Most architectures current have a debugfs file for dumping the kernel
page tables. Currently each architecture has to implement custom
functions for walking the page tables because the generic
walk_page_range() function is unable to walk the page tables used by the
kernel.
This series extends the capabilities of walk_page_range() so that it can
deal with the page tables of the kernel (which have no VMAs and can
contain larger huge pages than exist for user space). x86 and arm64 are
then converted to make use of walk_page_range() removing the custom page
table walkers.
To enable a generic page table walker to walk the unusual mappings of
the kernel we need to implement a set of functions which let us know
when the walker has reached the leaf entry. Since arm, powerpc, s390,
sparc and x86 all have p?d_large macros lets standardise on that and
implement those that are missing.
Potentially future changes could unify the implementations of the
debugfs walkers further, moving the common functionality into common
code. This would require a common way of handling the effective
permissions (currently implemented only for x86) along with a per-arch
way of formatting the page table information for debugfs. One
immediate benefit would be getting the KASAN speed up optimisation in
arm64 (and other arches) which is currently only implemented for x86.
Changes since v2:
* Rather than attemping to provide generic macros, actually implement
p?d_large() for each architecture.
Changes since v1:
* Added p4d_large() macro
* Comments to explain p?d_large() macro semantics
* Expanded comment for pte_hole() callback to explain mapping between
depth and P?D
* Handle folded page tables at all levels, so depth from pte_hole()
ignores folding at any level (see real_depth() function in
mm/pagewalk.c)
Steven Price (34):
alpha: mm: Add p?d_large() definitions
arc: mm: Add p?d_large() definitions
arm: mm: Add p?d_large() definitions
arm64: mm: Add p?d_large() definitions
c6x: mm: Add p?d_large() definitions
csky: mm: Add p?d_large() definitions
hexagon: mm: Add p?d_large() definitions
ia64: mm: Add p?d_large() definitions
m68k: mm: Add p?d_large() definitions
microblaze: mm: Add p?d_large() definitions
mips: mm: Add p?d_large() definitions
nds32: mm: Add p?d_large() definitions
nios2: mm: Add p?d_large() definitions
openrisc: mm: Add p?d_large() definitions
parisc: mm: Add p?d_large() definitions
powerpc: mm: Add p?d_large() definitions
riscv: mm: Add p?d_large() definitions
s390: mm: Add p?d_large() definitions
sh: mm: Add p?d_large() definitions
sparc: mm: Add p?d_large() definitions
um: mm: Add p?d_large() definitions
unicore32: mm: Add p?d_large() definitions
xtensa: mm: Add p?d_large() definitions
mm: Add generic p?d_large() macros
mm: pagewalk: Add p4d_entry() and pgd_entry()
mm: pagewalk: Allow walking without vma
mm: pagewalk: Add 'depth' parameter to pte_hole
mm: pagewalk: Add test_p?d callbacks
arm64: mm: Convert mm/dump.c to use walk_page_range()
x86/mm: Point to struct seq_file from struct pg_state
x86/mm+efi: Convert ptdump_walk_pgd_level() to take a mm_struct
x86/mm: Convert ptdump_walk_pgd_level_debugfs() to take an mm_struct
x86/mm: Convert ptdump_walk_pgd_level_core() to take an mm_struct
x86: mm: Convert dump_pagetables to use walk_page_range
arch/alpha/include/asm/pgtable.h | 2 +
arch/arc/include/asm/pgtable.h | 1 +
arch/arm/include/asm/pgtable-2level.h | 1 +
arch/arm/include/asm/pgtable-3level.h | 1 +
arch/arm64/include/asm/pgtable.h | 2 +
arch/arm64/mm/dump.c | 108 +++---
arch/c6x/include/asm/pgtable.h | 2 +
arch/csky/include/asm/pgtable.h | 2 +
arch/hexagon/include/asm/pgtable.h | 5 +
arch/ia64/include/asm/pgtable.h | 3 +
arch/m68k/include/asm/mcf_pgtable.h | 2 +
arch/m68k/include/asm/motorola_pgtable.h | 2 +
arch/m68k/include/asm/pgtable_no.h | 1 +
arch/m68k/include/asm/sun3_pgtable.h | 2 +
arch/microblaze/include/asm/pgtable.h | 2 +
arch/mips/include/asm/pgtable-32.h | 5 +
arch/mips/include/asm/pgtable-64.h | 15 +
arch/mips/include/asm/pgtable-bits.h | 2 +-
arch/nds32/include/asm/pgtable.h | 2 +
arch/nios2/include/asm/pgtable.h | 5 +
arch/openrisc/include/asm/pgtable.h | 1 +
arch/parisc/include/asm/pgtable.h | 3 +
arch/powerpc/include/asm/book3s/32/pgtable.h | 1 +
arch/powerpc/include/asm/book3s/64/pgtable.h | 27 +-
arch/powerpc/include/asm/nohash/32/pgtable.h | 1 +
.../include/asm/nohash/64/pgtable-4k.h | 1 +
arch/powerpc/kvm/book3s_64_mmu_radix.c | 12 +-
arch/riscv/include/asm/pgtable-64.h | 6 +
arch/riscv/include/asm/pgtable.h | 6 +
arch/s390/include/asm/pgtable.h | 10 +
arch/sh/include/asm/pgtable-3level.h | 1 +
arch/sh/include/asm/pgtable_32.h | 1 +
arch/sh/include/asm/pgtable_64.h | 1 +
arch/sparc/include/asm/pgtable_32.h | 10 +
arch/sparc/include/asm/pgtable_64.h | 1 +
arch/um/include/asm/pgtable-3level.h | 1 +
arch/um/include/asm/pgtable.h | 1 +
arch/unicore32/include/asm/pgtable.h | 1 +
arch/x86/include/asm/pgtable.h | 26 +-
arch/x86/mm/debug_pagetables.c | 8 +-
arch/x86/mm/dump_pagetables.c | 342 +++++++++---------
arch/x86/platform/efi/efi_32.c | 2 +-
arch/x86/platform/efi/efi_64.c | 4 +-
arch/xtensa/include/asm/pgtable.h | 1 +
fs/proc/task_mmu.c | 4 +-
include/asm-generic/4level-fixup.h | 1 +
include/asm-generic/5level-fixup.h | 1 +
include/asm-generic/pgtable-nop4d-hack.h | 1 +
include/asm-generic/pgtable-nop4d.h | 1 +
include/asm-generic/pgtable-nopmd.h | 1 +
include/asm-generic/pgtable-nopud.h | 1 +
include/linux/mm.h | 26 +-
mm/hmm.c | 2 +-
mm/migrate.c | 1 +
mm/mincore.c | 1 +
mm/pagewalk.c | 107 ++++--
56 files changed, 489 insertions(+), 291 deletions(-)
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information will be provided by the
p?d_large() functions/macros.
For alpha, we don't support huge pages, so add stubs returning 0.
CC: [email protected]
CC: Richard Henderson <[email protected]>
CC: Ivan Kokshaysky <[email protected]>
CC: Matt Turner <[email protected]>
Signed-off-by: Steven Price <[email protected]>
---
arch/alpha/include/asm/pgtable.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/alpha/include/asm/pgtable.h b/arch/alpha/include/asm/pgtable.h
index 89c2032f9960..e5726d3a3200 100644
--- a/arch/alpha/include/asm/pgtable.h
+++ b/arch/alpha/include/asm/pgtable.h
@@ -254,11 +254,13 @@ extern inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *pt
extern inline int pmd_none(pmd_t pmd) { return !pmd_val(pmd); }
extern inline int pmd_bad(pmd_t pmd) { return (pmd_val(pmd) & ~_PFN_MASK) != _PAGE_TABLE; }
extern inline int pmd_present(pmd_t pmd) { return pmd_val(pmd) & _PAGE_VALID; }
+extern inline int pmd_large(pmd_t pmd) { return 0; }
extern inline void pmd_clear(pmd_t * pmdp) { pmd_val(*pmdp) = 0; }
extern inline int pgd_none(pgd_t pgd) { return !pgd_val(pgd); }
extern inline int pgd_bad(pgd_t pgd) { return (pgd_val(pgd) & ~_PFN_MASK) != _PAGE_TABLE; }
extern inline int pgd_present(pgd_t pgd) { return pgd_val(pgd) & _PAGE_VALID; }
+extern inline int pgd_large(pgd_t pgd) { return 0; }
extern inline void pgd_clear(pgd_t * pgdp) { pgd_val(*pgdp) = 0; }
/*
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information will be provided by the
p?d_large() functions/macros.
For arc, we only have two levels, so only pmd_large() is needed.
CC: Vineet Gupta <[email protected]>
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/arc/include/asm/pgtable.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arc/include/asm/pgtable.h b/arch/arc/include/asm/pgtable.h
index cf4be70d5892..0edd27bc7018 100644
--- a/arch/arc/include/asm/pgtable.h
+++ b/arch/arc/include/asm/pgtable.h
@@ -277,6 +277,7 @@ static inline void pmd_set(pmd_t *pmdp, pte_t *ptep)
#define pmd_none(x) (!pmd_val(x))
#define pmd_bad(x) ((pmd_val(x) & ~PAGE_MASK))
#define pmd_present(x) (pmd_val(x))
+#define pmd_large(x) (pmd_val(pmd) & _PAGE_HW_SZ)
#define pmd_clear(xp) do { pmd_val(*(xp)) = 0; } while (0)
#define pte_page(pte) pfn_to_page(pte_pfn(pte))
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information will be provided by the
p?d_large() functions/macros.
For arm64, we already have p?d_sect() macros which we can reuse for
p?d_large().
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index de70c1eabf33..6eef345dbaf4 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -428,6 +428,7 @@ extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
PMD_TYPE_TABLE)
#define pmd_sect(pmd) ((pmd_val(pmd) & PMD_TYPE_MASK) == \
PMD_TYPE_SECT)
+#define pmd_large(pmd) pmd_sect(pmd)
#if defined(CONFIG_ARM64_64K_PAGES) || CONFIG_PGTABLE_LEVELS < 3
#define pud_sect(pud) (0)
@@ -511,6 +512,7 @@ static inline phys_addr_t pmd_page_paddr(pmd_t pmd)
#define pud_none(pud) (!pud_val(pud))
#define pud_bad(pud) (!(pud_val(pud) & PUD_TABLE_BIT))
#define pud_present(pud) pte_present(pud_pte(pud))
+#define pud_large(pud) pud_sect(pud)
#define pud_valid(pud) pte_valid(pud_pte(pud))
static inline void set_pud(pud_t *pudp, pud_t pud)
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information will be provided by the
p?d_large() functions/macros.
For arm, we already provide most p?d_large() macros. Add a stub for PUD
as we don't have huge pages at that level.
CC: Russell King <[email protected]>
Signed-off-by: Steven Price <[email protected]>
---
arch/arm/include/asm/pgtable-2level.h | 1 +
arch/arm/include/asm/pgtable-3level.h | 1 +
2 files changed, 2 insertions(+)
diff --git a/arch/arm/include/asm/pgtable-2level.h b/arch/arm/include/asm/pgtable-2level.h
index 12659ce5c1f3..adcef1306892 100644
--- a/arch/arm/include/asm/pgtable-2level.h
+++ b/arch/arm/include/asm/pgtable-2level.h
@@ -183,6 +183,7 @@
#define pud_none(pud) (0)
#define pud_bad(pud) (0)
#define pud_present(pud) (1)
+#define pud_large(pud) (0)
#define pud_clear(pudp) do { } while (0)
#define set_pud(pud,pudp) do { } while (0)
diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 6d50a11d7793..9f63a4b89f45 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -141,6 +141,7 @@
#define pud_none(pud) (!pud_val(pud))
#define pud_bad(pud) (!(pud_val(pud) & 2))
#define pud_present(pud) (pud_val(pud))
+#define pud_large(pud) (0)
#define pmd_table(pmd) ((pmd_val(pmd) & PMD_TYPE_MASK) == \
PMD_TYPE_TABLE)
#define pmd_sect(pmd) ((pmd_val(pmd) & PMD_TYPE_MASK) == \
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.
For c6x there's no MMU so there's never a large page, so just add stubs.
CC: Mark Salter <[email protected]>
CC: Aurelien Jacquiot <[email protected]>
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/c6x/include/asm/pgtable.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/c6x/include/asm/pgtable.h b/arch/c6x/include/asm/pgtable.h
index ec4db6df5e0d..d532b7df9001 100644
--- a/arch/c6x/include/asm/pgtable.h
+++ b/arch/c6x/include/asm/pgtable.h
@@ -26,6 +26,7 @@
#define pgd_present(pgd) (1)
#define pgd_none(pgd) (0)
#define pgd_bad(pgd) (0)
+#define pgd_large(pgd) (0)
#define pgd_clear(pgdp)
#define kern_addr_valid(addr) (1)
@@ -34,6 +35,7 @@
#define pmd_present(x) (pmd_val(x))
#define pmd_clear(xp) do { set_pmd(xp, __pmd(0)); } while (0)
#define pmd_bad(x) (pmd_val(x) & ~PAGE_MASK)
+#define pmd_large(pgd) (0)
#define PAGE_NONE __pgprot(0) /* these mean nothing to NO_MM */
#define PAGE_SHARED __pgprot(0) /* these mean nothing to NO_MM */
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.
For m68k, we don't support large pages, so add stubs returning 0
CC: Geert Uytterhoeven <[email protected]>
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/m68k/include/asm/mcf_pgtable.h | 2 ++
arch/m68k/include/asm/motorola_pgtable.h | 2 ++
arch/m68k/include/asm/pgtable_no.h | 1 +
arch/m68k/include/asm/sun3_pgtable.h | 2 ++
4 files changed, 7 insertions(+)
diff --git a/arch/m68k/include/asm/mcf_pgtable.h b/arch/m68k/include/asm/mcf_pgtable.h
index 5d5502cb2b2d..63827d28a017 100644
--- a/arch/m68k/include/asm/mcf_pgtable.h
+++ b/arch/m68k/include/asm/mcf_pgtable.h
@@ -196,11 +196,13 @@ static inline int pmd_none2(pmd_t *pmd) { return !pmd_val(*pmd); }
static inline int pmd_bad2(pmd_t *pmd) { return 0; }
#define pmd_bad(pmd) pmd_bad2(&(pmd))
#define pmd_present(pmd) (!pmd_none2(&(pmd)))
+#define pmd_large(pmd) (0)
static inline void pmd_clear(pmd_t *pmdp) { pmd_val(*pmdp) = 0; }
static inline int pgd_none(pgd_t pgd) { return 0; }
static inline int pgd_bad(pgd_t pgd) { return 0; }
static inline int pgd_present(pgd_t pgd) { return 1; }
+static inline int pgd_large(pgd_t pgd) { return 0; }
static inline void pgd_clear(pgd_t *pgdp) {}
#define pte_ERROR(e) \
diff --git a/arch/m68k/include/asm/motorola_pgtable.h b/arch/m68k/include/asm/motorola_pgtable.h
index 7f66a7bad7a5..a649eb8a91de 100644
--- a/arch/m68k/include/asm/motorola_pgtable.h
+++ b/arch/m68k/include/asm/motorola_pgtable.h
@@ -138,6 +138,7 @@ static inline void pgd_set(pgd_t *pgdp, pmd_t *pmdp)
#define pmd_none(pmd) (!pmd_val(pmd))
#define pmd_bad(pmd) ((pmd_val(pmd) & _DESCTYPE_MASK) != _PAGE_TABLE)
#define pmd_present(pmd) (pmd_val(pmd) & _PAGE_TABLE)
+#define pmd_large(pmd) (0)
#define pmd_clear(pmdp) ({ \
unsigned long *__ptr = pmdp->pmd; \
short __i = 16; \
@@ -150,6 +151,7 @@ static inline void pgd_set(pgd_t *pgdp, pmd_t *pmdp)
#define pgd_none(pgd) (!pgd_val(pgd))
#define pgd_bad(pgd) ((pgd_val(pgd) & _DESCTYPE_MASK) != _PAGE_TABLE)
#define pgd_present(pgd) (pgd_val(pgd) & _PAGE_TABLE)
+#define pgd_large(pgd) (0)
#define pgd_clear(pgdp) ({ pgd_val(*pgdp) = 0; })
#define pgd_page(pgd) (mem_map + ((unsigned long)(__va(pgd_val(pgd)) - PAGE_OFFSET) >> PAGE_SHIFT))
diff --git a/arch/m68k/include/asm/pgtable_no.h b/arch/m68k/include/asm/pgtable_no.h
index fc3a96c77bd8..eeef17b2eae8 100644
--- a/arch/m68k/include/asm/pgtable_no.h
+++ b/arch/m68k/include/asm/pgtable_no.h
@@ -17,6 +17,7 @@
* Trivial page table functions.
*/
#define pgd_present(pgd) (1)
+#define pgd_large(pgd) (0)
#define pgd_none(pgd) (0)
#define pgd_bad(pgd) (0)
#define pgd_clear(pgdp)
diff --git a/arch/m68k/include/asm/sun3_pgtable.h b/arch/m68k/include/asm/sun3_pgtable.h
index c987d50866b4..eea72e3515db 100644
--- a/arch/m68k/include/asm/sun3_pgtable.h
+++ b/arch/m68k/include/asm/sun3_pgtable.h
@@ -143,6 +143,7 @@ static inline int pmd_bad2 (pmd_t *pmd) { return 0; }
static inline int pmd_present2 (pmd_t *pmd) { return pmd_val (*pmd) & SUN3_PMD_VALID; }
/* #define pmd_present(pmd) pmd_present2(&(pmd)) */
#define pmd_present(pmd) (!pmd_none2(&(pmd)))
+#define pmd_large(pmd) (0)
static inline void pmd_clear (pmd_t *pmdp) { pmd_val (*pmdp) = 0; }
static inline int pgd_none (pgd_t pgd) { return 0; }
@@ -150,6 +151,7 @@ static inline int pgd_bad (pgd_t pgd) { return 0; }
static inline int pgd_present (pgd_t pgd) { return 1; }
static inline void pgd_clear (pgd_t *pgdp) {}
+static inline int pgd_large(pgd_t pgd) { return 0; }
#define pte_ERROR(e) \
pr_err("%s:%d: bad pte %08lx.\n", __FILE__, __LINE__, pte_val(e))
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.
For parisc, we don't support large pages, so add stubs returning 0.
CC: "James E.J. Bottomley" <[email protected]>
CC: Helge Deller <[email protected]>
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/parisc/include/asm/pgtable.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/parisc/include/asm/pgtable.h b/arch/parisc/include/asm/pgtable.h
index c7bb74e22436..1f38c85a9530 100644
--- a/arch/parisc/include/asm/pgtable.h
+++ b/arch/parisc/include/asm/pgtable.h
@@ -302,6 +302,7 @@ extern unsigned long *empty_zero_page;
#endif
#define pmd_bad(x) (!(pmd_flag(x) & PxD_FLAG_VALID))
#define pmd_present(x) (pmd_flag(x) & PxD_FLAG_PRESENT)
+#define pmd_large(x) (0)
static inline void pmd_clear(pmd_t *pmd) {
#if CONFIG_PGTABLE_LEVELS == 3
if (pmd_flag(*pmd) & PxD_FLAG_ATTACHED)
@@ -324,6 +325,7 @@ static inline void pmd_clear(pmd_t *pmd) {
#define pgd_none(x) (!pgd_val(x))
#define pgd_bad(x) (!(pgd_flag(x) & PxD_FLAG_VALID))
#define pgd_present(x) (pgd_flag(x) & PxD_FLAG_PRESENT)
+#define pgd_large(x) (0)
static inline void pgd_clear(pgd_t *pgd) {
#if CONFIG_PGTABLE_LEVELS == 3
if(pgd_flag(*pgd) & PxD_FLAG_ATTACHED)
@@ -342,6 +344,7 @@ static inline void pgd_clear(pgd_t *pgd) {
static inline int pgd_none(pgd_t pgd) { return 0; }
static inline int pgd_bad(pgd_t pgd) { return 0; }
static inline int pgd_present(pgd_t pgd) { return 1; }
+static inline int pgd_large(pgd_t pgd) { return 0; }
static inline void pgd_clear(pgd_t * pgdp) { }
#endif
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.
For mips, we don't support large pages on 32 bit so add stubs returning 0.
For 64 bit look for _PAGE_HUGE flag being set. This means exposing the
flag when !CONFIG_MIPS_HUGE_TLB_SUPPORT.
CC: Ralf Baechle <[email protected]>
CC: Paul Burton <[email protected]>
CC: James Hogan <[email protected]>
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/mips/include/asm/pgtable-32.h | 5 +++++
arch/mips/include/asm/pgtable-64.h | 15 +++++++++++++++
arch/mips/include/asm/pgtable-bits.h | 2 +-
3 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/arch/mips/include/asm/pgtable-32.h b/arch/mips/include/asm/pgtable-32.h
index 74afe8c76bdd..58cab62d768b 100644
--- a/arch/mips/include/asm/pgtable-32.h
+++ b/arch/mips/include/asm/pgtable-32.h
@@ -104,6 +104,11 @@ static inline int pmd_present(pmd_t pmd)
return pmd_val(pmd) != (unsigned long) invalid_pte_table;
}
+static inline int pmd_large(pmd_t pmd)
+{
+ return 0;
+}
+
static inline void pmd_clear(pmd_t *pmdp)
{
pmd_val(*pmdp) = ((unsigned long) invalid_pte_table);
diff --git a/arch/mips/include/asm/pgtable-64.h b/arch/mips/include/asm/pgtable-64.h
index 93a9dce31f25..981930e1f843 100644
--- a/arch/mips/include/asm/pgtable-64.h
+++ b/arch/mips/include/asm/pgtable-64.h
@@ -204,6 +204,11 @@ static inline int pgd_present(pgd_t pgd)
return pgd_val(pgd) != (unsigned long)invalid_pud_table;
}
+static inline int pgd_large(pgd_t pgd)
+{
+ return 0;
+}
+
static inline void pgd_clear(pgd_t *pgdp)
{
pgd_val(*pgdp) = (unsigned long)invalid_pud_table;
@@ -273,6 +278,11 @@ static inline int pmd_present(pmd_t pmd)
return pmd_val(pmd) != (unsigned long) invalid_pte_table;
}
+static inline int pmd_large(pmd_t pmd)
+{
+ return (pmd_val(pmd) & _PAGE_HUGE) != 0;
+}
+
static inline void pmd_clear(pmd_t *pmdp)
{
pmd_val(*pmdp) = ((unsigned long) invalid_pte_table);
@@ -297,6 +307,11 @@ static inline int pud_present(pud_t pud)
return pud_val(pud) != (unsigned long) invalid_pmd_table;
}
+static inline int pud_large(pud_t pud)
+{
+ return (pud_val(pud) & _PAGE_HUGE) != 0;
+}
+
static inline void pud_clear(pud_t *pudp)
{
pud_val(*pudp) = ((unsigned long) invalid_pmd_table);
diff --git a/arch/mips/include/asm/pgtable-bits.h b/arch/mips/include/asm/pgtable-bits.h
index f88a48cd68b2..5ab296dee8fa 100644
--- a/arch/mips/include/asm/pgtable-bits.h
+++ b/arch/mips/include/asm/pgtable-bits.h
@@ -132,7 +132,7 @@ enum pgtable_bits {
#define _PAGE_WRITE (1 << _PAGE_WRITE_SHIFT)
#define _PAGE_ACCESSED (1 << _PAGE_ACCESSED_SHIFT)
#define _PAGE_MODIFIED (1 << _PAGE_MODIFIED_SHIFT)
-#if defined(CONFIG_64BIT) && defined(CONFIG_MIPS_HUGE_TLB_SUPPORT)
+#if defined(CONFIG_64BIT)
# define _PAGE_HUGE (1 << _PAGE_HUGE_SHIFT)
#endif
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.
For sh, we don't support large pages, so add stubs returning 0.
CC: Yoshinori Sato <[email protected]>
CC: Rich Felker <[email protected]>
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/sh/include/asm/pgtable-3level.h | 1 +
arch/sh/include/asm/pgtable_32.h | 1 +
arch/sh/include/asm/pgtable_64.h | 1 +
3 files changed, 3 insertions(+)
diff --git a/arch/sh/include/asm/pgtable-3level.h b/arch/sh/include/asm/pgtable-3level.h
index 7d8587eb65ff..9d8b2b002582 100644
--- a/arch/sh/include/asm/pgtable-3level.h
+++ b/arch/sh/include/asm/pgtable-3level.h
@@ -48,6 +48,7 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
#define pud_present(x) (pud_val(x))
#define pud_clear(xp) do { set_pud(xp, __pud(0)); } while (0)
#define pud_bad(x) (pud_val(x) & ~PAGE_MASK)
+#define pud_large(x) (0)
/*
* (puds are folded into pgds so this doesn't get actually called,
diff --git a/arch/sh/include/asm/pgtable_32.h b/arch/sh/include/asm/pgtable_32.h
index 29274f0e428e..61186aa11021 100644
--- a/arch/sh/include/asm/pgtable_32.h
+++ b/arch/sh/include/asm/pgtable_32.h
@@ -329,6 +329,7 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
#define pmd_present(x) (pmd_val(x))
#define pmd_clear(xp) do { set_pmd(xp, __pmd(0)); } while (0)
#define pmd_bad(x) (pmd_val(x) & ~PAGE_MASK)
+#define pmd_large(x) (0)
#define pages_to_mb(x) ((x) >> (20-PAGE_SHIFT))
#define pte_page(x) pfn_to_page(pte_pfn(x))
diff --git a/arch/sh/include/asm/pgtable_64.h b/arch/sh/include/asm/pgtable_64.h
index 1778bc5971e7..80fe9264babf 100644
--- a/arch/sh/include/asm/pgtable_64.h
+++ b/arch/sh/include/asm/pgtable_64.h
@@ -64,6 +64,7 @@ static __inline__ void set_pte(pte_t *pteptr, pte_t pteval)
#define pmd_clear(pmd_entry_p) (set_pmd((pmd_entry_p), __pmd(_PMD_EMPTY)))
#define pmd_none(pmd_entry) (pmd_val((pmd_entry)) == _PMD_EMPTY)
#define pmd_bad(pmd_entry) ((pmd_val(pmd_entry) & (~PAGE_MASK & ~_PAGE_USER)) != _KERNPG_TABLE)
+#define pmd_large(pmd_entry) (0)
#define pmd_page_vaddr(pmd_entry) \
((unsigned long) __va(pmd_val(pmd_entry) & PAGE_MASK))
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.
For nds32, we don't support large pages, so add stubs returning 0.
CC: Greentime Hu <[email protected]>
CC: Vincent Chen <[email protected]>
Signed-off-by: Steven Price <[email protected]>
---
arch/nds32/include/asm/pgtable.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/nds32/include/asm/pgtable.h b/arch/nds32/include/asm/pgtable.h
index 9f52db930c00..202ac93c0a6b 100644
--- a/arch/nds32/include/asm/pgtable.h
+++ b/arch/nds32/include/asm/pgtable.h
@@ -309,6 +309,7 @@ static inline pte_t pte_mkspecial(pte_t pte)
#define pmd_none(pmd) (pmd_val(pmd)&0x1)
#define pmd_present(pmd) (!pmd_none(pmd))
+#define pmd_large(pmd) (0)
#define pmd_bad(pmd) pmd_none(pmd)
#define copy_pmd(pmdpd,pmdps) set_pmd((pmdpd), *(pmdps))
@@ -349,6 +350,7 @@ static inline pmd_t __mk_pmd(pte_t * ptep, unsigned long prot)
#define pgd_none(pgd) (0)
#define pgd_bad(pgd) (0)
#define pgd_present(pgd) (1)
+#define pgd_large(pgd) (0)
#define pgd_clear(pgdp) do { } while (0)
#define page_pte_prot(page,prot) mk_pte(page, prot)
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.
For xtensa, we don't support large pages, so add a stub returning 0.
CC: Chris Zankel <[email protected]>
CC: Max Filippov <[email protected]>
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/xtensa/include/asm/pgtable.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/xtensa/include/asm/pgtable.h b/arch/xtensa/include/asm/pgtable.h
index 29cfe421cf41..60c3e86b9782 100644
--- a/arch/xtensa/include/asm/pgtable.h
+++ b/arch/xtensa/include/asm/pgtable.h
@@ -266,6 +266,7 @@ static inline void pgtable_cache_init(void) { }
#define pmd_none(pmd) (!pmd_val(pmd))
#define pmd_present(pmd) (pmd_val(pmd) & PAGE_MASK)
#define pmd_bad(pmd) (pmd_val(pmd) & ~PAGE_MASK)
+#define pmd_large(pmd) (0)
#define pmd_clear(pmdp) do { set_pmd(pmdp, __pmd(0)); } while (0)
static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_WRITABLE; }
--
2.20.1
pgd_entry() and pud_entry() were removed by commit 0b1fbfe50006c410
("mm/pagewalk: remove pgd_entry() and pud_entry()") because there were
no users. We're about to add users so reintroduce them, along with
p4d_entry() as we now have 5 levels of tables.
Note that commit a00cc7d9dd93d66a ("mm, x86: add support for
PUD-sized transparent hugepages") already re-added pud_entry() but with
different semantics to the other callbacks. Since there have never
been upstream users of this, revert the semantics back to match the
other callbacks. This means pud_entry() is called for all entries, not
just transparent huge pages.
Signed-off-by: Steven Price <[email protected]>
---
include/linux/mm.h | 9 ++++++---
mm/pagewalk.c | 27 ++++++++++++++++-----------
2 files changed, 22 insertions(+), 14 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80bb6408fe73..1a4b1615d012 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1412,10 +1412,9 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
/**
* mm_walk - callbacks for walk_page_range
+ * @pgd_entry: if set, called for each non-empty PGD (top-level) entry
+ * @p4d_entry: if set, called for each non-empty P4D (1st-level) entry
* @pud_entry: if set, called for each non-empty PUD (2nd-level) entry
- * this handler should only handle pud_trans_huge() puds.
- * the pmd_entry or pte_entry callbacks will be used for
- * regular PUDs.
* @pmd_entry: if set, called for each non-empty PMD (3rd-level) entry
* this handler is required to be able to handle
* pmd_trans_huge() pmds. They may simply choose to
@@ -1435,6 +1434,10 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
* (see the comment on walk_page_range() for more details)
*/
struct mm_walk {
+ int (*pgd_entry)(pgd_t *pgd, unsigned long addr,
+ unsigned long next, struct mm_walk *walk);
+ int (*p4d_entry)(p4d_t *p4d, unsigned long addr,
+ unsigned long next, struct mm_walk *walk);
int (*pud_entry)(pud_t *pud, unsigned long addr,
unsigned long next, struct mm_walk *walk);
int (*pmd_entry)(pmd_t *pmd, unsigned long addr,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index c3084ff2569d..98373a9f88b8 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -90,15 +90,9 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
}
if (walk->pud_entry) {
- spinlock_t *ptl = pud_trans_huge_lock(pud, walk->vma);
-
- if (ptl) {
- err = walk->pud_entry(pud, addr, next, walk);
- spin_unlock(ptl);
- if (err)
- break;
- continue;
- }
+ err = walk->pud_entry(pud, addr, next, walk);
+ if (err)
+ break;
}
split_huge_pud(walk->vma, pud, addr);
@@ -131,7 +125,12 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
break;
continue;
}
- if (walk->pmd_entry || walk->pte_entry)
+ if (walk->p4d_entry) {
+ err = walk->p4d_entry(p4d, addr, next, walk);
+ if (err)
+ break;
+ }
+ if (walk->pud_entry || walk->pmd_entry || walk->pte_entry)
err = walk_pud_range(p4d, addr, next, walk);
if (err)
break;
@@ -157,7 +156,13 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
break;
continue;
}
- if (walk->pmd_entry || walk->pte_entry)
+ if (walk->pgd_entry) {
+ err = walk->pgd_entry(pgd, addr, next, walk);
+ if (err)
+ break;
+ }
+ if (walk->p4d_entry || walk->pud_entry || walk->pmd_entry ||
+ walk->pte_entry)
err = walk_p4d_range(pgd, addr, next, walk);
if (err)
break;
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.
For s390, we don't support large pages, so add a stub returning 0.
CC: Martin Schwidefsky <[email protected]>
CC: Heiko Carstens <[email protected]>
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/s390/include/asm/pgtable.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 063732414dfb..9617f1fb69b4 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -605,6 +605,11 @@ static inline int pgd_present(pgd_t pgd)
return (pgd_val(pgd) & _REGION_ENTRY_ORIGIN) != 0UL;
}
+static inline int pgd_large(pgd_t pgd)
+{
+ return 0;
+}
+
static inline int pgd_none(pgd_t pgd)
{
if (pgd_folded(pgd))
@@ -645,6 +650,11 @@ static inline int p4d_present(p4d_t p4d)
return (p4d_val(p4d) & _REGION_ENTRY_ORIGIN) != 0UL;
}
+static inline int p4d_large(p4d_t p4d)
+{
+ return 0;
+}
+
static inline int p4d_none(p4d_t p4d)
{
if (p4d_folded(p4d))
--
2.20.1
The pte_hole() callback is called at multiple levels of the page tables.
Code dumping the kernel page tables needs to know what at what depth
the missing entry is. Add this is an extra parameter to pte_hole().
When the depth isn't know (e.g. processing a vma) then -1 is passed.
The depth that is reported is the actual level where the entry is
missing (ignoring any folding that is in place), i.e. any levels where
PTRS_PER_P?D is set to 1 are ignored.
Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their
natural numbers as levels 2/3/4.
Signed-off-by: Steven Price <[email protected]>
---
fs/proc/task_mmu.c | 4 ++--
include/linux/mm.h | 6 ++++--
mm/hmm.c | 2 +-
mm/migrate.c | 1 +
mm/mincore.c | 1 +
mm/pagewalk.c | 31 +++++++++++++++++++++++++------
6 files changed, 34 insertions(+), 11 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f0ec9edab2f3..91131cd4e9e0 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -474,7 +474,7 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
#ifdef CONFIG_SHMEM
static int smaps_pte_hole(unsigned long addr, unsigned long end,
- struct mm_walk *walk)
+ __always_unused int depth, struct mm_walk *walk)
{
struct mem_size_stats *mss = walk->private;
@@ -1203,7 +1203,7 @@ static int add_to_pagemap(unsigned long addr, pagemap_entry_t *pme,
}
static int pagemap_pte_hole(unsigned long start, unsigned long end,
- struct mm_walk *walk)
+ __always_unused int depth, struct mm_walk *walk)
{
struct pagemapread *pm = walk->private;
unsigned long addr = start;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1a4b1615d012..4ae3634a9118 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1420,7 +1420,9 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
* pmd_trans_huge() pmds. They may simply choose to
* split_huge_page() instead of handling it explicitly.
* @pte_entry: if set, called for each non-empty PTE (4th-level) entry
- * @pte_hole: if set, called for each hole at all levels
+ * @pte_hole: if set, called for each hole at all levels,
+ * depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD, 4:PTE
+ * any depths where PTRS_PER_P?D is equal to 1 are skipped
* @hugetlb_entry: if set, called for each hugetlb entry
* @test_walk: caller specific callback function to determine whether
* we walk over the current vma or not. Returning 0
@@ -1445,7 +1447,7 @@ struct mm_walk {
int (*pte_entry)(pte_t *pte, unsigned long addr,
unsigned long next, struct mm_walk *walk);
int (*pte_hole)(unsigned long addr, unsigned long next,
- struct mm_walk *walk);
+ int depth, struct mm_walk *walk);
int (*hugetlb_entry)(pte_t *pte, unsigned long hmask,
unsigned long addr, unsigned long next,
struct mm_walk *walk);
diff --git a/mm/hmm.c b/mm/hmm.c
index a04e4b810610..e3e6b8fda437 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -440,7 +440,7 @@ static void hmm_range_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
}
static int hmm_vma_walk_hole(unsigned long addr, unsigned long end,
- struct mm_walk *walk)
+ __always_unused int depth, struct mm_walk *walk)
{
struct hmm_vma_walk *hmm_vma_walk = walk->private;
struct hmm_range *range = hmm_vma_walk->range;
diff --git a/mm/migrate.c b/mm/migrate.c
index d4fd680be3b0..8b62a9fecb5c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2121,6 +2121,7 @@ struct migrate_vma {
static int migrate_vma_collect_hole(unsigned long start,
unsigned long end,
+ __always_unused int depth,
struct mm_walk *walk)
{
struct migrate_vma *migrate = walk->private;
diff --git a/mm/mincore.c b/mm/mincore.c
index 218099b5ed31..c4edbc688241 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -104,6 +104,7 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
}
static int mincore_unmapped_range(unsigned long addr, unsigned long end,
+ __always_unused int depth,
struct mm_walk *walk)
{
walk->private += __mincore_unmapped_range(addr, end,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index dac0c848b458..57946bcd810c 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -4,6 +4,22 @@
#include <linux/sched.h>
#include <linux/hugetlb.h>
+/*
+ * We want to know the real level where a entry is located ignoring any
+ * folding of levels which may be happening. For example if p4d is folded then
+ * a missing entry found at level 1 (p4d) is actually at level 0 (pgd).
+ */
+static int real_depth(int depth)
+{
+ if (depth == 3 && PTRS_PER_PMD == 1)
+ depth = 2;
+ if (depth == 2 && PTRS_PER_PUD == 1)
+ depth = 1;
+ if (depth == 1 && PTRS_PER_P4D == 1)
+ depth = 0;
+ return depth;
+}
+
static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
@@ -31,6 +47,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
pmd_t *pmd;
unsigned long next;
int err = 0;
+ int depth = real_depth(3);
pmd = pmd_offset(pud, addr);
do {
@@ -38,7 +55,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
next = pmd_addr_end(addr, end);
if (pmd_none(*pmd)) {
if (walk->pte_hole)
- err = walk->pte_hole(addr, next, walk);
+ err = walk->pte_hole(addr, next, depth, walk);
if (err)
break;
continue;
@@ -81,6 +98,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
pud_t *pud;
unsigned long next;
int err = 0;
+ int depth = real_depth(2);
pud = pud_offset(p4d, addr);
do {
@@ -88,7 +106,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
next = pud_addr_end(addr, end);
if (pud_none(*pud)) {
if (walk->pte_hole)
- err = walk->pte_hole(addr, next, walk);
+ err = walk->pte_hole(addr, next, depth, walk);
if (err)
break;
continue;
@@ -123,13 +141,14 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
p4d_t *p4d;
unsigned long next;
int err = 0;
+ int depth = real_depth(1);
p4d = p4d_offset(pgd, addr);
do {
next = p4d_addr_end(addr, end);
if (p4d_none_or_clear_bad(p4d)) {
if (walk->pte_hole)
- err = walk->pte_hole(addr, next, walk);
+ err = walk->pte_hole(addr, next, depth, walk);
if (err)
break;
continue;
@@ -160,7 +179,7 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd)) {
if (walk->pte_hole)
- err = walk->pte_hole(addr, next, walk);
+ err = walk->pte_hole(addr, next, 0, walk);
if (err)
break;
continue;
@@ -206,7 +225,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
if (pte)
err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
else if (walk->pte_hole)
- err = walk->pte_hole(addr, next, walk);
+ err = walk->pte_hole(addr, next, -1, walk);
if (err)
break;
@@ -249,7 +268,7 @@ static int walk_page_test(unsigned long start, unsigned long end,
if (vma->vm_flags & VM_PFNMAP) {
int err = 1;
if (walk->pte_hole)
- err = walk->pte_hole(start, end, walk);
+ err = walk->pte_hole(start, end, -1, walk);
return err ? err : 1;
}
return 0;
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.
For microblaze, we don't support large pages, so add stubs returning 0.
CC: Michal Simek <[email protected]>
Signed-off-by: Steven Price <[email protected]>
---
arch/microblaze/include/asm/pgtable.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/microblaze/include/asm/pgtable.h b/arch/microblaze/include/asm/pgtable.h
index 142d3f004848..044ea7dbb4cc 100644
--- a/arch/microblaze/include/asm/pgtable.h
+++ b/arch/microblaze/include/asm/pgtable.h
@@ -303,6 +303,7 @@ extern unsigned long empty_zero_page[1024];
#define pmd_none(pmd) (!pmd_val(pmd))
#define pmd_bad(pmd) ((pmd_val(pmd) & _PMD_PRESENT) == 0)
#define pmd_present(pmd) ((pmd_val(pmd) & _PMD_PRESENT) != 0)
+#define pmd_large(pmd) (0)
#define pmd_clear(pmdp) do { pmd_val(*(pmdp)) = 0; } while (0)
#define pte_page(x) (mem_map + (unsigned long) \
@@ -323,6 +324,7 @@ extern unsigned long empty_zero_page[1024];
static inline int pgd_none(pgd_t pgd) { return 0; }
static inline int pgd_bad(pgd_t pgd) { return 0; }
static inline int pgd_present(pgd_t pgd) { return 1; }
+static inline int pgd_large(pgd_t pgd) { return 0; }
#define pgd_clear(xp) do { } while (0)
#define pgd_page(pgd) \
((unsigned long) __va(pgd_val(pgd) & PAGE_MASK))
--
2.20.1
Now walk_page_range() can walk kernel page tables, we can switch the
arm64 ptdump code over to using it, simplifying the code.
Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/mm/dump.c | 108 +++++++++++++++++++++----------------------
1 file changed, 53 insertions(+), 55 deletions(-)
diff --git a/arch/arm64/mm/dump.c b/arch/arm64/mm/dump.c
index 99bb8facb5cb..ee0bc1441dd0 100644
--- a/arch/arm64/mm/dump.c
+++ b/arch/arm64/mm/dump.c
@@ -286,73 +286,71 @@ static void note_page(struct pg_state *st, unsigned long addr, unsigned level,
}
-static void walk_pte(struct pg_state *st, pmd_t *pmdp, unsigned long start,
- unsigned long end)
+static int pud_entry(pud_t *pud, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
{
- unsigned long addr = start;
- pte_t *ptep = pte_offset_kernel(pmdp, start);
+ struct pg_state *st = walk->private;
+ pud_t val = READ_ONCE(*pud);
- do {
- note_page(st, addr, 4, READ_ONCE(pte_val(*ptep)));
- } while (ptep++, addr += PAGE_SIZE, addr != end);
+ if (pud_table(val))
+ return 0;
+
+ note_page(st, addr, 2, pud_val(val));
+
+ return 0;
}
-static void walk_pmd(struct pg_state *st, pud_t *pudp, unsigned long start,
- unsigned long end)
+static int pmd_entry(pmd_t *pmd, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
{
- unsigned long next, addr = start;
- pmd_t *pmdp = pmd_offset(pudp, start);
-
- do {
- pmd_t pmd = READ_ONCE(*pmdp);
- next = pmd_addr_end(addr, end);
-
- if (pmd_none(pmd) || pmd_sect(pmd)) {
- note_page(st, addr, 3, pmd_val(pmd));
- } else {
- BUG_ON(pmd_bad(pmd));
- walk_pte(st, pmdp, addr, next);
- }
- } while (pmdp++, addr = next, addr != end);
+ struct pg_state *st = walk->private;
+ pmd_t val = READ_ONCE(*pmd);
+
+ if (pmd_table(val))
+ return 0;
+
+ note_page(st, addr, 3, pmd_val(val));
+
+ return 0;
}
-static void walk_pud(struct pg_state *st, pgd_t *pgdp, unsigned long start,
- unsigned long end)
+static int pte_entry(pte_t *pte, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
{
- unsigned long next, addr = start;
- pud_t *pudp = pud_offset(pgdp, start);
-
- do {
- pud_t pud = READ_ONCE(*pudp);
- next = pud_addr_end(addr, end);
-
- if (pud_none(pud) || pud_sect(pud)) {
- note_page(st, addr, 2, pud_val(pud));
- } else {
- BUG_ON(pud_bad(pud));
- walk_pmd(st, pudp, addr, next);
- }
- } while (pudp++, addr = next, addr != end);
+ struct pg_state *st = walk->private;
+ pte_t val = READ_ONCE(*pte);
+
+ note_page(st, addr, 4, pte_val(val));
+
+ return 0;
+}
+
+static int pte_hole(unsigned long addr, unsigned long next, int depth,
+ struct mm_walk *walk)
+{
+ struct pg_state *st = walk->private;
+
+ note_page(st, addr, depth+1, 0);
+
+ return 0;
}
static void walk_pgd(struct pg_state *st, struct mm_struct *mm,
- unsigned long start)
+ unsigned long start)
{
- unsigned long end = (start < TASK_SIZE_64) ? TASK_SIZE_64 : 0;
- unsigned long next, addr = start;
- pgd_t *pgdp = pgd_offset(mm, start);
-
- do {
- pgd_t pgd = READ_ONCE(*pgdp);
- next = pgd_addr_end(addr, end);
-
- if (pgd_none(pgd)) {
- note_page(st, addr, 1, pgd_val(pgd));
- } else {
- BUG_ON(pgd_bad(pgd));
- walk_pud(st, pgdp, addr, next);
- }
- } while (pgdp++, addr = next, addr != end);
+ struct mm_walk walk = {
+ .mm = mm,
+ .private = st,
+ .pud_entry = pud_entry,
+ .pmd_entry = pmd_entry,
+ .pte_entry = pte_entry,
+ .pte_hole = pte_hole
+ };
+ down_read(&mm->mmap_sem);
+ walk_page_range(start, start | (((unsigned long)PTRS_PER_PGD <<
+ PGDIR_SHIFT) - 1),
+ &walk);
+ up_read(&mm->mmap_sem);
}
void ptdump_walk_pgd(struct seq_file *m, struct ptdump_info *info)
--
2.20.1
To enable x86 to use the generic walk_page_range() function, the
callers of ptdump_walk_pgd_level() need to pass an mm_struct rather
than the raw pgd_t pointer. Luckily since commit 7e904a91bf60
("efi: Use efi_mm in x86 as well as ARM") we now have an mm_struct
for EFI on x86.
Signed-off-by: Steven Price <[email protected]>
---
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/mm/dump_pagetables.c | 4 ++--
arch/x86/platform/efi/efi_32.c | 2 +-
arch/x86/platform/efi/efi_64.c | 4 ++--
4 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1b854c64cc7d..def035fa230e 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -27,7 +27,7 @@
extern pgd_t early_top_pgt[PTRS_PER_PGD];
int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
-void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd);
+void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm);
void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user);
void ptdump_walk_pgd_level_checkwx(void);
void ptdump_walk_user_pgd_level_checkwx(void);
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index ecbaf30a6a2f..3a8cf6699976 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -572,9 +572,9 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
pr_info("x86/mm: Checked W+X mappings: passed, no W+X pages found.\n");
}
-void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd)
+void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm)
{
- ptdump_walk_pgd_level_core(m, pgd, false, true);
+ ptdump_walk_pgd_level_core(m, mm->pgd, false, true);
}
void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user)
diff --git a/arch/x86/platform/efi/efi_32.c b/arch/x86/platform/efi/efi_32.c
index 9959657127f4..9175ceaa6e72 100644
--- a/arch/x86/platform/efi/efi_32.c
+++ b/arch/x86/platform/efi/efi_32.c
@@ -49,7 +49,7 @@ void efi_sync_low_kernel_mappings(void) {}
void __init efi_dump_pagetable(void)
{
#ifdef CONFIG_EFI_PGT_DUMP
- ptdump_walk_pgd_level(NULL, swapper_pg_dir);
+ ptdump_walk_pgd_level(NULL, init_mm);
#endif
}
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index cf0347f61b21..a2e0f9800190 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -611,9 +611,9 @@ void __init efi_dump_pagetable(void)
{
#ifdef CONFIG_EFI_PGT_DUMP
if (efi_enabled(EFI_OLD_MEMMAP))
- ptdump_walk_pgd_level(NULL, swapper_pg_dir);
+ ptdump_walk_pgd_level(NULL, init_mm);
else
- ptdump_walk_pgd_level(NULL, efi_mm.pgd);
+ ptdump_walk_pgd_level(NULL, efi_mm);
#endif
}
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.
For nios2, we don't support large pages, so add a stub returning 0.
CC: Ley Foon Tan <[email protected]>
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/nios2/include/asm/pgtable.h | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h
index db4f7d179220..b6ee0c205279 100644
--- a/arch/nios2/include/asm/pgtable.h
+++ b/arch/nios2/include/asm/pgtable.h
@@ -190,6 +190,11 @@ static inline int pmd_present(pmd_t pmd)
&& (pmd_val(pmd) != 0UL);
}
+static inline int pmd_large(pmd_t pmd)
+{
+ return 0;
+}
+
static inline void pmd_clear(pmd_t *pmdp)
{
pmd_val(*pmdp) = (unsigned long) invalid_pte_table;
--
2.20.1
To enable x86 to use the generic walk_page_range() function, the
callers of ptdump_walk_pgd_level_debugfs() need to pass in the mm_struct.
This means that ptdump_walk_pgd_level_core() is now always passed a
valid pgd, so drop the support for pgd==NULL.
Signed-off-by: Steven Price <[email protected]>
---
arch/x86/include/asm/pgtable.h | 3 ++-
arch/x86/mm/debug_pagetables.c | 8 ++++----
arch/x86/mm/dump_pagetables.c | 14 ++++++--------
3 files changed, 12 insertions(+), 13 deletions(-)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index def035fa230e..d6d919a20aac 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -28,7 +28,8 @@ extern pgd_t early_top_pgt[PTRS_PER_PGD];
int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm);
-void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user);
+void ptdump_walk_pgd_level_debugfs(struct seq_file *m, struct mm_struct *mm,
+ bool user);
void ptdump_walk_pgd_level_checkwx(void);
void ptdump_walk_user_pgd_level_checkwx(void);
diff --git a/arch/x86/mm/debug_pagetables.c b/arch/x86/mm/debug_pagetables.c
index cd84f067e41d..824131052574 100644
--- a/arch/x86/mm/debug_pagetables.c
+++ b/arch/x86/mm/debug_pagetables.c
@@ -6,7 +6,7 @@
static int ptdump_show(struct seq_file *m, void *v)
{
- ptdump_walk_pgd_level_debugfs(m, NULL, false);
+ ptdump_walk_pgd_level_debugfs(m, &init_mm, false);
return 0;
}
@@ -16,7 +16,7 @@ static int ptdump_curknl_show(struct seq_file *m, void *v)
{
if (current->mm->pgd) {
down_read(¤t->mm->mmap_sem);
- ptdump_walk_pgd_level_debugfs(m, current->mm->pgd, false);
+ ptdump_walk_pgd_level_debugfs(m, current->mm, false);
up_read(¤t->mm->mmap_sem);
}
return 0;
@@ -31,7 +31,7 @@ static int ptdump_curusr_show(struct seq_file *m, void *v)
{
if (current->mm->pgd) {
down_read(¤t->mm->mmap_sem);
- ptdump_walk_pgd_level_debugfs(m, current->mm->pgd, true);
+ ptdump_walk_pgd_level_debugfs(m, current->mm, true);
up_read(¤t->mm->mmap_sem);
}
return 0;
@@ -46,7 +46,7 @@ static struct dentry *pe_efi;
static int ptdump_efi_show(struct seq_file *m, void *v)
{
if (efi_mm.pgd)
- ptdump_walk_pgd_level_debugfs(m, efi_mm.pgd, false);
+ ptdump_walk_pgd_level_debugfs(m, &efi_mm, false);
return 0;
}
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 3a8cf6699976..fb4b9212cae5 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -523,16 +523,12 @@ static inline bool is_hypervisor_range(int idx)
static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
bool checkwx, bool dmesg)
{
- pgd_t *start = INIT_PGD;
+ pgd_t *start = pgd;
pgprotval_t prot, eff;
int i;
struct pg_state st = {};
- if (pgd) {
- start = pgd;
- st.to_dmesg = dmesg;
- }
-
+ st.to_dmesg = dmesg;
st.check_wx = checkwx;
st.seq = m;
if (checkwx)
@@ -577,8 +573,10 @@ void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm)
ptdump_walk_pgd_level_core(m, mm->pgd, false, true);
}
-void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user)
+void ptdump_walk_pgd_level_debugfs(struct seq_file *m, struct mm_struct *mm,
+ bool user)
{
+ pgd_t *pgd = mm->pgd;
#ifdef CONFIG_PAGE_TABLE_ISOLATION
if (user && static_cpu_has(X86_FEATURE_PTI))
pgd = kernel_to_user_pgdp(pgd);
@@ -604,7 +602,7 @@ void ptdump_walk_user_pgd_level_checkwx(void)
void ptdump_walk_pgd_level_checkwx(void)
{
- ptdump_walk_pgd_level_core(NULL, NULL, true, false);
+ ptdump_walk_pgd_level_core(NULL, INIT_PGD, true, false);
}
static int __init pt_dump_init(void)
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.
For openrisc, we don't support large pages, so add a stub returning 0.
CC: Jonas Bonn <[email protected]>
CC: Stefan Kristiansson <[email protected]>
CC: Stafford Horne <[email protected]>
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/openrisc/include/asm/pgtable.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/openrisc/include/asm/pgtable.h b/arch/openrisc/include/asm/pgtable.h
index 21c71303012f..5a375104ef71 100644
--- a/arch/openrisc/include/asm/pgtable.h
+++ b/arch/openrisc/include/asm/pgtable.h
@@ -228,6 +228,7 @@ extern unsigned long empty_zero_page[2048];
#define pmd_none(x) (!pmd_val(x))
#define pmd_bad(x) ((pmd_val(x) & (~PAGE_MASK)) != _KERNPG_TABLE)
#define pmd_present(x) (pmd_val(x) & _PAGE_PRESENT)
+#define pmd_large(x) (0)
#define pmd_clear(xp) do { pmd_val(*(xp)) = 0; } while (0)
/*
--
2.20.1
An mm_struct is needed to enable x86 to use of the generic
walk_page_range() function.
In the case of walking the user page tables (when
CONFIG_PAGE_TABLE_ISOLATION is enabled), it is necessary to create a
fake_mm structure because there isn't an mm_struct with a pointer
to the pgd of the user page tables. This fake_mm structure is
initialised with the minimum necessary for the generic page walk code.
Signed-off-by: Steven Price <[email protected]>
---
arch/x86/mm/dump_pagetables.c | 36 ++++++++++++++++++++---------------
1 file changed, 21 insertions(+), 15 deletions(-)
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index fb4b9212cae5..40a8b0da2170 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -111,8 +111,6 @@ static struct addr_marker address_markers[] = {
[END_OF_SPACE_NR] = { -1, NULL }
};
-#define INIT_PGD ((pgd_t *) &init_top_pgt)
-
#else /* CONFIG_X86_64 */
enum address_markers_idx {
@@ -147,8 +145,6 @@ static struct addr_marker address_markers[] = {
[END_OF_SPACE_NR] = { -1, NULL }
};
-#define INIT_PGD (swapper_pg_dir)
-
#endif /* !CONFIG_X86_64 */
/* Multipliers for offsets within the PTEs */
@@ -520,10 +516,10 @@ static inline bool is_hypervisor_range(int idx)
#endif
}
-static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
+static void ptdump_walk_pgd_level_core(struct seq_file *m, struct mm_struct *mm,
bool checkwx, bool dmesg)
{
- pgd_t *start = pgd;
+ pgd_t *start = mm->pgd;
pgprotval_t prot, eff;
int i;
struct pg_state st = {};
@@ -570,39 +566,49 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm)
{
- ptdump_walk_pgd_level_core(m, mm->pgd, false, true);
+ ptdump_walk_pgd_level_core(m, mm, false, true);
}
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+static void ptdump_walk_pgd_level_user_core(struct seq_file *m,
+ struct mm_struct *mm,
+ bool checkwx, bool dmesg)
+{
+ struct mm_struct fake_mm = {
+ .pgd = kernel_to_user_pgdp(mm->pgd)
+ };
+ init_rwsem(&fake_mm.mmap_sem);
+ ptdump_walk_pgd_level_core(m, &fake_mm, checkwx, dmesg);
+}
+#endif
+
void ptdump_walk_pgd_level_debugfs(struct seq_file *m, struct mm_struct *mm,
bool user)
{
- pgd_t *pgd = mm->pgd;
#ifdef CONFIG_PAGE_TABLE_ISOLATION
if (user && static_cpu_has(X86_FEATURE_PTI))
- pgd = kernel_to_user_pgdp(pgd);
+ ptdump_walk_pgd_level_user_core(m, mm, false, false);
+ else
#endif
- ptdump_walk_pgd_level_core(m, pgd, false, false);
+ ptdump_walk_pgd_level_core(m, mm, false, false);
}
EXPORT_SYMBOL_GPL(ptdump_walk_pgd_level_debugfs);
void ptdump_walk_user_pgd_level_checkwx(void)
{
#ifdef CONFIG_PAGE_TABLE_ISOLATION
- pgd_t *pgd = INIT_PGD;
-
if (!(__supported_pte_mask & _PAGE_NX) ||
!static_cpu_has(X86_FEATURE_PTI))
return;
pr_info("x86/mm: Checking user space page tables\n");
- pgd = kernel_to_user_pgdp(pgd);
- ptdump_walk_pgd_level_core(NULL, pgd, true, false);
+ ptdump_walk_pgd_level_user_core(NULL, &init_mm, true, false);
#endif
}
void ptdump_walk_pgd_level_checkwx(void)
{
- ptdump_walk_pgd_level_core(NULL, INIT_PGD, true, false);
+ ptdump_walk_pgd_level_core(NULL, &init_mm, true, false);
}
static int __init pt_dump_init(void)
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.
For riscv a page is large when it has a read, write or execute bit
set on it.
CC: Palmer Dabbelt <[email protected]>
CC: Albert Ou <[email protected]>
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/riscv/include/asm/pgtable-64.h | 6 ++++++
arch/riscv/include/asm/pgtable.h | 6 ++++++
2 files changed, 12 insertions(+)
diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
index 7aa0ea9bd8bb..6763c44d338d 100644
--- a/arch/riscv/include/asm/pgtable-64.h
+++ b/arch/riscv/include/asm/pgtable-64.h
@@ -51,6 +51,12 @@ static inline int pud_bad(pud_t pud)
return !pud_present(pud);
}
+static inline int pud_large(pud_t pud)
+{
+ return pud_present(pud)
+ && (pud_val(pud) & (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC));
+}
+
static inline void set_pud(pud_t *pudp, pud_t pud)
{
*pudp = pud;
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 16301966d65b..17624d7e7e8b 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -111,6 +111,12 @@ static inline int pmd_bad(pmd_t pmd)
return !pmd_present(pmd);
}
+static inline int pmd_large(pmd_t pmd)
+{
+ return pmd_present(pmd)
+ && (pmd_val(pmd) & (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC));
+}
+
static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
{
*pmdp = pmd;
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.
For powerpc pmd_large() was already implemented, so hoist it out of the
CONFIG_TRANSPARENT_HUGEPAGE condition and implement the other levels.
Also since we now have a pmd_large always implemented we can drop the
pmd_is_leaf() function.
For 32 bit simply implement stubs returning 0.
CC: Benjamin Herrenschmidt <[email protected]>
CC: Paul Mackerras <[email protected]>
CC: Michael Ellerman <[email protected]>
CC: [email protected]
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/powerpc/include/asm/book3s/32/pgtable.h | 1 +
arch/powerpc/include/asm/book3s/64/pgtable.h | 27 ++++++++++++-------
arch/powerpc/include/asm/nohash/32/pgtable.h | 1 +
.../include/asm/nohash/64/pgtable-4k.h | 1 +
arch/powerpc/kvm/book3s_64_mmu_radix.c | 12 +++------
5 files changed, 24 insertions(+), 18 deletions(-)
diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h b/arch/powerpc/include/asm/book3s/32/pgtable.h
index 49d76adb9bc5..036052a792c8 100644
--- a/arch/powerpc/include/asm/book3s/32/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
@@ -202,6 +202,7 @@ extern unsigned long ioremap_bot;
#define pmd_none(pmd) (!pmd_val(pmd))
#define pmd_bad(pmd) (pmd_val(pmd) & _PMD_BAD)
#define pmd_present(pmd) (pmd_val(pmd) & _PMD_PRESENT_MASK)
+#define pmd_large(pmd) (0)
static inline void pmd_clear(pmd_t *pmdp)
{
*pmdp = __pmd(0);
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index c9bfe526ca9d..1705b1a201bd 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -907,6 +907,11 @@ static inline int pud_present(pud_t pud)
return (pud_raw(pud) & cpu_to_be64(_PAGE_PRESENT));
}
+static inline int pud_large(pud_t pud)
+{
+ return (pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
+}
+
extern struct page *pud_page(pud_t pud);
extern struct page *pmd_page(pmd_t pmd);
static inline pte_t pud_pte(pud_t pud)
@@ -954,6 +959,11 @@ static inline int pgd_present(pgd_t pgd)
return (pgd_raw(pgd) & cpu_to_be64(_PAGE_PRESENT));
}
+static inline int pgd_large(pgd_t pgd)
+{
+ return (pgd_raw(pgd) & cpu_to_be64(_PAGE_PTE));
+}
+
static inline pte_t pgd_pte(pgd_t pgd)
{
return __pte_raw(pgd_raw(pgd));
@@ -1107,6 +1117,14 @@ static inline bool pmd_access_permitted(pmd_t pmd, bool write)
return pte_access_permitted(pmd_pte(pmd), write);
}
+/*
+ * returns true for pmd migration entries, THP, devmap, hugetlb
+ */
+static inline int pmd_large(pmd_t pmd)
+{
+ return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
+}
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
@@ -1133,15 +1151,6 @@ pmd_hugepage_update(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp,
return hash__pmd_hugepage_update(mm, addr, pmdp, clr, set);
}
-/*
- * returns true for pmd migration entries, THP, devmap, hugetlb
- * But compile time dependent on THP config
- */
-static inline int pmd_large(pmd_t pmd)
-{
- return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
-}
-
static inline pmd_t pmd_mknotpresent(pmd_t pmd)
{
return __pmd(pmd_val(pmd) & ~_PAGE_PRESENT);
diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h
index bed433358260..ebd55449914b 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -190,6 +190,7 @@ static inline pte_t pte_mkexec(pte_t pte)
#define pmd_none(pmd) (!pmd_val(pmd))
#define pmd_bad(pmd) (pmd_val(pmd) & _PMD_BAD)
#define pmd_present(pmd) (pmd_val(pmd) & _PMD_PRESENT_MASK)
+#define pmd_large(pmd) (0)
static inline void pmd_clear(pmd_t *pmdp)
{
*pmdp = __pmd(0);
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable-4k.h b/arch/powerpc/include/asm/nohash/64/pgtable-4k.h
index c40ec32b8194..9e6fa5646c9f 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable-4k.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable-4k.h
@@ -56,6 +56,7 @@
#define pgd_none(pgd) (!pgd_val(pgd))
#define pgd_bad(pgd) (pgd_val(pgd) == 0)
#define pgd_present(pgd) (pgd_val(pgd) != 0)
+#define pgd_large(pgd) (0)
#define pgd_page_vaddr(pgd) (pgd_val(pgd) & ~PGD_MASKED_BITS)
#ifndef __ASSEMBLY__
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 1b821c6efdef..040db20ac2ab 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -363,12 +363,6 @@ static void kvmppc_pte_free(pte_t *ptep)
kmem_cache_free(kvm_pte_cache, ptep);
}
-/* Like pmd_huge() and pmd_large(), but works regardless of config options */
-static inline int pmd_is_leaf(pmd_t pmd)
-{
- return !!(pmd_val(pmd) & _PAGE_PTE);
-}
-
static pmd_t *kvmppc_pmd_alloc(void)
{
return kmem_cache_alloc(kvm_pmd_cache, GFP_KERNEL);
@@ -455,7 +449,7 @@ static void kvmppc_unmap_free_pmd(struct kvm *kvm, pmd_t *pmd, bool full,
for (im = 0; im < PTRS_PER_PMD; ++im, ++p) {
if (!pmd_present(*p))
continue;
- if (pmd_is_leaf(*p)) {
+ if (pmd_large(*p)) {
if (full) {
pmd_clear(p);
} else {
@@ -588,7 +582,7 @@ int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
else if (level <= 1)
new_pmd = kvmppc_pmd_alloc();
- if (level == 0 && !(pmd && pmd_present(*pmd) && !pmd_is_leaf(*pmd)))
+ if (level == 0 && !(pmd && pmd_present(*pmd) && !pmd_large(*pmd)))
new_ptep = kvmppc_pte_alloc();
/* Check if we might have been invalidated; let the guest retry if so */
@@ -657,7 +651,7 @@ int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
new_pmd = NULL;
}
pmd = pmd_offset(pud, gpa);
- if (pmd_is_leaf(*pmd)) {
+ if (pmd_large(*pmd)) {
unsigned long lgpa = gpa & PMD_MASK;
/* Check if we raced and someone else has set the same thing */
--
2.20.1
Make use of the new functionality in walk_page_range to remove the
arch page walking code and use the generic code to walk the page tables.
The effective permissions are passed down the chain using new fields
in struct pg_state.
The KASAN optimisation is implemented by including test_p?d callbacks
which can decide to skip an entire tree of entries
Signed-off-by: Steven Price <[email protected]>
---
arch/x86/mm/dump_pagetables.c | 279 ++++++++++++++++++----------------
1 file changed, 146 insertions(+), 133 deletions(-)
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 40a8b0da2170..64d1619493a4 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -33,6 +33,10 @@ struct pg_state {
int level;
pgprot_t current_prot;
pgprotval_t effective_prot;
+ pgprotval_t effective_prot_pgd;
+ pgprotval_t effective_prot_p4d;
+ pgprotval_t effective_prot_pud;
+ pgprotval_t effective_prot_pmd;
unsigned long start_address;
unsigned long current_address;
const struct addr_marker *marker;
@@ -355,22 +359,21 @@ static inline pgprotval_t effective_prot(pgprotval_t prot1, pgprotval_t prot2)
((prot1 | prot2) & _PAGE_NX);
}
-static void walk_pte_level(struct pg_state *st, pmd_t addr, pgprotval_t eff_in,
- unsigned long P)
+static int ptdump_pte_entry(pte_t *pte, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
{
- int i;
- pte_t *pte;
- pgprotval_t prot, eff;
-
- for (i = 0; i < PTRS_PER_PTE; i++) {
- st->current_address = normalize_addr(P + i * PTE_LEVEL_MULT);
- pte = pte_offset_map(&addr, st->current_address);
- prot = pte_flags(*pte);
- eff = effective_prot(eff_in, prot);
- note_page(st, __pgprot(prot), eff, 5);
- pte_unmap(pte);
- }
+ struct pg_state *st = walk->private;
+ pgprotval_t eff, prot;
+
+ st->current_address = normalize_addr(addr);
+
+ prot = pte_flags(*pte);
+ eff = effective_prot(st->effective_prot_pmd, prot);
+ note_page(st, __pgprot(prot), eff, 5);
+
+ return 0;
}
+
#ifdef CONFIG_KASAN
/*
@@ -399,130 +402,152 @@ static inline bool kasan_page_table(struct pg_state *st, void *pt)
}
#endif
-#if PTRS_PER_PMD > 1
-
-static void walk_pmd_level(struct pg_state *st, pud_t addr,
- pgprotval_t eff_in, unsigned long P)
+static int ptdump_test_pmd(unsigned long addr, unsigned long next,
+ pmd_t *pmd, struct mm_walk *walk)
{
- int i;
- pmd_t *start, *pmd_start;
- pgprotval_t prot, eff;
-
- pmd_start = start = (pmd_t *)pud_page_vaddr(addr);
- for (i = 0; i < PTRS_PER_PMD; i++) {
- st->current_address = normalize_addr(P + i * PMD_LEVEL_MULT);
- if (!pmd_none(*start)) {
- prot = pmd_flags(*start);
- eff = effective_prot(eff_in, prot);
- if (pmd_large(*start) || !pmd_present(*start)) {
- note_page(st, __pgprot(prot), eff, 4);
- } else if (!kasan_page_table(st, pmd_start)) {
- walk_pte_level(st, *start, eff,
- P + i * PMD_LEVEL_MULT);
- }
- } else
- note_page(st, __pgprot(0), 0, 4);
- start++;
- }
+ struct pg_state *st = walk->private;
+
+ st->current_address = normalize_addr(addr);
+
+ if (kasan_page_table(st, pmd))
+ return 1;
+ return 0;
}
-#else
-#define walk_pmd_level(s,a,e,p) walk_pte_level(s,__pmd(pud_val(a)),e,p)
-#define pud_large(a) pmd_large(__pmd(pud_val(a)))
-#define pud_none(a) pmd_none(__pmd(pud_val(a)))
-#endif
+static int ptdump_pmd_entry(pmd_t *pmd, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ struct pg_state *st = walk->private;
+ pgprotval_t eff, prot;
+
+ prot = pmd_flags(*pmd);
+ eff = effective_prot(st->effective_prot_pud, prot);
+
+ st->current_address = normalize_addr(addr);
+
+ if (pmd_large(*pmd))
+ note_page(st, __pgprot(prot), eff, 4);
-#if PTRS_PER_PUD > 1
+ st->effective_prot_pmd = eff;
-static void walk_pud_level(struct pg_state *st, p4d_t addr, pgprotval_t eff_in,
- unsigned long P)
+ return 0;
+}
+
+static int ptdump_test_pud(unsigned long addr, unsigned long next,
+ pud_t *pud, struct mm_walk *walk)
{
- int i;
- pud_t *start, *pud_start;
- pgprotval_t prot, eff;
- pud_t *prev_pud = NULL;
-
- pud_start = start = (pud_t *)p4d_page_vaddr(addr);
-
- for (i = 0; i < PTRS_PER_PUD; i++) {
- st->current_address = normalize_addr(P + i * PUD_LEVEL_MULT);
- if (!pud_none(*start)) {
- prot = pud_flags(*start);
- eff = effective_prot(eff_in, prot);
- if (pud_large(*start) || !pud_present(*start)) {
- note_page(st, __pgprot(prot), eff, 3);
- } else if (!kasan_page_table(st, pud_start)) {
- walk_pmd_level(st, *start, eff,
- P + i * PUD_LEVEL_MULT);
- }
- } else
- note_page(st, __pgprot(0), 0, 3);
+ struct pg_state *st = walk->private;
- prev_pud = start;
- start++;
- }
+ st->current_address = normalize_addr(addr);
+
+ if (kasan_page_table(st, pud))
+ return 1;
+ return 0;
}
-#else
-#define walk_pud_level(s,a,e,p) walk_pmd_level(s,__pud(p4d_val(a)),e,p)
-#define p4d_large(a) pud_large(__pud(p4d_val(a)))
-#define p4d_none(a) pud_none(__pud(p4d_val(a)))
-#endif
+static int ptdump_pud_entry(pud_t *pud, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ struct pg_state *st = walk->private;
+ pgprotval_t eff, prot;
+
+ prot = pud_flags(*pud);
+ eff = effective_prot(st->effective_prot_p4d, prot);
+
+ st->current_address = normalize_addr(addr);
+
+ if (pud_large(*pud))
+ note_page(st, __pgprot(prot), eff, 3);
+
+ st->effective_prot_pud = eff;
-static void walk_p4d_level(struct pg_state *st, pgd_t addr, pgprotval_t eff_in,
- unsigned long P)
+ return 0;
+}
+
+static int ptdump_test_p4d(unsigned long addr, unsigned long next,
+ p4d_t *p4d, struct mm_walk *walk)
{
- int i;
- p4d_t *start, *p4d_start;
- pgprotval_t prot, eff;
-
- if (PTRS_PER_P4D == 1)
- return walk_pud_level(st, __p4d(pgd_val(addr)), eff_in, P);
-
- p4d_start = start = (p4d_t *)pgd_page_vaddr(addr);
-
- for (i = 0; i < PTRS_PER_P4D; i++) {
- st->current_address = normalize_addr(P + i * P4D_LEVEL_MULT);
- if (!p4d_none(*start)) {
- prot = p4d_flags(*start);
- eff = effective_prot(eff_in, prot);
- if (p4d_large(*start) || !p4d_present(*start)) {
- note_page(st, __pgprot(prot), eff, 2);
- } else if (!kasan_page_table(st, p4d_start)) {
- walk_pud_level(st, *start, eff,
- P + i * P4D_LEVEL_MULT);
- }
- } else
- note_page(st, __pgprot(0), 0, 2);
+ struct pg_state *st = walk->private;
- start++;
- }
+ st->current_address = normalize_addr(addr);
+
+ if (kasan_page_table(st, p4d))
+ return 1;
+ return 0;
}
-#define pgd_large(a) (pgtable_l5_enabled() ? pgd_large(a) : p4d_large(__p4d(pgd_val(a))))
-#define pgd_none(a) (pgtable_l5_enabled() ? pgd_none(a) : p4d_none(__p4d(pgd_val(a))))
+static int ptdump_p4d_entry(p4d_t *p4d, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ struct pg_state *st = walk->private;
+ pgprotval_t eff, prot;
+
+ prot = p4d_flags(*p4d);
+ eff = effective_prot(st->effective_prot_pgd, prot);
+
+ st->current_address = normalize_addr(addr);
+
+ if (p4d_large(*p4d))
+ note_page(st, __pgprot(prot), eff, 2);
+
+ st->effective_prot_p4d = eff;
+
+ return 0;
+}
-static inline bool is_hypervisor_range(int idx)
+static int ptdump_pgd_entry(pgd_t *pgd, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
{
-#ifdef CONFIG_X86_64
- /*
- * A hole in the beginning of kernel address space reserved
- * for a hypervisor.
- */
- return (idx >= pgd_index(GUARD_HOLE_BASE_ADDR)) &&
- (idx < pgd_index(GUARD_HOLE_END_ADDR));
+ struct pg_state *st = walk->private;
+ pgprotval_t eff, prot;
+
+ prot = pgd_flags(*pgd);
+
+#ifdef CONFIG_X86_PAE
+ eff = _PAGE_USER | _PAGE_RW;
#else
- return false;
+ eff = prot;
#endif
+
+ st->current_address = normalize_addr(addr);
+
+ if (pgd_large(*pgd))
+ note_page(st, __pgprot(prot), eff, 1);
+
+ st->effective_prot_pgd = eff;
+
+ return 0;
+}
+
+static int ptdump_hole(unsigned long addr, unsigned long next, int depth,
+ struct mm_walk *walk)
+{
+ struct pg_state *st = walk->private;
+
+ st->current_address = normalize_addr(addr);
+
+ note_page(st, __pgprot(0), 0, depth + 1);
+
+ return 0;
}
static void ptdump_walk_pgd_level_core(struct seq_file *m, struct mm_struct *mm,
bool checkwx, bool dmesg)
{
- pgd_t *start = mm->pgd;
- pgprotval_t prot, eff;
- int i;
struct pg_state st = {};
+ struct mm_walk walk = {
+ .mm = mm,
+ .pgd_entry = ptdump_pgd_entry,
+ .p4d_entry = ptdump_p4d_entry,
+ .pud_entry = ptdump_pud_entry,
+ .pmd_entry = ptdump_pmd_entry,
+ .pte_entry = ptdump_pte_entry,
+ .test_p4d = ptdump_test_p4d,
+ .test_pud = ptdump_test_pud,
+ .test_pmd = ptdump_test_pmd,
+ .pte_hole = ptdump_hole,
+ .private = &st
+ };
st.to_dmesg = dmesg;
st.check_wx = checkwx;
@@ -530,27 +555,15 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, struct mm_struct *mm,
if (checkwx)
st.wx_pages = 0;
- for (i = 0; i < PTRS_PER_PGD; i++) {
- st.current_address = normalize_addr(i * PGD_LEVEL_MULT);
- if (!pgd_none(*start) && !is_hypervisor_range(i)) {
- prot = pgd_flags(*start);
-#ifdef CONFIG_X86_PAE
- eff = _PAGE_USER | _PAGE_RW;
+ down_read(&mm->mmap_sem);
+#ifdef CONFIG_X86_64
+ walk_page_range(0, PTRS_PER_PGD*PGD_LEVEL_MULT/2, &walk);
+ walk_page_range(normalize_addr(PTRS_PER_PGD*PGD_LEVEL_MULT/2), ~0,
+ &walk);
#else
- eff = prot;
+ walk_page_range(0, ~0, &walk);
#endif
- if (pgd_large(*start) || !pgd_present(*start)) {
- note_page(&st, __pgprot(prot), eff, 1);
- } else {
- walk_p4d_level(&st, *start, eff,
- i * PGD_LEVEL_MULT);
- }
- } else
- note_page(&st, __pgprot(0), 0, 1);
-
- cond_resched();
- start++;
- }
+ up_read(&mm->mmap_sem);
/* Flush out the last page */
st.current_address = normalize_addr(PTRS_PER_PGD*PGD_LEVEL_MULT);
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.
For sparc, we don't support large pages, so add stubs returning 0.
CC: "David S. Miller" <[email protected]>
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/sparc/include/asm/pgtable_32.h | 10 ++++++++++
arch/sparc/include/asm/pgtable_64.h | 1 +
2 files changed, 11 insertions(+)
diff --git a/arch/sparc/include/asm/pgtable_32.h b/arch/sparc/include/asm/pgtable_32.h
index 4eebed6c6781..dbc533e4c460 100644
--- a/arch/sparc/include/asm/pgtable_32.h
+++ b/arch/sparc/include/asm/pgtable_32.h
@@ -177,6 +177,11 @@ static inline int pmd_none(pmd_t pmd)
return !pmd_val(pmd);
}
+static inline int pmd_large(pmd_t pmd)
+{
+ return 0;
+}
+
static inline void pmd_clear(pmd_t *pmdp)
{
int i;
@@ -199,6 +204,11 @@ static inline int pgd_present(pgd_t pgd)
return ((pgd_val(pgd) & SRMMU_ET_MASK) == SRMMU_ET_PTD);
}
+static inline int pgd_large(pgd_t pgd)
+{
+ return 0;
+}
+
static inline void pgd_clear(pgd_t *pgdp)
{
set_pte((pte_t *)pgdp, __pte(0));
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 1393a8ac596b..c32b26bdea53 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -892,6 +892,7 @@ static inline unsigned long pud_page_vaddr(pud_t pud)
#define pgd_page_vaddr(pgd) \
((unsigned long) __va(pgd_val(pgd)))
#define pgd_present(pgd) (pgd_val(pgd) != 0U)
+#define pgd_large(pgd) (0)
#define pgd_clear(pgdp) (pgd_val(*(pgdp)) = 0UL)
static inline unsigned long pud_large(pud_t pud)
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.
For um, we don't support large pages, so add stubs returning 0.
CC: Jeff Dike <[email protected]>
CC: Richard Weinberger <[email protected]>
CC: Anton Ivanov <[email protected]>
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/um/include/asm/pgtable-3level.h | 1 +
arch/um/include/asm/pgtable.h | 1 +
2 files changed, 2 insertions(+)
diff --git a/arch/um/include/asm/pgtable-3level.h b/arch/um/include/asm/pgtable-3level.h
index c4d876dfb9ac..2abf9aa5808e 100644
--- a/arch/um/include/asm/pgtable-3level.h
+++ b/arch/um/include/asm/pgtable-3level.h
@@ -57,6 +57,7 @@
#define pud_none(x) (!(pud_val(x) & ~_PAGE_NEWPAGE))
#define pud_bad(x) ((pud_val(x) & (~PAGE_MASK & ~_PAGE_USER)) != _KERNPG_TABLE)
#define pud_present(x) (pud_val(x) & _PAGE_PRESENT)
+#define pud_large(x) (0)
#define pud_populate(mm, pud, pmd) \
set_pud(pud, __pud(_PAGE_TABLE + __pa(pmd)))
diff --git a/arch/um/include/asm/pgtable.h b/arch/um/include/asm/pgtable.h
index 9c04562310b3..d5fa4e118dcc 100644
--- a/arch/um/include/asm/pgtable.h
+++ b/arch/um/include/asm/pgtable.h
@@ -100,6 +100,7 @@ extern unsigned long end_iomem;
#define pmd_bad(x) ((pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER)) != _KERNPG_TABLE)
#define pmd_present(x) (pmd_val(x) & _PAGE_PRESENT)
+#define pmd_large(x) (0)
#define pmd_clear(xp) do { pmd_val(*(xp)) = _PAGE_NEWPAGE; } while (0)
#define pmd_newpage(x) (pmd_val(x) & _PAGE_NEWPAGE)
--
2.20.1
It is useful to be able to skip parts of the page table tree even when
walking without VMAs. Add test_p?d callbacks similar to test_walk but
which are called just before a table at that level is walked. If the
callback returns non-zero then the entire table is skipped.
Signed-off-by: Steven Price <[email protected]>
---
include/linux/mm.h | 11 +++++++++++
mm/pagewalk.c | 24 ++++++++++++++++++++++++
2 files changed, 35 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4ae3634a9118..581f31c6b6d9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1429,6 +1429,11 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
* value means "do page table walk over the current vma,"
* and a negative one means "abort current page table walk
* right now." 1 means "skip the current vma."
+ * @test_pmd: similar to test_walk(), but called for every pmd.
+ * @test_pud: similar to test_walk(), but called for every pud.
+ * @test_p4d: similar to test_walk(), but called for every p4d.
+ * Returning 0 means walk this part of the page tables,
+ * returning 1 means to skip this range.
* @mm: mm_struct representing the target process of page table walk
* @vma: vma currently walked (NULL if walking outside vmas)
* @private: private data for callbacks' usage
@@ -1453,6 +1458,12 @@ struct mm_walk {
struct mm_walk *walk);
int (*test_walk)(unsigned long addr, unsigned long next,
struct mm_walk *walk);
+ int (*test_pmd)(unsigned long addr, unsigned long next,
+ pmd_t *pmd_start, struct mm_walk *walk);
+ int (*test_pud)(unsigned long addr, unsigned long next,
+ pud_t *pud_start, struct mm_walk *walk);
+ int (*test_p4d)(unsigned long addr, unsigned long next,
+ p4d_t *p4d_start, struct mm_walk *walk);
struct mm_struct *mm;
struct vm_area_struct *vma;
void *private;
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 57946bcd810c..ff2fc8490435 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -49,6 +49,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
int err = 0;
int depth = real_depth(3);
+ if (walk->test_pmd) {
+ err = walk->test_pmd(addr, end, pmd_offset(pud, 0), walk);
+ if (err < 0)
+ return err;
+ if (err > 0)
+ return 0;
+ }
+
pmd = pmd_offset(pud, addr);
do {
again:
@@ -100,6 +108,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
int err = 0;
int depth = real_depth(2);
+ if (walk->test_pud) {
+ err = walk->test_pud(addr, end, pud_offset(p4d, 0), walk);
+ if (err < 0)
+ return err;
+ if (err > 0)
+ return 0;
+ }
+
pud = pud_offset(p4d, addr);
do {
again:
@@ -143,6 +159,14 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
int err = 0;
int depth = real_depth(1);
+ if (walk->test_p4d) {
+ err = walk->test_p4d(addr, end, p4d_offset(pgd, 0), walk);
+ if (err < 0)
+ return err;
+ if (err > 0)
+ return 0;
+ }
+
p4d = p4d_offset(pgd, addr);
do {
next = p4d_addr_end(addr, end);
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.
For unicore32, we don't support large pages, so add a stub returning 0.
CC: Guan Xuetao <[email protected]>
Signed-off-by: Steven Price <[email protected]>
---
arch/unicore32/include/asm/pgtable.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/unicore32/include/asm/pgtable.h b/arch/unicore32/include/asm/pgtable.h
index a4f2bef37e70..b45429df8b99 100644
--- a/arch/unicore32/include/asm/pgtable.h
+++ b/arch/unicore32/include/asm/pgtable.h
@@ -209,6 +209,7 @@ static inline pte_t pte_mkspecial(pte_t pte) { return pte; }
#define pmd_bad(pmd) (((pmd_val(pmd) & \
(PMD_PRESENT | PMD_TYPE_MASK)) \
!= (PMD_PRESENT | PMD_TYPE_TABLE)))
+#define pmd_large(pmd) (0)
#define set_pmd(pmdpd, pmdval) \
do { \
--
2.20.1
Exposing the pud/pgd levels of the page tables to walk_page_range() means
we may come across the exotic large mappings that come with large areas
of contiguous memory (such as the kernel's linear map).
Where levels are folded we need to provide the appropriate stub
implementation of p?d_large().
For x86 move the existing definitions of p?d_large() so that they are
only defined when the corresponding levels are not folded.
Signed-off-by: Steven Price <[email protected]>
---
arch/x86/include/asm/pgtable.h | 21 ++++++++-------------
include/asm-generic/4level-fixup.h | 1 +
include/asm-generic/5level-fixup.h | 1 +
include/asm-generic/pgtable-nop4d-hack.h | 1 +
include/asm-generic/pgtable-nop4d.h | 1 +
include/asm-generic/pgtable-nopmd.h | 1 +
include/asm-generic/pgtable-nopud.h | 1 +
7 files changed, 14 insertions(+), 13 deletions(-)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 2779ace16d23..1b854c64cc7d 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -222,12 +222,6 @@ static inline unsigned long pgd_pfn(pgd_t pgd)
return (pgd_val(pgd) & PTE_PFN_MASK) >> PAGE_SHIFT;
}
-static inline int p4d_large(p4d_t p4d)
-{
- /* No 512 GiB pages yet */
- return 0;
-}
-
#define pte_page(pte) pfn_to_page(pte_pfn(pte))
static inline int pmd_large(pmd_t pte)
@@ -867,11 +861,6 @@ static inline int pud_bad(pud_t pud)
{
return (pud_flags(pud) & ~(_KERNPG_TABLE | _PAGE_USER)) != 0;
}
-#else
-static inline int pud_large(pud_t pud)
-{
- return 0;
-}
#endif /* CONFIG_PGTABLE_LEVELS > 2 */
static inline unsigned long pud_index(unsigned long address)
@@ -890,6 +879,12 @@ static inline int p4d_present(p4d_t p4d)
return p4d_flags(p4d) & _PAGE_PRESENT;
}
+static inline int p4d_large(p4d_t p4d)
+{
+ /* No 512 GiB pages yet */
+ return 0;
+}
+
static inline unsigned long p4d_page_vaddr(p4d_t p4d)
{
return (unsigned long)__va(p4d_val(p4d) & p4d_pfn_mask(p4d));
@@ -931,6 +926,8 @@ static inline int pgd_present(pgd_t pgd)
return pgd_flags(pgd) & _PAGE_PRESENT;
}
+static inline int pgd_large(pgd_t pgd) { return 0; }
+
static inline unsigned long pgd_page_vaddr(pgd_t pgd)
{
return (unsigned long)__va((unsigned long)pgd_val(pgd) & PTE_PFN_MASK);
@@ -1213,8 +1210,6 @@ static inline bool pgdp_maps_userspace(void *__ptr)
return (((ptr & ~PAGE_MASK) / sizeof(pgd_t)) < PGD_KERNEL_START);
}
-static inline int pgd_large(pgd_t pgd) { return 0; }
-
#ifdef CONFIG_PAGE_TABLE_ISOLATION
/*
* All top-level PAGE_TABLE_ISOLATION page tables are order-1 pages
diff --git a/include/asm-generic/4level-fixup.h b/include/asm-generic/4level-fixup.h
index e3667c9a33a5..3cc65a4dd093 100644
--- a/include/asm-generic/4level-fixup.h
+++ b/include/asm-generic/4level-fixup.h
@@ -20,6 +20,7 @@
#define pud_none(pud) 0
#define pud_bad(pud) 0
#define pud_present(pud) 1
+#define pud_large(pud) 0
#define pud_ERROR(pud) do { } while (0)
#define pud_clear(pud) pgd_clear(pud)
#define pud_val(pud) pgd_val(pud)
diff --git a/include/asm-generic/5level-fixup.h b/include/asm-generic/5level-fixup.h
index bb6cb347018c..c4377db09a4f 100644
--- a/include/asm-generic/5level-fixup.h
+++ b/include/asm-generic/5level-fixup.h
@@ -22,6 +22,7 @@
#define p4d_none(p4d) 0
#define p4d_bad(p4d) 0
#define p4d_present(p4d) 1
+#define p4d_large(p4d) 0
#define p4d_ERROR(p4d) do { } while (0)
#define p4d_clear(p4d) pgd_clear(p4d)
#define p4d_val(p4d) pgd_val(p4d)
diff --git a/include/asm-generic/pgtable-nop4d-hack.h b/include/asm-generic/pgtable-nop4d-hack.h
index 829bdb0d6327..d3967560d123 100644
--- a/include/asm-generic/pgtable-nop4d-hack.h
+++ b/include/asm-generic/pgtable-nop4d-hack.h
@@ -27,6 +27,7 @@ typedef struct { pgd_t pgd; } pud_t;
static inline int pgd_none(pgd_t pgd) { return 0; }
static inline int pgd_bad(pgd_t pgd) { return 0; }
static inline int pgd_present(pgd_t pgd) { return 1; }
+static inline int pgd_large(pgd_t pgd) { return 0; }
static inline void pgd_clear(pgd_t *pgd) { }
#define pud_ERROR(pud) (pgd_ERROR((pud).pgd))
diff --git a/include/asm-generic/pgtable-nop4d.h b/include/asm-generic/pgtable-nop4d.h
index aebab905e6cd..5d5dde24a8ca 100644
--- a/include/asm-generic/pgtable-nop4d.h
+++ b/include/asm-generic/pgtable-nop4d.h
@@ -22,6 +22,7 @@ typedef struct { pgd_t pgd; } p4d_t;
static inline int pgd_none(pgd_t pgd) { return 0; }
static inline int pgd_bad(pgd_t pgd) { return 0; }
static inline int pgd_present(pgd_t pgd) { return 1; }
+static inline int pgd_large(pgd_t pgd) { return 0; }
static inline void pgd_clear(pgd_t *pgd) { }
#define p4d_ERROR(p4d) (pgd_ERROR((p4d).pgd))
diff --git a/include/asm-generic/pgtable-nopmd.h b/include/asm-generic/pgtable-nopmd.h
index b85b8271a73d..beeb8f375d4d 100644
--- a/include/asm-generic/pgtable-nopmd.h
+++ b/include/asm-generic/pgtable-nopmd.h
@@ -30,6 +30,7 @@ typedef struct { pud_t pud; } pmd_t;
static inline int pud_none(pud_t pud) { return 0; }
static inline int pud_bad(pud_t pud) { return 0; }
static inline int pud_present(pud_t pud) { return 1; }
+static inline int pud_large(pud_t pud) { return 0; }
static inline void pud_clear(pud_t *pud) { }
#define pmd_ERROR(pmd) (pud_ERROR((pmd).pud))
diff --git a/include/asm-generic/pgtable-nopud.h b/include/asm-generic/pgtable-nopud.h
index c77a1d301155..b3ba496965d3 100644
--- a/include/asm-generic/pgtable-nopud.h
+++ b/include/asm-generic/pgtable-nopud.h
@@ -31,6 +31,7 @@ typedef struct { p4d_t p4d; } pud_t;
static inline int p4d_none(p4d_t p4d) { return 0; }
static inline int p4d_bad(p4d_t p4d) { return 0; }
static inline int p4d_present(p4d_t p4d) { return 1; }
+static inline int p4d_large(p4d_t p4d) { return 0; }
static inline void p4d_clear(p4d_t *p4d) { }
#define pud_ERROR(pud) (p4d_ERROR((pud).p4d))
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.
For csky, we don't support large pages, so add a stub returning 0.
CC: Guo Ren <[email protected]>
Signed-off-by: Steven Price <[email protected]>
---
arch/csky/include/asm/pgtable.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/csky/include/asm/pgtable.h b/arch/csky/include/asm/pgtable.h
index edfcbb25fd9f..4ffdb6bfbede 100644
--- a/arch/csky/include/asm/pgtable.h
+++ b/arch/csky/include/asm/pgtable.h
@@ -158,6 +158,8 @@ static inline int pmd_present(pmd_t pmd)
return (pmd_val(pmd) != __pa(invalid_pte_table));
}
+#define pmd_large(pmd) (0)
+
static inline void pmd_clear(pmd_t *p)
{
pmd_val(*p) = (__pa(invalid_pte_table));
--
2.20.1
mm/dump_pagetables.c passes both struct seq_file and struct pg_state
down the chain of walk_*_level() functions to be passed to note_page().
Instead place the struct seq_file in struct pg_state and access it from
struct pg_state (which is private to this file) in note_page().
Signed-off-by: Steven Price <[email protected]>
---
arch/x86/mm/dump_pagetables.c | 69 ++++++++++++++++++-----------------
1 file changed, 35 insertions(+), 34 deletions(-)
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index e3cdc85ce5b6..ecbaf30a6a2f 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -40,6 +40,7 @@ struct pg_state {
bool to_dmesg;
bool check_wx;
unsigned long wx_pages;
+ struct seq_file *seq;
};
struct addr_marker {
@@ -268,11 +269,12 @@ static void note_wx(struct pg_state *st)
* of PTE entries; the next one is different so we need to
* print what we collected so far.
*/
-static void note_page(struct seq_file *m, struct pg_state *st,
- pgprot_t new_prot, pgprotval_t new_eff, int level)
+static void note_page(struct pg_state *st, pgprot_t new_prot,
+ pgprotval_t new_eff, int level)
{
pgprotval_t prot, cur, eff;
static const char units[] = "BKMGTPE";
+ struct seq_file *m = st->seq;
/*
* If we have a "break" in the series, we need to flush the state that
@@ -357,8 +359,8 @@ static inline pgprotval_t effective_prot(pgprotval_t prot1, pgprotval_t prot2)
((prot1 | prot2) & _PAGE_NX);
}
-static void walk_pte_level(struct seq_file *m, struct pg_state *st, pmd_t addr,
- pgprotval_t eff_in, unsigned long P)
+static void walk_pte_level(struct pg_state *st, pmd_t addr, pgprotval_t eff_in,
+ unsigned long P)
{
int i;
pte_t *pte;
@@ -369,7 +371,7 @@ static void walk_pte_level(struct seq_file *m, struct pg_state *st, pmd_t addr,
pte = pte_offset_map(&addr, st->current_address);
prot = pte_flags(*pte);
eff = effective_prot(eff_in, prot);
- note_page(m, st, __pgprot(prot), eff, 5);
+ note_page(st, __pgprot(prot), eff, 5);
pte_unmap(pte);
}
}
@@ -382,22 +384,20 @@ static void walk_pte_level(struct seq_file *m, struct pg_state *st, pmd_t addr,
* us dozens of seconds (minutes for 5-level config) while checking for
* W+X mapping or reading kernel_page_tables debugfs file.
*/
-static inline bool kasan_page_table(struct seq_file *m, struct pg_state *st,
- void *pt)
+static inline bool kasan_page_table(struct pg_state *st, void *pt)
{
if (__pa(pt) == __pa(kasan_early_shadow_pmd) ||
(pgtable_l5_enabled() &&
__pa(pt) == __pa(kasan_early_shadow_p4d)) ||
__pa(pt) == __pa(kasan_early_shadow_pud)) {
pgprotval_t prot = pte_flags(kasan_early_shadow_pte[0]);
- note_page(m, st, __pgprot(prot), 0, 5);
+ note_page(st, __pgprot(prot), 0, 5);
return true;
}
return false;
}
#else
-static inline bool kasan_page_table(struct seq_file *m, struct pg_state *st,
- void *pt)
+static inline bool kasan_page_table(struct pg_state *st, void *pt)
{
return false;
}
@@ -405,7 +405,7 @@ static inline bool kasan_page_table(struct seq_file *m, struct pg_state *st,
#if PTRS_PER_PMD > 1
-static void walk_pmd_level(struct seq_file *m, struct pg_state *st, pud_t addr,
+static void walk_pmd_level(struct pg_state *st, pud_t addr,
pgprotval_t eff_in, unsigned long P)
{
int i;
@@ -419,27 +419,27 @@ static void walk_pmd_level(struct seq_file *m, struct pg_state *st, pud_t addr,
prot = pmd_flags(*start);
eff = effective_prot(eff_in, prot);
if (pmd_large(*start) || !pmd_present(*start)) {
- note_page(m, st, __pgprot(prot), eff, 4);
- } else if (!kasan_page_table(m, st, pmd_start)) {
- walk_pte_level(m, st, *start, eff,
+ note_page(st, __pgprot(prot), eff, 4);
+ } else if (!kasan_page_table(st, pmd_start)) {
+ walk_pte_level(st, *start, eff,
P + i * PMD_LEVEL_MULT);
}
} else
- note_page(m, st, __pgprot(0), 0, 4);
+ note_page(st, __pgprot(0), 0, 4);
start++;
}
}
#else
-#define walk_pmd_level(m,s,a,e,p) walk_pte_level(m,s,__pmd(pud_val(a)),e,p)
+#define walk_pmd_level(s,a,e,p) walk_pte_level(s,__pmd(pud_val(a)),e,p)
#define pud_large(a) pmd_large(__pmd(pud_val(a)))
#define pud_none(a) pmd_none(__pmd(pud_val(a)))
#endif
#if PTRS_PER_PUD > 1
-static void walk_pud_level(struct seq_file *m, struct pg_state *st, p4d_t addr,
- pgprotval_t eff_in, unsigned long P)
+static void walk_pud_level(struct pg_state *st, p4d_t addr, pgprotval_t eff_in,
+ unsigned long P)
{
int i;
pud_t *start, *pud_start;
@@ -454,13 +454,13 @@ static void walk_pud_level(struct seq_file *m, struct pg_state *st, p4d_t addr,
prot = pud_flags(*start);
eff = effective_prot(eff_in, prot);
if (pud_large(*start) || !pud_present(*start)) {
- note_page(m, st, __pgprot(prot), eff, 3);
- } else if (!kasan_page_table(m, st, pud_start)) {
- walk_pmd_level(m, st, *start, eff,
+ note_page(st, __pgprot(prot), eff, 3);
+ } else if (!kasan_page_table(st, pud_start)) {
+ walk_pmd_level(st, *start, eff,
P + i * PUD_LEVEL_MULT);
}
} else
- note_page(m, st, __pgprot(0), 0, 3);
+ note_page(st, __pgprot(0), 0, 3);
prev_pud = start;
start++;
@@ -468,20 +468,20 @@ static void walk_pud_level(struct seq_file *m, struct pg_state *st, p4d_t addr,
}
#else
-#define walk_pud_level(m,s,a,e,p) walk_pmd_level(m,s,__pud(p4d_val(a)),e,p)
+#define walk_pud_level(s,a,e,p) walk_pmd_level(s,__pud(p4d_val(a)),e,p)
#define p4d_large(a) pud_large(__pud(p4d_val(a)))
#define p4d_none(a) pud_none(__pud(p4d_val(a)))
#endif
-static void walk_p4d_level(struct seq_file *m, struct pg_state *st, pgd_t addr,
- pgprotval_t eff_in, unsigned long P)
+static void walk_p4d_level(struct pg_state *st, pgd_t addr, pgprotval_t eff_in,
+ unsigned long P)
{
int i;
p4d_t *start, *p4d_start;
pgprotval_t prot, eff;
if (PTRS_PER_P4D == 1)
- return walk_pud_level(m, st, __p4d(pgd_val(addr)), eff_in, P);
+ return walk_pud_level(st, __p4d(pgd_val(addr)), eff_in, P);
p4d_start = start = (p4d_t *)pgd_page_vaddr(addr);
@@ -491,13 +491,13 @@ static void walk_p4d_level(struct seq_file *m, struct pg_state *st, pgd_t addr,
prot = p4d_flags(*start);
eff = effective_prot(eff_in, prot);
if (p4d_large(*start) || !p4d_present(*start)) {
- note_page(m, st, __pgprot(prot), eff, 2);
- } else if (!kasan_page_table(m, st, p4d_start)) {
- walk_pud_level(m, st, *start, eff,
+ note_page(st, __pgprot(prot), eff, 2);
+ } else if (!kasan_page_table(st, p4d_start)) {
+ walk_pud_level(st, *start, eff,
P + i * P4D_LEVEL_MULT);
}
} else
- note_page(m, st, __pgprot(0), 0, 2);
+ note_page(st, __pgprot(0), 0, 2);
start++;
}
@@ -534,6 +534,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
}
st.check_wx = checkwx;
+ st.seq = m;
if (checkwx)
st.wx_pages = 0;
@@ -547,13 +548,13 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
eff = prot;
#endif
if (pgd_large(*start) || !pgd_present(*start)) {
- note_page(m, &st, __pgprot(prot), eff, 1);
+ note_page(&st, __pgprot(prot), eff, 1);
} else {
- walk_p4d_level(m, &st, *start, eff,
+ walk_p4d_level(&st, *start, eff,
i * PGD_LEVEL_MULT);
}
} else
- note_page(m, &st, __pgprot(0), 0, 1);
+ note_page(&st, __pgprot(0), 0, 1);
cond_resched();
start++;
@@ -561,7 +562,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
/* Flush out the last page */
st.current_address = normalize_addr(PTRS_PER_PGD*PGD_LEVEL_MULT);
- note_page(m, &st, __pgprot(0), 0, 0);
+ note_page(&st, __pgprot(0), 0, 0);
if (!checkwx)
return;
if (st.wx_pages)
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.
For ia64 leaf entries are always at the lowest level, so implement
stubs returning 0.
CC: Tony Luck <[email protected]>
CC: Fenghua Yu <[email protected]>
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/ia64/include/asm/pgtable.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index b1e7468eb65a..84dda295391b 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -271,6 +271,7 @@ extern unsigned long VMALLOC_END;
#define pmd_none(pmd) (!pmd_val(pmd))
#define pmd_bad(pmd) (!ia64_phys_addr_valid(pmd_val(pmd)))
#define pmd_present(pmd) (pmd_val(pmd) != 0UL)
+#define pmd_large(pmd) (0)
#define pmd_clear(pmdp) (pmd_val(*(pmdp)) = 0UL)
#define pmd_page_vaddr(pmd) ((unsigned long) __va(pmd_val(pmd) & _PFN_MASK))
#define pmd_page(pmd) virt_to_page((pmd_val(pmd) + PAGE_OFFSET))
@@ -278,6 +279,7 @@ extern unsigned long VMALLOC_END;
#define pud_none(pud) (!pud_val(pud))
#define pud_bad(pud) (!ia64_phys_addr_valid(pud_val(pud)))
#define pud_present(pud) (pud_val(pud) != 0UL)
+#define pud_large(pud) (0)
#define pud_clear(pudp) (pud_val(*(pudp)) = 0UL)
#define pud_page_vaddr(pud) ((unsigned long) __va(pud_val(pud) & _PFN_MASK))
#define pud_page(pud) virt_to_page((pud_val(pud) + PAGE_OFFSET))
@@ -286,6 +288,7 @@ extern unsigned long VMALLOC_END;
#define pgd_none(pgd) (!pgd_val(pgd))
#define pgd_bad(pgd) (!ia64_phys_addr_valid(pgd_val(pgd)))
#define pgd_present(pgd) (pgd_val(pgd) != 0UL)
+#define pgd_large(pgd) (0)
#define pgd_clear(pgdp) (pgd_val(*(pgdp)) = 0UL)
#define pgd_page_vaddr(pgd) ((unsigned long) __va(pgd_val(pgd) & _PFN_MASK))
#define pgd_page(pgd) virt_to_page((pgd_val(pgd) + PAGE_OFFSET))
--
2.20.1
Since 48684a65b4e3: "mm: pagewalk: fix misbehavior of walk_page_range
for vma(VM_PFNMAP)", page_table_walk() will report any kernel area as
a hole, because it lacks a vma.
This means each arch has re-implemented page table walking when needed,
for example in the per-arch ptdump walker.
Remove the requirement to have a vma except when trying to split huge
pages.
Signed-off-by: Steven Price <[email protected]>
---
mm/pagewalk.c | 25 +++++++++++++++++--------
1 file changed, 17 insertions(+), 8 deletions(-)
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 98373a9f88b8..dac0c848b458 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -36,7 +36,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
do {
again:
next = pmd_addr_end(addr, end);
- if (pmd_none(*pmd) || !walk->vma) {
+ if (pmd_none(*pmd)) {
if (walk->pte_hole)
err = walk->pte_hole(addr, next, walk);
if (err)
@@ -59,9 +59,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
if (!walk->pte_entry)
continue;
- split_huge_pmd(walk->vma, pmd, addr);
- if (pmd_trans_unstable(pmd))
- goto again;
+ if (walk->vma) {
+ split_huge_pmd(walk->vma, pmd, addr);
+ if (pmd_trans_unstable(pmd))
+ goto again;
+ } else if (pmd_large(*pmd)) {
+ continue;
+ }
+
err = walk_pte_range(pmd, addr, next, walk);
if (err)
break;
@@ -81,7 +86,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
do {
again:
next = pud_addr_end(addr, end);
- if (pud_none(*pud) || !walk->vma) {
+ if (pud_none(*pud)) {
if (walk->pte_hole)
err = walk->pte_hole(addr, next, walk);
if (err)
@@ -95,9 +100,13 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
break;
}
- split_huge_pud(walk->vma, pud, addr);
- if (pud_none(*pud))
- goto again;
+ if (walk->vma) {
+ split_huge_pud(walk->vma, pud, addr);
+ if (pud_none(*pud))
+ goto again;
+ } else if (pud_large(*pud)) {
+ continue;
+ }
if (walk->pmd_entry || walk->pte_entry)
err = walk_pmd_range(pud, addr, next, walk);
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.
For hexagon, we don't support large pages, so add a stub returning 0.
CC: Richard Kuo <[email protected]>
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/hexagon/include/asm/pgtable.h | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/hexagon/include/asm/pgtable.h b/arch/hexagon/include/asm/pgtable.h
index 65125d0b02dd..422e6aa2a6ef 100644
--- a/arch/hexagon/include/asm/pgtable.h
+++ b/arch/hexagon/include/asm/pgtable.h
@@ -281,6 +281,11 @@ static inline int pmd_bad(pmd_t pmd)
return 0;
}
+static inline int pmd_large(pmd_t pmd)
+{
+ return 0;
+}
+
/*
* pmd_page - converts a PMD entry to a page pointer
*/
--
2.20.1
On Wed, Feb 27, 2019 at 9:07 AM Steven Price <[email protected]> wrote:
>
> walk_page_range() is going to be allowed to walk page tables other than
> those of user space. For this it needs to know when it has reached a
> 'leaf' entry in the page tables. This information is provided by the
> p?d_large() functions/macros.
>
> For xtensa, we don't support large pages, so add a stub returning 0.
>
> CC: Chris Zankel <[email protected]>
> CC: Max Filippov <[email protected]>
> CC: [email protected]
> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/xtensa/include/asm/pgtable.h | 1 +
> 1 file changed, 1 insertion(+)
Acked-by: Max Filippov <[email protected]>
--
Thanks.
-- Max
On 2/27/19 9:06 AM, Steven Price wrote:
> #ifdef CONFIG_SHMEM
> static int smaps_pte_hole(unsigned long addr, unsigned long end,
> - struct mm_walk *walk)
> + __always_unused int depth, struct mm_walk *walk)
> {
I think this 'depth' argument is a mistake. It's synthetic and it's
surely going to be a source of bugs.
The page table dumpers seem to be using this to dump out the "name" of a
hole which seems a bit bogus in the first place. I'd much rather teach
the dumpers about the length of the hole, "the hole is 0x12340000 bytes
long", rather than "there's a hole at this level".
On Wed, 27 Feb 2019 17:05:52 +0000
Steven Price <[email protected]> wrote:
> walk_page_range() is going to be allowed to walk page tables other than
> those of user space. For this it needs to know when it has reached a
> 'leaf' entry in the page tables. This information is provided by the
> p?d_large() functions/macros.
>
> For s390, we don't support large pages, so add a stub returning 0.
Well s390 does support 1MB and 2GB large pages, pmd_large() and pud_large()
are non-empty. We do not support 4TB or 8PB large pages though, which
makes the patch itself correct. Just the wording is slightly off.
> CC: Martin Schwidefsky <[email protected]>
> CC: Heiko Carstens <[email protected]>
> CC: [email protected]
> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/s390/include/asm/pgtable.h | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 063732414dfb..9617f1fb69b4 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -605,6 +605,11 @@ static inline int pgd_present(pgd_t pgd)
> return (pgd_val(pgd) & _REGION_ENTRY_ORIGIN) != 0UL;
> }
>
> +static inline int pgd_large(pgd_t pgd)
> +{
> + return 0;
> +}
> +
> static inline int pgd_none(pgd_t pgd)
> {
> if (pgd_folded(pgd))
> @@ -645,6 +650,11 @@ static inline int p4d_present(p4d_t p4d)
> return (p4d_val(p4d) & _REGION_ENTRY_ORIGIN) != 0UL;
> }
>
> +static inline int p4d_large(p4d_t p4d)
> +{
> + return 0;
> +}
> +
> static inline int p4d_none(p4d_t p4d)
> {
> if (p4d_folded(p4d))
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On 2/27/19 9:06 AM, Steven Price wrote:
> walk_page_range() is going to be allowed to walk page tables other than
> those of user space. For this it needs to know when it has reached a
> 'leaf' entry in the page tables. This information will be provided by the
> p?d_large() functions/macros.
>
> For arc, we only have two levels, so only pmd_large() is needed.
>
> CC: Vineet Gupta <[email protected]>
> CC: [email protected]
> Signed-off-by: Steven Price <[email protected]>
Acked-by: vineet Gupta <[email protected]>
Thx,
-Vineet
From: Steven Price <[email protected]>
Date: Wed, 27 Feb 2019 17:05:54 +0000
> walk_page_range() is going to be allowed to walk page tables other than
> those of user space. For this it needs to know when it has reached a
> 'leaf' entry in the page tables. This information is provided by the
> p?d_large() functions/macros.
>
> For sparc, we don't support large pages, so add stubs returning 0.
>
> CC: "David S. Miller" <[email protected]>
> CC: [email protected]
> Signed-off-by: Steven Price <[email protected]>
Sparc does support large pages on 64-bit, just not at this level. It
would be nice if the commit message was made more accurate. Other than
that:
Acked-by: David S. Miller <[email protected]>
On 27.02.19 18:05, Steven Price wrote:
> walk_page_range() is going to be allowed to walk page tables other than
> those of user space. For this it needs to know when it has reached a
> 'leaf' entry in the page tables. This information is provided by the
> p?d_large() functions/macros.
>
> For parisc, we don't support large pages, so add stubs returning 0.
We do support huge pages on parisc, but not yet on those levels.
So, you may add
Acked-by: Helge Deller <[email protected]> # parisc
Helge
> CC: "James E.J. Bottomley" <[email protected]>
> CC: Helge Deller <[email protected]>
> CC: [email protected]
> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/parisc/include/asm/pgtable.h | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/arch/parisc/include/asm/pgtable.h b/arch/parisc/include/asm/pgtable.h
> index c7bb74e22436..1f38c85a9530 100644
> --- a/arch/parisc/include/asm/pgtable.h
> +++ b/arch/parisc/include/asm/pgtable.h
> @@ -302,6 +302,7 @@ extern unsigned long *empty_zero_page;
> #endif
> #define pmd_bad(x) (!(pmd_flag(x) & PxD_FLAG_VALID))
> #define pmd_present(x) (pmd_flag(x) & PxD_FLAG_PRESENT)
> +#define pmd_large(x) (0)
> static inline void pmd_clear(pmd_t *pmd) {
> #if CONFIG_PGTABLE_LEVELS == 3
> if (pmd_flag(*pmd) & PxD_FLAG_ATTACHED)
> @@ -324,6 +325,7 @@ static inline void pmd_clear(pmd_t *pmd) {
> #define pgd_none(x) (!pgd_val(x))
> #define pgd_bad(x) (!(pgd_flag(x) & PxD_FLAG_VALID))
> #define pgd_present(x) (pgd_flag(x) & PxD_FLAG_PRESENT)
> +#define pgd_large(x) (0)
> static inline void pgd_clear(pgd_t *pgd) {
> #if CONFIG_PGTABLE_LEVELS == 3
> if(pgd_flag(*pgd) & PxD_FLAG_ATTACHED)
> @@ -342,6 +344,7 @@ static inline void pgd_clear(pgd_t *pgd) {
> static inline int pgd_none(pgd_t pgd) { return 0; }
> static inline int pgd_bad(pgd_t pgd) { return 0; }
> static inline int pgd_present(pgd_t pgd) { return 1; }
> +static inline int pgd_large(pgd_t pgd) { return 0; }
> static inline void pgd_clear(pgd_t * pgdp) { }
> #endif
>
>
Hi Steven,
On Wed, Feb 27, 2019 at 6:07 PM Steven Price <[email protected]> wrote:
> walk_page_range() is going to be allowed to walk page tables other than
> those of user space. For this it needs to know when it has reached a
> 'leaf' entry in the page tables. This information is provided by the
> p?d_large() functions/macros.
>
> For m68k, we don't support large pages, so add stubs returning 0
>
> CC: Geert Uytterhoeven <[email protected]>
> CC: [email protected]
> Signed-off-by: Steven Price <[email protected]>
Thanks for your patch!
> arch/m68k/include/asm/mcf_pgtable.h | 2 ++
> arch/m68k/include/asm/motorola_pgtable.h | 2 ++
> arch/m68k/include/asm/pgtable_no.h | 1 +
> arch/m68k/include/asm/sun3_pgtable.h | 2 ++
> 4 files changed, 7 insertions(+)
If the definitions are the same, why not add them to
arch/m68k/include/asm/pgtable.h instead?
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
Hi Steven,
On Wed, Feb 27, 2019 at 05:05:45PM +0000, Steven Price wrote:
> For mips, we don't support large pages on 32 bit so add stubs returning 0.
So far so good :)
> For 64 bit look for _PAGE_HUGE flag being set. This means exposing the
> flag when !CONFIG_MIPS_HUGE_TLB_SUPPORT.
Here I have to ask why? We could just return 0 like the mips32 case when
CONFIG_MIPS_HUGE_TLB_SUPPORT=n, let the compiler optimize the whole
thing out and avoid redundant work at runtime.
This could be unified too in asm/pgtable.h - checking for
CONFIG_MIPS_HUGE_TLB_SUPPORT should be sufficient to cover the mips32
case along with the subset of mips64 configurations without huge pages.
Thanks,
Paul
On 27/02/2019 17:38, Dave Hansen wrote:
> On 2/27/19 9:06 AM, Steven Price wrote:
>> #ifdef CONFIG_SHMEM
>> static int smaps_pte_hole(unsigned long addr, unsigned long end,
>> - struct mm_walk *walk)
>> + __always_unused int depth, struct mm_walk *walk)
>> {
>
> I think this 'depth' argument is a mistake. It's synthetic and it's
> surely going to be a source of bugs.
>
> The page table dumpers seem to be using this to dump out the "name" of a
> hole which seems a bit bogus in the first place. I'd much rather teach
> the dumpers about the length of the hole, "the hole is 0x12340000 bytes
> long", rather than "there's a hole at this level".
I originally started by trying to calculate the 'depth' from (end -
addr), e.g. for arm64:
level = 4 - (ilog2(end - addr) - PAGE_SHIFT) / (PAGE_SHIFT - 3)
However there are two issues that I encountered:
* walk_page_range() takes a range of addresses to walk. This means that
holes at the beginning/end of the range are clamped to the address
range. This particularly shows up at the end of the range as I use ~0ULL
as the end which leads to (~0ULL - addr) which is 1 byte short of the
desired size. Obviously that particular corner-case is easy to work
round, but it seemed fragile.
* The above definition for arm64 isn't correct in all cases. You need to
account for things like CONFIG_PGTABLE_LEVELS. Other architectures also
have various quirks in their page tables.
I guess I could try something like:
static int get_level(unsigned long addr, unsigned long end)
{
/* Add 1 to account for ~0ULL */
unsigned long size = (end - addr) + 1;
if (size < PMD_SIZE)
return 4;
else if (size < PUD_SIZE)
return 3;
else if (size < P4D_SIZE)
return 2;
else if (size < PGD_SIZE)
return 1;
return 0;
}
There are two immediate problems with that:
* The "+1" to deal with ~0ULL is fragile
* PGD_SIZE isn't what you might expect, it's not defined for most
architectures and arm64/x86 use it as the size of the PGD table.
Although that's easy enough to fix up.
Do you think a function like above would be preferable?
The other option would of course be to just drop the information from
the debugfs file about at which level the holes are. But it can be
useful information to see whether there are empty levels in the page
table structure. Although this is an area where x86 and arm64 differ
currently (x86 explicitly shows the gaps, arm64 doesn't), so if x86
doesn't mind losing that functionality that would certainly simplify things!
Thanks,
Steve
Hi,
On Wed, Feb 27, 2019 at 08:27:40PM +0100, Geert Uytterhoeven wrote:
> Hi Steven,
>
> On Wed, Feb 27, 2019 at 6:07 PM Steven Price <[email protected]> wrote:
> > walk_page_range() is going to be allowed to walk page tables other than
> > those of user space. For this it needs to know when it has reached a
> > 'leaf' entry in the page tables. This information is provided by the
> > p?d_large() functions/macros.
> >
> > For m68k, we don't support large pages, so add stubs returning 0
> >
> > CC: Geert Uytterhoeven <[email protected]>
> > CC: [email protected]
> > Signed-off-by: Steven Price <[email protected]>
>
> Thanks for your patch!
>
> > arch/m68k/include/asm/mcf_pgtable.h | 2 ++
> > arch/m68k/include/asm/motorola_pgtable.h | 2 ++
> > arch/m68k/include/asm/pgtable_no.h | 1 +
> > arch/m68k/include/asm/sun3_pgtable.h | 2 ++
> > 4 files changed, 7 insertions(+)
>
> If the definitions are the same, why not add them to
> arch/m68k/include/asm/pgtable.h instead?
Maybe I'm missing something, but why the stubs have to be defined in
arch/*/include/asm/pgtable.h rather than in include/asm-generic/pgtable.h?
> Gr{oetje,eeting}s,
>
> Geert
>
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
>
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
> -- Linus Torvalds
>
--
Sincerely yours,
Mike.
On 27/02/2019 18:38, David Miller wrote:
> From: Steven Price <[email protected]>
> Date: Wed, 27 Feb 2019 17:05:54 +0000
>
>> walk_page_range() is going to be allowed to walk page tables other than
>> those of user space. For this it needs to know when it has reached a
>> 'leaf' entry in the page tables. This information is provided by the
>> p?d_large() functions/macros.
>>
>> For sparc, we don't support large pages, so add stubs returning 0.
>>
>> CC: "David S. Miller" <[email protected]>
>> CC: [email protected]
>> Signed-off-by: Steven Price <[email protected]>
>
> Sparc does support large pages on 64-bit, just not at this level. It
> would be nice if the commit message was made more accurate.
Yes you are right, I fear I only looked at the 32 bit changes when I
wrote the commit message. I'll clarify the difference between 32/64 bit.
> Other than that:
>
> Acked-by: David S. Miller <[email protected]>
Thanks,
Steve
On 28/02/2019 11:53, Geert Uytterhoeven wrote:
> Hi Mike,
>
> On Thu, Feb 28, 2019 at 12:37 PM Mike Rapoport <[email protected]> wrote:
>> On Wed, Feb 27, 2019 at 08:27:40PM +0100, Geert Uytterhoeven wrote:
>>> On Wed, Feb 27, 2019 at 6:07 PM Steven Price <[email protected]> wrote:
>>>> walk_page_range() is going to be allowed to walk page tables other than
>>>> those of user space. For this it needs to know when it has reached a
>>>> 'leaf' entry in the page tables. This information is provided by the
>>>> p?d_large() functions/macros.
>>>>
>>>> For m68k, we don't support large pages, so add stubs returning 0
>>>>
>>>> CC: Geert Uytterhoeven <[email protected]>
>>>> CC: [email protected]
>>>> Signed-off-by: Steven Price <[email protected]>
>>>
>>> Thanks for your patch!
>>>
>>>> arch/m68k/include/asm/mcf_pgtable.h | 2 ++
>>>> arch/m68k/include/asm/motorola_pgtable.h | 2 ++
>>>> arch/m68k/include/asm/pgtable_no.h | 1 +
>>>> arch/m68k/include/asm/sun3_pgtable.h | 2 ++
>>>> 4 files changed, 7 insertions(+)
>>>
>>> If the definitions are the same, why not add them to
>>> arch/m68k/include/asm/pgtable.h instead?
I don't really understand the structure of m68k, so I just followed the
existing layout (arch/m68k/include/asm/pgtable.h is basically empty). I
believe the following patch would be functionally equivalent.
----8<----
diff --git a/arch/m68k/include/asm/pgtable.h
b/arch/m68k/include/asm/pgtable.h
index ad15d655a9bf..6f6d463e69c1 100644
--- a/arch/m68k/include/asm/pgtable.h
+++ b/arch/m68k/include/asm/pgtable.h
@@ -3,4 +3,9 @@
#include <asm/pgtable_no.h>
#else
#include <asm/pgtable_mm.h>
+
+#define pmd_large(pmd) (0)
+
#endif
+
+#define pgd_large(pgd) (0)
----8<----
Let me know if you'd prefer that
>> Maybe I'm missing something, but why the stubs have to be defined in
>> arch/*/include/asm/pgtable.h rather than in include/asm-generic/pgtable.h?
>
> That would even make more sense, given most architectures don't
> support huge pages.
Where the architecture has folded a level stubs are provided by the
asm-generic layer, see this later patch:
https://lore.kernel.org/lkml/[email protected]/
However just because an architecture port doesn't (currently) support
huge pages doesn't mean that the architecture itself can't have large[1]
mappings at higher levels of the page table. For instance an
architecture might use large pages for the linear map but not support
huge page mappings for user space.
My previous posting of this series attempted to define generic versions
of p?d_large(), but it was pointed out to me that this was fragile and
having a way of knowing whether the page table was a 'leaf' is actually
useful, so I've attempted to implement for all architectures. See the
discussion here:
https://lore.kernel.org/lkml/[email protected]/T/#mf0bd0155f185a19681b48a288be212ed1596e85d
Steve
[1] Note I've tried to use the term "large page" where I mean that page
table walk terminates early, and "huge page" for the Linux concept of
combining a large area of memory to reduce TLB pressure. Some
architectures have ways of mapping a large block in the TLB without
reducing the number of levels in the table walk - for example contiguous
hint bits in the page table entries.
On 28/02/2019 02:15, Paul Burton wrote:
> Hi Steven,
>
> On Wed, Feb 27, 2019 at 05:05:45PM +0000, Steven Price wrote:
>> For mips, we don't support large pages on 32 bit so add stubs returning 0.
>
> So far so good :)
>
>> For 64 bit look for _PAGE_HUGE flag being set. This means exposing the
>> flag when !CONFIG_MIPS_HUGE_TLB_SUPPORT.
>
> Here I have to ask why? We could just return 0 like the mips32 case when
> CONFIG_MIPS_HUGE_TLB_SUPPORT=n, let the compiler optimize the whole
> thing out and avoid redundant work at runtime.
>
> This could be unified too in asm/pgtable.h - checking for
> CONFIG_MIPS_HUGE_TLB_SUPPORT should be sufficient to cover the mips32
> case along with the subset of mips64 configurations without huge pages.
The intention here is to define a new set of macros/functions which will
always tell us whether we're at the leaf of a page table walk, whether
or not huge pages are compiled into the kernel. Basically this allows
the page walking code to be used on page tables other than user space,
for instance the kernel page tables (which e.g. might use a large
mapping for linear memory even if huge pages are not compiled in) or
page tables from firmware (e.g. EFI on arm64).
I'm not familiar enough with mips to know how it handles things like the
linear map so I don't know how relevant that is, but I'm trying to
introduce a new set of functions which differ from the existing
p?d_huge() macros by not depending on whether these mappings could exist
for a user space VMA (i.e. not depending on HUGETLB support and existing
for all levels that architecturally they can occur at).
Steve
On 27/02/2019 17:40, Martin Schwidefsky wrote:
> On Wed, 27 Feb 2019 17:05:52 +0000
> Steven Price <[email protected]> wrote:
>
>> walk_page_range() is going to be allowed to walk page tables other than
>> those of user space. For this it needs to know when it has reached a
>> 'leaf' entry in the page tables. This information is provided by the
>> p?d_large() functions/macros.
>>
>> For s390, we don't support large pages, so add a stub returning 0.
>
> Well s390 does support 1MB and 2GB large pages, pmd_large() and pud_large()
> are non-empty. We do not support 4TB or 8PB large pages though, which
> makes the patch itself correct. Just the wording is slightly off.
Sorry, you're absolutely right - I'll update the commit message for the
next posting.
Thanks,
Steve
>> CC: Martin Schwidefsky <[email protected]>
>> CC: Heiko Carstens <[email protected]>
>> CC: [email protected]
>> Signed-off-by: Steven Price <[email protected]>
>> ---
>> arch/s390/include/asm/pgtable.h | 10 ++++++++++
>> 1 file changed, 10 insertions(+)
>>
>> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
>> index 063732414dfb..9617f1fb69b4 100644
>> --- a/arch/s390/include/asm/pgtable.h
>> +++ b/arch/s390/include/asm/pgtable.h
>> @@ -605,6 +605,11 @@ static inline int pgd_present(pgd_t pgd)
>> return (pgd_val(pgd) & _REGION_ENTRY_ORIGIN) != 0UL;
>> }
>>
>> +static inline int pgd_large(pgd_t pgd)
>> +{
>> + return 0;
>> +}
>> +
>> static inline int pgd_none(pgd_t pgd)
>> {
>> if (pgd_folded(pgd))
>> @@ -645,6 +650,11 @@ static inline int p4d_present(p4d_t p4d)
>> return (p4d_val(p4d) & _REGION_ENTRY_ORIGIN) != 0UL;
>> }
>>
>> +static inline int p4d_large(p4d_t p4d)
>> +{
>> + return 0;
>> +}
>> +
>> static inline int p4d_none(p4d_t p4d)
>> {
>> if (p4d_folded(p4d))
>
>
Hi Mike,
On Thu, Feb 28, 2019 at 12:37 PM Mike Rapoport <[email protected]> wrote:
> On Wed, Feb 27, 2019 at 08:27:40PM +0100, Geert Uytterhoeven wrote:
> > On Wed, Feb 27, 2019 at 6:07 PM Steven Price <[email protected]> wrote:
> > > walk_page_range() is going to be allowed to walk page tables other than
> > > those of user space. For this it needs to know when it has reached a
> > > 'leaf' entry in the page tables. This information is provided by the
> > > p?d_large() functions/macros.
> > >
> > > For m68k, we don't support large pages, so add stubs returning 0
> > >
> > > CC: Geert Uytterhoeven <[email protected]>
> > > CC: [email protected]
> > > Signed-off-by: Steven Price <[email protected]>
> >
> > Thanks for your patch!
> >
> > > arch/m68k/include/asm/mcf_pgtable.h | 2 ++
> > > arch/m68k/include/asm/motorola_pgtable.h | 2 ++
> > > arch/m68k/include/asm/pgtable_no.h | 1 +
> > > arch/m68k/include/asm/sun3_pgtable.h | 2 ++
> > > 4 files changed, 7 insertions(+)
> >
> > If the definitions are the same, why not add them to
> > arch/m68k/include/asm/pgtable.h instead?
>
> Maybe I'm missing something, but why the stubs have to be defined in
> arch/*/include/asm/pgtable.h rather than in include/asm-generic/pgtable.h?
That would even make more sense, given most architectures don't
support huge pages.
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
Hi Steven,
On Thu, Feb 28, 2019 at 12:11:24PM +0000, Steven Price wrote:
> On 28/02/2019 02:15, Paul Burton wrote:
> > On Wed, Feb 27, 2019 at 05:05:45PM +0000, Steven Price wrote:
> >> For mips, we don't support large pages on 32 bit so add stubs returning 0.
> >
> > So far so good :)
> >
> >> For 64 bit look for _PAGE_HUGE flag being set. This means exposing the
> >> flag when !CONFIG_MIPS_HUGE_TLB_SUPPORT.
> >
> > Here I have to ask why? We could just return 0 like the mips32 case when
> > CONFIG_MIPS_HUGE_TLB_SUPPORT=n, let the compiler optimize the whole
> > thing out and avoid redundant work at runtime.
> >
> > This could be unified too in asm/pgtable.h - checking for
> > CONFIG_MIPS_HUGE_TLB_SUPPORT should be sufficient to cover the mips32
> > case along with the subset of mips64 configurations without huge pages.
>
> The intention here is to define a new set of macros/functions which will
> always tell us whether we're at the leaf of a page table walk, whether
> or not huge pages are compiled into the kernel. Basically this allows
> the page walking code to be used on page tables other than user space,
> for instance the kernel page tables (which e.g. might use a large
> mapping for linear memory even if huge pages are not compiled in) or
> page tables from firmware (e.g. EFI on arm64).
>
> I'm not familiar enough with mips to know how it handles things like the
> linear map so I don't know how relevant that is, but I'm trying to
> introduce a new set of functions which differ from the existing
> p?d_huge() macros by not depending on whether these mappings could exist
> for a user space VMA (i.e. not depending on HUGETLB support and existing
> for all levels that architecturally they can occur at).
Thanks for the explanation - the background helps.
Right now for MIPS, with one exception, there'll be no difference
between a page being huge or large. So for the vast majority of kernels
with CONFIG_MIPS_HUGE_TLB_SUPPORT=n we should just return 0.
The one exception I mentioned is old SGI IP27 support, which allows the
kernel to be mapped through the TLB & does that using 2x 16MB pages when
CONFIG_MAPPED_KERNEL=y. However even there your patch as-is won't pick
up on that for 2 reasons:
1) The pages in question don't appear to actually be recorded in the
page tables - they're just written straight into the TLB as wired
entries (ie. entries that will never be evicted).
2) Even if they were in the page tables the _PAGE_HUGE bit isn't set.
Since those pages aren't recorded in the page tables anyway we'd either
need to:
a) Add them to the page tables, and set the _PAGE_HUGE bit.
b) Ignore them if the code you're working on won't be operating on the
memory mapping the kernel.
For other platforms the kernel is run from unmapped memory, and for all
cases including IP27 the kernel will use unmapped memory to access
lowmem or peripherals when possible. That is, MIPS has virtual address
regions ((c)kseg[01] or xkphys) which are architecturally defined as
linear maps to physical memory & so VA->PA translation doesn't use the
TLB at all.
So my thought would be that for almost everything we could just do:
#define pmd_large(pmd) pmd_huge(pmd)
#define pud_large(pmd) pud_huge(pmd)
And whether we need to do anything about IP27 depends on whether a) or
b) is chosen above.
Or alternatively you could do something like:
#ifdef _PAGE_HUGE
static inline int pmd_large(pmd_t pmd)
{
return (pmd_val(pmd) & _PAGE_HUGE) != 0;
}
static inline int pud_large(pud_t pud)
{
return (pud_val(pud) & _PAGE_HUGE) != 0;
}
#else
# define pmd_large(pmd) 0
# define pud_large(pud) 0
#endif
That would cover everything except for the IP27, but would make it pick
up the IP27 kernel pages automatically if someone later defines
_PAGE_HUGE for IP27 CONFIG_MAPPED_KERNEL=y & makes use of it for those
pages.
Thanks,
Paul
On 2/28/19 3:28 AM, Steven Price wrote:
> static int get_level(unsigned long addr, unsigned long end)
> {
> /* Add 1 to account for ~0ULL */
> unsigned long size = (end - addr) + 1;
> if (size < PMD_SIZE)
> return 4;
> else if (size < PUD_SIZE)
> return 3;
> else if (size < P4D_SIZE)
> return 2;
> else if (size < PGD_SIZE)
> return 1;
> return 0;
> }
>
> There are two immediate problems with that:
>
> * The "+1" to deal with ~0ULL is fragile
>
> * PGD_SIZE isn't what you might expect, it's not defined for most
> architectures and arm64/x86 use it as the size of the PGD table.
> Although that's easy enough to fix up.
>
> Do you think a function like above would be preferable?
The question still stands of why we *need* the depth/level in the first
place. As I said, we obviously need it for printing out the "name" of
the level. Is that it?
> The other option would of course be to just drop the information from
> the debugfs file about at which level the holes are. But it can be
> useful information to see whether there are empty levels in the page
> table structure. Although this is an area where x86 and arm64 differ
> currently (x86 explicitly shows the gaps, arm64 doesn't), so if x86
> doesn't mind losing that functionality that would certainly simplify things!
I think I'd actually be OK with the holes just not showing up. I
actually find it kinda hard to read sometimes with the holes in there.
I'd be curious what others think though.
On 28/02/2019 18:55, Paul Burton wrote:
> Hi Steven,
>
> On Thu, Feb 28, 2019 at 12:11:24PM +0000, Steven Price wrote:
>> On 28/02/2019 02:15, Paul Burton wrote:
>>> On Wed, Feb 27, 2019 at 05:05:45PM +0000, Steven Price wrote:
>>>> For mips, we don't support large pages on 32 bit so add stubs returning 0.
>>>
>>> So far so good :)
>>>
>>>> For 64 bit look for _PAGE_HUGE flag being set. This means exposing the
>>>> flag when !CONFIG_MIPS_HUGE_TLB_SUPPORT.
>>>
>>> Here I have to ask why? We could just return 0 like the mips32 case when
>>> CONFIG_MIPS_HUGE_TLB_SUPPORT=n, let the compiler optimize the whole
>>> thing out and avoid redundant work at runtime.
>>>
>>> This could be unified too in asm/pgtable.h - checking for
>>> CONFIG_MIPS_HUGE_TLB_SUPPORT should be sufficient to cover the mips32
>>> case along with the subset of mips64 configurations without huge pages.
>>
>> The intention here is to define a new set of macros/functions which will
>> always tell us whether we're at the leaf of a page table walk, whether
>> or not huge pages are compiled into the kernel. Basically this allows
>> the page walking code to be used on page tables other than user space,
>> for instance the kernel page tables (which e.g. might use a large
>> mapping for linear memory even if huge pages are not compiled in) or
>> page tables from firmware (e.g. EFI on arm64).
>>
>> I'm not familiar enough with mips to know how it handles things like the
>> linear map so I don't know how relevant that is, but I'm trying to
>> introduce a new set of functions which differ from the existing
>> p?d_huge() macros by not depending on whether these mappings could exist
>> for a user space VMA (i.e. not depending on HUGETLB support and existing
>> for all levels that architecturally they can occur at).
>
> Thanks for the explanation - the background helps.
>
> Right now for MIPS, with one exception, there'll be no difference
> between a page being huge or large. So for the vast majority of kernels
> with CONFIG_MIPS_HUGE_TLB_SUPPORT=n we should just return 0.
>
> The one exception I mentioned is old SGI IP27 support, which allows the
> kernel to be mapped through the TLB & does that using 2x 16MB pages when
> CONFIG_MAPPED_KERNEL=y. However even there your patch as-is won't pick
> up on that for 2 reasons:
>
> 1) The pages in question don't appear to actually be recorded in the
> page tables - they're just written straight into the TLB as wired
> entries (ie. entries that will never be evicted).
>
> 2) Even if they were in the page tables the _PAGE_HUGE bit isn't set.
>
> Since those pages aren't recorded in the page tables anyway we'd either
> need to:
>
> a) Add them to the page tables, and set the _PAGE_HUGE bit.
>
> b) Ignore them if the code you're working on won't be operating on the
> memory mapping the kernel.
>
> For other platforms the kernel is run from unmapped memory, and for all
> cases including IP27 the kernel will use unmapped memory to access
> lowmem or peripherals when possible. That is, MIPS has virtual address
> regions ((c)kseg[01] or xkphys) which are architecturally defined as
> linear maps to physical memory & so VA->PA translation doesn't use the
> TLB at all.
>
> So my thought would be that for almost everything we could just do:
>
> #define pmd_large(pmd) pmd_huge(pmd)
> #define pud_large(pmd) pud_huge(pmd)
>
> And whether we need to do anything about IP27 depends on whether a) or
> b) is chosen above.
>
> Or alternatively you could do something like:
>
> #ifdef _PAGE_HUGE
>
> static inline int pmd_large(pmd_t pmd)
> {
> return (pmd_val(pmd) & _PAGE_HUGE) != 0;
> }
>
> static inline int pud_large(pud_t pud)
> {
> return (pud_val(pud) & _PAGE_HUGE) != 0;
> }
>
> #else
> # define pmd_large(pmd) 0
> # define pud_large(pud) 0
> #endif
>
> That would cover everything except for the IP27, but would make it pick
> up the IP27 kernel pages automatically if someone later defines
> _PAGE_HUGE for IP27 CONFIG_MAPPED_KERNEL=y & makes use of it for those
> pages.
Thanks for the detailed explanation. I think my preference is your
change above (#ifdef _PAGE_HUGE) because I'm trying to stop people
thinking p?d_large==p?d_huge. MIPS is a little different from other
architectures in that the hardware doesn't walk the page tables, so
there isn't a definitive answer as to whether there is a 'huge' bit in
the tables or not - it actually does depend on the kernel configuration.
For the IP27 case I think the current situation is probably fine - the
intention is to walk the page tables, so even though the TLBs (and
therefore the actual translations) might not match, at least p?d_large
will accurately tell when the leaf of the page table tree has been
reached. And as you say, using _PAGE_HUGE as the #ifdef means that
should support be added the code should automatically make use of it.
Thanks for your help,
Steve
On 28/02/2019 19:00, Dave Hansen wrote:
> On 2/28/19 3:28 AM, Steven Price wrote:
>> static int get_level(unsigned long addr, unsigned long end)
>> {
>> /* Add 1 to account for ~0ULL */
>> unsigned long size = (end - addr) + 1;
>> if (size < PMD_SIZE)
>> return 4;
>> else if (size < PUD_SIZE)
>> return 3;
>> else if (size < P4D_SIZE)
>> return 2;
>> else if (size < PGD_SIZE)
>> return 1;
>> return 0;
>> }
>>
>> There are two immediate problems with that:
>>
>> * The "+1" to deal with ~0ULL is fragile
>>
>> * PGD_SIZE isn't what you might expect, it's not defined for most
>> architectures and arm64/x86 use it as the size of the PGD table.
>> Although that's easy enough to fix up.
>>
>> Do you think a function like above would be preferable?
>
> The question still stands of why we *need* the depth/level in the first
> place. As I said, we obviously need it for printing out the "name" of
> the level. Is that it?
That is the only use I'm currently aware of.
>> The other option would of course be to just drop the information from
>> the debugfs file about at which level the holes are. But it can be
>> useful information to see whether there are empty levels in the page
>> table structure. Although this is an area where x86 and arm64 differ
>> currently (x86 explicitly shows the gaps, arm64 doesn't), so if x86
>> doesn't mind losing that functionality that would certainly simplify things!
>
> I think I'd actually be OK with the holes just not showing up. I
> actually find it kinda hard to read sometimes with the holes in there.
> I'd be curious what others think though.
If no-one has any objections to dropping the holes in the output, then I
can rebase on something like below and drop this 'depth' patch.
Steve
----8<----
From a9eabadfc212389068ec5cc60265c7a55585bb76 Mon Sep 17 00:00:00 2001
From: Steven Price <[email protected]>
Date: Fri, 1 Mar 2019 10:06:33 +0000
Subject: [PATCH] x86: mm: Hide page table holes in debugfs
For the /sys/kernel/debug/page_tables/ files, rather than outputing a
mostly empty line when a block of memory isn't present just skip the
line. This keeps the output shorter and will help with a future change
switching to using the generic page walk code as we no longer care about
the 'level' that the page table holes are at.
Signed-off-by: Steven Price <[email protected]>
---
arch/x86/mm/dump_pagetables.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index e3cdc85ce5b6..a0f4139631dd 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -304,8 +304,8 @@ static void note_page(struct seq_file *m, struct
pg_state *st,
/*
* Now print the actual finished series
*/
- if (!st->marker->max_lines ||
- st->lines < st->marker->max_lines) {
+ if ((cur & _PAGE_PRESENT) && (!st->marker->max_lines ||
+ st->lines < st->marker->max_lines)) {
pt_dump_seq_printf(m, st->to_dmesg,
"0x%0*lx-0x%0*lx ",
width, st->start_address,
@@ -321,7 +321,9 @@ static void note_page(struct seq_file *m, struct
pg_state *st,
printk_prot(m, st->current_prot, st->level,
st->to_dmesg);
}
- st->lines++;
+ if (cur & _PAGE_PRESENT) {
+ st->lines++;
+ }
/*
* We print markers for special areas of address space,
--
2.20.1
On Thu, Feb 28, 2019 at 12:04:08PM +0000, Steven Price wrote:
> On 28/02/2019 11:53, Geert Uytterhoeven wrote:
> > Hi Mike,
> >
> > On Thu, Feb 28, 2019 at 12:37 PM Mike Rapoport <[email protected]> wrote:
> >> On Wed, Feb 27, 2019 at 08:27:40PM +0100, Geert Uytterhoeven wrote:
> >>> On Wed, Feb 27, 2019 at 6:07 PM Steven Price <[email protected]> wrote:
> >>>> walk_page_range() is going to be allowed to walk page tables other than
> >>>> those of user space. For this it needs to know when it has reached a
> >>>> 'leaf' entry in the page tables. This information is provided by the
> >>>> p?d_large() functions/macros.
> >>>>
> >>>> For m68k, we don't support large pages, so add stubs returning 0
> >>>>
> >>>> CC: Geert Uytterhoeven <[email protected]>
> >>>> CC: [email protected]
> >>>> Signed-off-by: Steven Price <[email protected]>
> >>>
> >>> Thanks for your patch!
> >>>
> >>>> arch/m68k/include/asm/mcf_pgtable.h | 2 ++
> >>>> arch/m68k/include/asm/motorola_pgtable.h | 2 ++
> >>>> arch/m68k/include/asm/pgtable_no.h | 1 +
> >>>> arch/m68k/include/asm/sun3_pgtable.h | 2 ++
> >>>> 4 files changed, 7 insertions(+)
> >>>
> >> Maybe I'm missing something, but why the stubs have to be defined in
> >> arch/*/include/asm/pgtable.h rather than in include/asm-generic/pgtable.h?
> >
> > That would even make more sense, given most architectures don't
> > support huge pages.
>
> Where the architecture has folded a level stubs are provided by the
> asm-generic layer, see this later patch:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> However just because an architecture port doesn't (currently) support
> huge pages doesn't mean that the architecture itself can't have large[1]
> mappings at higher levels of the page table. For instance an
> architecture might use large pages for the linear map but not support
> huge page mappings for user space.
Well, I doubt m68k can support large mappings at higher levels at all.
This, IMHO, applies to many other architectures and spreading p?d_large all
over those architecture seems wrong to me...
> My previous posting of this series attempted to define generic versions
> of p?d_large(), but it was pointed out to me that this was fragile and
> having a way of knowing whether the page table was a 'leaf' is actually
> useful, so I've attempted to implement for all architectures. See the
> discussion here:
> https://lore.kernel.org/lkml/[email protected]/T/#mf0bd0155f185a19681b48a288be212ed1596e85d
I'll reply on that thread, somehow I missed it then.
> Steve
>
--
Sincerely yours,
Mike.
On Wed, Feb 27, 2019 at 05:05:37PM +0000, Steven Price wrote:
> walk_page_range() is going to be allowed to walk page tables other than
> those of user space. For this it needs to know when it has reached a
> 'leaf' entry in the page tables. This information will be provided by the
> p?d_large() functions/macros.
>
> For arm, we already provide most p?d_large() macros. Add a stub for PUD
> as we don't have huge pages at that level.
We do not have PUD for 2- and 3-level paging. Macros from generic header
should cover it, shouldn't it?
--
Kirill A. Shutemov
On Wed, Feb 27, 2019 at 05:05:39PM +0000, Steven Price wrote:
> walk_page_range() is going to be allowed to walk page tables other than
> those of user space. For this it needs to know when it has reached a
> 'leaf' entry in the page tables. This information is provided by the
> p?d_large() functions/macros.
>
> For c6x there's no MMU so there's never a large page, so just add stubs.
Other option would be to provide the stubs via generic headers form !MMU.
--
Kirill A. Shutemov
On Wed, Feb 27, 2019 at 05:05:40PM +0000, Steven Price wrote:
> walk_page_range() is going to be allowed to walk page tables other than
> those of user space. For this it needs to know when it has reached a
> 'leaf' entry in the page tables. This information is provided by the
> p?d_large() functions/macros.
>
> For csky, we don't support large pages, so add a stub returning 0.
>
> CC: Guo Ren <[email protected]>
> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/csky/include/asm/pgtable.h | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/arch/csky/include/asm/pgtable.h b/arch/csky/include/asm/pgtable.h
> index edfcbb25fd9f..4ffdb6bfbede 100644
> --- a/arch/csky/include/asm/pgtable.h
> +++ b/arch/csky/include/asm/pgtable.h
> @@ -158,6 +158,8 @@ static inline int pmd_present(pmd_t pmd)
> return (pmd_val(pmd) != __pa(invalid_pte_table));
> }
>
> +#define pmd_large(pmd) (0)
Nit: here and in other places, parentheses around 0 is not needed.
--
Kirill A. Shutemov
On Wed, Feb 27, 2019 at 05:05:42PM +0000, Steven Price wrote:
> walk_page_range() is going to be allowed to walk page tables other than
> those of user space. For this it needs to know when it has reached a
> 'leaf' entry in the page tables. This information is provided by the
> p?d_large() functions/macros.
>
> For ia64 leaf entries are always at the lowest level, so implement
> stubs returning 0.
Are you sure about this? I see pte_mkhuge defined for ia64 and Kconfig
contains hugetlb references.
--
Kirill A. Shutemov
On Wed, Feb 27, 2019 at 07:54:22PM +0100, Helge Deller wrote:
> On 27.02.19 18:05, Steven Price wrote:
> > walk_page_range() is going to be allowed to walk page tables other than
> > those of user space. For this it needs to know when it has reached a
> > 'leaf' entry in the page tables. This information is provided by the
> > p?d_large() functions/macros.
> >
> > For parisc, we don't support large pages, so add stubs returning 0.
>
> We do support huge pages on parisc, but not yet on those levels.
Just curious, what level do parisc supports huge pages on?
AFAICS, it can have 2- or 3- level paging and the patch defines helpers
for two level: pgd and pmd. Hm?
--
Kirill A. Shutemov
On 01/03/2019 21:47, Kirill A. Shutemov wrote:
> On Wed, Feb 27, 2019 at 05:05:37PM +0000, Steven Price wrote:
>> walk_page_range() is going to be allowed to walk page tables other than
>> those of user space. For this it needs to know when it has reached a
>> 'leaf' entry in the page tables. This information will be provided by the
>> p?d_large() functions/macros.
>>
>> For arm, we already provide most p?d_large() macros. Add a stub for PUD
>> as we don't have huge pages at that level.
>
> We do not have PUD for 2- and 3-level paging. Macros from generic header
> should cover it, shouldn't it?
>
I'm not sure of the reasoning behind this, but levels are folded in a
slightly strange way. arm/include/asm/pgtable.h defines
__ARCH_USE_5LEVEL_HACK which means:
PGD has 2048 (2-level) or 4 (3-level) entries which are always
considered 'present' (pgd_present() returns 1 defined in
asm-generic/pgtables-nop4d-hack.h).
P4D has 1 entry which is always present (see asm-generic/5level-fixup.h)
PUD has 1 entry (see asm-generic/pgtable-nop4d-hack.h). This is always
present for 2-level, and present only if the first level of real page
table is present with a 3-level.
PMD/PTE are as you might expect.
So in terms of tables which are more than one entry you have PGD,
(optionally) PMD, PTE. But the levels which actually read the table
entries are PUD, PMD, PTE.
This means that the corresponding p?d_large() macros are needed for
PUD/PMD as that is where the actual entries are read. The asm-generic
files provide the definitions for PGD/P4D.
Steve
On 01/03/2019 21:48, Kirill A. Shutemov wrote:
> On Wed, Feb 27, 2019 at 05:05:39PM +0000, Steven Price wrote:
>> walk_page_range() is going to be allowed to walk page tables other than
>> those of user space. For this it needs to know when it has reached a
>> 'leaf' entry in the page tables. This information is provided by the
>> p?d_large() functions/macros.
>>
>> For c6x there's no MMU so there's never a large page, so just add stubs.
>
> Other option would be to provide the stubs via generic headers form !MMU.
>
I agree that could be done, but equally the definitions of
p?d_present/p?d_none/p?d_bad etc could be provided by a generic header
for !MMU but currently are not. It makes sense to keep the p?d_large
definitions next to the others.
I'd prefer to stick with a (relatively) small change here - it's already
quite a long series! But this is certainly something that could be
tidied up for !MMU archs.
Steve
On Mon, Mar 04, 2019 at 12:01:37PM +0000, Steven Price wrote:
> On 01/03/2019 21:48, Kirill A. Shutemov wrote:
> > On Wed, Feb 27, 2019 at 05:05:39PM +0000, Steven Price wrote:
> >> walk_page_range() is going to be allowed to walk page tables other than
> >> those of user space. For this it needs to know when it has reached a
> >> 'leaf' entry in the page tables. This information is provided by the
> >> p?d_large() functions/macros.
> >>
> >> For c6x there's no MMU so there's never a large page, so just add stubs.
> >
> > Other option would be to provide the stubs via generic headers form !MMU.
> >
>
> I agree that could be done, but equally the definitions of
> p?d_present/p?d_none/p?d_bad etc could be provided by a generic header
> for !MMU but currently are not. It makes sense to keep the p?d_large
> definitions next to the others.
>
> I'd prefer to stick with a (relatively) small change here - it's already
> quite a long series! But this is certainly something that could be
> tidied up for !MMU archs.
Agreed.
--
Kirill A. Shutemov
On 01/03/2019 21:57, Kirill A. Shutemov wrote:
> On Wed, Feb 27, 2019 at 05:05:42PM +0000, Steven Price wrote:
>> walk_page_range() is going to be allowed to walk page tables other than
>> those of user space. For this it needs to know when it has reached a
>> 'leaf' entry in the page tables. This information is provided by the
>> p?d_large() functions/macros.
>>
>> For ia64 leaf entries are always at the lowest level, so implement
>> stubs returning 0.
>
> Are you sure about this? I see pte_mkhuge defined for ia64 and Kconfig
> contains hugetlb references.
>
I'm not completely familiar with ia64, but my understanding is that it
doesn't have the situation where a page table walk ends early - there is
always the full depth of entries. The p?d_huge() functions always return 0.
However my understanding is that it does support huge TLB entries, so
when populating the TLB a region larger than a standard page can be mapped.
I'd definitely welcome review by someone more familiar with ia64 to
check my assumptions.
Thanks,
Steve
On Mon, Mar 04, 2019 at 11:56:13AM +0000, Steven Price wrote:
> On 01/03/2019 21:47, Kirill A. Shutemov wrote:
> > On Wed, Feb 27, 2019 at 05:05:37PM +0000, Steven Price wrote:
> >> walk_page_range() is going to be allowed to walk page tables other than
> >> those of user space. For this it needs to know when it has reached a
> >> 'leaf' entry in the page tables. This information will be provided by the
> >> p?d_large() functions/macros.
> >>
> >> For arm, we already provide most p?d_large() macros. Add a stub for PUD
> >> as we don't have huge pages at that level.
> >
> > We do not have PUD for 2- and 3-level paging. Macros from generic header
> > should cover it, shouldn't it?
> >
>
> I'm not sure of the reasoning behind this, but levels are folded in a
> slightly strange way. arm/include/asm/pgtable.h defines
> __ARCH_USE_5LEVEL_HACK which means:
>
> PGD has 2048 (2-level) or 4 (3-level) entries which are always
> considered 'present' (pgd_present() returns 1 defined in
> asm-generic/pgtables-nop4d-hack.h).
>
> P4D has 1 entry which is always present (see asm-generic/5level-fixup.h)
>
> PUD has 1 entry (see asm-generic/pgtable-nop4d-hack.h). This is always
> present for 2-level, and present only if the first level of real page
> table is present with a 3-level.
>
> PMD/PTE are as you might expect.
>
> So in terms of tables which are more than one entry you have PGD,
> (optionally) PMD, PTE. But the levels which actually read the table
> entries are PUD, PMD, PTE.
>
> This means that the corresponding p?d_large() macros are needed for
> PUD/PMD as that is where the actual entries are read. The asm-generic
> files provide the definitions for PGD/P4D.
Makes sense.
Only additional thing worth nothing that ARM in 2-level paging case folds
PMD manually without help from generic headres.
I'm partly responsible for the mess with folding. Sorry that you need to
explain this to :P
--
Kirill A. Shutemov
On Mon, Mar 04, 2019 at 01:16:47PM +0000, Steven Price wrote:
> On 01/03/2019 21:57, Kirill A. Shutemov wrote:
> > On Wed, Feb 27, 2019 at 05:05:42PM +0000, Steven Price wrote:
> >> walk_page_range() is going to be allowed to walk page tables other than
> >> those of user space. For this it needs to know when it has reached a
> >> 'leaf' entry in the page tables. This information is provided by the
> >> p?d_large() functions/macros.
> >>
> >> For ia64 leaf entries are always at the lowest level, so implement
> >> stubs returning 0.
> >
> > Are you sure about this? I see pte_mkhuge defined for ia64 and Kconfig
> > contains hugetlb references.
> >
>
> I'm not completely familiar with ia64, but my understanding is that it
> doesn't have the situation where a page table walk ends early - there is
> always the full depth of entries. The p?d_huge() functions always return 0.
>
> However my understanding is that it does support huge TLB entries, so
> when populating the TLB a region larger than a standard page can be mapped.
>
> I'd definitely welcome review by someone more familiar with ia64 to
> check my assumptions.
ia64 has several ways to manage page tables. The one
used by Linux has multi-level table walks like other
architectures, but we don't allow mixing of different
page sizes within a "region" (there are eight regions
selected by the high 3 bits of the virtual address).
Is the series in some GIT tree that I can pull, rather
than tracking down all 34 pieces? I can try it out and
see if things work/break.
-Tony
On 01.03.19 23:12, Kirill A. Shutemov wrote:
> On Wed, Feb 27, 2019 at 07:54:22PM +0100, Helge Deller wrote:
>> On 27.02.19 18:05, Steven Price wrote:
>>> walk_page_range() is going to be allowed to walk page tables other than
>>> those of user space. For this it needs to know when it has reached a
>>> 'leaf' entry in the page tables. This information is provided by the
>>> p?d_large() functions/macros.
>>>
>>> For parisc, we don't support large pages, so add stubs returning 0.
>>
>> We do support huge pages on parisc, but not yet on those levels.
>
> Just curious, what level do parisc supports huge pages on?
> AFAICS, it can have 2- or 3- level paging and the patch defines helpers
> for two level: pgd and pmd. Hm?
You are correct.
My comment was misleading and meant to say that we do support generic
huge pages for applications.
Helge
On 04/03/2019 19:06, Luck, Tony wrote:
> On Mon, Mar 04, 2019 at 01:16:47PM +0000, Steven Price wrote:
>> On 01/03/2019 21:57, Kirill A. Shutemov wrote:
>>> On Wed, Feb 27, 2019 at 05:05:42PM +0000, Steven Price wrote:
>>>> walk_page_range() is going to be allowed to walk page tables other than
>>>> those of user space. For this it needs to know when it has reached a
>>>> 'leaf' entry in the page tables. This information is provided by the
>>>> p?d_large() functions/macros.
>>>>
>>>> For ia64 leaf entries are always at the lowest level, so implement
>>>> stubs returning 0.
>>>
>>> Are you sure about this? I see pte_mkhuge defined for ia64 and Kconfig
>>> contains hugetlb references.
>>>
>>
>> I'm not completely familiar with ia64, but my understanding is that it
>> doesn't have the situation where a page table walk ends early - there is
>> always the full depth of entries. The p?d_huge() functions always return 0.
>>
>> However my understanding is that it does support huge TLB entries, so
>> when populating the TLB a region larger than a standard page can be mapped.
>>
>> I'd definitely welcome review by someone more familiar with ia64 to
>> check my assumptions.
>
> ia64 has several ways to manage page tables. The one
> used by Linux has multi-level table walks like other
> architectures, but we don't allow mixing of different
> page sizes within a "region" (there are eight regions
> selected by the high 3 bits of the virtual address).
I'd gathered ia64 has this "region" concept, from what I can tell the
existing p?d_present() etc macros are assuming a particular
configuration of a region, and so the p?d_large macros would follow that
scheme. This of course does limit any generic page walking code to
dealing only with this one type of region, but that doesn't seem
unreasonable.
> Is the series in some GIT tree that I can pull, rather
> than tracking down all 34 pieces? I can try it out and
> see if things work/break.
At the moment I don't have a public tree - I'm trying to get that set
up. In the meantime you can download the entire series as a mbox from
patchwork:
https://patchwork.kernel.org/series/85673/mbox/
(it's currently based on v5.0-rc6)
However you won't see anything particularly interesting on ia64 (yet)
because my focus has been converting the PTDUMP implementations that
several architecture have (arm, arm64, powerpc, s390, x86) but not ia64.
For now I've also only done the PTDUMP work for arm64/x86 as a way of
testing out the idea. Ideally the PTDUMP code can be made generic enough
that implementing it for other architecture (including ia64) will be
trivial.
Thanks,
Steve