This is a slight reworking and extension of my previous patch set
(Convert x86 & arm64 to use generic page walk), but I've continued the
version numbering as most of the changes are the same. In particular
this series ends with a generic PTDUMP implemention for arm64 and x86.
Many architectures current have a debugfs file for dumping the kernel
page tables. Currently each architecture has to implement custom
functions for this because the details of walking the page tables used
by the kernel are different between architectures.
This series extends the capabilities of walk_page_range() so that it can
deal with the page tables of the kernel (which have no VMAs and can
contain larger huge pages than exist for user space). A generic PTDUMP
implementation is the implemented making use of the new functionality of
walk_page_range() and finally arm64 and x86 are switch to using it,
removing the custom table walkers.
To enable a generic page table walker to walk the unusual mappings of
the kernel we need to implement a set of functions which let us know
when the walker has reached the leaf entry. After a suggestion from Will
Deacon I've chosen the name p?d_leaf() as this (hopefully) describes
the purpose (and is a new name so has no historic baggage). Some
architectures have p?d_large macros but this is easily confused with
"large pages".
Mostly this is a clean up and there should be very little functional
change. The exceptions are:
* x86 PTDUMP debugfs output no longer display pages which aren't
present (patch 14).
* arm64 has the ability to efficiently process KASAN pages (which
previously only x86 implemented). This means that the combination of
KASAN and DEBUG_WX is now useable.
Also available as a git tree:
git://linux-arm.org/linux-sp.git walk_page_range/v9
Changes since v8:
https://lore.kernel.org/lkml/[email protected]/
* Rename from p?d_large() to p?d_leaf()
* Dropped patches migrating arm64/x86 custom walkers to
walk_page_range() in favour of adding a generic PTDUMP implementation
and migrating arm64/x86 to that instead.
* Rebased to v5.3-rc1
Steven Price (21):
arc: mm: Add p?d_leaf() definitions
arm: mm: Add p?d_leaf() definitions
arm64: mm: Add p?d_leaf() definitions
mips: mm: Add p?d_leaf() definitions
powerpc: mm: Add p?d_leaf() definitions
riscv: mm: Add p?d_leaf() definitions
s390: mm: Add p?d_leaf() definitions
sparc: mm: Add p?d_leaf() definitions
x86: mm: Add p?d_leaf() definitions
mm: Add generic p?d_leaf() macros
mm: pagewalk: Add p4d_entry() and pgd_entry()
mm: pagewalk: Allow walking without vma
mm: pagewalk: Add test_p?d callbacks
x86: mm: Don't display pages which aren't present in debugfs
x86: mm: Point to struct seq_file from struct pg_state
x86: mm+efi: Convert ptdump_walk_pgd_level() to take a mm_struct
x86: mm: Convert ptdump_walk_pgd_level_debugfs() to take an mm_struct
x86: mm: Convert ptdump_walk_pgd_level_core() to take an mm_struct
mm: Add generic ptdump
x86: mm: Convert dump_pagetables to use walk_page_range
arm64: mm: Convert mm/dump.c to use walk_page_range()
arch/arc/include/asm/pgtable.h | 1 +
arch/arm/include/asm/pgtable-2level.h | 1 +
arch/arm/include/asm/pgtable-3level.h | 1 +
arch/arm64/Kconfig | 1 +
arch/arm64/Kconfig.debug | 19 +-
arch/arm64/include/asm/pgtable.h | 2 +
arch/arm64/include/asm/ptdump.h | 8 +-
arch/arm64/mm/Makefile | 4 +-
arch/arm64/mm/dump.c | 117 +++----
arch/arm64/mm/ptdump_debugfs.c | 2 +-
arch/mips/include/asm/pgtable-64.h | 8 +
arch/powerpc/include/asm/book3s/64/pgtable.h | 30 +-
arch/riscv/include/asm/pgtable-64.h | 7 +
arch/riscv/include/asm/pgtable.h | 7 +
arch/s390/include/asm/pgtable.h | 2 +
arch/sparc/include/asm/pgtable_64.h | 2 +
arch/x86/Kconfig | 1 +
arch/x86/Kconfig.debug | 20 +-
arch/x86/include/asm/pgtable.h | 10 +-
arch/x86/mm/Makefile | 4 +-
arch/x86/mm/debug_pagetables.c | 8 +-
arch/x86/mm/dump_pagetables.c | 339 +++++--------------
arch/x86/platform/efi/efi_32.c | 2 +-
arch/x86/platform/efi/efi_64.c | 4 +-
drivers/firmware/efi/arm-runtime.c | 2 +-
include/asm-generic/pgtable.h | 19 ++
include/linux/mm.h | 26 +-
include/linux/ptdump.h | 19 ++
mm/Kconfig.debug | 21 ++
mm/Makefile | 1 +
mm/pagewalk.c | 76 +++--
mm/ptdump.c | 161 +++++++++
32 files changed, 507 insertions(+), 418 deletions(-)
create mode 100644 include/linux/ptdump.h
create mode 100644 mm/ptdump.c
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_leaf() functions/macros.
For s390, pud_large() and pmd_large() are already implemented as static
inline functions. Add a macro to provide the p?d_leaf names for the
generic code to use.
CC: Martin Schwidefsky <[email protected]>
CC: Heiko Carstens <[email protected]>
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/s390/include/asm/pgtable.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 9b274fcaacb6..f99a5f546e5e 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -674,6 +674,7 @@ static inline int pud_none(pud_t pud)
return pud_val(pud) == _REGION3_ENTRY_EMPTY;
}
+#define pud_leaf pud_large
static inline int pud_large(pud_t pud)
{
if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) != _REGION_ENTRY_TYPE_R3)
@@ -691,6 +692,7 @@ static inline unsigned long pud_pfn(pud_t pud)
return (pud_val(pud) & origin_mask) >> PAGE_SHIFT;
}
+#define pmd_leaf pmd_large
static inline int pmd_large(pmd_t pmd)
{
return (pmd_val(pmd) & _SEGMENT_ENTRY_LARGE) != 0;
--
2.20.1
Exposing the pud/pgd levels of the page tables to walk_page_range() means
we may come across the exotic large mappings that come with large areas
of contiguous memory (such as the kernel's linear map).
For architectures that don't provide all p?d_leaf() macros, provide
generic do nothing default that are suitable where there cannot be leaf
pages that that level.
Signed-off-by: Steven Price <[email protected]>
---
include/asm-generic/pgtable.h | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 75d9d68a6de7..46275896ca66 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1188,4 +1188,23 @@ static inline bool arch_has_pfn_modify_check(void)
#define mm_pmd_folded(mm) __is_defined(__PAGETABLE_PMD_FOLDED)
#endif
+/*
+ * p?d_leaf() - true if this entry is a final mapping to a physical address.
+ * This differs from p?d_huge() by the fact that they are always available (if
+ * the architecture supports large pages at the appropriate level) even
+ * if CONFIG_HUGETLB_PAGE is not defined.
+ */
+#ifndef pgd_leaf
+#define pgd_leaf(x) 0
+#endif
+#ifndef p4d_leaf
+#define p4d_leaf(x) 0
+#endif
+#ifndef pud_leaf
+#define pud_leaf(x) 0
+#endif
+#ifndef pmd_leaf
+#define pmd_leaf(x) 0
+#endif
+
#endif /* _ASM_GENERIC_PGTABLE_H */
--
2.20.1
For the /sys/kernel/debug/page_tables/ files, rather than outputing a
mostly empty line when a block of memory isn't present just skip the
line. This keeps the output shorter and will help with a future change
switching to using the generic page walk code as we no longer care about
the 'level' that the page table holes are at.
Signed-off-by: Steven Price <[email protected]>
---
arch/x86/mm/dump_pagetables.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index ab67822fd2f4..95728027dd3b 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -301,8 +301,8 @@ static void note_page(struct seq_file *m, struct pg_state *st,
/*
* Now print the actual finished series
*/
- if (!st->marker->max_lines ||
- st->lines < st->marker->max_lines) {
+ if ((cur & _PAGE_PRESENT) && (!st->marker->max_lines ||
+ st->lines < st->marker->max_lines)) {
pt_dump_seq_printf(m, st->to_dmesg,
"0x%0*lx-0x%0*lx ",
width, st->start_address,
@@ -318,7 +318,8 @@ static void note_page(struct seq_file *m, struct pg_state *st,
printk_prot(m, st->current_prot, st->level,
st->to_dmesg);
}
- st->lines++;
+ if (cur & _PAGE_PRESENT)
+ st->lines++;
/*
* We print markers for special areas of address space,
--
2.20.1
To enable x86 to use the generic walk_page_range() function, the
callers of ptdump_walk_pgd_level_debugfs() need to pass in the mm_struct.
This means that ptdump_walk_pgd_level_core() is now always passed a
valid pgd, so drop the support for pgd==NULL.
Signed-off-by: Steven Price <[email protected]>
---
arch/x86/include/asm/pgtable.h | 3 ++-
arch/x86/mm/debug_pagetables.c | 8 ++++----
arch/x86/mm/dump_pagetables.c | 14 ++++++--------
3 files changed, 12 insertions(+), 13 deletions(-)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1a2b469f6e75..1b255987712e 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -30,7 +30,8 @@ extern pgd_t early_top_pgt[PTRS_PER_PGD];
int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm);
-void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user);
+void ptdump_walk_pgd_level_debugfs(struct seq_file *m, struct mm_struct *mm,
+ bool user);
void ptdump_walk_pgd_level_checkwx(void);
void ptdump_walk_user_pgd_level_checkwx(void);
diff --git a/arch/x86/mm/debug_pagetables.c b/arch/x86/mm/debug_pagetables.c
index 39001a401eff..d0efec713c6c 100644
--- a/arch/x86/mm/debug_pagetables.c
+++ b/arch/x86/mm/debug_pagetables.c
@@ -7,7 +7,7 @@
static int ptdump_show(struct seq_file *m, void *v)
{
- ptdump_walk_pgd_level_debugfs(m, NULL, false);
+ ptdump_walk_pgd_level_debugfs(m, &init_mm, false);
return 0;
}
@@ -17,7 +17,7 @@ static int ptdump_curknl_show(struct seq_file *m, void *v)
{
if (current->mm->pgd) {
down_read(¤t->mm->mmap_sem);
- ptdump_walk_pgd_level_debugfs(m, current->mm->pgd, false);
+ ptdump_walk_pgd_level_debugfs(m, current->mm, false);
up_read(¤t->mm->mmap_sem);
}
return 0;
@@ -30,7 +30,7 @@ static int ptdump_curusr_show(struct seq_file *m, void *v)
{
if (current->mm->pgd) {
down_read(¤t->mm->mmap_sem);
- ptdump_walk_pgd_level_debugfs(m, current->mm->pgd, true);
+ ptdump_walk_pgd_level_debugfs(m, current->mm, true);
up_read(¤t->mm->mmap_sem);
}
return 0;
@@ -43,7 +43,7 @@ DEFINE_SHOW_ATTRIBUTE(ptdump_curusr);
static int ptdump_efi_show(struct seq_file *m, void *v)
{
if (efi_mm.pgd)
- ptdump_walk_pgd_level_debugfs(m, efi_mm.pgd, false);
+ ptdump_walk_pgd_level_debugfs(m, &efi_mm, false);
return 0;
}
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 6f0d1296dee1..bcaf27b637e0 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -519,16 +519,12 @@ static inline bool is_hypervisor_range(int idx)
static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
bool checkwx, bool dmesg)
{
- pgd_t *start = INIT_PGD;
+ pgd_t *start = pgd;
pgprotval_t prot, eff;
int i;
struct pg_state st = {};
- if (pgd) {
- start = pgd;
- st.to_dmesg = dmesg;
- }
-
+ st.to_dmesg = dmesg;
st.check_wx = checkwx;
st.seq = m;
if (checkwx)
@@ -573,8 +569,10 @@ void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm)
ptdump_walk_pgd_level_core(m, mm->pgd, false, true);
}
-void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user)
+void ptdump_walk_pgd_level_debugfs(struct seq_file *m, struct mm_struct *mm,
+ bool user)
{
+ pgd_t *pgd = mm->pgd;
#ifdef CONFIG_PAGE_TABLE_ISOLATION
if (user && boot_cpu_has(X86_FEATURE_PTI))
pgd = kernel_to_user_pgdp(pgd);
@@ -600,7 +598,7 @@ void ptdump_walk_user_pgd_level_checkwx(void)
void ptdump_walk_pgd_level_checkwx(void)
{
- ptdump_walk_pgd_level_core(NULL, NULL, true, false);
+ ptdump_walk_pgd_level_core(NULL, INIT_PGD, true, false);
}
static int __init pt_dump_init(void)
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information will be provided by the
p?d_leaf() functions/macros.
For arm64, we already have p?d_sect() macros which we can reuse for
p?d_leaf().
pud_sect() is defined as a dummy function when CONFIG_PGTABLE_LEVELS < 3
or CONFIG_ARM64_64K_PAGES is defined. However when the kernel is
configured this way then architecturally it isn't allowed to have a
large page that this level, and any code using these page walking macros
is implicitly relying on the page size/number of levels being the same as
the kernel. So it is safe to reuse this for p?d_leaf() as it is an
architectural restriction.
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 87a4b2ddc1a1..2c123d59dbff 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -446,6 +446,7 @@ extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
PMD_TYPE_TABLE)
#define pmd_sect(pmd) ((pmd_val(pmd) & PMD_TYPE_MASK) == \
PMD_TYPE_SECT)
+#define pmd_leaf(pmd) pmd_sect(pmd)
#if defined(CONFIG_ARM64_64K_PAGES) || CONFIG_PGTABLE_LEVELS < 3
#define pud_sect(pud) (0)
@@ -528,6 +529,7 @@ static inline void pte_unmap(pte_t *pte) { }
#define pud_none(pud) (!pud_val(pud))
#define pud_bad(pud) (!(pud_val(pud) & PUD_TABLE_BIT))
#define pud_present(pud) pte_present(pud_pte(pud))
+#define pud_leaf(pud) pud_sect(pud)
#define pud_valid(pud) pte_valid(pud_pte(pud))
static inline void set_pud(pud_t *pudp, pud_t pud)
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_leaf() functions/macros.
For mips, we only support large pages on 64 bit.
For 64 bit if _PAGE_HUGE is defined we can simply look for it. When not
defined we can be confident that there are no leaf pages in existence
and fall back on the generic implementation (added in a later patch)
which returns 0.
CC: Ralf Baechle <[email protected]>
CC: Paul Burton <[email protected]>
CC: James Hogan <[email protected]>
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/mips/include/asm/pgtable-64.h | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/mips/include/asm/pgtable-64.h b/arch/mips/include/asm/pgtable-64.h
index 93a9dce31f25..2bdbf8652b5f 100644
--- a/arch/mips/include/asm/pgtable-64.h
+++ b/arch/mips/include/asm/pgtable-64.h
@@ -273,6 +273,10 @@ static inline int pmd_present(pmd_t pmd)
return pmd_val(pmd) != (unsigned long) invalid_pte_table;
}
+#ifdef _PAGE_HUGE
+#define pmd_leaf(pmd) ((pmd_val(pmd) & _PAGE_HUGE) != 0)
+#endif
+
static inline void pmd_clear(pmd_t *pmdp)
{
pmd_val(*pmdp) = ((unsigned long) invalid_pte_table);
@@ -297,6 +301,10 @@ static inline int pud_present(pud_t pud)
return pud_val(pud) != (unsigned long) invalid_pmd_table;
}
+#ifdef _PAGE_HUGE
+#define pud_leaf(pud) ((pud_val(pud) & _PAGE_HUGE) != 0)
+#endif
+
static inline void pud_clear(pud_t *pudp)
{
pud_val(*pudp) = ((unsigned long) invalid_pmd_table);
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_leaf() functions/macros.
For sparc 64 bit, pmd_large() and pud_large() are already provided, so
add macros to provide the p?d_leaf names required by the generic code.
CC: "David S. Miller" <[email protected]>
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/sparc/include/asm/pgtable_64.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 1599de730532..a78b968ae3fa 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -683,6 +683,7 @@ static inline unsigned long pte_special(pte_t pte)
return pte_val(pte) & _PAGE_SPECIAL;
}
+#define pmd_leaf pmd_large
static inline unsigned long pmd_large(pmd_t pmd)
{
pte_t pte = __pte(pmd_val(pmd));
@@ -867,6 +868,7 @@ static inline unsigned long pud_page_vaddr(pud_t pud)
/* only used by the stubbed out hugetlb gup code, should never be called */
#define pgd_page(pgd) NULL
+#define pud_leaf pud_large
static inline unsigned long pud_large(pud_t pud)
{
pte_t pte = __pte(pud_val(pud));
--
2.20.1
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_leaf() functions/macros.
For x86 we already have p?d_large() functions, so simply add macros to
provide the generic p?d_leaf() names for the generic code.
Signed-off-by: Steven Price <[email protected]>
---
arch/x86/include/asm/pgtable.h | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0bc530c4eb13..6986a451619e 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -239,6 +239,7 @@ static inline unsigned long pgd_pfn(pgd_t pgd)
return (pgd_val(pgd) & PTE_PFN_MASK) >> PAGE_SHIFT;
}
+#define p4d_leaf p4d_large
static inline int p4d_large(p4d_t p4d)
{
/* No 512 GiB pages yet */
@@ -247,6 +248,7 @@ static inline int p4d_large(p4d_t p4d)
#define pte_page(pte) pfn_to_page(pte_pfn(pte))
+#define pmd_leaf pmd_large
static inline int pmd_large(pmd_t pte)
{
return pmd_flags(pte) & _PAGE_PSE;
@@ -874,6 +876,7 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}
+#define pud_leaf pud_large
static inline int pud_large(pud_t pud)
{
return (pud_val(pud) & (_PAGE_PSE | _PAGE_PRESENT)) ==
@@ -885,6 +888,7 @@ static inline int pud_bad(pud_t pud)
return (pud_flags(pud) & ~(_KERNPG_TABLE | _PAGE_USER)) != 0;
}
#else
+#define pud_leaf pud_large
static inline int pud_large(pud_t pud)
{
return 0;
@@ -1233,6 +1237,7 @@ static inline bool pgdp_maps_userspace(void *__ptr)
return (((ptr & ~PAGE_MASK) / sizeof(pgd_t)) < PGD_KERNEL_START);
}
+#define pgd_leaf pgd_large
static inline int pgd_large(pgd_t pgd) { return 0; }
#ifdef CONFIG_PAGE_TABLE_ISOLATION
--
2.20.1
pgd_entry() and pud_entry() were removed by commit 0b1fbfe50006c410
("mm/pagewalk: remove pgd_entry() and pud_entry()") because there were
no users. We're about to add users so reintroduce them, along with
p4d_entry() as we now have 5 levels of tables.
Note that commit a00cc7d9dd93d66a ("mm, x86: add support for
PUD-sized transparent hugepages") already re-added pud_entry() but with
different semantics to the other callbacks. Since there have never
been upstream users of this, revert the semantics back to match the
other callbacks. This means pud_entry() is called for all entries, not
just transparent huge pages.
Signed-off-by: Steven Price <[email protected]>
---
include/linux/mm.h | 15 +++++++++------
mm/pagewalk.c | 27 ++++++++++++++++-----------
2 files changed, 25 insertions(+), 17 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0334ca97c584..b22799129128 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1432,15 +1432,14 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
/**
* mm_walk - callbacks for walk_page_range
- * @pud_entry: if set, called for each non-empty PUD (2nd-level) entry
- * this handler should only handle pud_trans_huge() puds.
- * the pmd_entry or pte_entry callbacks will be used for
- * regular PUDs.
- * @pmd_entry: if set, called for each non-empty PMD (3rd-level) entry
+ * @pgd_entry: if set, called for each non-empty PGD (top-level) entry
+ * @p4d_entry: if set, called for each non-empty P4D entry
+ * @pud_entry: if set, called for each non-empty PUD entry
+ * @pmd_entry: if set, called for each non-empty PMD entry
* this handler is required to be able to handle
* pmd_trans_huge() pmds. They may simply choose to
* split_huge_page() instead of handling it explicitly.
- * @pte_entry: if set, called for each non-empty PTE (4th-level) entry
+ * @pte_entry: if set, called for each non-empty PTE (lowest-level) entry
* @pte_hole: if set, called for each hole at all levels
* @hugetlb_entry: if set, called for each hugetlb entry
* @test_walk: caller specific callback function to determine whether
@@ -1455,6 +1454,10 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
* (see the comment on walk_page_range() for more details)
*/
struct mm_walk {
+ int (*pgd_entry)(pgd_t *pgd, unsigned long addr,
+ unsigned long next, struct mm_walk *walk);
+ int (*p4d_entry)(p4d_t *p4d, unsigned long addr,
+ unsigned long next, struct mm_walk *walk);
int (*pud_entry)(pud_t *pud, unsigned long addr,
unsigned long next, struct mm_walk *walk);
int (*pmd_entry)(pmd_t *pmd, unsigned long addr,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index c3084ff2569d..98373a9f88b8 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -90,15 +90,9 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
}
if (walk->pud_entry) {
- spinlock_t *ptl = pud_trans_huge_lock(pud, walk->vma);
-
- if (ptl) {
- err = walk->pud_entry(pud, addr, next, walk);
- spin_unlock(ptl);
- if (err)
- break;
- continue;
- }
+ err = walk->pud_entry(pud, addr, next, walk);
+ if (err)
+ break;
}
split_huge_pud(walk->vma, pud, addr);
@@ -131,7 +125,12 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
break;
continue;
}
- if (walk->pmd_entry || walk->pte_entry)
+ if (walk->p4d_entry) {
+ err = walk->p4d_entry(p4d, addr, next, walk);
+ if (err)
+ break;
+ }
+ if (walk->pud_entry || walk->pmd_entry || walk->pte_entry)
err = walk_pud_range(p4d, addr, next, walk);
if (err)
break;
@@ -157,7 +156,13 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
break;
continue;
}
- if (walk->pmd_entry || walk->pte_entry)
+ if (walk->pgd_entry) {
+ err = walk->pgd_entry(pgd, addr, next, walk);
+ if (err)
+ break;
+ }
+ if (walk->p4d_entry || walk->pud_entry || walk->pmd_entry ||
+ walk->pte_entry)
err = walk_p4d_range(pgd, addr, next, walk);
if (err)
break;
--
2.20.1
It is useful to be able to skip parts of the page table tree even when
walking without VMAs. Add test_p?d callbacks similar to test_walk but
which are called just before a table at that level is walked. If the
callback returns non-zero then the entire table is skipped.
Signed-off-by: Steven Price <[email protected]>
---
include/linux/mm.h | 11 +++++++++++
mm/pagewalk.c | 24 ++++++++++++++++++++++++
2 files changed, 35 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b22799129128..325a1ca6f820 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1447,6 +1447,11 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
* value means "do page table walk over the current vma,"
* and a negative one means "abort current page table walk
* right now." 1 means "skip the current vma."
+ * @test_pmd: similar to test_walk(), but called for every pmd.
+ * @test_pud: similar to test_walk(), but called for every pud.
+ * @test_p4d: similar to test_walk(), but called for every p4d.
+ * Returning 0 means walk this part of the page tables,
+ * returning 1 means to skip this range.
* @mm: mm_struct representing the target process of page table walk
* @vma: vma currently walked (NULL if walking outside vmas)
* @private: private data for callbacks' usage
@@ -1471,6 +1476,12 @@ struct mm_walk {
struct mm_walk *walk);
int (*test_walk)(unsigned long addr, unsigned long next,
struct mm_walk *walk);
+ int (*test_pmd)(unsigned long addr, unsigned long next,
+ pmd_t *pmd_start, struct mm_walk *walk);
+ int (*test_pud)(unsigned long addr, unsigned long next,
+ pud_t *pud_start, struct mm_walk *walk);
+ int (*test_p4d)(unsigned long addr, unsigned long next,
+ p4d_t *p4d_start, struct mm_walk *walk);
struct mm_struct *mm;
struct vm_area_struct *vma;
void *private;
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 1cbef99e9258..6bea79b95be3 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -32,6 +32,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
unsigned long next;
int err = 0;
+ if (walk->test_pmd) {
+ err = walk->test_pmd(addr, end, pmd_offset(pud, 0UL), walk);
+ if (err < 0)
+ return err;
+ if (err > 0)
+ return 0;
+ }
+
pmd = pmd_offset(pud, addr);
do {
again:
@@ -82,6 +90,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
unsigned long next;
int err = 0;
+ if (walk->test_pud) {
+ err = walk->test_pud(addr, end, pud_offset(p4d, 0UL), walk);
+ if (err < 0)
+ return err;
+ if (err > 0)
+ return 0;
+ }
+
pud = pud_offset(p4d, addr);
do {
again:
@@ -124,6 +140,14 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
unsigned long next;
int err = 0;
+ if (walk->test_p4d) {
+ err = walk->test_p4d(addr, end, p4d_offset(pgd, 0UL), walk);
+ if (err < 0)
+ return err;
+ if (err > 0)
+ return 0;
+ }
+
p4d = p4d_offset(pgd, addr);
do {
next = p4d_addr_end(addr, end);
--
2.20.1
Since 48684a65b4e3: "mm: pagewalk: fix misbehavior of walk_page_range
for vma(VM_PFNMAP)", page_table_walk() will report any kernel area as
a hole, because it lacks a vma.
This means each arch has re-implemented page table walking when needed,
for example in the per-arch ptdump walker.
Remove the requirement to have a vma except when trying to split huge
pages.
Signed-off-by: Steven Price <[email protected]>
---
mm/pagewalk.c | 25 +++++++++++++++++--------
1 file changed, 17 insertions(+), 8 deletions(-)
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 98373a9f88b8..1cbef99e9258 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -36,7 +36,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
do {
again:
next = pmd_addr_end(addr, end);
- if (pmd_none(*pmd) || !walk->vma) {
+ if (pmd_none(*pmd)) {
if (walk->pte_hole)
err = walk->pte_hole(addr, next, walk);
if (err)
@@ -59,9 +59,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
if (!walk->pte_entry)
continue;
- split_huge_pmd(walk->vma, pmd, addr);
- if (pmd_trans_unstable(pmd))
- goto again;
+ if (walk->vma) {
+ split_huge_pmd(walk->vma, pmd, addr);
+ if (pmd_trans_unstable(pmd))
+ goto again;
+ } else if (pmd_leaf(*pmd)) {
+ continue;
+ }
+
err = walk_pte_range(pmd, addr, next, walk);
if (err)
break;
@@ -81,7 +86,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
do {
again:
next = pud_addr_end(addr, end);
- if (pud_none(*pud) || !walk->vma) {
+ if (pud_none(*pud)) {
if (walk->pte_hole)
err = walk->pte_hole(addr, next, walk);
if (err)
@@ -95,9 +100,13 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
break;
}
- split_huge_pud(walk->vma, pud, addr);
- if (pud_none(*pud))
- goto again;
+ if (walk->vma) {
+ split_huge_pud(walk->vma, pud, addr);
+ if (pud_none(*pud))
+ goto again;
+ } else if (pud_leaf(*pud)) {
+ continue;
+ }
if (walk->pmd_entry || walk->pte_entry)
err = walk_pmd_range(pud, addr, next, walk);
--
2.20.1
mm/dump_pagetables.c passes both struct seq_file and struct pg_state
down the chain of walk_*_level() functions to be passed to note_page().
Instead place the struct seq_file in struct pg_state and access it from
struct pg_state (which is private to this file) in note_page().
Signed-off-by: Steven Price <[email protected]>
---
arch/x86/mm/dump_pagetables.c | 69 ++++++++++++++++++-----------------
1 file changed, 35 insertions(+), 34 deletions(-)
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 95728027dd3b..fe21b57f629f 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -36,6 +36,7 @@ struct pg_state {
bool to_dmesg;
bool check_wx;
unsigned long wx_pages;
+ struct seq_file *seq;
};
struct addr_marker {
@@ -265,11 +266,12 @@ static void note_wx(struct pg_state *st)
* of PTE entries; the next one is different so we need to
* print what we collected so far.
*/
-static void note_page(struct seq_file *m, struct pg_state *st,
- pgprot_t new_prot, pgprotval_t new_eff, int level)
+static void note_page(struct pg_state *st, pgprot_t new_prot,
+ pgprotval_t new_eff, int level)
{
pgprotval_t prot, cur, eff;
static const char units[] = "BKMGTPE";
+ struct seq_file *m = st->seq;
/*
* If we have a "break" in the series, we need to flush the state that
@@ -355,8 +357,8 @@ static inline pgprotval_t effective_prot(pgprotval_t prot1, pgprotval_t prot2)
((prot1 | prot2) & _PAGE_NX);
}
-static void walk_pte_level(struct seq_file *m, struct pg_state *st, pmd_t addr,
- pgprotval_t eff_in, unsigned long P)
+static void walk_pte_level(struct pg_state *st, pmd_t addr, pgprotval_t eff_in,
+ unsigned long P)
{
int i;
pte_t *pte;
@@ -367,7 +369,7 @@ static void walk_pte_level(struct seq_file *m, struct pg_state *st, pmd_t addr,
pte = pte_offset_map(&addr, st->current_address);
prot = pte_flags(*pte);
eff = effective_prot(eff_in, prot);
- note_page(m, st, __pgprot(prot), eff, 5);
+ note_page(st, __pgprot(prot), eff, 5);
pte_unmap(pte);
}
}
@@ -380,22 +382,20 @@ static void walk_pte_level(struct seq_file *m, struct pg_state *st, pmd_t addr,
* us dozens of seconds (minutes for 5-level config) while checking for
* W+X mapping or reading kernel_page_tables debugfs file.
*/
-static inline bool kasan_page_table(struct seq_file *m, struct pg_state *st,
- void *pt)
+static inline bool kasan_page_table(struct pg_state *st, void *pt)
{
if (__pa(pt) == __pa(kasan_early_shadow_pmd) ||
(pgtable_l5_enabled() &&
__pa(pt) == __pa(kasan_early_shadow_p4d)) ||
__pa(pt) == __pa(kasan_early_shadow_pud)) {
pgprotval_t prot = pte_flags(kasan_early_shadow_pte[0]);
- note_page(m, st, __pgprot(prot), 0, 5);
+ note_page(st, __pgprot(prot), 0, 5);
return true;
}
return false;
}
#else
-static inline bool kasan_page_table(struct seq_file *m, struct pg_state *st,
- void *pt)
+static inline bool kasan_page_table(struct pg_state *st, void *pt)
{
return false;
}
@@ -403,7 +403,7 @@ static inline bool kasan_page_table(struct seq_file *m, struct pg_state *st,
#if PTRS_PER_PMD > 1
-static void walk_pmd_level(struct seq_file *m, struct pg_state *st, pud_t addr,
+static void walk_pmd_level(struct pg_state *st, pud_t addr,
pgprotval_t eff_in, unsigned long P)
{
int i;
@@ -417,27 +417,27 @@ static void walk_pmd_level(struct seq_file *m, struct pg_state *st, pud_t addr,
prot = pmd_flags(*start);
eff = effective_prot(eff_in, prot);
if (pmd_large(*start) || !pmd_present(*start)) {
- note_page(m, st, __pgprot(prot), eff, 4);
- } else if (!kasan_page_table(m, st, pmd_start)) {
- walk_pte_level(m, st, *start, eff,
+ note_page(st, __pgprot(prot), eff, 4);
+ } else if (!kasan_page_table(st, pmd_start)) {
+ walk_pte_level(st, *start, eff,
P + i * PMD_LEVEL_MULT);
}
} else
- note_page(m, st, __pgprot(0), 0, 4);
+ note_page(st, __pgprot(0), 0, 4);
start++;
}
}
#else
-#define walk_pmd_level(m,s,a,e,p) walk_pte_level(m,s,__pmd(pud_val(a)),e,p)
+#define walk_pmd_level(s,a,e,p) walk_pte_level(s,__pmd(pud_val(a)),e,p)
#define pud_large(a) pmd_large(__pmd(pud_val(a)))
#define pud_none(a) pmd_none(__pmd(pud_val(a)))
#endif
#if PTRS_PER_PUD > 1
-static void walk_pud_level(struct seq_file *m, struct pg_state *st, p4d_t addr,
- pgprotval_t eff_in, unsigned long P)
+static void walk_pud_level(struct pg_state *st, p4d_t addr, pgprotval_t eff_in,
+ unsigned long P)
{
int i;
pud_t *start, *pud_start;
@@ -451,33 +451,33 @@ static void walk_pud_level(struct seq_file *m, struct pg_state *st, p4d_t addr,
prot = pud_flags(*start);
eff = effective_prot(eff_in, prot);
if (pud_large(*start) || !pud_present(*start)) {
- note_page(m, st, __pgprot(prot), eff, 3);
- } else if (!kasan_page_table(m, st, pud_start)) {
- walk_pmd_level(m, st, *start, eff,
+ note_page(st, __pgprot(prot), eff, 3);
+ } else if (!kasan_page_table(st, pud_start)) {
+ walk_pmd_level(st, *start, eff,
P + i * PUD_LEVEL_MULT);
}
} else
- note_page(m, st, __pgprot(0), 0, 3);
+ note_page(st, __pgprot(0), 0, 3);
start++;
}
}
#else
-#define walk_pud_level(m,s,a,e,p) walk_pmd_level(m,s,__pud(p4d_val(a)),e,p)
+#define walk_pud_level(s,a,e,p) walk_pmd_level(s,__pud(p4d_val(a)),e,p)
#define p4d_large(a) pud_large(__pud(p4d_val(a)))
#define p4d_none(a) pud_none(__pud(p4d_val(a)))
#endif
-static void walk_p4d_level(struct seq_file *m, struct pg_state *st, pgd_t addr,
- pgprotval_t eff_in, unsigned long P)
+static void walk_p4d_level(struct pg_state *st, pgd_t addr, pgprotval_t eff_in,
+ unsigned long P)
{
int i;
p4d_t *start, *p4d_start;
pgprotval_t prot, eff;
if (PTRS_PER_P4D == 1)
- return walk_pud_level(m, st, __p4d(pgd_val(addr)), eff_in, P);
+ return walk_pud_level(st, __p4d(pgd_val(addr)), eff_in, P);
p4d_start = start = (p4d_t *)pgd_page_vaddr(addr);
@@ -487,13 +487,13 @@ static void walk_p4d_level(struct seq_file *m, struct pg_state *st, pgd_t addr,
prot = p4d_flags(*start);
eff = effective_prot(eff_in, prot);
if (p4d_large(*start) || !p4d_present(*start)) {
- note_page(m, st, __pgprot(prot), eff, 2);
- } else if (!kasan_page_table(m, st, p4d_start)) {
- walk_pud_level(m, st, *start, eff,
+ note_page(st, __pgprot(prot), eff, 2);
+ } else if (!kasan_page_table(st, p4d_start)) {
+ walk_pud_level(st, *start, eff,
P + i * P4D_LEVEL_MULT);
}
} else
- note_page(m, st, __pgprot(0), 0, 2);
+ note_page(st, __pgprot(0), 0, 2);
start++;
}
@@ -530,6 +530,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
}
st.check_wx = checkwx;
+ st.seq = m;
if (checkwx)
st.wx_pages = 0;
@@ -543,13 +544,13 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
eff = prot;
#endif
if (pgd_large(*start) || !pgd_present(*start)) {
- note_page(m, &st, __pgprot(prot), eff, 1);
+ note_page(&st, __pgprot(prot), eff, 1);
} else {
- walk_p4d_level(m, &st, *start, eff,
+ walk_p4d_level(&st, *start, eff,
i * PGD_LEVEL_MULT);
}
} else
- note_page(m, &st, __pgprot(0), 0, 1);
+ note_page(&st, __pgprot(0), 0, 1);
cond_resched();
start++;
@@ -557,7 +558,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
/* Flush out the last page */
st.current_address = normalize_addr(PTRS_PER_PGD*PGD_LEVEL_MULT);
- note_page(m, &st, __pgprot(0), 0, 0);
+ note_page(&st, __pgprot(0), 0, 0);
if (!checkwx)
return;
if (st.wx_pages)
--
2.20.1
To enable x86 to use the generic walk_page_range() function, the
callers of ptdump_walk_pgd_level() need to pass an mm_struct rather
than the raw pgd_t pointer. Luckily since commit 7e904a91bf60
("efi: Use efi_mm in x86 as well as ARM") we now have an mm_struct
for EFI on x86.
Signed-off-by: Steven Price <[email protected]>
---
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/mm/dump_pagetables.c | 4 ++--
arch/x86/platform/efi/efi_32.c | 2 +-
arch/x86/platform/efi/efi_64.c | 4 ++--
4 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 6986a451619e..1a2b469f6e75 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -29,7 +29,7 @@
extern pgd_t early_top_pgt[PTRS_PER_PGD];
int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
-void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd);
+void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm);
void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user);
void ptdump_walk_pgd_level_checkwx(void);
void ptdump_walk_user_pgd_level_checkwx(void);
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index fe21b57f629f..6f0d1296dee1 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -568,9 +568,9 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
pr_info("x86/mm: Checked W+X mappings: passed, no W+X pages found.\n");
}
-void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd)
+void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm)
{
- ptdump_walk_pgd_level_core(m, pgd, false, true);
+ ptdump_walk_pgd_level_core(m, mm->pgd, false, true);
}
void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user)
diff --git a/arch/x86/platform/efi/efi_32.c b/arch/x86/platform/efi/efi_32.c
index 9959657127f4..9175ceaa6e72 100644
--- a/arch/x86/platform/efi/efi_32.c
+++ b/arch/x86/platform/efi/efi_32.c
@@ -49,7 +49,7 @@ void efi_sync_low_kernel_mappings(void) {}
void __init efi_dump_pagetable(void)
{
#ifdef CONFIG_EFI_PGT_DUMP
- ptdump_walk_pgd_level(NULL, swapper_pg_dir);
+ ptdump_walk_pgd_level(NULL, init_mm);
#endif
}
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 08ce8177c3af..47a4c6c70648 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -614,9 +614,9 @@ void __init efi_dump_pagetable(void)
{
#ifdef CONFIG_EFI_PGT_DUMP
if (efi_enabled(EFI_OLD_MEMMAP))
- ptdump_walk_pgd_level(NULL, swapper_pg_dir);
+ ptdump_walk_pgd_level(NULL, init_mm);
else
- ptdump_walk_pgd_level(NULL, efi_mm.pgd);
+ ptdump_walk_pgd_level(NULL, efi_mm);
#endif
}
--
2.20.1
An mm_struct is needed to enable x86 to use of the generic
walk_page_range() function.
In the case of walking the user page tables (when
CONFIG_PAGE_TABLE_ISOLATION is enabled), it is necessary to create a
fake_mm structure because there isn't an mm_struct with a pointer
to the pgd of the user page tables. This fake_mm structure is
initialised with the minimum necessary for the generic page walk code.
Signed-off-by: Steven Price <[email protected]>
---
arch/x86/mm/dump_pagetables.c | 36 ++++++++++++++++++++---------------
1 file changed, 21 insertions(+), 15 deletions(-)
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index bcaf27b637e0..546e28a7785c 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -107,8 +107,6 @@ static struct addr_marker address_markers[] = {
[END_OF_SPACE_NR] = { -1, NULL }
};
-#define INIT_PGD ((pgd_t *) &init_top_pgt)
-
#else /* CONFIG_X86_64 */
enum address_markers_idx {
@@ -143,8 +141,6 @@ static struct addr_marker address_markers[] = {
[END_OF_SPACE_NR] = { -1, NULL }
};
-#define INIT_PGD (swapper_pg_dir)
-
#endif /* !CONFIG_X86_64 */
/* Multipliers for offsets within the PTEs */
@@ -516,10 +512,10 @@ static inline bool is_hypervisor_range(int idx)
#endif
}
-static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
+static void ptdump_walk_pgd_level_core(struct seq_file *m, struct mm_struct *mm,
bool checkwx, bool dmesg)
{
- pgd_t *start = pgd;
+ pgd_t *start = mm->pgd;
pgprotval_t prot, eff;
int i;
struct pg_state st = {};
@@ -566,39 +562,49 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm)
{
- ptdump_walk_pgd_level_core(m, mm->pgd, false, true);
+ ptdump_walk_pgd_level_core(m, mm, false, true);
}
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+static void ptdump_walk_pgd_level_user_core(struct seq_file *m,
+ struct mm_struct *mm,
+ bool checkwx, bool dmesg)
+{
+ struct mm_struct fake_mm = {
+ .pgd = kernel_to_user_pgdp(mm->pgd)
+ };
+ init_rwsem(&fake_mm.mmap_sem);
+ ptdump_walk_pgd_level_core(m, &fake_mm, checkwx, dmesg);
+}
+#endif
+
void ptdump_walk_pgd_level_debugfs(struct seq_file *m, struct mm_struct *mm,
bool user)
{
- pgd_t *pgd = mm->pgd;
#ifdef CONFIG_PAGE_TABLE_ISOLATION
if (user && boot_cpu_has(X86_FEATURE_PTI))
- pgd = kernel_to_user_pgdp(pgd);
+ ptdump_walk_pgd_level_user_core(m, mm, false, false);
+ else
#endif
- ptdump_walk_pgd_level_core(m, pgd, false, false);
+ ptdump_walk_pgd_level_core(m, mm, false, false);
}
EXPORT_SYMBOL_GPL(ptdump_walk_pgd_level_debugfs);
void ptdump_walk_user_pgd_level_checkwx(void)
{
#ifdef CONFIG_PAGE_TABLE_ISOLATION
- pgd_t *pgd = INIT_PGD;
-
if (!(__supported_pte_mask & _PAGE_NX) ||
!boot_cpu_has(X86_FEATURE_PTI))
return;
pr_info("x86/mm: Checking user space page tables\n");
- pgd = kernel_to_user_pgdp(pgd);
- ptdump_walk_pgd_level_core(NULL, pgd, true, false);
+ ptdump_walk_pgd_level_user_core(NULL, &init_mm, true, false);
#endif
}
void ptdump_walk_pgd_level_checkwx(void)
{
- ptdump_walk_pgd_level_core(NULL, INIT_PGD, true, false);
+ ptdump_walk_pgd_level_core(NULL, &init_mm, true, false);
}
static int __init pt_dump_init(void)
--
2.20.1
Add a generic version of page table dumping that architectures can
opt-in to
Signed-off-by: Steven Price <[email protected]>
---
include/linux/ptdump.h | 19 +++++
mm/Kconfig.debug | 21 ++++++
mm/Makefile | 1 +
mm/ptdump.c | 161 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 202 insertions(+)
create mode 100644 include/linux/ptdump.h
create mode 100644 mm/ptdump.c
diff --git a/include/linux/ptdump.h b/include/linux/ptdump.h
new file mode 100644
index 000000000000..eb8e78154be3
--- /dev/null
+++ b/include/linux/ptdump.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_PTDUMP_H
+#define _LINUX_PTDUMP_H
+
+struct ptdump_range {
+ unsigned long start;
+ unsigned long end;
+};
+
+struct ptdump_state {
+ void (*note_page)(struct ptdump_state *st, unsigned long addr,
+ int level, unsigned long val);
+ const struct ptdump_range *range;
+};
+
+void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm);
+
+#endif /* _LINUX_PTDUMP_H */
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 82b6a20898bd..7ad939b7140f 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -115,3 +115,24 @@ config DEBUG_RODATA_TEST
depends on STRICT_KERNEL_RWX
---help---
This option enables a testcase for the setting rodata read-only.
+
+config GENERIC_PTDUMP
+ bool
+
+config PTDUMP_CORE
+ bool
+
+config PTDUMP_DEBUGFS
+ bool "Export kernel pagetable layout to userspace via debugfs"
+ depends on DEBUG_KERNEL
+ depends on DEBUG_FS
+ depends on GENERIC_PTDUMP
+ select PTDUMP_CORE
+ help
+ Say Y here if you want to show the kernel pagetable layout in a
+ debugfs file. This information is only useful for kernel developers
+ who are working in architecture specific areas of the kernel.
+ It is probably not a good idea to enable this feature in a production
+ kernel.
+
+ If in doubt, say N.
diff --git a/mm/Makefile b/mm/Makefile
index 338e528ad436..750a4c12d5da 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -104,3 +104,4 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
obj-$(CONFIG_HMM_MIRROR) += hmm.o
obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
diff --git a/mm/ptdump.c b/mm/ptdump.c
new file mode 100644
index 000000000000..39befc9088b8
--- /dev/null
+++ b/mm/ptdump.c
@@ -0,0 +1,161 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/mm.h>
+#include <linux/ptdump.h>
+#include <linux/kasan.h>
+
+static int ptdump_pgd_entry(pgd_t *pgd, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ struct ptdump_state *st = walk->private;
+ pgd_t val = READ_ONCE(*pgd);
+
+ if (pgd_leaf(val))
+ st->note_page(st, addr, 1, pgd_val(val));
+
+ return 0;
+}
+
+static int ptdump_p4d_entry(p4d_t *p4d, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ struct ptdump_state *st = walk->private;
+ p4d_t val = READ_ONCE(*p4d);
+
+ if (p4d_leaf(val))
+ st->note_page(st, addr, 2, p4d_val(val));
+
+ return 0;
+}
+
+static int ptdump_pud_entry(pud_t *pud, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ struct ptdump_state *st = walk->private;
+ pud_t val = READ_ONCE(*pud);
+
+ if (pud_leaf(val))
+ st->note_page(st, addr, 3, pud_val(val));
+
+ return 0;
+}
+
+static int ptdump_pmd_entry(pmd_t *pmd, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ struct ptdump_state *st = walk->private;
+ pmd_t val = READ_ONCE(*pmd);
+
+ if (pmd_leaf(val))
+ st->note_page(st, addr, 4, pmd_val(val));
+
+ return 0;
+}
+
+static int ptdump_pte_entry(pte_t *pte, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ struct ptdump_state *st = walk->private;
+
+ st->note_page(st, addr, 5, pte_val(READ_ONCE(*pte)));
+
+ return 0;
+}
+
+#ifdef CONFIG_KASAN
+/*
+ * This is an optimization for KASAN=y case. Since all kasan page tables
+ * eventually point to the kasan_early_shadow_page we could call note_page()
+ * right away without walking through lower level page tables. This saves
+ * us dozens of seconds (minutes for 5-level config) while checking for
+ * W+X mapping or reading kernel_page_tables debugfs file.
+ */
+static inline bool kasan_page_table(struct ptdump_state *st, void *pt,
+ unsigned long addr)
+{
+ if (__pa(pt) == __pa(kasan_early_shadow_pmd) ||
+#ifdef CONFIG_X86
+ (pgtable_l5_enabled() &&
+ __pa(pt) == __pa(kasan_early_shadow_p4d)) ||
+#endif
+ __pa(pt) == __pa(kasan_early_shadow_pud)) {
+ st->note_page(st, addr, 5, pte_val(kasan_early_shadow_pte[0]));
+ return true;
+ }
+ return false;
+}
+#else
+static inline bool kasan_page_table(struct ptdump_state *st, void *pt,
+ unsigned long addr)
+{
+ return false;
+}
+#endif
+
+static int ptdump_test_p4d(unsigned long addr, unsigned long next,
+ p4d_t *p4d, struct mm_walk *walk)
+{
+ struct ptdump_state *st = walk->private;
+
+ if (kasan_page_table(st, p4d, addr))
+ return 1;
+ return 0;
+}
+
+static int ptdump_test_pud(unsigned long addr, unsigned long next,
+ pud_t *pud, struct mm_walk *walk)
+{
+ struct ptdump_state *st = walk->private;
+
+ if (kasan_page_table(st, pud, addr))
+ return 1;
+ return 0;
+}
+
+static int ptdump_test_pmd(unsigned long addr, unsigned long next,
+ pmd_t *pmd, struct mm_walk *walk)
+{
+ struct ptdump_state *st = walk->private;
+
+ if (kasan_page_table(st, pmd, addr))
+ return 1;
+ return 0;
+}
+
+static int ptdump_hole(unsigned long addr, unsigned long next,
+ struct mm_walk *walk)
+{
+ struct ptdump_state *st = walk->private;
+
+ st->note_page(st, addr, -1, 0);
+
+ return 0;
+}
+
+void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm)
+{
+ struct mm_walk walk = {
+ .mm = mm,
+ .pgd_entry = ptdump_pgd_entry,
+ .p4d_entry = ptdump_p4d_entry,
+ .pud_entry = ptdump_pud_entry,
+ .pmd_entry = ptdump_pmd_entry,
+ .pte_entry = ptdump_pte_entry,
+ .test_p4d = ptdump_test_p4d,
+ .test_pud = ptdump_test_pud,
+ .test_pmd = ptdump_test_pmd,
+ .pte_hole = ptdump_hole,
+ .private = st
+ };
+ const struct ptdump_range *range = st->range;
+
+ down_read(&mm->mmap_sem);
+ while (range->start != range->end) {
+ walk_page_range(range->start, range->end, &walk);
+ range++;
+ }
+ up_read(&mm->mmap_sem);
+
+ /* Flush out the last page */
+ st->note_page(st, 0, 0, 0);
+}
--
2.20.1
Make use of the new functionality in walk_page_range to remove the
arch page walking code and use the generic code to walk the page tables.
The effective permissions are passed down the chain using new fields
in struct pg_state.
The KASAN optimisation is implemented by including test_p?d callbacks
which can decide to skip an entire tree of entries
Signed-off-by: Steven Price <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/Kconfig.debug | 20 +--
arch/x86/mm/Makefile | 4 +-
arch/x86/mm/dump_pagetables.c | 285 +++++++---------------------------
4 files changed, 64 insertions(+), 246 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 222855cc0158..76beeec13c27 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -119,6 +119,7 @@ config X86
select GENERIC_IRQ_RESERVATION_MODE
select GENERIC_IRQ_SHOW
select GENERIC_PENDING_IRQ if SMP
+ select GENERIC_PTDUMP
select GENERIC_SMP_IDLE_THREAD
select GENERIC_STRNCPY_FROM_USER
select GENERIC_STRNLEN_USER
diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 71c92db47c41..ca4ee374e685 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -62,26 +62,10 @@ config EARLY_PRINTK_USB_XDBC
config MCSAFE_TEST
def_bool n
-config X86_PTDUMP_CORE
- def_bool n
-
-config X86_PTDUMP
- tristate "Export kernel pagetable layout to userspace via debugfs"
- depends on DEBUG_KERNEL
- select DEBUG_FS
- select X86_PTDUMP_CORE
- ---help---
- Say Y here if you want to show the kernel pagetable layout in a
- debugfs file. This information is only useful for kernel developers
- who are working in architecture specific areas of the kernel.
- It is probably not a good idea to enable this feature in a production
- kernel.
- If in doubt, say "N"
-
config EFI_PGT_DUMP
bool "Dump the EFI pagetable"
depends on EFI
- select X86_PTDUMP_CORE
+ select PTDUMP_CORE
---help---
Enable this if you want to dump the EFI page table before
enabling virtual mode. This can be used to debug miscellaneous
@@ -90,7 +74,7 @@ config EFI_PGT_DUMP
config DEBUG_WX
bool "Warn on W+X mappings at boot"
- select X86_PTDUMP_CORE
+ select PTDUMP_CORE
---help---
Generate a warning if any W+X mappings are found at boot.
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 84373dc9b341..66cf0ea5c2be 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -28,8 +28,8 @@ obj-$(CONFIG_X86_PAT) += pat_rbtree.o
obj-$(CONFIG_X86_32) += pgtable_32.o iomap_32.o
obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
-obj-$(CONFIG_X86_PTDUMP_CORE) += dump_pagetables.o
-obj-$(CONFIG_X86_PTDUMP) += debug_pagetables.o
+obj-$(CONFIG_PTDUMP_CORE) += dump_pagetables.o
+obj-$(CONFIG_PTDUMP_DEBUGFS) += debug_pagetables.o
obj-$(CONFIG_HIGHMEM) += highmem_32.o
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 546e28a7785c..61b5feff12c7 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -16,6 +16,7 @@
#include <linux/seq_file.h>
#include <linux/highmem.h>
#include <linux/pci.h>
+#include <linux/ptdump.h>
#include <asm/e820/types.h>
#include <asm/pgtable.h>
@@ -26,11 +27,12 @@
* when a "break" in the continuity is found.
*/
struct pg_state {
+ struct ptdump_state ptdump;
int level;
- pgprot_t current_prot;
+ pgprotval_t current_prot;
pgprotval_t effective_prot;
+ pgprotval_t prot_levels[5];
unsigned long start_address;
- unsigned long current_address;
const struct addr_marker *marker;
unsigned long lines;
bool to_dmesg;
@@ -171,9 +173,8 @@ static struct addr_marker address_markers[] = {
/*
* Print a readable form of a pgprot_t to the seq_file
*/
-static void printk_prot(struct seq_file *m, pgprot_t prot, int level, bool dmsg)
+static void printk_prot(struct seq_file *m, pgprotval_t pr, int level, bool dmsg)
{
- pgprotval_t pr = pgprot_val(prot);
static const char * const level_name[] =
{ "cr3", "pgd", "p4d", "pud", "pmd", "pte" };
@@ -220,24 +221,11 @@ static void printk_prot(struct seq_file *m, pgprot_t prot, int level, bool dmsg)
pt_dump_cont_printf(m, dmsg, "%s\n", level_name[level]);
}
-/*
- * On 64 bits, sign-extend the 48 bit address to 64 bit
- */
-static unsigned long normalize_addr(unsigned long u)
-{
- int shift;
- if (!IS_ENABLED(CONFIG_X86_64))
- return u;
-
- shift = 64 - (__VIRTUAL_MASK_SHIFT + 1);
- return (signed long)(u << shift) >> shift;
-}
-
-static void note_wx(struct pg_state *st)
+static void note_wx(struct pg_state *st, unsigned long addr)
{
unsigned long npages;
- npages = (st->current_address - st->start_address) / PAGE_SIZE;
+ npages = (addr - st->start_address) / PAGE_SIZE;
#ifdef CONFIG_PCI_BIOS
/*
@@ -245,7 +233,7 @@ static void note_wx(struct pg_state *st)
* Inform about it, but avoid the warning.
*/
if (pcibios_enabled && st->start_address >= PAGE_OFFSET + BIOS_BEGIN &&
- st->current_address <= PAGE_OFFSET + BIOS_END) {
+ addr <= PAGE_OFFSET + BIOS_END) {
pr_warn_once("x86/mm: PCI BIOS W+X mapping %lu pages\n", npages);
return;
}
@@ -257,25 +245,44 @@ static void note_wx(struct pg_state *st)
(void *)st->start_address);
}
+static inline pgprotval_t effective_prot(pgprotval_t prot1, pgprotval_t prot2)
+{
+ return (prot1 & prot2 & (_PAGE_USER | _PAGE_RW)) |
+ ((prot1 | prot2) & _PAGE_NX);
+}
+
/*
* This function gets called on a break in a continuous series
* of PTE entries; the next one is different so we need to
* print what we collected so far.
*/
-static void note_page(struct pg_state *st, pgprot_t new_prot,
- pgprotval_t new_eff, int level)
+static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
+ unsigned long val)
{
- pgprotval_t prot, cur, eff;
+ struct pg_state *st = container_of(pt_st, struct pg_state, ptdump);
+ pgprotval_t new_prot, new_eff;
+ pgprotval_t cur, eff;
static const char units[] = "BKMGTPE";
struct seq_file *m = st->seq;
+ new_prot = val & PTE_FLAGS_MASK;
+
+ if (level > 1) {
+ new_eff = effective_prot(st->prot_levels[level - 2],
+ new_prot);
+ } else {
+ new_eff = new_prot;
+ }
+
+ if (level > 0)
+ st->prot_levels[level-1] = new_eff;
+
/*
* If we have a "break" in the series, we need to flush the state that
* we have now. "break" is either changing perms, levels or
* address space marker.
*/
- prot = pgprot_val(new_prot);
- cur = pgprot_val(st->current_prot);
+ cur = st->current_prot;
eff = st->effective_prot;
if (!st->level) {
@@ -287,14 +294,14 @@ static void note_page(struct pg_state *st, pgprot_t new_prot,
st->lines = 0;
pt_dump_seq_printf(m, st->to_dmesg, "---[ %s ]---\n",
st->marker->name);
- } else if (prot != cur || new_eff != eff || level != st->level ||
- st->current_address >= st->marker[1].start_address) {
+ } else if (new_prot != cur || new_eff != eff || level != st->level ||
+ addr >= st->marker[1].start_address) {
const char *unit = units;
unsigned long delta;
int width = sizeof(unsigned long) * 2;
if (st->check_wx && (eff & _PAGE_RW) && !(eff & _PAGE_NX))
- note_wx(st);
+ note_wx(st, addr);
/*
* Now print the actual finished series
@@ -304,9 +311,9 @@ static void note_page(struct pg_state *st, pgprot_t new_prot,
pt_dump_seq_printf(m, st->to_dmesg,
"0x%0*lx-0x%0*lx ",
width, st->start_address,
- width, st->current_address);
+ width, addr);
- delta = st->current_address - st->start_address;
+ delta = addr - st->start_address;
while (!(delta & 1023) && unit[1]) {
delta >>= 10;
unit++;
@@ -324,7 +331,7 @@ static void note_page(struct pg_state *st, pgprot_t new_prot,
* such as the start of vmalloc space etc.
* This helps in the interpretation.
*/
- if (st->current_address >= st->marker[1].start_address) {
+ if (addr >= st->marker[1].start_address) {
if (st->marker->max_lines &&
st->lines > st->marker->max_lines) {
unsigned long nskip =
@@ -340,217 +347,43 @@ static void note_page(struct pg_state *st, pgprot_t new_prot,
st->marker->name);
}
- st->start_address = st->current_address;
+ st->start_address = addr;
st->current_prot = new_prot;
st->effective_prot = new_eff;
st->level = level;
}
}
-static inline pgprotval_t effective_prot(pgprotval_t prot1, pgprotval_t prot2)
-{
- return (prot1 & prot2 & (_PAGE_USER | _PAGE_RW)) |
- ((prot1 | prot2) & _PAGE_NX);
-}
-
-static void walk_pte_level(struct pg_state *st, pmd_t addr, pgprotval_t eff_in,
- unsigned long P)
-{
- int i;
- pte_t *pte;
- pgprotval_t prot, eff;
-
- for (i = 0; i < PTRS_PER_PTE; i++) {
- st->current_address = normalize_addr(P + i * PTE_LEVEL_MULT);
- pte = pte_offset_map(&addr, st->current_address);
- prot = pte_flags(*pte);
- eff = effective_prot(eff_in, prot);
- note_page(st, __pgprot(prot), eff, 5);
- pte_unmap(pte);
- }
-}
-#ifdef CONFIG_KASAN
-
-/*
- * This is an optimization for KASAN=y case. Since all kasan page tables
- * eventually point to the kasan_early_shadow_page we could call note_page()
- * right away without walking through lower level page tables. This saves
- * us dozens of seconds (minutes for 5-level config) while checking for
- * W+X mapping or reading kernel_page_tables debugfs file.
- */
-static inline bool kasan_page_table(struct pg_state *st, void *pt)
-{
- if (__pa(pt) == __pa(kasan_early_shadow_pmd) ||
- (pgtable_l5_enabled() &&
- __pa(pt) == __pa(kasan_early_shadow_p4d)) ||
- __pa(pt) == __pa(kasan_early_shadow_pud)) {
- pgprotval_t prot = pte_flags(kasan_early_shadow_pte[0]);
- note_page(st, __pgprot(prot), 0, 5);
- return true;
- }
- return false;
-}
-#else
-static inline bool kasan_page_table(struct pg_state *st, void *pt)
-{
- return false;
-}
-#endif
-
-#if PTRS_PER_PMD > 1
-
-static void walk_pmd_level(struct pg_state *st, pud_t addr,
- pgprotval_t eff_in, unsigned long P)
-{
- int i;
- pmd_t *start, *pmd_start;
- pgprotval_t prot, eff;
-
- pmd_start = start = (pmd_t *)pud_page_vaddr(addr);
- for (i = 0; i < PTRS_PER_PMD; i++) {
- st->current_address = normalize_addr(P + i * PMD_LEVEL_MULT);
- if (!pmd_none(*start)) {
- prot = pmd_flags(*start);
- eff = effective_prot(eff_in, prot);
- if (pmd_large(*start) || !pmd_present(*start)) {
- note_page(st, __pgprot(prot), eff, 4);
- } else if (!kasan_page_table(st, pmd_start)) {
- walk_pte_level(st, *start, eff,
- P + i * PMD_LEVEL_MULT);
- }
- } else
- note_page(st, __pgprot(0), 0, 4);
- start++;
- }
-}
-
-#else
-#define walk_pmd_level(s,a,e,p) walk_pte_level(s,__pmd(pud_val(a)),e,p)
-#define pud_large(a) pmd_large(__pmd(pud_val(a)))
-#define pud_none(a) pmd_none(__pmd(pud_val(a)))
-#endif
-
-#if PTRS_PER_PUD > 1
-
-static void walk_pud_level(struct pg_state *st, p4d_t addr, pgprotval_t eff_in,
- unsigned long P)
-{
- int i;
- pud_t *start, *pud_start;
- pgprotval_t prot, eff;
-
- pud_start = start = (pud_t *)p4d_page_vaddr(addr);
-
- for (i = 0; i < PTRS_PER_PUD; i++) {
- st->current_address = normalize_addr(P + i * PUD_LEVEL_MULT);
- if (!pud_none(*start)) {
- prot = pud_flags(*start);
- eff = effective_prot(eff_in, prot);
- if (pud_large(*start) || !pud_present(*start)) {
- note_page(st, __pgprot(prot), eff, 3);
- } else if (!kasan_page_table(st, pud_start)) {
- walk_pmd_level(st, *start, eff,
- P + i * PUD_LEVEL_MULT);
- }
- } else
- note_page(st, __pgprot(0), 0, 3);
-
- start++;
- }
-}
-
-#else
-#define walk_pud_level(s,a,e,p) walk_pmd_level(s,__pud(p4d_val(a)),e,p)
-#define p4d_large(a) pud_large(__pud(p4d_val(a)))
-#define p4d_none(a) pud_none(__pud(p4d_val(a)))
-#endif
-
-static void walk_p4d_level(struct pg_state *st, pgd_t addr, pgprotval_t eff_in,
- unsigned long P)
-{
- int i;
- p4d_t *start, *p4d_start;
- pgprotval_t prot, eff;
-
- if (PTRS_PER_P4D == 1)
- return walk_pud_level(st, __p4d(pgd_val(addr)), eff_in, P);
-
- p4d_start = start = (p4d_t *)pgd_page_vaddr(addr);
-
- for (i = 0; i < PTRS_PER_P4D; i++) {
- st->current_address = normalize_addr(P + i * P4D_LEVEL_MULT);
- if (!p4d_none(*start)) {
- prot = p4d_flags(*start);
- eff = effective_prot(eff_in, prot);
- if (p4d_large(*start) || !p4d_present(*start)) {
- note_page(st, __pgprot(prot), eff, 2);
- } else if (!kasan_page_table(st, p4d_start)) {
- walk_pud_level(st, *start, eff,
- P + i * P4D_LEVEL_MULT);
- }
- } else
- note_page(st, __pgprot(0), 0, 2);
-
- start++;
- }
-}
+static const struct ptdump_range ptdump_ranges[] = {
+#ifdef CONFIG_X86_64
-#define pgd_large(a) (pgtable_l5_enabled() ? pgd_large(a) : p4d_large(__p4d(pgd_val(a))))
-#define pgd_none(a) (pgtable_l5_enabled() ? pgd_none(a) : p4d_none(__p4d(pgd_val(a))))
+#define normalize_addr_shift (64 - (__VIRTUAL_MASK_SHIFT + 1))
+#define normalize_addr(u) ((signed long)(u << normalize_addr_shift) \
+ >> normalize_addr_shift)
-static inline bool is_hypervisor_range(int idx)
-{
-#ifdef CONFIG_X86_64
- /*
- * A hole in the beginning of kernel address space reserved
- * for a hypervisor.
- */
- return (idx >= pgd_index(GUARD_HOLE_BASE_ADDR)) &&
- (idx < pgd_index(GUARD_HOLE_END_ADDR));
+ {0, PTRS_PER_PGD * PGD_LEVEL_MULT / 2},
+ {normalize_addr(PTRS_PER_PGD * PGD_LEVEL_MULT / 2), ~0UL},
#else
- return false;
+ {0, ~0UL},
#endif
-}
+ {0, 0}
+};
static void ptdump_walk_pgd_level_core(struct seq_file *m, struct mm_struct *mm,
bool checkwx, bool dmesg)
{
- pgd_t *start = mm->pgd;
- pgprotval_t prot, eff;
- int i;
- struct pg_state st = {};
-
- st.to_dmesg = dmesg;
- st.check_wx = checkwx;
- st.seq = m;
- if (checkwx)
- st.wx_pages = 0;
-
- for (i = 0; i < PTRS_PER_PGD; i++) {
- st.current_address = normalize_addr(i * PGD_LEVEL_MULT);
- if (!pgd_none(*start) && !is_hypervisor_range(i)) {
- prot = pgd_flags(*start);
-#ifdef CONFIG_X86_PAE
- eff = _PAGE_USER | _PAGE_RW;
-#else
- eff = prot;
-#endif
- if (pgd_large(*start) || !pgd_present(*start)) {
- note_page(&st, __pgprot(prot), eff, 1);
- } else {
- walk_p4d_level(&st, *start, eff,
- i * PGD_LEVEL_MULT);
- }
- } else
- note_page(&st, __pgprot(0), 0, 1);
+ struct pg_state st = {
+ .ptdump = {
+ .note_page = note_page,
+ .range = ptdump_ranges
+ },
+ .to_dmesg = dmesg,
+ .check_wx = checkwx,
+ .seq = m
+ };
- cond_resched();
- start++;
- }
+ ptdump_walk_pgd(&st.ptdump, mm);
- /* Flush out the last page */
- st.current_address = normalize_addr(PTRS_PER_PGD*PGD_LEVEL_MULT);
- note_page(&st, __pgprot(0), 0, 0);
if (!checkwx)
return;
if (st.wx_pages)
--
2.20.1
Now walk_page_range() can walk kernel page tables, we can switch the
arm64 ptdump code over to using it, simplifying the code.
Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/Kconfig | 1 +
arch/arm64/Kconfig.debug | 19 +----
arch/arm64/include/asm/ptdump.h | 8 +-
arch/arm64/mm/Makefile | 4 +-
arch/arm64/mm/dump.c | 117 ++++++++++-------------------
arch/arm64/mm/ptdump_debugfs.c | 2 +-
drivers/firmware/efi/arm-runtime.c | 2 +-
7 files changed, 48 insertions(+), 105 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 3adcec05b1f6..5a32c87f37c6 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -105,6 +105,7 @@ config ARM64
select GENERIC_IRQ_SHOW
select GENERIC_IRQ_SHOW_LEVEL
select GENERIC_PCI_IOMAP
+ select GENERIC_PTDUMP
select GENERIC_SCHED_CLOCK
select GENERIC_SMP_IDLE_THREAD
select GENERIC_STRNCPY_FROM_USER
diff --git a/arch/arm64/Kconfig.debug b/arch/arm64/Kconfig.debug
index cf09010d825f..1c906d932d6b 100644
--- a/arch/arm64/Kconfig.debug
+++ b/arch/arm64/Kconfig.debug
@@ -1,22 +1,5 @@
# SPDX-License-Identifier: GPL-2.0-only
-config ARM64_PTDUMP_CORE
- def_bool n
-
-config ARM64_PTDUMP_DEBUGFS
- bool "Export kernel pagetable layout to userspace via debugfs"
- depends on DEBUG_KERNEL
- select ARM64_PTDUMP_CORE
- select DEBUG_FS
- help
- Say Y here if you want to show the kernel pagetable layout in a
- debugfs file. This information is only useful for kernel developers
- who are working in architecture specific areas of the kernel.
- It is probably not a good idea to enable this feature in a production
- kernel.
-
- If in doubt, say N.
-
config PID_IN_CONTEXTIDR
bool "Write the current PID to the CONTEXTIDR register"
help
@@ -42,7 +25,7 @@ config ARM64_RANDOMIZE_TEXT_OFFSET
config DEBUG_WX
bool "Warn on W+X mappings at boot"
- select ARM64_PTDUMP_CORE
+ select PTDUMP_CORE
---help---
Generate a warning if any W+X mappings are found at boot.
diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h
index 0b8e7269ec82..38187f74e089 100644
--- a/arch/arm64/include/asm/ptdump.h
+++ b/arch/arm64/include/asm/ptdump.h
@@ -5,7 +5,7 @@
#ifndef __ASM_PTDUMP_H
#define __ASM_PTDUMP_H
-#ifdef CONFIG_ARM64_PTDUMP_CORE
+#ifdef CONFIG_PTDUMP_CORE
#include <linux/mm_types.h>
#include <linux/seq_file.h>
@@ -21,15 +21,15 @@ struct ptdump_info {
unsigned long base_addr;
};
-void ptdump_walk_pgd(struct seq_file *s, struct ptdump_info *info);
-#ifdef CONFIG_ARM64_PTDUMP_DEBUGFS
+void ptdump_walk(struct seq_file *s, struct ptdump_info *info);
+#ifdef CONFIG_PTDUMP_DEBUGFS
void ptdump_debugfs_register(struct ptdump_info *info, const char *name);
#else
static inline void ptdump_debugfs_register(struct ptdump_info *info,
const char *name) { }
#endif
void ptdump_check_wx(void);
-#endif /* CONFIG_ARM64_PTDUMP_CORE */
+#endif /* CONFIG_PTDUMP_CORE */
#ifdef CONFIG_DEBUG_WX
#define debug_checkwx() ptdump_check_wx()
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 849c1df3d214..d91030f0ffee 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -4,8 +4,8 @@ obj-y := dma-mapping.o extable.o fault.o init.o \
ioremap.o mmap.o pgd.o mmu.o \
context.o proc.o pageattr.o
obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
-obj-$(CONFIG_ARM64_PTDUMP_CORE) += dump.o
-obj-$(CONFIG_ARM64_PTDUMP_DEBUGFS) += ptdump_debugfs.o
+obj-$(CONFIG_PTDUMP_CORE) += dump.o
+obj-$(CONFIG_PTDUMP_DEBUGFS) += ptdump_debugfs.o
obj-$(CONFIG_NUMA) += numa.o
obj-$(CONFIG_DEBUG_VIRTUAL) += physaddr.o
KASAN_SANITIZE_physaddr.o += n
diff --git a/arch/arm64/mm/dump.c b/arch/arm64/mm/dump.c
index 82b3a7fdb4a6..5cc71ad567b4 100644
--- a/arch/arm64/mm/dump.c
+++ b/arch/arm64/mm/dump.c
@@ -15,6 +15,7 @@
#include <linux/io.h>
#include <linux/init.h>
#include <linux/mm.h>
+#include <linux/ptdump.h>
#include <linux/sched.h>
#include <linux/seq_file.h>
@@ -65,10 +66,11 @@ static const struct addr_marker address_markers[] = {
* dumps out a description of the range.
*/
struct pg_state {
+ struct ptdump_state ptdump;
struct seq_file *seq;
const struct addr_marker *marker;
unsigned long start_address;
- unsigned level;
+ int level;
u64 current_prot;
bool check_wx;
unsigned long wx_pages;
@@ -168,6 +170,10 @@ static struct pg_level pg_level[] = {
.name = "PGD",
.bits = pte_bits,
.num = ARRAY_SIZE(pte_bits),
+ }, { /* p4d */
+ .name = "P4D",
+ .bits = pte_bits,
+ .num = ARRAY_SIZE(pte_bits),
}, { /* pud */
.name = (CONFIG_PGTABLE_LEVELS > 3) ? "PUD" : "PGD",
.bits = pte_bits,
@@ -230,11 +236,15 @@ static void note_prot_wx(struct pg_state *st, unsigned long addr)
st->wx_pages += (addr - st->start_address) / PAGE_SIZE;
}
-static void note_page(struct pg_state *st, unsigned long addr, unsigned level,
- u64 val)
+static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
+ unsigned long val)
{
+ struct pg_state *st = container_of(pt_st, struct pg_state, ptdump);
static const char units[] = "KMGTPE";
- u64 prot = val & pg_level[level].mask;
+ u64 prot = 0;
+
+ if (level >= 0)
+ prot = val & pg_level[level].mask;
if (!st->level) {
st->level = level;
@@ -282,85 +292,27 @@ static void note_page(struct pg_state *st, unsigned long addr, unsigned level,
}
-static void walk_pte(struct pg_state *st, pmd_t *pmdp, unsigned long start,
- unsigned long end)
-{
- unsigned long addr = start;
- pte_t *ptep = pte_offset_kernel(pmdp, start);
-
- do {
- note_page(st, addr, 4, READ_ONCE(pte_val(*ptep)));
- } while (ptep++, addr += PAGE_SIZE, addr != end);
-}
-
-static void walk_pmd(struct pg_state *st, pud_t *pudp, unsigned long start,
- unsigned long end)
-{
- unsigned long next, addr = start;
- pmd_t *pmdp = pmd_offset(pudp, start);
-
- do {
- pmd_t pmd = READ_ONCE(*pmdp);
- next = pmd_addr_end(addr, end);
-
- if (pmd_none(pmd) || pmd_sect(pmd)) {
- note_page(st, addr, 3, pmd_val(pmd));
- } else {
- BUG_ON(pmd_bad(pmd));
- walk_pte(st, pmdp, addr, next);
- }
- } while (pmdp++, addr = next, addr != end);
-}
-
-static void walk_pud(struct pg_state *st, pgd_t *pgdp, unsigned long start,
- unsigned long end)
+void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
{
- unsigned long next, addr = start;
- pud_t *pudp = pud_offset(pgdp, start);
-
- do {
- pud_t pud = READ_ONCE(*pudp);
- next = pud_addr_end(addr, end);
-
- if (pud_none(pud) || pud_sect(pud)) {
- note_page(st, addr, 2, pud_val(pud));
- } else {
- BUG_ON(pud_bad(pud));
- walk_pmd(st, pudp, addr, next);
- }
- } while (pudp++, addr = next, addr != end);
-}
+ unsigned long end = ~0UL;
+ struct pg_state st;
-static void walk_pgd(struct pg_state *st, struct mm_struct *mm,
- unsigned long start)
-{
- unsigned long end = (start < TASK_SIZE_64) ? TASK_SIZE_64 : 0;
- unsigned long next, addr = start;
- pgd_t *pgdp = pgd_offset(mm, start);
-
- do {
- pgd_t pgd = READ_ONCE(*pgdp);
- next = pgd_addr_end(addr, end);
-
- if (pgd_none(pgd)) {
- note_page(st, addr, 1, pgd_val(pgd));
- } else {
- BUG_ON(pgd_bad(pgd));
- walk_pud(st, pgdp, addr, next);
- }
- } while (pgdp++, addr = next, addr != end);
-}
+ if (info->base_addr < TASK_SIZE_64)
+ end = TASK_SIZE_64;
-void ptdump_walk_pgd(struct seq_file *m, struct ptdump_info *info)
-{
- struct pg_state st = {
- .seq = m,
+ st = (struct pg_state){
+ .seq = s,
.marker = info->markers,
+ .ptdump = {
+ .note_page = note_page,
+ .range = (struct ptdump_range[]){
+ {info->base_addr, end},
+ {0, 0}
+ }
+ }
};
- walk_pgd(&st, info->mm, info->base_addr);
-
- note_page(&st, 0, 0, 0);
+ ptdump_walk_pgd(&st.ptdump, info->mm);
}
static void ptdump_initialize(void)
@@ -388,10 +340,17 @@ void ptdump_check_wx(void)
{ -1, NULL},
},
.check_wx = true,
+ .ptdump = {
+ .note_page = note_page,
+ .range = (struct ptdump_range[]) {
+ {VA_START, ~0UL},
+ {0, 0}
+ }
+ }
};
- walk_pgd(&st, &init_mm, VA_START);
- note_page(&st, 0, 0, 0);
+ ptdump_walk_pgd(&st.ptdump, &init_mm);
+
if (st.wx_pages || st.uxn_pages)
pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu non-UXN pages found\n",
st.wx_pages, st.uxn_pages);
diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 064163f25592..1f2eae3e988b 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -7,7 +7,7 @@
static int ptdump_show(struct seq_file *m, void *v)
{
struct ptdump_info *info = m->private;
- ptdump_walk_pgd(m, info);
+ ptdump_walk(m, info);
return 0;
}
DEFINE_SHOW_ATTRIBUTE(ptdump);
diff --git a/drivers/firmware/efi/arm-runtime.c b/drivers/firmware/efi/arm-runtime.c
index e2ac5fa5531b..1283685f9c20 100644
--- a/drivers/firmware/efi/arm-runtime.c
+++ b/drivers/firmware/efi/arm-runtime.c
@@ -27,7 +27,7 @@
extern u64 efi_system_table;
-#ifdef CONFIG_ARM64_PTDUMP_DEBUGFS
+#if defined(CONFIG_PTDUMP_DEBUGFS) && defined(CONFIG_ARM64)
#include <asm/ptdump.h>
static struct ptdump_info efi_ptdump_info = {
--
2.20.1
Hi Steven,
On Mon, Jul 22, 2019 at 04:41:53PM +0100, Steven Price wrote:
> walk_page_range() is going to be allowed to walk page tables other than
> those of user space. For this it needs to know when it has reached a
> 'leaf' entry in the page tables. This information is provided by the
> p?d_leaf() functions/macros.
>
> For mips, we only support large pages on 64 bit.
That ceases to be true with commit 35476311e529 ("MIPS: Add partial
32-bit huge page support") in mips-next, so I think it may be best to
move the definition to asm/pgtable.h so that both 32b & 64b kernels can
pick it up.
Thanks,
Paul
> For 64 bit if _PAGE_HUGE is defined we can simply look for it. When not
> defined we can be confident that there are no leaf pages in existence
> and fall back on the generic implementation (added in a later patch)
> which returns 0.
>
> CC: Ralf Baechle <[email protected]>
> CC: Paul Burton <[email protected]>
> CC: James Hogan <[email protected]>
> CC: [email protected]
> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/mips/include/asm/pgtable-64.h | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/arch/mips/include/asm/pgtable-64.h b/arch/mips/include/asm/pgtable-64.h
> index 93a9dce31f25..2bdbf8652b5f 100644
> --- a/arch/mips/include/asm/pgtable-64.h
> +++ b/arch/mips/include/asm/pgtable-64.h
> @@ -273,6 +273,10 @@ static inline int pmd_present(pmd_t pmd)
> return pmd_val(pmd) != (unsigned long) invalid_pte_table;
> }
>
> +#ifdef _PAGE_HUGE
> +#define pmd_leaf(pmd) ((pmd_val(pmd) & _PAGE_HUGE) != 0)
> +#endif
> +
> static inline void pmd_clear(pmd_t *pmdp)
> {
> pmd_val(*pmdp) = ((unsigned long) invalid_pte_table);
> @@ -297,6 +301,10 @@ static inline int pud_present(pud_t pud)
> return pud_val(pud) != (unsigned long) invalid_pmd_table;
> }
>
> +#ifdef _PAGE_HUGE
> +#define pud_leaf(pud) ((pud_val(pud) & _PAGE_HUGE) != 0)
> +#endif
> +
> static inline void pud_clear(pud_t *pudp)
> {
> pud_val(*pudp) = ((unsigned long) invalid_pmd_table);
> --
> 2.20.1
>
Hello Steven,
On 07/22/2019 09:11 PM, Steven Price wrote:
> This is a slight reworking and extension of my previous patch set
> (Convert x86 & arm64 to use generic page walk), but I've continued the
> version numbering as most of the changes are the same. In particular
> this series ends with a generic PTDUMP implemention for arm64 and x86.
>
> Many architectures current have a debugfs file for dumping the kernel
> page tables. Currently each architecture has to implement custom
> functions for this because the details of walking the page tables used
> by the kernel are different between architectures.
>
> This series extends the capabilities of walk_page_range() so that it can
> deal with the page tables of the kernel (which have no VMAs and can
> contain larger huge pages than exist for user space). A generic PTDUMP
> implementation is the implemented making use of the new functionality of
> walk_page_range() and finally arm64 and x86 are switch to using it,
> removing the custom table walkers.
Could other architectures just enable this new generic PTDUMP feature if
required without much problem ?
>
> To enable a generic page table walker to walk the unusual mappings of
> the kernel we need to implement a set of functions which let us know
> when the walker has reached the leaf entry. After a suggestion from Will
> Deacon I've chosen the name p?d_leaf() as this (hopefully) describes
> the purpose (and is a new name so has no historic baggage). Some
> architectures have p?d_large macros but this is easily confused with
> "large pages".
I have not been following the previous version of the series closely, hence
might be missing something here. But p?d_large() which identifies large
mappings on a given level can only signify a leaf entry. Large pages on the
table exist only as leaf entries. So what is the problem for it being used
directly instead. Is there any possibility in the kernel mapping when these
large pages are not leaf entries ?
>
> Mostly this is a clean up and there should be very little functional
> change. The exceptions are:
>
> * x86 PTDUMP debugfs output no longer display pages which aren't
> present (patch 14).
Hmm, kernel mappings pages which are not present! which ones are those ?
Just curious.
>
> * arm64 has the ability to efficiently process KASAN pages (which
> previously only x86 implemented). This means that the combination of
> KASAN and DEBUG_WX is now useable.
>
> Also available as a git tree:
> git://linux-arm.org/linux-sp.git walk_page_range/v9
>
> Changes since v8:
> https://lore.kernel.org/lkml/[email protected]/
> * Rename from p?d_large() to p?d_leaf()
As mentioned before wondering if this is actually required or it is even a
good idea to introduce something like this which expands page table helper
semantics scope further in generic MM.
> * Dropped patches migrating arm64/x86 custom walkers to
> walk_page_range() in favour of adding a generic PTDUMP implementation
> and migrating arm64/x86 to that instead.
> * Rebased to v5.3-rc1
Creating a generic PTDUMP implementation is definitely a better idea.
On Mon, Jul 22, 2019 at 04:42:08PM +0100, Steven Price wrote:
> Add a generic version of page table dumping that architectures can
> opt-in to
>
> Signed-off-by: Steven Price <[email protected]>
[...]
> +#ifdef CONFIG_KASAN
> +/*
> + * This is an optimization for KASAN=y case. Since all kasan page tables
> + * eventually point to the kasan_early_shadow_page we could call note_page()
> + * right away without walking through lower level page tables. This saves
> + * us dozens of seconds (minutes for 5-level config) while checking for
> + * W+X mapping or reading kernel_page_tables debugfs file.
> + */
> +static inline bool kasan_page_table(struct ptdump_state *st, void *pt,
> + unsigned long addr)
> +{
> + if (__pa(pt) == __pa(kasan_early_shadow_pmd) ||
> +#ifdef CONFIG_X86
> + (pgtable_l5_enabled() &&
> + __pa(pt) == __pa(kasan_early_shadow_p4d)) ||
> +#endif
> + __pa(pt) == __pa(kasan_early_shadow_pud)) {
> + st->note_page(st, addr, 5, pte_val(kasan_early_shadow_pte[0]));
> + return true;
> + }
> + return false;
Having you tried this with CONFIG_DEBUG_VIRTUAL?
The kasan_early_shadow_pmd is a kernel object rather than a linear map
object, so you should use __pa_symbol for that.
It's a bit horrid to have to test multiple levels in one function; can't
we check the relevant level inline in each of the test_p?d funcs?
They're optional anyway, so they only need to be defined for
CONFIG_KASAN.
Thanks,
Mark.
> +}
> +#else
> +static inline bool kasan_page_table(struct ptdump_state *st, void *pt,
> + unsigned long addr)
> +{
> + return false;
> +}
> +#endif
> +
> +static int ptdump_test_p4d(unsigned long addr, unsigned long next,
> + p4d_t *p4d, struct mm_walk *walk)
> +{
> + struct ptdump_state *st = walk->private;
> +
> + if (kasan_page_table(st, p4d, addr))
> + return 1;
> + return 0;
> +}
> +static int ptdump_test_pud(unsigned long addr, unsigned long next,
> + pud_t *pud, struct mm_walk *walk)
> +{
> + struct ptdump_state *st = walk->private;
> +
> + if (kasan_page_table(st, pud, addr))
> + return 1;
> + return 0;
> +}
> +
> +static int ptdump_test_pmd(unsigned long addr, unsigned long next,
> + pmd_t *pmd, struct mm_walk *walk)
> +{
> + struct ptdump_state *st = walk->private;
> +
> + if (kasan_page_table(st, pmd, addr))
> + return 1;
> + return 0;
> +}
> +
> +static int ptdump_hole(unsigned long addr, unsigned long next,
> + struct mm_walk *walk)
> +{
> + struct ptdump_state *st = walk->private;
> +
> + st->note_page(st, addr, -1, 0);
> +
> + return 0;
> +}
> +
> +void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm)
> +{
> + struct mm_walk walk = {
> + .mm = mm,
> + .pgd_entry = ptdump_pgd_entry,
> + .p4d_entry = ptdump_p4d_entry,
> + .pud_entry = ptdump_pud_entry,
> + .pmd_entry = ptdump_pmd_entry,
> + .pte_entry = ptdump_pte_entry,
> + .test_p4d = ptdump_test_p4d,
> + .test_pud = ptdump_test_pud,
> + .test_pmd = ptdump_test_pmd,
> + .pte_hole = ptdump_hole,
> + .private = st
> + };
> + const struct ptdump_range *range = st->range;
> +
> + down_read(&mm->mmap_sem);
> + while (range->start != range->end) {
> + walk_page_range(range->start, range->end, &walk);
> + range++;
> + }
> + up_read(&mm->mmap_sem);
> +
> + /* Flush out the last page */
> + st->note_page(st, 0, 0, 0);
> +}
> --
> 2.20.1
>
On Mon, Jul 22, 2019 at 04:41:59PM +0100, Steven Price wrote:
> Exposing the pud/pgd levels of the page tables to walk_page_range() means
> we may come across the exotic large mappings that come with large areas
> of contiguous memory (such as the kernel's linear map).
>
> For architectures that don't provide all p?d_leaf() macros, provide
> generic do nothing default that are suitable where there cannot be leaf
> pages that that level.
>
> Signed-off-by: Steven Price <[email protected]>
Not a big deal, but it would probably make sense for this to be patch 1
in the series, given it defines the semantic of p?d_leaf(), and they're
not used until we provide all the architectural implemetnations anyway.
It might also be worth pointing out the reasons for this naming, e.g.
p?d_large() aren't currently generic, and this name minimizes potential
confusion between p?d_{large,huge}().
> ---
> include/asm-generic/pgtable.h | 19 +++++++++++++++++++
> 1 file changed, 19 insertions(+)
>
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 75d9d68a6de7..46275896ca66 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -1188,4 +1188,23 @@ static inline bool arch_has_pfn_modify_check(void)
> #define mm_pmd_folded(mm) __is_defined(__PAGETABLE_PMD_FOLDED)
> #endif
>
> +/*
> + * p?d_leaf() - true if this entry is a final mapping to a physical address.
> + * This differs from p?d_huge() by the fact that they are always available (if
> + * the architecture supports large pages at the appropriate level) even
> + * if CONFIG_HUGETLB_PAGE is not defined.
> + */
I assume it's only safe to call these on valid entries? I think it would
be worth calling that out explicitly.
Otherwise, this looks sound to me:
Acked-by: Mark Rutland <[email protected]>
Thanks,
Mark.
> +#ifndef pgd_leaf
> +#define pgd_leaf(x) 0
> +#endif
> +#ifndef p4d_leaf
> +#define p4d_leaf(x) 0
> +#endif
> +#ifndef pud_leaf
> +#define pud_leaf(x) 0
> +#endif
> +#ifndef pmd_leaf
> +#define pmd_leaf(x) 0
> +#endif
> +
> #endif /* _ASM_GENERIC_PGTABLE_H */
> --
> 2.20.1
>
On Mon, Jul 22, 2019 at 04:42:00PM +0100, Steven Price wrote:
> pgd_entry() and pud_entry() were removed by commit 0b1fbfe50006c410
> ("mm/pagewalk: remove pgd_entry() and pud_entry()") because there were
> no users. We're about to add users so reintroduce them, along with
> p4d_entry() as we now have 5 levels of tables.
>
> Note that commit a00cc7d9dd93d66a ("mm, x86: add support for
> PUD-sized transparent hugepages") already re-added pud_entry() but with
> different semantics to the other callbacks. Since there have never
> been upstream users of this, revert the semantics back to match the
> other callbacks. This means pud_entry() is called for all entries, not
> just transparent huge pages.
>
> Signed-off-by: Steven Price <[email protected]>
> ---
> include/linux/mm.h | 15 +++++++++------
> mm/pagewalk.c | 27 ++++++++++++++++-----------
> 2 files changed, 25 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0334ca97c584..b22799129128 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1432,15 +1432,14 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>
> /**
> * mm_walk - callbacks for walk_page_range
> - * @pud_entry: if set, called for each non-empty PUD (2nd-level) entry
> - * this handler should only handle pud_trans_huge() puds.
> - * the pmd_entry or pte_entry callbacks will be used for
> - * regular PUDs.
> - * @pmd_entry: if set, called for each non-empty PMD (3rd-level) entry
> + * @pgd_entry: if set, called for each non-empty PGD (top-level) entry
> + * @p4d_entry: if set, called for each non-empty P4D entry
> + * @pud_entry: if set, called for each non-empty PUD entry
> + * @pmd_entry: if set, called for each non-empty PMD entry
How are these expected to work with folding?
For example, on arm64 with 64K pages and 42-bit VA, you can have 2-level
tables where the PGD is P4D, PUD, and PMD. IIUC we'd invoke the
callbacks for each of those levels where we found an entry in the pgd.
Either the callee handle that, or we should inhibit the callbacks when
levels are folded, and I think that needs to be explcitly stated either
way.
IIRC on x86 the p4d folding is dynamic depending on whether the HW
supports 5-level page tables. Maybe that implies the callee has to
handle that.
Thanks,
Mark.
> * this handler is required to be able to handle
> * pmd_trans_huge() pmds. They may simply choose to
> * split_huge_page() instead of handling it explicitly.
> - * @pte_entry: if set, called for each non-empty PTE (4th-level) entry
> + * @pte_entry: if set, called for each non-empty PTE (lowest-level) entry
> * @pte_hole: if set, called for each hole at all levels
> * @hugetlb_entry: if set, called for each hugetlb entry
> * @test_walk: caller specific callback function to determine whether
> @@ -1455,6 +1454,10 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> * (see the comment on walk_page_range() for more details)
> */
> struct mm_walk {
> + int (*pgd_entry)(pgd_t *pgd, unsigned long addr,
> + unsigned long next, struct mm_walk *walk);
> + int (*p4d_entry)(p4d_t *p4d, unsigned long addr,
> + unsigned long next, struct mm_walk *walk);
> int (*pud_entry)(pud_t *pud, unsigned long addr,
> unsigned long next, struct mm_walk *walk);
> int (*pmd_entry)(pmd_t *pmd, unsigned long addr,
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index c3084ff2569d..98373a9f88b8 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -90,15 +90,9 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> }
>
> if (walk->pud_entry) {
> - spinlock_t *ptl = pud_trans_huge_lock(pud, walk->vma);
> -
> - if (ptl) {
> - err = walk->pud_entry(pud, addr, next, walk);
> - spin_unlock(ptl);
> - if (err)
> - break;
> - continue;
> - }
> + err = walk->pud_entry(pud, addr, next, walk);
> + if (err)
> + break;
> }
>
> split_huge_pud(walk->vma, pud, addr);
> @@ -131,7 +125,12 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
> break;
> continue;
> }
> - if (walk->pmd_entry || walk->pte_entry)
> + if (walk->p4d_entry) {
> + err = walk->p4d_entry(p4d, addr, next, walk);
> + if (err)
> + break;
> + }
> + if (walk->pud_entry || walk->pmd_entry || walk->pte_entry)
> err = walk_pud_range(p4d, addr, next, walk);
> if (err)
> break;
> @@ -157,7 +156,13 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
> break;
> continue;
> }
> - if (walk->pmd_entry || walk->pte_entry)
> + if (walk->pgd_entry) {
> + err = walk->pgd_entry(pgd, addr, next, walk);
> + if (err)
> + break;
> + }
> + if (walk->p4d_entry || walk->pud_entry || walk->pmd_entry ||
> + walk->pte_entry)
> err = walk_p4d_range(pgd, addr, next, walk);
> if (err)
> break;
> --
> 2.20.1
>
On Mon, Jul 22, 2019 at 04:41:49PM +0100, Steven Price wrote:
> This is a slight reworking and extension of my previous patch set
> (Convert x86 & arm64 to use generic page walk), but I've continued the
> version numbering as most of the changes are the same. In particular
> this series ends with a generic PTDUMP implemention for arm64 and x86.
>
> Many architectures current have a debugfs file for dumping the kernel
> page tables. Currently each architecture has to implement custom
> functions for this because the details of walking the page tables used
> by the kernel are different between architectures.
>
> This series extends the capabilities of walk_page_range() so that it can
> deal with the page tables of the kernel (which have no VMAs and can
> contain larger huge pages than exist for user space). A generic PTDUMP
> implementation is the implemented making use of the new functionality of
> walk_page_range() and finally arm64 and x86 are switch to using it,
> removing the custom table walkers.
>
> To enable a generic page table walker to walk the unusual mappings of
> the kernel we need to implement a set of functions which let us know
> when the walker has reached the leaf entry. After a suggestion from Will
> Deacon I've chosen the name p?d_leaf() as this (hopefully) describes
> the purpose (and is a new name so has no historic baggage). Some
> architectures have p?d_large macros but this is easily confused with
> "large pages".
>
> Mostly this is a clean up and there should be very little functional
> change. The exceptions are:
>
> * x86 PTDUMP debugfs output no longer display pages which aren't
> present (patch 14).
>
> * arm64 has the ability to efficiently process KASAN pages (which
> previously only x86 implemented). This means that the combination of
> KASAN and DEBUG_WX is now useable.
Are there any visible changes to the arm64 output?
Could you dump a before/after example somewhere?
Thanks,
Mark.
On 22/07/2019 22:47, Paul Burton wrote:
> Hi Steven,
>
> On Mon, Jul 22, 2019 at 04:41:53PM +0100, Steven Price wrote:
>> walk_page_range() is going to be allowed to walk page tables other than
>> those of user space. For this it needs to know when it has reached a
>> 'leaf' entry in the page tables. This information is provided by the
>> p?d_leaf() functions/macros.
>>
>> For mips, we only support large pages on 64 bit.
>
> That ceases to be true with commit 35476311e529 ("MIPS: Add partial
> 32-bit huge page support") in mips-next, so I think it may be best to
> move the definition to asm/pgtable.h so that both 32b & 64b kernels can
> pick it up.
Thanks for pointing that out. I'll move the definitions as you suggest.
Steve
> Thanks,
> Paul
>
>> For 64 bit if _PAGE_HUGE is defined we can simply look for it. When not
>> defined we can be confident that there are no leaf pages in existence
>> and fall back on the generic implementation (added in a later patch)
>> which returns 0.
>>
>> CC: Ralf Baechle <[email protected]>
>> CC: Paul Burton <[email protected]>
>> CC: James Hogan <[email protected]>
>> CC: [email protected]
>> Signed-off-by: Steven Price <[email protected]>
>> ---
>> arch/mips/include/asm/pgtable-64.h | 8 ++++++++
>> 1 file changed, 8 insertions(+)
>>
>> diff --git a/arch/mips/include/asm/pgtable-64.h b/arch/mips/include/asm/pgtable-64.h
>> index 93a9dce31f25..2bdbf8652b5f 100644
>> --- a/arch/mips/include/asm/pgtable-64.h
>> +++ b/arch/mips/include/asm/pgtable-64.h
>> @@ -273,6 +273,10 @@ static inline int pmd_present(pmd_t pmd)
>> return pmd_val(pmd) != (unsigned long) invalid_pte_table;
>> }
>>
>> +#ifdef _PAGE_HUGE
>> +#define pmd_leaf(pmd) ((pmd_val(pmd) & _PAGE_HUGE) != 0)
>> +#endif
>> +
>> static inline void pmd_clear(pmd_t *pmdp)
>> {
>> pmd_val(*pmdp) = ((unsigned long) invalid_pte_table);
>> @@ -297,6 +301,10 @@ static inline int pud_present(pud_t pud)
>> return pud_val(pud) != (unsigned long) invalid_pmd_table;
>> }
>>
>> +#ifdef _PAGE_HUGE
>> +#define pud_leaf(pud) ((pud_val(pud) & _PAGE_HUGE) != 0)
>> +#endif
>> +
>> static inline void pud_clear(pud_t *pudp)
>> {
>> pud_val(*pudp) = ((unsigned long) invalid_pmd_table);
>> --
>> 2.20.1
>>
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>
On 23/07/2019 11:14, Mark Rutland wrote:
> On Mon, Jul 22, 2019 at 04:42:00PM +0100, Steven Price wrote:
>> pgd_entry() and pud_entry() were removed by commit 0b1fbfe50006c410
>> ("mm/pagewalk: remove pgd_entry() and pud_entry()") because there were
>> no users. We're about to add users so reintroduce them, along with
>> p4d_entry() as we now have 5 levels of tables.
>>
>> Note that commit a00cc7d9dd93d66a ("mm, x86: add support for
>> PUD-sized transparent hugepages") already re-added pud_entry() but with
>> different semantics to the other callbacks. Since there have never
>> been upstream users of this, revert the semantics back to match the
>> other callbacks. This means pud_entry() is called for all entries, not
>> just transparent huge pages.
>>
>> Signed-off-by: Steven Price <[email protected]>
>> ---
>> include/linux/mm.h | 15 +++++++++------
>> mm/pagewalk.c | 27 ++++++++++++++++-----------
>> 2 files changed, 25 insertions(+), 17 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 0334ca97c584..b22799129128 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1432,15 +1432,14 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>>
>> /**
>> * mm_walk - callbacks for walk_page_range
>> - * @pud_entry: if set, called for each non-empty PUD (2nd-level) entry
>> - * this handler should only handle pud_trans_huge() puds.
>> - * the pmd_entry or pte_entry callbacks will be used for
>> - * regular PUDs.
>> - * @pmd_entry: if set, called for each non-empty PMD (3rd-level) entry
>> + * @pgd_entry: if set, called for each non-empty PGD (top-level) entry
>> + * @p4d_entry: if set, called for each non-empty P4D entry
>> + * @pud_entry: if set, called for each non-empty PUD entry
>> + * @pmd_entry: if set, called for each non-empty PMD entry
>
> How are these expected to work with folding?
>
> For example, on arm64 with 64K pages and 42-bit VA, you can have 2-level
> tables where the PGD is P4D, PUD, and PMD. IIUC we'd invoke the
> callbacks for each of those levels where we found an entry in the pgd.
>
> Either the callee handle that, or we should inhibit the callbacks when
> levels are folded, and I think that needs to be explcitly stated either
> way.
>
> IIRC on x86 the p4d folding is dynamic depending on whether the HW
> supports 5-level page tables. Maybe that implies the callee has to
> handle that.
Yes, my assumption is that it has to be up to the callee to handle that
because folding can be dynamic. I believe this also was how these
callbacks work before they were removed. However I'll add a comment
explaining that here as it's probably non-obvious.
Steve
On 23/07/2019 10:41, Mark Rutland wrote:
> On Mon, Jul 22, 2019 at 04:41:59PM +0100, Steven Price wrote:
>> Exposing the pud/pgd levels of the page tables to walk_page_range() means
>> we may come across the exotic large mappings that come with large areas
>> of contiguous memory (such as the kernel's linear map).
>>
>> For architectures that don't provide all p?d_leaf() macros, provide
>> generic do nothing default that are suitable where there cannot be leaf
>> pages that that level.
>>
>> Signed-off-by: Steven Price <[email protected]>
>
> Not a big deal, but it would probably make sense for this to be patch 1
> in the series, given it defines the semantic of p?d_leaf(), and they're
> not used until we provide all the architectural implemetnations anyway.
Sure, I'll move it. When it was named p?d_large() this had to come after
some architectures that implement p?d_large() as static inline. But
p?d_leaf() doesn't have that issue.
> It might also be worth pointing out the reasons for this naming, e.g.
> p?d_large() aren't currently generic, and this name minimizes potential
> confusion between p?d_{large,huge}().
Ok, how about:
The name p?d_leaf() is chosen because to minimize the confusion with
existing uses of "large" pages and "huge" pages which do not necessary
mean that the entry is a leaf (for example it may be a set of contiguous
entries that only take 1 TLB slot). For the purpose of walking the page
tables we don't need to know how it will be represented in the TLB, but
we do need to know for sure if it is a leaf of the tree.
>> ---
>> include/asm-generic/pgtable.h | 19 +++++++++++++++++++
>> 1 file changed, 19 insertions(+)
>>
>> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
>> index 75d9d68a6de7..46275896ca66 100644
>> --- a/include/asm-generic/pgtable.h
>> +++ b/include/asm-generic/pgtable.h
>> @@ -1188,4 +1188,23 @@ static inline bool arch_has_pfn_modify_check(void)
>> #define mm_pmd_folded(mm) __is_defined(__PAGETABLE_PMD_FOLDED)
>> #endif
>>
>> +/*
>> + * p?d_leaf() - true if this entry is a final mapping to a physical address.
>> + * This differs from p?d_huge() by the fact that they are always available (if
>> + * the architecture supports large pages at the appropriate level) even
>> + * if CONFIG_HUGETLB_PAGE is not defined.
>> + */
>
> I assume it's only safe to call these on valid entries? I think it would
> be worth calling that out explicitly.
Yes only meaningful on valid entries - I'll add that as a comment.
> Otherwise, this looks sound to me:
>
> Acked-by: Mark Rutland <[email protected]>
Thanks for the review
Steve
On Wed, 24 Jul 2019, Steven Price wrote:
> On 23/07/2019 11:16, Mark Rutland wrote:
> > Are there any visible changes to the arm64 output?
>
> arm64 output shouldn't change. I've confirmed that "efi_page_tables" is
> identical on a Juno before/after the change. "kernel_page_tables"
> obviously will vary depending on the exact layout of memory, but the
> format isn't changed.
>
> x86 output does change due to patch 14. In this case the change is
> removing the lines from the output of the form...
>
> > 0xffffffff84800000-0xffffffffa0000000 440M pmd
>
> ...which are unpopulated areas of the memory map. Populated lines which
> have attributes are unchanged.
Having the hole size and the level in the dump is a very conveniant thing.
Right now we have:
0xffffffffc0427000-0xffffffffc042b000 16K ro NX pte
0xffffffffc042b000-0xffffffffc042e000 12K RW NX pte
0xffffffffc042e000-0xffffffffc042f000 4K pte
0xffffffffc042f000-0xffffffffc0430000 4K ro x pte
0xffffffffc0430000-0xffffffffc0431000 4K ro NX pte
0xffffffffc0431000-0xffffffffc0433000 8K RW NX pte
0xffffffffc0433000-0xffffffffc0434000 4K pte
0xffffffffc0434000-0xffffffffc0436000 8K ro x pte
0xffffffffc0436000-0xffffffffc0438000 8K ro NX pte
0xffffffffc0438000-0xffffffffc043a000 8K RW NX pte
0xffffffffc043a000-0xffffffffc043f000 20K pte
0xffffffffc043f000-0xffffffffc0444000 20K ro x pte
0xffffffffc0444000-0xffffffffc0447000 12K ro NX pte
0xffffffffc0447000-0xffffffffc0449000 8K RW NX pte
0xffffffffc0449000-0xffffffffc044f000 24K pte
0xffffffffc044f000-0xffffffffc0450000 4K ro x pte
0xffffffffc0450000-0xffffffffc0451000 4K ro NX pte
0xffffffffc0451000-0xffffffffc0453000 8K RW NX pte
0xffffffffc0453000-0xffffffffc0458000 20K pte
0xffffffffc0458000-0xffffffffc0459000 4K ro x pte
0xffffffffc0459000-0xffffffffc045b000 8K ro NX pte
with your change this becomes:
0xffffffffc0427000-0xffffffffc042b000 16K ro NX pte
0xffffffffc042b000-0xffffffffc042e000 12K RW NX pte
0xffffffffc042f000-0xffffffffc0430000 4K ro x pte
0xffffffffc0430000-0xffffffffc0431000 4K ro NX pte
0xffffffffc0431000-0xffffffffc0433000 8K RW NX pte
0xffffffffc0434000-0xffffffffc0436000 8K ro x pte
0xffffffffc0436000-0xffffffffc0438000 8K ro NX pte
0xffffffffc0438000-0xffffffffc043a000 8K RW NX pte
0xffffffffc043f000-0xffffffffc0444000 20K ro x pte
0xffffffffc0444000-0xffffffffc0447000 12K ro NX pte
0xffffffffc0447000-0xffffffffc0449000 8K RW NX pte
0xffffffffc044f000-0xffffffffc0450000 4K ro x pte
0xffffffffc0450000-0xffffffffc0451000 4K ro NX pte
0xffffffffc0451000-0xffffffffc0453000 8K RW NX pte
0xffffffffc0458000-0xffffffffc0459000 4K ro x pte
0xffffffffc0459000-0xffffffffc045b000 8K ro NX pte
which is 5 lines less, but a pain to figure out the size of the holes. And
it becomes even more painful when the holes go across different mapping
levels.
From your 14/N changelog:
> This keeps the output shorter and will help with a future change
I don't care about shorter at all. It's debug information.
> switching to using the generic page walk code as we no longer care about
> the 'level' that the page table holes are at.
I really do not understand why you think that WE no longer care about the
level (and the size) of the holes. I assume that WE is pluralis majestatis
and not meant to reflect the opinion of you and everyone else.
I have no idea whether you ever had to do serious work with PT dump, but I
surely have at various occasions including the PTI mess and I definitely
found the size and the level information from holes very useful.
Thanks,
tglx
On Wed, Jul 24, 2019 at 03:57:33PM +0200, Thomas Gleixner wrote:
> On Wed, 24 Jul 2019, Steven Price wrote:
> > On 23/07/2019 11:16, Mark Rutland wrote:
> > > Are there any visible changes to the arm64 output?
> >
> > arm64 output shouldn't change. I've confirmed that "efi_page_tables" is
> > identical on a Juno before/after the change. "kernel_page_tables"
> > obviously will vary depending on the exact layout of memory, but the
> > format isn't changed.
> >
> > x86 output does change due to patch 14. In this case the change is
> > removing the lines from the output of the form...
> >
> > > 0xffffffff84800000-0xffffffffa0000000 440M pmd
> >
> > ...which are unpopulated areas of the memory map. Populated lines which
> > have attributes are unchanged.
>
> Having the hole size and the level in the dump is a very conveniant thing.
Mhmm; I thought that we logged which level was empty on arm64 (but
apparently not), since knowing the structure can be important.
> Right now we have:
>
> 0xffffffffc0427000-0xffffffffc042b000 16K ro NX pte
> 0xffffffffc042b000-0xffffffffc042e000 12K RW NX pte
> 0xffffffffc042e000-0xffffffffc042f000 4K pte
> 0xffffffffc042f000-0xffffffffc0430000 4K ro x pte
> 0xffffffffc0430000-0xffffffffc0431000 4K ro NX pte
> 0xffffffffc0431000-0xffffffffc0433000 8K RW NX pte
> 0xffffffffc0433000-0xffffffffc0434000 4K pte
> 0xffffffffc0434000-0xffffffffc0436000 8K ro x pte
> 0xffffffffc0436000-0xffffffffc0438000 8K ro NX pte
> 0xffffffffc0438000-0xffffffffc043a000 8K RW NX pte
> 0xffffffffc043a000-0xffffffffc043f000 20K pte
> 0xffffffffc043f000-0xffffffffc0444000 20K ro x pte
> 0xffffffffc0444000-0xffffffffc0447000 12K ro NX pte
> 0xffffffffc0447000-0xffffffffc0449000 8K RW NX pte
> 0xffffffffc0449000-0xffffffffc044f000 24K pte
> 0xffffffffc044f000-0xffffffffc0450000 4K ro x pte
> 0xffffffffc0450000-0xffffffffc0451000 4K ro NX pte
> 0xffffffffc0451000-0xffffffffc0453000 8K RW NX pte
> 0xffffffffc0453000-0xffffffffc0458000 20K pte
> 0xffffffffc0458000-0xffffffffc0459000 4K ro x pte
> 0xffffffffc0459000-0xffffffffc045b000 8K ro NX pte
>
> with your change this becomes:
>
> 0xffffffffc0427000-0xffffffffc042b000 16K ro NX pte
> 0xffffffffc042b000-0xffffffffc042e000 12K RW NX pte
> 0xffffffffc042f000-0xffffffffc0430000 4K ro x pte
> 0xffffffffc0430000-0xffffffffc0431000 4K ro NX pte
> 0xffffffffc0431000-0xffffffffc0433000 8K RW NX pte
> 0xffffffffc0434000-0xffffffffc0436000 8K ro x pte
> 0xffffffffc0436000-0xffffffffc0438000 8K ro NX pte
> 0xffffffffc0438000-0xffffffffc043a000 8K RW NX pte
> 0xffffffffc043f000-0xffffffffc0444000 20K ro x pte
> 0xffffffffc0444000-0xffffffffc0447000 12K ro NX pte
> 0xffffffffc0447000-0xffffffffc0449000 8K RW NX pte
> 0xffffffffc044f000-0xffffffffc0450000 4K ro x pte
> 0xffffffffc0450000-0xffffffffc0451000 4K ro NX pte
> 0xffffffffc0451000-0xffffffffc0453000 8K RW NX pte
> 0xffffffffc0458000-0xffffffffc0459000 4K ro x pte
> 0xffffffffc0459000-0xffffffffc045b000 8K ro NX pte
>
> which is 5 lines less, but a pain to figure out the size of the holes. And
> it becomes even more painful when the holes go across different mapping
> levels.
I agree.
Steven, could you align arm64 with the x86 behaviour here?
Thanks,
Mark.
On Wed, Jul 24, 2019 at 02:53:04PM +0100, Steven Price wrote:
> On 23/07/2019 11:14, Mark Rutland wrote:
> > On Mon, Jul 22, 2019 at 04:42:00PM +0100, Steven Price wrote:
> >> pgd_entry() and pud_entry() were removed by commit 0b1fbfe50006c410
> >> ("mm/pagewalk: remove pgd_entry() and pud_entry()") because there were
> >> no users. We're about to add users so reintroduce them, along with
> >> p4d_entry() as we now have 5 levels of tables.
> >>
> >> Note that commit a00cc7d9dd93d66a ("mm, x86: add support for
> >> PUD-sized transparent hugepages") already re-added pud_entry() but with
> >> different semantics to the other callbacks. Since there have never
> >> been upstream users of this, revert the semantics back to match the
> >> other callbacks. This means pud_entry() is called for all entries, not
> >> just transparent huge pages.
> >>
> >> Signed-off-by: Steven Price <[email protected]>
> >> ---
> >> include/linux/mm.h | 15 +++++++++------
> >> mm/pagewalk.c | 27 ++++++++++++++++-----------
> >> 2 files changed, 25 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index 0334ca97c584..b22799129128 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -1432,15 +1432,14 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> >>
> >> /**
> >> * mm_walk - callbacks for walk_page_range
> >> - * @pud_entry: if set, called for each non-empty PUD (2nd-level) entry
> >> - * this handler should only handle pud_trans_huge() puds.
> >> - * the pmd_entry or pte_entry callbacks will be used for
> >> - * regular PUDs.
> >> - * @pmd_entry: if set, called for each non-empty PMD (3rd-level) entry
> >> + * @pgd_entry: if set, called for each non-empty PGD (top-level) entry
> >> + * @p4d_entry: if set, called for each non-empty P4D entry
> >> + * @pud_entry: if set, called for each non-empty PUD entry
> >> + * @pmd_entry: if set, called for each non-empty PMD entry
> >
> > How are these expected to work with folding?
> >
> > For example, on arm64 with 64K pages and 42-bit VA, you can have 2-level
> > tables where the PGD is P4D, PUD, and PMD. IIUC we'd invoke the
> > callbacks for each of those levels where we found an entry in the pgd.
> >
> > Either the callee handle that, or we should inhibit the callbacks when
> > levels are folded, and I think that needs to be explcitly stated either
> > way.
> >
> > IIRC on x86 the p4d folding is dynamic depending on whether the HW
> > supports 5-level page tables. Maybe that implies the callee has to
> > handle that.
>
> Yes, my assumption is that it has to be up to the callee to handle that
> because folding can be dynamic. I believe this also was how these
> callbacks work before they were removed. However I'll add a comment
> explaining that here as it's probably non-obvious.
That sounds good to me.
Thanks,
Mark.
On 24/07/2019 14:57, Thomas Gleixner wrote:
> On Wed, 24 Jul 2019, Steven Price wrote:
>> On 23/07/2019 11:16, Mark Rutland wrote:
>>> Are there any visible changes to the arm64 output?
>>
>> arm64 output shouldn't change. I've confirmed that "efi_page_tables" is
>> identical on a Juno before/after the change. "kernel_page_tables"
>> obviously will vary depending on the exact layout of memory, but the
>> format isn't changed.
>>
>> x86 output does change due to patch 14. In this case the change is
>> removing the lines from the output of the form...
>>
>>> 0xffffffff84800000-0xffffffffa0000000 440M pmd
>>
>> ...which are unpopulated areas of the memory map. Populated lines which
>> have attributes are unchanged.
>
> Having the hole size and the level in the dump is a very conveniant thing.
>
> Right now we have:
>
> 0xffffffffc0427000-0xffffffffc042b000 16K ro NX pte
> 0xffffffffc042b000-0xffffffffc042e000 12K RW NX pte
> 0xffffffffc042e000-0xffffffffc042f000 4K pte
> 0xffffffffc042f000-0xffffffffc0430000 4K ro x pte
> 0xffffffffc0430000-0xffffffffc0431000 4K ro NX pte
> 0xffffffffc0431000-0xffffffffc0433000 8K RW NX pte
> 0xffffffffc0433000-0xffffffffc0434000 4K pte
> 0xffffffffc0434000-0xffffffffc0436000 8K ro x pte
> 0xffffffffc0436000-0xffffffffc0438000 8K ro NX pte
> 0xffffffffc0438000-0xffffffffc043a000 8K RW NX pte
> 0xffffffffc043a000-0xffffffffc043f000 20K pte
> 0xffffffffc043f000-0xffffffffc0444000 20K ro x pte
> 0xffffffffc0444000-0xffffffffc0447000 12K ro NX pte
> 0xffffffffc0447000-0xffffffffc0449000 8K RW NX pte
> 0xffffffffc0449000-0xffffffffc044f000 24K pte
> 0xffffffffc044f000-0xffffffffc0450000 4K ro x pte
> 0xffffffffc0450000-0xffffffffc0451000 4K ro NX pte
> 0xffffffffc0451000-0xffffffffc0453000 8K RW NX pte
> 0xffffffffc0453000-0xffffffffc0458000 20K pte
> 0xffffffffc0458000-0xffffffffc0459000 4K ro x pte
> 0xffffffffc0459000-0xffffffffc045b000 8K ro NX pte
>
> with your change this becomes:
>
> 0xffffffffc0427000-0xffffffffc042b000 16K ro NX pte
> 0xffffffffc042b000-0xffffffffc042e000 12K RW NX pte
> 0xffffffffc042f000-0xffffffffc0430000 4K ro x pte
> 0xffffffffc0430000-0xffffffffc0431000 4K ro NX pte
> 0xffffffffc0431000-0xffffffffc0433000 8K RW NX pte
> 0xffffffffc0434000-0xffffffffc0436000 8K ro x pte
> 0xffffffffc0436000-0xffffffffc0438000 8K ro NX pte
> 0xffffffffc0438000-0xffffffffc043a000 8K RW NX pte
> 0xffffffffc043f000-0xffffffffc0444000 20K ro x pte
> 0xffffffffc0444000-0xffffffffc0447000 12K ro NX pte
> 0xffffffffc0447000-0xffffffffc0449000 8K RW NX pte
> 0xffffffffc044f000-0xffffffffc0450000 4K ro x pte
> 0xffffffffc0450000-0xffffffffc0451000 4K ro NX pte
> 0xffffffffc0451000-0xffffffffc0453000 8K RW NX pte
> 0xffffffffc0458000-0xffffffffc0459000 4K ro x pte
> 0xffffffffc0459000-0xffffffffc045b000 8K ro NX pte
>
> which is 5 lines less, but a pain to figure out the size of the holes. And
> it becomes even more painful when the holes go across different mapping
> levels.
>
> From your 14/N changelog:
>
>> This keeps the output shorter and will help with a future change
>
> I don't care about shorter at all. It's debug information.
Sorry, the "shorter" part was because Dave Hansen originally said[1]:
> I think I'd actually be OK with the holes just not showing up. I
> actually find it kinda hard to read sometimes with the holes in there.
> I'd be curious what others think though.
[1]
https://lore.kernel.org/lkml/[email protected]/
And I'd abbreviated "holes not showing up" as "shorter" in the commit
message - not the best wording I agree.
>> switching to using the generic page walk code as we no longer care about
>> the 'level' that the page table holes are at.
>
> I really do not understand why you think that WE no longer care about the
> level (and the size) of the holes. I assume that WE is pluralis majestatis
> and not meant to reflect the opinion of you and everyone else.
Again, I apologise - that was sloppy wording in the commit message. By
"we" I meant the code not any particular person. In my original patch[2]
the only use of the 'depth' argument to pte_hole was to report the level
for these debug lines. Removing those lines simplified the code and at
the time nobody raised any objections.
[2]
https://lore.kernel.org/lkml/[email protected]/
> I have no idea whether you ever had to do serious work with PT dump, but I
> surely have at various occasions including the PTI mess and I definitely
> found the size and the level information from holes very useful.
On arm64 we don't have those lines, but equally it's possible they might
be useful in the future. So this might be something to add.
As I said in a previous email[3] I was dropping the lines from the
output assuming nobody had any objections. Since you find these lines
useful, I'll see about reworking the change to retain the lines.
Steve
[3]
https://lore.kernel.org/lkml/[email protected]/
On 23/07/2019 11:16, Mark Rutland wrote:
> On Mon, Jul 22, 2019 at 04:41:49PM +0100, Steven Price wrote:
>> This is a slight reworking and extension of my previous patch set
>> (Convert x86 & arm64 to use generic page walk), but I've continued the
>> version numbering as most of the changes are the same. In particular
>> this series ends with a generic PTDUMP implemention for arm64 and x86.
>>
>> Many architectures current have a debugfs file for dumping the kernel
>> page tables. Currently each architecture has to implement custom
>> functions for this because the details of walking the page tables used
>> by the kernel are different between architectures.
>>
>> This series extends the capabilities of walk_page_range() so that it can
>> deal with the page tables of the kernel (which have no VMAs and can
>> contain larger huge pages than exist for user space). A generic PTDUMP
>> implementation is the implemented making use of the new functionality of
>> walk_page_range() and finally arm64 and x86 are switch to using it,
>> removing the custom table walkers.
>>
>> To enable a generic page table walker to walk the unusual mappings of
>> the kernel we need to implement a set of functions which let us know
>> when the walker has reached the leaf entry. After a suggestion from Will
>> Deacon I've chosen the name p?d_leaf() as this (hopefully) describes
>> the purpose (and is a new name so has no historic baggage). Some
>> architectures have p?d_large macros but this is easily confused with
>> "large pages".
>>
>> Mostly this is a clean up and there should be very little functional
>> change. The exceptions are:
>>
>> * x86 PTDUMP debugfs output no longer display pages which aren't
>> present (patch 14).
>>
>> * arm64 has the ability to efficiently process KASAN pages (which
>> previously only x86 implemented). This means that the combination of
>> KASAN and DEBUG_WX is now useable.
>
> Are there any visible changes to the arm64 output?
arm64 output shouldn't change. I've confirmed that "efi_page_tables" is
identical on a Juno before/after the change. "kernel_page_tables"
obviously will vary depending on the exact layout of memory, but the
format isn't changed.
x86 output does change due to patch 14. In this case the change is
removing the lines from the output of the form...
> 0xffffffff84800000-0xffffffffa0000000 440M pmd
...which are unpopulated areas of the memory map. Populated lines which
have attributes are unchanged.
Steve
On 23/07/2019 07:39, Anshuman Khandual wrote:
> Hello Steven,
>
> On 07/22/2019 09:11 PM, Steven Price wrote:
>> This is a slight reworking and extension of my previous patch set
>> (Convert x86 & arm64 to use generic page walk), but I've continued the
>> version numbering as most of the changes are the same. In particular
>> this series ends with a generic PTDUMP implemention for arm64 and x86.
>>
>> Many architectures current have a debugfs file for dumping the kernel
>> page tables. Currently each architecture has to implement custom
>> functions for this because the details of walking the page tables used
>> by the kernel are different between architectures.
>>
>> This series extends the capabilities of walk_page_range() so that it can
>> deal with the page tables of the kernel (which have no VMAs and can
>> contain larger huge pages than exist for user space). A generic PTDUMP
>> implementation is the implemented making use of the new functionality of
>> walk_page_range() and finally arm64 and x86 are switch to using it,
>> removing the custom table walkers.
>
> Could other architectures just enable this new generic PTDUMP feature if
> required without much problem ?
The generic PTDUMP is implemented as a library - so the architectures
would have to provide the call into ptdump_walk_pgd() and provide the
necessary callback note_page() which formats the lines in the output.
Hopefully the implementation is generic enough that it should be
flexible enough to work for most architectures.
arm, powerpc and s390 are the obvious architectures to convert next as
they already have note_page() functions which shouldn't be too difficult
to convert to match the callback prototype.
>>
>> To enable a generic page table walker to walk the unusual mappings of
>> the kernel we need to implement a set of functions which let us know
>> when the walker has reached the leaf entry. After a suggestion from Will
>> Deacon I've chosen the name p?d_leaf() as this (hopefully) describes
>> the purpose (and is a new name so has no historic baggage). Some
>> architectures have p?d_large macros but this is easily confused with
>> "large pages".
>
> I have not been following the previous version of the series closely, hence
> might be missing something here. But p?d_large() which identifies large
> mappings on a given level can only signify a leaf entry. Large pages on the
> table exist only as leaf entries. So what is the problem for it being used
> directly instead. Is there any possibility in the kernel mapping when these
> large pages are not leaf entries ?
There isn't any problem as such with using p?d_large macros. However the
name "large" has caused confusion in the past. In particular there are
two types of "large" page:
1. leaf entries at high levels than normal ('sections' on Arm, for 4K
pages this gives you 2MB and 1GB pages).
2. sets of contiguous entries that can share a TLB entry (the
'Contiguous bit' on Arm - which for 4K pages gives you 16 entries = 64
KB 'pages').
In many cases both give the same effect (reduce pressure on TLBs and
requires contiguous and aligned physical addresses). But for this case
we only care about the 'leaf' case (because the contiguous bit makes no
difference to walking the page tables).
As far as I'm aware p?d_large() currently implements the first and
p?d_(trans_)huge() implements either 1 or 2 depending on the architecture.
Will[1] suggested the same p?d_leaf() and this also avoids stepping on
the existing usage of p?d_large() which isn't always available on every
architecture.
[1]
https://lore.kernel.org/linux-mm/20190701101510.qup3nd6vm6cbdgjv@willie-the-truck/
>>
>> Mostly this is a clean up and there should be very little functional
>> change. The exceptions are:
>>
>> * x86 PTDUMP debugfs output no longer display pages which aren't
>> present (patch 14).
>
> Hmm, kernel mappings pages which are not present! which ones are those ?
> Just curious.
x86 currently outputs a line for every range, including those which are
unpopulated. Patch 14 removes those lines from the output, only showing
the populated ranges. This was discussed[2] previously.
[2]
https://lore.kernel.org/lkml/[email protected]/
>>
>> * arm64 has the ability to efficiently process KASAN pages (which
>> previously only x86 implemented). This means that the combination of
>> KASAN and DEBUG_WX is now useable.
>>
>> Also available as a git tree:
>> git://linux-arm.org/linux-sp.git walk_page_range/v9
>>
>> Changes since v8:
>> https://lore.kernel.org/lkml/[email protected]/
>> * Rename from p?d_large() to p?d_leaf()
>
> As mentioned before wondering if this is actually required or it is even a
> good idea to introduce something like this which expands page table helper
> semantics scope further in generic MM.
>
>> * Dropped patches migrating arm64/x86 custom walkers to
>> walk_page_range() in favour of adding a generic PTDUMP implementation
>> and migrating arm64/x86 to that instead.
>> * Rebased to v5.3-rc1
>
> Creating a generic PTDUMP implementation is definitely a better idea.
Yes, that was always where I was heading. But I initially thought it
would be easier to get the generic walking code in, followed by
implementing generic PTDUMP. But it turns out the generic PTDUMP is
actually the easy bit :)
Steve
On Wed, 24 Jul 2019, Steven Price wrote:
> On 24/07/2019 14:57, Thomas Gleixner wrote:
> > From your 14/N changelog:
> >
> >> This keeps the output shorter and will help with a future change
> >
> > I don't care about shorter at all. It's debug information.
>
> Sorry, the "shorter" part was because Dave Hansen originally said[1]:
> > I think I'd actually be OK with the holes just not showing up. I
> > actually find it kinda hard to read sometimes with the holes in there.
> > I'd be curious what others think though.
I missed that otherwise I'd have disagreed right away.
> > I really do not understand why you think that WE no longer care about the
> > level (and the size) of the holes. I assume that WE is pluralis majestatis
> > and not meant to reflect the opinion of you and everyone else.
>
> Again, I apologise - that was sloppy wording in the commit message. By
> "we" I meant the code not any particular person.
Nothing to apologize. Common mistake of trying to impersonate code. That
always reads wrong :)
> > I have no idea whether you ever had to do serious work with PT dump, but I
> > surely have at various occasions including the PTI mess and I definitely
> > found the size and the level information from holes very useful.
>
> On arm64 we don't have those lines, but equally it's possible they might
> be useful in the future. So this might be something to add.
>
> As I said in a previous email[3] I was dropping the lines from the
> output assuming nobody had any objections. Since you find these lines
> useful, I'll see about reworking the change to retain the lines.
That would be great and as I saw in the other mail, Mark wants to have them
as well :)
That reminds me, that I had a patch when dealing with L1TF which printed
the PFNs so I could verify that the mitigations do what they are supposed
to do, but that patch got obviously lost somewhere down the road.
Thanks,
tglx
On 23/07/2019 10:57, Mark Rutland wrote:
> On Mon, Jul 22, 2019 at 04:42:08PM +0100, Steven Price wrote:
>> Add a generic version of page table dumping that architectures can
>> opt-in to
>>
>> Signed-off-by: Steven Price <[email protected]>
>
> [...]
>
>> +#ifdef CONFIG_KASAN
>> +/*
>> + * This is an optimization for KASAN=y case. Since all kasan page tables
>> + * eventually point to the kasan_early_shadow_page we could call note_page()
>> + * right away without walking through lower level page tables. This saves
>> + * us dozens of seconds (minutes for 5-level config) while checking for
>> + * W+X mapping or reading kernel_page_tables debugfs file.
>> + */
>> +static inline bool kasan_page_table(struct ptdump_state *st, void *pt,
>> + unsigned long addr)
>> +{
>> + if (__pa(pt) == __pa(kasan_early_shadow_pmd) ||
>> +#ifdef CONFIG_X86
>> + (pgtable_l5_enabled() &&
>> + __pa(pt) == __pa(kasan_early_shadow_p4d)) ||
>> +#endif
>> + __pa(pt) == __pa(kasan_early_shadow_pud)) {
>> + st->note_page(st, addr, 5, pte_val(kasan_early_shadow_pte[0]));
>> + return true;
>> + }
>> + return false;
>
> Having you tried this with CONFIG_DEBUG_VIRTUAL?
>
> The kasan_early_shadow_pmd is a kernel object rather than a linear map
> object, so you should use __pa_symbol for that.
Thanks for pointing that out - it is indeed broken on arm64. This was
moved from x86 where CONFIG_DEBUG_VIRTUAL doesn't seem to pick this up.
There is actually a problem here that 'pt' might not be in the linear
map (so __pa(pt) barfs on arm64 as well as kasan_early_shadow_p?d).
It looks like having the comparisons of the form "pt ==
lm_alias(kasan_early_shadow_p?d)" is probably best.
> It's a bit horrid to have to test multiple levels in one function; can't
> we check the relevant level inline in each of the test_p?d funcs?
>
> They're optional anyway, so they only need to be defined for
> CONFIG_KASAN.
Good point - removing the test_p?d callbacks when !CONFIG_KASAN
simplifies the code.
Thanks,
Steve
On 07/24/2019 07:05 PM, Steven Price wrote:
> On 23/07/2019 07:39, Anshuman Khandual wrote:
>> Hello Steven,
>>
>> On 07/22/2019 09:11 PM, Steven Price wrote:
>>> This is a slight reworking and extension of my previous patch set
>>> (Convert x86 & arm64 to use generic page walk), but I've continued the
>>> version numbering as most of the changes are the same. In particular
>>> this series ends with a generic PTDUMP implemention for arm64 and x86.
>>>
>>> Many architectures current have a debugfs file for dumping the kernel
>>> page tables. Currently each architecture has to implement custom
>>> functions for this because the details of walking the page tables used
>>> by the kernel are different between architectures.
>>>
>>> This series extends the capabilities of walk_page_range() so that it can
>>> deal with the page tables of the kernel (which have no VMAs and can
>>> contain larger huge pages than exist for user space). A generic PTDUMP
>>> implementation is the implemented making use of the new functionality of
>>> walk_page_range() and finally arm64 and x86 are switch to using it,
>>> removing the custom table walkers.
>>
>> Could other architectures just enable this new generic PTDUMP feature if
>> required without much problem ?
>
> The generic PTDUMP is implemented as a library - so the architectures
> would have to provide the call into ptdump_walk_pgd() and provide the
> necessary callback note_page() which formats the lines in the output.
Though I understand that the leaf flag (any given level) details are very much
arch specific would there be any possibility for note_page() call back to be
unified as well. This is extracted from current PTDUMP output on arm64.
0xffffffc000000000-0xffffffc000080000 512K PTE RW NX SHD AF UXN MEM/NORMAL
The first three columns are generic
1. Kernel virtual range span
2. Kernel virtual range size
3. Kernel virtual range mapping level
Where as rest of the output are architecture specific page table entry flags.
Just wondering if we could print the first three columns in ptdump_walk_pgd()
itself before calling arch specific callback to fetch a print buffer for rest
of the line bounded with some character limit so that line does not overflow.
Its not something which must be done but I guess it's worth giving it a try.
This will help consolidate ptdump_walk_pgd() further.
>
> Hopefully the implementation is generic enough that it should be
> flexible enough to work for most architectures.
>
> arm, powerpc and s390 are the obvious architectures to convert next as
> they already have note_page() functions which shouldn't be too difficult
> to convert to match the callback prototype.
Which can be done independently later on, fair enough.
>
>>>
>>> To enable a generic page table walker to walk the unusual mappings of
>>> the kernel we need to implement a set of functions which let us know
>>> when the walker has reached the leaf entry. After a suggestion from Will
>>> Deacon I've chosen the name p?d_leaf() as this (hopefully) describes
>>> the purpose (and is a new name so has no historic baggage). Some
>>> architectures have p?d_large macros but this is easily confused with
>>> "large pages".
>>
>> I have not been following the previous version of the series closely, hence
>> might be missing something here. But p?d_large() which identifies large
>> mappings on a given level can only signify a leaf entry. Large pages on the
>> table exist only as leaf entries. So what is the problem for it being used
>> directly instead. Is there any possibility in the kernel mapping when these
>> large pages are not leaf entries ?
>
> There isn't any problem as such with using p?d_large macros. However the
> name "large" has caused confusion in the past. In particular there are
> two types of "large" page:
>
> 1. leaf entries at high levels than normal ('sections' on Arm, for 4K
> pages this gives you 2MB and 1GB pages).
>
> 2. sets of contiguous entries that can share a TLB entry (the
> 'Contiguous bit' on Arm - which for 4K pages gives you 16 entries = 64
> KB 'pages').
This is arm64 specific and AFAIK there are no other architectures where there
will be any confusion wrt p?d_large() not meaning a single entry.
As you have noted before if we are printing individual entries with PTE_CONT
then they need not be identified as p??d_large(). In which case p?d_large()
can just safely point to p?d_sect() identifying regular huge leaf entries.
>
> In many cases both give the same effect (reduce pressure on TLBs and
> requires contiguous and aligned physical addresses). But for this case
> we only care about the 'leaf' case (because the contiguous bit makes no
> difference to walking the page tables).
Right and we can just safely identify section entries with it. What will be
the problem with that ? Again this is only arm64 specific.
>
> As far as I'm aware p?d_large() currently implements the first and
> p?d_(trans_)huge() implements either 1 or 2 depending on the architecture.
AFAIK option 2 exists only on arm6 platform. IIUC generic MM requires two
different huge page dentition from platform. HugeTLB identifies large entries
at PGD|PUD|PMD after converting it's content into PTE first. So there is no
need for direct large page definitions for other levels.
1. THP - pmd_trans_huge()
2. HugeTLB - pte_huge() CONFIG_ARCH_WANT_GENERAL_HUGETLB is set
A simple check for p?d_large() on mm/ and include/linux shows that there are
no existing usage for these in generic MM. Hence it is available.
p?d_trans_huge() cannot use contiguous entries, so its definitely 1 in above
example.
The problem is there is no other type of mapped leaf entries apart from a large
mapping at PGD, PUD, PMD level. Had there been another type of leaf entry then
p?d_leaf() would have made sense as p?d_large() would not have been sufficient.
Hence just wondering if it is really necessary to add brand new p?d_leaf() page
table helper in generic MM functions.
IMHO the new addition of p?d_leaf() can be avoided and p?d_large() should be
cleaned up (if required) in platforms and used in generic functions.
>
> Will[1] suggested the same p?d_leaf() and this also avoids stepping on
> the existing usage of p?d_large() which isn't always available on every
> architecture.
PTDUMP now needs to identify large leaf entries uniformly on each platform.
Hence platforms enabling generic PTDUMP need to provide clean p?d_large()
definitions.
If there are existing definitions and usage of p?d_large() functions on some
platforms, those need to be fixed before they can use generic PTDUMP. IMHO we
should not be just adding new page table helpers in order to avoid cleaning
up these in platform code.
>
> [1]
> https://lore.kernel.org/linux-mm/20190701101510.qup3nd6vm6cbdgjv@willie-the-truck/
I guess the primary concern was with the existence of p?d_large()or p?d_huge()
definitions in various platform code and p?d_leaf() was an way of working around
it. The problem is, it adds a new helper without a real need for one.
>
>>>
>>> Mostly this is a clean up and there should be very little functional
>>> change. The exceptions are:
>>>
>>> * x86 PTDUMP debugfs output no longer display pages which aren't
>>> present (patch 14).
>>
>> Hmm, kernel mappings pages which are not present! which ones are those ?
>> Just curious.
>
> x86 currently outputs a line for every range, including those which are
> unpopulated. Patch 14 removes those lines from the output, only showing
> the populated ranges. This was discussed[2] previously.
>
> [2]
> https://lore.kernel.org/lkml/[email protected]/
Currently that is a difference between x86 and arm64 ptdump output. Whether to
show the gaps or not could not be achieved by defining a note_page() callback
function which does nothing but just return ? But if the single line output is
split between generic and callback as I had proposed earlier this will not be
possible any more as half the line would have been already printed.
On Thu, Jul 25, 2019 at 02:39:22PM +0530, Anshuman Khandual wrote:
> On 07/24/2019 07:05 PM, Steven Price wrote:
> > There isn't any problem as such with using p?d_large macros. However the
> > name "large" has caused confusion in the past. In particular there are
> > two types of "large" page:
> >
> > 1. leaf entries at high levels than normal ('sections' on Arm, for 4K
> > pages this gives you 2MB and 1GB pages).
> >
> > 2. sets of contiguous entries that can share a TLB entry (the
> > 'Contiguous bit' on Arm - which for 4K pages gives you 16 entries = 64
> > KB 'pages').
>
> This is arm64 specific and AFAIK there are no other architectures where there
> will be any confusion wrt p?d_large() not meaning a single entry.
>
> As you have noted before if we are printing individual entries with PTE_CONT
> then they need not be identified as p??d_large(). In which case p?d_large()
> can just safely point to p?d_sect() identifying regular huge leaf entries.
Steven's stuck in the middle of things here, but I do object to p?d_large()
because I find it bonkers to have p?d_large() and p?d_huge() mean completely
different things when they are synonyms in the English language.
Yes, p?d_leaf() matches the terminology used by the Arm architecture, but
given that most page table structures are arranged as a 'tree', then it's
not completely unreasonable, in my opinion. If you have a more descriptive
name, we could use that instead. We could also paint it blue.
> > In many cases both give the same effect (reduce pressure on TLBs and
> > requires contiguous and aligned physical addresses). But for this case
> > we only care about the 'leaf' case (because the contiguous bit makes no
> > difference to walking the page tables).
>
> Right and we can just safely identify section entries with it. What will be
> the problem with that ? Again this is only arm64 specific.
>
> >
> > As far as I'm aware p?d_large() currently implements the first and
> > p?d_(trans_)huge() implements either 1 or 2 depending on the architecture.
>
> AFAIK option 2 exists only on arm6 platform. IIUC generic MM requires two
> different huge page dentition from platform. HugeTLB identifies large entries
> at PGD|PUD|PMD after converting it's content into PTE first. So there is no
> need for direct large page definitions for other levels.
>
> 1. THP - pmd_trans_huge()
> 2. HugeTLB - pte_huge() CONFIG_ARCH_WANT_GENERAL_HUGETLB is set
>
> A simple check for p?d_large() on mm/ and include/linux shows that there are
> no existing usage for these in generic MM. Hence it is available.
Alternatively, it means we have a good opportunity to give it a better name
before it spreads into the core code.
> IMHO the new addition of p?d_leaf() can be avoided and p?d_large() should be
> cleaned up (if required) in platforms and used in generic functions.
Again, I disagree and think p?d_large() should be confined to arch code
if it sticks around at all.
I don't usually care much about naming, but in this case I really find
the status quo needlessly confusing.
Will
On 25/07/2019 10:09, Anshuman Khandual wrote:
>
>
> On 07/24/2019 07:05 PM, Steven Price wrote:
>> On 23/07/2019 07:39, Anshuman Khandual wrote:
>>> Hello Steven,
>>>
>>> On 07/22/2019 09:11 PM, Steven Price wrote:
>>>> This is a slight reworking and extension of my previous patch set
>>>> (Convert x86 & arm64 to use generic page walk), but I've continued the
>>>> version numbering as most of the changes are the same. In particular
>>>> this series ends with a generic PTDUMP implemention for arm64 and x86.
>>>>
>>>> Many architectures current have a debugfs file for dumping the kernel
>>>> page tables. Currently each architecture has to implement custom
>>>> functions for this because the details of walking the page tables used
>>>> by the kernel are different between architectures.
>>>>
>>>> This series extends the capabilities of walk_page_range() so that it can
>>>> deal with the page tables of the kernel (which have no VMAs and can
>>>> contain larger huge pages than exist for user space). A generic PTDUMP
>>>> implementation is the implemented making use of the new functionality of
>>>> walk_page_range() and finally arm64 and x86 are switch to using it,
>>>> removing the custom table walkers.
>>>
>>> Could other architectures just enable this new generic PTDUMP feature if
>>> required without much problem ?
>>
>> The generic PTDUMP is implemented as a library - so the architectures
>> would have to provide the call into ptdump_walk_pgd() and provide the
>> necessary callback note_page() which formats the lines in the output.
>
> Though I understand that the leaf flag (any given level) details are very much
> arch specific would there be any possibility for note_page() call back to be
> unified as well. This is extracted from current PTDUMP output on arm64.
>
> 0xffffffc000000000-0xffffffc000080000 512K PTE RW NX SHD AF UXN MEM/NORMAL
>
> The first three columns are generic
>
> 1. Kernel virtual range span
> 2. Kernel virtual range size
> 3. Kernel virtual range mapping level
>
> Where as rest of the output are architecture specific page table entry flags.
> Just wondering if we could print the first three columns in ptdump_walk_pgd()
> itself before calling arch specific callback to fetch a print buffer for rest
> of the line bounded with some character limit so that line does not overflow.
> Its not something which must be done but I guess it's worth giving it a try.
> This will help consolidate ptdump_walk_pgd() further.
It's not quite as simple as it seems. One of the things note_page() does
is work out whether a contiguous set of pages are "the same" (i.e.
should appear as one range). This is ultimately an architecture specific
decision: we need to look at the flags to do this.
I'm of course happy to be proved wrong if you can see a neat way of
making this work.
>>
>> Hopefully the implementation is generic enough that it should be
>> flexible enough to work for most architectures.
>>
>> arm, powerpc and s390 are the obvious architectures to convert next as
>> they already have note_page() functions which shouldn't be too difficult
>> to convert to match the callback prototype.
>
> Which can be done independently later on, fair enough.
>
>>
>>>>
>>>> To enable a generic page table walker to walk the unusual mappings of
>>>> the kernel we need to implement a set of functions which let us know
>>>> when the walker has reached the leaf entry. After a suggestion from Will
>>>> Deacon I've chosen the name p?d_leaf() as this (hopefully) describes
>>>> the purpose (and is a new name so has no historic baggage). Some
>>>> architectures have p?d_large macros but this is easily confused with
>>>> "large pages".
>>>
>>> I have not been following the previous version of the series closely, hence
>>> might be missing something here. But p?d_large() which identifies large
>>> mappings on a given level can only signify a leaf entry. Large pages on the
>>> table exist only as leaf entries. So what is the problem for it being used
>>> directly instead. Is there any possibility in the kernel mapping when these
>>> large pages are not leaf entries ?
>>
>> There isn't any problem as such with using p?d_large macros. However the
>> name "large" has caused confusion in the past. In particular there are
>> two types of "large" page:
>>
>> 1. leaf entries at high levels than normal ('sections' on Arm, for 4K
>> pages this gives you 2MB and 1GB pages).
>>
>> 2. sets of contiguous entries that can share a TLB entry (the
>> 'Contiguous bit' on Arm - which for 4K pages gives you 16 entries = 64
>> KB 'pages').
>
> This is arm64 specific and AFAIK there are no other architectures where there
> will be any confusion wrt p?d_large() not meaning a single entry.
This isn't arm64 specific (or even Arm specific) - only the examples I
gave are. There are several architectures with software walks where the
TLB can be populated with arbitrary sized entries. I have to admit I
don't fully understand the page table layouts of many of the other
architectures that Linux supports.
> As you have noted before if we are printing individual entries with PTE_CONT
> then they need not be identified as p??d_large(). In which case p?d_large()
> can just safely point to p?d_sect() identifying regular huge leaf entries.
The printing is largely irrelevant here (it's handled by arch code), so
PTE_CONT isn't a problem. However to walk the page tables we need to
know precisely "is this the leaf of the tree", we don't really care what
size page is being mapped, just whether we should continue the walk or not.
>>
>> In many cases both give the same effect (reduce pressure on TLBs and
>> requires contiguous and aligned physical addresses). But for this case
>> we only care about the 'leaf' case (because the contiguous bit makes no
>> difference to walking the page tables).
>
> Right and we can just safely identify section entries with it. What will be
> the problem with that ? Again this is only arm64 specific.
It's not arm64 specific.
>>
>> As far as I'm aware p?d_large() currently implements the first and
>> p?d_(trans_)huge() implements either 1 or 2 depending on the architecture.
>
> AFAIK option 2 exists only on arm6 platform. IIUC generic MM requires two
> different huge page dentition from platform. HugeTLB identifies large entries
> at PGD|PUD|PMD after converting it's content into PTE first. So there is no
> need for direct large page definitions for other levels.
>
> 1. THP - pmd_trans_huge()
> 2. HugeTLB - pte_huge() CONFIG_ARCH_WANT_GENERAL_HUGETLB is set
>
> A simple check for p?d_large() on mm/ and include/linux shows that there are
> no existing usage for these in generic MM. Hence it is available.
As Will has already replied - this is probably a good opportunity to
pick a better name - arch code can then be tidied up to use the new name.
[...]
>
> Currently that is a difference between x86 and arm64 ptdump output. Whether to
> show the gaps or not could not be achieved by defining a note_page() callback
> function which does nothing but just return ? But if the single line output is
> split between generic and callback as I had proposed earlier this will not be
> possible any more as half the line would have been already printed.
I think the proposal at the moment is for arm64 to match x86 as it seems
like it would be useful to know at what level the gaps are. But I also
like giving each arch the flexibility to display what information is
relevant for that architecture. It's the custom page walkers I'm trying
to remove as really there isn't much difference between architectures
there (as lots of generic code has to deal with page tables in one way
or another).
Steve
On 07/25/2019 03:00 PM, Will Deacon wrote:
> On Thu, Jul 25, 2019 at 02:39:22PM +0530, Anshuman Khandual wrote:
>> On 07/24/2019 07:05 PM, Steven Price wrote:
>>> There isn't any problem as such with using p?d_large macros. However the
>>> name "large" has caused confusion in the past. In particular there are
>>> two types of "large" page:
>>>
>>> 1. leaf entries at high levels than normal ('sections' on Arm, for 4K
>>> pages this gives you 2MB and 1GB pages).
>>>
>>> 2. sets of contiguous entries that can share a TLB entry (the
>>> 'Contiguous bit' on Arm - which for 4K pages gives you 16 entries = 64
>>> KB 'pages').
>>
>> This is arm64 specific and AFAIK there are no other architectures where there
>> will be any confusion wrt p?d_large() not meaning a single entry.
>>
>> As you have noted before if we are printing individual entries with PTE_CONT
>> then they need not be identified as p??d_large(). In which case p?d_large()
>> can just safely point to p?d_sect() identifying regular huge leaf entries.
>
> Steven's stuck in the middle of things here, but I do object to p?d_large()
> because I find it bonkers to have p?d_large() and p?d_huge() mean completely
> different things when they are synonyms in the English language.
Agreed that both p?d_large() and p?d_huge() should not exist at the same time
when they imply the same thing. I believe all these name proliferation happened
because THP, HugeTLB and kernel large mappings like linear, vmemmap, ioremap etc
which the platform code had to deal with in various forms.
>
> Yes, p?d_leaf() matches the terminology used by the Arm architecture, but
> given that most page table structures are arranged as a 'tree', then it's
> not completely unreasonable, in my opinion. If you have a more descriptive
> name, we could use that instead. We could also paint it blue.
The alternate name chosen p?d_leaf() is absolutely fine and it describes the
entry as intended. There is no disagreement over that. My original concern
was introduction of yet another page table helper.
>
>>> In many cases both give the same effect (reduce pressure on TLBs and
>>> requires contiguous and aligned physical addresses). But for this case
>>> we only care about the 'leaf' case (because the contiguous bit makes no
>>> difference to walking the page tables).
>>
>> Right and we can just safely identify section entries with it. What will be
>> the problem with that ? Again this is only arm64 specific.
>>
>>>
>>> As far as I'm aware p?d_large() currently implements the first and
>>> p?d_(trans_)huge() implements either 1 or 2 depending on the architecture.
>>
>> AFAIK option 2 exists only on arm6 platform. IIUC generic MM requires two
>> different huge page dentition from platform. HugeTLB identifies large entries
>> at PGD|PUD|PMD after converting it's content into PTE first. So there is no
>> need for direct large page definitions for other levels.
>>
>> 1. THP - pmd_trans_huge()
>> 2. HugeTLB - pte_huge() CONFIG_ARCH_WANT_GENERAL_HUGETLB is set
>>
>> A simple check for p?d_large() on mm/ and include/linux shows that there are
>> no existing usage for these in generic MM. Hence it is available.
>
> Alternatively, it means we have a good opportunity to give it a better name
> before it spreads into the core code.
Fair enough, that is another way. So you expect existing platform definitions
for p?d_large()/p?d_huge() getting cleaned up and to start using new p?d_leaf()
instead ?
>
>> IMHO the new addition of p?d_leaf() can be avoided and p?d_large() should be
>> cleaned up (if required) in platforms and used in generic functions.
>
> Again, I disagree and think p?d_large() should be confined to arch code
> if it sticks around at all.
All of those instances should migrate to using p?d_leaf() eventually else
there will be three different helpers which probably mean the same thing.
On 07/22/2019 09:11 PM, Steven Price wrote:
> Steven Price (21):
> arc: mm: Add p?d_leaf() definitions
> arm: mm: Add p?d_leaf() definitions
> arm64: mm: Add p?d_leaf() definitions
> mips: mm: Add p?d_leaf() definitions
> powerpc: mm: Add p?d_leaf() definitions
> riscv: mm: Add p?d_leaf() definitions
> s390: mm: Add p?d_leaf() definitions
> sparc: mm: Add p?d_leaf() definitions
> x86: mm: Add p?d_leaf() definitions
The set of architectures here is neither complete (e.g ia64, parisc missing)
nor does it only include architectures which had previously enabled PTDUMP
like arm, arm64, powerpc, s390 and x86. Is there any reason for this set of
archs to be on the list and not the others which are currently falling back
on generic p?d_leaf() defined later in the series ? Are the missing archs
do not have huge page support in the MMU ? If there is a direct dependency
for these symbols with CONFIG_HUGETLB_PAGE then it must be checked before
falling back on the generic ones.
Now that pmd_leaf() and pud_leaf() are getting used in walk_page_range() these
functions need to be defined on all arch irrespective if they use PTDUMP or not
or otherwise just define it for archs which need them now for sure i.e x86 and
arm64 (which are moving to new generic PTDUMP framework). Other archs can
implement these later.
On 07/23/2019 03:11 PM, Mark Rutland wrote:
> On Mon, Jul 22, 2019 at 04:41:59PM +0100, Steven Price wrote:
>> Exposing the pud/pgd levels of the page tables to walk_page_range() means
>> we may come across the exotic large mappings that come with large areas
>> of contiguous memory (such as the kernel's linear map).
>>
>> For architectures that don't provide all p?d_leaf() macros, provide
>> generic do nothing default that are suitable where there cannot be leaf
>> pages that that level.
>>
>> Signed-off-by: Steven Price <[email protected]>
>
> Not a big deal, but it would probably make sense for this to be patch 1
> in the series, given it defines the semantic of p?d_leaf(), and they're
> not used until we provide all the architectural implemetnations anyway.
Agreed.
>
> It might also be worth pointing out the reasons for this naming, e.g.
> p?d_large() aren't currently generic, and this name minimizes potential
> confusion between p?d_{large,huge}().
Agreed. But these fallback also need to first check non-availability of large
pages. I am not sure whether CONFIG_HUGETLB_PAGE config being clear indicates
that conclusively or not. Being a page table leaf entry has a broader meaning
than a large page but that is really not the case today. All leaf entries here
are large page entries from MMU perspective. This dependency can definitely be
removed when there are other types of leaf entries but for now IMHO it feels
bit problematic not to directly associate leaf entries with large pages in
config restriction while doing exactly the same.
On 07/22/2019 09:12 PM, Steven Price wrote:
> pgd_entry() and pud_entry() were removed by commit 0b1fbfe50006c410
> ("mm/pagewalk: remove pgd_entry() and pud_entry()") because there were
> no users. We're about to add users so reintroduce them, along with
> p4d_entry() as we now have 5 levels of tables.
>
> Note that commit a00cc7d9dd93d66a ("mm, x86: add support for
> PUD-sized transparent hugepages") already re-added pud_entry() but with
> different semantics to the other callbacks. Since there have never
> been upstream users of this, revert the semantics back to match the
> other callbacks. This means pud_entry() is called for all entries, not
> just transparent huge pages.
>
> Signed-off-by: Steven Price <[email protected]>
> ---
> include/linux/mm.h | 15 +++++++++------
> mm/pagewalk.c | 27 ++++++++++++++++-----------
> 2 files changed, 25 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0334ca97c584..b22799129128 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1432,15 +1432,14 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>
> /**
> * mm_walk - callbacks for walk_page_range
> - * @pud_entry: if set, called for each non-empty PUD (2nd-level) entry
> - * this handler should only handle pud_trans_huge() puds.
> - * the pmd_entry or pte_entry callbacks will be used for
> - * regular PUDs.
> - * @pmd_entry: if set, called for each non-empty PMD (3rd-level) entry
> + * @pgd_entry: if set, called for each non-empty PGD (top-level) entry
> + * @p4d_entry: if set, called for each non-empty P4D entry
> + * @pud_entry: if set, called for each non-empty PUD entry
> + * @pmd_entry: if set, called for each non-empty PMD entry
> * this handler is required to be able to handle
> * pmd_trans_huge() pmds. They may simply choose to
> * split_huge_page() instead of handling it explicitly.
> - * @pte_entry: if set, called for each non-empty PTE (4th-level) entry
> + * @pte_entry: if set, called for each non-empty PTE (lowest-level) entry
> * @pte_hole: if set, called for each hole at all levels
> * @hugetlb_entry: if set, called for each hugetlb entry
> * @test_walk: caller specific callback function to determine whether
> @@ -1455,6 +1454,10 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> * (see the comment on walk_page_range() for more details)
> */
> struct mm_walk {
> + int (*pgd_entry)(pgd_t *pgd, unsigned long addr,
> + unsigned long next, struct mm_walk *walk);
> + int (*p4d_entry)(p4d_t *p4d, unsigned long addr,
> + unsigned long next, struct mm_walk *walk);
> int (*pud_entry)(pud_t *pud, unsigned long addr,
> unsigned long next, struct mm_walk *walk);
> int (*pmd_entry)(pmd_t *pmd, unsigned long addr,
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index c3084ff2569d..98373a9f88b8 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -90,15 +90,9 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> }
>
> if (walk->pud_entry) {
> - spinlock_t *ptl = pud_trans_huge_lock(pud, walk->vma);
> -
> - if (ptl) {
> - err = walk->pud_entry(pud, addr, next, walk);
> - spin_unlock(ptl);
> - if (err)
> - break;
> - continue;
> - }
> + err = walk->pud_entry(pud, addr, next, walk);
> + if (err)
> + break;
But will not this still encounter possible THP entries when walking user
page tables (valid walk->vma) in which case still needs to get a lock.
OR will the callback take care of it ?
On 07/22/2019 09:12 PM, Steven Price wrote:
> It is useful to be able to skip parts of the page table tree even when
> walking without VMAs. Add test_p?d callbacks similar to test_walk but
> which are called just before a table at that level is walked. If the
> callback returns non-zero then the entire table is skipped.
>
> Signed-off-by: Steven Price <[email protected]>
> ---
> include/linux/mm.h | 11 +++++++++++
> mm/pagewalk.c | 24 ++++++++++++++++++++++++
> 2 files changed, 35 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b22799129128..325a1ca6f820 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1447,6 +1447,11 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> * value means "do page table walk over the current vma,"
> * and a negative one means "abort current page table walk
> * right now." 1 means "skip the current vma."
> + * @test_pmd: similar to test_walk(), but called for every pmd.
> + * @test_pud: similar to test_walk(), but called for every pud.
> + * @test_p4d: similar to test_walk(), but called for every p4d.
> + * Returning 0 means walk this part of the page tables,
> + * returning 1 means to skip this range.
> * @mm: mm_struct representing the target process of page table walk
> * @vma: vma currently walked (NULL if walking outside vmas)
> * @private: private data for callbacks' usage
> @@ -1471,6 +1476,12 @@ struct mm_walk {
> struct mm_walk *walk);
> int (*test_walk)(unsigned long addr, unsigned long next,
> struct mm_walk *walk);
> + int (*test_pmd)(unsigned long addr, unsigned long next,
> + pmd_t *pmd_start, struct mm_walk *walk);
> + int (*test_pud)(unsigned long addr, unsigned long next,
> + pud_t *pud_start, struct mm_walk *walk);
> + int (*test_p4d)(unsigned long addr, unsigned long next,
> + p4d_t *p4d_start, struct mm_walk *walk);
> struct mm_struct *mm;
> struct vm_area_struct *vma;
> void *private;
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 1cbef99e9258..6bea79b95be3 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -32,6 +32,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> unsigned long next;
> int err = 0;
>
> + if (walk->test_pmd) {
> + err = walk->test_pmd(addr, end, pmd_offset(pud, 0UL), walk);
> + if (err < 0)
> + return err;
> + if (err > 0)
> + return 0;
> + }
Though this attempts to match semantics with test_walk() and be comprehensive
just wondering what are the real world situations when page walking need to be
aborted based on error condition at a given page table level.
On 07/22/2019 09:12 PM, Steven Price wrote:
> Since 48684a65b4e3: "mm: pagewalk: fix misbehavior of walk_page_range
> for vma(VM_PFNMAP)", page_table_walk() will report any kernel area as
> a hole, because it lacks a vma.
>
> This means each arch has re-implemented page table walking when needed,
> for example in the per-arch ptdump walker.
>
> Remove the requirement to have a vma except when trying to split huge
> pages.
>
> Signed-off-by: Steven Price <[email protected]>
> ---
> mm/pagewalk.c | 25 +++++++++++++++++--------
> 1 file changed, 17 insertions(+), 8 deletions(-)
>
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 98373a9f88b8..1cbef99e9258 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -36,7 +36,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> do {
> again:
> next = pmd_addr_end(addr, end);
> - if (pmd_none(*pmd) || !walk->vma) {
> + if (pmd_none(*pmd)) {
> if (walk->pte_hole)
> err = walk->pte_hole(addr, next, walk);
> if (err)
> @@ -59,9 +59,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> if (!walk->pte_entry)
> continue;
>
> - split_huge_pmd(walk->vma, pmd, addr);
> - if (pmd_trans_unstable(pmd))
> - goto again;
> + if (walk->vma) {
> + split_huge_pmd(walk->vma, pmd, addr);
Check for a PMD THP entry before attempting to split it ?
> + if (pmd_trans_unstable(pmd))
> + goto again;
> + } else if (pmd_leaf(*pmd)) {
> + continue;
> + }
> +
> err = walk_pte_range(pmd, addr, next, walk);
> if (err)
> break;
> @@ -81,7 +86,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> do {
> again:
> next = pud_addr_end(addr, end);
> - if (pud_none(*pud) || !walk->vma) {
> + if (pud_none(*pud)) {
> if (walk->pte_hole)
> err = walk->pte_hole(addr, next, walk);
> if (err)
> @@ -95,9 +100,13 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> break;
> }
>
> - split_huge_pud(walk->vma, pud, addr);
> - if (pud_none(*pud))
> - goto again;
> + if (walk->vma) {
> + split_huge_pud(walk->vma, pud, addr);
Check for a PUD THP entry before attempting to split it ?
> + if (pud_none(*pud))
> + goto again;
> + } else if (pud_leaf(*pud)) {
> + continue;
> + }
This is bit cryptic. walk->vma check should be inside a helper is_user_page_table()
or similar to make things clear. p4d_leaf() check missing in walk_p4d_range() for
kernel page table walk ? Wondering if p?d_leaf() test should be moved earlier while
calling p?d_entry() for kernel page table walk.
On 07/22/2019 09:12 PM, Steven Price wrote:
> Add a generic version of page table dumping that architectures can
> opt-in to
>
> Signed-off-by: Steven Price <[email protected]>
> ---
> include/linux/ptdump.h | 19 +++++
> mm/Kconfig.debug | 21 ++++++
> mm/Makefile | 1 +
> mm/ptdump.c | 161 +++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 202 insertions(+)
> create mode 100644 include/linux/ptdump.h
> create mode 100644 mm/ptdump.c
>
> diff --git a/include/linux/ptdump.h b/include/linux/ptdump.h
> new file mode 100644
> index 000000000000..eb8e78154be3
> --- /dev/null
> +++ b/include/linux/ptdump.h
> @@ -0,0 +1,19 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _LINUX_PTDUMP_H
> +#define _LINUX_PTDUMP_H
> +
> +struct ptdump_range {
> + unsigned long start;
> + unsigned long end;
> +};
> +
> +struct ptdump_state {
> + void (*note_page)(struct ptdump_state *st, unsigned long addr,
> + int level, unsigned long val);
> + const struct ptdump_range *range;
> +};
> +
> +void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm);
> +
> +#endif /* _LINUX_PTDUMP_H */
> diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
> index 82b6a20898bd..7ad939b7140f 100644
> --- a/mm/Kconfig.debug
> +++ b/mm/Kconfig.debug
> @@ -115,3 +115,24 @@ config DEBUG_RODATA_TEST
> depends on STRICT_KERNEL_RWX
> ---help---
> This option enables a testcase for the setting rodata read-only.
> +
> +config GENERIC_PTDUMP
> + bool
> +
> +config PTDUMP_CORE
> + bool
> +
> +config PTDUMP_DEBUGFS
> + bool "Export kernel pagetable layout to userspace via debugfs"
> + depends on DEBUG_KERNEL
> + depends on DEBUG_FS
> + depends on GENERIC_PTDUMP
> + select PTDUMP_CORE
So PTDUMP_DEBUGFS depends on GENERIC_PTDUMP but selects PTDUMP_CORE. So any arch
subscribing this new generic PTDUMP by selecting GENERIC_PTDUMP needs to provide
some functions for PTDUMP_DEBUGFS which does not really have any code in generic
MM. Also ptdump_walk_pgd() is wrapped in PTDUMP_CORE not GENERIC_PTDUMP. Then what
does PTDUMP_GENERIC really indicate ? Bit confusing here.
The new ptdump_walk_pgd() symbol needs to be wrapped in a config symbol for sure
which should be selected in all platforms wishing to use it. GENERIC_PTDUMP can
be that config.
PTDUMP_DEBUGFS will require a full implementation (i.e PTDUMP_CORE) irrespective
of whether the platform subscribes GENERIC_PTDUMP or not. It should be something
like this.
config PTDUMP_DEBUGFS
bool "Export kernel pagetable layout to userspace via debugfs"
depends on DEBUG_KERNEL
depends on DEBUG_FS
select PTDUMP_CORE
PTDUMP_DEBUGFS need not depend on GENERIC_PTDUMP. All it requires is a PTDUMP_CORE
implementation which can optionally use ptdump_walk_pgd() through GENERIC_PTDUMP.
s/GENERIC_PTDUMP/PTDUMP_GENERIC to match and group with other configs.
DEBUG_WX can also be moved to generic MM like PTDUMP_DEBUGFS ?
> + help
> + Say Y here if you want to show the kernel pagetable layout in a
> + debugfs file. This information is only useful for kernel developers
> + who are working in architecture specific areas of the kernel.
> + It is probably not a good idea to enable this feature in a production
> + kernel.
> +
> + If in doubt, say N.
> diff --git a/mm/Makefile b/mm/Makefile
> index 338e528ad436..750a4c12d5da 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -104,3 +104,4 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
> obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
> obj-$(CONFIG_HMM_MIRROR) += hmm.o
> obj-$(CONFIG_MEMFD_CREATE) += memfd.o
> +obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
Should be GENERIC_PTDUMP instead ?
> diff --git a/mm/ptdump.c b/mm/ptdump.c
> new file mode 100644
> index 000000000000..39befc9088b8
> --- /dev/null
> +++ b/mm/ptdump.c
> @@ -0,0 +1,161 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/mm.h>
> +#include <linux/ptdump.h>
> +#include <linux/kasan.h>
> +
> +static int ptdump_pgd_entry(pgd_t *pgd, unsigned long addr,
> + unsigned long next, struct mm_walk *walk)
> +{
> + struct ptdump_state *st = walk->private;
> + pgd_t val = READ_ONCE(*pgd);
> +
> + if (pgd_leaf(val))
> + st->note_page(st, addr, 1, pgd_val(val));
> +
> + return 0;
> +}
> +
> +static int ptdump_p4d_entry(p4d_t *p4d, unsigned long addr,
> + unsigned long next, struct mm_walk *walk)
> +{
> + struct ptdump_state *st = walk->private;
> + p4d_t val = READ_ONCE(*p4d);
> +
> + if (p4d_leaf(val))
> + st->note_page(st, addr, 2, p4d_val(val));
> +
> + return 0;
> +}
> +
> +static int ptdump_pud_entry(pud_t *pud, unsigned long addr,
> + unsigned long next, struct mm_walk *walk)
> +{
> + struct ptdump_state *st = walk->private;
> + pud_t val = READ_ONCE(*pud);
> +
> + if (pud_leaf(val))
> + st->note_page(st, addr, 3, pud_val(val));
> +
> + return 0;
> +}
> +
> +static int ptdump_pmd_entry(pmd_t *pmd, unsigned long addr,
> + unsigned long next, struct mm_walk *walk)
> +{
> + struct ptdump_state *st = walk->private;
> + pmd_t val = READ_ONCE(*pmd);
> +
> + if (pmd_leaf(val))
> + st->note_page(st, addr, 4, pmd_val(val));
> +
> + return 0;
> +}
> +
> +static int ptdump_pte_entry(pte_t *pte, unsigned long addr,
> + unsigned long next, struct mm_walk *walk)
> +{
> + struct ptdump_state *st = walk->private;
> +
> + st->note_page(st, addr, 5, pte_val(READ_ONCE(*pte)));
> +
> + return 0;
> +}
> +
> +#ifdef CONFIG_KASAN
> +/*
> + * This is an optimization for KASAN=y case. Since all kasan page tables
> + * eventually point to the kasan_early_shadow_page we could call note_page()
> + * right away without walking through lower level page tables. This saves
> + * us dozens of seconds (minutes for 5-level config) while checking for
> + * W+X mapping or reading kernel_page_tables debugfs file.
> + */
> +static inline bool kasan_page_table(struct ptdump_state *st, void *pt,
> + unsigned long addr)
> +{
> + if (__pa(pt) == __pa(kasan_early_shadow_pmd) ||
> +#ifdef CONFIG_X86
> + (pgtable_l5_enabled() &&
> + __pa(pt) == __pa(kasan_early_shadow_p4d)) ||
> +#endif
> + __pa(pt) == __pa(kasan_early_shadow_pud)) {
> + st->note_page(st, addr, 5, pte_val(kasan_early_shadow_pte[0]));
> + return true;
> + }
> + return false;
> +}
> +#else
> +static inline bool kasan_page_table(struct ptdump_state *st, void *pt,
> + unsigned long addr)
> +{
> + return false;
> +}
> +#endif
> +
> +static int ptdump_test_p4d(unsigned long addr, unsigned long next,
> + p4d_t *p4d, struct mm_walk *walk)
> +{
> + struct ptdump_state *st = walk->private;
> +
> + if (kasan_page_table(st, p4d, addr))
> + return 1;
> + return 0;
> +}
> +
> +static int ptdump_test_pud(unsigned long addr, unsigned long next,
> + pud_t *pud, struct mm_walk *walk)
> +{
> + struct ptdump_state *st = walk->private;
> +
> + if (kasan_page_table(st, pud, addr))
> + return 1;
> + return 0;
> +}
> +
> +static int ptdump_test_pmd(unsigned long addr, unsigned long next,
> + pmd_t *pmd, struct mm_walk *walk)
> +{
> + struct ptdump_state *st = walk->private;
> +
> + if (kasan_page_table(st, pmd, addr))
> + return 1;
> + return 0;
> +}
> +
> +static int ptdump_hole(unsigned long addr, unsigned long next,
> + struct mm_walk *walk)
> +{
> + struct ptdump_state *st = walk->private;
> +
> + st->note_page(st, addr, -1, 0);
> +
> + return 0;
> +}
> +
> +void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm)
> +{
> + struct mm_walk walk = {
> + .mm = mm,
> + .pgd_entry = ptdump_pgd_entry,
> + .p4d_entry = ptdump_p4d_entry,
> + .pud_entry = ptdump_pud_entry,
> + .pmd_entry = ptdump_pmd_entry,
> + .pte_entry = ptdump_pte_entry,
> + .test_p4d = ptdump_test_p4d,
> + .test_pud = ptdump_test_pud,
> + .test_pmd = ptdump_test_pmd,
> + .pte_hole = ptdump_hole,
> + .private = st
> + };
> + const struct ptdump_range *range = st->range;
> +
> + down_read(&mm->mmap_sem);
> + while (range->start != range->end) {
> + walk_page_range(range->start, range->end, &walk);
> + range++;
> + }
> + up_read(&mm->mmap_sem);
Does walk_page_range() really needed here when it is definitely walking a
kernel page table. Why not directly use walk_pgd_range() instead which can
save some cycles avoiding going over VMAs, checking for HugeTLB, taking the
mmap_sem lock etc. AFAICS only thing it will miss is the opportunity to call
walk->test_walk() via walk_page_test(). IIUC test_walk() callback is primarily
for testing a VMA for it's eligibility and for kernel page table now there are
test callbacks like p?d_test() for individual levels anyway.
On 28/07/2019 12:44, Anshuman Khandual wrote:
>
>
> On 07/23/2019 03:11 PM, Mark Rutland wrote:
>> On Mon, Jul 22, 2019 at 04:41:59PM +0100, Steven Price wrote:
>>> Exposing the pud/pgd levels of the page tables to walk_page_range() means
>>> we may come across the exotic large mappings that come with large areas
>>> of contiguous memory (such as the kernel's linear map).
>>>
>>> For architectures that don't provide all p?d_leaf() macros, provide
>>> generic do nothing default that are suitable where there cannot be leaf
>>> pages that that level.
>>>
>>> Signed-off-by: Steven Price <[email protected]>
>>
>> Not a big deal, but it would probably make sense for this to be patch 1
>> in the series, given it defines the semantic of p?d_leaf(), and they're
>> not used until we provide all the architectural implemetnations anyway.
>
> Agreed.
>
>>
>> It might also be worth pointing out the reasons for this naming, e.g.
>> p?d_large() aren't currently generic, and this name minimizes potential
>> confusion between p?d_{large,huge}().
>
> Agreed. But these fallback also need to first check non-availability of large
> pages. I am not sure whether CONFIG_HUGETLB_PAGE config being clear indicates
> that conclusively or not. Being a page table leaf entry has a broader meaning
> than a large page but that is really not the case today. All leaf entries here
> are large page entries from MMU perspective. This dependency can definitely be
> removed when there are other types of leaf entries but for now IMHO it feels
> bit problematic not to directly associate leaf entries with large pages in
> config restriction while doing exactly the same.
The intention here is that the page walkers are able to walk any type of
page table entry which the kernel may use. CONFIG_HUGETLB_PAGE only
controls whether "huge TLB pages" are used by user space processes. It's
quite possible that option to not be selected but the linear mapping to
have been mapped using "large pages" (i.e. leaf entries further up the
tree than normal).
One of the goals was to avoid tying the new functions to a configuration
option but instead match the hardware architecture. Of course this isn't
possible in the most general case (e.g. an architecture may have
multiple hardware page table formats). But to the extent that other
functions like p?d_none() work the desire is that p?d_leaf() should also
work.
Steve
On 28/07/2019 12:20, Anshuman Khandual wrote:
> On 07/22/2019 09:11 PM, Steven Price wrote:
>> Steven Price (21):
>> arc: mm: Add p?d_leaf() definitions
>> arm: mm: Add p?d_leaf() definitions
>> arm64: mm: Add p?d_leaf() definitions
>> mips: mm: Add p?d_leaf() definitions
>> powerpc: mm: Add p?d_leaf() definitions
>> riscv: mm: Add p?d_leaf() definitions
>> s390: mm: Add p?d_leaf() definitions
>> sparc: mm: Add p?d_leaf() definitions
>> x86: mm: Add p?d_leaf() definitions
>
> The set of architectures here is neither complete (e.g ia64, parisc missing)
> nor does it only include architectures which had previously enabled PTDUMP
> like arm, arm64, powerpc, s390 and x86. Is there any reason for this set of
> archs to be on the list and not the others which are currently falling back
> on generic p?d_leaf() defined later in the series ? Are the missing archs
> do not have huge page support in the MMU ? If there is a direct dependency
> for these symbols with CONFIG_HUGETLB_PAGE then it must be checked before
> falling back on the generic ones.
The list of architectures here is what I believe to be the list of
architectures which can have leaf entries further up the tree than
normal. I'm by no means an expert on all these architectures so I'm
hoping someone will chime in if they notice something amiss. Obviously
all the NO_MMU
ia64 as far as I can tell doesn't implement leaf entries further up - it
has an interesting hybrid hardware/software walk mechanism and as I
understand it the hardware never walks the page table tree that the
p?d_xxx() operations operate on. So this is a software implementation
detail - but the existance of p?d_huge functions which always return 0
were my first clue that leaf entries are only at the bottom of the tree.
parisc is more interesting and I'm not sure if this is necessarily
correct. I originally proposed a patch with the line "For parisc, we
don't support large pages, so add stubs returning 0" which got Acked by
Helge Deller. However going back to look at that again I see there was a
follow up thread[2] which possibly suggests I was wrong?
Can anyone shed some light on whether parisc does support leaf entries
of the page table tree at a higher than the normal depth?
[1] https://lkml.org/lkml/2019/2/27/572
[2] https://lkml.org/lkml/2019/3/5/610
The intention is that the page table walker would be available for all
architectures so that it can be used in any generic code - PTDUMP simply
seemed like a good place to start.
> Now that pmd_leaf() and pud_leaf() are getting used in walk_page_range() these
> functions need to be defined on all arch irrespective if they use PTDUMP or not
> or otherwise just define it for archs which need them now for sure i.e x86 and
> arm64 (which are moving to new generic PTDUMP framework). Other archs can
> implement these later.
The intention was to have valid definitions for all architectures, but
obviously I need help from those familiar with those other architectures
to check whether I've understood them correctly.
Thanks,
Steve
On 28/07/2019 13:33, Anshuman Khandual wrote:
>
>
> On 07/22/2019 09:12 PM, Steven Price wrote:
>> pgd_entry() and pud_entry() were removed by commit 0b1fbfe50006c410
>> ("mm/pagewalk: remove pgd_entry() and pud_entry()") because there were
>> no users. We're about to add users so reintroduce them, along with
>> p4d_entry() as we now have 5 levels of tables.
>>
>> Note that commit a00cc7d9dd93d66a ("mm, x86: add support for
>> PUD-sized transparent hugepages") already re-added pud_entry() but with
>> different semantics to the other callbacks. Since there have never
>> been upstream users of this, revert the semantics back to match the
>> other callbacks. This means pud_entry() is called for all entries, not
>> just transparent huge pages.
>>
>> Signed-off-by: Steven Price <[email protected]>
>> ---
>> include/linux/mm.h | 15 +++++++++------
>> mm/pagewalk.c | 27 ++++++++++++++++-----------
>> 2 files changed, 25 insertions(+), 17 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 0334ca97c584..b22799129128 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1432,15 +1432,14 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>>
>> /**
>> * mm_walk - callbacks for walk_page_range
>> - * @pud_entry: if set, called for each non-empty PUD (2nd-level) entry
>> - * this handler should only handle pud_trans_huge() puds.
>> - * the pmd_entry or pte_entry callbacks will be used for
>> - * regular PUDs.
>> - * @pmd_entry: if set, called for each non-empty PMD (3rd-level) entry
>> + * @pgd_entry: if set, called for each non-empty PGD (top-level) entry
>> + * @p4d_entry: if set, called for each non-empty P4D entry
>> + * @pud_entry: if set, called for each non-empty PUD entry
>> + * @pmd_entry: if set, called for each non-empty PMD entry
>> * this handler is required to be able to handle
>> * pmd_trans_huge() pmds. They may simply choose to
>> * split_huge_page() instead of handling it explicitly.
>> - * @pte_entry: if set, called for each non-empty PTE (4th-level) entry
>> + * @pte_entry: if set, called for each non-empty PTE (lowest-level) entry
>> * @pte_hole: if set, called for each hole at all levels
>> * @hugetlb_entry: if set, called for each hugetlb entry
>> * @test_walk: caller specific callback function to determine whether
>> @@ -1455,6 +1454,10 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>> * (see the comment on walk_page_range() for more details)
>> */
>> struct mm_walk {
>> + int (*pgd_entry)(pgd_t *pgd, unsigned long addr,
>> + unsigned long next, struct mm_walk *walk);
>> + int (*p4d_entry)(p4d_t *p4d, unsigned long addr,
>> + unsigned long next, struct mm_walk *walk);
>> int (*pud_entry)(pud_t *pud, unsigned long addr,
>> unsigned long next, struct mm_walk *walk);
>> int (*pmd_entry)(pmd_t *pmd, unsigned long addr,
>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>> index c3084ff2569d..98373a9f88b8 100644
>> --- a/mm/pagewalk.c
>> +++ b/mm/pagewalk.c
>> @@ -90,15 +90,9 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>> }
>>
>> if (walk->pud_entry) {
>> - spinlock_t *ptl = pud_trans_huge_lock(pud, walk->vma);
>> -
>> - if (ptl) {
>> - err = walk->pud_entry(pud, addr, next, walk);
>> - spin_unlock(ptl);
>> - if (err)
>> - break;
>> - continue;
>> - }
>> + err = walk->pud_entry(pud, addr, next, walk);
>> + if (err)
>> + break;
>
> But will not this still encounter possible THP entries when walking user
> page tables (valid walk->vma) in which case still needs to get a lock.
> OR will the callback take care of it ?
This is what I mean in the commit message by:
> Since there have never
> been upstream users of this, revert the semantics back to match the
> other callbacks. This means pud_entry() is called for all entries, not
> just transparent huge pages.
So the expectation is that the caller takes care of it.
However, having checked again, it appears that mm/hmm.c now does use
this callback (merged in v5.2-rc1).
Jérôme - are you happy with this change in semantics? It looks like
hmm_vma_walk_pud() should deal gracefully with both normal and large
pages - although I'm unsure whether you are relying on the lock from
pud_trans_huge_lock()?
Thanks,
Steve
On 28/07/2019 15:20, Anshuman Khandual wrote:
>
>
> On 07/22/2019 09:12 PM, Steven Price wrote:
>> Since 48684a65b4e3: "mm: pagewalk: fix misbehavior of walk_page_range
>> for vma(VM_PFNMAP)", page_table_walk() will report any kernel area as
>> a hole, because it lacks a vma.
>>
>> This means each arch has re-implemented page table walking when needed,
>> for example in the per-arch ptdump walker.
>>
>> Remove the requirement to have a vma except when trying to split huge
>> pages.
>>
>> Signed-off-by: Steven Price <[email protected]>
>> ---
>> mm/pagewalk.c | 25 +++++++++++++++++--------
>> 1 file changed, 17 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>> index 98373a9f88b8..1cbef99e9258 100644
>> --- a/mm/pagewalk.c
>> +++ b/mm/pagewalk.c
>> @@ -36,7 +36,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>> do {
>> again:
>> next = pmd_addr_end(addr, end);
>> - if (pmd_none(*pmd) || !walk->vma) {
>> + if (pmd_none(*pmd)) {
>> if (walk->pte_hole)
>> err = walk->pte_hole(addr, next, walk);
>> if (err)
>> @@ -59,9 +59,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>> if (!walk->pte_entry)
>> continue;
>>
>> - split_huge_pmd(walk->vma, pmd, addr);
>> - if (pmd_trans_unstable(pmd))
>> - goto again;
>> + if (walk->vma) {
>> + split_huge_pmd(walk->vma, pmd, addr);
>
> Check for a PMD THP entry before attempting to split it ?
split_huge_pmd does the check for us:
> #define split_huge_pmd(__vma, __pmd, __address) \
> do { \
> pmd_t *____pmd = (__pmd); \
> if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd) \
> || pmd_devmap(*____pmd)) \
> __split_huge_pmd(__vma, __pmd, __address, \
> false, NULL); \
> } while (0)
And this isn't a change from the previous code - only that the entry is
no longer split when walk->vma==NULL.
>> + if (pmd_trans_unstable(pmd))
>> + goto again;
>> + } else if (pmd_leaf(*pmd)) {
>> + continue;
>> + }
>> +
>> err = walk_pte_range(pmd, addr, next, walk);
>> if (err)
>> break;
>> @@ -81,7 +86,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>> do {
>> again:
>> next = pud_addr_end(addr, end);
>> - if (pud_none(*pud) || !walk->vma) {
>> + if (pud_none(*pud)) {
>> if (walk->pte_hole)
>> err = walk->pte_hole(addr, next, walk);
>> if (err)
>> @@ -95,9 +100,13 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>> break;
>> }
>>
>> - split_huge_pud(walk->vma, pud, addr);
>> - if (pud_none(*pud))
>> - goto again;
>> + if (walk->vma) {
>> + split_huge_pud(walk->vma, pud, addr);
>
> Check for a PUD THP entry before attempting to split it ?
Same as above.
>> + if (pud_none(*pud))
>> + goto again;
>> + } else if (pud_leaf(*pud)) {
>> + continue;
>> + }
>
> This is bit cryptic. walk->vma check should be inside a helper is_user_page_table()
> or similar to make things clear. p4d_leaf() check missing in walk_p4d_range() for
> kernel page table walk ? Wondering if p?d_leaf() test should be moved earlier while
> calling p?d_entry() for kernel page table walk.
I wasn't sure if it was worth putting p4d_leaf() and pgd_leaf() checks
in (yet). No architecture that I know of uses such large pages.
I'm not sure what you mean by moving the p?d_leaf() test earlier? Can
you explain with an example?
Thanks,
Steve
On 28/07/2019 14:41, Anshuman Khandual wrote:
>
>
> On 07/22/2019 09:12 PM, Steven Price wrote:
>> It is useful to be able to skip parts of the page table tree even when
>> walking without VMAs. Add test_p?d callbacks similar to test_walk but
>> which are called just before a table at that level is walked. If the
>> callback returns non-zero then the entire table is skipped.
>>
>> Signed-off-by: Steven Price <[email protected]>
>> ---
>> include/linux/mm.h | 11 +++++++++++
>> mm/pagewalk.c | 24 ++++++++++++++++++++++++
>> 2 files changed, 35 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index b22799129128..325a1ca6f820 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1447,6 +1447,11 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>> * value means "do page table walk over the current vma,"
>> * and a negative one means "abort current page table walk
>> * right now." 1 means "skip the current vma."
>> + * @test_pmd: similar to test_walk(), but called for every pmd.
>> + * @test_pud: similar to test_walk(), but called for every pud.
>> + * @test_p4d: similar to test_walk(), but called for every p4d.
>> + * Returning 0 means walk this part of the page tables,
>> + * returning 1 means to skip this range.
>> * @mm: mm_struct representing the target process of page table walk
>> * @vma: vma currently walked (NULL if walking outside vmas)
>> * @private: private data for callbacks' usage
>> @@ -1471,6 +1476,12 @@ struct mm_walk {
>> struct mm_walk *walk);
>> int (*test_walk)(unsigned long addr, unsigned long next,
>> struct mm_walk *walk);
>> + int (*test_pmd)(unsigned long addr, unsigned long next,
>> + pmd_t *pmd_start, struct mm_walk *walk);
>> + int (*test_pud)(unsigned long addr, unsigned long next,
>> + pud_t *pud_start, struct mm_walk *walk);
>> + int (*test_p4d)(unsigned long addr, unsigned long next,
>> + p4d_t *p4d_start, struct mm_walk *walk);
>> struct mm_struct *mm;
>> struct vm_area_struct *vma;
>> void *private;
>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>> index 1cbef99e9258..6bea79b95be3 100644
>> --- a/mm/pagewalk.c
>> +++ b/mm/pagewalk.c
>> @@ -32,6 +32,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>> unsigned long next;
>> int err = 0;
>>
>> + if (walk->test_pmd) {
>> + err = walk->test_pmd(addr, end, pmd_offset(pud, 0UL), walk);
>> + if (err < 0)
>> + return err;
>> + if (err > 0)
>> + return 0;
>> + }
>
> Though this attempts to match semantics with test_walk() and be comprehensive
> just wondering what are the real world situations when page walking need to be
> aborted based on error condition at a given page table level.
I'm not aware of a situation yet where aborting early is necessary - but
as you say this matches the semantics of test_walk() and was easy to
implement.
Steve
On Sun, Jul 28, 2019 at 05:14:31PM +0530, Anshuman Khandual wrote:
> On 07/23/2019 03:11 PM, Mark Rutland wrote:
> > It might also be worth pointing out the reasons for this naming, e.g.
> > p?d_large() aren't currently generic, and this name minimizes potential
> > confusion between p?d_{large,huge}().
>
> Agreed. But these fallback also need to first check non-availability of large
> pages.
We're deliberately not making the p?d_large() helpers generic, so this
shouldn't fall back on those.
It's up to the architecture to implement these correctly, and the
architecture-specific implementations will be added in subsequent
patches.
Thanks,
Mark.
On 29/07/2019 03:59, Anshuman Khandual wrote:
>
> On 07/22/2019 09:12 PM, Steven Price wrote:
>> Add a generic version of page table dumping that architectures can
>> opt-in to
>>
>> Signed-off-by: Steven Price <[email protected]>
>> ---
>> include/linux/ptdump.h | 19 +++++
>> mm/Kconfig.debug | 21 ++++++
>> mm/Makefile | 1 +
>> mm/ptdump.c | 161 +++++++++++++++++++++++++++++++++++++++++
>> 4 files changed, 202 insertions(+)
>> create mode 100644 include/linux/ptdump.h
>> create mode 100644 mm/ptdump.c
>>
>> diff --git a/include/linux/ptdump.h b/include/linux/ptdump.h
>> new file mode 100644
>> index 000000000000..eb8e78154be3
>> --- /dev/null
>> +++ b/include/linux/ptdump.h
>> @@ -0,0 +1,19 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +#ifndef _LINUX_PTDUMP_H
>> +#define _LINUX_PTDUMP_H
>> +
>> +struct ptdump_range {
>> + unsigned long start;
>> + unsigned long end;
>> +};
>> +
>> +struct ptdump_state {
>> + void (*note_page)(struct ptdump_state *st, unsigned long addr,
>> + int level, unsigned long val);
>> + const struct ptdump_range *range;
>> +};
>> +
>> +void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm);
>> +
>> +#endif /* _LINUX_PTDUMP_H */
>> diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
>> index 82b6a20898bd..7ad939b7140f 100644
>> --- a/mm/Kconfig.debug
>> +++ b/mm/Kconfig.debug
>> @@ -115,3 +115,24 @@ config DEBUG_RODATA_TEST
>> depends on STRICT_KERNEL_RWX
>> ---help---
>> This option enables a testcase for the setting rodata read-only.
>> +
>> +config GENERIC_PTDUMP
>> + bool
>> +
>> +config PTDUMP_CORE
>> + bool
>> +
>> +config PTDUMP_DEBUGFS
>> + bool "Export kernel pagetable layout to userspace via debugfs"
>> + depends on DEBUG_KERNEL
>> + depends on DEBUG_FS
>> + depends on GENERIC_PTDUMP
>> + select PTDUMP_CORE
>
> So PTDUMP_DEBUGFS depends on GENERIC_PTDUMP but selects PTDUMP_CORE. So any arch
> subscribing this new generic PTDUMP by selecting GENERIC_PTDUMP needs to provide
> some functions for PTDUMP_DEBUGFS which does not really have any code in generic
> MM. Also ptdump_walk_pgd() is wrapped in PTDUMP_CORE not GENERIC_PTDUMP. Then what
> does PTDUMP_GENERIC really indicate ? Bit confusing here.
The intention is:
* PTDUMP_DEBUGFS: Controls if the debugfs file is available. This
enables arch specific code which creates the debugfs file (as the files
available vary between architectures).
* GENERIC_PTDUMP: Architecture is opting in to the generic ptdump
infrastructure. The arch code is expected to provide the debugfs code
for PTDUMP_DBEUGFS.
* PTDUMP_CORE: The core page table walker is enabled. This code is used
by both PTDUMP_DEBUGFS as well as the DEBUG_WX ("Warn on W+X mappings at
boot"). x86 also has EFI_PGT_DUMP which uses the core.
> The new ptdump_walk_pgd() symbol needs to be wrapped in a config symbol for sure
> which should be selected in all platforms wishing to use it. GENERIC_PTDUMP can
> be that config.
The intention is that GENERIC_PTDUMP is signalling that the architecture
supports PTDUMP_DEBUGFS. PTDUMP_CORE is the configuration which chooses
whether ptdump_walk_pgd() is built - selected by the options that
require it.
> PTDUMP_DEBUGFS will require a full implementation (i.e PTDUMP_CORE) irrespective
> of whether the platform subscribes GENERIC_PTDUMP or not. It should be something
> like this.
>
> config PTDUMP_DEBUGFS
> bool "Export kernel pagetable layout to userspace via debugfs"
> depends on DEBUG_KERNEL
> depends on DEBUG_FS
> select PTDUMP_CORE
>
> PTDUMP_DEBUGFS need not depend on GENERIC_PTDUMP. All it requires is a PTDUMP_CORE
> implementation which can optionally use ptdump_walk_pgd() through GENERIC_PTDUMP.
> s/GENERIC_PTDUMP/PTDUMP_GENERIC to match and group with other configs.
The intention here is to hide PTDUMP_DEBUGFS on architectures that
haven't migrated to it. Because the generic code isn't responsible for
creating the debugfs entries if we don't hide it then it will compile
but nothing will appear in debugfs.
> DEBUG_WX can also be moved to generic MM like PTDUMP_DEBUGFS ?
Well the DEBUG_WX requires some arch specific code (separate from
PTDUMP_DEBUGFS), so while the config option could be moved you would
then require a "ARCH_HAS_DEBUG_WX" to hide it on the architectures that
don't support it. I'm not sure that's worth doing when only 3
architectures support it, and I would argue it's separate to this patch
series so can be done at a different time.
>> + help
>> + Say Y here if you want to show the kernel pagetable layout in a
>> + debugfs file. This information is only useful for kernel developers
>> + who are working in architecture specific areas of the kernel.
>> + It is probably not a good idea to enable this feature in a production
>> + kernel.
>> +
>> + If in doubt, say N.
>> diff --git a/mm/Makefile b/mm/Makefile
>> index 338e528ad436..750a4c12d5da 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -104,3 +104,4 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
>> obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
>> obj-$(CONFIG_HMM_MIRROR) += hmm.o
>> obj-$(CONFIG_MEMFD_CREATE) += memfd.o
>> +obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
>
> Should be GENERIC_PTDUMP instead ?
No - GENERIC_PTDUMP is just signalling the architecture support it, we
don't want to compile in the code unless it is used.
>> diff --git a/mm/ptdump.c b/mm/ptdump.c
>> new file mode 100644
>> index 000000000000..39befc9088b8
>> --- /dev/null
>> +++ b/mm/ptdump.c
>> @@ -0,0 +1,161 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +#include <linux/mm.h>
>> +#include <linux/ptdump.h>
>> +#include <linux/kasan.h>
>> +
>> +static int ptdump_pgd_entry(pgd_t *pgd, unsigned long addr,
>> + unsigned long next, struct mm_walk *walk)
>> +{
>> + struct ptdump_state *st = walk->private;
>> + pgd_t val = READ_ONCE(*pgd);
>> +
>> + if (pgd_leaf(val))
>> + st->note_page(st, addr, 1, pgd_val(val));
>> +
>> + return 0;
>> +}
>> +
>> +static int ptdump_p4d_entry(p4d_t *p4d, unsigned long addr,
>> + unsigned long next, struct mm_walk *walk)
>> +{
>> + struct ptdump_state *st = walk->private;
>> + p4d_t val = READ_ONCE(*p4d);
>> +
>> + if (p4d_leaf(val))
>> + st->note_page(st, addr, 2, p4d_val(val));
>> +
>> + return 0;
>> +}
>> +
>> +static int ptdump_pud_entry(pud_t *pud, unsigned long addr,
>> + unsigned long next, struct mm_walk *walk)
>> +{
>> + struct ptdump_state *st = walk->private;
>> + pud_t val = READ_ONCE(*pud);
>> +
>> + if (pud_leaf(val))
>> + st->note_page(st, addr, 3, pud_val(val));
>> +
>> + return 0;
>> +}
>> +
>> +static int ptdump_pmd_entry(pmd_t *pmd, unsigned long addr,
>> + unsigned long next, struct mm_walk *walk)
>> +{
>> + struct ptdump_state *st = walk->private;
>> + pmd_t val = READ_ONCE(*pmd);
>> +
>> + if (pmd_leaf(val))
>> + st->note_page(st, addr, 4, pmd_val(val));
>> +
>> + return 0;
>> +}
>> +
>> +static int ptdump_pte_entry(pte_t *pte, unsigned long addr,
>> + unsigned long next, struct mm_walk *walk)
>> +{
>> + struct ptdump_state *st = walk->private;
>> +
>> + st->note_page(st, addr, 5, pte_val(READ_ONCE(*pte)));
>> +
>> + return 0;
>> +}
>> +
>> +#ifdef CONFIG_KASAN
>> +/*
>> + * This is an optimization for KASAN=y case. Since all kasan page tables
>> + * eventually point to the kasan_early_shadow_page we could call note_page()
>> + * right away without walking through lower level page tables. This saves
>> + * us dozens of seconds (minutes for 5-level config) while checking for
>> + * W+X mapping or reading kernel_page_tables debugfs file.
>> + */
>> +static inline bool kasan_page_table(struct ptdump_state *st, void *pt,
>> + unsigned long addr)
>> +{
>> + if (__pa(pt) == __pa(kasan_early_shadow_pmd) ||
>> +#ifdef CONFIG_X86
>> + (pgtable_l5_enabled() &&
>> + __pa(pt) == __pa(kasan_early_shadow_p4d)) ||
>> +#endif
>> + __pa(pt) == __pa(kasan_early_shadow_pud)) {
>> + st->note_page(st, addr, 5, pte_val(kasan_early_shadow_pte[0]));
>> + return true;
>> + }
>> + return false;
>> +}
>> +#else
>> +static inline bool kasan_page_table(struct ptdump_state *st, void *pt,
>> + unsigned long addr)
>> +{
>> + return false;
>> +}
>> +#endif
>> +
>> +static int ptdump_test_p4d(unsigned long addr, unsigned long next,
>> + p4d_t *p4d, struct mm_walk *walk)
>> +{
>> + struct ptdump_state *st = walk->private;
>> +
>> + if (kasan_page_table(st, p4d, addr))
>> + return 1;
>> + return 0;
>> +}
>> +
>> +static int ptdump_test_pud(unsigned long addr, unsigned long next,
>> + pud_t *pud, struct mm_walk *walk)
>> +{
>> + struct ptdump_state *st = walk->private;
>> +
>> + if (kasan_page_table(st, pud, addr))
>> + return 1;
>> + return 0;
>> +}
>> +
>> +static int ptdump_test_pmd(unsigned long addr, unsigned long next,
>> + pmd_t *pmd, struct mm_walk *walk)
>> +{
>> + struct ptdump_state *st = walk->private;
>> +
>> + if (kasan_page_table(st, pmd, addr))
>> + return 1;
>> + return 0;
>> +}
>> +
>> +static int ptdump_hole(unsigned long addr, unsigned long next,
>> + struct mm_walk *walk)
>> +{
>> + struct ptdump_state *st = walk->private;
>> +
>> + st->note_page(st, addr, -1, 0);
>> +
>> + return 0;
>> +}
>> +
>> +void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm)
>> +{
>> + struct mm_walk walk = {
>> + .mm = mm,
>> + .pgd_entry = ptdump_pgd_entry,
>> + .p4d_entry = ptdump_p4d_entry,
>> + .pud_entry = ptdump_pud_entry,
>> + .pmd_entry = ptdump_pmd_entry,
>> + .pte_entry = ptdump_pte_entry,
>> + .test_p4d = ptdump_test_p4d,
>> + .test_pud = ptdump_test_pud,
>> + .test_pmd = ptdump_test_pmd,
>> + .pte_hole = ptdump_hole,
>> + .private = st
>> + };
>> + const struct ptdump_range *range = st->range;
>> +
>> + down_read(&mm->mmap_sem);
>> + while (range->start != range->end) {
>> + walk_page_range(range->start, range->end, &walk);
>> + range++;
>> + }
>> + up_read(&mm->mmap_sem);
>
> Does walk_page_range() really needed here when it is definitely walking a
> kernel page table. Why not directly use walk_pgd_range() instead which can
> save some cycles avoiding going over VMAs, checking for HugeTLB, taking the
> mmap_sem lock etc. AFAICS only thing it will miss is the opportunity to call
> walk->test_walk() via walk_page_test(). IIUC test_walk() callback is primarily
> for testing a VMA for it's eligibility and for kernel page table now there are
> test callbacks like p?d_test() for individual levels anyway.
Well it's a debug interface so saving a few cycles is largely
irrelevant. I'm reluctant to export walk_pgd_range() in case it gets
used to walk real VMAs. Having just one interface is cleanest. But I
agree for kernel mappings all the extra work in walk_page_range() isn't
needed.
Steve
Hi Steven,
On Mon, Jul 29, 2019 at 12:32:25PM +0100, Steven Price wrote:
>
> parisc is more interesting and I'm not sure if this is necessarily
> correct. I originally proposed a patch with the line "For parisc, we
> don't support large pages, so add stubs returning 0" which got Acked by
> Helge Deller. However going back to look at that again I see there was a
> follow up thread[2] which possibly suggests I was wrong?
I just started a week ago implementing ptdump for PA-RISC. Didn't notice that
you're working on making it generic, which is nice. I'll adjust my code
to use the infrastructure you're currently developing.
> Can anyone shed some light on whether parisc does support leaf entries
> of the page table tree at a higher than the normal depth?
>
> [1] https://lkml.org/lkml/2019/2/27/572
> [2] https://lkml.org/lkml/2019/3/5/610
My understanding is that PA-RISC only has leaf entries on PTE level.
> The intention is that the page table walker would be available for all
> architectures so that it can be used in any generic code - PTDUMP simply
> seemed like a good place to start.
>
> > Now that pmd_leaf() and pud_leaf() are getting used in walk_page_range() these
> > functions need to be defined on all arch irrespective if they use PTDUMP or not
> > or otherwise just define it for archs which need them now for sure i.e x86 and
> > arm64 (which are moving to new generic PTDUMP framework). Other archs can
> > implement these later.
I'll take care of the PA-RISC part - for 32 bit your generic code works, for 64Bit
i need to learn a bit more about the following hack:
arch/parisc/include/asm/pgalloc.h:15
/* Allocate the top level pgd (page directory)
*
* Here (for 64 bit kernels) we implement a Hybrid L2/L3 scheme: we
* allocate the first pmd adjacent to the pgd. This means that we can
* subtract a constant offset to get to it. The pmd and pgd sizes are
* arranged so that a single pmd covers 4GB (giving a full 64-bit
* process access to 8TB) so our lookups are effectively L2 for the
* first 4GB of the kernel (i.e. for all ILP32 processes and all the
* kernel for machines with under 4GB of memory)
*/
I see that your change clear P?D entries when p?d_bad() returns true, which - i think -
would be the case with the PA-RISC implementation.
Regards
Sven
On 31/07/2019 10:27, Sven Schnelle wrote:
> Hi Steven,
>
> On Mon, Jul 29, 2019 at 12:32:25PM +0100, Steven Price wrote:
>>
>> parisc is more interesting and I'm not sure if this is necessarily
>> correct. I originally proposed a patch with the line "For parisc, we
>> don't support large pages, so add stubs returning 0" which got Acked by
>> Helge Deller. However going back to look at that again I see there was a
>> follow up thread[2] which possibly suggests I was wrong?
>
> I just started a week ago implementing ptdump for PA-RISC. Didn't notice that
> you're working on making it generic, which is nice. I'll adjust my code
> to use the infrastructure you're currently developing.
Great, hopefully it will make it easier to implement.
>> Can anyone shed some light on whether parisc does support leaf entries
>> of the page table tree at a higher than the normal depth?
>>
>> [1] https://lkml.org/lkml/2019/2/27/572
>> [2] https://lkml.org/lkml/2019/3/5/610
>
> My understanding is that PA-RISC only has leaf entries on PTE level.
Yes, that's my current interpretation.
>> The intention is that the page table walker would be available for all
>> architectures so that it can be used in any generic code - PTDUMP simply
>> seemed like a good place to start.
>>
>>> Now that pmd_leaf() and pud_leaf() are getting used in walk_page_range() these
>>> functions need to be defined on all arch irrespective if they use PTDUMP or not
>>> or otherwise just define it for archs which need them now for sure i.e x86 and
>>> arm64 (which are moving to new generic PTDUMP framework). Other archs can
>>> implement these later.
>
> I'll take care of the PA-RISC part - for 32 bit your generic code works, for 64Bit
> i need to learn a bit more about the following hack:
>
> arch/parisc/include/asm/pgalloc.h:15
> /* Allocate the top level pgd (page directory)
> *
> * Here (for 64 bit kernels) we implement a Hybrid L2/L3 scheme: we
> * allocate the first pmd adjacent to the pgd. This means that we can
> * subtract a constant offset to get to it. The pmd and pgd sizes are
> * arranged so that a single pmd covers 4GB (giving a full 64-bit
> * process access to 8TB) so our lookups are effectively L2 for the
> * first 4GB of the kernel (i.e. for all ILP32 processes and all the
> * kernel for machines with under 4GB of memory)
> */
As far as I understand this, the page table tree isn't any different
here. It's just that there's a PMD which is allocated at the same time
as the PGD. The PGD's first entry then points to the PMD (P4D/PUD are
folded). There are then some tricks which means that for addresses < 4GB
the PGD stage can be skipped because you already know where the relevant
PMD is.
However, nothing should stop a simple walk from PGD down - it's just an
optimisation to remove the pointer fetch from PGD in the usual case for
accesses < 4GB.
> I see that your change clear P?D entries when p?d_bad() returns true, which - i think -
> would be the case with the PA-RISC implementation.
The only case where p?d_bad() is checked is at the PGD and P4D levels
(unless I'm missing something?). I have to admit I'm a little unsure
about this. Basically the code as it stands doesn't allow leaf entries
at PGD or P4D. I'm not aware of any architectures that do this though.
Thanks,
Steve
On 07/29/2019 05:08 PM, Steven Price wrote:
> On 28/07/2019 12:44, Anshuman Khandual wrote:
>>
>>
>> On 07/23/2019 03:11 PM, Mark Rutland wrote:
>>> On Mon, Jul 22, 2019 at 04:41:59PM +0100, Steven Price wrote:
>>>> Exposing the pud/pgd levels of the page tables to walk_page_range() means
>>>> we may come across the exotic large mappings that come with large areas
>>>> of contiguous memory (such as the kernel's linear map).
>>>>
>>>> For architectures that don't provide all p?d_leaf() macros, provide
>>>> generic do nothing default that are suitable where there cannot be leaf
>>>> pages that that level.
>>>>
>>>> Signed-off-by: Steven Price <[email protected]>
>>>
>>> Not a big deal, but it would probably make sense for this to be patch 1
>>> in the series, given it defines the semantic of p?d_leaf(), and they're
>>> not used until we provide all the architectural implemetnations anyway.
>>
>> Agreed.
>>
>>>
>>> It might also be worth pointing out the reasons for this naming, e.g.
>>> p?d_large() aren't currently generic, and this name minimizes potential
>>> confusion between p?d_{large,huge}().
>>
>> Agreed. But these fallback also need to first check non-availability of large
>> pages. I am not sure whether CONFIG_HUGETLB_PAGE config being clear indicates
>> that conclusively or not. Being a page table leaf entry has a broader meaning
>> than a large page but that is really not the case today. All leaf entries here
>> are large page entries from MMU perspective. This dependency can definitely be
>> removed when there are other types of leaf entries but for now IMHO it feels
>> bit problematic not to directly associate leaf entries with large pages in
>> config restriction while doing exactly the same.
>
> The intention here is that the page walkers are able to walk any type of
> page table entry which the kernel may use. CONFIG_HUGETLB_PAGE only
> controls whether "huge TLB pages" are used by user space processes. It's
> quite possible that option to not be selected but the linear mapping to
> have been mapped using "large pages" (i.e. leaf entries further up the
> tree than normal).
I understand that kernel page table might use large pages where as user space
never enabled HugeTLB. The point to make here was CONFIG_HUGETLB approximately
indicates the presence of large pages though the absence of same does not
conclusively indicate that large pages are really absent on the MMU. Perhaps it
will requires something new like MMU_[LARGE|HUGE]_PAGES.
>
> One of the goals was to avoid tying the new functions to a configuration
> option but instead match the hardware architecture. Of course this isn't
> possible in the most general case (e.g. an architecture may have
> multiple hardware page table formats). But to the extent that other
> functions like p?d_none() work the desire is that p?d_leaf() should also
> work.
It is fair enough to assume that a platform can decide wisely and provide
accurate definition for p?d_leaf() functions. Anyways its okay not to make
this more complex by tying with a new config option which does not exist.
On 07/29/2019 05:59 PM, Steven Price wrote:
> On 28/07/2019 15:20, Anshuman Khandual wrote:
>>
>>
>> On 07/22/2019 09:12 PM, Steven Price wrote:
>>> Since 48684a65b4e3: "mm: pagewalk: fix misbehavior of walk_page_range
>>> for vma(VM_PFNMAP)", page_table_walk() will report any kernel area as
>>> a hole, because it lacks a vma.
>>>
>>> This means each arch has re-implemented page table walking when needed,
>>> for example in the per-arch ptdump walker.
>>>
>>> Remove the requirement to have a vma except when trying to split huge
>>> pages.
>>>
>>> Signed-off-by: Steven Price <[email protected]>
>>> ---
>>> mm/pagewalk.c | 25 +++++++++++++++++--------
>>> 1 file changed, 17 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>>> index 98373a9f88b8..1cbef99e9258 100644
>>> --- a/mm/pagewalk.c
>>> +++ b/mm/pagewalk.c
>>> @@ -36,7 +36,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>>> do {
>>> again:
>>> next = pmd_addr_end(addr, end);
>>> - if (pmd_none(*pmd) || !walk->vma) {
>>> + if (pmd_none(*pmd)) {
>>> if (walk->pte_hole)
>>> err = walk->pte_hole(addr, next, walk);
>>> if (err)
>>> @@ -59,9 +59,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>>> if (!walk->pte_entry)
>>> continue;
>>>
>>> - split_huge_pmd(walk->vma, pmd, addr);
>>> - if (pmd_trans_unstable(pmd))
>>> - goto again;
>>> + if (walk->vma) {
>>> + split_huge_pmd(walk->vma, pmd, addr);
>>
>> Check for a PMD THP entry before attempting to split it ?
>
> split_huge_pmd does the check for us:
>> #define split_huge_pmd(__vma, __pmd, __address) \
>> do { \
>> pmd_t *____pmd = (__pmd); \
>> if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd) \
>> || pmd_devmap(*____pmd)) \
>> __split_huge_pmd(__vma, __pmd, __address, \
>> false, NULL); \
>> } while (0)
>
> And this isn't a change from the previous code - only that the entry is
> no longer split when walk->vma==NULL.
Does it make sense to name walk->vma check to differentiate between user
and kernel page tables. IMHO that will help make things clear and explicit
during page table walk.
>
>>> + if (pmd_trans_unstable(pmd))
>>> + goto again;
>>> + } else if (pmd_leaf(*pmd)) {
>>> + continue;
>>> + }
>>> +
>>> err = walk_pte_range(pmd, addr, next, walk);
>>> if (err)
>>> break;
>>> @@ -81,7 +86,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>>> do {
>>> again:
>>> next = pud_addr_end(addr, end);
>>> - if (pud_none(*pud) || !walk->vma) {
>>> + if (pud_none(*pud)) {
>>> if (walk->pte_hole)
>>> err = walk->pte_hole(addr, next, walk);
>>> if (err)
>>> @@ -95,9 +100,13 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>>> break;
>>> }
>>>
>>> - split_huge_pud(walk->vma, pud, addr);
>>> - if (pud_none(*pud))
>>> - goto again;
>>> + if (walk->vma) {
>>> + split_huge_pud(walk->vma, pud, addr);
>>
>> Check for a PUD THP entry before attempting to split it ?
>
> Same as above.
>
>>> + if (pud_none(*pud))
>>> + goto again;
>>> + } else if (pud_leaf(*pud)) {
>>> + continue;
>>> + }
>>
>> This is bit cryptic. walk->vma check should be inside a helper is_user_page_table()
>> or similar to make things clear. p4d_leaf() check missing in walk_p4d_range() for
>> kernel page table walk ? Wondering if p?d_leaf() test should be moved earlier while
>> calling p?d_entry() for kernel page table walk.
>
> I wasn't sure if it was worth putting p4d_leaf() and pgd_leaf() checks
> in (yet). No architecture that I know of uses such large pages.
Just to be complete it does make sense to add the remaining possible leaf
entry checks but will leave it upto you.
>
> I'm not sure what you mean by moving the p?d_leaf() test earlier? Can
> you explain with an example?
In case its a kernel p?d_leaf() entry, then there is nothing to be done
after calling respective walk->p?d_entry() functions. Hence this check
should not complement user page table check (walk->vma) later in the
function but instead be checked right after walk->p?d_entry(). But its
not a big deal I guess.
On 07/29/2019 06:20 PM, Mark Rutland wrote:
> On Sun, Jul 28, 2019 at 05:14:31PM +0530, Anshuman Khandual wrote:
>> On 07/23/2019 03:11 PM, Mark Rutland wrote:
>>> It might also be worth pointing out the reasons for this naming, e.g.
>>> p?d_large() aren't currently generic, and this name minimizes potential
>>> confusion between p?d_{large,huge}().
>>
>> Agreed. But these fallback also need to first check non-availability of large
>> pages.
>
> We're deliberately not making the p?d_large() helpers generic, so this
> shouldn't fall back on those.
I meant non-availability of large page support in the MMU HW not just the
presence of p?d_large() helpers.
On 01/08/2019 07:09, Anshuman Khandual wrote:
>
>
> On 07/29/2019 05:08 PM, Steven Price wrote:
>> On 28/07/2019 12:44, Anshuman Khandual wrote:
>>>
>>>
>>> On 07/23/2019 03:11 PM, Mark Rutland wrote:
>>>> On Mon, Jul 22, 2019 at 04:41:59PM +0100, Steven Price wrote:
>>>>> Exposing the pud/pgd levels of the page tables to walk_page_range() means
>>>>> we may come across the exotic large mappings that come with large areas
>>>>> of contiguous memory (such as the kernel's linear map).
>>>>>
>>>>> For architectures that don't provide all p?d_leaf() macros, provide
>>>>> generic do nothing default that are suitable where there cannot be leaf
>>>>> pages that that level.
>>>>>
>>>>> Signed-off-by: Steven Price <[email protected]>
>>>>
>>>> Not a big deal, but it would probably make sense for this to be patch 1
>>>> in the series, given it defines the semantic of p?d_leaf(), and they're
>>>> not used until we provide all the architectural implemetnations anyway.
>>>
>>> Agreed.
>>>
>>>>
>>>> It might also be worth pointing out the reasons for this naming, e.g.
>>>> p?d_large() aren't currently generic, and this name minimizes potential
>>>> confusion between p?d_{large,huge}().
>>>
>>> Agreed. But these fallback also need to first check non-availability of large
>>> pages. I am not sure whether CONFIG_HUGETLB_PAGE config being clear indicates
>>> that conclusively or not. Being a page table leaf entry has a broader meaning
>>> than a large page but that is really not the case today. All leaf entries here
>>> are large page entries from MMU perspective. This dependency can definitely be
>>> removed when there are other types of leaf entries but for now IMHO it feels
>>> bit problematic not to directly associate leaf entries with large pages in
>>> config restriction while doing exactly the same.
>>
>> The intention here is that the page walkers are able to walk any type of
>> page table entry which the kernel may use. CONFIG_HUGETLB_PAGE only
>> controls whether "huge TLB pages" are used by user space processes. It's
>> quite possible that option to not be selected but the linear mapping to
>> have been mapped using "large pages" (i.e. leaf entries further up the
>> tree than normal).
>
> I understand that kernel page table might use large pages where as user space
> never enabled HugeTLB. The point to make here was CONFIG_HUGETLB approximately
> indicates the presence of large pages though the absence of same does not
> conclusively indicate that large pages are really absent on the MMU. Perhaps it
> will requires something new like MMU_[LARGE|HUGE]_PAGES.
CONFIG_HUGETLB doesn't necessarily mean leaf entries can appear anywhere
other than PTE. Some architectures always have a full tree of page
tables, but can program their TLBs with larger entries - I think all the
architectures I've come across have software page table walking, but in
theory the arm64 contiguous hint bit could be considered similar.
Steve