2019-11-01 15:38:36

by Steven Price

[permalink] [raw]
Subject: [PATCH v15 00/23] Generic page walk and ptdump

Many architectures current have a debugfs file for dumping the kernel
page tables. Currently each architecture has to implement custom
functions for this because the details of walking the page tables used
by the kernel are different between architectures.

This series extends the capabilities of walk_page_range() so that it can
deal with the page tables of the kernel (which have no VMAs and can
contain larger huge pages than exist for user space). A generic PTDUMP
implementation is the implemented making use of the new functionality of
walk_page_range() and finally arm64 and x86 are switch to using it,
removing the custom table walkers.

To enable a generic page table walker to walk the unusual mappings of
the kernel we need to implement a set of functions which let us know
when the walker has reached the leaf entry. After a suggestion from Will
Deacon I've chosen the name p?d_leaf() as this (hopefully) describes
the purpose (and is a new name so has no historic baggage). Some
architectures have p?d_large macros but this is easily confused with
"large pages".

This series ends with a generic PTDUMP implemention for arm64 and x86.

Mostly this is a clean up and there should be very little functional
change. The exceptions are:

* arm64 PTDUMP debugfs now displays pages which aren't present (patch 22).

* arm64 has the ability to efficiently process KASAN pages (which
previously only x86 implemented). This means that the combination of
KASAN and DEBUG_WX is now useable.

Also available as a git tree:
git://linux-arm.org/linux-sp.git walk_page_range/v15

Changes since v14:
https://lore.kernel.org/lkml/[email protected]/
* Switch walk_page_range() into two functions, the existing
walk_page_range() now still requires VMAs (and treats areas without a
VMA as a 'hole'). The new walk_page_range_novma() ignores VMAs and
will report the actual page table layout. This fixes the previous
breakage of /proc/<pid>/pagemap
* New patch at the end of the series which reduces the 'level' numbers
by 1 to simplify the code slightly
* Added tags

Changes since v13:
https://lore.kernel.org/lkml/[email protected]/
* Fixed typo in arc definition of pmd_leaf() spotted by the kbuild test
robot
* Added tags

Changes since v12:
https://lore.kernel.org/lkml/[email protected]/
* Correct code format in riscv pud_leaf()/pmd_leaf()
* v12 may not have reached everyone because of mail server problems
(which are now hopefully resolved!)

Changes since v11:
https://lore.kernel.org/lkml/[email protected]/
* Use "-1" as dummy depth parameter in patch 14.

Changes since v10:
https://lore.kernel.org/lkml/[email protected]/
* Rebased to v5.4-rc1 - mainly various updates to deal with the
splitting out of ops from struct mm_walk.
* Deal with PGD_LEVEL_MULT not always being constant on x86.

Changes since v9:
https://lore.kernel.org/lkml/[email protected]/
* Moved generic macros to first page in the series and explained the
macro naming in the commit message.
* mips: Moved macros to pgtable.h as they are now valid for both 32 and 64
bit
* x86: Dropped patch which changed the debugfs output for x86, instead
we have...
* new patch adding 'depth' parameter to pte_hole. This is used to
provide the necessary information to output lines for 'holes' in the
debugfs files
* new patch changing arm64 debugfs output to include holes to match x86
* generic ptdump KASAN handling has been simplified and now works with
CONFIG_DEBUG_VIRTUAL.

Changes since v8:
https://lore.kernel.org/lkml/[email protected]/
* Rename from p?d_large() to p?d_leaf()
* Dropped patches migrating arm64/x86 custom walkers to
walk_page_range() in favour of adding a generic PTDUMP implementation
and migrating arm64/x86 to that instead.
* Rebased to v5.3-rc1

Steven Price (23):
mm: Add generic p?d_leaf() macros
arc: mm: Add p?d_leaf() definitions
arm: mm: Add p?d_leaf() definitions
arm64: mm: Add p?d_leaf() definitions
mips: mm: Add p?d_leaf() definitions
powerpc: mm: Add p?d_leaf() definitions
riscv: mm: Add p?d_leaf() definitions
s390: mm: Add p?d_leaf() definitions
sparc: mm: Add p?d_leaf() definitions
x86: mm: Add p?d_leaf() definitions
mm: pagewalk: Add p4d_entry() and pgd_entry()
mm: pagewalk: Allow walking without vma
mm: pagewalk: Add test_p?d callbacks
mm: pagewalk: Add 'depth' parameter to pte_hole
x86: mm: Point to struct seq_file from struct pg_state
x86: mm+efi: Convert ptdump_walk_pgd_level() to take a mm_struct
x86: mm: Convert ptdump_walk_pgd_level_debugfs() to take an mm_struct
x86: mm: Convert ptdump_walk_pgd_level_core() to take an mm_struct
mm: Add generic ptdump
x86: mm: Convert dump_pagetables to use walk_page_range
arm64: mm: Convert mm/dump.c to use walk_page_range()
arm64: mm: Display non-present entries in ptdump
mm: ptdump: Reduce level numbers by 1 in note_page()

arch/arc/include/asm/pgtable.h | 1 +
arch/arm/include/asm/pgtable-2level.h | 1 +
arch/arm/include/asm/pgtable-3level.h | 1 +
arch/arm64/Kconfig | 1 +
arch/arm64/Kconfig.debug | 19 +-
arch/arm64/include/asm/pgtable.h | 2 +
arch/arm64/include/asm/ptdump.h | 8 +-
arch/arm64/mm/Makefile | 4 +-
arch/arm64/mm/dump.c | 148 +++-----
arch/arm64/mm/mmu.c | 4 +-
arch/arm64/mm/ptdump_debugfs.c | 2 +-
arch/mips/include/asm/pgtable.h | 5 +
arch/powerpc/include/asm/book3s/64/pgtable.h | 30 +-
arch/riscv/include/asm/pgtable-64.h | 7 +
arch/riscv/include/asm/pgtable.h | 7 +
arch/s390/include/asm/pgtable.h | 2 +
arch/sparc/include/asm/pgtable_64.h | 2 +
arch/x86/Kconfig | 1 +
arch/x86/Kconfig.debug | 20 +-
arch/x86/include/asm/pgtable.h | 10 +-
arch/x86/mm/Makefile | 4 +-
arch/x86/mm/debug_pagetables.c | 8 +-
arch/x86/mm/dump_pagetables.c | 343 +++++--------------
arch/x86/platform/efi/efi_32.c | 2 +-
arch/x86/platform/efi/efi_64.c | 4 +-
drivers/firmware/efi/arm-runtime.c | 2 +-
fs/proc/task_mmu.c | 4 +-
include/asm-generic/pgtable.h | 20 ++
include/linux/pagewalk.h | 42 ++-
include/linux/ptdump.h | 22 ++
mm/Kconfig.debug | 21 ++
mm/Makefile | 1 +
mm/hmm.c | 8 +-
mm/migrate.c | 5 +-
mm/mincore.c | 1 +
mm/pagewalk.c | 126 +++++--
mm/ptdump.c | 151 ++++++++
37 files changed, 586 insertions(+), 453 deletions(-)
create mode 100644 include/linux/ptdump.h
create mode 100644 mm/ptdump.c

--
2.20.1


2019-11-01 15:38:37

by Steven Price

[permalink] [raw]
Subject: [PATCH v15 01/23] mm: Add generic p?d_leaf() macros

Exposing the pud/pgd levels of the page tables to walk_page_range() means
we may come across the exotic large mappings that come with large areas
of contiguous memory (such as the kernel's linear map).

For architectures that don't provide all p?d_leaf() macros, provide
generic do nothing default that are suitable where there cannot be leaf
pages at that level. Futher patches will add implementations for
individual architectures.

The name p?d_leaf() is chosen to minimize the confusion with existing
uses of "large" pages and "huge" pages which do not necessary mean that
the entry is a leaf (for example it may be a set of contiguous entries
that only take 1 TLB slot). For the purpose of walking the page tables
we don't need to know how it will be represented in the TLB, but we do
need to know for sure if it is a leaf of the tree.

Signed-off-by: Steven Price <[email protected]>
Acked-by: Mark Rutland <[email protected]>
---
include/asm-generic/pgtable.h | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 818691846c90..7f9eced287b7 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1187,4 +1187,24 @@ static inline bool arch_has_pfn_modify_check(void)
#define mm_pmd_folded(mm) __is_defined(__PAGETABLE_PMD_FOLDED)
#endif

+/*
+ * p?d_leaf() - true if this entry is a final mapping to a physical address.
+ * This differs from p?d_huge() by the fact that they are always available (if
+ * the architecture supports large pages at the appropriate level) even
+ * if CONFIG_HUGETLB_PAGE is not defined.
+ * Only meaningful when called on a valid entry.
+ */
+#ifndef pgd_leaf
+#define pgd_leaf(x) 0
+#endif
+#ifndef p4d_leaf
+#define p4d_leaf(x) 0
+#endif
+#ifndef pud_leaf
+#define pud_leaf(x) 0
+#endif
+#ifndef pmd_leaf
+#define pmd_leaf(x) 0
+#endif
+
#endif /* _ASM_GENERIC_PGTABLE_H */
--
2.20.1

2019-11-01 15:40:12

by Steven Price

[permalink] [raw]
Subject: [PATCH v15 06/23] powerpc: mm: Add p?d_leaf() definitions

walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_leaf() functions/macros.

For powerpc pmd_large() already exists and does what we want, so hoist
it out of the CONFIG_TRANSPARENT_HUGEPAGE condition and implement the
other levels. Macros are used to provide the generic p?d_leaf() names.

CC: Benjamin Herrenschmidt <[email protected]>
CC: Paul Mackerras <[email protected]>
CC: Michael Ellerman <[email protected]>
CC: [email protected]
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/powerpc/include/asm/book3s/64/pgtable.h | 30 ++++++++++++++------
1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index b01624e5c467..3dd7b6f5edd0 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -923,6 +923,12 @@ static inline int pud_present(pud_t pud)
return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PRESENT));
}

+#define pud_leaf pud_large
+static inline int pud_large(pud_t pud)
+{
+ return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
+}
+
extern struct page *pud_page(pud_t pud);
extern struct page *pmd_page(pmd_t pmd);
static inline pte_t pud_pte(pud_t pud)
@@ -966,6 +972,12 @@ static inline int pgd_present(pgd_t pgd)
return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PRESENT));
}

+#define pgd_leaf pgd_large
+static inline int pgd_large(pgd_t pgd)
+{
+ return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PTE));
+}
+
static inline pte_t pgd_pte(pgd_t pgd)
{
return __pte_raw(pgd_raw(pgd));
@@ -1133,6 +1145,15 @@ static inline bool pmd_access_permitted(pmd_t pmd, bool write)
return pte_access_permitted(pmd_pte(pmd), write);
}

+#define pmd_leaf pmd_large
+/*
+ * returns true for pmd migration entries, THP, devmap, hugetlb
+ */
+static inline int pmd_large(pmd_t pmd)
+{
+ return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
+}
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
@@ -1159,15 +1180,6 @@ pmd_hugepage_update(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp,
return hash__pmd_hugepage_update(mm, addr, pmdp, clr, set);
}

-/*
- * returns true for pmd migration entries, THP, devmap, hugetlb
- * But compile time dependent on THP config
- */
-static inline int pmd_large(pmd_t pmd)
-{
- return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
-}
-
static inline pmd_t pmd_mknotpresent(pmd_t pmd)
{
return __pmd(pmd_val(pmd) & ~_PAGE_PRESENT);
--
2.20.1

2019-11-01 15:41:02

by Steven Price

[permalink] [raw]
Subject: [PATCH v15 09/23] sparc: mm: Add p?d_leaf() definitions

walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_leaf() functions/macros.

For sparc 64 bit, pmd_large() and pud_large() are already provided, so
add macros to provide the p?d_leaf names required by the generic code.

CC: "David S. Miller" <[email protected]>
CC: [email protected]
Acked-by: David S. Miller <[email protected]>
Signed-off-by: Steven Price <[email protected]>
---
arch/sparc/include/asm/pgtable_64.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 6ae8016ef4ec..43206652eaf5 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -683,6 +683,7 @@ static inline unsigned long pte_special(pte_t pte)
return pte_val(pte) & _PAGE_SPECIAL;
}

+#define pmd_leaf pmd_large
static inline unsigned long pmd_large(pmd_t pmd)
{
pte_t pte = __pte(pmd_val(pmd));
@@ -867,6 +868,7 @@ static inline unsigned long pud_page_vaddr(pud_t pud)
/* only used by the stubbed out hugetlb gup code, should never be called */
#define pgd_page(pgd) NULL

+#define pud_leaf pud_large
static inline unsigned long pud_large(pud_t pud)
{
pte_t pte = __pte(pud_val(pud));
--
2.20.1

2019-11-01 15:43:30

by Steven Price

[permalink] [raw]
Subject: [PATCH v15 12/23] mm: pagewalk: Allow walking without vma

Since 48684a65b4e3: "mm: pagewalk: fix misbehavior of walk_page_range
for vma(VM_PFNMAP)", page_table_walk() will report any kernel area as
a hole, because it lacks a vma.

This means each arch has re-implemented page table walking when needed,
for example in the per-arch ptdump walker.

Remove the requirement to have a vma in the generic code and add a new
function walk_page_range_novma() which ignores the VMAs and simply walks
the page tables.

Signed-off-by: Steven Price <[email protected]>
---
include/linux/pagewalk.h | 5 +++++
mm/pagewalk.c | 44 ++++++++++++++++++++++++++++++++--------
2 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 12004b097eae..ed2bb399fac2 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -53,6 +53,7 @@ struct mm_walk_ops {
* @ops: operation to call during the walk
* @mm: mm_struct representing the target process of page table walk
* @vma: vma currently walked (NULL if walking outside vmas)
+ * @no_vma: walk ignoring vmas (vma will always be NULL)
* @private: private data for callbacks' usage
*
* (see the comment on walk_page_range() for more details)
@@ -61,12 +62,16 @@ struct mm_walk {
const struct mm_walk_ops *ops;
struct mm_struct *mm;
struct vm_area_struct *vma;
+ bool no_vma;
void *private;
};

int walk_page_range(struct mm_struct *mm, unsigned long start,
unsigned long end, const struct mm_walk_ops *ops,
void *private);
+int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
+ unsigned long end, const struct mm_walk_ops *ops,
+ void *private);
int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
void *private);

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index fc4d98a3a5a0..626e7fdb0508 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -38,7 +38,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
do {
again:
next = pmd_addr_end(addr, end);
- if (pmd_none(*pmd) || !walk->vma) {
+ if (pmd_none(*pmd) || (!walk->vma && !walk->no_vma)) {
if (ops->pte_hole)
err = ops->pte_hole(addr, next, walk);
if (err)
@@ -61,9 +61,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
if (!ops->pte_entry)
continue;

- split_huge_pmd(walk->vma, pmd, addr);
- if (pmd_trans_unstable(pmd))
- goto again;
+ if (walk->vma) {
+ split_huge_pmd(walk->vma, pmd, addr);
+ if (pmd_trans_unstable(pmd))
+ goto again;
+ } else if (pmd_leaf(*pmd)) {
+ continue;
+ }
+
err = walk_pte_range(pmd, addr, next, walk);
if (err)
break;
@@ -84,7 +89,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
do {
again:
next = pud_addr_end(addr, end);
- if (pud_none(*pud) || !walk->vma) {
+ if (pud_none(*pud) || (!walk->vma && !walk->no_vma)) {
if (ops->pte_hole)
err = ops->pte_hole(addr, next, walk);
if (err)
@@ -98,9 +103,13 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
break;
}

- split_huge_pud(walk->vma, pud, addr);
- if (pud_none(*pud))
- goto again;
+ if (walk->vma) {
+ split_huge_pud(walk->vma, pud, addr);
+ if (pud_none(*pud))
+ goto again;
+ } else if (pud_leaf(*pud)) {
+ continue;
+ }

if (ops->pmd_entry || ops->pte_entry)
err = walk_pmd_range(pud, addr, next, walk);
@@ -358,6 +367,25 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
return err;
}

+int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
+ unsigned long end, const struct mm_walk_ops *ops,
+ void *private)
+{
+ struct mm_walk walk = {
+ .ops = ops,
+ .mm = mm,
+ .private = private,
+ .no_vma = true
+ };
+
+ if (start >= end || !walk.mm)
+ return -EINVAL;
+
+ lockdep_assert_held(&walk.mm->mmap_sem);
+
+ return __walk_page_range(start, end, &walk);
+}
+
int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
void *private)
{
--
2.20.1

2019-11-01 15:44:17

by Steven Price

[permalink] [raw]
Subject: [PATCH v15 21/23] arm64: mm: Convert mm/dump.c to use walk_page_range()

Now walk_page_range() can walk kernel page tables, we can switch the
arm64 ptdump code over to using it, simplifying the code.

Reviewed-by: Catalin Marinas <[email protected]>
Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/Kconfig | 1 +
arch/arm64/Kconfig.debug | 19 +----
arch/arm64/include/asm/ptdump.h | 8 +-
arch/arm64/mm/Makefile | 4 +-
arch/arm64/mm/dump.c | 117 ++++++++++-------------------
arch/arm64/mm/mmu.c | 4 +-
arch/arm64/mm/ptdump_debugfs.c | 2 +-
drivers/firmware/efi/arm-runtime.c | 2 +-
8 files changed, 50 insertions(+), 107 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 41a9b4257b72..0f6ad8dabd77 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -104,6 +104,7 @@ config ARM64
select GENERIC_IRQ_SHOW
select GENERIC_IRQ_SHOW_LEVEL
select GENERIC_PCI_IOMAP
+ select GENERIC_PTDUMP
select GENERIC_SCHED_CLOCK
select GENERIC_SMP_IDLE_THREAD
select GENERIC_STRNCPY_FROM_USER
diff --git a/arch/arm64/Kconfig.debug b/arch/arm64/Kconfig.debug
index cf09010d825f..1c906d932d6b 100644
--- a/arch/arm64/Kconfig.debug
+++ b/arch/arm64/Kconfig.debug
@@ -1,22 +1,5 @@
# SPDX-License-Identifier: GPL-2.0-only

-config ARM64_PTDUMP_CORE
- def_bool n
-
-config ARM64_PTDUMP_DEBUGFS
- bool "Export kernel pagetable layout to userspace via debugfs"
- depends on DEBUG_KERNEL
- select ARM64_PTDUMP_CORE
- select DEBUG_FS
- help
- Say Y here if you want to show the kernel pagetable layout in a
- debugfs file. This information is only useful for kernel developers
- who are working in architecture specific areas of the kernel.
- It is probably not a good idea to enable this feature in a production
- kernel.
-
- If in doubt, say N.
-
config PID_IN_CONTEXTIDR
bool "Write the current PID to the CONTEXTIDR register"
help
@@ -42,7 +25,7 @@ config ARM64_RANDOMIZE_TEXT_OFFSET

config DEBUG_WX
bool "Warn on W+X mappings at boot"
- select ARM64_PTDUMP_CORE
+ select PTDUMP_CORE
---help---
Generate a warning if any W+X mappings are found at boot.

diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h
index 0b8e7269ec82..38187f74e089 100644
--- a/arch/arm64/include/asm/ptdump.h
+++ b/arch/arm64/include/asm/ptdump.h
@@ -5,7 +5,7 @@
#ifndef __ASM_PTDUMP_H
#define __ASM_PTDUMP_H

-#ifdef CONFIG_ARM64_PTDUMP_CORE
+#ifdef CONFIG_PTDUMP_CORE

#include <linux/mm_types.h>
#include <linux/seq_file.h>
@@ -21,15 +21,15 @@ struct ptdump_info {
unsigned long base_addr;
};

-void ptdump_walk_pgd(struct seq_file *s, struct ptdump_info *info);
-#ifdef CONFIG_ARM64_PTDUMP_DEBUGFS
+void ptdump_walk(struct seq_file *s, struct ptdump_info *info);
+#ifdef CONFIG_PTDUMP_DEBUGFS
void ptdump_debugfs_register(struct ptdump_info *info, const char *name);
#else
static inline void ptdump_debugfs_register(struct ptdump_info *info,
const char *name) { }
#endif
void ptdump_check_wx(void);
-#endif /* CONFIG_ARM64_PTDUMP_CORE */
+#endif /* CONFIG_PTDUMP_CORE */

#ifdef CONFIG_DEBUG_WX
#define debug_checkwx() ptdump_check_wx()
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 849c1df3d214..d91030f0ffee 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -4,8 +4,8 @@ obj-y := dma-mapping.o extable.o fault.o init.o \
ioremap.o mmap.o pgd.o mmu.o \
context.o proc.o pageattr.o
obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
-obj-$(CONFIG_ARM64_PTDUMP_CORE) += dump.o
-obj-$(CONFIG_ARM64_PTDUMP_DEBUGFS) += ptdump_debugfs.o
+obj-$(CONFIG_PTDUMP_CORE) += dump.o
+obj-$(CONFIG_PTDUMP_DEBUGFS) += ptdump_debugfs.o
obj-$(CONFIG_NUMA) += numa.o
obj-$(CONFIG_DEBUG_VIRTUAL) += physaddr.o
KASAN_SANITIZE_physaddr.o += n
diff --git a/arch/arm64/mm/dump.c b/arch/arm64/mm/dump.c
index 93f9f77582ae..9d9b740a86d2 100644
--- a/arch/arm64/mm/dump.c
+++ b/arch/arm64/mm/dump.c
@@ -15,6 +15,7 @@
#include <linux/io.h>
#include <linux/init.h>
#include <linux/mm.h>
+#include <linux/ptdump.h>
#include <linux/sched.h>
#include <linux/seq_file.h>

@@ -75,10 +76,11 @@ static struct addr_marker address_markers[] = {
* dumps out a description of the range.
*/
struct pg_state {
+ struct ptdump_state ptdump;
struct seq_file *seq;
const struct addr_marker *marker;
unsigned long start_address;
- unsigned level;
+ int level;
u64 current_prot;
bool check_wx;
unsigned long wx_pages;
@@ -178,6 +180,10 @@ static struct pg_level pg_level[] = {
.name = "PGD",
.bits = pte_bits,
.num = ARRAY_SIZE(pte_bits),
+ }, { /* p4d */
+ .name = "P4D",
+ .bits = pte_bits,
+ .num = ARRAY_SIZE(pte_bits),
}, { /* pud */
.name = (CONFIG_PGTABLE_LEVELS > 3) ? "PUD" : "PGD",
.bits = pte_bits,
@@ -240,11 +246,15 @@ static void note_prot_wx(struct pg_state *st, unsigned long addr)
st->wx_pages += (addr - st->start_address) / PAGE_SIZE;
}

-static void note_page(struct pg_state *st, unsigned long addr, unsigned level,
- u64 val)
+static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
+ unsigned long val)
{
+ struct pg_state *st = container_of(pt_st, struct pg_state, ptdump);
static const char units[] = "KMGTPE";
- u64 prot = val & pg_level[level].mask;
+ u64 prot = 0;
+
+ if (level >= 0)
+ prot = val & pg_level[level].mask;

if (!st->level) {
st->level = level;
@@ -292,85 +302,27 @@ static void note_page(struct pg_state *st, unsigned long addr, unsigned level,

}

-static void walk_pte(struct pg_state *st, pmd_t *pmdp, unsigned long start,
- unsigned long end)
-{
- unsigned long addr = start;
- pte_t *ptep = pte_offset_kernel(pmdp, start);
-
- do {
- note_page(st, addr, 4, READ_ONCE(pte_val(*ptep)));
- } while (ptep++, addr += PAGE_SIZE, addr != end);
-}
-
-static void walk_pmd(struct pg_state *st, pud_t *pudp, unsigned long start,
- unsigned long end)
-{
- unsigned long next, addr = start;
- pmd_t *pmdp = pmd_offset(pudp, start);
-
- do {
- pmd_t pmd = READ_ONCE(*pmdp);
- next = pmd_addr_end(addr, end);
-
- if (pmd_none(pmd) || pmd_sect(pmd)) {
- note_page(st, addr, 3, pmd_val(pmd));
- } else {
- BUG_ON(pmd_bad(pmd));
- walk_pte(st, pmdp, addr, next);
- }
- } while (pmdp++, addr = next, addr != end);
-}
-
-static void walk_pud(struct pg_state *st, pgd_t *pgdp, unsigned long start,
- unsigned long end)
+void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
{
- unsigned long next, addr = start;
- pud_t *pudp = pud_offset(pgdp, start);
-
- do {
- pud_t pud = READ_ONCE(*pudp);
- next = pud_addr_end(addr, end);
-
- if (pud_none(pud) || pud_sect(pud)) {
- note_page(st, addr, 2, pud_val(pud));
- } else {
- BUG_ON(pud_bad(pud));
- walk_pmd(st, pudp, addr, next);
- }
- } while (pudp++, addr = next, addr != end);
-}
+ unsigned long end = ~0UL;
+ struct pg_state st;

-static void walk_pgd(struct pg_state *st, struct mm_struct *mm,
- unsigned long start)
-{
- unsigned long end = (start < TASK_SIZE_64) ? TASK_SIZE_64 : 0;
- unsigned long next, addr = start;
- pgd_t *pgdp = pgd_offset(mm, start);
-
- do {
- pgd_t pgd = READ_ONCE(*pgdp);
- next = pgd_addr_end(addr, end);
-
- if (pgd_none(pgd)) {
- note_page(st, addr, 1, pgd_val(pgd));
- } else {
- BUG_ON(pgd_bad(pgd));
- walk_pud(st, pgdp, addr, next);
- }
- } while (pgdp++, addr = next, addr != end);
-}
+ if (info->base_addr < TASK_SIZE_64)
+ end = TASK_SIZE_64;

-void ptdump_walk_pgd(struct seq_file *m, struct ptdump_info *info)
-{
- struct pg_state st = {
- .seq = m,
+ st = (struct pg_state){
+ .seq = s,
.marker = info->markers,
+ .ptdump = {
+ .note_page = note_page,
+ .range = (struct ptdump_range[]){
+ {info->base_addr, end},
+ {0, 0}
+ }
+ }
};

- walk_pgd(&st, info->mm, info->base_addr);
-
- note_page(&st, 0, 0, 0);
+ ptdump_walk_pgd(&st.ptdump, info->mm);
}

static void ptdump_initialize(void)
@@ -398,10 +350,17 @@ void ptdump_check_wx(void)
{ -1, NULL},
},
.check_wx = true,
+ .ptdump = {
+ .note_page = note_page,
+ .range = (struct ptdump_range[]) {
+ {PAGE_OFFSET, ~0UL},
+ {0, 0}
+ }
+ }
};

- walk_pgd(&st, &init_mm, PAGE_OFFSET);
- note_page(&st, 0, 0, 0);
+ ptdump_walk_pgd(&st.ptdump, &init_mm);
+
if (st.wx_pages || st.uxn_pages)
pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu non-UXN pages found\n",
st.wx_pages, st.uxn_pages);
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 60c929f3683b..6f12951c8052 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -944,13 +944,13 @@ int __init arch_ioremap_pud_supported(void)
* SW table walks can't handle removal of intermediate entries.
*/
return IS_ENABLED(CONFIG_ARM64_4K_PAGES) &&
- !IS_ENABLED(CONFIG_ARM64_PTDUMP_DEBUGFS);
+ !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
}

int __init arch_ioremap_pmd_supported(void)
{
/* See arch_ioremap_pud_supported() */
- return !IS_ENABLED(CONFIG_ARM64_PTDUMP_DEBUGFS);
+ return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
}

int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot)
diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 064163f25592..1f2eae3e988b 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -7,7 +7,7 @@
static int ptdump_show(struct seq_file *m, void *v)
{
struct ptdump_info *info = m->private;
- ptdump_walk_pgd(m, info);
+ ptdump_walk(m, info);
return 0;
}
DEFINE_SHOW_ATTRIBUTE(ptdump);
diff --git a/drivers/firmware/efi/arm-runtime.c b/drivers/firmware/efi/arm-runtime.c
index e2ac5fa5531b..1283685f9c20 100644
--- a/drivers/firmware/efi/arm-runtime.c
+++ b/drivers/firmware/efi/arm-runtime.c
@@ -27,7 +27,7 @@

extern u64 efi_system_table;

-#ifdef CONFIG_ARM64_PTDUMP_DEBUGFS
+#if defined(CONFIG_PTDUMP_DEBUGFS) && defined(CONFIG_ARM64)
#include <asm/ptdump.h>

static struct ptdump_info efi_ptdump_info = {
--
2.20.1

2019-11-04 19:38:50

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v15 00/23] Generic page walk and ptdump

On Fri, 2019-11-01 at 14:09 +0000, Steven Price wrote:
> Many architectures current have a debugfs file for dumping the kernel
> page tables. Currently each architecture has to implement custom
> functions for this because the details of walking the page tables used
> by the kernel are different between architectures.
>
> This series extends the capabilities of walk_page_range() so that it can
> deal with the page tables of the kernel (which have no VMAs and can
> contain larger huge pages than exist for user space). A generic PTDUMP
> implementation is the implemented making use of the new functionality of
> walk_page_range() and finally arm64 and x86 are switch to using it,
> removing the custom table walkers.
>
> To enable a generic page table walker to walk the unusual mappings of
> the kernel we need to implement a set of functions which let us know
> when the walker has reached the leaf entry. After a suggestion from Will
> Deacon I've chosen the name p?d_leaf() as this (hopefully) describes
> the purpose (and is a new name so has no historic baggage). Some
> architectures have p?d_large macros but this is easily confused with
> "large pages".
>
> This series ends with a generic PTDUMP implemention for arm64 and x86.
>
> Mostly this is a clean up and there should be very little functional
> change. The exceptions are:
>
> * arm64 PTDUMP debugfs now displays pages which aren't present (patch 22).
>
> * arm64 has the ability to efficiently process KASAN pages (which
> previously only x86 implemented). This means that the combination of
> KASAN and DEBUG_WX is now useable.
>
> Also available as a git tree:
> git://linux-arm.org/linux-sp.git walk_page_range/v15
>
> Changes since v14:
> https://lore.kernel.org/lkml/[email protected]/
> * Switch walk_page_range() into two functions, the existing
> walk_page_range() now still requires VMAs (and treats areas without a
> VMA as a 'hole'). The new walk_page_range_novma() ignores VMAs and
> will report the actual page table layout. This fixes the previous
> breakage of /proc/<pid>/pagemap
> * New patch at the end of the series which reduces the 'level' numbers
> by 1 to simplify the code slightly
> * Added tags

Does this new version also take care of this boot crash seen with v14? Suppose
it is now breaking CONFIG_EFI_PGT_DUMP=y? The full config is,

https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config

[   10.550313][    T0] Switched APIC routing to physical flat.
[   10.563899][    T0] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[   10.614633][    T0] clocksource: tsc-early: mask: 0xffffffffffffffff
max_cycles: 0x1fa6f481074, max_idle_ns: 440795311917 ns
[   10.625979][    T0] Calibrating delay loop (skipped), value calculated using
timer frequency.. 4391.73 BogoMIPS (lpj=21958690)
[   10.635990][    T0] pid_max: default: 131072 minimum: 1024
[   11.259736][    T0] ---[ User Space ]---
[   11.263737][    T0] 0x0000000000000000-
0x0000000000001000           4K     RW                     x  pte
[   11.266028][    T0] 0x0000000000001000-
0x0000000000200000        2044K                               pte
[   11.275992][    T0] 0x0000000000200000-
0x0000000004000000          62M                               pmd
[   11.285998][    T0] 0x0000000004000000-
0x0000000004076000         472K                               pte
[   11.296019][    T0] 0x0000000004076000-
0x0000000004200000        1576K                               pte
[   11.305997][    T0] 0x0000000004200000-
0x0000000011000000         206M                               pmd
[   11.316008][    T0] 0x0000000011000000-
0x0000000011100000           1M                               pte
[   11.326008][    T0] 0x0000000011100000-
0x0000000011200000           1M                               pte
[   11.335990][    T0] 0x0000000011200000-
0x0000000011800000           6M                               pmd
[   11.346054][    T0]
==================================================================
[   11.354074][    T0] BUG: KASAN: wild-memory-access in
ptdump_pte_entry+0x39/0x60
[   11.355975][    T0] Read of size 8 at addr 000f887fee5ff000 by task
swapper/0/0
[   11.355975][    T0] 
[   11.355975][    T0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc5-mm1+
#1
[   11.355975][    T0] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385
Gen10, BIOS A40 07/10/2019
[   11.355975][    T0] Call Trace:
[   11.355975][    T0]  dump_stack+0xa0/0xea
[   11.355975][    T0]  __kasan_report.cold.7+0xb0/0xc0
[   11.355975][    T0]  ? note_page+0x7f8/0xa70
[   11.355975][    T0]  ? ptdump_pte_entry+0x39/0x60
[   11.355975][    T0]  ? ptdump_walk_pgd_level_core+0x1b0/0x1b0
[   11.355975][    T0]  kasan_report+0x12/0x20
[   11.355975][    T0]  __asan_load8+0x71/0xa0
[   11.355975][    T0]  ptdump_pte_entry+0x39/0x60
[   11.355975][    T0]  walk_pgd_range+0xb75/0xce0
[   11.355975][    T0]  __walk_page_range+0x206/0x230
[   11.355975][    T0]  ? vmacache_find+0x3a/0x170
[   11.355975][    T0]  walk_page_range+0x136/0x210
[   11.355975][    T0]  ? __walk_page_range+0x230/0x230
[   11.355975][    T0]  ? find_held_lock+0xca/0xf0
[   11.355975][    T0]  ptdump_walk_pgd+0x76/0xd0
[   11.355975][    T0]  ptdump_walk_pgd_level_core+0x13b/0x1b0
[   11.355975][    T0]  ? hugetlb_get_unmapped_area+0x5b0/0x5b0
[   11.355975][    T0]  ? trace_hardirqs_on+0x3a/0x160
[   11.355975][    T0]  ? ptdump_walk_pgd_level_core+0x1b0/0x1b0
[   11.355975][    T0]  ? efi_delete_dummy_variable+0xa9/0xd0
[   11.355975][    T0]  ? __enc_copy+0x90/0x90
[   11.355975][    T0]  ptdump_walk_pgd_level+0x15/0x20
[   11.355975][    T0]  efi_dump_pagetable+0x35/0x37
[   11.355975][    T0]  efi_enter_virtual_mode+0x72a/0x737
[   11.355975][    T0]  start_kernel+0x607/0x6a9
[   11.355975][    T0]  ? thread_stack_cache_init+0xb/0xb
[   11.355975][    T0]  ? idt_setup_from_table+0xd9/0x130
[   11.355975][    T0]  x86_64_start_reservations+0x24/0x26
[   11.355975][    T0]  x86_64_start_kernel+0xf4/0xfb
[   11.355975][    T0]  secondary_startup_64+0xb6/0xc0
[   11.355975][    T0]
==================================================================
[   11.355975][    T0] Disabling lock debugging due to kernel taint
[   11.355991][    T0] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
KASAN NOPTI
[   11.364049][    T0] CPU: 0 PID: 0 Comm: swapper/0 Tainted:
G    B             5.4.0-rc5-mm1+ #1
[   11.365975][    T0] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385
Gen10, BIOS A40 07/10/2019
[   11.365975][    T0] RIP: 0010:ptdump_pte_entry+0x39/0x60
[   11.365975][    T0] Code: 55 41 54 49 89 fc 48 8d 79 18 53 48 89 cb e8 5e 0e
fa ff 48 8b 5b 18 48 89 df e8 52 0e fa ff 4c 89 e7 4c 8b 2b e8 47 0e fa ff <49>
8b 0c 24 4c 89 f6 48 89 df ba 05 00 00 00 e8 03 1d 9b 00 31 c0
[   11.365975][    T0] RSP: 0000:ffffffffaf8079d0 EFLAGS: 00010282
[   11.365975][    T0] RAX: 0000000000000000 RBX: ffffffffaf807cf0 RCX:
ffffffffae374306
[   11.365975][    T0] RDX: 0000000000000007 RSI: dffffc0000000000 RDI:
ffffffffafef2bf4
[   11.365975][    T0] RBP: ffffffffaf8079f0 R08: fffffbfff5fdbb22 R09:
fffffbfff5fdbb22
[   11.365975][    T0] R10: fffffbfff5fdbb21 R11: ffffffffafedd90b R12:
000f887fee5ff000
[   11.365975][    T0] R13: ffffffffae2aee40 R14: 0000000011a00000 R15:
0000000011a01000
[   11.365975][    T0] FS:  0000000000000000(0000) GS:ffff888843400000(0000)
knlGS:0000000000000000
[   11.365975][    T0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   11.365975][    T0] CR2: ffff8890779ff000 CR3: 0000000baf412000 CR4:
00000000000406b0
[   11.365975][    T0] Call Trace:
[   11.365975][    T0]  walk_pgd_range+0xb75/0xce0
[   11.365975][    T0]  __walk_page_range+0x206/0x230
[   11.365975][    T0]  ? vmacache_find+0x3a/0x170
[   11.365975][    T0]  walk_page_range+0x136/0x210
[   11.365975][    T0]  ? __walk_page_range+0x230/0x230
[   11.365975][    T0]  ? find_held_lock+0xca/0xf0
[   11.365975][    T0]  ptdump_walk_pgd+0x76/0xd0
[   11.365975][    T0]  ptdump_walk_pgd_level_core+0x13b/0x1b0
[   11.365975][    T0]  ? hugetlb_get_unmapped_area+0x5b0/0x5b0
[   11.365975][    T0]  ? trace_hardirqs_on+0x3a/0x160
[   11.365975][    T0]  ? ptdump_walk_pgd_level_core+0x1b0/0x1b0
[   11.365975][    T0]  ? efi_delete_dummy_variable+0xa9/0xd0
[   11.365975][    T0]  ? __enc_copy+0x90/0x90
[   11.365975][    T0]  ptdump_walk_pgd_level+0x15/0x20
[   11.365975][    T0]  efi_dump_pagetable+0x35/0x37
[   11.365975][    T0]  efi_enter_virtual_mode+0x72a/0x737
[   11.365975][    T0]  start_kernel+0x607/0x6a9
[   11.365975][    T0]  ? thread_stack_cache_init+0xb/0xb
[   11.365975][    T0]  ? idt_setup_from_table+0xd9/0x130
[   11.365975][    T0]  x86_64_start_reservations+0x24/0x26
[   11.365975][    T0]  x86_64_start_kernel+0xf4/0xfb
[   11.365975][    T0]  secondary_startup_64+0xb6/0xc0
[   11.365975][    T0] Modules linked in:
[   11.365988][    T0] ---[ end trace 8e90dc89e2468d55 ]---
[   11.375984][    T0] RIP: 0010:ptdump_pte_entry+0x39/0x60
[   11.381335][    T0] Code: 55 41 54 49 89 fc 48 8d 79 18 53 48 89 cb e8 5e 0e
fa ff 48 8b 5b 18 48 89 df e8 52 0e fa ff 4c 89 e7 4c 8b 2b e8 47 0e fa ff <49>
8b 0c 24 4c 89 f6 48 89 df ba 05 00 00 00 e8 03 1d 9b 00 31 c0
[   11.385982][    T0] RSP: 0000:ffffffffaf8079d0 EFLAGS: 00010282
[   11.395982][    T0] RAX: 0000000000000000 RBX: ffffffffaf807cf0 RCX:
ffffffffae374306
[   11.403864][    T0] RDX: 0000000000000007 RSI: dffffc0000000000 RDI:
ffffffffafef2bf4
[   11.405982][    T0] RBP: ffffffffaf8079f0 R08: fffffbfff5fdbb22 R09:
fffffbfff5fdbb22
[   11.415982][    T0] R10: fffffbfff5fdbb21 R11: ffffffffafedd90b R12:
000f887fee5ff000
[   11.425982][    T0] R13: ffffffffae2aee40 R14: 0000000011a00000 R15:
0000000011a01000
[   11.435982][    T0] FS:  0000000000000000(0000) GS:ffff888843400000(0000)
knlGS:0000000000000000
[   11.445982][    T0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   11.452466][    T0] CR2: ffff8890779ff000 CR3: 0000000baf412000 CR4:
00000000000406b0
[   11.455981][    T0] Kernel panic - not syncing: Fatal exception
[   11.462246][    T0] ---[ end Kernel panic - not syncing: Fatal exception ]---

>
> Changes since v13:
> https://lore.kernel.org/lkml/[email protected]/
> * Fixed typo in arc definition of pmd_leaf() spotted by the kbuild test
> robot
> * Added tags
>
> Changes since v12:
> https://lore.kernel.org/lkml/[email protected]/
> * Correct code format in riscv pud_leaf()/pmd_leaf()
> * v12 may not have reached everyone because of mail server problems
> (which are now hopefully resolved!)
>
> Changes since v11:
> https://lore.kernel.org/lkml/[email protected]/
> * Use "-1" as dummy depth parameter in patch 14.
>
> Changes since v10:
> https://lore.kernel.org/lkml/[email protected]/
> * Rebased to v5.4-rc1 - mainly various updates to deal with the
> splitting out of ops from struct mm_walk.
> * Deal with PGD_LEVEL_MULT not always being constant on x86.
>
> Changes since v9:
> https://lore.kernel.org/lkml/[email protected]/
> * Moved generic macros to first page in the series and explained the
> macro naming in the commit message.
> * mips: Moved macros to pgtable.h as they are now valid for both 32 and 64
> bit
> * x86: Dropped patch which changed the debugfs output for x86, instead
> we have...
> * new patch adding 'depth' parameter to pte_hole. This is used to
> provide the necessary information to output lines for 'holes' in the
> debugfs files
> * new patch changing arm64 debugfs output to include holes to match x86
> * generic ptdump KASAN handling has been simplified and now works with
> CONFIG_DEBUG_VIRTUAL.
>
> Changes since v8:
> https://lore.kernel.org/lkml/[email protected]/
> * Rename from p?d_large() to p?d_leaf()
> * Dropped patches migrating arm64/x86 custom walkers to
> walk_page_range() in favour of adding a generic PTDUMP implementation
> and migrating arm64/x86 to that instead.
> * Rebased to v5.3-rc1
>
> Steven Price (23):
> mm: Add generic p?d_leaf() macros
> arc: mm: Add p?d_leaf() definitions
> arm: mm: Add p?d_leaf() definitions
> arm64: mm: Add p?d_leaf() definitions
> mips: mm: Add p?d_leaf() definitions
> powerpc: mm: Add p?d_leaf() definitions
> riscv: mm: Add p?d_leaf() definitions
> s390: mm: Add p?d_leaf() definitions
> sparc: mm: Add p?d_leaf() definitions
> x86: mm: Add p?d_leaf() definitions
> mm: pagewalk: Add p4d_entry() and pgd_entry()
> mm: pagewalk: Allow walking without vma
> mm: pagewalk: Add test_p?d callbacks
> mm: pagewalk: Add 'depth' parameter to pte_hole
> x86: mm: Point to struct seq_file from struct pg_state
> x86: mm+efi: Convert ptdump_walk_pgd_level() to take a mm_struct
> x86: mm: Convert ptdump_walk_pgd_level_debugfs() to take an mm_struct
> x86: mm: Convert ptdump_walk_pgd_level_core() to take an mm_struct
> mm: Add generic ptdump
> x86: mm: Convert dump_pagetables to use walk_page_range
> arm64: mm: Convert mm/dump.c to use walk_page_range()
> arm64: mm: Display non-present entries in ptdump
> mm: ptdump: Reduce level numbers by 1 in note_page()
>
> arch/arc/include/asm/pgtable.h | 1 +
> arch/arm/include/asm/pgtable-2level.h | 1 +
> arch/arm/include/asm/pgtable-3level.h | 1 +
> arch/arm64/Kconfig | 1 +
> arch/arm64/Kconfig.debug | 19 +-
> arch/arm64/include/asm/pgtable.h | 2 +
> arch/arm64/include/asm/ptdump.h | 8 +-
> arch/arm64/mm/Makefile | 4 +-
> arch/arm64/mm/dump.c | 148 +++-----
> arch/arm64/mm/mmu.c | 4 +-
> arch/arm64/mm/ptdump_debugfs.c | 2 +-
> arch/mips/include/asm/pgtable.h | 5 +
> arch/powerpc/include/asm/book3s/64/pgtable.h | 30 +-
> arch/riscv/include/asm/pgtable-64.h | 7 +
> arch/riscv/include/asm/pgtable.h | 7 +
> arch/s390/include/asm/pgtable.h | 2 +
> arch/sparc/include/asm/pgtable_64.h | 2 +
> arch/x86/Kconfig | 1 +
> arch/x86/Kconfig.debug | 20 +-
> arch/x86/include/asm/pgtable.h | 10 +-
> arch/x86/mm/Makefile | 4 +-
> arch/x86/mm/debug_pagetables.c | 8 +-
> arch/x86/mm/dump_pagetables.c | 343 +++++--------------
> arch/x86/platform/efi/efi_32.c | 2 +-
> arch/x86/platform/efi/efi_64.c | 4 +-
> drivers/firmware/efi/arm-runtime.c | 2 +-
> fs/proc/task_mmu.c | 4 +-
> include/asm-generic/pgtable.h | 20 ++
> include/linux/pagewalk.h | 42 ++-
> include/linux/ptdump.h | 22 ++
> mm/Kconfig.debug | 21 ++
> mm/Makefile | 1 +
> mm/hmm.c | 8 +-
> mm/migrate.c | 5 +-
> mm/mincore.c | 1 +
> mm/pagewalk.c | 126 +++++--
> mm/ptdump.c | 151 ++++++++
> 37 files changed, 586 insertions(+), 453 deletions(-)
> create mode 100644 include/linux/ptdump.h
> create mode 100644 mm/ptdump.c
>

2019-11-06 13:34:31

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v15 00/23] Generic page walk and ptdump



> On Nov 4, 2019, at 2:35 PM, Qian Cai <[email protected]> wrote:
>
> On Fri, 2019-11-01 at 14:09 +0000, Steven Price wrote:
>> Many architectures current have a debugfs file for dumping the kernel
>> page tables. Currently each architecture has to implement custom
>> functions for this because the details of walking the page tables used
>> by the kernel are different between architectures.
>>
>> This series extends the capabilities of walk_page_range() so that it can
>> deal with the page tables of the kernel (which have no VMAs and can
>> contain larger huge pages than exist for user space). A generic PTDUMP
>> implementation is the implemented making use of the new functionality of
>> walk_page_range() and finally arm64 and x86 are switch to using it,
>> removing the custom table walkers.
>>
>> To enable a generic page table walker to walk the unusual mappings of
>> the kernel we need to implement a set of functions which let us know
>> when the walker has reached the leaf entry. After a suggestion from Will
>> Deacon I've chosen the name p?d_leaf() as this (hopefully) describes
>> the purpose (and is a new name so has no historic baggage). Some
>> architectures have p?d_large macros but this is easily confused with
>> "large pages".
>>
>> This series ends with a generic PTDUMP implemention for arm64 and x86.
>>
>> Mostly this is a clean up and there should be very little functional
>> change. The exceptions are:
>>
>> * arm64 PTDUMP debugfs now displays pages which aren't present (patch 22).
>>
>> * arm64 has the ability to efficiently process KASAN pages (which
>> previously only x86 implemented). This means that the combination of
>> KASAN and DEBUG_WX is now useable.
>>
>> Also available as a git tree:
>> git://linux-arm.org/linux-sp.git walk_page_range/v15
>>
>> Changes since v14:
>> https://lore.kernel.org/lkml/[email protected]/
>> * Switch walk_page_range() into two functions, the existing
>> walk_page_range() now still requires VMAs (and treats areas without a
>> VMA as a 'hole'). The new walk_page_range_novma() ignores VMAs and
>> will report the actual page table layout. This fixes the previous
>> breakage of /proc/<pid>/pagemap
>> * New patch at the end of the series which reduces the 'level' numbers
>> by 1 to simplify the code slightly
>> * Added tags
>
> Does this new version also take care of this boot crash seen with v14? Suppose
> it is now breaking CONFIG_EFI_PGT_DUMP=y? The full config is,
>
> https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
>

V15 is indeed DOA here.

[ 10.957006][ T0] pid_max: default: 131072 minimum: 1024
[ 11.543186][ T0] ---[ User Space ]---
[ 11.547009][ T0] 0x0000000000000000-0x0000000000001000 4K RW x pte
[ 11.556612][ T0] 0x0000000000001000-0x0000000000200000 2044K pte
[ 11.557008][ T0] 0x0000000000200000-0x0000000004000000 62M pmd
[ 11.567014][ T0] 0x0000000004000000-0x0000000004076000 472K pte
[ 11.577033][ T0] 0x0000000004076000-0x0000000004200000 1576K pte
[ 11.587013][ T0] 0x0000000004200000-0x0000000011000000 206M pmd
[ 11.597023][ T0] 0x0000000011000000-0x0000000011100000 1M pte
[ 11.607023][ T0] 0x0000000011100000-0x0000000011200000 1M pte
[ 11.617006][ T0] 0x0000000011200000-0x0000000011800000 6M pmd
[ 11.627068][ T0] ==================================================================
[ 11.635087][ T0] BUG: KASAN: wild-memory-access in ptdump_pte_entry+0x39/0x60
[ 11.636992][ T0] Read of size 8 at addr 000f887fee5ff000 by task swapper/0/0
[ 11.636992][ T0]
[ 11.636992][ T0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc6-next-20191106+ #6
[ 11.636992][ T0] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
[ 11.636992][ T0] Call Trace:
[ 11.636992][ T0] dump_stack+0xa0/0xea
[ 11.636992][ T0] __kasan_report.cold.7+0xb0/0xc0
[ 11.636992][ T0] ? note_page+0x6a9/0xa70
[ 11.636992][ T0] ? ptdump_pte_entry+0x39/0x60
[ 11.636992][ T0] ? ptdump_walk_pgd_level_core+0x1e0/0x1e0
[ 11.636992][ T0] kasan_report+0x12/0x20
[ 11.636992][ T0] __asan_load8+0x71/0xa0
[ 11.636992][ T0] ptdump_pte_entry+0x39/0x60
[ 11.636992][ T0] walk_pgd_range+0x9e5/0xdb0
[ 11.636992][ T0] __walk_page_range+0x206/0x230
[ 11.636992][ T0] walk_page_range_novma+0xc5/0x130
[ 11.636992][ T0] ? walk_page_range+0x220/0x220
[ 11.636992][ T0] ptdump_walk_pgd+0x76/0xd0
[ 11.636992][ T0] ptdump_walk_pgd_level_core+0x169/0x1e0
[ 11.636992][ T0] ? hugetlb_get_unmapped_area+0x5b0/0x5b0
[ 11.636992][ T0] ? trace_hardirqs_on+0x3a/0x160
[ 11.636992][ T0] ? ptdump_walk_pgd_level_core+0x1e0/0x1e0
[ 11.636992][ T0] ? efi_delete_dummy_variable+0xa9/0xd0
[ 11.636992][ T0] ? __enc_copy+0x90/0x90
[ 11.636992][ T0] ptdump_walk_pgd_level+0x15/0x20
[ 11.636992][ T0] efi_dump_pagetable+0x35/0x37
[ 11.636992][ T0] efi_enter_virtual_mode+0x72a/0x737
[ 11.636992][ T0] start_kernel+0x607/0x6a9
[ 11.636992][ T0] ? thread_stack_cache_init+0xb/0xb
[ 11.636992][ T0] ? idt_setup_from_table+0xd9/0x130
[ 11.636992][ T0] x86_64_start_reservations+0x24/0x26
[ 11.636992][ T0] x86_64_start_kernel+0xf4/0xfb
[ 11.636992][ T0] secondary_startup_64+0xb6/0xc0
[ 11.636992][ T0] ==================================================================
[ 11.636992][ T0] Disabling lock debugging due to kernel taint
[ 11.637009][ T0] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[ 11.645067][ T0] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G B 5.4.0-rc6-next-20191106+ #6
[ 11.646992][ T0] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
[ 11.646992][ T0] RIP: 0010:ptdump_pte_entry+0x39/0x60
[ 11.646992][ T0] Code: 55 41 54 49 89 fc 48 8d 79 20 53 48 89 cb e8 8e 9d fa ff 48 8b 5b 20 48 89 df e8 82 9d fa ff 4c 89 e7 4c 8b 2b e8 77 9d fa ff <49> 8b 0c 24 4c 89 f6 48 89 df ba 04 00 00 00 e8 f3 8d 9b 00 31 c0
[ 11.646992][ T0] RSP: 0000:ffffffff8a2079f0 EFLAGS: 00010286
[ 11.646992][ T0] RAX: 0000000000000000 RBX: ffffffff8a207cf0 RCX: ffffffff88d74576
[ 11.646992][ T0] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffffffff8a8f53d4
[ 11.646992][ T0] RBP: ffffffff8a207a10 R08: fffffbfff151c01a R09: fffffbfff151c01a
[ 11.646992][ T0] R10: fffffbfff151c019 R11: ffffffff8a8e00cb R12: 000f887fee5ff000
[ 11.646992][ T0] R13: ffffffff88caf040 R14: 0000000011a00000 R15: ffffffff89cfdcc0
[ 11.646992][ T0] FS: 0000000000000000(0000) GS:ffff888843400000(0000) knlGS:0000000000000000
[ 11.646992][ T0] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 11.646992][ T0] CR2: ffff8890779ff000 CR3: 0000000c54a12000 CR4: 00000000000406b0
[ 11.646992][ T0] Call Trace:
[ 11.646992][ T0] walk_pgd_range+0x9e5/0xdb0
[ 11.646992][ T0] __walk_page_range+0x206/0x230
[ 11.646992][ T0] walk_page_range_novma+0xc5/0x130
[ 11.646992][ T0] ? walk_page_range+0x220/0x220
[ 11.646992][ T0] ptdump_walk_pgd+0x76/0xd0
[ 11.646992][ T0] ptdump_walk_pgd_level_core+0x169/0x1e0
[ 11.646992][ T0] ? hugetlb_get_unmapped_area+0x5b0/0x5b0
[ 11.646992][ T0] ? trace_hardirqs_on+0x3a/0x160
[ 11.646992][ T0] ? ptdump_walk_pgd_level_core+0x1e0/0x1e0
[ 11.646992][ T0] ? efi_delete_dummy_variable+0xa9/0xd0
[ 11.646992][ T0] ? __enc_copy+0x90/0x90
[ 11.646992][ T0] ptdump_walk_pgd_level+0x15/0x20
[ 11.646992][ T0] efi_dump_pagetable+0x35/0x37
[ 11.646992][ T0] efi_enter_virtual_mode+0x72a/0x737
[ 11.646992][ T0] start_kernel+0x607/0x6a9
[ 11.646992][ T0] ? thread_stack_cache_init+0xb/0xb
[ 11.646992][ T0] ? idt_setup_from_table+0xd9/0x130
[ 11.646992][ T0] x86_64_start_reservations+0x24/0x26
[ 11.646992][ T0] x86_64_start_kernel+0xf4/0xfb
[ 11.646992][ T0] secondary_startup_64+0xb6/0xc0
[ 11.646992][ T0] Modules linked in:
[ 11.647003][ T0] ---[ end trace 751e8882de194a93 ]---
[ 11.652355][ T0] RIP: 0010:ptdump_pte_entry+0x39/0x60
[ 11.657001][ T0] Code: 55 41 54 49 89 fc 48 8d 79 20 53 48 89 cb e8 8e 9d fa ff 48 8b 5b 20 48 89 df e8 82 9d fa ff 4c 89 e7 4c 8b 2b e8 77 9d fa ff <49> 8b 0c 24 4c 89 f6 48 89 df ba 04 00 00 00 e8 f3 8d 9b 00 31 c0
[ 11.666998][ T0] RSP: 0000:ffffffff8a2079f0 EFLAGS: 00010286
[ 11.672961][ T0] RAX: 0000000000000000 RBX: ffffffff8a207cf0 RCX: ffffffff88d74576
[ 11.676998][ T0] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffffffff8a8f53d4
[ 11.686998][ T0] RBP: ffffffff8a207a10 R08: fffffbfff151c01a R09: fffffbfff151c01a
[ 11.696998][ T0] R10: fffffbfff151c019 R11: ffffffff8a8e00cb R12: 000f887fee5ff000
[ 11.704882][ T0] R13: ffffffff88caf040 R14: 0000000011a00000 R15: ffffffff89cfdcc0
[ 11.706999][ T0] FS: 0000000000000000(0000) GS:ffff888843400000(0000) knlGS:0000000000000000
[ 11.716998][ T0] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 11.726998][ T0] CR2: ffff8890779ff000 CR3: 0000000c54a12000 CR4: 00000000000406b0
[ 11.736998][ T0] Kernel panic - not syncing: Fatal exception
[ 11.743272][ T0] ---[ end Kernel panic - not syncing: Fatal exception ]---

>>
>> Changes since v13:
>> https://lore.kernel.org/lkml/[email protected]/
>> * Fixed typo in arc definition of pmd_leaf() spotted by the kbuild test
>> robot
>> * Added tags
>>
>> Changes since v12:
>> https://lore.kernel.org/lkml/[email protected]/
>> * Correct code format in riscv pud_leaf()/pmd_leaf()
>> * v12 may not have reached everyone because of mail server problems
>> (which are now hopefully resolved!)
>>
>> Changes since v11:
>> https://lore.kernel.org/lkml/[email protected]/
>> * Use "-1" as dummy depth parameter in patch 14.
>>
>> Changes since v10:
>> https://lore.kernel.org/lkml/[email protected]/
>> * Rebased to v5.4-rc1 - mainly various updates to deal with the
>> splitting out of ops from struct mm_walk.
>> * Deal with PGD_LEVEL_MULT not always being constant on x86.
>>
>> Changes since v9:
>> https://lore.kernel.org/lkml/[email protected]/
>> * Moved generic macros to first page in the series and explained the
>> macro naming in the commit message.
>> * mips: Moved macros to pgtable.h as they are now valid for both 32 and 64
>> bit
>> * x86: Dropped patch which changed the debugfs output for x86, instead
>> we have...
>> * new patch adding 'depth' parameter to pte_hole. This is used to
>> provide the necessary information to output lines for 'holes' in the
>> debugfs files
>> * new patch changing arm64 debugfs output to include holes to match x86
>> * generic ptdump KASAN handling has been simplified and now works with
>> CONFIG_DEBUG_VIRTUAL.
>>
>> Changes since v8:
>> https://lore.kernel.org/lkml/[email protected]/
>> * Rename from p?d_large() to p?d_leaf()
>> * Dropped patches migrating arm64/x86 custom walkers to
>> walk_page_range() in favour of adding a generic PTDUMP implementation
>> and migrating arm64/x86 to that instead.
>> * Rebased to v5.3-rc1
>>
>> Steven Price (23):
>> mm: Add generic p?d_leaf() macros
>> arc: mm: Add p?d_leaf() definitions
>> arm: mm: Add p?d_leaf() definitions
>> arm64: mm: Add p?d_leaf() definitions
>> mips: mm: Add p?d_leaf() definitions
>> powerpc: mm: Add p?d_leaf() definitions
>> riscv: mm: Add p?d_leaf() definitions
>> s390: mm: Add p?d_leaf() definitions
>> sparc: mm: Add p?d_leaf() definitions
>> x86: mm: Add p?d_leaf() definitions
>> mm: pagewalk: Add p4d_entry() and pgd_entry()
>> mm: pagewalk: Allow walking without vma
>> mm: pagewalk: Add test_p?d callbacks
>> mm: pagewalk: Add 'depth' parameter to pte_hole
>> x86: mm: Point to struct seq_file from struct pg_state
>> x86: mm+efi: Convert ptdump_walk_pgd_level() to take a mm_struct
>> x86: mm: Convert ptdump_walk_pgd_level_debugfs() to take an mm_struct
>> x86: mm: Convert ptdump_walk_pgd_level_core() to take an mm_struct
>> mm: Add generic ptdump
>> x86: mm: Convert dump_pagetables to use walk_page_range
>> arm64: mm: Convert mm/dump.c to use walk_page_range()
>> arm64: mm: Display non-present entries in ptdump
>> mm: ptdump: Reduce level numbers by 1 in note_page()
>>
>> arch/arc/include/asm/pgtable.h | 1 +
>> arch/arm/include/asm/pgtable-2level.h | 1 +
>> arch/arm/include/asm/pgtable-3level.h | 1 +
>> arch/arm64/Kconfig | 1 +
>> arch/arm64/Kconfig.debug | 19 +-
>> arch/arm64/include/asm/pgtable.h | 2 +
>> arch/arm64/include/asm/ptdump.h | 8 +-
>> arch/arm64/mm/Makefile | 4 +-
>> arch/arm64/mm/dump.c | 148 +++-----
>> arch/arm64/mm/mmu.c | 4 +-
>> arch/arm64/mm/ptdump_debugfs.c | 2 +-
>> arch/mips/include/asm/pgtable.h | 5 +
>> arch/powerpc/include/asm/book3s/64/pgtable.h | 30 +-
>> arch/riscv/include/asm/pgtable-64.h | 7 +
>> arch/riscv/include/asm/pgtable.h | 7 +
>> arch/s390/include/asm/pgtable.h | 2 +
>> arch/sparc/include/asm/pgtable_64.h | 2 +
>> arch/x86/Kconfig | 1 +
>> arch/x86/Kconfig.debug | 20 +-
>> arch/x86/include/asm/pgtable.h | 10 +-
>> arch/x86/mm/Makefile | 4 +-
>> arch/x86/mm/debug_pagetables.c | 8 +-
>> arch/x86/mm/dump_pagetables.c | 343 +++++--------------
>> arch/x86/platform/efi/efi_32.c | 2 +-
>> arch/x86/platform/efi/efi_64.c | 4 +-
>> drivers/firmware/efi/arm-runtime.c | 2 +-
>> fs/proc/task_mmu.c | 4 +-
>> include/asm-generic/pgtable.h | 20 ++
>> include/linux/pagewalk.h | 42 ++-
>> include/linux/ptdump.h | 22 ++
>> mm/Kconfig.debug | 21 ++
>> mm/Makefile | 1 +
>> mm/hmm.c | 8 +-
>> mm/migrate.c | 5 +-
>> mm/mincore.c | 1 +
>> mm/pagewalk.c | 126 +++++--
>> mm/ptdump.c | 151 ++++++++
>> 37 files changed, 586 insertions(+), 453 deletions(-)
>> create mode 100644 include/linux/ptdump.h
>> create mode 100644 mm/ptdump.c
>>

2019-11-06 15:09:21

by Steven Price

[permalink] [raw]
Subject: Re: [PATCH v15 00/23] Generic page walk and ptdump

On 06/11/2019 13:31, Qian Cai wrote:
>
>
>> On Nov 4, 2019, at 2:35 PM, Qian Cai <[email protected]> wrote:
>>
>> On Fri, 2019-11-01 at 14:09 +0000, Steven Price wrote:
[...]
>>> Changes since v14:
>>> https://lore.kernel.org/lkml/[email protected]/
>>> * Switch walk_page_range() into two functions, the existing
>>> walk_page_range() now still requires VMAs (and treats areas without a
>>> VMA as a 'hole'). The new walk_page_range_novma() ignores VMAs and
>>> will report the actual page table layout. This fixes the previous
>>> breakage of /proc/<pid>/pagemap
>>> * New patch at the end of the series which reduces the 'level' numbers
>>> by 1 to simplify the code slightly
>>> * Added tags
>>
>> Does this new version also take care of this boot crash seen with v14? Suppose
>> it is now breaking CONFIG_EFI_PGT_DUMP=y? The full config is,
>>
>> https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
>>
>
> V15 is indeed DOA here.

Thanks for finding this, it looks like EFI causes issues here. The below fixes
this for me (booting in QEMU).

Andrew: do you want me to send out the entire series again for this fix, or
can you squash this into mm-pagewalk-allow-walking-without-vma.patch?

Thanks,

Steve

---8<---
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index c7529dc4f82b..70dcaa23598f 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -90,7 +90,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
split_huge_pmd(walk->vma, pmd, addr);
if (pmd_trans_unstable(pmd))
goto again;
- } else if (pmd_leaf(*pmd)) {
+ } else if (pmd_leaf(*pmd) || !pmd_present(*pmd)) {
continue;
}

@@ -141,7 +141,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
split_huge_pud(walk->vma, pud, addr);
if (pud_none(*pud))
goto again;
- } else if (pud_leaf(*pud)) {
+ } else if (pud_leaf(*pud) || !pud_present(*pud)) {
continue;
}

2019-12-03 11:03:26

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v15 00/23] Generic page walk and ptdump

On 06.11.19 16:05, Steven Price wrote:
> On 06/11/2019 13:31, Qian Cai wrote:
>>
>>
>>> On Nov 4, 2019, at 2:35 PM, Qian Cai <[email protected]> wrote:
>>>
>>> On Fri, 2019-11-01 at 14:09 +0000, Steven Price wrote:
> [...]
>>>> Changes since v14:
>>>> https://lore.kernel.org/lkml/[email protected]/
>>>> * Switch walk_page_range() into two functions, the existing
>>>> walk_page_range() now still requires VMAs (and treats areas without a
>>>> VMA as a 'hole'). The new walk_page_range_novma() ignores VMAs and
>>>> will report the actual page table layout. This fixes the previous
>>>> breakage of /proc/<pid>/pagemap
>>>> * New patch at the end of the series which reduces the 'level' numbers
>>>> by 1 to simplify the code slightly
>>>> * Added tags
>>>
>>> Does this new version also take care of this boot crash seen with v14? Suppose
>>> it is now breaking CONFIG_EFI_PGT_DUMP=y? The full config is,
>>>
>>> https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
>>>
>>
>> V15 is indeed DOA here.
>
> Thanks for finding this, it looks like EFI causes issues here. The below fixes
> this for me (booting in QEMU).
>
> Andrew: do you want me to send out the entire series again for this fix, or
> can you squash this into mm-pagewalk-allow-walking-without-vma.patch?
>
> Thanks,
>
> Steve
>
> ---8<---
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index c7529dc4f82b..70dcaa23598f 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -90,7 +90,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> split_huge_pmd(walk->vma, pmd, addr);
> if (pmd_trans_unstable(pmd))
> goto again;
> - } else if (pmd_leaf(*pmd)) {
> + } else if (pmd_leaf(*pmd) || !pmd_present(*pmd)) {
> continue;
> }
>
> @@ -141,7 +141,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> split_huge_pud(walk->vma, pud, addr);
> if (pud_none(*pud))
> goto again;
> - } else if (pud_leaf(*pud)) {
> + } else if (pud_leaf(*pud) || !pud_present(*pud)) {
> continue;
> }
>
>

Even with this fix, booting for me under QEMU fails. See

https://lore.kernel.org/linux-mm/[email protected]/

--
Thanks,

David / dhildenb

2019-12-04 14:56:09

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v15 00/23] Generic page walk and ptdump



> On Dec 3, 2019, at 6:02 AM, David Hildenbrand <[email protected]> wrote:
>
> On 06.11.19 16:05, Steven Price wrote:
>> On 06/11/2019 13:31, Qian Cai wrote:
>>>
>>>
>>>> On Nov 4, 2019, at 2:35 PM, Qian Cai <[email protected]> wrote:
>>>>
>>>> On Fri, 2019-11-01 at 14:09 +0000, Steven Price wrote:
>> [...]
>>>>> Changes since v14:
>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>> * Switch walk_page_range() into two functions, the existing
>>>>> walk_page_range() now still requires VMAs (and treats areas without a
>>>>> VMA as a 'hole'). The new walk_page_range_novma() ignores VMAs and
>>>>> will report the actual page table layout. This fixes the previous
>>>>> breakage of /proc/<pid>/pagemap
>>>>> * New patch at the end of the series which reduces the 'level' numbers
>>>>> by 1 to simplify the code slightly
>>>>> * Added tags
>>>>
>>>> Does this new version also take care of this boot crash seen with v14? Suppose
>>>> it is now breaking CONFIG_EFI_PGT_DUMP=y? The full config is,
>>>>
>>>> https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
>>>>
>>>
>>> V15 is indeed DOA here.
>>
>> Thanks for finding this, it looks like EFI causes issues here. The below fixes
>> this for me (booting in QEMU).
>>
>> Andrew: do you want me to send out the entire series again for this fix, or
>> can you squash this into mm-pagewalk-allow-walking-without-vma.patch?
>>
>> Thanks,
>>
>> Steve
>>
>> ---8<---
>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>> index c7529dc4f82b..70dcaa23598f 100644
>> --- a/mm/pagewalk.c
>> +++ b/mm/pagewalk.c
>> @@ -90,7 +90,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>> split_huge_pmd(walk->vma, pmd, addr);
>> if (pmd_trans_unstable(pmd))
>> goto again;
>> - } else if (pmd_leaf(*pmd)) {
>> + } else if (pmd_leaf(*pmd) || !pmd_present(*pmd)) {
>> continue;
>> }
>>
>> @@ -141,7 +141,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>> split_huge_pud(walk->vma, pud, addr);
>> if (pud_none(*pud))
>> goto again;
>> - } else if (pud_leaf(*pud)) {
>> + } else if (pud_leaf(*pud) || !pud_present(*pud)) {
>> continue;
>> }
>>
>>
>
> Even with this fix, booting for me under QEMU fails. See
>
> https://lore.kernel.org/linux-mm/[email protected]/
>

Yes, for some reasons, this starts to crash on almost all arches here, so it might be worth
for Andrew to revert those in the meantime while allowing Steven to rework.

2019-12-04 16:34:53

by Steven Price

[permalink] [raw]
Subject: Re: [PATCH v15 00/23] Generic page walk and ptdump

On Wed, Dec 04, 2019 at 02:56:58PM +0000, David Hildenbrand wrote:
> On 04.12.19 15:54, Qian Cai wrote:
> >
> >
> >> On Dec 3, 2019, at 6:02 AM, David Hildenbrand <[email protected]> wrote:
> >>
> >> On 06.11.19 16:05, Steven Price wrote:
> >>> On 06/11/2019 13:31, Qian Cai wrote:
> >>>>
> >>>>
> >>>>> On Nov 4, 2019, at 2:35 PM, Qian Cai <[email protected]> wrote:
> >>>>>
> >>>>> On Fri, 2019-11-01 at 14:09 +0000, Steven Price wrote:
> >>> [...]
> >>>>>> Changes since v14:
> >>>>>> https://lore.kernel.org/lkml/[email protected]/
> >>>>>> * Switch walk_page_range() into two functions, the existing
> >>>>>> walk_page_range() now still requires VMAs (and treats areas without a
> >>>>>> VMA as a 'hole'). The new walk_page_range_novma() ignores VMAs and
> >>>>>> will report the actual page table layout. This fixes the previous
> >>>>>> breakage of /proc/<pid>/pagemap
> >>>>>> * New patch at the end of the series which reduces the 'level' numbers
> >>>>>> by 1 to simplify the code slightly
> >>>>>> * Added tags
> >>>>>
> >>>>> Does this new version also take care of this boot crash seen with v14? Suppose
> >>>>> it is now breaking CONFIG_EFI_PGT_DUMP=y? The full config is,
> >>>>>
> >>>>> https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
> >>>>>
> >>>>
> >>>> V15 is indeed DOA here.
> >>>
> >>> Thanks for finding this, it looks like EFI causes issues here. The below fixes
> >>> this for me (booting in QEMU).
> >>>
> >>> Andrew: do you want me to send out the entire series again for this fix, or
> >>> can you squash this into mm-pagewalk-allow-walking-without-vma.patch?
> >>>
> >>> Thanks,
> >>>
> >>> Steve
> >>>
> >>> ---8<---
> >>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> >>> index c7529dc4f82b..70dcaa23598f 100644
> >>> --- a/mm/pagewalk.c
> >>> +++ b/mm/pagewalk.c
> >>> @@ -90,7 +90,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> >>> split_huge_pmd(walk->vma, pmd, addr);
> >>> if (pmd_trans_unstable(pmd))
> >>> goto again;
> >>> - } else if (pmd_leaf(*pmd)) {
> >>> + } else if (pmd_leaf(*pmd) || !pmd_present(*pmd)) {
> >>> continue;
> >>> }
> >>>
> >>> @@ -141,7 +141,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> >>> split_huge_pud(walk->vma, pud, addr);
> >>> if (pud_none(*pud))
> >>> goto again;
> >>> - } else if (pud_leaf(*pud)) {
> >>> + } else if (pud_leaf(*pud) || !pud_present(*pud)) {
> >>> continue;
> >>> }
> >>>
> >>>
> >>
> >> Even with this fix, booting for me under QEMU fails. See
> >>
> >> https://lore.kernel.org/linux-mm/[email protected]/
> >>
> >
> > Yes, for some reasons, this starts to crash on almost all arches here, so it might be worth
> > for Andrew to revert those in the meantime while allowing Steven to rework.
>
> I agree, this produces too much noise.

I've bisected this problem and it's a merge conflict with:

ace88f1018b8 ("mm: pagewalk: Take the pagetable lock in walk_pte_range()")

Reverting that commit "fixes" the problem. That commit adds a call to
pte_offset_map_lock(), however that isn't necessarily safe when
considering an "unusual" mapping in the kernel. Combined with my patch
set this leads to the BUG when walking the kernel's page tables.

At this stage I think it's best if Andrew drops my series and I'll try
to rework it on top -rc1 fixing up this conflict and the other x86
32-bit issue that has cropped up.

Steve

2019-12-04 17:13:20

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v15 00/23] Generic page walk and ptdump

On 04.12.19 15:54, Qian Cai wrote:
>
>
>> On Dec 3, 2019, at 6:02 AM, David Hildenbrand <[email protected]> wrote:
>>
>> On 06.11.19 16:05, Steven Price wrote:
>>> On 06/11/2019 13:31, Qian Cai wrote:
>>>>
>>>>
>>>>> On Nov 4, 2019, at 2:35 PM, Qian Cai <[email protected]> wrote:
>>>>>
>>>>> On Fri, 2019-11-01 at 14:09 +0000, Steven Price wrote:
>>> [...]
>>>>>> Changes since v14:
>>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>> * Switch walk_page_range() into two functions, the existing
>>>>>> walk_page_range() now still requires VMAs (and treats areas without a
>>>>>> VMA as a 'hole'). The new walk_page_range_novma() ignores VMAs and
>>>>>> will report the actual page table layout. This fixes the previous
>>>>>> breakage of /proc/<pid>/pagemap
>>>>>> * New patch at the end of the series which reduces the 'level' numbers
>>>>>> by 1 to simplify the code slightly
>>>>>> * Added tags
>>>>>
>>>>> Does this new version also take care of this boot crash seen with v14? Suppose
>>>>> it is now breaking CONFIG_EFI_PGT_DUMP=y? The full config is,
>>>>>
>>>>> https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
>>>>>
>>>>
>>>> V15 is indeed DOA here.
>>>
>>> Thanks for finding this, it looks like EFI causes issues here. The below fixes
>>> this for me (booting in QEMU).
>>>
>>> Andrew: do you want me to send out the entire series again for this fix, or
>>> can you squash this into mm-pagewalk-allow-walking-without-vma.patch?
>>>
>>> Thanks,
>>>
>>> Steve
>>>
>>> ---8<---
>>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>>> index c7529dc4f82b..70dcaa23598f 100644
>>> --- a/mm/pagewalk.c
>>> +++ b/mm/pagewalk.c
>>> @@ -90,7 +90,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>>> split_huge_pmd(walk->vma, pmd, addr);
>>> if (pmd_trans_unstable(pmd))
>>> goto again;
>>> - } else if (pmd_leaf(*pmd)) {
>>> + } else if (pmd_leaf(*pmd) || !pmd_present(*pmd)) {
>>> continue;
>>> }
>>>
>>> @@ -141,7 +141,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>>> split_huge_pud(walk->vma, pud, addr);
>>> if (pud_none(*pud))
>>> goto again;
>>> - } else if (pud_leaf(*pud)) {
>>> + } else if (pud_leaf(*pud) || !pud_present(*pud)) {
>>> continue;
>>> }
>>>
>>>
>>
>> Even with this fix, booting for me under QEMU fails. See
>>
>> https://lore.kernel.org/linux-mm/[email protected]/
>>
>
> Yes, for some reasons, this starts to crash on almost all arches here, so it might be worth
> for Andrew to revert those in the meantime while allowing Steven to rework.

I agree, this produces too much noise.


--
Thanks,

David / dhildenb

2019-12-04 17:53:51

by Thomas Hellstrom

[permalink] [raw]
Subject: Re: [PATCH v15 00/23] Generic page walk and ptdump

On 12/4/19 5:32 PM, Steven Price wrote:
> On Wed, Dec 04, 2019 at 02:56:58PM +0000, David Hildenbrand wrote:
>> On 04.12.19 15:54, Qian Cai wrote:
>>>
>>>> On Dec 3, 2019, at 6:02 AM, David Hildenbrand <[email protected]> wrote:
>>>>
>>>> On 06.11.19 16:05, Steven Price wrote:
>>>>> On 06/11/2019 13:31, Qian Cai wrote:
>>>>>>
>>>>>>> On Nov 4, 2019, at 2:35 PM, Qian Cai <[email protected]> wrote:
>>>>>>>
>>>>>>> On Fri, 2019-11-01 at 14:09 +0000, Steven Price wrote:
>>>>> [...]
>>>>>>>> Changes since v14:
>>>>>>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20191028135910.33253-1-steven.price%40arm.com%2F&amp;data=02%7C01%7Cthellstrom%40vmware.com%7C9f50ca595f81432eff5b08d778d7968a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637110739727088799&amp;sdata=B3n6TFU7hluQyAXUOEaHBAGNC8mhscMfxSJi%2FrFr%2Flo%3D&amp;reserved=0
>>>>>>>> * Switch walk_page_range() into two functions, the existing
>>>>>>>> walk_page_range() now still requires VMAs (and treats areas without a
>>>>>>>> VMA as a 'hole'). The new walk_page_range_novma() ignores VMAs and
>>>>>>>> will report the actual page table layout. This fixes the previous
>>>>>>>> breakage of /proc/<pid>/pagemap
>>>>>>>> * New patch at the end of the series which reduces the 'level' numbers
>>>>>>>> by 1 to simplify the code slightly
>>>>>>>> * Added tags
>>>>>>> Does this new version also take care of this boot crash seen with v14? Suppose
>>>>>>> it is now breaking CONFIG_EFI_PGT_DUMP=y? The full config is,
>>>>>>>
>>>>>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fraw.githubusercontent.com%2Fcailca%2Flinux-mm%2Fmaster%2Fx86.config&amp;data=02%7C01%7Cthellstrom%40vmware.com%7C9f50ca595f81432eff5b08d778d7968a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637110739727088799&amp;sdata=ymVh49kh7VL9yseRdkjSbTwRh%2B7yBXxhK7QMTUzwn4U%3D&amp;reserved=0
>>>>>>>
>>>>>> V15 is indeed DOA here.
>>>>> Thanks for finding this, it looks like EFI causes issues here. The below fixes
>>>>> this for me (booting in QEMU).
>>>>>
>>>>> Andrew: do you want me to send out the entire series again for this fix, or
>>>>> can you squash this into mm-pagewalk-allow-walking-without-vma.patch?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Steve
>>>>>
>>>>> ---8<---
>>>>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>>>>> index c7529dc4f82b..70dcaa23598f 100644
>>>>> --- a/mm/pagewalk.c
>>>>> +++ b/mm/pagewalk.c
>>>>> @@ -90,7 +90,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>>>>> split_huge_pmd(walk->vma, pmd, addr);
>>>>> if (pmd_trans_unstable(pmd))
>>>>> goto again;
>>>>> - } else if (pmd_leaf(*pmd)) {
>>>>> + } else if (pmd_leaf(*pmd) || !pmd_present(*pmd)) {
>>>>> continue;
>>>>> }
>>>>>
>>>>> @@ -141,7 +141,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>>>>> split_huge_pud(walk->vma, pud, addr);
>>>>> if (pud_none(*pud))
>>>>> goto again;
>>>>> - } else if (pud_leaf(*pud)) {
>>>>> + } else if (pud_leaf(*pud) || !pud_present(*pud)) {
>>>>> continue;
>>>>> }
>>>>>
>>>>>
>>>> Even with this fix, booting for me under QEMU fails. See
>>>>
>>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2Fb7ce62f2-9a48-6e48-6685-003431e521aa%40redhat.com%2F&amp;data=02%7C01%7Cthellstrom%40vmware.com%7C9f50ca595f81432eff5b08d778d7968a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637110739727088799&amp;sdata=fRuLrmrzNEkU2MFzSVdyVyXyRoyZ95yZOYuy7aMSi7A%3D&amp;reserved=0
>>>>
>>> Yes, for some reasons, this starts to crash on almost all arches here, so it might be worth
>>> for Andrew to revert those in the meantime while allowing Steven to rework.
>> I agree, this produces too much noise.
> I've bisected this problem and it's a merge conflict with:
>
> ace88f1018b8 ("mm: pagewalk: Take the pagetable lock in walk_pte_range()")
>
> Reverting that commit "fixes" the problem. That commit adds a call to
> pte_offset_map_lock(), however that isn't necessarily safe when
> considering an "unusual" mapping in the kernel. Combined with my patch
> set this leads to the BUG when walking the kernel's page tables.
>
> At this stage I think it's best if Andrew drops my series and I'll try
> to rework it on top -rc1 fixing up this conflict and the other x86
> 32-bit issue that has cropped up.

Hi,

Unfortunately I wasn't aware of that conflict.

Perhaps something similar to this

https://elixir.bootlin.com/linux/v5.4/source/mm/memory.c#L2012

would fix at least this particular issue?

/Thomas




>
> Steve
>

2019-12-05 13:16:07

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v15 00/23] Generic page walk and ptdump



> On Dec 4, 2019, at 11:32 AM, Steven Price <[email protected]> wrote:
>
> I've bisected this problem and it's a merge conflict with:
>
> ace88f1018b8 ("mm: pagewalk: Take the pagetable lock in walk_pte_range()")

Sigh, how does that commit end up merging in the mainline without going through Andrew’s tree and missed all the linux-next testing? It was merged into the mainline Oct 4th?

> Reverting that commit "fixes" the problem. That commit adds a call to
> pte_offset_map_lock(), however that isn't necessarily safe when
> considering an "unusual" mapping in the kernel. Combined with my patch
> set this leads to the BUG when walking the kernel's page tables.
>
> At this stage I think it's best if Andrew drops my series and I'll try
> to rework it on top -rc1 fixing up this conflict and the other x86
> 32-bit issue that has cropped up.

2019-12-05 14:34:40

by Thomas Hellstrom

[permalink] [raw]
Subject: Re: [PATCH v15 00/23] Generic page walk and ptdump

On Thu, 2019-12-05 at 08:15 -0500, Qian Cai wrote:
> > On Dec 4, 2019, at 11:32 AM, Steven Price <[email protected]>
> > wrote:
> >
> > I've bisected this problem and it's a merge conflict with:
> >
> > ace88f1018b8 ("mm: pagewalk: Take the pagetable lock in
> > walk_pte_range()")
>
> Sigh, how does that commit end up merging in the mainline without
> going through Andrew’s tree and missed all the linux-next testing? It
> was merged into the mainline Oct 4th?

It was acked by Andrew to be merged through a drm tree, since it was
part of a graphics driver functionality. It was preceded by a fairly
lenghty discussion on linux-mm / linux-kernel.

It was merged into drm-next on 19-11-28, I think that's when it
normally is seen by linux-next. Merged into mainline 19-11-30. Andrew's
tree got merged 19-12-05.

linux-next signaled a merge conflict from one of the patches in this
series (not this one) resolved manually with the akpm tree on 19-12-02.

Thomas






2019-12-05 14:39:13

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v15 00/23] Generic page walk and ptdump



> On Dec 5, 2019, at 9:32 AM, Thomas Hellstrom <[email protected]> wrote:
>
> On Thu, 2019-12-05 at 08:15 -0500, Qian Cai wrote:
>>> On Dec 4, 2019, at 11:32 AM, Steven Price <[email protected]>
>>> wrote:
>>>
>>> I've bisected this problem and it's a merge conflict with:
>>>
>>> ace88f1018b8 ("mm: pagewalk: Take the pagetable lock in
>>> walk_pte_range()")
>>
>> Sigh, how does that commit end up merging in the mainline without
>> going through Andrew’s tree and missed all the linux-next testing? It
>> was merged into the mainline Oct 4th?
>
> It was acked by Andrew to be merged through a drm tree, since it was
> part of a graphics driver functionality. It was preceded by a fairly
> lenghty discussion on linux-mm / linux-kernel.
>
> It was merged into drm-next on 19-11-28, I think that's when it
> normally is seen by linux-next. Merged into mainline 19-11-30. Andrew's
> tree got merged 19-12-05.

Ah, that was the problem. Merged into the mainline after only a day or two
showed up in the linux-next. There isn’t enough time for integration testing.

>
> linux-next signaled a merge conflict from one of the patches in this
> series (not this one) resolved manually with the akpm tree on 19-12-02.
>
> Thomas
>
>
>
>
>
>