2019-04-03 14:18:10

by Steven Price

[permalink] [raw]
Subject: [PATCH v8 00/20] Convert x86 & arm64 to use generic page walk

Most architectures current have a debugfs file for dumping the kernel
page tables. Currently each architecture has to implement custom
functions for walking the page tables because the generic
walk_page_range() function is unable to walk the page tables used by the
kernel.

This series extends the capabilities of walk_page_range() so that it can
deal with the page tables of the kernel (which have no VMAs and can
contain larger huge pages than exist for user space). x86 and arm64 are
then converted to make use of walk_page_range() removing the custom page
table walkers.

To enable a generic page table walker to walk the unusual mappings of
the kernel we need to implement a set of functions which let us know
when the walker has reached the leaf entry. Since arm, powerpc, s390,
sparc and x86 all have p?d_large macros lets standardise on that and
implement those that are missing.

Potentially future changes could unify the implementations of the
debugfs walkers further, moving the common functionality into common
code. This would require a common way of handling the effective
permissions (currently implemented only for x86) along with a per-arch
way of formatting the page table information for debugfs. One
immediate benefit would be getting the KASAN speed up optimisation in
arm64 (and other arches) which is currently only implemented for x86.

Also available as a git tree:
git://linux-arm.org/linux-sp.git walk_page_range/v8

Changes since v7:
https://lore.kernel.org/lkml/[email protected]/T/
* Updated commit message in patch 2 to clarify that we rely on the page
tables being walked to be the same page size/depth as the kernel's
(since this confused me earlier today).

Changes since v6:
https://lore.kernel.org/lkml/[email protected]/T/
* Split the changes for powerpc. pmd_large() is now added in patch 4
patch, and pmd_is_leaf() removed in patch 5.

Changes since v5:
https://lore.kernel.org/lkml/[email protected]/T/
* Updated comment for struct mm_walk based on Mike Rapoport's
suggestion

Changes since v4:
https://lore.kernel.org/lkml/[email protected]/T/
* Correctly force result to a boolean in p?d_large for powerpc.
* Added Acked-bys
* Rebased onto v5.1-rc1

Changes since v3:
https://lore.kernel.org/lkml/[email protected]/T/
* Restored the generic macros, only implement p?d_large() for
architectures that have support for large pages. This also means
adding dummy #defines for architectures that define p?d_large as
static inline to avoid picking up the generic macro.
* Drop the 'depth' argument from pte_hole
* Because we no longer have the depth for holes, we also drop support
in x86 for showing missing pages in debugfs. See discussion below:
https://lore.kernel.org/lkml/[email protected]/
* mips: only define p?d_large when _PAGE_HUGE is defined.

Changes since v2:
https://lore.kernel.org/lkml/[email protected]/T/
* Rather than attemping to provide generic macros, actually implement
p?d_large() for each architecture.

Changes since v1:
https://lore.kernel.org/lkml/[email protected]/T/
* Added p4d_large() macro
* Comments to explain p?d_large() macro semantics
* Expanded comment for pte_hole() callback to explain mapping between
depth and P?D
* Handle folded page tables at all levels, so depth from pte_hole()
ignores folding at any level (see real_depth() function in
mm/pagewalk.c)

Steven Price (20):
arc: mm: Add p?d_large() definitions
arm64: mm: Add p?d_large() definitions
mips: mm: Add p?d_large() definitions
powerpc: mm: Add p?d_large() definitions
KVM: PPC: Book3S HV: Remove pmd_is_leaf()
riscv: mm: Add p?d_large() definitions
s390: mm: Add p?d_large() definitions
sparc: mm: Add p?d_large() definitions
x86: mm: Add p?d_large() definitions
mm: Add generic p?d_large() macros
mm: pagewalk: Add p4d_entry() and pgd_entry()
mm: pagewalk: Allow walking without vma
mm: pagewalk: Add test_p?d callbacks
arm64: mm: Convert mm/dump.c to use walk_page_range()
x86: mm: Don't display pages which aren't present in debugfs
x86: mm: Point to struct seq_file from struct pg_state
x86: mm+efi: Convert ptdump_walk_pgd_level() to take a mm_struct
x86: mm: Convert ptdump_walk_pgd_level_debugfs() to take an mm_struct
x86: mm: Convert ptdump_walk_pgd_level_core() to take an mm_struct
x86: mm: Convert dump_pagetables to use walk_page_range

arch/arc/include/asm/pgtable.h | 1 +
arch/arm64/include/asm/pgtable.h | 2 +
arch/arm64/mm/dump.c | 117 +++----
arch/mips/include/asm/pgtable-64.h | 8 +
arch/powerpc/include/asm/book3s/64/pgtable.h | 30 +-
arch/powerpc/kvm/book3s_64_mmu_radix.c | 12 +-
arch/riscv/include/asm/pgtable-64.h | 7 +
arch/riscv/include/asm/pgtable.h | 7 +
arch/s390/include/asm/pgtable.h | 2 +
arch/sparc/include/asm/pgtable_64.h | 2 +
arch/x86/include/asm/pgtable.h | 10 +-
arch/x86/mm/debug_pagetables.c | 8 +-
arch/x86/mm/dump_pagetables.c | 347 ++++++++++---------
arch/x86/platform/efi/efi_32.c | 2 +-
arch/x86/platform/efi/efi_64.c | 4 +-
include/asm-generic/pgtable.h | 19 +
include/linux/mm.h | 26 +-
mm/pagewalk.c | 76 +++-
18 files changed, 407 insertions(+), 273 deletions(-)

--
2.20.1


2019-04-03 14:18:14

by Steven Price

[permalink] [raw]
Subject: [PATCH v8 01/20] arc: mm: Add p?d_large() definitions

walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information will be provided by the
p?d_large() functions/macros.

For arc, we only have two levels, so only pmd_large() is needed.

CC: Vineet Gupta <[email protected]>
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
Acked-by: Vineet Gupta <[email protected]>
---
arch/arc/include/asm/pgtable.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/arc/include/asm/pgtable.h b/arch/arc/include/asm/pgtable.h
index cf4be70d5892..0edd27bc7018 100644
--- a/arch/arc/include/asm/pgtable.h
+++ b/arch/arc/include/asm/pgtable.h
@@ -277,6 +277,7 @@ static inline void pmd_set(pmd_t *pmdp, pte_t *ptep)
#define pmd_none(x) (!pmd_val(x))
#define pmd_bad(x) ((pmd_val(x) & ~PAGE_MASK))
#define pmd_present(x) (pmd_val(x))
+#define pmd_large(x) (pmd_val(pmd) & _PAGE_HW_SZ)
#define pmd_clear(xp) do { pmd_val(*(xp)) = 0; } while (0)

#define pte_page(pte) pfn_to_page(pte_pfn(pte))
--
2.20.1

2019-04-03 14:18:22

by Steven Price

[permalink] [raw]
Subject: [PATCH v8 02/20] arm64: mm: Add p?d_large() definitions

walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information will be provided by the
p?d_large() functions/macros.

For arm64, we already have p?d_sect() macros which we can reuse for
p?d_large().

pud_sect() is defined as a dummy function when CONFIG_PGTABLE_LEVELS < 3
or CONFIG_ARM64_64K_PAGES is defined. However when the kernel is
configured this way then architecturally it isn't allowed to have a
large page that this level, and any code using these page walking macros
is implicitly relying on the page size/number of levels being the same as
the kernel. So it is safe to reuse this for p?d_large() as it is an
architectural restriction.

CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index de70c1eabf33..6eef345dbaf4 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -428,6 +428,7 @@ extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
PMD_TYPE_TABLE)
#define pmd_sect(pmd) ((pmd_val(pmd) & PMD_TYPE_MASK) == \
PMD_TYPE_SECT)
+#define pmd_large(pmd) pmd_sect(pmd)

#if defined(CONFIG_ARM64_64K_PAGES) || CONFIG_PGTABLE_LEVELS < 3
#define pud_sect(pud) (0)
@@ -511,6 +512,7 @@ static inline phys_addr_t pmd_page_paddr(pmd_t pmd)
#define pud_none(pud) (!pud_val(pud))
#define pud_bad(pud) (!(pud_val(pud) & PUD_TABLE_BIT))
#define pud_present(pud) pte_present(pud_pte(pud))
+#define pud_large(pud) pud_sect(pud)
#define pud_valid(pud) pte_valid(pud_pte(pud))

static inline void set_pud(pud_t *pudp, pud_t pud)
--
2.20.1

2019-04-03 14:18:30

by Steven Price

[permalink] [raw]
Subject: [PATCH v8 03/20] mips: mm: Add p?d_large() definitions

walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.

For mips, we only support large pages on 64 bit.

For 64 bit if _PAGE_HUGE is defined we can simply look for it. When not
defined we can be confident that there are no large pages in existence
and fall back on the generic implementation (added in a later patch)
which returns 0.

CC: Ralf Baechle <[email protected]>
CC: Paul Burton <[email protected]>
CC: James Hogan <[email protected]>
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
Acked-by: Paul Burton <[email protected]>
---
arch/mips/include/asm/pgtable-64.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/arch/mips/include/asm/pgtable-64.h b/arch/mips/include/asm/pgtable-64.h
index 93a9dce31f25..42162877ac62 100644
--- a/arch/mips/include/asm/pgtable-64.h
+++ b/arch/mips/include/asm/pgtable-64.h
@@ -273,6 +273,10 @@ static inline int pmd_present(pmd_t pmd)
return pmd_val(pmd) != (unsigned long) invalid_pte_table;
}

+#ifdef _PAGE_HUGE
+#define pmd_large(pmd) ((pmd_val(pmd) & _PAGE_HUGE) != 0)
+#endif
+
static inline void pmd_clear(pmd_t *pmdp)
{
pmd_val(*pmdp) = ((unsigned long) invalid_pte_table);
@@ -297,6 +301,10 @@ static inline int pud_present(pud_t pud)
return pud_val(pud) != (unsigned long) invalid_pmd_table;
}

+#ifdef _PAGE_HUGE
+#define pud_large(pud) ((pud_val(pud) & _PAGE_HUGE) != 0)
+#endif
+
static inline void pud_clear(pud_t *pudp)
{
pud_val(*pudp) = ((unsigned long) invalid_pmd_table);
--
2.20.1

2019-04-03 14:18:48

by Steven Price

[permalink] [raw]
Subject: [PATCH v8 05/20] KVM: PPC: Book3S HV: Remove pmd_is_leaf()

Since pmd_large() is now always available, pmd_is_leaf() is redundant.
Replace all uses with calls to pmd_large().

CC: Benjamin Herrenschmidt <[email protected]>
CC: Michael Ellerman <[email protected]>
CC: Paul Mackerras <[email protected]>
CC: [email protected]
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/powerpc/kvm/book3s_64_mmu_radix.c | 12 +++---------
1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index f55ef071883f..1b57b4e3f819 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -363,12 +363,6 @@ static void kvmppc_pte_free(pte_t *ptep)
kmem_cache_free(kvm_pte_cache, ptep);
}

-/* Like pmd_huge() and pmd_large(), but works regardless of config options */
-static inline int pmd_is_leaf(pmd_t pmd)
-{
- return !!(pmd_val(pmd) & _PAGE_PTE);
-}
-
static pmd_t *kvmppc_pmd_alloc(void)
{
return kmem_cache_alloc(kvm_pmd_cache, GFP_KERNEL);
@@ -460,7 +454,7 @@ static void kvmppc_unmap_free_pmd(struct kvm *kvm, pmd_t *pmd, bool full,
for (im = 0; im < PTRS_PER_PMD; ++im, ++p) {
if (!pmd_present(*p))
continue;
- if (pmd_is_leaf(*p)) {
+ if (pmd_large(*p)) {
if (full) {
pmd_clear(p);
} else {
@@ -593,7 +587,7 @@ int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
else if (level <= 1)
new_pmd = kvmppc_pmd_alloc();

- if (level == 0 && !(pmd && pmd_present(*pmd) && !pmd_is_leaf(*pmd)))
+ if (level == 0 && !(pmd && pmd_present(*pmd) && !pmd_large(*pmd)))
new_ptep = kvmppc_pte_alloc();

/* Check if we might have been invalidated; let the guest retry if so */
@@ -662,7 +656,7 @@ int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
new_pmd = NULL;
}
pmd = pmd_offset(pud, gpa);
- if (pmd_is_leaf(*pmd)) {
+ if (pmd_large(*pmd)) {
unsigned long lgpa = gpa & PMD_MASK;

/* Check if we raced and someone else has set the same thing */
--
2.20.1

2019-04-03 14:18:54

by Steven Price

[permalink] [raw]
Subject: [PATCH v8 06/20] riscv: mm: Add p?d_large() definitions

walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.

For riscv a page is large when it has a read, write or execute bit
set on it.

CC: Palmer Dabbelt <[email protected]>
CC: Albert Ou <[email protected]>
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/riscv/include/asm/pgtable-64.h | 7 +++++++
arch/riscv/include/asm/pgtable.h | 7 +++++++
2 files changed, 14 insertions(+)

diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
index 7aa0ea9bd8bb..73747d9d7c66 100644
--- a/arch/riscv/include/asm/pgtable-64.h
+++ b/arch/riscv/include/asm/pgtable-64.h
@@ -51,6 +51,13 @@ static inline int pud_bad(pud_t pud)
return !pud_present(pud);
}

+#define pud_large pud_large
+static inline int pud_large(pud_t pud)
+{
+ return pud_present(pud)
+ && (pud_val(pud) & (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC));
+}
+
static inline void set_pud(pud_t *pudp, pud_t pud)
{
*pudp = pud;
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 1141364d990e..9570883c79e7 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -111,6 +111,13 @@ static inline int pmd_bad(pmd_t pmd)
return !pmd_present(pmd);
}

+#define pmd_large pmd_large
+static inline int pmd_large(pmd_t pmd)
+{
+ return pmd_present(pmd)
+ && (pmd_val(pmd) & (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC));
+}
+
static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
{
*pmdp = pmd;
--
2.20.1

2019-04-03 14:19:00

by Steven Price

[permalink] [raw]
Subject: [PATCH v8 07/20] s390: mm: Add p?d_large() definitions

walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.

For s390, pud_large() and pmd_large() are already implemented as static
inline functions. Add a #define so we don't pick up the generic version
introduced in a later patch.

CC: Martin Schwidefsky <[email protected]>
CC: Heiko Carstens <[email protected]>
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/s390/include/asm/pgtable.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 76dc344edb8c..3ad4c69e1f2d 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -679,6 +679,7 @@ static inline int pud_none(pud_t pud)
return pud_val(pud) == _REGION3_ENTRY_EMPTY;
}

+#define pud_large pud_large
static inline int pud_large(pud_t pud)
{
if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) != _REGION_ENTRY_TYPE_R3)
@@ -696,6 +697,7 @@ static inline unsigned long pud_pfn(pud_t pud)
return (pud_val(pud) & origin_mask) >> PAGE_SHIFT;
}

+#define pmd_large pmd_large
static inline int pmd_large(pmd_t pmd)
{
return (pmd_val(pmd) & _SEGMENT_ENTRY_LARGE) != 0;
--
2.20.1

2019-04-03 14:19:07

by Steven Price

[permalink] [raw]
Subject: [PATCH v8 08/20] sparc: mm: Add p?d_large() definitions

walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.

For sparc 64 bit, pmd_large() and pud_large() are already provided, so
add #defines to prevent the generic versions (added in a later patch)
from being used.

CC: "David S. Miller" <[email protected]>
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
Acked-by: David S. Miller <[email protected]>
---
arch/sparc/include/asm/pgtable_64.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 1393a8ac596b..f502e937c8fe 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -713,6 +713,7 @@ static inline unsigned long pte_special(pte_t pte)
return pte_val(pte) & _PAGE_SPECIAL;
}

+#define pmd_large pmd_large
static inline unsigned long pmd_large(pmd_t pmd)
{
pte_t pte = __pte(pmd_val(pmd));
@@ -894,6 +895,7 @@ static inline unsigned long pud_page_vaddr(pud_t pud)
#define pgd_present(pgd) (pgd_val(pgd) != 0U)
#define pgd_clear(pgdp) (pgd_val(*(pgdp)) = 0UL)

+#define pud_large pud_large
static inline unsigned long pud_large(pud_t pud)
{
pte_t pte = __pte(pud_val(pud));
--
2.20.1

2019-04-03 14:19:16

by Steven Price

[permalink] [raw]
Subject: [PATCH v8 09/20] x86: mm: Add p?d_large() definitions

walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.

For x86 we already have static inline functions, so simply add #defines
to prevent the generic versions (added in a later patch) from being
picked up.

We also need to add corresponding #undefs in dump_pagetables.c. This
code will be removed when x86 is switched over to using the generic
pagewalk code in a later patch.

Signed-off-by: Steven Price <[email protected]>
---
arch/x86/include/asm/pgtable.h | 5 +++++
arch/x86/mm/dump_pagetables.c | 3 +++
2 files changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 2779ace16d23..0dd04cf6ebeb 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -222,6 +222,7 @@ static inline unsigned long pgd_pfn(pgd_t pgd)
return (pgd_val(pgd) & PTE_PFN_MASK) >> PAGE_SHIFT;
}

+#define p4d_large p4d_large
static inline int p4d_large(p4d_t p4d)
{
/* No 512 GiB pages yet */
@@ -230,6 +231,7 @@ static inline int p4d_large(p4d_t p4d)

#define pte_page(pte) pfn_to_page(pte_pfn(pte))

+#define pmd_large pmd_large
static inline int pmd_large(pmd_t pte)
{
return pmd_flags(pte) & _PAGE_PSE;
@@ -857,6 +859,7 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}

+#define pud_large pud_large
static inline int pud_large(pud_t pud)
{
return (pud_val(pud) & (_PAGE_PSE | _PAGE_PRESENT)) ==
@@ -868,6 +871,7 @@ static inline int pud_bad(pud_t pud)
return (pud_flags(pud) & ~(_KERNPG_TABLE | _PAGE_USER)) != 0;
}
#else
+#define pud_large pud_large
static inline int pud_large(pud_t pud)
{
return 0;
@@ -1213,6 +1217,7 @@ static inline bool pgdp_maps_userspace(void *__ptr)
return (((ptr & ~PAGE_MASK) / sizeof(pgd_t)) < PGD_KERNEL_START);
}

+#define pgd_large pgd_large
static inline int pgd_large(pgd_t pgd) { return 0; }

#ifdef CONFIG_PAGE_TABLE_ISOLATION
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index ee8f8ab46941..ca270fb00805 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -432,6 +432,7 @@ static void walk_pmd_level(struct seq_file *m, struct pg_state *st, pud_t addr,

#else
#define walk_pmd_level(m,s,a,e,p) walk_pte_level(m,s,__pmd(pud_val(a)),e,p)
+#undef pud_large
#define pud_large(a) pmd_large(__pmd(pud_val(a)))
#define pud_none(a) pmd_none(__pmd(pud_val(a)))
#endif
@@ -467,6 +468,7 @@ static void walk_pud_level(struct seq_file *m, struct pg_state *st, p4d_t addr,

#else
#define walk_pud_level(m,s,a,e,p) walk_pmd_level(m,s,__pud(p4d_val(a)),e,p)
+#undef p4d_large
#define p4d_large(a) pud_large(__pud(p4d_val(a)))
#define p4d_none(a) pud_none(__pud(p4d_val(a)))
#endif
@@ -501,6 +503,7 @@ static void walk_p4d_level(struct seq_file *m, struct pg_state *st, pgd_t addr,
}
}

+#undef pgd_large
#define pgd_large(a) (pgtable_l5_enabled() ? pgd_large(a) : p4d_large(__p4d(pgd_val(a))))
#define pgd_none(a) (pgtable_l5_enabled() ? pgd_none(a) : p4d_none(__p4d(pgd_val(a))))

--
2.20.1

2019-04-03 14:19:16

by Steven Price

[permalink] [raw]
Subject: [PATCH v8 10/20] mm: Add generic p?d_large() macros

Exposing the pud/pgd levels of the page tables to walk_page_range() means
we may come across the exotic large mappings that come with large areas
of contiguous memory (such as the kernel's linear map).

For architectures that don't provide p?d_large() macros, provide generic
does nothing defaults.

Signed-off-by: Steven Price <[email protected]>
---
include/asm-generic/pgtable.h | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index fa782fba51ee..9c5d0f73db67 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1186,4 +1186,23 @@ static inline bool arch_has_pfn_modify_check(void)
#define mm_pmd_folded(mm) __is_defined(__PAGETABLE_PMD_FOLDED)
#endif

+/*
+ * p?d_large() - true if this entry is a final mapping to a physical address.
+ * This differs from p?d_huge() by the fact that they are always available (if
+ * the architecture supports large pages at the appropriate level) even
+ * if CONFIG_HUGETLB_PAGE is not defined.
+ */
+#ifndef pgd_large
+#define pgd_large(x) 0
+#endif
+#ifndef p4d_large
+#define p4d_large(x) 0
+#endif
+#ifndef pud_large
+#define pud_large(x) 0
+#endif
+#ifndef pmd_large
+#define pmd_large(x) 0
+#endif
+
#endif /* _ASM_GENERIC_PGTABLE_H */
--
2.20.1

2019-04-03 14:19:20

by Steven Price

[permalink] [raw]
Subject: [PATCH v8 11/20] mm: pagewalk: Add p4d_entry() and pgd_entry()

pgd_entry() and pud_entry() were removed by commit 0b1fbfe50006c410
("mm/pagewalk: remove pgd_entry() and pud_entry()") because there were
no users. We're about to add users so reintroduce them, along with
p4d_entry() as we now have 5 levels of tables.

Note that commit a00cc7d9dd93d66a ("mm, x86: add support for
PUD-sized transparent hugepages") already re-added pud_entry() but with
different semantics to the other callbacks. Since there have never
been upstream users of this, revert the semantics back to match the
other callbacks. This means pud_entry() is called for all entries, not
just transparent huge pages.

Signed-off-by: Steven Price <[email protected]>
---
include/linux/mm.h | 15 +++++++++------
mm/pagewalk.c | 27 ++++++++++++++++-----------
2 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 76769749b5a5..f6de08c116e6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1367,15 +1367,14 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,

/**
* mm_walk - callbacks for walk_page_range
- * @pud_entry: if set, called for each non-empty PUD (2nd-level) entry
- * this handler should only handle pud_trans_huge() puds.
- * the pmd_entry or pte_entry callbacks will be used for
- * regular PUDs.
- * @pmd_entry: if set, called for each non-empty PMD (3rd-level) entry
+ * @pgd_entry: if set, called for each non-empty PGD (top-level) entry
+ * @p4d_entry: if set, called for each non-empty P4D entry
+ * @pud_entry: if set, called for each non-empty PUD entry
+ * @pmd_entry: if set, called for each non-empty PMD entry
* this handler is required to be able to handle
* pmd_trans_huge() pmds. They may simply choose to
* split_huge_page() instead of handling it explicitly.
- * @pte_entry: if set, called for each non-empty PTE (4th-level) entry
+ * @pte_entry: if set, called for each non-empty PTE (lowest-level) entry
* @pte_hole: if set, called for each hole at all levels
* @hugetlb_entry: if set, called for each hugetlb entry
* @test_walk: caller specific callback function to determine whether
@@ -1390,6 +1389,10 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
* (see the comment on walk_page_range() for more details)
*/
struct mm_walk {
+ int (*pgd_entry)(pgd_t *pgd, unsigned long addr,
+ unsigned long next, struct mm_walk *walk);
+ int (*p4d_entry)(p4d_t *p4d, unsigned long addr,
+ unsigned long next, struct mm_walk *walk);
int (*pud_entry)(pud_t *pud, unsigned long addr,
unsigned long next, struct mm_walk *walk);
int (*pmd_entry)(pmd_t *pmd, unsigned long addr,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index c3084ff2569d..98373a9f88b8 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -90,15 +90,9 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
}

if (walk->pud_entry) {
- spinlock_t *ptl = pud_trans_huge_lock(pud, walk->vma);
-
- if (ptl) {
- err = walk->pud_entry(pud, addr, next, walk);
- spin_unlock(ptl);
- if (err)
- break;
- continue;
- }
+ err = walk->pud_entry(pud, addr, next, walk);
+ if (err)
+ break;
}

split_huge_pud(walk->vma, pud, addr);
@@ -131,7 +125,12 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
break;
continue;
}
- if (walk->pmd_entry || walk->pte_entry)
+ if (walk->p4d_entry) {
+ err = walk->p4d_entry(p4d, addr, next, walk);
+ if (err)
+ break;
+ }
+ if (walk->pud_entry || walk->pmd_entry || walk->pte_entry)
err = walk_pud_range(p4d, addr, next, walk);
if (err)
break;
@@ -157,7 +156,13 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
break;
continue;
}
- if (walk->pmd_entry || walk->pte_entry)
+ if (walk->pgd_entry) {
+ err = walk->pgd_entry(pgd, addr, next, walk);
+ if (err)
+ break;
+ }
+ if (walk->p4d_entry || walk->pud_entry || walk->pmd_entry ||
+ walk->pte_entry)
err = walk_p4d_range(pgd, addr, next, walk);
if (err)
break;
--
2.20.1

2019-04-03 14:19:25

by Steven Price

[permalink] [raw]
Subject: [PATCH v8 12/20] mm: pagewalk: Allow walking without vma

Since 48684a65b4e3: "mm: pagewalk: fix misbehavior of walk_page_range
for vma(VM_PFNMAP)", page_table_walk() will report any kernel area as
a hole, because it lacks a vma.

This means each arch has re-implemented page table walking when needed,
for example in the per-arch ptdump walker.

Remove the requirement to have a vma except when trying to split huge
pages.

Signed-off-by: Steven Price <[email protected]>
---
mm/pagewalk.c | 25 +++++++++++++++++--------
1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 98373a9f88b8..dac0c848b458 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -36,7 +36,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
do {
again:
next = pmd_addr_end(addr, end);
- if (pmd_none(*pmd) || !walk->vma) {
+ if (pmd_none(*pmd)) {
if (walk->pte_hole)
err = walk->pte_hole(addr, next, walk);
if (err)
@@ -59,9 +59,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
if (!walk->pte_entry)
continue;

- split_huge_pmd(walk->vma, pmd, addr);
- if (pmd_trans_unstable(pmd))
- goto again;
+ if (walk->vma) {
+ split_huge_pmd(walk->vma, pmd, addr);
+ if (pmd_trans_unstable(pmd))
+ goto again;
+ } else if (pmd_large(*pmd)) {
+ continue;
+ }
+
err = walk_pte_range(pmd, addr, next, walk);
if (err)
break;
@@ -81,7 +86,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
do {
again:
next = pud_addr_end(addr, end);
- if (pud_none(*pud) || !walk->vma) {
+ if (pud_none(*pud)) {
if (walk->pte_hole)
err = walk->pte_hole(addr, next, walk);
if (err)
@@ -95,9 +100,13 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
break;
}

- split_huge_pud(walk->vma, pud, addr);
- if (pud_none(*pud))
- goto again;
+ if (walk->vma) {
+ split_huge_pud(walk->vma, pud, addr);
+ if (pud_none(*pud))
+ goto again;
+ } else if (pud_large(*pud)) {
+ continue;
+ }

if (walk->pmd_entry || walk->pte_entry)
err = walk_pmd_range(pud, addr, next, walk);
--
2.20.1

2019-04-03 14:19:33

by Steven Price

[permalink] [raw]
Subject: [PATCH v8 15/20] x86: mm: Don't display pages which aren't present in debugfs

For the /sys/kernel/debug/page_tables/ files, rather than outputing a
mostly empty line when a block of memory isn't present just skip the
line. This keeps the output shorter and will help with a future change
switching to using the generic page walk code as we no longer care about
the 'level' that the page table holes are at.

Signed-off-by: Steven Price <[email protected]>
---
arch/x86/mm/dump_pagetables.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index ca270fb00805..e2b53db92c34 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -304,8 +304,8 @@ static void note_page(struct seq_file *m, struct pg_state *st,
/*
* Now print the actual finished series
*/
- if (!st->marker->max_lines ||
- st->lines < st->marker->max_lines) {
+ if ((cur & _PAGE_PRESENT) && (!st->marker->max_lines ||
+ st->lines < st->marker->max_lines)) {
pt_dump_seq_printf(m, st->to_dmesg,
"0x%0*lx-0x%0*lx ",
width, st->start_address,
@@ -321,7 +321,8 @@ static void note_page(struct seq_file *m, struct pg_state *st,
printk_prot(m, st->current_prot, st->level,
st->to_dmesg);
}
- st->lines++;
+ if (cur & _PAGE_PRESENT)
+ st->lines++;

/*
* We print markers for special areas of address space,
--
2.20.1

2019-04-03 14:19:36

by Steven Price

[permalink] [raw]
Subject: [PATCH v8 13/20] mm: pagewalk: Add test_p?d callbacks

It is useful to be able to skip parts of the page table tree even when
walking without VMAs. Add test_p?d callbacks similar to test_walk but
which are called just before a table at that level is walked. If the
callback returns non-zero then the entire table is skipped.

Signed-off-by: Steven Price <[email protected]>
---
include/linux/mm.h | 11 +++++++++++
mm/pagewalk.c | 24 ++++++++++++++++++++++++
2 files changed, 35 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f6de08c116e6..a4c1ed255455 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1382,6 +1382,11 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
* value means "do page table walk over the current vma,"
* and a negative one means "abort current page table walk
* right now." 1 means "skip the current vma."
+ * @test_pmd: similar to test_walk(), but called for every pmd.
+ * @test_pud: similar to test_walk(), but called for every pud.
+ * @test_p4d: similar to test_walk(), but called for every p4d.
+ * Returning 0 means walk this part of the page tables,
+ * returning 1 means to skip this range.
* @mm: mm_struct representing the target process of page table walk
* @vma: vma currently walked (NULL if walking outside vmas)
* @private: private data for callbacks' usage
@@ -1406,6 +1411,12 @@ struct mm_walk {
struct mm_walk *walk);
int (*test_walk)(unsigned long addr, unsigned long next,
struct mm_walk *walk);
+ int (*test_pmd)(unsigned long addr, unsigned long next,
+ pmd_t *pmd_start, struct mm_walk *walk);
+ int (*test_pud)(unsigned long addr, unsigned long next,
+ pud_t *pud_start, struct mm_walk *walk);
+ int (*test_p4d)(unsigned long addr, unsigned long next,
+ p4d_t *p4d_start, struct mm_walk *walk);
struct mm_struct *mm;
struct vm_area_struct *vma;
void *private;
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index dac0c848b458..231655db1295 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -32,6 +32,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
unsigned long next;
int err = 0;

+ if (walk->test_pmd) {
+ err = walk->test_pmd(addr, end, pmd_offset(pud, 0), walk);
+ if (err < 0)
+ return err;
+ if (err > 0)
+ return 0;
+ }
+
pmd = pmd_offset(pud, addr);
do {
again:
@@ -82,6 +90,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
unsigned long next;
int err = 0;

+ if (walk->test_pud) {
+ err = walk->test_pud(addr, end, pud_offset(p4d, 0), walk);
+ if (err < 0)
+ return err;
+ if (err > 0)
+ return 0;
+ }
+
pud = pud_offset(p4d, addr);
do {
again:
@@ -124,6 +140,14 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
unsigned long next;
int err = 0;

+ if (walk->test_p4d) {
+ err = walk->test_p4d(addr, end, p4d_offset(pgd, 0), walk);
+ if (err < 0)
+ return err;
+ if (err > 0)
+ return 0;
+ }
+
p4d = p4d_offset(pgd, addr);
do {
next = p4d_addr_end(addr, end);
--
2.20.1

2019-04-03 14:19:55

by Steven Price

[permalink] [raw]
Subject: [PATCH v8 19/20] x86: mm: Convert ptdump_walk_pgd_level_core() to take an mm_struct

An mm_struct is needed to enable x86 to use of the generic
walk_page_range() function.

In the case of walking the user page tables (when
CONFIG_PAGE_TABLE_ISOLATION is enabled), it is necessary to create a
fake_mm structure because there isn't an mm_struct with a pointer
to the pgd of the user page tables. This fake_mm structure is
initialised with the minimum necessary for the generic page walk code.

Signed-off-by: Steven Price <[email protected]>
---
arch/x86/mm/dump_pagetables.c | 36 ++++++++++++++++++++---------------
1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 40b3f1da6e15..c0fbb9e5a790 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -111,8 +111,6 @@ static struct addr_marker address_markers[] = {
[END_OF_SPACE_NR] = { -1, NULL }
};

-#define INIT_PGD ((pgd_t *) &init_top_pgt)
-
#else /* CONFIG_X86_64 */

enum address_markers_idx {
@@ -147,8 +145,6 @@ static struct addr_marker address_markers[] = {
[END_OF_SPACE_NR] = { -1, NULL }
};

-#define INIT_PGD (swapper_pg_dir)
-
#endif /* !CONFIG_X86_64 */

/* Multipliers for offsets within the PTEs */
@@ -522,10 +518,10 @@ static inline bool is_hypervisor_range(int idx)
#endif
}

-static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
+static void ptdump_walk_pgd_level_core(struct seq_file *m, struct mm_struct *mm,
bool checkwx, bool dmesg)
{
- pgd_t *start = pgd;
+ pgd_t *start = mm->pgd;
pgprotval_t prot, eff;
int i;
struct pg_state st = {};
@@ -572,39 +568,49 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,

void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm)
{
- ptdump_walk_pgd_level_core(m, mm->pgd, false, true);
+ ptdump_walk_pgd_level_core(m, mm, false, true);
}

+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+static void ptdump_walk_pgd_level_user_core(struct seq_file *m,
+ struct mm_struct *mm,
+ bool checkwx, bool dmesg)
+{
+ struct mm_struct fake_mm = {
+ .pgd = kernel_to_user_pgdp(mm->pgd)
+ };
+ init_rwsem(&fake_mm.mmap_sem);
+ ptdump_walk_pgd_level_core(m, &fake_mm, checkwx, dmesg);
+}
+#endif
+
void ptdump_walk_pgd_level_debugfs(struct seq_file *m, struct mm_struct *mm,
bool user)
{
- pgd_t *pgd = mm->pgd;
#ifdef CONFIG_PAGE_TABLE_ISOLATION
if (user && static_cpu_has(X86_FEATURE_PTI))
- pgd = kernel_to_user_pgdp(pgd);
+ ptdump_walk_pgd_level_user_core(m, mm, false, false);
+ else
#endif
- ptdump_walk_pgd_level_core(m, pgd, false, false);
+ ptdump_walk_pgd_level_core(m, mm, false, false);
}
EXPORT_SYMBOL_GPL(ptdump_walk_pgd_level_debugfs);

void ptdump_walk_user_pgd_level_checkwx(void)
{
#ifdef CONFIG_PAGE_TABLE_ISOLATION
- pgd_t *pgd = INIT_PGD;
-
if (!(__supported_pte_mask & _PAGE_NX) ||
!static_cpu_has(X86_FEATURE_PTI))
return;

pr_info("x86/mm: Checking user space page tables\n");
- pgd = kernel_to_user_pgdp(pgd);
- ptdump_walk_pgd_level_core(NULL, pgd, true, false);
+ ptdump_walk_pgd_level_user_core(NULL, &init_mm, true, false);
#endif
}

void ptdump_walk_pgd_level_checkwx(void)
{
- ptdump_walk_pgd_level_core(NULL, INIT_PGD, true, false);
+ ptdump_walk_pgd_level_core(NULL, &init_mm, true, false);
}

static int __init pt_dump_init(void)
--
2.20.1

2019-04-03 14:19:56

by Steven Price

[permalink] [raw]
Subject: [PATCH v8 04/20] powerpc: mm: Add p?d_large() definitions

walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.

For powerpc pmd_large() was already implemented, so hoist it out of the
CONFIG_TRANSPARENT_HUGEPAGE condition and implement the other levels.

CC: Benjamin Herrenschmidt <[email protected]>
CC: Paul Mackerras <[email protected]>
CC: Michael Ellerman <[email protected]>
CC: [email protected]
CC: [email protected]
Signed-off-by: Steven Price <[email protected]>
---
arch/powerpc/include/asm/book3s/64/pgtable.h | 30 ++++++++++++++------
1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 581f91be9dd4..f6d1ac8b832e 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -897,6 +897,12 @@ static inline int pud_present(pud_t pud)
return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PRESENT));
}

+#define pud_large pud_large
+static inline int pud_large(pud_t pud)
+{
+ return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
+}
+
extern struct page *pud_page(pud_t pud);
extern struct page *pmd_page(pmd_t pmd);
static inline pte_t pud_pte(pud_t pud)
@@ -940,6 +946,12 @@ static inline int pgd_present(pgd_t pgd)
return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PRESENT));
}

+#define pgd_large pgd_large
+static inline int pgd_large(pgd_t pgd)
+{
+ return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PTE));
+}
+
static inline pte_t pgd_pte(pgd_t pgd)
{
return __pte_raw(pgd_raw(pgd));
@@ -1093,6 +1105,15 @@ static inline bool pmd_access_permitted(pmd_t pmd, bool write)
return pte_access_permitted(pmd_pte(pmd), write);
}

+#define pmd_large pmd_large
+/*
+ * returns true for pmd migration entries, THP, devmap, hugetlb
+ */
+static inline int pmd_large(pmd_t pmd)
+{
+ return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
+}
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
@@ -1119,15 +1140,6 @@ pmd_hugepage_update(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp,
return hash__pmd_hugepage_update(mm, addr, pmdp, clr, set);
}

-/*
- * returns true for pmd migration entries, THP, devmap, hugetlb
- * But compile time dependent on THP config
- */
-static inline int pmd_large(pmd_t pmd)
-{
- return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
-}
-
static inline pmd_t pmd_mknotpresent(pmd_t pmd)
{
return __pmd(pmd_val(pmd) & ~_PAGE_PRESENT);
--
2.20.1

2019-04-03 14:19:56

by Steven Price

[permalink] [raw]
Subject: [PATCH v8 18/20] x86: mm: Convert ptdump_walk_pgd_level_debugfs() to take an mm_struct

To enable x86 to use the generic walk_page_range() function, the
callers of ptdump_walk_pgd_level_debugfs() need to pass in the mm_struct.

This means that ptdump_walk_pgd_level_core() is now always passed a
valid pgd, so drop the support for pgd==NULL.

Signed-off-by: Steven Price <[email protected]>
---
arch/x86/include/asm/pgtable.h | 3 ++-
arch/x86/mm/debug_pagetables.c | 8 ++++----
arch/x86/mm/dump_pagetables.c | 14 ++++++--------
3 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 579959750f34..5abf693dc9b2 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -28,7 +28,8 @@ extern pgd_t early_top_pgt[PTRS_PER_PGD];
int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);

void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm);
-void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user);
+void ptdump_walk_pgd_level_debugfs(struct seq_file *m, struct mm_struct *mm,
+ bool user);
void ptdump_walk_pgd_level_checkwx(void);
void ptdump_walk_user_pgd_level_checkwx(void);

diff --git a/arch/x86/mm/debug_pagetables.c b/arch/x86/mm/debug_pagetables.c
index cd84f067e41d..824131052574 100644
--- a/arch/x86/mm/debug_pagetables.c
+++ b/arch/x86/mm/debug_pagetables.c
@@ -6,7 +6,7 @@

static int ptdump_show(struct seq_file *m, void *v)
{
- ptdump_walk_pgd_level_debugfs(m, NULL, false);
+ ptdump_walk_pgd_level_debugfs(m, &init_mm, false);
return 0;
}

@@ -16,7 +16,7 @@ static int ptdump_curknl_show(struct seq_file *m, void *v)
{
if (current->mm->pgd) {
down_read(&current->mm->mmap_sem);
- ptdump_walk_pgd_level_debugfs(m, current->mm->pgd, false);
+ ptdump_walk_pgd_level_debugfs(m, current->mm, false);
up_read(&current->mm->mmap_sem);
}
return 0;
@@ -31,7 +31,7 @@ static int ptdump_curusr_show(struct seq_file *m, void *v)
{
if (current->mm->pgd) {
down_read(&current->mm->mmap_sem);
- ptdump_walk_pgd_level_debugfs(m, current->mm->pgd, true);
+ ptdump_walk_pgd_level_debugfs(m, current->mm, true);
up_read(&current->mm->mmap_sem);
}
return 0;
@@ -46,7 +46,7 @@ static struct dentry *pe_efi;
static int ptdump_efi_show(struct seq_file *m, void *v)
{
if (efi_mm.pgd)
- ptdump_walk_pgd_level_debugfs(m, efi_mm.pgd, false);
+ ptdump_walk_pgd_level_debugfs(m, &efi_mm, false);
return 0;
}

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index ddf8ea6b059d..40b3f1da6e15 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -525,16 +525,12 @@ static inline bool is_hypervisor_range(int idx)
static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
bool checkwx, bool dmesg)
{
- pgd_t *start = INIT_PGD;
+ pgd_t *start = pgd;
pgprotval_t prot, eff;
int i;
struct pg_state st = {};

- if (pgd) {
- start = pgd;
- st.to_dmesg = dmesg;
- }
-
+ st.to_dmesg = dmesg;
st.check_wx = checkwx;
st.seq = m;
if (checkwx)
@@ -579,8 +575,10 @@ void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm)
ptdump_walk_pgd_level_core(m, mm->pgd, false, true);
}

-void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user)
+void ptdump_walk_pgd_level_debugfs(struct seq_file *m, struct mm_struct *mm,
+ bool user)
{
+ pgd_t *pgd = mm->pgd;
#ifdef CONFIG_PAGE_TABLE_ISOLATION
if (user && static_cpu_has(X86_FEATURE_PTI))
pgd = kernel_to_user_pgdp(pgd);
@@ -606,7 +604,7 @@ void ptdump_walk_user_pgd_level_checkwx(void)

void ptdump_walk_pgd_level_checkwx(void)
{
- ptdump_walk_pgd_level_core(NULL, NULL, true, false);
+ ptdump_walk_pgd_level_core(NULL, INIT_PGD, true, false);
}

static int __init pt_dump_init(void)
--
2.20.1

2019-04-03 14:19:59

by Steven Price

[permalink] [raw]
Subject: [PATCH v8 20/20] x86: mm: Convert dump_pagetables to use walk_page_range

Make use of the new functionality in walk_page_range to remove the
arch page walking code and use the generic code to walk the page tables.

The effective permissions are passed down the chain using new fields
in struct pg_state.

The KASAN optimisation is implemented by including test_p?d callbacks
which can decide to skip an entire tree of entries

Signed-off-by: Steven Price <[email protected]>
---
arch/x86/mm/dump_pagetables.c | 280 ++++++++++++++++++----------------
1 file changed, 146 insertions(+), 134 deletions(-)

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index c0fbb9e5a790..f6b814aaddf7 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -33,6 +33,10 @@ struct pg_state {
int level;
pgprot_t current_prot;
pgprotval_t effective_prot;
+ pgprotval_t effective_prot_pgd;
+ pgprotval_t effective_prot_p4d;
+ pgprotval_t effective_prot_pud;
+ pgprotval_t effective_prot_pmd;
unsigned long start_address;
unsigned long current_address;
const struct addr_marker *marker;
@@ -356,22 +360,21 @@ static inline pgprotval_t effective_prot(pgprotval_t prot1, pgprotval_t prot2)
((prot1 | prot2) & _PAGE_NX);
}

-static void walk_pte_level(struct pg_state *st, pmd_t addr, pgprotval_t eff_in,
- unsigned long P)
+static int ptdump_pte_entry(pte_t *pte, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
{
- int i;
- pte_t *pte;
- pgprotval_t prot, eff;
-
- for (i = 0; i < PTRS_PER_PTE; i++) {
- st->current_address = normalize_addr(P + i * PTE_LEVEL_MULT);
- pte = pte_offset_map(&addr, st->current_address);
- prot = pte_flags(*pte);
- eff = effective_prot(eff_in, prot);
- note_page(st, __pgprot(prot), eff, 5);
- pte_unmap(pte);
- }
+ struct pg_state *st = walk->private;
+ pgprotval_t eff, prot;
+
+ st->current_address = normalize_addr(addr);
+
+ prot = pte_flags(*pte);
+ eff = effective_prot(st->effective_prot_pmd, prot);
+ note_page(st, __pgprot(prot), eff, 5);
+
+ return 0;
}
+
#ifdef CONFIG_KASAN

/*
@@ -400,131 +403,152 @@ static inline bool kasan_page_table(struct pg_state *st, void *pt)
}
#endif

-#if PTRS_PER_PMD > 1
-
-static void walk_pmd_level(struct pg_state *st, pud_t addr,
- pgprotval_t eff_in, unsigned long P)
+static int ptdump_test_pmd(unsigned long addr, unsigned long next,
+ pmd_t *pmd, struct mm_walk *walk)
{
- int i;
- pmd_t *start, *pmd_start;
- pgprotval_t prot, eff;
-
- pmd_start = start = (pmd_t *)pud_page_vaddr(addr);
- for (i = 0; i < PTRS_PER_PMD; i++) {
- st->current_address = normalize_addr(P + i * PMD_LEVEL_MULT);
- if (!pmd_none(*start)) {
- prot = pmd_flags(*start);
- eff = effective_prot(eff_in, prot);
- if (pmd_large(*start) || !pmd_present(*start)) {
- note_page(st, __pgprot(prot), eff, 4);
- } else if (!kasan_page_table(st, pmd_start)) {
- walk_pte_level(st, *start, eff,
- P + i * PMD_LEVEL_MULT);
- }
- } else
- note_page(st, __pgprot(0), 0, 4);
- start++;
- }
+ struct pg_state *st = walk->private;
+
+ st->current_address = normalize_addr(addr);
+
+ if (kasan_page_table(st, pmd))
+ return 1;
+ return 0;
}

-#else
-#define walk_pmd_level(s,a,e,p) walk_pte_level(s,__pmd(pud_val(a)),e,p)
-#undef pud_large
-#define pud_large(a) pmd_large(__pmd(pud_val(a)))
-#define pud_none(a) pmd_none(__pmd(pud_val(a)))
-#endif
+static int ptdump_pmd_entry(pmd_t *pmd, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ struct pg_state *st = walk->private;
+ pgprotval_t eff, prot;
+
+ prot = pmd_flags(*pmd);
+ eff = effective_prot(st->effective_prot_pud, prot);
+
+ st->current_address = normalize_addr(addr);
+
+ if (pmd_large(*pmd))
+ note_page(st, __pgprot(prot), eff, 4);

-#if PTRS_PER_PUD > 1
+ st->effective_prot_pmd = eff;

-static void walk_pud_level(struct pg_state *st, p4d_t addr, pgprotval_t eff_in,
- unsigned long P)
+ return 0;
+}
+
+static int ptdump_test_pud(unsigned long addr, unsigned long next,
+ pud_t *pud, struct mm_walk *walk)
{
- int i;
- pud_t *start, *pud_start;
- pgprotval_t prot, eff;
-
- pud_start = start = (pud_t *)p4d_page_vaddr(addr);
-
- for (i = 0; i < PTRS_PER_PUD; i++) {
- st->current_address = normalize_addr(P + i * PUD_LEVEL_MULT);
- if (!pud_none(*start)) {
- prot = pud_flags(*start);
- eff = effective_prot(eff_in, prot);
- if (pud_large(*start) || !pud_present(*start)) {
- note_page(st, __pgprot(prot), eff, 3);
- } else if (!kasan_page_table(st, pud_start)) {
- walk_pmd_level(st, *start, eff,
- P + i * PUD_LEVEL_MULT);
- }
- } else
- note_page(st, __pgprot(0), 0, 3);
+ struct pg_state *st = walk->private;

- start++;
- }
+ st->current_address = normalize_addr(addr);
+
+ if (kasan_page_table(st, pud))
+ return 1;
+ return 0;
}

-#else
-#define walk_pud_level(s,a,e,p) walk_pmd_level(s,__pud(p4d_val(a)),e,p)
-#undef p4d_large
-#define p4d_large(a) pud_large(__pud(p4d_val(a)))
-#define p4d_none(a) pud_none(__pud(p4d_val(a)))
-#endif
+static int ptdump_pud_entry(pud_t *pud, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ struct pg_state *st = walk->private;
+ pgprotval_t eff, prot;
+
+ prot = pud_flags(*pud);
+ eff = effective_prot(st->effective_prot_p4d, prot);
+
+ st->current_address = normalize_addr(addr);
+
+ if (pud_large(*pud))
+ note_page(st, __pgprot(prot), eff, 3);
+
+ st->effective_prot_pud = eff;

-static void walk_p4d_level(struct pg_state *st, pgd_t addr, pgprotval_t eff_in,
- unsigned long P)
+ return 0;
+}
+
+static int ptdump_test_p4d(unsigned long addr, unsigned long next,
+ p4d_t *p4d, struct mm_walk *walk)
{
- int i;
- p4d_t *start, *p4d_start;
- pgprotval_t prot, eff;
-
- if (PTRS_PER_P4D == 1)
- return walk_pud_level(st, __p4d(pgd_val(addr)), eff_in, P);
-
- p4d_start = start = (p4d_t *)pgd_page_vaddr(addr);
-
- for (i = 0; i < PTRS_PER_P4D; i++) {
- st->current_address = normalize_addr(P + i * P4D_LEVEL_MULT);
- if (!p4d_none(*start)) {
- prot = p4d_flags(*start);
- eff = effective_prot(eff_in, prot);
- if (p4d_large(*start) || !p4d_present(*start)) {
- note_page(st, __pgprot(prot), eff, 2);
- } else if (!kasan_page_table(st, p4d_start)) {
- walk_pud_level(st, *start, eff,
- P + i * P4D_LEVEL_MULT);
- }
- } else
- note_page(st, __pgprot(0), 0, 2);
+ struct pg_state *st = walk->private;

- start++;
- }
+ st->current_address = normalize_addr(addr);
+
+ if (kasan_page_table(st, p4d))
+ return 1;
+ return 0;
}

-#undef pgd_large
-#define pgd_large(a) (pgtable_l5_enabled() ? pgd_large(a) : p4d_large(__p4d(pgd_val(a))))
-#define pgd_none(a) (pgtable_l5_enabled() ? pgd_none(a) : p4d_none(__p4d(pgd_val(a))))
+static int ptdump_p4d_entry(p4d_t *p4d, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ struct pg_state *st = walk->private;
+ pgprotval_t eff, prot;
+
+ prot = p4d_flags(*p4d);
+ eff = effective_prot(st->effective_prot_pgd, prot);
+
+ st->current_address = normalize_addr(addr);
+
+ if (p4d_large(*p4d))
+ note_page(st, __pgprot(prot), eff, 2);
+
+ st->effective_prot_p4d = eff;
+
+ return 0;
+}

-static inline bool is_hypervisor_range(int idx)
+static int ptdump_pgd_entry(pgd_t *pgd, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
{
-#ifdef CONFIG_X86_64
- /*
- * A hole in the beginning of kernel address space reserved
- * for a hypervisor.
- */
- return (idx >= pgd_index(GUARD_HOLE_BASE_ADDR)) &&
- (idx < pgd_index(GUARD_HOLE_END_ADDR));
+ struct pg_state *st = walk->private;
+ pgprotval_t eff, prot;
+
+ prot = pgd_flags(*pgd);
+
+#ifdef CONFIG_X86_PAE
+ eff = _PAGE_USER | _PAGE_RW;
#else
- return false;
+ eff = prot;
#endif
+
+ st->current_address = normalize_addr(addr);
+
+ if (pgd_large(*pgd))
+ note_page(st, __pgprot(prot), eff, 1);
+
+ st->effective_prot_pgd = eff;
+
+ return 0;
+}
+
+static int ptdump_hole(unsigned long addr, unsigned long next,
+ struct mm_walk *walk)
+{
+ struct pg_state *st = walk->private;
+
+ st->current_address = normalize_addr(addr);
+
+ note_page(st, __pgprot(0), 0, -1);
+
+ return 0;
}

static void ptdump_walk_pgd_level_core(struct seq_file *m, struct mm_struct *mm,
bool checkwx, bool dmesg)
{
- pgd_t *start = mm->pgd;
- pgprotval_t prot, eff;
- int i;
struct pg_state st = {};
+ struct mm_walk walk = {
+ .mm = mm,
+ .pgd_entry = ptdump_pgd_entry,
+ .p4d_entry = ptdump_p4d_entry,
+ .pud_entry = ptdump_pud_entry,
+ .pmd_entry = ptdump_pmd_entry,
+ .pte_entry = ptdump_pte_entry,
+ .test_p4d = ptdump_test_p4d,
+ .test_pud = ptdump_test_pud,
+ .test_pmd = ptdump_test_pmd,
+ .pte_hole = ptdump_hole,
+ .private = &st
+ };

st.to_dmesg = dmesg;
st.check_wx = checkwx;
@@ -532,27 +556,15 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, struct mm_struct *mm,
if (checkwx)
st.wx_pages = 0;

- for (i = 0; i < PTRS_PER_PGD; i++) {
- st.current_address = normalize_addr(i * PGD_LEVEL_MULT);
- if (!pgd_none(*start) && !is_hypervisor_range(i)) {
- prot = pgd_flags(*start);
-#ifdef CONFIG_X86_PAE
- eff = _PAGE_USER | _PAGE_RW;
+ down_read(&mm->mmap_sem);
+#ifdef CONFIG_X86_64
+ walk_page_range(0, PTRS_PER_PGD*PGD_LEVEL_MULT/2, &walk);
+ walk_page_range(normalize_addr(PTRS_PER_PGD*PGD_LEVEL_MULT/2), ~0,
+ &walk);
#else
- eff = prot;
+ walk_page_range(0, ~0, &walk);
#endif
- if (pgd_large(*start) || !pgd_present(*start)) {
- note_page(&st, __pgprot(prot), eff, 1);
- } else {
- walk_p4d_level(&st, *start, eff,
- i * PGD_LEVEL_MULT);
- }
- } else
- note_page(&st, __pgprot(0), 0, 1);
-
- cond_resched();
- start++;
- }
+ up_read(&mm->mmap_sem);

/* Flush out the last page */
st.current_address = normalize_addr(PTRS_PER_PGD*PGD_LEVEL_MULT);
--
2.20.1

2019-04-03 14:20:44

by Steven Price

[permalink] [raw]
Subject: [PATCH v8 14/20] arm64: mm: Convert mm/dump.c to use walk_page_range()

Now walk_page_range() can walk kernel page tables, we can switch the
arm64 ptdump code over to using it, simplifying the code.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/mm/dump.c | 117 ++++++++++++++++++++++---------------------
1 file changed, 59 insertions(+), 58 deletions(-)

diff --git a/arch/arm64/mm/dump.c b/arch/arm64/mm/dump.c
index 14fe23cd5932..ea20c1213498 100644
--- a/arch/arm64/mm/dump.c
+++ b/arch/arm64/mm/dump.c
@@ -72,7 +72,7 @@ struct pg_state {
struct seq_file *seq;
const struct addr_marker *marker;
unsigned long start_address;
- unsigned level;
+ int level;
u64 current_prot;
bool check_wx;
unsigned long wx_pages;
@@ -234,11 +234,14 @@ static void note_prot_wx(struct pg_state *st, unsigned long addr)
st->wx_pages += (addr - st->start_address) / PAGE_SIZE;
}

-static void note_page(struct pg_state *st, unsigned long addr, unsigned level,
+static void note_page(struct pg_state *st, unsigned long addr, int level,
u64 val)
{
static const char units[] = "KMGTPE";
- u64 prot = val & pg_level[level].mask;
+ u64 prot = 0;
+
+ if (level >= 0)
+ prot = val & pg_level[level].mask;

if (!st->level) {
st->level = level;
@@ -286,73 +289,71 @@ static void note_page(struct pg_state *st, unsigned long addr, unsigned level,

}

-static void walk_pte(struct pg_state *st, pmd_t *pmdp, unsigned long start,
- unsigned long end)
+static int pud_entry(pud_t *pud, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
{
- unsigned long addr = start;
- pte_t *ptep = pte_offset_kernel(pmdp, start);
+ struct pg_state *st = walk->private;
+ pud_t val = READ_ONCE(*pud);
+
+ if (pud_table(val))
+ return 0;
+
+ note_page(st, addr, 2, pud_val(val));

- do {
- note_page(st, addr, 4, READ_ONCE(pte_val(*ptep)));
- } while (ptep++, addr += PAGE_SIZE, addr != end);
+ return 0;
}

-static void walk_pmd(struct pg_state *st, pud_t *pudp, unsigned long start,
- unsigned long end)
+static int pmd_entry(pmd_t *pmd, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
{
- unsigned long next, addr = start;
- pmd_t *pmdp = pmd_offset(pudp, start);
-
- do {
- pmd_t pmd = READ_ONCE(*pmdp);
- next = pmd_addr_end(addr, end);
-
- if (pmd_none(pmd) || pmd_sect(pmd)) {
- note_page(st, addr, 3, pmd_val(pmd));
- } else {
- BUG_ON(pmd_bad(pmd));
- walk_pte(st, pmdp, addr, next);
- }
- } while (pmdp++, addr = next, addr != end);
+ struct pg_state *st = walk->private;
+ pmd_t val = READ_ONCE(*pmd);
+
+ if (pmd_table(val))
+ return 0;
+
+ note_page(st, addr, 3, pmd_val(val));
+
+ return 0;
}

-static void walk_pud(struct pg_state *st, pgd_t *pgdp, unsigned long start,
- unsigned long end)
+static int pte_entry(pte_t *pte, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
{
- unsigned long next, addr = start;
- pud_t *pudp = pud_offset(pgdp, start);
-
- do {
- pud_t pud = READ_ONCE(*pudp);
- next = pud_addr_end(addr, end);
-
- if (pud_none(pud) || pud_sect(pud)) {
- note_page(st, addr, 2, pud_val(pud));
- } else {
- BUG_ON(pud_bad(pud));
- walk_pmd(st, pudp, addr, next);
- }
- } while (pudp++, addr = next, addr != end);
+ struct pg_state *st = walk->private;
+ pte_t val = READ_ONCE(*pte);
+
+ note_page(st, addr, 4, pte_val(val));
+
+ return 0;
+}
+
+static int pte_hole(unsigned long addr, unsigned long next,
+ struct mm_walk *walk)
+{
+ struct pg_state *st = walk->private;
+
+ note_page(st, addr, -1, 0);
+
+ return 0;
}

static void walk_pgd(struct pg_state *st, struct mm_struct *mm,
- unsigned long start)
+ unsigned long start)
{
- unsigned long end = (start < TASK_SIZE_64) ? TASK_SIZE_64 : 0;
- unsigned long next, addr = start;
- pgd_t *pgdp = pgd_offset(mm, start);
-
- do {
- pgd_t pgd = READ_ONCE(*pgdp);
- next = pgd_addr_end(addr, end);
-
- if (pgd_none(pgd)) {
- note_page(st, addr, 1, pgd_val(pgd));
- } else {
- BUG_ON(pgd_bad(pgd));
- walk_pud(st, pgdp, addr, next);
- }
- } while (pgdp++, addr = next, addr != end);
+ struct mm_walk walk = {
+ .mm = mm,
+ .private = st,
+ .pud_entry = pud_entry,
+ .pmd_entry = pmd_entry,
+ .pte_entry = pte_entry,
+ .pte_hole = pte_hole
+ };
+ down_read(&mm->mmap_sem);
+ walk_page_range(start, start | (((unsigned long)PTRS_PER_PGD <<
+ PGDIR_SHIFT) - 1),
+ &walk);
+ up_read(&mm->mmap_sem);
}

void ptdump_walk_pgd(struct seq_file *m, struct ptdump_info *info)
--
2.20.1

2019-04-03 14:20:51

by Steven Price

[permalink] [raw]
Subject: [PATCH v8 16/20] x86: mm: Point to struct seq_file from struct pg_state

mm/dump_pagetables.c passes both struct seq_file and struct pg_state
down the chain of walk_*_level() functions to be passed to note_page().
Instead place the struct seq_file in struct pg_state and access it from
struct pg_state (which is private to this file) in note_page().

Signed-off-by: Steven Price <[email protected]>
---
arch/x86/mm/dump_pagetables.c | 69 ++++++++++++++++++-----------------
1 file changed, 35 insertions(+), 34 deletions(-)

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index e2b53db92c34..3d12ac031144 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -40,6 +40,7 @@ struct pg_state {
bool to_dmesg;
bool check_wx;
unsigned long wx_pages;
+ struct seq_file *seq;
};

struct addr_marker {
@@ -268,11 +269,12 @@ static void note_wx(struct pg_state *st)
* of PTE entries; the next one is different so we need to
* print what we collected so far.
*/
-static void note_page(struct seq_file *m, struct pg_state *st,
- pgprot_t new_prot, pgprotval_t new_eff, int level)
+static void note_page(struct pg_state *st, pgprot_t new_prot,
+ pgprotval_t new_eff, int level)
{
pgprotval_t prot, cur, eff;
static const char units[] = "BKMGTPE";
+ struct seq_file *m = st->seq;

/*
* If we have a "break" in the series, we need to flush the state that
@@ -358,8 +360,8 @@ static inline pgprotval_t effective_prot(pgprotval_t prot1, pgprotval_t prot2)
((prot1 | prot2) & _PAGE_NX);
}

-static void walk_pte_level(struct seq_file *m, struct pg_state *st, pmd_t addr,
- pgprotval_t eff_in, unsigned long P)
+static void walk_pte_level(struct pg_state *st, pmd_t addr, pgprotval_t eff_in,
+ unsigned long P)
{
int i;
pte_t *pte;
@@ -370,7 +372,7 @@ static void walk_pte_level(struct seq_file *m, struct pg_state *st, pmd_t addr,
pte = pte_offset_map(&addr, st->current_address);
prot = pte_flags(*pte);
eff = effective_prot(eff_in, prot);
- note_page(m, st, __pgprot(prot), eff, 5);
+ note_page(st, __pgprot(prot), eff, 5);
pte_unmap(pte);
}
}
@@ -383,22 +385,20 @@ static void walk_pte_level(struct seq_file *m, struct pg_state *st, pmd_t addr,
* us dozens of seconds (minutes for 5-level config) while checking for
* W+X mapping or reading kernel_page_tables debugfs file.
*/
-static inline bool kasan_page_table(struct seq_file *m, struct pg_state *st,
- void *pt)
+static inline bool kasan_page_table(struct pg_state *st, void *pt)
{
if (__pa(pt) == __pa(kasan_early_shadow_pmd) ||
(pgtable_l5_enabled() &&
__pa(pt) == __pa(kasan_early_shadow_p4d)) ||
__pa(pt) == __pa(kasan_early_shadow_pud)) {
pgprotval_t prot = pte_flags(kasan_early_shadow_pte[0]);
- note_page(m, st, __pgprot(prot), 0, 5);
+ note_page(st, __pgprot(prot), 0, 5);
return true;
}
return false;
}
#else
-static inline bool kasan_page_table(struct seq_file *m, struct pg_state *st,
- void *pt)
+static inline bool kasan_page_table(struct pg_state *st, void *pt)
{
return false;
}
@@ -406,7 +406,7 @@ static inline bool kasan_page_table(struct seq_file *m, struct pg_state *st,

#if PTRS_PER_PMD > 1

-static void walk_pmd_level(struct seq_file *m, struct pg_state *st, pud_t addr,
+static void walk_pmd_level(struct pg_state *st, pud_t addr,
pgprotval_t eff_in, unsigned long P)
{
int i;
@@ -420,19 +420,19 @@ static void walk_pmd_level(struct seq_file *m, struct pg_state *st, pud_t addr,
prot = pmd_flags(*start);
eff = effective_prot(eff_in, prot);
if (pmd_large(*start) || !pmd_present(*start)) {
- note_page(m, st, __pgprot(prot), eff, 4);
- } else if (!kasan_page_table(m, st, pmd_start)) {
- walk_pte_level(m, st, *start, eff,
+ note_page(st, __pgprot(prot), eff, 4);
+ } else if (!kasan_page_table(st, pmd_start)) {
+ walk_pte_level(st, *start, eff,
P + i * PMD_LEVEL_MULT);
}
} else
- note_page(m, st, __pgprot(0), 0, 4);
+ note_page(st, __pgprot(0), 0, 4);
start++;
}
}

#else
-#define walk_pmd_level(m,s,a,e,p) walk_pte_level(m,s,__pmd(pud_val(a)),e,p)
+#define walk_pmd_level(s,a,e,p) walk_pte_level(s,__pmd(pud_val(a)),e,p)
#undef pud_large
#define pud_large(a) pmd_large(__pmd(pud_val(a)))
#define pud_none(a) pmd_none(__pmd(pud_val(a)))
@@ -440,8 +440,8 @@ static void walk_pmd_level(struct seq_file *m, struct pg_state *st, pud_t addr,

#if PTRS_PER_PUD > 1

-static void walk_pud_level(struct seq_file *m, struct pg_state *st, p4d_t addr,
- pgprotval_t eff_in, unsigned long P)
+static void walk_pud_level(struct pg_state *st, p4d_t addr, pgprotval_t eff_in,
+ unsigned long P)
{
int i;
pud_t *start, *pud_start;
@@ -455,34 +455,34 @@ static void walk_pud_level(struct seq_file *m, struct pg_state *st, p4d_t addr,
prot = pud_flags(*start);
eff = effective_prot(eff_in, prot);
if (pud_large(*start) || !pud_present(*start)) {
- note_page(m, st, __pgprot(prot), eff, 3);
- } else if (!kasan_page_table(m, st, pud_start)) {
- walk_pmd_level(m, st, *start, eff,
+ note_page(st, __pgprot(prot), eff, 3);
+ } else if (!kasan_page_table(st, pud_start)) {
+ walk_pmd_level(st, *start, eff,
P + i * PUD_LEVEL_MULT);
}
} else
- note_page(m, st, __pgprot(0), 0, 3);
+ note_page(st, __pgprot(0), 0, 3);

start++;
}
}

#else
-#define walk_pud_level(m,s,a,e,p) walk_pmd_level(m,s,__pud(p4d_val(a)),e,p)
+#define walk_pud_level(s,a,e,p) walk_pmd_level(s,__pud(p4d_val(a)),e,p)
#undef p4d_large
#define p4d_large(a) pud_large(__pud(p4d_val(a)))
#define p4d_none(a) pud_none(__pud(p4d_val(a)))
#endif

-static void walk_p4d_level(struct seq_file *m, struct pg_state *st, pgd_t addr,
- pgprotval_t eff_in, unsigned long P)
+static void walk_p4d_level(struct pg_state *st, pgd_t addr, pgprotval_t eff_in,
+ unsigned long P)
{
int i;
p4d_t *start, *p4d_start;
pgprotval_t prot, eff;

if (PTRS_PER_P4D == 1)
- return walk_pud_level(m, st, __p4d(pgd_val(addr)), eff_in, P);
+ return walk_pud_level(st, __p4d(pgd_val(addr)), eff_in, P);

p4d_start = start = (p4d_t *)pgd_page_vaddr(addr);

@@ -492,13 +492,13 @@ static void walk_p4d_level(struct seq_file *m, struct pg_state *st, pgd_t addr,
prot = p4d_flags(*start);
eff = effective_prot(eff_in, prot);
if (p4d_large(*start) || !p4d_present(*start)) {
- note_page(m, st, __pgprot(prot), eff, 2);
- } else if (!kasan_page_table(m, st, p4d_start)) {
- walk_pud_level(m, st, *start, eff,
+ note_page(st, __pgprot(prot), eff, 2);
+ } else if (!kasan_page_table(st, p4d_start)) {
+ walk_pud_level(st, *start, eff,
P + i * P4D_LEVEL_MULT);
}
} else
- note_page(m, st, __pgprot(0), 0, 2);
+ note_page(st, __pgprot(0), 0, 2);

start++;
}
@@ -536,6 +536,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
}

st.check_wx = checkwx;
+ st.seq = m;
if (checkwx)
st.wx_pages = 0;

@@ -549,13 +550,13 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
eff = prot;
#endif
if (pgd_large(*start) || !pgd_present(*start)) {
- note_page(m, &st, __pgprot(prot), eff, 1);
+ note_page(&st, __pgprot(prot), eff, 1);
} else {
- walk_p4d_level(m, &st, *start, eff,
+ walk_p4d_level(&st, *start, eff,
i * PGD_LEVEL_MULT);
}
} else
- note_page(m, &st, __pgprot(0), 0, 1);
+ note_page(&st, __pgprot(0), 0, 1);

cond_resched();
start++;
@@ -563,7 +564,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,

/* Flush out the last page */
st.current_address = normalize_addr(PTRS_PER_PGD*PGD_LEVEL_MULT);
- note_page(m, &st, __pgprot(0), 0, 0);
+ note_page(&st, __pgprot(0), 0, 0);
if (!checkwx)
return;
if (st.wx_pages)
--
2.20.1

2019-04-03 14:21:14

by Steven Price

[permalink] [raw]
Subject: [PATCH v8 17/20] x86: mm+efi: Convert ptdump_walk_pgd_level() to take a mm_struct

To enable x86 to use the generic walk_page_range() function, the
callers of ptdump_walk_pgd_level() need to pass an mm_struct rather
than the raw pgd_t pointer. Luckily since commit 7e904a91bf60
("efi: Use efi_mm in x86 as well as ARM") we now have an mm_struct
for EFI on x86.

Signed-off-by: Steven Price <[email protected]>
---
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/mm/dump_pagetables.c | 4 ++--
arch/x86/platform/efi/efi_32.c | 2 +-
arch/x86/platform/efi/efi_64.c | 4 ++--
4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0dd04cf6ebeb..579959750f34 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -27,7 +27,7 @@
extern pgd_t early_top_pgt[PTRS_PER_PGD];
int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);

-void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd);
+void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm);
void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user);
void ptdump_walk_pgd_level_checkwx(void);
void ptdump_walk_user_pgd_level_checkwx(void);
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 3d12ac031144..ddf8ea6b059d 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -574,9 +574,9 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
pr_info("x86/mm: Checked W+X mappings: passed, no W+X pages found.\n");
}

-void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd)
+void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm)
{
- ptdump_walk_pgd_level_core(m, pgd, false, true);
+ ptdump_walk_pgd_level_core(m, mm->pgd, false, true);
}

void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user)
diff --git a/arch/x86/platform/efi/efi_32.c b/arch/x86/platform/efi/efi_32.c
index 9959657127f4..9175ceaa6e72 100644
--- a/arch/x86/platform/efi/efi_32.c
+++ b/arch/x86/platform/efi/efi_32.c
@@ -49,7 +49,7 @@ void efi_sync_low_kernel_mappings(void) {}
void __init efi_dump_pagetable(void)
{
#ifdef CONFIG_EFI_PGT_DUMP
- ptdump_walk_pgd_level(NULL, swapper_pg_dir);
+ ptdump_walk_pgd_level(NULL, init_mm);
#endif
}

diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index cf0347f61b21..a2e0f9800190 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -611,9 +611,9 @@ void __init efi_dump_pagetable(void)
{
#ifdef CONFIG_EFI_PGT_DUMP
if (efi_enabled(EFI_OLD_MEMMAP))
- ptdump_walk_pgd_level(NULL, swapper_pg_dir);
+ ptdump_walk_pgd_level(NULL, init_mm);
else
- ptdump_walk_pgd_level(NULL, efi_mm.pgd);
+ ptdump_walk_pgd_level(NULL, efi_mm);
#endif
}

--
2.20.1

2019-04-05 04:15:12

by Anup Patel

[permalink] [raw]
Subject: Re: [PATCH v8 06/20] riscv: mm: Add p?d_large() definitions

On Wed, Apr 3, 2019 at 7:47 PM Steven Price <[email protected]> wrote:
>
> walk_page_range() is going to be allowed to walk page tables other than
> those of user space. For this it needs to know when it has reached a
> 'leaf' entry in the page tables. This information is provided by the
> p?d_large() functions/macros.
>
> For riscv a page is large when it has a read, write or execute bit
> set on it.
>
> CC: Palmer Dabbelt <[email protected]>
> CC: Albert Ou <[email protected]>
> CC: [email protected]
> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/riscv/include/asm/pgtable-64.h | 7 +++++++
> arch/riscv/include/asm/pgtable.h | 7 +++++++
> 2 files changed, 14 insertions(+)
>
> diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
> index 7aa0ea9bd8bb..73747d9d7c66 100644
> --- a/arch/riscv/include/asm/pgtable-64.h
> +++ b/arch/riscv/include/asm/pgtable-64.h
> @@ -51,6 +51,13 @@ static inline int pud_bad(pud_t pud)
> return !pud_present(pud);
> }
>
> +#define pud_large pud_large
> +static inline int pud_large(pud_t pud)
> +{
> + return pud_present(pud)
> + && (pud_val(pud) & (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC));
> +}
> +
> static inline void set_pud(pud_t *pudp, pud_t pud)
> {
> *pudp = pud;
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index 1141364d990e..9570883c79e7 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -111,6 +111,13 @@ static inline int pmd_bad(pmd_t pmd)
> return !pmd_present(pmd);
> }
>
> +#define pmd_large pmd_large
> +static inline int pmd_large(pmd_t pmd)
> +{
> + return pmd_present(pmd)
> + && (pmd_val(pmd) & (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC));
> +}
> +
> static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
> {
> *pmdp = pmd;
> --
> 2.20.1
>

Looks good to me.

Reviewed-by: Anup Patel <[email protected]>

Regards,
Anup

2019-04-10 14:57:39

by Steven Price

[permalink] [raw]
Subject: Re: [PATCH v8 00/20] Convert x86 & arm64 to use generic page walk

Hi all,

Gentle ping: who can take this? Is there anything blocking this series?

Thanks,

Steve

On 03/04/2019 15:16, Steven Price wrote:
> Most architectures current have a debugfs file for dumping the kernel
> page tables. Currently each architecture has to implement custom
> functions for walking the page tables because the generic
> walk_page_range() function is unable to walk the page tables used by the
> kernel.
>
> This series extends the capabilities of walk_page_range() so that it can
> deal with the page tables of the kernel (which have no VMAs and can
> contain larger huge pages than exist for user space). x86 and arm64 are
> then converted to make use of walk_page_range() removing the custom page
> table walkers.
>
> To enable a generic page table walker to walk the unusual mappings of
> the kernel we need to implement a set of functions which let us know
> when the walker has reached the leaf entry. Since arm, powerpc, s390,
> sparc and x86 all have p?d_large macros lets standardise on that and
> implement those that are missing.
>
> Potentially future changes could unify the implementations of the
> debugfs walkers further, moving the common functionality into common
> code. This would require a common way of handling the effective
> permissions (currently implemented only for x86) along with a per-arch
> way of formatting the page table information for debugfs. One
> immediate benefit would be getting the KASAN speed up optimisation in
> arm64 (and other arches) which is currently only implemented for x86.
>
> Also available as a git tree:
> git://linux-arm.org/linux-sp.git walk_page_range/v8
>
> Changes since v7:
> https://lore.kernel.org/lkml/[email protected]/T/
> * Updated commit message in patch 2 to clarify that we rely on the page
> tables being walked to be the same page size/depth as the kernel's
> (since this confused me earlier today).
>
> Changes since v6:
> https://lore.kernel.org/lkml/[email protected]/T/
> * Split the changes for powerpc. pmd_large() is now added in patch 4
> patch, and pmd_is_leaf() removed in patch 5.
>
> Changes since v5:
> https://lore.kernel.org/lkml/[email protected]/T/
> * Updated comment for struct mm_walk based on Mike Rapoport's
> suggestion
>
> Changes since v4:
> https://lore.kernel.org/lkml/[email protected]/T/
> * Correctly force result to a boolean in p?d_large for powerpc.
> * Added Acked-bys
> * Rebased onto v5.1-rc1
>
> Changes since v3:
> https://lore.kernel.org/lkml/[email protected]/T/
> * Restored the generic macros, only implement p?d_large() for
> architectures that have support for large pages. This also means
> adding dummy #defines for architectures that define p?d_large as
> static inline to avoid picking up the generic macro.
> * Drop the 'depth' argument from pte_hole
> * Because we no longer have the depth for holes, we also drop support
> in x86 for showing missing pages in debugfs. See discussion below:
> https://lore.kernel.org/lkml/[email protected]/
> * mips: only define p?d_large when _PAGE_HUGE is defined.
>
> Changes since v2:
> https://lore.kernel.org/lkml/[email protected]/T/
> * Rather than attemping to provide generic macros, actually implement
> p?d_large() for each architecture.
>
> Changes since v1:
> https://lore.kernel.org/lkml/[email protected]/T/
> * Added p4d_large() macro
> * Comments to explain p?d_large() macro semantics
> * Expanded comment for pte_hole() callback to explain mapping between
> depth and P?D
> * Handle folded page tables at all levels, so depth from pte_hole()
> ignores folding at any level (see real_depth() function in
> mm/pagewalk.c)
>
> Steven Price (20):
> arc: mm: Add p?d_large() definitions
> arm64: mm: Add p?d_large() definitions
> mips: mm: Add p?d_large() definitions
> powerpc: mm: Add p?d_large() definitions
> KVM: PPC: Book3S HV: Remove pmd_is_leaf()
> riscv: mm: Add p?d_large() definitions
> s390: mm: Add p?d_large() definitions
> sparc: mm: Add p?d_large() definitions
> x86: mm: Add p?d_large() definitions
> mm: Add generic p?d_large() macros
> mm: pagewalk: Add p4d_entry() and pgd_entry()
> mm: pagewalk: Allow walking without vma
> mm: pagewalk: Add test_p?d callbacks
> arm64: mm: Convert mm/dump.c to use walk_page_range()
> x86: mm: Don't display pages which aren't present in debugfs
> x86: mm: Point to struct seq_file from struct pg_state
> x86: mm+efi: Convert ptdump_walk_pgd_level() to take a mm_struct
> x86: mm: Convert ptdump_walk_pgd_level_debugfs() to take an mm_struct
> x86: mm: Convert ptdump_walk_pgd_level_core() to take an mm_struct
> x86: mm: Convert dump_pagetables to use walk_page_range
>
> arch/arc/include/asm/pgtable.h | 1 +
> arch/arm64/include/asm/pgtable.h | 2 +
> arch/arm64/mm/dump.c | 117 +++----
> arch/mips/include/asm/pgtable-64.h | 8 +
> arch/powerpc/include/asm/book3s/64/pgtable.h | 30 +-
> arch/powerpc/kvm/book3s_64_mmu_radix.c | 12 +-
> arch/riscv/include/asm/pgtable-64.h | 7 +
> arch/riscv/include/asm/pgtable.h | 7 +
> arch/s390/include/asm/pgtable.h | 2 +
> arch/sparc/include/asm/pgtable_64.h | 2 +
> arch/x86/include/asm/pgtable.h | 10 +-
> arch/x86/mm/debug_pagetables.c | 8 +-
> arch/x86/mm/dump_pagetables.c | 347 ++++++++++---------
> arch/x86/platform/efi/efi_32.c | 2 +-
> arch/x86/platform/efi/efi_64.c | 4 +-
> include/asm-generic/pgtable.h | 19 +
> include/linux/mm.h | 26 +-
> mm/pagewalk.c | 76 +++-
> 18 files changed, 407 insertions(+), 273 deletions(-)
>

2019-04-12 14:45:44

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 00/20] Convert x86 & arm64 to use generic page walk

On 4/10/19 7:56 AM, Steven Price wrote:
> Gentle ping: who can take this? Is there anything blocking this series?

First of all, I really appreciate that you tried this. Every open-coded
page walk has a set of common pitfalls, but is pretty unbounded in what
kinds of bugs it can contain. I think this at least gets us to the
point where some of those pitfalls won't happen. That's cool, but I'm a
worried that it hasn't gotten easier in the end.

Linus also had some strong opinions in the past on how page walks should
be written. He needs to have a look before we go much further.

2019-04-17 14:35:42

by Steven Price

[permalink] [raw]
Subject: Re: [PATCH v8 00/20] Convert x86 & arm64 to use generic page walk

On 12/04/2019 15:44, Dave Hansen wrote:
> On 4/10/19 7:56 AM, Steven Price wrote:
>> Gentle ping: who can take this? Is there anything blocking this series?
>
> First of all, I really appreciate that you tried this. Every open-coded
> page walk has a set of common pitfalls, but is pretty unbounded in what
> kinds of bugs it can contain. I think this at least gets us to the
> point where some of those pitfalls won't happen. That's cool, but I'm a
> worried that it hasn't gotten easier in the end.

My plan was to implement the generic infrastructure and then work to
remove the per-arch code for ptdump debugfs where possible. This patch
series doesn't actually get that far because I wanted to get some
confidence that the general approach would be accepted.

> Linus also had some strong opinions in the past on how page walks should
> be written. He needs to have a look before we go much further.

Fair enough. I'll post the initial work I've done on unifying the
x86/arm64 ptdump code - the diffstat is a bit nicer on that - but
there's still work to be done so I'm posting just as an RFC.

Thanks,

Steve

2019-04-17 14:36:27

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 3/3] x86: mm: Switch to using generic pt_dump

Instead of providing our own callbacks for walking the page tables,
switch to using the generic version instead.

Signed-off-by: Steven Price <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/Kconfig.debug | 20 +--
arch/x86/mm/Makefile | 4 +-
arch/x86/mm/dump_pagetables.c | 297 +++++++---------------------------
4 files changed, 62 insertions(+), 260 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c1f9b3cf437c..122c24055f02 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -106,6 +106,7 @@ config X86
select GENERIC_IRQ_RESERVATION_MODE
select GENERIC_IRQ_SHOW
select GENERIC_PENDING_IRQ if SMP
+ select GENERIC_PTDUMP
select GENERIC_SMP_IDLE_THREAD
select GENERIC_STRNCPY_FROM_USER
select GENERIC_STRNLEN_USER
diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 15d0fbe27872..dc1dfe213657 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -62,26 +62,10 @@ config EARLY_PRINTK_USB_XDBC
config MCSAFE_TEST
def_bool n

-config X86_PTDUMP_CORE
- def_bool n
-
-config X86_PTDUMP
- tristate "Export kernel pagetable layout to userspace via debugfs"
- depends on DEBUG_KERNEL
- select DEBUG_FS
- select X86_PTDUMP_CORE
- ---help---
- Say Y here if you want to show the kernel pagetable layout in a
- debugfs file. This information is only useful for kernel developers
- who are working in architecture specific areas of the kernel.
- It is probably not a good idea to enable this feature in a production
- kernel.
- If in doubt, say "N"
-
config EFI_PGT_DUMP
bool "Dump the EFI pagetable"
depends on EFI
- select X86_PTDUMP_CORE
+ select PTDUMP_CORE
---help---
Enable this if you want to dump the EFI page table before
enabling virtual mode. This can be used to debug miscellaneous
@@ -90,7 +74,7 @@ config EFI_PGT_DUMP

config DEBUG_WX
bool "Warn on W+X mappings at boot"
- select X86_PTDUMP_CORE
+ select PTDUMP_CORE
---help---
Generate a warning if any W+X mappings are found at boot.

diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 4b101dd6e52f..5233190fc6bf 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -28,8 +28,8 @@ obj-$(CONFIG_X86_PAT) += pat_rbtree.o
obj-$(CONFIG_X86_32) += pgtable_32.o iomap_32.o

obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
-obj-$(CONFIG_X86_PTDUMP_CORE) += dump_pagetables.o
-obj-$(CONFIG_X86_PTDUMP) += debug_pagetables.o
+obj-$(CONFIG_PTDUMP_CORE) += dump_pagetables.o
+obj-$(CONFIG_PTDUMP_DEBUGFS) += debug_pagetables.o

obj-$(CONFIG_HIGHMEM) += highmem_32.o

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index f6b814aaddf7..955824c7cddb 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -20,6 +20,7 @@
#include <linux/seq_file.h>
#include <linux/highmem.h>
#include <linux/pci.h>
+#include <linux/ptdump.h>

#include <asm/e820/types.h>
#include <asm/pgtable.h>
@@ -30,15 +31,12 @@
* when a "break" in the continuity is found.
*/
struct pg_state {
+ struct ptdump_state ptdump;
int level;
- pgprot_t current_prot;
+ pgprotval_t current_prot;
pgprotval_t effective_prot;
- pgprotval_t effective_prot_pgd;
- pgprotval_t effective_prot_p4d;
- pgprotval_t effective_prot_pud;
- pgprotval_t effective_prot_pmd;
+ pgprotval_t prot_levels[5];
unsigned long start_address;
- unsigned long current_address;
const struct addr_marker *marker;
unsigned long lines;
bool to_dmesg;
@@ -179,9 +177,8 @@ static struct addr_marker address_markers[] = {
/*
* Print a readable form of a pgprot_t to the seq_file
*/
-static void printk_prot(struct seq_file *m, pgprot_t prot, int level, bool dmsg)
+static void printk_prot(struct seq_file *m, pgprotval_t pr, int level, bool dmsg)
{
- pgprotval_t pr = pgprot_val(prot);
static const char * const level_name[] =
{ "cr3", "pgd", "p4d", "pud", "pmd", "pte" };

@@ -228,24 +225,11 @@ static void printk_prot(struct seq_file *m, pgprot_t prot, int level, bool dmsg)
pt_dump_cont_printf(m, dmsg, "%s\n", level_name[level]);
}

-/*
- * On 64 bits, sign-extend the 48 bit address to 64 bit
- */
-static unsigned long normalize_addr(unsigned long u)
-{
- int shift;
- if (!IS_ENABLED(CONFIG_X86_64))
- return u;
-
- shift = 64 - (__VIRTUAL_MASK_SHIFT + 1);
- return (signed long)(u << shift) >> shift;
-}
-
-static void note_wx(struct pg_state *st)
+static void note_wx(struct pg_state *st, unsigned long addr)
{
unsigned long npages;

- npages = (st->current_address - st->start_address) / PAGE_SIZE;
+ npages = (addr - st->start_address) / PAGE_SIZE;

#ifdef CONFIG_PCI_BIOS
/*
@@ -253,7 +237,7 @@ static void note_wx(struct pg_state *st)
* Inform about it, but avoid the warning.
*/
if (pcibios_enabled && st->start_address >= PAGE_OFFSET + BIOS_BEGIN &&
- st->current_address <= PAGE_OFFSET + BIOS_END) {
+ addr <= PAGE_OFFSET + BIOS_END) {
pr_warn_once("x86/mm: PCI BIOS W+X mapping %lu pages\n", npages);
return;
}
@@ -264,25 +248,44 @@ static void note_wx(struct pg_state *st)
(void *)st->start_address);
}

+static inline pgprotval_t effective_prot(pgprotval_t prot1, pgprotval_t prot2)
+{
+ return (prot1 & prot2 & (_PAGE_USER | _PAGE_RW)) |
+ ((prot1 | prot2) & _PAGE_NX);
+}
+
/*
* This function gets called on a break in a continuous series
* of PTE entries; the next one is different so we need to
* print what we collected so far.
*/
-static void note_page(struct pg_state *st, pgprot_t new_prot,
- pgprotval_t new_eff, int level)
+static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
+ unsigned long val)
{
- pgprotval_t prot, cur, eff;
+ struct pg_state *st = container_of(pt_st, struct pg_state, ptdump);
+ pgprotval_t new_prot, new_eff;
+ pgprotval_t cur, eff;
static const char units[] = "BKMGTPE";
struct seq_file *m = st->seq;

+ new_prot = val & PTE_FLAGS_MASK;
+
+ if (level > 1) {
+ new_eff = effective_prot(st->prot_levels[level - 2],
+ new_prot);
+ } else {
+ new_eff = new_prot;
+ }
+
+ if (level > 0)
+ st->prot_levels[level-1] = new_eff;
+
/*
* If we have a "break" in the series, we need to flush the state that
* we have now. "break" is either changing perms, levels or
* address space marker.
*/
- prot = pgprot_val(new_prot);
- cur = pgprot_val(st->current_prot);
+ cur = st->current_prot;
eff = st->effective_prot;

if (!st->level) {
@@ -294,14 +297,14 @@ static void note_page(struct pg_state *st, pgprot_t new_prot,
st->lines = 0;
pt_dump_seq_printf(m, st->to_dmesg, "---[ %s ]---\n",
st->marker->name);
- } else if (prot != cur || new_eff != eff || level != st->level ||
- st->current_address >= st->marker[1].start_address) {
+ } else if (new_prot != cur || new_eff != eff || level != st->level ||
+ addr >= st->marker[1].start_address) {
const char *unit = units;
unsigned long delta;
int width = sizeof(unsigned long) * 2;

if (st->check_wx && (eff & _PAGE_RW) && !(eff & _PAGE_NX))
- note_wx(st);
+ note_wx(st, addr);

/*
* Now print the actual finished series
@@ -311,9 +314,9 @@ static void note_page(struct pg_state *st, pgprot_t new_prot,
pt_dump_seq_printf(m, st->to_dmesg,
"0x%0*lx-0x%0*lx ",
width, st->start_address,
- width, st->current_address);
+ width, addr);

- delta = st->current_address - st->start_address;
+ delta = addr - st->start_address;
while (!(delta & 1023) && unit[1]) {
delta >>= 10;
unit++;
@@ -331,7 +334,7 @@ static void note_page(struct pg_state *st, pgprot_t new_prot,
* such as the start of vmalloc space etc.
* This helps in the interpretation.
*/
- if (st->current_address >= st->marker[1].start_address) {
+ if (addr >= st->marker[1].start_address) {
if (st->marker->max_lines &&
st->lines > st->marker->max_lines) {
unsigned long nskip =
@@ -347,228 +350,42 @@ static void note_page(struct pg_state *st, pgprot_t new_prot,
st->marker->name);
}

- st->start_address = st->current_address;
+ st->start_address = addr;
st->current_prot = new_prot;
st->effective_prot = new_eff;
st->level = level;
}
}

-static inline pgprotval_t effective_prot(pgprotval_t prot1, pgprotval_t prot2)
-{
- return (prot1 & prot2 & (_PAGE_USER | _PAGE_RW)) |
- ((prot1 | prot2) & _PAGE_NX);
-}
-
-static int ptdump_pte_entry(pte_t *pte, unsigned long addr,
- unsigned long next, struct mm_walk *walk)
-{
- struct pg_state *st = walk->private;
- pgprotval_t eff, prot;
-
- st->current_address = normalize_addr(addr);
-
- prot = pte_flags(*pte);
- eff = effective_prot(st->effective_prot_pmd, prot);
- note_page(st, __pgprot(prot), eff, 5);
-
- return 0;
-}
-
-#ifdef CONFIG_KASAN
-
-/*
- * This is an optimization for KASAN=y case. Since all kasan page tables
- * eventually point to the kasan_early_shadow_page we could call note_page()
- * right away without walking through lower level page tables. This saves
- * us dozens of seconds (minutes for 5-level config) while checking for
- * W+X mapping or reading kernel_page_tables debugfs file.
- */
-static inline bool kasan_page_table(struct pg_state *st, void *pt)
-{
- if (__pa(pt) == __pa(kasan_early_shadow_pmd) ||
- (pgtable_l5_enabled() &&
- __pa(pt) == __pa(kasan_early_shadow_p4d)) ||
- __pa(pt) == __pa(kasan_early_shadow_pud)) {
- pgprotval_t prot = pte_flags(kasan_early_shadow_pte[0]);
- note_page(st, __pgprot(prot), 0, 5);
- return true;
- }
- return false;
-}
-#else
-static inline bool kasan_page_table(struct pg_state *st, void *pt)
-{
- return false;
-}
-#endif
-
-static int ptdump_test_pmd(unsigned long addr, unsigned long next,
- pmd_t *pmd, struct mm_walk *walk)
-{
- struct pg_state *st = walk->private;
-
- st->current_address = normalize_addr(addr);
-
- if (kasan_page_table(st, pmd))
- return 1;
- return 0;
-}
-
-static int ptdump_pmd_entry(pmd_t *pmd, unsigned long addr,
- unsigned long next, struct mm_walk *walk)
-{
- struct pg_state *st = walk->private;
- pgprotval_t eff, prot;
-
- prot = pmd_flags(*pmd);
- eff = effective_prot(st->effective_prot_pud, prot);
-
- st->current_address = normalize_addr(addr);
-
- if (pmd_large(*pmd))
- note_page(st, __pgprot(prot), eff, 4);
-
- st->effective_prot_pmd = eff;
-
- return 0;
-}
-
-static int ptdump_test_pud(unsigned long addr, unsigned long next,
- pud_t *pud, struct mm_walk *walk)
-{
- struct pg_state *st = walk->private;
-
- st->current_address = normalize_addr(addr);
-
- if (kasan_page_table(st, pud))
- return 1;
- return 0;
-}
-
-static int ptdump_pud_entry(pud_t *pud, unsigned long addr,
- unsigned long next, struct mm_walk *walk)
-{
- struct pg_state *st = walk->private;
- pgprotval_t eff, prot;
-
- prot = pud_flags(*pud);
- eff = effective_prot(st->effective_prot_p4d, prot);
-
- st->current_address = normalize_addr(addr);
-
- if (pud_large(*pud))
- note_page(st, __pgprot(prot), eff, 3);
-
- st->effective_prot_pud = eff;
-
- return 0;
-}
-
-static int ptdump_test_p4d(unsigned long addr, unsigned long next,
- p4d_t *p4d, struct mm_walk *walk)
-{
- struct pg_state *st = walk->private;
-
- st->current_address = normalize_addr(addr);
-
- if (kasan_page_table(st, p4d))
- return 1;
- return 0;
-}
-
-static int ptdump_p4d_entry(p4d_t *p4d, unsigned long addr,
- unsigned long next, struct mm_walk *walk)
-{
- struct pg_state *st = walk->private;
- pgprotval_t eff, prot;
-
- prot = p4d_flags(*p4d);
- eff = effective_prot(st->effective_prot_pgd, prot);
-
- st->current_address = normalize_addr(addr);
-
- if (p4d_large(*p4d))
- note_page(st, __pgprot(prot), eff, 2);
-
- st->effective_prot_p4d = eff;
-
- return 0;
-}
-
-static int ptdump_pgd_entry(pgd_t *pgd, unsigned long addr,
- unsigned long next, struct mm_walk *walk)
-{
- struct pg_state *st = walk->private;
- pgprotval_t eff, prot;
+static const struct ptdump_range ptdump_ranges[] = {
+#ifdef CONFIG_X86_64

- prot = pgd_flags(*pgd);
+#define normalize_addr_shift (64 - (__VIRTUAL_MASK_SHIFT + 1))
+#define normalize_addr(u) ((signed long)(u << normalize_addr_shift) >> normalize_addr_shift)

-#ifdef CONFIG_X86_PAE
- eff = _PAGE_USER | _PAGE_RW;
+ {0, PTRS_PER_PGD * PGD_LEVEL_MULT / 2},
+ {normalize_addr(PTRS_PER_PGD * PGD_LEVEL_MULT / 2), ~0UL},
#else
- eff = prot;
+ {0, ~0UL},
#endif
-
- st->current_address = normalize_addr(addr);
-
- if (pgd_large(*pgd))
- note_page(st, __pgprot(prot), eff, 1);
-
- st->effective_prot_pgd = eff;
-
- return 0;
-}
-
-static int ptdump_hole(unsigned long addr, unsigned long next,
- struct mm_walk *walk)
-{
- struct pg_state *st = walk->private;
-
- st->current_address = normalize_addr(addr);
-
- note_page(st, __pgprot(0), 0, -1);
-
- return 0;
-}
+ {0, 0}
+};

static void ptdump_walk_pgd_level_core(struct seq_file *m, struct mm_struct *mm,
bool checkwx, bool dmesg)
{
- struct pg_state st = {};
- struct mm_walk walk = {
- .mm = mm,
- .pgd_entry = ptdump_pgd_entry,
- .p4d_entry = ptdump_p4d_entry,
- .pud_entry = ptdump_pud_entry,
- .pmd_entry = ptdump_pmd_entry,
- .pte_entry = ptdump_pte_entry,
- .test_p4d = ptdump_test_p4d,
- .test_pud = ptdump_test_pud,
- .test_pmd = ptdump_test_pmd,
- .pte_hole = ptdump_hole,
- .private = &st
+ struct pg_state st = {
+ .ptdump = {
+ .note_page = note_page,
+ .range = ptdump_ranges
+ },
+ .to_dmesg = dmesg,
+ .check_wx = checkwx,
+ .seq = m
};

- st.to_dmesg = dmesg;
- st.check_wx = checkwx;
- st.seq = m;
- if (checkwx)
- st.wx_pages = 0;
-
- down_read(&mm->mmap_sem);
-#ifdef CONFIG_X86_64
- walk_page_range(0, PTRS_PER_PGD*PGD_LEVEL_MULT/2, &walk);
- walk_page_range(normalize_addr(PTRS_PER_PGD*PGD_LEVEL_MULT/2), ~0,
- &walk);
-#else
- walk_page_range(0, ~0, &walk);
-#endif
- up_read(&mm->mmap_sem);
+ ptdump_walk_pgd(&st.ptdump, mm);

- /* Flush out the last page */
- st.current_address = normalize_addr(PTRS_PER_PGD*PGD_LEVEL_MULT);
- note_page(&st, __pgprot(0), 0, 0);
if (!checkwx)
return;
if (st.wx_pages)
--
2.20.1

2019-04-17 14:37:00

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 1/3] mm: Add generic ptdump

Add a generic version of page table dumping that architectures can
opt-in to

Signed-off-by: Steven Price <[email protected]>
---
include/linux/ptdump.h | 19 +++++
mm/Kconfig.debug | 21 ++++++
mm/Makefile | 1 +
mm/ptdump.c | 159 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 200 insertions(+)
create mode 100644 include/linux/ptdump.h
create mode 100644 mm/ptdump.c

diff --git a/include/linux/ptdump.h b/include/linux/ptdump.h
new file mode 100644
index 000000000000..eb8e78154be3
--- /dev/null
+++ b/include/linux/ptdump.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_PTDUMP_H
+#define _LINUX_PTDUMP_H
+
+struct ptdump_range {
+ unsigned long start;
+ unsigned long end;
+};
+
+struct ptdump_state {
+ void (*note_page)(struct ptdump_state *st, unsigned long addr,
+ int level, unsigned long val);
+ const struct ptdump_range *range;
+};
+
+void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm);
+
+#endif /* _LINUX_PTDUMP_H */
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index e3df921208c0..21bbf559408b 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -111,3 +111,24 @@ config DEBUG_RODATA_TEST
depends on STRICT_KERNEL_RWX
---help---
This option enables a testcase for the setting rodata read-only.
+
+config GENERIC_PTDUMP
+ bool
+
+config PTDUMP_CORE
+ bool
+
+config PTDUMP_DEBUGFS
+ bool "Export kerenl pagetable layout to userspace via debugfs"
+ depends on DEBUG_KERNEL
+ depends on DEBUG_FS
+ depends on GENERIC_PTDUMP
+ select PTDUMP_CORE
+ help
+ Say Y here if you want to show the kernel pagetable layout in a
+ debugfs file. This information is only useful for kernel developers
+ who are working in architecture specific areas of the kernel.
+ It is probably not a good idea to enable this feature in a production
+ kernel.
+
+ If in doubt, say N.
diff --git a/mm/Makefile b/mm/Makefile
index d210cc9d6f80..59d653c3250d 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -99,3 +99,4 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
obj-$(CONFIG_HMM) += hmm.o
obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
diff --git a/mm/ptdump.c b/mm/ptdump.c
new file mode 100644
index 000000000000..c8e4c08ce206
--- /dev/null
+++ b/mm/ptdump.c
@@ -0,0 +1,159 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/mm.h>
+#include <linux/ptdump.h>
+#include <linux/kasan.h>
+
+static int ptdump_pgd_entry(pgd_t *pgd, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ struct ptdump_state *st = walk->private;
+ pgd_t val = READ_ONCE(*pgd);
+
+ if (pgd_large(val))
+ st->note_page(st, addr, 1, pgd_val(val));
+
+ return 0;
+}
+
+static int ptdump_p4d_entry(p4d_t *p4d, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ struct ptdump_state *st = walk->private;
+ p4d_t val = READ_ONCE(*p4d);
+
+ if (p4d_large(val))
+ st->note_page(st, addr, 2, p4d_val(val));
+
+ return 0;
+}
+
+static int ptdump_pud_entry(pud_t *pud, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ struct ptdump_state *st = walk->private;
+ pud_t val = READ_ONCE(*pud);
+
+ if (pud_large(val))
+ st->note_page(st, addr, 3, pud_val(val));
+
+ return 0;
+}
+
+static int ptdump_pmd_entry(pmd_t *pmd, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ struct ptdump_state *st = walk->private;
+ pmd_t val = READ_ONCE(*pmd);
+
+ if (pmd_large(val))
+ st->note_page(st, addr, 4, pmd_val(val));
+
+ return 0;
+}
+
+static int ptdump_pte_entry(pte_t *pte, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ struct ptdump_state *st = walk->private;
+
+ st->note_page(st, addr, 5, pte_val(READ_ONCE(*pte)));
+
+ return 0;
+}
+
+#ifdef CONFIG_KASAN
+/*
+ * This is an optimization for KASAN=y case. Since all kasan page tables
+ * eventually point to the kasan_early_shadow_page we could call note_page()
+ * right away without walking through lower level page tables. This saves
+ * us dozens of seconds (minutes for 5-level config) while checking for
+ * W+X mapping or reading kernel_page_tables debugfs file.
+ */
+static inline bool kasan_page_table(struct ptdump_state *st, void *pt,
+ unsigned long addr)
+{
+ if (__pa(pt) == __pa(kasan_early_shadow_pmd) ||
+ (pgtable_l5_enabled() &&
+ __pa(pt) == __pa(kasan_early_shadow_p4d)) ||
+ __pa(pt) == __pa(kasan_early_shadow_pud)) {
+ st->note_page(st, addr, 5, pte_val(kasan_early_shadow_pte[0]));
+ return true;
+ }
+ return false;
+}
+#else
+static inline bool kasan_page_table(struct ptdump_state *st, void *pt,
+ unsigned long addr)
+{
+ return false;
+}
+#endif
+
+static int ptdump_test_p4d(unsigned long addr, unsigned long next,
+ p4d_t *p4d, struct mm_walk *walk)
+{
+ struct ptdump_state *st = walk->private;
+
+ if (kasan_page_table(st, p4d, addr))
+ return 1;
+ return 0;
+}
+
+static int ptdump_test_pud(unsigned long addr, unsigned long next,
+ pud_t *pud, struct mm_walk *walk)
+{
+ struct ptdump_state *st = walk->private;
+
+ if (kasan_page_table(st, pud, addr))
+ return 1;
+ return 0;
+}
+
+static int ptdump_test_pmd(unsigned long addr, unsigned long next,
+ pmd_t *pmd, struct mm_walk *walk)
+{
+ struct ptdump_state *st = walk->private;
+
+ if (kasan_page_table(st, pmd, addr))
+ return 1;
+ return 0;
+}
+
+static int ptdump_hole(unsigned long addr, unsigned long next,
+ struct mm_walk *walk)
+{
+ struct ptdump_state *st = walk->private;
+
+ st->note_page(st, addr, -1, 0);
+
+ return 0;
+}
+
+void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm)
+{
+ struct mm_walk walk = {
+ .mm = mm,
+ .pgd_entry = ptdump_pgd_entry,
+ .p4d_entry = ptdump_p4d_entry,
+ .pud_entry = ptdump_pud_entry,
+ .pmd_entry = ptdump_pmd_entry,
+ .pte_entry = ptdump_pte_entry,
+ .test_p4d = ptdump_test_p4d,
+ .test_pud = ptdump_test_pud,
+ .test_pmd = ptdump_test_pmd,
+ .pte_hole = ptdump_hole,
+ .private = st
+ };
+ const struct ptdump_range *range = st->range;
+
+ down_read(&mm->mmap_sem);
+ while (range->start != range->end) {
+ walk_page_range(range->start, range->end, &walk);
+ range++;
+ }
+ up_read(&mm->mmap_sem);
+
+ /* Flush out the last page */
+ st->note_page(st, 0, 0, 0);
+}
--
2.20.1

2019-04-17 14:37:00

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 2/3] arm64: mm: Switch to using generic pt_dump

Instead of providing our own callbacks for walking the page tables,
switch to using the generic version instead.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/Kconfig | 1 +
arch/arm64/Kconfig.debug | 19 +-----
arch/arm64/include/asm/ptdump.h | 8 +--
arch/arm64/mm/Makefile | 4 +-
arch/arm64/mm/dump.c | 104 +++++++++-----------------------
arch/arm64/mm/ptdump_debugfs.c | 2 +-
6 files changed, 37 insertions(+), 101 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 117b2541ef3d..4ff55b3ce8dc 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -97,6 +97,7 @@ config ARM64
select GENERIC_IRQ_SHOW
select GENERIC_IRQ_SHOW_LEVEL
select GENERIC_PCI_IOMAP
+ select GENERIC_PTDUMP
select GENERIC_SCHED_CLOCK
select GENERIC_SMP_IDLE_THREAD
select GENERIC_STRNCPY_FROM_USER
diff --git a/arch/arm64/Kconfig.debug b/arch/arm64/Kconfig.debug
index 69c9170bdd24..570dba4d4a0e 100644
--- a/arch/arm64/Kconfig.debug
+++ b/arch/arm64/Kconfig.debug
@@ -1,21 +1,4 @@

-config ARM64_PTDUMP_CORE
- def_bool n
-
-config ARM64_PTDUMP_DEBUGFS
- bool "Export kernel pagetable layout to userspace via debugfs"
- depends on DEBUG_KERNEL
- select ARM64_PTDUMP_CORE
- select DEBUG_FS
- help
- Say Y here if you want to show the kernel pagetable layout in a
- debugfs file. This information is only useful for kernel developers
- who are working in architecture specific areas of the kernel.
- It is probably not a good idea to enable this feature in a production
- kernel.
-
- If in doubt, say N.
-
config PID_IN_CONTEXTIDR
bool "Write the current PID to the CONTEXTIDR register"
help
@@ -41,7 +24,7 @@ config ARM64_RANDOMIZE_TEXT_OFFSET

config DEBUG_WX
bool "Warn on W+X mappings at boot"
- select ARM64_PTDUMP_CORE
+ select PTDUMP_CORE
---help---
Generate a warning if any W+X mappings are found at boot.

diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h
index 9e948a93d26c..f8fecda7b61d 100644
--- a/arch/arm64/include/asm/ptdump.h
+++ b/arch/arm64/include/asm/ptdump.h
@@ -16,7 +16,7 @@
#ifndef __ASM_PTDUMP_H
#define __ASM_PTDUMP_H

-#ifdef CONFIG_ARM64_PTDUMP_CORE
+#ifdef CONFIG_PTDUMP_CORE

#include <linux/mm_types.h>
#include <linux/seq_file.h>
@@ -32,15 +32,15 @@ struct ptdump_info {
unsigned long base_addr;
};

-void ptdump_walk_pgd(struct seq_file *s, struct ptdump_info *info);
-#ifdef CONFIG_ARM64_PTDUMP_DEBUGFS
+void ptdump_walk(struct seq_file *s, struct ptdump_info *info);
+#ifdef CONFIG_PTDUMP_DEBUGFS
void ptdump_debugfs_register(struct ptdump_info *info, const char *name);
#else
static inline void ptdump_debugfs_register(struct ptdump_info *info,
const char *name) { }
#endif
void ptdump_check_wx(void);
-#endif /* CONFIG_ARM64_PTDUMP_CORE */
+#endif /* CONFIG_PTDUMP_CORE */

#ifdef CONFIG_DEBUG_WX
#define debug_checkwx() ptdump_check_wx()
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 849c1df3d214..d91030f0ffee 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -4,8 +4,8 @@ obj-y := dma-mapping.o extable.o fault.o init.o \
ioremap.o mmap.o pgd.o mmu.o \
context.o proc.o pageattr.o
obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
-obj-$(CONFIG_ARM64_PTDUMP_CORE) += dump.o
-obj-$(CONFIG_ARM64_PTDUMP_DEBUGFS) += ptdump_debugfs.o
+obj-$(CONFIG_PTDUMP_CORE) += dump.o
+obj-$(CONFIG_PTDUMP_DEBUGFS) += ptdump_debugfs.o
obj-$(CONFIG_NUMA) += numa.o
obj-$(CONFIG_DEBUG_VIRTUAL) += physaddr.o
KASAN_SANITIZE_physaddr.o += n
diff --git a/arch/arm64/mm/dump.c b/arch/arm64/mm/dump.c
index ea20c1213498..e68df2ad8863 100644
--- a/arch/arm64/mm/dump.c
+++ b/arch/arm64/mm/dump.c
@@ -19,6 +19,7 @@
#include <linux/io.h>
#include <linux/init.h>
#include <linux/mm.h>
+#include <linux/ptdump.h>
#include <linux/sched.h>
#include <linux/seq_file.h>

@@ -69,6 +70,7 @@ static const struct addr_marker address_markers[] = {
* dumps out a description of the range.
*/
struct pg_state {
+ struct ptdump_state ptdump;
struct seq_file *seq;
const struct addr_marker *marker;
unsigned long start_address;
@@ -172,6 +174,10 @@ static struct pg_level pg_level[] = {
.name = "PGD",
.bits = pte_bits,
.num = ARRAY_SIZE(pte_bits),
+ }, { /* p4d */
+ .name = "P4D",
+ .bits = pte_bits,
+ .num = ARRAY_SIZE(pte_bits),
}, { /* pud */
.name = (CONFIG_PGTABLE_LEVELS > 3) ? "PUD" : "PGD",
.bits = pte_bits,
@@ -234,9 +240,10 @@ static void note_prot_wx(struct pg_state *st, unsigned long addr)
st->wx_pages += (addr - st->start_address) / PAGE_SIZE;
}

-static void note_page(struct pg_state *st, unsigned long addr, int level,
- u64 val)
+static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
+ unsigned long val)
{
+ struct pg_state *st = container_of(pt_st, struct pg_state, ptdump);
static const char units[] = "KMGTPE";
u64 prot = 0;

@@ -289,83 +296,21 @@ static void note_page(struct pg_state *st, unsigned long addr, int level,

}

-static int pud_entry(pud_t *pud, unsigned long addr,
- unsigned long next, struct mm_walk *walk)
-{
- struct pg_state *st = walk->private;
- pud_t val = READ_ONCE(*pud);
-
- if (pud_table(val))
- return 0;
-
- note_page(st, addr, 2, pud_val(val));
-
- return 0;
-}
-
-static int pmd_entry(pmd_t *pmd, unsigned long addr,
- unsigned long next, struct mm_walk *walk)
-{
- struct pg_state *st = walk->private;
- pmd_t val = READ_ONCE(*pmd);
-
- if (pmd_table(val))
- return 0;
-
- note_page(st, addr, 3, pmd_val(val));
-
- return 0;
-}
-
-static int pte_entry(pte_t *pte, unsigned long addr,
- unsigned long next, struct mm_walk *walk)
-{
- struct pg_state *st = walk->private;
- pte_t val = READ_ONCE(*pte);
-
- note_page(st, addr, 4, pte_val(val));
-
- return 0;
-}
-
-static int pte_hole(unsigned long addr, unsigned long next,
- struct mm_walk *walk)
-{
- struct pg_state *st = walk->private;
-
- note_page(st, addr, -1, 0);
-
- return 0;
-}
-
-static void walk_pgd(struct pg_state *st, struct mm_struct *mm,
- unsigned long start)
-{
- struct mm_walk walk = {
- .mm = mm,
- .private = st,
- .pud_entry = pud_entry,
- .pmd_entry = pmd_entry,
- .pte_entry = pte_entry,
- .pte_hole = pte_hole
- };
- down_read(&mm->mmap_sem);
- walk_page_range(start, start | (((unsigned long)PTRS_PER_PGD <<
- PGDIR_SHIFT) - 1),
- &walk);
- up_read(&mm->mmap_sem);
-}
-
-void ptdump_walk_pgd(struct seq_file *m, struct ptdump_info *info)
+void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
{
struct pg_state st = {
- .seq = m,
+ .seq = s,
.marker = info->markers,
+ .ptdump = {
+ .note_page = note_page,
+ .range = (struct ptdump_range[]){
+ {info->base_addr, ~0UL},
+ {0, 0}
+ }
+ }
};

- walk_pgd(&st, info->mm, info->base_addr);
-
- note_page(&st, 0, 0, 0);
+ ptdump_walk_pgd(&st.ptdump, info->mm);
}

static void ptdump_initialize(void)
@@ -393,10 +338,17 @@ void ptdump_check_wx(void)
{ -1, NULL},
},
.check_wx = true,
+ .ptdump = {
+ .note_page = note_page,
+ .range = (struct ptdump_range[]) {
+ {VA_START, ~0UL},
+ {0, 0}
+ }
+ }
};

- walk_pgd(&st, &init_mm, VA_START);
- note_page(&st, 0, 0, 0);
+ ptdump_walk_pgd(&st.ptdump, &init_mm);
+
if (st.wx_pages || st.uxn_pages)
pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu non-UXN pages found\n",
st.wx_pages, st.uxn_pages);
diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 064163f25592..1f2eae3e988b 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -7,7 +7,7 @@
static int ptdump_show(struct seq_file *m, void *v)
{
struct ptdump_info *info = m->private;
- ptdump_walk_pgd(m, info);
+ ptdump_walk(m, info);
return 0;
}
DEFINE_SHOW_ATTRIBUTE(ptdump);
--
2.20.1

2019-04-29 02:07:09

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH v8 05/20] KVM: PPC: Book3S HV: Remove pmd_is_leaf()

On Wed, Apr 03, 2019 at 03:16:12PM +0100, Steven Price wrote:
> Since pmd_large() is now always available, pmd_is_leaf() is redundant.
> Replace all uses with calls to pmd_large().

NAK. I don't want to do this, because pmd_is_leaf() is purely about
the guest page tables (the "partition-scoped" radix tree which
specifies the guest physical to host physical translation), not about
anything to do with the Linux process page tables. The guest page
tables have the same format as the Linux process page tables, but they
are managed separately.

If it makes things clearer, I could rename it to "guest_pmd_is_leaf()"
or something similar.

Paul.

2019-05-09 15:06:55

by Steven Price

[permalink] [raw]
Subject: Re: [PATCH v8 05/20] KVM: PPC: Book3S HV: Remove pmd_is_leaf()

On 29/04/2019 03:05, Paul Mackerras wrote:
> On Wed, Apr 03, 2019 at 03:16:12PM +0100, Steven Price wrote:
>> Since pmd_large() is now always available, pmd_is_leaf() is redundant.
>> Replace all uses with calls to pmd_large().
>
> NAK. I don't want to do this, because pmd_is_leaf() is purely about
> the guest page tables (the "partition-scoped" radix tree which
> specifies the guest physical to host physical translation), not about
> anything to do with the Linux process page tables. The guest page
> tables have the same format as the Linux process page tables, but they
> are managed separately.

Fair enough, I'll drop this patch in the next posting.

> If it makes things clearer, I could rename it to "guest_pmd_is_leaf()"
> or something similar.

I'll leave that decision up to you - it might prevent similar confusion
in the future.

Steve

2019-06-11 16:40:32

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v8 02/20] arm64: mm: Add p?d_large() definitions

On Wed, Apr 03, 2019 at 03:16:09PM +0100, Steven Price wrote:
> walk_page_range() is going to be allowed to walk page tables other than
> those of user space. For this it needs to know when it has reached a
> 'leaf' entry in the page tables. This information will be provided by the
> p?d_large() functions/macros.

I've have thought p?d_leaf() might match better with your description above,
but I'm not going to quibble on naming.

For this patch:

Acked-by: Will Deacon <[email protected]>

Will