2020-05-15 14:02:57

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH v3 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

Hi,

here is the updated version of this series with these
changes:

- Removed sync_current_stack_to_mm() too.

- Added Acked-by's from Andy Lutomirski

The previous versions can be found here:

v1: https://lore.kernel.org/lkml/[email protected]/

v2: https://lore.kernel.org/lkml/[email protected]/

The cover-letter of v1 has more details on the motivation
for this patch-set.

Please review.

Regards,

Joerg

Joerg Roedel (7):
mm: Add functions to track page directory modifications
mm/vmalloc: Track which page-table levels were modified
mm/ioremap: Track which page-table levels were modified
x86/mm/64: Implement arch_sync_kernel_mappings()
x86/mm/32: Implement arch_sync_kernel_mappings()
mm: Remove vmalloc_sync_(un)mappings()
x86/mm: Remove vmalloc faulting

arch/x86/include/asm/pgtable-2level_types.h | 2 +
arch/x86/include/asm/pgtable-3level_types.h | 2 +
arch/x86/include/asm/pgtable_64_types.h | 2 +
arch/x86/include/asm/switch_to.h | 23 ---
arch/x86/kernel/setup_percpu.c | 6 +-
arch/x86/mm/fault.c | 176 +-------------------
arch/x86/mm/init_64.c | 5 +
arch/x86/mm/pti.c | 8 +-
arch/x86/mm/tlb.c | 37 ----
drivers/acpi/apei/ghes.c | 6 -
include/asm-generic/5level-fixup.h | 5 +-
include/asm-generic/pgtable.h | 23 +++
include/linux/mm.h | 46 +++++
include/linux/vmalloc.h | 18 +-
kernel/notifier.c | 1 -
kernel/trace/trace.c | 12 --
lib/ioremap.c | 46 +++--
mm/nommu.c | 12 --
mm/vmalloc.c | 109 +++++++-----
19 files changed, 204 insertions(+), 335 deletions(-)

--
2.17.1


2020-05-15 14:03:03

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH v3 5/7] x86/mm/32: Implement arch_sync_kernel_mappings()

From: Joerg Roedel <[email protected]>

Implement the function to sync changes in vmalloc and ioremap ranges
to all page-tables.

Signed-off-by: Joerg Roedel <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/pgtable-2level_types.h | 2 ++
arch/x86/include/asm/pgtable-3level_types.h | 2 ++
arch/x86/mm/fault.c | 25 +++++++++++++--------
3 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/pgtable-2level_types.h b/arch/x86/include/asm/pgtable-2level_types.h
index 6deb6cd236e3..7f6ccff0ba72 100644
--- a/arch/x86/include/asm/pgtable-2level_types.h
+++ b/arch/x86/include/asm/pgtable-2level_types.h
@@ -20,6 +20,8 @@ typedef union {

#define SHARED_KERNEL_PMD 0

+#define ARCH_PAGE_TABLE_SYNC_MASK PGTBL_PMD_MODIFIED
+
/*
* traditional i386 two-level paging structure:
*/
diff --git a/arch/x86/include/asm/pgtable-3level_types.h b/arch/x86/include/asm/pgtable-3level_types.h
index 33845d36897c..80fbb4a9ed87 100644
--- a/arch/x86/include/asm/pgtable-3level_types.h
+++ b/arch/x86/include/asm/pgtable-3level_types.h
@@ -27,6 +27,8 @@ typedef union {
#define SHARED_KERNEL_PMD (!static_cpu_has(X86_FEATURE_PTI))
#endif

+#define ARCH_PAGE_TABLE_SYNC_MASK (SHARED_KERNEL_PMD ? 0 : PGTBL_PMD_MODIFIED)
+
/*
* PGDIR_SHIFT determines what a top-level page table entry can map
*/
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index a51df516b87b..edeb2adaf31f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -190,16 +190,13 @@ static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
return pmd_k;
}

-static void vmalloc_sync(void)
+void arch_sync_kernel_mappings(unsigned long start, unsigned long end)
{
- unsigned long address;
-
- if (SHARED_KERNEL_PMD)
- return;
+ unsigned long addr;

- for (address = VMALLOC_START & PMD_MASK;
- address >= TASK_SIZE_MAX && address < VMALLOC_END;
- address += PMD_SIZE) {
+ for (addr = start & PMD_MASK;
+ addr >= TASK_SIZE_MAX && addr < VMALLOC_END;
+ addr += PMD_SIZE) {
struct page *page;

spin_lock(&pgd_lock);
@@ -210,13 +207,23 @@ static void vmalloc_sync(void)
pgt_lock = &pgd_page_get_mm(page)->page_table_lock;

spin_lock(pgt_lock);
- vmalloc_sync_one(page_address(page), address);
+ vmalloc_sync_one(page_address(page), addr);
spin_unlock(pgt_lock);
}
spin_unlock(&pgd_lock);
}
}

+static void vmalloc_sync(void)
+{
+ unsigned long address;
+
+ if (SHARED_KERNEL_PMD)
+ return;
+
+ arch_sync_kernel_mappings(VMALLOC_START, VMALLOC_END);
+}
+
void vmalloc_sync_mappings(void)
{
vmalloc_sync();
--
2.17.1

2020-05-15 14:03:31

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH v3 4/7] x86/mm/64: Implement arch_sync_kernel_mappings()

From: Joerg Roedel <[email protected]>

Implement the function to sync changes in vmalloc and ioremap ranges
to all page-tables.

Signed-off-by: Joerg Roedel <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/pgtable_64_types.h | 2 ++
arch/x86/mm/init_64.c | 5 +++++
2 files changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 52e5f5f2240d..8f63efb2a2cc 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -159,4 +159,6 @@ extern unsigned int ptrs_per_p4d;

#define PGD_KERNEL_START ((PAGE_SIZE / 2) / sizeof(pgd_t))

+#define ARCH_PAGE_TABLE_SYNC_MASK (pgtable_l5_enabled() ? PGTBL_PGD_MODIFIED : PGTBL_P4D_MODIFIED)
+
#endif /* _ASM_X86_PGTABLE_64_DEFS_H */
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3b289c2f75cd..541af8e5bcd4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -217,6 +217,11 @@ void sync_global_pgds(unsigned long start, unsigned long end)
sync_global_pgds_l4(start, end);
}

+void arch_sync_kernel_mappings(unsigned long start, unsigned long end)
+{
+ sync_global_pgds(start, end);
+}
+
/*
* NOTE: This function is marked __ref because it calls __init function
* (alloc_bootmem_pages). It's safe to do it ONLY when after_bootmem == 0.
--
2.17.1

2020-05-15 14:03:31

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH v3 1/7] mm: Add functions to track page directory modifications

From: Joerg Roedel <[email protected]>

Add page-table allocation functions which will keep track of changed
directory entries. They are needed for new PGD, P4D, PUD, and PMD
entries and will be used in vmalloc and ioremap code to decide whether
any changes in the kernel mappings need to be synchronized between
page-tables in the system.

Signed-off-by: Joerg Roedel <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
---
include/asm-generic/5level-fixup.h | 5 ++--
include/asm-generic/pgtable.h | 23 +++++++++++++++
include/linux/mm.h | 46 ++++++++++++++++++++++++++++++
3 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/5level-fixup.h b/include/asm-generic/5level-fixup.h
index 4c74b1c1d13b..58046ddc08d0 100644
--- a/include/asm-generic/5level-fixup.h
+++ b/include/asm-generic/5level-fixup.h
@@ -17,8 +17,9 @@
((unlikely(pgd_none(*(p4d))) && __pud_alloc(mm, p4d, address)) ? \
NULL : pud_offset(p4d, address))

-#define p4d_alloc(mm, pgd, address) (pgd)
-#define p4d_offset(pgd, start) (pgd)
+#define p4d_alloc(mm, pgd, address) (pgd)
+#define p4d_alloc_track(mm, pgd, address, mask) (pgd)
+#define p4d_offset(pgd, start) (pgd)

#ifndef __ASSEMBLY__
static inline int p4d_none(p4d_t p4d)
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 329b8c8ca703..bf1418ae91a2 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1209,6 +1209,29 @@ static inline bool arch_has_pfn_modify_check(void)
# define PAGE_KERNEL_EXEC PAGE_KERNEL
#endif

+/*
+ * Page Table Modification bits for pgtbl_mod_mask.
+ *
+ * These are used by the p?d_alloc_track*() set of functions an in the generic
+ * vmalloc/ioremap code to track at which page-table levels entries have been
+ * modified. Based on that the code can better decide when vmalloc and ioremap
+ * mapping changes need to be synchronized to other page-tables in the system.
+ */
+#define __PGTBL_PGD_MODIFIED 0
+#define __PGTBL_P4D_MODIFIED 1
+#define __PGTBL_PUD_MODIFIED 2
+#define __PGTBL_PMD_MODIFIED 3
+#define __PGTBL_PTE_MODIFIED 4
+
+#define PGTBL_PGD_MODIFIED BIT(__PGTBL_PGD_MODIFIED)
+#define PGTBL_P4D_MODIFIED BIT(__PGTBL_P4D_MODIFIED)
+#define PGTBL_PUD_MODIFIED BIT(__PGTBL_PUD_MODIFIED)
+#define PGTBL_PMD_MODIFIED BIT(__PGTBL_PMD_MODIFIED)
+#define PGTBL_PTE_MODIFIED BIT(__PGTBL_PTE_MODIFIED)
+
+/* Page-Table Modification Mask */
+typedef unsigned int pgtbl_mod_mask;
+
#endif /* !__ASSEMBLY__ */

#ifndef io_remap_pfn_range
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5a323422d783..022fe682af9e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2078,13 +2078,54 @@ static inline pud_t *pud_alloc(struct mm_struct *mm, p4d_t *p4d,
return (unlikely(p4d_none(*p4d)) && __pud_alloc(mm, p4d, address)) ?
NULL : pud_offset(p4d, address);
}
+
+static inline p4d_t *p4d_alloc_track(struct mm_struct *mm, pgd_t *pgd,
+ unsigned long address,
+ pgtbl_mod_mask *mod_mask)
+
+{
+ if (unlikely(pgd_none(*pgd))) {
+ if (__p4d_alloc(mm, pgd, address))
+ return NULL;
+ *mod_mask |= PGTBL_PGD_MODIFIED;
+ }
+
+ return p4d_offset(pgd, address);
+}
+
#endif /* !__ARCH_HAS_5LEVEL_HACK */

+static inline pud_t *pud_alloc_track(struct mm_struct *mm, p4d_t *p4d,
+ unsigned long address,
+ pgtbl_mod_mask *mod_mask)
+{
+ if (unlikely(p4d_none(*p4d))) {
+ if (__pud_alloc(mm, p4d, address))
+ return NULL;
+ *mod_mask |= PGTBL_P4D_MODIFIED;
+ }
+
+ return pud_offset(p4d, address);
+}
+
static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
{
return (unlikely(pud_none(*pud)) && __pmd_alloc(mm, pud, address))?
NULL: pmd_offset(pud, address);
}
+
+static inline pmd_t *pmd_alloc_track(struct mm_struct *mm, pud_t *pud,
+ unsigned long address,
+ pgtbl_mod_mask *mod_mask)
+{
+ if (unlikely(pud_none(*pud))) {
+ if (__pmd_alloc(mm, pud, address))
+ return NULL;
+ *mod_mask |= PGTBL_PUD_MODIFIED;
+ }
+
+ return pmd_offset(pud, address);
+}
#endif /* CONFIG_MMU */

#if USE_SPLIT_PTE_PTLOCKS
@@ -2200,6 +2241,11 @@ static inline void pgtable_pte_page_dtor(struct page *page)
((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
NULL: pte_offset_kernel(pmd, address))

+#define pte_alloc_kernel_track(pmd, address, mask) \
+ ((unlikely(pmd_none(*(pmd))) && \
+ (__pte_alloc_kernel(pmd) || ({*(mask)|=PGTBL_PMD_MODIFIED;0;})))?\
+ NULL: pte_offset_kernel(pmd, address))
+
#if USE_SPLIT_PMD_PTLOCKS

static struct page *pmd_to_page(pmd_t *pmd)
--
2.17.1

2020-05-15 14:05:36

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH v3 2/7] mm/vmalloc: Track which page-table levels were modified

From: Joerg Roedel <[email protected]>

Track at which levels in the page-table entries were modified by
vmap/vunmap. After the page-table has been modified, use that
information do decide whether the new arch_sync_kernel_mappings()
needs to be called.

Signed-off-by: Joerg Roedel <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
---
include/linux/vmalloc.h | 16 ++++++++
mm/vmalloc.c | 88 ++++++++++++++++++++++++++++++-----------
2 files changed, 80 insertions(+), 24 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index a95d3cc74d79..c80bdb8a6b55 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -144,6 +144,22 @@ extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
void vmalloc_sync_mappings(void);
void vmalloc_sync_unmappings(void);

+/*
+ * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values
+ * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings()
+ * needs to be called.
+ */
+#ifndef ARCH_PAGE_TABLE_SYNC_MASK
+#define ARCH_PAGE_TABLE_SYNC_MASK 0
+#endif
+
+/*
+ * There is no default implementation for arch_sync_kernel_mappings(). It is
+ * relied upon the compiler to optimize calls out if ARCH_PAGE_TABLE_SYNC_MASK
+ * is 0.
+ */
+void arch_sync_kernel_mappings(unsigned long start, unsigned long end);
+
/*
* Lowlevel-APIs (not for driver use!)
*/
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 9a8227afa073..184f5a556cf7 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -69,7 +69,8 @@ static void free_work(struct work_struct *w)

/*** Page table manipulation functions ***/

-static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end)
+static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+ pgtbl_mod_mask *mask)
{
pte_t *pte;

@@ -78,73 +79,104 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end)
pte_t ptent = ptep_get_and_clear(&init_mm, addr, pte);
WARN_ON(!pte_none(ptent) && !pte_present(ptent));
} while (pte++, addr += PAGE_SIZE, addr != end);
+ *mask |= PGTBL_PTE_MODIFIED;
}

-static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end)
+static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+ pgtbl_mod_mask *mask)
{
pmd_t *pmd;
unsigned long next;
+ int cleared;

pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
- if (pmd_clear_huge(pmd))
+
+ cleared = pmd_clear_huge(pmd);
+ if (cleared || pmd_bad(*pmd))
+ *mask |= PGTBL_PMD_MODIFIED;
+
+ if (cleared)
continue;
if (pmd_none_or_clear_bad(pmd))
continue;
- vunmap_pte_range(pmd, addr, next);
+ vunmap_pte_range(pmd, addr, next, mask);
} while (pmd++, addr = next, addr != end);
}

-static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end)
+static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
+ pgtbl_mod_mask *mask)
{
pud_t *pud;
unsigned long next;
+ int cleared;

pud = pud_offset(p4d, addr);
do {
next = pud_addr_end(addr, end);
- if (pud_clear_huge(pud))
+
+ cleared = pud_clear_huge(pud);
+ if (cleared || pud_bad(*pud))
+ *mask |= PGTBL_PUD_MODIFIED;
+
+ if (cleared)
continue;
if (pud_none_or_clear_bad(pud))
continue;
- vunmap_pmd_range(pud, addr, next);
+ vunmap_pmd_range(pud, addr, next, mask);
} while (pud++, addr = next, addr != end);
}

-static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end)
+static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
+ pgtbl_mod_mask *mask)
{
p4d_t *p4d;
unsigned long next;
+ int cleared;

p4d = p4d_offset(pgd, addr);
do {
next = p4d_addr_end(addr, end);
- if (p4d_clear_huge(p4d))
+
+ cleared = p4d_clear_huge(p4d);
+ if (cleared || p4d_bad(*p4d))
+ *mask |= PGTBL_P4D_MODIFIED;
+
+ if (cleared)
continue;
if (p4d_none_or_clear_bad(p4d))
continue;
- vunmap_pud_range(p4d, addr, next);
+ vunmap_pud_range(p4d, addr, next, mask);
} while (p4d++, addr = next, addr != end);
}

-static void vunmap_page_range(unsigned long addr, unsigned long end)
+static void vunmap_page_range(unsigned long start, unsigned long end)
{
pgd_t *pgd;
+ unsigned long addr = start;
unsigned long next;
+ pgtbl_mod_mask mask = 0;

BUG_ON(addr >= end);
+ start = addr;
pgd = pgd_offset_k(addr);
do {
next = pgd_addr_end(addr, end);
+ if (pgd_bad(*pgd))
+ mask |= PGTBL_PGD_MODIFIED;
if (pgd_none_or_clear_bad(pgd))
continue;
- vunmap_p4d_range(pgd, addr, next);
+ vunmap_p4d_range(pgd, addr, next, &mask);
} while (pgd++, addr = next, addr != end);
+
+ if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
+ arch_sync_kernel_mappings(start, end);
}

static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
- unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+ unsigned long end, pgprot_t prot, struct page **pages, int *nr,
+ pgtbl_mod_mask *mask)
{
pte_t *pte;

@@ -153,7 +185,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
* callers keep track of where we're up to.
*/

- pte = pte_alloc_kernel(pmd, addr);
+ pte = pte_alloc_kernel_track(pmd, addr, mask);
if (!pte)
return -ENOMEM;
do {
@@ -166,55 +198,59 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
(*nr)++;
} while (pte++, addr += PAGE_SIZE, addr != end);
+ *mask |= PGTBL_PTE_MODIFIED;
return 0;
}

static int vmap_pmd_range(pud_t *pud, unsigned long addr,
- unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+ unsigned long end, pgprot_t prot, struct page **pages, int *nr,
+ pgtbl_mod_mask *mask)
{
pmd_t *pmd;
unsigned long next;

- pmd = pmd_alloc(&init_mm, pud, addr);
+ pmd = pmd_alloc_track(&init_mm, pud, addr, mask);
if (!pmd)
return -ENOMEM;
do {
next = pmd_addr_end(addr, end);
- if (vmap_pte_range(pmd, addr, next, prot, pages, nr))
+ if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask))
return -ENOMEM;
} while (pmd++, addr = next, addr != end);
return 0;
}

static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
- unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+ unsigned long end, pgprot_t prot, struct page **pages, int *nr,
+ pgtbl_mod_mask *mask)
{
pud_t *pud;
unsigned long next;

- pud = pud_alloc(&init_mm, p4d, addr);
+ pud = pud_alloc_track(&init_mm, p4d, addr, mask);
if (!pud)
return -ENOMEM;
do {
next = pud_addr_end(addr, end);
- if (vmap_pmd_range(pud, addr, next, prot, pages, nr))
+ if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask))
return -ENOMEM;
} while (pud++, addr = next, addr != end);
return 0;
}

static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
- unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+ unsigned long end, pgprot_t prot, struct page **pages, int *nr,
+ pgtbl_mod_mask *mask)
{
p4d_t *p4d;
unsigned long next;

- p4d = p4d_alloc(&init_mm, pgd, addr);
+ p4d = p4d_alloc_track(&init_mm, pgd, addr, mask);
if (!p4d)
return -ENOMEM;
do {
next = p4d_addr_end(addr, end);
- if (vmap_pud_range(p4d, addr, next, prot, pages, nr))
+ if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask))
return -ENOMEM;
} while (p4d++, addr = next, addr != end);
return 0;
@@ -234,16 +270,20 @@ static int vmap_page_range_noflush(unsigned long start, unsigned long end,
unsigned long addr = start;
int err = 0;
int nr = 0;
+ pgtbl_mod_mask mask = 0;

BUG_ON(addr >= end);
pgd = pgd_offset_k(addr);
do {
next = pgd_addr_end(addr, end);
- err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr);
+ err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
if (err)
return err;
} while (pgd++, addr = next, addr != end);

+ if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
+ arch_sync_kernel_mappings(start, end);
+
return nr;
}

--
2.17.1

2020-05-15 14:05:38

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH v3 3/7] mm/ioremap: Track which page-table levels were modified

From: Joerg Roedel <[email protected]>

Track at which levels in the page-table entries were modified by
ioremap_page_range(). After the page-table has been modified, use that
information do decide whether the new arch_sync_kernel_mappings()
needs to be called. The iounmap path re-uses vunmap(), which has
already been taken care of.

Signed-off-by: Joerg Roedel <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
---
lib/ioremap.c | 46 +++++++++++++++++++++++++++++++---------------
1 file changed, 31 insertions(+), 15 deletions(-)

diff --git a/lib/ioremap.c b/lib/ioremap.c
index 3f0e18543de8..ad485f08173b 100644
--- a/lib/ioremap.c
+++ b/lib/ioremap.c
@@ -61,13 +61,14 @@ static inline int ioremap_pmd_enabled(void) { return 0; }
#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */

static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
- unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+ unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+ pgtbl_mod_mask *mask)
{
pte_t *pte;
u64 pfn;

pfn = phys_addr >> PAGE_SHIFT;
- pte = pte_alloc_kernel(pmd, addr);
+ pte = pte_alloc_kernel_track(pmd, addr, mask);
if (!pte)
return -ENOMEM;
do {
@@ -75,6 +76,7 @@ static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
pfn++;
} while (pte++, addr += PAGE_SIZE, addr != end);
+ *mask |= PGTBL_PTE_MODIFIED;
return 0;
}

@@ -101,21 +103,24 @@ static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long addr,
}

static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
- unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+ unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+ pgtbl_mod_mask *mask)
{
pmd_t *pmd;
unsigned long next;

- pmd = pmd_alloc(&init_mm, pud, addr);
+ pmd = pmd_alloc_track(&init_mm, pud, addr, mask);
if (!pmd)
return -ENOMEM;
do {
next = pmd_addr_end(addr, end);

- if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot))
+ if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
+ *mask |= PGTBL_PMD_MODIFIED;
continue;
+ }

- if (ioremap_pte_range(pmd, addr, next, phys_addr, prot))
+ if (ioremap_pte_range(pmd, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
@@ -144,21 +149,24 @@ static int ioremap_try_huge_pud(pud_t *pud, unsigned long addr,
}

static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
- unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+ unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+ pgtbl_mod_mask *mask)
{
pud_t *pud;
unsigned long next;

- pud = pud_alloc(&init_mm, p4d, addr);
+ pud = pud_alloc_track(&init_mm, p4d, addr, mask);
if (!pud)
return -ENOMEM;
do {
next = pud_addr_end(addr, end);

- if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot))
+ if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
+ *mask |= PGTBL_PUD_MODIFIED;
continue;
+ }

- if (ioremap_pmd_range(pud, addr, next, phys_addr, prot))
+ if (ioremap_pmd_range(pud, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (pud++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
@@ -187,21 +195,24 @@ static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long addr,
}

static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
- unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+ unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+ pgtbl_mod_mask *mask)
{
p4d_t *p4d;
unsigned long next;

- p4d = p4d_alloc(&init_mm, pgd, addr);
+ p4d = p4d_alloc_track(&init_mm, pgd, addr, mask);
if (!p4d)
return -ENOMEM;
do {
next = p4d_addr_end(addr, end);

- if (ioremap_try_huge_p4d(p4d, addr, next, phys_addr, prot))
+ if (ioremap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) {
+ *mask |= PGTBL_P4D_MODIFIED;
continue;
+ }

- if (ioremap_pud_range(p4d, addr, next, phys_addr, prot))
+ if (ioremap_pud_range(p4d, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (p4d++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
@@ -214,6 +225,7 @@ int ioremap_page_range(unsigned long addr,
unsigned long start;
unsigned long next;
int err;
+ pgtbl_mod_mask mask = 0;

might_sleep();
BUG_ON(addr >= end);
@@ -222,13 +234,17 @@ int ioremap_page_range(unsigned long addr,
pgd = pgd_offset_k(addr);
do {
next = pgd_addr_end(addr, end);
- err = ioremap_p4d_range(pgd, addr, next, phys_addr, prot);
+ err = ioremap_p4d_range(pgd, addr, next, phys_addr, prot,
+ &mask);
if (err)
break;
} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);

flush_cache_vmap(start, end);

+ if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
+ arch_sync_kernel_mappings(start, end);
+
return err;
}

--
2.17.1

2020-05-15 14:19:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

On Fri, May 15, 2020 at 04:00:16PM +0200, Joerg Roedel wrote:
> Joerg Roedel (7):
> mm: Add functions to track page directory modifications
> mm/vmalloc: Track which page-table levels were modified
> mm/ioremap: Track which page-table levels were modified
> x86/mm/64: Implement arch_sync_kernel_mappings()
> x86/mm/32: Implement arch_sync_kernel_mappings()
> mm: Remove vmalloc_sync_(un)mappings()
> x86/mm: Remove vmalloc faulting
>
> arch/x86/include/asm/pgtable-2level_types.h | 2 +
> arch/x86/include/asm/pgtable-3level_types.h | 2 +
> arch/x86/include/asm/pgtable_64_types.h | 2 +
> arch/x86/include/asm/switch_to.h | 23 ---
> arch/x86/kernel/setup_percpu.c | 6 +-
> arch/x86/mm/fault.c | 176 +-------------------
> arch/x86/mm/init_64.c | 5 +
> arch/x86/mm/pti.c | 8 +-
> arch/x86/mm/tlb.c | 37 ----
> drivers/acpi/apei/ghes.c | 6 -
> include/asm-generic/5level-fixup.h | 5 +-
> include/asm-generic/pgtable.h | 23 +++
> include/linux/mm.h | 46 +++++
> include/linux/vmalloc.h | 18 +-
> kernel/notifier.c | 1 -
> kernel/trace/trace.c | 12 --
> lib/ioremap.c | 46 +++--
> mm/nommu.c | 12 --
> mm/vmalloc.c | 109 +++++++-----
> 19 files changed, 204 insertions(+), 335 deletions(-)

I'm thinking this improves the status-quo, so:

Acked-by: Peter Zijlstra (Intel) <[email protected]>

Like Andy, I think I'd like x86_64 to pre-populate, but that can easily
be done on top and should not hold this back.

2020-05-15 20:03:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 2/7] mm/vmalloc: Track which page-table levels were modified

On Fri, 15 May 2020 16:00:18 +0200 Joerg Roedel <[email protected]> wrote:

> Track at which levels in the page-table entries were modified by
> vmap/vunmap. After the page-table has been modified, use that
> information do decide whether the new arch_sync_kernel_mappings()
> needs to be called.

Lots of collisions here with Christoph's "decruft the vmalloc API" series
(http://lkml.kernel.org/r/[email protected]).

I attempted to fix things up.

unmap_kernel_range_noflush() needed to be redone.

map_kernel_range_noflush() might need the arch_sync_kernel_mappings() call?


From: Joerg Roedel <[email protected]>
Subject: mm/vmalloc: Track which page-table levels were modified

Track at which levels in the page-table entries were modified by
vmap/vunmap. After the page-table has been modified, use that information
do decide whether the new arch_sync_kernel_mappings() needs to be called.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Joerg Roedel <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

include/linux/vmalloc.h | 16 ++++++
mm/vmalloc.c | 91 +++++++++++++++++++++++++++-----------
2 files changed, 81 insertions(+), 26 deletions(-)

--- a/include/linux/vmalloc.h~mm-vmalloc-track-which-page-table-levels-were-modified
+++ a/include/linux/vmalloc.h
@@ -134,6 +134,22 @@ void vmalloc_sync_mappings(void);
void vmalloc_sync_unmappings(void);

/*
+ * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values
+ * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings()
+ * needs to be called.
+ */
+#ifndef ARCH_PAGE_TABLE_SYNC_MASK
+#define ARCH_PAGE_TABLE_SYNC_MASK 0
+#endif
+
+/*
+ * There is no default implementation for arch_sync_kernel_mappings(). It is
+ * relied upon the compiler to optimize calls out if ARCH_PAGE_TABLE_SYNC_MASK
+ * is 0.
+ */
+void arch_sync_kernel_mappings(unsigned long start, unsigned long end);
+
+/*
* Lowlevel-APIs (not for driver use!)
*/

--- a/mm/vmalloc.c~mm-vmalloc-track-which-page-table-levels-were-modified
+++ a/mm/vmalloc.c
@@ -69,7 +69,8 @@ static void free_work(struct work_struct

/*** Page table manipulation functions ***/

-static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end)
+static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+ pgtbl_mod_mask *mask)
{
pte_t *pte;

@@ -78,59 +79,81 @@ static void vunmap_pte_range(pmd_t *pmd,
pte_t ptent = ptep_get_and_clear(&init_mm, addr, pte);
WARN_ON(!pte_none(ptent) && !pte_present(ptent));
} while (pte++, addr += PAGE_SIZE, addr != end);
+ *mask |= PGTBL_PTE_MODIFIED;
}

-static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end)
+static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+ pgtbl_mod_mask *mask)
{
pmd_t *pmd;
unsigned long next;
+ int cleared;

pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
- if (pmd_clear_huge(pmd))
+
+ cleared = pmd_clear_huge(pmd);
+ if (cleared || pmd_bad(*pmd))
+ *mask |= PGTBL_PMD_MODIFIED;
+
+ if (cleared)
continue;
if (pmd_none_or_clear_bad(pmd))
continue;
- vunmap_pte_range(pmd, addr, next);
+ vunmap_pte_range(pmd, addr, next, mask);
} while (pmd++, addr = next, addr != end);
}

-static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end)
+static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
+ pgtbl_mod_mask *mask)
{
pud_t *pud;
unsigned long next;
+ int cleared;

pud = pud_offset(p4d, addr);
do {
next = pud_addr_end(addr, end);
- if (pud_clear_huge(pud))
+
+ cleared = pud_clear_huge(pud);
+ if (cleared || pud_bad(*pud))
+ *mask |= PGTBL_PUD_MODIFIED;
+
+ if (cleared)
continue;
if (pud_none_or_clear_bad(pud))
continue;
- vunmap_pmd_range(pud, addr, next);
+ vunmap_pmd_range(pud, addr, next, mask);
} while (pud++, addr = next, addr != end);
}

-static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end)
+static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
+ pgtbl_mod_mask *mask)
{
p4d_t *p4d;
unsigned long next;
+ int cleared;

p4d = p4d_offset(pgd, addr);
do {
next = p4d_addr_end(addr, end);
- if (p4d_clear_huge(p4d))
+
+ cleared = p4d_clear_huge(p4d);
+ if (cleared || p4d_bad(*p4d))
+ *mask |= PGTBL_P4D_MODIFIED;
+
+ if (cleared)
continue;
if (p4d_none_or_clear_bad(p4d))
continue;
- vunmap_pud_range(p4d, addr, next);
+ vunmap_pud_range(p4d, addr, next, mask);
} while (p4d++, addr = next, addr != end);
}

/**
* unmap_kernel_range_noflush - unmap kernel VM area
- * @addr: start of the VM area to unmap
+ * @start: start of the VM area to unmap
* @size: size of the VM area to unmap
*
* Unmap PFN_UP(@size) pages at @addr. The VM area @addr and @size specify
@@ -141,24 +164,33 @@ static void vunmap_p4d_range(pgd_t *pgd,
* for calling flush_cache_vunmap() on to-be-mapped areas before calling this
* function and flush_tlb_kernel_range() after.
*/
-void unmap_kernel_range_noflush(unsigned long addr, unsigned long size)
+void unmap_kernel_range_noflush(unsigned long start, unsigned long size)
{
- unsigned long end = addr + size;
+ unsigned long end = start + size;
unsigned long next;
pgd_t *pgd;
+ unsigned long addr = start;
+ pgtbl_mod_mask mask = 0;

BUG_ON(addr >= end);
+ start = addr;
pgd = pgd_offset_k(addr);
do {
next = pgd_addr_end(addr, end);
+ if (pgd_bad(*pgd))
+ mask |= PGTBL_PGD_MODIFIED;
if (pgd_none_or_clear_bad(pgd))
continue;
- vunmap_p4d_range(pgd, addr, next);
+ vunmap_p4d_range(pgd, addr, next, &mask);
} while (pgd++, addr = next, addr != end);
+
+ if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
+ arch_sync_kernel_mappings(start, end);
}

static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
- unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+ unsigned long end, pgprot_t prot, struct page **pages, int *nr,
+ pgtbl_mod_mask *mask)
{
pte_t *pte;

@@ -167,7 +199,7 @@ static int vmap_pte_range(pmd_t *pmd, un
* callers keep track of where we're up to.
*/

- pte = pte_alloc_kernel(pmd, addr);
+ pte = pte_alloc_kernel_track(pmd, addr, mask);
if (!pte)
return -ENOMEM;
do {
@@ -180,55 +212,59 @@ static int vmap_pte_range(pmd_t *pmd, un
set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
(*nr)++;
} while (pte++, addr += PAGE_SIZE, addr != end);
+ *mask |= PGTBL_PTE_MODIFIED;
return 0;
}

static int vmap_pmd_range(pud_t *pud, unsigned long addr,
- unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+ unsigned long end, pgprot_t prot, struct page **pages, int *nr,
+ pgtbl_mod_mask *mask)
{
pmd_t *pmd;
unsigned long next;

- pmd = pmd_alloc(&init_mm, pud, addr);
+ pmd = pmd_alloc_track(&init_mm, pud, addr, mask);
if (!pmd)
return -ENOMEM;
do {
next = pmd_addr_end(addr, end);
- if (vmap_pte_range(pmd, addr, next, prot, pages, nr))
+ if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask))
return -ENOMEM;
} while (pmd++, addr = next, addr != end);
return 0;
}

static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
- unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+ unsigned long end, pgprot_t prot, struct page **pages, int *nr,
+ pgtbl_mod_mask *mask)
{
pud_t *pud;
unsigned long next;

- pud = pud_alloc(&init_mm, p4d, addr);
+ pud = pud_alloc_track(&init_mm, p4d, addr, mask);
if (!pud)
return -ENOMEM;
do {
next = pud_addr_end(addr, end);
- if (vmap_pmd_range(pud, addr, next, prot, pages, nr))
+ if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask))
return -ENOMEM;
} while (pud++, addr = next, addr != end);
return 0;
}

static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
- unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+ unsigned long end, pgprot_t prot, struct page **pages, int *nr,
+ pgtbl_mod_mask *mask)
{
p4d_t *p4d;
unsigned long next;

- p4d = p4d_alloc(&init_mm, pgd, addr);
+ p4d = p4d_alloc_track(&init_mm, pgd, addr, mask);
if (!p4d)
return -ENOMEM;
do {
next = p4d_addr_end(addr, end);
- if (vmap_pud_range(p4d, addr, next, prot, pages, nr))
+ if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask))
return -ENOMEM;
} while (p4d++, addr = next, addr != end);
return 0;
@@ -260,12 +296,15 @@ int map_kernel_range_noflush(unsigned lo
pgd_t *pgd;
int err = 0;
int nr = 0;
+ pgtbl_mod_mask mask = 0;

BUG_ON(addr >= end);
pgd = pgd_offset_k(addr);
do {
next = pgd_addr_end(addr, end);
- err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr);
+ if (pgd_bad(*pgd))
+ mask |= PGTBL_PGD_MODIFIED;
+ err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
if (err)
return err;
} while (pgd++, addr = next, addr != end);
_

2020-05-16 12:58:23

by Jörg Rödel

[permalink] [raw]
Subject: Re: [PATCH v3 2/7] mm/vmalloc: Track which page-table levels were modified

Hi Andrew,

On Fri, May 15, 2020 at 01:01:42PM -0700, Andrew Morton wrote:
> On Fri, 15 May 2020 16:00:18 +0200 Joerg Roedel <[email protected]> wrote:
> Lots of collisions here with Christoph's "decruft the vmalloc API" series
> (http://lkml.kernel.org/r/[email protected]).
>
> I attempted to fix things up.
>
> unmap_kernel_range_noflush() needed to be redone.
>
> map_kernel_range_noflush() might need the arch_sync_kernel_mappings() call?

Yes, map_kernel_range_noflush() needs the arch_sync_kernel_mappings()
call as well.

Regards,

Joerg

2020-05-18 22:22:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 2/7] mm/vmalloc: Track which page-table levels were modified

On Sat, 16 May 2020 14:56:41 +0200 Joerg Roedel <[email protected]> wrote:

> Hi Andrew,
>
> On Fri, May 15, 2020 at 01:01:42PM -0700, Andrew Morton wrote:
> > On Fri, 15 May 2020 16:00:18 +0200 Joerg Roedel <[email protected]> wrote:
> > Lots of collisions here with Christoph's "decruft the vmalloc API" series
> > (http://lkml.kernel.org/r/[email protected]).
> >
> > I attempted to fix things up.
> >
> > unmap_kernel_range_noflush() needed to be redone.
> >
> > map_kernel_range_noflush() might need the arch_sync_kernel_mappings() call?
>
> Yes, map_kernel_range_noflush() needs the arch_sync_kernel_mappings()
> call as well.
>

This?

--- a/mm/vmalloc.c~mm-vmalloc-track-which-page-table-levels-were-modified-fix
+++ a/mm/vmalloc.c
@@ -309,6 +309,9 @@ int map_kernel_range_noflush(unsigned lo
return err;
} while (pgd++, addr = next, addr != end);

+ if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
+ arch_sync_kernel_mappings(start, end);
+
return 0;
}


It would be nice to get all this (ie, linux-next) retested before we
send it upstream, please.

2020-05-19 15:27:18

by Jörg Rödel

[permalink] [raw]
Subject: Re: [PATCH v3 2/7] mm/vmalloc: Track which page-table levels were modified

On Mon, May 18, 2020 at 03:18:28PM -0700, Andrew Morton wrote:
> On Sat, 16 May 2020 14:56:41 +0200 Joerg Roedel <[email protected]> wrote:
> --- a/mm/vmalloc.c~mm-vmalloc-track-which-page-table-levels-were-modified-fix
> +++ a/mm/vmalloc.c
> @@ -309,6 +309,9 @@ int map_kernel_range_noflush(unsigned lo
> return err;
> } while (pgd++, addr = next, addr != end);
>
> + if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
> + arch_sync_kernel_mappings(start, end);
> +
> return 0;
> }

Yes, this is the right call.

> It would be nice to get all this (ie, linux-next) retested before we
> send it upstream, please.

Will do and report back.


Thanks,

Joerg

2020-05-25 10:08:30

by Jörg Rödel

[permalink] [raw]
Subject: Re: [PATCH v3 2/7] mm/vmalloc: Track which page-table levels were modified

On Mon, May 18, 2020 at 03:18:28PM -0700, Andrew Morton wrote:
> It would be nice to get all this (ie, linux-next) retested before we
> send it upstream, please.

Tested these patches as in next-20200522, with three 2, 3, and 4-level
paging. On 4-level (aka 64-bit) I also re-ran the Stevens trace-cmd
reproducer, no issues found and the trace-cmd problem is still fixed.

Thanks,

Joerg

2020-06-05 10:10:28

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] mm: Add functions to track page directory modifications

Hi Joerg,

On Fri, May 15, 2020 at 04:00:17PM +0200, Joerg Roedel wrote:
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5a323422d783..022fe682af9e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2078,13 +2078,54 @@ static inline pud_t *pud_alloc(struct mm_struct *mm, p4d_t *p4d,
> return (unlikely(p4d_none(*p4d)) && __pud_alloc(mm, p4d, address)) ?
> NULL : pud_offset(p4d, address);
> }
> +
> +static inline p4d_t *p4d_alloc_track(struct mm_struct *mm, pgd_t *pgd,
> + unsigned long address,
> + pgtbl_mod_mask *mod_mask)
> +
> +{
> + if (unlikely(pgd_none(*pgd))) {
> + if (__p4d_alloc(mm, pgd, address))
> + return NULL;
> + *mod_mask |= PGTBL_PGD_MODIFIED;
> + }
> +
> + return p4d_offset(pgd, address);
> +}
> +
> #endif /* !__ARCH_HAS_5LEVEL_HACK */
>
> +static inline pud_t *pud_alloc_track(struct mm_struct *mm, p4d_t *p4d,
> + unsigned long address,
> + pgtbl_mod_mask *mod_mask)
> +{
> + if (unlikely(p4d_none(*p4d))) {
> + if (__pud_alloc(mm, p4d, address))
> + return NULL;
> + *mod_mask |= PGTBL_P4D_MODIFIED;
> + }
> +
> + return pud_offset(p4d, address);
> +}

This patch causes a kernel panic on arm64 (and possibly powerpc, I
haven't tried). arm64 still uses the 5level-fixup.h and pud_alloc()
checks for the empty p4d with pgd_none() instead of p4d_none().

The patch below fixes it:

-----------------------8<-----------------------------------
From 8fb5f4c1e0983bb501c8ec71f9bb9dd344b80e87 Mon Sep 17 00:00:00 2001
From: Catalin Marinas <[email protected]>
Date: Fri, 5 Jun 2020 10:56:03 +0100
Subject: [PATCH] mm: Use pgd_none() in pud_alloc_track() with
__ARCH_HAS_5LEVEL_HACK

Commit d8626138009b ("mm: add functions to track page directory
modifications") introduced the pud_alloc_track() function checking for
an empty p4d using p4d_none(). However, when __ARCH_HAS_5LEVEL_HACK is
defined, the pud_alloc() counterpart checks for an empty p4d using
pgd_none(). Since p4d_none() is always 0 in this case, no pud would be
allocated and the kernel panics during boot on arm64 (at least).

Until all architectures are moved away from the 5level-fixup.h, define a
pud_alloc_track() that matches the __ARCH_HAS_5LEVEL_HACK pud_alloc().

Fixes: d8626138009b ("mm: add functions to track page directory modifications")
Signed-off-by: Catalin Marinas <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Andrew Morton <[email protected]>
---
include/linux/mm.h | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fda41eb7f1c8..9d3761a1fad5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2106,8 +2106,6 @@ static inline p4d_t *p4d_alloc_track(struct mm_struct *mm, pgd_t *pgd,
return p4d_offset(pgd, address);
}

-#endif /* !__ARCH_HAS_5LEVEL_HACK */
-
static inline pud_t *pud_alloc_track(struct mm_struct *mm, p4d_t *p4d,
unsigned long address,
pgtbl_mod_mask *mod_mask)
@@ -2121,6 +2119,23 @@ static inline pud_t *pud_alloc_track(struct mm_struct *mm, p4d_t *p4d,
return pud_offset(p4d, address);
}

+#else /* __ARCH_HAS_5LEVEL_HACK */
+
+static inline pud_t *pud_alloc_track(struct mm_struct *mm, p4d_t *p4d,
+ unsigned long address,
+ pgtbl_mod_mask *mod_mask)
+{
+ if (unlikely(pgd_none(*p4d))) {
+ if (__pud_alloc(mm, p4d, address))
+ return NULL;
+ *mod_mask |= PGTBL_P4D_MODIFIED;
+ }
+
+ return pud_offset(p4d, address);
+}
+
+#endif /* !__ARCH_HAS_5LEVEL_HACK */
+
static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
{
return (unlikely(pud_none(*pud)) && __pmd_alloc(mm, pud, address))?

2020-06-05 11:49:26

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] mm: Add functions to track page directory modifications

On Fri, Jun 05, 2020 at 11:08:13AM +0100, Catalin Marinas wrote:
> This patch causes a kernel panic on arm64 (and possibly powerpc, I
> haven't tried). arm64 still uses the 5level-fixup.h and pud_alloc()
> checks for the empty p4d with pgd_none() instead of p4d_none().

Ah, should have checked the list first
(https://lore.kernel.org/linux-mm/[email protected]/).

Please ignore my patch then.

--
Catalin

2020-06-06 01:03:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] mm: Add functions to track page directory modifications

On Fri, 5 Jun 2020 12:46:55 +0100 Catalin Marinas <[email protected]> wrote:

> On Fri, Jun 05, 2020 at 11:08:13AM +0100, Catalin Marinas wrote:
> > This patch causes a kernel panic on arm64 (and possibly powerpc, I
> > haven't tried). arm64 still uses the 5level-fixup.h and pud_alloc()
> > checks for the empty p4d with pgd_none() instead of p4d_none().
>
> Ah, should have checked the list first
> (https://lore.kernel.org/linux-mm/[email protected]/).
>

Sorry about that.

Can you confirm that the merge of Mike's 5level_hack zappage has fixed
the arm64 wontboot?

2020-06-06 12:52:50

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] mm: Add functions to track page directory modifications

On Fri, Jun 05, 2020 at 06:01:16PM -0700, Andrew Morton wrote:
> On Fri, 5 Jun 2020 12:46:55 +0100 Catalin Marinas <[email protected]> wrote:
> > On Fri, Jun 05, 2020 at 11:08:13AM +0100, Catalin Marinas wrote:
> > > This patch causes a kernel panic on arm64 (and possibly powerpc, I
> > > haven't tried). arm64 still uses the 5level-fixup.h and pud_alloc()
> > > checks for the empty p4d with pgd_none() instead of p4d_none().
> >
> > Ah, should have checked the list first
> > (https://lore.kernel.org/linux-mm/[email protected]/).
>
> Can you confirm that the merge of Mike's 5level_hack zappage has fixed
> the arm64 wontboot?

Yes, it has. Mainline is booting again on arm64.

Thanks.

--
Catalin