2020-05-08 14:42:54

by Joerg Roedel

[permalink] [raw]
Subject: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

Hi,

after the recent issue with vmalloc and tracing code[1] on x86 and a
long history of previous issues related to the vmalloc_sync_mappings()
interface, I thought the time has come to remove it. Please
see [2], [3], and [4] for some other issues in the past.

The patches are based on v5.7-rc4 and add tracking of page-table
directory changes to the vmalloc and ioremap code. Depending on which
page-table levels changes have been made, a new per-arch function is
called: arch_sync_kernel_mappings().

On x86-64 with 4-level paging, this function will not be called more
than 64 times in a systems runtime (because vmalloc-space takes 64 PGD
entries which are only populated, but never cleared).

As a side effect this also allows to get rid of vmalloc faults on x86,
making it safe to touch vmalloc'ed memory in the page-fault handler.
Note that this potentially includes per-cpu memory.

The code is tested on x86-64, x86-32 with and without PAE. It also fixes
the issue described in [1]. Additionally this code has been
compile-tested on almost all architectures supported by Linux. I
couldn't find working compilers for hexagon and unicore32, so these are
not tested.

Please review.

Regards,

Joerg

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://lore.kernel.org/lkml/[email protected]/
[4] https://lore.kernel.org/lkml/[email protected]/

Joerg Roedel (7):
mm: Add functions to track page directory modifications
mm/vmalloc: Track which page-table levels were modified
mm/ioremap: Track which page-table levels were modified
x86/mm/64: Implement arch_sync_kernel_mappings()
x86/mm/32: Implement arch_sync_kernel_mappings()
mm: Remove vmalloc_sync_(un)mappings()
x86/mm: Remove vmalloc faulting

arch/x86/include/asm/pgtable-2level_types.h | 2 +
arch/x86/include/asm/pgtable-3level_types.h | 2 +
arch/x86/include/asm/pgtable_64_types.h | 2 +
arch/x86/include/asm/switch_to.h | 23 ---
arch/x86/kernel/setup_percpu.c | 6 +-
arch/x86/mm/fault.c | 176 +-------------------
arch/x86/mm/init_64.c | 5 +
arch/x86/mm/pti.c | 8 +-
drivers/acpi/apei/ghes.c | 6 -
include/asm-generic/5level-fixup.h | 5 +-
include/asm-generic/pgtable.h | 23 +++
include/linux/mm.h | 46 +++++
include/linux/vmalloc.h | 13 +-
kernel/notifier.c | 1 -
lib/ioremap.c | 46 +++--
mm/nommu.c | 12 --
mm/vmalloc.c | 109 +++++++-----
17 files changed, 199 insertions(+), 286 deletions(-)

--
2.17.1


2020-05-08 14:42:55

by Joerg Roedel

[permalink] [raw]
Subject: [RFC PATCH 4/7] x86/mm/64: Implement arch_sync_kernel_mappings()

From: Joerg Roedel <[email protected]>

Implement the function to sync changes in vmalloc and ioremap ranges
to all page-tables.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/include/asm/pgtable_64_types.h | 2 ++
arch/x86/mm/init_64.c | 5 +++++
2 files changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 52e5f5f2240d..8f63efb2a2cc 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -159,4 +159,6 @@ extern unsigned int ptrs_per_p4d;

#define PGD_KERNEL_START ((PAGE_SIZE / 2) / sizeof(pgd_t))

+#define ARCH_PAGE_TABLE_SYNC_MASK (pgtable_l5_enabled() ? PGTBL_PGD_MODIFIED : PGTBL_P4D_MODIFIED)
+
#endif /* _ASM_X86_PGTABLE_64_DEFS_H */
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3b289c2f75cd..541af8e5bcd4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -217,6 +217,11 @@ void sync_global_pgds(unsigned long start, unsigned long end)
sync_global_pgds_l4(start, end);
}

+void arch_sync_kernel_mappings(unsigned long start, unsigned long end)
+{
+ sync_global_pgds(start, end);
+}
+
/*
* NOTE: This function is marked __ref because it calls __init function
* (alloc_bootmem_pages). It's safe to do it ONLY when after_bootmem == 0.
--
2.17.1

2020-05-08 14:43:30

by Joerg Roedel

[permalink] [raw]
Subject: [RFC PATCH 2/7] mm/vmalloc: Track which page-table levels were modified

From: Joerg Roedel <[email protected]>

Track at which levels in the page-table entries were modified by
vmap/vunmap. After the page-table has been modified, use that
information do decide whether the new arch_sync_kernel_mappings()
needs to be called.

Signed-off-by: Joerg Roedel <[email protected]>
---
include/linux/vmalloc.h | 11 ++++++
mm/vmalloc.c | 88 ++++++++++++++++++++++++++++++-----------
2 files changed, 75 insertions(+), 24 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index a95d3cc74d79..eb364000cb03 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -144,6 +144,17 @@ extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
void vmalloc_sync_mappings(void);
void vmalloc_sync_unmappings(void);

+/*
+ * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values
+ * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings()
+ * needs to be called.
+ */
+#ifndef ARCH_PAGE_TABLE_SYNC_MASK
+#define ARCH_PAGE_TABLE_SYNC_MASK 0
+#endif
+
+void arch_sync_kernel_mappings(unsigned long start, unsigned long end);
+
/*
* Lowlevel-APIs (not for driver use!)
*/
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 9a8227afa073..184f5a556cf7 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -69,7 +69,8 @@ static void free_work(struct work_struct *w)

/*** Page table manipulation functions ***/

-static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end)
+static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+ pgtbl_mod_mask *mask)
{
pte_t *pte;

@@ -78,73 +79,104 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end)
pte_t ptent = ptep_get_and_clear(&init_mm, addr, pte);
WARN_ON(!pte_none(ptent) && !pte_present(ptent));
} while (pte++, addr += PAGE_SIZE, addr != end);
+ *mask |= PGTBL_PTE_MODIFIED;
}

-static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end)
+static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+ pgtbl_mod_mask *mask)
{
pmd_t *pmd;
unsigned long next;
+ int cleared;

pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
- if (pmd_clear_huge(pmd))
+
+ cleared = pmd_clear_huge(pmd);
+ if (cleared || pmd_bad(*pmd))
+ *mask |= PGTBL_PMD_MODIFIED;
+
+ if (cleared)
continue;
if (pmd_none_or_clear_bad(pmd))
continue;
- vunmap_pte_range(pmd, addr, next);
+ vunmap_pte_range(pmd, addr, next, mask);
} while (pmd++, addr = next, addr != end);
}

-static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end)
+static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
+ pgtbl_mod_mask *mask)
{
pud_t *pud;
unsigned long next;
+ int cleared;

pud = pud_offset(p4d, addr);
do {
next = pud_addr_end(addr, end);
- if (pud_clear_huge(pud))
+
+ cleared = pud_clear_huge(pud);
+ if (cleared || pud_bad(*pud))
+ *mask |= PGTBL_PUD_MODIFIED;
+
+ if (cleared)
continue;
if (pud_none_or_clear_bad(pud))
continue;
- vunmap_pmd_range(pud, addr, next);
+ vunmap_pmd_range(pud, addr, next, mask);
} while (pud++, addr = next, addr != end);
}

-static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end)
+static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
+ pgtbl_mod_mask *mask)
{
p4d_t *p4d;
unsigned long next;
+ int cleared;

p4d = p4d_offset(pgd, addr);
do {
next = p4d_addr_end(addr, end);
- if (p4d_clear_huge(p4d))
+
+ cleared = p4d_clear_huge(p4d);
+ if (cleared || p4d_bad(*p4d))
+ *mask |= PGTBL_P4D_MODIFIED;
+
+ if (cleared)
continue;
if (p4d_none_or_clear_bad(p4d))
continue;
- vunmap_pud_range(p4d, addr, next);
+ vunmap_pud_range(p4d, addr, next, mask);
} while (p4d++, addr = next, addr != end);
}

-static void vunmap_page_range(unsigned long addr, unsigned long end)
+static void vunmap_page_range(unsigned long start, unsigned long end)
{
pgd_t *pgd;
+ unsigned long addr = start;
unsigned long next;
+ pgtbl_mod_mask mask = 0;

BUG_ON(addr >= end);
+ start = addr;
pgd = pgd_offset_k(addr);
do {
next = pgd_addr_end(addr, end);
+ if (pgd_bad(*pgd))
+ mask |= PGTBL_PGD_MODIFIED;
if (pgd_none_or_clear_bad(pgd))
continue;
- vunmap_p4d_range(pgd, addr, next);
+ vunmap_p4d_range(pgd, addr, next, &mask);
} while (pgd++, addr = next, addr != end);
+
+ if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
+ arch_sync_kernel_mappings(start, end);
}

static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
- unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+ unsigned long end, pgprot_t prot, struct page **pages, int *nr,
+ pgtbl_mod_mask *mask)
{
pte_t *pte;

@@ -153,7 +185,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
* callers keep track of where we're up to.
*/

- pte = pte_alloc_kernel(pmd, addr);
+ pte = pte_alloc_kernel_track(pmd, addr, mask);
if (!pte)
return -ENOMEM;
do {
@@ -166,55 +198,59 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
(*nr)++;
} while (pte++, addr += PAGE_SIZE, addr != end);
+ *mask |= PGTBL_PTE_MODIFIED;
return 0;
}

static int vmap_pmd_range(pud_t *pud, unsigned long addr,
- unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+ unsigned long end, pgprot_t prot, struct page **pages, int *nr,
+ pgtbl_mod_mask *mask)
{
pmd_t *pmd;
unsigned long next;

- pmd = pmd_alloc(&init_mm, pud, addr);
+ pmd = pmd_alloc_track(&init_mm, pud, addr, mask);
if (!pmd)
return -ENOMEM;
do {
next = pmd_addr_end(addr, end);
- if (vmap_pte_range(pmd, addr, next, prot, pages, nr))
+ if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask))
return -ENOMEM;
} while (pmd++, addr = next, addr != end);
return 0;
}

static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
- unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+ unsigned long end, pgprot_t prot, struct page **pages, int *nr,
+ pgtbl_mod_mask *mask)
{
pud_t *pud;
unsigned long next;

- pud = pud_alloc(&init_mm, p4d, addr);
+ pud = pud_alloc_track(&init_mm, p4d, addr, mask);
if (!pud)
return -ENOMEM;
do {
next = pud_addr_end(addr, end);
- if (vmap_pmd_range(pud, addr, next, prot, pages, nr))
+ if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask))
return -ENOMEM;
} while (pud++, addr = next, addr != end);
return 0;
}

static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
- unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+ unsigned long end, pgprot_t prot, struct page **pages, int *nr,
+ pgtbl_mod_mask *mask)
{
p4d_t *p4d;
unsigned long next;

- p4d = p4d_alloc(&init_mm, pgd, addr);
+ p4d = p4d_alloc_track(&init_mm, pgd, addr, mask);
if (!p4d)
return -ENOMEM;
do {
next = p4d_addr_end(addr, end);
- if (vmap_pud_range(p4d, addr, next, prot, pages, nr))
+ if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask))
return -ENOMEM;
} while (p4d++, addr = next, addr != end);
return 0;
@@ -234,16 +270,20 @@ static int vmap_page_range_noflush(unsigned long start, unsigned long end,
unsigned long addr = start;
int err = 0;
int nr = 0;
+ pgtbl_mod_mask mask = 0;

BUG_ON(addr >= end);
pgd = pgd_offset_k(addr);
do {
next = pgd_addr_end(addr, end);
- err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr);
+ err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
if (err)
return err;
} while (pgd++, addr = next, addr != end);

+ if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
+ arch_sync_kernel_mappings(start, end);
+
return nr;
}

--
2.17.1

2020-05-08 14:44:51

by Joerg Roedel

[permalink] [raw]
Subject: [RFC PATCH 6/7] mm: Remove vmalloc_sync_(un)mappings()

From: Joerg Roedel <[email protected]>

These functions are not needed anymore because the vmalloc and ioremap
mappings are now synchronized when they are created or teared down.

Remove all callers and function definitions.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/mm/fault.c | 37 -------------------------------------
drivers/acpi/apei/ghes.c | 6 ------
include/linux/vmalloc.h | 2 --
kernel/notifier.c | 1 -
mm/nommu.c | 12 ------------
mm/vmalloc.c | 21 ---------------------
6 files changed, 79 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index edeb2adaf31f..255fc631b042 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -214,26 +214,6 @@ void arch_sync_kernel_mappings(unsigned long start, unsigned long end)
}
}

-static void vmalloc_sync(void)
-{
- unsigned long address;
-
- if (SHARED_KERNEL_PMD)
- return;
-
- arch_sync_kernel_mappings(VMALLOC_START, VMALLOC_END);
-}
-
-void vmalloc_sync_mappings(void)
-{
- vmalloc_sync();
-}
-
-void vmalloc_sync_unmappings(void)
-{
- vmalloc_sync();
-}
-
/*
* 32-bit:
*
@@ -336,23 +316,6 @@ static void dump_pagetable(unsigned long address)

#else /* CONFIG_X86_64: */

-void vmalloc_sync_mappings(void)
-{
- /*
- * 64-bit mappings might allocate new p4d/pud pages
- * that need to be propagated to all tasks' PGDs.
- */
- sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END);
-}
-
-void vmalloc_sync_unmappings(void)
-{
- /*
- * Unmappings never allocate or free p4d/pud pages.
- * No work is required here.
- */
-}
-
/*
* 64-bit:
*
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 24c9642e8fc7..aabe9c5ee515 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -167,12 +167,6 @@ int ghes_estatus_pool_init(int num_ghes)
if (!addr)
goto err_pool_alloc;

- /*
- * New allocation must be visible in all pgd before it can be found by
- * an NMI allocating from the pool.
- */
- vmalloc_sync_mappings();
-
rc = gen_pool_add(ghes_estatus_pool, addr, PAGE_ALIGN(len), -1);
if (rc)
goto err_pool_add;
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index eb364000cb03..9063cdeb15bb 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -141,8 +141,6 @@ extern int remap_vmalloc_range_partial(struct vm_area_struct *vma,

extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
unsigned long pgoff);
-void vmalloc_sync_mappings(void);
-void vmalloc_sync_unmappings(void);

/*
* Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values
diff --git a/kernel/notifier.c b/kernel/notifier.c
index 5989bbb93039..84c987dfbe03 100644
--- a/kernel/notifier.c
+++ b/kernel/notifier.c
@@ -519,7 +519,6 @@ NOKPROBE_SYMBOL(notify_die);

int register_die_notifier(struct notifier_block *nb)
{
- vmalloc_sync_mappings();
return atomic_notifier_chain_register(&die_chain, nb);
}
EXPORT_SYMBOL_GPL(register_die_notifier);
diff --git a/mm/nommu.c b/mm/nommu.c
index 318df4e236c9..b4267e1471f3 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -369,18 +369,6 @@ void vm_unmap_aliases(void)
}
EXPORT_SYMBOL_GPL(vm_unmap_aliases);

-/*
- * Implement a stub for vmalloc_sync_[un]mapping() if the architecture
- * chose not to have one.
- */
-void __weak vmalloc_sync_mappings(void)
-{
-}
-
-void __weak vmalloc_sync_unmappings(void)
-{
-}
-
struct vm_struct *alloc_vm_area(size_t size, pte_t **ptes)
{
BUG();
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 184f5a556cf7..901540e4773b 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1332,12 +1332,6 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
if (unlikely(valist == NULL))
return false;

- /*
- * First make sure the mappings are removed from all page-tables
- * before they are freed.
- */
- vmalloc_sync_unmappings();
-
/*
* TODO: to calculate a flush range without looping.
* The list can be up to lazy_max_pages() elements.
@@ -3177,21 +3171,6 @@ int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
}
EXPORT_SYMBOL(remap_vmalloc_range);

-/*
- * Implement stubs for vmalloc_sync_[un]mappings () if the architecture chose
- * not to have one.
- *
- * The purpose of this function is to make sure the vmalloc area
- * mappings are identical in all page-tables in the system.
- */
-void __weak vmalloc_sync_mappings(void)
-{
-}
-
-void __weak vmalloc_sync_unmappings(void)
-{
-}
-
static int f(pte_t *pte, unsigned long addr, void *data)
{
pte_t ***p = data;
--
2.17.1

2020-05-08 14:44:57

by Joerg Roedel

[permalink] [raw]
Subject: [RFC PATCH 1/7] mm: Add functions to track page directory modifications

From: Joerg Roedel <[email protected]>

Add page-table allocation functions which will keep track of changed
directory entries. They are needed for new PGD, P4D, PUD, and PMD
entries and will be used in vmalloc and ioremap code to decide whether
any changes in the kernel mappings need to be synchronized between
page-tables in the system.

Signed-off-by: Joerg Roedel <[email protected]>
---
include/asm-generic/5level-fixup.h | 5 ++--
include/asm-generic/pgtable.h | 23 +++++++++++++++
include/linux/mm.h | 46 ++++++++++++++++++++++++++++++
3 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/5level-fixup.h b/include/asm-generic/5level-fixup.h
index 4c74b1c1d13b..58046ddc08d0 100644
--- a/include/asm-generic/5level-fixup.h
+++ b/include/asm-generic/5level-fixup.h
@@ -17,8 +17,9 @@
((unlikely(pgd_none(*(p4d))) && __pud_alloc(mm, p4d, address)) ? \
NULL : pud_offset(p4d, address))

-#define p4d_alloc(mm, pgd, address) (pgd)
-#define p4d_offset(pgd, start) (pgd)
+#define p4d_alloc(mm, pgd, address) (pgd)
+#define p4d_alloc_track(mm, pgd, address, mask) (pgd)
+#define p4d_offset(pgd, start) (pgd)

#ifndef __ASSEMBLY__
static inline int p4d_none(p4d_t p4d)
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 329b8c8ca703..bf1418ae91a2 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1209,6 +1209,29 @@ static inline bool arch_has_pfn_modify_check(void)
# define PAGE_KERNEL_EXEC PAGE_KERNEL
#endif

+/*
+ * Page Table Modification bits for pgtbl_mod_mask.
+ *
+ * These are used by the p?d_alloc_track*() set of functions an in the generic
+ * vmalloc/ioremap code to track at which page-table levels entries have been
+ * modified. Based on that the code can better decide when vmalloc and ioremap
+ * mapping changes need to be synchronized to other page-tables in the system.
+ */
+#define __PGTBL_PGD_MODIFIED 0
+#define __PGTBL_P4D_MODIFIED 1
+#define __PGTBL_PUD_MODIFIED 2
+#define __PGTBL_PMD_MODIFIED 3
+#define __PGTBL_PTE_MODIFIED 4
+
+#define PGTBL_PGD_MODIFIED BIT(__PGTBL_PGD_MODIFIED)
+#define PGTBL_P4D_MODIFIED BIT(__PGTBL_P4D_MODIFIED)
+#define PGTBL_PUD_MODIFIED BIT(__PGTBL_PUD_MODIFIED)
+#define PGTBL_PMD_MODIFIED BIT(__PGTBL_PMD_MODIFIED)
+#define PGTBL_PTE_MODIFIED BIT(__PGTBL_PTE_MODIFIED)
+
+/* Page-Table Modification Mask */
+typedef unsigned int pgtbl_mod_mask;
+
#endif /* !__ASSEMBLY__ */

#ifndef io_remap_pfn_range
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5a323422d783..022fe682af9e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2078,13 +2078,54 @@ static inline pud_t *pud_alloc(struct mm_struct *mm, p4d_t *p4d,
return (unlikely(p4d_none(*p4d)) && __pud_alloc(mm, p4d, address)) ?
NULL : pud_offset(p4d, address);
}
+
+static inline p4d_t *p4d_alloc_track(struct mm_struct *mm, pgd_t *pgd,
+ unsigned long address,
+ pgtbl_mod_mask *mod_mask)
+
+{
+ if (unlikely(pgd_none(*pgd))) {
+ if (__p4d_alloc(mm, pgd, address))
+ return NULL;
+ *mod_mask |= PGTBL_PGD_MODIFIED;
+ }
+
+ return p4d_offset(pgd, address);
+}
+
#endif /* !__ARCH_HAS_5LEVEL_HACK */

+static inline pud_t *pud_alloc_track(struct mm_struct *mm, p4d_t *p4d,
+ unsigned long address,
+ pgtbl_mod_mask *mod_mask)
+{
+ if (unlikely(p4d_none(*p4d))) {
+ if (__pud_alloc(mm, p4d, address))
+ return NULL;
+ *mod_mask |= PGTBL_P4D_MODIFIED;
+ }
+
+ return pud_offset(p4d, address);
+}
+
static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
{
return (unlikely(pud_none(*pud)) && __pmd_alloc(mm, pud, address))?
NULL: pmd_offset(pud, address);
}
+
+static inline pmd_t *pmd_alloc_track(struct mm_struct *mm, pud_t *pud,
+ unsigned long address,
+ pgtbl_mod_mask *mod_mask)
+{
+ if (unlikely(pud_none(*pud))) {
+ if (__pmd_alloc(mm, pud, address))
+ return NULL;
+ *mod_mask |= PGTBL_PUD_MODIFIED;
+ }
+
+ return pmd_offset(pud, address);
+}
#endif /* CONFIG_MMU */

#if USE_SPLIT_PTE_PTLOCKS
@@ -2200,6 +2241,11 @@ static inline void pgtable_pte_page_dtor(struct page *page)
((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
NULL: pte_offset_kernel(pmd, address))

+#define pte_alloc_kernel_track(pmd, address, mask) \
+ ((unlikely(pmd_none(*(pmd))) && \
+ (__pte_alloc_kernel(pmd) || ({*(mask)|=PGTBL_PMD_MODIFIED;0;})))?\
+ NULL: pte_offset_kernel(pmd, address))
+
#if USE_SPLIT_PMD_PTLOCKS

static struct page *pmd_to_page(pmd_t *pmd)
--
2.17.1

2020-05-08 14:45:06

by Joerg Roedel

[permalink] [raw]
Subject: [RFC PATCH 7/7] x86/mm: Remove vmalloc faulting

From: Joerg Roedel <[email protected]>

Remove fault handling on vmalloc areas, as the vmalloc code now takes
care of synchronizing changes to all page-tables in the system.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/include/asm/switch_to.h | 23 ------
arch/x86/kernel/setup_percpu.c | 6 +-
arch/x86/mm/fault.c | 134 -------------------------------
arch/x86/mm/pti.c | 8 +-
4 files changed, 4 insertions(+), 167 deletions(-)

diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 0e059b73437b..9f69cc497f4b 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -12,27 +12,6 @@ struct task_struct *__switch_to_asm(struct task_struct *prev,
__visible struct task_struct *__switch_to(struct task_struct *prev,
struct task_struct *next);

-/* This runs runs on the previous thread's stack. */
-static inline void prepare_switch_to(struct task_struct *next)
-{
-#ifdef CONFIG_VMAP_STACK
- /*
- * If we switch to a stack that has a top-level paging entry
- * that is not present in the current mm, the resulting #PF will
- * will be promoted to a double-fault and we'll panic. Probe
- * the new stack now so that vmalloc_fault can fix up the page
- * tables if needed. This can only happen if we use a stack
- * in vmap space.
- *
- * We assume that the stack is aligned so that it never spans
- * more than one top-level paging entry.
- *
- * To minimize cache pollution, just follow the stack pointer.
- */
- READ_ONCE(*(unsigned char *)next->thread.sp);
-#endif
-}
-
asmlinkage void ret_from_fork(void);

/*
@@ -67,8 +46,6 @@ struct fork_frame {

#define switch_to(prev, next, last) \
do { \
- prepare_switch_to(next); \
- \
((last) = __switch_to_asm((prev), (next))); \
} while (0)

diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index e6d7894ad127..fd945ce78554 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -287,9 +287,9 @@ void __init setup_per_cpu_areas(void)
/*
* Sync back kernel address range again. We already did this in
* setup_arch(), but percpu data also needs to be available in
- * the smpboot asm. We can't reliably pick up percpu mappings
- * using vmalloc_fault(), because exception dispatch needs
- * percpu data.
+ * the smpboot asm and arch_sync_kernel_mappings() doesn't sync to
+ * swapper_pg_dir on 32-bit. The per-cpu mappings need to be available
+ * there too.
*
* FIXME: Can the later sync in setup_cpu_entry_areas() replace
* this call?
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 255fc631b042..dffe8e4d3140 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -214,44 +214,6 @@ void arch_sync_kernel_mappings(unsigned long start, unsigned long end)
}
}

-/*
- * 32-bit:
- *
- * Handle a fault on the vmalloc or module mapping area
- */
-static noinline int vmalloc_fault(unsigned long address)
-{
- unsigned long pgd_paddr;
- pmd_t *pmd_k;
- pte_t *pte_k;
-
- /* Make sure we are in vmalloc area: */
- if (!(address >= VMALLOC_START && address < VMALLOC_END))
- return -1;
-
- /*
- * Synchronize this task's top level page-table
- * with the 'reference' page table.
- *
- * Do _not_ use "current" here. We might be inside
- * an interrupt in the middle of a task switch..
- */
- pgd_paddr = read_cr3_pa();
- pmd_k = vmalloc_sync_one(__va(pgd_paddr), address);
- if (!pmd_k)
- return -1;
-
- if (pmd_large(*pmd_k))
- return 0;
-
- pte_k = pte_offset_kernel(pmd_k, address);
- if (!pte_present(*pte_k))
- return -1;
-
- return 0;
-}
-NOKPROBE_SYMBOL(vmalloc_fault);
-
/*
* Did it hit the DOS screen memory VA from vm86 mode?
*/
@@ -316,79 +278,6 @@ static void dump_pagetable(unsigned long address)

#else /* CONFIG_X86_64: */

-/*
- * 64-bit:
- *
- * Handle a fault on the vmalloc area
- */
-static noinline int vmalloc_fault(unsigned long address)
-{
- pgd_t *pgd, *pgd_k;
- p4d_t *p4d, *p4d_k;
- pud_t *pud;
- pmd_t *pmd;
- pte_t *pte;
-
- /* Make sure we are in vmalloc area: */
- if (!(address >= VMALLOC_START && address < VMALLOC_END))
- return -1;
-
- /*
- * Copy kernel mappings over when needed. This can also
- * happen within a race in page table update. In the later
- * case just flush:
- */
- pgd = (pgd_t *)__va(read_cr3_pa()) + pgd_index(address);
- pgd_k = pgd_offset_k(address);
- if (pgd_none(*pgd_k))
- return -1;
-
- if (pgtable_l5_enabled()) {
- if (pgd_none(*pgd)) {
- set_pgd(pgd, *pgd_k);
- arch_flush_lazy_mmu_mode();
- } else {
- BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_k));
- }
- }
-
- /* With 4-level paging, copying happens on the p4d level. */
- p4d = p4d_offset(pgd, address);
- p4d_k = p4d_offset(pgd_k, address);
- if (p4d_none(*p4d_k))
- return -1;
-
- if (p4d_none(*p4d) && !pgtable_l5_enabled()) {
- set_p4d(p4d, *p4d_k);
- arch_flush_lazy_mmu_mode();
- } else {
- BUG_ON(p4d_pfn(*p4d) != p4d_pfn(*p4d_k));
- }
-
- BUILD_BUG_ON(CONFIG_PGTABLE_LEVELS < 4);
-
- pud = pud_offset(p4d, address);
- if (pud_none(*pud))
- return -1;
-
- if (pud_large(*pud))
- return 0;
-
- pmd = pmd_offset(pud, address);
- if (pmd_none(*pmd))
- return -1;
-
- if (pmd_large(*pmd))
- return 0;
-
- pte = pte_offset_kernel(pmd, address);
- if (!pte_present(*pte))
- return -1;
-
- return 0;
-}
-NOKPROBE_SYMBOL(vmalloc_fault);
-
#ifdef CONFIG_CPU_SUP_AMD
static const char errata93_warning[] =
KERN_ERR
@@ -1227,29 +1116,6 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
*/
WARN_ON_ONCE(hw_error_code & X86_PF_PK);

- /*
- * We can fault-in kernel-space virtual memory on-demand. The
- * 'reference' page table is init_mm.pgd.
- *
- * NOTE! We MUST NOT take any locks for this case. We may
- * be in an interrupt or a critical region, and should
- * only copy the information from the master page table,
- * nothing more.
- *
- * Before doing this on-demand faulting, ensure that the
- * fault is not any of the following:
- * 1. A fault on a PTE with a reserved bit set.
- * 2. A fault caused by a user-mode access. (Do not demand-
- * fault kernel memory due to user-mode accesses).
- * 3. A fault caused by a page-level protection violation.
- * (A demand fault would be on a non-present page which
- * would have X86_PF_PROT==0).
- */
- if (!(hw_error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) {
- if (vmalloc_fault(address) >= 0)
- return;
- }
-
/* Was the fault spurious, caused by lazy TLB invalidation? */
if (spurious_kernel_fault(hw_error_code, address))
return;
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 843aa10a4cb6..da0fb17a1a36 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -448,13 +448,7 @@ static void __init pti_clone_user_shared(void)
* the sp1 and sp2 slots.
*
* This is done for all possible CPUs during boot to ensure
- * that it's propagated to all mms. If we were to add one of
- * these mappings during CPU hotplug, we would need to take
- * some measure to make sure that every mm that subsequently
- * ran on that CPU would have the relevant PGD entry in its
- * pagetables. The usual vmalloc_fault() mechanism would not
- * work for page faults taken in entry_SYSCALL_64 before RSP
- * is set up.
+ * that it's propagated to all mms.
*/

unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);
--
2.17.1

2020-05-08 14:45:37

by Joerg Roedel

[permalink] [raw]
Subject: [RFC PATCH 3/7] mm/ioremap: Track which page-table levels were modified

From: Joerg Roedel <[email protected]>

Track at which levels in the page-table entries were modified by
ioremap_page_range(). After the page-table has been modified, use that
information do decide whether the new arch_sync_kernel_mappings()
needs to be called. The iounmap path re-uses vunmap(), which has
already been taken care of.

Signed-off-by: Joerg Roedel <[email protected]>
---
lib/ioremap.c | 46 +++++++++++++++++++++++++++++++---------------
1 file changed, 31 insertions(+), 15 deletions(-)

diff --git a/lib/ioremap.c b/lib/ioremap.c
index 3f0e18543de8..ad485f08173b 100644
--- a/lib/ioremap.c
+++ b/lib/ioremap.c
@@ -61,13 +61,14 @@ static inline int ioremap_pmd_enabled(void) { return 0; }
#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */

static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
- unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+ unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+ pgtbl_mod_mask *mask)
{
pte_t *pte;
u64 pfn;

pfn = phys_addr >> PAGE_SHIFT;
- pte = pte_alloc_kernel(pmd, addr);
+ pte = pte_alloc_kernel_track(pmd, addr, mask);
if (!pte)
return -ENOMEM;
do {
@@ -75,6 +76,7 @@ static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
pfn++;
} while (pte++, addr += PAGE_SIZE, addr != end);
+ *mask |= PGTBL_PTE_MODIFIED;
return 0;
}

@@ -101,21 +103,24 @@ static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long addr,
}

static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
- unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+ unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+ pgtbl_mod_mask *mask)
{
pmd_t *pmd;
unsigned long next;

- pmd = pmd_alloc(&init_mm, pud, addr);
+ pmd = pmd_alloc_track(&init_mm, pud, addr, mask);
if (!pmd)
return -ENOMEM;
do {
next = pmd_addr_end(addr, end);

- if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot))
+ if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
+ *mask |= PGTBL_PMD_MODIFIED;
continue;
+ }

- if (ioremap_pte_range(pmd, addr, next, phys_addr, prot))
+ if (ioremap_pte_range(pmd, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
@@ -144,21 +149,24 @@ static int ioremap_try_huge_pud(pud_t *pud, unsigned long addr,
}

static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
- unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+ unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+ pgtbl_mod_mask *mask)
{
pud_t *pud;
unsigned long next;

- pud = pud_alloc(&init_mm, p4d, addr);
+ pud = pud_alloc_track(&init_mm, p4d, addr, mask);
if (!pud)
return -ENOMEM;
do {
next = pud_addr_end(addr, end);

- if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot))
+ if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
+ *mask |= PGTBL_PUD_MODIFIED;
continue;
+ }

- if (ioremap_pmd_range(pud, addr, next, phys_addr, prot))
+ if (ioremap_pmd_range(pud, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (pud++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
@@ -187,21 +195,24 @@ static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long addr,
}

static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
- unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+ unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+ pgtbl_mod_mask *mask)
{
p4d_t *p4d;
unsigned long next;

- p4d = p4d_alloc(&init_mm, pgd, addr);
+ p4d = p4d_alloc_track(&init_mm, pgd, addr, mask);
if (!p4d)
return -ENOMEM;
do {
next = p4d_addr_end(addr, end);

- if (ioremap_try_huge_p4d(p4d, addr, next, phys_addr, prot))
+ if (ioremap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) {
+ *mask |= PGTBL_P4D_MODIFIED;
continue;
+ }

- if (ioremap_pud_range(p4d, addr, next, phys_addr, prot))
+ if (ioremap_pud_range(p4d, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (p4d++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
@@ -214,6 +225,7 @@ int ioremap_page_range(unsigned long addr,
unsigned long start;
unsigned long next;
int err;
+ pgtbl_mod_mask mask = 0;

might_sleep();
BUG_ON(addr >= end);
@@ -222,13 +234,17 @@ int ioremap_page_range(unsigned long addr,
pgd = pgd_offset_k(addr);
do {
next = pgd_addr_end(addr, end);
- err = ioremap_p4d_range(pgd, addr, next, phys_addr, prot);
+ err = ioremap_p4d_range(pgd, addr, next, phys_addr, prot,
+ &mask);
if (err)
break;
} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);

flush_cache_vmap(start, end);

+ if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
+ arch_sync_kernel_mappings(start, end);
+
return err;
}

--
2.17.1

2020-05-08 14:45:54

by Joerg Roedel

[permalink] [raw]
Subject: [RFC PATCH 5/7] x86/mm/32: Implement arch_sync_kernel_mappings()

From: Joerg Roedel <[email protected]>

Implement the function to sync changes in vmalloc and ioremap ranges
to all page-tables.

Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/include/asm/pgtable-2level_types.h | 2 ++
arch/x86/include/asm/pgtable-3level_types.h | 2 ++
arch/x86/mm/fault.c | 25 +++++++++++++--------
3 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/pgtable-2level_types.h b/arch/x86/include/asm/pgtable-2level_types.h
index 6deb6cd236e3..7f6ccff0ba72 100644
--- a/arch/x86/include/asm/pgtable-2level_types.h
+++ b/arch/x86/include/asm/pgtable-2level_types.h
@@ -20,6 +20,8 @@ typedef union {

#define SHARED_KERNEL_PMD 0

+#define ARCH_PAGE_TABLE_SYNC_MASK PGTBL_PMD_MODIFIED
+
/*
* traditional i386 two-level paging structure:
*/
diff --git a/arch/x86/include/asm/pgtable-3level_types.h b/arch/x86/include/asm/pgtable-3level_types.h
index 33845d36897c..80fbb4a9ed87 100644
--- a/arch/x86/include/asm/pgtable-3level_types.h
+++ b/arch/x86/include/asm/pgtable-3level_types.h
@@ -27,6 +27,8 @@ typedef union {
#define SHARED_KERNEL_PMD (!static_cpu_has(X86_FEATURE_PTI))
#endif

+#define ARCH_PAGE_TABLE_SYNC_MASK (SHARED_KERNEL_PMD ? 0 : PGTBL_PMD_MODIFIED)
+
/*
* PGDIR_SHIFT determines what a top-level page table entry can map
*/
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index a51df516b87b..edeb2adaf31f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -190,16 +190,13 @@ static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
return pmd_k;
}

-static void vmalloc_sync(void)
+void arch_sync_kernel_mappings(unsigned long start, unsigned long end)
{
- unsigned long address;
-
- if (SHARED_KERNEL_PMD)
- return;
+ unsigned long addr;

- for (address = VMALLOC_START & PMD_MASK;
- address >= TASK_SIZE_MAX && address < VMALLOC_END;
- address += PMD_SIZE) {
+ for (addr = start & PMD_MASK;
+ addr >= TASK_SIZE_MAX && addr < VMALLOC_END;
+ addr += PMD_SIZE) {
struct page *page;

spin_lock(&pgd_lock);
@@ -210,13 +207,23 @@ static void vmalloc_sync(void)
pgt_lock = &pgd_page_get_mm(page)->page_table_lock;

spin_lock(pgt_lock);
- vmalloc_sync_one(page_address(page), address);
+ vmalloc_sync_one(page_address(page), addr);
spin_unlock(pgt_lock);
}
spin_unlock(&pgd_lock);
}
}

+static void vmalloc_sync(void)
+{
+ unsigned long address;
+
+ if (SHARED_KERNEL_PMD)
+ return;
+
+ arch_sync_kernel_mappings(VMALLOC_START, VMALLOC_END);
+}
+
void vmalloc_sync_mappings(void)
{
vmalloc_sync();
--
2.17.1

2020-05-08 19:02:19

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 6/7] mm: Remove vmalloc_sync_(un)mappings()

On Fri, 8 May 2020 16:40:42 +0200
Joerg Roedel <[email protected]> wrote:

> From: Joerg Roedel <[email protected]>
>
> These functions are not needed anymore because the vmalloc and ioremap
> mappings are now synchronized when they are created or teared down.
>
> Remove all callers and function definitions.
>
> Signed-off-by: Joerg Roedel <[email protected]>
>

You'll need to fold this into this patch, as my patch has already hit
Linus's tree.

But I applied your whole series and I'm not able to reproduce the bug.

Tested-by: Steven Rostedt (VMware) <[email protected]>

-- Steve

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 29615f15a820..1424a89193c6 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8526,19 +8526,6 @@ static int allocate_trace_buffers(struct trace_array *tr, int size)
*/
allocate_snapshot = false;
#endif
-
- /*
- * Because of some magic with the way alloc_percpu() works on
- * x86_64, we need to synchronize the pgd of all the tables,
- * otherwise the trace events that happen in x86_64 page fault
- * handlers can't cope with accessing the chance that a
- * alloc_percpu()'d memory might be touched in the page fault trace
- * event. Oh, and we need to audit all other alloc_percpu() and vmalloc()
- * calls in tracing, because something might get triggered within a
- * page fault trace event!
- */
- vmalloc_sync_mappings();
-
return 0;
}

2020-05-08 19:15:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] mm/vmalloc: Track which page-table levels were modified

On Fri, May 08, 2020 at 04:40:38PM +0200, Joerg Roedel wrote:

> +/*
> + * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values
> + * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings()
> + * needs to be called.
> + */
> +#ifndef ARCH_PAGE_TABLE_SYNC_MASK
> +#define ARCH_PAGE_TABLE_SYNC_MASK 0
> +#endif
> +
> +void arch_sync_kernel_mappings(unsigned long start, unsigned long end);


> + if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
> + arch_sync_kernel_mappings(start, end);

So you're relying on the compiler DCE'ing the call in the 'normal' case.

Works I suppose, but I went over the patches twice to look for a default
implementation of it before I figured that out ...

2020-05-08 19:22:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

On Fri, May 08, 2020 at 04:40:36PM +0200, Joerg Roedel wrote:

> Joerg Roedel (7):
> mm: Add functions to track page directory modifications
> mm/vmalloc: Track which page-table levels were modified
> mm/ioremap: Track which page-table levels were modified
> x86/mm/64: Implement arch_sync_kernel_mappings()
> x86/mm/32: Implement arch_sync_kernel_mappings()
> mm: Remove vmalloc_sync_(un)mappings()
> x86/mm: Remove vmalloc faulting
>
> arch/x86/include/asm/pgtable-2level_types.h | 2 +
> arch/x86/include/asm/pgtable-3level_types.h | 2 +
> arch/x86/include/asm/pgtable_64_types.h | 2 +
> arch/x86/include/asm/switch_to.h | 23 ---
> arch/x86/kernel/setup_percpu.c | 6 +-
> arch/x86/mm/fault.c | 176 +-------------------
> arch/x86/mm/init_64.c | 5 +
> arch/x86/mm/pti.c | 8 +-
> drivers/acpi/apei/ghes.c | 6 -
> include/asm-generic/5level-fixup.h | 5 +-
> include/asm-generic/pgtable.h | 23 +++
> include/linux/mm.h | 46 +++++
> include/linux/vmalloc.h | 13 +-
> kernel/notifier.c | 1 -
> lib/ioremap.c | 46 +++--
> mm/nommu.c | 12 --
> mm/vmalloc.c | 109 +++++++-----
> 17 files changed, 199 insertions(+), 286 deletions(-)

The only concern I have is the pgd_lock lock hold times.

By not doing on-demand faults anymore, and consistently calling
sync_global_*(), we iterate that pgd_list thing much more often than
before if I'm not mistaken.


2020-05-08 21:26:47

by Jörg Rödel

[permalink] [raw]
Subject: Re: [RFC PATCH 6/7] mm: Remove vmalloc_sync_(un)mappings()

On Fri, May 08, 2020 at 02:58:22PM -0400, Steven Rostedt wrote:
> You'll need to fold this into this patch, as my patch has already hit
> Linus's tree.

Yeah, have it on my todo-list already when I rebase this.

> But I applied your whole series and I'm not able to reproduce the bug.
>
> Tested-by: Steven Rostedt (VMware) <[email protected]>

Thanks a lot for testing!


Joerg

2020-05-08 21:27:04

by Jörg Rödel

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] mm/vmalloc: Track which page-table levels were modified

On Fri, May 08, 2020 at 09:10:48PM +0200, Peter Zijlstra wrote:
> On Fri, May 08, 2020 at 04:40:38PM +0200, Joerg Roedel wrote:
> > + if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
> > + arch_sync_kernel_mappings(start, end);
>
> So you're relying on the compiler DCE'ing the call in the 'normal' case.
>
> Works I suppose, but I went over the patches twice to look for a default
> implementation of it before I figured that out ...

Yes, works on all architectures I compile-tested this on, with gcc-9.3.
I should probably add a comment about that.


Joerg

2020-05-08 21:35:33

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

On Fri, May 8, 2020 at 7:40 AM Joerg Roedel <[email protected]> wrote:
>
> Hi,
>
> after the recent issue with vmalloc and tracing code[1] on x86 and a
> long history of previous issues related to the vmalloc_sync_mappings()
> interface, I thought the time has come to remove it. Please
> see [2], [3], and [4] for some other issues in the past.
>
> The patches are based on v5.7-rc4 and add tracking of page-table
> directory changes to the vmalloc and ioremap code. Depending on which
> page-table levels changes have been made, a new per-arch function is
> called: arch_sync_kernel_mappings().
>
> On x86-64 with 4-level paging, this function will not be called more
> than 64 times in a systems runtime (because vmalloc-space takes 64 PGD
> entries which are only populated, but never cleared).

What's the maximum on other system types? It might make more sense to
take the memory hit and pre-populate all the tables at boot so we
never have to sync them.

2020-05-08 21:37:55

by Jörg Rödel

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

Hi Peter,

thanks for reviewing this!

On Fri, May 08, 2020 at 09:20:00PM +0200, Peter Zijlstra wrote:
> The only concern I have is the pgd_lock lock hold times.
>
> By not doing on-demand faults anymore, and consistently calling
> sync_global_*(), we iterate that pgd_list thing much more often than
> before if I'm not mistaken.

Should not be a problem, from what I have seen this function is not
called often on x86-64. The vmalloc area needs to be synchronized at
the top-level there, which is by now P4D (with 4-level paging). And the
vmalloc area takes 64 entries, when all of them are populated the
function will not be called again.

On 32bit it might be called more often, because synchronization happens
on the PMD level, which is also used for large-page mapped ioremap
regions. But these don't happen very often and there are also no VMAP
stacks on x86-32 which could cause this function to be called
frequently.


Joerg

2020-05-08 21:40:49

by Jörg Rödel

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

On Fri, May 08, 2020 at 02:33:19PM -0700, Andy Lutomirski wrote:
> On Fri, May 8, 2020 at 7:40 AM Joerg Roedel <[email protected]> wrote:

> What's the maximum on other system types? It might make more sense to
> take the memory hit and pre-populate all the tables at boot so we
> never have to sync them.

Need to look it up for 5-level paging, with 4-level paging its 64 pages
to pre-populate the vmalloc area.

But that would not solve the problem on x86-32, which needs to
synchronize unmappings on the PMD level.


Joerg

2020-05-08 23:53:19

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

On Fri, May 8, 2020 at 2:36 PM Joerg Roedel <[email protected]> wrote:
>
> On Fri, May 08, 2020 at 02:33:19PM -0700, Andy Lutomirski wrote:
> > On Fri, May 8, 2020 at 7:40 AM Joerg Roedel <[email protected]> wrote:
>
> > What's the maximum on other system types? It might make more sense to
> > take the memory hit and pre-populate all the tables at boot so we
> > never have to sync them.
>
> Need to look it up for 5-level paging, with 4-level paging its 64 pages
> to pre-populate the vmalloc area.
>
> But that would not solve the problem on x86-32, which needs to
> synchronize unmappings on the PMD level.

What changes in this series with x86-32? We already do that
synchronization, right? IOW, in the cases where the vmalloc *fault*
code does anything at all, we should have a small bound for how much
memory to preallocate and, if we preallocate it, then there is nothing
to sync and nothing to fault. And we have the benefit that we never
need to sync anything on 64-bit, which is kind of nice.

Do we actually need PMD-level things for 32-bit? What if we just
outlawed huge pages in the vmalloc space on 32-bit non-PAE?

Or maybe the net result isn't much of a cleanup after all given the
need to support 32-bit.

>
>
> Joerg

2020-05-09 09:29:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

On Fri, May 08, 2020 at 11:34:07PM +0200, Joerg Roedel wrote:
> Hi Peter,
>
> thanks for reviewing this!
>
> On Fri, May 08, 2020 at 09:20:00PM +0200, Peter Zijlstra wrote:
> > The only concern I have is the pgd_lock lock hold times.
> >
> > By not doing on-demand faults anymore, and consistently calling
> > sync_global_*(), we iterate that pgd_list thing much more often than
> > before if I'm not mistaken.
>
> Should not be a problem, from what I have seen this function is not
> called often on x86-64. The vmalloc area needs to be synchronized at
> the top-level there, which is by now P4D (with 4-level paging). And the
> vmalloc area takes 64 entries, when all of them are populated the
> function will not be called again.

Right; it's just that the moment you do trigger it, it'll iterate that
pgd_list and that is potentially huge. Then again, that's not a new
problem.

I suppose we can deal with it if/when it becomes an actual problem.

I had a quick look and I think it might be possible to make it an RCU
managed list. We'd have to remove the pgd_list entry in
put_task_struct_rcu_user(). Then we can play games in sync_global_*()
and use RCU iteration. New tasks (which can be missed in the RCU
iteration) will already have a full clone of the PGD anyway.

But like said, something for later I suppose.

2020-05-09 17:54:00

by Jörg Rödel

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

On Fri, May 08, 2020 at 04:49:17PM -0700, Andy Lutomirski wrote:
> On Fri, May 8, 2020 at 2:36 PM Joerg Roedel <[email protected]> wrote:
> >
> > On Fri, May 08, 2020 at 02:33:19PM -0700, Andy Lutomirski wrote:
> > > On Fri, May 8, 2020 at 7:40 AM Joerg Roedel <[email protected]> wrote:
> >
> > > What's the maximum on other system types? It might make more sense to
> > > take the memory hit and pre-populate all the tables at boot so we
> > > never have to sync them.
> >
> > Need to look it up for 5-level paging, with 4-level paging its 64 pages
> > to pre-populate the vmalloc area.
> >
> > But that would not solve the problem on x86-32, which needs to
> > synchronize unmappings on the PMD level.
>
> What changes in this series with x86-32?

This series sets ARCH_PAGE_TABLE_SYNC_MASK to PGTBL_PMD_MODIFIED, so
that the synchronization happens every time PMD(s) in the vmalloc areas
are changed. Before this series this synchronization only happened at
arbitrary places calling vmalloc_sync_(un)mappings().

> We already do that synchronization, right? IOW, in the cases where
> the vmalloc *fault* code does anything at all, we should have a small
> bound for how much memory to preallocate and, if we preallocate it,
> then there is nothing to sync and nothing to fault. And we have the
> benefit that we never need to sync anything on 64-bit, which is kind
> of nice.

Don't really get you here, what is pre-allocated and why is there no
need to sync and fault then?

> Do we actually need PMD-level things for 32-bit? What if we just
> outlawed huge pages in the vmalloc space on 32-bit non-PAE?

Disallowing huge-pages would at least remove the need to sync
unmappings, but we still need to sync new PMD entries. Remember that the
size of the vmalloc area on 32 bit is dynamic and depends on the VM-split
and the actual amount of RAM on the system.

A machine wit 512MB of RAM and a 1G/3G split will have around 2.5G of
VMALLOC address space. And if we want to avoid vmalloc-faults there, we
need to pre-allocate all PTE pages for that area (and the amount of PTE
pages needed increases when RAM decreases).

On a machine with 512M of RAM we would need ca. 1270+ PTE pages, which
is around 5M (or 1% of total system memory).

Regards,

Joerg

2020-05-09 17:56:53

by Jörg Rödel

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

On Sat, May 09, 2020 at 11:25:16AM +0200, Peter Zijlstra wrote:
> Right; it's just that the moment you do trigger it, it'll iterate that
> pgd_list and that is potentially huge. Then again, that's not a new
> problem.
>
> I suppose we can deal with it if/when it becomes an actual problem.
>
> I had a quick look and I think it might be possible to make it an RCU
> managed list. We'd have to remove the pgd_list entry in
> put_task_struct_rcu_user(). Then we can play games in sync_global_*()
> and use RCU iteration. New tasks (which can be missed in the RCU
> iteration) will already have a full clone of the PGD anyway.

Right, it should not be that difficult to rcu-protect the pgd-list, but
we can try that when it actually becomes a problem.


Joerg

2020-05-09 19:07:45

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

On Sat, May 9, 2020 at 10:52 AM Joerg Roedel <[email protected]> wrote:
>
> On Fri, May 08, 2020 at 04:49:17PM -0700, Andy Lutomirski wrote:
> > On Fri, May 8, 2020 at 2:36 PM Joerg Roedel <[email protected]> wrote:
> > >
> > > On Fri, May 08, 2020 at 02:33:19PM -0700, Andy Lutomirski wrote:
> > > > On Fri, May 8, 2020 at 7:40 AM Joerg Roedel <[email protected]> wrote:
> > >
> > > > What's the maximum on other system types? It might make more sense to
> > > > take the memory hit and pre-populate all the tables at boot so we
> > > > never have to sync them.
> > >
> > > Need to look it up for 5-level paging, with 4-level paging its 64 pages
> > > to pre-populate the vmalloc area.
> > >
> > > But that would not solve the problem on x86-32, which needs to
> > > synchronize unmappings on the PMD level.
> >
> > What changes in this series with x86-32?
>
> This series sets ARCH_PAGE_TABLE_SYNC_MASK to PGTBL_PMD_MODIFIED, so
> that the synchronization happens every time PMD(s) in the vmalloc areas
> are changed. Before this series this synchronization only happened at
> arbitrary places calling vmalloc_sync_(un)mappings().
>
> > We already do that synchronization, right? IOW, in the cases where
> > the vmalloc *fault* code does anything at all, we should have a small
> > bound for how much memory to preallocate and, if we preallocate it,
> > then there is nothing to sync and nothing to fault. And we have the
> > benefit that we never need to sync anything on 64-bit, which is kind
> > of nice.
>
> Don't really get you here, what is pre-allocated and why is there no
> need to sync and fault then?
>
> > Do we actually need PMD-level things for 32-bit? What if we just
> > outlawed huge pages in the vmalloc space on 32-bit non-PAE?
>
> Disallowing huge-pages would at least remove the need to sync
> unmappings, but we still need to sync new PMD entries. Remember that the
> size of the vmalloc area on 32 bit is dynamic and depends on the VM-split
> and the actual amount of RAM on the system.
>
> A machine wit 512MB of RAM and a 1G/3G split will have around 2.5G of
> VMALLOC address space. And if we want to avoid vmalloc-faults there, we
> need to pre-allocate all PTE pages for that area (and the amount of PTE
> pages needed increases when RAM decreases).
>
> On a machine with 512M of RAM we would need ca. 1270+ PTE pages, which
> is around 5M (or 1% of total system memory).

I can never remember which P?D name goes with which level and which
machine type, but I don't think I agree with your math regardless. On
x86, there are two fundamental situations that can occur:

1. Non-PAE. There is a single 4k top-level page table per mm, and
this table contains either 512 or 1024 entries total. Of those
entries, some fraction (half or less) control the kernel address
space, and some fraction of *that* is for vmalloc space. Those
entries are the *only* thing that needs syncing -- all mms will either
have null (not present) in those slots or will have pointers to the
*same* next-level-down directories.

2. PAE. Depending on your perspective, there could be a grand total
of four top-level paging pointers, of which one (IIRC) is for the
kernel. That points to the same place for all mms. Or, if you look
at it the other way, PAE is just like #1 except that the top-level
table has only four entries and only one points to VMALLOC space.

So, unless I'm missing something here, there is an absolute maximum of
512 top-level entries that ever need to be synchronized.

Now, there's an additional complication. On x86_64, we have a rule:
those entries that need to be synced start out null and may, during
the lifetime of the system, change *once*. They are never unmapped or
modified after being allocated. This means that those entries can
only ever point to a page *table* and not to a ginormous page. So,
even if the hardware were to support ginormous pages (which, IIRC, it
doesn't), we would be limited to merely immense and not ginormous
pages in the vmalloc range. On x86_32, I don't think we have this
rule right now. And this means that it's possible for one of these
pages to be unmapped or modified.

So my suggestion is that just apply the x86_64 rule to x86_32 as well.
The practical effect will be that 2-level-paging systems will not be
able to use huge pages in the vmalloc range, since the rule will be
that the vmalloc-relevant entries in the top-level table must point to
page *tables* instead of huge pages.

On top of this, if we preallocate these entries, then the maximum
amount of memory we can possibly waste is 4k * (entries pointing to
vmalloc space - entries actually used for vmalloc space). I don't
know what this number typically is, but I don't think it's very large.
Preallocating means that vmalloc faults *and* synchronization go away
entirely. All of the page tables used for vmalloc will be entirely
shared by all mms, so all that's needed to modify vmalloc mappings is
to update init_mm and, if needed, flush TLBs. No other page tables
will need modification at all.

On x86_64, the only real advantage is that the handful of corner cases
that make vmalloc faults unpleasant (mostly relating to vmap stacks)
go away. On x86_32, a bunch of mind-bending stuff (everything your
series deletes but also almost everything your series *adds*) goes
away. There may be a genuine tiny performance hit on 2-level systems
due to the loss of huge pages in vmalloc space, but I'm not sure I
care or that we use them anyway on these systems. And PeterZ can stop
even thinking about RCU.

Am I making sense?


(Aside: I *hate* the PMD, etc terminology. Even the kernel's C types
can't keep track of whether pmd_t* points to an entire paging
directory or to a single entry. Similarly, everyone knows that a
pte_t is a "page table entry", except that pte_t* might instead be a
pointer to an array of 512 or 1024 page table entries.)

2020-05-09 21:59:21

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

Hi Andy,

On Sat, May 09, 2020 at 12:05:29PM -0700, Andy Lutomirski wrote:
> 1. Non-PAE. There is a single 4k top-level page table per mm, and
> this table contains either 512 or 1024 entries total. Of those
> entries, some fraction (half or less) control the kernel address
> space, and some fraction of *that* is for vmalloc space. Those
> entries are the *only* thing that needs syncing -- all mms will either
> have null (not present) in those slots or will have pointers to the
> *same* next-level-down directories.

Not entirely correct, on 32-bit non-PAE there are two levels with 1024
entries each. In Linux these map to PMD and PTE levels. With 1024
entries each PMD maps 4MB of address space.

How much of these 1024 top-level entries are used for the kernel is
configuration dependent in Linux, it can be 768, 512, or 256.

> 2. PAE. Depending on your perspective, there could be a grand total
> of four top-level paging pointers, of which one (IIRC) is for the
> kernel.

That is again configuration dependent, up to 3 of the top-level pointers
could be used for kernel-space.

> That points to the same place for all mms. Or, if you look at it the
> other way, PAE is just like #1 except that the top-level table has
> only four entries and only one points to VMALLOC space.

There are more differences. In PAE an entry is 64-bit, which means each
PMD-entry only maps 2MB of address space, which means you need two PTE
pages to map the same amount of address space you could map with one
page without PAE paging.
And as noted above, all 3 of the top-level pointers can point to vmalloc
space (not exclusivly).

> So, unless I'm missing something here, there is an absolute maximum of
> 512 top-level entries that ever need to be synchronized.

And here is where your assumption is wrong. On 32-bit PAE systems it is
not the top-level entries that need to be synchronized for vmalloc, but
the second-level entries. And dependent on the kernel configuration,
there are (in total, not only vmalloc) 1536, 1024, or 512 of these
second-level entries. How much of them are actually used for vmalloc
depends on the size of the system RAM (but is at least 64), because
the vmalloc area begins after the kernel direct-mapping (with an 8MB
unmapped hole).

With 512MB of RAM you need 256 entries for the kernel direct-mapping (I
ignore the ISA hole here). After that there are 4 unmapped entries and
the rest is vmalloc, minus some fixmap and ldt mappings at the end of
the address space. I havn't looked up the exact number, but 4 entries
(8MB) for this should be close to the real value.

1536 total PMD entries
- 256 for direct-mapping
- 4 for ISA hole
- 4 for stuff at the end of the address space

= 1272 PMD entries for VMALLOC space which need synchronization.

With less RAM it is even more, and to get rid of faulting and
synchronization you need to pre-allocate a 4k PTE page for each of these
PMD entries.

> Now, there's an additional complication. On x86_64, we have a rule:
> those entries that need to be synced start out null and may, during
> the lifetime of the system, change *once*. They are never unmapped or
> modified after being allocated. This means that those entries can
> only ever point to a page *table* and not to a ginormous page. So,
> even if the hardware were to support ginormous pages (which, IIRC, it
> doesn't), we would be limited to merely immense and not ginormous
> pages in the vmalloc range. On x86_32, I don't think we have this
> rule right now. And this means that it's possible for one of these
> pages to be unmapped or modified.

The reason for x86-32 being different is that the address space is
orders of magnitude smaller than on x86-64. We just have 4 top-level
entries with PAE paging and can't afford to partition kernel-adress
space on that level like we do on x86-64. That is the reason the address
space is partitioned on the second (PMD) level, which is also the reason
vmalloc synchronization needs to happen on that level. And because
that's not enough yet, its also the page-table level to map huge-pages.

> So my suggestion is that just apply the x86_64 rule to x86_32 as well.
> The practical effect will be that 2-level-paging systems will not be
> able to use huge pages in the vmalloc range, since the rule will be
> that the vmalloc-relevant entries in the top-level table must point to
> page *tables* instead of huge pages.

I could very well live with prohibiting huge-page ioremap mappings for
x86-32. But as I wrote before, this doesn't solve the problems I am
trying to address with this patch-set, or would only address them if
significant amount of total system memory is used.

The pre-allocation solution would work for x86-64, it would only need
256kb of preallocated memory for the vmalloc range to never synchronize
or fault again. I have thought about that and did the math before
writing this patch-set, but doing the math for 32 bit drove me away from
it for reasons written above.

And since a lot of the vmalloc_sync_(un)mappings problems I debugged
were actually related to 32-bit, I want a solution that works for 32 and
64-bit x86 (at least until support for x86-32 is removed). And I think
this patch-set provides a solution that works well for both.


Joerg

2020-05-10 01:13:58

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

On Sat, May 09, 2020 at 11:25:16AM +0200, Peter Zijlstra wrote:
> On Fri, May 08, 2020 at 11:34:07PM +0200, Joerg Roedel wrote:
> > On Fri, May 08, 2020 at 09:20:00PM +0200, Peter Zijlstra wrote:
> > > The only concern I have is the pgd_lock lock hold times.
> > >
> > > By not doing on-demand faults anymore, and consistently calling
> > > sync_global_*(), we iterate that pgd_list thing much more often than
> > > before if I'm not mistaken.
> >
> > Should not be a problem, from what I have seen this function is not
> > called often on x86-64. The vmalloc area needs to be synchronized at
> > the top-level there, which is by now P4D (with 4-level paging). And the
> > vmalloc area takes 64 entries, when all of them are populated the
> > function will not be called again.
>
> Right; it's just that the moment you do trigger it, it'll iterate that
> pgd_list and that is potentially huge. Then again, that's not a new
> problem.
>
> I suppose we can deal with it if/when it becomes an actual problem.
>
> I had a quick look and I think it might be possible to make it an RCU
> managed list. We'd have to remove the pgd_list entry in
> put_task_struct_rcu_user(). Then we can play games in sync_global_*()
> and use RCU iteration. New tasks (which can be missed in the RCU
> iteration) will already have a full clone of the PGD anyway.

One of the things on my long-term todo list is to replace mm_struct.mmlist
with an XArray containing all mm_structs. Then we can use one mark
to indicate maybe-swapped and another mark to indicate ... whatever it
is pgd_list indicates. Iterating an XArray (whether the entire thing
or with marks) is RCU-safe and faster than iterating a linked list,
so this should solve the problem?

2020-05-10 05:10:02

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

On Sat, May 9, 2020 at 2:57 PM Joerg Roedel <[email protected]> wrote:
>
> Hi Andy,
>
> On Sat, May 09, 2020 at 12:05:29PM -0700, Andy Lutomirski wrote:

> > So, unless I'm missing something here, there is an absolute maximum of
> > 512 top-level entries that ever need to be synchronized.
>
> And here is where your assumption is wrong. On 32-bit PAE systems it is
> not the top-level entries that need to be synchronized for vmalloc, but
> the second-level entries. And dependent on the kernel configuration,
> there are (in total, not only vmalloc) 1536, 1024, or 512 of these
> second-level entries. How much of them are actually used for vmalloc
> depends on the size of the system RAM (but is at least 64), because
> the vmalloc area begins after the kernel direct-mapping (with an 8MB
> unmapped hole).

I spent some time looking at the code, and I'm guessing you're talking
about the 3-level !SHARED_KERNEL_PMD case. I can't quite figure out
what's going on.

Can you explain what is actually going on that causes different
mms/pgds to have top-level entries in the kernel range that point to
different tables? Because I'm not seeing why this makes any sense.

>
> > Now, there's an additional complication. On x86_64, we have a rule:
> > those entries that need to be synced start out null and may, during
> > the lifetime of the system, change *once*. They are never unmapped or
> > modified after being allocated. This means that those entries can
> > only ever point to a page *table* and not to a ginormous page. So,
> > even if the hardware were to support ginormous pages (which, IIRC, it
> > doesn't), we would be limited to merely immense and not ginormous
> > pages in the vmalloc range. On x86_32, I don't think we have this
> > rule right now. And this means that it's possible for one of these
> > pages to be unmapped or modified.
>
> The reason for x86-32 being different is that the address space is
> orders of magnitude smaller than on x86-64. We just have 4 top-level
> entries with PAE paging and can't afford to partition kernel-adress
> space on that level like we do on x86-64. That is the reason the address
> space is partitioned on the second (PMD) level, which is also the reason
> vmalloc synchronization needs to happen on that level. And because
> that's not enough yet, its also the page-table level to map huge-pages.

Why does it need to be partitioned at all? The only thing that comes
to mind is that the LDT range is per-mm. So I can imagine that the
PAE case with a 3G user / 1G kernel split has to have the vmalloc
range and the LDT range in the same top-level entry. Yuck.

>
> > So my suggestion is that just apply the x86_64 rule to x86_32 as well.
> > The practical effect will be that 2-level-paging systems will not be
> > able to use huge pages in the vmalloc range, since the rule will be
> > that the vmalloc-relevant entries in the top-level table must point to
> > page *tables* instead of huge pages.
>
> I could very well live with prohibiting huge-page ioremap mappings for
> x86-32. But as I wrote before, this doesn't solve the problems I am
> trying to address with this patch-set, or would only address them if
> significant amount of total system memory is used.
>
> The pre-allocation solution would work for x86-64, it would only need
> 256kb of preallocated memory for the vmalloc range to never synchronize
> or fault again. I have thought about that and did the math before
> writing this patch-set, but doing the math for 32 bit drove me away from
> it for reasons written above.
>

If it's *just* the LDT that's a problem, we could plausibly shrink the
user address range a little bit and put the LDT in the user portion.
I suppose this could end up creating its own set of problems involving
tracking which code owns which page tables.

> And since a lot of the vmalloc_sync_(un)mappings problems I debugged
> were actually related to 32-bit, I want a solution that works for 32 and
> 64-bit x86 (at least until support for x86-32 is removed). And I think
> this patch-set provides a solution that works well for both.

I'm not fundamentally objecting to your patch set, but I do want to
understand what's going on that needs this stuff.

>
>
> Joerg

2020-05-10 08:19:42

by Jörg Rödel

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

On Sat, May 09, 2020 at 10:05:43PM -0700, Andy Lutomirski wrote:
> On Sat, May 9, 2020 at 2:57 PM Joerg Roedel <[email protected]> wrote:
> I spent some time looking at the code, and I'm guessing you're talking
> about the 3-level !SHARED_KERNEL_PMD case. I can't quite figure out
> what's going on.
>
> Can you explain what is actually going on that causes different
> mms/pgds to have top-level entries in the kernel range that point to
> different tables? Because I'm not seeing why this makes any sense.

There are three cases where the PMDs are not shared on x86-32:

1) With non-PAE the top-level is already the PMD level, because
the page-table only has two levels. Since the top-level can't
be shared, the PMDs are also not shared.

2) For some reason Xen-PV also does not share kernel PMDs on PAE
systems, but I havn't looked into why.

3) On 32-bit PAE with PTI enabled the kernel address space
contains the LDT mapping, which is also different per-PGD.
There is one PMD entry reserved for the LDT, giving it 2MB of
address space. I implemented it this way to keep the 32-bit
implementation of PTI mostly similar to the 64-bit one.

> Why does it need to be partitioned at all? The only thing that comes
> to mind is that the LDT range is per-mm. So I can imagine that the
> PAE case with a 3G user / 1G kernel split has to have the vmalloc
> range and the LDT range in the same top-level entry. Yuck.

PAE with 3G user / 1G kernel has _all_ of the kernel mappings in one
top-level entry (direct-mapping, vmalloc, ldt, fixmap).

> If it's *just* the LDT that's a problem, we could plausibly shrink the
> user address range a little bit and put the LDT in the user portion.
> I suppose this could end up creating its own set of problems involving
> tracking which code owns which page tables.

Yeah, for the PTI case it is only the LDT that causes the unshared
kernel PMDs, but even if we move the LDT somewhere else we still have
two-level paging and the xen-pv case.


Joerg

2020-05-11 07:36:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

On Sat, May 09, 2020 at 06:11:57PM -0700, Matthew Wilcox wrote:
> Iterating an XArray (whether the entire thing
> or with marks) is RCU-safe and faster than iterating a linked list,
> so this should solve the problem?

It can hardly be faster if you want all elements -- which is I think the
case here. We only call into this if we change an entry, and then we
need to propagate that change to all.

2020-05-11 07:45:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

On Sat, May 09, 2020 at 12:05:29PM -0700, Andy Lutomirski wrote:

> On x86_64, the only real advantage is that the handful of corner cases
> that make vmalloc faults unpleasant (mostly relating to vmap stacks)
> go away. On x86_32, a bunch of mind-bending stuff (everything your
> series deletes but also almost everything your series *adds*) goes
> away. There may be a genuine tiny performance hit on 2-level systems
> due to the loss of huge pages in vmalloc space, but I'm not sure I
> care or that we use them anyway on these systems. And PeterZ can stop
> even thinking about RCU.
>
> Am I making sense?

I think it'll work for x86_64 and that is really all I care about :-)

2020-05-11 15:39:10

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

On Mon, May 11, 2020 at 12:42 AM Peter Zijlstra <[email protected]> wrote:
>
> On Sat, May 09, 2020 at 12:05:29PM -0700, Andy Lutomirski wrote:
>
> > On x86_64, the only real advantage is that the handful of corner cases
> > that make vmalloc faults unpleasant (mostly relating to vmap stacks)
> > go away. On x86_32, a bunch of mind-bending stuff (everything your
> > series deletes but also almost everything your series *adds*) goes
> > away. There may be a genuine tiny performance hit on 2-level systems
> > due to the loss of huge pages in vmalloc space, but I'm not sure I
> > care or that we use them anyway on these systems. And PeterZ can stop
> > even thinking about RCU.
> >
> > Am I making sense?
>
> I think it'll work for x86_64 and that is really all I care about :-)

Sadly, I think that Joerg has convinced my that this doesn't really
work for 32-bit unless we rework the LDT code or drop support for
something that we might not want to drop support for. So, last try --
maybe we can start defeaturing 32-bit:

What if we make 32-bit PTI depend on PAE? And drop 32-bit Xen PV
support? And make 32-bit huge pages depend on PAE? Then 32-bit
non-PAE can use the direct-mapped LDT, 32-bit PTI (and optionally PAE
non-PTI) can use the evil virtually mapped LDT. And 32-bit non-PAE
(the 2-level case) will only have pointers to page tables at the top
level. And then we can preallocate.

Or maybe we don't want to defeature this much, or maybe the memory hit
from this preallocation will hurt little 2-level 32-bit systems too
much.

(Xen 32-bit PV support seems to be on its way out upstream.)

2020-05-11 15:56:47

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

On Mon, May 11, 2020 at 09:31:34AM +0200, Peter Zijlstra wrote:
> On Sat, May 09, 2020 at 06:11:57PM -0700, Matthew Wilcox wrote:
> > Iterating an XArray (whether the entire thing
> > or with marks) is RCU-safe and faster than iterating a linked list,
> > so this should solve the problem?
>
> It can hardly be faster if you want all elements -- which is I think the
> case here. We only call into this if we change an entry, and then we
> need to propagate that change to all.

Of course it can be faster. Iterating an array is faster than iterating
a linked list because caches. While an XArray is a segmented array
(so slower than a plain array), it's plainly going to be faster than
iterating a linked list.

2020-05-11 16:10:45

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

On Mon, May 11, 2020 at 08:52:04AM -0700, Matthew Wilcox wrote:
> On Mon, May 11, 2020 at 09:31:34AM +0200, Peter Zijlstra wrote:
> > On Sat, May 09, 2020 at 06:11:57PM -0700, Matthew Wilcox wrote:
> > > Iterating an XArray (whether the entire thing
> > > or with marks) is RCU-safe and faster than iterating a linked list,
> > > so this should solve the problem?
> >
> > It can hardly be faster if you want all elements -- which is I think the
> > case here. We only call into this if we change an entry, and then we
> > need to propagate that change to all.
>
> Of course it can be faster. Iterating an array is faster than iterating
> a linked list because caches. While an XArray is a segmented array
> (so slower than a plain array), it's plainly going to be faster than
> iterating a linked list.

Quantifying this:

$ ./array-vs-list
walked sequential array in 0.002039s
walked sequential list in 0.002807s
walked sequential array in 0.002017s
walked shuffled list in 0.102367s
walked shuffled array in 0.012114s

Attached is the source code; above results on a Kaby Lake with
CFLAGS="-O2 -W -Wall -g".


Attachments:
(No filename) (1.12 kB)
array-vs-list.c (4.01 kB)
Download all attachments

2020-05-11 17:04:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

On Mon, May 11, 2020 at 08:52:04AM -0700, Matthew Wilcox wrote:
> On Mon, May 11, 2020 at 09:31:34AM +0200, Peter Zijlstra wrote:
> > On Sat, May 09, 2020 at 06:11:57PM -0700, Matthew Wilcox wrote:
> > > Iterating an XArray (whether the entire thing
> > > or with marks) is RCU-safe and faster than iterating a linked list,
> > > so this should solve the problem?
> >
> > It can hardly be faster if you want all elements -- which is I think the
> > case here. We only call into this if we change an entry, and then we
> > need to propagate that change to all.
>
> Of course it can be faster. Iterating an array is faster than iterating
> a linked list because caches. While an XArray is a segmented array
> (so slower than a plain array), it's plainly going to be faster than
> iterating a linked list.

Fair enough, mostly also because the actual work (setting a single PTE)
doesn't dominate in this case.

2020-05-11 17:09:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

On Mon, May 11, 2020 at 08:36:31AM -0700, Andy Lutomirski wrote:
> On Mon, May 11, 2020 at 12:42 AM Peter Zijlstra <[email protected]> wrote:
> >
> > On Sat, May 09, 2020 at 12:05:29PM -0700, Andy Lutomirski wrote:
> >
> > > On x86_64, the only real advantage is that the handful of corner cases
> > > that make vmalloc faults unpleasant (mostly relating to vmap stacks)
> > > go away. On x86_32, a bunch of mind-bending stuff (everything your
> > > series deletes but also almost everything your series *adds*) goes
> > > away. There may be a genuine tiny performance hit on 2-level systems
> > > due to the loss of huge pages in vmalloc space, but I'm not sure I
> > > care or that we use them anyway on these systems. And PeterZ can stop
> > > even thinking about RCU.
> > >
> > > Am I making sense?
> >
> > I think it'll work for x86_64 and that is really all I care about :-)
>
> Sadly, I think that Joerg has convinced my that this doesn't really
> work for 32-bit unless we rework the LDT code or drop support for
> something that we might not want to drop support for.

I was thinking keep these patches for 32bit and fix 64bit 'proper'. But
sure, if we can get rid of it all by stripping 32bit features I'm not
going to object either.

2020-05-11 19:17:04

by Jörg Rödel

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

On Mon, May 11, 2020 at 08:36:31AM -0700, Andy Lutomirski wrote:
> What if we make 32-bit PTI depend on PAE?

It already does, PTI support for legacy paging had to be removed because
there were memory corruption problems with THP. The reason was that huge
PTEs in the user-space area were mapped in two page-tables (kernel and
user), but A/D bits were only fetched from the kernel part. To not make
things more complicated we agreed on just not supporting PTI without
PAE.

> And drop 32-bit Xen PV support? And make 32-bit huge pages depend on
> PAE? Then 32-bit non-PAE can use the direct-mapped LDT, 32-bit PTI
> (and optionally PAE non-PTI) can use the evil virtually mapped LDT.
> And 32-bit non-PAE (the 2-level case) will only have pointers to page
> tables at the top level. And then we can preallocate.

Not sure I can follow you here. How can 32-bit PTI with PAE use the LDT
from the direct mapping? I am guessing you want to get rid of the
SHARED_KERNEL_PMD==0 case for PAE kernels. This would indeed make
syncing unneccessary on PAE, but pre-allocation would still be needed
for 2-level paging. Just the amount of memory needed for the
pre-allocated PTE pages is half as big as it would be with PAE.

> Or maybe we don't want to defeature this much, or maybe the memory hit
> from this preallocation will hurt little 2-level 32-bit systems too
> much.

It will certainly make Linux less likely to boot on low-memory x86-32
systems, whoever will be affected by this.


Joerg

2020-05-11 19:40:40

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()


> On May 11, 2020, at 12:14 PM, Joerg Roedel <[email protected]> wrote:
>
> On Mon, May 11, 2020 at 08:36:31AM -0700, Andy Lutomirski wrote:
>> What if we make 32-bit PTI depend on PAE?
>
> It already does, PTI support for legacy paging had to be removed because
> there were memory corruption problems with THP. The reason was that huge
> PTEs in the user-space area were mapped in two page-tables (kernel and
> user), but A/D bits were only fetched from the kernel part. To not make
> things more complicated we agreed on just not supporting PTI without
> PAE.
>
>> And drop 32-bit Xen PV support? And make 32-bit huge pages depend on
>> PAE? Then 32-bit non-PAE can use the direct-mapped LDT, 32-bit PTI
>> (and optionally PAE non-PTI) can use the evil virtually mapped LDT.
>> And 32-bit non-PAE (the 2-level case) will only have pointers to page
>> tables at the top level. And then we can preallocate.
>
> Not sure I can follow you here. How can 32-bit PTI with PAE use the LDT
> from the direct mapping? I am guessing you want to get rid of the
> SHARED_KERNEL_PMD==0 case for PAE kernels.

I wrote nonsense. I mean bite off a piece of the *user* portion of the address space and stick the LDT there.

I contemplated doing this when I wrote the 64-bit code, but I decided we had so much address space to throw around that I liked my solution better.

> This would indeed make
> syncing unneccessary on PAE, but pre-allocation would still be needed
> for 2-level paging. Just the amount of memory needed for the
> pre-allocated PTE pages is half as big as it would be with PAE.
>
>> Or maybe we don't want to defeature this much, or maybe the memory hit
>> from this preallocation will hurt little 2-level 32-bit systems too
>> much.
>
> It will certainly make Linux less likely to boot on low-memory x86-32
> systems, whoever will be affected by this.
>
>

I’m guessing the right solution is either your series or your series plus preallocation on 64-bit. I’m just grumpy about it...

2020-05-11 20:52:38

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

On Mon, 11 May 2020 21:14:14 +0200
Joerg Roedel <[email protected]> wrote:

> > Or maybe we don't want to defeature this much, or maybe the memory hit
> > from this preallocation will hurt little 2-level 32-bit systems too
> > much.
>
> It will certainly make Linux less likely to boot on low-memory x86-32
> systems, whoever will be affected by this.

That may be an issue, as I'm guessing the biggest users of x86-32, is
probably small systems that uses Intel's attempts at the embedded space.

-- Steve

2020-05-12 15:05:15

by Jörg Rödel

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

On Mon, May 11, 2020 at 12:36:19PM -0700, Andy Lutomirski wrote:
> I’m guessing the right solution is either your series or your series
> plus preallocation on 64-bit. I’m just grumpy about it...

Okay, so we can do the pre-allocation when it turns out the pgd_list
lock-times become a problem on x86-64. The tracking code in vmalloc.c is
needed anyway for 32-bit and there is no reason why 64-bit shouldn't use
it as well for now.
I don't think that taking the lock _will_ be a problem, as it is only
taken when a new PGD/P4D entry is populated. And it is pretty unlikely
that a system will populate all 64 of them, with 4-level paging each of
these entries will map 512GB of address space. But if I am wrong here
pre-allocating is still an option.


Joerg

2020-05-12 15:17:59

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings()

On Tue, 12 May 2020 17:02:50 +0200
Joerg Roedel <[email protected]> wrote:

> On Mon, May 11, 2020 at 12:36:19PM -0700, Andy Lutomirski wrote:
> > I’m guessing the right solution is either your series or your series
> > plus preallocation on 64-bit. I’m just grumpy about it...
>
> Okay, so we can do the pre-allocation when it turns out the pgd_list
> lock-times become a problem on x86-64. The tracking code in vmalloc.c is
> needed anyway for 32-bit and there is no reason why 64-bit shouldn't use
> it as well for now.
> I don't think that taking the lock _will_ be a problem, as it is only
> taken when a new PGD/P4D entry is populated. And it is pretty unlikely
> that a system will populate all 64 of them, with 4-level paging each of
> these entries will map 512GB of address space. But if I am wrong here
> pre-allocating is still an option.
>

256TB of RAM isn't too far in the future ;-)

-- Steve