LinuxLists.cc - [PATCH v11 00/13] huge vmalloc mappings

2021-01-27 19:43:41

Subject: [PATCH v11 00/13] huge vmalloc mappings

I think I ended up implementing all Christoph's comments because
they turned out better in the end. Cleanups coming in another
series though.

Thanks,
Nick

Since v10:
- Fixed code style, most > 80 colums, tweak patch titles, etc [thanks Christoph]
- Made huge vmalloc code and data structure compile away if unselected
[Christoph]
- Archs only have to provide arch_vmap_p?d_supported for levels they
implement [Christoph]

Since v9:
- Fixed intermediate build breakage on x86-32 !PAE [thanks Ding]
- Fixed small page fallback case vm_struct double-free [thanks Ding]

Since v8:
- Fixed nommu compile.
- Added Kconfig option help text
- Added VM_NOHUGE which should help archs implement it [suggested by Rick]

Since v7:
- Rebase, added some acks, compile fix
- Removed "order=" from vmallocinfo, it's a bit confusing (nr_pages
is in small page size for compatibility).
- Added arch_vmap_pmd_supported() test before starting to allocate
the large page, rather than only testing it when doing the map, to
avoid unsupported configs trying to allocate huge pages for no
reason.

Since v6:
- Fixed a false positive warning introduced in patch 2, found by
kbuild test robot.

Since v5:
- Split arch changes out better and make the constant folding work
- Avoid most of the 80 column wrap, fix a reference to lib/ioremap.c
- Fix compile error on some archs

Since v4:
- Fixed an off-by-page-order bug in v4
- Several minor cleanups.
- Added page order to /proc/vmallocinfo
- Added hugepage to alloc_large_system_hage output.
- Made an architecture config option, powerpc only for now.

Since v3:
- Fixed an off-by-one bug in a loop
- Fix !CONFIG_HAVE_ARCH_HUGE_VMAP build fail

Nicholas Piggin (13):
mm/vmalloc: fix HUGE_VMAP regression by enabling huge pages in
vmalloc_to_page
mm: apply_to_pte_range warn and fail if a large pte is encountered
mm/vmalloc: rename vmap_*_range vmap_pages_*_range
mm/ioremap: rename ioremap_*_range to vmap_*_range
mm: HUGE_VMAP arch support cleanup
powerpc: inline huge vmap supported functions
arm64: inline huge vmap supported functions
x86: inline huge vmap supported functions
mm/vmalloc: provide fallback arch huge vmap support functions
mm: Move vmap_range from mm/ioremap.c to mm/vmalloc.c
mm/vmalloc: add vmap_range_noflush variant
mm/vmalloc: Hugepage vmalloc mappings
powerpc/64s/radix: Enable huge vmalloc mappings

.../admin-guide/kernel-parameters.txt | 2 +
arch/Kconfig | 11 +
arch/arm64/include/asm/vmalloc.h | 24 +
arch/arm64/mm/mmu.c | 26 -
arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/vmalloc.h | 20 +
arch/powerpc/kernel/module.c | 21 +-
arch/powerpc/mm/book3s64/radix_pgtable.c | 21 -
arch/x86/include/asm/vmalloc.h | 20 +
arch/x86/mm/ioremap.c | 19 -
arch/x86/mm/pgtable.c | 13 -
include/linux/io.h | 9 -
include/linux/vmalloc.h | 46 ++
init/main.c | 1 -
mm/ioremap.c | 225 +-------
mm/memory.c | 66 ++-
mm/page_alloc.c | 5 +-
mm/vmalloc.c | 484 +++++++++++++++---
18 files changed, 614 insertions(+), 400 deletions(-)

--
2.23.0

2021-01-27 19:43:41

by Nicholas Piggin

[permalink] [raw]

Subject: [PATCH v11 07/13] arm64: inline huge vmap supported functions

This allows unsupported levels to be constant folded away, and so
p4d_free_pud_page can be removed because it's no longer linked to.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Acked-by: Catalin Marinas <[email protected]>
Signed-off-by: Nicholas Piggin <[email protected]>
---
arch/arm64/include/asm/vmalloc.h | 23 ++++++++++++++++++++---
arch/arm64/mm/mmu.c | 26 --------------------------
2 files changed, 20 insertions(+), 29 deletions(-)

diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index 597b40405319..fc9a12d6cc1a 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -4,9 +4,26 @@
#include <asm/page.h>

#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-bool arch_vmap_p4d_supported(pgprot_t prot);
-bool arch_vmap_pud_supported(pgprot_t prot);
-bool arch_vmap_pmd_supported(pgprot_t prot);
+static inline bool arch_vmap_p4d_supported(pgprot_t prot)
+{
+ return false;
+}
+
+static inline bool arch_vmap_pud_supported(pgprot_t prot)
+{
+ /*
+ * Only 4k granule supports level 1 block mappings.
+ * SW table walks can't handle removal of intermediate entries.
+ */
+ return IS_ENABLED(CONFIG_ARM64_4K_PAGES) &&
+ !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
+}
+
+static inline bool arch_vmap_pmd_supported(pgprot_t prot)
+{
+ /* See arch_vmap_pud_supported() */
+ return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
+}
#endif

#endif /* _ASM_ARM64_VMALLOC_H */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 1613d290cbd1..ab9ba7c36dae 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1313,27 +1313,6 @@ void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot)
return dt_virt;
}

-bool arch_vmap_p4d_supported(pgprot_t prot)
-{
- return false;
-}
-
-bool arch_vmap_pud_supported(pgprot_t prot)
-{
- /*
- * Only 4k granule supports level 1 block mappings.
- * SW table walks can't handle removal of intermediate entries.
- */
- return IS_ENABLED(CONFIG_ARM64_4K_PAGES) &&
- !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
-}
-
-bool arch_vmap_pmd_supported(pgprot_t prot)
-{
- /* See arch_vmap_pud_supported() */
- return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
-}
-
int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot)
{
pud_t new_pud = pfn_pud(__phys_to_pfn(phys), mk_pud_sect_prot(prot));
@@ -1425,11 +1404,6 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
return 1;
}

-int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
-{
- return 0; /* Don't attempt a block mapping */
-}
-
#ifdef CONFIG_MEMORY_HOTPLUG
static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
{
--
2.23.0

2021-01-27 19:43:41

by Nicholas Piggin

[permalink] [raw]

Subject: [PATCH v11 12/13] mm/vmalloc: Hugepage vmalloc mappings

Support huge page vmalloc mappings. Config option HAVE_ARCH_HUGE_VMALLOC
enables support on architectures that define HAVE_ARCH_HUGE_VMAP and
supports PMD sized vmap mappings.

vmalloc will attempt to allocate PMD-sized pages if allocating PMD size
or larger, and fall back to small pages if that was unsuccessful.

Architectures must ensure that any arch specific vmalloc allocations
that require PAGE_SIZE mappings (e.g., module allocations vs strict
module rwx) use the VM_NOHUGE flag to inhibit larger mappings.

When hugepage vmalloc mappings are enabled in the next patch, this
reduces TLB misses by nearly 30x on a `git diff` workload on a 2-node
POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%.

This can result in more internal fragmentation and memory overhead for a
given allocation, an option nohugevmalloc is added to disable at boot.

Signed-off-by: Nicholas Piggin <[email protected]>
---
arch/Kconfig | 11 ++
include/linux/vmalloc.h | 21 ++++
mm/page_alloc.c | 5 +-
mm/vmalloc.c | 215 +++++++++++++++++++++++++++++++---------
4 files changed, 205 insertions(+), 47 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 24862d15f3a3..eef170e0c9b8 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -724,6 +724,17 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
config HAVE_ARCH_HUGE_VMAP
bool

+#
+# Archs that select this would be capable of PMD-sized vmaps (i.e.,
+# arch_vmap_pmd_supported() returns true), and they must make no assumptions
+# that vmalloc memory is mapped with PAGE_SIZE ptes. The VM_NO_HUGE_VMAP flag
+# can be used to prohibit arch-specific allocations from using hugepages to
+# help with this (e.g., modules may require it).
+#
+config HAVE_ARCH_HUGE_VMALLOC
+ depends on HAVE_ARCH_HUGE_VMAP
+ bool
+
config ARCH_WANT_HUGE_PMD_SHARE
bool

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 99ea72d547dc..93270adf5db5 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -25,6 +25,7 @@ struct notifier_block; /* in notifier.h */
#define VM_NO_GUARD 0x00000040 /* don't add guard page */
#define VM_KASAN 0x00000080 /* has allocated kasan shadow memory */
#define VM_MAP_PUT_PAGES 0x00000100 /* put pages and free array in vfree */
+#define VM_NO_HUGE_VMAP 0x00000200 /* force PAGE_SIZE pte mapping */

/*
* VM_KASAN is used slighly differently depending on CONFIG_KASAN_VMALLOC.
@@ -59,6 +60,9 @@ struct vm_struct {
unsigned long size;
unsigned long flags;
struct page **pages;
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+ unsigned int page_order;
+#endif
unsigned int nr_pages;
phys_addr_t phys_addr;
const void *caller;
@@ -193,6 +197,22 @@ void free_vm_area(struct vm_struct *area);
extern struct vm_struct *remove_vm_area(const void *addr);
extern struct vm_struct *find_vm_area(const void *addr);

+static inline bool is_vm_area_hugepages(const void *addr)
+{
+ /*
+ * This may not 100% tell if the area is mapped with > PAGE_SIZE
+ * page table entries, if for some reason the architecture indicates
+ * larger sizes are available but decides not to use them, nothing
+ * prevents that. This only indicates the size of the physical page
+ * allocated in the vmalloc layer.
+ */
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+ return find_vm_area(addr)->page_order > 0;
+#else
+ return false;
+#endif
+}
+
#ifdef CONFIG_MMU
int vmap_range(unsigned long addr, unsigned long end,
phys_addr_t phys_addr, pgprot_t prot,
@@ -210,6 +230,7 @@ static inline void set_vm_flush_reset_perms(void *addr)
if (vm)
vm->flags |= VM_FLUSH_RESET_PERMS;
}
+
#else
static inline int
map_kernel_range_noflush(unsigned long start, unsigned long size,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 027f6481ba59..b7a9661fa232 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -72,6 +72,7 @@
#include <linux/padata.h>
#include <linux/khugepaged.h>
#include <linux/buffer_head.h>
+#include <linux/vmalloc.h>

#include <asm/sections.h>
#include <asm/tlbflush.h>
@@ -8238,6 +8239,7 @@ void *__init alloc_large_system_hash(const char *tablename,
void *table = NULL;
gfp_t gfp_flags;
bool virt;
+ bool huge;

/* allow the kernel cmdline to have a say */
if (!numentries) {
@@ -8305,6 +8307,7 @@ void *__init alloc_large_system_hash(const char *tablename,
} else if (get_order(size) >= MAX_ORDER || hashdist) {
table = __vmalloc(size, gfp_flags);
virt = true;
+ huge = is_vm_area_hugepages(table);
} else {
/*
* If bucketsize is not a power-of-two, we may free
@@ -8321,7 +8324,7 @@ void *__init alloc_large_system_hash(const char *tablename,

pr_info("%s hash table entries: %ld (order: %d, %lu bytes, %s)\n",
tablename, 1UL << log2qty, ilog2(size) - PAGE_SHIFT, size,
- virt ? "vmalloc" : "linear");
+ virt ? (huge ? "vmalloc hugepage" : "vmalloc") : "linear");

if (_hash_shift)
*_hash_shift = log2qty;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 47ab4338cfff..e9a28de04182 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -42,6 +42,19 @@
#include "internal.h"
#include "pgalloc-track.h"

+#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+static bool __ro_after_init vmap_allow_huge = true;
+
+static int __init set_nohugevmalloc(char *str)
+{
+ vmap_allow_huge = false;
+ return 0;
+}
+early_param("nohugevmalloc", set_nohugevmalloc);
+#else /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
+static const bool vmap_allow_huge = false;
+#endif /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
+
bool is_vmalloc_addr(const void *x)
{
unsigned long addr = (unsigned long)x;
@@ -483,31 +496,12 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
return 0;
}

-/**
- * map_kernel_range_noflush - map kernel VM area with the specified pages
- * @addr: start of the VM area to map
- * @size: size of the VM area to map
- * @prot: page protection flags to use
- * @pages: pages to map
- *
- * Map PFN_UP(@size) pages at @addr. The VM area @addr and @size specify should
- * have been allocated using get_vm_area() and its friends.
- *
- * NOTE:
- * This function does NOT do any cache flushing. The caller is responsible for
- * calling flush_cache_vmap() on to-be-mapped areas before calling this
- * function.
- *
- * RETURNS:
- * 0 on success, -errno on failure.
- */
-int map_kernel_range_noflush(unsigned long addr, unsigned long size,
- pgprot_t prot, struct page **pages)
+static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages)
{
unsigned long start = addr;
- unsigned long end = addr + size;
- unsigned long next;
pgd_t *pgd;
+ unsigned long next;
int err = 0;
int nr = 0;
pgtbl_mod_mask mask = 0;
@@ -529,6 +523,66 @@ int map_kernel_range_noflush(unsigned long addr, unsigned long size,
return 0;
}

+static int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages, unsigned int page_shift)
+{
+ unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
+
+ WARN_ON(page_shift < PAGE_SHIFT);
+
+ if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
+ page_shift == PAGE_SHIFT)
+ return vmap_small_pages_range_noflush(addr, end, prot, pages);
+
+ for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
+ int err;
+
+ err = vmap_range_noflush(addr, addr + (1UL << page_shift),
+ __pa(page_address(pages[i])), prot,
+ page_shift);
+ if (err)
+ return err;
+
+ addr += 1UL << page_shift;
+ }
+
+ return 0;
+}
+
+static int vmap_pages_range(unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages, unsigned int page_shift)
+{
+ int err;
+
+ err = vmap_pages_range_noflush(addr, end, prot, pages, page_shift);
+ flush_cache_vmap(addr, end);
+ return err;
+}
+
+/**
+ * map_kernel_range_noflush - map kernel VM area with the specified pages
+ * @addr: start of the VM area to map
+ * @size: size of the VM area to map
+ * @prot: page protection flags to use
+ * @pages: pages to map
+ *
+ * Map PFN_UP(@size) pages at @addr. The VM area @addr and @size specify should
+ * have been allocated using get_vm_area() and its friends.
+ *
+ * NOTE:
+ * This function does NOT do any cache flushing. The caller is responsible for
+ * calling flush_cache_vmap() on to-be-mapped areas before calling this
+ * function.
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
+int map_kernel_range_noflush(unsigned long addr, unsigned long size,
+ pgprot_t prot, struct page **pages)
+{
+ return vmap_pages_range_noflush(addr, addr + size, prot, pages, PAGE_SHIFT);
+}
+
int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot,
struct page **pages)
{
@@ -2112,6 +2166,24 @@ EXPORT_SYMBOL(vm_map_ram);

static struct vm_struct *vmlist __initdata;

+static inline unsigned int vm_area_page_order(struct vm_struct *vm)
+{
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+ return vm->page_order;
+#else
+ return 0;
+#endif
+}
+
+static inline void set_vm_area_page_order(struct vm_struct *vm, unsigned int order)
+{
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+ vm->page_order = order;
+#else
+ BUG_ON(order != 0);
+#endif
+}
+
/**
* vm_area_add_early - add vmap area early during boot
* @vm: vm_struct to add
@@ -2422,6 +2494,7 @@ static inline void set_area_direct_map(const struct vm_struct *area,
{
int i;

+ /* HUGE_VMALLOC passes small pages to set_direct_map */
for (i = 0; i < area->nr_pages; i++)
if (page_address(area->pages[i]))
set_direct_map(area->pages[i]);
@@ -2431,6 +2504,7 @@ static inline void set_area_direct_map(const struct vm_struct *area,
static void vm_remove_mappings(struct vm_struct *area, int deallocate_pages)
{
unsigned long start = ULONG_MAX, end = 0;
+ unsigned int page_order = vm_area_page_order(area);
int flush_reset = area->flags & VM_FLUSH_RESET_PERMS;
int flush_dmap = 0;
int i;
@@ -2455,11 +2529,14 @@ static void vm_remove_mappings(struct vm_struct *area, int deallocate_pages)
* map. Find the start and end range of the direct mappings to make sure
* the vm_unmap_aliases() flush includes the direct map.
*/
- for (i = 0; i < area->nr_pages; i++) {
+ for (i = 0; i < area->nr_pages; i += 1U << page_order) {
unsigned long addr = (unsigned long)page_address(area->pages[i]);
if (addr) {
+ unsigned long page_size;
+
+ page_size = PAGE_SIZE << page_order;
start = min(addr, start);
- end = max(addr + PAGE_SIZE, end);
+ end = max(addr + page_size, end);
flush_dmap = 1;
}
}
@@ -2500,13 +2577,14 @@ static void __vunmap(const void *addr, int deallocate_pages)
vm_remove_mappings(area, deallocate_pages);

if (deallocate_pages) {
+ unsigned int page_order = vm_area_page_order(area);
int i;

- for (i = 0; i < area->nr_pages; i++) {
+ for (i = 0; i < area->nr_pages; i += 1U << page_order) {
struct page *page = area->pages[i];

BUG_ON(!page);
- __free_pages(page, 0);
+ __free_pages(page, page_order);
}
atomic_long_sub(area->nr_pages, &nr_vmalloc_pages);

@@ -2697,15 +2775,19 @@ EXPORT_SYMBOL_GPL(vmap_pfn);
#endif /* CONFIG_VMAP_PFN */

static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
- pgprot_t prot, int node)
+ pgprot_t prot, unsigned int page_shift,
+ int node)
{
const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
- unsigned int nr_pages = get_vm_area_size(area) >> PAGE_SHIFT;
+ unsigned long addr = (unsigned long)area->addr;
+ unsigned long size = get_vm_area_size(area);
unsigned long array_size;
- unsigned int i;
+ unsigned int nr_small_pages = size >> PAGE_SHIFT;
+ unsigned int page_order;
struct page **pages;
+ unsigned int i;

- array_size = (unsigned long)nr_pages * sizeof(struct page *);
+ array_size = (unsigned long)nr_small_pages * sizeof(struct page *);
gfp_mask |= __GFP_NOWARN;
if (!(gfp_mask & (GFP_DMA | GFP_DMA32)))
gfp_mask |= __GFP_HIGHMEM;
@@ -2724,30 +2806,37 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
}

area->pages = pages;
- area->nr_pages = nr_pages;
+ area->nr_pages = nr_small_pages;
+ set_vm_area_page_order(area, page_shift - PAGE_SHIFT);

- for (i = 0; i < area->nr_pages; i++) {
- struct page *page;
+ page_order = vm_area_page_order(area);

- if (node == NUMA_NO_NODE)
- page = alloc_page(gfp_mask);
- else
- page = alloc_pages_node(node, gfp_mask, 0);
+ /*
+ * Careful, we allocate and map page_order pages, but tracking is done
+ * per PAGE_SIZE page so as to keep the vm_struct APIs independent of
+ * the physical/mapped size.
+ */
+ for (i = 0; i < area->nr_pages; i += 1U << page_order) {
+ struct page *page;
+ int p;

+ page = alloc_pages_node(node, gfp_mask, page_order);
if (unlikely(!page)) {
/* Successfully allocated i pages, free them in __vfree() */
area->nr_pages = i;
atomic_long_add(area->nr_pages, &nr_vmalloc_pages);
goto fail;
}
- area->pages[i] = page;
+
+ for (p = 0; p < (1U << page_order); p++)
+ area->pages[i + p] = page + p;
+
if (gfpflags_allow_blocking(gfp_mask))
cond_resched();
}
atomic_long_add(area->nr_pages, &nr_vmalloc_pages);

- if (map_kernel_range((unsigned long)area->addr, get_vm_area_size(area),
- prot, pages) < 0)
+ if (vmap_pages_range(addr, addr + size, prot, pages, page_shift) < 0)
goto fail;

return area->addr;
@@ -2755,7 +2844,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
fail:
warn_alloc(gfp_mask, NULL,
"vmalloc: allocation failure, allocated %ld of %ld bytes",
- (area->nr_pages*PAGE_SIZE), area->size);
+ (area->nr_pages*PAGE_SIZE), size);
__vfree(area->addr);
return NULL;
}
@@ -2786,19 +2875,43 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
struct vm_struct *area;
void *addr;
unsigned long real_size = size;
+ unsigned long real_align = align;
+ unsigned int shift = PAGE_SHIFT;

- size = PAGE_ALIGN(size);
if (!size || (size >> PAGE_SHIFT) > totalram_pages())
goto fail;

- area = __get_vm_area_node(real_size, align, VM_ALLOC | VM_UNINITIALIZED |
+ if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP) &&
+ arch_vmap_pmd_supported(prot)) {
+ unsigned long size_per_node;
+
+ /*
+ * Try huge pages. Only try for PAGE_KERNEL allocations,
+ * others like modules don't yet expect huge pages in
+ * their allocations due to apply_to_page_range not
+ * supporting them.
+ */
+
+ size_per_node = size;
+ if (node == NUMA_NO_NODE)
+ size_per_node /= num_online_nodes();
+ if (size_per_node >= PMD_SIZE) {
+ shift = PMD_SHIFT;
+ align = max(real_align, 1UL << shift);
+ size = ALIGN(real_size, 1UL << shift);
+ }
+ }
+
+again:
+ size = PAGE_ALIGN(size);
+ area = __get_vm_area_node(size, align, VM_ALLOC | VM_UNINITIALIZED |
vm_flags, start, end, node, gfp_mask, caller);
if (!area)
goto fail;

- addr = __vmalloc_area_node(area, gfp_mask, prot, node);
+ addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
if (!addr)
- return NULL;
+ goto fail;

/*
* In this function, newly allocated vm_struct has VM_UNINITIALIZED
@@ -2812,8 +2925,18 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
return addr;

fail:
- warn_alloc(gfp_mask, NULL,
+ if (shift > PAGE_SHIFT) {
+ shift = PAGE_SHIFT;
+ align = real_align;
+ size = real_size;
+ goto again;
+ }
+
+ if (!area) {
+ /* Warn for area allocation, page allocations already warn */
+ warn_alloc(gfp_mask, NULL,
"vmalloc: allocation failure: %lu bytes", real_size);
+ }
return NULL;
}

--
2.23.0

2021-01-27 19:44:08

by Nicholas Piggin

[permalink] [raw]

Subject: [PATCH v11 02/13] mm: apply_to_pte_range warn and fail if a large pte is encountered

apply_to_pte_range might mistake a large pte for bad, or treat it as a
page table, resulting in a crash or corruption. Add a test to warn and
return error if large entries are found.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Nicholas Piggin <[email protected]>
---
mm/memory.c | 66 +++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 49 insertions(+), 17 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index feff48e1465a..672e39a72788 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2440,13 +2440,21 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
}
do {
next = pmd_addr_end(addr, end);
- if (create || !pmd_none_or_clear_bad(pmd)) {
- err = apply_to_pte_range(mm, pmd, addr, next, fn, data,
- create, mask);
- if (err)
- break;
+ if (pmd_none(*pmd) && !create)
+ continue;
+ if (WARN_ON_ONCE(pmd_leaf(*pmd)))
+ return -EINVAL;
+ if (!pmd_none(*pmd) && WARN_ON_ONCE(pmd_bad(*pmd))) {
+ if (!create)
+ continue;
+ pmd_clear_bad(pmd);
}
+ err = apply_to_pte_range(mm, pmd, addr, next,
+ fn, data, create, mask);
+ if (err)
+ break;
} while (pmd++, addr = next, addr != end);
+
return err;
}

@@ -2468,13 +2476,21 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
}
do {
next = pud_addr_end(addr, end);
- if (create || !pud_none_or_clear_bad(pud)) {
- err = apply_to_pmd_range(mm, pud, addr, next, fn, data,
- create, mask);
- if (err)
- break;
+ if (pud_none(*pud) && !create)
+ continue;
+ if (WARN_ON_ONCE(pud_leaf(*pud)))
+ return -EINVAL;
+ if (!pud_none(*pud) && WARN_ON_ONCE(pud_bad(*pud))) {
+ if (!create)
+ continue;
+ pud_clear_bad(pud);
}
+ err = apply_to_pmd_range(mm, pud, addr, next,
+ fn, data, create, mask);
+ if (err)
+ break;
} while (pud++, addr = next, addr != end);
+
return err;
}

@@ -2496,13 +2512,21 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
}
do {
next = p4d_addr_end(addr, end);
- if (create || !p4d_none_or_clear_bad(p4d)) {
- err = apply_to_pud_range(mm, p4d, addr, next, fn, data,
- create, mask);
- if (err)
- break;
+ if (p4d_none(*p4d) && !create)
+ continue;
+ if (WARN_ON_ONCE(p4d_leaf(*p4d)))
+ return -EINVAL;
+ if (!p4d_none(*p4d) && WARN_ON_ONCE(p4d_bad(*p4d))) {
+ if (!create)
+ continue;
+ p4d_clear_bad(p4d);
}
+ err = apply_to_pud_range(mm, p4d, addr, next,
+ fn, data, create, mask);
+ if (err)
+ break;
} while (p4d++, addr = next, addr != end);
+
return err;
}

@@ -2522,9 +2546,17 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
pgd = pgd_offset(mm, addr);
do {
next = pgd_addr_end(addr, end);
- if (!create && pgd_none_or_clear_bad(pgd))
+ if (pgd_none(*pgd) && !create)
continue;
- err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create, &mask);
+ if (WARN_ON_ONCE(pgd_leaf(*pgd)))
+ return -EINVAL;
+ if (!pgd_none(*pgd) && WARN_ON_ONCE(pgd_bad(*pgd))) {
+ if (!create)
+ continue;
+ pgd_clear_bad(pgd);
+ }
+ err = apply_to_p4d_range(mm, pgd, addr, next,
+ fn, data, create, &mask);
if (err)
break;
} while (pgd++, addr = next, addr != end);
--
2.23.0

2021-01-27 19:46:40

by Nicholas Piggin

[permalink] [raw]

Subject: [PATCH v11 01/13] mm/vmalloc: fix HUGE_VMAP regression by enabling huge pages in vmalloc_to_page

vmalloc_to_page returns NULL for addresses mapped by larger pages[*].
Whether or not a vmap is huge depends on the architecture details,
alignments, boot options, etc., which the caller can not be expected
to know. Therefore HUGE_VMAP is a regression for vmalloc_to_page.

This change teaches vmalloc_to_page about larger pages, and returns
the struct page that corresponds to the offset within the large page.
This makes the API agnostic to mapping implementation details.

[*] As explained by commit 029c54b095995 ("mm/vmalloc.c: huge-vmap:
fail gracefully on unexpected huge vmap mappings")

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Nicholas Piggin <[email protected]>
---
mm/vmalloc.c | 41 ++++++++++++++++++++++++++---------------
1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e6f352bf0498..62372f9e0167 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -34,7 +34,7 @@
#include <linux/bitops.h>
#include <linux/rbtree_augmented.h>
#include <linux/overflow.h>
-
+#include <linux/pgtable.h>
#include <linux/uaccess.h>
#include <asm/tlbflush.h>
#include <asm/shmparam.h>
@@ -343,7 +343,9 @@ int is_vmalloc_or_module_addr(const void *x)
}

/*
- * Walk a vmap address to the struct page it maps.
+ * Walk a vmap address to the struct page it maps. Huge vmap mappings will
+ * return the tail page that corresponds to the base page address, which
+ * matches small vmap mappings.
*/
struct page *vmalloc_to_page(const void *vmalloc_addr)
{
@@ -363,25 +365,33 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)

if (pgd_none(*pgd))
return NULL;
+ if (WARN_ON_ONCE(pgd_leaf(*pgd)))
+ return NULL; /* XXX: no allowance for huge pgd */
+ if (WARN_ON_ONCE(pgd_bad(*pgd)))
+ return NULL;
+
p4d = p4d_offset(pgd, addr);
if (p4d_none(*p4d))
return NULL;
- pud = pud_offset(p4d, addr);
+ if (p4d_leaf(*p4d))
+ return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT);
+ if (WARN_ON_ONCE(p4d_bad(*p4d)))
+ return NULL;

- /*
- * Don't dereference bad PUD or PMD (below) entries. This will also
- * identify huge mappings, which we may encounter on architectures
- * that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be
- * identified as vmalloc addresses by is_vmalloc_addr(), but are
- * not [unambiguously] associated with a struct page, so there is
- * no correct value to return for them.
- */
- WARN_ON_ONCE(pud_bad(*pud));
- if (pud_none(*pud) || pud_bad(*pud))
+ pud = pud_offset(p4d, addr);
+ if (pud_none(*pud))
+ return NULL;
+ if (pud_leaf(*pud))
+ return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+ if (WARN_ON_ONCE(pud_bad(*pud)))
return NULL;
+
pmd = pmd_offset(pud, addr);
- WARN_ON_ONCE(pmd_bad(*pmd));
- if (pmd_none(*pmd) || pmd_bad(*pmd))
+ if (pmd_none(*pmd))
+ return NULL;
+ if (pmd_leaf(*pmd))
+ return pmd_page(*pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+ if (WARN_ON_ONCE(pmd_bad(*pmd)))
return NULL;

ptep = pte_offset_map(pmd, addr);
@@ -389,6 +399,7 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
if (pte_present(pte))
page = pte_page(pte);
pte_unmap(ptep);
+
return page;
}
EXPORT_SYMBOL(vmalloc_to_page);
--
2.23.0

2021-01-27 19:48:07

by Nicholas Piggin

[permalink] [raw]

Subject: [PATCH v11 11/13] mm/vmalloc: add vmap_range_noflush variant

As a side-effect, the order of flush_cache_vmap() and
arch_sync_kernel_mappings() calls are switched, but that now matches
the other callers in this file.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Nicholas Piggin <[email protected]>
---
mm/vmalloc.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f043386bb51d..47ab4338cfff 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -240,7 +240,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
return 0;
}

-int vmap_range(unsigned long addr, unsigned long end,
+static int vmap_range_noflush(unsigned long addr, unsigned long end,
phys_addr_t phys_addr, pgprot_t prot,
unsigned int max_page_shift)
{
@@ -263,14 +263,24 @@ int vmap_range(unsigned long addr, unsigned long end,
break;
} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);

- flush_cache_vmap(start, end);
-
if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
arch_sync_kernel_mappings(start, end);

return err;
}

+int vmap_range(unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot,
+ unsigned int max_page_shift)
+{
+ int err;
+
+ err = vmap_range_noflush(addr, end, phys_addr, prot, max_page_shift);
+ flush_cache_vmap(addr, end);
+
+ return err;
+}
+
static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
pgtbl_mod_mask *mask)
{
--
2.23.0

2021-01-28 03:16:30

by Ding Tianhong

[permalink] [raw]

Subject: Re: [PATCH v11 01/13] mm/vmalloc: fix HUGE_VMAP regression by enabling huge pages in vmalloc_to_page

On 2021/1/26 12:44, Nicholas Piggin wrote:
> vmalloc_to_page returns NULL for addresses mapped by larger pages[*].
> Whether or not a vmap is huge depends on the architecture details,
> alignments, boot options, etc., which the caller can not be expected
> to know. Therefore HUGE_VMAP is a regression for vmalloc_to_page.
>
> This change teaches vmalloc_to_page about larger pages, and returns
> the struct page that corresponds to the offset within the large page.
> This makes the API agnostic to mapping implementation details.
>
> [*] As explained by commit 029c54b095995 ("mm/vmalloc.c: huge-vmap:
> fail gracefully on unexpected huge vmap mappings")
>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Signed-off-by: Nicholas Piggin <[email protected]>
> ---
> mm/vmalloc.c | 41 ++++++++++++++++++++++++++---------------
> 1 file changed, 26 insertions(+), 15 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index e6f352bf0498..62372f9e0167 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -34,7 +34,7 @@
> #include <linux/bitops.h>
> #include <linux/rbtree_augmented.h>
> #include <linux/overflow.h>
> -
> +#include <linux/pgtable.h>
> #include <linux/uaccess.h>
> #include <asm/tlbflush.h>
> #include <asm/shmparam.h>
> @@ -343,7 +343,9 @@ int is_vmalloc_or_module_addr(const void *x)
> }
>
> /*
> - * Walk a vmap address to the struct page it maps.
> + * Walk a vmap address to the struct page it maps. Huge vmap mappings will
> + * return the tail page that corresponds to the base page address, which
> + * matches small vmap mappings.
> */
> struct page *vmalloc_to_page(const void *vmalloc_addr)
> {
> @@ -363,25 +365,33 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
>
> if (pgd_none(*pgd))
> return NULL;
> + if (WARN_ON_ONCE(pgd_leaf(*pgd)))
> + return NULL; /* XXX: no allowance for huge pgd */
> + if (WARN_ON_ONCE(pgd_bad(*pgd)))
> + return NULL;
> +
> p4d = p4d_offset(pgd, addr);
> if (p4d_none(*p4d))
> return NULL;
> - pud = pud_offset(p4d, addr);
> + if (p4d_leaf(*p4d))
> + return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT);
> + if (WARN_ON_ONCE(p4d_bad(*p4d)))
> + return NULL;
>
> - /*
> - * Don't dereference bad PUD or PMD (below) entries. This will also
> - * identify huge mappings, which we may encounter on architectures
> - * that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be
> - * identified as vmalloc addresses by is_vmalloc_addr(), but are
> - * not [unambiguously] associated with a struct page, so there is
> - * no correct value to return for them.
> - */
> - WARN_ON_ONCE(pud_bad(*pud));
> - if (pud_none(*pud) || pud_bad(*pud))
> + pud = pud_offset(p4d, addr);
> + if (pud_none(*pud))
> + return NULL;
> + if (pud_leaf(*pud))
> + return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);

Hi Nicho:

/builds/1mzfdQzleCy69KZFb5qHNSEgabZ/mm/vmalloc.c: In function 'vmalloc_to_page':
/builds/1mzfdQzleCy69KZFb5qHNSEgabZ/include/asm-generic/pgtable-nop4d-hack.h:48:27: error: implicit declaration of function 'pud_page'; did you mean 'put_page'? [-Werror=implicit-function-declaration]
48 | #define pgd_page(pgd) (pud_page((pud_t){ pgd }))
| ^~~~~~~~

the pug_page is not defined for aarch32 when enabling 2-level page config, it break the system building.

> + if (WARN_ON_ONCE(pud_bad(*pud)))
> return NULL;
> +
> pmd = pmd_offset(pud, addr);
> - WARN_ON_ONCE(pmd_bad(*pmd));
> - if (pmd_none(*pmd) || pmd_bad(*pmd))
> + if (pmd_none(*pmd))
> + return NULL;
> + if (pmd_leaf(*pmd))
> + return pmd_page(*pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> + if (WARN_ON_ONCE(pmd_bad(*pmd)))
> return NULL;
>
> ptep = pte_offset_map(pmd, addr);
> @@ -389,6 +399,7 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
> if (pte_present(pte))
> page = pte_page(pte);
> pte_unmap(ptep);
> +
> return page;
> }
> EXPORT_SYMBOL(vmalloc_to_page);
>

2021-02-02 10:25:35

by Nicholas Piggin

[permalink] [raw]

Subject: Re: [PATCH v11 01/13] mm/vmalloc: fix HUGE_VMAP regression by enabling huge pages in vmalloc_to_page

Excerpts from Ding Tianhong's message of January 28, 2021 1:13 pm:
> On 2021/1/26 12:44, Nicholas Piggin wrote:
>> vmalloc_to_page returns NULL for addresses mapped by larger pages[*].
>> Whether or not a vmap is huge depends on the architecture details,
>> alignments, boot options, etc., which the caller can not be expected
>> to know. Therefore HUGE_VMAP is a regression for vmalloc_to_page.
>>
>> This change teaches vmalloc_to_page about larger pages, and returns
>> the struct page that corresponds to the offset within the large page.
>> This makes the API agnostic to mapping implementation details.
>>
>> [*] As explained by commit 029c54b095995 ("mm/vmalloc.c: huge-vmap:
>> fail gracefully on unexpected huge vmap mappings")
>>
>> Reviewed-by: Christoph Hellwig <[email protected]>
>> Signed-off-by: Nicholas Piggin <[email protected]>
>> ---
>> mm/vmalloc.c | 41 ++++++++++++++++++++++++++---------------
>> 1 file changed, 26 insertions(+), 15 deletions(-)
>>
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index e6f352bf0498..62372f9e0167 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -34,7 +34,7 @@
>> #include <linux/bitops.h>
>> #include <linux/rbtree_augmented.h>
>> #include <linux/overflow.h>
>> -
>> +#include <linux/pgtable.h>
>> #include <linux/uaccess.h>
>> #include <asm/tlbflush.h>
>> #include <asm/shmparam.h>
>> @@ -343,7 +343,9 @@ int is_vmalloc_or_module_addr(const void *x)
>> }
>>
>> /*
>> - * Walk a vmap address to the struct page it maps.
>> + * Walk a vmap address to the struct page it maps. Huge vmap mappings will
>> + * return the tail page that corresponds to the base page address, which
>> + * matches small vmap mappings.
>> */
>> struct page *vmalloc_to_page(const void *vmalloc_addr)
>> {
>> @@ -363,25 +365,33 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
>>
>> if (pgd_none(*pgd))
>> return NULL;
>> + if (WARN_ON_ONCE(pgd_leaf(*pgd)))
>> + return NULL; /* XXX: no allowance for huge pgd */
>> + if (WARN_ON_ONCE(pgd_bad(*pgd)))
>> + return NULL;
>> +
>> p4d = p4d_offset(pgd, addr);
>> if (p4d_none(*p4d))
>> return NULL;
>> - pud = pud_offset(p4d, addr);
>> + if (p4d_leaf(*p4d))
>> + return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT);
>> + if (WARN_ON_ONCE(p4d_bad(*p4d)))
>> + return NULL;
>>
>> - /*
>> - * Don't dereference bad PUD or PMD (below) entries. This will also
>> - * identify huge mappings, which we may encounter on architectures
>> - * that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be
>> - * identified as vmalloc addresses by is_vmalloc_addr(), but are
>> - * not [unambiguously] associated with a struct page, so there is
>> - * no correct value to return for them.
>> - */
>> - WARN_ON_ONCE(pud_bad(*pud));
>> - if (pud_none(*pud) || pud_bad(*pud))
>> + pud = pud_offset(p4d, addr);
>> + if (pud_none(*pud))
>> + return NULL;
>> + if (pud_leaf(*pud))
>> + return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
>
> Hi Nicho:
>
> /builds/1mzfdQzleCy69KZFb5qHNSEgabZ/mm/vmalloc.c: In function 'vmalloc_to_page':
> /builds/1mzfdQzleCy69KZFb5qHNSEgabZ/include/asm-generic/pgtable-nop4d-hack.h:48:27: error: implicit declaration of function 'pud_page'; did you mean 'put_page'? [-Werror=implicit-function-declaration]
> 48 | #define pgd_page(pgd) (pud_page((pud_t){ pgd }))
> | ^~~~~~~~
>
> the pug_page is not defined for aarch32 when enabling 2-level page config, it break the system building.

Hey thanks for finding that, not sure why that didn't trigger any CI.

Anyway newer kernels don't have the ptable-*-hack.h headers, but even so
it still breaks upstream. arm is using some hand-rolled 2-level folding
of its own (which is fair enough because most 32-bit archs were 2 level
at the time I added pgtable-nopud.h header).

This patch seems to at least make it build.

Thanks,
Nick

---
arch/arm/include/asm/pgtable-3level.h | 2 --
arch/arm/include/asm/pgtable.h | 3 +++
2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 2b85d175e999..d4edab51a77c 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -186,8 +186,6 @@ static inline pte_t pte_mkspecial(pte_t pte)

#define pmd_write(pmd) (pmd_isclear((pmd), L_PMD_SECT_RDONLY))
#define pmd_dirty(pmd) (pmd_isset((pmd), L_PMD_SECT_DIRTY))
-#define pud_page(pud) pmd_page(__pmd(pud_val(pud)))
-#define pud_write(pud) pmd_write(__pmd(pud_val(pud)))

#define pmd_hugewillfault(pmd) (!pmd_young(pmd) || !pmd_write(pmd))
#define pmd_thp_or_huge(pmd) (pmd_huge(pmd) || pmd_trans_huge(pmd))
diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index c02f24400369..d63a5bb6bd0c 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -166,6 +166,9 @@ extern struct page *empty_zero_page;

extern pgd_t swapper_pg_dir[PTRS_PER_PGD];

+#define pud_page(pud) pmd_page(__pmd(pud_val(pud)))
+#define pud_write(pud) pmd_write(__pmd(pud_val(pud)))
+
#define pmd_none(pmd) (!pmd_val(pmd))

static inline pte_t *pmd_page_vaddr(pmd_t pmd)
--
2.23.0