From: "Mike Rapoport (IBM)" <[email protected]>
Hi,
This is a third attempt to make page allocator aware of the direct map
layout and allow grouping of the pages that must be unmapped from
the direct map.
This a new implementation of __GFP_UNMAPPED, kinda a follow up for this set:
https://lore.kernel.org/all/[email protected]
but instead of using a migrate type to cache the unmapped pages, the
current implementation adds a dedicated cache to serve __GFP_UNMAPPED
allocations.
The last two patches in the series demonstrate how __GFP_UNMAPPED can be
used in two in-tree use cases.
First one is to switch secretmem to use the new mechanism, which is
straight forward optimization.
The second use-case is to enable __GFP_UNMAPPED in x86::module_alloc()
that is essentially used as a method to allocate code pages and thus
requires permission changes for basic pages in the direct map.
This set is x86 specific at the moment because other architectures either
do not support set_memory APIs that split the direct^w linear map (e.g.
PowerPC) or only enable set_memory APIs when the linear map uses basic page
size (like arm64).
The patches are only lightly tested.
== Motivation ==
There are use-cases that need to remove pages from the direct map or at
least map them with 4K granularity. Whenever this is done e.g. with
set_memory/set_direct_map APIs, the PUD and PMD sized mappings in the
direct map are split into smaller pages.
To reduce the performance hit caused by the fragmentation of the direct map
it makes sense to group and/or cache the pages removed from the direct
map so that the split large pages won't be all over the place.
There were RFCs for grouped page allocations for vmalloc permissions [1]
and for using PKS to protect page tables [2] as well as an attempt to use a
pool of large pages in secretmtm [3], but these suggestions address each
use case separately, while having a common mechanism at the core mm level
could be used by all use cases.
== Implementation overview ==
The pages that need to be removed from the direct map are grouped in a
dedicated cache. When there is a page allocation request with
__GFP_UNMAPPED set, it is redirected from __alloc_pages() to that cache
using a new unmapped_alloc() function.
The cache is implemented as a buddy allocator and it can handle high order
requests.
The cache starts empty and whenever it does not have enough pages to
satisfy an allocation request the cache attempts to allocate PMD_SIZE page
to replenish the cache. If PMD_SIZE page cannot be allocated, the cache is
replenished with a page of the highest order available. That page is
removed from the direct map and added to the local buddy allocator.
There is also a shrinker that releases pages from the unmapped cache when
there us a memory pressure in the system. When shrinker releases a page it
is mapped back into the direct map.
[1] https://lore.kernel.org/lkml/[email protected]
[2] https://lore.kernel.org/lkml/[email protected]
[3] https://lore.kernel.org/lkml/[email protected]
Mike Rapoport (IBM) (5):
mm: intorduce __GFP_UNMAPPED and unmapped_alloc()
mm/unmapped_alloc: add debugfs file similar to /proc/pagetypeinfo
mm/unmapped_alloc: add shrinker
EXPERIMENTAL: x86: use __GFP_UNMAPPED for modele_alloc()
EXPERIMENTAL: mm/secretmem: use __GFP_UNMAPPED
arch/x86/Kconfig | 3 +
arch/x86/kernel/module.c | 2 +-
include/linux/gfp_types.h | 11 +-
include/linux/page-flags.h | 6 +
include/linux/pageblock-flags.h | 28 +++
include/trace/events/mmflags.h | 10 +-
mm/Kconfig | 4 +
mm/Makefile | 1 +
mm/internal.h | 24 +++
mm/page_alloc.c | 39 +++-
mm/secretmem.c | 26 +--
mm/unmapped-alloc.c | 334 ++++++++++++++++++++++++++++++++
mm/vmalloc.c | 2 +-
13 files changed, 459 insertions(+), 31 deletions(-)
create mode 100644 mm/unmapped-alloc.c
base-commit: fe15c26ee26efa11741a7b632e9f23b01aca4cc6
--
2.35.1
From: "Mike Rapoport (IBM)" <[email protected]>
Present statistics about unmapped_alloc in debugfs
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
---
mm/unmapped-alloc.c | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
diff --git a/mm/unmapped-alloc.c b/mm/unmapped-alloc.c
index fb2d54204a3c..f74640e9ce9f 100644
--- a/mm/unmapped-alloc.c
+++ b/mm/unmapped-alloc.c
@@ -3,6 +3,7 @@
#include <linux/gfp.h>
#include <linux/mmzone.h>
#include <linux/printk.h>
+#include <linux/debugfs.h>
#include <linux/spinlock.h>
#include <linux/set_memory.h>
@@ -213,3 +214,37 @@ int unmapped_alloc_init(void)
return 0;
}
+
+static int unmapped_alloc_debug_show(struct seq_file *m, void *private)
+{
+ int order;
+
+ seq_printf(m, "MAX_ORDER: %d\n", MAX_ORDER);
+ seq_putc(m, '\n');
+
+ seq_printf(m, "%-10s", "Order:");
+ for (order = 0; order < MAX_ORDER; ++order)
+ seq_printf(m, "%5d ", order);
+ seq_putc(m, '\n');
+
+ seq_printf(m, "%-10s", "Free:");
+ for (order = 0; order < MAX_ORDER; ++order)
+ seq_printf(m, "%5lu ", free_area[order].nr_free);
+ seq_putc(m, '\n');
+
+ seq_printf(m, "%-10s", "Cached:");
+ for (order = 0; order < MAX_ORDER; ++order)
+ seq_printf(m, "%5lu ", free_area[order].nr_cached);
+ seq_putc(m, '\n');
+
+ return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(unmapped_alloc_debug);
+
+static int __init unmapped_alloc_init_late(void)
+{
+ debugfs_create_file("unmapped_alloc", 0444, NULL,
+ NULL, &unmapped_alloc_debug_fops);
+ return 0;
+}
+late_initcall(unmapped_alloc_init_late);
--
2.35.1
From: "Mike Rapoport (IBM)" <[email protected]>
Allow shrinking unmapped caches when there is a memory pressure.
Signed-off-by: Mike Rapoport(IBM) <[email protected]>
---
mm/internal.h | 2 ++
mm/page_alloc.c | 12 ++++++-
mm/unmapped-alloc.c | 86 ++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 98 insertions(+), 2 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 8d84cceab467..aa2934594dcf 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1064,4 +1064,6 @@ static inline struct page *unmapped_pages_alloc(gfp_t gfp, int order)
static inline void unmapped_pages_free(struct page *page, int order) {}
#endif
+void __free_unmapped_page(struct page *page, unsigned int order);
+
#endif /* __MM_INTERNAL_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 01f18e7529b0..ece213fac27a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -123,6 +123,11 @@ typedef int __bitwise fpi_t;
*/
#define FPI_SKIP_KASAN_POISON ((__force fpi_t)BIT(2))
+/*
+ * Free pages from the unmapped cache
+ */
+#define FPI_UNMAPPED ((__force fpi_t)BIT(3))
+
/* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
static DEFINE_MUTEX(pcp_batch_high_lock);
#define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
@@ -1467,7 +1472,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
PAGE_SIZE << order);
}
- if (get_pageblock_unmapped(page)) {
+ if (!(fpi_flags & FPI_UNMAPPED) && get_pageblock_unmapped(page)) {
unmapped_pages_free(page, order);
return false;
}
@@ -1636,6 +1641,11 @@ static void free_one_page(struct zone *zone,
spin_unlock_irqrestore(&zone->lock, flags);
}
+void __free_unmapped_page(struct page *page, unsigned int order)
+{
+ __free_pages_ok(page, order, FPI_UNMAPPED | FPI_TO_TAIL);
+}
+
static void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
{
diff --git a/mm/unmapped-alloc.c b/mm/unmapped-alloc.c
index f74640e9ce9f..89f54383df92 100644
--- a/mm/unmapped-alloc.c
+++ b/mm/unmapped-alloc.c
@@ -4,6 +4,7 @@
#include <linux/mmzone.h>
#include <linux/printk.h>
#include <linux/debugfs.h>
+#include <linux/shrinker.h>
#include <linux/spinlock.h>
#include <linux/set_memory.h>
@@ -16,6 +17,7 @@ struct unmapped_free_area {
spinlock_t lock;
unsigned long nr_free;
unsigned long nr_cached;
+ unsigned long nr_released;
};
static struct unmapped_free_area free_area[MAX_ORDER];
@@ -204,6 +206,82 @@ void unmapped_pages_free(struct page *page, int order)
__free_one_page(page, order, false);
}
+static unsigned long unmapped_alloc_count_objects(struct shrinker *sh,
+ struct shrink_control *sc)
+{
+ unsigned long pages_to_free = 0;
+
+ for (int order = 0; order < MAX_ORDER; order++)
+ pages_to_free += (free_area[order].nr_free << order);
+
+ return pages_to_free;
+}
+
+static unsigned long scan_free_area(struct shrink_control *sc, int order)
+{
+ struct unmapped_free_area *area = &free_area[order];
+ unsigned long nr_pages = (1 << order);
+ unsigned long pages_freed = 0;
+ unsigned long flags;
+ struct page *page;
+
+ spin_lock_irqsave(&area->lock, flags);
+ while (pages_freed < sc->nr_to_scan) {
+
+ page = list_first_entry_or_null(&area->free_list, struct page,
+ lru);
+ if (!page)
+ break;
+
+ del_page_from_free_list(page, order);
+ expand(page, order, order);
+
+ area->nr_released++;
+
+ if (order == pageblock_order)
+ clear_pageblock_unmapped(page);
+
+ spin_unlock_irqrestore(&area->lock, flags);
+
+ for (int i = 0; i < nr_pages; i++)
+ set_direct_map_default_noflush(page + i);
+
+ __free_unmapped_page(page, order);
+
+ pages_freed += nr_pages;
+
+ cond_resched();
+ spin_lock_irqsave(&area->lock, flags);
+ }
+
+ spin_unlock_irqrestore(&area->lock, flags);
+
+ return pages_freed;
+}
+
+static unsigned long unmapped_alloc_scan_objects(struct shrinker *shrinker,
+ struct shrink_control *sc)
+{
+ sc->nr_scanned = 0;
+
+ for (int order = 0; order < MAX_ORDER; order++) {
+ sc->nr_scanned += scan_free_area(sc, order);
+
+ if (sc->nr_scanned >= sc->nr_to_scan)
+ break;
+
+ sc->nr_to_scan -= sc->nr_scanned;
+ }
+
+ return sc->nr_scanned ? sc->nr_scanned : SHRINK_STOP;
+}
+
+static struct shrinker shrinker = {
+ .count_objects = unmapped_alloc_count_objects,
+ .scan_objects = unmapped_alloc_scan_objects,
+ .seeks = DEFAULT_SEEKS,
+};
+
int unmapped_alloc_init(void)
{
for (int order = 0; order < MAX_ORDER; order++) {
@@ -237,6 +315,11 @@ static int unmapped_alloc_debug_show(struct seq_file *m, void *private)
seq_printf(m, "%5lu ", free_area[order].nr_cached);
seq_putc(m, '\n');
+ seq_printf(m, "%-10s", "Released:");
+ for (order = 0; order < MAX_ORDER; ++order)
+ seq_printf(m, "%5lu ", free_area[order].nr_released);
+ seq_putc(m, '\n');
+
return 0;
}
DEFINE_SHOW_ATTRIBUTE(unmapped_alloc_debug);
@@ -245,6 +328,7 @@ static int __init unmapped_alloc_init_late(void)
{
debugfs_create_file("unmapped_alloc", 0444, NULL,
NULL, &unmapped_alloc_debug_fops);
- return 0;
+
+ return register_shrinker(&shrinker, "mm-unmapped-alloc");
}
late_initcall(unmapped_alloc_init_late);
--
2.35.1
From: "Mike Rapoport (IBM)" <[email protected]>
When set_memory or set_direct_map APIs used to change attribute or
permissions for chunks of several pages, the large PMD that maps these
pages in the direct map must be split. Fragmenting the direct map in such
manner causes TLB pressure and, eventually, performance degradation.
To avoid excessive direct map fragmentation, add ability to allocate
"unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
allocated pages from the direct map and use a cache of the unmapped pages.
This cache is replenished with higher order pages with preference for
PMD_SIZE pages when possible so that there will be fewer splits of large
pages in the direct map.
The cache is implemented as a buddy allocator, so it can serve high order
allocations of unmapped pages.
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
---
arch/x86/Kconfig | 3 +
include/linux/gfp_types.h | 11 +-
include/linux/page-flags.h | 6 +
include/linux/pageblock-flags.h | 28 +++++
include/trace/events/mmflags.h | 10 +-
mm/Kconfig | 4 +
mm/Makefile | 1 +
mm/internal.h | 22 ++++
mm/page_alloc.c | 29 ++++-
mm/unmapped-alloc.c | 215 ++++++++++++++++++++++++++++++++
10 files changed, 325 insertions(+), 4 deletions(-)
create mode 100644 mm/unmapped-alloc.c
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a825bf031f49..735f691d449c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -35,6 +35,9 @@ config X86_64
select ARCH_HAS_ELFCORE_COMPAT
select ZONE_DMA32
+config ARCH_WANTS_GFP_UNMAPPED
+ def_bool y if X86_64
+
config FORCE_DYNAMIC_FTRACE
def_bool y
depends on X86_32
diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 5088637fe5c2..234122b434dd 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -60,6 +60,12 @@ typedef unsigned int __bitwise gfp_t;
#else
#define ___GFP_NOLOCKDEP 0
#endif
+#ifdef CONFIG_UNMAPPED_ALLOC
+#define ___GFP_UNMAPPED 0x10000000u
+#else
+#define ___GFP_UNMAPPED 0
+#endif
+
/* If the above are modified, __GFP_BITS_SHIFT may need updating */
/*
@@ -101,12 +107,15 @@ typedef unsigned int __bitwise gfp_t;
* node with no fallbacks or placement policy enforcements.
*
* %__GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
+ *
+ * %__GFP_UNMAPPED removes the allocated pages from the direct map.
*/
#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE)
#define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL)
#define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)
#define __GFP_ACCOUNT ((__force gfp_t)___GFP_ACCOUNT)
+#define __GFP_UNMAPPED ((__force gfp_t)___GFP_UNMAPPED)
/**
* DOC: Watermark modifiers
@@ -252,7 +261,7 @@ typedef unsigned int __bitwise gfp_t;
#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
/* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT (27 + IS_ENABLED(CONFIG_LOCKDEP))
+#define __GFP_BITS_SHIFT (28 + IS_ENABLED(CONFIG_LOCKDEP))
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/**
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index a7e3a3405520..e66dbfdba8f2 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -922,6 +922,7 @@ static inline bool is_page_hwpoison(struct page *page)
#define PG_offline 0x00000100
#define PG_table 0x00000200
#define PG_guard 0x00000400
+#define PG_unmapped 0x00000800
#define PageType(page, flag) \
((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
@@ -992,6 +993,11 @@ PAGE_TYPE_OPS(Table, table)
*/
PAGE_TYPE_OPS(Guard, guard)
+/*
+ * Marks pages in free lists of unmapped allocator.
+ */
+PAGE_TYPE_OPS(Unmapped, unmapped)
+
extern bool is_free_buddy_page(struct page *page);
PAGEFLAG(Isolated, isolated, PF_ANY);
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 5f1ae07d724b..bb6f5ec83881 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -21,6 +21,9 @@ enum pageblock_bits {
/* 3 bits required for migrate types */
PB_migrate_skip,/* If set the block is skipped by compaction */
+ PB_unmapped = 7,/* if set the block has pages unmapped in the direct
+ map */
+
/*
* Assume the bits will always align on a word. If this assumption
* changes then get/set pageblock needs updating.
@@ -95,4 +98,29 @@ static inline void set_pageblock_skip(struct page *page)
}
#endif /* CONFIG_COMPACTION */
+#ifdef CONFIG_UNMAPPED_ALLOC
+#define get_pageblock_unmapped(page) \
+ get_pfnblock_flags_mask(page, page_to_pfn(page), \
+ (1 << (PB_unmapped)))
+#define clear_pageblock_unmapped(page) \
+ set_pfnblock_flags_mask(page, 0, page_to_pfn(page), \
+ (1 << PB_unmapped))
+#define set_pageblock_unmapped(page) \
+ set_pfnblock_flags_mask(page, (1 << PB_unmapped), \
+ page_to_pfn(page), \
+ (1 << PB_unmapped))
+#else /* CONFIG_UNMAPPED_ALLOC */
+static inline bool get_pageblock_unmapped(struct page *page)
+{
+ return false;
+}
+static inline void clear_pageblock_unmapped(struct page *page)
+{
+}
+static inline void set_pageblock_unmapped(struct page *page)
+{
+}
+#endif /* CONFIG_UNMAPPED_ALLOC */
+
+
#endif /* PAGEBLOCK_FLAGS_H */
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 9db52bc4ce19..951d294a3840 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -61,9 +61,17 @@
#define __def_gfpflag_names_kasan
#endif
+#ifdef CONFIG_UNMAPPED_ALLOC
+#define __def_gfpflag_names_unmapped \
+ , gfpflag_string(__GFP_UNMAPPED)
+#else
+#define __def_gfpflag_names_unmapped
+#endif
+
#define show_gfp_flags(flags) \
(flags) ? __print_flags(flags, "|", \
- __def_gfpflag_names __def_gfpflag_names_kasan \
+ __def_gfpflag_names __def_gfpflag_names_kasan \
+ __def_gfpflag_names_unmapped \
) : "none"
#ifdef CONFIG_MMU
diff --git a/mm/Kconfig b/mm/Kconfig
index 4751031f3f05..404be73e00e8 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1202,6 +1202,10 @@ config LRU_GEN_STATS
This option has a per-memcg and per-node memory overhead.
# }
+config UNMAPPED_ALLOC
+ def_bool y
+ depends on ARCH_WANTS_GFP_UNMAPPED
+
source "mm/damon/Kconfig"
endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5b3e29..9143303295af 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
+obj-$(CONFIG_UNMAPPED_ALLOC) += unmapped-alloc.o
diff --git a/mm/internal.h b/mm/internal.h
index 7920a8b7982e..8d84cceab467 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1042,4 +1042,26 @@ struct vma_prepare {
struct vm_area_struct *remove;
struct vm_area_struct *remove2;
};
+
+/*
+ * mm/unmapped-alloc.c
+ */
+#ifdef CONFIG_UNMAPPED_ALLOC
+int unmapped_alloc_init(void);
+struct page *unmapped_pages_alloc(gfp_t gfp, int order);
+void unmapped_pages_free(struct page *page, int order);
+#else
+static inline int unmapped_alloc_init(void)
+{
+ return 0;
+}
+
+static inline struct page *unmapped_pages_alloc(gfp_t gfp, int order)
+{
+ return NULL;
+}
+
+static inline void unmapped_pages_free(struct page *page, int order) {}
+#endif
+
#endif /* __MM_INTERNAL_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ac1fc986af44..01f18e7529b0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -589,7 +589,7 @@ void set_pfnblock_flags_mask(struct page *page, unsigned long flags,
unsigned long bitidx, word_bitidx;
unsigned long word;
- BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
+ BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 8);
BUILD_BUG_ON(MIGRATE_TYPES > (1 << PB_migratetype_bits));
bitmap = get_pageblock_bitmap(page, pfn);
@@ -746,6 +746,13 @@ static inline bool pcp_allowed_order(unsigned int order)
return false;
}
+static inline bool pcp_allowed(unsigned int order, gfp_t gfp)
+{
+ if (unlikely(gfp & __GFP_UNMAPPED))
+ return false;
+ return pcp_allowed_order(order);
+}
+
static inline void free_the_page(struct page *page, unsigned int order)
{
if (pcp_allowed_order(order)) /* Via pcp? */
@@ -1460,6 +1467,11 @@ static __always_inline bool free_pages_prepare(struct page *page,
PAGE_SIZE << order);
}
+ if (get_pageblock_unmapped(page)) {
+ unmapped_pages_free(page, order);
+ return false;
+ }
+
kernel_poison_pages(page, 1 << order);
/*
@@ -3525,6 +3537,13 @@ void free_unref_page_list(struct list_head *list)
/* Prepare pages for freeing */
list_for_each_entry_safe(page, next, list, lru) {
unsigned long pfn = page_to_pfn(page);
+
+ if (get_pageblock_unmapped(page)) {
+ list_del(&page->lru);
+ unmapped_pages_free(page, 0);
+ continue;
+ }
+
if (!free_unref_page_prepare(page, pfn, 0)) {
list_del(&page->lru);
continue;
@@ -3856,7 +3875,7 @@ struct page *rmqueue(struct zone *preferred_zone,
*/
WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
- if (likely(pcp_allowed_order(order))) {
+ if (likely(pcp_allowed(order, gfp_flags))) {
/*
* MIGRATE_MOVABLE pcplist could have the pages on CMA area and
* we need to skip it when CMA area isn't allowed.
@@ -5581,6 +5600,11 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
&alloc_gfp, &alloc_flags))
return NULL;
+ if (alloc_gfp & __GFP_UNMAPPED) {
+ page = unmapped_pages_alloc(gfp, order);
+ goto out;
+ }
+
/*
* Forbid the first pass from falling back to types that fragment
* memory until all local zones are considered.
@@ -8437,6 +8461,7 @@ void __init free_area_init(unsigned long *max_zone_pfn)
}
memmap_init();
+ unmapped_alloc_init();
}
static int __init cmdline_parse_core(char *p, unsigned long *core,
diff --git a/mm/unmapped-alloc.c b/mm/unmapped-alloc.c
new file mode 100644
index 000000000000..fb2d54204a3c
--- /dev/null
+++ b/mm/unmapped-alloc.c
@@ -0,0 +1,215 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/gfp.h>
+#include <linux/mmzone.h>
+#include <linux/printk.h>
+#include <linux/spinlock.h>
+#include <linux/set_memory.h>
+
+#include <asm/tlbflush.h>
+
+#include "internal.h"
+
+struct unmapped_free_area {
+ struct list_head free_list;
+ spinlock_t lock;
+ unsigned long nr_free;
+ unsigned long nr_cached;
+};
+
+static struct unmapped_free_area free_area[MAX_ORDER];
+
+static inline void add_to_free_list(struct page *page, unsigned int order)
+{
+ struct unmapped_free_area *area = &free_area[order];
+
+ list_add(&page->buddy_list, &area->free_list);
+ area->nr_free++;
+}
+
+static inline void del_page_from_free_list(struct page *page, unsigned int order)
+{
+ list_del(&page->buddy_list);
+ __ClearPageUnmapped(page);
+ set_page_private(page, 0);
+ free_area[order].nr_free--;
+}
+
+static inline void set_unmapped_order(struct page *page, unsigned int order)
+{
+ set_page_private(page, order);
+ __SetPageUnmapped(page);
+}
+
+static inline bool page_is_unmapped_buddy(struct page *page, struct page *buddy,
+ unsigned int order)
+{
+ if (!PageUnmapped(buddy))
+ return false;
+
+ if (buddy_order(buddy) != order)
+ return false;
+
+ return true;
+}
+
+static struct page *find_unmapped_buddy_page_pfn(struct page *page,
+ unsigned long pfn,
+ unsigned int order,
+ unsigned long *buddy_pfn)
+{
+ unsigned long __buddy_pfn = __find_buddy_pfn(pfn, order);
+ struct page *buddy;
+
+ buddy = page + (__buddy_pfn - pfn);
+ if (buddy_pfn)
+ *buddy_pfn = __buddy_pfn;
+
+ if (page_is_unmapped_buddy(page, buddy, order))
+ return buddy;
+
+ return NULL;
+}
+
+static inline void __free_one_page(struct page *page, unsigned int order,
+ bool cache_refill)
+{
+ unsigned long pfn = page_to_pfn(page);
+ unsigned long buddy_pfn;
+ unsigned long combined_pfn;
+ struct page *buddy;
+ unsigned long flags;
+
+ spin_lock_irqsave(&free_area->lock, flags);
+
+ if (cache_refill) {
+ set_pageblock_unmapped(page);
+ free_area[order].nr_cached++;
+ }
+
+ while (order < MAX_ORDER - 1) {
+ buddy = find_unmapped_buddy_page_pfn(page, pfn, order,
+ &buddy_pfn);
+ if (!buddy)
+ break;
+
+ del_page_from_free_list(buddy, order);
+ combined_pfn = buddy_pfn & pfn;
+ page = page + (combined_pfn - pfn);
+ pfn = combined_pfn;
+ order++;
+ }
+
+ set_unmapped_order(page, order);
+ add_to_free_list(page, order);
+ spin_unlock_irqrestore(&free_area->lock, flags);
+}
+
+static inline void expand(struct page *page, int low, int high)
+{
+ unsigned long size = 1 << high;
+
+ while (high > low) {
+ high--;
+ size >>= 1;
+
+ add_to_free_list(&page[size], high);
+ set_unmapped_order(&page[size], high);
+ }
+}
+
+static struct page *__rmqueue_smallest(unsigned int order)
+{
+ unsigned int current_order;
+ struct unmapped_free_area *area;
+ struct page *page = NULL;
+ unsigned long flags;
+
+ spin_lock_irqsave(&free_area->lock, flags);
+
+ /* Find a page of the appropriate size in the preferred list */
+ for (current_order = order; current_order < MAX_ORDER; ++current_order) {
+ area = &free_area[current_order];
+ page = list_first_entry_or_null(&area->free_list, struct page,
+ lru);
+ if (!page)
+ continue;
+
+ del_page_from_free_list(page, current_order);
+ expand(page, order, current_order);
+
+ break;
+ }
+
+ spin_unlock_irqrestore(&free_area->lock, flags);
+
+ return page;
+}
+
+/* FIXME: have PMD_ORDER at last available in include/linux */
+#define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT)
+
+struct page *unmapped_pages_alloc(gfp_t gfp, int order)
+{
+
+ int cache_order = PMD_ORDER;
+ struct page *page;
+
+ page = __rmqueue_smallest(order);
+ if (page)
+ goto out;
+
+ gfp &= ~__GFP_UNMAPPED;
+ while (cache_order >= order) {
+ page = alloc_pages(gfp | __GFP_ZERO, cache_order);
+ if (page)
+ break;
+ cache_order--;
+ }
+
+ if (page) {
+ unsigned long addr = (unsigned long)page_address(page);
+ unsigned long nr_pages = (1 << order);
+ unsigned long size = PAGE_SIZE * nr_pages;
+
+ /* FIXME: set_direct_map_invalid_noflush may fail */
+ for (int i = 0; i < nr_pages; i++)
+ set_direct_map_invalid_noflush(page + i);
+
+ flush_tlb_kernel_range(addr, addr + size);
+
+ /*
+ * FIXME: have this under lock so that allocation running
+ * in parallel won't steal all pages from the newly cached
+ * ones
+ */
+ __free_one_page(page, cache_order, true);
+ page = __rmqueue_smallest(order);
+
+ }
+
+out:
+ if (page) {
+ /* FIXME: __prep_new_page() expects page count of 0 */
+ set_page_count(page, 0);
+ post_alloc_hook(page, order, GFP_KERNEL);
+ }
+
+ return page;
+}
+
+void unmapped_pages_free(struct page *page, int order)
+{
+ __free_one_page(page, order, false);
+}
+
+int unmapped_alloc_init(void)
+{
+ for (int order = 0; order < MAX_ORDER; order++) {
+ struct unmapped_free_area *area = &free_area[order];
+ INIT_LIST_HEAD(&area->free_list);
+ spin_lock_init(&area->lock);
+ }
+
+ return 0;
+}
--
2.35.1
From: "Mike Rapoport (IBM)" <[email protected]>
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
---
mm/secretmem.c | 26 +-------------------------
1 file changed, 1 insertion(+), 25 deletions(-)
diff --git a/mm/secretmem.c b/mm/secretmem.c
index 0b502625cd30..f66dfd16a0c3 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -53,7 +53,6 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
struct inode *inode = file_inode(vmf->vma->vm_file);
pgoff_t offset = vmf->pgoff;
gfp_t gfp = vmf->gfp_mask;
- unsigned long addr;
struct page *page;
vm_fault_t ret;
int err;
@@ -66,38 +65,22 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
retry:
page = find_lock_page(mapping, offset);
if (!page) {
- page = alloc_page(gfp | __GFP_ZERO);
+ page = alloc_page(gfp | __GFP_ZERO | __GFP_UNMAPPED);
if (!page) {
ret = VM_FAULT_OOM;
goto out;
}
- err = set_direct_map_invalid_noflush(page);
- if (err) {
- put_page(page);
- ret = vmf_error(err);
- goto out;
- }
-
__SetPageUptodate(page);
err = add_to_page_cache_lru(page, mapping, offset, gfp);
if (unlikely(err)) {
put_page(page);
- /*
- * If a split of large page was required, it
- * already happened when we marked the page invalid
- * which guarantees that this call won't fail
- */
- set_direct_map_default_noflush(page);
if (err == -EEXIST)
goto retry;
ret = vmf_error(err);
goto out;
}
-
- addr = (unsigned long)page_address(page);
- flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
}
vmf->page = page;
@@ -150,15 +133,8 @@ static int secretmem_migrate_folio(struct address_space *mapping,
return -EBUSY;
}
-static void secretmem_free_folio(struct folio *folio)
-{
- set_direct_map_default_noflush(&folio->page);
- folio_zero_segment(folio, 0, folio_size(folio));
-}
-
const struct address_space_operations secretmem_aops = {
.dirty_folio = noop_dirty_folio,
- .free_folio = secretmem_free_folio,
.migrate_folio = secretmem_migrate_folio,
};
--
2.35.1
From: "Mike Rapoport (IBM)" <[email protected]>
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
---
arch/x86/kernel/module.c | 2 +-
mm/vmalloc.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index 84ad0e61ba6e..845ed70ba5ab 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -67,7 +67,7 @@ static unsigned long int get_module_load_offset(void)
void *module_alloc(unsigned long size)
{
- gfp_t gfp_mask = GFP_KERNEL;
+ gfp_t gfp_mask = GFP_KERNEL | __GFP_UNMAPPED;
void *p;
if (PAGE_ALIGN(size) > MODULES_LEN)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ef910bf349e1..84220ec45ec2 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2892,7 +2892,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
* to fails, fallback to a single page allocator that is
* more permissive.
*/
- if (!order) {
+ if (!order && !(gfp & __GFP_UNMAPPED)) {
gfp_t bulk_gfp = gfp & ~__GFP_NOFAIL;
while (nr_allocated < nr_pages) {
--
2.35.1
On Wed, 2023-03-08 at 11:41 +0200, Mike Rapoport wrote:
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index ef910bf349e1..84220ec45ec2 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2892,7 +2892,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> * to fails, fallback to a single page allocator that is
> * more permissive.
> */
> - if (!order) {
> + if (!order && !(gfp & __GFP_UNMAPPED)) {
> gfp_t bulk_gfp = gfp & ~__GFP_NOFAIL;
>
> while (nr_allocated < nr_pages) {
This is obviously a quick POC patch, but I guess we should skip the
whole vm_remove_mappings() thing since it would reset the direct map to
RW for these unmapped pages. Or rather modules shouldn't set
VM_FLUSH_RESET_PERMS if it uses this.
On Wed, 2023-03-08 at 11:41 +0200, Mike Rapoport wrote:
> +
> +static inline void __free_one_page(struct page *page, unsigned int
> order,
> + bool cache_refill)
> +{
> + unsigned long pfn = page_to_pfn(page);
> + unsigned long buddy_pfn;
> + unsigned long combined_pfn;
> + struct page *buddy;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&free_area->lock, flags);
> +
> + if (cache_refill) {
> + set_pageblock_unmapped(page);
> + free_area[order].nr_cached++;
> + }
> +
> + while (order < MAX_ORDER - 1) {
> + buddy = find_unmapped_buddy_page_pfn(page, pfn,
> order,
> + &buddy_pfn);
> + if (!buddy)
> + break;
> +
> + del_page_from_free_list(buddy, order);
> + combined_pfn = buddy_pfn & pfn;
> + page = page + (combined_pfn - pfn);
> + pfn = combined_pfn;
> + order++;
> + }
> +
> + set_unmapped_order(page, order);
> + add_to_free_list(page, order);
> + spin_unlock_irqrestore(&free_area->lock, flags);
> +}
> +
The page has to be zeroed before it goes back on the list, right? I
didn't see it.
On Wed, 2023-03-08 at 11:41 +0200, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <[email protected]>
>
> Hi,
>
> This is a third attempt to make page allocator aware of the direct
> map
> layout and allow grouping of the pages that must be unmapped from
> the direct map.
>
> This a new implementation of __GFP_UNMAPPED, kinda a follow up for
> this set:
>
> https://lore.kernel.org/all/[email protected]
>
> but instead of using a migrate type to cache the unmapped pages, the
> current implementation adds a dedicated cache to serve __GFP_UNMAPPED
> allocations.
It seems a downside to having a page allocator outside of _the_ page
allocator is you don't get all of the features that are baked in there.
For example does secretmem care about numa? I guess in this
implementation there is just one big cache for all nodes.
Probably most users would want __GFP_ZERO. Would secretmem care about
__GFP_ACCOUNT? I'm sure there is more, but I guess the question is, is
the idea that these features all get built into unmapped-alloc at some
point? The alternate approach is to have little caches for each usage
like the grouped pages, which is probably less efficient when you have
a bunch of them. Or solve it just for modules like the bpf allocator.
Those are the tradeoffs for the approaches that have been explored,
right?
On Wed, Mar 08, 2023 at 11:41:02AM +0200, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <[email protected]>
>
> When set_memory or set_direct_map APIs used to change attribute or
> permissions for chunks of several pages, the large PMD that maps these
> pages in the direct map must be split. Fragmenting the direct map in such
> manner causes TLB pressure and, eventually, performance degradation.
>
> To avoid excessive direct map fragmentation, add ability to allocate
> "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> allocated pages from the direct map and use a cache of the unmapped pages.
>
> This cache is replenished with higher order pages with preference for
> PMD_SIZE pages when possible so that there will be fewer splits of large
> pages in the direct map.
>
> The cache is implemented as a buddy allocator, so it can serve high order
> allocations of unmapped pages.
Hello,
To me it seems unnecessary to increase pageblock bitmap size just to
distinguish if it is allocated with __GFP_UNMAPPED.
I think this can be implemented as an independent cache on top of
buddy allocator, and introducing new API for unmapped page
allocation and freeing?
Thanks,
Hyeonggon
On Thu, Mar 09, 2023 at 01:56:37AM +0000, Edgecombe, Rick P wrote:
> On Wed, 2023-03-08 at 11:41 +0200, Mike Rapoport wrote:
> > +
> > +static inline void __free_one_page(struct page *page, unsigned int
> > order,
> > + bool cache_refill)
> > +{
> > + unsigned long pfn = page_to_pfn(page);
> > + unsigned long buddy_pfn;
> > + unsigned long combined_pfn;
> > + struct page *buddy;
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&free_area->lock, flags);
> > +
> > + if (cache_refill) {
> > + set_pageblock_unmapped(page);
> > + free_area[order].nr_cached++;
> > + }
> > +
> > + while (order < MAX_ORDER - 1) {
> > + buddy = find_unmapped_buddy_page_pfn(page, pfn,
> > order,
> > + &buddy_pfn);
> > + if (!buddy)
> > + break;
> > +
> > + del_page_from_free_list(buddy, order);
> > + combined_pfn = buddy_pfn & pfn;
> > + page = page + (combined_pfn - pfn);
> > + pfn = combined_pfn;
> > + order++;
> > + }
> > +
> > + set_unmapped_order(page, order);
> > + add_to_free_list(page, order);
> > + spin_unlock_irqrestore(&free_area->lock, flags);
> > +}
> > +
>
> The page has to be zeroed before it goes back on the list, right? I
> didn't see it.
You are right, I missed it.
--
Sincerely yours,
Mike.
On Thu, Mar 09, 2023 at 01:59:00AM +0000, Edgecombe, Rick P wrote:
> On Wed, 2023-03-08 at 11:41 +0200, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" <[email protected]>
> >
> > Hi,
> >
> > This is a third attempt to make page allocator aware of the direct
> > map
> > layout and allow grouping of the pages that must be unmapped from
> > the direct map.
> >
> > This a new implementation of __GFP_UNMAPPED, kinda a follow up for
> > this set:
> >
> > https://lore.kernel.org/all/[email protected]
> >
> > but instead of using a migrate type to cache the unmapped pages, the
> > current implementation adds a dedicated cache to serve __GFP_UNMAPPED
> > allocations.
>
> It seems a downside to having a page allocator outside of _the_ page
> allocator is you don't get all of the features that are baked in there.
> For example does secretmem care about numa? I guess in this
> implementation there is just one big cache for all nodes.
>
> Probably most users would want __GFP_ZERO. Would secretmem care about
> __GFP_ACCOUNT?
The intention was that the pages in cache are always zeroed, so __GFP_ZERO
is always implicitly there, at least should have been.
__GFP_ACCOUNT is respected in this implementation. If you look at the
changes to __alloc_pages(), after getting pages from unmapped cache there
is 'goto out' to the point where the accounting is handled.
> I'm sure there is more, but I guess the question is, is
> the idea that these features all get built into unmapped-alloc at some
> point? The alternate approach is to have little caches for each usage
> like the grouped pages, which is probably less efficient when you have
> a bunch of them. Or solve it just for modules like the bpf allocator.
> Those are the tradeoffs for the approaches that have been explored,
> right?
I think that no matter what cache we'll use it won't be able to support all
features _the_ page allocator has. Indeed if we'd have per case cache
implementation we can tune that implementation to support features of
interest for that use case, but then we'll be less efficient in reducing
splits of the large pages. Not to mention increase in complexity as there
will be several caches doing similar but yet different things.
This POC mostly targets secretmem and modules, so this was pretty much
about GFP_KERNEL without considerations for NUMA, but I think extending
this unmapped alloc for NUMA should be simple enough but it will increase
memory overhead even more.
--
Sincerely yours,
Mike.
On Thu, Mar 09, 2023 at 06:31:25AM +0000, Hyeonggon Yoo wrote:
> On Wed, Mar 08, 2023 at 11:41:02AM +0200, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" <[email protected]>
> >
> > When set_memory or set_direct_map APIs used to change attribute or
> > permissions for chunks of several pages, the large PMD that maps these
> > pages in the direct map must be split. Fragmenting the direct map in such
> > manner causes TLB pressure and, eventually, performance degradation.
> >
> > To avoid excessive direct map fragmentation, add ability to allocate
> > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> > allocated pages from the direct map and use a cache of the unmapped pages.
> >
> > This cache is replenished with higher order pages with preference for
> > PMD_SIZE pages when possible so that there will be fewer splits of large
> > pages in the direct map.
> >
> > The cache is implemented as a buddy allocator, so it can serve high order
> > allocations of unmapped pages.
>
> Hello,
>
> To me it seems unnecessary to increase pageblock bitmap size just to
> distinguish if it is allocated with __GFP_UNMAPPED.
>
> I think this can be implemented as an independent cache on top of
> buddy allocator, and introducing new API for unmapped page
> allocation and freeing?
Yes, it could. But with such API we'd also need to take care of
mapping/unmapping these pages for the modules use case which is not that
difficult, but still more complex than a flag to vmalloc().
As for secretmem, freeing of secretmem pages happens in the page cache, so
changing that to use, say, unmapped_free() would be quite intrusive.
Another reason to mark the entire pageblock as unmapped is possibility to
steal pages from that pageblock into the unmmaped cache when they become
free. When we allocate pages to replenish the unmapped cache, we split the
large mapping in the direct map, so even the pages in the same pageblock
that do not get into the cache are still mapped at PTE level. When they are
freed, they will be put in the cache and so fewer explicit cache refills
that split the large mappings will be required.
> Thanks,
> Hyeonggon
--
Sincerely yours,
Mike.
On Thu, 2023-03-09 at 16:39 +0200, Mike Rapoport wrote:
> On Thu, Mar 09, 2023 at 01:56:37AM +0000, Edgecombe, Rick P wrote:
> > On Wed, 2023-03-08 at 11:41 +0200, Mike Rapoport wrote:
> > > +
> > > +static inline void __free_one_page(struct page *page, unsigned
> > > int
> > > order,
> > > + bool cache_refill)
> > > +{
> > > + unsigned long pfn = page_to_pfn(page);
> > > + unsigned long buddy_pfn;
> > > + unsigned long combined_pfn;
> > > + struct page *buddy;
> > > + unsigned long flags;
> > > +
> > > + spin_lock_irqsave(&free_area->lock, flags);
> > > +
> > > + if (cache_refill) {
> > > + set_pageblock_unmapped(page);
> > > + free_area[order].nr_cached++;
> > > + }
> > > +
> > > + while (order < MAX_ORDER - 1) {
> > > + buddy = find_unmapped_buddy_page_pfn(page, pfn,
> > > order,
> > > + &buddy_pfn);
> > > + if (!buddy)
> > > + break;
> > > +
> > > + del_page_from_free_list(buddy, order);
> > > + combined_pfn = buddy_pfn & pfn;
> > > + page = page + (combined_pfn - pfn);
> > > + pfn = combined_pfn;
> > > + order++;
> > > + }
> > > +
> > > + set_unmapped_order(page, order);
> > > + add_to_free_list(page, order);
> > > + spin_unlock_irqrestore(&free_area->lock, flags);
> > > +}
> > > +
> >
> > The page has to be zeroed before it goes back on the list, right? I
> > didn't see it.
>
> You are right, I missed it.
The text_poke() family is probably the way on x86, but I don't think
there is a cross-arch way that doesn't involve kernel shootdowns...
On Wed, Mar 08, 2023 at 11:41:01AM +0200, Mike Rapoport wrote:
> The last two patches in the series demonstrate how __GFP_UNMAPPED can be
> used in two in-tree use cases.
dma_alloc_attrs(DMA_ATTR_NO_KERNEL_MAPPING) would be another easy one.
On Wed 08-03-23 11:41:02, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <[email protected]>
>
> When set_memory or set_direct_map APIs used to change attribute or
> permissions for chunks of several pages, the large PMD that maps these
> pages in the direct map must be split. Fragmenting the direct map in such
> manner causes TLB pressure and, eventually, performance degradation.
>
> To avoid excessive direct map fragmentation, add ability to allocate
> "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> allocated pages from the direct map and use a cache of the unmapped pages.
>
> This cache is replenished with higher order pages with preference for
> PMD_SIZE pages when possible so that there will be fewer splits of large
> pages in the direct map.
>
> The cache is implemented as a buddy allocator, so it can serve high order
> allocations of unmapped pages.
Why do we need a dedicated gfp flag for all this when a dedicated
allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}?
--
Michal Hocko
SUSE Labs
On Fri, Mar 24, 2023 at 09:37:31AM +0100, Michal Hocko wrote:
> On Wed 08-03-23 11:41:02, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" <[email protected]>
> >
> > When set_memory or set_direct_map APIs used to change attribute or
> > permissions for chunks of several pages, the large PMD that maps these
> > pages in the direct map must be split. Fragmenting the direct map in such
> > manner causes TLB pressure and, eventually, performance degradation.
> >
> > To avoid excessive direct map fragmentation, add ability to allocate
> > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> > allocated pages from the direct map and use a cache of the unmapped pages.
> >
> > This cache is replenished with higher order pages with preference for
> > PMD_SIZE pages when possible so that there will be fewer splits of large
> > pages in the direct map.
> >
> > The cache is implemented as a buddy allocator, so it can serve high order
> > allocations of unmapped pages.
>
> Why do we need a dedicated gfp flag for all this when a dedicated
> allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}?
Using unmapped_pages_{alloc,free} adds complexity to the users which IMO
outweighs the cost of a dedicated gfp flag.
For modules we'd have to make x86::module_{alloc,free}() take care of
mapping and unmapping the allocated pages in the modules virtual address
range. This also might become relevant for another architectures in future
and than we'll have several complex module_alloc()s.
And for secretmem while using unmapped_pages_alloc() is easy, the free path
becomes really complex because actual page freeing for fd-based memory is
deeply buried in the page cache code.
My gut feeling is that for PKS using a gfp flag would save a lot of hassle
as well.
I also think that some of the core buddy allocator code might be reused and
unmapped_pages_{alloc,free} could be statics in mm/page_alloc.c and won't be
exposed at all. For now I've just copied free list helpers to a separate
file out of laziness.
> --
> Michal Hocko
> SUSE Labs
--
Sincerely yours,
Mike.
On Sat 25-03-23 09:38:12, Mike Rapoport wrote:
> On Fri, Mar 24, 2023 at 09:37:31AM +0100, Michal Hocko wrote:
> > On Wed 08-03-23 11:41:02, Mike Rapoport wrote:
> > > From: "Mike Rapoport (IBM)" <[email protected]>
> > >
> > > When set_memory or set_direct_map APIs used to change attribute or
> > > permissions for chunks of several pages, the large PMD that maps these
> > > pages in the direct map must be split. Fragmenting the direct map in such
> > > manner causes TLB pressure and, eventually, performance degradation.
> > >
> > > To avoid excessive direct map fragmentation, add ability to allocate
> > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> > > allocated pages from the direct map and use a cache of the unmapped pages.
> > >
> > > This cache is replenished with higher order pages with preference for
> > > PMD_SIZE pages when possible so that there will be fewer splits of large
> > > pages in the direct map.
> > >
> > > The cache is implemented as a buddy allocator, so it can serve high order
> > > allocations of unmapped pages.
> >
> > Why do we need a dedicated gfp flag for all this when a dedicated
> > allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}?
>
> Using unmapped_pages_{alloc,free} adds complexity to the users which IMO
> outweighs the cost of a dedicated gfp flag.
Aren't those users rare and very special anyway?
> For modules we'd have to make x86::module_{alloc,free}() take care of
> mapping and unmapping the allocated pages in the modules virtual address
> range. This also might become relevant for another architectures in future
> and than we'll have several complex module_alloc()s.
The module_alloc use is lacking any justification. More context would be
more than useful. Also vmalloc support for the proposed __GFP_UNMAPPED
likely needs more explanation as well.
> And for secretmem while using unmapped_pages_alloc() is easy, the free path
> becomes really complex because actual page freeing for fd-based memory is
> deeply buried in the page cache code.
Why is that a problem? You already hook into the page freeing path and
special case unmapped memory.
> My gut feeling is that for PKS using a gfp flag would save a lot of hassle
> as well.
Well, my take on this is that this is not a generic page allocator
functionality. It is clearly an allocator on top of the page allocator.
In general gfp flags are scarce and convenience argument usually fires
back later on in hard to predict ways. So I've learned to be careful
here. I am not saying this is a no-go but right now I do not see any
acutal advantage. The vmalloc usecase could be interesting in that
regards but it is not really clear to me whether this is a good idea in
the first place.
--
Michal Hocko
SUSE Labs
(adding Mel)
On Wed, Mar 08, 2023 at 11:41:01AM +0200, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <[email protected]>
>
> Hi,
>
> This is a third attempt to make page allocator aware of the direct map
> layout and allow grouping of the pages that must be unmapped from
> the direct map.
>
> This a new implementation of __GFP_UNMAPPED, kinda a follow up for this set:
>
> https://lore.kernel.org/all/[email protected]
>
> but instead of using a migrate type to cache the unmapped pages, the
> current implementation adds a dedicated cache to serve __GFP_UNMAPPED
> allocations.
>
> The last two patches in the series demonstrate how __GFP_UNMAPPED can be
> used in two in-tree use cases.
>
> First one is to switch secretmem to use the new mechanism, which is
> straight forward optimization.
>
> The second use-case is to enable __GFP_UNMAPPED in x86::module_alloc()
> that is essentially used as a method to allocate code pages and thus
> requires permission changes for basic pages in the direct map.
>
> This set is x86 specific at the moment because other architectures either
> do not support set_memory APIs that split the direct^w linear map (e.g.
> PowerPC) or only enable set_memory APIs when the linear map uses basic page
> size (like arm64).
>
> The patches are only lightly tested.
>
> == Motivation ==
>
> There are use-cases that need to remove pages from the direct map or at
> least map them with 4K granularity. Whenever this is done e.g. with
> set_memory/set_direct_map APIs, the PUD and PMD sized mappings in the
> direct map are split into smaller pages.
>
> To reduce the performance hit caused by the fragmentation of the direct map
> it makes sense to group and/or cache the pages removed from the direct
> map so that the split large pages won't be all over the place.
>
> There were RFCs for grouped page allocations for vmalloc permissions [1]
> and for using PKS to protect page tables [2] as well as an attempt to use a
> pool of large pages in secretmtm [3], but these suggestions address each
> use case separately, while having a common mechanism at the core mm level
> could be used by all use cases.
>
> == Implementation overview ==
>
> The pages that need to be removed from the direct map are grouped in a
> dedicated cache. When there is a page allocation request with
> __GFP_UNMAPPED set, it is redirected from __alloc_pages() to that cache
> using a new unmapped_alloc() function.
>
> The cache is implemented as a buddy allocator and it can handle high order
> requests.
>
> The cache starts empty and whenever it does not have enough pages to
> satisfy an allocation request the cache attempts to allocate PMD_SIZE page
> to replenish the cache. If PMD_SIZE page cannot be allocated, the cache is
> replenished with a page of the highest order available. That page is
> removed from the direct map and added to the local buddy allocator.
>
> There is also a shrinker that releases pages from the unmapped cache when
> there us a memory pressure in the system. When shrinker releases a page it
> is mapped back into the direct map.
>
> [1] https://lore.kernel.org/lkml/[email protected]
> [2] https://lore.kernel.org/lkml/[email protected]
> [3] https://lore.kernel.org/lkml/[email protected]
>
> Mike Rapoport (IBM) (5):
> mm: intorduce __GFP_UNMAPPED and unmapped_alloc()
> mm/unmapped_alloc: add debugfs file similar to /proc/pagetypeinfo
> mm/unmapped_alloc: add shrinker
> EXPERIMENTAL: x86: use __GFP_UNMAPPED for modele_alloc()
> EXPERIMENTAL: mm/secretmem: use __GFP_UNMAPPED
>
> arch/x86/Kconfig | 3 +
> arch/x86/kernel/module.c | 2 +-
> include/linux/gfp_types.h | 11 +-
> include/linux/page-flags.h | 6 +
> include/linux/pageblock-flags.h | 28 +++
> include/trace/events/mmflags.h | 10 +-
> mm/Kconfig | 4 +
> mm/Makefile | 1 +
> mm/internal.h | 24 +++
> mm/page_alloc.c | 39 +++-
> mm/secretmem.c | 26 +--
> mm/unmapped-alloc.c | 334 ++++++++++++++++++++++++++++++++
> mm/vmalloc.c | 2 +-
> 13 files changed, 459 insertions(+), 31 deletions(-)
> create mode 100644 mm/unmapped-alloc.c
>
>
> base-commit: fe15c26ee26efa11741a7b632e9f23b01aca4cc6
> --
> 2.35.1
>
--
Sincerely yours,
Mike.
On 3/27/23 15:43, Michal Hocko wrote:
> On Sat 25-03-23 09:38:12, Mike Rapoport wrote:
>> On Fri, Mar 24, 2023 at 09:37:31AM +0100, Michal Hocko wrote:
>> > On Wed 08-03-23 11:41:02, Mike Rapoport wrote:
>> > > From: "Mike Rapoport (IBM)" <[email protected]>
>> > >
>> > > When set_memory or set_direct_map APIs used to change attribute or
>> > > permissions for chunks of several pages, the large PMD that maps these
>> > > pages in the direct map must be split. Fragmenting the direct map in such
>> > > manner causes TLB pressure and, eventually, performance degradation.
>> > >
>> > > To avoid excessive direct map fragmentation, add ability to allocate
>> > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
>> > > allocated pages from the direct map and use a cache of the unmapped pages.
>> > >
>> > > This cache is replenished with higher order pages with preference for
>> > > PMD_SIZE pages when possible so that there will be fewer splits of large
>> > > pages in the direct map.
>> > >
>> > > The cache is implemented as a buddy allocator, so it can serve high order
>> > > allocations of unmapped pages.
>> >
>> > Why do we need a dedicated gfp flag for all this when a dedicated
>> > allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}?
>>
>> Using unmapped_pages_{alloc,free} adds complexity to the users which IMO
>> outweighs the cost of a dedicated gfp flag.
>
> Aren't those users rare and very special anyway?
I think it's mostly about the freeing that can happen from a generic context
not aware of the special allocation, so it's not about how rare it is, but
how complex would be to determine exhaustively those contexts and do
something in them.
>> For modules we'd have to make x86::module_{alloc,free}() take care of
>> mapping and unmapping the allocated pages in the modules virtual address
>> range. This also might become relevant for another architectures in future
>> and than we'll have several complex module_alloc()s.
>
> The module_alloc use is lacking any justification. More context would be
> more than useful. Also vmalloc support for the proposed __GFP_UNMAPPED
> likely needs more explanation as well.
>
>> And for secretmem while using unmapped_pages_alloc() is easy, the free path
>> becomes really complex because actual page freeing for fd-based memory is
>> deeply buried in the page cache code.
>
> Why is that a problem? You already hook into the page freeing path and
> special case unmapped memory.
But the proposal of unmapped_pages_free() would suggest this would no longer
be the case?
But maybe we could, as a compromise, provide unmapped_pages_alloc() to get
rid of the new __GFP flag, provide unmapped_pages_free() to annotate places
that are known to free unmapped memory explicitly, but the generic page
freeing would also keep the hook?
>> My gut feeling is that for PKS using a gfp flag would save a lot of hassle
>> as well.
>
> Well, my take on this is that this is not a generic page allocator
> functionality. It is clearly an allocator on top of the page allocator.
> In general gfp flags are scarce and convenience argument usually fires
> back later on in hard to predict ways. So I've learned to be careful
> here. I am not saying this is a no-go but right now I do not see any
> acutal advantage. The vmalloc usecase could be interesting in that
> regards but it is not really clear to me whether this is a good idea in
> the first place.
>
On Mon 27-03-23 16:31:45, Vlastimil Babka wrote:
> On 3/27/23 15:43, Michal Hocko wrote:
> > On Sat 25-03-23 09:38:12, Mike Rapoport wrote:
> >> On Fri, Mar 24, 2023 at 09:37:31AM +0100, Michal Hocko wrote:
> >> > On Wed 08-03-23 11:41:02, Mike Rapoport wrote:
> >> > > From: "Mike Rapoport (IBM)" <[email protected]>
> >> > >
> >> > > When set_memory or set_direct_map APIs used to change attribute or
> >> > > permissions for chunks of several pages, the large PMD that maps these
> >> > > pages in the direct map must be split. Fragmenting the direct map in such
> >> > > manner causes TLB pressure and, eventually, performance degradation.
> >> > >
> >> > > To avoid excessive direct map fragmentation, add ability to allocate
> >> > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> >> > > allocated pages from the direct map and use a cache of the unmapped pages.
> >> > >
> >> > > This cache is replenished with higher order pages with preference for
> >> > > PMD_SIZE pages when possible so that there will be fewer splits of large
> >> > > pages in the direct map.
> >> > >
> >> > > The cache is implemented as a buddy allocator, so it can serve high order
> >> > > allocations of unmapped pages.
> >> >
> >> > Why do we need a dedicated gfp flag for all this when a dedicated
> >> > allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}?
> >>
> >> Using unmapped_pages_{alloc,free} adds complexity to the users which IMO
> >> outweighs the cost of a dedicated gfp flag.
> >
> > Aren't those users rare and very special anyway?
>
> I think it's mostly about the freeing that can happen from a generic context
> not aware of the special allocation, so it's not about how rare it is, but
> how complex would be to determine exhaustively those contexts and do
> something in them.
Yes, I can see a challenge with put_page users but that is not really
related to the gfp flag as those are only relevant for the allocation
context.
> >> For modules we'd have to make x86::module_{alloc,free}() take care of
> >> mapping and unmapping the allocated pages in the modules virtual address
> >> range. This also might become relevant for another architectures in future
> >> and than we'll have several complex module_alloc()s.
> >
> > The module_alloc use is lacking any justification. More context would be
> > more than useful. Also vmalloc support for the proposed __GFP_UNMAPPED
> > likely needs more explanation as well.
> >
> >> And for secretmem while using unmapped_pages_alloc() is easy, the free path
> >> becomes really complex because actual page freeing for fd-based memory is
> >> deeply buried in the page cache code.
> >
> > Why is that a problem? You already hook into the page freeing path and
> > special case unmapped memory.
>
> But the proposal of unmapped_pages_free() would suggest this would no longer
> be the case?
I can see a check in the freeing path.
> But maybe we could, as a compromise, provide unmapped_pages_alloc() to get
> rid of the new __GFP flag, provide unmapped_pages_free() to annotate places
> that are known to free unmapped memory explicitly, but the generic page
> freeing would also keep the hook?
Honestly I do not see a different option if those pages are to be
reference counted. Unless they can use a destructor concept like hugetlb
pages. At least secret mem usecase cannot AFAICS.
--
Michal Hocko
SUSE Labs
On Mon, Mar 27, 2023 at 03:43:27PM +0200, Michal Hocko wrote:
> On Sat 25-03-23 09:38:12, Mike Rapoport wrote:
> > On Fri, Mar 24, 2023 at 09:37:31AM +0100, Michal Hocko wrote:
> > > On Wed 08-03-23 11:41:02, Mike Rapoport wrote:
> > > > From: "Mike Rapoport (IBM)" <[email protected]>
> > > >
> > > > When set_memory or set_direct_map APIs used to change attribute or
> > > > permissions for chunks of several pages, the large PMD that maps these
> > > > pages in the direct map must be split. Fragmenting the direct map in such
> > > > manner causes TLB pressure and, eventually, performance degradation.
> > > >
> > > > To avoid excessive direct map fragmentation, add ability to allocate
> > > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> > > > allocated pages from the direct map and use a cache of the unmapped pages.
> > > >
> > > > This cache is replenished with higher order pages with preference for
> > > > PMD_SIZE pages when possible so that there will be fewer splits of large
> > > > pages in the direct map.
> > > >
> > > > The cache is implemented as a buddy allocator, so it can serve high order
> > > > allocations of unmapped pages.
> > >
> > > Why do we need a dedicated gfp flag for all this when a dedicated
> > > allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}?
> >
> > Using unmapped_pages_{alloc,free} adds complexity to the users which IMO
> > outweighs the cost of a dedicated gfp flag.
>
> Aren't those users rare and very special anyway?
>
> > For modules we'd have to make x86::module_{alloc,free}() take care of
> > mapping and unmapping the allocated pages in the modules virtual address
> > range. This also might become relevant for another architectures in future
> > and than we'll have several complex module_alloc()s.
>
> The module_alloc use is lacking any justification. More context would be
> more than useful. Also vmalloc support for the proposed __GFP_UNMAPPED
> likely needs more explanation as well.
Right now module_alloc() boils down to vmalloc() with the virtual range
limited to the modules area. The allocated chunk contains both code and
data. When CONFIG_STRICT_MODULE_RWX is set, parts of the memory allocated
with module_alloc() remapped with different permissions both in vmalloc
address space and in the direct map. The change of permissions for small
ranges causes splits of large pages in the direct map.
If we were to use unmapped_pages_alloc() in modules_alloc(), we would have
to implement the part of vmalloc() that reserves the virtual addresses and
maps the allocated memory there in module_alloc().
> > And for secretmem while using unmapped_pages_alloc() is easy, the free path
> > becomes really complex because actual page freeing for fd-based memory is
> > deeply buried in the page cache code.
>
> Why is that a problem? You already hook into the page freeing path and
> special case unmapped memory.
I didn't say there is a problem with unmapped_pages_alloc() in secretmem, I
said there is a problem with unmapped_pages_free() and hence are the
special case for unmapped memory in the freeing path.
> > My gut feeling is that for PKS using a gfp flag would save a lot of hassle
> > as well.
>
> Well, my take on this is that this is not a generic page allocator
> functionality. It is clearly an allocator on top of the page allocator.
> In general gfp flags are scarce and convenience argument usually fires
> back later on in hard to predict ways. So I've learned to be careful
> here. I am not saying this is a no-go but right now I do not see any
> acutal advantage. The vmalloc usecase could be interesting in that
> regards but it is not really clear to me whether this is a good idea in
> the first place.
I don't see the usage of a gfp flag as a convenience argument, but rather
it feels for me that a gfp flag will cause less maintenance burden. Of
course this is subjective.
And although this is an allocator on top of the page allocator, it is still
very tightly coupled with the core page allocator. I'm still think that
using a migrate type for this would have been more elegant, but I realize
that a migrate type would have more impact on the allocation path.
> --
> Michal Hocko
> SUSE Labs
--
Sincerely yours,
Mike.
On Tue 28-03-23 09:25:35, Mike Rapoport wrote:
> On Mon, Mar 27, 2023 at 03:43:27PM +0200, Michal Hocko wrote:
> > On Sat 25-03-23 09:38:12, Mike Rapoport wrote:
> > > On Fri, Mar 24, 2023 at 09:37:31AM +0100, Michal Hocko wrote:
> > > > On Wed 08-03-23 11:41:02, Mike Rapoport wrote:
> > > > > From: "Mike Rapoport (IBM)" <[email protected]>
> > > > >
> > > > > When set_memory or set_direct_map APIs used to change attribute or
> > > > > permissions for chunks of several pages, the large PMD that maps these
> > > > > pages in the direct map must be split. Fragmenting the direct map in such
> > > > > manner causes TLB pressure and, eventually, performance degradation.
> > > > >
> > > > > To avoid excessive direct map fragmentation, add ability to allocate
> > > > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> > > > > allocated pages from the direct map and use a cache of the unmapped pages.
> > > > >
> > > > > This cache is replenished with higher order pages with preference for
> > > > > PMD_SIZE pages when possible so that there will be fewer splits of large
> > > > > pages in the direct map.
> > > > >
> > > > > The cache is implemented as a buddy allocator, so it can serve high order
> > > > > allocations of unmapped pages.
> > > >
> > > > Why do we need a dedicated gfp flag for all this when a dedicated
> > > > allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}?
> > >
> > > Using unmapped_pages_{alloc,free} adds complexity to the users which IMO
> > > outweighs the cost of a dedicated gfp flag.
> >
> > Aren't those users rare and very special anyway?
> >
> > > For modules we'd have to make x86::module_{alloc,free}() take care of
> > > mapping and unmapping the allocated pages in the modules virtual address
> > > range. This also might become relevant for another architectures in future
> > > and than we'll have several complex module_alloc()s.
> >
> > The module_alloc use is lacking any justification. More context would be
> > more than useful. Also vmalloc support for the proposed __GFP_UNMAPPED
> > likely needs more explanation as well.
>
> Right now module_alloc() boils down to vmalloc() with the virtual range
> limited to the modules area. The allocated chunk contains both code and
> data. When CONFIG_STRICT_MODULE_RWX is set, parts of the memory allocated
> with module_alloc() remapped with different permissions both in vmalloc
> address space and in the direct map. The change of permissions for small
> ranges causes splits of large pages in the direct map.
OK, so you want to reduce that direct map fragmentation? Is that a real
problem? My impression is that modules are mostly static thing. BPF
might be a different thing though. I have a recollection that BPF guys
were dealing with direct map fragmention as well.
> If we were to use unmapped_pages_alloc() in modules_alloc(), we would have
> to implement the part of vmalloc() that reserves the virtual addresses and
> maps the allocated memory there in module_alloc().
Another option would be to provide an allocator for the backing pages to
vmalloc. But I do agree that a gfp flag is a less laborous way to
achieve the same. So the primary question really is whether we really
need vmalloc support for unmapped memory.
--
Michal Hocko
SUSE Labs
On Tue, Mar 28, 2023 at 09:39:37AM +0200, Michal Hocko wrote:
> On Tue 28-03-23 09:25:35, Mike Rapoport wrote:
> > On Mon, Mar 27, 2023 at 03:43:27PM +0200, Michal Hocko wrote:
> > > On Sat 25-03-23 09:38:12, Mike Rapoport wrote:
> > > > On Fri, Mar 24, 2023 at 09:37:31AM +0100, Michal Hocko wrote:
> > > > > On Wed 08-03-23 11:41:02, Mike Rapoport wrote:
> > > > > > From: "Mike Rapoport (IBM)" <[email protected]>
> > > > > >
> > > > > > When set_memory or set_direct_map APIs used to change attribute or
> > > > > > permissions for chunks of several pages, the large PMD that maps these
> > > > > > pages in the direct map must be split. Fragmenting the direct map in such
> > > > > > manner causes TLB pressure and, eventually, performance degradation.
> > > > > >
> > > > > > To avoid excessive direct map fragmentation, add ability to allocate
> > > > > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> > > > > > allocated pages from the direct map and use a cache of the unmapped pages.
> > > > > >
> > > > > > This cache is replenished with higher order pages with preference for
> > > > > > PMD_SIZE pages when possible so that there will be fewer splits of large
> > > > > > pages in the direct map.
> > > > > >
> > > > > > The cache is implemented as a buddy allocator, so it can serve high order
> > > > > > allocations of unmapped pages.
> > > > >
> > > > > Why do we need a dedicated gfp flag for all this when a dedicated
> > > > > allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}?
> > > >
> > > > Using unmapped_pages_{alloc,free} adds complexity to the users which IMO
> > > > outweighs the cost of a dedicated gfp flag.
> > >
> > > Aren't those users rare and very special anyway?
> > >
> > > > For modules we'd have to make x86::module_{alloc,free}() take care of
> > > > mapping and unmapping the allocated pages in the modules virtual address
> > > > range. This also might become relevant for another architectures in future
> > > > and than we'll have several complex module_alloc()s.
> > >
> > > The module_alloc use is lacking any justification. More context would be
> > > more than useful. Also vmalloc support for the proposed __GFP_UNMAPPED
> > > likely needs more explanation as well.
> >
> > Right now module_alloc() boils down to vmalloc() with the virtual range
> > limited to the modules area. The allocated chunk contains both code and
> > data. When CONFIG_STRICT_MODULE_RWX is set, parts of the memory allocated
> > with module_alloc() remapped with different permissions both in vmalloc
> > address space and in the direct map. The change of permissions for small
> > ranges causes splits of large pages in the direct map.
>
> OK, so you want to reduce that direct map fragmentation?
Yes.
> Is that a real problem?
A while ago Intel folks published report [1] that showed better performance
with large pages in the direct map for majority of benchmarks.
> My impression is that modules are mostly static thing. BPF
> might be a different thing though. I have a recollection that BPF guys
> were dealing with direct map fragmention as well.
Modules are indeed static, but module_alloc() used by anything that
allocates code pages, e.g. kprobes, ftrace and BPF. Besides, Thomas
mentioned that having code in 2M pages reduces iTLB pressure [2], but
that's not only about avoiding the splits in the direct map but also about
using large mappings in the modules address space.
BPF guys suggested an allocator for executable memory [3] mainly because
they've seen performance improvement of 0.6% - 0.9% in their setups [4].
> > If we were to use unmapped_pages_alloc() in modules_alloc(), we would have
> > to implement the part of vmalloc() that reserves the virtual addresses and
> > maps the allocated memory there in module_alloc().
>
> Another option would be to provide an allocator for the backing pages to
> vmalloc. But I do agree that a gfp flag is a less laborous way to
> achieve the same. So the primary question really is whether we really
> need vmalloc support for unmapped memory.
I'm not sure I follow here. module_alloc() is essentially an alias to
vmalloc(), so to reduce direct map fragmentation caused by code allocations
the most sensible way IMO is to support unmapped memory in vmalloc().
I also think vmalloc with unmmapped pages can provide backing pages for
execmem_alloc() Song proposed.
> --
> Michal Hocko
> SUSE Labs
[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/all/87mt86rbvy.ffs@tglx/
[3] https://lore.kernel.org/all/[email protected]/
[4] https://lore.kernel.org/bpf/[email protected]/
--
Sincerely yours,
Mike.
On Tue 28-03-23 18:11:20, Mike Rapoport wrote:
> On Tue, Mar 28, 2023 at 09:39:37AM +0200, Michal Hocko wrote:
[...]
> > OK, so you want to reduce that direct map fragmentation?
>
> Yes.
>
> > Is that a real problem?
>
> A while ago Intel folks published report [1] that showed better performance
> with large pages in the direct map for majority of benchmarks.
>
> > My impression is that modules are mostly static thing. BPF
> > might be a different thing though. I have a recollection that BPF guys
> > were dealing with direct map fragmention as well.
>
> Modules are indeed static, but module_alloc() used by anything that
> allocates code pages, e.g. kprobes, ftrace and BPF. Besides, Thomas
> mentioned that having code in 2M pages reduces iTLB pressure [2], but
> that's not only about avoiding the splits in the direct map but also about
> using large mappings in the modules address space.
>
> BPF guys suggested an allocator for executable memory [3] mainly because
> they've seen performance improvement of 0.6% - 0.9% in their setups [4].
These are fair arguments and it would have been better to have them in
the RFC. Also it is not really clear to me what is the actual benefit of
the unmapping for those usecases. I do get they want to benefit from
caching on the same permission setup but do they need unmapping as well?
> > > If we were to use unmapped_pages_alloc() in modules_alloc(), we would have
> > > to implement the part of vmalloc() that reserves the virtual addresses and
> > > maps the allocated memory there in module_alloc().
> >
> > Another option would be to provide an allocator for the backing pages to
> > vmalloc. But I do agree that a gfp flag is a less laborous way to
> > achieve the same. So the primary question really is whether we really
> > need vmalloc support for unmapped memory.
>
> I'm not sure I follow here. module_alloc() is essentially an alias to
> vmalloc(), so to reduce direct map fragmentation caused by code allocations
> the most sensible way IMO is to support unmapped memory in vmalloc().
What I meant to say is that vmalloc is currently using the page
allocator (resp bulk allocator) for the backing storage. I can imagine
an extension to replace this default allocator by something else (e.g. a
given allocation function). This would be more generic and it would
allow different usecases (e.g. benefit from caching withtout doing the
actual unmapping).
> I also think vmalloc with unmmapped pages can provide backing pages for
> execmem_alloc() Song proposed.
Why would you need to have execmem_alloc have its memory virtually
mapped into vmalloc space?
> [1] https://lore.kernel.org/linux-mm/[email protected]/
> [2] https://lore.kernel.org/all/87mt86rbvy.ffs@tglx/
> [3] https://lore.kernel.org/all/[email protected]/
> [4] https://lore.kernel.org/bpf/[email protected]/
>
> --
> Sincerely yours,
> Mike.
--
Michal Hocko
SUSE Labs
Mike, please Cc linux-modules if you want modules folks' input as well ;)
On Tue, Mar 28, 2023 at 8:11 AM Mike Rapoport <[email protected]> wrote:
> On Tue, Mar 28, 2023 at 09:39:37AM +0200, Michal Hocko wrote:
> > OK, so you want to reduce that direct map fragmentation?
>
> Yes.
>
> > Is that a real problem?
>
> A while ago Intel folks published report [1] that showed better performance
> with large pages in the direct map for majority of benchmarks.
>
> > My impression is that modules are mostly static thing. BPF
> > might be a different thing though. I have a recollection that BPF guys
> > were dealing with direct map fragmention as well.
>
> Modules are indeed static, but module_alloc() used by anything that
> allocates code pages, e.g. kprobes, ftrace and BPF. Besides, Thomas
> mentioned that having code in 2M pages reduces iTLB pressure [2], but
> that's not only about avoiding the splits in the direct map but also about
> using large mappings in the modules address space.
It is easily overlooked why such things create direct fragmentation --
it's because of the special permission stuff done, module_alloc()
targets memory which can be executed somehow.
> BPF guys suggested an allocator for executable memory [3] mainly because
> they've seen performance improvement of 0.6% - 0.9% in their setups [4].
The performance metrics were completely opaque to some synthetic
environment and our goal is to showcase real value with reproducible
performance benchmarks. Since now Song is convinced that modules need
to be a first class citizen in order to generalize a special allocator
we may sooner rather than later real reproducible performance data to
show the benefit of such a special allocator. One of the big
differences with eBPF programs is that modules *can* be rather large
in size. What is the average size of modules? Well let's take a look:
mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| wc -l
9173
mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| xargs stat -c "%s - %n" | awk 'BEGIN {sum=0} {sum+=$1}
END {print sum/NR/1024}'
175.1
175 MiB is pretty large.
Ignoring the top 5 piggies:
mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r |head -5
58315248 - ./drivers/gpu/drm/amd/amdgpu/amdgpu.ko
29605592 - ./drivers/gpu/drm/i915/i915.ko
18591256 - ./drivers/gpu/drm/nouveau/nouveau.ko
16867048 - ./fs/xfs/xfs.ko
14209440 - ./fs/btrfs/btrfs.ko
mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-5)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}'
160.54
Ignoring the top 10 largest modules:
mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-10)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}'
154.656
Ignoring the top 20 piggies:
mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-20)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}'
146.384
Ignoring the top 100 bloated modules:
mcgrof@bigtwin mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-100)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}'
117.97
Ignoring the top 1000 bloated modules:
mcgrof@bigtwin mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-1000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print
sum/NR/1024}'
64.4686
Ignoring the top 2000 bloated modules:
mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-2000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print
sum/NR/1024}'
48.7869
Ignoring top 3000
mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-3000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print
sum/NR/1024}'
39.6037
Ignoring top 4000
mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-4000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print
sum/NR/1024}'
33.106
Ignoring top 5000
mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-5000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print
sum/NR/1024}'
28.0925
But at least on the driver front we know we won't always have loaded
*all* GPU drivers, but *one*. So the math needs to consider what
module sizes are for modules actually used.
Let's see what the average module size is on on a big system:
mcgrof@bigtwin ~ $ lsmod | grep -v Module| awk '{print $1}' | xargs
sudo modinfo --field filename | xargs stat -c "%s - %n" | awk 'BEGIN
{sum=0} {sum+=$1} END {print sum/NR/1024}'
313.432
On a small desktop:
mcgrof@desktop ~ $ lsmod | grep -v Module| awk '{print $1}' | xargs
sudo modinfo --field filename | xargs stat -c "%s - %n" | awk 'BEGIN
{sum=0} {sum+=$1} END {print sum/NR/1024}'
292.786
For each finit_module we also do a vmalloc twice actually, one for the
kernel_read_*() which lets us read the file from userspace, and then a
second set of allocations where we copy data over, each step either
using vzalloc() or module_alloc() (vmalloc() with special
permissions), this is more easily reflected and visible now on
linux-next with Song's new module_memory_alloc().
However -- for the context of *this* effort, we're only talking about
the executable sections, the areas we'd use module_alloc() for.
On a big system:
mcgrof@bigtwin ~ $ lsmod | grep -v Module| awk '{print $1}' | xargs
sudo modinfo --field filename | xargs size --radix=10 | awk '{print
$1}'| grep -v text| awk 'BEGIN {sum=0} {sum+=$1} END {print
sum/NR/1024}'
88.7964
On a small desktop:
mcgrof@desktop ~ $ lsmod | grep -v Module| awk '{print $1}' | xargs
sudo modinfo --field filename | xargs size --radix=10 | awk '{print
$1}'| grep -v text| awk 'BEGIN {sum=0} {sum+=$1} END {print
sum/NR/1024}'
92.1473
Regardless, the amount of memory allocated is pretty significant, to
the point a 400 CPU system easily run out of vmap space (yes the
kernel does some stupid stuff, which we're correcting over time):
https://lkml.kernel.org/r/[email protected]
To help with this we have some recent efforts to help with this
pressure on vmap space:
https://lkml.kernel.org/r/[email protected]
> > > If we were to use unmapped_pages_alloc() in modules_alloc(), we would have
> > > to implement the part of vmalloc() that reserves the virtual addresses and
> > > maps the allocated memory there in module_alloc().
> >
> > Another option would be to provide an allocator for the backing pages to
> > vmalloc. But I do agree that a gfp flag is a less laborous way to
> > achieve the same. So the primary question really is whether we really
> > need vmalloc support for unmapped memory.
>
> I'm not sure I follow here. module_alloc() is essentially an alias to
> vmalloc(), so to reduce direct map fragmentation caused by code allocations
> the most sensible way IMO is to support unmapped memory in vmalloc().
The big win also would be to use huge pages from a performance point
of view, specially reducing iTLB pressure, however that is *slightly*
orthogonal to your goal of reducing direct map fragmentation. However,
we should not ignore that strategy as it is a special use case which
*could* be leveraged in other ways too. And so I'd prefer to see that
over just a flag to hide those allocations. It would not only reduce
that fragmentation, but reduce iTLB pressure which I think *is* what
experimentation revealed to show more gains.
> I also think vmalloc with unmmapped pages can provide backing pages for
> execmem_alloc() Song proposed.
Only if you consider huge pages. I don't see that here.
Luis
On Tue, Mar 28, 2023 at 10:18:50AM -0700, Luis Chamberlain wrote:
> differences with eBPF programs is that modules *can* be rather large
> in size. What is the average size of modules? Well let's take a look:
>
> mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
> -name \*.ko| wc -l
> 9173
ummm ... wc -c, surely?
> mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
> -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
> -$((9173-5)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}'
> 160.54
... which invalidates all of these.
On Tue, Mar 28, 2023 at 06:37:13PM +0100, Matthew Wilcox wrote:
> On Tue, Mar 28, 2023 at 10:18:50AM -0700, Luis Chamberlain wrote:
> > differences with eBPF programs is that modules *can* be rather large
> > in size. What is the average size of modules? Well let's take a look:
> >
> > mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
> > -name \*.ko| wc -l
> > 9173
>
> ummm ... wc -c, surely?
That's the number of allmodconfig modules found.
mcgrof@fulton ~/linux (git::sysctl-next)$ find ./ -name \*.ko| head -2
./arch/x86/crypto/twofish-x86_64.ko
./arch/x86/crypto/serpent-avx2.ko
mcgrof@fulton ~/linux (git::sysctl-next)$ find ./ -name \*.ko| head -2 |
wc -l
2
mcgrof@fulton ~/linux (git::sysctl-next)$ find ./ -name \*.ko| head -2 |
wc -c
70
wc -c would give a lot more. wc -l gives me the module count.
> > mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
> > -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
> > -$((9173-5)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}'
> > 160.54
>
> ... which invalidates all of these.
Not sure ? But regardless the *.text* lookup is what we care for though
which was later.
Luis
On Tue, Mar 28, 2023 at 10:52:08AM -0700, Luis Chamberlain wrote:
> On Tue, Mar 28, 2023 at 06:37:13PM +0100, Matthew Wilcox wrote:
> > On Tue, Mar 28, 2023 at 10:18:50AM -0700, Luis Chamberlain wrote:
> > > differences with eBPF programs is that modules *can* be rather large
> > > in size. What is the average size of modules? Well let's take a look:
> > >
> > > mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
> > > -name \*.ko| wc -l
> > > 9173
> >
> > ummm ... wc -c, surely?
>
> That's the number of allmodconfig modules found.
>
> mcgrof@fulton ~/linux (git::sysctl-next)$ find ./ -name \*.ko| head -2
> ./arch/x86/crypto/twofish-x86_64.ko
> ./arch/x86/crypto/serpent-avx2.ko
> mcgrof@fulton ~/linux (git::sysctl-next)$ find ./ -name \*.ko| head -2 |
> wc -l
> 2
> mcgrof@fulton ~/linux (git::sysctl-next)$ find ./ -name \*.ko| head -2 |
> wc -c
> 70
>
> wc -c would give a lot more. wc -l gives me the module count.
>
> > > mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
> > > -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
> > > -$((9173-5)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}'
> > > 160.54
> >
> > ... which invalidates all of these.
>
> Not sure ? But regardless the *.text* lookup is what we care for though
> which was later.
Which gets me thinking it'd be super nice if kmod tools supported
querying this for us, then no fat finger could mess up the math:
For all modules available:
* Average module size
* Average .text module size
For only modules loaded:
* Average module size
* Average .text module size
Luis
On Tue, Mar 28, 2023 at 05:24:49PM +0200, Michal Hocko wrote:
> On Tue 28-03-23 18:11:20, Mike Rapoport wrote:
> > On Tue, Mar 28, 2023 at 09:39:37AM +0200, Michal Hocko wrote:
> [...]
> > > OK, so you want to reduce that direct map fragmentation?
> >
> > Yes.
> >
> > > Is that a real problem?
> >
> > A while ago Intel folks published report [1] that showed better performance
> > with large pages in the direct map for majority of benchmarks.
> >
> > > My impression is that modules are mostly static thing. BPF
> > > might be a different thing though. I have a recollection that BPF guys
> > > were dealing with direct map fragmention as well.
> >
> > Modules are indeed static, but module_alloc() used by anything that
> > allocates code pages, e.g. kprobes, ftrace and BPF. Besides, Thomas
> > mentioned that having code in 2M pages reduces iTLB pressure [2], but
> > that's not only about avoiding the splits in the direct map but also about
> > using large mappings in the modules address space.
> >
> > BPF guys suggested an allocator for executable memory [3] mainly because
> > they've seen performance improvement of 0.6% - 0.9% in their setups [4].
>
> These are fair arguments and it would have been better to have them in
> the RFC. Also it is not really clear to me what is the actual benefit of
> the unmapping for those usecases. I do get they want to benefit from
> caching on the same permission setup but do they need unmapping as well?
The pages allocated with module_alloc() get different permissions depending
on whether they belong to text, rodata, data etc. The permissions are
updated in both vmalloc address space and in the direct map. The updates to
the direct map cause splits of the large pages. If we cache large pages as
unmapped we take out the entire 2M page from the direct map and then
if/when it becomes free it can be returned to the direct map as a 2M page.
Generally, the unmapped allocations are intended for use-cases that anyway
map the memory elsewhere than direct map and need to modify direct mappings
of the memory, be it modules_alloc(), secretmem, PKS page tables or maybe
even some of the encrypted VM memory.
> > > > If we were to use unmapped_pages_alloc() in modules_alloc(), we would have
> > > > to implement the part of vmalloc() that reserves the virtual addresses and
> > > > maps the allocated memory there in module_alloc().
> > >
> > > Another option would be to provide an allocator for the backing pages to
> > > vmalloc. But I do agree that a gfp flag is a less laborous way to
> > > achieve the same. So the primary question really is whether we really
> > > need vmalloc support for unmapped memory.
> >
> > I'm not sure I follow here. module_alloc() is essentially an alias to
> > vmalloc(), so to reduce direct map fragmentation caused by code allocations
> > the most sensible way IMO is to support unmapped memory in vmalloc().
>
> What I meant to say is that vmalloc is currently using the page
> allocator (resp bulk allocator) for the backing storage. I can imagine
> an extension to replace this default allocator by something else (e.g. a
> given allocation function). This would be more generic and it would
> allow different usecases (e.g. benefit from caching withtout doing the
> actual unmapping).
The whole point of unmapped cache is to allow non-default permissions in
the direct map without splitting large pages there. So the cache that
unmaps large pages in the direct map and then handles out subpages of that
page seems like the most logical thing to do. With the current use cases
the callers anyway map these pages at different virtual addresses, i.e.
page cache or vmalloc.
> > I also think vmalloc with unmmapped pages can provide backing pages for
> > execmem_alloc() Song proposed.
>
> Why would you need to have execmem_alloc have its memory virtually
> mapped into vmalloc space?
Currently all the memory allocated for code is managed in a subset of
vmalloc() space. The intention of execmem_alloc() was to replace
module_alloc() for the code pages, so it's natural that it will use the
same virtual ranges.
But anyway, execmem_alloc() is a long shot as it also requires quite a
refactoring of modules loading.
> > [1] https://lore.kernel.org/linux-mm/[email protected]/
> > [2] https://lore.kernel.org/all/87mt86rbvy.ffs@tglx/
> > [3] https://lore.kernel.org/all/[email protected]/
> > [4] https://lore.kernel.org/bpf/[email protected]/
> >
> > --
> > Sincerely yours,
> > Mike.
>
> --
> Michal Hocko
> SUSE Labs
--
Sincerely yours,
Mike.
On Wed 29-03-23 10:28:02, Mike Rapoport wrote:
> On Tue, Mar 28, 2023 at 05:24:49PM +0200, Michal Hocko wrote:
> > On Tue 28-03-23 18:11:20, Mike Rapoport wrote:
> > > On Tue, Mar 28, 2023 at 09:39:37AM +0200, Michal Hocko wrote:
> > [...]
> > > > OK, so you want to reduce that direct map fragmentation?
> > >
> > > Yes.
> > >
> > > > Is that a real problem?
> > >
> > > A while ago Intel folks published report [1] that showed better performance
> > > with large pages in the direct map for majority of benchmarks.
> > >
> > > > My impression is that modules are mostly static thing. BPF
> > > > might be a different thing though. I have a recollection that BPF guys
> > > > were dealing with direct map fragmention as well.
> > >
> > > Modules are indeed static, but module_alloc() used by anything that
> > > allocates code pages, e.g. kprobes, ftrace and BPF. Besides, Thomas
> > > mentioned that having code in 2M pages reduces iTLB pressure [2], but
> > > that's not only about avoiding the splits in the direct map but also about
> > > using large mappings in the modules address space.
> > >
> > > BPF guys suggested an allocator for executable memory [3] mainly because
> > > they've seen performance improvement of 0.6% - 0.9% in their setups [4].
> >
> > These are fair arguments and it would have been better to have them in
> > the RFC. Also it is not really clear to me what is the actual benefit of
> > the unmapping for those usecases. I do get they want to benefit from
> > caching on the same permission setup but do they need unmapping as well?
>
> The pages allocated with module_alloc() get different permissions depending
> on whether they belong to text, rodata, data etc. The permissions are
> updated in both vmalloc address space and in the direct map. The updates to
> the direct map cause splits of the large pages.
That much is clear (wouldn't hurt to mention that in the changelog
though).
> If we cache large pages as
> unmapped we take out the entire 2M page from the direct map and then
> if/when it becomes free it can be returned to the direct map as a 2M page.
>
> Generally, the unmapped allocations are intended for use-cases that anyway
> map the memory elsewhere than direct map and need to modify direct mappings
> of the memory, be it modules_alloc(), secretmem, PKS page tables or maybe
> even some of the encrypted VM memory.
I believe we are still not on the same page here. I do understand that
you want to re-use the caching capability of the unmapped_pages_alloc
for modules allocations as well. My question is whether module_alloc
really benefits from the fact that the memory is unmapped?
I guess you want to say that it does because it wouldn't have to change
the permission for the direct map but I do not see that anywhere in the
patch... Also consinder the slow path where module_alloc needs to
allocate a fresh (huge)page and unmap it. Does the overhead of the
unmapping suits needs of all module_alloc users? Module loader is likely
not interesting as it tends to be rather static but what about BPF
programs?
[...]
> > > I also think vmalloc with unmmapped pages can provide backing pages for
> > > execmem_alloc() Song proposed.
> >
> > Why would you need to have execmem_alloc have its memory virtually
> > mapped into vmalloc space?
>
> Currently all the memory allocated for code is managed in a subset of
> vmalloc() space. The intention of execmem_alloc() was to replace
> module_alloc() for the code pages, so it's natural that it will use the
> same virtual ranges.
Ohh, sorry my bad. I have only now realized I have misread and thought
about secretmem_alloc...
--
Michal Hocko
SUSE Labs
On Wed, Mar 29, 2023 at 10:13:23AM +0200, Michal Hocko wrote:
> On Wed 29-03-23 10:28:02, Mike Rapoport wrote:
> > On Tue, Mar 28, 2023 at 05:24:49PM +0200, Michal Hocko wrote:
> > > On Tue 28-03-23 18:11:20, Mike Rapoport wrote:
> > > > On Tue, Mar 28, 2023 at 09:39:37AM +0200, Michal Hocko wrote:
> > > [...]
> > > > > OK, so you want to reduce that direct map fragmentation?
> > > >
> > > > Yes.
> > > >
> > > > > Is that a real problem?
> > > >
> > > > A while ago Intel folks published report [1] that showed better performance
> > > > with large pages in the direct map for majority of benchmarks.
> > > >
> > > > > My impression is that modules are mostly static thing. BPF
> > > > > might be a different thing though. I have a recollection that BPF guys
> > > > > were dealing with direct map fragmention as well.
> > > >
> > > > Modules are indeed static, but module_alloc() used by anything that
> > > > allocates code pages, e.g. kprobes, ftrace and BPF. Besides, Thomas
> > > > mentioned that having code in 2M pages reduces iTLB pressure [2], but
> > > > that's not only about avoiding the splits in the direct map but also about
> > > > using large mappings in the modules address space.
> > > >
> > > > BPF guys suggested an allocator for executable memory [3] mainly because
> > > > they've seen performance improvement of 0.6% - 0.9% in their setups [4].
> > >
> > > These are fair arguments and it would have been better to have them in
> > > the RFC. Also it is not really clear to me what is the actual benefit of
> > > the unmapping for those usecases. I do get they want to benefit from
> > > caching on the same permission setup but do they need unmapping as well?
> >
> > The pages allocated with module_alloc() get different permissions depending
> > on whether they belong to text, rodata, data etc. The permissions are
> > updated in both vmalloc address space and in the direct map. The updates to
> > the direct map cause splits of the large pages.
>
> That much is clear (wouldn't hurt to mention that in the changelog
> though).
>
> > If we cache large pages as
> > unmapped we take out the entire 2M page from the direct map and then
> > if/when it becomes free it can be returned to the direct map as a 2M page.
> >
> > Generally, the unmapped allocations are intended for use-cases that anyway
> > map the memory elsewhere than direct map and need to modify direct mappings
> > of the memory, be it modules_alloc(), secretmem, PKS page tables or maybe
> > even some of the encrypted VM memory.
>
> I believe we are still not on the same page here. I do understand that
> you want to re-use the caching capability of the unmapped_pages_alloc
> for modules allocations as well. My question is whether module_alloc
> really benefits from the fact that the memory is unmapped?
>
> I guess you want to say that it does because it wouldn't have to change
> the permission for the direct map but I do not see that anywhere in the
> patch...
This happens automagically outside the patch :)
Currently, to change memory permissions modules code calls set_memory APIs
and passes vmalloced address to them. set_memory functions lookup the
direct map alias and update the permissions there as well.
If the memory allocated with module_alloc() is unmapped in the direct map,
there won't be an alias for set_memory APIs to update.
> Also consinder the slow path where module_alloc needs to
> allocate a fresh (huge)page and unmap it. Does the overhead of the
> unmapping suits needs of all module_alloc users? Module loader is likely
> not interesting as it tends to be rather static but what about BPF
> programs?
The overhead of unmapping pages in the direct map on allocation path will
be offset by reduced overhead of updating permissions in the direct map
after the allocation. Both are using the same APIs and if today the
permission update causes a split of a large page, unmapping of a large page
won't.
Of course in a loaded system unmapped_alloc() won't be able to always
allocated large pages to replenish the cache, but still there will be fewer
updates to the direct map.
> --
> Michal Hocko
> SUSE Labs
--
Sincerely yours,
Mike.
On Thu 30-03-23 08:13:48, Mike Rapoport wrote:
> On Wed, Mar 29, 2023 at 10:13:23AM +0200, Michal Hocko wrote:
> > On Wed 29-03-23 10:28:02, Mike Rapoport wrote:
> > > On Tue, Mar 28, 2023 at 05:24:49PM +0200, Michal Hocko wrote:
> > > > On Tue 28-03-23 18:11:20, Mike Rapoport wrote:
> > > > > On Tue, Mar 28, 2023 at 09:39:37AM +0200, Michal Hocko wrote:
> > > > [...]
> > > > > > OK, so you want to reduce that direct map fragmentation?
> > > > >
> > > > > Yes.
> > > > >
> > > > > > Is that a real problem?
> > > > >
> > > > > A while ago Intel folks published report [1] that showed better performance
> > > > > with large pages in the direct map for majority of benchmarks.
> > > > >
> > > > > > My impression is that modules are mostly static thing. BPF
> > > > > > might be a different thing though. I have a recollection that BPF guys
> > > > > > were dealing with direct map fragmention as well.
> > > > >
> > > > > Modules are indeed static, but module_alloc() used by anything that
> > > > > allocates code pages, e.g. kprobes, ftrace and BPF. Besides, Thomas
> > > > > mentioned that having code in 2M pages reduces iTLB pressure [2], but
> > > > > that's not only about avoiding the splits in the direct map but also about
> > > > > using large mappings in the modules address space.
> > > > >
> > > > > BPF guys suggested an allocator for executable memory [3] mainly because
> > > > > they've seen performance improvement of 0.6% - 0.9% in their setups [4].
> > > >
> > > > These are fair arguments and it would have been better to have them in
> > > > the RFC. Also it is not really clear to me what is the actual benefit of
> > > > the unmapping for those usecases. I do get they want to benefit from
> > > > caching on the same permission setup but do they need unmapping as well?
> > >
> > > The pages allocated with module_alloc() get different permissions depending
> > > on whether they belong to text, rodata, data etc. The permissions are
> > > updated in both vmalloc address space and in the direct map. The updates to
> > > the direct map cause splits of the large pages.
> >
> > That much is clear (wouldn't hurt to mention that in the changelog
> > though).
> >
> > > If we cache large pages as
> > > unmapped we take out the entire 2M page from the direct map and then
> > > if/when it becomes free it can be returned to the direct map as a 2M page.
> > >
> > > Generally, the unmapped allocations are intended for use-cases that anyway
> > > map the memory elsewhere than direct map and need to modify direct mappings
> > > of the memory, be it modules_alloc(), secretmem, PKS page tables or maybe
> > > even some of the encrypted VM memory.
> >
> > I believe we are still not on the same page here. I do understand that
> > you want to re-use the caching capability of the unmapped_pages_alloc
> > for modules allocations as well. My question is whether module_alloc
> > really benefits from the fact that the memory is unmapped?
> >
> > I guess you want to say that it does because it wouldn't have to change
> > the permission for the direct map but I do not see that anywhere in the
> > patch...
>
> This happens automagically outside the patch :)
>
> Currently, to change memory permissions modules code calls set_memory APIs
> and passes vmalloced address to them. set_memory functions lookup the
> direct map alias and update the permissions there as well.
> If the memory allocated with module_alloc() is unmapped in the direct map,
> there won't be an alias for set_memory APIs to update.
>
> > Also consinder the slow path where module_alloc needs to
> > allocate a fresh (huge)page and unmap it. Does the overhead of the
> > unmapping suits needs of all module_alloc users? Module loader is likely
> > not interesting as it tends to be rather static but what about BPF
> > programs?
>
> The overhead of unmapping pages in the direct map on allocation path will
> be offset by reduced overhead of updating permissions in the direct map
> after the allocation. Both are using the same APIs and if today the
> permission update causes a split of a large page, unmapping of a large page
> won't.
>
> Of course in a loaded system unmapped_alloc() won't be able to always
> allocated large pages to replenish the cache, but still there will be fewer
> updates to the direct map.
Ok, all of that is a changelog material. I would recommend to go this
way. Start by the simple unmapped_pages_alloc interface and use it for
the secret memory. There shouldn't be anything controversial there. In a
follow up patch add a support for the vmalloc which would add a new gfp
flag with a justification that this is the simplest way to support
module_alloc usecase and do not feel shy of providing as much context as
you can. Ideally with some numbers for the best/worst/avg cases.
Thanks
--
Michal Hocko
SUSE Labs
On Wed, Mar 08, 2023 at 11:41:02AM +0200, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <[email protected]>
>
> When set_memory or set_direct_map APIs used to change attribute or
> permissions for chunks of several pages, the large PMD that maps these
> pages in the direct map must be split. Fragmenting the direct map in such
> manner causes TLB pressure and, eventually, performance degradation.
>
> To avoid excessive direct map fragmentation, add ability to allocate
> "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> allocated pages from the direct map and use a cache of the unmapped pages.
>
> This cache is replenished with higher order pages with preference for
> PMD_SIZE pages when possible so that there will be fewer splits of large
> pages in the direct map.
>
> The cache is implemented as a buddy allocator, so it can serve high order
> allocations of unmapped pages.
So I'm late to this discussion, I stumbled in because of my own run in
with executable memory allocation.
I understand that post LSF this patchset seems to not be going anywhere,
but OTOH there's also been a desire for better executable memory
allocation; as noted by tglx and elsewhere, there _is_ a definite
performance impact on page size with kernel text - I've seen numbers in
the multiple single digit percentage range in the past.
This patchset does seem to me to be roughly the right approach for that,
and coupled with the slab allocator for sub-page sized allocations it
seems there's the potential for getting a nice interface that spans the
full range of allocation sizes, from small bpf/trampoline allocations up
to modules.
Is this patchset worth reviving/continuing with? Was it really just the
needed module refactoring that was the blocker?
On Wed, May 17, 2023 at 11:35:56PM -0400, Kent Overstreet wrote:
> On Wed, Mar 08, 2023 at 11:41:02AM +0200, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" <[email protected]>
> >
> > When set_memory or set_direct_map APIs used to change attribute or
> > permissions for chunks of several pages, the large PMD that maps these
> > pages in the direct map must be split. Fragmenting the direct map in such
> > manner causes TLB pressure and, eventually, performance degradation.
> >
> > To avoid excessive direct map fragmentation, add ability to allocate
> > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> > allocated pages from the direct map and use a cache of the unmapped pages.
> >
> > This cache is replenished with higher order pages with preference for
> > PMD_SIZE pages when possible so that there will be fewer splits of large
> > pages in the direct map.
> >
> > The cache is implemented as a buddy allocator, so it can serve high order
> > allocations of unmapped pages.
>
> So I'm late to this discussion, I stumbled in because of my own run in
> with executable memory allocation.
>
> I understand that post LSF this patchset seems to not be going anywhere,
> but OTOH there's also been a desire for better executable memory
> allocation; as noted by tglx and elsewhere, there _is_ a definite
> performance impact on page size with kernel text - I've seen numbers in
> the multiple single digit percentage range in the past.
>
> This patchset does seem to me to be roughly the right approach for that,
> and coupled with the slab allocator for sub-page sized allocations it
> seems there's the potential for getting a nice interface that spans the
> full range of allocation sizes, from small bpf/trampoline allocations up
> to modules.
>
> Is this patchset worth reviving/continuing with? Was it really just the
> needed module refactoring that was the blocker?
As I see it, this patchset only one building block out of three? four?
If we are to repurpose it for code allocations it should be something like
1) allocate 2M page to fill the cache
2) remove this page from the direct map
3) map the 2M page ROX in module address space (usually some part of
vmalloc address space)
4) allocate a smaller chunk of that page to the actual caller (bpf,
modules, whatever)
Right now (3) and (4) won't work for modules because they mix code and data
in a single allocation.
So module refactoring is a blocker and another blocker is to teach vmalloc
to map the areas for the executable memory with 2M pages and probably
something else.
I remember there was an attempt to add VM_ALLOW_HUGE_VMAP to
x86::module_alloc(), but it caused problems and was reverted. Sorry, I
could not find the lore link.
There was a related discussion here:
https://lore.kernel.org/all/[email protected]/
--
Sincerely yours,
Mike.
On Thu, May 18, 2023 at 8:24 AM Mike Rapoport <[email protected]> wrote:
>
> On Wed, May 17, 2023 at 11:35:56PM -0400, Kent Overstreet wrote:
> > On Wed, Mar 08, 2023 at 11:41:02AM +0200, Mike Rapoport wrote:
> > > From: "Mike Rapoport (IBM)" <[email protected]>
> > >
> > > When set_memory or set_direct_map APIs used to change attribute or
> > > permissions for chunks of several pages, the large PMD that maps these
> > > pages in the direct map must be split. Fragmenting the direct map in such
> > > manner causes TLB pressure and, eventually, performance degradation.
> > >
> > > To avoid excessive direct map fragmentation, add ability to allocate
> > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> > > allocated pages from the direct map and use a cache of the unmapped pages.
> > >
> > > This cache is replenished with higher order pages with preference for
> > > PMD_SIZE pages when possible so that there will be fewer splits of large
> > > pages in the direct map.
> > >
> > > The cache is implemented as a buddy allocator, so it can serve high order
> > > allocations of unmapped pages.
> >
> > So I'm late to this discussion, I stumbled in because of my own run in
> > with executable memory allocation.
> >
> > I understand that post LSF this patchset seems to not be going anywhere,
> > but OTOH there's also been a desire for better executable memory
> > allocation; as noted by tglx and elsewhere, there _is_ a definite
> > performance impact on page size with kernel text - I've seen numbers in
> > the multiple single digit percentage range in the past.
> >
> > This patchset does seem to me to be roughly the right approach for that,
> > and coupled with the slab allocator for sub-page sized allocations it
> > seems there's the potential for getting a nice interface that spans the
> > full range of allocation sizes, from small bpf/trampoline allocations up
> > to modules.
> >
> > Is this patchset worth reviving/continuing with? Was it really just the
> > needed module refactoring that was the blocker?
>
> As I see it, this patchset only one building block out of three? four?
> If we are to repurpose it for code allocations it should be something like
>
> 1) allocate 2M page to fill the cache
> 2) remove this page from the direct map
> 3) map the 2M page ROX in module address space (usually some part of
> vmalloc address space)
> 4) allocate a smaller chunk of that page to the actual caller (bpf,
> modules, whatever)
>
> Right now (3) and (4) won't work for modules because they mix code and data
> in a single allocation.
I am working on patches based on the discussion in [1]. I am planning to
send v1 for review in a week or so.
Thanks,
Song
[1] https://lore.kernel.org/linux-mm/[email protected]/
[...]
On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> I am working on patches based on the discussion in [1]. I am planning to
> send v1 for review in a week or so.
Hey Song, I was reviewing that thread too,
Are you taking a different approach based on Thomas's feedback? I think
he had some fair points in that thread.
My own feeling is that the buddy allocator is our tool for allocating
larger variable sized physically contiguous allocations, so I'd like to
see something based on that - I think we could do a hybrid buddy/slab
allocator approach, like we have for regular memory allocations.
I started on a slab allocator for executable memory allocations the
other day (very minimal, but tested it for bcachefs and it works).
But I'd love to hear more about your current approach as well.
Cheers,
Kent
On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> I am working on patches based on the discussion in [1]. I am planning to
> send v1 for review in a week or so.
For reference, here's my own (early, but functioning :) slab allocator:
Look forward to comparing!
-->--
From 6eeb6b8ef4271ea1a8d9cac7fbaeeb7704951976 Mon Sep 17 00:00:00 2001
From: Kent Overstreet <[email protected]>
Date: Wed, 17 May 2023 01:22:06 -0400
Subject: [PATCH] mm: jit/text allocator
This provides a new, very simple slab allocator for jit/text, i.e. bpf,
ftrace trampolines, or bcachefs unpack functions.
With this API we can avoid ever mapping pages both writeable and
executable (not implemented in this patch: need to tweak
module_alloc()), and it also supports sub-page sized allocations.
Signed-off-by: Kent Overstreet <[email protected]>
diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h
new file mode 100644
index 0000000000..f1549d60e8
--- /dev/null
+++ b/include/linux/jitalloc.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_JITALLOC_H
+#define _LINUX_JITALLOC_H
+
+void jit_update(void *buf, void *new_buf, size_t len);
+void jit_free(void *buf);
+void *jit_alloc(void *buf, size_t len);
+
+#endif /* _LINUX_JITALLOC_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 4751031f3f..ff26a4f0c9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1202,6 +1202,9 @@ config LRU_GEN_STATS
This option has a per-memcg and per-node memory overhead.
# }
+config JITALLOC
+ bool
+
source "mm/damon/Kconfig"
endmenu
diff --git a/mm/Makefile b/mm/Makefile
index c03e1e5859..25e82db9e8 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
+obj-$(CONFIG_JITALLOC) += jitalloc.o
diff --git a/mm/jitalloc.c b/mm/jitalloc.c
new file mode 100644
index 0000000000..7c4d621802
--- /dev/null
+++ b/mm/jitalloc.c
@@ -0,0 +1,187 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/gfp.h>
+#include <linux/highmem.h>
+#include <linux/jitalloc.h>
+#include <linux/mm.h>
+#include <linux/moduleloader.h>
+#include <linux/mutex.h>
+#include <linux/set_memory.h>
+#include <linux/vmalloc.h>
+
+#include <asm/text-patching.h>
+
+static DEFINE_MUTEX(jit_alloc_lock);
+
+struct jit_cache {
+ unsigned obj_size_bits;
+ unsigned objs_per_slab;
+ struct list_head partial;
+};
+
+#define JITALLOC_MIN_SIZE 16
+#define NR_JIT_CACHES ilog2(PAGE_SIZE / JITALLOC_MIN_SIZE)
+
+static struct jit_cache jit_caches[NR_JIT_CACHES];
+
+struct jit_slab {
+ unsigned long __page_flags;
+
+ struct jit_cache *cache;
+ void *executably_mapped;;
+ unsigned long *objs_allocated; /* bitmap of free objects */
+ struct list_head list;
+};
+
+#define folio_jit_slab(folio) (_Generic((folio), \
+ const struct folio *: (const struct jit_slab *)(folio), \
+ struct folio *: (struct jit_slab *)(folio)))
+
+#define jit_slab_folio(s) (_Generic((s), \
+ const struct jit_slab *: (const struct folio *)s, \
+ struct jit_slab *: (struct folio *)s))
+
+static struct jit_slab *jit_slab_alloc(struct jit_cache *cache)
+{
+ void *executably_mapped = module_alloc(PAGE_SIZE);
+ struct page *page;
+ struct folio *folio;
+ struct jit_slab *slab;
+ unsigned long *objs_allocated;
+
+ if (!executably_mapped)
+ return NULL;
+
+ objs_allocated = kcalloc(BITS_TO_LONGS(cache->objs_per_slab), sizeof(unsigned long), GFP_KERNEL);
+ if (!objs_allocated ) {
+ vfree(executably_mapped);
+ return NULL;
+ }
+
+ set_vm_flush_reset_perms(executably_mapped);
+ set_memory_rox((unsigned long) executably_mapped, 1);
+
+ page = vmalloc_to_page(executably_mapped);
+ folio = page_folio(page);
+
+ __folio_set_slab(folio);
+ slab = folio_jit_slab(folio);
+ slab->cache = cache;
+ slab->executably_mapped = executably_mapped;
+ slab->objs_allocated = objs_allocated;
+ INIT_LIST_HEAD(&slab->list);
+
+ return slab;
+}
+
+static void *jit_cache_alloc(void *buf, size_t len, struct jit_cache *cache)
+{
+ struct jit_slab *s =
+ list_first_entry_or_null(&cache->partial, struct jit_slab, list) ?:
+ jit_slab_alloc(cache);
+ unsigned obj_idx, nr_allocated;
+
+ if (!s)
+ return NULL;
+
+ obj_idx = find_first_zero_bit(s->objs_allocated, cache->objs_per_slab);
+
+ BUG_ON(obj_idx >= cache->objs_per_slab);
+ __set_bit(obj_idx, s->objs_allocated);
+
+ nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
+
+ if (nr_allocated == s->cache->objs_per_slab) {
+ list_del_init(&s->list);
+ } else if (nr_allocated == 1) {
+ list_del(&s->list);
+ list_add(&s->list, &s->cache->partial);
+ }
+
+ return s->executably_mapped + (obj_idx << cache->obj_size_bits);
+}
+
+void jit_update(void *buf, void *new_buf, size_t len)
+{
+ text_poke_copy(buf, new_buf, len);
+}
+EXPORT_SYMBOL_GPL(jit_update);
+
+void jit_free(void *buf)
+{
+ struct page *page;
+ struct folio *folio;
+ struct jit_slab *s;
+ unsigned obj_idx, nr_allocated;
+ size_t offset;
+
+ if (!buf)
+ return;
+
+ page = vmalloc_to_page(buf);
+ folio = page_folio(page);
+ offset = offset_in_folio(folio, buf);
+
+ if (!folio_test_slab(folio)) {
+ vfree(buf);
+ return;
+ }
+
+ s = folio_jit_slab(folio);
+
+ mutex_lock(&jit_alloc_lock);
+ obj_idx = offset >> s->cache->obj_size_bits;
+
+ __clear_bit(obj_idx, s->objs_allocated);
+
+ nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
+
+ if (nr_allocated == 0) {
+ list_del(&s->list);
+ kfree(s->objs_allocated);
+ folio_put(folio);
+ } else if (nr_allocated + 1 == s->cache->objs_per_slab) {
+ list_del(&s->list);
+ list_add(&s->list, &s->cache->partial);
+ }
+
+ mutex_unlock(&jit_alloc_lock);
+}
+EXPORT_SYMBOL_GPL(jit_free);
+
+void *jit_alloc(void *buf, size_t len)
+{
+ unsigned jit_cache_idx = ilog2(roundup_pow_of_two(len) / 16);
+ void *p;
+
+ if (jit_cache_idx < NR_JIT_CACHES) {
+ mutex_lock(&jit_alloc_lock);
+ p = jit_cache_alloc(buf, len, &jit_caches[jit_cache_idx]);
+ mutex_unlock(&jit_alloc_lock);
+ } else {
+ p = module_alloc(len);
+ if (p) {
+ set_vm_flush_reset_perms(p);
+ set_memory_rox((unsigned long) p, DIV_ROUND_UP(len, PAGE_SIZE));
+ }
+ }
+
+ if (p && buf)
+ jit_update(p, buf, len);
+
+ return p;
+}
+EXPORT_SYMBOL_GPL(jit_alloc);
+
+static int __init jit_alloc_init(void)
+{
+ for (unsigned i = 0; i < ARRAY_SIZE(jit_caches); i++) {
+ jit_caches[i].obj_size_bits = ilog2(JITALLOC_MIN_SIZE) + i;
+ jit_caches[i].objs_per_slab = PAGE_SIZE >> jit_caches[i].obj_size_bits;
+
+ INIT_LIST_HEAD(&jit_caches[i].partial);
+ }
+
+ return 0;
+}
+core_initcall(jit_alloc_init);
On Thu, May 18, 2023 at 9:48 AM Kent Overstreet
<[email protected]> wrote:
>
> On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > I am working on patches based on the discussion in [1]. I am planning to
> > send v1 for review in a week or so.
>
> Hey Song, I was reviewing that thread too,
>
> Are you taking a different approach based on Thomas's feedback? I think
> he had some fair points in that thread.
Yes, the API is based on Thomas's suggestion, like 90% from the discussions.
>
> My own feeling is that the buddy allocator is our tool for allocating
> larger variable sized physically contiguous allocations, so I'd like to
> see something based on that - I think we could do a hybrid buddy/slab
> allocator approach, like we have for regular memory allocations.
I am planning to implement the allocator based on this (reuse
vmap_area logic):
https://lore.kernel.org/linux-mm/[email protected]/
Thanks,
Song
>
> I started on a slab allocator for executable memory allocations the
> other day (very minimal, but tested it for bcachefs and it works).
>
> But I'd love to hear more about your current approach as well.
>
> Cheers,
> Kent
On Thu, May 18, 2023 at 9:58 AM Kent Overstreet
<[email protected]> wrote:
>
> On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > I am working on patches based on the discussion in [1]. I am planning to
> > send v1 for review in a week or so.
>
> For reference, here's my own (early, but functioning :) slab allocator:
>
> Look forward to comparing!
This looks similar to the bpf_prog_pack allocator we use for BPF at the
moment. (search for bpf_prog_pack in kernel/bpf/core.c). I guess we
can also use bpf_prog_pack for bcachefs for now.
Thanks,
Song
> -->--
> From 6eeb6b8ef4271ea1a8d9cac7fbaeeb7704951976 Mon Sep 17 00:00:00 2001
> From: Kent Overstreet <[email protected]>
> Date: Wed, 17 May 2023 01:22:06 -0400
> Subject: [PATCH] mm: jit/text allocator
>
> This provides a new, very simple slab allocator for jit/text, i.e. bpf,
> ftrace trampolines, or bcachefs unpack functions.
>
> With this API we can avoid ever mapping pages both writeable and
> executable (not implemented in this patch: need to tweak
> module_alloc()), and it also supports sub-page sized allocations.
>
> Signed-off-by: Kent Overstreet <[email protected]>
>
> diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h
> new file mode 100644
> index 0000000000..f1549d60e8
> --- /dev/null
> +++ b/include/linux/jitalloc.h
> @@ -0,0 +1,9 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_JITALLOC_H
> +#define _LINUX_JITALLOC_H
> +
> +void jit_update(void *buf, void *new_buf, size_t len);
> +void jit_free(void *buf);
> +void *jit_alloc(void *buf, size_t len);
> +
> +#endif /* _LINUX_JITALLOC_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 4751031f3f..ff26a4f0c9 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1202,6 +1202,9 @@ config LRU_GEN_STATS
> This option has a per-memcg and per-node memory overhead.
> # }
>
> +config JITALLOC
> + bool
> +
> source "mm/damon/Kconfig"
>
> endmenu
> diff --git a/mm/Makefile b/mm/Makefile
> index c03e1e5859..25e82db9e8 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
> obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
> obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
> obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
> +obj-$(CONFIG_JITALLOC) += jitalloc.o
> diff --git a/mm/jitalloc.c b/mm/jitalloc.c
> new file mode 100644
> index 0000000000..7c4d621802
> --- /dev/null
> +++ b/mm/jitalloc.c
> @@ -0,0 +1,187 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/gfp.h>
> +#include <linux/highmem.h>
> +#include <linux/jitalloc.h>
> +#include <linux/mm.h>
> +#include <linux/moduleloader.h>
> +#include <linux/mutex.h>
> +#include <linux/set_memory.h>
> +#include <linux/vmalloc.h>
> +
> +#include <asm/text-patching.h>
> +
> +static DEFINE_MUTEX(jit_alloc_lock);
> +
> +struct jit_cache {
> + unsigned obj_size_bits;
> + unsigned objs_per_slab;
> + struct list_head partial;
> +};
> +
> +#define JITALLOC_MIN_SIZE 16
> +#define NR_JIT_CACHES ilog2(PAGE_SIZE / JITALLOC_MIN_SIZE)
> +
> +static struct jit_cache jit_caches[NR_JIT_CACHES];
> +
> +struct jit_slab {
> + unsigned long __page_flags;
> +
> + struct jit_cache *cache;
> + void *executably_mapped;;
> + unsigned long *objs_allocated; /* bitmap of free objects */
> + struct list_head list;
> +};
> +
> +#define folio_jit_slab(folio) (_Generic((folio), \
> + const struct folio *: (const struct jit_slab *)(folio), \
> + struct folio *: (struct jit_slab *)(folio)))
> +
> +#define jit_slab_folio(s) (_Generic((s), \
> + const struct jit_slab *: (const struct folio *)s, \
> + struct jit_slab *: (struct folio *)s))
> +
> +static struct jit_slab *jit_slab_alloc(struct jit_cache *cache)
> +{
> + void *executably_mapped = module_alloc(PAGE_SIZE);
> + struct page *page;
> + struct folio *folio;
> + struct jit_slab *slab;
> + unsigned long *objs_allocated;
> +
> + if (!executably_mapped)
> + return NULL;
> +
> + objs_allocated = kcalloc(BITS_TO_LONGS(cache->objs_per_slab), sizeof(unsigned long), GFP_KERNEL);
> + if (!objs_allocated ) {
> + vfree(executably_mapped);
> + return NULL;
> + }
> +
> + set_vm_flush_reset_perms(executably_mapped);
> + set_memory_rox((unsigned long) executably_mapped, 1);
> +
> + page = vmalloc_to_page(executably_mapped);
> + folio = page_folio(page);
> +
> + __folio_set_slab(folio);
> + slab = folio_jit_slab(folio);
> + slab->cache = cache;
> + slab->executably_mapped = executably_mapped;
> + slab->objs_allocated = objs_allocated;
> + INIT_LIST_HEAD(&slab->list);
> +
> + return slab;
> +}
> +
> +static void *jit_cache_alloc(void *buf, size_t len, struct jit_cache *cache)
> +{
> + struct jit_slab *s =
> + list_first_entry_or_null(&cache->partial, struct jit_slab, list) ?:
> + jit_slab_alloc(cache);
> + unsigned obj_idx, nr_allocated;
> +
> + if (!s)
> + return NULL;
> +
> + obj_idx = find_first_zero_bit(s->objs_allocated, cache->objs_per_slab);
> +
> + BUG_ON(obj_idx >= cache->objs_per_slab);
> + __set_bit(obj_idx, s->objs_allocated);
> +
> + nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
> +
> + if (nr_allocated == s->cache->objs_per_slab) {
> + list_del_init(&s->list);
> + } else if (nr_allocated == 1) {
> + list_del(&s->list);
> + list_add(&s->list, &s->cache->partial);
> + }
> +
> + return s->executably_mapped + (obj_idx << cache->obj_size_bits);
> +}
> +
> +void jit_update(void *buf, void *new_buf, size_t len)
> +{
> + text_poke_copy(buf, new_buf, len);
> +}
> +EXPORT_SYMBOL_GPL(jit_update);
> +
> +void jit_free(void *buf)
> +{
> + struct page *page;
> + struct folio *folio;
> + struct jit_slab *s;
> + unsigned obj_idx, nr_allocated;
> + size_t offset;
> +
> + if (!buf)
> + return;
> +
> + page = vmalloc_to_page(buf);
> + folio = page_folio(page);
> + offset = offset_in_folio(folio, buf);
> +
> + if (!folio_test_slab(folio)) {
> + vfree(buf);
> + return;
> + }
> +
> + s = folio_jit_slab(folio);
> +
> + mutex_lock(&jit_alloc_lock);
> + obj_idx = offset >> s->cache->obj_size_bits;
> +
> + __clear_bit(obj_idx, s->objs_allocated);
> +
> + nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
> +
> + if (nr_allocated == 0) {
> + list_del(&s->list);
> + kfree(s->objs_allocated);
> + folio_put(folio);
> + } else if (nr_allocated + 1 == s->cache->objs_per_slab) {
> + list_del(&s->list);
> + list_add(&s->list, &s->cache->partial);
> + }
> +
> + mutex_unlock(&jit_alloc_lock);
> +}
> +EXPORT_SYMBOL_GPL(jit_free);
> +
> +void *jit_alloc(void *buf, size_t len)
> +{
> + unsigned jit_cache_idx = ilog2(roundup_pow_of_two(len) / 16);
> + void *p;
> +
> + if (jit_cache_idx < NR_JIT_CACHES) {
> + mutex_lock(&jit_alloc_lock);
> + p = jit_cache_alloc(buf, len, &jit_caches[jit_cache_idx]);
> + mutex_unlock(&jit_alloc_lock);
> + } else {
> + p = module_alloc(len);
> + if (p) {
> + set_vm_flush_reset_perms(p);
> + set_memory_rox((unsigned long) p, DIV_ROUND_UP(len, PAGE_SIZE));
> + }
> + }
> +
> + if (p && buf)
> + jit_update(p, buf, len);
> +
> + return p;
> +}
> +EXPORT_SYMBOL_GPL(jit_alloc);
> +
> +static int __init jit_alloc_init(void)
> +{
> + for (unsigned i = 0; i < ARRAY_SIZE(jit_caches); i++) {
> + jit_caches[i].obj_size_bits = ilog2(JITALLOC_MIN_SIZE) + i;
> + jit_caches[i].objs_per_slab = PAGE_SIZE >> jit_caches[i].obj_size_bits;
> +
> + INIT_LIST_HEAD(&jit_caches[i].partial);
> + }
> +
> + return 0;
> +}
> +core_initcall(jit_alloc_init);
>
On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote:
> On Thu, May 18, 2023 at 9:48 AM Kent Overstreet
> <[email protected]> wrote:
> >
> > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > > I am working on patches based on the discussion in [1]. I am planning to
> > > send v1 for review in a week or so.
> >
> > Hey Song, I was reviewing that thread too,
> >
> > Are you taking a different approach based on Thomas's feedback? I think
> > he had some fair points in that thread.
>
> Yes, the API is based on Thomas's suggestion, like 90% from the discussions.
>
> >
> > My own feeling is that the buddy allocator is our tool for allocating
> > larger variable sized physically contiguous allocations, so I'd like to
> > see something based on that - I think we could do a hybrid buddy/slab
> > allocator approach, like we have for regular memory allocations.
>
> I am planning to implement the allocator based on this (reuse
> vmap_area logic):
Ah, you're still doing vmap_area approach.
Mike's approach looks like it'll be _much_ lighter weight and higher
performance, to me. vmalloc is known to be slow compared to the buddy
allocator, and with Mike's approach we're only modifying mappings once
per 2 MB chunk.
I don't see anything in your code for sub-page sized allocations too, so
perhaps I should keep going with my slab allocator.
Could you share your thoughts on your approach vs. Mike's? I'm newer to
this area of the code than you two so maybe there's an angle I've missed
:)
On Thu, May 18, 2023 at 10:15:59AM -0700, Song Liu wrote:
> On Thu, May 18, 2023 at 9:58 AM Kent Overstreet
> <[email protected]> wrote:
> >
> > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > > I am working on patches based on the discussion in [1]. I am planning to
> > > send v1 for review in a week or so.
> >
> > For reference, here's my own (early, but functioning :) slab allocator:
> >
> > Look forward to comparing!
>
> This looks similar to the bpf_prog_pack allocator we use for BPF at the
> moment. (search for bpf_prog_pack in kernel/bpf/core.c). I guess we
> can also use bpf_prog_pack for bcachefs for now.
The code in bpf/core.c does a _lot_, my preference would be to split
that up and refactor the bpf code to use something generic :)
On Thu, May 18, 2023 at 10:24 AM Kent Overstreet
<[email protected]> wrote:
>
> On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote:
> > On Thu, May 18, 2023 at 9:48 AM Kent Overstreet
> > <[email protected]> wrote:
> > >
> > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > > > I am working on patches based on the discussion in [1]. I am planning to
> > > > send v1 for review in a week or so.
> > >
> > > Hey Song, I was reviewing that thread too,
> > >
> > > Are you taking a different approach based on Thomas's feedback? I think
> > > he had some fair points in that thread.
> >
> > Yes, the API is based on Thomas's suggestion, like 90% from the discussions.
> >
> > >
> > > My own feeling is that the buddy allocator is our tool for allocating
> > > larger variable sized physically contiguous allocations, so I'd like to
> > > see something based on that - I think we could do a hybrid buddy/slab
> > > allocator approach, like we have for regular memory allocations.
> >
> > I am planning to implement the allocator based on this (reuse
> > vmap_area logic):
>
> Ah, you're still doing vmap_area approach.
>
> Mike's approach looks like it'll be _much_ lighter weight and higher
> performance, to me. vmalloc is known to be slow compared to the buddy
> allocator, and with Mike's approach we're only modifying mappings once
> per 2 MB chunk.
>
> I don't see anything in your code for sub-page sized allocations too, so
> perhaps I should keep going with my slab allocator.
The vmap_area approach handles sub-page allocations. In 5/5 of set [2],
we showed that multiple BPF programs share the same page with some
kernel text (_etext).
> Could you share your thoughts on your approach vs. Mike's? I'm newer to
> this area of the code than you two so maybe there's an angle I've missed
> :)
AFAICT, tree based solution (vmap_area) is more efficient than bitmap
based solution.
First, for 2MiB page with 64B chunk size, we need a bitmap of
2MiB / 64B = 32k bit = 4k bytes
While the tree based solution can adapt to the number of allocations within
This 2MiB page. Also, searching a free range within 4kB of bitmap may
actually be slower than searching in the tree.
Second, bitmap based solution cannot handle > 2MiB allocation cleanly,
while tree based solution can. For example, if a big driver uses 3MiB, the
tree based allocator can allocate 4MiB for it, and use the rest 1MiB for
smaller allocations.
Thanks,
Song
[2] https://lore.kernel.org/linux-mm/[email protected]/
On Thu, May 18, 2023 at 10:25 AM Kent Overstreet
<[email protected]> wrote:
>
> On Thu, May 18, 2023 at 10:15:59AM -0700, Song Liu wrote:
> > On Thu, May 18, 2023 at 9:58 AM Kent Overstreet
> > <[email protected]> wrote:
> > >
> > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > > > I am working on patches based on the discussion in [1]. I am planning to
> > > > send v1 for review in a week or so.
> > >
> > > For reference, here's my own (early, but functioning :) slab allocator:
> > >
> > > Look forward to comparing!
> >
> > This looks similar to the bpf_prog_pack allocator we use for BPF at the
> > moment. (search for bpf_prog_pack in kernel/bpf/core.c). I guess we
> > can also use bpf_prog_pack for bcachefs for now.
>
> The code in bpf/core.c does a _lot_, my preference would be to split
> that up and refactor the bpf code to use something generic :)
The code is actually splitted into two parts:
bpf_prog_pack_[alloc|free] vs. bpf_jit_binary_pack_[alloc|free].
The latter has logic just for BPF. But the former can be reused by
other users (maybe with a little refactoring).
Once the vmap_area based solution is committed, we will replace the
former with the new allocator.
Thanks,
Song
On Thu, May 18, 2023 at 12:03:03PM -0700, Song Liu wrote:
> On Thu, May 18, 2023 at 11:47 AM Song Liu <[email protected]> wrote:
> >
> > On Thu, May 18, 2023 at 10:24 AM Kent Overstreet
> > <[email protected]> wrote:
> > >
> > > On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote:
> > > > On Thu, May 18, 2023 at 9:48 AM Kent Overstreet
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > > > > > I am working on patches based on the discussion in [1]. I am planning to
> > > > > > send v1 for review in a week or so.
> > > > >
> > > > > Hey Song, I was reviewing that thread too,
> > > > >
> > > > > Are you taking a different approach based on Thomas's feedback? I think
> > > > > he had some fair points in that thread.
> > > >
> > > > Yes, the API is based on Thomas's suggestion, like 90% from the discussions.
> > > >
> > > > >
> > > > > My own feeling is that the buddy allocator is our tool for allocating
> > > > > larger variable sized physically contiguous allocations, so I'd like to
> > > > > see something based on that - I think we could do a hybrid buddy/slab
> > > > > allocator approach, like we have for regular memory allocations.
> > > >
> > > > I am planning to implement the allocator based on this (reuse
> > > > vmap_area logic):
> > >
> > > Ah, you're still doing vmap_area approach.
> > >
> > > Mike's approach looks like it'll be _much_ lighter weight and higher
> > > performance, to me. vmalloc is known to be slow compared to the buddy
> > > allocator, and with Mike's approach we're only modifying mappings once
> > > per 2 MB chunk.
> > >
> > > I don't see anything in your code for sub-page sized allocations too, so
> > > perhaps I should keep going with my slab allocator.
> >
> > The vmap_area approach handles sub-page allocations. In 5/5 of set [2],
> > we showed that multiple BPF programs share the same page with some
> > kernel text (_etext).
> >
> > > Could you share your thoughts on your approach vs. Mike's? I'm newer to
> > > this area of the code than you two so maybe there's an angle I've missed
> > > :)
> >
> > AFAICT, tree based solution (vmap_area) is more efficient than bitmap
> > based solution.
Tree based requires quite a bit of overhead for the rbtree pointers, and
additional vmap_area structs.
With a buddy allocator based approach, there's no additional state that
needs to be allocated, since it all fits in struct page.
> > First, for 2MiB page with 64B chunk size, we need a bitmap of
> > 2MiB / 64B = 32k bit = 4k bytes
> > While the tree based solution can adapt to the number of allocations within
> > This 2MiB page. Also, searching a free range within 4kB of bitmap may
> > actually be slower than searching in the tree.
> >
> > Second, bitmap based solution cannot handle > 2MiB allocation cleanly,
> > while tree based solution can. For example, if a big driver uses 3MiB, the
> > tree based allocator can allocate 4MiB for it, and use the rest 1MiB for
> > smaller allocations.
We're not talking about a bitmap based solution for >= PAGE_SIZE
allocations, the alternative is a buddy allocator - so no searching,
just per power-of-two freelists.
>
> Missed one:
>
> Third, bitmap based solution requires a "size" parameter in free(). It is an
> overhead for the user. Tree based solution doesn't have this issue.
No, we can recover the size of the allocation via compound_order() -
hasn't historically been done for alloc_pages() allocations to avoid
setting up the state in each page for compound head/tail, but it perhaps
should be (and is with folios, which we've generally been switching to).
On Thu, May 18, 2023 at 12:03:03PM -0700, Song Liu wrote:
> > Second, bitmap based solution cannot handle > 2MiB allocation cleanly,
> > while tree based solution can. For example, if a big driver uses 3MiB, the
> > tree based allocator can allocate 4MiB for it, and use the rest 1MiB for
> > smaller allocations.
2 MB is just a good default size for the initial unmapped allocations if
the main consideration is avoiding mapping fragmentation. The approach
can work with larger allocations if required.
On Thu, May 18, 2023 at 9:58 AM Kent Overstreet
<[email protected]> wrote:
>
> On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > I am working on patches based on the discussion in [1]. I am planning to
> > send v1 for review in a week or so.
>
> For reference, here's my own (early, but functioning :) slab allocator:
>
> Look forward to comparing!
> -->--
> From 6eeb6b8ef4271ea1a8d9cac7fbaeeb7704951976 Mon Sep 17 00:00:00 2001
> From: Kent Overstreet <[email protected]>
> Date: Wed, 17 May 2023 01:22:06 -0400
> Subject: [PATCH] mm: jit/text allocator
>
> This provides a new, very simple slab allocator for jit/text, i.e. bpf,
> ftrace trampolines, or bcachefs unpack functions.
>
> With this API we can avoid ever mapping pages both writeable and
> executable (not implemented in this patch: need to tweak
> module_alloc()), and it also supports sub-page sized allocations.
>
> Signed-off-by: Kent Overstreet <[email protected]>
[...]
> +static void *jit_cache_alloc(void *buf, size_t len, struct jit_cache *cache)
> +{
> + struct jit_slab *s =
> + list_first_entry_or_null(&cache->partial, struct jit_slab, list) ?:
> + jit_slab_alloc(cache);
> + unsigned obj_idx, nr_allocated;
> +
> + if (!s)
> + return NULL;
> +
> + obj_idx = find_first_zero_bit(s->objs_allocated, cache->objs_per_slab);
> +
> + BUG_ON(obj_idx >= cache->objs_per_slab);
> + __set_bit(obj_idx, s->objs_allocated);
> +
> + nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
> +
> + if (nr_allocated == s->cache->objs_per_slab) {
> + list_del_init(&s->list);
> + } else if (nr_allocated == 1) {
> + list_del(&s->list);
> + list_add(&s->list, &s->cache->partial);
> + }
> +
> + return s->executably_mapped + (obj_idx << cache->obj_size_bits);
> +}
IIUC, "len" is ignored in jit_cache_alloc(), so it can only handle
<=16 byte allocations?
Thanks,
Song
On Thu, May 18, 2023 at 12:01:26PM -0700, Song Liu wrote:
> On Thu, May 18, 2023 at 9:58 AM Kent Overstreet
> <[email protected]> wrote:
> >
> > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > > I am working on patches based on the discussion in [1]. I am planning to
> > > send v1 for review in a week or so.
> >
> > For reference, here's my own (early, but functioning :) slab allocator:
> >
> > Look forward to comparing!
> > -->--
> > From 6eeb6b8ef4271ea1a8d9cac7fbaeeb7704951976 Mon Sep 17 00:00:00 2001
> > From: Kent Overstreet <[email protected]>
> > Date: Wed, 17 May 2023 01:22:06 -0400
> > Subject: [PATCH] mm: jit/text allocator
> >
> > This provides a new, very simple slab allocator for jit/text, i.e. bpf,
> > ftrace trampolines, or bcachefs unpack functions.
> >
> > With this API we can avoid ever mapping pages both writeable and
> > executable (not implemented in this patch: need to tweak
> > module_alloc()), and it also supports sub-page sized allocations.
> >
> > Signed-off-by: Kent Overstreet <[email protected]>
>
> [...]
>
> > +static void *jit_cache_alloc(void *buf, size_t len, struct jit_cache *cache)
> > +{
> > + struct jit_slab *s =
> > + list_first_entry_or_null(&cache->partial, struct jit_slab, list) ?:
> > + jit_slab_alloc(cache);
> > + unsigned obj_idx, nr_allocated;
> > +
> > + if (!s)
> > + return NULL;
> > +
> > + obj_idx = find_first_zero_bit(s->objs_allocated, cache->objs_per_slab);
> > +
> > + BUG_ON(obj_idx >= cache->objs_per_slab);
> > + __set_bit(obj_idx, s->objs_allocated);
> > +
> > + nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
> > +
> > + if (nr_allocated == s->cache->objs_per_slab) {
> > + list_del_init(&s->list);
> > + } else if (nr_allocated == 1) {
> > + list_del(&s->list);
> > + list_add(&s->list, &s->cache->partial);
> > + }
> > +
> > + return s->executably_mapped + (obj_idx << cache->obj_size_bits);
> > +}
>
> IIUC, "len" is ignored in jit_cache_alloc(), so it can only handle
> <=16 byte allocations?
len is a redundant parameter (good catch); at that point we've picked a
cache for the specific allocation size.
Since there's multiple caches for each power of two size, it can handle
allocations up to PAGE_SIZE.
On Thu, May 18, 2023 at 11:47 AM Song Liu <[email protected]> wrote:
>
> On Thu, May 18, 2023 at 10:24 AM Kent Overstreet
> <[email protected]> wrote:
> >
> > On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote:
> > > On Thu, May 18, 2023 at 9:48 AM Kent Overstreet
> > > <[email protected]> wrote:
> > > >
> > > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > > > > I am working on patches based on the discussion in [1]. I am planning to
> > > > > send v1 for review in a week or so.
> > > >
> > > > Hey Song, I was reviewing that thread too,
> > > >
> > > > Are you taking a different approach based on Thomas's feedback? I think
> > > > he had some fair points in that thread.
> > >
> > > Yes, the API is based on Thomas's suggestion, like 90% from the discussions.
> > >
> > > >
> > > > My own feeling is that the buddy allocator is our tool for allocating
> > > > larger variable sized physically contiguous allocations, so I'd like to
> > > > see something based on that - I think we could do a hybrid buddy/slab
> > > > allocator approach, like we have for regular memory allocations.
> > >
> > > I am planning to implement the allocator based on this (reuse
> > > vmap_area logic):
> >
> > Ah, you're still doing vmap_area approach.
> >
> > Mike's approach looks like it'll be _much_ lighter weight and higher
> > performance, to me. vmalloc is known to be slow compared to the buddy
> > allocator, and with Mike's approach we're only modifying mappings once
> > per 2 MB chunk.
> >
> > I don't see anything in your code for sub-page sized allocations too, so
> > perhaps I should keep going with my slab allocator.
>
> The vmap_area approach handles sub-page allocations. In 5/5 of set [2],
> we showed that multiple BPF programs share the same page with some
> kernel text (_etext).
>
> > Could you share your thoughts on your approach vs. Mike's? I'm newer to
> > this area of the code than you two so maybe there's an angle I've missed
> > :)
>
> AFAICT, tree based solution (vmap_area) is more efficient than bitmap
> based solution.
>
> First, for 2MiB page with 64B chunk size, we need a bitmap of
> 2MiB / 64B = 32k bit = 4k bytes
> While the tree based solution can adapt to the number of allocations within
> This 2MiB page. Also, searching a free range within 4kB of bitmap may
> actually be slower than searching in the tree.
>
> Second, bitmap based solution cannot handle > 2MiB allocation cleanly,
> while tree based solution can. For example, if a big driver uses 3MiB, the
> tree based allocator can allocate 4MiB for it, and use the rest 1MiB for
> smaller allocations.
Missed one:
Third, bitmap based solution requires a "size" parameter in free(). It is an
overhead for the user. Tree based solution doesn't have this issue.
Thanks,
Song
>
> [2] https://lore.kernel.org/linux-mm/[email protected]/
On Thu, May 18, 2023 at 12:15 PM Kent Overstreet
<[email protected]> wrote:
>
> On Thu, May 18, 2023 at 12:03:03PM -0700, Song Liu wrote:
> > On Thu, May 18, 2023 at 11:47 AM Song Liu <[email protected]> wrote:
> > >
> > > On Thu, May 18, 2023 at 10:24 AM Kent Overstreet
> > > <[email protected]> wrote:
> > > >
> > > > On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote:
> > > > > On Thu, May 18, 2023 at 9:48 AM Kent Overstreet
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > > > > > > I am working on patches based on the discussion in [1]. I am planning to
> > > > > > > send v1 for review in a week or so.
> > > > > >
> > > > > > Hey Song, I was reviewing that thread too,
> > > > > >
> > > > > > Are you taking a different approach based on Thomas's feedback? I think
> > > > > > he had some fair points in that thread.
> > > > >
> > > > > Yes, the API is based on Thomas's suggestion, like 90% from the discussions.
> > > > >
> > > > > >
> > > > > > My own feeling is that the buddy allocator is our tool for allocating
> > > > > > larger variable sized physically contiguous allocations, so I'd like to
> > > > > > see something based on that - I think we could do a hybrid buddy/slab
> > > > > > allocator approach, like we have for regular memory allocations.
> > > > >
> > > > > I am planning to implement the allocator based on this (reuse
> > > > > vmap_area logic):
> > > >
> > > > Ah, you're still doing vmap_area approach.
> > > >
> > > > Mike's approach looks like it'll be _much_ lighter weight and higher
> > > > performance, to me. vmalloc is known to be slow compared to the buddy
> > > > allocator, and with Mike's approach we're only modifying mappings once
> > > > per 2 MB chunk.
> > > >
> > > > I don't see anything in your code for sub-page sized allocations too, so
> > > > perhaps I should keep going with my slab allocator.
> > >
> > > The vmap_area approach handles sub-page allocations. In 5/5 of set [2],
> > > we showed that multiple BPF programs share the same page with some
> > > kernel text (_etext).
> > >
> > > > Could you share your thoughts on your approach vs. Mike's? I'm newer to
> > > > this area of the code than you two so maybe there's an angle I've missed
> > > > :)
> > >
> > > AFAICT, tree based solution (vmap_area) is more efficient than bitmap
> > > based solution.
>
> Tree based requires quite a bit of overhead for the rbtree pointers, and
> additional vmap_area structs.
>
> With a buddy allocator based approach, there's no additional state that
> needs to be allocated, since it all fits in struct page.
>
> > > First, for 2MiB page with 64B chunk size, we need a bitmap of
> > > 2MiB / 64B = 32k bit = 4k bytes
> > > While the tree based solution can adapt to the number of allocations within
> > > This 2MiB page. Also, searching a free range within 4kB of bitmap may
> > > actually be slower than searching in the tree.
> > >
> > > Second, bitmap based solution cannot handle > 2MiB allocation cleanly,
> > > while tree based solution can. For example, if a big driver uses 3MiB, the
> > > tree based allocator can allocate 4MiB for it, and use the rest 1MiB for
> > > smaller allocations.
>
> We're not talking about a bitmap based solution for >= PAGE_SIZE
> allocations, the alternative is a buddy allocator - so no searching,
> just per power-of-two freelists.
>
> >
> > Missed one:
> >
> > Third, bitmap based solution requires a "size" parameter in free(). It is an
> > overhead for the user. Tree based solution doesn't have this issue.
>
> No, we can recover the size of the allocation via compound_order() -
> hasn't historically been done for alloc_pages() allocations to avoid
> setting up the state in each page for compound head/tail, but it perhaps
> should be (and is with folios, which we've generally been switching to).
If we use compound_order(), we will round up to power of 2 for all
allocations. Does this mean we will use 4MiB for a 2.1MiB allocation?
Thanks,
Song
On Thu, May 18, 2023 at 01:03:28PM -0700, Song Liu wrote:
> If we use compound_order(), we will round up to power of 2 for all
> allocations. Does this mean we will use 4MiB for a 2.1MiB allocation?
Yes. This means we lose on average 33% - I believe, someone with better
statistics might correct me - to internal fragmentation. But the buddy
allocator will be better at avoiding and dealing with external
fragmentation over time.
On Thu, May 18, 2023 at 12:15 PM Kent Overstreet
<[email protected]> wrote:
>
> On Thu, May 18, 2023 at 12:03:03PM -0700, Song Liu wrote:
> > On Thu, May 18, 2023 at 11:47 AM Song Liu <[email protected]> wrote:
> > >
> > > On Thu, May 18, 2023 at 10:24 AM Kent Overstreet
> > > <[email protected]> wrote:
> > > >
> > > > On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote:
> > > > > On Thu, May 18, 2023 at 9:48 AM Kent Overstreet
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > > > > > > I am working on patches based on the discussion in [1]. I am planning to
> > > > > > > send v1 for review in a week or so.
> > > > > >
> > > > > > Hey Song, I was reviewing that thread too,
> > > > > >
> > > > > > Are you taking a different approach based on Thomas's feedback? I think
> > > > > > he had some fair points in that thread.
> > > > >
> > > > > Yes, the API is based on Thomas's suggestion, like 90% from the discussions.
> > > > >
> > > > > >
> > > > > > My own feeling is that the buddy allocator is our tool for allocating
> > > > > > larger variable sized physically contiguous allocations, so I'd like to
> > > > > > see something based on that - I think we could do a hybrid buddy/slab
> > > > > > allocator approach, like we have for regular memory allocations.
> > > > >
> > > > > I am planning to implement the allocator based on this (reuse
> > > > > vmap_area logic):
> > > >
> > > > Ah, you're still doing vmap_area approach.
> > > >
> > > > Mike's approach looks like it'll be _much_ lighter weight and higher
> > > > performance, to me. vmalloc is known to be slow compared to the buddy
> > > > allocator, and with Mike's approach we're only modifying mappings once
> > > > per 2 MB chunk.
> > > >
> > > > I don't see anything in your code for sub-page sized allocations too, so
> > > > perhaps I should keep going with my slab allocator.
> > >
> > > The vmap_area approach handles sub-page allocations. In 5/5 of set [2],
> > > we showed that multiple BPF programs share the same page with some
> > > kernel text (_etext).
> > >
> > > > Could you share your thoughts on your approach vs. Mike's? I'm newer to
> > > > this area of the code than you two so maybe there's an angle I've missed
> > > > :)
> > >
> > > AFAICT, tree based solution (vmap_area) is more efficient than bitmap
> > > based solution.
>
> Tree based requires quite a bit of overhead for the rbtree pointers, and
> additional vmap_area structs.
>
> With a buddy allocator based approach, there's no additional state that
> needs to be allocated, since it all fits in struct page.
To allocate memory for text, we will allocate 2MiB, make it ROX, and then
use it for many small allocations. IIUC, buddy allocator will use unallocated
parts of this page for metadata. I guess this may be a problem, as the
whole page is ROX now, and we have to use text_poke to write to it.
OTOH, if we allocate extra memory for metadata (tree based solution),
all the metadata operations can be regular read/write.
Thanks,
Song
On Thu, May 18, 2023 at 01:51:08PM -0700, Song Liu wrote:
> To allocate memory for text, we will allocate 2MiB, make it ROX, and then
> use it for many small allocations. IIUC, buddy allocator will use unallocated
> parts of this page for metadata. I guess this may be a problem, as the
> whole page is ROX now, and we have to use text_poke to write to it.
The standard kernel buddy allocator does _not_ store anything in the
page itself - because the page might be a highmem page.
That's also why I went with the bitmap for my slab allocator; standard
kernel slab allocator stores a freelist ptr in free objects.
Hi Kent,
On Thu, May 18, 2023 at 01:23:56PM -0400, Kent Overstreet wrote:
> On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote:
> > On Thu, May 18, 2023 at 9:48 AM Kent Overstreet
> > <[email protected]> wrote:
> > >
> > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > > > I am working on patches based on the discussion in [1]. I am planning to
> > > > send v1 for review in a week or so.
> > >
> > > Hey Song, I was reviewing that thread too,
> > >
> > > Are you taking a different approach based on Thomas's feedback? I think
> > > he had some fair points in that thread.
> >
> > Yes, the API is based on Thomas's suggestion, like 90% from the discussions.
> >
> > >
> > > My own feeling is that the buddy allocator is our tool for allocating
> > > larger variable sized physically contiguous allocations, so I'd like to
> > > see something based on that - I think we could do a hybrid buddy/slab
> > > allocator approach, like we have for regular memory allocations.
> >
> > I am planning to implement the allocator based on this (reuse
> > vmap_area logic):
>
> Ah, you're still doing vmap_area approach.
>
> Mike's approach looks like it'll be _much_ lighter weight and higher
> performance, to me. vmalloc is known to be slow compared to the buddy
> allocator, and with Mike's approach we're only modifying mappings once
> per 2 MB chunk.
>
> I don't see anything in your code for sub-page sized allocations too, so
> perhaps I should keep going with my slab allocator.
Your allocator implicitly relies on vmalloc because of module_alloc ;-)
What I was thinking is that we can replace module_alloc() calls in your
allocator with something based on my unmapped_alloc(). If we make the part
that refills the cache also take care of creating the mapping in the
module address space, that should cover everything.
> Could you share your thoughts on your approach vs. Mike's? I'm newer to
> this area of the code than you two so maybe there's an angle I've missed
> :)
>
--
Sincerely yours,
Mike.
On Thu, May 18, 2023 at 6:24 PM Kent Overstreet
<[email protected]> wrote:
>
> On Thu, May 18, 2023 at 01:51:08PM -0700, Song Liu wrote:
>
> > To allocate memory for text, we will allocate 2MiB, make it ROX, and then
> > use it for many small allocations. IIUC, buddy allocator will use unallocated
> > parts of this page for metadata. I guess this may be a problem, as the
> > whole page is ROX now, and we have to use text_poke to write to it.
>
> The standard kernel buddy allocator does _not_ store anything in the
> page itself - because the page might be a highmem page.
Thanks for the correction.
Song
>
> That's also why I went with the bitmap for my slab allocator; standard
> kernel slab allocator stores a freelist ptr in free objects.
On Thu, Mar 09, 2023, Mike Rapoport wrote:
> On Thu, Mar 09, 2023 at 01:59:00AM +0000, Edgecombe, Rick P wrote:
> > On Wed, 2023-03-08 at 11:41 +0200, Mike Rapoport wrote:
> > > From: "Mike Rapoport (IBM)" <[email protected]>
> > >
> > > Hi,
> > >
> > > This is a third attempt to make page allocator aware of the direct
> > > map
> > > layout and allow grouping of the pages that must be unmapped from
> > > the direct map.
> > >
> > > This a new implementation of __GFP_UNMAPPED, kinda a follow up for
> > > this set:
> > >
> > > https://lore.kernel.org/all/[email protected]
> > >
> > > but instead of using a migrate type to cache the unmapped pages, the
> > > current implementation adds a dedicated cache to serve __GFP_UNMAPPED
> > > allocations.
> >
> > It seems a downside to having a page allocator outside of _the_ page
> > allocator is you don't get all of the features that are baked in there.
> > For example does secretmem care about numa? I guess in this
> > implementation there is just one big cache for all nodes.
> >
> > Probably most users would want __GFP_ZERO. Would secretmem care about
> > __GFP_ACCOUNT?
>
> The intention was that the pages in cache are always zeroed, so __GFP_ZERO
> is always implicitly there, at least should have been.
Would it be possible to drop that assumption/requirement, i.e. allow allocation of
__GFP_UNMAPPED without __GFP_ZERO? At a glance, __GFP_UNMAPPED looks like it would
be a great fit for backing guest memory, in particular for confidential VMs. And
for some flavors of CoCo, i.e. TDX, the trusted intermediary is responsible for
zeroing/initializing guest memory as the untrusted host (kernel/KVM) doesn't have
access to the guest's encryption key. In other words, zeroing in the kernel would
be unnecessary work.
On Fri, May 19, 2023 at 11:29:45AM +0300, Mike Rapoport wrote:
> Your allocator implicitly relies on vmalloc because of module_alloc ;-)
>
> What I was thinking is that we can replace module_alloc() calls in your
> allocator with something based on my unmapped_alloc(). If we make the part
> that refills the cache also take care of creating the mapping in the
> module address space, that should cover everything.
Yeah, that's exactly what I was thinking :)
Liam was also just mentioning on IRC vmalloc lock contention came up
again at LSF, and that's historically always been an isuse - going with
your patchset for the backend nicely avoids that.
If I have time (hah! big if :) I'll see if I can cook up a patchset that
combines our two approaches over the weekend.
On Fri, May 19, 2023 at 1:30 AM Mike Rapoport <[email protected]> wrote:
>
> Hi Kent,
>
> On Thu, May 18, 2023 at 01:23:56PM -0400, Kent Overstreet wrote:
> > On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote:
> > > On Thu, May 18, 2023 at 9:48 AM Kent Overstreet
> > > <[email protected]> wrote:
> > > >
> > > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > > > > I am working on patches based on the discussion in [1]. I am planning to
> > > > > send v1 for review in a week or so.
> > > >
> > > > Hey Song, I was reviewing that thread too,
> > > >
> > > > Are you taking a different approach based on Thomas's feedback? I think
> > > > he had some fair points in that thread.
> > >
> > > Yes, the API is based on Thomas's suggestion, like 90% from the discussions.
> > >
> > > >
> > > > My own feeling is that the buddy allocator is our tool for allocating
> > > > larger variable sized physically contiguous allocations, so I'd like to
> > > > see something based on that - I think we could do a hybrid buddy/slab
> > > > allocator approach, like we have for regular memory allocations.
> > >
> > > I am planning to implement the allocator based on this (reuse
> > > vmap_area logic):
> >
> > Ah, you're still doing vmap_area approach.
> >
> > Mike's approach looks like it'll be _much_ lighter weight and higher
> > performance, to me. vmalloc is known to be slow compared to the buddy
> > allocator, and with Mike's approach we're only modifying mappings once
> > per 2 MB chunk.
> >
> > I don't see anything in your code for sub-page sized allocations too, so
> > perhaps I should keep going with my slab allocator.
>
> Your allocator implicitly relies on vmalloc because of module_alloc ;-)
>
> What I was thinking is that we can replace module_alloc() calls in your
> allocator with something based on my unmapped_alloc(). If we make the part
> that refills the cache also take care of creating the mapping in the
> module address space, that should cover everything.
Here are what I found as I work more on the code:
1. It takes quite some work to create a clean interface and make sure
all the users of module_alloc can use the new allocator on all archs.
(archs with text poke need to work with ROX memory from the
allocator; archs without text poke need to have a clean fall back
mechanism, etc.). Most of this work is independent of the actual
allocator, so we can do this part and plug in whatever allocator we
want (buddy+slab or vmap-based or any other solutions).
2. vmap_area based solution will work. And it will be one solution for
both < PAGE_SIZE and > PAGE_SIZE allocations. Given
module_alloc is not in any hot path (AFAIK), I don't see any
practical issues with this solution. It will be a little tricky to place
and name the code, as it uses vmalloc logic, but it is technically a
module allocator.
I will prioritize building the interface, and migrating users to it. If we
do this part right, changing the underlying allocator should be
straightforward.
Thanks,
Song
On Fri, May 19, 2023 at 11:47:42AM -0400, Kent Overstreet wrote:
> On Fri, May 19, 2023 at 11:29:45AM +0300, Mike Rapoport wrote:
> > Your allocator implicitly relies on vmalloc because of module_alloc ;-)
> >
> > What I was thinking is that we can replace module_alloc() calls in your
> > allocator with something based on my unmapped_alloc(). If we make the part
> > that refills the cache also take care of creating the mapping in the
> > module address space, that should cover everything.
>
> Yeah, that's exactly what I was thinking :)
>
> Liam was also just mentioning on IRC vmalloc lock contention came up
> again at LSF, and that's historically always been an isuse - going with
> your patchset for the backend nicely avoids that.
Unfortunately not because we still need to map the pages in the modules
area which is essentially a subset of vmalloc address space.
> If I have time (hah! big if :) I'll see if I can cook up a patchset that
> combines our two approaches over the weekend.
Now there is also an interest about unmapped allocations from KVM folks, so
I might continue pursuing unmapped allocator, probably just without a new
GFP flag and hooks into page allocator.
--
Sincerely yours,
Mike.
On Fri, May 19, 2023 at 07:14:14PM +0300, Mike Rapoport wrote:
> Unfortunately not because we still need to map the pages in the modules
> area which is essentially a subset of vmalloc address space.
Why wouldn't we be doing that once per 2 MB page?
>
> > If I have time (hah! big if :) I'll see if I can cook up a patchset that
> > combines our two approaches over the weekend.
>
> Now there is also an interest about unmapped allocations from KVM folks, so
> I might continue pursuing unmapped allocator, probably just without a new
> GFP flag and hooks into page allocator.
Cool - it does seem like a generally valuable capability :)
On Fri, May 19, 2023 at 08:40:48AM -0700, Sean Christopherson wrote:
> On Thu, Mar 09, 2023, Mike Rapoport wrote:
> > On Thu, Mar 09, 2023 at 01:59:00AM +0000, Edgecombe, Rick P wrote:
> > > On Wed, 2023-03-08 at 11:41 +0200, Mike Rapoport wrote:
> > > > From: "Mike Rapoport (IBM)" <[email protected]>
> > > >
> > > > Hi,
> > > >
> > > > This is a third attempt to make page allocator aware of the direct
> > > > map
> > > > layout and allow grouping of the pages that must be unmapped from
> > > > the direct map.
> > > >
> > > > This a new implementation of __GFP_UNMAPPED, kinda a follow up for
> > > > this set:
> > > >
> > > > https://lore.kernel.org/all/[email protected]
> > > >
> > > > but instead of using a migrate type to cache the unmapped pages, the
> > > > current implementation adds a dedicated cache to serve __GFP_UNMAPPED
> > > > allocations.
> > >
> > > It seems a downside to having a page allocator outside of _the_ page
> > > allocator is you don't get all of the features that are baked in there.
> > > For example does secretmem care about numa? I guess in this
> > > implementation there is just one big cache for all nodes.
> > >
> > > Probably most users would want __GFP_ZERO. Would secretmem care about
> > > __GFP_ACCOUNT?
> >
> > The intention was that the pages in cache are always zeroed, so __GFP_ZERO
> > is always implicitly there, at least should have been.
>
> Would it be possible to drop that assumption/requirement, i.e. allow allocation of
> __GFP_UNMAPPED without __GFP_ZERO? At a glance, __GFP_UNMAPPED looks like it would
> be a great fit for backing guest memory, in particular for confidential VMs. And
> for some flavors of CoCo, i.e. TDX, the trusted intermediary is responsible for
> zeroing/initializing guest memory as the untrusted host (kernel/KVM) doesn't have
> access to the guest's encryption key. In other words, zeroing in the kernel would
> be unnecessary work.
Making and unmapped allocation without __GFP_ZERO shouldn't be a problem.
However, using a gfp flag and hooking up into the free path in page
allocator have issues and preferably should be avoided.
Will something like unmapped_alloc() and unmapped_free() work for your
usecase?
--
Sincerely yours,
Mike.
On Fri, May 19, 2023, Mike Rapoport wrote:
> On Fri, May 19, 2023 at 08:40:48AM -0700, Sean Christopherson wrote:
> > On Thu, Mar 09, 2023, Mike Rapoport wrote:
> > > On Thu, Mar 09, 2023 at 01:59:00AM +0000, Edgecombe, Rick P wrote:
> > > > On Wed, 2023-03-08 at 11:41 +0200, Mike Rapoport wrote:
> > > > > From: "Mike Rapoport (IBM)" <[email protected]>
> > > > >
> > > > > Hi,
> > > > >
> > > > > This is a third attempt to make page allocator aware of the direct
> > > > > map
> > > > > layout and allow grouping of the pages that must be unmapped from
> > > > > the direct map.
> > > > >
> > > > > This a new implementation of __GFP_UNMAPPED, kinda a follow up for
> > > > > this set:
> > > > >
> > > > > https://lore.kernel.org/all/[email protected]
> > > > >
> > > > > but instead of using a migrate type to cache the unmapped pages, the
> > > > > current implementation adds a dedicated cache to serve __GFP_UNMAPPED
> > > > > allocations.
> > > >
> > > > It seems a downside to having a page allocator outside of _the_ page
> > > > allocator is you don't get all of the features that are baked in there.
> > > > For example does secretmem care about numa? I guess in this
> > > > implementation there is just one big cache for all nodes.
> > > >
> > > > Probably most users would want __GFP_ZERO. Would secretmem care about
> > > > __GFP_ACCOUNT?
> > >
> > > The intention was that the pages in cache are always zeroed, so __GFP_ZERO
> > > is always implicitly there, at least should have been.
> >
> > Would it be possible to drop that assumption/requirement, i.e. allow allocation of
> > __GFP_UNMAPPED without __GFP_ZERO? At a glance, __GFP_UNMAPPED looks like it would
> > be a great fit for backing guest memory, in particular for confidential VMs. And
> > for some flavors of CoCo, i.e. TDX, the trusted intermediary is responsible for
> > zeroing/initializing guest memory as the untrusted host (kernel/KVM) doesn't have
> > access to the guest's encryption key. In other words, zeroing in the kernel would
> > be unnecessary work.
>
> Making and unmapped allocation without __GFP_ZERO shouldn't be a problem.
>
> However, using a gfp flag and hooking up into the free path in page
> allocator have issues and preferably should be avoided.
>
> Will something like unmapped_alloc() and unmapped_free() work for your
> usecase?
Yep, I'm leaning more and more towards having KVM implement its own ioctl() for
managing this type of memory. Wiring that up to use dedicated APIs should be no
problem.
Thanks!
On Fri, May 19 2023 at 08:42, Song Liu wrote:
> On Fri, May 19, 2023 at 1:30 AM Mike Rapoport <[email protected]> wrote:
>> What I was thinking is that we can replace module_alloc() calls in your
>> allocator with something based on my unmapped_alloc(). If we make the part
>> that refills the cache also take care of creating the mapping in the
>> module address space, that should cover everything.
>
> Here are what I found as I work more on the code:
>
> 1. It takes quite some work to create a clean interface and make sure
> all the users of module_alloc can use the new allocator on all archs.
> (archs with text poke need to work with ROX memory from the
> allocator; archs without text poke need to have a clean fall back
> mechanism, etc.). Most of this work is independent of the actual
> allocator, so we can do this part and plug in whatever allocator we
> want (buddy+slab or vmap-based or any other solutions).
>
> 2. vmap_area based solution will work. And it will be one solution for
> both < PAGE_SIZE and > PAGE_SIZE allocations. Given
> module_alloc is not in any hot path (AFAIK), I don't see any
> practical issues with this solution. It will be a little tricky to place
> and name the code, as it uses vmalloc logic, but it is technically a
> module allocator.
>
> I will prioritize building the interface, and migrating users to it. If we
> do this part right, changing the underlying allocator should be
> straightforward.
Correct, that's the only workable approach.
Once you have solved #1 the mechanism of the underlying allocator (#2)
is pretty much exchangeable.
Any attempt to solve #2 before #1 is doomed by definition.
Thanks,
tglx
On Fri, May 19, 2023 at 11:25:53AM -0700, Sean Christopherson wrote:
> On Fri, May 19, 2023, Mike Rapoport wrote:
> > On Fri, May 19, 2023 at 08:40:48AM -0700, Sean Christopherson wrote:
> > > On Thu, Mar 09, 2023, Mike Rapoport wrote:
> > > > On Thu, Mar 09, 2023 at 01:59:00AM +0000, Edgecombe, Rick P wrote:
> > >
> > > Would it be possible to drop that assumption/requirement, i.e. allow allocation of
> > > __GFP_UNMAPPED without __GFP_ZERO? At a glance, __GFP_UNMAPPED looks like it would
> > > be a great fit for backing guest memory, in particular for confidential VMs. And
> > > for some flavors of CoCo, i.e. TDX, the trusted intermediary is responsible for
> > > zeroing/initializing guest memory as the untrusted host (kernel/KVM) doesn't have
> > > access to the guest's encryption key. In other words, zeroing in the kernel would
> > > be unnecessary work.
> >
> > Making and unmapped allocation without __GFP_ZERO shouldn't be a problem.
> >
> > However, using a gfp flag and hooking up into the free path in page
> > allocator have issues and preferably should be avoided.
> >
> > Will something like unmapped_alloc() and unmapped_free() work for your
> > usecase?
>
> Yep, I'm leaning more and more towards having KVM implement its own ioctl() for
> managing this type of memory. Wiring that up to use dedicated APIs should be no
> problem.
Ok, I've dropped the GFP flag and made unmapped_pages_{alloc,free} the only
APIs.
Totally untested version of what I've got is here:
https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=unmapped-alloc/rfc-v2
I have some thoughts about adding support for 1G pages, but this is still
really vague at the moment.
> Thanks!
--
Sincerely yours,
Mike.