From: Mike Rapoport <[email protected]>
Hi,
This is a second attempt to make page allocator aware of the direct map
layout and allow grouping of the pages that must be mapped at PTE level in
the direct map.
In a way this a v2 of "mm/page_alloc: cache pte-mapped allocations"
(https://lore.kernel.org/all/[email protected])
This time the grouping is implemented as a new migrate type -
MIGRATE_UNMAPPED, and a GFP flag - __GFP_UNMAPPED to request such pages
(more details are below).
I've abandoned the example use-case of page table protection because people
seemed to concentrate on it too much; instead, this set has two other
use-cases as the examples of how __GFP_UNMAPPED can be used.
First one is to switch secretmem to use the new mechanism, which is
straight forward optimization.
The second use-case is to enable __GFP_UNMAPPED in x86::module_alloc()
that is essentially used as a method to allocate code pages and thus
requires permission changes for basic pages in the direct map. This change
reduced the amount of large page splits in the direct map from several tens
to a single digit number for boot + several bpftrace runs.
This set is x86 specific at the moment because other architectures either
do not support set_memory APIs that split the direct^w linear map (e.g.
PowerPC) or only enable set_memory APIs when the linear map uses basic page
size (like arm64).
== Motivation ==
There are use-cases that need to remove pages from the direct map or at
least map them with 4K granularity. Whenever this is done e.g. with
set_memory/set_direct_map APIs, the PUD and PMD sized mappings in the
direct map are split into smaller pages.
To reduce the performance hit caused by the fragmentation of the direct map
it make sense to group and/or cache the 4K pages removed from the direct
map so that the split large pages won't be all over the place.
There were RFCs for grouped page allocations for vmalloc permissions [1]
and for using PKS to protect page tables [2] as well as an attempt to use a
pool of large pages in secretmtm [3].
== Implementation overview ==
The pages that need to be removed from the direct map are grouped in the
free lists in a dedicated migrate type called MIGRATE_UNMAPPED.
When there a page allocation request with __GFP_UNMAPPED set and
MIGRATE_UNMAPPED list is empty, a page of the requested order is stolen
from another migrate type, just like it would happen for the existing
fallback processing. The difference is that the pageblock containing that
page is remapped with PTEs in the direct map, and all the free pages in
that pageblock are moved to MIGRATE_UNMAPPED unconditionally.
The pages are actually unmapped only in the end of post_alloc_hook() and
they are mapped back in free_pages_prepare() to keep the existing page
initialization, clearing and poisoning logic intact.
(this makes MIGRATE_UNMAPPED name imprecise, but I don't really like
MIGRATE_PTE_MAPPED and have no better ideas)
In this prototype __alloc_pages_bulk() does not deal with __GFP_UNMAPPED
and this flag cannot be used with SL*B allocators.
[1] https://lore.kernel.org/lkml/[email protected]
[2] https://lore.kernel.org/lkml/[email protected]
[3] https://lore.kernel.org/lkml/[email protected]
Mike Rapoport (3):
mm/page_alloc: introduce __GFP_UNMAPPED and MIGRATE_UNMAPPED
mm/secretmem: use __GFP_UNMAPPED to allocate pages
EXPERIMENTAL: x86/module: use __GFP_UNMAPPED in module_alloc
arch/Kconfig | 7 ++
arch/x86/Kconfig | 1 +
arch/x86/kernel/module.c | 2 +-
include/linux/gfp.h | 13 +++-
include/linux/mmzone.h | 11 +++
include/trace/events/mmflags.h | 3 +-
mm/internal.h | 2 +-
mm/page_alloc.c | 129 ++++++++++++++++++++++++++++++++-
mm/secretmem.c | 8 +-
9 files changed, 162 insertions(+), 14 deletions(-)
base-commit: e783362eb54cd99b2cac8b3a9aeac942e6f6ac07
--
2.34.1
From: Mike Rapoport <[email protected]>
Currently secertmem explicitly removes allocated pages from the direct map.
This fragments direct map because allocated pages may reside in different
pageblocks.
Use __GFP_UNMAPPED to utilize caching of unmapped pages done by the page
allocator.
Signed-off-by: Mike Rapoport <[email protected]>
---
mm/secretmem.c | 8 +-------
1 file changed, 1 insertion(+), 7 deletions(-)
diff --git a/mm/secretmem.c b/mm/secretmem.c
index 22b310adb53d..878ef004d7a7 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -63,16 +63,10 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
retry:
page = find_lock_page(mapping, offset);
if (!page) {
- page = alloc_page(gfp | __GFP_ZERO);
+ page = alloc_page(gfp | __GFP_ZERO | __GFP_UNMAPPED);
if (!page)
return VM_FAULT_OOM;
- err = set_direct_map_invalid_noflush(page);
- if (err) {
- put_page(page);
- return vmf_error(err);
- }
-
__SetPageUptodate(page);
err = add_to_page_cache_lru(page, mapping, offset, gfp);
if (unlikely(err)) {
--
2.34.1
From: Mike Rapoport <[email protected]>
The permissions of pages allocated by module_alloc() are frequently updated
using set_memory and set_direct_map APIs. Such permission changes cause
fragmentation of the direct map.
Since module_alloc() essentially wraps vmalloc(), the memory allocated by it
is mapped in the vmalloc area and it can be completely removed from the
direct map.
Use __GFP_UNMAPPED to utilize caching of unmapped pages done by the page
allocator.
Signed-off-by: Mike Rapoport <[email protected]>
---
arch/x86/kernel/module.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index 95fa745e310a..70aa6ec0cc16 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -75,7 +75,7 @@ void *module_alloc(unsigned long size)
p = __vmalloc_node_range(size, MODULE_ALIGN,
MODULES_VADDR + get_module_load_offset(),
- MODULES_END, gfp_mask,
+ MODULES_END, gfp_mask | __GFP_UNMAPPED,
PAGE_KERNEL, VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
__builtin_return_address(0));
if (p && (kasan_module_alloc(p, size, gfp_mask) < 0)) {
--
2.34.1
From: Mike Rapoport <[email protected]>
When set_memory or set_direct_map APIs used to change attribute or
permissions for chunks of several pages, the large PMD that maps these
pages in the direct map must be split. Fragmenting the direct map in such
manner causes TLB pressure and, eventually, performance degradation.
To avoid excessive direct map fragmentation, add ability to allocate
"unmapped" pages with __GFP_UNMAPPED flag and a new migrate type
MIGRATE_UNMAPPED that will contain pages that are mapped with PTEs in the
direct map.
When allocation flags include __GFP_UNMAPPED, the allocated page(s) are
remapped at PTE level in the direct map along with the other free pages in
the same pageblock. The free pages in that pageblock are placed on
MIGRATE_UNMAPPED free lists and subsequent __GFP_UNMAPPED request are
served from the MIGRATE_UNMAPPED free lists.
This way, a single split of a large PMD creates a cache of PTE-mapped pages
that can be used without the need to split another PMD.
The pages are removed from the direct map late during allocation to keep
the existing logic in prep_new_page() for clearing, poisoning and other
page content accesses.
When a page from MIGRATE_UNMAPPED pageblock is freed, it is restored in the
direct map to allow proper page contents access in free_pages_prepare().
Signed-off-by: Mike Rapoport <[email protected]>
---
arch/Kconfig | 7 ++
arch/x86/Kconfig | 1 +
include/linux/gfp.h | 13 +++-
include/linux/mmzone.h | 11 +++
include/trace/events/mmflags.h | 3 +-
mm/internal.h | 2 +-
mm/page_alloc.c | 129 ++++++++++++++++++++++++++++++++-
7 files changed, 160 insertions(+), 6 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index 678a80713b21..e16a70ed0b2d 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1321,6 +1321,13 @@ config DYNAMIC_SIGFRAME
# Select, if arch has a named attribute group bound to NUMA device nodes.
config HAVE_ARCH_NODE_DEV_GROUP
bool
+#
+# Select if the architecture wants to minimize fragmentation of its
+# direct/linear map cauesd by set_memory and set_direct_map operations
+#
+config ARCH_WANTS_GFP_UNMAPPED
+ bool
+ depends on ARCH_HAS_SET_MEMORY || ARCH_HAS_SET_DIRECT_MAP
source "kernel/gcov/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ebe8fc76949a..564e97c88ef0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -120,6 +120,7 @@ config X86
select ARCH_WANTS_NO_INSTR
select ARCH_WANT_HUGE_PMD_SHARE
select ARCH_WANT_LD_ORPHAN_WARN
+ select ARCH_WANTS_GFP_UNMAPPED
select ARCH_WANTS_THP_SWAP if X86_64
select ARCH_HAS_PARANOID_L1D_FLUSH
select BUILDTIME_TABLE_SORT
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 80f63c862be5..63b8d3b2711d 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -55,8 +55,9 @@ struct vm_area_struct;
#define ___GFP_ACCOUNT 0x400000u
#define ___GFP_ZEROTAGS 0x800000u
#define ___GFP_SKIP_KASAN_POISON 0x1000000u
+#define ___GFP_UNMAPPED 0x2000000u
#ifdef CONFIG_LOCKDEP
-#define ___GFP_NOLOCKDEP 0x2000000u
+#define ___GFP_NOLOCKDEP 0x4000000u
#else
#define ___GFP_NOLOCKDEP 0
#endif
@@ -101,12 +102,15 @@ struct vm_area_struct;
* node with no fallbacks or placement policy enforcements.
*
* %__GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
+ *
+ * %__GFP_UNMAPPED removes the allocated pages from the direct map.
*/
#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE)
#define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL)
#define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)
#define __GFP_ACCOUNT ((__force gfp_t)___GFP_ACCOUNT)
+#define __GFP_UNMAPPED ((__force gfp_t)___GFP_UNMAPPED)
/**
* DOC: Watermark modifiers
@@ -249,7 +253,7 @@ struct vm_area_struct;
#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
/* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP))
+#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/**
@@ -348,6 +352,11 @@ static inline int gfp_migratetype(const gfp_t gfp_flags)
BUILD_BUG_ON((1UL << GFP_MOVABLE_SHIFT) != ___GFP_MOVABLE);
BUILD_BUG_ON((___GFP_MOVABLE >> GFP_MOVABLE_SHIFT) != MIGRATE_MOVABLE);
+#ifdef CONFIG_ARCH_WANTS_GFP_UNMAPPED
+ if (unlikely(gfp_flags & __GFP_UNMAPPED))
+ return MIGRATE_UNMAPPED;
+#endif
+
if (unlikely(page_group_by_mobility_disabled))
return MIGRATE_UNMOVABLE;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index aed44e9b5d89..7971af97b6cf 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -43,6 +43,9 @@ enum migratetype {
MIGRATE_UNMOVABLE,
MIGRATE_MOVABLE,
MIGRATE_RECLAIMABLE,
+#ifdef CONFIG_ARCH_WANTS_GFP_UNMAPPED
+ MIGRATE_UNMAPPED,
+#endif
MIGRATE_PCPTYPES, /* the number of types on the pcp lists */
MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,
#ifdef CONFIG_CMA
@@ -78,6 +81,14 @@ extern const char * const migratetype_names[MIGRATE_TYPES];
# define is_migrate_cma_page(_page) false
#endif
+#ifdef CONFIG_ARCH_WANTS_GFP_UNMAPPED
+# define is_migrate_unmapped(migratetype) unlikely((migratetype) == MIGRATE_UNMAPPED)
+# define is_migrate_unmapped_page(_page) (get_pageblock_migratetype(_page) == MIGRATE_UNMAPPED)
+#else
+# define is_migrate_unmapped(migratetype) false
+# define is_migrate_unmapped_page(_page) false
+#endif
+
static inline bool is_migrate_movable(int mt)
{
return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE;
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 116ed4d5d0f8..501f37f4095d 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -50,7 +50,8 @@
{(unsigned long)__GFP_DIRECT_RECLAIM, "__GFP_DIRECT_RECLAIM"},\
{(unsigned long)__GFP_KSWAPD_RECLAIM, "__GFP_KSWAPD_RECLAIM"},\
{(unsigned long)__GFP_ZEROTAGS, "__GFP_ZEROTAGS"}, \
- {(unsigned long)__GFP_SKIP_KASAN_POISON,"__GFP_SKIP_KASAN_POISON"}\
+ {(unsigned long)__GFP_SKIP_KASAN_POISON,"__GFP_SKIP_KASAN_POISON"},\
+ {(unsigned long)__GFP_UNMAPPED, "__GFP_UNMAPPED"} \
#define show_gfp_flags(flags) \
(flags) ? __print_flags(flags, "|", \
diff --git a/mm/internal.h b/mm/internal.h
index d80300392a19..3edc3abf3f56 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -32,7 +32,7 @@ struct folio_batch;
#define GFP_CONSTRAINT_MASK (__GFP_HARDWALL|__GFP_THISNODE)
/* Do not use these with a slab allocator */
-#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
+#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|__GFP_UNMAPPED|~__GFP_BITS_MASK)
void page_writeback_init(void);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3589febc6d31..da6b1bb912a8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -75,6 +75,7 @@
#include <linux/khugepaged.h>
#include <linux/buffer_head.h>
#include <linux/delayacct.h>
+#include <linux/set_memory.h>
#include <asm/sections.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -82,6 +83,12 @@
#include "shuffle.h"
#include "page_reporting.h"
+/*
+ * FIXME: add a proper definition in include/linux/mm.h once remaining
+ * definitions of PMD_ORDER in arch/ are updated
+ */
+#define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT)
+
/* Free Page Internal flags: for internal, non-pcp variants of free_pages(). */
typedef int __bitwise fpi_t;
@@ -319,6 +326,9 @@ const char * const migratetype_names[MIGRATE_TYPES] = {
"Unmovable",
"Movable",
"Reclaimable",
+#ifdef CONFIG_ARCH_WANTS_GFP_UNMAPPED
+ "Unmapped",
+#endif
"HighAtomic",
#ifdef CONFIG_CMA
"CMA",
@@ -938,9 +948,10 @@ compaction_capture(struct capture_control *capc, struct page *page,
if (!capc || order != capc->cc->order)
return false;
- /* Do not accidentally pollute CMA or isolated regions*/
+ /* Do not accidentally pollute CMA or isolated or unmapped regions */
if (is_migrate_cma(migratetype) ||
- is_migrate_isolate(migratetype))
+ is_migrate_isolate(migratetype) ||
+ is_migrate_unmapped(migratetype))
return false;
/*
@@ -1143,6 +1154,17 @@ static inline void __free_one_page(struct page *page,
done_merging:
set_buddy_order(page, order);
+#if 0
+ /*
+ * FIXME: collapse PMD-size page in the direct map and move the
+ * pageblock from MIGRATE_UNMAPPED to another migrate type.
+ */
+ if ((order == PMD_ORDER) && is_migrate_unmapped_page(page)) {
+ set_direct_map_PMD(page);
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+ }
+#endif
+
if (fpi_flags & FPI_TO_TAIL)
to_tail = true;
else if (is_shuffle_order(order))
@@ -1271,6 +1293,40 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
return ret;
}
+static void migrate_unmapped_map_pages(struct page *page, unsigned int nr)
+{
+#ifdef CONFIG_ARCH_WANTS_GFP_UNMAPPED
+ int i;
+
+ if (!is_migrate_unmapped_page(page))
+ return;
+
+ for (i = 0; i < nr; i++)
+ set_direct_map_default_noflush(page + i);
+#endif
+}
+
+static void migrate_unmapped_unmap_pages(struct page *page, unsigned int nr,
+ gfp_t gfp)
+{
+#ifdef CONFIG_ARCH_WANTS_GFP_UNMAPPED
+ unsigned long start = (unsigned long)page_address(page);
+ unsigned long end = start + nr * PAGE_SIZE;
+ int i;
+
+ if (!(gfp & __GFP_UNMAPPED))
+ return;
+
+ if (!is_migrate_unmapped_page(page))
+ return;
+
+ for (i = 0; i < nr; i++)
+ set_direct_map_invalid_noflush(page + i);
+
+ flush_tlb_kernel_range(start, end);
+#endif
+}
+
static void kernel_init_free_pages(struct page *page, int numpages, bool zero_tags)
{
int i;
@@ -1359,6 +1415,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
PAGE_SIZE << order);
}
+ migrate_unmapped_map_pages(page, 1 << order);
kernel_poison_pages(page, 1 << order);
/*
@@ -2426,6 +2483,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
set_page_owner(page, order, gfp_flags);
page_table_check_alloc(page, order);
+ migrate_unmapped_unmap_pages(page, 1 << order, gfp_flags);
}
static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
@@ -2480,6 +2538,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
* This array describes the order lists are fallen back to when
* the free lists for the desirable migrate type are depleted
*/
+#ifndef CONFIG_ARCH_WANTS_GFP_UNMAPPED
static int fallbacks[MIGRATE_TYPES][3] = {
[MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
[MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_TYPES },
@@ -2492,6 +2551,22 @@ static int fallbacks[MIGRATE_TYPES][3] = {
#endif
};
+#else /* CONFIG_ARCH_WANTS_GFP_UNMAPPED */
+
+static int fallbacks[MIGRATE_TYPES][4] = {
+ [MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_UNMAPPED, MIGRATE_TYPES },
+ [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_UNMAPPED, MIGRATE_TYPES },
+ [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_UNMAPPED, MIGRATE_TYPES },
+ [MIGRATE_UNMAPPED] = { MIGRATE_UNMOVABLE, MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
+#ifdef CONFIG_CMA
+ [MIGRATE_CMA] = { MIGRATE_TYPES }, /* Never used */
+#endif
+#ifdef CONFIG_MEMORY_ISOLATION
+ [MIGRATE_ISOLATE] = { MIGRATE_TYPES }, /* Never used */
+#endif
+};
+#endif /* CONFIG_ARCH_WANTS_GFP_UNMAPPED */
+
#ifdef CONFIG_CMA
static __always_inline struct page *__rmqueue_cma_fallback(struct zone *zone,
unsigned int order)
@@ -2567,6 +2642,39 @@ int move_freepages_block(struct zone *zone, struct page *page,
num_movable);
}
+static int set_pageblock_unmapped(struct zone *zone, struct page *page,
+ unsigned int order)
+{
+#ifdef CONFIG_ARCH_WANTS_GFP_UNMAPPED
+ int migratetype = get_pageblock_migratetype(page);
+ unsigned long err;
+
+ BUILD_BUG_ON(pageblock_order != PMD_ORDER);
+
+ if (is_migrate_unmapped_page(page))
+ return 0;
+
+ /*
+ * Calling set_direct_map_invalid_noflush() for any page in a
+ * pageblock will split PMD entry and the split may fail to allocate the
+ * PMD page.
+ * Subsequent calls to set_direct_map APIs within the same
+ * pageblock will only update the PTEs, so they cannot fail.
+ */
+ err = set_direct_map_invalid_noflush(page);
+ if (err) {
+ move_to_free_list(page, zone, order, migratetype);
+ return err;
+ }
+
+ set_direct_map_default_noflush(page);
+ set_pageblock_migratetype(page, MIGRATE_UNMAPPED);
+ move_freepages_block(zone, page, MIGRATE_UNMAPPED, NULL);
+#endif
+
+ return 0;
+}
+
static void change_pageblock_range(struct page *pageblock_page,
int start_order, int migratetype)
{
@@ -2605,6 +2713,7 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
if (order >= pageblock_order / 2 ||
start_mt == MIGRATE_RECLAIMABLE ||
start_mt == MIGRATE_UNMOVABLE ||
+ is_migrate_unmapped(start_mt) ||
page_group_by_mobility_disabled)
return true;
@@ -2672,6 +2781,14 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
if (is_migrate_highatomic(old_block_type))
goto single_page;
+ /*
+ * If the new migrate type is MIGRATE_UNMAPPED, the entire
+ * pageblock will be moved, but it is handled later in
+ * get_page_from_freelist() to allow error handling and recovery
+ */
+ if (is_migrate_unmapped(start_type))
+ goto single_page;
+
/* Take ownership for orders >= pageblock_order */
if (current_order >= pageblock_order) {
change_pageblock_range(page, current_order, start_type);
@@ -4162,6 +4279,10 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
page = rmqueue(ac->preferred_zoneref->zone, zone, order,
gfp_mask, alloc_flags, ac->migratetype);
if (page) {
+ if ((gfp_mask & __GFP_UNMAPPED) &&
+ set_pageblock_unmapped(zone, page, order))
+ return NULL;
+
prep_new_page(page, order, gfp_mask, alloc_flags);
/*
@@ -5241,6 +5362,10 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
if (nr_pages - nr_populated == 1)
goto failed;
+ /* Bulk allocator does not support __GFP_UNMAPPED */
+ if (gfp & __GFP_UNMAPPED)
+ goto failed;
+
#ifdef CONFIG_PAGE_OWNER
/*
* PAGE_OWNER may recurse into the allocator to allocate space to
--
2.34.1
Hello Hyeonggon,
On Tue, Apr 26, 2022 at 05:54:49PM +0900, Hyeonggon Yoo wrote:
> On Thu, Jan 27, 2022 at 10:56:05AM +0200, Mike Rapoport wrote:
> > From: Mike Rapoport <[email protected]>
> >
> > Hi,
> >
> > This is a second attempt to make page allocator aware of the direct map
> > layout and allow grouping of the pages that must be mapped at PTE level in
> > the direct map.
> >
>
> Hello mike, It may be a silly question...
>
> Looking at implementation of set_memory*(), they only split
> PMD/PUD-sized entries. But why not _merge_ them when all entries
> have same permissions after changing permission of an entry?
>
> I think grouping __GFP_UNMAPPED allocations would help reducing
> direct map fragmentation, but IMHO merging split entries seems better
> to be done in those helpers than in page allocator.
Maybe, I didn't got as far as to try merging split entries in the direct
map. IIRC, Kirill sent a patch for collapsing huge pages in the direct map
some time ago, but there still was something that had to initiate the
collapse.
> For example:
> 1) set_memory_ro() splits 1 RW PMD entry into 511 RW PTE
> entries and 1 RO PTE entry.
>
> 2) before freeing the pages, we call set_memory_rw() and we have
> 512 RW PTE entries. Then we can merge it to 1 RW PMD entry.
For this we need to check permissions of all 512 pages to make sure we can
use a PMD entry to map them.
Not sure that doing the scan in each set_memory call won't cause an overall
slowdown.
> 3) after 2) we can do same thing about PMD-sized entries
> and merge them into 1 PUD entry if 512 PMD entries have
> same permissions.
>
> [...]
>
> > Mike Rapoport (3):
> > mm/page_alloc: introduce __GFP_UNMAPPED and MIGRATE_UNMAPPED
> > mm/secretmem: use __GFP_UNMAPPED to allocate pages
> > EXPERIMENTAL: x86/module: use __GFP_UNMAPPED in module_alloc
> >
> > arch/Kconfig | 7 ++
> > arch/x86/Kconfig | 1 +
> > arch/x86/kernel/module.c | 2 +-
> > include/linux/gfp.h | 13 +++-
> > include/linux/mmzone.h | 11 +++
> > include/trace/events/mmflags.h | 3 +-
> > mm/internal.h | 2 +-
> > mm/page_alloc.c | 129 ++++++++++++++++++++++++++++++++-
> > mm/secretmem.c | 8 +-
> > 9 files changed, 162 insertions(+), 14 deletions(-)
> >
> >
> > base-commit: e783362eb54cd99b2cac8b3a9aeac942e6f6ac07
> > --
> > 2.34.1
> >
>
> --
> Thanks,
> Hyeonggon
--
Sincerely yours,
Mike.
On Thu, Jan 27, 2022 at 10:56:05AM +0200, Mike Rapoport wrote:
> From: Mike Rapoport <[email protected]>
>
> Hi,
>
> This is a second attempt to make page allocator aware of the direct map
> layout and allow grouping of the pages that must be mapped at PTE level in
> the direct map.
>
Hello mike, It may be a silly question...
Looking at implementation of set_memory*(), they only split
PMD/PUD-sized entries. But why not _merge_ them when all entries
have same permissions after changing permission of an entry?
I think grouping __GFP_UNMAPPED allocations would help reducing
direct map fragmentation, but IMHO merging split entries seems better
to be done in those helpers than in page allocator.
For example:
1) set_memory_ro() splits 1 RW PMD entry into 511 RW PTE
entries and 1 RO PTE entry.
2) before freeing the pages, we call set_memory_rw() and we have
512 RW PTE entries. Then we can merge it to 1 RW PMD entry.
3) after 2) we can do same thing about PMD-sized entries
and merge them into 1 PUD entry if 512 PMD entries have
same permissions.
[...]
> Mike Rapoport (3):
> mm/page_alloc: introduce __GFP_UNMAPPED and MIGRATE_UNMAPPED
> mm/secretmem: use __GFP_UNMAPPED to allocate pages
> EXPERIMENTAL: x86/module: use __GFP_UNMAPPED in module_alloc
>
> arch/Kconfig | 7 ++
> arch/x86/Kconfig | 1 +
> arch/x86/kernel/module.c | 2 +-
> include/linux/gfp.h | 13 +++-
> include/linux/mmzone.h | 11 +++
> include/trace/events/mmflags.h | 3 +-
> mm/internal.h | 2 +-
> mm/page_alloc.c | 129 ++++++++++++++++++++++++++++++++-
> mm/secretmem.c | 8 +-
> 9 files changed, 162 insertions(+), 14 deletions(-)
>
>
> base-commit: e783362eb54cd99b2cac8b3a9aeac942e6f6ac07
> --
> 2.34.1
>
--
Thanks,
Hyeonggon
On Tue, Apr 26, 2022 at 06:21:57PM +0300, Mike Rapoport wrote:
> Hello Hyeonggon,
>
> On Tue, Apr 26, 2022 at 05:54:49PM +0900, Hyeonggon Yoo wrote:
> > On Thu, Jan 27, 2022 at 10:56:05AM +0200, Mike Rapoport wrote:
> > > From: Mike Rapoport <[email protected]>
> > >
> > > Hi,
> > >
> > > This is a second attempt to make page allocator aware of the direct map
> > > layout and allow grouping of the pages that must be mapped at PTE level in
> > > the direct map.
> > >
> >
> > Hello mike, It may be a silly question...
> >
> > Looking at implementation of set_memory*(), they only split
> > PMD/PUD-sized entries. But why not _merge_ them when all entries
> > have same permissions after changing permission of an entry?
> >
> > I think grouping __GFP_UNMAPPED allocations would help reducing
> > direct map fragmentation, but IMHO merging split entries seems better
> > to be done in those helpers than in page allocator.
>
> Maybe, I didn't got as far as to try merging split entries in the direct
> map. IIRC, Kirill sent a patch for collapsing huge pages in the direct map
> some time ago, but there still was something that had to initiate the
> collapse.
But in this case buddy allocator's view of direct map is quite limited.
It cannot merge 2M entries to 1G entry as it does not support
big allocations. Also it cannot merge entries of pages freed in boot process
as they weren't allocated from page allocator.
And it will become harder when pages in MIGRATE_UNMAPPED is borrowed
from another migrate type....
So it would be nice if we can efficiently merge mappings in
change_page_attr_set(). this approach can handle cases above.
I think in this case grouping allocations and merging mappings
should be done separately.
> > For example:
> > 1) set_memory_ro() splits 1 RW PMD entry into 511 RW PTE
> > entries and 1 RO PTE entry.
> >
> > 2) before freeing the pages, we call set_memory_rw() and we have
> > 512 RW PTE entries. Then we can merge it to 1 RW PMD entry.
>
> For this we need to check permissions of all 512 pages to make sure we can
> use a PMD entry to map them.
Of course that may be slow. Maybe one way to optimize this is using some bits
in struct page, something like: each bit of page->direct_map_split (unsigned long)
is set when at least one entry in (PTRS_PER_PTE = 512)/(BITS_PER_LONG = 64) = 8 entries
has special permissions.
Then we just need to set the corresponding bit when splitting mappings and
iterate 8 entries when changing permission back again. (and then unset the bit when 8 entries has
usual permissions). we can decide to merge by checking if page->direct_map_split is zero.
When scanning, 8 entries would fit into one cacheline.
Any other ideas?
> Not sure that doing the scan in each set_memory call won't cause an overall
> slowdown.
I think we can evaluate it by measuring boot time and bpf/module
load/unload time.
Is there any other workload that is directly affected
by performance of set_memory*()?
> > 3) after 2) we can do same thing about PMD-sized entries
> > and merge them into 1 PUD entry if 512 PMD entries have
> > same permissions.
> > [...]
> > > Mike Rapoport (3):
> > > mm/page_alloc: introduce __GFP_UNMAPPED and MIGRATE_UNMAPPED
> > > mm/secretmem: use __GFP_UNMAPPED to allocate pages
> > > EXPERIMENTAL: x86/module: use __GFP_UNMAPPED in module_alloc
> > --
> > Thanks,
> > Hyeonggon
>
> --
> Sincerely yours,
> Mike.
On Sat, Apr 30, 2022 at 01:44:16PM +0000, Hyeonggon Yoo wrote:
> On Tue, Apr 26, 2022 at 06:21:57PM +0300, Mike Rapoport wrote:
> > Hello Hyeonggon,
> >
> > On Tue, Apr 26, 2022 at 05:54:49PM +0900, Hyeonggon Yoo wrote:
> > > On Thu, Jan 27, 2022 at 10:56:05AM +0200, Mike Rapoport wrote:
> > > > From: Mike Rapoport <[email protected]>
> > > >
> > > > Hi,
> > > >
> > > > This is a second attempt to make page allocator aware of the direct map
> > > > layout and allow grouping of the pages that must be mapped at PTE level in
> > > > the direct map.
> > > >
> > >
> > > Hello mike, It may be a silly question...
> > >
> > > Looking at implementation of set_memory*(), they only split
> > > PMD/PUD-sized entries. But why not _merge_ them when all entries
> > > have same permissions after changing permission of an entry?
> > >
> > > I think grouping __GFP_UNMAPPED allocations would help reducing
> > > direct map fragmentation, but IMHO merging split entries seems better
> > > to be done in those helpers than in page allocator.
> >
> > Maybe, I didn't got as far as to try merging split entries in the direct
> > map. IIRC, Kirill sent a patch for collapsing huge pages in the direct map
> > some time ago, but there still was something that had to initiate the
> > collapse.
>
> But in this case buddy allocator's view of direct map is quite limited.
> It cannot merge 2M entries to 1G entry as it does not support
> big allocations. Also it cannot merge entries of pages freed in boot process
> as they weren't allocated from page allocator.
>
> And it will become harder when pages in MIGRATE_UNMAPPED is borrowed
> from another migrate type....
>
> So it would be nice if we can efficiently merge mappings in
> change_page_attr_set(). this approach can handle cases above.
>
> I think in this case grouping allocations and merging mappings
> should be done separately.
I've added the provision to merge the mappings in __free_one_page() because
at that spot we know for sure we can replace multiple PTEs with a single
PMD.
I'm not saying there should be no additional mechanism for collapsing
direct map pages, but I don't know when and how it should be invoked.
> > > For example:
> > > 1) set_memory_ro() splits 1 RW PMD entry into 511 RW PTE
> > > entries and 1 RO PTE entry.
> > >
> > > 2) before freeing the pages, we call set_memory_rw() and we have
> > > 512 RW PTE entries. Then we can merge it to 1 RW PMD entry.
> >
> > For this we need to check permissions of all 512 pages to make sure we can
> > use a PMD entry to map them.
>
> Of course that may be slow. Maybe one way to optimize this is using some bits
> in struct page, something like: each bit of page->direct_map_split (unsigned long)
> is set when at least one entry in (PTRS_PER_PTE = 512)/(BITS_PER_LONG = 64) = 8 entries
> has special permissions.
>
> Then we just need to set the corresponding bit when splitting mappings and
> iterate 8 entries when changing permission back again. (and then unset the bit when 8 entries has
> usual permissions). we can decide to merge by checking if page->direct_map_split is zero.
>
> When scanning, 8 entries would fit into one cacheline.
>
> Any other ideas?
>
> > Not sure that doing the scan in each set_memory call won't cause an overall
> > slowdown.
>
> I think we can evaluate it by measuring boot time and bpf/module
> load/unload time.
>
> Is there any other workload that is directly affected
> by performance of set_memory*()?
>
> > > 3) after 2) we can do same thing about PMD-sized entries
> > > and merge them into 1 PUD entry if 512 PMD entries have
> > > same permissions.
> > > [...]
> > > > Mike Rapoport (3):
> > > > mm/page_alloc: introduce __GFP_UNMAPPED and MIGRATE_UNMAPPED
> > > > mm/secretmem: use __GFP_UNMAPPED to allocate pages
> > > > EXPERIMENTAL: x86/module: use __GFP_UNMAPPED in module_alloc
> > > --
> > > Thanks,
> > > Hyeonggon
> >
> > --
> > Sincerely yours,
> > Mike.
--
Sincerely yours,
Mike.
On Mon, May 02, 2022 at 09:44:48PM -0700, Mike Rapoport wrote:
> On Sat, Apr 30, 2022 at 01:44:16PM +0000, Hyeonggon Yoo wrote:
> > On Tue, Apr 26, 2022 at 06:21:57PM +0300, Mike Rapoport wrote:
> > > Hello Hyeonggon,
> > >
> > > On Tue, Apr 26, 2022 at 05:54:49PM +0900, Hyeonggon Yoo wrote:
> > > > On Thu, Jan 27, 2022 at 10:56:05AM +0200, Mike Rapoport wrote:
> > > > > From: Mike Rapoport <[email protected]>
> > > > >
> > > > > Hi,
> > > > >
> > > > > This is a second attempt to make page allocator aware of the direct map
> > > > > layout and allow grouping of the pages that must be mapped at PTE level in
> > > > > the direct map.
> > > > >
> > > >
> > > > Hello mike, It may be a silly question...
> > > >
> > > > Looking at implementation of set_memory*(), they only split
> > > > PMD/PUD-sized entries. But why not _merge_ them when all entries
> > > > have same permissions after changing permission of an entry?
> > > >
> > > > I think grouping __GFP_UNMAPPED allocations would help reducing
> > > > direct map fragmentation, but IMHO merging split entries seems better
> > > > to be done in those helpers than in page allocator.
> > >
> > > Maybe, I didn't got as far as to try merging split entries in the direct
> > > map. IIRC, Kirill sent a patch for collapsing huge pages in the direct map
> > > some time ago, but there still was something that had to initiate the
> > > collapse.
> >
> > But in this case buddy allocator's view of direct map is quite limited.
> > It cannot merge 2M entries to 1G entry as it does not support
> > big allocations. Also it cannot merge entries of pages freed in boot process
> > as they weren't allocated from page allocator.
> >
> > And it will become harder when pages in MIGRATE_UNMAPPED is borrowed
> > from another migrate type....
> >
> > So it would be nice if we can efficiently merge mappings in
> > change_page_attr_set(). this approach can handle cases above.
> >
> > I think in this case grouping allocations and merging mappings
> > should be done separately.
>
> I've added the provision to merge the mappings in __free_one_page() because
> at that spot we know for sure we can replace multiple PTEs with a single
> PMD.
Actually no external merging mechanism is needed if CPA supports merging
mappings.
Recently I started to implement similar idea I described above.
The approach is slightly different as it does not scan the page table
but updates count of number of mappings that has non-standard protection bits.
(being "non-standard" means pgprot is not equal to PAGE_KERNEL.)
It increases split_count when standard mapping becomes non-standard
and decreases split_count in the opposite case. It merges mappings when
the count become zero.
Updating counts and merging is invoked in __change_page_attr(), which
is called by set_memory_{rw,ro}(),
set_direct_map_{default,invalid}_noflush(), ... etc.
The implementation looks like revert_page() function that existed in
arch/i386/mm/pageattr.c decades ago...
There are some issues like 1) set_memory_4k()-ed memory should not be
merged and 2) we need to be extremely sure that the count is always
valid.
But I think this approach is definitely worth trying.
I'll send a RFC versionin to list after a bit of more work.
And still, I think grouping allocations using migrate type would
work well with adding merging feature in CPA.
Thanks!
Hyeonggon
> I'm not saying there should be no additional mechanism for collapsing
> direct map pages, but I don't know when and how it should be invoked.
>
> > > > For example:
> > > > 1) set_memory_ro() splits 1 RW PMD entry into 511 RW PTE
> > > > entries and 1 RO PTE entry.
> > > >
> > > > 2) before freeing the pages, we call set_memory_rw() and we have
> > > > 512 RW PTE entries. Then we can merge it to 1 RW PMD entry.
> > >
> > > For this we need to check permissions of all 512 pages to make sure we can
> > > use a PMD entry to map them.
> >
> > Of course that may be slow. Maybe one way to optimize this is using some bits
> > in struct page, something like: each bit of page->direct_map_split (unsigned long)
> > is set when at least one entry in (PTRS_PER_PTE = 512)/(BITS_PER_LONG = 64) = 8 entries
> > has special permissions.
> >
> > Then we just need to set the corresponding bit when splitting mappings and
> > iterate 8 entries when changing permission back again. (and then unset the bit when 8 entries has
> > usual permissions). we can decide to merge by checking if page->direct_map_split is zero.
> >
> > When scanning, 8 entries would fit into one cacheline.
> >
> > Any other ideas?
> >
> > > Not sure that doing the scan in each set_memory call won't cause an overall
> > > slowdown.
> >
> > I think we can evaluate it by measuring boot time and bpf/module
> > load/unload time.
> >
> > Is there any other workload that is directly affected
> > by performance of set_memory*()?
> >
> > > > 3) after 2) we can do same thing about PMD-sized entries
> > > > and merge them into 1 PUD entry if 512 PMD entries have
> > > > same permissions.
> > > > [...]
> > > > > Mike Rapoport (3):
> > > > > mm/page_alloc: introduce __GFP_UNMAPPED and MIGRATE_UNMAPPED
> > > > > mm/secretmem: use __GFP_UNMAPPED to allocate pages
> > > > > EXPERIMENTAL: x86/module: use __GFP_UNMAPPED in module_alloc
> > > > --
> > > > Thanks,
> > > > Hyeonggon
> > >
> > > --
> > > Sincerely yours,
> > > Mike.
>
> --
> Sincerely yours,
> Mike.
--
Thanks,
Hyeonggon
On Mon, May 02, 2022 at 09:44:48PM -0700, Mike Rapoport wrote:
> On Sat, Apr 30, 2022 at 01:44:16PM +0000, Hyeonggon Yoo wrote:
> > On Tue, Apr 26, 2022 at 06:21:57PM +0300, Mike Rapoport wrote:
> > > Hello Hyeonggon,
> > >
> > > On Tue, Apr 26, 2022 at 05:54:49PM +0900, Hyeonggon Yoo wrote:
> > > > On Thu, Jan 27, 2022 at 10:56:05AM +0200, Mike Rapoport wrote:
> > > > > From: Mike Rapoport <[email protected]>
> > > > >
> > > > > Hi,
> > > > >
> > > > > This is a second attempt to make page allocator aware of the direct map
> > > > > layout and allow grouping of the pages that must be mapped at PTE level in
> > > > > the direct map.
> > > > >
> > > >
> > > > Hello mike, It may be a silly question...
> > > >
> > > > Looking at implementation of set_memory*(), they only split
> > > > PMD/PUD-sized entries. But why not _merge_ them when all entries
> > > > have same permissions after changing permission of an entry?
> > > >
> > > > I think grouping __GFP_UNMAPPED allocations would help reducing
> > > > direct map fragmentation, but IMHO merging split entries seems better
> > > > to be done in those helpers than in page allocator.
> > >
> > > Maybe, I didn't got as far as to try merging split entries in the direct
> > > map. IIRC, Kirill sent a patch for collapsing huge pages in the direct map
> > > some time ago, but there still was something that had to initiate the
> > > collapse.
> >
> > But in this case buddy allocator's view of direct map is quite limited.
> > It cannot merge 2M entries to 1G entry as it does not support
> > big allocations. Also it cannot merge entries of pages freed in boot process
> > as they weren't allocated from page allocator.
> >
> > And it will become harder when pages in MIGRATE_UNMAPPED is borrowed
> > from another migrate type....
> >
> > So it would be nice if we can efficiently merge mappings in
> > change_page_attr_set(). this approach can handle cases above.
> >
> > I think in this case grouping allocations and merging mappings
> > should be done separately.
>
> I've added the provision to merge the mappings in __free_one_page() because
> at that spot we know for sure we can replace multiple PTEs with a single
> PMD.
>
> I'm not saying there should be no additional mechanism for collapsing
> direct map pages, but I don't know when and how it should be invoked.
>
I'm still thinking about a way to accurately track number of split
pages - because tracking number of split pages only in CPA code may be
inaccurate when kernel page table is changed outside CPA.
In case you wonder, my code is available at:
https://github.com/hygoni/linux/tree/merge-mapping-v1r3
it also adds vmstat items:
# cat /proc/vmstat | grep direct_map
direct_map_level2_splits 1079
direct_map_level3_splits 6
direct_map_level1_merges 1079
direct_map_level2_merges 6
Thanks,
Hyeonggon
> > > > For example:
> > > > 1) set_memory_ro() splits 1 RW PMD entry into 511 RW PTE
> > > > entries and 1 RO PTE entry.
> > > >
> > > > 2) before freeing the pages, we call set_memory_rw() and we have
> > > > 512 RW PTE entries. Then we can merge it to 1 RW PMD entry.
> > >
> > > For this we need to check permissions of all 512 pages to make sure we can
> > > use a PMD entry to map them.
> >
> > Of course that may be slow. Maybe one way to optimize this is using some bits
> > in struct page, something like: each bit of page->direct_map_split (unsigned long)
> > is set when at least one entry in (PTRS_PER_PTE = 512)/(BITS_PER_LONG = 64) = 8 entries
> > has special permissions.
> >
> > Then we just need to set the corresponding bit when splitting mappings and
> > iterate 8 entries when changing permission back again. (and then unset the bit when 8 entries has
> > usual permissions). we can decide to merge by checking if page->direct_map_split is zero.
> >
> > When scanning, 8 entries would fit into one cacheline.
> >
> > Any other ideas?
> >
> > > Not sure that doing the scan in each set_memory call won't cause an overall
> > > slowdown.
> >
> > I think we can evaluate it by measuring boot time and bpf/module
> > load/unload time.
> >
> > Is there any other workload that is directly affected
> > by performance of set_memory*()?
> >
> > > > 3) after 2) we can do same thing about PMD-sized entries
> > > > and merge them into 1 PUD entry if 512 PMD entries have
> > > > same permissions.
> > > > [...]
> > > > > Mike Rapoport (3):
> > > > > mm/page_alloc: introduce __GFP_UNMAPPED and MIGRATE_UNMAPPED
> > > > > mm/secretmem: use __GFP_UNMAPPED to allocate pages
> > > > > EXPERIMENTAL: x86/module: use __GFP_UNMAPPED in module_alloc
> > > > --
> > > > Thanks,
> > > > Hyeonggon
> > >
> > > --
> > > Sincerely yours,
> > > Mike.
>
> --
> Sincerely yours,
> Mike.
--
Thanks,
Hyeonggon