2023-04-18 19:15:20

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 00/26] mm: reliable huge page allocator

As memory capacity continues to grow, 4k TLB coverage has not been
able to keep up. On Meta's 64G webservers, close to 20% of execution
cycles are observed to be handling TLB misses when using 4k pages
only. Huge pages are shifting from being a nice-to-have optimization
for HPC workloads to becoming a necessity for common applications.

However, while trying to deploy THP more universally, we observe a
fragmentation problem in the page allocator that often prevents larger
requests from being met quickly, or met at all, at runtime. Since we
have to provision hardware capacity for worst case performance,
unreliable huge page coverage isn't of much help.

Drilling into the allocator, we find that existing defrag efforts,
such as mobility grouping and watermark boosting, help, but are
insufficient by themselves. We still observe a high number of blocks
being routinely shared by allocations of different migratetypes. This
in turn results in inefficient or ineffective reclaim/compaction runs.

In a broad sample of Meta servers, we find that unmovable allocations
make up less than 7% of total memory on average, yet occupy 34% of the
2M blocks in the system. We also found that this effect isn't
correlated with high uptimes, and that servers can get heavily
fragmented within the first hour of running a workload.

The following experiment shows that only 20min of build load under
moderate memory pressure already results in a significant number of
typemixed blocks (block analysis run after system is back to idle):

vanilla:
unmovable 50
movable 701
reclaimable 149
unmovable blocks with slab/lru pages: 13 ({'slab': 17, 'lru': 19} pages)
movable blocks with non-LRU pages: 77 ({'slab': 4257, 'kmem': 77, 'other': 2} pages)
reclaimable blocks with non-slab pages: 16 ({'lru': 37, 'kmem': 311, 'other': 26} pages)

patched:
unmovable 65
movable 457
reclaimable 159
free 219
unmovable blocks with slab/lru pages: 22 ({'slab': 0, 'lru': 38} pages)
movable blocks with non-LRU pages: 0 ({'slab': 0, 'kmem': 0, 'other': 0} pages)
reclaimable blocks with non-slab pages: 3 ({'lru': 36, 'kmem': 0, 'other': 23} pages)

[ The remaining "mixed blocks" in the patched kernel are false
positives: LRU pages without migrate callbacks (empty_aops), and
i915 shmem that is pinned until reclaimed through shrinkers. ]

Root causes

One of the behaviors that sabotage the page allocator's mobility
grouping is the fact that requests of one migratetype are allowed to
fall back into blocks of another type before reclaim and compaction
occur. This is a design decision to prioritize memory utilization over
block fragmentation - especially considering the history of lumpy
reclaim and its tendency to overreclaim. However, with compaction
available, these two goals are no longer in conflict: the scratch
space of free pages for compaction to work is only twice the size of
the allocation request; in most cases, only small amounts of
proactive, coordinated reclaim and compaction is required to prevent a
fallback which may fragment a pageblock indefinitely.

Another problem lies in how the page allocator drives reclaim and
compaction when it does invoke it. While the page allocator targets
migratetype grouping at the pageblock level, it calls reclaim and
compaction with the order of the allocation request. As most requests
are smaller than a pageblock, this results in partial block freeing
and subsequent fallbacks and type mixing.

Note that in combination, these two design decisions have a
self-reinforcing effect on fragmentation: 1. Partially used unmovable
blocks are filled up with fallback movable pages. 2. A subsequent
unmovable allocation, instead of grouping up, will then need to enter
reclaim, which most likely results in a partially freed movable block
that it falls back into. Over time, unmovable allocations are sparsely
scattered throughout the address space and poison many pageblocks.

Note that block fragmentation is driven by lower-order requests. It is
not reliably mitigated by the mere presence of higher-order requests.

Proposal

This series proposes to make THP allocations reliable by enforcing
pageblock hygiene, and aligning the allocator, reclaim and compaction
on the pageblock as the base unit for managing free memory. All orders
up to and including the pageblock are made first-class requests that
(outside of OOM situations) are expected to succeed without
exceptional investment by the allocating thread.

A neutral pageblock type is introduced, MIGRATE_FREE. The first
allocation to be placed into such a block claims it exclusively for
the allocation's migratetype. Fallbacks from a different type are no
longer allowed, and the block is "kept open" for more allocations of
the same type to ensure tight grouping. A pageblock becomes neutral
again only once all its pages have been freed.

Reclaim and compaction are changed from partial block reclaim to
producing whole neutral page blocks. The watermark logic is adjusted
to apply to neutral blocks, ensuring that background and direct
reclaim always maintain a readily-available reserve of them.

The defragmentation effort changes from reactive to proactive. In
turn, this makes defragmentation actually more efficient: compaction
only has to scan movable blocks and can skip other blocks entirely;
since movable blocks aren't poisoned by unmovable pages, the chances
of successful compaction in each block are greatly improved as well.

Defragmentation becomes an ongoing responsibility of all allocations,
rather than being the burden of only higher-order asks. This prevents
sub-block allocations - which cause block fragmentation in the first
place - from starving the increasingly important larger requests.

There is a slight increase in worst-case memory overhead by requiring
the watermarks to be met against neutral blocks even when there might
be free pages in typed blocks. However, the high watermarks are less
than 1% of the zone, so the increase is relatively small.

These changes only apply to CONFIG_COMPACTION kernels. Without
compaction, fallbacks and partial block reclaim remain the best
trade-off between memory utilization and fragmentation.

Initial Test Results

The following is purely an allocation reliability test. Achieving full
THP benefits in practice is tied to other pending changes, such as the
THP shrinker to avoid memory pressure from excessive internal
fragmentation, and tweaks to the kernel's THP allocation strategy.

The test is a kernel build under moderate-to-high memory pressure,
with a concurrent process trying to repeatedly fault THPs (madvise):

HUGEALLOC-VANILLA HUGEALLOC-PATCHED
Real time 265.04 ( +0.00%) 268.12 ( +1.16%)
User time 1131.05 ( +0.00%) 1131.13 ( +0.01%)
System time 474.66 ( +0.00%) 478.97 ( +0.91%)
THP fault alloc 17913.24 ( +0.00%) 19647.50 ( +9.68%)
THP fault fallback 1947.12 ( +0.00%) 223.40 ( -88.48%)
THP fault fail rate % 9.80 ( +0.00%) 1.12 ( -80.34%)
Direct compact stall 282.44 ( +0.00%) 543.90 ( +92.25%)
Direct compact fail 262.44 ( +0.00%) 239.90 ( -8.56%)
Direct compact success 20.00 ( +0.00%) 304.00 ( +1352.38%)
Direct compact success rate % 7.15 ( +0.00%) 57.10 ( +612.90%)
Compact daemon scanned migrate 21643.80 ( +0.00%) 387479.80 ( +1690.18%)
Compact daemon scanned free 188462.36 ( +0.00%) 2842824.10 ( +1408.42%)
Compact direct scanned migrate 1601294.84 ( +0.00%) 275670.70 ( -82.78%)
Compact direct scanned free 4476155.60 ( +0.00%) 2438835.00 ( -45.51%)
Compact migrate scanned daemon % 1.32 ( +0.00%) 59.18 ( +2499.00%)
Compact free scanned daemon % 3.95 ( +0.00%) 54.31 ( +1018.20%)
Alloc stall 2425.00 ( +0.00%) 992.00 ( -59.07%)
Pages kswapd scanned 586756.68 ( +0.00%) 975390.20 ( +66.23%)
Pages kswapd reclaimed 385468.20 ( +0.00%) 437767.50 ( +13.57%)
Pages direct scanned 335199.56 ( +0.00%) 501824.20 ( +49.71%)
Pages direct reclaimed 127953.72 ( +0.00%) 151880.70 ( +18.70%)
Pages scanned kswapd % 64.43 ( +0.00%) 66.39 ( +2.99%)
Swap out 14083.88 ( +0.00%) 45034.60 ( +219.74%)
Swap in 3395.08 ( +0.00%) 7767.50 ( +128.75%)
File refaults 93546.68 ( +0.00%) 129648.30 ( +38.59%)

The THP fault success rate is drastically improved. A bigger share of
the work is done by the background threads, as they now proactively
maintain MIGRATE_FREE block reserves. The increase in memory pressure
is shown by the uptick in swap activity.

Status

Initial test results look promising, but production testing has been
lagging behind the effort to generalize this code for upstream, and
putting all the pieces together to make THP work. I'll follow up as I
gather more data.

Sending this out now as an RFC to get input on the overall direction.

The patches are based on v6.2.

Documentation/admin-guide/sysctl/vm.rst | 21 -
block/bdev.c | 2 +-
include/linux/compaction.h | 100 +---
include/linux/gfp.h | 2 -
include/linux/mm.h | 1 -
include/linux/mmzone.h | 30 +-
include/linux/page-isolation.h | 28 +-
include/linux/pageblock-flags.h | 4 +-
include/linux/vmstat.h | 8 -
include/trace/events/mmflags.h | 4 +-
kernel/sysctl.c | 8 -
mm/compaction.c | 242 +++-----
mm/internal.h | 14 +-
mm/memory_hotplug.c | 4 +-
mm/page_alloc.c | 930 +++++++++++++-----------------
mm/page_isolation.c | 42 +-
mm/vmscan.c | 251 ++------
mm/vmstat.c | 6 +-
18 files changed, 629 insertions(+), 1068 deletions(-)



2023-04-18 19:15:32

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 05/26] mm: page_alloc: per-migratetype pcplist for THPs

Right now, there is only one pcplist for THP allocations. However,
while most THPs are movable, the huge zero page is not. This means a
movable THP allocation can grab an unmovable block from the pcplist,
and a subsequent THP split, partial free, and reallocation of the
remainder will mix movable and unmovable pages in the block.

While this isn't a huge source of block pollution in practice, it
happens often enough to trigger debug warnings fairly quickly under
load. In the interest of tightening up pageblock hygiene, make the THP
pcplists fully migratetype-aware, just like the lower order ones.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/mmzone.h | 8 +++-----
mm/page_alloc.c | 4 ++--
2 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index cd28a100d9e4..53e55882a4e7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -552,13 +552,11 @@ enum zone_watermarks {
};

/*
- * One per migratetype for each PAGE_ALLOC_COSTLY_ORDER. One additional list
- * for THP which will usually be GFP_MOVABLE. Even if it is another type,
- * it should not contribute to serious fragmentation causing THP allocation
- * failures.
+ * One per migratetype for each PAGE_ALLOC_COSTLY_ORDER. One additional set
+ * for THP (usually GFP_MOVABLE, but with exception of the huge zero page.)
*/
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define NR_PCP_THP 1
+#define NR_PCP_THP MIGRATE_PCPTYPES
#else
#define NR_PCP_THP 0
#endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5e04a69f6a26..d3d01019ce77 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -710,7 +710,7 @@ static inline unsigned int order_to_pindex(int migratetype, int order)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
if (order > PAGE_ALLOC_COSTLY_ORDER) {
VM_BUG_ON(order != pageblock_order);
- return NR_LOWORDER_PCP_LISTS;
+ return NR_LOWORDER_PCP_LISTS + migratetype;
}
#else
VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
@@ -724,7 +724,7 @@ static inline int pindex_to_order(unsigned int pindex)
int order = pindex / MIGRATE_PCPTYPES;

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- if (pindex == NR_LOWORDER_PCP_LISTS)
+ if (pindex >= NR_LOWORDER_PCP_LISTS)
order = pageblock_order;
#else
VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
--
2.39.2

2023-04-18 19:15:35

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 06/26] mm: page_alloc: consolidate free page accounting

Free page accounting currently happens a bit too high up the call
stack, where it has to deal with guard pages, compaction capturing,
block stealing and even page isolation. This is subtle and fragile,
and makes it difficult to hack on the code.

Push the accounting down to where pages enter and leave the physical
freelists, where all these higher-level exceptions are of no concern.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/page-isolation.h | 2 +-
include/linux/vmstat.h | 8 --
mm/internal.h | 5 --
mm/page_alloc.c | 153 +++++++++++++++++----------------
mm/page_isolation.c | 13 ++-
5 files changed, 83 insertions(+), 98 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 0ab089e89db4..b519fffb3dee 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -35,7 +35,7 @@ static inline bool is_migrate_isolate(int migratetype)

void set_pageblock_migratetype(struct page *page, int migratetype);
int move_freepages_block(struct zone *zone, struct page *page,
- int migratetype, int *num_movable);
+ int old_mt, int new_mt, int *num_movable);

int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
int migratetype, int flags, gfp_t gfp_flags);
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 19cf5b6892ce..219ccf3f91cd 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -481,14 +481,6 @@ static inline void node_stat_sub_folio(struct folio *folio,
mod_node_page_state(folio_pgdat(folio), item, -folio_nr_pages(folio));
}

-static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages,
- int migratetype)
-{
- __mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
- if (is_migrate_cma(migratetype))
- __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
-}
-
extern const char * const vmstat_text[];

static inline const char *zone_stat_name(enum zone_stat_item item)
diff --git a/mm/internal.h b/mm/internal.h
index bcf75a8b032d..024affd4e4b5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -781,11 +781,6 @@ static inline bool is_migrate_highatomic(enum migratetype migratetype)
return migratetype == MIGRATE_HIGHATOMIC;
}

-static inline bool is_migrate_highatomic_page(struct page *page)
-{
- return get_pageblock_migratetype(page) == MIGRATE_HIGHATOMIC;
-}
-
void setup_zone_pageset(struct zone *zone);

struct migration_target_control {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d3d01019ce77..b9366c002334 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -843,7 +843,7 @@ static int __init debug_guardpage_minorder_setup(char *buf)
early_param("debug_guardpage_minorder", debug_guardpage_minorder_setup);

static inline bool set_page_guard(struct zone *zone, struct page *page,
- unsigned int order, int migratetype)
+ unsigned int order
{
if (!debug_guardpage_enabled())
return false;
@@ -854,15 +854,12 @@ static inline bool set_page_guard(struct zone *zone, struct page *page,
__SetPageGuard(page);
INIT_LIST_HEAD(&page->buddy_list);
set_page_private(page, order);
- /* Guard pages are not available for any usage */
- if (!is_migrate_isolate(migratetype))
- __mod_zone_freepage_state(zone, -(1 << order), migratetype);

return true;
}

static inline void clear_page_guard(struct zone *zone, struct page *page,
- unsigned int order, int migratetype)
+ unsigned int order)
{
if (!debug_guardpage_enabled())
return;
@@ -870,14 +867,12 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
__ClearPageGuard(page);

set_page_private(page, 0);
- if (!is_migrate_isolate(migratetype))
- __mod_zone_freepage_state(zone, (1 << order), migratetype);
}
#else
static inline bool set_page_guard(struct zone *zone, struct page *page,
- unsigned int order, int migratetype) { return false; }
+ unsigned int order) { return false; }
static inline void clear_page_guard(struct zone *zone, struct page *page,
- unsigned int order, int migratetype) {}
+ unsigned int order) {}
#endif

/*
@@ -994,24 +989,36 @@ compaction_capture(struct capture_control *capc, struct page *page,
}
#endif /* CONFIG_COMPACTION */

-/* Used for pages not on another list */
-static inline void add_to_free_list(struct page *page, struct zone *zone,
- unsigned int order, int migratetype)
+static inline void account_freepages(struct page *page, struct zone *zone,
+ int nr_pages, int migratetype)
{
- struct free_area *area = &zone->free_area[order];
+ if (is_migrate_isolate(migratetype))
+ return;

- list_add(&page->buddy_list, &area->free_list[migratetype]);
- area->nr_free++;
+ __mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
+
+ if (is_migrate_cma(migratetype))
+ __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
}

/* Used for pages not on another list */
-static inline void add_to_free_list_tail(struct page *page, struct zone *zone,
- unsigned int order, int migratetype)
+static inline void add_to_free_list(struct page *page, struct zone *zone,
+ unsigned int order, int migratetype,
+ bool tail)
{
struct free_area *area = &zone->free_area[order];

- list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
+ VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
+ "page type is %lu, passed migratetype is %d (nr=%d)\n",
+ get_pageblock_migratetype(page), migratetype, 1 << order);
+
+ if (tail)
+ list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
+ else
+ list_add(&page->buddy_list, &area->free_list[migratetype]);
area->nr_free++;
+
+ account_freepages(page, zone, 1 << order, migratetype);
}

/*
@@ -1020,16 +1027,23 @@ static inline void add_to_free_list_tail(struct page *page, struct zone *zone,
* allocation again (e.g., optimization for memory onlining).
*/
static inline void move_to_free_list(struct page *page, struct zone *zone,
- unsigned int order, int migratetype)
+ unsigned int order, int old_mt, int new_mt)
{
struct free_area *area = &zone->free_area[order];

- list_move_tail(&page->buddy_list, &area->free_list[migratetype]);
+ list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
+
+ account_freepages(page, zone, -(1 << order), old_mt);
+ account_freepages(page, zone, 1 << order, new_mt);
}

static inline void del_page_from_free_list(struct page *page, struct zone *zone,
- unsigned int order)
+ unsigned int order, int migratetype)
{
+ VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
+ "page type is %lu, passed migratetype is %d (nr=%d)\n",
+ get_pageblock_migratetype(page), migratetype, 1 << order);
+
/* clear reported state and update reported page count */
if (page_reported(page))
__ClearPageReported(page);
@@ -1038,6 +1052,8 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
__ClearPageBuddy(page);
set_page_private(page, 0);
zone->free_area[order].nr_free--;
+
+ account_freepages(page, zone, -(1 << order), migratetype);
}

/*
@@ -1104,23 +1120,21 @@ static inline void __free_one_page(struct page *page,
VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);

VM_BUG_ON(migratetype == -1);
- if (likely(!is_migrate_isolate(migratetype)))
- __mod_zone_freepage_state(zone, 1 << order, migratetype);
-
VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
VM_BUG_ON_PAGE(bad_range(zone, page), page);

while (order < MAX_ORDER - 1) {
- if (compaction_capture(capc, page, order, migratetype)) {
- __mod_zone_freepage_state(zone, -(1 << order),
- migratetype);
+ int buddy_mt;
+
+ if (compaction_capture(capc, page, order, migratetype))
return;
- }

buddy = find_buddy_page_pfn(page, pfn, order, &buddy_pfn);
if (!buddy)
goto done_merging;

+ buddy_mt = get_pageblock_migratetype(buddy);
+
if (unlikely(order >= pageblock_order)) {
/*
* We want to prevent merge between freepages on pageblock
@@ -1128,8 +1142,6 @@ static inline void __free_one_page(struct page *page,
* pageblock isolation could cause incorrect freepage or CMA
* accounting or HIGHATOMIC accounting.
*/
- int buddy_mt = get_pageblock_migratetype(buddy);
-
if (migratetype != buddy_mt
&& (!migratetype_is_mergeable(migratetype) ||
!migratetype_is_mergeable(buddy_mt)))
@@ -1141,9 +1153,9 @@ static inline void __free_one_page(struct page *page,
* merge with it and move up one order.
*/
if (page_is_guard(buddy))
- clear_page_guard(zone, buddy, order, migratetype);
+ clear_page_guard(zone, buddy, order);
else
- del_page_from_free_list(buddy, zone, order);
+ del_page_from_free_list(buddy, zone, order, buddy_mt);
combined_pfn = buddy_pfn & pfn;
page = page + (combined_pfn - pfn);
pfn = combined_pfn;
@@ -1160,10 +1172,7 @@ static inline void __free_one_page(struct page *page,
else
to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);

- if (to_tail)
- add_to_free_list_tail(page, zone, order, migratetype);
- else
- add_to_free_list(page, zone, order, migratetype);
+ add_to_free_list(page, zone, order, migratetype, to_tail);

/* Notify page reporting subsystem of freed page */
if (!(fpi_flags & FPI_SKIP_REPORT_NOTIFY))
@@ -1205,10 +1214,8 @@ int split_free_page(struct page *free_page,
}

mt = get_pageblock_migratetype(free_page);
- if (likely(!is_migrate_isolate(mt)))
- __mod_zone_freepage_state(zone, -(1UL << order), mt);
+ del_page_from_free_list(free_page, zone, order, mt);

- del_page_from_free_list(free_page, zone, order);
for (pfn = free_page_pfn;
pfn < free_page_pfn + (1UL << order);) {
int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
@@ -2352,10 +2359,10 @@ static inline void expand(struct zone *zone, struct page *page,
* Corresponding page table entries will not be touched,
* pages will stay not present in virtual address space
*/
- if (set_page_guard(zone, &page[size], high, migratetype))
+ if (set_page_guard(zone, &page[size], high))
continue;

- add_to_free_list(&page[size], zone, high, migratetype);
+ add_to_free_list(&page[size], zone, high, migratetype, false);
set_buddy_order(&page[size], high);
}
}
@@ -2559,11 +2566,15 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,

/* Find a page of the appropriate size in the preferred list */
for (current_order = order; current_order < MAX_ORDER; ++current_order) {
+ int actual_mt;
+
area = &(zone->free_area[current_order]);
page = get_page_from_free_area(area, migratetype);
if (!page)
continue;
- del_page_from_free_list(page, zone, current_order);
+ /* move_freepages_block() may strand types on wrong list */
+ actual_mt = get_pageblock_migratetype(page);
+ del_page_from_free_list(page, zone, current_order, actual_mt);
expand(zone, page, order, current_order, migratetype);
set_pcppage_migratetype(page, migratetype);
trace_mm_page_alloc_zone_locked(page, order, migratetype,
@@ -2606,7 +2617,7 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
*/
static int move_freepages(struct zone *zone,
unsigned long start_pfn, unsigned long end_pfn,
- int migratetype, int *num_movable)
+ int old_mt, int new_mt, int *num_movable)
{
struct page *page;
unsigned long pfn;
@@ -2633,7 +2644,7 @@ static int move_freepages(struct zone *zone,
VM_BUG_ON_PAGE(page_zone(page) != zone, page);

order = buddy_order(page);
- move_to_free_list(page, zone, order, migratetype);
+ move_to_free_list(page, zone, order, old_mt, new_mt);
pfn += 1 << order;
pages_moved += 1 << order;
}
@@ -2642,7 +2653,7 @@ static int move_freepages(struct zone *zone,
}

int move_freepages_block(struct zone *zone, struct page *page,
- int migratetype, int *num_movable)
+ int old_mt, int new_mt, int *num_movable)
{
unsigned long start_pfn, end_pfn, pfn;

@@ -2659,8 +2670,8 @@ int move_freepages_block(struct zone *zone, struct page *page,
if (!zone_spans_pfn(zone, end_pfn))
return 0;

- return move_freepages(zone, start_pfn, end_pfn, migratetype,
- num_movable);
+ return move_freepages(zone, start_pfn, end_pfn,
+ old_mt, new_mt, num_movable);
}

static void change_pageblock_range(struct page *pageblock_page,
@@ -2786,8 +2797,9 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
if (!whole_block)
goto single_page;

- free_pages = move_freepages_block(zone, page, start_type,
- &movable_pages);
+ free_pages = move_freepages_block(zone, page, old_block_type,
+ start_type, &movable_pages);
+
/*
* Determine how many pages are compatible with our allocation.
* For movable allocation, it's the number of movable pages which
@@ -2825,7 +2837,8 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
return;

single_page:
- move_to_free_list(page, zone, current_order, start_type);
+ move_to_free_list(page, zone, current_order,
+ old_block_type, start_type);
}

/*
@@ -2895,7 +2908,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
if (migratetype_is_mergeable(mt)) {
zone->nr_reserved_highatomic += pageblock_nr_pages;
set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
- move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL);
+ move_freepages_block(zone, page, mt, MIGRATE_HIGHATOMIC, NULL);
}

out_unlock:
@@ -2935,11 +2948,13 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
spin_lock_irqsave(&zone->lock, flags);
for (order = 0; order < MAX_ORDER; order++) {
struct free_area *area = &(zone->free_area[order]);
+ int mt;

page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC);
if (!page)
continue;

+ mt = get_pageblock_migratetype(page);
/*
* In page freeing path, migratetype change is racy so
* we can counter several free pages in a pageblock
@@ -2947,7 +2962,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
* from highatomic to ac->migratetype. So we should
* adjust the count once.
*/
- if (is_migrate_highatomic_page(page)) {
+ if (is_migrate_highatomic(mt)) {
/*
* It should never happen but changes to
* locking could inadvertently allow a per-cpu
@@ -2970,8 +2985,8 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
* may increase.
*/
set_pageblock_migratetype(page, ac->migratetype);
- ret = move_freepages_block(zone, page, ac->migratetype,
- NULL);
+ ret = move_freepages_block(zone, page, mt,
+ ac->migratetype, NULL);
if (ret) {
spin_unlock_irqrestore(&zone->lock, flags);
return ret;
@@ -3142,18 +3157,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
*/
list_add_tail(&page->pcp_list, list);
allocated++;
- if (is_migrate_cma(get_pcppage_migratetype(page)))
- __mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
- -(1 << order));
}
-
- /*
- * i pages were removed from the buddy list even if some leak due
- * to check_pcp_refill failing so adjust NR_FREE_PAGES based
- * on i. Do not confuse with 'allocated' which is the number of
- * pages added to the pcp list.
- */
- __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
spin_unlock_irqrestore(&zone->lock, flags);
return allocated;
}
@@ -3615,11 +3619,9 @@ int __isolate_free_page(struct page *page, unsigned int order)
watermark = zone->_watermark[WMARK_MIN] + (1UL << order);
if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
return 0;
-
- __mod_zone_freepage_state(zone, -(1UL << order), mt);
}

- del_page_from_free_list(page, zone, order);
+ del_page_from_free_list(page, zone, order, mt);

/*
* Set the pageblock if the isolated page is at least half of a
@@ -3715,8 +3717,6 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
return NULL;
}
}
- __mod_zone_freepage_state(zone, -(1 << order),
- get_pcppage_migratetype(page));
spin_unlock_irqrestore(&zone->lock, flags);
} while (check_new_pages(page, order));

@@ -9578,8 +9578,9 @@ void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)

BUG_ON(page_count(page));
BUG_ON(!PageBuddy(page));
+ VM_WARN_ON(get_pageblock_migratetype(page) != MIGRATE_ISOLATE);
order = buddy_order(page);
- del_page_from_free_list(page, zone, order);
+ del_page_from_free_list(page, zone, order, MIGRATE_ISOLATE);
pfn += (1 << order);
}
spin_unlock_irqrestore(&zone->lock, flags);
@@ -9630,11 +9631,12 @@ static void break_down_buddy_pages(struct zone *zone, struct page *page,
current_buddy = page + size;
}

- if (set_page_guard(zone, current_buddy, high, migratetype))
+ if (set_page_guard(zone, current_buddy, high))
continue;

if (current_buddy != target) {
- add_to_free_list(current_buddy, zone, high, migratetype);
+ add_to_free_list(current_buddy, zone, high,
+ migratetype, false);
set_buddy_order(current_buddy, high);
page = next_page;
}
@@ -9662,12 +9664,11 @@ bool take_page_off_buddy(struct page *page)
int migratetype = get_pfnblock_migratetype(page_head,
pfn_head);

- del_page_from_free_list(page_head, zone, page_order);
+ del_page_from_free_list(page_head, zone, page_order,
+ migratetype);
break_down_buddy_pages(zone, page_head, page, 0,
page_order, migratetype);
SetPageHWPoisonTakenOff(page);
- if (!is_migrate_isolate(migratetype))
- __mod_zone_freepage_state(zone, -1, migratetype);
ret = true;
break;
}
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index b67800f7f6b1..e119a37ac661 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -183,10 +183,8 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_

set_pageblock_migratetype(page, MIGRATE_ISOLATE);
zone->nr_isolate_pageblock++;
- nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE,
- NULL);
-
- __mod_zone_freepage_state(zone, -nr_pages, mt);
+ nr_pages = move_freepages_block(zone, page, mt,
+ MIGRATE_ISOLATE, NULL);
spin_unlock_irqrestore(&zone->lock, flags);
return 0;
}
@@ -251,10 +249,9 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
* onlining - just onlined memory won't immediately be considered for
* allocation.
*/
- if (!isolated_page) {
- nr_pages = move_freepages_block(zone, page, migratetype, NULL);
- __mod_zone_freepage_state(zone, nr_pages, migratetype);
- }
+ if (!isolated_page)
+ nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE,
+ migratetype, NULL);
set_pageblock_migratetype(page, migratetype);
if (isolated_page)
__putback_isolated_page(page, order, migratetype);
--
2.39.2

2023-04-18 19:15:45

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 10/26] mm: page_alloc: allow compaction capturing from larger blocks

Currently, capturing only works on matching orders and matching
migratetypes. However, if capturing is initially skipped on the
migratetype, it's possible that merging continues up to a full
pageblock, in which case the migratetype is up for grabs again.

Allow capturing to grab smaller chunks from claimed pageblocks, and
expand the remainder of the block back onto the freelists.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/page_alloc.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cd86f80d7bbe..5ebfcf18537b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1067,10 +1067,10 @@ static inline struct capture_control *task_capc(struct zone *zone)
}

static inline bool
-compaction_capture(struct capture_control *capc, struct page *page,
- int order, int migratetype)
+compaction_capture(struct zone *zone, struct page *page, int order,
+ int migratetype, struct capture_control *capc)
{
- if (!capc || order != capc->cc->order)
+ if (!capc || order < capc->cc->order)
return false;

/* Do not accidentally pollute CMA or isolated regions*/
@@ -1092,6 +1092,9 @@ compaction_capture(struct capture_control *capc, struct page *page,
return false;
}

+ if (order > capc->cc->order)
+ expand(zone, page, capc->cc->order, order, migratetype);
+
capc->page = page;
return true;
}
@@ -1103,8 +1106,8 @@ static inline struct capture_control *task_capc(struct zone *zone)
}

static inline bool
-compaction_capture(struct capture_control *capc, struct page *page,
- int order, int migratetype)
+compaction_capture(struct zone *zone, struct page *page, int order,
+ int migratetype, struct capture_control *capc)
{
return false;
}
@@ -1180,7 +1183,7 @@ static inline void __free_one_page(struct page *page,
while (order < MAX_ORDER - 1) {
int buddy_mt;

- if (compaction_capture(capc, page, order, migratetype))
+ if (compaction_capture(zone, page, order, migratetype, capc))
return;

buddy = find_buddy_page_pfn(page, pfn, order, &buddy_pfn);
--
2.39.2

2023-04-18 19:15:46

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 03/26] mm: make pageblock_order 2M per default

pageblock_order can be of various sizes, depending on configuration,
but the default is MAX_ORDER-1. Given 4k pages, that comes out to
4M. This is a large chunk for the allocator/reclaim/compaction to try
to keep grouped per migratetype. It's also unnecessary as the majority
of higher order allocations - THP and slab - are smaller than that.

Before subsequent patches increase the effort that goes into
maintaining migratetype isolation, it's important to first set the
defrag block size to what's likely to have common consumers.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/pageblock-flags.h | 4 ++--
mm/page_alloc.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 5f1ae07d724b..05b6811f8cee 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -47,8 +47,8 @@ extern unsigned int pageblock_order;

#else /* CONFIG_HUGETLB_PAGE */

-/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
-#define pageblock_order (MAX_ORDER-1)
+/* Manage fragmentation at the 2M level */
+#define pageblock_order ilog2(2U << (20 - PAGE_SHIFT))

#endif /* CONFIG_HUGETLB_PAGE */

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ac03571e0532..5e04a69f6a26 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7634,7 +7634,7 @@ static inline void setup_usemap(struct zone *zone) {}
/* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
void __init set_pageblock_order(void)
{
- unsigned int order = MAX_ORDER - 1;
+ unsigned int order = ilog2(2U << (20 - PAGE_SHIFT));

/* Check that pageblock_nr_pages has not already been setup */
if (pageblock_order)
--
2.39.2

2023-04-18 19:15:47

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 11/26] mm: page_alloc: introduce MIGRATE_FREE

To cut down on type mixing, put empty pageblocks on separate freelists
and make them the first fallback preference before stealing space from
incompatible blocks.

The neutral block designation will also be handy in subsequent patches
that: simplify compaction; add per-mt freelist counts and make
compaction_suitable() more precise; and ultimately make pageblocks the
basis of free memory management.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/mmzone.h | 3 +-
mm/memory_hotplug.c | 4 +--
mm/page_alloc.c | 63 +++++++++++++++++++++++++++++++-----------
3 files changed, 51 insertions(+), 19 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 53e55882a4e7..20542e5a0a43 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -45,6 +45,7 @@ enum migratetype {
MIGRATE_RECLAIMABLE,
MIGRATE_PCPTYPES, /* the number of types on the pcp lists */
MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,
+ MIGRATE_FREE,
#ifdef CONFIG_CMA
/*
* MIGRATE_CMA migration type is designed to mimic the way
@@ -88,7 +89,7 @@ static inline bool is_migrate_movable(int mt)
*/
static inline bool migratetype_is_mergeable(int mt)
{
- return mt < MIGRATE_PCPTYPES;
+ return mt < MIGRATE_PCPTYPES || mt == MIGRATE_FREE;
}

#define for_each_migratetype_order(order, type) \
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index fd40f7e9f176..d7b9f0e70b58 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1129,7 +1129,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
build_all_zonelists(NULL);

/* Basic onlining is complete, allow allocation of onlined pages. */
- undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE);
+ undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_FREE);

/*
* Freshly onlined pages aren't shuffled (e.g., all pages are placed to
@@ -1951,7 +1951,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,

failed_removal_isolated:
/* pushback to free area */
- undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
+ undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_FREE);
memory_notify(MEM_CANCEL_OFFLINE, &arg);
failed_removal_pcplists_disabled:
lru_cache_enable();
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5ebfcf18537b..44da23625f51 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -380,6 +380,7 @@ const char * const migratetype_names[MIGRATE_TYPES] = {
"Movable",
"Reclaimable",
"HighAtomic",
+ "Free",
#ifdef CONFIG_CMA
"CMA",
#endif
@@ -1222,6 +1223,13 @@ static inline void __free_one_page(struct page *page,
done_merging:
set_buddy_order(page, order);

+ /* If we freed one or normal page blocks, mark them free. */
+ if (unlikely(order >= pageblock_order &&
+ migratetype_is_mergeable(migratetype))) {
+ change_pageblock_range(page, order, MIGRATE_FREE);
+ migratetype = MIGRATE_FREE;
+ }
+
if (fpi_flags & FPI_TO_TAIL)
to_tail = true;
else if (is_shuffle_order(order))
@@ -1961,14 +1969,14 @@ static void __init deferred_free_range(unsigned long pfn,

/* Free a large naturally-aligned chunk if possible */
if (nr_pages == pageblock_nr_pages && pageblock_aligned(pfn)) {
- set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+ set_pageblock_migratetype(page, MIGRATE_FREE);
__free_pages_core(page, pageblock_order);
return;
}

for (i = 0; i < nr_pages; i++, page++, pfn++) {
if (pageblock_aligned(pfn))
- set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+ set_pageblock_migratetype(page, MIGRATE_FREE);
__free_pages_core(page, 0);
}
}
@@ -2612,10 +2620,10 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
*
* The other migratetypes do not have fallbacks.
*/
-static int fallbacks[MIGRATE_TYPES][3] = {
- [MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
- [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_TYPES },
- [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
+static int fallbacks[MIGRATE_TYPES][4] = {
+ [MIGRATE_UNMOVABLE] = { MIGRATE_FREE, MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
+ [MIGRATE_MOVABLE] = { MIGRATE_FREE, MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_TYPES },
+ [MIGRATE_RECLAIMABLE] = { MIGRATE_FREE, MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
};

#ifdef CONFIG_CMA
@@ -2705,8 +2713,13 @@ int move_freepages_block(struct zone *zone, struct page *page,
* is worse than movable allocations stealing from unmovable and reclaimable
* pageblocks.
*/
-static bool can_steal_fallback(unsigned int order, int start_mt)
+static bool can_steal_fallback(unsigned int order, int start_mt,
+ int fallback_mt)
{
+ /* The first allocation in a free block *must* claim it. */
+ if (fallback_mt == MIGRATE_FREE)
+ return true;
+
/*
* Leaving this order check is intended, although there is
* relaxed order check in next check. The reason is that
@@ -2808,6 +2821,21 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
free_pages = move_freepages_block(zone, page, old_block_type,
start_type, &movable_pages);

+ /*
+ * If we fell back into a free block, claim the whole thing
+ */
+ if (old_block_type == MIGRATE_FREE) {
+ set_pageblock_migratetype(page, start_type);
+ if (!free_pages) {
+ /*
+ * This can leave some non-FREE pages on the
+ * FREE list. Future fallbacks will get them.
+ */
+ goto single_page;
+ }
+ return;
+ }
+
/*
* Determine how many pages are compatible with our allocation.
* For movable allocation, it's the number of movable pages which
@@ -2873,7 +2901,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
if (free_area_empty(area, fallback_mt))
continue;

- if (can_steal_fallback(order, migratetype))
+ if (can_steal_fallback(order, migratetype, fallback_mt))
*can_steal = true;

if (!only_stealable)
@@ -3485,7 +3513,7 @@ void free_unref_page(struct page *page, unsigned int order)
*/
migratetype = get_pcppage_migratetype(page);
if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
- if (unlikely(is_migrate_isolate(migratetype))) {
+ if (unlikely(is_migrate_isolate(migratetype) || migratetype == MIGRATE_FREE)) {
free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE);
return;
}
@@ -3529,7 +3557,7 @@ void free_unref_page_list(struct list_head *list)
* comment in free_unref_page.
*/
migratetype = get_pcppage_migratetype(page);
- if (unlikely(is_migrate_isolate(migratetype))) {
+ if (unlikely(is_migrate_isolate(migratetype) || migratetype == MIGRATE_FREE)) {
list_del(&page->lru);
free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
continue;
@@ -3632,10 +3660,10 @@ int __isolate_free_page(struct page *page, unsigned int order)
del_page_from_free_list(page, zone, order, mt);

/*
- * Set the pageblock if the isolated page is at least half of a
- * pageblock
+ * Set the pageblock if the isolated page is from a free block
+ * or at least half of a pageblock
*/
- if (order >= pageblock_order - 1) {
+ if (mt == MIGRATE_FREE || order >= pageblock_order - 1) {
struct page *endpage = page + (1 << order) - 1;
for (; page < endpage; page += pageblock_nr_pages) {
int mt = get_pageblock_migratetype(page);
@@ -4020,6 +4048,9 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
if (!area->nr_free)
continue;

+ if (!free_area_empty(area, MIGRATE_FREE))
+ return true;
+
for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
if (!free_area_empty(area, mt))
return true;
@@ -6081,6 +6112,7 @@ static void show_migration_types(unsigned char type)
[MIGRATE_MOVABLE] = 'M',
[MIGRATE_RECLAIMABLE] = 'E',
[MIGRATE_HIGHATOMIC] = 'H',
+ [MIGRATE_FREE] = 'F',
#ifdef CONFIG_CMA
[MIGRATE_CMA] = 'C',
#endif
@@ -7025,7 +7057,7 @@ static void __init memmap_init_zone_range(struct zone *zone,
return;

memmap_init_range(end_pfn - start_pfn, nid, zone_id, start_pfn,
- zone_end_pfn, MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
+ zone_end_pfn, MEMINIT_EARLY, NULL, MIGRATE_FREE);

if (*hole_pfn < start_pfn)
init_unavailable_range(*hole_pfn, start_pfn, zone_id, nid);
@@ -9422,8 +9454,7 @@ static int __alloc_contig_pages(unsigned long start_pfn,
{
unsigned long end_pfn = start_pfn + nr_pages;

- return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
- gfp_mask);
+ return alloc_contig_range(start_pfn, end_pfn, MIGRATE_FREE, gfp_mask);
}

static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
--
2.39.2

2023-04-18 19:15:53

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 12/26] mm: page_alloc: per-migratetype free counts

Increase visibility into the defragmentation behavior by tracking and
reporting per-migratetype free counters.

Subsequent patches will also use those counters to make more targeted
reclaim/compaction decisions.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/mmzone.h | 5 +++++
mm/page_alloc.c | 29 +++++++++++++++++++++++++----
mm/vmstat.c | 5 +++++
3 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 20542e5a0a43..d1083ab81998 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -139,6 +139,11 @@ enum numa_stat_item {
enum zone_stat_item {
/* First 128 byte cacheline (assuming 64 bit words) */
NR_FREE_PAGES,
+ NR_FREE_UNMOVABLE,
+ NR_FREE_MOVABLE,
+ NR_FREE_RECLAIMABLE,
+ NR_FREE_HIGHATOMIC,
+ NR_FREE_FREE,
NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
NR_ZONE_INACTIVE_ANON = NR_ZONE_LRU_BASE,
NR_ZONE_ACTIVE_ANON,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 44da23625f51..5f2a0037bed1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -959,8 +959,12 @@ static inline void account_freepages(struct page *page, struct zone *zone,

__mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);

- if (is_migrate_cma(migratetype))
+ if (migratetype <= MIGRATE_FREE)
+ __mod_zone_page_state(zone, NR_FREE_UNMOVABLE + migratetype, nr_pages);
+ else if (is_migrate_cma(migratetype))
__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
+ else
+ VM_WARN_ONCE(1, "unexpected migratetype %d\n", migratetype);
}

/* Used for pages not on another list */
@@ -6175,7 +6179,9 @@ void __show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_i
" mapped:%lu shmem:%lu pagetables:%lu\n"
" sec_pagetables:%lu bounce:%lu\n"
" kernel_misc_reclaimable:%lu\n"
- " free:%lu free_pcp:%lu free_cma:%lu\n",
+ " free:%lu free_unmovable:%lu free_movable:%lu\n"
+ " free_reclaimable:%lu free_highatomic:%lu free_free:%lu\n"
+ " free_cma:%lu free_pcp:%lu\n",
global_node_page_state(NR_ACTIVE_ANON),
global_node_page_state(NR_INACTIVE_ANON),
global_node_page_state(NR_ISOLATED_ANON),
@@ -6194,8 +6200,13 @@ void __show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_i
global_zone_page_state(NR_BOUNCE),
global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE),
global_zone_page_state(NR_FREE_PAGES),
- free_pcp,
- global_zone_page_state(NR_FREE_CMA_PAGES));
+ global_zone_page_state(NR_FREE_UNMOVABLE),
+ global_zone_page_state(NR_FREE_MOVABLE),
+ global_zone_page_state(NR_FREE_RECLAIMABLE),
+ global_zone_page_state(NR_FREE_HIGHATOMIC),
+ global_zone_page_state(NR_FREE_FREE),
+ global_zone_page_state(NR_FREE_CMA_PAGES),
+ free_pcp);

for_each_online_pgdat(pgdat) {
if (show_mem_node_skip(filter, pgdat->node_id, nodemask))
@@ -6273,6 +6284,11 @@ void __show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_i
printk(KERN_CONT
"%s"
" free:%lukB"
+ " free_unmovable:%lukB"
+ " free_movable:%lukB"
+ " free_reclaimable:%lukB"
+ " free_highatomic:%lukB"
+ " free_free:%lukB"
" boost:%lukB"
" min:%lukB"
" low:%lukB"
@@ -6294,6 +6310,11 @@ void __show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_i
"\n",
zone->name,
K(zone_page_state(zone, NR_FREE_PAGES)),
+ K(zone_page_state(zone, NR_FREE_UNMOVABLE)),
+ K(zone_page_state(zone, NR_FREE_MOVABLE)),
+ K(zone_page_state(zone, NR_FREE_RECLAIMABLE)),
+ K(zone_page_state(zone, NR_FREE_HIGHATOMIC)),
+ K(zone_page_state(zone, NR_FREE_FREE)),
K(zone->watermark_boost),
K(min_wmark_pages(zone)),
K(low_wmark_pages(zone)),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1ea6a5ce1c41..c8b8e6e259da 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1168,6 +1168,11 @@ int fragmentation_index(struct zone *zone, unsigned int order)
const char * const vmstat_text[] = {
/* enum zone_stat_item counters */
"nr_free_pages",
+ "nr_free_unmovable",
+ "nr_free_movable",
+ "nr_free_reclaimable",
+ "nr_free_highatomic",
+ "nr_free_free",
"nr_zone_inactive_anon",
"nr_zone_active_anon",
"nr_zone_inactive_file",
--
2.39.2

2023-04-18 19:15:58

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 16/26] mm: compaction: improve compaction_suitable() accuracy

With the new per-mt free counts, compaction can check the watermarks
specifically against suitable migration targets. This ensures reclaim
keeps going when the free pages are in blocks that aren't actually
suitable migration targets: MIGRATE_FREE, UNMOVABLE, RECLAIMABLE.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/compaction.c | 23 +++++++++++++++--------
mm/vmscan.c | 7 +++++--
2 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index b9eed0d43403..f637b4ed7f3c 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2226,11 +2226,17 @@ enum compact_result compaction_suitable(struct zone *zone, int order,
unsigned int alloc_flags,
int highest_zoneidx)
{
+ unsigned long free_pages;
enum compact_result ret;
int fragindex;

- ret = __compaction_suitable(zone, order, alloc_flags, highest_zoneidx,
- zone_page_state(zone, NR_FREE_PAGES));
+ /* Suitable migration targets */
+ free_pages = zone_page_state(zone, NR_FREE_MOVABLE);
+ free_pages += zone_page_state(zone, NR_FREE_CMA_PAGES);
+
+ ret = __compaction_suitable(zone, order, alloc_flags,
+ highest_zoneidx, free_pages);
+
/*
* fragmentation index determines if allocation failures are due to
* low memory or external fragmentation
@@ -2273,19 +2279,20 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
ac->highest_zoneidx, ac->nodemask) {
unsigned long available;
- enum compact_result compact_result;

+ available = zone_page_state_snapshot(zone, NR_FREE_MOVABLE);
+ available += zone_page_state_snapshot(zone, NR_FREE_CMA_PAGES);
/*
* Do not consider all the reclaimable memory because we do not
* want to trash just for a single high order allocation which
* is even not guaranteed to appear even if __compaction_suitable
* is happy about the watermark check.
*/
- available = zone_reclaimable_pages(zone) / order;
- available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
- compact_result = __compaction_suitable(zone, order, alloc_flags,
- ac->highest_zoneidx, available);
- if (compact_result == COMPACT_CONTINUE)
+ available += zone_reclaimable_pages(zone) / order;
+
+ if (__compaction_suitable(zone, order, alloc_flags,
+ ac->highest_zoneidx,
+ available) == COMPACT_CONTINUE)
return true;
}

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5b7b8d4f5297..9ecf29f4dab8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6270,6 +6270,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
{
unsigned long watermark;
+ unsigned long free_pages;
enum compact_result suitable;

suitable = compaction_suitable(zone, sc->order, 0, sc->reclaim_idx);
@@ -6290,8 +6291,10 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
* we are already above the high+gap watermark, don't reclaim at all.
*/
watermark = high_wmark_pages(zone) + compact_gap(sc->order);
-
- return zone_watermark_ok_safe(zone, 0, watermark, sc->reclaim_idx);
+ free_pages = zone_page_state_snapshot(zone, NR_FREE_MOVABLE);
+ free_pages += zone_page_state_snapshot(zone, NR_FREE_CMA_PAGES);
+ return __zone_watermark_ok(zone, 0, watermark, sc->reclaim_idx,
+ ALLOC_CMA, free_pages);
}

static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc)
--
2.39.2

2023-04-18 19:16:00

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 13/26] mm: compaction: remove compaction result helpers

I found myself repeatedly looking up the implementation of these
helpers while working on the code, which suggests they are not a
helpful abstraction. Inline them.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/compaction.h | 92 ----------------------------------
include/trace/events/mmflags.h | 4 +-
mm/page_alloc.c | 30 ++++++-----
3 files changed, 19 insertions(+), 107 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 06eeb2e25833..7635e220215a 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -103,78 +103,6 @@ extern enum compact_result compaction_suitable(struct zone *zone, int order,
extern void compaction_defer_reset(struct zone *zone, int order,
bool alloc_success);

-/* Compaction has made some progress and retrying makes sense */
-static inline bool compaction_made_progress(enum compact_result result)
-{
- /*
- * Even though this might sound confusing this in fact tells us
- * that the compaction successfully isolated and migrated some
- * pageblocks.
- */
- if (result == COMPACT_SUCCESS)
- return true;
-
- return false;
-}
-
-/* Compaction has failed and it doesn't make much sense to keep retrying. */
-static inline bool compaction_failed(enum compact_result result)
-{
- /* All zones were scanned completely and still not result. */
- if (result == COMPACT_COMPLETE)
- return true;
-
- return false;
-}
-
-/* Compaction needs reclaim to be performed first, so it can continue. */
-static inline bool compaction_needs_reclaim(enum compact_result result)
-{
- /*
- * Compaction backed off due to watermark checks for order-0
- * so the regular reclaim has to try harder and reclaim something.
- */
- if (result == COMPACT_SKIPPED)
- return true;
-
- return false;
-}
-
-/*
- * Compaction has backed off for some reason after doing some work or none
- * at all. It might be throttling or lock contention. Retrying might be still
- * worthwhile, but with a higher priority if allowed.
- */
-static inline bool compaction_withdrawn(enum compact_result result)
-{
- /*
- * If compaction is deferred for high-order allocations, it is
- * because sync compaction recently failed. If this is the case
- * and the caller requested a THP allocation, we do not want
- * to heavily disrupt the system, so we fail the allocation
- * instead of entering direct reclaim.
- */
- if (result == COMPACT_DEFERRED)
- return true;
-
- /*
- * If compaction in async mode encounters contention or blocks higher
- * priority task we back off early rather than cause stalls.
- */
- if (result == COMPACT_CONTENDED)
- return true;
-
- /*
- * Page scanners have met but we haven't scanned full zones so this
- * is a back off in fact.
- */
- if (result == COMPACT_PARTIAL_SKIPPED)
- return true;
-
- return false;
-}
-
-
bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
int alloc_flags);

@@ -193,26 +121,6 @@ static inline enum compact_result compaction_suitable(struct zone *zone, int ord
return COMPACT_SKIPPED;
}

-static inline bool compaction_made_progress(enum compact_result result)
-{
- return false;
-}
-
-static inline bool compaction_failed(enum compact_result result)
-{
- return false;
-}
-
-static inline bool compaction_needs_reclaim(enum compact_result result)
-{
- return false;
-}
-
-static inline bool compaction_withdrawn(enum compact_result result)
-{
- return true;
-}
-
static inline void kcompactd_run(int nid)
{
}
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 412b5a46374c..47bfeca4cf02 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -222,8 +222,8 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" ) \
#define compact_result_to_feedback(result) \
({ \
enum compact_result __result = result; \
- (compaction_failed(__result)) ? COMPACTION_FAILED : \
- (compaction_withdrawn(__result)) ? COMPACTION_WITHDRAWN : COMPACTION_PROGRESS; \
+ (__result == COMPACT_COMPLETE) ? COMPACTION_FAILED : \
+ (__result == COMPACT_SUCCESS) ? COMPACTION_PROGRESS : COMPACTION_WITHDRAWN; \
})

#define COMPACTION_FEEDBACK \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5f2a0037bed1..c3b7dc479936 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4620,35 +4620,39 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
if (fatal_signal_pending(current))
return false;

- if (compaction_made_progress(compact_result))
+ /*
+ * Compaction managed to coalesce some page blocks, but the
+ * allocation failed presumably due to a race. Retry some.
+ */
+ if (compact_result == COMPACT_SUCCESS)
(*compaction_retries)++;

/*
- * compaction considers all the zone as desperately out of memory
- * so it doesn't really make much sense to retry except when the
+ * All zones were scanned completely and still no result. It
+ * doesn't really make much sense to retry except when the
* failure could be caused by insufficient priority
*/
- if (compaction_failed(compact_result))
+ if (compact_result == COMPACT_COMPLETE)
goto check_priority;

/*
- * compaction was skipped because there are not enough order-0 pages
- * to work with, so we retry only if it looks like reclaim can help.
+ * Compaction was skipped due to a lack of free order-0
+ * migration targets. Continue if reclaim can help.
*/
- if (compaction_needs_reclaim(compact_result)) {
+ if (compact_result == COMPACT_SKIPPED) {
ret = compaction_zonelist_suitable(ac, order, alloc_flags);
goto out;
}

/*
- * make sure the compaction wasn't deferred or didn't bail out early
- * due to locks contention before we declare that we should give up.
- * But the next retry should use a higher priority if allowed, so
- * we don't just keep bailing out endlessly.
+ * If compaction backed due to being deferred, due to
+ * contended locks in async mode, or due to scanners meeting
+ * after a partial scan, retry with increased priority.
*/
- if (compaction_withdrawn(compact_result)) {
+ if (compact_result == COMPACT_DEFERRED ||
+ compact_result == COMPACT_CONTENDED ||
+ compact_result == COMPACT_PARTIAL_SKIPPED)
goto check_priority;
- }

/*
* !costly requests are much more important than __GFP_RETRY_MAYFAIL
--
2.39.2

2023-04-18 19:16:10

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 04/26] mm: page_isolation: write proper kerneldoc

And remove the incorrect header comments.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/page-isolation.h | 24 ++++++------------------
mm/page_isolation.c | 29 ++++++++++++++++++++++++-----
2 files changed, 30 insertions(+), 23 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 5456b7be38ae..0ab089e89db4 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -37,24 +37,12 @@ void set_pageblock_migratetype(struct page *page, int migratetype);
int move_freepages_block(struct zone *zone, struct page *page,
int migratetype, int *num_movable);

-/*
- * Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
- */
-int
-start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
- int migratetype, int flags, gfp_t gfp_flags);
-
-/*
- * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE.
- * target range is [start_pfn, end_pfn)
- */
-void
-undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
- int migratetype);
-
-/*
- * Test all pages in [start_pfn, end_pfn) are isolated or not.
- */
+int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
+ int migratetype, int flags, gfp_t gfp_flags);
+
+void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
+ int migratetype);
+
int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
int isol_flags);

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 47fbc1696466..b67800f7f6b1 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -481,8 +481,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
}

/**
- * start_isolate_page_range() - make page-allocation-type of range of pages to
- * be MIGRATE_ISOLATE.
+ * start_isolate_page_range() - mark page range MIGRATE_ISOLATE
* @start_pfn: The lower PFN of the range to be isolated.
* @end_pfn: The upper PFN of the range to be isolated.
* @migratetype: Migrate type to set in error recovery.
@@ -571,8 +570,14 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
return 0;
}

-/*
- * Make isolated pages available again.
+/**
+ * undo_isolate_page_range - undo effects of start_isolate_page_range()
+ * @start_pfn: The lower PFN of the isolated range
+ * @end_pfn: The upper PFN of the isolated range
+ * @migratetype: New migrate type to set on the range
+ *
+ * This finds every MIGRATE_ISOLATE page block in the given range
+ * and switches it to @migratetype.
*/
void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
int migratetype)
@@ -631,7 +636,21 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
return pfn;
}

-/* Caller should ensure that requested range is in a single zone */
+/**
+ * test_pages_isolated - check if pageblocks in range are isolated
+ * @start_pfn: The first PFN of the isolated range
+ * @end_pfn: The first PFN *after* the isolated range
+ * @isol_flags: Testing mode flags
+ *
+ * This tests if all in the specified range are free.
+ *
+ * If %MEMORY_OFFLINE is specified in @flags, it will consider
+ * poisoned and offlined pages free as well.
+ *
+ * Caller must ensure the requested range doesn't span zones.
+ *
+ * Returns 0 if true, -EBUSY if one or more pages are in use.
+ */
int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
int isol_flags)
{
--
2.39.2

2023-04-18 19:16:14

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 18/26] mm: compaction: remove unnecessary is_via_compact_memory() checks

Remove from all paths not reachable via /proc/sys/vm/compact_memory.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/compaction.c | 11 +----------
mm/vmscan.c | 8 +-------
2 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index d4b7d5b36600..0aa2a0a192dc 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2265,9 +2265,6 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
unsigned long available;
unsigned long watermark;

- if (is_via_compact_memory(order))
- return true;
-
/* Allocation can already succeed, nothing to do */
watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
if (zone_watermark_ok(zone, order, watermark,
@@ -2808,9 +2805,6 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
if (!populated_zone(zone))
continue;

- if (is_via_compact_memory(pgdat->kcompactd_max_order))
- return true;
-
/* Allocation can already succeed, check other zones */
if (zone_watermark_ok(zone, pgdat->kcompactd_max_order,
min_wmark_pages(zone),
@@ -2855,9 +2849,6 @@ static void kcompactd_do_work(pg_data_t *pgdat)
if (compaction_deferred(zone, cc.order))
continue;

- if (is_via_compact_memory(cc.order))
- goto compact;
-
/* Allocation can already succeed, nothing to do */
if (zone_watermark_ok(zone, cc.order,
min_wmark_pages(zone), zoneid, 0))
@@ -2866,7 +2857,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
if (compaction_suitable(zone, cc.order,
zoneid) != COMPACT_CONTINUE)
continue;
-compact:
+
if (kthread_should_stop())
return;

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a0ebdbf3efcf..ee8c8ca2e7b5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6076,9 +6076,6 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
if (!managed_zone(zone))
continue;

- if (sc->order == -1) /* is_via_compact_memory() */
- return false;
-
/* Allocation can already succeed, nothing to do */
if (zone_watermark_ok(zone, sc->order, min_wmark_pages(zone),
sc->reclaim_idx, 0))
@@ -6275,9 +6272,6 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
unsigned long watermark;
unsigned long free_pages;

- if (sc->order == -1) /* is_via_compact_memory() */
- goto suitable;
-
/* Allocation can already succeed, nothing to do */
if (zone_watermark_ok(zone, sc->order, min_wmark_pages(zone),
sc->reclaim_idx, 0))
@@ -6287,7 +6281,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
if (compaction_suitable(zone, sc->order,
sc->reclaim_idx) == COMPACT_SKIPPED)
return false;
-suitable:
+
/*
* Compaction is already possible, but it takes time to run and there
* are potentially other callers using the pages just freed. So proceed
--
2.39.2

2023-04-18 19:16:14

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 19/26] mm: compaction: drop redundant watermark check in compaction_zonelist_suitable()

should_compact_retry() is called after direct reclaim and compaction
have failed to produce a page, to check if there is still enough
compactable memory to warrant continuing. If there aren't, one last
attempt at the freelists happens in __alloc_pages_may_oom(). The
watermark check in should_compact_retry() is not necessary. Kill it.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/compaction.c | 7 -------
1 file changed, 7 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 0aa2a0a192dc..52103545d58c 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2263,13 +2263,6 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
ac->highest_zoneidx, ac->nodemask) {
unsigned long available;
- unsigned long watermark;
-
- /* Allocation can already succeed, nothing to do */
- watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
- if (zone_watermark_ok(zone, order, watermark,
- ac->highest_zoneidx, alloc_flags))
- continue;

available = zone_page_state_snapshot(zone, NR_FREE_MOVABLE);
available += zone_page_state_snapshot(zone, NR_FREE_CMA_PAGES);
--
2.39.2

2023-04-18 19:16:15

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 02/26] mm: compaction: avoid GFP_NOFS deadlocks

During stress testing, two deadlock scenarios were observed:

1. One GFP_NOFS allocation was sleeping on too_many_isolated(), and
all CPUs were busy with compactors that appeared to be spinning on
buffer locks.

Give GFP_NOFS compactors additional isolation headroom, the same
way we do during reclaim, to eliminate this deadlock scenario.

2. In a more pernicious scenario, the GFP_NOFS allocation was
busy-spinning in compaction, but seemingly never making
progress. Upon closer inspection, memory was dominated by file
pages, which the fs compactor isn't allowed to touch. The remaining
anon pages didn't have the contiguity to satisfy the request.

Allow GFP_NOFS allocations to bypass watermarks when compaction
failed at the highest priority.

While these deadlocks were encountered only in tests with the
subsequent patches (which put a lot more demand on compaction), in
theory these problems already exist in the code today. Fix them now.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/compaction.c | 15 +++++++++++++--
mm/page_alloc.c | 10 +++++++++-
2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 8238e83385a7..84db84e8fd3a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -745,8 +745,9 @@ isolate_freepages_range(struct compact_control *cc,
}

/* Similar to reclaim, but different enough that they don't share logic */
-static bool too_many_isolated(pg_data_t *pgdat)
+static bool too_many_isolated(struct compact_control *cc)
{
+ pg_data_t *pgdat = cc->zone->zone_pgdat;
bool too_many;

unsigned long active, inactive, isolated;
@@ -758,6 +759,16 @@ static bool too_many_isolated(pg_data_t *pgdat)
isolated = node_page_state(pgdat, NR_ISOLATED_FILE) +
node_page_state(pgdat, NR_ISOLATED_ANON);

+ /*
+ * GFP_NOFS callers are allowed to isolate more pages, so they
+ * won't get blocked by normal direct-reclaimers, forming a
+ * circular deadlock. GFP_NOIO won't get here.
+ */
+ if (cc->gfp_mask & __GFP_FS) {
+ inactive >>= 3;
+ active >>= 3;
+ }
+
too_many = isolated > (inactive + active) / 2;
if (!too_many)
wake_throttle_isolated(pgdat);
@@ -806,7 +817,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
* list by either parallel reclaimers or compaction. If there are,
* delay for some time until fewer pages are isolated
*/
- while (unlikely(too_many_isolated(pgdat))) {
+ while (unlikely(too_many_isolated(cc))) {
/* stop isolation if there are still pages not migrated */
if (cc->nr_migratepages)
return -EAGAIN;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3bb3484563ed..ac03571e0532 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4508,8 +4508,16 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
prep_new_page(page, order, gfp_mask, alloc_flags);

/* Try get a page from the freelist if available */
- if (!page)
+ if (!page) {
+ /*
+ * It's possible that the only migration sources are
+ * file pages, and the GFP_NOFS stack is holding up
+ * other compactors. Use reserves to avoid deadlock.
+ */
+ if (prio == MIN_COMPACT_PRIORITY && !(gfp_mask & __GFP_FS))
+ alloc_flags |= ALLOC_NO_WATERMARKS;
page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
+ }

if (page) {
struct zone *zone = page_zone(page);
--
2.39.2

2023-04-18 19:16:17

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 09/26] mm: page_alloc: move expand() above compaction_capture()

The next patch will allow compaction to capture from
larger-than-requested page blocks and free the remainder.

Rearrange the code in advance to make the diff more readable. No
functional change.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/page_alloc.c | 186 ++++++++++++++++++++++++------------------------
1 file changed, 93 insertions(+), 93 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8e5996f8b4b4..cd86f80d7bbe 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -950,61 +950,6 @@ static inline void set_buddy_order(struct page *page, unsigned int order)
__SetPageBuddy(page);
}

-#ifdef CONFIG_COMPACTION
-static inline struct capture_control *task_capc(struct zone *zone)
-{
- struct capture_control *capc = current->capture_control;
-
- return unlikely(capc && capc->cc) &&
- !(current->flags & PF_KTHREAD) &&
- !capc->page &&
- capc->cc->zone == zone ? capc : NULL;
-}
-
-static inline bool
-compaction_capture(struct capture_control *capc, struct page *page,
- int order, int migratetype)
-{
- if (!capc || order != capc->cc->order)
- return false;
-
- /* Do not accidentally pollute CMA or isolated regions*/
- if (is_migrate_cma(migratetype) ||
- is_migrate_isolate(migratetype))
- return false;
-
- if (order >= pageblock_order) {
- migratetype = capc->migratetype;
- change_pageblock_range(page, order, migratetype);
- } else if (migratetype == MIGRATE_MOVABLE) {
- /*
- * Do not let lower order allocations pollute a
- * movable pageblock. This might let an unmovable
- * request use a reclaimable pageblock and vice-versa
- * but no more than normal fallback logic which can
- * have trouble finding a high-order free page.
- */
- return false;
- }
-
- capc->page = page;
- return true;
-}
-
-#else
-static inline struct capture_control *task_capc(struct zone *zone)
-{
- return NULL;
-}
-
-static inline bool
-compaction_capture(struct capture_control *capc, struct page *page,
- int order, int migratetype)
-{
- return false;
-}
-#endif /* CONFIG_COMPACTION */
-
static inline void account_freepages(struct page *page, struct zone *zone,
int nr_pages, int migratetype)
{
@@ -1072,6 +1017,99 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
account_freepages(page, zone, -(1 << order), migratetype);
}

+/*
+ * The order of subdivision here is critical for the IO subsystem.
+ * Please do not alter this order without good reasons and regression
+ * testing. Specifically, as large blocks of memory are subdivided,
+ * the order in which smaller blocks are delivered depends on the order
+ * they're subdivided in this function. This is the primary factor
+ * influencing the order in which pages are delivered to the IO
+ * subsystem according to empirical testing, and this is also justified
+ * by considering the behavior of a buddy system containing a single
+ * large block of memory acted on by a series of small allocations.
+ * This behavior is a critical factor in sglist merging's success.
+ *
+ * -- nyc
+ */
+static inline void expand(struct zone *zone, struct page *page,
+ int low, int high, int migratetype)
+{
+ unsigned long size = 1 << high;
+
+ while (high > low) {
+ high--;
+ size >>= 1;
+ VM_BUG_ON_PAGE(bad_range(zone, &page[size]), &page[size]);
+
+ /*
+ * Mark as guard pages (or page), that will allow to
+ * merge back to allocator when buddy will be freed.
+ * Corresponding page table entries will not be touched,
+ * pages will stay not present in virtual address space
+ */
+ if (set_page_guard(zone, &page[size], high))
+ continue;
+
+ add_to_free_list(&page[size], zone, high, migratetype, false);
+ set_buddy_order(&page[size], high);
+ }
+}
+
+#ifdef CONFIG_COMPACTION
+static inline struct capture_control *task_capc(struct zone *zone)
+{
+ struct capture_control *capc = current->capture_control;
+
+ return unlikely(capc && capc->cc) &&
+ !(current->flags & PF_KTHREAD) &&
+ !capc->page &&
+ capc->cc->zone == zone ? capc : NULL;
+}
+
+static inline bool
+compaction_capture(struct capture_control *capc, struct page *page,
+ int order, int migratetype)
+{
+ if (!capc || order != capc->cc->order)
+ return false;
+
+ /* Do not accidentally pollute CMA or isolated regions*/
+ if (is_migrate_cma(migratetype) ||
+ is_migrate_isolate(migratetype))
+ return false;
+
+ if (order >= pageblock_order) {
+ migratetype = capc->migratetype;
+ change_pageblock_range(page, order, migratetype);
+ } else if (migratetype == MIGRATE_MOVABLE) {
+ /*
+ * Do not let lower order allocations pollute a
+ * movable pageblock. This might let an unmovable
+ * request use a reclaimable pageblock and vice-versa
+ * but no more than normal fallback logic which can
+ * have trouble finding a high-order free page.
+ */
+ return false;
+ }
+
+ capc->page = page;
+ return true;
+}
+
+#else
+static inline struct capture_control *task_capc(struct zone *zone)
+{
+ return NULL;
+}
+
+static inline bool
+compaction_capture(struct capture_control *capc, struct page *page,
+ int order, int migratetype)
+{
+ return false;
+}
+#endif /* CONFIG_COMPACTION */
+
/*
* If this is not the largest possible page, check if the buddy
* of the next-highest order is free. If it is, it's possible
@@ -2345,44 +2383,6 @@ void __init init_cma_reserved_pageblock(struct page *page)
}
#endif

-/*
- * The order of subdivision here is critical for the IO subsystem.
- * Please do not alter this order without good reasons and regression
- * testing. Specifically, as large blocks of memory are subdivided,
- * the order in which smaller blocks are delivered depends on the order
- * they're subdivided in this function. This is the primary factor
- * influencing the order in which pages are delivered to the IO
- * subsystem according to empirical testing, and this is also justified
- * by considering the behavior of a buddy system containing a single
- * large block of memory acted on by a series of small allocations.
- * This behavior is a critical factor in sglist merging's success.
- *
- * -- nyc
- */
-static inline void expand(struct zone *zone, struct page *page,
- int low, int high, int migratetype)
-{
- unsigned long size = 1 << high;
-
- while (high > low) {
- high--;
- size >>= 1;
- VM_BUG_ON_PAGE(bad_range(zone, &page[size]), &page[size]);
-
- /*
- * Mark as guard pages (or page), that will allow to
- * merge back to allocator when buddy will be freed.
- * Corresponding page table entries will not be touched,
- * pages will stay not present in virtual address space
- */
- if (set_page_guard(zone, &page[size], high))
- continue;
-
- add_to_free_list(&page[size], zone, high, migratetype, false);
- set_buddy_order(&page[size], high);
- }
-}
-
static void check_new_page_bad(struct page *page)
{
if (unlikely(page->flags & __PG_HWPOISON)) {
--
2.39.2

2023-04-18 19:16:18

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 15/26] mm: compaction: simplify free block check in suitable_migration_target()

Free page blocks are now marked MIGRATE_FREE. Consult that directly.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/compaction.c | 15 +++++----------
1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index a2280001eea3..b9eed0d43403 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1281,22 +1281,17 @@ static bool suitable_migration_source(struct compact_control *cc,
static bool suitable_migration_target(struct compact_control *cc,
struct page *page)
{
+ int mt = get_pageblock_migratetype(page);
+
/* If the page is a large free page, then disallow migration */
- if (PageBuddy(page)) {
- /*
- * We are checking page_order without zone->lock taken. But
- * the only small danger is that we skip a potentially suitable
- * pageblock, so it's not worth to check order for valid range.
- */
- if (buddy_order_unsafe(page) >= pageblock_order)
- return false;
- }
+ if (mt == MIGRATE_FREE)
+ return false;

if (cc->ignore_block_suitable)
return true;

/* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
- if (is_migrate_movable(get_pageblock_migratetype(page)))
+ if (is_migrate_movable(mt))
return true;

/* Otherwise skip the block */
--
2.39.2

2023-04-18 19:16:19

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 07/26] mm: page_alloc: move capture_control to the page allocator

Compaction capturing is really a component of the page allocator.
Later patches will also disconnect allocation request size from the
compaction size, so move the setup of the capturing parameters to the
"request domain", i.e. the page allocator. No functional change.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/compaction.h | 3 ++-
mm/compaction.c | 33 ++++++++++-----------------------
mm/page_alloc.c | 25 ++++++++++++++++++++++---
3 files changed, 34 insertions(+), 27 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 52a9ff65faee..06eeb2e25833 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -56,6 +56,7 @@ enum compact_result {
};

struct alloc_context; /* in mm/internal.h */
+struct capture_control; /* in mm/internal.h */

/*
* Number of free order-0 pages that should be available above given watermark
@@ -94,7 +95,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
unsigned int order, unsigned int alloc_flags,
const struct alloc_context *ac, enum compact_priority prio,
- struct page **page);
+ struct capture_control *capc);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern enum compact_result compaction_suitable(struct zone *zone, int order,
unsigned int alloc_flags, int highest_zoneidx);
diff --git a/mm/compaction.c b/mm/compaction.c
index 84db84e8fd3a..a2280001eea3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2510,7 +2510,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
static enum compact_result compact_zone_order(struct zone *zone, int order,
gfp_t gfp_mask, enum compact_priority prio,
unsigned int alloc_flags, int highest_zoneidx,
- struct page **capture)
+ struct capture_control *capc)
{
enum compact_result ret;
struct compact_control cc = {
@@ -2527,38 +2527,25 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
.ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
.ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
};
- struct capture_control capc = {
- .cc = &cc,
- .page = NULL,
- };

- /*
- * Make sure the structs are really initialized before we expose the
- * capture control, in case we are interrupted and the interrupt handler
- * frees a page.
- */
+ /* See the comment in __alloc_pages_direct_compact() */
barrier();
- WRITE_ONCE(current->capture_control, &capc);
+ WRITE_ONCE(capc->cc, &cc);

- ret = compact_zone(&cc, &capc);
+ ret = compact_zone(&cc, capc);
+
+ WRITE_ONCE(capc->cc, NULL);

VM_BUG_ON(!list_empty(&cc.freepages));
VM_BUG_ON(!list_empty(&cc.migratepages));

- /*
- * Make sure we hide capture control first before we read the captured
- * page pointer, otherwise an interrupt could free and capture a page
- * and we would leak it.
- */
- WRITE_ONCE(current->capture_control, NULL);
- *capture = READ_ONCE(capc.page);
/*
* Technically, it is also possible that compaction is skipped but
* the page is still captured out of luck(IRQ came and freed the page).
* Returning COMPACT_SUCCESS in such cases helps in properly accounting
* the COMPACT[STALL|FAIL] when compaction is skipped.
*/
- if (*capture)
+ if (capc->page)
ret = COMPACT_SUCCESS;

return ret;
@@ -2573,13 +2560,13 @@ int sysctl_extfrag_threshold = 500;
* @alloc_flags: The allocation flags of the current allocation
* @ac: The context of current allocation
* @prio: Determines how hard direct compaction should try to succeed
- * @capture: Pointer to free page created by compaction will be stored here
+ * @capc: The context for capturing pages during freeing
*
* This is the main entry point for direct page compaction.
*/
enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
unsigned int alloc_flags, const struct alloc_context *ac,
- enum compact_priority prio, struct page **capture)
+ enum compact_priority prio, struct capture_control *capc)
{
int may_perform_io = (__force int)(gfp_mask & __GFP_IO);
struct zoneref *z;
@@ -2607,7 +2594,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
}

status = compact_zone_order(zone, order, gfp_mask, prio,
- alloc_flags, ac->highest_zoneidx, capture);
+ alloc_flags, ac->highest_zoneidx, capc);
rc = max(status, rc);

/* The allocation should succeed, stop compacting */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b9366c002334..4d20513c83be 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -944,7 +944,7 @@ static inline struct capture_control *task_capc(struct zone *zone)
{
struct capture_control *capc = current->capture_control;

- return unlikely(capc) &&
+ return unlikely(capc && capc->cc) &&
!(current->flags & PF_KTHREAD) &&
!capc->page &&
capc->cc->zone == zone ? capc : NULL;
@@ -4480,22 +4480,41 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct page *page = NULL;
unsigned long pflags;
unsigned int noreclaim_flag;
+ struct capture_control capc = {
+ .page = NULL,
+ };

if (!order)
return NULL;

+ /*
+ * Make sure the structs are really initialized before we expose the
+ * capture control, in case we are interrupted and the interrupt handler
+ * frees a page.
+ */
+ barrier();
+ WRITE_ONCE(current->capture_control, &capc);
+
psi_memstall_enter(&pflags);
delayacct_compact_start();
noreclaim_flag = memalloc_noreclaim_save();

*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
- prio, &page);
+ prio, &capc);

memalloc_noreclaim_restore(noreclaim_flag);
psi_memstall_leave(&pflags);
delayacct_compact_end();

- if (*compact_result == COMPACT_SKIPPED)
+ /*
+ * Make sure we hide capture control first before we read the captured
+ * page pointer, otherwise an interrupt could free and capture a page
+ * and we would leak it.
+ */
+ WRITE_ONCE(current->capture_control, NULL);
+ page = READ_ONCE(capc.page);
+
+ if (!page && *compact_result == COMPACT_SKIPPED)
return NULL;
/*
* At least in one zone compaction wasn't deferred or skipped, so let's
--
2.39.2

2023-04-18 19:16:19

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 17/26] mm: compaction: refactor __compaction_suitable()

__compaction_suitable() is supposed to check for available migration
targets. However, it also checks whether the operation was requested
via /proc/sys/vm/compact_memory, and whether the original allocation
request can already succeed. These don't apply to all callsites.

Move the checks out to the callers, so that later patches can deal
with them one by one. No functional change intended.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/compaction.h | 4 +-
mm/compaction.c | 80 ++++++++++++++++++++++++--------------
mm/vmscan.c | 35 ++++++++++-------
3 files changed, 74 insertions(+), 45 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 7635e220215a..9e1b2c56df62 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -98,7 +98,7 @@ extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
struct capture_control *capc);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern enum compact_result compaction_suitable(struct zone *zone, int order,
- unsigned int alloc_flags, int highest_zoneidx);
+ int highest_zoneidx);

extern void compaction_defer_reset(struct zone *zone, int order,
bool alloc_success);
@@ -116,7 +116,7 @@ static inline void reset_isolation_suitable(pg_data_t *pgdat)
}

static inline enum compact_result compaction_suitable(struct zone *zone, int order,
- int alloc_flags, int highest_zoneidx)
+ int highest_zoneidx)
{
return COMPACT_SKIPPED;
}
diff --git a/mm/compaction.c b/mm/compaction.c
index f637b4ed7f3c..d4b7d5b36600 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2173,24 +2173,10 @@ static enum compact_result compact_finished(struct compact_control *cc)
}

static enum compact_result __compaction_suitable(struct zone *zone, int order,
- unsigned int alloc_flags,
int highest_zoneidx,
unsigned long wmark_target)
{
unsigned long watermark;
-
- if (is_via_compact_memory(order))
- return COMPACT_CONTINUE;
-
- watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
- /*
- * If watermarks for high-order allocation are already met, there
- * should be no need for compaction at all.
- */
- if (zone_watermark_ok(zone, order, watermark, highest_zoneidx,
- alloc_flags))
- return COMPACT_SUCCESS;
-
/*
* Watermarks for order-0 must be met for compaction to be able to
* isolate free pages for migration targets. This means that the
@@ -2223,7 +2209,6 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
* COMPACT_CONTINUE - If compaction should run now
*/
enum compact_result compaction_suitable(struct zone *zone, int order,
- unsigned int alloc_flags,
int highest_zoneidx)
{
unsigned long free_pages;
@@ -2234,8 +2219,7 @@ enum compact_result compaction_suitable(struct zone *zone, int order,
free_pages = zone_page_state(zone, NR_FREE_MOVABLE);
free_pages += zone_page_state(zone, NR_FREE_CMA_PAGES);

- ret = __compaction_suitable(zone, order, alloc_flags,
- highest_zoneidx, free_pages);
+ ret = __compaction_suitable(zone, order, highest_zoneidx, free_pages);

/*
* fragmentation index determines if allocation failures are due to
@@ -2279,6 +2263,16 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
ac->highest_zoneidx, ac->nodemask) {
unsigned long available;
+ unsigned long watermark;
+
+ if (is_via_compact_memory(order))
+ return true;
+
+ /* Allocation can already succeed, nothing to do */
+ watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
+ if (zone_watermark_ok(zone, order, watermark,
+ ac->highest_zoneidx, alloc_flags))
+ continue;

available = zone_page_state_snapshot(zone, NR_FREE_MOVABLE);
available += zone_page_state_snapshot(zone, NR_FREE_CMA_PAGES);
@@ -2290,8 +2284,7 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
*/
available += zone_reclaimable_pages(zone) / order;

- if (__compaction_suitable(zone, order, alloc_flags,
- ac->highest_zoneidx,
+ if (__compaction_suitable(zone, order, ac->highest_zoneidx,
available) == COMPACT_CONTINUE)
return true;
}
@@ -2322,14 +2315,26 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
INIT_LIST_HEAD(&cc->migratepages);

cc->migratetype = gfp_migratetype(cc->gfp_mask);
- ret = compaction_suitable(cc->zone, cc->order, cc->alloc_flags,
- cc->highest_zoneidx);
- /* Compaction is likely to fail */
- if (ret == COMPACT_SUCCESS || ret == COMPACT_SKIPPED)
- return ret;

- /* huh, compaction_suitable is returning something unexpected */
- VM_BUG_ON(ret != COMPACT_CONTINUE);
+ if (!is_via_compact_memory(cc->order)) {
+ unsigned long watermark;
+
+ /* Allocation can already succeed, nothing to do */
+ watermark = wmark_pages(cc->zone,
+ cc->alloc_flags & ALLOC_WMARK_MASK);
+ if (zone_watermark_ok(cc->zone, cc->order, watermark,
+ cc->highest_zoneidx, cc->alloc_flags))
+ return COMPACT_SUCCESS;
+
+ ret = compaction_suitable(cc->zone, cc->order,
+ cc->highest_zoneidx);
+ /* Compaction is likely to fail */
+ if (ret == COMPACT_SKIPPED)
+ return ret;
+
+ /* huh, compaction_suitable is returning something unexpected */
+ VM_BUG_ON(ret != COMPACT_CONTINUE);
+ }

/*
* Clear pageblock skip if there were failures recently and compaction
@@ -2803,7 +2808,16 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
if (!populated_zone(zone))
continue;

- if (compaction_suitable(zone, pgdat->kcompactd_max_order, 0,
+ if (is_via_compact_memory(pgdat->kcompactd_max_order))
+ return true;
+
+ /* Allocation can already succeed, check other zones */
+ if (zone_watermark_ok(zone, pgdat->kcompactd_max_order,
+ min_wmark_pages(zone),
+ highest_zoneidx, 0))
+ continue;
+
+ if (compaction_suitable(zone, pgdat->kcompactd_max_order,
highest_zoneidx) == COMPACT_CONTINUE)
return true;
}
@@ -2841,10 +2855,18 @@ static void kcompactd_do_work(pg_data_t *pgdat)
if (compaction_deferred(zone, cc.order))
continue;

- if (compaction_suitable(zone, cc.order, 0, zoneid) !=
- COMPACT_CONTINUE)
+ if (is_via_compact_memory(cc.order))
+ goto compact;
+
+ /* Allocation can already succeed, nothing to do */
+ if (zone_watermark_ok(zone, cc.order,
+ min_wmark_pages(zone), zoneid, 0))
continue;

+ if (compaction_suitable(zone, cc.order,
+ zoneid) != COMPACT_CONTINUE)
+ continue;
+compact:
if (kthread_should_stop())
return;

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9ecf29f4dab8..a0ebdbf3efcf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6076,14 +6076,17 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
if (!managed_zone(zone))
continue;

- switch (compaction_suitable(zone, sc->order, 0, sc->reclaim_idx)) {
- case COMPACT_SUCCESS:
- case COMPACT_CONTINUE:
+ if (sc->order == -1) /* is_via_compact_memory() */
+ return false;
+
+ /* Allocation can already succeed, nothing to do */
+ if (zone_watermark_ok(zone, sc->order, min_wmark_pages(zone),
+ sc->reclaim_idx, 0))
+ return false;
+
+ if (compaction_suitable(zone, sc->order,
+ sc->reclaim_idx) == COMPACT_CONTINUE)
return false;
- default:
- /* check next zone */
- ;
- }
}

/*
@@ -6271,16 +6274,20 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
{
unsigned long watermark;
unsigned long free_pages;
- enum compact_result suitable;

- suitable = compaction_suitable(zone, sc->order, 0, sc->reclaim_idx);
- if (suitable == COMPACT_SUCCESS)
- /* Allocation should succeed already. Don't reclaim. */
+ if (sc->order == -1) /* is_via_compact_memory() */
+ goto suitable;
+
+ /* Allocation can already succeed, nothing to do */
+ if (zone_watermark_ok(zone, sc->order, min_wmark_pages(zone),
+ sc->reclaim_idx, 0))
return true;
- if (suitable == COMPACT_SKIPPED)
- /* Compaction cannot yet proceed. Do reclaim. */
- return false;

+ /* Compaction cannot yet proceed. Do reclaim. */
+ if (compaction_suitable(zone, sc->order,
+ sc->reclaim_idx) == COMPACT_SKIPPED)
+ return false;
+suitable:
/*
* Compaction is already possible, but it takes time to run and there
* are potentially other callers using the pages just freed. So proceed
--
2.39.2

2023-04-18 19:16:31

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 23/26] mm: page_alloc: kill highatomic

The highatomic reserves are blocks set aside specifically for
higher-order atomic allocations. Since watermarks are now required to
be met in pageblocks, this is no longer necessary.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/gfp.h | 2 -
include/linux/mmzone.h | 6 +-
mm/internal.h | 5 --
mm/page_alloc.c | 187 ++---------------------------------------
mm/vmstat.c | 1 -
5 files changed, 10 insertions(+), 191 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 65a78773dcca..78b5176d354e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -19,8 +19,6 @@ static inline int gfp_migratetype(const gfp_t gfp_flags)
BUILD_BUG_ON((1UL << GFP_MOVABLE_SHIFT) != ___GFP_MOVABLE);
BUILD_BUG_ON((___GFP_MOVABLE >> GFP_MOVABLE_SHIFT) != MIGRATE_MOVABLE);
BUILD_BUG_ON((___GFP_RECLAIMABLE >> GFP_MOVABLE_SHIFT) != MIGRATE_RECLAIMABLE);
- BUILD_BUG_ON(((___GFP_MOVABLE | ___GFP_RECLAIMABLE) >>
- GFP_MOVABLE_SHIFT) != MIGRATE_HIGHATOMIC);

if (unlikely(page_group_by_mobility_disabled))
return MIGRATE_UNMOVABLE;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d1083ab81998..c705f2f7c829 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -44,8 +44,7 @@ enum migratetype {
MIGRATE_MOVABLE,
MIGRATE_RECLAIMABLE,
MIGRATE_PCPTYPES, /* the number of types on the pcp lists */
- MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,
- MIGRATE_FREE,
+ MIGRATE_FREE = MIGRATE_PCPTYPES,
#ifdef CONFIG_CMA
/*
* MIGRATE_CMA migration type is designed to mimic the way
@@ -142,7 +141,6 @@ enum zone_stat_item {
NR_FREE_UNMOVABLE,
NR_FREE_MOVABLE,
NR_FREE_RECLAIMABLE,
- NR_FREE_HIGHATOMIC,
NR_FREE_FREE,
NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
NR_ZONE_INACTIVE_ANON = NR_ZONE_LRU_BASE,
@@ -713,8 +711,6 @@ struct zone {
unsigned long _watermark[NR_WMARK];
unsigned long watermark_boost;

- unsigned long nr_reserved_highatomic;
-
/*
* We don't know if the memory that we're going to allocate will be
* freeable or/and it will be released eventually, so to avoid totally
diff --git a/mm/internal.h b/mm/internal.h
index 5c76455f8042..24f43f5db88b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -778,11 +778,6 @@ extern const struct trace_print_flags pageflag_names[];
extern const struct trace_print_flags vmaflag_names[];
extern const struct trace_print_flags gfpflag_names[];

-static inline bool is_migrate_highatomic(enum migratetype migratetype)
-{
- return migratetype == MIGRATE_HIGHATOMIC;
-}
-
void setup_zone_pageset(struct zone *zone);

struct migration_target_control {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6f0bfc226c36..e8ae04feb1bd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -379,7 +379,6 @@ const char * const migratetype_names[MIGRATE_TYPES] = {
"Unmovable",
"Movable",
"Reclaimable",
- "HighAtomic",
"Free",
#ifdef CONFIG_CMA
"CMA",
@@ -1202,7 +1201,7 @@ static inline void __free_one_page(struct page *page,
* We want to prevent merge between freepages on pageblock
* without fallbacks and normal pageblock. Without this,
* pageblock isolation could cause incorrect freepage or CMA
- * accounting or HIGHATOMIC accounting.
+ * accounting.
*/
if (migratetype != buddy_mt
&& (!migratetype_is_mergeable(migratetype) ||
@@ -2797,13 +2796,6 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,

old_block_type = get_pageblock_migratetype(page);

- /*
- * This can happen due to races and we want to prevent broken
- * highatomic accounting.
- */
- if (is_migrate_highatomic(old_block_type))
- goto single_page;
-
/* Take ownership for orders >= pageblock_order */
if (current_order >= pageblock_order) {
change_pageblock_range(page, current_order, start_type);
@@ -2918,126 +2910,6 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
return -1;
}

-/*
- * Reserve a pageblock for exclusive use of high-order atomic allocations if
- * there are no empty page blocks that contain a page with a suitable order
- */
-static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
- unsigned int alloc_order)
-{
- int mt;
- unsigned long max_managed, flags;
-
- /*
- * Limit the number reserved to 1 pageblock or roughly 1% of a zone.
- * Check is race-prone but harmless.
- */
- max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages;
- if (zone->nr_reserved_highatomic >= max_managed)
- return;
-
- spin_lock_irqsave(&zone->lock, flags);
-
- /* Recheck the nr_reserved_highatomic limit under the lock */
- if (zone->nr_reserved_highatomic >= max_managed)
- goto out_unlock;
-
- /* Yoink! */
- mt = get_pageblock_migratetype(page);
- /* Only reserve normal pageblocks (i.e., they can merge with others) */
- if (migratetype_is_mergeable(mt)) {
- zone->nr_reserved_highatomic += pageblock_nr_pages;
- set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
- move_freepages_block(zone, page, mt, MIGRATE_HIGHATOMIC, NULL);
- }
-
-out_unlock:
- spin_unlock_irqrestore(&zone->lock, flags);
-}
-
-/*
- * Used when an allocation is about to fail under memory pressure. This
- * potentially hurts the reliability of high-order allocations when under
- * intense memory pressure but failed atomic allocations should be easier
- * to recover from than an OOM.
- *
- * If @force is true, try to unreserve a pageblock even though highatomic
- * pageblock is exhausted.
- */
-static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
- bool force)
-{
- struct zonelist *zonelist = ac->zonelist;
- unsigned long flags;
- struct zoneref *z;
- struct zone *zone;
- struct page *page;
- int order;
- bool ret;
-
- for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->highest_zoneidx,
- ac->nodemask) {
- /*
- * Preserve at least one pageblock unless memory pressure
- * is really high.
- */
- if (!force && zone->nr_reserved_highatomic <=
- pageblock_nr_pages)
- continue;
-
- spin_lock_irqsave(&zone->lock, flags);
- for (order = 0; order < MAX_ORDER; order++) {
- struct free_area *area = &(zone->free_area[order]);
- int mt;
-
- page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC);
- if (!page)
- continue;
-
- mt = get_pageblock_migratetype(page);
- /*
- * In page freeing path, migratetype change is racy so
- * we can counter several free pages in a pageblock
- * in this loop although we changed the pageblock type
- * from highatomic to ac->migratetype. So we should
- * adjust the count once.
- */
- if (is_migrate_highatomic(mt)) {
- /*
- * It should never happen but changes to
- * locking could inadvertently allow a per-cpu
- * drain to add pages to MIGRATE_HIGHATOMIC
- * while unreserving so be safe and watch for
- * underflows.
- */
- zone->nr_reserved_highatomic -= min(
- pageblock_nr_pages,
- zone->nr_reserved_highatomic);
- }
-
- /*
- * Convert to ac->migratetype and avoid the normal
- * pageblock stealing heuristics. Minimally, the caller
- * is doing the work and needs the pages. More
- * importantly, if the block was always converted to
- * MIGRATE_UNMOVABLE or another type then the number
- * of pageblocks that cannot be completely freed
- * may increase.
- */
- set_pageblock_migratetype(page, ac->migratetype);
- ret = move_freepages_block(zone, page, mt,
- ac->migratetype, NULL);
- if (ret) {
- spin_unlock_irqrestore(&zone->lock, flags);
- return ret;
- }
- }
- spin_unlock_irqrestore(&zone->lock, flags);
- }
-
- return false;
-}
-
/*
* Try finding a free buddy page on the fallback list and put it on the free
* list of requested migratetype, possibly along with other pages from the same
@@ -3510,18 +3382,11 @@ void free_unref_page(struct page *page, unsigned int order)

/*
* We only track unmovable, reclaimable and movable on pcp lists.
- * Place ISOLATE pages on the isolated list because they are being
- * offlined but treat HIGHATOMIC as movable pages so we can get those
- * areas back if necessary. Otherwise, we may have to free
- * excessively into the page allocator
*/
migratetype = get_pcppage_migratetype(page);
if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
- if (unlikely(is_migrate_isolate(migratetype) || migratetype == MIGRATE_FREE)) {
- free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE);
- return;
- }
- migratetype = MIGRATE_MOVABLE;
+ free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE);
+ return;
}

zone = page_zone(page);
@@ -3740,24 +3605,11 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
unsigned long flags;

do {
- page = NULL;
spin_lock_irqsave(&zone->lock, flags);
- /*
- * order-0 request can reach here when the pcplist is skipped
- * due to non-CMA allocation context. HIGHATOMIC area is
- * reserved for high-order atomic allocation, so order-0
- * request should skip it.
- */
- if (order > 0 && alloc_flags & ALLOC_HARDER)
- page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
- if (!page) {
- page = __rmqueue(zone, order, migratetype, alloc_flags);
- if (!page) {
- spin_unlock_irqrestore(&zone->lock, flags);
- return NULL;
- }
- }
+ page = __rmqueue(zone, order, migratetype, alloc_flags);
spin_unlock_irqrestore(&zone->lock, flags);
+ if (!page)
+ return NULL;
} while (check_new_pages(page, order));

__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
@@ -4003,8 +3855,6 @@ static long __zone_free_pages(struct zone *zone, int alloc_flags, bool safe)
* compaction is necessary to restore the watermarks.
*/
free_pages = page_state(zone, NR_FREE_FREE, safe);
- if (alloc_flags & (ALLOC_HARDER | ALLOC_OOM))
- free_pages += page_state(zone, NR_FREE_HIGHATOMIC, safe);
if (IS_ENABLED(CONFIG_CMA) && (alloc_flags & ALLOC_CMA))
free_pages += page_state(zone, NR_FREE_CMA_PAGES, safe);

@@ -4098,8 +3948,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
return true;
}
#endif
- if (alloc_harder && !free_area_empty(area, MIGRATE_HIGHATOMIC))
- return true;
}
return false;
}
@@ -4340,14 +4188,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
gfp_mask, alloc_flags, ac->migratetype);
if (page) {
prep_new_page(page, order, gfp_mask, alloc_flags);
-
- /*
- * If this is a high-order atomic allocation then check
- * if the pageblock should be reserved for the future
- */
- if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
- reserve_highatomic_pageblock(page, zone, order);
-
return page;
} else {
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
@@ -4844,7 +4684,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
* Shrink them and try again
*/
if (!page && !drained) {
- unreserve_highatomic_pageblock(ac, false);
drain_all_pages(NULL);
drained = true;
goto retry;
@@ -5013,10 +4852,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
* Make sure we converge to OOM if we cannot make any progress
* several times in the row.
*/
- if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
- /* Before OOM, exhaust highatomic_reserve */
- return unreserve_highatomic_pageblock(ac, true);
- }
+ if (*no_progress_loops > MAX_RECLAIM_RETRIES)
+ return false;

/*
* Keep reclaiming pages while there is a chance this will lead
@@ -6129,7 +5966,6 @@ static void show_migration_types(unsigned char type)
[MIGRATE_UNMOVABLE] = 'U',
[MIGRATE_MOVABLE] = 'M',
[MIGRATE_RECLAIMABLE] = 'E',
- [MIGRATE_HIGHATOMIC] = 'H',
[MIGRATE_FREE] = 'F',
#ifdef CONFIG_CMA
[MIGRATE_CMA] = 'C',
@@ -6194,7 +6030,7 @@ void __show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_i
" sec_pagetables:%lu bounce:%lu\n"
" kernel_misc_reclaimable:%lu\n"
" free:%lu free_unmovable:%lu free_movable:%lu\n"
- " free_reclaimable:%lu free_highatomic:%lu free_free:%lu\n"
+ " free_reclaimable:%lu free_free:%lu\n"
" free_cma:%lu free_pcp:%lu\n",
global_node_page_state(NR_ACTIVE_ANON),
global_node_page_state(NR_INACTIVE_ANON),
@@ -6217,7 +6053,6 @@ void __show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_i
global_zone_page_state(NR_FREE_UNMOVABLE),
global_zone_page_state(NR_FREE_MOVABLE),
global_zone_page_state(NR_FREE_RECLAIMABLE),
- global_zone_page_state(NR_FREE_HIGHATOMIC),
global_zone_page_state(NR_FREE_FREE),
global_zone_page_state(NR_FREE_CMA_PAGES),
free_pcp);
@@ -6301,13 +6136,11 @@ void __show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_i
" free_unmovable:%lukB"
" free_movable:%lukB"
" free_reclaimable:%lukB"
- " free_highatomic:%lukB"
" free_free:%lukB"
" boost:%lukB"
" min:%lukB"
" low:%lukB"
" high:%lukB"
- " reserved_highatomic:%luKB"
" active_anon:%lukB"
" inactive_anon:%lukB"
" active_file:%lukB"
@@ -6327,13 +6160,11 @@ void __show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_i
K(zone_page_state(zone, NR_FREE_UNMOVABLE)),
K(zone_page_state(zone, NR_FREE_MOVABLE)),
K(zone_page_state(zone, NR_FREE_RECLAIMABLE)),
- K(zone_page_state(zone, NR_FREE_HIGHATOMIC)),
K(zone_page_state(zone, NR_FREE_FREE)),
K(zone->watermark_boost),
K(min_wmark_pages(zone)),
K(low_wmark_pages(zone)),
K(high_wmark_pages(zone)),
- K(zone->nr_reserved_highatomic),
K(zone_page_state(zone, NR_ZONE_ACTIVE_ANON)),
K(zone_page_state(zone, NR_ZONE_INACTIVE_ANON)),
K(zone_page_state(zone, NR_ZONE_ACTIVE_FILE)),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c8b8e6e259da..a2f7b41564df 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1171,7 +1171,6 @@ const char * const vmstat_text[] = {
"nr_free_unmovable",
"nr_free_movable",
"nr_free_reclaimable",
- "nr_free_highatomic",
"nr_free_free",
"nr_zone_inactive_anon",
"nr_zone_active_anon",
--
2.39.2

2023-04-18 19:16:38

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 24/26] mm: page_alloc: kill watermark boosting

Watermark boosting is meant to increase the chances of pageblock
production when fallbacks are observed. Since reclaim/compaction now
produce neutral pageblocks per default, this is no longer needed.

Signed-off-by: Johannes Weiner <[email protected]>
---
Documentation/admin-guide/sysctl/vm.rst | 21 -----
include/linux/mm.h | 1 -
include/linux/mmzone.h | 12 +--
kernel/sysctl.c | 8 --
mm/page_alloc.c | 67 --------------
mm/vmscan.c | 111 +-----------------------
mm/vmstat.c | 2 -
7 files changed, 7 insertions(+), 215 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 988f6a4c8084..498655c322bc 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -72,7 +72,6 @@ files can be found in mm/swap.c.
- unprivileged_userfaultfd
- user_reserve_kbytes
- vfs_cache_pressure
-- watermark_boost_factor
- watermark_scale_factor
- zone_reclaim_mode

@@ -968,26 +967,6 @@ directory and inode objects. With vfs_cache_pressure=1000, it will look for
ten times more freeable objects than there are.


-watermark_boost_factor
-======================
-
-This factor controls the level of reclaim when memory is being fragmented.
-It defines the percentage of the high watermark of a zone that will be
-reclaimed if pages of different mobility are being mixed within pageblocks.
-The intent is that compaction has less work to do in the future and to
-increase the success rate of future high-order allocations such as SLUB
-allocations, THP and hugetlbfs pages.
-
-To make it sensible with respect to the watermark_scale_factor
-parameter, the unit is in fractions of 10,000. The default value of
-15,000 means that up to 150% of the high watermark will be reclaimed in the
-event of a pageblock being mixed due to fragmentation. The level of reclaim
-is determined by the number of fragmentation events that occurred in the
-recent past. If this value is smaller than a pageblock then a pageblocks
-worth of pages will be reclaimed (e.g. 2MB on 64-bit x86). A boost factor
-of 0 will disable the feature.
-
-
watermark_scale_factor
======================

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f13f20258ce9..e7c2631848ed 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2746,7 +2746,6 @@ extern void setup_per_cpu_pageset(void);

/* page_alloc.c */
extern int min_free_kbytes;
-extern int watermark_boost_factor;
extern int watermark_scale_factor;
extern bool arch_has_descending_max_zone_pfns(void);

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c705f2f7c829..1363ff6caff3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -567,10 +567,10 @@ enum zone_watermarks {
#define NR_LOWORDER_PCP_LISTS (MIGRATE_PCPTYPES * (PAGE_ALLOC_COSTLY_ORDER + 1))
#define NR_PCP_LISTS (NR_LOWORDER_PCP_LISTS + NR_PCP_THP)

-#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost)
-#define low_wmark_pages(z) (z->_watermark[WMARK_LOW] + z->watermark_boost)
-#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
-#define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
+#define min_wmark_pages(z) (z->_watermark[WMARK_MIN])
+#define low_wmark_pages(z) (z->_watermark[WMARK_LOW])
+#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH])
+#define wmark_pages(z, i) (z->_watermark[i])

/* Fields and list protected by pagesets local_lock in page_alloc.c */
struct per_cpu_pages {
@@ -709,7 +709,6 @@ struct zone {

/* zone watermarks, access with *_wmark_pages(zone) macros */
unsigned long _watermark[NR_WMARK];
- unsigned long watermark_boost;

/*
* We don't know if the memory that we're going to allocate will be
@@ -884,9 +883,6 @@ enum pgdat_flags {
};

enum zone_flags {
- ZONE_BOOSTED_WATERMARK, /* zone recently boosted watermarks.
- * Cleared when kswapd is woken.
- */
ZONE_RECLAIM_ACTIVE, /* kswapd may be scanning the zone. */
};

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 137d4abe3eda..68bcd3a7c9c6 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2229,14 +2229,6 @@ static struct ctl_table vm_table[] = {
.proc_handler = min_free_kbytes_sysctl_handler,
.extra1 = SYSCTL_ZERO,
},
- {
- .procname = "watermark_boost_factor",
- .data = &watermark_boost_factor,
- .maxlen = sizeof(watermark_boost_factor),
- .mode = 0644,
- .proc_handler = proc_dointvec_minmax,
- .extra1 = SYSCTL_ZERO,
- },
{
.procname = "watermark_scale_factor",
.data = &watermark_scale_factor,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e8ae04feb1bd..f835a5548164 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -401,7 +401,6 @@ compound_page_dtor * const compound_page_dtors[NR_COMPOUND_DTORS] = {

int min_free_kbytes = 1024;
int user_min_free_kbytes = -1;
-int watermark_boost_factor __read_mostly = 15000;
int watermark_scale_factor = 10;

static unsigned long nr_kernel_pages __initdata;
@@ -2742,43 +2741,6 @@ static bool can_steal_fallback(unsigned int order, int start_mt,
return false;
}

-static inline bool boost_watermark(struct zone *zone)
-{
- unsigned long max_boost;
-
- if (!watermark_boost_factor)
- return false;
- /*
- * Don't bother in zones that are unlikely to produce results.
- * On small machines, including kdump capture kernels running
- * in a small area, boosting the watermark can cause an out of
- * memory situation immediately.
- */
- if ((pageblock_nr_pages * 4) > zone_managed_pages(zone))
- return false;
-
- max_boost = mult_frac(zone->_watermark[WMARK_HIGH],
- watermark_boost_factor, 10000);
-
- /*
- * high watermark may be uninitialised if fragmentation occurs
- * very early in boot so do not boost. We do not fall
- * through and boost by pageblock_nr_pages as failing
- * allocations that early means that reclaim is not going
- * to help and it may even be impossible to reclaim the
- * boosted watermark resulting in a hang.
- */
- if (!max_boost)
- return false;
-
- max_boost = max(pageblock_nr_pages, max_boost);
-
- zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages,
- max_boost);
-
- return true;
-}
-
/*
* This function implements actual steal behaviour. If order is large enough,
* we can steal whole pageblock. If not, we first move freepages in this
@@ -2802,14 +2764,6 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
goto single_page;
}

- /*
- * Boost watermarks to increase reclaim pressure to reduce the
- * likelihood of future fallbacks. Wake kswapd now as the node
- * may be balanced overall and kswapd will not wake naturally.
- */
- if (boost_watermark(zone) && (alloc_flags & ALLOC_KSWAPD))
- set_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
-
/* We are not allowed to try stealing from the whole block */
if (!whole_block)
goto single_page;
@@ -3738,12 +3692,6 @@ struct page *rmqueue(struct zone *preferred_zone,
migratetype);

out:
- /* Separate test+clear to avoid unnecessary atomics */
- if (unlikely(test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags))) {
- clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
- wakeup_kswapd(zone, 0, 0, zone_idx(zone));
- }
-
VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
return page;
}
@@ -3976,18 +3924,6 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
if (__zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags,
free_pages))
return true;
- /*
- * Ignore watermark boosting for GFP_ATOMIC order-0 allocations
- * when checking the min watermark. The min watermark is the
- * point where boosting is ignored so that kswapd is woken up
- * when below the low watermark.
- */
- if (unlikely(!order && (gfp_mask & __GFP_ATOMIC) && z->watermark_boost
- && ((alloc_flags & ALLOC_WMARK_MASK) == WMARK_MIN))) {
- mark = z->_watermark[WMARK_MIN];
- return __zone_watermark_ok(z, order, mark, highest_zoneidx,
- alloc_flags, free_pages);
- }

return false;
}
@@ -6137,7 +6073,6 @@ void __show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_i
" free_movable:%lukB"
" free_reclaimable:%lukB"
" free_free:%lukB"
- " boost:%lukB"
" min:%lukB"
" low:%lukB"
" high:%lukB"
@@ -6161,7 +6096,6 @@ void __show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_i
K(zone_page_state(zone, NR_FREE_MOVABLE)),
K(zone_page_state(zone, NR_FREE_RECLAIMABLE)),
K(zone_page_state(zone, NR_FREE_FREE)),
- K(zone->watermark_boost),
K(min_wmark_pages(zone)),
K(low_wmark_pages(zone)),
K(high_wmark_pages(zone)),
@@ -8701,7 +8635,6 @@ static void __setup_per_zone_wmarks(void)
if (IS_ENABLED(CONFIG_COMPACTION))
tmp = ALIGN(tmp, 1 << pageblock_order);

- zone->watermark_boost = 0;
zone->_watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp;
zone->_watermark[WMARK_HIGH] = low_wmark_pages(zone) + tmp;
zone->_watermark[WMARK_PROMO] = high_wmark_pages(zone) + tmp;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a7374cd6fe91..5586be6997cd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6827,30 +6827,6 @@ static void kswapd_age_node(struct pglist_data *pgdat, struct scan_control *sc)
} while (memcg);
}

-static bool pgdat_watermark_boosted(pg_data_t *pgdat, int highest_zoneidx)
-{
- int i;
- struct zone *zone;
-
- /*
- * Check for watermark boosts top-down as the higher zones
- * are more likely to be boosted. Both watermarks and boosts
- * should not be checked at the same time as reclaim would
- * start prematurely when there is no boosting and a lower
- * zone is balanced.
- */
- for (i = highest_zoneidx; i >= 0; i--) {
- zone = pgdat->node_zones + i;
- if (!managed_zone(zone))
- continue;
-
- if (zone->watermark_boost)
- return true;
- }
-
- return false;
-}
-
/*
* Returns true if there is an eligible zone balanced for the request order
* and highest_zoneidx
@@ -7025,14 +7001,13 @@ static void balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
unsigned long nr_soft_reclaimed;
unsigned long nr_soft_scanned;
unsigned long pflags;
- unsigned long nr_boost_reclaim;
- unsigned long zone_boosts[MAX_NR_ZONES] = { 0, };
- bool boosted;
struct zone *zone;
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
.order = order,
.may_unmap = 1,
+ .may_swap = 1,
+ .may_writepage = !laptop_mode,
};

set_task_reclaim_state(current, &sc.reclaim_state);
@@ -7041,29 +7016,11 @@ static void balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)

count_vm_event(PAGEOUTRUN);

- /*
- * Account for the reclaim boost. Note that the zone boost is left in
- * place so that parallel allocations that are near the watermark will
- * stall or direct reclaim until kswapd is finished.
- */
- nr_boost_reclaim = 0;
- for (i = 0; i <= highest_zoneidx; i++) {
- zone = pgdat->node_zones + i;
- if (!managed_zone(zone))
- continue;
-
- nr_boost_reclaim += zone->watermark_boost;
- zone_boosts[i] = zone->watermark_boost;
- }
- boosted = nr_boost_reclaim;
-
-restart:
set_reclaim_active(pgdat, highest_zoneidx);
sc.priority = DEF_PRIORITY;
do {
unsigned long nr_reclaimed = sc.nr_reclaimed;
bool raise_priority = true;
- bool balanced;
bool ret;

sc.reclaim_idx = highest_zoneidx;
@@ -7089,40 +7046,9 @@ static void balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
}
}

- /*
- * If the pgdat is imbalanced then ignore boosting and preserve
- * the watermarks for a later time and restart. Note that the
- * zone watermarks will be still reset at the end of balancing
- * on the grounds that the normal reclaim should be enough to
- * re-evaluate if boosting is required when kswapd next wakes.
- */
- balanced = pgdat_balanced(pgdat, sc.order, highest_zoneidx);
- if (!balanced && nr_boost_reclaim) {
- nr_boost_reclaim = 0;
- goto restart;
- }
-
- /*
- * If boosting is not active then only reclaim if there are no
- * eligible zones. Note that sc.reclaim_idx is not used as
- * buffer_heads_over_limit may have adjusted it.
- */
- if (!nr_boost_reclaim && balanced)
+ if (pgdat_balanced(pgdat, sc.order, highest_zoneidx))
goto out;

- /* Limit the priority of boosting to avoid reclaim writeback */
- if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
- raise_priority = false;
-
- /*
- * Do not writeback or swap pages for boosted reclaim. The
- * intent is to relieve pressure not issue sub-optimal IO
- * from reclaim context. If no pages are reclaimed, the
- * reclaim will be aborted.
- */
- sc.may_writepage = !laptop_mode && !nr_boost_reclaim;
- sc.may_swap = !nr_boost_reclaim;
-
/*
* Do some background aging, to give pages a chance to be
* referenced before reclaiming. All pages are rotated
@@ -7173,15 +7099,6 @@ static void balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
* progress in reclaiming pages
*/
nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
- nr_boost_reclaim -= min(nr_boost_reclaim, nr_reclaimed);
-
- /*
- * If reclaim made no progress for a boost, stop reclaim as
- * IO cannot be queued and it could be an infinite loop in
- * extreme circumstances.
- */
- if (nr_boost_reclaim && !nr_reclaimed)
- break;

if (raise_priority || !nr_reclaimed)
sc.priority--;
@@ -7193,28 +7110,6 @@ static void balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
out:
clear_reclaim_active(pgdat, highest_zoneidx);

- /* If reclaim was boosted, account for the reclaim done in this pass */
- if (boosted) {
- unsigned long flags;
-
- for (i = 0; i <= highest_zoneidx; i++) {
- if (!zone_boosts[i])
- continue;
-
- /* Increments are under the zone lock */
- zone = pgdat->node_zones + i;
- spin_lock_irqsave(&zone->lock, flags);
- zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
- spin_unlock_irqrestore(&zone->lock, flags);
- }
-
- /*
- * As there is now likely space, wakeup kcompact to defragment
- * pageblocks.
- */
- wakeup_kcompactd(pgdat, pageblock_order, highest_zoneidx);
- }
-
snapshot_refaults(NULL, pgdat);
__fs_reclaim_release(_THIS_IP_);
psi_memstall_leave(&pflags);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index a2f7b41564df..80ee26588242 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1682,7 +1682,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
}
seq_printf(m,
"\n pages free %lu"
- "\n boost %lu"
"\n min %lu"
"\n low %lu"
"\n high %lu"
@@ -1691,7 +1690,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
"\n managed %lu"
"\n cma %lu",
zone_page_state(zone, NR_FREE_PAGES),
- zone->watermark_boost,
min_wmark_pages(zone),
low_wmark_pages(zone),
high_wmark_pages(zone),
--
2.39.2

2023-04-18 19:16:40

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 20/26] mm: vmscan: use compaction_suitable() check in kswapd

Kswapd currently bails on higher-order allocations with an open-coded
check for whether it's reclaimed the compaction gap.

compaction_suitable() is the customary interface to coordinate reclaim
with compaction.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 67 ++++++++++++++++++-----------------------------------
1 file changed, 23 insertions(+), 44 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ee8c8ca2e7b5..723705b9e4d9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6872,12 +6872,18 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
if (!managed_zone(zone))
continue;

+ /* Allocation can succeed in any zone, done */
if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
mark = wmark_pages(zone, WMARK_PROMO);
else
mark = high_wmark_pages(zone);
if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx))
return true;
+
+ /* Allocation can't succeed, but enough order-0 to compact */
+ if (compaction_suitable(zone, order,
+ highest_zoneidx) == COMPACT_CONTINUE)
+ return true;
}

/*
@@ -6968,16 +6974,6 @@ static bool kswapd_shrink_node(pg_data_t *pgdat,
*/
shrink_node(pgdat, sc);

- /*
- * Fragmentation may mean that the system cannot be rebalanced for
- * high-order allocations. If twice the allocation size has been
- * reclaimed then recheck watermarks only at order-0 to prevent
- * excessive reclaim. Assume that a process requested a high-order
- * can direct reclaim/compact.
- */
- if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
- sc->order = 0;
-
return sc->nr_scanned >= sc->nr_to_reclaim;
}

@@ -7018,15 +7014,13 @@ clear_reclaim_active(pg_data_t *pgdat, int highest_zoneidx)
* that are eligible for use by the caller until at least one zone is
* balanced.
*
- * Returns the order kswapd finished reclaiming at.
- *
* kswapd scans the zones in the highmem->normal->dma direction. It skips
* zones which have free_pages > high_wmark_pages(zone), but once a zone is
* found to have free_pages <= high_wmark_pages(zone), any page in that zone
* or lower is eligible for reclaim until at least one usable zone is
* balanced.
*/
-static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
+static void balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
{
int i;
unsigned long nr_soft_reclaimed;
@@ -7226,14 +7220,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
__fs_reclaim_release(_THIS_IP_);
psi_memstall_leave(&pflags);
set_task_reclaim_state(current, NULL);
-
- /*
- * Return the order kswapd stopped reclaiming at as
- * prepare_kswapd_sleep() takes it into account. If another caller
- * entered the allocator slow path while kswapd was awake, order will
- * remain at the higher level.
- */
- return sc.order;
}

/*
@@ -7251,7 +7237,7 @@ static enum zone_type kswapd_highest_zoneidx(pg_data_t *pgdat,
return curr_idx == MAX_NR_ZONES ? prev_highest_zoneidx : curr_idx;
}

-static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order,
+static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
unsigned int highest_zoneidx)
{
long remaining = 0;
@@ -7269,7 +7255,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
* eligible zone balanced that it's also unlikely that compaction will
* succeed.
*/
- if (prepare_kswapd_sleep(pgdat, reclaim_order, highest_zoneidx)) {
+ if (prepare_kswapd_sleep(pgdat, order, highest_zoneidx)) {
/*
* Compaction records what page blocks it recently failed to
* isolate pages from and skips them in the future scanning.
@@ -7282,7 +7268,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
* We have freed the memory, now we should compact it to make
* allocation of the requested order possible.
*/
- wakeup_kcompactd(pgdat, alloc_order, highest_zoneidx);
+ wakeup_kcompactd(pgdat, order, highest_zoneidx);

remaining = schedule_timeout(HZ/10);

@@ -7296,8 +7282,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
kswapd_highest_zoneidx(pgdat,
highest_zoneidx));

- if (READ_ONCE(pgdat->kswapd_order) < reclaim_order)
- WRITE_ONCE(pgdat->kswapd_order, reclaim_order);
+ if (READ_ONCE(pgdat->kswapd_order) < order)
+ WRITE_ONCE(pgdat->kswapd_order, order);
}

finish_wait(&pgdat->kswapd_wait, &wait);
@@ -7308,8 +7294,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
* After a short sleep, check if it was a premature sleep. If not, then
* go fully to sleep until explicitly woken up.
*/
- if (!remaining &&
- prepare_kswapd_sleep(pgdat, reclaim_order, highest_zoneidx)) {
+ if (!remaining && prepare_kswapd_sleep(pgdat, order, highest_zoneidx)) {
trace_mm_vmscan_kswapd_sleep(pgdat->node_id);

/*
@@ -7350,8 +7335,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
*/
static int kswapd(void *p)
{
- unsigned int alloc_order, reclaim_order;
- unsigned int highest_zoneidx = MAX_NR_ZONES - 1;
+ unsigned int order, highest_zoneidx;
pg_data_t *pgdat = (pg_data_t *)p;
struct task_struct *tsk = current;
const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
@@ -7374,22 +7358,20 @@ static int kswapd(void *p)
tsk->flags |= PF_MEMALLOC | PF_KSWAPD;
set_freezable();

- WRITE_ONCE(pgdat->kswapd_order, 0);
+ order = 0;
+ highest_zoneidx = MAX_NR_ZONES - 1;
+ WRITE_ONCE(pgdat->kswapd_order, order);
WRITE_ONCE(pgdat->kswapd_highest_zoneidx, MAX_NR_ZONES);
+
atomic_set(&pgdat->nr_writeback_throttled, 0);
+
for ( ; ; ) {
bool ret;

- alloc_order = reclaim_order = READ_ONCE(pgdat->kswapd_order);
- highest_zoneidx = kswapd_highest_zoneidx(pgdat,
- highest_zoneidx);
-
-kswapd_try_sleep:
- kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
- highest_zoneidx);
+ kswapd_try_to_sleep(pgdat, order, highest_zoneidx);

/* Read the new order and highest_zoneidx */
- alloc_order = READ_ONCE(pgdat->kswapd_order);
+ order = READ_ONCE(pgdat->kswapd_order);
highest_zoneidx = kswapd_highest_zoneidx(pgdat,
highest_zoneidx);
WRITE_ONCE(pgdat->kswapd_order, 0);
@@ -7415,11 +7397,8 @@ static int kswapd(void *p)
* request (alloc_order).
*/
trace_mm_vmscan_kswapd_wake(pgdat->node_id, highest_zoneidx,
- alloc_order);
- reclaim_order = balance_pgdat(pgdat, alloc_order,
- highest_zoneidx);
- if (reclaim_order < alloc_order)
- goto kswapd_try_sleep;
+ order);
+ balance_pgdat(pgdat, order, highest_zoneidx);
}

tsk->flags &= ~(PF_MEMALLOC | PF_KSWAPD);
--
2.39.2

2023-04-18 19:16:50

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 08/26] mm: page_alloc: claim blocks during compaction capturing

When capturing a whole block, update the migratetype accordingly. For
example, a THP allocation might capture an unmovable block. If the THP
gets split and partially freed later, the remainder should group up
with movable allocations.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/internal.h | 1 +
mm/page_alloc.c | 42 ++++++++++++++++++++++++------------------
2 files changed, 25 insertions(+), 18 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 024affd4e4b5..39f65a463631 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -432,6 +432,7 @@ struct compact_control {
*/
struct capture_control {
struct compact_control *cc;
+ int migratetype;
struct page *page;
};

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4d20513c83be..8e5996f8b4b4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -615,6 +615,17 @@ void set_pageblock_migratetype(struct page *page, int migratetype)
page_to_pfn(page), MIGRATETYPE_MASK);
}

+static void change_pageblock_range(struct page *pageblock_page,
+ int start_order, int migratetype)
+{
+ int nr_pageblocks = 1 << (start_order - pageblock_order);
+
+ while (nr_pageblocks--) {
+ set_pageblock_migratetype(pageblock_page, migratetype);
+ pageblock_page += pageblock_nr_pages;
+ }
+}
+
#ifdef CONFIG_DEBUG_VM
static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
{
@@ -962,14 +973,19 @@ compaction_capture(struct capture_control *capc, struct page *page,
is_migrate_isolate(migratetype))
return false;

- /*
- * Do not let lower order allocations pollute a movable pageblock.
- * This might let an unmovable request use a reclaimable pageblock
- * and vice-versa but no more than normal fallback logic which can
- * have trouble finding a high-order free page.
- */
- if (order < pageblock_order && migratetype == MIGRATE_MOVABLE)
+ if (order >= pageblock_order) {
+ migratetype = capc->migratetype;
+ change_pageblock_range(page, order, migratetype);
+ } else if (migratetype == MIGRATE_MOVABLE) {
+ /*
+ * Do not let lower order allocations pollute a
+ * movable pageblock. This might let an unmovable
+ * request use a reclaimable pageblock and vice-versa
+ * but no more than normal fallback logic which can
+ * have trouble finding a high-order free page.
+ */
return false;
+ }

capc->page = page;
return true;
@@ -2674,17 +2690,6 @@ int move_freepages_block(struct zone *zone, struct page *page,
old_mt, new_mt, num_movable);
}

-static void change_pageblock_range(struct page *pageblock_page,
- int start_order, int migratetype)
-{
- int nr_pageblocks = 1 << (start_order - pageblock_order);
-
- while (nr_pageblocks--) {
- set_pageblock_migratetype(pageblock_page, migratetype);
- pageblock_page += pageblock_nr_pages;
- }
-}
-
/*
* When we are falling back to another migratetype during allocation, try to
* steal extra free pages from the same pageblocks to satisfy further
@@ -4481,6 +4486,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
unsigned long pflags;
unsigned int noreclaim_flag;
struct capture_control capc = {
+ .migratetype = ac->migratetype,
.page = NULL,
};

--
2.39.2

2023-04-18 19:17:03

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 26/26] mm: page_alloc: add sanity checks for migratetypes

Now that known block pollution from fallbacks, !movable compaction,
highatomic reserves and single THP pcplists is gone, add high-level
sanity checks that ensure that pages coming out of the allocator are
of the requested migratetype.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/page_alloc.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9db588a1de3b..b8767a6075e8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3675,6 +3675,7 @@ struct page *rmqueue(struct zone *preferred_zone,
int migratetype)
{
struct page *page;
+ int buddy = 0;

/*
* We most definitely don't want callers attempting to
@@ -3698,9 +3699,14 @@ struct page *rmqueue(struct zone *preferred_zone,

page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
migratetype);
+ buddy = 1;

out:
VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
+ VM_WARN_ONCE(page && get_pageblock_migratetype(page) != migratetype,
+ "%d:%s order=%u gfp=%pGg mt=%s alloc_flags=%x buddy=%d\n",
+ zone_to_nid(zone), zone->name, order, &gfp_flags,
+ migratetype_names[migratetype], alloc_flags, buddy);
return page;
}

--
2.39.2

2023-04-18 19:17:10

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 14/26] mm: compaction: simplify should_compact_retry()

The different branches for retry are unnecessarily complicated. There
is really only three outcomes: progress, skipped, failed. Also, the
retry counter only applies to loops that made progress, move it there.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/page_alloc.c | 60 +++++++++++++++++--------------------------------
1 file changed, 20 insertions(+), 40 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c3b7dc479936..18fa2bbba44b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4608,7 +4608,6 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
enum compact_priority *compact_priority,
int *compaction_retries)
{
- int max_retries = MAX_COMPACT_RETRIES;
int min_priority;
bool ret = false;
int retries = *compaction_retries;
@@ -4621,19 +4620,27 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
return false;

/*
- * Compaction managed to coalesce some page blocks, but the
- * allocation failed presumably due to a race. Retry some.
+ * Compaction coalesced some page blocks, but the allocation
+ * failed, presumably due to a race. Retry a few times.
*/
- if (compact_result == COMPACT_SUCCESS)
- (*compaction_retries)++;
+ if (compact_result == COMPACT_SUCCESS) {
+ int max_retries = MAX_COMPACT_RETRIES;

- /*
- * All zones were scanned completely and still no result. It
- * doesn't really make much sense to retry except when the
- * failure could be caused by insufficient priority
- */
- if (compact_result == COMPACT_COMPLETE)
- goto check_priority;
+ /*
+ * !costly requests are much more important than
+ * __GFP_RETRY_MAYFAIL costly ones because they are de
+ * facto nofail and invoke OOM killer to move on while
+ * costly can fail and users are ready to cope with
+ * that. 1/4 retries is rather arbitrary but we would
+ * need much more detailed feedback from compaction to
+ * make a better decision.
+ */
+ if (order > PAGE_ALLOC_COSTLY_ORDER)
+ max_retries /= 4;
+
+ ret = ++(*compaction_retries) <= MAX_COMPACT_RETRIES;
+ goto out;
+ }

/*
* Compaction was skipped due to a lack of free order-0
@@ -4645,35 +4652,8 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
}

/*
- * If compaction backed due to being deferred, due to
- * contended locks in async mode, or due to scanners meeting
- * after a partial scan, retry with increased priority.
- */
- if (compact_result == COMPACT_DEFERRED ||
- compact_result == COMPACT_CONTENDED ||
- compact_result == COMPACT_PARTIAL_SKIPPED)
- goto check_priority;
-
- /*
- * !costly requests are much more important than __GFP_RETRY_MAYFAIL
- * costly ones because they are de facto nofail and invoke OOM
- * killer to move on while costly can fail and users are ready
- * to cope with that. 1/4 retries is rather arbitrary but we
- * would need much more detailed feedback from compaction to
- * make a better decision.
- */
- if (order > PAGE_ALLOC_COSTLY_ORDER)
- max_retries /= 4;
- if (*compaction_retries <= max_retries) {
- ret = true;
- goto out;
- }
-
- /*
- * Make sure there are attempts at the highest priority if we exhausted
- * all retries or failed at the lower priorities.
+ * Compaction failed. Retry with increasing priority.
*/
-check_priority:
min_priority = (order > PAGE_ALLOC_COSTLY_ORDER) ?
MIN_COMPACT_COSTLY_PRIORITY : MIN_COMPACT_PRIORITY;

--
2.39.2

2023-04-18 19:17:14

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 25/26] mm: page_alloc: disallow fallbacks when 2M defrag is enabled

Fallbacks are already unlikely due to watermarks being enforced
against MIGRATE_FREE blocks. Eliminate them altogether. This allows
compaction to look exclusively at movable blocks, reducing the number
of pageblocks it needs to scan on an ongoing basis.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/compaction.c | 52 +++++--------------------------------------------
mm/internal.h | 2 +-
mm/page_alloc.c | 8 ++++++++
3 files changed, 14 insertions(+), 48 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index e33c99eb34a8..37dfd1878bef 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1258,46 +1258,6 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
#endif /* CONFIG_COMPACTION || CONFIG_CMA */
#ifdef CONFIG_COMPACTION

-static bool suitable_migration_source(struct compact_control *cc,
- struct page *page)
-{
- int block_mt;
-
- if (pageblock_skip_persistent(page))
- return false;
-
- if ((cc->mode != MIGRATE_ASYNC) || !cc->direct_compaction)
- return true;
-
- block_mt = get_pageblock_migratetype(page);
-
- if (cc->migratetype == MIGRATE_MOVABLE)
- return is_migrate_movable(block_mt);
- else
- return block_mt == cc->migratetype;
-}
-
-/* Returns true if the page is within a block suitable for migration to */
-static bool suitable_migration_target(struct compact_control *cc,
- struct page *page)
-{
- int mt = get_pageblock_migratetype(page);
-
- /* If the page is a large free page, then disallow migration */
- if (mt == MIGRATE_FREE)
- return false;
-
- if (cc->ignore_block_suitable)
- return true;
-
- /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
- if (is_migrate_movable(mt))
- return true;
-
- /* Otherwise skip the block */
- return false;
-}
-
static inline unsigned int
freelist_scan_limit(struct compact_control *cc)
{
@@ -1620,7 +1580,7 @@ static void isolate_freepages(struct compact_control *cc)
continue;

/* Check the block is suitable for migration */
- if (!suitable_migration_target(cc, page))
+ if (!is_migrate_movable(get_pageblock_migratetype(page)))
continue;

/* If isolation recently failed, do not retry */
@@ -1927,14 +1887,12 @@ static isolate_migrate_t isolate_migratepages(struct compact_control *cc)
continue;

/*
- * For async direct compaction, only scan the pageblocks of the
- * same migratetype without huge pages. Async direct compaction
- * is optimistic to see if the minimum amount of work satisfies
- * the allocation. The cached PFN is updated as it's possible
- * that all remaining blocks between source and target are
+ * The cached PFN is updated as it's possible that all
+ * remaining blocks between source and target are
* unsuitable and the compaction scanners fail to meet.
*/
- if (!suitable_migration_source(cc, page)) {
+ if (pageblock_skip_persistent(page) ||
+ !is_migrate_movable(get_pageblock_migratetype(page))) {
update_cached_migrate(cc, block_end_pfn);
continue;
}
diff --git a/mm/internal.h b/mm/internal.h
index 24f43f5db88b..1c0886c3ce0e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -741,7 +741,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
#define ALLOC_CMA 0x80 /* allow allocations from CMA areas */
-#ifdef CONFIG_ZONE_DMA32
+#if defined(CONFIG_ZONE_DMA32) && !defined(CONFIG_COMPACTION)
#define ALLOC_NOFRAGMENT 0x100 /* avoid mixing pageblock types */
#else
#define ALLOC_NOFRAGMENT 0x0
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f835a5548164..9db588a1de3b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2622,11 +2622,19 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
*
* The other migratetypes do not have fallbacks.
*/
+#ifdef CONFIG_COMPACTION
+static int fallbacks[MIGRATE_TYPES][2] = {
+ [MIGRATE_UNMOVABLE] = { MIGRATE_FREE, MIGRATE_TYPES },
+ [MIGRATE_MOVABLE] = { MIGRATE_FREE, MIGRATE_TYPES },
+ [MIGRATE_RECLAIMABLE] = { MIGRATE_FREE, MIGRATE_TYPES },
+};
+#else
static int fallbacks[MIGRATE_TYPES][4] = {
[MIGRATE_UNMOVABLE] = { MIGRATE_FREE, MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
[MIGRATE_MOVABLE] = { MIGRATE_FREE, MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_TYPES },
[MIGRATE_RECLAIMABLE] = { MIGRATE_FREE, MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
};
+#endif

#ifdef CONFIG_CMA
static __always_inline struct page *__rmqueue_cma_fallback(struct zone *zone,
--
2.39.2

2023-04-18 19:18:32

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 22/26] mm: page_alloc: manage free memory in whole pageblocks

Right now, allocation requests only reclaim (and compact) for their
exact order. Since the majority of allocation requests are smaller
than a pageblock, this is likely to result in partial blocks being
freed, which are subsequently fragmented by fallbacks. This defeats
the allocator's efforts to group pageblocks by mobility.

Fix this mismatch between the allocator and reclaim/compaction: make
the pageblock the default unit for free memory by enforcing watermarks
against MIGRATE_FREE blocks, and have reclaim/compaction produce them.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/compaction.h | 1 -
mm/compaction.c | 65 ++++---------
mm/internal.h | 1 +
mm/page_alloc.c | 183 ++++++++++++++++++++++---------------
mm/vmscan.c | 6 +-
5 files changed, 131 insertions(+), 125 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 9e1b2c56df62..52b2487ef901 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -10,7 +10,6 @@ enum compact_priority {
COMPACT_PRIO_SYNC_FULL,
MIN_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_FULL,
COMPACT_PRIO_SYNC_LIGHT,
- MIN_COMPACT_COSTLY_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
DEF_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
COMPACT_PRIO_ASYNC,
INIT_COMPACT_PRIORITY = COMPACT_PRIO_ASYNC
diff --git a/mm/compaction.c b/mm/compaction.c
index 8080c04e644a..e33c99eb34a8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1784,15 +1784,6 @@ static unsigned long fast_find_migrateblock(struct compact_control *cc)
if (cc->order <= PAGE_ALLOC_COSTLY_ORDER)
return pfn;

- /*
- * Only allow kcompactd and direct requests for movable pages to
- * quickly clear out a MOVABLE pageblock for allocation. This
- * reduces the risk that a large movable pageblock is freed for
- * an unmovable/reclaimable small allocation.
- */
- if (cc->direct_compaction && cc->migratetype != MIGRATE_MOVABLE)
- return pfn;
-
/*
* When starting the migration scanner, pick any pageblock within the
* first half of the search space. Otherwise try and pick a pageblock
@@ -2065,8 +2056,7 @@ static bool should_proactive_compact_node(pg_data_t *pgdat)

static enum compact_result __compact_finished(struct compact_control *cc)
{
- unsigned int order;
- const int migratetype = cc->migratetype;
+ unsigned long mark;
int ret;

/* Compaction run completes if the migrate and free scanner meet */
@@ -2120,39 +2110,16 @@ static enum compact_result __compact_finished(struct compact_control *cc)
if (!pageblock_aligned(cc->migrate_pfn))
return COMPACT_CONTINUE;

- /* Direct compactor: Is a suitable page free? */
+ /* Done when watermarks are restored */
ret = COMPACT_NO_SUITABLE_PAGE;
- for (order = cc->order; order < MAX_ORDER; order++) {
- struct free_area *area = &cc->zone->free_area[order];
- bool can_steal;
-
- /* Job done if page is free of the right migratetype */
- if (!free_area_empty(area, migratetype))
- return COMPACT_SUCCESS;
-
-#ifdef CONFIG_CMA
- /* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */
- if (migratetype == MIGRATE_MOVABLE &&
- !free_area_empty(area, MIGRATE_CMA))
- return COMPACT_SUCCESS;
-#endif
- /*
- * Job done if allocation would steal freepages from
- * other migratetype buddy lists.
- */
- if (find_suitable_fallback(area, order, migratetype,
- true, &can_steal) != -1)
- /*
- * Movable pages are OK in any pageblock. If we are
- * stealing for a non-movable allocation, make sure
- * we finish compacting the current pageblock first
- * (which is assured by the above migrate_pfn align
- * check) so it is as free as possible and we won't
- * have to steal another one soon.
- */
- return COMPACT_SUCCESS;
- }
-
+ if (cc->direct_compaction)
+ mark = wmark_pages(cc->zone,
+ cc->alloc_flags & ALLOC_WMARK_MASK);
+ else
+ mark = high_wmark_pages(cc->zone);
+ if (zone_watermark_ok(cc->zone, cc->order, mark,
+ cc->highest_zoneidx, cc->alloc_flags))
+ return COMPACT_SUCCESS;
out:
if (cc->contended || fatal_signal_pending(current))
ret = COMPACT_CONTENDED;
@@ -2310,8 +2277,12 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
unsigned long watermark;

/* Allocation can already succeed, nothing to do */
- watermark = wmark_pages(cc->zone,
- cc->alloc_flags & ALLOC_WMARK_MASK);
+ if (cc->direct_compaction)
+ watermark = wmark_pages(cc->zone,
+ cc->alloc_flags &
+ ALLOC_WMARK_MASK);
+ else
+ watermark = high_wmark_pages(cc->zone);
if (zone_watermark_ok(cc->zone, cc->order, watermark,
cc->highest_zoneidx, cc->alloc_flags))
return COMPACT_SUCCESS;
@@ -2800,7 +2771,7 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)

/* Allocation can succeed in any zone, done */
if (zone_watermark_ok(zone, pgdat->kcompactd_max_order,
- min_wmark_pages(zone),
+ high_wmark_pages(zone),
highest_zoneidx, 0))
return true;

@@ -2845,7 +2816,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)

/* Allocation can already succeed, nothing to do */
if (zone_watermark_ok(zone, cc.order,
- min_wmark_pages(zone), zoneid, 0))
+ high_wmark_pages(zone), zoneid, 0))
continue;

if (compaction_suitable(zone, cc.order,
diff --git a/mm/internal.h b/mm/internal.h
index 39f65a463631..5c76455f8042 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -432,6 +432,7 @@ struct compact_control {
*/
struct capture_control {
struct compact_control *cc;
+ int order;
int migratetype;
struct page *page;
};
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 18fa2bbba44b..6f0bfc226c36 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1075,7 +1075,7 @@ static inline bool
compaction_capture(struct zone *zone, struct page *page, int order,
int migratetype, struct capture_control *capc)
{
- if (!capc || order < capc->cc->order)
+ if (!capc || order < capc->order)
return false;

/* Do not accidentally pollute CMA or isolated regions*/
@@ -1097,8 +1097,8 @@ compaction_capture(struct zone *zone, struct page *page, int order,
return false;
}

- if (order > capc->cc->order)
- expand(zone, page, capc->cc->order, order, migratetype);
+ if (order > capc->order)
+ expand(zone, page, capc->order, order, migratetype);

capc->page = page;
return true;
@@ -3649,15 +3649,15 @@ int __isolate_free_page(struct page *page, unsigned int order)
int mt = get_pageblock_migratetype(page);

if (!is_migrate_isolate(mt)) {
+ long free_pages = zone_page_state(zone, NR_FREE_PAGES);
unsigned long watermark;
/*
- * Obey watermarks as if the page was being allocated. We can
- * emulate a high-order watermark check with a raised order-0
- * watermark, because we already know our high-order page
- * exists.
+ * Keep a lid on concurrent compaction. MIGRATE_FREE
+ * watermarks alone cannot be checked here, because
+ * that's what the caller is trying to produce.
*/
watermark = zone->_watermark[WMARK_MIN] + (1UL << order);
- if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
+ if (!__zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA, free_pages))
return 0;
}

@@ -3976,27 +3976,59 @@ noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
}
ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE);

-static inline long __zone_watermark_unusable_free(struct zone *z,
- unsigned int order, unsigned int alloc_flags)
+static long page_state(struct zone *zone, enum zone_stat_item item, bool safe)
{
- const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
- long unusable_free = (1 << order) - 1;
+ if (safe)
+ return zone_page_state_snapshot(zone, item);
+ else
+ return zone_page_state(zone, item);
+}
+
+static long __zone_free_pages(struct zone *zone, int alloc_flags, bool safe)
+{
+ long free_pages;

/*
- * If the caller does not have rights to ALLOC_HARDER then subtract
- * the high-atomic reserves. This will over-estimate the size of the
- * atomic reserve but it avoids a search.
+ * Enforce watermarks against MIGRATE_FREE pages. This ensures
+ * that there is always a reserve of higher-order pages
+ * maintained for all migratetypes and allocation contexts.
+ *
+ * Allocations will still use up any compatible free pages
+ * that may exist inside claimed blocks first. But the reserve
+ * prevents smaller allocations from starving out higher-order
+ * requests (which may not be able to sleep, e.g. highatomic).
+ *
+ * The additional memory requirements of this are mininmal. If
+ * internal free pages already exceed the compact_gap(), only
+ * compaction is necessary to restore the watermarks.
*/
- if (likely(!alloc_harder))
- unusable_free += z->nr_reserved_highatomic;
+ free_pages = page_state(zone, NR_FREE_FREE, safe);
+ if (alloc_flags & (ALLOC_HARDER | ALLOC_OOM))
+ free_pages += page_state(zone, NR_FREE_HIGHATOMIC, safe);
+ if (IS_ENABLED(CONFIG_CMA) && (alloc_flags & ALLOC_CMA))
+ free_pages += page_state(zone, NR_FREE_CMA_PAGES, safe);

-#ifdef CONFIG_CMA
- /* If allocation can't use CMA areas don't use free CMA pages */
- if (!(alloc_flags & ALLOC_CMA))
- unusable_free += zone_page_state(z, NR_FREE_CMA_PAGES);
-#endif
+ if (!IS_ENABLED(CONFIG_COMPACTION)) {
+ /*
+ * We can't reasonably defragment without compaction.
+ * Consider everything and do best-effort grouping.
+ */
+ free_pages += page_state(zone, NR_FREE_UNMOVABLE, safe);
+ free_pages += page_state(zone, NR_FREE_MOVABLE, safe);
+ free_pages += page_state(zone, NR_FREE_RECLAIMABLE, safe);
+ }

- return unusable_free;
+ return free_pages;
+}
+
+static long zone_free_pages(struct zone *zone, int alloc_flags)
+{
+ return __zone_free_pages(zone, alloc_flags, false);
+}
+
+static long zone_free_pages_safe(struct zone *zone, int alloc_flags)
+{
+ return __zone_free_pages(zone, alloc_flags, true);
}

/*
@@ -4014,7 +4046,7 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));

/* free_pages may go negative - that's OK */
- free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);
+ free_pages -= (1 << order) - 1;

if (alloc_flags & ALLOC_HIGH)
min -= min / 2;
@@ -4076,33 +4108,22 @@ bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
int highest_zoneidx, unsigned int alloc_flags)
{
return __zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags,
- zone_page_state(z, NR_FREE_PAGES));
+ zone_free_pages(z, alloc_flags));
}

static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
unsigned long mark, int highest_zoneidx,
unsigned int alloc_flags, gfp_t gfp_mask)
{
- long free_pages;
-
- free_pages = zone_page_state(z, NR_FREE_PAGES);
+ long free_pages = zone_free_pages(z, alloc_flags);

/*
* Fast check for order-0 only. If this fails then the reserves
* need to be calculated.
*/
- if (!order) {
- long usable_free;
- long reserved;
-
- usable_free = free_pages;
- reserved = __zone_watermark_unusable_free(z, 0, alloc_flags);
-
- /* reserved may over estimate high-atomic reserves. */
- usable_free -= min(usable_free, reserved);
- if (usable_free > mark + z->lowmem_reserve[highest_zoneidx])
- return true;
- }
+ if (!order && (free_pages - ((1 << order) - 1) >
+ mark + z->lowmem_reserve[highest_zoneidx]))
+ return true;

if (__zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags,
free_pages))
@@ -4126,13 +4147,8 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
unsigned long mark, int highest_zoneidx)
{
- long free_pages = zone_page_state(z, NR_FREE_PAGES);
-
- if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
- free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
-
return __zone_watermark_ok(z, order, mark, highest_zoneidx, 0,
- free_pages);
+ zone_free_pages_safe(z, 0));
}

#ifdef CONFIG_NUMA
@@ -4524,12 +4540,14 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
unsigned long pflags;
unsigned int noreclaim_flag;
struct capture_control capc = {
+ .order = order,
.migratetype = ac->migratetype,
.page = NULL,
};
+ int compact_order;

- if (!order)
- return NULL;
+ /* Use reclaim/compaction to produce neutral blocks */
+ compact_order = max_t(int, order, pageblock_order);

/*
* Make sure the structs are really initialized before we expose the
@@ -4543,8 +4561,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
delayacct_compact_start();
noreclaim_flag = memalloc_noreclaim_save();

- *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
- prio, &capc);
+ *compact_result = try_to_compact_pages(gfp_mask, compact_order,
+ alloc_flags, ac, prio, &capc);

memalloc_noreclaim_restore(noreclaim_flag);
psi_memstall_leave(&pflags);
@@ -4608,13 +4626,12 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
enum compact_priority *compact_priority,
int *compaction_retries)
{
- int min_priority;
bool ret = false;
int retries = *compaction_retries;
enum compact_priority priority = *compact_priority;

- if (!order)
- return false;
+ /* Use reclaim/compaction to produce neutral blocks */
+ order = max_t(int, order, pageblock_order);

if (fatal_signal_pending(current))
return false;
@@ -4624,20 +4641,6 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
* failed, presumably due to a race. Retry a few times.
*/
if (compact_result == COMPACT_SUCCESS) {
- int max_retries = MAX_COMPACT_RETRIES;
-
- /*
- * !costly requests are much more important than
- * __GFP_RETRY_MAYFAIL costly ones because they are de
- * facto nofail and invoke OOM killer to move on while
- * costly can fail and users are ready to cope with
- * that. 1/4 retries is rather arbitrary but we would
- * need much more detailed feedback from compaction to
- * make a better decision.
- */
- if (order > PAGE_ALLOC_COSTLY_ORDER)
- max_retries /= 4;
-
ret = ++(*compaction_retries) <= MAX_COMPACT_RETRIES;
goto out;
}
@@ -4654,16 +4657,13 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
/*
* Compaction failed. Retry with increasing priority.
*/
- min_priority = (order > PAGE_ALLOC_COSTLY_ORDER) ?
- MIN_COMPACT_COSTLY_PRIORITY : MIN_COMPACT_PRIORITY;
-
- if (*compact_priority > min_priority) {
+ if (*compact_priority > MIN_COMPACT_PRIORITY) {
(*compact_priority)--;
*compaction_retries = 0;
ret = true;
}
out:
- trace_compact_retry(order, priority, compact_result, retries, max_retries, ret);
+ trace_compact_retry(order, priority, compact_result, retries, MAX_COMPACT_RETRIES, ret);
return ret;
}
#else
@@ -4822,9 +4822,16 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
struct page *page = NULL;
unsigned long pflags;
bool drained = false;
+ int reclaim_order;
+
+ /* Use reclaim/compaction to produce neutral blocks */
+ if (IS_ENABLED(CONFIG_COMPACTION))
+ reclaim_order = max_t(int, order, pageblock_order);
+ else
+ reclaim_order = order;

psi_memstall_enter(&pflags);
- *did_some_progress = __perform_reclaim(gfp_mask, order, ac);
+ *did_some_progress = __perform_reclaim(gfp_mask, reclaim_order, ac);
if (unlikely(!(*did_some_progress)))
goto out;

@@ -4856,6 +4863,10 @@ static void wake_all_kswapds(unsigned int order, gfp_t gfp_mask,
pg_data_t *last_pgdat = NULL;
enum zone_type highest_zoneidx = ac->highest_zoneidx;

+ /* Use reclaim/compaction to produce neutral blocks */
+ if (IS_ENABLED(CONFIG_COMPACTION))
+ order = max_t(unsigned int, order, pageblock_order);
+
for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, highest_zoneidx,
ac->nodemask) {
if (!managed_zone(zone))
@@ -4970,6 +4981,24 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
struct zoneref *z;
bool ret = false;

+ /*
+ * In the old world, order-0 pages only need reclaim, and
+ * higher orders might be present but the order-0 watermarks
+ * aren't met yet. These things can be fixed by reclaim alone.
+ *
+ * In the new world, though, watermark checks are against
+ * MIGRATE_FREE blocks. That means if the watermarks aren't
+ * met, reclaim isn't going to be the solution. Neither for
+ * order-0 nor for anything else. Whether it makes sense to
+ * retry depends fully on whether compaction should retry.
+ *
+ * should_compact_retry() already checks for COMPACT_SKIPPED
+ * and compaction_zonelist_suitable() to test whether reclaim
+ * is needed.
+ */
+ if (IS_ENABLED(CONFIG_COMPACTION))
+ goto schedule;
+
/*
* Costly allocations might have made a progress but this doesn't mean
* their order will become available due to high fragmentation so
@@ -5019,6 +5048,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
}
}

+schedule:
/*
* Memory allocation/reclaim might be called from a WQ context and the
* current implementation of the WQ concurrency control doesn't
@@ -8833,6 +8863,13 @@ static void __setup_per_zone_wmarks(void)
mult_frac(zone_managed_pages(zone),
watermark_scale_factor, 10000));

+ /*
+ * Ensure the watermark delta is a multiple of the
+ * neutral block that reclaim/compaction produces.
+ */
+ if (IS_ENABLED(CONFIG_COMPACTION))
+ tmp = ALIGN(tmp, 1 << pageblock_order);
+
zone->watermark_boost = 0;
zone->_watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp;
zone->_watermark[WMARK_HIGH] = low_wmark_pages(zone) + tmp;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 14d6116384cc..a7374cd6fe91 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7438,8 +7438,7 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,

/* Hopeless node, leave it to direct reclaim if possible */
if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ||
- (pgdat_balanced(pgdat, order, highest_zoneidx) &&
- !pgdat_watermark_boosted(pgdat, highest_zoneidx))) {
+ pgdat_balanced(pgdat, order, highest_zoneidx)) {
/*
* There may be plenty of free memory available, but it's too
* fragmented for high-order allocations. Wake up kcompactd
@@ -7447,8 +7446,7 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
* needed. If it fails, it will defer subsequent attempts to
* ratelimit its work.
*/
- if (!(gfp_flags & __GFP_DIRECT_RECLAIM))
- wakeup_kcompactd(pgdat, order, highest_zoneidx);
+ wakeup_kcompactd(pgdat, order, highest_zoneidx);
return;
}

--
2.39.2

2023-04-18 19:24:26

by Johannes Weiner

[permalink] [raw]
Subject: [RFC PATCH 21/26] mm: compaction: align compaction goals with reclaim goals

Kswapd's goal is to balance at least one zone in the node for the
requested zoneidx, but no more than that. Kcompactd on the other hand
compacts all the zones in the node, even if one of them is already
compacted for the given request.

This can hog kcompactd unnecessarily long on a requested zoneidx. It
also has kcompactd working on zones without the cooperation of
kswapd. There is a compaction_suitable() check of course, but whether
that is true or not depends on luck, risking erratic behavior.

Make kcompactd follow the same criteria as kswapd when deciding to
work on a node, to keep them working in unison as much as possible.

Likewise, direct reclaim can bail as soon as one zone in the zonelist
is compaction_ready(), so check up front before hammering lower zones
while higher zones might already be suitable. This matches
compaction_zonelist_suitable() on the compaction side.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/compaction.c | 5 +++--
mm/vmscan.c | 35 +++++++++++++++++------------------
2 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 52103545d58c..8080c04e644a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2798,12 +2798,13 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
if (!populated_zone(zone))
continue;

- /* Allocation can already succeed, check other zones */
+ /* Allocation can succeed in any zone, done */
if (zone_watermark_ok(zone, pgdat->kcompactd_max_order,
min_wmark_pages(zone),
highest_zoneidx, 0))
- continue;
+ return true;

+ /* Allocation can't succed, but enough order-0 to compact */
if (compaction_suitable(zone, pgdat->kcompactd_max_order,
highest_zoneidx) == COMPACT_CONTINUE)
return true;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 723705b9e4d9..14d6116384cc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6277,7 +6277,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
sc->reclaim_idx, 0))
return true;

- /* Compaction cannot yet proceed. Do reclaim. */
+ /* Compaction cannot yet proceed, might need reclaim */
if (compaction_suitable(zone, sc->order,
sc->reclaim_idx) == COMPACT_SKIPPED)
return false;
@@ -6357,6 +6357,21 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
sc->reclaim_idx = gfp_zone(sc->gfp_mask);
}

+ /* Bail if any of the zones are already compactable */
+ if (IS_ENABLED(CONFIG_COMPACTION) &&
+ sc->order > PAGE_ALLOC_COSTLY_ORDER) {
+ for_each_zone_zonelist_nodemask(zone, z, zonelist,
+ sc->reclaim_idx, sc->nodemask) {
+ if (!cpuset_zone_allowed(zone,
+ GFP_KERNEL | __GFP_HARDWALL))
+ continue;
+ if (compaction_ready(zone, sc)) {
+ sc->compaction_ready = true;
+ goto out;
+ }
+ }
+ }
+
for_each_zone_zonelist_nodemask(zone, z, zonelist,
sc->reclaim_idx, sc->nodemask) {
/*
@@ -6368,22 +6383,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
GFP_KERNEL | __GFP_HARDWALL))
continue;

- /*
- * If we already have plenty of memory free for
- * compaction in this zone, don't free any more.
- * Even though compaction is invoked for any
- * non-zero order, only frequent costly order
- * reclamation is disruptive enough to become a
- * noticeable problem, like transparent huge
- * page allocations.
- */
- if (IS_ENABLED(CONFIG_COMPACTION) &&
- sc->order > PAGE_ALLOC_COSTLY_ORDER &&
- compaction_ready(zone, sc)) {
- sc->compaction_ready = true;
- continue;
- }
-
/*
* Shrink each node in the zonelist once. If the
* zonelist is ordered by zone (not the default) then a
@@ -6420,7 +6419,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)

if (first_pgdat)
consider_reclaim_throttle(first_pgdat, sc);
-
+out:
/*
* Restore to original mask to avoid the impact on the caller if we
* promoted it to __GFP_HIGHMEM.
--
2.39.2

2023-04-19 00:04:30

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] mm: reliable huge page allocator

On Tue, Apr 18, 2023 at 03:12:47PM -0400, Johannes Weiner wrote:
> As memory capacity continues to grow, 4k TLB coverage has not been
> able to keep up. On Meta's 64G webservers, close to 20% of execution
> cycles are observed to be handling TLB misses when using 4k pages
> only. Huge pages are shifting from being a nice-to-have optimization
> for HPC workloads to becoming a necessity for common applications.
>
> However, while trying to deploy THP more universally, we observe a
> fragmentation problem in the page allocator that often prevents larger
> requests from being met quickly, or met at all, at runtime. Since we
> have to provision hardware capacity for worst case performance,
> unreliable huge page coverage isn't of much help.
>
> Drilling into the allocator, we find that existing defrag efforts,
> such as mobility grouping and watermark boosting, help, but are
> insufficient by themselves. We still observe a high number of blocks
> being routinely shared by allocations of different migratetypes. This
> in turn results in inefficient or ineffective reclaim/compaction runs.
>
> In a broad sample of Meta servers, we find that unmovable allocations
> make up less than 7% of total memory on average, yet occupy 34% of the
> 2M blocks in the system. We also found that this effect isn't
> correlated with high uptimes, and that servers can get heavily
> fragmented within the first hour of running a workload.
>
> The following experiment shows that only 20min of build load under
> moderate memory pressure already results in a significant number of
> typemixed blocks (block analysis run after system is back to idle):
>
> vanilla:
> unmovable 50
> movable 701
> reclaimable 149
> unmovable blocks with slab/lru pages: 13 ({'slab': 17, 'lru': 19} pages)
> movable blocks with non-LRU pages: 77 ({'slab': 4257, 'kmem': 77, 'other': 2} pages)
> reclaimable blocks with non-slab pages: 16 ({'lru': 37, 'kmem': 311, 'other': 26} pages)
>
> patched:
> unmovable 65
> movable 457
> reclaimable 159
> free 219
> unmovable blocks with slab/lru pages: 22 ({'slab': 0, 'lru': 38} pages)
> movable blocks with non-LRU pages: 0 ({'slab': 0, 'kmem': 0, 'other': 0} pages)
> reclaimable blocks with non-slab pages: 3 ({'lru': 36, 'kmem': 0, 'other': 23} pages)
>
> [ The remaining "mixed blocks" in the patched kernel are false
> positives: LRU pages without migrate callbacks (empty_aops), and
> i915 shmem that is pinned until reclaimed through shrinkers. ]
>
> Root causes
>
> One of the behaviors that sabotage the page allocator's mobility
> grouping is the fact that requests of one migratetype are allowed to
> fall back into blocks of another type before reclaim and compaction
> occur. This is a design decision to prioritize memory utilization over
> block fragmentation - especially considering the history of lumpy
> reclaim and its tendency to overreclaim. However, with compaction
> available, these two goals are no longer in conflict: the scratch
> space of free pages for compaction to work is only twice the size of
> the allocation request; in most cases, only small amounts of
> proactive, coordinated reclaim and compaction is required to prevent a
> fallback which may fragment a pageblock indefinitely.
>
> Another problem lies in how the page allocator drives reclaim and
> compaction when it does invoke it. While the page allocator targets
> migratetype grouping at the pageblock level, it calls reclaim and
> compaction with the order of the allocation request. As most requests
> are smaller than a pageblock, this results in partial block freeing
> and subsequent fallbacks and type mixing.
>
> Note that in combination, these two design decisions have a
> self-reinforcing effect on fragmentation: 1. Partially used unmovable
> blocks are filled up with fallback movable pages. 2. A subsequent
> unmovable allocation, instead of grouping up, will then need to enter
> reclaim, which most likely results in a partially freed movable block
> that it falls back into. Over time, unmovable allocations are sparsely
> scattered throughout the address space and poison many pageblocks.
>
> Note that block fragmentation is driven by lower-order requests. It is
> not reliably mitigated by the mere presence of higher-order requests.
>
> Proposal
>
> This series proposes to make THP allocations reliable by enforcing
> pageblock hygiene, and aligning the allocator, reclaim and compaction
> on the pageblock as the base unit for managing free memory. All orders
> up to and including the pageblock are made first-class requests that
> (outside of OOM situations) are expected to succeed without
> exceptional investment by the allocating thread.
>
> A neutral pageblock type is introduced, MIGRATE_FREE. The first
> allocation to be placed into such a block claims it exclusively for
> the allocation's migratetype. Fallbacks from a different type are no
> longer allowed, and the block is "kept open" for more allocations of
> the same type to ensure tight grouping. A pageblock becomes neutral
> again only once all its pages have been freed.

Sounds like this will cause earlier OOM, no?

I guess with 2M pageblock on 64G server it shouldn't matter much. But how
about smaller machines?

> Reclaim and compaction are changed from partial block reclaim to
> producing whole neutral page blocks.

How does it affect allocation latencies? I see direct compact stall grew
substantially. Hm?

> The watermark logic is adjusted
> to apply to neutral blocks, ensuring that background and direct
> reclaim always maintain a readily-available reserve of them.
>
> The defragmentation effort changes from reactive to proactive. In
> turn, this makes defragmentation actually more efficient: compaction
> only has to scan movable blocks and can skip other blocks entirely;
> since movable blocks aren't poisoned by unmovable pages, the chances
> of successful compaction in each block are greatly improved as well.
>
> Defragmentation becomes an ongoing responsibility of all allocations,
> rather than being the burden of only higher-order asks. This prevents
> sub-block allocations - which cause block fragmentation in the first
> place - from starving the increasingly important larger requests.
>
> There is a slight increase in worst-case memory overhead by requiring
> the watermarks to be met against neutral blocks even when there might
> be free pages in typed blocks. However, the high watermarks are less
> than 1% of the zone, so the increase is relatively small.
>
> These changes only apply to CONFIG_COMPACTION kernels. Without
> compaction, fallbacks and partial block reclaim remain the best
> trade-off between memory utilization and fragmentation.
>
> Initial Test Results
>
> The following is purely an allocation reliability test. Achieving full
> THP benefits in practice is tied to other pending changes, such as the
> THP shrinker to avoid memory pressure from excessive internal
> fragmentation, and tweaks to the kernel's THP allocation strategy.
>
> The test is a kernel build under moderate-to-high memory pressure,
> with a concurrent process trying to repeatedly fault THPs (madvise):
>
> HUGEALLOC-VANILLA HUGEALLOC-PATCHED
> Real time 265.04 ( +0.00%) 268.12 ( +1.16%)
> User time 1131.05 ( +0.00%) 1131.13 ( +0.01%)
> System time 474.66 ( +0.00%) 478.97 ( +0.91%)
> THP fault alloc 17913.24 ( +0.00%) 19647.50 ( +9.68%)
> THP fault fallback 1947.12 ( +0.00%) 223.40 ( -88.48%)
> THP fault fail rate % 9.80 ( +0.00%) 1.12 ( -80.34%)
> Direct compact stall 282.44 ( +0.00%) 543.90 ( +92.25%)
> Direct compact fail 262.44 ( +0.00%) 239.90 ( -8.56%)
> Direct compact success 20.00 ( +0.00%) 304.00 ( +1352.38%)
> Direct compact success rate % 7.15 ( +0.00%) 57.10 ( +612.90%)
> Compact daemon scanned migrate 21643.80 ( +0.00%) 387479.80 ( +1690.18%)
> Compact daemon scanned free 188462.36 ( +0.00%) 2842824.10 ( +1408.42%)
> Compact direct scanned migrate 1601294.84 ( +0.00%) 275670.70 ( -82.78%)
> Compact direct scanned free 4476155.60 ( +0.00%) 2438835.00 ( -45.51%)
> Compact migrate scanned daemon % 1.32 ( +0.00%) 59.18 ( +2499.00%)
> Compact free scanned daemon % 3.95 ( +0.00%) 54.31 ( +1018.20%)
> Alloc stall 2425.00 ( +0.00%) 992.00 ( -59.07%)
> Pages kswapd scanned 586756.68 ( +0.00%) 975390.20 ( +66.23%)
> Pages kswapd reclaimed 385468.20 ( +0.00%) 437767.50 ( +13.57%)
> Pages direct scanned 335199.56 ( +0.00%) 501824.20 ( +49.71%)
> Pages direct reclaimed 127953.72 ( +0.00%) 151880.70 ( +18.70%)
> Pages scanned kswapd % 64.43 ( +0.00%) 66.39 ( +2.99%)
> Swap out 14083.88 ( +0.00%) 45034.60 ( +219.74%)
> Swap in 3395.08 ( +0.00%) 7767.50 ( +128.75%)
> File refaults 93546.68 ( +0.00%) 129648.30 ( +38.59%)
>
> The THP fault success rate is drastically improved. A bigger share of
> the work is done by the background threads, as they now proactively
> maintain MIGRATE_FREE block reserves. The increase in memory pressure
> is shown by the uptick in swap activity.
>
> Status
>
> Initial test results look promising, but production testing has been
> lagging behind the effort to generalize this code for upstream, and
> putting all the pieces together to make THP work. I'll follow up as I
> gather more data.
>
> Sending this out now as an RFC to get input on the overall direction.
>
> The patches are based on v6.2.
>
> Documentation/admin-guide/sysctl/vm.rst | 21 -
> block/bdev.c | 2 +-
> include/linux/compaction.h | 100 +---
> include/linux/gfp.h | 2 -
> include/linux/mm.h | 1 -
> include/linux/mmzone.h | 30 +-
> include/linux/page-isolation.h | 28 +-
> include/linux/pageblock-flags.h | 4 +-
> include/linux/vmstat.h | 8 -
> include/trace/events/mmflags.h | 4 +-
> kernel/sysctl.c | 8 -
> mm/compaction.c | 242 +++-----
> mm/internal.h | 14 +-
> mm/memory_hotplug.c | 4 +-
> mm/page_alloc.c | 930 +++++++++++++-----------------
> mm/page_isolation.c | 42 +-
> mm/vmscan.c | 251 ++------
> mm/vmstat.c | 6 +-
> 18 files changed, 629 insertions(+), 1068 deletions(-)
>
>

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-04-19 00:06:42

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC PATCH 03/26] mm: make pageblock_order 2M per default

On Tue, Apr 18, 2023 at 03:12:50PM -0400, Johannes Weiner wrote:
> pageblock_order can be of various sizes, depending on configuration,
> but the default is MAX_ORDER-1.

Note that MAX_ORDER got redefined in -mm tree recently.

> Given 4k pages, that comes out to
> 4M. This is a large chunk for the allocator/reclaim/compaction to try
> to keep grouped per migratetype. It's also unnecessary as the majority
> of higher order allocations - THP and slab - are smaller than that.

This seems way to x86-specific. Other arches have larger THP sizes. I
believe 16M is common.

Maybe define it as min(MAX_ORDER, PMD_ORDER)?

> Before subsequent patches increase the effort that goes into
> maintaining migratetype isolation, it's important to first set the
> defrag block size to what's likely to have common consumers.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> include/linux/pageblock-flags.h | 4 ++--
> mm/page_alloc.c | 2 +-
> 2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
> index 5f1ae07d724b..05b6811f8cee 100644
> --- a/include/linux/pageblock-flags.h
> +++ b/include/linux/pageblock-flags.h
> @@ -47,8 +47,8 @@ extern unsigned int pageblock_order;
>
> #else /* CONFIG_HUGETLB_PAGE */
>
> -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
> -#define pageblock_order (MAX_ORDER-1)
> +/* Manage fragmentation at the 2M level */
> +#define pageblock_order ilog2(2U << (20 - PAGE_SHIFT))
>
> #endif /* CONFIG_HUGETLB_PAGE */
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ac03571e0532..5e04a69f6a26 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7634,7 +7634,7 @@ static inline void setup_usemap(struct zone *zone) {}
> /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
> void __init set_pageblock_order(void)
> {
> - unsigned int order = MAX_ORDER - 1;
> + unsigned int order = ilog2(2U << (20 - PAGE_SHIFT));
>
> /* Check that pageblock_nr_pages has not already been setup */
> if (pageblock_order)
> --
> 2.39.2
>
>

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-04-19 02:12:07

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] mm: reliable huge page allocator

Hi Kirill, thanks for taking a look so quickly.

On Wed, Apr 19, 2023 at 02:54:02AM +0300, Kirill A. Shutemov wrote:
> On Tue, Apr 18, 2023 at 03:12:47PM -0400, Johannes Weiner wrote:
> > This series proposes to make THP allocations reliable by enforcing
> > pageblock hygiene, and aligning the allocator, reclaim and compaction
> > on the pageblock as the base unit for managing free memory. All orders
> > up to and including the pageblock are made first-class requests that
> > (outside of OOM situations) are expected to succeed without
> > exceptional investment by the allocating thread.
> >
> > A neutral pageblock type is introduced, MIGRATE_FREE. The first
> > allocation to be placed into such a block claims it exclusively for
> > the allocation's migratetype. Fallbacks from a different type are no
> > longer allowed, and the block is "kept open" for more allocations of
> > the same type to ensure tight grouping. A pageblock becomes neutral
> > again only once all its pages have been freed.
>
> Sounds like this will cause earlier OOM, no?
>
> I guess with 2M pageblock on 64G server it shouldn't matter much. But how
> about smaller machines?

Yes, it's a tradeoff.

It's not really possible to reduce external fragmentation and increase
contiguity, without also increasing the risk of internal fragmentation
to some extent. The tradeoff is slighly less but overall faster memory.

A 2M block size *seems* reasonable for most current setups. It's
actually still somewhat on the lower side, if you consider that we had
4k blocks when memory was a few megabytes. (4k pages for 4M RAM is the
same ratio as 2M pages for 2G RAM. My phone has 8G and my desktop 32G.
64G is unusually small for a datacenter server.)

I wouldn't be opposed to sticking this behind a separate config option
if there are setups that WOULD want to keep the current best-effort
compaction without the block hygiene. But obviously, from a
maintenance POV life would be much easier if we didn't have to.

FWIF, I have been doing tests in an environment constrained to 2G and
haven't had any issues with premature OOMs. But I'm happy to test
other situations and workloads that might be of interest to people.

> > Reclaim and compaction are changed from partial block reclaim to
> > producing whole neutral page blocks.
>
> How does it affect allocation latencies? I see direct compact stall grew
> substantially. Hm?

Good question.

There are 260 more compact stalls but also 1,734 more successful THP
allocations. And 1,433 fewer allocation stalls. There seems to be much
less direct work performed per successful allocation.

But of course, that's not the whole story. Let me trace the actual
latencies.

Thanks for your thoughts!
Johannes

2023-04-19 03:09:42

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 03/26] mm: make pageblock_order 2M per default

On Wed, Apr 19, 2023 at 03:01:05AM +0300, Kirill A. Shutemov wrote:
> On Tue, Apr 18, 2023 at 03:12:50PM -0400, Johannes Weiner wrote:
> > pageblock_order can be of various sizes, depending on configuration,
> > but the default is MAX_ORDER-1.
>
> Note that MAX_ORDER got redefined in -mm tree recently.
>
> > Given 4k pages, that comes out to
> > 4M. This is a large chunk for the allocator/reclaim/compaction to try
> > to keep grouped per migratetype. It's also unnecessary as the majority
> > of higher order allocations - THP and slab - are smaller than that.
>
> This seems way to x86-specific.

Hey, that's the machines I have access to ;)

> Other arches have larger THP sizes. I believe 16M is common.
>
> Maybe define it as min(MAX_ORDER, PMD_ORDER)?

Hm, let me play around with larger pageblocks.

The thing that gives me pause is that this seems quite aggressive as a
default block size for the allocator and reclaim/compaction - if you
consider the implications for internal fragmentation and the amount of
ongoing defragmentation work it would require.

IOW, it's not just a function of physical page size supported by the
CPU. It's also a function of overall memory capacity. Independent of
architecture, 2MB seems like a more reasonable step up than 16M.

16M is great for TLB coverage, and in our DCs we're getting a lot of
use out of 1G hugetlb pages as well. The question is if those archs
are willing to pay the cost of serving such page sizes quickly and
reliably during runtime; or if that's something better left to setups
with explicit preallocations and stuff like hugetlb_cma reservations.

2023-04-19 03:49:16

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 03/26] mm: make pageblock_order 2M per default

On Tue, Apr 18, 2023 at 10:55:53PM -0400, Johannes Weiner wrote:
> On Wed, Apr 19, 2023 at 03:01:05AM +0300, Kirill A. Shutemov wrote:
> > On Tue, Apr 18, 2023 at 03:12:50PM -0400, Johannes Weiner wrote:
> > > pageblock_order can be of various sizes, depending on configuration,
> > > but the default is MAX_ORDER-1.
> >
> > Note that MAX_ORDER got redefined in -mm tree recently.
> >
> > > Given 4k pages, that comes out to
> > > 4M. This is a large chunk for the allocator/reclaim/compaction to try
> > > to keep grouped per migratetype. It's also unnecessary as the majority
> > > of higher order allocations - THP and slab - are smaller than that.
> >
> > This seems way to x86-specific.
> > Other arches have larger THP sizes. I believe 16M is common.
> >
> > Maybe define it as min(MAX_ORDER, PMD_ORDER)?
>
> Hm, let me play around with larger pageblocks.
>
> The thing that gives me pause is that this seems quite aggressive as a
> default block size for the allocator and reclaim/compaction - if you
> consider the implications for internal fragmentation and the amount of
> ongoing defragmentation work it would require.
>
> IOW, it's not just a function of physical page size supported by the
> CPU. It's also a function of overall memory capacity. Independent of
> architecture, 2MB seems like a more reasonable step up than 16M.

[ Quick addition: on those other archs, these patches would still help
with other, non-THP sources of compound allocations, such as slub,
variable-order cache folios, and really any orders up to 2M. So it's
not like we *have* to raise it to PMD_ORDER for them to benefit. ]

2023-04-19 04:21:38

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] mm: reliable huge page allocator

On Tue, Apr 18, 2023 at 03:12:47PM -0400, Johannes Weiner wrote:
> This series proposes to make THP allocations reliable by enforcing
> pageblock hygiene, and aligning the allocator, reclaim and compaction
> on the pageblock as the base unit for managing free memory. All orders
> up to and including the pageblock are made first-class requests that
> (outside of OOM situations) are expected to succeed without
> exceptional investment by the allocating thread.
>
> A neutral pageblock type is introduced, MIGRATE_FREE. The first
> allocation to be placed into such a block claims it exclusively for
> the allocation's migratetype. Fallbacks from a different type are no
> longer allowed, and the block is "kept open" for more allocations of
> the same type to ensure tight grouping. A pageblock becomes neutral
> again only once all its pages have been freed.

YES! This is exactly what I've been thinking is the right solution
for some time. Thank you for doing it.

2023-04-19 10:53:34

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH 03/26] mm: make pageblock_order 2M per default

On 4/18/23 21:12, Johannes Weiner wrote:
> pageblock_order can be of various sizes, depending on configuration,
> but the default is MAX_ORDER-1. Given 4k pages, that comes out to
> 4M. This is a large chunk for the allocator/reclaim/compaction to try
> to keep grouped per migratetype. It's also unnecessary as the majority
> of higher order allocations - THP and slab - are smaller than that.

Well in my experience the kernel usually has hugetlbfs config-enabled so it
uses 2MB pageblocks (on x86) even if hugetlbfs is unused at runtime and THP
is used instead. But sure, we can set a better default that's not tied to
hugetlbfs.

> Before subsequent patches increase the effort that goes into
> maintaining migratetype isolation, it's important to first set the
> defrag block size to what's likely to have common consumers.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> include/linux/pageblock-flags.h | 4 ++--
> mm/page_alloc.c | 2 +-
> 2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
> index 5f1ae07d724b..05b6811f8cee 100644
> --- a/include/linux/pageblock-flags.h
> +++ b/include/linux/pageblock-flags.h
> @@ -47,8 +47,8 @@ extern unsigned int pageblock_order;
>
> #else /* CONFIG_HUGETLB_PAGE */
>
> -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
> -#define pageblock_order (MAX_ORDER-1)
> +/* Manage fragmentation at the 2M level */
> +#define pageblock_order ilog2(2U << (20 - PAGE_SHIFT))
>
> #endif /* CONFIG_HUGETLB_PAGE */
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ac03571e0532..5e04a69f6a26 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7634,7 +7634,7 @@ static inline void setup_usemap(struct zone *zone) {}
> /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
> void __init set_pageblock_order(void)
> {
> - unsigned int order = MAX_ORDER - 1;
> + unsigned int order = ilog2(2U << (20 - PAGE_SHIFT));
>
> /* Check that pageblock_nr_pages has not already been setup */
> if (pageblock_order)

2023-04-19 11:03:09

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] mm: reliable huge page allocator

On 4/19/23 04:08, Johannes Weiner wrote:
> Hi Kirill, thanks for taking a look so quickly.
>
> On Wed, Apr 19, 2023 at 02:54:02AM +0300, Kirill A. Shutemov wrote:
>> On Tue, Apr 18, 2023 at 03:12:47PM -0400, Johannes Weiner wrote:
>> > This series proposes to make THP allocations reliable by enforcing
>> > pageblock hygiene, and aligning the allocator, reclaim and compaction
>> > on the pageblock as the base unit for managing free memory. All orders
>> > up to and including the pageblock are made first-class requests that
>> > (outside of OOM situations) are expected to succeed without
>> > exceptional investment by the allocating thread.
>> >
>> > A neutral pageblock type is introduced, MIGRATE_FREE. The first
>> > allocation to be placed into such a block claims it exclusively for
>> > the allocation's migratetype. Fallbacks from a different type are no
>> > longer allowed, and the block is "kept open" for more allocations of
>> > the same type to ensure tight grouping. A pageblock becomes neutral
>> > again only once all its pages have been freed.
>>
>> Sounds like this will cause earlier OOM, no?
>>
>> I guess with 2M pageblock on 64G server it shouldn't matter much. But how
>> about smaller machines?
>
> Yes, it's a tradeoff.
>
> It's not really possible to reduce external fragmentation and increase
> contiguity, without also increasing the risk of internal fragmentation
> to some extent. The tradeoff is slighly less but overall faster memory.
>
> A 2M block size *seems* reasonable for most current setups. It's
> actually still somewhat on the lower side, if you consider that we had
> 4k blocks when memory was a few megabytes. (4k pages for 4M RAM is the
> same ratio as 2M pages for 2G RAM. My phone has 8G and my desktop 32G.
> 64G is unusually small for a datacenter server.)
>
> I wouldn't be opposed to sticking this behind a separate config option
> if there are setups that WOULD want to keep the current best-effort
> compaction without the block hygiene. But obviously, from a
> maintenance POV life would be much easier if we didn't have to.

As much as tunables are frowned upon in general, this could make sense to me
even as a runtime tunable (maybe with defaults based on how large the
system is), because a datacenter server and a phone is after all not the
same thing. But of course it would be preferrable to find out it works
reasonably well even for the smaller systems. For example we already do
completely disable mobility grouping if there's too little RAM for it to
make sense, which is somewhat similar (but not completely identical) decision.

> FWIF, I have been doing tests in an environment constrained to 2G and
> haven't had any issues with premature OOMs. But I'm happy to test
> other situations and workloads that might be of interest to people.
>
>> > Reclaim and compaction are changed from partial block reclaim to
>> > producing whole neutral page blocks.
>>
>> How does it affect allocation latencies? I see direct compact stall grew
>> substantially. Hm?
>
> Good question.
>
> There are 260 more compact stalls but also 1,734 more successful THP
> allocations. And 1,433 fewer allocation stalls. There seems to be much
> less direct work performed per successful allocation.

Yeah if there's a workload that uses THP madvise to indicate it prefers the
compaction stalls to base page fallbacks, and compaction is more sucessful,
it won't defer further attempts so as a result there will be more stalls.
What we should watch out for are rather latencies of allocations that don't
prefer the stalls, but might now be forced to clean up new MIGRATE_FREE
pageblocks for their order-0 allocation that would previously just fallback,
etc.

> But of course, that's not the whole story. Let me trace the actual
> latencies.
>
> Thanks for your thoughts!
> Johannes

2023-04-19 11:11:02

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 03/26] mm: make pageblock_order 2M per default

On 19.04.23 12:36, Vlastimil Babka wrote:
> On 4/18/23 21:12, Johannes Weiner wrote:
>> pageblock_order can be of various sizes, depending on configuration,
>> but the default is MAX_ORDER-1. Given 4k pages, that comes out to
>> 4M. This is a large chunk for the allocator/reclaim/compaction to try
>> to keep grouped per migratetype. It's also unnecessary as the majority
>> of higher order allocations - THP and slab - are smaller than that.
>
> Well in my experience the kernel usually has hugetlbfs config-enabled so it
> uses 2MB pageblocks (on x86) even if hugetlbfs is unused at runtime and THP
> is used instead. But sure, we can set a better default that's not tied to
> hugetlbfs.

As virtio-mem really wants small pageblocks (hot(un)plug granularity),
I've seen reports from users without HUGETLB configured complaining
about this (on x86, we'd get 4M instead of 2M).

So having a better default (PMD_SIZE) sounds like a good idea to me (and
I even recall suggesting to change the !hugetlb default).

--
Thanks,

David / dhildenb

2023-04-19 11:24:23

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 03/26] mm: make pageblock_order 2M per default

On 19.04.23 02:01, Kirill A. Shutemov wrote:
> On Tue, Apr 18, 2023 at 03:12:50PM -0400, Johannes Weiner wrote:
>> pageblock_order can be of various sizes, depending on configuration,
>> but the default is MAX_ORDER-1.
>
> Note that MAX_ORDER got redefined in -mm tree recently.
>
>> Given 4k pages, that comes out to
>> 4M. This is a large chunk for the allocator/reclaim/compaction to try
>> to keep grouped per migratetype. It's also unnecessary as the majority
>> of higher order allocations - THP and slab - are smaller than that.
>
> This seems way to x86-specific. Other arches have larger THP sizes. I
> believe 16M is common.
>

arm64 with 64k pages has ... 512 MiB IIRC :/ It's the weird one.

> Maybe define it as min(MAX_ORDER, PMD_ORDER)?

Sounds good to me.

--
Thanks,

David / dhildenb

2023-04-21 12:28:54

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 02/26] mm: compaction: avoid GFP_NOFS deadlocks

On Tue, Apr 18, 2023 at 03:12:49PM -0400, Johannes Weiner wrote:
> During stress testing, two deadlock scenarios were observed:
>
> 1. One GFP_NOFS allocation was sleeping on too_many_isolated(), and
> all CPUs were busy with compactors that appeared to be spinning on
> buffer locks.
>
> Give GFP_NOFS compactors additional isolation headroom, the same
> way we do during reclaim, to eliminate this deadlock scenario.
>
> 2. In a more pernicious scenario, the GFP_NOFS allocation was
> busy-spinning in compaction, but seemingly never making
> progress. Upon closer inspection, memory was dominated by file
> pages, which the fs compactor isn't allowed to touch. The remaining
> anon pages didn't have the contiguity to satisfy the request.
>
> Allow GFP_NOFS allocations to bypass watermarks when compaction
> failed at the highest priority.
>
> While these deadlocks were encountered only in tests with the
> subsequent patches (which put a lot more demand on compaction), in
> theory these problems already exist in the code today. Fix them now.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Definitely needs to be split out.


> ---
> mm/compaction.c | 15 +++++++++++++--
> mm/page_alloc.c | 10 +++++++++-
> 2 files changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 8238e83385a7..84db84e8fd3a 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -745,8 +745,9 @@ isolate_freepages_range(struct compact_control *cc,
> }
>
> /* Similar to reclaim, but different enough that they don't share logic */
> -static bool too_many_isolated(pg_data_t *pgdat)
> +static bool too_many_isolated(struct compact_control *cc)
> {
> + pg_data_t *pgdat = cc->zone->zone_pgdat;
> bool too_many;
>
> unsigned long active, inactive, isolated;
> @@ -758,6 +759,16 @@ static bool too_many_isolated(pg_data_t *pgdat)
> isolated = node_page_state(pgdat, NR_ISOLATED_FILE) +
> node_page_state(pgdat, NR_ISOLATED_ANON);
>
> + /*
> + * GFP_NOFS callers are allowed to isolate more pages, so they
> + * won't get blocked by normal direct-reclaimers, forming a
> + * circular deadlock. GFP_NOIO won't get here.
> + */
> + if (cc->gfp_mask & __GFP_FS) {
> + inactive >>= 3;
> + active >>= 3;
> + }
> +

This comment needs to explain why GFP_NOFS gets special treatment
explaning that a GFP_NOFS context may not be able to migrate pages and
why.

As a follow-up, if GFP_NOFS cannot deal with the majority of the
migration contexts then it should bail out of compaction entirely. The
changelog doesn't say why but maybe SYNC_LIGHT is the issue?

--
Mel Gorman
SUSE Labs

2023-04-21 12:39:28

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 03/26] mm: make pageblock_order 2M per default

On Tue, Apr 18, 2023 at 03:12:50PM -0400, Johannes Weiner wrote:
> pageblock_order can be of various sizes, depending on configuration,
> but the default is MAX_ORDER-1. Given 4k pages, that comes out to
> 4M. This is a large chunk for the allocator/reclaim/compaction to try
> to keep grouped per migratetype. It's also unnecessary as the majority
> of higher order allocations - THP and slab - are smaller than that.
>
> Before subsequent patches increase the effort that goes into
> maintaining migratetype isolation, it's important to first set the
> defrag block size to what's likely to have common consumers.
>
> Signed-off-by: Johannes Weiner <[email protected]>

This patch may be a distraction in the context of this series. I don't feel
particularly strongly about it but it has strong bikeshed potential. For
configurations that support huge pages of any sort, it should be PMD_ORDER,
for anything else the choice is arbitrary. 2M is as good a guess as
anyway because even if it was tied to the PAGE_ALLOC_COSTLY_ORDER then
the pageblock bitmap overhead might be annoying.

--
Mel Gorman
SUSE Labs

2023-04-21 12:40:42

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 04/26] mm: page_isolation: write proper kerneldoc

On Tue, Apr 18, 2023 at 03:12:51PM -0400, Johannes Weiner wrote:
> And remove the incorrect header comments.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Split out.

--
Mel Gorman
SUSE Labs

2023-04-21 12:51:39

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 05/26] mm: page_alloc: per-migratetype pcplist for THPs

On Tue, Apr 18, 2023 at 03:12:52PM -0400, Johannes Weiner wrote:
> Right now, there is only one pcplist for THP allocations. However,
> while most THPs are movable, the huge zero page is not. This means a
> movable THP allocation can grab an unmovable block from the pcplist,
> and a subsequent THP split, partial free, and reallocation of the
> remainder will mix movable and unmovable pages in the block.
>
> While this isn't a huge source of block pollution in practice, it
> happens often enough to trigger debug warnings fairly quickly under
> load. In the interest of tightening up pageblock hygiene, make the THP
> pcplists fully migratetype-aware, just like the lower order ones.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Split out :P

Take special care of this one because, while I didn't check this, I
suspect it'll push the PCP structure size into the next cache line and
increase overhead.

The changelog makes it unclear why exactly this happens or why the
patch fixes it. The huge zero page strips GFP_MOVABLE (so unmovable)
but at allocation time, it doesn't really matter what the movable type
is because it's a full pageblock. It doesn't appear to be a hazard until
the split happens. Assuming that's the case, it should be ok to always
set the pageblock movable for THP allocations regardless of GFP flags at
allocation time or else set the pageblock MOVABLE at THP split (always
MOVABLE at allocation time makes more sense).

--
Mel Gorman
SUSE Labs

2023-04-21 12:55:37

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 06/26] mm: page_alloc: consolidate free page accounting

On Tue, Apr 18, 2023 at 03:12:53PM -0400, Johannes Weiner wrote:
> Free page accounting currently happens a bit too high up the call
> stack, where it has to deal with guard pages, compaction capturing,
> block stealing and even page isolation. This is subtle and fragile,
> and makes it difficult to hack on the code.
>
> Push the accounting down to where pages enter and leave the physical
> freelists, where all these higher-level exceptions are of no concern.
>
> Signed-off-by: Johannes Weiner <[email protected]>

I didn't look too closely at this one as I'm scanning through to see how
the overall series works and this is mostly a mechanical patch.
However, it definitely breaks build

> @@ -843,7 +843,7 @@ static int __init debug_guardpage_minorder_setup(char *buf)
> early_param("debug_guardpage_minorder", debug_guardpage_minorder_setup);
>
> static inline bool set_page_guard(struct zone *zone, struct page *page,
> - unsigned int order, int migratetype)
> + unsigned int order
> {
> if (!debug_guardpage_enabled())
> return false;

Here

--
Mel Gorman
SUSE Labs

2023-04-21 13:07:18

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 07/26] mm: page_alloc: move capture_control to the page allocator

On Tue, Apr 18, 2023 at 03:12:54PM -0400, Johannes Weiner wrote:
> Compaction capturing is really a component of the page allocator.
> Later patches will also disconnect allocation request size from the
> compaction size, so move the setup of the capturing parameters to the
> "request domain", i.e. the page allocator. No functional change.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2023-04-21 13:16:32

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 08/26] mm: page_alloc: claim blocks during compaction capturing

On Tue, Apr 18, 2023 at 03:12:55PM -0400, Johannes Weiner wrote:
> When capturing a whole block, update the migratetype accordingly. For
> example, a THP allocation might capture an unmovable block. If the THP
> gets split and partially freed later, the remainder should group up
> with movable allocations.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> mm/internal.h | 1 +
> mm/page_alloc.c | 42 ++++++++++++++++++++++++------------------
> 2 files changed, 25 insertions(+), 18 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 024affd4e4b5..39f65a463631 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -432,6 +432,7 @@ struct compact_control {
> */
> struct capture_control {
> struct compact_control *cc;
> + int migratetype;
> struct page *page;
> };
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4d20513c83be..8e5996f8b4b4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -615,6 +615,17 @@ void set_pageblock_migratetype(struct page *page, int migratetype)
> page_to_pfn(page), MIGRATETYPE_MASK);
> }
>
> +static void change_pageblock_range(struct page *pageblock_page,
> + int start_order, int migratetype)
> +{
> + int nr_pageblocks = 1 << (start_order - pageblock_order);
> +
> + while (nr_pageblocks--) {
> + set_pageblock_migratetype(pageblock_page, migratetype);
> + pageblock_page += pageblock_nr_pages;
> + }
> +}
> +
> #ifdef CONFIG_DEBUG_VM
> static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
> {
> @@ -962,14 +973,19 @@ compaction_capture(struct capture_control *capc, struct page *page,
> is_migrate_isolate(migratetype))
> return false;
>
> - /*
> - * Do not let lower order allocations pollute a movable pageblock.
> - * This might let an unmovable request use a reclaimable pageblock
> - * and vice-versa but no more than normal fallback logic which can
> - * have trouble finding a high-order free page.
> - */
> - if (order < pageblock_order && migratetype == MIGRATE_MOVABLE)
> + if (order >= pageblock_order) {
> + migratetype = capc->migratetype;
> + change_pageblock_range(page, order, migratetype);
> + } else if (migratetype == MIGRATE_MOVABLE) {
> + /*
> + * Do not let lower order allocations pollute a
> + * movable pageblock. This might let an unmovable
> + * request use a reclaimable pageblock and vice-versa
> + * but no more than normal fallback logic which can
> + * have trouble finding a high-order free page.
> + */
> return false;
> + }
>

For capturing pageblock order or larger, why not unconditionally make
the block MOVABLE? Even if it's a zero page allocation, it would be nice
to keep the pageblock for movable pages after the split as long as
possible.

--
Mel Gorman
SUSE Labs

2023-04-21 14:16:04

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 10/26] mm: page_alloc: allow compaction capturing from larger blocks

On Tue, Apr 18, 2023 at 03:12:57PM -0400, Johannes Weiner wrote:
> Currently, capturing only works on matching orders and matching
> migratetypes. However, if capturing is initially skipped on the
> migratetype, it's possible that merging continues up to a full
> pageblock, in which case the migratetype is up for grabs again.
>
> Allow capturing to grab smaller chunks from claimed pageblocks, and
> expand the remainder of the block back onto the freelists.
>
> Signed-off-by: Johannes Weiner <[email protected]>

No objections other than we're still in the preparation phase and the
series needs to be split. Out of curiousity, how often does this actually
trigger in practice? I ask because superficially, I would expect capture to
happen while pages are being merged and I'm not sure how much this actually
helps. If anything the anomaly would be merging !MOVABLE types, capturing
one pageblock and leaving the adjacent block eligible for splitting as
UNMOVABLE/RECLAIMABLE which is not necessarily desirable.

I nagged about the splitting already but ideally there would be supporting
data for all the incremental improvements before a major modification is made
to fragmentation avoidance. That way, even if the fragmentation avoidance
changes have side-effects, the incremental changes stand alone.

--
Mel Gorman
SUSE Labs

2023-04-21 14:18:32

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 02/26] mm: compaction: avoid GFP_NOFS deadlocks

On Fri, Apr 21, 2023 at 01:27:43PM +0100, Mel Gorman wrote:
> On Tue, Apr 18, 2023 at 03:12:49PM -0400, Johannes Weiner wrote:
> > During stress testing, two deadlock scenarios were observed:
> >
> > 1. One GFP_NOFS allocation was sleeping on too_many_isolated(), and
> > all CPUs were busy with compactors that appeared to be spinning on
> > buffer locks.
> >
> > Give GFP_NOFS compactors additional isolation headroom, the same
> > way we do during reclaim, to eliminate this deadlock scenario.
> >
> > 2. In a more pernicious scenario, the GFP_NOFS allocation was
> > busy-spinning in compaction, but seemingly never making
> > progress. Upon closer inspection, memory was dominated by file
> > pages, which the fs compactor isn't allowed to touch. The remaining
> > anon pages didn't have the contiguity to satisfy the request.
> >
> > Allow GFP_NOFS allocations to bypass watermarks when compaction
> > failed at the highest priority.
> >
> > While these deadlocks were encountered only in tests with the
> > subsequent patches (which put a lot more demand on compaction), in
> > theory these problems already exist in the code today. Fix them now.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
>
> Definitely needs to be split out.

Will do.

> > mm/compaction.c | 15 +++++++++++++--
> > mm/page_alloc.c | 10 +++++++++-
> > 2 files changed, 22 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 8238e83385a7..84db84e8fd3a 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -745,8 +745,9 @@ isolate_freepages_range(struct compact_control *cc,
> > }
> >
> > /* Similar to reclaim, but different enough that they don't share logic */
> > -static bool too_many_isolated(pg_data_t *pgdat)
> > +static bool too_many_isolated(struct compact_control *cc)
> > {
> > + pg_data_t *pgdat = cc->zone->zone_pgdat;
> > bool too_many;
> >
> > unsigned long active, inactive, isolated;
> > @@ -758,6 +759,16 @@ static bool too_many_isolated(pg_data_t *pgdat)
> > isolated = node_page_state(pgdat, NR_ISOLATED_FILE) +
> > node_page_state(pgdat, NR_ISOLATED_ANON);
> >
> > + /*
> > + * GFP_NOFS callers are allowed to isolate more pages, so they
> > + * won't get blocked by normal direct-reclaimers, forming a
> > + * circular deadlock. GFP_NOIO won't get here.
> > + */
> > + if (cc->gfp_mask & __GFP_FS) {
> > + inactive >>= 3;
> > + active >>= 3;
> > + }
> > +
>
> This comment needs to explain why GFP_NOFS gets special treatment
> explaning that a GFP_NOFS context may not be able to migrate pages and
> why.

Fair point, I'll expand on that.

> As a follow-up, if GFP_NOFS cannot deal with the majority of the
> migration contexts then it should bail out of compaction entirely. The
> changelog doesn't say why but maybe SYNC_LIGHT is the issue?

It's this condition in isolate_migratepages_block():

/*
* Only allow to migrate anonymous pages in GFP_NOFS context
* because those do not depend on fs locks.
*/
if (!(cc->gfp_mask & __GFP_FS) && mapping)
goto isolate_fail_put;

In terms of bailing even earlier: We do have per-zone file and anon
counts that could be consulted. However, the real problem is
interleaving of anon and file. Even if only 10% of the zone is anon,
it could still be worth trying to compact if they're relatively
contiguous. OTOH 50% anon could be uncompactable if every block also
contains at least one file. We don't know until we actually scan. I'm
hesitant to give allocations premature access to the last reserves.

What might work is for NOFS contexts to test if anon is low up front
and shortcutting directly to the highest priority (SYNC_FULL). One
good faith scan attempt at least before touching the reserves.

2023-04-21 14:29:33

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 11/26] mm: page_alloc: introduce MIGRATE_FREE

On Tue, Apr 18, 2023 at 03:12:58PM -0400, Johannes Weiner wrote:
> To cut down on type mixing, put empty pageblocks on separate freelists
> and make them the first fallback preference before stealing space from
> incompatible blocks.
>
> The neutral block designation will also be handy in subsequent patches
> that: simplify compaction; add per-mt freelist counts and make
> compaction_suitable() more precise; and ultimately make pageblocks the
> basis of free memory management.
>

This patch is a line in the sand for the series. Patches 1-10 can stand
alone with supporting data because this is the first major change that
has a material impact on fragmentation avoidance and its overhead.

Maybe there is something in the later patches that makes the need for this
patch more obvious but putting the empty pageblocks on separate freelists
is not that helpful in itself. The main problem is that __rmqueue() starts
with __rmqueue_smallest which for huge pages is probably fine because
it searches first for free pageblocks, but it's not for SLUB high-order
allocations because __rmqueue_smallest for orders < pageblock_order
encourages mixing. Obviously it would also not be fine for contiguous page
allocations for page cache or anything else that is planned. If nothing
else, this patch highlights that fragmentation avoidance was originally
focused on huge pages which was fine at the time, but not any longer.

The need for MIGRATE_FREE type could potentially be avoided by having
__rmqueue() start with __rmqueue_smallest(order == pageblock_order) to
encourage full block usage first before mixing.

--
Mel Gorman
SUSE Labs

2023-04-21 14:30:17

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 12/26] mm: page_alloc: per-migratetype free counts

On Tue, Apr 18, 2023 at 03:12:59PM -0400, Johannes Weiner wrote:
> Increase visibility into the defragmentation behavior by tracking and
> reporting per-migratetype free counters.
>
> Subsequent patches will also use those counters to make more targeted
> reclaim/compaction decisions.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Visibility into fragmentation behaviour is information that is
almost certainly only useful to a developer and even then, there is
/proc/pagetypeinfo. At minimum, move this patch to later in the series
but I'm skeptical about its benefit.

--
Mel Gorman
SUSE Labs

2023-04-21 14:38:59

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 14/26] mm: compaction: simplify should_compact_retry()

On Tue, Apr 18, 2023 at 03:13:01PM -0400, Johannes Weiner wrote:
> The different branches for retry are unnecessarily complicated. There
> is really only three outcomes: progress, skipped, failed. Also, the
> retry counter only applies to loops that made progress, move it there.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> mm/page_alloc.c | 60 +++++++++++++++++--------------------------------
> 1 file changed, 20 insertions(+), 40 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c3b7dc479936..18fa2bbba44b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4608,7 +4608,6 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
> enum compact_priority *compact_priority,
> int *compaction_retries)
> {
> - int max_retries = MAX_COMPACT_RETRIES;
> int min_priority;
> bool ret = false;
> int retries = *compaction_retries;

Think this breaks build because of trace_compact_retry and max_retries is
declared in a different scope on the next hunk.

Again, move this to a preparation series. I didn't actually think about
this patch at all because I'm trying to reach the main purpose of the series
and it's now late on a Friday so I'll probably fail or forget by Monday.

--
Mel Gorman
SUSE Labs

2023-04-21 14:39:21

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 13/26] mm: compaction: remove compaction result helpers

On Tue, Apr 18, 2023 at 03:13:00PM -0400, Johannes Weiner wrote:
> I found myself repeatedly looking up the implementation of these
> helpers while working on the code, which suggests they are not a
> helpful abstraction. Inline them.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Not massively convinced one way or the other. Still, probably deserves
to join the end of the preparation patch before MIGRATE_FREE is
introduced in the interest in spltting the series into digestable
pieces.

--
Mel Gorman
SUSE Labs

2023-04-21 15:04:36

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 15/26] mm: compaction: simplify free block check in suitable_migration_target()

On Tue, Apr 18, 2023 at 03:13:02PM -0400, Johannes Weiner wrote:
> Free page blocks are now marked MIGRATE_FREE. Consult that directly.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Justification for patch depends on whether MIGRATE_FREE is itself
necessaary. Even then, it's a micro-optimisation but maybe it matters
more later.

--
Mel Gorman
SUSE Labs

2023-04-21 15:05:52

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 25/26] mm: page_alloc: disallow fallbacks when 2M defrag is enabled

On Tue, Apr 18, 2023 at 03:13:12PM -0400, Johannes Weiner wrote:
> Fallbacks are already unlikely due to watermarks being enforced
> against MIGRATE_FREE blocks. Eliminate them altogether. This allows
> compaction to look exclusively at movable blocks, reducing the number
> of pageblocks it needs to scan on an ongoing basis.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Conceptually this could be fun if a GFP_NOFS allocation cannot migrate
enough memory to free one pageblock and there are no pageblocks
available of the correct migratetype. Fallbacks might be unlikely but
never being able to fallback is a livelock risk, no?

--
Mel Gorman
SUSE Labs

2023-04-21 15:10:20

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 05/26] mm: page_alloc: per-migratetype pcplist for THPs

On Fri, Apr 21, 2023 at 01:47:44PM +0100, Mel Gorman wrote:
> On Tue, Apr 18, 2023 at 03:12:52PM -0400, Johannes Weiner wrote:
> > Right now, there is only one pcplist for THP allocations. However,
> > while most THPs are movable, the huge zero page is not. This means a
> > movable THP allocation can grab an unmovable block from the pcplist,
> > and a subsequent THP split, partial free, and reallocation of the
> > remainder will mix movable and unmovable pages in the block.
> >
> > While this isn't a huge source of block pollution in practice, it
> > happens often enough to trigger debug warnings fairly quickly under
> > load. In the interest of tightening up pageblock hygiene, make the THP
> > pcplists fully migratetype-aware, just like the lower order ones.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
>
> Split out :P
>
> Take special care of this one because, while I didn't check this, I
> suspect it'll push the PCP structure size into the next cache line and
> increase overhead.
>
> The changelog makes it unclear why exactly this happens or why the
> patch fixes it.

Before this, I'd see warnings from the last patch in the series about
received migratetype not matching requested mt.

The way it happens is that the zero page gets freed and the unmovable
block put on the pcplist. A regular THP allocation is subsequently
served from an unmovable block.

Mental note, I think this can happen the other way around too: a
regular THP on the pcp being served to a MIGRATE_UNMOVABLE zero
THP. It's not supposed to, but it looks like there is a bug in the
code that's meant to prevent that from happening in rmqueue():

if (likely(pcp_allowed_order(order))) {
/*
* MIGRATE_MOVABLE pcplist could have the pages on CMA area and
* we need to skip it when CMA area isn't allowed.
*/
if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
migratetype != MIGRATE_MOVABLE) {
page = rmqueue_pcplist(preferred_zone, zone, order,
migratetype, alloc_flags);
if (likely(page))
goto out;
}
}

Surely that last condition should be migratetype == MIGRATE_MOVABLE?

> The huge zero page strips GFP_MOVABLE (so unmovable)
> but at allocation time, it doesn't really matter what the movable type
> is because it's a full pageblock. It doesn't appear to be a hazard until
> the split happens. Assuming that's the case, it should be ok to always
> set the pageblock movable for THP allocations regardless of GFP flags at
> allocation time or else set the pageblock MOVABLE at THP split (always
> MOVABLE at allocation time makes more sense).

The regular allocator compaction skips over compound pages anyway, so
the migratetype should indeed not matter there.

The bigger issue is CMA. alloc_contig_range() will try to move THPs to
free a larger range. We have to be careful not to place an unmovable
zero THP into a CMA region. That means we can not play games with MT -
we really do have to physically keep unmovable and movable THPs apart.

Another option would be not to use pcp for the zero THP. It's cached
anyway in the caller. But it would add branches to the THP alloc and
free fast paths (pcp_allowed_order() also checking migratetype).

2023-04-21 15:16:46

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 06/26] mm: page_alloc: consolidate free page accounting

On Fri, Apr 21, 2023 at 01:54:53PM +0100, Mel Gorman wrote:
> On Tue, Apr 18, 2023 at 03:12:53PM -0400, Johannes Weiner wrote:
> > Free page accounting currently happens a bit too high up the call
> > stack, where it has to deal with guard pages, compaction capturing,
> > block stealing and even page isolation. This is subtle and fragile,
> > and makes it difficult to hack on the code.
> >
> > Push the accounting down to where pages enter and leave the physical
> > freelists, where all these higher-level exceptions are of no concern.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
>
> I didn't look too closely at this one as I'm scanning through to see how
> the overall series works and this is mostly a mechanical patch.
> However, it definitely breaks build
>
> > @@ -843,7 +843,7 @@ static int __init debug_guardpage_minorder_setup(char *buf)
> > early_param("debug_guardpage_minorder", debug_guardpage_minorder_setup);
> >
> > static inline bool set_page_guard(struct zone *zone, struct page *page,
> > - unsigned int order, int migratetype)
> > + unsigned int order
> > {
> > if (!debug_guardpage_enabled())
> > return false;

Oops, this is under a config I didn't test. Will fix. Thanks.

2023-04-21 15:38:19

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 25/26] mm: page_alloc: disallow fallbacks when 2M defrag is enabled

On Fri, Apr 21, 2023 at 03:56:57PM +0100, Mel Gorman wrote:
> On Tue, Apr 18, 2023 at 03:13:12PM -0400, Johannes Weiner wrote:
> > Fallbacks are already unlikely due to watermarks being enforced
> > against MIGRATE_FREE blocks. Eliminate them altogether. This allows
> > compaction to look exclusively at movable blocks, reducing the number
> > of pageblocks it needs to scan on an ongoing basis.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
>
> Conceptually this could be fun if a GFP_NOFS allocation cannot migrate
> enough memory to free one pageblock and there are no pageblocks
> available of the correct migratetype. Fallbacks might be unlikely but
> never being able to fallback is a livelock risk, no?

The reserves below the watermarks are maintained in neutral
MIGRATE_FREE blocks. So just like today, critical/limited allocation
contexts are ensured forward progress as long as there are reserves.

An argument could be made that because smaller orders type-claim the
entire neutral block on allocation now, the reserves can deplete
faster than they do today given comparable watermarks. I haven't run
into this issue during testing that would have me raise the reserves
by a factor of NR_MIGRATETYPES. But something to keep an eye out for.

2023-04-21 15:43:40

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 12/26] mm: page_alloc: per-migratetype free counts

On Fri, Apr 21, 2023 at 03:28:41PM +0100, Mel Gorman wrote:
> On Tue, Apr 18, 2023 at 03:12:59PM -0400, Johannes Weiner wrote:
> > Increase visibility into the defragmentation behavior by tracking and
> > reporting per-migratetype free counters.
> >
> > Subsequent patches will also use those counters to make more targeted
> > reclaim/compaction decisions.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
>
> Visibility into fragmentation behaviour is information that is
> almost certainly only useful to a developer and even then, there is
> /proc/pagetypeinfo. At minimum, move this patch to later in the series
> but I'm skeptical about its benefit.

Having them available in the memory dump (OOM, sysrq) was essential
while debugging problems in later patches. For OOMs or lockups,
pagetypeinfo isn't available. It would be useful to have them included
in user reports if any issues pop up.

They're used internally in several places later on, too.

I'll expand on the changelog and move them ahead in the series.

Thanks

2023-04-21 15:57:19

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 25/26] mm: page_alloc: disallow fallbacks when 2M defrag is enabled

On Fri, Apr 21, 2023 at 11:24:48AM -0400, Johannes Weiner wrote:
> On Fri, Apr 21, 2023 at 03:56:57PM +0100, Mel Gorman wrote:
> > On Tue, Apr 18, 2023 at 03:13:12PM -0400, Johannes Weiner wrote:
> > > Fallbacks are already unlikely due to watermarks being enforced
> > > against MIGRATE_FREE blocks. Eliminate them altogether. This allows
> > > compaction to look exclusively at movable blocks, reducing the number
> > > of pageblocks it needs to scan on an ongoing basis.
> > >
> > > Signed-off-by: Johannes Weiner <[email protected]>
> >
> > Conceptually this could be fun if a GFP_NOFS allocation cannot migrate
> > enough memory to free one pageblock and there are no pageblocks
> > available of the correct migratetype. Fallbacks might be unlikely but
> > never being able to fallback is a livelock risk, no?
>
> The reserves below the watermarks are maintained in neutral
> MIGRATE_FREE blocks. So just like today, critical/limited allocation
> contexts are ensured forward progress as long as there are reserves.
>
> An argument could be made that because smaller orders type-claim the
> entire neutral block on allocation now, the reserves can deplete
> faster than they do today given comparable watermarks. I haven't run
> into this issue during testing that would have me raise the reserves
> by a factor of NR_MIGRATETYPES. But something to keep an eye out for.

Also bear in mind low memory systems, not even the embedded ones, just
things like a cheap laptop with <= 1G. I haven't consulted the code to
think this through but forcing watermarks to be in multiples of pageblock
could be get trickier the smaller the memory size is and it would have
to be considered what happens when min_free_kbytes for a zone is smaller
than pageblock*MIGRATE_TYPES. It's possible it's protected, or could be
protected, by page_group_by_mobility_disabled but that only applies for
very small systems.

--
Mel Gorman
SUSE Labs

2023-04-21 16:10:32

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 12/26] mm: page_alloc: per-migratetype free counts

On Fri, Apr 21, 2023 at 11:35:01AM -0400, Johannes Weiner wrote:
> On Fri, Apr 21, 2023 at 03:28:41PM +0100, Mel Gorman wrote:
> > On Tue, Apr 18, 2023 at 03:12:59PM -0400, Johannes Weiner wrote:
> > > Increase visibility into the defragmentation behavior by tracking and
> > > reporting per-migratetype free counters.
> > >
> > > Subsequent patches will also use those counters to make more targeted
> > > reclaim/compaction decisions.
> > >
> > > Signed-off-by: Johannes Weiner <[email protected]>
> >
> > Visibility into fragmentation behaviour is information that is
> > almost certainly only useful to a developer and even then, there is
> > /proc/pagetypeinfo. At minimum, move this patch to later in the series
> > but I'm skeptical about its benefit.
>
> Having them available in the memory dump (OOM, sysrq) was essential
> while debugging problems in later patches. For OOMs or lockups,
> pagetypeinfo isn't available. It would be useful to have them included
> in user reports if any issues pop up.
>

OOM+sysrq could optionally take the very expensive step of traversing the
lists to get the count so yes, it helps debugging, but not necessarily
critical.

> They're used internally in several places later on, too.
>

I did see that for deciding the suitability for compaction. Minimally, put
the patches adjacent in the series and later if possible so that the series
can be taken in parts. There are a lot of patches that should be relatively
uncontroversial so maybe make "mm: page_alloc: introduce MIGRATE_FREE" the
pivot point between incremental improvements and "everything on and after
this patch is relatively high risk, could excessively compact/reclaim,
could livelock etc".

--
Mel Gorman
SUSE Labs

2023-04-21 16:13:34

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] mm: reliable huge page allocator

On Wed, Apr 19, 2023 at 05:11:45AM +0100, Matthew Wilcox wrote:
> On Tue, Apr 18, 2023 at 03:12:47PM -0400, Johannes Weiner wrote:
> > This series proposes to make THP allocations reliable by enforcing
> > pageblock hygiene, and aligning the allocator, reclaim and compaction
> > on the pageblock as the base unit for managing free memory. All orders
> > up to and including the pageblock are made first-class requests that
> > (outside of OOM situations) are expected to succeed without
> > exceptional investment by the allocating thread.
> >
> > A neutral pageblock type is introduced, MIGRATE_FREE. The first
> > allocation to be placed into such a block claims it exclusively for
> > the allocation's migratetype. Fallbacks from a different type are no
> > longer allowed, and the block is "kept open" for more allocations of
> > the same type to ensure tight grouping. A pageblock becomes neutral
> > again only once all its pages have been freed.
>
> YES! This is exactly what I've been thinking is the right solution
> for some time. Thank you for doing it.
>

It was considered once upon a time and comes up every so often as variants
of a "sticky" pageblock pageblock bit that prevents mixing. The risks was
ending up in a context where memory within a suitable pageblock cannot
be freed and all of the available MOVABLE pageblocks have at least one
pinned page that cannot migrate from the allocating context. It can also
potentially hit a case where the majority of memory is UNMOVABLE pageblocks,
each of which has a single pagetable page that cannot be freed without an
OOM kill. Variants of issues like this would manifestas an OOM kill with
plenty of memory free bug or excessive CPu usage on reclaim or compaction.

It doesn't kill the idea of the series at all but it puts a lot of emphasis
in splitting the series by low-risk and high-risk. Maybe to the extent where
the absolute protection against mixing can be broken in OOM situations,
kernel command line or sysctl.

--
Mel Gorman
SUSE Labs

2023-04-21 16:36:17

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 12/26] mm: page_alloc: per-migratetype free counts

On Fri, Apr 21, 2023 at 05:03:20PM +0100, Mel Gorman wrote:
> On Fri, Apr 21, 2023 at 11:35:01AM -0400, Johannes Weiner wrote:
> > On Fri, Apr 21, 2023 at 03:28:41PM +0100, Mel Gorman wrote:
> > > On Tue, Apr 18, 2023 at 03:12:59PM -0400, Johannes Weiner wrote:
> > > > Increase visibility into the defragmentation behavior by tracking and
> > > > reporting per-migratetype free counters.
> > > >
> > > > Subsequent patches will also use those counters to make more targeted
> > > > reclaim/compaction decisions.
> > > >
> > > > Signed-off-by: Johannes Weiner <[email protected]>
> > >
> > > Visibility into fragmentation behaviour is information that is
> > > almost certainly only useful to a developer and even then, there is
> > > /proc/pagetypeinfo. At minimum, move this patch to later in the series
> > > but I'm skeptical about its benefit.
> >
> > Having them available in the memory dump (OOM, sysrq) was essential
> > while debugging problems in later patches. For OOMs or lockups,
> > pagetypeinfo isn't available. It would be useful to have them included
> > in user reports if any issues pop up.
> >
>
> OOM+sysrq could optionally take the very expensive step of traversing the
> lists to get the count so yes, it helps debugging, but not necessarily
> critical.
>
> > They're used internally in several places later on, too.
> >
>
> I did see that for deciding the suitability for compaction. Minimally, put
> the patches adjacent in the series and later if possible so that the series
> can be taken in parts. There are a lot of patches that should be relatively
> uncontroversial so maybe make "mm: page_alloc: introduce MIGRATE_FREE" the
> pivot point between incremental improvements and "everything on and after
> this patch is relatively high risk, could excessively compact/reclaim,
> could livelock etc".

Okay, I see now where you're coming from. That's good feedback.

Actually most of the patches work toward the final goal of managing
free memory in whole blocks. The only exception are the block pages,
the nofs deadlock, the page_isolation kernel doc, and *maybe* the
should_[compact|reclaim]_retry cleanups. I tried to find the
standalone value in each of the prep patches as well to avoid
forward-referencing in the series too much. But obviously these
standalone reasons tend to be on the weak side.

I'll rework the changelogs (and patch ordering) where applicable to
try to make the dependencies clearer.

2023-04-21 17:30:56

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] mm: reliable huge page allocator

On Fri, Apr 21, 2023 at 05:11:56PM +0100, Mel Gorman wrote:
> It was considered once upon a time and comes up every so often as variants
> of a "sticky" pageblock pageblock bit that prevents mixing. The risks was
> ending up in a context where memory within a suitable pageblock cannot
> be freed and all of the available MOVABLE pageblocks have at least one
> pinned page that cannot migrate from the allocating context. It can also
> potentially hit a case where the majority of memory is UNMOVABLE pageblocks,
> each of which has a single pagetable page that cannot be freed without an
> OOM kill. Variants of issues like this would manifestas an OOM kill with
> plenty of memory free bug or excessive CPu usage on reclaim or compaction.
>
> It doesn't kill the idea of the series at all but it puts a lot of emphasis
> in splitting the series by low-risk and high-risk. Maybe to the extent where
> the absolute protection against mixing can be broken in OOM situations,
> kernel command line or sysctl.

Has a variant been previously considered where MOVABLE allocations are
allowed to come from UNMOVABLE blocks? After all, MOVABLE allocations
are generally, well, movable. So an UNMOVABLE allocation could try to
migrate pages from a MIXED pageblock in order to turn the MIXED pageblock
back into an UNMOVABLE pageblock.

This might work better in practice because GFP_NOFS allocations tend
to also be MOVABLE, so allowing them to take up some of the UNMOVABLE
space temporarily feels like a get-out-of-OOM card.

(I've resisted talking about plans to make page table pages movable
because I don't think that's your point; that's just an example of a
currently-unmovable allocation, right?)

I mention this in part because on my laptop, ZONE_DMA is almost unused:

Node 0, zone DMA 0 0 0 0 0 0 0 0 1 2 2
Node 0, zone DMA32 1685 1345 1152 554 424 212 104 40 2 0 0
Node 0, zone Normal 6959 3530 1893 1862 629 483 107 10 0 0 0

That's 2 order-10 (=8MB), 2 order-9 (=4MB) and 1 order8 (=1MB) for a
total of 13MB of memory. That's insignificant to a 16GB laptop, but on
smaller machines, it might be worth allowing MOVABLE allocations to come
from ZONE_DMA on the grounds that they can be easily freed if anybody
ever allocated from ZONE_DMA.

2023-04-25 01:04:42

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH 14/26] mm: compaction: simplify should_compact_retry()

Johannes Weiner <[email protected]> writes:

> The different branches for retry are unnecessarily complicated. There
> is really only three outcomes: progress, skipped, failed. Also, the
> retry counter only applies to loops that made progress, move it there.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> mm/page_alloc.c | 60 +++++++++++++++++--------------------------------
> 1 file changed, 20 insertions(+), 40 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c3b7dc479936..18fa2bbba44b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4608,7 +4608,6 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
> enum compact_priority *compact_priority,
> int *compaction_retries)
> {
> - int max_retries = MAX_COMPACT_RETRIES;
> int min_priority;
> bool ret = false;
> int retries = *compaction_retries;
> @@ -4621,19 +4620,27 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
> return false;
>
> /*
> - * Compaction managed to coalesce some page blocks, but the
> - * allocation failed presumably due to a race. Retry some.
> + * Compaction coalesced some page blocks, but the allocation
> + * failed, presumably due to a race. Retry a few times.
> */
> - if (compact_result == COMPACT_SUCCESS)
> - (*compaction_retries)++;
> + if (compact_result == COMPACT_SUCCESS) {
> + int max_retries = MAX_COMPACT_RETRIES;
>
> - /*
> - * All zones were scanned completely and still no result. It
> - * doesn't really make much sense to retry except when the
> - * failure could be caused by insufficient priority
> - */
> - if (compact_result == COMPACT_COMPLETE)
> - goto check_priority;
> + /*
> + * !costly requests are much more important than
> + * __GFP_RETRY_MAYFAIL costly ones because they are de
> + * facto nofail and invoke OOM killer to move on while
> + * costly can fail and users are ready to cope with
> + * that. 1/4 retries is rather arbitrary but we would
> + * need much more detailed feedback from compaction to
> + * make a better decision.
> + */
> + if (order > PAGE_ALLOC_COSTLY_ORDER)
> + max_retries /= 4;
> +
> + ret = ++(*compaction_retries) <= MAX_COMPACT_RETRIES;
~~~~~~~~~~~~~~~~~~~

Should be max_retries?

Best Regards,
Huang, Ying

> + goto out;
> + }
>
> /*
> * Compaction was skipped due to a lack of free order-0
> @@ -4645,35 +4652,8 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
> }
>
> /*
> - * If compaction backed due to being deferred, due to
> - * contended locks in async mode, or due to scanners meeting
> - * after a partial scan, retry with increased priority.
> - */
> - if (compact_result == COMPACT_DEFERRED ||
> - compact_result == COMPACT_CONTENDED ||
> - compact_result == COMPACT_PARTIAL_SKIPPED)
> - goto check_priority;
> -
> - /*
> - * !costly requests are much more important than __GFP_RETRY_MAYFAIL
> - * costly ones because they are de facto nofail and invoke OOM
> - * killer to move on while costly can fail and users are ready
> - * to cope with that. 1/4 retries is rather arbitrary but we
> - * would need much more detailed feedback from compaction to
> - * make a better decision.
> - */
> - if (order > PAGE_ALLOC_COSTLY_ORDER)
> - max_retries /= 4;
> - if (*compaction_retries <= max_retries) {
> - ret = true;
> - goto out;
> - }
> -
> - /*
> - * Make sure there are attempts at the highest priority if we exhausted
> - * all retries or failed at the lower priorities.
> + * Compaction failed. Retry with increasing priority.
> */
> -check_priority:
> min_priority = (order > PAGE_ALLOC_COSTLY_ORDER) ?
> MIN_COMPACT_COSTLY_PRIORITY : MIN_COMPACT_PRIORITY;

2023-04-25 02:38:26

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 14/26] mm: compaction: simplify should_compact_retry()

On Fri, Apr 21, 2023 at 03:36:54PM +0100, Mel Gorman wrote:
> On Tue, Apr 18, 2023 at 03:13:01PM -0400, Johannes Weiner wrote:
> > The different branches for retry are unnecessarily complicated. There
> > is really only three outcomes: progress, skipped, failed. Also, the
> > retry counter only applies to loops that made progress, move it there.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
> > ---
> > mm/page_alloc.c | 60 +++++++++++++++++--------------------------------
> > 1 file changed, 20 insertions(+), 40 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c3b7dc479936..18fa2bbba44b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4608,7 +4608,6 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
> > enum compact_priority *compact_priority,
> > int *compaction_retries)
> > {
> > - int max_retries = MAX_COMPACT_RETRIES;
> > int min_priority;
> > bool ret = false;
> > int retries = *compaction_retries;
>
> Think this breaks build because of trace_compact_retry and max_retries is
> declared in a different scope on the next hunk.

Right you are. Will fix. Thanks!


2023-04-25 02:45:38

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 14/26] mm: compaction: simplify should_compact_retry()

On Tue, Apr 25, 2023 at 08:56:47AM +0800, Huang, Ying wrote:
> Johannes Weiner <[email protected]> writes:
>
> > The different branches for retry are unnecessarily complicated. There
> > is really only three outcomes: progress, skipped, failed. Also, the
> > retry counter only applies to loops that made progress, move it there.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
> > ---
> > mm/page_alloc.c | 60 +++++++++++++++++--------------------------------
> > 1 file changed, 20 insertions(+), 40 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c3b7dc479936..18fa2bbba44b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4608,7 +4608,6 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
> > enum compact_priority *compact_priority,
> > int *compaction_retries)
> > {
> > - int max_retries = MAX_COMPACT_RETRIES;
> > int min_priority;
> > bool ret = false;
> > int retries = *compaction_retries;
> > @@ -4621,19 +4620,27 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
> > return false;
> >
> > /*
> > - * Compaction managed to coalesce some page blocks, but the
> > - * allocation failed presumably due to a race. Retry some.
> > + * Compaction coalesced some page blocks, but the allocation
> > + * failed, presumably due to a race. Retry a few times.
> > */
> > - if (compact_result == COMPACT_SUCCESS)
> > - (*compaction_retries)++;
> > + if (compact_result == COMPACT_SUCCESS) {
> > + int max_retries = MAX_COMPACT_RETRIES;
> >
> > - /*
> > - * All zones were scanned completely and still no result. It
> > - * doesn't really make much sense to retry except when the
> > - * failure could be caused by insufficient priority
> > - */
> > - if (compact_result == COMPACT_COMPLETE)
> > - goto check_priority;
> > + /*
> > + * !costly requests are much more important than
> > + * __GFP_RETRY_MAYFAIL costly ones because they are de
> > + * facto nofail and invoke OOM killer to move on while
> > + * costly can fail and users are ready to cope with
> > + * that. 1/4 retries is rather arbitrary but we would
> > + * need much more detailed feedback from compaction to
> > + * make a better decision.
> > + */
> > + if (order > PAGE_ALLOC_COSTLY_ORDER)
> > + max_retries /= 4;
> > +
> > + ret = ++(*compaction_retries) <= MAX_COMPACT_RETRIES;
> ~~~~~~~~~~~~~~~~~~~
>
> Should be max_retries?

Good catch. max_retries is deleted in a later patch, but this one
should be fixed regardless. Thanks, I will correct it.

2023-04-25 03:20:10

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH 20/26] mm: vmscan: use compaction_suitable() check in kswapd

Johannes Weiner <[email protected]> writes:

> Kswapd currently bails on higher-order allocations with an open-coded
> check for whether it's reclaimed the compaction gap.
>
> compaction_suitable() is the customary interface to coordinate reclaim
> with compaction.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> mm/vmscan.c | 67 ++++++++++++++++++-----------------------------------
> 1 file changed, 23 insertions(+), 44 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ee8c8ca2e7b5..723705b9e4d9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6872,12 +6872,18 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
> if (!managed_zone(zone))
> continue;
>
> + /* Allocation can succeed in any zone, done */
> if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
> mark = wmark_pages(zone, WMARK_PROMO);
> else
> mark = high_wmark_pages(zone);
> if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx))
> return true;
> +
> + /* Allocation can't succeed, but enough order-0 to compact */
> + if (compaction_suitable(zone, order,
> + highest_zoneidx) == COMPACT_CONTINUE)
> + return true;

Should we check the following first?

order > 0 && zone_watermark_ok_safe(zone, 0, mark, highest_zoneidx)

Best Regards,
Huang, Ying

> }
>
> /*
> @@ -6968,16 +6974,6 @@ static bool kswapd_shrink_node(pg_data_t *pgdat,
> */
> shrink_node(pgdat, sc);
>
> - /*
> - * Fragmentation may mean that the system cannot be rebalanced for
> - * high-order allocations. If twice the allocation size has been
> - * reclaimed then recheck watermarks only at order-0 to prevent
> - * excessive reclaim. Assume that a process requested a high-order
> - * can direct reclaim/compact.
> - */
> - if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
> - sc->order = 0;
> -
> return sc->nr_scanned >= sc->nr_to_reclaim;
> }
>
> @@ -7018,15 +7014,13 @@ clear_reclaim_active(pg_data_t *pgdat, int highest_zoneidx)
> * that are eligible for use by the caller until at least one zone is
> * balanced.
> *
> - * Returns the order kswapd finished reclaiming at.
> - *
> * kswapd scans the zones in the highmem->normal->dma direction. It skips
> * zones which have free_pages > high_wmark_pages(zone), but once a zone is
> * found to have free_pages <= high_wmark_pages(zone), any page in that zone
> * or lower is eligible for reclaim until at least one usable zone is
> * balanced.
> */
> -static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
> +static void balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
> {
> int i;
> unsigned long nr_soft_reclaimed;
> @@ -7226,14 +7220,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
> __fs_reclaim_release(_THIS_IP_);
> psi_memstall_leave(&pflags);
> set_task_reclaim_state(current, NULL);
> -
> - /*
> - * Return the order kswapd stopped reclaiming at as
> - * prepare_kswapd_sleep() takes it into account. If another caller
> - * entered the allocator slow path while kswapd was awake, order will
> - * remain at the higher level.
> - */
> - return sc.order;
> }
>
> /*
> @@ -7251,7 +7237,7 @@ static enum zone_type kswapd_highest_zoneidx(pg_data_t *pgdat,
> return curr_idx == MAX_NR_ZONES ? prev_highest_zoneidx : curr_idx;
> }
>
> -static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order,
> +static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
> unsigned int highest_zoneidx)
> {
> long remaining = 0;
> @@ -7269,7 +7255,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
> * eligible zone balanced that it's also unlikely that compaction will
> * succeed.
> */
> - if (prepare_kswapd_sleep(pgdat, reclaim_order, highest_zoneidx)) {
> + if (prepare_kswapd_sleep(pgdat, order, highest_zoneidx)) {
> /*
> * Compaction records what page blocks it recently failed to
> * isolate pages from and skips them in the future scanning.
> @@ -7282,7 +7268,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
> * We have freed the memory, now we should compact it to make
> * allocation of the requested order possible.
> */
> - wakeup_kcompactd(pgdat, alloc_order, highest_zoneidx);
> + wakeup_kcompactd(pgdat, order, highest_zoneidx);
>
> remaining = schedule_timeout(HZ/10);
>
> @@ -7296,8 +7282,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
> kswapd_highest_zoneidx(pgdat,
> highest_zoneidx));
>
> - if (READ_ONCE(pgdat->kswapd_order) < reclaim_order)
> - WRITE_ONCE(pgdat->kswapd_order, reclaim_order);
> + if (READ_ONCE(pgdat->kswapd_order) < order)
> + WRITE_ONCE(pgdat->kswapd_order, order);
> }
>
> finish_wait(&pgdat->kswapd_wait, &wait);
> @@ -7308,8 +7294,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
> * After a short sleep, check if it was a premature sleep. If not, then
> * go fully to sleep until explicitly woken up.
> */
> - if (!remaining &&
> - prepare_kswapd_sleep(pgdat, reclaim_order, highest_zoneidx)) {
> + if (!remaining && prepare_kswapd_sleep(pgdat, order, highest_zoneidx)) {
> trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
>
> /*
> @@ -7350,8 +7335,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
> */
> static int kswapd(void *p)
> {
> - unsigned int alloc_order, reclaim_order;
> - unsigned int highest_zoneidx = MAX_NR_ZONES - 1;
> + unsigned int order, highest_zoneidx;
> pg_data_t *pgdat = (pg_data_t *)p;
> struct task_struct *tsk = current;
> const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> @@ -7374,22 +7358,20 @@ static int kswapd(void *p)
> tsk->flags |= PF_MEMALLOC | PF_KSWAPD;
> set_freezable();
>
> - WRITE_ONCE(pgdat->kswapd_order, 0);
> + order = 0;
> + highest_zoneidx = MAX_NR_ZONES - 1;
> + WRITE_ONCE(pgdat->kswapd_order, order);
> WRITE_ONCE(pgdat->kswapd_highest_zoneidx, MAX_NR_ZONES);
> +
> atomic_set(&pgdat->nr_writeback_throttled, 0);
> +
> for ( ; ; ) {
> bool ret;
>
> - alloc_order = reclaim_order = READ_ONCE(pgdat->kswapd_order);
> - highest_zoneidx = kswapd_highest_zoneidx(pgdat,
> - highest_zoneidx);
> -
> -kswapd_try_sleep:
> - kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
> - highest_zoneidx);
> + kswapd_try_to_sleep(pgdat, order, highest_zoneidx);
>
> /* Read the new order and highest_zoneidx */
> - alloc_order = READ_ONCE(pgdat->kswapd_order);
> + order = READ_ONCE(pgdat->kswapd_order);
> highest_zoneidx = kswapd_highest_zoneidx(pgdat,
> highest_zoneidx);
> WRITE_ONCE(pgdat->kswapd_order, 0);
> @@ -7415,11 +7397,8 @@ static int kswapd(void *p)
> * request (alloc_order).
> */
> trace_mm_vmscan_kswapd_wake(pgdat->node_id, highest_zoneidx,
> - alloc_order);
> - reclaim_order = balance_pgdat(pgdat, alloc_order,
> - highest_zoneidx);
> - if (reclaim_order < alloc_order)
> - goto kswapd_try_sleep;
> + order);
> + balance_pgdat(pgdat, order, highest_zoneidx);
> }
>
> tsk->flags &= ~(PF_MEMALLOC | PF_KSWAPD);

2023-04-25 14:31:11

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 20/26] mm: vmscan: use compaction_suitable() check in kswapd

On Tue, Apr 25, 2023 at 11:12:28AM +0800, Huang, Ying wrote:
> Johannes Weiner <[email protected]> writes:
>
> > Kswapd currently bails on higher-order allocations with an open-coded
> > check for whether it's reclaimed the compaction gap.
> >
> > compaction_suitable() is the customary interface to coordinate reclaim
> > with compaction.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
> > ---
> > mm/vmscan.c | 67 ++++++++++++++++++-----------------------------------
> > 1 file changed, 23 insertions(+), 44 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index ee8c8ca2e7b5..723705b9e4d9 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -6872,12 +6872,18 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
> > if (!managed_zone(zone))
> > continue;
> >
> > + /* Allocation can succeed in any zone, done */
> > if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
> > mark = wmark_pages(zone, WMARK_PROMO);
> > else
> > mark = high_wmark_pages(zone);
> > if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx))
> > return true;
> > +
> > + /* Allocation can't succeed, but enough order-0 to compact */
> > + if (compaction_suitable(zone, order,
> > + highest_zoneidx) == COMPACT_CONTINUE)
> > + return true;
>
> Should we check the following first?
>
> order > 0 && zone_watermark_ok_safe(zone, 0, mark, highest_zoneidx)

That's what compaction_suitable() does. It checks whether there are
enough migration targets for compaction (COMPACT_CONTINUE) or whether
reclaim needs to do some more work (COMPACT_SKIPPED).

2023-04-25 14:41:53

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 08/26] mm: page_alloc: claim blocks during compaction capturing

On Fri, Apr 21, 2023 at 02:12:27PM +0100, Mel Gorman wrote:
> On Tue, Apr 18, 2023 at 03:12:55PM -0400, Johannes Weiner wrote:
> > When capturing a whole block, update the migratetype accordingly. For
> > example, a THP allocation might capture an unmovable block. If the THP
> > gets split and partially freed later, the remainder should group up
> > with movable allocations.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
> > ---
> > mm/internal.h | 1 +
> > mm/page_alloc.c | 42 ++++++++++++++++++++++++------------------
> > 2 files changed, 25 insertions(+), 18 deletions(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 024affd4e4b5..39f65a463631 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -432,6 +432,7 @@ struct compact_control {
> > */
> > struct capture_control {
> > struct compact_control *cc;
> > + int migratetype;
> > struct page *page;
> > };
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 4d20513c83be..8e5996f8b4b4 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -615,6 +615,17 @@ void set_pageblock_migratetype(struct page *page, int migratetype)
> > page_to_pfn(page), MIGRATETYPE_MASK);
> > }
> >
> > +static void change_pageblock_range(struct page *pageblock_page,
> > + int start_order, int migratetype)
> > +{
> > + int nr_pageblocks = 1 << (start_order - pageblock_order);
> > +
> > + while (nr_pageblocks--) {
> > + set_pageblock_migratetype(pageblock_page, migratetype);
> > + pageblock_page += pageblock_nr_pages;
> > + }
> > +}
> > +
> > #ifdef CONFIG_DEBUG_VM
> > static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
> > {
> > @@ -962,14 +973,19 @@ compaction_capture(struct capture_control *capc, struct page *page,
> > is_migrate_isolate(migratetype))
> > return false;
> >
> > - /*
> > - * Do not let lower order allocations pollute a movable pageblock.
> > - * This might let an unmovable request use a reclaimable pageblock
> > - * and vice-versa but no more than normal fallback logic which can
> > - * have trouble finding a high-order free page.
> > - */
> > - if (order < pageblock_order && migratetype == MIGRATE_MOVABLE)
> > + if (order >= pageblock_order) {
> > + migratetype = capc->migratetype;
> > + change_pageblock_range(page, order, migratetype);
> > + } else if (migratetype == MIGRATE_MOVABLE) {
> > + /*
> > + * Do not let lower order allocations pollute a
> > + * movable pageblock. This might let an unmovable
> > + * request use a reclaimable pageblock and vice-versa
> > + * but no more than normal fallback logic which can
> > + * have trouble finding a high-order free page.
> > + */
> > return false;
> > + }
> >
>
> For capturing pageblock order or larger, why not unconditionally make
> the block MOVABLE? Even if it's a zero page allocation, it would be nice
> to keep the pageblock for movable pages after the split as long as
> possible.

The zero page isn't split, but if some other unmovable allocation does
a split and free later on I want to avoid filling a block with an
unmovable allocation with movables. That block is already lost to
compaction, and this way future unmovable allocations are more likely
to group into that block rather than claim an additional unmovable.

I had to double take for block merges beyond pageblock order,
wondering if we can claim multiple blocks for requests (capc->order)
smaller than a block. But that can't happen. Once we reach
pageblock_order during merging we claim, capture and exit. That means
order > pageblock_order can only happen if capc->order is actually
larger than a pageblock as well. I'll add a comment.

2023-04-25 15:56:13

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 10/26] mm: page_alloc: allow compaction capturing from larger blocks

On Fri, Apr 21, 2023 at 03:14:47PM +0100, Mel Gorman wrote:
> On Tue, Apr 18, 2023 at 03:12:57PM -0400, Johannes Weiner wrote:
> > Currently, capturing only works on matching orders and matching
> > migratetypes. However, if capturing is initially skipped on the
> > migratetype, it's possible that merging continues up to a full
> > pageblock, in which case the migratetype is up for grabs again.
> >
> > Allow capturing to grab smaller chunks from claimed pageblocks, and
> > expand the remainder of the block back onto the freelists.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
>
> No objections other than we're still in the preparation phase and the
> series needs to be split. Out of curiousity, how often does this actually
> trigger in practice? I ask because superficially, I would expect capture to
> happen while pages are being merged and I'm not sure how much this actually
> helps. If anything the anomaly would be merging !MOVABLE types, capturing
> one pageblock and leaving the adjacent block eligible for splitting as
> UNMOVABLE/RECLAIMABLE which is not necessarily desirable.

Looking at this patch independently, once merging continues to the
full block, a fallback would be allowed to claim it anyway
(can_steal_fallback() returns true). I don't quite see a downside
letting capture apply in this case. The plus is of course avoiding the
indirection through the freelist which risks an opportunist request of
a smaller order fragmenting the block and wasting the contiguity work.

In the context of the full series, this becomes even more
important. Once watermarks are required to be met in MIGRATE_FREE
blocks, and reclaim/compaction recycle full blocks, merging up to
pageblock_order happens all the time - and needs to happen for
allocations to succeed. This applies to all types of direct reclaim:
unmovable request freeing reclaimable/movable blocks, reclaimable
freeing movable blocks, movable freeing reclaimable blocks.

I see your point about smaller orders now always ending the merge at
the pageblock, even when there could be additional merging
opportunities beyond. However, I'm not sure these accidental larger
merges beyond what's needed to fulfill the request at hand are a
preferable aspect over reclaimer fairness, and thus ultimately the
reliability of orders up to the pageblock size.

I'll try to get some numbers for this patch independently, though.
This should manifest in p99 allocation latencies and near-OOM
behavior. Is there anything else you'd want me to look for?

Thanks!

2023-04-26 01:42:58

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH 20/26] mm: vmscan: use compaction_suitable() check in kswapd

Johannes Weiner <[email protected]> writes:

> On Tue, Apr 25, 2023 at 11:12:28AM +0800, Huang, Ying wrote:
>> Johannes Weiner <[email protected]> writes:
>>
>> > Kswapd currently bails on higher-order allocations with an open-coded
>> > check for whether it's reclaimed the compaction gap.
>> >
>> > compaction_suitable() is the customary interface to coordinate reclaim
>> > with compaction.
>> >
>> > Signed-off-by: Johannes Weiner <[email protected]>
>> > ---
>> > mm/vmscan.c | 67 ++++++++++++++++++-----------------------------------
>> > 1 file changed, 23 insertions(+), 44 deletions(-)
>> >
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index ee8c8ca2e7b5..723705b9e4d9 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -6872,12 +6872,18 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
>> > if (!managed_zone(zone))
>> > continue;
>> >
>> > + /* Allocation can succeed in any zone, done */
>> > if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
>> > mark = wmark_pages(zone, WMARK_PROMO);
>> > else
>> > mark = high_wmark_pages(zone);
>> > if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx))
>> > return true;
>> > +
>> > + /* Allocation can't succeed, but enough order-0 to compact */
>> > + if (compaction_suitable(zone, order,
>> > + highest_zoneidx) == COMPACT_CONTINUE)
>> > + return true;
>>
>> Should we check the following first?
>>
>> order > 0 && zone_watermark_ok_safe(zone, 0, mark, highest_zoneidx)
>
> That's what compaction_suitable() does. It checks whether there are
> enough migration targets for compaction (COMPACT_CONTINUE) or whether
> reclaim needs to do some more work (COMPACT_SKIPPED).

Yes. And I found that the watermark used in compaction_suitable() is
low_wmark_pages() or min_wmark_pages(), which doesn't match the
watermark here. Or did I miss something?

Best Regards,
Huang, Ying

2023-04-26 15:27:39

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 20/26] mm: vmscan: use compaction_suitable() check in kswapd

On Wed, Apr 26, 2023 at 09:30:23AM +0800, Huang, Ying wrote:
> Johannes Weiner <[email protected]> writes:
>
> > On Tue, Apr 25, 2023 at 11:12:28AM +0800, Huang, Ying wrote:
> >> Johannes Weiner <[email protected]> writes:
> >>
> >> > Kswapd currently bails on higher-order allocations with an open-coded
> >> > check for whether it's reclaimed the compaction gap.
> >> >
> >> > compaction_suitable() is the customary interface to coordinate reclaim
> >> > with compaction.
> >> >
> >> > Signed-off-by: Johannes Weiner <[email protected]>
> >> > ---
> >> > mm/vmscan.c | 67 ++++++++++++++++++-----------------------------------
> >> > 1 file changed, 23 insertions(+), 44 deletions(-)
> >> >
> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > index ee8c8ca2e7b5..723705b9e4d9 100644
> >> > --- a/mm/vmscan.c
> >> > +++ b/mm/vmscan.c
> >> > @@ -6872,12 +6872,18 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
> >> > if (!managed_zone(zone))
> >> > continue;
> >> >
> >> > + /* Allocation can succeed in any zone, done */
> >> > if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
> >> > mark = wmark_pages(zone, WMARK_PROMO);
> >> > else
> >> > mark = high_wmark_pages(zone);
> >> > if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx))
> >> > return true;
> >> > +
> >> > + /* Allocation can't succeed, but enough order-0 to compact */
> >> > + if (compaction_suitable(zone, order,
> >> > + highest_zoneidx) == COMPACT_CONTINUE)
> >> > + return true;
> >>
> >> Should we check the following first?
> >>
> >> order > 0 && zone_watermark_ok_safe(zone, 0, mark, highest_zoneidx)
> >
> > That's what compaction_suitable() does. It checks whether there are
> > enough migration targets for compaction (COMPACT_CONTINUE) or whether
> > reclaim needs to do some more work (COMPACT_SKIPPED).
>
> Yes. And I found that the watermark used in compaction_suitable() is
> low_wmark_pages() or min_wmark_pages(), which doesn't match the
> watermark here. Or did I miss something?

Ahh, you're right, kswapd will bail prematurely. Compaction cannot
reliably meet the high watermark with a low watermark scratch space.

I'll add the order check before the suitable test, for clarity, and so
that order-0 requests don't check the same thing twice.

For the watermark, I'd make it an arg to compaction_suitable() and use
whatever the reclaimer targets (high for kswapd, min for direct).

However, there is a minor snag. compaction_suitable() currently has
its own smarts regarding the watermark:

/*
* Watermarks for order-0 must be met for compaction to be able to
* isolate free pages for migration targets. This means that the
* watermark and alloc_flags have to match, or be more pessimistic than
* the check in __isolate_free_page(). We don't use the direct
* compactor's alloc_flags, as they are not relevant for freepage
* isolation. We however do use the direct compactor's highest_zoneidx
* to skip over zones where lowmem reserves would prevent allocation
* even if compaction succeeds.
* For costly orders, we require low watermark instead of min for
* compaction to proceed to increase its chances.
* ALLOC_CMA is used, as pages in CMA pageblocks are considered
* suitable migration targets
*/
watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
low_wmark_pages(zone) : min_wmark_pages(zone);

Historically it has always checked low instead of min. Then Vlastimil
changed it to min for non-costly orders here:

commit 8348faf91f56371d4bada6fc5915e19580a15ffe
Author: Vlastimil Babka <[email protected]>
Date: Fri Oct 7 16:58:00 2016 -0700

mm, compaction: require only min watermarks for non-costly orders

The __compaction_suitable() function checks the low watermark plus a
compact_gap() gap to decide if there's enough free memory to perform
compaction. Then __isolate_free_page uses low watermark check to decide
if particular free page can be isolated. In the latter case, using low
watermark is needlessly pessimistic, as the free page isolations are
only temporary. For __compaction_suitable() the higher watermark makes
sense for high-order allocations where more freepages increase the
chance of success, and we can typically fail with some order-0 fallback
when the system is struggling to reach that watermark. But for
low-order allocation, forming the page should not be that hard. So
using low watermark here might just prevent compaction from even trying,
and eventually lead to OOM killer even if we are above min watermarks.

So after this patch, we use min watermark for non-costly orders in
__compaction_suitable(), and for all orders in __isolate_free_page().

Lowering to min wasn't an issue for non-costly, but AFAICS there was
no explicit testing for whether min would work for costly orders too.

I'd propose trying it with min even for costly and see what happens.

If it does regress, a better place to boost scratch space for costly
orders might be compact_gap(), so I'd move it there.

Does that sound reasonable?

2023-04-27 05:54:51

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH 20/26] mm: vmscan: use compaction_suitable() check in kswapd

Johannes Weiner <[email protected]> writes:

> On Wed, Apr 26, 2023 at 09:30:23AM +0800, Huang, Ying wrote:
>> Johannes Weiner <[email protected]> writes:
>>
>> > On Tue, Apr 25, 2023 at 11:12:28AM +0800, Huang, Ying wrote:
>> >> Johannes Weiner <[email protected]> writes:
>> >>
>> >> > Kswapd currently bails on higher-order allocations with an open-coded
>> >> > check for whether it's reclaimed the compaction gap.
>> >> >
>> >> > compaction_suitable() is the customary interface to coordinate reclaim
>> >> > with compaction.
>> >> >
>> >> > Signed-off-by: Johannes Weiner <[email protected]>
>> >> > ---
>> >> > mm/vmscan.c | 67 ++++++++++++++++++-----------------------------------
>> >> > 1 file changed, 23 insertions(+), 44 deletions(-)
>> >> >
>> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >> > index ee8c8ca2e7b5..723705b9e4d9 100644
>> >> > --- a/mm/vmscan.c
>> >> > +++ b/mm/vmscan.c
>> >> > @@ -6872,12 +6872,18 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
>> >> > if (!managed_zone(zone))
>> >> > continue;
>> >> >
>> >> > + /* Allocation can succeed in any zone, done */
>> >> > if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
>> >> > mark = wmark_pages(zone, WMARK_PROMO);
>> >> > else
>> >> > mark = high_wmark_pages(zone);
>> >> > if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx))
>> >> > return true;
>> >> > +
>> >> > + /* Allocation can't succeed, but enough order-0 to compact */
>> >> > + if (compaction_suitable(zone, order,
>> >> > + highest_zoneidx) == COMPACT_CONTINUE)
>> >> > + return true;
>> >>
>> >> Should we check the following first?
>> >>
>> >> order > 0 && zone_watermark_ok_safe(zone, 0, mark, highest_zoneidx)
>> >
>> > That's what compaction_suitable() does. It checks whether there are
>> > enough migration targets for compaction (COMPACT_CONTINUE) or whether
>> > reclaim needs to do some more work (COMPACT_SKIPPED).
>>
>> Yes. And I found that the watermark used in compaction_suitable() is
>> low_wmark_pages() or min_wmark_pages(), which doesn't match the
>> watermark here. Or did I miss something?
>
> Ahh, you're right, kswapd will bail prematurely. Compaction cannot
> reliably meet the high watermark with a low watermark scratch space.
>
> I'll add the order check before the suitable test, for clarity, and so
> that order-0 requests don't check the same thing twice.
>
> For the watermark, I'd make it an arg to compaction_suitable() and use
> whatever the reclaimer targets (high for kswapd, min for direct).
>
> However, there is a minor snag. compaction_suitable() currently has
> its own smarts regarding the watermark:
>
> /*
> * Watermarks for order-0 must be met for compaction to be able to
> * isolate free pages for migration targets. This means that the
> * watermark and alloc_flags have to match, or be more pessimistic than
> * the check in __isolate_free_page(). We don't use the direct
> * compactor's alloc_flags, as they are not relevant for freepage
> * isolation. We however do use the direct compactor's highest_zoneidx
> * to skip over zones where lowmem reserves would prevent allocation
> * even if compaction succeeds.
> * For costly orders, we require low watermark instead of min for
> * compaction to proceed to increase its chances.
> * ALLOC_CMA is used, as pages in CMA pageblocks are considered
> * suitable migration targets
> */
> watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
> low_wmark_pages(zone) : min_wmark_pages(zone);
>
> Historically it has always checked low instead of min. Then Vlastimil
> changed it to min for non-costly orders here:
>
> commit 8348faf91f56371d4bada6fc5915e19580a15ffe
> Author: Vlastimil Babka <[email protected]>
> Date: Fri Oct 7 16:58:00 2016 -0700
>
> mm, compaction: require only min watermarks for non-costly orders
>
> The __compaction_suitable() function checks the low watermark plus a
> compact_gap() gap to decide if there's enough free memory to perform
> compaction. Then __isolate_free_page uses low watermark check to decide
> if particular free page can be isolated. In the latter case, using low
> watermark is needlessly pessimistic, as the free page isolations are
> only temporary. For __compaction_suitable() the higher watermark makes
> sense for high-order allocations where more freepages increase the
> chance of success, and we can typically fail with some order-0 fallback
> when the system is struggling to reach that watermark. But for
> low-order allocation, forming the page should not be that hard. So
> using low watermark here might just prevent compaction from even trying,
> and eventually lead to OOM killer even if we are above min watermarks.
>
> So after this patch, we use min watermark for non-costly orders in
> __compaction_suitable(), and for all orders in __isolate_free_page().
>
> Lowering to min wasn't an issue for non-costly, but AFAICS there was
> no explicit testing for whether min would work for costly orders too.
>
> I'd propose trying it with min even for costly and see what happens.
>
> If it does regress, a better place to boost scratch space for costly
> orders might be compact_gap(), so I'd move it there.
>
> Does that sound reasonable?

Sounds good to me, Thanks!

Best Regards,
Huang, Ying

2023-04-28 10:37:25

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 05/26] mm: page_alloc: per-migratetype pcplist for THPs

On Fri, Apr 21, 2023 at 11:06:48AM -0400, Johannes Weiner wrote:
> On Fri, Apr 21, 2023 at 01:47:44PM +0100, Mel Gorman wrote:
> > On Tue, Apr 18, 2023 at 03:12:52PM -0400, Johannes Weiner wrote:
> > > Right now, there is only one pcplist for THP allocations. However,
> > > while most THPs are movable, the huge zero page is not. This means a
> > > movable THP allocation can grab an unmovable block from the pcplist,
> > > and a subsequent THP split, partial free, and reallocation of the
> > > remainder will mix movable and unmovable pages in the block.
> > >
> > > While this isn't a huge source of block pollution in practice, it
> > > happens often enough to trigger debug warnings fairly quickly under
> > > load. In the interest of tightening up pageblock hygiene, make the THP
> > > pcplists fully migratetype-aware, just like the lower order ones.
> > >
> > > Signed-off-by: Johannes Weiner <[email protected]>
> >
> > Split out :P
> >
> > Take special care of this one because, while I didn't check this, I
> > suspect it'll push the PCP structure size into the next cache line and
> > increase overhead.
> >
> > The changelog makes it unclear why exactly this happens or why the
> > patch fixes it.
>
> Before this, I'd see warnings from the last patch in the series about
> received migratetype not matching requested mt.
>
> The way it happens is that the zero page gets freed and the unmovable
> block put on the pcplist. A regular THP allocation is subsequently
> served from an unmovable block.
>
> Mental note, I think this can happen the other way around too: a
> regular THP on the pcp being served to a MIGRATE_UNMOVABLE zero
> THP. It's not supposed to, but it looks like there is a bug in the
> code that's meant to prevent that from happening in rmqueue():
>
> if (likely(pcp_allowed_order(order))) {
> /*
> * MIGRATE_MOVABLE pcplist could have the pages on CMA area and
> * we need to skip it when CMA area isn't allowed.
> */
> if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
> migratetype != MIGRATE_MOVABLE) {
> page = rmqueue_pcplist(preferred_zone, zone, order,
> migratetype, alloc_flags);
> if (likely(page))
> goto out;
> }
> }
>
> Surely that last condition should be migratetype == MIGRATE_MOVABLE?
>

It should be. It would have been missed for ages because it would need a
test case based on a machine configuration that requires CMA for functional
correctness and is using THP which is an unlikely combination.

> > The huge zero page strips GFP_MOVABLE (so unmovable)
> > but at allocation time, it doesn't really matter what the movable type
> > is because it's a full pageblock. It doesn't appear to be a hazard until
> > the split happens. Assuming that's the case, it should be ok to always
> > set the pageblock movable for THP allocations regardless of GFP flags at
> > allocation time or else set the pageblock MOVABLE at THP split (always
> > MOVABLE at allocation time makes more sense).
>
> The regular allocator compaction skips over compound pages anyway, so
> the migratetype should indeed not matter there.
>
> The bigger issue is CMA. alloc_contig_range() will try to move THPs to
> free a larger range. We have to be careful not to place an unmovable
> zero THP into a CMA region. That means we can not play games with MT -
> we really do have to physically keep unmovable and movable THPs apart.
>

Fair point.

> Another option would be not to use pcp for the zero THP. It's cached
> anyway in the caller. But it would add branches to the THP alloc and
> free fast paths (pcp_allowed_order() also checking migratetype).

And this is probably the most straight-forward option. The intent behind
caching some THPs on PCP was faulting large mappings of normal THPs and
reducing the contention on the zone lock a little. The zero THP is somewhat
special because it should not be allocated at high frequency.

--
Mel Gorman
SUSE Labs

2023-04-28 10:57:30

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 10/26] mm: page_alloc: allow compaction capturing from larger blocks

On Tue, Apr 25, 2023 at 11:40:26AM -0400, Johannes Weiner wrote:
> On Fri, Apr 21, 2023 at 03:14:47PM +0100, Mel Gorman wrote:
> > On Tue, Apr 18, 2023 at 03:12:57PM -0400, Johannes Weiner wrote:
> > > Currently, capturing only works on matching orders and matching
> > > migratetypes. However, if capturing is initially skipped on the
> > > migratetype, it's possible that merging continues up to a full
> > > pageblock, in which case the migratetype is up for grabs again.
> > >
> > > Allow capturing to grab smaller chunks from claimed pageblocks, and
> > > expand the remainder of the block back onto the freelists.
> > >
> > > Signed-off-by: Johannes Weiner <[email protected]>
> >
> > No objections other than we're still in the preparation phase and the
> > series needs to be split. Out of curiousity, how often does this actually
> > trigger in practice? I ask because superficially, I would expect capture to
> > happen while pages are being merged and I'm not sure how much this actually
> > helps. If anything the anomaly would be merging !MOVABLE types, capturing
> > one pageblock and leaving the adjacent block eligible for splitting as
> > UNMOVABLE/RECLAIMABLE which is not necessarily desirable.
>
> Looking at this patch independently, once merging continues to the
> full block, a fallback would be allowed to claim it anyway
> (can_steal_fallback() returns true). I don't quite see a downside
> letting capture apply in this case. The plus is of course avoiding the
> indirection through the freelist which risks an opportunist request of
> a smaller order fragmenting the block and wasting the contiguity work.
>
> In the context of the full series, this becomes even more
> important. Once watermarks are required to be met in MIGRATE_FREE
> blocks, and reclaim/compaction recycle full blocks, merging up to
> pageblock_order happens all the time - and needs to happen for
> allocations to succeed. This applies to all types of direct reclaim:
> unmovable request freeing reclaimable/movable blocks, reclaimable
> freeing movable blocks, movable freeing reclaimable blocks.
>
> I see your point about smaller orders now always ending the merge at
> the pageblock, even when there could be additional merging
> opportunities beyond. However, I'm not sure these accidental larger
> merges beyond what's needed to fulfill the request at hand are a
> preferable aspect over reclaimer fairness, and thus ultimately the
> reliability of orders up to the pageblock size.
>
> I'll try to get some numbers for this patch independently, though.
> This should manifest in p99 allocation latencies and near-OOM
> behavior. Is there anything else you'd want me to look for?
>

Any major change in the number of the mm_page_alloc_extfrag trace event
triggering. Also put the patch at the end of the preparation series of
possible or even do it as a separate follow-up patch after the bulk of
the series has been handled. While I cannot 100% convince myself either
way, I wonder if variable high-order allocation requests smaller than a
pageblock could cause premature mixing due to capture.

--
Mel Gorman
SUSE Labs

2023-05-02 15:23:58

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] mm: reliable huge page allocator

On 21.04.23 19:14, Matthew Wilcox wrote:
> On Fri, Apr 21, 2023 at 05:11:56PM +0100, Mel Gorman wrote:
>> It was considered once upon a time and comes up every so often as variants
>> of a "sticky" pageblock pageblock bit that prevents mixing. The risks was
>> ending up in a context where memory within a suitable pageblock cannot
>> be freed and all of the available MOVABLE pageblocks have at least one
>> pinned page that cannot migrate from the allocating context. It can also
>> potentially hit a case where the majority of memory is UNMOVABLE pageblocks,
>> each of which has a single pagetable page that cannot be freed without an
>> OOM kill. Variants of issues like this would manifestas an OOM kill with
>> plenty of memory free bug or excessive CPu usage on reclaim or compaction.
>>
>> It doesn't kill the idea of the series at all but it puts a lot of emphasis
>> in splitting the series by low-risk and high-risk. Maybe to the extent where
>> the absolute protection against mixing can be broken in OOM situations,
>> kernel command line or sysctl.
>
> Has a variant been previously considered where MOVABLE allocations are
> allowed to come from UNMOVABLE blocks? After all, MOVABLE allocations
> are generally, well, movable. So an UNMOVABLE allocation could try to
> migrate pages from a MIXED pageblock in order to turn the MIXED pageblock
> back into an UNMOVABLE pageblock.

I might be completely off, but my understanding was that movable
allocations can be happily placed into unmovable blocks if required already?

IIRC, it's primarily the zone fallback rules that prevent e.g., ZONE_DMA
to get filled immediately with movable data in your example. I might eb
wrong, though.

I guess what you mean is serving movable allocations much earlier from
these other zones.

Having memory hotunplug in mind ( as always ;) ), I'd expect that such
fragmentation must be allowed to happen to guarantee that memory (esp.
ZONE_MOVABLE) can be properly evacuated even if there are not sufficient
MOVABLE pageblocks around to hold all that (movable) data.

--
Thanks,

David / dhildenb