V4:
- fixed !pcp_order_allowed() case in free_unref_folios()
- reworded the patch 0 changelog a bit for the git log
- rebased to mm-everything-2024-03-19-23-01
- runtime-tested again with various CONFIG_DEBUG_FOOs enabled
---
The page allocator's mobility grouping is intended to keep unmovable
pages separate from reclaimable/compactable ones to allow on-demand
defragmentation for higher-order allocations and huge pages.
Currently, there are several places where accidental type mixing
occurs: an allocation asks for a page of a certain migratetype and
receives another. This ruins pageblocks for compaction, which in turn
makes allocating huge pages more expensive and less reliable.
The series addresses those causes. The last patch adds type checks on
all freelist movements to prevent new violations being introduced.
The benefits can be seen in a mixed workload that stresses the machine
with a memcache-type workload and a kernel build job while
periodically attempting to allocate batches of THP. The following data
is aggregated over 50 consecutive defconfig builds:
VANILLA PATCHED
Hugealloc Time mean 165843.93 ( +0.00%) 113025.88 ( -31.85%)
Hugealloc Time stddev 158957.35 ( +0.00%) 114716.07 ( -27.83%)
Kbuild Real time 310.24 ( +0.00%) 300.73 ( -3.06%)
Kbuild User time 1271.13 ( +0.00%) 1259.42 ( -0.92%)
Kbuild System time 582.02 ( +0.00%) 559.79 ( -3.81%)
THP fault alloc 30585.14 ( +0.00%) 40853.62 ( +33.57%)
THP fault fallback 36626.46 ( +0.00%) 26357.62 ( -28.04%)
THP fault fail rate % 54.49 ( +0.00%) 39.22 ( -27.53%)
Pagealloc fallback 1328.00 ( +0.00%) 1.00 ( -99.85%)
Pagealloc type mismatch 181009.50 ( +0.00%) 0.00 ( -100.00%)
Direct compact stall 434.56 ( +0.00%) 257.66 ( -40.61%)
Direct compact fail 421.70 ( +0.00%) 249.94 ( -40.63%)
Direct compact success 12.86 ( +0.00%) 7.72 ( -37.09%)
Direct compact success rate % 2.86 ( +0.00%) 2.82 ( -0.96%)
Compact daemon scanned migrate 3370059.62 ( +0.00%) 3612054.76 ( +7.18%)
Compact daemon scanned free 7718439.20 ( +0.00%) 5386385.02 ( -30.21%)
Compact direct scanned migrate 309248.62 ( +0.00%) 176721.04 ( -42.85%)
Compact direct scanned free 433582.84 ( +0.00%) 315727.66 ( -27.18%)
Compact migrate scanned daemon % 91.20 ( +0.00%) 94.48 ( +3.56%)
Compact free scanned daemon % 94.58 ( +0.00%) 94.42 ( -0.16%)
Compact total migrate scanned 3679308.24 ( +0.00%) 3788775.80 ( +2.98%)
Compact total free scanned 8152022.04 ( +0.00%) 5702112.68 ( -30.05%)
Alloc stall 872.04 ( +0.00%) 5156.12 ( +490.71%)
Pages kswapd scanned 510645.86 ( +0.00%) 3394.94 ( -99.33%)
Pages kswapd reclaimed 134811.62 ( +0.00%) 2701.26 ( -98.00%)
Pages direct scanned 99546.06 ( +0.00%) 376407.52 ( +278.12%)
Pages direct reclaimed 62123.40 ( +0.00%) 289535.70 ( +366.06%)
Pages total scanned 610191.92 ( +0.00%) 379802.46 ( -37.76%)
Pages scanned kswapd % 76.36 ( +0.00%) 0.10 ( -98.58%)
Swap out 12057.54 ( +0.00%) 15022.98 ( +24.59%)
Swap in 209.16 ( +0.00%) 256.48 ( +22.52%)
File refaults 17701.64 ( +0.00%) 11765.40 ( -33.53%)
Huge page success rate is higher, allocation latencies are shorter and
more predictable.
Stealing (fallback) rate is drastically reduced. Notably, while the
vanilla kernel keeps doing fallbacks on an ongoing basis, the patched
kernel enters a steady state once the distribution of block types is
adequate for the workload. Steals over 50 runs:
VANILLA PATCHED
1504.0 227.0
1557.0 6.0
1391.0 13.0
1080.0 26.0
1057.0 40.0
1156.0 6.0
805.0 46.0
736.0 20.0
1747.0 2.0
1699.0 34.0
1269.0 13.0
1858.0 12.0
907.0 4.0
727.0 2.0
563.0 2.0
3094.0 2.0
10211.0 3.0
2621.0 1.0
5508.0 2.0
1060.0 2.0
538.0 3.0
5773.0 2.0
2199.0 0.0
3781.0 2.0
1387.0 1.0
4977.0 0.0
2865.0 1.0
1814.0 1.0
3739.0 1.0
6857.0 0.0
382.0 0.0
407.0 1.0
3784.0 0.0
297.0 0.0
298.0 0.0
6636.0 0.0
4188.0 0.0
242.0 0.0
9960.0 0.0
5816.0 0.0
354.0 0.0
287.0 0.0
261.0 0.0
140.0 1.0
2065.0 0.0
312.0 0.0
331.0 0.0
164.0 0.0
465.0 1.0
219.0 0.0
Type mismatches are down too. Those count every time an allocation
request asks for one migratetype and gets another. This can still
occur minimally in the patched kernel due to non-stealing fallbacks,
but it's quite rare and follows the pattern of overall fallbacks -
once the block type distribution settles, mismatches cease as well:
VANILLA: PATCHED:
182602.0 268.0
135794.0 20.0
88619.0 19.0
95973.0 0.0
129590.0 0.0
129298.0 0.0
147134.0 0.0
230854.0 0.0
239709.0 0.0
137670.0 0.0
132430.0 0.0
65712.0 0.0
57901.0 0.0
67506.0 0.0
63565.0 4.0
34806.0 0.0
42962.0 0.0
32406.0 0.0
38668.0 0.0
61356.0 0.0
57800.0 0.0
41435.0 0.0
83456.0 0.0
65048.0 0.0
28955.0 0.0
47597.0 0.0
75117.0 0.0
55564.0 0.0
38280.0 0.0
52404.0 0.0
26264.0 0.0
37538.0 0.0
19671.0 0.0
30936.0 0.0
26933.0 0.0
16962.0 0.0
44554.0 0.0
46352.0 0.0
24995.0 0.0
35152.0 0.0
12823.0 0.0
21583.0 0.0
18129.0 0.0
31693.0 0.0
28745.0 0.0
33308.0 0.0
31114.0 0.0
35034.0 0.0
12111.0 0.0
24885.0 0.0
Compaction work is markedly reduced despite much better THP rates.
In the vanilla kernel, reclaim seems to have been driven primarily by
watermark boosting that happens as a result of fallbacks. With those
all but eliminated, watermarks average lower and kswapd does less
work. The uptick in direct reclaim is because THP requests have to
fend for themselves more often - which is intended policy right
now. Aggregate reclaim activity is lowered significantly, though.
---
V3:
- fixed freelist type violations from non-atomic page isolation
updates (Zi Yan)
- fixed incorrect migratetype update ordering during merge (Vlastimil Babka)
- reject moving a zone-straddling block altogether (Vlastimil Babka)
- fixed freelist type violations from lockless migratetype lookups in
cornercase freeing paths (Vlastimil Babka)
- fixed erroneous WARN in the bulk freeing path that was intended to catch
mistakes in the now-removed pcpcache (Mike Kravetz)
- fixed typo in patch 1's changelog (Zi Yan)
- optimized migratetype lookup in free_unref_page_list() (Vlastimil Babka)
- batched vmstat updates in page merging hot path (Vlastimil Babka)
- rebased to mm-everything-2024-03-05-20-43 (v6.8-rc5+)
V2:
- dropped the get_pfnblock_migratetype() optimization
patchlet since somebody else beat me to it (thanks Zi)
- broke out pcp bypass fix since somebody else reported the bug:
https://lore.kernel.org/linux-mm/[email protected]/
- fixed the CONFIG_UNACCEPTED_MEMORY build (lkp)
- rebased to v6.6-rc1
include/linux/mm.h | 18 +-
include/linux/page-isolation.h | 5 +-
include/linux/vmstat.h | 8 -
mm/debug_page_alloc.c | 12 +-
mm/internal.h | 9 -
mm/page_alloc.c | 650 +++++++++++++++++++++------------------
mm/page_isolation.c | 122 +++-----
7 files changed, 415 insertions(+), 409 deletions(-)
Based on mm-everything-2024-03-19-23-01.
The idea behind the cache is to save get_pageblock_migratetype()
lookups during bulk freeing. A microbenchmark suggests this isn't
helping, though. The pcp migratetype can get stale, which means that
bulk freeing has an extra branch to check if the pageblock was
isolated while on the pcp.
While the variance overlaps, the cache write and the branch seem to
make this a net negative. The following test allocates and frees
batches of 10,000 pages (~3x the pcp high marks to trigger flushing):
Before:
8,668.48 msec task-clock # 99.735 CPUs utilized ( +- 2.90% )
19 context-switches # 4.341 /sec ( +- 3.24% )
0 cpu-migrations # 0.000 /sec
17,440 page-faults # 3.984 K/sec ( +- 2.90% )
41,758,692,473 cycles # 9.541 GHz ( +- 2.90% )
126,201,294,231 instructions # 5.98 insn per cycle ( +- 2.90% )
25,348,098,335 branches # 5.791 G/sec ( +- 2.90% )
33,436,921 branch-misses # 0.26% of all branches ( +- 2.90% )
0.0869148 +- 0.0000302 seconds time elapsed ( +- 0.03% )
After:
8,444.81 msec task-clock # 99.726 CPUs utilized ( +- 2.90% )
22 context-switches # 5.160 /sec ( +- 3.23% )
0 cpu-migrations # 0.000 /sec
17,443 page-faults # 4.091 K/sec ( +- 2.90% )
40,616,738,355 cycles # 9.527 GHz ( +- 2.90% )
126,383,351,792 instructions # 6.16 insn per cycle ( +- 2.90% )
25,224,985,153 branches # 5.917 G/sec ( +- 2.90% )
32,236,793 branch-misses # 0.25% of all branches ( +- 2.90% )
0.0846799 +- 0.0000412 seconds time elapsed ( +- 0.05% )
A side effect is that this also ensures that pages whose pageblock
gets stolen while on the pcplist end up on the right freelist and we
don't perform potentially type-incompatible buddy merges (or skip
merges when we shouldn't), which is likely beneficial to long-term
fragmentation management, although the effects would be harder to
measure. Settle for simpler and faster code as justification here.
v2:
- remove erroneous leftover VM_BUG_ON in pcp bulk freeing (Mike)
Acked-by: Zi Yan <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Tested-by: "Huang, Ying" <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/page_alloc.c | 66 +++++++++++--------------------------------------
1 file changed, 14 insertions(+), 52 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4491d0240bc6..60a632b7c9f6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -206,24 +206,6 @@ EXPORT_SYMBOL(node_states);
gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
-/*
- * A cached value of the page's pageblock's migratetype, used when the page is
- * put on a pcplist. Used to avoid the pageblock migratetype lookup when
- * freeing from pcplists in most cases, at the cost of possibly becoming stale.
- * Also the migratetype set in the page does not necessarily match the pcplist
- * index, e.g. page might have MIGRATE_CMA set but be on a pcplist with any
- * other index - this ensures that it will be put on the correct CMA freelist.
- */
-static inline int get_pcppage_migratetype(struct page *page)
-{
- return page->index;
-}
-
-static inline void set_pcppage_migratetype(struct page *page, int migratetype)
-{
- page->index = migratetype;
-}
-
#ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
unsigned int pageblock_order __read_mostly;
#endif
@@ -1191,7 +1173,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
{
unsigned long flags;
unsigned int order;
- bool isolated_pageblocks;
struct page *page;
/*
@@ -1204,7 +1185,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
pindex = pindex - 1;
spin_lock_irqsave(&zone->lock, flags);
- isolated_pageblocks = has_isolate_pageblock(zone);
while (count > 0) {
struct list_head *list;
@@ -1220,23 +1200,19 @@ static void free_pcppages_bulk(struct zone *zone, int count,
order = pindex_to_order(pindex);
nr_pages = 1 << order;
do {
+ unsigned long pfn;
int mt;
page = list_last_entry(list, struct page, pcp_list);
- mt = get_pcppage_migratetype(page);
+ pfn = page_to_pfn(page);
+ mt = get_pfnblock_migratetype(page, pfn);
/* must delete to avoid corrupting pcp list */
list_del(&page->pcp_list);
count -= nr_pages;
pcp->count -= nr_pages;
- /* MIGRATE_ISOLATE page should not go to pcplists */
- VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
- /* Pageblock could have been isolated meanwhile */
- if (unlikely(isolated_pageblocks))
- mt = get_pageblock_migratetype(page);
-
- __free_one_page(page, page_to_pfn(page), zone, order, mt, FPI_NONE);
+ __free_one_page(page, pfn, zone, order, mt, FPI_NONE);
trace_mm_page_pcpu_drain(page, order, mt);
} while (count > 0 && !list_empty(list));
}
@@ -1575,7 +1551,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
continue;
del_page_from_free_list(page, zone, current_order);
expand(zone, page, order, current_order, migratetype);
- set_pcppage_migratetype(page, migratetype);
trace_mm_page_alloc_zone_locked(page, order, migratetype,
pcp_allowed_order(order) &&
migratetype < MIGRATE_PCPTYPES);
@@ -2182,7 +2157,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
* pages are ordered properly.
*/
list_add_tail(&page->pcp_list, list);
- if (is_migrate_cma(get_pcppage_migratetype(page)))
+ if (is_migrate_cma(get_pageblock_migratetype(page)))
__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
-(1 << order));
}
@@ -2378,19 +2353,6 @@ void drain_all_pages(struct zone *zone)
__drain_all_pages(zone, false);
}
-static bool free_unref_page_prepare(struct page *page, unsigned long pfn,
- unsigned int order)
-{
- int migratetype;
-
- if (!free_pages_prepare(page, order))
- return false;
-
- migratetype = get_pfnblock_migratetype(page, pfn);
- set_pcppage_migratetype(page, migratetype);
- return true;
-}
-
static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free_high)
{
int min_nr_free, max_nr_free;
@@ -2523,7 +2485,7 @@ void free_unref_page(struct page *page, unsigned int order)
unsigned long pfn = page_to_pfn(page);
int migratetype, pcpmigratetype;
- if (!free_unref_page_prepare(page, pfn, order))
+ if (!free_pages_prepare(page, order))
return;
/*
@@ -2533,7 +2495,7 @@ void free_unref_page(struct page *page, unsigned int order)
* get those areas back if necessary. Otherwise, we may have to free
* excessively into the page allocator
*/
- migratetype = pcpmigratetype = get_pcppage_migratetype(page);
+ migratetype = pcpmigratetype = get_pfnblock_migratetype(page, pfn);
if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
if (unlikely(is_migrate_isolate(migratetype))) {
free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE);
@@ -2572,14 +2534,14 @@ void free_unref_folios(struct folio_batch *folios)
if (order > 0 && folio_test_large_rmappable(folio))
folio_undo_large_rmappable(folio);
- if (!free_unref_page_prepare(&folio->page, pfn, order))
+ if (!free_pages_prepare(&folio->page, order))
continue;
/*
* Free isolated folios and orders not handled on the PCP
* directly to the allocator, see comment in free_unref_page.
*/
- migratetype = get_pcppage_migratetype(&folio->page);
+ migratetype = get_pfnblock_migratetype(&folio->page, pfn);
if (!pcp_allowed_order(order) ||
is_migrate_isolate(migratetype)) {
free_one_page(folio_zone(folio), &folio->page, pfn,
@@ -2596,10 +2558,11 @@ void free_unref_folios(struct folio_batch *folios)
for (i = 0; i < folios->nr; i++) {
struct folio *folio = folios->folios[i];
struct zone *zone = folio_zone(folio);
+ unsigned long pfn = folio_pfn(folio);
unsigned int order = (unsigned long)folio->private;
folio->private = NULL;
- migratetype = get_pcppage_migratetype(&folio->page);
+ migratetype = get_pfnblock_migratetype(&folio->page, pfn);
/* Different zone requires a different pcp lock */
if (zone != locked_zone) {
@@ -2616,9 +2579,8 @@ void free_unref_folios(struct folio_batch *folios)
pcp = pcp_spin_trylock(zone->per_cpu_pageset);
if (unlikely(!pcp)) {
pcp_trylock_finish(UP_flags);
- free_one_page(zone, &folio->page,
- folio_pfn(folio), order,
- migratetype, FPI_NONE);
+ free_one_page(zone, &folio->page, pfn,
+ order, migratetype, FPI_NONE);
locked_zone = NULL;
continue;
}
@@ -2787,7 +2749,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
}
}
__mod_zone_freepage_state(zone, -(1 << order),
- get_pcppage_migratetype(page));
+ get_pageblock_migratetype(page));
spin_unlock_irqrestore(&zone->lock, flags);
} while (check_new_pages(page, order));
--
2.44.0
Move direct freeing of isolated pages to the lock-breaking block in
the second loop. This saves an unnecessary migratetype reassessment.
Minor comment and local variable scoping cleanups.
Suggested-by: Vlastimil Babka <[email protected]>
Tested-by: "Huang, Ying" <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/page_alloc.c | 32 +++++++++++++++++++++++---------
1 file changed, 23 insertions(+), 9 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 60a632b7c9f6..994e4f790e92 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2524,7 +2524,7 @@ void free_unref_folios(struct folio_batch *folios)
unsigned long __maybe_unused UP_flags;
struct per_cpu_pages *pcp = NULL;
struct zone *locked_zone = NULL;
- int i, j, migratetype;
+ int i, j;
/* Prepare folios for freeing */
for (i = 0, j = 0; i < folios->nr; i++) {
@@ -2536,14 +2536,15 @@ void free_unref_folios(struct folio_batch *folios)
folio_undo_large_rmappable(folio);
if (!free_pages_prepare(&folio->page, order))
continue;
-
/*
- * Free isolated folios and orders not handled on the PCP
- * directly to the allocator, see comment in free_unref_page.
+ * Free orders not handled on the PCP directly to the
+ * allocator.
*/
- migratetype = get_pfnblock_migratetype(&folio->page, pfn);
- if (!pcp_allowed_order(order) ||
- is_migrate_isolate(migratetype)) {
+ if (!pcp_allowed_order(order)) {
+ int migratetype;
+
+ migratetype = get_pfnblock_migratetype(&folio->page,
+ pfn);
free_one_page(folio_zone(folio), &folio->page, pfn,
order, migratetype, FPI_NONE);
continue;
@@ -2560,15 +2561,29 @@ void free_unref_folios(struct folio_batch *folios)
struct zone *zone = folio_zone(folio);
unsigned long pfn = folio_pfn(folio);
unsigned int order = (unsigned long)folio->private;
+ int migratetype;
folio->private = NULL;
migratetype = get_pfnblock_migratetype(&folio->page, pfn);
/* Different zone requires a different pcp lock */
- if (zone != locked_zone) {
+ if (zone != locked_zone ||
+ is_migrate_isolate(migratetype)) {
if (pcp) {
pcp_spin_unlock(pcp);
pcp_trylock_finish(UP_flags);
+ locked_zone = NULL;
+ pcp = NULL;
+ }
+
+ /*
+ * Free isolated pages directly to the
+ * allocator, see comment in free_unref_page.
+ */
+ if (is_migrate_isolate(migratetype)) {
+ free_one_page(zone, &folio->page, pfn,
+ order, migratetype, FPI_NONE);
+ continue;
}
/*
@@ -2581,7 +2596,6 @@ void free_unref_folios(struct folio_batch *folios)
pcp_trylock_finish(UP_flags);
free_one_page(zone, &folio->page, pfn,
order, migratetype, FPI_NONE);
- locked_zone = NULL;
continue;
}
locked_zone = zone;
--
2.44.0
The buddy allocator coalesces compatible blocks during freeing, but it
doesn't update the types of the subblocks to match. When an allocation
later breaks the chunk down again, its pieces will be put on freelists
of the wrong type. This encourages incompatible page mixing (ask for
one type, get another), and thus long-term fragmentation.
Update the subblocks when merging a larger chunk, such that a later
expand() will maintain freelist type hygiene.
v2:
- remove spurious change_pageblock_range() move (Zi Yan)
Reviewed-by: Zi Yan <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Tested-by: "Huang, Ying" <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/page_alloc.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 994e4f790e92..4529893d9f04 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -785,10 +785,17 @@ static inline void __free_one_page(struct page *page,
*/
int buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
- if (migratetype != buddy_mt
- && (!migratetype_is_mergeable(migratetype) ||
- !migratetype_is_mergeable(buddy_mt)))
- goto done_merging;
+ if (migratetype != buddy_mt) {
+ if (!migratetype_is_mergeable(migratetype) ||
+ !migratetype_is_mergeable(buddy_mt))
+ goto done_merging;
+ /*
+ * Match buddy type. This ensures that
+ * an expand() down the line puts the
+ * sub-blocks on the right freelists.
+ */
+ set_pageblock_migratetype(buddy, migratetype);
+ }
}
/*
--
2.44.0
When claiming a block during compaction isolation, move any remaining
free pages to the correct freelists as well, instead of stranding them
on the wrong list. Otherwise, this encourages incompatible page mixing
down the line, and thus long-term fragmentation.
Reviewed-by: Zi Yan <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Tested-by: "Huang, Ying" <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/page_alloc.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4529893d9f04..a1376a6fe7e4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2683,9 +2683,12 @@ int __isolate_free_page(struct page *page, unsigned int order)
* Only change normal pageblocks (i.e., they can merge
* with others)
*/
- if (migratetype_is_mergeable(mt))
+ if (migratetype_is_mergeable(mt)) {
set_pageblock_migratetype(page,
MIGRATE_MOVABLE);
+ move_freepages_block(zone, page,
+ MIGRATE_MOVABLE, NULL);
+ }
}
}
--
2.44.0
When a block is partially outside the zone of the cursor page, the
function cuts the range to the pivot page instead of the zone
start. This can leave large parts of the block behind, which
encourages incompatible page mixing down the line (ask for one type,
get another), and thus long-term fragmentation.
This triggers reliably on the first block in the DMA zone, whose
start_pfn is 1. The block is stolen, but everything before the pivot
page (which was often hundreds of pages) is left on the old list.
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/page_alloc.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a1376a6fe7e4..7373329763e6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1645,9 +1645,15 @@ int move_freepages_block(struct zone *zone, struct page *page,
start_pfn = pageblock_start_pfn(pfn);
end_pfn = pageblock_end_pfn(pfn) - 1;
- /* Do not cross zone boundaries */
+ /*
+ * The caller only has the lock for @zone, don't touch ranges
+ * that straddle into other zones. While we could move part of
+ * the range that's inside the zone, this call is usually
+ * accompanied by other operations such as migratetype updates
+ * which also should be locked.
+ */
if (!zone_spans_pfn(zone, start_pfn))
- start_pfn = pfn;
+ return 0;
if (!zone_spans_pfn(zone, end_pfn))
return 0;
--
2.44.0
There are three freeing paths that read the page's migratetype
optimistically before grabbing the zone lock. When this races with
block stealing, those pages go on the wrong freelist.
The paths in question are:
- when freeing >costly orders that aren't THP
- when freeing pages to the buddy upon pcp lock contention
- when freeing pages that are isolated
- when freeing pages initially during boot
- when freeing the remainder in alloc_pages_exact()
- when "accepting" unaccepted VM host memory before first use
- when freeing pages during unpoisoning
None of these are so hot that they would need this optimization at the
cost of hampering defrag efforts. Especially when contrasted with the
fact that the most common buddy freeing path - free_pcppages_bulk - is
checking the migratetype under the zone->lock just fine.
In addition, isolated pages need to look up the migratetype under the
lock anyway, which adds branches to the locked section, and results in
a double lookup when the pages are in fact isolated.
Move the lookups into the lock.
Reported-by: Vlastimil Babka <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/page_alloc.c | 52 ++++++++++++++++++-------------------------------
1 file changed, 19 insertions(+), 33 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e7d0d4711bdd..3f65b565eaad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1227,18 +1227,15 @@ static void free_pcppages_bulk(struct zone *zone, int count,
spin_unlock_irqrestore(&zone->lock, flags);
}
-static void free_one_page(struct zone *zone,
- struct page *page, unsigned long pfn,
- unsigned int order,
- int migratetype, fpi_t fpi_flags)
+static void free_one_page(struct zone *zone, struct page *page,
+ unsigned long pfn, unsigned int order,
+ fpi_t fpi_flags)
{
unsigned long flags;
+ int migratetype;
spin_lock_irqsave(&zone->lock, flags);
- if (unlikely(has_isolate_pageblock(zone) ||
- is_migrate_isolate(migratetype))) {
- migratetype = get_pfnblock_migratetype(page, pfn);
- }
+ migratetype = get_pfnblock_migratetype(page, pfn);
__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
spin_unlock_irqrestore(&zone->lock, flags);
}
@@ -1246,21 +1243,13 @@ static void free_one_page(struct zone *zone,
static void __free_pages_ok(struct page *page, unsigned int order,
fpi_t fpi_flags)
{
- int migratetype;
unsigned long pfn = page_to_pfn(page);
struct zone *zone = page_zone(page);
if (!free_pages_prepare(page, order))
return;
- /*
- * Calling get_pfnblock_migratetype() without spin_lock_irqsave() here
- * is used to avoid calling get_pfnblock_migratetype() under the lock.
- * This will reduce the lock holding time.
- */
- migratetype = get_pfnblock_migratetype(page, pfn);
-
- free_one_page(zone, page, pfn, order, migratetype, fpi_flags);
+ free_one_page(zone, page, pfn, order, fpi_flags);
__count_vm_events(PGFREE, 1 << order);
}
@@ -2533,7 +2522,7 @@ void free_unref_page(struct page *page, unsigned int order)
struct per_cpu_pages *pcp;
struct zone *zone;
unsigned long pfn = page_to_pfn(page);
- int migratetype, pcpmigratetype;
+ int migratetype;
if (!free_pages_prepare(page, order))
return;
@@ -2545,23 +2534,23 @@ void free_unref_page(struct page *page, unsigned int order)
* get those areas back if necessary. Otherwise, we may have to free
* excessively into the page allocator
*/
- migratetype = pcpmigratetype = get_pfnblock_migratetype(page, pfn);
+ migratetype = get_pfnblock_migratetype(page, pfn);
if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
if (unlikely(is_migrate_isolate(migratetype))) {
- free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE);
+ free_one_page(page_zone(page), page, pfn, order, FPI_NONE);
return;
}
- pcpmigratetype = MIGRATE_MOVABLE;
+ migratetype = MIGRATE_MOVABLE;
}
zone = page_zone(page);
pcp_trylock_prepare(UP_flags);
pcp = pcp_spin_trylock(zone->per_cpu_pageset);
if (pcp) {
- free_unref_page_commit(zone, pcp, page, pcpmigratetype, order);
+ free_unref_page_commit(zone, pcp, page, migratetype, order);
pcp_spin_unlock(pcp);
} else {
- free_one_page(zone, page, pfn, order, migratetype, FPI_NONE);
+ free_one_page(zone, page, pfn, order, FPI_NONE);
}
pcp_trylock_finish(UP_flags);
}
@@ -2591,12 +2580,8 @@ void free_unref_folios(struct folio_batch *folios)
* allocator.
*/
if (!pcp_allowed_order(order)) {
- int migratetype;
-
- migratetype = get_pfnblock_migratetype(&folio->page,
- pfn);
- free_one_page(folio_zone(folio), &folio->page, pfn,
- order, migratetype, FPI_NONE);
+ free_one_page(folio_zone(folio), &folio->page,
+ pfn, order, FPI_NONE);
continue;
}
folio->private = (void *)(unsigned long)order;
@@ -2632,7 +2617,7 @@ void free_unref_folios(struct folio_batch *folios)
*/
if (is_migrate_isolate(migratetype)) {
free_one_page(zone, &folio->page, pfn,
- order, migratetype, FPI_NONE);
+ order, FPI_NONE);
continue;
}
@@ -2645,7 +2630,7 @@ void free_unref_folios(struct folio_batch *folios)
if (unlikely(!pcp)) {
pcp_trylock_finish(UP_flags);
free_one_page(zone, &folio->page, pfn,
- order, migratetype, FPI_NONE);
+ order, FPI_NONE);
continue;
}
locked_zone = zone;
@@ -6823,13 +6808,14 @@ bool take_page_off_buddy(struct page *page)
bool put_page_back_buddy(struct page *page)
{
struct zone *zone = page_zone(page);
- unsigned long pfn = page_to_pfn(page);
unsigned long flags;
- int migratetype = get_pfnblock_migratetype(page, pfn);
bool ret = false;
spin_lock_irqsave(&zone->lock, flags);
if (put_page_testzero(page)) {
+ unsigned long pfn = page_to_pfn(page);
+ int migratetype = get_pfnblock_migratetype(page, pfn);
+
ClearPageHWPoisonTakenOff(page);
__free_one_page(page, pfn, zone, 0, migratetype, FPI_NONE);
if (TestClearPageHWPoison(page)) {
--
2.44.0
Currently, page block type conversion during fallbacks, atomic
reservations and isolation can strand various amounts of free pages on
incorrect freelists.
For example, fallback stealing moves free pages in the block to the
new type's freelists, but then may not actually claim the block for
that type if there aren't enough compatible pages already allocated.
In all cases, free page moving might fail if the block straddles more
than one zone, in which case no free pages are moved at all, but the
block type is changed anyway.
This is detrimental to type hygiene on the freelists. It encourages
incompatible page mixing down the line (ask for one type, get another)
and thus contributes to long-term fragmentation.
Split the process into a proper transaction: check first if conversion
will happen, then try to move the free pages, and only if that was
successful convert the block to the new type.
Tested-by: "Huang, Ying" <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/page-isolation.h | 3 +-
mm/page_alloc.c | 175 ++++++++++++++++++++-------------
mm/page_isolation.c | 22 +++--
3 files changed, 121 insertions(+), 79 deletions(-)
diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 4ac34392823a..8550b3c91480 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -34,8 +34,7 @@ static inline bool is_migrate_isolate(int migratetype)
#define REPORT_FAILURE 0x2
void set_pageblock_migratetype(struct page *page, int migratetype);
-int move_freepages_block(struct zone *zone, struct page *page,
- int migratetype, int *num_movable);
+int move_freepages_block(struct zone *zone, struct page *page, int migratetype);
int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
int migratetype, int flags, gfp_t gfp_flags);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7373329763e6..e7d0d4711bdd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1596,9 +1596,8 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
* Note that start_page and end_pages are not aligned on a pageblock
* boundary. If alignment is required, use move_freepages_block()
*/
-static int move_freepages(struct zone *zone,
- unsigned long start_pfn, unsigned long end_pfn,
- int migratetype, int *num_movable)
+static int move_freepages(struct zone *zone, unsigned long start_pfn,
+ unsigned long end_pfn, int migratetype)
{
struct page *page;
unsigned long pfn;
@@ -1608,14 +1607,6 @@ static int move_freepages(struct zone *zone,
for (pfn = start_pfn; pfn <= end_pfn;) {
page = pfn_to_page(pfn);
if (!PageBuddy(page)) {
- /*
- * We assume that pages that could be isolated for
- * migration are movable. But we don't actually try
- * isolating, as that would be expensive.
- */
- if (num_movable &&
- (PageLRU(page) || __PageMovable(page)))
- (*num_movable)++;
pfn++;
continue;
}
@@ -1633,17 +1624,16 @@ static int move_freepages(struct zone *zone,
return pages_moved;
}
-int move_freepages_block(struct zone *zone, struct page *page,
- int migratetype, int *num_movable)
+static bool prep_move_freepages_block(struct zone *zone, struct page *page,
+ unsigned long *start_pfn,
+ unsigned long *end_pfn,
+ int *num_free, int *num_movable)
{
- unsigned long start_pfn, end_pfn, pfn;
-
- if (num_movable)
- *num_movable = 0;
+ unsigned long pfn, start, end;
pfn = page_to_pfn(page);
- start_pfn = pageblock_start_pfn(pfn);
- end_pfn = pageblock_end_pfn(pfn) - 1;
+ start = pageblock_start_pfn(pfn);
+ end = pageblock_end_pfn(pfn) - 1;
/*
* The caller only has the lock for @zone, don't touch ranges
@@ -1652,13 +1642,50 @@ int move_freepages_block(struct zone *zone, struct page *page,
* accompanied by other operations such as migratetype updates
* which also should be locked.
*/
- if (!zone_spans_pfn(zone, start_pfn))
- return 0;
- if (!zone_spans_pfn(zone, end_pfn))
- return 0;
+ if (!zone_spans_pfn(zone, start))
+ return false;
+ if (!zone_spans_pfn(zone, end))
+ return false;
+
+ *start_pfn = start;
+ *end_pfn = end;
+
+ if (num_free) {
+ *num_free = 0;
+ *num_movable = 0;
+ for (pfn = start; pfn <= end;) {
+ page = pfn_to_page(pfn);
+ if (PageBuddy(page)) {
+ int nr = 1 << buddy_order(page);
+
+ *num_free += nr;
+ pfn += nr;
+ continue;
+ }
+ /*
+ * We assume that pages that could be isolated for
+ * migration are movable. But we don't actually try
+ * isolating, as that would be expensive.
+ */
+ if (PageLRU(page) || __PageMovable(page))
+ (*num_movable)++;
+ pfn++;
+ }
+ }
+
+ return true;
+}
+
+int move_freepages_block(struct zone *zone, struct page *page,
+ int migratetype)
+{
+ unsigned long start_pfn, end_pfn;
+
+ if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn,
+ NULL, NULL))
+ return -1;
- return move_freepages(zone, start_pfn, end_pfn, migratetype,
- num_movable);
+ return move_freepages(zone, start_pfn, end_pfn, migratetype);
}
static void change_pageblock_range(struct page *pageblock_page,
@@ -1743,33 +1770,37 @@ static inline bool boost_watermark(struct zone *zone)
}
/*
- * This function implements actual steal behaviour. If order is large enough,
- * we can steal whole pageblock. If not, we first move freepages in this
- * pageblock to our migratetype and determine how many already-allocated pages
- * are there in the pageblock with a compatible migratetype. If at least half
- * of pages are free or compatible, we can change migratetype of the pageblock
- * itself, so pages freed in the future will be put on the correct free list.
+ * This function implements actual steal behaviour. If order is large enough, we
+ * can claim the whole pageblock for the requested migratetype. If not, we check
+ * the pageblock for constituent pages; if at least half of the pages are free
+ * or compatible, we can still claim the whole block, so pages freed in the
+ * future will be put on the correct free list. Otherwise, we isolate exactly
+ * the order we need from the fallback block and leave its migratetype alone.
*/
-static void steal_suitable_fallback(struct zone *zone, struct page *page,
- unsigned int alloc_flags, int start_type, bool whole_block)
+static struct page *
+steal_suitable_fallback(struct zone *zone, struct page *page,
+ int current_order, int order, int start_type,
+ unsigned int alloc_flags, bool whole_block)
{
- unsigned int current_order = buddy_order(page);
int free_pages, movable_pages, alike_pages;
- int old_block_type;
+ unsigned long start_pfn, end_pfn;
+ int block_type;
- old_block_type = get_pageblock_migratetype(page);
+ block_type = get_pageblock_migratetype(page);
/*
* This can happen due to races and we want to prevent broken
* highatomic accounting.
*/
- if (is_migrate_highatomic(old_block_type))
+ if (is_migrate_highatomic(block_type))
goto single_page;
/* Take ownership for orders >= pageblock_order */
if (current_order >= pageblock_order) {
+ del_page_from_free_list(page, zone, current_order);
change_pageblock_range(page, current_order, start_type);
- goto single_page;
+ expand(zone, page, order, current_order, start_type);
+ return page;
}
/*
@@ -1784,10 +1815,9 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
if (!whole_block)
goto single_page;
- free_pages = move_freepages_block(zone, page, start_type,
- &movable_pages);
/* moving whole block can fail due to zone boundary conditions */
- if (!free_pages)
+ if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn,
+ &free_pages, &movable_pages))
goto single_page;
/*
@@ -1805,7 +1835,7 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
* vice versa, be conservative since we can't distinguish the
* exact migratetype of non-movable pages.
*/
- if (old_block_type == MIGRATE_MOVABLE)
+ if (block_type == MIGRATE_MOVABLE)
alike_pages = pageblock_nr_pages
- (free_pages + movable_pages);
else
@@ -1816,13 +1846,16 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
* compatible migratability as our allocation, claim the whole block.
*/
if (free_pages + alike_pages >= (1 << (pageblock_order-1)) ||
- page_group_by_mobility_disabled)
+ page_group_by_mobility_disabled) {
+ move_freepages(zone, start_pfn, end_pfn, start_type);
set_pageblock_migratetype(page, start_type);
-
- return;
+ return __rmqueue_smallest(zone, order, start_type);
+ }
single_page:
- move_to_free_list(page, zone, current_order, start_type);
+ del_page_from_free_list(page, zone, current_order);
+ expand(zone, page, order, current_order, block_type);
+ return page;
}
/*
@@ -1890,9 +1923,10 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
mt = get_pageblock_migratetype(page);
/* Only reserve normal pageblocks (i.e., they can merge with others) */
if (migratetype_is_mergeable(mt)) {
- zone->nr_reserved_highatomic += pageblock_nr_pages;
- set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
- move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL);
+ if (move_freepages_block(zone, page, MIGRATE_HIGHATOMIC) != -1) {
+ set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
+ zone->nr_reserved_highatomic += pageblock_nr_pages;
+ }
}
out_unlock:
@@ -1917,7 +1951,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
struct zone *zone;
struct page *page;
int order;
- bool ret;
+ int ret;
for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->highest_zoneidx,
ac->nodemask) {
@@ -1966,10 +2000,14 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
* of pageblocks that cannot be completely freed
* may increase.
*/
+ ret = move_freepages_block(zone, page, ac->migratetype);
+ /*
+ * Reserving this block already succeeded, so this should
+ * not fail on zone boundaries.
+ */
+ WARN_ON_ONCE(ret == -1);
set_pageblock_migratetype(page, ac->migratetype);
- ret = move_freepages_block(zone, page, ac->migratetype,
- NULL);
- if (ret) {
+ if (ret > 0) {
spin_unlock_irqrestore(&zone->lock, flags);
return ret;
}
@@ -1990,7 +2028,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
* deviation from the rest of this file, to make the for loop
* condition simpler.
*/
-static __always_inline bool
+static __always_inline struct page *
__rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
unsigned int alloc_flags)
{
@@ -2037,7 +2075,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
goto do_steal;
}
- return false;
+ return NULL;
find_smallest:
for (current_order = order; current_order < NR_PAGE_ORDERS; current_order++) {
@@ -2057,14 +2095,14 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
do_steal:
page = get_page_from_free_area(area, fallback_mt);
- steal_suitable_fallback(zone, page, alloc_flags, start_migratetype,
- can_steal);
+ /* take off list, maybe claim block, expand remainder */
+ page = steal_suitable_fallback(zone, page, current_order, order,
+ start_migratetype, alloc_flags, can_steal);
trace_mm_page_alloc_extfrag(page, order, current_order,
start_migratetype, fallback_mt);
- return true;
-
+ return page;
}
#ifdef CONFIG_CMA
@@ -2127,15 +2165,14 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
return page;
}
}
-retry:
+
page = __rmqueue_smallest(zone, order, migratetype);
if (unlikely(!page)) {
if (alloc_flags & ALLOC_CMA)
page = __rmqueue_cma_fallback(zone, order);
-
- if (!page && __rmqueue_fallback(zone, order, migratetype,
- alloc_flags))
- goto retry;
+ else
+ page = __rmqueue_fallback(zone, order, migratetype,
+ alloc_flags);
}
return page;
}
@@ -2689,12 +2726,10 @@ int __isolate_free_page(struct page *page, unsigned int order)
* Only change normal pageblocks (i.e., they can merge
* with others)
*/
- if (migratetype_is_mergeable(mt)) {
- set_pageblock_migratetype(page,
- MIGRATE_MOVABLE);
- move_freepages_block(zone, page,
- MIGRATE_MOVABLE, NULL);
- }
+ if (migratetype_is_mergeable(mt) &&
+ move_freepages_block(zone, page,
+ MIGRATE_MOVABLE) != -1)
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
}
}
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index a5c8fa4c2a75..71539d7b96cf 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -178,15 +178,18 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
unmovable = has_unmovable_pages(check_unmovable_start, check_unmovable_end,
migratetype, isol_flags);
if (!unmovable) {
- unsigned long nr_pages;
+ int nr_pages;
int mt = get_pageblock_migratetype(page);
+ nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);
+ /* Block spans zone boundaries? */
+ if (nr_pages == -1) {
+ spin_unlock_irqrestore(&zone->lock, flags);
+ return -EBUSY;
+ }
+ __mod_zone_freepage_state(zone, -nr_pages, mt);
set_pageblock_migratetype(page, MIGRATE_ISOLATE);
zone->nr_isolate_pageblock++;
- nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE,
- NULL);
-
- __mod_zone_freepage_state(zone, -nr_pages, mt);
spin_unlock_irqrestore(&zone->lock, flags);
return 0;
}
@@ -206,7 +209,7 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
static void unset_migratetype_isolate(struct page *page, int migratetype)
{
struct zone *zone;
- unsigned long flags, nr_pages;
+ unsigned long flags;
bool isolated_page = false;
unsigned int order;
struct page *buddy;
@@ -252,7 +255,12 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
* allocation.
*/
if (!isolated_page) {
- nr_pages = move_freepages_block(zone, page, migratetype, NULL);
+ int nr_pages = move_freepages_block(zone, page, migratetype);
+ /*
+ * Isolating this block already succeeded, so this
+ * should not fail on zone boundaries.
+ */
+ WARN_ON_ONCE(nr_pages == -1);
__mod_zone_freepage_state(zone, nr_pages, migratetype);
}
set_pageblock_migratetype(page, migratetype);
--
2.44.0
From: Zi Yan <[email protected]>
This avoids changing migratetype after move_freepages() or
move_freepages_block(), which is error prone. It also prepares for
upcoming changes to fix move_freepages() not moving free pages
partially in the range.
Signed-off-by: Zi Yan <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/page_alloc.c | 27 +++++++++++++--------------
mm/page_isolation.c | 7 +++----
2 files changed, 16 insertions(+), 18 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3f65b565eaad..d687f27d891f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1581,9 +1581,8 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
#endif
/*
- * Move the free pages in a range to the freelist tail of the requested type.
- * Note that start_page and end_pages are not aligned on a pageblock
- * boundary. If alignment is required, use move_freepages_block()
+ * Change the type of a block and move all its free pages to that
+ * type's freelist.
*/
static int move_freepages(struct zone *zone, unsigned long start_pfn,
unsigned long end_pfn, int migratetype)
@@ -1593,6 +1592,9 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
unsigned int order;
int pages_moved = 0;
+ VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1));
+ VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn);
+
for (pfn = start_pfn; pfn <= end_pfn;) {
page = pfn_to_page(pfn);
if (!PageBuddy(page)) {
@@ -1610,6 +1612,8 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
pages_moved += 1 << order;
}
+ set_pageblock_migratetype(pfn_to_page(start_pfn), migratetype);
+
return pages_moved;
}
@@ -1837,7 +1841,6 @@ steal_suitable_fallback(struct zone *zone, struct page *page,
if (free_pages + alike_pages >= (1 << (pageblock_order-1)) ||
page_group_by_mobility_disabled) {
move_freepages(zone, start_pfn, end_pfn, start_type);
- set_pageblock_migratetype(page, start_type);
return __rmqueue_smallest(zone, order, start_type);
}
@@ -1911,12 +1914,10 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
/* Yoink! */
mt = get_pageblock_migratetype(page);
/* Only reserve normal pageblocks (i.e., they can merge with others) */
- if (migratetype_is_mergeable(mt)) {
- if (move_freepages_block(zone, page, MIGRATE_HIGHATOMIC) != -1) {
- set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
+ if (migratetype_is_mergeable(mt))
+ if (move_freepages_block(zone, page,
+ MIGRATE_HIGHATOMIC) != -1)
zone->nr_reserved_highatomic += pageblock_nr_pages;
- }
- }
out_unlock:
spin_unlock_irqrestore(&zone->lock, flags);
@@ -1995,7 +1996,6 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
* not fail on zone boundaries.
*/
WARN_ON_ONCE(ret == -1);
- set_pageblock_migratetype(page, ac->migratetype);
if (ret > 0) {
spin_unlock_irqrestore(&zone->lock, flags);
return ret;
@@ -2711,10 +2711,9 @@ int __isolate_free_page(struct page *page, unsigned int order)
* Only change normal pageblocks (i.e., they can merge
* with others)
*/
- if (migratetype_is_mergeable(mt) &&
- move_freepages_block(zone, page,
- MIGRATE_MOVABLE) != -1)
- set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+ if (migratetype_is_mergeable(mt))
+ move_freepages_block(zone, page,
+ MIGRATE_MOVABLE);
}
}
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 71539d7b96cf..f84f0981b2df 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -188,7 +188,6 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
return -EBUSY;
}
__mod_zone_freepage_state(zone, -nr_pages, mt);
- set_pageblock_migratetype(page, MIGRATE_ISOLATE);
zone->nr_isolate_pageblock++;
spin_unlock_irqrestore(&zone->lock, flags);
return 0;
@@ -262,10 +261,10 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
*/
WARN_ON_ONCE(nr_pages == -1);
__mod_zone_freepage_state(zone, nr_pages, migratetype);
- }
- set_pageblock_migratetype(page, migratetype);
- if (isolated_page)
+ } else {
+ set_pageblock_migratetype(page, migratetype);
__putback_isolated_page(page, order, migratetype);
+ }
zone->nr_isolate_pageblock--;
out:
spin_unlock_irqrestore(&zone->lock, flags);
--
2.44.0
Page isolation currently sets MIGRATE_ISOLATE on a block, then drops
zone->lock and scans the block for straddling buddies to split
up. Because this happens non-atomically wrt the page allocator, it's
possible for allocations to get a buddy whose first block is a regular
pcp migratetype but whose tail is isolated. This means that in certain
cases memory can still be allocated after isolation. It will also
trigger the freelist type hygiene warnings in subsequent patches.
start_isolate_page_range()
isolate_single_pageblock()
set_migratetype_isolate(tail)
lock zone->lock
move_freepages_block(tail) // nop
set_pageblock_migratetype(tail)
unlock zone->lock
__rmqueue_smallest()
del_page_from_freelist(head)
expand(head, head_mt)
WARN(head_mt != tail_mt)
start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
for (pfn = start_pfn, pfn < end_pfn)
if (PageBuddy())
split_free_page(head)
Introduce a variant of move_freepages_block() provided by the
allocator specifically for page isolation; it moves free pages,
converts the block, and handles the splitting of straddling buddies
while holding zone->lock.
The allocator knows that pageblocks and buddies are always naturally
aligned, which means that buddies can only straddle blocks if they're
actually >pageblock_order. This means the search-and-split part can be
simplified compared to what page isolation used to do.
Also tighten up the page isolation code around the expectations of
which pages can be large, and how they are freed.
Based on extensive discussions with and invaluable input from Zi Yan.
Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/page-isolation.h | 4 +-
mm/internal.h | 4 -
mm/page_alloc.c | 200 +++++++++++++++++++--------------
mm/page_isolation.c | 106 ++++++-----------
4 files changed, 151 insertions(+), 163 deletions(-)
diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 8550b3c91480..c16db0067090 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -34,7 +34,9 @@ static inline bool is_migrate_isolate(int migratetype)
#define REPORT_FAILURE 0x2
void set_pageblock_migratetype(struct page *page, int migratetype);
-int move_freepages_block(struct zone *zone, struct page *page, int migratetype);
+
+bool move_freepages_block_isolate(struct zone *zone, struct page *page,
+ int migratetype);
int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
int migratetype, int flags, gfp_t gfp_flags);
diff --git a/mm/internal.h b/mm/internal.h
index f8b31234c130..d6e6c7d9f04e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -559,10 +559,6 @@ extern void *memmap_alloc(phys_addr_t size, phys_addr_t align,
void memmap_init_range(unsigned long, int, unsigned long, unsigned long,
unsigned long, enum meminit_context, struct vmem_altmap *, int);
-
-int split_free_page(struct page *free_page,
- unsigned int order, unsigned long split_pfn_offset);
-
#if defined CONFIG_COMPACTION || defined CONFIG_CMA
/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d687f27d891f..efb2581ac142 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -832,64 +832,6 @@ static inline void __free_one_page(struct page *page,
page_reporting_notify_free(order);
}
-/**
- * split_free_page() -- split a free page at split_pfn_offset
- * @free_page: the original free page
- * @order: the order of the page
- * @split_pfn_offset: split offset within the page
- *
- * Return -ENOENT if the free page is changed, otherwise 0
- *
- * It is used when the free page crosses two pageblocks with different migratetypes
- * at split_pfn_offset within the page. The split free page will be put into
- * separate migratetype lists afterwards. Otherwise, the function achieves
- * nothing.
- */
-int split_free_page(struct page *free_page,
- unsigned int order, unsigned long split_pfn_offset)
-{
- struct zone *zone = page_zone(free_page);
- unsigned long free_page_pfn = page_to_pfn(free_page);
- unsigned long pfn;
- unsigned long flags;
- int free_page_order;
- int mt;
- int ret = 0;
-
- if (split_pfn_offset == 0)
- return ret;
-
- spin_lock_irqsave(&zone->lock, flags);
-
- if (!PageBuddy(free_page) || buddy_order(free_page) != order) {
- ret = -ENOENT;
- goto out;
- }
-
- mt = get_pfnblock_migratetype(free_page, free_page_pfn);
- if (likely(!is_migrate_isolate(mt)))
- __mod_zone_freepage_state(zone, -(1UL << order), mt);
-
- del_page_from_free_list(free_page, zone, order);
- for (pfn = free_page_pfn;
- pfn < free_page_pfn + (1UL << order);) {
- int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
-
- free_page_order = min_t(unsigned int,
- pfn ? __ffs(pfn) : order,
- __fls(split_pfn_offset));
- __free_one_page(pfn_to_page(pfn), pfn, zone, free_page_order,
- mt, FPI_NONE);
- pfn += 1UL << free_page_order;
- split_pfn_offset -= (1UL << free_page_order);
- /* we have done the first part, now switch to second part */
- if (split_pfn_offset == 0)
- split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
- }
-out:
- spin_unlock_irqrestore(&zone->lock, flags);
- return ret;
-}
/*
* A bad page could be due to a number of fields. Instead of multiple branches,
* try and check multiple fields with one check. The caller must do a detailed
@@ -1669,8 +1611,8 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
return true;
}
-int move_freepages_block(struct zone *zone, struct page *page,
- int migratetype)
+static int move_freepages_block(struct zone *zone, struct page *page,
+ int migratetype)
{
unsigned long start_pfn, end_pfn;
@@ -1681,6 +1623,119 @@ int move_freepages_block(struct zone *zone, struct page *page,
return move_freepages(zone, start_pfn, end_pfn, migratetype);
}
+#ifdef CONFIG_MEMORY_ISOLATION
+/* Look for a buddy that straddles start_pfn */
+static unsigned long find_large_buddy(unsigned long start_pfn)
+{
+ int order = 0;
+ struct page *page;
+ unsigned long pfn = start_pfn;
+
+ while (!PageBuddy(page = pfn_to_page(pfn))) {
+ /* Nothing found */
+ if (++order > MAX_PAGE_ORDER)
+ return start_pfn;
+ pfn &= ~0UL << order;
+ }
+
+ /*
+ * Found a preceding buddy, but does it straddle?
+ */
+ if (pfn + (1 << buddy_order(page)) > start_pfn)
+ return pfn;
+
+ /* Nothing found */
+ return start_pfn;
+}
+
+/* Split a multi-block free page into its individual pageblocks */
+static void split_large_buddy(struct zone *zone, struct page *page,
+ unsigned long pfn, int order)
+{
+ unsigned long end_pfn = pfn + (1 << order);
+
+ VM_WARN_ON_ONCE(order <= pageblock_order);
+ VM_WARN_ON_ONCE(pfn & (pageblock_nr_pages - 1));
+
+ /* Caller removed page from freelist, buddy info cleared! */
+ VM_WARN_ON_ONCE(PageBuddy(page));
+
+ while (pfn != end_pfn) {
+ int mt = get_pfnblock_migratetype(page, pfn);
+
+ __free_one_page(page, pfn, zone, pageblock_order, mt, FPI_NONE);
+ pfn += pageblock_nr_pages;
+ page = pfn_to_page(pfn);
+ }
+}
+
+/**
+ * move_freepages_block_isolate - move free pages in block for page isolation
+ * @zone: the zone
+ * @page: the pageblock page
+ * @migratetype: migratetype to set on the pageblock
+ *
+ * This is similar to move_freepages_block(), but handles the special
+ * case encountered in page isolation, where the block of interest
+ * might be part of a larger buddy spanning multiple pageblocks.
+ *
+ * Unlike the regular page allocator path, which moves pages while
+ * stealing buddies off the freelist, page isolation is interested in
+ * arbitrary pfn ranges that may have overlapping buddies on both ends.
+ *
+ * This function handles that. Straddling buddies are split into
+ * individual pageblocks. Only the block of interest is moved.
+ *
+ * Returns %true if pages could be moved, %false otherwise.
+ */
+bool move_freepages_block_isolate(struct zone *zone, struct page *page,
+ int migratetype)
+{
+ unsigned long start_pfn, end_pfn, pfn;
+ int nr_moved, mt;
+
+ if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn,
+ NULL, NULL))
+ return false;
+
+ /* We're a tail block in a larger buddy */
+ pfn = find_large_buddy(start_pfn);
+ if (pfn != start_pfn) {
+ struct page *buddy = pfn_to_page(pfn);
+ int order = buddy_order(buddy);
+ int mt = get_pfnblock_migratetype(buddy, pfn);
+
+ if (!is_migrate_isolate(mt))
+ __mod_zone_freepage_state(zone, -(1UL << order), mt);
+ del_page_from_free_list(buddy, zone, order);
+ set_pageblock_migratetype(page, migratetype);
+ split_large_buddy(zone, buddy, pfn, order);
+ return true;
+ }
+
+ /* We're the starting block of a larger buddy */
+ if (PageBuddy(page) && buddy_order(page) > pageblock_order) {
+ int mt = get_pfnblock_migratetype(page, pfn);
+ int order = buddy_order(page);
+
+ if (!is_migrate_isolate(mt))
+ __mod_zone_freepage_state(zone, -(1UL << order), mt);
+ del_page_from_free_list(page, zone, order);
+ set_pageblock_migratetype(page, migratetype);
+ split_large_buddy(zone, page, pfn, order);
+ return true;
+ }
+
+ mt = get_pfnblock_migratetype(page, start_pfn);
+ nr_moved = move_freepages(zone, start_pfn, end_pfn, migratetype);
+ if (!is_migrate_isolate(mt))
+ __mod_zone_freepage_state(zone, -nr_moved, mt);
+ else if (!is_migrate_isolate(migratetype))
+ __mod_zone_freepage_state(zone, nr_moved, migratetype);
+ return true;
+}
+#endif /* CONFIG_MEMORY_ISOLATION */
+
static void change_pageblock_range(struct page *pageblock_page,
int start_order, int migratetype)
{
@@ -6390,7 +6445,6 @@ int alloc_contig_range(unsigned long start, unsigned long end,
unsigned migratetype, gfp_t gfp_mask)
{
unsigned long outer_start, outer_end;
- int order;
int ret = 0;
struct compact_control cc = {
@@ -6463,29 +6517,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
* We don't have to hold zone->lock here because the pages are
* isolated thus they won't get removed from buddy.
*/
-
- order = 0;
- outer_start = start;
- while (!PageBuddy(pfn_to_page(outer_start))) {
- if (++order > MAX_PAGE_ORDER) {
- outer_start = start;
- break;
- }
- outer_start &= ~0UL << order;
- }
-
- if (outer_start != start) {
- order = buddy_order(pfn_to_page(outer_start));
-
- /*
- * outer_start page could be small order buddy page and
- * it doesn't include start page. Adjust outer_start
- * in this case to report failed page properly
- * on tracepoint in test_pages_isolated()
- */
- if (outer_start + (1UL << order) <= start)
- outer_start = start;
- }
+ outer_start = find_large_buddy(start);
/* Make sure the range is really isolated. */
if (test_pages_isolated(outer_start, end, 0)) {
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index f84f0981b2df..042937d5abe4 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -178,16 +178,10 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
unmovable = has_unmovable_pages(check_unmovable_start, check_unmovable_end,
migratetype, isol_flags);
if (!unmovable) {
- int nr_pages;
- int mt = get_pageblock_migratetype(page);
-
- nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);
- /* Block spans zone boundaries? */
- if (nr_pages == -1) {
+ if (!move_freepages_block_isolate(zone, page, MIGRATE_ISOLATE)) {
spin_unlock_irqrestore(&zone->lock, flags);
return -EBUSY;
}
- __mod_zone_freepage_state(zone, -nr_pages, mt);
zone->nr_isolate_pageblock++;
spin_unlock_irqrestore(&zone->lock, flags);
return 0;
@@ -254,13 +248,11 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
* allocation.
*/
if (!isolated_page) {
- int nr_pages = move_freepages_block(zone, page, migratetype);
/*
* Isolating this block already succeeded, so this
* should not fail on zone boundaries.
*/
- WARN_ON_ONCE(nr_pages == -1);
- __mod_zone_freepage_state(zone, nr_pages, migratetype);
+ WARN_ON_ONCE(!move_freepages_block_isolate(zone, page, migratetype));
} else {
set_pageblock_migratetype(page, migratetype);
__putback_isolated_page(page, order, migratetype);
@@ -374,26 +366,29 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
VM_BUG_ON(!page);
pfn = page_to_pfn(page);
- /*
- * start_pfn is MAX_ORDER_NR_PAGES aligned, if there is any
- * free pages in [start_pfn, boundary_pfn), its head page will
- * always be in the range.
- */
+
if (PageBuddy(page)) {
int order = buddy_order(page);
- if (pfn + (1UL << order) > boundary_pfn) {
- /* free page changed before split, check it again */
- if (split_free_page(page, order, boundary_pfn - pfn))
- continue;
- }
+ /* move_freepages_block_isolate() handled this */
+ VM_WARN_ON_ONCE(pfn + (1 << order) > boundary_pfn);
pfn += 1UL << order;
continue;
}
+
/*
- * migrate compound pages then let the free page handling code
- * above do the rest. If migration is not possible, just fail.
+ * If a compound page is straddling our block, attempt
+ * to migrate it out of the way.
+ *
+ * We don't have to worry about this creating a large
+ * free page that straddles into our block: gigantic
+ * pages are freed as order-0 chunks, and LRU pages
+ * (currently) do not exceed pageblock_order.
+ *
+ * The block of interest has already been marked
+ * MIGRATE_ISOLATE above, so when migration is done it
+ * will free its pages onto the correct freelists.
*/
if (PageCompound(page)) {
struct page *head = compound_head(page);
@@ -404,16 +399,10 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
pfn = head_pfn + nr_pages;
continue;
}
+
#if defined CONFIG_COMPACTION || defined CONFIG_CMA
- /*
- * hugetlb, lru compound (THP), and movable compound pages
- * can be migrated. Otherwise, fail the isolation.
- */
- if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
- int order;
- unsigned long outer_pfn;
+ if (PageHuge(page)) {
int page_mt = get_pageblock_migratetype(page);
- bool isolate_page = !is_migrate_isolate_page(page);
struct compact_control cc = {
.nr_migratepages = 0,
.order = -1,
@@ -426,56 +415,25 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
};
INIT_LIST_HEAD(&cc.migratepages);
- /*
- * XXX: mark the page as MIGRATE_ISOLATE so that
- * no one else can grab the freed page after migration.
- * Ideally, the page should be freed as two separate
- * pages to be added into separate migratetype free
- * lists.
- */
- if (isolate_page) {
- ret = set_migratetype_isolate(page, page_mt,
- flags, head_pfn, head_pfn + nr_pages);
- if (ret)
- goto failed;
- }
-
ret = __alloc_contig_migrate_range(&cc, head_pfn,
head_pfn + nr_pages, page_mt);
-
- /*
- * restore the page's migratetype so that it can
- * be split into separate migratetype free lists
- * later.
- */
- if (isolate_page)
- unset_migratetype_isolate(page, page_mt);
-
if (ret)
goto failed;
- /*
- * reset pfn to the head of the free page, so
- * that the free page handling code above can split
- * the free page to the right migratetype list.
- *
- * head_pfn is not used here as a hugetlb page order
- * can be bigger than MAX_PAGE_ORDER, but after it is
- * freed, the free page order is not. Use pfn within
- * the range to find the head of the free page.
- */
- order = 0;
- outer_pfn = pfn;
- while (!PageBuddy(pfn_to_page(outer_pfn))) {
- /* stop if we cannot find the free page */
- if (++order > MAX_PAGE_ORDER)
- goto failed;
- outer_pfn &= ~0UL << order;
- }
- pfn = outer_pfn;
+ pfn = head_pfn + nr_pages;
continue;
- } else
+ }
+
+ /*
+ * These pages are movable too, but they're
+ * not expected to exceed pageblock_order.
+ *
+ * Let us know when they do, so we can add
+ * proper free and split handling for them.
+ */
+ VM_WARN_ON_ONCE_PAGE(PageLRU(page), page);
+ VM_WARN_ON_ONCE_PAGE(__PageMovable(page), page);
#endif
- goto failed;
+ goto failed;
}
pfn++;
--
2.44.0
Free page accounting currently happens a bit too high up the call
stack, where it has to deal with guard pages, compaction capturing,
block stealing and even page isolation. This is subtle and fragile,
and makes it difficult to hack on the code.
Now that type violations on the freelists have been fixed, push the
accounting down to where pages enter and leave the freelist.
Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/mm.h | 18 ++--
include/linux/vmstat.h | 8 --
mm/debug_page_alloc.c | 12 +--
mm/internal.h | 5 --
mm/page_alloc.c | 194 +++++++++++++++++++++++------------------
mm/page_isolation.c | 3 +-
6 files changed, 120 insertions(+), 120 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8147b1302413..bd2e94391c7e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3781,24 +3781,22 @@ static inline bool page_is_guard(struct page *page)
return PageGuard(page);
}
-bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order,
- int migratetype);
+bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order);
static inline bool set_page_guard(struct zone *zone, struct page *page,
- unsigned int order, int migratetype)
+ unsigned int order)
{
if (!debug_guardpage_enabled())
return false;
- return __set_page_guard(zone, page, order, migratetype);
+ return __set_page_guard(zone, page, order);
}
-void __clear_page_guard(struct zone *zone, struct page *page, unsigned int order,
- int migratetype);
+void __clear_page_guard(struct zone *zone, struct page *page, unsigned int order);
static inline void clear_page_guard(struct zone *zone, struct page *page,
- unsigned int order, int migratetype)
+ unsigned int order)
{
if (!debug_guardpage_enabled())
return;
- __clear_page_guard(zone, page, order, migratetype);
+ __clear_page_guard(zone, page, order);
}
#else /* CONFIG_DEBUG_PAGEALLOC */
@@ -3808,9 +3806,9 @@ static inline unsigned int debug_guardpage_minorder(void) { return 0; }
static inline bool debug_guardpage_enabled(void) { return false; }
static inline bool page_is_guard(struct page *page) { return false; }
static inline bool set_page_guard(struct zone *zone, struct page *page,
- unsigned int order, int migratetype) { return false; }
+ unsigned int order) { return false; }
static inline void clear_page_guard(struct zone *zone, struct page *page,
- unsigned int order, int migratetype) {}
+ unsigned int order) {}
#endif /* CONFIG_DEBUG_PAGEALLOC */
#ifdef __HAVE_ARCH_GATE_AREA
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 343906a98d6e..735eae6e272c 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -487,14 +487,6 @@ static inline void node_stat_sub_folio(struct folio *folio,
mod_node_page_state(folio_pgdat(folio), item, -folio_nr_pages(folio));
}
-static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages,
- int migratetype)
-{
- __mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
- if (is_migrate_cma(migratetype))
- __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
-}
-
extern const char * const vmstat_text[];
static inline const char *zone_stat_name(enum zone_stat_item item)
diff --git a/mm/debug_page_alloc.c b/mm/debug_page_alloc.c
index 6755f0c9d4a3..d46acf989dde 100644
--- a/mm/debug_page_alloc.c
+++ b/mm/debug_page_alloc.c
@@ -32,8 +32,7 @@ static int __init debug_guardpage_minorder_setup(char *buf)
}
early_param("debug_guardpage_minorder", debug_guardpage_minorder_setup);
-bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order,
- int migratetype)
+bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order)
{
if (order >= debug_guardpage_minorder())
return false;
@@ -41,19 +40,12 @@ bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order,
__SetPageGuard(page);
INIT_LIST_HEAD(&page->buddy_list);
set_page_private(page, order);
- /* Guard pages are not available for any usage */
- if (!is_migrate_isolate(migratetype))
- __mod_zone_freepage_state(zone, -(1 << order), migratetype);
return true;
}
-void __clear_page_guard(struct zone *zone, struct page *page, unsigned int order,
- int migratetype)
+void __clear_page_guard(struct zone *zone, struct page *page, unsigned int order)
{
__ClearPageGuard(page);
-
set_page_private(page, 0);
- if (!is_migrate_isolate(migratetype))
- __mod_zone_freepage_state(zone, (1 << order), migratetype);
}
diff --git a/mm/internal.h b/mm/internal.h
index d6e6c7d9f04e..0a4007b03d0d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1036,11 +1036,6 @@ static inline bool is_migrate_highatomic(enum migratetype migratetype)
return migratetype == MIGRATE_HIGHATOMIC;
}
-static inline bool is_migrate_highatomic_page(struct page *page)
-{
- return get_pageblock_migratetype(page) == MIGRATE_HIGHATOMIC;
-}
-
void setup_zone_pageset(struct zone *zone);
struct migration_target_control {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index efb2581ac142..c46491f83ac2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -642,42 +642,72 @@ compaction_capture(struct capture_control *capc, struct page *page,
}
#endif /* CONFIG_COMPACTION */
-/* Used for pages not on another list */
-static inline void add_to_free_list(struct page *page, struct zone *zone,
- unsigned int order, int migratetype)
+static inline void account_freepages(struct page *page, struct zone *zone,
+ int nr_pages, int migratetype)
{
- struct free_area *area = &zone->free_area[order];
+ if (is_migrate_isolate(migratetype))
+ return;
- list_add(&page->buddy_list, &area->free_list[migratetype]);
- area->nr_free++;
+ __mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
+
+ if (is_migrate_cma(migratetype))
+ __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
}
/* Used for pages not on another list */
-static inline void add_to_free_list_tail(struct page *page, struct zone *zone,
- unsigned int order, int migratetype)
+static inline void __add_to_free_list(struct page *page, struct zone *zone,
+ unsigned int order, int migratetype,
+ bool tail)
{
struct free_area *area = &zone->free_area[order];
- list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
+ VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
+ "page type is %lu, passed migratetype is %d (nr=%d)\n",
+ get_pageblock_migratetype(page), migratetype, 1 << order);
+
+ if (tail)
+ list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
+ else
+ list_add(&page->buddy_list, &area->free_list[migratetype]);
area->nr_free++;
}
+static inline void add_to_free_list(struct page *page, struct zone *zone,
+ unsigned int order, int migratetype,
+ bool tail)
+{
+ __add_to_free_list(page, zone, order, migratetype, tail);
+ account_freepages(page, zone, 1 << order, migratetype);
+}
+
/*
* Used for pages which are on another list. Move the pages to the tail
* of the list - so the moved pages won't immediately be considered for
* allocation again (e.g., optimization for memory onlining).
*/
static inline void move_to_free_list(struct page *page, struct zone *zone,
- unsigned int order, int migratetype)
+ unsigned int order, int old_mt, int new_mt)
{
struct free_area *area = &zone->free_area[order];
- list_move_tail(&page->buddy_list, &area->free_list[migratetype]);
+ /* Free page moving can fail, so it happens before the type update */
+ VM_WARN_ONCE(get_pageblock_migratetype(page) != old_mt,
+ "page type is %lu, passed migratetype is %d (nr=%d)\n",
+ get_pageblock_migratetype(page), old_mt, 1 << order);
+
+ list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
+
+ account_freepages(page, zone, -(1 << order), old_mt);
+ account_freepages(page, zone, 1 << order, new_mt);
}
-static inline void del_page_from_free_list(struct page *page, struct zone *zone,
- unsigned int order)
+static inline void __del_page_from_free_list(struct page *page, struct zone *zone,
+ unsigned int order, int migratetype)
{
+ VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
+ "page type is %lu, passed migratetype is %d (nr=%d)\n",
+ get_pageblock_migratetype(page), migratetype, 1 << order);
+
/* clear reported state and update reported page count */
if (page_reported(page))
__ClearPageReported(page);
@@ -688,6 +718,13 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
zone->free_area[order].nr_free--;
}
+static inline void del_page_from_free_list(struct page *page, struct zone *zone,
+ unsigned int order, int migratetype)
+{
+ __del_page_from_free_list(page, zone, order, migratetype);
+ account_freepages(page, zone, -(1 << order), migratetype);
+}
+
static inline struct page *get_page_from_free_area(struct free_area *area,
int migratetype)
{
@@ -759,18 +796,16 @@ static inline void __free_one_page(struct page *page,
VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
VM_BUG_ON(migratetype == -1);
- if (likely(!is_migrate_isolate(migratetype)))
- __mod_zone_freepage_state(zone, 1 << order, migratetype);
-
VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
VM_BUG_ON_PAGE(bad_range(zone, page), page);
+ account_freepages(page, zone, 1 << order, migratetype);
+
while (order < MAX_PAGE_ORDER) {
- if (compaction_capture(capc, page, order, migratetype)) {
- __mod_zone_freepage_state(zone, -(1 << order),
- migratetype);
+ int buddy_mt = migratetype;
+
+ if (compaction_capture(capc, page, order, migratetype))
return;
- }
buddy = find_buddy_page_pfn(page, pfn, order, &buddy_pfn);
if (!buddy)
@@ -783,19 +818,12 @@ static inline void __free_one_page(struct page *page,
* pageblock isolation could cause incorrect freepage or CMA
* accounting or HIGHATOMIC accounting.
*/
- int buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
+ buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
- if (migratetype != buddy_mt) {
- if (!migratetype_is_mergeable(migratetype) ||
- !migratetype_is_mergeable(buddy_mt))
- goto done_merging;
- /*
- * Match buddy type. This ensures that
- * an expand() down the line puts the
- * sub-blocks on the right freelists.
- */
- set_pageblock_migratetype(buddy, migratetype);
- }
+ if (migratetype != buddy_mt &&
+ (!migratetype_is_mergeable(migratetype) ||
+ !migratetype_is_mergeable(buddy_mt)))
+ goto done_merging;
}
/*
@@ -803,9 +831,19 @@ static inline void __free_one_page(struct page *page,
* merge with it and move up one order.
*/
if (page_is_guard(buddy))
- clear_page_guard(zone, buddy, order, migratetype);
+ clear_page_guard(zone, buddy, order);
else
- del_page_from_free_list(buddy, zone, order);
+ __del_page_from_free_list(buddy, zone, order, buddy_mt);
+
+ if (unlikely(buddy_mt != migratetype)) {
+ /*
+ * Match buddy type. This ensures that an
+ * expand() down the line puts the sub-blocks
+ * on the right freelists.
+ */
+ set_pageblock_migratetype(buddy, migratetype);
+ }
+
combined_pfn = buddy_pfn & pfn;
page = page + (combined_pfn - pfn);
pfn = combined_pfn;
@@ -822,10 +860,7 @@ static inline void __free_one_page(struct page *page,
else
to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);
- if (to_tail)
- add_to_free_list_tail(page, zone, order, migratetype);
- else
- add_to_free_list(page, zone, order, migratetype);
+ __add_to_free_list(page, zone, order, migratetype, to_tail);
/* Notify page reporting subsystem of freed page */
if (!(fpi_flags & FPI_SKIP_REPORT_NOTIFY))
@@ -1314,10 +1349,10 @@ static inline void expand(struct zone *zone, struct page *page,
* Corresponding page table entries will not be touched,
* pages will stay not present in virtual address space
*/
- if (set_page_guard(zone, &page[size], high, migratetype))
+ if (set_page_guard(zone, &page[size], high))
continue;
- add_to_free_list(&page[size], zone, high, migratetype);
+ add_to_free_list(&page[size], zone, high, migratetype, false);
set_buddy_order(&page[size], high);
}
}
@@ -1487,7 +1522,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
page = get_page_from_free_area(area, migratetype);
if (!page)
continue;
- del_page_from_free_list(page, zone, current_order);
+ del_page_from_free_list(page, zone, current_order, migratetype);
expand(zone, page, order, current_order, migratetype);
trace_mm_page_alloc_zone_locked(page, order, migratetype,
pcp_allowed_order(order) &&
@@ -1527,7 +1562,7 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
* type's freelist.
*/
static int move_freepages(struct zone *zone, unsigned long start_pfn,
- unsigned long end_pfn, int migratetype)
+ unsigned long end_pfn, int old_mt, int new_mt)
{
struct page *page;
unsigned long pfn;
@@ -1549,12 +1584,14 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
VM_BUG_ON_PAGE(page_zone(page) != zone, page);
order = buddy_order(page);
- move_to_free_list(page, zone, order, migratetype);
+
+ move_to_free_list(page, zone, order, old_mt, new_mt);
+
pfn += 1 << order;
pages_moved += 1 << order;
}
- set_pageblock_migratetype(pfn_to_page(start_pfn), migratetype);
+ set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
return pages_moved;
}
@@ -1612,7 +1649,7 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
}
static int move_freepages_block(struct zone *zone, struct page *page,
- int migratetype)
+ int old_mt, int new_mt)
{
unsigned long start_pfn, end_pfn;
@@ -1620,7 +1657,7 @@ static int move_freepages_block(struct zone *zone, struct page *page,
NULL, NULL))
return -1;
- return move_freepages(zone, start_pfn, end_pfn, migratetype);
+ return move_freepages(zone, start_pfn, end_pfn, old_mt, new_mt);
}
#ifdef CONFIG_MEMORY_ISOLATION
@@ -1692,7 +1729,6 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
int migratetype)
{
unsigned long start_pfn, end_pfn, pfn;
- int nr_moved, mt;
if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn,
NULL, NULL))
@@ -1703,11 +1739,9 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
if (pfn != start_pfn) {
struct page *buddy = pfn_to_page(pfn);
int order = buddy_order(buddy);
- int mt = get_pfnblock_migratetype(buddy, pfn);
- if (!is_migrate_isolate(mt))
- __mod_zone_freepage_state(zone, -(1UL << order), mt);
- del_page_from_free_list(buddy, zone, order);
+ del_page_from_free_list(buddy, zone, order,
+ get_pfnblock_migratetype(buddy, pfn));
set_pageblock_migratetype(page, migratetype);
split_large_buddy(zone, buddy, pfn, order);
return true;
@@ -1715,23 +1749,17 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
/* We're the starting block of a larger buddy */
if (PageBuddy(page) && buddy_order(page) > pageblock_order) {
- int mt = get_pfnblock_migratetype(page, pfn);
int order = buddy_order(page);
- if (!is_migrate_isolate(mt))
- __mod_zone_freepage_state(zone, -(1UL << order), mt);
- del_page_from_free_list(page, zone, order);
+ del_page_from_free_list(page, zone, order,
+ get_pfnblock_migratetype(page, pfn));
set_pageblock_migratetype(page, migratetype);
split_large_buddy(zone, page, pfn, order);
return true;
}
- mt = get_pfnblock_migratetype(page, start_pfn);
- nr_moved = move_freepages(zone, start_pfn, end_pfn, migratetype);
- if (!is_migrate_isolate(mt))
- __mod_zone_freepage_state(zone, -nr_moved, mt);
- else if (!is_migrate_isolate(migratetype))
- __mod_zone_freepage_state(zone, nr_moved, migratetype);
+ move_freepages(zone, start_pfn, end_pfn,
+ get_pfnblock_migratetype(page, start_pfn), migratetype);
return true;
}
#endif /* CONFIG_MEMORY_ISOLATION */
@@ -1845,7 +1873,7 @@ steal_suitable_fallback(struct zone *zone, struct page *page,
/* Take ownership for orders >= pageblock_order */
if (current_order >= pageblock_order) {
- del_page_from_free_list(page, zone, current_order);
+ del_page_from_free_list(page, zone, current_order, block_type);
change_pageblock_range(page, current_order, start_type);
expand(zone, page, order, current_order, start_type);
return page;
@@ -1895,12 +1923,12 @@ steal_suitable_fallback(struct zone *zone, struct page *page,
*/
if (free_pages + alike_pages >= (1 << (pageblock_order-1)) ||
page_group_by_mobility_disabled) {
- move_freepages(zone, start_pfn, end_pfn, start_type);
+ move_freepages(zone, start_pfn, end_pfn, block_type, start_type);
return __rmqueue_smallest(zone, order, start_type);
}
single_page:
- del_page_from_free_list(page, zone, current_order);
+ del_page_from_free_list(page, zone, current_order, block_type);
expand(zone, page, order, current_order, block_type);
return page;
}
@@ -1970,7 +1998,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
mt = get_pageblock_migratetype(page);
/* Only reserve normal pageblocks (i.e., they can merge with others) */
if (migratetype_is_mergeable(mt))
- if (move_freepages_block(zone, page,
+ if (move_freepages_block(zone, page, mt,
MIGRATE_HIGHATOMIC) != -1)
zone->nr_reserved_highatomic += pageblock_nr_pages;
@@ -2011,11 +2039,13 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
spin_lock_irqsave(&zone->lock, flags);
for (order = 0; order < NR_PAGE_ORDERS; order++) {
struct free_area *area = &(zone->free_area[order]);
+ int mt;
page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC);
if (!page)
continue;
+ mt = get_pageblock_migratetype(page);
/*
* In page freeing path, migratetype change is racy so
* we can counter several free pages in a pageblock
@@ -2023,7 +2053,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
* from highatomic to ac->migratetype. So we should
* adjust the count once.
*/
- if (is_migrate_highatomic_page(page)) {
+ if (is_migrate_highatomic(mt)) {
/*
* It should never happen but changes to
* locking could inadvertently allow a per-cpu
@@ -2045,7 +2075,8 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
* of pageblocks that cannot be completely freed
* may increase.
*/
- ret = move_freepages_block(zone, page, ac->migratetype);
+ ret = move_freepages_block(zone, page, mt,
+ ac->migratetype);
/*
* Reserving this block already succeeded, so this should
* not fail on zone boundaries.
@@ -2251,12 +2282,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
* pages are ordered properly.
*/
list_add_tail(&page->pcp_list, list);
- if (is_migrate_cma(get_pageblock_migratetype(page)))
- __mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
- -(1 << order));
}
-
- __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
spin_unlock_irqrestore(&zone->lock, flags);
return i;
@@ -2748,11 +2774,9 @@ int __isolate_free_page(struct page *page, unsigned int order)
watermark = zone->_watermark[WMARK_MIN] + (1UL << order);
if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
return 0;
-
- __mod_zone_freepage_state(zone, -(1UL << order), mt);
}
- del_page_from_free_list(page, zone, order);
+ del_page_from_free_list(page, zone, order, mt);
/*
* Set the pageblock if the isolated page is at least half of a
@@ -2767,7 +2791,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
* with others)
*/
if (migratetype_is_mergeable(mt))
- move_freepages_block(zone, page,
+ move_freepages_block(zone, page, mt,
MIGRATE_MOVABLE);
}
}
@@ -2852,8 +2876,6 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
return NULL;
}
}
- __mod_zone_freepage_state(zone, -(1 << order),
- get_pageblock_migratetype(page));
spin_unlock_irqrestore(&zone->lock, flags);
} while (check_new_pages(page, order));
@@ -6737,8 +6759,9 @@ void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
BUG_ON(page_count(page));
BUG_ON(!PageBuddy(page));
+ VM_WARN_ON(get_pageblock_migratetype(page) != MIGRATE_ISOLATE);
order = buddy_order(page);
- del_page_from_free_list(page, zone, order);
+ del_page_from_free_list(page, zone, order, MIGRATE_ISOLATE);
pfn += (1 << order);
}
spin_unlock_irqrestore(&zone->lock, flags);
@@ -6788,10 +6811,10 @@ static void break_down_buddy_pages(struct zone *zone, struct page *page,
current_buddy = page + size;
}
- if (set_page_guard(zone, current_buddy, high, migratetype))
+ if (set_page_guard(zone, current_buddy, high))
continue;
- add_to_free_list(current_buddy, zone, high, migratetype);
+ add_to_free_list(current_buddy, zone, high, migratetype, false);
set_buddy_order(current_buddy, high);
}
}
@@ -6817,12 +6840,11 @@ bool take_page_off_buddy(struct page *page)
int migratetype = get_pfnblock_migratetype(page_head,
pfn_head);
- del_page_from_free_list(page_head, zone, page_order);
+ del_page_from_free_list(page_head, zone, page_order,
+ migratetype);
break_down_buddy_pages(zone, page_head, page, 0,
page_order, migratetype);
SetPageHWPoisonTakenOff(page);
- if (!is_migrate_isolate(migratetype))
- __mod_zone_freepage_state(zone, -1, migratetype);
ret = true;
break;
}
@@ -6930,7 +6952,7 @@ static bool try_to_accept_memory_one(struct zone *zone)
list_del(&page->lru);
last = list_empty(&zone->unaccepted_pages);
- __mod_zone_freepage_state(zone, -MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE);
+ account_freepages(page, zone, -MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE);
__mod_zone_page_state(zone, NR_UNACCEPTED, -MAX_ORDER_NR_PAGES);
spin_unlock_irqrestore(&zone->lock, flags);
@@ -6982,7 +7004,7 @@ static bool __free_unaccepted(struct page *page)
spin_lock_irqsave(&zone->lock, flags);
first = list_empty(&zone->unaccepted_pages);
list_add_tail(&page->lru, &zone->unaccepted_pages);
- __mod_zone_freepage_state(zone, MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE);
+ account_freepages(page, zone, MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE);
__mod_zone_page_state(zone, NR_UNACCEPTED, MAX_ORDER_NR_PAGES);
spin_unlock_irqrestore(&zone->lock, flags);
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 042937d5abe4..914a71c580d8 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -252,7 +252,8 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
* Isolating this block already succeeded, so this
* should not fail on zone boundaries.
*/
- WARN_ON_ONCE(!move_freepages_block_isolate(zone, page, migratetype));
+ WARN_ON_ONCE(!move_freepages_block_isolate(zone, page,
+ migratetype));
} else {
set_pageblock_migratetype(page, migratetype);
__putback_isolated_page(page, order, migratetype);
--
2.44.0
Hi Johannes,
kernel test robot noticed the following build warnings:
[auto build test WARNING on akpm-mm/mm-everything]
url: https://github.com/intel-lab-lkp/linux/commits/Johannes-Weiner/mm-page_alloc-remove-pcppage-migratetype-caching/20240321-020814
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20240320180429.678181-10-hannes%40cmpxchg.org
patch subject: [PATCH 09/10] mm: page_isolation: prepare for hygienic freelists
config: i386-randconfig-003-20240321 (https://download.01.org/0day-ci/archive/20240321/[email protected]/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240321/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All warnings (new ones prefixed by >>):
mm/page_alloc.c: In function 'move_freepages_block_isolate':
>> mm/page_alloc.c:688:17: warning: array subscript 11 is above array bounds of 'struct free_area[11]' [-Warray-bounds]
688 | zone->free_area[order].nr_free--;
| ~~~~~~~~~~~~~~~^~~~~~~
>> mm/page_alloc.c:688:17: warning: array subscript 11 is above array bounds of 'struct free_area[11]' [-Warray-bounds]
vim +688 mm/page_alloc.c
6ab0136310961eb Alexander Duyck 2020-04-06 677
6ab0136310961eb Alexander Duyck 2020-04-06 678 static inline void del_page_from_free_list(struct page *page, struct zone *zone,
6ab0136310961eb Alexander Duyck 2020-04-06 679 unsigned int order)
6ab0136310961eb Alexander Duyck 2020-04-06 680 {
36e66c554b5c6a9 Alexander Duyck 2020-04-06 681 /* clear reported state and update reported page count */
36e66c554b5c6a9 Alexander Duyck 2020-04-06 682 if (page_reported(page))
36e66c554b5c6a9 Alexander Duyck 2020-04-06 683 __ClearPageReported(page);
36e66c554b5c6a9 Alexander Duyck 2020-04-06 684
bf75f200569dd05 Mel Gorman 2022-06-24 685 list_del(&page->buddy_list);
6ab0136310961eb Alexander Duyck 2020-04-06 686 __ClearPageBuddy(page);
6ab0136310961eb Alexander Duyck 2020-04-06 687 set_page_private(page, 0);
6ab0136310961eb Alexander Duyck 2020-04-06 @688 zone->free_area[order].nr_free--;
6ab0136310961eb Alexander Duyck 2020-04-06 689 }
6ab0136310961eb Alexander Duyck 2020-04-06 690
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Thu, Mar 21, 2024 at 09:13:57PM +0800, kernel test robot wrote:
> Hi Johannes,
>
> kernel test robot noticed the following build warnings:
>
> [auto build test WARNING on akpm-mm/mm-everything]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Johannes-Weiner/mm-page_alloc-remove-pcppage-migratetype-caching/20240321-020814
> base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/r/20240320180429.678181-10-hannes%40cmpxchg.org
> patch subject: [PATCH 09/10] mm: page_isolation: prepare for hygienic freelists
> config: i386-randconfig-003-20240321 (https://download.01.org/0day-ci/archive/20240321/[email protected]/config)
> compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240321/[email protected]/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <[email protected]>
> | Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
>
> All warnings (new ones prefixed by >>):
>
> mm/page_alloc.c: In function 'move_freepages_block_isolate':
> >> mm/page_alloc.c:688:17: warning: array subscript 11 is above array bounds of 'struct free_area[11]' [-Warray-bounds]
> 688 | zone->free_area[order].nr_free--;
> | ~~~~~~~~~~~~~~~^~~~~~~
> >> mm/page_alloc.c:688:17: warning: array subscript 11 is above array bounds of 'struct free_area[11]' [-Warray-bounds]
I think this is a bug in the old gcc.
We have this in move_freepages_block_isolate():
/* We're the starting block of a larger buddy */
if (PageBuddy(page) && buddy_order(page) > pageblock_order) {
int mt = get_pfnblock_migratetype(page, pfn);
int order = buddy_order(page);
if (!is_migrate_isolate(mt))
__mod_zone_freepage_state(zone, -(1UL << order), mt);
del_page_from_free_list(page, zone, order);
And this config doesn't have hugetlb enabled, so:
/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
#define pageblock_order MAX_PAGE_ORDER
If buddies were indeed >MAX_PAGE_ORDER, this would be an out-of-bounds
access when delete updates the freelist count. Of course, buddies per
definition cannot be larger than MAX_PAGE_ORDER. But the older gcc
doesn't seem to realize this branch in this configuration is dead.
Maybe we can help it out and make the impossible scenario a bit more
explicit? Does this fixlet silence the warning?
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index efb2581ac142..4cdc356e73f6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1698,6 +1698,10 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
NULL, NULL))
return false;
+ /* No splits needed if buddies can't span multiple blocks */
+ if (pageblock_order == MAX_PAGE_ORDER)
+ goto move;
+
/* We're a tail block in a larger buddy */
pfn = find_large_buddy(start_pfn);
if (pfn != start_pfn) {
@@ -1725,7 +1729,7 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
split_large_buddy(zone, page, pfn, order);
return true;
}
-
+move:
mt = get_pfnblock_migratetype(page, start_pfn);
nr_moved = move_freepages(zone, start_pfn, end_pfn, migratetype);
if (!is_migrate_isolate(mt))
Zi Yan, does this look sane to you as well?
On 21 Mar 2024, at 10:24, Johannes Weiner wrote:
> On Thu, Mar 21, 2024 at 09:13:57PM +0800, kernel test robot wrote:
>> Hi Johannes,
>>
>> kernel test robot noticed the following build warnings:
>>
>> [auto build test WARNING on akpm-mm/mm-everything]
>>
>> url: https://github.com/intel-lab-lkp/linux/commits/Johannes-Weiner/mm-page_alloc-remove-pcppage-migratetype-caching/20240321-020814
>> base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
>> patch link: https://lore.kernel.org/r/20240320180429.678181-10-hannes%40cmpxchg.org
>> patch subject: [PATCH 09/10] mm: page_isolation: prepare for hygienic freelists
>> config: i386-randconfig-003-20240321 (https://download.01.org/0day-ci/archive/20240321/[email protected]/config)
>> compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
>> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240321/[email protected]/reproduce)
>>
>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>> the same patch/commit), kindly add following tags
>> | Reported-by: kernel test robot <[email protected]>
>> | Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
>>
>> All warnings (new ones prefixed by >>):
>>
>> mm/page_alloc.c: In function 'move_freepages_block_isolate':
>>>> mm/page_alloc.c:688:17: warning: array subscript 11 is above array bounds of 'struct free_area[11]' [-Warray-bounds]
>> 688 | zone->free_area[order].nr_free--;
>> | ~~~~~~~~~~~~~~~^~~~~~~
>>>> mm/page_alloc.c:688:17: warning: array subscript 11 is above array bounds of 'struct free_area[11]' [-Warray-bounds]
>
> I think this is a bug in the old gcc.
>
> We have this in move_freepages_block_isolate():
>
> /* We're the starting block of a larger buddy */
> if (PageBuddy(page) && buddy_order(page) > pageblock_order) {
> int mt = get_pfnblock_migratetype(page, pfn);
> int order = buddy_order(page);
>
> if (!is_migrate_isolate(mt))
> __mod_zone_freepage_state(zone, -(1UL << order), mt);
> del_page_from_free_list(page, zone, order);
>
> And this config doesn't have hugetlb enabled, so:
>
> /* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
> #define pageblock_order MAX_PAGE_ORDER
>
> If buddies were indeed >MAX_PAGE_ORDER, this would be an out-of-bounds
> access when delete updates the freelist count. Of course, buddies per
> definition cannot be larger than MAX_PAGE_ORDER. But the older gcc
> doesn't seem to realize this branch in this configuration is dead.
>
> Maybe we can help it out and make the impossible scenario a bit more
> explicit? Does this fixlet silence the warning?
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index efb2581ac142..4cdc356e73f6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1698,6 +1698,10 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
> NULL, NULL))
> return false;
>
> + /* No splits needed if buddies can't span multiple blocks */
> + if (pageblock_order == MAX_PAGE_ORDER)
> + goto move;
> +
> /* We're a tail block in a larger buddy */
> pfn = find_large_buddy(start_pfn);
> if (pfn != start_pfn) {
> @@ -1725,7 +1729,7 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
> split_large_buddy(zone, page, pfn, order);
> return true;
> }
> -
> +move:
> mt = get_pfnblock_migratetype(page, start_pfn);
> nr_moved = move_freepages(zone, start_pfn, end_pfn, migratetype);
> if (!is_migrate_isolate(mt))
>
> Zi Yan, does this look sane to you as well?
Yes. This and patch 9 look good to me. Thanks.
For both, Reviewed-by: Zi Yan <[email protected]>
I also tested them locally and confirm it is a gcc bug and this fix
works for gcc-9.3:
1. gcc-13.2 does not give any warning for the original patch 9
2. gcc-9.3 gives the warning for the origin patch, but the warning goes
away with this patch applied.
--
Best Regards,
Yan, Zi
On 3/20/24 7:02 PM, Johannes Weiner wrote:
> Move direct freeing of isolated pages to the lock-breaking block in
> the second loop. This saves an unnecessary migratetype reassessment.
>
> Minor comment and local variable scoping cleanups.
>
> Suggested-by: Vlastimil Babka <[email protected]>
> Tested-by: "Huang, Ying" <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
On 3/20/24 7:02 PM, Johannes Weiner wrote:
> When a block is partially outside the zone of the cursor page, the
> function cuts the range to the pivot page instead of the zone
> start. This can leave large parts of the block behind, which
> encourages incompatible page mixing down the line (ask for one type,
> get another), and thus long-term fragmentation.
>
> This triggers reliably on the first block in the DMA zone, whose
> start_pfn is 1. The block is stolen, but everything before the pivot
> page (which was often hundreds of pages) is left on the old list.
>
> Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
> ---
> mm/page_alloc.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a1376a6fe7e4..7373329763e6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1645,9 +1645,15 @@ int move_freepages_block(struct zone *zone, struct page *page,
> start_pfn = pageblock_start_pfn(pfn);
> end_pfn = pageblock_end_pfn(pfn) - 1;
>
> - /* Do not cross zone boundaries */
> + /*
> + * The caller only has the lock for @zone, don't touch ranges
> + * that straddle into other zones. While we could move part of
> + * the range that's inside the zone, this call is usually
> + * accompanied by other operations such as migratetype updates
> + * which also should be locked.
> + */
> if (!zone_spans_pfn(zone, start_pfn))
> - start_pfn = pfn;
> + return 0;
> if (!zone_spans_pfn(zone, end_pfn))
> return 0;
>
On 3/20/24 7:02 PM, Johannes Weiner wrote:
> Currently, page block type conversion during fallbacks, atomic
> reservations and isolation can strand various amounts of free pages on
> incorrect freelists.
>
> For example, fallback stealing moves free pages in the block to the
> new type's freelists, but then may not actually claim the block for
> that type if there aren't enough compatible pages already allocated.
>
> In all cases, free page moving might fail if the block straddles more
> than one zone, in which case no free pages are moved at all, but the
> block type is changed anyway.
>
> This is detrimental to type hygiene on the freelists. It encourages
> incompatible page mixing down the line (ask for one type, get another)
> and thus contributes to long-term fragmentation.
>
> Split the process into a proper transaction: check first if conversion
> will happen, then try to move the free pages, and only if that was
> successful convert the block to the new type.
>
> Tested-by: "Huang, Ying" <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Nit below:
> @@ -1743,33 +1770,37 @@ static inline bool boost_watermark(struct zone *zone)
> }
>
> /*
> - * This function implements actual steal behaviour. If order is large enough,
> - * we can steal whole pageblock. If not, we first move freepages in this
> - * pageblock to our migratetype and determine how many already-allocated pages
> - * are there in the pageblock with a compatible migratetype. If at least half
> - * of pages are free or compatible, we can change migratetype of the pageblock
> - * itself, so pages freed in the future will be put on the correct free list.
> + * This function implements actual steal behaviour. If order is large enough, we
> + * can claim the whole pageblock for the requested migratetype. If not, we check
> + * the pageblock for constituent pages; if at least half of the pages are free
> + * or compatible, we can still claim the whole block, so pages freed in the
> + * future will be put on the correct free list. Otherwise, we isolate exactly
> + * the order we need from the fallback block and leave its migratetype alone.
> */
> -static void steal_suitable_fallback(struct zone *zone, struct page *page,
> - unsigned int alloc_flags, int start_type, bool whole_block)
> +static struct page *
> +steal_suitable_fallback(struct zone *zone, struct page *page,
> + int current_order, int order, int start_type,
> + unsigned int alloc_flags, bool whole_block)
> {
> - unsigned int current_order = buddy_order(page);
> int free_pages, movable_pages, alike_pages;
> - int old_block_type;
> + unsigned long start_pfn, end_pfn;
> + int block_type;
>
> - old_block_type = get_pageblock_migratetype(page);
> + block_type = get_pageblock_migratetype(page);
>
> /*
> * This can happen due to races and we want to prevent broken
> * highatomic accounting.
> */
> - if (is_migrate_highatomic(old_block_type))
> + if (is_migrate_highatomic(block_type))
> goto single_page;
>
> /* Take ownership for orders >= pageblock_order */
> if (current_order >= pageblock_order) {
> + del_page_from_free_list(page, zone, current_order);
> change_pageblock_range(page, current_order, start_type);
> - goto single_page;
> + expand(zone, page, order, current_order, start_type);
> + return page;
Is the exact order here important (AFAIK shouldn't be?) or we could just
change_pageblock_range(); block_type = start_type; goto single_page?
> }
>
> /*
> @@ -1784,10 +1815,9 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
> if (!whole_block)
> goto single_page;
>
> - free_pages = move_freepages_block(zone, page, start_type,
> - &movable_pages);
> /* moving whole block can fail due to zone boundary conditions */
> - if (!free_pages)
> + if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn,
> + &free_pages, &movable_pages))
> goto single_page;
>
> /*
> @@ -1805,7 +1835,7 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
> * vice versa, be conservative since we can't distinguish the
> * exact migratetype of non-movable pages.
> */
> - if (old_block_type == MIGRATE_MOVABLE)
> + if (block_type == MIGRATE_MOVABLE)
> alike_pages = pageblock_nr_pages
> - (free_pages + movable_pages);
> else
> @@ -1816,13 +1846,16 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
> * compatible migratability as our allocation, claim the whole block.
> */
> if (free_pages + alike_pages >= (1 << (pageblock_order-1)) ||
> - page_group_by_mobility_disabled)
> + page_group_by_mobility_disabled) {
> + move_freepages(zone, start_pfn, end_pfn, start_type);
> set_pageblock_migratetype(page, start_type);
> -
> - return;
> + return __rmqueue_smallest(zone, order, start_type);
> + }
>
> single_page:
> - move_to_free_list(page, zone, current_order, start_type);
> + del_page_from_free_list(page, zone, current_order);
> + expand(zone, page, order, current_order, block_type);
> + return page;
> }
On Tue, Mar 26, 2024 at 12:28:37PM +0100, Vlastimil Babka wrote:
> On 3/20/24 7:02 PM, Johannes Weiner wrote:
> > Currently, page block type conversion during fallbacks, atomic
> > reservations and isolation can strand various amounts of free pages on
> > incorrect freelists.
> >
> > For example, fallback stealing moves free pages in the block to the
> > new type's freelists, but then may not actually claim the block for
> > that type if there aren't enough compatible pages already allocated.
> >
> > In all cases, free page moving might fail if the block straddles more
> > than one zone, in which case no free pages are moved at all, but the
> > block type is changed anyway.
> >
> > This is detrimental to type hygiene on the freelists. It encourages
> > incompatible page mixing down the line (ask for one type, get another)
> > and thus contributes to long-term fragmentation.
> >
> > Split the process into a proper transaction: check first if conversion
> > will happen, then try to move the free pages, and only if that was
> > successful convert the block to the new type.
> >
> > Tested-by: "Huang, Ying" <[email protected]>
> > Signed-off-by: Johannes Weiner <[email protected]>
>
> Reviewed-by: Vlastimil Babka <[email protected]>
Thanks!
> > /* Take ownership for orders >= pageblock_order */
> > if (current_order >= pageblock_order) {
> > + del_page_from_free_list(page, zone, current_order);
> > change_pageblock_range(page, current_order, start_type);
> > - goto single_page;
> > + expand(zone, page, order, current_order, start_type);
> > + return page;
>
> Is the exact order here important (AFAIK shouldn't be?) or we could just
> change_pageblock_range(); block_type = start_type; goto single_page?
At the end of the series, the delete function will do the
(type-sensitive) accounting and have a sanity check on the mt. So we
have to remove the page from the list before updating the type.
We *could* do it in this patch and revert it again 4 patches later,
but that's likely not worth the hassle to temporarily save one line.
> > single_page:
> > - move_to_free_list(page, zone, current_order, start_type);
> > + del_page_from_free_list(page, zone, current_order);
> > + expand(zone, page, order, current_order, block_type);
> > + return page;
> > }
On 3/20/24 7:02 PM, Johannes Weiner wrote:
> There are three freeing paths that read the page's migratetype
> optimistically before grabbing the zone lock. When this races with
> block stealing, those pages go on the wrong freelist.
>
> The paths in question are:
> - when freeing >costly orders that aren't THP
> - when freeing pages to the buddy upon pcp lock contention
> - when freeing pages that are isolated
> - when freeing pages initially during boot
> - when freeing the remainder in alloc_pages_exact()
> - when "accepting" unaccepted VM host memory before first use
> - when freeing pages during unpoisoning
>
> None of these are so hot that they would need this optimization at the
> cost of hampering defrag efforts. Especially when contrasted with the
> fact that the most common buddy freeing path - free_pcppages_bulk - is
> checking the migratetype under the zone->lock just fine.
>
> In addition, isolated pages need to look up the migratetype under the
> lock anyway, which adds branches to the locked section, and results in
> a double lookup when the pages are in fact isolated.
>
> Move the lookups into the lock.
>
> Reported-by: Vlastimil Babka <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
On 3/20/24 7:02 PM, Johannes Weiner wrote:
> From: Zi Yan <[email protected]>
>
> This avoids changing migratetype after move_freepages() or
> move_freepages_block(), which is error prone. It also prepares for
> upcoming changes to fix move_freepages() not moving free pages
> partially in the range.
>
> Signed-off-by: Zi Yan <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
BTW noticed in -mm this has R-b: Zi Yan which is odd as he's the author. In
the subthread for patch 9/10 Zi posted a R-b for patch 9/10 and its fixup,
not this one :)
On 3/20/24 7:02 PM, Johannes Weiner wrote:
> Page isolation currently sets MIGRATE_ISOLATE on a block, then drops
> zone->lock and scans the block for straddling buddies to split
> up. Because this happens non-atomically wrt the page allocator, it's
> possible for allocations to get a buddy whose first block is a regular
> pcp migratetype but whose tail is isolated. This means that in certain
> cases memory can still be allocated after isolation. It will also
> trigger the freelist type hygiene warnings in subsequent patches.
>
> start_isolate_page_range()
> isolate_single_pageblock()
> set_migratetype_isolate(tail)
> lock zone->lock
> move_freepages_block(tail) // nop
> set_pageblock_migratetype(tail)
> unlock zone->lock
> __rmqueue_smallest()
> del_page_from_freelist(head)
> expand(head, head_mt)
> WARN(head_mt != tail_mt)
> start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
> for (pfn = start_pfn, pfn < end_pfn)
> if (PageBuddy())
> split_free_page(head)
>
> Introduce a variant of move_freepages_block() provided by the
> allocator specifically for page isolation; it moves free pages,
> converts the block, and handles the splitting of straddling buddies
> while holding zone->lock.
>
> The allocator knows that pageblocks and buddies are always naturally
> aligned, which means that buddies can only straddle blocks if they're
> actually >pageblock_order. This means the search-and-split part can be
> simplified compared to what page isolation used to do.
>
> Also tighten up the page isolation code around the expectations of
> which pages can be large, and how they are freed.
>
> Based on extensive discussions with and invaluable input from Zi Yan.
>
> Signed-off-by: Johannes Weiner <[email protected]>
Nice cleanup of hairy code as well!
Reviewed-by: Vlastimil Babka <[email protected]>
On 3/20/24 7:02 PM, Johannes Weiner wrote:
> Free page accounting currently happens a bit too high up the call
> stack, where it has to deal with guard pages, compaction capturing,
> block stealing and even page isolation. This is subtle and fragile,
> and makes it difficult to hack on the code.
>
> Now that type violations on the freelists have been fixed, push the
> accounting down to where pages enter and leave the freelist.
Awesome!
> Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Just some nits:
> @@ -1314,10 +1349,10 @@ static inline void expand(struct zone *zone, struct page *page,
> * Corresponding page table entries will not be touched,
> * pages will stay not present in virtual address space
> */
> - if (set_page_guard(zone, &page[size], high, migratetype))
> + if (set_page_guard(zone, &page[size], high))
> continue;
>
> - add_to_free_list(&page[size], zone, high, migratetype);
> + add_to_free_list(&page[size], zone, high, migratetype, false);
This is account_freepages() in the hot loop, what if we instead used
__add_to_free_list(), sum up nr_pages and called account_freepages() once
outside of the loop?
> set_buddy_order(&page[size], high);
> }
> }
<snip>
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 042937d5abe4..914a71c580d8 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -252,7 +252,8 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
> * Isolating this block already succeeded, so this
> * should not fail on zone boundaries.
> */
> - WARN_ON_ONCE(!move_freepages_block_isolate(zone, page, migratetype));
> + WARN_ON_ONCE(!move_freepages_block_isolate(zone, page,
> + migratetype));
> } else {
> set_pageblock_migratetype(page, migratetype);
> __putback_isolated_page(page, order, migratetype);
Looks like a drive-by edit of an extra file just to adjust identation.
On 3/20/24 7:02 PM, Johannes Weiner wrote:
> V4:
> - fixed !pcp_order_allowed() case in free_unref_folios()
> - reworded the patch 0 changelog a bit for the git log
> - rebased to mm-everything-2024-03-19-23-01
> - runtime-tested again with various CONFIG_DEBUG_FOOs enabled
>
> ---
>
> The page allocator's mobility grouping is intended to keep unmovable
> pages separate from reclaimable/compactable ones to allow on-demand
> defragmentation for higher-order allocations and huge pages.
>
> Currently, there are several places where accidental type mixing
> occurs: an allocation asks for a page of a certain migratetype and
> receives another. This ruins pageblocks for compaction, which in turn
> makes allocating huge pages more expensive and less reliable.
>
> The series addresses those causes. The last patch adds type checks on
> all freelist movements to prevent new violations being introduced.
>
> The benefits can be seen in a mixed workload that stresses the machine
> with a memcache-type workload and a kernel build job while
> periodically attempting to allocate batches of THP. The following data
> is aggregated over 50 consecutive defconfig builds:
Great stuff. What would you say to the following on top?
----8<----
From 84f8a6d3a9e34c7ed8b438c3152d56e359a4ffb4 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <[email protected]>
Date: Wed, 27 Mar 2024 10:19:47 +0100
Subject: [PATCH] mm: page_alloc: change move_freepages() to
__move_freepages_block()
The function is now supposed to be called only on a single pageblock and
checks start_pfn and end_pfn accordingly. Rename it to make this more
obvious and drop the end_pfn parameter which can be determined trivially
and none of the callers use it for anything else.
Also make the (now internal) end_pfn exclusive, which is more common.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/page_alloc.c | 43 ++++++++++++++++++++-----------------------
1 file changed, 20 insertions(+), 23 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 34c84ef16b66..75aefbd52ef9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1566,18 +1566,18 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
* Change the type of a block and move all its free pages to that
* type's freelist.
*/
-static int move_freepages(struct zone *zone, unsigned long start_pfn,
- unsigned long end_pfn, int old_mt, int new_mt)
+static int __move_freepages_block(struct zone *zone, unsigned long start_pfn,
+ int old_mt, int new_mt)
{
struct page *page;
- unsigned long pfn;
+ unsigned long pfn, end_pfn;
unsigned int order;
int pages_moved = 0;
VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1));
- VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn);
+ end_pfn = pageblock_end_pfn(start_pfn);
- for (pfn = start_pfn; pfn <= end_pfn;) {
+ for (pfn = start_pfn; pfn < end_pfn;) {
page = pfn_to_page(pfn);
if (!PageBuddy(page)) {
pfn++;
@@ -1603,14 +1603,13 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
static bool prep_move_freepages_block(struct zone *zone, struct page *page,
unsigned long *start_pfn,
- unsigned long *end_pfn,
int *num_free, int *num_movable)
{
unsigned long pfn, start, end;
pfn = page_to_pfn(page);
start = pageblock_start_pfn(pfn);
- end = pageblock_end_pfn(pfn) - 1;
+ end = pageblock_end_pfn(pfn);
/*
* The caller only has the lock for @zone, don't touch ranges
@@ -1621,16 +1620,15 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
*/
if (!zone_spans_pfn(zone, start))
return false;
- if (!zone_spans_pfn(zone, end))
+ if (!zone_spans_pfn(zone, end - 1))
return false;
*start_pfn = start;
- *end_pfn = end;
if (num_free) {
*num_free = 0;
*num_movable = 0;
- for (pfn = start; pfn <= end;) {
+ for (pfn = start; pfn < end;) {
page = pfn_to_page(pfn);
if (PageBuddy(page)) {
int nr = 1 << buddy_order(page);
@@ -1656,13 +1654,12 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
static int move_freepages_block(struct zone *zone, struct page *page,
int old_mt, int new_mt)
{
- unsigned long start_pfn, end_pfn;
+ unsigned long start_pfn;
- if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn,
- NULL, NULL))
+ if (!prep_move_freepages_block(zone, page, &start_pfn, NULL, NULL))
return -1;
- return move_freepages(zone, start_pfn, end_pfn, old_mt, new_mt);
+ return __move_freepages_block(zone, start_pfn, old_mt, new_mt);
}
#ifdef CONFIG_MEMORY_ISOLATION
@@ -1733,10 +1730,9 @@ static void split_large_buddy(struct zone *zone, struct page *page,
bool move_freepages_block_isolate(struct zone *zone, struct page *page,
int migratetype)
{
- unsigned long start_pfn, end_pfn, pfn;
+ unsigned long start_pfn, pfn;
- if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn,
- NULL, NULL))
+ if (!prep_move_freepages_block(zone, page, &start_pfn, NULL, NULL))
return false;
/* No splits needed if buddies can't span multiple blocks */
@@ -1767,8 +1763,9 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
return true;
}
move:
- move_freepages(zone, start_pfn, end_pfn,
- get_pfnblock_migratetype(page, start_pfn), migratetype);
+ __move_freepages_block(zone, start_pfn,
+ get_pfnblock_migratetype(page, start_pfn),
+ migratetype);
return true;
}
#endif /* CONFIG_MEMORY_ISOLATION */
@@ -1868,7 +1865,7 @@ steal_suitable_fallback(struct zone *zone, struct page *page,
unsigned int alloc_flags, bool whole_block)
{
int free_pages, movable_pages, alike_pages;
- unsigned long start_pfn, end_pfn;
+ unsigned long start_pfn;
int block_type;
block_type = get_pageblock_migratetype(page);
@@ -1901,8 +1898,8 @@ steal_suitable_fallback(struct zone *zone, struct page *page,
goto single_page;
/* moving whole block can fail due to zone boundary conditions */
- if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn,
- &free_pages, &movable_pages))
+ if (!prep_move_freepages_block(zone, page, &start_pfn, &free_pages,
+ &movable_pages))
goto single_page;
/*
@@ -1932,7 +1929,7 @@ steal_suitable_fallback(struct zone *zone, struct page *page,
*/
if (free_pages + alike_pages >= (1 << (pageblock_order-1)) ||
page_group_by_mobility_disabled) {
- move_freepages(zone, start_pfn, end_pfn, block_type, start_type);
+ __move_freepages_block(zone, start_pfn, block_type, start_type);
return __rmqueue_smallest(zone, order, start_type);
}
--
2.44.0
On Wed, Mar 27, 2024 at 10:30:30AM +0100, Vlastimil Babka wrote:
> On 3/20/24 7:02 PM, Johannes Weiner wrote:
> > V4:
> > - fixed !pcp_order_allowed() case in free_unref_folios()
> > - reworded the patch 0 changelog a bit for the git log
> > - rebased to mm-everything-2024-03-19-23-01
> > - runtime-tested again with various CONFIG_DEBUG_FOOs enabled
> >
> > ---
> >
> > The page allocator's mobility grouping is intended to keep unmovable
> > pages separate from reclaimable/compactable ones to allow on-demand
> > defragmentation for higher-order allocations and huge pages.
> >
> > Currently, there are several places where accidental type mixing
> > occurs: an allocation asks for a page of a certain migratetype and
> > receives another. This ruins pageblocks for compaction, which in turn
> > makes allocating huge pages more expensive and less reliable.
> >
> > The series addresses those causes. The last patch adds type checks on
> > all freelist movements to prevent new violations being introduced.
> >
> > The benefits can be seen in a mixed workload that stresses the machine
> > with a memcache-type workload and a kernel build job while
> > periodically attempting to allocate batches of THP. The following data
> > is aggregated over 50 consecutive defconfig builds:
>
> Great stuff. What would you say to the following on top?
>
> ----8<----
> From 84f8a6d3a9e34c7ed8b438c3152d56e359a4ffb4 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <[email protected]>
> Date: Wed, 27 Mar 2024 10:19:47 +0100
> Subject: [PATCH] mm: page_alloc: change move_freepages() to
> __move_freepages_block()
>
> The function is now supposed to be called only on a single pageblock and
> checks start_pfn and end_pfn accordingly. Rename it to make this more
> obvious and drop the end_pfn parameter which can be determined trivially
> and none of the callers use it for anything else.
>
> Also make the (now internal) end_pfn exclusive, which is more common.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
Nice, that's better.
Acked-by: Johannes Weiner <[email protected]>
On Wed, Mar 27, 2024 at 09:54:01AM +0100, Vlastimil Babka wrote:
> On 3/20/24 7:02 PM, Johannes Weiner wrote:
> > Free page accounting currently happens a bit too high up the call
> > stack, where it has to deal with guard pages, compaction capturing,
> > block stealing and even page isolation. This is subtle and fragile,
> > and makes it difficult to hack on the code.
> >
> > Now that type violations on the freelists have been fixed, push the
> > accounting down to where pages enter and leave the freelist.
>
> Awesome!
>
> > Signed-off-by: Johannes Weiner <[email protected]>
>
> Reviewed-by: Vlastimil Babka <[email protected]>
>
> Just some nits:
>
> > @@ -1314,10 +1349,10 @@ static inline void expand(struct zone *zone, struct page *page,
> > * Corresponding page table entries will not be touched,
> > * pages will stay not present in virtual address space
> > */
> > - if (set_page_guard(zone, &page[size], high, migratetype))
> > + if (set_page_guard(zone, &page[size], high))
> > continue;
> >
> > - add_to_free_list(&page[size], zone, high, migratetype);
> > + add_to_free_list(&page[size], zone, high, migratetype, false);
>
> This is account_freepages() in the hot loop, what if we instead used
> __add_to_free_list(), sum up nr_pages and called account_freepages() once
> outside of the loop?
Good idea. I'll send a fixlet for that.
> > set_buddy_order(&page[size], high);
> > }
> > }
>
> <snip>
>
> > diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> > index 042937d5abe4..914a71c580d8 100644
> > --- a/mm/page_isolation.c
> > +++ b/mm/page_isolation.c
> > @@ -252,7 +252,8 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
> > * Isolating this block already succeeded, so this
> > * should not fail on zone boundaries.
> > */
> > - WARN_ON_ONCE(!move_freepages_block_isolate(zone, page, migratetype));
> > + WARN_ON_ONCE(!move_freepages_block_isolate(zone, page,
> > + migratetype));
> > } else {
> > set_pageblock_migratetype(page, migratetype);
> > __putback_isolated_page(page, order, migratetype);
>
> Looks like a drive-by edit of an extra file just to adjust identation.
Argh, yeah, I think an earlier version mucked with the signature and I
didn't undo that cleanly. I'll send a fixlet for that too.
Thanks for the review!
On 27 Mar 2024, at 5:30, Vlastimil Babka wrote:
> On 3/20/24 7:02 PM, Johannes Weiner wrote:
>> V4:
>> - fixed !pcp_order_allowed() case in free_unref_folios()
>> - reworded the patch 0 changelog a bit for the git log
>> - rebased to mm-everything-2024-03-19-23-01
>> - runtime-tested again with various CONFIG_DEBUG_FOOs enabled
>>
>> ---
>>
>> The page allocator's mobility grouping is intended to keep unmovable
>> pages separate from reclaimable/compactable ones to allow on-demand
>> defragmentation for higher-order allocations and huge pages.
>>
>> Currently, there are several places where accidental type mixing
>> occurs: an allocation asks for a page of a certain migratetype and
>> receives another. This ruins pageblocks for compaction, which in turn
>> makes allocating huge pages more expensive and less reliable.
>>
>> The series addresses those causes. The last patch adds type checks on
>> all freelist movements to prevent new violations being introduced.
>>
>> The benefits can be seen in a mixed workload that stresses the machine
>> with a memcache-type workload and a kernel build job while
>> periodically attempting to allocate batches of THP. The following data
>> is aggregated over 50 consecutive defconfig builds:
>
> Great stuff. What would you say to the following on top?
>
> ----8<----
> From 84f8a6d3a9e34c7ed8b438c3152d56e359a4ffb4 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <[email protected]>
> Date: Wed, 27 Mar 2024 10:19:47 +0100
> Subject: [PATCH] mm: page_alloc: change move_freepages() to
> __move_freepages_block()
>
> The function is now supposed to be called only on a single pageblock and
> checks start_pfn and end_pfn accordingly. Rename it to make this more
> obvious and drop the end_pfn parameter which can be determined trivially
> and none of the callers use it for anything else.
>
> Also make the (now internal) end_pfn exclusive, which is more common.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> ---
> mm/page_alloc.c | 43 ++++++++++++++++++++-----------------------
> 1 file changed, 20 insertions(+), 23 deletions(-)
>
This looks good to me and makes sense. Reviewed-by: Zi Yan <[email protected]>
--
Best Regards,
Yan, Zi
From d3a0b6847f8bf1ecfa6b08c758dfa5a86cfb18b8 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Wed, 27 Mar 2024 12:28:41 -0400
Subject: [PATCH 2/3] mm: page_alloc: consolidate free page accounting fix 2
remove unused page parameter from account_freepages()
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/page_alloc.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 34c84ef16b66..8987e8869f6d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -643,8 +643,8 @@ compaction_capture(struct capture_control *capc, struct page *page,
}
#endif /* CONFIG_COMPACTION */
-static inline void account_freepages(struct page *page, struct zone *zone,
- int nr_pages, int migratetype)
+static inline void account_freepages(struct zone *zone, int nr_pages,
+ int migratetype)
{
if (is_migrate_isolate(migratetype))
return;
@@ -678,7 +678,7 @@ static inline void add_to_free_list(struct page *page, struct zone *zone,
bool tail)
{
__add_to_free_list(page, zone, order, migratetype, tail);
- account_freepages(page, zone, 1 << order, migratetype);
+ account_freepages(zone, 1 << order, migratetype);
}
/*
@@ -698,8 +698,8 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
- account_freepages(page, zone, -(1 << order), old_mt);
- account_freepages(page, zone, 1 << order, new_mt);
+ account_freepages(zone, -(1 << order), old_mt);
+ account_freepages(zone, 1 << order, new_mt);
}
static inline void __del_page_from_free_list(struct page *page, struct zone *zone,
@@ -723,7 +723,7 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
unsigned int order, int migratetype)
{
__del_page_from_free_list(page, zone, order, migratetype);
- account_freepages(page, zone, -(1 << order), migratetype);
+ account_freepages(zone, -(1 << order), migratetype);
}
static inline struct page *get_page_from_free_area(struct free_area *area,
@@ -800,7 +800,7 @@ static inline void __free_one_page(struct page *page,
VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
VM_BUG_ON_PAGE(bad_range(zone, page), page);
- account_freepages(page, zone, 1 << order, migratetype);
+ account_freepages(zone, 1 << order, migratetype);
while (order < MAX_PAGE_ORDER) {
int buddy_mt = migratetype;
@@ -6930,7 +6930,7 @@ static bool try_to_accept_memory_one(struct zone *zone)
list_del(&page->lru);
last = list_empty(&zone->unaccepted_pages);
- account_freepages(page, zone, -MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE);
+ account_freepages(zone, -MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE);
__mod_zone_page_state(zone, NR_UNACCEPTED, -MAX_ORDER_NR_PAGES);
spin_unlock_irqrestore(&zone->lock, flags);
@@ -6982,7 +6982,7 @@ static bool __free_unaccepted(struct page *page)
spin_lock_irqsave(&zone->lock, flags);
first = list_empty(&zone->unaccepted_pages);
list_add_tail(&page->lru, &zone->unaccepted_pages);
- account_freepages(page, zone, MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE);
+ account_freepages(zone, MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE);
__mod_zone_page_state(zone, NR_UNACCEPTED, MAX_ORDER_NR_PAGES);
spin_unlock_irqrestore(&zone->lock, flags);
--
2.44.0
On Wed, Mar 27, 2024 at 09:54:01AM +0100, Vlastimil Babka wrote:
> > @@ -1314,10 +1349,10 @@ static inline void expand(struct zone *zone, struct page *page,
> > * Corresponding page table entries will not be touched,
> > * pages will stay not present in virtual address space
> > */
> > - if (set_page_guard(zone, &page[size], high, migratetype))
> > + if (set_page_guard(zone, &page[size], high))
> > continue;
> >
> > - add_to_free_list(&page[size], zone, high, migratetype);
> > + add_to_free_list(&page[size], zone, high, migratetype, false);
>
> This is account_freepages() in the hot loop, what if we instead used
> __add_to_free_list(), sum up nr_pages and called account_freepages() once
> outside of the loop?
How about this on top of the series? Could be folded into
mm-page_alloc-consolidate-free-page-accounting, but for credit and
bisectability (just in case) I think stand-alone makes sense.
---
From 361f5df28183d85c7718fe0b579438d3d58777be Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Wed, 27 Mar 2024 12:29:25 -0400
Subject: [PATCH 3/3] mm: page_alloc: batch vmstat updates in expand()
expand() currently updates vmstat for every subpage. This is
unnecessary, since they're all of the same zone and migratetype.
Count added pages locally, then do a single vmstat update.
Suggested-by: Vlastimil Babka <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/page_alloc.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8987e8869f6d..13fe5c612fbe 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1341,6 +1341,7 @@ static inline void expand(struct zone *zone, struct page *page,
int low, int high, int migratetype)
{
unsigned long size = 1 << high;
+ unsigned long nr_added = 0;
while (high > low) {
high--;
@@ -1356,9 +1357,11 @@ static inline void expand(struct zone *zone, struct page *page,
if (set_page_guard(zone, &page[size], high))
continue;
- add_to_free_list(&page[size], zone, high, migratetype, false);
+ __add_to_free_list(&page[size], zone, high, migratetype, false);
set_buddy_order(&page[size], high);
+ nr_added += size;
}
+ account_freepages(zone, nr_added, migratetype);
}
static void check_new_page_bad(struct page *page)
--
2.44.0
On Wed, Mar 27, 2024 at 09:54:01AM +0100, Vlastimil Babka wrote:
> > @@ -252,7 +252,8 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
> > * Isolating this block already succeeded, so this
> > * should not fail on zone boundaries.
> > */
> > - WARN_ON_ONCE(!move_freepages_block_isolate(zone, page, migratetype));
> > + WARN_ON_ONCE(!move_freepages_block_isolate(zone, page,
> > + migratetype));
> > } else {
> > set_pageblock_migratetype(page, migratetype);
> > __putback_isolated_page(page, order, migratetype);
>
> Looks like a drive-by edit of an extra file just to adjust identation.
Andrew, can you please fold the following fixlet?
From fa8dae0b39c5acfd4c0035c5a25b706b3a105497 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Wed, 27 Mar 2024 12:19:17 -0400
Subject: [PATCH 1/3] mm: page_alloc: consolidate free page accounting fix
undo unrelated drive-by line wrap
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/page_isolation.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 914a71c580d8..042937d5abe4 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -252,8 +252,7 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
* Isolating this block already succeeded, so this
* should not fail on zone boundaries.
*/
- WARN_ON_ONCE(!move_freepages_block_isolate(zone, page,
- migratetype));
+ WARN_ON_ONCE(!move_freepages_block_isolate(zone, page, migratetype));
} else {
set_pageblock_migratetype(page, migratetype);
__putback_isolated_page(page, order, migratetype);
--
2.44.0
On 3/27/24 8:01 PM, Johannes Weiner wrote:
> On Wed, Mar 27, 2024 at 09:54:01AM +0100, Vlastimil Babka wrote:
>> > @@ -1314,10 +1349,10 @@ static inline void expand(struct zone *zone, struct page *page,
>> > * Corresponding page table entries will not be touched,
>> > * pages will stay not present in virtual address space
>> > */
>> > - if (set_page_guard(zone, &page[size], high, migratetype))
>> > + if (set_page_guard(zone, &page[size], high))
>> > continue;
>> >
>> > - add_to_free_list(&page[size], zone, high, migratetype);
>> > + add_to_free_list(&page[size], zone, high, migratetype, false);
>>
>> This is account_freepages() in the hot loop, what if we instead used
>> __add_to_free_list(), sum up nr_pages and called account_freepages() once
>> outside of the loop?
>
> How about this on top of the series? Could be folded into
> mm-page_alloc-consolidate-free-page-accounting, but for credit and
> bisectability (just in case) I think stand-alone makes sense.
>
> ---
>
> From 361f5df28183d85c7718fe0b579438d3d58777be Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <[email protected]>
> Date: Wed, 27 Mar 2024 12:29:25 -0400
> Subject: [PATCH 3/3] mm: page_alloc: batch vmstat updates in expand()
>
> expand() currently updates vmstat for every subpage. This is
> unnecessary, since they're all of the same zone and migratetype.
>
> Count added pages locally, then do a single vmstat update.
>
> Suggested-by: Vlastimil Babka <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
> ---
> mm/page_alloc.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8987e8869f6d..13fe5c612fbe 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1341,6 +1341,7 @@ static inline void expand(struct zone *zone, struct page *page,
> int low, int high, int migratetype)
> {
> unsigned long size = 1 << high;
> + unsigned long nr_added = 0;
>
> while (high > low) {
> high--;
> @@ -1356,9 +1357,11 @@ static inline void expand(struct zone *zone, struct page *page,
> if (set_page_guard(zone, &page[size], high))
> continue;
>
> - add_to_free_list(&page[size], zone, high, migratetype, false);
> + __add_to_free_list(&page[size], zone, high, migratetype, false);
> set_buddy_order(&page[size], high);
> + nr_added += size;
> }
> + account_freepages(zone, nr_added, migratetype);
> }
>
> static void check_new_page_bad(struct page *page)
Hi Johannes,
On 2024/3/21 02:02, Johannes Weiner wrote:
> Currently, page block type conversion during fallbacks, atomic
> reservations and isolation can strand various amounts of free pages on
> incorrect freelists.
>
> For example, fallback stealing moves free pages in the block to the
> new type's freelists, but then may not actually claim the block for
> that type if there aren't enough compatible pages already allocated.
>
> In all cases, free page moving might fail if the block straddles more
> than one zone, in which case no free pages are moved at all, but the
> block type is changed anyway.
>
> This is detrimental to type hygiene on the freelists. It encourages
> incompatible page mixing down the line (ask for one type, get another)
> and thus contributes to long-term fragmentation.
>
> Split the process into a proper transaction: check first if conversion
> will happen, then try to move the free pages, and only if that was
> successful convert the block to the new type.
>
> Tested-by: "Huang, Ying" <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> include/linux/page-isolation.h | 3 +-
> mm/page_alloc.c | 175 ++++++++++++++++++++-------------
> mm/page_isolation.c | 22 +++--
> 3 files changed, 121 insertions(+), 79 deletions(-)
>
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index 4ac34392823a..8550b3c91480 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -34,8 +34,7 @@ static inline bool is_migrate_isolate(int migratetype)
> #define REPORT_FAILURE 0x2
>
> void set_pageblock_migratetype(struct page *page, int migratetype);
> -int move_freepages_block(struct zone *zone, struct page *page,
> - int migratetype, int *num_movable);
> +int move_freepages_block(struct zone *zone, struct page *page, int migratetype);
>
> int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
> int migratetype, int flags, gfp_t gfp_flags);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7373329763e6..e7d0d4711bdd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1596,9 +1596,8 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
> * Note that start_page and end_pages are not aligned on a pageblock
> * boundary. If alignment is required, use move_freepages_block()
> */
> -static int move_freepages(struct zone *zone,
> - unsigned long start_pfn, unsigned long end_pfn,
> - int migratetype, int *num_movable)
> +static int move_freepages(struct zone *zone, unsigned long start_pfn,
> + unsigned long end_pfn, int migratetype)
> {
> struct page *page;
> unsigned long pfn;
> @@ -1608,14 +1607,6 @@ static int move_freepages(struct zone *zone,
> for (pfn = start_pfn; pfn <= end_pfn;) {
> page = pfn_to_page(pfn);
> if (!PageBuddy(page)) {
> - /*
> - * We assume that pages that could be isolated for
> - * migration are movable. But we don't actually try
> - * isolating, as that would be expensive.
> - */
> - if (num_movable &&
> - (PageLRU(page) || __PageMovable(page)))
> - (*num_movable)++;
> pfn++;
> continue;
> }
> @@ -1633,17 +1624,16 @@ static int move_freepages(struct zone *zone,
> return pages_moved;
> }
>
> -int move_freepages_block(struct zone *zone, struct page *page,
> - int migratetype, int *num_movable)
> +static bool prep_move_freepages_block(struct zone *zone, struct page *page,
> + unsigned long *start_pfn,
> + unsigned long *end_pfn,
> + int *num_free, int *num_movable)
> {
> - unsigned long start_pfn, end_pfn, pfn;
> -
> - if (num_movable)
> - *num_movable = 0;
> + unsigned long pfn, start, end;
>
> pfn = page_to_pfn(page);
> - start_pfn = pageblock_start_pfn(pfn);
> - end_pfn = pageblock_end_pfn(pfn) - 1;
> + start = pageblock_start_pfn(pfn);
> + end = pageblock_end_pfn(pfn) - 1;
>
> /*
> * The caller only has the lock for @zone, don't touch ranges
> @@ -1652,13 +1642,50 @@ int move_freepages_block(struct zone *zone, struct page *page,
> * accompanied by other operations such as migratetype updates
> * which also should be locked.
> */
> - if (!zone_spans_pfn(zone, start_pfn))
> - return 0;
> - if (!zone_spans_pfn(zone, end_pfn))
> - return 0;
> + if (!zone_spans_pfn(zone, start))
> + return false;
> + if (!zone_spans_pfn(zone, end))
> + return false;
> +
> + *start_pfn = start;
> + *end_pfn = end;
> +
> + if (num_free) {
> + *num_free = 0;
> + *num_movable = 0;
> + for (pfn = start; pfn <= end;) {
> + page = pfn_to_page(pfn);
> + if (PageBuddy(page)) {
> + int nr = 1 << buddy_order(page);
> +
> + *num_free += nr;
> + pfn += nr;
> + continue;
> + }
> + /*
> + * We assume that pages that could be isolated for
> + * migration are movable. But we don't actually try
> + * isolating, as that would be expensive.
> + */
> + if (PageLRU(page) || __PageMovable(page))
> + (*num_movable)++;
> + pfn++;
> + }
> + }
> +
> + return true;
> +}
> +
> +int move_freepages_block(struct zone *zone, struct page *page,
> + int migratetype)
> +{
> + unsigned long start_pfn, end_pfn;
> +
> + if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn,
> + NULL, NULL))
> + return -1;
>
> - return move_freepages(zone, start_pfn, end_pfn, migratetype,
> - num_movable);
> + return move_freepages(zone, start_pfn, end_pfn, migratetype);
> }
>
> static void change_pageblock_range(struct page *pageblock_page,
> @@ -1743,33 +1770,37 @@ static inline bool boost_watermark(struct zone *zone)
> }
>
> /*
> - * This function implements actual steal behaviour. If order is large enough,
> - * we can steal whole pageblock. If not, we first move freepages in this
> - * pageblock to our migratetype and determine how many already-allocated pages
> - * are there in the pageblock with a compatible migratetype. If at least half
> - * of pages are free or compatible, we can change migratetype of the pageblock
> - * itself, so pages freed in the future will be put on the correct free list.
> + * This function implements actual steal behaviour. If order is large enough, we
> + * can claim the whole pageblock for the requested migratetype. If not, we check
> + * the pageblock for constituent pages; if at least half of the pages are free
> + * or compatible, we can still claim the whole block, so pages freed in the
> + * future will be put on the correct free list. Otherwise, we isolate exactly
> + * the order we need from the fallback block and leave its migratetype alone.
> */
> -static void steal_suitable_fallback(struct zone *zone, struct page *page,
> - unsigned int alloc_flags, int start_type, bool whole_block)
> +static struct page *
> +steal_suitable_fallback(struct zone *zone, struct page *page,
> + int current_order, int order, int start_type,
> + unsigned int alloc_flags, bool whole_block)
> {
> - unsigned int current_order = buddy_order(page);
> int free_pages, movable_pages, alike_pages;
> - int old_block_type;
> + unsigned long start_pfn, end_pfn;
> + int block_type;
>
> - old_block_type = get_pageblock_migratetype(page);
> + block_type = get_pageblock_migratetype(page);
>
> /*
> * This can happen due to races and we want to prevent broken
> * highatomic accounting.
> */
> - if (is_migrate_highatomic(old_block_type))
> + if (is_migrate_highatomic(block_type))
> goto single_page;
>
> /* Take ownership for orders >= pageblock_order */
> if (current_order >= pageblock_order) {
> + del_page_from_free_list(page, zone, current_order);
> change_pageblock_range(page, current_order, start_type);
> - goto single_page;
> + expand(zone, page, order, current_order, start_type);
> + return page;
> }
>
> /*
> @@ -1784,10 +1815,9 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
> if (!whole_block)
> goto single_page;
>
> - free_pages = move_freepages_block(zone, page, start_type,
> - &movable_pages);
> /* moving whole block can fail due to zone boundary conditions */
> - if (!free_pages)
> + if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn,
> + &free_pages, &movable_pages))
> goto single_page;
>
> /*
> @@ -1805,7 +1835,7 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
> * vice versa, be conservative since we can't distinguish the
> * exact migratetype of non-movable pages.
> */
> - if (old_block_type == MIGRATE_MOVABLE)
> + if (block_type == MIGRATE_MOVABLE)
> alike_pages = pageblock_nr_pages
> - (free_pages + movable_pages);
> else
> @@ -1816,13 +1846,16 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
> * compatible migratability as our allocation, claim the whole block.
> */
> if (free_pages + alike_pages >= (1 << (pageblock_order-1)) ||
> - page_group_by_mobility_disabled)
> + page_group_by_mobility_disabled) {
> + move_freepages(zone, start_pfn, end_pfn, start_type);
> set_pageblock_migratetype(page, start_type);
> -
> - return;
> + return __rmqueue_smallest(zone, order, start_type);
> + }
>
> single_page:
> - move_to_free_list(page, zone, current_order, start_type);
> + del_page_from_free_list(page, zone, current_order);
> + expand(zone, page, order, current_order, block_type);
> + return page;
> }
>
> /*
> @@ -1890,9 +1923,10 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
> mt = get_pageblock_migratetype(page);
> /* Only reserve normal pageblocks (i.e., they can merge with others) */
> if (migratetype_is_mergeable(mt)) {
> - zone->nr_reserved_highatomic += pageblock_nr_pages;
> - set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
> - move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL);
> + if (move_freepages_block(zone, page, MIGRATE_HIGHATOMIC) != -1) {
> + set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
> + zone->nr_reserved_highatomic += pageblock_nr_pages;
> + }
> }
>
> out_unlock:
> @@ -1917,7 +1951,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> struct zone *zone;
> struct page *page;
> int order;
> - bool ret;
> + int ret;
>
> for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->highest_zoneidx,
> ac->nodemask) {
> @@ -1966,10 +2000,14 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> * of pageblocks that cannot be completely freed
> * may increase.
> */
> + ret = move_freepages_block(zone, page, ac->migratetype);
> + /*
> + * Reserving this block already succeeded, so this should
> + * not fail on zone boundaries.
> + */
> + WARN_ON_ONCE(ret == -1);
> set_pageblock_migratetype(page, ac->migratetype);
> - ret = move_freepages_block(zone, page, ac->migratetype,
> - NULL);
> - if (ret) {
> + if (ret > 0) {
> spin_unlock_irqrestore(&zone->lock, flags);
> return ret;
> }
> @@ -1990,7 +2028,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> * deviation from the rest of this file, to make the for loop
> * condition simpler.
> */
> -static __always_inline bool
> +static __always_inline struct page *
> __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
> unsigned int alloc_flags)
> {
> @@ -2037,7 +2075,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
> goto do_steal;
> }
>
> - return false;
> + return NULL;
>
> find_smallest:
> for (current_order = order; current_order < NR_PAGE_ORDERS; current_order++) {
> @@ -2057,14 +2095,14 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
> do_steal:
> page = get_page_from_free_area(area, fallback_mt);
>
> - steal_suitable_fallback(zone, page, alloc_flags, start_migratetype,
> - can_steal);
> + /* take off list, maybe claim block, expand remainder */
> + page = steal_suitable_fallback(zone, page, current_order, order,
> + start_migratetype, alloc_flags, can_steal);
>
> trace_mm_page_alloc_extfrag(page, order, current_order,
> start_migratetype, fallback_mt);
>
> - return true;
> -
> + return page;
> }
>
> #ifdef CONFIG_CMA
> @@ -2127,15 +2165,14 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
> return page;
> }
> }
> -retry:
> +
> page = __rmqueue_smallest(zone, order, migratetype);
> if (unlikely(!page)) {
> if (alloc_flags & ALLOC_CMA)
> page = __rmqueue_cma_fallback(zone, order);
> -
> - if (!page && __rmqueue_fallback(zone, order, migratetype,
> - alloc_flags))
> - goto retry;
> + else
> + page = __rmqueue_fallback(zone, order, migratetype,
> + alloc_flags);
(Sorry for chim in late.)
I have some doubts about the changes here. The original code logic was
that if the 'migratetype' type allocation is failed, it would first try
CMA page allocation and then attempt to fallback to other migratetype
allocations. Now it has been changed so that if CMA allocation fails, it
will directly return. This change has caused a regression when I running
the thpcompact benchmark, resulting in a significant reduction in the
percentage of THPs like below:
thpcompact Percentage Faults Huge
K6.9-rc2 K6.9-rc2 + this patch
Percentage huge-1 78.18 ( 0.00%) 42.49 ( -45.65%)
Percentage huge-3 86.70 ( 0.00%) 35.13 ( -59.49%)
Percentage huge-5 90.26 ( 0.00%) 52.35 ( -42.00%)
Percentage huge-7 92.34 ( 0.00%) 31.84 ( -65.52%)
Percentage huge-12 91.18 ( 0.00%) 45.85 ( -49.72%)
Percentage huge-18 89.00 ( 0.00%) 29.18 ( -67.22%)
Percentage huge-24 90.52 ( 0.00%) 46.68 ( -48.43%)
Percentage huge-30 94.44 ( 0.00%) 38.35 ( -59.39%)
Percentage huge-32 93.09 ( 0.00%) 39.37 ( -57.70%)
After making the following modifications, the regression is gone.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ce67dc6777fa..a7cfe65e45c1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2139,7 +2139,8 @@ __rmqueue(struct zone *zone, unsigned int order,
int migratetype,
if (unlikely(!page)) {
if (alloc_flags & ALLOC_CMA)
page = __rmqueue_cma_fallback(zone, order);
- else
+
+ if (!page)
page = __rmqueue_fallback(zone, order, migratetype,
alloc_flags);
}
But I am not sure your original change is intentional? IIUC, we still
need try fallbacking even though CMA allocation is failed, please
correct me if I misunderstand your code. Thanks.
Hi Baolin,
On Fri, Apr 05, 2024 at 08:11:47PM +0800, Baolin Wang wrote:
> On 2024/3/21 02:02, Johannes Weiner wrote:
> > @@ -2127,15 +2165,14 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
> > return page;
> > }
> > }
> > -retry:
> > +
> > page = __rmqueue_smallest(zone, order, migratetype);
> > if (unlikely(!page)) {
> > if (alloc_flags & ALLOC_CMA)
> > page = __rmqueue_cma_fallback(zone, order);
> > -
> > - if (!page && __rmqueue_fallback(zone, order, migratetype,
> > - alloc_flags))
> > - goto retry;
> > + else
> > + page = __rmqueue_fallback(zone, order, migratetype,
> > + alloc_flags);
>
> (Sorry for chim in late.)
>
> I have some doubts about the changes here. The original code logic was
> that if the 'migratetype' type allocation is failed, it would first try
> CMA page allocation and then attempt to fallback to other migratetype
> allocations. Now it has been changed so that if CMA allocation fails, it
> will directly return. This change has caused a regression when I running
> the thpcompact benchmark, resulting in a significant reduction in the
> percentage of THPs like below:
>
> thpcompact Percentage Faults Huge
> K6.9-rc2 K6.9-rc2 + this patch
> Percentage huge-1 78.18 ( 0.00%) 42.49 ( -45.65%)
> Percentage huge-3 86.70 ( 0.00%) 35.13 ( -59.49%)
> Percentage huge-5 90.26 ( 0.00%) 52.35 ( -42.00%)
> Percentage huge-7 92.34 ( 0.00%) 31.84 ( -65.52%)
> Percentage huge-12 91.18 ( 0.00%) 45.85 ( -49.72%)
> Percentage huge-18 89.00 ( 0.00%) 29.18 ( -67.22%)
> Percentage huge-24 90.52 ( 0.00%) 46.68 ( -48.43%)
> Percentage huge-30 94.44 ( 0.00%) 38.35 ( -59.39%)
> Percentage huge-32 93.09 ( 0.00%) 39.37 ( -57.70%)
Ouch, sorry about that! I changed that specific part around later
during development and didn't retest with CMA. I'll be sure to
re-enable it again in my config.
> After making the following modifications, the regression is gone.
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ce67dc6777fa..a7cfe65e45c1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2139,7 +2139,8 @@ __rmqueue(struct zone *zone, unsigned int order,
> int migratetype,
> if (unlikely(!page)) {
> if (alloc_flags & ALLOC_CMA)
> page = __rmqueue_cma_fallback(zone, order);
> - else
> +
> + if (!page)
> page = __rmqueue_fallback(zone, order, migratetype,
> alloc_flags);
> }
>
> But I am not sure your original change is intentional? IIUC, we still
> need try fallbacking even though CMA allocation is failed, please
> correct me if I misunderstand your code. Thanks.
No, this was accidental. I missed that CMA dependency when changing
things around for the new return type of __rmqueue_fallback(). Your
fix is good: just because the request qualifies for CMA doesn't mean
it will succeed from that region. We need the fallback for those.
Andrew, could you please pick up Baolin's change for this patch?
[[email protected]: fix allocation failures with CONFIG_CMA]
Thanks for debugging this and the fix, Baolin.
On 2024/4/6 00:56, Johannes Weiner wrote:
> Hi Baolin,
>
> On Fri, Apr 05, 2024 at 08:11:47PM +0800, Baolin Wang wrote:
>> On 2024/3/21 02:02, Johannes Weiner wrote:
>>> @@ -2127,15 +2165,14 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
>>> return page;
>>> }
>>> }
>>> -retry:
>>> +
>>> page = __rmqueue_smallest(zone, order, migratetype);
>>> if (unlikely(!page)) {
>>> if (alloc_flags & ALLOC_CMA)
>>> page = __rmqueue_cma_fallback(zone, order);
>>> -
>>> - if (!page && __rmqueue_fallback(zone, order, migratetype,
>>> - alloc_flags))
>>> - goto retry;
>>> + else
>>> + page = __rmqueue_fallback(zone, order, migratetype,
>>> + alloc_flags);
>>
>> (Sorry for chim in late.)
>>
>> I have some doubts about the changes here. The original code logic was
>> that if the 'migratetype' type allocation is failed, it would first try
>> CMA page allocation and then attempt to fallback to other migratetype
>> allocations. Now it has been changed so that if CMA allocation fails, it
>> will directly return. This change has caused a regression when I running
>> the thpcompact benchmark, resulting in a significant reduction in the
>> percentage of THPs like below:
>>
>> thpcompact Percentage Faults Huge
>> K6.9-rc2 K6.9-rc2 + this patch
>> Percentage huge-1 78.18 ( 0.00%) 42.49 ( -45.65%)
>> Percentage huge-3 86.70 ( 0.00%) 35.13 ( -59.49%)
>> Percentage huge-5 90.26 ( 0.00%) 52.35 ( -42.00%)
>> Percentage huge-7 92.34 ( 0.00%) 31.84 ( -65.52%)
>> Percentage huge-12 91.18 ( 0.00%) 45.85 ( -49.72%)
>> Percentage huge-18 89.00 ( 0.00%) 29.18 ( -67.22%)
>> Percentage huge-24 90.52 ( 0.00%) 46.68 ( -48.43%)
>> Percentage huge-30 94.44 ( 0.00%) 38.35 ( -59.39%)
>> Percentage huge-32 93.09 ( 0.00%) 39.37 ( -57.70%)
>
> Ouch, sorry about that! I changed that specific part around later
> during development and didn't retest with CMA. I'll be sure to
> re-enable it again in my config.
>
>> After making the following modifications, the regression is gone.
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index ce67dc6777fa..a7cfe65e45c1 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2139,7 +2139,8 @@ __rmqueue(struct zone *zone, unsigned int order,
>> int migratetype,
>> if (unlikely(!page)) {
>> if (alloc_flags & ALLOC_CMA)
>> page = __rmqueue_cma_fallback(zone, order);
>> - else
>> +
>> + if (!page)
>> page = __rmqueue_fallback(zone, order, migratetype,
>> alloc_flags);
>> }
>>
>> But I am not sure your original change is intentional? IIUC, we still
>> need try fallbacking even though CMA allocation is failed, please
>> correct me if I misunderstand your code. Thanks.
>
> No, this was accidental. I missed that CMA dependency when changing
> things around for the new return type of __rmqueue_fallback(). Your
> fix is good: just because the request qualifies for CMA doesn't mean
> it will succeed from that region. We need the fallback for those.
OK. Thanks for the clarification.
> Andrew, could you please pick up Baolin's change for this patch?
>
> [[email protected]: fix allocation failures with CONFIG_CMA]
>
> Thanks for debugging this and the fix, Baolin.
My pleasure.
On 2024/3/21 02:02, Johannes Weiner wrote:
> Free page accounting currently happens a bit too high up the call
> stack, where it has to deal with guard pages, compaction capturing,
> block stealing and even page isolation. This is subtle and fragile,
> and makes it difficult to hack on the code.
>
> Now that type violations on the freelists have been fixed, push the
> accounting down to where pages enter and leave the freelist.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> include/linux/mm.h | 18 ++--
> include/linux/vmstat.h | 8 --
> mm/debug_page_alloc.c | 12 +--
> mm/internal.h | 5 --
> mm/page_alloc.c | 194 +++++++++++++++++++++++------------------
> mm/page_isolation.c | 3 +-
> 6 files changed, 120 insertions(+), 120 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8147b1302413..bd2e94391c7e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3781,24 +3781,22 @@ static inline bool page_is_guard(struct page *page)
> return PageGuard(page);
> }
>
> -bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order,
> - int migratetype);
> +bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order);
> static inline bool set_page_guard(struct zone *zone, struct page *page,
> - unsigned int order, int migratetype)
> + unsigned int order)
> {
> if (!debug_guardpage_enabled())
> return false;
> - return __set_page_guard(zone, page, order, migratetype);
> + return __set_page_guard(zone, page, order);
> }
>
> -void __clear_page_guard(struct zone *zone, struct page *page, unsigned int order,
> - int migratetype);
> +void __clear_page_guard(struct zone *zone, struct page *page, unsigned int order);
> static inline void clear_page_guard(struct zone *zone, struct page *page,
> - unsigned int order, int migratetype)
> + unsigned int order)
> {
> if (!debug_guardpage_enabled())
> return;
> - __clear_page_guard(zone, page, order, migratetype);
> + __clear_page_guard(zone, page, order);
> }
>
> #else /* CONFIG_DEBUG_PAGEALLOC */
> @@ -3808,9 +3806,9 @@ static inline unsigned int debug_guardpage_minorder(void) { return 0; }
> static inline bool debug_guardpage_enabled(void) { return false; }
> static inline bool page_is_guard(struct page *page) { return false; }
> static inline bool set_page_guard(struct zone *zone, struct page *page,
> - unsigned int order, int migratetype) { return false; }
> + unsigned int order) { return false; }
> static inline void clear_page_guard(struct zone *zone, struct page *page,
> - unsigned int order, int migratetype) {}
> + unsigned int order) {}
> #endif /* CONFIG_DEBUG_PAGEALLOC */
>
> #ifdef __HAVE_ARCH_GATE_AREA
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 343906a98d6e..735eae6e272c 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -487,14 +487,6 @@ static inline void node_stat_sub_folio(struct folio *folio,
> mod_node_page_state(folio_pgdat(folio), item, -folio_nr_pages(folio));
> }
>
> -static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages,
> - int migratetype)
> -{
> - __mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
> - if (is_migrate_cma(migratetype))
> - __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
> -}
> -
> extern const char * const vmstat_text[];
>
> static inline const char *zone_stat_name(enum zone_stat_item item)
> diff --git a/mm/debug_page_alloc.c b/mm/debug_page_alloc.c
> index 6755f0c9d4a3..d46acf989dde 100644
> --- a/mm/debug_page_alloc.c
> +++ b/mm/debug_page_alloc.c
> @@ -32,8 +32,7 @@ static int __init debug_guardpage_minorder_setup(char *buf)
> }
> early_param("debug_guardpage_minorder", debug_guardpage_minorder_setup);
>
> -bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order,
> - int migratetype)
> +bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order)
> {
> if (order >= debug_guardpage_minorder())
> return false;
> @@ -41,19 +40,12 @@ bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order,
> __SetPageGuard(page);
> INIT_LIST_HEAD(&page->buddy_list);
> set_page_private(page, order);
> - /* Guard pages are not available for any usage */
> - if (!is_migrate_isolate(migratetype))
> - __mod_zone_freepage_state(zone, -(1 << order), migratetype);
>
> return true;
> }
>
> -void __clear_page_guard(struct zone *zone, struct page *page, unsigned int order,
> - int migratetype)
> +void __clear_page_guard(struct zone *zone, struct page *page, unsigned int order)
> {
> __ClearPageGuard(page);
> -
> set_page_private(page, 0);
> - if (!is_migrate_isolate(migratetype))
> - __mod_zone_freepage_state(zone, (1 << order), migratetype);
> }
> diff --git a/mm/internal.h b/mm/internal.h
> index d6e6c7d9f04e..0a4007b03d0d 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1036,11 +1036,6 @@ static inline bool is_migrate_highatomic(enum migratetype migratetype)
> return migratetype == MIGRATE_HIGHATOMIC;
> }
>
> -static inline bool is_migrate_highatomic_page(struct page *page)
> -{
> - return get_pageblock_migratetype(page) == MIGRATE_HIGHATOMIC;
> -}
> -
> void setup_zone_pageset(struct zone *zone);
>
> struct migration_target_control {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index efb2581ac142..c46491f83ac2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -642,42 +642,72 @@ compaction_capture(struct capture_control *capc, struct page *page,
> }
> #endif /* CONFIG_COMPACTION */
>
> -/* Used for pages not on another list */
> -static inline void add_to_free_list(struct page *page, struct zone *zone,
> - unsigned int order, int migratetype)
> +static inline void account_freepages(struct page *page, struct zone *zone,
> + int nr_pages, int migratetype)
> {
> - struct free_area *area = &zone->free_area[order];
> + if (is_migrate_isolate(migratetype))
> + return;
>
> - list_add(&page->buddy_list, &area->free_list[migratetype]);
> - area->nr_free++;
> + __mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
> +
> + if (is_migrate_cma(migratetype))
> + __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
> }
>
> /* Used for pages not on another list */
> -static inline void add_to_free_list_tail(struct page *page, struct zone *zone,
> - unsigned int order, int migratetype)
> +static inline void __add_to_free_list(struct page *page, struct zone *zone,
> + unsigned int order, int migratetype,
> + bool tail)
> {
> struct free_area *area = &zone->free_area[order];
>
> - list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
> + VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
> + "page type is %lu, passed migratetype is %d (nr=%d)\n",
> + get_pageblock_migratetype(page), migratetype, 1 << order);
> +
> + if (tail)
> + list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
> + else
> + list_add(&page->buddy_list, &area->free_list[migratetype]);
> area->nr_free++;
> }
>
> +static inline void add_to_free_list(struct page *page, struct zone *zone,
> + unsigned int order, int migratetype,
> + bool tail)
> +{
> + __add_to_free_list(page, zone, order, migratetype, tail);
> + account_freepages(page, zone, 1 << order, migratetype);
> +}
> +
> /*
> * Used for pages which are on another list. Move the pages to the tail
> * of the list - so the moved pages won't immediately be considered for
> * allocation again (e.g., optimization for memory onlining).
> */
> static inline void move_to_free_list(struct page *page, struct zone *zone,
> - unsigned int order, int migratetype)
> + unsigned int order, int old_mt, int new_mt)
> {
> struct free_area *area = &zone->free_area[order];
>
> - list_move_tail(&page->buddy_list, &area->free_list[migratetype]);
> + /* Free page moving can fail, so it happens before the type update */
> + VM_WARN_ONCE(get_pageblock_migratetype(page) != old_mt,
> + "page type is %lu, passed migratetype is %d (nr=%d)\n",
> + get_pageblock_migratetype(page), old_mt, 1 << order);
> +
> + list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
> +
> + account_freepages(page, zone, -(1 << order), old_mt);
> + account_freepages(page, zone, 1 << order, new_mt);
> }
>
> -static inline void del_page_from_free_list(struct page *page, struct zone *zone,
> - unsigned int order)
> +static inline void __del_page_from_free_list(struct page *page, struct zone *zone,
> + unsigned int order, int migratetype)
> {
> + VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
> + "page type is %lu, passed migratetype is %d (nr=%d)\n",
> + get_pageblock_migratetype(page), migratetype, 1 << order);
> +
> /* clear reported state and update reported page count */
> if (page_reported(page))
> __ClearPageReported(page);
> @@ -688,6 +718,13 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
> zone->free_area[order].nr_free--;
> }
>
> +static inline void del_page_from_free_list(struct page *page, struct zone *zone,
> + unsigned int order, int migratetype)
> +{
> + __del_page_from_free_list(page, zone, order, migratetype);
> + account_freepages(page, zone, -(1 << order), migratetype);
> +}
> +
> static inline struct page *get_page_from_free_area(struct free_area *area,
> int migratetype)
> {
> @@ -759,18 +796,16 @@ static inline void __free_one_page(struct page *page,
> VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
>
> VM_BUG_ON(migratetype == -1);
> - if (likely(!is_migrate_isolate(migratetype)))
> - __mod_zone_freepage_state(zone, 1 << order, migratetype);
> -
> VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
> VM_BUG_ON_PAGE(bad_range(zone, page), page);
>
> + account_freepages(page, zone, 1 << order, migratetype);
> +
> while (order < MAX_PAGE_ORDER) {
> - if (compaction_capture(capc, page, order, migratetype)) {
> - __mod_zone_freepage_state(zone, -(1 << order),
> - migratetype);
> + int buddy_mt = migratetype;
> +
> + if (compaction_capture(capc, page, order, migratetype))
> return;
> - }
IIUC, if the released page is captured by compaction, then the
statistics for free pages should be correspondingly decreased,
otherwise, there will be a slight regression for my thpcompact benchmark.
thpcompact Percentage Faults Huge
k6.9-rc2-base base + patch10 + 2 fixes
Percentage huge-1 78.18 ( 0.00%) 71.92 ( -8.01%)
Percentage huge-3 86.70 ( 0.00%) 86.07 ( -0.73%)
Percentage huge-5 90.26 ( 0.00%) 78.02 ( -13.57%)
Percentage huge-7 92.34 ( 0.00%) 78.67 ( -14.81%)
Percentage huge-12 91.18 ( 0.00%) 81.04 ( -11.12%)
Percentage huge-18 89.00 ( 0.00%) 79.57 ( -10.60%)
Percentage huge-24 90.52 ( 0.00%) 80.07 ( -11.54%)
Percentage huge-30 94.44 ( 0.00%) 96.28 ( 1.95%)
Percentage huge-32 93.09 ( 0.00%) 99.39 ( 6.77%)
I add below fix based on your fix 2, then the thpcompact Percentage
looks good. How do you think for the fix?
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8330c5c2de6b..2facf844ef84 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -805,8 +805,10 @@ static inline void __free_one_page(struct page *page,
while (order < MAX_PAGE_ORDER) {
int buddy_mt = migratetype;
- if (compaction_capture(capc, page, order, migratetype))
+ if (compaction_capture(capc, page, order, migratetype)) {
+ account_freepages(zone, -(1 << order), migratetype);
return;
+ }
buddy = find_buddy_page_pfn(page, pfn, order, &buddy_pfn);
if (!buddy)
With my fix, the THP percentage looks better:
k6.9-rc2-base base + patch10 + 2 fixes +
my fix
Percentage huge-1 78.18 ( 0.00%) 82.83 ( 5.94%)
Percentage huge-3 86.70 ( 0.00%) 93.47 ( 7.81%)
Percentage huge-5 90.26 ( 0.00%) 94.73 ( 4.95%)
Percentage huge-7 92.34 ( 0.00%) 95.22 ( 3.12%)
Percentage huge-12 91.18 ( 0.00%) 92.40 ( 1.34%)
Percentage huge-18 89.00 ( 0.00%) 85.39 ( -4.06%)
Percentage huge-24 90.52 ( 0.00%) 94.70 ( 4.61%)
Percentage huge-30 94.44 ( 0.00%) 97.00 ( 2.71%)
Percentage huge-32 93.09 ( 0.00%) 92.87 ( -0.24%)
On 4/5/24 6:56 PM, Johannes Weiner wrote:
> Hi Baolin,
> Ouch, sorry about that! I changed that specific part around later
> during development and didn't retest with CMA. I'll be sure to
> re-enable it again in my config.
>
>> After making the following modifications, the regression is gone.
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index ce67dc6777fa..a7cfe65e45c1 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2139,7 +2139,8 @@ __rmqueue(struct zone *zone, unsigned int order,
>> int migratetype,
>> if (unlikely(!page)) {
>> if (alloc_flags & ALLOC_CMA)
>> page = __rmqueue_cma_fallback(zone, order);
>> - else
>> +
>> + if (!page)
>> page = __rmqueue_fallback(zone, order, migratetype,
>> alloc_flags);
>> }
>>
>> But I am not sure your original change is intentional? IIUC, we still
>> need try fallbacking even though CMA allocation is failed, please
>> correct me if I misunderstand your code. Thanks.
>
> No, this was accidental. I missed that CMA dependency when changing
> things around for the new return type of __rmqueue_fallback(). Your
> fix is good: just because the request qualifies for CMA doesn't mean
> it will succeed from that region. We need the fallback for those.
>
> Andrew, could you please pick up Baolin's change for this patch?
>
> [[email protected]: fix allocation failures with CONFIG_CMA]
>
> Thanks for debugging this and the fix, Baolin.
Good fix indeed, didn't spot the issue during review. Thanks!
On 4/7/24 12:19 PM, Baolin Wang wrote:
> On 2024/3/21 02:02, Johannes Weiner wrote:
>>
>> + account_freepages(page, zone, 1 << order, migratetype);
>> +
>> while (order < MAX_PAGE_ORDER) {
>> - if (compaction_capture(capc, page, order, migratetype)) {
>> - __mod_zone_freepage_state(zone, -(1 << order),
>> - migratetype);
>> + int buddy_mt = migratetype;
>> +
>> + if (compaction_capture(capc, page, order, migratetype))
>> return;
>> - }
>
> IIUC, if the released page is captured by compaction, then the
> statistics for free pages should be correspondingly decreased,
> otherwise, there will be a slight regression for my thpcompact benchmark.
>
> thpcompact Percentage Faults Huge
> k6.9-rc2-base base + patch10 + 2 fixes
> Percentage huge-1 78.18 ( 0.00%) 71.92 ( -8.01%)
> Percentage huge-3 86.70 ( 0.00%) 86.07 ( -0.73%)
> Percentage huge-5 90.26 ( 0.00%) 78.02 ( -13.57%)
> Percentage huge-7 92.34 ( 0.00%) 78.67 ( -14.81%)
> Percentage huge-12 91.18 ( 0.00%) 81.04 ( -11.12%)
> Percentage huge-18 89.00 ( 0.00%) 79.57 ( -10.60%)
> Percentage huge-24 90.52 ( 0.00%) 80.07 ( -11.54%)
> Percentage huge-30 94.44 ( 0.00%) 96.28 ( 1.95%)
> Percentage huge-32 93.09 ( 0.00%) 99.39 ( 6.77%)
>
> I add below fix based on your fix 2, then the thpcompact Percentage
> looks good. How do you think for the fix?
Yeah another well spotted, thanks. "slight regression" is an understatement,
this affects not just a "statistics" but very important counter
NR_FREE_PAGES which IIUC would eventually become larger than reality, make
the watermark checks false positive and result in depleted reserves etc etc.
Actually wondering why we're not seeing -next failures already (or maybe I
just haven't noticed).
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8330c5c2de6b..2facf844ef84 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -805,8 +805,10 @@ static inline void __free_one_page(struct page *page,
> while (order < MAX_PAGE_ORDER) {
> int buddy_mt = migratetype;
>
> - if (compaction_capture(capc, page, order, migratetype))
> + if (compaction_capture(capc, page, order, migratetype)) {
> + account_freepages(zone, -(1 << order), migratetype);
> return;
> + }
>
> buddy = find_buddy_page_pfn(page, pfn, order, &buddy_pfn);
> if (!buddy)
>
> With my fix, the THP percentage looks better:
> k6.9-rc2-base base + patch10 + 2 fixes +
> my fix
> Percentage huge-1 78.18 ( 0.00%) 82.83 ( 5.94%)
> Percentage huge-3 86.70 ( 0.00%) 93.47 ( 7.81%)
> Percentage huge-5 90.26 ( 0.00%) 94.73 ( 4.95%)
> Percentage huge-7 92.34 ( 0.00%) 95.22 ( 3.12%)
> Percentage huge-12 91.18 ( 0.00%) 92.40 ( 1.34%)
> Percentage huge-18 89.00 ( 0.00%) 85.39 ( -4.06%)
> Percentage huge-24 90.52 ( 0.00%) 94.70 ( 4.61%)
> Percentage huge-30 94.44 ( 0.00%) 97.00 ( 2.71%)
> Percentage huge-32 93.09 ( 0.00%) 92.87 ( -0.24%)
On 2024/4/8 15:38, Vlastimil Babka wrote:
> On 4/7/24 12:19 PM, Baolin Wang wrote:
>> On 2024/3/21 02:02, Johannes Weiner wrote:
>>>
>>> + account_freepages(page, zone, 1 << order, migratetype);
>>> +
>>> while (order < MAX_PAGE_ORDER) {
>>> - if (compaction_capture(capc, page, order, migratetype)) {
>>> - __mod_zone_freepage_state(zone, -(1 << order),
>>> - migratetype);
>>> + int buddy_mt = migratetype;
>>> +
>>> + if (compaction_capture(capc, page, order, migratetype))
>>> return;
>>> - }
>>
>> IIUC, if the released page is captured by compaction, then the
>> statistics for free pages should be correspondingly decreased,
>> otherwise, there will be a slight regression for my thpcompact benchmark.
>>
>> thpcompact Percentage Faults Huge
>> k6.9-rc2-base base + patch10 + 2 fixes
>> Percentage huge-1 78.18 ( 0.00%) 71.92 ( -8.01%)
>> Percentage huge-3 86.70 ( 0.00%) 86.07 ( -0.73%)
>> Percentage huge-5 90.26 ( 0.00%) 78.02 ( -13.57%)
>> Percentage huge-7 92.34 ( 0.00%) 78.67 ( -14.81%)
>> Percentage huge-12 91.18 ( 0.00%) 81.04 ( -11.12%)
>> Percentage huge-18 89.00 ( 0.00%) 79.57 ( -10.60%)
>> Percentage huge-24 90.52 ( 0.00%) 80.07 ( -11.54%)
>> Percentage huge-30 94.44 ( 0.00%) 96.28 ( 1.95%)
>> Percentage huge-32 93.09 ( 0.00%) 99.39 ( 6.77%)
>>
>> I add below fix based on your fix 2, then the thpcompact Percentage
>> looks good. How do you think for the fix?
>
> Yeah another well spotted, thanks. "slight regression" is an understatement,
Thanks for helping to confirm.
> this affects not just a "statistics" but very important counter
> NR_FREE_PAGES which IIUC would eventually become larger than reality, make
> the watermark checks false positive and result in depleted reserves etc etc.
Right, agree.
> Actually wondering why we're not seeing -next failures already (or maybe I
> just haven't noticed).
>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 8330c5c2de6b..2facf844ef84 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -805,8 +805,10 @@ static inline void __free_one_page(struct page *page,
>> while (order < MAX_PAGE_ORDER) {
>> int buddy_mt = migratetype;
>>
>> - if (compaction_capture(capc, page, order, migratetype))
>> + if (compaction_capture(capc, page, order, migratetype)) {
>> + account_freepages(zone, -(1 << order), migratetype);
>> return;
>> + }
>>
>> buddy = find_buddy_page_pfn(page, pfn, order, &buddy_pfn);
>> if (!buddy)
>>
>> With my fix, the THP percentage looks better:
>> k6.9-rc2-base base + patch10 + 2 fixes +
>> my fix
>> Percentage huge-1 78.18 ( 0.00%) 82.83 ( 5.94%)
>> Percentage huge-3 86.70 ( 0.00%) 93.47 ( 7.81%)
>> Percentage huge-5 90.26 ( 0.00%) 94.73 ( 4.95%)
>> Percentage huge-7 92.34 ( 0.00%) 95.22 ( 3.12%)
>> Percentage huge-12 91.18 ( 0.00%) 92.40 ( 1.34%)
>> Percentage huge-18 89.00 ( 0.00%) 85.39 ( -4.06%)
>> Percentage huge-24 90.52 ( 0.00%) 94.70 ( 4.61%)
>> Percentage huge-30 94.44 ( 0.00%) 97.00 ( 2.71%)
>> Percentage huge-32 93.09 ( 0.00%) 92.87 ( -0.24%)
On 2024/3/21 02:02, Johannes Weiner wrote:
> V4:
> - fixed !pcp_order_allowed() case in free_unref_folios()
> - reworded the patch 0 changelog a bit for the git log
> - rebased to mm-everything-2024-03-19-23-01
> - runtime-tested again with various CONFIG_DEBUG_FOOs enabled
>
> ---
>
> The page allocator's mobility grouping is intended to keep unmovable
> pages separate from reclaimable/compactable ones to allow on-demand
> defragmentation for higher-order allocations and huge pages.
>
> Currently, there are several places where accidental type mixing
> occurs: an allocation asks for a page of a certain migratetype and
> receives another. This ruins pageblocks for compaction, which in turn
> makes allocating huge pages more expensive and less reliable.
>
> The series addresses those causes. The last patch adds type checks on
> all freelist movements to prevent new violations being introduced.
>
> The benefits can be seen in a mixed workload that stresses the machine
> with a memcache-type workload and a kernel build job while
> periodically attempting to allocate batches of THP. The following data
> is aggregated over 50 consecutive defconfig builds:
>
> VANILLA PATCHED
> Hugealloc Time mean 165843.93 ( +0.00%) 113025.88 ( -31.85%)
> Hugealloc Time stddev 158957.35 ( +0.00%) 114716.07 ( -27.83%)
> Kbuild Real time 310.24 ( +0.00%) 300.73 ( -3.06%)
> Kbuild User time 1271.13 ( +0.00%) 1259.42 ( -0.92%)
> Kbuild System time 582.02 ( +0.00%) 559.79 ( -3.81%)
> THP fault alloc 30585.14 ( +0.00%) 40853.62 ( +33.57%)
> THP fault fallback 36626.46 ( +0.00%) 26357.62 ( -28.04%)
> THP fault fail rate % 54.49 ( +0.00%) 39.22 ( -27.53%)
> Pagealloc fallback 1328.00 ( +0.00%) 1.00 ( -99.85%)
> Pagealloc type mismatch 181009.50 ( +0.00%) 0.00 ( -100.00%)
> Direct compact stall 434.56 ( +0.00%) 257.66 ( -40.61%)
> Direct compact fail 421.70 ( +0.00%) 249.94 ( -40.63%)
> Direct compact success 12.86 ( +0.00%) 7.72 ( -37.09%)
> Direct compact success rate % 2.86 ( +0.00%) 2.82 ( -0.96%)
> Compact daemon scanned migrate 3370059.62 ( +0.00%) 3612054.76 ( +7.18%)
> Compact daemon scanned free 7718439.20 ( +0.00%) 5386385.02 ( -30.21%)
> Compact direct scanned migrate 309248.62 ( +0.00%) 176721.04 ( -42.85%)
> Compact direct scanned free 433582.84 ( +0.00%) 315727.66 ( -27.18%)
> Compact migrate scanned daemon % 91.20 ( +0.00%) 94.48 ( +3.56%)
> Compact free scanned daemon % 94.58 ( +0.00%) 94.42 ( -0.16%)
> Compact total migrate scanned 3679308.24 ( +0.00%) 3788775.80 ( +2.98%)
> Compact total free scanned 8152022.04 ( +0.00%) 5702112.68 ( -30.05%)
> Alloc stall 872.04 ( +0.00%) 5156.12 ( +490.71%)
> Pages kswapd scanned 510645.86 ( +0.00%) 3394.94 ( -99.33%)
> Pages kswapd reclaimed 134811.62 ( +0.00%) 2701.26 ( -98.00%)
> Pages direct scanned 99546.06 ( +0.00%) 376407.52 ( +278.12%)
> Pages direct reclaimed 62123.40 ( +0.00%) 289535.70 ( +366.06%)
> Pages total scanned 610191.92 ( +0.00%) 379802.46 ( -37.76%)
> Pages scanned kswapd % 76.36 ( +0.00%) 0.10 ( -98.58%)
> Swap out 12057.54 ( +0.00%) 15022.98 ( +24.59%)
> Swap in 209.16 ( +0.00%) 256.48 ( +22.52%)
> File refaults 17701.64 ( +0.00%) 11765.40 ( -33.53%)
>
> Huge page success rate is higher, allocation latencies are shorter and
> more predictable.
>
> Stealing (fallback) rate is drastically reduced. Notably, while the
> vanilla kernel keeps doing fallbacks on an ongoing basis, the patched
> kernel enters a steady state once the distribution of block types is
> adequate for the workload. Steals over 50 runs:
>
> VANILLA PATCHED
> 1504.0 227.0
> 1557.0 6.0
> 1391.0 13.0
> 1080.0 26.0
> 1057.0 40.0
> 1156.0 6.0
> 805.0 46.0
> 736.0 20.0
> 1747.0 2.0
> 1699.0 34.0
> 1269.0 13.0
> 1858.0 12.0
> 907.0 4.0
> 727.0 2.0
> 563.0 2.0
> 3094.0 2.0
> 10211.0 3.0
> 2621.0 1.0
> 5508.0 2.0
> 1060.0 2.0
> 538.0 3.0
> 5773.0 2.0
> 2199.0 0.0
> 3781.0 2.0
> 1387.0 1.0
> 4977.0 0.0
> 2865.0 1.0
> 1814.0 1.0
> 3739.0 1.0
> 6857.0 0.0
> 382.0 0.0
> 407.0 1.0
> 3784.0 0.0
> 297.0 0.0
> 298.0 0.0
> 6636.0 0.0
> 4188.0 0.0
> 242.0 0.0
> 9960.0 0.0
> 5816.0 0.0
> 354.0 0.0
> 287.0 0.0
> 261.0 0.0
> 140.0 1.0
> 2065.0 0.0
> 312.0 0.0
> 331.0 0.0
> 164.0 0.0
> 465.0 1.0
> 219.0 0.0
>
> Type mismatches are down too. Those count every time an allocation
> request asks for one migratetype and gets another. This can still
> occur minimally in the patched kernel due to non-stealing fallbacks,
> but it's quite rare and follows the pattern of overall fallbacks -
> once the block type distribution settles, mismatches cease as well:
>
> VANILLA: PATCHED:
> 182602.0 268.0
> 135794.0 20.0
> 88619.0 19.0
> 95973.0 0.0
> 129590.0 0.0
> 129298.0 0.0
> 147134.0 0.0
> 230854.0 0.0
> 239709.0 0.0
> 137670.0 0.0
> 132430.0 0.0
> 65712.0 0.0
> 57901.0 0.0
> 67506.0 0.0
> 63565.0 4.0
> 34806.0 0.0
> 42962.0 0.0
> 32406.0 0.0
> 38668.0 0.0
> 61356.0 0.0
> 57800.0 0.0
> 41435.0 0.0
> 83456.0 0.0
> 65048.0 0.0
> 28955.0 0.0
> 47597.0 0.0
> 75117.0 0.0
> 55564.0 0.0
> 38280.0 0.0
> 52404.0 0.0
> 26264.0 0.0
> 37538.0 0.0
> 19671.0 0.0
> 30936.0 0.0
> 26933.0 0.0
> 16962.0 0.0
> 44554.0 0.0
> 46352.0 0.0
> 24995.0 0.0
> 35152.0 0.0
> 12823.0 0.0
> 21583.0 0.0
> 18129.0 0.0
> 31693.0 0.0
> 28745.0 0.0
> 33308.0 0.0
> 31114.0 0.0
> 35034.0 0.0
> 12111.0 0.0
> 24885.0 0.0
>
> Compaction work is markedly reduced despite much better THP rates.
>
> In the vanilla kernel, reclaim seems to have been driven primarily by
> watermark boosting that happens as a result of fallbacks. With those
> all but eliminated, watermarks average lower and kswapd does less
> work. The uptick in direct reclaim is because THP requests have to
> fend for themselves more often - which is intended policy right
> now. Aggregate reclaim activity is lowered significantly, though.
>
> ---
With my 2 fixes, the whole series works well on my platform, so please
feel free to add:
Tested-by: Baolin Wang <[email protected]>
On Mon, Apr 08, 2024 at 09:38:20AM +0200, Vlastimil Babka wrote:
> On 4/7/24 12:19 PM, Baolin Wang wrote:
> > On 2024/3/21 02:02, Johannes Weiner wrote:
> >>
> >> + account_freepages(page, zone, 1 << order, migratetype);
> >> +
> >> while (order < MAX_PAGE_ORDER) {
> >> - if (compaction_capture(capc, page, order, migratetype)) {
> >> - __mod_zone_freepage_state(zone, -(1 << order),
> >> - migratetype);
> >> + int buddy_mt = migratetype;
> >> +
> >> + if (compaction_capture(capc, page, order, migratetype))
> >> return;
> >> - }
> >
> > IIUC, if the released page is captured by compaction, then the
> > statistics for free pages should be correspondingly decreased,
> > otherwise, there will be a slight regression for my thpcompact benchmark.
> >
> > thpcompact Percentage Faults Huge
> > k6.9-rc2-base base + patch10 + 2 fixes
> > Percentage huge-1 78.18 ( 0.00%) 71.92 ( -8.01%)
> > Percentage huge-3 86.70 ( 0.00%) 86.07 ( -0.73%)
> > Percentage huge-5 90.26 ( 0.00%) 78.02 ( -13.57%)
> > Percentage huge-7 92.34 ( 0.00%) 78.67 ( -14.81%)
> > Percentage huge-12 91.18 ( 0.00%) 81.04 ( -11.12%)
> > Percentage huge-18 89.00 ( 0.00%) 79.57 ( -10.60%)
> > Percentage huge-24 90.52 ( 0.00%) 80.07 ( -11.54%)
> > Percentage huge-30 94.44 ( 0.00%) 96.28 ( 1.95%)
> > Percentage huge-32 93.09 ( 0.00%) 99.39 ( 6.77%)
> >
> > I add below fix based on your fix 2, then the thpcompact Percentage
> > looks good. How do you think for the fix?
>
> Yeah another well spotted, thanks. "slight regression" is an understatement,
> this affects not just a "statistics" but very important counter
> NR_FREE_PAGES which IIUC would eventually become larger than reality, make
> the watermark checks false positive and result in depleted reserves etc etc.
> Actually wondering why we're not seeing -next failures already (or maybe I
> just haven't noticed).
Good catch indeed.
Trying to understand why I didn't notice this during testing, and I
think it's because I had order-10 pageblocks in my config. There is
this in compaction_capture():
if (order < pageblock_order && migratetype == MIGRATE_MOVABLE)
return false;
Most compaction is for order-9 THPs on movable blocks, so I didn't get
much capturing in practice in order for that leak to be noticable.
In earlier versions of the patches I had more aggressive capturing,
but also did the accounting in add_page_to/del_page_from_freelist(),
so capturing only steals what's already accounted to be off list.
With the __add/__del and the consolidated account_freepages()
optimization, compaction_capture() needs explicit accounting again.
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 8330c5c2de6b..2facf844ef84 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -805,8 +805,10 @@ static inline void __free_one_page(struct page *page,
> > while (order < MAX_PAGE_ORDER) {
> > int buddy_mt = migratetype;
> >
> > - if (compaction_capture(capc, page, order, migratetype))
> > + if (compaction_capture(capc, page, order, migratetype)) {
> > + account_freepages(zone, -(1 << order), migratetype);
> > return;
> > + }
> >
> > buddy = find_buddy_page_pfn(page, pfn, order, &buddy_pfn);
> > if (!buddy)
> >
> > With my fix, the THP percentage looks better:
> > k6.9-rc2-base base + patch10 + 2 fixes +
> > my fix
> > Percentage huge-1 78.18 ( 0.00%) 82.83 ( 5.94%)
> > Percentage huge-3 86.70 ( 0.00%) 93.47 ( 7.81%)
> > Percentage huge-5 90.26 ( 0.00%) 94.73 ( 4.95%)
> > Percentage huge-7 92.34 ( 0.00%) 95.22 ( 3.12%)
> > Percentage huge-12 91.18 ( 0.00%) 92.40 ( 1.34%)
> > Percentage huge-18 89.00 ( 0.00%) 85.39 ( -4.06%)
> > Percentage huge-24 90.52 ( 0.00%) 94.70 ( 4.61%)
> > Percentage huge-30 94.44 ( 0.00%) 97.00 ( 2.71%)
> > Percentage huge-32 93.09 ( 0.00%) 92.87 ( -0.24%)
Acked-by: Johannes Weiner <[email protected]>
With fixed indentation:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 70f82635f650..96815e3c22f2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -805,8 +805,10 @@ static inline void __free_one_page(struct page *page,
while (order < MAX_PAGE_ORDER) {
int buddy_mt = migratetype;
- if (compaction_capture(capc, page, order, migratetype))
+ if (compaction_capture(capc, page, order, migratetype)) {
+ account_freepages(zone, -(1 << order), migratetype);
return;
+ }
buddy = find_buddy_page_pfn(page, pfn, order, &buddy_pfn);
if (!buddy)
On Mon, Apr 08, 2024 at 05:30:04PM +0800, Baolin Wang wrote:
>
>
> On 2024/3/21 02:02, Johannes Weiner wrote:
> > V4:
> > - fixed !pcp_order_allowed() case in free_unref_folios()
> > - reworded the patch 0 changelog a bit for the git log
> > - rebased to mm-everything-2024-03-19-23-01
> > - runtime-tested again with various CONFIG_DEBUG_FOOs enabled
> >
> > ---
> >
> > The page allocator's mobility grouping is intended to keep unmovable
> > pages separate from reclaimable/compactable ones to allow on-demand
> > defragmentation for higher-order allocations and huge pages.
> >
> > Currently, there are several places where accidental type mixing
> > occurs: an allocation asks for a page of a certain migratetype and
> > receives another. This ruins pageblocks for compaction, which in turn
> > makes allocating huge pages more expensive and less reliable.
> >
> > The series addresses those causes. The last patch adds type checks on
> > all freelist movements to prevent new violations being introduced.
> >
> > The benefits can be seen in a mixed workload that stresses the machine
> > with a memcache-type workload and a kernel build job while
> > periodically attempting to allocate batches of THP. The following data
> > is aggregated over 50 consecutive defconfig builds:
> >
> > VANILLA PATCHED
> > Hugealloc Time mean 165843.93 ( +0.00%) 113025.88 ( -31.85%)
> > Hugealloc Time stddev 158957.35 ( +0.00%) 114716.07 ( -27.83%)
> > Kbuild Real time 310.24 ( +0.00%) 300.73 ( -3.06%)
> > Kbuild User time 1271.13 ( +0.00%) 1259.42 ( -0.92%)
> > Kbuild System time 582.02 ( +0.00%) 559.79 ( -3.81%)
> > THP fault alloc 30585.14 ( +0.00%) 40853.62 ( +33.57%)
> > THP fault fallback 36626.46 ( +0.00%) 26357.62 ( -28.04%)
> > THP fault fail rate % 54.49 ( +0.00%) 39.22 ( -27.53%)
> > Pagealloc fallback 1328.00 ( +0.00%) 1.00 ( -99.85%)
> > Pagealloc type mismatch 181009.50 ( +0.00%) 0.00 ( -100.00%)
> > Direct compact stall 434.56 ( +0.00%) 257.66 ( -40.61%)
> > Direct compact fail 421.70 ( +0.00%) 249.94 ( -40.63%)
> > Direct compact success 12.86 ( +0.00%) 7.72 ( -37.09%)
> > Direct compact success rate % 2.86 ( +0.00%) 2.82 ( -0.96%)
> > Compact daemon scanned migrate 3370059.62 ( +0.00%) 3612054.76 ( +7.18%)
> > Compact daemon scanned free 7718439.20 ( +0.00%) 5386385.02 ( -30.21%)
> > Compact direct scanned migrate 309248.62 ( +0.00%) 176721.04 ( -42.85%)
> > Compact direct scanned free 433582.84 ( +0.00%) 315727.66 ( -27.18%)
> > Compact migrate scanned daemon % 91.20 ( +0.00%) 94.48 ( +3.56%)
> > Compact free scanned daemon % 94.58 ( +0.00%) 94.42 ( -0.16%)
> > Compact total migrate scanned 3679308.24 ( +0.00%) 3788775.80 ( +2.98%)
> > Compact total free scanned 8152022.04 ( +0.00%) 5702112.68 ( -30.05%)
> > Alloc stall 872.04 ( +0.00%) 5156.12 ( +490.71%)
> > Pages kswapd scanned 510645.86 ( +0.00%) 3394.94 ( -99.33%)
> > Pages kswapd reclaimed 134811.62 ( +0.00%) 2701.26 ( -98.00%)
> > Pages direct scanned 99546.06 ( +0.00%) 376407.52 ( +278.12%)
> > Pages direct reclaimed 62123.40 ( +0.00%) 289535.70 ( +366.06%)
> > Pages total scanned 610191.92 ( +0.00%) 379802.46 ( -37.76%)
> > Pages scanned kswapd % 76.36 ( +0.00%) 0.10 ( -98.58%)
> > Swap out 12057.54 ( +0.00%) 15022.98 ( +24.59%)
> > Swap in 209.16 ( +0.00%) 256.48 ( +22.52%)
> > File refaults 17701.64 ( +0.00%) 11765.40 ( -33.53%)
> >
> > Huge page success rate is higher, allocation latencies are shorter and
> > more predictable.
> >
> > Stealing (fallback) rate is drastically reduced. Notably, while the
> > vanilla kernel keeps doing fallbacks on an ongoing basis, the patched
> > kernel enters a steady state once the distribution of block types is
> > adequate for the workload. Steals over 50 runs:
> >
> > VANILLA PATCHED
> > 1504.0 227.0
> > 1557.0 6.0
> > 1391.0 13.0
> > 1080.0 26.0
> > 1057.0 40.0
> > 1156.0 6.0
> > 805.0 46.0
> > 736.0 20.0
> > 1747.0 2.0
> > 1699.0 34.0
> > 1269.0 13.0
> > 1858.0 12.0
> > 907.0 4.0
> > 727.0 2.0
> > 563.0 2.0
> > 3094.0 2.0
> > 10211.0 3.0
> > 2621.0 1.0
> > 5508.0 2.0
> > 1060.0 2.0
> > 538.0 3.0
> > 5773.0 2.0
> > 2199.0 0.0
> > 3781.0 2.0
> > 1387.0 1.0
> > 4977.0 0.0
> > 2865.0 1.0
> > 1814.0 1.0
> > 3739.0 1.0
> > 6857.0 0.0
> > 382.0 0.0
> > 407.0 1.0
> > 3784.0 0.0
> > 297.0 0.0
> > 298.0 0.0
> > 6636.0 0.0
> > 4188.0 0.0
> > 242.0 0.0
> > 9960.0 0.0
> > 5816.0 0.0
> > 354.0 0.0
> > 287.0 0.0
> > 261.0 0.0
> > 140.0 1.0
> > 2065.0 0.0
> > 312.0 0.0
> > 331.0 0.0
> > 164.0 0.0
> > 465.0 1.0
> > 219.0 0.0
> >
> > Type mismatches are down too. Those count every time an allocation
> > request asks for one migratetype and gets another. This can still
> > occur minimally in the patched kernel due to non-stealing fallbacks,
> > but it's quite rare and follows the pattern of overall fallbacks -
> > once the block type distribution settles, mismatches cease as well:
> >
> > VANILLA: PATCHED:
> > 182602.0 268.0
> > 135794.0 20.0
> > 88619.0 19.0
> > 95973.0 0.0
> > 129590.0 0.0
> > 129298.0 0.0
> > 147134.0 0.0
> > 230854.0 0.0
> > 239709.0 0.0
> > 137670.0 0.0
> > 132430.0 0.0
> > 65712.0 0.0
> > 57901.0 0.0
> > 67506.0 0.0
> > 63565.0 4.0
> > 34806.0 0.0
> > 42962.0 0.0
> > 32406.0 0.0
> > 38668.0 0.0
> > 61356.0 0.0
> > 57800.0 0.0
> > 41435.0 0.0
> > 83456.0 0.0
> > 65048.0 0.0
> > 28955.0 0.0
> > 47597.0 0.0
> > 75117.0 0.0
> > 55564.0 0.0
> > 38280.0 0.0
> > 52404.0 0.0
> > 26264.0 0.0
> > 37538.0 0.0
> > 19671.0 0.0
> > 30936.0 0.0
> > 26933.0 0.0
> > 16962.0 0.0
> > 44554.0 0.0
> > 46352.0 0.0
> > 24995.0 0.0
> > 35152.0 0.0
> > 12823.0 0.0
> > 21583.0 0.0
> > 18129.0 0.0
> > 31693.0 0.0
> > 28745.0 0.0
> > 33308.0 0.0
> > 31114.0 0.0
> > 35034.0 0.0
> > 12111.0 0.0
> > 24885.0 0.0
> >
> > Compaction work is markedly reduced despite much better THP rates.
> >
> > In the vanilla kernel, reclaim seems to have been driven primarily by
> > watermark boosting that happens as a result of fallbacks. With those
> > all but eliminated, watermarks average lower and kswapd does less
> > work. The uptick in direct reclaim is because THP requests have to
> > fend for themselves more often - which is intended policy right
> > now. Aggregate reclaim activity is lowered significantly, though.
> >
> > ---
>
> With my 2 fixes, the whole series works well on my platform, so please
> feel free to add:
> Tested-by: Baolin Wang <[email protected]>
Very much appreciate your testing and the two fixes. Thank you!
On 4/5/24 6:56 PM, Johannes Weiner wrote:
> Hi Baolin,
>
> On Fri, Apr 05, 2024 at 08:11:47PM +0800, Baolin Wang wrote:
>> On 2024/3/21 02:02, Johannes Weiner wrote:
>> > @@ -2127,15 +2165,14 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
>> > return page;
>> > }
>> > }
>> > -retry:
>> > +
>> > page = __rmqueue_smallest(zone, order, migratetype);
>> > if (unlikely(!page)) {
>> > if (alloc_flags & ALLOC_CMA)
>> > page = __rmqueue_cma_fallback(zone, order);
>> > -
>> > - if (!page && __rmqueue_fallback(zone, order, migratetype,
>> > - alloc_flags))
>> > - goto retry;
>> > + else
>> > + page = __rmqueue_fallback(zone, order, migratetype,
>> > + alloc_flags);
>>
>> (Sorry for chim in late.)
>>
>> I have some doubts about the changes here. The original code logic was
>> that if the 'migratetype' type allocation is failed, it would first try
>> CMA page allocation and then attempt to fallback to other migratetype
>> allocations. Now it has been changed so that if CMA allocation fails, it
>> will directly return. This change has caused a regression when I running
>> the thpcompact benchmark, resulting in a significant reduction in the
>> percentage of THPs like below:
>>
>> thpcompact Percentage Faults Huge
>> K6.9-rc2 K6.9-rc2 + this patch
>> Percentage huge-1 78.18 ( 0.00%) 42.49 ( -45.65%)
>> Percentage huge-3 86.70 ( 0.00%) 35.13 ( -59.49%)
>> Percentage huge-5 90.26 ( 0.00%) 52.35 ( -42.00%)
>> Percentage huge-7 92.34 ( 0.00%) 31.84 ( -65.52%)
>> Percentage huge-12 91.18 ( 0.00%) 45.85 ( -49.72%)
>> Percentage huge-18 89.00 ( 0.00%) 29.18 ( -67.22%)
>> Percentage huge-24 90.52 ( 0.00%) 46.68 ( -48.43%)
>> Percentage huge-30 94.44 ( 0.00%) 38.35 ( -59.39%)
>> Percentage huge-32 93.09 ( 0.00%) 39.37 ( -57.70%)
>
> Ouch, sorry about that! I changed that specific part around later
> during development and didn't retest with CMA. I'll be sure to
> re-enable it again in my config.
>
>> After making the following modifications, the regression is gone.
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index ce67dc6777fa..a7cfe65e45c1 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2139,7 +2139,8 @@ __rmqueue(struct zone *zone, unsigned int order,
>> int migratetype,
>> if (unlikely(!page)) {
>> if (alloc_flags & ALLOC_CMA)
>> page = __rmqueue_cma_fallback(zone, order);
>> - else
>> +
>> + if (!page)
>> page = __rmqueue_fallback(zone, order, migratetype,
>> alloc_flags);
>> }
>>
>> But I am not sure your original change is intentional? IIUC, we still
>> need try fallbacking even though CMA allocation is failed, please
>> correct me if I misunderstand your code. Thanks.
>
> No, this was accidental. I missed that CMA dependency when changing
> things around for the new return type of __rmqueue_fallback(). Your
> fix is good: just because the request qualifies for CMA doesn't mean
> it will succeed from that region. We need the fallback for those.
>
> Andrew, could you please pick up Baolin's change for this patch?
>
> [[email protected]: fix allocation failures with CONFIG_CMA]
Acked-by: Vlastimil Babka <[email protected]>
> Thanks for debugging this and the fix, Baolin.
On 4/8/24 4:23 PM, Johannes Weiner wrote:
> On Mon, Apr 08, 2024 at 09:38:20AM +0200, Vlastimil Babka wrote:
>> On 4/7/24 12:19 PM, Baolin Wang wrote:
>> > On 2024/3/21 02:02, Johannes Weiner wrote:
>> >>
>> >> + account_freepages(page, zone, 1 << order, migratetype);
>> >> +
>> >> while (order < MAX_PAGE_ORDER) {
>> >> - if (compaction_capture(capc, page, order, migratetype)) {
>> >> - __mod_zone_freepage_state(zone, -(1 << order),
>> >> - migratetype);
>> >> + int buddy_mt = migratetype;
>> >> +
>> >> + if (compaction_capture(capc, page, order, migratetype))
>> >> return;
>> >> - }
>> >
>> > IIUC, if the released page is captured by compaction, then the
>> > statistics for free pages should be correspondingly decreased,
>> > otherwise, there will be a slight regression for my thpcompact benchmark.
>> >
>> > thpcompact Percentage Faults Huge
>> > k6.9-rc2-base base + patch10 + 2 fixes
>> > Percentage huge-1 78.18 ( 0.00%) 71.92 ( -8.01%)
>> > Percentage huge-3 86.70 ( 0.00%) 86.07 ( -0.73%)
>> > Percentage huge-5 90.26 ( 0.00%) 78.02 ( -13.57%)
>> > Percentage huge-7 92.34 ( 0.00%) 78.67 ( -14.81%)
>> > Percentage huge-12 91.18 ( 0.00%) 81.04 ( -11.12%)
>> > Percentage huge-18 89.00 ( 0.00%) 79.57 ( -10.60%)
>> > Percentage huge-24 90.52 ( 0.00%) 80.07 ( -11.54%)
>> > Percentage huge-30 94.44 ( 0.00%) 96.28 ( 1.95%)
>> > Percentage huge-32 93.09 ( 0.00%) 99.39 ( 6.77%)
>> >
>> > I add below fix based on your fix 2, then the thpcompact Percentage
>> > looks good. How do you think for the fix?
>>
>> Yeah another well spotted, thanks. "slight regression" is an understatement,
>> this affects not just a "statistics" but very important counter
>> NR_FREE_PAGES which IIUC would eventually become larger than reality, make
>> the watermark checks false positive and result in depleted reserves etc etc.
>> Actually wondering why we're not seeing -next failures already (or maybe I
>> just haven't noticed).
>
> Good catch indeed.
>
> Trying to understand why I didn't notice this during testing, and I
> think it's because I had order-10 pageblocks in my config. There is
> this in compaction_capture():
>
> if (order < pageblock_order && migratetype == MIGRATE_MOVABLE)
> return false;
>
> Most compaction is for order-9 THPs on movable blocks, so I didn't get
> much capturing in practice in order for that leak to be noticable.
>
> In earlier versions of the patches I had more aggressive capturing,
> but also did the accounting in add_page_to/del_page_from_freelist(),
> so capturing only steals what's already accounted to be off list.
>
> With the __add/__del and the consolidated account_freepages()
> optimization, compaction_capture() needs explicit accounting again.
>
>> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> > index 8330c5c2de6b..2facf844ef84 100644
>> > --- a/mm/page_alloc.c
>> > +++ b/mm/page_alloc.c
>> > @@ -805,8 +805,10 @@ static inline void __free_one_page(struct page *page,
>> > while (order < MAX_PAGE_ORDER) {
>> > int buddy_mt = migratetype;
>> >
>> > - if (compaction_capture(capc, page, order, migratetype))
>> > + if (compaction_capture(capc, page, order, migratetype)) {
>> > + account_freepages(zone, -(1 << order), migratetype);
>> > return;
>> > + }
>> >
>> > buddy = find_buddy_page_pfn(page, pfn, order, &buddy_pfn);
>> > if (!buddy)
>> >
>> > With my fix, the THP percentage looks better:
>> > k6.9-rc2-base base + patch10 + 2 fixes +
>> > my fix
>> > Percentage huge-1 78.18 ( 0.00%) 82.83 ( 5.94%)
>> > Percentage huge-3 86.70 ( 0.00%) 93.47 ( 7.81%)
>> > Percentage huge-5 90.26 ( 0.00%) 94.73 ( 4.95%)
>> > Percentage huge-7 92.34 ( 0.00%) 95.22 ( 3.12%)
>> > Percentage huge-12 91.18 ( 0.00%) 92.40 ( 1.34%)
>> > Percentage huge-18 89.00 ( 0.00%) 85.39 ( -4.06%)
>> > Percentage huge-24 90.52 ( 0.00%) 94.70 ( 4.61%)
>> > Percentage huge-30 94.44 ( 0.00%) 97.00 ( 2.71%)
>> > Percentage huge-32 93.09 ( 0.00%) 92.87 ( -0.24%)
>
> Acked-by: Johannes Weiner <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
> With fixed indentation:
Maybe Baolin could resend the finalized 2 fixups in a more ready-to-pick form?
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 70f82635f650..96815e3c22f2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -805,8 +805,10 @@ static inline void __free_one_page(struct page *page,
> while (order < MAX_PAGE_ORDER) {
> int buddy_mt = migratetype;
>
> - if (compaction_capture(capc, page, order, migratetype))
> + if (compaction_capture(capc, page, order, migratetype)) {
> + account_freepages(zone, -(1 << order), migratetype);
> return;
> + }
>
> buddy = find_buddy_page_pfn(page, pfn, order, &buddy_pfn);
> if (!buddy)
If the released page is captured by compaction, now the free page accounting
is not correspondingly decreased, which can make the watermark checks false
positive and result in depleted reserves etc. And I can see the false positive
watermark checks in thpcompact benchmark, that led to a slight regression:
thpcompact Percentage Faults Huge
k6.9-rc2-base base + patch10 + 2 fixes
Percentage huge-1 78.18 ( 0.00%) 71.92 ( -8.01%)
Percentage huge-3 86.70 ( 0.00%) 86.07 ( -0.73%)
Percentage huge-5 90.26 ( 0.00%) 78.02 ( -13.57%)
Percentage huge-7 92.34 ( 0.00%) 78.67 ( -14.81%)
Percentage huge-12 91.18 ( 0.00%) 81.04 ( -11.12%)
Percentage huge-18 89.00 ( 0.00%) 79.57 ( -10.60%)
Percentage huge-24 90.52 ( 0.00%) 80.07 ( -11.54%)
Percentage huge-30 94.44 ( 0.00%) 96.28 ( 1.95%)
Percentage huge-32 93.09 ( 0.00%) 99.39 ( 6.77%)
After the fix, the regression is gone.
Signed-off-by: Baolin Wang <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
mm/page_alloc.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8330c5c2de6b..2facf844ef84 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -805,8 +805,10 @@ static inline void __free_one_page(struct page *page,
while (order < MAX_PAGE_ORDER) {
int buddy_mt = migratetype;
- if (compaction_capture(capc, page, order, migratetype))
+ if (compaction_capture(capc, page, order, migratetype)) {
+ account_freepages(zone, -(1 << order), migratetype);
return;
+ }
buddy = find_buddy_page_pfn(page, pfn, order, &buddy_pfn);
if (!buddy)
--
2.39.3
On 2024/4/9 14:23, Vlastimil Babka wrote:
> On 4/8/24 4:23 PM, Johannes Weiner wrote:
>> On Mon, Apr 08, 2024 at 09:38:20AM +0200, Vlastimil Babka wrote:
>>> On 4/7/24 12:19 PM, Baolin Wang wrote:
>>>> On 2024/3/21 02:02, Johannes Weiner wrote:
>>>>>
>>>>> + account_freepages(page, zone, 1 << order, migratetype);
>>>>> +
>>>>> while (order < MAX_PAGE_ORDER) {
>>>>> - if (compaction_capture(capc, page, order, migratetype)) {
>>>>> - __mod_zone_freepage_state(zone, -(1 << order),
>>>>> - migratetype);
>>>>> + int buddy_mt = migratetype;
>>>>> +
>>>>> + if (compaction_capture(capc, page, order, migratetype))
>>>>> return;
>>>>> - }
>>>>
>>>> IIUC, if the released page is captured by compaction, then the
>>>> statistics for free pages should be correspondingly decreased,
>>>> otherwise, there will be a slight regression for my thpcompact benchmark.
>>>>
>>>> thpcompact Percentage Faults Huge
>>>> k6.9-rc2-base base + patch10 + 2 fixes
>>>> Percentage huge-1 78.18 ( 0.00%) 71.92 ( -8.01%)
>>>> Percentage huge-3 86.70 ( 0.00%) 86.07 ( -0.73%)
>>>> Percentage huge-5 90.26 ( 0.00%) 78.02 ( -13.57%)
>>>> Percentage huge-7 92.34 ( 0.00%) 78.67 ( -14.81%)
>>>> Percentage huge-12 91.18 ( 0.00%) 81.04 ( -11.12%)
>>>> Percentage huge-18 89.00 ( 0.00%) 79.57 ( -10.60%)
>>>> Percentage huge-24 90.52 ( 0.00%) 80.07 ( -11.54%)
>>>> Percentage huge-30 94.44 ( 0.00%) 96.28 ( 1.95%)
>>>> Percentage huge-32 93.09 ( 0.00%) 99.39 ( 6.77%)
>>>>
>>>> I add below fix based on your fix 2, then the thpcompact Percentage
>>>> looks good. How do you think for the fix?
>>>
>>> Yeah another well spotted, thanks. "slight regression" is an understatement,
>>> this affects not just a "statistics" but very important counter
>>> NR_FREE_PAGES which IIUC would eventually become larger than reality, make
>>> the watermark checks false positive and result in depleted reserves etc etc.
>>> Actually wondering why we're not seeing -next failures already (or maybe I
>>> just haven't noticed).
>>
>> Good catch indeed.
>>
>> Trying to understand why I didn't notice this during testing, and I
>> think it's because I had order-10 pageblocks in my config. There is
>> this in compaction_capture():
>>
>> if (order < pageblock_order && migratetype == MIGRATE_MOVABLE)
>> return false;
>>
>> Most compaction is for order-9 THPs on movable blocks, so I didn't get
>> much capturing in practice in order for that leak to be noticable.
>>
>> In earlier versions of the patches I had more aggressive capturing,
>> but also did the accounting in add_page_to/del_page_from_freelist(),
>> so capturing only steals what's already accounted to be off list.
>>
>> With the __add/__del and the consolidated account_freepages()
>> optimization, compaction_capture() needs explicit accounting again.
>>
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index 8330c5c2de6b..2facf844ef84 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -805,8 +805,10 @@ static inline void __free_one_page(struct page *page,
>>>> while (order < MAX_PAGE_ORDER) {
>>>> int buddy_mt = migratetype;
>>>>
>>>> - if (compaction_capture(capc, page, order, migratetype))
>>>> + if (compaction_capture(capc, page, order, migratetype)) {
>>>> + account_freepages(zone, -(1 << order), migratetype);
>>>> return;
>>>> + }
>>>>
>>>> buddy = find_buddy_page_pfn(page, pfn, order, &buddy_pfn);
>>>> if (!buddy)
>>>>
>>>> With my fix, the THP percentage looks better:
>>>> k6.9-rc2-base base + patch10 + 2 fixes +
>>>> my fix
>>>> Percentage huge-1 78.18 ( 0.00%) 82.83 ( 5.94%)
>>>> Percentage huge-3 86.70 ( 0.00%) 93.47 ( 7.81%)
>>>> Percentage huge-5 90.26 ( 0.00%) 94.73 ( 4.95%)
>>>> Percentage huge-7 92.34 ( 0.00%) 95.22 ( 3.12%)
>>>> Percentage huge-12 91.18 ( 0.00%) 92.40 ( 1.34%)
>>>> Percentage huge-18 89.00 ( 0.00%) 85.39 ( -4.06%)
>>>> Percentage huge-24 90.52 ( 0.00%) 94.70 ( 4.61%)
>>>> Percentage huge-30 94.44 ( 0.00%) 97.00 ( 2.71%)
>>>> Percentage huge-32 93.09 ( 0.00%) 92.87 ( -0.24%)
>>
>> Acked-by: Johannes Weiner <[email protected]>
>
> Acked-by: Vlastimil Babka <[email protected]>
Thanks.
>> With fixed indentation:
>
> Maybe Baolin could resend the finalized 2 fixups in a more ready-to-pick form?
Sure, I've send it out.
And I see Andrew has already folded the first fix
("mm-page_alloc-fix-freelist-movement-during-block-conversion-fix") into
the patch set. If a formal fix patch is needed, Andrew, please let me know.
On 4/9/24 9:56 AM, Baolin Wang wrote:
>
>
>>>
>>> Acked-by: Johannes Weiner <[email protected]>
>>
>> Acked-by: Vlastimil Babka <[email protected]>
>
> Thanks.
>
>>> With fixed indentation:
>>
>> Maybe Baolin could resend the finalized 2 fixups in a more ready-to-pick form?
>
> Sure, I've send it out.
Thanks.
> And I see Andrew has already folded the first fix
> ("mm-page_alloc-fix-freelist-movement-during-block-conversion-fix") into
> the patch set. If a formal fix patch is needed, Andrew, please let me know.
Oh didn't notice, in that case nothing more should be needed.
On 2024/4/8 22:23, Johannes Weiner wrote:
> On Mon, Apr 08, 2024 at 09:38:20AM +0200, Vlastimil Babka wrote:
>> On 4/7/24 12:19 PM, Baolin Wang wrote:
>>> On 2024/3/21 02:02, Johannes Weiner wrote:
>>>>
>>>> + account_freepages(page, zone, 1 << order, migratetype);
>>>> +
>>>> while (order < MAX_PAGE_ORDER) {
>>>> - if (compaction_capture(capc, page, order, migratetype)) {
>>>> - __mod_zone_freepage_state(zone, -(1 << order),
>>>> - migratetype);
>>>> + int buddy_mt = migratetype;
>>>> +
>>>> + if (compaction_capture(capc, page, order, migratetype))
>>>> return;
>>>> - }
>>>
>>> IIUC, if the released page is captured by compaction, then the
>>> statistics for free pages should be correspondingly decreased,
>>> otherwise, there will be a slight regression for my thpcompact benchmark.
>>>
>>> thpcompact Percentage Faults Huge
>>> k6.9-rc2-base base + patch10 + 2 fixes
>>> Percentage huge-1 78.18 ( 0.00%) 71.92 ( -8.01%)
>>> Percentage huge-3 86.70 ( 0.00%) 86.07 ( -0.73%)
>>> Percentage huge-5 90.26 ( 0.00%) 78.02 ( -13.57%)
>>> Percentage huge-7 92.34 ( 0.00%) 78.67 ( -14.81%)
>>> Percentage huge-12 91.18 ( 0.00%) 81.04 ( -11.12%)
>>> Percentage huge-18 89.00 ( 0.00%) 79.57 ( -10.60%)
>>> Percentage huge-24 90.52 ( 0.00%) 80.07 ( -11.54%)
>>> Percentage huge-30 94.44 ( 0.00%) 96.28 ( 1.95%)
>>> Percentage huge-32 93.09 ( 0.00%) 99.39 ( 6.77%)
>>>
>>> I add below fix based on your fix 2, then the thpcompact Percentage
>>> looks good. How do you think for the fix?
>>
>> Yeah another well spotted, thanks. "slight regression" is an understatement,
>> this affects not just a "statistics" but very important counter
>> NR_FREE_PAGES which IIUC would eventually become larger than reality, make
>> the watermark checks false positive and result in depleted reserves etc etc.
>> Actually wondering why we're not seeing -next failures already (or maybe I
>> just haven't noticed).
>
> Good catch indeed.
>
> Trying to understand why I didn't notice this during testing, and I
> think it's because I had order-10 pageblocks in my config. There is
> this in compaction_capture():
>
> if (order < pageblock_order && migratetype == MIGRATE_MOVABLE)
> return false;
>
> Most compaction is for order-9 THPs on movable blocks, so I didn't get
> much capturing in practice in order for that leak to be noticable.
This makes me wonder why not use 'cc->migratetype' for migratetype
comparison, so that low-order (like mTHP) compaction can directly get
the released pages, which could avoid some compaction scans without
mixing the migratetype?
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2facf844ef84..7a64020f8222 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -622,7 +622,7 @@ compaction_capture(struct capture_control *capc,
struct page *page,
* and vice-versa but no more than normal fallback logic which can
* have trouble finding a high-order free page.
*/
- if (order < pageblock_order && migratetype == MIGRATE_MOVABLE)
+ if (order < pageblock_order && capc->cc->migratetype != migratetype)
return false;
capc->page = page;
On 9 Apr 2024, at 5:31, Baolin Wang wrote:
> On 2024/4/8 22:23, Johannes Weiner wrote:
>> On Mon, Apr 08, 2024 at 09:38:20AM +0200, Vlastimil Babka wrote:
>>> On 4/7/24 12:19 PM, Baolin Wang wrote:
>>>> On 2024/3/21 02:02, Johannes Weiner wrote:
>>>>> + account_freepages(page, zone, 1 << order, migratetype);
>>>>> +
>>>>> while (order < MAX_PAGE_ORDER) {
>>>>> - if (compaction_capture(capc, page, order, migratetype)) {
>>>>> - __mod_zone_freepage_state(zone, -(1 << order),
>>>>> - migratetype);
>>>>> + int buddy_mt = migratetype;
>>>>> +
>>>>> + if (compaction_capture(capc, page, order, migratetype))
>>>>> return;
>>>>> - }
>>>>
>>>> IIUC, if the released page is captured by compaction, then the
>>>> statistics for free pages should be correspondingly decreased,
>>>> otherwise, there will be a slight regression for my thpcompact benchmark.
>>>>
>>>> thpcompact Percentage Faults Huge
>>>> k6.9-rc2-base base + patch10 + 2 fixes
>>>> Percentage huge-1 78.18 ( 0.00%) 71.92 ( -8.01%)
>>>> Percentage huge-3 86.70 ( 0.00%) 86.07 ( -0.73%)
>>>> Percentage huge-5 90.26 ( 0.00%) 78.02 ( -13.57%)
>>>> Percentage huge-7 92.34 ( 0.00%) 78.67 ( -14.81%)
>>>> Percentage huge-12 91.18 ( 0.00%) 81.04 ( -11.12%)
>>>> Percentage huge-18 89.00 ( 0.00%) 79.57 ( -10.60%)
>>>> Percentage huge-24 90.52 ( 0.00%) 80.07 ( -11.54%)
>>>> Percentage huge-30 94.44 ( 0.00%) 96.28 ( 1.95%)
>>>> Percentage huge-32 93.09 ( 0.00%) 99.39 ( 6.77%)
>>>>
>>>> I add below fix based on your fix 2, then the thpcompact Percentage
>>>> looks good. How do you think for the fix?
>>>
>>> Yeah another well spotted, thanks. "slight regression" is an understatement,
>>> this affects not just a "statistics" but very important counter
>>> NR_FREE_PAGES which IIUC would eventually become larger than reality, make
>>> the watermark checks false positive and result in depleted reserves etc etc.
>>> Actually wondering why we're not seeing -next failures already (or maybe I
>>> just haven't noticed).
>>
>> Good catch indeed.
>>
>> Trying to understand why I didn't notice this during testing, and I
>> think it's because I had order-10 pageblocks in my config. There is
>> this in compaction_capture():
>>
>> if (order < pageblock_order && migratetype == MIGRATE_MOVABLE)
>> return false;
>>
>> Most compaction is for order-9 THPs on movable blocks, so I didn't get
>> much capturing in practice in order for that leak to be noticable.
>
> This makes me wonder why not use 'cc->migratetype' for migratetype comparison, so that low-order (like mTHP) compaction can directly get the released pages, which could avoid some compaction scans without mixing the migratetype?
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2facf844ef84..7a64020f8222 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -622,7 +622,7 @@ compaction_capture(struct capture_control *capc, struct page *page,
> * and vice-versa but no more than normal fallback logic which can
> * have trouble finding a high-order free page.
> */
> - if (order < pageblock_order && migratetype == MIGRATE_MOVABLE)
> + if (order < pageblock_order && capc->cc->migratetype != migratetype)
> return false;
>
> capc->page = page;
It is worth trying, since at the original patch time mTHP was not present and
not capturing any MIGRATE_MOVABLE makes sense. But with your change, the capture
will lose the opportunity of letting an unmovable request use a reclaimable
pageblock and vice-versa, like the comment says. Please change the comment
as well and we should monitor potential unmovable and reclaimable regression.
--
Best Regards,
Yan, Zi
Hi Baolin,
kernel test robot noticed the following build errors:
url: https://github.com/intel-lab-lkp/linux/commits/UPDATE-20240409-154935/Johannes-Weiner/mm-page_alloc-remove-pcppage-migratetype-caching/20240321-020814
base: the 10th patch of https://lore.kernel.org/r/20240320180429.678181-11-hannes%40cmpxchg.org
patch link: https://lore.kernel.org/r/a2a48baca69f103aa431fd201f8a06e3b95e203d.1712648441.git.baolin.wang%40linux.alibaba.com
patch subject: [PATCH] mm: page_alloc: consolidate free page accounting fix 3
config: x86_64-kexec (https://download.01.org/0day-ci/archive/20240410/[email protected]/config)
compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240410/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All errors (new ones prefixed by >>):
>> mm/page_alloc.c:808:54: error: too few arguments to function call, expected 4, have 3
808 | account_freepages(zone, -(1 << order), migratetype);
| ~~~~~~~~~~~~~~~~~ ^
mm/page_alloc.c:645:20: note: 'account_freepages' declared here
645 | static inline void account_freepages(struct page *page, struct zone *zone,
| ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
646 | int nr_pages, int migratetype)
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
vim +808 mm/page_alloc.c
759
760 /*
761 * Freeing function for a buddy system allocator.
762 *
763 * The concept of a buddy system is to maintain direct-mapped table
764 * (containing bit values) for memory blocks of various "orders".
765 * The bottom level table contains the map for the smallest allocatable
766 * units of memory (here, pages), and each level above it describes
767 * pairs of units from the levels below, hence, "buddies".
768 * At a high level, all that happens here is marking the table entry
769 * at the bottom level available, and propagating the changes upward
770 * as necessary, plus some accounting needed to play nicely with other
771 * parts of the VM system.
772 * At each level, we keep a list of pages, which are heads of continuous
773 * free pages of length of (1 << order) and marked with PageBuddy.
774 * Page's order is recorded in page_private(page) field.
775 * So when we are allocating or freeing one, we can derive the state of the
776 * other. That is, if we allocate a small block, and both were
777 * free, the remainder of the region must be split into blocks.
778 * If a block is freed, and its buddy is also free, then this
779 * triggers coalescing into a block of larger size.
780 *
781 * -- nyc
782 */
783
784 static inline void __free_one_page(struct page *page,
785 unsigned long pfn,
786 struct zone *zone, unsigned int order,
787 int migratetype, fpi_t fpi_flags)
788 {
789 struct capture_control *capc = task_capc(zone);
790 unsigned long buddy_pfn = 0;
791 unsigned long combined_pfn;
792 struct page *buddy;
793 bool to_tail;
794
795 VM_BUG_ON(!zone_is_initialized(zone));
796 VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
797
798 VM_BUG_ON(migratetype == -1);
799 VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
800 VM_BUG_ON_PAGE(bad_range(zone, page), page);
801
802 account_freepages(page, zone, 1 << order, migratetype);
803
804 while (order < MAX_PAGE_ORDER) {
805 int buddy_mt = migratetype;
806
807 if (compaction_capture(capc, page, order, migratetype)) {
> 808 account_freepages(zone, -(1 << order), migratetype);
809 return;
810 }
811
812 buddy = find_buddy_page_pfn(page, pfn, order, &buddy_pfn);
813 if (!buddy)
814 goto done_merging;
815
816 if (unlikely(order >= pageblock_order)) {
817 /*
818 * We want to prevent merge between freepages on pageblock
819 * without fallbacks and normal pageblock. Without this,
820 * pageblock isolation could cause incorrect freepage or CMA
821 * accounting or HIGHATOMIC accounting.
822 */
823 buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
824
825 if (migratetype != buddy_mt &&
826 (!migratetype_is_mergeable(migratetype) ||
827 !migratetype_is_mergeable(buddy_mt)))
828 goto done_merging;
829 }
830
831 /*
832 * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
833 * merge with it and move up one order.
834 */
835 if (page_is_guard(buddy))
836 clear_page_guard(zone, buddy, order);
837 else
838 __del_page_from_free_list(buddy, zone, order, buddy_mt);
839
840 if (unlikely(buddy_mt != migratetype)) {
841 /*
842 * Match buddy type. This ensures that an
843 * expand() down the line puts the sub-blocks
844 * on the right freelists.
845 */
846 set_pageblock_migratetype(buddy, migratetype);
847 }
848
849 combined_pfn = buddy_pfn & pfn;
850 page = page + (combined_pfn - pfn);
851 pfn = combined_pfn;
852 order++;
853 }
854
855 done_merging:
856 set_buddy_order(page, order);
857
858 if (fpi_flags & FPI_TO_TAIL)
859 to_tail = true;
860 else if (is_shuffle_order(order))
861 to_tail = shuffle_pick_tail();
862 else
863 to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);
864
865 __add_to_free_list(page, zone, order, migratetype, to_tail);
866
867 /* Notify page reporting subsystem of freed page */
868 if (!(fpi_flags & FPI_SKIP_REPORT_NOTIFY))
869 page_reporting_notify_free(order);
870 }
871
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hi Baolin,
kernel test robot noticed the following build errors:
url: https://github.com/intel-lab-lkp/linux/commits/UPDATE-20240409-154935/Johannes-Weiner/mm-page_alloc-remove-pcppage-migratetype-caching/20240321-020814
base: the 10th patch of https://lore.kernel.org/r/20240320180429.678181-11-hannes%40cmpxchg.org
patch link: https://lore.kernel.org/r/a2a48baca69f103aa431fd201f8a06e3b95e203d.1712648441.git.baolin.wang%40linux.alibaba.com
patch subject: [PATCH] mm: page_alloc: consolidate free page accounting fix 3
config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20240410/[email protected]/config)
compiler: gcc-13 (Ubuntu 13.2.0-4ubuntu3) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240410/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All error/warnings (new ones prefixed by >>):
mm/page_alloc.c: In function '__free_one_page':
>> mm/page_alloc.c:808:43: error: passing argument 1 of 'account_freepages' from incompatible pointer type [-Werror=incompatible-pointer-types]
808 | account_freepages(zone, -(1 << order), migratetype);
| ^~~~
| |
| struct zone *
mm/page_alloc.c:645:51: note: expected 'struct page *' but argument is of type 'struct zone *'
645 | static inline void account_freepages(struct page *page, struct zone *zone,
| ~~~~~~~~~~~~~^~~~
>> mm/page_alloc.c:808:49: warning: passing argument 2 of 'account_freepages' makes pointer from integer without a cast [-Wint-conversion]
808 | account_freepages(zone, -(1 << order), migratetype);
| ^~~~~~~~~~~~~
| |
| int
mm/page_alloc.c:645:70: note: expected 'struct zone *' but argument is of type 'int'
645 | static inline void account_freepages(struct page *page, struct zone *zone,
| ~~~~~~~~~~~~~^~~~
>> mm/page_alloc.c:808:25: error: too few arguments to function 'account_freepages'
808 | account_freepages(zone, -(1 << order), migratetype);
| ^~~~~~~~~~~~~~~~~
mm/page_alloc.c:645:20: note: declared here
645 | static inline void account_freepages(struct page *page, struct zone *zone,
| ^~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
vim +/account_freepages +808 mm/page_alloc.c
759
760 /*
761 * Freeing function for a buddy system allocator.
762 *
763 * The concept of a buddy system is to maintain direct-mapped table
764 * (containing bit values) for memory blocks of various "orders".
765 * The bottom level table contains the map for the smallest allocatable
766 * units of memory (here, pages), and each level above it describes
767 * pairs of units from the levels below, hence, "buddies".
768 * At a high level, all that happens here is marking the table entry
769 * at the bottom level available, and propagating the changes upward
770 * as necessary, plus some accounting needed to play nicely with other
771 * parts of the VM system.
772 * At each level, we keep a list of pages, which are heads of continuous
773 * free pages of length of (1 << order) and marked with PageBuddy.
774 * Page's order is recorded in page_private(page) field.
775 * So when we are allocating or freeing one, we can derive the state of the
776 * other. That is, if we allocate a small block, and both were
777 * free, the remainder of the region must be split into blocks.
778 * If a block is freed, and its buddy is also free, then this
779 * triggers coalescing into a block of larger size.
780 *
781 * -- nyc
782 */
783
784 static inline void __free_one_page(struct page *page,
785 unsigned long pfn,
786 struct zone *zone, unsigned int order,
787 int migratetype, fpi_t fpi_flags)
788 {
789 struct capture_control *capc = task_capc(zone);
790 unsigned long buddy_pfn = 0;
791 unsigned long combined_pfn;
792 struct page *buddy;
793 bool to_tail;
794
795 VM_BUG_ON(!zone_is_initialized(zone));
796 VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
797
798 VM_BUG_ON(migratetype == -1);
799 VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
800 VM_BUG_ON_PAGE(bad_range(zone, page), page);
801
802 account_freepages(page, zone, 1 << order, migratetype);
803
804 while (order < MAX_PAGE_ORDER) {
805 int buddy_mt = migratetype;
806
807 if (compaction_capture(capc, page, order, migratetype)) {
> 808 account_freepages(zone, -(1 << order), migratetype);
809 return;
810 }
811
812 buddy = find_buddy_page_pfn(page, pfn, order, &buddy_pfn);
813 if (!buddy)
814 goto done_merging;
815
816 if (unlikely(order >= pageblock_order)) {
817 /*
818 * We want to prevent merge between freepages on pageblock
819 * without fallbacks and normal pageblock. Without this,
820 * pageblock isolation could cause incorrect freepage or CMA
821 * accounting or HIGHATOMIC accounting.
822 */
823 buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
824
825 if (migratetype != buddy_mt &&
826 (!migratetype_is_mergeable(migratetype) ||
827 !migratetype_is_mergeable(buddy_mt)))
828 goto done_merging;
829 }
830
831 /*
832 * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
833 * merge with it and move up one order.
834 */
835 if (page_is_guard(buddy))
836 clear_page_guard(zone, buddy, order);
837 else
838 __del_page_from_free_list(buddy, zone, order, buddy_mt);
839
840 if (unlikely(buddy_mt != migratetype)) {
841 /*
842 * Match buddy type. This ensures that an
843 * expand() down the line puts the sub-blocks
844 * on the right freelists.
845 */
846 set_pageblock_migratetype(buddy, migratetype);
847 }
848
849 combined_pfn = buddy_pfn & pfn;
850 page = page + (combined_pfn - pfn);
851 pfn = combined_pfn;
852 order++;
853 }
854
855 done_merging:
856 set_buddy_order(page, order);
857
858 if (fpi_flags & FPI_TO_TAIL)
859 to_tail = true;
860 else if (is_shuffle_order(order))
861 to_tail = shuffle_pick_tail();
862 else
863 to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);
864
865 __add_to_free_list(page, zone, order, migratetype, to_tail);
866
867 /* Notify page reporting subsystem of freed page */
868 if (!(fpi_flags & FPI_SKIP_REPORT_NOTIFY))
869 page_reporting_notify_free(order);
870 }
871
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Wed, Apr 10, 2024 at 05:15:01AM +0800, kernel test robot wrote:
> Hi Baolin,
>
> kernel test robot noticed the following build errors:
>
>
>
> url: https://github.com/intel-lab-lkp/linux/commits/UPDATE-20240409-154935/Johannes-Weiner/mm-page_alloc-remove-pcppage-migratetype-caching/20240321-020814
> base: the 10th patch of https://lore.kernel.org/r/20240320180429.678181-11-hannes%40cmpxchg.org
> patch link: https://lore.kernel.org/r/a2a48baca69f103aa431fd201f8a06e3b95e203d.1712648441.git.baolin.wang%40linux.alibaba.com
> patch subject: [PATCH] mm: page_alloc: consolidate free page accounting fix 3
> config: x86_64-kexec (https://download.01.org/0day-ci/archive/20240410/[email protected]/config)
> compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240410/[email protected]/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <[email protected]>
> | Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
>
> All errors (new ones prefixed by >>):
>
> >> mm/page_alloc.c:808:54: error: too few arguments to function call, expected 4, have 3
> 808 | account_freepages(zone, -(1 << order), migratetype);
> | ~~~~~~~~~~~~~~~~~ ^
> mm/page_alloc.c:645:20: note: 'account_freepages' declared here
> 645 | static inline void account_freepages(struct page *page, struct zone *zone,
> | ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 646 | int nr_pages, int migratetype)
> | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 1 error generated.
Looks like a false alarm because the test bot didn't pick up the
fixlet to remove the page parameter from account_freepages():
https://lore.kernel.org/all/[email protected]/
It's right in Andrew's tree.
On 2024/4/9 22:46, Zi Yan wrote:
> On 9 Apr 2024, at 5:31, Baolin Wang wrote:
>
>> On 2024/4/8 22:23, Johannes Weiner wrote:
>>> On Mon, Apr 08, 2024 at 09:38:20AM +0200, Vlastimil Babka wrote:
>>>> On 4/7/24 12:19 PM, Baolin Wang wrote:
>>>>> On 2024/3/21 02:02, Johannes Weiner wrote:
>>>>>> + account_freepages(page, zone, 1 << order, migratetype);
>>>>>> +
>>>>>> while (order < MAX_PAGE_ORDER) {
>>>>>> - if (compaction_capture(capc, page, order, migratetype)) {
>>>>>> - __mod_zone_freepage_state(zone, -(1 << order),
>>>>>> - migratetype);
>>>>>> + int buddy_mt = migratetype;
>>>>>> +
>>>>>> + if (compaction_capture(capc, page, order, migratetype))
>>>>>> return;
>>>>>> - }
>>>>>
>>>>> IIUC, if the released page is captured by compaction, then the
>>>>> statistics for free pages should be correspondingly decreased,
>>>>> otherwise, there will be a slight regression for my thpcompact benchmark.
>>>>>
>>>>> thpcompact Percentage Faults Huge
>>>>> k6.9-rc2-base base + patch10 + 2 fixes
>>>>> Percentage huge-1 78.18 ( 0.00%) 71.92 ( -8.01%)
>>>>> Percentage huge-3 86.70 ( 0.00%) 86.07 ( -0.73%)
>>>>> Percentage huge-5 90.26 ( 0.00%) 78.02 ( -13.57%)
>>>>> Percentage huge-7 92.34 ( 0.00%) 78.67 ( -14.81%)
>>>>> Percentage huge-12 91.18 ( 0.00%) 81.04 ( -11.12%)
>>>>> Percentage huge-18 89.00 ( 0.00%) 79.57 ( -10.60%)
>>>>> Percentage huge-24 90.52 ( 0.00%) 80.07 ( -11.54%)
>>>>> Percentage huge-30 94.44 ( 0.00%) 96.28 ( 1.95%)
>>>>> Percentage huge-32 93.09 ( 0.00%) 99.39 ( 6.77%)
>>>>>
>>>>> I add below fix based on your fix 2, then the thpcompact Percentage
>>>>> looks good. How do you think for the fix?
>>>>
>>>> Yeah another well spotted, thanks. "slight regression" is an understatement,
>>>> this affects not just a "statistics" but very important counter
>>>> NR_FREE_PAGES which IIUC would eventually become larger than reality, make
>>>> the watermark checks false positive and result in depleted reserves etc etc.
>>>> Actually wondering why we're not seeing -next failures already (or maybe I
>>>> just haven't noticed).
>>>
>>> Good catch indeed.
>>>
>>> Trying to understand why I didn't notice this during testing, and I
>>> think it's because I had order-10 pageblocks in my config. There is
>>> this in compaction_capture():
>>>
>>> if (order < pageblock_order && migratetype == MIGRATE_MOVABLE)
>>> return false;
>>>
>>> Most compaction is for order-9 THPs on movable blocks, so I didn't get
>>> much capturing in practice in order for that leak to be noticable.
>>
>> This makes me wonder why not use 'cc->migratetype' for migratetype comparison, so that low-order (like mTHP) compaction can directly get the released pages, which could avoid some compaction scans without mixing the migratetype?
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 2facf844ef84..7a64020f8222 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -622,7 +622,7 @@ compaction_capture(struct capture_control *capc, struct page *page,
>> * and vice-versa but no more than normal fallback logic which can
>> * have trouble finding a high-order free page.
>> */
>> - if (order < pageblock_order && migratetype == MIGRATE_MOVABLE)
>> + if (order < pageblock_order && capc->cc->migratetype != migratetype)
>> return false;
>>
>> capc->page = page;
>
> It is worth trying, since at the original patch time mTHP was not present and
> not capturing any MIGRATE_MOVABLE makes sense. But with your change, the capture
> will lose the opportunity of letting an unmovable request use a reclaimable
> pageblock and vice-versa, like the comment says. Please change the comment
> as well and we should monitor potential unmovable and reclaimable regression.
Yes, but I think this case is easy to solve. Anyway let me try to do
some measurement for mTHP.
On Wed, Mar 20, 2024 at 12:04 PM Johannes Weiner <[email protected]> wrote:
>
> V4:
> - fixed !pcp_order_allowed() case in free_unref_folios()
> - reworded the patch 0 changelog a bit for the git log
> - rebased to mm-everything-2024-03-19-23-01
> - runtime-tested again with various CONFIG_DEBUG_FOOs enabled
>
> ---
>
> The page allocator's mobility grouping is intended to keep unmovable
> pages separate from reclaimable/compactable ones to allow on-demand
> defragmentation for higher-order allocations and huge pages.
>
> Currently, there are several places where accidental type mixing
> occurs: an allocation asks for a page of a certain migratetype and
> receives another. This ruins pageblocks for compaction, which in turn
> makes allocating huge pages more expensive and less reliable.
>
> The series addresses those causes. The last patch adds type checks on
> all freelist movements to prevent new violations being introduced.
>
> The benefits can be seen in a mixed workload that stresses the machine
> with a memcache-type workload and a kernel build job while
> periodically attempting to allocate batches of THP. The following data
> is aggregated over 50 consecutive defconfig builds:
>
> VANILLA PATCHED
> Hugealloc Time mean 165843.93 ( +0.00%) 113025.88 ( -31.85%)
> Hugealloc Time stddev 158957.35 ( +0.00%) 114716.07 ( -27.83%)
> Kbuild Real time 310.24 ( +0.00%) 300.73 ( -3.06%)
> Kbuild User time 1271.13 ( +0.00%) 1259.42 ( -0.92%)
> Kbuild System time 582.02 ( +0.00%) 559.79 ( -3.81%)
> THP fault alloc 30585.14 ( +0.00%) 40853.62 ( +33.57%)
> THP fault fallback 36626.46 ( +0.00%) 26357.62 ( -28.04%)
> THP fault fail rate % 54.49 ( +0.00%) 39.22 ( -27.53%)
> Pagealloc fallback 1328.00 ( +0.00%) 1.00 ( -99.85%)
> Pagealloc type mismatch 181009.50 ( +0.00%) 0.00 ( -100.00%)
> Direct compact stall 434.56 ( +0.00%) 257.66 ( -40.61%)
> Direct compact fail 421.70 ( +0.00%) 249.94 ( -40.63%)
> Direct compact success 12.86 ( +0.00%) 7.72 ( -37.09%)
> Direct compact success rate % 2.86 ( +0.00%) 2.82 ( -0.96%)
> Compact daemon scanned migrate 3370059.62 ( +0.00%) 3612054.76 ( +7.18%)
> Compact daemon scanned free 7718439.20 ( +0.00%) 5386385.02 ( -30.21%)
> Compact direct scanned migrate 309248.62 ( +0.00%) 176721.04 ( -42.85%)
> Compact direct scanned free 433582.84 ( +0.00%) 315727.66 ( -27.18%)
> Compact migrate scanned daemon % 91.20 ( +0.00%) 94.48 ( +3.56%)
> Compact free scanned daemon % 94.58 ( +0.00%) 94.42 ( -0.16%)
> Compact total migrate scanned 3679308.24 ( +0.00%) 3788775.80 ( +2.98%)
> Compact total free scanned 8152022.04 ( +0.00%) 5702112.68 ( -30.05%)
> Alloc stall 872.04 ( +0.00%) 5156.12 ( +490.71%)
> Pages kswapd scanned 510645.86 ( +0.00%) 3394.94 ( -99.33%)
> Pages kswapd reclaimed 134811.62 ( +0.00%) 2701.26 ( -98.00%)
> Pages direct scanned 99546.06 ( +0.00%) 376407.52 ( +278.12%)
> Pages direct reclaimed 62123.40 ( +0.00%) 289535.70 ( +366.06%)
> Pages total scanned 610191.92 ( +0.00%) 379802.46 ( -37.76%)
> Pages scanned kswapd % 76.36 ( +0.00%) 0.10 ( -98.58%)
> Swap out 12057.54 ( +0.00%) 15022.98 ( +24.59%)
> Swap in 209.16 ( +0.00%) 256.48 ( +22.52%)
> File refaults 17701.64 ( +0.00%) 11765.40 ( -33.53%)
>
> Huge page success rate is higher, allocation latencies are shorter and
> more predictable.
>
> Stealing (fallback) rate is drastically reduced. Notably, while the
> vanilla kernel keeps doing fallbacks on an ongoing basis, the patched
> kernel enters a steady state once the distribution of block types is
> adequate for the workload. Steals over 50 runs:
>
> VANILLA PATCHED
> 1504.0 227.0
> 1557.0 6.0
> 1391.0 13.0
> 1080.0 26.0
> 1057.0 40.0
> 1156.0 6.0
> 805.0 46.0
> 736.0 20.0
> 1747.0 2.0
> 1699.0 34.0
> 1269.0 13.0
> 1858.0 12.0
> 907.0 4.0
> 727.0 2.0
> 563.0 2.0
> 3094.0 2.0
> 10211.0 3.0
> 2621.0 1.0
> 5508.0 2.0
> 1060.0 2.0
> 538.0 3.0
> 5773.0 2.0
> 2199.0 0.0
> 3781.0 2.0
> 1387.0 1.0
> 4977.0 0.0
> 2865.0 1.0
> 1814.0 1.0
> 3739.0 1.0
> 6857.0 0.0
> 382.0 0.0
> 407.0 1.0
> 3784.0 0.0
> 297.0 0.0
> 298.0 0.0
> 6636.0 0.0
> 4188.0 0.0
> 242.0 0.0
> 9960.0 0.0
> 5816.0 0.0
> 354.0 0.0
> 287.0 0.0
> 261.0 0.0
> 140.0 1.0
> 2065.0 0.0
> 312.0 0.0
> 331.0 0.0
> 164.0 0.0
> 465.0 1.0
> 219.0 0.0
>
> Type mismatches are down too. Those count every time an allocation
> request asks for one migratetype and gets another. This can still
> occur minimally in the patched kernel due to non-stealing fallbacks,
> but it's quite rare and follows the pattern of overall fallbacks -
> once the block type distribution settles, mismatches cease as well:
>
> VANILLA: PATCHED:
> 182602.0 268.0
> 135794.0 20.0
> 88619.0 19.0
> 95973.0 0.0
> 129590.0 0.0
> 129298.0 0.0
> 147134.0 0.0
> 230854.0 0.0
> 239709.0 0.0
> 137670.0 0.0
> 132430.0 0.0
> 65712.0 0.0
> 57901.0 0.0
> 67506.0 0.0
> 63565.0 4.0
> 34806.0 0.0
> 42962.0 0.0
> 32406.0 0.0
> 38668.0 0.0
> 61356.0 0.0
> 57800.0 0.0
> 41435.0 0.0
> 83456.0 0.0
> 65048.0 0.0
> 28955.0 0.0
> 47597.0 0.0
> 75117.0 0.0
> 55564.0 0.0
> 38280.0 0.0
> 52404.0 0.0
> 26264.0 0.0
> 37538.0 0.0
> 19671.0 0.0
> 30936.0 0.0
> 26933.0 0.0
> 16962.0 0.0
> 44554.0 0.0
> 46352.0 0.0
> 24995.0 0.0
> 35152.0 0.0
> 12823.0 0.0
> 21583.0 0.0
> 18129.0 0.0
> 31693.0 0.0
> 28745.0 0.0
> 33308.0 0.0
> 31114.0 0.0
> 35034.0 0.0
> 12111.0 0.0
> 24885.0 0.0
>
> Compaction work is markedly reduced despite much better THP rates.
>
> In the vanilla kernel, reclaim seems to have been driven primarily by
> watermark boosting that happens as a result of fallbacks. With those
> all but eliminated, watermarks average lower and kswapd does less
> work. The uptick in direct reclaim is because THP requests have to
> fend for themselves more often - which is intended policy right
> now. Aggregate reclaim activity is lowered significantly, though.
This series significantly regresses Android and ChromeOS under memory
pressure. THPs are virtually nonexistent on client devices, and IIRC,
it was mentioned in the early discussions that potential regressions
for such a case are somewhat expected?
On Android (ARMv8.2), app launch time regressed by about 7%; On
ChromeOS (Intel ADL), tab switch time regressed by about 8%. Also PSI
(full and some) on both platforms increased by over 20%. I could post
the details of the benchmarks and the metrics they measure, but I
doubt they would mean much to you. I did ask our test teams to save
extra kernel logs that might be more helpful, and I could forward them
to you.
Note that the numbers above were from the default LRU, not MGLRU,
which I specifically asked our test teams to disable to double check
the regressions.
Given the merge window will be open soon, I don't plan to stand in its
way. If we can't fix the regression after a reasonable amount of time,
can we find a way to disable this series runtime/build time?
On Fri, May 10, 2024 at 11:14:43PM -0600, Yu Zhao wrote:
> On Wed, Mar 20, 2024 at 12:04 PM Johannes Weiner <[email protected]> wrote:
> >
> > V4:
> > - fixed !pcp_order_allowed() case in free_unref_folios()
> > - reworded the patch 0 changelog a bit for the git log
> > - rebased to mm-everything-2024-03-19-23-01
> > - runtime-tested again with various CONFIG_DEBUG_FOOs enabled
> >
> > ---
> >
> > The page allocator's mobility grouping is intended to keep unmovable
> > pages separate from reclaimable/compactable ones to allow on-demand
> > defragmentation for higher-order allocations and huge pages.
> >
> > Currently, there are several places where accidental type mixing
> > occurs: an allocation asks for a page of a certain migratetype and
> > receives another. This ruins pageblocks for compaction, which in turn
> > makes allocating huge pages more expensive and less reliable.
> >
> > The series addresses those causes. The last patch adds type checks on
> > all freelist movements to prevent new violations being introduced.
> >
> > The benefits can be seen in a mixed workload that stresses the machine
> > with a memcache-type workload and a kernel build job while
> > periodically attempting to allocate batches of THP. The following data
> > is aggregated over 50 consecutive defconfig builds:
> >
> > VANILLA PATCHED
> > Hugealloc Time mean 165843.93 ( +0.00%) 113025.88 ( -31.85%)
> > Hugealloc Time stddev 158957.35 ( +0.00%) 114716.07 ( -27.83%)
> > Kbuild Real time 310.24 ( +0.00%) 300.73 ( -3.06%)
> > Kbuild User time 1271.13 ( +0.00%) 1259.42 ( -0.92%)
> > Kbuild System time 582.02 ( +0.00%) 559.79 ( -3.81%)
> > THP fault alloc 30585.14 ( +0.00%) 40853.62 ( +33.57%)
> > THP fault fallback 36626.46 ( +0.00%) 26357.62 ( -28.04%)
> > THP fault fail rate % 54.49 ( +0.00%) 39.22 ( -27.53%)
> > Pagealloc fallback 1328.00 ( +0.00%) 1.00 ( -99.85%)
> > Pagealloc type mismatch 181009.50 ( +0.00%) 0.00 ( -100.00%)
> > Direct compact stall 434.56 ( +0.00%) 257.66 ( -40.61%)
> > Direct compact fail 421.70 ( +0.00%) 249.94 ( -40.63%)
> > Direct compact success 12.86 ( +0.00%) 7.72 ( -37.09%)
> > Direct compact success rate % 2.86 ( +0.00%) 2.82 ( -0.96%)
> > Compact daemon scanned migrate 3370059.62 ( +0.00%) 3612054.76 ( +7.18%)
> > Compact daemon scanned free 7718439.20 ( +0.00%) 5386385.02 ( -30.21%)
> > Compact direct scanned migrate 309248.62 ( +0.00%) 176721.04 ( -42.85%)
> > Compact direct scanned free 433582.84 ( +0.00%) 315727.66 ( -27.18%)
> > Compact migrate scanned daemon % 91.20 ( +0.00%) 94.48 ( +3.56%)
> > Compact free scanned daemon % 94.58 ( +0.00%) 94.42 ( -0.16%)
> > Compact total migrate scanned 3679308.24 ( +0.00%) 3788775.80 ( +2.98%)
> > Compact total free scanned 8152022.04 ( +0.00%) 5702112.68 ( -30.05%)
> > Alloc stall 872.04 ( +0.00%) 5156.12 ( +490.71%)
> > Pages kswapd scanned 510645.86 ( +0.00%) 3394.94 ( -99.33%)
> > Pages kswapd reclaimed 134811.62 ( +0.00%) 2701.26 ( -98.00%)
> > Pages direct scanned 99546.06 ( +0.00%) 376407.52 ( +278.12%)
> > Pages direct reclaimed 62123.40 ( +0.00%) 289535.70 ( +366.06%)
> > Pages total scanned 610191.92 ( +0.00%) 379802.46 ( -37.76%)
> > Pages scanned kswapd % 76.36 ( +0.00%) 0.10 ( -98.58%)
> > Swap out 12057.54 ( +0.00%) 15022.98 ( +24.59%)
> > Swap in 209.16 ( +0.00%) 256.48 ( +22.52%)
> > File refaults 17701.64 ( +0.00%) 11765.40 ( -33.53%)
> >
> > Huge page success rate is higher, allocation latencies are shorter and
> > more predictable.
> >
> > Stealing (fallback) rate is drastically reduced. Notably, while the
> > vanilla kernel keeps doing fallbacks on an ongoing basis, the patched
> > kernel enters a steady state once the distribution of block types is
> > adequate for the workload. Steals over 50 runs:
> >
> > VANILLA PATCHED
> > 1504.0 227.0
> > 1557.0 6.0
> > 1391.0 13.0
> > 1080.0 26.0
> > 1057.0 40.0
> > 1156.0 6.0
> > 805.0 46.0
> > 736.0 20.0
> > 1747.0 2.0
> > 1699.0 34.0
> > 1269.0 13.0
> > 1858.0 12.0
> > 907.0 4.0
> > 727.0 2.0
> > 563.0 2.0
> > 3094.0 2.0
> > 10211.0 3.0
> > 2621.0 1.0
> > 5508.0 2.0
> > 1060.0 2.0
> > 538.0 3.0
> > 5773.0 2.0
> > 2199.0 0.0
> > 3781.0 2.0
> > 1387.0 1.0
> > 4977.0 0.0
> > 2865.0 1.0
> > 1814.0 1.0
> > 3739.0 1.0
> > 6857.0 0.0
> > 382.0 0.0
> > 407.0 1.0
> > 3784.0 0.0
> > 297.0 0.0
> > 298.0 0.0
> > 6636.0 0.0
> > 4188.0 0.0
> > 242.0 0.0
> > 9960.0 0.0
> > 5816.0 0.0
> > 354.0 0.0
> > 287.0 0.0
> > 261.0 0.0
> > 140.0 1.0
> > 2065.0 0.0
> > 312.0 0.0
> > 331.0 0.0
> > 164.0 0.0
> > 465.0 1.0
> > 219.0 0.0
> >
> > Type mismatches are down too. Those count every time an allocation
> > request asks for one migratetype and gets another. This can still
> > occur minimally in the patched kernel due to non-stealing fallbacks,
> > but it's quite rare and follows the pattern of overall fallbacks -
> > once the block type distribution settles, mismatches cease as well:
> >
> > VANILLA: PATCHED:
> > 182602.0 268.0
> > 135794.0 20.0
> > 88619.0 19.0
> > 95973.0 0.0
> > 129590.0 0.0
> > 129298.0 0.0
> > 147134.0 0.0
> > 230854.0 0.0
> > 239709.0 0.0
> > 137670.0 0.0
> > 132430.0 0.0
> > 65712.0 0.0
> > 57901.0 0.0
> > 67506.0 0.0
> > 63565.0 4.0
> > 34806.0 0.0
> > 42962.0 0.0
> > 32406.0 0.0
> > 38668.0 0.0
> > 61356.0 0.0
> > 57800.0 0.0
> > 41435.0 0.0
> > 83456.0 0.0
> > 65048.0 0.0
> > 28955.0 0.0
> > 47597.0 0.0
> > 75117.0 0.0
> > 55564.0 0.0
> > 38280.0 0.0
> > 52404.0 0.0
> > 26264.0 0.0
> > 37538.0 0.0
> > 19671.0 0.0
> > 30936.0 0.0
> > 26933.0 0.0
> > 16962.0 0.0
> > 44554.0 0.0
> > 46352.0 0.0
> > 24995.0 0.0
> > 35152.0 0.0
> > 12823.0 0.0
> > 21583.0 0.0
> > 18129.0 0.0
> > 31693.0 0.0
> > 28745.0 0.0
> > 33308.0 0.0
> > 31114.0 0.0
> > 35034.0 0.0
> > 12111.0 0.0
> > 24885.0 0.0
> >
> > Compaction work is markedly reduced despite much better THP rates.
> >
> > In the vanilla kernel, reclaim seems to have been driven primarily by
> > watermark boosting that happens as a result of fallbacks. With those
> > all but eliminated, watermarks average lower and kswapd does less
> > work. The uptick in direct reclaim is because THP requests have to
> > fend for themselves more often - which is intended policy right
> > now. Aggregate reclaim activity is lowered significantly, though.
>
> This series significantly regresses Android and ChromeOS under memory
> pressure. THPs are virtually nonexistent on client devices, and IIRC,
> it was mentioned in the early discussions that potential regressions
> for such a case are somewhat expected?
This is not expected for the 10 patches here. You might be referring
to the discussion around the huge page allocator series, which had
fallback restrictions and many changes to reclaim and compaction.
Can you confirm that you were testing the latest patches that are in
mm-stable as of today? There was a series of follow-up fixes.
Especially, please double check you have the follow-up fixes to
compaction capturing and the CMA fallback policy. It sounds like the
behavior Baolin described before the CMA fix.
Lastly, what's the base you backported this series to?
> On Android (ARMv8.2), app launch time regressed by about 7%; On
> ChromeOS (Intel ADL), tab switch time regressed by about 8%. Also PSI
> (full and some) on both platforms increased by over 20%. I could post
> the details of the benchmarks and the metrics they measure, but I
> doubt they would mean much to you. I did ask our test teams to save
> extra kernel logs that might be more helpful, and I could forward them
> to you.
If the issue persists with the latest patches in -mm, a kernel config
and snapshots of /proc/vmstat, /proc/pagetypeinfo, /proc/zoneinfo
before/during/after the problematic behavior would be very helpful.
On Mon, May 13, 2024 at 10:03 AM Johannes Weiner <[email protected]> wrote:
>
> On Fri, May 10, 2024 at 11:14:43PM -0600, Yu Zhao wrote:
> > On Wed, Mar 20, 2024 at 12:04 PM Johannes Weiner <[email protected]> wrote:
> > >
> > > V4:
> > > - fixed !pcp_order_allowed() case in free_unref_folios()
> > > - reworded the patch 0 changelog a bit for the git log
> > > - rebased to mm-everything-2024-03-19-23-01
> > > - runtime-tested again with various CONFIG_DEBUG_FOOs enabled
> > >
> > > ---
> > >
> > > The page allocator's mobility grouping is intended to keep unmovable
> > > pages separate from reclaimable/compactable ones to allow on-demand
> > > defragmentation for higher-order allocations and huge pages.
> > >
> > > Currently, there are several places where accidental type mixing
> > > occurs: an allocation asks for a page of a certain migratetype and
> > > receives another. This ruins pageblocks for compaction, which in turn
> > > makes allocating huge pages more expensive and less reliable.
> > >
> > > The series addresses those causes. The last patch adds type checks on
> > > all freelist movements to prevent new violations being introduced.
> > >
> > > The benefits can be seen in a mixed workload that stresses the machine
> > > with a memcache-type workload and a kernel build job while
> > > periodically attempting to allocate batches of THP. The following data
> > > is aggregated over 50 consecutive defconfig builds:
> > >
> > > VANILLA PATCHED
> > > Hugealloc Time mean 165843.93 ( +0.00%) 113025.88 ( -31.85%)
> > > Hugealloc Time stddev 158957.35 ( +0.00%) 114716.07 ( -27.83%)
> > > Kbuild Real time 310.24 ( +0.00%) 300.73 ( -3.06%)
> > > Kbuild User time 1271.13 ( +0.00%) 1259.42 ( -0.92%)
> > > Kbuild System time 582.02 ( +0.00%) 559.79 ( -3.81%)
> > > THP fault alloc 30585.14 ( +0.00%) 40853.62 ( +33.57%)
> > > THP fault fallback 36626.46 ( +0.00%) 26357.62 ( -28.04%)
> > > THP fault fail rate % 54.49 ( +0.00%) 39.22 ( -27.53%)
> > > Pagealloc fallback 1328.00 ( +0.00%) 1.00 ( -99.85%)
> > > Pagealloc type mismatch 181009.50 ( +0.00%) 0.00 ( -100.00%)
> > > Direct compact stall 434.56 ( +0.00%) 257.66 ( -40.61%)
> > > Direct compact fail 421.70 ( +0.00%) 249.94 ( -40.63%)
> > > Direct compact success 12.86 ( +0.00%) 7.72 ( -37.09%)
> > > Direct compact success rate % 2.86 ( +0.00%) 2.82 ( -0.96%)
> > > Compact daemon scanned migrate 3370059.62 ( +0.00%) 3612054.76 ( +7.18%)
> > > Compact daemon scanned free 7718439.20 ( +0.00%) 5386385.02 ( -30.21%)
> > > Compact direct scanned migrate 309248.62 ( +0.00%) 176721.04 ( -42.85%)
> > > Compact direct scanned free 433582.84 ( +0.00%) 315727.66 ( -27.18%)
> > > Compact migrate scanned daemon % 91.20 ( +0.00%) 94.48 ( +3.56%)
> > > Compact free scanned daemon % 94.58 ( +0.00%) 94.42 ( -0.16%)
> > > Compact total migrate scanned 3679308.24 ( +0.00%) 3788775.80 ( +2.98%)
> > > Compact total free scanned 8152022.04 ( +0.00%) 5702112.68 ( -30.05%)
> > > Alloc stall 872.04 ( +0.00%) 5156.12 ( +490.71%)
> > > Pages kswapd scanned 510645.86 ( +0.00%) 3394.94 ( -99.33%)
> > > Pages kswapd reclaimed 134811.62 ( +0.00%) 2701.26 ( -98.00%)
> > > Pages direct scanned 99546.06 ( +0.00%) 376407.52 ( +278.12%)
> > > Pages direct reclaimed 62123.40 ( +0.00%) 289535.70 ( +366.06%)
> > > Pages total scanned 610191.92 ( +0.00%) 379802.46 ( -37.76%)
> > > Pages scanned kswapd % 76.36 ( +0.00%) 0.10 ( -98.58%)
> > > Swap out 12057.54 ( +0.00%) 15022.98 ( +24.59%)
> > > Swap in 209.16 ( +0.00%) 256.48 ( +22.52%)
> > > File refaults 17701.64 ( +0.00%) 11765.40 ( -33.53%)
> > >
> > > Huge page success rate is higher, allocation latencies are shorter and
> > > more predictable.
> > >
> > > Stealing (fallback) rate is drastically reduced. Notably, while the
> > > vanilla kernel keeps doing fallbacks on an ongoing basis, the patched
> > > kernel enters a steady state once the distribution of block types is
> > > adequate for the workload. Steals over 50 runs:
> > >
> > > VANILLA PATCHED
> > > 1504.0 227.0
> > > 1557.0 6.0
> > > 1391.0 13.0
> > > 1080.0 26.0
> > > 1057.0 40.0
> > > 1156.0 6.0
> > > 805.0 46.0
> > > 736.0 20.0
> > > 1747.0 2.0
> > > 1699.0 34.0
> > > 1269.0 13.0
> > > 1858.0 12.0
> > > 907.0 4.0
> > > 727.0 2.0
> > > 563.0 2.0
> > > 3094.0 2.0
> > > 10211.0 3.0
> > > 2621.0 1.0
> > > 5508.0 2.0
> > > 1060.0 2.0
> > > 538.0 3.0
> > > 5773.0 2.0
> > > 2199.0 0.0
> > > 3781.0 2.0
> > > 1387.0 1.0
> > > 4977.0 0.0
> > > 2865.0 1.0
> > > 1814.0 1.0
> > > 3739.0 1.0
> > > 6857.0 0.0
> > > 382.0 0.0
> > > 407.0 1.0
> > > 3784.0 0.0
> > > 297.0 0.0
> > > 298.0 0.0
> > > 6636.0 0.0
> > > 4188.0 0.0
> > > 242.0 0.0
> > > 9960.0 0.0
> > > 5816.0 0.0
> > > 354.0 0.0
> > > 287.0 0.0
> > > 261.0 0.0
> > > 140.0 1.0
> > > 2065.0 0.0
> > > 312.0 0.0
> > > 331.0 0.0
> > > 164.0 0.0
> > > 465.0 1.0
> > > 219.0 0.0
> > >
> > > Type mismatches are down too. Those count every time an allocation
> > > request asks for one migratetype and gets another. This can still
> > > occur minimally in the patched kernel due to non-stealing fallbacks,
> > > but it's quite rare and follows the pattern of overall fallbacks -
> > > once the block type distribution settles, mismatches cease as well:
> > >
> > > VANILLA: PATCHED:
> > > 182602.0 268.0
> > > 135794.0 20.0
> > > 88619.0 19.0
> > > 95973.0 0.0
> > > 129590.0 0.0
> > > 129298.0 0.0
> > > 147134.0 0.0
> > > 230854.0 0.0
> > > 239709.0 0.0
> > > 137670.0 0.0
> > > 132430.0 0.0
> > > 65712.0 0.0
> > > 57901.0 0.0
> > > 67506.0 0.0
> > > 63565.0 4.0
> > > 34806.0 0.0
> > > 42962.0 0.0
> > > 32406.0 0.0
> > > 38668.0 0.0
> > > 61356.0 0.0
> > > 57800.0 0.0
> > > 41435.0 0.0
> > > 83456.0 0.0
> > > 65048.0 0.0
> > > 28955.0 0.0
> > > 47597.0 0.0
> > > 75117.0 0.0
> > > 55564.0 0.0
> > > 38280.0 0.0
> > > 52404.0 0.0
> > > 26264.0 0.0
> > > 37538.0 0.0
> > > 19671.0 0.0
> > > 30936.0 0.0
> > > 26933.0 0.0
> > > 16962.0 0.0
> > > 44554.0 0.0
> > > 46352.0 0.0
> > > 24995.0 0.0
> > > 35152.0 0.0
> > > 12823.0 0.0
> > > 21583.0 0.0
> > > 18129.0 0.0
> > > 31693.0 0.0
> > > 28745.0 0.0
> > > 33308.0 0.0
> > > 31114.0 0.0
> > > 35034.0 0.0
> > > 12111.0 0.0
> > > 24885.0 0.0
> > >
> > > Compaction work is markedly reduced despite much better THP rates.
> > >
> > > In the vanilla kernel, reclaim seems to have been driven primarily by
> > > watermark boosting that happens as a result of fallbacks. With those
> > > all but eliminated, watermarks average lower and kswapd does less
> > > work. The uptick in direct reclaim is because THP requests have to
> > > fend for themselves more often - which is intended policy right
> > > now. Aggregate reclaim activity is lowered significantly, though.
> >
> > This series significantly regresses Android and ChromeOS under memory
> > pressure. THPs are virtually nonexistent on client devices, and IIRC,
> > it was mentioned in the early discussions that potential regressions
> > for such a case are somewhat expected?
>
> This is not expected for the 10 patches here. You might be referring
> to the discussion around the huge page allocator series, which had
> fallback restrictions and many changes to reclaim and compaction.
Right, now I remember.
> Can you confirm that you were testing the latest patches that are in
> mm-stable as of today? There was a series of follow-up fixes.
Here is what I have on top of 6.8.y, which I think includes all the
follow-up fixes. The performance delta was measured between 5 & 22.
1 mm: convert free_unref_page_list() to use folios
2 mm: add free_unref_folios()
3 mm: handle large folios in free_unref_folios()
4 mm/page_alloc: remove unused fpi_flags in free_pages_prepare()
5 mm: add alloc_contig_migrate_range allocation statistics
6 mm: page_alloc: remove pcppage migratetype caching
7 mm: page_alloc: optimize free_unref_folios()
8 mm: page_alloc: fix up block types when merging compatible blocks
9 mm: page_alloc: move free pages when converting block during isolation
10 mm: page_alloc: fix move_freepages_block() range error
11 mm: page_alloc: fix freelist movement during block conversion
12 mm-page_alloc-fix-freelist-movement-during-block-conversion-fix
13 mm: page_alloc: close migratetype race between freeing and stealing
14 mm: page_alloc: set migratetype inside move_freepages()
15 mm: page_isolation: prepare for hygienic freelists
16 mm-page_isolation-prepare-for-hygienic-freelists-fix
17 mm: page_alloc: consolidate free page accounting
18 mm: page_alloc: consolidate free page accounting fix
19 mm: page_alloc: consolidate free page accounting fix 2
20 mm: page_alloc: consolidate free page accounting fix 3
21 mm: page_alloc: change move_freepages() to __move_freepages_block()
22 mm: page_alloc: batch vmstat updates in expand()
> Especially, please double check you have the follow-up fixes to
> compaction capturing and the CMA fallback policy. It sounds like the
> behavior Baolin described before the CMA fix.
Yes, that one was included.
> Lastly, what's the base you backported this series to?
It was 6.8, we can potentially try 6.9 this week and 6.10-rc in a few
weeks when it's in good shape for performance benchmarks.
> > On Android (ARMv8.2), app launch time regressed by about 7%; On
> > ChromeOS (Intel ADL), tab switch time regressed by about 8%. Also PSI
> > (full and some) on both platforms increased by over 20%. I could post
> > the details of the benchmarks and the metrics they measure, but I
> > doubt they would mean much to you. I did ask our test teams to save
> > extra kernel logs that might be more helpful, and I could forward them
> > to you.
>
> If the issue persists with the latest patches in -mm, a kernel config
> and snapshots of /proc/vmstat, /proc/pagetypeinfo, /proc/zoneinfo
> before/during/after the problematic behavior would be very helpful.
Assuming all the fixes were included, do you want the logs from 6.8?
We have them available now.
On Mon, May 13, 2024 at 12:10:04PM -0600, Yu Zhao wrote:
> On Mon, May 13, 2024 at 10:03 AM Johannes Weiner <[email protected]> wrote:
> >
> > On Fri, May 10, 2024 at 11:14:43PM -0600, Yu Zhao wrote:
> > > On Wed, Mar 20, 2024 at 12:04 PM Johannes Weiner <[email protected]> wrote:
> > > >
> > > > V4:
> > > > - fixed !pcp_order_allowed() case in free_unref_folios()
> > > > - reworded the patch 0 changelog a bit for the git log
> > > > - rebased to mm-everything-2024-03-19-23-01
> > > > - runtime-tested again with various CONFIG_DEBUG_FOOs enabled
> > > >
> > > > ---
> > > >
> > > > The page allocator's mobility grouping is intended to keep unmovable
> > > > pages separate from reclaimable/compactable ones to allow on-demand
> > > > defragmentation for higher-order allocations and huge pages.
> > > >
> > > > Currently, there are several places where accidental type mixing
> > > > occurs: an allocation asks for a page of a certain migratetype and
> > > > receives another. This ruins pageblocks for compaction, which in turn
> > > > makes allocating huge pages more expensive and less reliable.
> > > >
> > > > The series addresses those causes. The last patch adds type checks on
> > > > all freelist movements to prevent new violations being introduced.
> > > >
> > > > The benefits can be seen in a mixed workload that stresses the machine
> > > > with a memcache-type workload and a kernel build job while
> > > > periodically attempting to allocate batches of THP. The following data
> > > > is aggregated over 50 consecutive defconfig builds:
> > > >
> > > > VANILLA PATCHED
> > > > Hugealloc Time mean 165843.93 ( +0.00%) 113025.88 ( -31.85%)
> > > > Hugealloc Time stddev 158957.35 ( +0.00%) 114716.07 ( -27.83%)
> > > > Kbuild Real time 310.24 ( +0.00%) 300.73 ( -3.06%)
> > > > Kbuild User time 1271.13 ( +0.00%) 1259.42 ( -0.92%)
> > > > Kbuild System time 582.02 ( +0.00%) 559.79 ( -3.81%)
> > > > THP fault alloc 30585.14 ( +0.00%) 40853.62 ( +33.57%)
> > > > THP fault fallback 36626.46 ( +0.00%) 26357.62 ( -28.04%)
> > > > THP fault fail rate % 54.49 ( +0.00%) 39.22 ( -27.53%)
> > > > Pagealloc fallback 1328.00 ( +0.00%) 1.00 ( -99.85%)
> > > > Pagealloc type mismatch 181009.50 ( +0.00%) 0.00 ( -100.00%)
> > > > Direct compact stall 434.56 ( +0.00%) 257.66 ( -40.61%)
> > > > Direct compact fail 421.70 ( +0.00%) 249.94 ( -40.63%)
> > > > Direct compact success 12.86 ( +0.00%) 7.72 ( -37.09%)
> > > > Direct compact success rate % 2.86 ( +0.00%) 2.82 ( -0.96%)
> > > > Compact daemon scanned migrate 3370059.62 ( +0.00%) 3612054.76 ( +7.18%)
> > > > Compact daemon scanned free 7718439.20 ( +0.00%) 5386385.02 ( -30.21%)
> > > > Compact direct scanned migrate 309248.62 ( +0.00%) 176721.04 ( -42.85%)
> > > > Compact direct scanned free 433582.84 ( +0.00%) 315727.66 ( -27.18%)
> > > > Compact migrate scanned daemon % 91.20 ( +0.00%) 94.48 ( +3.56%)
> > > > Compact free scanned daemon % 94.58 ( +0.00%) 94.42 ( -0.16%)
> > > > Compact total migrate scanned 3679308.24 ( +0.00%) 3788775.80 ( +2.98%)
> > > > Compact total free scanned 8152022.04 ( +0.00%) 5702112.68 ( -30.05%)
> > > > Alloc stall 872.04 ( +0.00%) 5156.12 ( +490.71%)
> > > > Pages kswapd scanned 510645.86 ( +0.00%) 3394.94 ( -99.33%)
> > > > Pages kswapd reclaimed 134811.62 ( +0.00%) 2701.26 ( -98.00%)
> > > > Pages direct scanned 99546.06 ( +0.00%) 376407.52 ( +278.12%)
> > > > Pages direct reclaimed 62123.40 ( +0.00%) 289535.70 ( +366.06%)
> > > > Pages total scanned 610191.92 ( +0.00%) 379802.46 ( -37.76%)
> > > > Pages scanned kswapd % 76.36 ( +0.00%) 0.10 ( -98.58%)
> > > > Swap out 12057.54 ( +0.00%) 15022.98 ( +24.59%)
> > > > Swap in 209.16 ( +0.00%) 256.48 ( +22.52%)
> > > > File refaults 17701.64 ( +0.00%) 11765.40 ( -33.53%)
[...]
> > >
> > > This series significantly regresses Android and ChromeOS under memory
> > > pressure. THPs are virtually nonexistent on client devices, and IIRC,
> > > it was mentioned in the early discussions that potential regressions
> > > for such a case are somewhat expected?
> >
> > This is not expected for the 10 patches here. You might be referring
> > to the discussion around the huge page allocator series, which had
> > fallback restrictions and many changes to reclaim and compaction.
>
> Right, now I remember.
>
> > Can you confirm that you were testing the latest patches that are in
> > mm-stable as of today? There was a series of follow-up fixes.
>
> Here is what I have on top of 6.8.y, which I think includes all the
> follow-up fixes. The performance delta was measured between 5 & 22.
>
> 1 mm: convert free_unref_page_list() to use folios
> 2 mm: add free_unref_folios()
> 3 mm: handle large folios in free_unref_folios()
> 4 mm/page_alloc: remove unused fpi_flags in free_pages_prepare()
> 5 mm: add alloc_contig_migrate_range allocation statistics
> 6 mm: page_alloc: remove pcppage migratetype caching
> 7 mm: page_alloc: optimize free_unref_folios()
> 8 mm: page_alloc: fix up block types when merging compatible blocks
> 9 mm: page_alloc: move free pages when converting block during isolation
> 10 mm: page_alloc: fix move_freepages_block() range error
> 11 mm: page_alloc: fix freelist movement during block conversion
> 12 mm-page_alloc-fix-freelist-movement-during-block-conversion-fix
> 13 mm: page_alloc: close migratetype race between freeing and stealing
> 14 mm: page_alloc: set migratetype inside move_freepages()
> 15 mm: page_isolation: prepare for hygienic freelists
> 16 mm-page_isolation-prepare-for-hygienic-freelists-fix
> 17 mm: page_alloc: consolidate free page accounting
> 18 mm: page_alloc: consolidate free page accounting fix
> 19 mm: page_alloc: consolidate free page accounting fix 2
> 20 mm: page_alloc: consolidate free page accounting fix 3
> 21 mm: page_alloc: change move_freepages() to __move_freepages_block()
> 22 mm: page_alloc: batch vmstat updates in expand()
It does look complete to me. Did you encounter any conflicts during
the backport? Is there any chance you can fold the fixes into their
respective main patches and bisect the sequence?
Again, it's not expected behavior given the fairly conservative
changes above. It sounds like a bug.
> > Especially, please double check you have the follow-up fixes to
> > compaction capturing and the CMA fallback policy. It sounds like the
> > behavior Baolin described before the CMA fix.
>
> Yes, that one was included.
Ok.
> > Lastly, what's the base you backported this series to?
>
> It was 6.8, we can potentially try 6.9 this week and 6.10-rc in a few
> weeks when it's in good shape for performance benchmarks.
If you could try 6.9 as well, that would be great. I backported the
series to 6.9 the other day (git cherry-picks from mm-stable) and I
didn't encounter any conflicts.
> > > On Android (ARMv8.2), app launch time regressed by about 7%; On
> > > ChromeOS (Intel ADL), tab switch time regressed by about 8%. Also PSI
> > > (full and some) on both platforms increased by over 20%. I could post
> > > the details of the benchmarks and the metrics they measure, but I
> > > doubt they would mean much to you. I did ask our test teams to save
> > > extra kernel logs that might be more helpful, and I could forward them
> > > to you.
> >
> > If the issue persists with the latest patches in -mm, a kernel config
> > and snapshots of /proc/vmstat, /proc/pagetypeinfo, /proc/zoneinfo
> > before/during/after the problematic behavior would be very helpful.
>
> Assuming all the fixes were included, do you want the logs from 6.8?
> We have them available now.
Yes, that would be helpful.
If you have them, it would also be quite useful to have the vmstat
before-after-test delta from a good kernel, for baseline comparison.
Thanks
On Mon, May 13, 2024 at 1:04 PM Johannes Weiner <[email protected]> wrote:
>
> On Mon, May 13, 2024 at 12:10:04PM -0600, Yu Zhao wrote:
> > On Mon, May 13, 2024 at 10:03 AM Johannes Weiner <[email protected]> wrote:
> > >
> > > On Fri, May 10, 2024 at 11:14:43PM -0600, Yu Zhao wrote:
> > > > On Wed, Mar 20, 2024 at 12:04 PM Johannes Weiner <[email protected]> wrote:
> > > > >
> > > > > V4:
> > > > > - fixed !pcp_order_allowed() case in free_unref_folios()
> > > > > - reworded the patch 0 changelog a bit for the git log
> > > > > - rebased to mm-everything-2024-03-19-23-01
> > > > > - runtime-tested again with various CONFIG_DEBUG_FOOs enabled
> > > > >
> > > > > ---
> > > > >
> > > > > The page allocator's mobility grouping is intended to keep unmovable
> > > > > pages separate from reclaimable/compactable ones to allow on-demand
> > > > > defragmentation for higher-order allocations and huge pages.
> > > > >
> > > > > Currently, there are several places where accidental type mixing
> > > > > occurs: an allocation asks for a page of a certain migratetype and
> > > > > receives another. This ruins pageblocks for compaction, which in turn
> > > > > makes allocating huge pages more expensive and less reliable.
> > > > >
> > > > > The series addresses those causes. The last patch adds type checks on
> > > > > all freelist movements to prevent new violations being introduced.
> > > > >
> > > > > The benefits can be seen in a mixed workload that stresses the machine
> > > > > with a memcache-type workload and a kernel build job while
> > > > > periodically attempting to allocate batches of THP. The following data
> > > > > is aggregated over 50 consecutive defconfig builds:
> > > > >
> > > > > VANILLA PATCHED
> > > > > Hugealloc Time mean 165843.93 ( +0.00%) 113025.88 ( -31.85%)
> > > > > Hugealloc Time stddev 158957.35 ( +0.00%) 114716.07 ( -27.83%)
> > > > > Kbuild Real time 310.24 ( +0.00%) 300.73 ( -3.06%)
> > > > > Kbuild User time 1271.13 ( +0.00%) 1259.42 ( -0.92%)
> > > > > Kbuild System time 582.02 ( +0.00%) 559.79 ( -3.81%)
> > > > > THP fault alloc 30585.14 ( +0.00%) 40853.62 ( +33.57%)
> > > > > THP fault fallback 36626.46 ( +0.00%) 26357.62 ( -28.04%)
> > > > > THP fault fail rate % 54.49 ( +0.00%) 39.22 ( -27.53%)
> > > > > Pagealloc fallback 1328.00 ( +0.00%) 1.00 ( -99.85%)
> > > > > Pagealloc type mismatch 181009.50 ( +0.00%) 0.00 ( -100.00%)
> > > > > Direct compact stall 434.56 ( +0.00%) 257.66 ( -40.61%)
> > > > > Direct compact fail 421.70 ( +0.00%) 249.94 ( -40.63%)
> > > > > Direct compact success 12.86 ( +0.00%) 7.72 ( -37.09%)
> > > > > Direct compact success rate % 2.86 ( +0.00%) 2.82 ( -0.96%)
> > > > > Compact daemon scanned migrate 3370059.62 ( +0.00%) 3612054.76 ( +7.18%)
> > > > > Compact daemon scanned free 7718439.20 ( +0.00%) 5386385.02 ( -30.21%)
> > > > > Compact direct scanned migrate 309248.62 ( +0.00%) 176721.04 ( -42.85%)
> > > > > Compact direct scanned free 433582.84 ( +0.00%) 315727.66 ( -27.18%)
> > > > > Compact migrate scanned daemon % 91.20 ( +0.00%) 94.48 ( +3.56%)
> > > > > Compact free scanned daemon % 94.58 ( +0.00%) 94.42 ( -0.16%)
> > > > > Compact total migrate scanned 3679308.24 ( +0.00%) 3788775.80 ( +2.98%)
> > > > > Compact total free scanned 8152022.04 ( +0.00%) 5702112.68 ( -30.05%)
> > > > > Alloc stall 872.04 ( +0.00%) 5156.12 ( +490.71%)
> > > > > Pages kswapd scanned 510645.86 ( +0.00%) 3394.94 ( -99.33%)
> > > > > Pages kswapd reclaimed 134811.62 ( +0.00%) 2701.26 ( -98.00%)
> > > > > Pages direct scanned 99546.06 ( +0.00%) 376407.52 ( +278.12%)
> > > > > Pages direct reclaimed 62123.40 ( +0.00%) 289535.70 ( +366.06%)
> > > > > Pages total scanned 610191.92 ( +0.00%) 379802.46 ( -37.76%)
> > > > > Pages scanned kswapd % 76.36 ( +0.00%) 0.10 ( -98.58%)
> > > > > Swap out 12057.54 ( +0.00%) 15022.98 ( +24.59%)
> > > > > Swap in 209.16 ( +0.00%) 256.48 ( +22.52%)
> > > > > File refaults 17701.64 ( +0.00%) 11765.40 ( -33.53%)
>
> [...]
>
> > > >
> > > > This series significantly regresses Android and ChromeOS under memory
> > > > pressure. THPs are virtually nonexistent on client devices, and IIRC,
> > > > it was mentioned in the early discussions that potential regressions
> > > > for such a case are somewhat expected?
> > >
> > > This is not expected for the 10 patches here. You might be referring
> > > to the discussion around the huge page allocator series, which had
> > > fallback restrictions and many changes to reclaim and compaction.
> >
> > Right, now I remember.
> >
> > > Can you confirm that you were testing the latest patches that are in
> > > mm-stable as of today? There was a series of follow-up fixes.
> >
> > Here is what I have on top of 6.8.y, which I think includes all the
> > follow-up fixes. The performance delta was measured between 5 & 22.
> >
> > 1 mm: convert free_unref_page_list() to use folios
> > 2 mm: add free_unref_folios()
> > 3 mm: handle large folios in free_unref_folios()
> > 4 mm/page_alloc: remove unused fpi_flags in free_pages_prepare()
> > 5 mm: add alloc_contig_migrate_range allocation statistics
> > 6 mm: page_alloc: remove pcppage migratetype caching
> > 7 mm: page_alloc: optimize free_unref_folios()
> > 8 mm: page_alloc: fix up block types when merging compatible blocks
> > 9 mm: page_alloc: move free pages when converting block during isolation
> > 10 mm: page_alloc: fix move_freepages_block() range error
> > 11 mm: page_alloc: fix freelist movement during block conversion
> > 12 mm-page_alloc-fix-freelist-movement-during-block-conversion-fix
> > 13 mm: page_alloc: close migratetype race between freeing and stealing
> > 14 mm: page_alloc: set migratetype inside move_freepages()
> > 15 mm: page_isolation: prepare for hygienic freelists
> > 16 mm-page_isolation-prepare-for-hygienic-freelists-fix
> > 17 mm: page_alloc: consolidate free page accounting
> > 18 mm: page_alloc: consolidate free page accounting fix
> > 19 mm: page_alloc: consolidate free page accounting fix 2
> > 20 mm: page_alloc: consolidate free page accounting fix 3
> > 21 mm: page_alloc: change move_freepages() to __move_freepages_block()
> > 22 mm: page_alloc: batch vmstat updates in expand()
>
> It does look complete to me. Did you encounter any conflicts during
> the backport? Is there any chance you can fold the fixes into their
> respective main patches and bisect the sequence?
>
> Again, it's not expected behavior given the fairly conservative
> changes above. It sounds like a bug.
>
> > > Especially, please double check you have the follow-up fixes to
> > > compaction capturing and the CMA fallback policy. It sounds like the
> > > behavior Baolin described before the CMA fix.
> >
> > Yes, that one was included.
>
> Ok.
>
> > > Lastly, what's the base you backported this series to?
> >
> > It was 6.8, we can potentially try 6.9 this week and 6.10-rc in a few
> > weeks when it's in good shape for performance benchmarks.
>
> If you could try 6.9 as well, that would be great. I backported the
> series to 6.9 the other day (git cherry-picks from mm-stable) and I
> didn't encounter any conflicts.
>
> > > > On Android (ARMv8.2), app launch time regressed by about 7%; On
> > > > ChromeOS (Intel ADL), tab switch time regressed by about 8%. Also PSI
> > > > (full and some) on both platforms increased by over 20%. I could post
> > > > the details of the benchmarks and the metrics they measure, but I
> > > > doubt they would mean much to you. I did ask our test teams to save
> > > > extra kernel logs that might be more helpful, and I could forward them
> > > > to you.
> > >
> > > If the issue persists with the latest patches in -mm, a kernel config
> > > and snapshots of /proc/vmstat, /proc/pagetypeinfo, /proc/zoneinfo
> > > before/during/after the problematic behavior would be very helpful.
> >
> > Assuming all the fixes were included, do you want the logs from 6.8?
> > We have them available now.
>
> Yes, that would be helpful.
>
> If you have them, it would also be quite useful to have the vmstat
> before-after-test delta from a good kernel, for baseline comparison.
Sorry for taking this long -- I wanted to see if the regression is
still reproducible on v6.9.
Apparently we got the similar results on v6.9 with the following
patches cherry-picked cleanly from v6.10-rc1:
1 mm: page_alloc: remove pcppage migratetype caching
2 mm: page_alloc: optimize free_unref_folios()
3 mm: page_alloc: fix up block types when merging compatible blocks
4 mm: page_alloc: move free pages when converting block during isolation
5 mm: page_alloc: fix move_freepages_block() range error
6 mm: page_alloc: fix freelist movement during block conversion
7 mm: page_alloc: close migratetype race between freeing and stealing
8 mm: page_alloc: set migratetype inside move_freepages()
9 mm: page_isolation: prepare for hygienic freelists
10 mm: page_alloc: consolidate free page accounting
11 mm: page_alloc: change move_freepages() to __move_freepages_block()
12 mm: page_alloc: batch vmstat updates in expand()
Unfortunately I just realized that that automated benchmark didn't
collect the kernel stats before it starts (since it always starts on a
freshly booted device). While this is being fixed, I'm attaching the
kernel stats collected after the benchmark finished. I grabbed 10 runs
for each (baseline/patched), and if you need more, please let me know.
(And we should have the stats before the benchmark soon.)
On Tue, Jun 04, 2024 at 10:53:55PM -0600, Yu Zhao wrote:
> On Mon, May 13, 2024 at 1:04 PM Johannes Weiner <[email protected]> wrote:
> > On Mon, May 13, 2024 at 12:10:04PM -0600, Yu Zhao wrote:
> > > On Mon, May 13, 2024 at 10:03 AM Johannes Weiner <[email protected]> wrote:
> > > > On Fri, May 10, 2024 at 11:14:43PM -0600, Yu Zhao wrote:
> > > > > This series significantly regresses Android and ChromeOS under memory
> > > > > pressure. THPs are virtually nonexistent on client devices, and IIRC,
> > > > > it was mentioned in the early discussions that potential regressions
> > > > > for such a case are somewhat expected?
> > > >
> > > > This is not expected for the 10 patches here. You might be referring
> > > > to the discussion around the huge page allocator series, which had
> > > > fallback restrictions and many changes to reclaim and compaction.
> > >
> > > Right, now I remember.
> > >
> > > > > On Android (ARMv8.2), app launch time regressed by about 7%; On
> > > > > ChromeOS (Intel ADL), tab switch time regressed by about 8%. Also PSI
> > > > > (full and some) on both platforms increased by over 20%. I could post
> > > > > the details of the benchmarks and the metrics they measure, but I
> > > > > doubt they would mean much to you. I did ask our test teams to save
> > > > > extra kernel logs that might be more helpful, and I could forward them
> > > > > to you.
> > > >
> > > > If the issue persists with the latest patches in -mm, a kernel config
> > > > and snapshots of /proc/vmstat, /proc/pagetypeinfo, /proc/zoneinfo
> > > > before/during/after the problematic behavior would be very helpful.
> > >
> > > Assuming all the fixes were included, do you want the logs from 6.8?
> > > We have them available now.
> >
> > Yes, that would be helpful.
> >
> > If you have them, it would also be quite useful to have the vmstat
> > before-after-test delta from a good kernel, for baseline comparison.
>
> Sorry for taking this long -- I wanted to see if the regression is
> still reproducible on v6.9.
>
> Apparently we got the similar results on v6.9 with the following
> patches cherry-picked cleanly from v6.10-rc1:
>
> 1 mm: page_alloc: remove pcppage migratetype caching
> 2 mm: page_alloc: optimize free_unref_folios()
> 3 mm: page_alloc: fix up block types when merging compatible blocks
> 4 mm: page_alloc: move free pages when converting block during isolation
> 5 mm: page_alloc: fix move_freepages_block() range error
> 6 mm: page_alloc: fix freelist movement during block conversion
> 7 mm: page_alloc: close migratetype race between freeing and stealing
> 8 mm: page_alloc: set migratetype inside move_freepages()
> 9 mm: page_isolation: prepare for hygienic freelists
> 10 mm: page_alloc: consolidate free page accounting
> 11 mm: page_alloc: change move_freepages() to __move_freepages_block()
> 12 mm: page_alloc: batch vmstat updates in expand()
>
> Unfortunately I just realized that that automated benchmark didn't
> collect the kernel stats before it starts (since it always starts on a
> freshly booted device). While this is being fixed, I'm attaching the
> kernel stats collected after the benchmark finished. I grabbed 10 runs
> for each (baseline/patched), and if you need more, please let me know.
> (And we should have the stats before the benchmark soon.)
Thanks for grabbing these, and sorry about the delay, I was traveling
last week.
You mentioned "THPs are virtually non-existant". But the workload
doesn't seem to allocate anon THPs at all. For file THP, the patched
kernel's median for allocation success is 90% of baseline, but the
inter-run min/max deviation from the median in baseline is 85%/108%
and in patched and 85%/112% in patched, so this is quite noisy. Was
that initial comment regarding a different workload?
This other data point has me stumped. Comparing medians, there is a
1.5% reduction in anon refaults and a 4.8% increase in file
refaults. And indeed, there is more files and less anon being scanned.
I think this could explain the PSI delta, since AFAIK you have zram or
zswap, and anon decompression loads are cheaper than filesystem IO.
The above patches don't do anything that directly influences the
anon-file reclaim balance. However, if file THPs fall back to 4k file
pages more, that *might* be able to explain a shift in reclaim
balance, if some hot subpages in those THPs were protecting colder
subpages from being reclaimed and refaulting.
In that case, the root cause would still be a simple THP success rate
regression. To confirm this theory, could you run the baseline and the
patched sets both with THP disabled entirely?
Can you elaborate more on what the workload is doing exactly? What are
the parameters of the test machine (CPUs, memory size)? It'd be
helpful if I could reproduce this locally as well.
On Mon, Jun 10, 2024 at 9:28 AM Johannes Weiner <[email protected]> wrote:
>
> On Tue, Jun 04, 2024 at 10:53:55PM -0600, Yu Zhao wrote:
> > On Mon, May 13, 2024 at 1:04 PM Johannes Weiner <[email protected]> wrote:
> > > On Mon, May 13, 2024 at 12:10:04PM -0600, Yu Zhao wrote:
> > > > On Mon, May 13, 2024 at 10:03 AM Johannes Weiner <[email protected]> wrote:
> > > > > On Fri, May 10, 2024 at 11:14:43PM -0600, Yu Zhao wrote:
> > > > > > This series significantly regresses Android and ChromeOS under memory
> > > > > > pressure. THPs are virtually nonexistent on client devices, and IIRC,
> > > > > > it was mentioned in the early discussions that potential regressions
> > > > > > for such a case are somewhat expected?
> > > > >
> > > > > This is not expected for the 10 patches here. You might be referring
> > > > > to the discussion around the huge page allocator series, which had
> > > > > fallback restrictions and many changes to reclaim and compaction.
> > > >
> > > > Right, now I remember.
> > > >
> > > > > > On Android (ARMv8.2), app launch time regressed by about 7%; On
> > > > > > ChromeOS (Intel ADL), tab switch time regressed by about 8%. Also PSI
> > > > > > (full and some) on both platforms increased by over 20%. I could post
> > > > > > the details of the benchmarks and the metrics they measure, but I
> > > > > > doubt they would mean much to you. I did ask our test teams to save
> > > > > > extra kernel logs that might be more helpful, and I could forward them
> > > > > > to you.
> > > > >
> > > > > If the issue persists with the latest patches in -mm, a kernel config
> > > > > and snapshots of /proc/vmstat, /proc/pagetypeinfo, /proc/zoneinfo
> > > > > before/during/after the problematic behavior would be very helpful.
> > > >
> > > > Assuming all the fixes were included, do you want the logs from 6.8?
> > > > We have them available now.
> > >
> > > Yes, that would be helpful.
> > >
> > > If you have them, it would also be quite useful to have the vmstat
> > > before-after-test delta from a good kernel, for baseline comparison.
> >
> > Sorry for taking this long -- I wanted to see if the regression is
> > still reproducible on v6.9.
> >
> > Apparently we got the similar results on v6.9 with the following
> > patches cherry-picked cleanly from v6.10-rc1:
> >
> > 1 mm: page_alloc: remove pcppage migratetype caching
> > 2 mm: page_alloc: optimize free_unref_folios()
> > 3 mm: page_alloc: fix up block types when merging compatible blocks
> > 4 mm: page_alloc: move free pages when converting block during isolation
> > 5 mm: page_alloc: fix move_freepages_block() range error
> > 6 mm: page_alloc: fix freelist movement during block conversion
> > 7 mm: page_alloc: close migratetype race between freeing and stealing
> > 8 mm: page_alloc: set migratetype inside move_freepages()
> > 9 mm: page_isolation: prepare for hygienic freelists
> > 10 mm: page_alloc: consolidate free page accounting
> > 11 mm: page_alloc: change move_freepages() to __move_freepages_block()
> > 12 mm: page_alloc: batch vmstat updates in expand()
> >
> > Unfortunately I just realized that that automated benchmark didn't
> > collect the kernel stats before it starts (since it always starts on a
> > freshly booted device). While this is being fixed, I'm attaching the
> > kernel stats collected after the benchmark finished. I grabbed 10 runs
> > for each (baseline/patched), and if you need more, please let me know.
> > (And we should have the stats before the benchmark soon.)
>
> Thanks for grabbing these, and sorry about the delay, I was traveling
> last week.
>
> You mentioned "THPs are virtually non-existant". But the workload
> doesn't seem to allocate anon THPs at all.
Sorry for not being clear there: you are correct.
I meant that client devices rarely use 2MB THPs or __GFP_COMP. (They
simply can't due to both internal and external fragmentations, but we
are trying!)
> For file THP, the patched
> kernel's median for allocation success is 90% of baseline, but the
> inter-run min/max deviation from the median in baseline is 85%/108%
> and in patched and 85%/112% in patched, so this is quite noisy. Was
> that initial comment regarding a different workload?
No, in both cases (Android and ChromeOS) we tried, we were hoping the
series could help with mTHP (64KB and 32KB). But we hit the
regressions with their baseline (4KB). Again, 2MB THPs, if they are
used, are reserved (allocated and mlocked to hold text/code sections
after a reboot). So they shouldn't matter, and I highly doubt the
regressions are because of them.
> This other data point has me stumped. Comparing medians, there is a
> 1.5% reduction in anon refaults and a 4.8% increase in file
> refaults. And indeed, there is more files and less anon being scanned.
> I think this could explain the PSI delta, since AFAIK you have zram or
> zswap, and anon decompression loads are cheaper than filesystem IO.
>
> The above patches don't do anything that directly influences the
> anon-file reclaim balance. However, if file THPs fall back to 4k file
> pages more, that *might* be able to explain a shift in reclaim
> balance, if some hot subpages in those THPs were protecting colder
> subpages from being reclaimed and refaulting.
>
> In that case, the root cause would still be a simple THP success rate
> regression. To confirm this theory, could you run the baseline and the
> patched sets both with THP disabled entirely?
Will try this. And is bisecting within this series possible?
> Can you elaborate more on what the workload is doing exactly?
These are simple benchmarks that measure the system and foreground
app/tab performance under memory pressure, e.g., [1]. They open a
bunch of apps/tabs (respectively on Android/ChromeOS) and switch
between them. At a given time, one of them is foreground and the rest
are background, obviously. When an app/tab has been in the background
for a while, the userspace may call madvise(PAGEOUT) to reclaim (most
of) its LRU pages, leaving unmovable kernel memory there. This
strategy allows client systems to cache more apps/tabs in the
background and reduce their startup/switch time. But it's also a major
source of fragmentation (I'm sure you get why so I won't go into
details here. And userspace also tries to make a better decision
between reclaim/compact/kill based on fragmentation, but it's not
easy.)
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/refs/heads/main/src/go.chromium.org/tast-tests/cros/local/bundles/cros/platform/memory_pressure.go
> What are
> the parameters of the test machine (CPUs, memory size)? It'd be
> helpful if I could reproduce this locally as well.
The data I shared previously is from an Intel i7-1255U + 4GB Chromebook.
More data attached -- it contains vmstat, zoneinfo and pagetypeinfo
files collected before the benchmark (after fresh reboots) and after
the benchmark.
On Wed, Jun 12, 2024 at 12:52:20PM -0600, Yu Zhao wrote:
> On Mon, Jun 10, 2024 at 9:28 AM Johannes Weiner <[email protected]> wrote:
> >
> > On Tue, Jun 04, 2024 at 10:53:55PM -0600, Yu Zhao wrote:
> > > On Mon, May 13, 2024 at 1:04 PM Johannes Weiner <[email protected]> wrote:
> > > > On Mon, May 13, 2024 at 12:10:04PM -0600, Yu Zhao wrote:
> > > > > On Mon, May 13, 2024 at 10:03 AM Johannes Weiner <[email protected]> wrote:
> > > > > > On Fri, May 10, 2024 at 11:14:43PM -0600, Yu Zhao wrote:
> > > > > > > This series significantly regresses Android and ChromeOS under memory
> > > > > > > pressure. THPs are virtually nonexistent on client devices, and IIRC,
> > > > > > > it was mentioned in the early discussions that potential regressions
> > > > > > > for such a case are somewhat expected?
> > > > > >
> > > > > > This is not expected for the 10 patches here. You might be referring
> > > > > > to the discussion around the huge page allocator series, which had
> > > > > > fallback restrictions and many changes to reclaim and compaction.
> > > > >
> > > > > Right, now I remember.
> > > > >
> > > > > > > On Android (ARMv8.2), app launch time regressed by about 7%; On
> > > > > > > ChromeOS (Intel ADL), tab switch time regressed by about 8%. Also PSI
> > > > > > > (full and some) on both platforms increased by over 20%. I could post
> > > > > > > the details of the benchmarks and the metrics they measure, but I
> > > > > > > doubt they would mean much to you. I did ask our test teams to save
> > > > > > > extra kernel logs that might be more helpful, and I could forward them
> > > > > > > to you.
> > > > > >
> > > > > > If the issue persists with the latest patches in -mm, a kernel config
> > > > > > and snapshots of /proc/vmstat, /proc/pagetypeinfo, /proc/zoneinfo
> > > > > > before/during/after the problematic behavior would be very helpful.
> > > > >
> > > > > Assuming all the fixes were included, do you want the logs from 6.8?
> > > > > We have them available now.
> > > >
> > > > Yes, that would be helpful.
> > > >
> > > > If you have them, it would also be quite useful to have the vmstat
> > > > before-after-test delta from a good kernel, for baseline comparison.
> > >
> > > Sorry for taking this long -- I wanted to see if the regression is
> > > still reproducible on v6.9.
> > >
> > > Apparently we got the similar results on v6.9 with the following
> > > patches cherry-picked cleanly from v6.10-rc1:
> > >
> > > 1 mm: page_alloc: remove pcppage migratetype caching
> > > 2 mm: page_alloc: optimize free_unref_folios()
> > > 3 mm: page_alloc: fix up block types when merging compatible blocks
> > > 4 mm: page_alloc: move free pages when converting block during isolation
> > > 5 mm: page_alloc: fix move_freepages_block() range error
> > > 6 mm: page_alloc: fix freelist movement during block conversion
> > > 7 mm: page_alloc: close migratetype race between freeing and stealing
> > > 8 mm: page_alloc: set migratetype inside move_freepages()
> > > 9 mm: page_isolation: prepare for hygienic freelists
> > > 10 mm: page_alloc: consolidate free page accounting
> > > 11 mm: page_alloc: change move_freepages() to __move_freepages_block()
> > > 12 mm: page_alloc: batch vmstat updates in expand()
> > >
> > > Unfortunately I just realized that that automated benchmark didn't
> > > collect the kernel stats before it starts (since it always starts on a
> > > freshly booted device). While this is being fixed, I'm attaching the
> > > kernel stats collected after the benchmark finished. I grabbed 10 runs
> > > for each (baseline/patched), and if you need more, please let me know.
> > > (And we should have the stats before the benchmark soon.)
> >
> > Thanks for grabbing these, and sorry about the delay, I was traveling
> > last week.
> >
> > You mentioned "THPs are virtually non-existant". But the workload
> > doesn't seem to allocate anon THPs at all.
>
> Sorry for not being clear there: you are correct.
>
> I meant that client devices rarely use 2MB THPs or __GFP_COMP. (They
> simply can't due to both internal and external fragmentations, but we
> are trying!)
Ah, understood. So this is nominally a non-THP workload, and we're
suspecting a simple 4k allocation issue in low memory conditions.
Thanks for clarifying.
However, I don't think 4k alone would explain pressure just yet. PSI
is triggered by reclaim and compaction, but with this series type
fallbacks are still allowed to the full extent before entering any
such remediation. The series merely fixes type safety and eliminates
avoidable/accidental mixing.
So I'm thinking something else must still be going on. Either THP
(however limited the use in this workload); or the userspace feedback
mechanism you mention below...
> > For file THP, the patched
> > kernel's median for allocation success is 90% of baseline, but the
> > inter-run min/max deviation from the median in baseline is 85%/108%
> > and in patched and 85%/112% in patched, so this is quite noisy. Was
> > that initial comment regarding a different workload?
>
> No, in both cases (Android and ChromeOS) we tried, we were hoping the
> series could help with mTHP (64KB and 32KB). But we hit the
> regressions with their baseline (4KB). Again, 2MB THPs, if they are
> used, are reserved (allocated and mlocked to hold text/code sections
> after a reboot). So they shouldn't matter, and I highly doubt the
> regressions are because of them.
Ok.
> > This other data point has me stumped. Comparing medians, there is a
> > 1.5% reduction in anon refaults and a 4.8% increase in file
> > refaults. And indeed, there is more files and less anon being scanned.
> > I think this could explain the PSI delta, since AFAIK you have zram or
> > zswap, and anon decompression loads are cheaper than filesystem IO.
> >
> > The above patches don't do anything that directly influences the
> > anon-file reclaim balance. However, if file THPs fall back to 4k file
> > pages more, that *might* be able to explain a shift in reclaim
> > balance, if some hot subpages in those THPs were protecting colder
> > subpages from being reclaimed and refaulting.
> >
> > In that case, the root cause would still be a simple THP success rate
> > regression. To confirm this theory, could you run the baseline and the
> > patched sets both with THP disabled entirely?
>
> Will try this. And is bisecting within this series possible?
Yes. I built and put each commit incrementally through my test
machinery before sending them out. I can't vouch for all
configurations, of course, but I'd expect it to work.
> > Can you elaborate more on what the workload is doing exactly?
>
> These are simple benchmarks that measure the system and foreground
> app/tab performance under memory pressure, e.g., [1]. They open a
> bunch of apps/tabs (respectively on Android/ChromeOS) and switch
> between them. At a given time, one of them is foreground and the rest
> are background, obviously. When an app/tab has been in the background
> for a while, the userspace may call madvise(PAGEOUT) to reclaim (most
> of) its LRU pages, leaving unmovable kernel memory there. This
> strategy allows client systems to cache more apps/tabs in the
> background and reduce their startup/switch time. But it's also a major
> source of fragmentation (I'm sure you get why so I won't go into
> details here. And userspace also tries to make a better decision
> between reclaim/compact/kill based on fragmentation, but it's not
> easy.)
Thanks for the detailed explanation.
That last bit is interesting: how does it determine "fragmentation"?
The series might well affect this metric.
> [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/refs/heads/main/src/go.chromium.org/tast-tests/cros/local/bundles/cros/platform/memory_pressure.go
>
> > What are
> > the parameters of the test machine (CPUs, memory size)? It'd be
> > helpful if I could reproduce this locally as well.
>
> The data I shared previously is from an Intel i7-1255U + 4GB Chromebook.
>
> More data attached -- it contains vmstat, zoneinfo and pagetypeinfo
> files collected before the benchmark (after fresh reboots) and after
> the benchmark.
Thanks, I'll take a look.