LinuxLists.cc - [RFC PATCH 0/7] mm: providing ample physical memory contiguity by confining unmovable allocations

2024-03-20 02:42:44

Subject: [RFC PATCH 0/7] mm: providing ample physical memory contiguity by confining unmovable allocations

From: Kaiyang Zhao <[email protected]>

Memory capacity has increased dramatically over the last decades.
Meanwhile, TLB capacity has stagnated, causing a significant virtual
address translation overhead. As a collaboration between Carnegie Mellon
University and Meta, we investigated the issue at Meta’s datacenters and
found that about 20% of CPU cycles are spent doing page walks [1], and
similar results are also reported by Google [2].

To tackle the overhead, we need widespread uses of huge pages. And huge
pages, when they can actually be created, work wonders: they provide up
to 18% higher performance for Meta’s production workloads in our
experiments [1].

However, we observed that huge pages through THP are unreliable because
sufficient physical contiguity may not exist and compaction to recover
from memory fragmentation frequently fails. To ensure workloads get a
reasonable number of huge pages, Meta could not rely on THP and had to
use reserved huge pages. Proposals to add 1GB THP support [5] are even
more dependent on ample availability of physical contiguity.

A major reason for the lack of physical contiguity is the mixing of
unmovable and movable allocations, causing compaction to fail. Quoting
from [3], “in a broad sample of Meta servers, we find that unmovable
allocations make up less than 7% of total memory on average, yet occupy
34% of the 2M blocks in the system. We also found that this effect isn't
correlated with high uptimes, and that servers can get heavily
fragmented within the first hour of running a workload.”

Our proposed solution is to confine the unmovable allocations to a
separate region in physical memory. We experimented with using a CMA
region for the movable allocations, but in this version we use
ZONE_MOVABLE for movable and all other zones for unmovable allocations.
Movable allocations can temporarily reside in the unmovable zones, but
will be proactively moved out by compaction.

To resize ZONE_MOVABLE, we still rely on memory hotplug interfaces. We
export the number of pages scanned on behalf of movable or unmovable
allocations during reclaim to approximate the memory pressure in two
parts of physical memory, and a userspace tool can monitor the metrics
and make resizing decisions. Previously we augmented the PSI interface
to break down memory pressure into movable and unmovable allocation
types, but that approach enlarges the scheduler cacheline footprint.
From our preliminary observations, just looking at the per-allocation
type scanned counters and with a little tuning, it is sufficient to tell
if there is not enough memory for unmovable allocations and make
resizing decisions.

This patch extends the idea of migratetype isolation at pageblock
granularity posted earlier [3] by Johannes Weiner to an
as-large-as-needed region to better support huge pages of bigger sizes
and hardware TLB coalescing. We’re looking for feedback on the overall
direction, particularly in relation to the recent THP allocator
optimization proposal [4].

The patches are based on 6.4 and are also available on github at
https://github.com/magickaiyang/kernel-contiguous/tree/per_alloc_type_reclaim_counters_oct052023

Kaiyang Zhao (7):
sysfs interface for the boundary of movable zone
Disallows high-order movable allocations in other zones if
ZONE_MOVABLE is populated
compaction accepts a destination zone
vmstat counter for pages migrated across zones
proactively move pages out of unmovable zones in kcompactd
pass gfp mask of the allocation that waked kswapd to track number of
pages scanned on behalf of each alloc type
exports the number of pages scanned on behalf of movable/unmovable
allocations

drivers/base/memory.c | 2 +-
drivers/base/node.c | 32 ++++++
include/linux/compaction.h | 4 +-
include/linux/memory.h | 1 +
include/linux/mmzone.h | 1 +
include/linux/vm_event_item.h | 6 +
mm/compaction.c | 209 ++++++++++++++++++++++++++--------
mm/internal.h | 1 +
mm/page_alloc.c | 10 ++
mm/vmscan.c | 28 ++++-
mm/vmstat.c | 14 ++-
11 files changed, 249 insertions(+), 59 deletions(-)

--
2.40.1

2024-03-20 02:42:48

by kaiyang2

[permalink] [raw]

Subject: [RFC PATCH 1/7] sysfs interface for the boundary of movable zone

From: Kaiyang Zhao <[email protected]>

Exports the pfn and memory block id for boundary

Signed-off-by: Kaiyang Zhao <[email protected]>
---
drivers/base/memory.c | 2 +-
drivers/base/node.c | 32 ++++++++++++++++++++++++++++++++
include/linux/memory.h | 1 +
3 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index b456ac213610..281b229d7019 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -55,7 +55,7 @@ static inline unsigned long memory_block_id(unsigned long section_nr)
return section_nr / sections_per_block;
}

-static inline unsigned long pfn_to_block_id(unsigned long pfn)
+unsigned long pfn_to_block_id(unsigned long pfn)
{
return memory_block_id(pfn_to_section_nr(pfn));
}
diff --git a/drivers/base/node.c b/drivers/base/node.c
index b46db17124f3..f29ce07565ba 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -486,6 +486,37 @@ static ssize_t node_read_meminfo(struct device *dev,
#undef K
static DEVICE_ATTR(meminfo, 0444, node_read_meminfo, NULL);

+static ssize_t node_read_movable_zone(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ int len = 0;
+ struct zone *unmovable_zone;
+ unsigned long movable_start_pfn, unmovable_end_pfn;
+ unsigned long movable_start_block_id, unmovable_end_block_id;
+
+ movable_start_pfn = NODE_DATA(dev->id)->node_zones[ZONE_MOVABLE].zone_start_pfn;
+ movable_start_block_id = pfn_to_block_id(movable_start_pfn);
+
+ if (populated_zone(&(NODE_DATA(dev->id)->node_zones[ZONE_NORMAL])))
+ unmovable_zone = &(NODE_DATA(dev->id)->node_zones[ZONE_NORMAL]);
+ else
+ unmovable_zone = &(NODE_DATA(dev->id)->node_zones[ZONE_DMA32]);
+
+ unmovable_end_pfn = zone_end_pfn(unmovable_zone);
+ unmovable_end_block_id = pfn_to_block_id(unmovable_end_pfn);
+
+ len = sysfs_emit_at(buf, len,
+ "movable_zone_start_pfn %lu\n"
+ "movable_zone_start_block_id %lu\n"
+ "unmovable_zone_end_pfn %lu\n"
+ "unmovable_zone_end_block_id %lu\n",
+ movable_start_pfn, movable_start_block_id,
+ unmovable_end_pfn, unmovable_end_block_id);
+
+ return len;
+}
+static DEVICE_ATTR(movable_zone, 0444, node_read_movable_zone, NULL);
+
static ssize_t node_read_numastat(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -565,6 +596,7 @@ static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);

static struct attribute *node_dev_attrs[] = {
&dev_attr_meminfo.attr,
+ &dev_attr_movable_zone.attr,
&dev_attr_numastat.attr,
&dev_attr_distance.attr,
&dev_attr_vmstat.attr,
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 31343566c221..17a92a5c1ae5 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -92,6 +92,7 @@ struct memory_block {
int arch_get_memory_phys_device(unsigned long start_pfn);
unsigned long memory_block_size_bytes(void);
int set_memory_block_size_order(unsigned int order);
+unsigned long pfn_to_block_id(unsigned long pfn);

/* These states are exposed to userspace as text strings in sysfs */
#define MEM_ONLINE (1<<0) /* exposed to userspace */
--
2.40.1

2024-03-20 02:42:48

by kaiyang2

[permalink] [raw]

Subject: [RFC PATCH 2/7] Disallows high-order movable allocations in other zones if ZONE_MOVABLE is populated

From: Kaiyang Zhao <[email protected]>

Use ZONE_MOVABLE exclusively for non-0 order allocations

Signed-off-by: Kaiyang Zhao <[email protected]>
---
mm/page_alloc.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47421bedc12b..9ad9357e340a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3403,6 +3403,16 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
struct page *page;
unsigned long mark;

+ /*
+ * Disallows high-order movable allocations in other zones if
+ * ZONE_MOVABLE is populated on this node.
+ */
+ if (ac->highest_zoneidx >= ZONE_MOVABLE &&
+ order > 0 &&
+ zone_idx(zone) != ZONE_MOVABLE &&
+ populated_zone(&(zone->zone_pgdat->node_zones[ZONE_MOVABLE])))
+ continue;
+
if (cpusets_enabled() &&
(alloc_flags & ALLOC_CPUSET) &&
!__cpuset_zone_allowed(zone, gfp_mask))
--
2.40.1

2024-03-20 02:43:07

by kaiyang2

[permalink] [raw]

Subject: [RFC PATCH 4/7] vmstat counter for pages migrated across zones

From: Kaiyang Zhao <[email protected]>

Add a counter for the number of pages migrated across zones in vmstat

Signed-off-by: Kaiyang Zhao <[email protected]>
---
include/linux/vm_event_item.h | 1 +
mm/compaction.c | 2 ++
mm/vmstat.c | 1 +
3 files changed, 4 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 8abfa1240040..be88819085b6 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -79,6 +79,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
KCOMPACTD_WAKE,
KCOMPACTD_MIGRATE_SCANNED, KCOMPACTD_FREE_SCANNED,
+ COMPACT_CROSS_ZONE_MIGRATED,
#endif
#ifdef CONFIG_HUGETLB_PAGE
HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
diff --git a/mm/compaction.c b/mm/compaction.c
index 03b5c4debc17..dea10ad8ec64 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2552,6 +2552,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)

count_compact_events(COMPACTMIGRATE_SCANNED, cc->total_migrate_scanned);
count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned);
+ if (dst_zone != cc->zone)
+ count_compact_events(COMPACT_CROSS_ZONE_MIGRATED, nr_succeeded);

trace_mm_compaction_end(cc, start_pfn, end_pfn, sync, ret);

diff --git a/mm/vmstat.c b/mm/vmstat.c
index c28046371b45..98af82e65ad9 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1324,6 +1324,7 @@ const char * const vmstat_text[] = {
"compact_daemon_wake",
"compact_daemon_migrate_scanned",
"compact_daemon_free_scanned",
+ "compact_cross_zone_migrated",
#endif

#ifdef CONFIG_HUGETLB_PAGE
--
2.40.1

2024-03-20 02:43:18

by kaiyang2

[permalink] [raw]

Subject: [RFC PATCH 3/7] compaction accepts a destination zone

From: Kaiyang Zhao <[email protected]>

Distinguishes the source and destination zones in compaction

Signed-off-by: Kaiyang Zhao <[email protected]>
---
include/linux/compaction.h | 4 +-
mm/compaction.c | 106 +++++++++++++++++++++++--------------
mm/internal.h | 1 +
mm/vmscan.c | 4 +-
4 files changed, 70 insertions(+), 45 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index a6e512cfb670..11f5a1a83abb 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -90,7 +90,7 @@ extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
struct page **page);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern enum compact_result compaction_suitable(struct zone *zone, int order,
- unsigned int alloc_flags, int highest_zoneidx);
+ unsigned int alloc_flags, int highest_zoneidx, struct zone *dst_zone);

extern void compaction_defer_reset(struct zone *zone, int order,
bool alloc_success);
@@ -180,7 +180,7 @@ static inline void reset_isolation_suitable(pg_data_t *pgdat)
}

static inline enum compact_result compaction_suitable(struct zone *zone, int order,
- int alloc_flags, int highest_zoneidx)
+ int alloc_flags, int highest_zoneidx, struct zone *dst_zone)
{
return COMPACT_SKIPPED;
}
diff --git a/mm/compaction.c b/mm/compaction.c
index c8bcdea15f5f..03b5c4debc17 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -435,7 +435,7 @@ static void update_cached_migrate(struct compact_control *cc, unsigned long pfn)
static void update_pageblock_skip(struct compact_control *cc,
struct page *page, unsigned long pfn)
{
- struct zone *zone = cc->zone;
+ struct zone *dst_zone = cc->dst_zone ? cc->dst_zone : cc->zone;

if (cc->no_set_skip_hint)
return;
@@ -446,8 +446,8 @@ static void update_pageblock_skip(struct compact_control *cc,
set_pageblock_skip(page);

/* Update where async and sync compaction should restart */
- if (pfn < zone->compact_cached_free_pfn)
- zone->compact_cached_free_pfn = pfn;
+ if (pfn < dst_zone->compact_cached_free_pfn)
+ dst_zone->compact_cached_free_pfn = pfn;
}
#else
static inline bool isolation_suitable(struct compact_control *cc,
@@ -550,6 +550,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
bool locked = false;
unsigned long blockpfn = *start_pfn;
unsigned int order;
+ struct zone *dst_zone = cc->dst_zone ? cc->dst_zone : cc->zone;

/* Strict mode is for isolation, speed is secondary */
if (strict)
@@ -568,7 +569,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
* pending.
*/
if (!(blockpfn % COMPACT_CLUSTER_MAX)
- && compact_unlock_should_abort(&cc->zone->lock, flags,
+ && compact_unlock_should_abort(&dst_zone->lock, flags,
&locked, cc))
break;

@@ -596,7 +597,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,

/* If we already hold the lock, we can skip some rechecking. */
if (!locked) {
- locked = compact_lock_irqsave(&cc->zone->lock,
+ locked = compact_lock_irqsave(&dst_zone->lock,
&flags, cc);

/* Recheck this is a buddy page under lock */
@@ -634,7 +635,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
}

if (locked)
- spin_unlock_irqrestore(&cc->zone->lock, flags);
+ spin_unlock_irqrestore(&dst_zone->lock, flags);

/*
* There is a tiny chance that we have read bogus compound_order(),
@@ -683,11 +684,12 @@ isolate_freepages_range(struct compact_control *cc,
{
unsigned long isolated, pfn, block_start_pfn, block_end_pfn;
LIST_HEAD(freelist);
+ struct zone *dst_zone = cc->dst_zone ? cc->dst_zone : cc->zone;

pfn = start_pfn;
block_start_pfn = pageblock_start_pfn(pfn);
- if (block_start_pfn < cc->zone->zone_start_pfn)
- block_start_pfn = cc->zone->zone_start_pfn;
+ if (block_start_pfn < dst_zone->zone_start_pfn)
+ block_start_pfn = dst_zone->zone_start_pfn;
block_end_pfn = pageblock_end_pfn(pfn);

for (; pfn < end_pfn; pfn += isolated,
@@ -710,7 +712,7 @@ isolate_freepages_range(struct compact_control *cc,
}

if (!pageblock_pfn_to_page(block_start_pfn,
- block_end_pfn, cc->zone))
+ block_end_pfn, dst_zone))
break;

isolated = isolate_freepages_block(cc, &isolate_start_pfn,
@@ -1359,6 +1361,7 @@ fast_isolate_around(struct compact_control *cc, unsigned long pfn)
{
unsigned long start_pfn, end_pfn;
struct page *page;
+ struct zone *dst_zone = cc->dst_zone ? cc->dst_zone : cc->zone;

/* Do not search around if there are enough pages already */
if (cc->nr_freepages >= cc->nr_migratepages)
@@ -1369,10 +1372,10 @@ fast_isolate_around(struct compact_control *cc, unsigned long pfn)
return;

/* Pageblock boundaries */
- start_pfn = max(pageblock_start_pfn(pfn), cc->zone->zone_start_pfn);
- end_pfn = min(pageblock_end_pfn(pfn), zone_end_pfn(cc->zone));
+ start_pfn = max(pageblock_start_pfn(pfn), dst_zone->zone_start_pfn);
+ end_pfn = min(pageblock_end_pfn(pfn), zone_end_pfn(dst_zone));

- page = pageblock_pfn_to_page(start_pfn, end_pfn, cc->zone);
+ page = pageblock_pfn_to_page(start_pfn, end_pfn, dst_zone);
if (!page)
return;

@@ -1414,6 +1417,7 @@ fast_isolate_freepages(struct compact_control *cc)
struct page *page = NULL;
bool scan_start = false;
int order;
+ struct zone *dst_zone = cc->dst_zone ? cc->dst_zone : cc->zone;

/* Full compaction passes in a negative order */
if (cc->order <= 0)
@@ -1423,7 +1427,7 @@ fast_isolate_freepages(struct compact_control *cc)
* If starting the scan, use a deeper search and use the highest
* PFN found if a suitable one is not found.
*/
- if (cc->free_pfn >= cc->zone->compact_init_free_pfn) {
+ if (cc->free_pfn >= dst_zone->compact_init_free_pfn) {
limit = pageblock_nr_pages >> 1;
scan_start = true;
}
@@ -1448,7 +1452,7 @@ fast_isolate_freepages(struct compact_control *cc)
for (order = cc->search_order;
!page && order >= 0;
order = next_search_order(cc, order)) {
- struct free_area *area = &cc->zone->free_area[order];
+ struct free_area *area = &dst_zone->free_area[order];
struct list_head *freelist;
struct page *freepage;
unsigned long flags;
@@ -1458,7 +1462,7 @@ fast_isolate_freepages(struct compact_control *cc)
if (!area->nr_free)
continue;

- spin_lock_irqsave(&cc->zone->lock, flags);
+ spin_lock_irqsave(&dst_zone->lock, flags);
freelist = &area->free_list[MIGRATE_MOVABLE];
list_for_each_entry_reverse(freepage, freelist, lru) {
unsigned long pfn;
@@ -1469,7 +1473,7 @@ fast_isolate_freepages(struct compact_control *cc)

if (pfn >= highest)
highest = max(pageblock_start_pfn(pfn),
- cc->zone->zone_start_pfn);
+ dst_zone->zone_start_pfn);

if (pfn >= low_pfn) {
cc->fast_search_fail = 0;
@@ -1516,7 +1520,7 @@ fast_isolate_freepages(struct compact_control *cc)
}
}

- spin_unlock_irqrestore(&cc->zone->lock, flags);
+ spin_unlock_irqrestore(&dst_zone->lock, flags);

/*
* Smaller scan on next order so the total scan is related
@@ -1541,17 +1545,17 @@ fast_isolate_freepages(struct compact_control *cc)
if (cc->direct_compaction && pfn_valid(min_pfn)) {
page = pageblock_pfn_to_page(min_pfn,
min(pageblock_end_pfn(min_pfn),
- zone_end_pfn(cc->zone)),
- cc->zone);
+ zone_end_pfn(dst_zone)),
+ dst_zone);
cc->free_pfn = min_pfn;
}
}
}
}

- if (highest && highest >= cc->zone->compact_cached_free_pfn) {
+ if (highest && highest >= dst_zone->compact_cached_free_pfn) {
highest -= pageblock_nr_pages;
- cc->zone->compact_cached_free_pfn = highest;
+ dst_zone->compact_cached_free_pfn = highest;
}

cc->total_free_scanned += nr_scanned;
@@ -1569,7 +1573,7 @@ fast_isolate_freepages(struct compact_control *cc)
*/
static void isolate_freepages(struct compact_control *cc)
{
- struct zone *zone = cc->zone;
+ struct zone *zone = cc->dst_zone ? cc->dst_zone : cc->zone;
struct page *page;
unsigned long block_start_pfn; /* start of current pageblock */
unsigned long isolate_start_pfn; /* exact pfn we start at */
@@ -2089,11 +2093,19 @@ static enum compact_result __compact_finished(struct compact_control *cc)
unsigned int order;
const int migratetype = cc->migratetype;
int ret;
+ struct zone *dst_zone = cc->dst_zone ? cc->dst_zone : cc->zone;

- /* Compaction run completes if the migrate and free scanner meet */
- if (compact_scanners_met(cc)) {
+ /*
+ * Compaction run completes if the migrate and free scanner meet
+ * or when either the src or dst zone has been completely scanned
+ */
+ if (compact_scanners_met(cc) ||
+ cc->migrate_pfn >= zone_end_pfn(cc->zone) ||
+ cc->free_pfn < dst_zone->zone_start_pfn) {
/* Let the next compaction start anew. */
reset_cached_positions(cc->zone);
+ if (cc->dst_zone)
+ reset_cached_positions(cc->dst_zone);

/*
* Mark that the PG_migrate_skip information should be cleared
@@ -2196,10 +2208,13 @@ static enum compact_result compact_finished(struct compact_control *cc)
static enum compact_result __compaction_suitable(struct zone *zone, int order,
unsigned int alloc_flags,
int highest_zoneidx,
- unsigned long wmark_target)
+ unsigned long wmark_target, struct zone *dst_zone)
{
unsigned long watermark;

+ if (!dst_zone)
+ dst_zone = zone;
+
if (is_via_compact_memory(order))
return COMPACT_CONTINUE;

@@ -2227,9 +2242,9 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
* suitable migration targets
*/
watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
- low_wmark_pages(zone) : min_wmark_pages(zone);
+ low_wmark_pages(dst_zone) : min_wmark_pages(dst_zone);
watermark += compact_gap(order);
- if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
+ if (!__zone_watermark_ok(dst_zone, 0, watermark, highest_zoneidx,
ALLOC_CMA, wmark_target))
return COMPACT_SKIPPED;

@@ -2245,13 +2260,16 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
*/
enum compact_result compaction_suitable(struct zone *zone, int order,
unsigned int alloc_flags,
- int highest_zoneidx)
+ int highest_zoneidx, struct zone *dst_zone)
{
enum compact_result ret;
int fragindex;

+ if (!dst_zone)
+ dst_zone = zone;
+
ret = __compaction_suitable(zone, order, alloc_flags, highest_zoneidx,
- zone_page_state(zone, NR_FREE_PAGES));
+ zone_page_state(dst_zone, NR_FREE_PAGES), dst_zone);
/*
* fragmentation index determines if allocation failures are due to
* low memory or external fragmentation
@@ -2305,7 +2323,7 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
available = zone_reclaimable_pages(zone) / order;
available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
compact_result = __compaction_suitable(zone, order, alloc_flags,
- ac->highest_zoneidx, available);
+ ac->highest_zoneidx, available, NULL);
if (compact_result == COMPACT_CONTINUE)
return true;
}
@@ -2317,8 +2335,9 @@ static enum compact_result
compact_zone(struct compact_control *cc, struct capture_control *capc)
{
enum compact_result ret;
+ struct zone *dst_zone = cc->dst_zone ? cc->dst_zone : cc->zone;
unsigned long start_pfn = cc->zone->zone_start_pfn;
- unsigned long end_pfn = zone_end_pfn(cc->zone);
+ unsigned long end_pfn = zone_end_pfn(dst_zone);
unsigned long last_migrated_pfn;
const bool sync = cc->mode != MIGRATE_ASYNC;
bool update_cached;
@@ -2337,7 +2356,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)

cc->migratetype = gfp_migratetype(cc->gfp_mask);
ret = compaction_suitable(cc->zone, cc->order, cc->alloc_flags,
- cc->highest_zoneidx);
+ cc->highest_zoneidx, dst_zone);
/* Compaction is likely to fail */
if (ret == COMPACT_SUCCESS || ret == COMPACT_SKIPPED)
return ret;
@@ -2346,14 +2365,19 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
* Clear pageblock skip if there were failures recently and compaction
* is about to be retried after being deferred.
*/
- if (compaction_restarting(cc->zone, cc->order))
+ if (compaction_restarting(cc->zone, cc->order)) {
__reset_isolation_suitable(cc->zone);
+ if (dst_zone != cc->zone)
+ __reset_isolation_suitable(dst_zone);
+ }

/*
* Setup to move all movable pages to the end of the zone. Used cached
* information on where the scanners should start (unless we explicitly
* want to compact the whole zone), but check that it is initialised
* by ensuring the values are within zone boundaries.
+ *
+ * If a destination zone is provided, use it for free pages.
*/
cc->fast_start_pfn = 0;
if (cc->whole_zone) {
@@ -2361,12 +2385,12 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
cc->free_pfn = pageblock_start_pfn(end_pfn - 1);
} else {
cc->migrate_pfn = cc->zone->compact_cached_migrate_pfn[sync];
- cc->free_pfn = cc->zone->compact_cached_free_pfn;
- if (cc->free_pfn < start_pfn || cc->free_pfn >= end_pfn) {
+ cc->free_pfn = dst_zone->compact_cached_free_pfn;
+ if (cc->free_pfn < dst_zone->zone_start_pfn || cc->free_pfn >= end_pfn) {
cc->free_pfn = pageblock_start_pfn(end_pfn - 1);
- cc->zone->compact_cached_free_pfn = cc->free_pfn;
+ dst_zone->compact_cached_free_pfn = cc->free_pfn;
}
- if (cc->migrate_pfn < start_pfn || cc->migrate_pfn >= end_pfn) {
+ if (cc->migrate_pfn < start_pfn || cc->migrate_pfn >= zone_end_pfn(cc->zone)) {
cc->migrate_pfn = start_pfn;
cc->zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
cc->zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
@@ -2522,8 +2546,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
* Only go back, not forward. The cached pfn might have been
* already reset to zone end in compact_finished()
*/
- if (free_pfn > cc->zone->compact_cached_free_pfn)
- cc->zone->compact_cached_free_pfn = free_pfn;
+ if (free_pfn > dst_zone->compact_cached_free_pfn)
+ dst_zone->compact_cached_free_pfn = free_pfn;
}

count_compact_events(COMPACTMIGRATE_SCANNED, cc->total_migrate_scanned);
@@ -2834,7 +2858,7 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
continue;

if (compaction_suitable(zone, pgdat->kcompactd_max_order, 0,
- highest_zoneidx) == COMPACT_CONTINUE)
+ highest_zoneidx, NULL) == COMPACT_CONTINUE)
return true;
}

@@ -2871,7 +2895,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
if (compaction_deferred(zone, cc.order))
continue;

- if (compaction_suitable(zone, cc.order, 0, zoneid) !=
+ if (compaction_suitable(zone, cc.order, 0, zoneid, NULL) !=
COMPACT_CONTINUE)
continue;

diff --git a/mm/internal.h b/mm/internal.h
index 68410c6d97ac..349223cc0359 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -465,6 +465,7 @@ struct compact_control {
unsigned long migrate_pfn;
unsigned long fast_start_pfn; /* a pfn to start linear scan from */
struct zone *zone;
+ struct zone *dst_zone; /* use another zone as the destination */
unsigned long total_migrate_scanned;
unsigned long total_free_scanned;
unsigned short fast_search_fail;/* failures to use free list searches */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5bf98d0a22c9..aa21da983804 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6383,7 +6383,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
if (!managed_zone(zone))
continue;

- switch (compaction_suitable(zone, sc->order, 0, sc->reclaim_idx)) {
+ switch (compaction_suitable(zone, sc->order, 0, sc->reclaim_idx, NULL)) {
case COMPACT_SUCCESS:
case COMPACT_CONTINUE:
return false;
@@ -6580,7 +6580,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
unsigned long watermark;
enum compact_result suitable;

- suitable = compaction_suitable(zone, sc->order, 0, sc->reclaim_idx);
+ suitable = compaction_suitable(zone, sc->order, 0, sc->reclaim_idx, NULL);
if (suitable == COMPACT_SUCCESS)
/* Allocation should succeed already. Don't reclaim. */
return true;
--
2.40.1

2024-03-20 02:43:49

by kaiyang2

[permalink] [raw]

Subject: [RFC PATCH 6/7] pass gfp mask of the allocation that waked kswapd to track number of pages scanned on behalf of each alloc type

From: Kaiyang Zhao <[email protected]>

In preparation for exporting the number of pages scanned for each alloc
type

Signed-off-by: Kaiyang Zhao <[email protected]>
---
include/linux/mmzone.h | 1 +
mm/vmscan.c | 13 +++++++++++--
2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a4889c9d4055..abc9f1623c82 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1288,6 +1288,7 @@ typedef struct pglist_data {
struct task_struct *kswapd; /* Protected by kswapd_lock */
int kswapd_order;
enum zone_type kswapd_highest_zoneidx;
+ gfp_t kswapd_gfp;

int kswapd_failures; /* Number of 'reclaimed == 0' runs */

diff --git a/mm/vmscan.c b/mm/vmscan.c
index aa21da983804..ed0f47e2e810 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7330,7 +7330,7 @@ clear_reclaim_active(pg_data_t *pgdat, int highest_zoneidx)
* or lower is eligible for reclaim until at least one usable zone is
* balanced.
*/
-static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
+static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx, gfp_t gfp_mask)
{
int i;
unsigned long nr_soft_reclaimed;
@@ -7345,6 +7345,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
.order = order,
.may_unmap = 1,
};
+ if (is_migrate_movable(gfp_migratetype(gfp_mask)))
+ sc.gfp_mask |= __GFP_MOVABLE;

set_task_reclaim_state(current, &sc.reclaim_state);
psi_memstall_enter(&pflags);
@@ -7659,6 +7661,7 @@ static int kswapd(void *p)
pg_data_t *pgdat = (pg_data_t *)p;
struct task_struct *tsk = current;
const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
+ gfp_t gfp_mask;

if (!cpumask_empty(cpumask))
set_cpus_allowed_ptr(tsk, cpumask);
@@ -7680,6 +7683,7 @@ static int kswapd(void *p)

WRITE_ONCE(pgdat->kswapd_order, 0);
WRITE_ONCE(pgdat->kswapd_highest_zoneidx, MAX_NR_ZONES);
+ WRITE_ONCE(pgdat->kswapd_gfp, 0);
atomic_set(&pgdat->nr_writeback_throttled, 0);
for ( ; ; ) {
bool ret;
@@ -7687,6 +7691,7 @@ static int kswapd(void *p)
alloc_order = reclaim_order = READ_ONCE(pgdat->kswapd_order);
highest_zoneidx = kswapd_highest_zoneidx(pgdat,
highest_zoneidx);
+ gfp_mask = READ_ONCE(pgdat->kswapd_gfp);

kswapd_try_sleep:
kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
@@ -7696,8 +7701,10 @@ static int kswapd(void *p)
alloc_order = READ_ONCE(pgdat->kswapd_order);
highest_zoneidx = kswapd_highest_zoneidx(pgdat,
highest_zoneidx);
+ gfp_mask = READ_ONCE(pgdat->kswapd_gfp);
WRITE_ONCE(pgdat->kswapd_order, 0);
WRITE_ONCE(pgdat->kswapd_highest_zoneidx, MAX_NR_ZONES);
+ WRITE_ONCE(pgdat->kswapd_gfp, 0);

ret = try_to_freeze();
if (kthread_should_stop())
@@ -7721,7 +7728,7 @@ static int kswapd(void *p)
trace_mm_vmscan_kswapd_wake(pgdat->node_id, highest_zoneidx,
alloc_order);
reclaim_order = balance_pgdat(pgdat, alloc_order,
- highest_zoneidx);
+ highest_zoneidx, gfp_mask);
if (reclaim_order < alloc_order)
goto kswapd_try_sleep;
}
@@ -7759,6 +7766,8 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
if (READ_ONCE(pgdat->kswapd_order) < order)
WRITE_ONCE(pgdat->kswapd_order, order);

+ WRITE_ONCE(pgdat->kswapd_gfp, gfp_flags);
+
if (!waitqueue_active(&pgdat->kswapd_wait))
return;

--
2.40.1

2024-03-20 02:53:20

by kaiyang2

[permalink] [raw]

Subject: [RFC PATCH 7/7] exports the number of pages scanned on behalf of movable/unmovable allocations

From: Kaiyang Zhao <[email protected]>

exports the number of pages scanned on behalf of movable/unmovable
allocations in vmstat

Signed-off-by: Kaiyang Zhao <[email protected]>
---
include/linux/vm_event_item.h | 2 ++
mm/vmscan.c | 11 +++++++++++
mm/vmstat.c | 2 ++
3 files changed, 15 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index c9183117c8f7..dcfff56c6d29 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -50,6 +50,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
PGSCAN_DIRECT_THROTTLE,
PGSCAN_ANON,
PGSCAN_FILE,
+ PGSCAN_MOVABLE, /* number of pages scanned on behalf of a movable allocation */
+ PGSCAN_UNMOVABLE,
PGSTEAL_ANON,
PGSTEAL_FILE,
#ifdef CONFIG_NUMA
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ed0f47e2e810..4eadf0254918 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -904,6 +904,12 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
cond_resched();
}

+ /* Arbitrarily consider 16 pages scanned */
+ if (is_migrate_movable(gfp_migratetype(shrinkctl->gfp_mask)))
+ count_vm_events(PGSCAN_MOVABLE, 16);
+ else
+ count_vm_events(PGSCAN_UNMOVABLE, 16);
+
/*
* The deferred work is increased by any new work (delta) that wasn't
* done, decreased by old deferred work that was done now.
@@ -2580,6 +2586,11 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
__count_vm_events(PGSCAN_ANON + file, nr_scanned);

+ if (is_migrate_movable(gfp_migratetype(sc->gfp_mask)))
+ __count_vm_events(PGSCAN_MOVABLE, nr_scanned);
+ else
+ __count_vm_events(PGSCAN_UNMOVABLE, nr_scanned);
+
spin_unlock_irq(&lruvec->lru_lock);

if (nr_taken == 0)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 444740605f2f..56062d53a36c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1281,6 +1281,8 @@ const char * const vmstat_text[] = {
"pgscan_direct_throttle",
"pgscan_anon",
"pgscan_file",
+ "pgscan_by_movable",
+ "pgscan_by_unmovable",
"pgsteal_anon",
"pgsteal_file",

--
2.40.1

2024-03-20 02:54:35

by Zi Yan

[permalink] [raw]

Subject: Re: [RFC PATCH 0/7] mm: providing ample physical memory contiguity by confining unmovable allocations

On 19 Mar 2024, at 22:42, [email protected] wrote:

> From: Kaiyang Zhao <[email protected]>
>
> Memory capacity has increased dramatically over the last decades.
> Meanwhile, TLB capacity has stagnated, causing a significant virtual
> address translation overhead. As a collaboration between Carnegie Mellon
> University and Meta, we investigated the issue at Meta’s datacenters and
> found that about 20% of CPU cycles are spent doing page walks [1], and
> similar results are also reported by Google [2].
>
> To tackle the overhead, we need widespread uses of huge pages. And huge
> pages, when they can actually be created, work wonders: they provide up
> to 18% higher performance for Meta’s production workloads in our
> experiments [1].
>
> However, we observed that huge pages through THP are unreliable because
> sufficient physical contiguity may not exist and compaction to recover
> from memory fragmentation frequently fails. To ensure workloads get a
> reasonable number of huge pages, Meta could not rely on THP and had to
> use reserved huge pages. Proposals to add 1GB THP support [5] are even
> more dependent on ample availability of physical contiguity.
>
> A major reason for the lack of physical contiguity is the mixing of
> unmovable and movable allocations, causing compaction to fail. Quoting
> from [3], “in a broad sample of Meta servers, we find that unmovable
> allocations make up less than 7% of total memory on average, yet occupy
> 34% of the 2M blocks in the system. We also found that this effect isn't
> correlated with high uptimes, and that servers can get heavily
> fragmented within the first hour of running a workload.”
>
> Our proposed solution is to confine the unmovable allocations to a
> separate region in physical memory. We experimented with using a CMA
> region for the movable allocations, but in this version we use
> ZONE_MOVABLE for movable and all other zones for unmovable allocations.
> Movable allocations can temporarily reside in the unmovable zones, but
> will be proactively moved out by compaction.
>
> To resize ZONE_MOVABLE, we still rely on memory hotplug interfaces. We
> export the number of pages scanned on behalf of movable or unmovable
> allocations during reclaim to approximate the memory pressure in two
> parts of physical memory, and a userspace tool can monitor the metrics
> and make resizing decisions. Previously we augmented the PSI interface
> to break down memory pressure into movable and unmovable allocation
> types, but that approach enlarges the scheduler cacheline footprint.
> From our preliminary observations, just looking at the per-allocation
> type scanned counters and with a little tuning, it is sufficient to tell
> if there is not enough memory for unmovable allocations and make
> resizing decisions.
>
> This patch extends the idea of migratetype isolation at pageblock
> granularity posted earlier [3] by Johannes Weiner to an
> as-large-as-needed region to better support huge pages of bigger sizes
> and hardware TLB coalescing. We’re looking for feedback on the overall
> direction, particularly in relation to the recent THP allocator
> optimization proposal [4].
>
> The patches are based on 6.4 and are also available on github at
> https://github.com/magickaiyang/kernel-contiguous/tree/per_alloc_type_reclaim_counters_oct052023

Your reference links (1 to 4) are missing.

--
Best Regards,
Yan, Zi

Attachments:

signature.asc (871.00 B)
OpenPGP digital signature

2024-03-20 02:57:55

by kaiyang2

[permalink] [raw]

Subject: Re: [RFC PATCH 0/7] mm: providing ample physical memory contiguity by confining unmovable allocations

From: Kaiyang Zhao <[email protected]>

Adding the missing citations.

[1]: https://dl.acm.org/doi/pdf/10.1145/3579371.3589079
[2]: https://www.usenix.org/conference/osdi21/presentation/hunter
[3]: https://lore.kernel.org/lkml/[email protected]/
[4]: https://lore.kernel.org/linux-mm/[email protected]/
[5]: https://lore.kernel.org/linux-mm/[email protected]/

Best,
Kaiyang Zhao

2024-03-20 03:02:00

by kaiyang2

[permalink] [raw]

Subject: [RFC PATCH 5/7] proactively move pages out of unmovable zones in kcompactd

From: Kaiyang Zhao <[email protected]>

Proactively move pages out of unmovable zones in kcompactd
Debug only: zone start and end pfn printed in vmstat
Added counters for cross zone compaction start and scan

Signed-off-by: Kaiyang Zhao <[email protected]>
---
include/linux/vm_event_item.h | 3 +
mm/compaction.c | 101 +++++++++++++++++++++++++++++++---
mm/vmstat.c | 11 +++-
3 files changed, 104 insertions(+), 11 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index be88819085b6..c9183117c8f7 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -80,6 +80,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
KCOMPACTD_WAKE,
KCOMPACTD_MIGRATE_SCANNED, KCOMPACTD_FREE_SCANNED,
COMPACT_CROSS_ZONE_MIGRATED,
+ KCOMPACTD_CROSS_ZONE_START,
+ COMPACT_CROSS_ZONE_MIGRATE_SCANNED,
+ COMPACT_CROSS_ZONE_FREE_SCANNED,
#endif
#ifdef CONFIG_HUGETLB_PAGE
HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
diff --git a/mm/compaction.c b/mm/compaction.c
index dea10ad8ec64..94ce1282f17b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1436,7 +1436,10 @@ fast_isolate_freepages(struct compact_control *cc)
* Preferred point is in the top quarter of the scan space but take
* a pfn from the top half if the search is problematic.
*/
- distance = (cc->free_pfn - cc->migrate_pfn);
+ if (cc->zone != dst_zone)
+ distance = (cc->free_pfn - dst_zone->zone_start_pfn) >> 1;
+ else
+ distance = (cc->free_pfn - cc->migrate_pfn);
low_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 2));
min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1));

@@ -1602,7 +1605,10 @@ static void isolate_freepages(struct compact_control *cc)
block_start_pfn = pageblock_start_pfn(isolate_start_pfn);
block_end_pfn = min(block_start_pfn + pageblock_nr_pages,
zone_end_pfn(zone));
- low_pfn = pageblock_end_pfn(cc->migrate_pfn);
+ if (cc->dst_zone && cc->zone != cc->dst_zone)
+ low_pfn = pageblock_end_pfn(cc->dst_zone->zone_start_pfn);
+ else
+ low_pfn = pageblock_end_pfn(cc->migrate_pfn);
stride = cc->mode == MIGRATE_ASYNC ? COMPACT_CLUSTER_MAX : 1;

/*
@@ -1822,7 +1828,11 @@ static unsigned long fast_find_migrateblock(struct compact_control *cc)
* within the first eighth to reduce the chances that a migration
* target later becomes a source.
*/
- distance = (cc->free_pfn - cc->migrate_pfn) >> 1;
+ if (cc->dst_zone && cc->zone != cc->dst_zone)
+ distance = (zone_end_pfn(cc->zone) - cc->migrate_pfn) >> 1;
+ else
+ distance = (cc->free_pfn - cc->migrate_pfn) >> 1;
+
if (cc->migrate_pfn != cc->zone->zone_start_pfn)
distance >>= 2;
high_pfn = pageblock_start_pfn(cc->migrate_pfn + distance);
@@ -1897,7 +1907,7 @@ static isolate_migrate_t isolate_migratepages(struct compact_control *cc)
{
unsigned long block_start_pfn;
unsigned long block_end_pfn;
- unsigned long low_pfn;
+ unsigned long low_pfn, high_pfn;
struct page *page;
const isolate_mode_t isolate_mode =
(sysctl_compact_unevictable_allowed ? ISOLATE_UNEVICTABLE : 0) |
@@ -1924,11 +1934,16 @@ static isolate_migrate_t isolate_migratepages(struct compact_control *cc)
/* Only scan within a pageblock boundary */
block_end_pfn = pageblock_end_pfn(low_pfn);

+ if (cc->dst_zone && cc->zone != cc->dst_zone)
+ high_pfn = zone_end_pfn(cc->zone);
+ else
+ high_pfn = cc->free_pfn;
+
/*
* Iterate over whole pageblocks until we find the first suitable.
* Do not cross the free scanner.
*/
- for (; block_end_pfn <= cc->free_pfn;
+ for (; block_end_pfn <= high_pfn;
fast_find_block = false,
cc->migrate_pfn = low_pfn = block_end_pfn,
block_start_pfn = block_end_pfn,
@@ -1954,6 +1969,7 @@ static isolate_migrate_t isolate_migratepages(struct compact_control *cc)
* before making it "skip" so other compaction instances do
* not scan the same block.
*/
+
if (pageblock_aligned(low_pfn) &&
!fast_find_block && !isolation_suitable(cc, page))
continue;
@@ -1976,6 +1992,10 @@ static isolate_migrate_t isolate_migratepages(struct compact_control *cc)
isolate_mode))
return ISOLATE_ABORT;

+ /* free_pfn may have changed. update high_pfn. */
+ if (!cc->dst_zone || cc->zone == cc->dst_zone)
+ high_pfn = cc->free_pfn;
+
/*
* Either we isolated something and proceed with migration. Or
* we failed and compact_zone should decide if we should
@@ -2141,7 +2161,9 @@ static enum compact_result __compact_finished(struct compact_control *cc)
goto out;
}

- if (is_via_compact_memory(cc->order))
+ /* Don't check if a suitable page is free if doing cross zone compaction. */
+ if (is_via_compact_memory(cc->order) ||
+ (cc->dst_zone && cc->dst_zone != cc->zone))
return COMPACT_CONTINUE;

/*
@@ -2224,7 +2246,8 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
* should be no need for compaction at all.
*/
if (zone_watermark_ok(zone, order, watermark, highest_zoneidx,
- alloc_flags))
+ alloc_flags) &&
+ dst_zone == zone)
return COMPACT_SUCCESS;

/*
@@ -2270,6 +2293,11 @@ enum compact_result compaction_suitable(struct zone *zone, int order,

ret = __compaction_suitable(zone, order, alloc_flags, highest_zoneidx,
zone_page_state(dst_zone, NR_FREE_PAGES), dst_zone);
+
+ /* Allow migrating movable pages to ZONE_MOVABLE regardless of frag index */
+ if (ret == COMPACT_CONTINUE && dst_zone != zone)
+ return ret;
+
/*
* fragmentation index determines if allocation failures are due to
* low memory or external fragmentation
@@ -2841,6 +2869,14 @@ void compaction_unregister_node(struct node *node)
}
#endif /* CONFIG_SYSFS && CONFIG_NUMA */

+static inline bool should_compact_unmovable_zones(pg_data_t *pgdat)
+{
+ if (populated_zone(&pgdat->node_zones[ZONE_MOVABLE]))
+ return true;
+ else
+ return false;
+}
+
static inline bool kcompactd_work_requested(pg_data_t *pgdat)
{
return pgdat->kcompactd_max_order > 0 || kthread_should_stop() ||
@@ -2942,6 +2978,48 @@ static void kcompactd_do_work(pg_data_t *pgdat)
pgdat->kcompactd_highest_zoneidx = pgdat->nr_zones - 1;
}

+static void kcompactd_clean_unmovable_zones(pg_data_t *pgdat)
+{
+ int zoneid;
+ struct zone *zone;
+ struct compact_control cc = {
+ .order = 0,
+ .search_order = 0,
+ .highest_zoneidx = ZONE_MOVABLE,
+ .mode = MIGRATE_SYNC,
+ .ignore_skip_hint = true,
+ .gfp_mask = GFP_KERNEL,
+ .dst_zone = &pgdat->node_zones[ZONE_MOVABLE],
+ .whole_zone = true
+ };
+ count_compact_event(KCOMPACTD_CROSS_ZONE_START);
+
+ for (zoneid = 0; zoneid < ZONE_MOVABLE; zoneid++) {
+ int status;
+
+ zone = &pgdat->node_zones[zoneid];
+ if (!populated_zone(zone))
+ continue;
+
+ if (compaction_suitable(zone, cc.order, 0, zoneid, cc.dst_zone) !=
+ COMPACT_CONTINUE)
+ continue;
+
+ if (kthread_should_stop())
+ return;
+
+ /* Not participating in compaction defer. */
+
+ cc.zone = zone;
+ status = compact_zone(&cc, NULL);
+
+ count_compact_events(COMPACT_CROSS_ZONE_MIGRATE_SCANNED,
+ cc.total_migrate_scanned);
+ count_compact_events(COMPACT_CROSS_ZONE_FREE_SCANNED,
+ cc.total_free_scanned);
+ }
+}
+
void wakeup_kcompactd(pg_data_t *pgdat, int order, int highest_zoneidx)
{
if (!order)
@@ -2994,9 +3072,10 @@ static int kcompactd(void *p)

/*
* Avoid the unnecessary wakeup for proactive compaction
- * when it is disabled.
+ * and cleanup of unmovable zones
+ * when they are disabled.
*/
- if (!sysctl_compaction_proactiveness)
+ if (!sysctl_compaction_proactiveness && !should_compact_unmovable_zones(pgdat))
timeout = MAX_SCHEDULE_TIMEOUT;
trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
if (wait_event_freezable_timeout(pgdat->kcompactd_wait,
@@ -3017,6 +3096,10 @@ static int kcompactd(void *p)
continue;
}

+ /* Migrates movable pages out of unmovable zones if ZONE_MOVABLE exists */
+ if (should_compact_unmovable_zones(pgdat))
+ kcompactd_clean_unmovable_zones(pgdat);
+
/*
* Start the proactive work with default timeout. Based
* on the fragmentation score, this timeout is updated.
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 98af82e65ad9..444740605f2f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1325,6 +1325,9 @@ const char * const vmstat_text[] = {
"compact_daemon_migrate_scanned",
"compact_daemon_free_scanned",
"compact_cross_zone_migrated",
+ "compact_cross_zone_start",
+ "compact_cross_zone_migrate_scanned",
+ "compact_cross_zone_free_scanned",
#endif

#ifdef CONFIG_HUGETLB_PAGE
@@ -1692,7 +1695,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
"\n spanned %lu"
"\n present %lu"
"\n managed %lu"
- "\n cma %lu",
+ "\n cma %lu"
+ "\n start %lu"
+ "\n end %lu",
zone_page_state(zone, NR_FREE_PAGES),
zone->watermark_boost,
min_wmark_pages(zone),
@@ -1701,7 +1706,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
zone->spanned_pages,
zone->present_pages,
zone_managed_pages(zone),
- zone_cma_pages(zone));
+ zone_cma_pages(zone),
+ zone->zone_start_pfn,
+ zone_end_pfn(zone));

seq_printf(m,
"\n protection: (%ld",
--
2.40.1