Recently, I got a report that android get slow due to order-2 page
allocation. With some investigation, I found that compaction usually
fails and many pages are reclaimed to make order-2 freepage. I can't
analyze detailed reason that causes compaction fail because I don't
have reproducible environment and compaction code is changed so much
from that version, v3.10. But, I was inspired by this report and started
to think limitation of current compaction algorithm.
Limitation of current compaction algorithm:
1) Migrate scanner can't scan behind of free scanner, because
each scanner starts at both side of zone and go toward each other. If
they meet at some point, compaction is stopped and scanners' position
is reset to both side of zone again. From my experience, migrate scanner
usually doesn't scan beyond of half of the zone range.
2) Compaction capability is highly depends on amount of free memory.
If there is 50 MB free memory on 4 GB system, migrate scanner can
migrate 50 MB used pages at maximum and then will meet free scanner.
If compaction can't make enough high order freepages during this
amount of work, compaction would fail. There is no way to escape this
failure situation in current algorithm and it will scan same region and
fail again and again. And then, it goes into compaction deferring logic
and will be deferred for some times.
3) Compaction capability is highly depends on migratetype of memory,
because freepage scanner doesn't scan unmovable pageblock.
To investigate compaction limitations, I made some compaction benchmarks.
Base environment of this benchmark is fragmented memory. Before testing,
25% of total size of memory is allocated. With some tricks, these
allocations are evenly distributed to whole memory range. So, after
allocation is finished, memory is highly fragmented and possibility of
successful order-3 allocation is very low. Roughly 1500 order-3 allocation
can be successful. Tests attempt excessive amount of allocation request,
that is, 3000, to find out algorithm limitation.
There are two variations.
pageblock type (unmovable / movable):
One is that most pageblocks are unmovable migratetype and the other is
that most pageblocks are movable migratetype.
memory usage (memory hogger 200 MB / kernel build with -j8):
Memory hogger means that 200 MB free memory is occupied by hogger.
Kernel build means that kernel build is running on background and it
will consume free memory, but, amount of consumption will be very
fluctuated.
With these variations, I made 4 test cases by mixing them.
hogger-frag-unmovable
hogger-frag-movable
build-frag-unmovable
build-frag-movable
All tests are conducted on 512 MB QEMU virtual machine with 8 CPUs.
I can easily check weakness of compaction algorithm by following test.
To check 1), hogger-frag-movable benchmark is used. Result is as
following.
bzImage-improve-base
compact_free_scanned 5240676
compact_isolated 75048
compact_migrate_scanned 2468387
compact_stall 710
compact_success 98
pgmigrate_success 34869
Success: 25
Success(N): 53
Column 'Success' and 'Success(N) are calculated by following equations.
Success = successful allocation * 100 / attempts
Success(N) = successful allocation * 100 /
number of successful order-3 allocation
As mentioned above, there are roughly 1500 high order page candidates,
but, compaction just returns 53% of them. With new compaction approach,
it can be increased to 94%. See result at the end of this cover-letter.
To check 2), hogger-frag-movable benchmark is used again, but, with some
tweaks. Amount of allocated memory by memory hogger varys.
bzImage-improve-base
Hogger: 150MB 200MB 250MB 300MB
Success: 41 25 17 9
Success(N): 87 53 37 22
As background knowledge, up to 250MB, there is enough
memory to succeed all order-3 allocation attempts. In 300MB case,
available memory before starting allocation attempt is just 57MB,
so all of attempts cannot succeed.
Anyway, as free memory decreases, compaction success rate also decreases.
It is better to remove this dependency to get stable compaction result
in any case.
To check 3), build-frag-unmovable/movable benchmarks are used.
All factors are same except pageblock migratetypes.
Test: build-frag-unmovable
bzImage-improve-base
compact_free_scanned 5032378
compact_isolated 53368
compact_migrate_scanned 1456516
compact_stall 538
compact_success 93
pgmigrate_success 19926
Success: 15
Success(N): 33
Test: build-frag-movable
bzImage-improve-base
compact_free_scanned 3059086
compact_isolated 129085
compact_migrate_scanned 5029856
compact_stall 388
compact_success 99
pgmigrate_success 52898
Success: 38
Success(N): 82
Pageblock migratetype makes big difference on success rate. 3) would be
one of reason related to this result. Because freepage scanner doesn't
scan non-movable pageblock, compaction can't get enough freepage for
migration and compaction easily fails. This patchset try to solve it
by allowing freepage scanner to scan on non-movable pageblock.
Result show that we cannot get all possible high order page through
current compaction algorithm. And, in case that migratetype of
pageblock is unmovable, success rate get worse. Although we can solve
problem 3) in current algorithm, there is unsolvable limitations, 1), 2),
so I'd like to change compaction algorithm.
This patchset try to solve these limitations by introducing new compaction
approach. Main changes of this patchset are as following:
1) Make freepage scanner scans non-movable pageblock
Watermark check doesn't consider how many pages in non-movable pageblock.
To fully utilize existing freepage, freepage scanner should scan
non-movable pageblock.
2) Introduce compaction depletion state
Compaction algorithm will be changed to scan whole zone range. In this
approach, compaction inevitably do back and forth migration between
different iterations. If back and forth migration can make highorder
freepage, it can be justified. But, in case of depletion of compaction
possiblity, this back and forth migration causes unnecessary overhead.
Compaction depleteion state is introduced to avoid this useless
back and forth migration by detecting depletion of compaction possibility.
3) Change scanner's behaviour
Migration scanner is changed to scan whole zone range regardless freepage
scanner position. Freepage scanner also scans whole zone from
zone_start_pfn to zone_end_pfn. To prevent back and forth migration
within one compaction iteration, freepage scanner marks skip-bit when
scanning pageblock. Migration scanner will skip this marked pageblock.
Finish condition is very simple. If migration scanner reaches end of
the zone, compaction will be finished. If freepage scanner reaches end of
the zone first, it restart at zone_start_pfn. This helps us to overcome
dependency on amount of free memory.
Following is all test results of this patchset.
Test: hogger-frag-unmovable
base nonmovable redesign threshold
compact_free_scanned 2800710 5615427 6441095 2235764
compact_isolated 58323 114183 2711081 647701
compact_migrate_scanned 1078970 2437597 4175464 1697292
compact_stall 341 1066 2059 2092
compact_success 80 123 207 210
pgmigrate_success 27034 53832 1348113 318395
Success: 22 29 44 40
Success(N): 46 61 90 83
Test: hogger-frag-movable
base nonmovable redesign threshold
compact_free_scanned 5240676 5883401 8103231 1860428
compact_isolated 75048 83201 3108978 427602
compact_migrate_scanned 2468387 2755690 4316163 1474287
compact_stall 710 664 2117 1964
compact_success 98 102 234 183
pgmigrate_success 34869 38663 1547318 208629
Success: 25 26 45 44
Success(N): 53 56 94 92
Test: build-frag-unmovable
base nonmovable redesign threshold
compact_free_scanned 5032378 4110920 2538420 1891170
compact_isolated 53368 330762 1020908 534680
compact_migrate_scanned 1456516 6164677 4809150 2667823
compact_stall 538 746 2609 2500
compact_success 93 350 438 403
pgmigrate_success 19926 152754 491609 251977
Success: 15 31 39 40
Success(N): 33 65 80 81
Test: build-frag-movable
base nonmovable redesign threshold
compact_free_scanned 3059086 3852269 2359553 1461131
compact_isolated 129085 238856 907515 387373
compact_migrate_scanned 5029856 5051868 3785605 2177090
compact_stall 388 540 2195 2157
compact_success 99 218 247 225
pgmigrate_success 52898 110021 439739 182366
Success: 38 37 43 43
Success(N): 82 77 89 90
Test: hogger-frag-movable with free memory variation
Hogger: 150MB 200MB 250MB 300MB
bzImage-improve-base
Success: 41 25 17 9
Success(N): 87 53 37 22
bzImage-improve-threshold
Success: 44 44 42 37
Success(N): 94 92 91 80
Test: stress-highalloc in mmtests
(tweaks to request order-7 unmovable allocation)
Ops 1 30.00 8.33 84.67 78.00
Ops 2 32.33 26.67 84.33 79.00
Ops 3 91.67 92.00 95.00 94.00
Compaction stalls 5110 5581 10296 10475
Compaction success 1787 1807 5173 4744
Compaction failures 3323 3774 5123 5731
Compaction pages isolated 6370911 15421622 30534650 11825921
Compaction migrate scanned 52681405 83721428 150444732 53517273
Compaction free scanned 418049611 579768237 310629538 139433577
Compaction cost 3745 8822 17324 6628
Result shows that much improvement comes from redesign algorithm but it
causes too much overhead. However, further optimization reduces this
overhead greatly with a little success rate degradation.
We can observe regression from a patch that allows scanning on
non-movable pageblock in some cases. Although this regression is bad,
there are also much improvement in other cases when most of pageblocks
are non-movable migratetype. IMHO, that patch can be justified by
improvement case. Moreover, this regression disappears after applying
following patches so we don't need to worry.
Please see result of "hogger-frag-movable with free memory variation".
It shows that patched version solves limitations of current compaction
algorithm and almost possible order-3 candidates can be allocated
regardless of amount of free memory.
This patchset is based on next-20150515.
Feel free to comment. :)
Thanks.
Joonsoo Kim (10):
mm/compaction: update skip-bit if whole pageblock is really scanned
mm/compaction: skip useless pfn for scanner's cached pfn
mm/compaction: always update cached pfn
mm/compaction: clean-up restarting condition check
mm/compaction: make freepage scanner scans non-movable pageblock
mm/compaction: introduce compaction depleted state on zone
mm/compaction: limit compaction activity in compaction depleted state
mm/compaction: remove compaction deferring
mm/compaction: redesign compaction
mm/compaction: new threshold for compaction depleted zone
include/linux/compaction.h | 14 +-
include/linux/mmzone.h | 6 +-
include/trace/events/compaction.h | 30 ++--
mm/compaction.c | 353 ++++++++++++++++++++++----------------
mm/internal.h | 1 +
mm/page_alloc.c | 2 +-
mm/vmscan.c | 4 +-
7 files changed, 229 insertions(+), 181 deletions(-)
--
1.9.1
Scanning pageblock is stopped at the middle of pageblock if enough
pages are isolated. In the next run, it begins again at this position
and if it find that there is no isolation candidate from the middle of
pageblock to end of pageblock, it updates skip-bit. In this case,
scanner doesn't start at begin of pageblock so it is not appropriate
to set skipbit. This patch fixes this situation that updating skip-bit
only happens when whole pageblock is really scanned.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/compaction.c | 32 ++++++++++++++++++--------------
1 file changed, 18 insertions(+), 14 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 6ef2fdf..4397bf7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -261,7 +261,8 @@ void reset_isolation_suitable(pg_data_t *pgdat)
*/
static void update_pageblock_skip(struct compact_control *cc,
struct page *page, unsigned long nr_isolated,
- bool migrate_scanner)
+ unsigned long start_pfn, unsigned long end_pfn,
+ unsigned long curr_pfn, bool migrate_scanner)
{
struct zone *zone = cc->zone;
unsigned long pfn;
@@ -275,6 +276,13 @@ static void update_pageblock_skip(struct compact_control *cc,
if (nr_isolated)
return;
+ /* Update the pageblock-skip if the whole pageblock was scanned */
+ if (curr_pfn != end_pfn)
+ return;
+
+ if (start_pfn != round_down(end_pfn - 1, pageblock_nr_pages))
+ return;
+
set_pageblock_skip(page);
pfn = page_to_pfn(page);
@@ -300,7 +308,8 @@ static inline bool isolation_suitable(struct compact_control *cc,
static void update_pageblock_skip(struct compact_control *cc,
struct page *page, unsigned long nr_isolated,
- bool migrate_scanner)
+ unsigned long start_pfn, unsigned long end_pfn,
+ unsigned long curr_pfn, bool migrate_scanner)
{
}
#endif /* CONFIG_COMPACTION */
@@ -493,9 +502,6 @@ isolate_fail:
trace_mm_compaction_isolate_freepages(*start_pfn, blockpfn,
nr_scanned, total_isolated);
- /* Record how far we have got within the block */
- *start_pfn = blockpfn;
-
/*
* If strict isolation is requested by CMA then check that all the
* pages requested were isolated. If there were any failures, 0 is
@@ -507,9 +513,11 @@ isolate_fail:
if (locked)
spin_unlock_irqrestore(&cc->zone->lock, flags);
- /* Update the pageblock-skip if the whole pageblock was scanned */
- if (blockpfn == end_pfn)
- update_pageblock_skip(cc, valid_page, total_isolated, false);
+ update_pageblock_skip(cc, valid_page, total_isolated,
+ *start_pfn, end_pfn, blockpfn, false);
+
+ /* Record how far we have got within the block */
+ *start_pfn = blockpfn;
count_compact_events(COMPACTFREE_SCANNED, nr_scanned);
if (total_isolated)
@@ -806,12 +814,8 @@ isolate_success:
if (locked)
spin_unlock_irqrestore(&zone->lru_lock, flags);
- /*
- * Update the pageblock-skip information and cached scanner pfn,
- * if the whole pageblock was scanned without isolating any page.
- */
- if (low_pfn == end_pfn)
- update_pageblock_skip(cc, valid_page, nr_isolated, true);
+ update_pageblock_skip(cc, valid_page, nr_isolated,
+ start_pfn, end_pfn, low_pfn, true);
trace_mm_compaction_isolate_migratepages(start_pfn, low_pfn,
nr_scanned, nr_isolated);
--
1.9.1
Scanner's cached pfn is used to determine the start position of scanner
at next compaction run. Current cached pfn points the skipped pageblock
so we uselessly checks whether pageblock is valid for compaction and
skip-bit is set or not. If we set scanner's cached pfn to next pfn of
skipped pageblock, we don't need to do this check.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/compaction.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 4397bf7..9c5d43c 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -265,7 +265,6 @@ static void update_pageblock_skip(struct compact_control *cc,
unsigned long curr_pfn, bool migrate_scanner)
{
struct zone *zone = cc->zone;
- unsigned long pfn;
if (cc->ignore_skip_hint)
return;
@@ -285,18 +284,16 @@ static void update_pageblock_skip(struct compact_control *cc,
set_pageblock_skip(page);
- pfn = page_to_pfn(page);
-
/* Update where async and sync compaction should restart */
if (migrate_scanner) {
- if (pfn > zone->compact_cached_migrate_pfn[0])
- zone->compact_cached_migrate_pfn[0] = pfn;
+ if (end_pfn > zone->compact_cached_migrate_pfn[0])
+ zone->compact_cached_migrate_pfn[0] = end_pfn;
if (cc->mode != MIGRATE_ASYNC &&
- pfn > zone->compact_cached_migrate_pfn[1])
- zone->compact_cached_migrate_pfn[1] = pfn;
+ end_pfn > zone->compact_cached_migrate_pfn[1])
+ zone->compact_cached_migrate_pfn[1] = end_pfn;
} else {
- if (pfn < zone->compact_cached_free_pfn)
- zone->compact_cached_free_pfn = pfn;
+ if (start_pfn < zone->compact_cached_free_pfn)
+ zone->compact_cached_free_pfn = start_pfn;
}
}
#else
--
1.9.1
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/compaction.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/mm/compaction.c b/mm/compaction.c
index 9c5d43c..2d8e211 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -510,6 +510,10 @@ isolate_fail:
if (locked)
spin_unlock_irqrestore(&cc->zone->lock, flags);
+ if (blockpfn == end_pfn &&
+ blockpfn > cc->zone->compact_cached_free_pfn)
+ cc->zone->compact_cached_free_pfn = blockpfn;
+
update_pageblock_skip(cc, valid_page, total_isolated,
*start_pfn, end_pfn, blockpfn, false);
@@ -811,6 +815,13 @@ isolate_success:
if (locked)
spin_unlock_irqrestore(&zone->lru_lock, flags);
+ if (low_pfn == end_pfn && cc->mode != MIGRATE_ASYNC) {
+ int sync = cc->mode != MIGRATE_ASYNC;
+
+ if (low_pfn > zone->compact_cached_migrate_pfn[sync])
+ zone->compact_cached_migrate_pfn[sync] = low_pfn;
+ }
+
update_pageblock_skip(cc, valid_page, nr_isolated,
start_pfn, end_pfn, low_pfn, true);
--
1.9.1
Rename check function and move one outer condition check to this function.
There is no functional change.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/compaction.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 2d8e211..dd2063b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -188,8 +188,11 @@ void compaction_defer_reset(struct zone *zone, int order,
}
/* Returns true if restarting compaction after many failures */
-bool compaction_restarting(struct zone *zone, int order)
+static bool compaction_direct_restarting(struct zone *zone, int order)
{
+ if (current_is_kswapd())
+ return false;
+
if (order < zone->compact_order_failed)
return false;
@@ -1327,7 +1330,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
* is about to be retried after being deferred. kswapd does not do
* this reset as it'll reset the cached information when going to sleep.
*/
- if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
+ if (compaction_direct_restarting(zone, cc->order))
__reset_isolation_suitable(zone);
/*
--
1.9.1
Currently, freescanner doesn't scan non-movable pageblock, because if
freepages in non-movable pageblock are exhausted, another movable
pageblock would be used for non-movable allocation and it could cause
fragmentation.
But, we should know that watermark check for compaction doesn't consider
this reality. So, if all freepages are in non-movable pageblock, although,
system has enough freepages and watermark check is passed, freepage
scanner can't get any freepage and compaction will be failed. There is
no way to get precise number of freepage on movable pageblock and no way
to reclaim only used pages in movable pageblock. Therefore, I think
that best way to overcome this situation is to use freepage in non-movable
pageblock in compaction.
My test setup for this situation is:
Memory is artificially fragmented to make order 3 allocation hard. And,
most of pageblocks are changed to unmovable migratetype.
System: 512 MB with 32 MB Zram
Memory: 25% memory is allocated to make fragmentation and kernel build
is running on background.
Fragmentation: Successful order 3 allocation candidates may be around
1500 roughly.
Allocation attempts: Roughly 3000 order 3 allocation attempts
with GFP_NORETRY. This value is determined to saturate allocation
success.
Below is the result of this test.
Test: build-frag-unmovable
base nonmovable
compact_free_scanned 5032378 4110920
compact_isolated 53368 330762
compact_migrate_scanned 1456516 6164677
compact_stall 538 746
compact_success 93 350
pgmigrate_success 19926 152754
Success: 15 31
Success(N): 33 65
Column 'Success' and 'Success(N) are calculated by following equations.
Success = successful allocation * 100 / attempts
Success(N) = successful allocation * 100 / order 3 candidate
Result shows that success rate is doubled in this case
because we can search more area.
But, we can observe regression in other case.
Test: stress-highalloc in mmtests
(tweaks to request order-7 unmovable allocation)
Ops 1 30.00 8.33
Ops 2 32.33 26.67
Ops 3 91.67 92.00
Compaction stalls 5110 5581
Compaction success 1787 1807
Compaction failures 3323 3774
Compaction pages isolated 6370911 15421622
Compaction migrate scanned 52681405 83721428
Compaction free scanned 418049611 579768237
Compaction cost 3745 8822
Although this regression is bad, there are also much improvement
in other cases that most of pageblocks are non-movable migratetype.
IMHO, this patch can be justified by this improvement. Moreover,
this regression disappears after applying following patches, so
we don't need to worry about regression much.
Migration scanner already scans non-movable pageblock and make some
freepage in that pageblock through migration. So, even if freepage
scanner scans non-movable pageblock and uses freepage in that pageblock,
number of freepages on non-movable pageblock wouldn't diminish much and
wouldn't cause much fragmentation.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/compaction.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index dd2063b..8d1b3b5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -905,12 +905,8 @@ static bool suitable_migration_target(struct page *page)
return false;
}
- /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
- if (migrate_async_suitable(get_pageblock_migratetype(page)))
- return true;
-
- /* Otherwise skip the block */
- return false;
+ /* Otherwise scan the block */
+ return true;
}
/*
--
1.9.1
Further compaction attempt is deferred when some of compaction attempts
already fails. But, after some number of trial are skipped, compaction
restarts work to check whether compaction is now possible or not. It
scans whole range of zone to determine this possibility and if compaction
possibility doesn't recover, this whole range scan is quite big overhead.
As a first step to reduce this overhead, this patch implement compaction
depleted state on zone.
The way to determine depletion of compaction possility is checking number
of success on previous compaction attempt. If number of successful
compaction is below than specified threshold, we guess that compaction
will not successful next time so mark the zone as compaction depleted.
In this patch, threshold is choosed by 1 to imitate current compaction
deferring algorithm. In the following patch, compaction algorithm will be
changed and this threshold is also adjusted to that change.
In this patch, only state definition is implemented. There is no action
for this new state so no functional change. But, following patch will
add some handling for this new state.
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/mmzone.h | 2 ++
mm/compaction.c | 38 +++++++++++++++++++++++++++++++++++---
2 files changed, 37 insertions(+), 3 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 754c259..bd9f1a5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -517,6 +517,7 @@ struct zone {
unsigned int compact_considered;
unsigned int compact_defer_shift;
int compact_order_failed;
+ unsigned long compact_success;
#endif
#if defined CONFIG_COMPACTION || defined CONFIG_CMA
@@ -543,6 +544,7 @@ enum zone_flags {
* many pages under writeback
*/
ZONE_FAIR_DEPLETED, /* fair zone policy batch depleted */
+ ZONE_COMPACTION_DEPLETED, /* compaction possiblity depleted */
};
static inline unsigned long zone_end_pfn(const struct zone *zone)
diff --git a/mm/compaction.c b/mm/compaction.c
index 8d1b3b5..9f259b9 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -129,6 +129,23 @@ static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
/* Do not skip compaction more than 64 times */
#define COMPACT_MAX_DEFER_SHIFT 6
+#define COMPACT_MIN_DEPLETE_THRESHOLD 1UL
+
+static bool compaction_depleted(struct zone *zone)
+{
+ unsigned long threshold;
+ unsigned long success = zone->compact_success;
+
+ /*
+ * Now, to imitate current compaction deferring approach,
+ * choose threshold to 1. It will be changed in the future.
+ */
+ threshold = COMPACT_MIN_DEPLETE_THRESHOLD;
+ if (success >= threshold)
+ return false;
+
+ return true;
+}
/*
* Compaction is deferred when compaction fails to result in a page
@@ -226,6 +243,10 @@ static void __reset_isolation_suitable(struct zone *zone)
zone->compact_cached_free_pfn = end_pfn;
zone->compact_blockskip_flush = false;
+ if (compaction_depleted(zone))
+ set_bit(ZONE_COMPACTION_DEPLETED, &zone->flags);
+ zone->compact_success = 0;
+
/* Walk the zone and mark every pageblock as suitable for isolation */
for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
struct page *page;
@@ -1197,22 +1218,28 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc,
bool can_steal;
/* Job done if page is free of the right migratetype */
- if (!list_empty(&area->free_list[migratetype]))
+ if (!list_empty(&area->free_list[migratetype])) {
+ zone->compact_success++;
return COMPACT_PARTIAL;
+ }
#ifdef CONFIG_CMA
/* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */
if (migratetype == MIGRATE_MOVABLE &&
- !list_empty(&area->free_list[MIGRATE_CMA]))
+ !list_empty(&area->free_list[MIGRATE_CMA])) {
+ zone->compact_success++;
return COMPACT_PARTIAL;
+ }
#endif
/*
* Job done if allocation would steal freepages from
* other migratetype buddy lists.
*/
if (find_suitable_fallback(area, order, migratetype,
- true, &can_steal) != -1)
+ true, &can_steal) != -1) {
+ zone->compact_success++;
return COMPACT_PARTIAL;
+ }
}
return COMPACT_NO_SUITABLE_PAGE;
@@ -1452,6 +1479,11 @@ out:
trace_mm_compaction_end(start_pfn, cc->migrate_pfn,
cc->free_pfn, end_pfn, sync, ret);
+ if (test_bit(ZONE_COMPACTION_DEPLETED, &zone->flags)) {
+ if (!compaction_depleted(zone))
+ clear_bit(ZONE_COMPACTION_DEPLETED, &zone->flags);
+ }
+
return ret;
}
--
1.9.1
Compaction deferring was introduced to reduce overhead of compaction
when compaction attempt is expected to fail. But, it has a problem.
Whole zone is rescanned after some compaction attempts are deferred and
this rescan overhead is quite big. And, it imposes large latency to one
random requestor while others will get nearly zero latency to fail due
to deferring compaction. This patch try to handle this situation
differently to solve above problems.
At first, we should know when compaction will fail. Previous patch
defines compaction depleted state. In this state, compaction failure
is highly expected so we don't need to take much effort on compaction.
So, this patch forces migration scanner scan restricted number of pages
in this state. With this way, we can evenly distribute compaction overhead
to all compaction requestors. And, there is a way to escape from
compaction depleted state so we don't need to defer specific number of
compaction attempts unconditionally if compaction possibility recovers.
In this patch, migration scanner limit is defined to imitate current
compaction deferring approach. But, we can tune it easily if this
overhead doesn't look appropriate. It would be further work.
There would be a situation that compactino depleted state is maintained
for a long time. In this case, repeated compaction attempts would cause
useless overhead continually. To optimize this case, this patch introduce
compaction depletion depth and make migration scanner limit diminished
according to this depth. It effectively reduce compaction overhead in
this situation.
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/mmzone.h | 1 +
mm/compaction.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++--
mm/internal.h | 1 +
3 files changed, 61 insertions(+), 2 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bd9f1a5..700e9b5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -518,6 +518,7 @@ struct zone {
unsigned int compact_defer_shift;
int compact_order_failed;
unsigned long compact_success;
+ unsigned long compact_depletion_depth;
#endif
#if defined CONFIG_COMPACTION || defined CONFIG_CMA
diff --git a/mm/compaction.c b/mm/compaction.c
index 9f259b9..aff536f 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -130,6 +130,7 @@ static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
/* Do not skip compaction more than 64 times */
#define COMPACT_MAX_DEFER_SHIFT 6
#define COMPACT_MIN_DEPLETE_THRESHOLD 1UL
+#define COMPACT_MIN_SCAN_LIMIT (pageblock_nr_pages)
static bool compaction_depleted(struct zone *zone)
{
@@ -147,6 +148,48 @@ static bool compaction_depleted(struct zone *zone)
return true;
}
+static void set_migration_scan_limit(struct compact_control *cc)
+{
+ struct zone *zone = cc->zone;
+ int order = cc->order;
+ unsigned long limit;
+
+ cc->migration_scan_limit = LONG_MAX;
+ if (order < 0)
+ return;
+
+ if (!test_bit(ZONE_COMPACTION_DEPLETED, &zone->flags))
+ return;
+
+ if (!zone->compact_depletion_depth)
+ return;
+
+ /* Stop async migration if depleted */
+ if (cc->mode == MIGRATE_ASYNC) {
+ cc->migration_scan_limit = -1;
+ return;
+ }
+
+ /*
+ * Deferred compaction restart compaction every 64 compaction
+ * attempts and it rescans whole zone range. If we limit
+ * migration scanner to scan 1/64 range when depleted, 64
+ * compaction attempts will rescan whole zone range as same
+ * as deferred compaction.
+ */
+ limit = zone->managed_pages >> 6;
+
+ /*
+ * We don't do async compaction. Instead, give extra credit
+ * to sync compaction
+ */
+ limit <<= 1;
+ limit = max(limit, COMPACT_MIN_SCAN_LIMIT);
+
+ /* Degradation scan limit according to depletion depth. */
+ limit >>= zone->compact_depletion_depth;
+ cc->migration_scan_limit = max(limit, COMPACT_CLUSTER_MAX);
+}
/*
* Compaction is deferred when compaction fails to result in a page
* allocation success. 1 << compact_defer_limit compactions are skipped up
@@ -243,8 +286,14 @@ static void __reset_isolation_suitable(struct zone *zone)
zone->compact_cached_free_pfn = end_pfn;
zone->compact_blockskip_flush = false;
- if (compaction_depleted(zone))
- set_bit(ZONE_COMPACTION_DEPLETED, &zone->flags);
+ if (compaction_depleted(zone)) {
+ if (test_bit(ZONE_COMPACTION_DEPLETED, &zone->flags))
+ zone->compact_depletion_depth++;
+ else {
+ set_bit(ZONE_COMPACTION_DEPLETED, &zone->flags);
+ zone->compact_depletion_depth = 0;
+ }
+ }
zone->compact_success = 0;
/* Walk the zone and mark every pageblock as suitable for isolation */
@@ -839,6 +888,7 @@ isolate_success:
if (locked)
spin_unlock_irqrestore(&zone->lru_lock, flags);
+ cc->migration_scan_limit -= nr_scanned;
if (low_pfn == end_pfn && cc->mode != MIGRATE_ASYNC) {
int sync = cc->mode != MIGRATE_ASYNC;
@@ -1198,6 +1248,9 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc,
return COMPACT_COMPLETE;
}
+ if (cc->migration_scan_limit < 0)
+ return COMPACT_PARTIAL;
+
/*
* order == -1 is expected when compacting via
* /proc/sys/vm/compact_memory
@@ -1373,6 +1426,10 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
}
+ set_migration_scan_limit(cc);
+ if (cc->migration_scan_limit < 0)
+ return COMPACT_SKIPPED;
+
trace_mm_compaction_begin(start_pfn, cc->migrate_pfn,
cc->free_pfn, end_pfn, sync);
diff --git a/mm/internal.h b/mm/internal.h
index 36b23f1..a427695 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -182,6 +182,7 @@ struct compact_control {
unsigned long nr_migratepages; /* Number of pages to migrate */
unsigned long free_pfn; /* isolate_freepages search base */
unsigned long migrate_pfn; /* isolate_migratepages search base */
+ long migration_scan_limit; /* Limit migration scanner activity */
enum migrate_mode mode; /* Async or sync migration mode */
bool ignore_skip_hint; /* Scan blocks even if marked skip */
int order; /* order a direct compactor needs */
--
1.9.1
Now, we have a way to determine compaction depleted state and compaction
activity will be limited according this state and depletion depth so
compaction overhead would be well controlled without compaction deferring.
So, this patch remove compaction deferring completely.
Various functions are renamed and tracepoint outputs are changed due to
this removing.
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/compaction.h | 14 +-------
include/linux/mmzone.h | 3 +-
include/trace/events/compaction.h | 30 +++++++---------
mm/compaction.c | 74 ++++++++++-----------------------------
mm/page_alloc.c | 2 +-
mm/vmscan.c | 4 +--
6 files changed, 37 insertions(+), 90 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index aa8f61c..8d98f3c 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -45,11 +45,8 @@ extern void reset_isolation_suitable(pg_data_t *pgdat);
extern unsigned long compaction_suitable(struct zone *zone, int order,
int alloc_flags, int classzone_idx);
-extern void defer_compaction(struct zone *zone, int order);
-extern bool compaction_deferred(struct zone *zone, int order);
-extern void compaction_defer_reset(struct zone *zone, int order,
+extern void compaction_failed_reset(struct zone *zone, int order,
bool alloc_success);
-extern bool compaction_restarting(struct zone *zone, int order);
#else
static inline unsigned long try_to_compact_pages(gfp_t gfp_mask,
@@ -74,15 +71,6 @@ static inline unsigned long compaction_suitable(struct zone *zone, int order,
return COMPACT_SKIPPED;
}
-static inline void defer_compaction(struct zone *zone, int order)
-{
-}
-
-static inline bool compaction_deferred(struct zone *zone, int order)
-{
- return true;
-}
-
#endif /* CONFIG_COMPACTION */
#if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 700e9b5..e13b732 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -514,8 +514,7 @@ struct zone {
* are skipped before trying again. The number attempted since
* last failure is tracked with compact_considered.
*/
- unsigned int compact_considered;
- unsigned int compact_defer_shift;
+ int compact_failed;
int compact_order_failed;
unsigned long compact_success;
unsigned long compact_depletion_depth;
diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index 9a6a3fe..323e614 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -239,7 +239,7 @@ DEFINE_EVENT(mm_compaction_suitable_template, mm_compaction_suitable,
);
#ifdef CONFIG_COMPACTION
-DECLARE_EVENT_CLASS(mm_compaction_defer_template,
+DECLARE_EVENT_CLASS(mm_compaction_deplete_template,
TP_PROTO(struct zone *zone, int order),
@@ -249,8 +249,9 @@ DECLARE_EVENT_CLASS(mm_compaction_defer_template,
__field(int, nid)
__field(char *, name)
__field(int, order)
- __field(unsigned int, considered)
- __field(unsigned int, defer_shift)
+ __field(unsigned long, success)
+ __field(unsigned long, depletion_depth)
+ __field(int, failed)
__field(int, order_failed)
),
@@ -258,35 +259,30 @@ DECLARE_EVENT_CLASS(mm_compaction_defer_template,
__entry->nid = zone_to_nid(zone);
__entry->name = (char *)zone->name;
__entry->order = order;
- __entry->considered = zone->compact_considered;
- __entry->defer_shift = zone->compact_defer_shift;
+ __entry->success = zone->compact_success;
+ __entry->depletion_depth = zone->compact_depletion_depth;
+ __entry->failed = zone->compact_failed;
__entry->order_failed = zone->compact_order_failed;
),
- TP_printk("node=%d zone=%-8s order=%d order_failed=%d consider=%u limit=%lu",
+ TP_printk("node=%d zone=%-8s order=%d failed=%d order_failed=%d consider=%lu depth=%lu",
__entry->nid,
__entry->name,
__entry->order,
+ __entry->failed,
__entry->order_failed,
- __entry->considered,
- 1UL << __entry->defer_shift)
+ __entry->success,
+ __entry->depletion_depth)
);
-DEFINE_EVENT(mm_compaction_defer_template, mm_compaction_deferred,
+DEFINE_EVENT(mm_compaction_deplete_template, mm_compaction_fail_compaction,
TP_PROTO(struct zone *zone, int order),
TP_ARGS(zone, order)
);
-DEFINE_EVENT(mm_compaction_defer_template, mm_compaction_defer_compaction,
-
- TP_PROTO(struct zone *zone, int order),
-
- TP_ARGS(zone, order)
-);
-
-DEFINE_EVENT(mm_compaction_defer_template, mm_compaction_defer_reset,
+DEFINE_EVENT(mm_compaction_deplete_template, mm_compaction_failed_reset,
TP_PROTO(struct zone *zone, int order),
diff --git a/mm/compaction.c b/mm/compaction.c
index aff536f..649fca2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -128,7 +128,7 @@ static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
#ifdef CONFIG_COMPACTION
/* Do not skip compaction more than 64 times */
-#define COMPACT_MAX_DEFER_SHIFT 6
+#define COMPACT_MAX_FAILED 4
#define COMPACT_MIN_DEPLETE_THRESHOLD 1UL
#define COMPACT_MIN_SCAN_LIMIT (pageblock_nr_pages)
@@ -190,61 +190,28 @@ static void set_migration_scan_limit(struct compact_control *cc)
limit >>= zone->compact_depletion_depth;
cc->migration_scan_limit = max(limit, COMPACT_CLUSTER_MAX);
}
-/*
- * Compaction is deferred when compaction fails to result in a page
- * allocation success. 1 << compact_defer_limit compactions are skipped up
- * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT
- */
-void defer_compaction(struct zone *zone, int order)
-{
- zone->compact_considered = 0;
- zone->compact_defer_shift++;
-
- if (order < zone->compact_order_failed)
- zone->compact_order_failed = order;
-
- if (zone->compact_defer_shift > COMPACT_MAX_DEFER_SHIFT)
- zone->compact_defer_shift = COMPACT_MAX_DEFER_SHIFT;
- trace_mm_compaction_defer_compaction(zone, order);
-}
-
-/* Returns true if compaction should be skipped this time */
-bool compaction_deferred(struct zone *zone, int order)
+void fail_compaction(struct zone *zone, int order)
{
- unsigned long defer_limit = 1UL << zone->compact_defer_shift;
-
- if (order < zone->compact_order_failed)
- return false;
-
- /* Avoid possible overflow */
- if (++zone->compact_considered > defer_limit)
- zone->compact_considered = defer_limit;
-
- if (zone->compact_considered >= defer_limit)
- return false;
-
- trace_mm_compaction_deferred(zone, order);
+ if (order < zone->compact_order_failed) {
+ zone->compact_failed = 0;
+ zone->compact_order_failed = order;
+ } else
+ zone->compact_failed++;
- return true;
+ trace_mm_compaction_fail_compaction(zone, order);
}
-/*
- * Update defer tracking counters after successful compaction of given order,
- * which means an allocation either succeeded (alloc_success == true) or is
- * expected to succeed.
- */
-void compaction_defer_reset(struct zone *zone, int order,
+void compaction_failed_reset(struct zone *zone, int order,
bool alloc_success)
{
- if (alloc_success) {
- zone->compact_considered = 0;
- zone->compact_defer_shift = 0;
- }
+ if (alloc_success)
+ zone->compact_failed = 0;
+
if (order >= zone->compact_order_failed)
zone->compact_order_failed = order + 1;
- trace_mm_compaction_defer_reset(zone, order);
+ trace_mm_compaction_failed_reset(zone, order);
}
/* Returns true if restarting compaction after many failures */
@@ -256,8 +223,7 @@ static bool compaction_direct_restarting(struct zone *zone, int order)
if (order < zone->compact_order_failed)
return false;
- return zone->compact_defer_shift == COMPACT_MAX_DEFER_SHIFT &&
- zone->compact_considered >= 1UL << zone->compact_defer_shift;
+ return zone->compact_failed < COMPACT_MAX_FAILED ? false : true;
}
/* Returns true if the pageblock should be scanned for pages to isolate. */
@@ -295,6 +261,7 @@ static void __reset_isolation_suitable(struct zone *zone)
}
}
zone->compact_success = 0;
+ zone->compact_failed = 0;
/* Walk the zone and mark every pageblock as suitable for isolation */
for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
@@ -1610,9 +1577,6 @@ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
int status;
int zone_contended;
- if (compaction_deferred(zone, order))
- continue;
-
status = compact_zone_order(zone, order, gfp_mask, mode,
&zone_contended, alloc_flags,
ac->classzone_idx);
@@ -1632,7 +1596,7 @@ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
* will repeat this with true if allocation indeed
* succeeds in this zone.
*/
- compaction_defer_reset(zone, order, false);
+ compaction_failed_reset(zone, order, false);
/*
* It is possible that async compaction aborted due to
* need_resched() and the watermarks were ok thanks to
@@ -1653,7 +1617,7 @@ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
* so we defer compaction there. If it ends up
* succeeding after all, it will be reset.
*/
- defer_compaction(zone, order);
+ fail_compaction(zone, order);
}
/*
@@ -1715,13 +1679,13 @@ static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
if (cc->order == -1)
__reset_isolation_suitable(zone);
- if (cc->order == -1 || !compaction_deferred(zone, cc->order))
+ if (cc->order == -1)
compact_zone(zone, cc);
if (cc->order > 0) {
if (zone_watermark_ok(zone, cc->order,
low_wmark_pages(zone), 0, 0))
- compaction_defer_reset(zone, cc->order, false);
+ compaction_failed_reset(zone, cc->order, false);
}
VM_BUG_ON(!list_empty(&cc->freepages));
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index afd5459..f53d764 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2821,7 +2821,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct zone *zone = page_zone(page);
zone->compact_blockskip_flush = false;
- compaction_defer_reset(zone, order, true);
+ compaction_failed_reset(zone, order, true);
count_vm_event(COMPACTSUCCESS);
return page;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 37e90db..a561b5f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2469,10 +2469,10 @@ static inline bool compaction_ready(struct zone *zone, int order)
watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0, 0);
/*
- * If compaction is deferred, reclaim up to a point where
+ * If compaction is depleted, reclaim up to a point where
* compaction will have a chance of success when re-enabled
*/
- if (compaction_deferred(zone, order))
+ if (test_bit(ZONE_COMPACTION_DEPLETED, &zone->flags))
return watermark_ok;
/*
--
1.9.1
Currently, compaction works as following.
1) migration scanner scans from zone_start_pfn to zone_end_pfn
to find migratable pages
2) free scanner scans from zone_end_pfn to zone_start_pfn to
find free pages
3) If both scanner crossed, compaction is finished.
This algorithm has some drawbacks. 1) Back of the zone cannot be
scanned by migration scanner because migration scanner can't pass
over freepage scanner. So, although there are some high order page
candidates at back of the zone, we can't utilize it.
Another weakness is 2) compaction's success highly depends on amount
of freepage. Compaction can migrate used pages by amount of freepage
at maximum. If we can't make high order page by this effort, both
scanner should meet and compaction will fail.
We can easily observe problem 1) by following test.
Memory is artificially fragmented to make order 3 allocation hard. And,
most of pageblocks are changed to unmovable migratetype.
System: 512 MB with 32 MB Zram
Memory: 25% memory is allocated to make fragmentation and 200 MB is
occupied by memory hogger. Most pageblocks are movable
migratetype.
Fragmentation: Successful order 3 allocation candidates may be around
1500 roughly.
Allocation attempts: Roughly 3000 order 3 allocation attempts
with GFP_NORETRY. This value is determined to saturate allocation
success.
Test: hogger-frag-movable
nonmovable
compact_free_scanned 5883401
compact_isolated 83201
compact_migrate_scanned 2755690
compact_stall 664
compact_success 102
pgmigrate_success 38663
Success: 26
Success(N): 56
Column 'Success' and 'Success(N) are calculated by following equations.
Success = successful allocation * 100 / attempts
Success(N) = successful allocation * 100 /
number of successful order-3 allocation
As mentioned above, there are roughly 1500 high order page candidates,
but, compaction just returns 56% of them, because migration scanner
can't pass over freepage scanner. With new compaction approach, it can
be increased to 94% by this patch.
To check 2), hogger-frag-movable benchmark is used again, but, with some
tweaks. Amount of allocated memory by memory hogger varys.
Test: hogger-frag-movable with free memory variation
bzImage-improve-base
Hogger: 150MB 200MB 250MB 300MB
Success: 41 25 17 9
Success(N): 87 53 37 22
As background knowledge, up to 250MB, there is enough
memory to succeed all order-3 allocation attempts. In 300MB case,
available memory before starting allocation attempt is just 57MB,
so all of attempts cannot succeed.
Anyway, as free memory decreases, compaction success rate also decreases.
It is better to remove this dependency to get stable compaction result
in any case.
This patch solves these problems mentioned in above.
Freepage scanner is greatly changed to scan zone from zone_start_pfn
to zone_end_pfn. And, by this change, compaction finish condition is also
changed that migration scanner reach zone_end_pfn. With these changes,
migration scanner can traverse anywhere in the zone.
To prevent back and forth migration within one compaction iteration,
freepage scanner marks skip-bit when scanning pageblock. migration scanner
checks it and will skip this marked pageblock so back and forth migration
cannot be possible in one compaction iteration.
If freepage scanner reachs the end of zone, it restarts at zone_start_pfn.
In this time, freepage scanner would scan the pageblock where migration
scanner try to migrate some pages but fail to make high order page. This
leaved freepages means that they can't become high order page due to
the fragmentation so it is good source for freepage scanner.
With this change, above test result is:
Test: hogger-frag-movable
nonmovable redesign
compact_free_scanned 5883401 8103231
compact_isolated 83201 3108978
compact_migrate_scanned 2755690 4316163
compact_stall 664 2117
compact_success 102 234
pgmigrate_success 38663 1547318
Success: 26 45
Success(N): 56 94
Test: hogger-frag-movable with free memory variation
Hogger: 150MB 200MB 250MB 300MB
bzImage-improve-base
Success: 41 25 17 9
Success(N): 87 53 37 22
bzImage-improve-threshold
Success: 44 44 42 37
Success(N): 94 92 91 80
Compaction gives us almost all possible high order page. Overhead is
highly increased, but, further patch will reduce it greatly
by adjusting depletion check with this new algorithm.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/compaction.c | 134 ++++++++++++++++++++++++++------------------------------
1 file changed, 63 insertions(+), 71 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 649fca2..99f533f 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -53,17 +53,17 @@ static const char *const compaction_status_string[] = {
static unsigned long release_freepages(struct list_head *freelist)
{
struct page *page, *next;
- unsigned long high_pfn = 0;
+ unsigned long low_pfn = ULONG_MAX;
list_for_each_entry_safe(page, next, freelist, lru) {
unsigned long pfn = page_to_pfn(page);
list_del(&page->lru);
__free_page(page);
- if (pfn > high_pfn)
- high_pfn = pfn;
+ if (pfn < low_pfn)
+ low_pfn = pfn;
}
- return high_pfn;
+ return low_pfn;
}
static void map_pages(struct list_head *list)
@@ -249,7 +249,7 @@ static void __reset_isolation_suitable(struct zone *zone)
zone->compact_cached_migrate_pfn[0] = start_pfn;
zone->compact_cached_migrate_pfn[1] = start_pfn;
- zone->compact_cached_free_pfn = end_pfn;
+ zone->compact_cached_free_pfn = start_pfn;
zone->compact_blockskip_flush = false;
if (compaction_depleted(zone)) {
@@ -322,18 +322,18 @@ static void update_pageblock_skip(struct compact_control *cc,
if (start_pfn != round_down(end_pfn - 1, pageblock_nr_pages))
return;
- set_pageblock_skip(page);
-
/* Update where async and sync compaction should restart */
if (migrate_scanner) {
+ set_pageblock_skip(page);
+
if (end_pfn > zone->compact_cached_migrate_pfn[0])
zone->compact_cached_migrate_pfn[0] = end_pfn;
if (cc->mode != MIGRATE_ASYNC &&
- end_pfn > zone->compact_cached_migrate_pfn[1])
+ end_pfn > zone->compact_cached_migrate_pfn[1])
zone->compact_cached_migrate_pfn[1] = end_pfn;
} else {
- if (start_pfn < zone->compact_cached_free_pfn)
- zone->compact_cached_free_pfn = start_pfn;
+ if (end_pfn > zone->compact_cached_free_pfn)
+ zone->compact_cached_free_pfn = end_pfn;
}
}
#else
@@ -955,12 +955,13 @@ static void isolate_freepages(struct compact_control *cc)
{
struct zone *zone = cc->zone;
struct page *page;
+ unsigned long pfn;
unsigned long block_start_pfn; /* start of current pageblock */
- unsigned long isolate_start_pfn; /* exact pfn we start at */
unsigned long block_end_pfn; /* end of current pageblock */
- unsigned long low_pfn; /* lowest pfn scanner is able to scan */
struct list_head *freelist = &cc->freepages;
+ unsigned long nr_isolated;
+retry:
/*
* Initialise the free scanner. The starting point is where we last
* successfully isolated from, zone-cached value, or the end of the
@@ -972,22 +973,21 @@ static void isolate_freepages(struct compact_control *cc)
* The low boundary is the end of the pageblock the migration scanner
* is using.
*/
- isolate_start_pfn = cc->free_pfn;
- block_start_pfn = cc->free_pfn & ~(pageblock_nr_pages-1);
- block_end_pfn = min(block_start_pfn + pageblock_nr_pages,
- zone_end_pfn(zone));
- low_pfn = ALIGN(cc->migrate_pfn + 1, pageblock_nr_pages);
+ pfn = cc->free_pfn;
- /*
- * Isolate free pages until enough are available to migrate the
- * pages on cc->migratepages. We stop searching if the migrate
- * and free page scanners meet or enough free pages are isolated.
- */
- for (; block_start_pfn >= low_pfn &&
- cc->nr_migratepages > cc->nr_freepages;
- block_end_pfn = block_start_pfn,
- block_start_pfn -= pageblock_nr_pages,
- isolate_start_pfn = block_start_pfn) {
+ for (; pfn < zone_end_pfn(zone) &&
+ cc->nr_migratepages > cc->nr_freepages;) {
+
+ block_start_pfn = pfn & ~(pageblock_nr_pages-1);
+ block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
+ block_end_pfn = min(block_end_pfn, zone_end_pfn(zone));
+
+ /* Skip the pageblock where migration scan is */
+ if (block_start_pfn ==
+ (cc->migrate_pfn & ~(pageblock_nr_pages-1))) {
+ pfn = block_end_pfn;
+ continue;
+ }
/*
* This can iterate a massively long zone without finding any
@@ -998,35 +998,25 @@ static void isolate_freepages(struct compact_control *cc)
&& compact_should_abort(cc))
break;
- page = pageblock_pfn_to_page(block_start_pfn, block_end_pfn,
- zone);
- if (!page)
+ page = pageblock_pfn_to_page(pfn, block_end_pfn, zone);
+ if (!page) {
+ pfn = block_end_pfn;
continue;
+ }
/* Check the block is suitable for migration */
- if (!suitable_migration_target(page))
- continue;
-
- /* If isolation recently failed, do not retry */
- if (!isolation_suitable(cc, page))
+ if (!suitable_migration_target(page)) {
+ pfn = block_end_pfn;
continue;
+ }
/* Found a block suitable for isolating free pages from. */
- isolate_freepages_block(cc, &isolate_start_pfn,
+ nr_isolated = isolate_freepages_block(cc, &pfn,
block_end_pfn, freelist, false);
- /*
- * Remember where the free scanner should restart next time,
- * which is where isolate_freepages_block() left off.
- * But if it scanned the whole pageblock, isolate_start_pfn
- * now points at block_end_pfn, which is the start of the next
- * pageblock.
- * In that case we will however want to restart at the start
- * of the previous pageblock.
- */
- cc->free_pfn = (isolate_start_pfn < block_end_pfn) ?
- isolate_start_pfn :
- block_start_pfn - pageblock_nr_pages;
+ /* To prevent back and forth migration */
+ if (nr_isolated)
+ set_pageblock_skip(page);
/*
* isolate_freepages_block() might have aborted due to async
@@ -1039,12 +1029,13 @@ static void isolate_freepages(struct compact_control *cc)
/* split_free_page does not map the pages */
map_pages(freelist);
- /*
- * If we crossed the migrate scanner, we want to keep it that way
- * so that compact_finished() may detect this
- */
- if (block_start_pfn < low_pfn)
- cc->free_pfn = cc->migrate_pfn;
+ cc->free_pfn = pfn;
+ if (cc->free_pfn >= zone_end_pfn(zone)) {
+ cc->free_pfn = zone->zone_start_pfn;
+ zone->compact_cached_free_pfn = cc->free_pfn;
+ if (cc->nr_freepages == 0)
+ goto retry;
+ }
}
/*
@@ -1130,8 +1121,9 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
* Iterate over whole pageblocks until we find the first suitable.
* Do not cross the free scanner.
*/
- for (; end_pfn <= cc->free_pfn;
- low_pfn = end_pfn, end_pfn += pageblock_nr_pages) {
+ for (; low_pfn < zone_end_pfn(zone); low_pfn = end_pfn) {
+ end_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages);
+ end_pfn = min(end_pfn, zone_end_pfn(zone));
/*
* This can potentially iterate a massively long zone with
@@ -1177,12 +1169,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
}
acct_isolated(zone, cc);
- /*
- * Record where migration scanner will be restarted. If we end up in
- * the same pageblock as the free scanner, make the scanners fully
- * meet so that compact_finished() terminates compaction.
- */
- cc->migrate_pfn = (end_pfn <= cc->free_pfn) ? low_pfn : cc->free_pfn;
+ cc->migrate_pfn = low_pfn;
return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
}
@@ -1197,11 +1184,15 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc,
return COMPACT_PARTIAL;
/* Compaction run completes if the migrate and free scanner meet */
- if (cc->free_pfn <= cc->migrate_pfn) {
+ if (cc->migrate_pfn >= zone_end_pfn(zone)) {
+ /* Stop the async compaction */
+ zone->compact_cached_migrate_pfn[0] = zone_end_pfn(zone);
+ if (cc->mode == MIGRATE_ASYNC)
+ return COMPACT_PARTIAL;
+
/* Let the next compaction start anew. */
zone->compact_cached_migrate_pfn[0] = zone->zone_start_pfn;
zone->compact_cached_migrate_pfn[1] = zone->zone_start_pfn;
- zone->compact_cached_free_pfn = zone_end_pfn(zone);
/*
* Mark that the PG_migrate_skip information should be cleared
@@ -1383,11 +1374,14 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
*/
cc->migrate_pfn = zone->compact_cached_migrate_pfn[sync];
cc->free_pfn = zone->compact_cached_free_pfn;
- if (cc->free_pfn < start_pfn || cc->free_pfn > end_pfn) {
- cc->free_pfn = end_pfn & ~(pageblock_nr_pages-1);
+ if (cc->mode == MIGRATE_ASYNC && cc->migrate_pfn >= end_pfn)
+ return COMPACT_SKIPPED;
+
+ if (cc->free_pfn < start_pfn || cc->free_pfn >= end_pfn) {
+ cc->free_pfn = start_pfn;
zone->compact_cached_free_pfn = cc->free_pfn;
}
- if (cc->migrate_pfn < start_pfn || cc->migrate_pfn > end_pfn) {
+ if (cc->migrate_pfn < start_pfn || cc->migrate_pfn >= end_pfn) {
cc->migrate_pfn = start_pfn;
zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
@@ -1439,7 +1433,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
* migrate_pages() may return -ENOMEM when scanners meet
* and we want compact_finished() to detect it
*/
- if (err == -ENOMEM && cc->free_pfn > cc->migrate_pfn) {
+ if (err == -ENOMEM) {
ret = COMPACT_PARTIAL;
goto out;
}
@@ -1490,13 +1484,11 @@ out:
cc->nr_freepages = 0;
VM_BUG_ON(free_pfn == 0);
- /* The cached pfn is always the first in a pageblock */
- free_pfn &= ~(pageblock_nr_pages-1);
/*
* Only go back, not forward. The cached pfn might have been
* already reset to zone end in compact_finished()
*/
- if (free_pfn > zone->compact_cached_free_pfn)
+ if (free_pfn < zone->compact_cached_free_pfn)
zone->compact_cached_free_pfn = free_pfn;
}
--
1.9.1
Now, compaction algorithm become powerful. Migration scanner traverses
whole zone range. So, old threshold for depleted zone which is designed
to imitate compaction deferring approach isn't appropriate for current
compaction algorithm. If we adhere to current threshold, 1, we can't
avoid excessive overhead caused by compaction, because one compaction
for low order allocation would be easily successful in any situation.
This patch re-implements threshold calculation based on zone size and
allocation requested order. We judge whther compaction possibility is
depleted or not by number of successful compaction. Roughly, 1/100
of future scanned area should be allocated for high order page during
one comaction iteration in order to determine whether zone's compaction
possiblity is depleted or not.
Below is test result with following setup.
Memory is artificially fragmented to make order 3 allocation hard. And,
most of pageblocks are changed to unmovable migratetype.
System: 512 MB with 32 MB Zram
Memory: 25% memory is allocated to make fragmentation and 200 MB is
occupied by memory hogger. Most pageblocks are unmovable
migratetype.
Fragmentation: Successful order 3 allocation candidates may be around
1500 roughly.
Allocation attempts: Roughly 3000 order 3 allocation attempts
with GFP_NORETRY. This value is determined to saturate allocation
success.
Test: hogger-frag-unmovable
redesign threshold
compact_free_scanned 6441095 2235764
compact_isolated 2711081 647701
compact_migrate_scanned 4175464 1697292
compact_stall 2059 2092
compact_success 207 210
pgmigrate_success 1348113 318395
Success: 44 40
Success(N): 90 83
This change results in greatly decreasing compaction overhead when
zone's compaction possibility is nearly depleted. But, I should admit
that it's not perfect because compaction success rate is decreased.
More precise tuning threshold would restore this regression, but,
it highly depends on workload so I'm not doing it here.
Other test doesn't show any regression.
System: 512 MB with 32 MB Zram
Memory: 25% memory is allocated to make fragmentation and kernel
build is running on background. Most pageblocks are movable
migratetype.
Fragmentation: Successful order 3 allocation candidates may be around
1500 roughly.
Allocation attempts: Roughly 3000 order 3 allocation attempts
with GFP_NORETRY. This value is determined to saturate allocation
success.
Test: build-frag-movable
redesign threshold
compact_free_scanned 2359553 1461131
compact_isolated 907515 387373
compact_migrate_scanned 3785605 2177090
compact_stall 2195 2157
compact_success 247 225
pgmigrate_success 439739 182366
Success: 43 43
Success(N): 89 90
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/compaction.c | 19 ++++++++++++-------
1 file changed, 12 insertions(+), 7 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 99f533f..63702b3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -129,19 +129,24 @@ static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
/* Do not skip compaction more than 64 times */
#define COMPACT_MAX_FAILED 4
-#define COMPACT_MIN_DEPLETE_THRESHOLD 1UL
+#define COMPACT_MIN_DEPLETE_THRESHOLD 4UL
#define COMPACT_MIN_SCAN_LIMIT (pageblock_nr_pages)
static bool compaction_depleted(struct zone *zone)
{
- unsigned long threshold;
+ unsigned long nr_possible;
unsigned long success = zone->compact_success;
+ unsigned long threshold;
- /*
- * Now, to imitate current compaction deferring approach,
- * choose threshold to 1. It will be changed in the future.
- */
- threshold = COMPACT_MIN_DEPLETE_THRESHOLD;
+ nr_possible = zone->managed_pages >> zone->compact_order_failed;
+
+ /* Migration scanner can scans more than 1/4 range of zone */
+ nr_possible >>= 2;
+
+ /* We hope to succeed more than 1/100 roughly */
+ threshold = nr_possible >> 7;
+
+ threshold = max(threshold, COMPACT_MIN_DEPLETE_THRESHOLD);
if (success >= threshold)
return false;
--
1.9.1
I can has commit log? :)
On 06/25/2015 02:45 AM, Joonsoo Kim wrote:
> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> mm/compaction.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 9c5d43c..2d8e211 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -510,6 +510,10 @@ isolate_fail:
> if (locked)
> spin_unlock_irqrestore(&cc->zone->lock, flags);
>
> + if (blockpfn == end_pfn &&
> + blockpfn > cc->zone->compact_cached_free_pfn)
> + cc->zone->compact_cached_free_pfn = blockpfn;
> +
> update_pageblock_skip(cc, valid_page, total_isolated,
> *start_pfn, end_pfn, blockpfn, false);
>
> @@ -811,6 +815,13 @@ isolate_success:
> if (locked)
> spin_unlock_irqrestore(&zone->lru_lock, flags);
>
> + if (low_pfn == end_pfn && cc->mode != MIGRATE_ASYNC) {
> + int sync = cc->mode != MIGRATE_ASYNC;
> +
> + if (low_pfn > zone->compact_cached_migrate_pfn[sync])
> + zone->compact_cached_migrate_pfn[sync] = low_pfn;
> + }
> +
> update_pageblock_skip(cc, valid_page, nr_isolated,
> start_pfn, end_pfn, low_pfn, true);
>
>
On Thu, Jun 25, 2015 at 09:45:11AM +0900, Joonsoo Kim wrote:
> Recently, I got a report that android get slow due to order-2 page
> allocation. With some investigation, I found that compaction usually
> fails and many pages are reclaimed to make order-2 freepage. I can't
> analyze detailed reason that causes compaction fail because I don't
> have reproducible environment and compaction code is changed so much
> from that version, v3.10. But, I was inspired by this report and started
> to think limitation of current compaction algorithm.
>
> Limitation of current compaction algorithm:
>
I didn't review the individual patches unfortunately but have a few comments
about things to watch out for.
> 1) Migrate scanner can't scan behind of free scanner, because
> each scanner starts at both side of zone and go toward each other. If
> they meet at some point, compaction is stopped and scanners' position
> is reset to both side of zone again. From my experience, migrate scanner
> usually doesn't scan beyond of half of the zone range.
>
This was deliberate because if the scanners cross then they can undo each
others work. The free scanner can locate a pageblock that pages were
recently migrated from. Finishing compaction when the scanners met was
the easiest way of avoiding the problem without maintaining global
state.
Global state is required because there can be parallel compaction
attempts. The global state requires locking to avoid two parallel
compaction attempts selecting the same pageblock for migrating to and
from.
This global state then needs to be reset on each compaction cycle. The
difficulty then is that there is a potential ping-pong effect. A pageblock
that was previously a migration target for the free scanner may become a
migration source for the migration scanner. Having the scanners operate
in opposite directions and meet in the middle avoided this problem.
I'm not saying the current design is perfect but it avoids a number of
problems that are worth keeping in mind. Regressions in this area will
look like higher system CPU time with most of the additional time spent
in compaction.
> 2) Compaction capability is highly depends on amount of free memory.
> If there is 50 MB free memory on 4 GB system, migrate scanner can
> migrate 50 MB used pages at maximum and then will meet free scanner.
> If compaction can't make enough high order freepages during this
> amount of work, compaction would fail. There is no way to escape this
> failure situation in current algorithm and it will scan same region and
> fail again and again. And then, it goes into compaction deferring logic
> and will be deferred for some times.
>
This is why reclaim/compaction exists. When this situation occurs, the
kernel is meant to reclaim some order-0 pages and try again. Initially
it was lumpy reclaim that was used but it severely disrupted the system.
> 3) Compaction capability is highly depends on migratetype of memory,
> because freepage scanner doesn't scan unmovable pageblock.
>
For a very good reason. Unmovable allocation requests that fallback to
other pageblocks are the worst in terms of fragmentation avoidance. The
more of these events there are, the more the system will decay. If there
are many of these events then a compaction benchmark may start with high
success rates but decay over time.
Very broadly speaking, the more the mm_page_alloc_extfrag tracepoint
triggers with alloc_migratetype == MIGRATE_UNMOVABLE, the faster the
system is decaying. Having the freepage scanner select unmovable
pageblocks will trigger this event more frequently.
The unfortunate impact is that selecting unmovable blocks from the free
csanner will improve compaction success rates for high-order kernel
allocations early in the lifetime of the system but later fail high-order
allocation requests as more pageblocks get converted to unmovable. It
might be ok for kernel allocations but THP will eventually have a 100%
failure rate.
> To investigate compaction limitations, I made some compaction benchmarks.
> Base environment of this benchmark is fragmented memory. Before testing,
> 25% of total size of memory is allocated. With some tricks, these
> allocations are evenly distributed to whole memory range. So, after
> allocation is finished, memory is highly fragmented and possibility of
> successful order-3 allocation is very low. Roughly 1500 order-3 allocation
> can be successful. Tests attempt excessive amount of allocation request,
> that is, 3000, to find out algorithm limitation.
>
Did you look at forcing MIGRATE_SYNC for these allocation
requests? Compaction was originally designed for huge page allocations and
high-order kernel allocations up to PAGE_ALLOC_COSTLY_ORDER were meant to
free naturally. MIGRATE_SYNC considers more pageblocks for migration and
should improve success rates at the cost of longer stalls.
> <SNIP>
> Column 'Success' and 'Success(N) are calculated by following equations.
>
> Success = successful allocation * 100 / attempts
> Success(N) = successful allocation * 100 /
> number of successful order-3 allocation
>
I don't quite get this.
Success = success * 100 / attempts
Success(N) = success * 100 / success
Can you try explaining Success(N) again? It's not clear how the first
"success allocation" differs from the "successful order-3" allocations.
Is it all allocation attempts or any order in the system or what?
> <SNIP>
>
> Anyway, as free memory decreases, compaction success rate also decreases.
> It is better to remove this dependency to get stable compaction result
> in any case.
>
BTW, this is also expected. Early in the existence of compaction it had
much higher success rates -- high 90%s around kernel 3.0. This has
dropped over time because the *cost* of granting those allocations was
so high. These are timing results from a high-order allocation stress
test
stress-highalloc
3.0.0 3.0.101 3.12.38
Ops 1 89.00 ( 0.00%) 84.00 ( 5.62%) 11.00 ( 87.64%)
Ops 2 91.00 ( 0.00%) 71.00 ( 21.98%) 11.00 ( 87.91%)
Ops 3 93.00 ( 0.00%) 89.00 ( 4.30%) 80.00 ( 13.98%)
3.0.0 3.0.101 3.12.38
vanilla vanilla vanilla
User 2904.90 2280.92 2873.42
System 630.53 624.25 510.87
Elapsed 3869.95 1291.28 1232.83
Ops 1 and 2 are allocation attempts under heavy load and note how kernel
3.0 and 3.0.101 had success rates of over 80%. However, look at how long
3.0.0 took -- over an hour vs 20 minutes in later kernels.
Later kernels avoid any expensive step even though it means the success
rates are lower. Your figures look impressive but remember that the
success rates could be due to very long stalls.
> Pageblock migratetype makes big difference on success rate. 3) would be
> one of reason related to this result. Because freepage scanner doesn't
> scan non-movable pageblock, compaction can't get enough freepage for
> migration and compaction easily fails. This patchset try to solve it
> by allowing freepage scanner to scan on non-movable pageblock.
>
I really think that using unmovable blocks as a freepage scanner target
will cause all THP allocation attempts to eventually fail.
> Result show that we cannot get all possible high order page through
> current compaction algorithm. And, in case that migratetype of
> pageblock is unmovable, success rate get worse. Although we can solve
> problem 3) in current algorithm, there is unsolvable limitations, 1), 2),
> so I'd like to change compaction algorithm.
>
> This patchset try to solve these limitations by introducing new compaction
> approach. Main changes of this patchset are as following:
>
> 1) Make freepage scanner scans non-movable pageblock
> Watermark check doesn't consider how many pages in non-movable pageblock.
This was to avoid expensive checks in compaction. It was for huge page
allocations so if compaction took too long then it offset any benefit
from using huge pages.
>
> 2) Introduce compaction depletion state
> Compaction algorithm will be changed to scan whole zone range. In this
> approach, compaction inevitably do back and forth migration between
> different iterations. If back and forth migration can make highorder
> freepage, it can be justified. But, in case of depletion of compaction
> possiblity, this back and forth migration causes unnecessary overhead.
> Compaction depleteion state is introduced to avoid this useless
> back and forth migration by detecting depletion of compaction possibility.
>
Interesting.
> 3) Change scanner's behaviour
> Migration scanner is changed to scan whole zone range regardless freepage
> scanner position. Freepage scanner also scans whole zone from
> zone_start_pfn to zone_end_pfn. To prevent back and forth migration
> within one compaction iteration, freepage scanner marks skip-bit when
> scanning pageblock. Migration scanner will skip this marked pageblock.
At each iteration, this could ping pong with migration sources becoming
migration targets and vice-versa. Keep an eye on the overall time spent
in compaction and consider forcing MIGRATE_SYNC as a possible
alternative.
> Test: hogger-frag-unmovable
> base nonmovable redesign threshold
> compact_free_scanned 2800710 5615427 6441095 2235764
> compact_isolated 58323 114183 2711081 647701
> compact_migrate_scanned 1078970 2437597 4175464 1697292
> compact_stall 341 1066 2059 2092
> compact_success 80 123 207 210
> pgmigrate_success 27034 53832 1348113 318395
> Success: 22 29 44 40
> Success(N): 46 61 90 83
>
>
> Test: hogger-frag-movable
> base nonmovable redesign threshold
> compact_free_scanned 5240676 5883401 8103231 1860428
> compact_isolated 75048 83201 3108978 427602
> compact_migrate_scanned 2468387 2755690 4316163 1474287
> compact_stall 710 664 2117 1964
> compact_success 98 102 234 183
> pgmigrate_success 34869 38663 1547318 208629
> Success: 25 26 45 44
> Success(N): 53 56 94 92
>
>
> Test: build-frag-unmovable
> base nonmovable redesign threshold
> compact_free_scanned 5032378 4110920 2538420 1891170
> compact_isolated 53368 330762 1020908 534680
> compact_migrate_scanned 1456516 6164677 4809150 2667823
> compact_stall 538 746 2609 2500
> compact_success 93 350 438 403
> pgmigrate_success 19926 152754 491609 251977
> Success: 15 31 39 40
> Success(N): 33 65 80 81
>
>
> Test: build-frag-movable
> base nonmovable redesign threshold
> compact_free_scanned 3059086 3852269 2359553 1461131
> compact_isolated 129085 238856 907515 387373
> compact_migrate_scanned 5029856 5051868 3785605 2177090
> compact_stall 388 540 2195 2157
> compact_success 99 218 247 225
> pgmigrate_success 52898 110021 439739 182366
> Success: 38 37 43 43
> Success(N): 82 77 89 90
>
The success rates look impressive. If you could, also report the total
time and system CPU time for the test. Ideally also report just the time
spent in compaction. Tony Jones posted a patch for perf that might
help with this
https://lkml.org/lkml/2015/5/26/18
Ideally also monitor the trace_mm_page_alloc_extfrag tracepoint and see
how many externally fragmenting events are occuring, particularly ones
for MIGRATE_UNMOVABLE requests.
--
Mel Gorman
SUSE Labs
On 06/25/2015 02:45 AM, Joonsoo Kim wrote:
> Recently, I got a report that android get slow due to order-2 page
> allocation. With some investigation, I found that compaction usually
> fails and many pages are reclaimed to make order-2 freepage. I can't
> analyze detailed reason that causes compaction fail because I don't
> have reproducible environment and compaction code is changed so much
> from that version, v3.10. But, I was inspired by this report and started
> to think limitation of current compaction algorithm.
>
> Limitation of current compaction algorithm:
>
> 1) Migrate scanner can't scan behind of free scanner, because
> each scanner starts at both side of zone and go toward each other. If
> they meet at some point, compaction is stopped and scanners' position
> is reset to both side of zone again. From my experience, migrate scanner
> usually doesn't scan beyond of half of the zone range.
Yes, I've also pointed this out in the RFC for pivot changing compaction.
> 2) Compaction capability is highly depends on amount of free memory.
> If there is 50 MB free memory on 4 GB system, migrate scanner can
> migrate 50 MB used pages at maximum and then will meet free scanner.
> If compaction can't make enough high order freepages during this
> amount of work, compaction would fail. There is no way to escape this
> failure situation in current algorithm and it will scan same region and
> fail again and again. And then, it goes into compaction deferring logic
> and will be deferred for some times.
That's 1) again but in more detail.
> 3) Compaction capability is highly depends on migratetype of memory,
> because freepage scanner doesn't scan unmovable pageblock.
Yes, I've also observed this issue recently.
> To investigate compaction limitations, I made some compaction benchmarks.
> Base environment of this benchmark is fragmented memory. Before testing,
> 25% of total size of memory is allocated. With some tricks, these
> allocations are evenly distributed to whole memory range. So, after
> allocation is finished, memory is highly fragmented and possibility of
> successful order-3 allocation is very low. Roughly 1500 order-3 allocation
> can be successful. Tests attempt excessive amount of allocation request,
> that is, 3000, to find out algorithm limitation.
>
> There are two variations.
>
> pageblock type (unmovable / movable):
>
> One is that most pageblocks are unmovable migratetype and the other is
> that most pageblocks are movable migratetype.
>
> memory usage (memory hogger 200 MB / kernel build with -j8):
>
> Memory hogger means that 200 MB free memory is occupied by hogger.
> Kernel build means that kernel build is running on background and it
> will consume free memory, but, amount of consumption will be very
> fluctuated.
>
> With these variations, I made 4 test cases by mixing them.
>
> hogger-frag-unmovable
> hogger-frag-movable
> build-frag-unmovable
> build-frag-movable
>
> All tests are conducted on 512 MB QEMU virtual machine with 8 CPUs.
>
> I can easily check weakness of compaction algorithm by following test.
>
> To check 1), hogger-frag-movable benchmark is used. Result is as
> following.
>
> bzImage-improve-base
> compact_free_scanned 5240676
> compact_isolated 75048
> compact_migrate_scanned 2468387
> compact_stall 710
> compact_success 98
> pgmigrate_success 34869
> Success: 25
> Success(N): 53
>
> Column 'Success' and 'Success(N) are calculated by following equations.
>
> Success = successful allocation * 100 / attempts
> Success(N) = successful allocation * 100 /
> number of successful order-3 allocation
As Mel pointed out, this is a weird description. The one from patch 5
makes more sense and I hope it's correct:
Success = successful allocation * 100 / attempts
Success(N) = successful allocation * 100 / order 3 candidates
>
> As mentioned above, there are roughly 1500 high order page candidates,
> but, compaction just returns 53% of them. With new compaction approach,
> it can be increased to 94%. See result at the end of this cover-letter.
>
> To check 2), hogger-frag-movable benchmark is used again, but, with some
> tweaks. Amount of allocated memory by memory hogger varys.
>
> bzImage-improve-base
> Hogger: 150MB 200MB 250MB 300MB
> Success: 41 25 17 9
> Success(N): 87 53 37 22
>
> As background knowledge, up to 250MB, there is enough
> memory to succeed all order-3 allocation attempts. In 300MB case,
> available memory before starting allocation attempt is just 57MB,
> so all of attempts cannot succeed.
>
> Anyway, as free memory decreases, compaction success rate also decreases.
> It is better to remove this dependency to get stable compaction result
> in any case.
>
> To check 3), build-frag-unmovable/movable benchmarks are used.
> All factors are same except pageblock migratetypes.
>
> Test: build-frag-unmovable
>
> bzImage-improve-base
> compact_free_scanned 5032378
> compact_isolated 53368
> compact_migrate_scanned 1456516
> compact_stall 538
> compact_success 93
> pgmigrate_success 19926
> Success: 15
> Success(N): 33
>
> Test: build-frag-movable
>
> bzImage-improve-base
> compact_free_scanned 3059086
> compact_isolated 129085
> compact_migrate_scanned 5029856
> compact_stall 388
> compact_success 99
> pgmigrate_success 52898
> Success: 38
> Success(N): 82
>
> Pageblock migratetype makes big difference on success rate. 3) would be
> one of reason related to this result. Because freepage scanner doesn't
> scan non-movable pageblock, compaction can't get enough freepage for
> migration and compaction easily fails. This patchset try to solve it
> by allowing freepage scanner to scan on non-movable pageblock.
>
> Result show that we cannot get all possible high order page through
> current compaction algorithm. And, in case that migratetype of
> pageblock is unmovable, success rate get worse. Although we can solve
> problem 3) in current algorithm, there is unsolvable limitations, 1), 2),
> so I'd like to change compaction algorithm.
>
> This patchset try to solve these limitations by introducing new compaction
> approach. Main changes of this patchset are as following:
>
> 1) Make freepage scanner scans non-movable pageblock
> Watermark check doesn't consider how many pages in non-movable pageblock.
> To fully utilize existing freepage, freepage scanner should scan
> non-movable pageblock.
I share Mel's concerns here. The evaluation should consider long-term
fragmentation effects. Especially when you've already seen a regression
from this patch.
> 2) Introduce compaction depletion state
> Compaction algorithm will be changed to scan whole zone range. In this
> approach, compaction inevitably do back and forth migration between
> different iterations. If back and forth migration can make highorder
> freepage, it can be justified. But, in case of depletion of compaction
> possiblity, this back and forth migration causes unnecessary overhead.
> Compaction depleteion state is introduced to avoid this useless
> back and forth migration by detecting depletion of compaction possibility.
Interesting, but I'll need to study this in more detail to grasp it
completely. Limiting the scanning because of not enough success might
naturally lead to even less success and potentially never recover?
> 3) Change scanner's behaviour
> Migration scanner is changed to scan whole zone range regardless freepage
> scanner position. Freepage scanner also scans whole zone from
> zone_start_pfn to zone_end_pfn.
OK so you did propose this in response to my pivot changing compaction.
I've been testing approach like this too, to see which is better, but it
differs in many details. E.g. I just disabled pageblock skip bits for
the time being, the way you handle them probably makes sense. The
results did look nice (when free scanner ignored pageblock migratetype),
but the cost was very high. Not ignoring migratetype looked like a
better compromise on high-level look, but the inability of free scanner
to find anything does suck. I wonder what compromise could exist here.
You did also post a RFC that migrates everything out of newly made
unmovable pageblocks in the event of fallback allocation, and letting
free scanner use such pageblocks goes against that.
> To prevent back and forth migration
> within one compaction iteration, freepage scanner marks skip-bit when
> scanning pageblock. Migration scanner will skip this marked pageblock.
> Finish condition is very simple. If migration scanner reaches end of
> the zone, compaction will be finished. If freepage scanner reaches end of
> the zone first, it restart at zone_start_pfn. This helps us to overcome
> dependency on amount of free memory.
Isn't there a danger of the free scanner scanning the zone many times
during single compaction run? Does anything prevent it from scanning the
same pageblock as the migration scanner at the same time? The skip bit
is set only after free scanner is finished with a block AFAICS.
[snip]
2015-06-25 20:03 GMT+09:00 Mel Gorman <[email protected]>:
> On Thu, Jun 25, 2015 at 09:45:11AM +0900, Joonsoo Kim wrote:
>> Recently, I got a report that android get slow due to order-2 page
>> allocation. With some investigation, I found that compaction usually
>> fails and many pages are reclaimed to make order-2 freepage. I can't
>> analyze detailed reason that causes compaction fail because I don't
>> have reproducible environment and compaction code is changed so much
>> from that version, v3.10. But, I was inspired by this report and started
>> to think limitation of current compaction algorithm.
>>
>> Limitation of current compaction algorithm:
>>
>
> I didn't review the individual patches unfortunately but have a few comments
> about things to watch out for.
Hello, Mel.
Your comment always helps me a lot.
Thanks in advance.
>> 1) Migrate scanner can't scan behind of free scanner, because
>> each scanner starts at both side of zone and go toward each other. If
>> they meet at some point, compaction is stopped and scanners' position
>> is reset to both side of zone again. From my experience, migrate scanner
>> usually doesn't scan beyond of half of the zone range.
>>
>
> This was deliberate because if the scanners cross then they can undo each
> others work. The free scanner can locate a pageblock that pages were
> recently migrated from. Finishing compaction when the scanners met was
> the easiest way of avoiding the problem without maintaining global
> state.
I internally have a patch that prevents this kind of undoing.
I don't submit it because it doesn't have any notable effect on current
benchmarks, but, i can include it on next version.
> Global state is required because there can be parallel compaction
> attempts. The global state requires locking to avoid two parallel
> compaction attempts selecting the same pageblock for migrating to and
> from.
I used skip-bit to prevent selecting same pageblock for migrating to
and from. If freepage scanner isolates some pages, skip-bit is set
on that pageblock. Migration scanner checks skip-bit before scanning
and will avoid to scan that marked pageblock.
> This global state then needs to be reset on each compaction cycle. The
> difficulty then is that there is a potential ping-pong effect. A pageblock
> that was previously a migration target for the free scanner may become a
> migration source for the migration scanner. Having the scanners operate
> in opposite directions and meet in the middle avoided this problem.
I admit that this patchset causes ping-pong effect between each compaction
cycle, because skip-bit is reset on each compaction cycle. But, I think that
we don't need to worry about it. We should make high order page up to
PAGE_COSTLY_ORDER by any means. If compaction fails, we need to
reclaim some pages and this would cause file I/O. It is more bad than
ping-pong effect on compaction.
> I'm not saying the current design is perfect but it avoids a number of
> problems that are worth keeping in mind. Regressions in this area will
> look like higher system CPU time with most of the additional time spent
> in compaction.
>
>> 2) Compaction capability is highly depends on amount of free memory.
>> If there is 50 MB free memory on 4 GB system, migrate scanner can
>> migrate 50 MB used pages at maximum and then will meet free scanner.
>> If compaction can't make enough high order freepages during this
>> amount of work, compaction would fail. There is no way to escape this
>> failure situation in current algorithm and it will scan same region and
>> fail again and again. And then, it goes into compaction deferring logic
>> and will be deferred for some times.
>>
>
> This is why reclaim/compaction exists. When this situation occurs, the
> kernel is meant to reclaim some order-0 pages and try again. Initially
> it was lumpy reclaim that was used but it severely disrupted the system.
No, current kernel implementation doesn't reclaim pages in this situation.
Watermark check for order 0 would be passed in this case and reclaim logic
regards this state as compact_ready and there is no need to reclaim. Even if
we change it to reclaim some pages in this case, there are usually parallel
tasks who want to use more memory so free memory size wouldn't increase
as much as we need and compaction wouldn't succeed.
>> 3) Compaction capability is highly depends on migratetype of memory,
>> because freepage scanner doesn't scan unmovable pageblock.
>>
>
> For a very good reason. Unmovable allocation requests that fallback to
> other pageblocks are the worst in terms of fragmentation avoidance. The
> more of these events there are, the more the system will decay. If there
> are many of these events then a compaction benchmark may start with high
> success rates but decay over time.
>
> Very broadly speaking, the more the mm_page_alloc_extfrag tracepoint
> triggers with alloc_migratetype == MIGRATE_UNMOVABLE, the faster the
> system is decaying. Having the freepage scanner select unmovable
> pageblocks will trigger this event more frequently.
>
> The unfortunate impact is that selecting unmovable blocks from the free
> csanner will improve compaction success rates for high-order kernel
> allocations early in the lifetime of the system but later fail high-order
> allocation requests as more pageblocks get converted to unmovable. It
> might be ok for kernel allocations but THP will eventually have a 100%
> failure rate.
I wrote rationale in the patch itself. We already use non-movable pageblock
for migration scanner. It empties non-movable pageblock so number of
freepage on non-movable pageblock will increase. Using non-movable
pageblock for freepage scanner negates this effect so number of freepage
on non-movable pageblock will be balanced. Could you tell me in detail
how freepage scanner select unmovable pageblocks will cause
more fragmentation? Possibly, I don't understand effect of this patch
correctly and need some investigation. :)
>> To investigate compaction limitations, I made some compaction benchmarks.
>> Base environment of this benchmark is fragmented memory. Before testing,
>> 25% of total size of memory is allocated. With some tricks, these
>> allocations are evenly distributed to whole memory range. So, after
>> allocation is finished, memory is highly fragmented and possibility of
>> successful order-3 allocation is very low. Roughly 1500 order-3 allocation
>> can be successful. Tests attempt excessive amount of allocation request,
>> that is, 3000, to find out algorithm limitation.
>>
>
> Did you look at forcing MIGRATE_SYNC for these allocation
> requests? Compaction was originally designed for huge page allocations and
> high-order kernel allocations up to PAGE_ALLOC_COSTLY_ORDER were meant to
> free naturally. MIGRATE_SYNC considers more pageblocks for migration and
> should improve success rates at the cost of longer stalls.
This test uses unmovable allocation with GFP_NORETRY, so it will fallback
sync compaction after failing async compaction and direct reclaim. It should be
mentioned but I missed it. Sorry about that. I will include it on next version.
>> <SNIP>
>> Column 'Success' and 'Success(N) are calculated by following equations.
>>
>> Success = successful allocation * 100 / attempts
>> Success(N) = successful allocation * 100 /
>> number of successful order-3 allocation
>>
>
> I don't quite get this.
>
> Success = success * 100 / attempts
> Success(N) = success * 100 / success
>
> Can you try explaining Success(N) again? It's not clear how the first
> "success allocation" differs from the "successful order-3" allocations.
> Is it all allocation attempts or any order in the system or what?
Sorry for confusing word.
Success = successful allocation * 100 / attempts
Success(N) = successful allocation * 100 / order 3 candidates
order 3 candidates is limited as roughly 1500 in my test setup because
I did some trick to make memory fragmented. So, Success(N) is
calculated as success * 100 / 1500 (1500 slightly varies on each test)
in my test setup.
>> <SNIP>
>>
>> Anyway, as free memory decreases, compaction success rate also decreases.
>> It is better to remove this dependency to get stable compaction result
>> in any case.
>>
>
> BTW, this is also expected. Early in the existence of compaction it had
> much higher success rates -- high 90%s around kernel 3.0. This has
> dropped over time because the *cost* of granting those allocations was
> so high. These are timing results from a high-order allocation stress
> test
>
> stress-highalloc
> 3.0.0 3.0.101 3.12.38
> Ops 1 89.00 ( 0.00%) 84.00 ( 5.62%) 11.00 ( 87.64%)
> Ops 2 91.00 ( 0.00%) 71.00 ( 21.98%) 11.00 ( 87.91%)
> Ops 3 93.00 ( 0.00%) 89.00 ( 4.30%) 80.00 ( 13.98%)
>
> 3.0.0 3.0.101 3.12.38
> vanilla vanilla vanilla
> User 2904.90 2280.92 2873.42
> System 630.53 624.25 510.87
> Elapsed 3869.95 1291.28 1232.83
>
> Ops 1 and 2 are allocation attempts under heavy load and note how kernel
> 3.0 and 3.0.101 had success rates of over 80%. However, look at how long
> 3.0.0 took -- over an hour vs 20 minutes in later kernels.
I can attach elapsed time in one of stress-highalloc test.
Others have same tendency.
base threshold
Ops 1 24.00 ( 0.00%) 81.00 (-237.50%)
Ops 2 30.00 ( 0.00%) 83.00 (-176.67%)
Ops 3 91.00 ( 0.00%) 94.00 ( -3.30%)
User 5219.23 4168.99
System 1100.04 1018.23
Elapsed 1357.40 1488.90
Compaction stalls 5313 10250
Compaction success 1796 4893
Compaction failures 3517 5357
Compaction pages isolated 7069617 12330604
Compaction migrate scanned 64710910 59484302
Compaction free scanned 460910906 129035561
Compaction cost 4202 6934
Elapsed time slightly increase, but, not much as 3.0.0.
With patch, we get double number of high order page and
roughly double pages isolated so I think it is reasonable trade-off.
> Later kernels avoid any expensive step even though it means the success
> rates are lower. Your figures look impressive but remember that the
> success rates could be due to very long stalls.
>
>> Pageblock migratetype makes big difference on success rate. 3) would be
>> one of reason related to this result. Because freepage scanner doesn't
>> scan non-movable pageblock, compaction can't get enough freepage for
>> migration and compaction easily fails. This patchset try to solve it
>> by allowing freepage scanner to scan on non-movable pageblock.
>>
>
> I really think that using unmovable blocks as a freepage scanner target
> will cause all THP allocation attempts to eventually fail.
Please elaborate more on it. I can't imagine how it happens.
>> Result show that we cannot get all possible high order page through
>> current compaction algorithm. And, in case that migratetype of
>> pageblock is unmovable, success rate get worse. Although we can solve
>> problem 3) in current algorithm, there is unsolvable limitations, 1), 2),
>> so I'd like to change compaction algorithm.
>>
>> This patchset try to solve these limitations by introducing new compaction
>> approach. Main changes of this patchset are as following:
>>
>> 1) Make freepage scanner scans non-movable pageblock
>> Watermark check doesn't consider how many pages in non-movable pageblock.
>
> This was to avoid expensive checks in compaction. It was for huge page
> allocations so if compaction took too long then it offset any benefit
> from using huge pages.
I also think that counting number of freepage on non-movable pageblock by
compaction isn't appropriate solution.
>>
>> 2) Introduce compaction depletion state
>> Compaction algorithm will be changed to scan whole zone range. In this
>> approach, compaction inevitably do back and forth migration between
>> different iterations. If back and forth migration can make highorder
>> freepage, it can be justified. But, in case of depletion of compaction
>> possiblity, this back and forth migration causes unnecessary overhead.
>> Compaction depleteion state is introduced to avoid this useless
>> back and forth migration by detecting depletion of compaction possibility.
>>
>
> Interesting.
>
>> 3) Change scanner's behaviour
>> Migration scanner is changed to scan whole zone range regardless freepage
>> scanner position. Freepage scanner also scans whole zone from
>> zone_start_pfn to zone_end_pfn. To prevent back and forth migration
>> within one compaction iteration, freepage scanner marks skip-bit when
>> scanning pageblock. Migration scanner will skip this marked pageblock.
>
> At each iteration, this could ping pong with migration sources becoming
> migration targets and vice-versa. Keep an eye on the overall time spent
> in compaction and consider forcing MIGRATE_SYNC as a possible
> alternative.
>
>> Test: hogger-frag-unmovable
>> base nonmovable redesign threshold
>> compact_free_scanned 2800710 5615427 6441095 2235764
>> compact_isolated 58323 114183 2711081 647701
>> compact_migrate_scanned 1078970 2437597 4175464 1697292
>> compact_stall 341 1066 2059 2092
>> compact_success 80 123 207 210
>> pgmigrate_success 27034 53832 1348113 318395
>> Success: 22 29 44 40
>> Success(N): 46 61 90 83
>>
>>
>> Test: hogger-frag-movable
>> base nonmovable redesign threshold
>> compact_free_scanned 5240676 5883401 8103231 1860428
>> compact_isolated 75048 83201 3108978 427602
>> compact_migrate_scanned 2468387 2755690 4316163 1474287
>> compact_stall 710 664 2117 1964
>> compact_success 98 102 234 183
>> pgmigrate_success 34869 38663 1547318 208629
>> Success: 25 26 45 44
>> Success(N): 53 56 94 92
>>
>>
>> Test: build-frag-unmovable
>> base nonmovable redesign threshold
>> compact_free_scanned 5032378 4110920 2538420 1891170
>> compact_isolated 53368 330762 1020908 534680
>> compact_migrate_scanned 1456516 6164677 4809150 2667823
>> compact_stall 538 746 2609 2500
>> compact_success 93 350 438 403
>> pgmigrate_success 19926 152754 491609 251977
>> Success: 15 31 39 40
>> Success(N): 33 65 80 81
>>
>>
>> Test: build-frag-movable
>> base nonmovable redesign threshold
>> compact_free_scanned 3059086 3852269 2359553 1461131
>> compact_isolated 129085 238856 907515 387373
>> compact_migrate_scanned 5029856 5051868 3785605 2177090
>> compact_stall 388 540 2195 2157
>> compact_success 99 218 247 225
>> pgmigrate_success 52898 110021 439739 182366
>> Success: 38 37 43 43
>> Success(N): 82 77 89 90
>>
>
> The success rates look impressive. If you could, also report the total
> time and system CPU time for the test. Ideally also report just the time
> spent in compaction. Tony Jones posted a patch for perf that might
> help with this
> https://lkml.org/lkml/2015/5/26/18
Looks good. Maybe, compaction related stat already shows some aspect but
I will check it.
> Ideally also monitor the trace_mm_page_alloc_extfrag tracepoint and see
> how many externally fragmenting events are occuring, particularly ones
> for MIGRATE_UNMOVABLE requests.
Okay. Will do.
Thanks for detailed review.
Thanks.
On Fri, Jun 26, 2015 at 02:11:17AM +0900, Joonsoo Kim wrote:
> > Global state is required because there can be parallel compaction
> > attempts. The global state requires locking to avoid two parallel
> > compaction attempts selecting the same pageblock for migrating to and
> > from.
>
> I used skip-bit to prevent selecting same pageblock for migrating to
> and from. If freepage scanner isolates some pages, skip-bit is set
> on that pageblock. Migration scanner checks skip-bit before scanning
> and will avoid to scan that marked pageblock.
>
That will need locking or migration scanner could start just before the
skip bit is set.
>
> > This global state then needs to be reset on each compaction cycle. The
> > difficulty then is that there is a potential ping-pong effect. A pageblock
> > that was previously a migration target for the free scanner may become a
> > migration source for the migration scanner. Having the scanners operate
> > in opposite directions and meet in the middle avoided this problem.
>
> I admit that this patchset causes ping-pong effect between each compaction
> cycle, because skip-bit is reset on each compaction cycle. But, I think that
> we don't need to worry about it. We should make high order page up to
> PAGE_COSTLY_ORDER by any means. If compaction fails, we need to
> reclaim some pages and this would cause file I/O. It is more bad than
> ping-pong effect on compaction.
>
That's debatable because the assumption is that the compaction will
definitly allow forward progress. Copying pages back and forth without
forward progress will chew CPU. There is a cost with reclaiming to allow
compaction but that's the price to pay if high-order kernel allocations
are required. In the case of THP, we can give up quickly at least.
> > I'm not saying the current design is perfect but it avoids a number of
> > problems that are worth keeping in mind. Regressions in this area will
> > look like higher system CPU time with most of the additional time spent
> > in compaction.
> >
> >> 2) Compaction capability is highly depends on amount of free memory.
> >> If there is 50 MB free memory on 4 GB system, migrate scanner can
> >> migrate 50 MB used pages at maximum and then will meet free scanner.
> >> If compaction can't make enough high order freepages during this
> >> amount of work, compaction would fail. There is no way to escape this
> >> failure situation in current algorithm and it will scan same region and
> >> fail again and again. And then, it goes into compaction deferring logic
> >> and will be deferred for some times.
> >>
> >
> > This is why reclaim/compaction exists. When this situation occurs, the
> > kernel is meant to reclaim some order-0 pages and try again. Initially
> > it was lumpy reclaim that was used but it severely disrupted the system.
>
> No, current kernel implementation doesn't reclaim pages in this situation.
> Watermark check for order 0 would be passed in this case and reclaim logic
> regards this state as compact_ready and there is no need to reclaim. Even if
> we change it to reclaim some pages in this case, there are usually parallel
> tasks who want to use more memory so free memory size wouldn't increase
> as much as we need and compaction wouldn't succeed.
>
It could though. Reclaim/compaction is entered for orders higher than
PAGE_ALLOC_COSTLY_ORDER and when scan priority is sufficiently high.
That could be adjusted if you have a viable case where orders <
PAGE_ALLOC_COSTLY_ORDER must succeed and currently requires excessive
reclaim instead of relying on compaction.
> >> 3) Compaction capability is highly depends on migratetype of memory,
> >> because freepage scanner doesn't scan unmovable pageblock.
> >>
> >
> > For a very good reason. Unmovable allocation requests that fallback to
> > other pageblocks are the worst in terms of fragmentation avoidance. The
> > more of these events there are, the more the system will decay. If there
> > are many of these events then a compaction benchmark may start with high
> > success rates but decay over time.
> >
> > Very broadly speaking, the more the mm_page_alloc_extfrag tracepoint
> > triggers with alloc_migratetype == MIGRATE_UNMOVABLE, the faster the
> > system is decaying. Having the freepage scanner select unmovable
> > pageblocks will trigger this event more frequently.
> >
> > The unfortunate impact is that selecting unmovable blocks from the free
> > csanner will improve compaction success rates for high-order kernel
> > allocations early in the lifetime of the system but later fail high-order
> > allocation requests as more pageblocks get converted to unmovable. It
> > might be ok for kernel allocations but THP will eventually have a 100%
> > failure rate.
>
> I wrote rationale in the patch itself. We already use non-movable pageblock
> for migration scanner. It empties non-movable pageblock so number of
> freepage on non-movable pageblock will increase. Using non-movable
> pageblock for freepage scanner negates this effect so number of freepage
> on non-movable pageblock will be balanced. Could you tell me in detail
> how freepage scanner select unmovable pageblocks will cause
> more fragmentation? Possibly, I don't understand effect of this patch
> correctly and need some investigation. :)
>
The long-term success rate of fragmentation avoidance depends on
minimsing the number of UNMOVABLE allocation requests that use a
pageblock belonging to another migratetype. Once such a fallback occurs,
that pageblock potentially can never be used for a THP allocation again.
Lets say there is an unmovable pageblock with 500 free pages in it. If
the freepage scanner uses that pageblock and allocates all 500 free
pages then the next unmovable allocation request needs a new pageblock.
If one is not completely free then it will fallback to using a
RECLAIMABLE or MOVABLE pageblock forever contaminating it.
Do that enough times and fragmentation avoidance breaks down.
Your scheme of migrating to UNMOVABLE blocks may allow order-3 allocations
to success as long as there are enough MOVABLE pageblocks to move pages
from but eventually it'll stop working. THP-sized allocations would be the
first to notice. That might not matter on a mobile but it matters elsewhere.
--
Mel Gorman
SUSE Labs
2015-06-25 22:35 GMT+09:00 Vlastimil Babka <[email protected]>:
> On 06/25/2015 02:45 AM, Joonsoo Kim wrote:
>>
>> Recently, I got a report that android get slow due to order-2 page
>> allocation. With some investigation, I found that compaction usually
>> fails and many pages are reclaimed to make order-2 freepage. I can't
>> analyze detailed reason that causes compaction fail because I don't
>> have reproducible environment and compaction code is changed so much
>> from that version, v3.10. But, I was inspired by this report and started
>> to think limitation of current compaction algorithm.
>>
>> Limitation of current compaction algorithm:
>>
>> 1) Migrate scanner can't scan behind of free scanner, because
>> each scanner starts at both side of zone and go toward each other. If
>> they meet at some point, compaction is stopped and scanners' position
>> is reset to both side of zone again. From my experience, migrate scanner
>> usually doesn't scan beyond of half of the zone range.
>
>
> Yes, I've also pointed this out in the RFC for pivot changing compaction.
Hello, Vlastimil.
Yes, you did. :)
>> 2) Compaction capability is highly depends on amount of free memory.
>> If there is 50 MB free memory on 4 GB system, migrate scanner can
>> migrate 50 MB used pages at maximum and then will meet free scanner.
>> If compaction can't make enough high order freepages during this
>> amount of work, compaction would fail. There is no way to escape this
>> failure situation in current algorithm and it will scan same region and
>> fail again and again. And then, it goes into compaction deferring logic
>> and will be deferred for some times.
>
>
> That's 1) again but in more detail.
Hmm... I don't think this is same issue. 1) is about what pageblock
migration scanner can scan in many iterations. But, 2) is about how
many pages migration scanner can scan in one iteration. Think about
your pivot change approach. It can change pivot so migration scanner
can scan whole zone range sometime. But, in each iteration, it can
scans limited number of pages according to amount of free memory.
If free memory is low, pivot approach needs more iteration in order to
scan whole zone range. I'd like to remove this dependency.
>> 3) Compaction capability is highly depends on migratetype of memory,
>> because freepage scanner doesn't scan unmovable pageblock.
>
>
> Yes, I've also observed this issue recently.
Cool!!
>> To investigate compaction limitations, I made some compaction benchmarks.
>> Base environment of this benchmark is fragmented memory. Before testing,
>> 25% of total size of memory is allocated. With some tricks, these
>> allocations are evenly distributed to whole memory range. So, after
>> allocation is finished, memory is highly fragmented and possibility of
>> successful order-3 allocation is very low. Roughly 1500 order-3 allocation
>> can be successful. Tests attempt excessive amount of allocation request,
>> that is, 3000, to find out algorithm limitation.
>>
>> There are two variations.
>>
>> pageblock type (unmovable / movable):
>>
>> One is that most pageblocks are unmovable migratetype and the other is
>> that most pageblocks are movable migratetype.
>>
>> memory usage (memory hogger 200 MB / kernel build with -j8):
>>
>> Memory hogger means that 200 MB free memory is occupied by hogger.
>> Kernel build means that kernel build is running on background and it
>> will consume free memory, but, amount of consumption will be very
>> fluctuated.
>>
>> With these variations, I made 4 test cases by mixing them.
>>
>> hogger-frag-unmovable
>> hogger-frag-movable
>> build-frag-unmovable
>> build-frag-movable
>>
>> All tests are conducted on 512 MB QEMU virtual machine with 8 CPUs.
>>
>> I can easily check weakness of compaction algorithm by following test.
>>
>> To check 1), hogger-frag-movable benchmark is used. Result is as
>> following.
>>
>> bzImage-improve-base
>> compact_free_scanned 5240676
>> compact_isolated 75048
>> compact_migrate_scanned 2468387
>> compact_stall 710
>> compact_success 98
>> pgmigrate_success 34869
>> Success: 25
>> Success(N): 53
>>
>> Column 'Success' and 'Success(N) are calculated by following equations.
>>
>> Success = successful allocation * 100 / attempts
>> Success(N) = successful allocation * 100 /
>> number of successful order-3 allocation
>
>
> As Mel pointed out, this is a weird description. The one from patch 5 makes
> more sense and I hope it's correct:
>
> Success = successful allocation * 100 / attempts
> Success(N) = successful allocation * 100 / order 3 candidates
Thanks. I used this comment to reply Mel's ask.
>
>>
>> As mentioned above, there are roughly 1500 high order page candidates,
>> but, compaction just returns 53% of them. With new compaction approach,
>> it can be increased to 94%. See result at the end of this cover-letter.
>>
>> To check 2), hogger-frag-movable benchmark is used again, but, with some
>> tweaks. Amount of allocated memory by memory hogger varys.
>>
>> bzImage-improve-base
>> Hogger: 150MB 200MB 250MB 300MB
>> Success: 41 25 17 9
>> Success(N): 87 53 37 22
>>
>> As background knowledge, up to 250MB, there is enough
>> memory to succeed all order-3 allocation attempts. In 300MB case,
>> available memory before starting allocation attempt is just 57MB,
>> so all of attempts cannot succeed.
>>
>> Anyway, as free memory decreases, compaction success rate also decreases.
>> It is better to remove this dependency to get stable compaction result
>> in any case.
>>
>> To check 3), build-frag-unmovable/movable benchmarks are used.
>> All factors are same except pageblock migratetypes.
>>
>> Test: build-frag-unmovable
>>
>> bzImage-improve-base
>> compact_free_scanned 5032378
>> compact_isolated 53368
>> compact_migrate_scanned 1456516
>> compact_stall 538
>> compact_success 93
>> pgmigrate_success 19926
>> Success: 15
>> Success(N): 33
>>
>> Test: build-frag-movable
>>
>> bzImage-improve-base
>> compact_free_scanned 3059086
>> compact_isolated 129085
>> compact_migrate_scanned 5029856
>> compact_stall 388
>> compact_success 99
>> pgmigrate_success 52898
>> Success: 38
>> Success(N): 82
>>
>> Pageblock migratetype makes big difference on success rate. 3) would be
>> one of reason related to this result. Because freepage scanner doesn't
>> scan non-movable pageblock, compaction can't get enough freepage for
>> migration and compaction easily fails. This patchset try to solve it
>> by allowing freepage scanner to scan on non-movable pageblock.
>>
>> Result show that we cannot get all possible high order page through
>> current compaction algorithm. And, in case that migratetype of
>> pageblock is unmovable, success rate get worse. Although we can solve
>> problem 3) in current algorithm, there is unsolvable limitations, 1), 2),
>> so I'd like to change compaction algorithm.
>>
>> This patchset try to solve these limitations by introducing new compaction
>> approach. Main changes of this patchset are as following:
>>
>> 1) Make freepage scanner scans non-movable pageblock
>> Watermark check doesn't consider how many pages in non-movable pageblock.
>> To fully utilize existing freepage, freepage scanner should scan
>> non-movable pageblock.
>
>
> I share Mel's concerns here. The evaluation should consider long-term
> fragmentation effects. Especially when you've already seen a regression from
> this patch.
Okay. I will evaluate it. Anyway, do you have any scenario that this
causes more fragmentation? It would be helpful to know scenario
before investigation.
>> 2) Introduce compaction depletion state
>> Compaction algorithm will be changed to scan whole zone range. In this
>> approach, compaction inevitably do back and forth migration between
>> different iterations. If back and forth migration can make highorder
>> freepage, it can be justified. But, in case of depletion of compaction
>> possiblity, this back and forth migration causes unnecessary overhead.
>> Compaction depleteion state is introduced to avoid this useless
>> back and forth migration by detecting depletion of compaction possibility.
>
>
> Interesting, but I'll need to study this in more detail to grasp it
> completely.
No problem.
> Limiting the scanning because of not enough success might
> naturally lead to even less success and potentially never recover?
Yes, that is possible and some fine tuning may be needed. But, we
can't estimate exact state of memory in advance so it would be the best
to use heuristic.
In fact, I think that we should limit scanning in sync compaction
in any case. It is dangerous and very time consuming approach that
sync compaction scan as much as possible until it finds that compaction
is really impossible. This is related to recent change on network code
to prevent direct compaction.
We should balance effort on compaction and others such as reclaim.
Direct reclaim is limited up to certain amount of pages, but, sync
compaction doesn't have any limit. It is odd. Best strategy would be
limiting scan of sync compaction and implement high-level control
how many sync compaction is attempted according to amount of
reclaim effort.
>> 3) Change scanner's behaviour
>> Migration scanner is changed to scan whole zone range regardless freepage
>> scanner position. Freepage scanner also scans whole zone from
>> zone_start_pfn to zone_end_pfn.
>
>
> OK so you did propose this in response to my pivot changing compaction. I've
> been testing approach like this too, to see which is better, but it differs
> in many details. E.g. I just disabled pageblock skip bits for the time
> being, the way you handle them probably makes sense. The results did look
> nice (when free scanner ignored pageblock migratetype), but the cost was
> very high. Not ignoring migratetype looked like a better compromise on
> high-level look, but the inability of free scanner to find anything does
> suck.
Yes. As mentioned in some place, I think that inability of free scanner
to find freepage is big trouble than fragmentation
> I wonder what compromise could exist here.
I also wonder it. :)
I will investigate more.
> You did also post a RFC
> that migrates everything out of newly made unmovable pageblocks in the event
> of fallback allocation, and letting free scanner use such pageblocks goes
> against that.
I think that this patchset is more important than that patchset. I will
re-consider it again after seeing this patchset's conclusion.
>> To prevent back and forth migration
>> within one compaction iteration, freepage scanner marks skip-bit when
>> scanning pageblock. Migration scanner will skip this marked pageblock.
>> Finish condition is very simple. If migration scanner reaches end of
>> the zone, compaction will be finished. If freepage scanner reaches end of
>> the zone first, it restart at zone_start_pfn. This helps us to overcome
>> dependency on amount of free memory.
>
>
> Isn't there a danger of the free scanner scanning the zone many times during
> single compaction run?
I found this in another test with CMA after sending it. Maybe,
there is some corner case. I need to fix it.
> Does anything prevent it from scanning the same
> pageblock as the migration scanner at the same time? The skip bit is set
> only after free scanner is finished with a block AFAICS.
There is a check in isolate_freepages() that preventing freescanner
from scanning same pageblock where migration scanner scans now.
There is no such guard in case of parallel compaction, but, I guess that
this race window is small enough because freepage scanner scan and
marks skip-bit quickly.
Thanks for detailed and quick review. :)
Thanks.
2015-06-26 2:25 GMT+09:00 Mel Gorman <[email protected]>:
> On Fri, Jun 26, 2015 at 02:11:17AM +0900, Joonsoo Kim wrote:
>> > Global state is required because there can be parallel compaction
>> > attempts. The global state requires locking to avoid two parallel
>> > compaction attempts selecting the same pageblock for migrating to and
>> > from.
>>
>> I used skip-bit to prevent selecting same pageblock for migrating to
>> and from. If freepage scanner isolates some pages, skip-bit is set
>> on that pageblock. Migration scanner checks skip-bit before scanning
>> and will avoid to scan that marked pageblock.
>>
>
> That will need locking or migration scanner could start just before the
> skip bit is set.
Yes, that's possible, but this race window is very small and worst effect
is undoing small amount of compaction work in infrequently. Migration
scanner checks it before starting isolation so only 32 migrated pages
could be undo.
>>
>> > This global state then needs to be reset on each compaction cycle. The
>> > difficulty then is that there is a potential ping-pong effect. A pageblock
>> > that was previously a migration target for the free scanner may become a
>> > migration source for the migration scanner. Having the scanners operate
>> > in opposite directions and meet in the middle avoided this problem.
>>
>> I admit that this patchset causes ping-pong effect between each compaction
>> cycle, because skip-bit is reset on each compaction cycle. But, I think that
>> we don't need to worry about it. We should make high order page up to
>> PAGE_COSTLY_ORDER by any means. If compaction fails, we need to
>> reclaim some pages and this would cause file I/O. It is more bad than
>> ping-pong effect on compaction.
>>
>
> That's debatable because the assumption is that the compaction will
> definitly allow forward progress. Copying pages back and forth without
> forward progress will chew CPU. There is a cost with reclaiming to allow
> compaction but that's the price to pay if high-order kernel allocations
> are required. In the case of THP, we can give up quickly at least.
There is compaction limit logic in this patchset and it effectively
limit compaction when forward progress doesn't guarantee.
>> > I'm not saying the current design is perfect but it avoids a number of
>> > problems that are worth keeping in mind. Regressions in this area will
>> > look like higher system CPU time with most of the additional time spent
>> > in compaction.
>> >
>> >> 2) Compaction capability is highly depends on amount of free memory.
>> >> If there is 50 MB free memory on 4 GB system, migrate scanner can
>> >> migrate 50 MB used pages at maximum and then will meet free scanner.
>> >> If compaction can't make enough high order freepages during this
>> >> amount of work, compaction would fail. There is no way to escape this
>> >> failure situation in current algorithm and it will scan same region and
>> >> fail again and again. And then, it goes into compaction deferring logic
>> >> and will be deferred for some times.
>> >>
>> >
>> > This is why reclaim/compaction exists. When this situation occurs, the
>> > kernel is meant to reclaim some order-0 pages and try again. Initially
>> > it was lumpy reclaim that was used but it severely disrupted the system.
>>
>> No, current kernel implementation doesn't reclaim pages in this situation.
>> Watermark check for order 0 would be passed in this case and reclaim logic
>> regards this state as compact_ready and there is no need to reclaim. Even if
>> we change it to reclaim some pages in this case, there are usually parallel
>> tasks who want to use more memory so free memory size wouldn't increase
>> as much as we need and compaction wouldn't succeed.
>>
>
> It could though. Reclaim/compaction is entered for orders higher than
> PAGE_ALLOC_COSTLY_ORDER and when scan priority is sufficiently high.
> That could be adjusted if you have a viable case where orders <
> PAGE_ALLOC_COSTLY_ORDER must succeed and currently requires excessive
> reclaim instead of relying on compaction.
Yes. I saw this problem in real situation. In ARM, order-2 allocation
is requested
in fork(), so it should be succeed. But, there is not enough order-2 freepage,
so reclaim/compaction begins. Compaction fails repeatedly although
I didn't check exact reason. Anyway, system do reclaim repeatedly to satisfy
this order-2 allocation request. To make matter worse, there is no free swap
space, so anon pages cannot be reclaimed. In this situation, anon pages acts
as fragmentation provider and reclaim doesn't make order-2 freepage even if
too many file pages are reclaimed. If compaction works properly, order-2
allocation succeed easily so I started this patchset. Maybe, there is another
reason for compaction to fail, but, it is true that current compaction has some
limitation mentioned above so I'd like to fix it in this time.
>> >> 3) Compaction capability is highly depends on migratetype of memory,
>> >> because freepage scanner doesn't scan unmovable pageblock.
>> >>
>> >
>> > For a very good reason. Unmovable allocation requests that fallback to
>> > other pageblocks are the worst in terms of fragmentation avoidance. The
>> > more of these events there are, the more the system will decay. If there
>> > are many of these events then a compaction benchmark may start with high
>> > success rates but decay over time.
>> >
>> > Very broadly speaking, the more the mm_page_alloc_extfrag tracepoint
>> > triggers with alloc_migratetype == MIGRATE_UNMOVABLE, the faster the
>> > system is decaying. Having the freepage scanner select unmovable
>> > pageblocks will trigger this event more frequently.
>> >
>> > The unfortunate impact is that selecting unmovable blocks from the free
>> > csanner will improve compaction success rates for high-order kernel
>> > allocations early in the lifetime of the system but later fail high-order
>> > allocation requests as more pageblocks get converted to unmovable. It
>> > might be ok for kernel allocations but THP will eventually have a 100%
>> > failure rate.
>>
>> I wrote rationale in the patch itself. We already use non-movable pageblock
>> for migration scanner. It empties non-movable pageblock so number of
>> freepage on non-movable pageblock will increase. Using non-movable
>> pageblock for freepage scanner negates this effect so number of freepage
>> on non-movable pageblock will be balanced. Could you tell me in detail
>> how freepage scanner select unmovable pageblocks will cause
>> more fragmentation? Possibly, I don't understand effect of this patch
>> correctly and need some investigation. :)
>>
>
> The long-term success rate of fragmentation avoidance depends on
> minimsing the number of UNMOVABLE allocation requests that use a
> pageblock belonging to another migratetype. Once such a fallback occurs,
> that pageblock potentially can never be used for a THP allocation again.
>
> Lets say there is an unmovable pageblock with 500 free pages in it. If
> the freepage scanner uses that pageblock and allocates all 500 free
> pages then the next unmovable allocation request needs a new pageblock.
> If one is not completely free then it will fallback to using a
> RECLAIMABLE or MOVABLE pageblock forever contaminating it.
Yes, I can imagine that situation. But, as I said above, we already use
non-movable pageblock for migration scanner. While unmovable
pageblock with 500 free pages fills, some other unmovable pageblock
with some movable pages will be emptied. Number of freepage
on non-movable would be maintained so fallback doesn't happen.
Anyway, it is better to investigate this effect. I will do it and attach
result on next submission.
> Do that enough times and fragmentation avoidance breaks down.
>
> Your scheme of migrating to UNMOVABLE blocks may allow order-3 allocations
> to success as long as there are enough MOVABLE pageblocks to move pages
> from but eventually it'll stop working. THP-sized allocations would be the
> first to notice. That might not matter on a mobile but it matters elsewhere.
I don't get it. Could you tell me more why stop working? Maybe, example is
helpful for me. :)
Thanks.
On Fri, Jun 26, 2015 at 03:14:39AM +0900, Joonsoo Kim wrote:
> > It could though. Reclaim/compaction is entered for orders higher than
> > PAGE_ALLOC_COSTLY_ORDER and when scan priority is sufficiently high.
> > That could be adjusted if you have a viable case where orders <
> > PAGE_ALLOC_COSTLY_ORDER must succeed and currently requires excessive
> > reclaim instead of relying on compaction.
>
> Yes. I saw this problem in real situation. In ARM, order-2 allocation
> is requested
> in fork(), so it should be succeed. But, there is not enough order-2 freepage,
> so reclaim/compaction begins. Compaction fails repeatedly although
> I didn't check exact reason.
That should be identified and repaired prior to reimplementing
compaction because it's important.
> >> >> 3) Compaction capability is highly depends on migratetype of memory,
> >> >> because freepage scanner doesn't scan unmovable pageblock.
> >> >>
> >> >
> >> > For a very good reason. Unmovable allocation requests that fallback to
> >> > other pageblocks are the worst in terms of fragmentation avoidance. The
> >> > more of these events there are, the more the system will decay. If there
> >> > are many of these events then a compaction benchmark may start with high
> >> > success rates but decay over time.
> >> >
> >> > Very broadly speaking, the more the mm_page_alloc_extfrag tracepoint
> >> > triggers with alloc_migratetype == MIGRATE_UNMOVABLE, the faster the
> >> > system is decaying. Having the freepage scanner select unmovable
> >> > pageblocks will trigger this event more frequently.
> >> >
> >> > The unfortunate impact is that selecting unmovable blocks from the free
> >> > csanner will improve compaction success rates for high-order kernel
> >> > allocations early in the lifetime of the system but later fail high-order
> >> > allocation requests as more pageblocks get converted to unmovable. It
> >> > might be ok for kernel allocations but THP will eventually have a 100%
> >> > failure rate.
> >>
> >> I wrote rationale in the patch itself. We already use non-movable pageblock
> >> for migration scanner. It empties non-movable pageblock so number of
> >> freepage on non-movable pageblock will increase. Using non-movable
> >> pageblock for freepage scanner negates this effect so number of freepage
> >> on non-movable pageblock will be balanced. Could you tell me in detail
> >> how freepage scanner select unmovable pageblocks will cause
> >> more fragmentation? Possibly, I don't understand effect of this patch
> >> correctly and need some investigation. :)
> >>
> >
> > The long-term success rate of fragmentation avoidance depends on
> > minimsing the number of UNMOVABLE allocation requests that use a
> > pageblock belonging to another migratetype. Once such a fallback occurs,
> > that pageblock potentially can never be used for a THP allocation again.
> >
> > Lets say there is an unmovable pageblock with 500 free pages in it. If
> > the freepage scanner uses that pageblock and allocates all 500 free
> > pages then the next unmovable allocation request needs a new pageblock.
> > If one is not completely free then it will fallback to using a
> > RECLAIMABLE or MOVABLE pageblock forever contaminating it.
>
> Yes, I can imagine that situation. But, as I said above, we already use
> non-movable pageblock for migration scanner. While unmovable
> pageblock with 500 free pages fills, some other unmovable pageblock
> with some movable pages will be emptied. Number of freepage
> on non-movable would be maintained so fallback doesn't happen.
>
> Anyway, it is better to investigate this effect. I will do it and attach
> result on next submission.
>
Lets say we have X unmovable pageblocks and Y pageblocks overall. If the
migration scanner takes movable pages from X then there is more space for
unmovable allocations without having to increase X -- this is good. If
the free scanner uses the X pageblocks as targets then they can fill. The
next unmovable allocation then falls back to another pageblock and we
either have X+1 unmovable pageblocks (full steal) or a mixed pageblock
(partial steal) that cannot be used for THP. Do this enough times and
X == Y and all THP allocations fail.
--
Mel Gorman
SUSE Labs
On 25.6.2015 20:14, Joonsoo Kim wrote:
>> The long-term success rate of fragmentation avoidance depends on
>> > minimsing the number of UNMOVABLE allocation requests that use a
>> > pageblock belonging to another migratetype. Once such a fallback occurs,
>> > that pageblock potentially can never be used for a THP allocation again.
>> >
>> > Lets say there is an unmovable pageblock with 500 free pages in it. If
>> > the freepage scanner uses that pageblock and allocates all 500 free
>> > pages then the next unmovable allocation request needs a new pageblock.
>> > If one is not completely free then it will fallback to using a
>> > RECLAIMABLE or MOVABLE pageblock forever contaminating it.
> Yes, I can imagine that situation. But, as I said above, we already use
> non-movable pageblock for migration scanner. While unmovable
> pageblock with 500 free pages fills, some other unmovable pageblock
> with some movable pages will be emptied. Number of freepage
> on non-movable would be maintained so fallback doesn't happen.
There's nothing that guarantees that the migration scanner will be emptying
unmovable pageblock, or am I missing something? Worse, those pageblocks would be
marked to skip by the free scanner if it isolated free pages from them, so
migration scanner would skip them.
2015-06-26 3:41 GMT+09:00 Mel Gorman <[email protected]>:
> On Fri, Jun 26, 2015 at 03:14:39AM +0900, Joonsoo Kim wrote:
>> > It could though. Reclaim/compaction is entered for orders higher than
>> > PAGE_ALLOC_COSTLY_ORDER and when scan priority is sufficiently high.
>> > That could be adjusted if you have a viable case where orders <
>> > PAGE_ALLOC_COSTLY_ORDER must succeed and currently requires excessive
>> > reclaim instead of relying on compaction.
>>
>> Yes. I saw this problem in real situation. In ARM, order-2 allocation
>> is requested
>> in fork(), so it should be succeed. But, there is not enough order-2 freepage,
>> so reclaim/compaction begins. Compaction fails repeatedly although
>> I didn't check exact reason.
>
> That should be identified and repaired prior to reimplementing
> compaction because it's important.
Unfortunately, I got report a long time ago and I don't have any real
environment
to reproduce it. What I have remembered is that there are too many unmovable
allocations from graphic driver and zram and they really makes fragmented
memory. In that time, problem is solved by ad-hoc approach such as killing
many apps. But, it's sub-optimal and loosing performance greatly so I imitate
this effect in my benchmark and try to solve it by this patchset.
>> >> >> 3) Compaction capability is highly depends on migratetype of memory,
>> >> >> because freepage scanner doesn't scan unmovable pageblock.
>> >> >>
>> >> >
>> >> > For a very good reason. Unmovable allocation requests that fallback to
>> >> > other pageblocks are the worst in terms of fragmentation avoidance. The
>> >> > more of these events there are, the more the system will decay. If there
>> >> > are many of these events then a compaction benchmark may start with high
>> >> > success rates but decay over time.
>> >> >
>> >> > Very broadly speaking, the more the mm_page_alloc_extfrag tracepoint
>> >> > triggers with alloc_migratetype == MIGRATE_UNMOVABLE, the faster the
>> >> > system is decaying. Having the freepage scanner select unmovable
>> >> > pageblocks will trigger this event more frequently.
>> >> >
>> >> > The unfortunate impact is that selecting unmovable blocks from the free
>> >> > csanner will improve compaction success rates for high-order kernel
>> >> > allocations early in the lifetime of the system but later fail high-order
>> >> > allocation requests as more pageblocks get converted to unmovable. It
>> >> > might be ok for kernel allocations but THP will eventually have a 100%
>> >> > failure rate.
>> >>
>> >> I wrote rationale in the patch itself. We already use non-movable pageblock
>> >> for migration scanner. It empties non-movable pageblock so number of
>> >> freepage on non-movable pageblock will increase. Using non-movable
>> >> pageblock for freepage scanner negates this effect so number of freepage
>> >> on non-movable pageblock will be balanced. Could you tell me in detail
>> >> how freepage scanner select unmovable pageblocks will cause
>> >> more fragmentation? Possibly, I don't understand effect of this patch
>> >> correctly and need some investigation. :)
>> >>
>> >
>> > The long-term success rate of fragmentation avoidance depends on
>> > minimsing the number of UNMOVABLE allocation requests that use a
>> > pageblock belonging to another migratetype. Once such a fallback occurs,
>> > that pageblock potentially can never be used for a THP allocation again.
>> >
>> > Lets say there is an unmovable pageblock with 500 free pages in it. If
>> > the freepage scanner uses that pageblock and allocates all 500 free
>> > pages then the next unmovable allocation request needs a new pageblock.
>> > If one is not completely free then it will fallback to using a
>> > RECLAIMABLE or MOVABLE pageblock forever contaminating it.
>>
>> Yes, I can imagine that situation. But, as I said above, we already use
>> non-movable pageblock for migration scanner. While unmovable
>> pageblock with 500 free pages fills, some other unmovable pageblock
>> with some movable pages will be emptied. Number of freepage
>> on non-movable would be maintained so fallback doesn't happen.
>>
>> Anyway, it is better to investigate this effect. I will do it and attach
>> result on next submission.
>>
>
> Lets say we have X unmovable pageblocks and Y pageblocks overall. If the
> migration scanner takes movable pages from X then there is more space for
> unmovable allocations without having to increase X -- this is good. If
> the free scanner uses the X pageblocks as targets then they can fill. The
> next unmovable allocation then falls back to another pageblock and we
> either have X+1 unmovable pageblocks (full steal) or a mixed pageblock
> (partial steal) that cannot be used for THP. Do this enough times and
> X == Y and all THP allocations fail.
This was similar with my understanding but different conclusion.
As number of unmovable pageblocks, X, which is filled by movable pages
due to this compaction change increases, reclaimed/migrated out pages
from them also increase. And, then, further unmovable allocation request
will use this free space and eventually these pageblocks are totally filled
by unmovable allocation. Therefore, I guess, in the long-term, increasing X
is saturated and X == Y will not happen.
Thanks.
2015-06-26 3:56 GMT+09:00 Vlastimil Babka <[email protected]>:
> On 25.6.2015 20:14, Joonsoo Kim wrote:
>>> The long-term success rate of fragmentation avoidance depends on
>>> > minimsing the number of UNMOVABLE allocation requests that use a
>>> > pageblock belonging to another migratetype. Once such a fallback occurs,
>>> > that pageblock potentially can never be used for a THP allocation again.
>>> >
>>> > Lets say there is an unmovable pageblock with 500 free pages in it. If
>>> > the freepage scanner uses that pageblock and allocates all 500 free
>>> > pages then the next unmovable allocation request needs a new pageblock.
>>> > If one is not completely free then it will fallback to using a
>>> > RECLAIMABLE or MOVABLE pageblock forever contaminating it.
>> Yes, I can imagine that situation. But, as I said above, we already use
>> non-movable pageblock for migration scanner. While unmovable
>> pageblock with 500 free pages fills, some other unmovable pageblock
>> with some movable pages will be emptied. Number of freepage
>> on non-movable would be maintained so fallback doesn't happen.
>
> There's nothing that guarantees that the migration scanner will be emptying
> unmovable pageblock, or am I missing something?
As replied to Mel's comment, as number of unmovable pageblocks, which is
filled by movable pages due to this compaction change increases,
possible candidate reclaimable/migratable pages from them also increase.
So, at some time, amount of used page by free scanner and amount of
migrated page by migration scanner would be balanced.
> Worse, those pageblocks would be
> marked to skip by the free scanner if it isolated free pages from them, so
> migration scanner would skip them.
Yes, but, next iteration will move out movable pages from that pageblock
and freed pages will be used for further unmovable allocation.
So, in the long term, this doesn't make much more fragmentation.
Thanks.
On Fri, Jun 26, 2015 at 11:07:47AM +0900, Joonsoo Kim wrote:
> >> > The long-term success rate of fragmentation avoidance depends on
> >> > minimsing the number of UNMOVABLE allocation requests that use a
> >> > pageblock belonging to another migratetype. Once such a fallback occurs,
> >> > that pageblock potentially can never be used for a THP allocation again.
> >> >
> >> > Lets say there is an unmovable pageblock with 500 free pages in it. If
> >> > the freepage scanner uses that pageblock and allocates all 500 free
> >> > pages then the next unmovable allocation request needs a new pageblock.
> >> > If one is not completely free then it will fallback to using a
> >> > RECLAIMABLE or MOVABLE pageblock forever contaminating it.
> >>
> >> Yes, I can imagine that situation. But, as I said above, we already use
> >> non-movable pageblock for migration scanner. While unmovable
> >> pageblock with 500 free pages fills, some other unmovable pageblock
> >> with some movable pages will be emptied. Number of freepage
> >> on non-movable would be maintained so fallback doesn't happen.
> >>
> >> Anyway, it is better to investigate this effect. I will do it and attach
> >> result on next submission.
> >>
> >
> > Lets say we have X unmovable pageblocks and Y pageblocks overall. If the
> > migration scanner takes movable pages from X then there is more space for
> > unmovable allocations without having to increase X -- this is good. If
> > the free scanner uses the X pageblocks as targets then they can fill. The
> > next unmovable allocation then falls back to another pageblock and we
> > either have X+1 unmovable pageblocks (full steal) or a mixed pageblock
> > (partial steal) that cannot be used for THP. Do this enough times and
> > X == Y and all THP allocations fail.
>
> This was similar with my understanding but different conclusion.
>
> As number of unmovable pageblocks, X, which is filled by movable pages
> due to this compaction change increases, reclaimed/migrated out pages
> from them also increase.
There is no guarantee of that, it's timing sensitive and the kernel sepends
more time copying data in/out of the same pageblocks which is wasteful.
> And, then, further unmovable allocation request
> will use this free space and eventually these pageblocks are totally filled
> by unmovable allocation. Therefore, I guess, in the long-term, increasing X
> is saturated and X == Y will not happen.
>
The whole reason we avoid migrating to unmovable blocks is because it
did happen and quite quickly. Do not use unmovable blocks as migration
targets. If high-order kernel allocations are required then some reclaim
is necessary for compaction to work with.
--
Mel Gorman
SUSE Labs
On 06/26/2015 04:14 AM, Joonsoo Kim wrote:
> 2015-06-26 3:56 GMT+09:00 Vlastimil Babka <[email protected]>:
>>> on non-movable would be maintained so fallback doesn't happen.
>>
>> There's nothing that guarantees that the migration scanner will be emptying
>> unmovable pageblock, or am I missing something?
>
> As replied to Mel's comment, as number of unmovable pageblocks, which is
> filled by movable pages due to this compaction change increases,
> possible candidate reclaimable/migratable pages from them also increase.
> So, at some time, amount of used page by free scanner and amount of
> migrated page by migration scanner would be balanced.
>
>> Worse, those pageblocks would be
>> marked to skip by the free scanner if it isolated free pages from them, so
>> migration scanner would skip them.
>
> Yes, but, next iteration will move out movable pages from that pageblock
> and freed pages will be used for further unmovable allocation.
> So, in the long term, this doesn't make much more fragmentation.
Theoretically, maybe. I guess there's not much point discussing it
further, until there's data from experiments evaluating the long-term
fragmentation (think of e.g. the number of mixed pageblocks you already
checked in different experiments).
> Thanks.
>
On Fri, Jun 26, 2015 at 11:22:41AM +0100, Mel Gorman wrote:
> On Fri, Jun 26, 2015 at 11:07:47AM +0900, Joonsoo Kim wrote:
> > >> > The long-term success rate of fragmentation avoidance depends on
> > >> > minimsing the number of UNMOVABLE allocation requests that use a
> > >> > pageblock belonging to another migratetype. Once such a fallback occurs,
> > >> > that pageblock potentially can never be used for a THP allocation again.
> > >> >
> > >> > Lets say there is an unmovable pageblock with 500 free pages in it. If
> > >> > the freepage scanner uses that pageblock and allocates all 500 free
> > >> > pages then the next unmovable allocation request needs a new pageblock.
> > >> > If one is not completely free then it will fallback to using a
> > >> > RECLAIMABLE or MOVABLE pageblock forever contaminating it.
> > >>
> > >> Yes, I can imagine that situation. But, as I said above, we already use
> > >> non-movable pageblock for migration scanner. While unmovable
> > >> pageblock with 500 free pages fills, some other unmovable pageblock
> > >> with some movable pages will be emptied. Number of freepage
> > >> on non-movable would be maintained so fallback doesn't happen.
> > >>
> > >> Anyway, it is better to investigate this effect. I will do it and attach
> > >> result on next submission.
> > >>
> > >
> > > Lets say we have X unmovable pageblocks and Y pageblocks overall. If the
> > > migration scanner takes movable pages from X then there is more space for
> > > unmovable allocations without having to increase X -- this is good. If
> > > the free scanner uses the X pageblocks as targets then they can fill. The
> > > next unmovable allocation then falls back to another pageblock and we
> > > either have X+1 unmovable pageblocks (full steal) or a mixed pageblock
> > > (partial steal) that cannot be used for THP. Do this enough times and
> > > X == Y and all THP allocations fail.
> >
> > This was similar with my understanding but different conclusion.
> >
> > As number of unmovable pageblocks, X, which is filled by movable pages
> > due to this compaction change increases, reclaimed/migrated out pages
> > from them also increase.
>
> There is no guarantee of that, it's timing sensitive and the kernel sepends
> more time copying data in/out of the same pageblocks which is wasteful.
>
> > And, then, further unmovable allocation request
> > will use this free space and eventually these pageblocks are totally filled
> > by unmovable allocation. Therefore, I guess, in the long-term, increasing X
> > is saturated and X == Y will not happen.
> >
>
> The whole reason we avoid migrating to unmovable blocks is because it
> did happen and quite quickly. Do not use unmovable blocks as migration
> targets. If high-order kernel allocations are required then some reclaim
> is necessary for compaction to work with.
Hello, Mel and Vlastimil.
Sorry for late response. I need some time to get the number and it takes
so long due to bugs on page owner. Before mentioning about this patchset,
I should mention that result of my previous patchset about active
fragmentation avoidance that you have reviewed is wrong. Incorrect result
is caused by page owner bug and correct result shows just slight
improvement rather than dramatical improvment.
https://lkml.org/lkml/2015/4/27/92
Back to our discussion, indeed, you are right. As you expected,
fragmentation increases due to this patch. It's not much but adding
other changes of this patchset accelerates fragmentation more so
it's not tolerable in the end.
Below is number of *non-mixed* pageblock measured by page owner
after running modified stress-highalloc test that repeats test 3 times
without rebooting like as Vlastimil did.
pb[n] means that it is measured after n times runs of stress-highalloc
test without rebooting. They are averaged by 3 runs.
base nonmovable redesign revert-nonmovable
pb[1]:DMA32:movable: 1359 1333 1303 1380
pb[1]:Normal:movable: 368 341 356 364
pb[2]:DMA32:movable: 1306 1277 1216 1322
pb[2]:Normal:movable: 359 345 325 349
pb[3]:DMA32:movable: 1265 1240 1179 1276
pb[3]:Normal:movable: 330 330 312 332
Allowing scanning on nonmovable pageblock increases fragmentation so
non-mixed pageblock is reduced by rougly 2~3%. Whole of this patchset
bumps this reduction up to roughly 6%. But, with reverting nonmovable
patch, it get restored and looks better than before.
Nevertheless, still, I'd like to change freepage scanner's behaviour
because there are systems that most of pageblocks are unmovable pageblock.
In this kind of system, without this change, compaction would not
work well as my experiment, build-frag-unmovable, showed, and essential
high-order allocation fails.
I have no idea how to overcome this situation without this kind of change.
If you have such a idea, please let me know.
Here is similar idea to handle this situation without causing more
fragmentation. Changes as following:
1. Freepage scanner just scan only movable pageblocks.
2. If freepage scanner doesn't find any freepage on movable pageblocks
and whole zone range is scanned, freepage scanner start to scan on
non-movable pageblocks.
Here is the result.
new-idea
pb[1]:DMA32:movable: 1371
pb[1]:Normal:movable: 384
pb[2]:DMA32:movable: 1322
pb[2]:Normal:movable: 372
pb[3]:DMA32:movable: 1273
pb[3]:Normal:movable: 358
Result is better than revert-nonmovable case. Although I didn't attach
the whole result, this one is better than revert one in term of success
rate.
Before starting to optimize this idea, I'd like to hear your opinion
about this change.
I think this change is essential because fail on high-order allocation
up to PAGE_COSTLY_ORDER is functional failure and MM should guarantee
it's success. After lumpy recliam is removed, this kind of allocation
unavoidably rely on work of compaction. We can't prevent that movable
pageblocks are turned into unmovable pageblock because it is highly
workload dependant.
Thanks.
On 07/08/2015 10:24 AM, Joonsoo Kim wrote:
> On Fri, Jun 26, 2015 at 11:22:41AM +0100, Mel Gorman wrote:
>> On Fri, Jun 26, 2015 at 11:07:47AM +0900, Joonsoo Kim wrote:
>>
>> The whole reason we avoid migrating to unmovable blocks is because it
>> did happen and quite quickly. Do not use unmovable blocks as migration
>> targets. If high-order kernel allocations are required then some reclaim
>> is necessary for compaction to work with.
>
> Hello, Mel and Vlastimil.
>
> Sorry for late response. I need some time to get the number and it takes
> so long due to bugs on page owner. Before mentioning about this patchset,
> I should mention that result of my previous patchset about active
> fragmentation avoidance that you have reviewed is wrong. Incorrect result
> is caused by page owner bug and correct result shows just slight
> improvement rather than dramatical improvment.
>
> https://lkml.org/lkml/2015/4/27/92
Doh, glad you found the bug.
BTW I still think patch 1 of that series would make sense and it's a
code cleanup too. Patch 2 would depend on the corrected measurements.
Patch 3 also, and the active anti-fragmentation work could be done by
kcompactd if the idea of that thread floats.
> Back to our discussion, indeed, you are right. As you expected,
> fragmentation increases due to this patch. It's not much but adding
> other changes of this patchset accelerates fragmentation more so
> it's not tolerable in the end.
>
> Below is number of *non-mixed* pageblock measured by page owner
> after running modified stress-highalloc test that repeats test 3 times
> without rebooting like as Vlastimil did.
>
> pb[n] means that it is measured after n times runs of stress-highalloc
> test without rebooting. They are averaged by 3 runs.
>
> base nonmovable redesign revert-nonmovable
> pb[1]:DMA32:movable: 1359 1333 1303 1380
> pb[1]:Normal:movable: 368 341 356 364
>
> pb[2]:DMA32:movable: 1306 1277 1216 1322
> pb[2]:Normal:movable: 359 345 325 349
>
> pb[3]:DMA32:movable: 1265 1240 1179 1276
> pb[3]:Normal:movable: 330 330 312 332
>
> Allowing scanning on nonmovable pageblock increases fragmentation so
> non-mixed pageblock is reduced by rougly 2~3%. Whole of this patchset
> bumps this reduction up to roughly 6%. But, with reverting nonmovable
> patch, it get restored and looks better than before.
Hm that's somewhat strange. Why is it only the *combination* of
"nonmovable" and "redesign" that makes it so bad?
> Nevertheless, still, I'd like to change freepage scanner's behaviour
> because there are systems that most of pageblocks are unmovable pageblock.
> In this kind of system, without this change, compaction would not
> work well as my experiment, build-frag-unmovable, showed, and essential
> high-order allocation fails.
>
> I have no idea how to overcome this situation without this kind of change.
> If you have such a idea, please let me know.
Hm it's a tough one :/
> Here is similar idea to handle this situation without causing more
> fragmentation. Changes as following:
>
> 1. Freepage scanner just scan only movable pageblocks.
> 2. If freepage scanner doesn't find any freepage on movable pageblocks
> and whole zone range is scanned, freepage scanner start to scan on
> non-movable pageblocks.
>
> Here is the result.
> new-idea
> pb[1]:DMA32:movable: 1371
> pb[1]:Normal:movable: 384
>
> pb[2]:DMA32:movable: 1322
> pb[2]:Normal:movable: 372
>
> pb[3]:DMA32:movable: 1273
> pb[3]:Normal:movable: 358
>
> Result is better than revert-nonmovable case. Although I didn't attach
> the whole result, this one is better than revert one in term of success
> rate.
>
> Before starting to optimize this idea, I'd like to hear your opinion
> about this change.
Well, it might be better than nothing. Optimization could be remembering
from the first pass which pageblock was the emptiest? But that would
make the first pass more involved, so I'm not sure.
> I think this change is essential because fail on high-order allocation
> up to PAGE_COSTLY_ORDER is functional failure and MM should guarantee
> it's success. After lumpy recliam is removed, this kind of allocation
> unavoidably rely on work of compaction. We can't prevent that movable
> pageblocks are turned into unmovable pageblock because it is highly
> workload dependant.
>
> Thanks.
>
On Tue, Jul 21, 2015 at 11:27:54AM +0200, Vlastimil Babka wrote:
> On 07/08/2015 10:24 AM, Joonsoo Kim wrote:
> >On Fri, Jun 26, 2015 at 11:22:41AM +0100, Mel Gorman wrote:
> >>On Fri, Jun 26, 2015 at 11:07:47AM +0900, Joonsoo Kim wrote:
> >>
> >>The whole reason we avoid migrating to unmovable blocks is because it
> >>did happen and quite quickly. Do not use unmovable blocks as migration
> >>targets. If high-order kernel allocations are required then some reclaim
> >>is necessary for compaction to work with.
> >
> >Hello, Mel and Vlastimil.
> >
> >Sorry for late response. I need some time to get the number and it takes
> >so long due to bugs on page owner. Before mentioning about this patchset,
> >I should mention that result of my previous patchset about active
> >fragmentation avoidance that you have reviewed is wrong. Incorrect result
> >is caused by page owner bug and correct result shows just slight
> >improvement rather than dramatical improvment.
> >
> >https://lkml.org/lkml/2015/4/27/92
>
> Doh, glad you found the bug.
> BTW I still think patch 1 of that series would make sense and it's a
> code cleanup too. Patch 2 would depend on the corrected
> measurements. Patch 3 also, and the active anti-fragmentation work
> could be done by kcompactd if the idea of that thread floats.
Yes, I don't give up those patches. :)
>
> >Back to our discussion, indeed, you are right. As you expected,
> >fragmentation increases due to this patch. It's not much but adding
> >other changes of this patchset accelerates fragmentation more so
> >it's not tolerable in the end.
> >
> >Below is number of *non-mixed* pageblock measured by page owner
> >after running modified stress-highalloc test that repeats test 3 times
> >without rebooting like as Vlastimil did.
> >
> >pb[n] means that it is measured after n times runs of stress-highalloc
> >test without rebooting. They are averaged by 3 runs.
> >
> > base nonmovable redesign revert-nonmovable
> >pb[1]:DMA32:movable: 1359 1333 1303 1380
> >pb[1]:Normal:movable: 368 341 356 364
> >
> >pb[2]:DMA32:movable: 1306 1277 1216 1322
> >pb[2]:Normal:movable: 359 345 325 349
> >
> >pb[3]:DMA32:movable: 1265 1240 1179 1276
> >pb[3]:Normal:movable: 330 330 312 332
> >
> >Allowing scanning on nonmovable pageblock increases fragmentation so
> >non-mixed pageblock is reduced by rougly 2~3%. Whole of this patchset
> >bumps this reduction up to roughly 6%. But, with reverting nonmovable
> >patch, it get restored and looks better than before.
>
> Hm that's somewhat strange. Why is it only the *combination* of
> "nonmovable" and "redesign" that makes it so bad?
I guess that freepage scanner scans limited zone range in nonmovable case
so bad effect is also limited.
> >Nevertheless, still, I'd like to change freepage scanner's behaviour
> >because there are systems that most of pageblocks are unmovable pageblock.
> >In this kind of system, without this change, compaction would not
> >work well as my experiment, build-frag-unmovable, showed, and essential
> >high-order allocation fails.
> >
> >I have no idea how to overcome this situation without this kind of change.
> >If you have such a idea, please let me know.
>
> Hm it's a tough one :/
>
> >Here is similar idea to handle this situation without causing more
> >fragmentation. Changes as following:
> >
> >1. Freepage scanner just scan only movable pageblocks.
> >2. If freepage scanner doesn't find any freepage on movable pageblocks
> >and whole zone range is scanned, freepage scanner start to scan on
> >non-movable pageblocks.
> >
> >Here is the result.
> > new-idea
> >pb[1]:DMA32:movable: 1371
> >pb[1]:Normal:movable: 384
> >
> >pb[2]:DMA32:movable: 1322
> >pb[2]:Normal:movable: 372
> >
> >pb[3]:DMA32:movable: 1273
> >pb[3]:Normal:movable: 358
> >
> >Result is better than revert-nonmovable case. Although I didn't attach
> >the whole result, this one is better than revert one in term of success
> >rate.
> >
> >Before starting to optimize this idea, I'd like to hear your opinion
> >about this change.
>
> Well, it might be better than nothing. Optimization could be
> remembering from the first pass which pageblock was the emptiest?
> But that would make the first pass more involved, so I'm not sure.
Now, I don't have any idea for it. I need more think.
Thanks.