Major changes from v2:
o Split patchset into two parts, one is for replacing compaction
deferring with compaction limit and the other is for changing
scanner activity
o Add some fixes and cleanup for current defer logic
o Fix opposite direction problem in "skip useless pfn when..." patch
o Reuse current defer logic for compaction limit
o Provide proper argument when calling __reset_isolation_suitable()
o Prevent async compaction while compaction limit is activated
Previous cover-letter isn't appropriate for this part1 patchset so
I just append link about it.
https://lkml.org/lkml/2015/8/23/182
New description:
Compaction deferring effectively reduces compaction overhead if
compaction success isn't expected. But, it is implemented that
skipping a number of compaction requests until compaction is re-enabled.
Due to this implementation, unfortunate compaction requestor will get
whole compaction overhead unlike others have zero overhead. And, after
deferring start to work, even if compaction success possibility is
restored, we should skip to compaction in some number of times.
This patch try to solve above problem by using compaction limit.
Instead of imposing compaction overhead to one unfortunate requestor,
compaction limit distributes overhead to all compaction requestors.
All requestors have a chance to migrate some amount of pages and
after limit is exhausted compaction will be stopped. This will fairly
distributes overhead to all compaction requestors. And, because we don't
defer compaction request, someone will succeed to compact as soon as
possible if compaction success possiblility is restored.
I tested this patch on my compaction benchmark and found that high-order
allocation latency is evenly distributed and there is no latency spike
in the situation where compaction success isn't possible.
Following is the result of each high-order allocation latency (ns).
Base vs Limit
9807 failure 825 9807 failure 10839
9808 failure 820 9808 failure 9762
9809 failure 827 9809 failure 8585
9810 failure 3751 9810 failure 14052
9811 failure 881 9811 failure 10781
9812 failure 827 9812 failure 9906
9813 failure 2447430 9813 failure 8925
9814 failure 8632 9814 failure 9185
9815 failure 1172 9815 failure 9076
9816 failure 1045 9816 failure 10860
9817 failure 1044 9817 failure 10571
9818 failure 1043 9818 failure 8789
9819 failure 979 9819 failure 9086
9820 failure 4338 9820 failure 43681
9821 failure 1001 9821 failure 9361
9822 failure 875 9822 failure 15175
9823 failure 822 9823 failure 9394
9824 failure 827 9824 failure 334341
9825 failure 829 9825 failure 15404
9826 failure 823 9826 failure 10419
9827 failure 824 9827 failure 11375
9828 failure 827 9828 failure 9416
9829 failure 822 9829 failure 9303
9830 failure 3646 9830 failure 18514
9831 failure 869 9831 failure 11064
9832 failure 820 9832 failure 9626
9833 failure 832 9833 failure 8794
9834 failure 820 9834 failure 10576
9835 failure 2450955 9835 failure 12260
9836 failure 9428 9836 failure 9049
9837 failure 1067 9837 failure 10346
9838 failure 968 9838 failure 8793
9839 failure 984 9839 failure 8932
9840 failure 4262 9840 failure 18436
9841 failure 964 9841 failure 11429
9842 failure 937 9842 failure 9433
9843 failure 828 9843 failure 8838
9844 failure 827 9844 failure 8948
9845 failure 822 9845 failure 13017
9846 failure 827 9846 failure 10795
As you can see, Base has a latency spike periodically, but,
in Limit, latency is distributed evenly.
This patchset is based on linux-next-20151106 +
"restore COMPACT_CLUSTER_MAX to 32" + "__compact_pgdat() code cleanuup"
which are sent by me today.
Thanks.
Joonsoo Kim (7):
mm/compaction: skip useless pfn when updating cached pfn
mm/compaction: remove unused defer_compaction() in compaction.h
mm/compaction: initialize compact_order_failed to MAX_ORDER
mm/compaction: update defer counter when allocation is expected to
succeed
mm/compaction: respect compaction order when updating defer counter
mm/compaction: introduce migration scan limit
mm/compaction: replace compaction deferring with compaction limit
include/linux/compaction.h | 3 -
include/linux/mmzone.h | 6 +-
include/trace/events/compaction.h | 7 +-
mm/compaction.c | 188 ++++++++++++++++++++++++--------------
mm/internal.h | 1 +
mm/page_alloc.c | 4 +-
6 files changed, 125 insertions(+), 84 deletions(-)
--
1.9.1
Cached pfn is used to determine the start position of scanner
at next compaction run. Current cached pfn points the skipped pageblock
so we uselessly checks whether pageblock is valid for compaction and
skip-bit is set or not. If we set scanner's cached pfn to next pfn of
skipped pageblock, we don't need to do this check.
This patch moved update_pageblock_skip() to
isolate_(freepages|migratepages). Updating pageblock skip information
isn't relevant to CMA so they are more appropriate place
to update this information.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/compaction.c | 37 ++++++++++++++++++++-----------------
1 file changed, 20 insertions(+), 17 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 01b1e5e..564047c 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -256,10 +256,9 @@ void reset_isolation_suitable(pg_data_t *pgdat)
*/
static void update_pageblock_skip(struct compact_control *cc,
struct page *page, unsigned long nr_isolated,
- bool migrate_scanner)
+ unsigned long pfn, bool migrate_scanner)
{
struct zone *zone = cc->zone;
- unsigned long pfn;
if (cc->ignore_skip_hint)
return;
@@ -272,8 +271,6 @@ static void update_pageblock_skip(struct compact_control *cc,
set_pageblock_skip(page);
- pfn = page_to_pfn(page);
-
/* Update where async and sync compaction should restart */
if (migrate_scanner) {
if (pfn > zone->compact_cached_migrate_pfn[0])
@@ -295,7 +292,7 @@ static inline bool isolation_suitable(struct compact_control *cc,
static void update_pageblock_skip(struct compact_control *cc,
struct page *page, unsigned long nr_isolated,
- bool migrate_scanner)
+ unsigned long pfn, bool migrate_scanner)
{
}
#endif /* CONFIG_COMPACTION */
@@ -527,10 +524,6 @@ isolate_fail:
if (locked)
spin_unlock_irqrestore(&cc->zone->lock, flags);
- /* Update the pageblock-skip if the whole pageblock was scanned */
- if (blockpfn == end_pfn)
- update_pageblock_skip(cc, valid_page, total_isolated, false);
-
count_compact_events(COMPACTFREE_SCANNED, nr_scanned);
if (total_isolated)
count_compact_events(COMPACTISOLATED, total_isolated);
@@ -832,13 +825,6 @@ isolate_success:
if (locked)
spin_unlock_irqrestore(&zone->lru_lock, flags);
- /*
- * Update the pageblock-skip information and cached scanner pfn,
- * if the whole pageblock was scanned without isolating any page.
- */
- if (low_pfn == end_pfn)
- update_pageblock_skip(cc, valid_page, nr_isolated, true);
-
trace_mm_compaction_isolate_migratepages(start_pfn, low_pfn,
nr_scanned, nr_isolated);
@@ -947,6 +933,7 @@ static void isolate_freepages(struct compact_control *cc)
unsigned long block_end_pfn; /* end of current pageblock */
unsigned long low_pfn; /* lowest pfn scanner is able to scan */
struct list_head *freelist = &cc->freepages;
+ unsigned long nr_isolated;
/*
* Initialise the free scanner. The starting point is where we last
@@ -998,10 +985,18 @@ static void isolate_freepages(struct compact_control *cc)
continue;
/* Found a block suitable for isolating free pages from. */
- isolate_freepages_block(cc, &isolate_start_pfn,
+ nr_isolated = isolate_freepages_block(cc, &isolate_start_pfn,
block_end_pfn, freelist, false);
/*
+ * Update the pageblock-skip if the whole pageblock
+ * was scanned
+ */
+ if (isolate_start_pfn == block_end_pfn)
+ update_pageblock_skip(cc, page, nr_isolated,
+ block_start_pfn - pageblock_nr_pages, false);
+
+ /*
* If we isolated enough freepages, or aborted due to async
* compaction being contended, terminate the loop.
* Remember where the free scanner should restart next time,
@@ -1172,6 +1167,14 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
cc->last_migrated_pfn = isolate_start_pfn;
/*
+ * Update the pageblock-skip if the whole pageblock
+ * was scanned without isolating any page.
+ */
+ if (low_pfn == end_pfn)
+ update_pageblock_skip(cc, page, cc->nr_migratepages,
+ end_pfn, true);
+
+ /*
* Either we isolated something and proceed with migration. Or
* we failed and compact_zone should decide if we should
* continue or not.
--
1.9.1
It's not used externally. Remove it in compaction.h.
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/compaction.h | 1 -
mm/compaction.c | 2 +-
2 files changed, 1 insertion(+), 2 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 4cd4ddf..359b07a 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -46,7 +46,6 @@ extern void reset_isolation_suitable(pg_data_t *pgdat);
extern unsigned long compaction_suitable(struct zone *zone, int order,
int alloc_flags, int classzone_idx);
-extern void defer_compaction(struct zone *zone, int order);
extern bool compaction_deferred(struct zone *zone, int order);
extern void compaction_defer_reset(struct zone *zone, int order,
bool alloc_success);
diff --git a/mm/compaction.c b/mm/compaction.c
index 564047c..f144494 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -124,7 +124,7 @@ static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
* allocation success. 1 << compact_defer_limit compactions are skipped up
* to a limit of 1 << COMPACT_MAX_DEFER_SHIFT
*/
-void defer_compaction(struct zone *zone, int order)
+static void defer_compaction(struct zone *zone, int order)
{
zone->compact_considered = 0;
zone->compact_defer_shift++;
--
1.9.1
If compact_order_failed is initialized to 0 and order-9
compaction is continually failed, defer counter will be updated
to activate deferring. Although other defer counters will be properly
updated, compact_order_failed will not be updated because failed order
cannot be lower than compact_order_failed, 0. In this case,
low order compaction such as 2, 3 could be deferred due to
this wrongly initialized compact_order_failed value. This patch
removes this possibility by initializing it to MAX_ORDER.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/page_alloc.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d0499ff..7002c66 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5273,6 +5273,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
zone_seqlock_init(zone);
zone->zone_pgdat = pgdat;
zone_pcp_init(zone);
+#ifdef CONFIG_COMPACTION
+ zone->compact_order_failed = MAX_ORDER;
+#endif
/* For bootup, initialized properly in watermark setup */
mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
--
1.9.1
It's rather strange that compact_considered and compact_defer_shift aren't
updated but compact_order_failed is updated when allocation is expected
to succeed. Regardless actual allocation success, deferring for current
order will be disabled so it doesn't result in much difference to
compaction behaviour.
Moreover, in the past, there is a gap between expectation for allocation
succeess in compaction and actual success in page allocator. But, now,
this gap would be diminished due to providing classzone_idx and alloc_flags
to watermark check in compaction and changed watermark check criteria
for high-order allocation. Therfore, it's not a big problem to update
defer counter when allocation is expected to succeed. This change
will help to simplify defer logic.
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/compaction.h | 2 --
mm/compaction.c | 27 ++++++++-------------------
mm/page_alloc.c | 1 -
3 files changed, 8 insertions(+), 22 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 359b07a..4761611 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -47,8 +47,6 @@ extern unsigned long compaction_suitable(struct zone *zone, int order,
int alloc_flags, int classzone_idx);
extern bool compaction_deferred(struct zone *zone, int order);
-extern void compaction_defer_reset(struct zone *zone, int order,
- bool alloc_success);
extern bool compaction_restarting(struct zone *zone, int order);
#else
diff --git a/mm/compaction.c b/mm/compaction.c
index f144494..67b8d90 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -158,18 +158,12 @@ bool compaction_deferred(struct zone *zone, int order)
return true;
}
-/*
- * Update defer tracking counters after successful compaction of given order,
- * which means an allocation either succeeded (alloc_success == true) or is
- * expected to succeed.
- */
-void compaction_defer_reset(struct zone *zone, int order,
- bool alloc_success)
+/* Update defer tracking counters after successful compaction of given order */
+static void compaction_defer_reset(struct zone *zone, int order)
{
- if (alloc_success) {
- zone->compact_considered = 0;
- zone->compact_defer_shift = 0;
- }
+ zone->compact_considered = 0;
+ zone->compact_defer_shift = 0;
+
if (order >= zone->compact_order_failed)
zone->compact_order_failed = order + 1;
@@ -1568,13 +1562,8 @@ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
/* If a normal allocation would succeed, stop compacting */
if (zone_watermark_ok(zone, order, low_wmark_pages(zone),
ac->classzone_idx, alloc_flags)) {
- /*
- * We think the allocation will succeed in this zone,
- * but it is not certain, hence the false. The caller
- * will repeat this with true if allocation indeed
- * succeeds in this zone.
- */
- compaction_defer_reset(zone, order, false);
+ compaction_defer_reset(zone, order);
+
/*
* It is possible that async compaction aborted due to
* need_resched() and the watermarks were ok thanks to
@@ -1669,7 +1658,7 @@ static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
if (zone_watermark_ok(zone, cc->order,
low_wmark_pages(zone), 0, 0))
- compaction_defer_reset(zone, cc->order, false);
+ compaction_defer_reset(zone, cc->order);
}
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7002c66..f3605fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2815,7 +2815,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct zone *zone = page_zone(page);
zone->compact_blockskip_flush = false;
- compaction_defer_reset(zone, order, true);
count_vm_event(COMPACTSUCCESS);
return page;
}
--
1.9.1
It doesn't make sense that we reset defer counter
in compaction_defer_reset() when compaction request under the order of
compact_order_failed succeed. Fix it.
And, it does make sense that giving enough chance for updated failed
order compaction before deferring. Change it.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/compaction.c | 19 +++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 67b8d90..1a75a6e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -126,11 +126,14 @@ static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
*/
static void defer_compaction(struct zone *zone, int order)
{
- zone->compact_considered = 0;
- zone->compact_defer_shift++;
-
- if (order < zone->compact_order_failed)
+ if (order < zone->compact_order_failed) {
+ zone->compact_considered = 0;
+ zone->compact_defer_shift = 0;
zone->compact_order_failed = order;
+ } else {
+ zone->compact_considered = 0;
+ zone->compact_defer_shift++;
+ }
if (zone->compact_defer_shift > COMPACT_MAX_DEFER_SHIFT)
zone->compact_defer_shift = COMPACT_MAX_DEFER_SHIFT;
@@ -161,11 +164,11 @@ bool compaction_deferred(struct zone *zone, int order)
/* Update defer tracking counters after successful compaction of given order */
static void compaction_defer_reset(struct zone *zone, int order)
{
- zone->compact_considered = 0;
- zone->compact_defer_shift = 0;
-
- if (order >= zone->compact_order_failed)
+ if (order >= zone->compact_order_failed) {
+ zone->compact_considered = 0;
+ zone->compact_defer_shift = 0;
zone->compact_order_failed = order + 1;
+ }
trace_mm_compaction_defer_reset(zone, order);
}
--
1.9.1
This is preparation step to replace compaction deferring with compaction
limit. Whole reason why we need to replace it will be mentioned in
the following patch.
In this patch, migration_scan_limit is assigned and accounted, but, not
checked to finish. So, there is no functional change.
Currently, amount of migration_scan_limit is chosen to imitate compaction
deferring logic. We can tune it easily if overhead looks insane, but,
it would be further work.
Also, amount of migration_scan_limit is adapted by compact_defer_shift.
More fails increase compact_defer_shift and this will limit compaction
more.
There are two interesting changes. One is that cached pfn is always
updated while limit is activated. Otherwise, we would scan same range
over and over. Second one is that async compaction is skipped while
limit is activated, for algorithm correctness. Until now, even if
failure case, sync compaction continue to work when both scanner is met
so COMPACT_COMPLETE usually happens in sync compaction. But, limit is
applied, sync compaction is finished if limit is exhausted so
COMPACT_COMPLETE usually happens in async compaction. Because we don't
consider async COMPACT_COMPLETE as actual fail while we reset cached
scanner pfn, defer mechanism doesn't work well. And, async compaction
would not be easy to succeed in this case so skipping async compaction
doesn't result in much difference.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/compaction.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++--------
mm/internal.h | 1 +
2 files changed, 78 insertions(+), 11 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 1a75a6e..b23f6d9 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -116,6 +116,67 @@ static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
#ifdef CONFIG_COMPACTION
+/*
+ * order == -1 is expected when compacting via
+ * /proc/sys/vm/compact_memory
+ */
+static inline bool is_via_compact_memory(int order)
+{
+ return order == -1;
+}
+
+#define COMPACT_MIN_SCAN_LIMIT (pageblock_nr_pages)
+
+static bool excess_migration_scan_limit(struct compact_control *cc)
+{
+ /* Disable scan limit for now */
+ return false;
+}
+
+static void set_migration_scan_limit(struct compact_control *cc)
+{
+ struct zone *zone = cc->zone;
+ int order = cc->order;
+ unsigned long limit = zone->managed_pages;
+
+ cc->migration_scan_limit = LONG_MAX;
+ if (is_via_compact_memory(order))
+ return;
+
+ if (order < zone->compact_order_failed)
+ return;
+
+ if (!zone->compact_defer_shift)
+ return;
+
+ /*
+ * Do not allow async compaction during limit work. In this case,
+ * async compaction would not be easy to succeed and we need to
+ * ensure that COMPACT_COMPLETE occurs by sync compaction for
+ * algorithm correctness and prevention of async compaction will
+ * lead it.
+ */
+ if (cc->mode == MIGRATE_ASYNC) {
+ cc->migration_scan_limit = -1;
+ return;
+ }
+
+ /* Migration scanner usually scans less than 1/4 pages */
+ limit >>= 2;
+
+ /*
+ * Deferred compaction restart compaction every 64 compaction
+ * attempts and it rescans whole zone range. To imitate it,
+ * we set limit to 1/64 of scannable range.
+ */
+ limit >>= 6;
+
+ /* Degradation scan limit according to defer shift */
+ limit >>= zone->compact_defer_shift;
+
+ cc->migration_scan_limit = max(limit, COMPACT_MIN_SCAN_LIMIT);
+}
+
/* Do not skip compaction more than 64 times */
#define COMPACT_MAX_DEFER_SHIFT 6
@@ -263,10 +324,15 @@ static void update_pageblock_skip(struct compact_control *cc,
if (!page)
return;
- if (nr_isolated)
+ /*
+ * Always update cached_pfn if compaction has scan_limit,
+ * otherwise we would scan same range over and over.
+ */
+ if (cc->migration_scan_limit == LONG_MAX && nr_isolated)
return;
- set_pageblock_skip(page);
+ if (!nr_isolated)
+ set_pageblock_skip(page);
/* Update where async and sync compaction should restart */
if (migrate_scanner) {
@@ -822,6 +888,8 @@ isolate_success:
if (locked)
spin_unlock_irqrestore(&zone->lru_lock, flags);
+ cc->migration_scan_limit -= nr_scanned;
+
trace_mm_compaction_isolate_migratepages(start_pfn, low_pfn,
nr_scanned, nr_isolated);
@@ -1186,15 +1254,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
}
-/*
- * order == -1 is expected when compacting via
- * /proc/sys/vm/compact_memory
- */
-static inline bool is_via_compact_memory(int order)
-{
- return order == -1;
-}
-
static int __compact_finished(struct zone *zone, struct compact_control *cc,
const int migratetype)
{
@@ -1224,6 +1283,9 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc,
if (is_via_compact_memory(cc->order))
return COMPACT_CONTINUE;
+ if (excess_migration_scan_limit(cc))
+ return COMPACT_PARTIAL;
+
/* Compaction run is not finished if the watermark is not met */
watermark = low_wmark_pages(zone);
@@ -1382,6 +1444,10 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
}
cc->last_migrated_pfn = 0;
+ set_migration_scan_limit(cc);
+ if (excess_migration_scan_limit(cc))
+ return COMPACT_SKIPPED;
+
trace_mm_compaction_begin(start_pfn, cc->migrate_pfn,
cc->free_pfn, end_pfn, sync);
diff --git a/mm/internal.h b/mm/internal.h
index dbe0436..bb8225c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -164,6 +164,7 @@ struct compact_control {
unsigned long free_pfn; /* isolate_freepages search base */
unsigned long migrate_pfn; /* isolate_migratepages search base */
unsigned long last_migrated_pfn;/* Not yet flushed page being freed */
+ long migration_scan_limit; /* Limit migration scanner activity */
enum migrate_mode mode; /* Async or sync migration mode */
bool ignore_skip_hint; /* Scan blocks even if marked skip */
int order; /* order a direct compactor needs */
--
1.9.1
Compaction deferring effectively reduces compaction overhead if
compaction success isn't expected. But, it is implemented that
skipping a number of compaction requests until compaction is re-enabled.
Due to this implementation, unfortunate compaction requestor will get
whole compaction overhead unlike others have zero overhead. And, after
deferring start to work, even if compaction success possibility is
restored, we should skip to compaction in some number of times.
This patch try to solve above problem by using compaction limit.
Instead of imposing compaction overhead to one unfortunate requestor,
compaction limit distributes overhead to all compaction requestors.
All requestors have a chance to migrate some amount of pages and
after limit is exhausted compaction will be stopped. This will fairly
distributes overhead to all compaction requestors. And, because we don't
defer compaction request, someone will succeed to compact as soon as
possible if compaction success possiblility is restored.
Following is whole workflow enabled by this change.
- if sync compaction fails, compact_order_failed is set to current order
- if it fails again, compact_defer_shift is adjusted
- with positive compact_defer_shift, migration_scan_limit is assigned and
compaction limit is activated
- if compaction limit is activated, compaction would be stopped when
migration_scan_limit is exhausted
- when success, compact_defer_shift and compact_order_failed is reset and
compaction limit is deactivated
- compact_defer_shift can be grown up to COMPACT_MAX_DEFER_SHIFT
Most of changes are mechanical ones to remove compact_considered which
is not needed now. Note that, after restart, compact_defer_shift is
subtracted by 1 to avoid invoking __reset_isolation_suitable()
repeatedly.
I tested this patch on my compaction benchmark and found that high-order
allocation latency is evenly distributed and there is no latency spike
in the situation where compaction success isn't possible.
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/mmzone.h | 6 ++---
include/trace/events/compaction.h | 7 ++----
mm/compaction.c | 47 +++++++++++++--------------------------
3 files changed, 20 insertions(+), 40 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e23a9e7..ebb6400 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -511,11 +511,9 @@ struct zone {
#ifdef CONFIG_COMPACTION
/*
- * On compaction failure, 1<<compact_defer_shift compactions
- * are skipped before trying again. The number attempted since
- * last failure is tracked with compact_considered.
+ * On compaction failure, compaction will be limited by
+ * compact_defer_shift.
*/
- unsigned int compact_considered;
unsigned int compact_defer_shift;
int compact_order_failed;
#endif
diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index c92d1e1..ab7bed1 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -305,7 +305,6 @@ DECLARE_EVENT_CLASS(mm_compaction_defer_template,
__field(int, nid)
__field(enum zone_type, idx)
__field(int, order)
- __field(unsigned int, considered)
__field(unsigned int, defer_shift)
__field(int, order_failed)
),
@@ -314,18 +313,16 @@ DECLARE_EVENT_CLASS(mm_compaction_defer_template,
__entry->nid = zone_to_nid(zone);
__entry->idx = zone_idx(zone);
__entry->order = order;
- __entry->considered = zone->compact_considered;
__entry->defer_shift = zone->compact_defer_shift;
__entry->order_failed = zone->compact_order_failed;
),
- TP_printk("node=%d zone=%-8s order=%d order_failed=%d consider=%u limit=%lu",
+ TP_printk("node=%d zone=%-8s order=%d order_failed=%d defer=%u",
__entry->nid,
__print_symbolic(__entry->idx, ZONE_TYPE),
__entry->order,
__entry->order_failed,
- __entry->considered,
- 1UL << __entry->defer_shift)
+ __entry->defer_shift)
);
DEFINE_EVENT(mm_compaction_defer_template, mm_compaction_deferred,
diff --git a/mm/compaction.c b/mm/compaction.c
index b23f6d9..f3f9dc0 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -129,8 +129,7 @@ static inline bool is_via_compact_memory(int order)
static bool excess_migration_scan_limit(struct compact_control *cc)
{
- /* Disable scan limit for now */
- return false;
+ return cc->migration_scan_limit < 0 ? true : false;
}
static void set_migration_scan_limit(struct compact_control *cc)
@@ -143,10 +142,7 @@ static void set_migration_scan_limit(struct compact_control *cc)
if (is_via_compact_memory(order))
return;
- if (order < zone->compact_order_failed)
- return;
-
- if (!zone->compact_defer_shift)
+ if (!compaction_deferred(zone, order))
return;
/*
@@ -188,13 +184,10 @@ static void set_migration_scan_limit(struct compact_control *cc)
static void defer_compaction(struct zone *zone, int order)
{
if (order < zone->compact_order_failed) {
- zone->compact_considered = 0;
zone->compact_defer_shift = 0;
zone->compact_order_failed = order;
- } else {
- zone->compact_considered = 0;
+ } else
zone->compact_defer_shift++;
- }
if (zone->compact_defer_shift > COMPACT_MAX_DEFER_SHIFT)
zone->compact_defer_shift = COMPACT_MAX_DEFER_SHIFT;
@@ -202,19 +195,13 @@ static void defer_compaction(struct zone *zone, int order)
trace_mm_compaction_defer_compaction(zone, order);
}
-/* Returns true if compaction should be skipped this time */
+/* Returns true if compaction is limited */
bool compaction_deferred(struct zone *zone, int order)
{
- unsigned long defer_limit = 1UL << zone->compact_defer_shift;
-
if (order < zone->compact_order_failed)
return false;
- /* Avoid possible overflow */
- if (++zone->compact_considered > defer_limit)
- zone->compact_considered = defer_limit;
-
- if (zone->compact_considered >= defer_limit)
+ if (!zone->compact_defer_shift)
return false;
trace_mm_compaction_deferred(zone, order);
@@ -226,7 +213,6 @@ bool compaction_deferred(struct zone *zone, int order)
static void compaction_defer_reset(struct zone *zone, int order)
{
if (order >= zone->compact_order_failed) {
- zone->compact_considered = 0;
zone->compact_defer_shift = 0;
zone->compact_order_failed = order + 1;
}
@@ -240,8 +226,7 @@ bool compaction_restarting(struct zone *zone, int order)
if (order < zone->compact_order_failed)
return false;
- return zone->compact_defer_shift == COMPACT_MAX_DEFER_SHIFT &&
- zone->compact_considered >= 1UL << zone->compact_defer_shift;
+ return zone->compact_defer_shift == COMPACT_MAX_DEFER_SHIFT;
}
/* Returns true if the pageblock should be scanned for pages to isolate. */
@@ -266,7 +251,7 @@ static void reset_cached_positions(struct zone *zone)
* should be skipped for page isolation when the migrate and free page scanner
* meet.
*/
-static void __reset_isolation_suitable(struct zone *zone)
+static void __reset_isolation_suitable(struct zone *zone, bool restart)
{
unsigned long start_pfn = zone->zone_start_pfn;
unsigned long end_pfn = zone_end_pfn(zone);
@@ -274,6 +259,11 @@ static void __reset_isolation_suitable(struct zone *zone)
zone->compact_blockskip_flush = false;
+ if (restart) {
+ /* To prevent restart at next compaction attempt */
+ zone->compact_defer_shift = COMPACT_MAX_DEFER_SHIFT - 1;
+ }
+
/* Walk the zone and mark every pageblock as suitable for isolation */
for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
struct page *page;
@@ -304,7 +294,7 @@ void reset_isolation_suitable(pg_data_t *pgdat)
/* Only flush if a full compaction finished recently */
if (zone->compact_blockskip_flush)
- __reset_isolation_suitable(zone);
+ __reset_isolation_suitable(zone, false);
}
}
@@ -1424,7 +1414,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
* this reset as it'll reset the cached information when going to sleep.
*/
if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
- __reset_isolation_suitable(zone);
+ __reset_isolation_suitable(zone, true);
/*
* Setup to move all movable pages to the end of the zone. Used cached
@@ -1615,9 +1605,6 @@ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
int status;
int zone_contended;
- if (compaction_deferred(zone, order))
- continue;
-
status = compact_zone_order(zone, order, gfp_mask, mode,
&zone_contended, alloc_flags,
ac->classzone_idx);
@@ -1713,11 +1700,9 @@ static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
* cached scanner positions.
*/
if (is_via_compact_memory(cc->order))
- __reset_isolation_suitable(zone);
+ __reset_isolation_suitable(zone, false);
- if (is_via_compact_memory(cc->order) ||
- !compaction_deferred(zone, cc->order))
- compact_zone(zone, cc);
+ compact_zone(zone, cc);
VM_BUG_ON(!list_empty(&cc->freepages));
VM_BUG_ON(!list_empty(&cc->migratepages));
--
1.9.1
On 12/03/2015 08:11 AM, Joonsoo Kim wrote:
> Cached pfn is used to determine the start position of scanner
> at next compaction run. Current cached pfn points the skipped pageblock
> so we uselessly checks whether pageblock is valid for compaction and
> skip-bit is set or not. If we set scanner's cached pfn to next pfn of
> skipped pageblock, we don't need to do this check.
>
> This patch moved update_pageblock_skip() to
> isolate_(freepages|migratepages). Updating pageblock skip information
> isn't relevant to CMA so they are more appropriate place
> to update this information.
That's step in a good direction, yeah. But why not go as far as some variant of
my (not resubmitted) patch "mm, compaction: decouple updating pageblock_skip and
cached pfn" [1]. Now the overloading of update_pageblock_skip() is just too much
- a struct page pointer for the skip bits, and a pfn of different page for the
cached pfn update, that's just more complex than it should be.
(I also suspect the pageblock_flags manipulation functions could be simpler if
they accepted zone pointer and pfn instead of struct page)
Also recently in Aaron's report we found a possible scenario where pageblocks
are being skipped without entering the isolate_*_block() functions, and it would
make sense to update the cached pfn's in that case, independently of updating
pageblock skip bits.
But this might be too out of scope of your series, so if you want I can
separately look at reviving some useful parts of [1] and the simpler
pageblock_flags manipulations.
[1] https://lkml.org/lkml/2015/6/10/237
> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
On 12/03/2015 08:11 AM, Joonsoo Kim wrote:
> It's not used externally. Remove it in compaction.h.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 12/03/2015 08:11 AM, Joonsoo Kim wrote:
> If compact_order_failed is initialized to 0 and order-9
> compaction is continually failed, defer counter will be updated
> to activate deferring. Although other defer counters will be properly
> updated, compact_order_failed will not be updated because failed order
> cannot be lower than compact_order_failed, 0. In this case,
> low order compaction such as 2, 3 could be deferred due to
> this wrongly initialized compact_order_failed value. This patch
> removes this possibility by initializing it to MAX_ORDER.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
Good catch.
Acked-by: Vlastimil Babka <[email protected]>
> ---
> mm/page_alloc.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d0499ff..7002c66 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5273,6 +5273,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
> zone_seqlock_init(zone);
> zone->zone_pgdat = pgdat;
> zone_pcp_init(zone);
> +#ifdef CONFIG_COMPACTION
> + zone->compact_order_failed = MAX_ORDER;
> +#endif
>
> /* For bootup, initialized properly in watermark setup */
> mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
>
On 12/03/2015 08:11 AM, Joonsoo Kim wrote:
> It's rather strange that compact_considered and compact_defer_shift aren't
> updated but compact_order_failed is updated when allocation is expected
> to succeed. Regardless actual allocation success, deferring for current
> order will be disabled so it doesn't result in much difference to
> compaction behaviour.
The difference is that if the defer reset was wrong, the next compaction
attempt that fails would resume the deferred counters?
> Moreover, in the past, there is a gap between expectation for allocation
> succeess in compaction and actual success in page allocator. But, now,
> this gap would be diminished due to providing classzone_idx and alloc_flags
> to watermark check in compaction and changed watermark check criteria
> for high-order allocation. Therfore, it's not a big problem to update
> defer counter when allocation is expected to succeed. This change
> will help to simplify defer logic.
I guess that's true. But at least some experiment would be better.
> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> include/linux/compaction.h | 2 --
> mm/compaction.c | 27 ++++++++-------------------
> mm/page_alloc.c | 1 -
> 3 files changed, 8 insertions(+), 22 deletions(-)
>
...
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7002c66..f3605fd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2815,7 +2815,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> struct zone *zone = page_zone(page);
>
> zone->compact_blockskip_flush = false;
While we are here, I wonder if this is useful at all?
compact_blockskip_flush is set true when scanners meet. That typically
means the compaction wasn't successful. Rarely it can be, but I doubt
this will make much difference, so we could remove this line as well.
> - compaction_defer_reset(zone, order, true);
> count_vm_event(COMPACTSUCCESS);
> return page;
> }
>
On 12/03/2015 08:11 AM, Joonsoo Kim wrote:
> It doesn't make sense that we reset defer counter
> in compaction_defer_reset() when compaction request under the order of
> compact_order_failed succeed. Fix it.
Right.
> And, it does make sense that giving enough chance for updated failed
> order compaction before deferring. Change it.
Sorry, can't understand the meaning here. From the code it seems that
you want to reset defer_shift to 0 instead of increasing it, when the
current order is lower than the failed one? That makes sense, yeah.
How about this?
"On the other hand, when deferring compaction for an order lower than
the current compact_order_failed, we can assume the lower order will
recover more quickly, so we should reset the progress made previously on
compact_defer_shift with the higher order."
> Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
> ---
> mm/compaction.c | 19 +++++++++++--------
> 1 file changed, 11 insertions(+), 8 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 67b8d90..1a75a6e 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -126,11 +126,14 @@ static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
> */
> static void defer_compaction(struct zone *zone, int order)
> {
> - zone->compact_considered = 0;
> - zone->compact_defer_shift++;
> -
> - if (order < zone->compact_order_failed)
> + if (order < zone->compact_order_failed) {
> + zone->compact_considered = 0;
> + zone->compact_defer_shift = 0;
> zone->compact_order_failed = order;
> + } else {
> + zone->compact_considered = 0;
> + zone->compact_defer_shift++;
> + }
>
> if (zone->compact_defer_shift > COMPACT_MAX_DEFER_SHIFT)
> zone->compact_defer_shift = COMPACT_MAX_DEFER_SHIFT;
> @@ -161,11 +164,11 @@ bool compaction_deferred(struct zone *zone, int order)
> /* Update defer tracking counters after successful compaction of given order */
> static void compaction_defer_reset(struct zone *zone, int order)
> {
> - zone->compact_considered = 0;
> - zone->compact_defer_shift = 0;
> -
> - if (order >= zone->compact_order_failed)
> + if (order >= zone->compact_order_failed) {
> + zone->compact_considered = 0;
> + zone->compact_defer_shift = 0;
> zone->compact_order_failed = order + 1;
> + }
>
> trace_mm_compaction_defer_reset(zone, order);
> }
>
On Thu, Dec 03, 2015 at 11:36:52AM +0100, Vlastimil Babka wrote:
> On 12/03/2015 08:11 AM, Joonsoo Kim wrote:
> > Cached pfn is used to determine the start position of scanner
> > at next compaction run. Current cached pfn points the skipped pageblock
> > so we uselessly checks whether pageblock is valid for compaction and
> > skip-bit is set or not. If we set scanner's cached pfn to next pfn of
> > skipped pageblock, we don't need to do this check.
> >
> > This patch moved update_pageblock_skip() to
> > isolate_(freepages|migratepages). Updating pageblock skip information
> > isn't relevant to CMA so they are more appropriate place
> > to update this information.
>
> That's step in a good direction, yeah. But why not go as far as some variant of
> my (not resubmitted) patch "mm, compaction: decouple updating pageblock_skip and
> cached pfn" [1]. Now the overloading of update_pageblock_skip() is just too much
> - a struct page pointer for the skip bits, and a pfn of different page for the
> cached pfn update, that's just more complex than it should be.
>
> (I also suspect the pageblock_flags manipulation functions could be simpler if
> they accepted zone pointer and pfn instead of struct page)
Okay.
> Also recently in Aaron's report we found a possible scenario where pageblocks
> are being skipped without entering the isolate_*_block() functions, and it would
> make sense to update the cached pfn's in that case, independently of updating
> pageblock skip bits.
>
> But this might be too out of scope of your series, so if you want I can
> separately look at reviving some useful parts of [1] and the simpler
> pageblock_flags manipulations.
I will cherry-pick some useful parts of that with your authorship and
respin this series after you finish to review all patches.
Thanks.
On Fri, Dec 04, 2015 at 05:52:21PM +0100, Vlastimil Babka wrote:
> On 12/03/2015 08:11 AM, Joonsoo Kim wrote:
> >It's rather strange that compact_considered and compact_defer_shift aren't
> >updated but compact_order_failed is updated when allocation is expected
> >to succeed. Regardless actual allocation success, deferring for current
> >order will be disabled so it doesn't result in much difference to
> >compaction behaviour.
>
> The difference is that if the defer reset was wrong, the next
> compaction attempt that fails would resume the deferred counters?
Right. But, perhaps, if we wrongly reset order_failed due to difference
of check criteria, it could happen again and again on next compaction
attempt so defer would not work as intended.
>
> >Moreover, in the past, there is a gap between expectation for allocation
> >succeess in compaction and actual success in page allocator. But, now,
> >this gap would be diminished due to providing classzone_idx and alloc_flags
> >to watermark check in compaction and changed watermark check criteria
> >for high-order allocation. Therfore, it's not a big problem to update
> >defer counter when allocation is expected to succeed. This change
> >will help to simplify defer logic.
>
> I guess that's true. But at least some experiment would be better.
Yeah, I tested it today and found that there is a difference.
Allocation is more successful(really minor, 0.25%) than checking
in compaction. Reason is that watermark check in try_to_compact_pages()
uses low_wmark_pages but get_page_from_freelist() after direct compaction
uses min_wmark_pages. When I change low_wmark_pages to min_wmark_pages,
I can't find any difference. It seems reasonable to change
low_wmark_pages to min_wmark_pages in some places where checking
compaction finish condition.
I will add the patch on next spin.
>
> >Signed-off-by: Joonsoo Kim <[email protected]>
> >---
> > include/linux/compaction.h | 2 --
> > mm/compaction.c | 27 ++++++++-------------------
> > mm/page_alloc.c | 1 -
> > 3 files changed, 8 insertions(+), 22 deletions(-)
> >
>
> ...
>
> >diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >index 7002c66..f3605fd 100644
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -2815,7 +2815,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> > struct zone *zone = page_zone(page);
> >
> > zone->compact_blockskip_flush = false;
>
> While we are here, I wonder if this is useful at all?
I think it's useful. We still have some cases that premature
compaction complete happens (e.g. async compaction). In this case,
if next sync compaction succeed, compact_blockskip_flush is cleared
and pageblock skip bit will not be reset so overhead would be reduced.
Thanks.
On Fri, Dec 04, 2015 at 06:15:02PM +0100, Vlastimil Babka wrote:
> On 12/03/2015 08:11 AM, Joonsoo Kim wrote:
> >It doesn't make sense that we reset defer counter
> >in compaction_defer_reset() when compaction request under the order of
> >compact_order_failed succeed. Fix it.
>
> Right.
>
> >And, it does make sense that giving enough chance for updated failed
> >order compaction before deferring. Change it.
>
> Sorry, can't understand the meaning here. From the code it seems
> that you want to reset defer_shift to 0 instead of increasing it,
> when the current order is lower than the failed one? That makes
> sense, yeah.
You correctly understand my intention. :)
> How about this?
>
> "On the other hand, when deferring compaction for an order lower
> than the current compact_order_failed, we can assume the lower order
> will recover more quickly, so we should reset the progress made
> previously on compact_defer_shift with the higher order."
Will add it.
>
> >Signed-off-by: Joonsoo Kim <[email protected]>
>
> Acked-by: Vlastimil Babka <[email protected]>
Thanks.
>
> >---
> > mm/compaction.c | 19 +++++++++++--------
> > 1 file changed, 11 insertions(+), 8 deletions(-)
> >
> >diff --git a/mm/compaction.c b/mm/compaction.c
> >index 67b8d90..1a75a6e 100644
> >--- a/mm/compaction.c
> >+++ b/mm/compaction.c
> >@@ -126,11 +126,14 @@ static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
> > */
> > static void defer_compaction(struct zone *zone, int order)
> > {
> >- zone->compact_considered = 0;
> >- zone->compact_defer_shift++;
> >-
> >- if (order < zone->compact_order_failed)
> >+ if (order < zone->compact_order_failed) {
> >+ zone->compact_considered = 0;
> >+ zone->compact_defer_shift = 0;
> > zone->compact_order_failed = order;
> >+ } else {
> >+ zone->compact_considered = 0;
> >+ zone->compact_defer_shift++;
> >+ }
> >
> > if (zone->compact_defer_shift > COMPACT_MAX_DEFER_SHIFT)
> > zone->compact_defer_shift = COMPACT_MAX_DEFER_SHIFT;
> >@@ -161,11 +164,11 @@ bool compaction_deferred(struct zone *zone, int order)
> > /* Update defer tracking counters after successful compaction of given order */
> > static void compaction_defer_reset(struct zone *zone, int order)
> > {
> >- zone->compact_considered = 0;
> >- zone->compact_defer_shift = 0;
> >-
> >- if (order >= zone->compact_order_failed)
> >+ if (order >= zone->compact_order_failed) {
> >+ zone->compact_considered = 0;
> >+ zone->compact_defer_shift = 0;
> > zone->compact_order_failed = order + 1;
> >+ }
> >
> > trace_mm_compaction_defer_reset(zone, order);
> > }
> >
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On 12/03/2015 08:11 AM, Joonsoo Kim wrote:
> This is preparation step to replace compaction deferring with compaction
> limit. Whole reason why we need to replace it will be mentioned in
> the following patch.
>
> In this patch, migration_scan_limit is assigned and accounted, but, not
> checked to finish. So, there is no functional change.
>
> Currently, amount of migration_scan_limit is chosen to imitate compaction
> deferring logic. We can tune it easily if overhead looks insane, but,
> it would be further work.
> Also, amount of migration_scan_limit is adapted by compact_defer_shift.
> More fails increase compact_defer_shift and this will limit compaction
> more.
>
> There are two interesting changes. One is that cached pfn is always
> updated while limit is activated. Otherwise, we would scan same range
> over and over. Second one is that async compaction is skipped while
> limit is activated, for algorithm correctness. Until now, even if
> failure case, sync compaction continue to work when both scanner is met
> so COMPACT_COMPLETE usually happens in sync compaction. But, limit is
> applied, sync compaction is finished if limit is exhausted so
> COMPACT_COMPLETE usually happens in async compaction. Because we don't
> consider async COMPACT_COMPLETE as actual fail while we reset cached
> scanner pfn
I don't see where compaction being sync/async applies to "reset cached
scanner pfn". I assume you actually meant the call to defer_compaction()
in try_to_compact_pages, which only happens for async compaction?
> defer mechanism doesn't work well. And, async compaction
> would not be easy to succeed in this case so skipping async compaction
> doesn't result in much difference.
So, the alternative to avoiding async compaction would be to call
defer_compaction() also when async compaction completes, right? Which
doesn't sound as scary when deferring isn't an on/off thing, but applies
a limit.
This would also help the issue with THP fault compactions being only
async and thus never deferring anything, which I think showed itself in
Aaron's reports. This current patch wouldn't help there I think, as
without sync compaction the system would never start to apply the limit
in the first place, and would be stuck with the ill-defined contended
compaction detection based on need_resched etc.
> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> mm/compaction.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++--------
> mm/internal.h | 1 +
> 2 files changed, 78 insertions(+), 11 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 1a75a6e..b23f6d9 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -116,6 +116,67 @@ static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
>
> #ifdef CONFIG_COMPACTION
>
> +/*
> + * order == -1 is expected when compacting via
> + * /proc/sys/vm/compact_memory
> + */
> +static inline bool is_via_compact_memory(int order)
> +{
> + return order == -1;
> +}
> +
> +#define COMPACT_MIN_SCAN_LIMIT (pageblock_nr_pages)
> +
> +static bool excess_migration_scan_limit(struct compact_control *cc)
> +{
> + /* Disable scan limit for now */
> + return false;
> +}
> +
> +static void set_migration_scan_limit(struct compact_control *cc)
> +{
> + struct zone *zone = cc->zone;
> + int order = cc->order;
> + unsigned long limit = zone->managed_pages;
> +
> + cc->migration_scan_limit = LONG_MAX;
> + if (is_via_compact_memory(order))
> + return;
> +
> + if (order < zone->compact_order_failed)
> + return;
> +
> + if (!zone->compact_defer_shift)
> + return;
> +
> + /*
> + * Do not allow async compaction during limit work. In this case,
> + * async compaction would not be easy to succeed and we need to
> + * ensure that COMPACT_COMPLETE occurs by sync compaction for
> + * algorithm correctness and prevention of async compaction will
> + * lead it.
> + */
> + if (cc->mode == MIGRATE_ASYNC) {
> + cc->migration_scan_limit = -1;
> + return;
> + }
> +
> + /* Migration scanner usually scans less than 1/4 pages */
> + limit >>= 2;
> +
> + /*
> + * Deferred compaction restart compaction every 64 compaction
> + * attempts and it rescans whole zone range. To imitate it,
> + * we set limit to 1/64 of scannable range.
> + */
> + limit >>= 6;
> +
> + /* Degradation scan limit according to defer shift */
> + limit >>= zone->compact_defer_shift;
> +
> + cc->migration_scan_limit = max(limit, COMPACT_MIN_SCAN_LIMIT);
> +}
> +
> /* Do not skip compaction more than 64 times */
> #define COMPACT_MAX_DEFER_SHIFT 6
>
> @@ -263,10 +324,15 @@ static void update_pageblock_skip(struct compact_control *cc,
> if (!page)
> return;
>
> - if (nr_isolated)
> + /*
> + * Always update cached_pfn if compaction has scan_limit,
> + * otherwise we would scan same range over and over.
> + */
> + if (cc->migration_scan_limit == LONG_MAX && nr_isolated)
> return;
>
> - set_pageblock_skip(page);
> + if (!nr_isolated)
> + set_pageblock_skip(page);
>
> /* Update where async and sync compaction should restart */
> if (migrate_scanner) {
> @@ -822,6 +888,8 @@ isolate_success:
> if (locked)
> spin_unlock_irqrestore(&zone->lru_lock, flags);
>
> + cc->migration_scan_limit -= nr_scanned;
> +
> trace_mm_compaction_isolate_migratepages(start_pfn, low_pfn,
> nr_scanned, nr_isolated);
>
> @@ -1186,15 +1254,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
> return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
> }
>
> -/*
> - * order == -1 is expected when compacting via
> - * /proc/sys/vm/compact_memory
> - */
> -static inline bool is_via_compact_memory(int order)
> -{
> - return order == -1;
> -}
> -
> static int __compact_finished(struct zone *zone, struct compact_control *cc,
> const int migratetype)
> {
> @@ -1224,6 +1283,9 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc,
> if (is_via_compact_memory(cc->order))
> return COMPACT_CONTINUE;
>
> + if (excess_migration_scan_limit(cc))
> + return COMPACT_PARTIAL;
> +
> /* Compaction run is not finished if the watermark is not met */
> watermark = low_wmark_pages(zone);
>
> @@ -1382,6 +1444,10 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> }
> cc->last_migrated_pfn = 0;
>
> + set_migration_scan_limit(cc);
> + if (excess_migration_scan_limit(cc))
> + return COMPACT_SKIPPED;
> +
> trace_mm_compaction_begin(start_pfn, cc->migrate_pfn,
> cc->free_pfn, end_pfn, sync);
>
> diff --git a/mm/internal.h b/mm/internal.h
> index dbe0436..bb8225c 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -164,6 +164,7 @@ struct compact_control {
> unsigned long free_pfn; /* isolate_freepages search base */
> unsigned long migrate_pfn; /* isolate_migratepages search base */
> unsigned long last_migrated_pfn;/* Not yet flushed page being freed */
> + long migration_scan_limit; /* Limit migration scanner activity */
> enum migrate_mode mode; /* Async or sync migration mode */
> bool ignore_skip_hint; /* Scan blocks even if marked skip */
> int order; /* order a direct compactor needs */
>
On 12/03/2015 08:11 AM, Joonsoo Kim wrote:
> Compaction deferring effectively reduces compaction overhead if
> compaction success isn't expected. But, it is implemented that
> skipping a number of compaction requests until compaction is re-enabled.
> Due to this implementation, unfortunate compaction requestor will get
> whole compaction overhead unlike others have zero overhead. And, after
> deferring start to work, even if compaction success possibility is
> restored, we should skip to compaction in some number of times.
>
> This patch try to solve above problem by using compaction limit.
> Instead of imposing compaction overhead to one unfortunate requestor,
> compaction limit distributes overhead to all compaction requestors.
> All requestors have a chance to migrate some amount of pages and
> after limit is exhausted compaction will be stopped. This will fairly
> distributes overhead to all compaction requestors. And, because we don't
> defer compaction request, someone will succeed to compact as soon as
> possible if compaction success possiblility is restored.
>
> Following is whole workflow enabled by this change.
>
> - if sync compaction fails, compact_order_failed is set to current order
> - if it fails again, compact_defer_shift is adjusted
> - with positive compact_defer_shift, migration_scan_limit is assigned and
> compaction limit is activated
> - if compaction limit is activated, compaction would be stopped when
> migration_scan_limit is exhausted
> - when success, compact_defer_shift and compact_order_failed is reset and
> compaction limit is deactivated
> - compact_defer_shift can be grown up to COMPACT_MAX_DEFER_SHIFT
>
> Most of changes are mechanical ones to remove compact_considered which
> is not needed now. Note that, after restart, compact_defer_shift is
> subtracted by 1 to avoid invoking __reset_isolation_suitable()
> repeatedly.
>
> I tested this patch on my compaction benchmark and found that high-order
> allocation latency is evenly distributed and there is no latency spike
> in the situation where compaction success isn't possible.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
Looks fine overal, looking forward to next version :) (due to changes
expected in preceding patches, I didn't review the code fully now).
On Mon, Dec 14, 2015 at 10:34:50AM +0100, Vlastimil Babka wrote:
> On 12/03/2015 08:11 AM, Joonsoo Kim wrote:
> >This is preparation step to replace compaction deferring with compaction
> >limit. Whole reason why we need to replace it will be mentioned in
> >the following patch.
> >
> >In this patch, migration_scan_limit is assigned and accounted, but, not
> >checked to finish. So, there is no functional change.
> >
> >Currently, amount of migration_scan_limit is chosen to imitate compaction
> >deferring logic. We can tune it easily if overhead looks insane, but,
> >it would be further work.
> >Also, amount of migration_scan_limit is adapted by compact_defer_shift.
> >More fails increase compact_defer_shift and this will limit compaction
> >more.
> >
> >There are two interesting changes. One is that cached pfn is always
> >updated while limit is activated. Otherwise, we would scan same range
> >over and over. Second one is that async compaction is skipped while
> >limit is activated, for algorithm correctness. Until now, even if
> >failure case, sync compaction continue to work when both scanner is met
> >so COMPACT_COMPLETE usually happens in sync compaction. But, limit is
> >applied, sync compaction is finished if limit is exhausted so
> >COMPACT_COMPLETE usually happens in async compaction. Because we don't
> >consider async COMPACT_COMPLETE as actual fail while we reset cached
> >scanner pfn
>
> I don't see where compaction being sync/async applies to "reset
> cached scanner pfn". I assume you actually meant the call to
> defer_compaction() in try_to_compact_pages, which only happens for
> async compaction?
What I wanted to say is that reset_cached_positions() is called in
__compact_finished() for async compaction and defer_compaction() isn't
called for this case.
>
> >defer mechanism doesn't work well. And, async compaction
> >would not be easy to succeed in this case so skipping async compaction
> >doesn't result in much difference.
>
> So, the alternative to avoiding async compaction would be to call
> defer_compaction() also when async compaction completes, right?
> Which doesn't sound as scary when deferring isn't an on/off thing,
> but applies a limit.
Yeah, it would be one alternative but I'm not sure it works well. I
can think one scenario that this doesn't work well.
1) Asume that most of pageblocks are non-movable and limit is activated.
2) Async compaction skips non-movable pageblocks and scanners are
easily met without compaction success. Then, cache pfn is reset.
3) Sync compaction scans few pageblocks on front part of zone and
fails due to reset cache pfn and limit.
4) 2 and 3 happen again for next compaction request.
If we allow async compaction's migration scanner to scan non-movable
pageblock in this case, everything will work fine.
How about allowing async compaction's migration scanner to scan
non-movable pageblock *always*? Reason that async compaction doesn't
scan it is to succeed compaction without much stall but it makes
compaction too complicated. For example, defer_compaction() cannot be
called for async compaction and compaction works different according to
pageblock type distribution on the system. We already have a logic to
control stall so stall would not matter now. If it doesn't work well,
we can change it by always applying scan limit to async compaction.
It could cause lower success rate on async compaction but I don't
think it causes a problem because it's not that hard to succeed to
make high-order page up to PAGE_ALLOC_COSTLY_ORDER even in non-movable
pageblock. For request more than PAGE_ALLOC_COSTLY order, we don't
need to consider success rate much because it isn't easy to succeed
on async compaction.
> This would also help the issue with THP fault compactions being only
> async and thus never deferring anything, which I think showed itself
> in Aaron's reports. This current patch wouldn't help there I think,
> as without sync compaction the system would never start to apply the
Yes, if async compaction calls defer_compaction(), Aaron's problem
would be mitigated.
> limit in the first place, and would be stuck with the ill-defined
> contended compaction detection based on need_resched etc.
I also think that contended compaction detection based on need_resched()
should be changed. If there is just one task on cpu, it isn't triggered.
It would be better to apply scan limit in this case.
Thanks.
>
> >Signed-off-by: Joonsoo Kim <[email protected]>
> >---
> > mm/compaction.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++--------
> > mm/internal.h | 1 +
> > 2 files changed, 78 insertions(+), 11 deletions(-)
> >
> >diff --git a/mm/compaction.c b/mm/compaction.c
> >index 1a75a6e..b23f6d9 100644
> >--- a/mm/compaction.c
> >+++ b/mm/compaction.c
> >@@ -116,6 +116,67 @@ static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
> >
> > #ifdef CONFIG_COMPACTION
> >
> >+/*
> >+ * order == -1 is expected when compacting via
> >+ * /proc/sys/vm/compact_memory
> >+ */
> >+static inline bool is_via_compact_memory(int order)
> >+{
> >+ return order == -1;
> >+}
> >+
> >+#define COMPACT_MIN_SCAN_LIMIT (pageblock_nr_pages)
> >+
> >+static bool excess_migration_scan_limit(struct compact_control *cc)
> >+{
> >+ /* Disable scan limit for now */
> >+ return false;
> >+}
> >+
> >+static void set_migration_scan_limit(struct compact_control *cc)
> >+{
> >+ struct zone *zone = cc->zone;
> >+ int order = cc->order;
> >+ unsigned long limit = zone->managed_pages;
> >+
> >+ cc->migration_scan_limit = LONG_MAX;
> >+ if (is_via_compact_memory(order))
> >+ return;
> >+
> >+ if (order < zone->compact_order_failed)
> >+ return;
> >+
> >+ if (!zone->compact_defer_shift)
> >+ return;
> >+
> >+ /*
> >+ * Do not allow async compaction during limit work. In this case,
> >+ * async compaction would not be easy to succeed and we need to
> >+ * ensure that COMPACT_COMPLETE occurs by sync compaction for
> >+ * algorithm correctness and prevention of async compaction will
> >+ * lead it.
> >+ */
> >+ if (cc->mode == MIGRATE_ASYNC) {
> >+ cc->migration_scan_limit = -1;
> >+ return;
> >+ }
> >+
> >+ /* Migration scanner usually scans less than 1/4 pages */
> >+ limit >>= 2;
> >+
> >+ /*
> >+ * Deferred compaction restart compaction every 64 compaction
> >+ * attempts and it rescans whole zone range. To imitate it,
> >+ * we set limit to 1/64 of scannable range.
> >+ */
> >+ limit >>= 6;
> >+
> >+ /* Degradation scan limit according to defer shift */
> >+ limit >>= zone->compact_defer_shift;
> >+
> >+ cc->migration_scan_limit = max(limit, COMPACT_MIN_SCAN_LIMIT);
> >+}
> >+
> > /* Do not skip compaction more than 64 times */
> > #define COMPACT_MAX_DEFER_SHIFT 6
> >
> >@@ -263,10 +324,15 @@ static void update_pageblock_skip(struct compact_control *cc,
> > if (!page)
> > return;
> >
> >- if (nr_isolated)
> >+ /*
> >+ * Always update cached_pfn if compaction has scan_limit,
> >+ * otherwise we would scan same range over and over.
> >+ */
> >+ if (cc->migration_scan_limit == LONG_MAX && nr_isolated)
> > return;
> >
> >- set_pageblock_skip(page);
> >+ if (!nr_isolated)
> >+ set_pageblock_skip(page);
> >
> > /* Update where async and sync compaction should restart */
> > if (migrate_scanner) {
> >@@ -822,6 +888,8 @@ isolate_success:
> > if (locked)
> > spin_unlock_irqrestore(&zone->lru_lock, flags);
> >
> >+ cc->migration_scan_limit -= nr_scanned;
> >+
> > trace_mm_compaction_isolate_migratepages(start_pfn, low_pfn,
> > nr_scanned, nr_isolated);
> >
> >@@ -1186,15 +1254,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
> > return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
> > }
> >
> >-/*
> >- * order == -1 is expected when compacting via
> >- * /proc/sys/vm/compact_memory
> >- */
> >-static inline bool is_via_compact_memory(int order)
> >-{
> >- return order == -1;
> >-}
> >-
> > static int __compact_finished(struct zone *zone, struct compact_control *cc,
> > const int migratetype)
> > {
> >@@ -1224,6 +1283,9 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc,
> > if (is_via_compact_memory(cc->order))
> > return COMPACT_CONTINUE;
> >
> >+ if (excess_migration_scan_limit(cc))
> >+ return COMPACT_PARTIAL;
> >+
> > /* Compaction run is not finished if the watermark is not met */
> > watermark = low_wmark_pages(zone);
> >
> >@@ -1382,6 +1444,10 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> > }
> > cc->last_migrated_pfn = 0;
> >
> >+ set_migration_scan_limit(cc);
> >+ if (excess_migration_scan_limit(cc))
> >+ return COMPACT_SKIPPED;
> >+
> > trace_mm_compaction_begin(start_pfn, cc->migrate_pfn,
> > cc->free_pfn, end_pfn, sync);
> >
> >diff --git a/mm/internal.h b/mm/internal.h
> >index dbe0436..bb8225c 100644
> >--- a/mm/internal.h
> >+++ b/mm/internal.h
> >@@ -164,6 +164,7 @@ struct compact_control {
> > unsigned long free_pfn; /* isolate_freepages search base */
> > unsigned long migrate_pfn; /* isolate_migratepages search base */
> > unsigned long last_migrated_pfn;/* Not yet flushed page being freed */
> >+ long migration_scan_limit; /* Limit migration scanner activity */
> > enum migrate_mode mode; /* Async or sync migration mode */
> > bool ignore_skip_hint; /* Scan blocks even if marked skip */
> > int order; /* order a direct compactor needs */
> >
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>