2014-10-07 15:34:03

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH 0/5] Further compaction tuning

Based on next-20141007. Patch 5 needs "mm: introduce single zone pcplists
drain" from https://lkml.org/lkml/2014/10/2/375

OK, time to reset the "days without a compaction series" counter back to 0.
This series is mostly what was postponed in the previous one (but sadly not
all of it), along with some smaller changes.

Patch 1 tries to solve the mismatch in watermark checking by compaction and
allocations by adding the missing pieces to compact_control. This mainly
allows simplifying deferred allocation handling by Patch 2. Change in Patch 2
was suggested by Joonsoo reviewing the previous series, but was not possible
without Patch 1. Patch 3 is a rather cosmetic change to deferred compaction.

Patch 4 removes probably the last occurence of compaction scanners rescanning
some pages when being restarted in the middle of the zone.

Patch 5 is a posthumous child of patch "mm, compaction: try to capture the
just-created high-order freepage" which was removed from the previous series.
Thanks to Joonsoo's objections we could find out that the improvements of the
capture patch was mainly due to better lru_add cache and pcplists draining.
The remaining delta wrt success rates between this patch and page capture was
due to different (questionable) watermark checking in the capture mechanism.
So this patch brings most of the improvements without the questionable parts
and complexity that capture had.

Vlastimil Babka (5):
mm, compaction: pass classzone_idx and alloc_flags to watermark
checking
mm, compaction: simplify deferred compaction
mm, compaction: defer only on COMPACT_COMPLETE
mm, compaction: always update cached scanner positions
mm, compaction: more focused lru and pcplists draining

include/linux/compaction.h | 10 +++---
mm/compaction.c | 89 +++++++++++++++++++++++++++++-----------------
mm/internal.h | 7 ++--
mm/page_alloc.c | 15 +-------
mm/vmscan.c | 12 +++----
5 files changed, 71 insertions(+), 62 deletions(-)

--
1.8.4.5


2014-10-07 15:34:07

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH 2/5] mm, compaction: simplify deferred compaction

Since commit ("mm, compaction: defer each zone individually instead of
preferred zone"), compaction is deferred for each zone where sync direct
compaction fails, and reset where it succeeds. However, it was observed
that for DMA zone compaction often appeared to succeed while subsequent
allocation attempt would not, due to different outcome of watermark check.
In order to properly defer compaction in this zone, the candidate zone has
to be passed back to __alloc_pages_direct_compact() and compaction deferred
in the zone after the allocation attempt fails.

The large source of mismatch between watermark check in compaction and
allocation was the lack of alloc_flags and classzone_idx values in compaction,
which has been fixed in the previous patch. So with this problem fixed, we
can simplify the code by removing the candidate_zone parameter and deferring
in __alloc_pages_direct_compact().

After this patch, the compaction activity during stress-highalloc benchmark is
still somewhat increased, but it's negligible compared to the increase that
occurred without the better watermark checking. This suggests that it is still
possible to apparently succeed in compaction but fail to allocate, possibly
due to parallel allocation activity.

Suggested-by: Joonsoo Kim <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
---
include/linux/compaction.h | 6 ++----
mm/compaction.c | 5 +----
mm/page_alloc.c | 12 +-----------
3 files changed, 4 insertions(+), 19 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index d896765..58c293d 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -33,8 +33,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *mask,
enum migrate_mode mode, int *contended,
- int alloc_flags, int classzone_idx,
- struct zone **candidate_zone);
+ int alloc_flags, int classzone_idx);
extern void compact_pgdat(pg_data_t *pgdat, int order);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern unsigned long compaction_suitable(struct zone *zone, int order,
@@ -105,8 +104,7 @@ static inline bool compaction_restarting(struct zone *zone, int order)
static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
enum migrate_mode mode, int *contended,
- int alloc_flags, int classzone_idx,
- struct zone **candidate_zone)
+ int alloc_flags, int classzone_idx);
{
return COMPACT_CONTINUE;
}
diff --git a/mm/compaction.c b/mm/compaction.c
index dba8891..4b3e0bd 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1276,15 +1276,13 @@ int sysctl_extfrag_threshold = 500;
* @mode: The migration mode for async, sync light, or sync migration
* @contended: Return value that determines if compaction was aborted due to
* need_resched() or lock contention
- * @candidate_zone: Return the zone where we think allocation should succeed
*
* This is the main entry point for direct page compaction.
*/
unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
enum migrate_mode mode, int *contended,
- int alloc_flags, int classzone_idx,
- struct zone **candidate_zone)
+ int alloc_flags, int classzone_idx)
{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
int may_enter_fs = gfp_mask & __GFP_FS;
@@ -1321,7 +1319,6 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
/* If a normal allocation would succeed, stop compacting */
if (zone_watermark_ok(zone, order, low_wmark_pages(zone),
classzone_idx, alloc_flags)) {
- *candidate_zone = zone;
/*
* We think the allocation will succeed in this zone,
* but it is not certain, hence the false. The caller
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8d143a0..5a4506f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2328,7 +2328,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
int classzone_idx, int migratetype, enum migrate_mode mode,
int *contended_compaction, bool *deferred_compaction)
{
- struct zone *last_compact_zone = NULL;
unsigned long compact_result;
struct page *page;

@@ -2339,8 +2338,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
compact_result = try_to_compact_pages(zonelist, order, gfp_mask,
nodemask, mode,
contended_compaction,
- alloc_flags, classzone_idx,
- &last_compact_zone);
+ alloc_flags, classzone_idx);
current->flags &= ~PF_MEMALLOC;

switch (compact_result) {
@@ -2378,14 +2376,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
}

/*
- * last_compact_zone is where try_to_compact_pages thought allocation
- * should succeed, so it did not defer compaction. But here we know
- * that it didn't succeed, so we do the defer.
- */
- if (last_compact_zone && mode != MIGRATE_ASYNC)
- defer_compaction(last_compact_zone, order);
-
- /*
* It's bad if compaction run occurs and fails. The most likely reason
* is that pages exist, but not enough to satisfy watermarks.
*/
--
1.8.4.5

2014-10-07 15:34:05

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH 3/5] mm, compaction: defer only on COMPACT_COMPLETE

Deferred compaction is employed to avoid compacting zone where sync direct
compaction has recently failed. As such, it makes sense to only defer when
a full zone was scanned, which is when compact_zone returns with
COMPACT_COMPLETE. It's less useful to defer when compact_zone returns with
apparent success (COMPACT_PARTIAL), followed by a watermark check failure,
which can happen due to parallel allocation activity. It also does not make
much sense to defer compaction which was completely skipped (COMPACT_SKIP) for
being unsuitable in the first place.

This patch therefore makes deferred compaction trigger only when
COMPACT_COMPLETE is returned from compact_zone(). Results of stress-highalloc
becnmark show the difference is within measurement error, so the issue is
rather cosmetic.

Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
---
mm/compaction.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 4b3e0bd..9107588 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1340,7 +1340,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
goto break_loop;
}

- if (mode != MIGRATE_ASYNC) {
+ if (mode != MIGRATE_ASYNC && status == COMPACT_COMPLETE) {
/*
* We think that allocation won't succeed in this zone
* so we defer compaction there. If it ends up
--
1.8.4.5

2014-10-07 15:34:51

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH 5/5] mm, compaction: more focused lru and pcplists draining

The goal of memory compaction is to create high-order freepages through page
migration. Page migration however puts pages on the per-cpu lru_add cache,
which is later flushed to per-cpu pcplists, and only after pcplists are
drained the pages can actually merge. This can happen due to the per-cpu
caches becoming full through further freeing, or explicitly.

During direct compaction, it is useful to do the draining explicitly so that
pages merge as soon as possible and compaction can detect success immediately
and keep the latency impact at minimum. However the current implementation is
far from ideal. Draining is done only in __alloc_pages_direct_compact(),
after all zones were already compacted, and the decisions to continue or stop
compaction in individual zones was done without the last batch of migrations
being merged. It is also missing the draining of lru_add cache before the
pcplists.

This patch moves the draining for direct compaction into compact_zone(). It
adds the missing lru_cache draining and uses the newly introduced single zone
pcplists draining to reduce overhead and avoid impact on unrelated zones.
Draining is only performed when it can actually lead to merging of a page of
desired order (passed by cc->order). This means it is only done when migration
occurred in the previously scanned cc->order aligned block(s) and the
migration scanner is now pointing to the next cc->order aligned block.

The patch has been tested with stress-highalloc benchmark from mmtests.
Although overal allocation success rates of the benchmark were not affected,
the number of detected compaction successes has doubled. This suggests that
allocations were previously successful due to implicit merging caused by
background activity, making a later allocation attempt succeed immediately,
but not attributing the success to compaction. Since stress-highalloc always
tries to allocate almost the whole memory, it cannot show the improvement in
its reported success rate metric. However after this patch, compaction should
detect success and terminate earlier, reducing the direct compaction latencies
in a real scenario.

Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
---
mm/compaction.c | 41 ++++++++++++++++++++++++++++++++++++++++-
mm/page_alloc.c | 4 ----
2 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 8fa888d..41b49d7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1179,6 +1179,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
while ((ret = compact_finished(zone, cc, migratetype)) ==
COMPACT_CONTINUE) {
int err;
+ unsigned long last_migrated_pfn = 0;

switch (isolate_migratepages(zone, cc)) {
case ISOLATE_ABORT:
@@ -1187,7 +1188,12 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
cc->nr_migratepages = 0;
goto out;
case ISOLATE_NONE:
- continue;
+ /*
+ * We haven't isolated and migrated anything, but
+ * there might still be unflushed migrations from
+ * previous cc->order aligned block.
+ */
+ goto check_drain;
case ISOLATE_SUCCESS:
;
}
@@ -1212,6 +1218,39 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
goto out;
}
}
+
+ /*
+ * Record where we have freed pages by migration and not yet
+ * flushed them to buddy allocator. Subtract 1, because often
+ * we finish a pageblock and migrate_pfn points to the first
+ * page* of the next one. In that case we want the drain below
+ * to happen immediately.
+ */
+ if (!last_migrated_pfn)
+ last_migrated_pfn = cc->migrate_pfn - 1;
+
+check_drain:
+ /*
+ * Have we moved away from the previous cc->order aligned block
+ * where we migrated from? If yes, flush the pages that were
+ * freed, so that they can merge and compact_finished() can
+ * detect immediately if allocation should succeed.
+ */
+ if (cc->order > 0 && last_migrated_pfn) {
+ int cpu;
+ unsigned long current_block_start =
+ cc->migrate_pfn & ~((1UL << cc->order) - 1);
+
+ if (last_migrated_pfn < current_block_start) {
+ cpu = get_cpu();
+ lru_add_drain_cpu(cpu);
+ drain_local_pages(zone);
+ put_cpu();
+ /* No more flushing until we migrate again */
+ last_migrated_pfn = 0;
+ }
+ }
+
}

out:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5a4506f..14fac43 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2357,10 +2357,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
*/
count_vm_event(COMPACTSTALL);

- /* Page migration frees to the PCP lists but we want merging */
- drain_pages(get_cpu());
- put_cpu();
-
page = get_page_from_freelist(gfp_mask, nodemask,
order, zonelist, high_zoneidx,
alloc_flags & ~ALLOC_NO_WATERMARKS,
--
1.8.4.5

2014-10-07 15:35:10

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH 1/5] mm, compaction: pass classzone_idx and alloc_flags to watermark checking

Compaction relies on zone watermark checks for decisions such as if it's worth
to start compacting in compaction_suitable() or whether compaction should stop
in compact_finished(). The watermark checks take classzone_idx and alloc_flags
parameters, which are related to the memory allocation request. But from the
context of compaction they are currently passed as 0, including the direct
compaction which is invoked to satisfy the allocation request, and could
therefore know the proper values.

The lack of proper values can lead to mismatch between decisions taken during
compaction and decisions related to the allocation request. Lack of proper
classzone_idx value means that lowmem_reserve is not taken into account.
This has manifested (during recent changes to deferred compaction) when DMA
zone was used as fallback for preferred Normal zone. compaction_suitable()
without proper classzone_idx would think that the watermarks are already
satisfied, but watermark check in get_page_from_freelist() would fail. Because
of this problem, deferring compaction has extra complexity that can be removed
in the following patch.

The issue (not confirmed in practice) with missing alloc_flags is opposite in
nature. For allocations that include ALLOC_HIGH, ALLOC_HIGHER or ALLOC_CMA in
alloc_flags (the last includes all MOVABLE allocations on CMA-enabled systems)
the watermark checking in compaction with 0 passed will be stricter than in
get_page_from_freelist(). In these cases compaction might be running for a
longer time than is really needed.

This patch fixes these problems by adding alloc_flags and classzone_idx to
struct compact_control and related functions involved in direct compaction and
watermark checking. Where possible, all other callers of compaction_suitable()
pass proper values where those are known. This is currently limited to
classzone_idx, which is sometimes known in kswapd context. However, the direct
reclaim callers should_continue_reclaim() and compaction_ready() do not
currently know the proper values, so the coordination between reclaim and
compaction may still not be as accurate as it could. This can be fixed later,
if it's shown to be an issue.

The effect of this patch should be slightly better high-order allocation
success rates and/or less compaction overhead, depending on the type of
allocations and presence of CMA. It allows simplifying deferred compaction
code in a followup patch.

When testing with stress-highalloc, there was some slight improvement (which
might be just due to variance) in success rates of non-THP-like allocations.

Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
---
include/linux/compaction.h | 8 ++++++--
mm/compaction.c | 29 +++++++++++++++--------------
mm/internal.h | 2 ++
mm/page_alloc.c | 1 +
mm/vmscan.c | 12 ++++++------
5 files changed, 30 insertions(+), 22 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 60bdf8d..d896765 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -33,10 +33,12 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *mask,
enum migrate_mode mode, int *contended,
+ int alloc_flags, int classzone_idx,
struct zone **candidate_zone);
extern void compact_pgdat(pg_data_t *pgdat, int order);
extern void reset_isolation_suitable(pg_data_t *pgdat);
-extern unsigned long compaction_suitable(struct zone *zone, int order);
+extern unsigned long compaction_suitable(struct zone *zone, int order,
+ int alloc_flags, int classzone_idx);

/* Do not skip compaction more than 64 times */
#define COMPACT_MAX_DEFER_SHIFT 6
@@ -103,6 +105,7 @@ static inline bool compaction_restarting(struct zone *zone, int order)
static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
enum migrate_mode mode, int *contended,
+ int alloc_flags, int classzone_idx,
struct zone **candidate_zone)
{
return COMPACT_CONTINUE;
@@ -116,7 +119,8 @@ static inline void reset_isolation_suitable(pg_data_t *pgdat)
{
}

-static inline unsigned long compaction_suitable(struct zone *zone, int order)
+static inline unsigned long compaction_suitable(struct zone *zone, int order,
+ int alloc_flags, int classzone_idx)
{
return COMPACT_SKIPPED;
}
diff --git a/mm/compaction.c b/mm/compaction.c
index edba18a..dba8891 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1069,9 +1069,9 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,

/* Compaction run is not finished if the watermark is not met */
watermark = low_wmark_pages(zone);
- watermark += (1 << cc->order);

- if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
+ if (!zone_watermark_ok(zone, cc->order, watermark, cc->classzone_idx,
+ cc->alloc_flags))
return COMPACT_CONTINUE;

/* Direct compactor: Is a suitable page free? */
@@ -1097,7 +1097,8 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
* COMPACT_PARTIAL - If the allocation would succeed without compaction
* COMPACT_CONTINUE - If compaction should run now
*/
-unsigned long compaction_suitable(struct zone *zone, int order)
+unsigned long compaction_suitable(struct zone *zone, int order,
+ int alloc_flags, int classzone_idx)
{
int fragindex;
unsigned long watermark;
@@ -1134,7 +1135,7 @@ unsigned long compaction_suitable(struct zone *zone, int order)
return COMPACT_SKIPPED;

if (fragindex == -1000 && zone_watermark_ok(zone, order, watermark,
- 0, 0))
+ classzone_idx, alloc_flags))
return COMPACT_PARTIAL;

return COMPACT_CONTINUE;
@@ -1148,7 +1149,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
const bool sync = cc->mode != MIGRATE_ASYNC;

- ret = compaction_suitable(zone, cc->order);
+ ret = compaction_suitable(zone, cc->order, cc->alloc_flags,
+ cc->classzone_idx);
switch (ret) {
case COMPACT_PARTIAL:
case COMPACT_SKIPPED:
@@ -1237,7 +1239,8 @@ out:
}

static unsigned long compact_zone_order(struct zone *zone, int order,
- gfp_t gfp_mask, enum migrate_mode mode, int *contended)
+ gfp_t gfp_mask, enum migrate_mode mode, int *contended,
+ int alloc_flags, int classzone_idx)
{
unsigned long ret;
struct compact_control cc = {
@@ -1247,6 +1250,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
.gfp_mask = gfp_mask,
.zone = zone,
.mode = mode,
+ .alloc_flags = alloc_flags,
+ .classzone_idx = classzone_idx,
};
INIT_LIST_HEAD(&cc.freepages);
INIT_LIST_HEAD(&cc.migratepages);
@@ -1278,6 +1283,7 @@ int sysctl_extfrag_threshold = 500;
unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
enum migrate_mode mode, int *contended,
+ int alloc_flags, int classzone_idx,
struct zone **candidate_zone)
{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
@@ -1286,7 +1292,6 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
struct zoneref *z;
struct zone *zone;
int rc = COMPACT_DEFERRED;
- int alloc_flags = 0;
int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */

*contended = COMPACT_CONTENDED_NONE;
@@ -1295,10 +1300,6 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
if (!order || !may_enter_fs || !may_perform_io)
return COMPACT_SKIPPED;

-#ifdef CONFIG_CMA
- if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
- alloc_flags |= ALLOC_CMA;
-#endif
/* Compact each zone in the list */
for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
nodemask) {
@@ -1309,7 +1310,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
continue;

status = compact_zone_order(zone, order, gfp_mask, mode,
- &zone_contended);
+ &zone_contended, alloc_flags, classzone_idx);
rc = max(status, rc);
/*
* It takes at least one zone that wasn't lock contended
@@ -1318,8 +1319,8 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
all_zones_contended &= zone_contended;

/* If a normal allocation would succeed, stop compacting */
- if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
- alloc_flags)) {
+ if (zone_watermark_ok(zone, order, low_wmark_pages(zone),
+ classzone_idx, alloc_flags)) {
*candidate_zone = zone;
/*
* We think the allocation will succeed in this zone,
diff --git a/mm/internal.h b/mm/internal.h
index 8293040..3cc9b0a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -143,6 +143,8 @@ struct compact_control {

int order; /* order a direct compactor needs */
const gfp_t gfp_mask; /* gfp mask of a direct compactor */
+ const int alloc_flags; /* alloc flags of a direct compactor */
+ const int classzone_idx; /* zone index of a direct compactor */
struct zone *zone;
int contended; /* Signal need_sched() or lock
* contention detected during
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e758159..8d143a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2339,6 +2339,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
compact_result = try_to_compact_pages(zonelist, order, gfp_mask,
nodemask, mode,
contended_compaction,
+ alloc_flags, classzone_idx,
&last_compact_zone);
current->flags &= ~PF_MEMALLOC;

diff --git a/mm/vmscan.c b/mm/vmscan.c
index dcb4707..19ba76d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2249,7 +2249,7 @@ static inline bool should_continue_reclaim(struct zone *zone,
return true;

/* If compaction would go ahead or the allocation would succeed, stop */
- switch (compaction_suitable(zone, sc->order)) {
+ switch (compaction_suitable(zone, sc->order, 0, 0)) {
case COMPACT_PARTIAL:
case COMPACT_CONTINUE:
return false;
@@ -2346,7 +2346,7 @@ static inline bool compaction_ready(struct zone *zone, int order)
* If compaction is not ready to start and allocation is not likely
* to succeed without it, then keep reclaiming.
*/
- if (compaction_suitable(zone, order) == COMPACT_SKIPPED)
+ if (compaction_suitable(zone, order, 0, 0) == COMPACT_SKIPPED)
return false;

return watermark_ok;
@@ -2824,8 +2824,8 @@ static bool zone_balanced(struct zone *zone, int order,
balance_gap, classzone_idx, 0))
return false;

- if (IS_ENABLED(CONFIG_COMPACTION) && order &&
- compaction_suitable(zone, order) == COMPACT_SKIPPED)
+ if (IS_ENABLED(CONFIG_COMPACTION) && order && compaction_suitable(zone,
+ order, 0, classzone_idx) == COMPACT_SKIPPED)
return false;

return true;
@@ -2952,8 +2952,8 @@ static bool kswapd_shrink_zone(struct zone *zone,
* from memory. Do not reclaim more than needed for compaction.
*/
if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
- compaction_suitable(zone, sc->order) !=
- COMPACT_SKIPPED)
+ compaction_suitable(zone, sc->order, 0, classzone_idx)
+ != COMPACT_SKIPPED)
testorder = 0;

/*
--
1.8.4.5

2014-10-07 15:35:08

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH 4/5] mm, compaction: always update cached scanner positions

Compaction caches the migration and free scanner positions between compaction
invocations, so that the whole zone gets eventually scanned and there is no
bias towards the initial scanner positions at the beginning/end of the zone.

The cached positions are continuously updated as scanners progress and the
updating stops as soon as a page is successfully isolated. The reasoning
behind this is that a pageblock where isolation succeeded is likely to succeed
again in near future and it should be worth revisiting it.

However, the downside is that potentially many pages are rescanned without
successful isolation. At worst, there might be a page where isolation from LRU
succeeds but migration fails (potentially always). So upon encountering this
page, cached position would always stop being updated for no good reason.
It might have been useful to let such page be rescanned with sync compaction
after async one failed, but this is now handled by caching scanner position
for async and sync mode separately since commit 35979ef33931 ("mm, compaction:
add per-zone migration pfn cache for async compaction").

After this patch, cached positions are updated unconditionally. In
stress-highalloc benchmark, this has decreased the numbers of scanned pages
by few percent, without affecting allocation success rates.

Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
---
mm/compaction.c | 14 --------------
mm/internal.h | 5 -----
2 files changed, 19 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 9107588..8fa888d 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -195,16 +195,12 @@ static void update_pageblock_skip(struct compact_control *cc,

/* Update where async and sync compaction should restart */
if (migrate_scanner) {
- if (cc->finished_update_migrate)
- return;
if (pfn > zone->compact_cached_migrate_pfn[0])
zone->compact_cached_migrate_pfn[0] = pfn;
if (cc->mode != MIGRATE_ASYNC &&
pfn > zone->compact_cached_migrate_pfn[1])
zone->compact_cached_migrate_pfn[1] = pfn;
} else {
- if (cc->finished_update_free)
- return;
if (pfn < zone->compact_cached_free_pfn)
zone->compact_cached_free_pfn = pfn;
}
@@ -705,7 +701,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
del_page_from_lru_list(page, lruvec, page_lru(page));

isolate_success:
- cc->finished_update_migrate = true;
list_add(&page->lru, migratelist);
cc->nr_migratepages++;
nr_isolated++;
@@ -876,15 +871,6 @@ static void isolate_freepages(struct compact_control *cc)
block_start_pfn - pageblock_nr_pages;

/*
- * Set a flag that we successfully isolated in this pageblock.
- * In the next loop iteration, zone->compact_cached_free_pfn
- * will not be updated and thus it will effectively contain the
- * highest pageblock we isolated pages from.
- */
- if (isolated)
- cc->finished_update_free = true;
-
- /*
* isolate_freepages_block() might have aborted due to async
* compaction being contended
*/
diff --git a/mm/internal.h b/mm/internal.h
index 3cc9b0a..4928beb 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -136,11 +136,6 @@ struct compact_control {
unsigned long migrate_pfn; /* isolate_migratepages search base */
enum migrate_mode mode; /* Async or sync migration mode */
bool ignore_skip_hint; /* Scan blocks even if marked skip */
- bool finished_update_free; /* True when the zone cached pfns are
- * no longer being updated
- */
- bool finished_update_migrate;
-
int order; /* order a direct compactor needs */
const gfp_t gfp_mask; /* gfp mask of a direct compactor */
const int alloc_flags; /* alloc flags of a direct compactor */
--
1.8.4.5

2014-10-15 22:32:14

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/5] mm, compaction: simplify deferred compaction

On Tue, 7 Oct 2014 17:33:36 +0200 Vlastimil Babka <[email protected]> wrote:

> Since commit ("mm, compaction: defer each zone individually instead of
> preferred zone"), compaction is deferred for each zone where sync direct
> compaction fails, and reset where it succeeds. However, it was observed
> that for DMA zone compaction often appeared to succeed while subsequent
> allocation attempt would not, due to different outcome of watermark check.
> In order to properly defer compaction in this zone, the candidate zone has
> to be passed back to __alloc_pages_direct_compact() and compaction deferred
> in the zone after the allocation attempt fails.
>
> The large source of mismatch between watermark check in compaction and
> allocation was the lack of alloc_flags and classzone_idx values in compaction,
> which has been fixed in the previous patch. So with this problem fixed, we
> can simplify the code by removing the candidate_zone parameter and deferring
> in __alloc_pages_direct_compact().
>
> After this patch, the compaction activity during stress-highalloc benchmark is
> still somewhat increased, but it's negligible compared to the increase that
> occurred without the better watermark checking. This suggests that it is still
> possible to apparently succeed in compaction but fail to allocate, possibly
> due to parallel allocation activity.
>
> ...
>
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -33,8 +33,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
> extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *mask,
> enum migrate_mode mode, int *contended,
> - int alloc_flags, int classzone_idx,
> - struct zone **candidate_zone);
> + int alloc_flags, int classzone_idx);
> extern void compact_pgdat(pg_data_t *pgdat, int order);
> extern void reset_isolation_suitable(pg_data_t *pgdat);
> extern unsigned long compaction_suitable(struct zone *zone, int order,
> @@ -105,8 +104,7 @@ static inline bool compaction_restarting(struct zone *zone, int order)
> static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *nodemask,
> enum migrate_mode mode, int *contended,
> - int alloc_flags, int classzone_idx,
> - struct zone **candidate_zone)
> + int alloc_flags, int classzone_idx);
> {
> return COMPACT_CONTINUE;
> }

--- a/include/linux/compaction.h~mm-compaction-simplify-deferred-compaction-fix
+++ a/include/linux/compaction.h
@@ -104,7 +104,7 @@ static inline bool compaction_restarting
static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
enum migrate_mode mode, int *contended,
- int alloc_flags, int classzone_idx);
+ int alloc_flags, int classzone_idx)
{
return COMPACT_CONTINUE;
}

It clearly wasn't tested with this config. Please do so and let us
know the result?

2014-10-16 15:11:27

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 2/5] mm, compaction: simplify deferred compaction

On 10/16/2014 12:32 AM, Andrew Morton wrote:
> On Tue, 7 Oct 2014 17:33:36 +0200 Vlastimil Babka <[email protected]> wrote:
>> @@ -105,8 +104,7 @@ static inline bool compaction_restarting(struct zone *zone, int order)
>> static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
>> int order, gfp_t gfp_mask, nodemask_t *nodemask,
>> enum migrate_mode mode, int *contended,
>> - int alloc_flags, int classzone_idx,
>> - struct zone **candidate_zone)
>> + int alloc_flags, int classzone_idx);
>> {
>> return COMPACT_CONTINUE;
>> }
>
> --- a/include/linux/compaction.h~mm-compaction-simplify-deferred-compaction-fix
> +++ a/include/linux/compaction.h
> @@ -104,7 +104,7 @@ static inline bool compaction_restarting
> static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *nodemask,
> enum migrate_mode mode, int *contended,
> - int alloc_flags, int classzone_idx);
> + int alloc_flags, int classzone_idx)
> {
> return COMPACT_CONTINUE;
> }
>
> It clearly wasn't tested with this config. Please do so and let us
> know the result?

Sorry, forgot. Hopefully will get better next time, since I learned
about the undertaker/vampyr tool [1] today.

You patch does fix the compilation, thanks. Boot+stress-highalloc tests
are now running through the series but I don't expect any surprises -
the series is basically a no-op with CONFIG_COMPACTION disabled.

[1] http://lwn.net/Articles/616098/

2014-10-20 15:19:10

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 3/5] mm, compaction: defer only on COMPACT_COMPLETE

On 10/07/2014 11:33 AM, Vlastimil Babka wrote:
> Deferred compaction is employed to avoid compacting zone where sync direct
> compaction has recently failed. As such, it makes sense to only defer when
> a full zone was scanned, which is when compact_zone returns with
> COMPACT_COMPLETE. It's less useful to defer when compact_zone returns with
> apparent success (COMPACT_PARTIAL), followed by a watermark check failure,
> which can happen due to parallel allocation activity. It also does not make
> much sense to defer compaction which was completely skipped (COMPACT_SKIP) for
> being unsuitable in the first place.
>
> This patch therefore makes deferred compaction trigger only when
> COMPACT_COMPLETE is returned from compact_zone(). Results of stress-highalloc
> becnmark show the difference is within measurement error, so the issue is
> rather cosmetic.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>

Acked-by: Rik van Riel <[email protected]>

2014-10-20 15:27:19

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm, compaction: always update cached scanner positions

On 10/07/2014 11:33 AM, Vlastimil Babka wrote:
> Compaction caches the migration and free scanner positions between compaction
> invocations, so that the whole zone gets eventually scanned and there is no
> bias towards the initial scanner positions at the beginning/end of the zone.
>
> The cached positions are continuously updated as scanners progress and the
> updating stops as soon as a page is successfully isolated. The reasoning
> behind this is that a pageblock where isolation succeeded is likely to succeed
> again in near future and it should be worth revisiting it.
>
> However, the downside is that potentially many pages are rescanned without
> successful isolation. At worst, there might be a page where isolation from LRU
> succeeds but migration fails (potentially always). So upon encountering this
> page, cached position would always stop being updated for no good reason.
> It might have been useful to let such page be rescanned with sync compaction
> after async one failed, but this is now handled by caching scanner position
> for async and sync mode separately since commit 35979ef33931 ("mm, compaction:
> add per-zone migration pfn cache for async compaction").
>
> After this patch, cached positions are updated unconditionally. In
> stress-highalloc benchmark, this has decreased the numbers of scanned pages
> by few percent, without affecting allocation success rates.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>

Acked-by: Rik van Riel <[email protected]>

2014-10-20 15:44:42

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm, compaction: more focused lru and pcplists draining

On 10/07/2014 11:33 AM, Vlastimil Babka wrote:
> The goal of memory compaction is to create high-order freepages through page
> migration. Page migration however puts pages on the per-cpu lru_add cache,
> which is later flushed to per-cpu pcplists, and only after pcplists are
> drained the pages can actually merge. This can happen due to the per-cpu
> caches becoming full through further freeing, or explicitly.
>
> During direct compaction, it is useful to do the draining explicitly so that
> pages merge as soon as possible and compaction can detect success immediately
> and keep the latency impact at minimum. However the current implementation is
> far from ideal. Draining is done only in __alloc_pages_direct_compact(),
> after all zones were already compacted, and the decisions to continue or stop
> compaction in individual zones was done without the last batch of migrations
> being merged. It is also missing the draining of lru_add cache before the
> pcplists.
>
> This patch moves the draining for direct compaction into compact_zone(). It
> adds the missing lru_cache draining and uses the newly introduced single zone
> pcplists draining to reduce overhead and avoid impact on unrelated zones.
> Draining is only performed when it can actually lead to merging of a page of
> desired order (passed by cc->order). This means it is only done when migration
> occurred in the previously scanned cc->order aligned block(s) and the
> migration scanner is now pointing to the next cc->order aligned block.


Acked-by: Rik van Riel <[email protected]>

2014-10-20 15:45:36

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm, compaction: pass classzone_idx and alloc_flags to watermark checking

On 10/07/2014 11:33 AM, Vlastimil Babka wrote:
> Compaction relies on zone watermark checks for decisions such as if it's worth
> to start compacting in compaction_suitable() or whether compaction should stop
> in compact_finished(). The watermark checks take classzone_idx and alloc_flags
> parameters, which are related to the memory allocation request. But from the
> context of compaction they are currently passed as 0, including the direct
> compaction which is invoked to satisfy the allocation request, and could
> therefore know the proper values.
>
> The lack of proper values can lead to mismatch between decisions taken during
> compaction and decisions related to the allocation request. Lack of proper
> classzone_idx value means that lowmem_reserve is not taken into account.
> This has manifested (during recent changes to deferred compaction) when DMA
> zone was used as fallback for preferred Normal zone. compaction_suitable()
> without proper classzone_idx would think that the watermarks are already
> satisfied, but watermark check in get_page_from_freelist() would fail. Because
> of this problem, deferring compaction has extra complexity that can be removed
> in the following patch.
>
> The issue (not confirmed in practice) with missing alloc_flags is opposite in
> nature. For allocations that include ALLOC_HIGH, ALLOC_HIGHER or ALLOC_CMA in
> alloc_flags (the last includes all MOVABLE allocations on CMA-enabled systems)
> the watermark checking in compaction with 0 passed will be stricter than in
> get_page_from_freelist(). In these cases compaction might be running for a
> longer time than is really needed.
>
> This patch fixes these problems by adding alloc_flags and classzone_idx to
> struct compact_control and related functions involved in direct compaction and
> watermark checking. Where possible, all other callers of compaction_suitable()
> pass proper values where those are known. This is currently limited to
> classzone_idx, which is sometimes known in kswapd context. However, the direct
> reclaim callers should_continue_reclaim() and compaction_ready() do not
> currently know the proper values, so the coordination between reclaim and
> compaction may still not be as accurate as it could. This can be fixed later,
> if it's shown to be an issue.
>
> The effect of this patch should be slightly better high-order allocation
> success rates and/or less compaction overhead, depending on the type of
> allocations and presence of CMA. It allows simplifying deferred compaction
> code in a followup patch.
>
> When testing with stress-highalloc, there was some slight improvement (which
> might be just due to variance) in success rates of non-THP-like allocations.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>

Acked-by: Rik van Riel <[email protected]>

2014-10-27 06:45:39

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm, compaction: pass classzone_idx and alloc_flags to watermark checking

On Tue, Oct 07, 2014 at 05:33:35PM +0200, Vlastimil Babka wrote:
> Compaction relies on zone watermark checks for decisions such as if it's worth
> to start compacting in compaction_suitable() or whether compaction should stop
> in compact_finished(). The watermark checks take classzone_idx and alloc_flags
> parameters, which are related to the memory allocation request. But from the
> context of compaction they are currently passed as 0, including the direct
> compaction which is invoked to satisfy the allocation request, and could
> therefore know the proper values.
>
> The lack of proper values can lead to mismatch between decisions taken during
> compaction and decisions related to the allocation request. Lack of proper
> classzone_idx value means that lowmem_reserve is not taken into account.
> This has manifested (during recent changes to deferred compaction) when DMA
> zone was used as fallback for preferred Normal zone. compaction_suitable()
> without proper classzone_idx would think that the watermarks are already
> satisfied, but watermark check in get_page_from_freelist() would fail. Because
> of this problem, deferring compaction has extra complexity that can be removed
> in the following patch.
>
> The issue (not confirmed in practice) with missing alloc_flags is opposite in
> nature. For allocations that include ALLOC_HIGH, ALLOC_HIGHER or ALLOC_CMA in
> alloc_flags (the last includes all MOVABLE allocations on CMA-enabled systems)
> the watermark checking in compaction with 0 passed will be stricter than in
> get_page_from_freelist(). In these cases compaction might be running for a
> longer time than is really needed.
>
> This patch fixes these problems by adding alloc_flags and classzone_idx to
> struct compact_control and related functions involved in direct compaction and
> watermark checking. Where possible, all other callers of compaction_suitable()
> pass proper values where those are known. This is currently limited to
> classzone_idx, which is sometimes known in kswapd context. However, the direct
> reclaim callers should_continue_reclaim() and compaction_ready() do not
> currently know the proper values, so the coordination between reclaim and
> compaction may still not be as accurate as it could. This can be fixed later,
> if it's shown to be an issue.
>
> The effect of this patch should be slightly better high-order allocation
> success rates and/or less compaction overhead, depending on the type of
> allocations and presence of CMA. It allows simplifying deferred compaction
> code in a followup patch.
>
> When testing with stress-highalloc, there was some slight improvement (which
> might be just due to variance) in success rates of non-THP-like allocations.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>
> ---
> include/linux/compaction.h | 8 ++++++--
> mm/compaction.c | 29 +++++++++++++++--------------
> mm/internal.h | 2 ++
> mm/page_alloc.c | 1 +
> mm/vmscan.c | 12 ++++++------
> 5 files changed, 30 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 60bdf8d..d896765 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -33,10 +33,12 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
> extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *mask,
> enum migrate_mode mode, int *contended,
> + int alloc_flags, int classzone_idx,
> struct zone **candidate_zone);
> extern void compact_pgdat(pg_data_t *pgdat, int order);
> extern void reset_isolation_suitable(pg_data_t *pgdat);
> -extern unsigned long compaction_suitable(struct zone *zone, int order);
> +extern unsigned long compaction_suitable(struct zone *zone, int order,
> + int alloc_flags, int classzone_idx);
>
> /* Do not skip compaction more than 64 times */
> #define COMPACT_MAX_DEFER_SHIFT 6
> @@ -103,6 +105,7 @@ static inline bool compaction_restarting(struct zone *zone, int order)
> static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *nodemask,
> enum migrate_mode mode, int *contended,
> + int alloc_flags, int classzone_idx,
> struct zone **candidate_zone)
> {
> return COMPACT_CONTINUE;
> @@ -116,7 +119,8 @@ static inline void reset_isolation_suitable(pg_data_t *pgdat)
> {
> }
>
> -static inline unsigned long compaction_suitable(struct zone *zone, int order)
> +static inline unsigned long compaction_suitable(struct zone *zone, int order,
> + int alloc_flags, int classzone_idx)
> {
> return COMPACT_SKIPPED;
> }
> diff --git a/mm/compaction.c b/mm/compaction.c
> index edba18a..dba8891 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1069,9 +1069,9 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
>
> /* Compaction run is not finished if the watermark is not met */
> watermark = low_wmark_pages(zone);
> - watermark += (1 << cc->order);
>
> - if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
> + if (!zone_watermark_ok(zone, cc->order, watermark, cc->classzone_idx,
> + cc->alloc_flags))
> return COMPACT_CONTINUE;
>
> /* Direct compactor: Is a suitable page free? */
> @@ -1097,7 +1097,8 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
> * COMPACT_PARTIAL - If the allocation would succeed without compaction
> * COMPACT_CONTINUE - If compaction should run now
> */
> -unsigned long compaction_suitable(struct zone *zone, int order)
> +unsigned long compaction_suitable(struct zone *zone, int order,
> + int alloc_flags, int classzone_idx)
> {
> int fragindex;
> unsigned long watermark;
> @@ -1134,7 +1135,7 @@ unsigned long compaction_suitable(struct zone *zone, int order)
> return COMPACT_SKIPPED;
>
> if (fragindex == -1000 && zone_watermark_ok(zone, order, watermark,
> - 0, 0))
> + classzone_idx, alloc_flags))
> return COMPACT_PARTIAL;

Hello,

compaction_suitable() has one more zone_watermark_ok(). Why is it
unchanged?

Thanks.

2014-10-27 07:34:10

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm, compaction: always update cached scanner positions

On Tue, Oct 07, 2014 at 05:33:38PM +0200, Vlastimil Babka wrote:
> Compaction caches the migration and free scanner positions between compaction
> invocations, so that the whole zone gets eventually scanned and there is no
> bias towards the initial scanner positions at the beginning/end of the zone.
>
> The cached positions are continuously updated as scanners progress and the
> updating stops as soon as a page is successfully isolated. The reasoning
> behind this is that a pageblock where isolation succeeded is likely to succeed
> again in near future and it should be worth revisiting it.
>
> However, the downside is that potentially many pages are rescanned without
> successful isolation. At worst, there might be a page where isolation from LRU
> succeeds but migration fails (potentially always). So upon encountering this
> page, cached position would always stop being updated for no good reason.
> It might have been useful to let such page be rescanned with sync compaction
> after async one failed, but this is now handled by caching scanner position
> for async and sync mode separately since commit 35979ef33931 ("mm, compaction:
> add per-zone migration pfn cache for async compaction").

Hmm... I'm not sure that this patch is good thing.

In asynchronous compaction, compaction could be easily failed and
isolated freepages are returned to the buddy. In this case, next
asynchronous compaction would skip those returned freepages and
both scanners could meet prematurely.

And, I guess that pageblock skip feature effectively disable pageblock
rescanning if there is no freepage during rescan. This patch would
eliminate effect of pageblock skip feature.

IIUC, compaction logic assume that there are many temporary failure
conditions. Retrying from others would reduce effect of this temporary
failure so implementation looks as is.

If what we want is scanning each page once in each epoch, we can
implement compaction logic differently.

Please let me know if I'm missing something.

Thanks.

2014-10-27 07:39:59

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm, compaction: more focused lru and pcplists draining

On Tue, Oct 07, 2014 at 05:33:39PM +0200, Vlastimil Babka wrote:
> The goal of memory compaction is to create high-order freepages through page
> migration. Page migration however puts pages on the per-cpu lru_add cache,
> which is later flushed to per-cpu pcplists, and only after pcplists are
> drained the pages can actually merge. This can happen due to the per-cpu
> caches becoming full through further freeing, or explicitly.
>
> During direct compaction, it is useful to do the draining explicitly so that
> pages merge as soon as possible and compaction can detect success immediately
> and keep the latency impact at minimum. However the current implementation is
> far from ideal. Draining is done only in __alloc_pages_direct_compact(),
> after all zones were already compacted, and the decisions to continue or stop
> compaction in individual zones was done without the last batch of migrations
> being merged. It is also missing the draining of lru_add cache before the
> pcplists.
>
> This patch moves the draining for direct compaction into compact_zone(). It
> adds the missing lru_cache draining and uses the newly introduced single zone
> pcplists draining to reduce overhead and avoid impact on unrelated zones.
> Draining is only performed when it can actually lead to merging of a page of
> desired order (passed by cc->order). This means it is only done when migration
> occurred in the previously scanned cc->order aligned block(s) and the
> migration scanner is now pointing to the next cc->order aligned block.
>
> The patch has been tested with stress-highalloc benchmark from mmtests.
> Although overal allocation success rates of the benchmark were not affected,
> the number of detected compaction successes has doubled. This suggests that
> allocations were previously successful due to implicit merging caused by
> background activity, making a later allocation attempt succeed immediately,
> but not attributing the success to compaction. Since stress-highalloc always
> tries to allocate almost the whole memory, it cannot show the improvement in
> its reported success rate metric. However after this patch, compaction should
> detect success and terminate earlier, reducing the direct compaction latencies
> in a real scenario.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>
> ---
> mm/compaction.c | 41 ++++++++++++++++++++++++++++++++++++++++-
> mm/page_alloc.c | 4 ----
> 2 files changed, 40 insertions(+), 5 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 8fa888d..41b49d7 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1179,6 +1179,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> while ((ret = compact_finished(zone, cc, migratetype)) ==
> COMPACT_CONTINUE) {
> int err;
> + unsigned long last_migrated_pfn = 0;

I think that this definition looks odd.
In every iteration, last_migrated_pfn is re-defined as 0.
Maybe, it is on outside of the loop.

>
> switch (isolate_migratepages(zone, cc)) {
> case ISOLATE_ABORT:
> @@ -1187,7 +1188,12 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> cc->nr_migratepages = 0;
> goto out;
> case ISOLATE_NONE:
> - continue;
> + /*
> + * We haven't isolated and migrated anything, but
> + * there might still be unflushed migrations from
> + * previous cc->order aligned block.
> + */
> + goto check_drain;
> case ISOLATE_SUCCESS:
> ;
> }
> @@ -1212,6 +1218,39 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> goto out;
> }
> }
> +
> + /*
> + * Record where we have freed pages by migration and not yet
> + * flushed them to buddy allocator. Subtract 1, because often
> + * we finish a pageblock and migrate_pfn points to the first
> + * page* of the next one. In that case we want the drain below
> + * to happen immediately.
> + */
> + if (!last_migrated_pfn)
> + last_migrated_pfn = cc->migrate_pfn - 1;

And, I wonder why last_migrated_pfn is set after isolate_migratepages().

Thanks.

2014-10-27 09:11:38

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm, compaction: pass classzone_idx and alloc_flags to watermark checking

On 10/27/2014 07:46 AM, Joonsoo Kim wrote:
> On Tue, Oct 07, 2014 at 05:33:35PM +0200, Vlastimil Babka wrote:
>
> Hello,
>
> compaction_suitable() has one more zone_watermark_ok(). Why is it
> unchanged?

Hi,

it's a check whether there are enough free pages to perform compaction,
which means enough migration targets and temporary copies during
migration. These allocations are not affected by the flags of the
process that makes the high-order allocation.

> Thanks.
>

2014-10-27 09:39:13

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm, compaction: always update cached scanner positions

On 10/27/2014 08:35 AM, Joonsoo Kim wrote:> On Tue, Oct 07, 2014 at
05:33:38PM +0200, Vlastimil Babka wrote:
>> Compaction caches the migration and free scanner positions between
compaction
>> invocations, so that the whole zone gets eventually scanned and there
is no
>> bias towards the initial scanner positions at the beginning/end of
the zone.
>>
>> The cached positions are continuously updated as scanners progress
and the
>> updating stops as soon as a page is successfully isolated. The reasoning
>> behind this is that a pageblock where isolation succeeded is likely
to succeed
>> again in near future and it should be worth revisiting it.
>>
>> However, the downside is that potentially many pages are rescanned
without
>> successful isolation. At worst, there might be a page where isolation
from LRU
>> succeeds but migration fails (potentially always). So upon
encountering this
>> page, cached position would always stop being updated for no good reason.
>> It might have been useful to let such page be rescanned with sync
compaction
>> after async one failed, but this is now handled by caching scanner
position
>> for async and sync mode separately since commit 35979ef33931 ("mm,
compaction:
>> add per-zone migration pfn cache for async compaction").
>
> Hmm... I'm not sure that this patch is good thing.
>
> In asynchronous compaction, compaction could be easily failed and
> isolated freepages are returned to the buddy. In this case, next
> asynchronous compaction would skip those returned freepages and
> both scanners could meet prematurely.

If migration fails, free pages now remain isolated until next migration
attempt, which should happen within the same compaction when it isolates
new migratepages - it won't fail completely just because of failed
migration. It might run out of time due to need_resched and then yeah,
some free pages might be skipped. That's some tradeoff but at least my
tests don't seem to show reduced success rates.

> And, I guess that pageblock skip feature effectively disable pageblock
> rescanning if there is no freepage during rescan.

If there's no freepage during rescan, then the cached free_pfn also
won't be pointed to the pageblock anymore. Regardless of pageblock skip
being set, there will not be second rescan. But there will still be the
first rescan to determine there are no freepages.

> This patch would
> eliminate effect of pageblock skip feature.

I don't think so (as explained above). Also if free pages were isolated
(and then returned and skipped over), the pageblock should remain
without skip bit, so after scanners meet and positions reset (which
doesn't go hand in hand with skip bit reset), the next round will skip
over the blocks without freepages and find quickly the blocks where free
pages were skipped in the previous round.

> IIUC, compaction logic assume that there are many temporary failure
> conditions. Retrying from others would reduce effect of this temporary
> failure so implementation looks as is.

The implementation of pfn caching was written at time when we did not
keep isolated free pages between migration attempts in a single
compaction run. And the idea of async compaction is to try with minimal
effort (thus latency), and if there's a failure, try somewhere else.
Making sure we don't skip anything doesn't seem productive.

> If what we want is scanning each page once in each epoch, we can
> implement compaction logic differently.

Well I'm open to suggestions :) Can't say the current set of heuristics
is straightforward to reason about.

> Please let me know if I'm missing something.
>
> Thanks.
>

2014-10-28 07:07:08

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm, compaction: always update cached scanner positions

On Mon, Oct 27, 2014 at 10:39:01AM +0100, Vlastimil Babka wrote:
> On 10/27/2014 08:35 AM, Joonsoo Kim wrote:> On Tue, Oct 07, 2014 at
> 05:33:38PM +0200, Vlastimil Babka wrote:
> > Hmm... I'm not sure that this patch is good thing.
> >
> > In asynchronous compaction, compaction could be easily failed and
> > isolated freepages are returned to the buddy. In this case, next
> > asynchronous compaction would skip those returned freepages and
> > both scanners could meet prematurely.
>
> If migration fails, free pages now remain isolated until next migration
> attempt, which should happen within the same compaction when it isolates
> new migratepages - it won't fail completely just because of failed
> migration. It might run out of time due to need_resched and then yeah,
> some free pages might be skipped. That's some tradeoff but at least my
> tests don't seem to show reduced success rates.

I thought later one, need_resched case.

Your test is about really high order allocation test, so it's success
rate wouldn't be affected by this skipping. But, different result could be
possible in mid order allocation.

>
> > And, I guess that pageblock skip feature effectively disable pageblock
> > rescanning if there is no freepage during rescan.
>
> If there's no freepage during rescan, then the cached free_pfn also
> won't be pointed to the pageblock anymore. Regardless of pageblock skip
> being set, there will not be second rescan. But there will still be the
> first rescan to determine there are no freepages.

Yes, What I'd like to say is that these would work well. Just decreasing
few percent of scanning page doesn't look good to me to validate this
patch, because there is some facilities to reduce rescan overhead and
compaction is fundamentally time-consuming process. Moreover, failure of
compaction could cause serious system crash in some cases.

> > This patch would
> > eliminate effect of pageblock skip feature.
>
> I don't think so (as explained above). Also if free pages were isolated
> (and then returned and skipped over), the pageblock should remain
> without skip bit, so after scanners meet and positions reset (which
> doesn't go hand in hand with skip bit reset), the next round will skip
> over the blocks without freepages and find quickly the blocks where free
> pages were skipped in the previous round.
>
> > IIUC, compaction logic assume that there are many temporary failure
> > conditions. Retrying from others would reduce effect of this temporary
> > failure so implementation looks as is.
>
> The implementation of pfn caching was written at time when we did not
> keep isolated free pages between migration attempts in a single
> compaction run. And the idea of async compaction is to try with minimal
> effort (thus latency), and if there's a failure, try somewhere else.
> Making sure we don't skip anything doesn't seem productive.

free_pfn is shared by async/sync compaction and unconditional updating
causes sync compaction to stop prematurely, too.

And, if this patch makes migrate/freepage scanner meet more frequently,
there is one problematic scenario.

compact_finished() doesn't check how many work we did. It just check
if both scanners meet. Even if we failed to allocate high order page
due to little work, compaction would be deffered for later user.
This scenario wouldn't happen frequently if updating cached pfn is
limited. But, this patch may enlarge the possibility of this problem.

This is another problem of current logic, and, should be fixed, but,
there is now.

Thanks.

2014-10-28 07:15:10

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm, compaction: pass classzone_idx and alloc_flags to watermark checking

On Mon, Oct 27, 2014 at 10:11:31AM +0100, Vlastimil Babka wrote:
> On 10/27/2014 07:46 AM, Joonsoo Kim wrote:
> > On Tue, Oct 07, 2014 at 05:33:35PM +0200, Vlastimil Babka wrote:
> >
> > Hello,
> >
> > compaction_suitable() has one more zone_watermark_ok(). Why is it
> > unchanged?
>
> Hi,
>
> it's a check whether there are enough free pages to perform compaction,
> which means enough migration targets and temporary copies during
> migration. These allocations are not affected by the flags of the
> process that makes the high-order allocation.

Hmm...

To check whether enough free page is there or not needs zone index and
alloc flag. What we need to ignore is just order information, IMO.
If there is not enough free page in that zone, compaction progress
doesn't have any meaning. It will fail due to shortage of free page
after successful compaction.

I guess that __isolate_free_page() is also good candidate to need this
information in order to prevent compaction from isolating too many
freepage in low memory condition.

Thanks.

2014-10-29 13:51:18

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm, compaction: pass classzone_idx and alloc_flags to watermark checking

On 10/28/2014 08:16 AM, Joonsoo Kim wrote:> On Mon, Oct 27, 2014 at
10:11:31AM +0100, Vlastimil Babka wrote:
>> On 10/27/2014 07:46 AM, Joonsoo Kim wrote:
>>> On Tue, Oct 07, 2014 at 05:33:35PM +0200, Vlastimil Babka wrote:
>>>
>>> Hello,
>>>
>>> compaction_suitable() has one more zone_watermark_ok(). Why is it
>>> unchanged?
>>
>> Hi,
>>
>> it's a check whether there are enough free pages to perform compaction,
>> which means enough migration targets and temporary copies during
>> migration. These allocations are not affected by the flags of the
>> process that makes the high-order allocation.
>
> Hmm...
>
> To check whether enough free page is there or not needs zone index and
> alloc flag. What we need to ignore is just order information, IMO.
> If there is not enough free page in that zone, compaction progress
> doesn't have any meaning. It will fail due to shortage of free page
> after successful compaction.

I thought that the second check in compaction_suitable() makes sure of
this, but now I see it's in fact not.
But i'm not sure if we should just put the flags in the first check, as
IMHO the flags should only affect the final high-order allocation, not
also the temporary pages needed for migration?

BTW now I'm not even sure that the 2UL << order part makes sense
anymore. The number of pages migrated at once is always restricted by
COMPACT_CLUSTER_MAX, so why would we need more than that to cover migration?
Also the order of checks seems wrong. It should return COMPACT_PARTIAL
"If the allocation would succeed without compaction" but that only can
happen after passing the check if the zone has the extra 1UL << order
for migration. Do you agree?

> I guess that __isolate_free_page() is also good candidate to need this
> information in order to prevent compaction from isolating too many
> freepage in low memory condition.

I don't see how it would help here. It's temporary allocations for page
migration. How would passing classzone_idx and alloc_flags prevent
isolating too many?

> Thanks.
>

2014-10-31 07:48:18

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm, compaction: pass classzone_idx and alloc_flags to watermark checking

On Wed, Oct 29, 2014 at 02:51:11PM +0100, Vlastimil Babka wrote:
> On 10/28/2014 08:16 AM, Joonsoo Kim wrote:> On Mon, Oct 27, 2014 at
> 10:11:31AM +0100, Vlastimil Babka wrote:
> >> On 10/27/2014 07:46 AM, Joonsoo Kim wrote:
> >>> On Tue, Oct 07, 2014 at 05:33:35PM +0200, Vlastimil Babka wrote:
> >>>
> >>> Hello,
> >>>
> >>> compaction_suitable() has one more zone_watermark_ok(). Why is it
> >>> unchanged?
> >>
> >> Hi,
> >>
> >> it's a check whether there are enough free pages to perform compaction,
> >> which means enough migration targets and temporary copies during
> >> migration. These allocations are not affected by the flags of the
> >> process that makes the high-order allocation.
> >
> > Hmm...
> >
> > To check whether enough free page is there or not needs zone index and
> > alloc flag. What we need to ignore is just order information, IMO.
> > If there is not enough free page in that zone, compaction progress
> > doesn't have any meaning. It will fail due to shortage of free page
> > after successful compaction.
>
> I thought that the second check in compaction_suitable() makes sure
> of this, but now I see it's in fact not.
> But i'm not sure if we should just put the flags in the first check,
> as IMHO the flags should only affect the final high-order
> allocation, not also the temporary pages needed for migration?

I don't think so.
As mentioned before, if we don't have not enough freepages, compaction
will fail due to shortage of freepage at final high-order watermark
check. Maybe it failes due to not enough freepage rather than ordered
freepage. Proper flags and index make us avoid useless compaction so
I prefer put the flags in the first check.

>
> BTW now I'm not even sure that the 2UL << order part makes sense
> anymore. The number of pages migrated at once is always restricted
> by COMPACT_CLUSTER_MAX, so why would we need more than that to cover
> migration?

In fact, any values seems to be wrong. We can isolate high order freepage
for this temporary use. I don't have any idea what the proper value is.

> Also the order of checks seems wrong. It should return
> COMPACT_PARTIAL "If the allocation would succeed without compaction"
> but that only can happen after passing the check if the zone has the
> extra 1UL << order for migration. Do you agree?

Yes, agree!

> > I guess that __isolate_free_page() is also good candidate to need this
> > information in order to prevent compaction from isolating too many
> > freepage in low memory condition.
>
> I don't see how it would help here. It's temporary allocations for
> page migration. How would passing classzone_idx and alloc_flags
> prevent isolating too many?

It is temporary allocation, but, anyway, it could holds many freepage
in some duration. As mentioned above, if we isolate high order freepage,
we can hold 1MB or more freepage at once. I guess that passing flags helps
system stability.

Thanks.

2014-10-31 15:53:56

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm, compaction: always update cached scanner positions

On 10/28/2014 08:08 AM, Joonsoo Kim wrote:
>>
>>> And, I guess that pageblock skip feature effectively disable pageblock
>>> rescanning if there is no freepage during rescan.
>>
>> If there's no freepage during rescan, then the cached free_pfn also
>> won't be pointed to the pageblock anymore. Regardless of pageblock skip
>> being set, there will not be second rescan. But there will still be the
>> first rescan to determine there are no freepages.
>
> Yes, What I'd like to say is that these would work well. Just decreasing
> few percent of scanning page doesn't look good to me to validate this
> patch, because there is some facilities to reduce rescan overhead and

The mechanisms have a tradeoff, while this patch didn't seem to have
negative consequences.

> compaction is fundamentally time-consuming process. Moreover, failure of
> compaction could cause serious system crash in some cases.

Relying on successful high-order allocation for not crashing is
dangerous, success is never guaranteed. Such critical allocation should
try harder than fail due to a single compaction attempt. With this
argument you could aim to remove all the overhead reducing heuristics.

>>> This patch would
>>> eliminate effect of pageblock skip feature.
>>
>> I don't think so (as explained above). Also if free pages were isolated
>> (and then returned and skipped over), the pageblock should remain
>> without skip bit, so after scanners meet and positions reset (which
>> doesn't go hand in hand with skip bit reset), the next round will skip
>> over the blocks without freepages and find quickly the blocks where free
>> pages were skipped in the previous round.
>>
>>> IIUC, compaction logic assume that there are many temporary failure
>>> conditions. Retrying from others would reduce effect of this temporary
>>> failure so implementation looks as is.
>>
>> The implementation of pfn caching was written at time when we did not
>> keep isolated free pages between migration attempts in a single
>> compaction run. And the idea of async compaction is to try with minimal
>> effort (thus latency), and if there's a failure, try somewhere else.
>> Making sure we don't skip anything doesn't seem productive.
>
> free_pfn is shared by async/sync compaction and unconditional updating
> causes sync compaction to stop prematurely, too.
>
> And, if this patch makes migrate/freepage scanner meet more frequently,
> there is one problematic scenario.

OK, so you don't find a problem with how this patch changes migration
scanner caching, just the free scanner, right?
So how about making release_freepages() return the highest freepage pfn
it encountered (could perhaps do without comparing individual pfn's, the
list should be ordered so it could be just the pfn of first or last page
in the list, but need to check that) and updating cached free pfn with
that? That should ensure rescanning only when needed.

> compact_finished() doesn't check how many work we did. It just check
> if both scanners meet. Even if we failed to allocate high order page
> due to little work, compaction would be deffered for later user.
> This scenario wouldn't happen frequently if updating cached pfn is
> limited. But, this patch may enlarge the possibility of this problem.

I doubt it changes the possibility substantially, but nevermind.

> This is another problem of current logic, and, should be fixed, but,
> there is now.

If something needs the high-order allocation succeed that badly, then
the proper GFP flags should result in further reclaim and compaction
attempts (hopefully) and not give up after first sync compaction failure.

> Thanks.
>

2014-11-03 08:12:45

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm, compaction: more focused lru and pcplists draining

On 10/27/2014 08:41 AM, Joonsoo Kim wrote:
> On Tue, Oct 07, 2014 at 05:33:39PM +0200, Vlastimil Babka wrote:
>> The goal of memory compaction is to create high-order freepages through page
>> migration. Page migration however puts pages on the per-cpu lru_add cache,
>> which is later flushed to per-cpu pcplists, and only after pcplists are
>> drained the pages can actually merge. This can happen due to the per-cpu
>> caches becoming full through further freeing, or explicitly.
>>
>> During direct compaction, it is useful to do the draining explicitly so that
>> pages merge as soon as possible and compaction can detect success immediately
>> and keep the latency impact at minimum. However the current implementation is
>> far from ideal. Draining is done only in __alloc_pages_direct_compact(),
>> after all zones were already compacted, and the decisions to continue or stop
>> compaction in individual zones was done without the last batch of migrations
>> being merged. It is also missing the draining of lru_add cache before the
>> pcplists.
>>
>> This patch moves the draining for direct compaction into compact_zone(). It
>> adds the missing lru_cache draining and uses the newly introduced single zone
>> pcplists draining to reduce overhead and avoid impact on unrelated zones.
>> Draining is only performed when it can actually lead to merging of a page of
>> desired order (passed by cc->order). This means it is only done when migration
>> occurred in the previously scanned cc->order aligned block(s) and the
>> migration scanner is now pointing to the next cc->order aligned block.
>>
>> The patch has been tested with stress-highalloc benchmark from mmtests.
>> Although overal allocation success rates of the benchmark were not affected,
>> the number of detected compaction successes has doubled. This suggests that
>> allocations were previously successful due to implicit merging caused by
>> background activity, making a later allocation attempt succeed immediately,
>> but not attributing the success to compaction. Since stress-highalloc always
>> tries to allocate almost the whole memory, it cannot show the improvement in
>> its reported success rate metric. However after this patch, compaction should
>> detect success and terminate earlier, reducing the direct compaction latencies
>> in a real scenario.
>>
>> Signed-off-by: Vlastimil Babka <[email protected]>
>> Cc: Minchan Kim <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Joonsoo Kim <[email protected]>
>> Cc: Michal Nazarewicz <[email protected]>
>> Cc: Naoya Horiguchi <[email protected]>
>> Cc: Christoph Lameter <[email protected]>
>> Cc: Rik van Riel <[email protected]>
>> Cc: David Rientjes <[email protected]>
>> ---
>> mm/compaction.c | 41 ++++++++++++++++++++++++++++++++++++++++-
>> mm/page_alloc.c | 4 ----
>> 2 files changed, 40 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 8fa888d..41b49d7 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -1179,6 +1179,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>> while ((ret = compact_finished(zone, cc, migratetype)) ==
>> COMPACT_CONTINUE) {
>> int err;
>> + unsigned long last_migrated_pfn = 0;
>
> I think that this definition looks odd.
> In every iteration, last_migrated_pfn is re-defined as 0.
> Maybe, it is on outside of the loop.

Oops you're right, that's a mistake and it makes the code miss some of
the drain points (a minority I think but anyway).

>>
>> switch (isolate_migratepages(zone, cc)) {
>> case ISOLATE_ABORT:
>> @@ -1187,7 +1188,12 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>> cc->nr_migratepages = 0;
>> goto out;
>> case ISOLATE_NONE:
>> - continue;
>> + /*
>> + * We haven't isolated and migrated anything, but
>> + * there might still be unflushed migrations from
>> + * previous cc->order aligned block.
>> + */
>> + goto check_drain;
>> case ISOLATE_SUCCESS:
>> ;
>> }
>> @@ -1212,6 +1218,39 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>> goto out;
>> }
>> }
>> +
>> + /*
>> + * Record where we have freed pages by migration and not yet
>> + * flushed them to buddy allocator. Subtract 1, because often
>> + * we finish a pageblock and migrate_pfn points to the first
>> + * page* of the next one. In that case we want the drain below
>> + * to happen immediately.
>> + */
>> + if (!last_migrated_pfn)
>> + last_migrated_pfn = cc->migrate_pfn - 1;
>
> And, I wonder why last_migrated_pfn is set after isolate_migratepages().

Not sure I understand your question. With the mistake above, it cannot
currently be set at the point isolate_migratepages() is called, so you
might question the goto check_drain in the ISOLATE_NONE case, if that's
what you are wondering about.

When I correct that, it might be set when COMPACT_CLUSTER_MAX pages are
isolated and migrated the middle of a pageblock, and then the rest of
the pageblock contains no pages that could be isolated, so the last
isolate_migratepages() attempt in the pageblock returns with
ISOLATE_NONE. Still there were some migrations that produced free pages
that should be drained at that point.

>
> Thanks.
>

2014-11-04 00:27:14

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm, compaction: always update cached scanner positions

On Fri, Oct 31, 2014 at 04:53:44PM +0100, Vlastimil Babka wrote:
> On 10/28/2014 08:08 AM, Joonsoo Kim wrote:
> >>
> >>>And, I guess that pageblock skip feature effectively disable pageblock
> >>>rescanning if there is no freepage during rescan.
> >>
> >>If there's no freepage during rescan, then the cached free_pfn also
> >>won't be pointed to the pageblock anymore. Regardless of pageblock skip
> >>being set, there will not be second rescan. But there will still be the
> >>first rescan to determine there are no freepages.
> >
> >Yes, What I'd like to say is that these would work well. Just decreasing
> >few percent of scanning page doesn't look good to me to validate this
> >patch, because there is some facilities to reduce rescan overhead and
>
> The mechanisms have a tradeoff, while this patch didn't seem to have
> negative consequences.
>
> >compaction is fundamentally time-consuming process. Moreover, failure of
> >compaction could cause serious system crash in some cases.
>
> Relying on successful high-order allocation for not crashing is
> dangerous, success is never guaranteed. Such critical allocation
> should try harder than fail due to a single compaction attempt. With
> this argument you could aim to remove all the overhead reducing
> heuristics.
>
> >>>This patch would
> >>>eliminate effect of pageblock skip feature.
> >>
> >>I don't think so (as explained above). Also if free pages were isolated
> >>(and then returned and skipped over), the pageblock should remain
> >>without skip bit, so after scanners meet and positions reset (which
> >>doesn't go hand in hand with skip bit reset), the next round will skip
> >>over the blocks without freepages and find quickly the blocks where free
> >>pages were skipped in the previous round.
> >>
> >>>IIUC, compaction logic assume that there are many temporary failure
> >>>conditions. Retrying from others would reduce effect of this temporary
> >>>failure so implementation looks as is.
> >>
> >>The implementation of pfn caching was written at time when we did not
> >>keep isolated free pages between migration attempts in a single
> >>compaction run. And the idea of async compaction is to try with minimal
> >>effort (thus latency), and if there's a failure, try somewhere else.
> >>Making sure we don't skip anything doesn't seem productive.
> >
> >free_pfn is shared by async/sync compaction and unconditional updating
> >causes sync compaction to stop prematurely, too.
> >
> >And, if this patch makes migrate/freepage scanner meet more frequently,
> >there is one problematic scenario.
>
> OK, so you don't find a problem with how this patch changes
> migration scanner caching, just the free scanner, right?
> So how about making release_freepages() return the highest freepage
> pfn it encountered (could perhaps do without comparing individual
> pfn's, the list should be ordered so it could be just the pfn of
> first or last page in the list, but need to check that) and updating
> cached free pfn with that? That should ensure rescanning only when
> needed.

Hello,

Updating cached free pfn in release_freepages() looks good to me.

In fact, I guess that migration scanner also has similar problems, but,
it's just my guess. I admit your following arguments in patch description.

However, the downside is that potentially many pages are rescanned without
successful isolation. At worst, there might be a page where isolation from LRU
succeeds but migration fails (potentially always).

So, I'm okay if you update cached free pfn in release_freepages().

Thanks.

2014-11-04 00:35:57

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm, compaction: more focused lru and pcplists draining

On Mon, Nov 03, 2014 at 09:12:33AM +0100, Vlastimil Babka wrote:
> On 10/27/2014 08:41 AM, Joonsoo Kim wrote:
> >On Tue, Oct 07, 2014 at 05:33:39PM +0200, Vlastimil Babka wrote:
> >>The goal of memory compaction is to create high-order freepages through page
> >>migration. Page migration however puts pages on the per-cpu lru_add cache,
> >>which is later flushed to per-cpu pcplists, and only after pcplists are
> >>drained the pages can actually merge. This can happen due to the per-cpu
> >>caches becoming full through further freeing, or explicitly.
> >>
> >>During direct compaction, it is useful to do the draining explicitly so that
> >>pages merge as soon as possible and compaction can detect success immediately
> >>and keep the latency impact at minimum. However the current implementation is
> >>far from ideal. Draining is done only in __alloc_pages_direct_compact(),
> >>after all zones were already compacted, and the decisions to continue or stop
> >>compaction in individual zones was done without the last batch of migrations
> >>being merged. It is also missing the draining of lru_add cache before the
> >>pcplists.
> >>
> >>This patch moves the draining for direct compaction into compact_zone(). It
> >>adds the missing lru_cache draining and uses the newly introduced single zone
> >>pcplists draining to reduce overhead and avoid impact on unrelated zones.
> >>Draining is only performed when it can actually lead to merging of a page of
> >>desired order (passed by cc->order). This means it is only done when migration
> >>occurred in the previously scanned cc->order aligned block(s) and the
> >>migration scanner is now pointing to the next cc->order aligned block.
> >>
> >>The patch has been tested with stress-highalloc benchmark from mmtests.
> >>Although overal allocation success rates of the benchmark were not affected,
> >>the number of detected compaction successes has doubled. This suggests that
> >>allocations were previously successful due to implicit merging caused by
> >>background activity, making a later allocation attempt succeed immediately,
> >>but not attributing the success to compaction. Since stress-highalloc always
> >>tries to allocate almost the whole memory, it cannot show the improvement in
> >>its reported success rate metric. However after this patch, compaction should
> >>detect success and terminate earlier, reducing the direct compaction latencies
> >>in a real scenario.
> >>
> >>Signed-off-by: Vlastimil Babka <[email protected]>
> >>Cc: Minchan Kim <[email protected]>
> >>Cc: Mel Gorman <[email protected]>
> >>Cc: Joonsoo Kim <[email protected]>
> >>Cc: Michal Nazarewicz <[email protected]>
> >>Cc: Naoya Horiguchi <[email protected]>
> >>Cc: Christoph Lameter <[email protected]>
> >>Cc: Rik van Riel <[email protected]>
> >>Cc: David Rientjes <[email protected]>
> >>---
> >> mm/compaction.c | 41 ++++++++++++++++++++++++++++++++++++++++-
> >> mm/page_alloc.c | 4 ----
> >> 2 files changed, 40 insertions(+), 5 deletions(-)
> >>
> >>diff --git a/mm/compaction.c b/mm/compaction.c
> >>index 8fa888d..41b49d7 100644
> >>--- a/mm/compaction.c
> >>+++ b/mm/compaction.c
> >>@@ -1179,6 +1179,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> >> while ((ret = compact_finished(zone, cc, migratetype)) ==
> >> COMPACT_CONTINUE) {
> >> int err;
> >>+ unsigned long last_migrated_pfn = 0;
> >
> >I think that this definition looks odd.
> >In every iteration, last_migrated_pfn is re-defined as 0.
> >Maybe, it is on outside of the loop.
>
> Oops you're right, that's a mistake and it makes the code miss some
> of the drain points (a minority I think but anyway).
>
> >>
> >> switch (isolate_migratepages(zone, cc)) {
> >> case ISOLATE_ABORT:
> >>@@ -1187,7 +1188,12 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> >> cc->nr_migratepages = 0;
> >> goto out;
> >> case ISOLATE_NONE:
> >>- continue;
> >>+ /*
> >>+ * We haven't isolated and migrated anything, but
> >>+ * there might still be unflushed migrations from
> >>+ * previous cc->order aligned block.
> >>+ */
> >>+ goto check_drain;
> >> case ISOLATE_SUCCESS:
> >> ;
> >> }
> >>@@ -1212,6 +1218,39 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> >> goto out;
> >> }
> >> }
> >>+
> >>+ /*
> >>+ * Record where we have freed pages by migration and not yet
> >>+ * flushed them to buddy allocator. Subtract 1, because often
> >>+ * we finish a pageblock and migrate_pfn points to the first
> >>+ * page* of the next one. In that case we want the drain below
> >>+ * to happen immediately.
> >>+ */
> >>+ if (!last_migrated_pfn)
> >>+ last_migrated_pfn = cc->migrate_pfn - 1;
> >
> >And, I wonder why last_migrated_pfn is set after isolate_migratepages().
>
> Not sure I understand your question. With the mistake above, it
> cannot currently be set at the point isolate_migratepages() is
> called, so you might question the goto check_drain in the
> ISOLATE_NONE case, if that's what you are wondering about.
>
> When I correct that, it might be set when COMPACT_CLUSTER_MAX pages
> are isolated and migrated the middle of a pageblock, and then the
> rest of the pageblock contains no pages that could be isolated, so
> the last isolate_migratepages() attempt in the pageblock returns
> with ISOLATE_NONE. Still there were some migrations that produced
> free pages that should be drained at that point.

To clarify my question, I attach psuedo code that I thought correct.

static int compact_zone()
{
unsigned long last_migrated_pfn = 0;

...

compaction_suitable();

...

while (compact_finished()) {
if (!last_migrated_pfn)
last_migrated_pfn = cc->migrate_pfn - 1;

isolate_migratepages();
switch case
migrate_pages();
...

check_drain: (at the end of loop)
do flush and reset last_migrated_pfn if needed
}
}

We should record last_migrated_pfn before isolate_migratepages() and
then compare it with cc->migrate_pfn after isolate_migratepages() to
know if we moved away from the previous cc->order aligned block.
Am I missing something?

Thanks.

2014-11-13 12:47:14

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm, compaction: more focused lru and pcplists draining

On 11/04/2014 01:37 AM, Joonsoo Kim wrote:
> On Mon, Nov 03, 2014 at 09:12:33AM +0100, Vlastimil Babka wrote:
>> On 10/27/2014 08:41 AM, Joonsoo Kim wrote:
>>> On Tue, Oct 07, 2014 at 05:33:39PM +0200, Vlastimil Babka wrote:
>>>
>>> And, I wonder why last_migrated_pfn is set after isolate_migratepages().
>>
>> Not sure I understand your question. With the mistake above, it
>> cannot currently be set at the point isolate_migratepages() is
>> called, so you might question the goto check_drain in the
>> ISOLATE_NONE case, if that's what you are wondering about.
>>
>> When I correct that, it might be set when COMPACT_CLUSTER_MAX pages
>> are isolated and migrated the middle of a pageblock, and then the
>> rest of the pageblock contains no pages that could be isolated, so
>> the last isolate_migratepages() attempt in the pageblock returns
>> with ISOLATE_NONE. Still there were some migrations that produced
>> free pages that should be drained at that point.
>
> To clarify my question, I attach psuedo code that I thought correct.

Sorry for the late reply.

> static int compact_zone()
> {
> unsigned long last_migrated_pfn = 0;
>
> ...
>
> compaction_suitable();
>
> ...
>
> while (compact_finished()) {
> if (!last_migrated_pfn)
> last_migrated_pfn = cc->migrate_pfn - 1;
>
> isolate_migratepages();
> switch case
> migrate_pages();
> ...
>
> check_drain: (at the end of loop)
> do flush and reset last_migrated_pfn if needed
> }
> }
>
> We should record last_migrated_pfn before isolate_migratepages() and
> then compare it with cc->migrate_pfn after isolate_migratepages() to
> know if we moved away from the previous cc->order aligned block.
> Am I missing something?

What about this scenario, with pageblock order:

- record cc->migrate_pfn pointing to pageblock X
- isolate_migratepages() skips the pageblock due to e.g. skip bit, or
the pageblock being a THP already...
- loop to pageblock X+1, last_migrated_pfn is still set to pfn of
pageblock X (more precisely the pfn is (X << pageblock_order) - 1 per
your code, but doesn't matter)
- isolate_migratepages isolates something, but ends up somewhere in the
middle of pageblock due to COMPACT_CLUSTER_MAX
- cc->migrate_pfn points to pageblock X+1 (plus some pages it scanned)
- so it will decide that it has fully migrated pageblock X and it's time
to drain. But the drain is most likely useless - we didn't migrate
anything in pageblock X, we skipped it. And in X+1 we didn't migrate
everything yet, so we should drain only after finishing the other part
of the pageblock.

In short, "last_migrated_pfn" is not "last position of migrate scanner"
but "last block where we *actually* migrated".

I think if you would try to fix the scenario above, you would end up
with something like my patch :)

Vlastimil

> Thanks.
>

2014-11-14 07:02:44

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm, compaction: more focused lru and pcplists draining

On Thu, Nov 13, 2014 at 01:47:08PM +0100, Vlastimil Babka wrote:
> On 11/04/2014 01:37 AM, Joonsoo Kim wrote:
> >On Mon, Nov 03, 2014 at 09:12:33AM +0100, Vlastimil Babka wrote:
> >>On 10/27/2014 08:41 AM, Joonsoo Kim wrote:
> >>>On Tue, Oct 07, 2014 at 05:33:39PM +0200, Vlastimil Babka wrote:
> >>>
> >>>And, I wonder why last_migrated_pfn is set after isolate_migratepages().
> >>
> >>Not sure I understand your question. With the mistake above, it
> >>cannot currently be set at the point isolate_migratepages() is
> >>called, so you might question the goto check_drain in the
> >>ISOLATE_NONE case, if that's what you are wondering about.
> >>
> >>When I correct that, it might be set when COMPACT_CLUSTER_MAX pages
> >>are isolated and migrated the middle of a pageblock, and then the
> >>rest of the pageblock contains no pages that could be isolated, so
> >>the last isolate_migratepages() attempt in the pageblock returns
> >>with ISOLATE_NONE. Still there were some migrations that produced
> >>free pages that should be drained at that point.
> >
> >To clarify my question, I attach psuedo code that I thought correct.
>
> Sorry for the late reply.
>
> >static int compact_zone()
> >{
> > unsigned long last_migrated_pfn = 0;
> >
> > ...
> >
> > compaction_suitable();
> >
> > ...
> >
> > while (compact_finished()) {
> > if (!last_migrated_pfn)
> > last_migrated_pfn = cc->migrate_pfn - 1;
> >
> > isolate_migratepages();
> > switch case
> > migrate_pages();
> > ...
> >
> > check_drain: (at the end of loop)
> > do flush and reset last_migrated_pfn if needed
> > }
> >}
> >
> >We should record last_migrated_pfn before isolate_migratepages() and
> >then compare it with cc->migrate_pfn after isolate_migratepages() to
> >know if we moved away from the previous cc->order aligned block.
> >Am I missing something?
>
> What about this scenario, with pageblock order:
>
> - record cc->migrate_pfn pointing to pageblock X
> - isolate_migratepages() skips the pageblock due to e.g. skip bit,
> or the pageblock being a THP already...
> - loop to pageblock X+1, last_migrated_pfn is still set to pfn of
> pageblock X (more precisely the pfn is (X << pageblock_order) - 1
> per your code, but doesn't matter)
> - isolate_migratepages isolates something, but ends up somewhere in
> the middle of pageblock due to COMPACT_CLUSTER_MAX
> - cc->migrate_pfn points to pageblock X+1 (plus some pages it scanned)
> - so it will decide that it has fully migrated pageblock X and it's
> time to drain. But the drain is most likely useless - we didn't
> migrate anything in pageblock X, we skipped it. And in X+1 we didn't
> migrate everything yet, so we should drain only after finishing the
> other part of the pageblock.

Yes, but, it can be easily fixed.

while (compact_finished()) {
unsigned long prev_migrate_pfn = cc->migrate_pfn;

isolate_migratepages()
switch case {
NONE:
goto check_drain;
SUCCESS:
if (!last_migrated_pfn)
last_migrated_pfn = prev_migrate_pfn;
}

...

check_drain: (at the end of loop)
...
}

> In short, "last_migrated_pfn" is not "last position of migrate
> scanner" but "last block where we *actually* migrated".

Okay. Now I get it.
Nevertheless, I'd like to change logic like above.

One problem of your approach is that it can't detect some cases.

Let's think about following case.
'|' denotes aligned block boundary.
'^' denotes migrate_pfn at certain time.

Assume that last_migrated_pfn = 0;

|--------------|-------------|--------------|
^ ^
before isolate after isolate

In this case, your code just records position of second '^' to
last_migrated_pfn and skip to flush. But, flush is needed if we
migrate some pages because we move away from previous aligned block.

Thanks.

2014-11-14 08:52:38

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm, compaction: pass classzone_idx and alloc_flags to watermark checking

On 10/31/2014 08:49 AM, Joonsoo Kim wrote:
> On Wed, Oct 29, 2014 at 02:51:11PM +0100, Vlastimil Babka wrote:
>> On 10/28/2014 08:16 AM, Joonsoo Kim wrote:> On Mon, Oct 27, 2014 at
>> 10:11:31AM +0100, Vlastimil Babka wrote:
>>
>> I thought that the second check in compaction_suitable() makes sure
>> of this, but now I see it's in fact not.
>> But i'm not sure if we should just put the flags in the first check,
>> as IMHO the flags should only affect the final high-order
>> allocation, not also the temporary pages needed for migration?
>
> I don't think so.
> As mentioned before, if we don't have not enough freepages, compaction
> will fail due to shortage of freepage at final high-order watermark
> check. Maybe it failes due to not enough freepage rather than ordered
> freepage. Proper flags and index make us avoid useless compaction so
> I prefer put the flags in the first check.
>
>>
>> BTW now I'm not even sure that the 2UL << order part makes sense
>> anymore. The number of pages migrated at once is always restricted
>> by COMPACT_CLUSTER_MAX, so why would we need more than that to cover
>> migration?
>
> In fact, any values seems to be wrong. We can isolate high order freepage
> for this temporary use. I don't have any idea what the proper value is.
>
>> Also the order of checks seems wrong. It should return
>> COMPACT_PARTIAL "If the allocation would succeed without compaction"
>> but that only can happen after passing the check if the zone has the
>> extra 1UL << order for migration. Do you agree?
>
> Yes, agree!
>
>>> I guess that __isolate_free_page() is also good candidate to need this
>>> information in order to prevent compaction from isolating too many
>>> freepage in low memory condition.
>>
>> I don't see how it would help here. It's temporary allocations for
>> page migration. How would passing classzone_idx and alloc_flags
>> prevent isolating too many?
>
> It is temporary allocation, but, anyway, it could holds many freepage
> in some duration. As mentioned above, if we isolate high order freepage,
> we can hold 1MB or more freepage at once. I guess that passing flags helps
> system stability.

OK, here's a patch-fix to address everything discussed above. My testing
didn't show much difference, but I know it's limited anyway.

------8<------
From: Vlastimil Babka <[email protected]>
Date: Mon, 3 Nov 2014 15:26:40 +0100
Subject: mm-compaction-pass-classzone_idx-and-alloc_flags-to-watermark-checking-fix

This patch-fix changes zone watermark checking in compaction_suitable()
as we discussed with Joonsoo Kim. First, it moves up the watermark check for
answering the question "does the zone need compaction at all?". Before this
change, the check is preceded by a check that answers the question "does the
zone have enough free pages to succeed compaction". So it might happen that
there is already a high-order page available, but not enough pages for
performing the compaction (which assumes extra pages for the migrations).
Before the patch, compaction_suitable() would return COMPACT_SKIPPED which
means "compaction cannot succeed, reclaim more", after this change it returns
COMPACT_PARTIAL which means "compaction not needed, try allocating".

Second, the check for answering "can the compaction succeed?" now also
includes classzone_idx and alloc_flags parameters. This prevents starting
compaction that would not lead to successful allocation due to not having
enough free pages. The addition of extra pages for migration is left in the
check. Although these are temporary allocations and thus should not be
affected by alloc_flags and classzone_idx, it only matters when we are close
to the watermark, where it does not hurt to be a bit pessimistic.

Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>

---

When squashed, full changelog of the patch+fix should be:

Compaction relies on zone watermark checks for decisions such as if it's
worth to start compacting in compaction_suitable() or whether compaction
should stop in compact_finished(). The watermark checks take
classzone_idx and alloc_flags parameters, which are related to the memory
allocation request. But from the context of compaction they are currently
passed as 0, including the direct compaction which is invoked to satisfy
the allocation request, and could therefore know the proper values.

The lack of proper values can lead to mismatch between decisions taken
during compaction and decisions related to the allocation request. Lack
of proper classzone_idx value means that lowmem_reserve is not taken into
account. This has manifested (during recent changes to deferred
compaction) when DMA zone was used as fallback for preferred Normal zone.
compaction_suitable() without proper classzone_idx would think that the
watermarks are already satisfied, but watermark check in
get_page_from_freelist() would fail. Because of this problem, deferring
compaction has extra complexity that can be removed in the following
patch.

The issue (not confirmed in practice) with missing alloc_flags is opposite
in nature. For allocations that include ALLOC_HIGH, ALLOC_HIGHER or
ALLOC_CMA in alloc_flags (the last includes all MOVABLE allocations on
CMA-enabled systems) the watermark checking in compaction with 0 passed
will be stricter than in get_page_from_freelist(). In these cases
compaction might be running for a longer time than is really needed.

Another issue compaction_suitable() is that the check for "does the zone need
compaction at all?" comes only after the check "does the zone have enough free
free pages to succeed compaction". The latter considers extra pages for
migration and can therefore in some situations fail and return COMPACT_SKIPPED,
although the high-order allocation would succeed and we should return
COMPACT_PARTIAL.

This patch fixes these problems by adding alloc_flags and classzone_idx to
struct compact_control and related functions involved in direct compaction
and watermark checking. Where possible, all other callers of
compaction_suitable() pass proper values where those are known. This is
currently limited to classzone_idx, which is sometimes known in kswapd
context. However, the direct reclaim callers should_continue_reclaim()
and compaction_ready() do not currently know the proper values, so the
coordination between reclaim and compaction may still not be as accurate
as it could. This can be fixed later, if it's shown to be an issue.

Additionaly the checks in compact_suitable() are reordered to address the
second issue described above.

The effect of this patch should be slightly better high-order allocation
success rates and/or less compaction overhead, depending on the type of
allocations and presence of CMA. It allows simplifying deferred
compaction code in a followup patch.

When testing with stress-highalloc, there was some slight improvement
(which might be just due to variance) in success rates of non-THP-like
allocations.

mm/compaction.c | 21 +++++++++++++--------
1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index ba8bdb9..f15e8e5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1113,21 +1113,30 @@ unsigned long compaction_suitable(struct zone *zone, int order,
if (order == -1)
return COMPACT_CONTINUE;

+ watermark = low_wmark_pages(zone);
+ /*
+ * If watermarks for high-order allocation are already met, there
+ * should be no need for compaction at all.
+ */
+ if (zone_watermark_ok(zone, order, watermark, classzone_idx,
+ alloc_flags))
+ return COMPACT_PARTIAL;
+
/*
* Watermarks for order-0 must be met for compaction. Note the 2UL.
* This is because during migration, copies of pages need to be
* allocated and for a short time, the footprint is higher
*/
- watermark = low_wmark_pages(zone) + (2UL << order);
- if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+ watermark += (2UL << order);
+ if (!zone_watermark_ok(zone, 0, watermark, classzone_idx, alloc_flags))
return COMPACT_SKIPPED;

/*
* fragmentation index determines if allocation failures are due to
* low memory or external fragmentation
*
- * index of -1000 implies allocations might succeed depending on
- * watermarks
+ * index of -1000 would imply allocations might succeed depending on
+ * watermarks, but we already failed the high-order watermark check
* index towards 0 implies failure is due to lack of memory
* index towards 1000 implies failure is due to fragmentation
*
@@ -1137,10 +1146,6 @@ unsigned long compaction_suitable(struct zone *zone, int order,
if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold)
return COMPACT_SKIPPED;

- if (fragindex == -1000 && zone_watermark_ok(zone, order, watermark,
- classzone_idx, alloc_flags))
- return COMPACT_PARTIAL;
-
return COMPACT_CONTINUE;
}

--
2.1.2


2014-11-14 08:57:46

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm, compaction: always update cached scanner positions

On 11/04/2014 01:28 AM, Joonsoo Kim wrote:
> On Fri, Oct 31, 2014 at 04:53:44PM +0100, Vlastimil Babka wrote:
>> On 10/28/2014 08:08 AM, Joonsoo Kim wrote:
>>
>> OK, so you don't find a problem with how this patch changes
>> migration scanner caching, just the free scanner, right?
>> So how about making release_freepages() return the highest freepage
>> pfn it encountered (could perhaps do without comparing individual
>> pfn's, the list should be ordered so it could be just the pfn of
>> first or last page in the list, but need to check that) and updating
>> cached free pfn with that? That should ensure rescanning only when
>> needed.
>
> Hello,
>
> Updating cached free pfn in release_freepages() looks good to me.
>
> In fact, I guess that migration scanner also has similar problems, but,
> it's just my guess. I admit your following arguments in patch description.
>
> However, the downside is that potentially many pages are rescanned without
> successful isolation. At worst, there might be a page where isolation from LRU
> succeeds but migration fails (potentially always).
>
> So, I'm okay if you update cached free pfn in release_freepages().

Hi, here's the patch-fix to update cached free pfn based on release_freepages().
No significant difference in my testing.

------8<------
From: Vlastimil Babka <[email protected]>
Date: Wed, 12 Nov 2014 16:37:21 +0100
Subject: [PATCH] mm-compaction-always-update-cached-scanner-positions-fix

This patch-fix addresses Joonsoo Kim's concerns about free pages potentially
being skipped when they are isolated and then returned due to migration
failure. It does so by setting the cached scanner pfn to the pageblock where
where the free page with the highest pfn of all returned free pages resides.
A small downside is that release_freepages() no longer returns the number of
freed pages, which has been used in a VM_BUG_ON check. I don't think the check
was important enough to warrant a more complex solution.

Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>

---

When squashed, the following paragraph should be appended to the fixed patch's
changelog:

To prevent free scanner from leaving free pages behind after they are returned
due to page migration failure, the cached scanner pfn is changed to point to
the pageblock of the returned free page with the highest pfn, before leaving
compact_zone().

mm/compaction.c | 29 +++++++++++++++++++++++------
1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index e0befc3..3669f92 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -41,15 +41,17 @@ static inline void count_compact_events(enum vm_event_item item, long delta)
static unsigned long release_freepages(struct list_head *freelist)
{
struct page *page, *next;
- unsigned long count = 0;
+ unsigned long high_pfn = 0;

list_for_each_entry_safe(page, next, freelist, lru) {
+ unsigned long pfn = page_to_pfn(page);
list_del(&page->lru);
__free_page(page);
- count++;
+ if (pfn > high_pfn)
+ high_pfn = pfn;
}

- return count;
+ return high_pfn;
}

static void map_pages(struct list_head *list)
@@ -1223,9 +1225,24 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
}

out:
- /* Release free pages and check accounting */
- cc->nr_freepages -= release_freepages(&cc->freepages);
- VM_BUG_ON(cc->nr_freepages != 0);
+ /*
+ * Release free pages and update where the free scanner should restart,
+ * so we don't leave any returned pages behind in the next attempt.
+ */
+ if (cc->nr_freepages > 0) {
+ unsigned long free_pfn = release_freepages(&cc->freepages);
+ cc->nr_freepages = 0;
+
+ VM_BUG_ON(free_pfn == 0);
+ /* The cached pfn is always the first in a pageblock */
+ free_pfn &= ~(pageblock_nr_pages-1);
+ /*
+ * Only go back, not forward. The cached pfn might have been
+ * already reset to zone end in compact_finished()
+ */
+ if (free_pfn > zone->compact_cached_free_pfn)
+ zone->compact_cached_free_pfn = free_pfn;
+ }

trace_mm_compaction_end(ret);

--
2.1.2


2014-11-19 22:53:42

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm, compaction: more focused lru and pcplists draining

On 11/14/2014 08:05 AM, Joonsoo Kim wrote:
>> What about this scenario, with pageblock order:
>>
>> - record cc->migrate_pfn pointing to pageblock X
>> - isolate_migratepages() skips the pageblock due to e.g. skip bit,
>> or the pageblock being a THP already...
>> - loop to pageblock X+1, last_migrated_pfn is still set to pfn of
>> pageblock X (more precisely the pfn is (X << pageblock_order) - 1
>> per your code, but doesn't matter)
>> - isolate_migratepages isolates something, but ends up somewhere in
>> the middle of pageblock due to COMPACT_CLUSTER_MAX
>> - cc->migrate_pfn points to pageblock X+1 (plus some pages it scanned)
>> - so it will decide that it has fully migrated pageblock X and it's
>> time to drain. But the drain is most likely useless - we didn't
>> migrate anything in pageblock X, we skipped it. And in X+1 we didn't
>> migrate everything yet, so we should drain only after finishing the
>> other part of the pageblock.
>
> Yes, but, it can be easily fixed.
>
> while (compact_finished()) {
> unsigned long prev_migrate_pfn = cc->migrate_pfn;
>
> isolate_migratepages()
> switch case {
> NONE:
> goto check_drain;
> SUCCESS:
> if (!last_migrated_pfn)
> last_migrated_pfn = prev_migrate_pfn;
> }
>
> ...
>
> check_drain: (at the end of loop)
> ...
> }

Good suggestion, also gets rid of the awkward subtraction of 1 in the
current patch. Thanks.

>> In short, "last_migrated_pfn" is not "last position of migrate
>> scanner" but "last block where we *actually* migrated".
>
> Okay. Now I get it.
> Nevertheless, I'd like to change logic like above.
>
> One problem of your approach is that it can't detect some cases.
>
> Let's think about following case.
> '|' denotes aligned block boundary.
> '^' denotes migrate_pfn at certain time.
>
> Assume that last_migrated_pfn = 0;
>
> |--------------|-------------|--------------|
> ^ ^
> before isolate after isolate
>
> In this case, your code just records position of second '^' to
> last_migrated_pfn and skip to flush. But, flush is needed if we
> migrate some pages because we move away from previous aligned block.
>
> Thanks.
>

Right, so the patch below implements your suggestion, and the last_migrated_pfn
initialization fix. I named the variable "isolate_start_pfn" instead of
prev_migrate_pfn, as it's where the migrate scanner isolation starts, and having
both prev_migrate_pfn and last_migrated_pfn would be more confusing I think.

------8<------
From: Vlastimil Babka <[email protected]>
Date: Mon, 3 Nov 2014 15:28:01 +0100
Subject: [PATCH] mm-compaction-more-focused-lru-and-pcplists-draining-fix

As Joonsoo Kim pointed out, last_migrate_pfn was reset to 0 by mistake at each
iteration in compact_zone(). This mistake could result in fail to recognize
immediately draining points for orders smaller than pageblock.
Joonsoo has also suggested an improvement to detecting cc->order aligned
block where migration might have occured - before this fix, some of the drain
opportunities might have been missed.

Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
---
mm/compaction.c | 24 +++++++++++++-----------
1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index fe43e60..100e6e8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1144,6 +1144,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
unsigned long end_pfn = zone_end_pfn(zone);
const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
const bool sync = cc->mode != MIGRATE_ASYNC;
+ unsigned long last_migrated_pfn = 0;

ret = compaction_suitable(zone, cc->order, cc->alloc_flags,
cc->classzone_idx);
@@ -1189,7 +1190,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
while ((ret = compact_finished(zone, cc, migratetype)) ==
COMPACT_CONTINUE) {
int err;
- unsigned long last_migrated_pfn = 0;
+ unsigned long isolate_start_pfn = cc->migrate_pfn;

switch (isolate_migratepages(zone, cc)) {
case ISOLATE_ABORT:
@@ -1230,21 +1231,22 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
}

/*
- * Record where we have freed pages by migration and not yet
- * flushed them to buddy allocator. Subtract 1, because often
- * we finish a pageblock and migrate_pfn points to the first
- * page* of the next one. In that case we want the drain below
- * to happen immediately.
+ * Record where we could have freed pages by migration and not
+ * yet flushed them to buddy allocator. We use the pfn that
+ * isolate_migratepages() started from in this loop iteration
+ * - this is the lowest page that could have been isolated and
+ * then freed by migration.
*/
if (!last_migrated_pfn)
- last_migrated_pfn = cc->migrate_pfn - 1;
+ last_migrated_pfn = isolate_start_pfn;

check_drain:
/*
- * Have we moved away from the previous cc->order aligned block
- * where we migrated from? If yes, flush the pages that were
- * freed, so that they can merge and compact_finished() can
- * detect immediately if allocation should succeed.
+ * Has the migration scanner moved away from the previous
+ * cc->order aligned block where we migrated from? If yes,
+ * flush the pages that were freed, so that they can merge and
+ * compact_finished() can detect immediately if allocation
+ * would succeed.
*/
if (cc->order > 0 && last_migrated_pfn) {
int cpu;
--
2.1.2