2014-06-20 15:50:18

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v3 00/13] compaction: balancing overhead and success rates

Based on next-20140620.

This is a v3 of a series (first with proper cover letter) that tries to work
simultaneously towards two mutually exclusive goals in memory compaction -
reducing overhead and improving success rates. It includes some cleanups and
more or less trivial (micro-)optimizations, hopefully more intelligent lock
contention management, and some preparation patches that finally result in
last two patches that should improve success rates and minimize work that
is not likely to result on successful allocation for a THP page fault.
There are 3 new patches since last posting, and many have been reworked.

Patch 1: a simple change that will make khugepaged not hold uselessly mmap_sem
(new) during potentially long sync compaction. I saw more opportunities for
improvement there, but that will be for another series. This is rather
trivial but still can reduce latencies for m(un)map heavy workloads.

Patch 2: fine-grained per-zone deferred compaction management, which should
(new) result in more accurate decisions when to compact a particular zone

Patch 3: A cleanup/micro-optimization. No change since v2.

Patch 4: Another cleanup/optimization. Surprisingly there's still low hanging
(new) fruit in functionality that was changed quite recently. Anything that
simplifies isolate_migratepages_range() is a good thing...

Patch 5: First step towards not relying on need_resched() to limit amount of
work done by async compaction. Incorporated feedback since v2 and
reworked how lock contention is reported when multiple zones are
compacted, so that it's no longer accidental.

Patch 6: Prevent running for long time with IRQs disabled, and improve lock
contention detection. Incorporated feedback from David.

Patch 7: Microoptimization made possible by patch 6. No changes since v2.

Patch 8: Reduce some useless rescanning in the free scanner. I made quite major
changes based on feedback, so I rather not keep Reviewed-by (thanks
Minchan and Zhang though).

Patch 9: Reduce some iterations in the migration scanner, and make Patch 13
possible. Based on discussions with David, I made page_order_unsafe()
a #define so there will be no doubts about inlining behavior.

Patch 10: Cleanup, from David, no changes.

Patch 11: Prerequisity for Patch 13, from David, no changes.

Patch 12: Improve compaction success rates by grabbing page freed by migration
ASAP. Since v2, I've removed the impact on allocation fast paths per
Minchan's feedback and changed the rules for when capture is allowed.

Patch 13: Minimize work done in page fault direct compaction (i.e. THP) that
(RFC) would not lead to successful allocation. Move on to next cc->order
aligned block of pages as soon as the scanner encounters a page that
is not free and cannot be isolated for migration.
Only change since v2 is some cleanup moved to Patch 4 where it fits
better. Still a RFC because I see this patch making a difference
in stress-highalloc setting that doesn't use __GFP_NO_KSWAPD so it
shouldn't be affected. So there is either a bug or unforeseen
side-effect.

The only thorough evaluation was done when based on pre-3.16-rc1 kernel,
with mmtests stress-highalloc benchmark allocating order-9 pages which did
not use __GFP_NO_KSWAPD. Patches 1,2,4 were not yet in the series. This is not
a benchmark where microoptimizations would be visible, and the settings mean
it uses sync compaction and should not benefit from Patch 13 (but it did which
is weird). It has however shown improvements in vmstat figures in patches 8, 9
and 12, as documented in the commit messages. I hope David can test if it fixes
his issues. Patch 1 was tested separately on another machine, as documented.
I'll run further tests with stress-highalloc settings that would mimic THP
page faults (i.e. __GFP_NO_KSWAPD).

David Rientjes (2):
mm: rename allocflags_to_migratetype for clarity
mm, compaction: pass gfp mask to compact_control

Vlastimil Babka (11):
mm, THP: don't hold mmap_sem in khugepaged when allocating THP
mm, compaction: defer each zone individually instead of preferred zone
mm, compaction: do not recheck suitable_migration_target under lock
mm, compaction: move pageblock checks up from
isolate_migratepages_range()
mm, compaction: report compaction as contended only due to lock
contention
mm, compaction: periodically drop lock and restore IRQs in scanners
mm, compaction: skip rechecks when lock was already held
mm, compaction: remember position within pageblock in free pages
scanner
mm, compaction: skip buddy pages by their order in the migrate scanner
mm, compaction: try to capture the just-created high-order freepage
mm, compaction: do not migrate pages when that cannot satisfy page
fault allocation

include/linux/compaction.h | 10 +-
include/linux/gfp.h | 2 +-
mm/compaction.c | 569 +++++++++++++++++++++++++++++++++------------
mm/huge_memory.c | 20 +-
mm/internal.h | 38 ++-
mm/page_alloc.c | 122 +++++++---
6 files changed, 554 insertions(+), 207 deletions(-)

--
1.8.4.5


2014-06-20 15:50:20

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v3 04/13] mm, compaction: move pageblock checks up from isolate_migratepages_range()

isolate_migratepages_range() is the main function of the compaction scanner,
called either on a single pageblock by isolate_migratepages() during regular
compaction, or on an arbitrary range by CMA's __alloc_contig_migrate_range().
It currently perfoms two pageblock-wide compaction suitability checks, and
because of the CMA callpath, it tracks if it crossed a pageblock boundary in
order to repeat those checks.

However, closer inspection shows that those checks are always true for CMA:
- isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
- migrate_async_suitable() check is skipped because CMA uses sync compaction

We can therefore move the checks to isolate_migratepages(), reducing variables
and simplifying isolate_migratepages_range(). The update_pageblock_skip()
function also no longer needs set_unsuitable parameter.

Furthermore, going back to compact_zone() and compact_finished() when pageblock
is unsuitable is wasteful - the checks are meant to skip pageblocks quickly.
The patch therefore also introduces a simple loop into isolate_migratepages()
so that it does not return immediately on pageblock checks, but keeps going
until isolate_migratepages_range() gets called once. Similarily to
isolate_freepages(), the function periodically checks if it needs to reschedule
or abort async compaction.

Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
---
mm/compaction.c | 112 +++++++++++++++++++++++++++++---------------------------
1 file changed, 59 insertions(+), 53 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 3064a7f..ebe30c9 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -132,7 +132,7 @@ void reset_isolation_suitable(pg_data_t *pgdat)
*/
static void update_pageblock_skip(struct compact_control *cc,
struct page *page, unsigned long nr_isolated,
- bool set_unsuitable, bool migrate_scanner)
+ bool migrate_scanner)
{
struct zone *zone = cc->zone;
unsigned long pfn;
@@ -146,12 +146,7 @@ static void update_pageblock_skip(struct compact_control *cc,
if (nr_isolated)
return;

- /*
- * Only skip pageblocks when all forms of compaction will be known to
- * fail in the near future.
- */
- if (set_unsuitable)
- set_pageblock_skip(page);
+ set_pageblock_skip(page);

pfn = page_to_pfn(page);

@@ -180,7 +175,7 @@ static inline bool isolation_suitable(struct compact_control *cc,

static void update_pageblock_skip(struct compact_control *cc,
struct page *page, unsigned long nr_isolated,
- bool set_unsuitable, bool migrate_scanner)
+ bool migrate_scanner)
{
}
#endif /* CONFIG_COMPACTION */
@@ -345,8 +340,7 @@ isolate_fail:

/* Update the pageblock-skip if the whole pageblock was scanned */
if (blockpfn == end_pfn)
- update_pageblock_skip(cc, valid_page, total_isolated, true,
- false);
+ update_pageblock_skip(cc, valid_page, total_isolated, false);

count_compact_events(COMPACTFREE_SCANNED, nr_scanned);
if (total_isolated)
@@ -474,14 +468,12 @@ unsigned long
isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
unsigned long low_pfn, unsigned long end_pfn, bool unevictable)
{
- unsigned long last_pageblock_nr = 0, pageblock_nr;
unsigned long nr_scanned = 0, nr_isolated = 0;
struct list_head *migratelist = &cc->migratepages;
struct lruvec *lruvec;
unsigned long flags;
bool locked = false;
struct page *page = NULL, *valid_page = NULL;
- bool set_unsuitable = true;
const isolate_mode_t mode = (cc->mode == MIGRATE_ASYNC ?
ISOLATE_ASYNC_MIGRATE : 0) |
(unevictable ? ISOLATE_UNEVICTABLE : 0);
@@ -545,28 +537,6 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
if (!valid_page)
valid_page = page;

- /* If isolation recently failed, do not retry */
- pageblock_nr = low_pfn >> pageblock_order;
- if (last_pageblock_nr != pageblock_nr) {
- int mt;
-
- last_pageblock_nr = pageblock_nr;
- if (!isolation_suitable(cc, page))
- goto next_pageblock;
-
- /*
- * For async migration, also only scan in MOVABLE
- * blocks. Async migration is optimistic to see if
- * the minimum amount of work satisfies the allocation
- */
- mt = get_pageblock_migratetype(page);
- if (cc->mode == MIGRATE_ASYNC &&
- !migrate_async_suitable(mt)) {
- set_unsuitable = false;
- goto next_pageblock;
- }
- }
-
/*
* Skip if free. page_order cannot be used without zone->lock
* as nothing prevents parallel allocations or buddy merging.
@@ -668,8 +638,7 @@ next_pageblock:
* if the whole pageblock was scanned without isolating any page.
*/
if (low_pfn == end_pfn)
- update_pageblock_skip(cc, valid_page, nr_isolated,
- set_unsuitable, true);
+ update_pageblock_skip(cc, valid_page, nr_isolated, true);

trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);

@@ -840,34 +809,74 @@ typedef enum {
} isolate_migrate_t;

/*
- * Isolate all pages that can be migrated from the block pointed to by
- * the migrate scanner within compact_control.
+ * Isolate all pages that can be migrated from the first suitable block,
+ * starting at the block pointed to by the migrate scanner pfn within
+ * compact_control.
*/
static isolate_migrate_t isolate_migratepages(struct zone *zone,
struct compact_control *cc)
{
unsigned long low_pfn, end_pfn;
+ struct page *page;

- /* Do not scan outside zone boundaries */
- low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
+ /* Start at where we last stopped, or beginning of the zone */
+ low_pfn = cc->migrate_pfn;

/* Only scan within a pageblock boundary */
end_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages);

- /* Do not cross the free scanner or scan within a memory hole */
- if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
- cc->migrate_pfn = end_pfn;
- return ISOLATE_NONE;
- }
+ /*
+ * Iterate over whole pageblocks until we find the first suitable.
+ * Do not cross the free scanner.
+ */
+ for (; end_pfn <= cc->free_pfn;
+ low_pfn = end_pfn, end_pfn += pageblock_nr_pages) {
+
+ /*
+ * This can potentially iterate a massively long zone with
+ * many pageblocks unsuitable, so periodically check if we
+ * need to schedule, or even abort async compaction.
+ */
+ if (!(low_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages))
+ && compact_should_abort(cc))
+ break;

- /* Perform the isolation */
- low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn, false);
- if (!low_pfn || cc->contended)
- return ISOLATE_ABORT;
+ /* Do not scan within a memory hole */
+ if (!pfn_valid(low_pfn))
+ continue;
+
+ page = pfn_to_page(low_pfn);
+ /* If isolation recently failed, do not retry */
+ if (!isolation_suitable(cc, page))
+ continue;

+ /*
+ * For async compaction, also only scan in MOVABLE blocks.
+ * Async compaction is optimistic to see if the minimum amount
+ * of work satisfies the allocation.
+ */
+ if (cc->mode == MIGRATE_ASYNC &&
+ !migrate_async_suitable(get_pageblock_migratetype(page)))
+ continue;
+
+ /* Perform the isolation */
+ low_pfn = isolate_migratepages_range(zone, cc, low_pfn,
+ end_pfn, false);
+ if (!low_pfn || cc->contended)
+ return ISOLATE_ABORT;
+
+ /*
+ * Either we isolated something and proceed with migration. Or
+ * we failed and compact_zone should decide if we should
+ * continue or not.
+ */
+ break;
+ }
+
+ /* Record where migration scanner will be restarted */
cc->migrate_pfn = low_pfn;

- return ISOLATE_SUCCESS;
+ return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
}

static int compact_finished(struct zone *zone,
@@ -1040,9 +1049,6 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
;
}

- if (!cc->nr_migratepages)
- continue;
-
err = migrate_pages(&cc->migratepages, compaction_alloc,
compaction_free, (unsigned long)cc, cc->mode,
MR_COMPACTION);
--
1.8.4.5

2014-06-20 15:50:17

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v3 01/13] mm, THP: don't hold mmap_sem in khugepaged when allocating THP

When allocating huge page for collapsing, khugepaged currently holds mmap_sem
for reading on the mm where collapsing occurs. Afterwards the read lock is
dropped before write lock is taken on the same mmap_sem.

Holding mmap_sem during whole huge page allocation is therefore useless, the
vma needs to be rechecked after taking the write lock anyway. Furthemore, huge
page allocation might involve a rather long sync compaction, and thus block
any mmap_sem writers and i.e. affect workloads that perform frequent m(un)map
or mprotect oterations.

This patch simply releases the read lock before allocating a huge page. It
also deletes an outdated comment that assumed vma must be stable, as it was
using alloc_hugepage_vma(). This is no longer true since commit 9f1b868a13
("mm: thp: khugepaged: add policy for finding target node").

Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
---
mm/huge_memory.c | 20 +++++++-------------
1 file changed, 7 insertions(+), 13 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5d562a9..59ddc61 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2295,23 +2295,17 @@ static struct page
int node)
{
VM_BUG_ON_PAGE(*hpage, *hpage);
+
/*
- * Allocate the page while the vma is still valid and under
- * the mmap_sem read mode so there is no memory allocation
- * later when we take the mmap_sem in write mode. This is more
- * friendly behavior (OTOH it may actually hide bugs) to
- * filesystems in userland with daemons allocating memory in
- * the userland I/O paths. Allocating memory with the
- * mmap_sem in read mode is good idea also to allow greater
- * scalability.
+ * Before allocating the hugepage, release the mmap_sem read lock.
+ * The allocation can take potentially a long time if it involves
+ * sync compaction, and we do not need to hold the mmap_sem during
+ * that. We will recheck the vma after taking it again in write mode.
*/
+ up_read(&mm->mmap_sem);
+
*hpage = alloc_pages_exact_node(node, alloc_hugepage_gfpmask(
khugepaged_defrag(), __GFP_OTHER_NODE), HPAGE_PMD_ORDER);
- /*
- * After allocating the hugepage, release the mmap_sem read lock in
- * preparation for taking it in write mode.
- */
- up_read(&mm->mmap_sem);
if (unlikely(!*hpage)) {
count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
*hpage = ERR_PTR(-ENOMEM);
--
1.8.4.5

2014-06-20 15:50:15

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v3 03/13] mm, compaction: do not recheck suitable_migration_target under lock

isolate_freepages_block() rechecks if the pageblock is suitable to be a target
for migration after it has taken the zone->lock. However, the check has been
optimized to occur only once per pageblock, and compact_checklock_irqsave()
might be dropping and reacquiring lock, which means somebody else might have
changed the pageblock's migratetype meanwhile.

Furthermore, nothing prevents the migratetype to change right after
isolate_freepages_block() has finished isolating. Given how imperfect this is,
it's simpler to just rely on the check done in isolate_freepages() without
lock, and not pretend that the recheck under lock guarantees anything. It is
just a heuristic after all.

Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Acked-by: David Rientjes <[email protected]>
---
mm/compaction.c | 13 -------------
1 file changed, 13 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 7c491d0..3064a7f 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -276,7 +276,6 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
struct page *cursor, *valid_page = NULL;
unsigned long flags;
bool locked = false;
- bool checked_pageblock = false;

cursor = pfn_to_page(blockpfn);

@@ -307,18 +306,6 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
if (!locked)
break;

- /* Recheck this is a suitable migration target under lock */
- if (!strict && !checked_pageblock) {
- /*
- * We need to check suitability of pageblock only once
- * and this isolate_freepages_block() is called with
- * pageblock range, so just check once is sufficient.
- */
- checked_pageblock = true;
- if (!suitable_migration_target(page))
- break;
- }
-
/* Recheck this is a buddy page under lock */
if (!PageBuddy(page))
goto isolate_fail;
--
1.8.4.5

2014-06-20 15:50:14

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v3 02/13] mm, compaction: defer each zone individually instead of preferred zone

When direct sync compaction is often unsuccessful, it may become deferred for
some time to avoid further useless attempts, both sync and async. Successful
high-order allocations un-defer compaction, while further unsuccessful
compaction attempts prolong the copmaction deferred period.

Currently the checking and setting deferred status is performed only on the
preferred zone of the allocation that invoked direct compaction. But compaction
itself is attempted on all eligible zones in the zonelist, so the behavior is
suboptimal and may lead both to scenarios where 1) compaction is attempted
uselessly, or 2) where it's not attempted despite good chances of succeeding,
as shown on the examples below:

1) A direct compaction with Normal preferred zone failed and set deferred
compaction for the Normal zone. Another unrelated direct compaction with
DMA32 as preferred zone will attempt to compact DMA32 zone even though
the first compaction attempt also included DMA32 zone.

In another scenario, compaction with Normal preferred zone failed to compact
Normal zone, but succeeded in the DMA32 zone, so it will not defer
compaction. In the next attempt, it will try Normal zone which will fail
again, instead of skipping Normal zone and trying DMA32 directly.

2) Kswapd will balance DMA32 zone and reset defer status based on watermarks
looking good. A direct compaction with preferred Normal zone will skip
compaction of all zones including DMA32 because Normal was still deferred.
The allocation might have succeeded in DMA32, but won't.

This patch makes compaction deferring work on individual zone basis instead of
preferred zone. For each zone, it checks compaction_deferred() to decide if the
zone should be skipped. If watermarks fail after compacting the zone,
defer_compaction() is called. The zone where watermarks passed can still be
deferred when the allocation attempt is unsuccessful. When allocation is
successful, compaction_defer_reset() is called for the zone containing the
allocated page. This approach should approximate calling defer_compaction()
only on zones where compaction was attempted and did not yield allocated page.
There might be corner cases but that is inevitable as long as the decision
to stop compacting dues not guarantee that a page will be allocated.

During testing on a two-node machine with a single very small Normal zone on
node 1, this patch has improved success rates in stress-highalloc mmtests
benchmark. The success here were previously made worse by commit 3a025760fc
("mm: page_alloc: spill to remote nodes before waking kswapd") as kswapd was
no longer resetting often enough the deferred compaction for the Normal zone,
and DMA32 zones on both nodes were thus not considered for compaction.

Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
---
include/linux/compaction.h | 6 ++++--
mm/compaction.c | 29 ++++++++++++++++++++++++-----
mm/page_alloc.c | 33 ++++++++++++++++++---------------
3 files changed, 46 insertions(+), 22 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 01e3132..76f9beb 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -22,7 +22,8 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
extern int fragmentation_index(struct zone *zone, unsigned int order);
extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *mask,
- enum migrate_mode mode, bool *contended);
+ enum migrate_mode mode, bool *contended, bool *deferred,
+ struct zone **candidate_zone);
extern void compact_pgdat(pg_data_t *pgdat, int order);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern unsigned long compaction_suitable(struct zone *zone, int order);
@@ -91,7 +92,8 @@ static inline bool compaction_restarting(struct zone *zone, int order)
#else
static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
- enum migrate_mode mode, bool *contended)
+ enum migrate_mode mode, bool *contended, bool *deferred,
+ struct zone **candidate_zone)
{
return COMPACT_CONTINUE;
}
diff --git a/mm/compaction.c b/mm/compaction.c
index 5175019..7c491d0 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1122,13 +1122,15 @@ int sysctl_extfrag_threshold = 500;
* @nodemask: The allowed nodes to allocate from
* @mode: The migration mode for async, sync light, or sync migration
* @contended: Return value that is true if compaction was aborted due to lock contention
- * @page: Optionally capture a free page of the requested order during compaction
+ * @deferred: Return value that is true if compaction was deferred in all zones
+ * @candidate_zone: Return the zone where we think allocation should succeed
*
* This is the main entry point for direct page compaction.
*/
unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
- enum migrate_mode mode, bool *contended)
+ enum migrate_mode mode, bool *contended, bool *deferred,
+ struct zone **candidate_zone)
{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
int may_enter_fs = gfp_mask & __GFP_FS;
@@ -1142,8 +1144,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
if (!order || !may_enter_fs || !may_perform_io)
return rc;

- count_compact_event(COMPACTSTALL);
-
+ *deferred = true;
#ifdef CONFIG_CMA
if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;
@@ -1153,16 +1154,34 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
nodemask) {
int status;

+ if (compaction_deferred(zone, order))
+ continue;
+
+ *deferred = false;
+
status = compact_zone_order(zone, order, gfp_mask, mode,
contended);
rc = max(status, rc);

/* If a normal allocation would succeed, stop compacting */
if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
- alloc_flags))
+ alloc_flags)) {
+ *candidate_zone = zone;
break;
+ } else if (mode != MIGRATE_ASYNC) {
+ /*
+ * We think that allocation won't succeed in this zone
+ * so we defer compaction there. If it ends up
+ * succeeding after all, it will be reset.
+ */
+ defer_compaction(zone, order);
+ }
}

+ /* If at least one zone wasn't deferred, we count a compaction stall */
+ if (!*deferred)
+ count_compact_event(COMPACTSTALL);
+
return rc;
}

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ee92384..6593f79 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2238,18 +2238,17 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
bool *contended_compaction, bool *deferred_compaction,
unsigned long *did_some_progress)
{
- if (!order)
- return NULL;
+ struct zone *last_compact_zone = NULL;

- if (compaction_deferred(preferred_zone, order)) {
- *deferred_compaction = true;
+ if (!order)
return NULL;
- }

current->flags |= PF_MEMALLOC;
*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
nodemask, mode,
- contended_compaction);
+ contended_compaction,
+ deferred_compaction,
+ &last_compact_zone);
current->flags &= ~PF_MEMALLOC;

if (*did_some_progress != COMPACT_SKIPPED) {
@@ -2263,27 +2262,31 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
order, zonelist, high_zoneidx,
alloc_flags & ~ALLOC_NO_WATERMARKS,
preferred_zone, classzone_idx, migratetype);
+
if (page) {
- preferred_zone->compact_blockskip_flush = false;
- compaction_defer_reset(preferred_zone, order, true);
+ struct zone *zone = page_zone(page);
+
+ zone->compact_blockskip_flush = false;
+ compaction_defer_reset(zone, order, true);
count_vm_event(COMPACTSUCCESS);
return page;
}

/*
+ * last_compact_zone is where try_to_compact_pages thought
+ * allocation should succeed, so it did not defer compaction.
+ * But now we know that it didn't succeed, so we do the defer.
+ */
+ if (last_compact_zone && mode != MIGRATE_ASYNC)
+ defer_compaction(last_compact_zone, order);
+
+ /*
* It's bad if compaction run occurs and fails.
* The most likely reason is that pages exist,
* but not enough to satisfy watermarks.
*/
count_vm_event(COMPACTFAIL);

- /*
- * As async compaction considers a subset of pageblocks, only
- * defer if the failure was a sync compaction failure.
- */
- if (mode != MIGRATE_ASYNC)
- defer_compaction(preferred_zone, order);
-
cond_resched();
}

--
1.8.4.5

2014-06-20 15:52:54

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v3 12/13] mm, compaction: try to capture the just-created high-order freepage

Compaction uses watermark checking to determine if it succeeded in creating
a high-order free page. My testing has shown that this is quite racy and it
can happen that watermark checking in compaction succeeds, and moments later
the watermark checking in page allocation fails, even though the number of
free pages has increased meanwhile.

It should be more reliable if direct compaction captured the high-order free
page as soon as it detects it, and pass it back to allocation. This would
also reduce the window for somebody else to allocate the free page.

Capture has been implemented before by 1fb3f8ca0e92 ("mm: compaction: capture
a suitable high-order page immediately when it is made available"), but later
reverted by 8fb74b9f ("mm: compaction: partially revert capture of suitable
high-order page") due to a bug.

This patch differs from the previous attempt in two aspects:

1) The previous patch scanned free lists to capture the page. In this patch,
only the cc->order aligned block that the migration scanner just finished
is considered, but only if pages were actually isolated for migration in
that block. Tracking cc->order aligned blocks also has benefits for the
following patch that skips blocks where non-migratable pages were found.

2) The operations done in buffered_rmqueue() and get_page_from_freelist() are
closely followed so that page capture mimics normal page allocation as much
as possible. This includes operations such as prep_new_page() and
page->pfmemalloc setting (that was missing in the previous attempt), zone
statistics are updated etc. Due to subtleties with IRQ disabling and
enabling this cannot be simply factored out from the normal allocation
functions without affecting the fastpath.

This patch has tripled compaction success rates (as recorded in vmstat) in
stress-highalloc mmtests benchmark, although allocation success rates increased
only by a few percent. Closer inspection shows that due to the racy watermark
checking and lack of lru_add_drain(), the allocations that resulted in direct
compactions were often failing, but later allocations succeeeded in the fast
path. So the benefit of the patch to allocation success rates may be limited,
but it improves the fairness in the sense that whoever spent the time
compacting has a higher change of benefitting from it, and also can stop
compacting sooner, as page availability is detected immediately. With better
success detection, the contribution of compaction to high-order allocation
success success rates is also no longer understated by the vmstats.

Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
---
include/linux/compaction.h | 8 ++-
mm/compaction.c | 119 +++++++++++++++++++++++++++++++++++++++++----
mm/internal.h | 5 +-
mm/page_alloc.c | 85 +++++++++++++++++++++++++++-----
4 files changed, 192 insertions(+), 25 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 76f9beb..be26cdc 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -10,6 +10,8 @@
#define COMPACT_PARTIAL 2
/* The full zone was compacted */
#define COMPACT_COMPLETE 3
+/* Captured a high-order free page in direct compaction */
+#define COMPACT_CAPTURED 4

#ifdef CONFIG_COMPACTION
extern int sysctl_compact_memory;
@@ -23,7 +25,8 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *mask,
enum migrate_mode mode, bool *contended, bool *deferred,
- struct zone **candidate_zone);
+ struct zone **candidate_zone,
+ struct page **captured_page);
extern void compact_pgdat(pg_data_t *pgdat, int order);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern unsigned long compaction_suitable(struct zone *zone, int order);
@@ -93,7 +96,8 @@ static inline bool compaction_restarting(struct zone *zone, int order)
static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
enum migrate_mode mode, bool *contended, bool *deferred,
- struct zone **candidate_zone)
+ struct zone **candidate_zone,
+ struct page **captured_page);
{
return COMPACT_CONTINUE;
}
diff --git a/mm/compaction.c b/mm/compaction.c
index d4e0c13..89eed1e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -509,6 +509,7 @@ static bool too_many_isolated(struct zone *zone)
* @low_pfn: The first PFN of the range.
* @end_pfn: The one-past-the-last PFN of the range.
* @unevictable: true if it allows to isolate unevictable pages
+ * @capture: True if page capturing is allowed
*
* Isolate all pages that can be migrated from the range specified by
* [low_pfn, end_pfn). Returns zero if there is a fatal signal
@@ -524,7 +525,8 @@ static bool too_many_isolated(struct zone *zone)
*/
unsigned long
isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
- unsigned long low_pfn, unsigned long end_pfn, bool unevictable)
+ unsigned long low_pfn, unsigned long end_pfn, bool unevictable,
+ bool capture)
{
unsigned long nr_scanned = 0, nr_isolated = 0;
struct list_head *migratelist = &cc->migratepages;
@@ -535,6 +537,14 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
const isolate_mode_t mode = (cc->mode == MIGRATE_ASYNC ?
ISOLATE_ASYNC_MIGRATE : 0) |
(unevictable ? ISOLATE_UNEVICTABLE : 0);
+ unsigned long capture_pfn = 0; /* current candidate for capturing */
+ unsigned long next_capture_pfn = 0; /* next candidate for capturing */
+
+ if (cc->order > 0 && cc->order <= pageblock_order && capture) {
+ /* This may be outside the zone, but we check that later */
+ capture_pfn = low_pfn & ~((1UL << cc->order) - 1);
+ next_capture_pfn = ALIGN(low_pfn + 1, (1UL << cc->order));
+ }

/*
* Ensure that there are not too many pages isolated from the LRU
@@ -556,7 +566,27 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
return 0;

/* Time to isolate some pages for migration */
- for (; low_pfn < end_pfn; low_pfn++) {
+ for (; low_pfn <= end_pfn; low_pfn++) {
+ if (low_pfn == next_capture_pfn) {
+ /*
+ * We have a capture candidate if we isolated something
+ * during the last cc->order aligned block of pages.
+ */
+ if (nr_isolated &&
+ capture_pfn >= zone->zone_start_pfn) {
+ cc->capture_page = pfn_to_page(capture_pfn);
+ break;
+ }
+
+ /* Prepare for a new capture candidate */
+ capture_pfn = next_capture_pfn;
+ next_capture_pfn += (1UL << cc->order);
+ }
+
+ /* We check that here, in case low_pfn == next_capture_pfn */
+ if (low_pfn == end_pfn)
+ break;
+
/*
* Periodically drop the lock (if held) regardless of its
* contention, to give chance to IRQs. Abort async compaction
@@ -576,6 +606,8 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
if ((low_pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) {
if (!pfn_valid(low_pfn)) {
low_pfn += MAX_ORDER_NR_PAGES - 1;
+ if (next_capture_pfn)
+ next_capture_pfn = low_pfn + 1;
continue;
}
}
@@ -611,8 +643,12 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
* a valid page order. Consider only values in the
* valid order range to prevent low_pfn overflow.
*/
- if (freepage_order > 0 && freepage_order < MAX_ORDER)
+ if (freepage_order > 0 && freepage_order < MAX_ORDER) {
low_pfn += (1UL << freepage_order) - 1;
+ if (next_capture_pfn)
+ next_capture_pfn = ALIGN(low_pfn + 1,
+ (1UL << cc->order));
+ }
continue;
}

@@ -645,6 +681,9 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
if (!locked)
goto next_pageblock;
low_pfn += (1 << compound_order(page)) - 1;
+ if (next_capture_pfn)
+ next_capture_pfn =
+ ALIGN(low_pfn + 1, (1UL << cc->order));
continue;
}

@@ -669,6 +708,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
continue;
if (PageTransHuge(page)) {
low_pfn += (1 << compound_order(page)) - 1;
+ next_capture_pfn = low_pfn + 1;
continue;
}
}
@@ -700,6 +740,8 @@ isolate_success:

next_pageblock:
low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
+ if (next_capture_pfn)
+ next_capture_pfn = low_pfn + 1;
}

/*
@@ -910,7 +952,7 @@ typedef enum {
* compact_control.
*/
static isolate_migrate_t isolate_migratepages(struct zone *zone,
- struct compact_control *cc)
+ struct compact_control *cc, const int migratetype)
{
unsigned long low_pfn, end_pfn;
struct page *page;
@@ -927,6 +969,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
*/
for (; end_pfn <= cc->free_pfn;
low_pfn = end_pfn, end_pfn += pageblock_nr_pages) {
+ int pageblock_mt;

/*
* This can potentially iterate a massively long zone with
@@ -951,13 +994,15 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
* Async compaction is optimistic to see if the minimum amount
* of work satisfies the allocation.
*/
+ pageblock_mt = get_pageblock_migratetype(page);
if (cc->mode == MIGRATE_ASYNC &&
- !migrate_async_suitable(get_pageblock_migratetype(page)))
+ !migrate_async_suitable(pageblock_mt))
continue;

/* Perform the isolation */
low_pfn = isolate_migratepages_range(zone, cc, low_pfn,
- end_pfn, false);
+ end_pfn, false, pageblock_mt == migratetype);
+
if (!low_pfn || cc->contended)
return ISOLATE_ABORT;

@@ -975,6 +1020,44 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
}

+/*
+ * When called, cc->capture_page is just a candidate. This function will either
+ * successfully capture the page, or reset it to NULL.
+ */
+static bool compact_capture_page(struct compact_control *cc)
+{
+ struct page *page = cc->capture_page;
+ int cpu;
+
+ /* Unsafe check if it's worth to try acquiring the zone->lock at all */
+ if (PageBuddy(page) && page_order_unsafe(page) >= cc->order)
+ goto try_capture;
+
+ /*
+ * There's a good chance that we have just put free pages on this CPU's
+ * lru cache and pcplists after the page migrations. Drain them to
+ * allow merging.
+ */
+ cpu = get_cpu();
+ lru_add_drain_cpu(cpu);
+ drain_local_pages(NULL);
+ put_cpu();
+
+ /* Did the draining help? */
+ if (PageBuddy(page) && page_order_unsafe(page) >= cc->order)
+ goto try_capture;
+
+ goto fail;
+
+try_capture:
+ if (capture_free_page(page, cc->order))
+ return true;
+
+fail:
+ cc->capture_page = NULL;
+ return false;
+}
+
static int compact_finished(struct zone *zone, struct compact_control *cc,
const int migratetype)
{
@@ -1003,6 +1086,10 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
return COMPACT_COMPLETE;
}

+ /* Did we just finish a pageblock that was capture candidate? */
+ if (cc->capture_page && compact_capture_page(cc))
+ return COMPACT_CAPTURED;
+
/*
* order == -1 is expected when compacting via
* /proc/sys/vm/compact_memory
@@ -1135,7 +1222,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
COMPACT_CONTINUE) {
int err;

- switch (isolate_migratepages(zone, cc)) {
+ switch (isolate_migratepages(zone, cc, migratetype)) {
case ISOLATE_ABORT:
ret = COMPACT_PARTIAL;
putback_movable_pages(&cc->migratepages);
@@ -1180,7 +1267,8 @@ out:
}

static unsigned long compact_zone_order(struct zone *zone, int order,
- gfp_t gfp_mask, enum migrate_mode mode, bool *contended)
+ gfp_t gfp_mask, enum migrate_mode mode, bool *contended,
+ struct page **captured_page)
{
unsigned long ret;
struct compact_control cc = {
@@ -1196,6 +1284,9 @@ static unsigned long compact_zone_order(struct zone *zone, int order,

ret = compact_zone(zone, &cc);

+ if (ret == COMPACT_CAPTURED)
+ *captured_page = cc.capture_page;
+
VM_BUG_ON(!list_empty(&cc.freepages));
VM_BUG_ON(!list_empty(&cc.migratepages));

@@ -1216,13 +1307,15 @@ int sysctl_extfrag_threshold = 500;
* @contended: Return value that is true if compaction was aborted due to lock contention
* @deferred: Return value that is true if compaction was deferred in all zones
* @candidate_zone: Return the zone where we think allocation should succeed
+ * @captured_page: If successful, return the page captured during compaction
*
* This is the main entry point for direct page compaction.
*/
unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
enum migrate_mode mode, bool *contended, bool *deferred,
- struct zone **candidate_zone)
+ struct zone **candidate_zone,
+ struct page **captured_page)
{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
int may_enter_fs = gfp_mask & __GFP_FS;
@@ -1254,10 +1347,16 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
*deferred = false;

status = compact_zone_order(zone, order, gfp_mask, mode,
- &zone_contended);
+ &zone_contended, captured_page);
rc = max(status, rc);
all_zones_contended &= zone_contended;

+ /* If we captured a page, stop compacting */
+ if (*captured_page) {
+ *candidate_zone = zone;
+ break;
+ }
+
/* If a normal allocation would succeed, stop compacting */
if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
alloc_flags)) {
diff --git a/mm/internal.h b/mm/internal.h
index dd17a40..b15b89f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -110,6 +110,7 @@ extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
*/
extern void __free_pages_bootmem(struct page *page, unsigned int order);
extern void prep_compound_page(struct page *page, unsigned long order);
+extern bool capture_free_page(struct page *page, unsigned int order);
#ifdef CONFIG_MEMORY_FAILURE
extern bool is_free_buddy_page(struct page *page);
#endif
@@ -155,6 +156,7 @@ struct compact_control {
* contention detected during
* compaction
*/
+ struct page *capture_page; /* Free page captured by compaction */
};

unsigned long
@@ -162,7 +164,8 @@ isolate_freepages_range(struct compact_control *cc,
unsigned long start_pfn, unsigned long end_pfn);
unsigned long
isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
- unsigned long low_pfn, unsigned long end_pfn, bool unevictable);
+ unsigned long low_pfn, unsigned long end_pfn, bool unevictable,
+ bool capture);

#endif

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 70b8297..e568e86 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1471,9 +1471,11 @@ static int __isolate_free_page(struct page *page, unsigned int order)
{
unsigned long watermark;
struct zone *zone;
+ struct free_area *area;
int mt;
+ unsigned int freepage_order = page_order(page);

- BUG_ON(!PageBuddy(page));
+ VM_BUG_ON_PAGE((!PageBuddy(page) || freepage_order < order), page);

zone = page_zone(page);
mt = get_pageblock_migratetype(page);
@@ -1488,9 +1490,12 @@ static int __isolate_free_page(struct page *page, unsigned int order)
}

/* Remove page from free list */
+ area = &zone->free_area[freepage_order];
list_del(&page->lru);
- zone->free_area[order].nr_free--;
+ area->nr_free--;
rmv_page_order(page);
+ if (freepage_order != order)
+ expand(zone, page, order, freepage_order, area, mt);

/* Set the pageblock if the isolated page is at least a pageblock */
if (order >= pageblock_order - 1) {
@@ -1533,6 +1538,29 @@ int split_free_page(struct page *page)
return nr_pages;
}

+bool capture_free_page(struct page *page, unsigned int order)
+{
+ struct zone *zone = page_zone(page);
+ unsigned long flags;
+
+ spin_lock_irqsave(&zone->lock, flags);
+
+ if (!PageBuddy(page) || page_order(page) < order
+ || !__isolate_free_page(page, order)) {
+ spin_unlock_irqrestore(&zone->lock, flags);
+ return false;
+ }
+
+ spin_unlock(&zone->lock);
+
+ /* Mimic what buffered_rmqueue() does */
+ __mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+ __count_zone_vm_events(PGALLOC, zone, 1 << order);
+ local_irq_restore(flags);
+
+ return true;
+}
+
/*
* Really, prep_compound_page() should be called from __rmqueue_bulk(). But
* we cheat by calling it from here, in the order > 0 path. Saves a branch
@@ -2239,6 +2267,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
unsigned long *did_some_progress)
{
struct zone *last_compact_zone = NULL;
+ struct page *page = NULL;

if (!order)
return NULL;
@@ -2248,20 +2277,52 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
nodemask, mode,
contended_compaction,
deferred_compaction,
- &last_compact_zone);
+ &last_compact_zone, &page);
current->flags &= ~PF_MEMALLOC;

if (*did_some_progress != COMPACT_SKIPPED) {
- struct page *page;

- /* Page migration frees to the PCP lists but we want merging */
- drain_pages(get_cpu());
- put_cpu();
+ /* Did we capture a page? */
+ if (page) {
+ struct zone *zone;
+ unsigned long flags;
+ /*
+ * Mimic what buffered_rmqueue() does and
+ * capture_new_page() has not yet done.
+ */
+ zone = page_zone(page);
+
+ local_irq_save(flags);
+ zone_statistics(preferred_zone, zone, gfp_mask);
+ local_irq_restore(flags);

- page = get_page_from_freelist(gfp_mask, nodemask,
- order, zonelist, high_zoneidx,
- alloc_flags & ~ALLOC_NO_WATERMARKS,
- preferred_zone, classzone_idx, migratetype);
+ VM_BUG_ON_PAGE(bad_range(zone, page), page);
+ if (!prep_new_page(page, order, gfp_mask))
+ /*
+ * This is usually done in
+ * get_page_from_freelist()
+ */
+ page->pfmemalloc = !!(alloc_flags &
+ ALLOC_NO_WATERMARKS);
+ else
+ page = NULL;
+ }
+
+ /* No capture but let's try allocating anyway */
+ if (!page) {
+ /*
+ * Page migration frees to the PCP lists but we want
+ * merging
+ */
+ drain_pages(get_cpu());
+ put_cpu();
+
+ page = get_page_from_freelist(gfp_mask, nodemask,
+ order, zonelist, high_zoneidx,
+ alloc_flags & ~ALLOC_NO_WATERMARKS,
+ preferred_zone, classzone_idx,
+ migratetype);
+ }

if (page) {
struct zone *zone = page_zone(page);
@@ -6255,7 +6316,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
if (list_empty(&cc->migratepages)) {
cc->nr_migratepages = 0;
pfn = isolate_migratepages_range(cc->zone, cc,
- pfn, end, true);
+ pfn, end, true, false);
if (!pfn) {
ret = -EINTR;
break;
--
1.8.4.5

2014-06-20 15:52:52

by Vlastimil Babka

[permalink] [raw]
Subject: [RFC PATCH v3 13/13] mm, compaction: do not migrate pages when that cannot satisfy page fault allocation

In direct compaction for a page fault, we want to allocate the high-order page
as soon as possible, so migrating from a cc->order aligned block of pages that
contains also unmigratable pages just adds to page fault latency.

This patch therefore makes the migration scanner skip to the next cc->order
aligned block of pages as soon as it cannot isolate a non-free page. Everything
isolated up to that point is put back.

In this mode, the nr_isolated limit to COMPACT_CLUSTER_MAX is not observed,
allowing the scanner to scan the whole block at once, instead of migrating
COMPACT_CLUSTER_MAX pages and then finding an unmigratable page in the next
call. This might however have some implications on direct reclaimers through
too_many_isolated().

In preliminary tests with stress-highalloc benchmark, this has reduced numbers
of scanned, isolated and migrated pages by about 10%, while the allocation
success rates dropped only by a few percent.

[[email protected]: skip_on_failure based on THP page faults]
Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
---
mm/compaction.c | 51 +++++++++++++++++++++++++++++++++++++++------------
1 file changed, 39 insertions(+), 12 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 89eed1e..4577445 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -539,11 +539,20 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
(unevictable ? ISOLATE_UNEVICTABLE : 0);
unsigned long capture_pfn = 0; /* current candidate for capturing */
unsigned long next_capture_pfn = 0; /* next candidate for capturing */
+ bool skip_on_failure = false; /* skip block when isolation fails */

if (cc->order > 0 && cc->order <= pageblock_order && capture) {
/* This may be outside the zone, but we check that later */
capture_pfn = low_pfn & ~((1UL << cc->order) - 1);
next_capture_pfn = ALIGN(low_pfn + 1, (1UL << cc->order));
+ /*
+ * It is too expensive for compaction to migrate pages from a
+ * cc->order block of pages on page faults, unless the entire
+ * block can become free. But hugepaged should try anyway for
+ * THP so that general defragmentation happens.
+ */
+ skip_on_failure = (cc->gfp_mask & __GFP_NO_KSWAPD)
+ && !(current->flags & PF_KTHREAD);
}

/*
@@ -613,7 +622,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
}

if (!pfn_valid_within(low_pfn))
- continue;
+ goto isolation_failed;
nr_scanned++;

/*
@@ -624,7 +633,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
*/
page = pfn_to_page(low_pfn);
if (page_zone(page) != zone)
- continue;
+ goto isolation_failed;

if (!valid_page)
valid_page = page;
@@ -664,7 +673,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
goto isolate_success;
}
}
- continue;
+ goto isolation_failed;
}

/*
@@ -684,7 +693,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
if (next_capture_pfn)
next_capture_pfn =
ALIGN(low_pfn + 1, (1UL << cc->order));
- continue;
+ goto isolation_failed;
}

/*
@@ -694,7 +703,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
*/
if (!page_mapping(page) &&
page_count(page) > page_mapcount(page))
- continue;
+ goto isolation_failed;

/* If we already hold the lock, we can skip some rechecking */
if (!locked) {
@@ -705,11 +714,11 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,

/* Recheck PageLRU and PageTransHuge under lock */
if (!PageLRU(page))
- continue;
+ goto isolation_failed;
if (PageTransHuge(page)) {
low_pfn += (1 << compound_order(page)) - 1;
next_capture_pfn = low_pfn + 1;
- continue;
+ goto isolation_failed;
}
}

@@ -717,7 +726,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,

/* Try isolate the page */
if (__isolate_lru_page(page, mode) != 0)
- continue;
+ goto isolation_failed;

VM_BUG_ON_PAGE(PageTransCompound(page), page);

@@ -727,11 +736,14 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
isolate_success:
cc->finished_update_migrate = true;
list_add(&page->lru, migratelist);
- cc->nr_migratepages++;
nr_isolated++;

- /* Avoid isolating too much */
- if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) {
+ /*
+ * Avoid isolating too much, except if we try to capture a
+ * free page and want to find out at once if it can be done
+ * or we should skip to the next block.
+ */
+ if (!skip_on_failure && nr_isolated == COMPACT_CLUSTER_MAX) {
++low_pfn;
break;
}
@@ -742,6 +754,20 @@ next_pageblock:
low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
if (next_capture_pfn)
next_capture_pfn = low_pfn + 1;
+
+isolation_failed:
+ if (skip_on_failure) {
+ if (nr_isolated) {
+ if (locked) {
+ spin_unlock_irqrestore(&zone->lru_lock,
+ flags);
+ locked = false;
+ }
+ putback_movable_pages(migratelist);
+ nr_isolated = 0;
+ }
+ low_pfn = next_capture_pfn - 1;
+ }
}

/*
@@ -751,6 +777,7 @@ next_pageblock:
if (unlikely(low_pfn > end_pfn))
low_pfn = end_pfn;

+ cc->nr_migratepages = nr_isolated;
acct_isolated(zone, locked, cc);

if (locked)
@@ -760,7 +787,7 @@ next_pageblock:
* Update the pageblock-skip information and cached scanner pfn,
* if the whole pageblock was scanned without isolating any page.
*/
- if (low_pfn == end_pfn)
+ if (low_pfn == end_pfn && !skip_on_failure)
update_pageblock_skip(cc, valid_page, nr_isolated, true);

trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
--
1.8.4.5

2014-06-20 15:53:39

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v3 11/13] mm, compaction: pass gfp mask to compact_control

From: David Rientjes <[email protected]>

struct compact_control currently converts the gfp mask to a migratetype, but we
need the entire gfp mask in a follow-up patch.

Pass the entire gfp mask as part of struct compact_control.

Signed-off-by: David Rientjes <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
---
mm/compaction.c | 12 +++++++-----
mm/internal.h | 2 +-
2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 32c768b..d4e0c13 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -975,8 +975,8 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
}

-static int compact_finished(struct zone *zone,
- struct compact_control *cc)
+static int compact_finished(struct zone *zone, struct compact_control *cc,
+ const int migratetype)
{
unsigned int order;
unsigned long watermark;
@@ -1022,7 +1022,7 @@ static int compact_finished(struct zone *zone,
struct free_area *area = &zone->free_area[order];

/* Job done if page is free of the right migratetype */
- if (!list_empty(&area->free_list[cc->migratetype]))
+ if (!list_empty(&area->free_list[migratetype]))
return COMPACT_PARTIAL;

/* Job done if allocation would set block type */
@@ -1088,6 +1088,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
int ret;
unsigned long start_pfn = zone->zone_start_pfn;
unsigned long end_pfn = zone_end_pfn(zone);
+ const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
const bool sync = cc->mode != MIGRATE_ASYNC;

ret = compaction_suitable(zone, cc->order);
@@ -1130,7 +1131,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)

migrate_prep_local();

- while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
+ while ((ret = compact_finished(zone, cc, migratetype)) ==
+ COMPACT_CONTINUE) {
int err;

switch (isolate_migratepages(zone, cc)) {
@@ -1185,7 +1187,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
.nr_freepages = 0,
.nr_migratepages = 0,
.order = order,
- .migratetype = gfpflags_to_migratetype(gfp_mask),
+ .gfp_mask = gfp_mask,
.zone = zone,
.mode = mode,
};
diff --git a/mm/internal.h b/mm/internal.h
index 584cd69..dd17a40 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -149,7 +149,7 @@ struct compact_control {
bool finished_update_migrate;

int order; /* order a direct compactor needs */
- int migratetype; /* MOVABLE, RECLAIMABLE etc */
+ const gfp_t gfp_mask; /* gfp mask of a direct compactor */
struct zone *zone;
enum compact_contended contended; /* Signal need_sched() or lock
* contention detected during
--
1.8.4.5

2014-06-20 15:53:57

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v3 07/13] mm, compaction: skip rechecks when lock was already held

Compaction scanners try to lock zone locks as late as possible by checking
many page or pageblock properties opportunistically without lock and skipping
them if not unsuitable. For pages that pass the initial checks, some properties
have to be checked again safely under lock. However, if the lock was already
held from a previous iteration in the initial checks, the rechecks are
unnecessary.

This patch therefore skips the rechecks when the lock was already held. This is
now possible to do, since we don't (potentially) drop and reacquire the lock
between the initial checks and the safe rechecks anymore.

Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Acked-by: David Rientjes <[email protected]>
---
mm/compaction.c | 53 +++++++++++++++++++++++++++++++----------------------
1 file changed, 31 insertions(+), 22 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 40da812..9f6e857 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -324,22 +324,30 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
goto isolate_fail;

/*
- * The zone lock must be held to isolate freepages.
- * Unfortunately this is a very coarse lock and can be
- * heavily contended if there are parallel allocations
- * or parallel compactions. For async compaction do not
- * spin on the lock and we acquire the lock as late as
- * possible.
+ * If we already hold the lock, we can skip some rechecking.
+ * Note that if we hold the lock now, checked_pageblock was
+ * already set in some previous iteration (or strict is true),
+ * so it is correct to skip the suitable migration target
+ * recheck as well.
*/
- if (!locked)
+ if (!locked) {
+ /*
+ * The zone lock must be held to isolate freepages.
+ * Unfortunately this is a very coarse lock and can be
+ * heavily contended if there are parallel allocations
+ * or parallel compactions. For async compaction do not
+ * spin on the lock and we acquire the lock as late as
+ * possible.
+ */
locked = compact_trylock_irqsave(&cc->zone->lock,
&flags, cc);
- if (!locked)
- break;
+ if (!locked)
+ break;

- /* Recheck this is a buddy page under lock */
- if (!PageBuddy(page))
- goto isolate_fail;
+ /* Recheck this is a buddy page under lock */
+ if (!PageBuddy(page))
+ goto isolate_fail;
+ }

/* Found a free page, break it into order-0 pages */
isolated = split_free_page(page);
@@ -623,19 +631,20 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
page_count(page) > page_mapcount(page))
continue;

- /* If the lock is not held, try to take it */
- if (!locked)
+ /* If we already hold the lock, we can skip some rechecking */
+ if (!locked) {
locked = compact_trylock_irqsave(&zone->lru_lock,
&flags, cc);
- if (!locked)
- break;
+ if (!locked)
+ break;

- /* Recheck PageLRU and PageTransHuge under lock */
- if (!PageLRU(page))
- continue;
- if (PageTransHuge(page)) {
- low_pfn += (1 << compound_order(page)) - 1;
- continue;
+ /* Recheck PageLRU and PageTransHuge under lock */
+ if (!PageLRU(page))
+ continue;
+ if (PageTransHuge(page)) {
+ low_pfn += (1 << compound_order(page)) - 1;
+ continue;
+ }
}

lruvec = mem_cgroup_page_lruvec(page, zone);
--
1.8.4.5

2014-06-20 15:53:55

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v3 08/13] mm, compaction: remember position within pageblock in free pages scanner

Unlike the migration scanner, the free scanner remembers the beginning of the
last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages
uselessly when called several times during single compaction. This might have
been useful when pages were returned to the buddy allocator after a failed
migration, but this is no longer the case.

This patch changes the meaning of cc->free_pfn so that if it points to a
middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the
end. isolate_freepages_block() will record the pfn of the last page it looked
at, which is then used to update cc->free_pfn.

In the mmtests stress-highalloc benchmark, this has resulted in lowering the
ratio between pages scanned by both scanners, from 2.5 free pages per migrate
page, to 2.25 free pages per migrate page, without affecting success rates.

Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Zhang Yanfei <[email protected]>
---
mm/compaction.c | 40 +++++++++++++++++++++++++++++++---------
1 file changed, 31 insertions(+), 9 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 9f6e857..41c7005 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -287,7 +287,7 @@ static bool suitable_migration_target(struct page *page)
* (even though it may still end up isolating some pages).
*/
static unsigned long isolate_freepages_block(struct compact_control *cc,
- unsigned long blockpfn,
+ unsigned long *start_pfn,
unsigned long end_pfn,
struct list_head *freelist,
bool strict)
@@ -296,6 +296,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
struct page *cursor, *valid_page = NULL;
unsigned long flags;
bool locked = false;
+ unsigned long blockpfn = *start_pfn;

cursor = pfn_to_page(blockpfn);

@@ -369,6 +370,9 @@ isolate_fail:
break;
}

+ /* Record how far we have got within the block */
+ *start_pfn = blockpfn;
+
trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated);

/*
@@ -413,6 +417,9 @@ isolate_freepages_range(struct compact_control *cc,
LIST_HEAD(freelist);

for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) {
+ /* Protect pfn from changing by isolate_freepages_block */
+ unsigned long isolate_start_pfn = pfn;
+
if (!pfn_valid(pfn) || cc->zone != page_zone(pfn_to_page(pfn)))
break;

@@ -423,8 +430,8 @@ isolate_freepages_range(struct compact_control *cc,
block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
block_end_pfn = min(block_end_pfn, end_pfn);

- isolated = isolate_freepages_block(cc, pfn, block_end_pfn,
- &freelist, true);
+ isolated = isolate_freepages_block(cc, &isolate_start_pfn,
+ block_end_pfn, &freelist, true);

/*
* In strict mode, isolate_freepages_block() returns 0 if
@@ -708,6 +715,7 @@ static void isolate_freepages(struct zone *zone,
{
struct page *page;
unsigned long block_start_pfn; /* start of current pageblock */
+ unsigned long isolate_start_pfn; /* exact pfn we start at */
unsigned long block_end_pfn; /* end of current pageblock */
unsigned long low_pfn; /* lowest pfn scanner is able to scan */
int nr_freepages = cc->nr_freepages;
@@ -716,14 +724,15 @@ static void isolate_freepages(struct zone *zone,
/*
* Initialise the free scanner. The starting point is where we last
* successfully isolated from, zone-cached value, or the end of the
- * zone when isolating for the first time. We need this aligned to
- * the pageblock boundary, because we do
+ * zone when isolating for the first time. For looping we also need
+ * this pfn aligned down to the pageblock boundary, because we do
* block_start_pfn -= pageblock_nr_pages in the for loop.
* For ending point, take care when isolating in last pageblock of a
* a zone which ends in the middle of a pageblock.
* The low boundary is the end of the pageblock the migration scanner
* is using.
*/
+ isolate_start_pfn = cc->free_pfn;
block_start_pfn = cc->free_pfn & ~(pageblock_nr_pages-1);
block_end_pfn = min(block_start_pfn + pageblock_nr_pages,
zone_end_pfn(zone));
@@ -736,7 +745,8 @@ static void isolate_freepages(struct zone *zone,
*/
for (; block_start_pfn >= low_pfn && cc->nr_migratepages > nr_freepages;
block_end_pfn = block_start_pfn,
- block_start_pfn -= pageblock_nr_pages) {
+ block_start_pfn -= pageblock_nr_pages,
+ isolate_start_pfn = block_start_pfn) {
unsigned long isolated;

/*
@@ -770,13 +780,25 @@ static void isolate_freepages(struct zone *zone,
if (!isolation_suitable(cc, page))
continue;

- /* Found a block suitable for isolating free pages from */
- cc->free_pfn = block_start_pfn;
- isolated = isolate_freepages_block(cc, block_start_pfn,
+ /* Found a block suitable for isolating free pages from. */
+ isolated = isolate_freepages_block(cc, &isolate_start_pfn,
block_end_pfn, freelist, false);
nr_freepages += isolated;

/*
+ * Remember where the free scanner should restart next time,
+ * which is where isolate_freepages_block() left off.
+ * But if it scanned the whole pageblock, isolate_start_pfn
+ * now points at block_end_pfn, which is the start of the next
+ * pageblock.
+ * In that case we will however want to restart at the start
+ * of the previous pageblock.
+ */
+ cc->free_pfn = (isolate_start_pfn < block_end_pfn) ?
+ isolate_start_pfn :
+ block_start_pfn - pageblock_nr_pages;
+
+ /*
* Set a flag that we successfully isolated in this pageblock.
* In the next loop iteration, zone->compact_cached_free_pfn
* will not be updated and thus it will effectively contain the
--
1.8.4.5

2014-06-20 15:53:54

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v3 09/13] mm, compaction: skip buddy pages by their order in the migrate scanner

The migration scanner skips PageBuddy pages, but does not consider their order
as checking page_order() is generally unsafe without holding the zone->lock,
and acquiring the lock just for the check wouldn't be a good tradeoff.

Still, this could avoid some iterations over the rest of the buddy page, and
if we are careful, the race window between PageBuddy() check and page_order()
is small, and the worst thing that can happen is that we skip too much and miss
some isolation candidates. This is not that bad, as compaction can already fail
for many other reasons like parallel allocations, and those have much larger
race window.

This patch therefore makes the migration scanner obtain the buddy page order
and use it to skip the whole buddy page, if the order appears to be in the
valid range.

It's important that the page_order() is read only once, so that the value used
in the checks and in the pfn calculation is the same. But in theory the
compiler can replace the local variable by multiple inlines of page_order().
Therefore, the patch introduces page_order_unsafe() that uses ACCESS_ONCE to
prevent this.

Testing with stress-highalloc from mmtests shows a 15% reduction in number of
pages scanned by migration scanner. This change is also a prerequisite for a
later patch which is detecting when a cc->order block of pages contains
non-buddy pages that cannot be isolated, and the scanner should thus skip to
the next block immediately.

Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
---
mm/compaction.c | 36 +++++++++++++++++++++++++++++++-----
mm/internal.h | 16 +++++++++++++++-
2 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 41c7005..df0961b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -270,8 +270,15 @@ static inline bool compact_should_abort(struct compact_control *cc)
static bool suitable_migration_target(struct page *page)
{
/* If the page is a large free page, then disallow migration */
- if (PageBuddy(page) && page_order(page) >= pageblock_order)
- return false;
+ if (PageBuddy(page)) {
+ /*
+ * We are checking page_order without zone->lock taken. But
+ * the only small danger is that we skip a potentially suitable
+ * pageblock, so it's not worth to check order for valid range.
+ */
+ if (page_order_unsafe(page) >= pageblock_order)
+ return false;
+ }

/* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
if (migrate_async_suitable(get_pageblock_migratetype(page)))
@@ -591,11 +598,23 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
valid_page = page;

/*
- * Skip if free. page_order cannot be used without zone->lock
- * as nothing prevents parallel allocations or buddy merging.
+ * Skip if free. We read page order here without zone lock
+ * which is generally unsafe, but the race window is small and
+ * the worst thing that can happen is that we skip some
+ * potential isolation targets.
*/
- if (PageBuddy(page))
+ if (PageBuddy(page)) {
+ unsigned long freepage_order = page_order_unsafe(page);
+
+ /*
+ * Without lock, we cannot be sure that what we got is
+ * a valid page order. Consider only values in the
+ * valid order range to prevent low_pfn overflow.
+ */
+ if (freepage_order > 0 && freepage_order < MAX_ORDER)
+ low_pfn += (1UL << freepage_order) - 1;
continue;
+ }

/*
* Check may be lockless but that's ok as we recheck later.
@@ -683,6 +702,13 @@ next_pageblock:
low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
}

+ /*
+ * The PageBuddy() check could have potentially brought us outside
+ * the range to be scanned.
+ */
+ if (unlikely(low_pfn > end_pfn))
+ low_pfn = end_pfn;
+
acct_isolated(zone, locked, cc);

if (locked)
diff --git a/mm/internal.h b/mm/internal.h
index 2c187d2..584cd69 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -171,7 +171,8 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
* general, page_zone(page)->lock must be held by the caller to prevent the
* page from being allocated in parallel and returning garbage as the order.
* If a caller does not hold page_zone(page)->lock, it must guarantee that the
- * page cannot be allocated or merged in parallel.
+ * page cannot be allocated or merged in parallel. Alternatively, it must
+ * handle invalid values gracefully, and use page_order_unsafe() below.
*/
static inline unsigned long page_order(struct page *page)
{
@@ -179,6 +180,19 @@ static inline unsigned long page_order(struct page *page)
return page_private(page);
}

+/*
+ * Like page_order(), but for callers who cannot afford to hold the zone lock.
+ * PageBuddy() should be checked first by the caller to minimize race window,
+ * and invalid values must be handled gracefully.
+ *
+ * ACCESS_ONCE is used so that if the caller assigns the result into a local
+ * variable and e.g. tests it for valid range before using, the compiler cannot
+ * decide to remove the variable and inline the page_private(page) multiple
+ * times, potentially observing different values in the tests and the actual
+ * use of the result.
+ */
+#define page_order_unsafe(page) ACCESS_ONCE(page_private(page))
+
static inline bool is_cow_mapping(vm_flags_t flags)
{
return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
--
1.8.4.5

2014-06-20 15:53:53

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v3 10/13] mm: rename allocflags_to_migratetype for clarity

From: David Rientjes <[email protected]>

The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
ALLOC_CPUSET) that have separate semantics.

The function allocflags_to_migratetype() actually takes gfp flags, not alloc
flags, and returns a migratetype. Rename it to gfpflags_to_migratetype().

Signed-off-by: David Rientjes <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
---
include/linux/gfp.h | 2 +-
mm/compaction.c | 4 ++--
mm/page_alloc.c | 6 +++---
3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 5e7219d..41b30fd 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -156,7 +156,7 @@ struct vm_area_struct;
#define GFP_DMA32 __GFP_DMA32

/* Convert GFP flags to their corresponding migrate type */
-static inline int allocflags_to_migratetype(gfp_t gfp_flags)
+static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
{
WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);

diff --git a/mm/compaction.c b/mm/compaction.c
index df0961b..32c768b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1185,7 +1185,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
.nr_freepages = 0,
.nr_migratepages = 0,
.order = order,
- .migratetype = allocflags_to_migratetype(gfp_mask),
+ .migratetype = gfpflags_to_migratetype(gfp_mask),
.zone = zone,
.mode = mode,
};
@@ -1237,7 +1237,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,

*deferred = true;
#ifdef CONFIG_CMA
- if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+ if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;
#endif
/* Compact each zone in the list */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6593f79..70b8297 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2473,7 +2473,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
alloc_flags |= ALLOC_NO_WATERMARKS;
}
#ifdef CONFIG_CMA
- if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+ if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;
#endif
return alloc_flags;
@@ -2716,7 +2716,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct zone *preferred_zone;
struct zoneref *preferred_zoneref;
struct page *page = NULL;
- int migratetype = allocflags_to_migratetype(gfp_mask);
+ int migratetype = gfpflags_to_migratetype(gfp_mask);
unsigned int cpuset_mems_cookie;
int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
int classzone_idx;
@@ -2750,7 +2750,7 @@ retry_cpuset:
classzone_idx = zonelist_zone_idx(preferred_zoneref);

#ifdef CONFIG_CMA
- if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+ if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;
#endif
retry:
--
1.8.4.5

2014-06-20 15:55:24

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v3 06/13] mm, compaction: periodically drop lock and restore IRQs in scanners

Compaction scanners regularly check for lock contention and need_resched()
through the compact_checklock_irqsave() function. However, if there is no
contention, the lock can be held and IRQ disabled for potentially long time.

This has been addressed by commit b2eef8c0d0 ("mm: compaction: minimise the
time IRQs are disabled while isolating pages for migration") for the migration
scanner. However, the refactoring done by commit 748446bb6b ("mm: compaction:
acquire the zone->lru_lock as late as possible") has changed the conditions so
that the lock is dropped only when there's contention on the lock or
need_resched() is true. Also, need_resched() is checked only when the lock is
already held. The comment "give a chance to irqs before checking need_resched"
is therefore misleading, as IRQs remain disabled when the check is done.

This patch restores the behavior intended by commit b2eef8c0d0 and also tries
to better balance and make more deterministic the time spent by checking for
contention vs the time the scanners might run between the checks. It also
avoids situations where checking has not been done often enough before. The
result should be avoiding both too frequent and too infrequent contention
checking, and especially the potentially long-running scans with IRQs disabled
and no checking of need_resched() or for fatal signal pending, which can happen
when many consecutive pages or pageblocks fail the preliminary tests and do not
reach the later call site to compact_checklock_irqsave(), as explained below.

Before the patch:

In the migration scanner, compact_checklock_irqsave() was called each loop, if
reached. If not reached, some lower-frequency checking could still be done if
the lock was already held, but this would not result in aborting contended
async compaction until reaching compact_checklock_irqsave() or end of
pageblock. In the free scanner, it was similar but completely without the
periodical checking, so lock can be potentially held until reaching the end of
pageblock.

After the patch, in both scanners:

The periodical check is done as the first thing in the loop on each
SWAP_CLUSTER_MAX aligned pfn, using the new compact_unlock_should_abort()
function, which always unlocks the lock (if locked) and aborts async compaction
if scheduling is needed. It also aborts any type of compaction when a fatal
signal is pending.

The compact_checklock_irqsave() function is replaced with a slightly different
compact_trylock_irqsave(). The biggest difference is that the function is not
called at all if the lock is already held. The periodical need_resched()
checking is left solely to compact_unlock_should_abort(). The lock contention
avoidance for async compaction is achieved by the periodical unlock by
compact_unlock_should_abort() and by using trylock in compact_trylock_irqsave()
and aborting when trylock fails. Sync compaction does not use trylock.

Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
---
mm/compaction.c | 114 ++++++++++++++++++++++++++++++++++++--------------------
1 file changed, 73 insertions(+), 41 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index e8cfac9..40da812 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -180,54 +180,72 @@ static void update_pageblock_skip(struct compact_control *cc,
}
#endif /* CONFIG_COMPACTION */

-enum compact_contended should_release_lock(spinlock_t *lock)
+/*
+ * Compaction requires the taking of some coarse locks that are potentially
+ * very heavily contended. For async compaction, back out if the lock cannot
+ * be taken immediately. For sync compaction, spin on the lock if needed.
+ *
+ * Returns true if the lock is held
+ * Returns false if the lock is not held and compaction should abort
+ */
+static bool compact_trylock_irqsave(spinlock_t *lock,
+ unsigned long *flags, struct compact_control *cc)
{
- if (spin_is_contended(lock))
- return COMPACT_CONTENDED_LOCK;
- else if (need_resched())
- return COMPACT_CONTENDED_SCHED;
- else
- return COMPACT_CONTENDED_NONE;
+ if (cc->mode == MIGRATE_ASYNC) {
+ if (!spin_trylock_irqsave(lock, *flags)) {
+ cc->contended = COMPACT_CONTENDED_LOCK;
+ return false;
+ }
+ } else {
+ spin_lock_irqsave(lock, *flags);
+ }
+
+ return true;
}

/*
* Compaction requires the taking of some coarse locks that are potentially
- * very heavily contended. Check if the process needs to be scheduled or
- * if the lock is contended. For async compaction, back out in the event
- * if contention is severe. For sync compaction, schedule.
+ * very heavily contended. The lock should be periodically unlocked to avoid
+ * having disabled IRQs for a long time, even when there is nobody waiting on
+ * the lock. It might also be that allowing the IRQs will result in
+ * need_resched() becoming true. If scheduling is needed, async compaction
+ * aborts. Sync compaction schedules.
+ * Either compaction type will also abort if a fatal signal is pending.
+ * In either case if the lock was locked, it is dropped and not regained.
*
- * Returns true if the lock is held.
- * Returns false if the lock is released and compaction should abort
+ * Returns true if compaction should abort due to fatal signal pending, or
+ * async compaction due to need_resched()
+ * Returns false when compaction can continue (sync compaction might have
+ * scheduled)
*/
-static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
- bool locked, struct compact_control *cc)
+static bool compact_unlock_should_abort(spinlock_t *lock,
+ unsigned long flags, bool *locked, struct compact_control *cc)
{
- enum compact_contended contended = should_release_lock(lock);
+ if (*locked) {
+ spin_unlock_irqrestore(lock, flags);
+ *locked = false;
+ }

- if (contended) {
- if (locked) {
- spin_unlock_irqrestore(lock, *flags);
- locked = false;
- }
+ if (fatal_signal_pending(current)) {
+ cc->contended = COMPACT_CONTENDED_SCHED;
+ return true;
+ }

- /* async aborts if taking too long or contended */
+ if (need_resched()) {
if (cc->mode == MIGRATE_ASYNC) {
- cc->contended = contended;
- return false;
+ cc->contended = COMPACT_CONTENDED_SCHED;
+ return true;
}
-
cond_resched();
}

- if (!locked)
- spin_lock_irqsave(lock, *flags);
- return true;
+ return false;
}

/*
* Aside from avoiding lock contention, compaction also periodically checks
* need_resched() and either schedules in sync compaction or aborts async
- * compaction. This is similar to what compact_checklock_irqsave() does, but
+ * compaction. This is similar to what compact_unlock_should_abort() does, but
* is used where no lock is concerned.
*
* Returns false when no scheduling was needed, or sync compaction scheduled.
@@ -286,6 +304,16 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
int isolated, i;
struct page *page = cursor;

+ /*
+ * Periodically drop the lock (if held) regardless of its
+ * contention, to give chance to IRQs. Abort async compaction
+ * if contended.
+ */
+ if (!(blockpfn % SWAP_CLUSTER_MAX)
+ && compact_unlock_should_abort(&cc->zone->lock, flags,
+ &locked, cc))
+ break;
+
nr_scanned++;
if (!pfn_valid_within(blockpfn))
goto isolate_fail;
@@ -303,8 +331,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
* spin on the lock and we acquire the lock as late as
* possible.
*/
- locked = compact_checklock_irqsave(&cc->zone->lock, &flags,
- locked, cc);
+ if (!locked)
+ locked = compact_trylock_irqsave(&cc->zone->lock,
+ &flags, cc);
if (!locked)
break;

@@ -506,13 +535,15 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,

/* Time to isolate some pages for migration */
for (; low_pfn < end_pfn; low_pfn++) {
- /* give a chance to irqs before checking need_resched() */
- if (locked && !(low_pfn % SWAP_CLUSTER_MAX)) {
- if (should_release_lock(&zone->lru_lock)) {
- spin_unlock_irqrestore(&zone->lru_lock, flags);
- locked = false;
- }
- }
+ /*
+ * Periodically drop the lock (if held) regardless of its
+ * contention, to give chance to IRQs. Abort async compaction
+ * if contended.
+ */
+ if (!(low_pfn % SWAP_CLUSTER_MAX)
+ && compact_unlock_should_abort(&zone->lru_lock, flags,
+ &locked, cc))
+ break;

/*
* migrate_pfn does not necessarily start aligned to a
@@ -592,10 +623,11 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
page_count(page) > page_mapcount(page))
continue;

- /* Check if it is ok to still hold the lock */
- locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
- locked, cc);
- if (!locked || fatal_signal_pending(current))
+ /* If the lock is not held, try to take it */
+ if (!locked)
+ locked = compact_trylock_irqsave(&zone->lru_lock,
+ &flags, cc);
+ if (!locked)
break;

/* Recheck PageLRU and PageTransHuge under lock */
--
1.8.4.5

2014-06-20 15:55:23

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v3 05/13] mm, compaction: report compaction as contended only due to lock contention

Async compaction aborts when it detects zone lock contention or need_resched()
is true. David Rientjes has reported that in practice, most direct async
compactions for THP allocation abort due to need_resched(). This means that a
second direct compaction is never attempted, which might be OK for a page
fault, but khugepaged is intended to attempt a sync compaction in such case and
in these cases it won't.

This patch replaces "bool contended" in compact_control with an enum that
distinguieshes between aborting due to need_resched() and aborting due to lock
contention. This allows propagating the abort through all compaction functions
as before, but declaring the direct compaction as contended only when lock
contention has been detected.

A second problem is that try_to_compact_pages() did not act upon the reported
contention (both need_resched() or lock contention) and could proceed with
another zone from the zonelist. When need_resched() is true, that means
initializing another zone compaction, only to check again need_resched() in
isolate_migratepages() and aborting. For zone lock contention, the unintended
consequence is that the contended status reported back to the allocator
is decided from the last zone where compaction was attempted, which is rather
arbitrary.

This patch fixes the problem in the following way:
- need_resched() being true after async compaction returned from a zone means
that further zones should not be tried. We do a cond_resched() so that we
do not hog the CPU, and abort. "contended" is reported as false, since we
did not fail due to lock contention.
- aborting zone compaction due to lock contention means we can still try
another zone, since it has different locks. We report back "contended" as
true only if *all* zones where compaction was attempted, it aborted due to
lock contention.

As a result of these fixes, khugepaged will proceed with second sync compaction
as intended, when the preceding async compaction aborted due to need_resched().
Page fault compactions aborting due to need_resched() will spare some cycles
previously wasted by initializing another zone compaction only to abort again.
Lock contention will be reported only when compaction in all zones aborted due
to lock contention, and therefore it's not a good idea to try again after
reclaim.

Reported-by: David Rientjes <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
---
mm/compaction.c | 48 +++++++++++++++++++++++++++++++++++++++---------
mm/internal.h | 15 +++++++++++----
2 files changed, 50 insertions(+), 13 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index ebe30c9..e8cfac9 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -180,9 +180,14 @@ static void update_pageblock_skip(struct compact_control *cc,
}
#endif /* CONFIG_COMPACTION */

-static inline bool should_release_lock(spinlock_t *lock)
+enum compact_contended should_release_lock(spinlock_t *lock)
{
- return need_resched() || spin_is_contended(lock);
+ if (spin_is_contended(lock))
+ return COMPACT_CONTENDED_LOCK;
+ else if (need_resched())
+ return COMPACT_CONTENDED_SCHED;
+ else
+ return COMPACT_CONTENDED_NONE;
}

/*
@@ -197,7 +202,9 @@ static inline bool should_release_lock(spinlock_t *lock)
static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
bool locked, struct compact_control *cc)
{
- if (should_release_lock(lock)) {
+ enum compact_contended contended = should_release_lock(lock);
+
+ if (contended) {
if (locked) {
spin_unlock_irqrestore(lock, *flags);
locked = false;
@@ -205,7 +212,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,

/* async aborts if taking too long or contended */
if (cc->mode == MIGRATE_ASYNC) {
- cc->contended = true;
+ cc->contended = contended;
return false;
}

@@ -231,7 +238,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
/* async compaction aborts if contended */
if (need_resched()) {
if (cc->mode == MIGRATE_ASYNC) {
- cc->contended = true;
+ cc->contended = COMPACT_CONTENDED_SCHED;
return true;
}

@@ -1101,7 +1108,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
VM_BUG_ON(!list_empty(&cc.freepages));
VM_BUG_ON(!list_empty(&cc.migratepages));

- *contended = cc.contended;
+ /* We only signal lock contention back to the allocator */
+ *contended = cc.contended == COMPACT_CONTENDED_LOCK;
return ret;
}

@@ -1132,6 +1140,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
struct zone *zone;
int rc = COMPACT_SKIPPED;
int alloc_flags = 0;
+ bool all_zones_contended = true;

/* Check if the GFP flags allow compaction */
if (!order || !may_enter_fs || !may_perform_io)
@@ -1146,6 +1155,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
nodemask) {
int status;
+ bool zone_contended;

if (compaction_deferred(zone, order))
continue;
@@ -1153,8 +1163,9 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
*deferred = false;

status = compact_zone_order(zone, order, gfp_mask, mode,
- contended);
+ &zone_contended);
rc = max(status, rc);
+ all_zones_contended &= zone_contended;

/* If a normal allocation would succeed, stop compacting */
if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
@@ -1168,12 +1179,31 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
* succeeding after all, it will be reset.
*/
defer_compaction(zone, order);
+ /*
+ * If we stopped compacting due to need_resched(), do
+ * not try further zones and yield the CPU.
+ */
+ if (need_resched()) {
+ /*
+ * We might not have tried all the zones, so
+ * be conservative and assume they are not
+ * all lock contended.
+ */
+ all_zones_contended = false;
+ cond_resched();
+ break;
+ }
}
}

- /* If at least one zone wasn't deferred, we count a compaction stall */
- if (!*deferred)
+ /*
+ * If at least one zone wasn't deferred, we count a compaction stall
+ * and we report if all zones that were tried were contended.
+ */
+ if (!*deferred) {
count_compact_event(COMPACTSTALL);
+ *contended = all_zones_contended;
+ }

return rc;
}
diff --git a/mm/internal.h b/mm/internal.h
index a1b651b..2c187d2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -117,6 +117,13 @@ extern int user_min_free_kbytes;

#if defined CONFIG_COMPACTION || defined CONFIG_CMA

+/* Used to signal whether compaction detected need_sched() or lock contention */
+enum compact_contended {
+ COMPACT_CONTENDED_NONE = 0, /* no contention detected */
+ COMPACT_CONTENDED_SCHED, /* need_sched() was true */
+ COMPACT_CONTENDED_LOCK, /* zone lock or lru_lock was contended */
+};
+
/*
* in mm/compaction.c
*/
@@ -144,10 +151,10 @@ struct compact_control {
int order; /* order a direct compactor needs */
int migratetype; /* MOVABLE, RECLAIMABLE etc */
struct zone *zone;
- bool contended; /* True if a lock was contended, or
- * need_resched() true during async
- * compaction
- */
+ enum compact_contended contended; /* Signal need_sched() or lock
+ * contention detected during
+ * compaction
+ */
};

unsigned long
--
1.8.4.5

2014-06-20 17:46:19

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] mm, THP: don't hold mmap_sem in khugepaged when allocating THP

On Fri, Jun 20, 2014 at 05:49:31PM +0200, Vlastimil Babka wrote:
> When allocating huge page for collapsing, khugepaged currently holds mmap_sem
> for reading on the mm where collapsing occurs. Afterwards the read lock is
> dropped before write lock is taken on the same mmap_sem.
>
> Holding mmap_sem during whole huge page allocation is therefore useless, the
> vma needs to be rechecked after taking the write lock anyway. Furthemore, huge
> page allocation might involve a rather long sync compaction, and thus block
> any mmap_sem writers and i.e. affect workloads that perform frequent m(un)map
> or mprotect oterations.
>
> This patch simply releases the read lock before allocating a huge page. It
> also deletes an outdated comment that assumed vma must be stable, as it was
> using alloc_hugepage_vma(). This is no longer true since commit 9f1b868a13
> ("mm: thp: khugepaged: add policy for finding target node").

There is no point in touching ->mmap_sem in khugepaged_alloc_page() at
all. Please, move up_read() outside khugepaged_alloc_page().

--
Kirill A. Shutemov

2014-06-23 01:38:18

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 05/13] mm, compaction: report compaction as contended only due to lock contention

Hello Vlastimil,

On Fri, Jun 20, 2014 at 05:49:35PM +0200, Vlastimil Babka wrote:
> Async compaction aborts when it detects zone lock contention or need_resched()
> is true. David Rientjes has reported that in practice, most direct async
> compactions for THP allocation abort due to need_resched(). This means that a
> second direct compaction is never attempted, which might be OK for a page
> fault, but khugepaged is intended to attempt a sync compaction in such case and
> in these cases it won't.
>
> This patch replaces "bool contended" in compact_control with an enum that
> distinguieshes between aborting due to need_resched() and aborting due to lock
> contention. This allows propagating the abort through all compaction functions
> as before, but declaring the direct compaction as contended only when lock
> contention has been detected.
>
> A second problem is that try_to_compact_pages() did not act upon the reported
> contention (both need_resched() or lock contention) and could proceed with
> another zone from the zonelist. When need_resched() is true, that means
> initializing another zone compaction, only to check again need_resched() in
> isolate_migratepages() and aborting. For zone lock contention, the unintended
> consequence is that the contended status reported back to the allocator
> is decided from the last zone where compaction was attempted, which is rather
> arbitrary.
>
> This patch fixes the problem in the following way:
> - need_resched() being true after async compaction returned from a zone means
> that further zones should not be tried. We do a cond_resched() so that we
> do not hog the CPU, and abort. "contended" is reported as false, since we
> did not fail due to lock contention.
> - aborting zone compaction due to lock contention means we can still try
> another zone, since it has different locks. We report back "contended" as
> true only if *all* zones where compaction was attempted, it aborted due to
> lock contention.
>
> As a result of these fixes, khugepaged will proceed with second sync compaction
> as intended, when the preceding async compaction aborted due to need_resched().
> Page fault compactions aborting due to need_resched() will spare some cycles
> previously wasted by initializing another zone compaction only to abort again.
> Lock contention will be reported only when compaction in all zones aborted due
> to lock contention, and therefore it's not a good idea to try again after
> reclaim.
>
> Reported-by: David Rientjes <[email protected]>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> ---
> mm/compaction.c | 48 +++++++++++++++++++++++++++++++++++++++---------
> mm/internal.h | 15 +++++++++++----
> 2 files changed, 50 insertions(+), 13 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index ebe30c9..e8cfac9 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -180,9 +180,14 @@ static void update_pageblock_skip(struct compact_control *cc,
> }
> #endif /* CONFIG_COMPACTION */
>
> -static inline bool should_release_lock(spinlock_t *lock)
> +enum compact_contended should_release_lock(spinlock_t *lock)
> {
> - return need_resched() || spin_is_contended(lock);
> + if (spin_is_contended(lock))
> + return COMPACT_CONTENDED_LOCK;
> + else if (need_resched())
> + return COMPACT_CONTENDED_SCHED;
> + else
> + return COMPACT_CONTENDED_NONE;

If you want to raise priority of lock contention than need_resched
intentionally, please write it down on comment.

> }
>
> /*
> @@ -197,7 +202,9 @@ static inline bool should_release_lock(spinlock_t *lock)
> static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
> bool locked, struct compact_control *cc)
> {
> - if (should_release_lock(lock)) {
> + enum compact_contended contended = should_release_lock(lock);
> +
> + if (contended) {
> if (locked) {
> spin_unlock_irqrestore(lock, *flags);
> locked = false;
> @@ -205,7 +212,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>
> /* async aborts if taking too long or contended */
> if (cc->mode == MIGRATE_ASYNC) {
> - cc->contended = true;
> + cc->contended = contended;
> return false;
> }


>
> @@ -231,7 +238,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
> /* async compaction aborts if contended */
> if (need_resched()) {
> if (cc->mode == MIGRATE_ASYNC) {
> - cc->contended = true;
> + cc->contended = COMPACT_CONTENDED_SCHED;
> return true;
> }
>
> @@ -1101,7 +1108,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
> VM_BUG_ON(!list_empty(&cc.freepages));
> VM_BUG_ON(!list_empty(&cc.migratepages));
>
> - *contended = cc.contended;
> + /* We only signal lock contention back to the allocator */
> + *contended = cc.contended == COMPACT_CONTENDED_LOCK;

Please write down *WHY* as well as your intention we can know by looking at code.

> return ret;
> }
>
> @@ -1132,6 +1140,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> struct zone *zone;
> int rc = COMPACT_SKIPPED;
> int alloc_flags = 0;
> + bool all_zones_contended = true;
>
> /* Check if the GFP flags allow compaction */
> if (!order || !may_enter_fs || !may_perform_io)
> @@ -1146,6 +1155,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
> nodemask) {
> int status;
> + bool zone_contended;
>
> if (compaction_deferred(zone, order))
> continue;
> @@ -1153,8 +1163,9 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> *deferred = false;
>
> status = compact_zone_order(zone, order, gfp_mask, mode,
> - contended);
> + &zone_contended);
> rc = max(status, rc);
> + all_zones_contended &= zone_contended;
>
> /* If a normal allocation would succeed, stop compacting */
> if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
> @@ -1168,12 +1179,31 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> * succeeding after all, it will be reset.
> */
> defer_compaction(zone, order);
> + /*
> + * If we stopped compacting due to need_resched(), do
> + * not try further zones and yield the CPU.
> + */

For what? It would make your claim more clear.

> + if (need_resched()) {

compact_zone_order returns true state of contended only if it was lock contention
so it couldn't return true state of contended by need_resched so you made
need_resched check in here. It's fragile to me because it could be not a result
from ahead compact_zone_order call. More clear thing is compact_zone_order
should return zone_contended as enum, not bool and in here, you could check it.

It means you could return enum in compact_zone_order and make the result bool
in try_to_compact_pages.

> + /*
> + * We might not have tried all the zones, so
> + * be conservative and assume they are not
> + * all lock contended.
> + */
> + all_zones_contended = false;
> + cond_resched();
> + break;
> + }
> }
> }
>
> - /* If at least one zone wasn't deferred, we count a compaction stall */
> - if (!*deferred)
> + /*
> + * If at least one zone wasn't deferred, we count a compaction stall
> + * and we report if all zones that were tried were contended.
> + */
> + if (!*deferred) {
> count_compact_event(COMPACTSTALL);
> + *contended = all_zones_contended;

Why don't you initialize contended as *false* in function's intro?

> + }
>
> return rc;
> }
> diff --git a/mm/internal.h b/mm/internal.h
> index a1b651b..2c187d2 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -117,6 +117,13 @@ extern int user_min_free_kbytes;
>
> #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>
> +/* Used to signal whether compaction detected need_sched() or lock contention */
> +enum compact_contended {
> + COMPACT_CONTENDED_NONE = 0, /* no contention detected */
> + COMPACT_CONTENDED_SCHED, /* need_sched() was true */
> + COMPACT_CONTENDED_LOCK, /* zone lock or lru_lock was contended */
> +};
> +
> /*
> * in mm/compaction.c
> */
> @@ -144,10 +151,10 @@ struct compact_control {
> int order; /* order a direct compactor needs */
> int migratetype; /* MOVABLE, RECLAIMABLE etc */
> struct zone *zone;
> - bool contended; /* True if a lock was contended, or
> - * need_resched() true during async
> - * compaction
> - */
> + enum compact_contended contended; /* Signal need_sched() or lock
> + * contention detected during
> + * compaction
> + */
> };
>
> unsigned long
> --

Anyway, most big concern is that you are changing current behavior as
I said earlier.

Old behavior in THP page fault when it consumes own timeslot was just
abort and fallback 4K page but with your patch, new behavior is
take a rest when it founds need_resched and goes to another round with
async, not sync compaction. I'm not sure we need another round with
async compaction at the cost of increasing latency rather than fallback
4 page.

It might be okay if the VMA has MADV_HUGEPAGE which is good hint to
indicate non-temporal VMA so latency would be trade-off but it's not
for temporal big memory allocation in HUGEPAGE_ALWAYS system.

If you really want to go this, could you show us numbers?

1. How many could we can be successful in direct compaction by this patch?
2. How long could we increase latency for temporal allocation
for HUGEPAGE_ALWAYS system?

--
Kind regards,
Minchan Kim

2014-06-23 02:23:57

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 02/13] mm, compaction: defer each zone individually instead of preferred zone

On Fri, Jun 20, 2014 at 05:49:32PM +0200, Vlastimil Babka wrote:
> When direct sync compaction is often unsuccessful, it may become deferred for
> some time to avoid further useless attempts, both sync and async. Successful
> high-order allocations un-defer compaction, while further unsuccessful
> compaction attempts prolong the copmaction deferred period.
>
> Currently the checking and setting deferred status is performed only on the
> preferred zone of the allocation that invoked direct compaction. But compaction
> itself is attempted on all eligible zones in the zonelist, so the behavior is
> suboptimal and may lead both to scenarios where 1) compaction is attempted
> uselessly, or 2) where it's not attempted despite good chances of succeeding,
> as shown on the examples below:
>
> 1) A direct compaction with Normal preferred zone failed and set deferred
> compaction for the Normal zone. Another unrelated direct compaction with
> DMA32 as preferred zone will attempt to compact DMA32 zone even though
> the first compaction attempt also included DMA32 zone.
>
> In another scenario, compaction with Normal preferred zone failed to compact
> Normal zone, but succeeded in the DMA32 zone, so it will not defer
> compaction. In the next attempt, it will try Normal zone which will fail
> again, instead of skipping Normal zone and trying DMA32 directly.
>
> 2) Kswapd will balance DMA32 zone and reset defer status based on watermarks
> looking good. A direct compaction with preferred Normal zone will skip
> compaction of all zones including DMA32 because Normal was still deferred.
> The allocation might have succeeded in DMA32, but won't.
>
> This patch makes compaction deferring work on individual zone basis instead of
> preferred zone. For each zone, it checks compaction_deferred() to decide if the
> zone should be skipped. If watermarks fail after compacting the zone,
> defer_compaction() is called. The zone where watermarks passed can still be
> deferred when the allocation attempt is unsuccessful. When allocation is
> successful, compaction_defer_reset() is called for the zone containing the
> allocated page. This approach should approximate calling defer_compaction()
> only on zones where compaction was attempted and did not yield allocated page.
> There might be corner cases but that is inevitable as long as the decision
> to stop compacting dues not guarantee that a page will be allocated.
>
> During testing on a two-node machine with a single very small Normal zone on
> node 1, this patch has improved success rates in stress-highalloc mmtests
> benchmark. The success here were previously made worse by commit 3a025760fc
> ("mm: page_alloc: spill to remote nodes before waking kswapd") as kswapd was
> no longer resetting often enough the deferred compaction for the Normal zone,
> and DMA32 zones on both nodes were thus not considered for compaction.
>
> Signed-off-by: Vlastimil Babka <[email protected]>

Nice job!

Acked-by: Minchan Kim <[email protected]>

Below is just nitpick.

> Cc: Minchan Kim <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>
> ---
> include/linux/compaction.h | 6 ++++--
> mm/compaction.c | 29 ++++++++++++++++++++++++-----
> mm/page_alloc.c | 33 ++++++++++++++++++---------------
> 3 files changed, 46 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 01e3132..76f9beb 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -22,7 +22,8 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
> extern int fragmentation_index(struct zone *zone, unsigned int order);
> extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *mask,
> - enum migrate_mode mode, bool *contended);
> + enum migrate_mode mode, bool *contended, bool *deferred,
> + struct zone **candidate_zone);
> extern void compact_pgdat(pg_data_t *pgdat, int order);
> extern void reset_isolation_suitable(pg_data_t *pgdat);
> extern unsigned long compaction_suitable(struct zone *zone, int order);
> @@ -91,7 +92,8 @@ static inline bool compaction_restarting(struct zone *zone, int order)
> #else
> static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *nodemask,
> - enum migrate_mode mode, bool *contended)
> + enum migrate_mode mode, bool *contended, bool *deferred,
> + struct zone **candidate_zone)
> {
> return COMPACT_CONTINUE;
> }
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 5175019..7c491d0 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1122,13 +1122,15 @@ int sysctl_extfrag_threshold = 500;
> * @nodemask: The allowed nodes to allocate from
> * @mode: The migration mode for async, sync light, or sync migration
> * @contended: Return value that is true if compaction was aborted due to lock contention
> - * @page: Optionally capture a free page of the requested order during compaction
> + * @deferred: Return value that is true if compaction was deferred in all zones
> + * @candidate_zone: Return the zone where we think allocation should succeed
> *
> * This is the main entry point for direct page compaction.
> */
> unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *nodemask,
> - enum migrate_mode mode, bool *contended)
> + enum migrate_mode mode, bool *contended, bool *deferred,
> + struct zone **candidate_zone)
> {
> enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> int may_enter_fs = gfp_mask & __GFP_FS;
> @@ -1142,8 +1144,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> if (!order || !may_enter_fs || !may_perform_io)
> return rc;
>
> - count_compact_event(COMPACTSTALL);
> -
> + *deferred = true;
> #ifdef CONFIG_CMA
> if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> alloc_flags |= ALLOC_CMA;
> @@ -1153,16 +1154,34 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> nodemask) {
> int status;
>
> + if (compaction_deferred(zone, order))
> + continue;
> +
> + *deferred = false;
> +
> status = compact_zone_order(zone, order, gfp_mask, mode,
> contended);
> rc = max(status, rc);
>
> /* If a normal allocation would succeed, stop compacting */
> if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
> - alloc_flags))
> + alloc_flags)) {
> + *candidate_zone = zone;
> break;
> + } else if (mode != MIGRATE_ASYNC) {
> + /*
> + * We think that allocation won't succeed in this zone
> + * so we defer compaction there. If it ends up
> + * succeeding after all, it will be reset.
> + */
> + defer_compaction(zone, order);
> + }
> }
>
> + /* If at least one zone wasn't deferred, we count a compaction stall */

I like positive sentence.

/* Once we tried compaction a zone at least, let's count a compaction stall */


> + if (!*deferred)
> + count_compact_event(COMPACTSTALL);
> +
> return rc;
> }
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ee92384..6593f79 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2238,18 +2238,17 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> bool *contended_compaction, bool *deferred_compaction,
> unsigned long *did_some_progress)
> {
> - if (!order)
> - return NULL;
> + struct zone *last_compact_zone = NULL;
>
> - if (compaction_deferred(preferred_zone, order)) {
> - *deferred_compaction = true;
> + if (!order)
> return NULL;
> - }
>
> current->flags |= PF_MEMALLOC;
> *did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
> nodemask, mode,
> - contended_compaction);
> + contended_compaction,
> + deferred_compaction,
> + &last_compact_zone);
> current->flags &= ~PF_MEMALLOC;
>
> if (*did_some_progress != COMPACT_SKIPPED) {
> @@ -2263,27 +2262,31 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> order, zonelist, high_zoneidx,
> alloc_flags & ~ALLOC_NO_WATERMARKS,
> preferred_zone, classzone_idx, migratetype);
> +
> if (page) {
> - preferred_zone->compact_blockskip_flush = false;
> - compaction_defer_reset(preferred_zone, order, true);
> + struct zone *zone = page_zone(page);
> +
> + zone->compact_blockskip_flush = false;
> + compaction_defer_reset(zone, order, true);
> count_vm_event(COMPACTSUCCESS);
> return page;
> }
>
> /*
> + * last_compact_zone is where try_to_compact_pages thought
> + * allocation should succeed, so it did not defer compaction.
> + * But now we know that it didn't succeed, so we do the defer.
> + */
> + if (last_compact_zone && mode != MIGRATE_ASYNC)
> + defer_compaction(last_compact_zone, order);
> +
> + /*
> * It's bad if compaction run occurs and fails.
> * The most likely reason is that pages exist,
> * but not enough to satisfy watermarks.
> */
> count_vm_event(COMPACTFAIL);
>
> - /*
> - * As async compaction considers a subset of pageblocks, only
> - * defer if the failure was a sync compaction failure.
> - */
> - if (mode != MIGRATE_ASYNC)
> - defer_compaction(preferred_zone, order);
> -
> cond_resched();
> }
>
> --
> 1.8.4.5
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2014-06-23 02:52:50

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 06/13] mm, compaction: periodically drop lock and restore IRQs in scanners

On Fri, Jun 20, 2014 at 05:49:36PM +0200, Vlastimil Babka wrote:
> Compaction scanners regularly check for lock contention and need_resched()
> through the compact_checklock_irqsave() function. However, if there is no
> contention, the lock can be held and IRQ disabled for potentially long time.
>
> This has been addressed by commit b2eef8c0d0 ("mm: compaction: minimise the
> time IRQs are disabled while isolating pages for migration") for the migration
> scanner. However, the refactoring done by commit 748446bb6b ("mm: compaction:
> acquire the zone->lru_lock as late as possible") has changed the conditions so
> that the lock is dropped only when there's contention on the lock or
> need_resched() is true. Also, need_resched() is checked only when the lock is
> already held. The comment "give a chance to irqs before checking need_resched"
> is therefore misleading, as IRQs remain disabled when the check is done.
>
> This patch restores the behavior intended by commit b2eef8c0d0 and also tries
> to better balance and make more deterministic the time spent by checking for
> contention vs the time the scanners might run between the checks. It also
> avoids situations where checking has not been done often enough before. The
> result should be avoiding both too frequent and too infrequent contention
> checking, and especially the potentially long-running scans with IRQs disabled
> and no checking of need_resched() or for fatal signal pending, which can happen
> when many consecutive pages or pageblocks fail the preliminary tests and do not
> reach the later call site to compact_checklock_irqsave(), as explained below.
>
> Before the patch:
>
> In the migration scanner, compact_checklock_irqsave() was called each loop, if
> reached. If not reached, some lower-frequency checking could still be done if
> the lock was already held, but this would not result in aborting contended
> async compaction until reaching compact_checklock_irqsave() or end of
> pageblock. In the free scanner, it was similar but completely without the
> periodical checking, so lock can be potentially held until reaching the end of
> pageblock.
>
> After the patch, in both scanners:
>
> The periodical check is done as the first thing in the loop on each
> SWAP_CLUSTER_MAX aligned pfn, using the new compact_unlock_should_abort()
> function, which always unlocks the lock (if locked) and aborts async compaction
> if scheduling is needed. It also aborts any type of compaction when a fatal
> signal is pending.
>
> The compact_checklock_irqsave() function is replaced with a slightly different
> compact_trylock_irqsave(). The biggest difference is that the function is not
> called at all if the lock is already held. The periodical need_resched()
> checking is left solely to compact_unlock_should_abort(). The lock contention
> avoidance for async compaction is achieved by the periodical unlock by
> compact_unlock_should_abort() and by using trylock in compact_trylock_irqsave()
> and aborting when trylock fails. Sync compaction does not use trylock.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>
> ---
> mm/compaction.c | 114 ++++++++++++++++++++++++++++++++++++--------------------
> 1 file changed, 73 insertions(+), 41 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e8cfac9..40da812 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -180,54 +180,72 @@ static void update_pageblock_skip(struct compact_control *cc,
> }
> #endif /* CONFIG_COMPACTION */
>
> -enum compact_contended should_release_lock(spinlock_t *lock)
> +/*
> + * Compaction requires the taking of some coarse locks that are potentially
> + * very heavily contended. For async compaction, back out if the lock cannot
> + * be taken immediately. For sync compaction, spin on the lock if needed.
> + *
> + * Returns true if the lock is held
> + * Returns false if the lock is not held and compaction should abort
> + */
> +static bool compact_trylock_irqsave(spinlock_t *lock,
> + unsigned long *flags, struct compact_control *cc)
> {
> - if (spin_is_contended(lock))
> - return COMPACT_CONTENDED_LOCK;
> - else if (need_resched())
> - return COMPACT_CONTENDED_SCHED;
> - else
> - return COMPACT_CONTENDED_NONE;
> + if (cc->mode == MIGRATE_ASYNC) {
> + if (!spin_trylock_irqsave(lock, *flags)) {
> + cc->contended = COMPACT_CONTENDED_LOCK;
> + return false;
> + }
> + } else {
> + spin_lock_irqsave(lock, *flags);
> + }
> +
> + return true;
> }
>
> /*
> * Compaction requires the taking of some coarse locks that are potentially
> - * very heavily contended. Check if the process needs to be scheduled or
> - * if the lock is contended. For async compaction, back out in the event
> - * if contention is severe. For sync compaction, schedule.
> + * very heavily contended. The lock should be periodically unlocked to avoid
> + * having disabled IRQs for a long time, even when there is nobody waiting on
> + * the lock. It might also be that allowing the IRQs will result in
> + * need_resched() becoming true. If scheduling is needed, async compaction
> + * aborts. Sync compaction schedules.
> + * Either compaction type will also abort if a fatal signal is pending.
> + * In either case if the lock was locked, it is dropped and not regained.
> *
> - * Returns true if the lock is held.
> - * Returns false if the lock is released and compaction should abort
> + * Returns true if compaction should abort due to fatal signal pending, or
> + * async compaction due to need_resched()
> + * Returns false when compaction can continue (sync compaction might have
> + * scheduled)
> */
> -static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
> - bool locked, struct compact_control *cc)
> +static bool compact_unlock_should_abort(spinlock_t *lock,
> + unsigned long flags, bool *locked, struct compact_control *cc)
> {
> - enum compact_contended contended = should_release_lock(lock);
> + if (*locked) {
> + spin_unlock_irqrestore(lock, flags);
> + *locked = false;
> + }
>
> - if (contended) {
> - if (locked) {
> - spin_unlock_irqrestore(lock, *flags);
> - locked = false;
> - }
> + if (fatal_signal_pending(current)) {
> + cc->contended = COMPACT_CONTENDED_SCHED;
> + return true;
> + }


Generally, this patch is really good for me but I doubt what happens
if we bail out by fatal_signal? All the path is going to handle it
rightly to bail out direct compaction path?
I don't think so but anyway, it would be another patch so do you
handle it later or include it in this patchset series?

If you want to handle it later, please put the XXX for TODO.
Anyway,

Acked-by: Minchan Kim <[email protected]>

>
> - /* async aborts if taking too long or contended */
> + if (need_resched()) {
> if (cc->mode == MIGRATE_ASYNC) {
> - cc->contended = contended;
> - return false;
> + cc->contended = COMPACT_CONTENDED_SCHED;
> + return true;
> }
> -
> cond_resched();
> }
>
> - if (!locked)
> - spin_lock_irqsave(lock, *flags);
> - return true;
> + return false;
> }
>
> /*
> * Aside from avoiding lock contention, compaction also periodically checks
> * need_resched() and either schedules in sync compaction or aborts async
> - * compaction. This is similar to what compact_checklock_irqsave() does, but
> + * compaction. This is similar to what compact_unlock_should_abort() does, but
> * is used where no lock is concerned.
> *
> * Returns false when no scheduling was needed, or sync compaction scheduled.
> @@ -286,6 +304,16 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
> int isolated, i;
> struct page *page = cursor;
>
> + /*
> + * Periodically drop the lock (if held) regardless of its
> + * contention, to give chance to IRQs. Abort async compaction
> + * if contended.
> + */
> + if (!(blockpfn % SWAP_CLUSTER_MAX)
> + && compact_unlock_should_abort(&cc->zone->lock, flags,
> + &locked, cc))
> + break;
> +
> nr_scanned++;
> if (!pfn_valid_within(blockpfn))
> goto isolate_fail;
> @@ -303,8 +331,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
> * spin on the lock and we acquire the lock as late as
> * possible.
> */
> - locked = compact_checklock_irqsave(&cc->zone->lock, &flags,
> - locked, cc);
> + if (!locked)
> + locked = compact_trylock_irqsave(&cc->zone->lock,
> + &flags, cc);
> if (!locked)
> break;
>
> @@ -506,13 +535,15 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>
> /* Time to isolate some pages for migration */
> for (; low_pfn < end_pfn; low_pfn++) {
> - /* give a chance to irqs before checking need_resched() */
> - if (locked && !(low_pfn % SWAP_CLUSTER_MAX)) {
> - if (should_release_lock(&zone->lru_lock)) {
> - spin_unlock_irqrestore(&zone->lru_lock, flags);
> - locked = false;
> - }
> - }
> + /*
> + * Periodically drop the lock (if held) regardless of its
> + * contention, to give chance to IRQs. Abort async compaction
> + * if contended.
> + */
> + if (!(low_pfn % SWAP_CLUSTER_MAX)
> + && compact_unlock_should_abort(&zone->lru_lock, flags,
> + &locked, cc))
> + break;
>
> /*
> * migrate_pfn does not necessarily start aligned to a
> @@ -592,10 +623,11 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> page_count(page) > page_mapcount(page))
> continue;
>
> - /* Check if it is ok to still hold the lock */
> - locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
> - locked, cc);
> - if (!locked || fatal_signal_pending(current))
> + /* If the lock is not held, try to take it */
> + if (!locked)
> + locked = compact_trylock_irqsave(&zone->lru_lock,
> + &flags, cc);
> + if (!locked)
> break;
>
> /* Recheck PageLRU and PageTransHuge under lock */
> --
> 1.8.4.5
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2014-06-23 03:04:01

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 08/13] mm, compaction: remember position within pageblock in free pages scanner

On Fri, Jun 20, 2014 at 05:49:38PM +0200, Vlastimil Babka wrote:
> Unlike the migration scanner, the free scanner remembers the beginning of the
> last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages
> uselessly when called several times during single compaction. This might have
> been useful when pages were returned to the buddy allocator after a failed
> migration, but this is no longer the case.
>
> This patch changes the meaning of cc->free_pfn so that if it points to a
> middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the
> end. isolate_freepages_block() will record the pfn of the last page it looked
> at, which is then used to update cc->free_pfn.
>
> In the mmtests stress-highalloc benchmark, this has resulted in lowering the
> ratio between pages scanned by both scanners, from 2.5 free pages per migrate
> page, to 2.25 free pages per migrate page, without affecting success rates.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Acked-by: David Rientjes <[email protected]>
Acked-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2014-06-23 03:05:05

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 09/13] mm, compaction: skip buddy pages by their order in the migrate scanner

On Fri, Jun 20, 2014 at 05:49:39PM +0200, Vlastimil Babka wrote:
> The migration scanner skips PageBuddy pages, but does not consider their order
> as checking page_order() is generally unsafe without holding the zone->lock,
> and acquiring the lock just for the check wouldn't be a good tradeoff.
>
> Still, this could avoid some iterations over the rest of the buddy page, and
> if we are careful, the race window between PageBuddy() check and page_order()
> is small, and the worst thing that can happen is that we skip too much and miss
> some isolation candidates. This is not that bad, as compaction can already fail
> for many other reasons like parallel allocations, and those have much larger
> race window.
>
> This patch therefore makes the migration scanner obtain the buddy page order
> and use it to skip the whole buddy page, if the order appears to be in the
> valid range.
>
> It's important that the page_order() is read only once, so that the value used
> in the checks and in the pfn calculation is the same. But in theory the
> compiler can replace the local variable by multiple inlines of page_order().
> Therefore, the patch introduces page_order_unsafe() that uses ACCESS_ONCE to
> prevent this.
>
> Testing with stress-highalloc from mmtests shows a 15% reduction in number of
> pages scanned by migration scanner. This change is also a prerequisite for a
> later patch which is detecting when a cc->order block of pages contains
> non-buddy pages that cannot be isolated, and the scanner should thus skip to
> the next block immediately.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2014-06-23 03:05:22

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 11/13] mm, compaction: pass gfp mask to compact_control

On Fri, Jun 20, 2014 at 05:49:41PM +0200, Vlastimil Babka wrote:
> From: David Rientjes <[email protected]>
>
> struct compact_control currently converts the gfp mask to a migratetype, but we
> need the entire gfp mask in a follow-up patch.
>
> Pass the entire gfp mask as part of struct compact_control.
>
> Signed-off-by: David Rientjes <[email protected]>
> Signed-off-by: Vlastimil Babka <[email protected]>

Acked-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2014-06-23 05:39:42

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] mm, THP: don't hold mmap_sem in khugepaged when allocating THP

Hello

On 06/21/2014 01:45 AM, Kirill A. Shutemov wrote:
> On Fri, Jun 20, 2014 at 05:49:31PM +0200, Vlastimil Babka wrote:
>> When allocating huge page for collapsing, khugepaged currently holds mmap_sem
>> for reading on the mm where collapsing occurs. Afterwards the read lock is
>> dropped before write lock is taken on the same mmap_sem.
>>
>> Holding mmap_sem during whole huge page allocation is therefore useless, the
>> vma needs to be rechecked after taking the write lock anyway. Furthemore, huge
>> page allocation might involve a rather long sync compaction, and thus block
>> any mmap_sem writers and i.e. affect workloads that perform frequent m(un)map
>> or mprotect oterations.
>>
>> This patch simply releases the read lock before allocating a huge page. It
>> also deletes an outdated comment that assumed vma must be stable, as it was
>> using alloc_hugepage_vma(). This is no longer true since commit 9f1b868a13
>> ("mm: thp: khugepaged: add policy for finding target node").
>
> There is no point in touching ->mmap_sem in khugepaged_alloc_page() at
> all. Please, move up_read() outside khugepaged_alloc_page().
>

I might be wrong. If we up_read in khugepaged_scan_pmd(), then if we round again
do the for loop to get the next vma and handle it. Does we do this without holding
the mmap_sem in any mode?

And if the loop end, we have another up_read in breakouterloop. What if we have
released the mmap_sem in collapse_huge_page()?

--
Thanks.
Zhang Yanfei

2014-06-23 06:26:47

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v3 02/13] mm, compaction: defer each zone individually instead of preferred zone

On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
> When direct sync compaction is often unsuccessful, it may become deferred for
> some time to avoid further useless attempts, both sync and async. Successful
> high-order allocations un-defer compaction, while further unsuccessful
> compaction attempts prolong the copmaction deferred period.
>
> Currently the checking and setting deferred status is performed only on the
> preferred zone of the allocation that invoked direct compaction. But compaction
> itself is attempted on all eligible zones in the zonelist, so the behavior is
> suboptimal and may lead both to scenarios where 1) compaction is attempted
> uselessly, or 2) where it's not attempted despite good chances of succeeding,
> as shown on the examples below:
>
> 1) A direct compaction with Normal preferred zone failed and set deferred
> compaction for the Normal zone. Another unrelated direct compaction with
> DMA32 as preferred zone will attempt to compact DMA32 zone even though
> the first compaction attempt also included DMA32 zone.
>
> In another scenario, compaction with Normal preferred zone failed to compact
> Normal zone, but succeeded in the DMA32 zone, so it will not defer
> compaction. In the next attempt, it will try Normal zone which will fail
> again, instead of skipping Normal zone and trying DMA32 directly.
>
> 2) Kswapd will balance DMA32 zone and reset defer status based on watermarks
> looking good. A direct compaction with preferred Normal zone will skip
> compaction of all zones including DMA32 because Normal was still deferred.
> The allocation might have succeeded in DMA32, but won't.
>
> This patch makes compaction deferring work on individual zone basis instead of
> preferred zone. For each zone, it checks compaction_deferred() to decide if the
> zone should be skipped. If watermarks fail after compacting the zone,
> defer_compaction() is called. The zone where watermarks passed can still be
> deferred when the allocation attempt is unsuccessful. When allocation is
> successful, compaction_defer_reset() is called for the zone containing the
> allocated page. This approach should approximate calling defer_compaction()
> only on zones where compaction was attempted and did not yield allocated page.
> There might be corner cases but that is inevitable as long as the decision
> to stop compacting dues not guarantee that a page will be allocated.
>
> During testing on a two-node machine with a single very small Normal zone on
> node 1, this patch has improved success rates in stress-highalloc mmtests
> benchmark. The success here were previously made worse by commit 3a025760fc
> ("mm: page_alloc: spill to remote nodes before waking kswapd") as kswapd was
> no longer resetting often enough the deferred compaction for the Normal zone,
> and DMA32 zones on both nodes were thus not considered for compaction.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>

Really good.

Reviewed-by: Zhang Yanfei <[email protected]>

> ---
> include/linux/compaction.h | 6 ++++--
> mm/compaction.c | 29 ++++++++++++++++++++++++-----
> mm/page_alloc.c | 33 ++++++++++++++++++---------------
> 3 files changed, 46 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 01e3132..76f9beb 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -22,7 +22,8 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
> extern int fragmentation_index(struct zone *zone, unsigned int order);
> extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *mask,
> - enum migrate_mode mode, bool *contended);
> + enum migrate_mode mode, bool *contended, bool *deferred,
> + struct zone **candidate_zone);
> extern void compact_pgdat(pg_data_t *pgdat, int order);
> extern void reset_isolation_suitable(pg_data_t *pgdat);
> extern unsigned long compaction_suitable(struct zone *zone, int order);
> @@ -91,7 +92,8 @@ static inline bool compaction_restarting(struct zone *zone, int order)
> #else
> static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *nodemask,
> - enum migrate_mode mode, bool *contended)
> + enum migrate_mode mode, bool *contended, bool *deferred,
> + struct zone **candidate_zone)
> {
> return COMPACT_CONTINUE;
> }
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 5175019..7c491d0 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1122,13 +1122,15 @@ int sysctl_extfrag_threshold = 500;
> * @nodemask: The allowed nodes to allocate from
> * @mode: The migration mode for async, sync light, or sync migration
> * @contended: Return value that is true if compaction was aborted due to lock contention
> - * @page: Optionally capture a free page of the requested order during compaction
> + * @deferred: Return value that is true if compaction was deferred in all zones
> + * @candidate_zone: Return the zone where we think allocation should succeed
> *
> * This is the main entry point for direct page compaction.
> */
> unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *nodemask,
> - enum migrate_mode mode, bool *contended)
> + enum migrate_mode mode, bool *contended, bool *deferred,
> + struct zone **candidate_zone)
> {
> enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> int may_enter_fs = gfp_mask & __GFP_FS;
> @@ -1142,8 +1144,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> if (!order || !may_enter_fs || !may_perform_io)
> return rc;
>
> - count_compact_event(COMPACTSTALL);
> -
> + *deferred = true;
> #ifdef CONFIG_CMA
> if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> alloc_flags |= ALLOC_CMA;
> @@ -1153,16 +1154,34 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> nodemask) {
> int status;
>
> + if (compaction_deferred(zone, order))
> + continue;
> +
> + *deferred = false;
> +
> status = compact_zone_order(zone, order, gfp_mask, mode,
> contended);
> rc = max(status, rc);
>
> /* If a normal allocation would succeed, stop compacting */
> if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
> - alloc_flags))
> + alloc_flags)) {
> + *candidate_zone = zone;
> break;
> + } else if (mode != MIGRATE_ASYNC) {
> + /*
> + * We think that allocation won't succeed in this zone
> + * so we defer compaction there. If it ends up
> + * succeeding after all, it will be reset.
> + */
> + defer_compaction(zone, order);
> + }
> }
>
> + /* If at least one zone wasn't deferred, we count a compaction stall */
> + if (!*deferred)
> + count_compact_event(COMPACTSTALL);
> +
> return rc;
> }
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ee92384..6593f79 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2238,18 +2238,17 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> bool *contended_compaction, bool *deferred_compaction,
> unsigned long *did_some_progress)
> {
> - if (!order)
> - return NULL;
> + struct zone *last_compact_zone = NULL;
>
> - if (compaction_deferred(preferred_zone, order)) {
> - *deferred_compaction = true;
> + if (!order)
> return NULL;
> - }
>
> current->flags |= PF_MEMALLOC;
> *did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
> nodemask, mode,
> - contended_compaction);
> + contended_compaction,
> + deferred_compaction,
> + &last_compact_zone);
> current->flags &= ~PF_MEMALLOC;
>
> if (*did_some_progress != COMPACT_SKIPPED) {
> @@ -2263,27 +2262,31 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> order, zonelist, high_zoneidx,
> alloc_flags & ~ALLOC_NO_WATERMARKS,
> preferred_zone, classzone_idx, migratetype);
> +
> if (page) {
> - preferred_zone->compact_blockskip_flush = false;
> - compaction_defer_reset(preferred_zone, order, true);
> + struct zone *zone = page_zone(page);
> +
> + zone->compact_blockskip_flush = false;
> + compaction_defer_reset(zone, order, true);
> count_vm_event(COMPACTSUCCESS);
> return page;
> }
>
> /*
> + * last_compact_zone is where try_to_compact_pages thought
> + * allocation should succeed, so it did not defer compaction.
> + * But now we know that it didn't succeed, so we do the defer.
> + */
> + if (last_compact_zone && mode != MIGRATE_ASYNC)
> + defer_compaction(last_compact_zone, order);
> +
> + /*
> * It's bad if compaction run occurs and fails.
> * The most likely reason is that pages exist,
> * but not enough to satisfy watermarks.
> */
> count_vm_event(COMPACTFAIL);
>
> - /*
> - * As async compaction considers a subset of pageblocks, only
> - * defer if the failure was a sync compaction failure.
> - */
> - if (mode != MIGRATE_ASYNC)
> - defer_compaction(preferred_zone, order);
> -
> cond_resched();
> }
>
>


--
Thanks.
Zhang Yanfei

2014-06-23 06:58:14

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v3 04/13] mm, compaction: move pageblock checks up from isolate_migratepages_range()

On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
> isolate_migratepages_range() is the main function of the compaction scanner,
> called either on a single pageblock by isolate_migratepages() during regular
> compaction, or on an arbitrary range by CMA's __alloc_contig_migrate_range().
> It currently perfoms two pageblock-wide compaction suitability checks, and
> because of the CMA callpath, it tracks if it crossed a pageblock boundary in
> order to repeat those checks.
>
> However, closer inspection shows that those checks are always true for CMA:
> - isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
> - migrate_async_suitable() check is skipped because CMA uses sync compaction
>
> We can therefore move the checks to isolate_migratepages(), reducing variables
> and simplifying isolate_migratepages_range(). The update_pageblock_skip()
> function also no longer needs set_unsuitable parameter.
>
> Furthermore, going back to compact_zone() and compact_finished() when pageblock
> is unsuitable is wasteful - the checks are meant to skip pageblocks quickly.
> The patch therefore also introduces a simple loop into isolate_migratepages()
> so that it does not return immediately on pageblock checks, but keeps going
> until isolate_migratepages_range() gets called once. Similarily to
> isolate_freepages(), the function periodically checks if it needs to reschedule
> or abort async compaction.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>

I think this is a good clean-up to make code more clear.

Reviewed-by: Zhang Yanfei <[email protected]>

Only a tiny nit-pick below.

> ---
> mm/compaction.c | 112 +++++++++++++++++++++++++++++---------------------------
> 1 file changed, 59 insertions(+), 53 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 3064a7f..ebe30c9 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -132,7 +132,7 @@ void reset_isolation_suitable(pg_data_t *pgdat)
> */
> static void update_pageblock_skip(struct compact_control *cc,
> struct page *page, unsigned long nr_isolated,
> - bool set_unsuitable, bool migrate_scanner)
> + bool migrate_scanner)
> {
> struct zone *zone = cc->zone;
> unsigned long pfn;
> @@ -146,12 +146,7 @@ static void update_pageblock_skip(struct compact_control *cc,
> if (nr_isolated)
> return;
>
> - /*
> - * Only skip pageblocks when all forms of compaction will be known to
> - * fail in the near future.
> - */
> - if (set_unsuitable)
> - set_pageblock_skip(page);
> + set_pageblock_skip(page);
>
> pfn = page_to_pfn(page);
>
> @@ -180,7 +175,7 @@ static inline bool isolation_suitable(struct compact_control *cc,
>
> static void update_pageblock_skip(struct compact_control *cc,
> struct page *page, unsigned long nr_isolated,
> - bool set_unsuitable, bool migrate_scanner)
> + bool migrate_scanner)
> {
> }
> #endif /* CONFIG_COMPACTION */
> @@ -345,8 +340,7 @@ isolate_fail:
>
> /* Update the pageblock-skip if the whole pageblock was scanned */
> if (blockpfn == end_pfn)
> - update_pageblock_skip(cc, valid_page, total_isolated, true,
> - false);
> + update_pageblock_skip(cc, valid_page, total_isolated, false);
>
> count_compact_events(COMPACTFREE_SCANNED, nr_scanned);
> if (total_isolated)
> @@ -474,14 +468,12 @@ unsigned long
> isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> unsigned long low_pfn, unsigned long end_pfn, bool unevictable)
> {
> - unsigned long last_pageblock_nr = 0, pageblock_nr;
> unsigned long nr_scanned = 0, nr_isolated = 0;
> struct list_head *migratelist = &cc->migratepages;
> struct lruvec *lruvec;
> unsigned long flags;
> bool locked = false;
> struct page *page = NULL, *valid_page = NULL;
> - bool set_unsuitable = true;
> const isolate_mode_t mode = (cc->mode == MIGRATE_ASYNC ?
> ISOLATE_ASYNC_MIGRATE : 0) |
> (unevictable ? ISOLATE_UNEVICTABLE : 0);
> @@ -545,28 +537,6 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> if (!valid_page)
> valid_page = page;
>
> - /* If isolation recently failed, do not retry */
> - pageblock_nr = low_pfn >> pageblock_order;
> - if (last_pageblock_nr != pageblock_nr) {
> - int mt;
> -
> - last_pageblock_nr = pageblock_nr;
> - if (!isolation_suitable(cc, page))
> - goto next_pageblock;
> -
> - /*
> - * For async migration, also only scan in MOVABLE
> - * blocks. Async migration is optimistic to see if
> - * the minimum amount of work satisfies the allocation
> - */
> - mt = get_pageblock_migratetype(page);
> - if (cc->mode == MIGRATE_ASYNC &&
> - !migrate_async_suitable(mt)) {
> - set_unsuitable = false;
> - goto next_pageblock;
> - }
> - }
> -
> /*
> * Skip if free. page_order cannot be used without zone->lock
> * as nothing prevents parallel allocations or buddy merging.
> @@ -668,8 +638,7 @@ next_pageblock:
> * if the whole pageblock was scanned without isolating any page.
> */
> if (low_pfn == end_pfn)
> - update_pageblock_skip(cc, valid_page, nr_isolated,
> - set_unsuitable, true);
> + update_pageblock_skip(cc, valid_page, nr_isolated, true);
>
> trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
>
> @@ -840,34 +809,74 @@ typedef enum {
> } isolate_migrate_t;
>
> /*
> - * Isolate all pages that can be migrated from the block pointed to by
> - * the migrate scanner within compact_control.
> + * Isolate all pages that can be migrated from the first suitable block,
> + * starting at the block pointed to by the migrate scanner pfn within
> + * compact_control.
> */
> static isolate_migrate_t isolate_migratepages(struct zone *zone,
> struct compact_control *cc)
> {
> unsigned long low_pfn, end_pfn;
> + struct page *page;
>
> - /* Do not scan outside zone boundaries */
> - low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
> + /* Start at where we last stopped, or beginning of the zone */
> + low_pfn = cc->migrate_pfn;

This is ok since cc->migrate_pfn has been restricted to be inside the zone.
But the comment here maybe confusing...

Thanks.

>
> /* Only scan within a pageblock boundary */
> end_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages);
>
> - /* Do not cross the free scanner or scan within a memory hole */
> - if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
> - cc->migrate_pfn = end_pfn;
> - return ISOLATE_NONE;
> - }
> + /*
> + * Iterate over whole pageblocks until we find the first suitable.
> + * Do not cross the free scanner.
> + */
> + for (; end_pfn <= cc->free_pfn;
> + low_pfn = end_pfn, end_pfn += pageblock_nr_pages) {
> +
> + /*
> + * This can potentially iterate a massively long zone with
> + * many pageblocks unsuitable, so periodically check if we
> + * need to schedule, or even abort async compaction.
> + */
> + if (!(low_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages))
> + && compact_should_abort(cc))
> + break;
>
> - /* Perform the isolation */
> - low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn, false);
> - if (!low_pfn || cc->contended)
> - return ISOLATE_ABORT;
> + /* Do not scan within a memory hole */
> + if (!pfn_valid(low_pfn))
> + continue;
> +
> + page = pfn_to_page(low_pfn);
> + /* If isolation recently failed, do not retry */
> + if (!isolation_suitable(cc, page))
> + continue;
>
> + /*
> + * For async compaction, also only scan in MOVABLE blocks.
> + * Async compaction is optimistic to see if the minimum amount
> + * of work satisfies the allocation.
> + */
> + if (cc->mode == MIGRATE_ASYNC &&
> + !migrate_async_suitable(get_pageblock_migratetype(page)))
> + continue;
> +
> + /* Perform the isolation */
> + low_pfn = isolate_migratepages_range(zone, cc, low_pfn,
> + end_pfn, false);
> + if (!low_pfn || cc->contended)
> + return ISOLATE_ABORT;
> +
> + /*
> + * Either we isolated something and proceed with migration. Or
> + * we failed and compact_zone should decide if we should
> + * continue or not.
> + */
> + break;
> + }
> +
> + /* Record where migration scanner will be restarted */
> cc->migrate_pfn = low_pfn;
>
> - return ISOLATE_SUCCESS;
> + return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
> }
>
> static int compact_finished(struct zone *zone,
> @@ -1040,9 +1049,6 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> ;
> }
>
> - if (!cc->nr_migratepages)
> - continue;
> -
> err = migrate_pages(&cc->migratepages, compaction_alloc,
> compaction_free, (unsigned long)cc, cc->mode,
> MR_COMPACTION);
>


--
Thanks.
Zhang Yanfei

2014-06-23 08:56:11

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v3 05/13] mm, compaction: report compaction as contended only due to lock contention

Hello Minchan,

On 06/23/2014 09:39 AM, Minchan Kim wrote:
> Hello Vlastimil,
>
> On Fri, Jun 20, 2014 at 05:49:35PM +0200, Vlastimil Babka wrote:
>> Async compaction aborts when it detects zone lock contention or need_resched()
>> is true. David Rientjes has reported that in practice, most direct async
>> compactions for THP allocation abort due to need_resched(). This means that a
>> second direct compaction is never attempted, which might be OK for a page
>> fault, but khugepaged is intended to attempt a sync compaction in such case and
>> in these cases it won't.
>>
>> This patch replaces "bool contended" in compact_control with an enum that
>> distinguieshes between aborting due to need_resched() and aborting due to lock
>> contention. This allows propagating the abort through all compaction functions
>> as before, but declaring the direct compaction as contended only when lock
>> contention has been detected.
>>
>> A second problem is that try_to_compact_pages() did not act upon the reported
>> contention (both need_resched() or lock contention) and could proceed with
>> another zone from the zonelist. When need_resched() is true, that means
>> initializing another zone compaction, only to check again need_resched() in
>> isolate_migratepages() and aborting. For zone lock contention, the unintended
>> consequence is that the contended status reported back to the allocator
>> is decided from the last zone where compaction was attempted, which is rather
>> arbitrary.
>>
>> This patch fixes the problem in the following way:
>> - need_resched() being true after async compaction returned from a zone means
>> that further zones should not be tried. We do a cond_resched() so that we
>> do not hog the CPU, and abort. "contended" is reported as false, since we
>> did not fail due to lock contention.
>> - aborting zone compaction due to lock contention means we can still try
>> another zone, since it has different locks. We report back "contended" as
>> true only if *all* zones where compaction was attempted, it aborted due to
>> lock contention.
>>
>> As a result of these fixes, khugepaged will proceed with second sync compaction
>> as intended, when the preceding async compaction aborted due to need_resched().
>> Page fault compactions aborting due to need_resched() will spare some cycles
>> previously wasted by initializing another zone compaction only to abort again.
>> Lock contention will be reported only when compaction in all zones aborted due
>> to lock contention, and therefore it's not a good idea to try again after
>> reclaim.
>>
>> Reported-by: David Rientjes <[email protected]>
>> Signed-off-by: Vlastimil Babka <[email protected]>
>> Cc: Minchan Kim <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Joonsoo Kim <[email protected]>
>> Cc: Michal Nazarewicz <[email protected]>
>> Cc: Naoya Horiguchi <[email protected]>
>> Cc: Christoph Lameter <[email protected]>
>> Cc: Rik van Riel <[email protected]>
>> ---
>> mm/compaction.c | 48 +++++++++++++++++++++++++++++++++++++++---------
>> mm/internal.h | 15 +++++++++++----
>> 2 files changed, 50 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index ebe30c9..e8cfac9 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -180,9 +180,14 @@ static void update_pageblock_skip(struct compact_control *cc,
>> }
>> #endif /* CONFIG_COMPACTION */
>>
>> -static inline bool should_release_lock(spinlock_t *lock)
>> +enum compact_contended should_release_lock(spinlock_t *lock)
>> {
>> - return need_resched() || spin_is_contended(lock);
>> + if (spin_is_contended(lock))
>> + return COMPACT_CONTENDED_LOCK;
>> + else if (need_resched())
>> + return COMPACT_CONTENDED_SCHED;
>> + else
>> + return COMPACT_CONTENDED_NONE;
>
> If you want to raise priority of lock contention than need_resched
> intentionally, please write it down on comment.
>
>> }
>>
>> /*
>> @@ -197,7 +202,9 @@ static inline bool should_release_lock(spinlock_t *lock)
>> static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>> bool locked, struct compact_control *cc)
>> {
>> - if (should_release_lock(lock)) {
>> + enum compact_contended contended = should_release_lock(lock);
>> +
>> + if (contended) {
>> if (locked) {
>> spin_unlock_irqrestore(lock, *flags);
>> locked = false;
>> @@ -205,7 +212,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>>
>> /* async aborts if taking too long or contended */
>> if (cc->mode == MIGRATE_ASYNC) {
>> - cc->contended = true;
>> + cc->contended = contended;
>> return false;
>> }
>
>
>>
>> @@ -231,7 +238,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
>> /* async compaction aborts if contended */
>> if (need_resched()) {
>> if (cc->mode == MIGRATE_ASYNC) {
>> - cc->contended = true;
>> + cc->contended = COMPACT_CONTENDED_SCHED;
>> return true;
>> }
>>
>> @@ -1101,7 +1108,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
>> VM_BUG_ON(!list_empty(&cc.freepages));
>> VM_BUG_ON(!list_empty(&cc.migratepages));
>>
>> - *contended = cc.contended;
>> + /* We only signal lock contention back to the allocator */
>> + *contended = cc.contended == COMPACT_CONTENDED_LOCK;
>
> Please write down *WHY* as well as your intention we can know by looking at code.
>
>> return ret;
>> }
>>
>> @@ -1132,6 +1140,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>> struct zone *zone;
>> int rc = COMPACT_SKIPPED;
>> int alloc_flags = 0;
>> + bool all_zones_contended = true;
>>
>> /* Check if the GFP flags allow compaction */
>> if (!order || !may_enter_fs || !may_perform_io)
>> @@ -1146,6 +1155,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>> for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
>> nodemask) {
>> int status;
>> + bool zone_contended;
>>
>> if (compaction_deferred(zone, order))
>> continue;
>> @@ -1153,8 +1163,9 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>> *deferred = false;
>>
>> status = compact_zone_order(zone, order, gfp_mask, mode,
>> - contended);
>> + &zone_contended);
>> rc = max(status, rc);
>> + all_zones_contended &= zone_contended;
>>
>> /* If a normal allocation would succeed, stop compacting */
>> if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
>> @@ -1168,12 +1179,31 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>> * succeeding after all, it will be reset.
>> */
>> defer_compaction(zone, order);
>> + /*
>> + * If we stopped compacting due to need_resched(), do
>> + * not try further zones and yield the CPU.
>> + */
>
> For what? It would make your claim more clear.
>
>> + if (need_resched()) {
>
> compact_zone_order returns true state of contended only if it was lock contention
> so it couldn't return true state of contended by need_resched so you made
> need_resched check in here. It's fragile to me because it could be not a result
> from ahead compact_zone_order call. More clear thing is compact_zone_order
> should return zone_contended as enum, not bool and in here, you could check it.
>
> It means you could return enum in compact_zone_order and make the result bool
> in try_to_compact_pages.
>
>> + /*
>> + * We might not have tried all the zones, so
>> + * be conservative and assume they are not
>> + * all lock contended.
>> + */
>> + all_zones_contended = false;
>> + cond_resched();
>> + break;
>> + }
>> }
>> }
>>
>> - /* If at least one zone wasn't deferred, we count a compaction stall */
>> - if (!*deferred)
>> + /*
>> + * If at least one zone wasn't deferred, we count a compaction stall
>> + * and we report if all zones that were tried were contended.
>> + */
>> + if (!*deferred) {
>> count_compact_event(COMPACTSTALL);
>> + *contended = all_zones_contended;
>
> Why don't you initialize contended as *false* in function's intro?
>
>> + }
>>
>> return rc;
>> }
>> diff --git a/mm/internal.h b/mm/internal.h
>> index a1b651b..2c187d2 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -117,6 +117,13 @@ extern int user_min_free_kbytes;
>>
>> #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>>
>> +/* Used to signal whether compaction detected need_sched() or lock contention */
>> +enum compact_contended {
>> + COMPACT_CONTENDED_NONE = 0, /* no contention detected */
>> + COMPACT_CONTENDED_SCHED, /* need_sched() was true */
>> + COMPACT_CONTENDED_LOCK, /* zone lock or lru_lock was contended */
>> +};
>> +
>> /*
>> * in mm/compaction.c
>> */
>> @@ -144,10 +151,10 @@ struct compact_control {
>> int order; /* order a direct compactor needs */
>> int migratetype; /* MOVABLE, RECLAIMABLE etc */
>> struct zone *zone;
>> - bool contended; /* True if a lock was contended, or
>> - * need_resched() true during async
>> - * compaction
>> - */
>> + enum compact_contended contended; /* Signal need_sched() or lock
>> + * contention detected during
>> + * compaction
>> + */
>> };
>>
>> unsigned long
>> --
>
> Anyway, most big concern is that you are changing current behavior as
> I said earlier.
>
> Old behavior in THP page fault when it consumes own timeslot was just
> abort and fallback 4K page but with your patch, new behavior is
> take a rest when it founds need_resched and goes to another round with
> async, not sync compaction. I'm not sure we need another round with
> async compaction at the cost of increasing latency rather than fallback
> 4 page.

I don't see the new behavior works like what you said. If need_resched
is true, it calls cond_resched() and after a rest it just breaks the loop.
Why there is another round with async compact?

Thanks.

>
> It might be okay if the VMA has MADV_HUGEPAGE which is good hint to
> indicate non-temporal VMA so latency would be trade-off but it's not
> for temporal big memory allocation in HUGEPAGE_ALWAYS system.
>
> If you really want to go this, could you show us numbers?
>
> 1. How many could we can be successful in direct compaction by this patch?
> 2. How long could we increase latency for temporal allocation
> for HUGEPAGE_ALWAYS system?
>


--
Thanks.
Zhang Yanfei

2014-06-23 09:13:46

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v3 06/13] mm, compaction: periodically drop lock and restore IRQs in scanners

On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
> Compaction scanners regularly check for lock contention and need_resched()
> through the compact_checklock_irqsave() function. However, if there is no
> contention, the lock can be held and IRQ disabled for potentially long time.
>
> This has been addressed by commit b2eef8c0d0 ("mm: compaction: minimise the
> time IRQs are disabled while isolating pages for migration") for the migration
> scanner. However, the refactoring done by commit 748446bb6b ("mm: compaction:
> acquire the zone->lru_lock as late as possible") has changed the conditions so
> that the lock is dropped only when there's contention on the lock or
> need_resched() is true. Also, need_resched() is checked only when the lock is
> already held. The comment "give a chance to irqs before checking need_resched"
> is therefore misleading, as IRQs remain disabled when the check is done.
>
> This patch restores the behavior intended by commit b2eef8c0d0 and also tries
> to better balance and make more deterministic the time spent by checking for
> contention vs the time the scanners might run between the checks. It also
> avoids situations where checking has not been done often enough before. The
> result should be avoiding both too frequent and too infrequent contention
> checking, and especially the potentially long-running scans with IRQs disabled
> and no checking of need_resched() or for fatal signal pending, which can happen
> when many consecutive pages or pageblocks fail the preliminary tests and do not
> reach the later call site to compact_checklock_irqsave(), as explained below.
>
> Before the patch:
>
> In the migration scanner, compact_checklock_irqsave() was called each loop, if
> reached. If not reached, some lower-frequency checking could still be done if
> the lock was already held, but this would not result in aborting contended
> async compaction until reaching compact_checklock_irqsave() or end of
> pageblock. In the free scanner, it was similar but completely without the
> periodical checking, so lock can be potentially held until reaching the end of
> pageblock.
>
> After the patch, in both scanners:
>
> The periodical check is done as the first thing in the loop on each
> SWAP_CLUSTER_MAX aligned pfn, using the new compact_unlock_should_abort()
> function, which always unlocks the lock (if locked) and aborts async compaction
> if scheduling is needed. It also aborts any type of compaction when a fatal
> signal is pending.
>
> The compact_checklock_irqsave() function is replaced with a slightly different
> compact_trylock_irqsave(). The biggest difference is that the function is not
> called at all if the lock is already held. The periodical need_resched()
> checking is left solely to compact_unlock_should_abort(). The lock contention
> avoidance for async compaction is achieved by the periodical unlock by
> compact_unlock_should_abort() and by using trylock in compact_trylock_irqsave()
> and aborting when trylock fails. Sync compaction does not use trylock.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>

Reviewed-by: Zhang Yanfei <[email protected]>

> ---
> mm/compaction.c | 114 ++++++++++++++++++++++++++++++++++++--------------------
> 1 file changed, 73 insertions(+), 41 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e8cfac9..40da812 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -180,54 +180,72 @@ static void update_pageblock_skip(struct compact_control *cc,
> }
> #endif /* CONFIG_COMPACTION */
>
> -enum compact_contended should_release_lock(spinlock_t *lock)
> +/*
> + * Compaction requires the taking of some coarse locks that are potentially
> + * very heavily contended. For async compaction, back out if the lock cannot
> + * be taken immediately. For sync compaction, spin on the lock if needed.
> + *
> + * Returns true if the lock is held
> + * Returns false if the lock is not held and compaction should abort
> + */
> +static bool compact_trylock_irqsave(spinlock_t *lock,
> + unsigned long *flags, struct compact_control *cc)
> {
> - if (spin_is_contended(lock))
> - return COMPACT_CONTENDED_LOCK;
> - else if (need_resched())
> - return COMPACT_CONTENDED_SCHED;
> - else
> - return COMPACT_CONTENDED_NONE;
> + if (cc->mode == MIGRATE_ASYNC) {
> + if (!spin_trylock_irqsave(lock, *flags)) {
> + cc->contended = COMPACT_CONTENDED_LOCK;
> + return false;
> + }
> + } else {
> + spin_lock_irqsave(lock, *flags);
> + }
> +
> + return true;
> }
>
> /*
> * Compaction requires the taking of some coarse locks that are potentially
> - * very heavily contended. Check if the process needs to be scheduled or
> - * if the lock is contended. For async compaction, back out in the event
> - * if contention is severe. For sync compaction, schedule.
> + * very heavily contended. The lock should be periodically unlocked to avoid
> + * having disabled IRQs for a long time, even when there is nobody waiting on
> + * the lock. It might also be that allowing the IRQs will result in
> + * need_resched() becoming true. If scheduling is needed, async compaction
> + * aborts. Sync compaction schedules.
> + * Either compaction type will also abort if a fatal signal is pending.
> + * In either case if the lock was locked, it is dropped and not regained.
> *
> - * Returns true if the lock is held.
> - * Returns false if the lock is released and compaction should abort
> + * Returns true if compaction should abort due to fatal signal pending, or
> + * async compaction due to need_resched()
> + * Returns false when compaction can continue (sync compaction might have
> + * scheduled)
> */
> -static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
> - bool locked, struct compact_control *cc)
> +static bool compact_unlock_should_abort(spinlock_t *lock,
> + unsigned long flags, bool *locked, struct compact_control *cc)
> {
> - enum compact_contended contended = should_release_lock(lock);
> + if (*locked) {
> + spin_unlock_irqrestore(lock, flags);
> + *locked = false;
> + }
>
> - if (contended) {
> - if (locked) {
> - spin_unlock_irqrestore(lock, *flags);
> - locked = false;
> - }
> + if (fatal_signal_pending(current)) {
> + cc->contended = COMPACT_CONTENDED_SCHED;
> + return true;
> + }
>
> - /* async aborts if taking too long or contended */
> + if (need_resched()) {
> if (cc->mode == MIGRATE_ASYNC) {
> - cc->contended = contended;
> - return false;
> + cc->contended = COMPACT_CONTENDED_SCHED;
> + return true;
> }
> -
> cond_resched();
> }
>
> - if (!locked)
> - spin_lock_irqsave(lock, *flags);
> - return true;
> + return false;
> }
>
> /*
> * Aside from avoiding lock contention, compaction also periodically checks
> * need_resched() and either schedules in sync compaction or aborts async
> - * compaction. This is similar to what compact_checklock_irqsave() does, but
> + * compaction. This is similar to what compact_unlock_should_abort() does, but
> * is used where no lock is concerned.
> *
> * Returns false when no scheduling was needed, or sync compaction scheduled.
> @@ -286,6 +304,16 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
> int isolated, i;
> struct page *page = cursor;
>
> + /*
> + * Periodically drop the lock (if held) regardless of its
> + * contention, to give chance to IRQs. Abort async compaction
> + * if contended.
> + */
> + if (!(blockpfn % SWAP_CLUSTER_MAX)
> + && compact_unlock_should_abort(&cc->zone->lock, flags,
> + &locked, cc))
> + break;
> +
> nr_scanned++;
> if (!pfn_valid_within(blockpfn))
> goto isolate_fail;
> @@ -303,8 +331,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
> * spin on the lock and we acquire the lock as late as
> * possible.
> */
> - locked = compact_checklock_irqsave(&cc->zone->lock, &flags,
> - locked, cc);
> + if (!locked)
> + locked = compact_trylock_irqsave(&cc->zone->lock,
> + &flags, cc);
> if (!locked)
> break;
>
> @@ -506,13 +535,15 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>
> /* Time to isolate some pages for migration */
> for (; low_pfn < end_pfn; low_pfn++) {
> - /* give a chance to irqs before checking need_resched() */
> - if (locked && !(low_pfn % SWAP_CLUSTER_MAX)) {
> - if (should_release_lock(&zone->lru_lock)) {
> - spin_unlock_irqrestore(&zone->lru_lock, flags);
> - locked = false;
> - }
> - }
> + /*
> + * Periodically drop the lock (if held) regardless of its
> + * contention, to give chance to IRQs. Abort async compaction
> + * if contended.
> + */
> + if (!(low_pfn % SWAP_CLUSTER_MAX)
> + && compact_unlock_should_abort(&zone->lru_lock, flags,
> + &locked, cc))
> + break;
>
> /*
> * migrate_pfn does not necessarily start aligned to a
> @@ -592,10 +623,11 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> page_count(page) > page_mapcount(page))
> continue;
>
> - /* Check if it is ok to still hold the lock */
> - locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
> - locked, cc);
> - if (!locked || fatal_signal_pending(current))
> + /* If the lock is not held, try to take it */
> + if (!locked)
> + locked = compact_trylock_irqsave(&zone->lru_lock,
> + &flags, cc);
> + if (!locked)
> break;
>
> /* Recheck PageLRU and PageTransHuge under lock */
>


--
Thanks.
Zhang Yanfei

2014-06-23 09:16:27

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v3 07/13] mm, compaction: skip rechecks when lock was already held

On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
> Compaction scanners try to lock zone locks as late as possible by checking
> many page or pageblock properties opportunistically without lock and skipping
> them if not unsuitable. For pages that pass the initial checks, some properties
> have to be checked again safely under lock. However, if the lock was already
> held from a previous iteration in the initial checks, the rechecks are
> unnecessary.
>
> This patch therefore skips the rechecks when the lock was already held. This is
> now possible to do, since we don't (potentially) drop and reacquire the lock
> between the initial checks and the safe rechecks anymore.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Acked-by: Minchan Kim <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Acked-by: David Rientjes <[email protected]>

Reviewed-by: Zhang Yanfei <[email protected]>

> ---
> mm/compaction.c | 53 +++++++++++++++++++++++++++++++----------------------
> 1 file changed, 31 insertions(+), 22 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 40da812..9f6e857 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -324,22 +324,30 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
> goto isolate_fail;
>
> /*
> - * The zone lock must be held to isolate freepages.
> - * Unfortunately this is a very coarse lock and can be
> - * heavily contended if there are parallel allocations
> - * or parallel compactions. For async compaction do not
> - * spin on the lock and we acquire the lock as late as
> - * possible.
> + * If we already hold the lock, we can skip some rechecking.
> + * Note that if we hold the lock now, checked_pageblock was
> + * already set in some previous iteration (or strict is true),
> + * so it is correct to skip the suitable migration target
> + * recheck as well.
> */
> - if (!locked)
> + if (!locked) {
> + /*
> + * The zone lock must be held to isolate freepages.
> + * Unfortunately this is a very coarse lock and can be
> + * heavily contended if there are parallel allocations
> + * or parallel compactions. For async compaction do not
> + * spin on the lock and we acquire the lock as late as
> + * possible.
> + */
> locked = compact_trylock_irqsave(&cc->zone->lock,
> &flags, cc);
> - if (!locked)
> - break;
> + if (!locked)
> + break;
>
> - /* Recheck this is a buddy page under lock */
> - if (!PageBuddy(page))
> - goto isolate_fail;
> + /* Recheck this is a buddy page under lock */
> + if (!PageBuddy(page))
> + goto isolate_fail;
> + }
>
> /* Found a free page, break it into order-0 pages */
> isolated = split_free_page(page);
> @@ -623,19 +631,20 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> page_count(page) > page_mapcount(page))
> continue;
>
> - /* If the lock is not held, try to take it */
> - if (!locked)
> + /* If we already hold the lock, we can skip some rechecking */
> + if (!locked) {
> locked = compact_trylock_irqsave(&zone->lru_lock,
> &flags, cc);
> - if (!locked)
> - break;
> + if (!locked)
> + break;
>
> - /* Recheck PageLRU and PageTransHuge under lock */
> - if (!PageLRU(page))
> - continue;
> - if (PageTransHuge(page)) {
> - low_pfn += (1 << compound_order(page)) - 1;
> - continue;
> + /* Recheck PageLRU and PageTransHuge under lock */
> + if (!PageLRU(page))
> + continue;
> + if (PageTransHuge(page)) {
> + low_pfn += (1 << compound_order(page)) - 1;
> + continue;
> + }
> }
>
> lruvec = mem_cgroup_page_lruvec(page, zone);
>


--
Thanks.
Zhang Yanfei

2014-06-23 09:17:41

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v3 08/13] mm, compaction: remember position within pageblock in free pages scanner

On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
> Unlike the migration scanner, the free scanner remembers the beginning of the
> last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages
> uselessly when called several times during single compaction. This might have
> been useful when pages were returned to the buddy allocator after a failed
> migration, but this is no longer the case.
>
> This patch changes the meaning of cc->free_pfn so that if it points to a
> middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the
> end. isolate_freepages_block() will record the pfn of the last page it looked
> at, which is then used to update cc->free_pfn.
>
> In the mmtests stress-highalloc benchmark, this has resulted in lowering the
> ratio between pages scanned by both scanners, from 2.5 free pages per migrate
> page, to 2.25 free pages per migrate page, without affecting success rates.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Acked-by: David Rientjes <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Zhang Yanfei <[email protected]>

Reviewed-by: Zhang Yanfei <[email protected]>

> ---
> mm/compaction.c | 40 +++++++++++++++++++++++++++++++---------
> 1 file changed, 31 insertions(+), 9 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 9f6e857..41c7005 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -287,7 +287,7 @@ static bool suitable_migration_target(struct page *page)
> * (even though it may still end up isolating some pages).
> */
> static unsigned long isolate_freepages_block(struct compact_control *cc,
> - unsigned long blockpfn,
> + unsigned long *start_pfn,
> unsigned long end_pfn,
> struct list_head *freelist,
> bool strict)
> @@ -296,6 +296,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
> struct page *cursor, *valid_page = NULL;
> unsigned long flags;
> bool locked = false;
> + unsigned long blockpfn = *start_pfn;
>
> cursor = pfn_to_page(blockpfn);
>
> @@ -369,6 +370,9 @@ isolate_fail:
> break;
> }
>
> + /* Record how far we have got within the block */
> + *start_pfn = blockpfn;
> +
> trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated);
>
> /*
> @@ -413,6 +417,9 @@ isolate_freepages_range(struct compact_control *cc,
> LIST_HEAD(freelist);
>
> for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) {
> + /* Protect pfn from changing by isolate_freepages_block */
> + unsigned long isolate_start_pfn = pfn;
> +
> if (!pfn_valid(pfn) || cc->zone != page_zone(pfn_to_page(pfn)))
> break;
>
> @@ -423,8 +430,8 @@ isolate_freepages_range(struct compact_control *cc,
> block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
> block_end_pfn = min(block_end_pfn, end_pfn);
>
> - isolated = isolate_freepages_block(cc, pfn, block_end_pfn,
> - &freelist, true);
> + isolated = isolate_freepages_block(cc, &isolate_start_pfn,
> + block_end_pfn, &freelist, true);
>
> /*
> * In strict mode, isolate_freepages_block() returns 0 if
> @@ -708,6 +715,7 @@ static void isolate_freepages(struct zone *zone,
> {
> struct page *page;
> unsigned long block_start_pfn; /* start of current pageblock */
> + unsigned long isolate_start_pfn; /* exact pfn we start at */
> unsigned long block_end_pfn; /* end of current pageblock */
> unsigned long low_pfn; /* lowest pfn scanner is able to scan */
> int nr_freepages = cc->nr_freepages;
> @@ -716,14 +724,15 @@ static void isolate_freepages(struct zone *zone,
> /*
> * Initialise the free scanner. The starting point is where we last
> * successfully isolated from, zone-cached value, or the end of the
> - * zone when isolating for the first time. We need this aligned to
> - * the pageblock boundary, because we do
> + * zone when isolating for the first time. For looping we also need
> + * this pfn aligned down to the pageblock boundary, because we do
> * block_start_pfn -= pageblock_nr_pages in the for loop.
> * For ending point, take care when isolating in last pageblock of a
> * a zone which ends in the middle of a pageblock.
> * The low boundary is the end of the pageblock the migration scanner
> * is using.
> */
> + isolate_start_pfn = cc->free_pfn;
> block_start_pfn = cc->free_pfn & ~(pageblock_nr_pages-1);
> block_end_pfn = min(block_start_pfn + pageblock_nr_pages,
> zone_end_pfn(zone));
> @@ -736,7 +745,8 @@ static void isolate_freepages(struct zone *zone,
> */
> for (; block_start_pfn >= low_pfn && cc->nr_migratepages > nr_freepages;
> block_end_pfn = block_start_pfn,
> - block_start_pfn -= pageblock_nr_pages) {
> + block_start_pfn -= pageblock_nr_pages,
> + isolate_start_pfn = block_start_pfn) {
> unsigned long isolated;
>
> /*
> @@ -770,13 +780,25 @@ static void isolate_freepages(struct zone *zone,
> if (!isolation_suitable(cc, page))
> continue;
>
> - /* Found a block suitable for isolating free pages from */
> - cc->free_pfn = block_start_pfn;
> - isolated = isolate_freepages_block(cc, block_start_pfn,
> + /* Found a block suitable for isolating free pages from. */
> + isolated = isolate_freepages_block(cc, &isolate_start_pfn,
> block_end_pfn, freelist, false);
> nr_freepages += isolated;
>
> /*
> + * Remember where the free scanner should restart next time,
> + * which is where isolate_freepages_block() left off.
> + * But if it scanned the whole pageblock, isolate_start_pfn
> + * now points at block_end_pfn, which is the start of the next
> + * pageblock.
> + * In that case we will however want to restart at the start
> + * of the previous pageblock.
> + */
> + cc->free_pfn = (isolate_start_pfn < block_end_pfn) ?
> + isolate_start_pfn :
> + block_start_pfn - pageblock_nr_pages;
> +
> + /*
> * Set a flag that we successfully isolated in this pageblock.
> * In the next loop iteration, zone->compact_cached_free_pfn
> * will not be updated and thus it will effectively contain the
>


--
Thanks.
Zhang Yanfei

2014-06-23 09:29:42

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v3 09/13] mm, compaction: skip buddy pages by their order in the migrate scanner

On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
> The migration scanner skips PageBuddy pages, but does not consider their order
> as checking page_order() is generally unsafe without holding the zone->lock,
> and acquiring the lock just for the check wouldn't be a good tradeoff.
>
> Still, this could avoid some iterations over the rest of the buddy page, and
> if we are careful, the race window between PageBuddy() check and page_order()
> is small, and the worst thing that can happen is that we skip too much and miss
> some isolation candidates. This is not that bad, as compaction can already fail
> for many other reasons like parallel allocations, and those have much larger
> race window.
>
> This patch therefore makes the migration scanner obtain the buddy page order
> and use it to skip the whole buddy page, if the order appears to be in the
> valid range.
>
> It's important that the page_order() is read only once, so that the value used
> in the checks and in the pfn calculation is the same. But in theory the
> compiler can replace the local variable by multiple inlines of page_order().
> Therefore, the patch introduces page_order_unsafe() that uses ACCESS_ONCE to
> prevent this.
>
> Testing with stress-highalloc from mmtests shows a 15% reduction in number of
> pages scanned by migration scanner. This change is also a prerequisite for a
> later patch which is detecting when a cc->order block of pages contains
> non-buddy pages that cannot be isolated, and the scanner should thus skip to
> the next block immediately.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>

Fair enough.

Reviewed-by: Zhang Yanfei <[email protected]>

> ---
> mm/compaction.c | 36 +++++++++++++++++++++++++++++++-----
> mm/internal.h | 16 +++++++++++++++-
> 2 files changed, 46 insertions(+), 6 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 41c7005..df0961b 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -270,8 +270,15 @@ static inline bool compact_should_abort(struct compact_control *cc)
> static bool suitable_migration_target(struct page *page)
> {
> /* If the page is a large free page, then disallow migration */
> - if (PageBuddy(page) && page_order(page) >= pageblock_order)
> - return false;
> + if (PageBuddy(page)) {
> + /*
> + * We are checking page_order without zone->lock taken. But
> + * the only small danger is that we skip a potentially suitable
> + * pageblock, so it's not worth to check order for valid range.
> + */
> + if (page_order_unsafe(page) >= pageblock_order)
> + return false;
> + }
>
> /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
> if (migrate_async_suitable(get_pageblock_migratetype(page)))
> @@ -591,11 +598,23 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> valid_page = page;
>
> /*
> - * Skip if free. page_order cannot be used without zone->lock
> - * as nothing prevents parallel allocations or buddy merging.
> + * Skip if free. We read page order here without zone lock
> + * which is generally unsafe, but the race window is small and
> + * the worst thing that can happen is that we skip some
> + * potential isolation targets.
> */
> - if (PageBuddy(page))
> + if (PageBuddy(page)) {
> + unsigned long freepage_order = page_order_unsafe(page);
> +
> + /*
> + * Without lock, we cannot be sure that what we got is
> + * a valid page order. Consider only values in the
> + * valid order range to prevent low_pfn overflow.
> + */
> + if (freepage_order > 0 && freepage_order < MAX_ORDER)
> + low_pfn += (1UL << freepage_order) - 1;
> continue;
> + }
>
> /*
> * Check may be lockless but that's ok as we recheck later.
> @@ -683,6 +702,13 @@ next_pageblock:
> low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
> }
>
> + /*
> + * The PageBuddy() check could have potentially brought us outside
> + * the range to be scanned.
> + */
> + if (unlikely(low_pfn > end_pfn))
> + low_pfn = end_pfn;
> +
> acct_isolated(zone, locked, cc);
>
> if (locked)
> diff --git a/mm/internal.h b/mm/internal.h
> index 2c187d2..584cd69 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -171,7 +171,8 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> * general, page_zone(page)->lock must be held by the caller to prevent the
> * page from being allocated in parallel and returning garbage as the order.
> * If a caller does not hold page_zone(page)->lock, it must guarantee that the
> - * page cannot be allocated or merged in parallel.
> + * page cannot be allocated or merged in parallel. Alternatively, it must
> + * handle invalid values gracefully, and use page_order_unsafe() below.
> */
> static inline unsigned long page_order(struct page *page)
> {
> @@ -179,6 +180,19 @@ static inline unsigned long page_order(struct page *page)
> return page_private(page);
> }
>
> +/*
> + * Like page_order(), but for callers who cannot afford to hold the zone lock.
> + * PageBuddy() should be checked first by the caller to minimize race window,
> + * and invalid values must be handled gracefully.
> + *
> + * ACCESS_ONCE is used so that if the caller assigns the result into a local
> + * variable and e.g. tests it for valid range before using, the compiler cannot
> + * decide to remove the variable and inline the page_private(page) multiple
> + * times, potentially observing different values in the tests and the actual
> + * use of the result.
> + */
> +#define page_order_unsafe(page) ACCESS_ONCE(page_private(page))
> +
> static inline bool is_cow_mapping(vm_flags_t flags)
> {
> return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
>


--
Thanks.
Zhang Yanfei

2014-06-23 09:32:04

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v3 11/13] mm, compaction: pass gfp mask to compact_control

On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
> From: David Rientjes <[email protected]>
>
> struct compact_control currently converts the gfp mask to a migratetype, but we
> need the entire gfp mask in a follow-up patch.
>
> Pass the entire gfp mask as part of struct compact_control.
>
> Signed-off-by: David Rientjes <[email protected]>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>

Reviewed-by: Zhang Yanfei <[email protected]>

> ---
> mm/compaction.c | 12 +++++++-----
> mm/internal.h | 2 +-
> 2 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 32c768b..d4e0c13 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -975,8 +975,8 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
> return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
> }
>
> -static int compact_finished(struct zone *zone,
> - struct compact_control *cc)
> +static int compact_finished(struct zone *zone, struct compact_control *cc,
> + const int migratetype)
> {
> unsigned int order;
> unsigned long watermark;
> @@ -1022,7 +1022,7 @@ static int compact_finished(struct zone *zone,
> struct free_area *area = &zone->free_area[order];
>
> /* Job done if page is free of the right migratetype */
> - if (!list_empty(&area->free_list[cc->migratetype]))
> + if (!list_empty(&area->free_list[migratetype]))
> return COMPACT_PARTIAL;
>
> /* Job done if allocation would set block type */
> @@ -1088,6 +1088,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> int ret;
> unsigned long start_pfn = zone->zone_start_pfn;
> unsigned long end_pfn = zone_end_pfn(zone);
> + const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
> const bool sync = cc->mode != MIGRATE_ASYNC;
>
> ret = compaction_suitable(zone, cc->order);
> @@ -1130,7 +1131,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>
> migrate_prep_local();
>
> - while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
> + while ((ret = compact_finished(zone, cc, migratetype)) ==
> + COMPACT_CONTINUE) {
> int err;
>
> switch (isolate_migratepages(zone, cc)) {
> @@ -1185,7 +1187,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
> .nr_freepages = 0,
> .nr_migratepages = 0,
> .order = order,
> - .migratetype = gfpflags_to_migratetype(gfp_mask),
> + .gfp_mask = gfp_mask,
> .zone = zone,
> .mode = mode,
> };
> diff --git a/mm/internal.h b/mm/internal.h
> index 584cd69..dd17a40 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -149,7 +149,7 @@ struct compact_control {
> bool finished_update_migrate;
>
> int order; /* order a direct compactor needs */
> - int migratetype; /* MOVABLE, RECLAIMABLE etc */
> + const gfp_t gfp_mask; /* gfp mask of a direct compactor */
> struct zone *zone;
> enum compact_contended contended; /* Signal need_sched() or lock
> * contention detected during
>


--
Thanks.
Zhang Yanfei

2014-06-23 09:52:34

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] mm, THP: don't hold mmap_sem in khugepaged when allocating THP

On 06/23/2014 07:39 AM, Zhang Yanfei wrote:
> Hello
>
> On 06/21/2014 01:45 AM, Kirill A. Shutemov wrote:
>> On Fri, Jun 20, 2014 at 05:49:31PM +0200, Vlastimil Babka wrote:
>>> When allocating huge page for collapsing, khugepaged currently holds mmap_sem
>>> for reading on the mm where collapsing occurs. Afterwards the read lock is
>>> dropped before write lock is taken on the same mmap_sem.
>>>
>>> Holding mmap_sem during whole huge page allocation is therefore useless, the
>>> vma needs to be rechecked after taking the write lock anyway. Furthemore, huge
>>> page allocation might involve a rather long sync compaction, and thus block
>>> any mmap_sem writers and i.e. affect workloads that perform frequent m(un)map
>>> or mprotect oterations.
>>>
>>> This patch simply releases the read lock before allocating a huge page. It
>>> also deletes an outdated comment that assumed vma must be stable, as it was
>>> using alloc_hugepage_vma(). This is no longer true since commit 9f1b868a13
>>> ("mm: thp: khugepaged: add policy for finding target node").
>>
>> There is no point in touching ->mmap_sem in khugepaged_alloc_page() at
>> all. Please, move up_read() outside khugepaged_alloc_page().
>>

Well there's also currently no point in passing several parameters to
khugepaged_alloc_page(). So I could clean it up as well, but I imagine
later we would perhaps reintroduce them back, as I don't think the
current situation is ideal for at least two reasons.

1. If you read commit 9f1b868a13 ("mm: thp: khugepaged: add policy for
finding target node"), it's based on a report where somebody found that
mempolicy is not observed properly when collapsing THP's. But the
'policy' introduced by the commit isn't based on real mempolicy, it
might just under certain conditions results in an interleave, which
happens to be what the reporter was trying.

So ideally, it should be making node allocation decisions based on where
the original 4KB pages are located. For example, allocate a THP only if
all the 4KB pages are on the same node. That would also automatically
obey any policy that has lead to the allocation of those 4KB pages.

And for this, it will need again the parameters and mmap_sem in read
mode. It would be however still a good idea to drop mmap_sem before the
allocation itself, since compaction/reclaim might take some time...

2. (less related) I'd expect khugepaged to first allocate a hugepage and
then scan for collapsing. Yes there's khugepaged_prealloc_page, but that
only does something on !NUMA systems and these are not the future.
Although I don't have the data, I expect allocating a hugepage is a
bigger issue than finding something that could be collapsed. So why scan
for collapsing if in the end I cannot allocate a hugepage? And if I
really cannot find something to collapse, would e.g. caching a single
hugepage per node be a big hit? Also, if there's really nothing to
collapse, then it means khugepaged won't compact. And since khugepaged
is becoming the only source of sync compaction that doesn't give up
easily and tries to e.g. migrate movable pages out of unmovable
pageblocks, this might have bad effects on fragmentation.
I believe this could be done smarter.

> I might be wrong. If we up_read in khugepaged_scan_pmd(), then if we round again
> do the for loop to get the next vma and handle it. Does we do this without holding
> the mmap_sem in any mode?
>
> And if the loop end, we have another up_read in breakouterloop. What if we have
> released the mmap_sem in collapse_huge_page()?

collapse_huge_page() is only called from khugepaged_scan_pmd() in the if
(ret) condition. And khugepaged_scan_mm_slot() has similar if (ret) for
the return value of khugepaged_scan_pmd() to break out of the loop (and
not doing up_read() again). So I think this is correct and moving
up_read from khugepaged_alloc_page() to collapse_huge_page() wouldn't
change this?

2014-06-23 10:40:49

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] mm, THP: don't hold mmap_sem in khugepaged when allocating THP

On 06/23/2014 05:52 PM, Vlastimil Babka wrote:
> On 06/23/2014 07:39 AM, Zhang Yanfei wrote:
>> Hello
>>
>> On 06/21/2014 01:45 AM, Kirill A. Shutemov wrote:
>>> On Fri, Jun 20, 2014 at 05:49:31PM +0200, Vlastimil Babka wrote:
>>>> When allocating huge page for collapsing, khugepaged currently holds mmap_sem
>>>> for reading on the mm where collapsing occurs. Afterwards the read lock is
>>>> dropped before write lock is taken on the same mmap_sem.
>>>>
>>>> Holding mmap_sem during whole huge page allocation is therefore useless, the
>>>> vma needs to be rechecked after taking the write lock anyway. Furthemore, huge
>>>> page allocation might involve a rather long sync compaction, and thus block
>>>> any mmap_sem writers and i.e. affect workloads that perform frequent m(un)map
>>>> or mprotect oterations.
>>>>
>>>> This patch simply releases the read lock before allocating a huge page. It
>>>> also deletes an outdated comment that assumed vma must be stable, as it was
>>>> using alloc_hugepage_vma(). This is no longer true since commit 9f1b868a13
>>>> ("mm: thp: khugepaged: add policy for finding target node").
>>>
>>> There is no point in touching ->mmap_sem in khugepaged_alloc_page() at
>>> all. Please, move up_read() outside khugepaged_alloc_page().
>>>
>
> Well there's also currently no point in passing several parameters to khugepaged_alloc_page(). So I could clean it up as well, but I imagine later we would perhaps reintroduce them back, as I don't think the current situation is ideal for at least two reasons.
>
> 1. If you read commit 9f1b868a13 ("mm: thp: khugepaged: add policy for finding target node"), it's based on a report where somebody found that mempolicy is not observed properly when collapsing THP's. But the 'policy' introduced by the commit isn't based on real mempolicy, it might just under certain conditions results in an interleave, which happens to be what the reporter was trying.
>
> So ideally, it should be making node allocation decisions based on where the original 4KB pages are located. For example, allocate a THP only if all the 4KB pages are on the same node. That would also automatically obey any policy that has lead to the allocation of those 4KB pages.
>
> And for this, it will need again the parameters and mmap_sem in read mode. It would be however still a good idea to drop mmap_sem before the allocation itself, since compaction/reclaim might take some time...
>
> 2. (less related) I'd expect khugepaged to first allocate a hugepage and then scan for collapsing. Yes there's khugepaged_prealloc_page, but that only does something on !NUMA systems and these are not the future.
> Although I don't have the data, I expect allocating a hugepage is a bigger issue than finding something that could be collapsed. So why scan for collapsing if in the end I cannot allocate a hugepage? And if I really cannot find something to collapse, would e.g. caching a single hugepage per node be a big hit? Also, if there's really nothing to collapse, then it means khugepaged won't compact. And since khugepaged is becoming the only source of sync compaction that doesn't give up easily and tries to e.g. migrate movable pages out of unmovable pageblocks, this might have bad effects on fragmentation.
> I believe this could be done smarter.
>
>> I might be wrong. If we up_read in khugepaged_scan_pmd(), then if we round again
>> do the for loop to get the next vma and handle it. Does we do this without holding
>> the mmap_sem in any mode?
>>
>> And if the loop end, we have another up_read in breakouterloop. What if we have
>> released the mmap_sem in collapse_huge_page()?
>
> collapse_huge_page() is only called from khugepaged_scan_pmd() in the if (ret) condition. And khugepaged_scan_mm_slot() has similar if (ret) for the return value of khugepaged_scan_pmd() to break out of the loop (and not doing up_read() again). So I think this is correct and moving up_read from khugepaged_alloc_page() to collapse_huge_page() wouldn't
> change this?

Ah, right.

>
>
> .
>


--
Thanks.
Zhang Yanfei

2014-06-23 23:34:19

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 05/13] mm, compaction: report compaction as contended only due to lock contention

Hello Zhang,

On Mon, Jun 23, 2014 at 04:55:55PM +0800, Zhang Yanfei wrote:
> Hello Minchan,
>
> On 06/23/2014 09:39 AM, Minchan Kim wrote:
> > Hello Vlastimil,
> >
> > On Fri, Jun 20, 2014 at 05:49:35PM +0200, Vlastimil Babka wrote:
> >> Async compaction aborts when it detects zone lock contention or need_resched()
> >> is true. David Rientjes has reported that in practice, most direct async
> >> compactions for THP allocation abort due to need_resched(). This means that a
> >> second direct compaction is never attempted, which might be OK for a page
> >> fault, but khugepaged is intended to attempt a sync compaction in such case and
> >> in these cases it won't.
> >>
> >> This patch replaces "bool contended" in compact_control with an enum that
> >> distinguieshes between aborting due to need_resched() and aborting due to lock
> >> contention. This allows propagating the abort through all compaction functions
> >> as before, but declaring the direct compaction as contended only when lock
> >> contention has been detected.
> >>
> >> A second problem is that try_to_compact_pages() did not act upon the reported
> >> contention (both need_resched() or lock contention) and could proceed with
> >> another zone from the zonelist. When need_resched() is true, that means
> >> initializing another zone compaction, only to check again need_resched() in
> >> isolate_migratepages() and aborting. For zone lock contention, the unintended
> >> consequence is that the contended status reported back to the allocator
> >> is decided from the last zone where compaction was attempted, which is rather
> >> arbitrary.
> >>
> >> This patch fixes the problem in the following way:
> >> - need_resched() being true after async compaction returned from a zone means
> >> that further zones should not be tried. We do a cond_resched() so that we
> >> do not hog the CPU, and abort. "contended" is reported as false, since we
> >> did not fail due to lock contention.
> >> - aborting zone compaction due to lock contention means we can still try
> >> another zone, since it has different locks. We report back "contended" as
> >> true only if *all* zones where compaction was attempted, it aborted due to
> >> lock contention.
> >>
> >> As a result of these fixes, khugepaged will proceed with second sync compaction
> >> as intended, when the preceding async compaction aborted due to need_resched().
> >> Page fault compactions aborting due to need_resched() will spare some cycles
> >> previously wasted by initializing another zone compaction only to abort again.
> >> Lock contention will be reported only when compaction in all zones aborted due
> >> to lock contention, and therefore it's not a good idea to try again after
> >> reclaim.
> >>
> >> Reported-by: David Rientjes <[email protected]>
> >> Signed-off-by: Vlastimil Babka <[email protected]>
> >> Cc: Minchan Kim <[email protected]>
> >> Cc: Mel Gorman <[email protected]>
> >> Cc: Joonsoo Kim <[email protected]>
> >> Cc: Michal Nazarewicz <[email protected]>
> >> Cc: Naoya Horiguchi <[email protected]>
> >> Cc: Christoph Lameter <[email protected]>
> >> Cc: Rik van Riel <[email protected]>
> >> ---
> >> mm/compaction.c | 48 +++++++++++++++++++++++++++++++++++++++---------
> >> mm/internal.h | 15 +++++++++++----
> >> 2 files changed, 50 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/mm/compaction.c b/mm/compaction.c
> >> index ebe30c9..e8cfac9 100644
> >> --- a/mm/compaction.c
> >> +++ b/mm/compaction.c
> >> @@ -180,9 +180,14 @@ static void update_pageblock_skip(struct compact_control *cc,
> >> }
> >> #endif /* CONFIG_COMPACTION */
> >>
> >> -static inline bool should_release_lock(spinlock_t *lock)
> >> +enum compact_contended should_release_lock(spinlock_t *lock)
> >> {
> >> - return need_resched() || spin_is_contended(lock);
> >> + if (spin_is_contended(lock))
> >> + return COMPACT_CONTENDED_LOCK;
> >> + else if (need_resched())
> >> + return COMPACT_CONTENDED_SCHED;
> >> + else
> >> + return COMPACT_CONTENDED_NONE;
> >
> > If you want to raise priority of lock contention than need_resched
> > intentionally, please write it down on comment.
> >
> >> }
> >>
> >> /*
> >> @@ -197,7 +202,9 @@ static inline bool should_release_lock(spinlock_t *lock)
> >> static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
> >> bool locked, struct compact_control *cc)
> >> {
> >> - if (should_release_lock(lock)) {
> >> + enum compact_contended contended = should_release_lock(lock);
> >> +
> >> + if (contended) {
> >> if (locked) {
> >> spin_unlock_irqrestore(lock, *flags);
> >> locked = false;
> >> @@ -205,7 +212,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
> >>
> >> /* async aborts if taking too long or contended */
> >> if (cc->mode == MIGRATE_ASYNC) {
> >> - cc->contended = true;
> >> + cc->contended = contended;
> >> return false;
> >> }
> >
> >
> >>
> >> @@ -231,7 +238,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
> >> /* async compaction aborts if contended */
> >> if (need_resched()) {
> >> if (cc->mode == MIGRATE_ASYNC) {
> >> - cc->contended = true;
> >> + cc->contended = COMPACT_CONTENDED_SCHED;
> >> return true;
> >> }
> >>
> >> @@ -1101,7 +1108,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
> >> VM_BUG_ON(!list_empty(&cc.freepages));
> >> VM_BUG_ON(!list_empty(&cc.migratepages));
> >>
> >> - *contended = cc.contended;
> >> + /* We only signal lock contention back to the allocator */
> >> + *contended = cc.contended == COMPACT_CONTENDED_LOCK;
> >
> > Please write down *WHY* as well as your intention we can know by looking at code.
> >
> >> return ret;
> >> }
> >>
> >> @@ -1132,6 +1140,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> >> struct zone *zone;
> >> int rc = COMPACT_SKIPPED;
> >> int alloc_flags = 0;
> >> + bool all_zones_contended = true;
> >>
> >> /* Check if the GFP flags allow compaction */
> >> if (!order || !may_enter_fs || !may_perform_io)
> >> @@ -1146,6 +1155,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> >> for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
> >> nodemask) {
> >> int status;
> >> + bool zone_contended;
> >>
> >> if (compaction_deferred(zone, order))
> >> continue;
> >> @@ -1153,8 +1163,9 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> >> *deferred = false;
> >>
> >> status = compact_zone_order(zone, order, gfp_mask, mode,
> >> - contended);
> >> + &zone_contended);
> >> rc = max(status, rc);
> >> + all_zones_contended &= zone_contended;
> >>
> >> /* If a normal allocation would succeed, stop compacting */
> >> if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
> >> @@ -1168,12 +1179,31 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> >> * succeeding after all, it will be reset.
> >> */
> >> defer_compaction(zone, order);
> >> + /*
> >> + * If we stopped compacting due to need_resched(), do
> >> + * not try further zones and yield the CPU.
> >> + */
> >
> > For what? It would make your claim more clear.
> >
> >> + if (need_resched()) {
> >
> > compact_zone_order returns true state of contended only if it was lock contention
> > so it couldn't return true state of contended by need_resched so you made
> > need_resched check in here. It's fragile to me because it could be not a result
> > from ahead compact_zone_order call. More clear thing is compact_zone_order
> > should return zone_contended as enum, not bool and in here, you could check it.
> >
> > It means you could return enum in compact_zone_order and make the result bool
> > in try_to_compact_pages.
> >
> >> + /*
> >> + * We might not have tried all the zones, so
> >> + * be conservative and assume they are not
> >> + * all lock contended.
> >> + */
> >> + all_zones_contended = false;
> >> + cond_resched();
> >> + break;
> >> + }
> >> }
> >> }
> >>
> >> - /* If at least one zone wasn't deferred, we count a compaction stall */
> >> - if (!*deferred)
> >> + /*
> >> + * If at least one zone wasn't deferred, we count a compaction stall
> >> + * and we report if all zones that were tried were contended.
> >> + */
> >> + if (!*deferred) {
> >> count_compact_event(COMPACTSTALL);
> >> + *contended = all_zones_contended;
> >
> > Why don't you initialize contended as *false* in function's intro?
> >
> >> + }
> >>
> >> return rc;
> >> }
> >> diff --git a/mm/internal.h b/mm/internal.h
> >> index a1b651b..2c187d2 100644
> >> --- a/mm/internal.h
> >> +++ b/mm/internal.h
> >> @@ -117,6 +117,13 @@ extern int user_min_free_kbytes;
> >>
> >> #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> >>
> >> +/* Used to signal whether compaction detected need_sched() or lock contention */
> >> +enum compact_contended {
> >> + COMPACT_CONTENDED_NONE = 0, /* no contention detected */
> >> + COMPACT_CONTENDED_SCHED, /* need_sched() was true */
> >> + COMPACT_CONTENDED_LOCK, /* zone lock or lru_lock was contended */
> >> +};
> >> +
> >> /*
> >> * in mm/compaction.c
> >> */
> >> @@ -144,10 +151,10 @@ struct compact_control {
> >> int order; /* order a direct compactor needs */
> >> int migratetype; /* MOVABLE, RECLAIMABLE etc */
> >> struct zone *zone;
> >> - bool contended; /* True if a lock was contended, or
> >> - * need_resched() true during async
> >> - * compaction
> >> - */
> >> + enum compact_contended contended; /* Signal need_sched() or lock
> >> + * contention detected during
> >> + * compaction
> >> + */
> >> };
> >>
> >> unsigned long
> >> --
> >
> > Anyway, most big concern is that you are changing current behavior as
> > I said earlier.
> >
> > Old behavior in THP page fault when it consumes own timeslot was just
> > abort and fallback 4K page but with your patch, new behavior is
> > take a rest when it founds need_resched and goes to another round with
> > async, not sync compaction. I'm not sure we need another round with
> > async compaction at the cost of increasing latency rather than fallback
> > 4 page.
>
> I don't see the new behavior works like what you said. If need_resched
> is true, it calls cond_resched() and after a rest it just breaks the loop.
> Why there is another round with async compact?

One example goes

Old:
page fault
huge page allocation
__alloc_pages_slowpath
__alloc_pages_direct_compact
compact_zone_order
isolate_migratepages
compact_checklock_irqsave
need_resched is true
cc->contended = true;
return ISOLATE_ABORT
return COMPACT_PARTIAL with *contented = cc.contended;
COMPACTFAIL
if (contended_compaction && gfp_mask & __GFP_NO_KSWAPD)
goto nopage;

New:

page fault
huge page allocation
__alloc_pages_slowpath
__alloc_pages_direct_compact
compact_zone_order
isolate_migratepages
compact_unlock_should_abort
need_resched is true
cc->contended = COMPACT_CONTENDED_SCHED;
return true;
return ISOLATE_ABORT
return COMPACT_PARTIAL with *contended = cc.contended == COMPACT_CONTENDED_LOCK (1)
COMPACTFAIL
if (contended_compaction && gfp_mask & __GFP_NO_KSWAPD)
no goto nopage because contended_compaction was false by (1)

__alloc_pages_direct_reclaim
if (should_alloc_retry)
else
__alloc_pages_direct_compact again with ASYNC_MODE


>
> Thanks.
>
> >
> > It might be okay if the VMA has MADV_HUGEPAGE which is good hint to
> > indicate non-temporal VMA so latency would be trade-off but it's not
> > for temporal big memory allocation in HUGEPAGE_ALWAYS system.
> >
> > If you really want to go this, could you show us numbers?
> >
> > 1. How many could we can be successful in direct compaction by this patch?
> > 2. How long could we increase latency for temporal allocation
> > for HUGEPAGE_ALWAYS system?
> >
>
>
> --
> Thanks.
> Zhang Yanfei
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2014-06-24 01:07:41

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v3 05/13] mm, compaction: report compaction as contended only due to lock contention

Hello Minchan

Thank you for your explain. Actually, I read the kernel with an old
version. The latest upstream kernel has the behaviour like you described
below. Oops, how long didn't I follow the buddy allocator change.

Thanks.

On 06/24/2014 07:35 AM, Minchan Kim wrote:
>>> Anyway, most big concern is that you are changing current behavior as
>>> > > I said earlier.
>>> > >
>>> > > Old behavior in THP page fault when it consumes own timeslot was just
>>> > > abort and fallback 4K page but with your patch, new behavior is
>>> > > take a rest when it founds need_resched and goes to another round with
>>> > > async, not sync compaction. I'm not sure we need another round with
>>> > > async compaction at the cost of increasing latency rather than fallback
>>> > > 4 page.
>> >
>> > I don't see the new behavior works like what you said. If need_resched
>> > is true, it calls cond_resched() and after a rest it just breaks the loop.
>> > Why there is another round with async compact?
> One example goes
>
> Old:
> page fault
> huge page allocation
> __alloc_pages_slowpath
> __alloc_pages_direct_compact
> compact_zone_order
> isolate_migratepages
> compact_checklock_irqsave
> need_resched is true
> cc->contended = true;
> return ISOLATE_ABORT
> return COMPACT_PARTIAL with *contented = cc.contended;
> COMPACTFAIL
> if (contended_compaction && gfp_mask & __GFP_NO_KSWAPD)
> goto nopage;
>
> New:
>
> page fault
> huge page allocation
> __alloc_pages_slowpath
> __alloc_pages_direct_compact
> compact_zone_order
> isolate_migratepages
> compact_unlock_should_abort
> need_resched is true
> cc->contended = COMPACT_CONTENDED_SCHED;
> return true;
> return ISOLATE_ABORT
> return COMPACT_PARTIAL with *contended = cc.contended == COMPACT_CONTENDED_LOCK (1)
> COMPACTFAIL
> if (contended_compaction && gfp_mask & __GFP_NO_KSWAPD)
> no goto nopage because contended_compaction was false by (1)
>
> __alloc_pages_direct_reclaim
> if (should_alloc_retry)
> else
> __alloc_pages_direct_compact again with ASYNC_MODE
>
>


--
Thanks.
Zhang Yanfei

2014-06-24 04:53:29

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v3 04/13] mm, compaction: move pageblock checks up from isolate_migratepages_range()

On Fri, Jun 20, 2014 at 05:49:34PM +0200, Vlastimil Babka wrote:
> isolate_migratepages_range() is the main function of the compaction scanner,
> called either on a single pageblock by isolate_migratepages() during regular
> compaction, or on an arbitrary range by CMA's __alloc_contig_migrate_range().
> It currently perfoms two pageblock-wide compaction suitability checks, and

(nit-picking) s/perfoms/performs/

> because of the CMA callpath, it tracks if it crossed a pageblock boundary in
> order to repeat those checks.
>
> However, closer inspection shows that those checks are always true for CMA:
> - isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
> - migrate_async_suitable() check is skipped because CMA uses sync compaction
>
> We can therefore move the checks to isolate_migratepages(), reducing variables
> and simplifying isolate_migratepages_range(). The update_pageblock_skip()
> function also no longer needs set_unsuitable parameter.
>
> Furthermore, going back to compact_zone() and compact_finished() when pageblock
> is unsuitable is wasteful - the checks are meant to skip pageblocks quickly.
> The patch therefore also introduces a simple loop into isolate_migratepages()
> so that it does not return immediately on pageblock checks, but keeps going
> until isolate_migratepages_range() gets called once. Similarily to
> isolate_freepages(), the function periodically checks if it needs to reschedule
> or abort async compaction.

This looks to me a good direction.
One thing below ...

> Signed-off-by: Vlastimil Babka <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>
> ---
> mm/compaction.c | 112 +++++++++++++++++++++++++++++---------------------------
> 1 file changed, 59 insertions(+), 53 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 3064a7f..ebe30c9 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
...
> @@ -840,34 +809,74 @@ typedef enum {
> } isolate_migrate_t;
>
> /*
> - * Isolate all pages that can be migrated from the block pointed to by
> - * the migrate scanner within compact_control.
> + * Isolate all pages that can be migrated from the first suitable block,
> + * starting at the block pointed to by the migrate scanner pfn within
> + * compact_control.
> */
> static isolate_migrate_t isolate_migratepages(struct zone *zone,
> struct compact_control *cc)
> {
> unsigned long low_pfn, end_pfn;
> + struct page *page;
>
> - /* Do not scan outside zone boundaries */
> - low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
> + /* Start at where we last stopped, or beginning of the zone */
> + low_pfn = cc->migrate_pfn;
>
> /* Only scan within a pageblock boundary */
> end_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages);
>
> - /* Do not cross the free scanner or scan within a memory hole */
> - if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
> - cc->migrate_pfn = end_pfn;
> - return ISOLATE_NONE;
> - }
> + /*
> + * Iterate over whole pageblocks until we find the first suitable.
> + * Do not cross the free scanner.
> + */
> + for (; end_pfn <= cc->free_pfn;
> + low_pfn = end_pfn, end_pfn += pageblock_nr_pages) {
> +
> + /*
> + * This can potentially iterate a massively long zone with
> + * many pageblocks unsuitable, so periodically check if we
> + * need to schedule, or even abort async compaction.
> + */
> + if (!(low_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages))
> + && compact_should_abort(cc))
> + break;
>
> - /* Perform the isolation */
> - low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn, false);
> - if (!low_pfn || cc->contended)
> - return ISOLATE_ABORT;
> + /* Do not scan within a memory hole */
> + if (!pfn_valid(low_pfn))
> + continue;
> +
> + page = pfn_to_page(low_pfn);

Can we move (page_zone != zone) check here as isolate_freepages() does?

Thanks,
Naoya Horiguchi

> + /* If isolation recently failed, do not retry */
> + if (!isolation_suitable(cc, page))
> + continue;
>
> + /*
> + * For async compaction, also only scan in MOVABLE blocks.
> + * Async compaction is optimistic to see if the minimum amount
> + * of work satisfies the allocation.
> + */
> + if (cc->mode == MIGRATE_ASYNC &&
> + !migrate_async_suitable(get_pageblock_migratetype(page)))
> + continue;
> +
> + /* Perform the isolation */
> + low_pfn = isolate_migratepages_range(zone, cc, low_pfn,
> + end_pfn, false);
> + if (!low_pfn || cc->contended)
> + return ISOLATE_ABORT;
> +
> + /*
> + * Either we isolated something and proceed with migration. Or
> + * we failed and compact_zone should decide if we should
> + * continue or not.
> + */
> + break;
> + }
> +
> + /* Record where migration scanner will be restarted */
> cc->migrate_pfn = low_pfn;
>
> - return ISOLATE_SUCCESS;
> + return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
> }
>
> static int compact_finished(struct zone *zone,

2014-06-24 08:18:35

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v3 02/13] mm, compaction: defer each zone individually instead of preferred zone

On Fri, Jun 20, 2014 at 05:49:32PM +0200, Vlastimil Babka wrote:
> When direct sync compaction is often unsuccessful, it may become deferred for
> some time to avoid further useless attempts, both sync and async. Successful
> high-order allocations un-defer compaction, while further unsuccessful
> compaction attempts prolong the copmaction deferred period.
>
> Currently the checking and setting deferred status is performed only on the
> preferred zone of the allocation that invoked direct compaction. But compaction
> itself is attempted on all eligible zones in the zonelist, so the behavior is
> suboptimal and may lead both to scenarios where 1) compaction is attempted
> uselessly, or 2) where it's not attempted despite good chances of succeeding,
> as shown on the examples below:
>
> 1) A direct compaction with Normal preferred zone failed and set deferred
> compaction for the Normal zone. Another unrelated direct compaction with
> DMA32 as preferred zone will attempt to compact DMA32 zone even though
> the first compaction attempt also included DMA32 zone.
>
> In another scenario, compaction with Normal preferred zone failed to compact
> Normal zone, but succeeded in the DMA32 zone, so it will not defer
> compaction. In the next attempt, it will try Normal zone which will fail
> again, instead of skipping Normal zone and trying DMA32 directly.
>
> 2) Kswapd will balance DMA32 zone and reset defer status based on watermarks
> looking good. A direct compaction with preferred Normal zone will skip
> compaction of all zones including DMA32 because Normal was still deferred.
> The allocation might have succeeded in DMA32, but won't.
>
> This patch makes compaction deferring work on individual zone basis instead of
> preferred zone. For each zone, it checks compaction_deferred() to decide if the
> zone should be skipped. If watermarks fail after compacting the zone,
> defer_compaction() is called. The zone where watermarks passed can still be
> deferred when the allocation attempt is unsuccessful. When allocation is
> successful, compaction_defer_reset() is called for the zone containing the
> allocated page. This approach should approximate calling defer_compaction()
> only on zones where compaction was attempted and did not yield allocated page.
> There might be corner cases but that is inevitable as long as the decision
> to stop compacting dues not guarantee that a page will be allocated.
>
> During testing on a two-node machine with a single very small Normal zone on
> node 1, this patch has improved success rates in stress-highalloc mmtests
> benchmark. The success here were previously made worse by commit 3a025760fc
> ("mm: page_alloc: spill to remote nodes before waking kswapd") as kswapd was
> no longer resetting often enough the deferred compaction for the Normal zone,
> and DMA32 zones on both nodes were thus not considered for compaction.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>
> ---
> include/linux/compaction.h | 6 ++++--
> mm/compaction.c | 29 ++++++++++++++++++++++++-----
> mm/page_alloc.c | 33 ++++++++++++++++++---------------
> 3 files changed, 46 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 01e3132..76f9beb 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -22,7 +22,8 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
> extern int fragmentation_index(struct zone *zone, unsigned int order);
> extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *mask,
> - enum migrate_mode mode, bool *contended);
> + enum migrate_mode mode, bool *contended, bool *deferred,
> + struct zone **candidate_zone);
> extern void compact_pgdat(pg_data_t *pgdat, int order);
> extern void reset_isolation_suitable(pg_data_t *pgdat);
> extern unsigned long compaction_suitable(struct zone *zone, int order);
> @@ -91,7 +92,8 @@ static inline bool compaction_restarting(struct zone *zone, int order)
> #else
> static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *nodemask,
> - enum migrate_mode mode, bool *contended)
> + enum migrate_mode mode, bool *contended, bool *deferred,
> + struct zone **candidate_zone)
> {
> return COMPACT_CONTINUE;
> }
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 5175019..7c491d0 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1122,13 +1122,15 @@ int sysctl_extfrag_threshold = 500;
> * @nodemask: The allowed nodes to allocate from
> * @mode: The migration mode for async, sync light, or sync migration
> * @contended: Return value that is true if compaction was aborted due to lock contention
> - * @page: Optionally capture a free page of the requested order during compaction
> + * @deferred: Return value that is true if compaction was deferred in all zones
> + * @candidate_zone: Return the zone where we think allocation should succeed
> *
> * This is the main entry point for direct page compaction.
> */
> unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *nodemask,
> - enum migrate_mode mode, bool *contended)
> + enum migrate_mode mode, bool *contended, bool *deferred,
> + struct zone **candidate_zone)
> {
> enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> int may_enter_fs = gfp_mask & __GFP_FS;
> @@ -1142,8 +1144,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> if (!order || !may_enter_fs || !may_perform_io)
> return rc;
>
> - count_compact_event(COMPACTSTALL);
> -
> + *deferred = true;
> #ifdef CONFIG_CMA
> if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> alloc_flags |= ALLOC_CMA;
> @@ -1153,16 +1154,34 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> nodemask) {
> int status;
>
> + if (compaction_deferred(zone, order))
> + continue;
> +
> + *deferred = false;
> +
> status = compact_zone_order(zone, order, gfp_mask, mode,
> contended);
> rc = max(status, rc);
>
> /* If a normal allocation would succeed, stop compacting */
> if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
> - alloc_flags))
> + alloc_flags)) {
> + *candidate_zone = zone;
> break;

How about doing compaction_defer_reset() here?

As you said before, although this check is successful, it doesn't ensure
success of highorder allocation, because of some unknown reason(ex: racy
allocation attempt steals this page). But, at least, passing this check
means that we succeed compaction and there is much possibility to exit
compaction without searching whole zone range.

So, highorder allocation failure doesn't means that we should defer
compaction.

> + } else if (mode != MIGRATE_ASYNC) {
> + /*
> + * We think that allocation won't succeed in this zone
> + * so we defer compaction there. If it ends up
> + * succeeding after all, it will be reset.
> + */
> + defer_compaction(zone, order);
> + }
> }
>
> + /* If at least one zone wasn't deferred, we count a compaction stall */
> + if (!*deferred)
> + count_compact_event(COMPACTSTALL);
> +

Could you keep this counting in __alloc_pages_direct_compact()?
It will help to understand how this statistic works.

> return rc;
> }

And if possible, it is better to makes deferred to one of compaction
status likes as COMPACTION_SKIPPDED. It makes code more clear.


>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ee92384..6593f79 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2238,18 +2238,17 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> bool *contended_compaction, bool *deferred_compaction,
> unsigned long *did_some_progress)
> {
> - if (!order)
> - return NULL;
> + struct zone *last_compact_zone = NULL;
>
> - if (compaction_deferred(preferred_zone, order)) {
> - *deferred_compaction = true;
> + if (!order)
> return NULL;
> - }
>
> current->flags |= PF_MEMALLOC;
> *did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
> nodemask, mode,
> - contended_compaction);
> + contended_compaction,
> + deferred_compaction,
> + &last_compact_zone);
> current->flags &= ~PF_MEMALLOC;
>
> if (*did_some_progress != COMPACT_SKIPPED) {
> @@ -2263,27 +2262,31 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> order, zonelist, high_zoneidx,
> alloc_flags & ~ALLOC_NO_WATERMARKS,
> preferred_zone, classzone_idx, migratetype);
> +
> if (page) {
> - preferred_zone->compact_blockskip_flush = false;
> - compaction_defer_reset(preferred_zone, order, true);
> + struct zone *zone = page_zone(page);
> +
> + zone->compact_blockskip_flush = false;
> + compaction_defer_reset(zone, order, true);
> count_vm_event(COMPACTSUCCESS);
> return page;

This snippet raise a though to me.
Why don't we reset compaction_defer_reset() if we succeed to allocate
highorder page on fastpath or some other path? If we succeed to
allocate it on some other path rather than here, it means that the status of
memory changes. So this deferred check would be stale test.

> }
>
> /*
> + * last_compact_zone is where try_to_compact_pages thought
> + * allocation should succeed, so it did not defer compaction.
> + * But now we know that it didn't succeed, so we do the defer.
> + */
> + if (last_compact_zone && mode != MIGRATE_ASYNC)
> + defer_compaction(last_compact_zone, order);
> +
> + /*
> * It's bad if compaction run occurs and fails.
> * The most likely reason is that pages exist,
> * but not enough to satisfy watermarks.
> */
> count_vm_event(COMPACTFAIL);
>
> - /*
> - * As async compaction considers a subset of pageblocks, only
> - * defer if the failure was a sync compaction failure.
> - */
> - if (mode != MIGRATE_ASYNC)
> - defer_compaction(preferred_zone, order);
> -
> cond_resched();
> }
>
> --
> 1.8.4.5
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2014-06-24 08:28:46

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v3 04/13] mm, compaction: move pageblock checks up from isolate_migratepages_range()

On Fri, Jun 20, 2014 at 05:49:34PM +0200, Vlastimil Babka wrote:
> isolate_migratepages_range() is the main function of the compaction scanner,
> called either on a single pageblock by isolate_migratepages() during regular
> compaction, or on an arbitrary range by CMA's __alloc_contig_migrate_range().
> It currently perfoms two pageblock-wide compaction suitability checks, and
> because of the CMA callpath, it tracks if it crossed a pageblock boundary in
> order to repeat those checks.
>
> However, closer inspection shows that those checks are always true for CMA:
> - isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
> - migrate_async_suitable() check is skipped because CMA uses sync compaction
>
> We can therefore move the checks to isolate_migratepages(), reducing variables
> and simplifying isolate_migratepages_range(). The update_pageblock_skip()
> function also no longer needs set_unsuitable parameter.
>
> Furthermore, going back to compact_zone() and compact_finished() when pageblock
> is unsuitable is wasteful - the checks are meant to skip pageblocks quickly.
> The patch therefore also introduces a simple loop into isolate_migratepages()
> so that it does not return immediately on pageblock checks, but keeps going
> until isolate_migratepages_range() gets called once. Similarily to
> isolate_freepages(), the function periodically checks if it needs to reschedule
> or abort async compaction.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>
> ---
> mm/compaction.c | 112 +++++++++++++++++++++++++++++---------------------------
> 1 file changed, 59 insertions(+), 53 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 3064a7f..ebe30c9 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -132,7 +132,7 @@ void reset_isolation_suitable(pg_data_t *pgdat)
> */
> static void update_pageblock_skip(struct compact_control *cc,
> struct page *page, unsigned long nr_isolated,
> - bool set_unsuitable, bool migrate_scanner)
> + bool migrate_scanner)
> {
> struct zone *zone = cc->zone;
> unsigned long pfn;
> @@ -146,12 +146,7 @@ static void update_pageblock_skip(struct compact_control *cc,
> if (nr_isolated)
> return;
>
> - /*
> - * Only skip pageblocks when all forms of compaction will be known to
> - * fail in the near future.
> - */
> - if (set_unsuitable)
> - set_pageblock_skip(page);
> + set_pageblock_skip(page);
>
> pfn = page_to_pfn(page);
>
> @@ -180,7 +175,7 @@ static inline bool isolation_suitable(struct compact_control *cc,
>
> static void update_pageblock_skip(struct compact_control *cc,
> struct page *page, unsigned long nr_isolated,
> - bool set_unsuitable, bool migrate_scanner)
> + bool migrate_scanner)
> {
> }
> #endif /* CONFIG_COMPACTION */
> @@ -345,8 +340,7 @@ isolate_fail:
>
> /* Update the pageblock-skip if the whole pageblock was scanned */
> if (blockpfn == end_pfn)
> - update_pageblock_skip(cc, valid_page, total_isolated, true,
> - false);
> + update_pageblock_skip(cc, valid_page, total_isolated, false);
>
> count_compact_events(COMPACTFREE_SCANNED, nr_scanned);
> if (total_isolated)
> @@ -474,14 +468,12 @@ unsigned long
> isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> unsigned long low_pfn, unsigned long end_pfn, bool unevictable)
> {
> - unsigned long last_pageblock_nr = 0, pageblock_nr;
> unsigned long nr_scanned = 0, nr_isolated = 0;
> struct list_head *migratelist = &cc->migratepages;
> struct lruvec *lruvec;
> unsigned long flags;
> bool locked = false;
> struct page *page = NULL, *valid_page = NULL;
> - bool set_unsuitable = true;
> const isolate_mode_t mode = (cc->mode == MIGRATE_ASYNC ?
> ISOLATE_ASYNC_MIGRATE : 0) |
> (unevictable ? ISOLATE_UNEVICTABLE : 0);
> @@ -545,28 +537,6 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> if (!valid_page)
> valid_page = page;
>
> - /* If isolation recently failed, do not retry */
> - pageblock_nr = low_pfn >> pageblock_order;
> - if (last_pageblock_nr != pageblock_nr) {
> - int mt;
> -
> - last_pageblock_nr = pageblock_nr;
> - if (!isolation_suitable(cc, page))
> - goto next_pageblock;
> -
> - /*
> - * For async migration, also only scan in MOVABLE
> - * blocks. Async migration is optimistic to see if
> - * the minimum amount of work satisfies the allocation
> - */
> - mt = get_pageblock_migratetype(page);
> - if (cc->mode == MIGRATE_ASYNC &&
> - !migrate_async_suitable(mt)) {
> - set_unsuitable = false;
> - goto next_pageblock;
> - }
> - }
> -
> /*
> * Skip if free. page_order cannot be used without zone->lock
> * as nothing prevents parallel allocations or buddy merging.
> @@ -668,8 +638,7 @@ next_pageblock:
> * if the whole pageblock was scanned without isolating any page.
> */
> if (low_pfn == end_pfn)
> - update_pageblock_skip(cc, valid_page, nr_isolated,
> - set_unsuitable, true);
> + update_pageblock_skip(cc, valid_page, nr_isolated, true);
>
> trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
>
> @@ -840,34 +809,74 @@ typedef enum {
> } isolate_migrate_t;
>
> /*
> - * Isolate all pages that can be migrated from the block pointed to by
> - * the migrate scanner within compact_control.
> + * Isolate all pages that can be migrated from the first suitable block,
> + * starting at the block pointed to by the migrate scanner pfn within
> + * compact_control.
> */
> static isolate_migrate_t isolate_migratepages(struct zone *zone,
> struct compact_control *cc)
> {
> unsigned long low_pfn, end_pfn;
> + struct page *page;
>
> - /* Do not scan outside zone boundaries */
> - low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
> + /* Start at where we last stopped, or beginning of the zone */
> + low_pfn = cc->migrate_pfn;
>
> /* Only scan within a pageblock boundary */
> end_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages);
>
> - /* Do not cross the free scanner or scan within a memory hole */
> - if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
> - cc->migrate_pfn = end_pfn;
> - return ISOLATE_NONE;
> - }
> + /*
> + * Iterate over whole pageblocks until we find the first suitable.
> + * Do not cross the free scanner.
> + */
> + for (; end_pfn <= cc->free_pfn;
> + low_pfn = end_pfn, end_pfn += pageblock_nr_pages) {
> +
> + /*
> + * This can potentially iterate a massively long zone with
> + * many pageblocks unsuitable, so periodically check if we
> + * need to schedule, or even abort async compaction.
> + */
> + if (!(low_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages))
> + && compact_should_abort(cc))
> + break;
>
> - /* Perform the isolation */
> - low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn, false);
> - if (!low_pfn || cc->contended)
> - return ISOLATE_ABORT;
> + /* Do not scan within a memory hole */
> + if (!pfn_valid(low_pfn))
> + continue;
> +
> + page = pfn_to_page(low_pfn);
> + /* If isolation recently failed, do not retry */
> + if (!isolation_suitable(cc, page))
> + continue;
>
> + /*
> + * For async compaction, also only scan in MOVABLE blocks.
> + * Async compaction is optimistic to see if the minimum amount
> + * of work satisfies the allocation.
> + */
> + if (cc->mode == MIGRATE_ASYNC &&
> + !migrate_async_suitable(get_pageblock_migratetype(page)))
> + continue;
> +
> + /* Perform the isolation */
> + low_pfn = isolate_migratepages_range(zone, cc, low_pfn,
> + end_pfn, false);
> + if (!low_pfn || cc->contended)
> + return ISOLATE_ABORT;
> +
> + /*
> + * Either we isolated something and proceed with migration. Or
> + * we failed and compact_zone should decide if we should
> + * continue or not.
> + */
> + break;
> + }
> +
> + /* Record where migration scanner will be restarted */

If we make isolate_migratepages* interface like as isolate_freepages*,
we can get more clean and micro optimized code. Because
isolate_migratepages_range() can handle arbitrary range and this patch
make isolate_migratepages() also handle arbitrary range, there would
be some redundant codes. :)

Thanks.

2014-06-24 15:29:34

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v3 02/13] mm, compaction: defer each zone individually instead of preferred zone

On 06/24/2014 10:23 AM, Joonsoo Kim wrote:
> On Fri, Jun 20, 2014 at 05:49:32PM +0200, Vlastimil Babka wrote:
>> When direct sync compaction is often unsuccessful, it may become deferred for
>> some time to avoid further useless attempts, both sync and async. Successful
>> high-order allocations un-defer compaction, while further unsuccessful
>> compaction attempts prolong the copmaction deferred period.
>>
>> Currently the checking and setting deferred status is performed only on the
>> preferred zone of the allocation that invoked direct compaction. But compaction
>> itself is attempted on all eligible zones in the zonelist, so the behavior is
>> suboptimal and may lead both to scenarios where 1) compaction is attempted
>> uselessly, or 2) where it's not attempted despite good chances of succeeding,
>> as shown on the examples below:
>>
>> 1) A direct compaction with Normal preferred zone failed and set deferred
>> compaction for the Normal zone. Another unrelated direct compaction with
>> DMA32 as preferred zone will attempt to compact DMA32 zone even though
>> the first compaction attempt also included DMA32 zone.
>>
>> In another scenario, compaction with Normal preferred zone failed to compact
>> Normal zone, but succeeded in the DMA32 zone, so it will not defer
>> compaction. In the next attempt, it will try Normal zone which will fail
>> again, instead of skipping Normal zone and trying DMA32 directly.
>>
>> 2) Kswapd will balance DMA32 zone and reset defer status based on watermarks
>> looking good. A direct compaction with preferred Normal zone will skip
>> compaction of all zones including DMA32 because Normal was still deferred.
>> The allocation might have succeeded in DMA32, but won't.
>>
>> This patch makes compaction deferring work on individual zone basis instead of
>> preferred zone. For each zone, it checks compaction_deferred() to decide if the
>> zone should be skipped. If watermarks fail after compacting the zone,
>> defer_compaction() is called. The zone where watermarks passed can still be
>> deferred when the allocation attempt is unsuccessful. When allocation is
>> successful, compaction_defer_reset() is called for the zone containing the
>> allocated page. This approach should approximate calling defer_compaction()
>> only on zones where compaction was attempted and did not yield allocated page.
>> There might be corner cases but that is inevitable as long as the decision
>> to stop compacting dues not guarantee that a page will be allocated.
>>
>> During testing on a two-node machine with a single very small Normal zone on
>> node 1, this patch has improved success rates in stress-highalloc mmtests
>> benchmark. The success here were previously made worse by commit 3a025760fc
>> ("mm: page_alloc: spill to remote nodes before waking kswapd") as kswapd was
>> no longer resetting often enough the deferred compaction for the Normal zone,
>> and DMA32 zones on both nodes were thus not considered for compaction.
>>
>> Signed-off-by: Vlastimil Babka <[email protected]>
>> Cc: Minchan Kim <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Joonsoo Kim <[email protected]>
>> Cc: Michal Nazarewicz <[email protected]>
>> Cc: Naoya Horiguchi <[email protected]>
>> Cc: Christoph Lameter <[email protected]>
>> Cc: Rik van Riel <[email protected]>
>> Cc: David Rientjes <[email protected]>
>> ---
>> include/linux/compaction.h | 6 ++++--
>> mm/compaction.c | 29 ++++++++++++++++++++++++-----
>> mm/page_alloc.c | 33 ++++++++++++++++++---------------
>> 3 files changed, 46 insertions(+), 22 deletions(-)
>>
>> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
>> index 01e3132..76f9beb 100644
>> --- a/include/linux/compaction.h
>> +++ b/include/linux/compaction.h
>> @@ -22,7 +22,8 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
>> extern int fragmentation_index(struct zone *zone, unsigned int order);
>> extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
>> int order, gfp_t gfp_mask, nodemask_t *mask,
>> - enum migrate_mode mode, bool *contended);
>> + enum migrate_mode mode, bool *contended, bool *deferred,
>> + struct zone **candidate_zone);
>> extern void compact_pgdat(pg_data_t *pgdat, int order);
>> extern void reset_isolation_suitable(pg_data_t *pgdat);
>> extern unsigned long compaction_suitable(struct zone *zone, int order);
>> @@ -91,7 +92,8 @@ static inline bool compaction_restarting(struct zone *zone, int order)
>> #else
>> static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
>> int order, gfp_t gfp_mask, nodemask_t *nodemask,
>> - enum migrate_mode mode, bool *contended)
>> + enum migrate_mode mode, bool *contended, bool *deferred,
>> + struct zone **candidate_zone)
>> {
>> return COMPACT_CONTINUE;
>> }
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 5175019..7c491d0 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -1122,13 +1122,15 @@ int sysctl_extfrag_threshold = 500;
>> * @nodemask: The allowed nodes to allocate from
>> * @mode: The migration mode for async, sync light, or sync migration
>> * @contended: Return value that is true if compaction was aborted due to lock contention
>> - * @page: Optionally capture a free page of the requested order during compaction
>> + * @deferred: Return value that is true if compaction was deferred in all zones
>> + * @candidate_zone: Return the zone where we think allocation should succeed
>> *
>> * This is the main entry point for direct page compaction.
>> */
>> unsigned long try_to_compact_pages(struct zonelist *zonelist,
>> int order, gfp_t gfp_mask, nodemask_t *nodemask,
>> - enum migrate_mode mode, bool *contended)
>> + enum migrate_mode mode, bool *contended, bool *deferred,
>> + struct zone **candidate_zone)
>> {
>> enum zone_type high_zoneidx = gfp_zone(gfp_mask);
>> int may_enter_fs = gfp_mask & __GFP_FS;
>> @@ -1142,8 +1144,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>> if (!order || !may_enter_fs || !may_perform_io)
>> return rc;
>>
>> - count_compact_event(COMPACTSTALL);
>> -
>> + *deferred = true;
>> #ifdef CONFIG_CMA
>> if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
>> alloc_flags |= ALLOC_CMA;
>> @@ -1153,16 +1154,34 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>> nodemask) {
>> int status;
>>
>> + if (compaction_deferred(zone, order))
>> + continue;
>> +
>> + *deferred = false;
>> +
>> status = compact_zone_order(zone, order, gfp_mask, mode,
>> contended);
>> rc = max(status, rc);
>>
>> /* If a normal allocation would succeed, stop compacting */
>> if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
>> - alloc_flags))
>> + alloc_flags)) {
>> + *candidate_zone = zone;
>> break;
>
> How about doing compaction_defer_reset() here?
>
> As you said before, although this check is successful, it doesn't ensure
> success of highorder allocation, because of some unknown reason(ex: racy
> allocation attempt steals this page). But, at least, passing this check
> means that we succeed compaction and there is much possibility to exit
> compaction without searching whole zone range.

Well another reason is that the check is racy wrt NR_FREE counters
drift. But I think it tends to be false negative rather than false
positive. So it could work, and with page capture it would be quite
accurate. I'll try.

> So, highorder allocation failure doesn't means that we should defer
> compaction.
>
>> + } else if (mode != MIGRATE_ASYNC) {
>> + /*
>> + * We think that allocation won't succeed in this zone
>> + * so we defer compaction there. If it ends up
>> + * succeeding after all, it will be reset.
>> + */
>> + defer_compaction(zone, order);
>> + }
>> }
>>
>> + /* If at least one zone wasn't deferred, we count a compaction stall */
>> + if (!*deferred)
>> + count_compact_event(COMPACTSTALL);
>> +
>
> Could you keep this counting in __alloc_pages_direct_compact()?
> It will help to understand how this statistic works.

Well, count_compact_event is defined in compaction.c and this would be
usage in page_alloc.c. I'm not sure if it helps.

>> return rc;
>> }
>
> And if possible, it is better to makes deferred to one of compaction
> status likes as COMPACTION_SKIPPDED. It makes code more clear.

That could work inside try_to_compact_pages() as well.

>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index ee92384..6593f79 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2238,18 +2238,17 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>> bool *contended_compaction, bool *deferred_compaction,
>> unsigned long *did_some_progress)
>> {
>> - if (!order)
>> - return NULL;
>> + struct zone *last_compact_zone = NULL;
>>
>> - if (compaction_deferred(preferred_zone, order)) {
>> - *deferred_compaction = true;
>> + if (!order)
>> return NULL;
>> - }
>>
>> current->flags |= PF_MEMALLOC;
>> *did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
>> nodemask, mode,
>> - contended_compaction);
>> + contended_compaction,
>> + deferred_compaction,
>> + &last_compact_zone);
>> current->flags &= ~PF_MEMALLOC;
>>
>> if (*did_some_progress != COMPACT_SKIPPED) {
>> @@ -2263,27 +2262,31 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>> order, zonelist, high_zoneidx,
>> alloc_flags & ~ALLOC_NO_WATERMARKS,
>> preferred_zone, classzone_idx, migratetype);
>> +
>> if (page) {
>> - preferred_zone->compact_blockskip_flush = false;
>> - compaction_defer_reset(preferred_zone, order, true);
>> + struct zone *zone = page_zone(page);
>> +
>> + zone->compact_blockskip_flush = false;
>> + compaction_defer_reset(zone, order, true);
>> count_vm_event(COMPACTSUCCESS);
>> return page;
>
> This snippet raise a though to me.
> Why don't we reset compaction_defer_reset() if we succeed to allocate
> highorder page on fastpath or some other path? If we succeed to
> allocate it on some other path rather than here, it means that the status of
> memory changes. So this deferred check would be stale test.

Hm, not sure if we want to do that in fast paths. As long as somebody
succeeds, that means nobody has to try checking for deferred compaction
and it doesn't matter. When they stop succeeding, then it may be stale,
yes. But is it worth polluting fast paths with defer resets?

>
>> }
>>
>> /*
>> + * last_compact_zone is where try_to_compact_pages thought
>> + * allocation should succeed, so it did not defer compaction.
>> + * But now we know that it didn't succeed, so we do the defer.
>> + */
>> + if (last_compact_zone && mode != MIGRATE_ASYNC)
>> + defer_compaction(last_compact_zone, order);
>> +
>> + /*
>> * It's bad if compaction run occurs and fails.
>> * The most likely reason is that pages exist,
>> * but not enough to satisfy watermarks.
>> */
>> count_vm_event(COMPACTFAIL);
>>
>> - /*
>> - * As async compaction considers a subset of pageblocks, only
>> - * defer if the failure was a sync compaction failure.
>> - */
>> - if (mode != MIGRATE_ASYNC)
>> - defer_compaction(preferred_zone, order);
>> -
>> cond_resched();
>> }
>>
>> --
>> 1.8.4.5
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2014-06-24 15:34:34

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v3 04/13] mm, compaction: move pageblock checks up from isolate_migratepages_range()

On 06/24/2014 06:52 AM, Naoya Horiguchi wrote:
>> - low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn, false);
>> - if (!low_pfn || cc->contended)
>> - return ISOLATE_ABORT;
>> + /* Do not scan within a memory hole */
>> + if (!pfn_valid(low_pfn))
>> + continue;
>> +
>> + page = pfn_to_page(low_pfn);
>
> Can we move (page_zone != zone) check here as isolate_freepages() does?

Duplicate perhaps, not sure about move. Does CMA make sure that all
pages are in the same zone? Common sense tells me it would be useless
otherwise, but I haven't checked if we can rely on it.

> Thanks,
> Naoya Horiguchi
>

2014-06-24 15:39:49

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v3 06/13] mm, compaction: periodically drop lock and restore IRQs in scanners

On Fri, Jun 20, 2014 at 05:49:36PM +0200, Vlastimil Babka wrote:
> Compaction scanners regularly check for lock contention and need_resched()
> through the compact_checklock_irqsave() function. However, if there is no
> contention, the lock can be held and IRQ disabled for potentially long time.
>
> This has been addressed by commit b2eef8c0d0 ("mm: compaction: minimise the
> time IRQs are disabled while isolating pages for migration") for the migration
> scanner. However, the refactoring done by commit 748446bb6b ("mm: compaction:
> acquire the zone->lru_lock as late as possible") has changed the conditions so

You seem to refer to the incorrect commit, maybe you meant commit 2a1402aa044b?

> that the lock is dropped only when there's contention on the lock or
> need_resched() is true. Also, need_resched() is checked only when the lock is
> already held. The comment "give a chance to irqs before checking need_resched"
> is therefore misleading, as IRQs remain disabled when the check is done.
>
> This patch restores the behavior intended by commit b2eef8c0d0 and also tries
> to better balance and make more deterministic the time spent by checking for
> contention vs the time the scanners might run between the checks. It also
> avoids situations where checking has not been done often enough before. The
> result should be avoiding both too frequent and too infrequent contention
> checking, and especially the potentially long-running scans with IRQs disabled
> and no checking of need_resched() or for fatal signal pending, which can happen
> when many consecutive pages or pageblocks fail the preliminary tests and do not
> reach the later call site to compact_checklock_irqsave(), as explained below.
>
> Before the patch:
>
> In the migration scanner, compact_checklock_irqsave() was called each loop, if
> reached. If not reached, some lower-frequency checking could still be done if
> the lock was already held, but this would not result in aborting contended
> async compaction until reaching compact_checklock_irqsave() or end of
> pageblock. In the free scanner, it was similar but completely without the
> periodical checking, so lock can be potentially held until reaching the end of
> pageblock.
>
> After the patch, in both scanners:
>
> The periodical check is done as the first thing in the loop on each
> SWAP_CLUSTER_MAX aligned pfn, using the new compact_unlock_should_abort()
> function, which always unlocks the lock (if locked) and aborts async compaction
> if scheduling is needed. It also aborts any type of compaction when a fatal
> signal is pending.
>
> The compact_checklock_irqsave() function is replaced with a slightly different
> compact_trylock_irqsave(). The biggest difference is that the function is not
> called at all if the lock is already held. The periodical need_resched()
> checking is left solely to compact_unlock_should_abort(). The lock contention
> avoidance for async compaction is achieved by the periodical unlock by
> compact_unlock_should_abort() and by using trylock in compact_trylock_irqsave()
> and aborting when trylock fails. Sync compaction does not use trylock.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>

Looks OK to me.

Reviewed-by: Naoya Horiguchi <[email protected]>

> ---
> mm/compaction.c | 114 ++++++++++++++++++++++++++++++++++++--------------------
> 1 file changed, 73 insertions(+), 41 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e8cfac9..40da812 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -180,54 +180,72 @@ static void update_pageblock_skip(struct compact_control *cc,
> }
> #endif /* CONFIG_COMPACTION */
>
> -enum compact_contended should_release_lock(spinlock_t *lock)
> +/*
> + * Compaction requires the taking of some coarse locks that are potentially
> + * very heavily contended. For async compaction, back out if the lock cannot
> + * be taken immediately. For sync compaction, spin on the lock if needed.
> + *
> + * Returns true if the lock is held
> + * Returns false if the lock is not held and compaction should abort
> + */
> +static bool compact_trylock_irqsave(spinlock_t *lock,
> + unsigned long *flags, struct compact_control *cc)
> {
> - if (spin_is_contended(lock))
> - return COMPACT_CONTENDED_LOCK;
> - else if (need_resched())
> - return COMPACT_CONTENDED_SCHED;
> - else
> - return COMPACT_CONTENDED_NONE;
> + if (cc->mode == MIGRATE_ASYNC) {
> + if (!spin_trylock_irqsave(lock, *flags)) {
> + cc->contended = COMPACT_CONTENDED_LOCK;
> + return false;
> + }
> + } else {
> + spin_lock_irqsave(lock, *flags);
> + }
> +
> + return true;
> }
>
> /*
> * Compaction requires the taking of some coarse locks that are potentially
> - * very heavily contended. Check if the process needs to be scheduled or
> - * if the lock is contended. For async compaction, back out in the event
> - * if contention is severe. For sync compaction, schedule.
> + * very heavily contended. The lock should be periodically unlocked to avoid
> + * having disabled IRQs for a long time, even when there is nobody waiting on
> + * the lock. It might also be that allowing the IRQs will result in
> + * need_resched() becoming true. If scheduling is needed, async compaction
> + * aborts. Sync compaction schedules.
> + * Either compaction type will also abort if a fatal signal is pending.
> + * In either case if the lock was locked, it is dropped and not regained.
> *
> - * Returns true if the lock is held.
> - * Returns false if the lock is released and compaction should abort
> + * Returns true if compaction should abort due to fatal signal pending, or
> + * async compaction due to need_resched()
> + * Returns false when compaction can continue (sync compaction might have
> + * scheduled)
> */
> -static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
> - bool locked, struct compact_control *cc)
> +static bool compact_unlock_should_abort(spinlock_t *lock,
> + unsigned long flags, bool *locked, struct compact_control *cc)
> {
> - enum compact_contended contended = should_release_lock(lock);
> + if (*locked) {
> + spin_unlock_irqrestore(lock, flags);
> + *locked = false;
> + }
>
> - if (contended) {
> - if (locked) {
> - spin_unlock_irqrestore(lock, *flags);
> - locked = false;
> - }
> + if (fatal_signal_pending(current)) {
> + cc->contended = COMPACT_CONTENDED_SCHED;
> + return true;
> + }
>
> - /* async aborts if taking too long or contended */
> + if (need_resched()) {
> if (cc->mode == MIGRATE_ASYNC) {
> - cc->contended = contended;
> - return false;
> + cc->contended = COMPACT_CONTENDED_SCHED;
> + return true;
> }
> -
> cond_resched();
> }
>
> - if (!locked)
> - spin_lock_irqsave(lock, *flags);
> - return true;
> + return false;
> }
>
> /*
> * Aside from avoiding lock contention, compaction also periodically checks
> * need_resched() and either schedules in sync compaction or aborts async
> - * compaction. This is similar to what compact_checklock_irqsave() does, but
> + * compaction. This is similar to what compact_unlock_should_abort() does, but
> * is used where no lock is concerned.
> *
> * Returns false when no scheduling was needed, or sync compaction scheduled.
> @@ -286,6 +304,16 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
> int isolated, i;
> struct page *page = cursor;
>
> + /*
> + * Periodically drop the lock (if held) regardless of its
> + * contention, to give chance to IRQs. Abort async compaction
> + * if contended.
> + */
> + if (!(blockpfn % SWAP_CLUSTER_MAX)
> + && compact_unlock_should_abort(&cc->zone->lock, flags,
> + &locked, cc))
> + break;
> +
> nr_scanned++;
> if (!pfn_valid_within(blockpfn))
> goto isolate_fail;
> @@ -303,8 +331,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
> * spin on the lock and we acquire the lock as late as
> * possible.
> */
> - locked = compact_checklock_irqsave(&cc->zone->lock, &flags,
> - locked, cc);
> + if (!locked)
> + locked = compact_trylock_irqsave(&cc->zone->lock,
> + &flags, cc);
> if (!locked)
> break;
>
> @@ -506,13 +535,15 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>
> /* Time to isolate some pages for migration */
> for (; low_pfn < end_pfn; low_pfn++) {
> - /* give a chance to irqs before checking need_resched() */
> - if (locked && !(low_pfn % SWAP_CLUSTER_MAX)) {
> - if (should_release_lock(&zone->lru_lock)) {
> - spin_unlock_irqrestore(&zone->lru_lock, flags);
> - locked = false;
> - }
> - }
> + /*
> + * Periodically drop the lock (if held) regardless of its
> + * contention, to give chance to IRQs. Abort async compaction
> + * if contended.
> + */
> + if (!(low_pfn % SWAP_CLUSTER_MAX)
> + && compact_unlock_should_abort(&zone->lru_lock, flags,
> + &locked, cc))
> + break;
>
> /*
> * migrate_pfn does not necessarily start aligned to a
> @@ -592,10 +623,11 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> page_count(page) > page_mapcount(page))
> continue;
>
> - /* Check if it is ok to still hold the lock */
> - locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
> - locked, cc);
> - if (!locked || fatal_signal_pending(current))
> + /* If the lock is not held, try to take it */
> + if (!locked)
> + locked = compact_trylock_irqsave(&zone->lru_lock,
> + &flags, cc);
> + if (!locked)
> break;
>
> /* Recheck PageLRU and PageTransHuge under lock */
> --
> 1.8.4.5
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2014-06-24 15:42:55

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v3 04/13] mm, compaction: move pageblock checks up from isolate_migratepages_range()

On 06/24/2014 10:33 AM, Joonsoo Kim wrote:
> On Fri, Jun 20, 2014 at 05:49:34PM +0200, Vlastimil Babka wrote:
>> isolate_migratepages_range() is the main function of the compaction scanner,
>> called either on a single pageblock by isolate_migratepages() during regular
>> compaction, or on an arbitrary range by CMA's __alloc_contig_migrate_range().
>> It currently perfoms two pageblock-wide compaction suitability checks, and
>> because of the CMA callpath, it tracks if it crossed a pageblock boundary in
>> order to repeat those checks.
>>
>> However, closer inspection shows that those checks are always true for CMA:
>> - isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
>> - migrate_async_suitable() check is skipped because CMA uses sync compaction
>>
>> We can therefore move the checks to isolate_migratepages(), reducing variables
>> and simplifying isolate_migratepages_range(). The update_pageblock_skip()
>> function also no longer needs set_unsuitable parameter.
>>
>> Furthermore, going back to compact_zone() and compact_finished() when pageblock
>> is unsuitable is wasteful - the checks are meant to skip pageblocks quickly.
>> The patch therefore also introduces a simple loop into isolate_migratepages()
>> so that it does not return immediately on pageblock checks, but keeps going
>> until isolate_migratepages_range() gets called once. Similarily to
>> isolate_freepages(), the function periodically checks if it needs to reschedule
>> or abort async compaction.
>>
>> Signed-off-by: Vlastimil Babka <[email protected]>
>> Cc: Minchan Kim <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Joonsoo Kim <[email protected]>
>> Cc: Michal Nazarewicz <[email protected]>
>> Cc: Naoya Horiguchi <[email protected]>
>> Cc: Christoph Lameter <[email protected]>
>> Cc: Rik van Riel <[email protected]>
>> Cc: David Rientjes <[email protected]>
>> ---
>> mm/compaction.c | 112 +++++++++++++++++++++++++++++---------------------------
>> 1 file changed, 59 insertions(+), 53 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 3064a7f..ebe30c9 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -132,7 +132,7 @@ void reset_isolation_suitable(pg_data_t *pgdat)
>> */
>> static void update_pageblock_skip(struct compact_control *cc,
>> struct page *page, unsigned long nr_isolated,
>> - bool set_unsuitable, bool migrate_scanner)
>> + bool migrate_scanner)
>> {
>> struct zone *zone = cc->zone;
>> unsigned long pfn;
>> @@ -146,12 +146,7 @@ static void update_pageblock_skip(struct compact_control *cc,
>> if (nr_isolated)
>> return;
>>
>> - /*
>> - * Only skip pageblocks when all forms of compaction will be known to
>> - * fail in the near future.
>> - */
>> - if (set_unsuitable)
>> - set_pageblock_skip(page);
>> + set_pageblock_skip(page);
>>
>> pfn = page_to_pfn(page);
>>
>> @@ -180,7 +175,7 @@ static inline bool isolation_suitable(struct compact_control *cc,
>>
>> static void update_pageblock_skip(struct compact_control *cc,
>> struct page *page, unsigned long nr_isolated,
>> - bool set_unsuitable, bool migrate_scanner)
>> + bool migrate_scanner)
>> {
>> }
>> #endif /* CONFIG_COMPACTION */
>> @@ -345,8 +340,7 @@ isolate_fail:
>>
>> /* Update the pageblock-skip if the whole pageblock was scanned */
>> if (blockpfn == end_pfn)
>> - update_pageblock_skip(cc, valid_page, total_isolated, true,
>> - false);
>> + update_pageblock_skip(cc, valid_page, total_isolated, false);
>>
>> count_compact_events(COMPACTFREE_SCANNED, nr_scanned);
>> if (total_isolated)
>> @@ -474,14 +468,12 @@ unsigned long
>> isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>> unsigned long low_pfn, unsigned long end_pfn, bool unevictable)
>> {
>> - unsigned long last_pageblock_nr = 0, pageblock_nr;
>> unsigned long nr_scanned = 0, nr_isolated = 0;
>> struct list_head *migratelist = &cc->migratepages;
>> struct lruvec *lruvec;
>> unsigned long flags;
>> bool locked = false;
>> struct page *page = NULL, *valid_page = NULL;
>> - bool set_unsuitable = true;
>> const isolate_mode_t mode = (cc->mode == MIGRATE_ASYNC ?
>> ISOLATE_ASYNC_MIGRATE : 0) |
>> (unevictable ? ISOLATE_UNEVICTABLE : 0);
>> @@ -545,28 +537,6 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>> if (!valid_page)
>> valid_page = page;
>>
>> - /* If isolation recently failed, do not retry */
>> - pageblock_nr = low_pfn >> pageblock_order;
>> - if (last_pageblock_nr != pageblock_nr) {
>> - int mt;
>> -
>> - last_pageblock_nr = pageblock_nr;
>> - if (!isolation_suitable(cc, page))
>> - goto next_pageblock;
>> -
>> - /*
>> - * For async migration, also only scan in MOVABLE
>> - * blocks. Async migration is optimistic to see if
>> - * the minimum amount of work satisfies the allocation
>> - */
>> - mt = get_pageblock_migratetype(page);
>> - if (cc->mode == MIGRATE_ASYNC &&
>> - !migrate_async_suitable(mt)) {
>> - set_unsuitable = false;
>> - goto next_pageblock;
>> - }
>> - }
>> -
>> /*
>> * Skip if free. page_order cannot be used without zone->lock
>> * as nothing prevents parallel allocations or buddy merging.
>> @@ -668,8 +638,7 @@ next_pageblock:
>> * if the whole pageblock was scanned without isolating any page.
>> */
>> if (low_pfn == end_pfn)
>> - update_pageblock_skip(cc, valid_page, nr_isolated,
>> - set_unsuitable, true);
>> + update_pageblock_skip(cc, valid_page, nr_isolated, true);
>>
>> trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
>>
>> @@ -840,34 +809,74 @@ typedef enum {
>> } isolate_migrate_t;
>>
>> /*
>> - * Isolate all pages that can be migrated from the block pointed to by
>> - * the migrate scanner within compact_control.
>> + * Isolate all pages that can be migrated from the first suitable block,
>> + * starting at the block pointed to by the migrate scanner pfn within
>> + * compact_control.
>> */
>> static isolate_migrate_t isolate_migratepages(struct zone *zone,
>> struct compact_control *cc)
>> {
>> unsigned long low_pfn, end_pfn;
>> + struct page *page;
>>
>> - /* Do not scan outside zone boundaries */
>> - low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
>> + /* Start at where we last stopped, or beginning of the zone */
>> + low_pfn = cc->migrate_pfn;
>>
>> /* Only scan within a pageblock boundary */
>> end_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages);
>>
>> - /* Do not cross the free scanner or scan within a memory hole */
>> - if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
>> - cc->migrate_pfn = end_pfn;
>> - return ISOLATE_NONE;
>> - }
>> + /*
>> + * Iterate over whole pageblocks until we find the first suitable.
>> + * Do not cross the free scanner.
>> + */
>> + for (; end_pfn <= cc->free_pfn;
>> + low_pfn = end_pfn, end_pfn += pageblock_nr_pages) {
>> +
>> + /*
>> + * This can potentially iterate a massively long zone with
>> + * many pageblocks unsuitable, so periodically check if we
>> + * need to schedule, or even abort async compaction.
>> + */
>> + if (!(low_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages))
>> + && compact_should_abort(cc))
>> + break;
>>
>> - /* Perform the isolation */
>> - low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn, false);
>> - if (!low_pfn || cc->contended)
>> - return ISOLATE_ABORT;
>> + /* Do not scan within a memory hole */
>> + if (!pfn_valid(low_pfn))
>> + continue;
>> +
>> + page = pfn_to_page(low_pfn);
>> + /* If isolation recently failed, do not retry */
>> + if (!isolation_suitable(cc, page))
>> + continue;
>>
>> + /*
>> + * For async compaction, also only scan in MOVABLE blocks.
>> + * Async compaction is optimistic to see if the minimum amount
>> + * of work satisfies the allocation.
>> + */
>> + if (cc->mode == MIGRATE_ASYNC &&
>> + !migrate_async_suitable(get_pageblock_migratetype(page)))
>> + continue;
>> +
>> + /* Perform the isolation */
>> + low_pfn = isolate_migratepages_range(zone, cc, low_pfn,
>> + end_pfn, false);
>> + if (!low_pfn || cc->contended)
>> + return ISOLATE_ABORT;
>> +
>> + /*
>> + * Either we isolated something and proceed with migration. Or
>> + * we failed and compact_zone should decide if we should
>> + * continue or not.
>> + */
>> + break;
>> + }
>> +
>> + /* Record where migration scanner will be restarted */
>
> If we make isolate_migratepages* interface like as isolate_freepages*,
> we can get more clean and micro optimized code. Because
> isolate_migratepages_range() can handle arbitrary range and this patch
> make isolate_migratepages() also handle arbitrary range, there would
> be some redundant codes. :)

I'm not sure if it's worth already. Where is the arbitrary range adding
overhead? I can only imagine that next_pageblock: label could do a
'break;' instead of setting up next_capture_pfn, but that's about it AFAICS.

> Thanks.
>

2014-06-24 15:44:18

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v3 06/13] mm, compaction: periodically drop lock and restore IRQs in scanners

On 06/24/2014 05:39 PM, Naoya Horiguchi wrote:
> On Fri, Jun 20, 2014 at 05:49:36PM +0200, Vlastimil Babka wrote:
>> Compaction scanners regularly check for lock contention and need_resched()
>> through the compact_checklock_irqsave() function. However, if there is no
>> contention, the lock can be held and IRQ disabled for potentially long time.
>>
>> This has been addressed by commit b2eef8c0d0 ("mm: compaction: minimise the
>> time IRQs are disabled while isolating pages for migration") for the migration
>> scanner. However, the refactoring done by commit 748446bb6b ("mm: compaction:
>> acquire the zone->lru_lock as late as possible") has changed the conditions so
>
> You seem to refer to the incorrect commit, maybe you meant commit 2a1402aa044b?

Oops, right. Thanks!

2014-06-24 17:34:38

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v3 04/13] mm, compaction: move pageblock checks up from isolate_migratepages_range()

On Tue, Jun 24, 2014 at 05:34:32PM +0200, Vlastimil Babka wrote:
> On 06/24/2014 06:52 AM, Naoya Horiguchi wrote:
> >>- low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn, false);
> >>- if (!low_pfn || cc->contended)
> >>- return ISOLATE_ABORT;
> >>+ /* Do not scan within a memory hole */
> >>+ if (!pfn_valid(low_pfn))
> >>+ continue;
> >>+
> >>+ page = pfn_to_page(low_pfn);
> >
> >Can we move (page_zone != zone) check here as isolate_freepages() does?
>
> Duplicate perhaps, not sure about move.

Sorry for my unclearness.
I meant that we had better do this check in per-pageblock loop (as the free
scanner does) instead of in per-pfn loop (as we do now.)

> Does CMA make sure that all pages
> are in the same zone?

It seems not, CMA just specifies start pfn and end pfn, so it can cover
multiple zones.
And we also have a case of node overlapping as commented in commit dc9086004
"mm: compaction: check for overlapping nodes during isolation for migration".
So we need this check in compaction side.

Thanks,
Naoya Horiguchi

> Common sense tells me it would be useless otherwise,
> but I haven't checked if we can rely on it.

2014-06-24 18:56:12

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v3 07/13] mm, compaction: skip rechecks when lock was already held

On Fri, Jun 20, 2014 at 05:49:37PM +0200, Vlastimil Babka wrote:
> Compaction scanners try to lock zone locks as late as possible by checking
> many page or pageblock properties opportunistically without lock and skipping
> them if not unsuitable. For pages that pass the initial checks, some properties
> have to be checked again safely under lock. However, if the lock was already
> held from a previous iteration in the initial checks, the rechecks are
> unnecessary.
>
> This patch therefore skips the rechecks when the lock was already held. This is
> now possible to do, since we don't (potentially) drop and reacquire the lock
> between the initial checks and the safe rechecks anymore.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Acked-by: Minchan Kim <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Acked-by: David Rientjes <[email protected]>

Reviewed-by: Naoya Horiguchi <[email protected]>

> ---
> mm/compaction.c | 53 +++++++++++++++++++++++++++++++----------------------
> 1 file changed, 31 insertions(+), 22 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 40da812..9f6e857 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -324,22 +324,30 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
> goto isolate_fail;
>
> /*
> - * The zone lock must be held to isolate freepages.
> - * Unfortunately this is a very coarse lock and can be
> - * heavily contended if there are parallel allocations
> - * or parallel compactions. For async compaction do not
> - * spin on the lock and we acquire the lock as late as
> - * possible.
> + * If we already hold the lock, we can skip some rechecking.
> + * Note that if we hold the lock now, checked_pageblock was
> + * already set in some previous iteration (or strict is true),
> + * so it is correct to skip the suitable migration target
> + * recheck as well.
> */
> - if (!locked)
> + if (!locked) {
> + /*
> + * The zone lock must be held to isolate freepages.
> + * Unfortunately this is a very coarse lock and can be
> + * heavily contended if there are parallel allocations
> + * or parallel compactions. For async compaction do not
> + * spin on the lock and we acquire the lock as late as
> + * possible.
> + */
> locked = compact_trylock_irqsave(&cc->zone->lock,
> &flags, cc);
> - if (!locked)
> - break;
> + if (!locked)
> + break;
>
> - /* Recheck this is a buddy page under lock */
> - if (!PageBuddy(page))
> - goto isolate_fail;
> + /* Recheck this is a buddy page under lock */
> + if (!PageBuddy(page))
> + goto isolate_fail;
> + }
>
> /* Found a free page, break it into order-0 pages */
> isolated = split_free_page(page);
> @@ -623,19 +631,20 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> page_count(page) > page_mapcount(page))
> continue;
>
> - /* If the lock is not held, try to take it */
> - if (!locked)
> + /* If we already hold the lock, we can skip some rechecking */
> + if (!locked) {
> locked = compact_trylock_irqsave(&zone->lru_lock,
> &flags, cc);
> - if (!locked)
> - break;
> + if (!locked)
> + break;
>
> - /* Recheck PageLRU and PageTransHuge under lock */
> - if (!PageLRU(page))
> - continue;
> - if (PageTransHuge(page)) {
> - low_pfn += (1 << compound_order(page)) - 1;
> - continue;
> + /* Recheck PageLRU and PageTransHuge under lock */
> + if (!PageLRU(page))
> + continue;
> + if (PageTransHuge(page)) {
> + low_pfn += (1 << compound_order(page)) - 1;
> + continue;
> + }
> }
>
> lruvec = mem_cgroup_page_lruvec(page, zone);
> --
> 1.8.4.5
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2014-06-24 19:09:34

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v3 08/13] mm, compaction: remember position within pageblock in free pages scanner

On Fri, Jun 20, 2014 at 05:49:38PM +0200, Vlastimil Babka wrote:
> Unlike the migration scanner, the free scanner remembers the beginning of the
> last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages
> uselessly when called several times during single compaction. This might have
> been useful when pages were returned to the buddy allocator after a failed
> migration, but this is no longer the case.
>
> This patch changes the meaning of cc->free_pfn so that if it points to a
> middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the
> end. isolate_freepages_block() will record the pfn of the last page it looked
> at, which is then used to update cc->free_pfn.
>
> In the mmtests stress-highalloc benchmark, this has resulted in lowering the
> ratio between pages scanned by both scanners, from 2.5 free pages per migrate
> page, to 2.25 free pages per migrate page, without affecting success rates.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Acked-by: David Rientjes <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Zhang Yanfei <[email protected]>

Reviewed-by: Naoya Horiguchi <[email protected]>

> ---
> mm/compaction.c | 40 +++++++++++++++++++++++++++++++---------
> 1 file changed, 31 insertions(+), 9 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 9f6e857..41c7005 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -287,7 +287,7 @@ static bool suitable_migration_target(struct page *page)
> * (even though it may still end up isolating some pages).
> */
> static unsigned long isolate_freepages_block(struct compact_control *cc,
> - unsigned long blockpfn,
> + unsigned long *start_pfn,
> unsigned long end_pfn,
> struct list_head *freelist,
> bool strict)
> @@ -296,6 +296,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
> struct page *cursor, *valid_page = NULL;
> unsigned long flags;
> bool locked = false;
> + unsigned long blockpfn = *start_pfn;
>
> cursor = pfn_to_page(blockpfn);
>
> @@ -369,6 +370,9 @@ isolate_fail:
> break;
> }
>
> + /* Record how far we have got within the block */
> + *start_pfn = blockpfn;
> +
> trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated);
>
> /*
> @@ -413,6 +417,9 @@ isolate_freepages_range(struct compact_control *cc,
> LIST_HEAD(freelist);
>
> for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) {
> + /* Protect pfn from changing by isolate_freepages_block */
> + unsigned long isolate_start_pfn = pfn;
> +
> if (!pfn_valid(pfn) || cc->zone != page_zone(pfn_to_page(pfn)))
> break;
>
> @@ -423,8 +430,8 @@ isolate_freepages_range(struct compact_control *cc,
> block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
> block_end_pfn = min(block_end_pfn, end_pfn);
>
> - isolated = isolate_freepages_block(cc, pfn, block_end_pfn,
> - &freelist, true);
> + isolated = isolate_freepages_block(cc, &isolate_start_pfn,
> + block_end_pfn, &freelist, true);
>
> /*
> * In strict mode, isolate_freepages_block() returns 0 if
> @@ -708,6 +715,7 @@ static void isolate_freepages(struct zone *zone,
> {
> struct page *page;
> unsigned long block_start_pfn; /* start of current pageblock */
> + unsigned long isolate_start_pfn; /* exact pfn we start at */
> unsigned long block_end_pfn; /* end of current pageblock */
> unsigned long low_pfn; /* lowest pfn scanner is able to scan */
> int nr_freepages = cc->nr_freepages;
> @@ -716,14 +724,15 @@ static void isolate_freepages(struct zone *zone,
> /*
> * Initialise the free scanner. The starting point is where we last
> * successfully isolated from, zone-cached value, or the end of the
> - * zone when isolating for the first time. We need this aligned to
> - * the pageblock boundary, because we do
> + * zone when isolating for the first time. For looping we also need
> + * this pfn aligned down to the pageblock boundary, because we do
> * block_start_pfn -= pageblock_nr_pages in the for loop.
> * For ending point, take care when isolating in last pageblock of a
> * a zone which ends in the middle of a pageblock.
> * The low boundary is the end of the pageblock the migration scanner
> * is using.
> */
> + isolate_start_pfn = cc->free_pfn;
> block_start_pfn = cc->free_pfn & ~(pageblock_nr_pages-1);
> block_end_pfn = min(block_start_pfn + pageblock_nr_pages,
> zone_end_pfn(zone));
> @@ -736,7 +745,8 @@ static void isolate_freepages(struct zone *zone,
> */
> for (; block_start_pfn >= low_pfn && cc->nr_migratepages > nr_freepages;
> block_end_pfn = block_start_pfn,
> - block_start_pfn -= pageblock_nr_pages) {
> + block_start_pfn -= pageblock_nr_pages,
> + isolate_start_pfn = block_start_pfn) {
> unsigned long isolated;
>
> /*
> @@ -770,13 +780,25 @@ static void isolate_freepages(struct zone *zone,
> if (!isolation_suitable(cc, page))
> continue;
>
> - /* Found a block suitable for isolating free pages from */
> - cc->free_pfn = block_start_pfn;
> - isolated = isolate_freepages_block(cc, block_start_pfn,
> + /* Found a block suitable for isolating free pages from. */
> + isolated = isolate_freepages_block(cc, &isolate_start_pfn,
> block_end_pfn, freelist, false);
> nr_freepages += isolated;
>
> /*
> + * Remember where the free scanner should restart next time,
> + * which is where isolate_freepages_block() left off.
> + * But if it scanned the whole pageblock, isolate_start_pfn
> + * now points at block_end_pfn, which is the start of the next
> + * pageblock.
> + * In that case we will however want to restart at the start
> + * of the previous pageblock.
> + */
> + cc->free_pfn = (isolate_start_pfn < block_end_pfn) ?
> + isolate_start_pfn :
> + block_start_pfn - pageblock_nr_pages;
> +
> + /*
> * Set a flag that we successfully isolated in this pageblock.
> * In the next loop iteration, zone->compact_cached_free_pfn
> * will not be updated and thus it will effectively contain the
> --
> 1.8.4.5
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2014-06-24 20:35:19

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v3 10/13] mm: rename allocflags_to_migratetype for clarity

On Fri, Jun 20, 2014 at 05:49:40PM +0200, Vlastimil Babka wrote:
> From: David Rientjes <[email protected]>
>
> The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
> ALLOC_CPUSET) that have separate semantics.
>
> The function allocflags_to_migratetype() actually takes gfp flags, not alloc
> flags, and returns a migratetype. Rename it to gfpflags_to_migratetype().
>
> Signed-off-by: David Rientjes <[email protected]>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Reviewed-by: Zhang Yanfei <[email protected]>
> Acked-by: Minchan Kim <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>

Reviewed-by: Naoya Horiguchi <[email protected]>

> ---
> include/linux/gfp.h | 2 +-
> mm/compaction.c | 4 ++--
> mm/page_alloc.c | 6 +++---
> 3 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 5e7219d..41b30fd 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -156,7 +156,7 @@ struct vm_area_struct;
> #define GFP_DMA32 __GFP_DMA32
>
> /* Convert GFP flags to their corresponding migrate type */
> -static inline int allocflags_to_migratetype(gfp_t gfp_flags)
> +static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
> {
> WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index df0961b..32c768b 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1185,7 +1185,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
> .nr_freepages = 0,
> .nr_migratepages = 0,
> .order = order,
> - .migratetype = allocflags_to_migratetype(gfp_mask),
> + .migratetype = gfpflags_to_migratetype(gfp_mask),
> .zone = zone,
> .mode = mode,
> };
> @@ -1237,7 +1237,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>
> *deferred = true;
> #ifdef CONFIG_CMA
> - if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> + if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> alloc_flags |= ALLOC_CMA;
> #endif
> /* Compact each zone in the list */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6593f79..70b8297 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2473,7 +2473,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> alloc_flags |= ALLOC_NO_WATERMARKS;
> }
> #ifdef CONFIG_CMA
> - if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> + if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> alloc_flags |= ALLOC_CMA;
> #endif
> return alloc_flags;
> @@ -2716,7 +2716,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> struct zone *preferred_zone;
> struct zoneref *preferred_zoneref;
> struct page *page = NULL;
> - int migratetype = allocflags_to_migratetype(gfp_mask);
> + int migratetype = gfpflags_to_migratetype(gfp_mask);
> unsigned int cpuset_mems_cookie;
> int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
> int classzone_idx;
> @@ -2750,7 +2750,7 @@ retry_cpuset:
> classzone_idx = zonelist_zone_idx(preferred_zoneref);
>
> #ifdef CONFIG_CMA
> - if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> + if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> alloc_flags |= ALLOC_CMA;
> #endif
> retry:
> --
> 1.8.4.5
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2014-06-25 00:48:58

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v3 04/13] mm, compaction: move pageblock checks up from isolate_migratepages_range()

On Tue, Jun 24, 2014 at 05:42:50PM +0200, Vlastimil Babka wrote:
> On 06/24/2014 10:33 AM, Joonsoo Kim wrote:
> >On Fri, Jun 20, 2014 at 05:49:34PM +0200, Vlastimil Babka wrote:
> >>isolate_migratepages_range() is the main function of the compaction scanner,
> >>called either on a single pageblock by isolate_migratepages() during regular
> >>compaction, or on an arbitrary range by CMA's __alloc_contig_migrate_range().
> >>It currently perfoms two pageblock-wide compaction suitability checks, and
> >>because of the CMA callpath, it tracks if it crossed a pageblock boundary in
> >>order to repeat those checks.
> >>
> >>However, closer inspection shows that those checks are always true for CMA:
> >>- isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
> >>- migrate_async_suitable() check is skipped because CMA uses sync compaction
> >>
> >>We can therefore move the checks to isolate_migratepages(), reducing variables
> >>and simplifying isolate_migratepages_range(). The update_pageblock_skip()
> >>function also no longer needs set_unsuitable parameter.
> >>
> >>Furthermore, going back to compact_zone() and compact_finished() when pageblock
> >>is unsuitable is wasteful - the checks are meant to skip pageblocks quickly.
> >>The patch therefore also introduces a simple loop into isolate_migratepages()
> >>so that it does not return immediately on pageblock checks, but keeps going
> >>until isolate_migratepages_range() gets called once. Similarily to
> >>isolate_freepages(), the function periodically checks if it needs to reschedule
> >>or abort async compaction.
> >>
> >>Signed-off-by: Vlastimil Babka <[email protected]>
> >>Cc: Minchan Kim <[email protected]>
> >>Cc: Mel Gorman <[email protected]>
> >>Cc: Joonsoo Kim <[email protected]>
> >>Cc: Michal Nazarewicz <[email protected]>
> >>Cc: Naoya Horiguchi <[email protected]>
> >>Cc: Christoph Lameter <[email protected]>
> >>Cc: Rik van Riel <[email protected]>
> >>Cc: David Rientjes <[email protected]>
> >>---
> >> mm/compaction.c | 112 +++++++++++++++++++++++++++++---------------------------
> >> 1 file changed, 59 insertions(+), 53 deletions(-)
> >>
> >>diff --git a/mm/compaction.c b/mm/compaction.c
> >>index 3064a7f..ebe30c9 100644
> >>--- a/mm/compaction.c
> >>+++ b/mm/compaction.c
> >>@@ -132,7 +132,7 @@ void reset_isolation_suitable(pg_data_t *pgdat)
> >> */
> >> static void update_pageblock_skip(struct compact_control *cc,
> >> struct page *page, unsigned long nr_isolated,
> >>- bool set_unsuitable, bool migrate_scanner)
> >>+ bool migrate_scanner)
> >> {
> >> struct zone *zone = cc->zone;
> >> unsigned long pfn;
> >>@@ -146,12 +146,7 @@ static void update_pageblock_skip(struct compact_control *cc,
> >> if (nr_isolated)
> >> return;
> >>
> >>- /*
> >>- * Only skip pageblocks when all forms of compaction will be known to
> >>- * fail in the near future.
> >>- */
> >>- if (set_unsuitable)
> >>- set_pageblock_skip(page);
> >>+ set_pageblock_skip(page);
> >>
> >> pfn = page_to_pfn(page);
> >>
> >>@@ -180,7 +175,7 @@ static inline bool isolation_suitable(struct compact_control *cc,
> >>
> >> static void update_pageblock_skip(struct compact_control *cc,
> >> struct page *page, unsigned long nr_isolated,
> >>- bool set_unsuitable, bool migrate_scanner)
> >>+ bool migrate_scanner)
> >> {
> >> }
> >> #endif /* CONFIG_COMPACTION */
> >>@@ -345,8 +340,7 @@ isolate_fail:
> >>
> >> /* Update the pageblock-skip if the whole pageblock was scanned */
> >> if (blockpfn == end_pfn)
> >>- update_pageblock_skip(cc, valid_page, total_isolated, true,
> >>- false);
> >>+ update_pageblock_skip(cc, valid_page, total_isolated, false);
> >>
> >> count_compact_events(COMPACTFREE_SCANNED, nr_scanned);
> >> if (total_isolated)
> >>@@ -474,14 +468,12 @@ unsigned long
> >> isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> >> unsigned long low_pfn, unsigned long end_pfn, bool unevictable)
> >> {
> >>- unsigned long last_pageblock_nr = 0, pageblock_nr;
> >> unsigned long nr_scanned = 0, nr_isolated = 0;
> >> struct list_head *migratelist = &cc->migratepages;
> >> struct lruvec *lruvec;
> >> unsigned long flags;
> >> bool locked = false;
> >> struct page *page = NULL, *valid_page = NULL;
> >>- bool set_unsuitable = true;
> >> const isolate_mode_t mode = (cc->mode == MIGRATE_ASYNC ?
> >> ISOLATE_ASYNC_MIGRATE : 0) |
> >> (unevictable ? ISOLATE_UNEVICTABLE : 0);
> >>@@ -545,28 +537,6 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> >> if (!valid_page)
> >> valid_page = page;
> >>
> >>- /* If isolation recently failed, do not retry */
> >>- pageblock_nr = low_pfn >> pageblock_order;
> >>- if (last_pageblock_nr != pageblock_nr) {
> >>- int mt;
> >>-
> >>- last_pageblock_nr = pageblock_nr;
> >>- if (!isolation_suitable(cc, page))
> >>- goto next_pageblock;
> >>-
> >>- /*
> >>- * For async migration, also only scan in MOVABLE
> >>- * blocks. Async migration is optimistic to see if
> >>- * the minimum amount of work satisfies the allocation
> >>- */
> >>- mt = get_pageblock_migratetype(page);
> >>- if (cc->mode == MIGRATE_ASYNC &&
> >>- !migrate_async_suitable(mt)) {
> >>- set_unsuitable = false;
> >>- goto next_pageblock;
> >>- }
> >>- }
> >>-
> >> /*
> >> * Skip if free. page_order cannot be used without zone->lock
> >> * as nothing prevents parallel allocations or buddy merging.
> >>@@ -668,8 +638,7 @@ next_pageblock:
> >> * if the whole pageblock was scanned without isolating any page.
> >> */
> >> if (low_pfn == end_pfn)
> >>- update_pageblock_skip(cc, valid_page, nr_isolated,
> >>- set_unsuitable, true);
> >>+ update_pageblock_skip(cc, valid_page, nr_isolated, true);
> >>
> >> trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
> >>
> >>@@ -840,34 +809,74 @@ typedef enum {
> >> } isolate_migrate_t;
> >>
> >> /*
> >>- * Isolate all pages that can be migrated from the block pointed to by
> >>- * the migrate scanner within compact_control.
> >>+ * Isolate all pages that can be migrated from the first suitable block,
> >>+ * starting at the block pointed to by the migrate scanner pfn within
> >>+ * compact_control.
> >> */
> >> static isolate_migrate_t isolate_migratepages(struct zone *zone,
> >> struct compact_control *cc)
> >> {
> >> unsigned long low_pfn, end_pfn;
> >>+ struct page *page;
> >>
> >>- /* Do not scan outside zone boundaries */
> >>- low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
> >>+ /* Start at where we last stopped, or beginning of the zone */
> >>+ low_pfn = cc->migrate_pfn;
> >>
> >> /* Only scan within a pageblock boundary */
> >> end_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages);
> >>
> >>- /* Do not cross the free scanner or scan within a memory hole */
> >>- if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
> >>- cc->migrate_pfn = end_pfn;
> >>- return ISOLATE_NONE;
> >>- }
> >>+ /*
> >>+ * Iterate over whole pageblocks until we find the first suitable.
> >>+ * Do not cross the free scanner.
> >>+ */
> >>+ for (; end_pfn <= cc->free_pfn;
> >>+ low_pfn = end_pfn, end_pfn += pageblock_nr_pages) {
> >>+
> >>+ /*
> >>+ * This can potentially iterate a massively long zone with
> >>+ * many pageblocks unsuitable, so periodically check if we
> >>+ * need to schedule, or even abort async compaction.
> >>+ */
> >>+ if (!(low_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages))
> >>+ && compact_should_abort(cc))
> >>+ break;
> >>
> >>- /* Perform the isolation */
> >>- low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn, false);
> >>- if (!low_pfn || cc->contended)
> >>- return ISOLATE_ABORT;
> >>+ /* Do not scan within a memory hole */
> >>+ if (!pfn_valid(low_pfn))
> >>+ continue;
> >>+
> >>+ page = pfn_to_page(low_pfn);
> >>+ /* If isolation recently failed, do not retry */
> >>+ if (!isolation_suitable(cc, page))
> >>+ continue;
> >>
> >>+ /*
> >>+ * For async compaction, also only scan in MOVABLE blocks.
> >>+ * Async compaction is optimistic to see if the minimum amount
> >>+ * of work satisfies the allocation.
> >>+ */
> >>+ if (cc->mode == MIGRATE_ASYNC &&
> >>+ !migrate_async_suitable(get_pageblock_migratetype(page)))
> >>+ continue;
> >>+
> >>+ /* Perform the isolation */
> >>+ low_pfn = isolate_migratepages_range(zone, cc, low_pfn,
> >>+ end_pfn, false);
> >>+ if (!low_pfn || cc->contended)
> >>+ return ISOLATE_ABORT;
> >>+
> >>+ /*
> >>+ * Either we isolated something and proceed with migration. Or
> >>+ * we failed and compact_zone should decide if we should
> >>+ * continue or not.
> >>+ */
> >>+ break;
> >>+ }
> >>+
> >>+ /* Record where migration scanner will be restarted */
> >
> >If we make isolate_migratepages* interface like as isolate_freepages*,
> >we can get more clean and micro optimized code. Because
> >isolate_migratepages_range() can handle arbitrary range and this patch
> >make isolate_migratepages() also handle arbitrary range, there would
> >be some redundant codes. :)
>
> I'm not sure if it's worth already. Where is the arbitrary range
> adding overhead? I can only imagine that next_pageblock: label could
> do a 'break;' instead of setting up next_capture_pfn, but that's
> about it AFAICS.

In fact, there is just minor overhead, pfn_valid().
And isolate_freepages variants seems to do this correctly. :)

Someone could wonder why there are two isolate_migratepages variants
with arbitrary range compaction ability. IMHO, one
isolate_migratepage_xxx for pageblock range and two
isolate_migratepage_yyy/zzz for compaction and CMA is better
architecture.

And, One additional note. You can move update_pageblock_skip() to
isolate_migratepages() now.

Thanks.

2014-06-25 00:58:03

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v3 02/13] mm, compaction: defer each zone individually instead of preferred zone

On Tue, Jun 24, 2014 at 05:29:27PM +0200, Vlastimil Babka wrote:
> On 06/24/2014 10:23 AM, Joonsoo Kim wrote:
> >On Fri, Jun 20, 2014 at 05:49:32PM +0200, Vlastimil Babka wrote:
> >>When direct sync compaction is often unsuccessful, it may become deferred for
> >>some time to avoid further useless attempts, both sync and async. Successful
> >>high-order allocations un-defer compaction, while further unsuccessful
> >>compaction attempts prolong the copmaction deferred period.
> >>
> >>Currently the checking and setting deferred status is performed only on the
> >>preferred zone of the allocation that invoked direct compaction. But compaction
> >>itself is attempted on all eligible zones in the zonelist, so the behavior is
> >>suboptimal and may lead both to scenarios where 1) compaction is attempted
> >>uselessly, or 2) where it's not attempted despite good chances of succeeding,
> >>as shown on the examples below:
> >>
> >>1) A direct compaction with Normal preferred zone failed and set deferred
> >> compaction for the Normal zone. Another unrelated direct compaction with
> >> DMA32 as preferred zone will attempt to compact DMA32 zone even though
> >> the first compaction attempt also included DMA32 zone.
> >>
> >> In another scenario, compaction with Normal preferred zone failed to compact
> >> Normal zone, but succeeded in the DMA32 zone, so it will not defer
> >> compaction. In the next attempt, it will try Normal zone which will fail
> >> again, instead of skipping Normal zone and trying DMA32 directly.
> >>
> >>2) Kswapd will balance DMA32 zone and reset defer status based on watermarks
> >> looking good. A direct compaction with preferred Normal zone will skip
> >> compaction of all zones including DMA32 because Normal was still deferred.
> >> The allocation might have succeeded in DMA32, but won't.
> >>
> >>This patch makes compaction deferring work on individual zone basis instead of
> >>preferred zone. For each zone, it checks compaction_deferred() to decide if the
> >>zone should be skipped. If watermarks fail after compacting the zone,
> >>defer_compaction() is called. The zone where watermarks passed can still be
> >>deferred when the allocation attempt is unsuccessful. When allocation is
> >>successful, compaction_defer_reset() is called for the zone containing the
> >>allocated page. This approach should approximate calling defer_compaction()
> >>only on zones where compaction was attempted and did not yield allocated page.
> >>There might be corner cases but that is inevitable as long as the decision
> >>to stop compacting dues not guarantee that a page will be allocated.
> >>
> >>During testing on a two-node machine with a single very small Normal zone on
> >>node 1, this patch has improved success rates in stress-highalloc mmtests
> >>benchmark. The success here were previously made worse by commit 3a025760fc
> >>("mm: page_alloc: spill to remote nodes before waking kswapd") as kswapd was
> >>no longer resetting often enough the deferred compaction for the Normal zone,
> >>and DMA32 zones on both nodes were thus not considered for compaction.
> >>
> >>Signed-off-by: Vlastimil Babka <[email protected]>
> >>Cc: Minchan Kim <[email protected]>
> >>Cc: Mel Gorman <[email protected]>
> >>Cc: Joonsoo Kim <[email protected]>
> >>Cc: Michal Nazarewicz <[email protected]>
> >>Cc: Naoya Horiguchi <[email protected]>
> >>Cc: Christoph Lameter <[email protected]>
> >>Cc: Rik van Riel <[email protected]>
> >>Cc: David Rientjes <[email protected]>
> >>---
> >> include/linux/compaction.h | 6 ++++--
> >> mm/compaction.c | 29 ++++++++++++++++++++++++-----
> >> mm/page_alloc.c | 33 ++++++++++++++++++---------------
> >> 3 files changed, 46 insertions(+), 22 deletions(-)
> >>
> >>diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> >>index 01e3132..76f9beb 100644
> >>--- a/include/linux/compaction.h
> >>+++ b/include/linux/compaction.h
> >>@@ -22,7 +22,8 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
> >> extern int fragmentation_index(struct zone *zone, unsigned int order);
> >> extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> >> int order, gfp_t gfp_mask, nodemask_t *mask,
> >>- enum migrate_mode mode, bool *contended);
> >>+ enum migrate_mode mode, bool *contended, bool *deferred,
> >>+ struct zone **candidate_zone);
> >> extern void compact_pgdat(pg_data_t *pgdat, int order);
> >> extern void reset_isolation_suitable(pg_data_t *pgdat);
> >> extern unsigned long compaction_suitable(struct zone *zone, int order);
> >>@@ -91,7 +92,8 @@ static inline bool compaction_restarting(struct zone *zone, int order)
> >> #else
> >> static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> >> int order, gfp_t gfp_mask, nodemask_t *nodemask,
> >>- enum migrate_mode mode, bool *contended)
> >>+ enum migrate_mode mode, bool *contended, bool *deferred,
> >>+ struct zone **candidate_zone)
> >> {
> >> return COMPACT_CONTINUE;
> >> }
> >>diff --git a/mm/compaction.c b/mm/compaction.c
> >>index 5175019..7c491d0 100644
> >>--- a/mm/compaction.c
> >>+++ b/mm/compaction.c
> >>@@ -1122,13 +1122,15 @@ int sysctl_extfrag_threshold = 500;
> >> * @nodemask: The allowed nodes to allocate from
> >> * @mode: The migration mode for async, sync light, or sync migration
> >> * @contended: Return value that is true if compaction was aborted due to lock contention
> >>- * @page: Optionally capture a free page of the requested order during compaction
> >>+ * @deferred: Return value that is true if compaction was deferred in all zones
> >>+ * @candidate_zone: Return the zone where we think allocation should succeed
> >> *
> >> * This is the main entry point for direct page compaction.
> >> */
> >> unsigned long try_to_compact_pages(struct zonelist *zonelist,
> >> int order, gfp_t gfp_mask, nodemask_t *nodemask,
> >>- enum migrate_mode mode, bool *contended)
> >>+ enum migrate_mode mode, bool *contended, bool *deferred,
> >>+ struct zone **candidate_zone)
> >> {
> >> enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> >> int may_enter_fs = gfp_mask & __GFP_FS;
> >>@@ -1142,8 +1144,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> >> if (!order || !may_enter_fs || !may_perform_io)
> >> return rc;
> >>
> >>- count_compact_event(COMPACTSTALL);
> >>-
> >>+ *deferred = true;
> >> #ifdef CONFIG_CMA
> >> if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> >> alloc_flags |= ALLOC_CMA;
> >>@@ -1153,16 +1154,34 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> >> nodemask) {
> >> int status;
> >>
> >>+ if (compaction_deferred(zone, order))
> >>+ continue;
> >>+
> >>+ *deferred = false;
> >>+
> >> status = compact_zone_order(zone, order, gfp_mask, mode,
> >> contended);
> >> rc = max(status, rc);
> >>
> >> /* If a normal allocation would succeed, stop compacting */
> >> if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
> >>- alloc_flags))
> >>+ alloc_flags)) {
> >>+ *candidate_zone = zone;
> >> break;
> >
> >How about doing compaction_defer_reset() here?
> >
> >As you said before, although this check is successful, it doesn't ensure
> >success of highorder allocation, because of some unknown reason(ex: racy
> >allocation attempt steals this page). But, at least, passing this check
> >means that we succeed compaction and there is much possibility to exit
> >compaction without searching whole zone range.
>
> Well another reason is that the check is racy wrt NR_FREE counters
> drift. But I think it tends to be false negative rather than false
> positive. So it could work, and with page capture it would be quite
> accurate. I'll try.
>
> >So, highorder allocation failure doesn't means that we should defer
> >compaction.
> >
> >>+ } else if (mode != MIGRATE_ASYNC) {
> >>+ /*
> >>+ * We think that allocation won't succeed in this zone
> >>+ * so we defer compaction there. If it ends up
> >>+ * succeeding after all, it will be reset.
> >>+ */
> >>+ defer_compaction(zone, order);
> >>+ }
> >> }
> >>
> >>+ /* If at least one zone wasn't deferred, we count a compaction stall */
> >>+ if (!*deferred)
> >>+ count_compact_event(COMPACTSTALL);
> >>+
> >
> >Could you keep this counting in __alloc_pages_direct_compact()?
> >It will help to understand how this statistic works.
>
> Well, count_compact_event is defined in compaction.c and this would
> be usage in page_alloc.c. I'm not sure if it helps.

Yes, it is defined in compaction.c. But, others, COMPACTSUCCESS/FAIL
are counted by __alloc_pages_direct_compact() in page_alloc.c so counting
this also in __alloc_pages_direct_compact() would be better, IMHO.

>
> >> return rc;
> >> }
> >
> >And if possible, it is better to makes deferred to one of compaction
> >status likes as COMPACTION_SKIPPDED. It makes code more clear.
>
> That could work inside try_to_compact_pages() as well.
>
> >>
> >>diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>index ee92384..6593f79 100644
> >>--- a/mm/page_alloc.c
> >>+++ b/mm/page_alloc.c
> >>@@ -2238,18 +2238,17 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> >> bool *contended_compaction, bool *deferred_compaction,
> >> unsigned long *did_some_progress)
> >> {
> >>- if (!order)
> >>- return NULL;
> >>+ struct zone *last_compact_zone = NULL;
> >>
> >>- if (compaction_deferred(preferred_zone, order)) {
> >>- *deferred_compaction = true;
> >>+ if (!order)
> >> return NULL;
> >>- }
> >>
> >> current->flags |= PF_MEMALLOC;
> >> *did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
> >> nodemask, mode,
> >>- contended_compaction);
> >>+ contended_compaction,
> >>+ deferred_compaction,
> >>+ &last_compact_zone);
> >> current->flags &= ~PF_MEMALLOC;
> >>
> >> if (*did_some_progress != COMPACT_SKIPPED) {
> >>@@ -2263,27 +2262,31 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> >> order, zonelist, high_zoneidx,
> >> alloc_flags & ~ALLOC_NO_WATERMARKS,
> >> preferred_zone, classzone_idx, migratetype);
> >>+
> >> if (page) {
> >>- preferred_zone->compact_blockskip_flush = false;
> >>- compaction_defer_reset(preferred_zone, order, true);
> >>+ struct zone *zone = page_zone(page);
> >>+
> >>+ zone->compact_blockskip_flush = false;
> >>+ compaction_defer_reset(zone, order, true);
> >> count_vm_event(COMPACTSUCCESS);
> >> return page;
> >
> >This snippet raise a though to me.
> >Why don't we reset compaction_defer_reset() if we succeed to allocate
> >highorder page on fastpath or some other path? If we succeed to
> >allocate it on some other path rather than here, it means that the status of
> >memory changes. So this deferred check would be stale test.
>
> Hm, not sure if we want to do that in fast paths. As long as
> somebody succeeds, that means nobody has to try checking for
> deferred compaction and it doesn't matter. When they stop
> succeeding, then it may be stale, yes. But is it worth polluting
> fast paths with defer resets?

Yes, I don't think it is worth polluting fast paths with it.
It is just my quick thought and just want to share the problem I
realized. If you don't have any good solution to this, please skip
this comment. :)

Thanks.

2014-06-25 01:58:17

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v3 12/13] mm, compaction: try to capture the just-created high-order freepage

On Fri, Jun 20, 2014 at 05:49:42PM +0200, Vlastimil Babka wrote:
> Compaction uses watermark checking to determine if it succeeded in creating
> a high-order free page. My testing has shown that this is quite racy and it
> can happen that watermark checking in compaction succeeds, and moments later
> the watermark checking in page allocation fails, even though the number of
> free pages has increased meanwhile.
>
> It should be more reliable if direct compaction captured the high-order free
> page as soon as it detects it, and pass it back to allocation. This would
> also reduce the window for somebody else to allocate the free page.
>
> Capture has been implemented before by 1fb3f8ca0e92 ("mm: compaction: capture
> a suitable high-order page immediately when it is made available"), but later
> reverted by 8fb74b9f ("mm: compaction: partially revert capture of suitable
> high-order page") due to a bug.
>
> This patch differs from the previous attempt in two aspects:
>
> 1) The previous patch scanned free lists to capture the page. In this patch,
> only the cc->order aligned block that the migration scanner just finished
> is considered, but only if pages were actually isolated for migration in
> that block. Tracking cc->order aligned blocks also has benefits for the
> following patch that skips blocks where non-migratable pages were found.
>
> 2) The operations done in buffered_rmqueue() and get_page_from_freelist() are
> closely followed so that page capture mimics normal page allocation as much
> as possible. This includes operations such as prep_new_page() and
> page->pfmemalloc setting (that was missing in the previous attempt), zone
> statistics are updated etc. Due to subtleties with IRQ disabling and
> enabling this cannot be simply factored out from the normal allocation
> functions without affecting the fastpath.
>
> This patch has tripled compaction success rates (as recorded in vmstat) in
> stress-highalloc mmtests benchmark, although allocation success rates increased
> only by a few percent. Closer inspection shows that due to the racy watermark
> checking and lack of lru_add_drain(), the allocations that resulted in direct
> compactions were often failing, but later allocations succeeeded in the fast
> path. So the benefit of the patch to allocation success rates may be limited,
> but it improves the fairness in the sense that whoever spent the time
> compacting has a higher change of benefitting from it, and also can stop
> compacting sooner, as page availability is detected immediately. With better
> success detection, the contribution of compaction to high-order allocation
> success success rates is also no longer understated by the vmstats.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>
> ---
...
> @@ -669,6 +708,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> continue;
> if (PageTransHuge(page)) {
> low_pfn += (1 << compound_order(page)) - 1;
> + next_capture_pfn = low_pfn + 1;

Don't we need if (next_capture_pfn) here?

Thanks,
Naoya Horiguchi

> continue;
> }
> }

2014-06-25 08:50:58

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v3 04/13] mm, compaction: move pageblock checks up from isolate_migratepages_range()

On 06/24/2014 06:58 PM, Naoya Horiguchi wrote:
> On Tue, Jun 24, 2014 at 05:34:32PM +0200, Vlastimil Babka wrote:
>> On 06/24/2014 06:52 AM, Naoya Horiguchi wrote:
>>>> - low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn, false);
>>>> - if (!low_pfn || cc->contended)
>>>> - return ISOLATE_ABORT;
>>>> + /* Do not scan within a memory hole */
>>>> + if (!pfn_valid(low_pfn))
>>>> + continue;
>>>> +
>>>> + page = pfn_to_page(low_pfn);
>>>
>>> Can we move (page_zone != zone) check here as isolate_freepages() does?
>>
>> Duplicate perhaps, not sure about move.
>
> Sorry for my unclearness.
> I meant that we had better do this check in per-pageblock loop (as the free
> scanner does) instead of in per-pfn loop (as we do now.)

Hm I see, the migration and free scanners really do this differently.
Free scanned per-pageblock, but migration scanner per-page.
Can we assume that zones will never overlap within a single pageblock?
The example dc9086004 seems to be overlapping at even higher alignment
so it should be safe only to check first page in pageblock.
And if it wasn't the case, then I guess the freepage scanner would
already hit some errors on such system?

But if that's true, why does page_is_buddy test if pages are in the same
zone?

>> Does CMA make sure that all pages
>> are in the same zone?
>
> It seems not, CMA just specifies start pfn and end pfn, so it can cover
> multiple zones.
> And we also have a case of node overlapping as commented in commit dc9086004
> "mm: compaction: check for overlapping nodes during isolation for migration".
> So we need this check in compaction side.
>
> Thanks,
> Naoya Horiguchi
>
>> Common sense tells me it would be useless otherwise,
>> but I haven't checked if we can rely on it.

2014-06-25 08:57:42

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v3 12/13] mm, compaction: try to capture the just-created high-order freepage

On 06/25/2014 03:57 AM, Naoya Horiguchi wrote:
> On Fri, Jun 20, 2014 at 05:49:42PM +0200, Vlastimil Babka wrote:
>> Compaction uses watermark checking to determine if it succeeded in creating
>> a high-order free page. My testing has shown that this is quite racy and it
>> can happen that watermark checking in compaction succeeds, and moments later
>> the watermark checking in page allocation fails, even though the number of
>> free pages has increased meanwhile.
>>
>> It should be more reliable if direct compaction captured the high-order free
>> page as soon as it detects it, and pass it back to allocation. This would
>> also reduce the window for somebody else to allocate the free page.
>>
>> Capture has been implemented before by 1fb3f8ca0e92 ("mm: compaction: capture
>> a suitable high-order page immediately when it is made available"), but later
>> reverted by 8fb74b9f ("mm: compaction: partially revert capture of suitable
>> high-order page") due to a bug.
>>
>> This patch differs from the previous attempt in two aspects:
>>
>> 1) The previous patch scanned free lists to capture the page. In this patch,
>> only the cc->order aligned block that the migration scanner just finished
>> is considered, but only if pages were actually isolated for migration in
>> that block. Tracking cc->order aligned blocks also has benefits for the
>> following patch that skips blocks where non-migratable pages were found.
>>
>> 2) The operations done in buffered_rmqueue() and get_page_from_freelist() are
>> closely followed so that page capture mimics normal page allocation as much
>> as possible. This includes operations such as prep_new_page() and
>> page->pfmemalloc setting (that was missing in the previous attempt), zone
>> statistics are updated etc. Due to subtleties with IRQ disabling and
>> enabling this cannot be simply factored out from the normal allocation
>> functions without affecting the fastpath.
>>
>> This patch has tripled compaction success rates (as recorded in vmstat) in
>> stress-highalloc mmtests benchmark, although allocation success rates increased
>> only by a few percent. Closer inspection shows that due to the racy watermark
>> checking and lack of lru_add_drain(), the allocations that resulted in direct
>> compactions were often failing, but later allocations succeeeded in the fast
>> path. So the benefit of the patch to allocation success rates may be limited,
>> but it improves the fairness in the sense that whoever spent the time
>> compacting has a higher change of benefitting from it, and also can stop
>> compacting sooner, as page availability is detected immediately. With better
>> success detection, the contribution of compaction to high-order allocation
>> success success rates is also no longer understated by the vmstats.
>>
>> Signed-off-by: Vlastimil Babka <[email protected]>
>> Cc: Minchan Kim <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Joonsoo Kim <[email protected]>
>> Cc: Michal Nazarewicz <[email protected]>
>> Cc: Naoya Horiguchi <[email protected]>
>> Cc: Christoph Lameter <[email protected]>
>> Cc: Rik van Riel <[email protected]>
>> Cc: David Rientjes <[email protected]>
>> ---
> ...
>> @@ -669,6 +708,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>> continue;
>> if (PageTransHuge(page)) {
>> low_pfn += (1 << compound_order(page)) - 1;
>> + next_capture_pfn = low_pfn + 1;
>
> Don't we need if (next_capture_pfn) here?

Good catch, thanks! It should also use ALIGN properly as the non-locked
test above.

> Thanks,
> Naoya Horiguchi
>
>> continue;
>> }
>> }

2014-06-25 08:59:25

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v3 04/13] mm, compaction: move pageblock checks up from isolate_migratepages_range()

On 06/25/2014 02:53 AM, Joonsoo Kim wrote:
> On Tue, Jun 24, 2014 at 05:42:50PM +0200, Vlastimil Babka wrote:
>> On 06/24/2014 10:33 AM, Joonsoo Kim wrote:
>>> On Fri, Jun 20, 2014 at 05:49:34PM +0200, Vlastimil Babka wrote:
>>>> isolate_migratepages_range() is the main function of the compaction scanner,
>>>> called either on a single pageblock by isolate_migratepages() during regular
>>>> compaction, or on an arbitrary range by CMA's __alloc_contig_migrate_range().
>>>> It currently perfoms two pageblock-wide compaction suitability checks, and
>>>> because of the CMA callpath, it tracks if it crossed a pageblock boundary in
>>>> order to repeat those checks.
>>>>
>>>> However, closer inspection shows that those checks are always true for CMA:
>>>> - isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
>>>> - migrate_async_suitable() check is skipped because CMA uses sync compaction
>>>>
>>>> We can therefore move the checks to isolate_migratepages(), reducing variables
>>>> and simplifying isolate_migratepages_range(). The update_pageblock_skip()
>>>> function also no longer needs set_unsuitable parameter.
>>>>
>>>> Furthermore, going back to compact_zone() and compact_finished() when pageblock
>>>> is unsuitable is wasteful - the checks are meant to skip pageblocks quickly.
>>>> The patch therefore also introduces a simple loop into isolate_migratepages()
>>>> so that it does not return immediately on pageblock checks, but keeps going
>>>> until isolate_migratepages_range() gets called once. Similarily to
>>>> isolate_freepages(), the function periodically checks if it needs to reschedule
>>>> or abort async compaction.
>>>>
>>>> Signed-off-by: Vlastimil Babka <[email protected]>
>>>> Cc: Minchan Kim <[email protected]>
>>>> Cc: Mel Gorman <[email protected]>
>>>> Cc: Joonsoo Kim <[email protected]>
>>>> Cc: Michal Nazarewicz <[email protected]>
>>>> Cc: Naoya Horiguchi <[email protected]>
>>>> Cc: Christoph Lameter <[email protected]>
>>>> Cc: Rik van Riel <[email protected]>
>>>> Cc: David Rientjes <[email protected]>
>>>> ---
>>>> mm/compaction.c | 112 +++++++++++++++++++++++++++++---------------------------
>>>> 1 file changed, 59 insertions(+), 53 deletions(-)
>>>>
>>>> diff --git a/mm/compaction.c b/mm/compaction.c
>>>> index 3064a7f..ebe30c9 100644
>>>> --- a/mm/compaction.c
>>>> +++ b/mm/compaction.c
>>>> @@ -132,7 +132,7 @@ void reset_isolation_suitable(pg_data_t *pgdat)
>>>> */
>>>> static void update_pageblock_skip(struct compact_control *cc,
>>>> struct page *page, unsigned long nr_isolated,
>>>> - bool set_unsuitable, bool migrate_scanner)
>>>> + bool migrate_scanner)
>>>> {
>>>> struct zone *zone = cc->zone;
>>>> unsigned long pfn;
>>>> @@ -146,12 +146,7 @@ static void update_pageblock_skip(struct compact_control *cc,
>>>> if (nr_isolated)
>>>> return;
>>>>
>>>> - /*
>>>> - * Only skip pageblocks when all forms of compaction will be known to
>>>> - * fail in the near future.
>>>> - */
>>>> - if (set_unsuitable)
>>>> - set_pageblock_skip(page);
>>>> + set_pageblock_skip(page);
>>>>
>>>> pfn = page_to_pfn(page);
>>>>
>>>> @@ -180,7 +175,7 @@ static inline bool isolation_suitable(struct compact_control *cc,
>>>>
>>>> static void update_pageblock_skip(struct compact_control *cc,
>>>> struct page *page, unsigned long nr_isolated,
>>>> - bool set_unsuitable, bool migrate_scanner)
>>>> + bool migrate_scanner)
>>>> {
>>>> }
>>>> #endif /* CONFIG_COMPACTION */
>>>> @@ -345,8 +340,7 @@ isolate_fail:
>>>>
>>>> /* Update the pageblock-skip if the whole pageblock was scanned */
>>>> if (blockpfn == end_pfn)
>>>> - update_pageblock_skip(cc, valid_page, total_isolated, true,
>>>> - false);
>>>> + update_pageblock_skip(cc, valid_page, total_isolated, false);
>>>>
>>>> count_compact_events(COMPACTFREE_SCANNED, nr_scanned);
>>>> if (total_isolated)
>>>> @@ -474,14 +468,12 @@ unsigned long
>>>> isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>>>> unsigned long low_pfn, unsigned long end_pfn, bool unevictable)
>>>> {
>>>> - unsigned long last_pageblock_nr = 0, pageblock_nr;
>>>> unsigned long nr_scanned = 0, nr_isolated = 0;
>>>> struct list_head *migratelist = &cc->migratepages;
>>>> struct lruvec *lruvec;
>>>> unsigned long flags;
>>>> bool locked = false;
>>>> struct page *page = NULL, *valid_page = NULL;
>>>> - bool set_unsuitable = true;
>>>> const isolate_mode_t mode = (cc->mode == MIGRATE_ASYNC ?
>>>> ISOLATE_ASYNC_MIGRATE : 0) |
>>>> (unevictable ? ISOLATE_UNEVICTABLE : 0);
>>>> @@ -545,28 +537,6 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>>>> if (!valid_page)
>>>> valid_page = page;
>>>>
>>>> - /* If isolation recently failed, do not retry */
>>>> - pageblock_nr = low_pfn >> pageblock_order;
>>>> - if (last_pageblock_nr != pageblock_nr) {
>>>> - int mt;
>>>> -
>>>> - last_pageblock_nr = pageblock_nr;
>>>> - if (!isolation_suitable(cc, page))
>>>> - goto next_pageblock;
>>>> -
>>>> - /*
>>>> - * For async migration, also only scan in MOVABLE
>>>> - * blocks. Async migration is optimistic to see if
>>>> - * the minimum amount of work satisfies the allocation
>>>> - */
>>>> - mt = get_pageblock_migratetype(page);
>>>> - if (cc->mode == MIGRATE_ASYNC &&
>>>> - !migrate_async_suitable(mt)) {
>>>> - set_unsuitable = false;
>>>> - goto next_pageblock;
>>>> - }
>>>> - }
>>>> -
>>>> /*
>>>> * Skip if free. page_order cannot be used without zone->lock
>>>> * as nothing prevents parallel allocations or buddy merging.
>>>> @@ -668,8 +638,7 @@ next_pageblock:
>>>> * if the whole pageblock was scanned without isolating any page.
>>>> */
>>>> if (low_pfn == end_pfn)
>>>> - update_pageblock_skip(cc, valid_page, nr_isolated,
>>>> - set_unsuitable, true);
>>>> + update_pageblock_skip(cc, valid_page, nr_isolated, true);
>>>>
>>>> trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
>>>>
>>>> @@ -840,34 +809,74 @@ typedef enum {
>>>> } isolate_migrate_t;
>>>>
>>>> /*
>>>> - * Isolate all pages that can be migrated from the block pointed to by
>>>> - * the migrate scanner within compact_control.
>>>> + * Isolate all pages that can be migrated from the first suitable block,
>>>> + * starting at the block pointed to by the migrate scanner pfn within
>>>> + * compact_control.
>>>> */
>>>> static isolate_migrate_t isolate_migratepages(struct zone *zone,
>>>> struct compact_control *cc)
>>>> {
>>>> unsigned long low_pfn, end_pfn;
>>>> + struct page *page;
>>>>
>>>> - /* Do not scan outside zone boundaries */
>>>> - low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
>>>> + /* Start at where we last stopped, or beginning of the zone */
>>>> + low_pfn = cc->migrate_pfn;
>>>>
>>>> /* Only scan within a pageblock boundary */
>>>> end_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages);
>>>>
>>>> - /* Do not cross the free scanner or scan within a memory hole */
>>>> - if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
>>>> - cc->migrate_pfn = end_pfn;
>>>> - return ISOLATE_NONE;
>>>> - }
>>>> + /*
>>>> + * Iterate over whole pageblocks until we find the first suitable.
>>>> + * Do not cross the free scanner.
>>>> + */
>>>> + for (; end_pfn <= cc->free_pfn;
>>>> + low_pfn = end_pfn, end_pfn += pageblock_nr_pages) {
>>>> +
>>>> + /*
>>>> + * This can potentially iterate a massively long zone with
>>>> + * many pageblocks unsuitable, so periodically check if we
>>>> + * need to schedule, or even abort async compaction.
>>>> + */
>>>> + if (!(low_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages))
>>>> + && compact_should_abort(cc))
>>>> + break;
>>>>
>>>> - /* Perform the isolation */
>>>> - low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn, false);
>>>> - if (!low_pfn || cc->contended)
>>>> - return ISOLATE_ABORT;
>>>> + /* Do not scan within a memory hole */
>>>> + if (!pfn_valid(low_pfn))
>>>> + continue;
>>>> +
>>>> + page = pfn_to_page(low_pfn);
>>>> + /* If isolation recently failed, do not retry */
>>>> + if (!isolation_suitable(cc, page))
>>>> + continue;
>>>>
>>>> + /*
>>>> + * For async compaction, also only scan in MOVABLE blocks.
>>>> + * Async compaction is optimistic to see if the minimum amount
>>>> + * of work satisfies the allocation.
>>>> + */
>>>> + if (cc->mode == MIGRATE_ASYNC &&
>>>> + !migrate_async_suitable(get_pageblock_migratetype(page)))
>>>> + continue;
>>>> +
>>>> + /* Perform the isolation */
>>>> + low_pfn = isolate_migratepages_range(zone, cc, low_pfn,
>>>> + end_pfn, false);
>>>> + if (!low_pfn || cc->contended)
>>>> + return ISOLATE_ABORT;
>>>> +
>>>> + /*
>>>> + * Either we isolated something and proceed with migration. Or
>>>> + * we failed and compact_zone should decide if we should
>>>> + * continue or not.
>>>> + */
>>>> + break;
>>>> + }
>>>> +
>>>> + /* Record where migration scanner will be restarted */
>>>
>>> If we make isolate_migratepages* interface like as isolate_freepages*,
>>> we can get more clean and micro optimized code. Because
>>> isolate_migratepages_range() can handle arbitrary range and this patch
>>> make isolate_migratepages() also handle arbitrary range, there would
>>> be some redundant codes. :)
>>
>> I'm not sure if it's worth already. Where is the arbitrary range
>> adding overhead? I can only imagine that next_pageblock: label could
>> do a 'break;' instead of setting up next_capture_pfn, but that's
>> about it AFAICS.
>
> In fact, there is just minor overhead, pfn_valid().
> And isolate_freepages variants seems to do this correctly. :)
>
> Someone could wonder why there are two isolate_migratepages variants
> with arbitrary range compaction ability. IMHO, one
> isolate_migratepage_xxx for pageblock range and two
> isolate_migratepage_yyy/zzz for compaction and CMA is better
> architecture.

OK, I will try. But we need to resolve the "where to test for
page_zone() == cc->zone?" question.

> And, One additional note. You can move update_pageblock_skip() to
> isolate_migratepages() now.

Right, thanks.

> Thanks.
>

2014-06-25 15:47:17

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v3 04/13] mm, compaction: move pageblock checks up from isolate_migratepages_range()

On Wed, Jun 25, 2014 at 10:50:51AM +0200, Vlastimil Babka wrote:
> On 06/24/2014 06:58 PM, Naoya Horiguchi wrote:
> >On Tue, Jun 24, 2014 at 05:34:32PM +0200, Vlastimil Babka wrote:
> >>On 06/24/2014 06:52 AM, Naoya Horiguchi wrote:
> >>>>- low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn, false);
> >>>>- if (!low_pfn || cc->contended)
> >>>>- return ISOLATE_ABORT;
> >>>>+ /* Do not scan within a memory hole */
> >>>>+ if (!pfn_valid(low_pfn))
> >>>>+ continue;
> >>>>+
> >>>>+ page = pfn_to_page(low_pfn);
> >>>
> >>>Can we move (page_zone != zone) check here as isolate_freepages() does?
> >>
> >>Duplicate perhaps, not sure about move.
> >
> >Sorry for my unclearness.
> >I meant that we had better do this check in per-pageblock loop (as the free
> >scanner does) instead of in per-pfn loop (as we do now.)
>
> Hm I see, the migration and free scanners really do this differently. Free
> scanned per-pageblock, but migration scanner per-page.
> Can we assume that zones will never overlap within a single pageblock?

Maybe not, we have no such assumption.

> The example dc9086004 seems to be overlapping at even higher alignment so it
> should be safe only to check first page in pageblock.
> And if it wasn't the case, then I guess the freepage scanner would already
> hit some errors on such system?

That's right. Such system might be rare so nobody detected it, I guess.
So I was wrong, and page_zone check should be done in per-pfn loop in
both scanner?

I just think that it might be good if we have an iterator to run over
pfns only on a given zone (not that check page zone on each page,)
but it introduces some more complexity on the scanners, so at this time
we don't have to do it in this series.

> But if that's true, why does page_is_buddy test if pages are in the same
> zone?

Yeah, this is why we think we can't have the above mentioned assumption.

Thanks,
Naoya Horiguchi

2014-06-27 05:52:18

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v3 04/13] mm, compaction: move pageblock checks up from isolate_migratepages_range()

On Wed, Jun 25, 2014 at 10:59:19AM +0200, Vlastimil Babka wrote:
> On 06/25/2014 02:53 AM, Joonsoo Kim wrote:
> >On Tue, Jun 24, 2014 at 05:42:50PM +0200, Vlastimil Babka wrote:
> >>On 06/24/2014 10:33 AM, Joonsoo Kim wrote:
> >>>On Fri, Jun 20, 2014 at 05:49:34PM +0200, Vlastimil Babka wrote:
> >>>>isolate_migratepages_range() is the main function of the compaction scanner,
> >>>>called either on a single pageblock by isolate_migratepages() during regular
> >>>>compaction, or on an arbitrary range by CMA's __alloc_contig_migrate_range().
> >>>>It currently perfoms two pageblock-wide compaction suitability checks, and
> >>>>because of the CMA callpath, it tracks if it crossed a pageblock boundary in
> >>>>order to repeat those checks.
> >>>>
> >>>>However, closer inspection shows that those checks are always true for CMA:
> >>>>- isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
> >>>>- migrate_async_suitable() check is skipped because CMA uses sync compaction
> >>>>
> >>>>We can therefore move the checks to isolate_migratepages(), reducing variables
> >>>>and simplifying isolate_migratepages_range(). The update_pageblock_skip()
> >>>>function also no longer needs set_unsuitable parameter.
> >>>>
> >>>>Furthermore, going back to compact_zone() and compact_finished() when pageblock
> >>>>is unsuitable is wasteful - the checks are meant to skip pageblocks quickly.
> >>>>The patch therefore also introduces a simple loop into isolate_migratepages()
> >>>>so that it does not return immediately on pageblock checks, but keeps going
> >>>>until isolate_migratepages_range() gets called once. Similarily to
> >>>>isolate_freepages(), the function periodically checks if it needs to reschedule
> >>>>or abort async compaction.
> >>>>
> >>>>Signed-off-by: Vlastimil Babka <[email protected]>
> >>>>Cc: Minchan Kim <[email protected]>
> >>>>Cc: Mel Gorman <[email protected]>
> >>>>Cc: Joonsoo Kim <[email protected]>
> >>>>Cc: Michal Nazarewicz <[email protected]>
> >>>>Cc: Naoya Horiguchi <[email protected]>
> >>>>Cc: Christoph Lameter <[email protected]>
> >>>>Cc: Rik van Riel <[email protected]>
> >>>>Cc: David Rientjes <[email protected]>
> >>>>---
> >>>> mm/compaction.c | 112 +++++++++++++++++++++++++++++---------------------------
> >>>> 1 file changed, 59 insertions(+), 53 deletions(-)
> >>>>
> >>>>diff --git a/mm/compaction.c b/mm/compaction.c
> >>>>index 3064a7f..ebe30c9 100644
> >>>>--- a/mm/compaction.c
> >>>>+++ b/mm/compaction.c
> >>>>@@ -132,7 +132,7 @@ void reset_isolation_suitable(pg_data_t *pgdat)
> >>>> */
> >>>> static void update_pageblock_skip(struct compact_control *cc,
> >>>> struct page *page, unsigned long nr_isolated,
> >>>>- bool set_unsuitable, bool migrate_scanner)
> >>>>+ bool migrate_scanner)
> >>>> {
> >>>> struct zone *zone = cc->zone;
> >>>> unsigned long pfn;
> >>>>@@ -146,12 +146,7 @@ static void update_pageblock_skip(struct compact_control *cc,
> >>>> if (nr_isolated)
> >>>> return;
> >>>>
> >>>>- /*
> >>>>- * Only skip pageblocks when all forms of compaction will be known to
> >>>>- * fail in the near future.
> >>>>- */
> >>>>- if (set_unsuitable)
> >>>>- set_pageblock_skip(page);
> >>>>+ set_pageblock_skip(page);
> >>>>
> >>>> pfn = page_to_pfn(page);
> >>>>
> >>>>@@ -180,7 +175,7 @@ static inline bool isolation_suitable(struct compact_control *cc,
> >>>>
> >>>> static void update_pageblock_skip(struct compact_control *cc,
> >>>> struct page *page, unsigned long nr_isolated,
> >>>>- bool set_unsuitable, bool migrate_scanner)
> >>>>+ bool migrate_scanner)
> >>>> {
> >>>> }
> >>>> #endif /* CONFIG_COMPACTION */
> >>>>@@ -345,8 +340,7 @@ isolate_fail:
> >>>>
> >>>> /* Update the pageblock-skip if the whole pageblock was scanned */
> >>>> if (blockpfn == end_pfn)
> >>>>- update_pageblock_skip(cc, valid_page, total_isolated, true,
> >>>>- false);
> >>>>+ update_pageblock_skip(cc, valid_page, total_isolated, false);
> >>>>
> >>>> count_compact_events(COMPACTFREE_SCANNED, nr_scanned);
> >>>> if (total_isolated)
> >>>>@@ -474,14 +468,12 @@ unsigned long
> >>>> isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> >>>> unsigned long low_pfn, unsigned long end_pfn, bool unevictable)
> >>>> {
> >>>>- unsigned long last_pageblock_nr = 0, pageblock_nr;
> >>>> unsigned long nr_scanned = 0, nr_isolated = 0;
> >>>> struct list_head *migratelist = &cc->migratepages;
> >>>> struct lruvec *lruvec;
> >>>> unsigned long flags;
> >>>> bool locked = false;
> >>>> struct page *page = NULL, *valid_page = NULL;
> >>>>- bool set_unsuitable = true;
> >>>> const isolate_mode_t mode = (cc->mode == MIGRATE_ASYNC ?
> >>>> ISOLATE_ASYNC_MIGRATE : 0) |
> >>>> (unevictable ? ISOLATE_UNEVICTABLE : 0);
> >>>>@@ -545,28 +537,6 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> >>>> if (!valid_page)
> >>>> valid_page = page;
> >>>>
> >>>>- /* If isolation recently failed, do not retry */
> >>>>- pageblock_nr = low_pfn >> pageblock_order;
> >>>>- if (last_pageblock_nr != pageblock_nr) {
> >>>>- int mt;
> >>>>-
> >>>>- last_pageblock_nr = pageblock_nr;
> >>>>- if (!isolation_suitable(cc, page))
> >>>>- goto next_pageblock;
> >>>>-
> >>>>- /*
> >>>>- * For async migration, also only scan in MOVABLE
> >>>>- * blocks. Async migration is optimistic to see if
> >>>>- * the minimum amount of work satisfies the allocation
> >>>>- */
> >>>>- mt = get_pageblock_migratetype(page);
> >>>>- if (cc->mode == MIGRATE_ASYNC &&
> >>>>- !migrate_async_suitable(mt)) {
> >>>>- set_unsuitable = false;
> >>>>- goto next_pageblock;
> >>>>- }
> >>>>- }
> >>>>-
> >>>> /*
> >>>> * Skip if free. page_order cannot be used without zone->lock
> >>>> * as nothing prevents parallel allocations or buddy merging.
> >>>>@@ -668,8 +638,7 @@ next_pageblock:
> >>>> * if the whole pageblock was scanned without isolating any page.
> >>>> */
> >>>> if (low_pfn == end_pfn)
> >>>>- update_pageblock_skip(cc, valid_page, nr_isolated,
> >>>>- set_unsuitable, true);
> >>>>+ update_pageblock_skip(cc, valid_page, nr_isolated, true);
> >>>>
> >>>> trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
> >>>>
> >>>>@@ -840,34 +809,74 @@ typedef enum {
> >>>> } isolate_migrate_t;
> >>>>
> >>>> /*
> >>>>- * Isolate all pages that can be migrated from the block pointed to by
> >>>>- * the migrate scanner within compact_control.
> >>>>+ * Isolate all pages that can be migrated from the first suitable block,
> >>>>+ * starting at the block pointed to by the migrate scanner pfn within
> >>>>+ * compact_control.
> >>>> */
> >>>> static isolate_migrate_t isolate_migratepages(struct zone *zone,
> >>>> struct compact_control *cc)
> >>>> {
> >>>> unsigned long low_pfn, end_pfn;
> >>>>+ struct page *page;
> >>>>
> >>>>- /* Do not scan outside zone boundaries */
> >>>>- low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
> >>>>+ /* Start at where we last stopped, or beginning of the zone */
> >>>>+ low_pfn = cc->migrate_pfn;
> >>>>
> >>>> /* Only scan within a pageblock boundary */
> >>>> end_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages);
> >>>>
> >>>>- /* Do not cross the free scanner or scan within a memory hole */
> >>>>- if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
> >>>>- cc->migrate_pfn = end_pfn;
> >>>>- return ISOLATE_NONE;
> >>>>- }
> >>>>+ /*
> >>>>+ * Iterate over whole pageblocks until we find the first suitable.
> >>>>+ * Do not cross the free scanner.
> >>>>+ */
> >>>>+ for (; end_pfn <= cc->free_pfn;
> >>>>+ low_pfn = end_pfn, end_pfn += pageblock_nr_pages) {
> >>>>+
> >>>>+ /*
> >>>>+ * This can potentially iterate a massively long zone with
> >>>>+ * many pageblocks unsuitable, so periodically check if we
> >>>>+ * need to schedule, or even abort async compaction.
> >>>>+ */
> >>>>+ if (!(low_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages))
> >>>>+ && compact_should_abort(cc))
> >>>>+ break;
> >>>>
> >>>>- /* Perform the isolation */
> >>>>- low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn, false);
> >>>>- if (!low_pfn || cc->contended)
> >>>>- return ISOLATE_ABORT;
> >>>>+ /* Do not scan within a memory hole */
> >>>>+ if (!pfn_valid(low_pfn))
> >>>>+ continue;
> >>>>+
> >>>>+ page = pfn_to_page(low_pfn);
> >>>>+ /* If isolation recently failed, do not retry */
> >>>>+ if (!isolation_suitable(cc, page))
> >>>>+ continue;
> >>>>
> >>>>+ /*
> >>>>+ * For async compaction, also only scan in MOVABLE blocks.
> >>>>+ * Async compaction is optimistic to see if the minimum amount
> >>>>+ * of work satisfies the allocation.
> >>>>+ */
> >>>>+ if (cc->mode == MIGRATE_ASYNC &&
> >>>>+ !migrate_async_suitable(get_pageblock_migratetype(page)))
> >>>>+ continue;
> >>>>+
> >>>>+ /* Perform the isolation */
> >>>>+ low_pfn = isolate_migratepages_range(zone, cc, low_pfn,
> >>>>+ end_pfn, false);
> >>>>+ if (!low_pfn || cc->contended)
> >>>>+ return ISOLATE_ABORT;
> >>>>+
> >>>>+ /*
> >>>>+ * Either we isolated something and proceed with migration. Or
> >>>>+ * we failed and compact_zone should decide if we should
> >>>>+ * continue or not.
> >>>>+ */
> >>>>+ break;
> >>>>+ }
> >>>>+
> >>>>+ /* Record where migration scanner will be restarted */
> >>>
> >>>If we make isolate_migratepages* interface like as isolate_freepages*,
> >>>we can get more clean and micro optimized code. Because
> >>>isolate_migratepages_range() can handle arbitrary range and this patch
> >>>make isolate_migratepages() also handle arbitrary range, there would
> >>>be some redundant codes. :)
> >>
> >>I'm not sure if it's worth already. Where is the arbitrary range
> >>adding overhead? I can only imagine that next_pageblock: label could
> >>do a 'break;' instead of setting up next_capture_pfn, but that's
> >>about it AFAICS.
> >
> >In fact, there is just minor overhead, pfn_valid().
> >And isolate_freepages variants seems to do this correctly. :)
> >
> >Someone could wonder why there are two isolate_migratepages variants
> >with arbitrary range compaction ability. IMHO, one
> >isolate_migratepage_xxx for pageblock range and two
> >isolate_migratepage_yyy/zzz for compaction and CMA is better
> >architecture.
>
> OK, I will try. But we need to resolve the "where to test for
> page_zone() == cc->zone?" question.
>

Hello,

I have no idea here. :)
I saw the discussion between you and Naoya, and I also wonder when
we should check page_zone == cc->zone. Maybe we need help from more
experienced developers such as Andrew and others.

Thanks.

2014-07-11 08:28:17

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v3 05/13] mm, compaction: report compaction as contended only due to lock contention

On 06/23/2014 03:39 AM, Minchan Kim wrote:
> Hello Vlastimil,
>
> On Fri, Jun 20, 2014 at 05:49:35PM +0200, Vlastimil Babka wrote:
>> Async compaction aborts when it detects zone lock contention or need_resched()
>> is true. David Rientjes has reported that in practice, most direct async
>> compactions for THP allocation abort due to need_resched(). This means that a
>> second direct compaction is never attempted, which might be OK for a page
>> fault, but khugepaged is intended to attempt a sync compaction in such case and
>> in these cases it won't.
>>
>> This patch replaces "bool contended" in compact_control with an enum that
>> distinguieshes between aborting due to need_resched() and aborting due to lock
>> contention. This allows propagating the abort through all compaction functions
>> as before, but declaring the direct compaction as contended only when lock
>> contention has been detected.
>>
>> A second problem is that try_to_compact_pages() did not act upon the reported
>> contention (both need_resched() or lock contention) and could proceed with
>> another zone from the zonelist. When need_resched() is true, that means
>> initializing another zone compaction, only to check again need_resched() in
>> isolate_migratepages() and aborting. For zone lock contention, the unintended
>> consequence is that the contended status reported back to the allocator
>> is decided from the last zone where compaction was attempted, which is rather
>> arbitrary.
>>
>> This patch fixes the problem in the following way:
>> - need_resched() being true after async compaction returned from a zone means
>> that further zones should not be tried. We do a cond_resched() so that we
>> do not hog the CPU, and abort. "contended" is reported as false, since we
>> did not fail due to lock contention.
>> - aborting zone compaction due to lock contention means we can still try
>> another zone, since it has different locks. We report back "contended" as
>> true only if *all* zones where compaction was attempted, it aborted due to
>> lock contention.
>>
>> As a result of these fixes, khugepaged will proceed with second sync compaction
>> as intended, when the preceding async compaction aborted due to need_resched().
>> Page fault compactions aborting due to need_resched() will spare some cycles
>> previously wasted by initializing another zone compaction only to abort again.
>> Lock contention will be reported only when compaction in all zones aborted due
>> to lock contention, and therefore it's not a good idea to try again after
>> reclaim.
>>
>> Reported-by: David Rientjes <[email protected]>
>> Signed-off-by: Vlastimil Babka <[email protected]>
>> Cc: Minchan Kim <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Joonsoo Kim <[email protected]>
>> Cc: Michal Nazarewicz <[email protected]>
>> Cc: Naoya Horiguchi <[email protected]>
>> Cc: Christoph Lameter <[email protected]>
>> Cc: Rik van Riel <[email protected]>
>> ---
>> mm/compaction.c | 48 +++++++++++++++++++++++++++++++++++++++---------
>> mm/internal.h | 15 +++++++++++----
>> 2 files changed, 50 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index ebe30c9..e8cfac9 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -180,9 +180,14 @@ static void update_pageblock_skip(struct compact_control *cc,
>> }
>> #endif /* CONFIG_COMPACTION */
>>
>> -static inline bool should_release_lock(spinlock_t *lock)
>> +enum compact_contended should_release_lock(spinlock_t *lock)
>> {
>> - return need_resched() || spin_is_contended(lock);
>> + if (spin_is_contended(lock))
>> + return COMPACT_CONTENDED_LOCK;
>> + else if (need_resched())
>> + return COMPACT_CONTENDED_SCHED;
>> + else
>> + return COMPACT_CONTENDED_NONE;
>
> If you want to raise priority of lock contention than need_resched
> intentionally, please write it down on comment.
>
>> }
>>
>> /*
>> @@ -197,7 +202,9 @@ static inline bool should_release_lock(spinlock_t *lock)
>> static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>> bool locked, struct compact_control *cc)
>> {
>> - if (should_release_lock(lock)) {
>> + enum compact_contended contended = should_release_lock(lock);
>> +
>> + if (contended) {
>> if (locked) {
>> spin_unlock_irqrestore(lock, *flags);
>> locked = false;
>> @@ -205,7 +212,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>>
>> /* async aborts if taking too long or contended */
>> if (cc->mode == MIGRATE_ASYNC) {
>> - cc->contended = true;
>> + cc->contended = contended;
>> return false;
>> }
>
>
>>
>> @@ -231,7 +238,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
>> /* async compaction aborts if contended */
>> if (need_resched()) {
>> if (cc->mode == MIGRATE_ASYNC) {
>> - cc->contended = true;
>> + cc->contended = COMPACT_CONTENDED_SCHED;
>> return true;
>> }
>>
>> @@ -1101,7 +1108,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
>> VM_BUG_ON(!list_empty(&cc.freepages));
>> VM_BUG_ON(!list_empty(&cc.migratepages));
>>
>> - *contended = cc.contended;
>> + /* We only signal lock contention back to the allocator */
>> + *contended = cc.contended == COMPACT_CONTENDED_LOCK;
>
> Please write down *WHY* as well as your intention we can know by looking at code.
>
>> return ret;
>> }
>>
>> @@ -1132,6 +1140,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>> struct zone *zone;
>> int rc = COMPACT_SKIPPED;
>> int alloc_flags = 0;
>> + bool all_zones_contended = true;
>>
>> /* Check if the GFP flags allow compaction */
>> if (!order || !may_enter_fs || !may_perform_io)
>> @@ -1146,6 +1155,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>> for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
>> nodemask) {
>> int status;
>> + bool zone_contended;
>>
>> if (compaction_deferred(zone, order))
>> continue;
>> @@ -1153,8 +1163,9 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>> *deferred = false;
>>
>> status = compact_zone_order(zone, order, gfp_mask, mode,
>> - contended);
>> + &zone_contended);
>> rc = max(status, rc);
>> + all_zones_contended &= zone_contended;
>>
>> /* If a normal allocation would succeed, stop compacting */
>> if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
>> @@ -1168,12 +1179,31 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>> * succeeding after all, it will be reset.
>> */
>> defer_compaction(zone, order);
>> + /*
>> + * If we stopped compacting due to need_resched(), do
>> + * not try further zones and yield the CPU.
>> + */
>
> For what? It would make your claim more clear.
>
>> + if (need_resched()) {
>
> compact_zone_order returns true state of contended only if it was lock contention
> so it couldn't return true state of contended by need_resched so you made
> need_resched check in here. It's fragile to me because it could be not a result
> from ahead compact_zone_order call. More clear thing is compact_zone_order
> should return zone_contended as enum, not bool and in here, you could check it.
>
> It means you could return enum in compact_zone_order and make the result bool
> in try_to_compact_pages.
>
>> + /*
>> + * We might not have tried all the zones, so
>> + * be conservative and assume they are not
>> + * all lock contended.
>> + */
>> + all_zones_contended = false;
>> + cond_resched();
>> + break;
>> + }
>> }
>> }
>>
>> - /* If at least one zone wasn't deferred, we count a compaction stall */
>> - if (!*deferred)
>> + /*
>> + * If at least one zone wasn't deferred, we count a compaction stall
>> + * and we report if all zones that were tried were contended.
>> + */
>> + if (!*deferred) {
>> count_compact_event(COMPACTSTALL);
>> + *contended = all_zones_contended;
>
> Why don't you initialize contended as *false* in function's intro?
>
>> + }
>>
>> return rc;
>> }
>> diff --git a/mm/internal.h b/mm/internal.h
>> index a1b651b..2c187d2 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -117,6 +117,13 @@ extern int user_min_free_kbytes;
>>
>> #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>>
>> +/* Used to signal whether compaction detected need_sched() or lock contention */
>> +enum compact_contended {
>> + COMPACT_CONTENDED_NONE = 0, /* no contention detected */
>> + COMPACT_CONTENDED_SCHED, /* need_sched() was true */
>> + COMPACT_CONTENDED_LOCK, /* zone lock or lru_lock was contended */
>> +};
>> +
>> /*
>> * in mm/compaction.c
>> */
>> @@ -144,10 +151,10 @@ struct compact_control {
>> int order; /* order a direct compactor needs */
>> int migratetype; /* MOVABLE, RECLAIMABLE etc */
>> struct zone *zone;
>> - bool contended; /* True if a lock was contended, or
>> - * need_resched() true during async
>> - * compaction
>> - */
>> + enum compact_contended contended; /* Signal need_sched() or lock
>> + * contention detected during
>> + * compaction
>> + */
>> };
>>
>> unsigned long
>> --
>
> Anyway, most big concern is that you are changing current behavior as
> I said earlier.

Well that behavior is pretty new, introduced recently by David's patches
and now he himself is saying that it makes page fault THP allocations
very unsuccessful and the need_resched() is not a good termination
criterion. But OK, I can restrict this to khugepaged...

> Old behavior in THP page fault when it consumes own timeslot was just
> abort and fallback 4K page but with your patch, new behavior is
> take a rest when it founds need_resched and goes to another round with
> async, not sync compaction. I'm not sure we need another round with
> async compaction at the cost of increasing latency rather than fallback
> 4 page.
>
> It might be okay if the VMA has MADV_HUGEPAGE which is good hint to
> indicate non-temporal VMA so latency would be trade-off but it's not
> for temporal big memory allocation in HUGEPAGE_ALWAYS system.

Yeah that could be useful hint when we eventually move away from
need_resched() to a fixed/tunable work per compaction attempt. But I'll
rather not introduce it in the series.

> If you really want to go this, could you show us numbers?
>
> 1. How many could we can be successful in direct compaction by this patch?

Doesn't seem to help. I'll drop it.

> 2. How long could we increase latency for temporal allocation
> for HUGEPAGE_ALWAYS system?

Don't have the data.

2014-07-11 09:38:13

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v3 05/13] mm, compaction: report compaction as contended only due to lock contention

On 07/11/2014 10:28 AM, Vlastimil Babka wrote:
> On 06/23/2014 03:39 AM, Minchan Kim wrote:
>> Hello Vlastimil,
>>
>> On Fri, Jun 20, 2014 at 05:49:35PM +0200, Vlastimil Babka wrote:
>>> Async compaction aborts when it detects zone lock contention or need_resched()
>>> is true. David Rientjes has reported that in practice, most direct async
>>> compactions for THP allocation abort due to need_resched(). This means that a
>>> second direct compaction is never attempted, which might be OK for a page
>>> fault, but khugepaged is intended to attempt a sync compaction in such case and
>>> in these cases it won't.
>>>
>>> This patch replaces "bool contended" in compact_control with an enum that
>>> distinguieshes between aborting due to need_resched() and aborting due to lock
>>> contention. This allows propagating the abort through all compaction functions
>>> as before, but declaring the direct compaction as contended only when lock
>>> contention has been detected.
>>>
>>> A second problem is that try_to_compact_pages() did not act upon the reported
>>> contention (both need_resched() or lock contention) and could proceed with
>>> another zone from the zonelist. When need_resched() is true, that means
>>> initializing another zone compaction, only to check again need_resched() in
>>> isolate_migratepages() and aborting. For zone lock contention, the unintended
>>> consequence is that the contended status reported back to the allocator
>>> is decided from the last zone where compaction was attempted, which is rather
>>> arbitrary.
>>>
>>> This patch fixes the problem in the following way:
>>> - need_resched() being true after async compaction returned from a zone means
>>> that further zones should not be tried. We do a cond_resched() so that we
>>> do not hog the CPU, and abort. "contended" is reported as false, since we
>>> did not fail due to lock contention.
>>> - aborting zone compaction due to lock contention means we can still try
>>> another zone, since it has different locks. We report back "contended" as
>>> true only if *all* zones where compaction was attempted, it aborted due to
>>> lock contention.
>>>
>>> As a result of these fixes, khugepaged will proceed with second sync compaction
>>> as intended, when the preceding async compaction aborted due to need_resched().
>>> Page fault compactions aborting due to need_resched() will spare some cycles
>>> previously wasted by initializing another zone compaction only to abort again.
>>> Lock contention will be reported only when compaction in all zones aborted due
>>> to lock contention, and therefore it's not a good idea to try again after
>>> reclaim.
>>>
>>> Reported-by: David Rientjes <[email protected]>
>>> Signed-off-by: Vlastimil Babka <[email protected]>
>>> Cc: Minchan Kim <[email protected]>
>>> Cc: Mel Gorman <[email protected]>
>>> Cc: Joonsoo Kim <[email protected]>
>>> Cc: Michal Nazarewicz <[email protected]>
>>> Cc: Naoya Horiguchi <[email protected]>
>>> Cc: Christoph Lameter <[email protected]>
>>> Cc: Rik van Riel <[email protected]>
>>> ---
>>> mm/compaction.c | 48 +++++++++++++++++++++++++++++++++++++++---------
>>> mm/internal.h | 15 +++++++++++----
>>> 2 files changed, 50 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/mm/compaction.c b/mm/compaction.c
>>> index ebe30c9..e8cfac9 100644
>>> --- a/mm/compaction.c
>>> +++ b/mm/compaction.c
>>> @@ -180,9 +180,14 @@ static void update_pageblock_skip(struct compact_control *cc,
>>> }
>>> #endif /* CONFIG_COMPACTION */
>>>
>>> -static inline bool should_release_lock(spinlock_t *lock)
>>> +enum compact_contended should_release_lock(spinlock_t *lock)
>>> {
>>> - return need_resched() || spin_is_contended(lock);
>>> + if (spin_is_contended(lock))
>>> + return COMPACT_CONTENDED_LOCK;
>>> + else if (need_resched())
>>> + return COMPACT_CONTENDED_SCHED;
>>> + else
>>> + return COMPACT_CONTENDED_NONE;
>>
>> If you want to raise priority of lock contention than need_resched
>> intentionally, please write it down on comment.
>>
>>> }
>>>
>>> /*
>>> @@ -197,7 +202,9 @@ static inline bool should_release_lock(spinlock_t *lock)
>>> static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>>> bool locked, struct compact_control *cc)
>>> {
>>> - if (should_release_lock(lock)) {
>>> + enum compact_contended contended = should_release_lock(lock);
>>> +
>>> + if (contended) {
>>> if (locked) {
>>> spin_unlock_irqrestore(lock, *flags);
>>> locked = false;
>>> @@ -205,7 +212,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>>>
>>> /* async aborts if taking too long or contended */
>>> if (cc->mode == MIGRATE_ASYNC) {
>>> - cc->contended = true;
>>> + cc->contended = contended;
>>> return false;
>>> }
>>
>>
>>>
>>> @@ -231,7 +238,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
>>> /* async compaction aborts if contended */
>>> if (need_resched()) {
>>> if (cc->mode == MIGRATE_ASYNC) {
>>> - cc->contended = true;
>>> + cc->contended = COMPACT_CONTENDED_SCHED;
>>> return true;
>>> }
>>>
>>> @@ -1101,7 +1108,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
>>> VM_BUG_ON(!list_empty(&cc.freepages));
>>> VM_BUG_ON(!list_empty(&cc.migratepages));
>>>
>>> - *contended = cc.contended;
>>> + /* We only signal lock contention back to the allocator */
>>> + *contended = cc.contended == COMPACT_CONTENDED_LOCK;
>>
>> Please write down *WHY* as well as your intention we can know by looking at code.
>>
>>> return ret;
>>> }
>>>
>>> @@ -1132,6 +1140,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>>> struct zone *zone;
>>> int rc = COMPACT_SKIPPED;
>>> int alloc_flags = 0;
>>> + bool all_zones_contended = true;
>>>
>>> /* Check if the GFP flags allow compaction */
>>> if (!order || !may_enter_fs || !may_perform_io)
>>> @@ -1146,6 +1155,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>>> for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
>>> nodemask) {
>>> int status;
>>> + bool zone_contended;
>>>
>>> if (compaction_deferred(zone, order))
>>> continue;
>>> @@ -1153,8 +1163,9 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>>> *deferred = false;
>>>
>>> status = compact_zone_order(zone, order, gfp_mask, mode,
>>> - contended);
>>> + &zone_contended);
>>> rc = max(status, rc);
>>> + all_zones_contended &= zone_contended;
>>>
>>> /* If a normal allocation would succeed, stop compacting */
>>> if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
>>> @@ -1168,12 +1179,31 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>>> * succeeding after all, it will be reset.
>>> */
>>> defer_compaction(zone, order);
>>> + /*
>>> + * If we stopped compacting due to need_resched(), do
>>> + * not try further zones and yield the CPU.
>>> + */
>>
>> For what? It would make your claim more clear.
>>
>>> + if (need_resched()) {
>>
>> compact_zone_order returns true state of contended only if it was lock contention
>> so it couldn't return true state of contended by need_resched so you made
>> need_resched check in here. It's fragile to me because it could be not a result
>> from ahead compact_zone_order call. More clear thing is compact_zone_order
>> should return zone_contended as enum, not bool and in here, you could check it.
>>
>> It means you could return enum in compact_zone_order and make the result bool
>> in try_to_compact_pages.
>>
>>> + /*
>>> + * We might not have tried all the zones, so
>>> + * be conservative and assume they are not
>>> + * all lock contended.
>>> + */
>>> + all_zones_contended = false;
>>> + cond_resched();
>>> + break;
>>> + }
>>> }
>>> }
>>>
>>> - /* If at least one zone wasn't deferred, we count a compaction stall */
>>> - if (!*deferred)
>>> + /*
>>> + * If at least one zone wasn't deferred, we count a compaction stall
>>> + * and we report if all zones that were tried were contended.
>>> + */
>>> + if (!*deferred) {
>>> count_compact_event(COMPACTSTALL);
>>> + *contended = all_zones_contended;
>>
>> Why don't you initialize contended as *false* in function's intro?
>>
>>> + }
>>>
>>> return rc;
>>> }
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index a1b651b..2c187d2 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -117,6 +117,13 @@ extern int user_min_free_kbytes;
>>>
>>> #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>>>
>>> +/* Used to signal whether compaction detected need_sched() or lock contention */
>>> +enum compact_contended {
>>> + COMPACT_CONTENDED_NONE = 0, /* no contention detected */
>>> + COMPACT_CONTENDED_SCHED, /* need_sched() was true */
>>> + COMPACT_CONTENDED_LOCK, /* zone lock or lru_lock was contended */
>>> +};
>>> +
>>> /*
>>> * in mm/compaction.c
>>> */
>>> @@ -144,10 +151,10 @@ struct compact_control {
>>> int order; /* order a direct compactor needs */
>>> int migratetype; /* MOVABLE, RECLAIMABLE etc */
>>> struct zone *zone;
>>> - bool contended; /* True if a lock was contended, or
>>> - * need_resched() true during async
>>> - * compaction
>>> - */
>>> + enum compact_contended contended; /* Signal need_sched() or lock
>>> + * contention detected during
>>> + * compaction
>>> + */
>>> };
>>>
>>> unsigned long
>>> --
>>
>> Anyway, most big concern is that you are changing current behavior as
>> I said earlier.
>
> Well that behavior is pretty new, introduced recently by David's patches
> and now he himself is saying that it makes page fault THP allocations
> very unsuccessful and the need_resched() is not a good termination
> criterion. But OK, I can restrict this to khugepaged...
>
>> Old behavior in THP page fault when it consumes own timeslot was just
>> abort and fallback 4K page but with your patch, new behavior is
>> take a rest when it founds need_resched and goes to another round with
>> async, not sync compaction. I'm not sure we need another round with
>> async compaction at the cost of increasing latency rather than fallback
>> 4 page.
> >
>> It might be okay if the VMA has MADV_HUGEPAGE which is good hint to
>> indicate non-temporal VMA so latency would be trade-off but it's not
>> for temporal big memory allocation in HUGEPAGE_ALWAYS system.
>
> Yeah that could be useful hint when we eventually move away from
> need_resched() to a fixed/tunable work per compaction attempt. But I'll
> rather not introduce it in the series.
>
>> If you really want to go this, could you show us numbers?
>>
>> 1. How many could we can be successful in direct compaction by this patch?
>
> Doesn't seem to help. I'll drop it.

Actually the patch was wrong as the need_resched() check occured under
if (mode != MIGRATE_ASYNC) branch. But need_resched() contention is only
considered in MIGRATE_ASYNC mode. Oops.

>> 2. How long could we increase latency for temporal allocation
>> for HUGEPAGE_ALWAYS system?
>
> Don't have the data.
>

2014-07-11 12:03:56

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v3 06/13] mm, compaction: periodically drop lock and restore IRQs in scanners

On 06/23/2014 04:53 AM, Minchan Kim wrote:
> On Fri, Jun 20, 2014 at 05:49:36PM +0200, Vlastimil Babka wrote:
>> Compaction scanners regularly check for lock contention and need_resched()
>> through the compact_checklock_irqsave() function. However, if there is no
>> contention, the lock can be held and IRQ disabled for potentially long time.
>>
>> This has been addressed by commit b2eef8c0d0 ("mm: compaction: minimise the
>> time IRQs are disabled while isolating pages for migration") for the migration
>> scanner. However, the refactoring done by commit 748446bb6b ("mm: compaction:
>> acquire the zone->lru_lock as late as possible") has changed the conditions so
>> that the lock is dropped only when there's contention on the lock or
>> need_resched() is true. Also, need_resched() is checked only when the lock is
>> already held. The comment "give a chance to irqs before checking need_resched"
>> is therefore misleading, as IRQs remain disabled when the check is done.
>>
>> This patch restores the behavior intended by commit b2eef8c0d0 and also tries
>> to better balance and make more deterministic the time spent by checking for
>> contention vs the time the scanners might run between the checks. It also
>> avoids situations where checking has not been done often enough before. The
>> result should be avoiding both too frequent and too infrequent contention
>> checking, and especially the potentially long-running scans with IRQs disabled
>> and no checking of need_resched() or for fatal signal pending, which can happen
>> when many consecutive pages or pageblocks fail the preliminary tests and do not
>> reach the later call site to compact_checklock_irqsave(), as explained below.
>>
>> Before the patch:
>>
>> In the migration scanner, compact_checklock_irqsave() was called each loop, if
>> reached. If not reached, some lower-frequency checking could still be done if
>> the lock was already held, but this would not result in aborting contended
>> async compaction until reaching compact_checklock_irqsave() or end of
>> pageblock. In the free scanner, it was similar but completely without the
>> periodical checking, so lock can be potentially held until reaching the end of
>> pageblock.
>>
>> After the patch, in both scanners:
>>
>> The periodical check is done as the first thing in the loop on each
>> SWAP_CLUSTER_MAX aligned pfn, using the new compact_unlock_should_abort()
>> function, which always unlocks the lock (if locked) and aborts async compaction
>> if scheduling is needed. It also aborts any type of compaction when a fatal
>> signal is pending.
>>
>> The compact_checklock_irqsave() function is replaced with a slightly different
>> compact_trylock_irqsave(). The biggest difference is that the function is not
>> called at all if the lock is already held. The periodical need_resched()
>> checking is left solely to compact_unlock_should_abort(). The lock contention
>> avoidance for async compaction is achieved by the periodical unlock by
>> compact_unlock_should_abort() and by using trylock in compact_trylock_irqsave()
>> and aborting when trylock fails. Sync compaction does not use trylock.
>>
>> Signed-off-by: Vlastimil Babka <[email protected]>
>> Cc: Minchan Kim <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Michal Nazarewicz <[email protected]>
>> Cc: Naoya Horiguchi <[email protected]>
>> Cc: Christoph Lameter <[email protected]>
>> Cc: Rik van Riel <[email protected]>
>> Cc: David Rientjes <[email protected]>
>> ---
>> mm/compaction.c | 114 ++++++++++++++++++++++++++++++++++++--------------------
>> 1 file changed, 73 insertions(+), 41 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index e8cfac9..40da812 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -180,54 +180,72 @@ static void update_pageblock_skip(struct compact_control *cc,
>> }
>> #endif /* CONFIG_COMPACTION */
>>
>> -enum compact_contended should_release_lock(spinlock_t *lock)
>> +/*
>> + * Compaction requires the taking of some coarse locks that are potentially
>> + * very heavily contended. For async compaction, back out if the lock cannot
>> + * be taken immediately. For sync compaction, spin on the lock if needed.
>> + *
>> + * Returns true if the lock is held
>> + * Returns false if the lock is not held and compaction should abort
>> + */
>> +static bool compact_trylock_irqsave(spinlock_t *lock,
>> + unsigned long *flags, struct compact_control *cc)
>> {
>> - if (spin_is_contended(lock))
>> - return COMPACT_CONTENDED_LOCK;
>> - else if (need_resched())
>> - return COMPACT_CONTENDED_SCHED;
>> - else
>> - return COMPACT_CONTENDED_NONE;
>> + if (cc->mode == MIGRATE_ASYNC) {
>> + if (!spin_trylock_irqsave(lock, *flags)) {
>> + cc->contended = COMPACT_CONTENDED_LOCK;
>> + return false;
>> + }
>> + } else {
>> + spin_lock_irqsave(lock, *flags);
>> + }
>> +
>> + return true;
>> }
>>
>> /*
>> * Compaction requires the taking of some coarse locks that are potentially
>> - * very heavily contended. Check if the process needs to be scheduled or
>> - * if the lock is contended. For async compaction, back out in the event
>> - * if contention is severe. For sync compaction, schedule.
>> + * very heavily contended. The lock should be periodically unlocked to avoid
>> + * having disabled IRQs for a long time, even when there is nobody waiting on
>> + * the lock. It might also be that allowing the IRQs will result in
>> + * need_resched() becoming true. If scheduling is needed, async compaction
>> + * aborts. Sync compaction schedules.
>> + * Either compaction type will also abort if a fatal signal is pending.
>> + * In either case if the lock was locked, it is dropped and not regained.
>> *
>> - * Returns true if the lock is held.
>> - * Returns false if the lock is released and compaction should abort
>> + * Returns true if compaction should abort due to fatal signal pending, or
>> + * async compaction due to need_resched()
>> + * Returns false when compaction can continue (sync compaction might have
>> + * scheduled)
>> */
>> -static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>> - bool locked, struct compact_control *cc)
>> +static bool compact_unlock_should_abort(spinlock_t *lock,
>> + unsigned long flags, bool *locked, struct compact_control *cc)
>> {
>> - enum compact_contended contended = should_release_lock(lock);
>> + if (*locked) {
>> + spin_unlock_irqrestore(lock, flags);
>> + *locked = false;
>> + }
>>
>> - if (contended) {
>> - if (locked) {
>> - spin_unlock_irqrestore(lock, *flags);
>> - locked = false;
>> - }
>> + if (fatal_signal_pending(current)) {
>> + cc->contended = COMPACT_CONTENDED_SCHED;
>> + return true;
>> + }
>
>
> Generally, this patch is really good for me but I doubt what happens
> if we bail out by fatal_signal? All the path is going to handle it
> rightly to bail out direct compaction path?

Hm right, try_to_compact_pages should check it before trying another
zone. Then it will be ok.

> I don't think so but anyway, it would be another patch so do you
> handle it later or include it in this patchset series?

A good place to fix it is the previous patch. Thanks.

> If you want to handle it later, please put the XXX for TODO.
> Anyway,
>
> Acked-by: Minchan Kim <[email protected]>
>
>>
>> - /* async aborts if taking too long or contended */
>> + if (need_resched()) {
>> if (cc->mode == MIGRATE_ASYNC) {
>> - cc->contended = contended;
>> - return false;
>> + cc->contended = COMPACT_CONTENDED_SCHED;
>> + return true;
>> }
>> -
>> cond_resched();
>> }
>>
>> - if (!locked)
>> - spin_lock_irqsave(lock, *flags);
>> - return true;
>> + return false;
>> }
>>
>> /*
>> * Aside from avoiding lock contention, compaction also periodically checks
>> * need_resched() and either schedules in sync compaction or aborts async
>> - * compaction. This is similar to what compact_checklock_irqsave() does, but
>> + * compaction. This is similar to what compact_unlock_should_abort() does, but
>> * is used where no lock is concerned.
>> *
>> * Returns false when no scheduling was needed, or sync compaction scheduled.
>> @@ -286,6 +304,16 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>> int isolated, i;
>> struct page *page = cursor;
>>
>> + /*
>> + * Periodically drop the lock (if held) regardless of its
>> + * contention, to give chance to IRQs. Abort async compaction
>> + * if contended.
>> + */
>> + if (!(blockpfn % SWAP_CLUSTER_MAX)
>> + && compact_unlock_should_abort(&cc->zone->lock, flags,
>> + &locked, cc))
>> + break;
>> +
>> nr_scanned++;
>> if (!pfn_valid_within(blockpfn))
>> goto isolate_fail;
>> @@ -303,8 +331,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>> * spin on the lock and we acquire the lock as late as
>> * possible.
>> */
>> - locked = compact_checklock_irqsave(&cc->zone->lock, &flags,
>> - locked, cc);
>> + if (!locked)
>> + locked = compact_trylock_irqsave(&cc->zone->lock,
>> + &flags, cc);
>> if (!locked)
>> break;
>>
>> @@ -506,13 +535,15 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>>
>> /* Time to isolate some pages for migration */
>> for (; low_pfn < end_pfn; low_pfn++) {
>> - /* give a chance to irqs before checking need_resched() */
>> - if (locked && !(low_pfn % SWAP_CLUSTER_MAX)) {
>> - if (should_release_lock(&zone->lru_lock)) {
>> - spin_unlock_irqrestore(&zone->lru_lock, flags);
>> - locked = false;
>> - }
>> - }
>> + /*
>> + * Periodically drop the lock (if held) regardless of its
>> + * contention, to give chance to IRQs. Abort async compaction
>> + * if contended.
>> + */
>> + if (!(low_pfn % SWAP_CLUSTER_MAX)
>> + && compact_unlock_should_abort(&zone->lru_lock, flags,
>> + &locked, cc))
>> + break;
>>
>> /*
>> * migrate_pfn does not necessarily start aligned to a
>> @@ -592,10 +623,11 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>> page_count(page) > page_mapcount(page))
>> continue;
>>
>> - /* Check if it is ok to still hold the lock */
>> - locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
>> - locked, cc);
>> - if (!locked || fatal_signal_pending(current))
>> + /* If the lock is not held, try to take it */
>> + if (!locked)
>> + locked = compact_trylock_irqsave(&zone->lru_lock,
>> + &flags, cc);
>> + if (!locked)
>> break;
>>
>> /* Recheck PageLRU and PageTransHuge under lock */
>> --
>> 1.8.4.5
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>