2014-07-28 13:15:16

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v5 00/14] compaction: balancing overhead and success rates

Based on next-20140728.

Here's v5 of the series. I declared v4 ready on friday, so obviously I had to
immediately realize there's bug in one of the patches. So here's a new posting
with these changes:

- Patch 14 had two issues in v4. The major issue was leaking captured pages
when called from compact_pgdat() by kswapd. I thought that checking for
cc->order is enough to recognize a direct compaction, but missed that kswapd
also uses an order > 0 and does not consume any page. Fixed by adopting the
approach in original capture commit 1fb3f8ca0e92: cc->capture_page is no
longer a "struct page *", but "struct page **" and only when it's non-NULL
(in direct compaction) it means the caller wants a captured page pointer
filled there.

The minor issue was that during changes from v3 and v4, a rebase snafu lead
to removing the check whether to capture pages, based on matching pageblock
migratetype. So it could e.g. capture a page for UNMOVABLE allocation inside
a MOVABLE pageblock, which is undesired. So in v5 the check is back and
slightly improved by observing that if we want the whole pageblock, it does
not matter what its migratetype is.

- Patch 15 dropped for now (was always RFC), will appear in future series, as
it was negatively affecting extfrag. The (theoretical so far) explanation is
that by aggressively skipping a pageblock as soon as it cannot be completely
compacted, we fail to create lower-than-9-order free pages that could still
be otherwise created. Since page stealing works by stealing the
highest-order-available page and then satisfying potentially many allocations
from that, lack of high-order pages means more stealing events in potentially
more different pageblocks, so that many MOVABLE pageblocks are polluted, each
by only a few UNMOVABLE pages, which is still enough to kill the pageblocks
for the purposes of THP(-like) allocation.

- Patch 7 has benchmark data included.

- Checkpatch fixes applied, Acked-by's from Mel added.

David Rientjes (2):
mm: rename allocflags_to_migratetype for clarity
mm, compaction: pass gfp mask to compact_control

Vlastimil Babka (12):
mm, THP: don't hold mmap_sem in khugepaged when allocating THP
mm, compaction: defer each zone individually instead of preferred zone
mm, compaction: do not count compact_stall if all zones skipped
compaction
mm, compaction: do not recheck suitable_migration_target under lock
mm, compaction: move pageblock checks up from
isolate_migratepages_range()
mm, compaction: reduce zone checking frequency in the migration
scanner
mm, compaction: khugepaged should not give up due to need_resched()
mm, compaction: periodically drop lock and restore IRQs in scanners
mm, compaction: skip rechecks when lock was already held
mm, compaction: remember position within pageblock in free pages
scanner
mm, compaction: skip buddy pages by their order in the migrate scanner
mm, compaction: try to capture the just-created high-order freepage

include/linux/compaction.h | 28 +-
include/linux/gfp.h | 2 +-
mm/compaction.c | 757 ++++++++++++++++++++++++++++++++-------------
mm/huge_memory.c | 20 +-
mm/internal.h | 28 +-
mm/page_alloc.c | 189 ++++++++---
6 files changed, 733 insertions(+), 291 deletions(-)

--
1.8.4.5


2014-07-28 13:12:10

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v5 01/14] mm, THP: don't hold mmap_sem in khugepaged when allocating THP

When allocating huge page for collapsing, khugepaged currently holds mmap_sem
for reading on the mm where collapsing occurs. Afterwards the read lock is
dropped before write lock is taken on the same mmap_sem.

Holding mmap_sem during whole huge page allocation is therefore useless, the
vma needs to be rechecked after taking the write lock anyway. Furthemore, huge
page allocation might involve a rather long sync compaction, and thus block
any mmap_sem writers and i.e. affect workloads that perform frequent m(un)map
or mprotect oterations.

This patch simply releases the read lock before allocating a huge page. It
also deletes an outdated comment that assumed vma must be stable, as it was
using alloc_hugepage_vma(). This is no longer true since commit 9f1b868a13ac
("mm: thp: khugepaged: add policy for finding target node").

Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
---
mm/huge_memory.c | 20 +++++++-------------
1 file changed, 7 insertions(+), 13 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d9a21d06..7cfc325 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2319,23 +2319,17 @@ static struct page
int node)
{
VM_BUG_ON_PAGE(*hpage, *hpage);
+
/*
- * Allocate the page while the vma is still valid and under
- * the mmap_sem read mode so there is no memory allocation
- * later when we take the mmap_sem in write mode. This is more
- * friendly behavior (OTOH it may actually hide bugs) to
- * filesystems in userland with daemons allocating memory in
- * the userland I/O paths. Allocating memory with the
- * mmap_sem in read mode is good idea also to allow greater
- * scalability.
+ * Before allocating the hugepage, release the mmap_sem read lock.
+ * The allocation can take potentially a long time if it involves
+ * sync compaction, and we do not need to hold the mmap_sem during
+ * that. We will recheck the vma after taking it again in write mode.
*/
+ up_read(&mm->mmap_sem);
+
*hpage = alloc_pages_exact_node(node, alloc_hugepage_gfpmask(
khugepaged_defrag(), __GFP_OTHER_NODE), HPAGE_PMD_ORDER);
- /*
- * After allocating the hugepage, release the mmap_sem read lock in
- * preparation for taking it in write mode.
- */
- up_read(&mm->mmap_sem);
if (unlikely(!*hpage)) {
count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
*hpage = ERR_PTR(-ENOMEM);
--
1.8.4.5

2014-07-28 13:12:16

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v5 14/14] mm, compaction: try to capture the just-created high-order freepage

Compaction uses watermark checking to determine if it succeeded in creating
a high-order free page. My testing has shown that this is quite racy and it
can happen that watermark checking in compaction succeeds, and moments later
the watermark checking in page allocation fails, even though the number of
free pages has increased meanwhile.

It should be more reliable if direct compaction captured the high-order free
page as soon as it detects it, and pass it back to allocation. This would
also reduce the window for somebody else to allocate the free page.

Capture has been implemented before by 1fb3f8ca0e92 ("mm: compaction: capture
a suitable high-order page immediately when it is made available"), but later
reverted by 8fb74b9f ("mm: compaction: partially revert capture of suitable
high-order page") due to a bug.

This patch differs from the previous attempt in two aspects:

1) The previous patch scanned free lists to capture the page. In this patch,
only the cc->order aligned block that the migration scanner just finished
is considered, but only if pages were actually isolated for migration in
that block. Tracking cc->order aligned blocks also has benefits for the
following patch that skips blocks where non-migratable pages were found.

2) The operations done in buffered_rmqueue() and get_page_from_freelist() are
closely followed so that page capture mimics normal page allocation as much
as possible. This includes operations such as prep_new_page() and
page->pfmemalloc setting (that was missing in the previous attempt), zone
statistics are updated etc. Due to subtleties with IRQ disabling and
enabling this cannot be simply factored out from the normal allocation
functions without affecting the fastpath.

This patch has tripled compaction success rates (as recorded in vmstat) in
stress-highalloc mmtests benchmark, although allocation success rates increased
only by a few percent. Closer inspection shows that due to the racy watermark
checking and lack of lru_add_drain(), the allocations that resulted in direct
compactions were often failing, but later allocations succeeeded in the fast
path. So the benefit of the patch to allocation success rates may be limited,
but it improves the fairness in the sense that whoever spent the time
compacting has a higher change of benefitting from it, and also can stop
compacting sooner, as page availability is detected immediately. With better
success detection, the contribution of compaction to high-order allocation
success success rates is also no longer understated by the vmstats.

Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
---
include/linux/compaction.h | 8 ++-
mm/compaction.c | 134 +++++++++++++++++++++++++++++++++++++++++----
mm/internal.h | 4 +-
mm/page_alloc.c | 81 +++++++++++++++++++++++----
4 files changed, 201 insertions(+), 26 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 60bdf8d..b83c142 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -12,6 +12,8 @@
#define COMPACT_PARTIAL 3
/* The full zone was compacted */
#define COMPACT_COMPLETE 4
+/* Captured a high-order free page in direct compaction */
+#define COMPACT_CAPTURED 5

/* Used to signal whether compaction detected need_sched() or lock contention */
/* No contention detected */
@@ -33,7 +35,8 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *mask,
enum migrate_mode mode, int *contended,
- struct zone **candidate_zone);
+ struct zone **candidate_zone,
+ struct page **captured_page);
extern void compact_pgdat(pg_data_t *pgdat, int order);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern unsigned long compaction_suitable(struct zone *zone, int order);
@@ -103,7 +106,8 @@ static inline bool compaction_restarting(struct zone *zone, int order)
static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
enum migrate_mode mode, int *contended,
- struct zone **candidate_zone)
+ struct zone **candidate_zone,
+ struct page **captured_page);
{
return COMPACT_CONTINUE;
}
diff --git a/mm/compaction.c b/mm/compaction.c
index dd3e4db..bfe56ee 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -548,6 +548,7 @@ static bool too_many_isolated(struct zone *zone)
* @low_pfn: The first PFN to isolate
* @end_pfn: The one-past-the-last PFN to isolate, within same pageblock
* @isolate_mode: Isolation mode to be used.
+ * @capture: True if page capturing is allowed
*
* Isolate all pages that can be migrated from the range specified by
* [low_pfn, end_pfn). The range is expected to be within same pageblock.
@@ -561,7 +562,8 @@ static bool too_many_isolated(struct zone *zone)
*/
static unsigned long
isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
- unsigned long end_pfn, isolate_mode_t isolate_mode)
+ unsigned long end_pfn, isolate_mode_t isolate_mode,
+ bool capture)
{
struct zone *zone = cc->zone;
unsigned long nr_scanned = 0, nr_isolated = 0;
@@ -570,6 +572,14 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
unsigned long flags;
bool locked = false;
struct page *page = NULL, *valid_page = NULL;
+ unsigned long capture_pfn = 0; /* current candidate for capturing */
+ unsigned long next_capture_pfn = 0; /* next candidate for capturing */
+
+ if (cc->order > 0 && cc->order <= pageblock_order && capture) {
+ /* This may be outside the zone, but we check that later */
+ capture_pfn = low_pfn & ~((1UL << cc->order) - 1);
+ next_capture_pfn = ALIGN(low_pfn + 1, (1UL << cc->order));
+ }

/*
* Ensure that there are not too many pages isolated from the LRU
@@ -591,7 +601,27 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
return 0;

/* Time to isolate some pages for migration */
- for (; low_pfn < end_pfn; low_pfn++) {
+ for (; low_pfn <= end_pfn; low_pfn++) {
+ if (low_pfn == next_capture_pfn) {
+ /*
+ * We have a capture candidate if we isolated something
+ * during the last cc->order aligned block of pages.
+ */
+ if (nr_isolated &&
+ capture_pfn >= zone->zone_start_pfn) {
+ *cc->capture_page = pfn_to_page(capture_pfn);
+ break;
+ }
+
+ /* Prepare for a new capture candidate */
+ capture_pfn = next_capture_pfn;
+ next_capture_pfn += (1UL << cc->order);
+ }
+
+ /* We check that here, in case low_pfn == next_capture_pfn */
+ if (low_pfn == end_pfn)
+ break;
+
/*
* Periodically drop the lock (if held) regardless of its
* contention, to give chance to IRQs. Abort async compaction
@@ -625,8 +655,12 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
* a valid page order. Consider only values in the
* valid order range to prevent low_pfn overflow.
*/
- if (freepage_order > 0 && freepage_order < MAX_ORDER)
+ if (freepage_order > 0 && freepage_order < MAX_ORDER) {
low_pfn += (1UL << freepage_order) - 1;
+ if (next_capture_pfn)
+ next_capture_pfn = ALIGN(low_pfn + 1,
+ (1UL << cc->order));
+ }
continue;
}

@@ -662,6 +696,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
else
low_pfn += (1 << compound_order(page)) - 1;

+ if (next_capture_pfn)
+ next_capture_pfn =
+ ALIGN(low_pfn + 1, (1UL << cc->order));
continue;
}

@@ -686,6 +723,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
continue;
if (PageTransHuge(page)) {
low_pfn += (1 << compound_order(page)) - 1;
+ if (next_capture_pfn)
+ next_capture_pfn = ALIGN(low_pfn + 1,
+ (1UL << cc->order));
continue;
}
}
@@ -770,7 +810,7 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
continue;

pfn = isolate_migratepages_block(cc, pfn, block_end_pfn,
- ISOLATE_UNEVICTABLE);
+ ISOLATE_UNEVICTABLE, false);

/*
* In case of fatal failure, release everything that might
@@ -958,7 +998,7 @@ typedef enum {
* compact_control.
*/
static isolate_migrate_t isolate_migratepages(struct zone *zone,
- struct compact_control *cc)
+ struct compact_control *cc, const int migratetype)
{
unsigned long low_pfn, end_pfn;
struct page *page;
@@ -981,6 +1021,9 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
for (; end_pfn <= cc->free_pfn;
low_pfn = end_pfn, end_pfn += pageblock_nr_pages) {

+ int pageblock_mt;
+ bool capture;
+
/*
* This can potentially iterate a massively long zone with
* many pageblocks unsuitable, so periodically check if we
@@ -1003,13 +1046,22 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
* Async compaction is optimistic to see if the minimum amount
* of work satisfies the allocation.
*/
+ pageblock_mt = get_pageblock_migratetype(page);
if (cc->mode == MIGRATE_ASYNC &&
- !migrate_async_suitable(get_pageblock_migratetype(page)))
+ !migrate_async_suitable(pageblock_mt))
continue;

+ /*
+ * Capture page only if the caller requested it, and either the
+ * pageblock has our desired migratetype, or we would take it
+ * completely.
+ */
+ capture = cc->capture_page && ((pageblock_mt == migratetype)
+ || (cc->order == pageblock_order));
+
/* Perform the isolation */
low_pfn = isolate_migratepages_block(cc, low_pfn, end_pfn,
- isolate_mode);
+ isolate_mode, capture);

if (!low_pfn || cc->contended)
return ISOLATE_ABORT;
@@ -1028,6 +1080,48 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
}

+/*
+ * When called, cc->capture_page must be non-NULL. Then *cc->capture_page is
+ * just a candidate, or NULL (no candidate). This function will either
+ * successfully capture the page, or reset *cc->capture_page to NULL.
+ */
+static bool compact_capture_page(struct compact_control *cc)
+{
+ struct page *page = *cc->capture_page;
+ int cpu;
+
+ if (!page)
+ return false;
+
+ /* Unsafe check if it's worth to try acquiring the zone->lock at all */
+ if (PageBuddy(page) && page_order_unsafe(page) >= cc->order)
+ goto try_capture;
+
+ /*
+ * There's a good chance that we have just put free pages on this CPU's
+ * lru cache and pcplists after the page migrations. Drain them to
+ * allow merging.
+ */
+ cpu = get_cpu();
+ lru_add_drain_cpu(cpu);
+ drain_local_pages(NULL);
+ put_cpu();
+
+ /* Did the draining help? */
+ if (PageBuddy(page) && page_order_unsafe(page) >= cc->order)
+ goto try_capture;
+
+ goto fail;
+
+try_capture:
+ if (capture_free_page(page, cc->order))
+ return true;
+
+fail:
+ *cc->capture_page = NULL;
+ return false;
+}
+
static int compact_finished(struct zone *zone, struct compact_control *cc,
const int migratetype)
{
@@ -1056,6 +1150,10 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
return COMPACT_COMPLETE;
}

+ /* Did we just finish a pageblock that was capture candidate? */
+ if (cc->capture_page && compact_capture_page(cc))
+ return COMPACT_CAPTURED;
+
/*
* order == -1 is expected when compacting via
* /proc/sys/vm/compact_memory
@@ -1188,7 +1286,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
COMPACT_CONTINUE) {
int err;

- switch (isolate_migratepages(zone, cc)) {
+ switch (isolate_migratepages(zone, cc, migratetype)) {
case ISOLATE_ABORT:
ret = COMPACT_PARTIAL;
putback_movable_pages(&cc->migratepages);
@@ -1227,13 +1325,18 @@ out:
cc->nr_freepages -= release_freepages(&cc->freepages);
VM_BUG_ON(cc->nr_freepages != 0);

+ /* Remove any candidate page if it was not captured */
+ if (cc->capture_page && ret != COMPACT_CAPTURED)
+ *cc->capture_page = NULL;
+
trace_mm_compaction_end(ret);

return ret;
}

static unsigned long compact_zone_order(struct zone *zone, int order,
- gfp_t gfp_mask, enum migrate_mode mode, int *contended)
+ gfp_t gfp_mask, enum migrate_mode mode, int *contended,
+ struct page **captured_page)
{
unsigned long ret;
struct compact_control cc = {
@@ -1243,6 +1346,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
.gfp_mask = gfp_mask,
.zone = zone,
.mode = mode,
+ .capture_page = captured_page,
};
INIT_LIST_HEAD(&cc.freepages);
INIT_LIST_HEAD(&cc.migratepages);
@@ -1268,13 +1372,15 @@ int sysctl_extfrag_threshold = 500;
* @contended: Return value that determines if compaction was aborted due to
* need_resched() or lock contention
* @candidate_zone: Return the zone where we think allocation should succeed
+ * @captured_page: If successful, return the page captured during compaction
*
* This is the main entry point for direct page compaction.
*/
unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
enum migrate_mode mode, int *contended,
- struct zone **candidate_zone)
+ struct zone **candidate_zone,
+ struct page **captured_page)
{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
int may_enter_fs = gfp_mask & __GFP_FS;
@@ -1305,7 +1411,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
continue;

status = compact_zone_order(zone, order, gfp_mask, mode,
- &zone_contended);
+ &zone_contended, captured_page);
rc = max(status, rc);
/*
* It takes at least one zone that wasn't lock contended
@@ -1314,6 +1420,12 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
all_zones_lock_contended &=
(zone_contended == COMPACT_CONTENDED_LOCK);

+ /* If we captured a page, stop compacting */
+ if (*captured_page) {
+ *candidate_zone = zone;
+ break;
+ }
+
/* If a normal allocation would succeed, stop compacting */
if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
alloc_flags)) {
diff --git a/mm/internal.h b/mm/internal.h
index 8293040..f2d625f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -110,6 +110,7 @@ extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
*/
extern void __free_pages_bootmem(struct page *page, unsigned int order);
extern void prep_compound_page(struct page *page, unsigned long order);
+extern bool capture_free_page(struct page *page, unsigned int order);
#ifdef CONFIG_MEMORY_FAILURE
extern bool is_free_buddy_page(struct page *page);
#endif
@@ -148,6 +149,7 @@ struct compact_control {
* contention detected during
* compaction
*/
+ struct page **capture_page; /* Free page captured by compaction */
};

unsigned long
@@ -155,7 +157,7 @@ isolate_freepages_range(struct compact_control *cc,
unsigned long start_pfn, unsigned long end_pfn);
unsigned long
isolate_migratepages_range(struct compact_control *cc,
- unsigned long low_pfn, unsigned long end_pfn);
+ unsigned long low_pfn, unsigned long end_pfn);

#endif

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bbdb10f..af9ed36 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1489,9 +1489,11 @@ static int __isolate_free_page(struct page *page, unsigned int order)
{
unsigned long watermark;
struct zone *zone;
+ struct free_area *area;
int mt;
+ unsigned int freepage_order = page_order(page);

- BUG_ON(!PageBuddy(page));
+ VM_BUG_ON_PAGE((!PageBuddy(page) || freepage_order < order), page);

zone = page_zone(page);
mt = get_pageblock_migratetype(page);
@@ -1506,9 +1508,12 @@ static int __isolate_free_page(struct page *page, unsigned int order)
}

/* Remove page from free list */
+ area = &zone->free_area[freepage_order];
list_del(&page->lru);
- zone->free_area[order].nr_free--;
+ area->nr_free--;
rmv_page_order(page);
+ if (freepage_order != order)
+ expand(zone, page, order, freepage_order, area, mt);

/* Set the pageblock if the isolated page is at least a pageblock */
if (order >= pageblock_order - 1) {
@@ -1551,6 +1556,29 @@ int split_free_page(struct page *page)
return nr_pages;
}

+bool capture_free_page(struct page *page, unsigned int order)
+{
+ struct zone *zone = page_zone(page);
+ unsigned long flags;
+
+ spin_lock_irqsave(&zone->lock, flags);
+
+ if (!PageBuddy(page) || page_order(page) < order
+ || !__isolate_free_page(page, order)) {
+ spin_unlock_irqrestore(&zone->lock, flags);
+ return false;
+ }
+
+ spin_unlock(&zone->lock);
+
+ /* Mimic what buffered_rmqueue() does */
+ __mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+ __count_zone_vm_events(PGALLOC, zone, 1 << order);
+ local_irq_restore(flags);
+
+ return true;
+}
+
/*
* Really, prep_compound_page() should be called from __rmqueue_bulk(). But
* we cheat by calling it from here, in the order > 0 path. Saves a branch
@@ -2300,7 +2328,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
unsigned long *did_some_progress)
{
struct zone *last_compact_zone = NULL;
- struct page *page;
+ struct page *page = NULL;

if (!order)
return NULL;
@@ -2309,7 +2337,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
nodemask, mode,
contended_compaction,
- &last_compact_zone);
+ &last_compact_zone, &page);
current->flags &= ~PF_MEMALLOC;

switch (*did_some_progress) {
@@ -2328,14 +2356,43 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
*/
count_vm_event(COMPACTSTALL);

- /* Page migration frees to the PCP lists but we want merging */
- drain_pages(get_cpu());
- put_cpu();
+ /* Did we capture a page? */
+ if (page) {
+ struct zone *zone;
+ unsigned long flags;
+ /*
+ * Mimic what buffered_rmqueue() does and capture_new_page()
+ * has not yet done.
+ */
+ zone = page_zone(page);
+
+ local_irq_save(flags);
+ zone_statistics(preferred_zone, zone, gfp_mask);
+ local_irq_restore(flags);

- page = get_page_from_freelist(gfp_mask, nodemask,
- order, zonelist, high_zoneidx,
- alloc_flags & ~ALLOC_NO_WATERMARKS,
- preferred_zone, classzone_idx, migratetype);
+ VM_BUG_ON_PAGE(bad_range(zone, page), page);
+ if (!prep_new_page(page, order, gfp_mask))
+ /* This is normally done in get_page_from_freelist() */
+ page->pfmemalloc = !!(alloc_flags &
+ ALLOC_NO_WATERMARKS);
+ else
+ page = NULL;
+ }
+
+ /* No capture but let's try allocating anyway */
+ if (!page) {
+ /*
+ * Page migration frees to the PCP lists but we want
+ * merging
+ */
+ drain_pages(get_cpu());
+ put_cpu();
+
+ page = get_page_from_freelist(gfp_mask, nodemask, order,
+ zonelist, high_zoneidx,
+ alloc_flags & ~ALLOC_NO_WATERMARKS,
+ preferred_zone, classzone_idx, migratetype);
+ }

if (page) {
struct zone *zone = page_zone(page);
@@ -6313,7 +6370,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,

if (list_empty(&cc->migratepages)) {
cc->nr_migratepages = 0;
- pfn = isolate_migratepages_range(cc, pfn, end);
+ pfn = isolate_migratepages_range(cc, pfn, end, false);
if (!pfn) {
ret = -EINTR;
break;
--
1.8.4.5

2014-07-28 13:12:36

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v5 13/14] mm, compaction: pass gfp mask to compact_control

From: David Rientjes <[email protected]>

struct compact_control currently converts the gfp mask to a migratetype, but we
need the entire gfp mask in a follow-up patch.

Pass the entire gfp mask as part of struct compact_control.

Signed-off-by: David Rientjes <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
---
mm/compaction.c | 12 +++++++-----
mm/internal.h | 2 +-
2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index e8645fb..dd3e4db 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1028,8 +1028,8 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
}

-static int compact_finished(struct zone *zone,
- struct compact_control *cc)
+static int compact_finished(struct zone *zone, struct compact_control *cc,
+ const int migratetype)
{
unsigned int order;
unsigned long watermark;
@@ -1075,7 +1075,7 @@ static int compact_finished(struct zone *zone,
struct free_area *area = &zone->free_area[order];

/* Job done if page is free of the right migratetype */
- if (!list_empty(&area->free_list[cc->migratetype]))
+ if (!list_empty(&area->free_list[migratetype]))
return COMPACT_PARTIAL;

/* Job done if allocation would set block type */
@@ -1141,6 +1141,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
int ret;
unsigned long start_pfn = zone->zone_start_pfn;
unsigned long end_pfn = zone_end_pfn(zone);
+ const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
const bool sync = cc->mode != MIGRATE_ASYNC;

ret = compaction_suitable(zone, cc->order);
@@ -1183,7 +1184,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)

migrate_prep_local();

- while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
+ while ((ret = compact_finished(zone, cc, migratetype)) ==
+ COMPACT_CONTINUE) {
int err;

switch (isolate_migratepages(zone, cc)) {
@@ -1238,7 +1240,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
.nr_freepages = 0,
.nr_migratepages = 0,
.order = order,
- .migratetype = gfpflags_to_migratetype(gfp_mask),
+ .gfp_mask = gfp_mask,
.zone = zone,
.mode = mode,
};
diff --git a/mm/internal.h b/mm/internal.h
index 86ae964..8293040 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -142,7 +142,7 @@ struct compact_control {
bool finished_update_migrate;

int order; /* order a direct compactor needs */
- int migratetype; /* MOVABLE, RECLAIMABLE etc */
+ const gfp_t gfp_mask; /* gfp mask of a direct compactor */
struct zone *zone;
int contended; /* Signal need_sched() or lock
* contention detected during
--
1.8.4.5

2014-07-28 13:12:41

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v5 12/14] mm: rename allocflags_to_migratetype for clarity

From: David Rientjes <[email protected]>

The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
ALLOC_CPUSET) that have separate semantics.

The function allocflags_to_migratetype() actually takes gfp flags, not alloc
flags, and returns a migratetype. Rename it to gfpflags_to_migratetype().

Signed-off-by: David Rientjes <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
---
include/linux/gfp.h | 2 +-
mm/compaction.c | 4 ++--
mm/page_alloc.c | 6 +++---
3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 5e7219d..41b30fd 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -156,7 +156,7 @@ struct vm_area_struct;
#define GFP_DMA32 __GFP_DMA32

/* Convert GFP flags to their corresponding migrate type */
-static inline int allocflags_to_migratetype(gfp_t gfp_flags)
+static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
{
WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);

diff --git a/mm/compaction.c b/mm/compaction.c
index 320f339..e8645fb 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1238,7 +1238,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
.nr_freepages = 0,
.nr_migratepages = 0,
.order = order,
- .migratetype = allocflags_to_migratetype(gfp_mask),
+ .migratetype = gfpflags_to_migratetype(gfp_mask),
.zone = zone,
.mode = mode,
};
@@ -1290,7 +1290,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
return rc;

#ifdef CONFIG_CMA
- if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+ if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;
#endif
/* Compact each zone in the list */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e3c633b..bbdb10f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2523,7 +2523,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
alloc_flags |= ALLOC_NO_WATERMARKS;
}
#ifdef CONFIG_CMA
- if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+ if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;
#endif
return alloc_flags;
@@ -2787,7 +2787,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct zone *preferred_zone;
struct zoneref *preferred_zoneref;
struct page *page = NULL;
- int migratetype = allocflags_to_migratetype(gfp_mask);
+ int migratetype = gfpflags_to_migratetype(gfp_mask);
unsigned int cpuset_mems_cookie;
int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
int classzone_idx;
@@ -2821,7 +2821,7 @@ retry_cpuset:
classzone_idx = zonelist_zone_idx(preferred_zoneref);

#ifdef CONFIG_CMA
- if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+ if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;
#endif
/* First allocation attempt */
--
1.8.4.5

2014-07-28 13:12:43

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v5 09/14] mm, compaction: skip rechecks when lock was already held

Compaction scanners try to lock zone locks as late as possible by checking
many page or pageblock properties opportunistically without lock and skipping
them if not unsuitable. For pages that pass the initial checks, some properties
have to be checked again safely under lock. However, if the lock was already
held from a previous iteration in the initial checks, the rechecks are
unnecessary.

This patch therefore skips the rechecks when the lock was already held. This is
now possible to do, since we don't (potentially) drop and reacquire the lock
between the initial checks and the safe rechecks anymore.

Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Acked-by: David Rientjes <[email protected]>
---
mm/compaction.c | 53 +++++++++++++++++++++++++++++++----------------------
1 file changed, 31 insertions(+), 22 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 1756ed8..a9965ab 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -367,22 +367,30 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
goto isolate_fail;

/*
- * The zone lock must be held to isolate freepages.
- * Unfortunately this is a very coarse lock and can be
- * heavily contended if there are parallel allocations
- * or parallel compactions. For async compaction do not
- * spin on the lock and we acquire the lock as late as
- * possible.
+ * If we already hold the lock, we can skip some rechecking.
+ * Note that if we hold the lock now, checked_pageblock was
+ * already set in some previous iteration (or strict is true),
+ * so it is correct to skip the suitable migration target
+ * recheck as well.
*/
- if (!locked)
+ if (!locked) {
+ /*
+ * The zone lock must be held to isolate freepages.
+ * Unfortunately this is a very coarse lock and can be
+ * heavily contended if there are parallel allocations
+ * or parallel compactions. For async compaction do not
+ * spin on the lock and we acquire the lock as late as
+ * possible.
+ */
locked = compact_trylock_irqsave(&cc->zone->lock,
&flags, cc);
- if (!locked)
- break;
+ if (!locked)
+ break;

- /* Recheck this is a buddy page under lock */
- if (!PageBuddy(page))
- goto isolate_fail;
+ /* Recheck this is a buddy page under lock */
+ if (!PageBuddy(page))
+ goto isolate_fail;
+ }

/* Found a free page, break it into order-0 pages */
isolated = split_free_page(page);
@@ -644,19 +652,20 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
page_count(page) > page_mapcount(page))
continue;

- /* If the lock is not held, try to take it */
- if (!locked)
+ /* If we already hold the lock, we can skip some rechecking */
+ if (!locked) {
locked = compact_trylock_irqsave(&zone->lru_lock,
&flags, cc);
- if (!locked)
- break;
+ if (!locked)
+ break;

- /* Recheck PageLRU and PageTransHuge under lock */
- if (!PageLRU(page))
- continue;
- if (PageTransHuge(page)) {
- low_pfn += (1 << compound_order(page)) - 1;
- continue;
+ /* Recheck PageLRU and PageTransHuge under lock */
+ if (!PageLRU(page))
+ continue;
+ if (PageTransHuge(page)) {
+ low_pfn += (1 << compound_order(page)) - 1;
+ continue;
+ }
}

lruvec = mem_cgroup_page_lruvec(page, zone);
--
1.8.4.5

2014-07-28 13:12:49

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v5 10/14] mm, compaction: remember position within pageblock in free pages scanner

Unlike the migration scanner, the free scanner remembers the beginning of the
last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages
uselessly when called several times during single compaction. This might have
been useful when pages were returned to the buddy allocator after a failed
migration, but this is no longer the case.

This patch changes the meaning of cc->free_pfn so that if it points to a
middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the
end. isolate_freepages_block() will record the pfn of the last page it looked
at, which is then used to update cc->free_pfn.

In the mmtests stress-highalloc benchmark, this has resulted in lowering the
ratio between pages scanned by both scanners, from 2.5 free pages per migrate
page, to 2.25 free pages per migrate page, without affecting success rates.

With __GFP_NO_KSWAPD allocations, this appears to result in a worse ratio (2.1
instead of 1.8), but page migration successes increased by 10%, so this could
mean that more useful work can be done until need_resched() aborts this kind
of compaction.

Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Zhang Yanfei <[email protected]>
---
mm/compaction.c | 42 ++++++++++++++++++++++++++++++------------
1 file changed, 30 insertions(+), 12 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index a9965ab..5892d8b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -330,7 +330,7 @@ static bool suitable_migration_target(struct page *page)
* (even though it may still end up isolating some pages).
*/
static unsigned long isolate_freepages_block(struct compact_control *cc,
- unsigned long blockpfn,
+ unsigned long *start_pfn,
unsigned long end_pfn,
struct list_head *freelist,
bool strict)
@@ -339,6 +339,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
struct page *cursor, *valid_page = NULL;
unsigned long flags;
bool locked = false;
+ unsigned long blockpfn = *start_pfn;

cursor = pfn_to_page(blockpfn);

@@ -412,6 +413,9 @@ isolate_fail:
break;
}

+ /* Record how far we have got within the block */
+ *start_pfn = blockpfn;
+
trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated);

/*
@@ -460,14 +464,13 @@ isolate_freepages_range(struct compact_control *cc,

for (; pfn < end_pfn; pfn += isolated,
block_end_pfn += pageblock_nr_pages) {
+ /* Protect pfn from changing by isolate_freepages_block */
+ unsigned long isolate_start_pfn = pfn;

block_end_pfn = min(block_end_pfn, end_pfn);

- if (!pageblock_within_zone(pfn, block_end_pfn, cc->zone))
- break;
-
- isolated = isolate_freepages_block(cc, pfn, block_end_pfn,
- &freelist, true);
+ isolated = isolate_freepages_block(cc, &isolate_start_pfn,
+ block_end_pfn, &freelist, true);

/*
* In strict mode, isolate_freepages_block() returns 0 if
@@ -769,6 +772,7 @@ static void isolate_freepages(struct compact_control *cc)
struct zone *zone = cc->zone;
struct page *page;
unsigned long block_start_pfn; /* start of current pageblock */
+ unsigned long isolate_start_pfn; /* exact pfn we start at */
unsigned long block_end_pfn; /* end of current pageblock */
unsigned long low_pfn; /* lowest pfn scanner is able to scan */
int nr_freepages = cc->nr_freepages;
@@ -777,14 +781,15 @@ static void isolate_freepages(struct compact_control *cc)
/*
* Initialise the free scanner. The starting point is where we last
* successfully isolated from, zone-cached value, or the end of the
- * zone when isolating for the first time. We need this aligned to
- * the pageblock boundary, because we do
+ * zone when isolating for the first time. For looping we also need
+ * this pfn aligned down to the pageblock boundary, because we do
* block_start_pfn -= pageblock_nr_pages in the for loop.
* For ending point, take care when isolating in last pageblock of a
* a zone which ends in the middle of a pageblock.
* The low boundary is the end of the pageblock the migration scanner
* is using.
*/
+ isolate_start_pfn = cc->free_pfn;
block_start_pfn = cc->free_pfn & ~(pageblock_nr_pages-1);
block_end_pfn = min(block_start_pfn + pageblock_nr_pages,
zone_end_pfn(zone));
@@ -797,7 +802,8 @@ static void isolate_freepages(struct compact_control *cc)
*/
for (; block_start_pfn >= low_pfn && cc->nr_migratepages > nr_freepages;
block_end_pfn = block_start_pfn,
- block_start_pfn -= pageblock_nr_pages) {
+ block_start_pfn -= pageblock_nr_pages,
+ isolate_start_pfn = block_start_pfn) {
unsigned long isolated;

/*
@@ -822,13 +828,25 @@ static void isolate_freepages(struct compact_control *cc)
if (!isolation_suitable(cc, page))
continue;

- /* Found a block suitable for isolating free pages from */
- cc->free_pfn = block_start_pfn;
- isolated = isolate_freepages_block(cc, block_start_pfn,
+ /* Found a block suitable for isolating free pages from. */
+ isolated = isolate_freepages_block(cc, &isolate_start_pfn,
block_end_pfn, freelist, false);
nr_freepages += isolated;

/*
+ * Remember where the free scanner should restart next time,
+ * which is where isolate_freepages_block() left off.
+ * But if it scanned the whole pageblock, isolate_start_pfn
+ * now points at block_end_pfn, which is the start of the next
+ * pageblock.
+ * In that case we will however want to restart at the start
+ * of the previous pageblock.
+ */
+ cc->free_pfn = (isolate_start_pfn < block_end_pfn) ?
+ isolate_start_pfn :
+ block_start_pfn - pageblock_nr_pages;
+
+ /*
* Set a flag that we successfully isolated in this pageblock.
* In the next loop iteration, zone->compact_cached_free_pfn
* will not be updated and thus it will effectively contain the
--
1.8.4.5

2014-07-28 13:12:47

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v5 02/14] mm, compaction: defer each zone individually instead of preferred zone

When direct sync compaction is often unsuccessful, it may become deferred for
some time to avoid further useless attempts, both sync and async. Successful
high-order allocations un-defer compaction, while further unsuccessful
compaction attempts prolong the copmaction deferred period.

Currently the checking and setting deferred status is performed only on the
preferred zone of the allocation that invoked direct compaction. But compaction
itself is attempted on all eligible zones in the zonelist, so the behavior is
suboptimal and may lead both to scenarios where 1) compaction is attempted
uselessly, or 2) where it's not attempted despite good chances of succeeding,
as shown on the examples below:

1) A direct compaction with Normal preferred zone failed and set deferred
compaction for the Normal zone. Another unrelated direct compaction with
DMA32 as preferred zone will attempt to compact DMA32 zone even though
the first compaction attempt also included DMA32 zone.

In another scenario, compaction with Normal preferred zone failed to compact
Normal zone, but succeeded in the DMA32 zone, so it will not defer
compaction. In the next attempt, it will try Normal zone which will fail
again, instead of skipping Normal zone and trying DMA32 directly.

2) Kswapd will balance DMA32 zone and reset defer status based on watermarks
looking good. A direct compaction with preferred Normal zone will skip
compaction of all zones including DMA32 because Normal was still deferred.
The allocation might have succeeded in DMA32, but won't.

This patch makes compaction deferring work on individual zone basis instead of
preferred zone. For each zone, it checks compaction_deferred() to decide if the
zone should be skipped. If watermarks fail after compacting the zone,
defer_compaction() is called. The zone where watermarks passed can still be
deferred when the allocation attempt is unsuccessful. When allocation is
successful, compaction_defer_reset() is called for the zone containing the
allocated page. This approach should approximate calling defer_compaction()
only on zones where compaction was attempted and did not yield allocated page.
There might be corner cases but that is inevitable as long as the decision
to stop compacting dues not guarantee that a page will be allocated.

During testing on a two-node machine with a single very small Normal zone on
node 1, this patch has improved success rates in stress-highalloc mmtests
benchmark. The success here were previously made worse by commit 3a025760fc15
("mm: page_alloc: spill to remote nodes before waking kswapd") as kswapd was
no longer resetting often enough the deferred compaction for the Normal zone,
and DMA32 zones on both nodes were thus not considered for compaction.
On different machine, success rates were improved with __GFP_NO_KSWAPD
allocations.

Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
---
include/linux/compaction.h | 16 ++++++++++------
mm/compaction.c | 30 ++++++++++++++++++++++++------
mm/page_alloc.c | 39 +++++++++++++++++++++++----------------
3 files changed, 57 insertions(+), 28 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 01e3132..b2e4c92 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -2,14 +2,16 @@
#define _LINUX_COMPACTION_H

/* Return values for compact_zone() and try_to_compact_pages() */
+/* compaction didn't start as it was deferred due to past failures */
+#define COMPACT_DEFERRED 0
/* compaction didn't start as it was not possible or direct reclaim was more suitable */
-#define COMPACT_SKIPPED 0
+#define COMPACT_SKIPPED 1
/* compaction should continue to another pageblock */
-#define COMPACT_CONTINUE 1
+#define COMPACT_CONTINUE 2
/* direct compaction partially compacted a zone and there are suitable pages */
-#define COMPACT_PARTIAL 2
+#define COMPACT_PARTIAL 3
/* The full zone was compacted */
-#define COMPACT_COMPLETE 3
+#define COMPACT_COMPLETE 4

#ifdef CONFIG_COMPACTION
extern int sysctl_compact_memory;
@@ -22,7 +24,8 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
extern int fragmentation_index(struct zone *zone, unsigned int order);
extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *mask,
- enum migrate_mode mode, bool *contended);
+ enum migrate_mode mode, bool *contended,
+ struct zone **candidate_zone);
extern void compact_pgdat(pg_data_t *pgdat, int order);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern unsigned long compaction_suitable(struct zone *zone, int order);
@@ -91,7 +94,8 @@ static inline bool compaction_restarting(struct zone *zone, int order)
#else
static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
- enum migrate_mode mode, bool *contended)
+ enum migrate_mode mode, bool *contended,
+ struct zone **candidate_zone)
{
return COMPACT_CONTINUE;
}
diff --git a/mm/compaction.c b/mm/compaction.c
index 5175019..f3ae2ec 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1122,28 +1122,27 @@ int sysctl_extfrag_threshold = 500;
* @nodemask: The allowed nodes to allocate from
* @mode: The migration mode for async, sync light, or sync migration
* @contended: Return value that is true if compaction was aborted due to lock contention
- * @page: Optionally capture a free page of the requested order during compaction
+ * @candidate_zone: Return the zone where we think allocation should succeed
*
* This is the main entry point for direct page compaction.
*/
unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
- enum migrate_mode mode, bool *contended)
+ enum migrate_mode mode, bool *contended,
+ struct zone **candidate_zone)
{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
int may_enter_fs = gfp_mask & __GFP_FS;
int may_perform_io = gfp_mask & __GFP_IO;
struct zoneref *z;
struct zone *zone;
- int rc = COMPACT_SKIPPED;
+ int rc = COMPACT_DEFERRED;
int alloc_flags = 0;

/* Check if the GFP flags allow compaction */
if (!order || !may_enter_fs || !may_perform_io)
return rc;

- count_compact_event(COMPACTSTALL);
-
#ifdef CONFIG_CMA
if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;
@@ -1153,14 +1152,33 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
nodemask) {
int status;

+ if (compaction_deferred(zone, order))
+ continue;
+
status = compact_zone_order(zone, order, gfp_mask, mode,
contended);
rc = max(status, rc);

/* If a normal allocation would succeed, stop compacting */
if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
- alloc_flags))
+ alloc_flags)) {
+ *candidate_zone = zone;
+ /*
+ * We think the allocation will succeed in this zone,
+ * but it is not certain, hence the false. The caller
+ * will repeat this with true if allocation indeed
+ * succeeds in this zone.
+ */
+ compaction_defer_reset(zone, order, false);
break;
+ } else if (mode != MIGRATE_ASYNC) {
+ /*
+ * We think that allocation won't succeed in this zone
+ * so we defer compaction there. If it ends up
+ * succeeding after all, it will be reset.
+ */
+ defer_compaction(zone, order);
+ }
}

return rc;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b99643d4..a14efeb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2299,21 +2299,24 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
bool *contended_compaction, bool *deferred_compaction,
unsigned long *did_some_progress)
{
- if (!order)
- return NULL;
+ struct zone *last_compact_zone = NULL;

- if (compaction_deferred(preferred_zone, order)) {
- *deferred_compaction = true;
+ if (!order)
return NULL;
- }

current->flags |= PF_MEMALLOC;
*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
nodemask, mode,
- contended_compaction);
+ contended_compaction,
+ &last_compact_zone);
current->flags &= ~PF_MEMALLOC;

- if (*did_some_progress != COMPACT_SKIPPED) {
+ if (*did_some_progress > COMPACT_DEFERRED)
+ count_vm_event(COMPACTSTALL);
+ else
+ *deferred_compaction = true;
+
+ if (*did_some_progress > COMPACT_SKIPPED) {
struct page *page;

/* Page migration frees to the PCP lists but we want merging */
@@ -2324,27 +2327,31 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
order, zonelist, high_zoneidx,
alloc_flags & ~ALLOC_NO_WATERMARKS,
preferred_zone, classzone_idx, migratetype);
+
if (page) {
- preferred_zone->compact_blockskip_flush = false;
- compaction_defer_reset(preferred_zone, order, true);
+ struct zone *zone = page_zone(page);
+
+ zone->compact_blockskip_flush = false;
+ compaction_defer_reset(zone, order, true);
count_vm_event(COMPACTSUCCESS);
return page;
}

/*
+ * last_compact_zone is where try_to_compact_pages thought
+ * allocation should succeed, so it did not defer compaction.
+ * But now we know that it didn't succeed, so we do the defer.
+ */
+ if (last_compact_zone && mode != MIGRATE_ASYNC)
+ defer_compaction(last_compact_zone, order);
+
+ /*
* It's bad if compaction run occurs and fails.
* The most likely reason is that pages exist,
* but not enough to satisfy watermarks.
*/
count_vm_event(COMPACTFAIL);

- /*
- * As async compaction considers a subset of pageblocks, only
- * defer if the failure was a sync compaction failure.
- */
- if (mode != MIGRATE_ASYNC)
- defer_compaction(preferred_zone, order);
-
cond_resched();
}

--
1.8.4.5

2014-07-28 13:12:39

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v5 11/14] mm, compaction: skip buddy pages by their order in the migrate scanner

The migration scanner skips PageBuddy pages, but does not consider their order
as checking page_order() is generally unsafe without holding the zone->lock,
and acquiring the lock just for the check wouldn't be a good tradeoff.

Still, this could avoid some iterations over the rest of the buddy page, and
if we are careful, the race window between PageBuddy() check and page_order()
is small, and the worst thing that can happen is that we skip too much and miss
some isolation candidates. This is not that bad, as compaction can already fail
for many other reasons like parallel allocations, and those have much larger
race window.

This patch therefore makes the migration scanner obtain the buddy page order
and use it to skip the whole buddy page, if the order appears to be in the
valid range.

It's important that the page_order() is read only once, so that the value used
in the checks and in the pfn calculation is the same. But in theory the
compiler can replace the local variable by multiple inlines of page_order().
Therefore, the patch introduces page_order_unsafe() that uses ACCESS_ONCE to
prevent this.

Testing with stress-highalloc from mmtests shows a 15% reduction in number of
pages scanned by migration scanner. The reduction is >60% with __GFP_NO_KSWAPD
allocations, along with success rates better by few percent.
This change is also a prerequisite for a later patch which is detecting when
a cc->order block of pages contains non-buddy pages that cannot be isolated,
and the scanner should thus skip to the next block immediately.

Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
---
mm/compaction.c | 36 +++++++++++++++++++++++++++++++-----
mm/internal.h | 16 +++++++++++++++-
2 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 5892d8b..320f339 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -313,8 +313,15 @@ static inline bool compact_should_abort(struct compact_control *cc)
static bool suitable_migration_target(struct page *page)
{
/* If the page is a large free page, then disallow migration */
- if (PageBuddy(page) && page_order(page) >= pageblock_order)
- return false;
+ if (PageBuddy(page)) {
+ /*
+ * We are checking page_order without zone->lock taken. But
+ * the only small danger is that we skip a potentially suitable
+ * pageblock, so it's not worth to check order for valid range.
+ */
+ if (page_order_unsafe(page) >= pageblock_order)
+ return false;
+ }

/* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
if (migrate_async_suitable(get_pageblock_migratetype(page)))
@@ -605,11 +612,23 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
valid_page = page;

/*
- * Skip if free. page_order cannot be used without zone->lock
- * as nothing prevents parallel allocations or buddy merging.
+ * Skip if free. We read page order here without zone lock
+ * which is generally unsafe, but the race window is small and
+ * the worst thing that can happen is that we skip some
+ * potential isolation targets.
*/
- if (PageBuddy(page))
+ if (PageBuddy(page)) {
+ unsigned long freepage_order = page_order_unsafe(page);
+
+ /*
+ * Without lock, we cannot be sure that what we got is
+ * a valid page order. Consider only values in the
+ * valid order range to prevent low_pfn overflow.
+ */
+ if (freepage_order > 0 && freepage_order < MAX_ORDER)
+ low_pfn += (1UL << freepage_order) - 1;
continue;
+ }

/*
* Check may be lockless but that's ok as we recheck later.
@@ -695,6 +714,13 @@ isolate_success:
}
}

+ /*
+ * The PageBuddy() check could have potentially brought us outside
+ * the range to be scanned.
+ */
+ if (unlikely(low_pfn > end_pfn))
+ low_pfn = end_pfn;
+
acct_isolated(zone, locked, cc);

if (locked)
diff --git a/mm/internal.h b/mm/internal.h
index 4c1d604..86ae964 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -164,7 +164,8 @@ isolate_migratepages_range(struct compact_control *cc,
* general, page_zone(page)->lock must be held by the caller to prevent the
* page from being allocated in parallel and returning garbage as the order.
* If a caller does not hold page_zone(page)->lock, it must guarantee that the
- * page cannot be allocated or merged in parallel.
+ * page cannot be allocated or merged in parallel. Alternatively, it must
+ * handle invalid values gracefully, and use page_order_unsafe() below.
*/
static inline unsigned long page_order(struct page *page)
{
@@ -172,6 +173,19 @@ static inline unsigned long page_order(struct page *page)
return page_private(page);
}

+/*
+ * Like page_order(), but for callers who cannot afford to hold the zone lock.
+ * PageBuddy() should be checked first by the caller to minimize race window,
+ * and invalid values must be handled gracefully.
+ *
+ * ACCESS_ONCE is used so that if the caller assigns the result into a local
+ * variable and e.g. tests it for valid range before using, the compiler cannot
+ * decide to remove the variable and inline the page_private(page) multiple
+ * times, potentially observing different values in the tests and the actual
+ * use of the result.
+ */
+#define page_order_unsafe(page) ACCESS_ONCE(page_private(page))
+
static inline bool is_cow_mapping(vm_flags_t flags)
{
return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
--
1.8.4.5

2014-07-28 13:14:05

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v5 06/14] mm, compaction: reduce zone checking frequency in the migration scanner

The unification of the migrate and free scanner families of function has
highlighted a difference in how the scanners ensure they only isolate pages
of the intended zone. This is important for taking zone lock or lru lock of
the correct zone. Due to nodes overlapping, it is however possible to
encounter a different zone within the range of the zone being compacted.

The free scanner, since its inception by commit 748446bb6b5a ("mm: compaction:
memory compaction core"), has been checking the zone of the first valid page
in a pageblock, and skipping the whole pageblock if the zone does not match.

This checking was completely missing from the migration scanner at first, and
later added by commit dc9086004b3d ("mm: compaction: check for overlapping
nodes during isolation for migration") in a reaction to a bug report.
But the zone comparison in migration scanner is done once per a single scanned
page, which is more defensive and thus more costly than a check per pageblock.

This patch unifies the checking done in both scanners to once per pageblock,
through a new pageblock_within_zone() function, which also includes pfn_valid()
checks. It is more defensive than the current free scanner checks, as it checks
both the first and last page of the pageblock, but less defensive by the
migration scanner per-page checks. It assumes that node overlapping may result
(on some architecture) in a boundary between two nodes falling into the middle
of a pageblock, but that there cannot be a node0 node1 node0 interleaving
within a single pageblock.

The result is more code being shared and a bit less per-page CPU cost in the
migration scanner.

Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
---
mm/compaction.c | 91 ++++++++++++++++++++++++++++++++++++---------------------
1 file changed, 57 insertions(+), 34 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index bac6e37..76a9775 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -67,6 +67,49 @@ static inline bool migrate_async_suitable(int migratetype)
return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
}

+/*
+ * Check that the whole (or subset of) a pageblock given by the interval of
+ * [start_pfn, end_pfn) is valid and within the same zone, before scanning it
+ * with the migration of free compaction scanner. The scanners then need to
+ * use only pfn_valid_within() check for arches that allow holes within
+ * pageblocks.
+ *
+ * Return struct page pointer of start_pfn, or NULL if checks were not passed.
+ *
+ * It's possible on some configurations to have a setup like node0 node1 node0
+ * i.e. it's possible that all pages within a zones range of pages do not
+ * belong to a single zone. We assume that a border between node0 and node1
+ * can occur within a single pageblock, but not a node0 node1 node0
+ * interleaving within a single pageblock. It is therefore sufficient to check
+ * the first and last page of a pageblock and avoid checking each individual
+ * page in a pageblock.
+ */
+static struct page *pageblock_within_zone(unsigned long start_pfn,
+ unsigned long end_pfn, struct zone *zone)
+{
+ struct page *start_page;
+ struct page *end_page;
+
+ /* end_pfn is one past the range we are checking */
+ end_pfn--;
+
+ if (!pfn_valid(start_pfn) || !pfn_valid(end_pfn))
+ return NULL;
+
+ start_page = pfn_to_page(start_pfn);
+
+ if (page_zone(start_page) != zone)
+ return NULL;
+
+ end_page = pfn_to_page(end_pfn);
+
+ /* This gives a shorter code than deriving page_zone(end_page) */
+ if (page_zone_id(start_page) != page_zone_id(end_page))
+ return NULL;
+
+ return start_page;
+}
+
#ifdef CONFIG_COMPACTION
/* Returns true if the pageblock should be scanned for pages to isolate. */
static inline bool isolation_suitable(struct compact_control *cc,
@@ -368,17 +411,17 @@ isolate_freepages_range(struct compact_control *cc,
unsigned long isolated, pfn, block_end_pfn;
LIST_HEAD(freelist);

- for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) {
- if (!pfn_valid(pfn) || cc->zone != page_zone(pfn_to_page(pfn)))
- break;
+ pfn = start_pfn;
+ block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
+
+ for (; pfn < end_pfn; pfn += isolated,
+ block_end_pfn += pageblock_nr_pages) {

- /*
- * On subsequent iterations ALIGN() is actually not needed,
- * but we keep it that we not to complicate the code.
- */
- block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
block_end_pfn = min(block_end_pfn, end_pfn);

+ if (!pageblock_within_zone(pfn, block_end_pfn, cc->zone))
+ break;
+
isolated = isolate_freepages_block(cc, pfn, block_end_pfn,
&freelist, true);

@@ -507,15 +550,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
continue;
nr_scanned++;

- /*
- * Get the page and ensure the page is within the same zone.
- * See the comment in isolate_freepages about overlapping
- * nodes. It is deliberate that the new zone lock is not taken
- * as memory compaction should not move pages between nodes.
- */
page = pfn_to_page(low_pfn);
- if (page_zone(page) != zone)
- continue;

if (!valid_page)
valid_page = page;
@@ -654,8 +689,7 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,

block_end_pfn = min(block_end_pfn, end_pfn);

- /* Skip whole pageblock in case of a memory hole */
- if (!pfn_valid(pfn))
+ if (!pageblock_within_zone(pfn, block_end_pfn, cc->zone))
continue;

pfn = isolate_migratepages_block(cc, pfn, block_end_pfn,
@@ -727,18 +761,9 @@ static void isolate_freepages(struct compact_control *cc)
&& compact_should_abort(cc))
break;

- if (!pfn_valid(block_start_pfn))
- continue;
-
- /*
- * Check for overlapping nodes/zones. It's possible on some
- * configurations to have a setup like
- * node0 node1 node0
- * i.e. it's possible that all pages within a zones range of
- * pages do not belong to a single zone.
- */
- page = pfn_to_page(block_start_pfn);
- if (page_zone(page) != zone)
+ page = pageblock_within_zone(block_start_pfn, block_end_pfn,
+ zone);
+ if (!page)
continue;

/* Check the block is suitable for migration */
@@ -873,12 +898,10 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
&& compact_should_abort(cc))
break;

- /* Skip whole pageblock in case of a memory hole */
- if (!pfn_valid(low_pfn))
+ page = pageblock_within_zone(low_pfn, end_pfn, zone);
+ if (!page)
continue;

- page = pfn_to_page(low_pfn);
-
/* If isolation recently failed, do not retry */
if (!isolation_suitable(cc, page))
continue;
--
1.8.4.5

2014-07-28 13:14:03

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v5 08/14] mm, compaction: periodically drop lock and restore IRQs in scanners

Compaction scanners regularly check for lock contention and need_resched()
through the compact_checklock_irqsave() function. However, if there is no
contention, the lock can be held and IRQ disabled for potentially long time.

This has been addressed by commit b2eef8c0d091 ("mm: compaction: minimise the
time IRQs are disabled while isolating pages for migration") for the migration
scanner. However, the refactoring done by commit 2a1402aa044b ("mm: compaction:
acquire the zone->lru_lock as late as possible") has changed the conditions so
that the lock is dropped only when there's contention on the lock or
need_resched() is true. Also, need_resched() is checked only when the lock is
already held. The comment "give a chance to irqs before checking need_resched"
is therefore misleading, as IRQs remain disabled when the check is done.

This patch restores the behavior intended by commit b2eef8c0d091 and also tries
to better balance and make more deterministic the time spent by checking for
contention vs the time the scanners might run between the checks. It also
avoids situations where checking has not been done often enough before. The
result should be avoiding both too frequent and too infrequent contention
checking, and especially the potentially long-running scans with IRQs disabled
and no checking of need_resched() or for fatal signal pending, which can happen
when many consecutive pages or pageblocks fail the preliminary tests and do not
reach the later call site to compact_checklock_irqsave(), as explained below.

Before the patch:

In the migration scanner, compact_checklock_irqsave() was called each loop, if
reached. If not reached, some lower-frequency checking could still be done if
the lock was already held, but this would not result in aborting contended
async compaction until reaching compact_checklock_irqsave() or end of
pageblock. In the free scanner, it was similar but completely without the
periodical checking, so lock can be potentially held until reaching the end of
pageblock.

After the patch, in both scanners:

The periodical check is done as the first thing in the loop on each
SWAP_CLUSTER_MAX aligned pfn, using the new compact_unlock_should_abort()
function, which always unlocks the lock (if locked) and aborts async compaction
if scheduling is needed. It also aborts any type of compaction when a fatal
signal is pending.

The compact_checklock_irqsave() function is replaced with a slightly different
compact_trylock_irqsave(). The biggest difference is that the function is not
called at all if the lock is already held. The periodical need_resched()
checking is left solely to compact_unlock_should_abort(). The lock contention
avoidance for async compaction is achieved by the periodical unlock by
compact_unlock_should_abort() and by using trylock in compact_trylock_irqsave()
and aborting when trylock fails. Sync compaction does not use trylock.

Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
---
mm/compaction.c | 121 ++++++++++++++++++++++++++++++++++----------------------
1 file changed, 73 insertions(+), 48 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 2b8b6d8..1756ed8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -223,61 +223,72 @@ static void update_pageblock_skip(struct compact_control *cc,
}
#endif /* CONFIG_COMPACTION */

-static int should_release_lock(spinlock_t *lock)
+/*
+ * Compaction requires the taking of some coarse locks that are potentially
+ * very heavily contended. For async compaction, back out if the lock cannot
+ * be taken immediately. For sync compaction, spin on the lock if needed.
+ *
+ * Returns true if the lock is held
+ * Returns false if the lock is not held and compaction should abort
+ */
+static bool compact_trylock_irqsave(spinlock_t *lock, unsigned long *flags,
+ struct compact_control *cc)
{
- /*
- * Sched contention has higher priority here as we may potentially
- * have to abort whole compaction ASAP. Returning with lock contention
- * means we will try another zone, and further decisions are
- * influenced only when all zones are lock contended. That means
- * potentially missing a lock contention is less critical.
- */
- if (need_resched())
- return COMPACT_CONTENDED_SCHED;
- else if (spin_is_contended(lock))
- return COMPACT_CONTENDED_LOCK;
- else
- return COMPACT_CONTENDED_NONE;
+ if (cc->mode == MIGRATE_ASYNC) {
+ if (!spin_trylock_irqsave(lock, *flags)) {
+ cc->contended = COMPACT_CONTENDED_LOCK;
+ return false;
+ }
+ } else {
+ spin_lock_irqsave(lock, *flags);
+ }
+
+ return true;
}

/*
* Compaction requires the taking of some coarse locks that are potentially
- * very heavily contended. Check if the process needs to be scheduled or
- * if the lock is contended. For async compaction, back out in the event
- * if contention is severe. For sync compaction, schedule.
+ * very heavily contended. The lock should be periodically unlocked to avoid
+ * having disabled IRQs for a long time, even when there is nobody waiting on
+ * the lock. It might also be that allowing the IRQs will result in
+ * need_resched() becoming true. If scheduling is needed, async compaction
+ * aborts. Sync compaction schedules.
+ * Either compaction type will also abort if a fatal signal is pending.
+ * In either case if the lock was locked, it is dropped and not regained.
*
- * Returns true if the lock is held.
- * Returns false if the lock is released and compaction should abort
+ * Returns true if compaction should abort due to fatal signal pending, or
+ * async compaction due to need_resched()
+ * Returns false when compaction can continue (sync compaction might have
+ * scheduled)
*/
-static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
- bool locked, struct compact_control *cc)
+static bool compact_unlock_should_abort(spinlock_t *lock,
+ unsigned long flags, bool *locked, struct compact_control *cc)
{
- int contended = should_release_lock(lock);
+ if (*locked) {
+ spin_unlock_irqrestore(lock, flags);
+ *locked = false;
+ }

- if (contended) {
- if (locked) {
- spin_unlock_irqrestore(lock, *flags);
- locked = false;
- }
+ if (fatal_signal_pending(current)) {
+ cc->contended = COMPACT_CONTENDED_SCHED;
+ return true;
+ }

- /* async aborts if taking too long or contended */
+ if (need_resched()) {
if (cc->mode == MIGRATE_ASYNC) {
- cc->contended = contended;
- return false;
+ cc->contended = COMPACT_CONTENDED_SCHED;
+ return true;
}
-
cond_resched();
}

- if (!locked)
- spin_lock_irqsave(lock, *flags);
- return true;
+ return false;
}

/*
* Aside from avoiding lock contention, compaction also periodically checks
* need_resched() and either schedules in sync compaction or aborts async
- * compaction. This is similar to what compact_checklock_irqsave() does, but
+ * compaction. This is similar to what compact_unlock_should_abort() does, but
* is used where no lock is concerned.
*
* Returns false when no scheduling was needed, or sync compaction scheduled.
@@ -336,6 +347,16 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
int isolated, i;
struct page *page = cursor;

+ /*
+ * Periodically drop the lock (if held) regardless of its
+ * contention, to give chance to IRQs. Abort async compaction
+ * if contended.
+ */
+ if (!(blockpfn % SWAP_CLUSTER_MAX)
+ && compact_unlock_should_abort(&cc->zone->lock, flags,
+ &locked, cc))
+ break;
+
nr_scanned++;
if (!pfn_valid_within(blockpfn))
goto isolate_fail;
@@ -353,8 +374,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
* spin on the lock and we acquire the lock as late as
* possible.
*/
- locked = compact_checklock_irqsave(&cc->zone->lock, &flags,
- locked, cc);
+ if (!locked)
+ locked = compact_trylock_irqsave(&cc->zone->lock,
+ &flags, cc);
if (!locked)
break;

@@ -552,13 +574,15 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,

/* Time to isolate some pages for migration */
for (; low_pfn < end_pfn; low_pfn++) {
- /* give a chance to irqs before checking need_resched() */
- if (locked && !(low_pfn % SWAP_CLUSTER_MAX)) {
- if (should_release_lock(&zone->lru_lock)) {
- spin_unlock_irqrestore(&zone->lru_lock, flags);
- locked = false;
- }
- }
+ /*
+ * Periodically drop the lock (if held) regardless of its
+ * contention, to give chance to IRQs. Abort async compaction
+ * if contended.
+ */
+ if (!(low_pfn % SWAP_CLUSTER_MAX)
+ && compact_unlock_should_abort(&zone->lru_lock, flags,
+ &locked, cc))
+ break;

if (!pfn_valid_within(low_pfn))
continue;
@@ -620,10 +644,11 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
page_count(page) > page_mapcount(page))
continue;

- /* Check if it is ok to still hold the lock */
- locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
- locked, cc);
- if (!locked || fatal_signal_pending(current))
+ /* If the lock is not held, try to take it */
+ if (!locked)
+ locked = compact_trylock_irqsave(&zone->lru_lock,
+ &flags, cc);
+ if (!locked)
break;

/* Recheck PageLRU and PageTransHuge under lock */
--
1.8.4.5

2014-07-28 13:14:01

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v5 07/14] mm, compaction: khugepaged should not give up due to need_resched()

Async compaction aborts when it detects zone lock contention or need_resched()
is true. David Rientjes has reported that in practice, most direct async
compactions for THP allocation abort due to need_resched(). This means that a
second direct compaction is never attempted, which might be OK for a page
fault, but khugepaged is intended to attempt a sync compaction in such case and
in these cases it won't.

This patch replaces "bool contended" in compact_control with an int that
distinguieshes between aborting due to need_resched() and aborting due to lock
contention. This allows propagating the abort through all compaction functions
as before, but passing the abort reason up to __alloc_pages_slowpath() which
decides when to continue with direct reclaim and another compaction attempt.

Another problem is that try_to_compact_pages() did not act upon the reported
contention (both need_resched() or lock contention) immediately and would
proceed with another zone from the zonelist. When need_resched() is true, that
means initializing another zone compaction, only to check again need_resched()
in isolate_migratepages() and aborting. For zone lock contention, the
unintended consequence is that the lock contended status reported back to the
allocator is detrmined from the last zone where compaction was attempted, which
is rather arbitrary.

This patch fixes the problem in the following way:
- async compaction of a zone aborting due to need_resched() or fatal signal
pending means that further zones should not be tried. We report
COMPACT_CONTENDED_SCHED to the allocator.
- aborting zone compaction due to lock contention means we can still try
another zone, since it has different set of locks. We report back
COMPACT_CONTENDED_LOCK only if *all* zones where compaction was attempted,
it was aborted due to lock contention.

As a result of these fixes, khugepaged will proceed with second sync compaction
as intended, when the preceding async compaction aborted due to need_resched().
Page fault compactions aborting due to need_resched() will spare some cycles
previously wasted by initializing another zone compaction only to abort again.
Lock contention will be reported only when compaction in all zones aborted due
to lock contention, and therefore it's not a good idea to try again after
reclaim.

In stress-highalloc from mmtests configured to use __GFP_NO_KSWAPD, this has
improved number of THP collapse allocations by 10%, which shows positive
effect on khugepaged. The benchmark's success rates are unchanged as it is not
recognized as khugepaged. Numbers of compact_stall and compact_fail events
have however decreased by 20%, with compact_success still a bit improved,
which is good. With benchmark configured not to use __GFP_NO_KSWAPD, there is
6% improvement in THP collapse allocations, and only slight improvement in
stalls and failures.

Reported-by: David Rientjes <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
---
include/linux/compaction.h | 12 +++++--
mm/compaction.c | 89 +++++++++++++++++++++++++++++++++++++++-------
mm/internal.h | 4 +--
mm/page_alloc.c | 43 ++++++++++++++++------
4 files changed, 121 insertions(+), 27 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index b2e4c92..60bdf8d 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -13,6 +13,14 @@
/* The full zone was compacted */
#define COMPACT_COMPLETE 4

+/* Used to signal whether compaction detected need_sched() or lock contention */
+/* No contention detected */
+#define COMPACT_CONTENDED_NONE 0
+/* Either need_sched() was true or fatal signal pending */
+#define COMPACT_CONTENDED_SCHED 1
+/* Zone lock or lru_lock was contended in async compaction */
+#define COMPACT_CONTENDED_LOCK 2
+
#ifdef CONFIG_COMPACTION
extern int sysctl_compact_memory;
extern int sysctl_compaction_handler(struct ctl_table *table, int write,
@@ -24,7 +32,7 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
extern int fragmentation_index(struct zone *zone, unsigned int order);
extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *mask,
- enum migrate_mode mode, bool *contended,
+ enum migrate_mode mode, int *contended,
struct zone **candidate_zone);
extern void compact_pgdat(pg_data_t *pgdat, int order);
extern void reset_isolation_suitable(pg_data_t *pgdat);
@@ -94,7 +102,7 @@ static inline bool compaction_restarting(struct zone *zone, int order)
#else
static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
- enum migrate_mode mode, bool *contended,
+ enum migrate_mode mode, int *contended,
struct zone **candidate_zone)
{
return COMPACT_CONTINUE;
diff --git a/mm/compaction.c b/mm/compaction.c
index 76a9775..2b8b6d8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -223,9 +223,21 @@ static void update_pageblock_skip(struct compact_control *cc,
}
#endif /* CONFIG_COMPACTION */

-static inline bool should_release_lock(spinlock_t *lock)
+static int should_release_lock(spinlock_t *lock)
{
- return need_resched() || spin_is_contended(lock);
+ /*
+ * Sched contention has higher priority here as we may potentially
+ * have to abort whole compaction ASAP. Returning with lock contention
+ * means we will try another zone, and further decisions are
+ * influenced only when all zones are lock contended. That means
+ * potentially missing a lock contention is less critical.
+ */
+ if (need_resched())
+ return COMPACT_CONTENDED_SCHED;
+ else if (spin_is_contended(lock))
+ return COMPACT_CONTENDED_LOCK;
+ else
+ return COMPACT_CONTENDED_NONE;
}

/*
@@ -240,7 +252,9 @@ static inline bool should_release_lock(spinlock_t *lock)
static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
bool locked, struct compact_control *cc)
{
- if (should_release_lock(lock)) {
+ int contended = should_release_lock(lock);
+
+ if (contended) {
if (locked) {
spin_unlock_irqrestore(lock, *flags);
locked = false;
@@ -248,7 +262,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,

/* async aborts if taking too long or contended */
if (cc->mode == MIGRATE_ASYNC) {
- cc->contended = true;
+ cc->contended = contended;
return false;
}

@@ -274,7 +288,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
/* async compaction aborts if contended */
if (need_resched()) {
if (cc->mode == MIGRATE_ASYNC) {
- cc->contended = true;
+ cc->contended = COMPACT_CONTENDED_SCHED;
return true;
}

@@ -1139,7 +1153,7 @@ out:
}

static unsigned long compact_zone_order(struct zone *zone, int order,
- gfp_t gfp_mask, enum migrate_mode mode, bool *contended)
+ gfp_t gfp_mask, enum migrate_mode mode, int *contended)
{
unsigned long ret;
struct compact_control cc = {
@@ -1154,11 +1168,11 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
INIT_LIST_HEAD(&cc.migratepages);

ret = compact_zone(zone, &cc);
+ *contended = cc.contended;

VM_BUG_ON(!list_empty(&cc.freepages));
VM_BUG_ON(!list_empty(&cc.migratepages));

- *contended = cc.contended;
return ret;
}

@@ -1171,14 +1185,15 @@ int sysctl_extfrag_threshold = 500;
* @gfp_mask: The GFP mask of the current allocation
* @nodemask: The allowed nodes to allocate from
* @mode: The migration mode for async, sync light, or sync migration
- * @contended: Return value that is true if compaction was aborted due to lock contention
+ * @contended: Return value that determines if compaction was aborted due to
+ * need_resched() or lock contention
* @candidate_zone: Return the zone where we think allocation should succeed
*
* This is the main entry point for direct page compaction.
*/
unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
- enum migrate_mode mode, bool *contended,
+ enum migrate_mode mode, int *contended,
struct zone **candidate_zone)
{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
@@ -1188,6 +1203,9 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
struct zone *zone;
int rc = COMPACT_DEFERRED;
int alloc_flags = 0;
+ bool all_zones_lock_contended = true; /* init true for &= operation */
+
+ *contended = COMPACT_CONTENDED_NONE;

/* Check if the GFP flags allow compaction */
if (!order || !may_enter_fs || !may_perform_io)
@@ -1201,13 +1219,20 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
nodemask) {
int status;
+ int zone_contended;

if (compaction_deferred(zone, order))
continue;

status = compact_zone_order(zone, order, gfp_mask, mode,
- contended);
+ &zone_contended);
rc = max(status, rc);
+ /*
+ * It takes at least one zone that wasn't lock contended
+ * to turn all_zones_lock_contended to false.
+ */
+ all_zones_lock_contended &=
+ (zone_contended == COMPACT_CONTENDED_LOCK);

/* If a normal allocation would succeed, stop compacting */
if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
@@ -1220,8 +1245,20 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
* succeeds in this zone.
*/
compaction_defer_reset(zone, order, false);
- break;
- } else if (mode != MIGRATE_ASYNC) {
+ /*
+ * It is possible that async compaction aborted due to
+ * need_resched() and the watermarks were ok thanks to
+ * somebody else freeing memory. The allocation can
+ * however still fail so we better signal the
+ * need_resched() contention anyway.
+ */
+ if (zone_contended == COMPACT_CONTENDED_SCHED)
+ *contended = COMPACT_CONTENDED_SCHED;
+
+ goto break_loop;
+ }
+
+ if (mode != MIGRATE_ASYNC) {
/*
* We think that allocation won't succeed in this zone
* so we defer compaction there. If it ends up
@@ -1229,8 +1266,36 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
*/
defer_compaction(zone, order);
}
+
+ /*
+ * We might have stopped compacting due to need_resched() in
+ * async compaction, or due to a fatal signal detected. In that
+ * case do not try further zones and signal need_resched()
+ * contention.
+ */
+ if ((zone_contended == COMPACT_CONTENDED_SCHED)
+ || fatal_signal_pending(current)) {
+ *contended = COMPACT_CONTENDED_SCHED;
+ goto break_loop;
+ }
+
+ continue;
+break_loop:
+ /*
+ * We might not have tried all the zones, so be conservative
+ * and assume they are not all lock contended.
+ */
+ all_zones_lock_contended = false;
+ break;
}

+ /*
+ * If at least one zone wasn't deferred or skipped, we report if all
+ * zones that were tried were contended.
+ */
+ if (rc > COMPACT_SKIPPED && all_zones_lock_contended)
+ *contended = COMPACT_CONTENDED_LOCK;
+
return rc;
}

diff --git a/mm/internal.h b/mm/internal.h
index 5a0738f..4c1d604 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -144,8 +144,8 @@ struct compact_control {
int order; /* order a direct compactor needs */
int migratetype; /* MOVABLE, RECLAIMABLE etc */
struct zone *zone;
- bool contended; /* True if a lock was contended, or
- * need_resched() true during async
+ int contended; /* Signal need_sched() or lock
+ * contention detected during
* compaction
*/
};
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f424752..e3c633b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2296,7 +2296,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
int classzone_idx, int migratetype, enum migrate_mode mode,
- bool *contended_compaction, bool *deferred_compaction,
+ int *contended_compaction, bool *deferred_compaction,
unsigned long *did_some_progress)
{
struct zone *last_compact_zone = NULL;
@@ -2547,7 +2547,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
unsigned long did_some_progress;
enum migrate_mode migration_mode = MIGRATE_ASYNC;
bool deferred_compaction = false;
- bool contended_compaction = false;
+ int contended_compaction = COMPACT_CONTENDED_NONE;

/*
* In the slowpath, we sanity check order to avoid ever trying to
@@ -2660,15 +2660,36 @@ rebalance:
if (!(gfp_mask & __GFP_NO_KSWAPD) || (current->flags & PF_KTHREAD))
migration_mode = MIGRATE_SYNC_LIGHT;

- /*
- * If compaction is deferred for high-order allocations, it is because
- * sync compaction recently failed. In this is the case and the caller
- * requested a movable allocation that does not heavily disrupt the
- * system then fail the allocation instead of entering direct reclaim.
- */
- if ((deferred_compaction || contended_compaction) &&
- (gfp_mask & __GFP_NO_KSWAPD))
- goto nopage;
+ /* Checks for THP-specific high-order allocations */
+ if (gfp_mask & __GFP_NO_KSWAPD) {
+ /*
+ * If compaction is deferred for high-order allocations, it is
+ * because sync compaction recently failed. If this is the case
+ * and the caller requested a THP allocation, we do not want
+ * to heavily disrupt the system, so we fail the allocation
+ * instead of entering direct reclaim.
+ */
+ if (deferred_compaction)
+ goto nopage;
+
+ /*
+ * In all zones where compaction was attempted (and not
+ * deferred or skipped), lock contention has been detected.
+ * For THP allocation we do not want to disrupt the others
+ * so we fallback to base pages instead.
+ */
+ if (contended_compaction == COMPACT_CONTENDED_LOCK)
+ goto nopage;
+
+ /*
+ * If compaction was aborted due to need_resched(), we do not
+ * want to further increase allocation latency, unless it is
+ * khugepaged trying to collapse.
+ */
+ if (contended_compaction == COMPACT_CONTENDED_SCHED
+ && !(current->flags & PF_KTHREAD))
+ goto nopage;
+ }

/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order,
--
1.8.4.5

2014-07-28 13:15:18

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v5 04/14] mm, compaction: do not recheck suitable_migration_target under lock

isolate_freepages_block() rechecks if the pageblock is suitable to be a target
for migration after it has taken the zone->lock. However, the check has been
optimized to occur only once per pageblock, and compact_checklock_irqsave()
might be dropping and reacquiring lock, which means somebody else might have
changed the pageblock's migratetype meanwhile.

Furthermore, nothing prevents the migratetype to change right after
isolate_freepages_block() has finished isolating. Given how imperfect this is,
it's simpler to just rely on the check done in isolate_freepages() without
lock, and not pretend that the recheck under lock guarantees anything. It is
just a heuristic after all.

Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Acked-by: David Rientjes <[email protected]>
---
mm/compaction.c | 13 -------------
1 file changed, 13 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index f3ae2ec..0a871e5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -276,7 +276,6 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
struct page *cursor, *valid_page = NULL;
unsigned long flags;
bool locked = false;
- bool checked_pageblock = false;

cursor = pfn_to_page(blockpfn);

@@ -307,18 +306,6 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
if (!locked)
break;

- /* Recheck this is a suitable migration target under lock */
- if (!strict && !checked_pageblock) {
- /*
- * We need to check suitability of pageblock only once
- * and this isolate_freepages_block() is called with
- * pageblock range, so just check once is sufficient.
- */
- checked_pageblock = true;
- if (!suitable_migration_target(page))
- break;
- }
-
/* Recheck this is a buddy page under lock */
if (!PageBuddy(page))
goto isolate_fail;
--
1.8.4.5

2014-07-28 13:15:14

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v5 03/14] mm, compaction: do not count compact_stall if all zones skipped compaction

The compact_stall vmstat counter counts the number of allocations stalled by
direct compaction. It does not count when all attempted zones had deferred
compaction, but it does count when all zones skipped compaction. The skipping
is decided based on very early check of compaction_suitable(), based on
watermarks and memory fragmentation. Therefore it makes sense not to count
skipped compactions as stalls. Moreover, compact_success or compact_fail is
also already not being counted when compaction was skipped, so this patch
changes the compact_stall counting to match the other two.

Additionally, restructure __alloc_pages_direct_compact() code for better
readability.

Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
---
mm/page_alloc.c | 75 +++++++++++++++++++++++++++++++--------------------------
1 file changed, 41 insertions(+), 34 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a14efeb..91191fb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2300,6 +2300,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
unsigned long *did_some_progress)
{
struct zone *last_compact_zone = NULL;
+ struct page *page;

if (!order)
return NULL;
@@ -2311,49 +2312,55 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
&last_compact_zone);
current->flags &= ~PF_MEMALLOC;

- if (*did_some_progress > COMPACT_DEFERRED)
- count_vm_event(COMPACTSTALL);
- else
+ switch (*did_some_progress) {
+ case COMPACT_DEFERRED:
*deferred_compaction = true;
+ /* fall-through */
+ case COMPACT_SKIPPED:
+ return NULL;
+ default:
+ break;
+ }

- if (*did_some_progress > COMPACT_SKIPPED) {
- struct page *page;
+ /*
+ * At least in one zone compaction wasn't deferred or skipped, so let's
+ * count a compaction stall
+ */
+ count_vm_event(COMPACTSTALL);

- /* Page migration frees to the PCP lists but we want merging */
- drain_pages(get_cpu());
- put_cpu();
+ /* Page migration frees to the PCP lists but we want merging */
+ drain_pages(get_cpu());
+ put_cpu();

- page = get_page_from_freelist(gfp_mask, nodemask,
- order, zonelist, high_zoneidx,
- alloc_flags & ~ALLOC_NO_WATERMARKS,
- preferred_zone, classzone_idx, migratetype);
+ page = get_page_from_freelist(gfp_mask, nodemask,
+ order, zonelist, high_zoneidx,
+ alloc_flags & ~ALLOC_NO_WATERMARKS,
+ preferred_zone, classzone_idx, migratetype);

- if (page) {
- struct zone *zone = page_zone(page);
+ if (page) {
+ struct zone *zone = page_zone(page);

- zone->compact_blockskip_flush = false;
- compaction_defer_reset(zone, order, true);
- count_vm_event(COMPACTSUCCESS);
- return page;
- }
+ zone->compact_blockskip_flush = false;
+ compaction_defer_reset(zone, order, true);
+ count_vm_event(COMPACTSUCCESS);
+ return page;
+ }

- /*
- * last_compact_zone is where try_to_compact_pages thought
- * allocation should succeed, so it did not defer compaction.
- * But now we know that it didn't succeed, so we do the defer.
- */
- if (last_compact_zone && mode != MIGRATE_ASYNC)
- defer_compaction(last_compact_zone, order);
+ /*
+ * last_compact_zone is where try_to_compact_pages thought allocation
+ * should succeed, so it did not defer compaction. But here we know
+ * that it didn't succeed, so we do the defer.
+ */
+ if (last_compact_zone && mode != MIGRATE_ASYNC)
+ defer_compaction(last_compact_zone, order);

- /*
- * It's bad if compaction run occurs and fails.
- * The most likely reason is that pages exist,
- * but not enough to satisfy watermarks.
- */
- count_vm_event(COMPACTFAIL);
+ /*
+ * It's bad if compaction run occurs and fails. The most likely reason
+ * is that pages exist, but not enough to satisfy watermarks.
+ */
+ count_vm_event(COMPACTFAIL);

- cond_resched();
- }
+ cond_resched();

return NULL;
}
--
1.8.4.5

2014-07-28 13:15:11

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v5 05/14] mm, compaction: move pageblock checks up from isolate_migratepages_range()

isolate_migratepages_range() is the main function of the compaction scanner,
called either on a single pageblock by isolate_migratepages() during regular
compaction, or on an arbitrary range by CMA's __alloc_contig_migrate_range().
It currently perfoms two pageblock-wide compaction suitability checks, and
because of the CMA callpath, it tracks if it crossed a pageblock boundary in
order to repeat those checks.

However, closer inspection shows that those checks are always true for CMA:
- isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
- migrate_async_suitable() check is skipped because CMA uses sync compaction

We can therefore move the compaction-specific checks to isolate_migratepages()
and simplify isolate_migratepages_range(). Furthermore, we can mimic the
freepage scanner family of functions, which has isolate_freepages_block()
function called both by compaction from isolate_freepages() and by CMA from
isolate_freepages_range(), where each use-case adds own specific glue code.
This allows further code simplification.

Therefore, we rename isolate_migratepages_range() to isolate_freepages_block()
and limit its functionality to a single pageblock (or its subset). For CMA,
a new different isolate_migratepages_range() is created as a CMA-specific
wrapper for the _block() function. The checks specific to compaction are moved
to isolate_migratepages(). As part of the unification of these two families of
functions, we remove the redundant zone parameter where applicable, since zone
pointer is already passed in cc->zone.

Furthermore, going back to compact_zone() and compact_finished() when pageblock
is found unsuitable (now by isolate_migratepages()) is wasteful - the checks
are meant to skip pageblocks quickly. The patch therefore also introduces a
simple loop into isolate_migratepages() so that it does not return immediately
on failed pageblock checks, but keeps going until isolate_migratepages_range()
gets called once. Similarily to isolate_freepages(), the function periodically
checks if it needs to reschedule or abort async compaction.

Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
---
mm/compaction.c | 234 +++++++++++++++++++++++++++++++++-----------------------
mm/internal.h | 4 +-
mm/page_alloc.c | 3 +-
3 files changed, 140 insertions(+), 101 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 0a871e5..bac6e37 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -132,7 +132,7 @@ void reset_isolation_suitable(pg_data_t *pgdat)
*/
static void update_pageblock_skip(struct compact_control *cc,
struct page *page, unsigned long nr_isolated,
- bool set_unsuitable, bool migrate_scanner)
+ bool migrate_scanner)
{
struct zone *zone = cc->zone;
unsigned long pfn;
@@ -146,12 +146,7 @@ static void update_pageblock_skip(struct compact_control *cc,
if (nr_isolated)
return;

- /*
- * Only skip pageblocks when all forms of compaction will be known to
- * fail in the near future.
- */
- if (set_unsuitable)
- set_pageblock_skip(page);
+ set_pageblock_skip(page);

pfn = page_to_pfn(page);

@@ -180,7 +175,7 @@ static inline bool isolation_suitable(struct compact_control *cc,

static void update_pageblock_skip(struct compact_control *cc,
struct page *page, unsigned long nr_isolated,
- bool set_unsuitable, bool migrate_scanner)
+ bool migrate_scanner)
{
}
#endif /* CONFIG_COMPACTION */
@@ -345,8 +340,7 @@ isolate_fail:

/* Update the pageblock-skip if the whole pageblock was scanned */
if (blockpfn == end_pfn)
- update_pageblock_skip(cc, valid_page, total_isolated, true,
- false);
+ update_pageblock_skip(cc, valid_page, total_isolated, false);

count_compact_events(COMPACTFREE_SCANNED, nr_scanned);
if (total_isolated)
@@ -451,40 +445,34 @@ static bool too_many_isolated(struct zone *zone)
}

/**
- * isolate_migratepages_range() - isolate all migrate-able pages in range.
- * @zone: Zone pages are in.
+ * isolate_migratepages_block() - isolate all migrate-able pages within
+ * a single pageblock
* @cc: Compaction control structure.
- * @low_pfn: The first PFN of the range.
- * @end_pfn: The one-past-the-last PFN of the range.
- * @unevictable: true if it allows to isolate unevictable pages
+ * @low_pfn: The first PFN to isolate
+ * @end_pfn: The one-past-the-last PFN to isolate, within same pageblock
+ * @isolate_mode: Isolation mode to be used.
*
* Isolate all pages that can be migrated from the range specified by
- * [low_pfn, end_pfn). Returns zero if there is a fatal signal
- * pending), otherwise PFN of the first page that was not scanned
- * (which may be both less, equal to or more then end_pfn).
- *
- * Assumes that cc->migratepages is empty and cc->nr_migratepages is
- * zero.
+ * [low_pfn, end_pfn). The range is expected to be within same pageblock.
+ * Returns zero if there is a fatal signal pending, otherwise PFN of the
+ * first page that was not scanned (which may be both less, equal to or more
+ * than end_pfn).
*
- * Apart from cc->migratepages and cc->nr_migratetypes this function
- * does not modify any cc's fields, in particular it does not modify
- * (or read for that matter) cc->migrate_pfn.
+ * The pages are isolated on cc->migratepages list (not required to be empty),
+ * and cc->nr_migratepages is updated accordingly. The cc->migrate_pfn field
+ * is neither read nor updated.
*/
-unsigned long
-isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
- unsigned long low_pfn, unsigned long end_pfn, bool unevictable)
+static unsigned long
+isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
+ unsigned long end_pfn, isolate_mode_t isolate_mode)
{
- unsigned long last_pageblock_nr = 0, pageblock_nr;
+ struct zone *zone = cc->zone;
unsigned long nr_scanned = 0, nr_isolated = 0;
struct list_head *migratelist = &cc->migratepages;
struct lruvec *lruvec;
unsigned long flags;
bool locked = false;
struct page *page = NULL, *valid_page = NULL;
- bool set_unsuitable = true;
- const isolate_mode_t mode = (cc->mode == MIGRATE_ASYNC ?
- ISOLATE_ASYNC_MIGRATE : 0) |
- (unevictable ? ISOLATE_UNEVICTABLE : 0);

/*
* Ensure that there are not too many pages isolated from the LRU
@@ -515,19 +503,6 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
}
}

- /*
- * migrate_pfn does not necessarily start aligned to a
- * pageblock. Ensure that pfn_valid is called when moving
- * into a new MAX_ORDER_NR_PAGES range in case of large
- * memory holes within the zone
- */
- if ((low_pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) {
- if (!pfn_valid(low_pfn)) {
- low_pfn += MAX_ORDER_NR_PAGES - 1;
- continue;
- }
- }
-
if (!pfn_valid_within(low_pfn))
continue;
nr_scanned++;
@@ -545,28 +520,6 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
if (!valid_page)
valid_page = page;

- /* If isolation recently failed, do not retry */
- pageblock_nr = low_pfn >> pageblock_order;
- if (last_pageblock_nr != pageblock_nr) {
- int mt;
-
- last_pageblock_nr = pageblock_nr;
- if (!isolation_suitable(cc, page))
- goto next_pageblock;
-
- /*
- * For async migration, also only scan in MOVABLE
- * blocks. Async migration is optimistic to see if
- * the minimum amount of work satisfies the allocation
- */
- mt = get_pageblock_migratetype(page);
- if (cc->mode == MIGRATE_ASYNC &&
- !migrate_async_suitable(mt)) {
- set_unsuitable = false;
- goto next_pageblock;
- }
- }
-
/*
* Skip if free. page_order cannot be used without zone->lock
* as nothing prevents parallel allocations or buddy merging.
@@ -601,8 +554,11 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
*/
if (PageTransHuge(page)) {
if (!locked)
- goto next_pageblock;
- low_pfn += (1 << compound_order(page)) - 1;
+ low_pfn = ALIGN(low_pfn + 1,
+ pageblock_nr_pages) - 1;
+ else
+ low_pfn += (1 << compound_order(page)) - 1;
+
continue;
}

@@ -632,7 +588,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
lruvec = mem_cgroup_page_lruvec(page, zone);

/* Try isolate the page */
- if (__isolate_lru_page(page, mode) != 0)
+ if (__isolate_lru_page(page, isolate_mode) != 0)
continue;

VM_BUG_ON_PAGE(PageTransCompound(page), page);
@@ -651,11 +607,6 @@ isolate_success:
++low_pfn;
break;
}
-
- continue;
-
-next_pageblock:
- low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
}

acct_isolated(zone, locked, cc);
@@ -668,8 +619,7 @@ next_pageblock:
* if the whole pageblock was scanned without isolating any page.
*/
if (low_pfn == end_pfn)
- update_pageblock_skip(cc, valid_page, nr_isolated,
- set_unsuitable, true);
+ update_pageblock_skip(cc, valid_page, nr_isolated, true);

trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);

@@ -680,15 +630,61 @@ next_pageblock:
return low_pfn;
}

+/**
+ * isolate_migratepages_range() - isolate migrate-able pages in a PFN range
+ * @start_pfn: The first PFN to start isolating.
+ * @end_pfn: The one-past-last PFN.
+ *
+ * Returns zero if isolation fails fatally due to e.g. pending signal.
+ * Otherwise, function returns one-past-the-last PFN of isolated page
+ * (which may be greater than end_pfn if end fell in a middle of a THP page).
+ */
+unsigned long
+isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ unsigned long pfn, block_end_pfn;
+
+ /* Scan block by block. First and last block may be incomplete */
+ pfn = start_pfn;
+ block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
+
+ for (; pfn < end_pfn; pfn = block_end_pfn,
+ block_end_pfn += pageblock_nr_pages) {
+
+ block_end_pfn = min(block_end_pfn, end_pfn);
+
+ /* Skip whole pageblock in case of a memory hole */
+ if (!pfn_valid(pfn))
+ continue;
+
+ pfn = isolate_migratepages_block(cc, pfn, block_end_pfn,
+ ISOLATE_UNEVICTABLE);
+
+ /*
+ * In case of fatal failure, release everything that might
+ * have been isolated in the previous iteration, and signal
+ * the failure back to caller.
+ */
+ if (!pfn) {
+ putback_movable_pages(&cc->migratepages);
+ cc->nr_migratepages = 0;
+ break;
+ }
+ }
+
+ return pfn;
+}
+
#endif /* CONFIG_COMPACTION || CONFIG_CMA */
#ifdef CONFIG_COMPACTION
/*
* Based on information in the current compact_control, find blocks
* suitable for isolating free pages from and then isolate them.
*/
-static void isolate_freepages(struct zone *zone,
- struct compact_control *cc)
+static void isolate_freepages(struct compact_control *cc)
{
+ struct zone *zone = cc->zone;
struct page *page;
unsigned long block_start_pfn; /* start of current pageblock */
unsigned long block_end_pfn; /* end of current pageblock */
@@ -806,7 +802,7 @@ static struct page *compaction_alloc(struct page *migratepage,
*/
if (list_empty(&cc->freepages)) {
if (!cc->contended)
- isolate_freepages(cc->zone, cc);
+ isolate_freepages(cc);

if (list_empty(&cc->freepages))
return NULL;
@@ -840,34 +836,81 @@ typedef enum {
} isolate_migrate_t;

/*
- * Isolate all pages that can be migrated from the block pointed to by
- * the migrate scanner within compact_control.
+ * Isolate all pages that can be migrated from the first suitable block,
+ * starting at the block pointed to by the migrate scanner pfn within
+ * compact_control.
*/
static isolate_migrate_t isolate_migratepages(struct zone *zone,
struct compact_control *cc)
{
unsigned long low_pfn, end_pfn;
+ struct page *page;
+ const isolate_mode_t isolate_mode =
+ (cc->mode == MIGRATE_ASYNC ? ISOLATE_ASYNC_MIGRATE : 0);

- /* Do not scan outside zone boundaries */
- low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
+ /*
+ * Start at where we last stopped, or beginning of the zone as
+ * initialized by compact_zone()
+ */
+ low_pfn = cc->migrate_pfn;

/* Only scan within a pageblock boundary */
end_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages);

- /* Do not cross the free scanner or scan within a memory hole */
- if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
- cc->migrate_pfn = end_pfn;
- return ISOLATE_NONE;
- }
+ /*
+ * Iterate over whole pageblocks until we find the first suitable.
+ * Do not cross the free scanner.
+ */
+ for (; end_pfn <= cc->free_pfn;
+ low_pfn = end_pfn, end_pfn += pageblock_nr_pages) {
+
+ /*
+ * This can potentially iterate a massively long zone with
+ * many pageblocks unsuitable, so periodically check if we
+ * need to schedule, or even abort async compaction.
+ */
+ if (!(low_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages))
+ && compact_should_abort(cc))
+ break;
+
+ /* Skip whole pageblock in case of a memory hole */
+ if (!pfn_valid(low_pfn))
+ continue;
+
+ page = pfn_to_page(low_pfn);
+
+ /* If isolation recently failed, do not retry */
+ if (!isolation_suitable(cc, page))
+ continue;
+
+ /*
+ * For async compaction, also only scan in MOVABLE blocks.
+ * Async compaction is optimistic to see if the minimum amount
+ * of work satisfies the allocation.
+ */
+ if (cc->mode == MIGRATE_ASYNC &&
+ !migrate_async_suitable(get_pageblock_migratetype(page)))
+ continue;
+
+ /* Perform the isolation */
+ low_pfn = isolate_migratepages_block(cc, low_pfn, end_pfn,
+ isolate_mode);
+
+ if (!low_pfn || cc->contended)
+ return ISOLATE_ABORT;

- /* Perform the isolation */
- low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn, false);
- if (!low_pfn || cc->contended)
- return ISOLATE_ABORT;
+ /*
+ * Either we isolated something and proceed with migration. Or
+ * we failed and compact_zone should decide if we should
+ * continue or not.
+ */
+ break;
+ }

+ /* Record where migration scanner will be restarted */
cc->migrate_pfn = low_pfn;

- return ISOLATE_SUCCESS;
+ return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
}

static int compact_finished(struct zone *zone,
@@ -1040,9 +1083,6 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
;
}

- if (!cc->nr_migratepages)
- continue;
-
err = migrate_pages(&cc->migratepages, compaction_alloc,
compaction_free, (unsigned long)cc, cc->mode,
MR_COMPACTION);
diff --git a/mm/internal.h b/mm/internal.h
index a1b651b..5a0738f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -154,8 +154,8 @@ unsigned long
isolate_freepages_range(struct compact_control *cc,
unsigned long start_pfn, unsigned long end_pfn);
unsigned long
-isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
- unsigned long low_pfn, unsigned long end_pfn, bool unevictable);
+isolate_migratepages_range(struct compact_control *cc,
+ unsigned long low_pfn, unsigned long end_pfn);

#endif

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 91191fb..f424752 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6292,8 +6292,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,

if (list_empty(&cc->migratepages)) {
cc->nr_migratepages = 0;
- pfn = isolate_migratepages_range(cc->zone, cc,
- pfn, end, true);
+ pfn = isolate_migratepages_range(cc, pfn, end);
if (!pfn) {
ret = -EINTR;
break;
--
1.8.4.5

2014-07-28 23:39:10

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v5 01/14] mm, THP: don't hold mmap_sem in khugepaged when allocating THP

On Mon, 28 Jul 2014, Vlastimil Babka wrote:

> When allocating huge page for collapsing, khugepaged currently holds mmap_sem
> for reading on the mm where collapsing occurs. Afterwards the read lock is
> dropped before write lock is taken on the same mmap_sem.
>
> Holding mmap_sem during whole huge page allocation is therefore useless, the
> vma needs to be rechecked after taking the write lock anyway. Furthemore, huge
> page allocation might involve a rather long sync compaction, and thus block
> any mmap_sem writers and i.e. affect workloads that perform frequent m(un)map
> or mprotect oterations.
>
> This patch simply releases the read lock before allocating a huge page. It
> also deletes an outdated comment that assumed vma must be stable, as it was
> using alloc_hugepage_vma(). This is no longer true since commit 9f1b868a13ac
> ("mm: thp: khugepaged: add policy for finding target node").
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Acked-by: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>

Acked-by: David Rientjes <[email protected]>

2014-07-28 23:59:47

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v5 02/14] mm, compaction: defer each zone individually instead of preferred zone

On Mon, 28 Jul 2014, Vlastimil Babka wrote:

> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 01e3132..b2e4c92 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -2,14 +2,16 @@
> #define _LINUX_COMPACTION_H
>
> /* Return values for compact_zone() and try_to_compact_pages() */
> +/* compaction didn't start as it was deferred due to past failures */
> +#define COMPACT_DEFERRED 0
> /* compaction didn't start as it was not possible or direct reclaim was more suitable */
> -#define COMPACT_SKIPPED 0
> +#define COMPACT_SKIPPED 1
> /* compaction should continue to another pageblock */
> -#define COMPACT_CONTINUE 1
> +#define COMPACT_CONTINUE 2
> /* direct compaction partially compacted a zone and there are suitable pages */
> -#define COMPACT_PARTIAL 2
> +#define COMPACT_PARTIAL 3
> /* The full zone was compacted */
> -#define COMPACT_COMPLETE 3
> +#define COMPACT_COMPLETE 4
>
> #ifdef CONFIG_COMPACTION
> extern int sysctl_compact_memory;
> @@ -22,7 +24,8 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
> extern int fragmentation_index(struct zone *zone, unsigned int order);
> extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *mask,
> - enum migrate_mode mode, bool *contended);
> + enum migrate_mode mode, bool *contended,
> + struct zone **candidate_zone);
> extern void compact_pgdat(pg_data_t *pgdat, int order);
> extern void reset_isolation_suitable(pg_data_t *pgdat);
> extern unsigned long compaction_suitable(struct zone *zone, int order);
> @@ -91,7 +94,8 @@ static inline bool compaction_restarting(struct zone *zone, int order)
> #else
> static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *nodemask,
> - enum migrate_mode mode, bool *contended)
> + enum migrate_mode mode, bool *contended,
> + struct zone **candidate_zone)
> {
> return COMPACT_CONTINUE;
> }
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 5175019..f3ae2ec 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1122,28 +1122,27 @@ int sysctl_extfrag_threshold = 500;
> * @nodemask: The allowed nodes to allocate from
> * @mode: The migration mode for async, sync light, or sync migration
> * @contended: Return value that is true if compaction was aborted due to lock contention
> - * @page: Optionally capture a free page of the requested order during compaction

Never noticed this non-existant formal before.

> + * @candidate_zone: Return the zone where we think allocation should succeed
> *
> * This is the main entry point for direct page compaction.
> */
> unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *nodemask,
> - enum migrate_mode mode, bool *contended)
> + enum migrate_mode mode, bool *contended,
> + struct zone **candidate_zone)
> {
> enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> int may_enter_fs = gfp_mask & __GFP_FS;
> int may_perform_io = gfp_mask & __GFP_IO;
> struct zoneref *z;
> struct zone *zone;
> - int rc = COMPACT_SKIPPED;
> + int rc = COMPACT_DEFERRED;
> int alloc_flags = 0;
>
> /* Check if the GFP flags allow compaction */
> if (!order || !may_enter_fs || !may_perform_io)
> return rc;
>

It doesn't seem right that if we called try_to_compact_pages() in a
context where it is useless (order-0 or non-GFP_KERNEL allocation) that we
would return COMPACT_DEFERRED. I think the existing semantics before the
patch, that is

- deferred: compaction was tried but failed, so avoid subsequent calls in
the near future that may be potentially expensive, and

- skipped: compaction wasn't tried because it will be useless

is correct and deferred shouldn't take on another meaning, which now will
set deferred_compaction == true in the page allocator. It probably
doesn't matter right now because the only check of deferred_compaction is
effective for __GFP_NO_KSWAPD, i.e. it is both high-order and GFP_KERNEL,
but it seems returning COMPACT_SKIPPED here would also work fine and be
more appropriate.

> - count_compact_event(COMPACTSTALL);
> -

I was originally going to object to moving this to the page allocator, but
it does indeed seem correct given the new return values and since deferred
compaction is now checked in memory compaction instead of the page
allocator.

> #ifdef CONFIG_CMA
> if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> alloc_flags |= ALLOC_CMA;
> @@ -1153,14 +1152,33 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> nodemask) {
> int status;
>
> + if (compaction_deferred(zone, order))
> + continue;
> +
> status = compact_zone_order(zone, order, gfp_mask, mode,
> contended);
> rc = max(status, rc);
>
> /* If a normal allocation would succeed, stop compacting */
> if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
> - alloc_flags))
> + alloc_flags)) {
> + *candidate_zone = zone;
> + /*
> + * We think the allocation will succeed in this zone,
> + * but it is not certain, hence the false. The caller
> + * will repeat this with true if allocation indeed
> + * succeeds in this zone.
> + */
> + compaction_defer_reset(zone, order, false);
> break;
> + } else if (mode != MIGRATE_ASYNC) {
> + /*
> + * We think that allocation won't succeed in this zone
> + * so we defer compaction there. If it ends up
> + * succeeding after all, it will be reset.
> + */
> + defer_compaction(zone, order);
> + }
> }
>
> return rc;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b99643d4..a14efeb 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2299,21 +2299,24 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> bool *contended_compaction, bool *deferred_compaction,
> unsigned long *did_some_progress)
> {
> - if (!order)
> - return NULL;
> + struct zone *last_compact_zone = NULL;
>
> - if (compaction_deferred(preferred_zone, order)) {
> - *deferred_compaction = true;
> + if (!order)
> return NULL;
> - }
>
> current->flags |= PF_MEMALLOC;
> *did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
> nodemask, mode,
> - contended_compaction);
> + contended_compaction,
> + &last_compact_zone);
> current->flags &= ~PF_MEMALLOC;
>
> - if (*did_some_progress != COMPACT_SKIPPED) {
> + if (*did_some_progress > COMPACT_DEFERRED)
> + count_vm_event(COMPACTSTALL);
> + else
> + *deferred_compaction = true;
> +
> + if (*did_some_progress > COMPACT_SKIPPED) {
> struct page *page;
>
> /* Page migration frees to the PCP lists but we want merging */
> @@ -2324,27 +2327,31 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> order, zonelist, high_zoneidx,
> alloc_flags & ~ALLOC_NO_WATERMARKS,
> preferred_zone, classzone_idx, migratetype);
> +
> if (page) {
> - preferred_zone->compact_blockskip_flush = false;
> - compaction_defer_reset(preferred_zone, order, true);
> + struct zone *zone = page_zone(page);
> +
> + zone->compact_blockskip_flush = false;
> + compaction_defer_reset(zone, order, true);
> count_vm_event(COMPACTSUCCESS);
> return page;
> }
>
> /*
> + * last_compact_zone is where try_to_compact_pages thought
> + * allocation should succeed, so it did not defer compaction.
> + * But now we know that it didn't succeed, so we do the defer.
> + */
> + if (last_compact_zone && mode != MIGRATE_ASYNC)
> + defer_compaction(last_compact_zone, order);
> +
> + /*
> * It's bad if compaction run occurs and fails.
> * The most likely reason is that pages exist,
> * but not enough to satisfy watermarks.
> */
> count_vm_event(COMPACTFAIL);
>
> - /*
> - * As async compaction considers a subset of pageblocks, only
> - * defer if the failure was a sync compaction failure.
> - */
> - if (mode != MIGRATE_ASYNC)
> - defer_compaction(preferred_zone, order);
> -
> cond_resched();
> }
>

2014-07-29 00:04:41

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v5 03/14] mm, compaction: do not count compact_stall if all zones skipped compaction

On Mon, 28 Jul 2014, Vlastimil Babka wrote:

> The compact_stall vmstat counter counts the number of allocations stalled by
> direct compaction. It does not count when all attempted zones had deferred
> compaction, but it does count when all zones skipped compaction. The skipping
> is decided based on very early check of compaction_suitable(), based on
> watermarks and memory fragmentation. Therefore it makes sense not to count
> skipped compactions as stalls. Moreover, compact_success or compact_fail is
> also already not being counted when compaction was skipped, so this patch
> changes the compact_stall counting to match the other two.
>
> Additionally, restructure __alloc_pages_direct_compact() code for better
> readability.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Acked-by: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>

Acked-by: David Rientjes <[email protected]>

This makes the second patch in the series more understandable but I still
renew my suggestion that you should be doing the following as part of the
second patch.
---
diff --git a/mm/compaction.c b/mm/compaction.c
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1136,7 +1136,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
int may_perform_io = gfp_mask & __GFP_IO;
struct zoneref *z;
struct zone *zone;
- int rc = COMPACT_DEFERRED;
+ int rc = COMPACT_SKIPPED;
int alloc_flags = 0;

/* Check if the GFP flags allow compaction */
@@ -1147,6 +1147,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;
#endif
+ rc = COMPACT_DEFERRED;
/* Compact each zone in the list */
for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
nodemask) {

2014-07-29 00:29:45

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v5 05/14] mm, compaction: move pageblock checks up from isolate_migratepages_range()

On Mon, 28 Jul 2014, Vlastimil Babka wrote:

> isolate_migratepages_range() is the main function of the compaction scanner,
> called either on a single pageblock by isolate_migratepages() during regular
> compaction, or on an arbitrary range by CMA's __alloc_contig_migrate_range().
> It currently perfoms two pageblock-wide compaction suitability checks, and
> because of the CMA callpath, it tracks if it crossed a pageblock boundary in
> order to repeat those checks.
>
> However, closer inspection shows that those checks are always true for CMA:
> - isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
> - migrate_async_suitable() check is skipped because CMA uses sync compaction
>
> We can therefore move the compaction-specific checks to isolate_migratepages()
> and simplify isolate_migratepages_range(). Furthermore, we can mimic the
> freepage scanner family of functions, which has isolate_freepages_block()
> function called both by compaction from isolate_freepages() and by CMA from
> isolate_freepages_range(), where each use-case adds own specific glue code.
> This allows further code simplification.
>
> Therefore, we rename isolate_migratepages_range() to isolate_freepages_block()

s/isolate_freepages_block/isolate_migratepages_block/

I read your commit description before looking at the patch and was very
nervous about the direction you were going if that was true :) I'm
relieved to see it was just a typo.

> and limit its functionality to a single pageblock (or its subset). For CMA,
> a new different isolate_migratepages_range() is created as a CMA-specific
> wrapper for the _block() function. The checks specific to compaction are moved
> to isolate_migratepages(). As part of the unification of these two families of
> functions, we remove the redundant zone parameter where applicable, since zone
> pointer is already passed in cc->zone.
>
> Furthermore, going back to compact_zone() and compact_finished() when pageblock
> is found unsuitable (now by isolate_migratepages()) is wasteful - the checks
> are meant to skip pageblocks quickly. The patch therefore also introduces a
> simple loop into isolate_migratepages() so that it does not return immediately
> on failed pageblock checks, but keeps going until isolate_migratepages_range()
> gets called once. Similarily to isolate_freepages(), the function periodically
> checks if it needs to reschedule or abort async compaction.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Acked-by: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>
> ---
> mm/compaction.c | 234 +++++++++++++++++++++++++++++++++-----------------------
> mm/internal.h | 4 +-
> mm/page_alloc.c | 3 +-
> 3 files changed, 140 insertions(+), 101 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 0a871e5..bac6e37 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -132,7 +132,7 @@ void reset_isolation_suitable(pg_data_t *pgdat)
> */
> static void update_pageblock_skip(struct compact_control *cc,
> struct page *page, unsigned long nr_isolated,
> - bool set_unsuitable, bool migrate_scanner)
> + bool migrate_scanner)
> {
> struct zone *zone = cc->zone;
> unsigned long pfn;
> @@ -146,12 +146,7 @@ static void update_pageblock_skip(struct compact_control *cc,
> if (nr_isolated)
> return;
>
> - /*
> - * Only skip pageblocks when all forms of compaction will be known to
> - * fail in the near future.
> - */
> - if (set_unsuitable)
> - set_pageblock_skip(page);
> + set_pageblock_skip(page);
>
> pfn = page_to_pfn(page);
>
> @@ -180,7 +175,7 @@ static inline bool isolation_suitable(struct compact_control *cc,
>
> static void update_pageblock_skip(struct compact_control *cc,
> struct page *page, unsigned long nr_isolated,
> - bool set_unsuitable, bool migrate_scanner)
> + bool migrate_scanner)
> {
> }
> #endif /* CONFIG_COMPACTION */
> @@ -345,8 +340,7 @@ isolate_fail:
>
> /* Update the pageblock-skip if the whole pageblock was scanned */
> if (blockpfn == end_pfn)
> - update_pageblock_skip(cc, valid_page, total_isolated, true,
> - false);
> + update_pageblock_skip(cc, valid_page, total_isolated, false);
>
> count_compact_events(COMPACTFREE_SCANNED, nr_scanned);
> if (total_isolated)
> @@ -451,40 +445,34 @@ static bool too_many_isolated(struct zone *zone)
> }
>
> /**
> - * isolate_migratepages_range() - isolate all migrate-able pages in range.
> - * @zone: Zone pages are in.
> + * isolate_migratepages_block() - isolate all migrate-able pages within
> + * a single pageblock
> * @cc: Compaction control structure.
> - * @low_pfn: The first PFN of the range.
> - * @end_pfn: The one-past-the-last PFN of the range.
> - * @unevictable: true if it allows to isolate unevictable pages
> + * @low_pfn: The first PFN to isolate
> + * @end_pfn: The one-past-the-last PFN to isolate, within same pageblock
> + * @isolate_mode: Isolation mode to be used.
> *
> * Isolate all pages that can be migrated from the range specified by
> - * [low_pfn, end_pfn). Returns zero if there is a fatal signal
> - * pending), otherwise PFN of the first page that was not scanned
> - * (which may be both less, equal to or more then end_pfn).
> - *
> - * Assumes that cc->migratepages is empty and cc->nr_migratepages is
> - * zero.
> + * [low_pfn, end_pfn). The range is expected to be within same pageblock.
> + * Returns zero if there is a fatal signal pending, otherwise PFN of the
> + * first page that was not scanned (which may be both less, equal to or more
> + * than end_pfn).
> *
> - * Apart from cc->migratepages and cc->nr_migratetypes this function
> - * does not modify any cc's fields, in particular it does not modify
> - * (or read for that matter) cc->migrate_pfn.
> + * The pages are isolated on cc->migratepages list (not required to be empty),
> + * and cc->nr_migratepages is updated accordingly. The cc->migrate_pfn field
> + * is neither read nor updated.
> */
> -unsigned long
> -isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> - unsigned long low_pfn, unsigned long end_pfn, bool unevictable)
> +static unsigned long
> +isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> + unsigned long end_pfn, isolate_mode_t isolate_mode)
> {
> - unsigned long last_pageblock_nr = 0, pageblock_nr;
> + struct zone *zone = cc->zone;
> unsigned long nr_scanned = 0, nr_isolated = 0;
> struct list_head *migratelist = &cc->migratepages;
> struct lruvec *lruvec;
> unsigned long flags;
> bool locked = false;
> struct page *page = NULL, *valid_page = NULL;
> - bool set_unsuitable = true;
> - const isolate_mode_t mode = (cc->mode == MIGRATE_ASYNC ?
> - ISOLATE_ASYNC_MIGRATE : 0) |
> - (unevictable ? ISOLATE_UNEVICTABLE : 0);
>
> /*
> * Ensure that there are not too many pages isolated from the LRU
> @@ -515,19 +503,6 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> }
> }
>
> - /*
> - * migrate_pfn does not necessarily start aligned to a
> - * pageblock. Ensure that pfn_valid is called when moving
> - * into a new MAX_ORDER_NR_PAGES range in case of large
> - * memory holes within the zone
> - */
> - if ((low_pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) {
> - if (!pfn_valid(low_pfn)) {
> - low_pfn += MAX_ORDER_NR_PAGES - 1;
> - continue;
> - }
> - }
> -
> if (!pfn_valid_within(low_pfn))
> continue;
> nr_scanned++;
> @@ -545,28 +520,6 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> if (!valid_page)
> valid_page = page;
>
> - /* If isolation recently failed, do not retry */
> - pageblock_nr = low_pfn >> pageblock_order;
> - if (last_pageblock_nr != pageblock_nr) {
> - int mt;
> -
> - last_pageblock_nr = pageblock_nr;
> - if (!isolation_suitable(cc, page))
> - goto next_pageblock;
> -
> - /*
> - * For async migration, also only scan in MOVABLE
> - * blocks. Async migration is optimistic to see if
> - * the minimum amount of work satisfies the allocation
> - */
> - mt = get_pageblock_migratetype(page);
> - if (cc->mode == MIGRATE_ASYNC &&
> - !migrate_async_suitable(mt)) {
> - set_unsuitable = false;
> - goto next_pageblock;
> - }
> - }
> -
> /*
> * Skip if free. page_order cannot be used without zone->lock
> * as nothing prevents parallel allocations or buddy merging.
> @@ -601,8 +554,11 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> */
> if (PageTransHuge(page)) {
> if (!locked)
> - goto next_pageblock;
> - low_pfn += (1 << compound_order(page)) - 1;
> + low_pfn = ALIGN(low_pfn + 1,
> + pageblock_nr_pages) - 1;
> + else
> + low_pfn += (1 << compound_order(page)) - 1;
> +

Hmm, any reason not to always advance and align low_pfn to
pageblock_nr_pages? I don't see how pageblock_order > HPAGE_PMD_ORDER
would make sense if encountering thp.

> continue;
> }
>
> @@ -632,7 +588,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> lruvec = mem_cgroup_page_lruvec(page, zone);
>
> /* Try isolate the page */
> - if (__isolate_lru_page(page, mode) != 0)
> + if (__isolate_lru_page(page, isolate_mode) != 0)
> continue;
>
> VM_BUG_ON_PAGE(PageTransCompound(page), page);
> @@ -651,11 +607,6 @@ isolate_success:
> ++low_pfn;
> break;
> }
> -
> - continue;
> -
> -next_pageblock:
> - low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
> }
>
> acct_isolated(zone, locked, cc);
> @@ -668,8 +619,7 @@ next_pageblock:
> * if the whole pageblock was scanned without isolating any page.
> */
> if (low_pfn == end_pfn)
> - update_pageblock_skip(cc, valid_page, nr_isolated,
> - set_unsuitable, true);
> + update_pageblock_skip(cc, valid_page, nr_isolated, true);
>
> trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
>
> @@ -680,15 +630,61 @@ next_pageblock:
> return low_pfn;
> }
>
> +/**
> + * isolate_migratepages_range() - isolate migrate-able pages in a PFN range
> + * @start_pfn: The first PFN to start isolating.
> + * @end_pfn: The one-past-last PFN.

Need to specify @cc?

> + *
> + * Returns zero if isolation fails fatally due to e.g. pending signal.
> + * Otherwise, function returns one-past-the-last PFN of isolated page
> + * (which may be greater than end_pfn if end fell in a middle of a THP page).
> + */
> +unsigned long
> +isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
> + unsigned long end_pfn)
> +{
> + unsigned long pfn, block_end_pfn;
> +
> + /* Scan block by block. First and last block may be incomplete */
> + pfn = start_pfn;
> + block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
> +
> + for (; pfn < end_pfn; pfn = block_end_pfn,
> + block_end_pfn += pageblock_nr_pages) {
> +
> + block_end_pfn = min(block_end_pfn, end_pfn);
> +
> + /* Skip whole pageblock in case of a memory hole */
> + if (!pfn_valid(pfn))
> + continue;
> +
> + pfn = isolate_migratepages_block(cc, pfn, block_end_pfn,
> + ISOLATE_UNEVICTABLE);
> +
> + /*
> + * In case of fatal failure, release everything that might
> + * have been isolated in the previous iteration, and signal
> + * the failure back to caller.
> + */
> + if (!pfn) {
> + putback_movable_pages(&cc->migratepages);
> + cc->nr_migratepages = 0;
> + break;
> + }
> + }
> +
> + return pfn;
> +}
> +
> #endif /* CONFIG_COMPACTION || CONFIG_CMA */
> #ifdef CONFIG_COMPACTION
> /*
> * Based on information in the current compact_control, find blocks
> * suitable for isolating free pages from and then isolate them.
> */
> -static void isolate_freepages(struct zone *zone,
> - struct compact_control *cc)
> +static void isolate_freepages(struct compact_control *cc)
> {
> + struct zone *zone = cc->zone;
> struct page *page;
> unsigned long block_start_pfn; /* start of current pageblock */
> unsigned long block_end_pfn; /* end of current pageblock */
> @@ -806,7 +802,7 @@ static struct page *compaction_alloc(struct page *migratepage,
> */
> if (list_empty(&cc->freepages)) {
> if (!cc->contended)
> - isolate_freepages(cc->zone, cc);
> + isolate_freepages(cc);
>
> if (list_empty(&cc->freepages))
> return NULL;
> @@ -840,34 +836,81 @@ typedef enum {
> } isolate_migrate_t;
>
> /*
> - * Isolate all pages that can be migrated from the block pointed to by
> - * the migrate scanner within compact_control.
> + * Isolate all pages that can be migrated from the first suitable block,
> + * starting at the block pointed to by the migrate scanner pfn within
> + * compact_control.
> */
> static isolate_migrate_t isolate_migratepages(struct zone *zone,
> struct compact_control *cc)
> {
> unsigned long low_pfn, end_pfn;
> + struct page *page;
> + const isolate_mode_t isolate_mode =
> + (cc->mode == MIGRATE_ASYNC ? ISOLATE_ASYNC_MIGRATE : 0);
>
> - /* Do not scan outside zone boundaries */
> - low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
> + /*
> + * Start at where we last stopped, or beginning of the zone as
> + * initialized by compact_zone()
> + */
> + low_pfn = cc->migrate_pfn;
>
> /* Only scan within a pageblock boundary */
> end_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages);
>
> - /* Do not cross the free scanner or scan within a memory hole */
> - if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
> - cc->migrate_pfn = end_pfn;
> - return ISOLATE_NONE;
> - }
> + /*
> + * Iterate over whole pageblocks until we find the first suitable.
> + * Do not cross the free scanner.
> + */
> + for (; end_pfn <= cc->free_pfn;
> + low_pfn = end_pfn, end_pfn += pageblock_nr_pages) {
> +
> + /*
> + * This can potentially iterate a massively long zone with
> + * many pageblocks unsuitable, so periodically check if we
> + * need to schedule, or even abort async compaction.
> + */
> + if (!(low_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages))
> + && compact_should_abort(cc))
> + break;
> +
> + /* Skip whole pageblock in case of a memory hole */
> + if (!pfn_valid(low_pfn))
> + continue;
> +
> + page = pfn_to_page(low_pfn);
> +
> + /* If isolation recently failed, do not retry */
> + if (!isolation_suitable(cc, page))
> + continue;
> +
> + /*
> + * For async compaction, also only scan in MOVABLE blocks.
> + * Async compaction is optimistic to see if the minimum amount
> + * of work satisfies the allocation.
> + */
> + if (cc->mode == MIGRATE_ASYNC &&
> + !migrate_async_suitable(get_pageblock_migratetype(page)))
> + continue;
> +
> + /* Perform the isolation */
> + low_pfn = isolate_migratepages_block(cc, low_pfn, end_pfn,
> + isolate_mode);

Hmm, why would we want to unconditionally set pageblock_skip if no pages
could be isolated from a pageblock when
isolate_mode == ISOLATE_ASYNC_MIGRATE? It seems like it erroneously skip
pageblocks for cases when isolate_mode == 0.

> +
> + if (!low_pfn || cc->contended)
> + return ISOLATE_ABORT;
>
> - /* Perform the isolation */
> - low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn, false);
> - if (!low_pfn || cc->contended)
> - return ISOLATE_ABORT;
> + /*
> + * Either we isolated something and proceed with migration. Or
> + * we failed and compact_zone should decide if we should
> + * continue or not.
> + */
> + break;
> + }
>
> + /* Record where migration scanner will be restarted */
> cc->migrate_pfn = low_pfn;
>
> - return ISOLATE_SUCCESS;
> + return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
> }
>
> static int compact_finished(struct zone *zone,
> @@ -1040,9 +1083,6 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> ;
> }
>
> - if (!cc->nr_migratepages)
> - continue;
> -
> err = migrate_pages(&cc->migratepages, compaction_alloc,
> compaction_free, (unsigned long)cc, cc->mode,
> MR_COMPACTION);
> diff --git a/mm/internal.h b/mm/internal.h
> index a1b651b..5a0738f 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -154,8 +154,8 @@ unsigned long
> isolate_freepages_range(struct compact_control *cc,
> unsigned long start_pfn, unsigned long end_pfn);
> unsigned long
> -isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> - unsigned long low_pfn, unsigned long end_pfn, bool unevictable);
> +isolate_migratepages_range(struct compact_control *cc,
> + unsigned long low_pfn, unsigned long end_pfn);
>
> #endif
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 91191fb..f424752 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6292,8 +6292,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
>
> if (list_empty(&cc->migratepages)) {
> cc->nr_migratepages = 0;
> - pfn = isolate_migratepages_range(cc->zone, cc,
> - pfn, end, true);
> + pfn = isolate_migratepages_range(cc, pfn, end);
> if (!pfn) {
> ret = -EINTR;
> break;

2014-07-29 00:44:28

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v5 06/14] mm, compaction: reduce zone checking frequency in the migration scanner

On Mon, 28 Jul 2014, Vlastimil Babka wrote:

> The unification of the migrate and free scanner families of function has
> highlighted a difference in how the scanners ensure they only isolate pages
> of the intended zone. This is important for taking zone lock or lru lock of
> the correct zone. Due to nodes overlapping, it is however possible to
> encounter a different zone within the range of the zone being compacted.
>
> The free scanner, since its inception by commit 748446bb6b5a ("mm: compaction:
> memory compaction core"), has been checking the zone of the first valid page
> in a pageblock, and skipping the whole pageblock if the zone does not match.
>
> This checking was completely missing from the migration scanner at first, and
> later added by commit dc9086004b3d ("mm: compaction: check for overlapping
> nodes during isolation for migration") in a reaction to a bug report.
> But the zone comparison in migration scanner is done once per a single scanned
> page, which is more defensive and thus more costly than a check per pageblock.
>
> This patch unifies the checking done in both scanners to once per pageblock,
> through a new pageblock_within_zone() function, which also includes pfn_valid()
> checks. It is more defensive than the current free scanner checks, as it checks
> both the first and last page of the pageblock, but less defensive by the
> migration scanner per-page checks. It assumes that node overlapping may result
> (on some architecture) in a boundary between two nodes falling into the middle
> of a pageblock, but that there cannot be a node0 node1 node0 interleaving
> within a single pageblock.
>
> The result is more code being shared and a bit less per-page CPU cost in the
> migration scanner.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Acked-by: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>

Acked-by: David Rientjes <[email protected]>

Minor comments below.

> ---
> mm/compaction.c | 91 ++++++++++++++++++++++++++++++++++++---------------------
> 1 file changed, 57 insertions(+), 34 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index bac6e37..76a9775 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -67,6 +67,49 @@ static inline bool migrate_async_suitable(int migratetype)
> return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
> }
>
> +/*
> + * Check that the whole (or subset of) a pageblock given by the interval of
> + * [start_pfn, end_pfn) is valid and within the same zone, before scanning it
> + * with the migration of free compaction scanner. The scanners then need to
> + * use only pfn_valid_within() check for arches that allow holes within
> + * pageblocks.
> + *
> + * Return struct page pointer of start_pfn, or NULL if checks were not passed.
> + *
> + * It's possible on some configurations to have a setup like node0 node1 node0
> + * i.e. it's possible that all pages within a zones range of pages do not
> + * belong to a single zone. We assume that a border between node0 and node1
> + * can occur within a single pageblock, but not a node0 node1 node0
> + * interleaving within a single pageblock. It is therefore sufficient to check
> + * the first and last page of a pageblock and avoid checking each individual
> + * page in a pageblock.
> + */
> +static struct page *pageblock_within_zone(unsigned long start_pfn,
> + unsigned long end_pfn, struct zone *zone)

The name of this function is quite strange, it's returning a pointer to
the actual start page but the name implies it would be a boolean.

> +{
> + struct page *start_page;
> + struct page *end_page;
> +
> + /* end_pfn is one past the range we are checking */
> + end_pfn--;
> +

With the given implementation, yes, but I'm not sure if that should be
assumed for any class of callers. It seems better to call with
end_pfn - 1.

> + if (!pfn_valid(start_pfn) || !pfn_valid(end_pfn))
> + return NULL;
> +

Ok, so even with this check, we still need to check pfn_valid_within() for
all pfns between start_pfn and end_pfn if there are memory holes. I
checked that both the migration and freeing scanners do that before
reading your comment above the function, looks good.

2014-07-29 00:59:31

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v5 07/14] mm, compaction: khugepaged should not give up due to need_resched()

On Mon, 28 Jul 2014, Vlastimil Babka wrote:

> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index b2e4c92..60bdf8d 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -13,6 +13,14 @@
> /* The full zone was compacted */
> #define COMPACT_COMPLETE 4
>
> +/* Used to signal whether compaction detected need_sched() or lock contention */
> +/* No contention detected */
> +#define COMPACT_CONTENDED_NONE 0
> +/* Either need_sched() was true or fatal signal pending */
> +#define COMPACT_CONTENDED_SCHED 1
> +/* Zone lock or lru_lock was contended in async compaction */
> +#define COMPACT_CONTENDED_LOCK 2
> +

Make this an enum?

> #ifdef CONFIG_COMPACTION
> extern int sysctl_compact_memory;
> extern int sysctl_compaction_handler(struct ctl_table *table, int write,
> @@ -24,7 +32,7 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
> extern int fragmentation_index(struct zone *zone, unsigned int order);
> extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *mask,
> - enum migrate_mode mode, bool *contended,
> + enum migrate_mode mode, int *contended,
> struct zone **candidate_zone);
> extern void compact_pgdat(pg_data_t *pgdat, int order);
> extern void reset_isolation_suitable(pg_data_t *pgdat);
> @@ -94,7 +102,7 @@ static inline bool compaction_restarting(struct zone *zone, int order)
> #else
> static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *nodemask,
> - enum migrate_mode mode, bool *contended,
> + enum migrate_mode mode, int *contended,
> struct zone **candidate_zone)
> {
> return COMPACT_CONTINUE;
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 76a9775..2b8b6d8 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -223,9 +223,21 @@ static void update_pageblock_skip(struct compact_control *cc,
> }
> #endif /* CONFIG_COMPACTION */
>
> -static inline bool should_release_lock(spinlock_t *lock)
> +static int should_release_lock(spinlock_t *lock)
> {
> - return need_resched() || spin_is_contended(lock);
> + /*
> + * Sched contention has higher priority here as we may potentially
> + * have to abort whole compaction ASAP. Returning with lock contention
> + * means we will try another zone, and further decisions are
> + * influenced only when all zones are lock contended. That means
> + * potentially missing a lock contention is less critical.
> + */
> + if (need_resched())
> + return COMPACT_CONTENDED_SCHED;
> + else if (spin_is_contended(lock))
> + return COMPACT_CONTENDED_LOCK;
> + else
> + return COMPACT_CONTENDED_NONE;

I would avoid the last else statement and just return
COMPACT_CONTENDED_NONE.

> }
>
> /*
> @@ -240,7 +252,9 @@ static inline bool should_release_lock(spinlock_t *lock)
> static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
> bool locked, struct compact_control *cc)
> {
> - if (should_release_lock(lock)) {
> + int contended = should_release_lock(lock);
> +
> + if (contended) {
> if (locked) {
> spin_unlock_irqrestore(lock, *flags);
> locked = false;
> @@ -248,7 +262,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>
> /* async aborts if taking too long or contended */
> if (cc->mode == MIGRATE_ASYNC) {
> - cc->contended = true;
> + cc->contended = contended;
> return false;
> }
>
> @@ -274,7 +288,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
> /* async compaction aborts if contended */
> if (need_resched()) {
> if (cc->mode == MIGRATE_ASYNC) {
> - cc->contended = true;
> + cc->contended = COMPACT_CONTENDED_SCHED;
> return true;
> }
>
> @@ -1139,7 +1153,7 @@ out:
> }
>
> static unsigned long compact_zone_order(struct zone *zone, int order,
> - gfp_t gfp_mask, enum migrate_mode mode, bool *contended)
> + gfp_t gfp_mask, enum migrate_mode mode, int *contended)
> {
> unsigned long ret;
> struct compact_control cc = {
> @@ -1154,11 +1168,11 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
> INIT_LIST_HEAD(&cc.migratepages);
>
> ret = compact_zone(zone, &cc);
> + *contended = cc.contended;
>
> VM_BUG_ON(!list_empty(&cc.freepages));
> VM_BUG_ON(!list_empty(&cc.migratepages));
>
> - *contended = cc.contended;

Not sure why, but ok :)

> return ret;
> }
>
> @@ -1171,14 +1185,15 @@ int sysctl_extfrag_threshold = 500;
> * @gfp_mask: The GFP mask of the current allocation
> * @nodemask: The allowed nodes to allocate from
> * @mode: The migration mode for async, sync light, or sync migration
> - * @contended: Return value that is true if compaction was aborted due to lock contention
> + * @contended: Return value that determines if compaction was aborted due to
> + * need_resched() or lock contention
> * @candidate_zone: Return the zone where we think allocation should succeed
> *
> * This is the main entry point for direct page compaction.
> */
> unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *nodemask,
> - enum migrate_mode mode, bool *contended,
> + enum migrate_mode mode, int *contended,
> struct zone **candidate_zone)
> {
> enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> @@ -1188,6 +1203,9 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> struct zone *zone;
> int rc = COMPACT_DEFERRED;
> int alloc_flags = 0;
> + bool all_zones_lock_contended = true; /* init true for &= operation */
> +
> + *contended = COMPACT_CONTENDED_NONE;
>
> /* Check if the GFP flags allow compaction */
> if (!order || !may_enter_fs || !may_perform_io)
> @@ -1201,13 +1219,20 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
> nodemask) {
> int status;
> + int zone_contended;
>
> if (compaction_deferred(zone, order))
> continue;
>
> status = compact_zone_order(zone, order, gfp_mask, mode,
> - contended);
> + &zone_contended);
> rc = max(status, rc);
> + /*
> + * It takes at least one zone that wasn't lock contended
> + * to turn all_zones_lock_contended to false.
> + */
> + all_zones_lock_contended &=
> + (zone_contended == COMPACT_CONTENDED_LOCK);

Eek, does this always work? COMPACT_CONTENDED_LOCK is 0x2 and
all_zones_lock_contended is a bool initialized to true. I'm fairly
certain you'd get better code generation if you defined
all_zones_lock_contended as

int all_zones_lock_contended = COMPACT_CONTENDED_LOCK

from the start and the source code would be more clear.

>
> /* If a normal allocation would succeed, stop compacting */
> if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
> @@ -1220,8 +1245,20 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> * succeeds in this zone.
> */
> compaction_defer_reset(zone, order, false);
> - break;
> - } else if (mode != MIGRATE_ASYNC) {
> + /*
> + * It is possible that async compaction aborted due to
> + * need_resched() and the watermarks were ok thanks to
> + * somebody else freeing memory. The allocation can
> + * however still fail so we better signal the
> + * need_resched() contention anyway.
> + */

Makes sense because the page allocator still tries to allocate the page
even if this is returned, but it's not very clear in the comment.

> + if (zone_contended == COMPACT_CONTENDED_SCHED)
> + *contended = COMPACT_CONTENDED_SCHED;
> +
> + goto break_loop;
> + }
> +
> + if (mode != MIGRATE_ASYNC) {
> /*
> * We think that allocation won't succeed in this zone
> * so we defer compaction there. If it ends up
> @@ -1229,8 +1266,36 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> */
> defer_compaction(zone, order);
> }
> +
> + /*
> + * We might have stopped compacting due to need_resched() in
> + * async compaction, or due to a fatal signal detected. In that
> + * case do not try further zones and signal need_resched()
> + * contention.
> + */
> + if ((zone_contended == COMPACT_CONTENDED_SCHED)
> + || fatal_signal_pending(current)) {
> + *contended = COMPACT_CONTENDED_SCHED;
> + goto break_loop;
> + }
> +
> + continue;
> +break_loop:
> + /*
> + * We might not have tried all the zones, so be conservative
> + * and assume they are not all lock contended.
> + */
> + all_zones_lock_contended = false;
> + break;
> }
>
> + /*
> + * If at least one zone wasn't deferred or skipped, we report if all
> + * zones that were tried were contended.
> + */
> + if (rc > COMPACT_SKIPPED && all_zones_lock_contended)
> + *contended = COMPACT_CONTENDED_LOCK;
> +
> return rc;
> }
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 5a0738f..4c1d604 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -144,8 +144,8 @@ struct compact_control {
> int order; /* order a direct compactor needs */
> int migratetype; /* MOVABLE, RECLAIMABLE etc */
> struct zone *zone;
> - bool contended; /* True if a lock was contended, or
> - * need_resched() true during async
> + int contended; /* Signal need_sched() or lock
> + * contention detected during
> * compaction
> */
> };
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f424752..e3c633b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2296,7 +2296,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> struct zonelist *zonelist, enum zone_type high_zoneidx,
> nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
> int classzone_idx, int migratetype, enum migrate_mode mode,
> - bool *contended_compaction, bool *deferred_compaction,
> + int *contended_compaction, bool *deferred_compaction,
> unsigned long *did_some_progress)
> {
> struct zone *last_compact_zone = NULL;
> @@ -2547,7 +2547,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> unsigned long did_some_progress;
> enum migrate_mode migration_mode = MIGRATE_ASYNC;
> bool deferred_compaction = false;
> - bool contended_compaction = false;
> + int contended_compaction = COMPACT_CONTENDED_NONE;
>
> /*
> * In the slowpath, we sanity check order to avoid ever trying to
> @@ -2660,15 +2660,36 @@ rebalance:
> if (!(gfp_mask & __GFP_NO_KSWAPD) || (current->flags & PF_KTHREAD))
> migration_mode = MIGRATE_SYNC_LIGHT;
>
> - /*
> - * If compaction is deferred for high-order allocations, it is because
> - * sync compaction recently failed. In this is the case and the caller
> - * requested a movable allocation that does not heavily disrupt the
> - * system then fail the allocation instead of entering direct reclaim.
> - */
> - if ((deferred_compaction || contended_compaction) &&
> - (gfp_mask & __GFP_NO_KSWAPD))
> - goto nopage;

Hmm, this check will have unfortunately changed in the latest mmotm due to
mm-thp-restructure-thp-avoidance-of-light-synchronous-migration.patch.

> + /* Checks for THP-specific high-order allocations */
> + if (gfp_mask & __GFP_NO_KSWAPD) {
> + /*
> + * If compaction is deferred for high-order allocations, it is
> + * because sync compaction recently failed. If this is the case
> + * and the caller requested a THP allocation, we do not want
> + * to heavily disrupt the system, so we fail the allocation
> + * instead of entering direct reclaim.
> + */
> + if (deferred_compaction)
> + goto nopage;
> +
> + /*
> + * In all zones where compaction was attempted (and not
> + * deferred or skipped), lock contention has been detected.
> + * For THP allocation we do not want to disrupt the others
> + * so we fallback to base pages instead.
> + */
> + if (contended_compaction == COMPACT_CONTENDED_LOCK)
> + goto nopage;
> +
> + /*
> + * If compaction was aborted due to need_resched(), we do not
> + * want to further increase allocation latency, unless it is
> + * khugepaged trying to collapse.
> + */
> + if (contended_compaction == COMPACT_CONTENDED_SCHED
> + && !(current->flags & PF_KTHREAD))
> + goto nopage;
> + }
>
> /* Try direct reclaim and then allocating */
> page = __alloc_pages_direct_reclaim(gfp_mask, order,

2014-07-29 01:03:31

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v5 08/14] mm, compaction: periodically drop lock and restore IRQs in scanners

On Mon, 28 Jul 2014, Vlastimil Babka wrote:

> Compaction scanners regularly check for lock contention and need_resched()
> through the compact_checklock_irqsave() function. However, if there is no
> contention, the lock can be held and IRQ disabled for potentially long time.
>
> This has been addressed by commit b2eef8c0d091 ("mm: compaction: minimise the
> time IRQs are disabled while isolating pages for migration") for the migration
> scanner. However, the refactoring done by commit 2a1402aa044b ("mm: compaction:
> acquire the zone->lru_lock as late as possible") has changed the conditions so
> that the lock is dropped only when there's contention on the lock or
> need_resched() is true. Also, need_resched() is checked only when the lock is
> already held. The comment "give a chance to irqs before checking need_resched"
> is therefore misleading, as IRQs remain disabled when the check is done.
>
> This patch restores the behavior intended by commit b2eef8c0d091 and also tries
> to better balance and make more deterministic the time spent by checking for
> contention vs the time the scanners might run between the checks. It also
> avoids situations where checking has not been done often enough before. The
> result should be avoiding both too frequent and too infrequent contention
> checking, and especially the potentially long-running scans with IRQs disabled
> and no checking of need_resched() or for fatal signal pending, which can happen
> when many consecutive pages or pageblocks fail the preliminary tests and do not
> reach the later call site to compact_checklock_irqsave(), as explained below.
>
> Before the patch:
>
> In the migration scanner, compact_checklock_irqsave() was called each loop, if
> reached. If not reached, some lower-frequency checking could still be done if
> the lock was already held, but this would not result in aborting contended
> async compaction until reaching compact_checklock_irqsave() or end of
> pageblock. In the free scanner, it was similar but completely without the
> periodical checking, so lock can be potentially held until reaching the end of
> pageblock.
>
> After the patch, in both scanners:
>
> The periodical check is done as the first thing in the loop on each
> SWAP_CLUSTER_MAX aligned pfn, using the new compact_unlock_should_abort()
> function, which always unlocks the lock (if locked) and aborts async compaction
> if scheduling is needed. It also aborts any type of compaction when a fatal
> signal is pending.
>
> The compact_checklock_irqsave() function is replaced with a slightly different
> compact_trylock_irqsave(). The biggest difference is that the function is not
> called at all if the lock is already held. The periodical need_resched()
> checking is left solely to compact_unlock_should_abort(). The lock contention
> avoidance for async compaction is achieved by the periodical unlock by
> compact_unlock_should_abort() and by using trylock in compact_trylock_irqsave()
> and aborting when trylock fails. Sync compaction does not use trylock.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Reviewed-by: Zhang Yanfei <[email protected]>
> Acked-by: Minchan Kim <[email protected]>
> Acked-by: Mel Gorman <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>

Acked-by: David Rientjes <[email protected]>

Minor comment below.

> ---
> mm/compaction.c | 121 ++++++++++++++++++++++++++++++++++----------------------
> 1 file changed, 73 insertions(+), 48 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 2b8b6d8..1756ed8 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -223,61 +223,72 @@ static void update_pageblock_skip(struct compact_control *cc,
> }
> #endif /* CONFIG_COMPACTION */
>
> -static int should_release_lock(spinlock_t *lock)
> +/*
> + * Compaction requires the taking of some coarse locks that are potentially
> + * very heavily contended. For async compaction, back out if the lock cannot
> + * be taken immediately. For sync compaction, spin on the lock if needed.
> + *
> + * Returns true if the lock is held
> + * Returns false if the lock is not held and compaction should abort
> + */
> +static bool compact_trylock_irqsave(spinlock_t *lock, unsigned long *flags,
> + struct compact_control *cc)
> {
> - /*
> - * Sched contention has higher priority here as we may potentially
> - * have to abort whole compaction ASAP. Returning with lock contention
> - * means we will try another zone, and further decisions are
> - * influenced only when all zones are lock contended. That means
> - * potentially missing a lock contention is less critical.
> - */
> - if (need_resched())
> - return COMPACT_CONTENDED_SCHED;
> - else if (spin_is_contended(lock))
> - return COMPACT_CONTENDED_LOCK;
> - else
> - return COMPACT_CONTENDED_NONE;
> + if (cc->mode == MIGRATE_ASYNC) {
> + if (!spin_trylock_irqsave(lock, *flags)) {
> + cc->contended = COMPACT_CONTENDED_LOCK;
> + return false;
> + }
> + } else {
> + spin_lock_irqsave(lock, *flags);
> + }
> +
> + return true;
> }
>
> /*
> * Compaction requires the taking of some coarse locks that are potentially
> - * very heavily contended. Check if the process needs to be scheduled or
> - * if the lock is contended. For async compaction, back out in the event
> - * if contention is severe. For sync compaction, schedule.
> + * very heavily contended. The lock should be periodically unlocked to avoid
> + * having disabled IRQs for a long time, even when there is nobody waiting on
> + * the lock. It might also be that allowing the IRQs will result in
> + * need_resched() becoming true. If scheduling is needed, async compaction
> + * aborts. Sync compaction schedules.
> + * Either compaction type will also abort if a fatal signal is pending.
> + * In either case if the lock was locked, it is dropped and not regained.
> *
> - * Returns true if the lock is held.
> - * Returns false if the lock is released and compaction should abort
> + * Returns true if compaction should abort due to fatal signal pending, or
> + * async compaction due to need_resched()
> + * Returns false when compaction can continue (sync compaction might have
> + * scheduled)
> */
> -static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
> - bool locked, struct compact_control *cc)
> +static bool compact_unlock_should_abort(spinlock_t *lock,
> + unsigned long flags, bool *locked, struct compact_control *cc)
> {
> - int contended = should_release_lock(lock);
> + if (*locked) {
> + spin_unlock_irqrestore(lock, flags);
> + *locked = false;
> + }
>
> - if (contended) {
> - if (locked) {
> - spin_unlock_irqrestore(lock, *flags);
> - locked = false;
> - }
> + if (fatal_signal_pending(current)) {
> + cc->contended = COMPACT_CONTENDED_SCHED;
> + return true;
> + }
>
> - /* async aborts if taking too long or contended */
> + if (need_resched()) {
> if (cc->mode == MIGRATE_ASYNC) {
> - cc->contended = contended;
> - return false;
> + cc->contended = COMPACT_CONTENDED_SCHED;
> + return true;
> }
> -
> cond_resched();
> }
>
> - if (!locked)
> - spin_lock_irqsave(lock, *flags);
> - return true;
> + return false;
> }
>
> /*
> * Aside from avoiding lock contention, compaction also periodically checks
> * need_resched() and either schedules in sync compaction or aborts async
> - * compaction. This is similar to what compact_checklock_irqsave() does, but
> + * compaction. This is similar to what compact_unlock_should_abort() does, but
> * is used where no lock is concerned.
> *
> * Returns false when no scheduling was needed, or sync compaction scheduled.
> @@ -336,6 +347,16 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
> int isolated, i;
> struct page *page = cursor;
>
> + /*
> + * Periodically drop the lock (if held) regardless of its
> + * contention, to give chance to IRQs. Abort async compaction
> + * if contended.
> + */

I think this comment is out of date since compact_unlock_should_abort() no
longer returns true if spin_is_contended(), which is good.

> + if (!(blockpfn % SWAP_CLUSTER_MAX)
> + && compact_unlock_should_abort(&cc->zone->lock, flags,
> + &locked, cc))
> + break;
> +
> nr_scanned++;
> if (!pfn_valid_within(blockpfn))
> goto isolate_fail;
> @@ -353,8 +374,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
> * spin on the lock and we acquire the lock as late as
> * possible.
> */
> - locked = compact_checklock_irqsave(&cc->zone->lock, &flags,
> - locked, cc);
> + if (!locked)
> + locked = compact_trylock_irqsave(&cc->zone->lock,
> + &flags, cc);
> if (!locked)
> break;
>
> @@ -552,13 +574,15 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>
> /* Time to isolate some pages for migration */
> for (; low_pfn < end_pfn; low_pfn++) {
> - /* give a chance to irqs before checking need_resched() */
> - if (locked && !(low_pfn % SWAP_CLUSTER_MAX)) {
> - if (should_release_lock(&zone->lru_lock)) {
> - spin_unlock_irqrestore(&zone->lru_lock, flags);
> - locked = false;
> - }
> - }
> + /*
> + * Periodically drop the lock (if held) regardless of its
> + * contention, to give chance to IRQs. Abort async compaction
> + * if contended.
> + */
> + if (!(low_pfn % SWAP_CLUSTER_MAX)
> + && compact_unlock_should_abort(&zone->lru_lock, flags,
> + &locked, cc))
> + break;
>
> if (!pfn_valid_within(low_pfn))
> continue;
> @@ -620,10 +644,11 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> page_count(page) > page_mapcount(page))
> continue;
>
> - /* Check if it is ok to still hold the lock */
> - locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
> - locked, cc);
> - if (!locked || fatal_signal_pending(current))
> + /* If the lock is not held, try to take it */
> + if (!locked)
> + locked = compact_trylock_irqsave(&zone->lru_lock,
> + &flags, cc);
> + if (!locked)
> break;
>
> /* Recheck PageLRU and PageTransHuge under lock */

2014-07-29 01:05:38

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v5 11/14] mm, compaction: skip buddy pages by their order in the migrate scanner

On Mon, 28 Jul 2014, Vlastimil Babka wrote:

> The migration scanner skips PageBuddy pages, but does not consider their order
> as checking page_order() is generally unsafe without holding the zone->lock,
> and acquiring the lock just for the check wouldn't be a good tradeoff.
>
> Still, this could avoid some iterations over the rest of the buddy page, and
> if we are careful, the race window between PageBuddy() check and page_order()
> is small, and the worst thing that can happen is that we skip too much and miss
> some isolation candidates. This is not that bad, as compaction can already fail
> for many other reasons like parallel allocations, and those have much larger
> race window.
>
> This patch therefore makes the migration scanner obtain the buddy page order
> and use it to skip the whole buddy page, if the order appears to be in the
> valid range.
>
> It's important that the page_order() is read only once, so that the value used
> in the checks and in the pfn calculation is the same. But in theory the
> compiler can replace the local variable by multiple inlines of page_order().
> Therefore, the patch introduces page_order_unsafe() that uses ACCESS_ONCE to
> prevent this.
>
> Testing with stress-highalloc from mmtests shows a 15% reduction in number of
> pages scanned by migration scanner. The reduction is >60% with __GFP_NO_KSWAPD
> allocations, along with success rates better by few percent.
> This change is also a prerequisite for a later patch which is detecting when
> a cc->order block of pages contains non-buddy pages that cannot be isolated,
> and the scanner should thus skip to the next block immediately.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Reviewed-by: Zhang Yanfei <[email protected]>
> Acked-by: Minchan Kim <[email protected]>
> Acked-by: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>

Acked-by: David Rientjes <[email protected]>

Seems I'm overruled in the definition of page_order_unsafe(). Owell, you
have more than one caller so I guess it makes sense.

2014-07-29 06:31:54

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v5 02/14] mm, compaction: defer each zone individually instead of preferred zone

On Mon, Jul 28, 2014 at 03:11:29PM +0200, Vlastimil Babka wrote:
> When direct sync compaction is often unsuccessful, it may become deferred for
> some time to avoid further useless attempts, both sync and async. Successful
> high-order allocations un-defer compaction, while further unsuccessful
> compaction attempts prolong the copmaction deferred period.
>
> Currently the checking and setting deferred status is performed only on the
> preferred zone of the allocation that invoked direct compaction. But compaction
> itself is attempted on all eligible zones in the zonelist, so the behavior is
> suboptimal and may lead both to scenarios where 1) compaction is attempted
> uselessly, or 2) where it's not attempted despite good chances of succeeding,
> as shown on the examples below:
>
> 1) A direct compaction with Normal preferred zone failed and set deferred
> compaction for the Normal zone. Another unrelated direct compaction with
> DMA32 as preferred zone will attempt to compact DMA32 zone even though
> the first compaction attempt also included DMA32 zone.
>
> In another scenario, compaction with Normal preferred zone failed to compact
> Normal zone, but succeeded in the DMA32 zone, so it will not defer
> compaction. In the next attempt, it will try Normal zone which will fail
> again, instead of skipping Normal zone and trying DMA32 directly.
>
> 2) Kswapd will balance DMA32 zone and reset defer status based on watermarks
> looking good. A direct compaction with preferred Normal zone will skip
> compaction of all zones including DMA32 because Normal was still deferred.
> The allocation might have succeeded in DMA32, but won't.
>
> This patch makes compaction deferring work on individual zone basis instead of
> preferred zone. For each zone, it checks compaction_deferred() to decide if the
> zone should be skipped. If watermarks fail after compacting the zone,
> defer_compaction() is called. The zone where watermarks passed can still be
> deferred when the allocation attempt is unsuccessful. When allocation is
> successful, compaction_defer_reset() is called for the zone containing the
> allocated page. This approach should approximate calling defer_compaction()
> only on zones where compaction was attempted and did not yield allocated page.
> There might be corner cases but that is inevitable as long as the decision
> to stop compacting dues not guarantee that a page will be allocated.
>
> During testing on a two-node machine with a single very small Normal zone on
> node 1, this patch has improved success rates in stress-highalloc mmtests
> benchmark. The success here were previously made worse by commit 3a025760fc15
> ("mm: page_alloc: spill to remote nodes before waking kswapd") as kswapd was
> no longer resetting often enough the deferred compaction for the Normal zone,
> and DMA32 zones on both nodes were thus not considered for compaction.
> On different machine, success rates were improved with __GFP_NO_KSWAPD
> allocations.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Acked-by: Minchan Kim <[email protected]>
> Reviewed-by: Zhang Yanfei <[email protected]>
> Acked-by: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>
> ---
> include/linux/compaction.h | 16 ++++++++++------
> mm/compaction.c | 30 ++++++++++++++++++++++++------
> mm/page_alloc.c | 39 +++++++++++++++++++++++----------------
> 3 files changed, 57 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 01e3132..b2e4c92 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -2,14 +2,16 @@
> #define _LINUX_COMPACTION_H
>
> /* Return values for compact_zone() and try_to_compact_pages() */
> +/* compaction didn't start as it was deferred due to past failures */
> +#define COMPACT_DEFERRED 0
> /* compaction didn't start as it was not possible or direct reclaim was more suitable */
> -#define COMPACT_SKIPPED 0
> +#define COMPACT_SKIPPED 1

Hello,

This change makes some users of compaction_suitable() failed
unintentionally, because they assume that COMPACT_SKIPPED is 0.
Please fix them according to this change.

> /* compaction should continue to another pageblock */
> -#define COMPACT_CONTINUE 1
> +#define COMPACT_CONTINUE 2
> /* direct compaction partially compacted a zone and there are suitable pages */
> -#define COMPACT_PARTIAL 2
> +#define COMPACT_PARTIAL 3
> /* The full zone was compacted */
> -#define COMPACT_COMPLETE 3
> +#define COMPACT_COMPLETE 4
>
> #ifdef CONFIG_COMPACTION
> extern int sysctl_compact_memory;
> @@ -22,7 +24,8 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
> extern int fragmentation_index(struct zone *zone, unsigned int order);
> extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *mask,
> - enum migrate_mode mode, bool *contended);
> + enum migrate_mode mode, bool *contended,
> + struct zone **candidate_zone);
> extern void compact_pgdat(pg_data_t *pgdat, int order);
> extern void reset_isolation_suitable(pg_data_t *pgdat);
> extern unsigned long compaction_suitable(struct zone *zone, int order);
> @@ -91,7 +94,8 @@ static inline bool compaction_restarting(struct zone *zone, int order)
> #else
> static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *nodemask,
> - enum migrate_mode mode, bool *contended)
> + enum migrate_mode mode, bool *contended,
> + struct zone **candidate_zone)
> {
> return COMPACT_CONTINUE;
> }
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 5175019..f3ae2ec 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1122,28 +1122,27 @@ int sysctl_extfrag_threshold = 500;
> * @nodemask: The allowed nodes to allocate from
> * @mode: The migration mode for async, sync light, or sync migration
> * @contended: Return value that is true if compaction was aborted due to lock contention
> - * @page: Optionally capture a free page of the requested order during compaction
> + * @candidate_zone: Return the zone where we think allocation should succeed
> *
> * This is the main entry point for direct page compaction.
> */
> unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *nodemask,
> - enum migrate_mode mode, bool *contended)
> + enum migrate_mode mode, bool *contended,
> + struct zone **candidate_zone)
> {
> enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> int may_enter_fs = gfp_mask & __GFP_FS;
> int may_perform_io = gfp_mask & __GFP_IO;
> struct zoneref *z;
> struct zone *zone;
> - int rc = COMPACT_SKIPPED;
> + int rc = COMPACT_DEFERRED;
> int alloc_flags = 0;
>
> /* Check if the GFP flags allow compaction */
> if (!order || !may_enter_fs || !may_perform_io)
> return rc;
>
> - count_compact_event(COMPACTSTALL);
> -
> #ifdef CONFIG_CMA
> if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> alloc_flags |= ALLOC_CMA;
> @@ -1153,14 +1152,33 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> nodemask) {
> int status;
>
> + if (compaction_deferred(zone, order))
> + continue;
> +
> status = compact_zone_order(zone, order, gfp_mask, mode,
> contended);
> rc = max(status, rc);
>
> /* If a normal allocation would succeed, stop compacting */
> if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
> - alloc_flags))
> + alloc_flags)) {
> + *candidate_zone = zone;
> + /*
> + * We think the allocation will succeed in this zone,
> + * but it is not certain, hence the false. The caller
> + * will repeat this with true if allocation indeed
> + * succeeds in this zone.
> + */
> + compaction_defer_reset(zone, order, false);
> break;
> + } else if (mode != MIGRATE_ASYNC) {
> + /*
> + * We think that allocation won't succeed in this zone
> + * so we defer compaction there. If it ends up
> + * succeeding after all, it will be reset.
> + */
> + defer_compaction(zone, order);
> + }
> }
>
> return rc;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b99643d4..a14efeb 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2299,21 +2299,24 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> bool *contended_compaction, bool *deferred_compaction,
> unsigned long *did_some_progress)
> {
> - if (!order)
> - return NULL;
> + struct zone *last_compact_zone = NULL;
>
> - if (compaction_deferred(preferred_zone, order)) {
> - *deferred_compaction = true;
> + if (!order)
> return NULL;
> - }
>
> current->flags |= PF_MEMALLOC;
> *did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
> nodemask, mode,
> - contended_compaction);
> + contended_compaction,
> + &last_compact_zone);
> current->flags &= ~PF_MEMALLOC;
>
> - if (*did_some_progress != COMPACT_SKIPPED) {
> + if (*did_some_progress > COMPACT_DEFERRED)
> + count_vm_event(COMPACTSTALL);
> + else
> + *deferred_compaction = true;
> +
> + if (*did_some_progress > COMPACT_SKIPPED) {
> struct page *page;
>
> /* Page migration frees to the PCP lists but we want merging */
> @@ -2324,27 +2327,31 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> order, zonelist, high_zoneidx,
> alloc_flags & ~ALLOC_NO_WATERMARKS,
> preferred_zone, classzone_idx, migratetype);
> +
> if (page) {
> - preferred_zone->compact_blockskip_flush = false;
> - compaction_defer_reset(preferred_zone, order, true);
> + struct zone *zone = page_zone(page);
> +
> + zone->compact_blockskip_flush = false;
> + compaction_defer_reset(zone, order, true);
> count_vm_event(COMPACTSUCCESS);
> return page;
> }
>
> /*
> + * last_compact_zone is where try_to_compact_pages thought
> + * allocation should succeed, so it did not defer compaction.
> + * But now we know that it didn't succeed, so we do the defer.
> + */
> + if (last_compact_zone && mode != MIGRATE_ASYNC)
> + defer_compaction(last_compact_zone, order);

I still don't understand why defer_compaction() is needed here.
defer_compaction() is intended for not struggling doing compaction on
the zone where we already have tried compaction and found that it
isn't suitable for compaction. Allocation failure doesn't tell us
that we have tried compaction for all the zone range so we shouldn't
make a decision here to defer compaction on this zone carelessly.

Thanks.

2014-07-29 06:46:39

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v5 07/14] mm, compaction: khugepaged should not give up due to need_resched()

On Mon, Jul 28, 2014 at 03:11:34PM +0200, Vlastimil Babka wrote:
> Async compaction aborts when it detects zone lock contention or need_resched()
> is true. David Rientjes has reported that in practice, most direct async
> compactions for THP allocation abort due to need_resched(). This means that a
> second direct compaction is never attempted, which might be OK for a page
> fault, but khugepaged is intended to attempt a sync compaction in such case and
> in these cases it won't.

I have a silly question here.
Why need_resched() is criteria to stop async compaction?
need_resched() is flagged up when time slice runs out or other reasons.
It means that we should stop async compaction at arbitrary timing
because process can be on compaction code at arbitrary moment. I think
that it isn't reasonable and it doesn't ensure anything. Instead of
this approach, how about doing compaction on certain amounts of pageblock
for async compaction?

Thanks.

2014-07-29 07:28:14

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v5 14/14] mm, compaction: try to capture the just-created high-order freepage

On Mon, Jul 28, 2014 at 03:11:41PM +0200, Vlastimil Babka wrote:
> Compaction uses watermark checking to determine if it succeeded in creating
> a high-order free page. My testing has shown that this is quite racy and it
> can happen that watermark checking in compaction succeeds, and moments later
> the watermark checking in page allocation fails, even though the number of
> free pages has increased meanwhile.
>
> It should be more reliable if direct compaction captured the high-order free
> page as soon as it detects it, and pass it back to allocation. This would
> also reduce the window for somebody else to allocate the free page.
>
> Capture has been implemented before by 1fb3f8ca0e92 ("mm: compaction: capture
> a suitable high-order page immediately when it is made available"), but later
> reverted by 8fb74b9f ("mm: compaction: partially revert capture of suitable
> high-order page") due to a bug.
>
> This patch differs from the previous attempt in two aspects:
>
> 1) The previous patch scanned free lists to capture the page. In this patch,
> only the cc->order aligned block that the migration scanner just finished
> is considered, but only if pages were actually isolated for migration in
> that block. Tracking cc->order aligned blocks also has benefits for the
> following patch that skips blocks where non-migratable pages were found.
>
> 2) The operations done in buffered_rmqueue() and get_page_from_freelist() are
> closely followed so that page capture mimics normal page allocation as much
> as possible. This includes operations such as prep_new_page() and
> page->pfmemalloc setting (that was missing in the previous attempt), zone
> statistics are updated etc. Due to subtleties with IRQ disabling and
> enabling this cannot be simply factored out from the normal allocation
> functions without affecting the fastpath.

I don't look at it in detail, but, it looks really duplicated and hard
to maintain. From my experience, this is really error-prone. Please
think of freepage counting bugs reported by my recent patchset.
Freepage counting handles counting at different places for performance reason
and finally bugs are there. IMHO, making common function and using it
is better than this approach even if we touch the fastpath.

>
> This patch has tripled compaction success rates (as recorded in vmstat) in
> stress-highalloc mmtests benchmark, although allocation success rates increased
> only by a few percent. Closer inspection shows that due to the racy watermark
> checking and lack of lru_add_drain(), the allocations that resulted in direct
> compactions were often failing, but later allocations succeeeded in the fast
> path. So the benefit of the patch to allocation success rates may be limited,
> but it improves the fairness in the sense that whoever spent the time
> compacting has a higher change of benefitting from it, and also can stop
> compacting sooner, as page availability is detected immediately. With better
> success detection, the contribution of compaction to high-order allocation
> success success rates is also no longer understated by the vmstats.

Could you separate this patch to this patchset?
I think that this patch doesn't get much reviewed from other developers
unlike other patches.

>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Acked-by: Mel Gorman <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Michal Nazarewicz <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>
> ---
> include/linux/compaction.h | 8 ++-
> mm/compaction.c | 134 +++++++++++++++++++++++++++++++++++++++++----
> mm/internal.h | 4 +-
> mm/page_alloc.c | 81 +++++++++++++++++++++++----
> 4 files changed, 201 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 60bdf8d..b83c142 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -12,6 +12,8 @@
> #define COMPACT_PARTIAL 3
> /* The full zone was compacted */
> #define COMPACT_COMPLETE 4
> +/* Captured a high-order free page in direct compaction */
> +#define COMPACT_CAPTURED 5
>
> /* Used to signal whether compaction detected need_sched() or lock contention */
> /* No contention detected */
> @@ -33,7 +35,8 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
> extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *mask,
> enum migrate_mode mode, int *contended,
> - struct zone **candidate_zone);
> + struct zone **candidate_zone,
> + struct page **captured_page);
> extern void compact_pgdat(pg_data_t *pgdat, int order);
> extern void reset_isolation_suitable(pg_data_t *pgdat);
> extern unsigned long compaction_suitable(struct zone *zone, int order);
> @@ -103,7 +106,8 @@ static inline bool compaction_restarting(struct zone *zone, int order)
> static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *nodemask,
> enum migrate_mode mode, int *contended,
> - struct zone **candidate_zone)
> + struct zone **candidate_zone,
> + struct page **captured_page);
> {
> return COMPACT_CONTINUE;
> }
> diff --git a/mm/compaction.c b/mm/compaction.c
> index dd3e4db..bfe56ee 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -548,6 +548,7 @@ static bool too_many_isolated(struct zone *zone)
> * @low_pfn: The first PFN to isolate
> * @end_pfn: The one-past-the-last PFN to isolate, within same pageblock
> * @isolate_mode: Isolation mode to be used.
> + * @capture: True if page capturing is allowed
> *
> * Isolate all pages that can be migrated from the range specified by
> * [low_pfn, end_pfn). The range is expected to be within same pageblock.
> @@ -561,7 +562,8 @@ static bool too_many_isolated(struct zone *zone)
> */
> static unsigned long
> isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> - unsigned long end_pfn, isolate_mode_t isolate_mode)
> + unsigned long end_pfn, isolate_mode_t isolate_mode,
> + bool capture)
> {
> struct zone *zone = cc->zone;
> unsigned long nr_scanned = 0, nr_isolated = 0;
> @@ -570,6 +572,14 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> unsigned long flags;
> bool locked = false;
> struct page *page = NULL, *valid_page = NULL;
> + unsigned long capture_pfn = 0; /* current candidate for capturing */
> + unsigned long next_capture_pfn = 0; /* next candidate for capturing */
> +
> + if (cc->order > 0 && cc->order <= pageblock_order && capture) {
> + /* This may be outside the zone, but we check that later */
> + capture_pfn = low_pfn & ~((1UL << cc->order) - 1);
> + next_capture_pfn = ALIGN(low_pfn + 1, (1UL << cc->order));
> + }

Instead of inserting capture logic to common code (compaction and
CMA), could you add it only to compaction code such as
isolate_migratepages(). Capture logic needs too many hooks as you see
on below snippets. And it makes code so much complicated.

>
> /*
> * Ensure that there are not too many pages isolated from the LRU
> @@ -591,7 +601,27 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> return 0;
>
> /* Time to isolate some pages for migration */
> - for (; low_pfn < end_pfn; low_pfn++) {
> + for (; low_pfn <= end_pfn; low_pfn++) {
> + if (low_pfn == next_capture_pfn) {
> + /*
> + * We have a capture candidate if we isolated something
> + * during the last cc->order aligned block of pages.
> + */
> + if (nr_isolated &&
> + capture_pfn >= zone->zone_start_pfn) {
> + *cc->capture_page = pfn_to_page(capture_pfn);
> + break;
> + }
> +
> + /* Prepare for a new capture candidate */
> + capture_pfn = next_capture_pfn;
> + next_capture_pfn += (1UL << cc->order);
> + }
> +
> + /* We check that here, in case low_pfn == next_capture_pfn */
> + if (low_pfn == end_pfn)
> + break;
> +
> /*
> * Periodically drop the lock (if held) regardless of its
> * contention, to give chance to IRQs. Abort async compaction
> @@ -625,8 +655,12 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> * a valid page order. Consider only values in the
> * valid order range to prevent low_pfn overflow.
> */
> - if (freepage_order > 0 && freepage_order < MAX_ORDER)
> + if (freepage_order > 0 && freepage_order < MAX_ORDER) {
> low_pfn += (1UL << freepage_order) - 1;
> + if (next_capture_pfn)
> + next_capture_pfn = ALIGN(low_pfn + 1,
> + (1UL << cc->order));
> + }
> continue;
> }
>
> @@ -662,6 +696,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> else
> low_pfn += (1 << compound_order(page)) - 1;
>
> + if (next_capture_pfn)
> + next_capture_pfn =
> + ALIGN(low_pfn + 1, (1UL << cc->order));
> continue;
> }
>
> @@ -686,6 +723,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> continue;
> if (PageTransHuge(page)) {
> low_pfn += (1 << compound_order(page)) - 1;
> + if (next_capture_pfn)
> + next_capture_pfn = ALIGN(low_pfn + 1,
> + (1UL << cc->order));
> continue;
> }
> }
> @@ -770,7 +810,7 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
> continue;
>
> pfn = isolate_migratepages_block(cc, pfn, block_end_pfn,
> - ISOLATE_UNEVICTABLE);
> + ISOLATE_UNEVICTABLE, false);
>
> /*
> * In case of fatal failure, release everything that might
> @@ -958,7 +998,7 @@ typedef enum {
> * compact_control.
> */
> static isolate_migrate_t isolate_migratepages(struct zone *zone,
> - struct compact_control *cc)
> + struct compact_control *cc, const int migratetype)
> {
> unsigned long low_pfn, end_pfn;
> struct page *page;
> @@ -981,6 +1021,9 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
> for (; end_pfn <= cc->free_pfn;
> low_pfn = end_pfn, end_pfn += pageblock_nr_pages) {
>
> + int pageblock_mt;
> + bool capture;
> +
> /*
> * This can potentially iterate a massively long zone with
> * many pageblocks unsuitable, so periodically check if we
> @@ -1003,13 +1046,22 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
> * Async compaction is optimistic to see if the minimum amount
> * of work satisfies the allocation.
> */
> + pageblock_mt = get_pageblock_migratetype(page);
> if (cc->mode == MIGRATE_ASYNC &&
> - !migrate_async_suitable(get_pageblock_migratetype(page)))
> + !migrate_async_suitable(pageblock_mt))
> continue;
>
> + /*
> + * Capture page only if the caller requested it, and either the
> + * pageblock has our desired migratetype, or we would take it
> + * completely.
> + */
> + capture = cc->capture_page && ((pageblock_mt == migratetype)
> + || (cc->order == pageblock_order));
> +
> /* Perform the isolation */
> low_pfn = isolate_migratepages_block(cc, low_pfn, end_pfn,
> - isolate_mode);
> + isolate_mode, capture);
>
> if (!low_pfn || cc->contended)
> return ISOLATE_ABORT;
> @@ -1028,6 +1080,48 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
> return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
> }
>
> +/*
> + * When called, cc->capture_page must be non-NULL. Then *cc->capture_page is
> + * just a candidate, or NULL (no candidate). This function will either
> + * successfully capture the page, or reset *cc->capture_page to NULL.
> + */
> +static bool compact_capture_page(struct compact_control *cc)
> +{
> + struct page *page = *cc->capture_page;
> + int cpu;
> +
> + if (!page)
> + return false;
> +
> + /* Unsafe check if it's worth to try acquiring the zone->lock at all */
> + if (PageBuddy(page) && page_order_unsafe(page) >= cc->order)
> + goto try_capture;
> +
> + /*
> + * There's a good chance that we have just put free pages on this CPU's
> + * lru cache and pcplists after the page migrations. Drain them to
> + * allow merging.
> + */
> + cpu = get_cpu();
> + lru_add_drain_cpu(cpu);
> + drain_local_pages(NULL);
> + put_cpu();

Just for curiosity.

If lru_add_drain_cpu() is cheap enough to capture high order page, why
__alloc_pages_direct_compact() doesn't call it before
get_page_from_freelist()?

> +
> + /* Did the draining help? */
> + if (PageBuddy(page) && page_order_unsafe(page) >= cc->order)
> + goto try_capture;
> +
> + goto fail;
> +
> +try_capture:
> + if (capture_free_page(page, cc->order))
> + return true;
> +
> +fail:
> + *cc->capture_page = NULL;
> + return false;
> +}
> +
> static int compact_finished(struct zone *zone, struct compact_control *cc,
> const int migratetype)
> {
> @@ -1056,6 +1150,10 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
> return COMPACT_COMPLETE;
> }
>
> + /* Did we just finish a pageblock that was capture candidate? */
> + if (cc->capture_page && compact_capture_page(cc))
> + return COMPACT_CAPTURED;
> +
> /*
> * order == -1 is expected when compacting via
> * /proc/sys/vm/compact_memory
> @@ -1188,7 +1286,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> COMPACT_CONTINUE) {
> int err;
>
> - switch (isolate_migratepages(zone, cc)) {
> + switch (isolate_migratepages(zone, cc, migratetype)) {
> case ISOLATE_ABORT:
> ret = COMPACT_PARTIAL;
> putback_movable_pages(&cc->migratepages);
> @@ -1227,13 +1325,18 @@ out:
> cc->nr_freepages -= release_freepages(&cc->freepages);
> VM_BUG_ON(cc->nr_freepages != 0);
>
> + /* Remove any candidate page if it was not captured */
> + if (cc->capture_page && ret != COMPACT_CAPTURED)
> + *cc->capture_page = NULL;
> +
> trace_mm_compaction_end(ret);
>
> return ret;
> }
>
> static unsigned long compact_zone_order(struct zone *zone, int order,
> - gfp_t gfp_mask, enum migrate_mode mode, int *contended)
> + gfp_t gfp_mask, enum migrate_mode mode, int *contended,
> + struct page **captured_page)
> {
> unsigned long ret;
> struct compact_control cc = {
> @@ -1243,6 +1346,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
> .gfp_mask = gfp_mask,
> .zone = zone,
> .mode = mode,
> + .capture_page = captured_page,
> };
> INIT_LIST_HEAD(&cc.freepages);
> INIT_LIST_HEAD(&cc.migratepages);
> @@ -1268,13 +1372,15 @@ int sysctl_extfrag_threshold = 500;
> * @contended: Return value that determines if compaction was aborted due to
> * need_resched() or lock contention
> * @candidate_zone: Return the zone where we think allocation should succeed
> + * @captured_page: If successful, return the page captured during compaction
> *
> * This is the main entry point for direct page compaction.
> */
> unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *nodemask,
> enum migrate_mode mode, int *contended,
> - struct zone **candidate_zone)
> + struct zone **candidate_zone,
> + struct page **captured_page)
> {
> enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> int may_enter_fs = gfp_mask & __GFP_FS;
> @@ -1305,7 +1411,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> continue;
>
> status = compact_zone_order(zone, order, gfp_mask, mode,
> - &zone_contended);
> + &zone_contended, captured_page);
> rc = max(status, rc);
> /*
> * It takes at least one zone that wasn't lock contended
> @@ -1314,6 +1420,12 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
> all_zones_lock_contended &=
> (zone_contended == COMPACT_CONTENDED_LOCK);
>
> + /* If we captured a page, stop compacting */
> + if (*captured_page) {
> + *candidate_zone = zone;
> + break;
> + }
> +
> /* If a normal allocation would succeed, stop compacting */
> if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
> alloc_flags)) {
> diff --git a/mm/internal.h b/mm/internal.h
> index 8293040..f2d625f 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -110,6 +110,7 @@ extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
> */
> extern void __free_pages_bootmem(struct page *page, unsigned int order);
> extern void prep_compound_page(struct page *page, unsigned long order);
> +extern bool capture_free_page(struct page *page, unsigned int order);
> #ifdef CONFIG_MEMORY_FAILURE
> extern bool is_free_buddy_page(struct page *page);
> #endif
> @@ -148,6 +149,7 @@ struct compact_control {
> * contention detected during
> * compaction
> */
> + struct page **capture_page; /* Free page captured by compaction */
> };
>
> unsigned long
> @@ -155,7 +157,7 @@ isolate_freepages_range(struct compact_control *cc,
> unsigned long start_pfn, unsigned long end_pfn);
> unsigned long
> isolate_migratepages_range(struct compact_control *cc,
> - unsigned long low_pfn, unsigned long end_pfn);
> + unsigned long low_pfn, unsigned long end_pfn);
>
> #endif
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bbdb10f..af9ed36 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1489,9 +1489,11 @@ static int __isolate_free_page(struct page *page, unsigned int order)
> {
> unsigned long watermark;
> struct zone *zone;
> + struct free_area *area;
> int mt;
> + unsigned int freepage_order = page_order(page);
>
> - BUG_ON(!PageBuddy(page));
> + VM_BUG_ON_PAGE((!PageBuddy(page) || freepage_order < order), page);
>
> zone = page_zone(page);
> mt = get_pageblock_migratetype(page);
> @@ -1506,9 +1508,12 @@ static int __isolate_free_page(struct page *page, unsigned int order)
> }
>

In __isolate_free_page(), we check zone_watermark_ok() with order 0.
But normal allocation logic would check zone_watermark_ok() with requested
order. Your capture logic uses __isolate_free_page() and it would
affect compaction success rate significantly. And it means that
capture logic allocates high order page on page allocator
too aggressively compared to other component such as normal high order
allocation. Could you test this patch again after changing order for
zone_watermark_ok() in __isolate_free_page()?

Thanks.

2014-07-29 07:31:18

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v5 07/14] mm, compaction: khugepaged should not give up due to need_resched()

On Tue, 29 Jul 2014, Joonsoo Kim wrote:

> I have a silly question here.
> Why need_resched() is criteria to stop async compaction?
> need_resched() is flagged up when time slice runs out or other reasons.
> It means that we should stop async compaction at arbitrary timing
> because process can be on compaction code at arbitrary moment. I think
> that it isn't reasonable and it doesn't ensure anything. Instead of
> this approach, how about doing compaction on certain amounts of pageblock
> for async compaction?
>

Not a silly question at all, I had the same feeling in
https://lkml.org/lkml/2014/5/21/730 and proposed it to be a tunable that
indicates how much work we are willing to do for thp in the pagefault
path. It suffers from the fact that past failure to isolate and/or
migrate memory to free an entire pageblock doesn't indicate that the next
pageblock will fail as well, but there has to be cutoff at some point or
async compaction becomes unnecessarily expensive. We can always rely on
khugepaged later to do the collapse, assuming we're not faulting memory
and then immediately pinning it.

I think there's two ways to go about it:

- allow a single thp fault to be expensive and then rely on deferred
compaction to avoid subsequent calls in the near future, or

- try to make all thp faults be as least expensive as possible so that
the cumulative effect of faulting large amounts of memory doesn't end
up with lengthy stalls.

Both of these are complex because of the potential for concurrent calls to
memory compaction when faulting thp on several cpus.

I also think the second point from that email still applies, that we
should abort isolating pages within a pageblock for migration once it can
no longer allow a cc->order allocation to succeed.

2014-07-29 08:20:26

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v5 07/14] mm, compaction: khugepaged should not give up due to need_resched()

On Tue, Jul 29, 2014 at 12:31:13AM -0700, David Rientjes wrote:
> On Tue, 29 Jul 2014, Joonsoo Kim wrote:
>
> > I have a silly question here.
> > Why need_resched() is criteria to stop async compaction?
> > need_resched() is flagged up when time slice runs out or other reasons.
> > It means that we should stop async compaction at arbitrary timing
> > because process can be on compaction code at arbitrary moment. I think
> > that it isn't reasonable and it doesn't ensure anything. Instead of
> > this approach, how about doing compaction on certain amounts of pageblock
> > for async compaction?
> >
>
> Not a silly question at all, I had the same feeling in
> https://lkml.org/lkml/2014/5/21/730 and proposed it to be a tunable that
> indicates how much work we are willing to do for thp in the pagefault
> path. It suffers from the fact that past failure to isolate and/or

Oh... you already suggested the same idea.

> migrate memory to free an entire pageblock doesn't indicate that the next
> pageblock will fail as well, but there has to be cutoff at some point or
> async compaction becomes unnecessarily expensive. We can always rely on
> khugepaged later to do the collapse, assuming we're not faulting memory
> and then immediately pinning it.
>
> I think there's two ways to go about it:
>
> - allow a single thp fault to be expensive and then rely on deferred
> compaction to avoid subsequent calls in the near future, or
>
> - try to make all thp faults be as least expensive as possible so that
> the cumulative effect of faulting large amounts of memory doesn't end
> up with lengthy stalls.

Hmm, if thp faults want to pay cost as least as possible, how about
making thp faults skip async/sync compaction at all?

Thanks.

2014-07-29 09:02:12

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v5 02/14] mm, compaction: defer each zone individually instead of preferred zone

On 07/29/2014 01:59 AM, David Rientjes wrote:
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -1122,28 +1122,27 @@ int sysctl_extfrag_threshold = 500;
>> * @nodemask: The allowed nodes to allocate from
>> * @mode: The migration mode for async, sync light, or sync migration
>> * @contended: Return value that is true if compaction was aborted due to lock contention
>> - * @page: Optionally capture a free page of the requested order during compaction
>
> Never noticed this non-existant formal before.

It's a leftover from the first Mel's page capture attempt that was
partially reverted.

>> + * @candidate_zone: Return the zone where we think allocation should succeed
>> *
>> * This is the main entry point for direct page compaction.
>> */
>> unsigned long try_to_compact_pages(struct zonelist *zonelist,
>> int order, gfp_t gfp_mask, nodemask_t *nodemask,
>> - enum migrate_mode mode, bool *contended)
>> + enum migrate_mode mode, bool *contended,
>> + struct zone **candidate_zone)
>> {
>> enum zone_type high_zoneidx = gfp_zone(gfp_mask);
>> int may_enter_fs = gfp_mask & __GFP_FS;
>> int may_perform_io = gfp_mask & __GFP_IO;
>> struct zoneref *z;
>> struct zone *zone;
>> - int rc = COMPACT_SKIPPED;
>> + int rc = COMPACT_DEFERRED;
>> int alloc_flags = 0;
>>
>> /* Check if the GFP flags allow compaction */
>> if (!order || !may_enter_fs || !may_perform_io)
>> return rc;
>>
>
> It doesn't seem right that if we called try_to_compact_pages() in a
> context where it is useless (order-0 or non-GFP_KERNEL allocation) that we
> would return COMPACT_DEFERRED. I think the existing semantics before the
> patch, that is
>
> - deferred: compaction was tried but failed, so avoid subsequent calls in
> the near future that may be potentially expensive, and
>
> - skipped: compaction wasn't tried because it will be useless
>
> is correct and deferred shouldn't take on another meaning, which now will
> set deferred_compaction == true in the page allocator. It probably
> doesn't matter right now because the only check of deferred_compaction is
> effective for __GFP_NO_KSWAPD, i.e. it is both high-order and GFP_KERNEL,
> but it seems returning COMPACT_SKIPPED here would also work fine and be
> more appropriate.

You're right. That was an oversight, not intention. Thanks for catching
that.

2014-07-29 09:12:55

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v5 02/14] mm, compaction: defer each zone individually instead of preferred zone

On 07/29/2014 08:38 AM, Joonsoo Kim wrote:
>> /* Return values for compact_zone() and try_to_compact_pages() */
>> +/* compaction didn't start as it was deferred due to past failures */
>> +#define COMPACT_DEFERRED 0
>> /* compaction didn't start as it was not possible or direct reclaim was more suitable */
>> -#define COMPACT_SKIPPED 0
>> +#define COMPACT_SKIPPED 1
>
> Hello,
>
> This change makes some users of compaction_suitable() failed
> unintentionally, because they assume that COMPACT_SKIPPED is 0.
> Please fix them according to this change.

Oops, good catch. Thanks!

>> @@ -2324,27 +2327,31 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>> order, zonelist, high_zoneidx,
>> alloc_flags & ~ALLOC_NO_WATERMARKS,
>> preferred_zone, classzone_idx, migratetype);
>> +
>> if (page) {
>> - preferred_zone->compact_blockskip_flush = false;
>> - compaction_defer_reset(preferred_zone, order, true);
>> + struct zone *zone = page_zone(page);
>> +
>> + zone->compact_blockskip_flush = false;
>> + compaction_defer_reset(zone, order, true);
>> count_vm_event(COMPACTSUCCESS);
>> return page;
>> }
>>
>> /*
>> + * last_compact_zone is where try_to_compact_pages thought
>> + * allocation should succeed, so it did not defer compaction.
>> + * But now we know that it didn't succeed, so we do the defer.
>> + */
>> + if (last_compact_zone && mode != MIGRATE_ASYNC)
>> + defer_compaction(last_compact_zone, order);
>
> I still don't understand why defer_compaction() is needed here.
> defer_compaction() is intended for not struggling doing compaction on
> the zone where we already have tried compaction and found that it
> isn't suitable for compaction. Allocation failure doesn't tell us
> that we have tried compaction for all the zone range so we shouldn't
> make a decision here to defer compaction on this zone carelessly.

OK I can remove that, it should make the code nicer anyway. I also agree
with the argument "for all the zone range" and I also realized that it's
not (both before and after this patch) really the case. I planned to fix
that in the future, but I can probably do it now.
The plan is to call defer_compaction() only when compaction returned
COMPACT_COMPLETE (and not COMPACT_PARTIAL) as it means the whole zone
was scanned. Otherwise there will be bias towards the beginning of the
zone in the migration scanner - compaction will be deferred half-way and
then cached pfn's might be reset when it restarts, and the rest of the
zone won't be scanned at all.

> Thanks.
>

2014-07-29 09:16:42

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v5 07/14] mm, compaction: khugepaged should not give up due to need_resched()

On Tue, 29 Jul 2014, Joonsoo Kim wrote:

> Hmm, if thp faults want to pay cost as least as possible, how about
> making thp faults skip async/sync compaction at all?
>

You can certainly do that with /sys/kernel/mm/transparent_hugepage/defrag.

Without doing memory compaction, though, at least one of my customers will
have their thp ratio drop significantly and for the vast majority of our
machines minimal compaction is all that is needed to allocate a hugepage.
I'm concerned primarily about the straggler that has very lengthy fault
times for even single hugepages.

This patchset will address some of those concerns, but I agree with you
that we should be terminating async compaction with another heuristic
rather than need_resched().

2014-07-29 09:27:44

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v5 05/14] mm, compaction: move pageblock checks up from isolate_migratepages_range()

On 07/29/2014 02:29 AM, David Rientjes wrote:
> On Mon, 28 Jul 2014, Vlastimil Babka wrote:
>
>> isolate_migratepages_range() is the main function of the compaction scanner,
>> called either on a single pageblock by isolate_migratepages() during regular
>> compaction, or on an arbitrary range by CMA's __alloc_contig_migrate_range().
>> It currently perfoms two pageblock-wide compaction suitability checks, and
>> because of the CMA callpath, it tracks if it crossed a pageblock boundary in
>> order to repeat those checks.
>>
>> However, closer inspection shows that those checks are always true for CMA:
>> - isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
>> - migrate_async_suitable() check is skipped because CMA uses sync compaction
>>
>> We can therefore move the compaction-specific checks to isolate_migratepages()
>> and simplify isolate_migratepages_range(). Furthermore, we can mimic the
>> freepage scanner family of functions, which has isolate_freepages_block()
>> function called both by compaction from isolate_freepages() and by CMA from
>> isolate_freepages_range(), where each use-case adds own specific glue code.
>> This allows further code simplification.
>>
>> Therefore, we rename isolate_migratepages_range() to isolate_freepages_block()
>
> s/isolate_freepages_block/isolate_migratepages_block/
>
> I read your commit description before looking at the patch and was very
> nervous about the direction you were going if that was true :) I'm
> relieved to see it was just a typo.

Ah, thanks :)

>> @@ -601,8 +554,11 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>> */
>> if (PageTransHuge(page)) {
>> if (!locked)
>> - goto next_pageblock;
>> - low_pfn += (1 << compound_order(page)) - 1;
>> + low_pfn = ALIGN(low_pfn + 1,
>> + pageblock_nr_pages) - 1;
>> + else
>> + low_pfn += (1 << compound_order(page)) - 1;
>> +
>
> Hmm, any reason not to always advance and align low_pfn to
> pageblock_nr_pages? I don't see how pageblock_order > HPAGE_PMD_ORDER
> would make sense if encountering thp.

I think PageTransHuge() might be true even for non-THP compound pages
which might be actually of lower order and we wouldn't want to skip the
whole pageblock.

>> @@ -680,15 +630,61 @@ next_pageblock:
>> return low_pfn;
>> }
>>
>> +/**
>> + * isolate_migratepages_range() - isolate migrate-able pages in a PFN range
>> + * @start_pfn: The first PFN to start isolating.
>> + * @end_pfn: The one-past-last PFN.
>
> Need to specify @cc?

OK.

>>
>> /*
>> - * Isolate all pages that can be migrated from the block pointed to by
>> - * the migrate scanner within compact_control.
>> + * Isolate all pages that can be migrated from the first suitable block,
>> + * starting at the block pointed to by the migrate scanner pfn within
>> + * compact_control.
>> */
>> static isolate_migrate_t isolate_migratepages(struct zone *zone,
>> struct compact_control *cc)
>> {
>> unsigned long low_pfn, end_pfn;
>> + struct page *page;
>> + const isolate_mode_t isolate_mode =
>> + (cc->mode == MIGRATE_ASYNC ? ISOLATE_ASYNC_MIGRATE : 0);
>>
>> - /* Do not scan outside zone boundaries */
>> - low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
>> + /*
>> + * Start at where we last stopped, or beginning of the zone as
>> + * initialized by compact_zone()
>> + */
>> + low_pfn = cc->migrate_pfn;
>>
>> /* Only scan within a pageblock boundary */
>> end_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages);
>>
>> - /* Do not cross the free scanner or scan within a memory hole */
>> - if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
>> - cc->migrate_pfn = end_pfn;
>> - return ISOLATE_NONE;
>> - }
>> + /*
>> + * Iterate over whole pageblocks until we find the first suitable.
>> + * Do not cross the free scanner.
>> + */
>> + for (; end_pfn <= cc->free_pfn;
>> + low_pfn = end_pfn, end_pfn += pageblock_nr_pages) {
>> +
>> + /*
>> + * This can potentially iterate a massively long zone with
>> + * many pageblocks unsuitable, so periodically check if we
>> + * need to schedule, or even abort async compaction.
>> + */
>> + if (!(low_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages))
>> + && compact_should_abort(cc))
>> + break;
>> +
>> + /* Skip whole pageblock in case of a memory hole */
>> + if (!pfn_valid(low_pfn))
>> + continue;
>> +
>> + page = pfn_to_page(low_pfn);
>> +
>> + /* If isolation recently failed, do not retry */
>> + if (!isolation_suitable(cc, page))
>> + continue;
>> +
>> + /*
>> + * For async compaction, also only scan in MOVABLE blocks.
>> + * Async compaction is optimistic to see if the minimum amount
>> + * of work satisfies the allocation.
>> + */
>> + if (cc->mode == MIGRATE_ASYNC &&
>> + !migrate_async_suitable(get_pageblock_migratetype(page)))
>> + continue;
>> +
>> + /* Perform the isolation */
>> + low_pfn = isolate_migratepages_block(cc, low_pfn, end_pfn,
>> + isolate_mode);
>
> Hmm, why would we want to unconditionally set pageblock_skip if no pages
> could be isolated from a pageblock when
> isolate_mode == ISOLATE_ASYNC_MIGRATE? It seems like it erroneously skip
> pageblocks for cases when isolate_mode == 0.

Well pageblock_skip is a single bit and you don't know if the next
attempt will be async or sync. So now you would maybe skip needlessly if
the next attempt would be sync. If we changed that, you wouldn't skip if
the next attempt would be async again. Could be that one way is better
than other but I'm not sure, and would consider it separately.
The former patch 15 (quick skip pageblock that won't be fully migrated)
could perhaps change the balance here.

But I hope this patch doesn't change this particular thing, right?

2014-07-29 09:31:54

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v5 06/14] mm, compaction: reduce zone checking frequency in the migration scanner

On 07/29/2014 02:44 AM, David Rientjes wrote:
>> Signed-off-by: Vlastimil Babka <[email protected]>
>> Cc: Minchan Kim <[email protected]>
>> Acked-by: Mel Gorman <[email protected]>
>> Cc: Joonsoo Kim <[email protected]>
>> Cc: Michal Nazarewicz <[email protected]>
>> Cc: Naoya Horiguchi <[email protected]>
>> Cc: Christoph Lameter <[email protected]>
>> Cc: Rik van Riel <[email protected]>
>> Cc: David Rientjes <[email protected]>
>
> Acked-by: David Rientjes <[email protected]>
>
> Minor comments below.

Thanks,

>> +/*
>> + * Check that the whole (or subset of) a pageblock given by the interval of
>> + * [start_pfn, end_pfn) is valid and within the same zone, before scanning it
>> + * with the migration of free compaction scanner. The scanners then need to
>> + * use only pfn_valid_within() check for arches that allow holes within
>> + * pageblocks.
>> + *
>> + * Return struct page pointer of start_pfn, or NULL if checks were not passed.
>> + *
>> + * It's possible on some configurations to have a setup like node0 node1 node0
>> + * i.e. it's possible that all pages within a zones range of pages do not
>> + * belong to a single zone. We assume that a border between node0 and node1
>> + * can occur within a single pageblock, but not a node0 node1 node0
>> + * interleaving within a single pageblock. It is therefore sufficient to check
>> + * the first and last page of a pageblock and avoid checking each individual
>> + * page in a pageblock.
>> + */
>> +static struct page *pageblock_within_zone(unsigned long start_pfn,
>> + unsigned long end_pfn, struct zone *zone)
>
> The name of this function is quite strange, it's returning a pointer to
> the actual start page but the name implies it would be a boolean.

Yeah but I couldn't think of a better name that wouldn't be long and ugly :(

>> +{
>> + struct page *start_page;
>> + struct page *end_page;
>> +
>> + /* end_pfn is one past the range we are checking */
>> + end_pfn--;
>> +
>
> With the given implementation, yes, but I'm not sure if that should be
> assumed for any class of callers. It seems better to call with
> end_pfn - 1.

Well, I think the rest of compaction functions assume one-past-end
parameters so this would make it an exception. Better hide the exception
in the implementation and not expose it to callers?

>> + if (!pfn_valid(start_pfn) || !pfn_valid(end_pfn))
>> + return NULL;
>> +
>
> Ok, so even with this check, we still need to check pfn_valid_within() for
> all pfns between start_pfn and end_pfn if there are memory holes. I
> checked that both the migration and freeing scanners do that before
> reading your comment above the function, looks good.

Yeah, and thankfully pfn_valid_within() is a no-op on many archs :)

2014-07-29 09:45:25

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v5 07/14] mm, compaction: khugepaged should not give up due to need_resched()

On 07/29/2014 02:59 AM, David Rientjes wrote:
> On Mon, 28 Jul 2014, Vlastimil Babka wrote:
>
>> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
>> index b2e4c92..60bdf8d 100644
>> --- a/include/linux/compaction.h
>> +++ b/include/linux/compaction.h
>> @@ -13,6 +13,14 @@
>> /* The full zone was compacted */
>> #define COMPACT_COMPLETE 4
>>
>> +/* Used to signal whether compaction detected need_sched() or lock contention */
>> +/* No contention detected */
>> +#define COMPACT_CONTENDED_NONE 0
>> +/* Either need_sched() was true or fatal signal pending */
>> +#define COMPACT_CONTENDED_SCHED 1
>> +/* Zone lock or lru_lock was contended in async compaction */
>> +#define COMPACT_CONTENDED_LOCK 2
>> +
>
> Make this an enum?

I tried originally, but then I would have to define it elsewhere
(mm/internal.h I think) together with compact_control. I didn't think it
was worth the extra pollution of shared header, when the return codes
are also #define and we might still get rid of need_resched() one day.

>> @@ -223,9 +223,21 @@ static void update_pageblock_skip(struct compact_control *cc,
>> }
>> #endif /* CONFIG_COMPACTION */
>>
>> -static inline bool should_release_lock(spinlock_t *lock)
>> +static int should_release_lock(spinlock_t *lock)
>> {
>> - return need_resched() || spin_is_contended(lock);
>> + /*
>> + * Sched contention has higher priority here as we may potentially
>> + * have to abort whole compaction ASAP. Returning with lock contention
>> + * means we will try another zone, and further decisions are
>> + * influenced only when all zones are lock contended. That means
>> + * potentially missing a lock contention is less critical.
>> + */
>> + if (need_resched())
>> + return COMPACT_CONTENDED_SCHED;
>> + else if (spin_is_contended(lock))
>> + return COMPACT_CONTENDED_LOCK;
>> + else
>> + return COMPACT_CONTENDED_NONE;
>
> I would avoid the last else statement and just return
> COMPACT_CONTENDED_NONE.

OK, checkpatch agrees.

>> @@ -1154,11 +1168,11 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
>> INIT_LIST_HEAD(&cc.migratepages);
>>
>> ret = compact_zone(zone, &cc);
>> + *contended = cc.contended;
>>
>> VM_BUG_ON(!list_empty(&cc.freepages));
>> VM_BUG_ON(!list_empty(&cc.migratepages));
>>
>> - *contended = cc.contended;
>
> Not sure why, but ok :)

Oversight due to how this patch evolved :)

>> return ret;
>> }
>>
>> @@ -1171,14 +1185,15 @@ int sysctl_extfrag_threshold = 500;
>> * @gfp_mask: The GFP mask of the current allocation
>> * @nodemask: The allowed nodes to allocate from
>> * @mode: The migration mode for async, sync light, or sync migration
>> - * @contended: Return value that is true if compaction was aborted due to lock contention
>> + * @contended: Return value that determines if compaction was aborted due to
>> + * need_resched() or lock contention
>> * @candidate_zone: Return the zone where we think allocation should succeed
>> *
>> * This is the main entry point for direct page compaction.
>> */
>> unsigned long try_to_compact_pages(struct zonelist *zonelist,
>> int order, gfp_t gfp_mask, nodemask_t *nodemask,
>> - enum migrate_mode mode, bool *contended,
>> + enum migrate_mode mode, int *contended,
>> struct zone **candidate_zone)
>> {
>> enum zone_type high_zoneidx = gfp_zone(gfp_mask);
>> @@ -1188,6 +1203,9 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>> struct zone *zone;
>> int rc = COMPACT_DEFERRED;
>> int alloc_flags = 0;
>> + bool all_zones_lock_contended = true; /* init true for &= operation */
>> +
>> + *contended = COMPACT_CONTENDED_NONE;
>>
>> /* Check if the GFP flags allow compaction */
>> if (!order || !may_enter_fs || !may_perform_io)
>> @@ -1201,13 +1219,20 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>> for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
>> nodemask) {
>> int status;
>> + int zone_contended;
>>
>> if (compaction_deferred(zone, order))
>> continue;
>>
>> status = compact_zone_order(zone, order, gfp_mask, mode,
>> - contended);
>> + &zone_contended);
>> rc = max(status, rc);
>> + /*
>> + * It takes at least one zone that wasn't lock contended
>> + * to turn all_zones_lock_contended to false.
>> + */
>> + all_zones_lock_contended &=
>> + (zone_contended == COMPACT_CONTENDED_LOCK);
>
> Eek, does this always work? COMPACT_CONTENDED_LOCK is 0x2 and

(zone_contended == COMPACT_CONTENDED_LOCK) is a bool so it should work?

> all_zones_lock_contended is a bool initialized to true. I'm fairly
> certain you'd get better code generation if you defined
> all_zones_lock_contended as
>
> int all_zones_lock_contended = COMPACT_CONTENDED_LOCK
>
> from the start and the source code would be more clear.

Yeah that would look nicer, thanks for the suggestion.

>>
>> /* If a normal allocation would succeed, stop compacting */
>> if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
>> @@ -1220,8 +1245,20 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>> * succeeds in this zone.
>> */
>> compaction_defer_reset(zone, order, false);
>> - break;
>> - } else if (mode != MIGRATE_ASYNC) {
>> + /*
>> + * It is possible that async compaction aborted due to
>> + * need_resched() and the watermarks were ok thanks to
>> + * somebody else freeing memory. The allocation can
>> + * however still fail so we better signal the
>> + * need_resched() contention anyway.
>> + */
>
> Makes sense because the page allocator still tries to allocate the page
> even if this is returned, but it's not very clear in the comment.

OK

>> @@ -2660,15 +2660,36 @@ rebalance:
>> if (!(gfp_mask & __GFP_NO_KSWAPD) || (current->flags & PF_KTHREAD))
>> migration_mode = MIGRATE_SYNC_LIGHT;
>>
>> - /*
>> - * If compaction is deferred for high-order allocations, it is because
>> - * sync compaction recently failed. In this is the case and the caller
>> - * requested a movable allocation that does not heavily disrupt the
>> - * system then fail the allocation instead of entering direct reclaim.
>> - */
>> - if ((deferred_compaction || contended_compaction) &&
>> - (gfp_mask & __GFP_NO_KSWAPD))
>> - goto nopage;
>
> Hmm, this check will have unfortunately changed in the latest mmotm due to
> mm-thp-restructure-thp-avoidance-of-light-synchronous-migration.patch.

I think you were changing (and moving around) a different check so there
would be merge conflicts but no semantic problem.

>> + /* Checks for THP-specific high-order allocations */
>> + if (gfp_mask & __GFP_NO_KSWAPD) {
>> + /*
>> + * If compaction is deferred for high-order allocations, it is
>> + * because sync compaction recently failed. If this is the case
>> + * and the caller requested a THP allocation, we do not want
>> + * to heavily disrupt the system, so we fail the allocation
>> + * instead of entering direct reclaim.
>> + */
>> + if (deferred_compaction)
>> + goto nopage;
>> +
>> + /*
>> + * In all zones where compaction was attempted (and not
>> + * deferred or skipped), lock contention has been detected.
>> + * For THP allocation we do not want to disrupt the others
>> + * so we fallback to base pages instead.
>> + */
>> + if (contended_compaction == COMPACT_CONTENDED_LOCK)
>> + goto nopage;
>> +
>> + /*
>> + * If compaction was aborted due to need_resched(), we do not
>> + * want to further increase allocation latency, unless it is
>> + * khugepaged trying to collapse.
>> + */
>> + if (contended_compaction == COMPACT_CONTENDED_SCHED
>> + && !(current->flags & PF_KTHREAD))
>> + goto nopage;
>> + }
>>
>> /* Try direct reclaim and then allocating */
>> page = __alloc_pages_direct_reclaim(gfp_mask, order,

2014-07-29 09:49:32

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v5 07/14] mm, compaction: khugepaged should not give up due to need_resched()

On 07/29/2014 09:31 AM, David Rientjes wrote:
> On Tue, 29 Jul 2014, Joonsoo Kim wrote:
>
>> I have a silly question here.
>> Why need_resched() is criteria to stop async compaction?
>> need_resched() is flagged up when time slice runs out or other reasons.
>> It means that we should stop async compaction at arbitrary timing
>> because process can be on compaction code at arbitrary moment. I think
>> that it isn't reasonable and it doesn't ensure anything. Instead of
>> this approach, how about doing compaction on certain amounts of pageblock
>> for async compaction?
>>
>
> Not a silly question at all, I had the same feeling in
> https://lkml.org/lkml/2014/5/21/730 and proposed it to be a tunable that
> indicates how much work we are willing to do for thp in the pagefault
> path. It suffers from the fact that past failure to isolate and/or
> migrate memory to free an entire pageblock doesn't indicate that the next
> pageblock will fail as well, but there has to be cutoff at some point or
> async compaction becomes unnecessarily expensive. We can always rely on
> khugepaged later to do the collapse, assuming we're not faulting memory
> and then immediately pinning it.
>
> I think there's two ways to go about it:
>
> - allow a single thp fault to be expensive and then rely on deferred
> compaction to avoid subsequent calls in the near future, or
>
> - try to make all thp faults be as least expensive as possible so that
> the cumulative effect of faulting large amounts of memory doesn't end
> up with lengthy stalls.
>
> Both of these are complex because of the potential for concurrent calls to
> memory compaction when faulting thp on several cpus.
>
> I also think the second point from that email still applies, that we
> should abort isolating pages within a pageblock for migration once it can
> no longer allow a cc->order allocation to succeed.

That was the RFC patch 15, I hope to reintroduce it soon. You could
still test it meanwhile to see if you see the same extfrag regression as
me. In my tests, kswapd/khugepaged wasn't doing enough work to
defragment the pageblocks that the stress-highalloc benchmark
(configured to behave like thp page fault) was skipping.

2014-07-29 15:34:43

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v5 14/14] mm, compaction: try to capture the just-created high-order freepage

On 07/29/2014 09:34 AM, Joonsoo Kim wrote:
> I don't look at it in detail, but, it looks really duplicated and hard
> to maintain. From my experience, this is really error-prone. Please
> think of freepage counting bugs reported by my recent patchset.
> Freepage counting handles counting at different places for performance reason
> and finally bugs are there. IMHO, making common function and using it
> is better than this approach even if we touch the fastpath.

OK, so opposite opinion than Minchan's :)

> Could you separate this patch to this patchset?
> I think that this patch doesn't get much reviewed from other developers
> unlike other patches.

Yeah I will.

>> @@ -570,6 +572,14 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>> unsigned long flags;
>> bool locked = false;
>> struct page *page = NULL, *valid_page = NULL;
>> + unsigned long capture_pfn = 0; /* current candidate for capturing */
>> + unsigned long next_capture_pfn = 0; /* next candidate for capturing */
>> +
>> + if (cc->order > 0 && cc->order <= pageblock_order && capture) {
>> + /* This may be outside the zone, but we check that later */
>> + capture_pfn = low_pfn & ~((1UL << cc->order) - 1);
>> + next_capture_pfn = ALIGN(low_pfn + 1, (1UL << cc->order));
>> + }
>
> Instead of inserting capture logic to common code (compaction and
> CMA), could you add it only to compaction code such as
> isolate_migratepages(). Capture logic needs too many hooks as you see
> on below snippets. And it makes code so much complicated.

Could do it in isolate_migratepages() for whole pageblocks only (as
David's patch did), but that restricts the usefulness. Or maybe do it
fine grained by calling isolate_migratepages_block() multiple times. But
the overhead of multiple calls would probably suck even more for
lower-order compactions. For CMA the added overhead is basically only
checks for next_capture_pfn that will be always false, so predictable.
And mostly just in branches where isolation is failing, which is not the
CMA's "fast path" I guess?

But I see you're talking about "complicated", not overhead. Well it's 4
hunks inside the isolate_migratepages_block() for loop. I don't think
it's *that* bad, thanks to how the function was cleaned up by the
previosu patches.
Hmm but you made me realize I could make it nicer by doing a "goto
isolation_fail" which would handle the next_capture_pfn update at a
single place.

>> +static bool compact_capture_page(struct compact_control *cc)
>> +{
>> + struct page *page = *cc->capture_page;
>> + int cpu;
>> +
>> + if (!page)
>> + return false;
>> +
>> + /* Unsafe check if it's worth to try acquiring the zone->lock at all */
>> + if (PageBuddy(page) && page_order_unsafe(page) >= cc->order)
>> + goto try_capture;
>> +
>> + /*
>> + * There's a good chance that we have just put free pages on this CPU's
>> + * lru cache and pcplists after the page migrations. Drain them to
>> + * allow merging.
>> + */
>> + cpu = get_cpu();
>> + lru_add_drain_cpu(cpu);
>> + drain_local_pages(NULL);
>> + put_cpu();
>
> Just for curiosity.
>
> If lru_add_drain_cpu() is cheap enough to capture high order page, why
> __alloc_pages_direct_compact() doesn't call it before
> get_page_from_freelist()?

No idea. I guess it wasn't noticed at the time that page migration uses
putback_lru_page() on the page that was freed, which puts it into the
lru_add cache, only to be freed. I think it would be better to free the
page immediately in this case, and use lru_add cache only for pages that
will really go to lru.

Heck, it could be even better to tell page migration to skip pcplists as
well, to avoid drain_local_pages. Often you migrate because you want to
use the original page for something. NUMA balancing migrations are
different, I guess.

>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1489,9 +1489,11 @@ static int __isolate_free_page(struct page *page, unsigned int order)
>> {
>> unsigned long watermark;
>> struct zone *zone;
>> + struct free_area *area;
>> int mt;
>> + unsigned int freepage_order = page_order(page);
>>
>> - BUG_ON(!PageBuddy(page));
>> + VM_BUG_ON_PAGE((!PageBuddy(page) || freepage_order < order), page);
>>
>> zone = page_zone(page);
>> mt = get_pageblock_migratetype(page);
>> @@ -1506,9 +1508,12 @@ static int __isolate_free_page(struct page *page, unsigned int order)
>> }
>>
>
> In __isolate_free_page(), we check zone_watermark_ok() with order 0.
> But normal allocation logic would check zone_watermark_ok() with requested
> order. Your capture logic uses __isolate_free_page() and it would
> affect compaction success rate significantly. And it means that
> capture logic allocates high order page on page allocator
> too aggressively compared to other component such as normal high order

It's either that, or the extra lru drain that makes the different. But
the "aggressiveness" would in fact mean better accuracy. Watermark
checking may be inaccurate. Especially when memory is close to the
watermark and there is only a single high-order page that would satisfy
the allocation.

> allocation. Could you test this patch again after changing order for
> zone_watermark_ok() in __isolate_free_page()?

I can do that. If that makes capture significantly worse, it just
highlights the watermark checking inaccuracy.

> Thanks.
>

2014-07-29 22:53:39

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v5 07/14] mm, compaction: khugepaged should not give up due to need_resched()

On Tue, 29 Jul 2014, Vlastimil Babka wrote:

> > I think there's two ways to go about it:
> >
> > - allow a single thp fault to be expensive and then rely on deferred
> > compaction to avoid subsequent calls in the near future, or
> >
> > - try to make all thp faults be as least expensive as possible so that
> > the cumulative effect of faulting large amounts of memory doesn't end
> > up with lengthy stalls.
> >
> > Both of these are complex because of the potential for concurrent calls to
> > memory compaction when faulting thp on several cpus.
> >
> > I also think the second point from that email still applies, that we
> > should abort isolating pages within a pageblock for migration once it can
> > no longer allow a cc->order allocation to succeed.
>
> That was the RFC patch 15, I hope to reintroduce it soon.

Which of the points above are you planning on addressing in another patch?
I think the approach would cause the above to be mutually exclusive
options.

> You could still test
> it meanwhile to see if you see the same extfrag regression as me. In my tests,
> kswapd/khugepaged wasn't doing enough work to defragment the pageblocks that
> the stress-highalloc benchmark (configured to behave like thp page fault) was
> skipping.
>

The initial regression that I encountered was on a 128GB machine where
async compaction would cause faulting 64MB of transparent hugepages to
excessively stall and I don't see how kswapd can address this if there's
no memory pressure and khugepaged can address it if it has the default
settings which is very slow.

Another idea I had is to only do async memory compaction for thp on local
zones and avoid defragmenting remotely since, in my experimentation,
remote thp memory causes a performance degradation over regular pages. If
that solution were to involve zone_reclaim_mode and a test of
node_distance() > RECLAIM_DISTANCE, I think that would be acceptable as
well.

2014-07-29 22:57:45

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v5 07/14] mm, compaction: khugepaged should not give up due to need_resched()

On Tue, 29 Jul 2014, Vlastimil Babka wrote:

> > > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > > index b2e4c92..60bdf8d 100644
> > > --- a/include/linux/compaction.h
> > > +++ b/include/linux/compaction.h
> > > @@ -13,6 +13,14 @@
> > > /* The full zone was compacted */
> > > #define COMPACT_COMPLETE 4
> > >
> > > +/* Used to signal whether compaction detected need_sched() or lock
> > > contention */
> > > +/* No contention detected */
> > > +#define COMPACT_CONTENDED_NONE 0
> > > +/* Either need_sched() was true or fatal signal pending */
> > > +#define COMPACT_CONTENDED_SCHED 1
> > > +/* Zone lock or lru_lock was contended in async compaction */
> > > +#define COMPACT_CONTENDED_LOCK 2
> > > +
> >
> > Make this an enum?
>
> I tried originally, but then I would have to define it elsewhere
> (mm/internal.h I think) together with compact_control. I didn't think it was
> worth the extra pollution of shared header, when the return codes are also
> #define and we might still get rid of need_resched() one day.
>

Ok.

[...]

> > > @@ -2660,15 +2660,36 @@ rebalance:
> > > if (!(gfp_mask & __GFP_NO_KSWAPD) || (current->flags & PF_KTHREAD))
> > > migration_mode = MIGRATE_SYNC_LIGHT;
> > >
> > > - /*
> > > - * If compaction is deferred for high-order allocations, it is because
> > > - * sync compaction recently failed. In this is the case and the caller
> > > - * requested a movable allocation that does not heavily disrupt the
> > > - * system then fail the allocation instead of entering direct reclaim.
> > > - */
> > > - if ((deferred_compaction || contended_compaction) &&
> > > - (gfp_mask & __GFP_NO_KSWAPD))
> > > - goto nopage;
> >
> > Hmm, this check will have unfortunately changed in the latest mmotm due to
> > mm-thp-restructure-thp-avoidance-of-light-synchronous-migration.patch.
>
> I think you were changing (and moving around) a different check so there would
> be merge conflicts but no semantic problem.
>

The idea is the same, though, I think the check should not rely on
__GFP_NO_KSWAPD and rather rely on
(gfp_mask & GFP_TRANSHUGE) == GFP_TRANSHUGE. In other words, all the
possibilities under your new test for gfp_mask & __GFP_NO_KSWAPD are thp
specific and not for the other allocators who pass __GFP_NO_KSWAPD. This
patch would be a significant change in logic for those users that doesn't
seem helpful.

2014-07-29 23:02:13

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v5 05/14] mm, compaction: move pageblock checks up from isolate_migratepages_range()

On Tue, 29 Jul 2014, Vlastimil Babka wrote:

> > > @@ -601,8 +554,11 @@ isolate_migratepages_range(struct zone *zone, struct
> > > compact_control *cc,
> > > */
> > > if (PageTransHuge(page)) {
> > > if (!locked)
> > > - goto next_pageblock;
> > > - low_pfn += (1 << compound_order(page)) - 1;
> > > + low_pfn = ALIGN(low_pfn + 1,
> > > + pageblock_nr_pages) - 1;
> > > + else
> > > + low_pfn += (1 << compound_order(page)) - 1;
> > > +
> >
> > Hmm, any reason not to always advance and align low_pfn to
> > pageblock_nr_pages? I don't see how pageblock_order > HPAGE_PMD_ORDER
> > would make sense if encountering thp.
>
> I think PageTransHuge() might be true even for non-THP compound pages which
> might be actually of lower order and we wouldn't want to skip the whole
> pageblock.
>

Hmm, I'm confused at how that could be true, could you explain what
memory other than thp can return true for PageTransHuge()? Are you simply
referring to possible checks on tail pages where we would need to look at
PageHead() instead? If so, I'm not sure how we could possibly encounter
such a condition within this iteration.

> > > @@ -680,15 +630,61 @@ next_pageblock:
> > > return low_pfn;
> > > }
> > >
> > > +/**
> > > + * isolate_migratepages_range() - isolate migrate-able pages in a PFN
> > > range
> > > + * @start_pfn: The first PFN to start isolating.
> > > + * @end_pfn: The one-past-last PFN.
> >
> > Need to specify @cc?
>
> OK.
>
> > >
> > > /*
> > > - * Isolate all pages that can be migrated from the block pointed to by
> > > - * the migrate scanner within compact_control.
> > > + * Isolate all pages that can be migrated from the first suitable block,
> > > + * starting at the block pointed to by the migrate scanner pfn within
> > > + * compact_control.
> > > */
> > > static isolate_migrate_t isolate_migratepages(struct zone *zone,
> > > struct compact_control *cc)
> > > {
> > > unsigned long low_pfn, end_pfn;
> > > + struct page *page;
> > > + const isolate_mode_t isolate_mode =
> > > + (cc->mode == MIGRATE_ASYNC ? ISOLATE_ASYNC_MIGRATE : 0);
> > >
> > > - /* Do not scan outside zone boundaries */
> > > - low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
> > > + /*
> > > + * Start at where we last stopped, or beginning of the zone as
> > > + * initialized by compact_zone()
> > > + */
> > > + low_pfn = cc->migrate_pfn;
> > >
> > > /* Only scan within a pageblock boundary */
> > > end_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages);
> > >
> > > - /* Do not cross the free scanner or scan within a memory hole */
> > > - if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
> > > - cc->migrate_pfn = end_pfn;
> > > - return ISOLATE_NONE;
> > > - }
> > > + /*
> > > + * Iterate over whole pageblocks until we find the first suitable.
> > > + * Do not cross the free scanner.
> > > + */
> > > + for (; end_pfn <= cc->free_pfn;
> > > + low_pfn = end_pfn, end_pfn += pageblock_nr_pages) {
> > > +
> > > + /*
> > > + * This can potentially iterate a massively long zone with
> > > + * many pageblocks unsuitable, so periodically check if we
> > > + * need to schedule, or even abort async compaction.
> > > + */
> > > + if (!(low_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages))
> > > + && compact_should_abort(cc))
> > > + break;
> > > +
> > > + /* Skip whole pageblock in case of a memory hole */
> > > + if (!pfn_valid(low_pfn))
> > > + continue;
> > > +
> > > + page = pfn_to_page(low_pfn);
> > > +
> > > + /* If isolation recently failed, do not retry */
> > > + if (!isolation_suitable(cc, page))
> > > + continue;
> > > +
> > > + /*
> > > + * For async compaction, also only scan in MOVABLE blocks.
> > > + * Async compaction is optimistic to see if the minimum amount
> > > + * of work satisfies the allocation.
> > > + */
> > > + if (cc->mode == MIGRATE_ASYNC &&
> > > + !migrate_async_suitable(get_pageblock_migratetype(page)))
> > > + continue;
> > > +
> > > + /* Perform the isolation */
> > > + low_pfn = isolate_migratepages_block(cc, low_pfn, end_pfn,
> > > + isolate_mode);
> >
> > Hmm, why would we want to unconditionally set pageblock_skip if no pages
> > could be isolated from a pageblock when
> > isolate_mode == ISOLATE_ASYNC_MIGRATE? It seems like it erroneously skip
> > pageblocks for cases when isolate_mode == 0.
>
> Well pageblock_skip is a single bit and you don't know if the next attempt
> will be async or sync. So now you would maybe skip needlessly if the next
> attempt would be sync. If we changed that, you wouldn't skip if the next
> attempt would be async again. Could be that one way is better than other but
> I'm not sure, and would consider it separately.
> The former patch 15 (quick skip pageblock that won't be fully migrated) could
> perhaps change the balance here.
>

That's why we have two separate per-zone cached start pfns, though, right?
The next call to async compaction should start from where the previous
caller left off so there would be no need to set pageblock skip in that
case until we have checked all memory. Or are you considering the case of
concurrent async compaction?

2014-07-29 23:22:23

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v5 05/14] mm, compaction: move pageblock checks up from isolate_migratepages_range()

On Tue, Jul 29, 2014 at 04:02:09PM -0700, David Rientjes wrote:
> Hmm, I'm confused at how that could be true, could you explain what
> memory other than thp can return true for PageTransHuge()?

PageTransHuge() will be true for any head of compound page if THP is
enabled compile time: hugetlbfs, slab, whatever.

--
Kirill A. Shutemov

2014-07-29 23:51:42

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v5 05/14] mm, compaction: move pageblock checks up from isolate_migratepages_range()

On Wed, 30 Jul 2014, Kirill A. Shutemov wrote:

> > Hmm, I'm confused at how that could be true, could you explain what
> > memory other than thp can return true for PageTransHuge()?
>
> PageTransHuge() will be true for any head of compound page if THP is
> enabled compile time: hugetlbfs, slab, whatever.
>

I was meaning in the context of the patch :) Since PageLRU is set, that
discounts slab so we're left with thp or hugetlbfs. Logically, both
should have sizes that are >= the size of the pageblock itself so I'm not
sure why we don't unconditionally align up to pageblock_nr_pages here. Is
there a legitimiate configuration where a pageblock will span multiple
pages of HPAGE_PMD_ORDER?

2014-07-30 08:32:30

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v5 14/14] mm, compaction: try to capture the just-created high-order freepage

On Tue, Jul 29, 2014 at 05:34:37PM +0200, Vlastimil Babka wrote:
> >>@@ -570,6 +572,14 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> >> unsigned long flags;
> >> bool locked = false;
> >> struct page *page = NULL, *valid_page = NULL;
> >>+ unsigned long capture_pfn = 0; /* current candidate for capturing */
> >>+ unsigned long next_capture_pfn = 0; /* next candidate for capturing */
> >>+
> >>+ if (cc->order > 0 && cc->order <= pageblock_order && capture) {
> >>+ /* This may be outside the zone, but we check that later */
> >>+ capture_pfn = low_pfn & ~((1UL << cc->order) - 1);
> >>+ next_capture_pfn = ALIGN(low_pfn + 1, (1UL << cc->order));
> >>+ }
> >
> >Instead of inserting capture logic to common code (compaction and
> >CMA), could you add it only to compaction code such as
> >isolate_migratepages(). Capture logic needs too many hooks as you see
> >on below snippets. And it makes code so much complicated.
>
> Could do it in isolate_migratepages() for whole pageblocks only (as
> David's patch did), but that restricts the usefulness. Or maybe do
> it fine grained by calling isolate_migratepages_block() multiple
> times. But the overhead of multiple calls would probably suck even
> more for lower-order compactions. For CMA the added overhead is
> basically only checks for next_capture_pfn that will be always
> false, so predictable. And mostly just in branches where isolation
> is failing, which is not the CMA's "fast path" I guess?

You can do it find grained with compact_control's migratepages list
or new private list. If some pages are isolated and added to this list,
you can check pfn of page on this list and determine appropriate capture
candidate page. This approach can give us more flexibility for
choosing capture candidate without adding more complexity to
common function. For example, you can choose capture candidate if
there are XX isolated pages in certain range.

>
> But I see you're talking about "complicated", not overhead. Well
> it's 4 hunks inside the isolate_migratepages_block() for loop. I
> don't think it's *that* bad, thanks to how the function was cleaned
> up by the previosu patches.
> Hmm but you made me realize I could make it nicer by doing a "goto
> isolation_fail" which would handle the next_capture_pfn update at a
> single place.
>
> >>+static bool compact_capture_page(struct compact_control *cc)
> >>+{
> >>+ struct page *page = *cc->capture_page;
> >>+ int cpu;
> >>+
> >>+ if (!page)
> >>+ return false;
> >>+
> >>+ /* Unsafe check if it's worth to try acquiring the zone->lock at all */
> >>+ if (PageBuddy(page) && page_order_unsafe(page) >= cc->order)
> >>+ goto try_capture;
> >>+
> >>+ /*
> >>+ * There's a good chance that we have just put free pages on this CPU's
> >>+ * lru cache and pcplists after the page migrations. Drain them to
> >>+ * allow merging.
> >>+ */
> >>+ cpu = get_cpu();
> >>+ lru_add_drain_cpu(cpu);
> >>+ drain_local_pages(NULL);
> >>+ put_cpu();
> >
> >Just for curiosity.
> >
> >If lru_add_drain_cpu() is cheap enough to capture high order page, why
> >__alloc_pages_direct_compact() doesn't call it before
> >get_page_from_freelist()?
>
> No idea. I guess it wasn't noticed at the time that page migration
> uses putback_lru_page() on the page that was freed, which puts it
> into the lru_add cache, only to be freed. I think it would be better
> to free the page immediately in this case, and use lru_add cache
> only for pages that will really go to lru.
>
> Heck, it could be even better to tell page migration to skip
> pcplists as well, to avoid drain_local_pages. Often you migrate
> because you want to use the original page for something. NUMA
> balancing migrations are different, I guess.
>
> >>--- a/mm/page_alloc.c
> >>+++ b/mm/page_alloc.c
> >>@@ -1489,9 +1489,11 @@ static int __isolate_free_page(struct page *page, unsigned int order)
> >> {
> >> unsigned long watermark;
> >> struct zone *zone;
> >>+ struct free_area *area;
> >> int mt;
> >>+ unsigned int freepage_order = page_order(page);
> >>
> >>- BUG_ON(!PageBuddy(page));
> >>+ VM_BUG_ON_PAGE((!PageBuddy(page) || freepage_order < order), page);
> >>
> >> zone = page_zone(page);
> >> mt = get_pageblock_migratetype(page);
> >>@@ -1506,9 +1508,12 @@ static int __isolate_free_page(struct page *page, unsigned int order)
> >> }
> >>
> >
> >In __isolate_free_page(), we check zone_watermark_ok() with order 0.
> >But normal allocation logic would check zone_watermark_ok() with requested
> >order. Your capture logic uses __isolate_free_page() and it would
> >affect compaction success rate significantly. And it means that
> >capture logic allocates high order page on page allocator
> >too aggressively compared to other component such as normal high order
>
> It's either that, or the extra lru drain that makes the different.
> But the "aggressiveness" would in fact mean better accuracy.
> Watermark checking may be inaccurate. Especially when memory is
> close to the watermark and there is only a single high-order page
> that would satisfy the allocation.

If this "aggressiveness" means better accuracy, fixing general
function, watermark_ok() is better than adding capture logic.

But, I guess that there is a reason that watermark_ok() is so
conservative. If page allocator aggressively provides high order page,
future atomic high order page request cannot succeed easily. For
preventing this situation, watermark_ok() should be conservative.

Thanks.

2014-07-30 09:08:34

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v5 07/14] mm, compaction: khugepaged should not give up due to need_resched()

On 07/30/2014 12:53 AM, David Rientjes wrote:
> On Tue, 29 Jul 2014, Vlastimil Babka wrote:
>
>>> I think there's two ways to go about it:
>>>
>>> - allow a single thp fault to be expensive and then rely on deferred
>>> compaction to avoid subsequent calls in the near future, or
>>>
>>> - try to make all thp faults be as least expensive as possible so that
>>> the cumulative effect of faulting large amounts of memory doesn't end
>>> up with lengthy stalls.
>>>
>>> Both of these are complex because of the potential for concurrent calls to
>>> memory compaction when faulting thp on several cpus.
>>>
>>> I also think the second point from that email still applies, that we
>>> should abort isolating pages within a pageblock for migration once it can
>>> no longer allow a cc->order allocation to succeed.
>>
>> That was the RFC patch 15, I hope to reintroduce it soon.
>
> Which of the points above are you planning on addressing in another patch?
> I think the approach would cause the above to be mutually exclusive
> options.

Oh I meant the quick abort of a pageblock that's not going to succeed.
That was the RFC patch. As for the single expensive fault + defer vs
lots of inexpensive faults, I would favor the latter. I'd rather avoid
bug reports such as "It works fine for a while and then we get this
weird few seconds of stall", which is exactly what you were dealing with
IIRC?

>> You could still test
>> it meanwhile to see if you see the same extfrag regression as me. In my tests,
>> kswapd/khugepaged wasn't doing enough work to defragment the pageblocks that
>> the stress-highalloc benchmark (configured to behave like thp page fault) was
>> skipping.
>>
>
> The initial regression that I encountered was on a 128GB machine where
> async compaction would cause faulting 64MB of transparent hugepages to
> excessively stall and I don't see how kswapd can address this if there's
> no memory pressure and khugepaged can address it if it has the default
> settings which is very slow.

Hm I see. I have been thinking about somehow connecting compaction with
the extfrag (page stealing) events. For example, if it's about to
allocate UNMOVABLE/RECLAIMABLE page in a MOVABLE pageblock, then try to
compact the pageblock first, which will hopefully free enough of it to
have it remarked as UNMOVABLE/RECLAIMABLE and satisfy many such
allocations without having to steal from another one.

> Another idea I had is to only do async memory compaction for thp on local
> zones and avoid defragmenting remotely since, in my experimentation,
> remote thp memory causes a performance degradation over regular pages. If
> that solution were to involve zone_reclaim_mode and a test of
> node_distance() > RECLAIM_DISTANCE, I think that would be acceptable as
> well.

Yes, not compacting remote zones on page fault definitely makes sense.
Maybe even without zone_reclaim_mode...

2014-07-30 09:27:59

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v5 05/14] mm, compaction: move pageblock checks up from isolate_migratepages_range()

On 07/30/2014 01:51 AM, David Rientjes wrote:
> On Wed, 30 Jul 2014, Kirill A. Shutemov wrote:
>
>>> Hmm, I'm confused at how that could be true, could you explain what
>>> memory other than thp can return true for PageTransHuge()?
>>
>> PageTransHuge() will be true for any head of compound page if THP is
>> enabled compile time: hugetlbfs, slab, whatever.
>>
>
> I was meaning in the context of the patch :) Since PageLRU is set, that
> discounts slab so we're left with thp or hugetlbfs. Logically, both
> should have sizes that are >= the size of the pageblock itself so I'm not
> sure why we don't unconditionally align up to pageblock_nr_pages here. Is
> there a legitimiate configuration where a pageblock will span multiple
> pages of HPAGE_PMD_ORDER?

I think Joonsoo mentioned in some previous iteration that some arches
may have this. But I have no idea.
But perhaps we could use HPAGE_PMD_ORDER instead of compound_order()?

In the locked case we know that PageLRU could not change so it still has
to be a huge page so we know it's possible order.

In the !locked case, I'm now not even sure if the current code is safe
enough. What if we pass the PageLRU check, but before the PageTransHuge
check a compound page (THP or otherwise) materializes and we are at one
of the tail pages. Then in DEBUG_VM configuration, this could fire in
PageTransHuge() check: VM_BUG_ON_PAGE(PageTail(page), page);

2014-07-30 09:40:00

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v5 05/14] mm, compaction: move pageblock checks up from isolate_migratepages_range()

On 07/30/2014 01:02 AM, David Rientjes wrote:
>>>>
>>>> /*
>>>> - * Isolate all pages that can be migrated from the block pointed to by
>>>> - * the migrate scanner within compact_control.
>>>> + * Isolate all pages that can be migrated from the first suitable block,
>>>> + * starting at the block pointed to by the migrate scanner pfn within
>>>> + * compact_control.
>>>> */
>>>> static isolate_migrate_t isolate_migratepages(struct zone *zone,
>>>> struct compact_control *cc)
>>>> {
>>>> unsigned long low_pfn, end_pfn;
>>>> + struct page *page;
>>>> + const isolate_mode_t isolate_mode =
>>>> + (cc->mode == MIGRATE_ASYNC ? ISOLATE_ASYNC_MIGRATE : 0);
>>>>
>>>> - /* Do not scan outside zone boundaries */
>>>> - low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
>>>> + /*
>>>> + * Start at where we last stopped, or beginning of the zone as
>>>> + * initialized by compact_zone()
>>>> + */
>>>> + low_pfn = cc->migrate_pfn;
>>>>
>>>> /* Only scan within a pageblock boundary */
>>>> end_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages);
>>>>
>>>> - /* Do not cross the free scanner or scan within a memory hole */
>>>> - if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
>>>> - cc->migrate_pfn = end_pfn;
>>>> - return ISOLATE_NONE;
>>>> - }
>>>> + /*
>>>> + * Iterate over whole pageblocks until we find the first suitable.
>>>> + * Do not cross the free scanner.
>>>> + */
>>>> + for (; end_pfn <= cc->free_pfn;
>>>> + low_pfn = end_pfn, end_pfn += pageblock_nr_pages) {
>>>> +
>>>> + /*
>>>> + * This can potentially iterate a massively long zone with
>>>> + * many pageblocks unsuitable, so periodically check if we
>>>> + * need to schedule, or even abort async compaction.
>>>> + */
>>>> + if (!(low_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages))
>>>> + && compact_should_abort(cc))
>>>> + break;
>>>> +
>>>> + /* Skip whole pageblock in case of a memory hole */
>>>> + if (!pfn_valid(low_pfn))
>>>> + continue;
>>>> +
>>>> + page = pfn_to_page(low_pfn);
>>>> +
>>>> + /* If isolation recently failed, do not retry */
>>>> + if (!isolation_suitable(cc, page))
>>>> + continue;
>>>> +
>>>> + /*
>>>> + * For async compaction, also only scan in MOVABLE blocks.
>>>> + * Async compaction is optimistic to see if the minimum amount
>>>> + * of work satisfies the allocation.
>>>> + */
>>>> + if (cc->mode == MIGRATE_ASYNC &&
>>>> + !migrate_async_suitable(get_pageblock_migratetype(page)))
>>>> + continue;
>>>> +
>>>> + /* Perform the isolation */
>>>> + low_pfn = isolate_migratepages_block(cc, low_pfn, end_pfn,
>>>> + isolate_mode);
>>>
>>> Hmm, why would we want to unconditionally set pageblock_skip if no pages
>>> could be isolated from a pageblock when
>>> isolate_mode == ISOLATE_ASYNC_MIGRATE? It seems like it erroneously skip
>>> pageblocks for cases when isolate_mode == 0.
>>
>> Well pageblock_skip is a single bit and you don't know if the next attempt
>> will be async or sync. So now you would maybe skip needlessly if the next
>> attempt would be sync. If we changed that, you wouldn't skip if the next
>> attempt would be async again. Could be that one way is better than other but
>> I'm not sure, and would consider it separately.
>> The former patch 15 (quick skip pageblock that won't be fully migrated) could
>> perhaps change the balance here.
>>
>
> That's why we have two separate per-zone cached start pfns, though, right?
> The next call to async compaction should start from where the previous
> caller left off so there would be no need to set pageblock skip in that
> case until we have checked all memory. Or are you considering the case of
> concurrent async compaction?

Ah, well the lifecycle of cached pfn's and pageblock_skip is not
generally in sync. It may be that cached pfn's are reset, but
pageblock_skip bits remain. So this would be one async pass setting
hints for the next async pass.

But maybe we've already reduced the impact of sync compaction enough so
it could now be ignoring pageblock_skip completely, and leave those
hints only for async compaction.

2014-07-30 09:57:04

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v5 14/14] mm, compaction: try to capture the just-created high-order freepage

On 07/30/2014 10:39 AM, Joonsoo Kim wrote:
> On Tue, Jul 29, 2014 at 05:34:37PM +0200, Vlastimil Babka wrote:
>> Could do it in isolate_migratepages() for whole pageblocks only (as
>> David's patch did), but that restricts the usefulness. Or maybe do
>> it fine grained by calling isolate_migratepages_block() multiple
>> times. But the overhead of multiple calls would probably suck even
>> more for lower-order compactions. For CMA the added overhead is
>> basically only checks for next_capture_pfn that will be always
>> false, so predictable. And mostly just in branches where isolation
>> is failing, which is not the CMA's "fast path" I guess?
>
> You can do it find grained with compact_control's migratepages list
> or new private list. If some pages are isolated and added to this list,
> you can check pfn of page on this list and determine appropriate capture
> candidate page. This approach can give us more flexibility for
> choosing capture candidate without adding more complexity to
> common function. For example, you can choose capture candidate if
> there are XX isolated pages in certain range.

Hm I see. But the logic added by page capture was also a prerequisity
for the "[RFC PATCH V4 15/15] mm, compaction: do not migrate pages when
that cannot satisfy page fault allocation"
http://marc.info/?l=linux-mm&m=140551859423716&w=2

And that could be hardly done by a post-isolation inspection of the
migratepages list. And I haven't given up on that idea yet :)

>>> In __isolate_free_page(), we check zone_watermark_ok() with order 0.
>>> But normal allocation logic would check zone_watermark_ok() with requested
>>> order. Your capture logic uses __isolate_free_page() and it would
>>> affect compaction success rate significantly. And it means that
>>> capture logic allocates high order page on page allocator
>>> too aggressively compared to other component such as normal high order
>>
>> It's either that, or the extra lru drain that makes the different.
>> But the "aggressiveness" would in fact mean better accuracy.
>> Watermark checking may be inaccurate. Especially when memory is
>> close to the watermark and there is only a single high-order page
>> that would satisfy the allocation.
>
> If this "aggressiveness" means better accuracy, fixing general
> function, watermark_ok() is better than adding capture logic.

That's if fixing the function wouldn't add significant overhead to all
the callers. And making it non-racy and not prone to per-cpu counter
drifts would certainly do that :(

> But, I guess that there is a reason that watermark_ok() is so
> conservative. If page allocator aggressively provides high order page,
> future atomic high order page request cannot succeed easily. For
> preventing this situation, watermark_ok() should be conservative.

I don't think it's intentionally conservative, just unreliable. It tests
two things together:

1) are there enough free pages for the allocation wrt watermarks?
2) does it look like that there is a free page of the requested order?

The 1) works fine and my patch won't change that by passing a order=0.
The problem is with 2) which is unreliable, especially when close to the
watermarks. Note that it's not trying to keep some reserves for atomic
requests. That's what MIGRATE_RESERVE is for. It's just unreliable to
decide if there is the high-order page available. Even though its
allocation would preserve the watermarks, so there is no good reason to
prevent the allocation. So it will often pass when deciding to stop
compaction, and then fail when allocating.

> Thanks.
>

2014-07-30 14:19:16

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v5 14/14] mm, compaction: try to capture the just-created high-order freepage

Oops... resend because of omitting everyone on CC.

2014-07-30 18:56 GMT+09:00 Vlastimil Babka <[email protected]>:
> On 07/30/2014 10:39 AM, Joonsoo Kim wrote:
>>
>> On Tue, Jul 29, 2014 at 05:34:37PM +0200, Vlastimil Babka wrote:
>>>
>>> Could do it in isolate_migratepages() for whole pageblocks only (as
>>> David's patch did), but that restricts the usefulness. Or maybe do
>>> it fine grained by calling isolate_migratepages_block() multiple
>>> times. But the overhead of multiple calls would probably suck even
>>> more for lower-order compactions. For CMA the added overhead is
>>> basically only checks for next_capture_pfn that will be always
>>> false, so predictable. And mostly just in branches where isolation
>>> is failing, which is not the CMA's "fast path" I guess?
>>
>>
>> You can do it find grained with compact_control's migratepages list
>> or new private list. If some pages are isolated and added to this list,
>> you can check pfn of page on this list and determine appropriate capture
>> candidate page. This approach can give us more flexibility for
>> choosing capture candidate without adding more complexity to
>> common function. For example, you can choose capture candidate if
>> there are XX isolated pages in certain range.
>
>
> Hm I see. But the logic added by page capture was also a prerequisity for
> the "[RFC PATCH V4 15/15] mm, compaction: do not migrate pages when that
> cannot satisfy page fault allocation"
> http://marc.info/?l=linux-mm&m=140551859423716&w=2
>
> And that could be hardly done by a post-isolation inspection of the
> migratepages list. And I haven't given up on that idea yet :)

Okay. I didn't look at that patch yet. I will try later :)

>
>>>> In __isolate_free_page(), we check zone_watermark_ok() with order 0.
>>>> But normal allocation logic would check zone_watermark_ok() with
>>>> requested
>>>> order. Your capture logic uses __isolate_free_page() and it would
>>>> affect compaction success rate significantly. And it means that
>>>> capture logic allocates high order page on page allocator
>>>> too aggressively compared to other component such as normal high order
>>>
>>>
>>> It's either that, or the extra lru drain that makes the different.
>>> But the "aggressiveness" would in fact mean better accuracy.
>>> Watermark checking may be inaccurate. Especially when memory is
>>> close to the watermark and there is only a single high-order page
>>> that would satisfy the allocation.
>>
>>
>> If this "aggressiveness" means better accuracy, fixing general
>> function, watermark_ok() is better than adding capture logic.
>
>
> That's if fixing the function wouldn't add significant overhead to all the
> callers. And making it non-racy and not prone to per-cpu counter drifts
> would certainly do that :(
>
>
>> But, I guess that there is a reason that watermark_ok() is so
>> conservative. If page allocator aggressively provides high order page,
>> future atomic high order page request cannot succeed easily. For
>> preventing this situation, watermark_ok() should be conservative.
>
>
> I don't think it's intentionally conservative, just unreliable. It tests two
> things together:
>
> 1) are there enough free pages for the allocation wrt watermarks?
> 2) does it look like that there is a free page of the requested order?

I don't think that watermark_ok()'s intention is checking if there is a free
page of the requested order. If we want to know it, we could use more
easy way something like below.

X = number of total freepage - number of freepage lower than requested order
If X is positive, we can conclude that there is at least one freepage
of requested order and this equation is easy to compute.

But, watermark_ok() doesn't do that. Instead, it uses mark value to determine
if we can go further. I guess that this means that allocation/reclaim logic want
to preserve certain level of high order freepages according to system memory
size, although I don't know what the reason is exactly. So
the "aggressiveness" on capture logic here could break what
allocation/reclaim want.

Thanks.

2014-07-30 15:05:44

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v5 14/14] mm, compaction: try to capture the just-created high-order freepage

On 07/30/2014 04:19 PM, Joonsoo Kim wrote:
>>> But, I guess that there is a reason that watermark_ok() is so
>>> conservative. If page allocator aggressively provides high order page,
>>> future atomic high order page request cannot succeed easily. For
>>> preventing this situation, watermark_ok() should be conservative.
>>
>>
>> I don't think it's intentionally conservative, just unreliable. It tests two
>> things together:
>>
>> 1) are there enough free pages for the allocation wrt watermarks?
>> 2) does it look like that there is a free page of the requested order?
>
> I don't think that watermark_ok()'s intention is checking if there is a free
> page of the requested order. If we want to know it, we could use more
> easy way something like below.
>
> X = number of total freepage - number of freepage lower than requested order
> If X is positive, we can conclude that there is at least one freepage
> of requested order and this equation is easy to compute.

I thought that's basically what it does, but...

> But, watermark_ok() doesn't do that. Instead, it uses mark value to determine
> if we can go further. I guess that this means that allocation/reclaim logic want
> to preserve certain level of high order freepages according to system memory
> size, although I don't know what the reason is exactly. So
> the "aggressiveness" on capture logic here could break what
> allocation/reclaim want.

Hm I see your point. So OK, I will check if the order=0 makes the
difference for page capture or not.

> Thanks.
>

2014-07-30 16:22:42

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v5 02/14] mm, compaction: defer each zone individually instead of preferred zone

On 07/29/2014 11:12 AM, Vlastimil Babka wrote:
> On 07/29/2014 08:38 AM, Joonsoo Kim wrote:
>>> /* Return values for compact_zone() and try_to_compact_pages() */
>>> +/* compaction didn't start as it was deferred due to past failures */
>>> +#define COMPACT_DEFERRED 0
>>> /* compaction didn't start as it was not possible or direct reclaim was more suitable */
>>> -#define COMPACT_SKIPPED 0
>>> +#define COMPACT_SKIPPED 1
>>
>> Hello,
>>
>> This change makes some users of compaction_suitable() failed
>> unintentionally, because they assume that COMPACT_SKIPPED is 0.
>> Please fix them according to this change.
>
> Oops, good catch. Thanks!
>
>>> @@ -2324,27 +2327,31 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>>> order, zonelist, high_zoneidx,
>>> alloc_flags & ~ALLOC_NO_WATERMARKS,
>>> preferred_zone, classzone_idx, migratetype);
>>> +
>>> if (page) {
>>> - preferred_zone->compact_blockskip_flush = false;
>>> - compaction_defer_reset(preferred_zone, order, true);
>>> + struct zone *zone = page_zone(page);
>>> +
>>> + zone->compact_blockskip_flush = false;
>>> + compaction_defer_reset(zone, order, true);
>>> count_vm_event(COMPACTSUCCESS);
>>> return page;
>>> }
>>>
>>> /*
>>> + * last_compact_zone is where try_to_compact_pages thought
>>> + * allocation should succeed, so it did not defer compaction.
>>> + * But now we know that it didn't succeed, so we do the defer.
>>> + */
>>> + if (last_compact_zone && mode != MIGRATE_ASYNC)
>>> + defer_compaction(last_compact_zone, order);
>>
>> I still don't understand why defer_compaction() is needed here.
>> defer_compaction() is intended for not struggling doing compaction on
>> the zone where we already have tried compaction and found that it
>> isn't suitable for compaction. Allocation failure doesn't tell us
>> that we have tried compaction for all the zone range so we shouldn't
>> make a decision here to defer compaction on this zone carelessly.
>
> OK I can remove that, it should make the code nicer anyway.

Weird, that removal of this defer_compaction() call seems ho have
quadrupled compact_stall and compact_fail counts. The scanner pages
counters however increased by only 10% so that could indicate the
problem is occuring only in a small zone such as DMA. Could be another
case of mismatch between watermark checking in compaction and
allocation? Perhaps the lack of proper classzone_idx in the compaction
check? Sigh.

> I also agree
> with the argument "for all the zone range" and I also realized that it's
> not (both before and after this patch) really the case. I planned to fix
> that in the future, but I can probably do it now.
> The plan is to call defer_compaction() only when compaction returned
> COMPACT_COMPLETE (and not COMPACT_PARTIAL) as it means the whole zone
> was scanned. Otherwise there will be bias towards the beginning of the
> zone in the migration scanner - compaction will be deferred half-way and
> then cached pfn's might be reset when it restarts, and the rest of the
> zone won't be scanned at all.

Hm despite expectations, this didn't seem to make much difference. But
maybe there will be once I have some idea what happened to those stalls.

>> Thanks.
>>
>