2012-08-07 12:31:22

by Mel Gorman

[permalink] [raw]
Subject: [RFC PATCH 0/6] Improve hugepage allocation success rates under load

Allocation success rates have been far lower since 3.4 due to commit
[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
commit was introduced for good reasons and it was known in advance that
the success rates would suffer but it was justified on the grounds that
the high allocation success rates were achieved by aggressive reclaim.
Success rates are expected to suffer even more in 3.6 due to commit
[7db8889a: mm: have order > 0 compaction start off where it left] which
testing has shown to severely reduce allocation success rates under load -
to 0% in one case. There is a proposed change to that patch in this series
and it would be ideal if Jim Schutt could retest the workload that led to
commit [7db8889a: mm: have order > 0 compaction start off where it left].

This series aims to improve the allocation success rates without regressing
the benefits of commit fe2c2a10. The series is based on 3.5 and includes
the commit 7db8889a to illustrate what impact it has to success rates.

Patch 1 updates a stale comment seeing as I was in the general area.

Patch 2 updates reclaim/compaction to reclaim pages scaled on the number
of recent failures.

Patch 3 has kswapd use similar logic to direct reclaim when deciding whether
to continue reclaiming for reclaim/compaction or not.

Patch 4 captures suitable high-order pages freed by compaction to reduce
races with parallel allocation requests.

Patch 5 is an upstream commit that has compaction restart free page scanning
from an old position instead of always starting from the end of the
zone

Patch 6 adjusts patch 5 to restores allocation success rates.

STRESS-HIGHALLOC
3.5.0-vanilla patches:1-2 patches:1-3 patches:1-4 patches:1-5 patches:1-6
Pass 1 36.00 ( 0.00%) 61.00 (25.00%) 49.00 (13.00%) 57.00 (21.00%) 0.00 (-36.00%) 62.00 (26.00%)
Pass 2 46.00 ( 0.00%) 61.00 (15.00%) 55.00 ( 9.00%) 62.00 (16.00%) 0.00 (-46.00%) 63.00 (17.00%)
while Rested 84.00 ( 0.00%) 85.00 ( 1.00%) 84.00 ( 0.00%) 86.00 ( 2.00%) 86.00 ( 2.00%) 86.00 ( 2.00%)

From
http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/hydra/comparison.html
I know that the allocation success rates in 3.3.6 was 78% in comparison
to 36% in 3.5. With the full series applied, the success rates are up
to 62% which is still much less but it does not reclaim excessively.
Note what patch 5 which is the upstream commit fe2c2a10 did to allocation
success rates.

MMTests Statistics: vmstat
Page Ins 3037580 3167260 3002720 3120080 2885540 3159024
Page Outs 8026888 8028472 8023292 8031056 8025324 8026676
Swap Ins 0 0 0 0 0 0
Swap Outs 0 0 0 0 0 8

Note that swap in/out rates remain at 0. In 3.3.6 with 78% success rates
there were 71881 pages swapped out.

Direct pages scanned 97106 59600 43926 108327 2109 171530
Kswapd pages scanned 1231288 1419472 1388888 1443504 1180916 1377362
Kswapd pages reclaimed 1231221 1419248 1358130 1427561 1164936 1372875
Direct pages reclaimed 97100 59486 24233 88990 2109 171235
Kswapd efficiency 99% 99% 97% 98% 98% 99%
Kswapd velocity 1001.153 1129.622 1098.647 1080.758 955.967 1084.657
Direct efficiency 99% 99% 55% 82% 100% 99%
Direct velocity 78.956 47.430 34.747 81.105 1.707 135.078

kswapd velocity stays at around 1000 pages/second which is reasonable. In
kernel 3.3.6, it was 8140 pages/second.

include/linux/compaction.h | 4 +-
include/linux/mm.h | 1 +
include/linux/mmzone.h | 4 ++
mm/compaction.c | 142 +++++++++++++++++++++++++++++++++++++-------
mm/internal.h | 7 +++
mm/page_alloc.c | 68 ++++++++++++++++-----
mm/vmscan.c | 29 ++++++++-
7 files changed, 213 insertions(+), 42 deletions(-)

--
1.7.9.2


2012-08-07 12:31:23

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/6] mm: compaction: Update comment in try_to_compact_pages

The comment about order applied when the check was
order > PAGE_ALLOC_COSTLY_ORDER which has not been the case since
[c5a73c3d: thp: use compaction for all allocation orders]. Fixing
the comment while I'm in the general area.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/compaction.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index b39ede1..95ca967 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -759,11 +759,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
struct zone *zone;
int rc = COMPACT_SKIPPED;

- /*
- * Check whether it is worth even starting compaction. The order check is
- * made because an assumption is made that the page allocator can satisfy
- * the "cheaper" orders without taking special steps
- */
+ /* Check if the GFP flags allow compaction */
if (!order || !may_enter_fs || !may_perform_io)
return rc;

--
1.7.9.2

2012-08-07 12:31:26

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 6/6] mm: have order > 0 compaction start near a pageblock with free pages

commit [7db8889a: mm: have order > 0 compaction start off where it left]
introduced a caching mechanism to reduce the amount work the free page
scanner does in compaction. However, it has a problem. Consider two process
simultaneously scanning free pages

C
Process A M S F
|---------------------------------------|
Process B M FS

C is zone->compact_cached_free_pfn
S is cc->start_pfree_pfn
M is cc->migrate_pfn
F is cc->free_pfn

In this diagram, Process A has just reached its migrate scanner, wrapped
around and updated compact_cached_free_pfn accordingly.

Simultaneously, Process B finishes isolating in a block and updates
compact_cached_free_pfn again to the location of its free scanner.

Process A moves to "end_of_zone - one_pageblock" and runs this check

if (cc->order > 0 && (!cc->wrapped ||
zone->compact_cached_free_pfn >
cc->start_free_pfn))
pfn = min(pfn, zone->compact_cached_free_pfn);

compact_cached_free_pfn is above where it started so the free scanner skips
almost the entire space it should have scanned. When there are multiple
processes compacting it can end in a situation where the entire zone is
not being scanned at all. Further, it is possible for two processes to
ping-pong update to compact_cached_free_pfn which is just random.

Overall, the end result wrecks allocation success rates.

There is not an obvious way around this problem without introducing new
locking and state so this patch takes a different approach.

First, it gets rid of the skip logic because it's not clear that it matters
if two free scanners happen to be in the same block but with racing updates
it's too easy for it to skip over blocks it should not.

Second, it updates compact_cached_free_pfn in a more limited set of
circumstances.

If a scanner has wrapped, it updates compact_cached_free_pfn to the end
of the zone. Each time a wrapped scanner isoaltes a page, it
updates compact_cached_free_pfn. The intention is that after
wrapping, the compact_cached_free_pfn will be at the highest
pageblock with free pages when compaction completes.

If a scanner has not wrapped when compaction completes and
compact_cached_free_pfn is set the end of the the zone, initialise
it once.

This is not optimal and it can still race but the compact_cached_free_pfn
will be pointing to or very near a pageblock with free pages.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/compaction.c | 54 ++++++++++++++++++++++++++++--------------------------
1 file changed, 28 insertions(+), 26 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index be310f1..df50f73 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -419,6 +419,20 @@ static bool suitable_migration_target(struct page *page)
}

/*
+ * Returns the start pfn of the last page block in a zone. This is the starting
+ * point for full compaction of a zone. Compaction searches for free pages from
+ * the end of each zone, while isolate_freepages_block scans forward inside each
+ * page block.
+ */
+static unsigned long start_free_pfn(struct zone *zone)
+{
+ unsigned long free_pfn;
+ free_pfn = zone->zone_start_pfn + zone->spanned_pages;
+ free_pfn &= ~(pageblock_nr_pages-1);
+ return free_pfn;
+}
+
+/*
* Based on information in the current compact_control, find blocks
* suitable for isolating free pages from and then isolate them.
*/
@@ -457,17 +471,6 @@ static void isolate_freepages(struct zone *zone,
pfn -= pageblock_nr_pages) {
unsigned long isolated;

- /*
- * Skip ahead if another thread is compacting in the area
- * simultaneously. If we wrapped around, we can only skip
- * ahead if zone->compact_cached_free_pfn also wrapped to
- * above our starting point.
- */
- if (cc->order > 0 && (!cc->wrapped ||
- zone->compact_cached_free_pfn >
- cc->start_free_pfn))
- pfn = min(pfn, zone->compact_cached_free_pfn);
-
if (!pfn_valid(pfn))
continue;

@@ -510,7 +513,15 @@ static void isolate_freepages(struct zone *zone,
*/
if (isolated) {
high_pfn = max(high_pfn, pfn);
- if (cc->order > 0)
+
+ /*
+ * If the free scanner has wrapped, update
+ * compact_cached_free_pfn to point to the highest
+ * pageblock with free pages. This reduces excessive
+ * scanning of full pageblocks near the end of the
+ * zone
+ */
+ if (cc->order > 0 && cc->wrapped)
zone->compact_cached_free_pfn = high_pfn;
}
}
@@ -520,6 +531,11 @@ static void isolate_freepages(struct zone *zone,

cc->free_pfn = high_pfn;
cc->nr_freepages = nr_freepages;
+
+ /* If compact_cached_free_pfn is reset then set it now */
+ if (cc->order > 0 && !cc->wrapped &&
+ zone->compact_cached_free_pfn == start_free_pfn(zone))
+ zone->compact_cached_free_pfn = high_pfn;
}

/*
@@ -607,20 +623,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
return ISOLATE_SUCCESS;
}

-/*
- * Returns the start pfn of the last page block in a zone. This is the starting
- * point for full compaction of a zone. Compaction searches for free pages from
- * the end of each zone, while isolate_freepages_block scans forward inside each
- * page block.
- */
-static unsigned long start_free_pfn(struct zone *zone)
-{
- unsigned long free_pfn;
- free_pfn = zone->zone_start_pfn + zone->spanned_pages;
- free_pfn &= ~(pageblock_nr_pages-1);
- return free_pfn;
-}
-
static int compact_finished(struct zone *zone,
struct compact_control *cc)
{
--
1.7.9.2

2012-08-07 12:31:44

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 5/6] mm: have order > 0 compaction start off where it left

From: Rik van Riel <[email protected]>

This commit is already upstream as [7db8889a: mm: have order > 0 compaction
start off where it left]. It's included in this series to provide context
to the next patch as the series is based on 3.5.

Order > 0 compaction stops when enough free pages of the correct page
order have been coalesced. When doing subsequent higher order
allocations, it is possible for compaction to be invoked many times.

However, the compaction code always starts out looking for things to
compact at the start of the zone, and for free pages to compact things to
at the end of the zone.

This can cause quadratic behaviour, with isolate_freepages starting at the
end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.

This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.

The obvious solution is to have isolate_freepages remember where it left
off last time, and continue at that point the next time it gets invoked
for an order > 0 compaction. This could cause compaction to fail if
cc->free_pfn and cc->migrate_pfn are close together initially, in that
case we restart from the end of the zone and try once more.

Forced full (order == -1) compactions are left alone.

[[email protected]: checkpatch fixes]
[[email protected]: s/laste/last/, use 80 cols]
Signed-off-by: Rik van Riel <[email protected]>
Reported-by: Jim Schutt <[email protected]>
Tested-by: Jim Schutt <[email protected]>
Cc: Minchan Kim <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 4 +++
mm/compaction.c | 63 ++++++++++++++++++++++++++++++++++++++++++++----
mm/internal.h | 6 +++++
mm/page_alloc.c | 5 ++++
4 files changed, 73 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 68c569f..6340f38 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -369,6 +369,10 @@ struct zone {
*/
spinlock_t lock;
int all_unreclaimable; /* All pages pinned */
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+ /* pfn where the last incremental compaction isolated free pages */
+ unsigned long compact_cached_free_pfn;
+#endif
#ifdef CONFIG_MEMORY_HOTPLUG
/* see spanned/present_pages for more description */
seqlock_t span_seqlock;
diff --git a/mm/compaction.c b/mm/compaction.c
index 63af8d2..be310f1 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -457,6 +457,17 @@ static void isolate_freepages(struct zone *zone,
pfn -= pageblock_nr_pages) {
unsigned long isolated;

+ /*
+ * Skip ahead if another thread is compacting in the area
+ * simultaneously. If we wrapped around, we can only skip
+ * ahead if zone->compact_cached_free_pfn also wrapped to
+ * above our starting point.
+ */
+ if (cc->order > 0 && (!cc->wrapped ||
+ zone->compact_cached_free_pfn >
+ cc->start_free_pfn))
+ pfn = min(pfn, zone->compact_cached_free_pfn);
+
if (!pfn_valid(pfn))
continue;

@@ -497,8 +508,11 @@ static void isolate_freepages(struct zone *zone,
* looking for free pages, the search will restart here as
* page migration may have returned some pages to the allocator
*/
- if (isolated)
+ if (isolated) {
high_pfn = max(high_pfn, pfn);
+ if (cc->order > 0)
+ zone->compact_cached_free_pfn = high_pfn;
+ }
}

/* split_free_page does not map the pages */
@@ -593,6 +607,20 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
return ISOLATE_SUCCESS;
}

+/*
+ * Returns the start pfn of the last page block in a zone. This is the starting
+ * point for full compaction of a zone. Compaction searches for free pages from
+ * the end of each zone, while isolate_freepages_block scans forward inside each
+ * page block.
+ */
+static unsigned long start_free_pfn(struct zone *zone)
+{
+ unsigned long free_pfn;
+ free_pfn = zone->zone_start_pfn + zone->spanned_pages;
+ free_pfn &= ~(pageblock_nr_pages-1);
+ return free_pfn;
+}
+
static int compact_finished(struct zone *zone,
struct compact_control *cc)
{
@@ -601,8 +629,26 @@ static int compact_finished(struct zone *zone,
if (fatal_signal_pending(current))
return COMPACT_PARTIAL;

- /* Compaction run completes if the migrate and free scanner meet */
- if (cc->free_pfn <= cc->migrate_pfn)
+ /*
+ * A full (order == -1) compaction run starts at the beginning and
+ * end of a zone; it completes when the migrate and free scanner meet.
+ * A partial (order > 0) compaction can start with the free scanner
+ * at a random point in the zone, and may have to restart.
+ */
+ if (cc->free_pfn <= cc->migrate_pfn) {
+ if (cc->order > 0 && !cc->wrapped) {
+ /* We started partway through; restart at the end. */
+ unsigned long free_pfn = start_free_pfn(zone);
+ zone->compact_cached_free_pfn = free_pfn;
+ cc->free_pfn = free_pfn;
+ cc->wrapped = 1;
+ return COMPACT_CONTINUE;
+ }
+ return COMPACT_COMPLETE;
+ }
+
+ /* We wrapped around and ended up where we started. */
+ if (cc->wrapped && cc->free_pfn <= cc->start_free_pfn)
return COMPACT_COMPLETE;

/*
@@ -708,8 +754,15 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)

/* Setup to move all movable pages to the end of the zone */
cc->migrate_pfn = zone->zone_start_pfn;
- cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
- cc->free_pfn &= ~(pageblock_nr_pages-1);
+
+ if (cc->order > 0) {
+ /* Incremental compaction. Start where the last one stopped. */
+ cc->free_pfn = zone->compact_cached_free_pfn;
+ cc->start_free_pfn = cc->free_pfn;
+ } else {
+ /* Order == -1 starts at the end of the zone. */
+ cc->free_pfn = start_free_pfn(zone);
+ }

migrate_prep_local();

diff --git a/mm/internal.h b/mm/internal.h
index 9156714..064f6ef 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -118,8 +118,14 @@ struct compact_control {
unsigned long nr_freepages; /* Number of isolated free pages */
unsigned long nr_migratepages; /* Number of pages to migrate */
unsigned long free_pfn; /* isolate_freepages search base */
+ unsigned long start_free_pfn; /* where we started the search */
unsigned long migrate_pfn; /* isolate_migratepages search base */
bool sync; /* Synchronous migration */
+ bool wrapped; /* Order > 0 compactions are
+ incremental, once free_pfn
+ and migrate_pfn meet, we restart
+ from the top of the zone;
+ remember we wrapped around. */

int order; /* order a direct compactor needs */
int migratetype; /* MOVABLE, RECLAIMABLE etc */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index adc3aa8..781d6e4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4425,6 +4425,11 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,

zone->spanned_pages = size;
zone->present_pages = realsize;
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+ zone->compact_cached_free_pfn = zone->zone_start_pfn +
+ zone->spanned_pages;
+ zone->compact_cached_free_pfn &= ~(pageblock_nr_pages-1);
+#endif
#ifdef CONFIG_NUMA
zone->node = nid;
zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
--
1.7.9.2

2012-08-07 12:31:45

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 4/6] mm: compaction: Capture a suitable high-order page immediately when it is made available

While compaction is moving pages to free up large contiguous blocks for
allocation it races with other allocation requests that may steal these
blocks or break them up. This patch alters direct compaction to capture a
suitable free page as soon as it becomes available to reduce this race. It
uses similar logic to split_free_page() to ensure that watermarks are
still obeyed.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/compaction.h | 4 +--
include/linux/mm.h | 1 +
mm/compaction.c | 71 +++++++++++++++++++++++++++++++++++++-------
mm/internal.h | 1 +
mm/page_alloc.c | 63 +++++++++++++++++++++++++++++----------
5 files changed, 111 insertions(+), 29 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 51a90b7..5673459 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -22,7 +22,7 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
extern int fragmentation_index(struct zone *zone, unsigned int order);
extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *mask,
- bool sync);
+ bool sync, struct page **page);
extern int compact_pgdat(pg_data_t *pgdat, int order);
extern unsigned long compaction_suitable(struct zone *zone, int order);

@@ -64,7 +64,7 @@ static inline bool compaction_deferred(struct zone *zone, int order)
#else
static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
- bool sync)
+ bool sync, struct page **page)
{
return COMPACT_CONTINUE;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b36d08c..0812e86 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -454,6 +454,7 @@ void put_pages_list(struct list_head *pages);

void split_page(struct page *page, unsigned int order);
int split_free_page(struct page *page);
+int capture_free_page(struct page *page, int alloc_order, int migratetype);

/*
* Compound pages have a destructor function. Provide a
diff --git a/mm/compaction.c b/mm/compaction.c
index 95ca967..63af8d2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -50,6 +50,41 @@ static inline bool migrate_async_suitable(int migratetype)
return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
}

+static void compact_capture_page(struct compact_control *cc)
+{
+ unsigned long flags;
+ int mtype;
+
+ if (!cc->page || *cc->page)
+ return;
+
+ /* Speculatively examine the free lists without zone lock */
+ for (mtype = 0; mtype < MIGRATE_PCPTYPES; mtype++) {
+ int order;
+ for (order = cc->order; order < MAX_ORDER; order++) {
+ struct page *page;
+ struct free_area *area;
+ area = &(cc->zone->free_area[order]);
+ if (list_empty(&area->free_list[mtype]))
+ continue;
+
+ /* Take the lock and attempt capture of the page */
+ spin_lock_irqsave(&cc->zone->lock, flags);
+ if (!list_empty(&area->free_list[mtype])) {
+ page = list_entry(area->free_list[mtype].next,
+ struct page, lru);
+ if (capture_free_page(page, cc->order, mtype)) {
+ spin_unlock_irqrestore(&cc->zone->lock,
+ flags);
+ *cc->page = page;
+ return;
+ }
+ }
+ spin_unlock_irqrestore(&cc->zone->lock, flags);
+ }
+ }
+}
+
/*
* Isolate free pages onto a private freelist. Caller must hold zone->lock.
* If @strict is true, will abort returning 0 on any invalid PFNs or non-free
@@ -561,7 +596,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
static int compact_finished(struct zone *zone,
struct compact_control *cc)
{
- unsigned int order;
unsigned long watermark;

if (fatal_signal_pending(current))
@@ -586,14 +620,22 @@ static int compact_finished(struct zone *zone,
return COMPACT_CONTINUE;

/* Direct compactor: Is a suitable page free? */
- for (order = cc->order; order < MAX_ORDER; order++) {
- /* Job done if page is free of the right migratetype */
- if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
- return COMPACT_PARTIAL;
-
- /* Job done if allocation would set block type */
- if (order >= pageblock_order && zone->free_area[order].nr_free)
+ if (cc->page) {
+ /* Was a suitable page captured? */
+ if (*cc->page)
return COMPACT_PARTIAL;
+ } else {
+ unsigned int order;
+ for (order = cc->order; order < MAX_ORDER; order++) {
+ struct free_area *area = &zone->free_area[cc->order];
+ /* Job done if page is free of the right migratetype */
+ if (!list_empty(&area->free_list[cc->migratetype]))
+ return COMPACT_PARTIAL;
+
+ /* Job done if allocation would set block type */
+ if (cc->order >= pageblock_order && area->nr_free)
+ return COMPACT_PARTIAL;
+ }
}

return COMPACT_CONTINUE;
@@ -708,6 +750,10 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
goto out;
}
}
+
+ /* Capture a page now if it is a suitable size */
+ if (cc->migratetype == MIGRATE_MOVABLE)
+ compact_capture_page(cc);
}

out:
@@ -720,7 +766,7 @@ out:

static unsigned long compact_zone_order(struct zone *zone,
int order, gfp_t gfp_mask,
- bool sync)
+ bool sync, struct page **page)
{
struct compact_control cc = {
.nr_freepages = 0,
@@ -729,6 +775,7 @@ static unsigned long compact_zone_order(struct zone *zone,
.migratetype = allocflags_to_migratetype(gfp_mask),
.zone = zone,
.sync = sync,
+ .page = page,
};
INIT_LIST_HEAD(&cc.freepages);
INIT_LIST_HEAD(&cc.migratepages);
@@ -750,7 +797,7 @@ int sysctl_extfrag_threshold = 500;
*/
unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
- bool sync)
+ bool sync, struct page **page)
{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
int may_enter_fs = gfp_mask & __GFP_FS;
@@ -770,7 +817,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
nodemask) {
int status;

- status = compact_zone_order(zone, order, gfp_mask, sync);
+ status = compact_zone_order(zone, order, gfp_mask, sync, page);
rc = max(status, rc);

/* If a normal allocation would succeed, stop compacting */
@@ -825,6 +872,7 @@ int compact_pgdat(pg_data_t *pgdat, int order)
struct compact_control cc = {
.order = order,
.sync = false,
+ .page = NULL,
};

return __compact_pgdat(pgdat, &cc);
@@ -835,6 +883,7 @@ static int compact_node(int nid)
struct compact_control cc = {
.order = -1,
.sync = true,
+ .page = NULL,
};

return __compact_pgdat(NODE_DATA(nid), &cc);
diff --git a/mm/internal.h b/mm/internal.h
index 2ba87fb..9156714 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -124,6 +124,7 @@ struct compact_control {
int order; /* order a direct compactor needs */
int migratetype; /* MOVABLE, RECLAIMABLE etc */
struct zone *zone;
+ struct page **page; /* Page captured of requested size */
};

unsigned long
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a4f921..adc3aa8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1374,16 +1374,11 @@ void split_page(struct page *page, unsigned int order)
}

/*
- * Similar to split_page except the page is already free. As this is only
- * being used for migration, the migratetype of the block also changes.
- * As this is called with interrupts disabled, the caller is responsible
- * for calling arch_alloc_page() and kernel_map_page() after interrupts
- * are enabled.
- *
- * Note: this is probably too low level an operation for use in drivers.
- * Please consult with lkml before using this in your driver.
+ * Similar to the split_page family of functions except that the page
+ * required at the given order and being isolated now to prevent races
+ * with parallel allocators
*/
-int split_free_page(struct page *page)
+int capture_free_page(struct page *page, int alloc_order, int migratetype)
{
unsigned int order;
unsigned long watermark;
@@ -1405,10 +1400,11 @@ int split_free_page(struct page *page)
rmv_page_order(page);
__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));

- /* Split into individual pages */
- set_page_refcounted(page);
- split_page(page, order);
+ if (alloc_order != order)
+ expand(zone, page, alloc_order, order,
+ &zone->free_area[order], migratetype);

+ /* Set the pageblock if the captured page is at least a pageblock */
if (order >= pageblock_order - 1) {
struct page *endpage = page + (1 << order) - 1;
for (; page < endpage; page += pageblock_nr_pages) {
@@ -1419,7 +1415,35 @@ int split_free_page(struct page *page)
}
}

- return 1 << order;
+ return 1UL << order;
+}
+
+/*
+ * Similar to split_page except the page is already free. As this is only
+ * being used for migration, the migratetype of the block also changes.
+ * As this is called with interrupts disabled, the caller is responsible
+ * for calling arch_alloc_page() and kernel_map_page() after interrupts
+ * are enabled.
+ *
+ * Note: this is probably too low level an operation for use in drivers.
+ * Please consult with lkml before using this in your driver.
+ */
+int split_free_page(struct page *page)
+{
+ unsigned int order;
+ int nr_pages;
+
+ BUG_ON(!PageBuddy(page));
+ order = page_order(page);
+
+ nr_pages = capture_free_page(page, order, 0);
+ if (!nr_pages)
+ return 0;
+
+ /* Split into individual pages */
+ set_page_refcounted(page);
+ split_page(page, order);
+ return nr_pages;
}

/*
@@ -2065,7 +2089,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
bool *deferred_compaction,
unsigned long *did_some_progress)
{
- struct page *page;
+ struct page *page = NULL;

if (!order)
return NULL;
@@ -2077,10 +2101,16 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,

current->flags |= PF_MEMALLOC;
*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
- nodemask, sync_migration);
+ nodemask, sync_migration, &page);
current->flags &= ~PF_MEMALLOC;
- if (*did_some_progress != COMPACT_SKIPPED) {

+ /* If compaction captured a page, prep and use it */
+ if (page) {
+ prep_new_page(page, order, gfp_mask);
+ goto got_page;
+ }
+
+ if (*did_some_progress != COMPACT_SKIPPED) {
/* Page migration frees to the PCP lists but we want merging */
drain_pages(get_cpu());
put_cpu();
@@ -2090,6 +2120,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
alloc_flags, preferred_zone,
migratetype);
if (page) {
+got_page:
preferred_zone->compact_considered = 0;
preferred_zone->compact_defer_shift = 0;
if (order >= preferred_zone->compact_order_failed)
--
1.7.9.2

2012-08-07 12:32:26

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 3/6] mm: kswapd: Continue reclaiming for reclaim/compaction if the minimum number of pages have not been reclaimed

When direct reclaim is running reclaim/compaction, there is a minimum
number of pages it reclaims. As it must be under the low watermark to be
in direct reclaim it has also woken kswapd to do some work. This patch
has kswapd use the same logic as direct reclaim to reclaim a minimum
number of pages so compaction can run later.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0cb2593..afdec93 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1701,7 +1701,7 @@ static bool in_reclaim_compaction(struct scan_control *sc)
* calls try_to_compact_zone() that it will have enough free pages to succeed.
* It will give up earlier than that if there is difficulty reclaiming pages.
*/
-static inline bool should_continue_reclaim(struct lruvec *lruvec,
+static bool should_continue_reclaim(struct lruvec *lruvec,
unsigned long nr_reclaimed,
unsigned long nr_scanned,
struct scan_control *sc)
@@ -1768,6 +1768,17 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
}
}

+static inline bool should_continue_reclaim_zone(struct zone *zone,
+ unsigned long nr_reclaimed,
+ unsigned long nr_scanned,
+ struct scan_control *sc)
+{
+ struct mem_cgroup *memcg = mem_cgroup_iter(NULL, NULL, NULL);
+ struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+
+ return should_continue_reclaim(lruvec, nr_reclaimed, nr_scanned, sc);
+}
+
/*
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
@@ -2496,8 +2507,10 @@ loop_again:
*/
testorder = order;
if (COMPACTION_BUILD && order &&
- compaction_suitable(zone, order) !=
- COMPACT_SKIPPED)
+ !should_continue_reclaim_zone(zone,
+ nr_soft_reclaimed,
+ sc.nr_scanned - nr_soft_scanned,
+ &sc))
testorder = 0;

if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
--
1.7.9.2

2012-08-07 12:32:47

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 2/6] mm: vmscan: Scale number of pages reclaimed by reclaim/compaction based on failures

If allocation fails after compaction then compaction may be deferred for
a number of allocation attempts. If there are subsequent failures,
compact_defer_shift is increased to defer for longer periods. This patch
uses that information to scale the number of pages reclaimed with
compact_defer_shift until allocations succeed again.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 66e4310..0cb2593 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1708,6 +1708,7 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
{
unsigned long pages_for_compaction;
unsigned long inactive_lru_pages;
+ struct zone *zone;

/* If not in reclaim/compaction mode, stop */
if (!in_reclaim_compaction(sc))
@@ -1741,6 +1742,15 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
* inactive lists are large enough, continue reclaiming
*/
pages_for_compaction = (2UL << sc->order);
+
+ /*
+ * If compaction is deferred for this order then scale the number of
+ * pages reclaimed based on the number of consecutive allocation
+ * failures
+ */
+ zone = lruvec_zone(lruvec);
+ if (zone->compact_order_failed >= sc->order)
+ pages_for_compaction <<= zone->compact_defer_shift;
inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
if (nr_swap_pages > 0)
inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);
--
1.7.9.2

2012-08-07 13:19:37

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 1/6] mm: compaction: Update comment in try_to_compact_pages

On 08/07/2012 08:31 AM, Mel Gorman wrote:
> The comment about order applied when the check was
> order> PAGE_ALLOC_COSTLY_ORDER which has not been the case since
> [c5a73c3d: thp: use compaction for all allocation orders]. Fixing
> the comment while I'm in the general area.
>
> Signed-off-by: Mel Gorman<[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed

2012-08-07 13:23:48

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm: vmscan: Scale number of pages reclaimed by reclaim/compaction based on failures

On 08/07/2012 08:31 AM, Mel Gorman wrote:
> If allocation fails after compaction then compaction may be deferred for
> a number of allocation attempts. If there are subsequent failures,
> compact_defer_shift is increased to defer for longer periods. This patch
> uses that information to scale the number of pages reclaimed with
> compact_defer_shift until allocations succeed again.
>
> Signed-off-by: Mel Gorman<[email protected]>

Acked-by: Rik van Riel <[email protected]>


--
All rights reversed

2012-08-07 13:26:24

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 3/6] mm: kswapd: Continue reclaiming for reclaim/compaction if the minimum number of pages have not been reclaimed

On 08/07/2012 08:31 AM, Mel Gorman wrote:
> When direct reclaim is running reclaim/compaction, there is a minimum
> number of pages it reclaims. As it must be under the low watermark to be
> in direct reclaim it has also woken kswapd to do some work. This patch
> has kswapd use the same logic as direct reclaim to reclaim a minimum
> number of pages so compaction can run later.
>
> Signed-off-by: Mel Gorman<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2012-08-07 13:30:23

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 4/6] mm: compaction: Capture a suitable high-order page immediately when it is made available

On 08/07/2012 08:31 AM, Mel Gorman wrote:
> While compaction is moving pages to free up large contiguous blocks for
> allocation it races with other allocation requests that may steal these
> blocks or break them up. This patch alters direct compaction to capture a
> suitable free page as soon as it becomes available to reduce this race. It
> uses similar logic to split_free_page() to ensure that watermarks are
> still obeyed.
>
> Signed-off-by: Mel Gorman<[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed

2012-08-07 14:45:36

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm: have order > 0 compaction start near a pageblock with free pages

On 08/07/2012 08:31 AM, Mel Gorman wrote:
> commit [7db8889a: mm: have order> 0 compaction start off where it left]
> introduced a caching mechanism to reduce the amount work the free page
> scanner does in compaction. However, it has a problem. Consider two process
> simultaneously scanning free pages
>
> C
> Process A M S F
> |---------------------------------------|
> Process B M FS

Argh. Good spotting.

> This is not optimal and it can still race but the compact_cached_free_pfn
> will be pointing to or very near a pageblock with free pages.

Agreed on the "not optimal", but I also cannot think of a better
idea right now. Getting this fixed for 3.6 is important, we can
think of future optimizations in San Diego.

> Signed-off-by: Mel Gorman<[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed

2012-08-07 14:52:42

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm: have order > 0 compaction start near a pageblock with free pages

On Tue, Aug 07, 2012 at 10:45:25AM -0400, Rik van Riel wrote:
> On 08/07/2012 08:31 AM, Mel Gorman wrote:
> >commit [7db8889a: mm: have order> 0 compaction start off where it left]
> >introduced a caching mechanism to reduce the amount work the free page
> >scanner does in compaction. However, it has a problem. Consider two process
> >simultaneously scanning free pages
> >
> > C
> >Process A M S F
> > |---------------------------------------|
> >Process B M FS
>
> Argh. Good spotting.
>
> >This is not optimal and it can still race but the compact_cached_free_pfn
> >will be pointing to or very near a pageblock with free pages.
>
> Agreed on the "not optimal", but I also cannot think of a better
> idea right now. Getting this fixed for 3.6 is important, we can
> think of future optimizations in San Diego.
>

Sounds like a plan.

> >Signed-off-by: Mel Gorman<[email protected]>
>
> Reviewed-by: Rik van Riel <[email protected]>
>

Thanks very much.

Jim, what are the chances of getting this series tested with your large
data workload? As it's on top of 3.5, it should be less scary than
testing 3.6-rc1 but if you are comfortable testing 3.6-rc1 then please
test with just this patch on top.

--
Mel Gorman
SUSE Labs

2012-08-07 15:38:27

by Jim Schutt

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm: have order > 0 compaction start near a pageblock with free pages

On 08/07/2012 08:52 AM, Mel Gorman wrote:
> On Tue, Aug 07, 2012 at 10:45:25AM -0400, Rik van Riel wrote:
>> On 08/07/2012 08:31 AM, Mel Gorman wrote:
>>> commit [7db8889a: mm: have order> 0 compaction start off where it left]
>>> introduced a caching mechanism to reduce the amount work the free page
>>> scanner does in compaction. However, it has a problem. Consider two process
>>> simultaneously scanning free pages
>>>
>>> C
>>> Process A M S F
>>> |---------------------------------------|
>>> Process B M FS
>>
>> Argh. Good spotting.
>>
>>> This is not optimal and it can still race but the compact_cached_free_pfn
>>> will be pointing to or very near a pageblock with free pages.
>>
>> Agreed on the "not optimal", but I also cannot think of a better
>> idea right now. Getting this fixed for 3.6 is important, we can
>> think of future optimizations in San Diego.
>>
>
> Sounds like a plan.
>
>>> Signed-off-by: Mel Gorman<[email protected]>
>>
>> Reviewed-by: Rik van Riel<[email protected]>
>>
>
> Thanks very much.
>
> Jim, what are the chances of getting this series tested with your large
> data workload? As it's on top of 3.5, it should be less scary than
> testing 3.6-rc1 but if you are comfortable testing 3.6-rc1 then please
> test with just this patch on top.
>

As it turns out I'm already testing 3.6-rc1, as I'm on
the trail of a Ceph client messaging bug. I think I've
about got that figured out, and am working on a patch, but
I need it fixed in order to generate enough load to trigger
the problem that your patch addresses.

Which is a long-winded way of saying: no problem, I'll
roll this into my current testing, but I'll need another
day or two before I'm likely to be able to generate a
high enough load to test effectively. OK?

Also FWIW, it occurs to me that you might be interested
to know that my load also involves lots of network load
where I'm using jumbo frames. I suspect that puts even
more stress on higher page order allocations, right?

-- Jim

2012-08-07 15:45:35

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm: have order > 0 compaction start near a pageblock with free pages

On Tue, Aug 07, 2012 at 09:20:08AM -0600, Jim Schutt wrote:
> On 08/07/2012 08:52 AM, Mel Gorman wrote:
> >On Tue, Aug 07, 2012 at 10:45:25AM -0400, Rik van Riel wrote:
> >>On 08/07/2012 08:31 AM, Mel Gorman wrote:
> >>>commit [7db8889a: mm: have order> 0 compaction start off where it left]
> >>>introduced a caching mechanism to reduce the amount work the free page
> >>>scanner does in compaction. However, it has a problem. Consider two process
> >>>simultaneously scanning free pages
> >>>
> >>> C
> >>>Process A M S F
> >>> |---------------------------------------|
> >>>Process B M FS
> >>
> >>Argh. Good spotting.
> >>
> >>>This is not optimal and it can still race but the compact_cached_free_pfn
> >>>will be pointing to or very near a pageblock with free pages.
> >>
> >>Agreed on the "not optimal", but I also cannot think of a better
> >>idea right now. Getting this fixed for 3.6 is important, we can
> >>think of future optimizations in San Diego.
> >>
> >
> >Sounds like a plan.
> >
> >>>Signed-off-by: Mel Gorman<[email protected]>
> >>
> >>Reviewed-by: Rik van Riel<[email protected]>
> >>
> >
> >Thanks very much.
> >
> >Jim, what are the chances of getting this series tested with your large
> >data workload? As it's on top of 3.5, it should be less scary than
> >testing 3.6-rc1 but if you are comfortable testing 3.6-rc1 then please
> >test with just this patch on top.
> >
>
> As it turns out I'm already testing 3.6-rc1, as I'm on
> the trail of a Ceph client messaging bug. I think I've
> about got that figured out, and am working on a patch, but
> I need it fixed in order to generate enough load to trigger
> the problem that your patch addresses.
>

Grand, good luck with the Ceph bug.

> Which is a long-winded way of saying: no problem, I'll
> roll this into my current testing, but I'll need another
> day or two before I'm likely to be able to generate a
> high enough load to test effectively. OK?
>

That is perfectly reasonable, thanks.

> Also FWIW, it occurs to me that you might be interested
> to know that my load also involves lots of network load
> where I'm using jumbo frames. I suspect that puts even
> more stress on higher page order allocations, right?
>

It might. It depends on whether the underlying driver needs contiguous
pages to handle jumbo frame, if it can do scatter/gather IO or some
combination like trying for a contiguous page but using scatter/gather as
a fallback. Certainly it is interesting and I will keep it in mind.

--
Mel Gorman
SUSE Labs

2012-08-07 23:24:29

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 1/6] mm: compaction: Update comment in try_to_compact_pages

On Tue, Aug 07, 2012 at 01:31:12PM +0100, Mel Gorman wrote:
> The comment about order applied when the check was
> order > PAGE_ALLOC_COSTLY_ORDER which has not been the case since
> [c5a73c3d: thp: use compaction for all allocation orders]. Fixing
> the comment while I'm in the general area.
>
> Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2012-08-08 01:46:57

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm: vmscan: Scale number of pages reclaimed by reclaim/compaction based on failures

Hi Mel,

Just out of curiosity.
What's the problem did you see? (ie, What's the problem do this patch solve?)
AFAIUC, it seem to solve consecutive allocation success ratio through
getting several free pageblocks all at once in a process/kswapd
reclaim context. Right?

On Tue, Aug 07, 2012 at 01:31:13PM +0100, Mel Gorman wrote:
> If allocation fails after compaction then compaction may be deferred for
> a number of allocation attempts. If there are subsequent failures,
> compact_defer_shift is increased to defer for longer periods. This patch
> uses that information to scale the number of pages reclaimed with
> compact_defer_shift until allocations succeed again.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 66e4310..0cb2593 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1708,6 +1708,7 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
> {
> unsigned long pages_for_compaction;
> unsigned long inactive_lru_pages;
> + struct zone *zone;
>
> /* If not in reclaim/compaction mode, stop */
> if (!in_reclaim_compaction(sc))
> @@ -1741,6 +1742,15 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
> * inactive lists are large enough, continue reclaiming
> */
> pages_for_compaction = (2UL << sc->order);
> +
> + /*
> + * If compaction is deferred for this order then scale the number of
> + * pages reclaimed based on the number of consecutive allocation
> + * failures
> + */
> + zone = lruvec_zone(lruvec);
> + if (zone->compact_order_failed >= sc->order)
> + pages_for_compaction <<= zone->compact_defer_shift;
> inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
> if (nr_swap_pages > 0)
> inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);
> --
> 1.7.9.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2012-08-08 02:06:20

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 3/6] mm: kswapd: Continue reclaiming for reclaim/compaction if the minimum number of pages have not been reclaimed

On Tue, Aug 07, 2012 at 01:31:14PM +0100, Mel Gorman wrote:
> When direct reclaim is running reclaim/compaction, there is a minimum
> number of pages it reclaims. As it must be under the low watermark to be
> in direct reclaim it has also woken kswapd to do some work. This patch
> has kswapd use the same logic as direct reclaim to reclaim a minimum
> number of pages so compaction can run later.

-ENOPARSE by my stupid brain.
Could you elaborate a bit more?

>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 19 ++++++++++++++++---
> 1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0cb2593..afdec93 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1701,7 +1701,7 @@ static bool in_reclaim_compaction(struct scan_control *sc)
> * calls try_to_compact_zone() that it will have enough free pages to succeed.
> * It will give up earlier than that if there is difficulty reclaiming pages.
> */
> -static inline bool should_continue_reclaim(struct lruvec *lruvec,
> +static bool should_continue_reclaim(struct lruvec *lruvec,
> unsigned long nr_reclaimed,
> unsigned long nr_scanned,
> struct scan_control *sc)
> @@ -1768,6 +1768,17 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
> }
> }
>
> +static inline bool should_continue_reclaim_zone(struct zone *zone,
> + unsigned long nr_reclaimed,
> + unsigned long nr_scanned,
> + struct scan_control *sc)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_iter(NULL, NULL, NULL);
> + struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> +
> + return should_continue_reclaim(lruvec, nr_reclaimed, nr_scanned, sc);
> +}
> +
> /*
> * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
> */
> @@ -2496,8 +2507,10 @@ loop_again:
> */
> testorder = order;
> if (COMPACTION_BUILD && order &&
> - compaction_suitable(zone, order) !=
> - COMPACT_SKIPPED)
> + !should_continue_reclaim_zone(zone,
> + nr_soft_reclaimed,

nr_soft_reclaimed is always zero with !CONFIG_MEMCG.
So should_continue_reclaim_zone would return normally true in case of
non-__GFP_REPEAT allocation. Is it intentional?


> + sc.nr_scanned - nr_soft_scanned,
> + &sc))
> testorder = 0;
>
> if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
> --
> 1.7.9.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2012-08-08 04:34:31

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm: have order > 0 compaction start near a pageblock with free pages

On Tue, Aug 07, 2012 at 01:31:17PM +0100, Mel Gorman wrote:
> commit [7db8889a: mm: have order > 0 compaction start off where it left]
> introduced a caching mechanism to reduce the amount work the free page
> scanner does in compaction. However, it has a problem. Consider two process
> simultaneously scanning free pages
>
> C
> Process A M S F
> |---------------------------------------|
> Process B M FS
>
> C is zone->compact_cached_free_pfn
> S is cc->start_pfree_pfn
> M is cc->migrate_pfn
> F is cc->free_pfn
>
> In this diagram, Process A has just reached its migrate scanner, wrapped
> around and updated compact_cached_free_pfn accordingly.
>
> Simultaneously, Process B finishes isolating in a block and updates
> compact_cached_free_pfn again to the location of its free scanner.
>
> Process A moves to "end_of_zone - one_pageblock" and runs this check
>
> if (cc->order > 0 && (!cc->wrapped ||
> zone->compact_cached_free_pfn >
> cc->start_free_pfn))
> pfn = min(pfn, zone->compact_cached_free_pfn);
>
> compact_cached_free_pfn is above where it started so the free scanner skips
> almost the entire space it should have scanned. When there are multiple
> processes compacting it can end in a situation where the entire zone is
> not being scanned at all. Further, it is possible for two processes to
> ping-pong update to compact_cached_free_pfn which is just random.
>
> Overall, the end result wrecks allocation success rates.
>
> There is not an obvious way around this problem without introducing new
> locking and state so this patch takes a different approach.
>
> First, it gets rid of the skip logic because it's not clear that it matters
> if two free scanners happen to be in the same block but with racing updates
> it's too easy for it to skip over blocks it should not.

Okay.

>
> Second, it updates compact_cached_free_pfn in a more limited set of
> circumstances.
>
> If a scanner has wrapped, it updates compact_cached_free_pfn to the end
> of the zone. Each time a wrapped scanner isoaltes a page, it
> updates compact_cached_free_pfn. The intention is that after
> wrapping, the compact_cached_free_pfn will be at the highest
> pageblock with free pages when compaction completes.

Okay.

>
> If a scanner has not wrapped when compaction completes and

Compaction complete?
Your code seem to do it in isolate_freepages.
Isn't it compaction complete?

> compact_cached_free_pfn is set the end of the the zone, initialise
> it once.

I can't understad this part.
Could you elaborate a bit more?

>
> This is not optimal and it can still race but the compact_cached_free_pfn
> will be pointing to or very near a pageblock with free pages.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/compaction.c | 54 ++++++++++++++++++++++++++++--------------------------
> 1 file changed, 28 insertions(+), 26 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index be310f1..df50f73 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -419,6 +419,20 @@ static bool suitable_migration_target(struct page *page)
> }
>
> /*
> + * Returns the start pfn of the last page block in a zone. This is the starting
> + * point for full compaction of a zone. Compaction searches for free pages from
> + * the end of each zone, while isolate_freepages_block scans forward inside each
> + * page block.
> + */
> +static unsigned long start_free_pfn(struct zone *zone)
> +{
> + unsigned long free_pfn;
> + free_pfn = zone->zone_start_pfn + zone->spanned_pages;
> + free_pfn &= ~(pageblock_nr_pages-1);
> + return free_pfn;
> +}
> +
> +/*
> * Based on information in the current compact_control, find blocks
> * suitable for isolating free pages from and then isolate them.
> */
> @@ -457,17 +471,6 @@ static void isolate_freepages(struct zone *zone,
> pfn -= pageblock_nr_pages) {
> unsigned long isolated;
>
> - /*
> - * Skip ahead if another thread is compacting in the area
> - * simultaneously. If we wrapped around, we can only skip
> - * ahead if zone->compact_cached_free_pfn also wrapped to
> - * above our starting point.
> - */
> - if (cc->order > 0 && (!cc->wrapped ||
> - zone->compact_cached_free_pfn >
> - cc->start_free_pfn))
> - pfn = min(pfn, zone->compact_cached_free_pfn);
> -
> if (!pfn_valid(pfn))
> continue;
>
> @@ -510,7 +513,15 @@ static void isolate_freepages(struct zone *zone,
> */
> if (isolated) {
> high_pfn = max(high_pfn, pfn);
> - if (cc->order > 0)
> +
> + /*
> + * If the free scanner has wrapped, update
> + * compact_cached_free_pfn to point to the highest
> + * pageblock with free pages. This reduces excessive
> + * scanning of full pageblocks near the end of the
> + * zone
> + */
> + if (cc->order > 0 && cc->wrapped)
> zone->compact_cached_free_pfn = high_pfn;
> }
> }
> @@ -520,6 +531,11 @@ static void isolate_freepages(struct zone *zone,
>
> cc->free_pfn = high_pfn;
> cc->nr_freepages = nr_freepages;
> +
> + /* If compact_cached_free_pfn is reset then set it now */
> + if (cc->order > 0 && !cc->wrapped &&
> + zone->compact_cached_free_pfn == start_free_pfn(zone))
> + zone->compact_cached_free_pfn = high_pfn;
> }
>
> /*
> @@ -607,20 +623,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
> return ISOLATE_SUCCESS;
> }
>
> -/*
> - * Returns the start pfn of the last page block in a zone. This is the starting
> - * point for full compaction of a zone. Compaction searches for free pages from
> - * the end of each zone, while isolate_freepages_block scans forward inside each
> - * page block.
> - */
> -static unsigned long start_free_pfn(struct zone *zone)
> -{
> - unsigned long free_pfn;
> - free_pfn = zone->zone_start_pfn + zone->spanned_pages;
> - free_pfn &= ~(pageblock_nr_pages-1);
> - return free_pfn;
> -}
> -
> static int compact_finished(struct zone *zone,
> struct compact_control *cc)
> {
> --
> 1.7.9.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2012-08-08 07:55:33

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm: vmscan: Scale number of pages reclaimed by reclaim/compaction based on failures

On Wed, Aug 08, 2012 at 10:48:24AM +0900, Minchan Kim wrote:
> Hi Mel,
>
> Just out of curiosity.
> What's the problem did you see? (ie, What's the problem do this patch solve?)

Everythign in this series is related to the problem in the leader - high
order allocation success rates are lower. This patch increases the success
rates when allocating under load.

> AFAIUC, it seem to solve consecutive allocation success ratio through
> getting several free pageblocks all at once in a process/kswapd
> reclaim context. Right?

Only pageblocks if it is order-9 on x86, it reclaims an amount that depends
on an allocation size. This only happens during reclaim/compaction context
when we know that a high-order allocation has recently failed. The objective
is to reclaim enough order-0 pages so that compaction can succeed again.

--
Mel Gorman
SUSE Labs

2012-08-08 08:26:12

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm: vmscan: Scale number of pages reclaimed by reclaim/compaction based on failures

On Wed, Aug 08, 2012 at 08:55:26AM +0100, Mel Gorman wrote:
> On Wed, Aug 08, 2012 at 10:48:24AM +0900, Minchan Kim wrote:
> > Hi Mel,
> >
> > Just out of curiosity.
> > What's the problem did you see? (ie, What's the problem do this patch solve?)
>
> Everythign in this series is related to the problem in the leader - high
> order allocation success rates are lower. This patch increases the success
> rates when allocating under load.
>
> > AFAIUC, it seem to solve consecutive allocation success ratio through
> > getting several free pageblocks all at once in a process/kswapd
> > reclaim context. Right?
>
> Only pageblocks if it is order-9 on x86, it reclaims an amount that depends
> on an allocation size. This only happens during reclaim/compaction context
> when we know that a high-order allocation has recently failed. The objective
> is to reclaim enough order-0 pages so that compaction can succeed again.

Your patch increases the number of pages to be reclaimed with considering
the number of fail case during deferring period and your test proved it's
really good. Without your patch, why can't VM reclaim enough pages?
Other processes steal the pages reclaimed?
Why I ask a question is that I want to know what's the problem at current
VM.


>
> --
> Mel Gorman
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2012-08-08 08:51:20

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm: vmscan: Scale number of pages reclaimed by reclaim/compaction based on failures

On Wed, Aug 08, 2012 at 05:27:38PM +0900, Minchan Kim wrote:
> On Wed, Aug 08, 2012 at 08:55:26AM +0100, Mel Gorman wrote:
> > On Wed, Aug 08, 2012 at 10:48:24AM +0900, Minchan Kim wrote:
> > > Hi Mel,
> > >
> > > Just out of curiosity.
> > > What's the problem did you see? (ie, What's the problem do this patch solve?)
> >
> > Everythign in this series is related to the problem in the leader - high
> > order allocation success rates are lower. This patch increases the success
> > rates when allocating under load.
> >
> > > AFAIUC, it seem to solve consecutive allocation success ratio through
> > > getting several free pageblocks all at once in a process/kswapd
> > > reclaim context. Right?
> >
> > Only pageblocks if it is order-9 on x86, it reclaims an amount that depends
> > on an allocation size. This only happens during reclaim/compaction context
> > when we know that a high-order allocation has recently failed. The objective
> > is to reclaim enough order-0 pages so that compaction can succeed again.
>
> Your patch increases the number of pages to be reclaimed with considering
> the number of fail case during deferring period and your test proved it's
> really good. Without your patch, why can't VM reclaim enough pages?

It could reclaim enough pages but it doesn't. nr_to_reclaim is
SWAP_CLUSTER_MAX and that gets short-cutted in direct reclaim at least
by

if (sc->nr_reclaimed >= sc->nr_to_reclaim)
goto out;

I could set nr_to_reclaim in try_to_free_pages() of course and drive
it from there but that's just different, not better. If driven from
do_try_to_free_pages(), it is also possible that priorities will rise.
When they reach DEF_PRIORITY-2, it will also start stalling and setting
pages for immediate reclaim which is more disruptive than not desirable
in this case. That is a more wide-reaching change than I would expect for
this problem and could cause another regression related to THP requests
causing interactive jitter.

> Other processes steal the pages reclaimed?

Or the page it reclaimed were in pageblocks that could not be used.

> Why I ask a question is that I want to know what's the problem at current
> VM.
>

We cannot reliably tell in advance whether compaction is going to succeed
in the future without doing a full scan of the zone which would be both
very heavy and race with any allocation requests. Compaction needs free
pages to succeed so the intention is to scale the number of pages reclaimed
with the number of recent compaction failures.

--
Mel Gorman
SUSE Labs

2012-08-08 09:08:09

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/6] mm: kswapd: Continue reclaiming for reclaim/compaction if the minimum number of pages have not been reclaimed

On Wed, Aug 08, 2012 at 11:07:49AM +0900, Minchan Kim wrote:
> On Tue, Aug 07, 2012 at 01:31:14PM +0100, Mel Gorman wrote:
> > When direct reclaim is running reclaim/compaction, there is a minimum
> > number of pages it reclaims. As it must be under the low watermark to be
> > in direct reclaim it has also woken kswapd to do some work. This patch
> > has kswapd use the same logic as direct reclaim to reclaim a minimum
> > number of pages so compaction can run later.
>
> -ENOPARSE by my stupid brain.
> Could you elaborate a bit more?
>

Which part did not make sense so I know which part to elaborate on? Lets
try again randomly with this;

When direct reclaim is running reclaim/compaction for high-order allocations,
it aims to reclaim a minimum number of pages for compaction as controlled
by should_continue_reclaim. Before it entered direct reclaim, kswapd was
woken to reclaim pages at the same order. This patch forces kswapd to use
the same logic as direct reclaim to reclaim a minimum number of pages so
that subsequent allocation requests are less likely to enter direct reclaim.

> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > mm/vmscan.c | 19 ++++++++++++++++---
> > 1 file changed, 16 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 0cb2593..afdec93 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1701,7 +1701,7 @@ static bool in_reclaim_compaction(struct scan_control *sc)
> > * calls try_to_compact_zone() that it will have enough free pages to succeed.
> > * It will give up earlier than that if there is difficulty reclaiming pages.
> > */
> > -static inline bool should_continue_reclaim(struct lruvec *lruvec,
> > +static bool should_continue_reclaim(struct lruvec *lruvec,
> > unsigned long nr_reclaimed,
> > unsigned long nr_scanned,
> > struct scan_control *sc)
> > @@ -1768,6 +1768,17 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
> > }
> > }
> >
> > +static inline bool should_continue_reclaim_zone(struct zone *zone,
> > + unsigned long nr_reclaimed,
> > + unsigned long nr_scanned,
> > + struct scan_control *sc)
> > +{
> > + struct mem_cgroup *memcg = mem_cgroup_iter(NULL, NULL, NULL);
> > + struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> > +
> > + return should_continue_reclaim(lruvec, nr_reclaimed, nr_scanned, sc);
> > +}
> > +
> > /*
> > * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
> > */
> > @@ -2496,8 +2507,10 @@ loop_again:
> > */
> > testorder = order;
> > if (COMPACTION_BUILD && order &&
> > - compaction_suitable(zone, order) !=
> > - COMPACT_SKIPPED)
> > + !should_continue_reclaim_zone(zone,
> > + nr_soft_reclaimed,
>
> nr_soft_reclaimed is always zero with !CONFIG_MEMCG.
> So should_continue_reclaim_zone would return normally true in case of
> non-__GFP_REPEAT allocation. Is it intentional?
>

It was intentional at the time but asking me about it made me reconsider,
thanks. In too many cases, this is a no-op and any apparent increase of
kswapd activity is likely a co-incidence. This is untested but is what I
intended.

---8<---
mm: kswapd: Continue reclaiming for reclaim/compaction if the minimum number of pages have not been reclaimed

When direct reclaim is running reclaim/compaction for high-order allocations,
it aims to reclaim a minimum number of pages for compaction as controlled
by should_continue_reclaim. Before it entered direct reclaim, kswapd was
woken to reclaim pages at the same order. This patch forces kswapd to use
the same logic as direct reclaim to reclaim a minimum number of pages so
that subsequent allocation requests are less likely to enter direct reclaim.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 81 ++++++++++++++++++++++++++++++++++++-----------------------
1 file changed, 50 insertions(+), 31 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0cb2593..6840218 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1696,14 +1696,11 @@ static bool in_reclaim_compaction(struct scan_control *sc)

/*
* Reclaim/compaction is used for high-order allocation requests. It reclaims
- * order-0 pages before compacting the zone. should_continue_reclaim() returns
+ * order-0 pages before compacting the zone. __should_continue_reclaim() returns
* true if more pages should be reclaimed such that when the page allocator
* calls try_to_compact_zone() that it will have enough free pages to succeed.
- * It will give up earlier than that if there is difficulty reclaiming pages.
*/
-static inline bool should_continue_reclaim(struct lruvec *lruvec,
- unsigned long nr_reclaimed,
- unsigned long nr_scanned,
+static bool __should_continue_reclaim(struct lruvec *lruvec,
struct scan_control *sc)
{
unsigned long pages_for_compaction;
@@ -1714,29 +1711,6 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
if (!in_reclaim_compaction(sc))
return false;

- /* Consider stopping depending on scan and reclaim activity */
- if (sc->gfp_mask & __GFP_REPEAT) {
- /*
- * For __GFP_REPEAT allocations, stop reclaiming if the
- * full LRU list has been scanned and we are still failing
- * to reclaim pages. This full LRU scan is potentially
- * expensive but a __GFP_REPEAT caller really wants to succeed
- */
- if (!nr_reclaimed && !nr_scanned)
- return false;
- } else {
- /*
- * For non-__GFP_REPEAT allocations which can presumably
- * fail without consequence, stop if we failed to reclaim
- * any pages from the last SWAP_CLUSTER_MAX number of
- * pages that were scanned. This will return to the
- * caller faster at the risk reclaim/compaction and
- * the resulting allocation attempt fails
- */
- if (!nr_reclaimed)
- return false;
- }
-
/*
* If we have not reclaimed enough pages for compaction and the
* inactive lists are large enough, continue reclaiming
@@ -1768,6 +1742,51 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
}
}

+/* Looks up the lruvec before calling __should_continue_reclaim */
+static inline bool should_kswapd_continue_reclaim(struct zone *zone,
+ struct scan_control *sc)
+{
+ struct mem_cgroup *memcg = mem_cgroup_iter(NULL, NULL, NULL);
+ struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+
+ return __should_continue_reclaim(lruvec, sc);
+}
+
+/*
+ * This uses __should_continue_reclaim at its core but will also give up
+ * earlier than that if there is difficulty reclaiming pages.
+ */
+static inline bool should_direct_continue_reclaim(struct lruvec *lruvec,
+ unsigned long nr_reclaimed,
+ unsigned long nr_scanned,
+ struct scan_control *sc)
+{
+ /* Consider stopping depending on scan and reclaim activity */
+ if (sc->gfp_mask & __GFP_REPEAT) {
+ /*
+ * For __GFP_REPEAT allocations, stop reclaiming if the
+ * full LRU list has been scanned and we are still failing
+ * to reclaim pages. This full LRU scan is potentially
+ * expensive but a __GFP_REPEAT caller really wants to succeed
+ */
+ if (!nr_reclaimed && !nr_scanned)
+ return false;
+ } else {
+ /*
+ * For non-__GFP_REPEAT allocations which can presumably
+ * fail without consequence, stop if we failed to reclaim
+ * any pages from the last SWAP_CLUSTER_MAX number of
+ * pages that were scanned. This will return to the
+ * caller faster at the risk reclaim/compaction and
+ * the resulting allocation attempt fails
+ */
+ if (!nr_reclaimed)
+ return false;
+ }
+
+ return __should_continue_reclaim(lruvec, sc);
+}
+
/*
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
@@ -1822,7 +1841,7 @@ restart:
sc, LRU_ACTIVE_ANON);

/* reclaim/compaction might need reclaim to continue */
- if (should_continue_reclaim(lruvec, nr_reclaimed,
+ if (should_direct_continue_reclaim(lruvec, nr_reclaimed,
sc->nr_scanned - nr_scanned, sc))
goto restart;

@@ -2496,8 +2515,8 @@ loop_again:
*/
testorder = order;
if (COMPACTION_BUILD && order &&
- compaction_suitable(zone, order) !=
- COMPACT_SKIPPED)
+ !should_kswapd_continue_reclaim(zone,
+ &sc))
testorder = 0;

if ((buffer_heads_over_limit && is_highmem_idx(i)) ||

2012-08-08 09:58:09

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/6] mm: kswapd: Continue reclaiming for reclaim/compaction if the minimum number of pages have not been reclaimed

On Wed, Aug 08, 2012 at 10:07:57AM +0100, Mel Gorman wrote:
> > <SNIP>
>
> It was intentional at the time but asking me about it made me reconsider,
> thanks. In too many cases, this is a no-op and any apparent increase of
> kswapd activity is likely a co-incidence. This is untested but is what I
> intended.
>
> ---8<---
> mm: kswapd: Continue reclaiming for reclaim/compaction if the minimum number of pages have not been reclaimed
>

And considering this further again, it would partially regress fe2c2a10
and be too aggressive. I'm dropping this patch completely for now and will
revisit it in the future.

Thanks Minchan.

--
Mel Gorman
SUSE Labs

2012-08-08 10:18:45

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm: have order > 0 compaction start near a pageblock with free pages

On Wed, Aug 08, 2012 at 01:36:00PM +0900, Minchan Kim wrote:
> >
> > Second, it updates compact_cached_free_pfn in a more limited set of
> > circumstances.
> >
> > If a scanner has wrapped, it updates compact_cached_free_pfn to the end
> > of the zone. Each time a wrapped scanner isoaltes a page, it
> > updates compact_cached_free_pfn. The intention is that after
> > wrapping, the compact_cached_free_pfn will be at the highest
> > pageblock with free pages when compaction completes.
>
> Okay.
>
> >
> > If a scanner has not wrapped when compaction completes and
>
> Compaction complete?
> Your code seem to do it in isolate_freepages.
> Isn't it compaction complete?
>

s/compaction/free page isolation/

> > compact_cached_free_pfn is set the end of the the zone, initialise
> > it once.
>

> I can't understad this part.
> Could you elaborate a bit more?
>

Is this better?

If a scanner has wrapped, it updates compact_cached_free_pfn to the end
of the zone. When a wrapped scanner isolates a page, it updates
compact_cached_free_pfn to point to the highest pageblock it
can isolate pages from.

If a scanner has not wrapped when it has finished isolated pages it
checks if compact_cached_free_pfn is pointing to the end of the
zone. If so, the value is updated to point to the highest
pageblock that pages were isolated from. This value will not
be updated again until a free page scanner wraps and resets
compact_cached_free_pfn.

This is not optimal and it can still race but the compact_cached_free_pfn
will be pointing to or very near a pageblock with free pages.

--
Mel Gorman
SUSE Labs

2012-08-08 23:49:55

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm: vmscan: Scale number of pages reclaimed by reclaim/compaction based on failures

On Wed, Aug 08, 2012 at 09:51:12AM +0100, Mel Gorman wrote:
> On Wed, Aug 08, 2012 at 05:27:38PM +0900, Minchan Kim wrote:
> > On Wed, Aug 08, 2012 at 08:55:26AM +0100, Mel Gorman wrote:
> > > On Wed, Aug 08, 2012 at 10:48:24AM +0900, Minchan Kim wrote:
> > > > Hi Mel,
> > > >
> > > > Just out of curiosity.
> > > > What's the problem did you see? (ie, What's the problem do this patch solve?)
> > >
> > > Everythign in this series is related to the problem in the leader - high
> > > order allocation success rates are lower. This patch increases the success
> > > rates when allocating under load.
> > >
> > > > AFAIUC, it seem to solve consecutive allocation success ratio through
> > > > getting several free pageblocks all at once in a process/kswapd
> > > > reclaim context. Right?
> > >
> > > Only pageblocks if it is order-9 on x86, it reclaims an amount that depends
> > > on an allocation size. This only happens during reclaim/compaction context
> > > when we know that a high-order allocation has recently failed. The objective
> > > is to reclaim enough order-0 pages so that compaction can succeed again.
> >
> > Your patch increases the number of pages to be reclaimed with considering
> > the number of fail case during deferring period and your test proved it's
> > really good. Without your patch, why can't VM reclaim enough pages?
>
> It could reclaim enough pages but it doesn't. nr_to_reclaim is
> SWAP_CLUSTER_MAX and that gets short-cutted in direct reclaim at least
> by
>
> if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> goto out;
>
> I could set nr_to_reclaim in try_to_free_pages() of course and drive
> it from there but that's just different, not better. If driven from
> do_try_to_free_pages(), it is also possible that priorities will rise.
> When they reach DEF_PRIORITY-2, it will also start stalling and setting
> pages for immediate reclaim which is more disruptive than not desirable
> in this case. That is a more wide-reaching change than I would expect for
> this problem and could cause another regression related to THP requests
> causing interactive jitter.

Agreed.
I hope it should be added by changelog.

>
> > Other processes steal the pages reclaimed?
>
> Or the page it reclaimed were in pageblocks that could not be used.
>
> > Why I ask a question is that I want to know what's the problem at current
> > VM.
> >
>
> We cannot reliably tell in advance whether compaction is going to succeed
> in the future without doing a full scan of the zone which would be both
> very heavy and race with any allocation requests. Compaction needs free
> pages to succeed so the intention is to scale the number of pages reclaimed
> with the number of recent compaction failures.

> If allocation fails after compaction then compaction may be deferred for
> a number of allocation attempts. If there are subsequent failures,
> compact_defer_shift is increased to defer for longer periods. This patch
> uses that information to scale the number of pages reclaimed with
> compact_defer_shift until allocations succeed again.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 66e4310..0cb2593 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1708,6 +1708,7 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
> {
> unsigned long pages_for_compaction;
> unsigned long inactive_lru_pages;
> + struct zone *zone;
>
> /* If not in reclaim/compaction mode, stop */
> if (!in_reclaim_compaction(sc))
> @@ -1741,6 +1742,15 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
> * inactive lists are large enough, continue reclaiming
> */
> pages_for_compaction = (2UL << sc->order);
> +
> + /*
> + * If compaction is deferred for this order then scale the number of

this order? sc->order?

> + * pages reclaimed based on the number of consecutive allocation
> + * failures
> + */
> + zone = lruvec_zone(lruvec);
> + if (zone->compact_order_failed >= sc->order)

I can't understand this part.
We don't defer lower order than compact_order_failed by aff62249.
Do you mean lower order compaction context should be a lamb for
deferred higher order allocation request success? I think it's not fair
and even I can't understand rationale why it has to scale the number of pages
reclaimed with the number of recent compaction failture.
Your changelog just says "What we have to do, NOT Why we have to do".


> + pages_for_compaction <<= zone->compact_defer_shift;


> inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
> if (nr_swap_pages > 0)
> inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);
> --
> 1.7.9.2
>


>
> --
> Mel Gorman
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2012-08-09 07:49:57

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm: vmscan: Scale number of pages reclaimed by reclaim/compaction based on failures

On Thu, Aug 09, 2012 at 08:51:27AM +0900, Minchan Kim wrote:
> > > > > Just out of curiosity.
> > > > > What's the problem did you see? (ie, What's the problem do this patch solve?)
> > > >
> > > > Everythign in this series is related to the problem in the leader - high
> > > > order allocation success rates are lower. This patch increases the success
> > > > rates when allocating under load.
> > > >
> > > > > AFAIUC, it seem to solve consecutive allocation success ratio through
> > > > > getting several free pageblocks all at once in a process/kswapd
> > > > > reclaim context. Right?
> > > >
> > > > Only pageblocks if it is order-9 on x86, it reclaims an amount that depends
> > > > on an allocation size. This only happens during reclaim/compaction context
> > > > when we know that a high-order allocation has recently failed. The objective
> > > > is to reclaim enough order-0 pages so that compaction can succeed again.
> > >
> > > Your patch increases the number of pages to be reclaimed with considering
> > > the number of fail case during deferring period and your test proved it's
> > > really good. Without your patch, why can't VM reclaim enough pages?
> >
> > It could reclaim enough pages but it doesn't. nr_to_reclaim is
> > SWAP_CLUSTER_MAX and that gets short-cutted in direct reclaim at least
> > by
> >
> > if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> > goto out;
> >
> > I could set nr_to_reclaim in try_to_free_pages() of course and drive
> > it from there but that's just different, not better. If driven from
> > do_try_to_free_pages(), it is also possible that priorities will rise.
> > When they reach DEF_PRIORITY-2, it will also start stalling and setting
> > pages for immediate reclaim which is more disruptive than not desirable
> > in this case. That is a more wide-reaching change than I would expect for
> > this problem and could cause another regression related to THP requests
> > causing interactive jitter.
>
> Agreed.
> I hope it should be added by changelog.
>

I guess but it's not really part of this patch is it? The decision on
where to drive should_continue_reclaim from was made in commit [3e7d3449:
mm: vmscan: reclaim order-0 and use compaction instead of lumpy reclaim].

Anyway changelog now reads as

If allocation fails after compaction then compaction may be deferred
for a number of allocation attempts. If there are subsequent failures,
compact_defer_shift is increased to defer for longer periods. This
patch uses that information to scale the number of pages reclaimed with
compact_defer_shift until allocations succeed again. The rationale is
that reclaiming the normal number of pages still allowed compaction to
fail and its success depends on the number of pages. If it's failing,
reclaim more pages until it succeeds again.

Note that this is not implying that VM reclaim is not reclaiming enough
pages or that its logic is broken. try_to_free_pages() always asks for
SWAP_CLUSTER_MAX pages to be reclaimed regardless of order and that is
what it does. Direct reclaim stops normally with this check.

if (sc->nr_reclaimed >= sc->nr_to_reclaim)
goto out;

should_continue_reclaim delays when that check is made until a minimum number
of pages for reclaim/compaction are reclaimed. It is possible that this patch
could instead set nr_to_reclaim in try_to_free_pages() and drive it from
there but that's behaves differently and not necessarily for the better.
If driven from do_try_to_free_pages(), it is also possible that priorities
will rise. When they reach DEF_PRIORITY-2, it will also start stalling
and setting pages for immediate reclaim which is more disruptive than not
desirable in this case. That is a more wide-reaching change that could
cause another regression related to THP requests causing interactive jitter.

> >
> > > Other processes steal the pages reclaimed?
> >
> > Or the page it reclaimed were in pageblocks that could not be used.
> >
> > > Why I ask a question is that I want to know what's the problem at current
> > > VM.
> > >
> >
> > We cannot reliably tell in advance whether compaction is going to succeed
> > in the future without doing a full scan of the zone which would be both
> > very heavy and race with any allocation requests. Compaction needs free
> > pages to succeed so the intention is to scale the number of pages reclaimed
> > with the number of recent compaction failures.
>
> > If allocation fails after compaction then compaction may be deferred for
> > a number of allocation attempts. If there are subsequent failures,
> > compact_defer_shift is increased to defer for longer periods. This patch
> > uses that information to scale the number of pages reclaimed with
> > compact_defer_shift until allocations succeed again.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > mm/vmscan.c | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 66e4310..0cb2593 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1708,6 +1708,7 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
> > {
> > unsigned long pages_for_compaction;
> > unsigned long inactive_lru_pages;
> > + struct zone *zone;
> >
> > /* If not in reclaim/compaction mode, stop */
> > if (!in_reclaim_compaction(sc))
> > @@ -1741,6 +1742,15 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
> > * inactive lists are large enough, continue reclaiming
> > */
> > pages_for_compaction = (2UL << sc->order);
> > +
> > + /*
> > + * If compaction is deferred for this order then scale the number of
>
> this order? sc->order?
>

yes. Comment changed to clarify.

> > + * pages reclaimed based on the number of consecutive allocation
> > + * failures
> > + */
> > + zone = lruvec_zone(lruvec);
> > + if (zone->compact_order_failed >= sc->order)
>
> I can't understand this part.
> We don't defer lower order than compact_order_failed by aff62249.
> Do you mean lower order compaction context should be a lamb for
> deferred higher order allocation request success? I think it's not fair
> and even I can't understand rationale why it has to scale the number of pages
> reclaimed with the number of recent compaction failture.
> Your changelog just says "What we have to do, NOT Why we have to do".
>

I'm a moron, that should be <=, not >=. All my tests were based on order==9
and that was the only order using reclaim/compaction so it happened to
work as expected. Thanks! I fixed that and added the following
clarification to the changelog

The rationale is that reclaiming the normal number of pages still allowed
compaction to fail and its success depends on the number of pages. If it's
failing, reclaim more pages until it succeeds again.

Does that make more sense?

>
> > + pages_for_compaction <<= zone->compact_defer_shift;
>
>
> > inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
> > if (nr_swap_pages > 0)
> > inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);

--
Mel Gorman
SUSE Labs

2012-08-09 08:25:45

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm: vmscan: Scale number of pages reclaimed by reclaim/compaction based on failures

On Thu, Aug 09, 2012 at 08:49:50AM +0100, Mel Gorman wrote:
> On Thu, Aug 09, 2012 at 08:51:27AM +0900, Minchan Kim wrote:
> > > > > > Just out of curiosity.
> > > > > > What's the problem did you see? (ie, What's the problem do this patch solve?)
> > > > >
> > > > > Everythign in this series is related to the problem in the leader - high
> > > > > order allocation success rates are lower. This patch increases the success
> > > > > rates when allocating under load.
> > > > >
> > > > > > AFAIUC, it seem to solve consecutive allocation success ratio through
> > > > > > getting several free pageblocks all at once in a process/kswapd
> > > > > > reclaim context. Right?
> > > > >
> > > > > Only pageblocks if it is order-9 on x86, it reclaims an amount that depends
> > > > > on an allocation size. This only happens during reclaim/compaction context
> > > > > when we know that a high-order allocation has recently failed. The objective
> > > > > is to reclaim enough order-0 pages so that compaction can succeed again.
> > > >
> > > > Your patch increases the number of pages to be reclaimed with considering
> > > > the number of fail case during deferring period and your test proved it's
> > > > really good. Without your patch, why can't VM reclaim enough pages?
> > >
> > > It could reclaim enough pages but it doesn't. nr_to_reclaim is
> > > SWAP_CLUSTER_MAX and that gets short-cutted in direct reclaim at least
> > > by
> > >
> > > if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> > > goto out;
> > >
> > > I could set nr_to_reclaim in try_to_free_pages() of course and drive
> > > it from there but that's just different, not better. If driven from
> > > do_try_to_free_pages(), it is also possible that priorities will rise.
> > > When they reach DEF_PRIORITY-2, it will also start stalling and setting
> > > pages for immediate reclaim which is more disruptive than not desirable
> > > in this case. That is a more wide-reaching change than I would expect for
> > > this problem and could cause another regression related to THP requests
> > > causing interactive jitter.
> >
> > Agreed.
> > I hope it should be added by changelog.
> >
>
> I guess but it's not really part of this patch is it? The decision on
> where to drive should_continue_reclaim from was made in commit [3e7d3449:
> mm: vmscan: reclaim order-0 and use compaction instead of lumpy reclaim].
>
> Anyway changelog now reads as
>
> If allocation fails after compaction then compaction may be deferred
> for a number of allocation attempts. If there are subsequent failures,
> compact_defer_shift is increased to defer for longer periods. This
> patch uses that information to scale the number of pages reclaimed with
> compact_defer_shift until allocations succeed again. The rationale is
> that reclaiming the normal number of pages still allowed compaction to
> fail and its success depends on the number of pages. If it's failing,
> reclaim more pages until it succeeds again.
>
> Note that this is not implying that VM reclaim is not reclaiming enough
> pages or that its logic is broken. try_to_free_pages() always asks for
> SWAP_CLUSTER_MAX pages to be reclaimed regardless of order and that is
> what it does. Direct reclaim stops normally with this check.
>
> if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> goto out;
>
> should_continue_reclaim delays when that check is made until a minimum number
> of pages for reclaim/compaction are reclaimed. It is possible that this patch
> could instead set nr_to_reclaim in try_to_free_pages() and drive it from
> there but that's behaves differently and not necessarily for the better.
> If driven from do_try_to_free_pages(), it is also possible that priorities
> will rise. When they reach DEF_PRIORITY-2, it will also start stalling
> and setting pages for immediate reclaim which is more disruptive than not
> desirable in this case. That is a more wide-reaching change that could
> cause another regression related to THP requests causing interactive jitter.
>
> > >
> > > > Other processes steal the pages reclaimed?
> > >
> > > Or the page it reclaimed were in pageblocks that could not be used.
> > >
> > > > Why I ask a question is that I want to know what's the problem at current
> > > > VM.
> > > >
> > >
> > > We cannot reliably tell in advance whether compaction is going to succeed
> > > in the future without doing a full scan of the zone which would be both
> > > very heavy and race with any allocation requests. Compaction needs free
> > > pages to succeed so the intention is to scale the number of pages reclaimed
> > > with the number of recent compaction failures.
> >
> > > If allocation fails after compaction then compaction may be deferred for
> > > a number of allocation attempts. If there are subsequent failures,
> > > compact_defer_shift is increased to defer for longer periods. This patch
> > > uses that information to scale the number of pages reclaimed with
> > > compact_defer_shift until allocations succeed again.
> > >
> > > Signed-off-by: Mel Gorman <[email protected]>
> > > ---
> > > mm/vmscan.c | 10 ++++++++++
> > > 1 file changed, 10 insertions(+)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 66e4310..0cb2593 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1708,6 +1708,7 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
> > > {
> > > unsigned long pages_for_compaction;
> > > unsigned long inactive_lru_pages;
> > > + struct zone *zone;
> > >
> > > /* If not in reclaim/compaction mode, stop */
> > > if (!in_reclaim_compaction(sc))
> > > @@ -1741,6 +1742,15 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
> > > * inactive lists are large enough, continue reclaiming
> > > */
> > > pages_for_compaction = (2UL << sc->order);
> > > +
> > > + /*
> > > + * If compaction is deferred for this order then scale the number of
> >
> > this order? sc->order?
> >
>
> yes. Comment changed to clarify.
>
> > > + * pages reclaimed based on the number of consecutive allocation
> > > + * failures
> > > + */
> > > + zone = lruvec_zone(lruvec);
> > > + if (zone->compact_order_failed >= sc->order)
> >
> > I can't understand this part.
> > We don't defer lower order than compact_order_failed by aff62249.
> > Do you mean lower order compaction context should be a lamb for
> > deferred higher order allocation request success? I think it's not fair
> > and even I can't understand rationale why it has to scale the number of pages
> > reclaimed with the number of recent compaction failture.
> > Your changelog just says "What we have to do, NOT Why we have to do".
> >
>
> I'm a moron, that should be <=, not >=. All my tests were based on order==9
> and that was the only order using reclaim/compaction so it happened to
> work as expected. Thanks! I fixed that and added the following
> clarification to the changelog
>
> The rationale is that reclaiming the normal number of pages still allowed
> compaction to fail and its success depends on the number of pages. If it's
> failing, reclaim more pages until it succeeds again.
>
> Does that make more sense?

If compaction is defered, requestors fails to get high-order page and
they normally do fallback by order-0 or something.
In this context, if they don't depends on fallback and retrying higher order
allocation, your patch makes sense to me because your algorithm is based on
past allocation request fail rate.
Do I miss something?


>
> >
> > > + pages_for_compaction <<= zone->compact_defer_shift;
> >
> >
> > > inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
> > > if (nr_swap_pages > 0)
> > > inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);
>
> --
> Mel Gorman
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2012-08-09 09:20:41

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm: vmscan: Scale number of pages reclaimed by reclaim/compaction based on failures

On Thu, Aug 09, 2012 at 05:27:15PM +0900, Minchan Kim wrote:
> > > > + * pages reclaimed based on the number of consecutive allocation
> > > > + * failures
> > > > + */
> > > > + zone = lruvec_zone(lruvec);
> > > > + if (zone->compact_order_failed >= sc->order)
> > >
> > > I can't understand this part.
> > > We don't defer lower order than compact_order_failed by aff62249.
> > > Do you mean lower order compaction context should be a lamb for
> > > deferred higher order allocation request success? I think it's not fair
> > > and even I can't understand rationale why it has to scale the number of pages
> > > reclaimed with the number of recent compaction failture.
> > > Your changelog just says "What we have to do, NOT Why we have to do".
> > >
> >
> > I'm a moron, that should be <=, not >=. All my tests were based on order==9
> > and that was the only order using reclaim/compaction so it happened to
> > work as expected. Thanks! I fixed that and added the following
> > clarification to the changelog
> >
> > The rationale is that reclaiming the normal number of pages still allowed
> > compaction to fail and its success depends on the number of pages. If it's
> > failing, reclaim more pages until it succeeds again.
> >
> > Does that make more sense?
>
> If compaction is defered, requestors fails to get high-order page and
> they normally do fallback by order-0 or something.

Yes. At least, one hopes they fell back to order-0.

> In this context, if they don't depends on fallback and retrying higher order
> allocation, your patch makes sense to me because your algorithm is based on
> past allocation request fail rate.
> Do I miss something?

Your question is difficult to parse but I think you are making an implicit
assumption that it's the same caller retrying the high order allocation.
That is not the case, not do I want it to be because that would be similar
to the caller using __GFP_REPEAT. Retrying with more reclaim until the
allocation succeeds would both stall and reclaim excessively.

The intention is that an allocation can fail but each subsequent attempt will
try harder until there is success. Each allocation request does a portion
of the necessary work to spread the cost between multiple requests. Take
THP for example where there is a constant request for THP allocations
for whatever reason (heavy fork workload, large buffer allocation being
populated etc.). Some of those allocations fail but if they do, future
THP requests will reclaim more pages. When compaction resumes again, it
will be more likely to succeed and compact_defer_shift gets reset. In the
specific case of THP there will be allocations that fail but khugepaged
will promote them later if the process is long-lived.

--
Mel Gorman
SUSE Labs

2012-08-09 20:31:51

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm: vmscan: Scale number of pages reclaimed by reclaim/compaction based on failures

On 08/09/2012 05:20 AM, Mel Gorman wrote:

> The intention is that an allocation can fail but each subsequent attempt will
> try harder until there is success. Each allocation request does a portion
> of the necessary work to spread the cost between multiple requests.

At some point we need to stop doing that work, though.

Otherwise we could end up back at the problem where
way too much memory gets evicted, and we get swap
storms.

2012-08-09 23:25:56

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm: vmscan: Scale number of pages reclaimed by reclaim/compaction based on failures

On Thu, Aug 09, 2012 at 10:20:35AM +0100, Mel Gorman wrote:
> On Thu, Aug 09, 2012 at 05:27:15PM +0900, Minchan Kim wrote:
> > > > > + * pages reclaimed based on the number of consecutive allocation
> > > > > + * failures
> > > > > + */
> > > > > + zone = lruvec_zone(lruvec);
> > > > > + if (zone->compact_order_failed >= sc->order)
> > > >
> > > > I can't understand this part.
> > > > We don't defer lower order than compact_order_failed by aff62249.
> > > > Do you mean lower order compaction context should be a lamb for
> > > > deferred higher order allocation request success? I think it's not fair
> > > > and even I can't understand rationale why it has to scale the number of pages
> > > > reclaimed with the number of recent compaction failture.
> > > > Your changelog just says "What we have to do, NOT Why we have to do".
> > > >
> > >
> > > I'm a moron, that should be <=, not >=. All my tests were based on order==9
> > > and that was the only order using reclaim/compaction so it happened to
> > > work as expected. Thanks! I fixed that and added the following
> > > clarification to the changelog
> > >
> > > The rationale is that reclaiming the normal number of pages still allowed
> > > compaction to fail and its success depends on the number of pages. If it's
> > > failing, reclaim more pages until it succeeds again.
> > >
> > > Does that make more sense?
> >
> > If compaction is defered, requestors fails to get high-order page and
> > they normally do fallback by order-0 or something.
>
> Yes. At least, one hopes they fell back to order-0.
>
> > In this context, if they don't depends on fallback and retrying higher order
> > allocation, your patch makes sense to me because your algorithm is based on
> > past allocation request fail rate.
> > Do I miss something?
>
> Your question is difficult to parse but I think you are making an implicit
> assumption that it's the same caller retrying the high order allocation.
> That is not the case, not do I want it to be because that would be similar
> to the caller using __GFP_REPEAT. Retrying with more reclaim until the
> allocation succeeds would both stall and reclaim excessively.
>
> The intention is that an allocation can fail but each subsequent attempt will
> try harder until there is success. Each allocation request does a portion
> of the necessary work to spread the cost between multiple requests. Take
> THP for example where there is a constant request for THP allocations
> for whatever reason (heavy fork workload, large buffer allocation being
> populated etc.). Some of those allocations fail but if they do, future
> THP requests will reclaim more pages. When compaction resumes again, it
> will be more likely to succeed and compact_defer_shift gets reset. In the
> specific case of THP there will be allocations that fail but khugepaged
> will promote them later if the process is long-lived.

You assume high-order allocation are *constant* and I guess your test enviroment
is optimal for it. I agree your patch if we can make sure such high-order
allocation are always constant. But, is it true? Otherwise, your patch could reclaim
too many pages unnecessary and it could reduce system performance by eviction
of page cache and swap out of workingset part. That's a concern to me.
In summary, I think your patch is rather agressive so how about this?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 66e4310..0cb2593 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1708,6 +1708,7 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
{
unsigned long pages_for_compaction;
unsigned long inactive_lru_pages;
+ struct zone *zone;

/* If not in reclaim/compaction mode, stop */
if (!in_reclaim_compaction(sc))
@@ -1741,6 +1742,15 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
* inactive lists are large enough, continue reclaiming
*/
pages_for_compaction = (2UL << sc->order);
+
+ /*
+ * If compaction is deferred for this order then scale the number of
+ * pages reclaimed based on the number of consecutive allocation
+ * failures
+ */
+ zone = lruvec_zone(lruvec);
+ if (zone->compact_order_failed <= sc->order) {
+ if (zone->compact_defer_shift)
+ /*
+ * We can't make sure deferred requests will come again
+ * The probability is 50:50.
+ */
+ pages_for_compaction <<= (zone->compact_defer_shift - 1);
}
inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
if (nr_swap_pages > 0)
inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);


>
> --
> Mel Gorman
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2012-08-10 08:14:39

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm: vmscan: Scale number of pages reclaimed by reclaim/compaction based on failures

On Thu, Aug 09, 2012 at 04:29:57PM -0400, Rik van Riel wrote:
> On 08/09/2012 05:20 AM, Mel Gorman wrote:
>
> >The intention is that an allocation can fail but each subsequent attempt will
> >try harder until there is success. Each allocation request does a portion
> >of the necessary work to spread the cost between multiple requests.
>
> At some point we need to stop doing that work, though.
>
> Otherwise we could end up back at the problem where
> way too much memory gets evicted, and we get swap
> storms.

That's the case without this patch as it'll still be running
reclaim/compaction just less aggressively. For it to continually try like
the system must be either continually under load preventing compaction ever
working (which may be undesirable for order-3 and the like) or so badly
fragmented it cannot recover (not aware of a situation where this happened).

You could add a separate patch that checked if
defer_shift == COMPACT_MAX_DEFER_SHIFT and to disable reclaim/compaction in
that case but that will require enough SWAP_CLUSTER_MAX pages to be reclaimed
over time or a large process to exit before compaction succeeds again.

I would expect rates under load to be very low with such a patch
applied.

--
Mel Gorman
SUSE Labs

2012-08-10 08:34:47

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm: vmscan: Scale number of pages reclaimed by reclaim/compaction based on failures

On Fri, Aug 10, 2012 at 08:27:33AM +0900, Minchan Kim wrote:
> > <SNIP>
> >
> > The intention is that an allocation can fail but each subsequent attempt will
> > try harder until there is success. Each allocation request does a portion
> > of the necessary work to spread the cost between multiple requests. Take
> > THP for example where there is a constant request for THP allocations
> > for whatever reason (heavy fork workload, large buffer allocation being
> > populated etc.). Some of those allocations fail but if they do, future
> > THP requests will reclaim more pages. When compaction resumes again, it
> > will be more likely to succeed and compact_defer_shift gets reset. In the
> > specific case of THP there will be allocations that fail but khugepaged
> > will promote them later if the process is long-lived.
>
> You assume high-order allocation are *constant* and I guess your test enviroment
> is optimal for it.

Ok, my example stated they were constant because it was the easiest to
illustrate but it does not necessarily have to be the case. The high-order
allocation requests can be separated by any length of time with a read or
write stream running in the background applying a small amount of memory
pressure and the same scenario applies.

> I agree your patch if we can make sure such high-order
> allocation are always constant. But, is it true? Otherwise, your patch could reclaim
> too many pages unnecessary and it could reduce system performance by eviction

The "too many pages unnecessarily" is unlikely. For compact_defer_shift to be
elevated there has to have been recent failures by try_to_compact_pages(). If
compact_defer_shift is elevated and a large process exited then
try_to_compact_pages() may succeed and reset compact_defer_shift without
calling direct reclaim and entering this path at all.

> of page cache and swap out of workingset part. That's a concern to me.
> In summary, I think your patch is rather agressive so how about this?
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 66e4310..0cb2593 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1708,6 +1708,7 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
> {
> unsigned long pages_for_compaction;
> unsigned long inactive_lru_pages;
> + struct zone *zone;
>
> /* If not in reclaim/compaction mode, stop */
> if (!in_reclaim_compaction(sc))
> @@ -1741,6 +1742,15 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
> * inactive lists are large enough, continue reclaiming
> */
> pages_for_compaction = (2UL << sc->order);
> +
> + /*
> + * If compaction is deferred for this order then scale the number of
> + * pages reclaimed based on the number of consecutive allocation
> + * failures
> + */
> + zone = lruvec_zone(lruvec);
> + if (zone->compact_order_failed <= sc->order) {
> + if (zone->compact_defer_shift)
> + /*
> + * We can't make sure deferred requests will come again
> + * The probability is 50:50.
> + */
> + pages_for_compaction <<= (zone->compact_defer_shift - 1);

This patch is not doing anything radically different to my own patch.
compact_defer_shift == 0 if allocations succeeded recently using
reclaim/compaction at its normal level. Functionally the only difference
is that you delay when more pages get reclaim by one failure.

Was that what you intended? If so, it's not clear why you think this patch
is better or how you concluded that the probability of another failure was
"50:50".

> }
> inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
> if (nr_swap_pages > 0)
> inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);
>

--
Mel Gorman
SUSE Labs

2012-08-10 08:46:53

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm: vmscan: Scale number of pages reclaimed by reclaim/compaction based on failures

Hi Mel,

On Fri, Aug 10, 2012 at 09:34:38AM +0100, Mel Gorman wrote:
> On Fri, Aug 10, 2012 at 08:27:33AM +0900, Minchan Kim wrote:
> > > <SNIP>
> > >
> > > The intention is that an allocation can fail but each subsequent attempt will
> > > try harder until there is success. Each allocation request does a portion
> > > of the necessary work to spread the cost between multiple requests. Take
> > > THP for example where there is a constant request for THP allocations
> > > for whatever reason (heavy fork workload, large buffer allocation being
> > > populated etc.). Some of those allocations fail but if they do, future
> > > THP requests will reclaim more pages. When compaction resumes again, it
> > > will be more likely to succeed and compact_defer_shift gets reset. In the
> > > specific case of THP there will be allocations that fail but khugepaged
> > > will promote them later if the process is long-lived.
> >
> > You assume high-order allocation are *constant* and I guess your test enviroment
> > is optimal for it.
>
> Ok, my example stated they were constant because it was the easiest to
> illustrate but it does not necessarily have to be the case. The high-order
> allocation requests can be separated by any length of time with a read or
> write stream running in the background applying a small amount of memory
> pressure and the same scenario applies.
>
> > I agree your patch if we can make sure such high-order
> > allocation are always constant. But, is it true? Otherwise, your patch could reclaim
> > too many pages unnecessary and it could reduce system performance by eviction
>
> The "too many pages unnecessarily" is unlikely. For compact_defer_shift to be
> elevated there has to have been recent failures by try_to_compact_pages(). If
> compact_defer_shift is elevated and a large process exited then
> try_to_compact_pages() may succeed and reset compact_defer_shift without
> calling direct reclaim and entering this path at all.
>
> > of page cache and swap out of workingset part. That's a concern to me.
> > In summary, I think your patch is rather agressive so how about this?
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 66e4310..0cb2593 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1708,6 +1708,7 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
> > {
> > unsigned long pages_for_compaction;
> > unsigned long inactive_lru_pages;
> > + struct zone *zone;
> >
> > /* If not in reclaim/compaction mode, stop */
> > if (!in_reclaim_compaction(sc))
> > @@ -1741,6 +1742,15 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
> > * inactive lists are large enough, continue reclaiming
> > */
> > pages_for_compaction = (2UL << sc->order);
> > +
> > + /*
> > + * If compaction is deferred for this order then scale the number of
> > + * pages reclaimed based on the number of consecutive allocation
> > + * failures
> > + */
> > + zone = lruvec_zone(lruvec);
> > + if (zone->compact_order_failed <= sc->order) {
> > + if (zone->compact_defer_shift)
> > + /*
> > + * We can't make sure deferred requests will come again
> > + * The probability is 50:50.
> > + */
> > + pages_for_compaction <<= (zone->compact_defer_shift - 1);
>
> This patch is not doing anything radically different to my own patch.
> compact_defer_shift == 0 if allocations succeeded recently using
> reclaim/compaction at its normal level. Functionally the only difference
> is that you delay when more pages get reclaim by one failure.
>
> Was that what you intended? If so, it's not clear why you think this patch
> is better or how you concluded that the probability of another failure was
> "50:50".

Please ignore my comment about this patch.
I got confused between compat_considered and compact_defer_shift.
compact_defer_shift is indication of constant high order page
allocationfailing so I have no objection any more.
Sorry for the noise. :(

--
Kind regards,
Minchan Kim