2015-12-03 08:11:49

by Vlastimil Babka

[permalink] [raw]
Subject: [RFC 0/3] reduce latency of direct async compaction

The goal is to reduce latency (and increase success) of direct async compaction
by making it focus more on the goal of creating a high-order page, at the
expense of thoroughness. This should be useful for example for THP allocations
where we still get reports of being too expensive, most recently [2].

This is based on an older attempt [1] which I didn't finish as it seemed that
it increased longer-term fragmentation. Now it seems it doesn't, but I'll have
to test more properly. This patch (2) makes migration scanner skip whole
order-aligned blocks where isolation fails, as it takes just one unmigrated
page to prevent a high-order page from merging.

Patch 3 tries to reduce the excessive freepage scanning (such as in [3]) by
allocating migration targets from freelist. We just need to be sure that the
pages are not from the same block as the migrated pages. This is also limited
to direct async compaction and is not meant to replace a (potentially
redesigned) free scanner for other scenarios.

Early tests with stress-highalloc configured to simulate THP allocations:

4.4-rc2 4.4-rc2 4.4-rc2 4.4-rc2
0-test 1-test 2-test 3-test
Success 1 Min 1.00 ( 0.00%) 2.00 (-100.00%) 2.00 (-100.00%) 3.00 (-200.00%)
Success 1 Mean 3.00 ( 0.00%) 3.00 ( 0.00%) 2.80 ( 6.67%) 4.80 (-60.00%)
Success 1 Max 6.00 ( 0.00%) 4.00 ( 33.33%) 5.00 ( 16.67%) 7.00 (-16.67%)
Success 2 Min 1.00 ( 0.00%) 3.00 (-200.00%) 4.00 (-300.00%) 8.00 (-700.00%)
Success 2 Mean 3.80 ( 0.00%) 4.00 ( -5.26%) 5.20 (-36.84%) 11.00 (-189.47%)
Success 2 Max 8.00 ( 0.00%) 7.00 ( 12.50%) 6.00 ( 25.00%) 13.00 (-62.50%)
Success 3 Min 58.00 ( 0.00%) 69.00 (-18.97%) 53.00 ( 8.62%) 66.00 (-13.79%)
Success 3 Mean 67.40 ( 0.00%) 74.00 ( -9.79%) 58.20 ( 13.65%) 68.80 ( -2.08%)
Success 3 Max 74.00 ( 0.00%) 78.00 ( -5.41%) 70.00 ( 5.41%) 72.00 ( 2.70%)

4.4-rc2 4.4-rc2 4.4-rc2 4.4-rc2
0-test 1-test 2-test 3-test
User 3167.23 3140.58 3198.77 3049.85
System 1166.65 1158.64 1171.06 1140.18
Elapsed 1827.63 1737.69 1750.62 1793.82

4.4-rc2 4.4-rc2 4.4-rc2 4.4-rc2
0-test 1-test 2-test 3-test
Minor Faults 107184766 107311664 107366319 108425875
Major Faults 753 730 746 817
Swap Ins 188 346 243 287
Swap Outs 7278 6186 6226 5702
Allocation stalls 988 868 1104 846
DMA allocs 25 18 15 13
DMA32 allocs 75074785 75104070 75131502 76260816
Normal allocs 26112454 26193770 26142374 26291337
Movable allocs 0 0 0 0
Direct pages scanned 83996 82251 80523 93509
Kswapd pages scanned 2122511 2107947 2110599 2121951
Kswapd pages reclaimed 2031597 2006468 2011184 2052483
Direct pages reclaimed 83806 82162 80315 93275
Kswapd efficiency 95% 95% 95% 96%
Kswapd velocity 1217.211 1202.789 1211.116 1189.075
Direct efficiency 99% 99% 99% 99%
Direct velocity 48.170 46.932 46.206 52.400
Percentage direct scans 3% 3% 3% 4%
Zone normal velocity 301.196 301.273 297.286 308.598
Zone dma32 velocity 964.185 948.448 960.036 932.877
Zone dma velocity 0.000 0.000 0.000 0.000
Page writes by reclaim 7296.200 6187.400 6226.800 5702.600
Page writes file 18 1 0 0
Page writes anon 7278 6186 6226 5702
Page reclaim immediate 259 225 41 180
Sector Reads 4132945 4074422 4099737 4291996
Sector Writes 11066128 11057103 11066448 11083256
Page rescued immediate 0 0 0 0
Slabs scanned 1539471 1521153 1518145 1776426
Direct inode steals 8482 3717 6096 9832
Kswapd inode steals 37735 42700 39976 43492
Kswapd skipped wait 0 0 0 0
THP fault alloc 593 610 680 778
THP collapse alloc 340 294 335 393
THP splits 4 2 4 3
THP fault fallback 751 748 705 626
THP collapse fail 14 16 14 12
Compaction stalls 6464 6373 6743 6451
Compaction success 518 688 575 972
Compaction failures 5945 5684 6167 5479
Page migrate success 318176 313488 239637 595224
Page migrate failure 40983 46106 12171 2587
Compaction pages isolated 733684 735737 564719 713799
Compaction migrate scanned 1101427 1056870 603977 969346
Compaction free scanned 17736383 15328486 11999748 5269641
Compaction cost 352 347 263 638
NUMA alloc hit 99632716 99690283 99753018 100771746
NUMA alloc miss 0 0 0 0
NUMA interleave hit 0 0 0 0
NUMA alloc local 99632716 99690283 99753018 100771746
NUMA base PTE updates 0 0 0 0
NUMA huge PMD updates 0 0 0 0
NUMA page range updates 0 0 0 0
NUMA hint faults 0 0 0 0
NUMA hint local faults 0 0 0 0
NUMA hint local percent 100 100 100 100
NUMA pages migrated 0 0 0 0
AutoNUMA cost 0% 0% 0% 0%

Migrate scanned pages are reduced by patch 2 as expected thanks to the skipping.
Patch 3 reduces free scanned pages significantly, and improves compaction
success and THP fault allocs (of the interfering activity, not the alloc test
itself). That results in more migrate scanner activity, as more success means
less deferring, and time spent previously in free sacanner can now be used in
migration scanner.

"Success 3" is indication of long-term fragmentation (the interference is
ceased in this phase) and it looks quite unstable overall (there shouldn't be
such difference between base and patch 1) but it doesn't seem decreased. I'm
suspecting it's the lack of reset_isolation_suitable() when the only activity
is async compaction. Needs more evaluation.

Aaron, could you try this on your testcase?

[1] https://lkml.org/lkml/2014/7/16/988
[2] http://www.spinics.net/lists/linux-mm/msg97378.html
[3] http://www.spinics.net/lists/linux-mm/msg97475.html

Vlastimil Babka (3):
mm, compaction: reduce spurious pcplist drains
mm, compaction: make async direct compaction skip blocks where
isolation fails
mm, compaction: direct freepage allocation for async direct compaction

include/linux/vm_event_item.h | 1 +
mm/compaction.c | 122 +++++++++++++++++++++++++++++++++++-------
mm/internal.h | 4 ++
mm/page_alloc.c | 27 ++++++++++
mm/vmstat.c | 2 +
5 files changed, 137 insertions(+), 19 deletions(-)

--
2.6.3


2015-12-03 08:11:07

by Vlastimil Babka

[permalink] [raw]
Subject: [RFC 1/3] mm, compaction: reduce spurious pcplist drains

Compaction drains the local pcplists each time migration scanner moves away
from a cc->order aligned block where it isolated pages for migration, so that
the pages freed by migrations can merge into highero orders.

The detection is currently coarser than it could be. The cc->last_migrated_pfn
variable should track the lowest pfn that was isolated for migration. But it
is set to the pfn where isolate_migratepages_block() starts scanning, which is
typically the first pfn of the pageblock. There, the scanner might fail to
isolate several order-aligned blocks, and then isolate COMPACT_CLUSTER_MAX in
another block. This would cause the pcplists drain to be performed, although
the scanner didn't yet finish the block where it isolated from.

This patch thus makes cc->last_migrated_pfn handling more accurate by setting
it to the pfn of an actually isolated page in isolate_migratepages_block().
Although practical effects of this patch are likely low, it arguably makes the
intent of the code more obvious. Also the next patch will make async direct
compaction skip blocks more aggressively, and draining pcplists due to skipped
blocks is wasteful.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/compaction.c | 20 +++++++++-----------
1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index de3e1e71cd9f..9c14d10ad3e5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -815,6 +815,15 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
cc->nr_migratepages++;
nr_isolated++;

+ /*
+ * Record where we could have freed pages by migration and not
+ * yet flushed them to buddy allocator.
+ * - this is the lowest page that was isolated and likely be
+ * then freed by migration.
+ */
+ if (!cc->last_migrated_pfn)
+ cc->last_migrated_pfn = low_pfn;
+
/* Avoid isolating too much */
if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) {
++low_pfn;
@@ -1104,7 +1113,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
struct compact_control *cc)
{
unsigned long low_pfn, end_pfn;
- unsigned long isolate_start_pfn;
struct page *page;
const isolate_mode_t isolate_mode =
(sysctl_compact_unevictable_allowed ? ISOLATE_UNEVICTABLE : 0) |
@@ -1153,7 +1161,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
continue;

/* Perform the isolation */
- isolate_start_pfn = low_pfn;
low_pfn = isolate_migratepages_block(cc, low_pfn, end_pfn,
isolate_mode);

@@ -1163,15 +1170,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
}

/*
- * Record where we could have freed pages by migration and not
- * yet flushed them to buddy allocator.
- * - this is the lowest page that could have been isolated and
- * then freed by migration.
- */
- if (cc->nr_migratepages && !cc->last_migrated_pfn)
- cc->last_migrated_pfn = isolate_start_pfn;
-
- /*
* Either we isolated something and proceed with migration. Or
* we failed and compact_zone should decide if we should
* continue or not.
--
2.6.3

2015-12-03 08:11:06

by Vlastimil Babka

[permalink] [raw]
Subject: [RFC 2/3] mm, compaction: make async direct compaction skip blocks where isolation fails

The goal of direct compaction is to quickly make a high-order page available.
Within an aligned block of pages of desired order, a single allocated page that
cannot be isolated for migration means that the block cannot fulfill the
allocation request. Therefore we can reduce the latency by skipping such blocks
immediately on isolation failures. For async compaction, this also means a
higher chance of succeeding until it becomes contended.

We however shouldn't completely sacrifice the second objective of compaction,
which is to reduce overal long-term memory fragmentation. As a compromise,
perform the eager skipping only in direct async compaction, while sync
compaction remains thorough.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/compaction.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++-------
mm/internal.h | 1 +
2 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 9c14d10ad3e5..f94518b5b1c9 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -672,6 +672,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
bool locked = false;
struct page *page = NULL, *valid_page = NULL;
unsigned long start_pfn = low_pfn;
+ bool skip_on_failure = false;
+ unsigned long next_skip_pfn = 0;

/*
* Ensure that there are not too many pages isolated from the LRU
@@ -692,10 +694,24 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
if (compact_should_abort(cc))
return 0;

+ if (cc->direct_compaction && (cc->mode == MIGRATE_ASYNC))
+ skip_on_failure = true;
+
/* Time to isolate some pages for migration */
for (; low_pfn < end_pfn; low_pfn++) {
bool is_lru;

+ if (skip_on_failure && low_pfn >= next_skip_pfn) {
+ if (nr_isolated)
+ break;
+ /*
+ * low_pfn might have been incremented by arbitrary
+ * number due to skipping a compound or a high-order
+ * buddy page in the previous iteration
+ */
+ next_skip_pfn = ALIGN(low_pfn + 1, (1UL << cc->order));
+ }
+
/*
* Periodically drop the lock (if held) regardless of its
* contention, to give chance to IRQs. Abort async compaction
@@ -707,7 +723,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
break;

if (!pfn_valid_within(low_pfn))
- continue;
+ goto isolate_fail;
nr_scanned++;

page = pfn_to_page(low_pfn);
@@ -762,11 +778,11 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
if (likely(comp_order < MAX_ORDER))
low_pfn += (1UL << comp_order) - 1;

- continue;
+ goto isolate_fail;
}

if (!is_lru)
- continue;
+ goto isolate_fail;

/*
* Migration will fail if an anonymous page is pinned in memory,
@@ -775,7 +791,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
*/
if (!page_mapping(page) &&
page_count(page) > page_mapcount(page))
- continue;
+ goto isolate_fail;

/* If we already hold the lock, we can skip some rechecking */
if (!locked) {
@@ -786,7 +802,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,

/* Recheck PageLRU and PageCompound under lock */
if (!PageLRU(page))
- continue;
+ goto isolate_fail;

/*
* Page become compound since the non-locked check,
@@ -795,7 +811,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
*/
if (unlikely(PageCompound(page))) {
low_pfn += (1UL << compound_order(page)) - 1;
- continue;
+ goto isolate_fail;
}
}

@@ -803,7 +819,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,

/* Try isolate the page */
if (__isolate_lru_page(page, isolate_mode) != 0)
- continue;
+ goto isolate_fail;

VM_BUG_ON_PAGE(PageCompound(page), page);

@@ -829,6 +845,30 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
++low_pfn;
break;
}
+
+ continue;
+isolate_fail:
+ if (!skip_on_failure)
+ continue;
+
+ if (nr_isolated) {
+ if (locked) {
+ spin_unlock_irqrestore(&zone->lru_lock, flags);
+ locked = false;
+ }
+ putback_movable_pages(migratelist);
+ nr_isolated = 0;
+ cc->last_migrated_pfn = 0;
+ }
+
+ if (low_pfn < next_skip_pfn) {
+ low_pfn = next_skip_pfn - 1;
+ /*
+ * The check near the loop beginning would have updated
+ * next_skip_pfn too, but this is a bit simpler.
+ */
+ next_skip_pfn += 1UL << cc->order;
+ }
}

/*
@@ -1495,6 +1535,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
.mode = mode,
.alloc_flags = alloc_flags,
.classzone_idx = classzone_idx,
+ .direct_compaction = true,
};
INIT_LIST_HEAD(&cc.freepages);
INIT_LIST_HEAD(&cc.migratepages);
diff --git a/mm/internal.h b/mm/internal.h
index 38e24b89e4c4..079ba14afe55 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -205,6 +205,7 @@ struct compact_control {
unsigned long last_migrated_pfn;/* Not yet flushed page being freed */
enum migrate_mode mode; /* Async or sync migration mode */
bool ignore_skip_hint; /* Scan blocks even if marked skip */
+ bool direct_compaction; /* Low latency over thoroughness */
int order; /* order a direct compactor needs */
const gfp_t gfp_mask; /* gfp mask of a direct compactor */
const int alloc_flags; /* alloc flags of a direct compactor */
--
2.6.3

2015-12-03 08:11:08

by Vlastimil Babka

[permalink] [raw]
Subject: [RFC 3/3] mm, compaction: direct freepage allocation for async direct compaction

The goal of direct compaction is to quickly make a high-order page available.
The free page scanner can add significant latency when searching for free
pages, although to succeed the compaction, the only important limit on the free
pages for migration targets is that they must not come from the same
order-aligned block as the migration sources.

This patch therefore makes direct async compaction allocate freepages directly
from freelists. Pages that do come from the same block (which we cannot simply
exclude from the allocation) are put on separate list and released afterwards
to facilitate merging.

Another advantage is that we split larger free pages only when necessary, while
the free scanner can split potentially up to order-1. However, we still likely
sacrifice some of the long-term anti-fragmentation features of a thorough
compaction, hence the limiting of this approach to direct async compaction.

Signed-off-by: Vlastimil Babka <[email protected]>
---
include/linux/vm_event_item.h | 1 +
mm/compaction.c | 47 ++++++++++++++++++++++++++++++++++++++++++-
mm/internal.h | 3 +++
mm/page_alloc.c | 27 +++++++++++++++++++++++++
mm/vmstat.c | 2 ++
5 files changed, 79 insertions(+), 1 deletion(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index e623d392db0c..614291613408 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -50,6 +50,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#endif
#ifdef CONFIG_COMPACTION
COMPACTMIGRATE_SCANNED, COMPACTFREE_SCANNED,
+ COMPACTFREE_DIRECT, COMPACTFREE_DIRECT_MISS,
COMPACTISOLATED,
COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
#endif
diff --git a/mm/compaction.c b/mm/compaction.c
index f94518b5b1c9..74b5b5ddebb0 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1088,6 +1088,40 @@ static void isolate_freepages(struct compact_control *cc)
cc->free_pfn = isolate_start_pfn;
}

+static void isolate_freepages_direct(struct compact_control *cc)
+{
+ unsigned long nr_pages;
+ unsigned long flags;
+
+ nr_pages = cc->nr_migratepages - cc->nr_freepages;
+
+ if (!compact_trylock_irqsave(&cc->zone->lock, &flags, cc))
+ return;
+
+ while (nr_pages) {
+ struct page *page;
+ unsigned long pfn;
+
+ page = alloc_pages_zone(cc->zone, 0, MIGRATE_MOVABLE);
+ if (!page)
+ break;
+ pfn = page_to_pfn(page);
+
+ /* Is the free page in the block we are migrating from? */
+ if (pfn >> cc->order == (cc->migrate_pfn - 1) >> cc->order) {
+ list_add(&page->lru, &cc->culled_freepages);
+ count_compact_event(COMPACTFREE_DIRECT_MISS);
+ } else {
+ list_add(&page->lru, &cc->freepages);
+ cc->nr_freepages++;
+ nr_pages--;
+ count_compact_event(COMPACTFREE_DIRECT);
+ }
+ }
+
+ spin_unlock_irqrestore(&cc->zone->lock, flags);
+}
+
/*
* This is a migrate-callback that "allocates" freepages by taking pages
* from the isolated freelists in the block we are migrating to.
@@ -1104,7 +1138,12 @@ static struct page *compaction_alloc(struct page *migratepage,
* contention.
*/
if (list_empty(&cc->freepages)) {
- if (!cc->contended)
+ if (cc->contended)
+ return NULL;
+
+ if (cc->direct_compaction && (cc->mode == MIGRATE_ASYNC))
+ isolate_freepages_direct(cc);
+ else
isolate_freepages(cc);

if (list_empty(&cc->freepages))
@@ -1481,6 +1520,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
cc->migrate_pfn & ~((1UL << cc->order) - 1);

if (cc->last_migrated_pfn < current_block_start) {
+ if (!list_empty(&cc->culled_freepages))
+ release_freepages(&cc->culled_freepages);
cpu = get_cpu();
lru_add_drain_cpu(cpu);
drain_local_pages(zone);
@@ -1511,6 +1552,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
if (free_pfn > zone->compact_cached_free_pfn)
zone->compact_cached_free_pfn = free_pfn;
}
+ if (!list_empty(&cc->culled_freepages))
+ release_freepages(&cc->culled_freepages);

trace_mm_compaction_end(start_pfn, cc->migrate_pfn,
cc->free_pfn, end_pfn, sync, ret);
@@ -1539,6 +1582,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
};
INIT_LIST_HEAD(&cc.freepages);
INIT_LIST_HEAD(&cc.migratepages);
+ INIT_LIST_HEAD(&cc.culled_freepages);

ret = compact_zone(zone, &cc);

@@ -1684,6 +1728,7 @@ static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
cc->zone = zone;
INIT_LIST_HEAD(&cc->freepages);
INIT_LIST_HEAD(&cc->migratepages);
+ INIT_LIST_HEAD(&cc->culled_freepages);

/*
* When called via /proc/sys/vm/compact_memory
diff --git a/mm/internal.h b/mm/internal.h
index 079ba14afe55..cb6a3f6ca631 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -175,6 +175,8 @@ __find_buddy_index(unsigned long page_idx, unsigned int order)
}

extern int __isolate_free_page(struct page *page, unsigned int order);
+extern struct page * alloc_pages_zone(struct zone *zone, unsigned int order,
+ int migratetype);
extern void __free_pages_bootmem(struct page *page, unsigned long pfn,
unsigned int order);
extern void prep_compound_page(struct page *page, unsigned int order);
@@ -198,6 +200,7 @@ extern int user_min_free_kbytes;
struct compact_control {
struct list_head freepages; /* List of free pages to migrate to */
struct list_head migratepages; /* List of pages being migrated */
+ struct list_head culled_freepages;
unsigned long nr_freepages; /* Number of isolated free pages */
unsigned long nr_migratepages; /* Number of pages to migrate */
unsigned long free_pfn; /* isolate_freepages search base */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 17a3c66639a9..715f0e6047c6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2185,6 +2185,33 @@ int split_free_page(struct page *page)
}

/*
+ * Like split_free_page, but given the zone, it will grab a free page from
+ * the freelists.
+ */
+struct page *
+alloc_pages_zone(struct zone *zone, unsigned int order, int migratetype)
+{
+ struct page *page;
+ unsigned long watermark;
+
+ watermark = low_wmark_pages(zone) + (1 << order);
+ if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+ return NULL;
+
+ page = __rmqueue(zone, order, migratetype, 0);
+ if (!page)
+ return NULL;
+
+ __mod_zone_freepage_state(zone, -(1 << order),
+ get_pcppage_migratetype(page));
+
+ set_page_owner(page, order, __GFP_MOVABLE);
+ set_page_refcounted(page);
+
+ return page;
+}
+
+/*
* Allocate a page from the given zone. Use pcplists for order-0 allocations.
*/
static inline
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 879a2be23325..20e2affdc08e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -819,6 +819,8 @@ const char * const vmstat_text[] = {
#ifdef CONFIG_COMPACTION
"compact_migrate_scanned",
"compact_free_scanned",
+ "compact_free_direct",
+ "compact_free_direct_miss",
"compact_isolated",
"compact_stall",
"compact_fail",
--
2.6.3

2015-12-03 09:25:31

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC 0/3] reduce latency of direct async compaction

On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:
> Aaron, could you try this on your testcase?

The test result is placed at:
https://drive.google.com/file/d/0B49uX3igf4K4enBkdVFScXhFM0U

For some reason, the patches made the performace worse. The base tree is
today's Linus git 25364a9e54fb8296837061bf684b76d20eec01fb, and its
performace is about 1000MB/s. After applying this patch series, the
performace drops to 720MB/s.

Please let me know if you need more information, thanks.

2015-12-03 09:38:55

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC 0/3] reduce latency of direct async compaction

On 12/03/2015 10:25 AM, Aaron Lu wrote:
> On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:
>> Aaron, could you try this on your testcase?
>
> The test result is placed at:
> https://drive.google.com/file/d/0B49uX3igf4K4enBkdVFScXhFM0U
>
> For some reason, the patches made the performace worse. The base tree is
> today's Linus git 25364a9e54fb8296837061bf684b76d20eec01fb, and its
> performace is about 1000MB/s. After applying this patch series, the
> performace drops to 720MB/s.
>
> Please let me know if you need more information, thanks.

Hm, compaction stats are at 0. The code in the patches isn't even running.
Can you provide the same data also for the base tree?

2015-12-03 11:35:12

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC 0/3] reduce latency of direct async compaction

On Thu, Dec 03, 2015 at 10:38:50AM +0100, Vlastimil Babka wrote:
> On 12/03/2015 10:25 AM, Aaron Lu wrote:
> > On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:
> >> Aaron, could you try this on your testcase?
> >
> > The test result is placed at:
> > https://drive.google.com/file/d/0B49uX3igf4K4enBkdVFScXhFM0U
> >
> > For some reason, the patches made the performace worse. The base tree is
> > today's Linus git 25364a9e54fb8296837061bf684b76d20eec01fb, and its
> > performace is about 1000MB/s. After applying this patch series, the
> > performace drops to 720MB/s.
> >
> > Please let me know if you need more information, thanks.
>
> Hm, compaction stats are at 0. The code in the patches isn't even running.
> Can you provide the same data also for the base tree?

My bad, I uploaded the wrong data :-/
I uploaded again:
https://drive.google.com/file/d/0B49uX3igf4K4UFI4TEQ3THYta0E

And I just run the base tree with trace-cmd and found that its
performace drops significantly(from 1000MB/s to 6xxMB/s), is it that
trace-cmd will impact performace a lot? Any suggestions on how to run
the test regarding trace-cmd? i.e. should I aways run usemem under
trace-cmd or only when necessary?

Thanks,
Aaron

2015-12-03 11:53:10

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC 0/3] reduce latency of direct async compaction

On Thu, Dec 03, 2015 at 07:35:08PM +0800, Aaron Lu wrote:
> On Thu, Dec 03, 2015 at 10:38:50AM +0100, Vlastimil Babka wrote:
> > On 12/03/2015 10:25 AM, Aaron Lu wrote:
> > > On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:
> > >> Aaron, could you try this on your testcase?
> > >
> > > The test result is placed at:
> > > https://drive.google.com/file/d/0B49uX3igf4K4enBkdVFScXhFM0U
> > >
> > > For some reason, the patches made the performace worse. The base tree is
> > > today's Linus git 25364a9e54fb8296837061bf684b76d20eec01fb, and its
> > > performace is about 1000MB/s. After applying this patch series, the
> > > performace drops to 720MB/s.
> > >
> > > Please let me know if you need more information, thanks.
> >
> > Hm, compaction stats are at 0. The code in the patches isn't even running.
> > Can you provide the same data also for the base tree?
>
> My bad, I uploaded the wrong data :-/
> I uploaded again:
> https://drive.google.com/file/d/0B49uX3igf4K4UFI4TEQ3THYta0E
>
> And I just run the base tree with trace-cmd and found that its
> performace drops significantly(from 1000MB/s to 6xxMB/s), is it that
> trace-cmd will impact performace a lot? Any suggestions on how to run
> the test regarding trace-cmd? i.e. should I aways run usemem under
> trace-cmd or only when necessary?

I just run the test with the base tree and with this patch series
applied(head), I didn't use trace-cmd this time.

The throughput for base tree is 963MB/s while the head is 815MB/s, I
have attached pagetypeinfo/proc-vmstat/perf-profile for them.


Attachments:
(No filename) (1.54 kB)
base.tar (160.00 kB)
head.tar (180.00 kB)
Download all attachments

2015-12-04 06:26:05

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC 0/3] reduce latency of direct async compaction

On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:
> Aaron, could you try this on your testcase?

One time result isn't stable enough, so I did 9 runs for each commit,
here is the result:

base: 25364a9e54fb8296837061bf684b76d20eec01fb
head: 7433b1009ff5a02e1e9f3444802daba2cf385d27
(head = base + this_patch_serie)

The always-always case(transparent_hugepage set to always and defrag set
to always):

Result for base:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 100000622592
100000622592 transferred in 103 seconds, throughput: 925 MB/s
cmdline: /lkp/aaron/src/bin/usemem 99999559680
99999559680 transferred in 92 seconds, throughput: 1036 MB/s
cmdline: /lkp/aaron/src/bin/usemem 99996171264
99996171264 transferred in 92 seconds, throughput: 1036 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100005663744
100005663744 transferred in 150 seconds, throughput: 635 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100002966528
100002966528 transferred in 87 seconds, throughput: 1096 MB/s
cmdline: /lkp/aaron/src/bin/usemem 99995784192
99995784192 transferred in 131 seconds, throughput: 727 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100003731456
100003731456 transferred in 97 seconds, throughput: 983 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100006440960
100006440960 transferred in 109 seconds, throughput: 874 MB/s
cmdline: /lkp/aaron/src/bin/usemem 99998813184
99998813184 transferred in 122 seconds, throughput: 781 MB/s
Max: 1096 MB/s
Min: 635 MB/s
Avg: 899 MB/s

Result for head:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 100003163136
100003163136 transferred in 105 seconds, throughput: 908 MB/s
cmdline: /lkp/aaron/src/bin/usemem 99998524416
99998524416 transferred in 78 seconds, throughput: 1222 MB/s
cmdline: /lkp/aaron/src/bin/usemem 99993646080
99993646080 transferred in 108 seconds, throughput: 882 MB/s
cmdline: /lkp/aaron/src/bin/usemem 99998936064
99998936064 transferred in 114 seconds, throughput: 836 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100002204672
100002204672 transferred in 73 seconds, throughput: 1306 MB/s
cmdline: /lkp/aaron/src/bin/usemem 99998140416
99998140416 transferred in 146 seconds, throughput: 653 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100002941952
100002941952 transferred in 78 seconds, throughput: 1222 MB/s
cmdline: /lkp/aaron/src/bin/usemem 99996917760
99996917760 transferred in 109 seconds, throughput: 874 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100001405952
100001405952 transferred in 96 seconds, throughput: 993 MB/s
Max: 1306 MB/s
Min: 653 MB/s
Avg: 988 MB/s

Result for v4.3 as a reference:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 100002459648
100002459648 transferred in 96 seconds, throughput: 993 MB/s
cmdline: /lkp/aaron/src/bin/usemem 99997375488
99997375488 transferred in 96 seconds, throughput: 993 MB/s
cmdline: /lkp/aaron/src/bin/usemem 99999028224
99999028224 transferred in 107 seconds, throughput: 891 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100000137216
100000137216 transferred in 91 seconds, throughput: 1047 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100003835904
100003835904 transferred in 80 seconds, throughput: 1192 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100000143360
100000143360 transferred in 96 seconds, throughput: 993 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100020593664
100020593664 transferred in 101 seconds, throughput: 944 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100005805056
100005805056 transferred in 87 seconds, throughput: 1096 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100008360960
100008360960 transferred in 74 seconds, throughput: 1288 MB/s
Max: 1288 MB/s
Min: 891 MB/s
Avg: 1048 MB/s

The always-never case:

Result for head:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 100003940352
100003940352 transferred in 71 seconds, throughput: 1343 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100007411712
100007411712 transferred in 62 seconds, throughput: 1538 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100001875968
100001875968 transferred in 64 seconds, throughput: 1490 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100003912704
100003912704 transferred in 62 seconds, throughput: 1538 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100002238464
100002238464 transferred in 66 seconds, throughput: 1444 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100003670016
100003670016 transferred in 65 seconds, throughput: 1467 MB/s
cmdline: /lkp/aaron/src/bin/usemem 99998364672
99998364672 transferred in 68 seconds, throughput: 1402 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100005417984
100005417984 transferred in 70 seconds, throughput: 1362 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100005304320
100005304320 transferred in 64 seconds, throughput: 1490 MB/s
Max: 1538 MB/s
Min: 1343 MB/s
Avg: 1452 MB/s

2015-12-04 12:34:15

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC 0/3] reduce latency of direct async compaction

On 12/03/2015 12:52 PM, Aaron Lu wrote:
> On Thu, Dec 03, 2015 at 07:35:08PM +0800, Aaron Lu wrote:
>> On Thu, Dec 03, 2015 at 10:38:50AM +0100, Vlastimil Babka wrote:
>>> On 12/03/2015 10:25 AM, Aaron Lu wrote:
>>>> On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:
>>
>> My bad, I uploaded the wrong data :-/
>> I uploaded again:
>> https://drive.google.com/file/d/0B49uX3igf4K4UFI4TEQ3THYta0E
>>
>> And I just run the base tree with trace-cmd and found that its
>> performace drops significantly(from 1000MB/s to 6xxMB/s), is it that
>> trace-cmd will impact performace a lot?

Yeah it has some overhead depending on how many events it has to
process. Your workload is quite sensitive to that.

>> Any suggestions on how to run
>> the test regarding trace-cmd? i.e. should I aways run usemem under
>> trace-cmd or only when necessary?

I'd run it with tracing only when the goal is to collect traces, but not
for any performance comparisons. Also it's not useful to collect perf
data while also tracing.

> I just run the test with the base tree and with this patch series
> applied(head), I didn't use trace-cmd this time.
>
> The throughput for base tree is 963MB/s while the head is 815MB/s, I
> have attached pagetypeinfo/proc-vmstat/perf-profile for them.

The compact stats improvements look fine, perhaps better than in my tests:

base: compact_migrate_scanned 3476360
head: compact_migrate_scanned 1020827

- that's the eager skipping of patch 2

base: compact_free_scanned 5924928
head: compact_free_scanned 0
compact_free_direct 918813
compact_free_direct_miss 500308

As your workload does exclusively async direct compaction through THP
faults, the traditional free scanner isn't used at all. Direct
allocations should be much cheaper, although the "miss" ratio (the
allocations that were from the same pageblock as the one we are
compacting) is quite high. I should probably look into making migration
release pages to the tails of the freelists - could be that it's
grabbing the very pages that were just freed in the previous
COMPACT_CLUSTER_MAX cycle (modulo pcplist buffering).

I however find it strange that your original stats (4.3?) differ from
the base so much:

compact_migrate_scanned 1982396
compact_free_scanned 40576943

That was order of magnitude more free scanned on 4.3, and half the
migrate scanned. But your throughput figures in the other mail suggested
a regression from 4.3 to 4.4, which would be the opposite of what the
stats say. And anyway, compaction code didn't change between 4.3 and 4.4
except changes to tracepoint format...

moving on...
base:
compact_isolated 731304
compact_stall 10561
compact_fail 9459
compact_success 1102

head:
compact_isolated 921087
compact_stall 14451
compact_fail 12550
compact_success 1901

More success in both isolation and compaction results.

base:
thp_fault_alloc 45337
thp_fault_fallback 2349

head:
thp_fault_alloc 45564
thp_fault_fallback 2120

Somehow the extra compact success didn't fully translate to thp alloc
success... But given how many of the alloc's didn't even involve a
compact_stall (two thirds of them), that interpretation could also be
easily misleading. So, hard to say.

Looking at the perf profiles...
base:
54.55% 54.55% :1550 [kernel.kallsyms] [k]
pageblock_pfn_to_page

head:
40.13% 40.13% :1551 [kernel.kallsyms] [k]
pageblock_pfn_to_page

Since the freepage allocation doesn't hit this code anymore, it shows
that the bulk was actually from the migration scanner, although the perf
callgraph and vmstats suggested otherwise. However, vmstats count only
when the scanner actually enters the pageblock, and there are numerous
reasons why it wouldn't... For example the pageblock_skip bitmap. Could
it make sense to look at the bitmap before doing the pfn_to_page
translation?

I don't see much else in the profiles. I guess the remaining problem of
compaction here is that deferring compaction doesn't trigger for async
compaction, and this testcase doesn't hit sync compaction at all.

2015-12-04 12:38:37

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC 0/3] reduce latency of direct async compaction

On 12/04/2015 07:25 AM, Aaron Lu wrote:
> On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:
>> Aaron, could you try this on your testcase?
>
> One time result isn't stable enough, so I did 9 runs for each commit,
> here is the result:
>
> base: 25364a9e54fb8296837061bf684b76d20eec01fb
> head: 7433b1009ff5a02e1e9f3444802daba2cf385d27
> (head = base + this_patch_serie)
>
> The always-always case(transparent_hugepage set to always and defrag set
> to always):
>
> Result for base:
> $ cat {0..8}/swap
> cmdline: /lkp/aaron/src/bin/usemem 100000622592
> 100000622592 transferred in 103 seconds, throughput: 925 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 99999559680
> 99999559680 transferred in 92 seconds, throughput: 1036 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 99996171264
> 99996171264 transferred in 92 seconds, throughput: 1036 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100005663744
> 100005663744 transferred in 150 seconds, throughput: 635 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100002966528
> 100002966528 transferred in 87 seconds, throughput: 1096 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 99995784192
> 99995784192 transferred in 131 seconds, throughput: 727 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100003731456
> 100003731456 transferred in 97 seconds, throughput: 983 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100006440960
> 100006440960 transferred in 109 seconds, throughput: 874 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 99998813184
> 99998813184 transferred in 122 seconds, throughput: 781 MB/s
> Max: 1096 MB/s
> Min: 635 MB/s
> Avg: 899 MB/s
>
> Result for head:
> $ cat {0..8}/swap
> cmdline: /lkp/aaron/src/bin/usemem 100003163136
> 100003163136 transferred in 105 seconds, throughput: 908 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 99998524416
> 99998524416 transferred in 78 seconds, throughput: 1222 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 99993646080
> 99993646080 transferred in 108 seconds, throughput: 882 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 99998936064
> 99998936064 transferred in 114 seconds, throughput: 836 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100002204672
> 100002204672 transferred in 73 seconds, throughput: 1306 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 99998140416
> 99998140416 transferred in 146 seconds, throughput: 653 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100002941952
> 100002941952 transferred in 78 seconds, throughput: 1222 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 99996917760
> 99996917760 transferred in 109 seconds, throughput: 874 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100001405952
> 100001405952 transferred in 96 seconds, throughput: 993 MB/s
> Max: 1306 MB/s
> Min: 653 MB/s
> Avg: 988 MB/s

Ok that looks better than the first results :) The series either helped,
or it's just noise. But hopefully not worse.

> Result for v4.3 as a reference:
> $ cat {0..8}/swap
> cmdline: /lkp/aaron/src/bin/usemem 100002459648
> 100002459648 transferred in 96 seconds, throughput: 993 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 99997375488
> 99997375488 transferred in 96 seconds, throughput: 993 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 99999028224
> 99999028224 transferred in 107 seconds, throughput: 891 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100000137216
> 100000137216 transferred in 91 seconds, throughput: 1047 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100003835904
> 100003835904 transferred in 80 seconds, throughput: 1192 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100000143360
> 100000143360 transferred in 96 seconds, throughput: 993 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100020593664
> 100020593664 transferred in 101 seconds, throughput: 944 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100005805056
> 100005805056 transferred in 87 seconds, throughput: 1096 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100008360960
> 100008360960 transferred in 74 seconds, throughput: 1288 MB/s
> Max: 1288 MB/s
> Min: 891 MB/s
> Avg: 1048 MB/s

Hard to say if there's actual regression from 4.3 to 4.4, it's too
noisy. More iterations could help, but then the eventual bisection would
need them too.

> The always-never case:
>
> Result for head:
> $ cat {0..8}/swap
> cmdline: /lkp/aaron/src/bin/usemem 100003940352
> 100003940352 transferred in 71 seconds, throughput: 1343 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100007411712
> 100007411712 transferred in 62 seconds, throughput: 1538 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100001875968
> 100001875968 transferred in 64 seconds, throughput: 1490 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100003912704
> 100003912704 transferred in 62 seconds, throughput: 1538 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100002238464
> 100002238464 transferred in 66 seconds, throughput: 1444 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100003670016
> 100003670016 transferred in 65 seconds, throughput: 1467 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 99998364672
> 99998364672 transferred in 68 seconds, throughput: 1402 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100005417984
> 100005417984 transferred in 70 seconds, throughput: 1362 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100005304320
> 100005304320 transferred in 64 seconds, throughput: 1490 MB/s
> Max: 1538 MB/s
> Min: 1343 MB/s
> Avg: 1452 MB/s
>

2015-12-07 03:14:19

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC 0/3] reduce latency of direct async compaction

On 12/04/2015 08:38 PM, Vlastimil Babka wrote:
> On 12/04/2015 07:25 AM, Aaron Lu wrote:
>> On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:
>>> Aaron, could you try this on your testcase?
>>
>> One time result isn't stable enough, so I did 9 runs for each commit,
>> here is the result:
>>
>> base: 25364a9e54fb8296837061bf684b76d20eec01fb
>> head: 7433b1009ff5a02e1e9f3444802daba2cf385d27
>> (head = base + this_patch_serie)
>>
>> The always-always case(transparent_hugepage set to always and defrag set
>> to always):
>>
>> Result for base:
>> $ cat {0..8}/swap
>> cmdline: /lkp/aaron/src/bin/usemem 100000622592
>> 100000622592 transferred in 103 seconds, throughput: 925 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 99999559680
>> 99999559680 transferred in 92 seconds, throughput: 1036 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 99996171264
>> 99996171264 transferred in 92 seconds, throughput: 1036 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100005663744
>> 100005663744 transferred in 150 seconds, throughput: 635 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100002966528
>> 100002966528 transferred in 87 seconds, throughput: 1096 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 99995784192
>> 99995784192 transferred in 131 seconds, throughput: 727 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100003731456
>> 100003731456 transferred in 97 seconds, throughput: 983 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100006440960
>> 100006440960 transferred in 109 seconds, throughput: 874 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 99998813184
>> 99998813184 transferred in 122 seconds, throughput: 781 MB/s
>> Max: 1096 MB/s
>> Min: 635 MB/s
>> Avg: 899 MB/s
>>
>> Result for head:
>> $ cat {0..8}/swap
>> cmdline: /lkp/aaron/src/bin/usemem 100003163136
>> 100003163136 transferred in 105 seconds, throughput: 908 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 99998524416
>> 99998524416 transferred in 78 seconds, throughput: 1222 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 99993646080
>> 99993646080 transferred in 108 seconds, throughput: 882 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 99998936064
>> 99998936064 transferred in 114 seconds, throughput: 836 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100002204672
>> 100002204672 transferred in 73 seconds, throughput: 1306 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 99998140416
>> 99998140416 transferred in 146 seconds, throughput: 653 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100002941952
>> 100002941952 transferred in 78 seconds, throughput: 1222 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 99996917760
>> 99996917760 transferred in 109 seconds, throughput: 874 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100001405952
>> 100001405952 transferred in 96 seconds, throughput: 993 MB/s
>> Max: 1306 MB/s
>> Min: 653 MB/s
>> Avg: 988 MB/s
>
> Ok that looks better than the first results :) The series either helped,
> or it's just noise. But hopefully not worse.

Well, it looks to be the case :-)

>
>> Result for v4.3 as a reference:
>> $ cat {0..8}/swap
>> cmdline: /lkp/aaron/src/bin/usemem 100002459648
>> 100002459648 transferred in 96 seconds, throughput: 993 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 99997375488
>> 99997375488 transferred in 96 seconds, throughput: 993 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 99999028224
>> 99999028224 transferred in 107 seconds, throughput: 891 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100000137216
>> 100000137216 transferred in 91 seconds, throughput: 1047 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100003835904
>> 100003835904 transferred in 80 seconds, throughput: 1192 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100000143360
>> 100000143360 transferred in 96 seconds, throughput: 993 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100020593664
>> 100020593664 transferred in 101 seconds, throughput: 944 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100005805056
>> 100005805056 transferred in 87 seconds, throughput: 1096 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100008360960
>> 100008360960 transferred in 74 seconds, throughput: 1288 MB/s
>> Max: 1288 MB/s
>> Min: 891 MB/s
>> Avg: 1048 MB/s
>
> Hard to say if there's actual regression from 4.3 to 4.4, it's too
> noisy. More iterations could help, but then the eventual bisection would
> need them too.

One thing puzzles me most is that once compaction is involved, the
results will become undetermined, i.e. the result could be as high
as 1xxx MB/s or as low as 6xx MB/s. The always-never's case is much
better in this regard.

Thanks,
Aaron

>
>> The always-never case:
>>
>> Result for head:
>> $ cat {0..8}/swap
>> cmdline: /lkp/aaron/src/bin/usemem 100003940352
>> 100003940352 transferred in 71 seconds, throughput: 1343 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100007411712
>> 100007411712 transferred in 62 seconds, throughput: 1538 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100001875968
>> 100001875968 transferred in 64 seconds, throughput: 1490 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100003912704
>> 100003912704 transferred in 62 seconds, throughput: 1538 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100002238464
>> 100002238464 transferred in 66 seconds, throughput: 1444 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100003670016
>> 100003670016 transferred in 65 seconds, throughput: 1467 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 99998364672
>> 99998364672 transferred in 68 seconds, throughput: 1402 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100005417984
>> 100005417984 transferred in 70 seconds, throughput: 1362 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100005304320
>> 100005304320 transferred in 64 seconds, throughput: 1490 MB/s
>> Max: 1538 MB/s
>> Min: 1343 MB/s
>> Avg: 1452 MB/s
>>
>

2015-12-07 07:34:17

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC 0/3] reduce latency of direct async compaction

On Fri, Dec 04, 2015 at 01:34:09PM +0100, Vlastimil Babka wrote:
> On 12/03/2015 12:52 PM, Aaron Lu wrote:
> >On Thu, Dec 03, 2015 at 07:35:08PM +0800, Aaron Lu wrote:
> >>On Thu, Dec 03, 2015 at 10:38:50AM +0100, Vlastimil Babka wrote:
> >>>On 12/03/2015 10:25 AM, Aaron Lu wrote:
> >>>>On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:
> >>
> >>My bad, I uploaded the wrong data :-/
> >>I uploaded again:
> >>https://drive.google.com/file/d/0B49uX3igf4K4UFI4TEQ3THYta0E
> >>
> >>And I just run the base tree with trace-cmd and found that its
> >>performace drops significantly(from 1000MB/s to 6xxMB/s), is it that
> >>trace-cmd will impact performace a lot?
>
> Yeah it has some overhead depending on how many events it has to
> process. Your workload is quite sensitive to that.
>
> >>Any suggestions on how to run
> >>the test regarding trace-cmd? i.e. should I aways run usemem under
> >>trace-cmd or only when necessary?
>
> I'd run it with tracing only when the goal is to collect traces, but
> not for any performance comparisons. Also it's not useful to collect
> perf data while also tracing.
>
> >I just run the test with the base tree and with this patch series
> >applied(head), I didn't use trace-cmd this time.
> >
> >The throughput for base tree is 963MB/s while the head is 815MB/s, I
> >have attached pagetypeinfo/proc-vmstat/perf-profile for them.
>
> The compact stats improvements look fine, perhaps better than in my tests:
>
> base: compact_migrate_scanned 3476360
> head: compact_migrate_scanned 1020827
>
> - that's the eager skipping of patch 2
>
> base: compact_free_scanned 5924928
> head: compact_free_scanned 0
> compact_free_direct 918813
> compact_free_direct_miss 500308
>
> As your workload does exclusively async direct compaction through
> THP faults, the traditional free scanner isn't used at all. Direct
> allocations should be much cheaper, although the "miss" ratio (the
> allocations that were from the same pageblock as the one we are
> compacting) is quite high. I should probably look into making
> migration release pages to the tails of the freelists - could be
> that it's grabbing the very pages that were just freed in the
> previous COMPACT_CLUSTER_MAX cycle (modulo pcplist buffering).
>
> I however find it strange that your original stats (4.3?) differ
> from the base so much:
>
> compact_migrate_scanned 1982396
> compact_free_scanned 40576943
>
> That was order of magnitude more free scanned on 4.3, and half the
> migrate scanned. But your throughput figures in the other mail
> suggested a regression from 4.3 to 4.4, which would be the opposite
> of what the stats say. And anyway, compaction code didn't change
> between 4.3 and 4.4 except changes to tracepoint format...
>
> moving on...
> base:
> compact_isolated 731304
> compact_stall 10561
> compact_fail 9459
> compact_success 1102
>
> head:
> compact_isolated 921087
> compact_stall 14451
> compact_fail 12550
> compact_success 1901
>
> More success in both isolation and compaction results.
>
> base:
> thp_fault_alloc 45337
> thp_fault_fallback 2349
>
> head:
> thp_fault_alloc 45564
> thp_fault_fallback 2120
>
> Somehow the extra compact success didn't fully translate to thp
> alloc success... But given how many of the alloc's didn't even
> involve a compact_stall (two thirds of them), that interpretation
> could also be easily misleading. So, hard to say.
>
> Looking at the perf profiles...
> base:
> 54.55% 54.55% :1550 [kernel.kallsyms] [k]
> pageblock_pfn_to_page
>
> head:
> 40.13% 40.13% :1551 [kernel.kallsyms] [k]
> pageblock_pfn_to_page
>
> Since the freepage allocation doesn't hit this code anymore, it
> shows that the bulk was actually from the migration scanner,
> although the perf callgraph and vmstats suggested otherwise.

It looks like overhead still remain. I guess that migration scanner
would call pageblock_pfn_to_page() for more extended range so
overhead still remain.

I have an idea to solve his problem. Aaron, could you test following patch
on top of base? It tries to skip calling pageblock_pfn_to_page()
if we check that zone is contiguous at initialization stage.

Thanks.

---->8----
>From 9c4fbf8f8ed37eb88a04a97908e76ba2437404a2 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim <[email protected]>
Date: Mon, 7 Dec 2015 14:51:42 +0900
Subject: [PATCH] mm/compaction: Optimize pageblock_pfn_to_page() for
contiguous zone

Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/mmzone.h | 1 +
mm/compaction.c | 35 ++++++++++++++++++++++++++++++++++-
2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e23a9e7..573f9a9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -521,6 +521,7 @@ struct zone {
#endif

#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+ int contiguous;
/* Set to true when the PG_migrate_skip bits should be cleared */
bool compact_blockskip_flush;
#endif
diff --git a/mm/compaction.c b/mm/compaction.c
index 67b8d90..f4e8c89 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -88,7 +88,7 @@ static inline bool migrate_async_suitable(int migratetype)
* the first and last page of a pageblock and avoid checking each individual
* page in a pageblock.
*/
-static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
+static struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
unsigned long end_pfn, struct zone *zone)
{
struct page *start_page;
@@ -114,6 +114,37 @@ static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
return start_page;
}

+static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn,
+ unsigned long end_pfn, struct zone *zone)
+{
+ if (zone->contiguous == 1)
+ return pfn_to_page(start_pfn);
+
+ return __pageblock_pfn_to_page(start_pfn, end_pfn, zone);
+}
+
+static void check_zone_contiguous(struct zone *zone)
+{
+ unsigned long pfn = zone->zone_start_pfn;
+ unsigned long end_pfn = zone_end_pfn(zone);
+
+ /* Already checked */
+ if (zone->contiguous)
+ return;
+
+ pfn = ALIGN(pfn + 1, pageblock_nr_pages);
+ for (; pfn < end_pfn; pfn += pageblock_nr_pages) {
+ if (!__pageblock_pfn_to_page(pfn, end_pfn, zone)) {
+ /* We have hole */
+ zone->contiguous = -1;
+ return;
+ }
+ }
+
+ /* We don't have hole */
+ zone->contiguous = 1;
+}
+
#ifdef CONFIG_COMPACTION

/* Do not skip compaction more than 64 times */
@@ -1353,6 +1384,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
;
}

+ check_zone_contiguous(zone);
+
/*
* Clear pageblock skip if there were failures recently and compaction
* is about to be retried after being deferred. kswapd does not do
--
1.9.1

2015-12-07 09:00:24

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC 0/3] reduce latency of direct async compaction

On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote:
> It looks like overhead still remain. I guess that migration scanner
> would call pageblock_pfn_to_page() for more extended range so
> overhead still remain.
>
> I have an idea to solve his problem. Aaron, could you test following patch
> on top of base? It tries to skip calling pageblock_pfn_to_page()

It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb
cleanly, so I made some changes to make it apply and the result is:
https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63

There is a problem occured right after the test starts:
[ 58.080962] BUG: unable to handle kernel paging request at ffffea0082000018
[ 58.089124] IP: [<ffffffff81193f29>] compaction_alloc+0xf9/0x270
[ 58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0
[ 58.101569] Oops: 0000 [#1] SMP

The full dmesg is attached.

Regards,
Aaron

> if we check that zone is contiguous at initialization stage.
>
> Thanks.
>
> ---->8----
> From 9c4fbf8f8ed37eb88a04a97908e76ba2437404a2 Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim <[email protected]>
> Date: Mon, 7 Dec 2015 14:51:42 +0900
> Subject: [PATCH] mm/compaction: Optimize pageblock_pfn_to_page() for
> contiguous zone
>
> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> include/linux/mmzone.h | 1 +
> mm/compaction.c | 35 ++++++++++++++++++++++++++++++++++-
> 2 files changed, 35 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index e23a9e7..573f9a9 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -521,6 +521,7 @@ struct zone {
> #endif
>
> #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> + int contiguous;
> /* Set to true when the PG_migrate_skip bits should be cleared */
> bool compact_blockskip_flush;
> #endif
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 67b8d90..f4e8c89 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -88,7 +88,7 @@ static inline bool migrate_async_suitable(int migratetype)
> * the first and last page of a pageblock and avoid checking each individual
> * page in a pageblock.
> */
> -static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
> +static struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
> unsigned long end_pfn, struct zone *zone)
> {
> struct page *start_page;
> @@ -114,6 +114,37 @@ static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
> return start_page;
> }
>
> +static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn,
> + unsigned long end_pfn, struct zone *zone)
> +{
> + if (zone->contiguous == 1)
> + return pfn_to_page(start_pfn);
> +
> + return __pageblock_pfn_to_page(start_pfn, end_pfn, zone);
> +}
> +
> +static void check_zone_contiguous(struct zone *zone)
> +{
> + unsigned long pfn = zone->zone_start_pfn;
> + unsigned long end_pfn = zone_end_pfn(zone);
> +
> + /* Already checked */
> + if (zone->contiguous)
> + return;
> +
> + pfn = ALIGN(pfn + 1, pageblock_nr_pages);
> + for (; pfn < end_pfn; pfn += pageblock_nr_pages) {
> + if (!__pageblock_pfn_to_page(pfn, end_pfn, zone)) {
> + /* We have hole */
> + zone->contiguous = -1;
> + return;
> + }
> + }
> +
> + /* We don't have hole */
> + zone->contiguous = 1;
> +}
> +
> #ifdef CONFIG_COMPACTION
>
> /* Do not skip compaction more than 64 times */
> @@ -1353,6 +1384,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> ;
> }
>
> + check_zone_contiguous(zone);
> +
> /*
> * Clear pageblock skip if there were failures recently and compaction
> * is about to be retried after being deferred. kswapd does not do
> --
> 1.9.1
>
>


Attachments:
(No filename) (4.00 kB)
dmesg (162.63 kB)
Download all attachments

2015-12-08 00:40:10

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC 0/3] reduce latency of direct async compaction

On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote:
> On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote:
> > It looks like overhead still remain. I guess that migration scanner
> > would call pageblock_pfn_to_page() for more extended range so
> > overhead still remain.
> >
> > I have an idea to solve his problem. Aaron, could you test following patch
> > on top of base? It tries to skip calling pageblock_pfn_to_page()
>
> It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb
> cleanly, so I made some changes to make it apply and the result is:
> https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63

Yes, that's okay. I made it on my working branch but it will not result in
any problem except applying.

>
> There is a problem occured right after the test starts:
> [ 58.080962] BUG: unable to handle kernel paging request at ffffea0082000018
> [ 58.089124] IP: [<ffffffff81193f29>] compaction_alloc+0xf9/0x270
> [ 58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> [ 58.101569] Oops: 0000 [#1] SMP

I did some mistake. Please test following patch. It is also made
on my working branch so you need to resolve conflict but it would be
trivial.

I inserted some logs to check whether zone is contiguous or not.
Please check that normal zone is set to contiguous after testing.

Thanks.

------>8------
>From 4a1a08d8ab3fb165b87ad2ec0a2000ff6892330f Mon Sep 17 00:00:00 2001
From: Joonsoo Kim <[email protected]>
Date: Mon, 7 Dec 2015 14:51:42 +0900
Subject: [PATCH] mm/compaction: Optimize pageblock_pfn_to_page() for
contiguous zone

Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/mmzone.h | 1 +
mm/compaction.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e23a9e7..573f9a9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -521,6 +521,7 @@ struct zone {
#endif

#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+ int contiguous;
/* Set to true when the PG_migrate_skip bits should be cleared */
bool compact_blockskip_flush;
#endif
diff --git a/mm/compaction.c b/mm/compaction.c
index 67b8d90..cb5c7a2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -88,7 +88,7 @@ static inline bool migrate_async_suitable(int migratetype)
* the first and last page of a pageblock and avoid checking each individual
* page in a pageblock.
*/
-static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
+static struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
unsigned long end_pfn, struct zone *zone)
{
struct page *start_page;
@@ -114,6 +114,56 @@ static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
return start_page;
}

+static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn,
+ unsigned long end_pfn, struct zone *zone)
+{
+ if (zone->contiguous == 1)
+ return pfn_to_page(start_pfn);
+
+ return __pageblock_pfn_to_page(start_pfn, end_pfn, zone);
+}
+
+static void check_zone_contiguous(struct zone *zone)
+{
+ unsigned long block_start_pfn = zone->zone_start_pfn;
+ unsigned long block_end_pfn;
+ unsigned long pfn;
+
+ /* Already checked */
+ if (zone->contiguous)
+ return;
+
+ printk("%s: %s\n", __func__, zone->name);
+ block_end_pfn = ALIGN(block_start_pfn + 1, pageblock_nr_pages);
+ for (; block_start_pfn < zone_end_pfn(zone);
+ block_start_pfn = block_end_pfn,
+ block_end_pfn += pageblock_nr_pages) {
+
+ block_end_pfn = min(block_end_pfn, zone_end_pfn(zone));
+
+ if (!__pageblock_pfn_to_page(block_start_pfn,
+ block_end_pfn, zone)) {
+ /* We have hole */
+ zone->contiguous = -1;
+ printk("%s: %s: uncontiguous\n", __func__, zone->name);
+ return;
+ }
+
+ /* Check validity of pfn within pageblock */
+ for (pfn = block_start_pfn; pfn < block_end_pfn; pfn++) {
+ if (!pfn_valid_within(pfn)) {
+ zone->contiguous = -1;
+ printk("%s: %s: uncontiguous\n", __func__, zone->name);
+ return;
+ }
+ }
+ }
+
+ /* We don't have hole */
+ zone->contiguous = 1;
+ printk("%s: %s: contiguous\n", __func__, zone->name);
+}
+
#ifdef CONFIG_COMPACTION

/* Do not skip compaction more than 64 times */
@@ -1353,6 +1403,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
;
}

+ check_zone_contiguous(zone);
+
/*
* Clear pageblock skip if there were failures recently and compaction
* is about to be retried after being deferred. kswapd does not do
--
1.9.1

2015-12-08 05:14:45

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC 0/3] reduce latency of direct async compaction

On Tue, Dec 08, 2015 at 09:41:18AM +0900, Joonsoo Kim wrote:
> On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote:
> > On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote:
> > > It looks like overhead still remain. I guess that migration scanner
> > > would call pageblock_pfn_to_page() for more extended range so
> > > overhead still remain.
> > >
> > > I have an idea to solve his problem. Aaron, could you test following patch
> > > on top of base? It tries to skip calling pageblock_pfn_to_page()
> >
> > It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb
> > cleanly, so I made some changes to make it apply and the result is:
> > https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63
>
> Yes, that's okay. I made it on my working branch but it will not result in
> any problem except applying.
>
> >
> > There is a problem occured right after the test starts:
> > [ 58.080962] BUG: unable to handle kernel paging request at ffffea0082000018
> > [ 58.089124] IP: [<ffffffff81193f29>] compaction_alloc+0xf9/0x270
> > [ 58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> > [ 58.101569] Oops: 0000 [#1] SMP
>
> I did some mistake. Please test following patch. It is also made
> on my working branch so you need to resolve conflict but it would be
> trivial.
>
> I inserted some logs to check whether zone is contiguous or not.
> Please check that normal zone is set to contiguous after testing.

Yes it is contiguous, but unfortunately, the problem remains:
[ 56.536930] check_zone_contiguous: Normal
[ 56.543467] check_zone_contiguous: Normal: contiguous
[ 56.549640] BUG: unable to handle kernel paging request at ffffea0082000018
[ 56.557717] IP: [<ffffffff81193f29>] compaction_alloc+0xf9/0x270
[ 56.564719] PGD 107ffd6067 PUD 207f7d5067 PMD 0

Full dmesg attached.

Thanks,
Aaron

>
> Thanks.
>
> ------>8------
> From 4a1a08d8ab3fb165b87ad2ec0a2000ff6892330f Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim <[email protected]>
> Date: Mon, 7 Dec 2015 14:51:42 +0900
> Subject: [PATCH] mm/compaction: Optimize pageblock_pfn_to_page() for
> contiguous zone
>
> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> include/linux/mmzone.h | 1 +
> mm/compaction.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 54 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index e23a9e7..573f9a9 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -521,6 +521,7 @@ struct zone {
> #endif
>
> #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> + int contiguous;
> /* Set to true when the PG_migrate_skip bits should be cleared */
> bool compact_blockskip_flush;
> #endif
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 67b8d90..cb5c7a2 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -88,7 +88,7 @@ static inline bool migrate_async_suitable(int migratetype)
> * the first and last page of a pageblock and avoid checking each individual
> * page in a pageblock.
> */
> -static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
> +static struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
> unsigned long end_pfn, struct zone *zone)
> {
> struct page *start_page;
> @@ -114,6 +114,56 @@ static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
> return start_page;
> }
>
> +static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn,
> + unsigned long end_pfn, struct zone *zone)
> +{
> + if (zone->contiguous == 1)
> + return pfn_to_page(start_pfn);
> +
> + return __pageblock_pfn_to_page(start_pfn, end_pfn, zone);
> +}
> +
> +static void check_zone_contiguous(struct zone *zone)
> +{
> + unsigned long block_start_pfn = zone->zone_start_pfn;
> + unsigned long block_end_pfn;
> + unsigned long pfn;
> +
> + /* Already checked */
> + if (zone->contiguous)
> + return;
> +
> + printk("%s: %s\n", __func__, zone->name);
> + block_end_pfn = ALIGN(block_start_pfn + 1, pageblock_nr_pages);
> + for (; block_start_pfn < zone_end_pfn(zone);
> + block_start_pfn = block_end_pfn,
> + block_end_pfn += pageblock_nr_pages) {
> +
> + block_end_pfn = min(block_end_pfn, zone_end_pfn(zone));
> +
> + if (!__pageblock_pfn_to_page(block_start_pfn,
> + block_end_pfn, zone)) {
> + /* We have hole */
> + zone->contiguous = -1;
> + printk("%s: %s: uncontiguous\n", __func__, zone->name);
> + return;
> + }
> +
> + /* Check validity of pfn within pageblock */
> + for (pfn = block_start_pfn; pfn < block_end_pfn; pfn++) {
> + if (!pfn_valid_within(pfn)) {
> + zone->contiguous = -1;
> + printk("%s: %s: uncontiguous\n", __func__, zone->name);
> + return;
> + }
> + }
> + }
> +
> + /* We don't have hole */
> + zone->contiguous = 1;
> + printk("%s: %s: contiguous\n", __func__, zone->name);
> +}
> +
> #ifdef CONFIG_COMPACTION
>
> /* Do not skip compaction more than 64 times */
> @@ -1353,6 +1403,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> ;
> }
>
> + check_zone_contiguous(zone);
> +
> /*
> * Clear pageblock skip if there were failures recently and compaction
> * is about to be retried after being deferred. kswapd does not do
> --
> 1.9.1
>


Attachments:
(No filename) (5.79 kB)
dmesg.xz (21.57 kB)
Download all attachments

2015-12-08 06:50:07

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC 0/3] reduce latency of direct async compaction

On Tue, Dec 08, 2015 at 01:14:39PM +0800, Aaron Lu wrote:
> On Tue, Dec 08, 2015 at 09:41:18AM +0900, Joonsoo Kim wrote:
> > On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote:
> > > On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote:
> > > > It looks like overhead still remain. I guess that migration scanner
> > > > would call pageblock_pfn_to_page() for more extended range so
> > > > overhead still remain.
> > > >
> > > > I have an idea to solve his problem. Aaron, could you test following patch
> > > > on top of base? It tries to skip calling pageblock_pfn_to_page()
> > >
> > > It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb
> > > cleanly, so I made some changes to make it apply and the result is:
> > > https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63
> >
> > Yes, that's okay. I made it on my working branch but it will not result in
> > any problem except applying.
> >
> > >
> > > There is a problem occured right after the test starts:
> > > [ 58.080962] BUG: unable to handle kernel paging request at ffffea0082000018
> > > [ 58.089124] IP: [<ffffffff81193f29>] compaction_alloc+0xf9/0x270
> > > [ 58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> > > [ 58.101569] Oops: 0000 [#1] SMP
> >
> > I did some mistake. Please test following patch. It is also made
> > on my working branch so you need to resolve conflict but it would be
> > trivial.
> >
> > I inserted some logs to check whether zone is contiguous or not.
> > Please check that normal zone is set to contiguous after testing.
>
> Yes it is contiguous, but unfortunately, the problem remains:
> [ 56.536930] check_zone_contiguous: Normal
> [ 56.543467] check_zone_contiguous: Normal: contiguous
> [ 56.549640] BUG: unable to handle kernel paging request at ffffea0082000018
> [ 56.557717] IP: [<ffffffff81193f29>] compaction_alloc+0xf9/0x270
> [ 56.564719] PGD 107ffd6067 PUD 207f7d5067 PMD 0
>

Maybe, I find the reason. cc->free_pfn can be initialized to invalid pfn
that isn't checked so optimized pageblock_pfn_to_page() causes BUG().

I add work-around for this problem at isolate_freepages(). Please test
following one.

Thanks.

---------->8---------------
>From 7e954a68fb555a868acc5860627a1ad8dadbe3bf Mon Sep 17 00:00:00 2001
From: Joonsoo Kim <[email protected]>
Date: Mon, 7 Dec 2015 14:51:42 +0900
Subject: [PATCH] mm/compaction: Optimize pageblock_pfn_to_page() for
contiguous zone

Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/mmzone.h | 1 +
mm/compaction.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e23a9e7..573f9a9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -521,6 +521,7 @@ struct zone {
#endif

#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+ int contiguous;
/* Set to true when the PG_migrate_skip bits should be cleared */
bool compact_blockskip_flush;
#endif
diff --git a/mm/compaction.c b/mm/compaction.c
index de3e1e7..ff5fb04 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -88,7 +88,7 @@ static inline bool migrate_async_suitable(int migratetype)
* the first and last page of a pageblock and avoid checking each individual
* page in a pageblock.
*/
-static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
+static struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
unsigned long end_pfn, struct zone *zone)
{
struct page *start_page;
@@ -114,6 +114,56 @@ static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
return start_page;
}

+static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn,
+ unsigned long end_pfn, struct zone *zone)
+{
+ if (zone->contiguous == 1)
+ return pfn_to_page(start_pfn);
+
+ return __pageblock_pfn_to_page(start_pfn, end_pfn, zone);
+}
+
+static void check_zone_contiguous(struct zone *zone)
+{
+ unsigned long block_start_pfn = zone->zone_start_pfn;
+ unsigned long block_end_pfn;
+ unsigned long pfn;
+
+ /* Already checked */
+ if (zone->contiguous)
+ return;
+
+ printk("%s: %s\n", __func__, zone->name);
+ block_end_pfn = ALIGN(block_start_pfn + 1, pageblock_nr_pages);
+ for (; block_start_pfn < zone_end_pfn(zone);
+ block_start_pfn = block_end_pfn,
+ block_end_pfn += pageblock_nr_pages) {
+
+ block_end_pfn = min(block_end_pfn, zone_end_pfn(zone));
+
+ if (!__pageblock_pfn_to_page(block_start_pfn,
+ block_end_pfn, zone)) {
+ /* We have hole */
+ zone->contiguous = -1;
+ printk("%s: %s: uncontiguous\n", __func__, zone->name);
+ return;
+ }
+
+ /* Check validity of pfn within pageblock */
+ for (pfn = block_start_pfn; pfn < block_end_pfn; pfn++) {
+ if (!pfn_valid_within(pfn)) {
+ zone->contiguous = -1;
+ printk("%s: %s: uncontiguous\n", __func__, zone->name);
+ return;
+ }
+ }
+ }
+
+ /* We don't have hole */
+ zone->contiguous = 1;
+ printk("%s: %s: contiguous\n", __func__, zone->name);
+}
+
#ifdef CONFIG_COMPACTION

/* Do not skip compaction more than 64 times */
@@ -948,6 +998,12 @@ static void isolate_freepages(struct compact_control *cc)
unsigned long low_pfn; /* lowest pfn scanner is able to scan */
struct list_head *freelist = &cc->freepages;

+ /* Work-around */
+ if (zone->contiguous == 1 &&
+ cc->free_pfn == zone_end_pfn(zone) &&
+ cc->free_pfn == cc->free_pfn & ~(pageblock_nr_pages-1))
+ cc->free_pfn -= pageblock_nr_pages;
+
/*
* Initialise the free scanner. The starting point is where we last
* successfully isolated from, zone-cached value, or the end of the
@@ -1356,6 +1412,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
;
}

+ check_zone_contiguous(zone);
+
/*
* Clear pageblock skip if there were failures recently and compaction
* is about to be retried after being deferred. kswapd does not do
--
1.9.1

2015-12-08 08:52:47

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC 0/3] reduce latency of direct async compaction

On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote:
> On Tue, Dec 08, 2015 at 01:14:39PM +0800, Aaron Lu wrote:
> > On Tue, Dec 08, 2015 at 09:41:18AM +0900, Joonsoo Kim wrote:
> > > On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote:
> > > > On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote:
> > > > > It looks like overhead still remain. I guess that migration scanner
> > > > > would call pageblock_pfn_to_page() for more extended range so
> > > > > overhead still remain.
> > > > >
> > > > > I have an idea to solve his problem. Aaron, could you test following patch
> > > > > on top of base? It tries to skip calling pageblock_pfn_to_page()
> > > >
> > > > It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb
> > > > cleanly, so I made some changes to make it apply and the result is:
> > > > https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63
> > >
> > > Yes, that's okay. I made it on my working branch but it will not result in
> > > any problem except applying.
> > >
> > > >
> > > > There is a problem occured right after the test starts:
> > > > [ 58.080962] BUG: unable to handle kernel paging request at ffffea0082000018
> > > > [ 58.089124] IP: [<ffffffff81193f29>] compaction_alloc+0xf9/0x270
> > > > [ 58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> > > > [ 58.101569] Oops: 0000 [#1] SMP
> > >
> > > I did some mistake. Please test following patch. It is also made
> > > on my working branch so you need to resolve conflict but it would be
> > > trivial.
> > >
> > > I inserted some logs to check whether zone is contiguous or not.
> > > Please check that normal zone is set to contiguous after testing.
> >
> > Yes it is contiguous, but unfortunately, the problem remains:
> > [ 56.536930] check_zone_contiguous: Normal
> > [ 56.543467] check_zone_contiguous: Normal: contiguous
> > [ 56.549640] BUG: unable to handle kernel paging request at ffffea0082000018
> > [ 56.557717] IP: [<ffffffff81193f29>] compaction_alloc+0xf9/0x270
> > [ 56.564719] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> >
>
> Maybe, I find the reason. cc->free_pfn can be initialized to invalid pfn
> that isn't checked so optimized pageblock_pfn_to_page() causes BUG().
>
> I add work-around for this problem at isolate_freepages(). Please test
> following one.

Still no luck and the error is about the same:

[ 64.727792] check_zone_contiguous: Normal
[ 64.733950] check_zone_contiguous: Normal: contiguous
[ 64.741610] BUG: unable to handle kernel paging request at ffffea0082000018
[ 64.749708] IP: [<ffffffff81193f29>] compaction_alloc+0xf9/0x270
[ 64.756806] PGD 107ffd6067 PUD 207f7d5067 PMD 0
[ 64.762302] Oops: 0000 [#1] SMP
[ 64.766294] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver netconsole sg sd_mod x86_pkg_temp_thermal coretemp kvm_intel kvm mgag200 irqbypass crct10dif_pclmul ttm crc32_pclmul crc32c_intel drm_kms_helper ahci syscopyarea sysfillrect sysimgblt snd_pcm libahci fb_sys_fops snd_timer snd sb_edac aesni_intel soundcore lrw drm gf128mul pcspkr edac_core ipmi_devintf glue_helper ablk_helper cryptd libata ipmi_si shpchp wmi ipmi_msghandler acpi_power_meter acpi_pad
[ 64.816579] CPU: 19 PID: 1526 Comm: usemem Not tainted 4.4.0-rc3-00025-gf60ea5f #1
[ 64.825419] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
[ 64.837483] task: ffff88168a0aca80 ti: ffff88168a564000 task.ti:ffff88168a564000
[ 64.846264] RIP: 0010:[<ffffffff81193f29>] [<ffffffff81193f29>] compaction_alloc+0xf9/0x270
[ 64.856147] RSP: 0000:ffff88168a567940 EFLAGS: 00010286
[ 64.862520] RAX: ffff88207ffdcd80 RBX: ffff88168a567ac0 RCX: ffff88207ffdcd80
[ 64.870944] RDX: 0000000002080000 RSI: ffff88168a567ac0 RDI: ffff88168a567ac0
[ 64.879377] RBP: ffff88168a567990 R08: ffffea0082000000 R09: 0000000000000000
[ 64.887813] R10: 0000000000000000 R11: 000000000001ae88 R12: ffffea0082000000
[ 64.896254] R13: ffffea0059f20780 R14: 0000000002080000 R15: 0000000002080000
[ 64.904704] FS: 00007f2d4e6e8700(0000) GS:ffff882034440000(0000) knlGS:0000000000000000
[ 64.914232] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 64.921151] CR2: ffffea0082000018 CR3: 0000002015771000 CR4: 00000000001406e0
[ 64.929635] Stack:
[ 64.932413] ffff88168a568000 000000000167ca00 ffffffff81193196 ffff88207ffdcd80
[ 64.941292] 0000000002080000 ffffea0059f207c0 ffff88168a567ac0 ffffea0059f20780
[ 64.950179] ffffea0059f207e0 ffff88207ffdcd80 ffff88168a567a20 ffffffff811d097e
[ 64.959071] Call Trace:
[ 64.962364] [<ffffffff81193196>] ? update_pageblock_skip+0x56/0xa0
[ 64.969939] [<ffffffff811d097e>] migrate_pages+0x28e/0x7b0
[ 64.976728] [<ffffffff811931e0>] ? update_pageblock_skip+0xa0/0xa0
[ 64.984312] [<ffffffff81193e30>] ? __pageblock_pfn_to_page+0xe0/0xe0
[ 64.992093] [<ffffffff811954da>] compact_zone+0x38a/0x8e0
[ 64.998811] [<ffffffff81195a9d>] compact_zone_order+0x6d/0x90
[ 65.005926] [<ffffffff81174f44>] ? get_page_from_freelist+0xd4/0xa20
[ 65.013861] [<ffffffff81195d2c>] try_to_compact_pages+0xec/0x210
[ 65.021212] [<ffffffffa00d0c72>] ? sdebug_queuecommand_lock_or_not+0x22/0x60 [scsi_debug]
[ 65.030984] [<ffffffff811758cd>] __alloc_pages_direct_compact+0x3d/0x110
[ 65.039106] [<ffffffff81175f06>] __alloc_pages_nodemask+0x566/0xb40
[ 65.046739] [<ffffffff811c02c1>] alloc_pages_vma+0x1d1/0x230
[ 65.053690] [<ffffffff811d5d77>] do_huge_pmd_anonymous_page+0x107/0x3f0
[ 65.061713] [<ffffffff8119ed2a>] handle_mm_fault+0x178a/0x1940
[ 65.068859] [<ffffffff811a6614>] ? change_protection+0x14/0x20
[ 65.075999] [<ffffffff8109d8a2>] ? __might_sleep+0x52/0xb0
[ 65.082750] [<ffffffff81063c4d>] __do_page_fault+0x1ad/0x410
[ 65.089690] [<ffffffff81063edf>] do_page_fault+0x2f/0x80
[ 65.096242] [<ffffffff818c8008>] page_fault+0x28/0x30
[ 65.102491] Code: 90 00 00 00 48 8b 45 c8 4d 89 e0 83 b8 50 05 00 00 01 74 12 48 8b 55 c8 4c 89 f6 4c 89 ff e8 2f fe ff ff 49 89 c0 4d 85 c0 74 47 <41> 8b 40 18 83 f8 80 75 0a 49 8b 40 30 48 83 f8 08 77 34 48 b8
[ 65.125421] RIP [<ffffffff81193f29>] compaction_alloc+0xf9/0x270
[ 65.132777] RSP <ffff88168a567940>
[ 65.141419] ---[ end trace c17c6b894e4340a8 ]---
[ 65.149001] Kernel panic - not syncing: Fatal exception


Thanks,
Aaron

>
> Thanks.
>
> ---------->8---------------
> From 7e954a68fb555a868acc5860627a1ad8dadbe3bf Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim <[email protected]>
> Date: Mon, 7 Dec 2015 14:51:42 +0900
> Subject: [PATCH] mm/compaction: Optimize pageblock_pfn_to_page() for
> contiguous zone
>
> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> include/linux/mmzone.h | 1 +
> mm/compaction.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 60 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index e23a9e7..573f9a9 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -521,6 +521,7 @@ struct zone {
> #endif
>
> #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> + int contiguous;
> /* Set to true when the PG_migrate_skip bits should be cleared */
> bool compact_blockskip_flush;
> #endif
> diff --git a/mm/compaction.c b/mm/compaction.c
> index de3e1e7..ff5fb04 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -88,7 +88,7 @@ static inline bool migrate_async_suitable(int migratetype)
> * the first and last page of a pageblock and avoid checking each individual
> * page in a pageblock.
> */
> -static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
> +static struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
> unsigned long end_pfn, struct zone *zone)
> {
> struct page *start_page;
> @@ -114,6 +114,56 @@ static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
> return start_page;
> }
>
> +static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn,
> + unsigned long end_pfn, struct zone *zone)
> +{
> + if (zone->contiguous == 1)
> + return pfn_to_page(start_pfn);
> +
> + return __pageblock_pfn_to_page(start_pfn, end_pfn, zone);
> +}
> +
> +static void check_zone_contiguous(struct zone *zone)
> +{
> + unsigned long block_start_pfn = zone->zone_start_pfn;
> + unsigned long block_end_pfn;
> + unsigned long pfn;
> +
> + /* Already checked */
> + if (zone->contiguous)
> + return;
> +
> + printk("%s: %s\n", __func__, zone->name);
> + block_end_pfn = ALIGN(block_start_pfn + 1, pageblock_nr_pages);
> + for (; block_start_pfn < zone_end_pfn(zone);
> + block_start_pfn = block_end_pfn,
> + block_end_pfn += pageblock_nr_pages) {
> +
> + block_end_pfn = min(block_end_pfn, zone_end_pfn(zone));
> +
> + if (!__pageblock_pfn_to_page(block_start_pfn,
> + block_end_pfn, zone)) {
> + /* We have hole */
> + zone->contiguous = -1;
> + printk("%s: %s: uncontiguous\n", __func__, zone->name);
> + return;
> + }
> +
> + /* Check validity of pfn within pageblock */
> + for (pfn = block_start_pfn; pfn < block_end_pfn; pfn++) {
> + if (!pfn_valid_within(pfn)) {
> + zone->contiguous = -1;
> + printk("%s: %s: uncontiguous\n", __func__, zone->name);
> + return;
> + }
> + }
> + }
> +
> + /* We don't have hole */
> + zone->contiguous = 1;
> + printk("%s: %s: contiguous\n", __func__, zone->name);
> +}
> +
> #ifdef CONFIG_COMPACTION
>
> /* Do not skip compaction more than 64 times */
> @@ -948,6 +998,12 @@ static void isolate_freepages(struct compact_control *cc)
> unsigned long low_pfn; /* lowest pfn scanner is able to scan */
> struct list_head *freelist = &cc->freepages;
>
> + /* Work-around */
> + if (zone->contiguous == 1 &&
> + cc->free_pfn == zone_end_pfn(zone) &&
> + cc->free_pfn == cc->free_pfn & ~(pageblock_nr_pages-1))
> + cc->free_pfn -= pageblock_nr_pages;
> +
> /*
> * Initialise the free scanner. The starting point is where we last
> * successfully isolated from, zone-cached value, or the end of the
> @@ -1356,6 +1412,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> ;
> }
>
> + check_zone_contiguous(zone);
> +
> /*
> * Clear pageblock skip if there were failures recently and compaction
> * is about to be retried after being deferred. kswapd does not do
> --
> 1.9.1
>

2015-12-09 00:32:41

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC 0/3] reduce latency of direct async compaction

On Tue, Dec 08, 2015 at 04:52:42PM +0800, Aaron Lu wrote:
> On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote:
> > On Tue, Dec 08, 2015 at 01:14:39PM +0800, Aaron Lu wrote:
> > > On Tue, Dec 08, 2015 at 09:41:18AM +0900, Joonsoo Kim wrote:
> > > > On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote:
> > > > > On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote:
> > > > > > It looks like overhead still remain. I guess that migration scanner
> > > > > > would call pageblock_pfn_to_page() for more extended range so
> > > > > > overhead still remain.
> > > > > >
> > > > > > I have an idea to solve his problem. Aaron, could you test following patch
> > > > > > on top of base? It tries to skip calling pageblock_pfn_to_page()
> > > > >
> > > > > It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb
> > > > > cleanly, so I made some changes to make it apply and the result is:
> > > > > https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63
> > > >
> > > > Yes, that's okay. I made it on my working branch but it will not result in
> > > > any problem except applying.
> > > >
> > > > >
> > > > > There is a problem occured right after the test starts:
> > > > > [ 58.080962] BUG: unable to handle kernel paging request at ffffea0082000018
> > > > > [ 58.089124] IP: [<ffffffff81193f29>] compaction_alloc+0xf9/0x270
> > > > > [ 58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> > > > > [ 58.101569] Oops: 0000 [#1] SMP
> > > >
> > > > I did some mistake. Please test following patch. It is also made
> > > > on my working branch so you need to resolve conflict but it would be
> > > > trivial.
> > > >
> > > > I inserted some logs to check whether zone is contiguous or not.
> > > > Please check that normal zone is set to contiguous after testing.
> > >
> > > Yes it is contiguous, but unfortunately, the problem remains:
> > > [ 56.536930] check_zone_contiguous: Normal
> > > [ 56.543467] check_zone_contiguous: Normal: contiguous
> > > [ 56.549640] BUG: unable to handle kernel paging request at ffffea0082000018
> > > [ 56.557717] IP: [<ffffffff81193f29>] compaction_alloc+0xf9/0x270
> > > [ 56.564719] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> > >
> >
> > Maybe, I find the reason. cc->free_pfn can be initialized to invalid pfn
> > that isn't checked so optimized pageblock_pfn_to_page() causes BUG().
> >
> > I add work-around for this problem at isolate_freepages(). Please test
> > following one.
>
> Still no luck and the error is about the same:

There is a mistake... Could you insert () for
cc->free_pfn & ~(pageblock_nr_pages-1) like as following?

cc->free_pfn == (cc->free_pfn & ~(pageblock_nr_pages-1))

Thanks.

>
> [ 64.727792] check_zone_contiguous: Normal
> [ 64.733950] check_zone_contiguous: Normal: contiguous
> [ 64.741610] BUG: unable to handle kernel paging request at ffffea0082000018
> [ 64.749708] IP: [<ffffffff81193f29>] compaction_alloc+0xf9/0x270
> [ 64.756806] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> [ 64.762302] Oops: 0000 [#1] SMP
> [ 64.766294] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver netconsole sg sd_mod x86_pkg_temp_thermal coretemp kvm_intel kvm mgag200 irqbypass crct10dif_pclmul ttm crc32_pclmul crc32c_intel drm_kms_helper ahci syscopyarea sysfillrect sysimgblt snd_pcm libahci fb_sys_fops snd_timer snd sb_edac aesni_intel soundcore lrw drm gf128mul pcspkr edac_core ipmi_devintf glue_helper ablk_helper cryptd libata ipmi_si shpchp wmi ipmi_msghandler acpi_power_meter acpi_pad
> [ 64.816579] CPU: 19 PID: 1526 Comm: usemem Not tainted 4.4.0-rc3-00025-gf60ea5f #1
> [ 64.825419] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
> [ 64.837483] task: ffff88168a0aca80 ti: ffff88168a564000 task.ti:ffff88168a564000
> [ 64.846264] RIP: 0010:[<ffffffff81193f29>] [<ffffffff81193f29>] compaction_alloc+0xf9/0x270
> [ 64.856147] RSP: 0000:ffff88168a567940 EFLAGS: 00010286
> [ 64.862520] RAX: ffff88207ffdcd80 RBX: ffff88168a567ac0 RCX: ffff88207ffdcd80
> [ 64.870944] RDX: 0000000002080000 RSI: ffff88168a567ac0 RDI: ffff88168a567ac0
> [ 64.879377] RBP: ffff88168a567990 R08: ffffea0082000000 R09: 0000000000000000
> [ 64.887813] R10: 0000000000000000 R11: 000000000001ae88 R12: ffffea0082000000
> [ 64.896254] R13: ffffea0059f20780 R14: 0000000002080000 R15: 0000000002080000
> [ 64.904704] FS: 00007f2d4e6e8700(0000) GS:ffff882034440000(0000) knlGS:0000000000000000
> [ 64.914232] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 64.921151] CR2: ffffea0082000018 CR3: 0000002015771000 CR4: 00000000001406e0
> [ 64.929635] Stack:
> [ 64.932413] ffff88168a568000 000000000167ca00 ffffffff81193196 ffff88207ffdcd80
> [ 64.941292] 0000000002080000 ffffea0059f207c0 ffff88168a567ac0 ffffea0059f20780
> [ 64.950179] ffffea0059f207e0 ffff88207ffdcd80 ffff88168a567a20 ffffffff811d097e
> [ 64.959071] Call Trace:
> [ 64.962364] [<ffffffff81193196>] ? update_pageblock_skip+0x56/0xa0
> [ 64.969939] [<ffffffff811d097e>] migrate_pages+0x28e/0x7b0
> [ 64.976728] [<ffffffff811931e0>] ? update_pageblock_skip+0xa0/0xa0
> [ 64.984312] [<ffffffff81193e30>] ? __pageblock_pfn_to_page+0xe0/0xe0
> [ 64.992093] [<ffffffff811954da>] compact_zone+0x38a/0x8e0
> [ 64.998811] [<ffffffff81195a9d>] compact_zone_order+0x6d/0x90
> [ 65.005926] [<ffffffff81174f44>] ? get_page_from_freelist+0xd4/0xa20
> [ 65.013861] [<ffffffff81195d2c>] try_to_compact_pages+0xec/0x210
> [ 65.021212] [<ffffffffa00d0c72>] ? sdebug_queuecommand_lock_or_not+0x22/0x60 [scsi_debug]
> [ 65.030984] [<ffffffff811758cd>] __alloc_pages_direct_compact+0x3d/0x110
> [ 65.039106] [<ffffffff81175f06>] __alloc_pages_nodemask+0x566/0xb40
> [ 65.046739] [<ffffffff811c02c1>] alloc_pages_vma+0x1d1/0x230
> [ 65.053690] [<ffffffff811d5d77>] do_huge_pmd_anonymous_page+0x107/0x3f0
> [ 65.061713] [<ffffffff8119ed2a>] handle_mm_fault+0x178a/0x1940
> [ 65.068859] [<ffffffff811a6614>] ? change_protection+0x14/0x20
> [ 65.075999] [<ffffffff8109d8a2>] ? __might_sleep+0x52/0xb0
> [ 65.082750] [<ffffffff81063c4d>] __do_page_fault+0x1ad/0x410
> [ 65.089690] [<ffffffff81063edf>] do_page_fault+0x2f/0x80
> [ 65.096242] [<ffffffff818c8008>] page_fault+0x28/0x30
> [ 65.102491] Code: 90 00 00 00 48 8b 45 c8 4d 89 e0 83 b8 50 05 00 00 01 74 12 48 8b 55 c8 4c 89 f6 4c 89 ff e8 2f fe ff ff 49 89 c0 4d 85 c0 74 47 <41> 8b 40 18 83 f8 80 75 0a 49 8b 40 30 48 83 f8 08 77 34 48 b8
> [ 65.125421] RIP [<ffffffff81193f29>] compaction_alloc+0xf9/0x270
> [ 65.132777] RSP <ffff88168a567940>
> [ 65.141419] ---[ end trace c17c6b894e4340a8 ]---
> [ 65.149001] Kernel panic - not syncing: Fatal exception
>
>
> Thanks,
> Aaron
>
> >
> > Thanks.
> >
> > ---------->8---------------
> > From 7e954a68fb555a868acc5860627a1ad8dadbe3bf Mon Sep 17 00:00:00 2001
> > From: Joonsoo Kim <[email protected]>
> > Date: Mon, 7 Dec 2015 14:51:42 +0900
> > Subject: [PATCH] mm/compaction: Optimize pageblock_pfn_to_page() for
> > contiguous zone
> >
> > Signed-off-by: Joonsoo Kim <[email protected]>
> > ---
> > include/linux/mmzone.h | 1 +
> > mm/compaction.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++-
> > 2 files changed, 60 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index e23a9e7..573f9a9 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -521,6 +521,7 @@ struct zone {
> > #endif
> >
> > #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> > + int contiguous;
> > /* Set to true when the PG_migrate_skip bits should be cleared */
> > bool compact_blockskip_flush;
> > #endif
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index de3e1e7..ff5fb04 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -88,7 +88,7 @@ static inline bool migrate_async_suitable(int migratetype)
> > * the first and last page of a pageblock and avoid checking each individual
> > * page in a pageblock.
> > */
> > -static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
> > +static struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
> > unsigned long end_pfn, struct zone *zone)
> > {
> > struct page *start_page;
> > @@ -114,6 +114,56 @@ static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
> > return start_page;
> > }
> >
> > +static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn,
> > + unsigned long end_pfn, struct zone *zone)
> > +{
> > + if (zone->contiguous == 1)
> > + return pfn_to_page(start_pfn);
> > +
> > + return __pageblock_pfn_to_page(start_pfn, end_pfn, zone);
> > +}
> > +
> > +static void check_zone_contiguous(struct zone *zone)
> > +{
> > + unsigned long block_start_pfn = zone->zone_start_pfn;
> > + unsigned long block_end_pfn;
> > + unsigned long pfn;
> > +
> > + /* Already checked */
> > + if (zone->contiguous)
> > + return;
> > +
> > + printk("%s: %s\n", __func__, zone->name);
> > + block_end_pfn = ALIGN(block_start_pfn + 1, pageblock_nr_pages);
> > + for (; block_start_pfn < zone_end_pfn(zone);
> > + block_start_pfn = block_end_pfn,
> > + block_end_pfn += pageblock_nr_pages) {
> > +
> > + block_end_pfn = min(block_end_pfn, zone_end_pfn(zone));
> > +
> > + if (!__pageblock_pfn_to_page(block_start_pfn,
> > + block_end_pfn, zone)) {
> > + /* We have hole */
> > + zone->contiguous = -1;
> > + printk("%s: %s: uncontiguous\n", __func__, zone->name);
> > + return;
> > + }
> > +
> > + /* Check validity of pfn within pageblock */
> > + for (pfn = block_start_pfn; pfn < block_end_pfn; pfn++) {
> > + if (!pfn_valid_within(pfn)) {
> > + zone->contiguous = -1;
> > + printk("%s: %s: uncontiguous\n", __func__, zone->name);
> > + return;
> > + }
> > + }
> > + }
> > +
> > + /* We don't have hole */
> > + zone->contiguous = 1;
> > + printk("%s: %s: contiguous\n", __func__, zone->name);
> > +}
> > +
> > #ifdef CONFIG_COMPACTION
> >
> > /* Do not skip compaction more than 64 times */
> > @@ -948,6 +998,12 @@ static void isolate_freepages(struct compact_control *cc)
> > unsigned long low_pfn; /* lowest pfn scanner is able to scan */
> > struct list_head *freelist = &cc->freepages;
> >
> > + /* Work-around */
> > + if (zone->contiguous == 1 &&
> > + cc->free_pfn == zone_end_pfn(zone) &&
> > + cc->free_pfn == cc->free_pfn & ~(pageblock_nr_pages-1))
> > + cc->free_pfn -= pageblock_nr_pages;
> > +
> > /*
> > * Initialise the free scanner. The starting point is where we last
> > * successfully isolated from, zone-cached value, or the end of the
> > @@ -1356,6 +1412,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> > ;
> > }
> >
> > + check_zone_contiguous(zone);
> > +
> > /*
> > * Clear pageblock skip if there were failures recently and compaction
> > * is about to be retried after being deferred. kswapd does not do
> > --
> > 1.9.1
> >
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-12-09 05:40:10

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC 0/3] reduce latency of direct async compaction

On Wed, Dec 09, 2015 at 09:33:53AM +0900, Joonsoo Kim wrote:
> On Tue, Dec 08, 2015 at 04:52:42PM +0800, Aaron Lu wrote:
> > On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote:
> > > I add work-around for this problem at isolate_freepages(). Please test
> > > following one.
> >
> > Still no luck and the error is about the same:
>
> There is a mistake... Could you insert () for
> cc->free_pfn & ~(pageblock_nr_pages-1) like as following?
>
> cc->free_pfn == (cc->free_pfn & ~(pageblock_nr_pages-1))

Oh right, of course.

Good news, the result is much better now:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 100064603136
100064603136 transferred in 72 seconds, throughput: 1325 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100072049664
100072049664 transferred in 74 seconds, throughput: 1289 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100070246400
100070246400 transferred in 92 seconds, throughput: 1037 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100069545984
100069545984 transferred in 81 seconds, throughput: 1178 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100058895360
100058895360 transferred in 78 seconds, throughput: 1223 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100066074624
100066074624 transferred in 94 seconds, throughput: 1015 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100062855168
100062855168 transferred in 77 seconds, throughput: 1239 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100060990464
100060990464 transferred in 73 seconds, throughput: 1307 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100064996352
100064996352 transferred in 84 seconds, throughput: 1136 MB/s
Max: 1325 MB/s
Min: 1015 MB/s
Avg: 1194 MB/s

The base result for reference:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 100000622592
100000622592 transferred in 103 seconds, throughput: 925 MB/s
cmdline: /lkp/aaron/src/bin/usemem 99999559680
99999559680 transferred in 92 seconds, throughput: 1036 MB/s
cmdline: /lkp/aaron/src/bin/usemem 99996171264
99996171264 transferred in 92 seconds, throughput: 1036 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100005663744
100005663744 transferred in 150 seconds, throughput: 635 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100002966528
100002966528 transferred in 87 seconds, throughput: 1096 MB/s
cmdline: /lkp/aaron/src/bin/usemem 99995784192
99995784192 transferred in 131 seconds, throughput: 727 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100003731456
100003731456 transferred in 97 seconds, throughput: 983 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100006440960
100006440960 transferred in 109 seconds, throughput: 874 MB/s
cmdline: /lkp/aaron/src/bin/usemem 99998813184
99998813184 transferred in 122 seconds, throughput: 781 MB/s
Max: 1096 MB/s
Min: 635 MB/s
Avg: 899 MB/s

2015-12-10 04:46:03

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC 0/3] reduce latency of direct async compaction

On Wed, Dec 09, 2015 at 01:40:06PM +0800, Aaron Lu wrote:
> On Wed, Dec 09, 2015 at 09:33:53AM +0900, Joonsoo Kim wrote:
> > On Tue, Dec 08, 2015 at 04:52:42PM +0800, Aaron Lu wrote:
> > > On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote:
> > > > I add work-around for this problem at isolate_freepages(). Please test
> > > > following one.
> > >
> > > Still no luck and the error is about the same:
> >
> > There is a mistake... Could you insert () for
> > cc->free_pfn & ~(pageblock_nr_pages-1) like as following?
> >
> > cc->free_pfn == (cc->free_pfn & ~(pageblock_nr_pages-1))
>
> Oh right, of course.
>
> Good news, the result is much better now:
> $ cat {0..8}/swap
> cmdline: /lkp/aaron/src/bin/usemem 100064603136
> 100064603136 transferred in 72 seconds, throughput: 1325 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100072049664
> 100072049664 transferred in 74 seconds, throughput: 1289 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100070246400
> 100070246400 transferred in 92 seconds, throughput: 1037 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100069545984
> 100069545984 transferred in 81 seconds, throughput: 1178 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100058895360
> 100058895360 transferred in 78 seconds, throughput: 1223 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100066074624
> 100066074624 transferred in 94 seconds, throughput: 1015 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100062855168
> 100062855168 transferred in 77 seconds, throughput: 1239 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100060990464
> 100060990464 transferred in 73 seconds, throughput: 1307 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100064996352
> 100064996352 transferred in 84 seconds, throughput: 1136 MB/s
> Max: 1325 MB/s
> Min: 1015 MB/s
> Avg: 1194 MB/s

Nice result! Thanks for testing.
I will make a proper formatted patch soon.

Then, your concern is solved? I think that performance of
always-always on this test case can't follow up performance of
always-never because migration cost to make hugepage is additionally
charged to always-always case. Instead, it will have more hugepage
mapping and it may result in better performance in some situation.
I guess that it is intention of that option.

Thanks.

2015-12-10 06:15:54

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC 0/3] reduce latency of direct async compaction

On 12/10/2015 12:35 PM, Joonsoo Kim wrote:
> On Wed, Dec 09, 2015 at 01:40:06PM +0800, Aaron Lu wrote:
>> On Wed, Dec 09, 2015 at 09:33:53AM +0900, Joonsoo Kim wrote:
>>> On Tue, Dec 08, 2015 at 04:52:42PM +0800, Aaron Lu wrote:
>>>> On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote:
>>>>> I add work-around for this problem at isolate_freepages(). Please test
>>>>> following one.
>>>>
>>>> Still no luck and the error is about the same:
>>>
>>> There is a mistake... Could you insert () for
>>> cc->free_pfn & ~(pageblock_nr_pages-1) like as following?
>>>
>>> cc->free_pfn == (cc->free_pfn & ~(pageblock_nr_pages-1))
>>
>> Oh right, of course.
>>
>> Good news, the result is much better now:
>> $ cat {0..8}/swap
>> cmdline: /lkp/aaron/src/bin/usemem 100064603136
>> 100064603136 transferred in 72 seconds, throughput: 1325 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100072049664
>> 100072049664 transferred in 74 seconds, throughput: 1289 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100070246400
>> 100070246400 transferred in 92 seconds, throughput: 1037 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100069545984
>> 100069545984 transferred in 81 seconds, throughput: 1178 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100058895360
>> 100058895360 transferred in 78 seconds, throughput: 1223 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100066074624
>> 100066074624 transferred in 94 seconds, throughput: 1015 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100062855168
>> 100062855168 transferred in 77 seconds, throughput: 1239 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100060990464
>> 100060990464 transferred in 73 seconds, throughput: 1307 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100064996352
>> 100064996352 transferred in 84 seconds, throughput: 1136 MB/s
>> Max: 1325 MB/s
>> Min: 1015 MB/s
>> Avg: 1194 MB/s
>
> Nice result! Thanks for testing.
> I will make a proper formatted patch soon.

Thanks for the nice work.

>
> Then, your concern is solved? I think that performance of

I think so.

> always-always on this test case can't follow up performance of
> always-never because migration cost to make hugepage is additionally
> charged to always-always case. Instead, it will have more hugepage
> mapping and it may result in better performance in some situation.
> I guess that it is intention of that option.

OK, I see.

Regards,
Aaron