2012-08-09 13:50:47

by Mel Gorman

[permalink] [raw]
Subject: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

Changelog since V2
o Capture !MIGRATE_MOVABLE pages where possible
o Document the treatment of MIGRATE_MOVABLE pages while capturing
o Expand changelogs

Changelog since V1
o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan)
o Expanded changelogs a little

Allocation success rates have been far lower since 3.4 due to commit
[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
commit was introduced for good reasons and it was known in advance that
the success rates would suffer but it was justified on the grounds that
the high allocation success rates were achieved by aggressive reclaim.
Success rates are expected to suffer even more in 3.6 due to commit
[7db8889a: mm: have order > 0 compaction start off where it left] which
testing has shown to severely reduce allocation success rates under load -
to 0% in one case. There is a proposed change to that patch in this series
and it would be ideal if Jim Schutt could retest the workload that led to
commit [7db8889a: mm: have order > 0 compaction start off where it left].

This series aims to improve the allocation success rates without regressing
the benefits of commit fe2c2a10. The series is based on 3.5 and includes
the commit 7db8889a to illustrate what impact it has to success rates.

Patch 1 updates a stale comment seeing as I was in the general area.

Patch 2 updates reclaim/compaction to reclaim pages scaled on the number
of recent failures.

Patch 3 captures suitable high-order pages freed by compaction to reduce
races with parallel allocation requests.

Patch 4 is an upstream commit that has compaction restart free page scanning
from an old position instead of always starting from the end of the
zone

Patch 5 adjusts patch 5 to restores allocation success rates.

STRESS-HIGHALLOC
3.5.0-vanilla patches:1-2 patches:1-3 patches:1-5
Pass 1 36.00 ( 0.00%) 56.00 (20.00%) 63.00 (27.00%) 58.00 (22.00%)
Pass 2 46.00 ( 0.00%) 64.00 (18.00%) 63.00 (17.00%) 58.00 (12.00%)
while Rested 84.00 ( 0.00%) 86.00 ( 2.00%) 85.00 ( 1.00%) 84.00 ( 0.00%)

From
http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/hydra/comparison.html
I know that the allocation success rates in 3.3.6 was 78% in comparison
to 36% in 3.5. With the full series applied, the success rates are up to
around 60% with some variability in the results. This is not as high
a success rate but it does not reclaim excessively which is a key point.

Previous tests on V1 of this series showed that patch 4 on its own adversely
affected high-order allocation success rates.

MMTests Statistics: vmstat
Page Ins 3037580 2979316 2988160 2957716
Page Outs 8026888 8027300 8031232 8041696
Swap Ins 0 0 0 0
Swap Outs 0 0 0 0

Note that swap in/out rates remain at 0. In 3.3.6 with 78% success rates
there were 71881 pages swapped out.

Direct pages scanned 97106 110003 80319 130947
Kswapd pages scanned 1231288 1372523 1498003 1392390
Kswapd pages reclaimed 1231221 1321591 1439185 1342106
Direct pages reclaimed 97100 102174 56267 125401
Kswapd efficiency 99% 96% 96% 96%
Kswapd velocity 1001.153 1060.896 1131.567 1103.189
Direct efficiency 99% 92% 70% 95%
Direct velocity 78.956 85.027 60.672 103.749

The direct reclaim and kswapd velocities change very little. kswapd velocity
is around the 1000 pages/sec mark where as in kernel 3.3.6 with the high
allocation success rates it was 8140 pages/second.

include/linux/compaction.h | 4 +-
include/linux/mm.h | 1 +
include/linux/mmzone.h | 4 ++
mm/compaction.c | 159 ++++++++++++++++++++++++++++++++++++++------
mm/internal.h | 7 ++
mm/page_alloc.c | 68 ++++++++++++++-----
mm/vmscan.c | 10 +++
7 files changed, 214 insertions(+), 39 deletions(-)

--
1.7.9.2


2012-08-09 13:49:30

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/5] mm: compaction: Update comment in try_to_compact_pages

The comment about order applied when the check was
order > PAGE_ALLOC_COSTLY_ORDER which has not been the case since
[c5a73c3d: thp: use compaction for all allocation orders]. Fixing
the comment while I'm in the general area.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
---
mm/compaction.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index b39ede1..95ca967 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -759,11 +759,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
struct zone *zone;
int rc = COMPACT_SKIPPED;

- /*
- * Check whether it is worth even starting compaction. The order check is
- * made because an assumption is made that the page allocator can satisfy
- * the "cheaper" orders without taking special steps
- */
+ /* Check if the GFP flags allow compaction */
if (!order || !may_enter_fs || !may_perform_io)
return rc;

--
1.7.9.2

2012-08-09 13:49:33

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 3/5] mm: compaction: Capture a suitable high-order page immediately when it is made available

While compaction is migrating pages to free up large contiguous blocks for
allocation it races with other allocation requests that may steal these
blocks or break them up. This patch alters direct compaction to capture a
suitable free page as soon as it becomes available to reduce this race. It
uses similar logic to split_free_page() to ensure that watermarks are
still obeyed.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
include/linux/compaction.h | 4 +-
include/linux/mm.h | 1 +
mm/compaction.c | 88 ++++++++++++++++++++++++++++++++++++++------
mm/internal.h | 1 +
mm/page_alloc.c | 63 +++++++++++++++++++++++--------
5 files changed, 128 insertions(+), 29 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 51a90b7..5673459 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -22,7 +22,7 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
extern int fragmentation_index(struct zone *zone, unsigned int order);
extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *mask,
- bool sync);
+ bool sync, struct page **page);
extern int compact_pgdat(pg_data_t *pgdat, int order);
extern unsigned long compaction_suitable(struct zone *zone, int order);

@@ -64,7 +64,7 @@ static inline bool compaction_deferred(struct zone *zone, int order)
#else
static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
- bool sync)
+ bool sync, struct page **page)
{
return COMPACT_CONTINUE;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b36d08c..0812e86 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -454,6 +454,7 @@ void put_pages_list(struct list_head *pages);

void split_page(struct page *page, unsigned int order);
int split_free_page(struct page *page);
+int capture_free_page(struct page *page, int alloc_order, int migratetype);

/*
* Compound pages have a destructor function. Provide a
diff --git a/mm/compaction.c b/mm/compaction.c
index 95ca967..384164e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -50,6 +50,59 @@ static inline bool migrate_async_suitable(int migratetype)
return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
}

+static void compact_capture_page(struct compact_control *cc)
+{
+ unsigned long flags;
+ int mtype, mtype_low, mtype_high;
+
+ if (!cc->page || *cc->page)
+ return;
+
+ /*
+ * For MIGRATE_MOVABLE allocations we capture a suitable page ASAP
+ * regardless of the migratetype of the freelist is is captured from.
+ * This is fine because the order for a high-order MIGRATE_MOVABLE
+ * allocation is typically at least a pageblock size and overall
+ * fragmentation is not impaired. Other allocation types must
+ * capture pages from their own migratelist because otherwise they
+ * could pollute other pageblocks like MIGRATE_MOVABLE with
+ * difficult to move pages and making fragmentation worse overall.
+ */
+ if (cc->migratetype == MIGRATE_MOVABLE) {
+ mtype_low = 0;
+ mtype_high = MIGRATE_PCPTYPES;
+ } else {
+ mtype_low = cc->migratetype;
+ mtype_high = cc->migratetype + 1;
+ }
+
+ /* Speculatively examine the free lists without zone lock */
+ for (mtype = mtype_low; mtype < mtype_high; mtype++) {
+ int order;
+ for (order = cc->order; order < MAX_ORDER; order++) {
+ struct page *page;
+ struct free_area *area;
+ area = &(cc->zone->free_area[order]);
+ if (list_empty(&area->free_list[mtype]))
+ continue;
+
+ /* Take the lock and attempt capture of the page */
+ spin_lock_irqsave(&cc->zone->lock, flags);
+ if (!list_empty(&area->free_list[mtype])) {
+ page = list_entry(area->free_list[mtype].next,
+ struct page, lru);
+ if (capture_free_page(page, cc->order, mtype)) {
+ spin_unlock_irqrestore(&cc->zone->lock,
+ flags);
+ *cc->page = page;
+ return;
+ }
+ }
+ spin_unlock_irqrestore(&cc->zone->lock, flags);
+ }
+ }
+}
+
/*
* Isolate free pages onto a private freelist. Caller must hold zone->lock.
* If @strict is true, will abort returning 0 on any invalid PFNs or non-free
@@ -561,7 +614,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
static int compact_finished(struct zone *zone,
struct compact_control *cc)
{
- unsigned int order;
unsigned long watermark;

if (fatal_signal_pending(current))
@@ -586,14 +638,22 @@ static int compact_finished(struct zone *zone,
return COMPACT_CONTINUE;

/* Direct compactor: Is a suitable page free? */
- for (order = cc->order; order < MAX_ORDER; order++) {
- /* Job done if page is free of the right migratetype */
- if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
- return COMPACT_PARTIAL;
-
- /* Job done if allocation would set block type */
- if (order >= pageblock_order && zone->free_area[order].nr_free)
+ if (cc->page) {
+ /* Was a suitable page captured? */
+ if (*cc->page)
return COMPACT_PARTIAL;
+ } else {
+ unsigned int order;
+ for (order = cc->order; order < MAX_ORDER; order++) {
+ struct free_area *area = &zone->free_area[cc->order];
+ /* Job done if page is free of the right migratetype */
+ if (!list_empty(&area->free_list[cc->migratetype]))
+ return COMPACT_PARTIAL;
+
+ /* Job done if allocation would set block type */
+ if (cc->order >= pageblock_order && area->nr_free)
+ return COMPACT_PARTIAL;
+ }
}

return COMPACT_CONTINUE;
@@ -708,6 +768,9 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
goto out;
}
}
+
+ /* Capture a page now if it is a suitable size */
+ compact_capture_page(cc);
}

out:
@@ -720,7 +783,7 @@ out:

static unsigned long compact_zone_order(struct zone *zone,
int order, gfp_t gfp_mask,
- bool sync)
+ bool sync, struct page **page)
{
struct compact_control cc = {
.nr_freepages = 0,
@@ -729,6 +792,7 @@ static unsigned long compact_zone_order(struct zone *zone,
.migratetype = allocflags_to_migratetype(gfp_mask),
.zone = zone,
.sync = sync,
+ .page = page,
};
INIT_LIST_HEAD(&cc.freepages);
INIT_LIST_HEAD(&cc.migratepages);
@@ -750,7 +814,7 @@ int sysctl_extfrag_threshold = 500;
*/
unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
- bool sync)
+ bool sync, struct page **page)
{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
int may_enter_fs = gfp_mask & __GFP_FS;
@@ -770,7 +834,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
nodemask) {
int status;

- status = compact_zone_order(zone, order, gfp_mask, sync);
+ status = compact_zone_order(zone, order, gfp_mask, sync, page);
rc = max(status, rc);

/* If a normal allocation would succeed, stop compacting */
@@ -825,6 +889,7 @@ int compact_pgdat(pg_data_t *pgdat, int order)
struct compact_control cc = {
.order = order,
.sync = false,
+ .page = NULL,
};

return __compact_pgdat(pgdat, &cc);
@@ -835,6 +900,7 @@ static int compact_node(int nid)
struct compact_control cc = {
.order = -1,
.sync = true,
+ .page = NULL,
};

return __compact_pgdat(NODE_DATA(nid), &cc);
diff --git a/mm/internal.h b/mm/internal.h
index 2ba87fb..9156714 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -124,6 +124,7 @@ struct compact_control {
int order; /* order a direct compactor needs */
int migratetype; /* MOVABLE, RECLAIMABLE etc */
struct zone *zone;
+ struct page **page; /* Page captured of requested size */
};

unsigned long
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a4f921..adc3aa8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1374,16 +1374,11 @@ void split_page(struct page *page, unsigned int order)
}

/*
- * Similar to split_page except the page is already free. As this is only
- * being used for migration, the migratetype of the block also changes.
- * As this is called with interrupts disabled, the caller is responsible
- * for calling arch_alloc_page() and kernel_map_page() after interrupts
- * are enabled.
- *
- * Note: this is probably too low level an operation for use in drivers.
- * Please consult with lkml before using this in your driver.
+ * Similar to the split_page family of functions except that the page
+ * required at the given order and being isolated now to prevent races
+ * with parallel allocators
*/
-int split_free_page(struct page *page)
+int capture_free_page(struct page *page, int alloc_order, int migratetype)
{
unsigned int order;
unsigned long watermark;
@@ -1405,10 +1400,11 @@ int split_free_page(struct page *page)
rmv_page_order(page);
__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));

- /* Split into individual pages */
- set_page_refcounted(page);
- split_page(page, order);
+ if (alloc_order != order)
+ expand(zone, page, alloc_order, order,
+ &zone->free_area[order], migratetype);

+ /* Set the pageblock if the captured page is at least a pageblock */
if (order >= pageblock_order - 1) {
struct page *endpage = page + (1 << order) - 1;
for (; page < endpage; page += pageblock_nr_pages) {
@@ -1419,7 +1415,35 @@ int split_free_page(struct page *page)
}
}

- return 1 << order;
+ return 1UL << order;
+}
+
+/*
+ * Similar to split_page except the page is already free. As this is only
+ * being used for migration, the migratetype of the block also changes.
+ * As this is called with interrupts disabled, the caller is responsible
+ * for calling arch_alloc_page() and kernel_map_page() after interrupts
+ * are enabled.
+ *
+ * Note: this is probably too low level an operation for use in drivers.
+ * Please consult with lkml before using this in your driver.
+ */
+int split_free_page(struct page *page)
+{
+ unsigned int order;
+ int nr_pages;
+
+ BUG_ON(!PageBuddy(page));
+ order = page_order(page);
+
+ nr_pages = capture_free_page(page, order, 0);
+ if (!nr_pages)
+ return 0;
+
+ /* Split into individual pages */
+ set_page_refcounted(page);
+ split_page(page, order);
+ return nr_pages;
}

/*
@@ -2065,7 +2089,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
bool *deferred_compaction,
unsigned long *did_some_progress)
{
- struct page *page;
+ struct page *page = NULL;

if (!order)
return NULL;
@@ -2077,10 +2101,16 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,

current->flags |= PF_MEMALLOC;
*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
- nodemask, sync_migration);
+ nodemask, sync_migration, &page);
current->flags &= ~PF_MEMALLOC;
- if (*did_some_progress != COMPACT_SKIPPED) {

+ /* If compaction captured a page, prep and use it */
+ if (page) {
+ prep_new_page(page, order, gfp_mask);
+ goto got_page;
+ }
+
+ if (*did_some_progress != COMPACT_SKIPPED) {
/* Page migration frees to the PCP lists but we want merging */
drain_pages(get_cpu());
put_cpu();
@@ -2090,6 +2120,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
alloc_flags, preferred_zone,
migratetype);
if (page) {
+got_page:
preferred_zone->compact_considered = 0;
preferred_zone->compact_defer_shift = 0;
if (order >= preferred_zone->compact_order_failed)
--
1.7.9.2

2012-08-09 13:49:46

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 5/5] mm: have order > 0 compaction start near a pageblock with free pages

commit [7db8889a: mm: have order > 0 compaction start off where it left]
introduced a caching mechanism to reduce the amount work the free page
scanner does in compaction. However, it has a problem. Consider two process
simultaneously scanning free pages

C
Process A M S F
|---------------------------------------|
Process B M FS

C is zone->compact_cached_free_pfn
S is cc->start_pfree_pfn
M is cc->migrate_pfn
F is cc->free_pfn

In this diagram, Process A has just reached its migrate scanner, wrapped
around and updated compact_cached_free_pfn accordingly.

Simultaneously, Process B finishes isolating in a block and updates
compact_cached_free_pfn again to the location of its free scanner.

Process A moves to "end_of_zone - one_pageblock" and runs this check

if (cc->order > 0 && (!cc->wrapped ||
zone->compact_cached_free_pfn >
cc->start_free_pfn))
pfn = min(pfn, zone->compact_cached_free_pfn);

compact_cached_free_pfn is above where it started so the free scanner skips
almost the entire space it should have scanned. When there are multiple
processes compacting it can end in a situation where the entire zone is
not being scanned at all. Further, it is possible for two processes to
ping-pong update to compact_cached_free_pfn which is just random.

Overall, the end result wrecks allocation success rates.

There is not an obvious way around this problem without introducing new
locking and state so this patch takes a different approach.

First, it gets rid of the skip logic because it's not clear that it matters
if two free scanners happen to be in the same block but with racing updates
it's too easy for it to skip over blocks it should not.

Second, it updates compact_cached_free_pfn in a more limited set of
circumstances.

If a scanner has wrapped, it updates compact_cached_free_pfn to the end
of the zone. When a wrapped scanner isolates a page, it updates
compact_cached_free_pfn to point to the highest pageblock it
can isolate pages from.

If a scanner has not wrapped when it has finished isolated pages it
checks if compact_cached_free_pfn is pointing to the end of the
zone. If so, the value is updated to point to the highest
pageblock that pages were isolated from. This value will not
be updated again until a free page scanner wraps and resets
compact_cached_free_pfn.

This is not optimal and it can still race but the compact_cached_free_pfn
will be pointing to or very near a pageblock with free pages.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
---
mm/compaction.c | 54 ++++++++++++++++++++++++++++--------------------------
1 file changed, 28 insertions(+), 26 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index a806a9c..c2d0958 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -437,6 +437,20 @@ static bool suitable_migration_target(struct page *page)
}

/*
+ * Returns the start pfn of the last page block in a zone. This is the starting
+ * point for full compaction of a zone. Compaction searches for free pages from
+ * the end of each zone, while isolate_freepages_block scans forward inside each
+ * page block.
+ */
+static unsigned long start_free_pfn(struct zone *zone)
+{
+ unsigned long free_pfn;
+ free_pfn = zone->zone_start_pfn + zone->spanned_pages;
+ free_pfn &= ~(pageblock_nr_pages-1);
+ return free_pfn;
+}
+
+/*
* Based on information in the current compact_control, find blocks
* suitable for isolating free pages from and then isolate them.
*/
@@ -475,17 +489,6 @@ static void isolate_freepages(struct zone *zone,
pfn -= pageblock_nr_pages) {
unsigned long isolated;

- /*
- * Skip ahead if another thread is compacting in the area
- * simultaneously. If we wrapped around, we can only skip
- * ahead if zone->compact_cached_free_pfn also wrapped to
- * above our starting point.
- */
- if (cc->order > 0 && (!cc->wrapped ||
- zone->compact_cached_free_pfn >
- cc->start_free_pfn))
- pfn = min(pfn, zone->compact_cached_free_pfn);
-
if (!pfn_valid(pfn))
continue;

@@ -528,7 +531,15 @@ static void isolate_freepages(struct zone *zone,
*/
if (isolated) {
high_pfn = max(high_pfn, pfn);
- if (cc->order > 0)
+
+ /*
+ * If the free scanner has wrapped, update
+ * compact_cached_free_pfn to point to the highest
+ * pageblock with free pages. This reduces excessive
+ * scanning of full pageblocks near the end of the
+ * zone
+ */
+ if (cc->order > 0 && cc->wrapped)
zone->compact_cached_free_pfn = high_pfn;
}
}
@@ -538,6 +549,11 @@ static void isolate_freepages(struct zone *zone,

cc->free_pfn = high_pfn;
cc->nr_freepages = nr_freepages;
+
+ /* If compact_cached_free_pfn is reset then set it now */
+ if (cc->order > 0 && !cc->wrapped &&
+ zone->compact_cached_free_pfn == start_free_pfn(zone))
+ zone->compact_cached_free_pfn = high_pfn;
}

/*
@@ -625,20 +641,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
return ISOLATE_SUCCESS;
}

-/*
- * Returns the start pfn of the last page block in a zone. This is the starting
- * point for full compaction of a zone. Compaction searches for free pages from
- * the end of each zone, while isolate_freepages_block scans forward inside each
- * page block.
- */
-static unsigned long start_free_pfn(struct zone *zone)
-{
- unsigned long free_pfn;
- free_pfn = zone->zone_start_pfn + zone->spanned_pages;
- free_pfn &= ~(pageblock_nr_pages-1);
- return free_pfn;
-}
-
static int compact_finished(struct zone *zone,
struct compact_control *cc)
{
--
1.7.9.2

2012-08-09 13:49:48

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 4/5] mm: have order > 0 compaction start off where it left

From: Rik van Riel <[email protected]>

This commit is already upstream as [7db8889a: mm: have order > 0 compaction
start off where it left]. It's included in this series to provide context
to the next patch as the series is based on 3.5.

Order > 0 compaction stops when enough free pages of the correct page
order have been coalesced. When doing subsequent higher order
allocations, it is possible for compaction to be invoked many times.

However, the compaction code always starts out looking for things to
compact at the start of the zone, and for free pages to compact things to
at the end of the zone.

This can cause quadratic behaviour, with isolate_freepages starting at the
end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.

This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.

The obvious solution is to have isolate_freepages remember where it left
off last time, and continue at that point the next time it gets invoked
for an order > 0 compaction. This could cause compaction to fail if
cc->free_pfn and cc->migrate_pfn are close together initially, in that
case we restart from the end of the zone and try once more.

Forced full (order == -1) compactions are left alone.

[[email protected]: checkpatch fixes]
[[email protected]: s/laste/last/, use 80 cols]
Signed-off-by: Rik van Riel <[email protected]>
Reported-by: Jim Schutt <[email protected]>
Tested-by: Jim Schutt <[email protected]>
Cc: Minchan Kim <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 4 +++
mm/compaction.c | 63 ++++++++++++++++++++++++++++++++++++++++++++----
mm/internal.h | 6 +++++
mm/page_alloc.c | 5 ++++
4 files changed, 73 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 68c569f..6340f38 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -369,6 +369,10 @@ struct zone {
*/
spinlock_t lock;
int all_unreclaimable; /* All pages pinned */
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+ /* pfn where the last incremental compaction isolated free pages */
+ unsigned long compact_cached_free_pfn;
+#endif
#ifdef CONFIG_MEMORY_HOTPLUG
/* see spanned/present_pages for more description */
seqlock_t span_seqlock;
diff --git a/mm/compaction.c b/mm/compaction.c
index 384164e..a806a9c 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -475,6 +475,17 @@ static void isolate_freepages(struct zone *zone,
pfn -= pageblock_nr_pages) {
unsigned long isolated;

+ /*
+ * Skip ahead if another thread is compacting in the area
+ * simultaneously. If we wrapped around, we can only skip
+ * ahead if zone->compact_cached_free_pfn also wrapped to
+ * above our starting point.
+ */
+ if (cc->order > 0 && (!cc->wrapped ||
+ zone->compact_cached_free_pfn >
+ cc->start_free_pfn))
+ pfn = min(pfn, zone->compact_cached_free_pfn);
+
if (!pfn_valid(pfn))
continue;

@@ -515,8 +526,11 @@ static void isolate_freepages(struct zone *zone,
* looking for free pages, the search will restart here as
* page migration may have returned some pages to the allocator
*/
- if (isolated)
+ if (isolated) {
high_pfn = max(high_pfn, pfn);
+ if (cc->order > 0)
+ zone->compact_cached_free_pfn = high_pfn;
+ }
}

/* split_free_page does not map the pages */
@@ -611,6 +625,20 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
return ISOLATE_SUCCESS;
}

+/*
+ * Returns the start pfn of the last page block in a zone. This is the starting
+ * point for full compaction of a zone. Compaction searches for free pages from
+ * the end of each zone, while isolate_freepages_block scans forward inside each
+ * page block.
+ */
+static unsigned long start_free_pfn(struct zone *zone)
+{
+ unsigned long free_pfn;
+ free_pfn = zone->zone_start_pfn + zone->spanned_pages;
+ free_pfn &= ~(pageblock_nr_pages-1);
+ return free_pfn;
+}
+
static int compact_finished(struct zone *zone,
struct compact_control *cc)
{
@@ -619,8 +647,26 @@ static int compact_finished(struct zone *zone,
if (fatal_signal_pending(current))
return COMPACT_PARTIAL;

- /* Compaction run completes if the migrate and free scanner meet */
- if (cc->free_pfn <= cc->migrate_pfn)
+ /*
+ * A full (order == -1) compaction run starts at the beginning and
+ * end of a zone; it completes when the migrate and free scanner meet.
+ * A partial (order > 0) compaction can start with the free scanner
+ * at a random point in the zone, and may have to restart.
+ */
+ if (cc->free_pfn <= cc->migrate_pfn) {
+ if (cc->order > 0 && !cc->wrapped) {
+ /* We started partway through; restart at the end. */
+ unsigned long free_pfn = start_free_pfn(zone);
+ zone->compact_cached_free_pfn = free_pfn;
+ cc->free_pfn = free_pfn;
+ cc->wrapped = 1;
+ return COMPACT_CONTINUE;
+ }
+ return COMPACT_COMPLETE;
+ }
+
+ /* We wrapped around and ended up where we started. */
+ if (cc->wrapped && cc->free_pfn <= cc->start_free_pfn)
return COMPACT_COMPLETE;

/*
@@ -726,8 +772,15 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)

/* Setup to move all movable pages to the end of the zone */
cc->migrate_pfn = zone->zone_start_pfn;
- cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
- cc->free_pfn &= ~(pageblock_nr_pages-1);
+
+ if (cc->order > 0) {
+ /* Incremental compaction. Start where the last one stopped. */
+ cc->free_pfn = zone->compact_cached_free_pfn;
+ cc->start_free_pfn = cc->free_pfn;
+ } else {
+ /* Order == -1 starts at the end of the zone. */
+ cc->free_pfn = start_free_pfn(zone);
+ }

migrate_prep_local();

diff --git a/mm/internal.h b/mm/internal.h
index 9156714..064f6ef 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -118,8 +118,14 @@ struct compact_control {
unsigned long nr_freepages; /* Number of isolated free pages */
unsigned long nr_migratepages; /* Number of pages to migrate */
unsigned long free_pfn; /* isolate_freepages search base */
+ unsigned long start_free_pfn; /* where we started the search */
unsigned long migrate_pfn; /* isolate_migratepages search base */
bool sync; /* Synchronous migration */
+ bool wrapped; /* Order > 0 compactions are
+ incremental, once free_pfn
+ and migrate_pfn meet, we restart
+ from the top of the zone;
+ remember we wrapped around. */

int order; /* order a direct compactor needs */
int migratetype; /* MOVABLE, RECLAIMABLE etc */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index adc3aa8..781d6e4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4425,6 +4425,11 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,

zone->spanned_pages = size;
zone->present_pages = realsize;
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+ zone->compact_cached_free_pfn = zone->zone_start_pfn +
+ zone->spanned_pages;
+ zone->compact_cached_free_pfn &= ~(pageblock_nr_pages-1);
+#endif
#ifdef CONFIG_NUMA
zone->node = nid;
zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
--
1.7.9.2

2012-08-09 13:50:29

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 2/5] mm: vmscan: Scale number of pages reclaimed by reclaim/compaction based on failures

If allocation fails after compaction then compaction may be deferred for
a number of allocation attempts. If there are subsequent failures,
compact_defer_shift is increased to defer for longer periods. This patch
uses that information to scale the number of pages reclaimed with
compact_defer_shift until allocations succeed again. The rationale is
that reclaiming the normal number of pages still allowed compaction to
fail and its success depends on the number of pages. If it's failing,
reclaim more pages until it succeeds again.

Note that this is not implying that VM reclaim is not reclaiming enough
pages or that its logic is broken. try_to_free_pages() always asks for
SWAP_CLUSTER_MAX pages to be reclaimed regardless of order and that is
what it does. Direct reclaim stops normally with this check.

if (sc->nr_reclaimed >= sc->nr_to_reclaim)
goto out;

should_continue_reclaim delays when that check is made until a minimum number
of pages for reclaim/compaction are reclaimed. It is possible that this patch
could instead set nr_to_reclaim in try_to_free_pages() and drive it from
there but that's behaves differently and not necessarily for the better. If
driven from do_try_to_free_pages(), it is also possible that priorities
will rise. When they reach DEF_PRIORITY-2, it will also start stalling
and setting pages for immediate reclaim which is more disruptive than not
desirable in this case. That is a more wide-reaching change that could
cause another regression related to THP requests causing interactive jitter.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
mm/vmscan.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 66e4310..7a43fd8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1708,6 +1708,7 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
{
unsigned long pages_for_compaction;
unsigned long inactive_lru_pages;
+ struct zone *zone;

/* If not in reclaim/compaction mode, stop */
if (!in_reclaim_compaction(sc))
@@ -1741,6 +1742,15 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
* inactive lists are large enough, continue reclaiming
*/
pages_for_compaction = (2UL << sc->order);
+
+ /*
+ * If compaction is deferred for sc->order then scale the number of
+ * pages reclaimed based on the number of consecutive allocation
+ * failures
+ */
+ zone = lruvec_zone(lruvec);
+ if (zone->compact_order_failed <= sc->order)
+ pages_for_compaction <<= zone->compact_defer_shift;
inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
if (nr_swap_pages > 0)
inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);
--
1.7.9.2

2012-08-09 14:36:38

by Jim Schutt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

Hi Mel,

On 08/09/2012 07:49 AM, Mel Gorman wrote:
> Changelog since V2
> o Capture !MIGRATE_MOVABLE pages where possible
> o Document the treatment of MIGRATE_MOVABLE pages while capturing
> o Expand changelogs
>
> Changelog since V1
> o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan)
> o Expanded changelogs a little
>
> Allocation success rates have been far lower since 3.4 due to commit
> [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
> commit was introduced for good reasons and it was known in advance that
> the success rates would suffer but it was justified on the grounds that
> the high allocation success rates were achieved by aggressive reclaim.
> Success rates are expected to suffer even more in 3.6 due to commit
> [7db8889a: mm: have order> 0 compaction start off where it left] which
> testing has shown to severely reduce allocation success rates under load -
> to 0% in one case. There is a proposed change to that patch in this series
> and it would be ideal if Jim Schutt could retest the workload that led to
> commit [7db8889a: mm: have order> 0 compaction start off where it left].

I was successful at resolving my Ceph issue on 3.6-rc1, but ran
into some other issue that isn't immediately obvious, and prevents
me from testing your patch with 3.6-rc1. Today I will apply your
patch series to 3.5 and test that way.

Sorry for the delay.

-- Jim

2012-08-09 14:51:46

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

On Thu, Aug 09, 2012 at 08:36:12AM -0600, Jim Schutt wrote:
> Hi Mel,
>
> On 08/09/2012 07:49 AM, Mel Gorman wrote:
> >Changelog since V2
> >o Capture !MIGRATE_MOVABLE pages where possible
> >o Document the treatment of MIGRATE_MOVABLE pages while capturing
> >o Expand changelogs
> >
> >Changelog since V1
> >o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan)
> >o Expanded changelogs a little
> >
> >Allocation success rates have been far lower since 3.4 due to commit
> >[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
> >commit was introduced for good reasons and it was known in advance that
> >the success rates would suffer but it was justified on the grounds that
> >the high allocation success rates were achieved by aggressive reclaim.
> >Success rates are expected to suffer even more in 3.6 due to commit
> >[7db8889a: mm: have order> 0 compaction start off where it left] which
> >testing has shown to severely reduce allocation success rates under load -
> >to 0% in one case. There is a proposed change to that patch in this series
> >and it would be ideal if Jim Schutt could retest the workload that led to
> >commit [7db8889a: mm: have order> 0 compaction start off where it left].
>
> I was successful at resolving my Ceph issue on 3.6-rc1, but ran
> into some other issue that isn't immediately obvious, and prevents
> me from testing your patch with 3.6-rc1. Today I will apply your
> patch series to 3.5 and test that way.
>
> Sorry for the delay.
>

No need to be sorry at all. I appreciate you taking the time and as
there were revisions since V1 you were better off waiting even if you
did not have the Ceph issue!

Thanks.

--
Mel Gorman
SUSE Labs

2012-08-09 18:16:48

by Jim Schutt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

On 08/09/2012 07:49 AM, Mel Gorman wrote:
> Changelog since V2
> o Capture !MIGRATE_MOVABLE pages where possible
> o Document the treatment of MIGRATE_MOVABLE pages while capturing
> o Expand changelogs
>
> Changelog since V1
> o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan)
> o Expanded changelogs a little
>
> Allocation success rates have been far lower since 3.4 due to commit
> [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
> commit was introduced for good reasons and it was known in advance that
> the success rates would suffer but it was justified on the grounds that
> the high allocation success rates were achieved by aggressive reclaim.
> Success rates are expected to suffer even more in 3.6 due to commit
> [7db8889a: mm: have order> 0 compaction start off where it left] which
> testing has shown to severely reduce allocation success rates under load -
> to 0% in one case. There is a proposed change to that patch in this series
> and it would be ideal if Jim Schutt could retest the workload that led to
> commit [7db8889a: mm: have order> 0 compaction start off where it left].

On my first test of this patch series on top of 3.5, I ran into an
instance of what I think is the sort of thing that patch 4/5 was
fixing. Here's what vmstat had to say during that period:

----------

2012-08-09 11:58:04.107-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
20 14 0 235884 576 38916072 0 0 12 17047 171 133 3 8 85 4 0
18 17 0 220272 576 38955912 0 0 86 2131838 200142 162956 12 38 31 19 0
17 9 0 244284 576 38955328 0 0 19 2179562 213775 167901 13 43 26 18 0
27 15 0 223036 576 38952640 0 0 24 2202816 217996 158390 14 47 25 15 0
17 16 0 233124 576 38959908 0 0 5 2268815 224647 165728 14 50 21 15 0
16 13 0 225840 576 38995740 0 0 52 2253829 216797 160551 14 47 23 16 0
22 13 0 260584 576 38982908 0 0 92 2196737 211694 140924 14 53 19 15 0
16 10 0 235784 576 38917128 0 0 22 2157466 210022 137630 14 54 19 14 0
12 13 0 214300 576 38923848 0 0 31 2187735 213862 142711 14 52 20 14 0
25 12 0 219528 576 38919540 0 0 11 2066523 205256 142080 13 49 23 15 0
26 14 0 229460 576 38913704 0 0 49 2108654 200692 135447 13 51 21 15 0
11 11 0 220376 576 38862456 0 0 45 2136419 207493 146813 13 49 22 16 0
36 12 0 229860 576 38869784 0 0 7 2163463 212223 151812 14 47 25 14 0
16 13 0 238356 576 38891496 0 0 67 2251650 221728 154429 14 52 20 14 0
65 15 0 211536 576 38922108 0 0 59 2237925 224237 156587 14 53 19 14 0
24 13 0 585024 576 38634024 0 0 37 2240929 229040 148192 15 61 14 10 0

2012-08-09 11:59:04.714-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
43 8 0 794392 576 38382316 0 0 11 20491 576 420 3 10 82 4 0
127 6 0 579328 576 38422156 0 0 21 2006775 205582 119660 12 70 11 7 0
44 5 0 492860 576 38512360 0 0 46 1536525 173377 85320 10 78 7 4 0
218 9 0 585668 576 38271320 0 0 39 1257266 152869 64023 8 83 7 3 0
101 6 0 600168 576 38128104 0 0 10 1438705 160769 68374 9 84 5 3 0
62 5 0 597004 576 38098972 0 0 93 1376841 154012 63912 8 82 7 4 0
61 11 0 850396 576 37808772 0 0 46 1186816 145731 70453 7 78 9 6 0
124 7 0 437388 576 38126320 0 0 15 1208434 149736 57142 7 86 4 3 0
204 11 0 1105816 576 37309532 0 0 20 1327833 145979 52718 7 87 4 2 0
29 8 0 751020 576 37360332 0 0 8 1405474 169916 61982 9 85 4 2 0
38 7 0 626448 576 37333244 0 0 14 1328415 174665 74214 8 84 5 3 0
23 5 0 650040 576 37134280 0 0 28 1351209 179220 71631 8 85 5 2 0
40 10 0 610988 576 37054292 0 0 104 1272527 167530 73527 7 85 5 3 0
79 22 0 2076836 576 35487340 0 0 750 1249934 175420 70124 7 88 3 2 0
58 6 0 431068 576 36934140 0 0 1000 1366234 169675 72524 8 84 5 3 0
134 9 0 574692 576 36784980 0 0 1049 1305543 152507 62639 8 84 4 4 0

2012-08-09 12:00:09.137-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
163 8 0 464308 576 36791368 0 0 11 22210 866 536 3 13 79 4 0
207 14 0 917752 576 36181928 0 0 712 1345376 134598 47367 7 90 1 2 0
123 12 0 685516 576 36296148 0 0 429 1386615 158494 60077 8 84 5 3 0
123 12 0 598572 576 36333728 0 0 1107 1233281 147542 62351 7 84 5 4 0
622 7 0 660768 576 36118264 0 0 557 1345548 151394 59353 7 85 4 3 0
223 11 0 283960 576 36463868 0 0 46 1107160 121846 33006 6 93 1 1 0
104 14 0 3140508 576 33522616 0 0 299 1414709 160879 51422 9 89 1 1 0
100 11 0 1323036 576 35337740 0 0 429 1637733 175817 94471 9 73 10 8 0
91 11 0 673320 576 35918084 0 0 562 1477100 157069 67951 8 83 5 4 0
35 15 0 3486592 576 32983244 0 0 384 1574186 189023 82135 9 81 5 5 0
51 16 0 1428108 576 34962112 0 0 394 1573231 160575 76632 9 76 9 7 0
55 6 0 719548 576 35621284 0 0 425 1483962 160335 79991 8 74 10 7 0
96 7 0 1226852 576 35062608 0 0 803 1531041 164923 70820 9 78 7 6 0
97 8 0 862500 576 35332496 0 0 536 1177949 155969 80769 7 74 13 7 0
23 5 0 6096372 576 30115776 0 0 367 919949 124993 81755 6 62 24 8 0
13 5 0 7427860 576 28368292 0 0 399 915331 153895 102186 6 53 32 9 0

----------

And here's a perf report, captured/displayed with
perf record -g -a sleep 10
perf report --sort symbol --call-graph fractal,5
sometime during that period just after 12:00:09, when
the run queueu was > 100.

----------

Processed 0 events and LOST 1175296!

Check IO/CPU overload!

# Events: 208K cycles
#
# Overhead

Symbol
# ........ .....................................................................................................................................................................................
.................................................................................................................................................................................................
............................................................................................................
#
34.63% [k] _raw_spin_lock_irqsave
|
|--97.30%-- isolate_freepages
| compaction_alloc
| unmap_and_move
| migrate_pages
| compact_zone
| compact_zone_order
| try_to_compact_pages
| __alloc_pages_direct_compact
| __alloc_pages_slowpath
| __alloc_pages_nodemask
| alloc_pages_vma
| do_huge_pmd_anonymous_page
| handle_mm_fault
| do_page_fault
| page_fault
| |
| |--87.39%-- skb_copy_datagram_iovec
| | tcp_recvmsg
| | inet_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call
| | __recv
| | |
| | --100.00%-- (nil)
| |
| --12.61%-- memcpy
--2.70%-- [...]

14.31% [k] _raw_spin_lock_irq
|
|--98.08%-- isolate_migratepages_range
| compact_zone
| compact_zone_order
| try_to_compact_pages
| __alloc_pages_direct_compact
| __alloc_pages_slowpath
| __alloc_pages_nodemask
| alloc_pages_vma
| do_huge_pmd_anonymous_page
| handle_mm_fault
| do_page_fault
| page_fault
| |
| |--83.93%-- skb_copy_datagram_iovec
| | tcp_recvmsg
| | inet_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call
| | __recv
| | |
| | --100.00%-- (nil)
| |
| --16.07%-- memcpy
--1.92%-- [...]

5.48% [k] isolate_freepages_block
|
|--99.96%-- isolate_freepages
| compaction_alloc
| unmap_and_move
| migrate_pages
| compact_zone
| compact_zone_order
| try_to_compact_pages
| __alloc_pages_direct_compact
| __alloc_pages_slowpath
| __alloc_pages_nodemask
| alloc_pages_vma
| do_huge_pmd_anonymous_page
| handle_mm_fault
| do_page_fault
| page_fault
| |
| |--86.01%-- skb_copy_datagram_iovec
| | tcp_recvmsg
| | inet_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call
| | __recv
| | |
| | --100.00%-- (nil)
| |
| --13.99%-- memcpy
--0.04%-- [...]

5.34% [.] ceph_crc32c_le
|
|--99.95%-- 0xb8057558d0065990
--0.05%-- [...]

----------

If I understand what this is telling me, skb_copy_datagram_iovec
is responsible for triggering the calls to isolate_freepages_block,
isolate_migratepages_range, and isolate_freepages?

FWIW, I'm using a Chelsio T4 NIC in these hosts, with jumbo frames
and the Linux TCP stack (i.e., no stateful TCP offload).

-- Jim

2012-08-09 20:46:37

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

On Thu, Aug 09, 2012 at 12:16:35PM -0600, Jim Schutt wrote:
> On 08/09/2012 07:49 AM, Mel Gorman wrote:
> >Changelog since V2
> >o Capture !MIGRATE_MOVABLE pages where possible
> >o Document the treatment of MIGRATE_MOVABLE pages while capturing
> >o Expand changelogs
> >
> >Changelog since V1
> >o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan)
> >o Expanded changelogs a little
> >
> >Allocation success rates have been far lower since 3.4 due to commit
> >[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
> >commit was introduced for good reasons and it was known in advance that
> >the success rates would suffer but it was justified on the grounds that
> >the high allocation success rates were achieved by aggressive reclaim.
> >Success rates are expected to suffer even more in 3.6 due to commit
> >[7db8889a: mm: have order> 0 compaction start off where it left] which
> >testing has shown to severely reduce allocation success rates under load -
> >to 0% in one case. There is a proposed change to that patch in this series
> >and it would be ideal if Jim Schutt could retest the workload that led to
> >commit [7db8889a: mm: have order> 0 compaction start off where it left].
>
> On my first test of this patch series on top of 3.5, I ran into an
> instance of what I think is the sort of thing that patch 4/5 was
> fixing. Here's what vmstat had to say during that period:
>
> <SNIP>

My conclusion looking at the vmstat data is that everything is looking ok
until system CPU usage goes through the roof. I'm assuming that's what we
are all still looking at.

I am still concerned that what patch 4/5 was actually doing was bypassing
compaction almost entirely in the contended case which "works" but not
exactly expected

> And here's a perf report, captured/displayed with
> perf record -g -a sleep 10
> perf report --sort symbol --call-graph fractal,5
> sometime during that period just after 12:00:09, when
> the run queueu was > 100.
>
> ----------
>
> Processed 0 events and LOST 1175296!
>
> <SNIP>
> #
> 34.63% [k] _raw_spin_lock_irqsave
> |
> |--97.30%-- isolate_freepages
> | compaction_alloc
> | unmap_and_move
> | migrate_pages
> | compact_zone
> | compact_zone_order
> | try_to_compact_pages
> | __alloc_pages_direct_compact
> | __alloc_pages_slowpath
> | __alloc_pages_nodemask
> | alloc_pages_vma
> | do_huge_pmd_anonymous_page
> | handle_mm_fault
> | do_page_fault
> | page_fault
> | |
> | |--87.39%-- skb_copy_datagram_iovec
> | | tcp_recvmsg
> | | inet_recvmsg
> | | sock_recvmsg
> | | sys_recvfrom
> | | system_call
> | | __recv
> | | |
> | | --100.00%-- (nil)
> | |
> | --12.61%-- memcpy
> --2.70%-- [...]

So lets just consider this. My interpretation of that is that we are
receiving data from the network and copying it into a buffer that is
faulted for the first time and backed by THP.

All good so far *BUT* we are contending like crazy on the zone lock and
probably blocking normal page allocations in the meantime.

>
> 14.31% [k] _raw_spin_lock_irq
> |
> |--98.08%-- isolate_migratepages_range

This is a variation of the same problem but on the LRU lock this time.

> <SNIP>
>
> ----------
>
> If I understand what this is telling me, skb_copy_datagram_iovec
> is responsible for triggering the calls to isolate_freepages_block,
> isolate_migratepages_range, and isolate_freepages?
>

Sortof. I do not think it's the jumbo frames that are doing it, it's the
faulting of the buffer it copies to.

> FWIW, I'm using a Chelsio T4 NIC in these hosts, with jumbo frames
> and the Linux TCP stack (i.e., no stateful TCP offload).
>

Ok, this is an untested hack and I expect it would drop allocation success
rates again under load (but not as much). Can you test again and see what
effect, if any, it has please?

---8<---
mm: compaction: back out if contended

---
include/linux/compaction.h | 4 ++--
mm/compaction.c | 45 ++++++++++++++++++++++++++++++++++++++------
mm/internal.h | 1 +
mm/page_alloc.c | 13 +++++++++----
4 files changed, 51 insertions(+), 12 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 5673459..9c94cba 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -22,7 +22,7 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
extern int fragmentation_index(struct zone *zone, unsigned int order);
extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *mask,
- bool sync, struct page **page);
+ bool sync, bool *contended, struct page **page);
extern int compact_pgdat(pg_data_t *pgdat, int order);
extern unsigned long compaction_suitable(struct zone *zone, int order);

@@ -64,7 +64,7 @@ static inline bool compaction_deferred(struct zone *zone, int order)
#else
static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
- bool sync, struct page **page)
+ bool sync, bool *contended, struct page **page)
{
return COMPACT_CONTINUE;
}
diff --git a/mm/compaction.c b/mm/compaction.c
index c2d0958..8e290d2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -50,6 +50,27 @@ static inline bool migrate_async_suitable(int migratetype)
return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
}

+/*
+ * Compaction requires the taking of some coarse locks that are potentially
+ * very heavily contended. For async compaction, back out in the event there
+ * is contention.
+ */
+static bool compact_trylock_irqsave(spinlock_t *lock, unsigned long *flags,
+ struct compact_control *cc)
+{
+ if (cc->sync) {
+ spin_lock_irqsave(lock, *flags);
+ } else {
+ if (!spin_trylock_irqsave(lock, *flags)) {
+ if (cc->contended)
+ *cc->contended = true;
+ return false;
+ }
+ }
+
+ return true;
+}
+
static void compact_capture_page(struct compact_control *cc)
{
unsigned long flags;
@@ -87,7 +108,8 @@ static void compact_capture_page(struct compact_control *cc)
continue;

/* Take the lock and attempt capture of the page */
- spin_lock_irqsave(&cc->zone->lock, flags);
+ if (!compact_trylock_irqsave(&cc->zone->lock, &flags, cc))
+ return;
if (!list_empty(&area->free_list[mtype])) {
page = list_entry(area->free_list[mtype].next,
struct page, lru);
@@ -514,7 +536,16 @@ static void isolate_freepages(struct zone *zone,
* are disabled
*/
isolated = 0;
- spin_lock_irqsave(&zone->lock, flags);
+
+ /*
+ * The zone lock must be held to isolate freepages. This
+ * unfortunately this is a very coarse lock and can be
+ * heavily contended if there are parallel allocations
+ * or parallel compactions. For async compaction do not
+ * spin on the lock
+ */
+ if (!compact_trylock_irqsave(&zone->lock, &flags, cc))
+ break;
if (suitable_migration_target(page)) {
end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn);
trace_mm_compaction_freepage_scanpfn(pfn);
@@ -837,8 +868,8 @@ out:
}

static unsigned long compact_zone_order(struct zone *zone,
- int order, gfp_t gfp_mask,
- bool sync, struct page **page)
+ int order, gfp_t gfp_mask, bool sync,
+ bool *contended, struct page **page)
{
struct compact_control cc = {
.nr_freepages = 0,
@@ -848,6 +879,7 @@ static unsigned long compact_zone_order(struct zone *zone,
.zone = zone,
.sync = sync,
.page = page,
+ .contended = contended,
};
INIT_LIST_HEAD(&cc.freepages);
INIT_LIST_HEAD(&cc.migratepages);
@@ -869,7 +901,7 @@ int sysctl_extfrag_threshold = 500;
*/
unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
- bool sync, struct page **page)
+ bool sync, bool *contended, struct page **page)
{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
int may_enter_fs = gfp_mask & __GFP_FS;
@@ -889,7 +921,8 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
nodemask) {
int status;

- status = compact_zone_order(zone, order, gfp_mask, sync, page);
+ status = compact_zone_order(zone, order, gfp_mask, sync,
+ contended, page);
rc = max(status, rc);

/* If a normal allocation would succeed, stop compacting */
diff --git a/mm/internal.h b/mm/internal.h
index 064f6ef..344b555 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -130,6 +130,7 @@ struct compact_control {
int order; /* order a direct compactor needs */
int migratetype; /* MOVABLE, RECLAIMABLE etc */
struct zone *zone;
+ bool *contended; /* True if a lock was contended */
struct page **page; /* Page captured of requested size */
};

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 781d6e4..75b30ea 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2086,7 +2086,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
int migratetype, bool sync_migration,
- bool *deferred_compaction,
+ bool *contended_compaction, bool *deferred_compaction,
unsigned long *did_some_progress)
{
struct page *page = NULL;
@@ -2101,7 +2101,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,

current->flags |= PF_MEMALLOC;
*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
- nodemask, sync_migration, &page);
+ nodemask, sync_migration,
+ contended_compaction, &page);
current->flags &= ~PF_MEMALLOC;

/* If compaction captured a page, prep and use it */
@@ -2154,7 +2155,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
int migratetype, bool sync_migration,
- bool *deferred_compaction,
+ bool *contended_compaction, bool *deferred_compaction,
unsigned long *did_some_progress)
{
return NULL;
@@ -2318,6 +2319,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
unsigned long did_some_progress;
bool sync_migration = false;
bool deferred_compaction = false;
+ bool contended_compaction = false;

/*
* In the slowpath, we sanity check order to avoid ever trying to
@@ -2399,6 +2401,7 @@ rebalance:
nodemask,
alloc_flags, preferred_zone,
migratetype, sync_migration,
+ &contended_compaction,
&deferred_compaction,
&did_some_progress);
if (page)
@@ -2411,7 +2414,8 @@ rebalance:
* has requested the system not be heavily disrupted, fail the
* allocation now instead of entering direct reclaim
*/
- if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD))
+ if ((deferred_compaction || contended_compaction) &&
+ (gfp_mask & __GFP_NO_KSWAPD))
goto nopage;

/* Try direct reclaim and then allocating */
@@ -2482,6 +2486,7 @@ rebalance:
nodemask,
alloc_flags, preferred_zone,
migratetype, sync_migration,
+ &contended_compaction,
&deferred_compaction,
&did_some_progress);
if (page)

2012-08-09 22:38:51

by Jim Schutt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

On 08/09/2012 02:46 PM, Mel Gorman wrote:
> On Thu, Aug 09, 2012 at 12:16:35PM -0600, Jim Schutt wrote:
>> On 08/09/2012 07:49 AM, Mel Gorman wrote:
>>> Changelog since V2
>>> o Capture !MIGRATE_MOVABLE pages where possible
>>> o Document the treatment of MIGRATE_MOVABLE pages while capturing
>>> o Expand changelogs
>>>
>>> Changelog since V1
>>> o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan)
>>> o Expanded changelogs a little
>>>
>>> Allocation success rates have been far lower since 3.4 due to commit
>>> [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
>>> commit was introduced for good reasons and it was known in advance that
>>> the success rates would suffer but it was justified on the grounds that
>>> the high allocation success rates were achieved by aggressive reclaim.
>>> Success rates are expected to suffer even more in 3.6 due to commit
>>> [7db8889a: mm: have order> 0 compaction start off where it left] which
>>> testing has shown to severely reduce allocation success rates under load -
>>> to 0% in one case. There is a proposed change to that patch in this series
>>> and it would be ideal if Jim Schutt could retest the workload that led to
>>> commit [7db8889a: mm: have order> 0 compaction start off where it left].
>>
>> On my first test of this patch series on top of 3.5, I ran into an
>> instance of what I think is the sort of thing that patch 4/5 was
>> fixing. Here's what vmstat had to say during that period:
>>
>> <SNIP>
>
> My conclusion looking at the vmstat data is that everything is looking ok
> until system CPU usage goes through the roof. I'm assuming that's what we
> are all still looking at.

I'm concerned about both the high CPU usage as well as the
reduction in write-out rate, but I've been assuming the latter
is caused by the former.

<snip>

>
> Ok, this is an untested hack and I expect it would drop allocation success
> rates again under load (but not as much). Can you test again and see what
> effect, if any, it has please?
>
> ---8<---
> mm: compaction: back out if contended
>
> ---

<snip>

Initial testing with this patch looks very good from
my perspective; CPU utilization stays reasonable,
write-out rate stays high, no signs of stress.
Here's an example after ~10 minutes under my test load:

2012-08-09 16:26:07.550-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
21 19 0 351628 576 37835440 0 0 17 44394 1241 653 6 20 64 9 0
11 11 0 365520 576 37893060 0 0 124 2121508 203450 170957 12 46 25 17 0
13 16 0 359888 576 37954456 0 0 98 2185033 209473 171571 13 44 25 18 0
17 15 0 353728 576 38010536 0 0 89 2170971 208052 167988 13 43 26 18 0
17 16 0 349732 576 38048284 0 0 135 2217752 218754 174170 13 49 21 16 0
43 13 0 343280 576 38046500 0 0 153 2207135 217872 179519 13 47 23 18 0
26 13 0 350968 576 37937184 0 0 147 2189822 214276 176697 13 47 23 17 0
4 12 0 350080 576 37958364 0 0 226 2145212 207077 172163 12 44 24 20 0
15 13 0 353124 576 37921040 0 0 145 2078422 197231 166381 12 41 30 17 0
14 15 0 348964 576 37949588 0 0 107 2020853 188192 164064 12 39 30 20 0
21 9 0 354784 576 37951228 0 0 117 2148090 204307 165609 13 48 22 18 0
36 16 0 347368 576 37989824 0 0 166 2208681 216392 178114 13 47 24 16 0
28 15 0 300656 576 38060912 0 0 164 2181681 214618 175132 13 45 24 18 0
9 16 0 295484 576 38092184 0 0 153 2156909 218993 180289 13 43 27 17 0
17 16 0 346760 576 37979008 0 0 165 2124168 198730 173455 12 44 27 18 0
14 17 0 360988 576 37957136 0 0 142 2092248 197430 168199 12 42 29 17 0

I'll continue testing tomorrow to be sure nothing
shows up after continued testing.

If this passes your allocation success rate testing,
I'm happy with this performance for 3.6 - if not, I'll
be happy to test any further patches.

I really appreciate getting the chance to test out
your patchset.

Thanks -- Jim

2012-08-09 23:33:26

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 3/5] mm: compaction: Capture a suitable high-order page immediately when it is made available

On Thu, Aug 09, 2012 at 02:49:23PM +0100, Mel Gorman wrote:
> While compaction is migrating pages to free up large contiguous blocks for
> allocation it races with other allocation requests that may steal these
> blocks or break them up. This patch alters direct compaction to capture a
> suitable free page as soon as it becomes available to reduce this race. It
> uses similar logic to split_free_page() to ensure that watermarks are
> still obeyed.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2012-08-10 08:48:03

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/5] mm: vmscan: Scale number of pages reclaimed by reclaim/compaction based on failures

On Thu, Aug 09, 2012 at 02:49:22PM +0100, Mel Gorman wrote:
> If allocation fails after compaction then compaction may be deferred for
> a number of allocation attempts. If there are subsequent failures,
> compact_defer_shift is increased to defer for longer periods. This patch
> uses that information to scale the number of pages reclaimed with
> compact_defer_shift until allocations succeed again. The rationale is
> that reclaiming the normal number of pages still allowed compaction to
> fail and its success depends on the number of pages. If it's failing,
> reclaim more pages until it succeeds again.
>
> Note that this is not implying that VM reclaim is not reclaiming enough
> pages or that its logic is broken. try_to_free_pages() always asks for
> SWAP_CLUSTER_MAX pages to be reclaimed regardless of order and that is
> what it does. Direct reclaim stops normally with this check.
>
> if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> goto out;
>
> should_continue_reclaim delays when that check is made until a minimum number
> of pages for reclaim/compaction are reclaimed. It is possible that this patch
> could instead set nr_to_reclaim in try_to_free_pages() and drive it from
> there but that's behaves differently and not necessarily for the better. If
> driven from do_try_to_free_pages(), it is also possible that priorities
> will rise. When they reach DEF_PRIORITY-2, it will also start stalling
> and setting pages for immediate reclaim which is more disruptive than not
> desirable in this case. That is a more wide-reaching change that could
> cause another regression related to THP requests causing interactive jitter.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2012-08-10 11:02:34

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

On Thu, Aug 09, 2012 at 04:38:24PM -0600, Jim Schutt wrote:
> >><SNIP>
> >
> >My conclusion looking at the vmstat data is that everything is looking ok
> >until system CPU usage goes through the roof. I'm assuming that's what we
> >are all still looking at.
>
> I'm concerned about both the high CPU usage as well as the
> reduction in write-out rate, but I've been assuming the latter
> is caused by the former.
>

Almost certainly.

> <snip>
>
> >
> >Ok, this is an untested hack and I expect it would drop allocation success
> >rates again under load (but not as much). Can you test again and see what
> >effect, if any, it has please?
> >
> >---8<---
> >mm: compaction: back out if contended
> >
> >---
>
> <snip>
>
> Initial testing with this patch looks very good from
> my perspective; CPU utilization stays reasonable,
> write-out rate stays high, no signs of stress.
> Here's an example after ~10 minutes under my test load:
>

Excellent, so it is contention that is the problem.

> <SNIP>
> I'll continue testing tomorrow to be sure nothing
> shows up after continued testing.
>
> If this passes your allocation success rate testing,
> I'm happy with this performance for 3.6 - if not, I'll
> be happy to test any further patches.
>

It does impair allocation success rates as I expected (they're still ok
but not as high as I'd like) so I implemented the following instead. It
attempts to backoff when contention is detected or compaction is taking
too long. It does not backoff as quickly as the first prototype did so
I'd like to see if it addresses your problem or not.

> I really appreciate getting the chance to test out
> your patchset.
>

I appreciate that you have a workload that demonstrates the problem and
will test patches. I will not abuse this and hope the keep the revisions
to a minimum.

Thanks.

---8<---
mm: compaction: Abort async compaction if locks are contended or taking too long

Jim Schutt reported a problem that pointed at compaction contending
heavily on locks. The workload is straight-forward and in his own words;

The systems in question have 24 SAS drives spread across 3 HBAs,
running 24 Ceph OSD instances, one per drive. FWIW these servers
are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160
Ceph Linux clients doing dd simultaneously to a Ceph file system
backed by 12 of these servers.

Early in the test everything looks fine

procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
31 15 0 287216 576 38606628 0 0 2 1158 2 14 1 3 95 0 0
27 15 0 225288 576 38583384 0 0 18 2222016 203357 134876 11 56 17 15 0
28 17 0 219256 576 38544736 0 0 11 2305932 203141 146296 11 49 23 17 0
6 18 0 215596 576 38552872 0 0 7 2363207 215264 166502 12 45 22 20 0
22 18 0 226984 576 38596404 0 0 3 2445741 223114 179527 12 43 23 22 0

and then it goes to pot

procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
163 8 0 464308 576 36791368 0 0 11 22210 866 536 3 13 79 4 0
207 14 0 917752 576 36181928 0 0 712 1345376 134598 47367 7 90 1 2 0
123 12 0 685516 576 36296148 0 0 429 1386615 158494 60077 8 84 5 3 0
123 12 0 598572 576 36333728 0 0 1107 1233281 147542 62351 7 84 5 4 0
622 7 0 660768 576 36118264 0 0 557 1345548 151394 59353 7 85 4 3 0
223 11 0 283960 576 36463868 0 0 46 1107160 121846 33006 6 93 1 1 0

Note that system CPU usage is very high blocks being written out has
dropped by 42%. He analysed this with perf and found

perf record -g -a sleep 10
perf report --sort symbol --call-graph fractal,5
34.63% [k] _raw_spin_lock_irqsave
|
|--97.30%-- isolate_freepages
| compaction_alloc
| unmap_and_move
| migrate_pages
| compact_zone
| compact_zone_order
| try_to_compact_pages
| __alloc_pages_direct_compact
| __alloc_pages_slowpath
| __alloc_pages_nodemask
| alloc_pages_vma
| do_huge_pmd_anonymous_page
| handle_mm_fault
| do_page_fault
| page_fault
| |
| |--87.39%-- skb_copy_datagram_iovec
| | tcp_recvmsg
| | inet_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call
| | __recv
| | |
| | --100.00%-- (nil)
| |
| --12.61%-- memcpy
--2.70%-- [...]

There was other data but primarily it is all showing that compaction is
contended heavily on the zone->lock and zone->lru_lock.

commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
while isolating pages for migration] noted that it was possible for
migration to hold the lru_lock for an excessive amount of time. Very
broadly speaking this patch expands the concept.

This patch introduces compact_checklock_irqsave() to check if a lock
is contended or the process needs to be scheduled. If either condition
is true then async compaction is aborted and the caller is informed.
The page allocator will fail a THP allocation if compaction failed due
to contention. This patch also introduces compact_trylock_irqsave()
which will acquire the lock only if it is not contended and the process
does not need to schedule.

Reported-by: Jim Schutt <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/compaction.h | 4 +-
mm/compaction.c | 91 +++++++++++++++++++++++++++++++++++---------
mm/internal.h | 1 +
mm/page_alloc.c | 13 +++++--
4 files changed, 84 insertions(+), 25 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 5673459..9c94cba 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -22,7 +22,7 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
extern int fragmentation_index(struct zone *zone, unsigned int order);
extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *mask,
- bool sync, struct page **page);
+ bool sync, bool *contended, struct page **page);
extern int compact_pgdat(pg_data_t *pgdat, int order);
extern unsigned long compaction_suitable(struct zone *zone, int order);

@@ -64,7 +64,7 @@ static inline bool compaction_deferred(struct zone *zone, int order)
#else
static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
- bool sync, struct page **page)
+ bool sync, bool *contended, struct page **page)
{
return COMPACT_CONTINUE;
}
diff --git a/mm/compaction.c b/mm/compaction.c
index c2d0958..1827d9a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -50,6 +50,47 @@ static inline bool migrate_async_suitable(int migratetype)
return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
}

+/*
+ * Compaction requires the taking of some coarse locks that are potentially
+ * very heavily contended. Check if the process needs to be scheduled or
+ * if the lock is contended. For async compaction, back out in the event
+ * if contention is severe. For sync compaction, schedule.
+ *
+ * Returns true if the lock is held.
+ * Returns false if the lock is released and compaction should abort
+ */
+static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
+ bool locked, struct compact_control *cc)
+{
+ if (need_resched() || spin_is_contended(lock)) {
+ if (locked) {
+ spin_unlock_irq(lock);
+ locked = false;
+ }
+
+ /* async aborts if taking too long or contended */
+ if (!cc->sync) {
+ if (cc->contended)
+ *cc->contended = true;
+ return false;
+ }
+
+ cond_resched();
+ if (fatal_signal_pending(current))
+ return false;
+ }
+
+ if (!locked)
+ spin_lock_irqsave(lock, *flags);
+ return true;
+}
+
+static inline bool compact_trylock_irqsave(spinlock_t *lock, unsigned long *flags,
+ struct compact_control *cc)
+{
+ return compact_checklock_irqsave(lock, flags, false, cc);
+}
+
static void compact_capture_page(struct compact_control *cc)
{
unsigned long flags;
@@ -87,7 +128,8 @@ static void compact_capture_page(struct compact_control *cc)
continue;

/* Take the lock and attempt capture of the page */
- spin_lock_irqsave(&cc->zone->lock, flags);
+ if (!compact_trylock_irqsave(&cc->zone->lock, &flags, cc))
+ return;
if (!list_empty(&area->free_list[mtype])) {
page = list_entry(area->free_list[mtype].next,
struct page, lru);
@@ -281,6 +323,8 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
struct list_head *migratelist = &cc->migratepages;
isolate_mode_t mode = 0;
struct lruvec *lruvec;
+ unsigned long flags;
+ bool locked;

/*
* Ensure that there are not too many pages isolated from the LRU
@@ -300,25 +344,22 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,

/* Time to isolate some pages for migration */
cond_resched();
- spin_lock_irq(&zone->lru_lock);
+ spin_lock_irqsave(&zone->lru_lock, flags);
+ locked = true;
for (; low_pfn < end_pfn; low_pfn++) {
struct page *page;
- bool locked = true;

/* give a chance to irqs before checking need_resched() */
if (!((low_pfn+1) % SWAP_CLUSTER_MAX)) {
- spin_unlock_irq(&zone->lru_lock);
+ spin_unlock_irqrestore(&zone->lru_lock, flags);
locked = false;
}
- if (need_resched() || spin_is_contended(&zone->lru_lock)) {
- if (locked)
- spin_unlock_irq(&zone->lru_lock);
- cond_resched();
- spin_lock_irq(&zone->lru_lock);
- if (fatal_signal_pending(current))
- break;
- } else if (!locked)
- spin_lock_irq(&zone->lru_lock);
+
+ /* Check if it is ok to still hold the lock */
+ locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
+ locked, cc);
+ if (!locked)
+ break;

/*
* migrate_pfn does not necessarily start aligned to a
@@ -404,7 +445,8 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,

acct_isolated(zone, cc);

- spin_unlock_irq(&zone->lru_lock);
+ if (locked)
+ spin_unlock_irqrestore(&zone->lru_lock, flags);

trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);

@@ -514,7 +556,16 @@ static void isolate_freepages(struct zone *zone,
* are disabled
*/
isolated = 0;
- spin_lock_irqsave(&zone->lock, flags);
+
+ /*
+ * The zone lock must be held to isolate freepages. This
+ * unfortunately this is a very coarse lock and can be
+ * heavily contended if there are parallel allocations
+ * or parallel compactions. For async compaction do not
+ * spin on the lock
+ */
+ if (!compact_trylock_irqsave(&zone->lock, &flags, cc))
+ break;
if (suitable_migration_target(page)) {
end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn);
trace_mm_compaction_freepage_scanpfn(pfn);
@@ -837,8 +888,8 @@ out:
}

static unsigned long compact_zone_order(struct zone *zone,
- int order, gfp_t gfp_mask,
- bool sync, struct page **page)
+ int order, gfp_t gfp_mask, bool sync,
+ bool *contended, struct page **page)
{
struct compact_control cc = {
.nr_freepages = 0,
@@ -848,6 +899,7 @@ static unsigned long compact_zone_order(struct zone *zone,
.zone = zone,
.sync = sync,
.page = page,
+ .contended = contended,
};
INIT_LIST_HEAD(&cc.freepages);
INIT_LIST_HEAD(&cc.migratepages);
@@ -869,7 +921,7 @@ int sysctl_extfrag_threshold = 500;
*/
unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
- bool sync, struct page **page)
+ bool sync, bool *contended, struct page **page)
{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
int may_enter_fs = gfp_mask & __GFP_FS;
@@ -889,7 +941,8 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
nodemask) {
int status;

- status = compact_zone_order(zone, order, gfp_mask, sync, page);
+ status = compact_zone_order(zone, order, gfp_mask, sync,
+ contended, page);
rc = max(status, rc);

/* If a normal allocation would succeed, stop compacting */
diff --git a/mm/internal.h b/mm/internal.h
index 064f6ef..344b555 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -130,6 +130,7 @@ struct compact_control {
int order; /* order a direct compactor needs */
int migratetype; /* MOVABLE, RECLAIMABLE etc */
struct zone *zone;
+ bool *contended; /* True if a lock was contended */
struct page **page; /* Page captured of requested size */
};

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 781d6e4..75b30ea 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2086,7 +2086,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
int migratetype, bool sync_migration,
- bool *deferred_compaction,
+ bool *contended_compaction, bool *deferred_compaction,
unsigned long *did_some_progress)
{
struct page *page = NULL;
@@ -2101,7 +2101,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,

current->flags |= PF_MEMALLOC;
*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
- nodemask, sync_migration, &page);
+ nodemask, sync_migration,
+ contended_compaction, &page);
current->flags &= ~PF_MEMALLOC;

/* If compaction captured a page, prep and use it */
@@ -2154,7 +2155,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
int migratetype, bool sync_migration,
- bool *deferred_compaction,
+ bool *contended_compaction, bool *deferred_compaction,
unsigned long *did_some_progress)
{
return NULL;
@@ -2318,6 +2319,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
unsigned long did_some_progress;
bool sync_migration = false;
bool deferred_compaction = false;
+ bool contended_compaction = false;

/*
* In the slowpath, we sanity check order to avoid ever trying to
@@ -2399,6 +2401,7 @@ rebalance:
nodemask,
alloc_flags, preferred_zone,
migratetype, sync_migration,
+ &contended_compaction,
&deferred_compaction,
&did_some_progress);
if (page)
@@ -2411,7 +2414,8 @@ rebalance:
* has requested the system not be heavily disrupted, fail the
* allocation now instead of entering direct reclaim
*/
- if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD))
+ if ((deferred_compaction || contended_compaction) &&
+ (gfp_mask & __GFP_NO_KSWAPD))
goto nopage;

/* Try direct reclaim and then allocating */
@@ -2482,6 +2486,7 @@ rebalance:
nodemask,
alloc_flags, preferred_zone,
migratetype, sync_migration,
+ &contended_compaction,
&deferred_compaction,
&did_some_progress);
if (page)

2012-08-10 17:20:36

by Jim Schutt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

On 08/10/2012 05:02 AM, Mel Gorman wrote:
> On Thu, Aug 09, 2012 at 04:38:24PM -0600, Jim Schutt wrote:

>>>
>>> Ok, this is an untested hack and I expect it would drop allocation success
>>> rates again under load (but not as much). Can you test again and see what
>>> effect, if any, it has please?
>>>
>>> ---8<---
>>> mm: compaction: back out if contended
>>>
>>> ---
>>
>> <snip>
>>
>> Initial testing with this patch looks very good from
>> my perspective; CPU utilization stays reasonable,
>> write-out rate stays high, no signs of stress.
>> Here's an example after ~10 minutes under my test load:
>>

Hmmm, I wonder if I should have tested this patch longer,
in view of the trouble I ran into testing the new patch?
See below.

>
> Excellent, so it is contention that is the problem.
>
>> <SNIP>
>> I'll continue testing tomorrow to be sure nothing
>> shows up after continued testing.
>>
>> If this passes your allocation success rate testing,
>> I'm happy with this performance for 3.6 - if not, I'll
>> be happy to test any further patches.
>>
>
> It does impair allocation success rates as I expected (they're still ok
> but not as high as I'd like) so I implemented the following instead. It
> attempts to backoff when contention is detected or compaction is taking
> too long. It does not backoff as quickly as the first prototype did so
> I'd like to see if it addresses your problem or not.
>
>> I really appreciate getting the chance to test out
>> your patchset.
>>
>
> I appreciate that you have a workload that demonstrates the problem and
> will test patches. I will not abuse this and hope the keep the revisions
> to a minimum.
>
> Thanks.
>
> ---8<---
> mm: compaction: Abort async compaction if locks are contended or taking too long


Hmmm, while testing this patch, a couple of my servers got
stuck after ~30 minutes or so, like this:

[ 2515.869936] INFO: task ceph-osd:30375 blocked for more than 120 seconds.
[ 2515.876630] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2515.884447] ceph-osd D 0000000000000000 0 30375 1 0x00000000
[ 2515.891531] ffff8802e1a99e38 0000000000000082 ffff88056b38e298 ffff8802e1a99fd8
[ 2515.899013] ffff8802e1a98010 ffff8802e1a98000 ffff8802e1a98000 ffff8802e1a98000
[ 2515.906482] ffff8802e1a99fd8 ffff8802e1a98000 ffff880697d31700 ffff8802e1a84500
[ 2515.913968] Call Trace:
[ 2515.916433] [<ffffffff8147fded>] schedule+0x5d/0x60
[ 2515.921417] [<ffffffff81480b25>] rwsem_down_failed_common+0x105/0x140
[ 2515.927938] [<ffffffff81480b73>] rwsem_down_write_failed+0x13/0x20
[ 2515.934195] [<ffffffff8124bcd3>] call_rwsem_down_write_failed+0x13/0x20
[ 2515.940934] [<ffffffff8147edc5>] ? down_write+0x45/0x50
[ 2515.946244] [<ffffffff81127b62>] sys_mprotect+0xd2/0x240
[ 2515.951640] [<ffffffff81489412>] system_call_fastpath+0x16/0x1b
[ 2515.957646] INFO: task ceph-osd:95698 blocked for more than 120 seconds.
[ 2515.964330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2515.972141] ceph-osd D 0000000000000000 0 95698 1 0x00000000
[ 2515.979223] ffff8802b049fe38 0000000000000082 ffff88056b38e2a0 ffff8802b049ffd8
[ 2515.986700] ffff8802b049e010 ffff8802b049e000 ffff8802b049e000 ffff8802b049e000
[ 2515.994176] ffff8802b049ffd8 ffff8802b049e000 ffff8809832ddc00 ffff880611592e00
[ 2516.001653] Call Trace:
[ 2516.004111] [<ffffffff8147fded>] schedule+0x5d/0x60
[ 2516.009072] [<ffffffff81480b25>] rwsem_down_failed_common+0x105/0x140
[ 2516.015589] [<ffffffff81480b73>] rwsem_down_write_failed+0x13/0x20
[ 2516.021861] [<ffffffff8124bcd3>] call_rwsem_down_write_failed+0x13/0x20
[ 2516.028555] [<ffffffff8147edc5>] ? down_write+0x45/0x50
[ 2516.033859] [<ffffffff81127b62>] sys_mprotect+0xd2/0x240
[ 2516.039248] [<ffffffff81489412>] system_call_fastpath+0x16/0x1b
[ 2516.045248] INFO: task ceph-osd:95699 blocked for more than 120 seconds.
[ 2516.051934] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2516.059753] ceph-osd D 0000000000000000 0 95699 1 0x00000000
[ 2516.066832] ffff880c022d3dc8 0000000000000082 ffff880c022d2000 ffff880c022d3fd8
[ 2516.074302] ffff880c022d2010 ffff880c022d2000 ffff880c022d2000 ffff880c022d2000
[ 2516.081784] ffff880c022d3fd8 ffff880c022d2000 ffff8806224cc500 ffff88096b64dc00
[ 2516.089254] Call Trace:
[ 2516.091702] [<ffffffff8147fded>] schedule+0x5d/0x60
[ 2516.096656] [<ffffffff81480b25>] rwsem_down_failed_common+0x105/0x140
[ 2516.103176] [<ffffffff81480b73>] rwsem_down_write_failed+0x13/0x20
[ 2516.109443] [<ffffffff8124bcd3>] call_rwsem_down_write_failed+0x13/0x20
[ 2516.116134] [<ffffffff8147edc5>] ? down_write+0x45/0x50
[ 2516.121442] [<ffffffff8111362e>] vm_mmap_pgoff+0x6e/0xb0
[ 2516.126861] [<ffffffff8112486a>] sys_mmap_pgoff+0x18a/0x190
[ 2516.132552] [<ffffffff8124bd6e>] ? trace_hardirqs_on_thunk+0x3a/0x3c
[ 2516.138985] [<ffffffff81006b22>] sys_mmap+0x22/0x30
[ 2516.143945] [<ffffffff81489412>] system_call_fastpath+0x16/0x1b
[ 2516.149949] INFO: task ceph-osd:95816 blocked for more than 120 seconds.
[ 2516.156632] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2516.164444] ceph-osd D 0000000000000000 0 95816 1 0x00000000
[ 2516.171521] ffff880332991e38 0000000000000082 ffff880332991de8 ffff880332991fd8
[ 2516.178992] ffff880332990010 ffff880332990000 ffff880332990000 ffff880332990000
[ 2516.186466] ffff880332991fd8 ffff880332990000 ffff880697d31700 ffff880a92c32e00
[ 2516.193937] Call Trace:
[ 2516.196396] [<ffffffff8147fded>] schedule+0x5d/0x60
[ 2516.201354] [<ffffffff81480b25>] rwsem_down_failed_common+0x105/0x140
[ 2516.207886] [<ffffffff81480b73>] rwsem_down_write_failed+0x13/0x20
[ 2516.214138] [<ffffffff8124bcd3>] call_rwsem_down_write_failed+0x13/0x20
[ 2516.220843] [<ffffffff8147edc5>] ? down_write+0x45/0x50
[ 2516.226145] [<ffffffff81127b62>] sys_mprotect+0xd2/0x240
[ 2516.231548] [<ffffffff81489412>] system_call_fastpath+0x16/0x1b
[ 2516.237545] INFO: task ceph-osd:95838 blocked for more than 120 seconds.
[ 2516.244248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2516.252067] ceph-osd D 0000000000000000 0 95838 1 0x00000000
[ 2516.259159] ffff8803f8281e38 0000000000000082 ffff88056b38e2a8 ffff8803f8281fd8
[ 2516.266627] ffff8803f8280010 ffff8803f8280000 ffff8803f8280000 ffff8803f8280000
[ 2516.274094] ffff8803f8281fd8 ffff8803f8280000 ffff8809a45f8000 ffff880691d41700
[ 2516.281573] Call Trace:
[ 2516.284028] [<ffffffff8147fded>] schedule+0x5d/0x60
[ 2516.289000] [<ffffffff81480b25>] rwsem_down_failed_common+0x105/0x140
[ 2516.295513] [<ffffffff81480b73>] rwsem_down_write_failed+0x13/0x20
[ 2516.301764] [<ffffffff8124bcd3>] call_rwsem_down_write_failed+0x13/0x20
[ 2516.308450] [<ffffffff8147edc5>] ? down_write+0x45/0x50
[ 2516.313753] [<ffffffff81127b62>] sys_mprotect+0xd2/0x240
[ 2516.319157] [<ffffffff81489412>] system_call_fastpath+0x16/0x1b
[ 2516.325154] INFO: task ceph-osd:95861 blocked for more than 120 seconds.
[ 2516.331844] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2516.339665] ceph-osd D 0000000000000000 0 95861 1 0x00000000
[ 2516.346742] ffff8805026e9e38 0000000000000082 ffff88056b38e2a0 ffff8805026e9fd8
[ 2516.354221] ffff8805026e8010 ffff8805026e8000 ffff8805026e8000 ffff8805026e8000
[ 2516.361698] ffff8805026e9fd8 ffff8805026e8000 ffff880611592e00 ffff880948df0000
[ 2516.369174] Call Trace:
[ 2516.371623] [<ffffffff8147fded>] schedule+0x5d/0x60
[ 2516.376582] [<ffffffff81480b25>] rwsem_down_failed_common+0x105/0x140
[ 2516.383149] [<ffffffff81480b73>] rwsem_down_write_failed+0x13/0x20
[ 2516.389404] [<ffffffff8124bcd3>] call_rwsem_down_write_failed+0x13/0x20
[ 2516.396091] [<ffffffff8147edc5>] ? down_write+0x45/0x50
[ 2516.401397] [<ffffffff81127b62>] sys_mprotect+0xd2/0x240
[ 2516.406818] [<ffffffff81489412>] system_call_fastpath+0x16/0x1b
[ 2516.412868] INFO: task ceph-osd:95899 blocked for more than 120 seconds.
[ 2516.419557] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2516.427371] ceph-osd D 0000000000000000 0 95899 1 0x00000000
[ 2516.434466] ffff8801eaa9dd50 0000000000000082 0000000000000000 ffff8801eaa9dfd8
[ 2516.442020] ffff8801eaa9c010 ffff8801eaa9c000 ffff8801eaa9c000 ffff8801eaa9c000
[ 2516.449594] ffff8801eaa9dfd8 ffff8801eaa9c000 ffff8800865e5c00 ffff8802b356c500
[ 2516.457079] Call Trace:
[ 2516.459534] [<ffffffff8147fded>] schedule+0x5d/0x60
[ 2516.464519] [<ffffffff81480b25>] rwsem_down_failed_common+0x105/0x140
[ 2516.471044] [<ffffffff81480b95>] rwsem_down_read_failed+0x15/0x17
[ 2516.477222] [<ffffffff8124bca4>] call_rwsem_down_read_failed+0x14/0x30
[ 2516.483830] [<ffffffff8147ee07>] ? down_read+0x37/0x40
[ 2516.489050] [<ffffffff81484c49>] do_page_fault+0x239/0x4a0
[ 2516.494627] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 2516.501143] [<ffffffff8148154f>] page_fault+0x1f/0x30


I tried to capture a perf trace while this was going on, but it
never completed. "ps" on this system reports lots of kernel threads
and some user-space stuff, but hangs part way through - no ceph
executables in the output, oddly.

I can retest your earlier patch for a longer period, to
see if it does the same thing, or I can do some other thing
if you tell me what it is.

Also, FWIW I sorted a little through SysRq-T output from such
a system; these bits looked interesting:

[ 3663.685097] INFO: rcu_sched self-detected stall on CPU { 17} (t=60000 jiffies)
[ 3663.685099] sending NMI to all CPUs:
[ 3663.685101] NMI backtrace for cpu 0
[ 3663.685102] CPU 0 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
[ 3663.685138]
[ 3663.685140] Pid: 100027, comm: ceph-osd Not tainted 3.5.0-00019-g472719a #221 Supermicro X8DTH-i/6/iF/6F/X8DTH
[ 3663.685142] RIP: 0010:[<ffffffff81480ed5>] [<ffffffff81480ed5>] _raw_spin_lock_irqsave+0x45/0x60
[ 3663.685148] RSP: 0018:ffff880a08191898 EFLAGS: 00000012
[ 3663.685149] RAX: ffff88063fffcb00 RBX: ffff88063fffcb00 RCX: 00000000000000c5
[ 3663.685149] RDX: 00000000000000bf RSI: 000000000000015a RDI: ffff88063fffcb00
[ 3663.685150] RBP: ffff880a081918a8 R08: 0000000000000000 R09: 0000000000000000
[ 3663.685151] R10: ffff88063fffcb98 R11: ffff88063fffcc38 R12: 0000000000000246
[ 3663.685152] R13: ffff88063fffcba8 R14: ffff88063fffcb90 R15: ffff88063fffc680
[ 3663.685153] FS: 00007fff90ae0700(0000) GS:ffff880627c00000(0000) knlGS:0000000000000000
[ 3663.685154] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3663.685155] CR2: ffffffffff600400 CR3: 00000002b8fbe000 CR4: 00000000000007f0
[ 3663.685156] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3663.685157] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3663.685158] Process ceph-osd (pid: 100027, threadinfo ffff880a08190000, task ffff880a9a29ae00)
[ 3663.685158] Stack:
[ 3663.685159] 000000000000130a 0000000000000000 ffff880a08191948 ffffffff8111a760
[ 3663.685162] ffffffff81a13420 0000000000000009 ffffea000004c240 0000000000000000
[ 3663.685165] ffff88063fffcba0 000000003fffcb98 ffff880a08191a18 0000000000001600
[ 3663.685168] Call Trace:
[ 3663.685169] [<ffffffff8111a760>] isolate_migratepages_range+0x150/0x4e0
[ 3663.685173] [<ffffffff8111a5b0>] ? isolate_freepages+0x330/0x330
[ 3663.685175] [<ffffffff8111af5b>] compact_zone+0x46b/0x4f0
[ 3663.685178] [<ffffffff8111b3f8>] compact_zone_order+0xe8/0x100
[ 3663.685180] [<ffffffff8111b4b6>] try_to_compact_pages+0xa6/0x110
[ 3663.685182] [<ffffffff81100339>] __alloc_pages_direct_compact+0xd9/0x250
[ 3663.685187] [<ffffffff81100883>] __alloc_pages_slowpath+0x3d3/0x750
[ 3663.685190] [<ffffffff81100d3e>] __alloc_pages_nodemask+0x13e/0x1d0
[ 3663.685192] [<ffffffff8113c894>] alloc_pages_vma+0x124/0x150
[ 3663.685195] [<ffffffff8114e065>] do_huge_pmd_anonymous_page+0xf5/0x1e0
[ 3663.685199] [<ffffffff81121bcd>] handle_mm_fault+0x21d/0x320
[ 3663.685202] [<ffffffff8124bca4>] ? call_rwsem_down_read_failed+0x14/0x30
[ 3663.685205] [<ffffffff81484e49>] do_page_fault+0x439/0x4a0
[ 3663.685208] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3663.685211] [<ffffffff8148154f>] page_fault+0x1f/0x30
[ 3663.685213] Code: 8b 14 25 48 b7 00 00 83 82 44 e0 ff ff 01 ba 00 01 00 00 f0 66 0f c1 13 89 d1 66 c1 e9 08 38 d1 74 0d 0f 1f 40 00 f3 90 0f b6 13 <38> d1 75 f7 5b 4c 89 e0 41 5c c9 c3 66 66 66 66 66 66 2e 0f 1f
[ 3663.685238] NMI backtrace for cpu 3
[ 3663.685239] CPU 3 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
[ 3663.685273]
[ 3663.685274] Pid: 101503, comm: ceph-osd Not tainted 3.5.0-00019-g472719a #221 Supermicro X8DTH-i/6/iF/6F/X8DTH
[ 3663.685276] RIP: 0010:[<ffffffff81480ed2>] [<ffffffff81480ed2>] _raw_spin_lock_irqsave+0x42/0x60
[ 3663.685280] RSP: 0018:ffff8806bce17898 EFLAGS: 00000006
[ 3663.685280] RAX: ffff88063fffcb00 RBX: ffff88063fffcb00 RCX: 00000000000000cb
[ 3663.685281] RDX: 00000000000000c5 RSI: 000000000000015a RDI: ffff88063fffcb00
[ 3663.685282] RBP: ffff8806bce178a8 R08: 0000000000000000 R09: 0000000000000000
[ 3663.685283] R10: ffff88063fffcb98 R11: ffff88063fffcc38 R12: 0000000000000246
[ 3663.685284] R13: ffff88063fffcba8 R14: ffff88063fffcb90 R15: ffff88063fffc680
[ 3663.685285] FS: 00007fffc8e60700(0000) GS:ffff880627c60000(0000) knlGS:0000000000000000
[ 3663.685286] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3663.685287] CR2: ffffffffff600400 CR3: 00000002cbd8c000 CR4: 00000000000007e0
[ 3663.685287] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3663.685288] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3663.685289] Process ceph-osd (pid: 101503, threadinfo ffff8806bce16000, task ffff880c06580000)
[ 3663.685290] Stack:
[ 3663.685290] 0000000000001212 0000000000000000 ffff8806bce17948 ffffffff8111a760
[ 3663.685294] ffff8806244d5c00 0000000000000009 ffffea0000048440 0000000000000000
[ 3663.685297] ffff88063fffcba0 000000003fffcb98 ffff8806bce17a18 0000000000001600
[ 3663.685300] Call Trace:
[ 3663.685301] [<ffffffff8111a760>] isolate_migratepages_range+0x150/0x4e0
[ 3663.685304] [<ffffffff8111a5b0>] ? isolate_freepages+0x330/0x330
[ 3663.685306] [<ffffffff8111af5b>] compact_zone+0x46b/0x4f0
[ 3663.685308] [<ffffffff814018c4>] ? ip_finish_output+0x274/0x300
[ 3663.685311] [<ffffffff8111b3f8>] compact_zone_order+0xe8/0x100
[ 3663.685314] [<ffffffff8111b4b6>] try_to_compact_pages+0xa6/0x110
[ 3663.685316] [<ffffffff81100339>] __alloc_pages_direct_compact+0xd9/0x250
[ 3663.685319] [<ffffffff813b655b>] ? release_sock+0x6b/0x80
[ 3663.685322] [<ffffffff81100883>] __alloc_pages_slowpath+0x3d3/0x750
[ 3663.685325] [<ffffffff81100d3e>] __alloc_pages_nodemask+0x13e/0x1d0
[ 3663.685327] [<ffffffff8113c894>] alloc_pages_vma+0x124/0x150
[ 3663.685330] [<ffffffff8114e065>] do_huge_pmd_anonymous_page+0xf5/0x1e0
[ 3663.685332] [<ffffffff81121bcd>] handle_mm_fault+0x21d/0x320
[ 3663.685335] [<ffffffff8124bca4>] ? call_rwsem_down_read_failed+0x14/0x30
[ 3663.685337] [<ffffffff81484e49>] do_page_fault+0x439/0x4a0
[ 3663.685340] [<ffffffff8106707d>] ? up_write+0x1d/0x20
[ 3663.685343] [<ffffffff81113656>] ? vm_mmap_pgoff+0x96/0xb0
[ 3663.685347] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3663.685349] [<ffffffff8148154f>] page_fault+0x1f/0x30
[ 3663.685352] Code: ff 65 48 8b 14 25 48 b7 00 00 83 82 44 e0 ff ff 01 ba 00 01 00 00 f0 66 0f c1 13 89 d1 66 c1 e9 08 38 d1 74 0d 0f 1f 40 00 f3 90 <0f> b6 13 38 d1 75 f7 5b 4c 89 e0 41 5c c9 c3 66 66 66 66 66 66
[ 3663.685378] NMI backtrace for cpu 6
[ 3663.685379] CPU 6 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en mlx4_core[ 3663.685402] Uhhuh. NMI received for unknown reason 3d on CPU 3.
[ 3663.685403] mpt2sas[ 3663.685404] Do you have a strange power saving mode enabled?
[ 3663.685405] scsi_transport_sas[ 3663.685406] Dazed and confused, but trying to continue
[ 3663.685407] raid_class scsi_mod cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
[ 3663.685420]
[ 3663.685422] Pid: 102943, comm: ceph-osd Not tainted 3.5.0-00019-g472719a #221 Supermicro X8DTH-i/6/iF/6F/X8DTH
[ 3663.685424] RIP: 0010:[<ffffffff81480ed2>] [<ffffffff81480ed2>] _raw_spin_lock_irqsave+0x42/0x60
[ 3663.685430] RSP: 0018:ffff88065c111898 EFLAGS: 00000006
[ 3663.685430] RAX: ffff88063fffcb00 RBX: ffff88063fffcb00 RCX: 00000000000000d9
[ 3663.685431] RDX: 00000000000000c5 RSI: 000000000000015a RDI: ffff88063fffcb00
[ 3663.685432] RBP: ffff88065c1118a8 R08: 0000000000000000 R09: 0000000000000000
[ 3663.685433] R10: ffff88063fffcb98 R11: ffff88063fffcc38 R12: 0000000000000246
[ 3663.685433] R13: ffff88063fffcba8 R14: ffff88063fffcb90 R15: ffff88063fffc680
[ 3663.685434] FS: 00007fffc693b700(0000) GS:ffff880c3fc00000(0000) knlGS:0000000000000000
[ 3663.685435] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3663.685436] CR2: ffffffffff600400 CR3: 000000048d1b1000 CR4: 00000000000007e0
[ 3663.685437] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3663.685438] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3663.685439] Process ceph-osd (pid: 102943, threadinfo ffff88065c110000, task ffff880737b9ae00)
[ 3663.685439] Stack:
[ 3663.685440] 0000000000001d31 0000000000000000 ffff88065c111948 ffffffff8111a760
[ 3663.685444] ffff8806245b2e00 ffff88065c1118c8 0000000000000006 0000000000000000
[ 3663.685447] ffff88063fffcba0 000000003fffcb98 ffff88065c111a18 0000000000002000
[ 3663.685450] Call Trace:
[ 3663.685451] [<ffffffff8111a760>] isolate_migratepages_range+0x150/0x4e0
[ 3663.685455] [<ffffffff8111a5b0>] ? isolate_freepages+0x330/0x330
[ 3663.685458] [<ffffffff8111af5b>] compact_zone+0x46b/0x4f0
[ 3663.685460] [<ffffffff8111b3f8>] compact_zone_order+0xe8/0x100
[ 3663.685462] [<ffffffff8111b4b6>] try_to_compact_pages+0xa6/0x110
[ 3663.685464] [<ffffffff81100339>] __alloc_pages_direct_compact+0xd9/0x250
[ 3663.685469] [<ffffffff81100883>] __alloc_pages_slowpath+0x3d3/0x750
[ 3663.685471] [<ffffffff81100d3e>] __alloc_pages_nodemask+0x13e/0x1d0
[ 3663.685474] [<ffffffff8113c894>] alloc_pages_vma+0x124/0x150
[ 3663.685477] [<ffffffff8114e065>] do_huge_pmd_anonymous_page+0xf5/0x1e0
[ 3663.685481] [<ffffffff81121bcd>] handle_mm_fault+0x21d/0x320
[ 3663.685483] [<ffffffff8124bca4>] ? call_rwsem_down_read_failed+0x14/0x30
[ 3663.685487] [<ffffffff81484e49>] do_page_fault+0x439/0x4a0
[ 3663.685490] [<ffffffff8106707d>] ? up_write+0x1d/0x20
[ 3663.685493] [<ffffffff81113656>] ? vm_mmap_pgoff+0x96/0xb0
[ 3663.685497] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3663.685500] [<ffffffff8148154f>] page_fault+0x1f/0x30
[ 3663.685502] Code: ff 65 48 8b 14 25 48 b7 00 00 83 82 44 e0 ff ff 01 ba 00 01 00 00 f0 66 0f c1 13 89 d1 66 c1 e9 08 38 d1 74 0d 0f 1f 40 00 f3 90 <0f> b6 13 38 d1 75 f7 5b 4c 89 e0 41 5c c9 c3 66 66 66 66 66 66
[ 3663.685527] NMI backtrace for cpu 1
[ 3663.685528] CPU 1 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
[ 3663.685562]
[ 3663.685563] Pid: 30029, comm: ceph-osd Not tainted 3.5.0-00019-g472719a #221 Supermicro X8DTH-i/6/iF/6F/X8DTH
[ 3663.685565] RIP: 0010:[<ffffffff81480ed2>] [<ffffffff81480ed2>] _raw_spin_lock_irqsave+0x42/0x60
[ 3663.685569] RSP: 0018:ffff880563ae1898 EFLAGS: 00000006
[ 3663.685569] RAX: ffff88063fffcb00 RBX: ffff88063fffcb00 RCX: 00000000000000d6
[ 3663.685570] RDX: 00000000000000c5 RSI: 000000000000015a RDI: ffff88063fffcb00
[ 3663.685571] RBP: ffff880563ae18a8 R08: 0000000000000000 R09: 0000000000000000
[ 3663.685572] R10: ffff88063fffcb98 R11: ffff88063fffcc38 R12: 0000000000000246
[ 3663.685573] R13: ffff88063fffcba8 R14: ffff88063fffcb90 R15: ffff88063fffc680
[ 3663.685574] FS: 00007fffe86c9700(0000) GS:ffff880627c20000(0000) knlGS:0000000000000000
[ 3663.685575] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3663.685576] CR2: ffffffffff600400 CR3: 00000002cc584000 CR4: 00000000000007e0
[ 3663.685577] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3663.685577] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3663.685578] Process ceph-osd (pid: 30029, threadinfo ffff880563ae0000, task ffff880563adc500)
[ 3663.685579] Stack:
[ 3663.685579] 000000000000167f 0000000000000000 ffff880563ae1948 ffffffff8111a760
[ 3663.685583] ffff88063fffcc38 ffff88063fffcb98 000000000000256b 0000000000000000
[ 3663.685586] ffff88063fffcba0 0000000000000004 ffff880563ae1a18 0000000000001a00
[ 3663.685589] Call Trace:
[ 3663.685590] [<ffffffff8111a760>] isolate_migratepages_range+0x150/0x4e0
[ 3663.685593] [<ffffffff8111a5b0>] ? isolate_freepages+0x330/0x330
[ 3663.685595] [<ffffffff8111af5b>] compact_zone+0x46b/0x4f0
[ 3663.685597] [<ffffffff8111b3f8>] compact_zone_order+0xe8/0x100
[ 3663.685599] [<ffffffff8111b4b6>] try_to_compact_pages+0xa6/0x110
[ 3663.685601] [<ffffffff81100339>] __alloc_pages_direct_compact+0xd9/0x250
[ 3663.685604] [<ffffffff81100883>] __alloc_pages_slowpath+0x3d3/0x750
[ 3663.685607] [<ffffffff81100d3e>] __alloc_pages_nodemask+0x13e/0x1d0
[ 3663.685609] [<ffffffff8113c894>] alloc_pages_vma+0x124/0x150
[ 3663.685612] [<ffffffff8114e065>] do_huge_pmd_anonymous_page+0xf5/0x1e0
[ 3663.685614] [<ffffffff81121bcd>] handle_mm_fault+0x21d/0x320
[ 3663.685616] [<ffffffff8124bca4>] ? call_rwsem_down_read_failed+0x14/0x30
[ 3663.685619] [<ffffffff81484e49>] do_page_fault+0x439/0x4a0
[ 3663.685621] [<ffffffff8106707d>] ? up_write+0x1d/0x20
[ 3663.685623] [<ffffffff81113656>] ? vm_mmap_pgoff+0x96/0xb0
[ 3663.685626] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3663.685628] [<ffffffff8148154f>] page_fault+0x1f/0x30
[ 3663.685630] Code: ff 65 48 8b 14 25 48 b7 00 00 83 82 44 e0 ff ff 01 ba 00 01 00 00 f0 66 0f c1 13 89 d1 66 c1 e9 08 38 d1 74 0d 0f 1f 40 00 f3 90 <0f> b6 13 38 d1 75 f7 5b 4c 89 e0 41 5c c9 c3 66 66 66 66 66 66
[ 3663.685656] NMI backtrace for cpu 12
[ 3663.685656] CPU 12 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
[ 3663.685687]
[ 3663.685688] Pid: 97037, comm: ceph-osd Not tainted 3.5.0-00019-g472719a #221 Supermicro X8DTH-i/6/iF/6F/X8DTH
[ 3663.685690] RIP: 0010:[<ffffffff81480ed2>] [<ffffffff81480ed2>] _raw_spin_lock_irqsave+0x42/0x60
[ 3663.685693] RSP: 0018:ffff880092839898 EFLAGS: 00000016
[ 3663.685694] RAX: ffff88063fffcb00 RBX: ffff88063fffcb00 RCX: 00000000000000d4
[ 3663.685694] RDX: 00000000000000c5 RSI: 000000000000015a RDI: ffff88063fffcb00
[ 3663.685695] RBP: ffff8800928398a8 R08: 0000000000000000 R09: 0000000000000000
[ 3663.685696] R10: ffff88063fffcb98 R11: ffff88063fffcc38 R12: 0000000000000246
[ 3663.685697] R13: ffff88063fffcba8 R14: ffff88063fffcb90 R15: ffff88063fffc680
[ 3663.685698] FS: 00007fffcb183700(0000) GS:ffff880627cc0000(0000) knlGS:0000000000000000
[ 3663.685699] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3663.685700] CR2: ffffffffff600400 CR3: 0000000411741000 CR4: 00000000000007e0
[ 3663.685701] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3663.685702] Uhhuh. NMI received for unknown reason 3d on CPU 6.
[ 3663.685703] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3663.685704] Do you have a strange power saving mode enabled?
[ 3663.685705] Process ceph-osd (pid: 97037, threadinfo ffff880092838000, task ffff8805d127dc00)
[ 3663.685706] Dazed and confused, but trying to continue
[ 3663.685707] Stack:
[ 3663.685707] 000000000000358a 0000000000000000 ffff880092839948 ffffffff8111a760
[ 3663.685711] ffff8806245c4500 ffff8800928398c8 000000000000000c 0000000000000000
[ 3663.685714] ffff88063fffcba0 000000003fffcb98 ffff880092839a18 0000000000003800
[ 3663.685717] Call Trace:
[ 3663.685717] [<ffffffff8111a760>] isolate_migratepages_range+0x150/0x4e0
[ 3663.685720] [<ffffffff8111a5b0>] ? isolate_freepages+0x330/0x330
[ 3663.685722] [<ffffffff8111af5b>] compact_zone+0x46b/0x4f0
[ 3663.685724] [<ffffffff8111b3f8>] compact_zone_order+0xe8/0x100
[ 3663.685727] [<ffffffff8111b4b6>] try_to_compact_pages+0xa6/0x110
[ 3663.685729] [<ffffffff81100339>] __alloc_pages_direct_compact+0xd9/0x250
[ 3663.685731] [<ffffffff81100883>] __alloc_pages_slowpath+0x3d3/0x750
[ 3663.685734] [<ffffffff81100d3e>] __alloc_pages_nodemask+0x13e/0x1d0
[ 3663.685736] [<ffffffff8113c894>] alloc_pages_vma+0x124/0x150
[ 3663.685738] [<ffffffff8114e065>] do_huge_pmd_anonymous_page+0xf5/0x1e0
[ 3663.685740] [<ffffffff81121bcd>] handle_mm_fault+0x21d/0x320
[ 3663.685743] [<ffffffff8124bca4>] ? call_rwsem_down_read_failed+0x14/0x30
[ 3663.685745] [<ffffffff81484e49>] do_page_fault+0x439/0x4a0
[ 3663.685747] [<ffffffff8106707d>] ? up_write+0x1d/0x20
[ 3663.685749] [<ffffffff81113656>] ? vm_mmap_pgoff+0x96/0xb0
[ 3663.685752] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3663.685754] [<ffffffff8148154f>] page_fault+0x1f/0x30
[ 3663.685756] Code: ff 65 48 8b 14 25 48 b7 00 00 83 82 44 e0 ff ff 01 ba 00 01 00 00 f0 66 0f c1 13 89 d1 66 c1 e9 08 38 d1 74 0d 0f 1f 40 00 f3 90 <0f> b6 13 38 d1 75 f7 5b 4c 89 e0 41 5c c9 c3 66 66 66 66 66 66
[ 3663.685781] NMI backtrace for cpu 14
[ 3663.685782] CPU 14 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
[ 3663.685815]
[ 3663.685816] Pid: 97590, comm: ceph-osd Not tainted 3.5.0-00019-g472719a #221 Supermicro X8DTH-i/6/iF/6F/X8DTH
[ 3663.685818] RIP: 0010:[<ffffffff81480ed2>] [<ffffffff81480ed2>] _raw_spin_lock_irqsave+0x42/0x60
[ 3663.685821] RSP: 0018:ffff8803f97a9898 EFLAGS: 00000002
[ 3663.685822] RAX: ffff88063fffcb00 RBX: ffff88063fffcb00 RCX: 00000000000000c6
[ 3663.685823] RDX: 00000000000000c5 RSI: 000000000000015a RDI: ffff88063fffcb00
[ 3663.685823] RBP: ffff8803f97a98a8 R08: 0000000000000000 R09: 0000000000000000
[ 3663.685824] R10: ffff88063fffcb98 R11: ffff88063fffcc38 R12: 0000000000000246
[ 3663.685825] R13: ffff88063fffcba8 R14: ffff88063fffcb90 R15: ffff88063fffc680
[ 3663.685826] FS: 00007fffca577700(0000) GS:ffff880627d00000(0000) knlGS:0000000000000000
[ 3663.685827] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3663.685828] CR2: ffffffffff600400 CR3: 00000002e0986000 CR4: 00000000000007e0
[ 3663.685828] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3663.685829] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3663.685830] Process ceph-osd (pid: 97590, threadinfo ffff8803f97a8000, task ffff88045554c500)
[ 3663.685831] Stack:
[ 3663.685831] 0000000000001cc3 0000000000000000 ffff8803f97a9948 ffffffff8111a760
[ 3663.685834] ffff8806245d8000 ffff8803f97a98c8 000000000000000e 0000000000000000
[ 3663.685838] ffff88063fffcba0 000000003fffcb98 ffff8803f97a9a18 0000000000002000
[ 3663.685841] Call Trace:
[ 3663.685842] [<ffffffff8111a760>] isolate_migratepages_range+0x150/0x4e0
[ 3663.685844] [<ffffffff8111a5b0>] ? isolate_freepages+0x330/0x330
[ 3663.685847] [<ffffffff8111af5b>] compact_zone+0x46b/0x4f0
[ 3663.685849] [<ffffffff8111b3f8>] compact_zone_order+0xe8/0x100
[ 3663.685851] [<ffffffff8111b4b6>] try_to_compact_pages+0xa6/0x110
[ 3663.685853] [<ffffffff81100339>] __alloc_pages_direct_compact+0xd9/0x250
[ 3663.685856] [<ffffffff81100883>] __alloc_pages_slowpath+0x3d3/0x750
[ 3663.685859] [<ffffffff81100d3e>] __alloc_pages_nodemask+0x13e/0x1d0
[ 3663.685861] [<ffffffff8113c894>] alloc_pages_vma+0x124/0x150
[ 3663.685864] [<ffffffff8114e065>] do_huge_pmd_anonymous_page+0xf5/0x1e0
[ 3663.685866] [<ffffffff81121bcd>] handle_mm_fault+0x21d/0x320
[ 3663.685868] [<ffffffff8124bca4>] ? call_rwsem_down_read_failed+0x14/0x30
[ 3663.685871] [<ffffffff81484e49>] do_page_fault+0x439/0x4a0
[ 3663.685873] [<ffffffff8106707d>] ? up_write+0x1d/0x20
[ 3663.685875] [<ffffffff81113656>] ? vm_mmap_pgoff+0x96/0xb0
[ 3663.685878] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3663.685880] [<ffffffff8148154f>] page_fault+0x1f/0x30
[ 3663.685882] Code: ff 65 48 8b 14 25 48 b7 00 00 83 82 44 e0 ff ff 01 ba 00 01 00 00 f0 66 0f c1 13 89 d1 66 c1 e9 08 38 d1 74 0d 0f 1f 40 00 f3 90 <0f> b6 13 38 d1 75 f7 5b 4c 89 e0 41 5c c9 c3 66 66 66 66 66 66
[ 3663.685907] NMI backtrace for cpu 2
[ 3663.685908] CPU 2 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
[ 3663.685939]
[ 3663.685941] Pid: 100053, comm: ceph-osd Not tainted 3.5.0-00019-g472719a #221 Supermicro X8DTH-i/6/iF/6F/X8DTH
[ 3663.685943] RIP: 0010:[<ffffffff81480ed2>] [<ffffffff81480ed2>] _raw_spin_lock_irqsave+0x42/0x60
[ 3663.685946] RSP: 0018:ffff8808da685898 EFLAGS: 00000012
[ 3663.685947] RAX: ffff88063fffcb00 RBX: ffff88063fffcb00 RCX: 00000000000000d3
[ 3663.685948] RDX: 00000000000000c6 RSI: 000000000000015a RDI: ffff88063fffcb00
[ 3663.685948] RBP: ffff8808da6858a8 R08: 0000000000000000 R09: 0000000000000000
[ 3663.685949] R10: ffff88063fffcb98 R11: ffff88063fffcc38 R12: 0000000000000246
[ 3663.685950] R13: ffff88063fffcba8 R14: ffff88063fffcb90 R15: ffff88063fffc680
[ 3663.685951] FS: 00007fff92c01700(0000) GS:ffff880627c40000(0000) knlGS:0000000000000000
[ 3663.685952] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3663.685953] CR2: ffffffffff600400 CR3: 00000002b8fbe000 CR4: 00000000000007e0
[ 3663.685954] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3663.685954] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3663.685955] Process ceph-osd (pid: 100053, threadinfo ffff8808da684000, task ffff880a05a92e00)
[ 3663.685956] Stack:
[ 3663.685956] 000000000000119b 0000000000000000 ffff8808da685948 ffffffff8111a760
[ 3663.685959] ffff8806244d4500 ffff8808da6858c8 0000000000000002 0000000000000000
[ 3663.685962] ffff88063fffcba0 000000003fffcb98 ffff8808da685a18 0000000000001400
[ 3663.685966] Call Trace:
[ 3663.685966] [<ffffffff8111a760>] isolate_migratepages_range+0x150/0x4e0
[ 3663.685969] [<ffffffff8111a5b0>] ? isolate_freepages+0x330/0x330
[ 3663.685971] [<ffffffff8111af5b>] compact_zone+0x46b/0x4f0
[ 3663.685973] [<ffffffff8111b3f8>] compact_zone_order+0xe8/0x100
[ 3663.685976] [<ffffffff8111b4b6>] try_to_compact_pages+0xa6/0x110
[ 3663.685978] [<ffffffff81100339>] __alloc_pages_direct_compact+0xd9/0x250
[ 3663.685981] [<ffffffff81100883>] __alloc_pages_slowpath+0x3d3/0x750
[ 3663.685983] [<ffffffff81100d3e>] __alloc_pages_nodemask+0x13e/0x1d0
[ 3663.685986] [<ffffffff8113c894>] alloc_pages_vma+0x124/0x150
[ 3663.685988] [<ffffffff8114e065>] do_huge_pmd_anonymous_page+0xf5/0x1e0
[ 3663.685990] [<ffffffff81121bcd>] handle_mm_fault+0x21d/0x320
[ 3663.685992] [<ffffffff8124bca4>] ? call_rwsem_down_read_failed+0x14/0x30
[ 3663.685995] [<ffffffff81484e49>] do_page_fault+0x439/0x4a0
[ 3663.685997] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3663.685999] [<ffffffff8148154f>] page_fault+0x1f/0x30
[ 3663.686001] Code: ff 65 48 8b 14 25 48 b7 00 00 83 82 44 e0 ff ff 01 ba 00 01 00 00 f0 66 0f c1 13 89 d1 66 c1 e9 08 38 d1 74 0d 0f 1f 40 00 f3 90 <0f> b6 13 38 d1 75 f7 5b 4c 89 e0 41 5c c9 c3 66 66 66 66 66 66
[ 3663.686028] NMI backtrace for cpu 11
[ 3663.686028] CPU 11 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
[ 3663.686062]
[ 3663.686064] Pid: 97756, comm: ceph-osd Not tainted 3.5.0-00019-g472719a #221 Supermicro X8DTH-i/6/iF/6F/X8DTH
[ 3663.686066] RIP: 0010:[<ffffffff81480ed5>] [<ffffffff81480ed5>] _raw_spin_lock_irqsave+0x45/0x60
[ 3663.686069] RSP: 0018:ffff880b11ecd898 EFLAGS: 00000006
[ 3663.686070] RAX: ffff88063fffcb00 RBX: ffff88063fffcb00 RCX: 00000000000000d8
[ 3663.686070] RDX: 00000000000000c6 RSI: 000000000000015a RDI: ffff88063fffcb00
[ 3663.686071] RBP: ffff880b11ecd8a8 R08: 0000000000000000 R09: 0000000000000000
[ 3663.686072] R10: ffff88063fffcb98 R11: ffff88063fffcc38 R12: 0000000000000246
[ 3663.686073] R13: ffff88063fffcba8 R14: ffff88063fffcb90 R15: ffff88063fffc680
[ 3663.686074] FS: 00007ffff36df700(0000) GS:ffff880c3fca0000(0000) knlGS:0000000000000000
[ 3663.686075] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3663.686076] CR2: ffffffffff600400 CR3: 00000002cae55000 CR4: 00000000000007e0
[ 3663.686077] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3663.686078] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3663.686079] Process ceph-osd (pid: 97756, threadinfo ffff880b11ecc000, task ffff880a79a51700)
[ 3663.686079] Stack:
[ 3663.686080] 0000000000001b3e 0000000000000000 ffff880b11ecd948 ffffffff8111a760
[ 3663.686083] ffff8806245c2e00 ffff880b11ecd8c8 000000000000000b 0000000000000000
[ 3663.686086] ffff88063fffcba0 000000003fffcb98 ffff880b11ecda18 0000000000001e00
[ 3663.686089] Call Trace:
[ 3663.686090] [<ffffffff8111a760>] isolate_migratepages_range+0x150/0x4e0
[ 3663.686093] [<ffffffff8111a5b0>] ? isolate_freepages+0x330/0x330
[ 3663.686095] [<ffffffff8111af5b>] compact_zone+0x46b/0x4f0
[ 3663.686097] [<ffffffff8111b3f8>] compact_zone_order+0xe8/0x100
[ 3663.686099] [<ffffffff8111b4b6>] try_to_compact_pages+0xa6/0x110
[ 3663.686102] [<ffffffff81100339>] __alloc_pages_direct_compact+0xd9/0x250
[ 3663.686105] [<ffffffff81100883>] __alloc_pages_slowpath+0x3d3/0x750
[ 3663.686107] [<ffffffff81100d3e>] __alloc_pages_nodemask+0x13e/0x1d0
[ 3663.686110] [<ffffffff8113c894>] alloc_pages_vma+0x124/0x150
[ 3663.686112] [<ffffffff8114e065>] do_huge_pmd_anonymous_page+0xf5/0x1e0
[ 3663.686114] [<ffffffff81121bcd>] handle_mm_fault+0x21d/0x320
[ 3663.686117] [<ffffffff81484e49>] do_page_fault+0x439/0x4a0
[ 3663.686119] [<ffffffff8106707d>] ? up_write+0x1d/0x20
[ 3663.686121] [<ffffffff81113656>] ? vm_mmap_pgoff+0x96/0xb0
[ 3663.686124] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3663.686126] [<ffffffff8148154f>] page_fault+0x1f/0x30
[ 3663.686129] Code: 8b 14 25 48 b7 00 00 83 82 44 e0 ff ff 01 ba 00 01 00 00 f0 66 0f c1 13 89 d1 66 c1 e9 08 38 d1 74 0d 0f 1f 40 00 f3 90 0f b6 13 <38> d1 75 f7 5b 4c 89 e0 41 5c c9 c3 66 66 66 66 66 66 2e 0f 1f
[ 3663.686155] NMI backtrace for cpu 20
[ 3663.686155] CPU 20 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
[ 3663.686189]
[ 3663.686190] Pid: 97755, comm: ceph-osd Not tainted 3.5.0-00019-g472719a #221 Supermicro X8DTH-i/6/iF/6F/X8DTH
[ 3663.686193] RIP: 0010:[<ffffffff81480ed5>] [<ffffffff81480ed5>] _raw_spin_lock_irqsave+0x45/0x60
[ 3663.686196] RSP: 0018:ffff88066d5af898 EFLAGS: 00000002
[ 3663.686196] RAX: ffff88063fffcb00 RBX: ffff88063fffcb00 RCX: 00000000000000cd
[ 3663.686197] RDX: 00000000000000c6 RSI: 000000000000015a RDI: ffff88063fffcb00
[ 3663.686198] RBP: ffff88066d5af8a8 R08: 0000000000000000 R09: 0000000000000000
[ 3663.686199] R10: ffff88063fffcb98 R11: ffff88063fffcc38 R12: 0000000000000246
[ 3663.686199] R13: ffff88063fffcba8 R14: ffff88063fffcb90 R15: ffff88063fffc680
[ 3663.686200] Uhhuh. NMI received for unknown reason 2d on CPU 11.
[ 3663.686201] FS: 00007ffff3ee0700(0000) GS:ffff880c3fd00000(0000) knlGS:0000000000000000
[ 3663.686202] Do you have a strange power saving mode enabled?
[ 3663.686203] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3663.686203] Dazed and confused, but trying to continue
[ 3663.686204] CR2: ffffffffff600400 CR3: 00000002cae55000 CR4: 00000000000007e0
[ 3663.686205] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3663.686206] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3663.686207] Process ceph-osd (pid: 97755, threadinfo ffff88066d5ae000, task ffff880a79a52e00)
[ 3663.686207] Stack:
[ 3663.686208] 0000000000001cbf 0000000000000000 ffff88066d5af948 ffffffff8111a760
[ 3663.686211] ffff8806245e9700 ffff88066d5af8c8 0000000000000014 0000000000000000
[ 3663.686214] ffff88063fffcba0 000000003fffcb98 ffff88066d5afa18 0000000000002000
[ 3663.686217] Call Trace:
[ 3663.686218] [<ffffffff8111a760>] isolate_migratepages_range+0x150/0x4e0
[ 3663.686221] [<ffffffff8111a5b0>] ? isolate_freepages+0x330/0x330
[ 3663.686223] [<ffffffff8111af5b>] compact_zone+0x46b/0x4f0
[ 3663.686225] [<ffffffff8111b3f8>] compact_zone_order+0xe8/0x100
[ 3663.686228] [<ffffffff8111b4b6>] try_to_compact_pages+0xa6/0x110
[ 3663.686230] [<ffffffff81100339>] __alloc_pages_direct_compact+0xd9/0x250
[ 3663.686233] [<ffffffff81100883>] __alloc_pages_slowpath+0x3d3/0x750
[ 3663.686236] [<ffffffff81100d3e>] __alloc_pages_nodemask+0x13e/0x1d0
[ 3663.686238] [<ffffffff8113c894>] alloc_pages_vma+0x124/0x150
[ 3663.686240] [<ffffffff8114e065>] do_huge_pmd_anonymous_page+0xf5/0x1e0
[ 3663.686243] [<ffffffff81121bcd>] handle_mm_fault+0x21d/0x320
[ 3663.686245] [<ffffffff8124bca4>] ? call_rwsem_down_read_failed+0x14/0x30
[ 3663.686247] [<ffffffff81484e49>] do_page_fault+0x439/0x4a0
[ 3663.686250] [<ffffffff8106707d>] ? up_write+0x1d/0x20
[ 3663.686252] [<ffffffff81113656>] ? vm_mmap_pgoff+0x96/0xb0
[ 3663.686254] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3663.686257] [<ffffffff8148154f>] page_fault+0x1f/0x30
[ 3663.686259] Code: 8b 14 25 48 b7 00 00 83 82 44 e0 ff ff 01 ba 00 01 00 00 f0 66 0f c1 13 89 d1 66 c1 e9 08 38 d1 74 0d 0f 1f 40 00 f3 90 0f b6 13 <38> d1 75 f7 5b 4c 89 e0 41 5c c9 c3 66 66 66 66 66 66 2e 0f 1f
[ 3663.686284] NMI backtrace for cpu 13
[ 3663.686285] CPU 13 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm crc32c_intel[ 3663.686300] Uhhuh. NMI received for unknown reason 2d on CPU 12.
[ 3663.686300] ghash_clmulni_intel[ 3663.686301] Do you have a strange power saving mode enabled?
[ 3663.686301] aesni_intel[ 3663.686302] Dazed and confused, but trying to continue
[ 3663.686302] cryptd aes_x86_64 microcode serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
[ 3663.686318]
[ 3663.686319] Pid: 98427, comm: ceph-osd Not tainted 3.5.0-00019-g472719a #221 Supermicro X8DTH-i/6/iF/6F/X8DTH
[ 3663.686321] RIP: 0010:[<ffffffff81480ed0>] [<ffffffff81480ed0>] _raw_spin_lock_irqsave+0x40/0x60
[ 3663.686324] RSP: 0018:ffff880356409898 EFLAGS: 00000016
[ 3663.686324] RAX: ffff88063fffcb00 RBX: ffff88063fffcb00 RCX: 00000000000000d2
[ 3663.686325] RDX: 00000000000000c6 RSI: 000000000000015a RDI: ffff88063fffcb00
[ 3663.686326] RBP: ffff8803564098a8 R08: 0000000000000000 R09: 0000000000000000
[ 3663.686327] R10: ffff88063fffcb98 R11: ffff88063fffcc38 R12: 0000000000000246
[ 3663.686327] R13: ffff88063fffcba8 R14: ffff88063fffcb90 R15: ffff88063fffc680
[ 3663.686328] FS: 00007fffc794b700(0000) GS:ffff880627ce0000(0000) knlGS:0000000000000000
[ 3663.686329] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3663.686330] CR2: ffffffffff600400 CR3: 00000002bc512000 CR4: 00000000000007e0
[ 3663.686331] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3663.686332] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3663.686333] Process ceph-osd (pid: 98427, threadinfo ffff880356408000, task ffff880027de5c00)
[ 3663.686333] Stack:
[ 3663.686333] 0000000000001061 0000000000000000 ffff880356409948 ffffffff8111a760
[ 3663.686337] ffff8806245c5c00 ffff8803564098c8 000000000000000d 0000000000000000
[ 3663.686340] ffff88063fffcba0 000000003fffcb98 ffff880356409a18 0000000000001400
[ 3663.686343] Call Trace:
[ 3663.686343] [<ffffffff8111a760>] isolate_migratepages_range+0x150/0x4e0
[ 3663.686346] [<ffffffff8111a5b0>] ? isolate_freepages+0x330/0x330
[ 3663.686348] [<ffffffff8111af5b>] compact_zone+0x46b/0x4f0
[ 3663.686350] [<ffffffff8111b3f8>] compact_zone_order+0xe8/0x100
[ 3663.686352] [<ffffffff8111b4b6>] try_to_compact_pages+0xa6/0x110
[ 3663.686354] [<ffffffff81100339>] __alloc_pages_direct_compact+0xd9/0x250
[ 3663.686357] [<ffffffff81100883>] __alloc_pages_slowpath+0x3d3/0x750
[ 3663.686360] [<ffffffff81100d3e>] __alloc_pages_nodemask+0x13e/0x1d0
[ 3663.686362] [<ffffffff8113c894>] alloc_pages_vma+0x124/0x150
[ 3663.686364] [<ffffffff8114e065>] do_huge_pmd_anonymous_page+0xf5/0x1e0
[ 3663.686366] [<ffffffff81121bcd>] handle_mm_fault+0x21d/0x320
[ 3663.686368] [<ffffffff8124bca4>] ? call_rwsem_down_read_failed+0x14/0x30
[ 3663.686370] [<ffffffff81484e49>] do_page_fault+0x439/0x4a0
[ 3663.686373] [<ffffffff8106707d>] ? up_write+0x1d/0x20
[ 3663.686375] [<ffffffff81113656>] ? vm_mmap_pgoff+0x96/0xb0
[ 3663.686377] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3663.686379] [<ffffffff8148154f>] page_fault+0x1f/0x30
[ 3663.686381] Code: 6a c5 ff 65 48 8b 14 25 48 b7 00 00 83 82 44 e0 ff ff 01 ba 00 01 00 00 f0 66 0f c1 13 89 d1 66 c1 e9 08 38 d1 74 0d 0f 1f 40 00 <f3> 90 0f b6 13 38 d1 75 f7 5b 4c 89 e0 41 5c c9 c3 66 66 66 66


Please let me know what I can do next to help sort this out.

Thanks -- Jim

2012-08-12 20:23:06

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

On Fri, Aug 10, 2012 at 11:20:07AM -0600, Jim Schutt wrote:
> On 08/10/2012 05:02 AM, Mel Gorman wrote:
> >On Thu, Aug 09, 2012 at 04:38:24PM -0600, Jim Schutt wrote:
>
> >>>
> >>>Ok, this is an untested hack and I expect it would drop allocation success
> >>>rates again under load (but not as much). Can you test again and see what
> >>>effect, if any, it has please?
> >>>
> >>>---8<---
> >>>mm: compaction: back out if contended
> >>>
> >>>---
> >>
> >><snip>
> >>
> >>Initial testing with this patch looks very good from
> >>my perspective; CPU utilization stays reasonable,
> >>write-out rate stays high, no signs of stress.
> >>Here's an example after ~10 minutes under my test load:
> >>
>
> Hmmm, I wonder if I should have tested this patch longer,
> in view of the trouble I ran into testing the new patch?
> See below.
>

The two patches are quite different in what they do. I think it's
unlikely they would share a common bug.

> > <SNIP>
> >---8<---
> >mm: compaction: Abort async compaction if locks are contended or taking too long
>
>
> Hmmm, while testing this patch, a couple of my servers got
> stuck after ~30 minutes or so, like this:
>
> [ 2515.869936] INFO: task ceph-osd:30375 blocked for more than 120 seconds.
> [ 2515.876630] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 2515.884447] ceph-osd D 0000000000000000 0 30375 1 0x00000000
> [ 2515.891531] ffff8802e1a99e38 0000000000000082 ffff88056b38e298 ffff8802e1a99fd8
> [ 2515.899013] ffff8802e1a98010 ffff8802e1a98000 ffff8802e1a98000 ffff8802e1a98000
> [ 2515.906482] ffff8802e1a99fd8 ffff8802e1a98000 ffff880697d31700 ffff8802e1a84500
> [ 2515.913968] Call Trace:
> [ 2515.916433] [<ffffffff8147fded>] schedule+0x5d/0x60
> [ 2515.921417] [<ffffffff81480b25>] rwsem_down_failed_common+0x105/0x140
> [ 2515.927938] [<ffffffff81480b73>] rwsem_down_write_failed+0x13/0x20
> [ 2515.934195] [<ffffffff8124bcd3>] call_rwsem_down_write_failed+0x13/0x20
> [ 2515.940934] [<ffffffff8147edc5>] ? down_write+0x45/0x50
> [ 2515.946244] [<ffffffff81127b62>] sys_mprotect+0xd2/0x240
> [ 2515.951640] [<ffffffff81489412>] system_call_fastpath+0x16/0x1b
> <SNIP>
>
> I tried to capture a perf trace while this was going on, but it
> never completed. "ps" on this system reports lots of kernel threads
> and some user-space stuff, but hangs part way through - no ceph
> executables in the output, oddly.
>

ps is probably locking up because it's trying to access a proc file for
a process that is not releasing the mmap_sem.

> I can retest your earlier patch for a longer period, to
> see if it does the same thing, or I can do some other thing
> if you tell me what it is.
>
> Also, FWIW I sorted a little through SysRq-T output from such
> a system; these bits looked interesting:
>
> [ 3663.685097] INFO: rcu_sched self-detected stall on CPU { 17} (t=60000 jiffies)
> [ 3663.685099] sending NMI to all CPUs:
> [ 3663.685101] NMI backtrace for cpu 0
> [ 3663.685102] CPU 0 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
> [ 3663.685138]
> [ 3663.685140] Pid: 100027, comm: ceph-osd Not tainted 3.5.0-00019-g472719a #221 Supermicro X8DTH-i/6/iF/6F/X8DTH
> [ 3663.685142] RIP: 0010:[<ffffffff81480ed5>] [<ffffffff81480ed5>] _raw_spin_lock_irqsave+0x45/0x60
> [ 3663.685148] RSP: 0018:ffff880a08191898 EFLAGS: 00000012
> [ 3663.685149] RAX: ffff88063fffcb00 RBX: ffff88063fffcb00 RCX: 00000000000000c5
> [ 3663.685149] RDX: 00000000000000bf RSI: 000000000000015a RDI: ffff88063fffcb00
> [ 3663.685150] RBP: ffff880a081918a8 R08: 0000000000000000 R09: 0000000000000000
> [ 3663.685151] R10: ffff88063fffcb98 R11: ffff88063fffcc38 R12: 0000000000000246
> [ 3663.685152] R13: ffff88063fffcba8 R14: ffff88063fffcb90 R15: ffff88063fffc680
> [ 3663.685153] FS: 00007fff90ae0700(0000) GS:ffff880627c00000(0000) knlGS:0000000000000000
> [ 3663.685154] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 3663.685155] CR2: ffffffffff600400 CR3: 00000002b8fbe000 CR4: 00000000000007f0
> [ 3663.685156] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 3663.685157] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 3663.685158] Process ceph-osd (pid: 100027, threadinfo ffff880a08190000, task ffff880a9a29ae00)
> [ 3663.685158] Stack:
> [ 3663.685159] 000000000000130a 0000000000000000 ffff880a08191948 ffffffff8111a760
> [ 3663.685162] ffffffff81a13420 0000000000000009 ffffea000004c240 0000000000000000
> [ 3663.685165] ffff88063fffcba0 000000003fffcb98 ffff880a08191a18 0000000000001600
> [ 3663.685168] Call Trace:
> [ 3663.685169] [<ffffffff8111a760>] isolate_migratepages_range+0x150/0x4e0
> [ 3663.685173] [<ffffffff8111a5b0>] ? isolate_freepages+0x330/0x330
> [ 3663.685175] [<ffffffff8111af5b>] compact_zone+0x46b/0x4f0
> [ 3663.685178] [<ffffffff8111b3f8>] compact_zone_order+0xe8/0x100
> [ 3663.685180] [<ffffffff8111b4b6>] try_to_compact_pages+0xa6/0x110
> [ 3663.685182] [<ffffffff81100339>] __alloc_pages_direct_compact+0xd9/0x250
> [ 3663.685187] [<ffffffff81100883>] __alloc_pages_slowpath+0x3d3/0x750
> [ 3663.685190] [<ffffffff81100d3e>] __alloc_pages_nodemask+0x13e/0x1d0
> [ 3663.685192] [<ffffffff8113c894>] alloc_pages_vma+0x124/0x150
> [ 3663.685195] [<ffffffff8114e065>] do_huge_pmd_anonymous_page+0xf5/0x1e0
> [ 3663.685199] [<ffffffff81121bcd>] handle_mm_fault+0x21d/0x320
> [ 3663.685202] [<ffffffff8124bca4>] ? call_rwsem_down_read_failed+0x14/0x30
> [ 3663.685205] [<ffffffff81484e49>] do_page_fault+0x439/0x4a0
> [ 3663.685208] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
> [ 3663.685211] [<ffffffff8148154f>] page_fault+0x1f/0x30

I went through the patch again but only found the following which is a
weak candidate. Still, can you retest with the following patch on top and
CONFIG_PROVE_LOCKING set please?

---8<---
diff --git a/mm/compaction.c b/mm/compaction.c
index 1827d9a..d4a51c6 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -64,7 +64,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
{
if (need_resched() || spin_is_contended(lock)) {
if (locked) {
- spin_unlock_irq(lock);
+ spin_unlock_irqrestore(lock, *flags);
locked = false;
}

@@ -276,8 +276,8 @@ static void acct_isolated(struct zone *zone, struct compact_control *cc)
list_for_each_entry(page, &cc->migratepages, lru)
count[!!page_is_file_cache(page)]++;

- __mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]);
- __mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]);
+ mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]);
+ mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]);
}

/* Similar to reclaim, but different enough that they don't share logic */

2012-08-13 20:36:31

by Jim Schutt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

Hi Mel,

On 08/12/2012 02:22 PM, Mel Gorman wrote:

>
> I went through the patch again but only found the following which is a
> weak candidate. Still, can you retest with the following patch on top and
> CONFIG_PROVE_LOCKING set please?
>

I've gotten in several hours of testing on this patch with
no issues at all, and no output from CONFIG_PROVE_LOCKING
(I'm assuming it would show up on a serial console). So,
it seems to me this patch has done the trick.

CPU utilization is staying under control, and write-out rate
is good.

You can add my Tested-by: as you see fit. If you work
up any refinements and would like me to test, please
let me know.

Thanks -- Jim

2012-08-14 09:24:06

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

On Mon, Aug 13, 2012 at 02:35:46PM -0600, Jim Schutt wrote:
> Hi Mel,
>
> On 08/12/2012 02:22 PM, Mel Gorman wrote:
>
> >
> >I went through the patch again but only found the following which is a
> >weak candidate. Still, can you retest with the following patch on top and
> >CONFIG_PROVE_LOCKING set please?
> >
>
> I've gotten in several hours of testing on this patch with
> no issues at all, and no output from CONFIG_PROVE_LOCKING
> (I'm assuming it would show up on a serial console). So,
> it seems to me this patch has done the trick.
>

Super.

> CPU utilization is staying under control, and write-out rate
> is good.
>

Even better.

> You can add my Tested-by: as you see fit. If you work
> up any refinements and would like me to test, please
> let me know.
>

I'll be adding your Tested-by and I'll keep you cc'd on the series. It'll
look a little different because I'm expect to adjust it slightly to match
Andrew's tree but there should be no major surprises and my expectation is
that testing a -rc kernel after it gets merged is all that is necessary. I'm
planning to backport this to -stable but it'll remain to be seen if I can
convince the relevant maintainers that it should be merged.

Thanks.

--
Mel Gorman
SUSE Labs