LinuxLists.cc - [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

2011-05-30 13:13:08

Subject: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

Asynchronous compaction is used when promoting to huge pages. This is
all very nice but if there are a number of processes in compacting
memory, a large number of pages can be isolated. An "asynchronous"
process can stall for long periods of time as a result with a user
reporting that firefox can stall for 10s of seconds. This patch aborts
asynchronous compaction if too many pages are isolated as it's better to
fail a hugepage promotion than stall a process.

If accepted, this should also be considered for 2.6.39-stable. It should
also be considered for 2.6.38-stable but ideally [11bc82d6: mm:
compaction: Use async migration for __GFP_NO_KSWAPD and enforce no
writeback] would be applied to 2.6.38 before consideration.

Reported-and-Tested-by: Ury Stankevich <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
mm/compaction.c | 32 ++++++++++++++++++++++++++------
1 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 021a296..331a2ee 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -240,11 +240,20 @@ static bool too_many_isolated(struct zone *zone)
return isolated > (inactive + active) / 2;
}

+/* possible outcome of isolate_migratepages */
+typedef enum {
+ ISOLATE_ABORT, /* Abort compaction now */
+ ISOLATE_NONE, /* No pages isolated, continue scanning */
+ ISOLATE_SUCCESS, /* Pages isolated, migrate */
+} isolate_migrate_t;
+
/*
* Isolate all pages that can be migrated from the block pointed to by
* the migrate scanner within compact_control.
+ *
+ * Returns false if compaction should abort at this point due to congestion.
*/
-static unsigned long isolate_migratepages(struct zone *zone,
+static isolate_migrate_t isolate_migratepages(struct zone *zone,
struct compact_control *cc)
{
unsigned long low_pfn, end_pfn;
@@ -261,7 +270,7 @@ static unsigned long isolate_migratepages(struct zone *zone,
/* Do not cross the free scanner or scan within a memory hole */
if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
cc->migrate_pfn = end_pfn;
- return 0;
+ return ISOLATE_NONE;
}

/*
@@ -270,10 +279,14 @@ static unsigned long isolate_migratepages(struct zone *zone,
* delay for some time until fewer pages are isolated
*/
while (unlikely(too_many_isolated(zone))) {
+ /* async migration should just abort */
+ if (!cc->sync)
+ return ISOLATE_ABORT;
+
congestion_wait(BLK_RW_ASYNC, HZ/10);

if (fatal_signal_pending(current))
- return 0;
+ return ISOLATE_ABORT;
}

/* Time to isolate some pages for migration */
@@ -358,7 +371,7 @@ static unsigned long isolate_migratepages(struct zone *zone,

trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);

- return cc->nr_migratepages;
+ return ISOLATE_SUCCESS;
}

/*
@@ -522,9 +535,15 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
unsigned long nr_migrate, nr_remaining;
int err;

- if (!isolate_migratepages(zone, cc))
+ switch (isolate_migratepages(zone, cc)) {
+ case ISOLATE_ABORT:
+ goto out;
+ case ISOLATE_NONE:
continue;
-
+ case ISOLATE_SUCCESS:
+ ;
+ }
+
nr_migrate = cc->nr_migratepages;
err = migrate_pages(&cc->migratepages, compaction_alloc,
(unsigned long)cc, false,
@@ -547,6 +566,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)

}

+out:
/* Release free pages and check accounting */
cc->nr_freepages -= release_freepages(&cc->freepages);
VM_BUG_ON(cc->nr_freepages != 0);

2011-05-30 14:31:41

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

Hi Mel and everyone,

On Mon, May 30, 2011 at 02:13:00PM +0100, Mel Gorman wrote:
> Asynchronous compaction is used when promoting to huge pages. This is
> all very nice but if there are a number of processes in compacting
> memory, a large number of pages can be isolated. An "asynchronous"
> process can stall for long periods of time as a result with a user
> reporting that firefox can stall for 10s of seconds. This patch aborts
> asynchronous compaction if too many pages are isolated as it's better to
> fail a hugepage promotion than stall a process.
>
> If accepted, this should also be considered for 2.6.39-stable. It should
> also be considered for 2.6.38-stable but ideally [11bc82d6: mm:
> compaction: Use async migration for __GFP_NO_KSWAPD and enforce no
> writeback] would be applied to 2.6.38 before consideration.

Is this supposed to fix the stall with khugepaged in D state and other
processes in D state?

zoneinfo showed a nr_isolated_file = -1, I don't think that meant
compaction had 4g pages isolated really considering it moves from
-1,0, 1. So I'm unsure if this fix could be right if the problem is
the hang with khugepaged in D state reported, so far that looked more
like a bug with PREEMPT in the vmstat accounting of nr_isolated_file
that trips in too_many_isolated of both vmscan.c and compaction.c with
PREEMPT=y. Or are you fixing a different problem?

Or how do you explain this -1 value out of nr_isolated_file? Clearly
when that value goes to -1, compaction.c:too_many_isolated will hang,
I think we should fix the -1 value before worrying about the rest...

grep nr_isolated_file zoneinfo-khugepaged
nr_isolated_file 1
nr_isolated_file 4294967295
nr_isolated_file 0
nr_isolated_file 1
nr_isolated_file 4294967295
nr_isolated_file 0
nr_isolated_file 1
nr_isolated_file 4294967295
nr_isolated_file 0
nr_isolated_file 1
nr_isolated_file 4294967295
nr_isolated_file 0
nr_isolated_file 1
nr_isolated_file 4294967295
nr_isolated_file 0
nr_isolated_file 1
nr_isolated_file 4294967295
nr_isolated_file 0
nr_isolated_file 1
nr_isolated_file 4294967295
nr_isolated_file 0
nr_isolated_file 1
nr_isolated_file 4294967295
nr_isolated_file 0
nr_isolated_file 1
nr_isolated_file 4294967295
nr_isolated_file 0
nr_isolated_file 1
nr_isolated_file 4294967295
nr_isolated_file 0

2011-05-30 14:51:02

by Greg KH

[permalink] [raw]

Subject: Re: [stable] [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

2011-05-30 15:37:56

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Mon, May 30, 2011 at 04:31:09PM +0200, Andrea Arcangeli wrote:
> Hi Mel and everyone,
>
> On Mon, May 30, 2011 at 02:13:00PM +0100, Mel Gorman wrote:
> > Asynchronous compaction is used when promoting to huge pages. This is
> > all very nice but if there are a number of processes in compacting
> > memory, a large number of pages can be isolated. An "asynchronous"
> > process can stall for long periods of time as a result with a user
> > reporting that firefox can stall for 10s of seconds. This patch aborts
> > asynchronous compaction if too many pages are isolated as it's better to
> > fail a hugepage promotion than stall a process.
> >
> > If accepted, this should also be considered for 2.6.39-stable. It should
> > also be considered for 2.6.38-stable but ideally [11bc82d6: mm:
> > compaction: Use async migration for __GFP_NO_KSWAPD and enforce no
> > writeback] would be applied to 2.6.38 before consideration.
>
> Is this supposed to fix the stall with khugepaged in D state and other
> processes in D state?
>

Other processes. khugepaged might be getting stuck in the same loop but
I do not have a specific case in mind.

> zoneinfo showed a nr_isolated_file = -1, I don't think that meant
> compaction had 4g pages isolated really considering it moves from
> -1,0, 1. So I'm unsure if this fix could be right if the problem is
> the hang with khugepaged in D state reported, so far that looked more
> like a bug with PREEMPT in the vmstat accounting of nr_isolated_file
> that trips in too_many_isolated of both vmscan.c and compaction.c with
> PREEMPT=y. Or are you fixing a different problem?
>

I'm not familiar with this problem. I either missed it or forgot about
it entirely. I was considering only Ury's report whereby firefox was
getting stalled for 10s of seconds in congestion_wait. It's possible the
root cause was isolated counters being broken but I didn't pick up on
it.

> Or how do you explain this -1 value out of nr_isolated_file? Clearly
> when that value goes to -1, compaction.c:too_many_isolated will hang,
> I think we should fix the -1 value before worrying about the rest...
>
> grep nr_isolated_file zoneinfo-khugepaged
> nr_isolated_file 1
> nr_isolated_file 4294967295

Can you point me at the thread that this file appears on and what the
conditions were? If vmstat is going to -1, it is indeed a problem
because it implies an imbalance in increments and decrements to the
isolated counters. Even with that fixed though, this patch still makes
sense as why would an asynchronous user of compaction stall on
congestion_wait?

--
Mel Gorman
SUSE Labs

2011-05-30 16:14:24