2011-05-30 13:13:08

by Mel Gorman

[permalink] [raw]
Subject: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

Asynchronous compaction is used when promoting to huge pages. This is
all very nice but if there are a number of processes in compacting
memory, a large number of pages can be isolated. An "asynchronous"
process can stall for long periods of time as a result with a user
reporting that firefox can stall for 10s of seconds. This patch aborts
asynchronous compaction if too many pages are isolated as it's better to
fail a hugepage promotion than stall a process.

If accepted, this should also be considered for 2.6.39-stable. It should
also be considered for 2.6.38-stable but ideally [11bc82d6: mm:
compaction: Use async migration for __GFP_NO_KSWAPD and enforce no
writeback] would be applied to 2.6.38 before consideration.

Reported-and-Tested-by: Ury Stankevich <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
mm/compaction.c | 32 ++++++++++++++++++++++++++------
1 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 021a296..331a2ee 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -240,11 +240,20 @@ static bool too_many_isolated(struct zone *zone)
return isolated > (inactive + active) / 2;
}

+/* possible outcome of isolate_migratepages */
+typedef enum {
+ ISOLATE_ABORT, /* Abort compaction now */
+ ISOLATE_NONE, /* No pages isolated, continue scanning */
+ ISOLATE_SUCCESS, /* Pages isolated, migrate */
+} isolate_migrate_t;
+
/*
* Isolate all pages that can be migrated from the block pointed to by
* the migrate scanner within compact_control.
+ *
+ * Returns false if compaction should abort at this point due to congestion.
*/
-static unsigned long isolate_migratepages(struct zone *zone,
+static isolate_migrate_t isolate_migratepages(struct zone *zone,
struct compact_control *cc)
{
unsigned long low_pfn, end_pfn;
@@ -261,7 +270,7 @@ static unsigned long isolate_migratepages(struct zone *zone,
/* Do not cross the free scanner or scan within a memory hole */
if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
cc->migrate_pfn = end_pfn;
- return 0;
+ return ISOLATE_NONE;
}

/*
@@ -270,10 +279,14 @@ static unsigned long isolate_migratepages(struct zone *zone,
* delay for some time until fewer pages are isolated
*/
while (unlikely(too_many_isolated(zone))) {
+ /* async migration should just abort */
+ if (!cc->sync)
+ return ISOLATE_ABORT;
+
congestion_wait(BLK_RW_ASYNC, HZ/10);

if (fatal_signal_pending(current))
- return 0;
+ return ISOLATE_ABORT;
}

/* Time to isolate some pages for migration */
@@ -358,7 +371,7 @@ static unsigned long isolate_migratepages(struct zone *zone,

trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);

- return cc->nr_migratepages;
+ return ISOLATE_SUCCESS;
}

/*
@@ -522,9 +535,15 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
unsigned long nr_migrate, nr_remaining;
int err;

- if (!isolate_migratepages(zone, cc))
+ switch (isolate_migratepages(zone, cc)) {
+ case ISOLATE_ABORT:
+ goto out;
+ case ISOLATE_NONE:
continue;
-
+ case ISOLATE_SUCCESS:
+ ;
+ }
+
nr_migrate = cc->nr_migratepages;
err = migrate_pages(&cc->migratepages, compaction_alloc,
(unsigned long)cc, false,
@@ -547,6 +566,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)

}

+out:
/* Release free pages and check accounting */
cc->nr_freepages -= release_freepages(&cc->freepages);
VM_BUG_ON(cc->nr_freepages != 0);


2011-05-30 14:31:41

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

Hi Mel and everyone,

On Mon, May 30, 2011 at 02:13:00PM +0100, Mel Gorman wrote:
> Asynchronous compaction is used when promoting to huge pages. This is
> all very nice but if there are a number of processes in compacting
> memory, a large number of pages can be isolated. An "asynchronous"
> process can stall for long periods of time as a result with a user
> reporting that firefox can stall for 10s of seconds. This patch aborts
> asynchronous compaction if too many pages are isolated as it's better to
> fail a hugepage promotion than stall a process.
>
> If accepted, this should also be considered for 2.6.39-stable. It should
> also be considered for 2.6.38-stable but ideally [11bc82d6: mm:
> compaction: Use async migration for __GFP_NO_KSWAPD and enforce no
> writeback] would be applied to 2.6.38 before consideration.

Is this supposed to fix the stall with khugepaged in D state and other
processes in D state?

zoneinfo showed a nr_isolated_file = -1, I don't think that meant
compaction had 4g pages isolated really considering it moves from
-1,0, 1. So I'm unsure if this fix could be right if the problem is
the hang with khugepaged in D state reported, so far that looked more
like a bug with PREEMPT in the vmstat accounting of nr_isolated_file
that trips in too_many_isolated of both vmscan.c and compaction.c with
PREEMPT=y. Or are you fixing a different problem?

Or how do you explain this -1 value out of nr_isolated_file? Clearly
when that value goes to -1, compaction.c:too_many_isolated will hang,
I think we should fix the -1 value before worrying about the rest...

grep nr_isolated_file zoneinfo-khugepaged
nr_isolated_file 1
nr_isolated_file 4294967295
nr_isolated_file 0
nr_isolated_file 1
nr_isolated_file 4294967295
nr_isolated_file 0
nr_isolated_file 1
nr_isolated_file 4294967295
nr_isolated_file 0
nr_isolated_file 1
nr_isolated_file 4294967295
nr_isolated_file 0
nr_isolated_file 1
nr_isolated_file 4294967295
nr_isolated_file 0
nr_isolated_file 1
nr_isolated_file 4294967295
nr_isolated_file 0
nr_isolated_file 1
nr_isolated_file 4294967295
nr_isolated_file 0
nr_isolated_file 1
nr_isolated_file 4294967295
nr_isolated_file 0
nr_isolated_file 1
nr_isolated_file 4294967295
nr_isolated_file 0
nr_isolated_file 1
nr_isolated_file 4294967295
nr_isolated_file 0

2011-05-30 14:51:02

by Greg KH

[permalink] [raw]
Subject: Re: [stable] [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Mon, May 30, 2011 at 02:13:00PM +0100, Mel Gorman wrote:
> Asynchronous compaction is used when promoting to huge pages. This is
> all very nice but if there are a number of processes in compacting
> memory, a large number of pages can be isolated. An "asynchronous"
> process can stall for long periods of time as a result with a user
> reporting that firefox can stall for 10s of seconds. This patch aborts
> asynchronous compaction if too many pages are isolated as it's better to
> fail a hugepage promotion than stall a process.
>
> If accepted, this should also be considered for 2.6.39-stable. It should
> also be considered for 2.6.38-stable but ideally [11bc82d6: mm:
> compaction: Use async migration for __GFP_NO_KSWAPD and enforce no
> writeback] would be applied to 2.6.38 before consideration.
>
> Reported-and-Tested-by: Ury Stankevich <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/compaction.c | 32 ++++++++++++++++++++++++++------
> 1 files changed, 26 insertions(+), 6 deletions(-)


<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree. Please read Documentation/stable_kernel_rules.txt
for how to do this properly.

</formletter>

2011-05-30 15:37:56

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Mon, May 30, 2011 at 04:31:09PM +0200, Andrea Arcangeli wrote:
> Hi Mel and everyone,
>
> On Mon, May 30, 2011 at 02:13:00PM +0100, Mel Gorman wrote:
> > Asynchronous compaction is used when promoting to huge pages. This is
> > all very nice but if there are a number of processes in compacting
> > memory, a large number of pages can be isolated. An "asynchronous"
> > process can stall for long periods of time as a result with a user
> > reporting that firefox can stall for 10s of seconds. This patch aborts
> > asynchronous compaction if too many pages are isolated as it's better to
> > fail a hugepage promotion than stall a process.
> >
> > If accepted, this should also be considered for 2.6.39-stable. It should
> > also be considered for 2.6.38-stable but ideally [11bc82d6: mm:
> > compaction: Use async migration for __GFP_NO_KSWAPD and enforce no
> > writeback] would be applied to 2.6.38 before consideration.
>
> Is this supposed to fix the stall with khugepaged in D state and other
> processes in D state?
>

Other processes. khugepaged might be getting stuck in the same loop but
I do not have a specific case in mind.

> zoneinfo showed a nr_isolated_file = -1, I don't think that meant
> compaction had 4g pages isolated really considering it moves from
> -1,0, 1. So I'm unsure if this fix could be right if the problem is
> the hang with khugepaged in D state reported, so far that looked more
> like a bug with PREEMPT in the vmstat accounting of nr_isolated_file
> that trips in too_many_isolated of both vmscan.c and compaction.c with
> PREEMPT=y. Or are you fixing a different problem?
>

I'm not familiar with this problem. I either missed it or forgot about
it entirely. I was considering only Ury's report whereby firefox was
getting stalled for 10s of seconds in congestion_wait. It's possible the
root cause was isolated counters being broken but I didn't pick up on
it.

> Or how do you explain this -1 value out of nr_isolated_file? Clearly
> when that value goes to -1, compaction.c:too_many_isolated will hang,
> I think we should fix the -1 value before worrying about the rest...
>
> grep nr_isolated_file zoneinfo-khugepaged
> nr_isolated_file 1
> nr_isolated_file 4294967295

Can you point me at the thread that this file appears on and what the
conditions were? If vmstat is going to -1, it is indeed a problem
because it implies an imbalance in increments and decrements to the
isolated counters. Even with that fixed though, this patch still makes
sense as why would an asynchronous user of compaction stall on
congestion_wait?

--
Mel Gorman
SUSE Labs

2011-05-30 16:14:24

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Mon, May 30, 2011 at 02:13:00PM +0100, Mel Gorman wrote:
> Asynchronous compaction is used when promoting to huge pages. This is
> all very nice but if there are a number of processes in compacting
> memory, a large number of pages can be isolated. An "asynchronous"
> process can stall for long periods of time as a result with a user
> reporting that firefox can stall for 10s of seconds. This patch aborts
> asynchronous compaction if too many pages are isolated as it's better to
> fail a hugepage promotion than stall a process.
>
> If accepted, this should also be considered for 2.6.39-stable. It should
> also be considered for 2.6.38-stable but ideally [11bc82d6: mm:
> compaction: Use async migration for __GFP_NO_KSWAPD and enforce no
> writeback] would be applied to 2.6.38 before consideration.
>
> Reported-and-Tested-by: Ury Stankevich <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

I have a nitpick below.
Otherwise, looks good to me.

> ---
> mm/compaction.c | 32 ++++++++++++++++++++++++++------
> 1 files changed, 26 insertions(+), 6 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 021a296..331a2ee 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -240,11 +240,20 @@ static bool too_many_isolated(struct zone *zone)
> return isolated > (inactive + active) / 2;
> }
>
> +/* possible outcome of isolate_migratepages */
> +typedef enum {
> + ISOLATE_ABORT, /* Abort compaction now */
> + ISOLATE_NONE, /* No pages isolated, continue scanning */
> + ISOLATE_SUCCESS, /* Pages isolated, migrate */
> +} isolate_migrate_t;
> +
> /*
> * Isolate all pages that can be migrated from the block pointed to by
> * the migrate scanner within compact_control.
> + *
> + * Returns false if compaction should abort at this point due to congestion.

false? I think it would be better to use explicit word, ISOLATE_ABORT.

--
Kind regards
Minchan Kim

2011-05-30 16:55:52

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Mon, May 30, 2011 at 04:37:49PM +0100, Mel Gorman wrote:
> > Or how do you explain this -1 value out of nr_isolated_file? Clearly
> > when that value goes to -1, compaction.c:too_many_isolated will hang,
> > I think we should fix the -1 value before worrying about the rest...
> >
> > grep nr_isolated_file zoneinfo-khugepaged
> > nr_isolated_file 1
> > nr_isolated_file 4294967295
>
> Can you point me at the thread that this file appears on and what the
> conditions were? If vmstat is going to -1, it is indeed a problem
> because it implies an imbalance in increments and decrements to the
> isolated counters.

Even with drift issues, -1 there should be "impossible". Assuming this
is a zoneinfo file, that figure is based on global_page_state() which
looks like

static inline unsigned long global_page_state(enum zone_stat_item item)
{
long x = atomic_long_read(&vm_stat[item]);
#ifdef CONFIG_SMP
if (x < 0)
x = 0;
#endif
return x;
}

So even if isolated counts were going negative for short periods of
time, the returned value should be 0. As this is an inline returning
unsigned long, and callers are using unsigned long, is there any
possibility the "if (x < 0)" is being optimised out? If you aware
of users reporting this problem (like the users in thread "iotop:
khugepaged at 99.99% (2.6.38.3)"), do you know if they had a particular
compiler in common?

--
Mel Gorman
SUSE Labs

2011-05-30 17:54:11

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Mon, May 30, 2011 at 05:55:46PM +0100, Mel Gorman wrote:
> Even with drift issues, -1 there should be "impossible". Assuming this
> is a zoneinfo file, that figure is based on global_page_state() which
> looks like

The two cases reproducing this long hang in D state, had from SMP=n
PREEMPT=y. Clearly not common config these days. Also it didn't seem
apparent that any task was running in a code path that kept pages
isolated.

> unsigned long, and callers are using unsigned long, is there any
> possibility the "if (x < 0)" is being optimised out? If you aware

It was eliminated by cpp.

> of users reporting this problem (like the users in thread "iotop:
> khugepaged at 99.99% (2.6.38.3)"), do you know if they had a particular
> compiler in common?

I had no reason to worry about the compiler yet but that's always good
idea to keep in mind. The thread were the bug is reported is the
"iotop" one you mentioned, and there's a tarball attached to one of
the last emails of the thread with the debug data I grepped. It was
/proc/zoneinfo file yes. That's the file I asked when I noticed
something had to be wrong with too_many_isolated and I expected either
nr_isolated or nr_inactive going wrong, it turned out it was
nr_isolated (apparently, I don't have full picture on the problem
yet). I added you in CC to a few emails but you weren't in all
replies.

The debug data you can find on lkml in this email: Message-Id:
<[email protected]>.

The other relevant sysrq+t here http://pastebin.com/raw.php?i=VG28YRbi

better save the latter (I did) as I'm worried it has a timeout on it.

Your patch was for reports with CONFIG_SMP=y? I'd prefer to clear out
this error before improving the too_many_isolated, in fact while
reviewing this code I was not impressed by too_many_isolated. For
vmscan.c if there's an huge nr_active* list and a tiny nr_inactive
(like after a truncate of filebacked pages or munmap of anon memory)
there's no reason to stall, it's better to go ahead and let it refile
more active pages. The too_many_isolated in compaction.c looks a whole
lot better than the vmscan.c one as that takes into account the active
pages too... But I refrained to make any change in this area as I
don't think the bug is in too_many_isolated itself.

I noticed the count[] array is unsigned int, but it looks ok
(especially for 32bit ;) because the isolation is limited.

Both bugs were reported on 32bit x86 UP builds with PREEMPT=y. The
stat accounting seem to use atomics on UP so irqs on off or
PREEMPT=y/n shouldn't matter if the increment is 1 insn long (plus no
irq code should ever mess with nr_isolated)... If it wasn't atomic and
irqs or preempt aren't disabled it could be preempt. To avoid
confusion: it's not proven that PREEMPT is related, it may be an
accident both .config had it on. I'm also unsure why it moves from
-1,0,1 I wouldn't expect a single page to be isolated like -1 pages to
be isolated, it just looks weird...

2011-05-31 04:55:31

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Mon, 30 May 2011 14:13:00 +0100
Mel Gorman <[email protected]> wrote:

> Asynchronous compaction is used when promoting to huge pages. This is
> all very nice but if there are a number of processes in compacting
> memory, a large number of pages can be isolated. An "asynchronous"
> process can stall for long periods of time as a result with a user
> reporting that firefox can stall for 10s of seconds. This patch aborts
> asynchronous compaction if too many pages are isolated as it's better to
> fail a hugepage promotion than stall a process.
>
> If accepted, this should also be considered for 2.6.39-stable. It should
> also be considered for 2.6.38-stable but ideally [11bc82d6: mm:
> compaction: Use async migration for __GFP_NO_KSWAPD and enforce no
> writeback] would be applied to 2.6.38 before consideration.
>
> Reported-and-Tested-by: Ury Stankevich <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>



Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

BTW, I'm surprised to see both of vmscan.c and compaction.c has too_many_isolated()..
in different logic ;)

BTW, compaction ignores UNEVICTABLE LRU ?

Thanks,
-Kame


> ---
> mm/compaction.c | 32 ++++++++++++++++++++++++++------
> 1 files changed, 26 insertions(+), 6 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 021a296..331a2ee 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -240,11 +240,20 @@ static bool too_many_isolated(struct zone *zone)
> return isolated > (inactive + active) / 2;
> }
>
> +/* possible outcome of isolate_migratepages */
> +typedef enum {
> + ISOLATE_ABORT, /* Abort compaction now */
> + ISOLATE_NONE, /* No pages isolated, continue scanning */
> + ISOLATE_SUCCESS, /* Pages isolated, migrate */
> +} isolate_migrate_t;
> +
> /*
> * Isolate all pages that can be migrated from the block pointed to by
> * the migrate scanner within compact_control.
> + *
> + * Returns false if compaction should abort at this point due to congestion.
> */
> -static unsigned long isolate_migratepages(struct zone *zone,
> +static isolate_migrate_t isolate_migratepages(struct zone *zone,
> struct compact_control *cc)
> {
> unsigned long low_pfn, end_pfn;
> @@ -261,7 +270,7 @@ static unsigned long isolate_migratepages(struct zone *zone,
> /* Do not cross the free scanner or scan within a memory hole */
> if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
> cc->migrate_pfn = end_pfn;
> - return 0;
> + return ISOLATE_NONE;
> }
>
> /*
> @@ -270,10 +279,14 @@ static unsigned long isolate_migratepages(struct zone *zone,
> * delay for some time until fewer pages are isolated
> */
> while (unlikely(too_many_isolated(zone))) {
> + /* async migration should just abort */
> + if (!cc->sync)
> + return ISOLATE_ABORT;
> +
> congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> if (fatal_signal_pending(current))
> - return 0;
> + return ISOLATE_ABORT;
> }
>
> /* Time to isolate some pages for migration */
> @@ -358,7 +371,7 @@ static unsigned long isolate_migratepages(struct zone *zone,
>
> trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
>
> - return cc->nr_migratepages;
> + return ISOLATE_SUCCESS;
> }
>
> /*
> @@ -522,9 +535,15 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> unsigned long nr_migrate, nr_remaining;
> int err;
>
> - if (!isolate_migratepages(zone, cc))
> + switch (isolate_migratepages(zone, cc)) {
> + case ISOLATE_ABORT:
> + goto out;
> + case ISOLATE_NONE:
> continue;
> -
> + case ISOLATE_SUCCESS:
> + ;
> + }
> +
> nr_migrate = cc->nr_migratepages;
> err = migrate_pages(&cc->migratepages, compaction_alloc,
> (unsigned long)cc, false,
> @@ -547,6 +566,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>
> }
>
> +out:
> /* Release free pages and check accounting */
> cc->nr_freepages -= release_freepages(&cc->freepages);
> VM_BUG_ON(cc->nr_freepages != 0);
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-05-31 05:38:11

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

Hi Kame,

On Tue, May 31, 2011 at 01:48:35PM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 30 May 2011 14:13:00 +0100
> Mel Gorman <[email protected]> wrote:
>
> > Asynchronous compaction is used when promoting to huge pages. This is
> > all very nice but if there are a number of processes in compacting
> > memory, a large number of pages can be isolated. An "asynchronous"
> > process can stall for long periods of time as a result with a user
> > reporting that firefox can stall for 10s of seconds. This patch aborts
> > asynchronous compaction if too many pages are isolated as it's better to
> > fail a hugepage promotion than stall a process.
> >
> > If accepted, this should also be considered for 2.6.39-stable. It should
> > also be considered for 2.6.38-stable but ideally [11bc82d6: mm:
> > compaction: Use async migration for __GFP_NO_KSWAPD and enforce no
> > writeback] would be applied to 2.6.38 before consideration.
> >
> > Reported-and-Tested-by: Ury Stankevich <[email protected]>
> > Signed-off-by: Mel Gorman <[email protected]>
>
>
>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
>
> BTW, I'm surprised to see both of vmscan.c and compaction.c has too_many_isolated()..
> in different logic ;)
>
> BTW, compaction ignores UNEVICTABLE LRU ?

Good point.
Yes. now compaction doesn't work with unevictable LRU but I think we have no reason
to work well with unveictable pages.
If we don't support unevictable lru, it would be a problem in lots of
mlocked pages workload.
It would be a good enhance point on compaction.

>
> Thanks,
> -Kame
--
Kind regards
Minchan Kim

2011-05-31 07:14:49

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

(2011/05/30 22:13), Mel Gorman wrote:
> Asynchronous compaction is used when promoting to huge pages. This is
> all very nice but if there are a number of processes in compacting
> memory, a large number of pages can be isolated. An "asynchronous"
> process can stall for long periods of time as a result with a user
> reporting that firefox can stall for 10s of seconds. This patch aborts
> asynchronous compaction if too many pages are isolated as it's better to
> fail a hugepage promotion than stall a process.
>
> If accepted, this should also be considered for 2.6.39-stable. It should
> also be considered for 2.6.38-stable but ideally [11bc82d6: mm:
> compaction: Use async migration for __GFP_NO_KSWAPD and enforce no
> writeback] would be applied to 2.6.38 before consideration.
>
> Reported-and-Tested-by: Ury Stankevich <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>

Reviewed-by: KOSAKI Motohiro <[email protected]>


2011-05-31 08:32:23

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Tue, May 31, 2011 at 01:14:15AM +0900, Minchan Kim wrote:
> On Mon, May 30, 2011 at 02:13:00PM +0100, Mel Gorman wrote:
> > Asynchronous compaction is used when promoting to huge pages. This is
> > all very nice but if there are a number of processes in compacting
> > memory, a large number of pages can be isolated. An "asynchronous"
> > process can stall for long periods of time as a result with a user
> > reporting that firefox can stall for 10s of seconds. This patch aborts
> > asynchronous compaction if too many pages are isolated as it's better to
> > fail a hugepage promotion than stall a process.
> >
> > If accepted, this should also be considered for 2.6.39-stable. It should
> > also be considered for 2.6.38-stable but ideally [11bc82d6: mm:
> > compaction: Use async migration for __GFP_NO_KSWAPD and enforce no
> > writeback] would be applied to 2.6.38 before consideration.
> >
> > Reported-and-Tested-by: Ury Stankevich <[email protected]>
> > Signed-off-by: Mel Gorman <[email protected]>
> Reviewed-by: Minchan Kim <[email protected]>
>

Thanks

> I have a nitpick below.
> Otherwise, looks good to me.
>
> > ---
> > mm/compaction.c | 32 ++++++++++++++++++++++++++------
> > 1 files changed, 26 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 021a296..331a2ee 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -240,11 +240,20 @@ static bool too_many_isolated(struct zone *zone)
> > return isolated > (inactive + active) / 2;
> > }
> >
> > +/* possible outcome of isolate_migratepages */
> > +typedef enum {
> > + ISOLATE_ABORT, /* Abort compaction now */
> > + ISOLATE_NONE, /* No pages isolated, continue scanning */
> > + ISOLATE_SUCCESS, /* Pages isolated, migrate */
> > +} isolate_migrate_t;
> > +
> > /*
> > * Isolate all pages that can be migrated from the block pointed to by
> > * the migrate scanner within compact_control.
> > + *
> > + * Returns false if compaction should abort at this point due to congestion.
>
> false? I think it would be better to use explicit word, ISOLATE_ABORT.
>

Oops, thanks for pointing that out. I'll post a V2 once it has been
figured out why NR_ISOLATE_* is getting screwed up on !CONFIG_SMP.

--
Mel Gorman
SUSE Labs

2011-05-31 12:16:29

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

Hi Andrea,

On Mon, May 30, 2011 at 07:53:34PM +0200, Andrea Arcangeli wrote:
> On Mon, May 30, 2011 at 05:55:46PM +0100, Mel Gorman wrote:
> > Even with drift issues, -1 there should be "impossible". Assuming this
> > is a zoneinfo file, that figure is based on global_page_state() which
> > looks like
>
> The two cases reproducing this long hang in D state, had from SMP=n
> PREEMPT=y. Clearly not common config these days. Also it didn't seem
> apparent that any task was running in a code path that kept pages
> isolated.
>
> > unsigned long, and callers are using unsigned long, is there any
> > possibility the "if (x < 0)" is being optimised out? If you aware
>
> It was eliminated by cpp.
>
> > of users reporting this problem (like the users in thread "iotop:
> > khugepaged at 99.99% (2.6.38.3)"), do you know if they had a particular
> > compiler in common?
>
> I had no reason to worry about the compiler yet but that's always good
> idea to keep in mind. The thread were the bug is reported is the
> "iotop" one you mentioned, and there's a tarball attached to one of
> the last emails of the thread with the debug data I grepped. It was
> /proc/zoneinfo file yes. That's the file I asked when I noticed
> something had to be wrong with too_many_isolated and I expected either
> nr_isolated or nr_inactive going wrong, it turned out it was
> nr_isolated (apparently, I don't have full picture on the problem
> yet). I added you in CC to a few emails but you weren't in all
> replies.
>
> The debug data you can find on lkml in this email: Message-Id:
> <[email protected]>.
>
> The other relevant sysrq+t here http://pastebin.com/raw.php?i=VG28YRbi
>
> better save the latter (I did) as I'm worried it has a timeout on it.
>
> Your patch was for reports with CONFIG_SMP=y? I'd prefer to clear out
> this error before improving the too_many_isolated, in fact while
> reviewing this code I was not impressed by too_many_isolated. For
> vmscan.c if there's an huge nr_active* list and a tiny nr_inactive
> (like after a truncate of filebacked pages or munmap of anon memory)
> there's no reason to stall, it's better to go ahead and let it refile
> more active pages. The too_many_isolated in compaction.c looks a whole
> lot better than the vmscan.c one as that takes into account the active
> pages too... But I refrained to make any change in this area as I
> don't think the bug is in too_many_isolated itself.
>
> I noticed the count[] array is unsigned int, but it looks ok
> (especially for 32bit ;) because the isolation is limited.
>
> Both bugs were reported on 32bit x86 UP builds with PREEMPT=y. The
> stat accounting seem to use atomics on UP so irqs on off or
> PREEMPT=y/n shouldn't matter if the increment is 1 insn long (plus no
> irq code should ever mess with nr_isolated)... If it wasn't atomic and
> irqs or preempt aren't disabled it could be preempt. To avoid
> confusion: it's not proven that PREEMPT is related, it may be an
> accident both .config had it on. I'm also unsure why it moves from
> -1,0,1 I wouldn't expect a single page to be isolated like -1 pages to
> be isolated, it just looks weird...

I am not sure this is related to the problem you have seen.
If he used hwpoison by madivse, it is possible.
Anyway, we can see negative value by count mismatch in UP build.
Let's fix it.

>From 1d3ebce2e8aa79dcc912da16b7a8d0611b6f9f1a Mon Sep 17 00:00:00 2001
From: Minchan Kim <[email protected]>
Date: Tue, 31 May 2011 21:11:58 +0900
Subject: [PATCH] Fix page isolated count mismatch

If migration is failed, normally we call putback_lru_pages which
decreases NR_ISOLATE_[ANON|FILE].
It means we should increase NR_ISOLATE_[ANON|FILE] before calling
putback_lru_pages. But soft_offline_page dosn't it.

It can make NR_ISOLATE_[ANON|FILE] with negative value and in UP build,
zone_page_state will say huge isolated pages so too_many_isolated
functions be deceived completely. At last, some process stuck in D state
as it expect while loop ending with congestion_wait.
But it's never ending story.

If it is right, it would be -stable stuff.

Cc: Mel Gorman <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/memory-failure.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 5c8f7e0..eac0ba5 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -52,6 +52,7 @@
#include <linux/swapops.h>
#include <linux/hugetlb.h>
#include <linux/memory_hotplug.h>
+#include <linux/mm_inline.h>
#include "internal.h"

int sysctl_memory_failure_early_kill __read_mostly = 0;
@@ -1468,7 +1469,8 @@ int soft_offline_page(struct page *page, int flags)
put_page(page);
if (!ret) {
LIST_HEAD(pagelist);
-
+ inc_zone_page_state(page, NR_ISOLATED_ANON +
+ page_is_file_cache(page));
list_add(&page->lru, &pagelist);
ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
0, true);
--
1.7.0.4

--
Kind regards
Minchan Kim

2011-05-31 12:25:16

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Tue, May 31, 2011 at 09:16:20PM +0900, Minchan Kim wrote:
> I am not sure this is related to the problem you have seen.
> If he used hwpoison by madivse, it is possible.

CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
# CONFIG_MEMORY_FAILURE is not set

> Anyway, we can see negative value by count mismatch in UP build.
> Let's fix it.

Definitely let's fix it, but it's probably not related to this one.

>
> From 1d3ebce2e8aa79dcc912da16b7a8d0611b6f9f1a Mon Sep 17 00:00:00 2001
> From: Minchan Kim <[email protected]>
> Date: Tue, 31 May 2011 21:11:58 +0900
> Subject: [PATCH] Fix page isolated count mismatch
>
> If migration is failed, normally we call putback_lru_pages which
> decreases NR_ISOLATE_[ANON|FILE].
> It means we should increase NR_ISOLATE_[ANON|FILE] before calling
> putback_lru_pages. But soft_offline_page dosn't it.
>
> It can make NR_ISOLATE_[ANON|FILE] with negative value and in UP build,
> zone_page_state will say huge isolated pages so too_many_isolated
> functions be deceived completely. At last, some process stuck in D state
> as it expect while loop ending with congestion_wait.
> But it's never ending story.
>
> If it is right, it would be -stable stuff.
>
> Cc: Mel Gorman <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> mm/memory-failure.c | 4 +++-
> 1 files changed, 3 insertions(+), 1 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 5c8f7e0..eac0ba5 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -52,6 +52,7 @@
> #include <linux/swapops.h>
> #include <linux/hugetlb.h>
> #include <linux/memory_hotplug.h>
> +#include <linux/mm_inline.h>
> #include "internal.h"
>
> int sysctl_memory_failure_early_kill __read_mostly = 0;
> @@ -1468,7 +1469,8 @@ int soft_offline_page(struct page *page, int flags)
> put_page(page);
> if (!ret) {
> LIST_HEAD(pagelist);
> -
> + inc_zone_page_state(page, NR_ISOLATED_ANON +
> + page_is_file_cache(page));
> list_add(&page->lru, &pagelist);
> ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
> 0, true);

Reviewed-by: Andrea Arcangeli <[email protected]>

Let's check all other migrate_pages callers too...

2011-05-31 13:33:51

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Tue, May 31, 2011 at 02:24:37PM +0200, Andrea Arcangeli wrote:
> On Tue, May 31, 2011 at 09:16:20PM +0900, Minchan Kim wrote:
> > I am not sure this is related to the problem you have seen.
> > If he used hwpoison by madivse, it is possible.
>
> CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
> # CONFIG_MEMORY_FAILURE is not set
>
> > Anyway, we can see negative value by count mismatch in UP build.
> > Let's fix it.
>
> Definitely let's fix it, but it's probably not related to this one.
>
> >
> > From 1d3ebce2e8aa79dcc912da16b7a8d0611b6f9f1a Mon Sep 17 00:00:00 2001
> > From: Minchan Kim <[email protected]>
> > Date: Tue, 31 May 2011 21:11:58 +0900
> > Subject: [PATCH] Fix page isolated count mismatch
> >
> > If migration is failed, normally we call putback_lru_pages which
> > decreases NR_ISOLATE_[ANON|FILE].
> > It means we should increase NR_ISOLATE_[ANON|FILE] before calling
> > putback_lru_pages. But soft_offline_page dosn't it.
> >
> > It can make NR_ISOLATE_[ANON|FILE] with negative value and in UP build,
> > zone_page_state will say huge isolated pages so too_many_isolated
> > functions be deceived completely. At last, some process stuck in D state
> > as it expect while loop ending with congestion_wait.
> > But it's never ending story.
> >
> > If it is right, it would be -stable stuff.
> >
> > Cc: Mel Gorman <[email protected]>
> > Cc: Andrea Arcangeli <[email protected]>
> > Signed-off-by: Minchan Kim <[email protected]>
> > ---
> > mm/memory-failure.c | 4 +++-
> > 1 files changed, 3 insertions(+), 1 deletions(-)
> >
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 5c8f7e0..eac0ba5 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -52,6 +52,7 @@
> > #include <linux/swapops.h>
> > #include <linux/hugetlb.h>
> > #include <linux/memory_hotplug.h>
> > +#include <linux/mm_inline.h>
> > #include "internal.h"
> >
> > int sysctl_memory_failure_early_kill __read_mostly = 0;
> > @@ -1468,7 +1469,8 @@ int soft_offline_page(struct page *page, int flags)
> > put_page(page);
> > if (!ret) {
> > LIST_HEAD(pagelist);
> > -
> > + inc_zone_page_state(page, NR_ISOLATED_ANON +
> > + page_is_file_cache(page));
> > list_add(&page->lru, &pagelist);
> > ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
> > 0, true);
>
> Reviewed-by: Andrea Arcangeli <[email protected]>

Thanks, Andrea.

>
> Let's check all other migrate_pages callers too...

I checked them before sending patch but I got failed to find strange things. :(
Now I am checking the page's SwapBacked flag can be changed
between before and after of migrate_pages so accounting of NR_ISOLATED_XX can
make mistake. I am approaching the failure, too. Hmm.


--
Kind regards
Minchan Kim

2011-05-31 14:14:37

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Tue, May 31, 2011 at 10:33:40PM +0900, Minchan Kim wrote:
> I checked them before sending patch but I got failed to find strange things. :(

My review also doesn't show other bugs in migrate_pages callers like
that one.

> Now I am checking the page's SwapBacked flag can be changed
> between before and after of migrate_pages so accounting of NR_ISOLATED_XX can
> make mistake. I am approaching the failure, too. Hmm.

When I checked that, I noticed the ClearPageSwapBacked in swapcache if
radix insertion fails, but that happens before adding the page in the
LRU so it shouldn't have a chance to be isolated.

So far I only noticed an unsafe page_count in
vmscan.c:isolate_lru_pages but that should at worst result in a
invalid pointer dereference as random result from that page_count is
not going to hurt and I think it's only a theoretical issue.

2011-05-31 14:34:46

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Mon, May 30, 2011 at 07:53:34PM +0200, Andrea Arcangeli wrote:
> On Mon, May 30, 2011 at 05:55:46PM +0100, Mel Gorman wrote:
> > Even with drift issues, -1 there should be "impossible". Assuming this
> > is a zoneinfo file, that figure is based on global_page_state() which
> > looks like
>
> The two cases reproducing this long hang in D state, had from SMP=n
> PREEMPT=y. Clearly not common config these days. Also it didn't seem
> apparent that any task was running in a code path that kept pages
> isolated.
>
> > unsigned long, and callers are using unsigned long, is there any
> > possibility the "if (x < 0)" is being optimised out? If you aware
>
> It was eliminated by cpp.
>

Yep, !CONFIG_SMP is important.

> > of users reporting this problem (like the users in thread "iotop:
> > khugepaged at 99.99% (2.6.38.3)"), do you know if they had a particular
> > compiler in common?
>
> I had no reason to worry about the compiler yet but that's always good
> idea to keep in mind.

I'm no longer considering it if CONFIG_SMP is not set.

> The thread were the bug is reported is the
> "iotop" one you mentioned, and there's a tarball attached to one of
> the last emails of the thread with the debug data I grepped. It was
> /proc/zoneinfo file yes. That's the file I asked when I noticed
> something had to be wrong with too_many_isolated and I expected either
> nr_isolated or nr_inactive going wrong, it turned out it was
> nr_isolated (apparently, I don't have full picture on the problem
> yet). I added you in CC to a few emails but you weren't in all
> replies.
>

I didn't pay as close attention as I should either. I was out for a few
days when that thread happened and it had gone quiet by the time I came
back. I didn't check if it ever got resolved.

> The debug data you can find on lkml in this email: Message-Id:
> <[email protected]>.
>
> The other relevant sysrq+t here http://pastebin.com/raw.php?i=VG28YRbi
>
> better save the latter (I did) as I'm worried it has a timeout on it.
>
> Your patch was for reports with CONFIG_SMP=y?

No. Now that I look at the config, CONFIG_SMP was not set.

> I'd prefer to clear out
> this error before improving the too_many_isolated,

Agreed. So far, it has been a impossible to reproduce but it's
happened to enough people that it must exist. I have a similar
config now and have applied a debugging patch to warn when the count
gets screwed up. It hasn't triggered yet so it must be due to some
difficult-to-hit-race.

> in fact while
> reviewing this code I was not impressed by too_many_isolated. For
> vmscan.c if there's an huge nr_active* list and a tiny nr_inactive
> (like after a truncate of filebacked pages or munmap of anon memory)
> there's no reason to stall, it's better to go ahead and let it refile
> more active pages.

The rotating of lists should already have happened but you could be
right if a process was stalled in too_many_isolated for a long period of
time that the lists would become imbalanced. It'd be saved by kswapd
running at the same time and rotating the list but it might need to be
revisisted at some point. It's unrelated to the current problem though.

> The too_many_isolated in compaction.c looks a whole
> lot better than the vmscan.c one as that takes into account the active
> pages too... But I refrained to make any change in this area as I
> don't think the bug is in too_many_isolated itself.
>

Not a bug that would lock up the machine at least.

> I noticed the count[] array is unsigned int, but it looks ok
> (especially for 32bit ;) because the isolation is limited.
>

Agreed, this is not a wrapping problem.

> Both bugs were reported on 32bit x86 UP builds with PREEMPT=y. The
> stat accounting seem to use atomics on UP so irqs on off or
> PREEMPT=y/n shouldn't matter if the increment is 1 insn long (plus no
> irq code should ever mess with nr_isolated)... If it wasn't atomic and
> irqs or preempt aren't disabled it could be preempt. To avoid
> confusion: it's not proven that PREEMPT is related, it may be an
> accident both .config had it on. I'm also unsure why it moves from
> -1,0,1 I wouldn't expect a single page to be isolated like -1 pages to
> be isolated, it just looks weird...
>

Hmm, looking at the zoneinfo, we can probably rule out some race
related to page flags. If we someone isolated a page as anon and put
it back as file, we'd decrement nr_isolated_file inappropriate to -1
but nr_isolated_anon would be stuck on 1 to match it.

I'm looking at __collapse_huge_page_isolate and it always assumes
it isolated anonymous pages. mmap_sem should be held as well as the
pagetable lock and we hold the page lock while isolating from the LRU
which seems fine but I wonder could there be another condition that
allows a !PageSwapBacked page to sneak in there?

--
Mel Gorman
SUSE Labs

2011-05-31 14:37:45

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Tue, May 31, 2011 at 04:14:02PM +0200, Andrea Arcangeli wrote:
> On Tue, May 31, 2011 at 10:33:40PM +0900, Minchan Kim wrote:
> > I checked them before sending patch but I got failed to find strange things. :(
>
> My review also doesn't show other bugs in migrate_pages callers like
> that one.
>
> > Now I am checking the page's SwapBacked flag can be changed
> > between before and after of migrate_pages so accounting of NR_ISOLATED_XX can
> > make mistake. I am approaching the failure, too. Hmm.
>
> When I checked that, I noticed the ClearPageSwapBacked in swapcache if
> radix insertion fails, but that happens before adding the page in the
> LRU so it shouldn't have a chance to be isolated.

True.

>
> So far I only noticed an unsafe page_count in
> vmscan.c:isolate_lru_pages but that should at worst result in a
> invalid pointer dereference as random result from that page_count is
> not going to hurt and I think it's only a theoretical issue.


Yes. You find a new BUG.
It seems to be related to this problem but it should be solved although
it's very rare case.

--
Kind regards
Minchan Kim

2011-05-31 14:38:39

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Tue, May 31, 2011 at 11:37:35PM +0900, Minchan Kim wrote:
> On Tue, May 31, 2011 at 04:14:02PM +0200, Andrea Arcangeli wrote:
> > On Tue, May 31, 2011 at 10:33:40PM +0900, Minchan Kim wrote:
> > > I checked them before sending patch but I got failed to find strange things. :(
> >
> > My review also doesn't show other bugs in migrate_pages callers like
> > that one.
> >
> > > Now I am checking the page's SwapBacked flag can be changed
> > > between before and after of migrate_pages so accounting of NR_ISOLATED_XX can
> > > make mistake. I am approaching the failure, too. Hmm.
> >
> > When I checked that, I noticed the ClearPageSwapBacked in swapcache if
> > radix insertion fails, but that happens before adding the page in the
> > LRU so it shouldn't have a chance to be isolated.
>
> True.
>
> >
> > So far I only noticed an unsafe page_count in
> > vmscan.c:isolate_lru_pages but that should at worst result in a
> > invalid pointer dereference as random result from that page_count is
> > not going to hurt and I think it's only a theoretical issue.
>
>
> Yes. You find a new BUG.
> It seems to be related to this problem but it should be solved although

typo : It doesn't seem to be.


-
Kind regards
Minchan Kim

2011-06-01 00:57:55

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Tue, May 31, 2011 at 04:14:02PM +0200, Andrea Arcangeli wrote:
> On Tue, May 31, 2011 at 10:33:40PM +0900, Minchan Kim wrote:
> > I checked them before sending patch but I got failed to find strange things. :(
>
> My review also doesn't show other bugs in migrate_pages callers like
> that one.
>
> > Now I am checking the page's SwapBacked flag can be changed
> > between before and after of migrate_pages so accounting of NR_ISOLATED_XX can
> > make mistake. I am approaching the failure, too. Hmm.
>
> When I checked that, I noticed the ClearPageSwapBacked in swapcache if
> radix insertion fails, but that happens before adding the page in the
> LRU so it shouldn't have a chance to be isolated.
>

After hammering three machines for several hours, I managed to trigger
this once on x86 !CONFIG_SMP CONFIG_PREEMPT HIGHMEM4G (so no PAE)
and caught the following trace.

May 31 23:45:37 arnold kernel: WARNING: at include/linux/vmstat.h:167 compact_zone+0xf8/0x53c()
May 31 23:45:37 arnold kernel: Hardware name:
May 31 23:45:37 arnold kernel: Modules linked in: 3c59x mii sr_mod forcedeth cdrom ext4 mbcache jbd2 crc16 sd_mod ata_generic pata_amd sata_nv libata scsi_mod
May 31 23:45:37 arnold kernel: Pid: 16172, comm: usemem Not tainted 2.6.38.4-autobuild #17
May 31 23:45:37 arnold kernel: Call Trace:
May 31 23:45:37 arnold kernel: [<c10277f5>] ? warn_slowpath_common+0x65/0x7a
May 31 23:45:37 arnold kernel: [<c1098b12>] ? compact_zone+0xf8/0x53c
May 31 23:45:37 arnold kernel: [<c1027819>] ? warn_slowpath_null+0xf/0x13
May 31 23:45:37 arnold kernel: [<c1098b12>] ? compact_zone+0xf8/0x53c
May 31 23:45:37 arnold kernel: [<c1098fe3>] ? compact_zone_order+0x8d/0x95
May 31 23:45:37 arnold kernel: [<c1099068>] ? try_to_compact_pages+0x7d/0xc8
May 31 23:45:37 arnold kernel: [<c107ba56>] ? __alloc_pages_direct_compact+0x71/0x102
May 31 23:45:37 arnold kernel: [<c107be15>] ? __alloc_pages_nodemask+0x32e/0x57d
May 31 23:45:37 arnold kernel: [<c10914a6>] ? anon_vma_prepare+0x13/0x109
May 31 23:45:37 arnold kernel: [<c109fb01>] ? do_huge_pmd_anonymous_page+0xc9/0x285
May 31 23:45:37 arnold kernel: [<c1018f6a>] ? do_page_fault+0x0/0x346
May 31 23:45:37 arnold kernel: [<c108cb5e>] ? handle_mm_fault+0x7b/0x13a
May 31 23:45:37 arnold kernel: [<c1018f6a>] ? do_page_fault+0x0/0x346
May 31 23:45:37 arnold kernel: [<c1019298>] ? do_page_fault+0x32e/0x346
May 31 23:45:37 arnold kernel: [<c104a234>] ? trace_hardirqs_off+0xb/0xd
May 31 23:45:37 arnold kernel: [<c100482c>] ? do_softirq+0x9f/0xb5
May 31 23:45:37 arnold kernel: [<c12a5dee>] ? restore_all_notrace+0x0/0x18
May 31 23:45:37 arnold kernel: [<c1018f6a>] ? do_page_fault+0x0/0x346
May 31 23:45:37 arnold kernel: [<c104e91e>] ? trace_hardirqs_on_caller+0xfd/0x11e
May 31 23:45:37 arnold kernel: [<c1018f6a>] ? do_page_fault+0x0/0x346
May 31 23:45:37 arnold kernel: [<c12a61d9>] ? error_code+0x5d/0x64
May 31 23:45:37 arnold kernel: [<c1018f6a>] ? do_page_fault+0x0/0x346

This is triggering in compactions too_many_isolated() where the
NR_ISOLATED_FILE counter has gone negative so the damage was already
done. Most likely, the damage was caused when compaction called
putback_lru_pages() on pages that failed the migration that were
accounted as isolated anon during isolation and putback as isolated
file magically.

It's almost 2am so I'm wiped but the first thing in the morning
I want to check is if http://lkml.org/lkml/2010/8/26/32 is
relevant. Specifically, if during transparent hugepage collapsing
or splitting we are not protected by the anon_vma lock allowing an
imbalance to occur while calling release_pte_pages(). This seems a
bit far-reached though as I'd think at least the anon counter would
be corrupted by that.

A related possibility is that if the wrong anon_vma is being locked
then there is a race between collapse_huge_page and when migration
drops to 0 allowing release_pte_pages() to miss the page entirely.
Again, wrong counter being corrupted you'd think.

Another possibility is that because this is !PAE that the !SMP version
of native_pmdp_get_and_clear is somehow insufficient although I can't
think how it might be - unless the lack of a barrier with preemption
enabled is somehow a factor. Again, it's reaching because one would
expect the anon counter to get messed up, not the file one.

I can't formulate a theory as to how PageSwapBacked gets cleared during
migration that would cause compaction's putback_lru_pages to decrement
the wrong counter. Maybe sleep will figure it out :(

--
Mel Gorman
SUSE Labs

2011-06-01 09:25:04

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Wed, Jun 01, 2011 at 01:57:47AM +0100, Mel Gorman wrote:
> It's almost 2am so I'm wiped but the first thing in the morning
> I want to check is if http://lkml.org/lkml/2010/8/26/32 is
> relevant.

It's not. The patch I really meant was
https://lkml.org/lkml/2011/5/28/155 and it's irrelevant to 2.6.38.4
which is already doing the right thing of rechecking page->mapping under
lock.

--
Mel Gorman
SUSE Labs

2011-06-01 17:58:15

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Wed, Jun 01, 2011 at 01:57:47AM +0100, Mel Gorman wrote:
> On Tue, May 31, 2011 at 04:14:02PM +0200, Andrea Arcangeli wrote:
> > On Tue, May 31, 2011 at 10:33:40PM +0900, Minchan Kim wrote:
> > > I checked them before sending patch but I got failed to find strange things. :(
> >
> > My review also doesn't show other bugs in migrate_pages callers like
> > that one.
> >
> > > Now I am checking the page's SwapBacked flag can be changed
> > > between before and after of migrate_pages so accounting of NR_ISOLATED_XX can
> > > make mistake. I am approaching the failure, too. Hmm.
> >
> > When I checked that, I noticed the ClearPageSwapBacked in swapcache if
> > radix insertion fails, but that happens before adding the page in the
> > LRU so it shouldn't have a chance to be isolated.
> >
>
> After hammering three machines for several hours, I managed to trigger
> this once on x86 !CONFIG_SMP CONFIG_PREEMPT HIGHMEM4G (so no PAE)
> and caught the following trace.
>

Umm, HIGHMEM4G implies a two-level pagetable layout so where are
things like _PAGE_BIT_SPLITTING being set when THP is enabled?

--
Mel Gorman
SUSE Labs

2011-06-01 19:15:45

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Wed, Jun 01, 2011 at 06:58:09PM +0100, Mel Gorman wrote:
> Umm, HIGHMEM4G implies a two-level pagetable layout so where are
> things like _PAGE_BIT_SPLITTING being set when THP is enabled?

They should be set on the pgd, pud_offset/pgd_offset will just bypass.
The splitting bit shouldn't be special about it, the present bit
should work the same.

2011-06-01 21:40:25

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Wed, Jun 01, 2011 at 09:15:29PM +0200, Andrea Arcangeli wrote:
> On Wed, Jun 01, 2011 at 06:58:09PM +0100, Mel Gorman wrote:
> > Umm, HIGHMEM4G implies a two-level pagetable layout so where are
> > things like _PAGE_BIT_SPLITTING being set when THP is enabled?
>
> They should be set on the pgd, pud_offset/pgd_offset will just bypass.
> The splitting bit shouldn't be special about it, the present bit
> should work the same.

This comment is misleading at best then.

#define _PAGE_BIT_SPLITTING _PAGE_BIT_UNUSED1 /* only valid on a PSE pmd */

At the PGD level, it can have PSE set obviously but it's not a
PMD. I confess I haven't checked the manual to see if it's safe to
use _PAGE_BIT_UNUSED1 like this so am taking your word for it. I
found that the bug is far harder to reproduce with 3 pagetable levels
than with 2 but that is just timing. So far it has proven impossible
on x86-64 at least within 27 hours so that has me looking at how
pagetable management between x86 and x86-64 differ.

Barriers are a big different between how 32-bit !SMP and X86-64 but
don't know yet which one is relevant or if this is even the right
direction.

--
Mel Gorman
SUSE Labs

2011-06-01 23:31:19

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

Hi Mel,

On Wed, Jun 01, 2011 at 10:40:18PM +0100, Mel Gorman wrote:
> On Wed, Jun 01, 2011 at 09:15:29PM +0200, Andrea Arcangeli wrote:
> > On Wed, Jun 01, 2011 at 06:58:09PM +0100, Mel Gorman wrote:
> > > Umm, HIGHMEM4G implies a two-level pagetable layout so where are
> > > things like _PAGE_BIT_SPLITTING being set when THP is enabled?
> >
> > They should be set on the pgd, pud_offset/pgd_offset will just bypass.
> > The splitting bit shouldn't be special about it, the present bit
> > should work the same.
>
> This comment is misleading at best then.
>
> #define _PAGE_BIT_SPLITTING _PAGE_BIT_UNUSED1 /* only valid on a PSE pmd */

>From common code point of view it's set in the pmd, the comment can be
extended to specify it's actually the pgd in case of 32bit noPAE but I
didn't think it was too misleading as we think in common code terms
all over the code, the fact it's a bypass is pretty clear across the
whole archs.

> At the PGD level, it can have PSE set obviously but it's not a
> PMD. I confess I haven't checked the manual to see if it's safe to
> use _PAGE_BIT_UNUSED1 like this so am taking your word for it. I

To be sure I re-checked on 253668.pdf page 113/114 noPAE and page 122
PAE, on x86 32bit/64 all ptes/pmd/pgd (32bit/64bit PAE/noPAE) have bit
9-11 "Avail" to software. So I think we should be safe here.

> found that the bug is far harder to reproduce with 3 pagetable levels
> than with 2 but that is just timing. So far it has proven impossible
> on x86-64 at least within 27 hours so that has me looking at how
> pagetable management between x86 and x86-64 differ.

Weird.

However I could see it screwing the nr_inactive/active_* stats, but
the nr_isolated should never go below zero, and especially not anon
even if split_huge_page does the accounting wrong (and
migrate/compaction won't mess with THP), or at least I'd expect things
to fall apart in other ways and not with just a fairly innocuous and
not-memory corrupting nr_isolated_ counter going off just by one.

The khugepaged nr_isolated_anon increment couldn't affect the file one
and we hold mmap_sem write mode there to prevent the pte to change
from under us, in addition to the PT and anon_vma lock. Anon_vma lock
being wrong sounds unlikely too, and even if it was it should screw
the nr_isolated_anon counter, impossible to screw the nr_isolated_file
with khugepaged.

Where did you put your bugcheck? It looked like you put it in the < 0
reader, can you add it to all _inc/dec/mod (even _inc just in case) so
we may get a stack trace including the culprit? (not guaranteed but
better chance)

> Barriers are a big different between how 32-bit !SMP and X86-64 but
> don't know yet which one is relevant or if this is even the right
> direction.

The difference is we need xchg on SMP to avoid losing the dirty
bit. Otherwise if we do pmd_t pmd = *pmdp; *pmdp = 0; the dirty bit
may have been set in between the two by another thread running in
userland in a different CPU, while the pmd was still "present". As
long as interrupts don't write to read-write userland memory with the
pte dirty bit clear, we shouldn't need xchg on !SMP.

On PAE we also need to write 0 into pmd_low before worrying about
pmd_high so the present bit is cleared before clearing the high part
of the 32bit PAE pte, and we relay on xchg implicit lock to avoid a
smp_wmb() in between the two writes.

I'm unsure if any of this could be relevant to our problem, also there
can't be more than one writer at once in the pmd, as nobody can modify
it without the page_table_lock held. xchg there is just to be safe for
the dirty bit (or we'd corrupt memory with threads running in userland
and writing to memory on other cpus while we ptep_clear_flush).

I've been wondering about the lack of "lock" on the bus in atomic.h
too, but I can't see how it can possibly matter on !SMP, vmstat
modifications should execute only 1 asm insn so preempt or irq can't
interrupt it.

2011-06-02 01:03:58

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Thu, Jun 02, 2011 at 01:30:36AM +0200, Andrea Arcangeli wrote:
> Hi Mel,
>
> On Wed, Jun 01, 2011 at 10:40:18PM +0100, Mel Gorman wrote:
> > On Wed, Jun 01, 2011 at 09:15:29PM +0200, Andrea Arcangeli wrote:
> > > On Wed, Jun 01, 2011 at 06:58:09PM +0100, Mel Gorman wrote:
> > > > Umm, HIGHMEM4G implies a two-level pagetable layout so where are
> > > > things like _PAGE_BIT_SPLITTING being set when THP is enabled?
> > >
> > > They should be set on the pgd, pud_offset/pgd_offset will just bypass.
> > > The splitting bit shouldn't be special about it, the present bit
> > > should work the same.
> >
> > This comment is misleading at best then.
> >
> > #define _PAGE_BIT_SPLITTING _PAGE_BIT_UNUSED1 /* only valid on a PSE pmd */
>
> From common code point of view it's set in the pmd, the comment can be
> extended to specify it's actually the pgd in case of 32bit noPAE but I
> didn't think it was too misleading as we think in common code terms
> all over the code, the fact it's a bypass is pretty clear across the
> whole archs.
>

Fair point.

> > At the PGD level, it can have PSE set obviously but it's not a
> > PMD. I confess I haven't checked the manual to see if it's safe to
> > use _PAGE_BIT_UNUSED1 like this so am taking your word for it. I
>
> To be sure I re-checked on 253668.pdf page 113/114 noPAE and page 122
> PAE, on x86 32bit/64 all ptes/pmd/pgd (32bit/64bit PAE/noPAE) have bit
> 9-11 "Avail" to software. So I think we should be safe here.
>

Good stuff. I was reasonably sure this was the case but as this was
already "impossible", it needed to be ruled out.

> > found that the bug is far harder to reproduce with 3 pagetable levels
> > than with 2 but that is just timing. So far it has proven impossible
> > on x86-64 at least within 27 hours so that has me looking at how
> > pagetable management between x86 and x86-64 differ.
>
> Weird.
>
> However I could see it screwing the nr_inactive/active_* stats, but
> the nr_isolated should never go below zero, and especially not anon
> even if split_huge_page does the accounting wrong (and
> migrate/compaction won't mess with THP), or at least I'd expect things
> to fall apart in other ways and not with just a fairly innocuous and
> not-memory corrupting nr_isolated_ counter going off just by one.
>

Again, agreed. I found it hard to come up with a reason why file would
get messed up particularly as PageSwapBacked does not get cleared in the
ordinary case until the page is freed. If we were using pages after
being freed due to bad refcounting, it would show up in all sorts of bad
ways.

> The khugepaged nr_isolated_anon increment couldn't affect the file one
> and we hold mmap_sem write mode there to prevent the pte to change
> from under us, in addition to the PT and anon_vma lock. Anon_vma lock
> being wrong sounds unlikely too, and even if it was it should screw
> the nr_isolated_anon counter, impossible to screw the nr_isolated_file
> with khugepaged.
>

After reviewing, I still could not find a problem with the locking that
might explain this. I thought last night anon_vma might be bust in some
way but today I couldn't find a problem.

> Where did you put your bugcheck? It looked like you put it in the < 0
> reader, can you add it to all _inc/dec/mod (even _inc just in case) so
> we may get a stack trace including the culprit? (not guaranteed but
> better chance)
>

Did that, didn't really help other than showing the corruption happens
early in the process lifetime while huge PMDs are being faulted. This
made me think the problem might be on or near fork.

> > Barriers are a big different between how 32-bit !SMP and X86-64 but
> > don't know yet which one is relevant or if this is even the right
> > direction.
>
> The difference is we need xchg on SMP to avoid losing the dirty
> bit. Otherwise if we do pmd_t pmd = *pmdp; *pmdp = 0; the dirty bit
> may have been set in between the two by another thread running in
> userland in a different CPU, while the pmd was still "present". As
> long as interrupts don't write to read-write userland memory with the
> pte dirty bit clear, we shouldn't need xchg on !SMP.
>

Yep.

> On PAE we also need to write 0 into pmd_low before worrying about
> pmd_high so the present bit is cleared before clearing the high part
> of the 32bit PAE pte, and we relay on xchg implicit lock to avoid a
> smp_wmb() in between the two writes.
>

Yep.

> I'm unsure if any of this could be relevant to our problem, also there

I concluded after a while that it wasn't. Partially from reasoning about
it and part by testing forcing the use of the SMP versions and finding
the bug was still reproducible.

> can't be more than one writer at once in the pmd, as nobody can modify
> it without the page_table_lock held. xchg there is just to be safe for
> the dirty bit (or we'd corrupt memory with threads running in userland
> and writing to memory on other cpus while we ptep_clear_flush).
>
> I've been wondering about the lack of "lock" on the bus in atomic.h
> too, but I can't see how it can possibly matter on !SMP, vmstat
> modifications should execute only 1 asm insn so preempt or irq can't
> interrupt it.

To be honest, I haven't fully figured out yet why it makes such
a difference on !SMP. I have a vague notion that it's because
the page table page and the data is visible before the bit set by
SetPageSwapBacked on the struct page is visible but haven't reasoned
it out yet. If this was the case, it might allow an "anon" page to
be treated as a file by compaction for accounting purposes and push
the counter negative but you'd think then the anon isolation would
be positive so it's something else.

As I thought fork() be an issue, I looked closer at what we do
there. We are calling pmd_alloc at copy_pmd_range which is a no-op
when PMD is folded and copy_huge_pmd() is calling pte_alloc_one()
which also has no barrier. I haven't checked this fully (it's very
late again as I wasn't able to work on this during most of the day)
but I wonder if it's then possible the PMD setup is not visible before
insertion into the page table leading to weirdness? Why it matters to
SMP is unclear unless this is a preemption thing I'm not thinking of.

On a similar vein, during collapse_huge_page(), we use a barrier
to ensure the data copy is visible before the PMD insertion but in
__do_huge_pmd_anonymous_page(), we assume the "spinlocking to take the
lru_lock inside page_add_new_anon_rmap() acts as a full memory". Thing
is, it's calling lru_cache_add_lru() adding the page to a pagevec
which is not necessarily taking the LRU lock and !SMP is leaving a
big enough race before the pagevec gets drained to cause a problem.
Of course, maybe it *is* happening on SMP but the negative counters
are being reported as zero :)

To see if this is along the right lines, I'm currently testing with
this patch against 2.6.38.4. It hasn't blown up in 35 minutes which
is an improvement over getting into trouble after 5 so I'll leave
it running overnight and see can I convince myself of what is going
on tomorrow.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2d29c9a..65fa251 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -631,12 +631,14 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
entry = mk_pmd(page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
entry = pmd_mkhuge(entry);
+
/*
- * The spinlocking to take the lru_lock inside
- * page_add_new_anon_rmap() acts as a full memory
- * barrier to be sure clear_huge_page writes become
- * visible after the set_pmd_at() write.
+ * Need a write barrier to ensure the writes from
+ * clear_huge_page become visible before the
+ * set_pmd_at
*/
+ smp_wmb();
+
page_add_new_anon_rmap(page, vma, haddr);
set_pmd_at(mm, haddr, pmd, entry);
prepare_pmd_huge_pte(pgtable, mm);
@@ -753,6 +755,13 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,

pmdp_set_wrprotect(src_mm, addr, src_pmd);
pmd = pmd_mkold(pmd_wrprotect(pmd));
+
+ /*
+ * Write barrier to make sure the setup for the PMD is fully visible
+ * before the set_pmd_at
+ */
+ smp_wmb();
+
set_pmd_at(dst_mm, addr, dst_pmd, pmd);
prepare_pmd_huge_pte(pgtable, dst_mm);

2011-06-02 08:34:11

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Thu, Jun 2, 2011 at 10:03 AM, Mel Gorman <[email protected]> wrote:
> On Thu, Jun 02, 2011 at 01:30:36AM +0200, Andrea Arcangeli wrote:
>> Hi Mel,
>>
>> On Wed, Jun 01, 2011 at 10:40:18PM +0100, Mel Gorman wrote:
>> > On Wed, Jun 01, 2011 at 09:15:29PM +0200, Andrea Arcangeli wrote:
>> > > On Wed, Jun 01, 2011 at 06:58:09PM +0100, Mel Gorman wrote:
>> > > > Umm, HIGHMEM4G implies a two-level pagetable layout so where are
>> > > > things like _PAGE_BIT_SPLITTING being set when THP is enabled?
>> > >
>> > > They should be set on the pgd, pud_offset/pgd_offset will just bypass.
>> > > The splitting bit shouldn't be special about it, the present bit
>> > > should work the same.
>> >
>> > This comment is misleading at best then.
>> >
>> > #define _PAGE_BIT_SPLITTING     _PAGE_BIT_UNUSED1 /* only valid on a PSE pmd */
>>
>> From common code point of view it's set in the pmd, the comment can be
>> extended to specify it's actually the pgd in case of 32bit noPAE but I
>> didn't think it was too misleading as we think in common code terms
>> all over the code, the fact it's a bypass is pretty clear across the
>> whole archs.
>>
>
> Fair point.
>
>> > At the PGD level, it can have PSE set obviously but it's not a
>> > PMD. I confess I haven't checked the manual to see if it's safe to
>> > use _PAGE_BIT_UNUSED1 like this so am taking your word for it. I
>>
>> To be sure I re-checked on 253668.pdf page 113/114 noPAE and page 122
>> PAE, on x86 32bit/64 all ptes/pmd/pgd (32bit/64bit PAE/noPAE) have bit
>> 9-11 "Avail" to software. So I think we should be safe here.
>>
>
> Good stuff. I was reasonably sure this was the case but as this was
> already "impossible", it needed to be ruled out.
>
>> > found that the bug is far harder to reproduce with 3 pagetable levels
>> > than with 2 but that is just timing. So far it has proven impossible
>> > on x86-64 at least within 27 hours so that has me looking at how
>> > pagetable management between x86 and x86-64 differ.
>>
>> Weird.
>>
>> However I could see it screwing the nr_inactive/active_* stats, but
>> the nr_isolated should never go below zero, and especially not anon
>> even if split_huge_page does the accounting wrong (and
>> migrate/compaction won't mess with THP), or at least I'd expect things
>> to fall apart in other ways and not with just a fairly innocuous and
>> not-memory corrupting nr_isolated_ counter going off just by one.
>>
>
> Again, agreed. I found it hard to come up with a reason why file would
> get messed up particularly as PageSwapBacked does not get cleared in the
> ordinary case until the page is freed. If we were using pages after
> being freed due to bad refcounting, it would show up in all sorts of bad
> ways.
>
>> The khugepaged nr_isolated_anon increment couldn't affect the file one
>> and we hold mmap_sem write mode there to prevent the pte to change
>> from under us, in addition to the PT and anon_vma lock. Anon_vma lock
>> being wrong sounds unlikely too, and even if it was it should screw
>> the nr_isolated_anon counter, impossible to screw the nr_isolated_file
>> with khugepaged.
>>
>
> After reviewing, I still could not find a problem with the locking that
> might explain this. I thought last night anon_vma might be bust in some
> way but today I couldn't find a problem.
>
>> Where did you put your bugcheck? It looked like you put it in the < 0
>> reader, can you add it to all _inc/dec/mod (even _inc just in case) so
>> we may get a stack trace including the culprit? (not guaranteed but
>> better chance)
>>
>
> Did that, didn't really help other than showing the corruption happens
> early in the process lifetime while huge PMDs are being faulted. This
> made me think the problem might be on or near fork.
>
>> > Barriers are a big different between how 32-bit !SMP and X86-64 but
>> > don't know yet which one is relevant or if this is even the right
>> > direction.
>>
>> The difference is we need xchg on SMP to avoid losing the dirty
>> bit. Otherwise if we do pmd_t pmd = *pmdp; *pmdp = 0; the dirty bit
>> may have been set in between the two by another thread running in
>> userland in a different CPU, while the pmd was still "present". As
>> long as interrupts don't write to read-write userland memory with the
>> pte dirty bit clear, we shouldn't need xchg on !SMP.
>>
>
> Yep.
>
>> On PAE we also need to write 0 into pmd_low before worrying about
>> pmd_high so the present bit is cleared before clearing the high part
>> of the 32bit PAE pte, and we relay on xchg implicit lock to avoid a
>> smp_wmb() in between the two writes.
>>
>
> Yep.
>
>> I'm unsure if any of this could be relevant to our problem, also there
>
> I concluded after a while that it wasn't. Partially from reasoning about
> it and part by testing forcing the use of the SMP versions and finding
> the bug was still reproducible.
>
>> can't be more than one writer at once in the pmd, as nobody can modify
>> it without the page_table_lock held. xchg there is just to be safe for
>> the dirty bit (or we'd corrupt memory with threads running in userland
>> and writing to memory on other cpus while we ptep_clear_flush).
>>
>> I've been wondering about the lack of "lock" on the bus in atomic.h
>> too, but I can't see how it can possibly matter on !SMP, vmstat
>> modifications should execute only 1 asm insn so preempt or irq can't
>> interrupt it.
>
> To be honest, I haven't fully figured out yet why it makes such
> a difference on !SMP. I have a vague notion that it's because
> the page table page and the data is visible before the bit set by
> SetPageSwapBacked on the struct page is visible but haven't reasoned
> it out yet. If this was the case, it might allow an "anon" page to
> be treated as a file by compaction for accounting purposes and push
> the counter negative but you'd think then the anon isolation would
> be positive so it's something else.
>
> As I thought fork() be an issue, I looked closer at what we do
> there. We are calling pmd_alloc at copy_pmd_range which is a no-op
> when PMD is folded and copy_huge_pmd() is calling pte_alloc_one()
> which also has no barrier. I haven't checked this fully (it's very
> late again as I wasn't able to work on this during most of the day)
> but I wonder if it's then possible the PMD setup is not visible before
> insertion into the page table leading to weirdness? Why it matters to
> SMP is unclear unless this is a preemption thing I'm not thinking of.
>
> On a similar vein, during collapse_huge_page(), we use a barrier
> to ensure the data copy is visible before the PMD insertion but in
> __do_huge_pmd_anonymous_page(), we assume the "spinlocking to take the
> lru_lock inside page_add_new_anon_rmap() acts as a full memory". Thing
> is, it's calling lru_cache_add_lru() adding the page to a pagevec
> which is not necessarily taking the LRU lock and !SMP is leaving a
> big enough race before the pagevec gets drained to cause a problem.
> Of course, maybe it *is* happening on SMP but the negative counters
> are being reported as zero :)

Yes. although we have atomic_inc of in get_page, it doesn't imply full
memory barrier.
So we need explicit memory barrier. I think you're right.
But I can't think of that it's related to this problem(UP, preemption).


--
Kind regards,
Minchan Kim

2011-06-02 13:30:37

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Thu, Jun 02, 2011 at 02:03:52AM +0100, Mel Gorman wrote:
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2d29c9a..65fa251 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -631,12 +631,14 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
> entry = mk_pmd(page, vma->vm_page_prot);
> entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> entry = pmd_mkhuge(entry);
> +
> /*
> - * The spinlocking to take the lru_lock inside
> - * page_add_new_anon_rmap() acts as a full memory
> - * barrier to be sure clear_huge_page writes become
> - * visible after the set_pmd_at() write.
> + * Need a write barrier to ensure the writes from
> + * clear_huge_page become visible before the
> + * set_pmd_at
> */
> + smp_wmb();
> +

On x86 at least this is noop because of the
spin_lock(&page_table_lock) after clear_huge_page. But I'm not against
adding this in case other archs supports THP later.

But smp_wmb() is optimized away at build time by cpp so this can't
possibly help if you're reproducing !SMP.

> page_add_new_anon_rmap(page, vma, haddr);
> set_pmd_at(mm, haddr, pmd, entry);
> prepare_pmd_huge_pte(pgtable, mm);
> @@ -753,6 +755,13 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>
> pmdp_set_wrprotect(src_mm, addr, src_pmd);
> pmd = pmd_mkold(pmd_wrprotect(pmd));
> +
> + /*
> + * Write barrier to make sure the setup for the PMD is fully visible
> + * before the set_pmd_at
> + */
> + smp_wmb();
> +
> set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> prepare_pmd_huge_pte(pgtable, dst_mm);

This part seems superfluous to me, it's also noop for !SMP. Only wmb()
would stay. the pmd is perfectly fine to stay in a register, not even
a compiler barrier is needed, even less a smp serialization.

2011-06-02 14:50:31

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Thu, Jun 02, 2011 at 03:29:54PM +0200, Andrea Arcangeli wrote:
> On Thu, Jun 02, 2011 at 02:03:52AM +0100, Mel Gorman wrote:
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 2d29c9a..65fa251 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -631,12 +631,14 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
> > entry = mk_pmd(page, vma->vm_page_prot);
> > entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> > entry = pmd_mkhuge(entry);
> > +
> > /*
> > - * The spinlocking to take the lru_lock inside
> > - * page_add_new_anon_rmap() acts as a full memory
> > - * barrier to be sure clear_huge_page writes become
> > - * visible after the set_pmd_at() write.
> > + * Need a write barrier to ensure the writes from
> > + * clear_huge_page become visible before the
> > + * set_pmd_at
> > */
> > + smp_wmb();
> > +
>
> On x86 at least this is noop because of the
> spin_lock(&page_table_lock) after clear_huge_page. But I'm not against
> adding this in case other archs supports THP later.
>

I thought spin lock acquisition was one-way where loads/stores
preceeding the lock are allowed to leak into the protected region
but not the other way around?

So we have

clear_huge_page()
__SetPageUptodate(page);
spin_lock(&mm->page_table_lock);
...
set_pmd_at(mm, haddr, pmd, entry);

This spinlock itself does not guarantee that writes from
clear_huge_page are complete before that set_pmd_at().

Whether this is right or wrong, why is the same not true in
collapse_huge_page()? There we are

__collapse_huge_page_copy(pte, new_page, vma, address, ptl);
....
smp_wmb();
spin_lock(&mm->page_table_lock);
...
set_pmd_at(mm, address, pmd, _pmd);

with the comment stressing that this is necessary.

> But smp_wmb() is optimized away at build time by cpp so this can't
> possibly help if you're reproducing !SMP.
>

On X86 !SMP, this is still a barrier() which on gcc is

#define barrier() __asm__ __volatile__("": : :"memory")

so it's a compiler barrier. I'm not working on this at this at the
moment but when I get to it, I'll compare the object files and see
if there are relevant differences. Could be tomorrow before I get
the chance again.

> > page_add_new_anon_rmap(page, vma, haddr);
> > set_pmd_at(mm, haddr, pmd, entry);
> > prepare_pmd_huge_pte(pgtable, mm);
> > @@ -753,6 +755,13 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >
> > pmdp_set_wrprotect(src_mm, addr, src_pmd);
> > pmd = pmd_mkold(pmd_wrprotect(pmd));
> > +
> > + /*
> > + * Write barrier to make sure the setup for the PMD is fully visible
> > + * before the set_pmd_at
> > + */
> > + smp_wmb();
> > +
> > set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> > prepare_pmd_huge_pte(pgtable, dst_mm);
>
> This part seems superfluous to me, it's also noop for !SMP.

Other than being a compiler barrier.

> Only wmb()
> would stay. the pmd is perfectly fine to stay in a register, not even
> a compiler barrier is needed, even less a smp serialization.

There is an explanation in here somewhere because as I write this,
the test machine has survived 14 hours under continual stress without
the isolated counters going negative with over 128 million pages
successfully migrated and a million pages failed to migrate due to
direct compaction being called 80,000 times. It's possible it's a
co-incidence but it's some co-incidence!

--
Mel Gorman
SUSE Labs

2011-06-02 15:38:37

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Thu, Jun 02, 2011 at 03:50:19PM +0100, Mel Gorman wrote:
> I thought spin lock acquisition was one-way where loads/stores
> preceeding the lock are allowed to leak into the protected region
> but not the other way around?

That's true for ia64, x86 not AFIK.

> So we have
>
> clear_huge_page()
> __SetPageUptodate(page);
> spin_lock(&mm->page_table_lock);
> ...
> set_pmd_at(mm, haddr, pmd, entry);
>
> This spinlock itself does not guarantee that writes from
> clear_huge_page are complete before that set_pmd_at().

It does on x86.

> Whether this is right or wrong, why is the same not true in
> collapse_huge_page()? There we are
>
> __collapse_huge_page_copy(pte, new_page, vma, address, ptl);
> ....
> smp_wmb();
> spin_lock(&mm->page_table_lock);
> ...
> set_pmd_at(mm, address, pmd, _pmd);
>
> with the comment stressing that this is necessary.

So your first part of the patch is right, but it should be only a
theoretical improvement.

> > But smp_wmb() is optimized away at build time by cpp so this can't
> > possibly help if you're reproducing !SMP.
> >
>
> On X86 !SMP, this is still a barrier() which on gcc is
>
> #define barrier() __asm__ __volatile__("": : :"memory")
>
> so it's a compiler barrier. I'm not working on this at this at the
> moment but when I get to it, I'll compare the object files and see
> if there are relevant differences. Could be tomorrow before I get
> the chance again.

clear_huge_page called by do_huge_pmd_anonymous_page is an external
function (not static so gcc can't make assumption) and that is a full
equivalent to a barrier() after the function returns, so the only
relevancy of a smp_wmb on x86 SMP or !SMP would be zero (unless
X86_OOSTORE is set which I think is not, and that would only apply to
SMP).

> > > page_add_new_anon_rmap(page, vma, haddr);
> > > set_pmd_at(mm, haddr, pmd, entry);
> > > prepare_pmd_huge_pte(pgtable, mm);
> > > @@ -753,6 +755,13 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > >
> > > pmdp_set_wrprotect(src_mm, addr, src_pmd);
> > > pmd = pmd_mkold(pmd_wrprotect(pmd));
> > > +
> > > + /*
> > > + * Write barrier to make sure the setup for the PMD is fully visible
> > > + * before the set_pmd_at
> > > + */
> > > + smp_wmb();
> > > +
> > > set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> > > prepare_pmd_huge_pte(pgtable, dst_mm);
> >
> > This part seems superfluous to me, it's also noop for !SMP.
>
> Other than being a compiler barrier.

Yes but my point is this is ok to be cached in registers, the pmd
setup doesn't need to hit on main memory to be safe, it's local.

pmdp_set_wrprotect is done with a clear_bit and the dependency of the
code will require reading that after the clear_bit on !SMP. Not sure
how can possibly a barrier() above can matter.

> > Only wmb()
> > would stay. the pmd is perfectly fine to stay in a register, not even
> > a compiler barrier is needed, even less a smp serialization.
>
> There is an explanation in here somewhere because as I write this,
> the test machine has survived 14 hours under continual stress without
> the isolated counters going negative with over 128 million pages
> successfully migrated and a million pages failed to migrate due to
> direct compaction being called 80,000 times. It's possible it's a
> co-incidence but it's some co-incidence!

No idea...

2011-06-02 18:23:36

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Tue, May 31, 2011 at 11:38:30PM +0900, Minchan Kim wrote:
> > Yes. You find a new BUG.
> > It seems to be related to this problem but it should be solved although
>
> typo : It doesn't seem to be.

This should fix it, but I doubt it matters for this problem.

===
Subject: mm: no page_count without a page pin

From: Andrea Arcangeli <[email protected]>

It's unsafe to run page_count during the physical pfn scan.

Signed-off-by: Andrea Arcangeli <[email protected]>
---

diff --git a/mm/vmscan.c b/mm/vmscan.c
index faa0a08..e41e78a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1124,8 +1124,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
nr_lumpy_dirty++;
scan++;
} else {
- /* the page is freed already. */
- if (!page_count(cursor_page))
+ /*
+ * We can't use page_count() as that
+ * requires compound_head and we don't
+ * have a pin on the page here. If a
+ * page is tail, we may or may not
+ * have isolated the head, so assume
+ * it's not free, it'd be tricky to
+ * track the head status without a
+ * page pin.
+ */
+ if (!PageTail(cursor_page) &&
+ !atomic_read(&cursor_page->_count))
continue;
break;
}

2011-06-02 20:22:10

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Thu, Jun 02, 2011 at 08:23:02PM +0200, Andrea Arcangeli wrote:
> On Tue, May 31, 2011 at 11:38:30PM +0900, Minchan Kim wrote:
> > > Yes. You find a new BUG.
> > > It seems to be related to this problem but it should be solved although
> >
> > typo : It doesn't seem to be.
>
> This should fix it, but I doubt it matters for this problem.
>
> ===
> Subject: mm: no page_count without a page pin
>
> From: Andrea Arcangeli <[email protected]>
>
> It's unsafe to run page_count during the physical pfn scan.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index faa0a08..e41e78a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1124,8 +1124,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> nr_lumpy_dirty++;
> scan++;
> } else {
> - /* the page is freed already. */
> - if (!page_count(cursor_page))
> + /*
> + * We can't use page_count() as that
> + * requires compound_head and we don't
> + * have a pin on the page here. If a
> + * page is tail, we may or may not
> + * have isolated the head, so assume
> + * it's not free, it'd be tricky to

Isn't it rather aggressive?
I think cursor page is likely to be PageTail rather than PageHead.
Could we handle it simply with below code?

get_page(cursor_page)
/* The page is freed already */
if (1 == page_count(cursor_page)) {
put_page(cursor_page)
continue;
}
put_page(cursor_page);


> + * track the head status without a
> + * page pin.
> + */
> + if (!PageTail(cursor_page) &&
> + !atomic_read(&cursor_page->_count))
> continue;
> break;

> }

--
Kind regards
Minchan Kim

2011-06-02 20:59:23

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Fri, Jun 03, 2011 at 05:21:56AM +0900, Minchan Kim wrote:
> On Thu, Jun 02, 2011 at 08:23:02PM +0200, Andrea Arcangeli wrote:
> > On Tue, May 31, 2011 at 11:38:30PM +0900, Minchan Kim wrote:
> > > > Yes. You find a new BUG.
> > > > It seems to be related to this problem but it should be solved although
> > >
> > > typo : It doesn't seem to be.
> >
> > This should fix it, but I doubt it matters for this problem.
> >
> > ===
> > Subject: mm: no page_count without a page pin
> >
> > From: Andrea Arcangeli <[email protected]>
> >
> > It's unsafe to run page_count during the physical pfn scan.
> >
> > Signed-off-by: Andrea Arcangeli <[email protected]>
> > ---
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index faa0a08..e41e78a 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1124,8 +1124,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > nr_lumpy_dirty++;
> > scan++;
> > } else {
> > - /* the page is freed already. */
> > - if (!page_count(cursor_page))
> > + /*
> > + * We can't use page_count() as that
> > + * requires compound_head and we don't
> > + * have a pin on the page here. If a
> > + * page is tail, we may or may not
> > + * have isolated the head, so assume
> > + * it's not free, it'd be tricky to
>
> Isn't it rather aggressive?
> I think cursor page is likely to be PageTail rather than PageHead.
> Could we handle it simply with below code?
>
> get_page(cursor_page)
> /* The page is freed already */
> if (1 == page_count(cursor_page)) {
> put_page(cursor_page)
> continue;
> }
> put_page(cursor_page);
>

Now that I look code more, it would meet VM_BUG_ON of get_page if the page is really
freed. I think if we hold zone->lock to prevent prep_new_page racing, it would be okay.
But it's rather overkill so I will add my sign to your patch if we don't have better idea
until tomorrow. :)

--
Kind regards
Minchan Kim

2011-06-02 21:41:44

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

Hello Minchan,

On Fri, Jun 03, 2011 at 05:21:56AM +0900, Minchan Kim wrote:
> Isn't it rather aggressive?
> I think cursor page is likely to be PageTail rather than PageHead.
> Could we handle it simply with below code?

It's not so likely, there is small percentage of compound pages that
aren't THP compared to the rest that is either regular pagecache or
anon regular or anon THP or regular shm. If it's THP chances are we
isolated the head and it's useless to insist on more tail pages (at
least for large page size like on x86). Plus we've compaction so
insisting and screwing lru ordering isn't worth it, better to be
permissive and abort... in fact I wouldn't dislike to remove the
entire lumpy logic when COMPACTION_BUILD is true, but that alters the
trace too...

> get_page(cursor_page)
> /* The page is freed already */
> if (1 == page_count(cursor_page)) {
> put_page(cursor_page)
> continue;
> }
> put_page(cursor_page);

We can't call get_page on an tail page or we break split_huge_page,
only an isolated lru can be boosted, if we take the lru_lock and we
check the page is in lru, then we can isolate and pin it safely.

2011-06-02 22:04:30

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Fri, Jun 03, 2011 at 05:59:13AM +0900, Minchan Kim wrote:
> Now that I look code more, it would meet VM_BUG_ON of get_page if the page is really
> freed. I think if we hold zone->lock to prevent prep_new_page racing, it would be okay.

There would be problems with split_huge_page too, we can't even use
get_page_unless_zero unless it's a lru page and we hold the lru_lock
and that's an hot lock too.

> But it's rather overkill so I will add my sign to your patch if we don't have better idea
> until tomorrow. :)

Things like compound_trans_head are made to protect against
split_huge_page like in ksm, not exactly to get to the head page when
the page is being freed, so it's a little tricky. If we could get to
the head page safe starting from a tail page it'd solve some issues
for memory-failure too, which is currently using compound_head unsafe
too, but at least that's running after a catastrophic hardware failure
so the safer the better but the little race is unlikely to ever be an
issue for memory-failure (and it's same issue for hugetlbfs and slub
order 3).

2011-06-02 22:23:51

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

Hi Andrea,

On Fri, Jun 3, 2011 at 6:40 AM, Andrea Arcangeli <[email protected]> wrote:
> Hello Minchan,
>
> On Fri, Jun 03, 2011 at 05:21:56AM +0900, Minchan Kim wrote:
>> Isn't it rather aggressive?
>> I think cursor page is likely to be PageTail rather than PageHead.
>> Could we handle it simply with below code?
>
> It's not so likely, there is small percentage of compound pages that
> aren't THP compared to the rest that is either regular pagecache or
> anon regular or anon THP or regular shm. If it's THP chances are we

I mean we have more tail pages than head pages. So I think we are likely to
meet tail pages. Of course, compared to all pages(page cache, anon and
so on), compound pages would be very small percentage.

> isolated the head and it's useless to insist on more tail pages (at
> least for large page size like on x86). Plus we've compaction so

I can't understand your point. Could you elaborate it?

> insisting and screwing lru ordering isn't worth it, better to be
> permissive and abort... in fact I wouldn't dislike to remove the
> entire lumpy logic when COMPACTION_BUILD is true, but that alters the
> trace too...

AFAIK, it's final destination to go as compaction will not break lru
ordering if my patch(inorder-putback) is merged.

>
>> get_page(cursor_page)
>> /* The page is freed already */
>> if (1 == page_count(cursor_page)) {
>>       put_page(cursor_page)
>>       continue;
>> }
>> put_page(cursor_page);
>
> We can't call get_page on an tail page or we break split_huge_page,

Why don't we call get_page on tail page if tail page isn't free?
Maybe I need investigating split_huge_page.

> only an isolated lru can be boosted, if we take the lru_lock and we
> check the page is in lru, then we can isolate and pin it safely.
>

Thanks.


--
Kind regards,
Minchan Kim

2011-06-02 22:33:31

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Fri, Jun 03, 2011 at 07:23:48AM +0900, Minchan Kim wrote:
> I mean we have more tail pages than head pages. So I think we are likely to
> meet tail pages. Of course, compared to all pages(page cache, anon and
> so on), compound pages would be very small percentage.

Yes that's my point, that being a small percentage it's no big deal to
break the loop early.

> > isolated the head and it's useless to insist on more tail pages (at
> > least for large page size like on x86). Plus we've compaction so
>
> I can't understand your point. Could you elaborate it?

What I meant is that if we already isolated the head page of the THP,
we don't need to try to free the tail pages and breaking the loop
early, will still give us a chance to free a whole 2m because we
isolated the head page (it'll involve some work and swapping but if it
was a compoundtranspage we're ok to break the loop and we're not
making the logic any worse). Provided the PMD_SIZE is quite large like
2/4m...

The only way this patch makes things worse is for slub order 3 in the
process of being freed. But tail pages aren't generally free anyway so
I doubt this really makes any difference plus the tail is getting
cleared as soon as the page reaches the buddy so it's probably
unnoticeable as this then makes a difference only during a race (plus
the tail page can't be isolated, only head page can be part of lrus
and only if they're THP).

> > insisting and screwing lru ordering isn't worth it, better to be
> > permissive and abort... in fact I wouldn't dislike to remove the
> > entire lumpy logic when COMPACTION_BUILD is true, but that alters the
> > trace too...
>
> AFAIK, it's final destination to go as compaction will not break lru
> ordering if my patch(inorder-putback) is merged.

Agreed. I like your patchset, sorry for not having reviewed it in
detail yet but there were other issues popping up in the last few
days.

> >> get_page(cursor_page)
> >> /* The page is freed already */
> >> if (1 == page_count(cursor_page)) {
> >> ? ? ? put_page(cursor_page)
> >> ? ? ? continue;
> >> }
> >> put_page(cursor_page);
> >
> > We can't call get_page on an tail page or we break split_huge_page,
>
> Why don't we call get_page on tail page if tail page isn't free?
> Maybe I need investigating split_huge_page.

Yes it's split_huge_page, only gup is allowed to increase the tail
page because we're guaranteed while gup_fast does it,
split_huge_page_refcount isn't running yet, because the pmd wasn't
set as splitting and the irqs were disabled (or we'd be holding the
page_table_lock for gup slow version after checking again the pmd
wasn't splitting and so __split_huge_page_refcount will wait).

Thanks,
Andrea

2011-06-02 23:01:46

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Fri, Jun 3, 2011 at 7:32 AM, Andrea Arcangeli <[email protected]> wrote:
> On Fri, Jun 03, 2011 at 07:23:48AM +0900, Minchan Kim wrote:
>> I mean we have more tail pages than head pages. So I think we are likely to
>> meet tail pages. Of course, compared to all pages(page cache, anon and
>> so on), compound pages would be very small percentage.
>
> Yes that's my point, that being a small percentage it's no big deal to
> break the loop early.

Indeed.

>
>> > isolated the head and it's useless to insist on more tail pages (at
>> > least for large page size like on x86). Plus we've compaction so
>>
>> I can't understand your point. Could you elaborate it?
>
> What I meant is that if we already isolated the head page of the THP,
> we don't need to try to free the tail pages and breaking the loop
> early, will still give us a chance to free a whole 2m because we
> isolated the head page (it'll involve some work and swapping but if it
> was a compoundtranspage we're ok to break the loop and we're not
> making the logic any worse). Provided the PMD_SIZE is quite large like
> 2/4m...

Do you want this? (it's almost pseudo-code)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7a4469b..9d7609f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1017,7 +1017,7 @@ static unsigned long isolate_lru_pages(unsigned
long nr_to_scan,
for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
struct page *page;
unsigned long pfn;
- unsigned long end_pfn;
+ unsigned long start_pfn, end_pfn;
unsigned long page_pfn;
int zone_id;

@@ -1057,9 +1057,9 @@ static unsigned long isolate_lru_pages(unsigned
long nr_to_scan,
*/
zone_id = page_zone_id(page);
page_pfn = page_to_pfn(page);
- pfn = page_pfn & ~((1 << order) - 1);
+ start_pfn = pfn = page_pfn & ~((1 << order) - 1);
end_pfn = pfn + (1 << order);
- for (; pfn < end_pfn; pfn++) {
+ while (pfn < end_pfn) {
struct page *cursor_page;

/* The target page is in the block, ignore it. */
@@ -1086,17 +1086,25 @@ static unsigned long
isolate_lru_pages(unsigned long nr_to_scan,
break;

if (__isolate_lru_page(cursor_page, mode, file) == 0) {
+ int isolated_pages;
list_move(&cursor_page->lru, dst);
mem_cgroup_del_lru(cursor_page);
- nr_taken += hpage_nr_pages(page);
+ isolated_pages = hpage_nr_pages(page);
+ nr_taken += isolated_pages;
+ /* if we isolated pages enough, let's
break early */
+ if (nr_taken > end_pfn - start_pfn)
+ break;
+ pfn += isolated_pages;
nr_lumpy_taken++;
if (PageDirty(cursor_page))
nr_lumpy_dirty++;
scan++;
} else {
/* the page is freed already. */
- if (!page_count(cursor_page))
+ if (!page_count(cursor_page)) {
+ pfn++;
continue;
+ }
break;
}
}

>
> The only way this patch makes things worse is for slub order 3 in the
> process of being freed. But tail pages aren't generally free anyway so
> I doubt this really makes any difference plus the tail is getting
> cleared as soon as the page reaches the buddy so it's probably

Okay. Considering getting clear PG_tail as soon as slub order 3 is
freed, it would be very rare case.

> unnoticeable as this then makes a difference only during a race (plus
> the tail page can't be isolated, only head page can be part of lrus
> and only if they're THP).
>
>> > insisting and screwing lru ordering isn't worth it, better to be
>> > permissive and abort... in fact I wouldn't dislike to remove the
>> > entire lumpy logic when COMPACTION_BUILD is true, but that alters the
>> > trace too...
>>
>> AFAIK, it's final destination to go as compaction will not break lru
>> ordering if my patch(inorder-putback) is merged.
>
> Agreed. I like your patchset, sorry for not having reviewed it in
> detail yet but there were other issues popping up in the last few
> days.

No problem. it's urgent than mine. :)

>
>> >> get_page(cursor_page)
>> >> /* The page is freed already */
>> >> if (1 == page_count(cursor_page)) {
>> >>       put_page(cursor_page)
>> >>       continue;
>> >> }
>> >> put_page(cursor_page);
>> >
>> > We can't call get_page on an tail page or we break split_huge_page,
>>
>> Why don't we call get_page on tail page if tail page isn't free?
>> Maybe I need investigating split_huge_page.
>
> Yes it's split_huge_page, only gup is allowed to increase the tail
> page because we're guaranteed while gup_fast does it,
> split_huge_page_refcount isn't running yet, because the pmd wasn't
> set as splitting and the irqs were disabled (or we'd be holding the
> page_table_lock for gup slow version after checking again the pmd
> wasn't splitting and so __split_huge_page_refcount will wait).

Thanks. I will have a time to understand your point with reviewing
split_huge_page and your this comment.

You convinced me and made me think thing I didn't think about which
are good points.
Thanks, Andrea.

>
> Thanks,
> Andrea
>



--
Kind regards,
Minchan Kim

2011-06-02 23:02:37

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Fri, Jun 3, 2011 at 3:23 AM, Andrea Arcangeli <[email protected]> wrote:
> On Tue, May 31, 2011 at 11:38:30PM +0900, Minchan Kim wrote:
>> > Yes. You find a new BUG.
>> > It seems to be related to this problem but it should be solved although
>>
>>  typo : It doesn't seem to be.
>
> This should fix it, but I doubt it matters for this problem.
>
> ===
> Subject: mm: no page_count without a page pin
>
> From: Andrea Arcangeli <[email protected]>
>
> It's unsafe to run page_count during the physical pfn scan.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

Nitpick :
I want to remain " /* the page is freed already. */" comment.

--
Kind regards,
Minchan Kim

2011-06-03 02:09:26

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Thu, Jun 02, 2011 at 05:37:54PM +0200, Andrea Arcangeli wrote:
> > There is an explanation in here somewhere because as I write this,
> > the test machine has survived 14 hours under continual stress without
> > the isolated counters going negative with over 128 million pages
> > successfully migrated and a million pages failed to migrate due to
> > direct compaction being called 80,000 times. It's possible it's a
> > co-incidence but it's some co-incidence!
>
> No idea...

I wasn't able to work on this most of the day but was looking at this
closer this evening again and I think I might have thought of another
theory that could cause this problem.

When THP is isolating pages, it accounts for the pages isolated against
the zone of course. If it backs out, it finds the pages from the PTEs.
On !SMP but PREEMPT, we may not have adequate protection against a new
page from a different zone being inserted into the PTE causing us to
decrement against the wrong zone. While the global counter is fine,
the per-zone counters look corrupted. You'd still think it was the
anon counter tht got screwed rather than the file one if it really was
THP unfortunately so it's not the full picture. I'm going to start
a test monitoring both zoneinfo and vmstat to see if vmstat looks
fine while the per-zone counters that are negative are offset by a
positive count on the other zones that when added together become 0.
Hopefully it'll actually trigger overnight :/

--
Mel Gorman
SUSE Labs

2011-06-03 14:49:47

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Fri, Jun 03, 2011 at 03:09:20AM +0100, Mel Gorman wrote:
> On Thu, Jun 02, 2011 at 05:37:54PM +0200, Andrea Arcangeli wrote:
> > > There is an explanation in here somewhere because as I write this,
> > > the test machine has survived 14 hours under continual stress without
> > > the isolated counters going negative with over 128 million pages
> > > successfully migrated and a million pages failed to migrate due to
> > > direct compaction being called 80,000 times. It's possible it's a
> > > co-incidence but it's some co-incidence!
> >
> > No idea...
>
> I wasn't able to work on this most of the day but was looking at this
> closer this evening again and I think I might have thought of another
> theory that could cause this problem.
>
> When THP is isolating pages, it accounts for the pages isolated against
> the zone of course. If it backs out, it finds the pages from the PTEs.
> On !SMP but PREEMPT, we may not have adequate protection against a new
> page from a different zone being inserted into the PTE causing us to
> decrement against the wrong zone. While the global counter is fine,
> the per-zone counters look corrupted. You'd still think it was the
> anon counter tht got screwed rather than the file one if it really was
> THP unfortunately so it's not the full picture. I'm going to start
> a test monitoring both zoneinfo and vmstat to see if vmstat looks
> fine while the per-zone counters that are negative are offset by a
> positive count on the other zones that when added together become 0.
> Hopefully it'll actually trigger overnight :/
>

Right idea of the wrong zone being accounted for but wrong place. I
think the following patch should fix the problem;

==== CUT HERE ===
mm: compaction: Ensure that the compaction free scanner does not move to the next zone

Compaction works with two scanners, a migration and a free
scanner. When the scanners crossover, migration within the zone is
complete. The location of the scanner is recorded on each cycle to
avoid excesive scanning.

When a zone is small and mostly reserved, it's very easy for the
migration scanner to be close to the end of the zone. Then the following
situation can occurs

o migration scanner isolates some pages near the end of the zone
o free scanner starts at the end of the zone but finds that the
migration scanner is already there
o free scanner gets reinitialised for the next cycle as
cc->migrate_pfn + pageblock_nr_pages
moving the free scanner into the next zone
o migration scanner moves into the next zone but continues accounting
against the old zone

When this happens, NR_ISOLATED accounting goes haywire because some
of the accounting happens against the wrong zone. One zones counter
remains positive while the other goes negative even though the overall
global count is accurate. This was reported on X86-32 with !SMP because
!SMP allows the negative counters to be visible. The fact that it is
difficult to reproduce on X86-64 is probably just a co-incidence as
the bug should theoritically be possible there.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/compaction.c | 13 ++++++++++++-
1 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index a4337bc..ec1ed3b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -144,9 +144,20 @@ static void isolate_freepages(struct zone *zone,
int nr_freepages = cc->nr_freepages;
struct list_head *freelist = &cc->freepages;

+ /*
+ * Initialise the free scanner. The starting point is where we last
+ * scanned from (or the end of the zone if starting). The low point
+ * is the end of the pageblock the migration scanner is using.
+ */
pfn = cc->free_pfn;
low_pfn = cc->migrate_pfn + pageblock_nr_pages;
- high_pfn = low_pfn;
+
+ /*
+ * Take care that if the migration scanner is at the end of the zone
+ * that the free scanner does not accidentally move to the next zone
+ * in the next isolation cycle.
+ */
+ high_pfn = min(low_pfn, pfn);

/*
* Isolate free pages until enough are available to migrate the

2011-06-03 15:46:38

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Fri, Jun 03, 2011 at 03:49:41PM +0100, Mel Gorman wrote:
> Right idea of the wrong zone being accounted for but wrong place. I
> think the following patch should fix the problem;

Looks good thanks.

I also found this bug during my debugging that made NR_SHMEM underflow.

===
Subject: migrate: don't account swapcache as shmem

From: Andrea Arcangeli <[email protected]>

swapcache will reach the below code path in migrate_page_move_mapping,
and swapcache is accounted as NR_FILE_PAGES but it's not accounted as
NR_SHMEM.

Signed-off-by: Andrea Arcangeli <[email protected]>
---

diff --git a/mm/migrate.c b/mm/migrate.c
index e4a5c91..2597a27 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -288,7 +288,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
*/
__dec_zone_page_state(page, NR_FILE_PAGES);
__inc_zone_page_state(newpage, NR_FILE_PAGES);
- if (PageSwapBacked(page)) {
+ if (mapping != &swapper_space && PageSwapBacked(page)) {
__dec_zone_page_state(page, NR_SHMEM);
__inc_zone_page_state(newpage, NR_SHMEM);
}

2011-06-03 17:37:46

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Fri, Jun 03, 2011 at 08:01:44AM +0900, Minchan Kim wrote:
> Do you want this? (it's almost pseudo-code)

Yes that's good idea so we at least take into account if we isolated
something big, and it's pointless to insist wasting CPU on the tail
pages and even trace a fail because of tail pages after it.

I introduced a __page_count to increase readability. It's still
hackish to work on subpages in vmscan.c but at least I added a comment
and until we serialize destroy_compound_page vs compound_head, I guess
there's no better way. I didn't attempt to add out of order
serialization similar to what exists for split_huge_page vs
compound_trans_head yet, as the page can be allocated or go away from
under us, in split_huge_page vs compound_trans_head it's simpler
because both callers are required to hold a pin on the page so the
page can't go be reallocated and destroyed under it.

===
Subject: mm: no page_count without a page pin

From: Andrea Arcangeli <[email protected]>

It's unsafe to run page_count during the physical pfn scan because
compound_head could trip on a dangling pointer when reading page->first_page if
the compound page is being freed by another CPU. Also properly take into
account if we isolated a compound page during the scan and break the loop if
we've isolated enoguh. Introduce __page_count to cleanup some atomic_read from
&page->_count in common code to cleanup.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/powerpc/mm/gup.c | 2 -
arch/powerpc/platforms/512x/mpc512x_shared.c | 2 -
arch/x86/mm/gup.c | 2 -
fs/nilfs2/page.c | 2 -
include/linux/mm.h | 13 +++++++---
mm/huge_memory.c | 4 +--
mm/internal.h | 2 -
mm/page_alloc.c | 6 ++--
mm/swap.c | 4 +--
mm/vmscan.c | 33 ++++++++++++++++++++-------
10 files changed, 46 insertions(+), 24 deletions(-)

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1047,7 +1047,7 @@ static unsigned long isolate_lru_pages(u
for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
struct page *page;
unsigned long pfn;
- unsigned long end_pfn;
+ unsigned long start_pfn, end_pfn;
unsigned long page_pfn;
int zone_id;

@@ -1087,9 +1087,9 @@ static unsigned long isolate_lru_pages(u
*/
zone_id = page_zone_id(page);
page_pfn = page_to_pfn(page);
- pfn = page_pfn & ~((1 << order) - 1);
+ start_pfn = page_pfn & ~((1 << order) - 1);
end_pfn = pfn + (1 << order);
- for (; pfn < end_pfn; pfn++) {
+ for (pfn = start_pfn; pfn < end_pfn; pfn++) {
struct page *cursor_page;

/* The target page is in the block, ignore it. */
@@ -1116,16 +1116,33 @@ static unsigned long isolate_lru_pages(u
break;

if (__isolate_lru_page(cursor_page, mode, file) == 0) {
+ unsigned int isolated_pages;
list_move(&cursor_page->lru, dst);
mem_cgroup_del_lru(cursor_page);
- nr_taken += hpage_nr_pages(page);
- nr_lumpy_taken++;
+ isolated_pages = hpage_nr_pages(page);
+ nr_taken += isolated_pages;
+ nr_lumpy_taken += isolated_pages;
if (PageDirty(cursor_page))
- nr_lumpy_dirty++;
+ nr_lumpy_dirty += isolated_pages;
scan++;
+ pfn += isolated_pages-1;
+ VM_BUG_ON(!isolated_pages);
+ VM_BUG_ON(isolated_pages > MAX_ORDER_NR_PAGES);
} else {
- /* the page is freed already. */
- if (!page_count(cursor_page))
+ /*
+ * Check if the page is freed already.
+ *
+ * We can't use page_count() as that
+ * requires compound_head and we don't
+ * have a pin on the page here. If a
+ * page is tail, we may or may not
+ * have isolated the head, so assume
+ * it's not free, it'd be tricky to
+ * track the head status without a
+ * page pin.
+ */
+ if (!PageTail(cursor_page) &&
+ !__page_count(&cursor_page))
continue;
break;
}
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -271,7 +271,7 @@ struct inode;
*/
static inline int put_page_testzero(struct page *page)
{
- VM_BUG_ON(atomic_read(&page->_count) == 0);
+ VM_BUG_ON(__page_count(page) == 0);
return atomic_dec_and_test(&page->_count);
}

@@ -355,9 +355,14 @@ static inline struct page *compound_head
return page;
}

+static inline int __page_count(struct page *page)
+{
+ return atomic_read(&page->_count);
+}
+
static inline int page_count(struct page *page)
{
- return atomic_read(&compound_head(page)->_count);
+ return __page_count(compound_head(page));
}

static inline void get_page(struct page *page)
@@ -370,7 +375,7 @@ static inline void get_page(struct page
* bugcheck only verifies that the page->_count isn't
* negative.
*/
- VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
+ VM_BUG_ON(__page_count(page) < !PageTail(page));
atomic_inc(&page->_count);
/*
* Getting a tail page will elevate both the head and tail
@@ -382,7 +387,7 @@ static inline void get_page(struct page
* __split_huge_page_refcount can't run under
* get_page().
*/
- VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+ VM_BUG_ON(__page_count(page->first_page) <= 0);
atomic_inc(&page->first_page->_count);
}
}
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1203,10 +1203,10 @@ static void __split_huge_page_refcount(s
struct page *page_tail = page + i;

/* tail_page->_count cannot change */
- atomic_sub(atomic_read(&page_tail->_count), &page->_count);
+ atomic_sub(__page_count(page_tail), &page->_count);
BUG_ON(page_count(page) <= 0);
atomic_add(page_mapcount(page) + 1, &page_tail->_count);
- BUG_ON(atomic_read(&page_tail->_count) <= 0);
+ BUG_ON(__page_count(page_tail) <= 0);

/* after clearing PageTail the gup refcount can be released */
smp_mb();
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -22,7 +22,7 @@ static inline void get_huge_page_tail(st
* __split_huge_page_refcount() cannot run
* from under us.
*/
- VM_BUG_ON(atomic_read(&page->_count) < 0);
+ VM_BUG_ON(__page_count(page) < 0);
atomic_inc(&page->_count);
}

--- a/arch/powerpc/platforms/512x/mpc512x_shared.c
+++ b/arch/powerpc/platforms/512x/mpc512x_shared.c
@@ -200,7 +200,7 @@ static inline void mpc512x_free_bootmem(
{
__ClearPageReserved(page);
BUG_ON(PageTail(page));
- BUG_ON(atomic_read(&page->_count) > 1);
+ BUG_ON(__page_count(page) > 1);
atomic_set(&page->_count, 1);
__free_page(page);
totalram_pages++;
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -114,7 +114,7 @@ static inline void get_huge_page_tail(st
* __split_huge_page_refcount() cannot run
* from under us.
*/
- VM_BUG_ON(atomic_read(&page->_count) < 0);
+ VM_BUG_ON(__page_count(page) < 0);
atomic_inc(&page->_count);
}

--- a/fs/nilfs2/page.c
+++ b/fs/nilfs2/page.c
@@ -181,7 +181,7 @@ void nilfs_page_bug(struct page *page)

printk(KERN_CRIT "NILFS_PAGE_BUG(%p): cnt=%d index#=%llu flags=0x%lx "
"mapping=%p ino=%lu\n",
- page, atomic_read(&page->_count),
+ page, __page_count(page),
(unsigned long long)page->index, page->flags, m, ino);

if (page_has_buffers(page)) {
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -28,7 +28,7 @@ static inline void set_page_count(struct
static inline void set_page_refcounted(struct page *page)
{
VM_BUG_ON(PageTail(page));
- VM_BUG_ON(atomic_read(&page->_count));
+ VM_BUG_ON(__page_count(page));
set_page_count(page, 1);
}

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -568,7 +568,7 @@ static inline int free_pages_check(struc
{
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
- (atomic_read(&page->_count) != 0) |
+ (__page_count(page) != 0) |
(page->flags & PAGE_FLAGS_CHECK_AT_FREE) |
(mem_cgroup_bad_page_check(page)))) {
bad_page(page);
@@ -758,7 +758,7 @@ static inline int check_new_page(struct
{
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
- (atomic_read(&page->_count) != 0) |
+ (__page_count(page) != 0) |
(page->flags & PAGE_FLAGS_CHECK_AT_PREP) |
(mem_cgroup_bad_page_check(page)))) {
bad_page(page);
@@ -5739,7 +5739,7 @@ void dump_page(struct page *page)
{
printk(KERN_ALERT
"page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
- page, atomic_read(&page->_count), page_mapcount(page),
+ page, __page_count(page), page_mapcount(page),
page->mapping, page->index);
dump_page_flags(page->flags);
mem_cgroup_print_bad_page(page);
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -128,9 +128,9 @@ static void put_compound_page(struct pag
if (put_page_testzero(page_head))
VM_BUG_ON(1);
/* __split_huge_page_refcount will wait now */
- VM_BUG_ON(atomic_read(&page->_count) <= 0);
+ VM_BUG_ON(__page_count(page) <= 0);
atomic_dec(&page->_count);
- VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+ VM_BUG_ON(__page_count(page_head) <= 0);
compound_unlock_irqrestore(page_head, flags);
if (put_page_testzero(page_head)) {
if (PageHead(page_head))

2011-06-03 18:08:07

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Fri, Jun 03, 2011 at 07:37:07PM +0200, Andrea Arcangeli wrote:
> On Fri, Jun 03, 2011 at 08:01:44AM +0900, Minchan Kim wrote:
> > Do you want this? (it's almost pseudo-code)
>
> Yes that's good idea so we at least take into account if we isolated
> something big, and it's pointless to insist wasting CPU on the tail
> pages and even trace a fail because of tail pages after it.
>
> I introduced a __page_count to increase readability. It's still
> hackish to work on subpages in vmscan.c but at least I added a comment
> and until we serialize destroy_compound_page vs compound_head, I guess
> there's no better way. I didn't attempt to add out of order
> serialization similar to what exists for split_huge_page vs
> compound_trans_head yet, as the page can be allocated or go away from
> under us, in split_huge_page vs compound_trans_head it's simpler
> because both callers are required to hold a pin on the page so the
> page can't go be reallocated and destroyed under it.

Sent too fast... had to shuffle a few things around... trying again.

===
Subject: mm: no page_count without a page pin

From: Andrea Arcangeli <[email protected]>

It's unsafe to run page_count during the physical pfn scan because
compound_head could trip on a dangling pointer when reading page->first_page if
the compound page is being freed by another CPU. Also properly take into
account if we isolated a compound page during the scan and break the loop if
we've isolated enoguh. Introduce __page_count to cleanup some atomic_read from
&page->_count in common code to cleanup.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/powerpc/mm/gup.c | 2 -
arch/powerpc/platforms/512x/mpc512x_shared.c | 2 -
arch/x86/mm/gup.c | 2 -
fs/nilfs2/page.c | 2 -
include/linux/mm.h | 13 ++++++----
mm/huge_memory.c | 4 +--
mm/internal.h | 2 -
mm/page_alloc.c | 6 ++--
mm/swap.c | 4 +--
mm/vmscan.c | 35 ++++++++++++++++++++-------
10 files changed, 47 insertions(+), 25 deletions(-)

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1047,7 +1047,7 @@ static unsigned long isolate_lru_pages(u
for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
struct page *page;
unsigned long pfn;
- unsigned long end_pfn;
+ unsigned long start_pfn, end_pfn;
unsigned long page_pfn;
int zone_id;

@@ -1087,9 +1087,9 @@ static unsigned long isolate_lru_pages(u
*/
zone_id = page_zone_id(page);
page_pfn = page_to_pfn(page);
- pfn = page_pfn & ~((1 << order) - 1);
- end_pfn = pfn + (1 << order);
- for (; pfn < end_pfn; pfn++) {
+ start_pfn = page_pfn & ~((1 << order) - 1);
+ end_pfn = start_pfn + (1 << order);
+ for (pfn = start_pfn; pfn < end_pfn; pfn++) {
struct page *cursor_page;

/* The target page is in the block, ignore it. */
@@ -1116,16 +1116,33 @@ static unsigned long isolate_lru_pages(u
break;

if (__isolate_lru_page(cursor_page, mode, file) == 0) {
+ unsigned int isolated_pages;
list_move(&cursor_page->lru, dst);
mem_cgroup_del_lru(cursor_page);
- nr_taken += hpage_nr_pages(page);
- nr_lumpy_taken++;
+ isolated_pages = hpage_nr_pages(page);
+ nr_taken += isolated_pages;
+ nr_lumpy_taken += isolated_pages;
if (PageDirty(cursor_page))
- nr_lumpy_dirty++;
+ nr_lumpy_dirty += isolated_pages;
scan++;
+ pfn += isolated_pages-1;
+ VM_BUG_ON(!isolated_pages);
+ VM_BUG_ON(isolated_pages > MAX_ORDER_NR_PAGES);
} else {
- /* the page is freed already. */
- if (!page_count(cursor_page))
+ /*
+ * Check if the page is freed already.
+ *
+ * We can't use page_count() as that
+ * requires compound_head and we don't
+ * have a pin on the page here. If a
+ * page is tail, we may or may not
+ * have isolated the head, so assume
+ * it's not free, it'd be tricky to
+ * track the head status without a
+ * page pin.
+ */
+ if (!PageTail(cursor_page) &&
+ !__page_count(cursor_page))
continue;
break;
}
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -266,12 +266,17 @@ struct inode;
* routine so they can be sure the page doesn't go away from under them.
*/

+static inline int __page_count(struct page *page)
+{
+ return atomic_read(&page->_count);
+}
+
/*
* Drop a ref, return true if the refcount fell to zero (the page has no users)
*/
static inline int put_page_testzero(struct page *page)
{
- VM_BUG_ON(atomic_read(&page->_count) == 0);
+ VM_BUG_ON(__page_count(page) == 0);
return atomic_dec_and_test(&page->_count);
}

@@ -357,7 +362,7 @@ static inline struct page *compound_head

static inline int page_count(struct page *page)
{
- return atomic_read(&compound_head(page)->_count);
+ return __page_count(compound_head(page));
}

static inline void get_page(struct page *page)
@@ -370,7 +375,7 @@ static inline void get_page(struct page
* bugcheck only verifies that the page->_count isn't
* negative.
*/
- VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
+ VM_BUG_ON(__page_count(page) < !PageTail(page));
atomic_inc(&page->_count);
/*
* Getting a tail page will elevate both the head and tail
@@ -382,7 +387,7 @@ static inline void get_page(struct page
* __split_huge_page_refcount can't run under
* get_page().
*/
- VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+ VM_BUG_ON(__page_count(page->first_page) <= 0);
atomic_inc(&page->first_page->_count);
}
}
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1203,10 +1203,10 @@ static void __split_huge_page_refcount(s
struct page *page_tail = page + i;

/* tail_page->_count cannot change */
- atomic_sub(atomic_read(&page_tail->_count), &page->_count);
+ atomic_sub(__page_count(page_tail), &page->_count);
BUG_ON(page_count(page) <= 0);
atomic_add(page_mapcount(page) + 1, &page_tail->_count);
- BUG_ON(atomic_read(&page_tail->_count) <= 0);
+ BUG_ON(__page_count(page_tail) <= 0);

/* after clearing PageTail the gup refcount can be released */
smp_mb();
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -22,7 +22,7 @@ static inline void get_huge_page_tail(st
* __split_huge_page_refcount() cannot run
* from under us.
*/
- VM_BUG_ON(atomic_read(&page->_count) < 0);
+ VM_BUG_ON(__page_count(page) < 0);
atomic_inc(&page->_count);
}

--- a/arch/powerpc/platforms/512x/mpc512x_shared.c
+++ b/arch/powerpc/platforms/512x/mpc512x_shared.c
@@ -200,7 +200,7 @@ static inline void mpc512x_free_bootmem(
{
__ClearPageReserved(page);
BUG_ON(PageTail(page));
- BUG_ON(atomic_read(&page->_count) > 1);
+ BUG_ON(__page_count(page) > 1);
atomic_set(&page->_count, 1);
__free_page(page);
totalram_pages++;
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -114,7 +114,7 @@ static inline void get_huge_page_tail(st
* __split_huge_page_refcount() cannot run
* from under us.
*/
- VM_BUG_ON(atomic_read(&page->_count) < 0);
+ VM_BUG_ON(__page_count(page) < 0);
atomic_inc(&page->_count);
}

--- a/fs/nilfs2/page.c
+++ b/fs/nilfs2/page.c
@@ -181,7 +181,7 @@ void nilfs_page_bug(struct page *page)

printk(KERN_CRIT "NILFS_PAGE_BUG(%p): cnt=%d index#=%llu flags=0x%lx "
"mapping=%p ino=%lu\n",
- page, atomic_read(&page->_count),
+ page, __page_count(page),
(unsigned long long)page->index, page->flags, m, ino);

if (page_has_buffers(page)) {
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -28,7 +28,7 @@ static inline void set_page_count(struct
static inline void set_page_refcounted(struct page *page)
{
VM_BUG_ON(PageTail(page));
- VM_BUG_ON(atomic_read(&page->_count));
+ VM_BUG_ON(__page_count(page));
set_page_count(page, 1);
}

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -568,7 +568,7 @@ static inline int free_pages_check(struc
{
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
- (atomic_read(&page->_count) != 0) |
+ (__page_count(page) != 0) |
(page->flags & PAGE_FLAGS_CHECK_AT_FREE) |
(mem_cgroup_bad_page_check(page)))) {
bad_page(page);
@@ -758,7 +758,7 @@ static inline int check_new_page(struct
{
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
- (atomic_read(&page->_count) != 0) |
+ (__page_count(page) != 0) |
(page->flags & PAGE_FLAGS_CHECK_AT_PREP) |
(mem_cgroup_bad_page_check(page)))) {
bad_page(page);
@@ -5739,7 +5739,7 @@ void dump_page(struct page *page)
{
printk(KERN_ALERT
"page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
- page, atomic_read(&page->_count), page_mapcount(page),
+ page, __page_count(page), page_mapcount(page),
page->mapping, page->index);
dump_page_flags(page->flags);
mem_cgroup_print_bad_page(page);
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -128,9 +128,9 @@ static void put_compound_page(struct pag
if (put_page_testzero(page_head))
VM_BUG_ON(1);
/* __split_huge_page_refcount will wait now */
- VM_BUG_ON(atomic_read(&page->_count) <= 0);
+ VM_BUG_ON(__page_count(page) <= 0);
atomic_dec(&page->_count);
- VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+ VM_BUG_ON(__page_count(page_head) <= 0);
compound_unlock_irqrestore(page_head, flags);
if (put_page_testzero(page_head)) {
if (PageHead(page_head))

2011-06-04 06:59:05

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Fri, Jun 03, 2011 at 03:49:41PM +0100, Mel Gorman wrote:
> On Fri, Jun 03, 2011 at 03:09:20AM +0100, Mel Gorman wrote:
> > On Thu, Jun 02, 2011 at 05:37:54PM +0200, Andrea Arcangeli wrote:
> > > > There is an explanation in here somewhere because as I write this,
> > > > the test machine has survived 14 hours under continual stress without
> > > > the isolated counters going negative with over 128 million pages
> > > > successfully migrated and a million pages failed to migrate due to
> > > > direct compaction being called 80,000 times. It's possible it's a
> > > > co-incidence but it's some co-incidence!
> > >
> > > No idea...
> >
> > I wasn't able to work on this most of the day but was looking at this
> > closer this evening again and I think I might have thought of another
> > theory that could cause this problem.
> >
> > When THP is isolating pages, it accounts for the pages isolated against
> > the zone of course. If it backs out, it finds the pages from the PTEs.
> > On !SMP but PREEMPT, we may not have adequate protection against a new
> > page from a different zone being inserted into the PTE causing us to
> > decrement against the wrong zone. While the global counter is fine,
> > the per-zone counters look corrupted. You'd still think it was the
> > anon counter tht got screwed rather than the file one if it really was
> > THP unfortunately so it's not the full picture. I'm going to start
> > a test monitoring both zoneinfo and vmstat to see if vmstat looks
> > fine while the per-zone counters that are negative are offset by a
> > positive count on the other zones that when added together become 0.
> > Hopefully it'll actually trigger overnight :/
> >
>
> Right idea of the wrong zone being accounted for but wrong place. I
> think the following patch should fix the problem;
>
> ==== CUT HERE ===
> mm: compaction: Ensure that the compaction free scanner does not move to the next zone
>
> Compaction works with two scanners, a migration and a free
> scanner. When the scanners crossover, migration within the zone is
> complete. The location of the scanner is recorded on each cycle to
> avoid excesive scanning.
>
> When a zone is small and mostly reserved, it's very easy for the
> migration scanner to be close to the end of the zone. Then the following
> situation can occurs
>
> o migration scanner isolates some pages near the end of the zone
> o free scanner starts at the end of the zone but finds that the
> migration scanner is already there
> o free scanner gets reinitialised for the next cycle as
> cc->migrate_pfn + pageblock_nr_pages
> moving the free scanner into the next zone
> o migration scanner moves into the next zone but continues accounting
> against the old zone
>
> When this happens, NR_ISOLATED accounting goes haywire because some
> of the accounting happens against the wrong zone. One zones counter
> remains positive while the other goes negative even though the overall
> global count is accurate. This was reported on X86-32 with !SMP because
> !SMP allows the negative counters to be visible. The fact that it is
> difficult to reproduce on X86-64 is probably just a co-incidence as

I guess it's related to zone sizes.
X86-64 has small DMA and large DMA32 zones for fallback of NORMAL while
x86 has just a small DMA(16M) zone.

I think DMA zone in x86 is easily full of non-LRU or non-movable pages.
So isolate_migratepagse continues to scan for finding pages which are migratable
and then it reaches near end of zone.


> the bug should theoritically be possible there.
> Finally, you found it. Congratulations on!
> Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

When we are debugging this problem, we found a few of bugs and enhance points
and submitted patches. It was a very good chance to fix Linux VM.

Thanks, Mel.

--
Kind regards
Minchan Kim

2011-06-04 07:25:23

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Fri, Jun 03, 2011 at 05:45:54PM +0200, Andrea Arcangeli wrote:
> On Fri, Jun 03, 2011 at 03:49:41PM +0100, Mel Gorman wrote:
> > Right idea of the wrong zone being accounted for but wrong place. I
> > think the following patch should fix the problem;
>
> Looks good thanks.
>
> I also found this bug during my debugging that made NR_SHMEM underflow.
>
> ===
> Subject: migrate: don't account swapcache as shmem
>
> From: Andrea Arcangeli <[email protected]>
>
> swapcache will reach the below code path in migrate_page_move_mapping,
> and swapcache is accounted as NR_FILE_PAGES but it's not accounted as
> NR_SHMEM.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

Nice catch!
--
Kind regards
Minchan Kim

2011-06-04 07:59:36

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Fri, Jun 03, 2011 at 08:07:30PM +0200, Andrea Arcangeli wrote:
> On Fri, Jun 03, 2011 at 07:37:07PM +0200, Andrea Arcangeli wrote:
> > On Fri, Jun 03, 2011 at 08:01:44AM +0900, Minchan Kim wrote:
> > > Do you want this? (it's almost pseudo-code)
> >
> > Yes that's good idea so we at least take into account if we isolated
> > something big, and it's pointless to insist wasting CPU on the tail
> > pages and even trace a fail because of tail pages after it.
> >
> > I introduced a __page_count to increase readability. It's still
> > hackish to work on subpages in vmscan.c but at least I added a comment
> > and until we serialize destroy_compound_page vs compound_head, I guess
> > there's no better way. I didn't attempt to add out of order
> > serialization similar to what exists for split_huge_page vs
> > compound_trans_head yet, as the page can be allocated or go away from
> > under us, in split_huge_page vs compound_trans_head it's simpler
> > because both callers are required to hold a pin on the page so the
> > page can't go be reallocated and destroyed under it.
>
> Sent too fast... had to shuffle a few things around... trying again.
>
> ===
> Subject: mm: no page_count without a page pin
>
> From: Andrea Arcangeli <[email protected]>
>
> It's unsafe to run page_count during the physical pfn scan because
> compound_head could trip on a dangling pointer when reading page->first_page if
> the compound page is being freed by another CPU. Also properly take into
> account if we isolated a compound page during the scan and break the loop if
> we've isolated enoguh. Introduce __page_count to cleanup some atomic_read from
> &page->_count in common code to cleanup.
>

Patch looks good to me.
I have a question. Please see bottom line.

In addition, I think this patch have to be divided by 4 patches.

1. fix accounting nu_lumpy_taken, nr_lumpy_dirty on hpage
2. early breaking of isolate_lru_pages if we had enough isolated pages
3. introduce __page_count and cleanup
4. fix page_count usage of subpage in vmscan.c

> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> arch/powerpc/mm/gup.c | 2 -
> arch/powerpc/platforms/512x/mpc512x_shared.c | 2 -
> arch/x86/mm/gup.c | 2 -
> fs/nilfs2/page.c | 2 -
> include/linux/mm.h | 13 ++++++----
> mm/huge_memory.c | 4 +--
> mm/internal.h | 2 -
> mm/page_alloc.c | 6 ++--
> mm/swap.c | 4 +--
> mm/vmscan.c | 35 ++++++++++++++++++++-------
> 10 files changed, 47 insertions(+), 25 deletions(-)
>
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1047,7 +1047,7 @@ static unsigned long isolate_lru_pages(u
> for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
> struct page *page;
> unsigned long pfn;
> - unsigned long end_pfn;
> + unsigned long start_pfn, end_pfn;
> unsigned long page_pfn;
> int zone_id;
>
> @@ -1087,9 +1087,9 @@ static unsigned long isolate_lru_pages(u
> */
> zone_id = page_zone_id(page);
> page_pfn = page_to_pfn(page);
> - pfn = page_pfn & ~((1 << order) - 1);
> - end_pfn = pfn + (1 << order);
> - for (; pfn < end_pfn; pfn++) {
> + start_pfn = page_pfn & ~((1 << order) - 1);
> + end_pfn = start_pfn + (1 << order);
> + for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> struct page *cursor_page;
>
> /* The target page is in the block, ignore it. */
> @@ -1116,16 +1116,33 @@ static unsigned long isolate_lru_pages(u
> break;
>
> if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> + unsigned int isolated_pages;
> list_move(&cursor_page->lru, dst);
> mem_cgroup_del_lru(cursor_page);
> - nr_taken += hpage_nr_pages(page);
> - nr_lumpy_taken++;
> + isolated_pages = hpage_nr_pages(page);
> + nr_taken += isolated_pages;
> + nr_lumpy_taken += isolated_pages;
> if (PageDirty(cursor_page))
> - nr_lumpy_dirty++;
> + nr_lumpy_dirty += isolated_pages;
> scan++;
> + pfn += isolated_pages-1;
> + VM_BUG_ON(!isolated_pages);
> + VM_BUG_ON(isolated_pages > MAX_ORDER_NR_PAGES);

What's point of this VM_BUG_ONs?
Could you explain what you expect with this VM_BUG_ONs?

--
Kind regards
Minchan Kim

2011-06-06 10:16:11

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Fri, Jun 03, 2011 at 08:01:44AM +0900, Minchan Kim wrote:
> On Fri, Jun 3, 2011 at 7:32 AM, Andrea Arcangeli <[email protected]> wrote:
> > On Fri, Jun 03, 2011 at 07:23:48AM +0900, Minchan Kim wrote:
> >> I mean we have more tail pages than head pages. So I think we are likely to
> >> meet tail pages. Of course, compared to all pages(page cache, anon and
> >> so on), compound pages would be very small percentage.
> >
> > Yes that's my point, that being a small percentage it's no big deal to
> > break the loop early.
>
> Indeed.
>
> >
> >> > isolated the head and it's useless to insist on more tail pages (at
> >> > least for large page size like on x86). Plus we've compaction so
> >>
> >> I can't understand your point. Could you elaborate it?
> >
> > What I meant is that if we already isolated the head page of the THP,
> > we don't need to try to free the tail pages and breaking the loop
> > early, will still give us a chance to free a whole 2m because we
> > isolated the head page (it'll involve some work and swapping but if it
> > was a compoundtranspage we're ok to break the loop and we're not
> > making the logic any worse). Provided the PMD_SIZE is quite large like
> > 2/4m...
>
> Do you want this? (it's almost pseudo-code)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7a4469b..9d7609f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1017,7 +1017,7 @@ static unsigned long isolate_lru_pages(unsigned
> long nr_to_scan,
> for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
> struct page *page;
> unsigned long pfn;
> - unsigned long end_pfn;
> + unsigned long start_pfn, end_pfn;
> unsigned long page_pfn;
> int zone_id;
>
> @@ -1057,9 +1057,9 @@ static unsigned long isolate_lru_pages(unsigned
> long nr_to_scan,
> */
> zone_id = page_zone_id(page);
> page_pfn = page_to_pfn(page);
> - pfn = page_pfn & ~((1 << order) - 1);
> + start_pfn = pfn = page_pfn & ~((1 << order) - 1);
> end_pfn = pfn + (1 << order);
> - for (; pfn < end_pfn; pfn++) {
> + while (pfn < end_pfn) {
> struct page *cursor_page;
>
> /* The target page is in the block, ignore it. */
> @@ -1086,17 +1086,25 @@ static unsigned long
> isolate_lru_pages(unsigned long nr_to_scan,
> break;
>
> if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> + int isolated_pages;
> list_move(&cursor_page->lru, dst);
> mem_cgroup_del_lru(cursor_page);
> - nr_taken += hpage_nr_pages(page);
> + isolated_pages = hpage_nr_pages(page);
> + nr_taken += isolated_pages;
> + /* if we isolated pages enough, let's
> break early */
> + if (nr_taken > end_pfn - start_pfn)
> + break;
> + pfn += isolated_pages;

I think this condition is somewhat unlikely. We are scanning within
aligned blocks in this linear scanner. Huge pages are always aligned
so the only situation where we'll encounter a hugepage in the middle
of this linear scan is when the requested order is larger than a huge
page. This is exceptionally rare.

Did I miss something?

> nr_lumpy_taken++;
> if (PageDirty(cursor_page))
> nr_lumpy_dirty++;
> scan++;
> } else {
> /* the page is freed already. */
> - if (!page_count(cursor_page))
> + if (!page_count(cursor_page)) {
> + pfn++;
> continue;
> + }
> break;
> }
> }
>
> >
> > The only way this patch makes things worse is for slub order 3 in the
> > process of being freed. But tail pages aren't generally free anyway so
> > I doubt this really makes any difference plus the tail is getting
> > cleared as soon as the page reaches the buddy so it's probably
>
> Okay. Considering getting clear PG_tail as soon as slub order 3 is
> freed, it would be very rare case.
>
> > unnoticeable as this then makes a difference only during a race (plus
> > the tail page can't be isolated, only head page can be part of lrus
> > and only if they're THP).
> >
> >> > insisting and screwing lru ordering isn't worth it, better to be
> >> > permissive and abort... in fact I wouldn't dislike to remove the
> >> > entire lumpy logic when COMPACTION_BUILD is true, but that alters the
> >> > trace too...
> >>
> >> AFAIK, it's final destination to go as compaction will not break lru
> >> ordering if my patch(inorder-putback) is merged.
> >
> > Agreed. I like your patchset, sorry for not having reviewed it in
> > detail yet but there were other issues popping up in the last few
> > days.
>
> No problem. it's urgent than mine. :)
>

I'm going to take the opportunity to apologise for not reviewing that
series yet. I've been kept too busy with other bugs to set side the
few hours I need to review the series. I'm hoping to get to it this
week if everything goes well.

--
Mel Gorman
SUSE Labs

2011-06-06 10:26:43

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Mon, Jun 06, 2011 at 11:15:57AM +0100, Mel Gorman wrote:
> On Fri, Jun 03, 2011 at 08:01:44AM +0900, Minchan Kim wrote:
> > On Fri, Jun 3, 2011 at 7:32 AM, Andrea Arcangeli <[email protected]> wrote:
> > > On Fri, Jun 03, 2011 at 07:23:48AM +0900, Minchan Kim wrote:
> > >> I mean we have more tail pages than head pages. So I think we are likely to
> > >> meet tail pages. Of course, compared to all pages(page cache, anon and
> > >> so on), compound pages would be very small percentage.
> > >
> > > Yes that's my point, that being a small percentage it's no big deal to
> > > break the loop early.
> >
> > Indeed.
> >
> > >
> > >> > isolated the head and it's useless to insist on more tail pages (at
> > >> > least for large page size like on x86). Plus we've compaction so
> > >>
> > >> I can't understand your point. Could you elaborate it?
> > >
> > > What I meant is that if we already isolated the head page of the THP,
> > > we don't need to try to free the tail pages and breaking the loop
> > > early, will still give us a chance to free a whole 2m because we
> > > isolated the head page (it'll involve some work and swapping but if it
> > > was a compoundtranspage we're ok to break the loop and we're not
> > > making the logic any worse). Provided the PMD_SIZE is quite large like
> > > 2/4m...
> >
> > Do you want this? (it's almost pseudo-code)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 7a4469b..9d7609f 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1017,7 +1017,7 @@ static unsigned long isolate_lru_pages(unsigned
> > long nr_to_scan,
> > for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
> > struct page *page;
> > unsigned long pfn;
> > - unsigned long end_pfn;
> > + unsigned long start_pfn, end_pfn;
> > unsigned long page_pfn;
> > int zone_id;
> >
> > @@ -1057,9 +1057,9 @@ static unsigned long isolate_lru_pages(unsigned
> > long nr_to_scan,
> > */
> > zone_id = page_zone_id(page);
> > page_pfn = page_to_pfn(page);
> > - pfn = page_pfn & ~((1 << order) - 1);
> > + start_pfn = pfn = page_pfn & ~((1 << order) - 1);
> > end_pfn = pfn + (1 << order);
> > - for (; pfn < end_pfn; pfn++) {
> > + while (pfn < end_pfn) {
> > struct page *cursor_page;
> >
> > /* The target page is in the block, ignore it. */
> > @@ -1086,17 +1086,25 @@ static unsigned long
> > isolate_lru_pages(unsigned long nr_to_scan,
> > break;
> >
> > if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> > + int isolated_pages;
> > list_move(&cursor_page->lru, dst);
> > mem_cgroup_del_lru(cursor_page);
> > - nr_taken += hpage_nr_pages(page);
> > + isolated_pages = hpage_nr_pages(page);
> > + nr_taken += isolated_pages;
> > + /* if we isolated pages enough, let's
> > break early */
> > + if (nr_taken > end_pfn - start_pfn)
> > + break;
> > + pfn += isolated_pages;
>
> I think this condition is somewhat unlikely. We are scanning within
> aligned blocks in this linear scanner. Huge pages are always aligned
> so the only situation where we'll encounter a hugepage in the middle
> of this linear scan is when the requested order is larger than a huge
> page. This is exceptionally rare.
>
> Did I miss something?
>

I forgot to mention the "pfn += isolated_pages" but I'm also worried
about it. It's a performance gain iif we are encountering huge pages
during the linear scan which I think is rare but also, I think this
is now skipping pages in the linear scan because we now have

for (; pfn < end_pfn; pfn++) {
if (isolate page) {
isolated_pages = hpage_nr_pages(page);
pfn += isolated_pages;
}
}

hpage_nr_pages is returning 1 for order-0 LRU pages so now the loop is
effectively

for (; pfn < end_pfn; pfn += 2)

Did you mean

pfn += isolated_pages - 1;

?

--
Mel Gorman
SUSE Labs

2011-06-06 10:32:22

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Fri, Jun 03, 2011 at 08:07:30PM +0200, Andrea Arcangeli wrote:
> On Fri, Jun 03, 2011 at 07:37:07PM +0200, Andrea Arcangeli wrote:
> > On Fri, Jun 03, 2011 at 08:01:44AM +0900, Minchan Kim wrote:
> > > Do you want this? (it's almost pseudo-code)
> >
> > Yes that's good idea so we at least take into account if we isolated
> > something big, and it's pointless to insist wasting CPU on the tail
> > pages and even trace a fail because of tail pages after it.
> >
> > I introduced a __page_count to increase readability. It's still
> > hackish to work on subpages in vmscan.c but at least I added a comment
> > and until we serialize destroy_compound_page vs compound_head, I guess
> > there's no better way. I didn't attempt to add out of order
> > serialization similar to what exists for split_huge_page vs
> > compound_trans_head yet, as the page can be allocated or go away from
> > under us, in split_huge_page vs compound_trans_head it's simpler
> > because both callers are required to hold a pin on the page so the
> > page can't go be reallocated and destroyed under it.
>
> Sent too fast... had to shuffle a few things around... trying again.
>
> ===
> Subject: mm: no page_count without a page pin
>
> From: Andrea Arcangeli <[email protected]>
>
> It's unsafe to run page_count during the physical pfn scan because
> compound_head could trip on a dangling pointer when reading page->first_page if
> the compound page is being freed by another CPU. Also properly take into
> account if we isolated a compound page during the scan and break the loop if
> we've isolated enoguh. Introduce __page_count to cleanup some atomic_read from
> &page->_count in common code to cleanup.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>

This patch is pulling in stuff from Minchan. Minimally his patch should
be kept separate to preserve history or his Signed-off should be
included on this patch.

> ---
> arch/powerpc/mm/gup.c | 2 -
> arch/powerpc/platforms/512x/mpc512x_shared.c | 2 -
> arch/x86/mm/gup.c | 2 -
> fs/nilfs2/page.c | 2 -
> include/linux/mm.h | 13 ++++++----
> mm/huge_memory.c | 4 +--
> mm/internal.h | 2 -
> mm/page_alloc.c | 6 ++--
> mm/swap.c | 4 +--
> mm/vmscan.c | 35 ++++++++++++++++++++-------
> 10 files changed, 47 insertions(+), 25 deletions(-)
>
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1047,7 +1047,7 @@ static unsigned long isolate_lru_pages(u
> for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
> struct page *page;
> unsigned long pfn;
> - unsigned long end_pfn;
> + unsigned long start_pfn, end_pfn;
> unsigned long page_pfn;
> int zone_id;
>
> @@ -1087,9 +1087,9 @@ static unsigned long isolate_lru_pages(u
> */
> zone_id = page_zone_id(page);
> page_pfn = page_to_pfn(page);
> - pfn = page_pfn & ~((1 << order) - 1);
> - end_pfn = pfn + (1 << order);
> - for (; pfn < end_pfn; pfn++) {
> + start_pfn = page_pfn & ~((1 << order) - 1);
> + end_pfn = start_pfn + (1 << order);
> + for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> struct page *cursor_page;
>
> /* The target page is in the block, ignore it. */
> @@ -1116,16 +1116,33 @@ static unsigned long isolate_lru_pages(u
> break;
>
> if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> + unsigned int isolated_pages;
> list_move(&cursor_page->lru, dst);
> mem_cgroup_del_lru(cursor_page);
> - nr_taken += hpage_nr_pages(page);
> - nr_lumpy_taken++;
> + isolated_pages = hpage_nr_pages(page);
> + nr_taken += isolated_pages;
> + nr_lumpy_taken += isolated_pages;
> if (PageDirty(cursor_page))
> - nr_lumpy_dirty++;
> + nr_lumpy_dirty += isolated_pages;
> scan++;
> + pfn += isolated_pages-1;

Ah, here is the isolated_pages - 1 which is necessary. Should have read
the whole thread before responding to anything :).

I still think this optimisation is rare and only applies if we are
encountering huge pages during the linear scan. How often are we doing
that really?

> + VM_BUG_ON(!isolated_pages);

This BUG_ON is overkill. hpage_nr_pages would have to return 0.

> + VM_BUG_ON(isolated_pages > MAX_ORDER_NR_PAGES);

This would require order > MAX_ORDER_NR_PAGES to be passed into
isolate_lru_pages or for a huge page to be unaligned to a power of
two. The former is very unlikely and the latter is not supported by
any CPU.

> } else {
> - /* the page is freed already. */
> - if (!page_count(cursor_page))
> + /*
> + * Check if the page is freed already.
> + *
> + * We can't use page_count() as that
> + * requires compound_head and we don't
> + * have a pin on the page here. If a
> + * page is tail, we may or may not
> + * have isolated the head, so assume
> + * it's not free, it'd be tricky to
> + * track the head status without a
> + * page pin.
> + */
> + if (!PageTail(cursor_page) &&
> + !__page_count(cursor_page))
> continue;
> break;

Ack to this part.

I'm not keen on __page_count() as __ normally means the "unlocked"
version of a function although I realise that rule isn't universal
either. I can't think of a better name though.

> }
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -266,12 +266,17 @@ struct inode;
> * routine so they can be sure the page doesn't go away from under them.
> */
>
> +static inline int __page_count(struct page *page)
> +{
> + return atomic_read(&page->_count);
> +}
> +
> /*
> * Drop a ref, return true if the refcount fell to zero (the page has no users)
> */
> static inline int put_page_testzero(struct page *page)
> {
> - VM_BUG_ON(atomic_read(&page->_count) == 0);
> + VM_BUG_ON(__page_count(page) == 0);
> return atomic_dec_and_test(&page->_count);
> }
>
> @@ -357,7 +362,7 @@ static inline struct page *compound_head
>
> static inline int page_count(struct page *page)
> {
> - return atomic_read(&compound_head(page)->_count);
> + return __page_count(compound_head(page));
> }
>
> static inline void get_page(struct page *page)
> @@ -370,7 +375,7 @@ static inline void get_page(struct page
> * bugcheck only verifies that the page->_count isn't
> * negative.
> */
> - VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
> + VM_BUG_ON(__page_count(page) < !PageTail(page));
> atomic_inc(&page->_count);
> /*
> * Getting a tail page will elevate both the head and tail
> @@ -382,7 +387,7 @@ static inline void get_page(struct page
> * __split_huge_page_refcount can't run under
> * get_page().
> */
> - VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
> + VM_BUG_ON(__page_count(page->first_page) <= 0);
> atomic_inc(&page->first_page->_count);
> }
> }
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1203,10 +1203,10 @@ static void __split_huge_page_refcount(s
> struct page *page_tail = page + i;
>
> /* tail_page->_count cannot change */
> - atomic_sub(atomic_read(&page_tail->_count), &page->_count);
> + atomic_sub(__page_count(page_tail), &page->_count);
> BUG_ON(page_count(page) <= 0);
> atomic_add(page_mapcount(page) + 1, &page_tail->_count);
> - BUG_ON(atomic_read(&page_tail->_count) <= 0);
> + BUG_ON(__page_count(page_tail) <= 0);
>
> /* after clearing PageTail the gup refcount can be released */
> smp_mb();
> --- a/arch/powerpc/mm/gup.c
> +++ b/arch/powerpc/mm/gup.c
> @@ -22,7 +22,7 @@ static inline void get_huge_page_tail(st
> * __split_huge_page_refcount() cannot run
> * from under us.
> */
> - VM_BUG_ON(atomic_read(&page->_count) < 0);
> + VM_BUG_ON(__page_count(page) < 0);
> atomic_inc(&page->_count);
> }
>
> --- a/arch/powerpc/platforms/512x/mpc512x_shared.c
> +++ b/arch/powerpc/platforms/512x/mpc512x_shared.c
> @@ -200,7 +200,7 @@ static inline void mpc512x_free_bootmem(
> {
> __ClearPageReserved(page);
> BUG_ON(PageTail(page));
> - BUG_ON(atomic_read(&page->_count) > 1);
> + BUG_ON(__page_count(page) > 1);
> atomic_set(&page->_count, 1);
> __free_page(page);
> totalram_pages++;
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -114,7 +114,7 @@ static inline void get_huge_page_tail(st
> * __split_huge_page_refcount() cannot run
> * from under us.
> */
> - VM_BUG_ON(atomic_read(&page->_count) < 0);
> + VM_BUG_ON(__page_count(page) < 0);
> atomic_inc(&page->_count);
> }
>
> --- a/fs/nilfs2/page.c
> +++ b/fs/nilfs2/page.c
> @@ -181,7 +181,7 @@ void nilfs_page_bug(struct page *page)
>
> printk(KERN_CRIT "NILFS_PAGE_BUG(%p): cnt=%d index#=%llu flags=0x%lx "
> "mapping=%p ino=%lu\n",
> - page, atomic_read(&page->_count),
> + page, __page_count(page),
> (unsigned long long)page->index, page->flags, m, ino);
>
> if (page_has_buffers(page)) {
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -28,7 +28,7 @@ static inline void set_page_count(struct
> static inline void set_page_refcounted(struct page *page)
> {
> VM_BUG_ON(PageTail(page));
> - VM_BUG_ON(atomic_read(&page->_count));
> + VM_BUG_ON(__page_count(page));
> set_page_count(page, 1);
> }
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -568,7 +568,7 @@ static inline int free_pages_check(struc
> {
> if (unlikely(page_mapcount(page) |
> (page->mapping != NULL) |
> - (atomic_read(&page->_count) != 0) |
> + (__page_count(page) != 0) |
> (page->flags & PAGE_FLAGS_CHECK_AT_FREE) |
> (mem_cgroup_bad_page_check(page)))) {
> bad_page(page);
> @@ -758,7 +758,7 @@ static inline int check_new_page(struct
> {
> if (unlikely(page_mapcount(page) |
> (page->mapping != NULL) |
> - (atomic_read(&page->_count) != 0) |
> + (__page_count(page) != 0) |
> (page->flags & PAGE_FLAGS_CHECK_AT_PREP) |
> (mem_cgroup_bad_page_check(page)))) {
> bad_page(page);
> @@ -5739,7 +5739,7 @@ void dump_page(struct page *page)
> {
> printk(KERN_ALERT
> "page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
> - page, atomic_read(&page->_count), page_mapcount(page),
> + page, __page_count(page), page_mapcount(page),
> page->mapping, page->index);
> dump_page_flags(page->flags);
> mem_cgroup_print_bad_page(page);
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -128,9 +128,9 @@ static void put_compound_page(struct pag
> if (put_page_testzero(page_head))
> VM_BUG_ON(1);
> /* __split_huge_page_refcount will wait now */
> - VM_BUG_ON(atomic_read(&page->_count) <= 0);
> + VM_BUG_ON(__page_count(page) <= 0);
> atomic_dec(&page->_count);
> - VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
> + VM_BUG_ON(__page_count(page_head) <= 0);
> compound_unlock_irqrestore(page_head, flags);
> if (put_page_testzero(page_head)) {
> if (PageHead(page_head))

--
Mel Gorman
SUSE Labs

2011-06-06 10:39:30

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Fri, Jun 03, 2011 at 05:45:54PM +0200, Andrea Arcangeli wrote:
> On Fri, Jun 03, 2011 at 03:49:41PM +0100, Mel Gorman wrote:
> > Right idea of the wrong zone being accounted for but wrong place. I
> > think the following patch should fix the problem;
>
> Looks good thanks.
>
> I also found this bug during my debugging that made NR_SHMEM underflow.
>
> ===
> Subject: migrate: don't account swapcache as shmem
>
> From: Andrea Arcangeli <[email protected]>
>
> swapcache will reach the below code path in migrate_page_move_mapping,
> and swapcache is accounted as NR_FILE_PAGES but it's not accounted as
> NR_SHMEM.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>

Well spotted.

Acked-by: Mel Gorman <[email protected]>

Minor nit. swapper_space is rarely referred to outside of the swap
code. Might it be more readable to use

/*
* swapcache is accounted as NR_FILE_PAGES but it is not
* accounted as NR_SHMEM
*
if (PageSwapBacked(page) && !PageSwapCache(page))

?

> ---
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index e4a5c91..2597a27 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -288,7 +288,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
> */
> __dec_zone_page_state(page, NR_FILE_PAGES);
> __inc_zone_page_state(newpage, NR_FILE_PAGES);
> - if (PageSwapBacked(page)) {
> + if (mapping != &swapper_space && PageSwapBacked(page)) {
> __dec_zone_page_state(page, NR_SHMEM);
> __inc_zone_page_state(newpage, NR_SHMEM);
> }
>
>

--
Mel Gorman
SUSE Labs

2011-06-06 10:43:50

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Sat, Jun 04, 2011 at 03:58:53PM +0900, Minchan Kim wrote:
> On Fri, Jun 03, 2011 at 03:49:41PM +0100, Mel Gorman wrote:
> > On Fri, Jun 03, 2011 at 03:09:20AM +0100, Mel Gorman wrote:
> > > On Thu, Jun 02, 2011 at 05:37:54PM +0200, Andrea Arcangeli wrote:
> > > > > There is an explanation in here somewhere because as I write this,
> > > > > the test machine has survived 14 hours under continual stress without
> > > > > the isolated counters going negative with over 128 million pages
> > > > > successfully migrated and a million pages failed to migrate due to
> > > > > direct compaction being called 80,000 times. It's possible it's a
> > > > > co-incidence but it's some co-incidence!
> > > >
> > > > No idea...
> > >
> > > I wasn't able to work on this most of the day but was looking at this
> > > closer this evening again and I think I might have thought of another
> > > theory that could cause this problem.
> > >
> > > When THP is isolating pages, it accounts for the pages isolated against
> > > the zone of course. If it backs out, it finds the pages from the PTEs.
> > > On !SMP but PREEMPT, we may not have adequate protection against a new
> > > page from a different zone being inserted into the PTE causing us to
> > > decrement against the wrong zone. While the global counter is fine,
> > > the per-zone counters look corrupted. You'd still think it was the
> > > anon counter tht got screwed rather than the file one if it really was
> > > THP unfortunately so it's not the full picture. I'm going to start
> > > a test monitoring both zoneinfo and vmstat to see if vmstat looks
> > > fine while the per-zone counters that are negative are offset by a
> > > positive count on the other zones that when added together become 0.
> > > Hopefully it'll actually trigger overnight :/
> > >
> >
> > Right idea of the wrong zone being accounted for but wrong place. I
> > think the following patch should fix the problem;
> >
> > ==== CUT HERE ===
> > mm: compaction: Ensure that the compaction free scanner does not move to the next zone
> >
> > Compaction works with two scanners, a migration and a free
> > scanner. When the scanners crossover, migration within the zone is
> > complete. The location of the scanner is recorded on each cycle to
> > avoid excesive scanning.
> >
> > When a zone is small and mostly reserved, it's very easy for the
> > migration scanner to be close to the end of the zone. Then the following
> > situation can occurs
> >
> > o migration scanner isolates some pages near the end of the zone
> > o free scanner starts at the end of the zone but finds that the
> > migration scanner is already there
> > o free scanner gets reinitialised for the next cycle as
> > cc->migrate_pfn + pageblock_nr_pages
> > moving the free scanner into the next zone
> > o migration scanner moves into the next zone but continues accounting
> > against the old zone
> >
> > When this happens, NR_ISOLATED accounting goes haywire because some
> > of the accounting happens against the wrong zone. One zones counter
> > remains positive while the other goes negative even though the overall
> > global count is accurate. This was reported on X86-32 with !SMP because
> > !SMP allows the negative counters to be visible. The fact that it is
> > difficult to reproduce on X86-64 is probably just a co-incidence as
>
> I guess it's related to zone sizes.
> X86-64 has small DMA and large DMA32 zones for fallback of NORMAL while
> x86 has just a small DMA(16M) zone.
>

Yep, this is a possibility as well as the use of lowmem reserves.

> I think DMA zone in x86 is easily full of non-LRU or non-movable pages.

Maybe not full, but it has more PageReserved pages than anywhere else
and few MIGRATE_MOVABLE blocks. MIGRATE_MOVABLE gets skipped during
async compaction we could easily reach the end of the DMA zone quickly.

> So isolate_migratepagse continues to scan for finding pages which are migratable
> and then it reaches near end of zone.
>
> > the bug should theoritically be possible there.
>
> Finally, you found it. Congratulations on!
> > Signed-off-by: Mel Gorman <[email protected]>
> Reviewed-by: Minchan Kim <[email protected]>
>
> When we are debugging this problem, we found a few of bugs and enhance points
> and submitted patches. It was a very good chance to fix Linux VM.
>

Thanks.

--
Mel Gorman
SUSE Labs

2011-06-06 12:39:08

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Mon, Jun 06, 2011 at 11:39:24AM +0100, Mel Gorman wrote:
> Well spotted.
>
> Acked-by: Mel Gorman <[email protected]>
>
> Minor nit. swapper_space is rarely referred to outside of the swap
> code. Might it be more readable to use
>
> /*
> * swapcache is accounted as NR_FILE_PAGES but it is not
> * accounted as NR_SHMEM
> *
> if (PageSwapBacked(page) && !PageSwapCache(page))

I thought the comparison on swapper_space would be faster as it was
immediate vs register in CPU, instead of forcing a memory
access. Otherwise I would have used the above. Now the test_bit is
written in C and lockless so it's not likely to be very different
considering the cacheline is hot in the CPU but it's still referencing
memory instead register vs immediate comparison.

2011-06-06 12:40:34

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Mon, Jun 06, 2011 at 11:43:45AM +0100, Mel Gorman wrote:
> Maybe not full, but it has more PageReserved pages than anywhere else
> and few MIGRATE_MOVABLE blocks. MIGRATE_MOVABLE gets skipped during
> async compaction we could easily reach the end of the DMA zone quickly.

Debug data has nr_isolated_file in dma zone 1 and nr_isolated_file in
normal zone being -1.

2011-06-06 12:50:30

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Mon, Jun 06, 2011 at 11:32:16AM +0100, Mel Gorman wrote:
> This patch is pulling in stuff from Minchan. Minimally his patch should
> be kept separate to preserve history or his Signed-off should be
> included on this patch.

Well I didn't apply Minchan's patch, just improved it as he suggested
from pseudocode, but I can add his signed-off-by no prob.

> I still think this optimisation is rare and only applies if we are
> encountering huge pages during the linear scan. How often are we doing
> that really?

Well it's so fast to do it, that it looks worthwhile. You probably
noticed initially I suggested only the fix for page_count
(theoretical) oops, and I argued we could improve some more bits, but
then it was kind of obvious to improve the upper side of the loop too
according to pseudocode.

>
> > + VM_BUG_ON(!isolated_pages);
>
> This BUG_ON is overkill. hpage_nr_pages would have to return 0.
>
> > + VM_BUG_ON(isolated_pages > MAX_ORDER_NR_PAGES);
>
> This would require order > MAX_ORDER_NR_PAGES to be passed into
> isolate_lru_pages or for a huge page to be unaligned to a power of
> two. The former is very unlikely and the latter is not supported by
> any CPU.

Minchan also disliked the VM_BUG_ON, it's clearly way overkill, but
frankly the pfn physical scans are tricky enough things and if there's
a race and the order is wrong for whatever reason (no compound page or
overwritten by driver messing with subpages) we'll just trip into some
weird pointer next iteration (or maybe not and it'll go ahead
unnoticed if it's not beyond the range) and in that case I'd like to
notice immediately.

But probably it's too paranoid even of a VM_BUG_ON so I surely can
remove it...

>
> > } else {
> > - /* the page is freed already. */
> > - if (!page_count(cursor_page))
> > + /*
> > + * Check if the page is freed already.
> > + *
> > + * We can't use page_count() as that
> > + * requires compound_head and we don't
> > + * have a pin on the page here. If a
> > + * page is tail, we may or may not
> > + * have isolated the head, so assume
> > + * it's not free, it'd be tricky to
> > + * track the head status without a
> > + * page pin.
> > + */
> > + if (!PageTail(cursor_page) &&
> > + !__page_count(cursor_page))
> > continue;
> > break;
>
> Ack to this part.

This is also the only important part that fixes the potential oops.

> I'm not keen on __page_count() as __ normally means the "unlocked"
> version of a function although I realise that rule isn't universal
> either. I can't think of a better name though.

If better suggestions comes to mind I can change it... Or I can also
use atomic_read like in the first patch... it's up to you. I figured
it wasn't so nice to call atomic_read and there are other places in
huge_memory.c that used that for bugchecks and it can be cleaned up
with __page_count. The _count having _ prefix is the thing that makes
it look like a more private field not to use in generic VM code so the
raw value can be altered without changing all callers of __page_count
similar to _mapcount.

2011-06-06 13:23:33

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Mon, Jun 06, 2011 at 11:43:45AM +0100, Mel Gorman wrote:
> On Sat, Jun 04, 2011 at 03:58:53PM +0900, Minchan Kim wrote:
> > On Fri, Jun 03, 2011 at 03:49:41PM +0100, Mel Gorman wrote:
> > > On Fri, Jun 03, 2011 at 03:09:20AM +0100, Mel Gorman wrote:
> > > > On Thu, Jun 02, 2011 at 05:37:54PM +0200, Andrea Arcangeli wrote:
> > > > > > There is an explanation in here somewhere because as I write this,
> > > > > > the test machine has survived 14 hours under continual stress without
> > > > > > the isolated counters going negative with over 128 million pages
> > > > > > successfully migrated and a million pages failed to migrate due to
> > > > > > direct compaction being called 80,000 times. It's possible it's a
> > > > > > co-incidence but it's some co-incidence!
> > > > >
> > > > > No idea...
> > > >
> > > > I wasn't able to work on this most of the day but was looking at this
> > > > closer this evening again and I think I might have thought of another
> > > > theory that could cause this problem.
> > > >
> > > > When THP is isolating pages, it accounts for the pages isolated against
> > > > the zone of course. If it backs out, it finds the pages from the PTEs.
> > > > On !SMP but PREEMPT, we may not have adequate protection against a new
> > > > page from a different zone being inserted into the PTE causing us to
> > > > decrement against the wrong zone. While the global counter is fine,
> > > > the per-zone counters look corrupted. You'd still think it was the
> > > > anon counter tht got screwed rather than the file one if it really was
> > > > THP unfortunately so it's not the full picture. I'm going to start
> > > > a test monitoring both zoneinfo and vmstat to see if vmstat looks
> > > > fine while the per-zone counters that are negative are offset by a
> > > > positive count on the other zones that when added together become 0.
> > > > Hopefully it'll actually trigger overnight :/
> > > >
> > >
> > > Right idea of the wrong zone being accounted for but wrong place. I
> > > think the following patch should fix the problem;
> > >
> > > ==== CUT HERE ===
> > > mm: compaction: Ensure that the compaction free scanner does not move to the next zone
> > >
> > > Compaction works with two scanners, a migration and a free
> > > scanner. When the scanners crossover, migration within the zone is
> > > complete. The location of the scanner is recorded on each cycle to
> > > avoid excesive scanning.
> > >
> > > When a zone is small and mostly reserved, it's very easy for the
> > > migration scanner to be close to the end of the zone. Then the following
> > > situation can occurs
> > >
> > > o migration scanner isolates some pages near the end of the zone
> > > o free scanner starts at the end of the zone but finds that the
> > > migration scanner is already there
> > > o free scanner gets reinitialised for the next cycle as
> > > cc->migrate_pfn + pageblock_nr_pages
> > > moving the free scanner into the next zone
> > > o migration scanner moves into the next zone but continues accounting
> > > against the old zone
> > >
> > > When this happens, NR_ISOLATED accounting goes haywire because some
> > > of the accounting happens against the wrong zone. One zones counter
> > > remains positive while the other goes negative even though the overall
> > > global count is accurate. This was reported on X86-32 with !SMP because
> > > !SMP allows the negative counters to be visible. The fact that it is
> > > difficult to reproduce on X86-64 is probably just a co-incidence as
> >
> > I guess it's related to zone sizes.
> > X86-64 has small DMA and large DMA32 zones for fallback of NORMAL while
> > x86 has just a small DMA(16M) zone.
> >
>
> Yep, this is a possibility as well as the use of lowmem reserves.
>
> > I think DMA zone in x86 is easily full of non-LRU or non-movable pages.
>
> Maybe not full, but it has more PageReserved pages than anywhere else

Yeb. It's very possible.

> and few MIGRATE_MOVABLE blocks. MIGRATE_MOVABLE gets skipped during
non-MIGRATE_MOVABLE gets skipped during
To be clear for someone in future, let's fix typo.

> async compaction we could easily reach the end of the DMA zone quickly.

--
Kind regards
Minchan Kim

2011-06-06 13:27:22

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Mon, Jun 06, 2011 at 02:40:25PM +0200, Andrea Arcangeli wrote:
> On Mon, Jun 06, 2011 at 11:43:45AM +0100, Mel Gorman wrote:
> > Maybe not full, but it has more PageReserved pages than anywhere else
> > and few MIGRATE_MOVABLE blocks. MIGRATE_MOVABLE gets skipped during
> > async compaction we could easily reach the end of the DMA zone quickly.
>
> Debug data has nr_isolated_file in dma zone 1 and nr_isolated_file in
> normal zone being -1.

It's exactly match with Mel's case.
Thanks for the proving, Andrea.

--
Kind regards
Minchan Kim

2011-06-06 14:01:31

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Mon, Jun 06, 2011 at 11:15:57AM +0100, Mel Gorman wrote:
> On Fri, Jun 03, 2011 at 08:01:44AM +0900, Minchan Kim wrote:
> > On Fri, Jun 3, 2011 at 7:32 AM, Andrea Arcangeli <[email protected]> wrote:
> > > On Fri, Jun 03, 2011 at 07:23:48AM +0900, Minchan Kim wrote:
> > >> I mean we have more tail pages than head pages. So I think we are likely to
> > >> meet tail pages. Of course, compared to all pages(page cache, anon and
> > >> so on), compound pages would be very small percentage.
> > >
> > > Yes that's my point, that being a small percentage it's no big deal to
> > > break the loop early.
> >
> > Indeed.
> >
> > >
> > >> > isolated the head and it's useless to insist on more tail pages (at
> > >> > least for large page size like on x86). Plus we've compaction so
> > >>
> > >> I can't understand your point. Could you elaborate it?
> > >
> > > What I meant is that if we already isolated the head page of the THP,
> > > we don't need to try to free the tail pages and breaking the loop
> > > early, will still give us a chance to free a whole 2m because we
> > > isolated the head page (it'll involve some work and swapping but if it
> > > was a compoundtranspage we're ok to break the loop and we're not
> > > making the logic any worse). Provided the PMD_SIZE is quite large like
> > > 2/4m...
> >
> > Do you want this? (it's almost pseudo-code)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 7a4469b..9d7609f 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1017,7 +1017,7 @@ static unsigned long isolate_lru_pages(unsigned
> > long nr_to_scan,
> > for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
> > struct page *page;
> > unsigned long pfn;
> > - unsigned long end_pfn;
> > + unsigned long start_pfn, end_pfn;
> > unsigned long page_pfn;
> > int zone_id;
> >
> > @@ -1057,9 +1057,9 @@ static unsigned long isolate_lru_pages(unsigned
> > long nr_to_scan,
> > */
> > zone_id = page_zone_id(page);
> > page_pfn = page_to_pfn(page);
> > - pfn = page_pfn & ~((1 << order) - 1);
> > + start_pfn = pfn = page_pfn & ~((1 << order) - 1);
> > end_pfn = pfn + (1 << order);
> > - for (; pfn < end_pfn; pfn++) {
> > + while (pfn < end_pfn) {
> > struct page *cursor_page;
> >
> > /* The target page is in the block, ignore it. */
> > @@ -1086,17 +1086,25 @@ static unsigned long
> > isolate_lru_pages(unsigned long nr_to_scan,
> > break;
> >
> > if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> > + int isolated_pages;
> > list_move(&cursor_page->lru, dst);
> > mem_cgroup_del_lru(cursor_page);
> > - nr_taken += hpage_nr_pages(page);
> > + isolated_pages = hpage_nr_pages(page);
> > + nr_taken += isolated_pages;
> > + /* if we isolated pages enough, let's
> > break early */
> > + if (nr_taken > end_pfn - start_pfn)
> > + break;
> > + pfn += isolated_pages;
>
> I think this condition is somewhat unlikely. We are scanning within
> aligned blocks in this linear scanner. Huge pages are always aligned
> so the only situation where we'll encounter a hugepage in the middle
> of this linear scan is when the requested order is larger than a huge
> page. This is exceptionally rare.
>
> Did I miss something?

Never. You're absolute right.
I don't have systems which have lots of hpages.
But I have heard some guys tunes MAX_ORDER(Whether it's a good or bad is off-topic).
Anyway, it would be good in such system but I admit it would be rare.
I don't have strong mind about this pseudo patch.

--
Kind regards
Minchan Kim

2011-06-06 14:08:09

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Mon, Jun 06, 2011 at 11:32:16AM +0100, Mel Gorman wrote:
> On Fri, Jun 03, 2011 at 08:07:30PM +0200, Andrea Arcangeli wrote:
> > On Fri, Jun 03, 2011 at 07:37:07PM +0200, Andrea Arcangeli wrote:
> > > On Fri, Jun 03, 2011 at 08:01:44AM +0900, Minchan Kim wrote:
> > > > Do you want this? (it's almost pseudo-code)
> > >
> > > Yes that's good idea so we at least take into account if we isolated
> > > something big, and it's pointless to insist wasting CPU on the tail
> > > pages and even trace a fail because of tail pages after it.
> > >
> > > I introduced a __page_count to increase readability. It's still
> > > hackish to work on subpages in vmscan.c but at least I added a comment
> > > and until we serialize destroy_compound_page vs compound_head, I guess
> > > there's no better way. I didn't attempt to add out of order
> > > serialization similar to what exists for split_huge_page vs
> > > compound_trans_head yet, as the page can be allocated or go away from
> > > under us, in split_huge_page vs compound_trans_head it's simpler
> > > because both callers are required to hold a pin on the page so the
> > > page can't go be reallocated and destroyed under it.
> >
> > Sent too fast... had to shuffle a few things around... trying again.
> >
> > ===
> > Subject: mm: no page_count without a page pin
> >
> > From: Andrea Arcangeli <[email protected]>
> >
> > It's unsafe to run page_count during the physical pfn scan because
> > compound_head could trip on a dangling pointer when reading page->first_page if
> > the compound page is being freed by another CPU. Also properly take into
> > account if we isolated a compound page during the scan and break the loop if
> > we've isolated enoguh. Introduce __page_count to cleanup some atomic_read from
> > &page->_count in common code to cleanup.
> >
> > Signed-off-by: Andrea Arcangeli <[email protected]>
>
> This patch is pulling in stuff from Minchan. Minimally his patch should
> be kept separate to preserve history or his Signed-off should be
> included on this patch.
>
> > ---
> > arch/powerpc/mm/gup.c | 2 -
> > arch/powerpc/platforms/512x/mpc512x_shared.c | 2 -
> > arch/x86/mm/gup.c | 2 -
> > fs/nilfs2/page.c | 2 -
> > include/linux/mm.h | 13 ++++++----
> > mm/huge_memory.c | 4 +--
> > mm/internal.h | 2 -
> > mm/page_alloc.c | 6 ++--
> > mm/swap.c | 4 +--
> > mm/vmscan.c | 35 ++++++++++++++++++++-------
> > 10 files changed, 47 insertions(+), 25 deletions(-)
> >
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1047,7 +1047,7 @@ static unsigned long isolate_lru_pages(u
> > for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
> > struct page *page;
> > unsigned long pfn;
> > - unsigned long end_pfn;
> > + unsigned long start_pfn, end_pfn;
> > unsigned long page_pfn;
> > int zone_id;
> >
> > @@ -1087,9 +1087,9 @@ static unsigned long isolate_lru_pages(u
> > */
> > zone_id = page_zone_id(page);
> > page_pfn = page_to_pfn(page);
> > - pfn = page_pfn & ~((1 << order) - 1);
> > - end_pfn = pfn + (1 << order);
> > - for (; pfn < end_pfn; pfn++) {
> > + start_pfn = page_pfn & ~((1 << order) - 1);
> > + end_pfn = start_pfn + (1 << order);
> > + for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> > struct page *cursor_page;
> >
> > /* The target page is in the block, ignore it. */
> > @@ -1116,16 +1116,33 @@ static unsigned long isolate_lru_pages(u
> > break;
> >
> > if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> > + unsigned int isolated_pages;
> > list_move(&cursor_page->lru, dst);
> > mem_cgroup_del_lru(cursor_page);
> > - nr_taken += hpage_nr_pages(page);
> > - nr_lumpy_taken++;
> > + isolated_pages = hpage_nr_pages(page);
> > + nr_taken += isolated_pages;
> > + nr_lumpy_taken += isolated_pages;
> > if (PageDirty(cursor_page))
> > - nr_lumpy_dirty++;
> > + nr_lumpy_dirty += isolated_pages;
> > scan++;
> > + pfn += isolated_pages-1;
>
> Ah, here is the isolated_pages - 1 which is necessary. Should have read
> the whole thread before responding to anything :).
>
> I still think this optimisation is rare and only applies if we are
> encountering huge pages during the linear scan. How often are we doing
> that really?
>
> > + VM_BUG_ON(!isolated_pages);
>
> This BUG_ON is overkill. hpage_nr_pages would have to return 0.
>
> > + VM_BUG_ON(isolated_pages > MAX_ORDER_NR_PAGES);
>
> This would require order > MAX_ORDER_NR_PAGES to be passed into
> isolate_lru_pages or for a huge page to be unaligned to a power of
> two. The former is very unlikely and the latter is not supported by
> any CPU.
>
> > } else {
> > - /* the page is freed already. */
> > - if (!page_count(cursor_page))
> > + /*
> > + * Check if the page is freed already.
> > + *
> > + * We can't use page_count() as that
> > + * requires compound_head and we don't
> > + * have a pin on the page here. If a
> > + * page is tail, we may or may not
> > + * have isolated the head, so assume
> > + * it's not free, it'd be tricky to
> > + * track the head status without a
> > + * page pin.
> > + */
> > + if (!PageTail(cursor_page) &&
> > + !__page_count(cursor_page))
> > continue;
> > break;
>
> Ack to this part.
>
> I'm not keen on __page_count() as __ normally means the "unlocked"
> version of a function although I realise that rule isn't universal

Yes. It's not univeral.
I have thought about it as it's just private function or utility function
(ie, not-exportable).
So I don't mind the name.

> either. I can't think of a better name though.

Me, too.

--
Kind regards
Minchan Kim

2011-06-06 14:20:06

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Mon, Jun 06, 2011 at 11:39:24AM +0100, Mel Gorman wrote:
> On Fri, Jun 03, 2011 at 05:45:54PM +0200, Andrea Arcangeli wrote:
> > On Fri, Jun 03, 2011 at 03:49:41PM +0100, Mel Gorman wrote:
> > > Right idea of the wrong zone being accounted for but wrong place. I
> > > think the following patch should fix the problem;
> >
> > Looks good thanks.
> >
> > I also found this bug during my debugging that made NR_SHMEM underflow.
> >
> > ===
> > Subject: migrate: don't account swapcache as shmem
> >
> > From: Andrea Arcangeli <[email protected]>
> >
> > swapcache will reach the below code path in migrate_page_move_mapping,
> > and swapcache is accounted as NR_FILE_PAGES but it's not accounted as
> > NR_SHMEM.
> >
> > Signed-off-by: Andrea Arcangeli <[email protected]>
>
> Well spotted.
>
> Acked-by: Mel Gorman <[email protected]>
>
> Minor nit. swapper_space is rarely referred to outside of the swap
> code. Might it be more readable to use
>
> /*
> * swapcache is accounted as NR_FILE_PAGES but it is not
> * accounted as NR_SHMEM
> *
> if (PageSwapBacked(page) && !PageSwapCache(page))

I like this. but as it's "and" operation, CPU have to execute two condition comparison.
but how about below?
if (!PageSwapCache(page) && PageSwapBacked(page))

PageSwapCache implys PageSwapBacked so we can handle non-swapbacked pages as just 1 comparison.

>
> ?
>
> > ---
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index e4a5c91..2597a27 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -288,7 +288,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
> > */
> > __dec_zone_page_state(page, NR_FILE_PAGES);
> > __inc_zone_page_state(newpage, NR_FILE_PAGES);
> > - if (PageSwapBacked(page)) {
> > + if (mapping != &swapper_space && PageSwapBacked(page)) {
> > __dec_zone_page_state(page, NR_SHMEM);
> > __inc_zone_page_state(newpage, NR_SHMEM);
> > }
> >
> >
>
> --
> Mel Gorman
> SUSE Labs

--
Kind regards
Minchan Kim

2011-06-06 14:26:34

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Mon, Jun 06, 2011 at 11:15:57AM +0100, Mel Gorman wrote:
> On Fri, Jun 03, 2011 at 08:01:44AM +0900, Minchan Kim wrote:
> > On Fri, Jun 3, 2011 at 7:32 AM, Andrea Arcangeli <[email protected]> wrote:
> > > On Fri, Jun 03, 2011 at 07:23:48AM +0900, Minchan Kim wrote:

<snip>

> > >> AFAIK, it's final destination to go as compaction will not break lru
> > >> ordering if my patch(inorder-putback) is merged.
> > >
> > > Agreed. I like your patchset, sorry for not having reviewed it in
> > > detail yet but there were other issues popping up in the last few
> > > days.
> >
> > No problem. it's urgent than mine. :)
> >
>
> I'm going to take the opportunity to apologise for not reviewing that
> series yet. I've been kept too busy with other bugs to set side the
> few hours I need to review the series. I'm hoping to get to it this
> week if everything goes well.

I am refactoring the code about migration.
Maybe, I will resend it on tomorrow.
I hope you guys reviews that. :)

Thanks, Mel.

>
> --
> Mel Gorman
> SUSE Labs

--
Kind regards
Minchan Kim

2011-06-06 14:47:44

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Mon, Jun 06, 2011 at 02:49:54PM +0200, Andrea Arcangeli wrote:
> On Mon, Jun 06, 2011 at 11:32:16AM +0100, Mel Gorman wrote:
> > This patch is pulling in stuff from Minchan. Minimally his patch should
> > be kept separate to preserve history or his Signed-off should be
> > included on this patch.
>
> Well I didn't apply Minchan's patch, just improved it as he suggested
> from pseudocode, but I can add his signed-off-by no prob.
>

My bad, the pseudo-code was close enough to being a patch I felt it at
least merited a mention in the patch.

> > I still think this optimisation is rare and only applies if we are
> > encountering huge pages during the linear scan. How often are we doing
> > that really?
>
> Well it's so fast to do it, that it looks worthwhile. You probably
> noticed initially I suggested only the fix for page_count
> (theoretical) oops, and I argued we could improve some more bits, but
> then it was kind of obvious to improve the upper side of the loop too
> according to pseudocode.
>

I don't feel very strongly about it. I don't think there is much of a
boost because of how rarely we'll encounter this situation but there is
no harm either. I think the page_count fix is more important.

> >
> > > + VM_BUG_ON(!isolated_pages);
> >
> > This BUG_ON is overkill. hpage_nr_pages would have to return 0.
> >
> > > + VM_BUG_ON(isolated_pages > MAX_ORDER_NR_PAGES);
> >
> > This would require order > MAX_ORDER_NR_PAGES to be passed into
> > isolate_lru_pages or for a huge page to be unaligned to a power of
> > two. The former is very unlikely and the latter is not supported by
> > any CPU.
>
> Minchan also disliked the VM_BUG_ON, it's clearly way overkill, but
> frankly the pfn physical scans are tricky enough things and if there's
> a race and the order is wrong for whatever reason (no compound page or
> overwritten by driver messing with subpages) we'll just trip into some
> weird pointer next iteration (or maybe not and it'll go ahead
> unnoticed if it's not beyond the range) and in that case I'd like to
> notice immediately.
>

I guess there is always the chance that an out-of-tree driver will
do something utterly insane with a transparent hugepage and while
this BUG_ON is "impossible", it doesn't hurt either.

> But probably it's too paranoid even of a VM_BUG_ON so I surely can
> remove it...
>
> >
> > > } else {
> > > - /* the page is freed already. */
> > > - if (!page_count(cursor_page))
> > > + /*
> > > + * Check if the page is freed already.
> > > + *
> > > + * We can't use page_count() as that
> > > + * requires compound_head and we don't
> > > + * have a pin on the page here. If a
> > > + * page is tail, we may or may not
> > > + * have isolated the head, so assume
> > > + * it's not free, it'd be tricky to
> > > + * track the head status without a
> > > + * page pin.
> > > + */
> > > + if (!PageTail(cursor_page) &&
> > > + !__page_count(cursor_page))
> > > continue;
> > > break;
> >
> > Ack to this part.
>
> This is also the only important part that fixes the potential oops.
>

Agreed.

> > I'm not keen on __page_count() as __ normally means the "unlocked"
> > version of a function although I realise that rule isn't universal
> > either. I can't think of a better name though.
>
> If better suggestions comes to mind I can change it... Or I can also
> use atomic_read like in the first patch... it's up to you.

The atomic_read is not an improvement. What you have is better than
adding another atomic_read to page count.

> I figured
> it wasn't so nice to call atomic_read and there are other places in
> huge_memory.c that used that for bugchecks and it can be cleaned up
> with __page_count. The _count having _ prefix is the thing that makes
> it look like a more private field not to use in generic VM code so the
> raw value can be altered without changing all callers of __page_count
> similar to _mapcount.

There is that. Go with __page_count because it's better than an
atomic_read. I'm happy to ack the __page_count part of this patch so
please split it out because it is a -stable candidate where as the
potential optimisation and VM_BUG_ONs are not.

--
Mel Gorman
SUSE Labs

2011-06-06 14:55:22

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Mon, Jun 06, 2011 at 02:38:51PM +0200, Andrea Arcangeli wrote:
> On Mon, Jun 06, 2011 at 11:39:24AM +0100, Mel Gorman wrote:
> > Well spotted.
> >
> > Acked-by: Mel Gorman <[email protected]>
> >
> > Minor nit. swapper_space is rarely referred to outside of the swap
> > code. Might it be more readable to use
> >
> > /*
> > * swapcache is accounted as NR_FILE_PAGES but it is not
> > * accounted as NR_SHMEM
> > *
> > if (PageSwapBacked(page) && !PageSwapCache(page))
>
> I thought the comparison on swapper_space would be faster as it was
> immediate vs register in CPU, instead of forcing a memory
> access. Otherwise I would have used the above. Now the test_bit is
> written in C and lockless so it's not likely to be very different
> considering the cacheline is hot in the CPU but it's still referencing
> memory instead register vs immediate comparison.

Ok, I had not considered that. That is a micro-optimisation but it's
there. I thought my version is more readable and migration is not
really a fast path but yours is still better.

--
Mel Gorman
SUSE Labs

2011-06-06 22:33:01

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous

On Fri, 3 Jun 2011 17:45:54 +0200
Andrea Arcangeli <[email protected]> wrote:

> On Fri, Jun 03, 2011 at 03:49:41PM +0100, Mel Gorman wrote:
> > Right idea of the wrong zone being accounted for but wrong place. I
> > think the following patch should fix the problem;
>
> Looks good thanks.
>
> I also found this bug during my debugging that made NR_SHMEM underflow.
>
> ===
> Subject: migrate: don't account swapcache as shmem
>
> From: Andrea Arcangeli <[email protected]>
>
> swapcache will reach the below code path in migrate_page_move_mapping,
> and swapcache is accounted as NR_FILE_PAGES but it's not accounted as
> NR_SHMEM.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index e4a5c91..2597a27 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -288,7 +288,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
> */
> __dec_zone_page_state(page, NR_FILE_PAGES);
> __inc_zone_page_state(newpage, NR_FILE_PAGES);
> - if (PageSwapBacked(page)) {
> + if (mapping != &swapper_space && PageSwapBacked(page)) {
> __dec_zone_page_state(page, NR_SHMEM);
> __inc_zone_page_state(newpage, NR_SHMEM);
> }

fyi, this was the only patch I applied from this whole thread. Once the
dust has settled, could people please resend whatever they have,
including any acked-by's and reviewed-by's?

Thanks.