2011-02-09 15:46:27

by Johannes Weiner

[permalink] [raw]
Subject: [patch] vmscan: fix zone shrinking exit when scan work is done

Hi,

I think this should fix the problem of processes getting stuck in
reclaim that has been reported several times. Kent actually
single-stepped through this code and noted that it was never exiting
shrink_zone(), which really narrowed it down a lot, considering the
tons of nested loops from the allocator down to the list shrinking.

Hannes

---
From: Johannes Weiner <[email protected]>
Subject: vmscan: fix zone shrinking exit when scan work is done

'3e7d344 mm: vmscan: reclaim order-0 and use compaction instead of
lumpy reclaim' introduced an indefinite loop in shrink_zone().

It meant to break out of this loop when no pages had been reclaimed
and not a single page was even scanned. The way it would detect the
latter is by taking a snapshot of sc->nr_scanned at the beginning of
the function and comparing it against the new sc->nr_scanned after the
scan loop. But it would re-iterate without updating that snapshot,
looping forever if sc->nr_scanned changed at least once since
shrink_zone() was invoked.

This is not the sole condition that would exit that loop, but it
requires other processes to change the zone state, as the reclaimer
that is stuck obviously can not anymore.

This is only happening for higher-order allocations, where reclaim is
run back to back with compaction.

Reported-by: Michal Hocko <[email protected]>
Reported-by: Kent Overstreet <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 148c6e6..17497d0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1882,12 +1882,12 @@ static void shrink_zone(int priority, struct zone *zone,
unsigned long nr[NR_LRU_LISTS];
unsigned long nr_to_scan;
enum lru_list l;
- unsigned long nr_reclaimed;
+ unsigned long nr_reclaimed, nr_scanned;
unsigned long nr_to_reclaim = sc->nr_to_reclaim;
- unsigned long nr_scanned = sc->nr_scanned;

restart:
nr_reclaimed = 0;
+ nr_scanned = sc->nr_scanned;
get_scan_count(zone, sc, nr, priority);

while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
--
1.7.4


2011-02-09 15:55:04

by Kent Overstreet

[permalink] [raw]
Subject: Re: [patch] vmscan: fix zone shrinking exit when scan work is done

On 02/09/2011 07:46 AM, Johannes Weiner wrote:
> Hi,
>
> I think this should fix the problem of processes getting stuck in
> reclaim that has been reported several times. Kent actually
> single-stepped through this code and noted that it was never exiting
> shrink_zone(), which really narrowed it down a lot, considering the
> tons of nested loops from the allocator down to the list shrinking.
>
> Hannes

I was able to trigger this in just a few minutes stress testing bcache,
and now it's been going for half an hour working beautifully. Thanks!

>
> ---
> From: Johannes Weiner<[email protected]>
> Subject: vmscan: fix zone shrinking exit when scan work is done
>
> '3e7d344 mm: vmscan: reclaim order-0 and use compaction instead of
> lumpy reclaim' introduced an indefinite loop in shrink_zone().
>
> It meant to break out of this loop when no pages had been reclaimed
> and not a single page was even scanned. The way it would detect the
> latter is by taking a snapshot of sc->nr_scanned at the beginning of
> the function and comparing it against the new sc->nr_scanned after the
> scan loop. But it would re-iterate without updating that snapshot,
> looping forever if sc->nr_scanned changed at least once since
> shrink_zone() was invoked.
>
> This is not the sole condition that would exit that loop, but it
> requires other processes to change the zone state, as the reclaimer
> that is stuck obviously can not anymore.
>
> This is only happening for higher-order allocations, where reclaim is
> run back to back with compaction.
>
> Reported-by: Michal Hocko<[email protected]>
> Reported-by: Kent Overstreet<[email protected]>
> Signed-off-by: Johannes Weiner<[email protected]>

Tested-by: Kent Overstreet<[email protected]>

> ---
> mm/vmscan.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 148c6e6..17497d0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1882,12 +1882,12 @@ static void shrink_zone(int priority, struct zone *zone,
> unsigned long nr[NR_LRU_LISTS];
> unsigned long nr_to_scan;
> enum lru_list l;
> - unsigned long nr_reclaimed;
> + unsigned long nr_reclaimed, nr_scanned;
> unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> - unsigned long nr_scanned = sc->nr_scanned;
>
> restart:
> nr_reclaimed = 0;
> + nr_scanned = sc->nr_scanned;
> get_scan_count(zone, sc, nr, priority);
>
> while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||

2011-02-09 16:47:24

by Mel Gorman

[permalink] [raw]
Subject: Re: [patch] vmscan: fix zone shrinking exit when scan work is done

On Wed, Feb 09, 2011 at 04:46:06PM +0100, Johannes Weiner wrote:
> Hi,
>
> I think this should fix the problem of processes getting stuck in
> reclaim that has been reported several times.

I don't think it's the only source but I'm basing this on seeing
constant looping in balance_pgdat() and calling congestion_wait() a few
weeks ago that I haven't rechecked since. However, this looks like a
real fix for a real problem.

> Kent actually
> single-stepped through this code and noted that it was never exiting
> shrink_zone(), which really narrowed it down a lot, considering the
> tons of nested loops from the allocator down to the list shrinking.
>
> Hannes
>
> ---
> From: Johannes Weiner <[email protected]>
> Subject: vmscan: fix zone shrinking exit when scan work is done
>
> '3e7d344 mm: vmscan: reclaim order-0 and use compaction instead of
> lumpy reclaim' introduced an indefinite loop in shrink_zone().
>
> It meant to break out of this loop when no pages had been reclaimed
> and not a single page was even scanned. The way it would detect the
> latter is by taking a snapshot of sc->nr_scanned at the beginning of
> the function and comparing it against the new sc->nr_scanned after the
> scan loop. But it would re-iterate without updating that snapshot,
> looping forever if sc->nr_scanned changed at least once since
> shrink_zone() was invoked.
>
> This is not the sole condition that would exit that loop, but it
> requires other processes to change the zone state, as the reclaimer
> that is stuck obviously can not anymore.
>
> This is only happening for higher-order allocations, where reclaim is
> run back to back with compaction.
>
> Reported-by: Michal Hocko <[email protected]>
> Reported-by: Kent Overstreet <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>

Well spotted.

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2011-02-09 18:29:56

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch] vmscan: fix zone shrinking exit when scan work is done

On Wed, Feb 09, 2011 at 04:46:56PM +0000, Mel Gorman wrote:
> On Wed, Feb 09, 2011 at 04:46:06PM +0100, Johannes Weiner wrote:
> > Hi,
> >
> > I think this should fix the problem of processes getting stuck in
> > reclaim that has been reported several times.
>
> I don't think it's the only source but I'm basing this on seeing
> constant looping in balance_pgdat() and calling congestion_wait() a few
> weeks ago that I haven't rechecked since. However, this looks like a
> real fix for a real problem.

Agreed. Just yesterday I spent some time on the lumpy compaction
changes after wondering about Michal's khugepaged 100% report, and I
expected some fix was needed in this area (as I couldn't find any bug
in khugepaged yet, so the lumpy compaction looked the next candidate
for bugs).

I've also been wondering about the !nr_scanned check in
should_continue_reclaim too but I didn't look too much into the caller
(I was tempted to remove it all together). I don't see how checking
nr_scanned can be safe even after we fix the caller to avoid passing
non-zero values if "goto restart".

nr_scanned is incremented even for !page_evictable... so it's not
really useful to insist, just because we scanned something, in my
view. It looks bogus... So my proposal would be below.

====
Subject: mm: stop checking nr_scanned in should_continue_reclaim

From: Andrea Arcangeli <[email protected]>

nr_scanned is incremented even for !page_evictable... so it's not
really useful to insist, just because we scanned something.

Signed-off-by: Andrea Arcangeli <[email protected]>
---

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 148c6e6..9741884 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1831,7 +1831,6 @@ out:
*/
static inline bool should_continue_reclaim(struct zone *zone,
unsigned long nr_reclaimed,
- unsigned long nr_scanned,
struct scan_control *sc)
{
unsigned long pages_for_compaction;
@@ -1841,15 +1840,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
return false;

- /*
- * If we failed to reclaim and have scanned the full list, stop.
- * NOTE: Checking just nr_reclaimed would exit reclaim/compaction far
- * faster but obviously would be less likely to succeed
- * allocation. If this is desirable, use GFP_REPEAT to decide
- * if both reclaimed and scanned should be checked or just
- * reclaimed
- */
- if (!nr_reclaimed && !nr_scanned)
+ /* If we failed to reclaim stop. */
+ if (!nr_reclaimed)
return false;

/*
@@ -1884,7 +1876,6 @@ static void shrink_zone(int priority, struct zone *zone,
enum lru_list l;
unsigned long nr_reclaimed;
unsigned long nr_to_reclaim = sc->nr_to_reclaim;
- unsigned long nr_scanned = sc->nr_scanned;

restart:
nr_reclaimed = 0;
@@ -1923,8 +1914,7 @@ restart:
shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);

/* reclaim/compaction might need reclaim to continue */
- if (should_continue_reclaim(zone, nr_reclaimed,
- sc->nr_scanned - nr_scanned, sc))
+ if (should_continue_reclaim(zone, nr_reclaimed, sc))
goto restart;

throttle_vm_writeout(sc->gfp_mask);

2011-02-09 20:06:47

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] vmscan: fix zone shrinking exit when scan work is done

On Wed, 9 Feb 2011 19:28:46 +0100
Andrea Arcangeli <[email protected]> wrote:

> On Wed, Feb 09, 2011 at 04:46:56PM +0000, Mel Gorman wrote:
> > On Wed, Feb 09, 2011 at 04:46:06PM +0100, Johannes Weiner wrote:
> > > Hi,
> > >
> > > I think this should fix the problem of processes getting stuck in
> > > reclaim that has been reported several times.
> >
> > I don't think it's the only source but I'm basing this on seeing
> > constant looping in balance_pgdat() and calling congestion_wait() a few
> > weeks ago that I haven't rechecked since. However, this looks like a
> > real fix for a real problem.
>
> Agreed. Just yesterday I spent some time on the lumpy compaction
> changes after wondering about Michal's khugepaged 100% report, and I
> expected some fix was needed in this area (as I couldn't find any bug
> in khugepaged yet, so the lumpy compaction looked the next candidate
> for bugs).
>
> I've also been wondering about the !nr_scanned check in
> should_continue_reclaim too but I didn't look too much into the caller
> (I was tempted to remove it all together). I don't see how checking
> nr_scanned can be safe even after we fix the caller to avoid passing
> non-zero values if "goto restart".
>
> nr_scanned is incremented even for !page_evictable... so it's not
> really useful to insist, just because we scanned something, in my
> view. It looks bogus... So my proposal would be below.
>
> ====
> Subject: mm: stop checking nr_scanned in should_continue_reclaim
>
> From: Andrea Arcangeli <[email protected]>
>
> nr_scanned is incremented even for !page_evictable... so it's not
> really useful to insist, just because we scanned something.

So if reclaim has scanned 100% !page_evictable pages,
should_continue_reclaim() can return true and we keep on scanning?

That sounds like it's both good and bad :( Is this actually a problem?
What sort of behaviour could it cause and under what circumstances?

Johannes's patch is an obvious bugfix and I'll run with it for now, but
please let's have a further think abut the impact of the
!page_evictable pages.

2011-02-10 04:04:53

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch] vmscan: fix zone shrinking exit when scan work is done

On Thu, Feb 10, 2011 at 12:46 AM, Johannes Weiner <[email protected]> wrote:
> Hi,
>
> I think this should fix the problem of processes getting stuck in
> reclaim that has been reported several times.  Kent actually
> single-stepped through this code and noted that it was never exiting
> shrink_zone(), which really narrowed it down a lot, considering the
> tons of nested loops from the allocator down to the list shrinking.
>
>        Hannes
>
> ---
> From: Johannes Weiner <[email protected]>
> Subject: vmscan: fix zone shrinking exit when scan work is done
>
> '3e7d344 mm: vmscan: reclaim order-0 and use compaction instead of
> lumpy reclaim' introduced an indefinite loop in shrink_zone().
>
> It meant to break out of this loop when no pages had been reclaimed
> and not a single page was even scanned.  The way it would detect the
> latter is by taking a snapshot of sc->nr_scanned at the beginning of
> the function and comparing it against the new sc->nr_scanned after the
> scan loop.  But it would re-iterate without updating that snapshot,
> looping forever if sc->nr_scanned changed at least once since
> shrink_zone() was invoked.
>
> This is not the sole condition that would exit that loop, but it
> requires other processes to change the zone state, as the reclaimer
> that is stuck obviously can not anymore.
>
> This is only happening for higher-order allocations, where reclaim is
> run back to back with compaction.
>
> Reported-by: Michal Hocko <[email protected]>
> Reported-by: Kent Overstreet <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2011-02-10 10:21:40

by Mel Gorman

[permalink] [raw]
Subject: Re: [patch] vmscan: fix zone shrinking exit when scan work is done

On Wed, Feb 09, 2011 at 07:28:46PM +0100, Andrea Arcangeli wrote:
> On Wed, Feb 09, 2011 at 04:46:56PM +0000, Mel Gorman wrote:
> > On Wed, Feb 09, 2011 at 04:46:06PM +0100, Johannes Weiner wrote:
> > > Hi,
> > >
> > > I think this should fix the problem of processes getting stuck in
> > > reclaim that has been reported several times.
> >
> > I don't think it's the only source but I'm basing this on seeing
> > constant looping in balance_pgdat() and calling congestion_wait() a few
> > weeks ago that I haven't rechecked since. However, this looks like a
> > real fix for a real problem.
>
> Agreed. Just yesterday I spent some time on the lumpy compaction
> changes after wondering about Michal's khugepaged 100% report, and I
> expected some fix was needed in this area (as I couldn't find any bug
> in khugepaged yet, so the lumpy compaction looked the next candidate
> for bugs).
>

Michal did report that disabling defrag did not help but the stack trace
also showed that it was stuck in shrink_zone() which is what Johannes'
patch targets. It's not unreasonable to test if Johannes' patch solves
Michal's problem. Michal, I know that your workload is a bit random and
may not be reproducible but do you think it'd be possible to determine
if Johannes' patch helps?

> I've also been wondering about the !nr_scanned check in
> should_continue_reclaim too but I didn't look too much into the caller
> (I was tempted to remove it all together). I don't see how checking
> nr_scanned can be safe even after we fix the caller to avoid passing
> non-zero values if "goto restart".
>
> nr_scanned is incremented even for !page_evictable... so it's not
> really useful to insist, just because we scanned something, in my
> view. It looks bogus... So my proposal would be below.
>

We should not be ending up in a situation with the LRU list of only
page_evictable pages and that situation persisting causing excessive (or
infinite) looping. As unevictable pages are encountered on the LRU list,
they should be moved to the unevictable lists by putback_lru_page(). Are you
aware of a situation where this becomes broken?

I recognise that SWAP_CLUSTER_MAX pages could all be unevictable and they
are all get moved. In this case, nr_scanned is positive and we continue
to scan but this is expected and desirable: Reclaim/compaction needs more
pages to be freed before it starts compaction. If it stops scanning early,
then it would just fail the allocation later. This is what the "NOTE" is about.

I prefer Johannes' fix for the observed problem.

> ====
> Subject: mm: stop checking nr_scanned in should_continue_reclaim
>
> From: Andrea Arcangeli <[email protected]>
>
> nr_scanned is incremented even for !page_evictable... so it's not
> really useful to insist, just because we scanned something.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 148c6e6..9741884 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1831,7 +1831,6 @@ out:
> */
> static inline bool should_continue_reclaim(struct zone *zone,
> unsigned long nr_reclaimed,
> - unsigned long nr_scanned,
> struct scan_control *sc)
> {
> unsigned long pages_for_compaction;
> @@ -1841,15 +1840,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
> if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
> return false;
>
> - /*
> - * If we failed to reclaim and have scanned the full list, stop.
> - * NOTE: Checking just nr_reclaimed would exit reclaim/compaction far
> - * faster but obviously would be less likely to succeed
> - * allocation. If this is desirable, use GFP_REPEAT to decide
> - * if both reclaimed and scanned should be checked or just
> - * reclaimed
> - */
> - if (!nr_reclaimed && !nr_scanned)
> + /* If we failed to reclaim stop. */
> + if (!nr_reclaimed)
> return false;
>
> /*
> @@ -1884,7 +1876,6 @@ static void shrink_zone(int priority, struct zone *zone,
> enum lru_list l;
> unsigned long nr_reclaimed;
> unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> - unsigned long nr_scanned = sc->nr_scanned;
>
> restart:
> nr_reclaimed = 0;
> @@ -1923,8 +1914,7 @@ restart:
> shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
>
> /* reclaim/compaction might need reclaim to continue */
> - if (should_continue_reclaim(zone, nr_reclaimed,
> - sc->nr_scanned - nr_scanned, sc))
> + if (should_continue_reclaim(zone, nr_reclaimed, sc))
> goto restart;
>
> throttle_vm_writeout(sc->gfp_mask);
>
>

--
Mel Gorman

2011-02-10 10:41:15

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch] vmscan: fix zone shrinking exit when scan work is done

On Thu 10-02-11 10:21:10, Mel Gorman wrote:
> On Wed, Feb 09, 2011 at 07:28:46PM +0100, Andrea Arcangeli wrote:
> > On Wed, Feb 09, 2011 at 04:46:56PM +0000, Mel Gorman wrote:
> > > On Wed, Feb 09, 2011 at 04:46:06PM +0100, Johannes Weiner wrote:
> > > > Hi,
> > > >
> > > > I think this should fix the problem of processes getting stuck in
> > > > reclaim that has been reported several times.
> > >
> > > I don't think it's the only source but I'm basing this on seeing
> > > constant looping in balance_pgdat() and calling congestion_wait() a few
> > > weeks ago that I haven't rechecked since. However, this looks like a
> > > real fix for a real problem.
> >
> > Agreed. Just yesterday I spent some time on the lumpy compaction
> > changes after wondering about Michal's khugepaged 100% report, and I
> > expected some fix was needed in this area (as I couldn't find any bug
> > in khugepaged yet, so the lumpy compaction looked the next candidate
> > for bugs).
> >
>
> Michal did report that disabling defrag did not help but the stack trace
> also showed that it was stuck in shrink_zone() which is what Johannes'
> patch targets. It's not unreasonable to test if Johannes' patch solves
> Michal's problem. Michal, I know that your workload is a bit random and
> may not be reproducible but do you think it'd be possible to determine
> if Johannes' patch helps?

Sure, I can test it. Nevertheless, I haven't seen the problem again. I
have tried to make some memory pressure on the machine but no "luck".

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-02-10 12:49:41

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch] vmscan: fix zone shrinking exit when scan work is done

On Thu, Feb 10, 2011 at 10:21:10AM +0000, Mel Gorman wrote:
> We should not be ending up in a situation with the LRU list of only
> page_evictable pages and that situation persisting causing excessive (or
> infinite) looping. As unevictable pages are encountered on the LRU list,
> they should be moved to the unevictable lists by putback_lru_page(). Are you
> aware of a situation where this becomes broken?
>
> I recognise that SWAP_CLUSTER_MAX pages could all be unevictable and they
> are all get moved. In this case, nr_scanned is positive and we continue
> to scan but this is expected and desirable: Reclaim/compaction needs more
> pages to be freed before it starts compaction. If it stops scanning early,
> then it would just fail the allocation later. This is what the "NOTE" is about.
>
> I prefer Johannes' fix for the observed problem.

should_continue_reclaim is only needed for compaction. It tries to
free enough pages so that compaction can succeed in its defrag
attempt. So breaking the loop faster isn't going to cause failures for
0 order pages. My worry is that we loop too much in shrink_zone just
for compaction even when we don't do any progress. shrink_zone would
never scan more than SWAP_CLUSTER_MAX pages, before this change. Now
it can loop over the whole lru as long as we're scanning stuff. Ok to
overboost shrink_zone if we're making progress to allow compaction at
the next round, but if we don't visibly make progress, I'm concerned
that it may be too aggressive to scan the whole list. The performance
benefit of having an hugepage isn't as huge as scanning all pages in
the lru when before we would have broken the loop and declared failure
after only SWAP_CLUSTER_MAX pages, and then we would have fallen back
in a order 0 allocation. The fix may help of course, maybe it's enough
for his case I don't know, but I don't see it making a whole lot of
difference, except now it will stop when the lru is practically empty
which clearly is an improvement. I think we shouldn't be so worried
about succeeding compaction, the important thing is we don't waste
time in compaction if there's not enough free memory but
compaction_suitable used by both logics should be enough for that.

I'd rather prefer that if hugetlbfs has special needs it uses a __GFP_
flag or similar that increases how compaction is strict in succeeding,
up to scanning the whole lru in one go in order to make some free
memory for compaction to succeed.

Going ahead with the scan until compaction_suitable is true instead
makes sense when there's absence of memory pressure and nr_reclaimed
is never zero.

Maybe we should try a bit more times than just nr_reclaim but going
over the whole lru, sounds a bit extreme.

The issue isn't just for unevictable pages, that will be refiled
during the scan but it will also happen in presence of lots of
referenced pages. For example if we don't apply my fix, the current
code can take down all young bits in all ptes in one go in the whole
system before returning from shrink_zone, that is too much in my view,
and losing all that information in one go (not even to tell the cost
associated with losing it) can hardly be offseted by the improvement
given by 1 more hugepage.

But please let me know if I've misread something...

Thanks,
Andrea

2011-02-10 13:33:52

by Mel Gorman

[permalink] [raw]
Subject: Re: [patch] vmscan: fix zone shrinking exit when scan work is done

On Thu, Feb 10, 2011 at 01:48:38PM +0100, Andrea Arcangeli wrote:
> On Thu, Feb 10, 2011 at 10:21:10AM +0000, Mel Gorman wrote:
> > We should not be ending up in a situation with the LRU list of only
> > page_evictable pages and that situation persisting causing excessive (or
> > infinite) looping. As unevictable pages are encountered on the LRU list,
> > they should be moved to the unevictable lists by putback_lru_page(). Are you
> > aware of a situation where this becomes broken?
> >
> > I recognise that SWAP_CLUSTER_MAX pages could all be unevictable and they
> > are all get moved. In this case, nr_scanned is positive and we continue
> > to scan but this is expected and desirable: Reclaim/compaction needs more
> > pages to be freed before it starts compaction. If it stops scanning early,
> > then it would just fail the allocation later. This is what the "NOTE" is about.
> >
> > I prefer Johannes' fix for the observed problem.
>
> should_continue_reclaim is only needed for compaction. It tries to
> free enough pages so that compaction can succeed in its defrag
> attempt.

Correct.

> So breaking the loop faster isn't going to cause failures for
> 0 order pages.

Also true, I commented on this in the "Note" your patch deletes and a
suggestion on how an alternative would be to break early unless GFP_REPEAT.

> My worry is that we loop too much in shrink_zone just
> for compaction even when we don't do any progress. shrink_zone would
> never scan more than SWAP_CLUSTER_MAX pages, before this change.

Sortof. Lumpy reclaim would have scanned more than SWAP_CLUSTER_MAX so
scanning was still pretty high. The other costs of lumpy reclaim would hide
it of course.

> Now
> it can loop over the whole lru as long as we're scanning stuff.

True, the alternative being failing the allocation. Returning sooner is of
course an option, but it would be preferable to see a case where the logic
after Johannes' patch is failing.

> Ok to
> overboost shrink_zone if we're making progress to allow compaction at
> the next round, but if we don't visibly make progress, I'm concerned
> that it may be too aggressive to scan the whole list. The performance
> benefit of having an hugepage isn't as huge as scanning all pages in
> the lru when before we would have broken the loop and declared failure
> after only SWAP_CLUSTER_MAX pages, and then we would have fallen back
> in a order 0 allocation.

What about other cases such as order-1 allocations for stack or order-3
allocations for those network cards using jumbo frames without
scatter/gather?

Don't get me wrong, I see your point but I'm wondering if there really are
cases where we routinely scan an entire LRU list of unevictable pages that
are somehow not being migrated properly to the unevictable lists. If
this is happening, we are also in trouble for reclaiming for order-0
pages, right?

> The fix may help of course, maybe it's enough
> for his case I don't know, but I don't see it making a whole lot of
> difference, except now it will stop when the lru is practically empty
> which clearly is an improvement. I think we shouldn't be so worried
> about succeeding compaction, the important thing is we don't waste
> time in compaction if there's not enough free memory but
> compaction_suitable used by both logics should be enough for that.
>
> I'd rather prefer that if hugetlbfs has special needs it uses a __GFP_

It uses GFP_REPEAT. That is why I specifically mentioned it in the "NOTE"
as an alternative to how we could break early while still being agressive
when required. The only reason it's not that way now is because a) I didn't
consider an LRU mostly full of unevictable pages to be the normal case and b)
for allocations such as order-3 that are preferable not to fail.

> flag or similar that increases how compaction is strict in succeeding,
> up to scanning the whole lru in one go in order to make some free
> memory for compaction to succeed.
>
> Going ahead with the scan until compaction_suitable is true instead
> makes sense when there's absence of memory pressure and nr_reclaimed
> is never zero.
>
> Maybe we should try a bit more times than just nr_reclaim but going
> over the whole lru, sounds a bit extreme.
>

Where should be draw the line? We could come up with ratio of the lists
depending on priority but it'd be hard to measure the gain or loss
without having a profile of a problem case to look at.

> The issue isn't just for unevictable pages, that will be refiled
> during the scan but it will also happen in presence of lots of
> referenced pages. For example if we don't apply my fix, the current
> code can take down all young bits in all ptes in one go in the whole
> system before returning from shrink_zone, that is too much in my view,
> and losing all that information in one go (not even to tell the cost
> associated with losing it) can hardly be offseted by the improvement
> given by 1 more hugepage.
>
> But please let me know if I've misread something...
>

I don't think you have misread anything but if we're going to weaken
this logic, I'd at least like to see the GFP_REPEAT option tried - i.e.
preserve being aggressive if set. I'm also not convinced we routinely get
into a situation where the LRU consists of almost all unevictable pages
and if we are in this situation, that is a serious problem on its own. It
would also be preferable if we could get latency figures on alloc_pages for
hugepage-sized allocations and a count of how many are succeeding or failing
to measure the impact (if any).

--
Mel Gorman
SUSE Labs

2011-02-10 14:15:24

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch] vmscan: fix zone shrinking exit when scan work is done

On Thu, Feb 10, 2011 at 01:33:24PM +0000, Mel Gorman wrote:
> Also true, I commented on this in the "Note" your patch deletes and a
> suggestion on how an alternative would be to break early unless GFP_REPEAT.

Yep noticed that ;), doing that with __GFP_REPEAT sounds just fine to me.

> Sortof. Lumpy reclaim would have scanned more than SWAP_CLUSTER_MAX so
> scanning was still pretty high. The other costs of lumpy reclaim would hide
> it of course.

Ok but we know lumpy reclaim was not ok to start with.

> What about other cases such as order-1 allocations for stack or order-3
> allocations for those network cards using jumbo frames without
> scatter/gather?

stack order 1 is one of the few cases that come to mind where failing
an allocation becomes fatal. Maybe we should use __GFP_REPEAT there
too.

But we probably need a way to discriminate callers that can gracefully
fallback. I'd be extremely surprised if the cost of looping all over
the lru taking down all young bits could ever be offseted by the jumbo
frame. In fact the jumbo frame I'm afraid might be better off without
using compaction at all because it's probably very latency
sensitive. We need a 'lowlatency' version of compaction for these
users where the improvement of having a compound page instead of a
regular page isn't very significant.

On a separated topic, I'm currently trying to use the new async
compaction code upstream with jumbo frames. I'm also wondering if I'll
have to set sync=0 by default unless __GFP_REPEAT is set. It seems
adding compaction to jumbo frames is increasing latency to certain
workloads in a measurable way. Things were fine when compaction was
only used by THP and not for all order allocations (but I didn't test
the async mode yet plus the other optimizations for compaction you did
recently, I hope they're enough to jumbo frames).

> Don't get me wrong, I see your point but I'm wondering if there really are
> cases where we routinely scan an entire LRU list of unevictable pages that
> are somehow not being migrated properly to the unevictable lists. If
> this is happening, we are also in trouble for reclaiming for order-0
> pages, right?

Well unevictable pages are just an example and like you said they last
only one round of the loop at most. But other caching bits like the
referenced bits and all young bits will get all taken down during all
later loops too. We definitely don't want to swap just to allow
compaction to succeed! I think this argument explains it pretty well,
if you takedown all young bits in a constant loop, then system may end
up swapping. That's definitely something we don't want.

Things may be different if this is a stack allocation without
fallback, or if it's hugetlbfs again without kernel fallback (only
userland fallback).

> It uses GFP_REPEAT. That is why I specifically mentioned it in the "NOTE"
> as an alternative to how we could break early while still being agressive
> when required. The only reason it's not that way now is because a) I didn't
> consider an LRU mostly full of unevictable pages to be the normal case and b)
> for allocations such as order-3 that are preferable not to fail.

Ok.

> Where should be draw the line? We could come up with ratio of the lists
> depending on priority but it'd be hard to measure the gain or loss
> without having a profile of a problem case to look at.

I would just stick to !nr_reclaimed to break the loop, and ignore the
nr_scanned unless __GFP_REPEAT is set, in which case you're welcome to
scan everything.

Then we've to decide if to add __GFP_REPEAT to the stack allocation...

> I don't think you have misread anything but if we're going to weaken
> this logic, I'd at least like to see the GFP_REPEAT option tried - i.e.

I see the point of __GFP_REPEAT, that sounds the best, I should have
just followed your comment but I felt scanning everything was too
heavyweight regardless, but ok I see you want as much accuracy as
possible in that case, even if that end up in a swap storm.

> preserve being aggressive if set. I'm also not convinced we routinely get
> into a situation where the LRU consists of almost all unevictable pages
> and if we are in this situation, that is a serious problem on its own. It
> would also be preferable if we could get latency figures on alloc_pages for
> hugepage-sized allocations and a count of how many are succeeding or failing
> to measure the impact (if any).

I think I made not the best example talking about unevictable pages. I
said that because the code is like below, but to me all the goto
something following the !page_evictable check were also relevant for
this shrink_zone loop. The real life issue is avoid swap storm (or
expensive loop flooding the whole system of ipis to takedown all young
bits in all ptes), to allocate an hugepage or jumboframe that has a
graceful fallback that performs not hugely slower than the
hugepage/jumboframe.

sc->nr_scanned++;

if (unlikely(!page_evictable(page, NULL)))
goto cull_mlocked;

I think making the !nr_scanned check conditional to __GFP_REPEAT as
the comment suggested, for now is the best way to go.

2011-02-10 14:58:42

by Mel Gorman

[permalink] [raw]
Subject: Re: [patch] vmscan: fix zone shrinking exit when scan work is done

On Thu, Feb 10, 2011 at 03:14:47PM +0100, Andrea Arcangeli wrote:
> On Thu, Feb 10, 2011 at 01:33:24PM +0000, Mel Gorman wrote:
> > Also true, I commented on this in the "Note" your patch deletes and a
> > suggestion on how an alternative would be to break early unless GFP_REPEAT.
>
> Yep noticed that ;), doing that with __GFP_REPEAT sounds just fine to me.
>

Great.

> > Sortof. Lumpy reclaim would have scanned more than SWAP_CLUSTER_MAX so
> > scanning was still pretty high. The other costs of lumpy reclaim would hide
> > it of course.
>
> Ok but we know lumpy reclaim was not ok to start with.
>

Sure.

> > What about other cases such as order-1 allocations for stack or order-3
> > allocations for those network cards using jumbo frames without
> > scatter/gather?
>
> stack order 1 is one of the few cases that come to mind where failing
> an allocation becomes fatal. Maybe we should use __GFP_REPEAT there
> too.
>

Actually, there shouldn't be need. Small allocations such as order-1
effectively loop indefinitely due to the check in should_alloc_retry().
This means that even if reclaim/compaction breaks earlier than it
should, it'll get tried again.

> But we probably need a way to discriminate callers that can gracefully
> fallback. I'd be extremely surprised if the cost of looping all over
> the lru taking down all young bits could ever be offseted by the jumbo
> frame. In fact the jumbo frame I'm afraid might be better off without
> using compaction at all because it's probably very latency
> sensitive.

It depends entirely on whether the jumbo frame can be received with
order-0 pages. If not, it means it's dropping packets which as worse
latency.

> We need a 'lowlatency' version of compaction for these
> users where the improvement of having a compound page instead of a
> regular page isn't very significant.
>

It's not impossible to pass this information in once the cases where it
is required are identified.

> On a separated topic, I'm currently trying to use the new async
> compaction code upstream with jumbo frames. I'm also wondering if I'll
> have to set sync=0 by default unless __GFP_REPEAT is set. It seems
> adding compaction to jumbo frames is increasing latency to certain
> workloads in a measurable way.

This is interesting. Any profiles showing where the time is being spent?
In the event an order-3 allocation fails with the particular network
card, is it able to fallback to order-0 pages?

> Things were fine when compaction was
> only used by THP and not for all order allocations (but I didn't test
> the async mode yet plus the other optimizations for compaction you did
> recently, I hope they're enough to jumbo frames).
>

Wish I had your test rig :/

> > Don't get me wrong, I see your point but I'm wondering if there really are
> > cases where we routinely scan an entire LRU list of unevictable pages that
> > are somehow not being migrated properly to the unevictable lists. If
> > this is happening, we are also in trouble for reclaiming for order-0
> > pages, right?
>
> Well unevictable pages are just an example and like you said they last
> only one round of the loop at most. But other caching bits like the
> referenced bits and all young bits will get all taken down during all
> later loops too. We definitely don't want to swap just to allow
> compaction to succeed! I think this argument explains it pretty well,
> if you takedown all young bits in a constant loop, then system may end
> up swapping. That's definitely something we don't want.
>

Avoiding the clearing of young bits is a much stronger arguement.

> Things may be different if this is a stack allocation without
> fallback, or if it's hugetlbfs again without kernel fallback (only
> userland fallback).
>
> > It uses GFP_REPEAT. That is why I specifically mentioned it in the "NOTE"
> > as an alternative to how we could break early while still being agressive
> > when required. The only reason it's not that way now is because a) I didn't
> > consider an LRU mostly full of unevictable pages to be the normal case and b)
> > for allocations such as order-3 that are preferable not to fail.
>
> Ok.
>
> > Where should be draw the line? We could come up with ratio of the lists
> > depending on priority but it'd be hard to measure the gain or loss
> > without having a profile of a problem case to look at.
>
> I would just stick to !nr_reclaimed to break the loop, and ignore the
> nr_scanned unless __GFP_REPEAT is set, in which case you're welcome to
> scan everything.
>

Patch that should do this is below.

> Then we've to decide if to add __GFP_REPEAT to the stack allocation...
>

It shouldn't be necessary as the allocator will continue looping.

> > I don't think you have misread anything but if we're going to weaken
> > this logic, I'd at least like to see the GFP_REPEAT option tried - i.e.
>
> I see the point of __GFP_REPEAT, that sounds the best, I should have
> just followed your comment but I felt scanning everything was too
> heavyweight regardless, but ok I see you want as much accuracy as
> possible in that case, even if that end up in a swap storm.
>
> > preserve being aggressive if set. I'm also not convinced we routinely get
> > into a situation where the LRU consists of almost all unevictable pages
> > and if we are in this situation, that is a serious problem on its own. It
> > would also be preferable if we could get latency figures on alloc_pages for
> > hugepage-sized allocations and a count of how many are succeeding or failing
> > to measure the impact (if any).
>
> I think I made not the best example talking about unevictable pages. I
> said that because the code is like below, but to me all the goto
> something following the !page_evictable check were also relevant for
> this shrink_zone loop. The real life issue is avoid swap storm (or
> expensive loop flooding the whole system of ipis to takedown all young
> bits in all ptes), to allocate an hugepage or jumboframe that has a
> graceful fallback that performs not hugely slower than the
> hugepage/jumboframe.
>
> sc->nr_scanned++;
>
> if (unlikely(!page_evictable(page, NULL)))
> goto cull_mlocked;
>
> I think making the !nr_scanned check conditional to __GFP_REPEAT as
> the comment suggested, for now is the best way to go.
>

Ok, here is a patch that should do that. This does *not* replace Johannes'
patch which I think should still be merged. However, I am unable to test this
at the moment. My laptop and test machines are 200KM away and inaccessible
until next Tuesday at the earliest. The machine I'm typing this mail from
is unsuitable for testing with. Are you in the position to test THP with it
applied for me please?

==== CUT HERE ====
mm: vmscan: Stop reclaim/compaction when reclaim is failing for !__GFP_REPEAT allocations

should_continue_reclaim() for reclaim/compaction allows scanning to continue
even if pages are not being reclaimed until the full list is scanned. In
terms of allocation success, this makes sense but potentially it introduces
unwanted latency for transparent hugepages and network jumbo frames that
would prefer to fail the allocation attempt and fallback to order-0 pages.
Worse, there is a potential that the full LRU scan will clear all the young
bits, distort page aging information and potentially push pages into swap
that would have otherwise remained resident.

This patch will stop reclaim/compaction if no pages were reclaimed in the
last SWAP_CLUSTER_MAX pages that were considered. For allocations such as
hugetlbfs that use GFP_REPEAT and have fewer fallback options, the full LRU
list may still be scanned.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 32 ++++++++++++++++++++++----------
1 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 148c6e6..591b907 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1841,16 +1841,28 @@ static inline bool should_continue_reclaim(struct zone *zone,
if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
return false;

- /*
- * If we failed to reclaim and have scanned the full list, stop.
- * NOTE: Checking just nr_reclaimed would exit reclaim/compaction far
- * faster but obviously would be less likely to succeed
- * allocation. If this is desirable, use GFP_REPEAT to decide
- * if both reclaimed and scanned should be checked or just
- * reclaimed
- */
- if (!nr_reclaimed && !nr_scanned)
- return false;
+ /* Consider stopping depending on scan and reclaim activity */
+ if (sc->gfp_mask & __GFP_REPEAT) {
+ /*
+ * For GFP_REPEAT allocations, stop reclaiming if the
+ * full LRU list has been scanned and we are still failing
+ * to reclaim pages. This full LRU scan is potentially
+ * expensive but a GFP_REPEAT caller really wants to succeed
+ */
+ if (!nr_reclaimed && !nr_scanned)
+ return false;
+ } else {
+ /*
+ * For non-GFP_REPEAT allocations which can presumably
+ * fail without consequence, stop if we failed to reclaim
+ * any pages from the last SWAP_CLUSTER_MAX number of
+ * pages that were scanned. This will return to the
+ * caller faster at the risk reclaim/compaction and
+ * the resulting allocation attempt will fail
+ */
+ if (!nr_reclaimed)
+ return false;
+ }

/*
* If we have not reclaimed enough pages for compaction and the

2011-02-16 09:51:22

by Mel Gorman

[permalink] [raw]
Subject: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT

should_continue_reclaim() for reclaim/compaction allows scanning to continue
even if pages are not being reclaimed until the full list is scanned. In
terms of allocation success, this makes sense but potentially it introduces
unwanted latency for high-order allocations such as transparent hugepages
and network jumbo frames that would prefer to fail the allocation attempt
and fallback to order-0 pages. Worse, there is a potential that the full
LRU scan will clear all the young bits, distort page aging information and
potentially push pages into swap that would have otherwise remained resident.

This patch will stop reclaim/compaction if no pages were reclaimed in the
last SWAP_CLUSTER_MAX pages that were considered. For allocations such as
hugetlbfs that use GFP_REPEAT and have fewer fallback options, the full LRU
list may still be scanned.

To test this, a tool was developed based on ftrace that tracked the latency of
high-order allocations while transparent hugepage support was enabled and three
benchmarks were run. The "fix-infinite" figures are 2.6.38-rc4 with Johannes's
patch "vmscan: fix zone shrinking exit when scan work is done" applied.

STREAM Highorder Allocation Latency Statistics
fix-infinite break-early
1 :: Count 10298 10229
1 :: Min 0.4560 0.4640
1 :: Mean 1.0589 1.0183
1 :: Max 14.5990 11.7510
1 :: Stddev 0.5208 0.4719
2 :: Count 2 1
2 :: Min 1.8610 3.7240
2 :: Mean 3.4325 3.7240
2 :: Max 5.0040 3.7240
2 :: Stddev 1.5715 0.0000
9 :: Count 111696 111694
9 :: Min 0.5230 0.4110
9 :: Mean 10.5831 10.5718
9 :: Max 38.4480 43.2900
9 :: Stddev 1.1147 1.1325

Mean time for order-1 allocations is reduced. order-2 looks increased
but with so few allocations, it's not particularly significant. THP mean
allocation latency is also reduced. That said, allocation time varies so
significantly that the reductions are within noise.

Max allocation time is reduced by a significant amount for low-order
allocations but reduced for THP allocations which presumably are now
breaking before reclaim has done enough work.

SysBench Highorder Allocation Latency Statistics
fix-infinite break-early
1 :: Count 15745 15677
1 :: Min 0.4250 0.4550
1 :: Mean 1.1023 1.0810
1 :: Max 14.4590 10.8220
1 :: Stddev 0.5117 0.5100
2 :: Count 1 1
2 :: Min 3.0040 2.1530
2 :: Mean 3.0040 2.1530
2 :: Max 3.0040 2.1530
2 :: Stddev 0.0000 0.0000
9 :: Count 2017 1931
9 :: Min 0.4980 0.7480
9 :: Mean 10.4717 10.3840
9 :: Max 24.9460 26.2500
9 :: Stddev 1.1726 1.1966

Again, mean time for order-1 allocations is reduced while order-2 allocations
are too few to draw conclusions from. The mean time for THP allocations is
also slightly reduced albeit the reductions are within varianes.

Once again, our maximum allocation time is significantly reduced for
low-order allocations and slightly increased for THP allocations.

Anon stream mmap reference Highorder Allocation Latency Statistics
1 :: Count 1376 1790
1 :: Min 0.4940 0.5010
1 :: Mean 1.0289 0.9732
1 :: Max 6.2670 4.2540
1 :: Stddev 0.4142 0.2785
2 :: Count 1 -
2 :: Min 1.9060 -
2 :: Mean 1.9060 -
2 :: Max 1.9060 -
2 :: Stddev 0.0000 -
9 :: Count 11266 11257
9 :: Min 0.4990 0.4940
9 :: Mean 27250.4669 24256.1919
9 :: Max 11439211.0000 6008885.0000
9 :: Stddev 226427.4624 186298.1430

This benchmark creates one thread per CPU which references an amount of
anonymous memory 1.5 times the size of physical RAM. This pounds swap quite
heavily and is intended to exercise THP a bit.

Mean allocation time for order-1 is reduced as before. It's also reduced
for THP allocations but the variations here are pretty massive due to swap.
As before, maximum allocation times are significantly reduced.

Overall, the patch reduces the mean and maximum allocation latencies for
the smaller high-order allocations. This was with Slab configured so it
would be expected to be more significant with Slub which uses these size
allocations more aggressively.

The mean allocation times for THP allocations are also slightly reduced.
The maximum latency was slightly increased as predicted by the comments due
to reclaim/compaction breaking early. However, workloads care more about the
latency of lower-order allocations than THP so it's an acceptable trade-off.
Please consider merging for 2.6.38.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 32 ++++++++++++++++++++++----------
1 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 148c6e6..591b907 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1841,16 +1841,28 @@ static inline bool should_continue_reclaim(struct zone *zone,
if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
return false;

- /*
- * If we failed to reclaim and have scanned the full list, stop.
- * NOTE: Checking just nr_reclaimed would exit reclaim/compaction far
- * faster but obviously would be less likely to succeed
- * allocation. If this is desirable, use GFP_REPEAT to decide
- * if both reclaimed and scanned should be checked or just
- * reclaimed
- */
- if (!nr_reclaimed && !nr_scanned)
- return false;
+ /* Consider stopping depending on scan and reclaim activity */
+ if (sc->gfp_mask & __GFP_REPEAT) {
+ /*
+ * For GFP_REPEAT allocations, stop reclaiming if the
+ * full LRU list has been scanned and we are still failing
+ * to reclaim pages. This full LRU scan is potentially
+ * expensive but a GFP_REPEAT caller really wants to succeed
+ */
+ if (!nr_reclaimed && !nr_scanned)
+ return false;
+ } else {
+ /*
+ * For non-GFP_REPEAT allocations which can presumably
+ * fail without consequence, stop if we failed to reclaim
+ * any pages from the last SWAP_CLUSTER_MAX number of
+ * pages that were scanned. This will return to the
+ * caller faster at the risk reclaim/compaction and
+ * the resulting allocation attempt fails
+ */
+ if (!nr_reclaimed)
+ return false;
+ }

/*
* If we have not reclaimed enough pages for compaction and the

2011-02-16 10:14:08

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT

On Wed, Feb 16, 2011 at 09:50:49AM +0000, Mel Gorman wrote:
> The mean allocation times for THP allocations are also slightly reduced.
> The maximum latency was slightly increased as predicted by the comments due
> to reclaim/compaction breaking early. However, workloads care more about the
> latency of lower-order allocations than THP so it's an acceptable trade-off.
> Please consider merging for 2.6.38.

Full agreement. I'm currently dealing with latency issues (nothing
major! but still not nice to see a reproducible regression, even if a
small one only visible in the benchmark) with compaction and jumbo
frames. This won't be enough to close them completely though because I
didn't backport the change to vmscan.c and should_continue_reclaim (I
backported all the other compaction improvements though, so this
practically is the only missing bit). I also suspected the e1000
driver, that sets the NAPI latency to bulk_latency when it uses jumbo
frames, so I wonder if it could be that with compaction we get more
jumbo frames and the latency then gets reduced by the driver as side
effect. Not sure yet.

I like the above because it's less likely to give us compaction issues
with jumbo frames when I add should_continue_reclaim on top. It seems
anonymous memory allocation are orders of magnitude more long lived
than jumbo frames could ever be, so at this point I'm not even
entirely certain it's ok to enable compaction at all for jumbo
frames. But I still like the above regardless of my current issue
(just because of the young bits going nuked in one go the lumpy hammer
way, even if it actually increases latency a bit for THP allocations).

One issue with compaction for jumbo frames, is the potentially very
long loop, for the scan in isolated_migratepages. I added a counter to
break the loop after 1024 pages scanned. This is extreme but this is a
debug patch for now, I also did if (retval == bulk_latency) reval =
low_latency in the e1000* drivers to see if it makes a difference. If
any of the two will help I will track down how much each change
contributes to lowering the network latency to pre-compaction
levels. It may very well be only a compaction issue, or only a driver
issue, I don't know yet (the latter less likely because this very
compaction loop spikes at the top of oprofile output, but maybe that
only affects throughput and the driver is to blame for the latency
reduction... this is what I'm going to find pretty soon). Also this
isolate_migratepages loop I think needs a cond_resched() (I didn't add
that yet ;). 1024 pages scanned is too few, I just want to see how it
behaves with an extremely permissive setting. I'll let you know when I
come to some more reliable conclusion.

2011-02-16 11:23:04

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT

On Wed, Feb 16, 2011 at 11:13:01AM +0100, Andrea Arcangeli wrote:
> On Wed, Feb 16, 2011 at 09:50:49AM +0000, Mel Gorman wrote:
> > The mean allocation times for THP allocations are also slightly reduced.
> > The maximum latency was slightly increased as predicted by the comments due
> > to reclaim/compaction breaking early. However, workloads care more about the
> > latency of lower-order allocations than THP so it's an acceptable trade-off.
> > Please consider merging for 2.6.38.
>
> Full agreement. I'm currently dealing with latency issues (nothing
> major! but still not nice to see a reproducible regression, even if a
> small one only visible in the benchmark) with compaction and jumbo
> frames.

Out of curiousity, what are you measuring the latency of and how? I used
a combination of the function_graph ftrace analyser and the mm_page_alloc
tracepoint myself to avoid any additional patching and it was easier than
cobbling together something with kprobes. A perl script configures ftrace
and then parses the contents of trace_pipe - crude but does the job without
patching the kernel.

> This won't be enough to close them completely though because I
> didn't backport the change to vmscan.c and should_continue_reclaim (I
> backported all the other compaction improvements though, so this
> practically is the only missing bit).

How big are the discrepancies?

> I also suspected the e1000
> driver, that sets the NAPI latency to bulk_latency when it uses jumbo
> frames, so I wonder if it could be that with compaction we get more
> jumbo frames and the latency then gets reduced by the driver as side
> effect. Not sure yet.
>

No idea.

> I like the above because it's less likely to give us compaction issues
> with jumbo frames when I add should_continue_reclaim on top. It seems
> anonymous memory allocation are orders of magnitude more long lived
> than jumbo frames could ever be, so at this point I'm not even
> entirely certain it's ok to enable compaction at all for jumbo
> frames. But I still like the above regardless of my current issue
> (just because of the young bits going nuked in one go the lumpy hammer
> way, even if it actually increases latency a bit for THP allocations).
>

Can I have your ack on the patch then? Even if it doesn't resolve the
jumbo frame problems, it's in the right direction. Measuring how it
currently behaves and what direction should be taken may be something
still worth discussing at LSF/MM.

> One issue with compaction for jumbo frames, is the potentially very
> long loop, for the scan in isolated_migratepages.

Yes, the scanner is poor. The scanner for free pages is potentially just
as bad. I prototyped some designs that should have been faster but they
didn't make any significant difference so they got discarded.

> I added a counter to
> break the loop after 1024 pages scanned. This is extreme but this is a
> debug patch for now, I also did if (retval == bulk_latency) reval =
> low_latency in the e1000* drivers to see if it makes a difference. If
> any of the two will help I will track down how much each change
> contributes to lowering the network latency to pre-compaction
> levels. It may very well be only a compaction issue, or only a driver
> issue, I don't know yet (the latter less likely because this very
> compaction loop spikes at the top of oprofile output, but maybe that
> only affects throughput and the driver is to blame for the latency
> reduction... this is what I'm going to find pretty soon). Also this
> isolate_migratepages loop I think needs a cond_resched()

This surprises me. In my own tests at least, the compaction stuff was
way down in the profile and I wouldn't have expected scanning to take so
long as to require a cond_resched. I was depending on the cond_resched()
in migrate_pages() to yield the processor as necessary.

> (I didn't add
> that yet ;). 1024 pages scanned is too few, I just want to see how it
> behaves with an extremely permissive setting. I'll let you know when I
> come to some more reliable conclusion.
>

Thanks.

--
Mel Gorman
SUSE Labs

2011-02-16 12:04:50

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT

On Wed, Feb 16, 2011 at 09:50:49AM +0000, Mel Gorman wrote:
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 32 ++++++++++++++++++++++----------
> 1 files changed, 22 insertions(+), 10 deletions(-)

Acked-by: Andrea Arcangeli <[email protected]>

2011-02-16 12:12:26

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT

On 02/16/2011 04:50 AM, Mel Gorman wrote:

> This patch will stop reclaim/compaction if no pages were reclaimed in the
> last SWAP_CLUSTER_MAX pages that were considered. For allocations such as
> hugetlbfs that use GFP_REPEAT and have fewer fallback options, the full LRU
> list may still be scanned.

> Signed-off-by: Mel Gorman<[email protected]>

Acked-by: Rik van Riel <[email protected]>

2011-02-16 12:39:41

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT

On Wed, Feb 16, 2011 at 09:50:49AM +0000, Mel Gorman wrote:
> should_continue_reclaim() for reclaim/compaction allows scanning to continue
> even if pages are not being reclaimed until the full list is scanned. In
> terms of allocation success, this makes sense but potentially it introduces
> unwanted latency for high-order allocations such as transparent hugepages
> and network jumbo frames that would prefer to fail the allocation attempt
> and fallback to order-0 pages. Worse, there is a potential that the full
> LRU scan will clear all the young bits, distort page aging information and
> potentially push pages into swap that would have otherwise remained resident.
>
> This patch will stop reclaim/compaction if no pages were reclaimed in the
> last SWAP_CLUSTER_MAX pages that were considered. For allocations such as
> hugetlbfs that use GFP_REPEAT and have fewer fallback options, the full LRU
> list may still be scanned.
>
> To test this, a tool was developed based on ftrace that tracked the latency of
> high-order allocations while transparent hugepage support was enabled and three
> benchmarks were run. The "fix-infinite" figures are 2.6.38-rc4 with Johannes's
> patch "vmscan: fix zone shrinking exit when scan work is done" applied.
>
> STREAM Highorder Allocation Latency Statistics
> fix-infinite break-early
> 1 :: Count 10298 10229
> 1 :: Min 0.4560 0.4640
> 1 :: Mean 1.0589 1.0183
> 1 :: Max 14.5990 11.7510
> 1 :: Stddev 0.5208 0.4719
> 2 :: Count 2 1
> 2 :: Min 1.8610 3.7240
> 2 :: Mean 3.4325 3.7240
> 2 :: Max 5.0040 3.7240
> 2 :: Stddev 1.5715 0.0000
> 9 :: Count 111696 111694
> 9 :: Min 0.5230 0.4110
> 9 :: Mean 10.5831 10.5718
> 9 :: Max 38.4480 43.2900
> 9 :: Stddev 1.1147 1.1325
>
> Mean time for order-1 allocations is reduced. order-2 looks increased
> but with so few allocations, it's not particularly significant. THP mean
> allocation latency is also reduced. That said, allocation time varies so
> significantly that the reductions are within noise.
>
> Max allocation time is reduced by a significant amount for low-order
> allocations but reduced for THP allocations which presumably are now
> breaking before reclaim has done enough work.
>
> SysBench Highorder Allocation Latency Statistics
> fix-infinite break-early
> 1 :: Count 15745 15677
> 1 :: Min 0.4250 0.4550
> 1 :: Mean 1.1023 1.0810
> 1 :: Max 14.4590 10.8220
> 1 :: Stddev 0.5117 0.5100
> 2 :: Count 1 1
> 2 :: Min 3.0040 2.1530
> 2 :: Mean 3.0040 2.1530
> 2 :: Max 3.0040 2.1530
> 2 :: Stddev 0.0000 0.0000
> 9 :: Count 2017 1931
> 9 :: Min 0.4980 0.7480
> 9 :: Mean 10.4717 10.3840
> 9 :: Max 24.9460 26.2500
> 9 :: Stddev 1.1726 1.1966
>
> Again, mean time for order-1 allocations is reduced while order-2 allocations
> are too few to draw conclusions from. The mean time for THP allocations is
> also slightly reduced albeit the reductions are within varianes.
>
> Once again, our maximum allocation time is significantly reduced for
> low-order allocations and slightly increased for THP allocations.
>
> Anon stream mmap reference Highorder Allocation Latency Statistics
> 1 :: Count 1376 1790
> 1 :: Min 0.4940 0.5010
> 1 :: Mean 1.0289 0.9732
> 1 :: Max 6.2670 4.2540
> 1 :: Stddev 0.4142 0.2785
> 2 :: Count 1 -
> 2 :: Min 1.9060 -
> 2 :: Mean 1.9060 -
> 2 :: Max 1.9060 -
> 2 :: Stddev 0.0000 -
> 9 :: Count 11266 11257
> 9 :: Min 0.4990 0.4940
> 9 :: Mean 27250.4669 24256.1919
> 9 :: Max 11439211.0000 6008885.0000
> 9 :: Stddev 226427.4624 186298.1430
>
> This benchmark creates one thread per CPU which references an amount of
> anonymous memory 1.5 times the size of physical RAM. This pounds swap quite
> heavily and is intended to exercise THP a bit.
>
> Mean allocation time for order-1 is reduced as before. It's also reduced
> for THP allocations but the variations here are pretty massive due to swap.
> As before, maximum allocation times are significantly reduced.
>
> Overall, the patch reduces the mean and maximum allocation latencies for
> the smaller high-order allocations. This was with Slab configured so it
> would be expected to be more significant with Slub which uses these size
> allocations more aggressively.
>
> The mean allocation times for THP allocations are also slightly reduced.
> The maximum latency was slightly increased as predicted by the comments due
> to reclaim/compaction breaking early. However, workloads care more about the
> latency of lower-order allocations than THP so it's an acceptable trade-off.
> Please consider merging for 2.6.38.
>
> Signed-off-by: Mel Gorman <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2011-02-16 14:45:12

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT

On Wed, Feb 16, 2011 at 11:22:32AM +0000, Mel Gorman wrote:
> Out of curiousity, what are you measuring the latency of and how? I used
> a combination of the function_graph ftrace analyser and the mm_page_alloc
> tracepoint myself to avoid any additional patching and it was easier than
> cobbling together something with kprobes. A perl script configures ftrace
> and then parses the contents of trace_pipe - crude but does the job without
> patching the kernel.

It's some complex benchmark that is measuring the latency from
userland, I think latency is measured from clients (not the server
running compaction).

> How big are the discrepancies?

Latency in msec/op goes up from 1.1 to 5.4 starting from half the peak
load. But then latency stays flat with compaction, eventually the peak
load latency is similar. It just goes immediately from 1.1 to 5.4 in
the middle and it's slightly higher even for the light load runs.

> No idea.

I guess it's very hard to tell unless we try. I just nuked the
bulk_latency for the jumbo frames and forced the driver to always
stay in low_latency mode (in NAPI ->poll method of the driver), just
in case it's not compaction to blame but a side effect of compaction
providing jumbo frames much more frequently to the driver.

> Can I have your ack on the patch then? Even if it doesn't resolve the

Sure, I acked it explicitly in separate email ;).

> jumbo frame problems, it's in the right direction. Measuring how it
> currently behaves and what direction should be taken may be something
> still worth discussing at LSF/MM.

Agreed!

> > One issue with compaction for jumbo frames, is the potentially very
> > long loop, for the scan in isolated_migratepages.
>
> Yes, the scanner is poor. The scanner for free pages is potentially just
> as bad. I prototyped some designs that should have been faster but they
> didn't make any significant difference so they got discarded.

But the scanner for free pages a nr_scanned countdown and breaks the
loop way sooner. Also most of the >order allocations must have a
fallback so scanning everything for succeeding order 0 is much more
obviously safe than scanning everything to provide an order 2
allocation, if the order 0 allocation could be provided immediately
without scanning anything. It's not a trivial problem when we deal
with short lived allocations. Also the throughput is equal or a little
higher (not necessarily related to compaction though), the latency is
the real measurable regression.

> This surprises me. In my own tests at least, the compaction stuff was
> way down in the profile and I wouldn't have expected scanning to take so
> long as to require a cond_resched. I was depending on the cond_resched()
> in migrate_pages() to yield the processor as necessary.

If migrate_pages runs often likely won't need to scan too many pages
in the first place. I think cond_resched is good idea in that
loop considering the current possible worst case.

This is the profiling. This is with basically 2.6.37 compaction code
so only enabled for THP sized allocations and not for order <=
PAGE_ALLOC_COSTLY_ORDER and not for kswapd.

Samples % of Total Cum. Samples Cum. % of Total module:function
-------------------------------------------------------------------------------------------------
177786 6.178 177786 6.178 sunrpc:svc_recv
128779 4.475 306565 10.654 sunrpc:svc_xprt_enqueue
80786 2.807 387351 13.462 vmlinux:__d_lookup
62272 2.164 449623 15.626 ext4:ext4_htree_store_dirent
55896 1.942 505519 17.569 jbd2:journal_clean_one_cp_list
43868 1.524 549387 19.093 vmlinux:task_rq_lock
43572 1.514 592959 20.608 vmlinux:kfree
37620 1.307 630579 21.915 vmlinux:mwait_idle
36169 1.257 666748 23.172 vmlinux:schedule
34037 1.182 700785 24.355 e1000:e1000_clean
31945 1.110 732730 25.465 vmlinux:find_busiest_group
31491 1.094 764221 26.560 qla2xxx:qla24xx_intr_handler
30681 1.066 794902 27.626 vmlinux:_atomic_dec_and_lock
7425 0.258 xxxxxx xxxxxx vmlinux:get_page_from_freelist

This is with 2.6.38 compaction code enabled for all !order in both
direct compaction and kswapd (it includes async compaction/migrate and
the preferred pageblock selection in !cc->sync mode). It basically
only doesn't include the should_continue_reclaim loop as that could
only potentially increase the latency even further so I skipped it for
now (I'll add it later with your __GFP_RECLAIM new patch).

Samples % of Total Cum. Samples Cum. % of Total module:function
-------------------------------------------------------------------------------------------------
1182928 17.358 1182928 17.358 vmlinux:get_page_from_freelist
657802 9.652 1840730 27.011 vmlinux:free_pcppages_bulk
579976 8.510 2420706 35.522 sunrpc:svc_xprt_enqueue
508953 7.468 2929659 42.991 sunrpc:svc_recv
490538 7.198 3420197 50.189 vmlinux:compaction_alloc
188620 2.767 3608817 52.957 vmlinux:tg_shares_up
97527 1.431 3706344 54.388 vmlinux:__d_lookup
85670 1.257 3792014 55.646 jbd2:journal_clean_one_cp_list
71738 1.052 3863752 56.698 vmlinux:mutex_spin_on_owner
71037 1.042 3934789 57.741 vmlinux:kfree

Basically it was my patch that enabled compaction for all order sized
allocations and in kswapd as well that started this but I think I only
exposed the problem and if the jumbo frame would have order 4 instead
of order 1/2/3, it'd happen regardless of my patch. Later I'm also
going to check if it's the kswapd invocation that causes the problem
(so trying with only direct compaction) but I doubt it'll help.

2011-02-16 23:26:23

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT

On Wed, Feb 16, 2011 at 6:50 PM, Mel Gorman <[email protected]> wrote:
> should_continue_reclaim() for reclaim/compaction allows scanning to continue
> even if pages are not being reclaimed until the full list is scanned. In
> terms of allocation success, this makes sense but potentially it introduces
> unwanted latency for high-order allocations such as transparent hugepages
> and network jumbo frames that would prefer to fail the allocation attempt
> and fallback to order-0 pages.  Worse, there is a potential that the full
> LRU scan will clear all the young bits, distort page aging information and
> potentially push pages into swap that would have otherwise remained resident.
>
> This patch will stop reclaim/compaction if no pages were reclaimed in the
> last SWAP_CLUSTER_MAX pages that were considered. For allocations such as
> hugetlbfs that use GFP_REPEAT and have fewer fallback options, the full LRU
> list may still be scanned.
>
> To test this, a tool was developed based on ftrace that tracked the latency of
> high-order allocations while transparent hugepage support was enabled and three
> benchmarks were run. The "fix-infinite" figures are 2.6.38-rc4 with Johannes's
> patch "vmscan: fix zone shrinking exit when scan work is done" applied.
>
> STREAM Highorder Allocation Latency Statistics
>               fix-infinite     break-early
> 1 :: Count            10298           10229
> 1 :: Min             0.4560          0.4640
> 1 :: Mean            1.0589          1.0183
> 1 :: Max            14.5990         11.7510
> 1 :: Stddev          0.5208          0.4719
> 2 :: Count                2               1
> 2 :: Min             1.8610          3.7240
> 2 :: Mean            3.4325          3.7240
> 2 :: Max             5.0040          3.7240
> 2 :: Stddev          1.5715          0.0000
> 9 :: Count           111696          111694
> 9 :: Min             0.5230          0.4110
> 9 :: Mean           10.5831         10.5718
> 9 :: Max            38.4480         43.2900
> 9 :: Stddev          1.1147          1.1325
>
> Mean time for order-1 allocations is reduced. order-2 looks increased
> but with so few allocations, it's not particularly significant. THP mean
> allocation latency is also reduced. That said, allocation time varies so
> significantly that the reductions are within noise.
>
> Max allocation time is reduced by a significant amount for low-order
> allocations but reduced for THP allocations which presumably are now
> breaking before reclaim has done enough work.
>
> SysBench Highorder Allocation Latency Statistics
>               fix-infinite     break-early
> 1 :: Count            15745           15677
> 1 :: Min             0.4250          0.4550
> 1 :: Mean            1.1023          1.0810
> 1 :: Max            14.4590         10.8220
> 1 :: Stddev          0.5117          0.5100
> 2 :: Count                1               1
> 2 :: Min             3.0040          2.1530
> 2 :: Mean            3.0040          2.1530
> 2 :: Max             3.0040          2.1530
> 2 :: Stddev          0.0000          0.0000
> 9 :: Count             2017            1931
> 9 :: Min             0.4980          0.7480
> 9 :: Mean           10.4717         10.3840
> 9 :: Max            24.9460         26.2500
> 9 :: Stddev          1.1726          1.1966
>
> Again, mean time for order-1 allocations is reduced while order-2 allocations
> are too few to draw conclusions from. The mean time for THP allocations is
> also slightly reduced albeit the reductions are within varianes.
>
> Once again, our maximum allocation time is significantly reduced for
> low-order allocations and slightly increased for THP allocations.
>
> Anon stream mmap reference Highorder Allocation Latency Statistics
> 1 :: Count             1376            1790
> 1 :: Min             0.4940          0.5010
> 1 :: Mean            1.0289          0.9732
> 1 :: Max             6.2670          4.2540
> 1 :: Stddev          0.4142          0.2785
> 2 :: Count                1               -
> 2 :: Min             1.9060               -
> 2 :: Mean            1.9060               -
> 2 :: Max             1.9060               -
> 2 :: Stddev          0.0000               -
> 9 :: Count            11266           11257
> 9 :: Min             0.4990          0.4940
> 9 :: Mean        27250.4669      24256.1919
> 9 :: Max      11439211.0000    6008885.0000
> 9 :: Stddev     226427.4624     186298.1430
>
> This benchmark creates one thread per CPU which references an amount of
> anonymous memory 1.5 times the size of physical RAM. This pounds swap quite
> heavily and is intended to exercise THP a bit.
>
> Mean allocation time for order-1 is reduced as before. It's also reduced
> for THP allocations but the variations here are pretty massive due to swap.
> As before, maximum allocation times are significantly reduced.
>
> Overall, the patch reduces the mean and maximum allocation latencies for
> the smaller high-order allocations. This was with Slab configured so it
> would be expected to be more significant with Slub which uses these size
> allocations more aggressively.
>
> The mean allocation times for THP allocations are also slightly reduced.
> The maximum latency was slightly increased as predicted by the comments due
> to reclaim/compaction breaking early. However, workloads care more about the
> latency of lower-order allocations than THP so it's an acceptable trade-off.
> Please consider merging for 2.6.38.
>
> Signed-off-by: Mel Gorman <[email protected]>

> ---
>  mm/vmscan.c |   32 ++++++++++++++++++++++----------
>  1 files changed, 22 insertions(+), 10 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 148c6e6..591b907 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1841,16 +1841,28 @@ static inline bool should_continue_reclaim(struct zone *zone,
>        if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
>                return false;
>
> -       /*
> -        * If we failed to reclaim and have scanned the full list, stop.
> -        * NOTE: Checking just nr_reclaimed would exit reclaim/compaction far
> -        *       faster but obviously would be less likely to succeed
> -        *       allocation. If this is desirable, use GFP_REPEAT to decide

Typo. __GFP_REPEAT

Otherwise, looks good to me.
Reviewed-by: Minchan Kim <[email protected]>






--
Kind regards,
Minchan Kim

2011-02-17 22:23:06

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT

On Wed, 16 Feb 2011 09:50:49 +0000
Mel Gorman <[email protected]> wrote:

> should_continue_reclaim() for reclaim/compaction allows scanning to continue
> even if pages are not being reclaimed until the full list is scanned. In
> terms of allocation success, this makes sense but potentially it introduces
> unwanted latency for high-order allocations such as transparent hugepages
> and network jumbo frames that would prefer to fail the allocation attempt
> and fallback to order-0 pages. Worse, there is a potential that the full
> LRU scan will clear all the young bits, distort page aging information and
> potentially push pages into swap that would have otherwise remained resident.

afaict the patch affects order-0 allocations as well. What are the
implications of this?

Also, what might be the downsides of this change, and did you test for
them?

> This patch will stop reclaim/compaction if no pages were reclaimed in the
> last SWAP_CLUSTER_MAX pages that were considered.

a) Why SWAP_CLUSTER_MAX? Is (SWAP_CLUSTER_MAX+7) better or worse?

b) The sentence doesn't seem even vaguely accurate. shrink_zone()
will scan vastly more than SWAP_CLUSTER_MAX pages before calling
should_continue_reclaim(). Confused.

c) The patch doesn't "stop reclaim/compaction" fully. It stops it
against one zone. reclaim will then advance on to any other
eligible zones.

2011-02-18 12:22:35

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT

On Thu, Feb 17, 2011 at 02:22:09PM -0800, Andrew Morton wrote:
> On Wed, 16 Feb 2011 09:50:49 +0000
> Mel Gorman <[email protected]> wrote:
>
> > should_continue_reclaim() for reclaim/compaction allows scanning to continue
> > even if pages are not being reclaimed until the full list is scanned. In
> > terms of allocation success, this makes sense but potentially it introduces
> > unwanted latency for high-order allocations such as transparent hugepages
> > and network jumbo frames that would prefer to fail the allocation attempt
> > and fallback to order-0 pages. Worse, there is a potential that the full
> > LRU scan will clear all the young bits, distort page aging information and
> > potentially push pages into swap that would have otherwise remained resident.
>
> afaict the patch affects order-0 allocations as well. What are the
> implications of this?
>

order-0 allocation should not be affected because RECLAIM_MODE_COMPACTION
is not set so the following avoids the gfp_mask being examined;

if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
return false;

> Also, what might be the downsides of this change, and did you test for
> them?
>

The main downside that I predict is that the worst-case latencies for
successful transparent hugepage allocations will be increased as there will
be more looping in do_try_to_free_pages() at higher priorities. I would also
not be surprised if there were fewer successful allocations.

Latencies did seem to be worse for order-9 allocations in testing but it was
offset by lower latencies for lower orders and seemed an acceptable trade-off.

Other major consequences did not spring to mind.

> > This patch will stop reclaim/compaction if no pages were reclaimed in the
> > last SWAP_CLUSTER_MAX pages that were considered.
>
> a) Why SWAP_CLUSTER_MAX? Is (SWAP_CLUSTER_MAX+7) better or worse?
>

SWAP_CLUSTER_MAX is the standard "unit of reclaim" and that's what I had
in mind when writing the comment but it's wrong and misleading. More on
this below.

> b) The sentence doesn't seem even vaguely accurate. shrink_zone()
> will scan vastly more than SWAP_CLUSTER_MAX pages before calling
> should_continue_reclaim(). Confused.
>
> c) The patch doesn't "stop reclaim/compaction" fully. It stops it
> against one zone. reclaim will then advance on to any other
> eligible zones.

You're right on both counts and this comment is inaccurate. It should
have read;

This patch will stop reclaim/compaction for the current zone in shrink_zone()
if there were no pages reclaimed in the last batch of scanning at the
current priority. For allocations such as hugetlbfs that use __GFP_REPEAT
and have fewer fallback options, the full LRU list may still be scanned.

The comment in the code itself then becomes

+ /*
+ * For non-__GFP_REPEAT allocations which can presumably
+ * fail without consequence, stop if we failed to reclaim
+ * any pages from the last batch of pages that were scanned.
+ * This will return to the caller faster at the risk that
+ * reclaim/compaction and the resulting allocation attempt
+ * fails
+ */

--
Mel Gorman
SUSE Labs