LinuxLists.cc - kswapd craziness in 3.7

2012-11-27 20:49:42

Subject: kswapd craziness in 3.7

Hi everyone,

I hope I included everybody that participated in the various threads
on kswapd getting stuck / exhibiting high CPU usage. We were looking
at at least three root causes as far as I can see, so it's not really
clear who observed which problem. Please correct me if the
reported-by, tested-by, bisected-by tags are incomplete.

One problem was, as it seems, overly aggressive reclaim due to scaling
up reclaim goals based on compaction failures. This one was reverted
in 9671009 mm: revert "mm: vmscan: scale number of pages reclaimed by
reclaim/compaction based on failures".

Another one was an accounting problem where a freed higher order page
was underreported, and so kswapd had trouble restoring watermarks.
This one was fixed in ef6c5be fix incorrect NR_FREE_PAGES accounting
(appears like memory leak).

The third one is a problem with small zones, like the DMA zone, where
the high watermark is lower than the low watermark plus compaction gap
(2 * allocation size). The zonelist reclaim in kswapd would do
nothing because all high watermarks are met, but the compaction logic
would find its own requirements unmet and loop over the zones again.
Indefinitely, until some third party would free enough memory to help
meet the higher compaction watermark. The problematic code has been
there since the 3.4 merge window for non-THP higher order allocations
but has been more prominent since the 3.7 merge window, where kswapd
is also woken up for the much more common THP allocations.

The following patch should fix the third issue by making both reclaim
and compaction code in kswapd use the same predicate to determine
whether a zone is balanced or not.

Hopefully, the sum of all three fixes should tame kswapd enough for
3.7.

Johannes

2012-11-27 20:49:50

by Johannes Weiner

[permalink] [raw]

Subject: [patch] mm: vmscan: fix kswapd endless loop on higher order allocation

Kswapd does not in all places have the same criteria for a balanced
zone. Zones are only being reclaimed when their high watermark is
breached, but compaction checks loop over the zonelist again when the
zone does not meet the low watermark plus two times the size of the
allocation. This gets kswapd stuck in an endless loop over a small
zone, like the DMA zone, where the high watermark is smaller than the
compaction requirement.

Add a function, zone_balanced(), that checks the watermark, and, for
higher order allocations, if compaction has enough free memory. Then
use it uniformly to check for balanced zones.

This makes sure that when the compaction watermark is not met, at
least reclaim happens and progress is made - or the zone is declared
unreclaimable at some point and skipped entirely.

Reported-by: George Spelvin <[email protected]>
Reported-by: Johannes Hirte <[email protected]>
Reported-by: Tomas Racek <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
Tested-by: Johannes Hirte <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
mm/vmscan.c | 27 ++++++++++++++++++---------
1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 48550c6..3b0aef4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2397,6 +2397,19 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
} while (memcg);
}

+static bool zone_balanced(struct zone *zone, int order,
+ unsigned long balance_gap, int classzone_idx)
+{
+ if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) +
+ balance_gap, classzone_idx, 0))
+ return false;
+
+ if (COMPACTION_BUILD && order && !compaction_suitable(zone, order))
+ return false;
+
+ return true;
+}
+
/*
* pgdat_balanced is used when checking if a node is balanced for high-order
* allocations. Only zones that meet watermarks and are in a zone allowed
@@ -2475,8 +2488,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
continue;
}

- if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
- i, 0))
+ if (!zone_balanced(zone, order, 0, i))
all_zones_ok = false;
else
balanced += zone->present_pages;
@@ -2585,8 +2597,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
break;
}

- if (!zone_watermark_ok_safe(zone, order,
- high_wmark_pages(zone), 0, 0)) {
+ if (!zone_balanced(zone, order, 0, 0)) {
end_zone = i;
break;
} else {
@@ -2662,9 +2673,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
testorder = 0;

if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
- !zone_watermark_ok_safe(zone, testorder,
- high_wmark_pages(zone) + balance_gap,
- end_zone, 0)) {
+ !zone_balanced(zone, testorder,
+ balance_gap, end_zone)) {
shrink_zone(zone, &sc);

reclaim_state->reclaimed_slab = 0;
@@ -2691,8 +2701,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
continue;
}

- if (!zone_watermark_ok_safe(zone, testorder,
- high_wmark_pages(zone), end_zone, 0)) {
+ if (!zone_balanced(zone, testorder, 0, end_zone)) {
all_zones_ok = 0;
/*
* We are still under min water mark. This
--
1.7.11.7

2012-11-27 20:58:41

by Linus Torvalds

[permalink] [raw]

Subject: Re: kswapd craziness in 3.7

Note that in the meantime, I've also applied (through Andrew) the
patch that reverts commit c654345924f7 (see commit 82b212f40059
'Revert "mm: remove __GFP_NO_KSWAPD"').

I wonder if that revert may be bogus, and a result of this same issue.
Maybe that revert should be reverted, and replaced with your patch?

Mel? Zdenek? What's the status here?

Linus

On Tue, Nov 27, 2012 at 12:48 PM, Johannes Weiner <[email protected]> wrote:
> Hi everyone,
>
> I hope I included everybody that participated in the various threads
> on kswapd getting stuck / exhibiting high CPU usage. We were looking
> at at least three root causes as far as I can see, so it's not really
> clear who observed which problem. Please correct me if the
> reported-by, tested-by, bisected-by tags are incomplete.
>
> One problem was, as it seems, overly aggressive reclaim due to scaling
> up reclaim goals based on compaction failures. This one was reverted
> in 9671009 mm: revert "mm: vmscan: scale number of pages reclaimed by
> reclaim/compaction based on failures".
>
> Another one was an accounting problem where a freed higher order page
> was underreported, and so kswapd had trouble restoring watermarks.
> This one was fixed in ef6c5be fix incorrect NR_FREE_PAGES accounting
> (appears like memory leak).
>
> The third one is a problem with small zones, like the DMA zone, where
> the high watermark is lower than the low watermark plus compaction gap
> (2 * allocation size). The zonelist reclaim in kswapd would do
> nothing because all high watermarks are met, but the compaction logic
> would find its own requirements unmet and loop over the zones again.
> Indefinitely, until some third party would free enough memory to help
> meet the higher compaction watermark. The problematic code has been
> there since the 3.4 merge window for non-THP higher order allocations
> but has been more prominent since the 3.7 merge window, where kswapd
> is also woken up for the much more common THP allocations.
>
> The following patch should fix the third issue by making both reclaim
> and compaction code in kswapd use the same predicate to determine
> whether a zone is balanced or not.
>
> Hopefully, the sum of all three fixes should tame kswapd enough for
> 3.7.
>
> Johannes
>

2012-11-27 21:18:42

by Rik van Riel

[permalink] [raw]

Subject: Re: kswapd craziness in 3.7

On 11/27/2012 03:58 PM, Linus Torvalds wrote:
> Note that in the meantime, I've also applied (through Andrew) the
> patch that reverts commit c654345924f7 (see commit 82b212f40059
> 'Revert "mm: remove __GFP_NO_KSWAPD"').
>
> I wonder if that revert may be bogus, and a result of this same issue.
> Maybe that revert should be reverted, and replaced with your patch?
>
> Mel? Zdenek? What's the status here?

Mel posted several patches to fix the kswapd issue. This one is
slightly more risky than the outright revert, but probably preferred
from a performance point of view:

https://lkml.org/lkml/2012/11/12/151

It works by skipping the kswapd wakeup for THP allocations, only
if compaction is deferred or contended.

--
All rights reversed

2012-11-27 21:30:13

by Johannes Weiner

[permalink] [raw]

Subject: Re: kswapd craziness in 3.7

On Tue, Nov 27, 2012 at 12:58:18PM -0800, Linus Torvalds wrote:
> Note that in the meantime, I've also applied (through Andrew) the
> patch that reverts commit c654345924f7 (see commit 82b212f40059
> 'Revert "mm: remove __GFP_NO_KSWAPD"').
>
> I wonder if that revert may be bogus, and a result of this same issue.
> Maybe that revert should be reverted, and replaced with your patch?

The __GFP_NO_KSWAPD removal woke kswapd for THP reclaim and so it
exposed all these bugs that accumulated in there when higher order
kswapd reclaim was excercised less often.

The revert will hide the problem again, but doesn't make it go away
entirely, so I think we need my fix either way.

Whether you want to put the full THP weight back on the freshly fixed
higher order kswapd code for 3.7 is a different matter :-) At least we
would see quickly if it's still not working correctly...

2012-11-27 21:50:34

by Johannes Weiner

[permalink] [raw]

Subject: Re: kswapd craziness in 3.7

On Tue, Nov 27, 2012 at 04:16:52PM -0500, Rik van Riel wrote:
> On 11/27/2012 03:58 PM, Linus Torvalds wrote:
> >Note that in the meantime, I've also applied (through Andrew) the
> >patch that reverts commit c654345924f7 (see commit 82b212f40059
> >'Revert "mm: remove __GFP_NO_KSWAPD"').
> >
> >I wonder if that revert may be bogus, and a result of this same issue.
> >Maybe that revert should be reverted, and replaced with your patch?
> >
> >Mel? Zdenek? What's the status here?
>
> Mel posted several patches to fix the kswapd issue. This one is
> slightly more risky than the outright revert, but probably preferred
> from a performance point of view:
>
> https://lkml.org/lkml/2012/11/12/151
>
> It works by skipping the kswapd wakeup for THP allocations, only
> if compaction is deferred or contended.

Just to clarify, this would be a replacement strictly for the
__GFP_NO_KSWAPD removal revert, to control how often kswapd is woken
up for higher order allocations like THP.

My patch is to fix how kswapd actually does higher order reclaim, and
it is required either way.

[ But isn't the _reason_ why the "wake up kswapd more carefully for
THP" patch was written kind of moot now since it was developed
against a crazy kswapd? It would certainly need to be re-evaluated.
My (limited) testing didn't show any issues anymore with waking
kswapd unconditionally once it's fixed. ]

2012-11-27 22:04:19

by Rik van Riel

[permalink] [raw]

Subject: Re: kswapd craziness in 3.7

On 11/27/2012 04:49 PM, Johannes Weiner wrote:
> On Tue, Nov 27, 2012 at 04:16:52PM -0500, Rik van Riel wrote:
>> On 11/27/2012 03:58 PM, Linus Torvalds wrote:
>>> Note that in the meantime, I've also applied (through Andrew) the
>>> patch that reverts commit c654345924f7 (see commit 82b212f40059
>>> 'Revert "mm: remove __GFP_NO_KSWAPD"').
>>>
>>> I wonder if that revert may be bogus, and a result of this same issue.
>>> Maybe that revert should be reverted, and replaced with your patch?
>>>
>>> Mel? Zdenek? What's the status here?
>>
>> Mel posted several patches to fix the kswapd issue. This one is
>> slightly more risky than the outright revert, but probably preferred
>> from a performance point of view:
>>
>> https://lkml.org/lkml/2012/11/12/151
>>
>> It works by skipping the kswapd wakeup for THP allocations, only
>> if compaction is deferred or contended.
>
> Just to clarify, this would be a replacement strictly for the
> __GFP_NO_KSWAPD removal revert, to control how often kswapd is woken
> up for higher order allocations like THP.
>
> My patch is to fix how kswapd actually does higher order reclaim, and
> it is required either way.
>
> [ But isn't the _reason_ why the "wake up kswapd more carefully for
> THP" patch was written kind of moot now since it was developed
> against a crazy kswapd? It would certainly need to be re-evaluated.
> My (limited) testing didn't show any issues anymore with waking
> kswapd unconditionally once it's fixed. ]

Kswapd going crazy is certainly a large part of the problem.

However, that leaves the issue of page_alloc.c waking up
kswapd when the system is not actually low on memory.

Instead, kswapd is woken up because memory compaction failed,
potentially even due to lock contention during compaction!

Ideally the allocation code would only wake up kswapd if
memory needs to be freed, or in order for kswapd to do
memory compaction (so the allocator does not have to).

--
All rights reversed

2012-11-27 22:27:41

by Johannes Weiner

[permalink] [raw]

Subject: Re: kswapd craziness in 3.7

On Tue, Nov 27, 2012 at 05:02:36PM -0500, Rik van Riel wrote:
> On 11/27/2012 04:49 PM, Johannes Weiner wrote:
> >On Tue, Nov 27, 2012 at 04:16:52PM -0500, Rik van Riel wrote:
> >>On 11/27/2012 03:58 PM, Linus Torvalds wrote:
> >>>Note that in the meantime, I've also applied (through Andrew) the
> >>>patch that reverts commit c654345924f7 (see commit 82b212f40059
> >>>'Revert "mm: remove __GFP_NO_KSWAPD"').
> >>>
> >>>I wonder if that revert may be bogus, and a result of this same issue.
> >>>Maybe that revert should be reverted, and replaced with your patch?
> >>>
> >>>Mel? Zdenek? What's the status here?
> >>
> >>Mel posted several patches to fix the kswapd issue. This one is
> >>slightly more risky than the outright revert, but probably preferred
> >>from a performance point of view:
> >>
> >>https://lkml.org/lkml/2012/11/12/151
> >>
> >>It works by skipping the kswapd wakeup for THP allocations, only
> >>if compaction is deferred or contended.
> >
> >Just to clarify, this would be a replacement strictly for the
> >__GFP_NO_KSWAPD removal revert, to control how often kswapd is woken
> >up for higher order allocations like THP.
> >
> >My patch is to fix how kswapd actually does higher order reclaim, and
> >it is required either way.
> >
> >[ But isn't the _reason_ why the "wake up kswapd more carefully for
> > THP" patch was written kind of moot now since it was developed
> > against a crazy kswapd? It would certainly need to be re-evaluated.
> > My (limited) testing didn't show any issues anymore with waking
> > kswapd unconditionally once it's fixed. ]
>
> Kswapd going crazy is certainly a large part of the problem.
>
> However, that leaves the issue of page_alloc.c waking up
> kswapd when the system is not actually low on memory.
>
> Instead, kswapd is woken up because memory compaction failed,
> potentially even due to lock contention during compaction!
>
> Ideally the allocation code would only wake up kswapd if
> memory needs to be freed, or in order for kswapd to do
> memory compaction (so the allocator does not have to).

Maybe I missed something, but shouldn't this be solved with my patch?

The first scan over the zones finds the higher order watermark
breached, but the reclaim scan over the zones tests against order-0
(testorder) watermarks when compaction is suitable, i.e. no reclaim if
there are enough order-0 pages for compaction to work. It should just
fall through to that zones_need_compaction condition at the end and
run compaction.

As such, it should always be approriate to wake kswapd if allocations
fail.

2012-11-27 23:20:05

by Linus Torvalds

[permalink] [raw]

Subject: Re: kswapd craziness in 3.7

On Tue, Nov 27, 2012 at 2:26 PM, Johannes Weiner <[email protected]> wrote:
> On Tue, Nov 27, 2012 at 05:02:36PM -0500, Rik van Riel wrote:
>>
>> Kswapd going crazy is certainly a large part of the problem.
>>
>> However, that leaves the issue of page_alloc.c waking up
>> kswapd when the system is not actually low on memory.
>>
>> Instead, kswapd is woken up because memory compaction failed,
>> potentially even due to lock contention during compaction!
>>
>> Ideally the allocation code would only wake up kswapd if
>> memory needs to be freed, or in order for kswapd to do
>> memory compaction (so the allocator does not have to).
>
> Maybe I missed something, but shouldn't this be solved with my patch?

Ok, guys. Cage fight!

The rules are simple: two men enter, one man leaves.

And the one who comes out gets to explain to me which patch(es) I
should apply, and which I should revert, if any.

My current guess is that I should apply the one Johannes just sent
("mm: vmscan: fix kswapd endless loop on higher order allocation")
after having added the cc to stable to it, and then revert the recent
revert (commit 82b212f40059).

But I await the Thunderdome. <Cue Tina Turner "We Don't Need Another Hero">

Linus

2012-11-28 09:45:22

by Mel Gorman

[permalink] [raw]

Subject: Re: kswapd craziness in 3.7

(Adding Thorsten to cc)

On Tue, Nov 27, 2012 at 03:48:34PM -0500, Johannes Weiner wrote:
> Hi everyone,
>
> I hope I included everybody that participated in the various threads
> on kswapd getting stuck / exhibiting high CPU usage. We were looking
> at at least three root causes as far as I can see, so it's not really
> clear who observed which problem. Please correct me if the
> reported-by, tested-by, bisected-by tags are incomplete.
>
> One problem was, as it seems, overly aggressive reclaim due to scaling
> up reclaim goals based on compaction failures. This one was reverted
> in 9671009 mm: revert "mm: vmscan: scale number of pages reclaimed by
> reclaim/compaction based on failures".
>

This particular one would have been made worse by the accounting bug and
if kswapd was staying awake longer than necessary. As scaling the amount
of reclaim only for direct reclaim helped this problem a lot, I strongly
suspect the accounting bug was a factor.

However the benefit for this is marginal -- it primarily affects how
many THP pages we can allocate under stress. There is already a graceful
fallback path and a system under heavy reclaim pressure is not going to
notice the performance benefit of THP.

> Another one was an accounting problem where a freed higher order page
> was underreported, and so kswapd had trouble restoring watermarks.
> This one was fixed in ef6c5be fix incorrect NR_FREE_PAGES accounting
> (appears like memory leak).
>

This almost certainly also requires the follow-on fix at
https://lkml.org/lkml/2012/11/26/225 for reasons I explained in
https://lkml.org/lkml/2012/11/27/190 .

> The third one is a problem with small zones, like the DMA zone, where
> the high watermark is lower than the low watermark plus compaction gap
> (2 * allocation size). The zonelist reclaim in kswapd would do
> nothing because all high watermarks are met, but the compaction logic
> would find its own requirements unmet and loop over the zones again.
> Indefinitely, until some third party would free enough memory to help
> meet the higher compaction watermark. The problematic code has been
> there since the 3.4 merge window for non-THP higher order allocations
> but has been more prominent since the 3.7 merge window, where kswapd
> is also woken up for the much more common THP allocations.
>

Yes.

> The following patch should fix the third issue by making both reclaim
> and compaction code in kswapd use the same predicate to determine
> whether a zone is balanced or not.
>
> Hopefully, the sum of all three fixes should tame kswapd enough for
> 3.7.
>

Not exactly sure of that. With just those patches it is possible for
allocations for THP entering the slow path to keep kswapd continually awake
doing busy work. This was an alternative to the revert that covered that
https://lkml.org/lkml/2012/11/12/151 but it was not enough because kswapd
would stay awake due to the bug you identified and fixed.

I went with the __GFP_NO_KSWAPD patch in this cycle because 3.6 was/is
very poor in how it handles THP after the removal of lumpy reclaim. 3.7
was shaping up to be even worse with multiple root causes too close to the
release date. Taking kswapd out of the equation covered some of the
problems (yes, by hiding them) so it could be revisited but Johannes may
have finally squashed it.

However, if we revert the revert then I strongly recommend that it be
replaced with "Avoid waking kswapd for THP allocations when compaction is
deferred or contended".

--
Mel Gorman
SUSE Labs

2012-11-28 10:14:07

by Mel Gorman

[permalink] [raw]

Subject: Re: kswapd craziness in 3.7

On Tue, Nov 27, 2012 at 03:19:38PM -0800, Linus Torvalds wrote:
> On Tue, Nov 27, 2012 at 2:26 PM, Johannes Weiner <[email protected]> wrote:
> > On Tue, Nov 27, 2012 at 05:02:36PM -0500, Rik van Riel wrote:
> >>
> >> Kswapd going crazy is certainly a large part of the problem.
> >>
> >> However, that leaves the issue of page_alloc.c waking up
> >> kswapd when the system is not actually low on memory.
> >>
> >> Instead, kswapd is woken up because memory compaction failed,
> >> potentially even due to lock contention during compaction!
> >>
> >> Ideally the allocation code would only wake up kswapd if
> >> memory needs to be freed, or in order for kswapd to do
> >> memory compaction (so the allocator does not have to).
> >
> > Maybe I missed something, but shouldn't this be solved with my patch?
>
> Ok, guys. Cage fight!
>
> The rules are simple: two men enter, one man leaves.
>

I'm fairly scorch damaged from this whole cycle already. I won't need a
prop master to look the part for a thunderdome match.

> And the one who comes out gets to explain to me which patch(es) I
> should apply, and which I should revert, if any.
>

Based on the reports I've seen I expect the following to work for 3.7

Keep
96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)

Revert
82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"

Merge
mm: vmscan: fix kswapd endless loop on higher order allocation
mm: Avoid waking kswapd for THP allocations when compaction is deferred or contended

Johannes' patch should remove the necessity for __GFP_NO_KSWAPD revert but I
think we should also avoid waking kswapd for THP allocations if compaction
is deferred. Johannes' patch might mean that kswapd goes quickly go back
to sleep but it's still busy work.

3.6 is still known to be screwed in terms of THP because of the amount of
time it can spend in compaction after lumpy reclaim was removed. This is
my old list of patches I felt needed to be backported after 3.7 came out.
They are not tagged -stable, I'll be sending it to Greg manually.

e64c523 mm: compaction: abort compaction loop if lock is contended or run too long
3cc668f mm: compaction: move fatal signal check out of compact_checklock_irqsave
661c4cb mm: compaction: Update try_to_compact_pages()kerneldoc comment
2a1402a mm: compaction: acquire the zone->lru_lock as late as possible
f40d1e4 mm: compaction: acquire the zone->lock as late as possible
753341a revert "mm: have order > 0 compaction start off where it left"
bb13ffe mm: compaction: cache if a pageblock was scanned and no pages were isolated
c89511a mm: compaction: Restart compaction from near where it left off
6299702 mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity
0db63d7 mm: compaction: correct the nr_strict va isolated check for CMA

Only Johannes' patch needs to be added to this list. kswapd is not woken
for THP in 3.6 but as it calls compaction for other high-order allocations
it still makes sense.

--
Mel Gorman
SUSE Labs

2012-11-28 10:51:23

by Thorsten Leemhuis

[permalink] [raw]

Subject: Re: kswapd craziness in 3.7

Mel Gorman wrote on 28.11.2012 11:13:
> On Tue, Nov 27, 2012 at 03:19:38PM -0800, Linus Torvalds wrote:
>> On Tue, Nov 27, 2012 at 2:26 PM, Johannes Weiner <[email protected]> wrote:
>> > On Tue, Nov 27, 2012 at 05:02:36PM -0500, Rik van Riel wrote:
>
>> And the one who comes out gets to explain to me which patch(es) I
>> should apply, and which I should revert, if any.
>
> Based on the reports I've seen I expect the following to work for 3.7
>
> Keep
> 96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
> ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)
>
> Revert
> 82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
>
> Merge
> mm: vmscan: fix kswapd endless loop on higher order allocation
> mm: Avoid waking kswapd for THP allocations when compaction is deferred or contended

I'll build a kernel with this combination and will give it a try. Maybe
one of those people that reported problems in
https://bugzilla.redhat.com/show_bug.cgi?id=866988 can try them, too.
There two people recently reported their problems were gone with kernels
that contained 82b212f4.

> Johannes' patch should remove the necessity for __GFP_NO_KSWAPD revert but I
> think we should also avoid waking kswapd for THP allocations if compaction
> is deferred. Johannes' patch might mean that kswapd goes quickly go back
> to sleep but it's still busy work.

Is there a way to trigger (some benchmark?) and detect (something in
/proc/vmstat ?) the problem Hannes patch tries to fix?

Background: The two main problems that got me into this discussion
vanished thx to 9671009 (mm: revert "mm: vmscan: scale number of pages
reclaimed by reclaim/compaction based on failures") and ef6c5be (fix
incorrect NR_FREE_PAGES accounting (appears like memory leak)). I
thought all my problems had gone, but after a few days of uptime
(suspended and resumed the particular machine a few times in between, as
I was using it just in the evenings) kswap now and then started
consuming nearly 100% of one cpu core for 10 to 15 seconds intervals (it
seems watching a YouTube video triggered it; and the machine was using a
little bit swap space). I just had started debugging this, but due to
some stupid mistake
(https://plus.google.com/107616711159256259828/posts/GXuhf1LTien ) then
rebooted the machine :-/ So maybe I hit the problem Hannes patch tries
to solve, but I'm not sure; and I have no easy way to verify quickly if
the proposed patch combination helps.

Thorsten

2012-11-28 13:37:40

by Zdenek Kabelac

[permalink] [raw]

Subject: Re: kswapd craziness in 3.7

Dne 27.11.2012 21:58, Linus Torvalds napsal(a):
> Note that in the meantime, I've also applied (through Andrew) the
> patch that reverts commit c654345924f7 (see commit 82b212f40059
> 'Revert "mm: remove __GFP_NO_KSWAPD"').
>
> I wonder if that revert may be bogus, and a result of this same issue.
> Maybe that revert should be reverted, and replaced with your patch?
>
> Mel? Zdenek? What's the status here?
>

I've tried for longer term:

https://lkml.org/lkml/2012/11/5/308
https://lkml.org/lkml/2012/11/12/113

these 2 seems to be now merge in -rc7
(since they disappeared after my git rebase)

and added slightly modified patch from Jiri
(https://lkml.org/lkml/2012/11/15/950
(Unsure where it still applies for -rc7??)

Also I've Jan Kara <[email protected]>
fs: Fix imbalance in freeze protection in mark_files_ro()
(which is still not applied to upstream)

And I think I'm NOT seeing huge load from kswapd0.
(At least related to my not really long uptimes)

But also I'm now frequent victim of my other report:

https://lkml.org/lkml/2012/11/15/369

Which turns into a problem, that if my T61 docking station
has enabled support for 'old hw' for docking in BIOS - i.e. serial output'
it becomes unstable and either 1st. or 2nd. resume deadlocks
machine - and serial port gives just garbage)

Zdenek

> Linus
>
> On Tue, Nov 27, 2012 at 12:48 PM, Johannes Weiner <[email protected]> wrote:
>> Hi everyone,
>>
>> I hope I included everybody that participated in the various threads
>> on kswapd getting stuck / exhibiting high CPU usage. We were looking
>> at at least three root causes as far as I can see, so it's not really
>> clear who observed which problem. Please correct me if the
>> reported-by, tested-by, bisected-by tags are incomplete.
>>
>> One problem was, as it seems, overly aggressive reclaim due to scaling
>> up reclaim goals based on compaction failures. This one was reverted
>> in 9671009 mm: revert "mm: vmscan: scale number of pages reclaimed by
>> reclaim/compaction based on failures".
>>
>> Another one was an accounting problem where a freed higher order page
>> was underreported, and so kswapd had trouble restoring watermarks.
>> This one was fixed in ef6c5be fix incorrect NR_FREE_PAGES accounting
>> (appears like memory leak).
>>
>> The third one is a problem with small zones, like the DMA zone, where
>> the high watermark is lower than the low watermark plus compaction gap
>> (2 * allocation size). The zonelist reclaim in kswapd would do
>> nothing because all high watermarks are met, but the compaction logic
>> would find its own requirements unmet and loop over the zones again.
>> Indefinitely, until some third party would free enough memory to help
>> meet the higher compaction watermark. The problematic code has been
>> there since the 3.4 merge window for non-THP higher order allocations
>> but has been more prominent since the 3.7 merge window, where kswapd
>> is also woken up for the much more common THP allocations.
>>
>> The following patch should fix the third issue by making both reclaim
>> and compaction code in kswapd use the same predicate to determine
>> whether a zone is balanced or not.
>>
>> Hopefully, the sum of all three fixes should tame kswapd enough for
>> 3.7.
>>
>> Johannes
>>

2012-11-28 14:05:08

by Jiri Slaby

[permalink] [raw]

Subject: Re: kswapd craziness in 3.7

On 11/28/2012 02:35 PM, Zdenek Kabelac wrote:
> and added slightly modified patch from Jiri
> (https://lkml.org/lkml/2012/11/15/950
> (Unsure where it still applies for -rc7??)

It is needed for -next only. And if you have recent -next, it's already
there...

thanks,
--
js
suse labs

2012-11-28 16:42:30

by Mel Gorman

[permalink] [raw]

Subject: Re: kswapd craziness in 3.7

On Wed, Nov 28, 2012 at 10:13:59AM +0000, Mel Gorman wrote:
> On Tue, Nov 27, 2012 at 03:19:38PM -0800, Linus Torvalds wrote:
> > On Tue, Nov 27, 2012 at 2:26 PM, Johannes Weiner <[email protected]> wrote:
> > > On Tue, Nov 27, 2012 at 05:02:36PM -0500, Rik van Riel wrote:
> > >>
> > >> Kswapd going crazy is certainly a large part of the problem.
> > >>
> > >> However, that leaves the issue of page_alloc.c waking up
> > >> kswapd when the system is not actually low on memory.
> > >>
> > >> Instead, kswapd is woken up because memory compaction failed,
> > >> potentially even due to lock contention during compaction!
> > >>
> > >> Ideally the allocation code would only wake up kswapd if
> > >> memory needs to be freed, or in order for kswapd to do
> > >> memory compaction (so the allocator does not have to).
> > >
> > > Maybe I missed something, but shouldn't this be solved with my patch?
> >
> > Ok, guys. Cage fight!
> >
> > The rules are simple: two men enter, one man leaves.
> >
>
> I'm fairly scorch damaged from this whole cycle already. I won't need a
> prop master to look the part for a thunderdome match.
>
> > And the one who comes out gets to explain to me which patch(es) I
> > should apply, and which I should revert, if any.
> >
>
> Based on the reports I've seen I expect the following to work for 3.7
>
> Keep
> 96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
> ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)
>
> Revert
> 82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
>
> Merge
> mm: vmscan: fix kswapd endless loop on higher order allocation
> mm: Avoid waking kswapd for THP allocations when compaction is deferred or contended
>

and
mm: compaction: Fix return value of capture_free_page

but this one may already be in flight from Andrew's tree as he picked it
up already.

--
Mel Gorman
SUSE Labs

2012-11-28 22:52:18

by Andrew Morton

[permalink] [raw]

Subject: Re: kswapd craziness in 3.7

On Wed, 28 Nov 2012 10:13:59 +0000
Mel Gorman <[email protected]> wrote:

> Based on the reports I've seen I expect the following to work for 3.7
>
> Keep
> 96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
> ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)
>
> Revert
> 82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
>
> Merge
> mm: vmscan: fix kswapd endless loop on higher order allocation
> mm: Avoid waking kswapd for THP allocations when compaction is deferred or contended

"mm: Avoid waking kswapd for THP ..." is marked "I have not tested it
myself" and when Zdenek tested it he hit an unexplained oom.

> Johannes' patch should remove the necessity for __GFP_NO_KSWAPD revert but I
> think we should also avoid waking kswapd for THP allocations if compaction
> is deferred. Johannes' patch might mean that kswapd goes quickly go back
> to sleep but it's still busy work.
>
> 3.6 is still known to be screwed in terms of THP because of the amount of
> time it can spend in compaction after lumpy reclaim was removed. This is
> my old list of patches I felt needed to be backported after 3.7 came out.
> They are not tagged -stable, I'll be sending it to Greg manually.
>
> e64c523 mm: compaction: abort compaction loop if lock is contended or run too long
> 3cc668f mm: compaction: move fatal signal check out of compact_checklock_irqsave
> 661c4cb mm: compaction: Update try_to_compact_pages()kerneldoc comment
> 2a1402a mm: compaction: acquire the zone->lru_lock as late as possible
> f40d1e4 mm: compaction: acquire the zone->lock as late as possible
> 753341a revert "mm: have order > 0 compaction start off where it left"
> bb13ffe mm: compaction: cache if a pageblock was scanned and no pages were isolated
> c89511a mm: compaction: Restart compaction from near where it left off
> 6299702 mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity
> 0db63d7 mm: compaction: correct the nr_strict va isolated check for CMA
>
> Only Johannes' patch needs to be added to this list. kswapd is not woken
> for THP in 3.6 but as it calls compaction for other high-order allocations
> it still makes sense.

Please identify "Johannes' patch"?

2012-11-28 23:54:19

by Mel Gorman

[permalink] [raw]

Subject: Re: kswapd craziness in 3.7

On Wed, Nov 28, 2012 at 02:52:15PM -0800, Andrew Morton wrote:
> On Wed, 28 Nov 2012 10:13:59 +0000
> Mel Gorman <[email protected]> wrote:
>
> > Based on the reports I've seen I expect the following to work for 3.7
> >
> > Keep
> > 96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
> > ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)
> >
> > Revert
> > 82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
> >
> > Merge
> > mm: vmscan: fix kswapd endless loop on higher order allocation
> > mm: Avoid waking kswapd for THP allocations when compaction is deferred or contended
>
> "mm: Avoid waking kswapd for THP ..." is marked "I have not tested it
> myself" and when Zdenek tested it he hit an unexplained oom.
>

I thought Zdenek was testing with __GFP_NO_KSWAPD when he hit that OOM.
Further, when he hit that OOM, it looked like a genuine OOM. He had no
swap configured and inactive/active file pages were very low. Finally,
the free pages for Normal looked off and could also have been affected by
the accounting bug. I'm looking at https://lkml.org/lkml/2012/11/18/132
here. Are you thinking of something else?

I have not tested with the patch admittedly but Thorsten has and seemed
to be ok with it https://lkml.org/lkml/2012/11/23/276.

> > Johannes' patch should remove the necessity for __GFP_NO_KSWAPD revert but I
> > think we should also avoid waking kswapd for THP allocations if compaction
> > is deferred. Johannes' patch might mean that kswapd goes quickly go back
> > to sleep but it's still busy work.
> >
> > 3.6 is still known to be screwed in terms of THP because of the amount of
> > time it can spend in compaction after lumpy reclaim was removed. This is
> > my old list of patches I felt needed to be backported after 3.7 came out.
> > They are not tagged -stable, I'll be sending it to Greg manually.
> >
> > e64c523 mm: compaction: abort compaction loop if lock is contended or run too long
> > 3cc668f mm: compaction: move fatal signal check out of compact_checklock_irqsave
> > 661c4cb mm: compaction: Update try_to_compact_pages()kerneldoc comment
> > 2a1402a mm: compaction: acquire the zone->lru_lock as late as possible
> > f40d1e4 mm: compaction: acquire the zone->lock as late as possible
> > 753341a revert "mm: have order > 0 compaction start off where it left"
> > bb13ffe mm: compaction: cache if a pageblock was scanned and no pages were isolated
> > c89511a mm: compaction: Restart compaction from near where it left off
> > 6299702 mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity
> > 0db63d7 mm: compaction: correct the nr_strict va isolated check for CMA
> >
> > Only Johannes' patch needs to be added to this list. kswapd is not woken
> > for THP in 3.6 but as it calls compaction for other high-order allocations
> > it still makes sense.
>
> Please identify "Johannes' patch"?

mm: vmscan: fix kswapd endless loop on higher order allocation

--
Mel Gorman
SUSE Labs

2012-11-29 00:14:55

by Andrew Morton

[permalink] [raw]

Subject: Re: kswapd craziness in 3.7

On Wed, 28 Nov 2012 23:54:12 +0000
Mel Gorman <[email protected]> wrote:

> On Wed, Nov 28, 2012 at 02:52:15PM -0800, Andrew Morton wrote:
> > On Wed, 28 Nov 2012 10:13:59 +0000
> > Mel Gorman <[email protected]> wrote:
> >
> > > Based on the reports I've seen I expect the following to work for 3.7
> > >
> > > Keep
> > > 96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
> > > ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)
> > >
> > > Revert
> > > 82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
> > >
> > > Merge
> > > mm: vmscan: fix kswapd endless loop on higher order allocation
> > > mm: Avoid waking kswapd for THP allocations when compaction is deferred or contended
> >
> > "mm: Avoid waking kswapd for THP ..." is marked "I have not tested it
> > myself" and when Zdenek tested it he hit an unexplained oom.
> >
>
> I thought Zdenek was testing with __GFP_NO_KSWAPD when he hit that OOM.
> Further, when he hit that OOM, it looked like a genuine OOM. He had no
> swap configured and inactive/active file pages were very low. Finally,
> the free pages for Normal looked off and could also have been affected by
> the accounting bug. I'm looking at https://lkml.org/lkml/2012/11/18/132
> here. Are you thinking of something else?

who, me, think? I was trying to work out why I hadn't merged or queued
a patch which you felt was important. Turned out it was because it
didn't look very tested and final.

> I have not tested with the patch admittedly but Thorsten has and seemed
> to be ok with it https://lkml.org/lkml/2012/11/23/276.

OK, I'll queue revert-revert-mm-remove-__gfp_no_kswapd.patch and the
patch from https://patchwork.kernel.org/patch/1728081/.

So what I'm currently sitting on for 3.7 is

mm-compaction-fix-return-value-of-capture_free_page.patch
mm-vmemmap-fix-wrong-use-of-virt_to_page.patch
mm-vmscan-fix-endless-loop-in-kswapd-balancing.patch
revert-revert-mm-remove-__gfp_no_kswapd.patch
mm-avoid-waking-kswapd-for-thp-allocations-when-compaction-is-deferred-or-contended.patch
mm-soft-offline-split-thp-at-the-beginning-of-soft_offline_page.patch

> > Please identify "Johannes' patch"?
>
> mm: vmscan: fix kswapd endless loop on higher order allocation

OK, we have that. I'll start a round of testing, do another -next drop
and send the above Linuswards tomorrow.

2012-11-29 15:26:48

by Thorsten Leemhuis

[permalink] [raw]

Subject: Re: kswapd craziness in 3.7

Mel Gorman wrote on 29.11.2012 00:54:
> On Wed, Nov 28, 2012 at 02:52:15PM -0800, Andrew Morton wrote:
>> On Wed, 28 Nov 2012 10:13:59 +0000
>> Mel Gorman <[email protected]> wrote:
>>
>> > Based on the reports I've seen I expect the following to work for 3.7
>> > Keep
>> > 96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
>> > ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)
>> > Revert
>> > 82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
>> > Merge
>> > mm: vmscan: fix kswapd endless loop on higher order allocation
>> > mm: Avoid waking kswapd for THP allocations when compaction is deferred or contended
>> "mm: Avoid waking kswapd for THP ..." is marked "I have not tested it
>> myself" and when Zdenek tested it he hit an unexplained oom.
> I thought Zdenek was testing with __GFP_NO_KSWAPD when he hit that OOM.
> Further, when he hit that OOM, it looked like a genuine OOM. He had no
> swap configured and inactive/active file pages were very low. Finally,
> the free pages for Normal looked off and could also have been affected by
> the accounting bug. I'm looking at https://lkml.org/lkml/2012/11/18/132
> here. Are you thinking of something else?
>
> I have not tested with the patch admittedly but Thorsten has and seemed
> to be ok with it https://lkml.org/lkml/2012/11/23/276.

Yeah, on my two main work horses a few different kernels based on rc6 or
rc7 worked fine with this patch. But sorry, it seems the patch doesn't
fix the problems Fedora user John Ellson sees, who tried kernels I built
in the Fedora buildsystem. Details:

In https://bugzilla.redhat.com/show_bug.cgi?id=866988#c35 he mentioned
his machine worked fine with a rc6 based kernel I built that contained
82b212f4 (Revert "mm: remove __GFP_NO_KSWAPD"). Before that he had tried
a kernel with the same baseline that contained "Avoid waking kswapd for
THP allocations when […]" instead and reported it didn't help on his
i686 machine (seems it helped the x86-64 one):
https://bugzilla.redhat.com/show_bug.cgi?id=866988#c33

He now tried a recent mainline kernel I built 20 hours ago that is based
on a git checkout from round about two days ago, reverts 82b212f4, and had
* fix-kswapd-endless-loop-on-higher-order-allocation.patch
* Avoid-waking-kswapd-for-THP-allocations-when.patch
* mm-compaction-Fix-return-value-of-capture_free_page.patch
applied. In https://bugzilla.redhat.com/show_bug.cgi?id=866988#c39 and
comment 41 he reported that this kernel on his i686 host showed 100%cpu
usage by kswapd0 :-/

Build log for said kernel rpms (I quite sure I applied the patches
properly, but you know: mistakes happen, so be careful, maybe I did
something stupid somewhere...):
http://kojipkgs.fedoraproject.org//work/tasks/8253/4738253/build.log

I know, this makes things more complicated again; but I wanted to let
you guys know that some problem might still be lurking somewhere. Side
note: right now it seems John with kernels that contain
"Avoid-waking-kswapd-for-THP-allocations-when" can trigger the problem
quicker (or only?) on i686 than on x86-64.

CU
Thorsten

2012-11-29 17:06:15

by Johannes Weiner

[permalink] [raw]

Subject: Re: kswapd craziness in 3.7

On Thu, Nov 29, 2012 at 04:30:12PM +0100, Thorsten Leemhuis wrote:
> Mel Gorman wrote on 29.11.2012 00:54:
> > On Wed, Nov 28, 2012 at 02:52:15PM -0800, Andrew Morton wrote:
> >> On Wed, 28 Nov 2012 10:13:59 +0000
> >> Mel Gorman <[email protected]> wrote:
> >>
> >> > Based on the reports I've seen I expect the following to work for 3.7
> >> > Keep
> >> > 96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
> >> > ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)
> >> > Revert
> >> > 82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
> >> > Merge
> >> > mm: vmscan: fix kswapd endless loop on higher order allocation
> >> > mm: Avoid waking kswapd for THP allocations when compaction is deferred or contended
> >> "mm: Avoid waking kswapd for THP ..." is marked "I have not tested it
> >> myself" and when Zdenek tested it he hit an unexplained oom.
> > I thought Zdenek was testing with __GFP_NO_KSWAPD when he hit that OOM.
> > Further, when he hit that OOM, it looked like a genuine OOM. He had no
> > swap configured and inactive/active file pages were very low. Finally,
> > the free pages for Normal looked off and could also have been affected by
> > the accounting bug. I'm looking at https://lkml.org/lkml/2012/11/18/132
> > here. Are you thinking of something else?
> >
> > I have not tested with the patch admittedly but Thorsten has and seemed
> > to be ok with it https://lkml.org/lkml/2012/11/23/276.
>
> Yeah, on my two main work horses a few different kernels based on rc6 or
> rc7 worked fine with this patch. But sorry, it seems the patch doesn't
> fix the problems Fedora user John Ellson sees, who tried kernels I built
> in the Fedora buildsystem. Details:
>
> In https://bugzilla.redhat.com/show_bug.cgi?id=866988#c35 he mentioned
> his machine worked fine with a rc6 based kernel I built that contained
> 82b212f4 (Revert "mm: remove __GFP_NO_KSWAPD"). Before that he had tried
> a kernel with the same baseline that contained "Avoid waking kswapd for
> THP allocations when […]" instead and reported it didn't help on his
> i686 machine (seems it helped the x86-64 one):
> https://bugzilla.redhat.com/show_bug.cgi?id=866988#c33
>
> He now tried a recent mainline kernel I built 20 hours ago that is based
> on a git checkout from round about two days ago, reverts 82b212f4, and had
> * fix-kswapd-endless-loop-on-higher-order-allocation.patch
> * Avoid-waking-kswapd-for-THP-allocations-when.patch
> * mm-compaction-Fix-return-value-of-capture_free_page.patch
> applied. In https://bugzilla.redhat.com/show_bug.cgi?id=866988#c39 and
> comment 41 he reported that this kernel on his i686 host showed 100%cpu
> usage by kswapd0 :-/
>
> Build log for said kernel rpms (I quite sure I applied the patches
> properly, but you know: mistakes happen, so be careful, maybe I did
> something stupid somewhere...):
> http://kojipkgs.fedoraproject.org//work/tasks/8253/4738253/build.log
>
> I know, this makes things more complicated again; but I wanted to let
> you guys know that some problem might still be lurking somewhere. Side
> note: right now it seems John with kernels that contain
> "Avoid-waking-kswapd-for-THP-allocations-when" can trigger the problem
> quicker (or only?) on i686 than on x86-64.

Humm, highmem... Could this be the lowmem protection forcing kswapd
to reclaim highmem at DEF_PRIORITY (not useful but burns CPU) every
time it's woken up?

This requires somebody to wake up kswapd regularly, though and from
his report it's not quite clear to me if kswapd gets stuck or just has
really high CPU usage while the system is still under load. The
initial post says he would expect "<5% cpu when idling" but his top
snippet in there shows there are other tasks running as well. So does
it happen while the system is busy or when it's otherwise idle?

[ On the other hand, not waking kswapd from THP allocations seems to
not show this problem on his i686 machine. But it could also just
be a tiny window of conditions aligning perfectly that drops kswapd
in an endless loop, and the increased wakeups increase the
probability of hitting it. So, yeah, this would be good to know. ]

As the system is still responsive when this happens, any chance he
could capture /proc/zoneinfo and /proc/vmstat when kswapd goes
haywire?

Or even run perf record -a -g sleep 5; perf report > kswapd.txt?

Preferrably with this patch applied, to rule out faulty lowmem
protection:

buffer_heads_over_limit can put kswapd into reclaim, but it's ignored
when figuring out whether the zone is balanced and so priority levels
are not descended and no progress is ever made.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3b0aef4..73c4f5f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2400,6 +2400,14 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
static bool zone_balanced(struct zone *zone, int order,
unsigned long balance_gap, int classzone_idx)
{
+ /*
+ * If the number of buffer_heads in the machine exceeds the
+ * maximum allowed level and this node has a highmem zone,
+ * force kswapd to reclaim from it to relieve lowmem pressure.
+ */
+ if (is_highmem(zone) && buffer_heads_over_limit)
+ return false;
+
if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) +
balance_gap, classzone_idx, 0))
return false;
@@ -2586,17 +2594,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
*/
age_active_anon(zone, &sc);

- /*
- * If the number of buffer_heads in the machine
- * exceeds the maximum allowed level and this node
- * has a highmem zone, force kswapd to reclaim from
- * it to relieve lowmem pressure.
- */
- if (buffer_heads_over_limit && is_highmem_idx(i)) {
- end_zone = i;
- break;
- }
-
if (!zone_balanced(zone, order, 0, 0)) {
end_zone = i;
break;
@@ -2672,8 +2669,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
COMPACT_SKIPPED)
testorder = 0;

- if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
- !zone_balanced(zone, testorder,
+ if (!zone_balanced(zone, testorder,
balance_gap, end_zone)) {
shrink_zone(zone, &sc);

2012-11-30 12:35:39

by Thorsten Leemhuis

[permalink] [raw]

Subject: Re: kswapd craziness in 3.7

Johannes Weiner wrote on 29.11.2012 18:05:
> On Thu, Nov 29, 2012 at 04:30:12PM +0100, Thorsten Leemhuis wrote:
>> Mel Gorman wrote on 29.11.2012 00:54:
>> > On Wed, Nov 28, 2012 at 02:52:15PM -0800, Andrew Morton wrote:
>> >> On Wed, 28 Nov 2012 10:13:59 +0000
>> >> Mel Gorman <[email protected]> wrote:
>> >> > Based on the reports I've seen I expect the following to work for 3.7
>> >> > Keep
>> >> > 96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
>> >> > ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)
>> >> > Revert
>> >> > 82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
>> >> > Merge
>> >> > mm: vmscan: fix kswapd endless loop on higher order allocation
>> >> > mm: Avoid waking kswapd for THP allocations when compaction is deferred or contended
>> >> "mm: Avoid waking kswapd for THP ..." is marked "I have not tested it
>> >> myself" and when Zdenek tested it he hit an unexplained oom.
>> > I thought Zdenek was testing with __GFP_NO_KSWAPD when he hit that OOM.
>> > Further, when he hit that OOM, it looked like a genuine OOM. He had no
>> > swap configured and inactive/active file pages were very low. Finally,
>> > the free pages for Normal looked off and could also have been affected by
>> > the accounting bug. I'm looking at https://lkml.org/lkml/2012/11/18/132
>> > here. Are you thinking of something else?
>> > I have not tested with the patch admittedly but Thorsten has and seemed
>> > to be ok with it https://lkml.org/lkml/2012/11/23/276.
>> Yeah, on my two main work horses a few different kernels based on rc6 or
>> rc7 worked fine with this patch. But sorry, it seems the patch doesn't
>> fix the problems Fedora user John Ellson sees, who tried kernels I built
>> in the Fedora buildsystem. Details:
> [...]
>> I know, this makes things more complicated again; but I wanted to let
>> you guys know that some problem might still be lurking somewhere. Side
>> note: right now it seems John with kernels that contain
>> "Avoid-waking-kswapd-for-THP-allocations-when" can trigger the problem
>> quicker (or only?) on i686 than on x86-64.
>
> Humm, highmem... Could this be the lowmem protection forcing kswapd
> to reclaim highmem at DEF_PRIORITY (not useful but burns CPU) every
> time it's woken up?
>
> This requires somebody to wake up kswapd regularly, though and from
> his report it's not quite clear to me if kswapd gets stuck or just has
> really high CPU usage while the system is still under load. The
> initial post says he would expect "<5% cpu when idling" but his top
> snippet in there shows there are other tasks running as well. So does
> it happen while the system is busy or when it's otherwise idle?
>
> [ On the other hand, not waking kswapd from THP allocations seems to
> not show this problem on his i686 machine. But it could also just
> be a tiny window of conditions aligning perfectly that drops kswapd
> in an endless loop, and the increased wakeups increase the
> probability of hitting it. So, yeah, this would be good to know. ]
>
> As the system is still responsive when this happens, any chance he
> could capture /proc/zoneinfo and /proc/vmstat when kswapd goes
> haywire?
>
> Or even run perf record -a -g sleep 5; perf report > kswapd.txt?
>
> Preferrably with this patch applied, to rule out faulty lowmem
> protection:
>
> buffer_heads_over_limit can put kswapd into reclaim, but it's ignored
> when figuring out whether the zone is balanced and so priority levels
> are not descended and no progress is ever made.

/me wonders how to elegantly get out of his man-in-the-middle position

John was able to reproduce the problem quickly with a kernel that
contained the patch from your mail. For details see

https://bugzilla.redhat.com/show_bug.cgi?id=866988#c42 and later

He provided the informations there. Parts of it:

/proc/vmstat while kswad0 at 100%cpu

nr_free_pages 196858
nr_inactive_anon 15804
nr_active_anon 65
nr_inactive_file 20792
nr_active_file 11307
nr_unevictable 0
nr_mlock 0
nr_anon_pages 14385
nr_mapped 2393
nr_file_pages 32563
nr_dirty 5
nr_writeback 0
nr_slab_reclaimable 3113
nr_slab_unreclaimable 4725
nr_page_table_pages 271
nr_kernel_stack 96
nr_unstable 0
nr_bounce 0
nr_vmscan_write 1487
nr_vmscan_immediate_reclaim 3
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 381
nr_dirtied 388323
nr_written 361128
nr_anon_transparent_hugepages 1
nr_free_cma 0
nr_dirty_threshold 38188
nr_dirty_background_threshold 19094
pgpgin 1057223
pgpgout 1552306
pswpin 8
pswpout 1487
pgalloc_dma 5548
pgalloc_normal 10651864
pgalloc_high 2191246
pgalloc_movable 0
pgfree 13055503
pgactivate 440358
pgdeactivate 259724
pgfault 31423675
pgmajfault 3760
pgrefill_dma 2174
pgrefill_normal 212914
pgrefill_high 51755
pgrefill_movable 0
pgsteal_kswapd_dma 1
pgsteal_kswapd_normal 202106
pgsteal_kswapd_high 36515
pgsteal_kswapd_movable 0
pgsteal_direct_dma 18
pgsteal_direct_normal 0
pgsteal_direct_high 3818
pgsteal_direct_movable 0
pgscan_kswapd_dma 1
pgscan_kswapd_normal 203044
pgscan_kswapd_high 40407
pgscan_kswapd_movable 0
pgscan_direct_dma 18
pgscan_direct_normal 0
pgscan_direct_high 4409
pgscan_direct_movable 0
pgscan_direct_throttle 0
pginodesteal 0
slabs_scanned 264192
kswapd_inodesteal 171676
kswapd_low_wmark_hit_quickly 0
kswapd_high_wmark_hit_quickly 26
kswapd_skip_congestion_wait 0
pageoutrun 117729182
allocstall 5
pgrotated 1628
compact_blocks_moved 313
compact_pages_moved 7192
compact_pagemigrate_failed 265
compact_stall 13
compact_fail 9
compact_success 4
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 2985
unevictable_pgs_scanned 0
unevictable_pgs_rescued 1877
unevictable_pgs_mlocked 3965
unevictable_pgs_munlocked 3965
unevictable_pgs_cleared 0
unevictable_pgs_stranded 0
thp_fault_alloc 636
thp_fault_fallback 10
thp_collapse_alloc 342
thp_collapse_alloc_failed 2
thp_split 6

/proc/zoneinfo with kswapd0 at 100% cpu

Node 0, zone DMA
pages free 1655
min 196
low 245
high 294
scanned 0
spanned 4080
present 3951
nr_free_pages 1655
nr_inactive_anon 0
nr_active_anon 0
nr_inactive_file 0
nr_active_file 0
nr_unevictable 0
nr_mlock 0
nr_anon_pages 0
nr_mapped 0
nr_file_pages 0
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 3
nr_slab_unreclaimable 1
nr_page_table_pages 0
nr_kernel_stack 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 0
nr_dirtied 315
nr_written 315
nr_anon_transparent_hugepages 0
nr_free_cma 0
protection: (0, 861, 1000, 1000)
pagesets
cpu: 0
count: 0
high: 0
batch: 1
vm stats threshold: 2
all_unreclaimable: 1
start_pfn: 16
inactive_ratio: 1
Node 0, zone Normal
pages free 186234
min 10953
low 13691
high 16429
scanned 0
spanned 222206
present 220470
nr_free_pages 186234
nr_inactive_anon 3147
nr_active_anon 2
nr_inactive_file 14064
nr_active_file 4672
nr_unevictable 0
nr_mlock 0
nr_anon_pages 3028
nr_mapped 216
nr_file_pages 18857
nr_dirty 8
nr_writeback 0
nr_slab_reclaimable 3110
nr_slab_unreclaimable 4729
nr_page_table_pages 62
nr_kernel_stack 96
nr_unstable 0
nr_bounce 0
nr_vmscan_write 311
nr_vmscan_immediate_reclaim 2
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 114
nr_dirtied 339809
nr_written 315061
nr_anon_transparent_hugepages 0
nr_free_cma 0
protection: (0, 0, 1111, 1111)
pagesets
cpu: 0
count: 81
high: 186
batch: 31
vm stats threshold: 8
all_unreclaimable: 0
start_pfn: 4096
inactive_ratio: 1
Node 0, zone HighMem
pages free 8983
min 34
low 475
high 917
scanned 0
spanned 35840
present 35560
nr_free_pages 8983
nr_inactive_anon 12661
nr_active_anon 64
nr_inactive_file 6849
nr_active_file 6500
nr_unevictable 0
nr_mlock 0
nr_anon_pages 11357
nr_mapped 2177
nr_file_pages 13692
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 0
nr_slab_unreclaimable 0
nr_page_table_pages 209
nr_kernel_stack 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 1176
nr_vmscan_immediate_reclaim 1
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 267
nr_dirtied 48189
nr_written 45739
nr_anon_transparent_hugepages 1
nr_free_cma 0
protection: (0, 0, 0, 0)
pagesets
cpu: 0
count: 20
high: 42
batch: 7
vm stats threshold: 4
all_unreclaimable: 0
start_pfn: 226302
inactive_ratio: 1

First few lines of /proc/vmstat while kswad0 at 100%cpu

# ========
# captured on: Fri Nov 30 07:22:00 2012
# hostname : rawhide
# os release : 3.7.0-0.rc7.git1.2.van.main.knurd.kswap.3.fc18.i686
# perf version : 3.7.0-0.rc7.git1.2.van.main.knurd.kswap.3.fc18.i686
# arch : i686
# nrcpus online : 1
# nrcpus avail : 1
# cpudesc : QEMU Virtual CPU version 1.0.1
# cpuid : AuthenticAMD,6,2,3
# total memory : 1027716 kB
# cmdline : /usr/bin/perf record -g -a sleep 5
# event : name = cpu-clock, type = 1, config = 0x0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, precise_ip = 0, id = { 7 }
# HEADER_CPU_TOPOLOGY info available, use -I to display
# pmu mappings: software = 1, tracepoint = 2, breakpoint = 5
# ========
#
# Samples: 20K of event 'cpu-clock'
# Event count (approx.): 20016
#
# Overhead Command Shared Object Symbol
# ........ ........... ......................... ...................................
#
16.52% kswapd0 [kernel.kallsyms] [k] idr_get_next
|
--- idr_get_next
|
|--99.76%-- css_get_next
| mem_cgroup_iter
| |
| |--50.49%-- shrink_zone
| | kswapd
| | kthread
| | ret_from_kernel_thread
| |
| --49.51%-- kswapd
| kthread
| ret_from_kernel_thread
--0.24%-- [...]

11.23% kswapd0 [kernel.kallsyms] [k] prune_super
|
--- prune_super
|
|--86.74%-- shrink_slab
| kswapd
| kthread
| ret_from_kernel_thread
|
--13.26%-- kswapd
kthread
ret_from_kernel_thread

10.73% kswapd0 [kernel.kallsyms] [k] shrink_slab
|
--- shrink_slab
|
|--99.58%-- kswapd
| kthread
| ret_from_kernel_thread
--0.42%-- [...]

7.36% kswapd0 [kernel.kallsyms] [k] grab_super_passive
|
--- grab_super_passive
|
|--92.46%-- prune_super
| shrink_slab
| kswapd
| kthread
| ret_from_kernel_thread
|
--7.54%-- shrink_slab
kswapd
kthread
ret_from_kernel_thread

5.82% kswapd0 [kernel.kallsyms] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--34.28%-- put_super
| drop_super
| prune_super
| shrink_slab
| kswapd
| kthread
| ret_from_kernel_thread
|
|--30.50%-- grab_super_passive
| prune_super
| shrink_slab
| kswapd
| kthread
| ret_from_kernel_thread
|
|--17.27%-- prune_super
| shrink_slab
| kswapd
| kthread
| ret_from_kernel_thread
|
|--16.15%-- drop_super
| prune_super
| shrink_slab
| kswapd
| kthread
| ret_from_kernel_thread
|
|--1.20%-- mb_cache_shrink_fn
| shrink_slab
| kswapd
| kthread
| ret_from_kernel_thread
|
--0.60%-- shrink_slab
kswapd
kthread
ret_from_kernel_thread

4.43% kswapd0 [kernel.kallsyms] [k] fill_contig_page_info
|
--- fill_contig_page_info
|
|--99.10%-- fragmentation_index
| compaction_suitable
| kswapd
| kthread
| ret_from_kernel_thread
|
--0.90%-- compaction_suitable
kswapd
kthread
ret_from_kernel_thread

3.81% kswapd0 [kernel.kallsyms] [k] shrink_lruvec
|
--- shrink_lruvec
|
|--99.34%-- shrink_zone
| kswapd
| kthread
| ret_from_kernel_thread
|
--0.66%-- kswapd
kthread
ret_from_kernel_thread

The rest at https://bugzilla.redhat.com/attachment.cgi?id=654977

CU
Thorsten

2012-12-01 00:46:37

by Johannes Weiner

[permalink] [raw]

Subject: Re: kswapd craziness in 3.7

Hi Thorsten,

On Fri, Nov 30, 2012 at 01:39:03PM +0100, Thorsten Leemhuis wrote:
> /me wonders how to elegantly get out of his man-in-the-middle position

You control the mighty koji :-)

But seriously, this is very helpful, thank you! John now also Cc'd
directly.

> John was able to reproduce the problem quickly with a kernel that
> contained the patch from your mail. For details see
>
> https://bugzilla.redhat.com/show_bug.cgi?id=866988#c42 and later
>
> He provided the informations there. Parts of it:

> /proc/vmstat while kswad0 at 100%cpu
> /proc/zoneinfo with kswapd0 at 100% cpu
> perf profile

Thanks.

I'm quoting the interesting bits in order of the cars on my possibly
derailing train of thought:

> pageoutrun 117729182
> allocstall 5

Okay, so kswapd is stupidly looping but it's still managing to do it's
actual job; nobody is dropping into direct reclaim.

> pgsteal_kswapd_dma 1
> pgsteal_kswapd_normal 202106
> pgsteal_kswapd_high 36515
> pgsteal_kswapd_movable 0

> pgscan_kswapd_dma 1
> pgscan_kswapd_normal 203044
> pgscan_kswapd_high 40407
> pgscan_kswapd_movable 0

Does not seem excessive, so apparently it also does not overreclaim.

> Node 0, zone DMA
> pages free 1655
> min 196
> low 245
> high 294

> Node 0, zone Normal
> pages free 186234
> min 10953
> low 13691
> high 16429

> Node 0, zone HighMem
> pages free 8983
> min 34
> low 475
> high 917

These are all well above their watermarks, yet kswapd is definitely
finding something wrong with one of these as it actually does drop
into the reclaim loop, so zone_balanced() must be returning false:

> 16.52% kswapd0 [kernel.kallsyms] [k] idr_get_next
> |
> --- idr_get_next
> |
> |--99.76%-- css_get_next
> | mem_cgroup_iter
> | |
> | |--50.49%-- shrink_zone
> | | kswapd
> | | kthread
> | | ret_from_kernel_thread
> | |
> | --49.51%-- kswapd
> | kthread
> | ret_from_kernel_thread
> --0.24%-- [...]
>
> 11.23% kswapd0 [kernel.kallsyms] [k] prune_super
> |
> --- prune_super
> |
> |--86.74%-- shrink_slab
> | kswapd
> | kthread
> | ret_from_kernel_thread
> |
> --13.26%-- kswapd
> kthread
> ret_from_kernel_thread

Spending so much time in shrink_zone and shrink_slab without
overreclaiming a zone, I would say that a) this always stays on the
DEF_PRIORITY and b) only loops on the DMA zone. At DEF_PRIORITY, the
scan goal for filepages in the other zones would be > 0 e.g.

As the DMA zone watermarks are fine, it must be the fragmentation
index that indicates a lack of memory. Filling in the 1655 free pages
into the fragmentation index formula indicates lack of free memory
when these 1655 pages are lumped together in less than 9 page blocks.
Not unrealistic, I think: on my desktop machine, the DMA zone's free
3975 pages are lumped together in only 12 blocks. But on my system,
the DMA zone is either never used and there is always at least one
page block available that could satisfy a huge page allocation
(fragmentation index == -1000). Unless the system gets really close
to OOM, at which point the DMA zone is highly fragmented. And keep in
mind that if the priority level goes below DEF_PRIORITY, as it does
close to OOM, the unreclaimable DMA zone is ignored anyway. But the
DMA zone here is just barely used:

> Node 0, zone DMA
[...]
> nr_slab_reclaimable 3
> nr_slab_unreclaimable 1
[...]
> nr_dirtied 315
> nr_written 315

which could explain a fragmentation index that asks for more free
memory while the watermarks are fine.

Why this all loops: there is one more inconsistency where the
conditions for reclaim and the conditions for compaction contradict
each other: reclaim also does not consider the DMA zone balanced, but
it needs only 25% of the whole node to be balanced, while compaction
requires every single zone to be balanced individually.

So these strict per-zone checks for compaction at the end of
balance_pgdat() are likely to be the culprits that keep kswapd looping
forever on this machine, trying to balance DMA for compaction while
reclaim decides it has enough balanced memory in the node overall.

I think we can just remove them: whenever the compaction code is
reached, the reclaim code balanced 25% of the memory available for the
classzone to be suitable for compaction.

Mel? Rik?

---
From: Johannes Weiner <[email protected]>
Subject: [patch] mm: vmscan: do not keep kswapd looping forever due
to individual uncompactable zones

When a zone meets its high watermark and is compactable in case of
higher order allocations, it contributes to the percentage of the
node's memory that is considered balanced.

This requirement, that a node be only partially balanced, came about
when kswapd was desparately trying to balance tiny zones when all
bigger zones in the node had plenty of free memory. Arguably, the
same should apply to compaction: if a significant part of the node is
balanced enough to run compaction, do not get hung up on that tiny
zone that might never get in shape.

When the compaction logic in kswapd is reached, we know that at least
25% of the node's memory is balanced properly for compaction (see
zone_balanced and pgdat_balanced). Remove the individual zone checks
that restart the kswapd cycle.

Otherwise, we may observe more endless looping in kswapd where the
compaction code loops back to reclaim because of a single zone and
reclaim does nothing because the node is considered balanced overall.

Reported-by: Thorsten Leemhuis <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 16 ----------------
1 file changed, 16 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3b0aef4..486100f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2806,22 +2806,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
if (!populated_zone(zone))
continue;

- if (zone->all_unreclaimable &&
- sc.priority != DEF_PRIORITY)
- continue;
-
- /* Would compaction fail due to lack of free memory? */
- if (COMPACTION_BUILD &&
- compaction_suitable(zone, order) == COMPACT_SKIPPED)
- goto loop_again;
-
- /* Confirm the zone is balanced for order-0 */
- if (!zone_watermark_ok(zone, 0,
- high_wmark_pages(zone), 0, 0)) {
- order = sc.order = 0;
- goto loop_again;
- }
-
/* Check if the memory needs to be defragmented. */
if (zone_watermark_ok(zone, order,
low_wmark_pages(zone), *classzone_idx, 0))
--
1.7.11.7