2017-03-07 13:58:27

by Michal Hocko

[permalink] [raw]
Subject: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

From: Michal Hocko <[email protected]>

Tetsuo Handa has reported [1][2] that direct reclaimers might get stuck
in too_many_isolated loop basically for ever because the last few pages
on the LRU lists are isolated by the kswapd which is stuck on fs locks
when doing the pageout or slab reclaim. This in turn means that there is
nobody to actually trigger the oom killer and the system is basically
unusable.

too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
direct reclaim when too many pages are isolated already") to prevent
from pre-mature oom killer invocations because back then no reclaim
progress could indeed trigger the OOM killer too early. But since the
oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
the allocation/reclaim retry loop considers all the reclaimable pages
and throttles the allocation at that layer so we can loosen the direct
reclaim throttling.

Make shrink_inactive_list loop over too_many_isolated bounded and returns
immediately when the situation hasn't resolved after the first sleep.
Replace congestion_wait by a simple schedule_timeout_interruptible because
we are not really waiting on the IO congestion in this path.

Please note that this patch can theoretically cause the OOM killer to
trigger earlier while there are many pages isolated for the reclaim
which makes progress only very slowly. This would be obvious from the oom
report as the number of isolated pages are printed there. If we ever hit
this should_reclaim_retry should consider those numbers in the evaluation
in one way or another.

[1] http://lkml.kernel.org/r/[email protected]
[2] http://lkml.kernel.org/r/[email protected]

Signed-off-by: Michal Hocko <[email protected]>
---

Hi,
Tetsuo helped to test this patch [3] and couldn't reproduce the hang
inside the page allocator anymore. Thanks! He was able to see a
different lockup though. This time this is more related to how XFS is
doing the inode reclaim from the WQ context. This is being discussed [4]
and I believe it is unrelated to this change.

I believe this change is still an improvement because it reduces chances
of an unbound loop inside the reclaim path so we have a) more reliable
detection of the lockup from the allocator path and b) more deterministic
retry loop logic.

Thoughts/complains/suggestions?

[3] http://lkml.kernel.org/r/[email protected]
[4] http://lkml.kernel.org/r/[email protected]

mm/vmscan.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c15b2e4c47ca..4ae069060ae5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+ bool stalled = false;

while (unlikely(too_many_isolated(pgdat, file, sc))) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ if (stalled)
+ return 0;
+
+ /* wait a bit for the reclaimer. */
+ schedule_timeout_interruptible(HZ/10);
+ stalled = true;

/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
--
2.11.0


2017-03-07 20:02:16

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

On Tue, 2017-03-07 at 14:30 +0100, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Tetsuo Handa has reported [1][2] that direct reclaimers might get
> stuck
> in too_many_isolated loop basically for ever because the last few
> pages
> on the LRU lists are isolated by the kswapd which is stuck on fs
> locks
> when doing the pageout or slab reclaim. This in turn means that there
> is
> nobody to actually trigger the oom killer and the system is basically
> unusable.
>
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan:
> throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> and throttles the allocation at that layer so we can loosen the
> direct
> reclaim throttling.

It only does this to some extent.  If reclaim made
no progress, for example due to immediately bailing
out because the number of already isolated pages is
too high (due to many parallel reclaimers), the code
could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
test without ever looking at the number of reclaimable
pages.

Could that create problems if we have many concurrent
reclaimers?

It may be OK, I just do not understand all the implications.

I like the general direction your patch takes the code in,
but I would like to understand it better...

--
All rights reversed


Attachments:
signature.asc (473.00 B)
This is a digitally signed message part

2017-03-08 09:54:29

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

On Tue 07-03-17 14:52:36, Rik van Riel wrote:
> On Tue, 2017-03-07 at 14:30 +0100, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > Tetsuo Handa has reported [1][2] that direct reclaimers might get
> > stuck
> > in too_many_isolated loop basically for ever because the last few
> > pages
> > on the LRU lists are isolated by the kswapd which is stuck on fs
> > locks
> > when doing the pageout or slab reclaim. This in turn means that there
> > is
> > nobody to actually trigger the oom killer and the system is basically
> > unusable.
> >
> > too_many_isolated has been introduced by 35cd78156c49 ("vmscan:
> > throttle
> > direct reclaim when too many pages are isolated already") to prevent
> > from pre-mature oom killer invocations because back then no reclaim
> > progress could indeed trigger the OOM killer too early. But since the
> > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> > the allocation/reclaim retry loop considers all the reclaimable pages
> > and throttles the allocation at that layer so we can loosen the
> > direct
> > reclaim throttling.
>
> It only does this to some extent. ?If reclaim made
> no progress, for example due to immediately bailing
> out because the number of already isolated pages is
> too high (due to many parallel reclaimers), the code
> could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> test without ever looking at the number of reclaimable
> pages.
>
> Could that create problems if we have many concurrent
> reclaimers?

As the changelog mentions it might cause a premature oom killer
invocation theoretically. We could easily see that from the oom report
by checking isolated counters. My testing didn't trigger that though
and I was hammering the page allocator path from many threads.

I suspect some artificial tests can trigger that, I am not so sure about
reasonabel workloads. If we see this happening though then the fix would
be to resurrect my previous attempt to track NR_ISOLATED* per zone and
use them in the allocator retry logic.

--
Michal Hocko
SUSE Labs

2017-03-08 16:04:13

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

On Wed, 2017-03-08 at 10:21 +0100, Michal Hocko wrote:

> > Could that create problems if we have many concurrent
> > reclaimers?
>
> As the changelog mentions it might cause a premature oom killer
> invocation theoretically. We could easily see that from the oom
> report
> by checking isolated counters. My testing didn't trigger that though
> and I was hammering the page allocator path from many threads.
>
> I suspect some artificial tests can trigger that, I am not so sure
> about
> reasonabel workloads. If we see this happening though then the fix
> would
> be to resurrect my previous attempt to track NR_ISOLATED* per zone
> and
> use them in the allocator retry logic.

I am not sure the workload in question is "artificial".
A heavily forking (or multi-threaded) server running out
of physical memory could easily get hundreds of tasks
doing direct reclaim simultaneously.

In fact, false OOM kills with that kind of workload is
how we ended up getting the "too many isolated" logic
in the first place.

I am perfectly fine with moving the retry logic up like
you did, but think it may make sense to check the number
of reclaimable pages if we have too many isolated pages,
instead of risking a too-early OOM kill.

2017-03-09 09:23:52

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

On Wed 08-03-17 10:54:57, Rik van Riel wrote:
> On Wed, 2017-03-08 at 10:21 +0100, Michal Hocko wrote:
>
> > > Could that create problems if we have many concurrent
> > > reclaimers?
> >
> > As the changelog mentions it might cause a premature oom killer
> > invocation theoretically. We could easily see that from the oom
> > report
> > by checking isolated counters. My testing didn't trigger that though
> > and I was hammering the page allocator path from many threads.
> >
> > I suspect some artificial tests can trigger that, I am not so sure
> > about
> > reasonabel workloads. If we see this happening though then the fix
> > would
> > be to resurrect my previous attempt to track NR_ISOLATED* per zone
> > and
> > use them in the allocator retry logic.
>
> I am not sure the workload in question is "artificial".
> A heavily forking (or multi-threaded) server running out
> of physical memory could easily get hundreds of tasks
> doing direct reclaim simultaneously.

Yes, some of my OOM tests (fork many short lived processes while there
is a strong memory pressure and a lot of IO going on) are doing this and
I haven't hit a premature OOM yet. It is hard to tune those tests for almost
OOM but not yet there, though. Usually you either find a steady state or
really run out of memory.

> In fact, false OOM kills with that kind of workload is
> how we ended up getting the "too many isolated" logic
> in the first place.

Right, but the retry logic was considerably different than what we
have these days. should_reclaim_retry considers amount of reclaimable
memory. As I've said earlier if we see a report where the oom hits
prematurely with many NR_ISOLATED* we know how to fix that.

> I am perfectly fine with moving the retry logic up like
> you did, but think it may make sense to check the number
> of reclaimable pages if we have too many isolated pages,
> instead of risking a too-early OOM kill.

Actually that was my initial attempt but for that we would need per-zone
NR_ISOLATED* counters but Mel was against and wanted to start with
simpler approach if it works reasonably well which it seems it does from
my experience so far (but the reallity can surprise as I've seen so many
times already).
--
Michal Hocko
SUSE Labs

2017-03-09 14:16:30

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

On Thu, 2017-03-09 at 10:12 +0100, Michal Hocko wrote:
> On Wed 08-03-17 10:54:57, Rik van Riel wrote:

> > In fact, false OOM kills with that kind of workload is
> > how we ended up getting the "too many isolated" logic
> > in the first place.
> Right, but the retry logic was considerably different than what we
> have these days. should_reclaim_retry considers amount of reclaimable
> memory. As I've said earlier if we see a report where the oom hits
> prematurely with many NR_ISOLATED* we know how to fix that.

Would it be enough to simply reset no_progress_loops
in this check inside should_reclaim_retry, if we know
pageout IO is pending?

                        if (!did_some_progress) {
                                unsigned long write_pending;

                                write_pending =
zone_page_state_snapshot(zone,
                                                        NR_ZONE_WRITE_P
ENDING);

                                if (2 * write_pending > reclaimable) {
                                        congestion_wait(BLK_RW_ASYNC,
HZ/10);
                                        return true;
                                }
                        }

--
All rights reversed


Attachments:
signature.asc (473.00 B)
This is a digitally signed message part

2017-03-09 14:32:28

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

On Tue, Mar 07, 2017 at 02:30:57PM +0100, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Tetsuo Handa has reported [1][2] that direct reclaimers might get stuck
> in too_many_isolated loop basically for ever because the last few pages
> on the LRU lists are isolated by the kswapd which is stuck on fs locks
> when doing the pageout or slab reclaim. This in turn means that there is
> nobody to actually trigger the oom killer and the system is basically
> unusable.
>
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> and throttles the allocation at that layer so we can loosen the direct
> reclaim throttling.
>
> Make shrink_inactive_list loop over too_many_isolated bounded and returns
> immediately when the situation hasn't resolved after the first sleep.
> Replace congestion_wait by a simple schedule_timeout_interruptible because
> we are not really waiting on the IO congestion in this path.
>
> Please note that this patch can theoretically cause the OOM killer to
> trigger earlier while there are many pages isolated for the reclaim
> which makes progress only very slowly. This would be obvious from the oom
> report as the number of isolated pages are printed there. If we ever hit
> this should_reclaim_retry should consider those numbers in the evaluation
> in one way or another.
>
> [1] http://lkml.kernel.org/r/[email protected]
> [2] http://lkml.kernel.org/r/[email protected]
>
> Signed-off-by: Michal Hocko <[email protected]>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2017-03-09 14:59:42

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

On Thu 09-03-17 09:16:25, Rik van Riel wrote:
> On Thu, 2017-03-09 at 10:12 +0100, Michal Hocko wrote:
> > On Wed 08-03-17 10:54:57, Rik van Riel wrote:
>
> > > In fact, false OOM kills with that kind of workload is
> > > how we ended up getting the "too many isolated" logic
> > > in the first place.
> > Right, but the retry logic was considerably different than what we
> > have these days. should_reclaim_retry considers amount of reclaimable
> > memory. As I've said earlier if we see a report where the oom hits
> > prematurely with many NR_ISOLATED* we know how to fix that.
>
> Would it be enough to simply reset no_progress_loops
> in this check inside should_reclaim_retry, if we know
> pageout IO is pending?
>
> ? ? ? ? ? ? ? ? ? ? ? ? if (!did_some_progress) {
> ????????????????????????????????unsigned long write_pending;
>
> ????????????????????????????????write_pending = zone_page_state_snapshot(zone,
> ????????????????????????????????????????????????????????NR_ZONE_WRITE_PENDING);
>
> ????????????????????????????????if (2 * write_pending > reclaimable) {
> ????????????????????????????????????????congestion_wait(BLK_RW_ASYNC, HZ/10);
> ????????????????????????????????????????return true;
> ????????????????????????????????}
> ????????????????????????}

I am not really sure what problem we are trying to solve right now to be
honest. I would prefer to keep the logic simpler rather than over
engeneer something that is even not needed.

--
Michal Hocko
SUSE Labs

2017-03-09 18:11:42

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> It only does this to some extent. ?If reclaim made
> no progress, for example due to immediately bailing
> out because the number of already isolated pages is
> too high (due to many parallel reclaimers), the code
> could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> test without ever looking at the number of reclaimable
> pages.

Hm, there is no early return there, actually. We bump the loop counter
every time it happens, but then *do* look at the reclaimable pages.

> Could that create problems if we have many concurrent
> reclaimers?

With increased concurrency, the likelihood of OOM will go up if we
remove the unlimited wait for isolated pages, that much is true.

I'm not sure that's a bad thing, however, because we want the OOM
killer to be predictable and timely. So a reasonable wait time in
between 0 and forever before an allocating thread gives up under
extreme concurrency makes sense to me.

> It may be OK, I just do not understand all the implications.
>
> I like the general direction your patch takes the code in,
> but I would like to understand it better...

I feel the same way. The throttling logic doesn't seem to be very well
thought out at the moment, making it hard to reason about what happens
in certain scenarios.

In that sense, this patch isn't really an overall improvement to the
way things work. It patches a hole that seems to be exploitable only
from an artificial OOM torture test, at the risk of regressing high
concurrency workloads that may or may not be artificial.

Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
behind this patch. Can we think about a general model to deal with
allocation concurrency? Unlimited parallel direct reclaim is kinda
bonkers in the first place. How about checking for excessive isolation
counts from the page allocator and putting allocations on a waitqueue?

2017-03-09 22:18:18

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

On Thu, 2017-03-09 at 13:05 -0500, Johannes Weiner wrote:
> On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> >
> > It only does this to some extent.  If reclaim made
> > no progress, for example due to immediately bailing
> > out because the number of already isolated pages is
> > too high (due to many parallel reclaimers), the code
> > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> > test without ever looking at the number of reclaimable
> > pages.
> Hm, there is no early return there, actually. We bump the loop
> counter
> every time it happens, but then *do* look at the reclaimable pages.

Am I looking at an old tree?  I see this code
before we look at the reclaimable pages.

        /*
         * Make sure we converge to OOM if we cannot make any progress
         * several times in the row.
         */
        if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
                /* Before OOM, exhaust highatomic_reserve */
                return unreserve_highatomic_pageblock(ac, true);
        }

> > Could that create problems if we have many concurrent
> > reclaimers?
> With increased concurrency, the likelihood of OOM will go up if we
> remove the unlimited wait for isolated pages, that much is true.
>
> I'm not sure that's a bad thing, however, because we want the OOM
> killer to be predictable and timely. So a reasonable wait time in
> between 0 and forever before an allocating thread gives up under
> extreme concurrency makes sense to me.

That is a fair point, a faster OOM kill is preferable
to a system that is livelocked.

> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> behind this patch. Can we think about a general model to deal with
> allocation concurrency? Unlimited parallel direct reclaim is kinda
> bonkers in the first place. How about checking for excessive
> isolation
> counts from the page allocator and putting allocations on a
> waitqueue?

The (limited) number of reclaimers can still do a
relatively fast OOM kill, if none of them manage
to make progress.

That should avoid the potential issue you and I
both pointed out, and, as a bonus, it might actually
be faster than letting all the tasks in the system
into the direct reclaim code simultaneously.

--
All rights reversed


Attachments:
signature.asc (473.00 B)
This is a digitally signed message part

2017-03-10 10:21:25

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
> On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> > It only does this to some extent. ?If reclaim made
> > no progress, for example due to immediately bailing
> > out because the number of already isolated pages is
> > too high (due to many parallel reclaimers), the code
> > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> > test without ever looking at the number of reclaimable
> > pages.
>
> Hm, there is no early return there, actually. We bump the loop counter
> every time it happens, but then *do* look at the reclaimable pages.
>
> > Could that create problems if we have many concurrent
> > reclaimers?
>
> With increased concurrency, the likelihood of OOM will go up if we
> remove the unlimited wait for isolated pages, that much is true.
>
> I'm not sure that's a bad thing, however, because we want the OOM
> killer to be predictable and timely. So a reasonable wait time in
> between 0 and forever before an allocating thread gives up under
> extreme concurrency makes sense to me.
>
> > It may be OK, I just do not understand all the implications.
> >
> > I like the general direction your patch takes the code in,
> > but I would like to understand it better...
>
> I feel the same way. The throttling logic doesn't seem to be very well
> thought out at the moment, making it hard to reason about what happens
> in certain scenarios.
>
> In that sense, this patch isn't really an overall improvement to the
> way things work. It patches a hole that seems to be exploitable only
> from an artificial OOM torture test, at the risk of regressing high
> concurrency workloads that may or may not be artificial.
>
> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> behind this patch. Can we think about a general model to deal with
> allocation concurrency?

I am definitely not against. There is no reason to rush the patch in.
My main point behind this patch was to reduce unbound loops from inside
the reclaim path and push any throttling up the call chain to the
page allocator path because I believe that it is easier to reason
about them at that level. The direct reclaim should be as simple as
possible without too many side effects otherwise we end up in a highly
unpredictable behavior. This was a first step in that direction and my
testing so far didn't show any regressions.

> Unlimited parallel direct reclaim is kinda
> bonkers in the first place. How about checking for excessive isolation
> counts from the page allocator and putting allocations on a waitqueue?

I would be interested in details here.
--
Michal Hocko
SUSE Labs

2017-03-10 10:28:03

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

On Thu 09-03-17 17:18:00, Rik van Riel wrote:
> On Thu, 2017-03-09 at 13:05 -0500, Johannes Weiner wrote:
> > On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> > >
> > > It only does this to some extent. ?If reclaim made
> > > no progress, for example due to immediately bailing
> > > out because the number of already isolated pages is
> > > too high (due to many parallel reclaimers), the code
> > > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> > > test without ever looking at the number of reclaimable
> > > pages.
> > Hm, there is no early return there, actually. We bump the loop
> > counter
> > every time it happens, but then *do* look at the reclaimable pages.
>
> Am I looking at an old tree? ?I see this code
> before we look at the reclaimable pages.
>
> ? ? ? ? /*
> ?????????* Make sure we converge to OOM if we cannot make any progress
> ?????????* several times in the row.
> ?????????*/
> ????????if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
> ????????????????/* Before OOM, exhaust highatomic_reserve */
> ????????????????return unreserve_highatomic_pageblock(ac, true);
> ????????}

I believe that Johannes meant cases where we do not exhaust all the
reclaim retries and fail early because there are no reclaimable pages
during the watermark check.

> > > Could that create problems if we have many concurrent
> > > reclaimers?
> > With increased concurrency, the likelihood of OOM will go up if we
> > remove the unlimited wait for isolated pages, that much is true.
> >
> > I'm not sure that's a bad thing, however, because we want the OOM
> > killer to be predictable and timely. So a reasonable wait time in
> > between 0 and forever before an allocating thread gives up under
> > extreme concurrency makes sense to me.
>
> That is a fair point, a faster OOM kill is preferable
> to a system that is livelocked.
>
> > Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> > behind this patch. Can we think about a general model to deal with
> > allocation concurrency? Unlimited parallel direct reclaim is kinda
> > bonkers in the first place. How about checking for excessive
> > isolation
> > counts from the page allocator and putting allocations on a
> > waitqueue?
>
> The (limited) number of reclaimers can still do a
> relatively fast OOM kill, if none of them manage
> to make progress.

well, we can estimate how much memory can those relatively few
reclaimers isolate and try to reclaim. Even if we have hundreds of them which
is more towards a large number to me then we are 100*SWAP_CLUSTER_MAX
which is not all that much. And we are effectivelly OOM if there is no
other reclaimable memory left. All we need is just to put some upper
bound. We already have throttle_direct_reclaim but it doesn't really
throttle the maximum number of reclaimers.
--
Michal Hocko
SUSE Labs

2017-03-10 11:45:26

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

Michal Hocko wrote:
> On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
> > On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> > > It only does this to some extent. If reclaim made
> > > no progress, for example due to immediately bailing
> > > out because the number of already isolated pages is
> > > too high (due to many parallel reclaimers), the code
> > > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> > > test without ever looking at the number of reclaimable
> > > pages.
> >
> > Hm, there is no early return there, actually. We bump the loop counter
> > every time it happens, but then *do* look at the reclaimable pages.
> >
> > > Could that create problems if we have many concurrent
> > > reclaimers?
> >
> > With increased concurrency, the likelihood of OOM will go up if we
> > remove the unlimited wait for isolated pages, that much is true.
> >
> > I'm not sure that's a bad thing, however, because we want the OOM
> > killer to be predictable and timely. So a reasonable wait time in
> > between 0 and forever before an allocating thread gives up under
> > extreme concurrency makes sense to me.
> >
> > > It may be OK, I just do not understand all the implications.
> > >
> > > I like the general direction your patch takes the code in,
> > > but I would like to understand it better...
> >
> > I feel the same way. The throttling logic doesn't seem to be very well
> > thought out at the moment, making it hard to reason about what happens
> > in certain scenarios.
> >
> > In that sense, this patch isn't really an overall improvement to the
> > way things work. It patches a hole that seems to be exploitable only
> > from an artificial OOM torture test, at the risk of regressing high
> > concurrency workloads that may or may not be artificial.
> >
> > Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> > behind this patch. Can we think about a general model to deal with
> > allocation concurrency?
>
> I am definitely not against. There is no reason to rush the patch in.

I don't hurry if we can check using watchdog whether this problem is occurring
in the real world. I have to test corner cases because watchdog is missing.

> My main point behind this patch was to reduce unbound loops from inside
> the reclaim path and push any throttling up the call chain to the
> page allocator path because I believe that it is easier to reason
> about them at that level. The direct reclaim should be as simple as
> possible without too many side effects otherwise we end up in a highly
> unpredictable behavior. This was a first step in that direction and my
> testing so far didn't show any regressions.
>
> > Unlimited parallel direct reclaim is kinda
> > bonkers in the first place. How about checking for excessive isolation
> > counts from the page allocator and putting allocations on a waitqueue?
>
> I would be interested in details here.

That will help implementing __GFP_KILLABLE.
https://bugzilla.kernel.org/show_bug.cgi?id=192981#c15

2017-03-21 10:37:43

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

On 2017/03/10 20:44, Tetsuo Handa wrote:
> Michal Hocko wrote:
>> On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
>>>> It may be OK, I just do not understand all the implications.
>>>>
>>>> I like the general direction your patch takes the code in,
>>>> but I would like to understand it better...
>>>
>>> I feel the same way. The throttling logic doesn't seem to be very well
>>> thought out at the moment, making it hard to reason about what happens
>>> in certain scenarios.
>>>
>>> In that sense, this patch isn't really an overall improvement to the
>>> way things work. It patches a hole that seems to be exploitable only
>>> from an artificial OOM torture test, at the risk of regressing high
>>> concurrency workloads that may or may not be artificial.
>>>
>>> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
>>> behind this patch. Can we think about a general model to deal with
>>> allocation concurrency?
>>
>> I am definitely not against. There is no reason to rush the patch in.
>
> I don't hurry if we can check using watchdog whether this problem is occurring
> in the real world. I have to test corner cases because watchdog is missing.

Today I tested linux-next-20170321 with not so insane stress, and I again
hit this problem. Thus, I think this problem might occur in the real world.

http://I-love.SAKURA.ne.jp/tmp/serial-20170321.txt.xz (Logs up to before swapoff are eliminated.)
----------
[ 2250.175109] MemAlloc-Info: stalling=16 dying=0 exiting=4 victim=0 oom_count=1155386
[ 2257.535653] MemAlloc-Info: stalling=16 dying=0 exiting=4 victim=0 oom_count=1155386
[ 2319.806880] MemAlloc-Info: stalling=19 dying=0 exiting=4 victim=0 oom_count=1155386
[ 2320.722282] MemAlloc-Info: stalling=19 dying=0 exiting=4 victim=0 oom_count=1155386
[ 2381.243393] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386
[ 2389.777052] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386
[ 2450.878287] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386
[ 2459.386321] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386
[ 2520.500633] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386
[ 2529.042088] MemAlloc-Info: stalling=20 dying=0 exiting=4 victim=0 oom_count=1155386
----------

2017-04-23 10:24:34

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

On 2017/03/10 20:44, Tetsuo Handa wrote:
> Michal Hocko wrote:
>> On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
>>> On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
>>>> It only does this to some extent. If reclaim made
>>>> no progress, for example due to immediately bailing
>>>> out because the number of already isolated pages is
>>>> too high (due to many parallel reclaimers), the code
>>>> could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
>>>> test without ever looking at the number of reclaimable
>>>> pages.
>>>
>>> Hm, there is no early return there, actually. We bump the loop counter
>>> every time it happens, but then *do* look at the reclaimable pages.
>>>
>>>> Could that create problems if we have many concurrent
>>>> reclaimers?
>>>
>>> With increased concurrency, the likelihood of OOM will go up if we
>>> remove the unlimited wait for isolated pages, that much is true.
>>>
>>> I'm not sure that's a bad thing, however, because we want the OOM
>>> killer to be predictable and timely. So a reasonable wait time in
>>> between 0 and forever before an allocating thread gives up under
>>> extreme concurrency makes sense to me.
>>>
>>>> It may be OK, I just do not understand all the implications.
>>>>
>>>> I like the general direction your patch takes the code in,
>>>> but I would like to understand it better...
>>>
>>> I feel the same way. The throttling logic doesn't seem to be very well
>>> thought out at the moment, making it hard to reason about what happens
>>> in certain scenarios.
>>>
>>> In that sense, this patch isn't really an overall improvement to the
>>> way things work. It patches a hole that seems to be exploitable only
>>> from an artificial OOM torture test, at the risk of regressing high
>>> concurrency workloads that may or may not be artificial.
>>>
>>> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
>>> behind this patch. Can we think about a general model to deal with
>>> allocation concurrency?
>>
>> I am definitely not against. There is no reason to rush the patch in.
>
> I don't hurry if we can check using watchdog whether this problem is occurring
> in the real world. I have to test corner cases because watchdog is missing.
>
Ping?

This problem can occur even immediately after the first invocation of
the OOM killer. I believe this problem can occur in the real world.
When are we going to apply this patch or watchdog patch?

----------------------------------------
[ 0.000000] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 17:38:02 JST 2017
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 debug_guardpage_minorder=1
(...snipped...)
CentOS Linux 7 (Core)
Kernel 4.11.0-rc7-next-20170421+ on an x86_64

ccsecurity login: [ 32.406723] ip6_tables: (C) 2000-2006 Netfilter Core Team
[ 32.852917] Ebtables v2.0 registered
[ 33.034402] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[ 33.467929] e1000: ens32 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[ 33.475728] IPv6: ADDRCONF(NETDEV_UP): ens32: link is not ready
[ 33.478910] IPv6: ADDRCONF(NETDEV_CHANGE): ens32: link becomes ready
[ 33.950365] Netfilter messages via NETLINK v0.30.
[ 33.983449] ip_set: protocol 6
[ 37.335966] e1000: ens35 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[ 37.337587] IPv6: ADDRCONF(NETDEV_UP): ens35: link is not ready
[ 37.339925] IPv6: ADDRCONF(NETDEV_CHANGE): ens35: link becomes ready
[ 39.940942] nf_conntrack: default automatic helper assignment has been turned off for security reasons and CT-based firewall rule not found. Use the iptables CT target to attach helpers instead.
[ 98.926202] a.out invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0
[ 98.932977] a.out cpuset=/ mems_allowed=0
[ 98.934780] CPU: 1 PID: 2972 Comm: a.out Not tainted 4.11.0-rc7-next-20170421+ #588
[ 98.937988] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 98.942193] Call Trace:
[ 98.942942] ? dump_stack+0x5c/0x7d
[ 98.943907] ? dump_header+0x97/0x233
[ 98.945334] ? ktime_get+0x30/0x90
[ 98.946290] ? delayacct_end+0x35/0x60
[ 98.947319] ? do_try_to_free_pages+0x2ca/0x370
[ 98.948554] ? oom_kill_process+0x223/0x3e0
[ 98.949715] ? has_capability_noaudit+0x17/0x20
[ 98.950948] ? oom_badness+0xeb/0x160
[ 98.951962] ? out_of_memory+0x10b/0x490
[ 98.953030] ? __alloc_pages_slowpath+0x701/0x8e2
[ 98.954313] ? __alloc_pages_nodemask+0x1ed/0x210
[ 98.956242] ? alloc_pages_vma+0x9f/0x220
[ 98.957486] ? __handle_mm_fault+0xc22/0x11e0
[ 98.958673] ? handle_mm_fault+0xc5/0x220
[ 98.959766] ? __do_page_fault+0x21e/0x4b0
[ 98.960906] ? do_page_fault+0x2b/0x70
[ 98.961977] ? page_fault+0x28/0x30
[ 98.963861] Mem-Info:
[ 98.965330] active_anon:372765 inactive_anon:2097 isolated_anon:0
[ 98.965330] active_file:182 inactive_file:214 isolated_file:32
[ 98.965330] unevictable:0 dirty:6 writeback:6 unstable:0
[ 98.965330] slab_reclaimable:2011 slab_unreclaimable:11291
[ 98.965330] mapped:623 shmem:2162 pagetables:8582 bounce:0
[ 98.965330] free:13278 free_pcp:117 free_cma:0
[ 98.978473] Node 0 active_anon:1491060kB inactive_anon:8388kB active_file:728kB inactive_file:856kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:2492kB dirty:24kB writeback:24kB shmem:8648kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 1241088kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[ 98.987555] Node 0 DMA free:7176kB min:408kB low:508kB high:608kB active_anon:8672kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 98.998904] lowmem_reserve[]: 0 1696 1696 1696
[ 99.001205] Node 0 DMA32 free:45936kB min:44644kB low:55804kB high:66964kB active_anon:1482048kB inactive_anon:8388kB active_file:232kB inactive_file:1000kB unevictable:0kB writepending:48kB present:2080640kB managed:1756232kB mlocked:0kB slab_reclaimable:8044kB slab_unreclaimable:45132kB kernel_stack:22128kB pagetables:34304kB bounce:0kB free_pcp:700kB local_pcp:0kB free_cma:0kB
[ 99.009428] lowmem_reserve[]: 0 0 0 0
[ 99.010816] Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 2*32kB (UM) 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (M) 2*1024kB (UM) 0*2048kB 1*4096kB (M) = 7176kB
[ 99.014262] Node 0 DMA32: 909*4kB (UE) 548*8kB (UME) 190*16kB (UME) 99*32kB (UME) 37*64kB (UME) 14*128kB (UME) 5*256kB (UME) 3*512kB (E) 2*1024kB (UM) 1*2048kB (M) 5*4096kB (M) = 45780kB
[ 99.018848] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 99.021288] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 99.023758] 2752 total pagecache pages
[ 99.025196] 0 pages in swap cache
[ 99.026538] Swap cache stats: add 0, delete 0, find 0/0
[ 99.028521] Free swap = 0kB
[ 99.029923] Total swap = 0kB
[ 99.031212] 524157 pages RAM
[ 99.032458] 0 pages HighMem/MovableOnly
[ 99.033812] 81123 pages reserved
[ 99.035255] 0 pages cma reserved
[ 99.036729] 0 pages hwpoisoned
[ 99.037898] Out of memory: Kill process 2973 (a.out) score 999 or sacrifice child
[ 99.039902] Killed process 2973 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[ 99.043953] oom_reaper: reaped process 2973 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 173.285686] sysrq: SysRq : Show State
(...snipped...)
[ 173.899630] kswapd0 D 0 51 2 0x00000000
[ 173.900935] Call Trace:
[ 173.901706] ? __schedule+0x1d2/0x5a0
[ 173.902906] ? schedule+0x2d/0x80
[ 173.904034] ? schedule_timeout+0x192/0x240
[ 173.905437] ? __down_common+0xc0/0x128
[ 173.906549] ? down+0x36/0x40
[ 173.907433] ? xfs_buf_lock+0x1d/0x40 [xfs]
[ 173.908574] ? _xfs_buf_find+0x2ad/0x580 [xfs]
[ 173.909734] ? xfs_buf_get_map+0x1d/0x140 [xfs]
[ 173.910886] ? xfs_buf_read_map+0x23/0xd0 [xfs]
[ 173.912045] ? xfs_trans_read_buf_map+0xe5/0x2f0 [xfs]
[ 173.913381] ? xfs_read_agf+0x8d/0x120 [xfs]
[ 173.914725] ? xfs_alloc_read_agf+0x39/0x130 [xfs]
[ 173.916225] ? xfs_alloc_fix_freelist+0x369/0x430 [xfs]
[ 173.917491] ? __radix_tree_lookup+0x80/0xf0
[ 173.918593] ? __radix_tree_lookup+0x80/0xf0
[ 173.920091] ? xfs_alloc_vextent+0x148/0x460 [xfs]
[ 173.921549] ? xfs_bmap_btalloc+0x45e/0x8a0 [xfs]
[ 173.922728] ? xfs_bmapi_write+0x768/0x1250 [xfs]
[ 173.923904] ? kmem_cache_alloc+0x11c/0x130
[ 173.925030] ? xfs_iomap_write_allocate+0x175/0x360 [xfs]
[ 173.926592] ? xfs_map_blocks+0x181/0x230 [xfs]
[ 173.927854] ? xfs_do_writepage+0x1db/0x630 [xfs]
[ 173.929046] ? xfs_setfilesize_trans_alloc.isra.26+0x35/0x80 [xfs]
[ 173.930665] ? xfs_vm_writepage+0x31/0x70 [xfs]
[ 173.931915] ? pageout.isra.47+0x188/0x280
[ 173.933005] ? shrink_page_list+0x79d/0xbb0
[ 173.934138] ? shrink_inactive_list+0x1c2/0x3d0
[ 173.935609] ? radix_tree_gang_lookup_tag+0xe3/0x160
[ 173.937100] ? shrink_node_memcg+0x33a/0x740
[ 173.938335] ? _cond_resched+0x10/0x20
[ 173.939443] ? _cond_resched+0x10/0x20
[ 173.940470] ? shrink_node+0xe0/0x320
[ 173.941483] ? kswapd+0x2b4/0x660
[ 173.942424] ? kthread+0xf2/0x130
[ 173.943396] ? mem_cgroup_shrink_node+0xb0/0xb0
[ 173.944578] ? kthread_park+0x60/0x60
[ 173.945613] ? ret_from_fork+0x26/0x40
(...snipped...)
[ 195.183281] Showing busy workqueues and worker pools:
[ 195.184626] workqueue events_freezable_power_: flags=0x84
[ 195.186013] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 195.187596] in-flight: 24:disk_events_workfn
[ 195.188832] workqueue writeback: flags=0x4e
[ 195.189919] pwq 256: cpus=0-127 flags=0x4 nice=0 active=1/256
[ 195.191826] in-flight: 370:wb_workfn
[ 195.194105] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 129 63
[ 195.195883] pool 256: cpus=0-127 flags=0x4 nice=0 hung=96s workers=31 idle: 371 369 368 367 366 365 364 363 362 361 360 359 358 357 356 355 354 353 352 351 350 349 348 347 346 249 253 5 53 372
[ 243.365293] sysrq: SysRq : Resetting
----------------------------------------

2017-04-24 12:40:04

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

On Sun, Apr 23, 2017 at 07:24:21PM +0900, Tetsuo Handa wrote:
> On 2017/03/10 20:44, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> >> On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
> >>> On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> >>>> It only does this to some extent. If reclaim made
> >>>> no progress, for example due to immediately bailing
> >>>> out because the number of already isolated pages is
> >>>> too high (due to many parallel reclaimers), the code
> >>>> could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> >>>> test without ever looking at the number of reclaimable
> >>>> pages.
> >>>
> >>> Hm, there is no early return there, actually. We bump the loop counter
> >>> every time it happens, but then *do* look at the reclaimable pages.
> >>>
> >>>> Could that create problems if we have many concurrent
> >>>> reclaimers?
> >>>
> >>> With increased concurrency, the likelihood of OOM will go up if we
> >>> remove the unlimited wait for isolated pages, that much is true.
> >>>
> >>> I'm not sure that's a bad thing, however, because we want the OOM
> >>> killer to be predictable and timely. So a reasonable wait time in
> >>> between 0 and forever before an allocating thread gives up under
> >>> extreme concurrency makes sense to me.
> >>>
> >>>> It may be OK, I just do not understand all the implications.
> >>>>
> >>>> I like the general direction your patch takes the code in,
> >>>> but I would like to understand it better...
> >>>
> >>> I feel the same way. The throttling logic doesn't seem to be very well
> >>> thought out at the moment, making it hard to reason about what happens
> >>> in certain scenarios.
> >>>
> >>> In that sense, this patch isn't really an overall improvement to the
> >>> way things work. It patches a hole that seems to be exploitable only
> >>> from an artificial OOM torture test, at the risk of regressing high
> >>> concurrency workloads that may or may not be artificial.
> >>>
> >>> Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> >>> behind this patch. Can we think about a general model to deal with
> >>> allocation concurrency?
> >>
> >> I am definitely not against. There is no reason to rush the patch in.
> >
> > I don't hurry if we can check using watchdog whether this problem is occurring
> > in the real world. I have to test corner cases because watchdog is missing.
> >
> Ping?
>
> This problem can occur even immediately after the first invocation of
> the OOM killer. I believe this problem can occur in the real world.
> When are we going to apply this patch or watchdog patch?
>
> ----------------------------------------
> [ 0.000000] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 17:38:02 JST 2017
> [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 debug_guardpage_minorder=1

Are you debugging memory corruption problem?

FWIW, if you use debug_guardpage_minorder= you can expect any
allocation memory problems. This option is intended to debug
memory corruption bugs and it shrinks available memory in
artificial way. Taking that, I don't think justifying any
patch, by problem happened when debug_guardpage_minorder= is
used, is reasonable.

Stanislaw

2017-04-24 13:06:43

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

Stanislaw Gruszka wrote:
> On Sun, Apr 23, 2017 at 07:24:21PM +0900, Tetsuo Handa wrote:
> > On 2017/03/10 20:44, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > >> I am definitely not against. There is no reason to rush the patch in.
> > >
> > > I don't hurry if we can check using watchdog whether this problem is occurring
> > > in the real world. I have to test corner cases because watchdog is missing.
> > >
> > Ping?
> >
> > This problem can occur even immediately after the first invocation of
> > the OOM killer. I believe this problem can occur in the real world.
> > When are we going to apply this patch or watchdog patch?
> >
> > ----------------------------------------
> > [ 0.000000] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 17:38:02 JST 2017
> > [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 debug_guardpage_minorder=1
>
> Are you debugging memory corruption problem?

No. Just a random testing trying to find how we can avoid flooding of
warn_alloc_stall() warning messages while also avoiding ratelimiting.

>
> FWIW, if you use debug_guardpage_minorder= you can expect any
> allocation memory problems. This option is intended to debug
> memory corruption bugs and it shrinks available memory in
> artificial way. Taking that, I don't think justifying any
> patch, by problem happened when debug_guardpage_minorder= is
> used, is reasonable.
>
> Stanislaw

This problem occurs without debug_guardpage_minorder= parameter and

----------
[ 0.000000] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 17:38:02 JST 2017
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 vconsole.font=latarcyrheb-sun16 security=none sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8
(...snipped...)
CentOS Linux 7 (Core)
Kernel 4.11.0-rc7-next-20170421+ on an x86_64

ccsecurity login: [ 31.882531] ip6_tables: (C) 2000-2006 Netfilter Core Team
[ 32.550187] Ebtables v2.0 registered
[ 32.730371] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[ 32.926518] IPv6: ADDRCONF(NETDEV_UP): ens32: link is not ready
[ 32.928310] e1000: ens32 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[ 32.930960] IPv6: ADDRCONF(NETDEV_CHANGE): ens32: link becomes ready
[ 33.741378] Netfilter messages via NETLINK v0.30.
[ 33.807350] ip_set: protocol 6
[ 37.581002] nf_conntrack: default automatic helper assignment has been turned off for security reasons and CT-based firewall rule not found. Use the iptables CT target to attach helpers instead.
[ 38.072689] IPv6: ADDRCONF(NETDEV_UP): ens35: link is not ready
[ 38.074419] e1000: ens35 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[ 38.077222] IPv6: ADDRCONF(NETDEV_CHANGE): ens35: link becomes ready
[ 92.753140] gmain invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0
[ 92.763445] gmain cpuset=/ mems_allowed=0
[ 92.767634] CPU: 2 PID: 2733 Comm: gmain Not tainted 4.11.0-rc7-next-20170421+ #588
[ 92.773624] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 92.781790] Call Trace:
[ 92.782630] ? dump_stack+0x5c/0x7d
[ 92.783902] ? dump_header+0x97/0x233
[ 92.785427] ? ktime_get+0x30/0x90
[ 92.786390] ? delayacct_end+0x35/0x60
[ 92.787433] ? do_try_to_free_pages+0x2ca/0x370
[ 92.789157] ? oom_kill_process+0x223/0x3e0
[ 92.790502] ? has_capability_noaudit+0x17/0x20
[ 92.791761] ? oom_badness+0xeb/0x160
[ 92.792783] ? out_of_memory+0x10b/0x490
[ 92.793872] ? __alloc_pages_slowpath+0x701/0x8e2
[ 92.795603] ? __alloc_pages_nodemask+0x1ed/0x210
[ 92.796902] ? alloc_pages_current+0x7a/0x100
[ 92.798115] ? filemap_fault+0x2e9/0x5e0
[ 92.799204] ? filemap_map_pages+0x185/0x3a0
[ 92.800402] ? xfs_filemap_fault+0x2f/0x50 [xfs]
[ 92.801678] ? __do_fault+0x15/0x70
[ 92.802651] ? __handle_mm_fault+0xb0f/0x11e0
[ 92.805141] ? handle_mm_fault+0xc5/0x220
[ 92.807261] ? __do_page_fault+0x21e/0x4b0
[ 92.809203] ? do_page_fault+0x2b/0x70
[ 92.811018] ? do_syscall_64+0x137/0x140
[ 92.812554] ? page_fault+0x28/0x30
[ 92.813855] Mem-Info:
[ 92.815009] active_anon:437483 inactive_anon:2097 isolated_anon:0
[ 92.815009] active_file:0 inactive_file:104 isolated_file:41
[ 92.815009] unevictable:0 dirty:10 writeback:0 unstable:0
[ 92.815009] slab_reclaimable:2439 slab_unreclaimable:11018
[ 92.815009] mapped:405 shmem:2162 pagetables:8704 bounce:0
[ 92.815009] free:13168 free_pcp:58 free_cma:0
[ 92.825444] Node 0 active_anon:1749932kB inactive_anon:8388kB active_file:0kB inactive_file:592kB unevictable:0kB isolated(anon):0kB isolated(file):164kB mapped:1620kB dirty:40kB writeback:0kB shmem:8648kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 1519616kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[ 92.832175] Node 0 DMA free:8148kB min:352kB low:440kB high:528kB active_anon:7696kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 92.840217] lowmem_reserve[]: 0 1952 1952 1952
[ 92.841799] Node 0 DMA32 free:45028kB min:44700kB low:55872kB high:67044kB active_anon:1742236kB inactive_anon:8388kB active_file:0kB inactive_file:992kB unevictable:0kB writepending:40kB present:2080640kB managed:2018376kB mlocked:0kB slab_reclaimable:9756kB slab_unreclaimable:44040kB kernel_stack:22192kB pagetables:34788kB bounce:0kB free_pcp:672kB local_pcp:0kB free_cma:0kB
[ 92.850458] lowmem_reserve[]: 0 0 0 0
[ 92.851881] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (M) 2*32kB (UM) 2*64kB (UM) 2*128kB (UM) 2*256kB (UM) 0*512kB 1*1024kB (U) 1*2048kB (M) 1*4096kB (M) = 8148kB
[ 92.855530] Node 0 DMA32: 1023*4kB (UME) 591*8kB (UME) 220*16kB (UME) 223*32kB (UME) 156*64kB (UME) 38*128kB (UME) 12*256kB (UME) 10*512kB (UME) 2*1024kB (M) 0*2048kB 0*4096kB = 44564kB
[ 92.860735] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 92.863216] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 92.865714] 2994 total pagecache pages
[ 92.867201] 0 pages in swap cache
[ 92.868575] Swap cache stats: add 0, delete 0, find 0/0
[ 92.870309] Free swap = 0kB
[ 92.871579] Total swap = 0kB
[ 92.873000] 524157 pages RAM
[ 92.874351] 0 pages HighMem/MovableOnly
[ 92.875809] 15587 pages reserved
[ 92.877151] 0 pages cma reserved
[ 92.878513] 0 pages hwpoisoned
[ 92.879948] Out of memory: Kill process 2983 (a.out) score 998 or sacrifice child
[ 92.882182] Killed process 2983 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[ 92.886190] oom_reaper: reaped process 2983 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 96.072996] a.out invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0
[ 96.076683] a.out cpuset=/ mems_allowed=0
[ 96.078329] CPU: 3 PID: 2982 Comm: a.out Not tainted 4.11.0-rc7-next-20170421+ #588
[ 96.080583] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 96.083254] Call Trace:
[ 96.084404] ? dump_stack+0x5c/0x7d
[ 96.085855] ? dump_header+0x97/0x233
[ 96.087393] ? oom_kill_process+0x223/0x3e0
[ 96.089059] ? has_capability_noaudit+0x17/0x20
[ 96.090567] ? oom_badness+0xeb/0x160
[ 96.092133] ? out_of_memory+0x10b/0x490
[ 96.093920] ? __alloc_pages_slowpath+0x701/0x8e2
[ 96.095732] ? __alloc_pages_nodemask+0x1ed/0x210
[ 96.097544] ? alloc_pages_vma+0x9f/0x220
[ 96.099133] ? __handle_mm_fault+0xc22/0x11e0
[ 96.100668] ? handle_mm_fault+0xc5/0x220
[ 96.102387] ? __do_page_fault+0x21e/0x4b0
[ 96.103824] ? do_page_fault+0x2b/0x70
[ 96.105351] ? page_fault+0x28/0x30
[ 96.106759] Mem-Info:
[ 96.107908] active_anon:438003 inactive_anon:2097 isolated_anon:0
[ 96.107908] active_file:91 inactive_file:265 isolated_file:6
[ 96.107908] unevictable:0 dirty:1 writeback:121 unstable:0
[ 96.107908] slab_reclaimable:2439 slab_unreclaimable:11273
[ 96.107908] mapped:382 shmem:2162 pagetables:8698 bounce:0
[ 96.107908] free:13166 free_pcp:0 free_cma:0
[ 96.119325] Node 0 active_anon:1752012kB inactive_anon:8388kB active_file:364kB inactive_file:1060kB unevictable:0kB isolated(anon):0kB isolated(file):24kB mapped:1528kB dirty:4kB writeback:484kB shmem:8648kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 1519616kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[ 96.125753] Node 0 DMA free:8148kB min:352kB low:440kB high:528kB active_anon:7696kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 96.133203] lowmem_reserve[]: 0 1952 1952 1952
[ 96.135013] Node 0 DMA32 free:44516kB min:44700kB low:55872kB high:67044kB active_anon:1743720kB inactive_anon:8388kB active_file:336kB inactive_file:792kB unevictable:0kB writepending:488kB present:2080640kB managed:2018376kB mlocked:0kB slab_reclaimable:9756kB slab_unreclaimable:45060kB kernel_stack:22192kB pagetables:34764kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 96.143814] lowmem_reserve[]: 0 0 0 0
[ 96.145371] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (M) 2*32kB (UM) 2*64kB (UM) 2*128kB (UM) 2*256kB (UM) 0*512kB 1*1024kB (U) 1*2048kB (M) 1*4096kB (M) = 8148kB
[ 96.148956] Node 0 DMA32: 1052*4kB (UME) 599*8kB (UME) 212*16kB (UME) 237*32kB (UME) 155*64kB (UME) 39*128kB (UME) 12*256kB (UME) 10*512kB (UME) 2*1024kB (M) 0*2048kB 0*4096kB = 45128kB
[ 96.153861] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 96.156374] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 96.158817] 2598 total pagecache pages
[ 96.160434] 0 pages in swap cache
[ 96.161904] Swap cache stats: add 0, delete 0, find 0/0
[ 96.163762] Free swap = 0kB
[ 96.165142] Total swap = 0kB
[ 96.166507] 524157 pages RAM
[ 96.167839] 0 pages HighMem/MovableOnly
[ 96.169374] 15587 pages reserved
[ 96.170834] 0 pages cma reserved
[ 96.172247] 0 pages hwpoisoned
[ 96.173569] Out of memory: Kill process 2984 (a.out) score 998 or sacrifice child
[ 96.176242] Killed process 2984 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[ 96.182342] oom_reaper: reaped process 2984 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 242.498498] sysrq: SysRq : Show State
[ 242.503329] task PC stack pid father
[ 242.509822] systemd D 0 1 0 0x00000000
[ 242.515791] Call Trace:
[ 242.519807] ? __schedule+0x1d2/0x5a0
[ 242.526263] ? schedule+0x2d/0x80
[ 242.530940] ? schedule_timeout+0x16d/0x240
[ 242.536135] ? del_timer_sync+0x40/0x40
[ 242.541458] ? io_schedule_timeout+0x14/0x40
[ 242.543661] ? congestion_wait+0x79/0xd0
[ 242.545748] ? prepare_to_wait_event+0xf0/0xf0
[ 242.548051] ? shrink_inactive_list+0x388/0x3d0
[ 242.550323] ? shrink_node_memcg+0x33a/0x740
[ 242.552505] ? _cond_resched+0x10/0x20
[ 242.554743] ? _cond_resched+0x10/0x20
[ 242.556952] ? shrink_node+0xe0/0x320
[ 242.558962] ? do_try_to_free_pages+0xdc/0x370
[ 242.561168] ? try_to_free_pages+0xbe/0x100
[ 242.563309] ? __alloc_pages_slowpath+0x387/0x8e2
[ 242.565581] ? __wake_up_common+0x4c/0x80
[ 242.567759] ? __alloc_pages_nodemask+0x1ed/0x210
[ 242.570064] ? alloc_pages_current+0x7a/0x100
[ 242.572092] ? __do_page_cache_readahead+0xe9/0x250
[ 242.573707] ? radix_tree_lookup_slot+0x1e/0x50
[ 242.575081] ? find_get_entry+0x14/0x100
[ 242.576414] ? pagecache_get_page+0x21/0x200
[ 242.577678] ? filemap_fault+0x23a/0x5e0
[ 242.578859] ? filemap_map_pages+0x185/0x3a0
[ 242.580093] ? xfs_filemap_fault+0x2f/0x50 [xfs]
[ 242.581398] ? __do_fault+0x15/0x70
[ 242.582468] ? __handle_mm_fault+0xb0f/0x11e0
[ 242.583665] ? ep_ptable_queue_proc+0x90/0x90
[ 242.584831] ? handle_mm_fault+0xc5/0x220
[ 242.585993] ? __do_page_fault+0x21e/0x4b0
[ 242.587257] ? do_page_fault+0x2b/0x70
[ 242.589145] ? page_fault+0x28/0x30
(...snipped...)
[ 243.105826] kswapd0 D 0 51 2 0x00000000
[ 243.107344] Call Trace:
[ 243.108113] ? __schedule+0x1d2/0x5a0
[ 243.109114] ? schedule+0x2d/0x80
[ 243.110052] ? schedule_timeout+0x192/0x240
[ 243.111190] ? check_preempt_curr+0x7f/0x90
[ 243.112260] ? __down_common+0xc0/0x128
[ 243.113329] ? down+0x36/0x40
[ 243.114296] ? xfs_buf_lock+0x1d/0x40 [xfs]
[ 243.115473] ? _xfs_buf_find+0x2ad/0x580 [xfs]
[ 243.116785] ? xfs_buf_get_map+0x1d/0x140 [xfs]
[ 243.118052] ? xfs_buf_read_map+0x23/0xd0 [xfs]
[ 243.119310] ? xfs_trans_read_buf_map+0xe5/0x2f0 [xfs]
[ 243.120655] ? _cond_resched+0x10/0x20
[ 243.122831] ? xfs_read_agf+0x8d/0x120 [xfs]
[ 243.124181] ? xfs_alloc_read_agf+0x39/0x130 [xfs]
[ 243.125616] ? xfs_alloc_fix_freelist+0x369/0x430 [xfs]
[ 243.127093] ? __radix_tree_lookup+0x80/0xf0
[ 243.128235] ? __radix_tree_lookup+0x80/0xf0
[ 243.129357] ? xfs_alloc_vextent+0x148/0x460 [xfs]
[ 243.130596] ? xfs_bmap_btalloc+0x45e/0x8a0 [xfs]
[ 243.131804] ? xfs_bmapi_write+0x768/0x1250 [xfs]
[ 243.133032] ? kmem_cache_alloc+0x11c/0x130
[ 243.134160] ? xfs_iomap_write_allocate+0x175/0x360 [xfs]
[ 243.135503] ? xfs_map_blocks+0x181/0x230 [xfs]
[ 243.136802] ? xfs_do_writepage+0x1db/0x630 [xfs]
[ 243.138030] ? xfs_vm_writepage+0x31/0x70 [xfs]
[ 243.139396] ? pageout.isra.47+0x188/0x280
[ 243.140490] ? shrink_page_list+0x79d/0xbb0
[ 243.141619] ? shrink_inactive_list+0x1c2/0x3d0
[ 243.142831] ? radix_tree_gang_lookup_tag+0xe3/0x160
[ 243.144072] ? shrink_node_memcg+0x33a/0x740
[ 243.145188] ? _cond_resched+0x10/0x20
[ 243.146410] ? _cond_resched+0x10/0x20
[ 243.147746] ? shrink_node+0xe0/0x320
[ 243.148754] ? kswapd+0x2b4/0x660
[ 243.149691] ? kthread+0xf2/0x130
[ 243.150690] ? mem_cgroup_shrink_node+0xb0/0xb0
[ 243.151887] ? kthread_park+0x60/0x60
[ 243.152909] ? ret_from_fork+0x26/0x40
(...snipped...)
[ 273.216540] Showing busy workqueues and worker pools:
[ 273.218084] workqueue events_freezable_power_: flags=0x84
[ 273.219707] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[ 273.221259] in-flight: 381:disk_events_workfn
[ 273.222576] workqueue writeback: flags=0x4e
[ 273.223721] pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[ 273.225240] in-flight: 344:wb_workfn wb_workfn
[ 273.227485] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 63 17
[ 273.229266] pool 256: cpus=0-127 flags=0x4 nice=0 hung=180s workers=34 idle: 343 342 341 340 339 338 337 336 335 334 333 332 331 329 330 328 327 326 325 324 323 322 321 320 319 318 317 248 280 53 345 5 348
[ 340.690056] sysrq: SysRq : Resetting
----------

this problem also occurs with only 4 parallel writers.

----------
[ 0.000000] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 17:38:02 JST 2017
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 debug_guardpage_minorder=1
(...snipped...)
[ 383.692506] Out of memory: Kill process 3391 (a.out) score 999 or sacrifice child
[ 383.694476] Killed process 3391 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[ 383.699008] oom_reaper: reaped process 3391 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 445.711383] sysrq: SysRq : Show State
[ 445.718193] task PC stack pid father
(...snipped...)
[ 446.272860] kswapd0 D 0 51 2 0x00000000
[ 446.274148] Call Trace:
[ 446.274890] ? __schedule+0x1d2/0x5a0
[ 446.275847] ? schedule+0x2d/0x80
[ 446.276736] ? rwsem_down_read_failed+0x108/0x180
[ 446.278223] ? call_rwsem_down_read_failed+0x14/0x30
[ 446.280076] ? down_read+0x17/0x30
[ 446.281297] ? xfs_map_blocks+0x8f/0x230 [xfs]
[ 446.282685] ? xfs_do_writepage+0x1db/0x630 [xfs]
[ 446.283985] ? xfs_vm_writepage+0x31/0x70 [xfs]
[ 446.285124] ? pageout.isra.47+0x188/0x280
[ 446.286192] ? shrink_page_list+0x79d/0xbb0
[ 446.287296] ? shrink_inactive_list+0x1c2/0x3d0
[ 446.288442] ? radix_tree_gang_lookup_tag+0xe3/0x160
[ 446.289808] ? shrink_node_memcg+0x33a/0x740
[ 446.291027] ? _cond_resched+0x10/0x20
[ 446.292038] ? _cond_resched+0x10/0x20
[ 446.293089] ? shrink_node+0xe0/0x320
[ 446.294069] ? kswapd+0x2b4/0x660
[ 446.295036] ? kthread+0xf2/0x130
[ 446.296211] ? mem_cgroup_shrink_node+0xb0/0xb0
[ 446.297367] ? kthread_park+0x60/0x60
[ 446.298353] ? ret_from_fork+0x26/0x40
(...snipped...)
[ 448.285791] a.out D 0 3387 2847 0x00000080
[ 448.287194] Call Trace:
[ 448.287975] ? __schedule+0x1d2/0x5a0
[ 448.288975] ? schedule+0x2d/0x80
[ 448.289910] ? schedule_timeout+0x16d/0x240
[ 448.291072] ? del_timer_sync+0x40/0x40
[ 448.292097] ? io_schedule_timeout+0x14/0x40
[ 448.293294] ? congestion_wait+0x79/0xd0
[ 448.294327] ? prepare_to_wait_event+0xf0/0xf0
[ 448.295476] ? shrink_inactive_list+0x388/0x3d0
[ 448.296650] ? shrink_node_memcg+0x33a/0x740
[ 448.298016] ? _cond_resched+0x10/0x20
[ 448.299027] ? _cond_resched+0x10/0x20
[ 448.300032] ? shrink_node+0xe0/0x320
[ 448.301068] ? do_try_to_free_pages+0xdc/0x370
[ 448.302247] ? try_to_free_pages+0xbe/0x100
[ 448.303325] ? __alloc_pages_slowpath+0x387/0x8e2
[ 448.304492] ? __lock_page_or_retry+0x1b8/0x300
[ 448.305628] ? __alloc_pages_nodemask+0x1ed/0x210
[ 448.306809] ? alloc_pages_vma+0x9f/0x220
[ 448.307874] ? __handle_mm_fault+0xc22/0x11e0
[ 448.308984] ? handle_mm_fault+0xc5/0x220
[ 448.310228] ? __do_page_fault+0x21e/0x4b0
[ 448.311500] ? do_page_fault+0x2b/0x70
[ 448.312609] ? page_fault+0x28/0x30
[ 448.313926] a.out D 0 3388 3387 0x00000086
[ 448.315461] Call Trace:
[ 448.316339] ? __schedule+0x1d2/0x5a0
[ 448.317348] ? schedule+0x2d/0x80
[ 448.318291] ? schedule_timeout+0x192/0x240
[ 448.319372] ? sched_clock_cpu+0xc/0xa0
[ 448.320417] ? __down_common+0xc0/0x128
[ 448.321583] ? down+0x36/0x40
[ 448.322463] ? xfs_buf_lock+0x1d/0x40 [xfs]
[ 448.323572] ? _xfs_buf_find+0x2ad/0x580 [xfs]
[ 448.324698] ? xfs_buf_get_map+0x1d/0x140 [xfs]
[ 448.325885] ? xfs_buf_read_map+0x23/0xd0 [xfs]
[ 448.327045] ? xfs_trans_read_buf_map+0xe5/0x2f0 [xfs]
[ 448.328303] ? xfs_read_agf+0x8d/0x120 [xfs]
[ 448.329384] ? xfs_trans_read_buf_map+0x178/0x2f0 [xfs]
[ 448.330906] ? xfs_alloc_read_agf+0x39/0x130 [xfs]
[ 448.332401] ? xfs_alloc_fix_freelist+0x369/0x430 [xfs]
[ 448.333738] ? xfs_btree_rec_addr+0x9/0x10 [xfs]
[ 448.335180] ? _cond_resched+0x10/0x20
[ 448.336628] ? __kmalloc+0x114/0x180
[ 448.337783] ? xfs_buf_rele+0x57/0x3b0 [xfs]
[ 448.339143] ? __radix_tree_lookup+0x80/0xf0
[ 448.340406] ? xfs_free_extent_fix_freelist+0x67/0xc0 [xfs]
[ 448.341889] ? xfs_free_extent+0x6f/0x210 [xfs]
[ 448.343210] ? xfs_trans_free_extent+0x27/0x90 [xfs]
[ 448.344565] ? xfs_extent_free_finish_item+0x1c/0x30 [xfs]
[ 448.346042] ? xfs_defer_finish+0x125/0x280 [xfs]
[ 448.348145] ? xfs_itruncate_extents+0x1a2/0x3c0 [xfs]
[ 448.349999] ? xfs_free_eofblocks+0x1c5/0x230 [xfs]
[ 448.351680] ? xfs_release+0x135/0x160 [xfs]
[ 448.353278] ? __fput+0xc8/0x1c0
[ 448.354355] ? task_work_run+0x6e/0x90
[ 448.355646] ? do_exit+0x2b6/0xab0
[ 448.356761] ? do_group_exit+0x34/0xa0
[ 448.357901] ? get_signal+0x17c/0x4f0
[ 448.359039] ? __do_fault+0x15/0x70
[ 448.360139] ? do_signal+0x31/0x610
[ 448.361238] ? handle_mm_fault+0xc5/0x220
[ 448.362487] ? __do_page_fault+0x21e/0x4b0
[ 448.363752] ? exit_to_usermode_loop+0x35/0x70
[ 448.365109] ? prepare_exit_to_usermode+0x39/0x40
[ 448.366475] ? retint_user+0x8/0x13
[ 448.367640] a.out D 0 3389 3387 0x00000086
[ 448.369260] Call Trace:
[ 448.370151] ? __schedule+0x1d2/0x5a0
[ 448.371220] ? schedule+0x2d/0x80
[ 448.372181] ? schedule_timeout+0x16d/0x240
[ 448.373277] ? del_timer_sync+0x40/0x40
[ 448.374309] ? io_schedule_timeout+0x14/0x40
[ 448.375414] ? congestion_wait+0x79/0xd0
[ 448.376460] ? prepare_to_wait_event+0xf0/0xf0
[ 448.377590] ? shrink_inactive_list+0x388/0x3d0
[ 448.378788] ? pick_next_task_fair+0x39c/0x480
[ 448.380269] ? shrink_node_memcg+0x33a/0x740
[ 448.381981] ? mem_cgroup_iter+0x127/0x2b0
[ 448.383266] ? shrink_node+0xe0/0x320
[ 448.384342] ? do_try_to_free_pages+0xdc/0x370
[ 448.385569] ? try_to_free_pages+0xbe/0x100
[ 448.386680] ? __alloc_pages_slowpath+0x387/0x8e2
[ 448.387909] ? __alloc_pages_nodemask+0x1ed/0x210
[ 448.389163] ? alloc_pages_current+0x7a/0x100
[ 448.390369] ? xfs_buf_allocate_memory+0x16a/0x2ad [xfs]
[ 448.391731] ? xfs_buf_get_map+0xeb/0x140 [xfs]
[ 448.392931] ? xfs_buf_read_map+0x23/0xd0 [xfs]
[ 448.394114] ? xfs_trans_read_buf_map+0xe5/0x2f0 [xfs]
[ 448.395421] ? xfs_btree_read_buf_block.constprop.37+0x72/0xc0 [xfs]
[ 448.397007] ? xfs_btree_lookup_get_block+0x7f/0x160 [xfs]
[ 448.398671] ? xfs_btree_lookup+0xc9/0x3f0 [xfs]
[ 448.399927] ? xfs_bmap_del_extent+0x1a0/0xbb0 [xfs]
[ 448.401357] ? __xfs_bunmapi+0x3bb/0xb70 [xfs]
[ 448.402679] ? xfs_bunmapi+0x26/0x40 [xfs]
[ 448.403907] ? xfs_itruncate_extents+0x18a/0x3c0 [xfs]
[ 448.405339] ? xfs_free_eofblocks+0x1c5/0x230 [xfs]
[ 448.406688] ? xfs_release+0x135/0x160 [xfs]
[ 448.407911] ? __fput+0xc8/0x1c0
[ 448.408939] ? task_work_run+0x6e/0x90
[ 448.410061] ? do_exit+0x2b6/0xab0
[ 448.411156] ? do_group_exit+0x34/0xa0
[ 448.412301] ? get_signal+0x17c/0x4f0
[ 448.413526] ? __do_fault+0x15/0x70
[ 448.415066] ? do_signal+0x31/0x610
[ 448.416174] ? handle_mm_fault+0xc5/0x220
[ 448.417490] ? __do_page_fault+0x21e/0x4b0
[ 448.418729] ? exit_to_usermode_loop+0x35/0x70
[ 448.419976] ? prepare_exit_to_usermode+0x39/0x40
[ 448.421336] ? retint_user+0x8/0x13
[ 448.422414] a.out D 0 3391 3387 0x00000086
[ 448.423873] Call Trace:
[ 448.424755] ? __schedule+0x1d2/0x5a0
[ 448.425857] ? schedule+0x2d/0x80
[ 448.426918] ? schedule_timeout+0x192/0x240
[ 448.428143] ? mempool_alloc+0x64/0x170
[ 448.429318] ? __down_common+0xc0/0x128
[ 448.430401] ? down+0x36/0x40
[ 448.431561] ? xfs_buf_lock+0x1d/0x40 [xfs]
[ 448.432727] ? _xfs_buf_find+0x2ad/0x580 [xfs]
[ 448.433976] ? xfs_buf_get_map+0x1d/0x140 [xfs]
[ 448.435216] ? xfs_buf_read_map+0x23/0xd0 [xfs]
[ 448.436545] ? xfs_trans_read_buf_map+0xe5/0x2f0 [xfs]
[ 448.437901] ? xfs_read_agf+0x8d/0x120 [xfs]
[ 448.439111] ? xfs_trans_read_buf_map+0x178/0x2f0 [xfs]
[ 448.440624] ? xfs_alloc_read_agf+0x39/0x130 [xfs]
[ 448.441958] ? xfs_alloc_fix_freelist+0x369/0x430 [xfs]
[ 448.443524] ? xfs_btree_rec_addr+0x9/0x10 [xfs]
[ 448.444800] ? _cond_resched+0x10/0x20
[ 448.445933] ? __kmalloc+0x114/0x180
[ 448.447319] ? xfs_buf_rele+0x57/0x3b0 [xfs]
[ 448.448657] ? __radix_tree_lookup+0x80/0xf0
[ 448.449934] ? xfs_free_extent_fix_freelist+0x67/0xc0 [xfs]
[ 448.451445] ? xfs_free_extent+0x6f/0x210 [xfs]
[ 448.452608] ? xfs_trans_free_extent+0x27/0x90 [xfs]
[ 448.453874] ? xfs_extent_free_finish_item+0x1c/0x30 [xfs]
[ 448.455203] ? xfs_defer_finish+0x125/0x280 [xfs]
[ 448.456410] ? xfs_itruncate_extents+0x1a2/0x3c0 [xfs]
[ 448.457682] ? xfs_free_eofblocks+0x1c5/0x230 [xfs]
[ 448.458937] ? xfs_release+0x135/0x160 [xfs]
[ 448.460060] ? __fput+0xc8/0x1c0
[ 448.461081] ? task_work_run+0x6e/0x90
[ 448.462103] ? do_exit+0x2b6/0xab0
[ 448.463064] ? do_group_exit+0x34/0xa0
[ 448.464347] ? get_signal+0x17c/0x4f0
[ 448.465402] ? do_signal+0x31/0x610
[ 448.466373] ? xfs_file_write_iter+0x88/0x120 [xfs]
[ 448.467614] ? __vfs_write+0xe5/0x140
[ 448.468613] ? exit_to_usermode_loop+0x35/0x70
[ 448.469747] ? do_syscall_64+0x12a/0x140
[ 448.470827] ? entry_SYSCALL64_slow_path+0x25/0x25
[ 448.472399] a.out D 0 3392 3387 0x00000080
[ 448.473757] Call Trace:
[ 448.474567] ? __schedule+0x1d2/0x5a0
[ 448.475598] ? schedule+0x2d/0x80
[ 448.476566] ? schedule_timeout+0x16d/0x240
[ 448.477688] ? del_timer_sync+0x40/0x40
[ 448.478709] ? io_schedule_timeout+0x14/0x40
[ 448.480159] ? congestion_wait+0x79/0xd0
[ 448.481998] ? prepare_to_wait_event+0xf0/0xf0
[ 448.483679] ? shrink_inactive_list+0x388/0x3d0
[ 448.485113] ? shrink_node_memcg+0x33a/0x740
[ 448.486310] ? xfs_reclaim_inodes_count+0x2d/0x40 [xfs]
[ 448.487609] ? mem_cgroup_iter+0x127/0x2b0
[ 448.488719] ? shrink_node+0xe0/0x320
[ 448.489747] ? do_try_to_free_pages+0xdc/0x370
[ 448.490926] ? try_to_free_pages+0xbe/0x100
[ 448.492122] ? __alloc_pages_slowpath+0x387/0x8e2
[ 448.493347] ? __alloc_pages_nodemask+0x1ed/0x210
[ 448.494633] ? alloc_pages_current+0x7a/0x100
[ 448.495800] ? xfs_buf_allocate_memory+0x16a/0x2ad [xfs]
[ 448.497170] ? xfs_buf_get_map+0xeb/0x140 [xfs]
[ 448.498710] ? xfs_buf_read_map+0x23/0xd0 [xfs]
[ 448.499861] ? xfs_trans_read_buf_map+0xe5/0x2f0 [xfs]
[ 448.501200] ? xfs_btree_read_buf_block.constprop.37+0x72/0xc0 [xfs]
[ 448.502698] ? xfs_btree_lookup_get_block+0x7f/0x160 [xfs]
[ 448.504021] ? xfs_btree_lookup+0xc9/0x3f0 [xfs]
[ 448.505242] ? xfs_iext_remove_direct+0x64/0xd0 [xfs]
[ 448.506495] ? xfs_bmap_add_extent_delay_real+0x4f9/0x18e0 [xfs]
[ 448.507930] ? _cond_resched+0x10/0x20
[ 448.508972] ? kmem_cache_alloc+0x11c/0x130
[ 448.510132] ? kmem_zone_alloc+0x84/0xf0 [xfs]
[ 448.511366] ? xfs_bmapi_write+0x826/0x1250 [xfs]
[ 448.512572] ? kmem_cache_alloc+0x11c/0x130
[ 448.514112] ? xfs_iomap_write_allocate+0x175/0x360 [xfs]
[ 448.515920] ? xfs_map_blocks+0x181/0x230 [xfs]
[ 448.517136] ? xfs_do_writepage+0x1db/0x630 [xfs]
[ 448.518381] ? invalid_page_referenced_vma+0x80/0x80
[ 448.519640] ? write_cache_pages+0x205/0x400
[ 448.520831] ? xfs_vm_set_page_dirty+0x1c0/0x1c0 [xfs]
[ 448.522203] ? iomap_apply+0xe3/0x120
[ 448.523271] ? xfs_vm_writepages+0x5f/0xa0 [xfs]
[ 448.524523] ? __filemap_fdatawrite_range+0xc0/0xf0
[ 448.525866] ? filemap_write_and_wait_range+0x20/0x50
[ 448.527157] ? xfs_file_fsync+0x41/0x160 [xfs]
[ 448.528319] ? do_fsync+0x33/0x60
[ 448.529273] ? SyS_fsync+0x7/0x10
[ 448.530267] ? do_syscall_64+0x5c/0x140
[ 448.531609] ? entry_SYSCALL64_slow_path+0x25/0x25
(...snipped...)
[ 580.304479] Showing busy workqueues and worker pools:
[ 580.306114] workqueue events_freezable_power_: flags=0x84
[ 580.307522] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 580.309059] in-flight: 99:disk_events_workfn
[ 580.310365] workqueue writeback: flags=0x4e
[ 580.312273] pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[ 580.313966] in-flight: 342:wb_workfn wb_workfn
[ 580.316378] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=2s workers=3 idle: 24 3095
[ 580.318281] pool 256: cpus=0-127 flags=0x4 nice=0 hung=198s workers=3 idle: 341 340
[ 595.909943] sysrq: SysRq : Resetting
----------

This problem is very much dependent on timing, and warn_alloc_stall() cannot
catch this problem.

2017-04-25 06:33:14

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

On Mon, Apr 24, 2017 at 10:06:32PM +0900, Tetsuo Handa wrote:
> Stanislaw Gruszka wrote:
> > On Sun, Apr 23, 2017 at 07:24:21PM +0900, Tetsuo Handa wrote:
> > > On 2017/03/10 20:44, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > >> I am definitely not against. There is no reason to rush the patch in.
> > > >
> > > > I don't hurry if we can check using watchdog whether this problem is occurring
> > > > in the real world. I have to test corner cases because watchdog is missing.
> > > >
> > > Ping?
> > >
> > > This problem can occur even immediately after the first invocation of
> > > the OOM killer. I believe this problem can occur in the real world.
> > > When are we going to apply this patch or watchdog patch?
> > >
> > > ----------------------------------------
> > > [ 0.000000] Linux version 4.11.0-rc7-next-20170421+ (root@ccsecurity) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #588 SMP Sun Apr 23 17:38:02 JST 2017
> > > [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.11.0-rc7-next-20170421+ root=UUID=17c3c28f-a70a-4666-95fa-ecf6acd901e4 ro vconsole.keymap=jp106 crashkernel=256M vconsole.font=latarcyrheb-sun16 security=none sysrq_always_enabled console=ttyS0,115200n8 console=tty0 LANG=en_US.UTF-8 debug_guardpage_minorder=1
> >
> > Are you debugging memory corruption problem?
>
> No. Just a random testing trying to find how we can avoid flooding of
> warn_alloc_stall() warning messages while also avoiding ratelimiting.

This is not right way to stress mm subsystem, debug_guardpage_minorder=
option is for _debug_ purpose. Use mem= instead if you want to limit
available memory.

> > FWIW, if you use debug_guardpage_minorder= you can expect any
> > allocation memory problems. This option is intended to debug
> > memory corruption bugs and it shrinks available memory in
> > artificial way. Taking that, I don't think justifying any
> > patch, by problem happened when debug_guardpage_minorder= is
> > used, is reasonable.
> >
> > Stanislaw
>
> This problem occurs without debug_guardpage_minorder= parameter and

So please justify your patches by that.

Thanks
Stanislaw

2017-06-30 00:14:30

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 09-03-17 13:05:40, Johannes Weiner wrote:
> > > On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> > > > It only does this to some extent. If reclaim made
> > > > no progress, for example due to immediately bailing
> > > > out because the number of already isolated pages is
> > > > too high (due to many parallel reclaimers), the code
> > > > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> > > > test without ever looking at the number of reclaimable
> > > > pages.
> > >
> > > Hm, there is no early return there, actually. We bump the loop counter
> > > every time it happens, but then *do* look at the reclaimable pages.
> > >
> > > > Could that create problems if we have many concurrent
> > > > reclaimers?
> > >
> > > With increased concurrency, the likelihood of OOM will go up if we
> > > remove the unlimited wait for isolated pages, that much is true.
> > >
> > > I'm not sure that's a bad thing, however, because we want the OOM
> > > killer to be predictable and timely. So a reasonable wait time in
> > > between 0 and forever before an allocating thread gives up under
> > > extreme concurrency makes sense to me.
> > >
> > > > It may be OK, I just do not understand all the implications.
> > > >
> > > > I like the general direction your patch takes the code in,
> > > > but I would like to understand it better...
> > >
> > > I feel the same way. The throttling logic doesn't seem to be very well
> > > thought out at the moment, making it hard to reason about what happens
> > > in certain scenarios.
> > >
> > > In that sense, this patch isn't really an overall improvement to the
> > > way things work. It patches a hole that seems to be exploitable only
> > > from an artificial OOM torture test, at the risk of regressing high
> > > concurrency workloads that may or may not be artificial.
> > >
> > > Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
> > > behind this patch. Can we think about a general model to deal with
> > > allocation concurrency?
> >
> > I am definitely not against. There is no reason to rush the patch in.
>
> I don't hurry if we can check using watchdog whether this problem is occurring
> in the real world. I have to test corner cases because watchdog is missing.
>
> > My main point behind this patch was to reduce unbound loops from inside
> > the reclaim path and push any throttling up the call chain to the
> > page allocator path because I believe that it is easier to reason
> > about them at that level. The direct reclaim should be as simple as
> > possible without too many side effects otherwise we end up in a highly
> > unpredictable behavior. This was a first step in that direction and my
> > testing so far didn't show any regressions.
> >
> > > Unlimited parallel direct reclaim is kinda
> > > bonkers in the first place. How about checking for excessive isolation
> > > counts from the page allocator and putting allocations on a waitqueue?
> >
> > I would be interested in details here.
>
> That will help implementing __GFP_KILLABLE.
> https://bugzilla.kernel.org/show_bug.cgi?id=192981#c15
>
Ping? Ping? When are we going to apply this patch or watchdog patch?
This problem occurs with not so insane stress like shown below.
I can't test almost OOM situation because test likely falls into either
printk() v.s. oom_lock lockup problem or this too_many_isolated() problem.

----------
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(int argc, char *argv[])
{
static char buffer[4096] = { };
char *buf = NULL;
unsigned long size;
int i;
for (i = 0; i < 10; i++) {
if (fork() == 0) {
int fd = open("/proc/self/oom_score_adj", O_WRONLY);
write(fd, "1000", 4);
close(fd);
sleep(1);
if (!i)
pause();
snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid());
fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer))
fsync(fd);
_exit(0);
}
}
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
sleep(2);
/* Will cause OOM due to overcommit */
for (i = 0; i < size; i += 4096)
buf[i] = 0;
return 0;
}
----------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170629-3.txt.xz .

[ 190.924887] a.out D13296 2191 2172 0x00000080
[ 190.927121] Call Trace:
[ 190.928304] __schedule+0x23f/0x5d0
[ 190.929843] schedule+0x31/0x80
[ 190.931261] schedule_timeout+0x189/0x290
[ 190.933068] ? del_timer_sync+0x40/0x40
[ 190.934722] io_schedule_timeout+0x19/0x40
[ 190.936467] ? io_schedule_timeout+0x19/0x40
[ 190.938272] congestion_wait+0x7d/0xd0
[ 190.939919] ? wait_woken+0x80/0x80
[ 190.941452] shrink_inactive_list+0x3e3/0x4d0
[ 190.943281] shrink_node_memcg+0x360/0x780
[ 190.945023] ? check_preempt_curr+0x7d/0x90
[ 190.946794] ? try_to_wake_up+0x23b/0x3c0
[ 190.948741] shrink_node+0xdc/0x310
[ 190.950285] ? shrink_node+0xdc/0x310
[ 190.951870] do_try_to_free_pages+0xea/0x370
[ 190.953661] try_to_free_pages+0xc3/0x100
[ 190.955644] __alloc_pages_slowpath+0x441/0xd50
[ 190.957714] __alloc_pages_nodemask+0x20c/0x250
[ 190.959598] alloc_pages_vma+0x83/0x1e0
[ 190.961244] __handle_mm_fault+0xc2c/0x1030
[ 190.963006] handle_mm_fault+0xf4/0x220
[ 190.964871] __do_page_fault+0x25b/0x4a0
[ 190.966611] do_page_fault+0x30/0x80
[ 190.968169] page_fault+0x28/0x30

[ 190.987135] a.out D11896 2193 2191 0x00000086
[ 190.989636] Call Trace:
[ 190.990855] __schedule+0x23f/0x5d0
[ 190.992384] schedule+0x31/0x80
[ 190.993797] schedule_timeout+0x1c1/0x290
[ 190.995578] ? init_object+0x64/0xa0
[ 190.997133] __down+0x85/0xd0
[ 190.998476] ? __down+0x85/0xd0
[ 190.999879] ? deactivate_slab.isra.83+0x160/0x4b0
[ 191.001843] down+0x3c/0x50
[ 191.003116] ? down+0x3c/0x50
[ 191.004460] xfs_buf_lock+0x21/0x50 [xfs]
[ 191.006146] _xfs_buf_find+0x3cd/0x640 [xfs]
[ 191.007924] xfs_buf_get_map+0x25/0x150 [xfs]
[ 191.009736] xfs_buf_read_map+0x25/0xc0 [xfs]
[ 191.011891] xfs_trans_read_buf_map+0xef/0x2f0 [xfs]
[ 191.013990] xfs_read_agf+0x86/0x110 [xfs]
[ 191.015758] xfs_alloc_read_agf+0x3e/0x140 [xfs]
[ 191.017675] xfs_alloc_fix_freelist+0x3e8/0x4e0 [xfs]
[ 191.019725] ? kmem_zone_alloc+0x8a/0x110 [xfs]
[ 191.021613] ? set_track+0x6b/0x140
[ 191.023452] ? init_object+0x64/0xa0
[ 191.025049] ? ___slab_alloc+0x1b6/0x590
[ 191.026870] ? ___slab_alloc+0x1b6/0x590
[ 191.028581] xfs_free_extent_fix_freelist+0x78/0xe0 [xfs]
[ 191.030768] xfs_free_extent+0x6a/0x1d0 [xfs]
[ 191.032577] xfs_trans_free_extent+0x2c/0xb0 [xfs]
[ 191.034534] xfs_extent_free_finish_item+0x21/0x40 [xfs]
[ 191.036695] xfs_defer_finish+0x143/0x2b0 [xfs]
[ 191.038622] xfs_itruncate_extents+0x1a5/0x3d0 [xfs]
[ 191.040686] xfs_free_eofblocks+0x1a8/0x200 [xfs]
[ 191.042945] xfs_release+0x13f/0x160 [xfs]
[ 191.044811] xfs_file_release+0x10/0x20 [xfs]
[ 191.046674] __fput+0xda/0x1e0
[ 191.048077] ____fput+0x9/0x10
[ 191.049479] task_work_run+0x7b/0xa0
[ 191.051063] do_exit+0x2c5/0xb30
[ 191.052522] do_group_exit+0x3e/0xb0
[ 191.054103] get_signal+0x1dd/0x4f0
[ 191.055663] ? __do_fault+0x19/0xf0
[ 191.057790] do_signal+0x32/0x650
[ 191.059421] ? handle_mm_fault+0xf4/0x220
[ 191.061108] ? __do_page_fault+0x25b/0x4a0
[ 191.062818] exit_to_usermode_loop+0x5a/0x90
[ 191.064588] prepare_exit_to_usermode+0x40/0x50
[ 191.066468] retint_user+0x8/0x10

[ 191.085459] a.out D11576 2194 2191 0x00000086
[ 191.087652] Call Trace:
[ 191.088883] __schedule+0x23f/0x5d0
[ 191.090437] schedule+0x31/0x80
[ 191.091830] schedule_timeout+0x189/0x290
[ 191.093541] ? del_timer_sync+0x40/0x40
[ 191.095166] io_schedule_timeout+0x19/0x40
[ 191.096881] ? io_schedule_timeout+0x19/0x40
[ 191.098657] congestion_wait+0x7d/0xd0
[ 191.100254] ? wait_woken+0x80/0x80
[ 191.101758] shrink_inactive_list+0x3e3/0x4d0
[ 191.103574] shrink_node_memcg+0x360/0x780
[ 191.105599] ? check_preempt_curr+0x7d/0x90
[ 191.107402] ? try_to_wake_up+0x23b/0x3c0
[ 191.109087] shrink_node+0xdc/0x310
[ 191.110590] ? shrink_node+0xdc/0x310
[ 191.112153] do_try_to_free_pages+0xea/0x370
[ 191.113948] try_to_free_pages+0xc3/0x100
[ 191.115639] __alloc_pages_slowpath+0x441/0xd50
[ 191.117508] __alloc_pages_nodemask+0x20c/0x250
[ 191.119374] alloc_pages_current+0x65/0xd0
[ 191.121179] xfs_buf_allocate_memory+0x172/0x2d0 [xfs]
[ 191.123262] xfs_buf_get_map+0xbe/0x150 [xfs]
[ 191.125077] xfs_buf_read_map+0x25/0xc0 [xfs]
[ 191.126909] xfs_trans_read_buf_map+0xef/0x2f0 [xfs]
[ 191.128924] xfs_btree_read_buf_block.constprop.36+0x6d/0xc0 [xfs]
[ 191.131358] xfs_btree_lookup_get_block+0x85/0x180 [xfs]
[ 191.133529] xfs_btree_lookup+0x125/0x460 [xfs]
[ 191.135562] ? xfs_allocbt_init_cursor+0x43/0x130 [xfs]
[ 191.137674] xfs_free_ag_extent+0x9f/0x870 [xfs]
[ 191.139579] xfs_free_extent+0xb5/0x1d0 [xfs]
[ 191.141419] xfs_trans_free_extent+0x2c/0xb0 [xfs]
[ 191.143387] xfs_extent_free_finish_item+0x21/0x40 [xfs]
[ 191.145538] xfs_defer_finish+0x143/0x2b0 [xfs]
[ 191.147446] xfs_itruncate_extents+0x1a5/0x3d0 [xfs]
[ 191.149485] xfs_free_eofblocks+0x1a8/0x200 [xfs]
[ 191.151630] xfs_release+0x13f/0x160 [xfs]
[ 191.153373] xfs_file_release+0x10/0x20 [xfs]
[ 191.155248] __fput+0xda/0x1e0
[ 191.156637] ____fput+0x9/0x10
[ 191.158011] task_work_run+0x7b/0xa0
[ 191.159563] do_exit+0x2c5/0xb30
[ 191.161013] do_group_exit+0x3e/0xb0
[ 191.162557] get_signal+0x1dd/0x4f0
[ 191.164071] do_signal+0x32/0x650
[ 191.165526] ? handle_mm_fault+0xf4/0x220
[ 191.167429] ? __do_page_fault+0x283/0x4a0
[ 191.169254] exit_to_usermode_loop+0x5a/0x90
[ 191.171070] prepare_exit_to_usermode+0x40/0x50
[ 191.172976] retint_user+0x8/0x10

2017-06-30 13:32:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

On Fri 30-06-17 09:14:22, Tetsuo Handa wrote:
[...]
> Ping? Ping? When are we going to apply this patch or watchdog patch?
> This problem occurs with not so insane stress like shown below.
> I can't test almost OOM situation because test likely falls into either
> printk() v.s. oom_lock lockup problem or this too_many_isolated() problem.

So you are saying that the patch fixes this issue. Do I understand you
corretly? And you do not see any other negative side effectes with it
applied?

I am sorry I didn't have much time to think about feedback from Johannes
yet. A more robust throttling method is surely due but also not trivial.
So I am not sure how to proceed. It is true that your last test case
with only 10 processes fighting resembles the reality much better than
hundreds (AFAIR) that you were using previously.

Rik, Johannes what do you think? Should we go with the simpler approach
for now and think of a better plan longterm?
--
Michal Hocko
SUSE Labs

2017-06-30 16:00:15

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

Michal Hocko wrote:
> On Fri 30-06-17 09:14:22, Tetsuo Handa wrote:
> [...]
> > Ping? Ping? When are we going to apply this patch or watchdog patch?
> > This problem occurs with not so insane stress like shown below.
> > I can't test almost OOM situation because test likely falls into either
> > printk() v.s. oom_lock lockup problem or this too_many_isolated() problem.
>
> So you are saying that the patch fixes this issue. Do I understand you
> corretly? And you do not see any other negative side effectes with it
> applied?

I hit this problem using http://lkml.kernel.org/r/[email protected]
on next-20170628. We won't be able to test whether the patch fixes this issue without
seeing any other negative side effects without sending this patch to linux-next.git.
But at least we know that even this patch is sent to linux-next.git, we will still see
bugs like http://lkml.kernel.org/r/[email protected] .

>
> I am sorry I didn't have much time to think about feedback from Johannes
> yet. A more robust throttling method is surely due but also not trivial.
> So I am not sure how to proceed. It is true that your last test case
> with only 10 processes fighting resembles the reality much better than
> hundreds (AFAIR) that you were using previously.

Even if hundreds are running, most of them are simply blocked inside open()
at down_write() (like an example from serial-20170423-2.txt.xz shown below).
Actual number of processes fighting for memory is always less than 100.

? __schedule+0x1d2/0x5a0
? schedule+0x2d/0x80
? rwsem_down_write_failed+0x1f9/0x370
? walk_component+0x43/0x270
? call_rwsem_down_write_failed+0x13/0x20
? down_write+0x24/0x40
? path_openat+0x670/0x1210
? do_filp_open+0x8c/0x100
? getname_flags+0x47/0x1e0
? do_sys_open+0x121/0x200
? do_syscall_64+0x5c/0x140
? entry_SYSCALL64_slow_path+0x25/0x25

>
> Rik, Johannes what do you think? Should we go with the simpler approach
> for now and think of a better plan longterm?

I don't hurry if we can check using watchdog whether this problem is occurring
in the real world. I have to test corner cases because watchdog is missing.

Watchdog does not introduce negative side effects, will avoid soft lockups like
http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com ,
will avoid console_unlock() v.s. oom_lock mutext lockups due to warn_alloc(),
will catch similar bugs which people are failing to reproduce.

2017-06-30 16:19:13

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

On Sat 01-07-17 00:59:56, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 30-06-17 09:14:22, Tetsuo Handa wrote:
> > [...]
> > > Ping? Ping? When are we going to apply this patch or watchdog patch?
> > > This problem occurs with not so insane stress like shown below.
> > > I can't test almost OOM situation because test likely falls into either
> > > printk() v.s. oom_lock lockup problem or this too_many_isolated() problem.
> >
> > So you are saying that the patch fixes this issue. Do I understand you
> > corretly? And you do not see any other negative side effectes with it
> > applied?
>
> I hit this problem using http://lkml.kernel.org/r/[email protected]
> on next-20170628. We won't be able to test whether the patch fixes this issue without
> seeing any other negative side effects without sending this patch to linux-next.git.
> But at least we know that even this patch is sent to linux-next.git, we will still see
> bugs like http://lkml.kernel.org/r/[email protected] .

It is really hard to pursue this half solution when there is no clear
indication it helps in your testing. So could you try to test with only
this patch on top of the current linux-next tree (or Linus tree) and see
if you can reproduce the problem?

It is possible that there are other potential problems but we at least
need to know whether it is worth going with the patch now.

[...]
> > Rik, Johannes what do you think? Should we go with the simpler approach
> > for now and think of a better plan longterm?
>
> I don't hurry if we can check using watchdog whether this problem is occurring
> in the real world. I have to test corner cases because watchdog is missing.
>
> Watchdog does not introduce negative side effects, will avoid soft lockups like
> http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com ,
> will avoid console_unlock() v.s. oom_lock mutext lockups due to warn_alloc(),
> will catch similar bugs which people are failing to reproduce.

this way of pushing your patch is really annoying. Please do realize
that repeating the same thing all around will not make a patch more
likely to merge. You have proposed something, nobody has nacked it
so it waits for people to actually find it important enough to justify
the additional code. So please stop this.

I really do appreciate your testing because it uncovers corner cases
most people do not test for and we can actually make the code better in
the end.
--
Michal Hocko
SUSE Labs

2017-07-01 11:44:09

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

Michal Hocko wrote:
> I really do appreciate your testing because it uncovers corner cases
> most people do not test for and we can actually make the code better in
> the end.

That statement does not get to my heart at all. Collision between your
approach and my approach is wasting both your time and my time.

I've reported this too_many_isolated() trap three years ago at
http://lkml.kernel.org/r/[email protected] .
Do you know that we already wasted 3 years without any attention?

You are rejecting serialization under OOM without giving a chance to test
side effects of serialization under OOM at linux-next.git. I call such attitude
"speculation" which you never accept.

Look at mem_cgroup_out_of_memory(). Memcg OOM does use serialization.
In the first place, if the system is under global OOM (which is more
serious situation than memcg OOM), delay caused by serialization will not
matter. Rather, I consider that making sure that the system does not get
locked up is more important. I'm reporting that serialization helps
facilitating the OOM killer/reaper operations, avoiding lockups, and
solving global OOM situation smoothly. But you are refusing my report without
giving a chance to test what side effects will pop up at linux-next.git.

Knowledge about OOM situation is hardly shared among Linux developers and users,
and is far from object of concern. Like shown by cgroup-aware OOM killer proposal,
what will happen if we restrict 0 <= oom_victims <= 1 is not shared among developers.

How many developers joined to my OOM watchdog proposal? Every time and ever it is
confrontation between you and me. You, as effectively the only participant, are
showing negative attitude is effectively Nacked-by: response without alternative
proposal.

Not everybody can afford testing with absolutely latest upstream kernels.
Not prepared to obtain information for analysis using distributor kernels makes
it impossible to compare whether user's problems are already fixed in upstream
kernels, makes it impossible to identify patches which needs to be backported to
distributor kernels, and is bad for customers using distributor kernels. Of course,
it is possible that distributors decide not to allow users to obtain information
for analysis, but such decision cannot become a reason we can not prepare to obtain
information for analysis at upstream kernels.

Suppose I take a step back and tolerate the burden of sitting in front of console
24 hours a day, every day of the year so that users can press SysRq when something
went wrong, how nice it will be if all in-flight allocation requests were printed
upon SysRq. show_workqueue_state() is called upon SysRq-t is to some degree useful.

In fact, my proposal was such approach before I serialize using a kernel thread
(e.g. http://lkml.kernel.org/r/[email protected]
which I proposed two years and a half ago). Though, while my proposal was left ignored,
I learned that showing only current thread is not sufficient and updated my watchdog
to show other threads (e.g. kswapd) using serialization.

A patch at http://lkml.kernel.org/r/[email protected]
which I posted two years ago also includes a proposal for handling infinite
shrink_inactive_list() problem. After all, this shrink_inactive_list() problem was
ignored for three years without getting a chance to even test at linux-next.git.
Sigh...

I know my proposals might not be best. But you cannot afford showing alternative proposals
because you are putting higher priority to other problems. And other developers cannot afford
participating because they are not interested in or they do not share knowledge of this problem.

My proposals do not constrain future kernels. We can revert my proposals when my proposals
became no longer needed. My proposals is meaningful as interim approach, but you never accept
approaches which do not match your will (or desire). Even without giving people a chance to
test what side effects will crop up, how can your "I really do appreciate your testing"
statement get to my heart?

My watchdog allows detecting problems which are previously overlooked unless putting
unrealistic burden (e.g. stand by 24 hours a day, every day of the year). You ask people
to prove that it is a MM problem. But I am dissatisfied that you are letting proposals
which helps judging whether it is a MM problem alone.

> this way of pushing your patch is really annoying. Please do realize
> that repeating the same thing all around will not make a patch more
> likely to merge. You have proposed something, nobody has nacked it
> so it waits for people to actually find it important enough to justify
> the additional code. So please stop this.

When will people find time to judge it? We already wasted three years, and
knowledge about OOM situation is hardly shared among Linux developers and users,
and will unlikely be object of concern. How many years (or decades) will we waste
more? MM subsystem will change meanwhile and we will just ignore old kernels.

If you do want me to stop bringing watchdog here and there, please do show
alternative approach which I can tolerate. If you cannot afford it, please allow
me to involve people (e.g. you make calls for joining to my proposals because
you are asking me to wait until people find time to judge it).
Please do realize that just repeatedly saying "wait patiently" helps nothing.

> It is really hard to pursue this half solution when there is no clear
> indication it helps in your testing. So could you try to test with only
> this patch on top of the current linux-next tree (or Linus tree) and see
> if you can reproduce the problem?

With this patch on top of next-20170630, I no longer hit this problem.
(Of course, this is because this patch eliminates the infinite loop.)

2017-07-05 08:20:03

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

[this is getting tangent again and I will not respond any further if
this turn into yet another flame]

On Sat 01-07-17 20:43:56, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > I really do appreciate your testing because it uncovers corner cases
> > most people do not test for and we can actually make the code better in
> > the end.
>
> That statement does not get to my heart at all. Collision between your
> approach and my approach is wasting both your time and my time.
>
> I've reported this too_many_isolated() trap three years ago at
> http://lkml.kernel.org/r/[email protected] .
> Do you know that we already wasted 3 years without any attention?

And how many real bugs have we seen in those three years? Well, zero
AFAIR, except for your corner case testing. So while I never dismissed
the problem I've been saing this is not that trivial to fix. As my
attempt to address this and the review feedback I've received shows.

> You are rejecting serialization under OOM without giving a chance to test
> side effects of serialization under OOM at linux-next.git. I call such attitude
> "speculation" which you never accept.

No I am rejecting abusing the lock for purpose it is not aimed for.

> Look at mem_cgroup_out_of_memory(). Memcg OOM does use serialization.
> In the first place, if the system is under global OOM (which is more
> serious situation than memcg OOM), delay caused by serialization will not
> matter. Rather, I consider that making sure that the system does not get
> locked up is more important. I'm reporting that serialization helps
> facilitating the OOM killer/reaper operations, avoiding lockups, and
> solving global OOM situation smoothly. But you are refusing my report without
> giving a chance to test what side effects will pop up at linux-next.git.

You are mixing oranges with apples here. We do synchronize memcg oom
killer the same way as the global one.

> Knowledge about OOM situation is hardly shared among Linux developers and users,
> and is far from object of concern. Like shown by cgroup-aware OOM killer proposal,
> what will happen if we restrict 0 <= oom_victims <= 1 is not shared among developers.
>
> How many developers joined to my OOM watchdog proposal? Every time and ever it is
> confrontation between you and me. You, as effectively the only participant, are
> showing negative attitude is effectively Nacked-by: response without alternative
> proposal.

This is something all of us have to fight with. There are only so many
MM developers. You have to justify your changes in order to attract other
developers/users. You are basing your changes on speculations and what-ifs
for workloads that most developers consider borderline and
misconfigurations already.

> Not everybody can afford testing with absolutely latest upstream kernels.
> Not prepared to obtain information for analysis using distributor kernels makes
> it impossible to compare whether user's problems are already fixed in upstream
> kernels, makes it impossible to identify patches which needs to be backported to
> distributor kernels, and is bad for customers using distributor kernels. Of course,
> it is possible that distributors decide not to allow users to obtain information
> for analysis, but such decision cannot become a reason we can not prepare to obtain
> information for analysis at upstream kernels.

If you have to work with distribution kernels then talk to distribution
people. It is that simple. You are surely not using those systems just
because of a fancy logo...

[...]

> > this way of pushing your patch is really annoying. Please do realize
> > that repeating the same thing all around will not make a patch more
> > likely to merge. You have proposed something, nobody has nacked it
> > so it waits for people to actually find it important enough to justify
> > the additional code. So please stop this.
>
> When will people find time to judge it? We already wasted three years, and
> knowledge about OOM situation is hardly shared among Linux developers and users,
> and will unlikely be object of concern. How many years (or decades) will we waste
> more? MM subsystem will change meanwhile and we will just ignore old kernels.
>
> If you do want me to stop bringing watchdog here and there, please do show
> alternative approach which I can tolerate. If you cannot afford it, please allow
> me to involve people (e.g. you make calls for joining to my proposals because
> you are asking me to wait until people find time to judge it).
> Please do realize that just repeatedly saying "wait patiently" helps nothing.

You really have to realize that there will hardly be more interest in
your reports when they do not reflect real life situations. I have said
(several times) that those issues should be addressed eventually but
there are more pressing issues which do trigger in the real life and
they have precedence.

Should we add a lot of code for something that doesn't bother many
users? I do not think so. As explained earlier (several times) this code
will have a maintenance cost and also can lead to other problems (false
positives etc. just consider how easily it is to get a false positive
lockup splats - I am facing reports for those very often on our
distribution kernels on large boxes).

I said I appreciate your testing regardless because I really mean it. We
really want to have a more robust out of memory handling long term. And
as you have surely noticed quite some changes have been made in that
direction last few years. There are still many unaddressed ones, no
question about that. We do not have to jump into the first approach we
come up with for those, though. Cost/benefit evaluation has to be done
everytime for each proposal. I am really not sure what is so hard to
understand about this.

[...]
--
Michal Hocko
SUSE Labs

2017-07-05 08:20:24

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

On Sat 01-07-17 20:43:56, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > It is really hard to pursue this half solution when there is no clear
> > indication it helps in your testing. So could you try to test with only
> > this patch on top of the current linux-next tree (or Linus tree) and see
> > if you can reproduce the problem?
>
> With this patch on top of next-20170630, I no longer hit this problem.
> (Of course, this is because this patch eliminates the infinite loop.)

I assume you haven't seen other negative side effects, like unexpected
OOMs etc... Are you willing to give your Tested-by?
--
Michal Hocko
SUSE Labs

2017-07-06 10:48:55

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

Michal Hocko wrote:
> On Sat 01-07-17 20:43:56, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > It is really hard to pursue this half solution when there is no clear
> > > indication it helps in your testing. So could you try to test with only
> > > this patch on top of the current linux-next tree (or Linus tree) and see
> > > if you can reproduce the problem?
> >
> > With this patch on top of next-20170630, I no longer hit this problem.
> > (Of course, this is because this patch eliminates the infinite loop.)
>
> I assume you haven't seen other negative side effects, like unexpected
> OOMs etc... Are you willing to give your Tested-by?

I didn't see other negative side effects.

Tested-by: Tetsuo Handa <[email protected]>

We need long time for testing this patch at linux-next.git (and I give up
this handy bug for finding other bugs under almost OOM situation).