LinuxLists.cc - [PATCH] mm, vmscan: do not loop on too_many

2017-03-07 13:58:27

Subject: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

From: Michal Hocko <[email protected]>

Tetsuo Handa has reported [1][2] that direct reclaimers might get stuck
in too_many_isolated loop basically for ever because the last few pages
on the LRU lists are isolated by the kswapd which is stuck on fs locks
when doing the pageout or slab reclaim. This in turn means that there is
nobody to actually trigger the oom killer and the system is basically
unusable.

too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
direct reclaim when too many pages are isolated already") to prevent
from pre-mature oom killer invocations because back then no reclaim
progress could indeed trigger the OOM killer too early. But since the
oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
the allocation/reclaim retry loop considers all the reclaimable pages
and throttles the allocation at that layer so we can loosen the direct
reclaim throttling.

Make shrink_inactive_list loop over too_many_isolated bounded and returns
immediately when the situation hasn't resolved after the first sleep.
Replace congestion_wait by a simple schedule_timeout_interruptible because
we are not really waiting on the IO congestion in this path.

Please note that this patch can theoretically cause the OOM killer to
trigger earlier while there are many pages isolated for the reclaim
which makes progress only very slowly. This would be obvious from the oom
report as the number of isolated pages are printed there. If we ever hit
this should_reclaim_retry should consider those numbers in the evaluation
in one way or another.

[1] http://lkml.kernel.org/r/[email protected]
[2] http://lkml.kernel.org/r/[email protected]

Signed-off-by: Michal Hocko <[email protected]>
---

Hi,
Tetsuo helped to test this patch [3] and couldn't reproduce the hang
inside the page allocator anymore. Thanks! He was able to see a
different lockup though. This time this is more related to how XFS is
doing the inode reclaim from the WQ context. This is being discussed [4]
and I believe it is unrelated to this change.

I believe this change is still an improvement because it reduces chances
of an unbound loop inside the reclaim path so we have a) more reliable
detection of the lockup from the allocator path and b) more deterministic
retry loop logic.

Thoughts/complains/suggestions?

[3] http://lkml.kernel.org/r/[email protected]
[4] http://lkml.kernel.org/r/[email protected]

mm/vmscan.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c15b2e4c47ca..4ae069060ae5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+ bool stalled = false;

while (unlikely(too_many_isolated(pgdat, file, sc))) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ if (stalled)
+ return 0;
+
+ /* wait a bit for the reclaimer. */
+ schedule_timeout_interruptible(HZ/10);
+ stalled = true;

/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
--
2.11.0

2017-03-07 20:02:16

On Thu, 2017-03-09 at 10:12 +0100, Michal Hocko wrote:
> On Wed 08-03-17 10:54:57, Rik van Riel wrote:

> > In fact, false OOM kills with that kind of workload is
> > how we ended up getting the "too many isolated" logic
> > in the first place.
> Right, but the retry logic was considerably different than what we
> have these days. should_reclaim_retry considers amount of reclaimable
> memory. As I've said earlier if we see a report where the oom hits
> prematurely with many NR_ISOLATED* we know how to fix that.

Would it be enough to simply reset no_progress_loops
in this check inside should_reclaim_retry, if we know
pageout IO is pending?

if (!did_some_progress) {
                                unsigned long write_pending;

                                write_pending =
zone_page_state_snapshot(zone,
                                                        NR_ZONE_WRITE_P
ENDING);

                                if (2 * write_pending > reclaimable) {
                                        congestion_wait(BLK_RW_ASYNC,
HZ/10);
                                        return true;
                                }
                        }

--
All rights reversed

Attachments:

signature.asc (473.00 B)
This is a digitally signed message part

2017-03-09 14:32:28

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

On Tue, Mar 07, 2017 at 02:30:57PM +0100, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Tetsuo Handa has reported [1][2] that direct reclaimers might get stuck
> in too_many_isolated loop basically for ever because the last few pages
> on the LRU lists are isolated by the kswapd which is stuck on fs locks
> when doing the pageout or slab reclaim. This in turn means that there is
> nobody to actually trigger the oom killer and the system is basically
> unusable.
>
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> and throttles the allocation at that layer so we can loosen the direct
> reclaim throttling.
>
> Make shrink_inactive_list loop over too_many_isolated bounded and returns
> immediately when the situation hasn't resolved after the first sleep.
> Replace congestion_wait by a simple schedule_timeout_interruptible because
> we are not really waiting on the IO congestion in this path.
>
> Please note that this patch can theoretically cause the OOM killer to
> trigger earlier while there are many pages isolated for the reclaim
> which makes progress only very slowly. This would be obvious from the oom
> report as the number of isolated pages are printed there. If we ever hit
> this should_reclaim_retry should consider those numbers in the evaluation
> in one way or another.
>
> [1] http://lkml.kernel.org/r/[email protected]
> [2] http://lkml.kernel.org/r/[email protected]
>
> Signed-off-by: Michal Hocko <[email protected]>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2017-03-09 14:59:42

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

On Thu 09-03-17 09:16:25, Rik van Riel wrote:
> On Thu, 2017-03-09 at 10:12 +0100, Michal Hocko wrote:
> > On Wed 08-03-17 10:54:57, Rik van Riel wrote:
>
> > > In fact, false OOM kills with that kind of workload is
> > > how we ended up getting the "too many isolated" logic
> > > in the first place.
> > Right, but the retry logic was considerably different than what we
> > have these days. should_reclaim_retry considers amount of reclaimable
> > memory. As I've said earlier if we see a report where the oom hits
> > prematurely with many NR_ISOLATED* we know how to fix that.
>
> Would it be enough to simply reset no_progress_loops
> in this check inside should_reclaim_retry, if we know
> pageout IO is pending?
>
> ? ? ? ? ? ? ? ? ? ? ? ? if (!did_some_progress) {
> ????????????????????????????????unsigned long write_pending;
>
> ????????????????????????????????write_pending = zone_page_state_snapshot(zone,
> ????????????????????????????????????????????????????????NR_ZONE_WRITE_PENDING);
>
> ????????????????????????????????if (2 * write_pending > reclaimable) {
> ????????????????????????????????????????congestion_wait(BLK_RW_ASYNC, HZ/10);
> ????????????????????????????????????????return true;
> ????????????????????????????????}
> ????????????????????????}

I am not really sure what problem we are trying to solve right now to be
honest. I would prefer to keep the logic simpler rather than over
engeneer something that is even not needed.

--
Michal Hocko
SUSE Labs

2017-03-09 18:11:42

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote:
> It only does this to some extent. ?If reclaim made
> no progress, for example due to immediately bailing
> out because the number of already isolated pages is
> too high (due to many parallel reclaimers), the code
> could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
> test without ever looking at the number of reclaimable
> pages.

Hm, there is no early return there, actually. We bump the loop counter
every time it happens, but then *do* look at the reclaimable pages.

> Could that create problems if we have many concurrent
> reclaimers?

With increased concurrency, the likelihood of OOM will go up if we
remove the unlimited wait for isolated pages, that much is true.

I'm not sure that's a bad thing, however, because we want the OOM
killer to be predictable and timely. So a reasonable wait time in
between 0 and forever before an allocating thread gives up under
extreme concurrency makes sense to me.

> It may be OK, I just do not understand all the implications.
>
> I like the general direction your patch takes the code in,
> but I would like to understand it better...

I feel the same way. The throttling logic doesn't seem to be very well
thought out at the moment, making it hard to reason about what happens
in certain scenarios.

In that sense, this patch isn't really an overall improvement to the
way things work. It patches a hole that seems to be exploitable only
from an artificial OOM torture test, at the risk of regressing high
concurrency workloads that may or may not be artificial.

Unless I'm mistaken, there doesn't seem to be a whole lot of urgency
behind this patch. Can we think about a general model to deal with
allocation concurrency? Unlimited parallel direct reclaim is kinda
bonkers in the first place. How about checking for excessive isolation
counts from the page allocator and putting allocations on a waitqueue?

2017-03-09 22:18:18

On Sat 01-07-17 20:43:56, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > It is really hard to pursue this half solution when there is no clear
> > indication it helps in your testing. So could you try to test with only
> > this patch on top of the current linux-next tree (or Linus tree) and see
> > if you can reproduce the problem?
>
> With this patch on top of next-20170630, I no longer hit this problem.
> (Of course, this is because this patch eliminates the infinite loop.)

I assume you haven't seen other negative side effects, like unexpected
OOMs etc... Are you willing to give your Tested-by?
--
Michal Hocko
SUSE Labs

2017-07-06 10:48:55

by Tetsuo Handa

[permalink] [raw]

Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

Michal Hocko wrote:
> On Sat 01-07-17 20:43:56, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > It is really hard to pursue this half solution when there is no clear
> > > indication it helps in your testing. So could you try to test with only
> > > this patch on top of the current linux-next tree (or Linus tree) and see
> > > if you can reproduce the problem?
> >
> > With this patch on top of next-20170630, I no longer hit this problem.
> > (Of course, this is because this patch eliminates the infinite loop.)
>
> I assume you haven't seen other negative side effects, like unexpected
> OOMs etc... Are you willing to give your Tested-by?

I didn't see other negative side effects.

Tested-by: Tetsuo Handa <[email protected]>

We need long time for testing this patch at linux-next.git (and I give up
this handy bug for finding other bugs under almost OOM situation).