From: Michal Hocko <[email protected]>
Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck
in too_many_isolated loop basically for ever because the last few pages
on the LRU lists are isolated by the kswapd which is stuck on fs locks
when doing the pageout or slab reclaim. This in turn means that there is
nobody to actually trigger the oom killer and the system is basically
unusable.
too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
direct reclaim when too many pages are isolated already") to prevent
from pre-mature oom killer invocations because back then no reclaim
progress could indeed trigger the OOM killer too early. But since the
oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
the allocation/reclaim retry loop considers all the reclaimable pages
and throttles the allocation at that layer so we can loosen the direct
reclaim throttling.
Make shrink_inactive_list loop over too_many_isolated bounded and returns
immediately when the situation hasn't resolved after the first sleep.
Replace congestion_wait by a simple schedule_timeout_interruptible because
we are not really waiting on the IO congestion in this path.
Please note that this patch can theoretically cause the OOM killer to
trigger earlier while there are many pages isolated for the reclaim
which makes progress only very slowly. This would be obvious from the oom
report as the number of isolated pages are printed there. If we ever hit
this should_reclaim_retry should consider those numbers in the evaluation
in one way or another.
[1] http://lkml.kernel.org/r/[email protected]
[2] http://lkml.kernel.org/r/[email protected]
[3] http://lkml.kernel.org/r/[email protected]
Acked-by: Mel Gorman <[email protected]>
Tested-by: Tetsuo Handa <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
Hi,
I am resubmitting this patch previously sent here
http://lkml.kernel.org/r/[email protected].
Johannes and Rik had some concerns that this could lead to premature
OOM kills. I agree with them that we need a better throttling
mechanism. Until now we didn't give the issue described above a high
priority because it usually required a really insane workload to
trigger. But it seems that the issue can be reproduced also without
having an insane number of competing threads [3].
Moreover, the issue also triggers very often while testing heavy memory
pressure and so prevents further development of hardening of that area
(http://lkml.kernel.org/r/[email protected]).
Tetsuo hasn't seen any negative effect of this patch in his oom stress
tests so I think we should go with this simple patch for now and think
about something more robust long term.
That being said I suggest merging this (after spending the full release
cycle in linux-next) for the time being until we come up with a more
clever solution.
mm/vmscan.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c15b2e4c47ca..4ae069060ae5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+ bool stalled = false;
while (unlikely(too_many_isolated(pgdat, file, sc))) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ if (stalled)
+ return 0;
+
+ /* wait a bit for the reclaimer. */
+ schedule_timeout_interruptible(HZ/10);
+ stalled = true;
/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
--
2.11.0
On 07/10/2017 09:48 AM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck
> in too_many_isolated loop basically for ever because the last few pages
> on the LRU lists are isolated by the kswapd which is stuck on fs locks
> when doing the pageout or slab reclaim. This in turn means that there is
> nobody to actually trigger the oom killer and the system is basically
> unusable.
>
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> and throttles the allocation at that layer so we can loosen the direct
> reclaim throttling.
>
> Make shrink_inactive_list loop over too_many_isolated bounded and returns
> immediately when the situation hasn't resolved after the first sleep.
> Replace congestion_wait by a simple schedule_timeout_interruptible because
> we are not really waiting on the IO congestion in this path.
>
> Please note that this patch can theoretically cause the OOM killer to
> trigger earlier while there are many pages isolated for the reclaim
> which makes progress only very slowly. This would be obvious from the oom
> report as the number of isolated pages are printed there. If we ever hit
> this should_reclaim_retry should consider those numbers in the evaluation
> in one way or another.
>
> [1] http://lkml.kernel.org/r/[email protected]
> [2] http://lkml.kernel.org/r/[email protected]
> [3] http://lkml.kernel.org/r/[email protected]
>
> Acked-by: Mel Gorman <[email protected]>
> Tested-by: Tetsuo Handa <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
Let's hope there won't be premature OOM's then.
Acked-by: Vlastimil Babka <[email protected]>
On Mon, 2017-07-10 at 09:48 +0200, Michal Hocko wrote:
> Johannes and Rik had some concerns that this could lead to premature
> OOM kills. I agree with them that we need a better throttling
> mechanism. Until now we didn't give the issue described above a high
> priority because it usually required a really insane workload to
> trigger. But it seems that the issue can be reproduced also without
> having an insane number of competing threads [3].
My worries stand, but lets fix the real observed bug, and not worry
too much about the theoretical bug for now.
Acked-by: Rik van Riel <[email protected]>
On Mon, Jul 10, 2017 at 09:58:03AM -0400, Rik van Riel wrote:
> On Mon, 2017-07-10 at 09:48 +0200, Michal Hocko wrote:
>
> > Johannes and Rik had some concerns that this could lead to premature
> > OOM kills. I agree with them that we need a better throttling
> > mechanism. Until now we didn't give the issue described above a high
> > priority because it usually required a really insane workload to
> > trigger. But it seems that the issue can be reproduced also without
> > having an insane number of competing threads [3].
>
> My worries stand, but lets fix the real observed bug, and not worry
> too much about the theoretical bug for now.
>
> Acked-by: Rik van Riel <[email protected]>
I agree with this.
Acked-by: Johannes Weiner <[email protected]>
On Mon 10-07-17 12:58:59, Johannes Weiner wrote:
> On Mon, Jul 10, 2017 at 09:58:03AM -0400, Rik van Riel wrote:
> > On Mon, 2017-07-10 at 09:48 +0200, Michal Hocko wrote:
> >
> > > Johannes and Rik had some concerns that this could lead to premature
> > > OOM kills. I agree with them that we need a better throttling
> > > mechanism. Until now we didn't give the issue described above a high
> > > priority because it usually required a really insane workload to
> > > trigger. But it seems that the issue can be reproduced also without
> > > having an insane number of competing threads [3].
> >
> > My worries stand, but lets fix the real observed bug, and not worry
> > too much about the theoretical bug for now.
> >
> > Acked-by: Rik van Riel <[email protected]>
>
> I agree with this.
>
> Acked-by: Johannes Weiner <[email protected]>
Thanks to both of you. Just to make it clear. I really do want to
address the throttling problem longterm properly. I do not have any
great ideas to be honest. I am busy with other things so it might be
quite some time before I come up with something.
--
Michal Hocko
SUSE Labs
On Mon, 10 Jul 2017 09:48:42 +0200 Michal Hocko <[email protected]> wrote:
> From: Michal Hocko <[email protected]>
>
> Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck
> in too_many_isolated loop basically for ever because the last few pages
> on the LRU lists are isolated by the kswapd which is stuck on fs locks
> when doing the pageout or slab reclaim. This in turn means that there is
> nobody to actually trigger the oom killer and the system is basically
> unusable.
>
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> and throttles the allocation at that layer so we can loosen the direct
> reclaim throttling.
>
> Make shrink_inactive_list loop over too_many_isolated bounded and returns
> immediately when the situation hasn't resolved after the first sleep.
> Replace congestion_wait by a simple schedule_timeout_interruptible because
> we are not really waiting on the IO congestion in this path.
>
> Please note that this patch can theoretically cause the OOM killer to
> trigger earlier while there are many pages isolated for the reclaim
> which makes progress only very slowly. This would be obvious from the oom
> report as the number of isolated pages are printed there. If we ever hit
> this should_reclaim_retry should consider those numbers in the evaluation
> in one way or another.
Need to figure out which kernels to patch. Maybe just 4.13-rc after a
week or two?
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> int file = is_file_lru(lru);
> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> + bool stalled = false;
>
> while (unlikely(too_many_isolated(pgdat, file, sc))) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + if (stalled)
> + return 0;
> +
> + /* wait a bit for the reclaimer. */
> + schedule_timeout_interruptible(HZ/10);
a) if this task has signal_pending(), this falls straight through
and I suspect the code breaks?
b) replacing congestion_wait() with schedule_timeout_interruptible()
means this task no longer contributes to load average here and it's
a (slightly) user-visible change.
c) msleep_interruptible() is nicer
d) IOW, methinks we should be using msleep() here?
> + stalled = true;
>
> /* We are about to die and free our memory. Return now. */
> if (fatal_signal_pending(current))
(Gets distracted by the thought that we should do
s/msleep/msleep_uninterruptible/g)
On Mon, 10 Jul 2017, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck
> in too_many_isolated loop basically for ever because the last few pages
> on the LRU lists are isolated by the kswapd which is stuck on fs locks
> when doing the pageout or slab reclaim. This in turn means that there is
> nobody to actually trigger the oom killer and the system is basically
> unusable.
>
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> and throttles the allocation at that layer so we can loosen the direct
> reclaim throttling.
>
> Make shrink_inactive_list loop over too_many_isolated bounded and returns
> immediately when the situation hasn't resolved after the first sleep.
> Replace congestion_wait by a simple schedule_timeout_interruptible because
> we are not really waiting on the IO congestion in this path.
>
> Please note that this patch can theoretically cause the OOM killer to
> trigger earlier while there are many pages isolated for the reclaim
> which makes progress only very slowly. This would be obvious from the oom
> report as the number of isolated pages are printed there. If we ever hit
> this should_reclaim_retry should consider those numbers in the evaluation
> in one way or another.
>
> [1] http://lkml.kernel.org/r/[email protected]
> [2] http://lkml.kernel.org/r/[email protected]
> [3] http://lkml.kernel.org/r/[email protected]
>
> Acked-by: Mel Gorman <[email protected]>
> Tested-by: Tetsuo Handa <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> Hi,
> I am resubmitting this patch previously sent here
> http://lkml.kernel.org/r/[email protected].
>
> Johannes and Rik had some concerns that this could lead to premature
> OOM kills. I agree with them that we need a better throttling
> mechanism. Until now we didn't give the issue described above a high
> priority because it usually required a really insane workload to
> trigger. But it seems that the issue can be reproduced also without
> having an insane number of competing threads [3].
>
> Moreover, the issue also triggers very often while testing heavy memory
> pressure and so prevents further development of hardening of that area
> (http://lkml.kernel.org/r/[email protected]).
> Tetsuo hasn't seen any negative effect of this patch in his oom stress
> tests so I think we should go with this simple patch for now and think
> about something more robust long term.
>
> That being said I suggest merging this (after spending the full release
> cycle in linux-next) for the time being until we come up with a more
> clever solution.
>
> mm/vmscan.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c15b2e4c47ca..4ae069060ae5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> int file = is_file_lru(lru);
> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> + bool stalled = false;
>
> while (unlikely(too_many_isolated(pgdat, file, sc))) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + if (stalled)
> + return 0;
> +
> + /* wait a bit for the reclaimer. */
> + schedule_timeout_interruptible(HZ/10);
> + stalled = true;
>
> /* We are about to die and free our memory. Return now. */
> if (fatal_signal_pending(current))
> --
You probably won't welcome getting into alternatives at this late stage;
but after hacking around it one way or another because of its pointless
lockups, I lost patience with that too_many_isolated() loop a few months
back (on realizing the enormous number of pages that may be isolated via
migrate_pages(2)), and we've been running nicely since with something like:
bool got_mutex = false;
if (unlikely(too_many_isolated(pgdat, file, sc))) {
if (mutex_lock_killable(&pgdat->too_many_isolated))
return SWAP_CLUSTER_MAX;
got_mutex = true;
}
...
if (got_mutex)
mutex_unlock(&pgdat->too_many_isolated);
Using a mutex to provide the intended throttling, without an infinite
loop or an arbitrary delay; and without having to worry (as we often did)
about whether those numbers in too_many_isolated() are really appropriate.
No premature OOMs complained of yet.
But that was on a different kernel, and there I did have to make sure
that PF_MEMALLOC always prevented us from nesting: I'm not certain of
that in the current kernel (but do remember Johannes changing the memcg
end to make it use PF_MEMALLOC too). I offer the preview above, to see
if you're interested in that alternative: if you are, then I'll go ahead
and make it into an actual patch against v4.13-rc.
Hugh
On Wed 19-07-17 15:20:14, Andrew Morton wrote:
> On Mon, 10 Jul 2017 09:48:42 +0200 Michal Hocko <[email protected]> wrote:
>
> > From: Michal Hocko <[email protected]>
> >
> > Tetsuo Handa has reported [1][2][3]that direct reclaimers might get stuck
> > in too_many_isolated loop basically for ever because the last few pages
> > on the LRU lists are isolated by the kswapd which is stuck on fs locks
> > when doing the pageout or slab reclaim. This in turn means that there is
> > nobody to actually trigger the oom killer and the system is basically
> > unusable.
> >
> > too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
> > direct reclaim when too many pages are isolated already") to prevent
> > from pre-mature oom killer invocations because back then no reclaim
> > progress could indeed trigger the OOM killer too early. But since the
> > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> > the allocation/reclaim retry loop considers all the reclaimable pages
> > and throttles the allocation at that layer so we can loosen the direct
> > reclaim throttling.
> >
> > Make shrink_inactive_list loop over too_many_isolated bounded and returns
> > immediately when the situation hasn't resolved after the first sleep.
> > Replace congestion_wait by a simple schedule_timeout_interruptible because
> > we are not really waiting on the IO congestion in this path.
> >
> > Please note that this patch can theoretically cause the OOM killer to
> > trigger earlier while there are many pages isolated for the reclaim
> > which makes progress only very slowly. This would be obvious from the oom
> > report as the number of isolated pages are printed there. If we ever hit
> > this should_reclaim_retry should consider those numbers in the evaluation
> > in one way or another.
>
> Need to figure out which kernels to patch. Maybe just 4.13-rc after a
> week or two?
I do not think we need to rush it and the next merge window should be
just OK.
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> > int file = is_file_lru(lru);
> > struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> > + bool stalled = false;
> >
> > while (unlikely(too_many_isolated(pgdat, file, sc))) {
> > - congestion_wait(BLK_RW_ASYNC, HZ/10);
> > + if (stalled)
> > + return 0;
> > +
> > + /* wait a bit for the reclaimer. */
> > + schedule_timeout_interruptible(HZ/10);
>
> a) if this task has signal_pending(), this falls straight through
> and I suspect the code breaks?
It will not break. It will return to the allocation path more quickly
but no over-reclaim will happen and it will/should get throttled there.
So nothing critical.
> b) replacing congestion_wait() with schedule_timeout_interruptible()
> means this task no longer contributes to load average here and it's
> a (slightly) user-visible change.
you are right. I am not sure it matters but it might be visible.
> c) msleep_interruptible() is nicer
>
> d) IOW, methinks we should be using msleep() here?
OK, I do not have objections. Are you going to squash this in or want a
separate patch explaining all the above?
--
Michal Hocko
SUSE Labs
Hugh Dickins wrote:
> You probably won't welcome getting into alternatives at this late stage;
> but after hacking around it one way or another because of its pointless
> lockups, I lost patience with that too_many_isolated() loop a few months
> back (on realizing the enormous number of pages that may be isolated via
> migrate_pages(2)), and we've been running nicely since with something like:
>
> bool got_mutex = false;
>
> if (unlikely(too_many_isolated(pgdat, file, sc))) {
> if (mutex_lock_killable(&pgdat->too_many_isolated))
> return SWAP_CLUSTER_MAX;
> got_mutex = true;
> }
> ...
> if (got_mutex)
> mutex_unlock(&pgdat->too_many_isolated);
>
> Using a mutex to provide the intended throttling, without an infinite
> loop or an arbitrary delay; and without having to worry (as we often did)
> about whether those numbers in too_many_isolated() are really appropriate.
> No premature OOMs complained of yet.
Roughly speaking, there is a moment where shrink_inactive_list() acts
like below.
bool got_mutex = false;
if (!current_is_kswapd()) {
if (mutex_lock_killable(&pgdat->too_many_isolated))
return SWAP_CLUSTER_MAX;
got_mutex = true;
}
// kswapd is blocked here waiting for !current_is_kswapd().
if (got_mutex)
mutex_unlock(&pgdat->too_many_isolated);
>
> But that was on a different kernel, and there I did have to make sure
> that PF_MEMALLOC always prevented us from nesting: I'm not certain of
> that in the current kernel (but do remember Johannes changing the memcg
> end to make it use PF_MEMALLOC too). I offer the preview above, to see
> if you're interested in that alternative: if you are, then I'll go ahead
> and make it into an actual patch against v4.13-rc.
I don't know what your actual patch looks like, but the problem is that
pgdat->too_many_isolated waits for kswapd while kswapd waits for
pgdat->too_many_isolated; nobody can unlock pgdat->too_many_isolated if
once we hit it.
On Wed 19-07-17 18:54:40, Hugh Dickins wrote:
[...]
> You probably won't welcome getting into alternatives at this late stage;
> but after hacking around it one way or another because of its pointless
> lockups, I lost patience with that too_many_isolated() loop a few months
> back (on realizing the enormous number of pages that may be isolated via
> migrate_pages(2)), and we've been running nicely since with something like:
>
> bool got_mutex = false;
>
> if (unlikely(too_many_isolated(pgdat, file, sc))) {
> if (mutex_lock_killable(&pgdat->too_many_isolated))
> return SWAP_CLUSTER_MAX;
> got_mutex = true;
> }
> ...
> if (got_mutex)
> mutex_unlock(&pgdat->too_many_isolated);
>
> Using a mutex to provide the intended throttling, without an infinite
> loop or an arbitrary delay; and without having to worry (as we often did)
> about whether those numbers in too_many_isolated() are really appropriate.
> No premature OOMs complained of yet.
>
> But that was on a different kernel, and there I did have to make sure
> that PF_MEMALLOC always prevented us from nesting: I'm not certain of
> that in the current kernel (but do remember Johannes changing the memcg
> end to make it use PF_MEMALLOC too). I offer the preview above, to see
> if you're interested in that alternative: if you are, then I'll go ahead
> and make it into an actual patch against v4.13-rc.
I would rather get rid of any additional locking here and my ultimate
goal is to make throttling at the page allocator layer rather than
inside the reclaim.
--
Michal Hocko
SUSE Labs
On Thu, 20 Jul 2017 08:56:26 +0200 Michal Hocko <[email protected]> wrote:
>
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> > > int file = is_file_lru(lru);
> > > struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > > struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> > > + bool stalled = false;
> > >
> > > while (unlikely(too_many_isolated(pgdat, file, sc))) {
> > > - congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > + if (stalled)
> > > + return 0;
> > > +
> > > + /* wait a bit for the reclaimer. */
> > > + schedule_timeout_interruptible(HZ/10);
> >
> > a) if this task has signal_pending(), this falls straight through
> > and I suspect the code breaks?
>
> It will not break. It will return to the allocation path more quickly
> but no over-reclaim will happen and it will/should get throttled there.
> So nothing critical.
>
> > b) replacing congestion_wait() with schedule_timeout_interruptible()
> > means this task no longer contributes to load average here and it's
> > a (slightly) user-visible change.
>
> you are right. I am not sure it matters but it might be visible.
>
> > c) msleep_interruptible() is nicer
> >
> > d) IOW, methinks we should be using msleep() here?
>
> OK, I do not have objections. Are you going to squash this in or want a
> separate patch explaining all the above?
I'd prefer to have a comment explaining why interruptible sleep is
being used, because that "what if signal_pending()" case is rather a
red flag.
Is it the case that fall-through-if-signal_pending() is the
*preferred* behaviour? If so, the comment should explain this. If it
isn't the preferred behaviour then using uninterruptible sleep sounds
better to me, if only because it saves us from having to test a rather
tricky and rare case.
On Fri 21-07-17 16:01:04, Andrew Morton wrote:
> On Thu, 20 Jul 2017 08:56:26 +0200 Michal Hocko <[email protected]> wrote:
> >
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -1713,9 +1713,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> > > > int file = is_file_lru(lru);
> > > > struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > > > struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> > > > + bool stalled = false;
> > > >
> > > > while (unlikely(too_many_isolated(pgdat, file, sc))) {
> > > > - congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > + if (stalled)
> > > > + return 0;
> > > > +
> > > > + /* wait a bit for the reclaimer. */
> > > > + schedule_timeout_interruptible(HZ/10);
> > >
> > > a) if this task has signal_pending(), this falls straight through
> > > and I suspect the code breaks?
> >
> > It will not break. It will return to the allocation path more quickly
> > but no over-reclaim will happen and it will/should get throttled there.
> > So nothing critical.
> >
> > > b) replacing congestion_wait() with schedule_timeout_interruptible()
> > > means this task no longer contributes to load average here and it's
> > > a (slightly) user-visible change.
> >
> > you are right. I am not sure it matters but it might be visible.
> >
> > > c) msleep_interruptible() is nicer
> > >
> > > d) IOW, methinks we should be using msleep() here?
> >
> > OK, I do not have objections. Are you going to squash this in or want a
> > separate patch explaining all the above?
>
> I'd prefer to have a comment explaining why interruptible sleep is
> being used, because that "what if signal_pending()" case is rather a
> red flag.
I didn't really consider interruptible vs. uninterruptible sleep so it
wasn't really a deliberate decision. Now, that you have brought up the
above points I am OK changing that the uninterruptible.
Here is a fix up. I am fine with this either folded in or as a separate
patch.
---
>From 4b295fc1625d499a2e1283b036aea005158b33cc Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Mon, 24 Jul 2017 08:43:23 +0200
Subject: [PATCH] mm, vmscan: throttle too_many_isolated by uninterruptible
sleep
"mm, vmscan: do not loop on too_many_isolated for ever" has added an
interruptible sleep into shrink_inactive_list when we have isolated too
many pages. Andrew has noticed that we used to sleep in uninterruptible
sleep previously (in congestion_wait) and that changing that is wrong
for at least two reasons
- waiting task would not participate in the load average anymore
- task with a pending signal will not sleep and bail out
immediately
While none of those issues are critical in any way but they are
unnecessary. The interruptible sleep was more an oversight than a
deliberate decision. Fix this by using msleep instead.
Spotted-by: Andrew Morton <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7ba9467b6d58..ed0c29a3db16 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1749,7 +1749,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
return 0;
/* wait a bit for the reclaimer. */
- schedule_timeout_interruptible(HZ/10);
+ msleep(100);
stalled = true;
/* We are about to die and free our memory. Return now. */
--
2.13.2
--
Michal Hocko
SUSE Labs
On Thu, 20 Jul 2017, Tetsuo Handa wrote:
> Hugh Dickins wrote:
> > You probably won't welcome getting into alternatives at this late stage;
> > but after hacking around it one way or another because of its pointless
> > lockups, I lost patience with that too_many_isolated() loop a few months
> > back (on realizing the enormous number of pages that may be isolated via
> > migrate_pages(2)), and we've been running nicely since with something like:
> >
> > bool got_mutex = false;
> >
> > if (unlikely(too_many_isolated(pgdat, file, sc))) {
> > if (mutex_lock_killable(&pgdat->too_many_isolated))
> > return SWAP_CLUSTER_MAX;
> > got_mutex = true;
> > }
> > ...
> > if (got_mutex)
> > mutex_unlock(&pgdat->too_many_isolated);
> >
> > Using a mutex to provide the intended throttling, without an infinite
> > loop or an arbitrary delay; and without having to worry (as we often did)
> > about whether those numbers in too_many_isolated() are really appropriate.
> > No premature OOMs complained of yet.
>
> Roughly speaking, there is a moment where shrink_inactive_list() acts
> like below.
>
> bool got_mutex = false;
>
> if (!current_is_kswapd()) {
> if (mutex_lock_killable(&pgdat->too_many_isolated))
> return SWAP_CLUSTER_MAX;
> got_mutex = true;
> }
>
> // kswapd is blocked here waiting for !current_is_kswapd().
That would be a shame, for kswapd to wait for !current_is_kswapd()!
But seriously, I think I understand what you mean by that, you're
thinking that kswapd would be waiting on some other task to clear
the too_many_isolated() condition?
No, it does not work that way: kswapd (never seeing too_many_isolated()
because that always says false when current_is_kswapd()) never tries to
take the pgdat->too_many_isolated mutex itself: it does not wait there
at all, although other tasks may be waiting there at the time.
Perhaps my naming the mutex "too_many_isolated", same as the function,
is actually confusing, when I had intended it to be helpful.
>
> if (got_mutex)
> mutex_unlock(&pgdat->too_many_isolated);
>
> >
> > But that was on a different kernel, and there I did have to make sure
> > that PF_MEMALLOC always prevented us from nesting: I'm not certain of
> > that in the current kernel (but do remember Johannes changing the memcg
> > end to make it use PF_MEMALLOC too). I offer the preview above, to see
> > if you're interested in that alternative: if you are, then I'll go ahead
> > and make it into an actual patch against v4.13-rc.
>
> I don't know what your actual patch looks like, but the problem is that
> pgdat->too_many_isolated waits for kswapd while kswapd waits for
> pgdat->too_many_isolated; nobody can unlock pgdat->too_many_isolated if
> once we hit it.
Not so (and we'd hardly be finding it a useful patch if that were so).
Hugh
On Thu, 20 Jul 2017, Michal Hocko wrote:
> On Wed 19-07-17 18:54:40, Hugh Dickins wrote:
> [...]
> > You probably won't welcome getting into alternatives at this late stage;
> > but after hacking around it one way or another because of its pointless
> > lockups, I lost patience with that too_many_isolated() loop a few months
> > back (on realizing the enormous number of pages that may be isolated via
> > migrate_pages(2)), and we've been running nicely since with something like:
> >
> > bool got_mutex = false;
> >
> > if (unlikely(too_many_isolated(pgdat, file, sc))) {
> > if (mutex_lock_killable(&pgdat->too_many_isolated))
> > return SWAP_CLUSTER_MAX;
> > got_mutex = true;
> > }
> > ...
> > if (got_mutex)
> > mutex_unlock(&pgdat->too_many_isolated);
> >
> > Using a mutex to provide the intended throttling, without an infinite
> > loop or an arbitrary delay; and without having to worry (as we often did)
> > about whether those numbers in too_many_isolated() are really appropriate.
> > No premature OOMs complained of yet.
> >
> > But that was on a different kernel, and there I did have to make sure
> > that PF_MEMALLOC always prevented us from nesting: I'm not certain of
> > that in the current kernel (but do remember Johannes changing the memcg
> > end to make it use PF_MEMALLOC too). I offer the preview above, to see
> > if you're interested in that alternative: if you are, then I'll go ahead
> > and make it into an actual patch against v4.13-rc.
>
> I would rather get rid of any additional locking here and my ultimate
> goal is to make throttling at the page allocator layer rather than
> inside the reclaim.
Fair enough, I'm certainly in no hurry to send the patch,
but thought it worth mentioning.
Hugh
Hugh Dickins wrote:
> On Thu, 20 Jul 2017, Tetsuo Handa wrote:
> > Hugh Dickins wrote:
> > > You probably won't welcome getting into alternatives at this late stage;
> > > but after hacking around it one way or another because of its pointless
> > > lockups, I lost patience with that too_many_isolated() loop a few months
> > > back (on realizing the enormous number of pages that may be isolated via
> > > migrate_pages(2)), and we've been running nicely since with something like:
> > >
> > > bool got_mutex = false;
> > >
> > > if (unlikely(too_many_isolated(pgdat, file, sc))) {
> > > if (mutex_lock_killable(&pgdat->too_many_isolated))
> > > return SWAP_CLUSTER_MAX;
> > > got_mutex = true;
> > > }
> > > ...
> > > if (got_mutex)
> > > mutex_unlock(&pgdat->too_many_isolated);
> > >
> > > Using a mutex to provide the intended throttling, without an infinite
> > > loop or an arbitrary delay; and without having to worry (as we often did)
> > > about whether those numbers in too_many_isolated() are really appropriate.
> > > No premature OOMs complained of yet.
> >
> > Roughly speaking, there is a moment where shrink_inactive_list() acts
> > like below.
> >
> > bool got_mutex = false;
> >
> > if (!current_is_kswapd()) {
> > if (mutex_lock_killable(&pgdat->too_many_isolated))
> > return SWAP_CLUSTER_MAX;
> > got_mutex = true;
> > }
> >
> > // kswapd is blocked here waiting for !current_is_kswapd().
>
> That would be a shame, for kswapd to wait for !current_is_kswapd()!
Yes, but current code (not about your patch) does allow kswapd to wait
for memory allocations of !current_is_kswapd() thread to complete.
>
> But seriously, I think I understand what you mean by that, you're
> thinking that kswapd would be waiting on some other task to clear
> the too_many_isolated() condition?
Yes.
>
> No, it does not work that way: kswapd (never seeing too_many_isolated()
> because that always says false when current_is_kswapd()) never tries to
> take the pgdat->too_many_isolated mutex itself: it does not wait there
> at all, although other tasks may be waiting there at the time.
I know. I wrote behavior of your patch if my guess (your "..." part
corresponds to kswapd doing writepage) is correct.
>
> Perhaps my naming the mutex "too_many_isolated", same as the function,
> is actually confusing, when I had intended it to be helpful.
Not confusing at all. It is helpful.
I just wanted to confirm what comes in your "..." part.
>
> >
> > if (got_mutex)
> > mutex_unlock(&pgdat->too_many_isolated);
> >
> > >
> > > But that was on a different kernel, and there I did have to make sure
> > > that PF_MEMALLOC always prevented us from nesting: I'm not certain of
> > > that in the current kernel (but do remember Johannes changing the memcg
> > > end to make it use PF_MEMALLOC too). I offer the preview above, to see
> > > if you're interested in that alternative: if you are, then I'll go ahead
> > > and make it into an actual patch against v4.13-rc.
> >
> > I don't know what your actual patch looks like, but the problem is that
> > pgdat->too_many_isolated waits for kswapd while kswapd waits for
> > pgdat->too_many_isolated; nobody can unlock pgdat->too_many_isolated if
> > once we hit it.
>
> Not so (and we'd hardly be finding it a useful patch if that were so).
Current code allows kswapd to wait for memory allocation of !current_is_kswapd()
threads, and thus !current_is_kswapd() threads wait for current_is_kswapd() threads
while current_is_kswapd() threads wait for !current_is_kswapd() threads; nobody can
make too_many_isolated() false if once we hit it. Hence, this patch is proposed.
Thanks.