2015-11-11 13:48:33

by Michal Hocko

[permalink] [raw]
Subject: [PATCH] mm, oom: Give __GFP_NOFAIL allocations access to memory reserves

From: Michal Hocko <[email protected]>

__GFP_NOFAIL is a big hammer used to ensure that the allocation
request can never fail. This is a strong requirement and as such
it also deserves a special treatment when the system is OOM. The
primary problem here is that the allocation request might have
come with some locks held and the oom victim might be blocked
on the same locks. This is basically an OOM deadlock situation.

This patch tries to reduce the risk of such a deadlocks by giving
__GFP_NOFAIL allocations a special treatment and let them dive into
memory reserves after oom killer invocation. This should help them
to make a progress and release resources they are holding. The OOM
victim should compensate for the reserves consumption.

Signed-off-by: Michal Hocko <[email protected]>
---

Hi,
this has been posted previously as a part of larger GFP_NOFS related
patch set (http://lkml.kernel.org/r/1438768284-30927-1-git-send-email-mhocko%40kernel.org)
but Andrea was asking basically the same thing at LSF early this year
(I cannot seem to find it in any public archive though). I think the
patch makes some sense on its own.

Comments?

mm/page_alloc.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8034909faad2..d30bce9d7ac8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2766,8 +2766,16 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
goto out;
}
/* Exhausted what can be done so it's blamo time */
- if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
+ if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
*did_some_progress = 1;
+
+ if (gfp_mask & __GFP_NOFAIL) {
+ page = get_page_from_freelist(gfp_mask, order,
+ ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
+ WARN_ONCE(!page, "Unable to fullfil gfp_nofail allocation."
+ " Consider increasing min_free_kbytes.\n");
+ }
+ }
out:
mutex_unlock(&oom_lock);
return page;
--
2.6.2


2015-11-11 15:55:04

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: Give __GFP_NOFAIL allocations access to memory reserves

On Wed, Nov 11, 2015 at 02:48:17PM +0100, [email protected] wrote:
> From: Michal Hocko <[email protected]>
>
> __GFP_NOFAIL is a big hammer used to ensure that the allocation
> request can never fail. This is a strong requirement and as such
> it also deserves a special treatment when the system is OOM. The
> primary problem here is that the allocation request might have
> come with some locks held and the oom victim might be blocked
> on the same locks. This is basically an OOM deadlock situation.
>
> This patch tries to reduce the risk of such a deadlocks by giving
> __GFP_NOFAIL allocations a special treatment and let them dive into
> memory reserves after oom killer invocation. This should help them
> to make a progress and release resources they are holding. The OOM
> victim should compensate for the reserves consumption.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
>
> Hi,
> this has been posted previously as a part of larger GFP_NOFS related
> patch set (http://lkml.kernel.org/r/1438768284-30927-1-git-send-email-mhocko%40kernel.org)
> but Andrea was asking basically the same thing at LSF early this year
> (I cannot seem to find it in any public archive though). I think the
> patch makes some sense on its own.

I sent this right after LSF based on Andrea's suggestion:

https://lkml.org/lkml/2015/3/25/37

2015-11-12 08:51:29

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: Give __GFP_NOFAIL allocations access to memory reserves

On Wed 11-11-15 10:54:46, Johannes Weiner wrote:
> On Wed, Nov 11, 2015 at 02:48:17PM +0100, [email protected] wrote:
> > From: Michal Hocko <[email protected]>
> >
> > __GFP_NOFAIL is a big hammer used to ensure that the allocation
> > request can never fail. This is a strong requirement and as such
> > it also deserves a special treatment when the system is OOM. The
> > primary problem here is that the allocation request might have
> > come with some locks held and the oom victim might be blocked
> > on the same locks. This is basically an OOM deadlock situation.
> >
> > This patch tries to reduce the risk of such a deadlocks by giving
> > __GFP_NOFAIL allocations a special treatment and let them dive into
> > memory reserves after oom killer invocation. This should help them
> > to make a progress and release resources they are holding. The OOM
> > victim should compensate for the reserves consumption.
> >
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> >
> > Hi,
> > this has been posted previously as a part of larger GFP_NOFS related
> > patch set (http://lkml.kernel.org/r/1438768284-30927-1-git-send-email-mhocko%40kernel.org)
> > but Andrea was asking basically the same thing at LSF early this year
> > (I cannot seem to find it in any public archive though). I think the
> > patch makes some sense on its own.
>
> I sent this right after LSF based on Andrea's suggestion:
>
> https://lkml.org/lkml/2015/3/25/37

Ohh, I completely forgot as it was part of a larger series.
Thanks for the pointer.

--
Michal Hocko
SUSE Labs

2015-11-22 12:55:35

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: Give __GFP_NOFAIL allocations access to memory reserves

On 11.11.2015 14:48, [email protected] wrote:
> mm/page_alloc.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8034909faad2..d30bce9d7ac8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2766,8 +2766,16 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> goto out;
> }
> /* Exhausted what can be done so it's blamo time */
> - if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> + if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
> *did_some_progress = 1;
> +
> + if (gfp_mask & __GFP_NOFAIL) {
> + page = get_page_from_freelist(gfp_mask, order,
> + ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
> + WARN_ONCE(!page, "Unable to fullfil gfp_nofail allocation."
> + " Consider increasing min_free_kbytes.\n");

It seems redundant to me to keep the WARN_ON_ONCE also above in the if () part?
Also s/gfp_nofail/GFP_NOFAIL/ for consistency?

Hm and probably out of scope of your patch, but I understand the WARN_ONCE
(WARN_ON_ONCE) to be _ONCE just to prevent a flood from a single task looping
here. But for distinct tasks and potentially far away in time, wouldn't we want
to see all the warnings? Would that be feasible to implement?

> + }
> + }
> out:
> mutex_unlock(&oom_lock);
> return page;
>

2015-11-23 09:29:38

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: Give __GFP_NOFAIL allocations access to memory reserves

On Sun 22-11-15 13:55:31, Vlastimil Babka wrote:
> On 11.11.2015 14:48, [email protected] wrote:
> > mm/page_alloc.c | 10 +++++++++-
> > 1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 8034909faad2..d30bce9d7ac8 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2766,8 +2766,16 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > goto out;
> > }
> > /* Exhausted what can be done so it's blamo time */
> > - if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> > + if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
> > *did_some_progress = 1;
> > +
> > + if (gfp_mask & __GFP_NOFAIL) {
> > + page = get_page_from_freelist(gfp_mask, order,
> > + ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
> > + WARN_ONCE(!page, "Unable to fullfil gfp_nofail allocation."
> > + " Consider increasing min_free_kbytes.\n");
>
> It seems redundant to me to keep the WARN_ON_ONCE also above in the if () part?

They are warning about two different things. The first one catches a
buggy code which uses __GFP_NOFAIL from oom disabled context while the
second one tries to help the administrator with a hint that memory
reserves are too small.

> Also s/gfp_nofail/GFP_NOFAIL/ for consistency?

Fair enough, changed.

> Hm and probably out of scope of your patch, but I understand the WARN_ONCE
> (WARN_ON_ONCE) to be _ONCE just to prevent a flood from a single task looping
> here. But for distinct tasks and potentially far away in time, wouldn't we want
> to see all the warnings? Would that be feasible to implement?

I was thinking about that as well some time ago but it was quite
hard to find a good enough API to tell when to warn again. The first
WARN_ON_ONCE should trigger for all different _code paths_ no matter
how frequently they appear to catch all the buggy callers. The second
one would benefit from a new warning after min_free_kbytes was updated
because it would tell the administrator that the last update was not
sufficient for the workload.

>
> > + }
> > + }
> > out:
> > mutex_unlock(&oom_lock);
> > return page;
> >

Thanks!
--
Michal Hocko
SUSE Labs

2015-11-23 09:43:45

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: Give __GFP_NOFAIL allocations access to memory reserves

On 11/23/2015 10:29 AM, Michal Hocko wrote:
> On Sun 22-11-15 13:55:31, Vlastimil Babka wrote:
>> On 11.11.2015 14:48, [email protected] wrote:
>>> mm/page_alloc.c | 10 +++++++++-
>>> 1 file changed, 9 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 8034909faad2..d30bce9d7ac8 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -2766,8 +2766,16 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>>> goto out;
>>> }
>>> /* Exhausted what can be done so it's blamo time */
>>> - if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
>>> + if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
>>> *did_some_progress = 1;
>>> +
>>> + if (gfp_mask & __GFP_NOFAIL) {
>>> + page = get_page_from_freelist(gfp_mask, order,
>>> + ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
>>> + WARN_ONCE(!page, "Unable to fullfil gfp_nofail allocation."
>>> + " Consider increasing min_free_kbytes.\n");
>>
>> It seems redundant to me to keep the WARN_ON_ONCE also above in the if () part?
>
> They are warning about two different things. The first one catches a
> buggy code which uses __GFP_NOFAIL from oom disabled context while the

Ah, I see, I misinterpreted what the return values of out_of_memory()
mean. But now that I look at its code, it seems to only return false
when oom_killer_disabled is set to true. Which is a global thing and
nothing to do with the context of the __GFP_NOFAIL allocation?

> second one tries to help the administrator with a hint that memory
> reserves are too small.
>
>> Also s/gfp_nofail/GFP_NOFAIL/ for consistency?
>
> Fair enough, changed.
>
>> Hm and probably out of scope of your patch, but I understand the WARN_ONCE
>> (WARN_ON_ONCE) to be _ONCE just to prevent a flood from a single task looping
>> here. But for distinct tasks and potentially far away in time, wouldn't we want
>> to see all the warnings? Would that be feasible to implement?
>
> I was thinking about that as well some time ago but it was quite
> hard to find a good enough API to tell when to warn again. The first
> WARN_ON_ONCE should trigger for all different _code paths_ no matter
> how frequently they appear to catch all the buggy callers. The second
> one would benefit from a new warning after min_free_kbytes was updated
> because it would tell the administrator that the last update was not
> sufficient for the workload.

Hm, what about adding a flag to the struct alloc_context, so that when
the particular allocation attempt emits the warning, it sets a flag in
the alloc_context so that it won't emit them again as long as it keeps
looping and attempting oom. Other allocations will warn independently.

We could also print the same info as the "allocation failed" warnings
do, since it's very similar, except we can't fail - but the admin/bug
reporter should be interested in the same details as for an allocation
failure that is allowed to fail. But it's also true that we have
probably just printed the info during out_of_memory()... except when we
skipped that for some reason?

>>
>>> + }
>>> + }
>>> out:
>>> mutex_unlock(&oom_lock);
>>> return page;
>>>
>
> Thanks!
>

2015-11-23 10:13:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: Give __GFP_NOFAIL allocations access to memory reserves

On Mon 23-11-15 10:43:42, Vlastimil Babka wrote:
> On 11/23/2015 10:29 AM, Michal Hocko wrote:
> >On Sun 22-11-15 13:55:31, Vlastimil Babka wrote:
> >>On 11.11.2015 14:48, [email protected] wrote:
> >>> mm/page_alloc.c | 10 +++++++++-
> >>> 1 file changed, 9 insertions(+), 1 deletion(-)
> >>>
> >>>diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>>index 8034909faad2..d30bce9d7ac8 100644
> >>>--- a/mm/page_alloc.c
> >>>+++ b/mm/page_alloc.c
> >>>@@ -2766,8 +2766,16 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> >>> goto out;
> >>> }
> >>> /* Exhausted what can be done so it's blamo time */
> >>>- if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> >>>+ if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
> >>> *did_some_progress = 1;
> >>>+
> >>>+ if (gfp_mask & __GFP_NOFAIL) {
> >>>+ page = get_page_from_freelist(gfp_mask, order,
> >>>+ ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
> >>>+ WARN_ONCE(!page, "Unable to fullfil gfp_nofail allocation."
> >>>+ " Consider increasing min_free_kbytes.\n");
> >>
> >>It seems redundant to me to keep the WARN_ON_ONCE also above in the if () part?
> >
> >They are warning about two different things. The first one catches a
> >buggy code which uses __GFP_NOFAIL from oom disabled context while the
>
> Ah, I see, I misinterpreted what the return values of out_of_memory() mean.
> But now that I look at its code, it seems to only return false when
> oom_killer_disabled is set to true. Which is a global thing and nothing to
> do with the context of the __GFP_NOFAIL allocation?

I am not sure I follow you here. The point of the warning is to warn
when the oom killer is disbaled (out_of_memory returns false) _and_ the
request is __GFP_NOFAIL because we simply cannot guarantee any forward
progress and just a use of the allocation flag is not supproted.

[...]
> >>Hm and probably out of scope of your patch, but I understand the WARN_ONCE
> >>(WARN_ON_ONCE) to be _ONCE just to prevent a flood from a single task looping
> >>here. But for distinct tasks and potentially far away in time, wouldn't we want
> >>to see all the warnings? Would that be feasible to implement?
> >
> >I was thinking about that as well some time ago but it was quite
> >hard to find a good enough API to tell when to warn again. The first
> >WARN_ON_ONCE should trigger for all different _code paths_ no matter
> >how frequently they appear to catch all the buggy callers. The second
> >one would benefit from a new warning after min_free_kbytes was updated
> >because it would tell the administrator that the last update was not
> >sufficient for the workload.
>
> Hm, what about adding a flag to the struct alloc_context, so that when the
> particular allocation attempt emits the warning, it sets a flag in the
> alloc_context so that it won't emit them again as long as it keeps looping
> and attempting oom. Other allocations will warn independently.

That could still trigger a flood of messages. Say you have many
concurrent users from the same call path...

I am not really sure making the code more complicating for this warning
is really worth it. If anything we can use ratelimited variant.

> We could also print the same info as the "allocation failed" warnings do,
> since it's very similar, except we can't fail - but the admin/bug reporter
> should be interested in the same details as for an allocation failure that
> is allowed to fail. But it's also true that we have probably just printed
> the info during out_of_memory()... except when we skipped that for some
> reason?

The first WARN_ON_ONCE happens when OOM killer doesn't trigger so a
memory situation might be worth considering. The later one might have
seen the OOM report which is the likely case. So if anyting the first
one should dump the info.

--
Michal Hocko
SUSE Labs

2015-11-23 21:26:53

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: Give __GFP_NOFAIL allocations access to memory reserves

On Mon, 23 Nov 2015, Michal Hocko wrote:

> > >>>diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > >>>index 8034909faad2..d30bce9d7ac8 100644
> > >>>--- a/mm/page_alloc.c
> > >>>+++ b/mm/page_alloc.c
> > >>>@@ -2766,8 +2766,16 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > >>> goto out;
> > >>> }
> > >>> /* Exhausted what can be done so it's blamo time */
> > >>>- if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> > >>>+ if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
> > >>> *did_some_progress = 1;
> > >>>+
> > >>>+ if (gfp_mask & __GFP_NOFAIL) {
> > >>>+ page = get_page_from_freelist(gfp_mask, order,
> > >>>+ ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
> > >>>+ WARN_ONCE(!page, "Unable to fullfil gfp_nofail allocation."
> > >>>+ " Consider increasing min_free_kbytes.\n");
> > >>
> > >>It seems redundant to me to keep the WARN_ON_ONCE also above in the if () part?
> > >
> > >They are warning about two different things. The first one catches a
> > >buggy code which uses __GFP_NOFAIL from oom disabled context while the
> >
> > Ah, I see, I misinterpreted what the return values of out_of_memory() mean.
> > But now that I look at its code, it seems to only return false when
> > oom_killer_disabled is set to true. Which is a global thing and nothing to
> > do with the context of the __GFP_NOFAIL allocation?
>
> I am not sure I follow you here. The point of the warning is to warn
> when the oom killer is disbaled (out_of_memory returns false) _and_ the
> request is __GFP_NOFAIL because we simply cannot guarantee any forward
> progress and just a use of the allocation flag is not supproted.
>

I don't think the WARN_ONCE() above is helpful for a few reasons:

- it suggests that min_free_kbytes is the best way to work around such
issues and gives kernel developers a free pass to just say "raise
min_free_kbytes" rather than reducing their reliance on __GFP_NOFAIL,

- raising min_free_kbytes is not immediately actionable without memory
freeing to fix any oom issue, and

- it relies on the earlier warning to dump the state of memory and
doesn't add any significant information to help understand how seperate
occurrences are similar or different.

I think the WARN_ONCE() should just be removed.

2015-11-24 09:47:18

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: Give __GFP_NOFAIL allocations access to memory reserves

On Mon 23-11-15 13:26:49, David Rientjes wrote:
> On Mon, 23 Nov 2015, Michal Hocko wrote:
[...]
> > I am not sure I follow you here. The point of the warning is to warn
> > when the oom killer is disbaled (out_of_memory returns false) _and_ the
> > request is __GFP_NOFAIL because we simply cannot guarantee any forward
> > progress and just a use of the allocation flag is not supproted.
> >
>
> I don't think the WARN_ONCE() above is helpful for a few reasons:
>
> - it suggests that min_free_kbytes is the best way to work around such
> issues and gives kernel developers a free pass to just say "raise
> min_free_kbytes" rather than reducing their reliance on __GFP_NOFAIL,

I disagree. Users are quite sensitive to warnings with backtraces in the
log from my experience and they report them. And while the log shows the
code path which triggers the issue which can help us to change the code
it also gives a useful hint on how to reduce this issue until we are
able to either fix a bug or a permanent configuration if we are not able
to get rid of it for whatever reason.

Besides that there is no other reliable warning that we are getting
_really_ short on memory unlike when the allocation failure is
allowed. OOM killer report might be missing because there was no actual
killing happening.

> - raising min_free_kbytes is not immediately actionable without memory
> freeing to fix any oom issue, and

true but it can be done to reduce chances for the issue to reappear.

> - it relies on the earlier warning to dump the state of memory and
> doesn't add any significant information to help understand how seperate
> occurrences are similar or different.

The information is quite valuable even without OOM killer report IMHO.

> I think the WARN_ONCE() should just be removed.

I do not insist on keeping it but I really think it might be useful
while it doesn't seem to cause any confusion IMHO. So unless there is a
strong reason to not include it I would prefer keeping it.
--
Michal Hocko
SUSE Labs

2015-11-24 16:26:16

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: Give __GFP_NOFAIL allocations access to memory reserves

On Tue, Nov 24, 2015 at 10:47:09AM +0100, Michal Hocko wrote:
> Besides that there is no other reliable warning that we are getting
> _really_ short on memory unlike when the allocation failure is
> allowed. OOM killer report might be missing because there was no actual
> killing happening.

This is why I would like to see that warning generalized, and not just
for __GFP_NOFAIL. We have allocations other than explicit __GFP_NOFAIL
that will loop forever in the allocator, and when this deadlocks the
machine all we see is other tasks hanging, but not the culprit. If we
were to get a backtrace of some task in the allocator that is known to
hold locks, suddenly all the other hung tasks will make sense, and it
will clearly distinguish such an allocator deadlock from other issues.

Do you remember the patch you proposed at LSF about failing requests
after looping a certain (configurable) number of times? Well, instead
of failing them, it would be good to start WARNING after a certain #
of loops when we know we won't quit (implicit or explicit NOFAIL).

[ Kind of like fs/xfs/kmem::kmem_alloc() does, only that that is
currently dead code due to our looping inside the allocator. ]

2015-11-24 17:02:43

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: Give __GFP_NOFAIL allocations access to memory reserves

On Tue 24-11-15 11:26:04, Johannes Weiner wrote:
> On Tue, Nov 24, 2015 at 10:47:09AM +0100, Michal Hocko wrote:
> > Besides that there is no other reliable warning that we are getting
> > _really_ short on memory unlike when the allocation failure is
> > allowed. OOM killer report might be missing because there was no actual
> > killing happening.
>
> This is why I would like to see that warning generalized, and not just
> for __GFP_NOFAIL. We have allocations other than explicit __GFP_NOFAIL
> that will loop forever in the allocator,

Yes but does it make sense to warn for all of them? Wouldn't it be
sufficient to warn about those which cannot allocate anything even
though they are doing ALLOC_NO_WATERMARKS? We could still hint the
administrator to increase min_free_kbytes for his particular workload.
Such a situation should be really exceptional and warn_once should be
sufficient.

> and when this deadlocks the
> machine all we see is other tasks hanging, but not the culprit. If we
> were to get a backtrace of some task in the allocator that is known to
> hold locks, suddenly all the other hung tasks will make sense, and it
> will clearly distinguish such an allocator deadlock from other issues.

Tetsuo was suggesting a more sophisticated infrastructure for tracking
allocations [1] which take too long without making progress. I haven't
seen his patch because I was too busy with other stuff but maybe this is
what you would like to see?

Anyway I would like to make some progress on this patch. Do you think
that it would be acceptable in the current form without the warning or
you preffer a different way?

[1] http://lkml.kernel.org/r/201510182105.AGA00839.FHVFFStLQOMOOJ%40I-love.SAKURA.ne.jp
--
Michal Hocko
SUSE Labs

2015-11-24 19:57:23

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: Give __GFP_NOFAIL allocations access to memory reserves

On Tue, Nov 24, 2015 at 06:02:39PM +0100, Michal Hocko wrote:
> On Tue 24-11-15 11:26:04, Johannes Weiner wrote:
> > On Tue, Nov 24, 2015 at 10:47:09AM +0100, Michal Hocko wrote:
> > > Besides that there is no other reliable warning that we are getting
> > > _really_ short on memory unlike when the allocation failure is
> > > allowed. OOM killer report might be missing because there was no actual
> > > killing happening.
> >
> > This is why I would like to see that warning generalized, and not just
> > for __GFP_NOFAIL. We have allocations other than explicit __GFP_NOFAIL
> > that will loop forever in the allocator,
>
> Yes but does it make sense to warn for all of them? Wouldn't it be
> sufficient to warn about those which cannot allocate anything even
> though they are doing ALLOC_NO_WATERMARKS?

Why is it important whether they can do ALLOC_NO_WATERMARKS or not?

I'm worried about all those that can loop forever with locks held.

> > and when this deadlocks the
> > machine all we see is other tasks hanging, but not the culprit. If we
> > were to get a backtrace of some task in the allocator that is known to
> > hold locks, suddenly all the other hung tasks will make sense, and it
> > will clearly distinguish such an allocator deadlock from other issues.
>
> Tetsuo was suggesting a more sophisticated infrastructure for tracking
> allocations [1] which take too long without making progress. I haven't
> seen his patch because I was too busy with other stuff but maybe this is
> what you would like to see?

That seems a bit excessive. I was thinking something more like this:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 05ef7fb..fbfc581 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3004,6 +3004,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
enum migrate_mode migration_mode = MIGRATE_ASYNC;
bool deferred_compaction = false;
int contended_compaction = COMPACT_CONTENDED_NONE;
+ unsigned int nr_tries = 0;

/*
* In the slowpath, we sanity check order to avoid ever trying to
@@ -3033,6 +3034,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
goto nopage;

retry:
+ if (++nr_retries % 100 == 0)
+ warn_alloc_failed(gfp_mask, order, "Potential GFP deadlock\n");
+
if (gfp_mask & __GFP_KSWAPD_RECLAIM)
wake_all_kswapds(order, ac);

> Anyway I would like to make some progress on this patch. Do you think
> that it would be acceptable in the current form without the warning or
> you preffer a different way?

Oh, I have nothing against your patch, please go ahead with it. I just
wondered out loud when you proposed a warning about deadlocking NOFAIL
allocations but limited it to explicit __GFP_NOFAIL allocations, when
those obviously aren't the only ones that can deadlock in that way.

2015-11-25 09:33:31

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: Give __GFP_NOFAIL allocations access to memory reserves

On Tue 24-11-15 14:57:10, Johannes Weiner wrote:
> On Tue, Nov 24, 2015 at 06:02:39PM +0100, Michal Hocko wrote:
> > On Tue 24-11-15 11:26:04, Johannes Weiner wrote:
> > > On Tue, Nov 24, 2015 at 10:47:09AM +0100, Michal Hocko wrote:
> > > > Besides that there is no other reliable warning that we are getting
> > > > _really_ short on memory unlike when the allocation failure is
> > > > allowed. OOM killer report might be missing because there was no actual
> > > > killing happening.
> > >
> > > This is why I would like to see that warning generalized, and not just
> > > for __GFP_NOFAIL. We have allocations other than explicit __GFP_NOFAIL
> > > that will loop forever in the allocator,
> >
> > Yes but does it make sense to warn for all of them? Wouldn't it be
> > sufficient to warn about those which cannot allocate anything even
> > though they are doing ALLOC_NO_WATERMARKS?
>
> Why is it important whether they can do ALLOC_NO_WATERMARKS or not?

Well, the idea was that ALLOC_NO_WATERMARKS failures mean that memory
reserves are not sufficient to handle given workload. min_free_kbytes
is auto-tuned and it might be not sufficient - especially now that we
are adding a new class of consumers of the reserves. I find a warning
as an appropriate way to tell administrator that the auto-tuning was
too optimistic for the particular load.

> I'm worried about all those that can loop forever with locks held.

I can see your point here but I think that looping endlessly without
any progress is a different class of issue. It is hard to tune for it.
You can change tunning and still can end up looping because the lock
vs. reclaim dependencies will be still there.

> > > and when this deadlocks the
> > > machine all we see is other tasks hanging, but not the culprit. If we
> > > were to get a backtrace of some task in the allocator that is known to
> > > hold locks, suddenly all the other hung tasks will make sense, and it
> > > will clearly distinguish such an allocator deadlock from other issues.
> >
> > Tetsuo was suggesting a more sophisticated infrastructure for tracking
> > allocations [1] which take too long without making progress. I haven't
> > seen his patch because I was too busy with other stuff but maybe this is
> > what you would like to see?
>
> That seems a bit excessive. I was thinking something more like this:
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 05ef7fb..fbfc581 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3004,6 +3004,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> enum migrate_mode migration_mode = MIGRATE_ASYNC;
> bool deferred_compaction = false;
> int contended_compaction = COMPACT_CONTENDED_NONE;
> + unsigned int nr_tries = 0;
>
> /*
> * In the slowpath, we sanity check order to avoid ever trying to
> @@ -3033,6 +3034,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> goto nopage;
>
> retry:
> + if (++nr_retries % 100 == 0)
> + warn_alloc_failed(gfp_mask, order, "Potential GFP deadlock\n");
> +

I am not against this in principle. It might be too noisy but
warn_alloc_failed is throttled already so it should handle too many
parallel requests already. I still think that ALLOC_NO_WATERMARKS
failures deserve a special treatment because we can tune for that.
Care to send a patch?

> if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> wake_all_kswapds(order, ac);
>
> > Anyway I would like to make some progress on this patch. Do you think
> > that it would be acceptable in the current form without the warning or
> > you preffer a different way?
>
> Oh, I have nothing against your patch, please go ahead with it. I just
> wondered out loud when you proposed a warning about deadlocking NOFAIL
> allocations but limited it to explicit __GFP_NOFAIL allocations, when
> those obviously aren't the only ones that can deadlock in that way.

As mentioned above I have added it merely because this would be a new
consumer of the reserves which might lead to its depletion without
an explicit warning. But the more I think about that the more I am
convinced that this should be generalized to ALLOC_NO_WATERMARKS.

I will remove the warning from the patch and prepare a separate one
which will warn ALLOC_NO_WATERMARKS so that we can discuss that
separately.

Thanks

--
Michal Hocko
SUSE Labs