Commit b9921ecdee66 ("mm: add a helper function to check may oom
condition") was added because the gfp criteria for oom killing was
checked in both the page allocator and memcg.
That was true for about nine months, but then commit 0029e19ebf84 ("mm:
memcontrol: remove explicit OOM parameter in charge path") removed the
memcg usecase.
Fold the implementation into its only caller.
Signed-off-by: David Rientjes <[email protected]>
---
include/linux/oom.h | 5 -----
mm/page_alloc.c | 2 +-
2 files changed, 1 insertion(+), 6 deletions(-)
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -85,11 +85,6 @@ static inline void oom_killer_enable(void)
oom_killer_disabled = false;
}
-static inline bool oom_gfp_allowed(gfp_t gfp_mask)
-{
- return (gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY);
-}
-
extern struct task_struct *find_lock_task_mm(struct task_struct *p);
/* sysctls */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2706,7 +2706,7 @@ rebalance:
* running out of options and have to consider going OOM
*/
if (!did_some_progress) {
- if (oom_gfp_allowed(gfp_mask)) {
+ if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
if (oom_killer_disabled)
goto nopage;
/* Coredumps can quickly deplete all memory reserves */
On Wed 26-11-14 14:17:32, David Rientjes wrote:
> Commit b9921ecdee66 ("mm: add a helper function to check may oom
> condition") was added because the gfp criteria for oom killing was
> checked in both the page allocator and memcg.
>
> That was true for about nine months, but then commit 0029e19ebf84 ("mm:
> memcontrol: remove explicit OOM parameter in charge path") removed the
> memcg usecase.
>
> Fold the implementation into its only caller.
I don't care much whether the check is open coded or hidden behind the
helper but I would really appreciate a comment explaining why we care
about these two particular gfp flags. The code is like that since ages
- excavation work would lead us back to 2002 resp. 2003. Let's save
other others people time and do not repeat the same exercise again.
What about a comment like the following?
> Signed-off-by: David Rientjes <[email protected]>
> ---
> include/linux/oom.h | 5 -----
> mm/page_alloc.c | 2 +-
> 2 files changed, 1 insertion(+), 6 deletions(-)
>
[...]
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2706,7 +2706,7 @@ rebalance:
> * running out of options and have to consider going OOM
> */
> if (!did_some_progress) {
> - if (oom_gfp_allowed(gfp_mask)) {
/*
* Do not attempt to trigger OOM killer for !__GFP_FS
* allocations because it would be premature to kill
* anything just because the reclaim is stuck on
* dirty/writeback pages.
* __GFP_NORETRY allocations might fail and so the OOM
* would be more harmful than useful.
*/
> + if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> if (oom_killer_disabled)
> goto nopage;
> /* Coredumps can quickly deplete all memory reserves */
--
Michal Hocko
SUSE Labs
On Wed, Nov 26, 2014 at 02:17:32PM -0800, David Rientjes wrote:
> Commit b9921ecdee66 ("mm: add a helper function to check may oom
> condition") was added because the gfp criteria for oom killing was
> checked in both the page allocator and memcg.
>
> That was true for about nine months, but then commit 0029e19ebf84 ("mm:
> memcontrol: remove explicit OOM parameter in charge path") removed the
> memcg usecase.
>
> Fold the implementation into its only caller.
>
> Signed-off-by: David Rientjes <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
On Thu, Nov 27, 2014 at 11:25:47AM +0100, Michal Hocko wrote:
> On Wed 26-11-14 14:17:32, David Rientjes wrote:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2706,7 +2706,7 @@ rebalance:
> > * running out of options and have to consider going OOM
> > */
> > if (!did_some_progress) {
> > - if (oom_gfp_allowed(gfp_mask)) {
> /*
> * Do not attempt to trigger OOM killer for !__GFP_FS
> * allocations because it would be premature to kill
> * anything just because the reclaim is stuck on
> * dirty/writeback pages.
> * __GFP_NORETRY allocations might fail and so the OOM
> * would be more harmful than useful.
> */
I don't think we need to explain the individual flags, but it would
indeed be useful to remark here that we shouldn't OOM kill from
allocations contexts with (severely) limited reclaim abilities.
On Mon 01-12-14 18:30:40, Johannes Weiner wrote:
> On Thu, Nov 27, 2014 at 11:25:47AM +0100, Michal Hocko wrote:
> > On Wed 26-11-14 14:17:32, David Rientjes wrote:
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2706,7 +2706,7 @@ rebalance:
> > > * running out of options and have to consider going OOM
> > > */
> > > if (!did_some_progress) {
> > > - if (oom_gfp_allowed(gfp_mask)) {
> > /*
> > * Do not attempt to trigger OOM killer for !__GFP_FS
> > * allocations because it would be premature to kill
> > * anything just because the reclaim is stuck on
> > * dirty/writeback pages.
> > * __GFP_NORETRY allocations might fail and so the OOM
> > * would be more harmful than useful.
> > */
>
> I don't think we need to explain the individual flags, but it would
> indeed be useful to remark here that we shouldn't OOM kill from
> allocations contexts with (severely) limited reclaim abilities.
Is __GFP_NORETRY really related to limited reclaim abilities? I thought
it was merely a way to tell the allocator to fail rather than spend too
much time reclaiming. If you are referring to __GFP_FS part then I have
no objections to be less specific, of course, but __GFP_IO would fall
into the same category but we are not checking for it. I have no idea
why we consider the first and not the later one, to be honest...
--
Michal Hocko
SUSE Labs
On Wed, Dec 03, 2014 at 04:52:22PM +0100, Michal Hocko wrote:
> On Mon 01-12-14 18:30:40, Johannes Weiner wrote:
> > On Thu, Nov 27, 2014 at 11:25:47AM +0100, Michal Hocko wrote:
> > > On Wed 26-11-14 14:17:32, David Rientjes wrote:
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -2706,7 +2706,7 @@ rebalance:
> > > > * running out of options and have to consider going OOM
> > > > */
> > > > if (!did_some_progress) {
> > > > - if (oom_gfp_allowed(gfp_mask)) {
> > > /*
> > > * Do not attempt to trigger OOM killer for !__GFP_FS
> > > * allocations because it would be premature to kill
> > > * anything just because the reclaim is stuck on
> > > * dirty/writeback pages.
> > > * __GFP_NORETRY allocations might fail and so the OOM
> > > * would be more harmful than useful.
> > > */
> >
> > I don't think we need to explain the individual flags, but it would
> > indeed be useful to remark here that we shouldn't OOM kill from
> > allocations contexts with (severely) limited reclaim abilities.
>
> Is __GFP_NORETRY really related to limited reclaim abilities? I thought
> it was merely a way to tell the allocator to fail rather than spend too
> much time reclaiming.
And you wouldn't call that "limited reclaim ability"? I guess it's a
matter of phrasing, but the point is that we don't want anybody to OOM
kill that didn't exhaust all other options that are usually available
to allocators. This includes the ability to enter the FS, the ability
to do IO in general, and the ability to retry reclaim. Possibly more.
> If you are referring to __GFP_FS part then I have
> no objections to be less specific, of course, but __GFP_IO would fall
> into the same category but we are not checking for it. I have no idea
> why we consider the first and not the later one, to be honest...
Which proves my point that we should document high-level intent rather
than implementation. Suddenly, that missing __GFP_IO is sticking out
like a sore thumb...
On Wed, 3 Dec 2014 16:52:22 +0100 Michal Hocko <[email protected]> wrote:
> On Mon 01-12-14 18:30:40, Johannes Weiner wrote:
> > On Thu, Nov 27, 2014 at 11:25:47AM +0100, Michal Hocko wrote:
> > > On Wed 26-11-14 14:17:32, David Rientjes wrote:
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -2706,7 +2706,7 @@ rebalance:
> > > > * running out of options and have to consider going OOM
> > > > */
> > > > if (!did_some_progress) {
> > > > - if (oom_gfp_allowed(gfp_mask)) {
> > > /*
> > > * Do not attempt to trigger OOM killer for !__GFP_FS
> > > * allocations because it would be premature to kill
> > > * anything just because the reclaim is stuck on
> > > * dirty/writeback pages.
> > > * __GFP_NORETRY allocations might fail and so the OOM
> > > * would be more harmful than useful.
> > > */
> >
> > I don't think we need to explain the individual flags, but it would
> > indeed be useful to remark here that we shouldn't OOM kill from
> > allocations contexts with (severely) limited reclaim abilities.
>
> Is __GFP_NORETRY really related to limited reclaim abilities? I thought
> it was merely a way to tell the allocator to fail rather than spend too
> much time reclaiming.
That's my understanding of __GFP_NORETRY. However it seems that I
really didn't have a plan:
: commit 75908778d91e92ca3c9ed587c4550866f4c903fc
: Author: Andrew Morton <[email protected]>
: Date: Sun Apr 20 00:28:12 2003 -0700
:
: [PATCH] implement __GFP_REPEAT, __GFP_NOFAIL, __GFP_NORETRY
:
: This is a cleanup patch.
:
: There are quite a lot of places in the kernel which will infinitely retry a
: memory allocation.
:
: Generally, they get it wrong. Some do yield(), the semantics of which have
: changed over time. Some do schedule(), which can lock up if the caller is
: SCHED_FIFO/RR. Some do schedule_timeout(), etc.
:
: And often it is unnecessary, because the page allocator will do the retry
: internally anyway. But we cannot rely on that - this behaviour may change
: (-aa and -rmap kernels do not do this, for instance).
:
: So it is good to formalise and to centralise this operation. If an
: allocation specifies __GFP_REPEAT then the page allocator must infinitely
: retry the allocation.
:
: The semantics of __GFP_REPEAT are "try harder". The allocation _may_ fail
: (the 2.4 -aa and -rmap VM's do not retry infinitely by default).
:
: The semantics of __GFP_NOFAIL are "cannot fail". It is a no-op in this VM,
: but needs to be honoured (or fix up the callers) if the VM ischanged to not
: retry infinitely by default.
:
: The semantics of __GFP_NOREPEAT are "try once, don't loop". This isn't used
: at present (although perhaps it should be, in swapoff). It is mainly for
: completeness.
(that's a braino in the changelog: it should be
s/__GFP_NOREPEAT/__GFP_NORETRY/)
> If you are referring to __GFP_FS part then I have
> no objections to be less specific, of course, but __GFP_IO would fall
> into the same category but we are not checking for it. I have no idea
> why we consider the first and not the later one, to be honest...
(__GFP_FS && !__GFP_IO) doesn't make much sense and probably doesn't
happen. "__GFP_FS implies __GFP_IO" is OK.
Anyway, yes, This particular piece of __alloc_pages_slowpath() sorely
needs documenting please. Once we manage to work out why we're doing
what we're doing!
On Wed 03-12-14 13:15:09, Johannes Weiner wrote:
> On Wed, Dec 03, 2014 at 04:52:22PM +0100, Michal Hocko wrote:
> > On Mon 01-12-14 18:30:40, Johannes Weiner wrote:
> > > On Thu, Nov 27, 2014 at 11:25:47AM +0100, Michal Hocko wrote:
> > > > On Wed 26-11-14 14:17:32, David Rientjes wrote:
> > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > --- a/mm/page_alloc.c
> > > > > +++ b/mm/page_alloc.c
> > > > > @@ -2706,7 +2706,7 @@ rebalance:
> > > > > * running out of options and have to consider going OOM
> > > > > */
> > > > > if (!did_some_progress) {
> > > > > - if (oom_gfp_allowed(gfp_mask)) {
> > > > /*
> > > > * Do not attempt to trigger OOM killer for !__GFP_FS
> > > > * allocations because it would be premature to kill
> > > > * anything just because the reclaim is stuck on
> > > > * dirty/writeback pages.
> > > > * __GFP_NORETRY allocations might fail and so the OOM
> > > > * would be more harmful than useful.
> > > > */
> > >
> > > I don't think we need to explain the individual flags, but it would
> > > indeed be useful to remark here that we shouldn't OOM kill from
> > > allocations contexts with (severely) limited reclaim abilities.
> >
> > Is __GFP_NORETRY really related to limited reclaim abilities? I thought
> > it was merely a way to tell the allocator to fail rather than spend too
> > much time reclaiming.
>
> And you wouldn't call that "limited reclaim ability"?
I really do not want to go into language lawyering here. But to me the
reclaim ability is what the reclaim is capable to do with the given gfp.
And __GFP_NORETRY is completely irrelevant for the reclaim. It tells the
allocator how hard it should try (similar like __GFP_REPEAT or
__GFP_NOFAIL) unlike __GFP_FS which restricts the reclaim in its
operation.
> I guess it's a
> matter of phrasing, but the point is that we don't want anybody to OOM
> kill that didn't exhaust all other options that are usually available
> to allocators. This includes the ability to enter the FS, the ability
> to do IO in general, and the ability to retry reclaim. Possibly more.
Right.
> > If you are referring to __GFP_FS part then I have
> > no objections to be less specific, of course, but __GFP_IO would fall
> > into the same category but we are not checking for it. I have no idea
> > why we consider the first and not the later one, to be honest...
>
> Which proves my point that we should document high-level intent rather
> than implementation. Suddenly, that missing __GFP_IO is sticking out
> like a sore thumb...
I am obviously not insisting on the above wording. I am for everything
that would clarify the test and do not force me to go through several
hops of the git blame to find the original intention again after year
when I forget this again.
--
Michal Hocko
SUSE Labs
On Thu, Dec 04, 2014 at 04:17:58PM +0100, Michal Hocko wrote:
> On Wed 03-12-14 13:15:09, Johannes Weiner wrote:
> > On Wed, Dec 03, 2014 at 04:52:22PM +0100, Michal Hocko wrote:
> > > On Mon 01-12-14 18:30:40, Johannes Weiner wrote:
> > > > On Thu, Nov 27, 2014 at 11:25:47AM +0100, Michal Hocko wrote:
> > > > > On Wed 26-11-14 14:17:32, David Rientjes wrote:
> > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > --- a/mm/page_alloc.c
> > > > > > +++ b/mm/page_alloc.c
> > > > > > @@ -2706,7 +2706,7 @@ rebalance:
> > > > > > * running out of options and have to consider going OOM
> > > > > > */
> > > > > > if (!did_some_progress) {
> > > > > > - if (oom_gfp_allowed(gfp_mask)) {
> > > > > /*
> > > > > * Do not attempt to trigger OOM killer for !__GFP_FS
> > > > > * allocations because it would be premature to kill
> > > > > * anything just because the reclaim is stuck on
> > > > > * dirty/writeback pages.
> > > > > * __GFP_NORETRY allocations might fail and so the OOM
> > > > > * would be more harmful than useful.
> > > > > */
> > > >
> > > > I don't think we need to explain the individual flags, but it would
> > > > indeed be useful to remark here that we shouldn't OOM kill from
> > > > allocations contexts with (severely) limited reclaim abilities.
> > >
> > > Is __GFP_NORETRY really related to limited reclaim abilities? I thought
> > > it was merely a way to tell the allocator to fail rather than spend too
> > > much time reclaiming.
> >
> > And you wouldn't call that "limited reclaim ability"?
>
> I really do not want to go into language lawyering here. But to me the
> reclaim ability is what the reclaim is capable to do with the given gfp.
I was explicitely talking about the reclaim abilities of the
allocation context, in the context of the allocator. Surely that
includes how often the task can call try_to_free_pages()?
> And __GFP_NORETRY is completely irrelevant for the reclaim. It tells the
> allocator how hard it should try (similar like __GFP_REPEAT or
> __GFP_NOFAIL) unlike __GFP_FS which restricts the reclaim in its
> operation.
Tells the allocator how hard to try *what*? It's not just looking at
the freelists in a tight loop. It's reclaiming memory.
You're still thinking about the implementation and get distracted by
the detail that the allocator's reclaim logic is living in a separate
code file. Look at the architecture. The GFP flags apply to the
freelist manager and the reclaim code as one big allocator machine
that goes off to get pages, from wherever. And anything that is
taking pages that are not already on the freelist must be reclaiming
them from somewhere else, by definition. Thus, the allocation
context's reclaim ability includes whether it can free or migrate used
pages on its own (__GFP_WAIT), whether it can scan and migrate pages
indefinitely or not (__GFP_NOFAIL and __GFP_NORETRY), and whether it
can write and wait for the pages it looks at (__GFP_FS and __GFP_IO).
OOM killing is just another way to reclaim, but the point here is that
it's so disruptive that we want to avoid it unless all other means to
reclaim memory are exhausted. And yes, continuing to scan pages *is*
a means for the allocator to reclaim memory, and we don't OOM kill
when that is still making progress!
That being said, you ARE right that __GFP_FS and __GFP_NORETRY apply
to different components of the allocator. While they should have the
same effect on OOM-killing, the fact that they are checked together in
the same branch is a strong signal of how weird this code actually is.
How about the following? It changes the code flow to clarify what's
actually going on there and gets rid of oom_gfp_allowed() altogether,
instead of awkwardly trying to explain something that has no meaning.
Btw, it looks like there is a bug with oom_killer_disabled, because it
will return NULL for __GFP_NOFAIL.
---
From: Johannes Weiner <[email protected]>
Subject: [patch] mm: page_alloc: embed OOM killing naturally into allocation
slowpath
The OOM killing invocation does a lot of duplicative checks against
the task's allocation context. Rework it to take advantage of the
existing checks in the allocator slowpath.
Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/oom.h | 5 ----
mm/page_alloc.c | 80 +++++++++++++++++++++++------------------------------
2 files changed, 35 insertions(+), 50 deletions(-)
diff --git a/include/linux/oom.h b/include/linux/oom.h
index e8d6e1058723..4971874f54db 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -85,11 +85,6 @@ static inline void oom_killer_enable(void)
oom_killer_disabled = false;
}
-static inline bool oom_gfp_allowed(gfp_t gfp_mask)
-{
- return (gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY);
-}
-
extern struct task_struct *find_lock_task_mm(struct task_struct *p);
/* sysctls */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 616a2c956b4b..2df99ca56e28 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2232,12 +2232,21 @@ static inline struct page *
__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
nodemask_t *nodemask, struct zone *preferred_zone,
- int classzone_idx, int migratetype)
+ int classzone_idx, int migratetype, unsigned long *did_some_progress)
{
struct page *page;
- /* Acquire the per-zone oom lock for each zone */
+ *did_some_progress = 0;
+
+ if (oom_killer_disabled)
+ return NULL;
+
+ /*
+ * Acquire the per-zone oom lock for each zone. If that
+ * fails, somebody else is making progress for us.
+ */
if (!oom_zonelist_trylock(zonelist, gfp_mask)) {
+ *did_some_progress = 1;
schedule_timeout_uninterruptible(1);
return NULL;
}
@@ -2263,12 +2272,18 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
goto out;
if (!(gfp_mask & __GFP_NOFAIL)) {
+ /* Coredumps can quickly deplete all memory reserves */
+ if (current->flags & PF_DUMPCORE)
+ goto out;
/* The OOM killer will not help higher order allocs */
if (order > PAGE_ALLOC_COSTLY_ORDER)
goto out;
/* The OOM killer does not needlessly kill tasks for lowmem */
if (high_zoneidx < ZONE_NORMAL)
goto out;
+ /* The OOM killer does not compensate for light reclaim */
+ if (!(gfp_mask & __GFP_FS))
+ goto out;
/*
* GFP_THISNODE contains __GFP_NORETRY and we never hit this.
* Sanity check for bare calls of __GFP_THISNODE, not real OOM.
@@ -2281,7 +2296,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
}
/* Exhausted what can be done so it's blamo time */
out_of_memory(zonelist, gfp_mask, order, nodemask, false);
-
+ *did_some_progress = 1;
out:
oom_zonelist_unlock(zonelist, gfp_mask);
return page;
@@ -2571,7 +2586,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
(gfp_mask & GFP_THISNODE) == GFP_THISNODE)
goto nopage;
-restart:
if (!(gfp_mask & __GFP_NO_KSWAPD))
wake_all_kswapds(order, zonelist, high_zoneidx,
preferred_zone, nodemask);
@@ -2701,51 +2715,27 @@ rebalance:
if (page)
goto got_pg;
- /*
- * If we failed to make any progress reclaiming, then we are
- * running out of options and have to consider going OOM
- */
- if (!did_some_progress) {
- if (oom_gfp_allowed(gfp_mask)) {
- if (oom_killer_disabled)
- goto nopage;
- /* Coredumps can quickly deplete all memory reserves */
- if ((current->flags & PF_DUMPCORE) &&
- !(gfp_mask & __GFP_NOFAIL))
- goto nopage;
- page = __alloc_pages_may_oom(gfp_mask, order,
- zonelist, high_zoneidx,
- nodemask, preferred_zone,
- classzone_idx, migratetype);
- if (page)
- goto got_pg;
-
- if (!(gfp_mask & __GFP_NOFAIL)) {
- /*
- * The oom killer is not called for high-order
- * allocations that may fail, so if no progress
- * is being made, there are no other options and
- * retrying is unlikely to help.
- */
- if (order > PAGE_ALLOC_COSTLY_ORDER)
- goto nopage;
- /*
- * The oom killer is not called for lowmem
- * allocations to prevent needlessly killing
- * innocent tasks.
- */
- if (high_zoneidx < ZONE_NORMAL)
- goto nopage;
- }
-
- goto restart;
- }
- }
-
/* Check if we should retry the allocation */
pages_reclaimed += did_some_progress;
if (should_alloc_retry(gfp_mask, order, did_some_progress,
pages_reclaimed)) {
+ /*
+ * If we fail to make progress by freeing individual
+ * pages, but the allocation wants us to keep going,
+ * start OOM killing tasks.
+ */
+ if (!did_some_progress) {
+ page = __alloc_pages_may_oom(gfp_mask, order, zonelist,
+ high_zoneidx, nodemask,
+ preferred_zone, classzone_idx,
+ migratetype,&did_some_progress);
+ if (page)
+ goto got_pg;
+ if (!did_some_progress) {
+ BUG_ON(gfp_mask & __GFP_NOFAIL);
+ goto nopage;
+ }
+ }
/* Wait for some write requests to complete then retry */
wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
goto rebalance;
--
2.1.3
On Thu 04-12-14 15:19:05, Johannes Weiner wrote:
[...]
> How about the following? It changes the code flow to clarify what's
> actually going on there and gets rid of oom_gfp_allowed() altogether,
> instead of awkwardly trying to explain something that has no meaning.
Yes this makes a lot of sense.
> Btw, it looks like there is a bug with oom_killer_disabled, because it
> will return NULL for __GFP_NOFAIL.
Right! __GFP_NOFAIL allocation after oom is disabled cannot be
guaranteed.
> ---
> From: Johannes Weiner <[email protected]>
> Subject: [patch] mm: page_alloc: embed OOM killing naturally into allocation
> slowpath
>
> The OOM killing invocation does a lot of duplicative checks against
> the task's allocation context. Rework it to take advantage of the
> existing checks in the allocator slowpath.
Nice! Just document the __GFP_NOFAIL fix here as well, please.
> Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
I will rebase my oom vs pm-freezer patch
(http://marc.info/?l=linux-mm&m=141634503316543&w=2) which touches this
area on top of your patch.
Thanks!
> ---
> include/linux/oom.h | 5 ----
> mm/page_alloc.c | 80 +++++++++++++++++++++++------------------------------
> 2 files changed, 35 insertions(+), 50 deletions(-)
>
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index e8d6e1058723..4971874f54db 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -85,11 +85,6 @@ static inline void oom_killer_enable(void)
> oom_killer_disabled = false;
> }
>
> -static inline bool oom_gfp_allowed(gfp_t gfp_mask)
> -{
> - return (gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY);
> -}
> -
> extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>
> /* sysctls */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 616a2c956b4b..2df99ca56e28 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2232,12 +2232,21 @@ static inline struct page *
> __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> struct zonelist *zonelist, enum zone_type high_zoneidx,
> nodemask_t *nodemask, struct zone *preferred_zone,
> - int classzone_idx, int migratetype)
> + int classzone_idx, int migratetype, unsigned long *did_some_progress)
> {
> struct page *page;
>
> - /* Acquire the per-zone oom lock for each zone */
> + *did_some_progress = 0;
> +
> + if (oom_killer_disabled)
> + return NULL;
> +
> + /*
> + * Acquire the per-zone oom lock for each zone. If that
> + * fails, somebody else is making progress for us.
> + */
> if (!oom_zonelist_trylock(zonelist, gfp_mask)) {
> + *did_some_progress = 1;
> schedule_timeout_uninterruptible(1);
> return NULL;
> }
> @@ -2263,12 +2272,18 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> goto out;
>
> if (!(gfp_mask & __GFP_NOFAIL)) {
> + /* Coredumps can quickly deplete all memory reserves */
> + if (current->flags & PF_DUMPCORE)
> + goto out;
> /* The OOM killer will not help higher order allocs */
> if (order > PAGE_ALLOC_COSTLY_ORDER)
> goto out;
> /* The OOM killer does not needlessly kill tasks for lowmem */
> if (high_zoneidx < ZONE_NORMAL)
> goto out;
> + /* The OOM killer does not compensate for light reclaim */
> + if (!(gfp_mask & __GFP_FS))
> + goto out;
> /*
> * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> @@ -2281,7 +2296,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> }
> /* Exhausted what can be done so it's blamo time */
> out_of_memory(zonelist, gfp_mask, order, nodemask, false);
> -
> + *did_some_progress = 1;
> out:
> oom_zonelist_unlock(zonelist, gfp_mask);
> return page;
> @@ -2571,7 +2586,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
> goto nopage;
>
> -restart:
> if (!(gfp_mask & __GFP_NO_KSWAPD))
> wake_all_kswapds(order, zonelist, high_zoneidx,
> preferred_zone, nodemask);
> @@ -2701,51 +2715,27 @@ rebalance:
> if (page)
> goto got_pg;
>
> - /*
> - * If we failed to make any progress reclaiming, then we are
> - * running out of options and have to consider going OOM
> - */
> - if (!did_some_progress) {
> - if (oom_gfp_allowed(gfp_mask)) {
> - if (oom_killer_disabled)
> - goto nopage;
> - /* Coredumps can quickly deplete all memory reserves */
> - if ((current->flags & PF_DUMPCORE) &&
> - !(gfp_mask & __GFP_NOFAIL))
> - goto nopage;
> - page = __alloc_pages_may_oom(gfp_mask, order,
> - zonelist, high_zoneidx,
> - nodemask, preferred_zone,
> - classzone_idx, migratetype);
> - if (page)
> - goto got_pg;
> -
> - if (!(gfp_mask & __GFP_NOFAIL)) {
> - /*
> - * The oom killer is not called for high-order
> - * allocations that may fail, so if no progress
> - * is being made, there are no other options and
> - * retrying is unlikely to help.
> - */
> - if (order > PAGE_ALLOC_COSTLY_ORDER)
> - goto nopage;
> - /*
> - * The oom killer is not called for lowmem
> - * allocations to prevent needlessly killing
> - * innocent tasks.
> - */
> - if (high_zoneidx < ZONE_NORMAL)
> - goto nopage;
> - }
> -
> - goto restart;
> - }
> - }
> -
> /* Check if we should retry the allocation */
> pages_reclaimed += did_some_progress;
> if (should_alloc_retry(gfp_mask, order, did_some_progress,
> pages_reclaimed)) {
> + /*
> + * If we fail to make progress by freeing individual
> + * pages, but the allocation wants us to keep going,
> + * start OOM killing tasks.
> + */
> + if (!did_some_progress) {
> + page = __alloc_pages_may_oom(gfp_mask, order, zonelist,
> + high_zoneidx, nodemask,
> + preferred_zone, classzone_idx,
> + migratetype,&did_some_progress);
> + if (page)
> + goto got_pg;
> + if (!did_some_progress) {
> + BUG_ON(gfp_mask & __GFP_NOFAIL);
> + goto nopage;
> + }
> + }
> /* Wait for some write requests to complete then retry */
> wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
> goto rebalance;
> --
> 2.1.3
>
--
Michal Hocko
SUSE Labs