LinuxLists.cc - [PATCH] memcg, thp: do not invoke oom killer on thp charges

2018-03-21 21:01:33

Subject: [PATCH] memcg, thp: do not invoke oom killer on thp charges

From: Michal Hocko <[email protected]>

David has noticed that THP memcg charge can trigger the oom killer
since 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
madvised allocations"). We have used an explicit __GFP_NORETRY
previously which ruled the OOM killer automagically.

Memcg charge path should be semantically compliant with the allocation
path and that means that if we do not trigger the OOM killer for costly
orders which should do the same in the memcg charge path as well.
Otherwise we are forcing callers to distinguish the two and use
different gfp masks which is both non-intuitive and bug prone. Not to
mention the maintenance burden.

Teach mem_cgroup_oom to bail out on costly order requests to fix the THP
issue as well as any other costly OOM eligible allocations to be added
in future.

Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
Reported-by: David Rientjes <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---

Hi,
this is an alternative patch to [1]. I strongly believe that using
different gfp masks for the allocation and memcg charging is to be
avoided as much as possible. There doesn't seem to be any good reason
why THP charges should be an exception here.

I would be tempted to mark this for stable even though we haven't seen
any unexpected memcg OOM killer reports since 4.8 which is quite some
time.

[1] http://lkml.kernel.org/r/[email protected]

mm/memcontrol.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d1a917b5b7b7..08accbcd1a18 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1493,7 +1493,7 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)

static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
{
- if (!current->memcg_may_oom)
+ if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER)
return;
/*
* We are in the middle of the charge context here, so we
--
2.16.2

2018-03-21 21:23:44

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH] memcg, thp: do not invoke oom killer on thp charges

On Wed, 21 Mar 2018, Michal Hocko wrote:

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d1a917b5b7b7..08accbcd1a18 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1493,7 +1493,7 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
>
> static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> {
> - if (!current->memcg_may_oom)
> + if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER)
> return;
> /*
> * We are in the middle of the charge context here, so we

What bug reports have you received about order-4 and higher order non thp
charges that this fixes?

The patch title and the changelog specifically single out thp, which I've
fixed, since it has sane fallback behavior and everything else uses
__GFP_NORETRY. I think this is misusing a page allocator heuristic that
hasn't been applied to the memcg charge path before to address a thp
regression but generalizing it for all charges.

PAGE_ALLOC_COSTLY_ORDER is a heuristic used by the page allocator because
it cannot free high-order contiguous memory. Memcg just needs to reclaim
a number of pages. Two order-3 charges can cause a memcg oom kill but now
an order-4 charge cannot. It's an unfair bias against high-order charges
that are not explicitly using __GFP_NORETRY.

2018-03-21 21:42:15

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH] memcg, thp: do not invoke oom killer on thp charges

On Wed 21-03-18 14:22:13, David Rientjes wrote:
> On Wed, 21 Mar 2018, Michal Hocko wrote:
>
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index d1a917b5b7b7..08accbcd1a18 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1493,7 +1493,7 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
> >
> > static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> > {
> > - if (!current->memcg_may_oom)
> > + if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER)
> > return;
> > /*
> > * We are in the middle of the charge context here, so we
>
> What bug reports have you received about order-4 and higher order non thp
> charges that this fixes?

We do not have any costly _OOM killable_ allocations but THP AFAIR. Or
am I missing any?

> The patch title and the changelog specifically single out thp, which I've
> fixed, since it has sane fallback behavior and everything else uses
> __GFP_NORETRY. I think this is misusing a page allocator heuristic that
> hasn't been applied to the memcg charge path before to address a thp
> regression but generalizing it for all charges.

Yes, which is the whole point! We do not want a THP specific workaround.
Just look at the bug your original patch was fixing. The regression was
caused by a change which generalizes gfp masks for THP because different
policies imply a different effort. As a side effect THP charges got OOM
killable. I would call it quite non intuitive and error prone.

> PAGE_ALLOC_COSTLY_ORDER is a heuristic used by the page allocator because
> it cannot free high-order contiguous memory. Memcg just needs to reclaim
> a number of pages. Two order-3 charges can cause a memcg oom kill but now
> an order-4 charge cannot. It's an unfair bias against high-order charges
> that are not explicitly using __GFP_NORETRY.

PAGE_ALLOC_COSTLY_ORDER is documented and people know what to expect
from such a request. Diverging from that behavior just comes as a
surprise. There is no reason for that and as the above outlines it is
error prone.

--
Michal Hocko
SUSE Labs

2018-03-22 08:28:12

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH] memcg, thp: do not invoke oom killer on thp charges

On Wed, 21 Mar 2018, Michal Hocko wrote:

> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index d1a917b5b7b7..08accbcd1a18 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1493,7 +1493,7 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
> > >
> > > static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> > > {
> > > - if (!current->memcg_may_oom)
> > > + if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER)
> > > return;
> > > /*
> > > * We are in the middle of the charge context here, so we
> >
> > What bug reports have you received about order-4 and higher order non thp
> > charges that this fixes?
>
> We do not have any costly _OOM killable_ allocations but THP AFAIR. Or
> am I missing any?
>

So now you're making a generalized policy change to the memcg charge path
to fix what is obviously only thp and caused by removing the __GFP_NORETRY
from thp allocations in commit 2516035499b9? I don't know what orders
people enforce for slub_min_order. I assume that people who don't want to
cause a memcg oom kill are using __GFP_NORETRY because that's how it has
always worked. The fact that the page allocator got more sophisticated
logic for the various thp fault and defrag policies doesn't change that.

You're implementing the exact same behavior that commit 2516035499b9 was
trying to avoid; it's trying to avoid special-casing thp in general logic.
order > PAGE_ALLOC_COSTLY_ORDER is a terrible heuristic to identify thp
allocations.

> > PAGE_ALLOC_COSTLY_ORDER is a heuristic used by the page allocator because
> > it cannot free high-order contiguous memory. Memcg just needs to reclaim
> > a number of pages. Two order-3 charges can cause a memcg oom kill but now
> > an order-4 charge cannot. It's an unfair bias against high-order charges
> > that are not explicitly using __GFP_NORETRY.
>
> PAGE_ALLOC_COSTLY_ORDER is documented and people know what to expect
> from such a request. Diverging from that behavior just comes as a
> surprise. There is no reason for that and as the above outlines it is
> error prone.
>

You're diverging from it because the memcg charge path has never had this
heuristic. I'm somewhat stunned this has to be repeated:
PAGE_ALLOC_COSTLY_ORDER is about the ability to allocate _contiguous_
memory, it's not about the _amount_ of memory. Changing the memcg charge
path to factor order into oom kill decisions is new, and should be
proposed as a follow-up patch to my bug fix to describe what else is being
impacted by your patch and what is fixed by it.

Yours is a heuristic change, mine is a bug fix.

Look, commit 2516035499b9 pulled off __GFP_NORETRY for GFP_TRANSHUGE and
forgot to fix it up for memcg charging. I'm setting the bit again to
prevent the oom kill. It's what should be merged for rc7. I can't make a
stable case for it because the stable rules want it to impact more than
one user and I haven't seen other bug reports. It can be backported if
others are affected to meet the rules.

Your change is broken and I wouldn't push it to Linus for rc7 if my life
depended on it. What is the response when someone complains that they
start getting a ton of MEMCG_OOM notifications for every thp fallback?
They will, because yours is a broken implementation.

I'm trying to fix the problem introduced by commit 2516035499b9 wrt how
memcg charges treat high order non-__GFP_NORETRY allocations, and fix it
directly with something that is obviously right. I'm specifically not
trying to change heuristics as a bug fix. Please feel free to send a
follow-up patch for 4.17 that lays out why memcg doesn't want to oom kill
for order-4 and above (why does memcg fail for 64KB charges when the
caller specifically left off __GFP_NORETRY again?) as a policy change and
why that is helpful.

Respectfully, allow the bugfix to fix what was obviously left off from
commit 2516035499b9. I don't have time to write 100 emails on it, but
Andrew can be assured if he chooses to send it for rc7 that my code (1) is
actually tested, (2) has users that depend on it, and (3) won't cause
undesired side-effects like yours wrt oom notifications.

2018-03-22 08:57:29

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH] memcg, thp: do not invoke oom killer on thp charges

On Thu 22-03-18 01:26:13, David Rientjes wrote:
> On Wed, 21 Mar 2018, Michal Hocko wrote:
>
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index d1a917b5b7b7..08accbcd1a18 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -1493,7 +1493,7 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
> > > >
> > > > static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> > > > {
> > > > - if (!current->memcg_may_oom)
> > > > + if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER)
> > > > return;
> > > > /*
> > > > * We are in the middle of the charge context here, so we
> > >
> > > What bug reports have you received about order-4 and higher order non thp
> > > charges that this fixes?
> >
> > We do not have any costly _OOM killable_ allocations but THP AFAIR. Or
> > am I missing any?
> >
>
> So now you're making a generalized policy change to the memcg charge path
> to fix what is obviously only thp and caused by removing the __GFP_NORETRY
> from thp allocations in commit 2516035499b9?

Yes, because relying on __GFP_NORETRY for the oom handling has proven to
be subtle and error prone. And as I've repeated few times already there
is _no_ reason why the oom policy for the memcg charge should be any
different from the allocator's one.

> I don't know what orders
> people enforce for slub_min_order. I assume that people who don't want to
> cause a memcg oom kill are using __GFP_NORETRY because that's how it has
> always worked. The fact that the page allocator got more sophisticated
> logic for the various thp fault and defrag policies doesn't change that.

They simply cannot because kmalloc performs the change under the cover.
So you would have to use kmalloc(gfp|__GFP_NORETRY) to be absolutely
sure to not trigger _any_ oom killer. This is just wrong thing to do.

> You're implementing the exact same behavior that commit 2516035499b9 was
> trying to avoid; it's trying to avoid special-casing thp in general logic.

It is not trying to special case THP. It special cases _all_ costly
charges.

> order > PAGE_ALLOC_COSTLY_ORDER is a terrible heuristic to identify thp
> allocations.
>
> > > PAGE_ALLOC_COSTLY_ORDER is a heuristic used by the page allocator because
> > > it cannot free high-order contiguous memory. Memcg just needs to reclaim
> > > a number of pages. Two order-3 charges can cause a memcg oom kill but now
> > > an order-4 charge cannot. It's an unfair bias against high-order charges
> > > that are not explicitly using __GFP_NORETRY.
> >
> > PAGE_ALLOC_COSTLY_ORDER is documented and people know what to expect
> > from such a request. Diverging from that behavior just comes as a
> > surprise. There is no reason for that and as the above outlines it is
> > error prone.
> >
>
> You're diverging from it because the memcg charge path has never had this
> heuristic.

Which is arguably a bug which just didn't matter because we do not
have costly order oom eligible charges in general and THP was subtly
different and turned out to be error prone.

> I'm somewhat stunned this has to be repeated:
> PAGE_ALLOC_COSTLY_ORDER is about the ability to allocate _contiguous_
> memory, it's not about the _amount_ of memory. Changing the memcg charge
> path to factor order into oom kill decisions is new, and should be
> proposed as a follow-up patch to my bug fix to describe what else is being
> impacted by your patch and what is fixed by it.
>
> Yours is a heuristic change, mine is a bug fix.

Nobody is really arguing about this. I have just pointed out my
reservation that your bug fix is adding more special casing and a more
generic solution is due. If you absolutely believe that your bugfix is
so important to make it to rc7 I will not object. It is however strange
that we haven't seen a _single_ bug report in last two years about this
being a problem. So I am not really sure the urgency is due.

> Look, commit 2516035499b9 pulled off __GFP_NORETRY for GFP_TRANSHUGE and
> forgot to fix it up for memcg charging. I'm setting the bit again to
> prevent the oom kill. It's what should be merged for rc7. I can't make a
> stable case for it because the stable rules want it to impact more than
> one user and I haven't seen other bug reports. It can be backported if
> others are affected to meet the rules.

Exactly, so why the urgency and having half fix that will preserve the
subtle behavior?

> Your change is broken and I wouldn't push it to Linus for rc7 if my life
> depended on it. What is the response when someone complains that they
> start getting a ton of MEMCG_OOM notifications for every thp fallback?
> They will, because yours is a broken implementation.

I fail to see what is broken. Could you be more specific?

> I'm trying to fix the problem introduced by commit 2516035499b9 wrt how
> memcg charges treat high order non-__GFP_NORETRY allocations, and fix it
> directly with something that is obviously right. I'm specifically not
> trying to change heuristics as a bug fix. Please feel free to send a
> follow-up patch for 4.17 that lays out why memcg doesn't want to oom kill
> for order-4 and above (why does memcg fail for 64KB charges when the
> caller specifically left off __GFP_NORETRY again?) as a policy change and
> why that is helpful.
>
> Respectfully, allow the bugfix to fix what was obviously left off from
> commit 2516035499b9.

I haven't nacked the patch AFAIR so nothing really prevents it from
being merged.

--
Michal Hocko
SUSE Labs

2018-03-22 20:31:13

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH] memcg, thp: do not invoke oom killer on thp charges

On Thu, 22 Mar 2018, Michal Hocko wrote:

> > So now you're making a generalized policy change to the memcg charge path
> > to fix what is obviously only thp and caused by removing the __GFP_NORETRY
> > from thp allocations in commit 2516035499b9?
>
> Yes, because relying on __GFP_NORETRY for the oom handling has proven to
> be subtle and error prone. And as I've repeated few times already there
> is _no_ reason why the oom policy for the memcg charge should be any
> different from the allocator's one.
>

The PAGE_ALLOC_COSTLY_ORDER oom heuristic in the page allocator is about
the unlikelihood of freeing contiguous memory. It was added around the
same time I added the oom heuristic to prevent killing processes for
lowmem because of the unlikelihood of freeing lowmem.

If lowmem is accounted to a memcg, would you avoid oom kill in that
scenario too, just for the sake of matching the page allocator? Of course
not.

The argument is absurd, I'm sorry. Charging 32KB of memory allows the
memcg oom killer but magically 64KB of memory automatically fails and the
charging path has no ability to override that because you've made it part
of the charge path. There is no __GFP_RETRY. Making blanket claims that
high-order charges should never oom kill and that you know better than the
caller, and that you don't think the caller knows what they are doing wrt
__GFP_NORETRY, is more dangerous imo than the potential benefit from what
you are proposing.

> They simply cannot because kmalloc performs the change under the cover.
> So you would have to use kmalloc(gfp|__GFP_NORETRY) to be absolutely
> sure to not trigger _any_ oom killer. This is just wrong thing to do.
>

Examples of where this isn't already done? It certainly wasn't a problem
before __GFP_NORETRY was dropped in commit 2516035499b9 but you suspect
it's a problem now.

> > You're diverging from it because the memcg charge path has never had this
> > heuristic.
>
> Which is arguably a bug which just didn't matter because we do not
> have costly order oom eligible charges in general and THP was subtly
> different and turned out to be error prone.
>

It was inadvertently dropped from commit 2516035499b9. There were no
high-order charge oom kill problems before this commit. People know how
to use __GFP_NORETRY or leave it off, which you don't trust them to do
because you're hardcoding a heuristic in the charge path. You also acked
the commit that introduced this "error prone" problem. Before you start
to advertise that you know better than what previously worked just fine,
let's fix the issue that was introduced by that commit and then you can
propose a follow-up patch that changes the charge heuristic and it can
stand on its own merits.

> > I'm somewhat stunned this has to be repeated:
> > PAGE_ALLOC_COSTLY_ORDER is about the ability to allocate _contiguous_
> > memory, it's not about the _amount_ of memory. Changing the memcg charge
> > path to factor order into oom kill decisions is new, and should be
> > proposed as a follow-up patch to my bug fix to describe what else is being
> > impacted by your patch and what is fixed by it.
> >
> > Yours is a heuristic change, mine is a bug fix.
>
> Nobody is really arguing about this. I have just pointed out my
> reservation that your bug fix is adding more special casing and a more
> generic solution is due.

Dude, it's setting a bit that the problem commit dropped. That's it. I'm
setting a bit.

> If you absolutely believe that your bugfix is
> so important to make it to rc7 I will not object. It is however strange
> that we haven't seen a _single_ bug report in last two years about this
> being a problem. So I am not really sure the urgency is due.
>

You're not really sure about the urgency but you were "tempted to mark
this for stable" for your heuristic "fix"?

> > Your change is broken and I wouldn't push it to Linus for rc7 if my life
> > depended on it. What is the response when someone complains that they
> > start getting a ton of MEMCG_OOM notifications for every thp fallback?
> > They will, because yours is a broken implementation.
>
> I fail to see what is broken. Could you be more specific?
>

I said MEMCG_OOM notifications on thp fallback. You modified
mem_cgroup_oom(). What is called before mem_cgroup_oom()?
mem_cgroup_event(mem_over_limit, MEMCG_OOM). That increments the
MEMCG_OOM event and anybody waiting on the events file gets notified it
changed. They read a MEMCG_OOM event. It's thp fallback, it's not memcg
oom.

Really, I can't continue to write 100 emails in this thread. I'm sorry,
but there are only so many hours in a day. I can't read 20 emails about
why Tetsuo shouldn't emit a stack trace when the oom reaper fails, which
happens 0.04% of the time on production workloads with data I provided. I
can't continue to reiterate why we added the PAGE_ALLOC_COSTLY_ORDER
heuristic to the allocator. I'm done.

> > Respectfully, allow the bugfix to fix what was obviously left off from
> > commit 2516035499b9.
>
> I haven't nacked the patch AFAIR so nothing really prevents it from
> being merged.
>

It should be merged for rc7. Please send any follow-up policy change wrt
to high-order charges as a separate patch for 4.17.

2018-03-23 09:08:32

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH] memcg, thp: do not invoke oom killer on thp charges

On Thu 22-03-18 13:29:37, David Rientjes wrote:
> On Thu, 22 Mar 2018, Michal Hocko wrote:
[...]
> > They simply cannot because kmalloc performs the change under the cover.
> > So you would have to use kmalloc(gfp|__GFP_NORETRY) to be absolutely
> > sure to not trigger _any_ oom killer. This is just wrong thing to do.
> >
>
> Examples of where this isn't already done? It certainly wasn't a problem
> before __GFP_NORETRY was dropped in commit 2516035499b9 but you suspect
> it's a problem now.

It is not a problem _right now_ as I've already pointed out few
times. We do not trigger the OOM killer for anything but #PF path. But
this is an implementation detail which can change in future and there is
actually some demand for the change. Once we start triggering the oom
killer for all charges then we do not really want to have the disparity.

> > > You're diverging from it because the memcg charge path has never had this
> > > heuristic.
> >
> > Which is arguably a bug which just didn't matter because we do not
> > have costly order oom eligible charges in general and THP was subtly
> > different and turned out to be error prone.
> >
>
> It was inadvertently dropped from commit 2516035499b9. There were no
> high-order charge oom kill problems before this commit. People know how
> to use __GFP_NORETRY or leave it off, which you don't trust them to do
> because you're hardcoding a heuristic in the charge path.

No. Just read what I wrote. I am worried that the current disparity
between the page allocator and the memcg charging will _force_ them to
do hacks and sometimes (e.g. kmalloc) they will not have any option but
using __GFP_NORETRY even when that is not really needed and it has a
different semantic than they would like.

Behavior on with and without memcgs should be as similar as possible
otherwise you will see different sets of bugs when running under the
memcg and without. I really fail to see what is so hard about this to
understand.

[...]

> > > Your change is broken and I wouldn't push it to Linus for rc7 if my life
> > > depended on it. What is the response when someone complains that they
> > > start getting a ton of MEMCG_OOM notifications for every thp fallback?
> > > They will, because yours is a broken implementation.
> >
> > I fail to see what is broken. Could you be more specific?
> >
>
> I said MEMCG_OOM notifications on thp fallback. You modified
> mem_cgroup_oom(). What is called before mem_cgroup_oom()?
> mem_cgroup_event(mem_over_limit, MEMCG_OOM). That increments the
> MEMCG_OOM event and anybody waiting on the events file gets notified it
> changed. They read a MEMCG_OOM event. It's thp fallback, it's not memcg
> oom.

MEMCG_OOM doesn't count the number of oom killer invocations. That has
never been the case.

> Really, I can't continue to write 100 emails in this thread.

Then try to read and even try to understand concerns expressed in those
emails. Repeating the same set of arguments and ignoring the rest is not
really helpful.
--
Michal Hocko
SUSE Labs

2018-03-23 09:27:58

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH] memcg, thp: do not invoke oom killer on thp charges

On Fri, 23 Mar 2018, Michal Hocko wrote:

> > Examples of where this isn't already done? It certainly wasn't a problem
> > before __GFP_NORETRY was dropped in commit 2516035499b9 but you suspect
> > it's a problem now.
>
> It is not a problem _right now_ as I've already pointed out few
> times. We do not trigger the OOM killer for anything but #PF path. But
> this is an implementation detail which can change in future and there is
> actually some demand for the change. Once we start triggering the oom
> killer for all charges then we do not really want to have the disparity.
>

Ok, my patch is only addressing the code as it sits today, not any
theoretical code in the future. The fact remains that the
PAGE_ALLOC_COSTLY_ORDER and high_zoneidx test for lowmem allocations in
the allocation path are because oom killing is unlikely to free contiguous
pages and lowmem, respectively. We wouldn't avoid oom kill in memcg just
because a charge is __GFP_DMA. We shouldn't avoid oom kill in memcg just
because the order is PAGE_ALLOC_COSTLY_ORDER: it's about contiguous
memory, not about amount of memory. I believe you understand that and so
I'm optimistic that we are good in closing this thread out. Thanks.

2018-04-03 14:55:23

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH] memcg, thp: do not invoke oom killer on thp charges

On Wed, Mar 21, 2018 at 02:22:13PM -0700, David Rientjes wrote:
> PAGE_ALLOC_COSTLY_ORDER is a heuristic used by the page allocator because
> it cannot free high-order contiguous memory. Memcg just needs to reclaim
> a number of pages. Two order-3 charges can cause a memcg oom kill but now
> an order-4 charge cannot. It's an unfair bias against high-order charges
> that are not explicitly using __GFP_NORETRY.

I agree with your premise: unlike the page allocator, memcg could OOM
kill to help satisfy higher order allocations.

Technically.

But the semantics and expectations matter too.

Because of the allocator's restriction, we've been telling and
teaching callsites to fail gracefully and implement fallbacks forever,
and that makes costly-order allocations inherently speculative and to
a certain extent often optional. They've been written with that in
mind forever.

OOM is not graceful failure, though. We don't want to OOM kill when an
the callsite can easily fallback to smaller allocations, trigger a
packet resend, fail the syscall, what have you.

We could argue what the default should be if callsites aren't
specifically annotated - and whether we should change it.

But the page allocator has established the default behavior already,
and this is a bugfix. I prefer this fix not fundamentally change
semantics for costly-order allocations.

2018-04-03 14:59:18

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH] memcg, thp: do not invoke oom killer on thp charges

On Wed, Mar 21, 2018 at 09:59:28PM +0100, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> David has noticed that THP memcg charge can trigger the oom killer
> since 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
> madvised allocations"). We have used an explicit __GFP_NORETRY
> previously which ruled the OOM killer automagically.
>
> Memcg charge path should be semantically compliant with the allocation
> path and that means that if we do not trigger the OOM killer for costly
> orders which should do the same in the memcg charge path as well.
> Otherwise we are forcing callers to distinguish the two and use
> different gfp masks which is both non-intuitive and bug prone. Not to
> mention the maintenance burden.
>
> Teach mem_cgroup_oom to bail out on costly order requests to fix the THP
> issue as well as any other costly OOM eligible allocations to be added
> in future.
>
> Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
> Reported-by: David Rientjes <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>

I also prefer this fix over having separate OOM behaviors (which is
user-visible, and not just about technical ability to satisfy the
allocation) between the allocator and memcg.

Acked-by: Johannes Weiner <[email protected]>

2018-04-03 15:58:01

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH] memcg, thp: do not invoke oom killer on thp charges

On Tue 03-04-18 10:58:53, Johannes Weiner wrote:
> On Wed, Mar 21, 2018 at 09:59:28PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > David has noticed that THP memcg charge can trigger the oom killer
> > since 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
> > madvised allocations"). We have used an explicit __GFP_NORETRY
> > previously which ruled the OOM killer automagically.
> >
> > Memcg charge path should be semantically compliant with the allocation
> > path and that means that if we do not trigger the OOM killer for costly
> > orders which should do the same in the memcg charge path as well.
> > Otherwise we are forcing callers to distinguish the two and use
> > different gfp masks which is both non-intuitive and bug prone. Not to
> > mention the maintenance burden.
> >
> > Teach mem_cgroup_oom to bail out on costly order requests to fix the THP
> > issue as well as any other costly OOM eligible allocations to be added
> > in future.
> >
> > Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
> > Reported-by: David Rientjes <[email protected]>
> > Signed-off-by: Michal Hocko <[email protected]>
>
> I also prefer this fix over having separate OOM behaviors (which is
> user-visible, and not just about technical ability to satisfy the
> allocation) between the allocator and memcg.
>
> Acked-by: Johannes Weiner <[email protected]>

I will repost the patch with the currently merged THP specific handling
reverted (see below). While 9d3c3354bb85 might have been an appropriate
quick fix, we shouldn't keep it longterm for 4.17+ IMHO.

Does you ack apply to that patch as well?
---
From 3e1b127dd6107a0e51654e90d9faef7f196f5a10 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Wed, 21 Mar 2018 10:10:37 +0100
Subject: [PATCH] memcg, thp: do not invoke oom killer on thp charges

David has noticed that THP memcg charge can trigger the oom killer
since 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
madvised allocations"). We have used an explicit __GFP_NORETRY
previously which ruled the OOM killer automagically.

Memcg charge path should be semantically compliant with the allocation
path and that means that if we do not trigger the OOM killer for
costly orders which should do the same in the memcg charge path as
well. Otherwise we are forcing callers to distinguish the two and use
different gfp masks which is both non-intuitive and bug prone. As soon
as we get a costly high order kmalloc user we even do not have any means
to tell the memcg specific gfp mask to prevent from OOM because the
charging is deep within guts of the slab allocator.

The unexpected memcg OOM on THP has already been fixed upstream by
9d3c3354bb85 ("mm, thp: do not cause memcg oom for thp") but this is
one-off fix rather than a generic solution. Teach mem_cgroup_oom to bail
out on costly order requests to fix the THP issue as well as any other
costly OOM eligible allocations to be added in future.

Also revert 9d3c3354bb85 because special gfp for THP is no longer
needed.

Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
Reported-by: David Rientjes <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/huge_memory.c | 5 ++---
mm/khugepaged.c | 8 ++------
mm/memcontrol.c | 2 +-
3 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2297dd9cc7c3..0cc62405de9c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -555,8 +555,7 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,

VM_BUG_ON_PAGE(!PageCompound(page), page);

- if (mem_cgroup_try_charge(page, vma->vm_mm, gfp | __GFP_NORETRY, &memcg,
- true)) {
+ if (mem_cgroup_try_charge(page, vma->vm_mm, gfp, &memcg, true)) {
put_page(page);
count_vm_event(THP_FAULT_FALLBACK);
return VM_FAULT_FALLBACK;
@@ -1317,7 +1316,7 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
}

if (unlikely(mem_cgroup_try_charge(new_page, vma->vm_mm,
- huge_gfp | __GFP_NORETRY, &memcg, true))) {
+ huge_gfp, &memcg, true))) {
put_page(new_page);
split_huge_pmd(vma, vmf->pmd, vmf->address);
if (page)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 214e614b62b0..b7e2268dfc9a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -960,9 +960,7 @@ static void collapse_huge_page(struct mm_struct *mm,
goto out_nolock;
}

- /* Do not oom kill for khugepaged charges */
- if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp | __GFP_NORETRY,
- &memcg, true))) {
+ if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg, true))) {
result = SCAN_CGROUP_CHARGE_FAIL;
goto out_nolock;
}
@@ -1321,9 +1319,7 @@ static void collapse_shmem(struct mm_struct *mm,
goto out;
}

- /* Do not oom kill for khugepaged charges */
- if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp | __GFP_NORETRY,
- &memcg, true))) {
+ if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg, true))) {
result = SCAN_CGROUP_CHARGE_FAIL;
goto out;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d1a917b5b7b7..08accbcd1a18 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1493,7 +1493,7 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)

static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
{
- if (!current->memcg_may_oom)
+ if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER)
return;
/*
* We are in the middle of the charge context here, so we
--
2.16.3

--
Michal Hocko
SUSE Labs

2018-04-03 18:12:09

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH] memcg, thp: do not invoke oom killer on thp charges

On Tue, Apr 03, 2018 at 05:55:09PM +0200, Michal Hocko wrote:
> On Tue 03-04-18 10:58:53, Johannes Weiner wrote:
> > On Wed, Mar 21, 2018 at 09:59:28PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <[email protected]>
> > >
> > > David has noticed that THP memcg charge can trigger the oom killer
> > > since 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
> > > madvised allocations"). We have used an explicit __GFP_NORETRY
> > > previously which ruled the OOM killer automagically.
> > >
> > > Memcg charge path should be semantically compliant with the allocation
> > > path and that means that if we do not trigger the OOM killer for costly
> > > orders which should do the same in the memcg charge path as well.
> > > Otherwise we are forcing callers to distinguish the two and use
> > > different gfp masks which is both non-intuitive and bug prone. Not to
> > > mention the maintenance burden.
> > >
> > > Teach mem_cgroup_oom to bail out on costly order requests to fix the THP
> > > issue as well as any other costly OOM eligible allocations to be added
> > > in future.
> > >
> > > Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
> > > Reported-by: David Rientjes <[email protected]>
> > > Signed-off-by: Michal Hocko <[email protected]>
> >
> > I also prefer this fix over having separate OOM behaviors (which is
> > user-visible, and not just about technical ability to satisfy the
> > allocation) between the allocator and memcg.
> >
> > Acked-by: Johannes Weiner <[email protected]>
>
> I will repost the patch with the currently merged THP specific handling
> reverted (see below). While 9d3c3354bb85 might have been an appropriate
> quick fix, we shouldn't keep it longterm for 4.17+ IMHO.
>
> Does you ack apply to that patch as well?

Yep, looks good to me.

> From: Michal Hocko <[email protected]>
> Date: Wed, 21 Mar 2018 10:10:37 +0100
> Subject: [PATCH] memcg, thp: do not invoke oom killer on thp charges
>
> David has noticed that THP memcg charge can trigger the oom killer
> since 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
> madvised allocations"). We have used an explicit __GFP_NORETRY
> previously which ruled the OOM killer automagically.
>
> Memcg charge path should be semantically compliant with the allocation
> path and that means that if we do not trigger the OOM killer for
> costly orders which should do the same in the memcg charge path as
> well. Otherwise we are forcing callers to distinguish the two and use
> different gfp masks which is both non-intuitive and bug prone. As soon
> as we get a costly high order kmalloc user we even do not have any means
> to tell the memcg specific gfp mask to prevent from OOM because the
> charging is deep within guts of the slab allocator.
>
> The unexpected memcg OOM on THP has already been fixed upstream by
> 9d3c3354bb85 ("mm, thp: do not cause memcg oom for thp") but this is
> one-off fix rather than a generic solution. Teach mem_cgroup_oom to bail
> out on costly order requests to fix the THP issue as well as any other
> costly OOM eligible allocations to be added in future.
>
> Also revert 9d3c3354bb85 because special gfp for THP is no longer
> needed.
>
> Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
> Reported-by: David Rientjes <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>

Acked-by: Johannes Weiner <[email protected]>