LinuxLists.cc - [RFC PATCH v2 1/7] mm, oom: refactor select_bad

2017-06-01 18:37:01

Subject: [RFC PATCH v2 1/7] mm, oom: refactor select_bad_process() to take memcg as an argument

The select_bad_process() function will be used further
to select a process to kill in the victim cgroup.
This cgroup doesn't necessary match oc->memcg,
which is a cgroup, which limits were caused cgroup-wide OOM
(or NULL in case of global OOM).

So, refactor select_bad_process() to take a pointer to
a cgroup to iterate over as an argument.

Signed-off-by: Roman Gushchin <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/oom_kill.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143..f8b0fb1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -343,10 +343,11 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
* Simple selection loop. We choose the process with the highest number of
* 'points'. In case scan was aborted, oc->chosen is set to -1.
*/
-static void select_bad_process(struct oom_control *oc)
+static void select_bad_process(struct oom_control *oc,
+ struct mem_cgroup *memcg)
{
- if (is_memcg_oom(oc))
- mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
+ if (memcg)
+ mem_cgroup_scan_tasks(memcg, oom_evaluate_task, oc);
else {
struct task_struct *p;

@@ -1032,7 +1033,7 @@ bool out_of_memory(struct oom_control *oc)
return true;
}

- select_bad_process(oc);
+ select_bad_process(oc, oc->memcg);
/* Found nothing?!?! Either we hang forever, or we panic. */
if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
dump_header(oc, NULL);
--
2.7.4

2017-06-04 19:26:05

by Vladimir Davydov

[permalink] [raw]

Subject: Re: [RFC PATCH v2 1/7] mm, oom: refactor select_bad_process() to take memcg as an argument

On Thu, Jun 01, 2017 at 07:35:09PM +0100, Roman Gushchin wrote:
> The select_bad_process() function will be used further
> to select a process to kill in the victim cgroup.
> This cgroup doesn't necessary match oc->memcg,
> which is a cgroup, which limits were caused cgroup-wide OOM
> (or NULL in case of global OOM).
>
> So, refactor select_bad_process() to take a pointer to
> a cgroup to iterate over as an argument.

IMHO this patch, as well as patches 2-5, doesn't deserve to be submitted
separately: none of them make sense as a separate change; worse, patches
4 and 5 introduce user API that doesn't do anything without patch 6. All
of the changes are relatively small and singling them out doesn't really
facilitate review, so I'd merge them all in patch 6.

2017-06-04 22:52:08

by David Rientjes

[permalink] [raw]

Subject: Re: [RFC PATCH v2 1/7] mm, oom: refactor select_bad_process() to take memcg as an argument

We use a heavily modified system and memcg oom killer and I'm wondering
if there is some opportunity for collaboration because we may have some
shared goals.

I can summarize how we currently use the oom killer at a high level so
that it is not overwhelming with implementation details and give some
rationale for why we have converged onto this strategy over the period of
a few years.

For victim selection, we have strict priority based oom killing both at
the memcg level and the process level.

Each process has its own "badness" value that is independent of
oom_score_adj, although some conversion is done if a third-party thread
chooses to disable itself from oom killing for backwards compatibility.
Lower values are more preferable victims, but that detail should not
matter significantly. If two processes share the same badness value,
tiebreaks are done by selecting the largest rss.

Each memcg in a hierarchy also has its own badness value which
semantically means the same as the per-process value, although it
considers the entire memcg as a unit, similar to your approach, when
iterating the hierarchy to choose a process. The benefit of the
per-memcg and per-process approach is that you can kill the lowest
priority process from the lowest priority memcg.

The above scoring is enabled with a VM sysctl for the system and is used
for both system (global) and memcg oom kills. For system overcommit,
this means we can kill the lowest priority job on the system; for memcg,
we can allow users to define their oom kill priorities at each level of
their own hierarchy.

When the system or root of an oom memcg hierarchy encounters its limit,
we iterate each level of the memcg hierarchy to find the lowest priority
job. This is done by comparing the badness of the sibling memcgs at
each level, finding the lowest, and iterating that subtree. If there are
lower priority processes per the per-process badness value compared to
all sibling memcgs, that process is killed.

We also have complete userspace oom handling support. This complements
the existing memory.oom_control notification when a memcg is oom with a
separate notifier that notifies when the kernel has oom killed a process.
It is possible to delay the oom killer from killing a process for memcg
oom kills with a configurable, per-memcg, oom delay. If set, the kernel
will wait for userspace to respond to its oom notification and effect its
own policy decisions until memory is uncharged to that memcg hierarchy,
or the oom delay expires. If the oom delay expires, the kernel oom
killer kills a process based on badness.

Our oom kill notification file used to get an fd to register with
cgroup.event_control also provides oom kill statistics based on system,
memcg, local, hierarchical, and user-induced oom kills when read().

We also have a convenient "kill all" knob that userspace can write when
handling oom conditions to iterate all threads attached to a particular
memcg and kill them. This is merely to prevent races where userspace
does the oom killing itself, which is not problematic in itself, but
additional tasks continue to be attached to an oom memcg.

A caveat here is that we also support fully inclusive kmem accounting to
memcg hierarchies, so we call the oom killer as part of the memcg charge
path rather than only upon returning from fault with VM_FAULT_OOM. We
have our own oom killer livelock detection that isn't necessarily
important in this thread, but we haven't encountered a situation where we
livelock by calling the oom killer during charge, and this is a
requirement for memcg charging as part of slab allocation.

I could post many patches to implement all of this functionality that we
have used for a few years, but I first wanted to send this email to see
if there is any common ground or to better understand your methodology
for using the kernel oom killer for both system and memcg oom kills.

Otherwise, very interesting stuff!

2017-06-06 16:21:01

by Roman Gushchin

[permalink] [raw]

Subject: Re: [RFC PATCH v2 1/7] mm, oom: refactor select_bad_process() to take memcg as an argument

On Sun, Jun 04, 2017 at 03:50:37PM -0700, David Rientjes wrote:
> We use a heavily modified system and memcg oom killer and I'm wondering
> if there is some opportunity for collaboration because we may have some
> shared goals.
>
> I can summarize how we currently use the oom killer at a high level so
> that it is not overwhelming with implementation details and give some
> rationale for why we have converged onto this strategy over the period of
> a few years.
>
> For victim selection, we have strict priority based oom killing both at
> the memcg level and the process level.
>
> Each process has its own "badness" value that is independent of
> oom_score_adj, although some conversion is done if a third-party thread
> chooses to disable itself from oom killing for backwards compatibility.
> Lower values are more preferable victims, but that detail should not
> matter significantly. If two processes share the same badness value,
> tiebreaks are done by selecting the largest rss.
>
> Each memcg in a hierarchy also has its own badness value which
> semantically means the same as the per-process value, although it
> considers the entire memcg as a unit, similar to your approach, when
> iterating the hierarchy to choose a process. The benefit of the
> per-memcg and per-process approach is that you can kill the lowest
> priority process from the lowest priority memcg.
>
> The above scoring is enabled with a VM sysctl for the system and is used
> for both system (global) and memcg oom kills. For system overcommit,
> this means we can kill the lowest priority job on the system; for memcg,
> we can allow users to define their oom kill priorities at each level of
> their own hierarchy.
>
> When the system or root of an oom memcg hierarchy encounters its limit,
> we iterate each level of the memcg hierarchy to find the lowest priority
> job. This is done by comparing the badness of the sibling memcgs at
> each level, finding the lowest, and iterating that subtree. If there are
> lower priority processes per the per-process badness value compared to
> all sibling memcgs, that process is killed.
>
> We also have complete userspace oom handling support. This complements
> the existing memory.oom_control notification when a memcg is oom with a
> separate notifier that notifies when the kernel has oom killed a process.
> It is possible to delay the oom killer from killing a process for memcg
> oom kills with a configurable, per-memcg, oom delay. If set, the kernel
> will wait for userspace to respond to its oom notification and effect its
> own policy decisions until memory is uncharged to that memcg hierarchy,
> or the oom delay expires. If the oom delay expires, the kernel oom
> killer kills a process based on badness.
>
> Our oom kill notification file used to get an fd to register with
> cgroup.event_control also provides oom kill statistics based on system,
> memcg, local, hierarchical, and user-induced oom kills when read().
>
> We also have a convenient "kill all" knob that userspace can write when
> handling oom conditions to iterate all threads attached to a particular
> memcg and kill them. This is merely to prevent races where userspace
> does the oom killing itself, which is not problematic in itself, but
> additional tasks continue to be attached to an oom memcg.
>
> A caveat here is that we also support fully inclusive kmem accounting to
> memcg hierarchies, so we call the oom killer as part of the memcg charge
> path rather than only upon returning from fault with VM_FAULT_OOM. We
> have our own oom killer livelock detection that isn't necessarily
> important in this thread, but we haven't encountered a situation where we
> livelock by calling the oom killer during charge, and this is a
> requirement for memcg charging as part of slab allocation.
>
> I could post many patches to implement all of this functionality that we
> have used for a few years, but I first wanted to send this email to see
> if there is any common ground or to better understand your methodology
> for using the kernel oom killer for both system and memcg oom kills.
>
> Otherwise, very interesting stuff!

Hi David!

Thank you for sharing this!

It's very interesting, and it looks like,
it's not that far from what I've suggested.

So we definitily need to come up with some common solution.

Thanks!

Roman

2017-06-06 20:42:34

by David Rientjes

[permalink] [raw]

Subject: Re: [RFC PATCH v2 1/7] mm, oom: refactor select_bad_process() to take memcg as an argument

On Tue, 6 Jun 2017, Roman Gushchin wrote:

> Hi David!
>
> Thank you for sharing this!
>
> It's very interesting, and it looks like,
> it's not that far from what I've suggested.
>
> So we definitily need to come up with some common solution.
>

Hi Roman,

Yes, definitely. I could post a series of patches to do everything that
was listed in my email sans the fully inclusive kmem accounting, which may
be pursued at a later date, if it would be helpful to see where there is
common ground?

Another question is what you think about userspace oom handling? We
implement our own oom kill policies in userspace for both the system and
for user-controlled memcg hierarchies because it often does not match the
kernel implementation and there is some action that can be taken other
than killing a process. Have you tried to implement functionality to do
userspace oom handling, or are you considering it? This is the main
motivation behind allowing an oom delay to be configured.

2017-06-08 16:01:02

by Roman Gushchin

[permalink] [raw]

Subject: Re: [RFC PATCH v2 1/7] mm, oom: refactor select_bad_process() to take memcg as an argument

On Tue, Jun 06, 2017 at 01:42:29PM -0700, David Rientjes wrote:
> On Tue, 6 Jun 2017, Roman Gushchin wrote:
>
> > Hi David!
> >
> > Thank you for sharing this!
> >
> > It's very interesting, and it looks like,
> > it's not that far from what I've suggested.
> >
> > So we definitily need to come up with some common solution.
> >
>

Hi David,

> Yes, definitely. I could post a series of patches to do everything that
> was listed in my email sans the fully inclusive kmem accounting, which may
> be pursued at a later date, if it would be helpful to see where there is
> common ground?
>
> Another question is what you think about userspace oom handling? We
> implement our own oom kill policies in userspace for both the system and
> for user-controlled memcg hierarchies because it often does not match the
> kernel implementation and there is some action that can be taken other
> than killing a process. Have you tried to implement functionality to do
> userspace oom handling, or are you considering it? This is the main
> motivation behind allowing an oom delay to be configured.

cgroup v2 memory controller is built on the idea of preventing OOMs
by using the memory.high limit. This allows an userspace app to get notified
before OOM happens (by looking at memory.events control), so there is (hopefully)
no need in things like oom delay.

Actually, I'm trying to implement some minimal functionality in the kernel,
which will simplify and make more consistent the userspace part of the job.
But, of course, the main goal of the patchset is to fix the unfairness
of the current victim selection.

Thanks!

Roman