2018-08-02 00:33:50

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH v2 0/3] introduce memory.oom.group

This is a tiny implementation of cgroup-aware OOM killer,
which adds an ability to kill a cgroup as a single unit
and so guarantee the integrity of the workload.

Although it has only a limited functionality in comparison
to what now resides in the mm tree (it doesn't change
the victim task selection algorithm, doesn't look
at memory stas on cgroup level, etc), it's also much
simpler and more straightforward. So, hopefully, we can
avoid having long debates here, as we had with the full
implementation.

As it doesn't prevent any futher development,
and implements an useful and complete feature,
it looks as a sane way forward.

v2->v1:
- added dmesg message about killing all tasks in cgroup
- removed an unnecessary check for memcg being NULL pointer
- adjusted docs and commit message
- rebased to linus/master

--

This patchset is against Linus's tree to avoid conflicts
with the cgroup-aware OOM killer patchset in the mm tree.
It's intended to replace it.

Two first patches are already in the mm tree.
The first one ("mm: introduce mem_cgroup_put() helper")
is totally fine.
Commit message of the second one has to be changed to reflect
that it's not a part of the old patchset anymore.

Roman Gushchin (3):
mm: introduce mem_cgroup_put() helper
mm, oom: refactor oom_kill_process()
mm, oom: introduce memory.oom.group

Documentation/admin-guide/cgroup-v2.rst | 18 ++++
include/linux/memcontrol.h | 27 ++++++
mm/memcontrol.c | 93 +++++++++++++++++++
mm/oom_kill.c | 153 ++++++++++++++++++++------------
4 files changed, 233 insertions(+), 58 deletions(-)

--
2.14.4



2018-08-02 00:33:57

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH v2 2/3] mm, oom: refactor oom_kill_process()

oom_kill_process() consists of two logical parts: the first one is
responsible for considering task's children as a potential victim and
printing the debug information. The second half is responsible for
sending SIGKILL to all tasks sharing the mm struct with the given victim.

This commit splits oom_kill_process() with an intention to re-use the the
second half: __oom_kill_process().

The cgroup-aware OOM killer will kill multiple tasks belonging to the
victim cgroup. We don't need to print the debug information for the each
task, as well as play with task selection (considering task's children),
so we can't use the existing oom_kill_process().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
mm/oom_kill.c | 123 +++++++++++++++++++++++++++++++---------------------------
1 file changed, 65 insertions(+), 58 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 84081e77bc51..8bded6b3205b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -835,68 +835,12 @@ static bool task_will_free_mem(struct task_struct *task)
return ret;
}

-static void oom_kill_process(struct oom_control *oc, const char *message)
+static void __oom_kill_process(struct task_struct *victim)
{
- struct task_struct *p = oc->chosen;
- unsigned int points = oc->chosen_points;
- struct task_struct *victim = p;
- struct task_struct *child;
- struct task_struct *t;
+ struct task_struct *p;
struct mm_struct *mm;
- unsigned int victim_points = 0;
- static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
bool can_oom_reap = true;

- /*
- * If the task is already exiting, don't alarm the sysadmin or kill
- * its children or threads, just give it access to memory reserves
- * so it can die quickly
- */
- task_lock(p);
- if (task_will_free_mem(p)) {
- mark_oom_victim(p);
- wake_oom_reaper(p);
- task_unlock(p);
- put_task_struct(p);
- return;
- }
- task_unlock(p);
-
- if (__ratelimit(&oom_rs))
- dump_header(oc, p);
-
- pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
- message, task_pid_nr(p), p->comm, points);
-
- /*
- * If any of p's children has a different mm and is eligible for kill,
- * the one with the highest oom_badness() score is sacrificed for its
- * parent. This attempts to lose the minimal amount of work done while
- * still freeing memory.
- */
- read_lock(&tasklist_lock);
- for_each_thread(p, t) {
- list_for_each_entry(child, &t->children, sibling) {
- unsigned int child_points;
-
- if (process_shares_mm(child, p->mm))
- continue;
- /*
- * oom_badness() returns 0 if the thread is unkillable
- */
- child_points = oom_badness(child,
- oc->memcg, oc->nodemask, oc->totalpages);
- if (child_points > victim_points) {
- put_task_struct(victim);
- victim = child;
- victim_points = child_points;
- get_task_struct(victim);
- }
- }
- }
- read_unlock(&tasklist_lock);
-
p = find_lock_task_mm(victim);
if (!p) {
put_task_struct(victim);
@@ -970,6 +914,69 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
}
#undef K

+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+ struct task_struct *p = oc->chosen;
+ unsigned int points = oc->chosen_points;
+ struct task_struct *victim = p;
+ struct task_struct *child;
+ struct task_struct *t;
+ unsigned int victim_points = 0;
+ static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+ /*
+ * If the task is already exiting, don't alarm the sysadmin or kill
+ * its children or threads, just give it access to memory reserves
+ * so it can die quickly
+ */
+ task_lock(p);
+ if (task_will_free_mem(p)) {
+ mark_oom_victim(p);
+ wake_oom_reaper(p);
+ task_unlock(p);
+ put_task_struct(p);
+ return;
+ }
+ task_unlock(p);
+
+ if (__ratelimit(&oom_rs))
+ dump_header(oc, p);
+
+ pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
+ message, task_pid_nr(p), p->comm, points);
+
+ /*
+ * If any of p's children has a different mm and is eligible for kill,
+ * the one with the highest oom_badness() score is sacrificed for its
+ * parent. This attempts to lose the minimal amount of work done while
+ * still freeing memory.
+ */
+ read_lock(&tasklist_lock);
+ for_each_thread(p, t) {
+ list_for_each_entry(child, &t->children, sibling) {
+ unsigned int child_points;
+
+ if (process_shares_mm(child, p->mm))
+ continue;
+ /*
+ * oom_badness() returns 0 if the thread is unkillable
+ */
+ child_points = oom_badness(child,
+ oc->memcg, oc->nodemask, oc->totalpages);
+ if (child_points > victim_points) {
+ put_task_struct(victim);
+ victim = child;
+ victim_points = child_points;
+ get_task_struct(victim);
+ }
+ }
+ }
+ read_unlock(&tasklist_lock);
+
+ __oom_kill_process(victim);
+}
+
/*
* Determines whether the kernel must panic because of the panic_on_oom sysctl.
*/
--
2.14.4


2018-08-02 00:34:08

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH v2 3/3] mm, oom: introduce memory.oom.group

For some workloads an intervention from the OOM killer
can be painful. Killing a random task can bring
the workload into an inconsistent state.

Historically, there are two common solutions for this
problem:
1) enabling panic_on_oom,
2) using a userspace daemon to monitor OOMs and kill
all outstanding processes.

Both approaches have their downsides:
rebooting on each OOM is an obvious waste of capacity,
and handling all in userspace is tricky and requires
a userspace agent, which will monitor all cgroups
for OOMs.

In most cases an in-kernel after-OOM cleaning-up
mechanism can eliminate the necessity of enabling
panic_on_oom. Also, it can simplify the cgroup
management for userspace applications.

This commit introduces a new knob for cgroup v2 memory
controller: memory.oom.group. The knob determines
whether the cgroup should be treated as an indivisible
workload by the OOM killer. If set, all tasks belonging
to the cgroup or to its descendants (if the memory cgroup
is not a leaf cgroup) are killed together or not at all.

To determine which cgroup has to be killed, we do
traverse the cgroup hierarchy from the victim task's
cgroup up to the OOMing cgroup (or root) and looking
for the highest-level cgroup with memory.oom.group set.

Tasks with the OOM protection (oom_score_adj set to -1000)
are treated as an exception and are never killed.

This patch doesn't change the OOM victim selection algorithm.

Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Tejun Heo <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 18 +++++++
include/linux/memcontrol.h | 18 +++++++
mm/memcontrol.c | 93 +++++++++++++++++++++++++++++++++
mm/oom_kill.c | 30 +++++++++++
4 files changed, 159 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 8a2c52d5c53b..7b4364962fbb 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1069,6 +1069,24 @@ PAGE_SIZE multiple when read back.
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.

+ memory.oom.group
+ A read-write single value file which exists on non-root
+ cgroups. The default value is "0".
+
+ Determines whether the cgroup should be treated as
+ an indivisible workload by the OOM killer. If set,
+ all tasks belonging to the cgroup or to its descendants
+ (if the memory cgroup is not a leaf cgroup) are killed
+ together or not at all. This can be used to avoid
+ partial kills to guarantee workload integrity.
+
+ Tasks with the OOM protection (oom_score_adj set to -1000)
+ are treated as an exception and are never killed.
+
+ If the OOM killer is invoked in a cgroup, it's not going
+ to kill any tasks outside of this cgroup, regardless
+ memory.oom.group values of ancestor cgroups.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e53e00cdbe3f..5b26ab460565 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -213,6 +213,11 @@ struct mem_cgroup {
*/
bool use_hierarchy;

+ /*
+ * Should the OOM killer kill all belonging tasks, had it kill one?
+ */
+ bool oom_group;
+
/* protected by memcg_oom_lock */
bool oom_lock;
int under_oom;
@@ -517,6 +522,9 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
}

bool mem_cgroup_oom_synchronize(bool wait);
+struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
+ struct mem_cgroup *oom_domain);
+void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);

#ifdef CONFIG_MEMCG_SWAP
extern int do_swap_account;
@@ -951,6 +959,16 @@ static inline bool mem_cgroup_oom_synchronize(bool wait)
return false;
}

+static inline struct mem_cgroup *mem_cgroup_get_oom_group(
+ struct task_struct *victim, struct mem_cgroup *oom_domain)
+{
+ return NULL;
+}
+
+static inline void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
+{
+}
+
static inline unsigned long memcg_page_state(struct mem_cgroup *memcg,
int idx)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8c0280b3143e..23045398ad21 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1577,6 +1577,62 @@ bool mem_cgroup_oom_synchronize(bool handle)
return true;
}

+/**
+ * mem_cgroup_get_oom_group - get a memory cgroup to clean up after OOM
+ * @victim: task to be killed by the OOM killer
+ * @oom_domain: memcg in case of memcg OOM, NULL in case of system-wide OOM
+ *
+ * Returns a pointer to a memory cgroup, which has to be cleaned up
+ * by killing all belonging OOM-killable tasks.
+ *
+ * Caller has to call mem_cgroup_put() on the returned non-NULL memcg.
+ */
+struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
+ struct mem_cgroup *oom_domain)
+{
+ struct mem_cgroup *oom_group = NULL;
+ struct mem_cgroup *memcg;
+
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ return NULL;
+
+ if (!oom_domain)
+ oom_domain = root_mem_cgroup;
+
+ rcu_read_lock();
+
+ memcg = mem_cgroup_from_task(victim);
+ if (memcg == root_mem_cgroup)
+ goto out;
+
+ /*
+ * Traverse the memory cgroup hierarchy from the victim task's
+ * cgroup up to the OOMing cgroup (or root) to find the
+ * highest-level memory cgroup with oom.group set.
+ */
+ for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+ if (memcg->oom_group)
+ oom_group = memcg;
+
+ if (memcg == oom_domain)
+ break;
+ }
+
+ if (oom_group)
+ css_get(&oom_group->css);
+out:
+ rcu_read_unlock();
+
+ return oom_group;
+}
+
+void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
+{
+ pr_info("Tasks in ");
+ pr_cont_cgroup_path(memcg->css.cgroup);
+ pr_cont(" are going to be killed due to memory.oom.group set\n");
+}
+
/**
* lock_page_memcg - lock a page->mem_cgroup binding
* @page: the page
@@ -5328,6 +5384,37 @@ static int memory_stat_show(struct seq_file *m, void *v)
return 0;
}

+static int memory_oom_group_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+
+ seq_printf(m, "%d\n", memcg->oom_group);
+
+ return 0;
+}
+
+static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ int ret, oom_group;
+
+ buf = strstrip(buf);
+ if (!buf)
+ return -EINVAL;
+
+ ret = kstrtoint(buf, 0, &oom_group);
+ if (ret)
+ return ret;
+
+ if (oom_group != 0 && oom_group != 1)
+ return -EINVAL;
+
+ memcg->oom_group = oom_group;
+
+ return nbytes;
+}
+
static struct cftype memory_files[] = {
{
.name = "current",
@@ -5369,6 +5456,12 @@ static struct cftype memory_files[] = {
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = memory_stat_show,
},
+ {
+ .name = "oom.group",
+ .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
+ .seq_show = memory_oom_group_show,
+ .write = memory_oom_group_write,
+ },
{ } /* terminate */
};

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8bded6b3205b..f10eb301f6bf 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -914,6 +914,19 @@ static void __oom_kill_process(struct task_struct *victim)
}
#undef K

+/*
+ * Kill provided task unless it's secured by setting
+ * oom_score_adj to OOM_SCORE_ADJ_MIN.
+ */
+static int oom_kill_memcg_member(struct task_struct *task, void *unused)
+{
+ if (task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
+ get_task_struct(task);
+ __oom_kill_process(task);
+ }
+ return 0;
+}
+
static void oom_kill_process(struct oom_control *oc, const char *message)
{
struct task_struct *p = oc->chosen;
@@ -921,6 +934,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
struct task_struct *victim = p;
struct task_struct *child;
struct task_struct *t;
+ struct mem_cgroup *oom_group;
unsigned int victim_points = 0;
static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
@@ -974,7 +988,23 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
}
read_unlock(&tasklist_lock);

+ /*
+ * Do we need to kill the entire memory cgroup?
+ * Or even one of the ancestor memory cgroups?
+ * Check this out before killing the victim task.
+ */
+ oom_group = mem_cgroup_get_oom_group(victim, oc->memcg);
+
__oom_kill_process(victim);
+
+ /*
+ * If necessary, kill all tasks in the selected memory cgroup.
+ */
+ if (oom_group) {
+ mem_cgroup_print_oom_group(oom_group);
+ mem_cgroup_scan_tasks(oom_group, oom_kill_memcg_member, NULL);
+ mem_cgroup_put(oom_group);
+ }
}

/*
--
2.14.4


2018-08-02 00:35:03

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH v2 1/3] mm: introduce mem_cgroup_put() helper

Introduce the mem_cgroup_put() helper, which helps to eliminate guarding
memcg css release with "#ifdef CONFIG_MEMCG" in multiple places.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
---
include/linux/memcontrol.h | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6c6fb116e925..e53e00cdbe3f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -375,6 +375,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
}

+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+ css_put(&memcg->css);
+}
+
#define mem_cgroup_from_counter(counter, member) \
container_of(counter, struct mem_cgroup, member)

@@ -837,6 +842,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
return true;
}

+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
static inline struct mem_cgroup *
mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
--
2.14.4


2018-08-02 00:37:57

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm: introduce mem_cgroup_put() helper

Hi Roman,

On Wed, 1 Aug 2018 17:31:59 -0700 Roman Gushchin <[email protected]> wrote:
>
> Introduce the mem_cgroup_put() helper, which helps to eliminate guarding
> memcg css release with "#ifdef CONFIG_MEMCG" in multiple places.
>
> Link: http://lkml.kernel.org/r/[email protected]
> Signed-off-by: Roman Gushchin <[email protected]>
> Reviewed-by: Shakeel Butt <[email protected]>
> Reviewed-by: Andrew Morton <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Stephen Rothwell <[email protected]>

I have no idea why my Signed-off-by is attached to this patch (or
Andrew's for that matter) ...

--
Cheers,
Stephen Rothwell


Attachments:
(No filename) (499.00 B)
OpenPGP digital signature

2018-08-02 00:48:28

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm: introduce mem_cgroup_put() helper

On Wed, Aug 1, 2018 at 5:37 PM Stephen Rothwell <[email protected]> wrote:
>
> Hi Roman,
>
> On Wed, 1 Aug 2018 17:31:59 -0700 Roman Gushchin <[email protected]> wrote:
> >
> > Introduce the mem_cgroup_put() helper, which helps to eliminate guarding
> > memcg css release with "#ifdef CONFIG_MEMCG" in multiple places.
> >
> > Link: http://lkml.kernel.org/r/[email protected]
> > Signed-off-by: Roman Gushchin <[email protected]>
> > Reviewed-by: Shakeel Butt <[email protected]>
> > Reviewed-by: Andrew Morton <[email protected]>
> > Acked-by: Johannes Weiner <[email protected]>
> > Acked-by: Michal Hocko <[email protected]>
> > Signed-off-by: Andrew Morton <[email protected]>
> > Signed-off-by: Stephen Rothwell <[email protected]>
>
> I have no idea why my Signed-off-by is attached to this patch (or
> Andrew's for that matter) ...
>

Roman might have picked this patch from linux-next and sent as it is.

Shakeel

2018-08-02 00:52:06

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm: introduce mem_cgroup_put() helper

On Thu, Aug 02, 2018 at 10:36:48AM +1000, Stephen Rothwell wrote:
> Hi Roman,
>
> On Wed, 1 Aug 2018 17:31:59 -0700 Roman Gushchin <[email protected]> wrote:
> >
> > Introduce the mem_cgroup_put() helper, which helps to eliminate guarding
> > memcg css release with "#ifdef CONFIG_MEMCG" in multiple places.
> >
> > Link: http://lkml.kernel.org/r/[email protected]
> > Signed-off-by: Roman Gushchin <[email protected]>
> > Reviewed-by: Shakeel Butt <[email protected]>
> > Reviewed-by: Andrew Morton <[email protected]>
> > Acked-by: Johannes Weiner <[email protected]>
> > Acked-by: Michal Hocko <[email protected]>
> > Signed-off-by: Andrew Morton <[email protected]>
> > Signed-off-by: Stephen Rothwell <[email protected]>
>
> I have no idea why my Signed-off-by is attached to this patch (or
> Andrew's for that matter) ...

Hi Stephen!

I've cherry-picked this patch from the next tree,
so it got your signed-off-by.

Sorry for that!

Thanks!

Roman

2018-08-02 10:54:29

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] mm, oom: introduce memory.oom.group

On 2018/08/02 9:32, Roman Gushchin wrote:
> For some workloads an intervention from the OOM killer
> can be painful. Killing a random task can bring
> the workload into an inconsistent state.
>
> Historically, there are two common solutions for this
> problem:
> 1) enabling panic_on_oom,
> 2) using a userspace daemon to monitor OOMs and kill
> all outstanding processes.
>
> Both approaches have their downsides:
> rebooting on each OOM is an obvious waste of capacity,
> and handling all in userspace is tricky and requires
> a userspace agent, which will monitor all cgroups
> for OOMs.

We could start a one-time userspace agent which handles
an cgroup OOM event and then terminates...



> +/**
> + * mem_cgroup_get_oom_group - get a memory cgroup to clean up after OOM
> + * @victim: task to be killed by the OOM killer
> + * @oom_domain: memcg in case of memcg OOM, NULL in case of system-wide OOM
> + *
> + * Returns a pointer to a memory cgroup, which has to be cleaned up
> + * by killing all belonging OOM-killable tasks.
> + *
> + * Caller has to call mem_cgroup_put() on the returned non-NULL memcg.
> + */
> +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
> + struct mem_cgroup *oom_domain)
> +{
> + struct mem_cgroup *oom_group = NULL;
> + struct mem_cgroup *memcg;
> +
> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> + return NULL;
> +
> + if (!oom_domain)
> + oom_domain = root_mem_cgroup;
> +
> + rcu_read_lock();
> +
> + memcg = mem_cgroup_from_task(victim);

Isn't this racy? I guess that memcg of this "victim" can change to
somewhere else from the one as of determining the final candidate.
This "victim" might have already passed exit_mm()/cgroup_exit() from do_exit().
This "victim" might be moving to a memcg which is different from the one
determining the final candidate.

> + if (memcg == root_mem_cgroup)
> + goto out;
> +
> + /*
> + * Traverse the memory cgroup hierarchy from the victim task's
> + * cgroup up to the OOMing cgroup (or root) to find the
> + * highest-level memory cgroup with oom.group set.
> + */
> + for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> + if (memcg->oom_group)
> + oom_group = memcg;
> +
> + if (memcg == oom_domain)
> + break;
> + }
> +
> + if (oom_group)
> + css_get(&oom_group->css);
> +out:
> + rcu_read_unlock();
> +
> + return oom_group;
> +}



> @@ -974,7 +988,23 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> }
> read_unlock(&tasklist_lock);
>
> + /*
> + * Do we need to kill the entire memory cgroup?
> + * Or even one of the ancestor memory cgroups?
> + * Check this out before killing the victim task.
> + */
> + oom_group = mem_cgroup_get_oom_group(victim, oc->memcg);
> +
> __oom_kill_process(victim);
> +
> + /*
> + * If necessary, kill all tasks in the selected memory cgroup.
> + */
> + if (oom_group) {

Isn't "killing a child process of the biggest memory hog" and "killing all
processes which belongs to a memcg which the child process of the biggest
memory hog belongs to" strange? The intent of selecting a child is to try
to minimize lost work while the intent of oom_cgroup is to try to discard
all work. If oom_cgroup is enabled, I feel that we should

pr_err("%s: Kill all processes in ", message);
pr_cont_cgroup_path(memcg->css.cgroup);
pr_cont(" due to memory.oom.group set\n");

without

pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", message, task_pid_nr(p), p->comm, points);

(I mean, don't try to select a child).

> + mem_cgroup_print_oom_group(oom_group);
> + mem_cgroup_scan_tasks(oom_group, oom_kill_memcg_member, NULL);
> + mem_cgroup_put(oom_group);
> + }
> }
>
> /*


2018-08-02 11:22:24

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] mm, oom: introduce memory.oom.group

On Thu 02-08-18 19:53:13, Tetsuo Handa wrote:
> On 2018/08/02 9:32, Roman Gushchin wrote:
[...]
> > +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
> > + struct mem_cgroup *oom_domain)
> > +{
> > + struct mem_cgroup *oom_group = NULL;
> > + struct mem_cgroup *memcg;
> > +
> > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > + return NULL;
> > +
> > + if (!oom_domain)
> > + oom_domain = root_mem_cgroup;
> > +
> > + rcu_read_lock();
> > +
> > + memcg = mem_cgroup_from_task(victim);
>
> Isn't this racy? I guess that memcg of this "victim" can change to
> somewhere else from the one as of determining the final candidate.

How is this any different from the existing code? We select a victim and
then kill it. The victim might move away and won't be part of the oom
memcg anymore but we will still kill it. I do not remember this ever
being a problem. Migration is a privileged operation. If you loose this
restriction you shouldn't allow to move outside of the oom domain.

> This "victim" might have already passed exit_mm()/cgroup_exit() from do_exit().

Why does this matter? The victim hasn't been killed yet so if it exists
by its own I do not think we really have to tear the whole cgroup down.

> This "victim" might be moving to a memcg which is different from the one
> determining the final candidate.
>
> > + if (memcg == root_mem_cgroup)
> > + goto out;
> > +
> > + /*
> > + * Traverse the memory cgroup hierarchy from the victim task's
> > + * cgroup up to the OOMing cgroup (or root) to find the
> > + * highest-level memory cgroup with oom.group set.
> > + */
> > + for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> > + if (memcg->oom_group)
> > + oom_group = memcg;
> > +
> > + if (memcg == oom_domain)
> > + break;
> > + }
> > +
> > + if (oom_group)
> > + css_get(&oom_group->css);
> > +out:
> > + rcu_read_unlock();
> > +
> > + return oom_group;
> > +}
>
>
>
> > @@ -974,7 +988,23 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> > }
> > read_unlock(&tasklist_lock);
> >
> > + /*
> > + * Do we need to kill the entire memory cgroup?
> > + * Or even one of the ancestor memory cgroups?
> > + * Check this out before killing the victim task.
> > + */
> > + oom_group = mem_cgroup_get_oom_group(victim, oc->memcg);
> > +
> > __oom_kill_process(victim);
> > +
> > + /*
> > + * If necessary, kill all tasks in the selected memory cgroup.
> > + */
> > + if (oom_group) {
>
> Isn't "killing a child process of the biggest memory hog" and "killing all
> processes which belongs to a memcg which the child process of the biggest
> memory hog belongs to" strange? The intent of selecting a child is to try
> to minimize lost work while the intent of oom_cgroup is to try to discard
> all work. If oom_cgroup is enabled, I feel that we should
>
> pr_err("%s: Kill all processes in ", message);
> pr_cont_cgroup_path(memcg->css.cgroup);
> pr_cont(" due to memory.oom.group set\n");
>
> without
>
> pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", message, task_pid_nr(p), p->comm, points);
>
> (I mean, don't try to select a child).

Well, the child can belong into a different memcg. Whether the heuristic
to pick up the child is sensible is another question and I do not think
it is related to this patchset. The code works as intended, albeit being
questionable.
--
Michal Hocko
SUSE Labs

2018-08-02 11:55:05

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] mm, oom: introduce memory.oom.group

On 2018/08/02 20:21, Michal Hocko wrote:
> On Thu 02-08-18 19:53:13, Tetsuo Handa wrote:
>> On 2018/08/02 9:32, Roman Gushchin wrote:
> [...]
>>> +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
>>> + struct mem_cgroup *oom_domain)
>>> +{
>>> + struct mem_cgroup *oom_group = NULL;
>>> + struct mem_cgroup *memcg;
>>> +
>>> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>>> + return NULL;
>>> +
>>> + if (!oom_domain)
>>> + oom_domain = root_mem_cgroup;
>>> +
>>> + rcu_read_lock();
>>> +
>>> + memcg = mem_cgroup_from_task(victim);
>>
>> Isn't this racy? I guess that memcg of this "victim" can change to
>> somewhere else from the one as of determining the final candidate.
>
> How is this any different from the existing code? We select a victim and
> then kill it. The victim might move away and won't be part of the oom
> memcg anymore but we will still kill it. I do not remember this ever
> being a problem. Migration is a privileged operation. If you loose this
> restriction you shouldn't allow to move outside of the oom domain.

The existing code kills one process (plus other processes sharing mm if any).
But oom_cgroup kills multiple processes. Thus, whether we made decision based
on correct memcg becomes important.

>
>> This "victim" might have already passed exit_mm()/cgroup_exit() from do_exit().
>
> Why does this matter? The victim hasn't been killed yet so if it exists
> by its own I do not think we really have to tear the whole cgroup down.

The existing code does not send SIGKILL if find_lock_task_mm() failed. Who can
guarantee that the victim is not inside do_exit() yet when this code is executed?


2018-08-02 12:25:51

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] mm, oom: introduce memory.oom.group

On Thu 02-08-18 20:53:14, Tetsuo Handa wrote:
> On 2018/08/02 20:21, Michal Hocko wrote:
> > On Thu 02-08-18 19:53:13, Tetsuo Handa wrote:
> >> On 2018/08/02 9:32, Roman Gushchin wrote:
> > [...]
> >>> +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
> >>> + struct mem_cgroup *oom_domain)
> >>> +{
> >>> + struct mem_cgroup *oom_group = NULL;
> >>> + struct mem_cgroup *memcg;
> >>> +
> >>> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> >>> + return NULL;
> >>> +
> >>> + if (!oom_domain)
> >>> + oom_domain = root_mem_cgroup;
> >>> +
> >>> + rcu_read_lock();
> >>> +
> >>> + memcg = mem_cgroup_from_task(victim);
> >>
> >> Isn't this racy? I guess that memcg of this "victim" can change to
> >> somewhere else from the one as of determining the final candidate.
> >
> > How is this any different from the existing code? We select a victim and
> > then kill it. The victim might move away and won't be part of the oom
> > memcg anymore but we will still kill it. I do not remember this ever
> > being a problem. Migration is a privileged operation. If you loose this
> > restriction you shouldn't allow to move outside of the oom domain.
>
> The existing code kills one process (plus other processes sharing mm if any).
> But oom_cgroup kills multiple processes. Thus, whether we made decision based
> on correct memcg becomes important.

Yes but a proper configuration should already mitigate the harm because
you shouldn't be able to migrate the task outside of the oom domain.
A (oom.group = 1)
/ \
B C

moving task between B and C should be harmless while moving it out of A
subtree completely is a dubious configuration.

> >> This "victim" might have already passed exit_mm()/cgroup_exit() from do_exit().
> >
> > Why does this matter? The victim hasn't been killed yet so if it exists
> > by its own I do not think we really have to tear the whole cgroup down.
>
> The existing code does not send SIGKILL if find_lock_task_mm() failed. Who can
> guarantee that the victim is not inside do_exit() yet when this code is executed?

I do not follow. Why does this matter at all?

--
Michal Hocko
SUSE Labs

2018-08-02 18:10:58

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] mm, oom: introduce memory.oom.group

On Thu, Aug 02, 2018 at 07:53:13PM +0900, Tetsuo Handa wrote:
> On 2018/08/02 9:32, Roman Gushchin wrote:
> > For some workloads an intervention from the OOM killer
> > can be painful. Killing a random task can bring
> > the workload into an inconsistent state.
> >
> > Historically, there are two common solutions for this
> > problem:
> > 1) enabling panic_on_oom,
> > 2) using a userspace daemon to monitor OOMs and kill
> > all outstanding processes.
> >
> > Both approaches have their downsides:
> > rebooting on each OOM is an obvious waste of capacity,
> > and handling all in userspace is tricky and requires
> > a userspace agent, which will monitor all cgroups
> > for OOMs.
>
> We could start a one-time userspace agent which handles
> an cgroup OOM event and then terminates...

That might be not so trivial if there is a shortage of memory.

>
>
>
> > +/**
> > + * mem_cgroup_get_oom_group - get a memory cgroup to clean up after OOM
> > + * @victim: task to be killed by the OOM killer
> > + * @oom_domain: memcg in case of memcg OOM, NULL in case of system-wide OOM
> > + *
> > + * Returns a pointer to a memory cgroup, which has to be cleaned up
> > + * by killing all belonging OOM-killable tasks.
> > + *
> > + * Caller has to call mem_cgroup_put() on the returned non-NULL memcg.
> > + */
> > +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
> > + struct mem_cgroup *oom_domain)
> > +{
> > + struct mem_cgroup *oom_group = NULL;
> > + struct mem_cgroup *memcg;
> > +
> > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > + return NULL;
> > +
> > + if (!oom_domain)
> > + oom_domain = root_mem_cgroup;
> > +
> > + rcu_read_lock();
> > +
> > + memcg = mem_cgroup_from_task(victim);
>
> Isn't this racy? I guess that memcg of this "victim" can change to
> somewhere else from the one as of determining the final candidate.
> This "victim" might have already passed exit_mm()/cgroup_exit() from do_exit().
> This "victim" might be moving to a memcg which is different from the one
> determining the final candidate.

It is, as well as _all_ OOM handling code.
E.g. what if a user will set oom_score_adj to -1000 in the last moment?

It really doesn't matter, OOM killer should guarantee
forward progress without making too stupid decisions.
It doesn't provide any strict guarantees and really
shouldn't.

>
> > + if (memcg == root_mem_cgroup)
> > + goto out;
> > +
> > + /*
> > + * Traverse the memory cgroup hierarchy from the victim task's
> > + * cgroup up to the OOMing cgroup (or root) to find the
> > + * highest-level memory cgroup with oom.group set.
> > + */
> > + for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> > + if (memcg->oom_group)
> > + oom_group = memcg;
> > +
> > + if (memcg == oom_domain)
> > + break;
> > + }
> > +
> > + if (oom_group)
> > + css_get(&oom_group->css);
> > +out:
> > + rcu_read_unlock();
> > +
> > + return oom_group;
> > +}
>
>
>
> > @@ -974,7 +988,23 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> > }
> > read_unlock(&tasklist_lock);
> >
> > + /*
> > + * Do we need to kill the entire memory cgroup?
> > + * Or even one of the ancestor memory cgroups?
> > + * Check this out before killing the victim task.
> > + */
> > + oom_group = mem_cgroup_get_oom_group(victim, oc->memcg);
> > +
> > __oom_kill_process(victim);
> > +
> > + /*
> > + * If necessary, kill all tasks in the selected memory cgroup.
> > + */
> > + if (oom_group) {
>
> Isn't "killing a child process of the biggest memory hog" and "killing all
> processes which belongs to a memcg which the child process of the biggest
> memory hog belongs to" strange? The intent of selecting a child is to try
> to minimize lost work while the intent of oom_cgroup is to try to discard
> all work. If oom_cgroup is enabled, I feel that we should
>
> pr_err("%s: Kill all processes in ", message);
> pr_cont_cgroup_path(memcg->css.cgroup);
> pr_cont(" due to memory.oom.group set\n");
>
> without
>
> pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", message, task_pid_nr(p), p->comm, points);
>
> (I mean, don't try to select a child).

We can do this optimization, but I would be accurate with changing
dmesg output format. Although it never was a part of ABI, I wonder,
how many users are using something like "kill process [0-9]+ or
sacrifice child" regexps?

Thanks!

2018-08-02 18:50:57

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] mm, oom: introduce memory.oom.group

On Wed, Aug 01, 2018 at 05:32:01PM -0700, Roman Gushchin wrote:
> For some workloads an intervention from the OOM killer
> can be painful. Killing a random task can bring
> the workload into an inconsistent state.

For patches 1-3,

Acked-by: Tejun Heo <[email protected]>

Thanks.

--
tejun

2018-08-06 21:52:47

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm: introduce mem_cgroup_put() helper

On Wed, 1 Aug 2018, Roman Gushchin wrote:

> Introduce the mem_cgroup_put() helper, which helps to eliminate guarding
> memcg css release with "#ifdef CONFIG_MEMCG" in multiple places.
>
> Link: http://lkml.kernel.org/r/[email protected]
> Signed-off-by: Roman Gushchin <[email protected]>
> Reviewed-by: Shakeel Butt <[email protected]>
> Reviewed-by: Andrew Morton <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Stephen Rothwell <[email protected]>

Acked-by: David Rientjes <[email protected]>