This is a tiny implementation of cgroup-aware OOM killer,
which adds an ability to kill a cgroup as a single unit
and so guarantee the integrity of the workload.
Although it has only a limited functionality in comparison
to what now resides in the mm tree (it doesn't change
the victim task selection algorithm, doesn't look
at memory stas on cgroup level, etc), it's also much
simpler and more straightforward. So, hopefully, we can
avoid having long debates here, as we had with the full
implementation.
As it doesn't prevent any futher development,
and implements an useful and complete feature,
it looks as a sane way forward.
This patchset is against Linus's tree to avoid conflicts
with the cgroup-aware OOM killer patchset in the mm tree.
Two first patches are already in the mm tree.
The first one ("mm: introduce mem_cgroup_put() helper")
is totally fine, and the second's commit message has to be
changed to reflect that it's not a part of old patchset
anymore.
Roman Gushchin (3):
mm: introduce mem_cgroup_put() helper
mm, oom: refactor oom_kill_process()
mm, oom: introduce memory.oom.group
Documentation/admin-guide/cgroup-v2.rst | 16 ++++
include/linux/memcontrol.h | 22 +++++
mm/memcontrol.c | 84 ++++++++++++++++++
mm/oom_kill.c | 152 ++++++++++++++++++++------------
4 files changed, 216 insertions(+), 58 deletions(-)
--
2.14.4
oom_kill_process() consists of two logical parts: the first one is
responsible for considering task's children as a potential victim and
printing the debug information. The second half is responsible for
sending SIGKILL to all tasks sharing the mm struct with the given victim.
This commit splits oom_kill_process() with an intention to re-use the the
second half: __oom_kill_process().
The cgroup-aware OOM killer will kill multiple tasks belonging to the
victim cgroup. We don't need to print the debug information for the each
task, as well as play with task selection (considering task's children),
so we can't use the existing oom_kill_process().
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
mm/oom_kill.c | 123 +++++++++++++++++++++++++++++++---------------------------
1 file changed, 65 insertions(+), 58 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 84081e77bc51..8bded6b3205b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -835,68 +835,12 @@ static bool task_will_free_mem(struct task_struct *task)
return ret;
}
-static void oom_kill_process(struct oom_control *oc, const char *message)
+static void __oom_kill_process(struct task_struct *victim)
{
- struct task_struct *p = oc->chosen;
- unsigned int points = oc->chosen_points;
- struct task_struct *victim = p;
- struct task_struct *child;
- struct task_struct *t;
+ struct task_struct *p;
struct mm_struct *mm;
- unsigned int victim_points = 0;
- static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
bool can_oom_reap = true;
- /*
- * If the task is already exiting, don't alarm the sysadmin or kill
- * its children or threads, just give it access to memory reserves
- * so it can die quickly
- */
- task_lock(p);
- if (task_will_free_mem(p)) {
- mark_oom_victim(p);
- wake_oom_reaper(p);
- task_unlock(p);
- put_task_struct(p);
- return;
- }
- task_unlock(p);
-
- if (__ratelimit(&oom_rs))
- dump_header(oc, p);
-
- pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
- message, task_pid_nr(p), p->comm, points);
-
- /*
- * If any of p's children has a different mm and is eligible for kill,
- * the one with the highest oom_badness() score is sacrificed for its
- * parent. This attempts to lose the minimal amount of work done while
- * still freeing memory.
- */
- read_lock(&tasklist_lock);
- for_each_thread(p, t) {
- list_for_each_entry(child, &t->children, sibling) {
- unsigned int child_points;
-
- if (process_shares_mm(child, p->mm))
- continue;
- /*
- * oom_badness() returns 0 if the thread is unkillable
- */
- child_points = oom_badness(child,
- oc->memcg, oc->nodemask, oc->totalpages);
- if (child_points > victim_points) {
- put_task_struct(victim);
- victim = child;
- victim_points = child_points;
- get_task_struct(victim);
- }
- }
- }
- read_unlock(&tasklist_lock);
-
p = find_lock_task_mm(victim);
if (!p) {
put_task_struct(victim);
@@ -970,6 +914,69 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
}
#undef K
+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+ struct task_struct *p = oc->chosen;
+ unsigned int points = oc->chosen_points;
+ struct task_struct *victim = p;
+ struct task_struct *child;
+ struct task_struct *t;
+ unsigned int victim_points = 0;
+ static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+ /*
+ * If the task is already exiting, don't alarm the sysadmin or kill
+ * its children or threads, just give it access to memory reserves
+ * so it can die quickly
+ */
+ task_lock(p);
+ if (task_will_free_mem(p)) {
+ mark_oom_victim(p);
+ wake_oom_reaper(p);
+ task_unlock(p);
+ put_task_struct(p);
+ return;
+ }
+ task_unlock(p);
+
+ if (__ratelimit(&oom_rs))
+ dump_header(oc, p);
+
+ pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
+ message, task_pid_nr(p), p->comm, points);
+
+ /*
+ * If any of p's children has a different mm and is eligible for kill,
+ * the one with the highest oom_badness() score is sacrificed for its
+ * parent. This attempts to lose the minimal amount of work done while
+ * still freeing memory.
+ */
+ read_lock(&tasklist_lock);
+ for_each_thread(p, t) {
+ list_for_each_entry(child, &t->children, sibling) {
+ unsigned int child_points;
+
+ if (process_shares_mm(child, p->mm))
+ continue;
+ /*
+ * oom_badness() returns 0 if the thread is unkillable
+ */
+ child_points = oom_badness(child,
+ oc->memcg, oc->nodemask, oc->totalpages);
+ if (child_points > victim_points) {
+ put_task_struct(victim);
+ victim = child;
+ victim_points = child_points;
+ get_task_struct(victim);
+ }
+ }
+ }
+ read_unlock(&tasklist_lock);
+
+ __oom_kill_process(victim);
+}
+
/*
* Determines whether the kernel must panic because of the panic_on_oom sysctl.
*/
--
2.14.4
For some workloads an intervention from the OOM killer
can be painful. Killing a random task can bring
the workload into an inconsistent state.
Historically, there are two common solutions for this
problem:
1) enabling panic_on_oom,
2) using a userspace daemon to monitor OOMs and kill
all outstanding processes.
Both approaches have their downsides:
rebooting on each OOM is an obvious waste of capacity,
and handling all in userspace is tricky and requires
a userspace agent, which will monitor all cgroups
for OOMs.
In most cases an in-kernel after-OOM cleaning-up
mechanism can eliminate the necessity of enabling
panic_on_oom. Also, it can simplify the cgroup
management for userspace applications.
This commit introduces a new knob for cgroup v2 memory
controller: memory.oom.group. The knob determines
whether the cgroup should be treated as a single
unit by the OOM killer. If set, the cgroup and its
descendants are killed together or not at all.
To determine which cgroup has to be killed, we do
traverse the cgroup hierarchy from the victim task's
cgroup up to the OOMing cgroup (or root) and looking
for the highest-level cgroup with memory.oom.group set.
Tasks with the OOM protection (oom_score_adj set to -1000)
are treated as an exception and are never killed.
This patch doesn't change the OOM victim selection algorithm.
Signed-off-by: Roman Gushchin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Tejun Heo <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 16 +++++++
include/linux/memcontrol.h | 13 +++++
mm/memcontrol.c | 84 +++++++++++++++++++++++++++++++++
mm/oom_kill.c | 29 ++++++++++++
4 files changed, 142 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 8a2c52d5c53b..f11de1b78cfc 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1069,6 +1069,22 @@ PAGE_SIZE multiple when read back.
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.
+ memory.oom.group
+ A read-write single value file which exists on non-root
+ cgroups. The default value is "0".
+
+ Determines whether the cgroup should be treated as a single
+ unit by the OOM killer. If set, the cgroup and its descendants
+ are killed together or not at all. This can be used to avoid
+ partial kills to guarantee workload integrity.
+
+ Tasks with the OOM protection (oom_score_adj set to -1000)
+ are treated as an exception and are never killed.
+
+ If the OOM killer is invoked in a cgroup, it's not going
+ to kill any tasks outside of this cgroup, regardless
+ memory.oom.group values of ancestor cgroups.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e53e00cdbe3f..05c9c867a3ab 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -213,6 +213,11 @@ struct mem_cgroup {
*/
bool use_hierarchy;
+ /*
+ * Should the OOM killer kill all belonging tasks, had it kill one?
+ */
+ bool oom_group;
+
/* protected by memcg_oom_lock */
bool oom_lock;
int under_oom;
@@ -517,6 +522,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
}
bool mem_cgroup_oom_synchronize(bool wait);
+struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
+ struct mem_cgroup *oom_domain);
#ifdef CONFIG_MEMCG_SWAP
extern int do_swap_account;
@@ -951,6 +958,12 @@ static inline bool mem_cgroup_oom_synchronize(bool wait)
return false;
}
+static inline struct mem_cgroup *mem_cgroup_get_oom_group(
+ struct task_struct *victim, struct mem_cgroup *oom_domain)
+{
+ return NULL;
+}
+
static inline unsigned long memcg_page_state(struct mem_cgroup *memcg,
int idx)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8c0280b3143e..ade832571091 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1577,6 +1577,53 @@ bool mem_cgroup_oom_synchronize(bool handle)
return true;
}
+/**
+ * mem_cgroup_get_oom_group - get a memory cgroup to clean up after OOM
+ * @victim: task to be killed by the OOM killer
+ * @oom_domain: memcg in case of memcg OOM, NULL in case of system-wide OOM
+ *
+ * Returns a pointer to a memory cgroup, which has to be cleaned up
+ * by killing all belonging OOM-killable tasks.
+ */
+struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
+ struct mem_cgroup *oom_domain)
+{
+ struct mem_cgroup *oom_group = NULL;
+ struct mem_cgroup *memcg;
+
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ return NULL;
+
+ if (!oom_domain)
+ oom_domain = root_mem_cgroup;
+
+ rcu_read_lock();
+
+ memcg = mem_cgroup_from_task(victim);
+ if (!memcg || memcg == root_mem_cgroup)
+ goto out;
+
+ /*
+ * Traverse the memory cgroup hierarchy from the victim task's
+ * cgroup up to the OOMing cgroup (or root) to find the
+ * highest-level memory cgroup with oom.group set.
+ */
+ for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+ if (memcg->oom_group)
+ oom_group = memcg;
+
+ if (memcg == oom_domain)
+ break;
+ }
+
+ if (oom_group)
+ css_get(&oom_group->css);
+out:
+ rcu_read_unlock();
+
+ return oom_group;
+}
+
/**
* lock_page_memcg - lock a page->mem_cgroup binding
* @page: the page
@@ -5328,6 +5375,37 @@ static int memory_stat_show(struct seq_file *m, void *v)
return 0;
}
+static int memory_oom_group_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+
+ seq_printf(m, "%d\n", memcg->oom_group);
+
+ return 0;
+}
+
+static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ int ret, oom_group;
+
+ buf = strstrip(buf);
+ if (!buf)
+ return -EINVAL;
+
+ ret = kstrtoint(buf, 0, &oom_group);
+ if (ret)
+ return ret;
+
+ if (oom_group != 0 && oom_group != 1)
+ return -EINVAL;
+
+ memcg->oom_group = oom_group;
+
+ return nbytes;
+}
+
static struct cftype memory_files[] = {
{
.name = "current",
@@ -5369,6 +5447,12 @@ static struct cftype memory_files[] = {
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = memory_stat_show,
},
+ {
+ .name = "oom.group",
+ .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
+ .seq_show = memory_oom_group_show,
+ .write = memory_oom_group_write,
+ },
{ } /* terminate */
};
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8bded6b3205b..08f30ed5abed 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -914,6 +914,19 @@ static void __oom_kill_process(struct task_struct *victim)
}
#undef K
+/*
+ * Kill provided task unless it's secured by setting
+ * oom_score_adj to OOM_SCORE_ADJ_MIN.
+ */
+static int oom_kill_memcg_member(struct task_struct *task, void *unused)
+{
+ if (task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
+ get_task_struct(task);
+ __oom_kill_process(task);
+ }
+ return 0;
+}
+
static void oom_kill_process(struct oom_control *oc, const char *message)
{
struct task_struct *p = oc->chosen;
@@ -921,6 +934,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
struct task_struct *victim = p;
struct task_struct *child;
struct task_struct *t;
+ struct mem_cgroup *oom_group;
unsigned int victim_points = 0;
static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
@@ -974,7 +988,22 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
}
read_unlock(&tasklist_lock);
+ /*
+ * Do we need to kill the entire memory cgroup?
+ * Or even one of the ancestor memory cgroups?
+ * Check this out before killing the victim task.
+ */
+ oom_group = mem_cgroup_get_oom_group(victim, oc->memcg);
+
__oom_kill_process(victim);
+
+ /*
+ * If necessary, kill all tasks in the selected memory cgroup.
+ */
+ if (oom_group) {
+ mem_cgroup_scan_tasks(oom_group, oom_kill_memcg_member, NULL);
+ mem_cgroup_put(oom_group);
+ }
}
/*
--
2.14.4
Introduce the mem_cgroup_put() helper, which helps to eliminate guarding
memcg css release with "#ifdef CONFIG_MEMCG" in multiple places.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
---
include/linux/memcontrol.h | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6c6fb116e925..e53e00cdbe3f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -375,6 +375,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
}
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+ css_put(&memcg->css);
+}
+
#define mem_cgroup_from_counter(counter, member) \
container_of(counter, struct mem_cgroup, member)
@@ -837,6 +842,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
return true;
}
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
static inline struct mem_cgroup *
mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
--
2.14.4
On Mon, 30 Jul 2018, Roman Gushchin wrote:
> This is a tiny implementation of cgroup-aware OOM killer,
> which adds an ability to kill a cgroup as a single unit
> and so guarantee the integrity of the workload.
>
> Although it has only a limited functionality in comparison
> to what now resides in the mm tree (it doesn't change
> the victim task selection algorithm, doesn't look
> at memory stas on cgroup level, etc), it's also much
> simpler and more straightforward. So, hopefully, we can
> avoid having long debates here, as we had with the full
> implementation.
>
> As it doesn't prevent any futher development,
> and implements an useful and complete feature,
> it looks as a sane way forward.
>
> This patchset is against Linus's tree to avoid conflicts
> with the cgroup-aware OOM killer patchset in the mm tree.
>
> Two first patches are already in the mm tree.
> The first one ("mm: introduce mem_cgroup_put() helper")
> is totally fine, and the second's commit message has to be
> changed to reflect that it's not a part of old patchset
> anymore.
>
What's the plan with the cgroup aware oom killer? It has been sitting in
the -mm tree for ages with no clear path to being merged.
Are you suggesting this patchset as a preliminary series so the cgroup
aware oom killer should be removed from the -mm tree and this should be
merged instead? If so, what is the plan going forward for the cgroup
aware oom killer?
Are you planning on reviewing the patchset to fix the cgroup aware oom
killer at https://marc.info/?l=linux-kernel&m=153152325411865 which has
been waiting for feedback since March?
On Mon 30-07-18 11:00:58, Roman Gushchin wrote:
> Introduce the mem_cgroup_put() helper, which helps to eliminate guarding
> memcg css release with "#ifdef CONFIG_MEMCG" in multiple places.
Is there any reason for this to be a separate patch? I usually do not
like to add helpers without their users because this makes review
harder. This one is quite trivial to fit into Patch3 easilly.
> Link: http://lkml.kernel.org/r/[email protected]
> Signed-off-by: Roman Gushchin <[email protected]>
> Reviewed-by: Shakeel Butt <[email protected]>
> Reviewed-by: Andrew Morton <[email protected]>
> Cc: Shakeel Butt <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Stephen Rothwell <[email protected]>
> ---
> include/linux/memcontrol.h | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6c6fb116e925..e53e00cdbe3f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -375,6 +375,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
> return css ? container_of(css, struct mem_cgroup, css) : NULL;
> }
>
> +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> +{
> + css_put(&memcg->css);
> +}
> +
> #define mem_cgroup_from_counter(counter, member) \
> container_of(counter, struct mem_cgroup, member)
>
> @@ -837,6 +842,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
> return true;
> }
>
> +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> +{
> +}
> +
> static inline struct mem_cgroup *
> mem_cgroup_iter(struct mem_cgroup *root,
> struct mem_cgroup *prev,
> --
> 2.14.4
>
--
Michal Hocko
SUSE Labs
On Mon 30-07-18 11:01:00, Roman Gushchin wrote:
> For some workloads an intervention from the OOM killer
> can be painful. Killing a random task can bring
> the workload into an inconsistent state.
>
> Historically, there are two common solutions for this
> problem:
> 1) enabling panic_on_oom,
> 2) using a userspace daemon to monitor OOMs and kill
> all outstanding processes.
>
> Both approaches have their downsides:
> rebooting on each OOM is an obvious waste of capacity,
> and handling all in userspace is tricky and requires
> a userspace agent, which will monitor all cgroups
> for OOMs.
>
> In most cases an in-kernel after-OOM cleaning-up
> mechanism can eliminate the necessity of enabling
> panic_on_oom. Also, it can simplify the cgroup
> management for userspace applications.
>
> This commit introduces a new knob for cgroup v2 memory
> controller: memory.oom.group. The knob determines
> whether the cgroup should be treated as a single
> unit by the OOM killer. If set, the cgroup and its
> descendants are killed together or not at all.
I do not want to nit pick on wording but unit is not really a good
description. I would expect that to mean that the oom killer will
consider the unit also when selecting the task and that is not the case.
I would be more explicit about this being a single killable entity
because it forms an indivisible workload.
You can reuse http://lkml.kernel.org/r/[email protected]
if you want.
[...]
> +/**
> + * mem_cgroup_get_oom_group - get a memory cgroup to clean up after OOM
> + * @victim: task to be killed by the OOM killer
> + * @oom_domain: memcg in case of memcg OOM, NULL in case of system-wide OOM
> + *
> + * Returns a pointer to a memory cgroup, which has to be cleaned up
> + * by killing all belonging OOM-killable tasks.
Caller has to call mem_cgroup_put on the returned non-null memcg.
> + */
> +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
> + struct mem_cgroup *oom_domain)
> +{
> + struct mem_cgroup *oom_group = NULL;
> + struct mem_cgroup *memcg;
> +
> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> + return NULL;
> +
> + if (!oom_domain)
> + oom_domain = root_mem_cgroup;
> +
> + rcu_read_lock();
> +
> + memcg = mem_cgroup_from_task(victim);
> + if (!memcg || memcg == root_mem_cgroup)
> + goto out;
When can we have memcg == NULL? victim should be always non-NULL.
Also why do you need to special case the root_mem_cgroup here. The loop
below should handle that just fine no?
> +
> + /*
> + * Traverse the memory cgroup hierarchy from the victim task's
> + * cgroup up to the OOMing cgroup (or root) to find the
> + * highest-level memory cgroup with oom.group set.
> + */
> + for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> + if (memcg->oom_group)
> + oom_group = memcg;
> +
> + if (memcg == oom_domain)
> + break;
> + }
> +
> + if (oom_group)
> + css_get(&oom_group->css);
> +out:
> + rcu_read_unlock();
> +
> + return oom_group;
> +}
> +
[...]
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 8bded6b3205b..08f30ed5abed 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -914,6 +914,19 @@ static void __oom_kill_process(struct task_struct *victim)
> }
> #undef K
>
> +/*
> + * Kill provided task unless it's secured by setting
> + * oom_score_adj to OOM_SCORE_ADJ_MIN.
> + */
> +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> +{
> + if (task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> + get_task_struct(task);
> + __oom_kill_process(task);
> + }
> + return 0;
> +}
> +
> static void oom_kill_process(struct oom_control *oc, const char *message)
> {
> struct task_struct *p = oc->chosen;
> @@ -921,6 +934,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> struct task_struct *victim = p;
> struct task_struct *child;
> struct task_struct *t;
> + struct mem_cgroup *oom_group;
> unsigned int victim_points = 0;
> static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> DEFAULT_RATELIMIT_BURST);
> @@ -974,7 +988,22 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> }
> read_unlock(&tasklist_lock);
>
> + /*
> + * Do we need to kill the entire memory cgroup?
> + * Or even one of the ancestor memory cgroups?
> + * Check this out before killing the victim task.
> + */
> + oom_group = mem_cgroup_get_oom_group(victim, oc->memcg);
> +
> __oom_kill_process(victim);
> +
> + /*
> + * If necessary, kill all tasks in the selected memory cgroup.
> + */
> + if (oom_group) {
we want a printk explaining that we are going to tear down the whole
oom_group here.
> + mem_cgroup_scan_tasks(oom_group, oom_kill_memcg_member, NULL);
> + mem_cgroup_put(oom_group);
> + }
> }
Other than that looks good to me. My concern that the previous
implementation was more consistent because we were comparing memcgs
still holds but if there is no way forward that direction this should be
acceptable as well.
After above small things are addressed you can add
Acked-by: Michal Hocko <[email protected]>
Thanks!
--
Michal Hocko
SUSE Labs
On Tue, Jul 31, 2018 at 1:45 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 30-07-18 11:00:58, Roman Gushchin wrote:
> > Introduce the mem_cgroup_put() helper, which helps to eliminate guarding
> > memcg css release with "#ifdef CONFIG_MEMCG" in multiple places.
>
> Is there any reason for this to be a separate patch? I usually do not
> like to add helpers without their users because this makes review
> harder. This one is quite trivial to fit into Patch3 easilly.
>
The helper function introduced in this change is also used in the
remote charging patches, so, I asked Roman to separate this change out
and thus can be merged independently.
Shakeel
> > Link: http://lkml.kernel.org/r/[email protected]
> > Signed-off-by: Roman Gushchin <[email protected]>
> > Reviewed-by: Shakeel Butt <[email protected]>
> > Reviewed-by: Andrew Morton <[email protected]>
> > Cc: Shakeel Butt <[email protected]>
> > Cc: Johannes Weiner <[email protected]>
> > Cc: Michal Hocko <[email protected]>
> > Signed-off-by: Andrew Morton <[email protected]>
> > Signed-off-by: Stephen Rothwell <[email protected]>
> > ---
> > include/linux/memcontrol.h | 9 +++++++++
> > 1 file changed, 9 insertions(+)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 6c6fb116e925..e53e00cdbe3f 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -375,6 +375,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
> > return css ? container_of(css, struct mem_cgroup, css) : NULL;
> > }
> >
> > +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> > +{
> > + css_put(&memcg->css);
> > +}
> > +
> > #define mem_cgroup_from_counter(counter, member) \
> > container_of(counter, struct mem_cgroup, member)
> >
> > @@ -837,6 +842,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
> > return true;
> > }
> >
> > +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> > +{
> > +}
> > +
> > static inline struct mem_cgroup *
> > mem_cgroup_iter(struct mem_cgroup *root,
> > struct mem_cgroup *prev,
> > --
> > 2.14.4
> >
>
> --
> Michal Hocko
> SUSE Labs
On Mon, Jul 30, 2018 at 06:49:31PM -0700, David Rientjes wrote:
> On Mon, 30 Jul 2018, Roman Gushchin wrote:
>
> > This is a tiny implementation of cgroup-aware OOM killer,
> > which adds an ability to kill a cgroup as a single unit
> > and so guarantee the integrity of the workload.
> >
> > Although it has only a limited functionality in comparison
> > to what now resides in the mm tree (it doesn't change
> > the victim task selection algorithm, doesn't look
> > at memory stas on cgroup level, etc), it's also much
> > simpler and more straightforward. So, hopefully, we can
> > avoid having long debates here, as we had with the full
> > implementation.
> >
> > As it doesn't prevent any futher development,
> > and implements an useful and complete feature,
> > it looks as a sane way forward.
> >
> > This patchset is against Linus's tree to avoid conflicts
> > with the cgroup-aware OOM killer patchset in the mm tree.
> >
> > Two first patches are already in the mm tree.
> > The first one ("mm: introduce mem_cgroup_put() helper")
> > is totally fine, and the second's commit message has to be
> > changed to reflect that it's not a part of old patchset
> > anymore.
>
> What's the plan with the cgroup aware oom killer? It has been sitting in
> the -mm tree for ages with no clear path to being merged.
>
> Are you suggesting this patchset as a preliminary series so the cgroup
> aware oom killer should be removed from the -mm tree and this should be
> merged instead? If so, what is the plan going forward for the cgroup
> aware oom killer?
>
> Are you planning on reviewing the patchset to fix the cgroup aware oom
> killer at https://marc.info/?l=linux-kernel&m=153152325411865 which has
> been waiting for feedback since March?
I would say it's been waiting for feedback since March because every
person that could give meaningful feedback on it, incl. the memcg and
cgroup maintainers, is happy with Roman's current patches in -mm, is
not convinced by the arguments you have made over months of
discussions, and has serious reservations about the configurable OOM
policies you propose as follow-up fix to Roman's patches.
This pared-down version of the patches proposed here is to accomodate
you and table discussions about policy for now. But your patchset is
highly unlikely to gain any sort of traction in the future.
Also I don't think there is any controversy here over what the way
forward should be. Roman's patches should have been merged months ago.
On Mon, Jul 30, 2018 at 06:49:31PM -0700, David Rientjes wrote:
> On Mon, 30 Jul 2018, Roman Gushchin wrote:
>
> > This is a tiny implementation of cgroup-aware OOM killer,
> > which adds an ability to kill a cgroup as a single unit
> > and so guarantee the integrity of the workload.
> >
> > Although it has only a limited functionality in comparison
> > to what now resides in the mm tree (it doesn't change
> > the victim task selection algorithm, doesn't look
> > at memory stas on cgroup level, etc), it's also much
> > simpler and more straightforward. So, hopefully, we can
> > avoid having long debates here, as we had with the full
> > implementation.
> >
> > As it doesn't prevent any futher development,
> > and implements an useful and complete feature,
> > it looks as a sane way forward.
> >
> > This patchset is against Linus's tree to avoid conflicts
> > with the cgroup-aware OOM killer patchset in the mm tree.
> >
> > Two first patches are already in the mm tree.
> > The first one ("mm: introduce mem_cgroup_put() helper")
> > is totally fine, and the second's commit message has to be
> > changed to reflect that it's not a part of old patchset
> > anymore.
> >
>
> What's the plan with the cgroup aware oom killer? It has been sitting in
> the -mm tree for ages with no clear path to being merged.
It's because your nack, isn't it?
Everybody else seem to be fine with it.
>
> Are you suggesting this patchset as a preliminary series so the cgroup
> aware oom killer should be removed from the -mm tree and this should be
> merged instead? If so, what is the plan going forward for the cgroup
> aware oom killer?
Answered below.
>
> Are you planning on reviewing the patchset to fix the cgroup aware oom
> killer at https://marc.info/?l=linux-kernel&m=153152325411865 which has
> been waiting for feedback since March?
>
I already did.
As I said, I find the proposed oom_policy interface confusing.
I'm not sure I understand why some memcg OOMs should be handled
by memcg-aware OOMs, while other by the traditional per-process
logic; and why this should be set on the OOMing memcg.
IMO this adds nothing but confusion.
If it's just a way to get rid of mount option,
it doesn't look nice to me (neither I'm fan of the mount option).
If you need an option to evaluate a cgroup as a whole, but kill
only one task inside (the ability we've discussed before),
let's make it clear. It's possible with the new memory.oom.group.
Despite mentioning the lack of priority tuning in the list of
problems, you do not propose anything. I agree it's hard, but
why mentioning then?
Patches which adjust root memory cgroup accounting and NUMA
handling should be handled separately, they are really not
about the interface. I've nothing against them.
Again, I don't like the proposed interface, it doesn't feel
clear. I think, this is the reason, why your patchset didn't
collect any acks since March.
I'm not blocking any progress here, it's not on me.
Anyway, at this point I really think that this patch (memory.oom.group)
is a reasonable way forward. It implements a useful and complete feature,
doesn't block any further development and has a clean interface.
So, you can build memory.oom.policy on top of it.
Does this sound good?
On Tue, Jul 31, 2018 at 11:07:00AM +0200, Michal Hocko wrote:
> On Mon 30-07-18 11:01:00, Roman Gushchin wrote:
> > For some workloads an intervention from the OOM killer
> > can be painful. Killing a random task can bring
> > the workload into an inconsistent state.
> >
> > Historically, there are two common solutions for this
> > problem:
> > 1) enabling panic_on_oom,
> > 2) using a userspace daemon to monitor OOMs and kill
> > all outstanding processes.
> >
> > Both approaches have their downsides:
> > rebooting on each OOM is an obvious waste of capacity,
> > and handling all in userspace is tricky and requires
> > a userspace agent, which will monitor all cgroups
> > for OOMs.
> >
> > In most cases an in-kernel after-OOM cleaning-up
> > mechanism can eliminate the necessity of enabling
> > panic_on_oom. Also, it can simplify the cgroup
> > management for userspace applications.
> >
> > This commit introduces a new knob for cgroup v2 memory
> > controller: memory.oom.group. The knob determines
> > whether the cgroup should be treated as a single
> > unit by the OOM killer. If set, the cgroup and its
> > descendants are killed together or not at all.
>
> I do not want to nit pick on wording but unit is not really a good
> description. I would expect that to mean that the oom killer will
> consider the unit also when selecting the task and that is not the case.
> I would be more explicit about this being a single killable entity
> because it forms an indivisible workload.
>
> You can reuse http://lkml.kernel.org/r/[email protected]
> if you want.
Ok, I'll do my best to make it clearer.
>
> [...]
> > +/**
> > + * mem_cgroup_get_oom_group - get a memory cgroup to clean up after OOM
> > + * @victim: task to be killed by the OOM killer
> > + * @oom_domain: memcg in case of memcg OOM, NULL in case of system-wide OOM
> > + *
> > + * Returns a pointer to a memory cgroup, which has to be cleaned up
> > + * by killing all belonging OOM-killable tasks.
>
> Caller has to call mem_cgroup_put on the returned non-null memcg.
Added.
>
> > + */
> > +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
> > + struct mem_cgroup *oom_domain)
> > +{
> > + struct mem_cgroup *oom_group = NULL;
> > + struct mem_cgroup *memcg;
> > +
> > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > + return NULL;
> > +
> > + if (!oom_domain)
> > + oom_domain = root_mem_cgroup;
> > +
> > + rcu_read_lock();
> > +
> > + memcg = mem_cgroup_from_task(victim);
> > + if (!memcg || memcg == root_mem_cgroup)
> > + goto out;
>
> When can we have memcg == NULL? victim should be always non-NULL.
> Also why do you need to special case the root_mem_cgroup here. The loop
> below should handle that just fine no?
Idk, I prefer to keep an explicit root_mem_cgroup check,
rather than traversing the tree and relying on an inability
to set oom_group on the root.
!memcg check removed, you're right.
>
> > +
> > + /*
> > + * Traverse the memory cgroup hierarchy from the victim task's
> > + * cgroup up to the OOMing cgroup (or root) to find the
> > + * highest-level memory cgroup with oom.group set.
> > + */
> > + for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> > + if (memcg->oom_group)
> > + oom_group = memcg;
> > +
> > + if (memcg == oom_domain)
> > + break;
> > + }
> > +
> > + if (oom_group)
> > + css_get(&oom_group->css);
> > +out:
> > + rcu_read_unlock();
> > +
> > + return oom_group;
> > +}
> > +
> [...]
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 8bded6b3205b..08f30ed5abed 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -914,6 +914,19 @@ static void __oom_kill_process(struct task_struct *victim)
> > }
> > #undef K
> >
> > +/*
> > + * Kill provided task unless it's secured by setting
> > + * oom_score_adj to OOM_SCORE_ADJ_MIN.
> > + */
> > +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> > +{
> > + if (task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> > + get_task_struct(task);
> > + __oom_kill_process(task);
> > + }
> > + return 0;
> > +}
> > +
> > static void oom_kill_process(struct oom_control *oc, const char *message)
> > {
> > struct task_struct *p = oc->chosen;
> > @@ -921,6 +934,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> > struct task_struct *victim = p;
> > struct task_struct *child;
> > struct task_struct *t;
> > + struct mem_cgroup *oom_group;
> > unsigned int victim_points = 0;
> > static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> > DEFAULT_RATELIMIT_BURST);
> > @@ -974,7 +988,22 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> > }
> > read_unlock(&tasklist_lock);
> >
> > + /*
> > + * Do we need to kill the entire memory cgroup?
> > + * Or even one of the ancestor memory cgroups?
> > + * Check this out before killing the victim task.
> > + */
> > + oom_group = mem_cgroup_get_oom_group(victim, oc->memcg);
> > +
> > __oom_kill_process(victim);
> > +
> > + /*
> > + * If necessary, kill all tasks in the selected memory cgroup.
> > + */
> > + if (oom_group) {
>
> we want a printk explaining that we are going to tear down the whole
> oom_group here.
Does this looks good?
Or it's better to remove "memory." prefix?
[ 52.835327] Out of memory: Kill process 1221 (allocate) score 241 or sacrifice child
[ 52.836625] Killed process 1221 (allocate) total-vm:2257144kB, anon-rss:2009128kB, file-rss:4kB, shmem-rss:0kB
[ 52.841431] Tasks in /A1 are going to be killed due to memory.oom.group set
[ 52.869439] Killed process 1217 (allocate) total-vm:2052344kB, anon-rss:1704036kB, file-rss:0kB, shmem-rss:0kB
[ 52.875601] Killed process 1218 (allocate) total-vm:106668kB, anon-rss:24668kB, file-rss:0kB, shmem-rss:0kB
[ 52.882914] Killed process 1219 (allocate) total-vm:106668kB, anon-rss:21528kB, file-rss:0kB, shmem-rss:0kB
[ 52.891806] Killed process 1220 (allocate) total-vm:2257144kB, anon-rss:1984120kB, file-rss:4kB, shmem-rss:0kB
[ 52.903770] Killed process 1221 (allocate) total-vm:2257144kB, anon-rss:2009128kB, file-rss:4kB, shmem-rss:0kB
[ 52.905574] Killed process 1222 (allocate) total-vm:2257144kB, anon-rss:2063640kB, file-rss:0kB, shmem-rss:0kB
[ 53.202153] oom_reaper: reaped process 1222 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>
> > + mem_cgroup_scan_tasks(oom_group, oom_kill_memcg_member, NULL);
> > + mem_cgroup_put(oom_group);
> > + }
> > }
>
> Other than that looks good to me. My concern that the previous
> implementation was more consistent because we were comparing memcgs
> still holds but if there is no way forward that direction this should be
> acceptable as well.
>
> After above small things are addressed you can add
> Acked-by: Michal Hocko <[email protected]>
Thank you!
On Tue 31-07-18 07:58:00, Shakeel Butt wrote:
> On Tue, Jul 31, 2018 at 1:45 AM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 30-07-18 11:00:58, Roman Gushchin wrote:
> > > Introduce the mem_cgroup_put() helper, which helps to eliminate guarding
> > > memcg css release with "#ifdef CONFIG_MEMCG" in multiple places.
> >
> > Is there any reason for this to be a separate patch? I usually do not
> > like to add helpers without their users because this makes review
> > harder. This one is quite trivial to fit into Patch3 easilly.
> >
>
> The helper function introduced in this change is also used in the
> remote charging patches, so, I asked Roman to separate this change out
> and thus can be merged independently.
Ok, that was not clear from the description. Then this is ok
Acked-by: Michal Hocko <[email protected]>
> Shakeel
>
> > > Link: http://lkml.kernel.org/r/[email protected]
> > > Signed-off-by: Roman Gushchin <[email protected]>
> > > Reviewed-by: Shakeel Butt <[email protected]>
> > > Reviewed-by: Andrew Morton <[email protected]>
> > > Cc: Shakeel Butt <[email protected]>
> > > Cc: Johannes Weiner <[email protected]>
> > > Cc: Michal Hocko <[email protected]>
> > > Signed-off-by: Andrew Morton <[email protected]>
> > > Signed-off-by: Stephen Rothwell <[email protected]>
> > > ---
> > > include/linux/memcontrol.h | 9 +++++++++
> > > 1 file changed, 9 insertions(+)
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index 6c6fb116e925..e53e00cdbe3f 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -375,6 +375,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
> > > return css ? container_of(css, struct mem_cgroup, css) : NULL;
> > > }
> > >
> > > +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> > > +{
> > > + css_put(&memcg->css);
> > > +}
> > > +
> > > #define mem_cgroup_from_counter(counter, member) \
> > > container_of(counter, struct mem_cgroup, member)
> > >
> > > @@ -837,6 +842,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
> > > return true;
> > > }
> > >
> > > +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> > > +{
> > > +}
> > > +
> > > static inline struct mem_cgroup *
> > > mem_cgroup_iter(struct mem_cgroup *root,
> > > struct mem_cgroup *prev,
> > > --
> > > 2.14.4
> > >
> >
> > --
> > Michal Hocko
> > SUSE Labs
--
Michal Hocko
SUSE Labs
On Tue 31-07-18 18:14:48, Roman Gushchin wrote:
> On Tue, Jul 31, 2018 at 11:07:00AM +0200, Michal Hocko wrote:
> > On Mon 30-07-18 11:01:00, Roman Gushchin wrote:
> > > For some workloads an intervention from the OOM killer
> > > can be painful. Killing a random task can bring
> > > the workload into an inconsistent state.
> > >
> > > Historically, there are two common solutions for this
> > > problem:
> > > 1) enabling panic_on_oom,
> > > 2) using a userspace daemon to monitor OOMs and kill
> > > all outstanding processes.
> > >
> > > Both approaches have their downsides:
> > > rebooting on each OOM is an obvious waste of capacity,
> > > and handling all in userspace is tricky and requires
> > > a userspace agent, which will monitor all cgroups
> > > for OOMs.
> > >
> > > In most cases an in-kernel after-OOM cleaning-up
> > > mechanism can eliminate the necessity of enabling
> > > panic_on_oom. Also, it can simplify the cgroup
> > > management for userspace applications.
> > >
> > > This commit introduces a new knob for cgroup v2 memory
> > > controller: memory.oom.group. The knob determines
> > > whether the cgroup should be treated as a single
> > > unit by the OOM killer. If set, the cgroup and its
> > > descendants are killed together or not at all.
> >
> > I do not want to nit pick on wording but unit is not really a good
> > description. I would expect that to mean that the oom killer will
> > consider the unit also when selecting the task and that is not the case.
> > I would be more explicit about this being a single killable entity
> > because it forms an indivisible workload.
> >
> > You can reuse http://lkml.kernel.org/r/[email protected]
> > if you want.
>
> Ok, I'll do my best to make it clearer.
>
> >
> > [...]
> > > +/**
> > > + * mem_cgroup_get_oom_group - get a memory cgroup to clean up after OOM
> > > + * @victim: task to be killed by the OOM killer
> > > + * @oom_domain: memcg in case of memcg OOM, NULL in case of system-wide OOM
> > > + *
> > > + * Returns a pointer to a memory cgroup, which has to be cleaned up
> > > + * by killing all belonging OOM-killable tasks.
> >
> > Caller has to call mem_cgroup_put on the returned non-null memcg.
>
> Added.
>
> >
> > > + */
> > > +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
> > > + struct mem_cgroup *oom_domain)
> > > +{
> > > + struct mem_cgroup *oom_group = NULL;
> > > + struct mem_cgroup *memcg;
> > > +
> > > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > > + return NULL;
> > > +
> > > + if (!oom_domain)
> > > + oom_domain = root_mem_cgroup;
> > > +
> > > + rcu_read_lock();
> > > +
> > > + memcg = mem_cgroup_from_task(victim);
> > > + if (!memcg || memcg == root_mem_cgroup)
> > > + goto out;
> >
> > When can we have memcg == NULL? victim should be always non-NULL.
> > Also why do you need to special case the root_mem_cgroup here. The loop
> > below should handle that just fine no?
>
> Idk, I prefer to keep an explicit root_mem_cgroup check,
> rather than traversing the tree and relying on an inability
> to set oom_group on the root.
I will not insist but this just makes the code harder to read.
[...]
> > > + if (oom_group) {
> >
> > we want a printk explaining that we are going to tear down the whole
> > oom_group here.
>
> Does this looks good?
> Or it's better to remove "memory." prefix?
>
> [ 52.835327] Out of memory: Kill process 1221 (allocate) score 241 or sacrifice child
> [ 52.836625] Killed process 1221 (allocate) total-vm:2257144kB, anon-rss:2009128kB, file-rss:4kB, shmem-rss:0kB
> [ 52.841431] Tasks in /A1 are going to be killed due to memory.oom.group set
Yes, looks good to me.
> [ 52.869439] Killed process 1217 (allocate) total-vm:2052344kB, anon-rss:1704036kB, file-rss:0kB, shmem-rss:0kB
> [ 52.875601] Killed process 1218 (allocate) total-vm:106668kB, anon-rss:24668kB, file-rss:0kB, shmem-rss:0kB
> [ 52.882914] Killed process 1219 (allocate) total-vm:106668kB, anon-rss:21528kB, file-rss:0kB, shmem-rss:0kB
> [ 52.891806] Killed process 1220 (allocate) total-vm:2257144kB, anon-rss:1984120kB, file-rss:4kB, shmem-rss:0kB
> [ 52.903770] Killed process 1221 (allocate) total-vm:2257144kB, anon-rss:2009128kB, file-rss:4kB, shmem-rss:0kB
> [ 52.905574] Killed process 1222 (allocate) total-vm:2257144kB, anon-rss:2063640kB, file-rss:0kB, shmem-rss:0kB
> [ 53.202153] oom_reaper: reaped process 1222 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>
> >
> > > + mem_cgroup_scan_tasks(oom_group, oom_kill_memcg_member, NULL);
> > > + mem_cgroup_put(oom_group);
> > > + }
> > > }
> >
> > Other than that looks good to me. My concern that the previous
> > implementation was more consistent because we were comparing memcgs
> > still holds but if there is no way forward that direction this should be
> > acceptable as well.
> >
> > After above small things are addressed you can add
> > Acked-by: Michal Hocko <[email protected]>
>
> Thank you!
--
Michal Hocko
SUSE Labs
On Mon, Jul 30, 2018 at 11:00:59AM -0700, Roman Gushchin wrote:
> oom_kill_process() consists of two logical parts: the first one is
> responsible for considering task's children as a potential victim and
> printing the debug information. The second half is responsible for
> sending SIGKILL to all tasks sharing the mm struct with the given victim.
>
> This commit splits oom_kill_process() with an intention to re-use the the
> second half: __oom_kill_process().
>
> The cgroup-aware OOM killer will kill multiple tasks belonging to the
> victim cgroup. We don't need to print the debug information for the each
> task, as well as play with task selection (considering task's children),
> so we can't use the existing oom_kill_process().
>
> Link: http://lkml.kernel.org/r/[email protected]
> Signed-off-by: Roman Gushchin <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>
> Acked-by: David Rientjes <[email protected]>
> Cc: Vladimir Davydov <[email protected]>
> Cc: Tetsuo Handa <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
This is pretty straight-forward.
Acked-by: Johannes Weiner <[email protected]>
On Mon, Jul 30, 2018 at 11:00:58AM -0700, Roman Gushchin wrote:
> Introduce the mem_cgroup_put() helper, which helps to eliminate guarding
> memcg css release with "#ifdef CONFIG_MEMCG" in multiple places.
>
> Link: http://lkml.kernel.org/r/[email protected]
> Signed-off-by: Roman Gushchin <[email protected]>
> Reviewed-by: Shakeel Butt <[email protected]>
> Reviewed-by: Andrew Morton <[email protected]>
> Cc: Shakeel Butt <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Stephen Rothwell <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
On Wed, Aug 01, 2018 at 07:55:03AM +0200, Michal Hocko wrote:
> On Tue 31-07-18 18:14:48, Roman Gushchin wrote:
> > On Tue, Jul 31, 2018 at 11:07:00AM +0200, Michal Hocko wrote:
> > > On Mon 30-07-18 11:01:00, Roman Gushchin wrote:
> > > > +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
> > > > + struct mem_cgroup *oom_domain)
> > > > +{
> > > > + struct mem_cgroup *oom_group = NULL;
> > > > + struct mem_cgroup *memcg;
> > > > +
> > > > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > > > + return NULL;
> > > > +
> > > > + if (!oom_domain)
> > > > + oom_domain = root_mem_cgroup;
> > > > +
> > > > + rcu_read_lock();
> > > > +
> > > > + memcg = mem_cgroup_from_task(victim);
> > > > + if (!memcg || memcg == root_mem_cgroup)
> > > > + goto out;
> > >
> > > When can we have memcg == NULL? victim should be always non-NULL.
> > > Also why do you need to special case the root_mem_cgroup here. The loop
> > > below should handle that just fine no?
> >
> > Idk, I prefer to keep an explicit root_mem_cgroup check,
> > rather than traversing the tree and relying on an inability
> > to set oom_group on the root.
>
> I will not insist but this just makes the code harder to read.
Just FYI, I'd prefer the explicit check. The loop would do the right
thing, but it's a little too implicit and subtle for my taste...
On Mon, Jul 30, 2018 at 11:01:00AM -0700, Roman Gushchin wrote:
> For some workloads an intervention from the OOM killer
> can be painful. Killing a random task can bring
> the workload into an inconsistent state.
>
> Historically, there are two common solutions for this
> problem:
> 1) enabling panic_on_oom,
> 2) using a userspace daemon to monitor OOMs and kill
> all outstanding processes.
>
> Both approaches have their downsides:
> rebooting on each OOM is an obvious waste of capacity,
> and handling all in userspace is tricky and requires
> a userspace agent, which will monitor all cgroups
> for OOMs.
>
> In most cases an in-kernel after-OOM cleaning-up
> mechanism can eliminate the necessity of enabling
> panic_on_oom. Also, it can simplify the cgroup
> management for userspace applications.
>
> This commit introduces a new knob for cgroup v2 memory
> controller: memory.oom.group. The knob determines
> whether the cgroup should be treated as a single
> unit by the OOM killer. If set, the cgroup and its
> descendants are killed together or not at all.
>
> To determine which cgroup has to be killed, we do
> traverse the cgroup hierarchy from the victim task's
> cgroup up to the OOMing cgroup (or root) and looking
> for the highest-level cgroup with memory.oom.group set.
>
> Tasks with the OOM protection (oom_score_adj set to -1000)
> are treated as an exception and are never killed.
>
> This patch doesn't change the OOM victim selection algorithm.
>
> Signed-off-by: Roman Gushchin <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Tetsuo Handa <[email protected]>
> Cc: Tejun Heo <[email protected]>
The semantics make sense to me and the code is straight-forward. With
Michal's other feedback incorporated, please feel free to add:
Acked-by: Johannes Weiner <[email protected]>
On Tue, 31 Jul 2018, Roman Gushchin wrote:
> > What's the plan with the cgroup aware oom killer? It has been sitting in
> > the -mm tree for ages with no clear path to being merged.
>
> It's because your nack, isn't it?
> Everybody else seem to be fine with it.
>
If they are fine with it, I'm not sure they have tested it :) Killing
entire cgroups needlessly for mempolicy oom kills that will not free
memory on target nodes is the first regression they may notice. It also
unnecessarily uses oom_score_adj settings only when attached to the root
mem cgroup. That may be fine in very specialized usecases but your bash
shell being considered equal to a 96GB cgroup isn't very useful. These
are all fixed in my follow-up patch series which you say you have reviewed
later in this email.
> > Are you planning on reviewing the patchset to fix the cgroup aware oom
> > killer at https://marc.info/?l=linux-kernel&m=153152325411865 which has
> > been waiting for feedback since March?
> >
>
> I already did.
> As I said, I find the proposed oom_policy interface confusing.
> I'm not sure I understand why some memcg OOMs should be handled
> by memcg-aware OOMs, while other by the traditional per-process
> logic; and why this should be set on the OOMing memcg.
> IMO this adds nothing but confusion.
>
If your entire review was the email to a single patch, I misinterpreted
that as the entire review not being done, sorry. I volunteered to
separate out the logic to determine if a cgroup should be considered on
its own (kill the largest cgroup on the system) or whether to consider
subtree usage as well into its own tunable. I haven't received an
answer, yet, but it's a trivial patch on top of my series if you prefer.
Just let me know so we can make progress.
> it doesn't look nice to me (neither I'm fan of the mount option).
> If you need an option to evaluate a cgroup as a whole, but kill
> only one task inside (the ability we've discussed before),
> let's make it clear. It's possible with the new memory.oom.group.
>
The purpose is for subtrees delegated to users so that they can continue
to expect the same process being oom killed, with oom_score_adj
respected, even though the ancestor oom policy is for cgroup aware
targeting. It is perfectly legitimate, and necessary, for a user who
controls their own subtree to prefer killing of the single largest process
as it has always been done. Secondary to that is their ability to
influence the decision with oom_score_adj, which they lose without my
patches.
> Patches which adjust root memory cgroup accounting and NUMA
> handling should be handled separately, they are really not
> about the interface. I've nothing against them.
>
That's good to know, it would be helpful if you would ack the patches that
you are not objecting to. Your feedback about the overloading of "cgroup"
and "tree" is well received and I can easily separate that into a tunable
as I said. I do not know of any user that would want to specify "tree"
without having cgroup aware behavior, however. If you would prefer this,
please let me know!
> Anyway, at this point I really think that this patch (memory.oom.group)
> is a reasonable way forward. It implements a useful and complete feature,
> doesn't block any further development and has a clean interface.
> So, you can build memory.oom.policy on top of it.
> Does this sound good?
>
I have no objection to this series, of course. The functionality of group
oom was unchanged in my series. I'd very much appreciate a review of my
patchset, though, so the cgroup-aware policy can be merged as well.
On Wed, Aug 01, 2018 at 02:51:25PM -0700, David Rientjes wrote:
> On Tue, 31 Jul 2018, Roman Gushchin wrote:
>
> > > What's the plan with the cgroup aware oom killer? It has been sitting in
> > > the -mm tree for ages with no clear path to being merged.
> >
> > It's because your nack, isn't it?
> > Everybody else seem to be fine with it.
> >
>
> If they are fine with it, I'm not sure they have tested it :) Killing
> entire cgroups needlessly for mempolicy oom kills that will not free
> memory on target nodes is the first regression they may notice. It also
> unnecessarily uses oom_score_adj settings only when attached to the root
> mem cgroup. That may be fine in very specialized usecases but your bash
> shell being considered equal to a 96GB cgroup isn't very useful. These
> are all fixed in my follow-up patch series which you say you have reviewed
> later in this email.
>
> > > Are you planning on reviewing the patchset to fix the cgroup aware oom
> > > killer at https://marc.info/?l=linux-kernel&m=153152325411865 which has
> > > been waiting for feedback since March?
> > >
> >
> > I already did.
> > As I said, I find the proposed oom_policy interface confusing.
> > I'm not sure I understand why some memcg OOMs should be handled
> > by memcg-aware OOMs, while other by the traditional per-process
> > logic; and why this should be set on the OOMing memcg.
> > IMO this adds nothing but confusion.
> >
>
> If your entire review was the email to a single patch, I misinterpreted
> that as the entire review not being done, sorry. I volunteered to
> separate out the logic to determine if a cgroup should be considered on
> its own (kill the largest cgroup on the system) or whether to consider
> subtree usage as well into its own tunable. I haven't received an
> answer, yet, but it's a trivial patch on top of my series if you prefer.
> Just let me know so we can make progress.
>
> > it doesn't look nice to me (neither I'm fan of the mount option).
> > If you need an option to evaluate a cgroup as a whole, but kill
> > only one task inside (the ability we've discussed before),
> > let's make it clear. It's possible with the new memory.oom.group.
> >
>
> The purpose is for subtrees delegated to users so that they can continue
> to expect the same process being oom killed, with oom_score_adj
> respected, even though the ancestor oom policy is for cgroup aware
> targeting. It is perfectly legitimate, and necessary, for a user who
> controls their own subtree to prefer killing of the single largest process
> as it has always been done. Secondary to that is their ability to
> influence the decision with oom_score_adj, which they lose without my
> patches.
>
> > Patches which adjust root memory cgroup accounting and NUMA
> > handling should be handled separately, they are really not
> > about the interface. I've nothing against them.
> >
>
> That's good to know, it would be helpful if you would ack the patches that
> you are not objecting to. Your feedback about the overloading of "cgroup"
> and "tree" is well received and I can easily separate that into a tunable
> as I said. I do not know of any user that would want to specify "tree"
> without having cgroup aware behavior, however. If you would prefer this,
> please let me know!
>
> > Anyway, at this point I really think that this patch (memory.oom.group)
> > is a reasonable way forward. It implements a useful and complete feature,
> > doesn't block any further development and has a clean interface.
> > So, you can build memory.oom.policy on top of it.
> > Does this sound good?
> >
>
> I have no objection to this series, of course. The functionality of group
> oom was unchanged in my series. I'd very much appreciate a review of my
> patchset, though, so the cgroup-aware policy can be merged as well.
>
Ok, I think that what we'll do here:
1) drop the current cgroup-aware OOM killer implementation from the mm tree
2) land memory.oom.group to the mm tree (your ack will be appreciated)
3) discuss and, hopefully, agree on memory.oom.policy interface
4) land memory.oom.policy
Basically, with oom.group separated everything we need is another
boolean knob, which means that the memcg should be evaluated together.
Am I right? If so, the main problem to solve is how to handle the following
case:
A
/ \ B/memory.oom.evaluate_as_a_group* = 1
B C C/memory.oom.evaluate_as_a_group* = 0
/ \
D E * I do not propose to use this name, just for example.
In this case you have to compare tasks in C with cgroup B.
And this is what I'd like to avoid. Maybe it should be enforced on A's level?
I don't think it should be linked to the OOMing group, as in your patchset.
I would really prefer to discuss the interface first, without going
into code and implementation details code. It's not because I do not
appreciate your work, only because it's hard to think about the interface
when there are two big patchsets on top of each other.
Thank you!
On Wed 01-08-18 14:51:25, David Rientjes wrote:
> On Tue, 31 Jul 2018, Roman Gushchin wrote:
>
> > > What's the plan with the cgroup aware oom killer? It has been sitting in
> > > the -mm tree for ages with no clear path to being merged.
> >
> > It's because your nack, isn't it?
> > Everybody else seem to be fine with it.
> >
>
> If they are fine with it, I'm not sure they have tested it :) Killing
> entire cgroups needlessly for mempolicy oom kills that will not free
> memory on target nodes is the first regression they may notice.
I do not remember you would be mentioning this previously. Anyway the
older implementation has considered the nodemask in memcg_oom_badness.
You are right that a cpuset allocation could needlessly select a memcg
with small or no memory from the target nodemask which is something I
could have noticed during the review. If only I didn't have to spend all
my energy to go through repetitive arguments of yours. Anyway this would
be quite trivial to resolve in the same function by checking
node_isset(node, current->mems_allowed).
Thanks for your productive feedback again.
Skipping the rest which is yet again repeating same arguments and it
doesn't add anything new to the table.
--
Michal Hocko
SUSE Labs
On Wed, 1 Aug 2018, Roman Gushchin wrote:
> Ok, I think that what we'll do here:
> 1) drop the current cgroup-aware OOM killer implementation from the mm tree
> 2) land memory.oom.group to the mm tree (your ack will be appreciated)
> 3) discuss and, hopefully, agree on memory.oom.policy interface
> 4) land memory.oom.policy
>
Yes, I'm fine proceeding this way, there's a clear separation between the
policy and mechanism and they can be introduced independent of each other.
As I said in my patchset, we can also introduce policies independent of
each other and I have no objection to your design that addresses your
specific usecase, with your own policy decisions, with the added caveat
that we do so in a way that respects other usecases.
Specifically, I would ask that the following be respected:
- Subtrees delegated to users can still operate as they do today with
per-process selection (largest, or influenced by oom_score_adj) so
their victim selection is not changed out from under them. This
requires the entire hierarchy is not locked into a specific policy,
and also that a subtree is not locked in a specific policy. In other
words, if an oom condition occurs in a user-controlled subtree they
have the ability to get the same selection criteria as they do today.
- Policies are implemented in a way that has an extensible API so that
we do not unnecessarily limit or prohibit ourselves from making changes
in the future or from extending the functionality by introducing other
policy choices that are needed in the future.
I hope that I'm not being unrealistic in assuming that you're fine with
these since it can still preserve your goals.
> Basically, with oom.group separated everything we need is another
> boolean knob, which means that the memcg should be evaluated together.
In a cgroup-aware oom killer world, yes, we need the ability to specify
that the usage of the entire subtree should be compared as a single
entity with other cgroups. That is necessary for user subtrees but may
not be necessary for top-level cgroups depending on how you structure your
unified cgroup hierarchy. So it needs to be configurable, as you suggest,
and you are correct it can be different than oom.group.
That's not the only thing we need though, as I'm sure you were expecting
me to say :)
We need the ability to preserve existing behavior, i.e. process based and
not cgroup aware, for subtrees so that our users who have clear
expectations and tune their oom_score_adj accordingly based on how the oom
killer has always chosen processes for oom kill do not suddenly regress.
So we need to define the policy for a subtree that is oom, and I suggest
we do that as a characteristic of the cgroup that is oom ("process" vs
"cgroup", and process would be the default to preserve what currently
happens in a user subtree).
Now, as users who rely on process selection are well aware, we have
oom_score_adj to influence the decision of which process to oom kill. If
our oom subtree is cgroup aware, we should have the ability to likewise
influence that decision. For example, we have high priority applications
that run at the top-level that use a lot of memory and strictly oom
killing them in all scenarios because they use a lot of memory isn't
appropriate. We need to be able to adjust the comparison of a cgroup (or
subtree) when compared to other cgroups.
I've also suggested, but did not implement in my patchset because I was
trying to define the API and find common ground first, that we have a need
for priority based selection. In other words, define the priority of a
subtree regardless of cgroup usage.
So with these four things, we have
- an "oom.policy" tunable to define "cgroup" or "process" for that
subtree (and plans for "priority" in the future),
- your "oom.evaluate_as_group" tunable to account the usage of the
subtree as the cgroup's own usage for comparison with others,
- an "oom.adj" to adjust the usage of the cgroup (local or subtree)
to protect important applications and bias against unimportant
applications.
This adds several tunables, which I didn't like, so I tried to overload
oom.policy and oom.evaluate_as_group. When I referred to separating out
the subtree usage accounting into a separate tunable, that is what I have
referenced above.
So when a cgroup is oom, oom.policy defines the selection. The cgroup
here could be root for when the system is oom. If "process", nothing else
matters, we iterate and find the largest process (modulo oom_score_adj)
and kill it. We then look at oom.group and determine if additional
processes should be oom killed.
If "cgroup", we determine the local usage of each cgroup in the subtree.
If oom.evaluate_as_group is enabled for a cgroup, we add the usage from
each cgroup in the subtree to that cgroup. We then add oom.adj, which can
be positive or negative, for the cgroup's overall score. Each cgroup then
has a score that can be compared fairly to one another and the oom kill
can occur. We then look at oom.group and determine if additional
processes should be oom killed.
With plans for an oom.policy of "priority", I would define that priority
in oom.adj. Here, oom.evaluate_as_group can still be useful, which is
great. If smaller priorities means higher preference for oom kill, we
compare the oom.adj of all direct children and iterate the smallest. If
oom.evaluate_as_group is set, the smallest oom.adj from the subtree is
used.
This is how I envisioned the functionality of the cgroup aware oom killer
when I wrote my patchset and would be happy to hear your input or
suggestions on it.
On Mon, Aug 06, 2018 at 02:34:06PM -0700, David Rientjes wrote:
> On Wed, 1 Aug 2018, Roman Gushchin wrote:
>
> > Ok, I think that what we'll do here:
> > 1) drop the current cgroup-aware OOM killer implementation from the mm tree
> > 2) land memory.oom.group to the mm tree (your ack will be appreciated)
> > 3) discuss and, hopefully, agree on memory.oom.policy interface
> > 4) land memory.oom.policy
> >
>
> Yes, I'm fine proceeding this way, there's a clear separation between the
> policy and mechanism and they can be introduced independent of each other.
> As I said in my patchset, we can also introduce policies independent of
> each other and I have no objection to your design that addresses your
> specific usecase, with your own policy decisions, with the added caveat
> that we do so in a way that respects other usecases.
>
> Specifically, I would ask that the following be respected:
>
> - Subtrees delegated to users can still operate as they do today with
> per-process selection (largest, or influenced by oom_score_adj) so
> their victim selection is not changed out from under them. This
> requires the entire hierarchy is not locked into a specific policy,
> and also that a subtree is not locked in a specific policy. In other
> words, if an oom condition occurs in a user-controlled subtree they
> have the ability to get the same selection criteria as they do today.
>
> - Policies are implemented in a way that has an extensible API so that
> we do not unnecessarily limit or prohibit ourselves from making changes
> in the future or from extending the functionality by introducing other
> policy choices that are needed in the future.
>
> I hope that I'm not being unrealistic in assuming that you're fine with
> these since it can still preserve your goals.
>
> > Basically, with oom.group separated everything we need is another
> > boolean knob, which means that the memcg should be evaluated together.
>
> In a cgroup-aware oom killer world, yes, we need the ability to specify
> that the usage of the entire subtree should be compared as a single
> entity with other cgroups. That is necessary for user subtrees but may
> not be necessary for top-level cgroups depending on how you structure your
> unified cgroup hierarchy. So it needs to be configurable, as you suggest,
> and you are correct it can be different than oom.group.
>
> That's not the only thing we need though, as I'm sure you were expecting
> me to say :)
>
> We need the ability to preserve existing behavior, i.e. process based and
> not cgroup aware, for subtrees so that our users who have clear
> expectations and tune their oom_score_adj accordingly based on how the oom
> killer has always chosen processes for oom kill do not suddenly regress.
Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing
this case? This basically means that if memcg is selected as target,
the process inside will be selected using traditional per-process approach.
> So we need to define the policy for a subtree that is oom, and I suggest
> we do that as a characteristic of the cgroup that is oom ("process" vs
> "cgroup", and process would be the default to preserve what currently
> happens in a user subtree).
I'm not entirely convinced here.
I do agree, that some sub-tree may have a well tuned oom_score_adj,
and it's preferable to keep the current behavior.
At the same time I don't like the idea to look at the policy of the OOMing
cgroup. Why exceeding of one limit should be handled different to exceeding
of another? This seems to be a property of workload, not a limit.
>
> Now, as users who rely on process selection are well aware, we have
> oom_score_adj to influence the decision of which process to oom kill. If
> our oom subtree is cgroup aware, we should have the ability to likewise
> influence that decision. For example, we have high priority applications
> that run at the top-level that use a lot of memory and strictly oom
> killing them in all scenarios because they use a lot of memory isn't
> appropriate. We need to be able to adjust the comparison of a cgroup (or
> subtree) when compared to other cgroups.
>
> I've also suggested, but did not implement in my patchset because I was
> trying to define the API and find common ground first, that we have a need
> for priority based selection. In other words, define the priority of a
> subtree regardless of cgroup usage.
>
> So with these four things, we have
>
> - an "oom.policy" tunable to define "cgroup" or "process" for that
> subtree (and plans for "priority" in the future),
>
> - your "oom.evaluate_as_group" tunable to account the usage of the
> subtree as the cgroup's own usage for comparison with others,
>
> - an "oom.adj" to adjust the usage of the cgroup (local or subtree)
> to protect important applications and bias against unimportant
> applications.
>
> This adds several tunables, which I didn't like, so I tried to overload
> oom.policy and oom.evaluate_as_group. When I referred to separating out
> the subtree usage accounting into a separate tunable, that is what I have
> referenced above.
IMO, merging multiple tunables into one doesn't make it saner.
The real question how to make a reasonable interface with fever tunables.
The reason behind introducing all these knobs is to provide
a generic solution to define OOM handling rules, but then the
question raises if the kernel is the best place for it.
I really doubt that an interface with so many knobs has any chances
to be merged.
IMO, there should be a compromise between the simplicity (basically,
the number of tunables and possible values) and functionality
of the interface. You nacked my previous version, and unfortunately
I don't have anything better so far.
Thanks!
On Mon, 6 Aug 2018, Roman Gushchin wrote:
> > In a cgroup-aware oom killer world, yes, we need the ability to specify
> > that the usage of the entire subtree should be compared as a single
> > entity with other cgroups. That is necessary for user subtrees but may
> > not be necessary for top-level cgroups depending on how you structure your
> > unified cgroup hierarchy. So it needs to be configurable, as you suggest,
> > and you are correct it can be different than oom.group.
> >
> > That's not the only thing we need though, as I'm sure you were expecting
> > me to say :)
> >
> > We need the ability to preserve existing behavior, i.e. process based and
> > not cgroup aware, for subtrees so that our users who have clear
> > expectations and tune their oom_score_adj accordingly based on how the oom
> > killer has always chosen processes for oom kill do not suddenly regress.
>
> Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing
> this case? This basically means that if memcg is selected as target,
> the process inside will be selected using traditional per-process approach.
>
No, that would overload the policy and mechanism. We want the ability to
consider user-controlled subtrees as a single entity for comparison with
other user subtrees to select which subtree to target. This does not
imply that users want their entire subtree oom killed.
> > So we need to define the policy for a subtree that is oom, and I suggest
> > we do that as a characteristic of the cgroup that is oom ("process" vs
> > "cgroup", and process would be the default to preserve what currently
> > happens in a user subtree).
>
> I'm not entirely convinced here.
> I do agree, that some sub-tree may have a well tuned oom_score_adj,
> and it's preferable to keep the current behavior.
>
> At the same time I don't like the idea to look at the policy of the OOMing
> cgroup. Why exceeding of one limit should be handled different to exceeding
> of another? This seems to be a property of workload, not a limit.
>
The limit is the property of the mem cgroup, so it's logical that the
policy when reaching that limit is a property of the same mem cgroup.
Using the user-controlled subtree example, if we have /david and /roman,
we can define our own policies on oom, we are not restricted to cgroup
aware selection on the entire hierarchy. /david/oom.policy can be
"process" so that I haven't regressed with earlier kernels, and
/roman/oom.policy can be "cgroup" to target the largest cgroup in your
subtree.
Something needs to be oom killed when a mem cgroup at any level in the
hierarchy is reached and reclaim has failed. What to do when that limit
is reached is a property of that cgroup.
> > Now, as users who rely on process selection are well aware, we have
> > oom_score_adj to influence the decision of which process to oom kill. If
> > our oom subtree is cgroup aware, we should have the ability to likewise
> > influence that decision. For example, we have high priority applications
> > that run at the top-level that use a lot of memory and strictly oom
> > killing them in all scenarios because they use a lot of memory isn't
> > appropriate. We need to be able to adjust the comparison of a cgroup (or
> > subtree) when compared to other cgroups.
> >
> > I've also suggested, but did not implement in my patchset because I was
> > trying to define the API and find common ground first, that we have a need
> > for priority based selection. In other words, define the priority of a
> > subtree regardless of cgroup usage.
> >
> > So with these four things, we have
> >
> > - an "oom.policy" tunable to define "cgroup" or "process" for that
> > subtree (and plans for "priority" in the future),
> >
> > - your "oom.evaluate_as_group" tunable to account the usage of the
> > subtree as the cgroup's own usage for comparison with others,
> >
> > - an "oom.adj" to adjust the usage of the cgroup (local or subtree)
> > to protect important applications and bias against unimportant
> > applications.
> >
> > This adds several tunables, which I didn't like, so I tried to overload
> > oom.policy and oom.evaluate_as_group. When I referred to separating out
> > the subtree usage accounting into a separate tunable, that is what I have
> > referenced above.
>
> IMO, merging multiple tunables into one doesn't make it saner.
> The real question how to make a reasonable interface with fever tunables.
>
> The reason behind introducing all these knobs is to provide
> a generic solution to define OOM handling rules, but then the
> question raises if the kernel is the best place for it.
>
> I really doubt that an interface with so many knobs has any chances
> to be merged.
>
This is why I attempted to overload oom.policy and oom.evaluate_as_group:
I could not think of a reasonable usecase where a subtree would be used to
account for cgroup usage but not use a cgroup aware policy itself. You've
objected to that, where memory.oom_policy == "tree" implied cgroup
awareness in my patchset, so I've separated that out.
> IMO, there should be a compromise between the simplicity (basically,
> the number of tunables and possible values) and functionality
> of the interface. You nacked my previous version, and unfortunately
> I don't have anything better so far.
>
If you do not agree with the overloading and have a preference for single
value tunables, then all three tunables are needed. This functionality
could be represented as two or one tunable if they are not single value,
but from the oom.group discussion you preferred single values.
I assume you'd also object to adding and removing files based on
oom.policy since oom.evaluate_as_group and oom.adj is only needed for
oom.policy of "cgroup" or "priority", and they do not need to exist for
the default oom.policy of "process".
On Tue 07-08-18 15:34:58, David Rientjes wrote:
> On Mon, 6 Aug 2018, Roman Gushchin wrote:
>
> > > In a cgroup-aware oom killer world, yes, we need the ability to specify
> > > that the usage of the entire subtree should be compared as a single
> > > entity with other cgroups. That is necessary for user subtrees but may
> > > not be necessary for top-level cgroups depending on how you structure your
> > > unified cgroup hierarchy. So it needs to be configurable, as you suggest,
> > > and you are correct it can be different than oom.group.
> > >
> > > That's not the only thing we need though, as I'm sure you were expecting
> > > me to say :)
> > >
> > > We need the ability to preserve existing behavior, i.e. process based and
> > > not cgroup aware, for subtrees so that our users who have clear
> > > expectations and tune their oom_score_adj accordingly based on how the oom
> > > killer has always chosen processes for oom kill do not suddenly regress.
> >
> > Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing
> > this case? This basically means that if memcg is selected as target,
> > the process inside will be selected using traditional per-process approach.
> >
>
> No, that would overload the policy and mechanism. We want the ability to
> consider user-controlled subtrees as a single entity for comparison with
> other user subtrees to select which subtree to target. This does not
> imply that users want their entire subtree oom killed.
Yeah, that's why oom.group == 0, no?
Anyway, can we separate this discussion from the current series please?
We are getting more and more tangent.
Or do you still see the current state to be not mergeable?
--
Michal Hocko
SUSE Labs
On Wed, 8 Aug 2018, Michal Hocko wrote:
> > > > In a cgroup-aware oom killer world, yes, we need the ability to specify
> > > > that the usage of the entire subtree should be compared as a single
> > > > entity with other cgroups. That is necessary for user subtrees but may
> > > > not be necessary for top-level cgroups depending on how you structure your
> > > > unified cgroup hierarchy. So it needs to be configurable, as you suggest,
> > > > and you are correct it can be different than oom.group.
> > > >
> > > > That's not the only thing we need though, as I'm sure you were expecting
> > > > me to say :)
> > > >
> > > > We need the ability to preserve existing behavior, i.e. process based and
> > > > not cgroup aware, for subtrees so that our users who have clear
> > > > expectations and tune their oom_score_adj accordingly based on how the oom
> > > > killer has always chosen processes for oom kill do not suddenly regress.
> > >
> > > Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing
> > > this case? This basically means that if memcg is selected as target,
> > > the process inside will be selected using traditional per-process approach.
> > >
> >
> > No, that would overload the policy and mechanism. We want the ability to
> > consider user-controlled subtrees as a single entity for comparison with
> > other user subtrees to select which subtree to target. This does not
> > imply that users want their entire subtree oom killed.
>
> Yeah, that's why oom.group == 0, no?
>
> Anyway, can we separate this discussion from the current series please?
> We are getting more and more tangent.
>
> Or do you still see the current state to be not mergeable?
I've said three times in this series that I am fine with it. Roman and I
are discussing the API for making forward progress with the cgroup aware
oom killer itself. When he responds, he can change the subject line if
that would be helpful to you.
On Thu 09-08-18 13:10:10, David Rientjes wrote:
> On Wed, 8 Aug 2018, Michal Hocko wrote:
>
> > > > > In a cgroup-aware oom killer world, yes, we need the ability to specify
> > > > > that the usage of the entire subtree should be compared as a single
> > > > > entity with other cgroups. That is necessary for user subtrees but may
> > > > > not be necessary for top-level cgroups depending on how you structure your
> > > > > unified cgroup hierarchy. So it needs to be configurable, as you suggest,
> > > > > and you are correct it can be different than oom.group.
> > > > >
> > > > > That's not the only thing we need though, as I'm sure you were expecting
> > > > > me to say :)
> > > > >
> > > > > We need the ability to preserve existing behavior, i.e. process based and
> > > > > not cgroup aware, for subtrees so that our users who have clear
> > > > > expectations and tune their oom_score_adj accordingly based on how the oom
> > > > > killer has always chosen processes for oom kill do not suddenly regress.
> > > >
> > > > Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing
> > > > this case? This basically means that if memcg is selected as target,
> > > > the process inside will be selected using traditional per-process approach.
> > > >
> > >
> > > No, that would overload the policy and mechanism. We want the ability to
> > > consider user-controlled subtrees as a single entity for comparison with
> > > other user subtrees to select which subtree to target. This does not
> > > imply that users want their entire subtree oom killed.
> >
> > Yeah, that's why oom.group == 0, no?
> >
> > Anyway, can we separate this discussion from the current series please?
> > We are getting more and more tangent.
> >
> > Or do you still see the current state to be not mergeable?
>
> I've said three times in this series that I am fine with it.
OK, that wasn't really clear to me because I haven't see any explicit
ack from you (well except for the trivial helper patch). So I was not
sure.
> Roman and I
> are discussing the API for making forward progress with the cgroup aware
> oom killer itself. When he responds, he can change the subject line if
> that would be helpful to you.
I do not insist of course but it would be easier to follow if that
discussion was separate.
--
Michal Hocko
SUSE Labs
Roman, have you had time to go through this?
On Tue, 7 Aug 2018, David Rientjes wrote:
> On Mon, 6 Aug 2018, Roman Gushchin wrote:
>
> > > In a cgroup-aware oom killer world, yes, we need the ability to specify
> > > that the usage of the entire subtree should be compared as a single
> > > entity with other cgroups. That is necessary for user subtrees but may
> > > not be necessary for top-level cgroups depending on how you structure your
> > > unified cgroup hierarchy. So it needs to be configurable, as you suggest,
> > > and you are correct it can be different than oom.group.
> > >
> > > That's not the only thing we need though, as I'm sure you were expecting
> > > me to say :)
> > >
> > > We need the ability to preserve existing behavior, i.e. process based and
> > > not cgroup aware, for subtrees so that our users who have clear
> > > expectations and tune their oom_score_adj accordingly based on how the oom
> > > killer has always chosen processes for oom kill do not suddenly regress.
> >
> > Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing
> > this case? This basically means that if memcg is selected as target,
> > the process inside will be selected using traditional per-process approach.
> >
>
> No, that would overload the policy and mechanism. We want the ability to
> consider user-controlled subtrees as a single entity for comparison with
> other user subtrees to select which subtree to target. This does not
> imply that users want their entire subtree oom killed.
>
> > > So we need to define the policy for a subtree that is oom, and I suggest
> > > we do that as a characteristic of the cgroup that is oom ("process" vs
> > > "cgroup", and process would be the default to preserve what currently
> > > happens in a user subtree).
> >
> > I'm not entirely convinced here.
> > I do agree, that some sub-tree may have a well tuned oom_score_adj,
> > and it's preferable to keep the current behavior.
> >
> > At the same time I don't like the idea to look at the policy of the OOMing
> > cgroup. Why exceeding of one limit should be handled different to exceeding
> > of another? This seems to be a property of workload, not a limit.
> >
>
> The limit is the property of the mem cgroup, so it's logical that the
> policy when reaching that limit is a property of the same mem cgroup.
> Using the user-controlled subtree example, if we have /david and /roman,
> we can define our own policies on oom, we are not restricted to cgroup
> aware selection on the entire hierarchy. /david/oom.policy can be
> "process" so that I haven't regressed with earlier kernels, and
> /roman/oom.policy can be "cgroup" to target the largest cgroup in your
> subtree.
>
> Something needs to be oom killed when a mem cgroup at any level in the
> hierarchy is reached and reclaim has failed. What to do when that limit
> is reached is a property of that cgroup.
>
> > > Now, as users who rely on process selection are well aware, we have
> > > oom_score_adj to influence the decision of which process to oom kill. If
> > > our oom subtree is cgroup aware, we should have the ability to likewise
> > > influence that decision. For example, we have high priority applications
> > > that run at the top-level that use a lot of memory and strictly oom
> > > killing them in all scenarios because they use a lot of memory isn't
> > > appropriate. We need to be able to adjust the comparison of a cgroup (or
> > > subtree) when compared to other cgroups.
> > >
> > > I've also suggested, but did not implement in my patchset because I was
> > > trying to define the API and find common ground first, that we have a need
> > > for priority based selection. In other words, define the priority of a
> > > subtree regardless of cgroup usage.
> > >
> > > So with these four things, we have
> > >
> > > - an "oom.policy" tunable to define "cgroup" or "process" for that
> > > subtree (and plans for "priority" in the future),
> > >
> > > - your "oom.evaluate_as_group" tunable to account the usage of the
> > > subtree as the cgroup's own usage for comparison with others,
> > >
> > > - an "oom.adj" to adjust the usage of the cgroup (local or subtree)
> > > to protect important applications and bias against unimportant
> > > applications.
> > >
> > > This adds several tunables, which I didn't like, so I tried to overload
> > > oom.policy and oom.evaluate_as_group. When I referred to separating out
> > > the subtree usage accounting into a separate tunable, that is what I have
> > > referenced above.
> >
> > IMO, merging multiple tunables into one doesn't make it saner.
> > The real question how to make a reasonable interface with fever tunables.
> >
> > The reason behind introducing all these knobs is to provide
> > a generic solution to define OOM handling rules, but then the
> > question raises if the kernel is the best place for it.
> >
> > I really doubt that an interface with so many knobs has any chances
> > to be merged.
> >
>
> This is why I attempted to overload oom.policy and oom.evaluate_as_group:
> I could not think of a reasonable usecase where a subtree would be used to
> account for cgroup usage but not use a cgroup aware policy itself. You've
> objected to that, where memory.oom_policy == "tree" implied cgroup
> awareness in my patchset, so I've separated that out.
>
> > IMO, there should be a compromise between the simplicity (basically,
> > the number of tunables and possible values) and functionality
> > of the interface. You nacked my previous version, and unfortunately
> > I don't have anything better so far.
> >
>
> If you do not agree with the overloading and have a preference for single
> value tunables, then all three tunables are needed. This functionality
> could be represented as two or one tunable if they are not single value,
> but from the oom.group discussion you preferred single values.
>
> I assume you'd also object to adding and removing files based on
> oom.policy since oom.evaluate_as_group and oom.adj is only needed for
> oom.policy of "cgroup" or "priority", and they do not need to exist for
> the default oom.policy of "process".
>
On Sun, Aug 19, 2018 at 04:26:50PM -0700, David Rientjes wrote:
> Roman, have you had time to go through this?
Hm, I thought we've finished this part of discussion, no?
Anyway, let me repeat my position: I don't like the interface
you've proposed in that follow-up patchset, and I explained why.
If you've a new proposal, please, rebase it to the current
mm tree, and we can discuss it separately.
Alternatively, we can discuss the interface first (without
the implementation), but, please, make a new thread with a
fresh description of a proposed interface.
Thanks!
>
>
> On Tue, 7 Aug 2018, David Rientjes wrote:
>
> > On Mon, 6 Aug 2018, Roman Gushchin wrote:
> >
> > > > In a cgroup-aware oom killer world, yes, we need the ability to specify
> > > > that the usage of the entire subtree should be compared as a single
> > > > entity with other cgroups. That is necessary for user subtrees but may
> > > > not be necessary for top-level cgroups depending on how you structure your
> > > > unified cgroup hierarchy. So it needs to be configurable, as you suggest,
> > > > and you are correct it can be different than oom.group.
> > > >
> > > > That's not the only thing we need though, as I'm sure you were expecting
> > > > me to say :)
> > > >
> > > > We need the ability to preserve existing behavior, i.e. process based and
> > > > not cgroup aware, for subtrees so that our users who have clear
> > > > expectations and tune their oom_score_adj accordingly based on how the oom
> > > > killer has always chosen processes for oom kill do not suddenly regress.
> > >
> > > Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing
> > > this case? This basically means that if memcg is selected as target,
> > > the process inside will be selected using traditional per-process approach.
> > >
> >
> > No, that would overload the policy and mechanism. We want the ability to
> > consider user-controlled subtrees as a single entity for comparison with
> > other user subtrees to select which subtree to target. This does not
> > imply that users want their entire subtree oom killed.
> >
> > > > So we need to define the policy for a subtree that is oom, and I suggest
> > > > we do that as a characteristic of the cgroup that is oom ("process" vs
> > > > "cgroup", and process would be the default to preserve what currently
> > > > happens in a user subtree).
> > >
> > > I'm not entirely convinced here.
> > > I do agree, that some sub-tree may have a well tuned oom_score_adj,
> > > and it's preferable to keep the current behavior.
> > >
> > > At the same time I don't like the idea to look at the policy of the OOMing
> > > cgroup. Why exceeding of one limit should be handled different to exceeding
> > > of another? This seems to be a property of workload, not a limit.
> > >
> >
> > The limit is the property of the mem cgroup, so it's logical that the
> > policy when reaching that limit is a property of the same mem cgroup.
> > Using the user-controlled subtree example, if we have /david and /roman,
> > we can define our own policies on oom, we are not restricted to cgroup
> > aware selection on the entire hierarchy. /david/oom.policy can be
> > "process" so that I haven't regressed with earlier kernels, and
> > /roman/oom.policy can be "cgroup" to target the largest cgroup in your
> > subtree.
> >
> > Something needs to be oom killed when a mem cgroup at any level in the
> > hierarchy is reached and reclaim has failed. What to do when that limit
> > is reached is a property of that cgroup.
> >
> > > > Now, as users who rely on process selection are well aware, we have
> > > > oom_score_adj to influence the decision of which process to oom kill. If
> > > > our oom subtree is cgroup aware, we should have the ability to likewise
> > > > influence that decision. For example, we have high priority applications
> > > > that run at the top-level that use a lot of memory and strictly oom
> > > > killing them in all scenarios because they use a lot of memory isn't
> > > > appropriate. We need to be able to adjust the comparison of a cgroup (or
> > > > subtree) when compared to other cgroups.
> > > >
> > > > I've also suggested, but did not implement in my patchset because I was
> > > > trying to define the API and find common ground first, that we have a need
> > > > for priority based selection. In other words, define the priority of a
> > > > subtree regardless of cgroup usage.
> > > >
> > > > So with these four things, we have
> > > >
> > > > - an "oom.policy" tunable to define "cgroup" or "process" for that
> > > > subtree (and plans for "priority" in the future),
> > > >
> > > > - your "oom.evaluate_as_group" tunable to account the usage of the
> > > > subtree as the cgroup's own usage for comparison with others,
> > > >
> > > > - an "oom.adj" to adjust the usage of the cgroup (local or subtree)
> > > > to protect important applications and bias against unimportant
> > > > applications.
> > > >
> > > > This adds several tunables, which I didn't like, so I tried to overload
> > > > oom.policy and oom.evaluate_as_group. When I referred to separating out
> > > > the subtree usage accounting into a separate tunable, that is what I have
> > > > referenced above.
> > >
> > > IMO, merging multiple tunables into one doesn't make it saner.
> > > The real question how to make a reasonable interface with fever tunables.
> > >
> > > The reason behind introducing all these knobs is to provide
> > > a generic solution to define OOM handling rules, but then the
> > > question raises if the kernel is the best place for it.
> > >
> > > I really doubt that an interface with so many knobs has any chances
> > > to be merged.
> > >
> >
> > This is why I attempted to overload oom.policy and oom.evaluate_as_group:
> > I could not think of a reasonable usecase where a subtree would be used to
> > account for cgroup usage but not use a cgroup aware policy itself. You've
> > objected to that, where memory.oom_policy == "tree" implied cgroup
> > awareness in my patchset, so I've separated that out.
> >
> > > IMO, there should be a compromise between the simplicity (basically,
> > > the number of tunables and possible values) and functionality
> > > of the interface. You nacked my previous version, and unfortunately
> > > I don't have anything better so far.
> > >
> >
> > If you do not agree with the overloading and have a preference for single
> > value tunables, then all three tunables are needed. This functionality
> > could be represented as two or one tunable if they are not single value,
> > but from the oom.group discussion you preferred single values.
> >
> > I assume you'd also object to adding and removing files based on
> > oom.policy since oom.evaluate_as_group and oom.adj is only needed for
> > oom.policy of "cgroup" or "priority", and they do not need to exist for
> > the default oom.policy of "process".
> >