2017-09-04 14:21:57

by Roman Gushchin

[permalink] [raw]
Subject: [v7 0/5] cgroup-aware OOM killer

This patchset makes the OOM killer cgroup-aware.

v7:
- __oom_kill_process() drops reference to the victim task
- oom_score_adj -1000 is always respected
- Renamed oom_kill_all to oom_group
- Dropped oom_prio range, converted from short to int
- Added a cgroup v2 mount option to disable cgroup-aware OOM killer
- Docs updated
- Rebased on top of mmotm

v6:
- Renamed oom_control.chosen to oom_control.chosen_task
- Renamed oom_kill_all_tasks to oom_kill_all
- Per-node NR_SLAB_UNRECLAIMABLE accounting
- Several minor fixes and cleanups
- Docs updated

v5:
- Rebased on top of Michal Hocko's patches, which have changed the
way how OOM victims becoming an access to the memory
reserves. Dropped corresponding part of this patchset
- Separated the oom_kill_process() splitting into a standalone commit
- Added debug output (suggested by David Rientjes)
- Some minor fixes

v4:
- Reworked per-cgroup oom_score_adj into oom_priority
(based on ideas by David Rientjes)
- Tasks with oom_score_adj -1000 are never selected if
oom_kill_all_tasks is not set
- Memcg victim selection code is reworked, and
synchronization is based on finding tasks with OOM victim marker,
rather then on global counter
- Debug output is dropped
- Refactored TIF_MEMDIE usage

v3:
- Merged commits 1-4 into 6
- Separated oom_score_adj logic and debug output into separate commits
- Fixed swap accounting

v2:
- Reworked victim selection based on feedback
from Michal Hocko, Vladimir Davydov and Johannes Weiner
- "Kill all tasks" is now an opt-in option, by default
only one process will be killed
- Added per-cgroup oom_score_adj
- Refined oom score calculations, suggested by Vladimir Davydov
- Converted to a patchset

v1:
https://lkml.org/lkml/2017/5/18/969


Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

Roman Gushchin (5):
mm, oom: refactor the oom_kill_process() function
mm, oom: cgroup-aware OOM killer
mm, oom: introduce oom_priority for memory cgroups
mm, oom, docs: describe the cgroup-aware OOM killer
mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

Documentation/admin-guide/kernel-parameters.txt | 1 +
Documentation/cgroup-v2.txt | 56 +++++
include/linux/memcontrol.h | 36 +++
include/linux/oom.h | 12 +-
mm/memcontrol.c | 293 ++++++++++++++++++++++++
mm/oom_kill.c | 209 +++++++++++------
6 files changed, 537 insertions(+), 70 deletions(-)

--
2.13.5


2017-09-04 14:22:09

by Roman Gushchin

[permalink] [raw]
Subject: [v7 3/5] mm, oom: introduce oom_priority for memory cgroups

Introduce a per-memory-cgroup oom_priority setting: an integer number,
which defines the order in which the OOM killer selects victim memory
cgroups.

OOM killer prefers memory cgroups with larger priority if they are
populated with eligible tasks.

The oom_priority value is compared within sibling cgroups.

If two or more sibling cgroups have the same oom_priority,
the decision is based on their memory footprint.

The root cgroup has the oom_priority 0, which cannot be changed.

Signed-off-by: Roman Gushchin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/linux/memcontrol.h | 3 +++
mm/memcontrol.c | 49 ++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5b5c2b89968e..73a0291948fd 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -206,6 +206,9 @@ struct mem_cgroup {
/* cached OOM score */
long oom_score;

+ /* OOM killer priority */
+ int oom_priority;
+
/* handle for "memory.events" */
struct cgroup_file events_file;

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 97813c56163b..d7dd293897ca 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2757,6 +2757,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
for (;;) {
struct cgroup_subsys_state *css;
struct mem_cgroup *memcg = NULL;
+ int prio = INT_MIN;
long score = LONG_MIN;

css_for_each_child(css, &root->css) {
@@ -2768,7 +2769,12 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
if (iter->oom_score == 0)
continue;

- if (iter->oom_score > score) {
+ if (iter->oom_priority > prio) {
+ memcg = iter;
+ prio = iter->oom_priority;
+ score = iter->oom_score;
+ } else if (iter->oom_priority == prio &&
+ iter->oom_score > score) {
memcg = iter;
score = iter->oom_score;
}
@@ -2838,7 +2844,15 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
* For system-wide OOMs we should consider tasks in the root cgroup
* with oom_score larger than oc->chosen_points.
*/
- if (!oc->memcg) {
+ if (!oc->memcg && !(oc->chosen_memcg &&
+ oc->chosen_memcg->oom_priority > 0)) {
+ /*
+ * Root memcg has priority 0, so if chosen memcg has lower
+ * priority, any task in root cgroup is preferable.
+ */
+ if (oc->chosen_memcg && oc->chosen_memcg->oom_priority < 0)
+ oc->chosen_points = 0;
+
select_victim_root_cgroup_task(oc);

if (oc->chosen_task && oc->chosen_memcg) {
@@ -5480,6 +5494,31 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
return nbytes;
}

+static int memory_oom_priority_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+
+ seq_printf(m, "%d\n", memcg->oom_priority);
+
+ return 0;
+}
+
+static ssize_t memory_oom_priority_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ int oom_priority;
+ int err;
+
+ err = kstrtoint(strstrip(buf), 0, &oom_priority);
+ if (err)
+ return err;
+
+ memcg->oom_priority = oom_priority;
+
+ return nbytes;
+}
+
static int memory_events_show(struct seq_file *m, void *v)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5606,6 +5645,12 @@ static struct cftype memory_files[] = {
.write = memory_oom_group_write,
},
{
+ .name = "oom_priority",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_oom_priority_show,
+ .write = memory_oom_priority_write,
+ },
+ {
.name = "events",
.flags = CFTYPE_NOT_ON_ROOT,
.file_offset = offsetof(struct mem_cgroup, events_file),
--
2.13.5

2017-09-04 14:22:17

by Roman Gushchin

[permalink] [raw]
Subject: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

Introducing of cgroup-aware OOM killer changes the victim selection
algorithm used by default: instead of picking the largest process,
it will pick the largest memcg and then the largest process inside.

This affects only cgroup v2 users.

To provide a way to use cgroups v2 if the old OOM victim selection
algorithm is preferred for some reason, the nogroupoom mount option
is added.

If set, the OOM selection is performed in a "traditional" per-process
way. Both oom_priority and oom_group memcg knobs are ignored.

Signed-off-by: Roman Gushchin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Documentation/admin-guide/kernel-parameters.txt | 1 +
mm/memcontrol.c | 8 ++++++++
2 files changed, 9 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 28f1a0f84456..07891f1030aa 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -489,6 +489,7 @@
Format: <string>
nosocket -- Disable socket memory accounting.
nokmem -- Disable kernel memory accounting.
+ nogroupoom -- Disable cgroup-aware OOM killer.

checkreqprot [SELINUX] Set initial checkreqprot flag value.
Format: { "0" | "1" }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d7dd293897ca..6a8235dc41f6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -87,6 +87,9 @@ static bool cgroup_memory_nosocket;
/* Kernel memory accounting disabled? */
static bool cgroup_memory_nokmem;

+/* Cgroup-aware OOM disabled? */
+static bool cgroup_memory_nogroupoom;
+
/* Whether the swap controller is active */
#ifdef CONFIG_MEMCG_SWAP
int do_swap_account __read_mostly;
@@ -2822,6 +2825,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
if (mem_cgroup_disabled())
return false;

+ if (cgroup_memory_nogroupoom)
+ return false;
+
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return false;

@@ -6188,6 +6194,8 @@ static int __init cgroup_memory(char *s)
cgroup_memory_nosocket = true;
if (!strcmp(token, "nokmem"))
cgroup_memory_nokmem = true;
+ if (!strcmp(token, "nogroupoom"))
+ cgroup_memory_nogroupoom = true;
}
return 0;
}
--
2.13.5

2017-09-04 14:22:06

by Roman Gushchin

[permalink] [raw]
Subject: [v7 4/5] mm, oom, docs: describe the cgroup-aware OOM killer

Update cgroups v2 docs.

Signed-off-by: Roman Gushchin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Documentation/cgroup-v2.txt | 56 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 56 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index a86f3cb88125..5d21bd2e7d55 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -44,6 +44,7 @@ CONTENTS
5-2-1. Memory Interface Files
5-2-2. Usage Guidelines
5-2-3. Memory Ownership
+ 5-2-4. OOM Killer
5-3. IO
5-3-1. IO Interface Files
5-3-2. Writeback
@@ -799,6 +800,33 @@ PAGE_SIZE multiple when read back.
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.

+ memory.oom_group
+
+ A read-write single value file which exists on non-root
+ cgroups. The default is "0".
+
+ If set, OOM killer will kill all processes attached to the cgroup
+ if selected as an OOM victim.
+
+ OOM killer respects the /proc/pid/oom_score_adj value -1000,
+ and will never kill the unkillable task, even if memory.oom_group
+ is set.
+
+ memory.oom_priority
+
+ A read-write single value file which exists on non-root
+ cgroups. The default is "0".
+
+ An integer number, which defines the order in which
+ the OOM killer selects victim memory cgroups.
+
+ OOM killer prefers memory cgroups with larger priority if they
+ are populated with eligible tasks.
+
+ The oom_priority value is compared within sibling cgroups.
+
+ The root cgroup has the oom_priority 0, which cannot be changed.
+
memory.events

A read-only flat-keyed file which exists on non-root cgroups.
@@ -1028,6 +1056,34 @@ POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
belonging to the affected files to ensure correct memory ownership.


+5-2-4. OOM Killer
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choice of a victim, hierarchically looking for a cgroup with the
+largest oom_priority. If sibling cgroups have the same priority,
+the OOM killer selects one which is the largest memory consumer.
+
+By default, OOM killer will kill the biggest task in the selected
+memory cgroup. A user can change this behavior by enabling
+the per-cgroup oom_group option. If set, it causes the OOM killer
+to kill all processes attached to the cgroup.
+
+Tasks in the root cgroup are treated as independent memory consumers,
+and are compared with other memory consumers (memory cgroups and
+other tasks in root cgroup).
+The root cgroup doesn't support the oom_group feature.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
+If there are no cgroups with the enabled memory controller,
+the OOM killer is using the "traditional" process-based approach.
+
+
5-3. IO

The "io" controller regulates the distribution of IO resources. This
--
2.13.5

2017-09-04 14:23:04

by Roman Gushchin

[permalink] [raw]
Subject: [v7 1/5] mm, oom: refactor the oom_kill_process() function

The oom_kill_process() function consists of two logical parts:
the first one is responsible for considering task's children as
a potential victim and printing the debug information.
The second half is responsible for sending SIGKILL to all
tasks sharing the mm struct with the given victim.

This commit splits the oom_kill_process() function with
an intention to re-use the the second half: __oom_kill_process().

The cgroup-aware OOM killer will kill multiple tasks
belonging to the victim cgroup. We don't need to print
the debug information for the each task, as well as play
with task selection (considering task's children),
so we can't use the existing oom_kill_process().

Signed-off-by: Roman Gushchin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/oom_kill.c | 123 +++++++++++++++++++++++++++++++---------------------------
1 file changed, 65 insertions(+), 58 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 99736e026712..f061b627092c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -804,68 +804,12 @@ static bool task_will_free_mem(struct task_struct *task)
return ret;
}

-static void oom_kill_process(struct oom_control *oc, const char *message)
+static void __oom_kill_process(struct task_struct *victim)
{
- struct task_struct *p = oc->chosen;
- unsigned int points = oc->chosen_points;
- struct task_struct *victim = p;
- struct task_struct *child;
- struct task_struct *t;
+ struct task_struct *p;
struct mm_struct *mm;
- unsigned int victim_points = 0;
- static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
bool can_oom_reap = true;

- /*
- * If the task is already exiting, don't alarm the sysadmin or kill
- * its children or threads, just give it access to memory reserves
- * so it can die quickly
- */
- task_lock(p);
- if (task_will_free_mem(p)) {
- mark_oom_victim(p);
- wake_oom_reaper(p);
- task_unlock(p);
- put_task_struct(p);
- return;
- }
- task_unlock(p);
-
- if (__ratelimit(&oom_rs))
- dump_header(oc, p);
-
- pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
- message, task_pid_nr(p), p->comm, points);
-
- /*
- * If any of p's children has a different mm and is eligible for kill,
- * the one with the highest oom_badness() score is sacrificed for its
- * parent. This attempts to lose the minimal amount of work done while
- * still freeing memory.
- */
- read_lock(&tasklist_lock);
- for_each_thread(p, t) {
- list_for_each_entry(child, &t->children, sibling) {
- unsigned int child_points;
-
- if (process_shares_mm(child, p->mm))
- continue;
- /*
- * oom_badness() returns 0 if the thread is unkillable
- */
- child_points = oom_badness(child,
- oc->memcg, oc->nodemask, oc->totalpages);
- if (child_points > victim_points) {
- put_task_struct(victim);
- victim = child;
- victim_points = child_points;
- get_task_struct(victim);
- }
- }
- }
- read_unlock(&tasklist_lock);
-
p = find_lock_task_mm(victim);
if (!p) {
put_task_struct(victim);
@@ -939,6 +883,69 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
}
#undef K

+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+ struct task_struct *p = oc->chosen;
+ unsigned int points = oc->chosen_points;
+ struct task_struct *victim = p;
+ struct task_struct *child;
+ struct task_struct *t;
+ unsigned int victim_points = 0;
+ static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+ /*
+ * If the task is already exiting, don't alarm the sysadmin or kill
+ * its children or threads, just give it access to memory reserves
+ * so it can die quickly
+ */
+ task_lock(p);
+ if (task_will_free_mem(p)) {
+ mark_oom_victim(p);
+ wake_oom_reaper(p);
+ task_unlock(p);
+ put_task_struct(p);
+ return;
+ }
+ task_unlock(p);
+
+ if (__ratelimit(&oom_rs))
+ dump_header(oc, p);
+
+ pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
+ message, task_pid_nr(p), p->comm, points);
+
+ /*
+ * If any of p's children has a different mm and is eligible for kill,
+ * the one with the highest oom_badness() score is sacrificed for its
+ * parent. This attempts to lose the minimal amount of work done while
+ * still freeing memory.
+ */
+ read_lock(&tasklist_lock);
+ for_each_thread(p, t) {
+ list_for_each_entry(child, &t->children, sibling) {
+ unsigned int child_points;
+
+ if (process_shares_mm(child, p->mm))
+ continue;
+ /*
+ * oom_badness() returns 0 if the thread is unkillable
+ */
+ child_points = oom_badness(child,
+ oc->memcg, oc->nodemask, oc->totalpages);
+ if (child_points > victim_points) {
+ put_task_struct(victim);
+ victim = child;
+ victim_points = child_points;
+ get_task_struct(victim);
+ }
+ }
+ }
+ read_unlock(&tasklist_lock);
+
+ __oom_kill_process(victim);
+}
+
/*
* Determines whether the kernel must panic because of the panic_on_oom sysctl.
*/
--
2.13.5

2017-09-04 14:23:02

by Roman Gushchin

[permalink] [raw]
Subject: [v7 2/5] mm, oom: cgroup-aware OOM killer

Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers:

1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. In many cases much safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

3) Per-process oom_score_adj affects global OOM, so it's a breache
in the isolation.

To address these issues, cgroup-aware OOM killer is introduced.

Under OOM conditions, it tries to find the biggest memory consumer,
and free memory by killing corresponding task(s). The difference
the "traditional" OOM killer is that it can treat memory cgroups
as memory consumers as well as single processes.

By default, it will look for the biggest leaf cgroup, and kill
the largest task inside.

But a user can change this behavior by enabling the per-cgroup
oom_kill_all_tasks option. If set, it causes the OOM killer treat
the whole cgroup as an indivisible memory consumer. In case if it's
selected as on OOM victim, all belonging tasks will be killed.

Tasks in the root cgroup are treated as independent memory consumers,
and are compared with other memory consumers (e.g. leaf cgroups).
The root cgroup doesn't support the oom_kill_all_tasks feature.

Signed-off-by: Roman Gushchin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/linux/memcontrol.h | 33 +++++++
include/linux/oom.h | 12 ++-
mm/memcontrol.c | 240 +++++++++++++++++++++++++++++++++++++++++++++
mm/oom_kill.c | 92 ++++++++++++++---
4 files changed, 362 insertions(+), 15 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 69966c461d1c..5b5c2b89968e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
struct page;
struct mm_struct;
struct kmem_cache;
+struct oom_control;

/* Cgroup-specific page state, on top of universal node page state */
enum memcg_stat_item {
@@ -199,6 +200,12 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;

+ /* kill all tasks in the subtree in case of OOM */
+ bool oom_group;
+
+ /* cached OOM score */
+ long oom_score;
+
/* handle for "memory.events" */
struct cgroup_file events_file;

@@ -342,6 +349,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
}

+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+ css_put(&memcg->css);
+}
+
#define mem_cgroup_from_counter(counter, member) \
container_of(counter, struct mem_cgroup, member)

@@ -480,6 +492,13 @@ static inline bool task_in_memcg_oom(struct task_struct *p)

bool mem_cgroup_oom_synchronize(bool wait);

+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+ return memcg->oom_group;
+}
+
#ifdef CONFIG_MEMCG_SWAP
extern int do_swap_account;
#endif
@@ -744,6 +763,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
return true;
}

+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
static inline struct mem_cgroup *
mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
@@ -936,6 +959,16 @@ static inline
void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
{
}
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+ return false;
+}
+
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+ return false;
+}
#endif /* CONFIG_MEMCG */

/* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4ce39bc..ca78e2d5956e 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -9,6 +9,13 @@
#include <linux/sched/coredump.h> /* MMF_* */
#include <linux/mm.h> /* VM_FAULT* */

+
+/*
+ * Special value returned by victim selection functions to indicate
+ * that are inflight OOM victims.
+ */
+#define INFLIGHT_VICTIM ((void *)-1UL)
+
struct zonelist;
struct notifier_block;
struct mem_cgroup;
@@ -39,7 +46,8 @@ struct oom_control {

/* Used by oom implementation, do not set */
unsigned long totalpages;
- struct task_struct *chosen;
+ struct task_struct *chosen_task;
+ struct mem_cgroup *chosen_memcg;
unsigned long chosen_points;
};

@@ -101,6 +109,8 @@ extern void oom_killer_enable(void);

extern struct task_struct *find_lock_task_mm(struct task_struct *p);

+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
/* sysctls */
extern int sysctl_oom_dump_tasks;
extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a69d23082abf..97813c56163b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2649,6 +2649,213 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
return ret;
}

+static long memcg_oom_badness(struct mem_cgroup *memcg,
+ const nodemask_t *nodemask)
+{
+ long points = 0;
+ int nid;
+ pg_data_t *pgdat;
+
+ for_each_node_state(nid, N_MEMORY) {
+ if (nodemask && !node_isset(nid, *nodemask))
+ continue;
+
+ points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+ LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+
+ pgdat = NODE_DATA(nid);
+ points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
+ NR_SLAB_UNRECLAIMABLE);
+ }
+
+ points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+ (PAGE_SIZE / 1024);
+ points += memcg_page_state(memcg, MEMCG_SOCK);
+ points += memcg_page_state(memcg, MEMCG_SWAP);
+
+ return points;
+}
+
+/*
+ * Checks if the given memcg is a valid OOM victim and returns a number,
+ * which means the folowing:
+ * -1: there are inflight OOM victim tasks, belonging to the memcg
+ * 0: memcg is not eligible, e.g. all belonging tasks are protected
+ * by oom_score_adj set to OOM_SCORE_ADJ_MIN
+ * >0: memcg is eligible, and the returned value is an estimation
+ * of the memory footprint
+ */
+static long oom_evaluate_memcg(struct mem_cgroup *memcg,
+ const nodemask_t *nodemask)
+{
+ struct css_task_iter it;
+ struct task_struct *task;
+ int eligible = 0;
+
+ /*
+ * Memcg is OOM eligible if there are OOM killable tasks inside.
+ *
+ * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
+ * as unkillable.
+ *
+ * If there are inflight OOM victim tasks inside the memcg,
+ * we return -1.
+ */
+ css_task_iter_start(&memcg->css, &it);
+ while ((task = css_task_iter_next(&it))) {
+ if (!eligible &&
+ task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
+ eligible = 1;
+
+ if (tsk_is_oom_victim(task) &&
+ !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
+ eligible = -1;
+ break;
+ }
+ }
+ css_task_iter_end(&it);
+
+ if (eligible <= 0)
+ return eligible;
+
+ return memcg_oom_badness(memcg, nodemask);
+}
+
+static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
+{
+ struct mem_cgroup *iter, *parent;
+
+ for_each_mem_cgroup_tree(iter, root) {
+ if (memcg_has_children(iter)) {
+ iter->oom_score = 0;
+ continue;
+ }
+
+ iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask);
+
+ /*
+ * Ignore empty and non-eligible memory cgroups.
+ */
+ if (iter->oom_score == 0)
+ continue;
+
+ /*
+ * If there are inflight OOM victims, we don't need to look
+ * further for new victims.
+ */
+ if (iter->oom_score == -1) {
+ oc->chosen_memcg = INFLIGHT_VICTIM;
+ mem_cgroup_iter_break(root, iter);
+ return;
+ }
+
+ for (parent = parent_mem_cgroup(iter); parent && parent != root;
+ parent = parent_mem_cgroup(parent))
+ parent->oom_score += iter->oom_score;
+ }
+
+ for (;;) {
+ struct cgroup_subsys_state *css;
+ struct mem_cgroup *memcg = NULL;
+ long score = LONG_MIN;
+
+ css_for_each_child(css, &root->css) {
+ struct mem_cgroup *iter = mem_cgroup_from_css(css);
+
+ /*
+ * Ignore empty and non-eligible memory cgroups.
+ */
+ if (iter->oom_score == 0)
+ continue;
+
+ if (iter->oom_score > score) {
+ memcg = iter;
+ score = iter->oom_score;
+ }
+ }
+
+ if (!memcg) {
+ if (oc->memcg && root == oc->memcg) {
+ oc->chosen_memcg = oc->memcg;
+ css_get(&oc->chosen_memcg->css);
+ oc->chosen_points = oc->memcg->oom_score;
+ }
+ break;
+ }
+
+ if (memcg->oom_group || !memcg_has_children(memcg)) {
+ oc->chosen_memcg = memcg;
+ css_get(&oc->chosen_memcg->css);
+ oc->chosen_points = score;
+ break;
+ }
+
+ root = memcg;
+ }
+}
+
+static void select_victim_root_cgroup_task(struct oom_control *oc)
+{
+ struct css_task_iter it;
+ struct task_struct *task;
+ int ret = 0;
+
+ css_task_iter_start(&root_mem_cgroup->css, &it);
+ while (!ret && (task = css_task_iter_next(&it)))
+ ret = oom_evaluate_task(task, oc);
+ css_task_iter_end(&it);
+}
+
+bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+ struct mem_cgroup *root = root_mem_cgroup;
+
+ oc->chosen_task = NULL;
+ oc->chosen_memcg = NULL;
+
+ if (mem_cgroup_disabled())
+ return false;
+
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ return false;
+
+ if (oc->memcg)
+ root = oc->memcg;
+
+ rcu_read_lock();
+
+ select_victim_memcg(root, oc);
+
+ /*
+ * If existing OOM victims are found, no need to look further.
+ */
+ if (oc->chosen_memcg == INFLIGHT_VICTIM) {
+ rcu_read_unlock();
+ return true;
+ }
+
+ /*
+ * For system-wide OOMs we should consider tasks in the root cgroup
+ * with oom_score larger than oc->chosen_points.
+ */
+ if (!oc->memcg) {
+ select_victim_root_cgroup_task(oc);
+
+ if (oc->chosen_task && oc->chosen_memcg) {
+ /*
+ * If we've decided to kill a task in the root memcg,
+ * release chosen_memcg.
+ */
+ css_put(&oc->chosen_memcg->css);
+ oc->chosen_memcg = NULL;
+ }
+ }
+
+ rcu_read_unlock();
+
+ return oc->chosen_task || oc->chosen_memcg;
+}
+
/*
* Reclaims as many pages from the given memcg as possible.
*
@@ -5246,6 +5453,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
return nbytes;
}

+static int memory_oom_group_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+ bool oom_group = memcg->oom_group;
+
+ seq_printf(m, "%d\n", oom_group);
+
+ return 0;
+}
+
+static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ int oom_group;
+ int err;
+
+ err = kstrtoint(strstrip(buf), 0, &oom_group);
+ if (err)
+ return err;
+
+ memcg->oom_group = oom_group;
+
+ return nbytes;
+}
+
static int memory_events_show(struct seq_file *m, void *v)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5366,6 +5600,12 @@ static struct cftype memory_files[] = {
.write = memory_max_write,
},
{
+ .name = "oom_group",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_oom_group_show,
+ .write = memory_oom_group_write,
+ },
+ {
.name = "events",
.flags = CFTYPE_NOT_ON_ROOT,
.file_offset = offsetof(struct mem_cgroup, events_file),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f061b627092c..b90a41ec16a1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,7 +288,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
return CONSTRAINT_NONE;
}

-static int oom_evaluate_task(struct task_struct *task, void *arg)
+int oom_evaluate_task(struct task_struct *task, void *arg)
{
struct oom_control *oc = arg;
unsigned long points;
@@ -322,26 +322,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
goto next;

/* Prefer thread group leaders for display purposes */
- if (points == oc->chosen_points && thread_group_leader(oc->chosen))
+ if (points == oc->chosen_points && thread_group_leader(oc->chosen_task))
goto next;
select:
- if (oc->chosen)
- put_task_struct(oc->chosen);
+ if (oc->chosen_task)
+ put_task_struct(oc->chosen_task);
get_task_struct(task);
- oc->chosen = task;
+ oc->chosen_task = task;
oc->chosen_points = points;
next:
return 0;
abort:
- if (oc->chosen)
- put_task_struct(oc->chosen);
- oc->chosen = (void *)-1UL;
+ if (oc->chosen_task)
+ put_task_struct(oc->chosen_task);
+ oc->chosen_task = INFLIGHT_VICTIM;
return 1;
}

/*
* Simple selection loop. We choose the process with the highest number of
- * 'points'. In case scan was aborted, oc->chosen is set to -1.
+ * 'points'. In case scan was aborted, oc->chosen_task is set to -1.
*/
static void select_bad_process(struct oom_control *oc)
{
@@ -810,6 +810,9 @@ static void __oom_kill_process(struct task_struct *victim)
struct mm_struct *mm;
bool can_oom_reap = true;

+ if (is_global_init(victim) || (victim->flags & PF_KTHREAD))
+ return;
+
p = find_lock_task_mm(victim);
if (!p) {
put_task_struct(victim);
@@ -885,7 +888,7 @@ static void __oom_kill_process(struct task_struct *victim)

static void oom_kill_process(struct oom_control *oc, const char *message)
{
- struct task_struct *p = oc->chosen;
+ struct task_struct *p = oc->chosen_task;
unsigned int points = oc->chosen_points;
struct task_struct *victim = p;
struct task_struct *child;
@@ -946,6 +949,64 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
__oom_kill_process(victim);
}

+static int oom_kill_memcg_member(struct task_struct *task, void *unused)
+{
+ if (!tsk_is_oom_victim(task)) {
+ get_task_struct(task);
+ __oom_kill_process(task);
+ }
+ return 0;
+}
+
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
+ static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+ if (oc->chosen_task) {
+ if (oc->chosen_task == INFLIGHT_VICTIM)
+ return true;
+
+ if (__ratelimit(&oom_rs))
+ dump_header(oc, oc->chosen_task);
+
+ __oom_kill_process(oc->chosen_task);
+
+ schedule_timeout_killable(1);
+ return true;
+
+ } else if (oc->chosen_memcg) {
+ if (oc->chosen_memcg == INFLIGHT_VICTIM)
+ return true;
+
+ /* Always begin with the biggest task */
+ oc->chosen_points = 0;
+ oc->chosen_task = NULL;
+ mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
+
+ if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
+ if (__ratelimit(&oom_rs))
+ dump_header(oc, oc->chosen_task);
+
+ __oom_kill_process(oc->chosen_task);
+
+ if (mem_cgroup_oom_group(oc->chosen_memcg))
+ mem_cgroup_scan_tasks(oc->chosen_memcg,
+ oom_kill_memcg_member,
+ NULL);
+ schedule_timeout_killable(1);
+ }
+
+ mem_cgroup_put(oc->chosen_memcg);
+ oc->chosen_memcg = NULL;
+ return oc->chosen_task;
+
+ } else {
+ oc->chosen_points = 0;
+ return false;
+ }
+}
+
/*
* Determines whether the kernel must panic because of the panic_on_oom sysctl.
*/
@@ -1042,18 +1103,21 @@ bool out_of_memory(struct oom_control *oc)
current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
get_task_struct(current);
- oc->chosen = current;
+ oc->chosen_task = current;
oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
return true;
}

+ if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc))
+ return true;
+
select_bad_process(oc);
/* Found nothing?!?! Either we hang forever, or we panic. */
- if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
+ if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
dump_header(oc, NULL);
panic("Out of memory and no killable processes...\n");
}
- if (oc->chosen && oc->chosen != (void *)-1UL) {
+ if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
"Memory cgroup out of memory");
/*
@@ -1062,7 +1126,7 @@ bool out_of_memory(struct oom_control *oc)
*/
schedule_timeout_killable(1);
}
- return !!oc->chosen;
+ return !!oc->chosen_task;
}

/*
--
2.13.5

2017-09-04 17:32:41

by Shakeel Butt

[permalink] [raw]
Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Mon, Sep 4, 2017 at 7:21 AM, Roman Gushchin <[email protected]> wrote:
> Introducing of cgroup-aware OOM killer changes the victim selection
> algorithm used by default: instead of picking the largest process,
> it will pick the largest memcg and then the largest process inside.
>
> This affects only cgroup v2 users.
>
> To provide a way to use cgroups v2 if the old OOM victim selection
> algorithm is preferred for some reason, the nogroupoom mount option
> is added.

Is this mount option or boot parameter? From the code, it seems like a
boot parameter.

2017-09-04 17:51:54

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Mon, Sep 04, 2017 at 10:32:37AM -0700, Shakeel Butt wrote:
> On Mon, Sep 4, 2017 at 7:21 AM, Roman Gushchin <[email protected]> wrote:
> > Introducing of cgroup-aware OOM killer changes the victim selection
> > algorithm used by default: instead of picking the largest process,
> > it will pick the largest memcg and then the largest process inside.
> >
> > This affects only cgroup v2 users.
> >
> > To provide a way to use cgroups v2 if the old OOM victim selection
> > algorithm is preferred for some reason, the nogroupoom mount option
> > is added.
>
> Is this mount option or boot parameter? From the code, it seems like a
> boot parameter.

Sure, you're right.

Fixed version below.

Thank you!

--

>From 0b4757ec8d339fa883e17d4e25a92f45bf5565e0 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <[email protected]>
Date: Mon, 4 Sep 2017 12:08:52 +0100
Subject: [v7 5/5] mm, oom: allow disabling cgroup-aware OOM killer

Introducing of cgroup-aware OOM killer changes the victim selection
algorithm used by default: instead of picking the largest process,
it will pick the largest memcg and then the largest process inside.

This affects only cgroup v2 users.

To provide a way to use cgroups v2 if the old OOM victim selection
algorithm is preferred for some reason, the cgroup.memory=nogroupoom
boot option is added.

If set, the OOM selection is performed in a "traditional" per-process
way. Both oom_priority and oom_group memcg knobs are ignored.

Signed-off-by: Roman Gushchin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Documentation/admin-guide/kernel-parameters.txt | 1 +
mm/memcontrol.c | 8 ++++++++
2 files changed, 9 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 28f1a0f84456..07891f1030aa 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -489,6 +489,7 @@
Format: <string>
nosocket -- Disable socket memory accounting.
nokmem -- Disable kernel memory accounting.
+ nogroupoom -- Disable cgroup-aware OOM killer.

checkreqprot [SELINUX] Set initial checkreqprot flag value.
Format: { "0" | "1" }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d7dd293897ca..6a8235dc41f6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -87,6 +87,9 @@ static bool cgroup_memory_nosocket;
/* Kernel memory accounting disabled? */
static bool cgroup_memory_nokmem;

+/* Cgroup-aware OOM disabled? */
+static bool cgroup_memory_nogroupoom;
+
/* Whether the swap controller is active */
#ifdef CONFIG_MEMCG_SWAP
int do_swap_account __read_mostly;
@@ -2822,6 +2825,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
if (mem_cgroup_disabled())
return false;

+ if (cgroup_memory_nogroupoom)
+ return false;
+
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return false;

@@ -6188,6 +6194,8 @@ static int __init cgroup_memory(char *s)
cgroup_memory_nosocket = true;
if (!strcmp(token, "nokmem"))
cgroup_memory_nokmem = true;
+ if (!strcmp(token, "nogroupoom"))
+ cgroup_memory_nogroupoom = true;
}
return 0;
}
--
2.13.5

2017-09-05 13:34:47

by Michal Hocko

[permalink] [raw]
Subject: Re: [v7 1/5] mm, oom: refactor the oom_kill_process() function

On Mon 04-09-17 15:21:04, Roman Gushchin wrote:
> The oom_kill_process() function consists of two logical parts:
> the first one is responsible for considering task's children as
> a potential victim and printing the debug information.
> The second half is responsible for sending SIGKILL to all
> tasks sharing the mm struct with the given victim.
>
> This commit splits the oom_kill_process() function with
> an intention to re-use the the second half: __oom_kill_process().
>
> The cgroup-aware OOM killer will kill multiple tasks
> belonging to the victim cgroup. We don't need to print
> the debug information for the each task, as well as play
> with task selection (considering task's children),
> so we can't use the existing oom_kill_process().
>
> Signed-off-by: Roman Gushchin <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Vladimir Davydov <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Tetsuo Handa <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]

Looks good to me
Acked-by: Michal Hocko <[email protected]>

> ---
> mm/oom_kill.c | 123 +++++++++++++++++++++++++++++++---------------------------
> 1 file changed, 65 insertions(+), 58 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 99736e026712..f061b627092c 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -804,68 +804,12 @@ static bool task_will_free_mem(struct task_struct *task)
> return ret;
> }
>
> -static void oom_kill_process(struct oom_control *oc, const char *message)
> +static void __oom_kill_process(struct task_struct *victim)
> {
> - struct task_struct *p = oc->chosen;
> - unsigned int points = oc->chosen_points;
> - struct task_struct *victim = p;
> - struct task_struct *child;
> - struct task_struct *t;
> + struct task_struct *p;
> struct mm_struct *mm;
> - unsigned int victim_points = 0;
> - static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> - DEFAULT_RATELIMIT_BURST);
> bool can_oom_reap = true;
>
> - /*
> - * If the task is already exiting, don't alarm the sysadmin or kill
> - * its children or threads, just give it access to memory reserves
> - * so it can die quickly
> - */
> - task_lock(p);
> - if (task_will_free_mem(p)) {
> - mark_oom_victim(p);
> - wake_oom_reaper(p);
> - task_unlock(p);
> - put_task_struct(p);
> - return;
> - }
> - task_unlock(p);
> -
> - if (__ratelimit(&oom_rs))
> - dump_header(oc, p);
> -
> - pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
> - message, task_pid_nr(p), p->comm, points);
> -
> - /*
> - * If any of p's children has a different mm and is eligible for kill,
> - * the one with the highest oom_badness() score is sacrificed for its
> - * parent. This attempts to lose the minimal amount of work done while
> - * still freeing memory.
> - */
> - read_lock(&tasklist_lock);
> - for_each_thread(p, t) {
> - list_for_each_entry(child, &t->children, sibling) {
> - unsigned int child_points;
> -
> - if (process_shares_mm(child, p->mm))
> - continue;
> - /*
> - * oom_badness() returns 0 if the thread is unkillable
> - */
> - child_points = oom_badness(child,
> - oc->memcg, oc->nodemask, oc->totalpages);
> - if (child_points > victim_points) {
> - put_task_struct(victim);
> - victim = child;
> - victim_points = child_points;
> - get_task_struct(victim);
> - }
> - }
> - }
> - read_unlock(&tasklist_lock);
> -
> p = find_lock_task_mm(victim);
> if (!p) {
> put_task_struct(victim);
> @@ -939,6 +883,69 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> }
> #undef K
>
> +static void oom_kill_process(struct oom_control *oc, const char *message)
> +{
> + struct task_struct *p = oc->chosen;
> + unsigned int points = oc->chosen_points;
> + struct task_struct *victim = p;
> + struct task_struct *child;
> + struct task_struct *t;
> + unsigned int victim_points = 0;
> + static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> + DEFAULT_RATELIMIT_BURST);
> +
> + /*
> + * If the task is already exiting, don't alarm the sysadmin or kill
> + * its children or threads, just give it access to memory reserves
> + * so it can die quickly
> + */
> + task_lock(p);
> + if (task_will_free_mem(p)) {
> + mark_oom_victim(p);
> + wake_oom_reaper(p);
> + task_unlock(p);
> + put_task_struct(p);
> + return;
> + }
> + task_unlock(p);
> +
> + if (__ratelimit(&oom_rs))
> + dump_header(oc, p);
> +
> + pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
> + message, task_pid_nr(p), p->comm, points);
> +
> + /*
> + * If any of p's children has a different mm and is eligible for kill,
> + * the one with the highest oom_badness() score is sacrificed for its
> + * parent. This attempts to lose the minimal amount of work done while
> + * still freeing memory.
> + */
> + read_lock(&tasklist_lock);
> + for_each_thread(p, t) {
> + list_for_each_entry(child, &t->children, sibling) {
> + unsigned int child_points;
> +
> + if (process_shares_mm(child, p->mm))
> + continue;
> + /*
> + * oom_badness() returns 0 if the thread is unkillable
> + */
> + child_points = oom_badness(child,
> + oc->memcg, oc->nodemask, oc->totalpages);
> + if (child_points > victim_points) {
> + put_task_struct(victim);
> + victim = child;
> + victim_points = child_points;
> + get_task_struct(victim);
> + }
> + }
> + }
> + read_unlock(&tasklist_lock);
> +
> + __oom_kill_process(victim);
> +}
> +
> /*
> * Determines whether the kernel must panic because of the panic_on_oom sysctl.
> */
> --
> 2.13.5

--
Michal Hocko
SUSE Labs

2017-09-05 13:44:20

by Michal Hocko

[permalink] [raw]
Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

I will go and check patch 2 more deeply but this is something that I
wanted to sort out first.

On Mon 04-09-17 15:21:08, Roman Gushchin wrote:
> Introducing of cgroup-aware OOM killer changes the victim selection
> algorithm used by default: instead of picking the largest process,
> it will pick the largest memcg and then the largest process inside.
>
> This affects only cgroup v2 users.
>
> To provide a way to use cgroups v2 if the old OOM victim selection
> algorithm is preferred for some reason, the nogroupoom mount option
> is added.
>
> If set, the OOM selection is performed in a "traditional" per-process
> way. Both oom_priority and oom_group memcg knobs are ignored.

Why is this an opt out rather than opt-in? IMHO the original oom logic
should be preserved by default and specific workloads should opt in for
the cgroup aware logic. Changing the global behavior depending on
whether cgroup v2 interface is in use is more than unexpected and IMHO
wrong approach to take. I think we should instead go with
oom_strategy=[alloc_task,biggest_task,cgroup]

we currently have alloc_task (via sysctl_oom_kill_allocating_task) and
biggest_task which is the default. You are adding cgroup and the more I
think about the more I agree that it doesn't really make sense to try to
fit thew new semantic into the existing one (compare tasks to kill-all
memcgs). Just introduce a new strategy and define a new semantic from
scratch. Memcg priority and kill-all are a natural extension of this new
strategy. This will make the life easier and easier to understand by
users.

Does that make sense to you?

> Signed-off-by: Roman Gushchin <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Vladimir Davydov <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Tetsuo Handa <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> ---
> Documentation/admin-guide/kernel-parameters.txt | 1 +
> mm/memcontrol.c | 8 ++++++++
> 2 files changed, 9 insertions(+)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 28f1a0f84456..07891f1030aa 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -489,6 +489,7 @@
> Format: <string>
> nosocket -- Disable socket memory accounting.
> nokmem -- Disable kernel memory accounting.
> + nogroupoom -- Disable cgroup-aware OOM killer.
>
> checkreqprot [SELINUX] Set initial checkreqprot flag value.
> Format: { "0" | "1" }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d7dd293897ca..6a8235dc41f6 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -87,6 +87,9 @@ static bool cgroup_memory_nosocket;
> /* Kernel memory accounting disabled? */
> static bool cgroup_memory_nokmem;
>
> +/* Cgroup-aware OOM disabled? */
> +static bool cgroup_memory_nogroupoom;
> +
> /* Whether the swap controller is active */
> #ifdef CONFIG_MEMCG_SWAP
> int do_swap_account __read_mostly;
> @@ -2822,6 +2825,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> if (mem_cgroup_disabled())
> return false;
>
> + if (cgroup_memory_nogroupoom)
> + return false;
> +
> if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> return false;
>
> @@ -6188,6 +6194,8 @@ static int __init cgroup_memory(char *s)
> cgroup_memory_nosocket = true;
> if (!strcmp(token, "nokmem"))
> cgroup_memory_nokmem = true;
> + if (!strcmp(token, "nogroupoom"))
> + cgroup_memory_nogroupoom = true;
> }
> return 0;
> }
> --
> 2.13.5

--
Michal Hocko
SUSE Labs

2017-09-05 14:30:56

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Tue, Sep 05, 2017 at 03:44:12PM +0200, Michal Hocko wrote:
> I will go and check patch 2 more deeply but this is something that I
> wanted to sort out first.
>
> On Mon 04-09-17 15:21:08, Roman Gushchin wrote:
> > Introducing of cgroup-aware OOM killer changes the victim selection
> > algorithm used by default: instead of picking the largest process,
> > it will pick the largest memcg and then the largest process inside.
> >
> > This affects only cgroup v2 users.
> >
> > To provide a way to use cgroups v2 if the old OOM victim selection
> > algorithm is preferred for some reason, the nogroupoom mount option
> > is added.
> >
> > If set, the OOM selection is performed in a "traditional" per-process
> > way. Both oom_priority and oom_group memcg knobs are ignored.
>
> Why is this an opt out rather than opt-in? IMHO the original oom logic
> should be preserved by default and specific workloads should opt in for
> the cgroup aware logic. Changing the global behavior depending on
> whether cgroup v2 interface is in use is more than unexpected and IMHO
> wrong approach to take. I think we should instead go with
> oom_strategy=[alloc_task,biggest_task,cgroup]
>
> we currently have alloc_task (via sysctl_oom_kill_allocating_task) and
> biggest_task which is the default. You are adding cgroup and the more I
> think about the more I agree that it doesn't really make sense to try to
> fit thew new semantic into the existing one (compare tasks to kill-all
> memcgs). Just introduce a new strategy and define a new semantic from
> scratch. Memcg priority and kill-all are a natural extension of this new
> strategy. This will make the life easier and easier to understand by
> users.
>
> Does that make sense to you?

Absolutely.

The only thing: I'm not sure that we have to preserve the existing logic
as default option. For most users (except few very specific usecases),
it should be at least as good, as the existing one.

Making it opt-in means that corresponding code will be executed only
by few users, who cares. Then we should probably hide corresponding
cgroup interface (oom_group and oom_priority knobs) by default,
and it feels as unnecessary complication and is overall against
cgroup v2 interface design.

> I think we should instead go with
> oom_strategy=[alloc_task,biggest_task,cgroup]

It would be a really nice interface; although I've no idea how to implement it:
"alloc_task" is an existing sysctl, which we have to preserve;
while "cgroup" depends on cgroup v2.

Thanks!

2017-09-05 14:57:06

by Michal Hocko

[permalink] [raw]
Subject: Re: [v7 2/5] mm, oom: cgroup-aware OOM killer

On Mon 04-09-17 15:21:05, Roman Gushchin wrote:
[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a69d23082abf..97813c56163b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2649,6 +2649,213 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
> return ret;
> }
>
> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> + const nodemask_t *nodemask)
> +{
> + long points = 0;
> + int nid;
> + pg_data_t *pgdat;
> +
> + for_each_node_state(nid, N_MEMORY) {
> + if (nodemask && !node_isset(nid, *nodemask))
> + continue;
> +
> + points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));

Why don't you consider file LRUs here? What if there is a lot of page
cache which is not reclaimed because it is protected by memcg->low.
Should we hide that from the OOM killer?

[...]
> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> +{
> + struct mem_cgroup *iter, *parent;
> +
> + for_each_mem_cgroup_tree(iter, root) {
> + if (memcg_has_children(iter)) {
> + iter->oom_score = 0;
> + continue;
> + }

Do we really need this check? If it is a mere optimization then
we should probably check for tasks in the memcg rather than
descendant. More on that below.

> +
> + iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask);
> +
> + /*
> + * Ignore empty and non-eligible memory cgroups.
> + */
> + if (iter->oom_score == 0)
> + continue;
> +
> + /*
> + * If there are inflight OOM victims, we don't need to look
> + * further for new victims.
> + */
> + if (iter->oom_score == -1) {
> + oc->chosen_memcg = INFLIGHT_VICTIM;
> + mem_cgroup_iter_break(root, iter);
> + return;
> + }
> +
> + for (parent = parent_mem_cgroup(iter); parent && parent != root;
> + parent = parent_mem_cgroup(parent))
> + parent->oom_score += iter->oom_score;

Hmm. The changelog says "By default, it will look for the biggest leaf
cgroup, and kill the largest task inside." But you are accumulating
oom_score up the hierarchy and so parents will have higher score than
the layer of their children and the larger the sub-hierarchy the more
biased it will become. Say you have
root
/\
/ \
A D
/ \
B C

B (5), C(15) thus A(20) and D(20). Unless I am missing something we are
going to go down A path and then chose C even though D is the largest
leaf group, right?

> + }
> +
> + for (;;) {
> + struct cgroup_subsys_state *css;
> + struct mem_cgroup *memcg = NULL;
> + long score = LONG_MIN;
> +
> + css_for_each_child(css, &root->css) {
> + struct mem_cgroup *iter = mem_cgroup_from_css(css);
> +
> + /*
> + * Ignore empty and non-eligible memory cgroups.
> + */
> + if (iter->oom_score == 0)
> + continue;
> +
> + if (iter->oom_score > score) {
> + memcg = iter;
> + score = iter->oom_score;
> + }
> + }
> +
> + if (!memcg) {
> + if (oc->memcg && root == oc->memcg) {
> + oc->chosen_memcg = oc->memcg;
> + css_get(&oc->chosen_memcg->css);
> + oc->chosen_points = oc->memcg->oom_score;
> + }
> + break;
> + }
> +
> + if (memcg->oom_group || !memcg_has_children(memcg)) {
> + oc->chosen_memcg = memcg;
> + css_get(&oc->chosen_memcg->css);
> + oc->chosen_points = score;
> + break;
> + }
> +
> + root = memcg;
> + }
> +}
> +
[...]
> + /*
> + * For system-wide OOMs we should consider tasks in the root cgroup
> + * with oom_score larger than oc->chosen_points.
> + */
> + if (!oc->memcg) {
> + select_victim_root_cgroup_task(oc);

I do not understand why do we have to handle root cgroup specially here.
select_victim_memcg already iterates all memcgs in the oom hierarchy
(including root) so if the root memcg is the largest one then we
should simply consider it no? You are skipping root there because of
memcg_has_children but I suspect this and the whole accumulate up the
hierarchy approach just makes the whole thing more complex than necessary. With
"tasks only in leafs" cgroup policy we should only see any pages on LRUs
on the global root memcg and leaf cgroups. The same applies to memcg
stats. So why cannot we simply do the tree walk, calculate
badness/check the priority and select the largest memcg in one go?

> @@ -810,6 +810,9 @@ static void __oom_kill_process(struct task_struct *victim)
> struct mm_struct *mm;
> bool can_oom_reap = true;
>
> + if (is_global_init(victim) || (victim->flags & PF_KTHREAD))
> + return;
> +

This will leak a reference to the victim AFACS

> p = find_lock_task_mm(victim);
> if (!p) {
> put_task_struct(victim);

--
Michal Hocko
SUSE Labs

2017-09-05 15:13:07

by Michal Hocko

[permalink] [raw]
Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Tue 05-09-17 15:30:21, Roman Gushchin wrote:
> On Tue, Sep 05, 2017 at 03:44:12PM +0200, Michal Hocko wrote:
[...]
> > Why is this an opt out rather than opt-in? IMHO the original oom logic
> > should be preserved by default and specific workloads should opt in for
> > the cgroup aware logic. Changing the global behavior depending on
> > whether cgroup v2 interface is in use is more than unexpected and IMHO
> > wrong approach to take. I think we should instead go with
> > oom_strategy=[alloc_task,biggest_task,cgroup]
> >
> > we currently have alloc_task (via sysctl_oom_kill_allocating_task) and
> > biggest_task which is the default. You are adding cgroup and the more I
> > think about the more I agree that it doesn't really make sense to try to
> > fit thew new semantic into the existing one (compare tasks to kill-all
> > memcgs). Just introduce a new strategy and define a new semantic from
> > scratch. Memcg priority and kill-all are a natural extension of this new
> > strategy. This will make the life easier and easier to understand by
> > users.
> >
> > Does that make sense to you?
>
> Absolutely.
>
> The only thing: I'm not sure that we have to preserve the existing logic
> as default option. For most users (except few very specific usecases),
> it should be at least as good, as the existing one.

But this is really an unexpected change. Users even might not know that
they are using cgroup v2 and memcg is in use.

> Making it opt-in means that corresponding code will be executed only
> by few users, who cares.

Yeah, which is the way we should introduce new features no?

> Then we should probably hide corresponding
> cgroup interface (oom_group and oom_priority knobs) by default,
> and it feels as unnecessary complication and is overall against
> cgroup v2 interface design.

Why. If we care enough, we could simply return EINVAL when those knobs
are written while the corresponding strategy is not used.

> > I think we should instead go with
> > oom_strategy=[alloc_task,biggest_task,cgroup]
>
> It would be a really nice interface; although I've no idea how to implement it:
> "alloc_task" is an existing sysctl, which we have to preserve;

I would argue that we should simply deprecate and later drop the sysctl.
I _strongly_ suspect anybody is using this. If yes it is not that hard
to change the kernel command like rather than select the sysctl. The
deprecation process would be
- warn when somebody writes to the sysctl and check both boot
and sysctl values
[ wait some time ]
- keep the sysctl but return EINVAL
[ wait some time ]
- remove the sysctl

> while "cgroup" depends on cgroup v2.

Which is not a big deal either. Simply fall back to default if there are
no cgroup v2. The implementation would have essentially the same effect
because there won't be any kill-all cgroups and so we will select the
largest task.
--
Michal Hocko
SUSE Labs

2017-09-05 19:16:45

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Tue, Sep 05, 2017 at 05:12:51PM +0200, Michal Hocko wrote:
> On Tue 05-09-17 15:30:21, Roman Gushchin wrote:
> > On Tue, Sep 05, 2017 at 03:44:12PM +0200, Michal Hocko wrote:
> [...]
> > > Why is this an opt out rather than opt-in? IMHO the original oom logic
> > > should be preserved by default and specific workloads should opt in for
> > > the cgroup aware logic. Changing the global behavior depending on
> > > whether cgroup v2 interface is in use is more than unexpected and IMHO
> > > wrong approach to take. I think we should instead go with
> > > oom_strategy=[alloc_task,biggest_task,cgroup]
> > >
> > > we currently have alloc_task (via sysctl_oom_kill_allocating_task) and
> > > biggest_task which is the default. You are adding cgroup and the more I
> > > think about the more I agree that it doesn't really make sense to try to
> > > fit thew new semantic into the existing one (compare tasks to kill-all
> > > memcgs). Just introduce a new strategy and define a new semantic from
> > > scratch. Memcg priority and kill-all are a natural extension of this new
> > > strategy. This will make the life easier and easier to understand by
> > > users.
> > >
> > > Does that make sense to you?
> >
> > Absolutely.
> >
> > The only thing: I'm not sure that we have to preserve the existing logic
> > as default option. For most users (except few very specific usecases),
> > it should be at least as good, as the existing one.
>
> But this is really an unexpected change. Users even might not know that
> they are using cgroup v2 and memcg is in use.
>
> > Making it opt-in means that corresponding code will be executed only
> > by few users, who cares.
>
> Yeah, which is the way we should introduce new features no?
>
> > Then we should probably hide corresponding
> > cgroup interface (oom_group and oom_priority knobs) by default,
> > and it feels as unnecessary complication and is overall against
> > cgroup v2 interface design.
>
> Why. If we care enough, we could simply return EINVAL when those knobs
> are written while the corresponding strategy is not used.

It doesn't look as a nice default interface.

>
> > > I think we should instead go with
> > > oom_strategy=[alloc_task,biggest_task,cgroup]
> >
> > It would be a really nice interface; although I've no idea how to implement it:
> > "alloc_task" is an existing sysctl, which we have to preserve;
>
> I would argue that we should simply deprecate and later drop the sysctl.
> I _strongly_ suspect anybody is using this. If yes it is not that hard
> to change the kernel command like rather than select the sysctl.

I agree. And if so, why do we need a new interface for an useless feature?

>
> > while "cgroup" depends on cgroup v2.
>
> Which is not a big deal either. Simply fall back to default if there are
> no cgroup v2. The implementation would have essentially the same effect
> because there won't be any kill-all cgroups and so we will select the
> largest task.

I'd agree with you, if there are use cases (excluding pure legacy),
when the per-process algorithm is preferable over the cgroup-aware OOM.
I really doubt, and hope, that with oom_priorities the suggested algorithm
should cover almost all reasonable use cases.

Thanks!

2017-09-05 20:24:35

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v7 2/5] mm, oom: cgroup-aware OOM killer

On Tue, Sep 05, 2017 at 04:57:00PM +0200, Michal Hocko wrote:
> On Mon 04-09-17 15:21:05, Roman Gushchin wrote:
> [...]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index a69d23082abf..97813c56163b 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2649,6 +2649,213 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
> > return ret;
> > }
> >
> > +static long memcg_oom_badness(struct mem_cgroup *memcg,
> > + const nodemask_t *nodemask)
> > +{
> > + long points = 0;
> > + int nid;
> > + pg_data_t *pgdat;
> > +
> > + for_each_node_state(nid, N_MEMORY) {
> > + if (nodemask && !node_isset(nid, *nodemask))
> > + continue;
> > +
> > + points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> > + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
>
> Why don't you consider file LRUs here? What if there is a lot of page
> cache which is not reclaimed because it is protected by memcg->low.
> Should we hide that from the OOM killer?

I'm not sure here.
I agree with your argument, although memcg->low should not cause OOMs
in the current implementation (which is a separate problem).
Also I can imagine some edge cases with mlocked pagecache belonging
to a process from a different cgroup.

I would suggest to refine this later.

>
> [...]
> > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> > +{
> > + struct mem_cgroup *iter, *parent;
> > +
> > + for_each_mem_cgroup_tree(iter, root) {
> > + if (memcg_has_children(iter)) {
> > + iter->oom_score = 0;
> > + continue;
> > + }
>
> Do we really need this check? If it is a mere optimization then
> we should probably check for tasks in the memcg rather than
> descendant. More on that below.

The idea is to traverse memcg only once: we're resetting oom_score
for non-leaf cgroups, and for each leaf cgroup calculate the score
and propagate it upwards.

>
> > +
> > + iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask);
> > +
> > + /*
> > + * Ignore empty and non-eligible memory cgroups.
> > + */
> > + if (iter->oom_score == 0)
> > + continue;
> > +
> > + /*
> > + * If there are inflight OOM victims, we don't need to look
> > + * further for new victims.
> > + */
> > + if (iter->oom_score == -1) {
> > + oc->chosen_memcg = INFLIGHT_VICTIM;
> > + mem_cgroup_iter_break(root, iter);
> > + return;
> > + }
> > +
> > + for (parent = parent_mem_cgroup(iter); parent && parent != root;
> > + parent = parent_mem_cgroup(parent))
> > + parent->oom_score += iter->oom_score;
>
> Hmm. The changelog says "By default, it will look for the biggest leaf
> cgroup, and kill the largest task inside." But you are accumulating
> oom_score up the hierarchy and so parents will have higher score than
> the layer of their children and the larger the sub-hierarchy the more
> biased it will become. Say you have
> root
> /\
> / \
> A D
> / \
> B C
>
> B (5), C(15) thus A(20) and D(20). Unless I am missing something we are
> going to go down A path and then chose C even though D is the largest
> leaf group, right?

You're right, changelog is not accurate, I'll fix it.
The behavior is correct, IMO.

>
> > + }
> > +
> > + for (;;) {
> > + struct cgroup_subsys_state *css;
> > + struct mem_cgroup *memcg = NULL;
> > + long score = LONG_MIN;
> > +
> > + css_for_each_child(css, &root->css) {
> > + struct mem_cgroup *iter = mem_cgroup_from_css(css);
> > +
> > + /*
> > + * Ignore empty and non-eligible memory cgroups.
> > + */
> > + if (iter->oom_score == 0)
> > + continue;
> > +
> > + if (iter->oom_score > score) {
> > + memcg = iter;
> > + score = iter->oom_score;
> > + }
> > + }
> > +
> > + if (!memcg) {
> > + if (oc->memcg && root == oc->memcg) {
> > + oc->chosen_memcg = oc->memcg;
> > + css_get(&oc->chosen_memcg->css);
> > + oc->chosen_points = oc->memcg->oom_score;
> > + }
> > + break;
> > + }
> > +
> > + if (memcg->oom_group || !memcg_has_children(memcg)) {
> > + oc->chosen_memcg = memcg;
> > + css_get(&oc->chosen_memcg->css);
> > + oc->chosen_points = score;
> > + break;
> > + }
> > +
> > + root = memcg;
> > + }
> > +}
> > +
> [...]
> > + /*
> > + * For system-wide OOMs we should consider tasks in the root cgroup
> > + * with oom_score larger than oc->chosen_points.
> > + */
> > + if (!oc->memcg) {
> > + select_victim_root_cgroup_task(oc);
>
> I do not understand why do we have to handle root cgroup specially here.
> select_victim_memcg already iterates all memcgs in the oom hierarchy
> (including root) so if the root memcg is the largest one then we
> should simply consider it no?

We don't have necessary stats for the root cgroup, so we can't calculate
it's oom_score.

> You are skipping root there because of
> memcg_has_children but I suspect this and the whole accumulate up the
> hierarchy approach just makes the whole thing more complex than necessary. With
> "tasks only in leafs" cgroup policy we should only see any pages on LRUs
> on the global root memcg and leaf cgroups. The same applies to memcg
> stats. So why cannot we simply do the tree walk, calculate
> badness/check the priority and select the largest memcg in one go?

We have to traverse from top to bottom to make priority-based decision,
but size-based oom_score is calculated as sum of descending leaf cgroup scores.

For example:
root
/\
/ \
A D
/ \
B C
A and D have same priorities, B has larger priority than C.

In this case we need to calculate size-based score for A, which requires
summing oom_score of the sub-tree (B an C), despite we don't need it
for choosing between B and C.

Maybe I don't see it, but I don't know how to implement it more optimal.

>
> > @@ -810,6 +810,9 @@ static void __oom_kill_process(struct task_struct *victim)
> > struct mm_struct *mm;
> > bool can_oom_reap = true;
> >
> > + if (is_global_init(victim) || (victim->flags & PF_KTHREAD))
> > + return;
> > +
>
> This will leak a reference to the victim AFACS

Good catch!
I didn't fix this after moving reference dropping into __oom_kill_process().
Fixed.

Thanks!

2017-09-05 21:54:03

by Johannes Weiner

[permalink] [raw]
Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Tue, Sep 05, 2017 at 03:44:12PM +0200, Michal Hocko wrote:
> Why is this an opt out rather than opt-in? IMHO the original oom logic
> should be preserved by default and specific workloads should opt in for
> the cgroup aware logic. Changing the global behavior depending on
> whether cgroup v2 interface is in use is more than unexpected and IMHO
> wrong approach to take. I think we should instead go with
> oom_strategy=[alloc_task,biggest_task,cgroup]
>
> we currently have alloc_task (via sysctl_oom_kill_allocating_task) and
> biggest_task which is the default. You are adding cgroup and the more I
> think about the more I agree that it doesn't really make sense to try to
> fit thew new semantic into the existing one (compare tasks to kill-all
> memcgs). Just introduce a new strategy and define a new semantic from
> scratch. Memcg priority and kill-all are a natural extension of this new
> strategy. This will make the life easier and easier to understand by
> users.

oom_kill_allocating_task is actually a really good example of why
cgroup-awareness *should* be the new default.

Before we had the oom killer victim selection, we simply killed the
faulting/allocating task. While a valid answer to the problem, it's
not very fair or representative of what the user wants or intends.

Then we added code to kill the biggest offender instead, which should
have been the case from the start and was hence made the new default.
The oom_kill_allocating_task was added on the off-chance that there
might be setups who, for historical reasons, rely on the old behavior.
But our default was chosen based on what behavior is fair, expected,
and most reflective of the user's intentions.

The cgroup-awareness in the OOM killer is exactly the same thing. It
should have been the default from the beginning, because the user
configures a group of tasks to be an interdependent, terminal unit of
memory consumption, and it's undesirable for the OOM killer to ignore
this intention and compare members across these boundaries.

We should go the same way here as with kill_alloc_task: the default
should be what's sane and expected by the vast majority of our users,
with a knob (I would prefer a sysctl here, actually) to switch back in
case somebody - unexpectedly - actually relies on the old behavior.

I don't see why couldn't change user-visible behavior here if we don't
expect anyone (quirky setups aside) to rely on it AND we provide a
knob to revert in the field for the exceptions.

2017-09-06 08:29:07

by Michal Hocko

[permalink] [raw]
Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Tue 05-09-17 17:53:44, Johannes Weiner wrote:
> On Tue, Sep 05, 2017 at 03:44:12PM +0200, Michal Hocko wrote:
> > Why is this an opt out rather than opt-in? IMHO the original oom logic
> > should be preserved by default and specific workloads should opt in for
> > the cgroup aware logic. Changing the global behavior depending on
> > whether cgroup v2 interface is in use is more than unexpected and IMHO
> > wrong approach to take. I think we should instead go with
> > oom_strategy=[alloc_task,biggest_task,cgroup]
> >
> > we currently have alloc_task (via sysctl_oom_kill_allocating_task) and
> > biggest_task which is the default. You are adding cgroup and the more I
> > think about the more I agree that it doesn't really make sense to try to
> > fit thew new semantic into the existing one (compare tasks to kill-all
> > memcgs). Just introduce a new strategy and define a new semantic from
> > scratch. Memcg priority and kill-all are a natural extension of this new
> > strategy. This will make the life easier and easier to understand by
> > users.
>
> oom_kill_allocating_task is actually a really good example of why
> cgroup-awareness *should* be the new default.
>
> Before we had the oom killer victim selection, we simply killed the
> faulting/allocating task. While a valid answer to the problem, it's
> not very fair or representative of what the user wants or intends.
>
> Then we added code to kill the biggest offender instead, which should
> have been the case from the start and was hence made the new default.
> The oom_kill_allocating_task was added on the off-chance that there
> might be setups who, for historical reasons, rely on the old behavior.
> But our default was chosen based on what behavior is fair, expected,
> and most reflective of the user's intentions.

I am not sure this is how things evolved actually. This is way before
my time so my git log interpretation might be imprecise. We do have
oom_badness heuristic since out_of_memory has been introduced and
oom_kill_allocating_task has been introduced much later because of large
boxes with zillions of tasks (SGI I suspect) which took too long to
select a victim so David has added this heuristic.

> The cgroup-awareness in the OOM killer is exactly the same thing. It
> should have been the default from the beginning, because the user
> configures a group of tasks to be an interdependent, terminal unit of
> memory consumption, and it's undesirable for the OOM killer to ignore
> this intention and compare members across these boundaries.

I would agree if that was true in general. I can completely see how the
cgroup awareness is useful in e.g. containerized environments (especially
with kill-all enabled) but memcgs are used in a large variety of
usecases and I cannot really say all of them really demand the new
semantic. Say I have a workload which doesn't want to see reclaim
interference from others on the same machine. Why should I kill a
process from that particular memcg just because it is the largest one
when there is a memory hog/leak outside of this memcg?

>From my point of view the safest (in a sense of the least surprise)
way to go with opt-in for the new heuristic. I am pretty sure all who
would benefit from the new behavior will enable it while others will not
regress in unexpected way.

We can talk about the way _how_ to control these oom strategies, of
course. But I would be really reluctant to change the default which is
used for years and people got used to it.
--
Michal Hocko
SUSE Labs

2017-09-06 08:32:10

by Michal Hocko

[permalink] [raw]
Subject: Re: [v7 2/5] mm, oom: cgroup-aware OOM killer

On Tue 05-09-17 21:23:57, Roman Gushchin wrote:
> On Tue, Sep 05, 2017 at 04:57:00PM +0200, Michal Hocko wrote:
[...]
> > Hmm. The changelog says "By default, it will look for the biggest leaf
> > cgroup, and kill the largest task inside." But you are accumulating
> > oom_score up the hierarchy and so parents will have higher score than
> > the layer of their children and the larger the sub-hierarchy the more
> > biased it will become. Say you have
> > root
> > /\
> > / \
> > A D
> > / \
> > B C
> >
> > B (5), C(15) thus A(20) and D(20). Unless I am missing something we are
> > going to go down A path and then chose C even though D is the largest
> > leaf group, right?
>
> You're right, changelog is not accurate, I'll fix it.
> The behavior is correct, IMO.

Please explain why. This is really a non-intuitive semantic. Why should
larger hierarchies be punished more than shallow ones? I would
completely agree if the whole hierarchy would be a killable entity (aka
A would be kill-all).

[...]
> > I do not understand why do we have to handle root cgroup specially here.
> > select_victim_memcg already iterates all memcgs in the oom hierarchy
> > (including root) so if the root memcg is the largest one then we
> > should simply consider it no?
>
> We don't have necessary stats for the root cgroup, so we can't calculate
> it's oom_score.

We used to charge pages to the root memcg as well so we might resurrect
that idea. In any case this is something that could be hidden in
memcg_oom_badness rather then special cased somewhere else.

> > You are skipping root there because of
> > memcg_has_children but I suspect this and the whole accumulate up the
> > hierarchy approach just makes the whole thing more complex than necessary. With
> > "tasks only in leafs" cgroup policy we should only see any pages on LRUs
> > on the global root memcg and leaf cgroups. The same applies to memcg
> > stats. So why cannot we simply do the tree walk, calculate
> > badness/check the priority and select the largest memcg in one go?
>
> We have to traverse from top to bottom to make priority-based decision,
> but size-based oom_score is calculated as sum of descending leaf cgroup scores.
>
> For example:
> root
> /\
> / \
> A D
> / \
> B C
> A and D have same priorities, B has larger priority than C.
>
> In this case we need to calculate size-based score for A, which requires
> summing oom_score of the sub-tree (B an C), despite we don't need it
> for choosing between B and C.
>
> Maybe I don't see it, but I don't know how to implement it more optimal.

I have to think about the priority based oom killing some more to be
honest. Do we really want to allow setting a priority to non-leaf
memcgs? How are you going to manage the whole tree consistency? Say your
above example have prio(A) < prio(D) && prio(C) > prio(D). Your current
implementation would kill D, right? Isn't that counter intuitive
behavior again. If anything we should prio(A) = max(tree_prio(A)). Again
I could understand comparing priorities only on killable entities.
--
Michal Hocko
SUSE Labs

2017-09-06 08:34:21

by Michal Hocko

[permalink] [raw]
Subject: Re: [v7 2/5] mm, oom: cgroup-aware OOM killer

On Tue 05-09-17 21:23:57, Roman Gushchin wrote:
> On Tue, Sep 05, 2017 at 04:57:00PM +0200, Michal Hocko wrote:
[...]
> > > @@ -810,6 +810,9 @@ static void __oom_kill_process(struct task_struct *victim)
> > > struct mm_struct *mm;
> > > bool can_oom_reap = true;
> > >
> > > + if (is_global_init(victim) || (victim->flags & PF_KTHREAD))
> > > + return;
> > > +
> >
> > This will leak a reference to the victim AFACS
>
> Good catch!
> I didn't fix this after moving reference dropping into __oom_kill_process().
> Fixed.

Btw. didn't you want to check
victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN

here as well? Maybe I've missed something but you still can kill a task
which is oom disabled which I thought we agreed is the wrong thing to
do.
--
Michal Hocko
SUSE Labs

2017-09-06 08:42:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Tue 05-09-17 20:16:09, Roman Gushchin wrote:
> On Tue, Sep 05, 2017 at 05:12:51PM +0200, Michal Hocko wrote:
[...]
> > > Then we should probably hide corresponding
> > > cgroup interface (oom_group and oom_priority knobs) by default,
> > > and it feels as unnecessary complication and is overall against
> > > cgroup v2 interface design.
> >
> > Why. If we care enough, we could simply return EINVAL when those knobs
> > are written while the corresponding strategy is not used.
>
> It doesn't look as a nice default interface.

I do not have a strong opinion on this. A printk_once could explain why
the knob is ignored and instruct the admin how to enable the feature
completely.

> > > > I think we should instead go with
> > > > oom_strategy=[alloc_task,biggest_task,cgroup]
> > >
> > > It would be a really nice interface; although I've no idea how to implement it:
> > > "alloc_task" is an existing sysctl, which we have to preserve;
> >
> > I would argue that we should simply deprecate and later drop the sysctl.
> > I _strongly_ suspect anybody is using this. If yes it is not that hard
> > to change the kernel command like rather than select the sysctl.
>
> I agree. And if so, why do we need a new interface for an useless feature?

Well, I won't be opposed just deprecating the sysfs and only add a
"real" kill-allocate strategy if somebody explicitly asks for it.
--
Michal Hocko
SUSE Labs

2017-09-06 12:35:05

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v7 2/5] mm, oom: cgroup-aware OOM killer

On Wed, Sep 06, 2017 at 10:34:13AM +0200, Michal Hocko wrote:
> On Tue 05-09-17 21:23:57, Roman Gushchin wrote:
> > On Tue, Sep 05, 2017 at 04:57:00PM +0200, Michal Hocko wrote:
> [...]
> > > > @@ -810,6 +810,9 @@ static void __oom_kill_process(struct task_struct *victim)
> > > > struct mm_struct *mm;
> > > > bool can_oom_reap = true;
> > > >
> > > > + if (is_global_init(victim) || (victim->flags & PF_KTHREAD))
> > > > + return;
> > > > +
> > >
> > > This will leak a reference to the victim AFACS
> >
> > Good catch!
> > I didn't fix this after moving reference dropping into __oom_kill_process().
> > Fixed.
>
> Btw. didn't you want to check
> victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
>
> here as well? Maybe I've missed something but you still can kill a task
> which is oom disabled which I thought we agreed is the wrong thing to
> do.

Added. Thanks!

2017-09-06 12:58:45

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v7 2/5] mm, oom: cgroup-aware OOM killer

On Wed, Sep 06, 2017 at 10:31:58AM +0200, Michal Hocko wrote:
> On Tue 05-09-17 21:23:57, Roman Gushchin wrote:
> > On Tue, Sep 05, 2017 at 04:57:00PM +0200, Michal Hocko wrote:
> [...]
> > > Hmm. The changelog says "By default, it will look for the biggest leaf
> > > cgroup, and kill the largest task inside." But you are accumulating
> > > oom_score up the hierarchy and so parents will have higher score than
> > > the layer of their children and the larger the sub-hierarchy the more
> > > biased it will become. Say you have
> > > root
> > > /\
> > > / \
> > > A D
> > > / \
> > > B C
> > >
> > > B (5), C(15) thus A(20) and D(20). Unless I am missing something we are
> > > going to go down A path and then chose C even though D is the largest
> > > leaf group, right?
> >
> > You're right, changelog is not accurate, I'll fix it.
> > The behavior is correct, IMO.
>
> Please explain why. This is really a non-intuitive semantic. Why should
> larger hierarchies be punished more than shallow ones? I would
> completely agree if the whole hierarchy would be a killable entity (aka
> A would be kill-all).

I think it's a reasonable and clear policy: we're looking for a memcg
with the smallest oom_priority and largest memory footprint recursively.
Then we reclaim some memory from it (by killing the biggest process
or all processes, depending on memcg preferences).

In general, if there are two memcgs of equal importance (which is defined
by setting the oom_priority), we're choosing the largest, because there
are more chances that it contain a leaking process. The same is true
right now for processes.

I agree, that for size-based comparison we could use a different policy:
comparing leaf cgroups despite their level. But I don't see a clever
way to apply oom_priorities in this case. Comparing oom_priority
on each level is a simple and powerful policy, and it works well
for delegation.

>
> [...]
> > > I do not understand why do we have to handle root cgroup specially here.
> > > select_victim_memcg already iterates all memcgs in the oom hierarchy
> > > (including root) so if the root memcg is the largest one then we
> > > should simply consider it no?
> >
> > We don't have necessary stats for the root cgroup, so we can't calculate
> > it's oom_score.
>
> We used to charge pages to the root memcg as well so we might resurrect
> that idea. In any case this is something that could be hidden in
> memcg_oom_badness rather then special cased somewhere else.

In theory I agree, but I do not see a good way to calculate root memcg
oom_score.

>
> > > You are skipping root there because of
> > > memcg_has_children but I suspect this and the whole accumulate up the
> > > hierarchy approach just makes the whole thing more complex than necessary. With
> > > "tasks only in leafs" cgroup policy we should only see any pages on LRUs
> > > on the global root memcg and leaf cgroups. The same applies to memcg
> > > stats. So why cannot we simply do the tree walk, calculate
> > > badness/check the priority and select the largest memcg in one go?
> >
> > We have to traverse from top to bottom to make priority-based decision,
> > but size-based oom_score is calculated as sum of descending leaf cgroup scores.
> >
> > For example:
> > root
> > /\
> > / \
> > A D
> > / \
> > B C
> > A and D have same priorities, B has larger priority than C.
> >
> > In this case we need to calculate size-based score for A, which requires
> > summing oom_score of the sub-tree (B an C), despite we don't need it
> > for choosing between B and C.
> >
> > Maybe I don't see it, but I don't know how to implement it more optimal.
>
> I have to think about the priority based oom killing some more to be
> honest. Do we really want to allow setting a priority to non-leaf
> memcgs? How are you going to manage the whole tree consistency? Say your
> above example have prio(A) < prio(D) && prio(C) > prio(D). Your current
> implementation would kill D, right?

Right.

> Isn't that counter intuitive
> behavior again. If anything we should prio(A) = max(tree_prio(A)). Again
> I could understand comparing priorities only on killable entities.

Answered above.
Also, I don't think any per-memcg knobs should have global meaning,
despite parent memcg settings. It will break delegation model.

Thanks!

Roman

2017-09-06 13:22:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [v7 2/5] mm, oom: cgroup-aware OOM killer

On Wed 06-09-17 13:57:50, Roman Gushchin wrote:
> On Wed, Sep 06, 2017 at 10:31:58AM +0200, Michal Hocko wrote:
> > On Tue 05-09-17 21:23:57, Roman Gushchin wrote:
> > > On Tue, Sep 05, 2017 at 04:57:00PM +0200, Michal Hocko wrote:
> > [...]
> > > > Hmm. The changelog says "By default, it will look for the biggest leaf
> > > > cgroup, and kill the largest task inside." But you are accumulating
> > > > oom_score up the hierarchy and so parents will have higher score than
> > > > the layer of their children and the larger the sub-hierarchy the more
> > > > biased it will become. Say you have
> > > > root
> > > > /\
> > > > / \
> > > > A D
> > > > / \
> > > > B C
> > > >
> > > > B (5), C(15) thus A(20) and D(20). Unless I am missing something we are
> > > > going to go down A path and then chose C even though D is the largest
> > > > leaf group, right?
> > >
> > > You're right, changelog is not accurate, I'll fix it.
> > > The behavior is correct, IMO.
> >
> > Please explain why. This is really a non-intuitive semantic. Why should
> > larger hierarchies be punished more than shallow ones? I would
> > completely agree if the whole hierarchy would be a killable entity (aka
> > A would be kill-all).
>
> I think it's a reasonable and clear policy: we're looking for a memcg
> with the smallest oom_priority and largest memory footprint recursively.

But this can get really complex for non-trivial setups. Anything with
deeper and larger hierarchies will get quite complex IMHO.

Btw. do you have any specific usecase for the priority based oom
killer? I remember David was asking for this because it _would_ be
useful but you didn't have it initially. And I agree with that I am
just not sure the semantic is thought through wery well. I am thinking
whether it would be easier to push this further without priority thing
for now and add it later with a clear example of the configuration and
how it should work and a more thought through semantic. Would that sound
acceptable? I believe the rest is quite useful to get merged on its own.

> Then we reclaim some memory from it (by killing the biggest process
> or all processes, depending on memcg preferences).
>
> In general, if there are two memcgs of equal importance (which is defined
> by setting the oom_priority), we're choosing the largest, because there
> are more chances that it contain a leaking process. The same is true
> right now for processes.

Yes except this is not the case as shown above. We can easily select a
smaller leaf memcg just because it is in a larger hierarchy and that
sounds very dubious to me. Even when all the priorities are the same.

> I agree, that for size-based comparison we could use a different policy:
> comparing leaf cgroups despite their level. But I don't see a clever
> way to apply oom_priorities in this case. Comparing oom_priority
> on each level is a simple and powerful policy, and it works well
> for delegation.

You are already shaping semantic around the implementation and that is a
clear sign of problem.

> > [...]
> > > > I do not understand why do we have to handle root cgroup specially here.
> > > > select_victim_memcg already iterates all memcgs in the oom hierarchy
> > > > (including root) so if the root memcg is the largest one then we
> > > > should simply consider it no?
> > >
> > > We don't have necessary stats for the root cgroup, so we can't calculate
> > > it's oom_score.
> >
> > We used to charge pages to the root memcg as well so we might resurrect
> > that idea. In any case this is something that could be hidden in
> > memcg_oom_badness rather then special cased somewhere else.
>
> In theory I agree, but I do not see a good way to calculate root memcg
> oom_score.

Why cannot you emulate that by the largest task in the root? The same
way you actually do in select_victim_root_cgroup_task now?
--
Michal Hocko
SUSE Labs

2017-09-06 13:42:18

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v7 2/5] mm, oom: cgroup-aware OOM killer

On Wed, Sep 06, 2017 at 03:22:49PM +0200, Michal Hocko wrote:
> On Wed 06-09-17 13:57:50, Roman Gushchin wrote:
> > On Wed, Sep 06, 2017 at 10:31:58AM +0200, Michal Hocko wrote:
> > > On Tue 05-09-17 21:23:57, Roman Gushchin wrote:
> > > > On Tue, Sep 05, 2017 at 04:57:00PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > Hmm. The changelog says "By default, it will look for the biggest leaf
> > > > > cgroup, and kill the largest task inside." But you are accumulating
> > > > > oom_score up the hierarchy and so parents will have higher score than
> > > > > the layer of their children and the larger the sub-hierarchy the more
> > > > > biased it will become. Say you have
> > > > > root
> > > > > /\
> > > > > / \
> > > > > A D
> > > > > / \
> > > > > B C
> > > > >
> > > > > B (5), C(15) thus A(20) and D(20). Unless I am missing something we are
> > > > > going to go down A path and then chose C even though D is the largest
> > > > > leaf group, right?
> > > >
> > > > You're right, changelog is not accurate, I'll fix it.
> > > > The behavior is correct, IMO.
> > >
> > > Please explain why. This is really a non-intuitive semantic. Why should
> > > larger hierarchies be punished more than shallow ones? I would
> > > completely agree if the whole hierarchy would be a killable entity (aka
> > > A would be kill-all).
> >
> > I think it's a reasonable and clear policy: we're looking for a memcg
> > with the smallest oom_priority and largest memory footprint recursively.
>
> But this can get really complex for non-trivial setups. Anything with
> deeper and larger hierarchies will get quite complex IMHO.
>
> Btw. do you have any specific usecase for the priority based oom
> killer? I remember David was asking for this because it _would_ be
> useful but you didn't have it initially. And I agree with that I am
> just not sure the semantic is thought through wery well. I am thinking
> whether it would be easier to push this further without priority thing
> for now and add it later with a clear example of the configuration and
> how it should work and a more thought through semantic. Would that sound
> acceptable? I believe the rest is quite useful to get merged on its own.

Any way to set up which memcgs should be killed in first order,
and which in the last will more or less suit me.
Initially I did have somewhat similar to the per-cgroup oom_score_adj.
But I really like David's idea here.
It's just much more simple and also more powerful:
clear semantics and long priority range will allow implementing
policies in userspace.

All priority-related stuff except docs is already separated.
Of course, I can split docs too.

Although, I don't think the whole thing is useful without any way
to adjust the memcg selection, so we can't postpone if for too long.
Anyway, if you think it's a way to go forward, let's do it.

>
> > Then we reclaim some memory from it (by killing the biggest process
> > or all processes, depending on memcg preferences).
> >
> > In general, if there are two memcgs of equal importance (which is defined
> > by setting the oom_priority), we're choosing the largest, because there
> > are more chances that it contain a leaking process. The same is true
> > right now for processes.
>
> Yes except this is not the case as shown above. We can easily select a
> smaller leaf memcg just because it is in a larger hierarchy and that
> sounds very dubious to me. Even when all the priorities are the same.
>
> > I agree, that for size-based comparison we could use a different policy:
> > comparing leaf cgroups despite their level. But I don't see a clever
> > way to apply oom_priorities in this case. Comparing oom_priority
> > on each level is a simple and powerful policy, and it works well
> > for delegation.
>
> You are already shaping semantic around the implementation and that is a
> clear sign of problem.
>
> > > [...]
> > > > > I do not understand why do we have to handle root cgroup specially here.
> > > > > select_victim_memcg already iterates all memcgs in the oom hierarchy
> > > > > (including root) so if the root memcg is the largest one then we
> > > > > should simply consider it no?
> > > >
> > > > We don't have necessary stats for the root cgroup, so we can't calculate
> > > > it's oom_score.
> > >
> > > We used to charge pages to the root memcg as well so we might resurrect
> > > that idea. In any case this is something that could be hidden in
> > > memcg_oom_badness rather then special cased somewhere else.
> >
> > In theory I agree, but I do not see a good way to calculate root memcg
> > oom_score.
>
> Why cannot you emulate that by the largest task in the root? The same
> way you actually do in select_victim_root_cgroup_task now?

Hm, this sounds good! Then I can apply the same policy for root memcg,
treating it as level 1 leaf memcg.

Thanks!

2017-09-06 14:10:39

by Michal Hocko

[permalink] [raw]
Subject: Re: [v7 2/5] mm, oom: cgroup-aware OOM killer

On Wed 06-09-17 14:41:42, Roman Gushchin wrote:
[...]
> Although, I don't think the whole thing is useful without any way
> to adjust the memcg selection, so we can't postpone if for too long.
> Anyway, if you think it's a way to go forward, let's do it.

I am not really sure we are in a rush here. The whole oom_score_adj
fiasco has showed that most users tend to only care "to never kill this
and that". A better fine tuned oom control sounds useful at first but
apart from very special usecases turns out very impractical to set
up. At least that is my experience. There are special cases of course
but we should target general use first.

Kill the whole memcg is a really useful feature on its own for proper
container cleanup.
--
Michal Hocko
SUSE Labs

2017-09-06 17:41:19

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

sOn Wed, Sep 06, 2017 at 10:42:42AM +0200, Michal Hocko wrote:
> On Tue 05-09-17 20:16:09, Roman Gushchin wrote:
> > On Tue, Sep 05, 2017 at 05:12:51PM +0200, Michal Hocko wrote:
> [...]
> > > > Then we should probably hide corresponding
> > > > cgroup interface (oom_group and oom_priority knobs) by default,
> > > > and it feels as unnecessary complication and is overall against
> > > > cgroup v2 interface design.
> > >
> > > Why. If we care enough, we could simply return EINVAL when those knobs
> > > are written while the corresponding strategy is not used.
> >
> > It doesn't look as a nice default interface.
>
> I do not have a strong opinion on this. A printk_once could explain why
> the knob is ignored and instruct the admin how to enable the feature
> completely.
>
> > > > > I think we should instead go with
> > > > > oom_strategy=[alloc_task,biggest_task,cgroup]
> > > >
> > > > It would be a really nice interface; although I've no idea how to implement it:
> > > > "alloc_task" is an existing sysctl, which we have to preserve;
> > >
> > > I would argue that we should simply deprecate and later drop the sysctl.
> > > I _strongly_ suspect anybody is using this. If yes it is not that hard
> > > to change the kernel command like rather than select the sysctl.
> >
> > I agree. And if so, why do we need a new interface for an useless feature?
>
> Well, I won't be opposed just deprecating the sysfs and only add a
> "real" kill-allocate strategy if somebody explicitly asks for it.


I think we should select this approach.
Let's check that nobody actually uses it.

Thanks!

--

>From f6e2339926a07500834d86548f3f116af7335d71 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <[email protected]>
Date: Wed, 6 Sep 2017 17:43:44 +0100
Subject: [PATCH] mm, oom: first step towards oom_kill_allocating_task
deprecation

The oom_kill_allocating_task sysctl which causes the OOM killer
to simple kill the allocating task is useless. Killing the random
task is not the best idea.

Nobody likes it, and hopefully nobody uses it.
We want to completely deprecate it at some point.

To make a first step towards deprecation, let's warn potential
users about deprecation plans.

Signed-off-by: Roman Gushchin <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/sysctl.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 655686d546cb..9158f1980584 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -220,6 +220,17 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,

#endif

+static int proc_oom_kill_allocating_tasks(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos)
+{
+ pr_warn_once("The oom_kill_allocating_task sysctl will be deprecated.\n"
+ "If you're using it, please, report to "
+ "[email protected].\n");
+
+ return proc_dointvec(table, write, buffer, lenp, ppos);
+}
+
static struct ctl_table kern_table[];
static struct ctl_table vm_table[];
static struct ctl_table fs_table[];
@@ -1235,7 +1246,7 @@ static struct ctl_table vm_table[] = {
.data = &sysctl_oom_kill_allocating_task,
.maxlen = sizeof(sysctl_oom_kill_allocating_task),
.mode = 0644,
- .proc_handler = proc_dointvec,
+ .proc_handler = proc_oom_kill_allocating_tasks,
},
{
.procname = "oom_dump_tasks",
--
2.13.5

2017-09-06 17:59:25

by Michal Hocko

[permalink] [raw]
Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Wed 06-09-17 18:40:43, Roman Gushchin wrote:
[...]
> >From f6e2339926a07500834d86548f3f116af7335d71 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <[email protected]>
> Date: Wed, 6 Sep 2017 17:43:44 +0100
> Subject: [PATCH] mm, oom: first step towards oom_kill_allocating_task
> deprecation
>
> The oom_kill_allocating_task sysctl which causes the OOM killer
> to simple kill the allocating task is useless. Killing the random

useless is quite strong ;) I would say dubious.

> task is not the best idea.
>
> Nobody likes it, and hopefully nobody uses it.
> We want to completely deprecate it at some point.
>
> To make a first step towards deprecation, let's warn potential
> users about deprecation plans.
>
> Signed-off-by: Roman Gushchin <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Vladimir Davydov <[email protected]>
> Cc: [email protected]
> Cc: [email protected]

Other than that
Acked-by: Michal Hocko <[email protected]>

> ---
> kernel/sysctl.c | 13 ++++++++++++-
> 1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 655686d546cb..9158f1980584 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -220,6 +220,17 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
>
> #endif
>
> +static int proc_oom_kill_allocating_tasks(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp,
> + loff_t *ppos)
> +{
> + pr_warn_once("The oom_kill_allocating_task sysctl will be deprecated.\n"
> + "If you're using it, please, report to "
> + "[email protected].\n");
> +
> + return proc_dointvec(table, write, buffer, lenp, ppos);
> +}
> +
> static struct ctl_table kern_table[];
> static struct ctl_table vm_table[];
> static struct ctl_table fs_table[];
> @@ -1235,7 +1246,7 @@ static struct ctl_table vm_table[] = {
> .data = &sysctl_oom_kill_allocating_task,
> .maxlen = sizeof(sysctl_oom_kill_allocating_task),
> .mode = 0644,
> - .proc_handler = proc_dointvec,
> + .proc_handler = proc_oom_kill_allocating_tasks,
> },
> {
> .procname = "oom_dump_tasks",
> --
> 2.13.5

--
Michal Hocko
SUSE Labs

2017-09-06 20:59:47

by David Rientjes

[permalink] [raw]
Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Wed, 6 Sep 2017, Roman Gushchin wrote:

> From f6e2339926a07500834d86548f3f116af7335d71 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <[email protected]>
> Date: Wed, 6 Sep 2017 17:43:44 +0100
> Subject: [PATCH] mm, oom: first step towards oom_kill_allocating_task
> deprecation
>
> The oom_kill_allocating_task sysctl which causes the OOM killer
> to simple kill the allocating task is useless. Killing the random
> task is not the best idea.
>
> Nobody likes it, and hopefully nobody uses it.
> We want to completely deprecate it at some point.
>

SGI required it when it was introduced simply to avoid the very expensive
tasklist scan. Adding Christoph Lameter to the cc since he was involved
back then.

I attempted to deprecate the old /proc/pid/oom_adj in this same manner; we
warned about it for over a year and then finally removed it, one person
complained of breakage, and it was reverted with a strict policy that
Linux doesn't break userspace.

Although it would be good to do, I'm not sure that this is possible unless
it can be shown nobody is using it. Talking to SGI would be the first
step.

I'm not sure what this has to do with the overall patchset though :)

> To make a first step towards deprecation, let's warn potential
> users about deprecation plans.
>
> Signed-off-by: Roman Gushchin <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Vladimir Davydov <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> kernel/sysctl.c | 13 ++++++++++++-
> 1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 655686d546cb..9158f1980584 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -220,6 +220,17 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
>
> #endif
>
> +static int proc_oom_kill_allocating_tasks(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp,
> + loff_t *ppos)
> +{
> + pr_warn_once("The oom_kill_allocating_task sysctl will be deprecated.\n"
> + "If you're using it, please, report to "
> + "[email protected].\n");
> +
> + return proc_dointvec(table, write, buffer, lenp, ppos);
> +}
> +
> static struct ctl_table kern_table[];
> static struct ctl_table vm_table[];
> static struct ctl_table fs_table[];
> @@ -1235,7 +1246,7 @@ static struct ctl_table vm_table[] = {
> .data = &sysctl_oom_kill_allocating_task,
> .maxlen = sizeof(sysctl_oom_kill_allocating_task),
> .mode = 0644,
> - .proc_handler = proc_dointvec,
> + .proc_handler = proc_oom_kill_allocating_tasks,
> },
> {
> .procname = "oom_dump_tasks",

Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Wed, 6 Sep 2017, David Rientjes wrote:

> > The oom_kill_allocating_task sysctl which causes the OOM killer
> > to simple kill the allocating task is useless. Killing the random
> > task is not the best idea.
> >
> > Nobody likes it, and hopefully nobody uses it.
> > We want to completely deprecate it at some point.
> >
>
> SGI required it when it was introduced simply to avoid the very expensive
> tasklist scan. Adding Christoph Lameter to the cc since he was involved
> back then.

Really? From what I know and worked on way back when: The reason was to be
able to contain the affected application in a cpuset. Multiple apps may
have been running in multiple cpusets on a large NUMA machine and the OOM
condition in one cpuset should not affect the other. It also helped to
isolate the application behavior causing the oom in numerous cases.

Doesnt this requirement transfer to cgroups in the same way?

Left SGI in 2008 so adding Dimitri who may know about the current
situation. Robin Holt also left SGI as far as I know.


2017-09-07 14:53:17

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Thu, Sep 07, 2017 at 09:43:30AM -0500, Christopher Lameter wrote:
> On Wed, 6 Sep 2017, David Rientjes wrote:
>
> > > The oom_kill_allocating_task sysctl which causes the OOM killer
> > > to simple kill the allocating task is useless. Killing the random
> > > task is not the best idea.
> > >
> > > Nobody likes it, and hopefully nobody uses it.
> > > We want to completely deprecate it at some point.
> > >
> >
> > SGI required it when it was introduced simply to avoid the very expensive
> > tasklist scan. Adding Christoph Lameter to the cc since he was involved
> > back then.
>
> Really? From what I know and worked on way back when: The reason was to be
> able to contain the affected application in a cpuset. Multiple apps may
> have been running in multiple cpusets on a large NUMA machine and the OOM
> condition in one cpuset should not affect the other. It also helped to
> isolate the application behavior causing the oom in numerous cases.
>
> Doesnt this requirement transfer to cgroups in the same way?

We have per-node memory stats and plan to use them during the OOM victim
selection. Hopefully it can help.

>
> Left SGI in 2008 so adding Dimitri who may know about the current
> situation. Robin Holt also left SGI as far as I know.

Thanks!

Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Thu, 7 Sep 2017, Roman Gushchin wrote:

> > Really? From what I know and worked on way back when: The reason was to be
> > able to contain the affected application in a cpuset. Multiple apps may
> > have been running in multiple cpusets on a large NUMA machine and the OOM
> > condition in one cpuset should not affect the other. It also helped to
> > isolate the application behavior causing the oom in numerous cases.
> >
> > Doesnt this requirement transfer to cgroups in the same way?
>
> We have per-node memory stats and plan to use them during the OOM victim
> selection. Hopefully it can help.

One of the OOM causes could be that memory was restricted to a certain
node set. Killing the allocating task is (was?) default behavior in that
case so that the task that has the restrictions is killed. Not any task
that may not have the restrictions and woiuld not experience OOM.

2017-09-07 16:15:15

by Johannes Weiner

[permalink] [raw]
Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Wed, Sep 06, 2017 at 10:28:59AM +0200, Michal Hocko wrote:
> On Tue 05-09-17 17:53:44, Johannes Weiner wrote:
> > The cgroup-awareness in the OOM killer is exactly the same thing. It
> > should have been the default from the beginning, because the user
> > configures a group of tasks to be an interdependent, terminal unit of
> > memory consumption, and it's undesirable for the OOM killer to ignore
> > this intention and compare members across these boundaries.
>
> I would agree if that was true in general. I can completely see how the
> cgroup awareness is useful in e.g. containerized environments (especially
> with kill-all enabled) but memcgs are used in a large variety of
> usecases and I cannot really say all of them really demand the new
> semantic. Say I have a workload which doesn't want to see reclaim
> interference from others on the same machine. Why should I kill a
> process from that particular memcg just because it is the largest one
> when there is a memory hog/leak outside of this memcg?

Sure, it's always possible to come up with a config for which this
isn't the optimal behavior. But this is about picking a default that
makes sense to most users, and that type of cgroup usage just isn't
the common case.

> From my point of view the safest (in a sense of the least surprise)
> way to go with opt-in for the new heuristic. I am pretty sure all who
> would benefit from the new behavior will enable it while others will not
> regress in unexpected way.

This thinking simply needs to be balanced against the need to make an
unsurprising and consistent final interface.

The current behavior breaks isolation by letting tasks in different
cgroups compete with each other during an OOM kill. While you can
rightfully argue that it's possible for usecases to rely on this, you
cannot tell me that this is the least-surprising thing we can offer
users; certainly not new users, but also not many/most existing ones.

> We can talk about the way _how_ to control these oom strategies, of
> course. But I would be really reluctant to change the default which is
> used for years and people got used to it.

I really doubt there are many cgroup users that rely on that
particular global OOM behavior.

We have to agree to disagree, I guess.

Subject: Re: [v7 2/5] mm, oom: cgroup-aware OOM killer

On Mon, 4 Sep 2017, Roman Gushchin wrote

> To address these issues, cgroup-aware OOM killer is introduced.

You are missing a major issue here. Processes may have allocation
constraints to memory nodes, special DMA zones etc etc. OOM conditions on
such resource constricted allocations need to be dealt with. Killing
processes that do not allocate with the same restrictions may not do
anything to improve conditions.

> But a user can change this behavior by enabling the per-cgroup
> oom_kill_all_tasks option. If set, it causes the OOM killer treat
> the whole cgroup as an indivisible memory consumer. In case if it's
> selected as on OOM victim, all belonging tasks will be killed.

Sounds good in general. Unless the cgroup or processes therein run out of
memory due to memory access restrictions. How do you detect that and how
it is handled?

Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Tue, 5 Sep 2017, Michal Hocko wrote:

> I would argue that we should simply deprecate and later drop the sysctl.
> I _strongly_ suspect anybody is using this. If yes it is not that hard
> to change the kernel command like rather than select the sysctl. The
> deprecation process would be
> - warn when somebody writes to the sysctl and check both boot
> and sysctl values
> [ wait some time ]
> - keep the sysctl but return EINVAL
> [ wait some time ]
> - remove the sysctl

Note that the behavior that would be enabled by the sysctl is the default
behavior for the case of a constrained allocation. If a process does an
mbind to numa node 3 and it runs out of memory then that process should be
killed and the rest is fine.

Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Wed, 6 Sep 2017, Michal Hocko wrote:

> I am not sure this is how things evolved actually. This is way before
> my time so my git log interpretation might be imprecise. We do have
> oom_badness heuristic since out_of_memory has been introduced and
> oom_kill_allocating_task has been introduced much later because of large
> boxes with zillions of tasks (SGI I suspect) which took too long to
> select a victim so David has added this heuristic.

Nope. The logic was required for tasks that run out of memory when the
restriction on the allocation did not allow the use of all of memory.
cpuset restrictions and memory policy restrictions where the prime
considerations at the time.

It has *nothing* to do with zillions of tasks. Its amusing that the SGI
ghost is still haunting the discussion here. The company died a couple of
years ago finally (ok somehow HP has an "SGI" brand now I believe). But
there are multiple companies that have large NUMA configurations and they
all have configurations where they want to restrict allocations of a
process to subset of system memory. This is even more important now that
we get new forms of memory (NVDIMM, PCI-E device memory etc). You need to
figure out what to do with allocations that fail because the *allowed*
memory pools are empty.


2017-09-07 16:43:28

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Thu, Sep 07, 2017 at 10:03:24AM -0500, Christopher Lameter wrote:
> On Thu, 7 Sep 2017, Roman Gushchin wrote:
>
> > > Really? From what I know and worked on way back when: The reason was to be
> > > able to contain the affected application in a cpuset. Multiple apps may
> > > have been running in multiple cpusets on a large NUMA machine and the OOM
> > > condition in one cpuset should not affect the other. It also helped to
> > > isolate the application behavior causing the oom in numerous cases.
> > >
> > > Doesnt this requirement transfer to cgroups in the same way?
> >
> > We have per-node memory stats and plan to use them during the OOM victim
> > selection. Hopefully it can help.
>
> One of the OOM causes could be that memory was restricted to a certain
> node set. Killing the allocating task is (was?) default behavior in that
> case so that the task that has the restrictions is killed. Not any task
> that may not have the restrictions and woiuld not experience OOM.

As I can see, it's not the default behavior these days. If we have a way
to select a victim between memcgs/tasks which are actually using
the corresponding type of memory, it's much better than to kill
an allocating task.

Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Thu, 7 Sep 2017, Roman Gushchin wrote:

> On Thu, Sep 07, 2017 at 10:03:24AM -0500, Christopher Lameter wrote:
> > On Thu, 7 Sep 2017, Roman Gushchin wrote:
> >
> > > > Really? From what I know and worked on way back when: The reason was to be
> > > > able to contain the affected application in a cpuset. Multiple apps may
> > > > have been running in multiple cpusets on a large NUMA machine and the OOM
> > > > condition in one cpuset should not affect the other. It also helped to
> > > > isolate the application behavior causing the oom in numerous cases.
> > > >
> > > > Doesnt this requirement transfer to cgroups in the same way?
> > >
> > > We have per-node memory stats and plan to use them during the OOM victim
> > > selection. Hopefully it can help.
> >
> > One of the OOM causes could be that memory was restricted to a certain
> > node set. Killing the allocating task is (was?) default behavior in that
> > case so that the task that has the restrictions is killed. Not any task
> > that may not have the restrictions and woiuld not experience OOM.
>
> As I can see, it's not the default behavior these days. If we have a way
> to select a victim between memcgs/tasks which are actually using
> the corresponding type of memory, it's much better than to kill
> an allocating task.

Kill the whole set of processes constituting an app in a cgroup or so
sounds good to me.

2017-09-07 21:56:01

by David Rientjes

[permalink] [raw]
Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Thu, 7 Sep 2017, Christopher Lameter wrote:

> > SGI required it when it was introduced simply to avoid the very expensive
> > tasklist scan. Adding Christoph Lameter to the cc since he was involved
> > back then.
>
> Really? From what I know and worked on way back when: The reason was to be
> able to contain the affected application in a cpuset. Multiple apps may
> have been running in multiple cpusets on a large NUMA machine and the OOM
> condition in one cpuset should not affect the other. It also helped to
> isolate the application behavior causing the oom in numerous cases.
>
> Doesnt this requirement transfer to cgroups in the same way?
>
> Left SGI in 2008 so adding Dimitri who may know about the current
> situation. Robin Holt also left SGI as far as I know.
>

It may have been Paul Jackson, but I remember the oom_kill_allocating_task
knob being required due to very slow oom killer due to the very lengthy
iteration of the tasklist. It would be helpful if someone from SGI could
confirm whether or not they actively use this sysctl.

2017-09-07 22:03:31

by David Rientjes

[permalink] [raw]
Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Thu, 7 Sep 2017, Christopher Lameter wrote:

> > I am not sure this is how things evolved actually. This is way before
> > my time so my git log interpretation might be imprecise. We do have
> > oom_badness heuristic since out_of_memory has been introduced and
> > oom_kill_allocating_task has been introduced much later because of large
> > boxes with zillions of tasks (SGI I suspect) which took too long to
> > select a victim so David has added this heuristic.
>
> Nope. The logic was required for tasks that run out of memory when the
> restriction on the allocation did not allow the use of all of memory.
> cpuset restrictions and memory policy restrictions where the prime
> considerations at the time.
>
> It has *nothing* to do with zillions of tasks. Its amusing that the SGI
> ghost is still haunting the discussion here. The company died a couple of
> years ago finally (ok somehow HP has an "SGI" brand now I believe). But
> there are multiple companies that have large NUMA configurations and they
> all have configurations where they want to restrict allocations of a
> process to subset of system memory. This is even more important now that
> we get new forms of memory (NVDIMM, PCI-E device memory etc). You need to
> figure out what to do with allocations that fail because the *allowed*
> memory pools are empty.
>

We already had CONSTRAINT_CPUSET at the time, this was requested by Paul
and acked by him in https://marc.info/?l=linux-mm&m=118306851418425.

Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Thu, 7 Sep 2017, David Rientjes wrote:

> > It has *nothing* to do with zillions of tasks. Its amusing that the SGI
> > ghost is still haunting the discussion here. The company died a couple of
> > years ago finally (ok somehow HP has an "SGI" brand now I believe). But
> > there are multiple companies that have large NUMA configurations and they
> > all have configurations where they want to restrict allocations of a
> > process to subset of system memory. This is even more important now that
> > we get new forms of memory (NVDIMM, PCI-E device memory etc). You need to
> > figure out what to do with allocations that fail because the *allowed*
> > memory pools are empty.
> >
>
> We already had CONSTRAINT_CPUSET at the time, this was requested by Paul
> and acked by him in https://marc.info/?l=linux-mm&m=118306851418425.

Ok. Certainly there were scalability issues (lots of them) and the sysctl
may have helped there if set globally. But the ability to kill the
allocating tasks was primarily used in cpusets for constrained allocation.

The issue of scaling is irrelevant in the context of deciding what to do
about the sysctl. You can address the issue differently if it still
exists. The systems with super high NUMA nodes (hundreds to a
thousand) have somehow fallen out of fashion a bit. So I doubt that this
is still an issue. And no one of the old stakeholders is speaking up.

What is the current approach for an OOM occuring in a cpuset or cgroup
with a restricted numa node set?

2017-09-09 08:46:00

by David Rientjes

[permalink] [raw]
Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Fri, 8 Sep 2017, Christopher Lameter wrote:

> Ok. Certainly there were scalability issues (lots of them) and the sysctl
> may have helped there if set globally. But the ability to kill the
> allocating tasks was primarily used in cpusets for constrained allocation.
>

I remember discussing it with him and he had some data with pretty extreme
numbers for how long the tasklist iteration was taking. Regardless, I
agree it's not pertinent to the discussion if anybody is actively using
the sysctl, just fun to try to remember the discussions from 10 years ago.

The problem I'm having with the removal, though, is that the kernel source
actually uses it itself in tools/testing/fault-injection/failcmd.sh.
That, to me, suggests there are people outside the kernel source that are
also probably use it. We use it as part of our unit testing, although we
could convert away from it.

These are things that can probably be worked around, but I'm struggling to
see the whole benefit of it. It's only defined, there's generic sysctl
handling, and there's a single conditional in the oom killer. I wouldn't
risk the potential userspace breakage.

> The issue of scaling is irrelevant in the context of deciding what to do
> about the sysctl. You can address the issue differently if it still
> exists. The systems with super high NUMA nodes (hundreds to a
> thousand) have somehow fallen out of fashion a bit. So I doubt that this
> is still an issue. And no one of the old stakeholders is speaking up.
>
> What is the current approach for an OOM occuring in a cpuset or cgroup
> with a restricted numa node set?
>

It's always been shaky, we simply exclude potential kill victims based on
whether or not they share mempolicy nodes or cpuset mems with the
allocating process. Of course, this could result in no memory freeing
because a potential victim being allowed to allocate on a particular node
right now doesn't mean killing it will free memory on that node. It's
just more probable in practice. Nobody has complained about that
methodology, but we do have internal code that simply kills current for
mempolicy ooms. That is because we have priority based oom killing much
like this patchset implements and then extends it even further to
processes.

2017-09-11 08:49:37

by Michal Hocko

[permalink] [raw]
Subject: Re: [v7 2/5] mm, oom: cgroup-aware OOM killer

On Thu 07-09-17 11:18:18, Cristopher Lameter wrote:
> On Mon, 4 Sep 2017, Roman Gushchin wrote
>
> > To address these issues, cgroup-aware OOM killer is introduced.
>
> You are missing a major issue here. Processes may have allocation
> constraints to memory nodes, special DMA zones etc etc. OOM conditions on
> such resource constricted allocations need to be dealt with. Killing
> processes that do not allocate with the same restrictions may not do
> anything to improve conditions.

memcg_oom_badness tries to be node aware - very similar to what the
oom_badness does for the regular oom killer. Or do you have anything
else in mind?
--
Michal Hocko
SUSE Labs

2017-09-11 09:06:05

by Michal Hocko

[permalink] [raw]
Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Thu 07-09-17 12:14:57, Johannes Weiner wrote:
> On Wed, Sep 06, 2017 at 10:28:59AM +0200, Michal Hocko wrote:
> > On Tue 05-09-17 17:53:44, Johannes Weiner wrote:
> > > The cgroup-awareness in the OOM killer is exactly the same thing. It
> > > should have been the default from the beginning, because the user
> > > configures a group of tasks to be an interdependent, terminal unit of
> > > memory consumption, and it's undesirable for the OOM killer to ignore
> > > this intention and compare members across these boundaries.
> >
> > I would agree if that was true in general. I can completely see how the
> > cgroup awareness is useful in e.g. containerized environments (especially
> > with kill-all enabled) but memcgs are used in a large variety of
> > usecases and I cannot really say all of them really demand the new
> > semantic. Say I have a workload which doesn't want to see reclaim
> > interference from others on the same machine. Why should I kill a
> > process from that particular memcg just because it is the largest one
> > when there is a memory hog/leak outside of this memcg?
>
> Sure, it's always possible to come up with a config for which this
> isn't the optimal behavior. But this is about picking a default that
> makes sense to most users, and that type of cgroup usage just isn't
> the common case.

How can you tell, really? Even if cgroup2 is a new interface we still
want as many legacy (v1) users to be migrated to the new hierarchy.
I have seen quite different usecases over time and I have hard time to
tell which of them to call common enough.

> > From my point of view the safest (in a sense of the least surprise)
> > way to go with opt-in for the new heuristic. I am pretty sure all who
> > would benefit from the new behavior will enable it while others will not
> > regress in unexpected way.
>
> This thinking simply needs to be balanced against the need to make an
> unsurprising and consistent final interface.

Sure. And I _think_ we can come up with a clear interface to configure
the oom behavior - e.g. a kernel command line parameter with a default
based on a config option.

> The current behavior breaks isolation by letting tasks in different
> cgroups compete with each other during an OOM kill. While you can
> rightfully argue that it's possible for usecases to rely on this, you
> cannot tell me that this is the least-surprising thing we can offer
> users; certainly not new users, but also not many/most existing ones.

I would argue that a global OOM has been always a special case and
people got used to "kill the largest task" strategy. I have seen
multiple reports where people were complaining when this wasn't the case
(e.g. when the NUMA policies were involved).

> > We can talk about the way _how_ to control these oom strategies, of
> > course. But I would be really reluctant to change the default which is
> > used for years and people got used to it.
>
> I really doubt there are many cgroup users that rely on that
> particular global OOM behavior.
>
> We have to agree to disagree, I guess.

Yes, I am afraid so. And I do not hear this would be a feature so many
users have been asking for a long time to simply say "yeah everybody
wants that, make it a default". And as such I do not see a reason why we
should enforce it on all users. It is really trivial to enable it when
it is considered useful.
--
Michal Hocko
SUSE Labs

2017-09-11 12:50:50

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

On Mon, Sep 11, 2017 at 11:05:59AM +0200, Michal Hocko wrote:
> On Thu 07-09-17 12:14:57, Johannes Weiner wrote:
> > On Wed, Sep 06, 2017 at 10:28:59AM +0200, Michal Hocko wrote:
> > > On Tue 05-09-17 17:53:44, Johannes Weiner wrote:
> > > > The cgroup-awareness in the OOM killer is exactly the same thing. It
> > > > should have been the default from the beginning, because the user
> > > > configures a group of tasks to be an interdependent, terminal unit of
> > > > memory consumption, and it's undesirable for the OOM killer to ignore
> > > > this intention and compare members across these boundaries.
> > >
> > > I would agree if that was true in general. I can completely see how the
> > > cgroup awareness is useful in e.g. containerized environments (especially
> > > with kill-all enabled) but memcgs are used in a large variety of
> > > usecases and I cannot really say all of them really demand the new
> > > semantic. Say I have a workload which doesn't want to see reclaim
> > > interference from others on the same machine. Why should I kill a
> > > process from that particular memcg just because it is the largest one
> > > when there is a memory hog/leak outside of this memcg?
> >
> > Sure, it's always possible to come up with a config for which this
> > isn't the optimal behavior. But this is about picking a default that
> > makes sense to most users, and that type of cgroup usage just isn't
> > the common case.
>
> How can you tell, really? Even if cgroup2 is a new interface we still
> want as many legacy (v1) users to be migrated to the new hierarchy.
> I have seen quite different usecases over time and I have hard time to
> tell which of them to call common enough.
>
> > > From my point of view the safest (in a sense of the least surprise)
> > > way to go with opt-in for the new heuristic. I am pretty sure all who
> > > would benefit from the new behavior will enable it while others will not
> > > regress in unexpected way.
> >
> > This thinking simply needs to be balanced against the need to make an
> > unsurprising and consistent final interface.
>
> Sure. And I _think_ we can come up with a clear interface to configure
> the oom behavior - e.g. a kernel command line parameter with a default
> based on a config option.

I would say cgroup v2 mount option is better, because it allows to change
the behavior dynamically (without rebooting) and clearly reflects
cgroup v2 dependency.

Also, it makes systemd (or who is mounting cgroupfs) responsible for the
default behavior. And makes more or less not important what the default is.

Thanks!