2017-08-14 18:33:24

by Roman Gushchin

[permalink] [raw]
Subject: [v5 1/4] mm, oom: refactor the oom_kill_process() function

The oom_kill_process() function consists of two logical parts:
the first one is responsible for considering task's children as
a potential victim and printing the debug information.
The second half is responsible for sending SIGKILL to all
tasks sharing the mm struct with the given victim.

This commit splits the oom_kill_process() function with
an intention to re-use the the second half: __oom_kill_process().

The cgroup-aware OOM killer will kill multiple tasks
belonging to the victim cgroup. We don't need to print
the debug information for the each task, as well as play
with task selection (considering task's children),
so we can't use the existing oom_kill_process().

Signed-off-by: Roman Gushchin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/oom_kill.c | 123 +++++++++++++++++++++++++++++++---------------------------
1 file changed, 65 insertions(+), 58 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 53b44425ef35..5c29a3dd591b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -817,67 +817,12 @@ static bool task_will_free_mem(struct task_struct *task)
return ret;
}

-static void oom_kill_process(struct oom_control *oc, const char *message)
+static void __oom_kill_process(struct task_struct *victim)
{
- struct task_struct *p = oc->chosen;
- unsigned int points = oc->chosen_points;
- struct task_struct *victim = p;
- struct task_struct *child;
- struct task_struct *t;
+ struct task_struct *p;
struct mm_struct *mm;
- unsigned int victim_points = 0;
- static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
bool can_oom_reap = true;

- /*
- * If the task is already exiting, don't alarm the sysadmin or kill
- * its children or threads, just set TIF_MEMDIE so it can die quickly
- */
- task_lock(p);
- if (task_will_free_mem(p)) {
- mark_oom_victim(p);
- wake_oom_reaper(p);
- task_unlock(p);
- put_task_struct(p);
- return;
- }
- task_unlock(p);
-
- if (__ratelimit(&oom_rs))
- dump_header(oc, p);
-
- pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
- message, task_pid_nr(p), p->comm, points);
-
- /*
- * If any of p's children has a different mm and is eligible for kill,
- * the one with the highest oom_badness() score is sacrificed for its
- * parent. This attempts to lose the minimal amount of work done while
- * still freeing memory.
- */
- read_lock(&tasklist_lock);
- for_each_thread(p, t) {
- list_for_each_entry(child, &t->children, sibling) {
- unsigned int child_points;
-
- if (process_shares_mm(child, p->mm))
- continue;
- /*
- * oom_badness() returns 0 if the thread is unkillable
- */
- child_points = oom_badness(child,
- oc->memcg, oc->nodemask, oc->totalpages);
- if (child_points > victim_points) {
- put_task_struct(victim);
- victim = child;
- victim_points = child_points;
- get_task_struct(victim);
- }
- }
- }
- read_unlock(&tasklist_lock);
-
p = find_lock_task_mm(victim);
if (!p) {
put_task_struct(victim);
@@ -947,10 +892,72 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
wake_oom_reaper(victim);

mmdrop(mm);
- put_task_struct(victim);
}
#undef K

+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+ struct task_struct *p = oc->chosen;
+ unsigned int points = oc->chosen_points;
+ struct task_struct *victim = p;
+ struct task_struct *child;
+ struct task_struct *t;
+ unsigned int victim_points = 0;
+ static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+ /*
+ * If the task is already exiting, don't alarm the sysadmin or kill
+ * its children or threads, just set TIF_MEMDIE so it can die quickly
+ */
+ task_lock(p);
+ if (task_will_free_mem(p)) {
+ mark_oom_victim(p);
+ wake_oom_reaper(p);
+ task_unlock(p);
+ put_task_struct(p);
+ return;
+ }
+ task_unlock(p);
+
+ if (__ratelimit(&oom_rs))
+ dump_header(oc, p);
+
+ pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
+ message, task_pid_nr(p), p->comm, points);
+
+ /*
+ * If any of p's children has a different mm and is eligible for kill,
+ * the one with the highest oom_badness() score is sacrificed for its
+ * parent. This attempts to lose the minimal amount of work done while
+ * still freeing memory.
+ */
+ read_lock(&tasklist_lock);
+ for_each_thread(p, t) {
+ list_for_each_entry(child, &t->children, sibling) {
+ unsigned int child_points;
+
+ if (process_shares_mm(child, p->mm))
+ continue;
+ /*
+ * oom_badness() returns 0 if the thread is unkillable
+ */
+ child_points = oom_badness(child,
+ oc->memcg, oc->nodemask, oc->totalpages);
+ if (child_points > victim_points) {
+ put_task_struct(victim);
+ victim = child;
+ victim_points = child_points;
+ get_task_struct(victim);
+ }
+ }
+ }
+ read_unlock(&tasklist_lock);
+
+ __oom_kill_process(victim);
+ put_task_struct(victim);
+}
+
/*
* Determines whether the kernel must panic because of the panic_on_oom sysctl.
*/
--
2.13.5


2017-08-14 18:32:56

by Roman Gushchin

[permalink] [raw]
Subject: [v5 0/4] cgroup-aware OOM killer

This patchset makes the OOM killer cgroup-aware.

v5:
- Rebased on top of Michal Hocko's patches, which have changed the
way how OOM victims becoming an access to the memory
reserves. Dropped corresponding part of this patchset
- Separated the oom_kill_process() splitting into a standalone commit
- Added debug output (suggested by David Rientjes)
- Some minor fixes

v4:
- Reworked per-cgroup oom_score_adj into oom_priority
(based on ideas by David Rientjes)
- Tasks with oom_score_adj -1000 are never selected if
oom_kill_all_tasks is not set
- Memcg victim selection code is reworked, and
synchronization is based on finding tasks with OOM victim marker,
rather then on global counter
- Debug output is dropped
- Refactored TIF_MEMDIE usage

v3:
- Merged commits 1-4 into 6
- Separated oom_score_adj logic and debug output into separate commits
- Fixed swap accounting

v2:
- Reworked victim selection based on feedback
from Michal Hocko, Vladimir Davydov and Johannes Weiner
- "Kill all tasks" is now an opt-in option, by default
only one process will be killed
- Added per-cgroup oom_score_adj
- Refined oom score calculations, suggested by Vladimir Davydov
- Converted to a patchset

v1:
https://lkml.org/lkml/2017/5/18/969

Roman Gushchin (4):
mm, oom: refactor the oom_kill_process() function
mm, oom: cgroup-aware OOM killer
mm, oom: introduce oom_priority for memory cgroups
mm, oom, docs: describe the cgroup-aware OOM killer

Documentation/cgroup-v2.txt | 62 +++++++++++
include/linux/memcontrol.h | 36 ++++++
include/linux/oom.h | 3 +
mm/memcontrol.c | 259 ++++++++++++++++++++++++++++++++++++++++++++
mm/oom_kill.c | 181 +++++++++++++++++++++----------
5 files changed, 484 insertions(+), 57 deletions(-)

--
2.13.5

2017-08-14 18:33:05

by Roman Gushchin

[permalink] [raw]
Subject: [v5 4/4] mm, oom, docs: describe the cgroup-aware OOM killer

Update cgroups v2 docs.

Signed-off-by: Roman Gushchin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Documentation/cgroup-v2.txt | 62 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 62 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index dec5afdaa36d..22108f31e09d 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
5-2-1. Memory Interface Files
5-2-2. Usage Guidelines
5-2-3. Memory Ownership
+ 5-2-4. Cgroup-aware OOM Killer
5-3. IO
5-3-1. IO Interface Files
5-3-2. Writeback
@@ -1002,6 +1003,37 @@ PAGE_SIZE multiple when read back.
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.

+ memory.oom_kill_all_tasks
+
+ A read-write single value file which exits on non-root
+ cgroups. The default is "0".
+
+ Defines whether the OOM killer should treat the cgroup
+ as a single entity during the victim selection.
+
+ If set, OOM killer will kill all belonging tasks in
+ corresponding cgroup is selected as an OOM victim.
+
+ Be default, OOM killer respect /proc/pid/oom_score_adj value
+ -1000, and will never kill the task, unless oom_kill_all_tasks
+ is set.
+
+ memory.oom_priority
+
+ A read-write single value file which exits on non-root
+ cgroups. The default is "0".
+
+ An integer number within the [-10000, 10000] range,
+ which defines the order in which the OOM killer selects victim
+ memory cgroups.
+
+ OOM killer prefers memory cgroups with larger priority if they
+ are populated with elegible tasks.
+
+ The oom_priority value is compared within sibling cgroups.
+
+ The root cgroup has the oom_priority 0, which cannot be changed.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
@@ -1206,6 +1238,36 @@ POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
belonging to the affected files to ensure correct memory ownership.


+Cgroup-aware OOM Killer
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats memory cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choise of a victim, hierarchically looking for the largest memory
+consumer. By default, it will look for the biggest task in the
+biggest leaf cgroup.
+
+Be default, all cgroups have oom_priority 0, and OOM killer will
+chose the largest cgroup recursively on each level. For non-root
+cgroups it's possible to change the oom_priority, and it will cause
+the OOM killer to look athe the priority value first, and compare
+sizes only of cgroups with equal priority.
+
+But a user can change this behavior by enabling the per-cgroup
+oom_kill_all_tasks option. If set, it causes the OOM killer treat
+the whole cgroup as an indivisible memory consumer. In case if it's
+selected as on OOM victim, all belonging tasks will be killed.
+
+Tasks in the root cgroup are treated as independent memory consumers,
+and are compared with other memory consumers (e.g. leaf cgroups).
+The root cgroup doesn't support the oom_kill_all_tasks feature.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
IO
--

--
2.13.5

2017-08-14 18:33:43

by Roman Gushchin

[permalink] [raw]
Subject: [v5 2/4] mm, oom: cgroup-aware OOM killer

Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers:

1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. In many cases much safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

3) Per-process oom_score_adj affects global OOM, so it's a breache
in the isolation.

To address these issues, cgroup-aware OOM killer is introduced.

Under OOM conditions, it tries to find the biggest memory consumer,
and free memory by killing corresponding task(s). The difference
the "traditional" OOM killer is that it can treat memory cgroups
as memory consumers as well as single processes.

By default, it will look for the biggest leaf cgroup, and kill
the largest task inside.

But a user can change this behavior by enabling the per-cgroup
oom_kill_all_tasks option. If set, it causes the OOM killer treat
the whole cgroup as an indivisible memory consumer. In case if it's
selected as on OOM victim, all belonging tasks will be killed.

Tasks in the root cgroup are treated as independent memory consumers,
and are compared with other memory consumers (e.g. leaf cgroups).
The root cgroup doesn't support the oom_kill_all_tasks feature.

Signed-off-by: Roman Gushchin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/linux/memcontrol.h | 33 +++++++
include/linux/oom.h | 3 +
mm/memcontrol.c | 208 +++++++++++++++++++++++++++++++++++++++++++++
mm/oom_kill.c | 62 +++++++++++++-
4 files changed, 305 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8556f1b86d40..796666dc3282 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
struct page;
struct mm_struct;
struct kmem_cache;
+struct oom_control;

/* Cgroup-specific page state, on top of universal node page state */
enum memcg_stat_item {
@@ -199,6 +200,12 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;

+ /* kill all tasks in the subtree in case of OOM */
+ bool oom_kill_all_tasks;
+
+ /* cached OOM score */
+ long oom_score;
+
/* handle for "memory.events" */
struct cgroup_file events_file;

@@ -342,6 +349,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
}

+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+ css_put(&memcg->css);
+}
+
#define mem_cgroup_from_counter(counter, member) \
container_of(counter, struct mem_cgroup, member)

@@ -480,6 +492,13 @@ static inline bool task_in_memcg_oom(struct task_struct *p)

bool mem_cgroup_oom_synchronize(bool wait);

+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
+static inline bool mem_cgroup_oom_kill_all_tasks(struct mem_cgroup *memcg)
+{
+ return memcg->oom_kill_all_tasks;
+}
+
#ifdef CONFIG_MEMCG_SWAP
extern int do_swap_account;
#endif
@@ -743,6 +762,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
return true;
}

+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
static inline struct mem_cgroup *
mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
@@ -930,6 +953,16 @@ static inline
void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
{
}
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+ return false;
+}
+
+static inline bool mem_cgroup_oom_kill_all_tasks(struct mem_cgroup *memcg)
+{
+ return false;
+}
#endif /* CONFIG_MEMCG */

/* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 8a266e2be5a6..b7ec3bd441be 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -39,6 +39,7 @@ struct oom_control {
unsigned long totalpages;
struct task_struct *chosen;
unsigned long chosen_points;
+ struct mem_cgroup *chosen_memcg;
};

extern struct mutex oom_lock;
@@ -79,6 +80,8 @@ extern void oom_killer_enable(void);

extern struct task_struct *find_lock_task_mm(struct task_struct *p);

+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
/* sysctls */
extern int sysctl_oom_dump_tasks;
extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index df6f63ee95d6..0b81dc55c6ac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2639,6 +2639,181 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
return ret;
}

+static long memcg_oom_badness(struct mem_cgroup *memcg,
+ const nodemask_t *nodemask)
+{
+ long points = 0;
+ int nid;
+
+ for_each_node_state(nid, N_MEMORY) {
+ if (nodemask && !node_isset(nid, *nodemask))
+ continue;
+
+ points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+ LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+ }
+
+ points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+ (PAGE_SIZE / 1024);
+ points += memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE);
+ points += memcg_page_state(memcg, MEMCG_SOCK);
+ points += memcg_page_state(memcg, MEMCG_SWAP);
+
+ return points;
+}
+
+static long oom_evaluate_memcg(struct mem_cgroup *memcg,
+ const nodemask_t *nodemask)
+{
+ struct css_task_iter it;
+ struct task_struct *task;
+ int elegible = 0;
+
+ css_task_iter_start(&memcg->css, 0, &it);
+ while ((task = css_task_iter_next(&it))) {
+ /*
+ * If there are no tasks, or all tasks have oom_score_adj set
+ * to OOM_SCORE_ADJ_MIN and oom_kill_all_tasks is not set,
+ * don't select this memory cgroup.
+ */
+ if (!elegible &&
+ (memcg->oom_kill_all_tasks ||
+ task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN))
+ elegible = 1;
+
+ /*
+ * If there are previously selected OOM victims,
+ * abort memcg selection.
+ */
+ if (tsk_is_oom_victim(task) &&
+ !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
+ elegible = -1;
+ break;
+ }
+ }
+ css_task_iter_end(&it);
+
+ return elegible > 0 ? memcg_oom_badness(memcg, nodemask) : elegible;
+}
+
+static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
+{
+ struct mem_cgroup *iter, *parent;
+
+ for_each_mem_cgroup_tree(iter, root) {
+ if (memcg_has_children(iter)) {
+ iter->oom_score = 0;
+ continue;
+ }
+
+ iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask);
+ if (iter->oom_score == -1) {
+ oc->chosen_memcg = (void *)-1UL;
+ mem_cgroup_iter_break(root, iter);
+ return;
+ }
+
+ if (!iter->oom_score)
+ continue;
+
+ for (parent = parent_mem_cgroup(iter); parent && parent != root;
+ parent = parent_mem_cgroup(parent))
+ parent->oom_score += iter->oom_score;
+ }
+
+ for (;;) {
+ struct cgroup_subsys_state *css;
+ struct mem_cgroup *memcg = NULL;
+ long score = LONG_MIN;
+
+ css_for_each_child(css, &root->css) {
+ struct mem_cgroup *iter = mem_cgroup_from_css(css);
+
+ if (iter->oom_score > score) {
+ memcg = iter;
+ score = iter->oom_score;
+ }
+ }
+
+ if (!memcg) {
+ if (oc->memcg && root == oc->memcg) {
+ oc->chosen_memcg = oc->memcg;
+ css_get(&oc->chosen_memcg->css);
+ oc->chosen_points = oc->memcg->oom_score;
+ }
+ break;
+ }
+
+ if (memcg->oom_kill_all_tasks || !memcg_has_children(memcg)) {
+ oc->chosen_memcg = memcg;
+ css_get(&oc->chosen_memcg->css);
+ oc->chosen_points = score;
+ break;
+ }
+
+ root = memcg;
+ }
+}
+
+static void select_victim_root_cgroup_task(struct oom_control *oc)
+{
+ struct css_task_iter it;
+ struct task_struct *task;
+ int ret = 0;
+
+ css_task_iter_start(&root_mem_cgroup->css, 0, &it);
+ while (!ret && (task = css_task_iter_next(&it)))
+ ret = oom_evaluate_task(task, oc);
+ css_task_iter_end(&it);
+}
+
+bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+ struct mem_cgroup *root = root_mem_cgroup;
+
+ oc->chosen = NULL;
+ oc->chosen_memcg = NULL;
+
+ if (mem_cgroup_disabled())
+ return false;
+
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ return false;
+
+ if (oc->memcg)
+ root = oc->memcg;
+
+ rcu_read_lock();
+
+ select_victim_memcg(root, oc);
+ if (oc->chosen_memcg == (void *)-1UL) {
+ /* Existing OOM victims are found. */
+ rcu_read_unlock();
+ return true;
+ }
+
+ /*
+ * For system-wide OOMs we should consider tasks in the root cgroup
+ * with oom_score larger than oc->chosen_points.
+ */
+ if (!oc->memcg) {
+ select_victim_root_cgroup_task(oc);
+
+ if (oc->chosen && oc->chosen_memcg) {
+ /*
+ * If we've decided to kill a task in the root memcg,
+ * release chosen_memcg.
+ */
+ css_put(&oc->chosen_memcg->css);
+ oc->chosen_memcg = NULL;
+ }
+ }
+
+ rcu_read_unlock();
+
+ return !!oc->chosen || !!oc->chosen_memcg;
+}
+
/*
* Reclaims as many pages from the given memcg as possible.
*
@@ -5190,6 +5365,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
return nbytes;
}

+static int memory_oom_kill_all_tasks_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+ bool oom_kill_all_tasks = memcg->oom_kill_all_tasks;
+
+ seq_printf(m, "%d\n", oom_kill_all_tasks);
+
+ return 0;
+}
+
+static ssize_t memory_oom_kill_all_tasks_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ int oom_kill_all_tasks;
+ int err;
+
+ err = kstrtoint(strstrip(buf), 0, &oom_kill_all_tasks);
+ if (err)
+ return err;
+
+ memcg->oom_kill_all_tasks = !!oom_kill_all_tasks;
+
+ return nbytes;
+}
+
static int memory_events_show(struct seq_file *m, void *v)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5310,6 +5512,12 @@ static struct cftype memory_files[] = {
.write = memory_max_write,
},
{
+ .name = "oom_kill_all_tasks",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_oom_kill_all_tasks_show,
+ .write = memory_oom_kill_all_tasks_write,
+ },
+ {
.name = "events",
.flags = CFTYPE_NOT_ON_ROOT,
.file_offset = offsetof(struct mem_cgroup, events_file),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5c29a3dd591b..28e42a0d5eee 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,7 +288,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
return CONSTRAINT_NONE;
}

-static int oom_evaluate_task(struct task_struct *task, void *arg)
+int oom_evaluate_task(struct task_struct *task, void *arg)
{
struct oom_control *oc = arg;
unsigned long points;
@@ -823,6 +823,9 @@ static void __oom_kill_process(struct task_struct *victim)
struct mm_struct *mm;
bool can_oom_reap = true;

+ if (is_global_init(victim) || (victim->flags & PF_KTHREAD))
+ return;
+
p = find_lock_task_mm(victim);
if (!p) {
put_task_struct(victim);
@@ -958,6 +961,60 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
put_task_struct(victim);
}

+static int oom_kill_memcg_member(struct task_struct *task, void *unused)
+{
+ if (!tsk_is_oom_victim(task))
+ __oom_kill_process(task);
+ return 0;
+}
+
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
+ static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+ if (oc->chosen) {
+ if (oc->chosen != (void *)-1UL) {
+ if (__ratelimit(&oom_rs))
+ dump_header(oc, oc->chosen);
+
+ __oom_kill_process(oc->chosen);
+ put_task_struct(oc->chosen);
+ schedule_timeout_killable(1);
+ }
+ return true;
+
+ } else if (oc->chosen_memcg) {
+ if (oc->chosen_memcg == (void *)-1UL)
+ return true;
+
+ /* Always begin with the biggest task */
+ oc->chosen_points = 0;
+ oc->chosen = NULL;
+ mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
+ if (oc->chosen && oc->chosen != (void *)-1UL) {
+ if (__ratelimit(&oom_rs))
+ dump_header(oc, oc->chosen);
+
+ __oom_kill_process(oc->chosen);
+ put_task_struct(oc->chosen);
+
+ if (mem_cgroup_oom_kill_all_tasks(oc->chosen_memcg))
+ mem_cgroup_scan_tasks(oc->chosen_memcg,
+ oom_kill_memcg_member,
+ NULL);
+ }
+
+ mem_cgroup_put(oc->chosen_memcg);
+ oc->chosen_memcg = NULL;
+ return !!oc->chosen;
+
+ } else {
+ oc->chosen_points = 0;
+ return false;
+ }
+}
+
/*
* Determines whether the kernel must panic because of the panic_on_oom sysctl.
*/
@@ -1059,6 +1116,9 @@ bool out_of_memory(struct oom_control *oc)
return true;
}

+ if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc))
+ return true;
+
select_bad_process(oc);
/* Found nothing?!?! Either we hang forever, or we panic. */
if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
--
2.13.5

2017-08-14 18:33:40

by Roman Gushchin

[permalink] [raw]
Subject: [v5 3/4] mm, oom: introduce oom_priority for memory cgroups

Introduce a per-memory-cgroup oom_priority setting: an integer number
within the [-10000, 10000] range, which defines the order in which
the OOM killer selects victim memory cgroups.

OOM killer prefers memory cgroups with larger priority if they are
populated with elegible tasks.

The oom_priority value is compared within sibling cgroups.

The root cgroup has the oom_priority 0, which cannot be changed.

Signed-off-by: Roman Gushchin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/linux/memcontrol.h | 3 +++
mm/memcontrol.c | 55 ++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 796666dc3282..3c1ab3aedebe 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -206,6 +206,9 @@ struct mem_cgroup {
/* cached OOM score */
long oom_score;

+ /* OOM killer priority */
+ short oom_priority;
+
/* handle for "memory.events" */
struct cgroup_file events_file;

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0b81dc55c6ac..f61e9a9c8bdc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2724,12 +2724,21 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
for (;;) {
struct cgroup_subsys_state *css;
struct mem_cgroup *memcg = NULL;
+ short prio = SHRT_MIN;
long score = LONG_MIN;

css_for_each_child(css, &root->css) {
struct mem_cgroup *iter = mem_cgroup_from_css(css);

- if (iter->oom_score > score) {
+ if (iter->oom_score == 0)
+ continue;
+
+ if (iter->oom_priority > prio) {
+ memcg = iter;
+ prio = iter->oom_priority;
+ score = iter->oom_score;
+ } else if (iter->oom_priority == prio &&
+ iter->oom_score > score) {
memcg = iter;
score = iter->oom_score;
}
@@ -2796,7 +2805,15 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
* For system-wide OOMs we should consider tasks in the root cgroup
* with oom_score larger than oc->chosen_points.
*/
- if (!oc->memcg) {
+ if (!oc->memcg && !(oc->chosen_memcg &&
+ oc->chosen_memcg->oom_priority > 0)) {
+ /*
+ * Root memcg has priority 0, so if chosen memcg has lower
+ * priority, any task in root cgroup is preferable.
+ */
+ if (oc->chosen_memcg && oc->chosen_memcg->oom_priority < 0)
+ oc->chosen_points = 0;
+
select_victim_root_cgroup_task(oc);

if (oc->chosen && oc->chosen_memcg) {
@@ -5392,6 +5409,34 @@ static ssize_t memory_oom_kill_all_tasks_write(struct kernfs_open_file *of,
return nbytes;
}

+static int memory_oom_priority_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+
+ seq_printf(m, "%d\n", memcg->oom_priority);
+
+ return 0;
+}
+
+static ssize_t memory_oom_priority_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ int oom_priority;
+ int err;
+
+ err = kstrtoint(strstrip(buf), 0, &oom_priority);
+ if (err)
+ return err;
+
+ if (oom_priority < -10000 || oom_priority > 10000)
+ return -EINVAL;
+
+ memcg->oom_priority = (short)oom_priority;
+
+ return nbytes;
+}
+
static int memory_events_show(struct seq_file *m, void *v)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5518,6 +5563,12 @@ static struct cftype memory_files[] = {
.write = memory_oom_kill_all_tasks_write,
},
{
+ .name = "oom_priority",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_oom_priority_show,
+ .write = memory_oom_priority_write,
+ },
+ {
.name = "events",
.flags = CFTYPE_NOT_ON_ROOT,
.file_offset = offsetof(struct mem_cgroup, events_file),
--
2.13.5

2017-08-14 22:00:48

by David Rientjes

[permalink] [raw]
Subject: Re: [v5 1/4] mm, oom: refactor the oom_kill_process() function

On Mon, 14 Aug 2017, Roman Gushchin wrote:

> The oom_kill_process() function consists of two logical parts:
> the first one is responsible for considering task's children as
> a potential victim and printing the debug information.
> The second half is responsible for sending SIGKILL to all
> tasks sharing the mm struct with the given victim.
>
> This commit splits the oom_kill_process() function with
> an intention to re-use the the second half: __oom_kill_process().
>
> The cgroup-aware OOM killer will kill multiple tasks
> belonging to the victim cgroup. We don't need to print
> the debug information for the each task, as well as play
> with task selection (considering task's children),
> so we can't use the existing oom_kill_process().
>
> Signed-off-by: Roman Gushchin <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Vladimir Davydov <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Tetsuo Handa <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]

Acked-by: David Rientjes <[email protected]>

2017-08-14 22:42:58

by David Rientjes

[permalink] [raw]
Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

On Mon, 14 Aug 2017, Roman Gushchin wrote:

> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 8a266e2be5a6..b7ec3bd441be 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -39,6 +39,7 @@ struct oom_control {
> unsigned long totalpages;
> struct task_struct *chosen;
> unsigned long chosen_points;
> + struct mem_cgroup *chosen_memcg;
> };
>
> extern struct mutex oom_lock;
> @@ -79,6 +80,8 @@ extern void oom_killer_enable(void);
>
> extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>
> +extern int oom_evaluate_task(struct task_struct *task, void *arg);
> +
> /* sysctls */
> extern int sysctl_oom_dump_tasks;
> extern int sysctl_oom_kill_allocating_task;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index df6f63ee95d6..0b81dc55c6ac 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2639,6 +2639,181 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
> return ret;
> }
>
> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> + const nodemask_t *nodemask)
> +{
> + long points = 0;
> + int nid;
> +
> + for_each_node_state(nid, N_MEMORY) {
> + if (nodemask && !node_isset(nid, *nodemask))
> + continue;
> +
> + points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> + }
> +
> + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> + (PAGE_SIZE / 1024);
> + points += memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE);
> + points += memcg_page_state(memcg, MEMCG_SOCK);
> + points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> + return points;
> +}

I'm indifferent to the memcg evaluation criteria used to determine which
memcg should be selected over others with the same priority, others may
feel differently.

> +
> +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> + const nodemask_t *nodemask)
> +{
> + struct css_task_iter it;
> + struct task_struct *task;
> + int elegible = 0;
> +
> + css_task_iter_start(&memcg->css, 0, &it);
> + while ((task = css_task_iter_next(&it))) {
> + /*
> + * If there are no tasks, or all tasks have oom_score_adj set
> + * to OOM_SCORE_ADJ_MIN and oom_kill_all_tasks is not set,
> + * don't select this memory cgroup.
> + */
> + if (!elegible &&
> + (memcg->oom_kill_all_tasks ||
> + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN))
> + elegible = 1;

I'm curious about the decision made in this conditional and how
oom_kill_memcg_member() ignores task->signal->oom_score_adj. It means
that memory.oom_kill_all_tasks overrides /proc/pid/oom_score_adj if it
should otherwise be disabled.

It's undocumented in the changelog, but I'm questioning whether it's the
right decision. Doesn't it make sense to kill all tasks that are not oom
disabled, and allow the user to still protect certain processes by their
/proc/pid/oom_score_adj setting? Otherwise, there's no way to do that
protection without a sibling memcg and its own reservation of memory. I'm
thinking about a process that governs jobs inside the memcg and if there
is an oom kill, it wants to do logging and any cleanup necessary before
exiting itself. It seems like a powerful combination if coupled with oom
notification.

Also, s/elegible/eligible/

Otherwise, looks good!

2017-08-14 22:44:08

by David Rientjes

[permalink] [raw]
Subject: Re: [v5 3/4] mm, oom: introduce oom_priority for memory cgroups

On Mon, 14 Aug 2017, Roman Gushchin wrote:

> Introduce a per-memory-cgroup oom_priority setting: an integer number
> within the [-10000, 10000] range, which defines the order in which
> the OOM killer selects victim memory cgroups.
>
> OOM killer prefers memory cgroups with larger priority if they are
> populated with elegible tasks.
>
> The oom_priority value is compared within sibling cgroups.
>
> The root cgroup has the oom_priority 0, which cannot be changed.
>
> Signed-off-by: Roman Gushchin <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Vladimir Davydov <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Tetsuo Handa <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]

Tested-by: David Rientjes <[email protected]>

2017-08-14 22:52:32

by David Rientjes

[permalink] [raw]
Subject: Re: [v5 4/4] mm, oom, docs: describe the cgroup-aware OOM killer

On Mon, 14 Aug 2017, Roman Gushchin wrote:

> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index dec5afdaa36d..22108f31e09d 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
> 5-2-1. Memory Interface Files
> 5-2-2. Usage Guidelines
> 5-2-3. Memory Ownership
> + 5-2-4. Cgroup-aware OOM Killer

Random curiousness, why cgroup-aware oom killer and not memcg-aware oom
killer?

> 5-3. IO
> 5-3-1. IO Interface Files
> 5-3-2. Writeback
> @@ -1002,6 +1003,37 @@ PAGE_SIZE multiple when read back.
> high limit is used and monitored properly, this limit's
> utility is limited to providing the final safety net.
>
> + memory.oom_kill_all_tasks
> +
> + A read-write single value file which exits on non-root

s/exits/exists/

> + cgroups. The default is "0".
> +
> + Defines whether the OOM killer should treat the cgroup
> + as a single entity during the victim selection.

Isn't this true independent of the memory.oom_kill_all_tasks setting?
The cgroup aware oom killer will consider memcg's as logical units when
deciding what to kill with or without memory.oom_kill_all_tasks, right?

I think you cover this fact in the cgroup aware oom killer section below
so this might result in confusion if described alongside a setting of
memory.oom_kill_all_tasks.

> +
> + If set, OOM killer will kill all belonging tasks in
> + corresponding cgroup is selected as an OOM victim.

Maybe

"If set, the OOM killer will kill all threads attached to the memcg if
selected as an OOM victim."

is better?

> +
> + Be default, OOM killer respect /proc/pid/oom_score_adj value
> + -1000, and will never kill the task, unless oom_kill_all_tasks
> + is set.
> +
> + memory.oom_priority
> +
> + A read-write single value file which exits on non-root

s/exits/exists/

> + cgroups. The default is "0".
> +
> + An integer number within the [-10000, 10000] range,
> + which defines the order in which the OOM killer selects victim
> + memory cgroups.
> +
> + OOM killer prefers memory cgroups with larger priority if they
> + are populated with elegible tasks.

s/elegible/eligible/

> +
> + The oom_priority value is compared within sibling cgroups.
> +
> + The root cgroup has the oom_priority 0, which cannot be changed.
> +
> memory.events
> A read-only flat-keyed file which exists on non-root cgroups.
> The following entries are defined. Unless specified
> @@ -1206,6 +1238,36 @@ POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
> belonging to the affected files to ensure correct memory ownership.
>
>
> +Cgroup-aware OOM Killer
> +~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> +It means that it treats memory cgroups as first class OOM entities.
> +
> +Under OOM conditions the memory controller tries to make the best
> +choise of a victim, hierarchically looking for the largest memory
> +consumer. By default, it will look for the biggest task in the
> +biggest leaf cgroup.
> +
> +Be default, all cgroups have oom_priority 0, and OOM killer will
> +chose the largest cgroup recursively on each level. For non-root
> +cgroups it's possible to change the oom_priority, and it will cause
> +the OOM killer to look athe the priority value first, and compare
> +sizes only of cgroups with equal priority.

Maybe some description of "largest" would be helpful here? I think you
could briefly describe what is accounted for in the decisionmaking.

s/athe/at the/

Reading through this, it makes me wonder if doing s/cgroup/memcg/ over
most of it would be better.

> +
> +But a user can change this behavior by enabling the per-cgroup
> +oom_kill_all_tasks option. If set, it causes the OOM killer treat
> +the whole cgroup as an indivisible memory consumer. In case if it's
> +selected as on OOM victim, all belonging tasks will be killed.
> +
> +Tasks in the root cgroup are treated as independent memory consumers,
> +and are compared with other memory consumers (e.g. leaf cgroups).
> +The root cgroup doesn't support the oom_kill_all_tasks feature.
> +
> +This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
> +the memory controller considers only cgroups belonging to the sub-tree
> +of the OOM'ing cgroup.
> +
> IO
> --
>

2017-08-15 12:16:27

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

On Mon, Aug 14, 2017 at 03:42:54PM -0700, David Rientjes wrote:
> On Mon, 14 Aug 2017, Roman Gushchin wrote:
> > +
> > +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> > + const nodemask_t *nodemask)
> > +{
> > + struct css_task_iter it;
> > + struct task_struct *task;
> > + int elegible = 0;
> > +
> > + css_task_iter_start(&memcg->css, 0, &it);
> > + while ((task = css_task_iter_next(&it))) {
> > + /*
> > + * If there are no tasks, or all tasks have oom_score_adj set
> > + * to OOM_SCORE_ADJ_MIN and oom_kill_all_tasks is not set,
> > + * don't select this memory cgroup.
> > + */
> > + if (!elegible &&
> > + (memcg->oom_kill_all_tasks ||
> > + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN))
> > + elegible = 1;
>
> I'm curious about the decision made in this conditional and how
> oom_kill_memcg_member() ignores task->signal->oom_score_adj. It means
> that memory.oom_kill_all_tasks overrides /proc/pid/oom_score_adj if it
> should otherwise be disabled.
>
> It's undocumented in the changelog, but I'm questioning whether it's the
> right decision. Doesn't it make sense to kill all tasks that are not oom
> disabled, and allow the user to still protect certain processes by their
> /proc/pid/oom_score_adj setting? Otherwise, there's no way to do that
> protection without a sibling memcg and its own reservation of memory. I'm
> thinking about a process that governs jobs inside the memcg and if there
> is an oom kill, it wants to do logging and any cleanup necessary before
> exiting itself. It seems like a powerful combination if coupled with oom
> notification.

Good question!
I think, that an ability to override any oom_score_adj value and get all tasks
killed is more important, than an ability to kill all processes with some
exceptions.

In your example someone still needs to look after the remaining process,
and kill it after some timeout, if it will not quit by itself, right?

The special treatment of the -1000 value (without oom_kill_all_tasks)
is required only to not to break the existing setups.

Generally, oom_score_adj should have a meaning only on a cgroup level,
so extending it to the system level doesn't sound as a good idea.

>
> Also, s/elegible/eligible/

Shame on me :)
Will fix, thanks!

>
> Otherwise, looks good!

Great!
Thank you for the reviewing and testing.

Roman

2017-08-15 12:20:29

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

On 08/15/2017 10:15 PM, Roman Gushchin wrote:
> Generally, oom_score_adj should have a meaning only on a cgroup level,
> so extending it to the system level doesn't sound as a good idea.

But wasn't the original purpose of oom_score (and oom_score_adj) to work
on a system level, aka "normal" OOM? Is there some peculiarity about
memcg OOM that I'm missing?

--
Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

2017-08-15 12:58:28

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

On Tue, Aug 15, 2017 at 10:20:18PM +1000, Aleksa Sarai wrote:
> On 08/15/2017 10:15 PM, Roman Gushchin wrote:
> > Generally, oom_score_adj should have a meaning only on a cgroup level,
> > so extending it to the system level doesn't sound as a good idea.
>
> But wasn't the original purpose of oom_score (and oom_score_adj) to work on
> a system level, aka "normal" OOM? Is there some peculiarity about memcg OOM
> that I'm missing?

I'm sorry, if it wasn't clear from my message, it's not about
the system-wide OOM vs the memcg-wide OOM, it's about the isolation.

In general, decision is made on memcg level first (based on oom_priority
and size), and only then on a task level (based on size and oom_score_adj).

Oom_score_adj affects which task inside the cgroup will be killed,
but we never compare tasks from different cgroups. This is what I mean,
when I'm saying, that oom_score_adj should not have a system-wide meaning.

Thanks!

2017-08-15 14:14:26

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v5 4/4] mm, oom, docs: describe the cgroup-aware OOM killer

On Mon, Aug 14, 2017 at 03:52:26PM -0700, David Rientjes wrote:
> On Mon, 14 Aug 2017, Roman Gushchin wrote:
>
> > diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> > index dec5afdaa36d..22108f31e09d 100644
> > --- a/Documentation/cgroup-v2.txt
> > +++ b/Documentation/cgroup-v2.txt
> > @@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
> > 5-2-1. Memory Interface Files
> > 5-2-2. Usage Guidelines
> > 5-2-3. Memory Ownership
> > + 5-2-4. Cgroup-aware OOM Killer
>
> Random curiousness, why cgroup-aware oom killer and not memcg-aware oom
> killer?

I don't think we use the term "memcg" somewhere in v2 docs.
Do you think that "Memory cgroup-aware OOM killer" is better?

>
> > 5-3. IO
> > 5-3-1. IO Interface Files
> > 5-3-2. Writeback
> > @@ -1002,6 +1003,37 @@ PAGE_SIZE multiple when read back.
> > high limit is used and monitored properly, this limit's
> > utility is limited to providing the final safety net.
> >
> > + memory.oom_kill_all_tasks
> > +
> > + A read-write single value file which exits on non-root
>
> s/exits/exists/

Fixed. Thanks!

>
> > + cgroups. The default is "0".
> > +
> > + Defines whether the OOM killer should treat the cgroup
> > + as a single entity during the victim selection.
>
> Isn't this true independent of the memory.oom_kill_all_tasks setting?
> The cgroup aware oom killer will consider memcg's as logical units when
> deciding what to kill with or without memory.oom_kill_all_tasks, right?
>
> I think you cover this fact in the cgroup aware oom killer section below
> so this might result in confusion if described alongside a setting of
> memory.oom_kill_all_tasks.
>
> > +
> > + If set, OOM killer will kill all belonging tasks in
> > + corresponding cgroup is selected as an OOM victim.
>
> Maybe
>
> "If set, the OOM killer will kill all threads attached to the memcg if
> selected as an OOM victim."
>
> is better?

Fixed to the following (to conform with core v2 concepts):
If set, OOM killer will kill all processes attached to the cgroup
if selected as an OOM victim.

>
> > +
> > + Be default, OOM killer respect /proc/pid/oom_score_adj value
> > + -1000, and will never kill the task, unless oom_kill_all_tasks
> > + is set.
> > +
> > + memory.oom_priority
> > +
> > + A read-write single value file which exits on non-root
>
> s/exits/exists/

Fixed.

>
> > + cgroups. The default is "0".
> > +
> > + An integer number within the [-10000, 10000] range,
> > + which defines the order in which the OOM killer selects victim
> > + memory cgroups.
> > +
> > + OOM killer prefers memory cgroups with larger priority if they
> > + are populated with elegible tasks.
>
> s/elegible/eligible/

Fixed.

>
> > +
> > + The oom_priority value is compared within sibling cgroups.
> > +
> > + The root cgroup has the oom_priority 0, which cannot be changed.
> > +
> > memory.events
> > A read-only flat-keyed file which exists on non-root cgroups.
> > The following entries are defined. Unless specified
> > @@ -1206,6 +1238,36 @@ POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
> > belonging to the affected files to ensure correct memory ownership.
> >
> >
> > +Cgroup-aware OOM Killer
> > +~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> > +It means that it treats memory cgroups as first class OOM entities.
> > +
> > +Under OOM conditions the memory controller tries to make the best
> > +choise of a victim, hierarchically looking for the largest memory
> > +consumer. By default, it will look for the biggest task in the
> > +biggest leaf cgroup.
> > +
> > +Be default, all cgroups have oom_priority 0, and OOM killer will
> > +chose the largest cgroup recursively on each level. For non-root
> > +cgroups it's possible to change the oom_priority, and it will cause
> > +the OOM killer to look athe the priority value first, and compare
> > +sizes only of cgroups with equal priority.
>
> Maybe some description of "largest" would be helpful here? I think you
> could briefly describe what is accounted for in the decisionmaking.

I'm afraid that it's too implementation-defined to be described.
Do you have an idea, how to describe it without going too much into details?

> s/athe/at the/

Fixed.

>
> Reading through this, it makes me wonder if doing s/cgroup/memcg/ over
> most of it would be better.

I don't think memcg is a good user term, but I agree, that it's necessary
to highlight the fact that a user should enable memory controller to get
this functionality.
Added a corresponding note.

Thanks!

Roman

2017-08-15 20:56:29

by David Rientjes

[permalink] [raw]
Subject: Re: [v5 4/4] mm, oom, docs: describe the cgroup-aware OOM killer

On Tue, 15 Aug 2017, Roman Gushchin wrote:

> > > diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> > > index dec5afdaa36d..22108f31e09d 100644
> > > --- a/Documentation/cgroup-v2.txt
> > > +++ b/Documentation/cgroup-v2.txt
> > > @@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
> > > 5-2-1. Memory Interface Files
> > > 5-2-2. Usage Guidelines
> > > 5-2-3. Memory Ownership
> > > + 5-2-4. Cgroup-aware OOM Killer
> >
> > Random curiousness, why cgroup-aware oom killer and not memcg-aware oom
> > killer?
>
> I don't think we use the term "memcg" somewhere in v2 docs.
> Do you think that "Memory cgroup-aware OOM killer" is better?
>

I think it would be better to not describe it as its own entity, but
rather a part of how the memory cgroup works, so simply describing it in
section 5-2, perhaps as its own subsection, as how the oom killer works
when using the memory cgroup is sufficient. I wouldn't separate it out as
a distinct cgroup feature in the documentation.

> > > + cgroups. The default is "0".
> > > +
> > > + Defines whether the OOM killer should treat the cgroup
> > > + as a single entity during the victim selection.
> >
> > Isn't this true independent of the memory.oom_kill_all_tasks setting?
> > The cgroup aware oom killer will consider memcg's as logical units when
> > deciding what to kill with or without memory.oom_kill_all_tasks, right?
> >
> > I think you cover this fact in the cgroup aware oom killer section below
> > so this might result in confusion if described alongside a setting of
> > memory.oom_kill_all_tasks.
> >

I assume this is fixed so that it's documented that memory cgroups are
considered logical units by the oom killer and that
memory.oom_kill_all_tasks is separate? The former defines the policy on
how a memory cgroup is targeted and the latter defines the mechanism it
uses to free memory.

> > > + If set, OOM killer will kill all belonging tasks in
> > > + corresponding cgroup is selected as an OOM victim.
> >
> > Maybe
> >
> > "If set, the OOM killer will kill all threads attached to the memcg if
> > selected as an OOM victim."
> >
> > is better?
>
> Fixed to the following (to conform with core v2 concepts):
> If set, OOM killer will kill all processes attached to the cgroup
> if selected as an OOM victim.
>

Thanks.

> > > +Cgroup-aware OOM Killer
> > > +~~~~~~~~~~~~~~~~~~~~~~~
> > > +
> > > +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> > > +It means that it treats memory cgroups as first class OOM entities.
> > > +
> > > +Under OOM conditions the memory controller tries to make the best
> > > +choise of a victim, hierarchically looking for the largest memory
> > > +consumer. By default, it will look for the biggest task in the
> > > +biggest leaf cgroup.
> > > +
> > > +Be default, all cgroups have oom_priority 0, and OOM killer will
> > > +chose the largest cgroup recursively on each level. For non-root
> > > +cgroups it's possible to change the oom_priority, and it will cause
> > > +the OOM killer to look athe the priority value first, and compare
> > > +sizes only of cgroups with equal priority.
> >
> > Maybe some description of "largest" would be helpful here? I think you
> > could briefly describe what is accounted for in the decisionmaking.
>
> I'm afraid that it's too implementation-defined to be described.
> Do you have an idea, how to describe it without going too much into details?
>

The point is that "largest cgroup" is ambiguous here: largest in what
sense? The cgroup with the largest number of processes attached? Using
the largest amount of memory?

I think the documentation should clearly define that the oom killer
selects the memory cgroup that has the most memory managed at each level.

2017-08-15 21:47:17

by David Rientjes

[permalink] [raw]
Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

On Tue, 15 Aug 2017, Roman Gushchin wrote:

> > I'm curious about the decision made in this conditional and how
> > oom_kill_memcg_member() ignores task->signal->oom_score_adj. It means
> > that memory.oom_kill_all_tasks overrides /proc/pid/oom_score_adj if it
> > should otherwise be disabled.
> >
> > It's undocumented in the changelog, but I'm questioning whether it's the
> > right decision. Doesn't it make sense to kill all tasks that are not oom
> > disabled, and allow the user to still protect certain processes by their
> > /proc/pid/oom_score_adj setting? Otherwise, there's no way to do that
> > protection without a sibling memcg and its own reservation of memory. I'm
> > thinking about a process that governs jobs inside the memcg and if there
> > is an oom kill, it wants to do logging and any cleanup necessary before
> > exiting itself. It seems like a powerful combination if coupled with oom
> > notification.
>
> Good question!
> I think, that an ability to override any oom_score_adj value and get all tasks
> killed is more important, than an ability to kill all processes with some
> exceptions.
>

I'm disagreeing because getting all tasks killed is not necessarily
something that only the kernel can do. If all processes are oom disabled,
that's a configuration issue done by sysadmin and the kernel should decide
to kill the next largest memory cgroup or lower priority memory cgroup.
It's not killing things like sshd that intentionally oom disable
themselves.

You could argue that having an oom disabled process attached to these
memcgs in the first place is also a configuration issue, but the problem
is that in cgroup v2 with a restriction on processes only being attached
at the leaf cgroups that there is no competition for memory in this case.
I must assign memory resources to that sshd, or "Activity Manager"
described by the cgroup v1 documentation, just to prevent it from being
killed.

I think the latter of what you describe, killing all processes with some
exceptions, is actually quite powerful. I can guarantee that processes
that set themselves to oom disabled are really oom disabled and I don't
need to work around that in the cgroup hierarchy only because of this
caveat. I can also oom disable my Activity Manger that wants to wait on
oom notification and collect the oom kill logs, raise notifications, and
perhaps restart the process that it manages.

> In your example someone still needs to look after the remaining process,
> and kill it after some timeout, if it will not quit by itself, right?
>

No, it can restart the process that was oom killed; or it can be sshd and
I can still ssh into my machine.

> The special treatment of the -1000 value (without oom_kill_all_tasks)
> is required only to not to break the existing setups.
>

I think as a general principle that allowing an oom disabled process to be
oom killed is incorrect and if you really do want these to be killed, then
(1) either your oom_score_adj is already wrong or (2) you can wait on oom
notification and exit.

2017-08-16 14:44:19

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v5 4/4] mm, oom, docs: describe the cgroup-aware OOM killer

On Tue, Aug 15, 2017 at 01:56:24PM -0700, David Rientjes wrote:
> On Tue, 15 Aug 2017, Roman Gushchin wrote:
>
> > > > diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> > > > index dec5afdaa36d..22108f31e09d 100644
> > > > --- a/Documentation/cgroup-v2.txt
> > > > +++ b/Documentation/cgroup-v2.txt
> > > > @@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
> > > > 5-2-1. Memory Interface Files
> > > > 5-2-2. Usage Guidelines
> > > > 5-2-3. Memory Ownership
> > > > + 5-2-4. Cgroup-aware OOM Killer
> > >
> > > Random curiousness, why cgroup-aware oom killer and not memcg-aware oom
> > > killer?
> >
> > I don't think we use the term "memcg" somewhere in v2 docs.
> > Do you think that "Memory cgroup-aware OOM killer" is better?
> >
>
> I think it would be better to not describe it as its own entity, but
> rather a part of how the memory cgroup works, so simply describing it in
> section 5-2, perhaps as its own subsection, as how the oom killer works
> when using the memory cgroup is sufficient. I wouldn't separate it out as
> a distinct cgroup feature in the documentation.

Ok I've got the idea, let me look, what I can do.
I'll post an updated version soon.

>
> > > > + cgroups. The default is "0".
> > > > +
> > > > + Defines whether the OOM killer should treat the cgroup
> > > > + as a single entity during the victim selection.
> > >
> > > Isn't this true independent of the memory.oom_kill_all_tasks setting?
> > > The cgroup aware oom killer will consider memcg's as logical units when
> > > deciding what to kill with or without memory.oom_kill_all_tasks, right?
> > >
> > > I think you cover this fact in the cgroup aware oom killer section below
> > > so this might result in confusion if described alongside a setting of
> > > memory.oom_kill_all_tasks.
> > >
>
> I assume this is fixed so that it's documented that memory cgroups are
> considered logical units by the oom killer and that
> memory.oom_kill_all_tasks is separate? The former defines the policy on
> how a memory cgroup is targeted and the latter defines the mechanism it
> uses to free memory.

Yes, I've fixed this. Thanks!

> > > > +Cgroup-aware OOM Killer
> > > > +~~~~~~~~~~~~~~~~~~~~~~~
> > > > +
> > > > +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> > > > +It means that it treats memory cgroups as first class OOM entities.
> > > > +
> > > > +Under OOM conditions the memory controller tries to make the best
> > > > +choise of a victim, hierarchically looking for the largest memory
> > > > +consumer. By default, it will look for the biggest task in the
> > > > +biggest leaf cgroup.
> > > > +
> > > > +Be default, all cgroups have oom_priority 0, and OOM killer will
> > > > +chose the largest cgroup recursively on each level. For non-root
> > > > +cgroups it's possible to change the oom_priority, and it will cause
> > > > +the OOM killer to look athe the priority value first, and compare
> > > > +sizes only of cgroups with equal priority.
> > >
> > > Maybe some description of "largest" would be helpful here? I think you
> > > could briefly describe what is accounted for in the decisionmaking.
> >
> > I'm afraid that it's too implementation-defined to be described.
> > Do you have an idea, how to describe it without going too much into details?
> >
>
> The point is that "largest cgroup" is ambiguous here: largest in what
> sense? The cgroup with the largest number of processes attached? Using
> the largest amount of memory?
>
> I think the documentation should clearly define that the oom killer
> selects the memory cgroup that has the most memory managed at each level.

No problems, I'll add a clarification.

Thank you!

2017-08-16 15:43:58

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

On Tue, Aug 15, 2017 at 02:47:10PM -0700, David Rientjes wrote:
> On Tue, 15 Aug 2017, Roman Gushchin wrote:
>
> > > I'm curious about the decision made in this conditional and how
> > > oom_kill_memcg_member() ignores task->signal->oom_score_adj. It means
> > > that memory.oom_kill_all_tasks overrides /proc/pid/oom_score_adj if it
> > > should otherwise be disabled.
> > >
> > > It's undocumented in the changelog, but I'm questioning whether it's the
> > > right decision. Doesn't it make sense to kill all tasks that are not oom
> > > disabled, and allow the user to still protect certain processes by their
> > > /proc/pid/oom_score_adj setting? Otherwise, there's no way to do that
> > > protection without a sibling memcg and its own reservation of memory. I'm
> > > thinking about a process that governs jobs inside the memcg and if there
> > > is an oom kill, it wants to do logging and any cleanup necessary before
> > > exiting itself. It seems like a powerful combination if coupled with oom
> > > notification.
> >
> > Good question!
> > I think, that an ability to override any oom_score_adj value and get all tasks
> > killed is more important, than an ability to kill all processes with some
> > exceptions.
> >
>
> I'm disagreeing because getting all tasks killed is not necessarily
> something that only the kernel can do. If all processes are oom disabled,
> that's a configuration issue done by sysadmin and the kernel should decide
> to kill the next largest memory cgroup or lower priority memory cgroup.
> It's not killing things like sshd that intentionally oom disable
> themselves.
>
> You could argue that having an oom disabled process attached to these
> memcgs in the first place is also a configuration issue, but the problem
> is that in cgroup v2 with a restriction on processes only being attached
> at the leaf cgroups that there is no competition for memory in this case.
> I must assign memory resources to that sshd, or "Activity Manager"
> described by the cgroup v1 documentation, just to prevent it from being
> killed.
>
> I think the latter of what you describe, killing all processes with some
> exceptions, is actually quite powerful. I can guarantee that processes
> that set themselves to oom disabled are really oom disabled and I don't
> need to work around that in the cgroup hierarchy only because of this
> caveat. I can also oom disable my Activity Manger that wants to wait on
> oom notification and collect the oom kill logs, raise notifications, and
> perhaps restart the process that it manage.
>
> > In your example someone still needs to look after the remaining process,
> > and kill it after some timeout, if it will not quit by itself, right?
> >
>
> No, it can restart the process that was oom killed; or it can be sshd and
> I can still ssh into my machine.
>
> > The special treatment of the -1000 value (without oom_kill_all_tasks)
> > is required only to not to break the existing setups.
> >
>
> I think as a general principle that allowing an oom disabled process to be
> oom killed is incorrect and if you really do want these to be killed, then
> (1) either your oom_score_adj is already wrong or (2) you can wait on oom
> notification and exit.

It's natural to expect that inside a container there are their own sshd,
"activity manager" or some other stuff, which can play with oom_score_adj.
If it can override the upper cgroup-level settings, the whole delegation model
is broken.

You can think about the oom_kill_all_tasks like the panic_on_oom,
but on a cgroup level. It should _guarantee_, that in case of oom
the whole cgroup will be destroyed completely, and will not remain
in a non-consistent state.

The model you're describing is based on a trust given to these oom-unkillable
processes on system level. But we can't really trust some unknown processes
inside a cgroup that they will be able to do some useful work and finish
in a reasonable time; especially in case of a global memory shortage.
That means someone needs to look at cgroup after each OOM and check if there
are remaining tasks. If so, the whole functionality is useless.

If some complex post-oom handling is required, it should be performed
from another cgroup (protected by the lower oom_priority value).

So, for example:

root
|
A
/ \
B C

B: oom_priority=10, oom_kill_all_tasks=1, contains workload
C: oom_priority=0, oom_kill_all_tasks=0, contains control stuff

If B is killed by OOM, an "activity manager" in C can be notified
and perform some actions.

Thanks!

Roman

2017-08-17 12:20:24

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v5 4/4] mm, oom, docs: describe the cgroup-aware OOM killer

Hi David!

Please, find an updated version of docs patch below.

Thanks!

Roman

--

>From 97805b3dcccb9420d2c4380e88e202164ead0e45 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <[email protected]>
Date: Fri, 2 Jun 2017 11:29:14 +0100
Subject: [PATCH 4/4] mm, oom, docs: describe the cgroup-aware OOM killer

Update cgroups v2 docs.

Signed-off-by: Roman Gushchin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Documentation/cgroup-v2.txt | 62 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 62 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index dec5afdaa36d..0e93c5b9cbd2 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
5-2-1. Memory Interface Files
5-2-2. Usage Guidelines
5-2-3. Memory Ownership
+ 5-2-4. OOM Killer
5-3. IO
5-3-1. IO Interface Files
5-3-2. Writeback
@@ -1002,6 +1003,34 @@ PAGE_SIZE multiple when read back.
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.

+ memory.oom_kill_all_tasks
+
+ A read-write single value file which exists on non-root
+ cgroups. The default is "0".
+
+ If set, OOM killer will kill all processes attached to the cgroup
+ if selected as an OOM victim.
+
+ Be default, the OOM killer respects the /proc/pid/oom_score_adj
+ value -1000, and will never kill the task, unless oom_kill_all_tasks
+ is set.
+
+ memory.oom_priority
+
+ A read-write single value file which exists on non-root
+ cgroups. The default is "0".
+
+ An integer number within the [-10000, 10000] range,
+ which defines the order in which the OOM killer selects victim
+ memory cgroups.
+
+ OOM killer prefers memory cgroups with larger priority if they
+ are populated with eligible tasks.
+
+ The oom_priority value is compared within sibling cgroups.
+
+ The root cgroup has the oom_priority 0, which cannot be changed.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
@@ -1206,6 +1235,39 @@ POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
belonging to the affected files to ensure correct memory ownership.


+OOM Killer
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choice of a victim, hierarchically looking for the largest memory
+consumer. By default, it will look for the biggest task in the
+biggest leaf memory cgroup.
+
+By default, all memory cgroups have oom_priority 0, and OOM killer
+will choice the cgroup with the largest memory consuption recursively
+on each level. For non-root cgroups it's possible to change
+the oom_priority, and it will cause the OOM killer to look
+at the priority value first, and compare sizes only of memory
+cgroups with equal priority.
+
+A user can change this behavior by enabling the per-cgroup
+oom_kill_all_tasks option. If set, OOM killer will kill all processes
+attached to the cgroup if selected as an OOM victim.
+
+Tasks in the root cgroup are treated as independent memory consumers,
+and are compared with other memory consumers (leaf memory cgroups).
+The root cgroup doesn't support the oom_kill_all_tasks feature.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
+If there are no cgroups with the enabled memory controller,
+the OOM killer is using the "traditional" process-based approach.
+
IO
--

--
2.13.5

2017-08-21 00:41:30

by David Rientjes

[permalink] [raw]
Subject: Re: [v5 4/4] mm, oom, docs: describe the cgroup-aware OOM killer

On Thu, 17 Aug 2017, Roman Gushchin wrote:

> Hi David!
>
> Please, find an updated version of docs patch below.
>

Looks much better, thanks! I think the only pending issue is discussing
the relationship of memory.oom_kill_all_tasks with /proc/pid/oom_score_adj
== OOM_SCORE_ADJ_MIN.

2017-08-21 00:50:32

by David Rientjes

[permalink] [raw]
Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

On Wed, 16 Aug 2017, Roman Gushchin wrote:

> It's natural to expect that inside a container there are their own sshd,
> "activity manager" or some other stuff, which can play with oom_score_adj.
> If it can override the upper cgroup-level settings, the whole delegation model
> is broken.
>

I don't think any delegation model related to core cgroups or memory
cgroup is broken, I think it's based on how memory.oom_kill_all_tasks is
defined. It could very well behave as memory.oom_kill_all_eligible_tasks
when enacted upon.

> You can think about the oom_kill_all_tasks like the panic_on_oom,
> but on a cgroup level. It should _guarantee_, that in case of oom
> the whole cgroup will be destroyed completely, and will not remain
> in a non-consistent state.
>

Only CAP_SYS_ADMIN has this ability to set /proc/pid/oom_score_adj to
OOM_SCORE_ADJ_MIN, so it preserves the ability to change that setting, if
needed, when it sets memory.oom_kill_all_tasks. If a user gains
permissions to change memory.oom_kill_all_tasks, I disagree it should
override the CAP_SYS_ADMIN setting of /proc/pid/oom_score_adj.

I would prefer not to exclude oom disabled processes to their own sibling
cgroups because they would require their own reservation with cgroup v2
and it makes the single hierarchy model much more difficult to arrange
alongside cpusets, for example.

> The model you're describing is based on a trust given to these oom-unkillable
> processes on system level. But we can't really trust some unknown processes
> inside a cgroup that they will be able to do some useful work and finish
> in a reasonable time; especially in case of a global memory shortage.

Yes, we prefer to panic instead of sshd, for example, being oom killed.
We trust that sshd, as well as our own activity manager and security
daemons are trusted to do useful work and that we never want the kernel to
do this. I'm not sure why you are describing processes that CAP_SYS_ADMIN
has set to be oom disabled as unknown processes.

I'd be interested in hearing the opinions of others related to a per-memcg
knob being allowed to override the setting of the sysadmin.

2017-08-21 09:48:09

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

On Sun, Aug 20, 2017 at 05:50:27PM -0700, David Rientjes wrote:
> On Wed, 16 Aug 2017, Roman Gushchin wrote:
>
> > It's natural to expect that inside a container there are their own sshd,
> > "activity manager" or some other stuff, which can play with oom_score_adj.
> > If it can override the upper cgroup-level settings, the whole delegation model
> > is broken.
> >
>
> I don't think any delegation model related to core cgroups or memory
> cgroup is broken, I think it's based on how memory.oom_kill_all_tasks is
> defined. It could very well behave as memory.oom_kill_all_eligible_tasks
> when enacted upon.
>
> > You can think about the oom_kill_all_tasks like the panic_on_oom,
> > but on a cgroup level. It should _guarantee_, that in case of oom
> > the whole cgroup will be destroyed completely, and will not remain
> > in a non-consistent state.
> >
>
> Only CAP_SYS_ADMIN has this ability to set /proc/pid/oom_score_adj to

CAP_SYS_RESOURCE

> OOM_SCORE_ADJ_MIN, so it preserves the ability to change that setting, if
> needed, when it sets memory.oom_kill_all_tasks. If a user gains
> permissions to change memory.oom_kill_all_tasks, I disagree it should
> override the CAP_SYS_ADMIN setting of /proc/pid/oom_score_adj.
>
> I would prefer not to exclude oom disabled processes to their own sibling
> cgroups because they would require their own reservation with cgroup v2
> and it makes the single hierarchy model much more difficult to arrange
> alongside cpusets, for example.
>
> > The model you're describing is based on a trust given to these oom-unkillable
> > processes on system level. But we can't really trust some unknown processes
> > inside a cgroup that they will be able to do some useful work and finish
> > in a reasonable time; especially in case of a global memory shortage.
>
> Yes, we prefer to panic instead of sshd, for example, being oom killed.
> We trust that sshd, as well as our own activity manager and security
> daemons are trusted to do useful work and that we never want the kernel to
> do this. I'm not sure why you are describing processes that CAP_SYS_ADMIN
> has set to be oom disabled as unknown processes.
>
> I'd be interested in hearing the opinions of others related to a per-memcg
> knob being allowed to override the setting of the sysadmin.

Sure, me too.

Thanks!

2017-08-22 17:04:01

by Johannes Weiner

[permalink] [raw]
Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

Hi Roman,

great work! This looks mostly good to me now. Below are some nitpicks
concerning naming and code layout, but nothing major.

On Mon, Aug 14, 2017 at 07:32:11PM +0100, Roman Gushchin wrote:
> @@ -39,6 +39,7 @@ struct oom_control {
> unsigned long totalpages;
> struct task_struct *chosen;
> unsigned long chosen_points;
> + struct mem_cgroup *chosen_memcg;
> };

Please rename 'chosen' to 'chosen_task' to make the distinction to
chosen_memcg clearer.

The ordering is a little weird too, with chosen_points in between.

chosen_task
chosen_memcg
chosen_points

?

> @@ -2639,6 +2639,181 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
> return ret;
> }
>
> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> + const nodemask_t *nodemask)
> +{
> + long points = 0;
> + int nid;
> +
> + for_each_node_state(nid, N_MEMORY) {
> + if (nodemask && !node_isset(nid, *nodemask))
> + continue;
> +
> + points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> + }
> +
> + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> + (PAGE_SIZE / 1024);
> + points += memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE);

NR_SLAB_UNRECLAIMABLE is now accounted per-lruvec, which takes
nodeness into account, and so would be more accurate here.

You can get it with mem_cgroup_lruvec() and lruvec_page_state().

> + points += memcg_page_state(memcg, MEMCG_SOCK);
> + points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> + return points;
> +}
> +
> +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> + const nodemask_t *nodemask)
> +{
> + struct css_task_iter it;
> + struct task_struct *task;
> + int elegible = 0;

eligible

> +
> + css_task_iter_start(&memcg->css, 0, &it);
> + while ((task = css_task_iter_next(&it))) {
> + /*
> + * If there are no tasks, or all tasks have oom_score_adj set
> + * to OOM_SCORE_ADJ_MIN and oom_kill_all_tasks is not set,
> + * don't select this memory cgroup.
> + */
> + if (!elegible &&
> + (memcg->oom_kill_all_tasks ||
> + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN))
> + elegible = 1;

This is a little awkward to read. How about something like this:

/*
* When killing individual tasks, we respect OOM score adjustments:
* at least one task in the group needs to be killable for the group
* to be oomable.
*
* Also check that previous OOM kills have finished, and abort if
* there are any pending OOM victims.
*/
oomable = memcg->oom_kill_all_tasks;
while ((task = css_task_iter_next(&it))) {
if (!oomable && task->signal_oom_score_adj != OOM_SCORE_ADJ_MIN)
oomable = 1;

> + if (tsk_is_oom_victim(task) &&
> + !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
> + elegible = -1;
> + break;
> + }
> + }
> + css_task_iter_end(&it);

etc.

> +
> + return elegible > 0 ? memcg_oom_badness(memcg, nodemask) : elegible;

I find these much easier to read if broken up, even if it's more LOC:

if (eligible <= 0)
return eligible;

return memcg_oom_badness(memcg, nodemask);

> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> +{
> + struct mem_cgroup *iter, *parent;
> +
> + for_each_mem_cgroup_tree(iter, root) {
> + if (memcg_has_children(iter)) {
> + iter->oom_score = 0;
> + continue;
> + }
> +
> + iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask);
> + if (iter->oom_score == -1) {

Please add comments to document the special returns. Maybe #defines
would be clearer, too.

> + oc->chosen_memcg = (void *)-1UL;
> + mem_cgroup_iter_break(root, iter);
> + return;
> + }
> +
> + if (!iter->oom_score)
> + continue;

Same here.

Maybe a switch would be suitable to handle the abort/no-score cases.

> + for (parent = parent_mem_cgroup(iter); parent && parent != root;
> + parent = parent_mem_cgroup(parent))
> + parent->oom_score += iter->oom_score;
> + }
> +
> + for (;;) {
> + struct cgroup_subsys_state *css;
> + struct mem_cgroup *memcg = NULL;
> + long score = LONG_MIN;
> +
> + css_for_each_child(css, &root->css) {
> + struct mem_cgroup *iter = mem_cgroup_from_css(css);
> +
> + if (iter->oom_score > score) {
> + memcg = iter;
> + score = iter->oom_score;
> + }
> + }
> +
> + if (!memcg) {
> + if (oc->memcg && root == oc->memcg) {
> + oc->chosen_memcg = oc->memcg;
> + css_get(&oc->chosen_memcg->css);
> + oc->chosen_points = oc->memcg->oom_score;
> + }
> + break;
> + }
> +
> + if (memcg->oom_kill_all_tasks || !memcg_has_children(memcg)) {
> + oc->chosen_memcg = memcg;
> + css_get(&oc->chosen_memcg->css);
> + oc->chosen_points = score;
> + break;
> + }
> +
> + root = memcg;
> + }
> +}
> +
> +static void select_victim_root_cgroup_task(struct oom_control *oc)
> +{
> + struct css_task_iter it;
> + struct task_struct *task;
> + int ret = 0;
> +
> + css_task_iter_start(&root_mem_cgroup->css, 0, &it);
> + while (!ret && (task = css_task_iter_next(&it)))
> + ret = oom_evaluate_task(task, oc);
> + css_task_iter_end(&it);
> +}
> +
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> + struct mem_cgroup *root = root_mem_cgroup;
> +
> + oc->chosen = NULL;
> + oc->chosen_memcg = NULL;
> +
> + if (mem_cgroup_disabled())
> + return false;
> +
> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> + return false;
> +
> + if (oc->memcg)
> + root = oc->memcg;
> +
> + rcu_read_lock();
> +
> + select_victim_memcg(root, oc);
> + if (oc->chosen_memcg == (void *)-1UL) {
> + /* Existing OOM victims are found. */
> + rcu_read_unlock();
> + return true;
> + }

It would be good to format this branch like the block below, with a
newline and the comment before the branch block rather than inside.

That would also set apart the call to select_victim_memcg(), which is
the main workhorse of this function.

> + /*
> + * For system-wide OOMs we should consider tasks in the root cgroup
> + * with oom_score larger than oc->chosen_points.
> + */
> + if (!oc->memcg) {
> + select_victim_root_cgroup_task(oc);
> +
> + if (oc->chosen && oc->chosen_memcg) {
> + /*
> + * If we've decided to kill a task in the root memcg,
> + * release chosen_memcg.
> + */
> + css_put(&oc->chosen_memcg->css);
> + oc->chosen_memcg = NULL;
> + }
> + }

^^ like this one.

> +
> + rcu_read_unlock();
> +
> + return !!oc->chosen || !!oc->chosen_memcg;

The !! are detrimental to readability and shouldn't be necessary.

> @@ -5190,6 +5365,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> return nbytes;
> }
>
> +static int memory_oom_kill_all_tasks_show(struct seq_file *m, void *v)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> + bool oom_kill_all_tasks = memcg->oom_kill_all_tasks;
> +
> + seq_printf(m, "%d\n", oom_kill_all_tasks);
> +
> + return 0;
> +}
> +
> +static ssize_t memory_oom_kill_all_tasks_write(struct kernfs_open_file *of,
> + char *buf, size_t nbytes,
> + loff_t off)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> + int oom_kill_all_tasks;
> + int err;
> +
> + err = kstrtoint(strstrip(buf), 0, &oom_kill_all_tasks);
> + if (err)
> + return err;
> +
> + memcg->oom_kill_all_tasks = !!oom_kill_all_tasks;
> +
> + return nbytes;
> +}
> +
> static int memory_events_show(struct seq_file *m, void *v)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> @@ -5310,6 +5512,12 @@ static struct cftype memory_files[] = {
> .write = memory_max_write,
> },
> {
> + .name = "oom_kill_all_tasks",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .seq_show = memory_oom_kill_all_tasks_show,
> + .write = memory_oom_kill_all_tasks_write,
> + },

This name is quite a mouthful and reminiscent of the awkward v1
interface names. It doesn't really go well with the v2 names.

How about memory.oom_group?

> + {
> .name = "events",
> .flags = CFTYPE_NOT_ON_ROOT,
> .file_offset = offsetof(struct mem_cgroup, events_file),
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5c29a3dd591b..28e42a0d5eee 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -288,7 +288,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
> return CONSTRAINT_NONE;
> }
>
> -static int oom_evaluate_task(struct task_struct *task, void *arg)
> +int oom_evaluate_task(struct task_struct *task, void *arg)
> {
> struct oom_control *oc = arg;
> unsigned long points;
> @@ -823,6 +823,9 @@ static void __oom_kill_process(struct task_struct *victim)
> struct mm_struct *mm;
> bool can_oom_reap = true;
>
> + if (is_global_init(victim) || (victim->flags & PF_KTHREAD))
> + return;
> +
> p = find_lock_task_mm(victim);
> if (!p) {
> put_task_struct(victim);
> @@ -958,6 +961,60 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> put_task_struct(victim);
> }
>
> +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> +{
> + if (!tsk_is_oom_victim(task))
> + __oom_kill_process(task);
> + return 0;
> +}
> +
> +static bool oom_kill_memcg_victim(struct oom_control *oc)
> +{
> + static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> + DEFAULT_RATELIMIT_BURST);
> +
> + if (oc->chosen) {
> + if (oc->chosen != (void *)-1UL) {
> + if (__ratelimit(&oom_rs))
> + dump_header(oc, oc->chosen);
> +
> + __oom_kill_process(oc->chosen);
> + put_task_struct(oc->chosen);
> + schedule_timeout_killable(1);
> + }
> + return true;
> +
> + } else if (oc->chosen_memcg) {
> + if (oc->chosen_memcg == (void *)-1UL)
> + return true;

Can you format the above chosen == (void *)-1UL the same way? That
makes it easier to see that it's checking the same thing.

Thanks

2017-08-22 17:07:08

by Johannes Weiner

[permalink] [raw]
Subject: Re: [v5 1/4] mm, oom: refactor the oom_kill_process() function

On Mon, Aug 14, 2017 at 07:32:09PM +0100, Roman Gushchin wrote:
> @@ -817,67 +817,12 @@ static bool task_will_free_mem(struct task_struct *task)
> return ret;
> }
>
> -static void oom_kill_process(struct oom_control *oc, const char *message)
> +static void __oom_kill_process(struct task_struct *victim)

oom_kill_task()?

2017-08-23 12:30:33

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v5 1/4] mm, oom: refactor the oom_kill_process() function

On Tue, Aug 22, 2017 at 01:06:55PM -0400, Johannes Weiner wrote:
> On Mon, Aug 14, 2017 at 07:32:09PM +0100, Roman Gushchin wrote:
> > @@ -817,67 +817,12 @@ static bool task_will_free_mem(struct task_struct *task)
> > return ret;
> > }
> >
> > -static void oom_kill_process(struct oom_control *oc, const char *message)
> > +static void __oom_kill_process(struct task_struct *victim)
>
> oom_kill_task()?

Not sure, as it kills all tasks which are sharing mm with the given task.
Also, it will be confusing to have oom_kill_process() and oom_kill_task(),
where the actual difference is in how much verbose they are,
and if it's allowed to perfer a child process.

2017-08-23 16:21:06

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

Hi Johannes!

Thank you for review!

I do agree with most of the comments, and I will address them in v6.
I'll post it soon.

Please, find some comments below.

On Tue, Aug 22, 2017 at 01:03:44PM -0400, Johannes Weiner wrote:
> Hi Roman,
>
> great work! This looks mostly good to me now. Below are some nitpicks
> concerning naming and code layout, but nothing major.
>
> > +
> > + css_task_iter_start(&memcg->css, 0, &it);
> > + while ((task = css_task_iter_next(&it))) {
> > + /*
> > + * If there are no tasks, or all tasks have oom_score_adj set
> > + * to OOM_SCORE_ADJ_MIN and oom_kill_all_tasks is not set,
> > + * don't select this memory cgroup.
> > + */
> > + if (!elegible &&
> > + (memcg->oom_kill_all_tasks ||
> > + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN))
> > + elegible = 1;
>
> This is a little awkward to read. How about something like this:
>
> /*
> * When killing individual tasks, we respect OOM score adjustments:
> * at least one task in the group needs to be killable for the group
> * to be oomable.
> *
> * Also check that previous OOM kills have finished, and abort if
> * there are any pending OOM victims.
> */
> oomable = memcg->oom_kill_all_tasks;
> while ((task = css_task_iter_next(&it))) {
> if (!oomable && task->signal_oom_score_adj != OOM_SCORE_ADJ_MIN)
> oomable = 1;
>
> > + if (tsk_is_oom_victim(task) &&
> > + !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
> > + elegible = -1;
> > + break;
> > + }
> > + }
> > + css_task_iter_end(&it);

We ignore oom_score_adj if oom_kill_all_tasks is set, it's
not reflected in your version. Anyway, I've moved the comments block
outside and rephrased it to make more clear.

>
> etc.
>
> > +
> > + return elegible > 0 ? memcg_oom_badness(memcg, nodemask) : elegible;
>
> I find these much easier to read if broken up, even if it's more LOC:
>
> if (eligible <= 0)
> return eligible;
>
> return memcg_oom_badness(memcg, nodemask);
>
> > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> > +{
> > + struct mem_cgroup *iter, *parent;
> > +
> > + for_each_mem_cgroup_tree(iter, root) {
> > + if (memcg_has_children(iter)) {
> > + iter->oom_score = 0;
> > + continue;
> > + }
> > +
> > + iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask);
> > + if (iter->oom_score == -1) {
>
> Please add comments to document the special returns. Maybe #defines
> would be clearer, too.
>
> > + oc->chosen_memcg = (void *)-1UL;
> > + mem_cgroup_iter_break(root, iter);
> > + return;
> > + }
> > +
> > + if (!iter->oom_score)
> > + continue;
>
> Same here.
>
> Maybe a switch would be suitable to handle the abort/no-score cases.

Not sure about switch/defines, but I've added several comment blocks
to describe possible return values, as well as their handling.
Hope, it will be enough.

> > static int memory_events_show(struct seq_file *m, void *v)
> > {
> > struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> > @@ -5310,6 +5512,12 @@ static struct cftype memory_files[] = {
> > .write = memory_max_write,
> > },
> > {
> > + .name = "oom_kill_all_tasks",
> > + .flags = CFTYPE_NOT_ON_ROOT,
> > + .seq_show = memory_oom_kill_all_tasks_show,
> > + .write = memory_oom_kill_all_tasks_write,
> > + },
>
> This name is quite a mouthful and reminiscent of the awkward v1
> interface names. It doesn't really go well with the v2 names.
>
> How about memory.oom_group?

I'd prefer to have something more obvious. I've renamed
memory.oom_kill_all_tasks to memory.oom_kill_all, which was earlier suggested
by Vladimir. Are you ok with it?

Thanks!

2017-08-23 17:25:01

by Johannes Weiner

[permalink] [raw]
Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

Hi,

On Wed, Aug 23, 2017 at 05:20:31PM +0100, Roman Gushchin wrote:
> On Tue, Aug 22, 2017 at 01:03:44PM -0400, Johannes Weiner wrote:
> > > + css_task_iter_start(&memcg->css, 0, &it);
> > > + while ((task = css_task_iter_next(&it))) {
> > > + /*
> > > + * If there are no tasks, or all tasks have oom_score_adj set
> > > + * to OOM_SCORE_ADJ_MIN and oom_kill_all_tasks is not set,
> > > + * don't select this memory cgroup.
> > > + */
> > > + if (!elegible &&
> > > + (memcg->oom_kill_all_tasks ||
> > > + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN))
> > > + elegible = 1;
> >
> > This is a little awkward to read. How about something like this:
> >
> > /*
> > * When killing individual tasks, we respect OOM score adjustments:
> > * at least one task in the group needs to be killable for the group
> > * to be oomable.
> > *
> > * Also check that previous OOM kills have finished, and abort if
> > * there are any pending OOM victims.
> > */
> > oomable = memcg->oom_kill_all_tasks;
> > while ((task = css_task_iter_next(&it))) {
> > if (!oomable && task->signal_oom_score_adj != OOM_SCORE_ADJ_MIN)
> > oomable = 1;
> >
> > > + if (tsk_is_oom_victim(task) &&
> > > + !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
> > > + elegible = -1;
> > > + break;
> > > + }
> > > + }
> > > + css_task_iter_end(&it);
>
> We ignore oom_score_adj if oom_kill_all_tasks is set, it's
> not reflected in your version. Anyway, I've moved the comments block
> outside and rephrased it to make more clear.

Yes it is...? We only respect the score if !oomable, which is set to
oom_kill_all_tasks.

> > > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> > > +{
> > > + struct mem_cgroup *iter, *parent;
> > > +
> > > + for_each_mem_cgroup_tree(iter, root) {
> > > + if (memcg_has_children(iter)) {
> > > + iter->oom_score = 0;
> > > + continue;
> > > + }
> > > +
> > > + iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask);
> > > + if (iter->oom_score == -1) {
> >
> > Please add comments to document the special returns. Maybe #defines
> > would be clearer, too.
> >
> > > + oc->chosen_memcg = (void *)-1UL;
> > > + mem_cgroup_iter_break(root, iter);
> > > + return;
> > > + }
> > > +
> > > + if (!iter->oom_score)
> > > + continue;
> >
> > Same here.
> >
> > Maybe a switch would be suitable to handle the abort/no-score cases.
>
> Not sure about switch/defines, but I've added several comment blocks
> to describe possible return values, as well as their handling.
> Hope, it will be enough.

Sounds good.

> > > static int memory_events_show(struct seq_file *m, void *v)
> > > {
> > > struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> > > @@ -5310,6 +5512,12 @@ static struct cftype memory_files[] = {
> > > .write = memory_max_write,
> > > },
> > > {
> > > + .name = "oom_kill_all_tasks",
> > > + .flags = CFTYPE_NOT_ON_ROOT,
> > > + .seq_show = memory_oom_kill_all_tasks_show,
> > > + .write = memory_oom_kill_all_tasks_write,
> > > + },
> >
> > This name is quite a mouthful and reminiscent of the awkward v1
> > interface names. It doesn't really go well with the v2 names.
> >
> > How about memory.oom_group?
>
> I'd prefer to have something more obvious. I've renamed
> memory.oom_kill_all_tasks to memory.oom_kill_all, which was earlier suggested
> by Vladimir. Are you ok with it?

No, we should be striving for short and sweet mnemonics that express a
concept (oom applies to group, not member tasks) instead of underscore
sentences that describe an implementation (upon oom, kill all tasks in
the group).

It's better to have newbies consult the documentation once than making
everybody deal with long and cumbersome names for the rest of time.

Like 'ls' being better than 'read_and_print_directory_contents'.

2017-08-23 18:05:33

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

On Wed, Aug 23, 2017 at 01:24:41PM -0400, Johannes Weiner wrote:
> Hi,
>
> On Wed, Aug 23, 2017 at 05:20:31PM +0100, Roman Gushchin wrote:
> > On Tue, Aug 22, 2017 at 01:03:44PM -0400, Johannes Weiner wrote:
> > > > + css_task_iter_start(&memcg->css, 0, &it);
> > > > + while ((task = css_task_iter_next(&it))) {
> > > > + /*
> > > > + * If there are no tasks, or all tasks have oom_score_adj set
> > > > + * to OOM_SCORE_ADJ_MIN and oom_kill_all_tasks is not set,
> > > > + * don't select this memory cgroup.
> > > > + */
> > > > + if (!elegible &&
> > > > + (memcg->oom_kill_all_tasks ||
> > > > + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN))
> > > > + elegible = 1;
> > >
> > > This is a little awkward to read. How about something like this:
> > >
> > > /*
> > > * When killing individual tasks, we respect OOM score adjustments:
> > > * at least one task in the group needs to be killable for the group
> > > * to be oomable.
> > > *
> > > * Also check that previous OOM kills have finished, and abort if
> > > * there are any pending OOM victims.
> > > */
> > > oomable = memcg->oom_kill_all_tasks;
> > > while ((task = css_task_iter_next(&it))) {
> > > if (!oomable && task->signal_oom_score_adj != OOM_SCORE_ADJ_MIN)
> > > oomable = 1;
> > >
> > > > + if (tsk_is_oom_victim(task) &&
> > > > + !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
> > > > + elegible = -1;
> > > > + break;
> > > > + }
> > > > + }
> > > > + css_task_iter_end(&it);
> >
> > We ignore oom_score_adj if oom_kill_all_tasks is set, it's
> > not reflected in your version. Anyway, I've moved the comments block
> > outside and rephrased it to make more clear.
>
> Yes it is...? We only respect the score if !oomable, which is set to
> oom_kill_all_tasks.

Sorry, haven't noticed this.

> > > > static int memory_events_show(struct seq_file *m, void *v)
> > > > {
> > > > struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> > > > @@ -5310,6 +5512,12 @@ static struct cftype memory_files[] = {
> > > > .write = memory_max_write,
> > > > },
> > > > {
> > > > + .name = "oom_kill_all_tasks",
> > > > + .flags = CFTYPE_NOT_ON_ROOT,
> > > > + .seq_show = memory_oom_kill_all_tasks_show,
> > > > + .write = memory_oom_kill_all_tasks_write,
> > > > + },
> > >
> > > This name is quite a mouthful and reminiscent of the awkward v1
> > > interface names. It doesn't really go well with the v2 names.
> > >
> > > How about memory.oom_group?
> >
> > I'd prefer to have something more obvious. I've renamed
> > memory.oom_kill_all_tasks to memory.oom_kill_all, which was earlier suggested
> > by Vladimir. Are you ok with it?
>
> No, we should be striving for short and sweet mnemonics that express a
> concept (oom applies to group, not member tasks) instead of underscore
> sentences that describe an implementation (upon oom, kill all tasks in
> the group).

Why do you call it implementation, it's definitely an user's intention
"if a cgroup is under OOM, all belonging processes should be killed".

How it can be implemented differently?

>
> It's better to have newbies consult the documentation once than making
> everybody deal with long and cumbersome names for the rest of time.
>
> Like 'ls' being better than 'read_and_print_directory_contents'.

I don't think it's a good argument here: realistically, nobody will type
the knob's name often. Your option is shorter only by 3 characters :)

Anyway, I'm ok with memory.oom_group too, if everybody else prefer it.
Michal, David?
What's your opinion?

Thanks!

2017-08-23 23:13:31

by David Rientjes

[permalink] [raw]
Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

On Wed, 23 Aug 2017, Roman Gushchin wrote:

> > It's better to have newbies consult the documentation once than making
> > everybody deal with long and cumbersome names for the rest of time.
> >
> > Like 'ls' being better than 'read_and_print_directory_contents'.
>
> I don't think it's a good argument here: realistically, nobody will type
> the knob's name often. Your option is shorter only by 3 characters :)
>
> Anyway, I'm ok with memory.oom_group too, if everybody else prefer it.
> Michal, David?
> What's your opinion?
>

I'm probably the worst person in the world for succinctly naming stuff,
but I at least think the knob should have the word "kill" in it to
describe the behavior. ("oom_group", out of memory group, what exactly is
that?)