2017-06-01 18:37:30

by Roman Gushchin

[permalink] [raw]
Subject: [RFC PATCH v2 6/7] mm, oom: cgroup-aware OOM killer

Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers. There are two main issues:

1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. In many cases much more safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

3) Per-process oom_score_adj affects global OOM, so it's a breache
in the isolation.

To address these issues, cgroup-aware OOM killer is introduced.

Under OOM conditions, it tries to find the biggest memory consumer,
and free memory by killing corresponding task(s). The difference
the "traditional" OOM killer is that it can treat memory cgroups
as memory consumers as well as single processes.

By default, it will look for the biggest leaf cgroup, and kill
the largest task inside.

But a user can change this behavior by enabling the per-cgroup
oom_kill_all_tasks option. If set, it causes the OOM killer treat
the whole cgroup as an indivisible memory consumer. In case if it's
selected as on OOM victim, all belonging tasks will be killed.

Tasks in the root cgroup are treated as independent memory consumers,
and are compared with other memory consumers (e.g. leaf cgroups).
The root cgroup doesn't support the oom_kill_all_tasks feature.

Signed-off-by: Roman Gushchin <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/linux/memcontrol.h | 13 ++++
include/linux/oom.h | 1 +
mm/memcontrol.c | 178 +++++++++++++++++++++++++++++++++++++++++++++
mm/oom_kill.c | 6 ++
4 files changed, 198 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 818a42e..67709a4 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -34,6 +34,7 @@ struct mem_cgroup;
struct page;
struct mm_struct;
struct kmem_cache;
+struct oom_control;

/* Cgroup-specific page state, on top of universal node page state */
enum memcg_stat_item {
@@ -471,6 +472,9 @@ static inline bool task_in_memcg_oom(struct task_struct *p)

bool mem_cgroup_oom_synchronize(bool wait);

+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+bool mem_cgroup_kill_oom_victim(struct oom_control *oc);
+
#ifdef CONFIG_MEMCG_SWAP
extern int do_swap_account;
#endif
@@ -931,6 +935,15 @@ static inline void memcg_kmem_update_page_stat(struct page *page,
enum memcg_stat_item idx, int val)
{
}
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+ return false;
+}
+static inline bool mem_cgroup_kill_oom_victim(struct oom_control *oc)
+{
+ return false;
+}
#endif /* CONFIG_MEMCG && !CONFIG_SLOB */

#endif /* _LINUX_MEMCONTROL_H */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index edf7a77..a6086a2 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -39,6 +39,7 @@ struct oom_control {
unsigned long totalpages;
struct task_struct *chosen;
unsigned long chosen_points;
+ struct mem_cgroup *chosen_memcg;
};

extern struct mutex oom_lock;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f979ac7..855d335 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2625,6 +2625,184 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
return ret;
}

+static long mem_cgroup_oom_badness(struct mem_cgroup *memcg,
+ const nodemask_t *nodemask)
+{
+ long points = 0;
+ int nid;
+ struct mem_cgroup *iter;
+
+ for_each_mem_cgroup_tree(iter, memcg) {
+ for_each_node_state(nid, N_MEMORY) {
+ if (nodemask && !node_isset(nid, *nodemask))
+ continue;
+
+ points += mem_cgroup_node_nr_lru_pages(iter, nid,
+ LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+ }
+
+ points += mem_cgroup_get_nr_swap_pages(iter);
+ points += memcg_page_state(iter, MEMCG_KERNEL_STACK_KB) /
+ (PAGE_SIZE / 1024);
+ points += memcg_page_state(iter, MEMCG_SLAB_UNRECLAIMABLE);
+ points += memcg_page_state(iter, MEMCG_SOCK);
+ }
+
+ return points;
+}
+
+bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+ struct cgroup_subsys_state *css = NULL;
+ struct mem_cgroup *iter = NULL;
+ struct mem_cgroup *chosen_memcg = NULL;
+ struct mem_cgroup *parent = root_mem_cgroup;
+ unsigned long totalpages = oc->totalpages;
+ long chosen_memcg_points = 0;
+ long points = 0;
+
+ oc->chosen = NULL;
+ oc->chosen_memcg = NULL;
+
+ if (mem_cgroup_disabled())
+ return false;
+
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ return false;
+
+ pr_info("Choosing a victim memcg because of the %s",
+ oc->memcg ?
+ "memory limit reached of cgroup " :
+ "system-wide OOM\n");
+ if (oc->memcg) {
+ pr_cont_cgroup_path(oc->memcg->css.cgroup);
+ pr_cont("\n");
+
+ chosen_memcg = oc->memcg;
+ parent = oc->memcg;
+ }
+
+ rcu_read_lock();
+
+ for (;;) {
+ css = css_next_child(css, &parent->css);
+ if (css) {
+ iter = mem_cgroup_from_css(css);
+
+ points = mem_cgroup_oom_badness(iter, oc->nodemask);
+ points += iter->oom_score_adj * (totalpages / 1000);
+
+ pr_info("Cgroup ");
+ pr_cont_cgroup_path(iter->css.cgroup);
+ pr_cont(": %ld\n", points);
+
+ if (points > chosen_memcg_points) {
+ chosen_memcg = iter;
+ chosen_memcg_points = points;
+ oc->chosen_points = points;
+ }
+
+ continue;
+ }
+
+ if (chosen_memcg && !chosen_memcg->oom_kill_all_tasks) {
+ /* Go deeper in the cgroup hierarchy */
+ totalpages = chosen_memcg_points;
+ chosen_memcg_points = 0;
+
+ parent = chosen_memcg;
+ chosen_memcg = NULL;
+
+ continue;
+ }
+
+ if (!chosen_memcg && parent != root_mem_cgroup)
+ chosen_memcg = parent;
+
+ break;
+ }
+
+ if (!oc->memcg) {
+ /*
+ * We should also consider tasks in the root cgroup
+ * with badness larger than oc->chosen_points
+ */
+
+ struct css_task_iter it;
+ struct task_struct *task;
+ int ret = 0;
+
+ css_task_iter_start(&root_mem_cgroup->css, &it);
+ while (!ret && (task = css_task_iter_next(&it)))
+ ret = oom_evaluate_task(task, oc);
+ css_task_iter_end(&it);
+ }
+
+ if (!oc->chosen && chosen_memcg) {
+ pr_info("Chosen cgroup ");
+ pr_cont_cgroup_path(chosen_memcg->css.cgroup);
+ pr_cont(": %ld\n", oc->chosen_points);
+
+ if (chosen_memcg->oom_kill_all_tasks) {
+ css_get(&chosen_memcg->css);
+ oc->chosen_memcg = chosen_memcg;
+ } else {
+ /*
+ * If we don't need to kill all tasks in the cgroup,
+ * let's select the biggest task.
+ */
+ oc->chosen_points = 0;
+ select_bad_process(oc, chosen_memcg);
+ }
+ } else if (oc->chosen)
+ pr_info("Chosen task %s (%d) in root cgroup: %ld\n",
+ oc->chosen->comm, oc->chosen->pid, oc->chosen_points);
+
+ rcu_read_unlock();
+
+ oc->chosen_points = 0;
+ return !!oc->chosen || !!oc->chosen_memcg;
+}
+
+static int __oom_kill_task(struct task_struct *tsk, void *arg)
+{
+ if (!is_global_init(tsk) && !(tsk->flags & PF_KTHREAD)) {
+ get_task_struct(tsk);
+ __oom_kill_process(tsk);
+ }
+ return 0;
+}
+
+bool mem_cgroup_kill_oom_victim(struct oom_control *oc)
+{
+ if (oc->chosen_memcg) {
+ /*
+ * Kill all tasks in the cgroup hierarchy
+ */
+ mem_cgroup_scan_tasks(oc->chosen_memcg,
+ __oom_kill_task, NULL);
+
+ /*
+ * Release oc->chosen_memcg
+ */
+ css_put(&oc->chosen_memcg->css);
+ oc->chosen_memcg = NULL;
+ }
+
+ if (oc->chosen && oc->chosen != (void *)-1UL) {
+ __oom_kill_process(oc->chosen);
+ return true;
+ }
+
+ /*
+ * Reset points before falling back to an old
+ * per-process OOM victim selection logic
+ */
+ oc->chosen_points = 0;
+
+ return !!oc->chosen;
+}
+
/*
* Reclaims as many pages from the given memcg as possible.
*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8cf77fb..1346565 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1039,6 +1039,12 @@ bool out_of_memory(struct oom_control *oc)
return true;
}

+ if (mem_cgroup_select_oom_victim(oc) &&
+ mem_cgroup_kill_oom_victim(oc)) {
+ schedule_timeout_killable(1);
+ return true;
+ }
+
select_bad_process(oc, oc->memcg);
/* Found nothing?!?! Either we hang forever, or we panic. */
if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
--
2.7.4


2017-06-04 20:43:47

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [RFC PATCH v2 6/7] mm, oom: cgroup-aware OOM killer

On Thu, Jun 01, 2017 at 07:35:14PM +0100, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
>
> This behavior doesn't suit well the system with many running
> containers. There are two main issues:
>
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
>
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much more safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
>
> 3) Per-process oom_score_adj affects global OOM, so it's a breache
> in the isolation.
>
> To address these issues, cgroup-aware OOM killer is introduced.
>
> Under OOM conditions, it tries to find the biggest memory consumer,
> and free memory by killing corresponding task(s). The difference
> the "traditional" OOM killer is that it can treat memory cgroups
> as memory consumers as well as single processes.
>
> By default, it will look for the biggest leaf cgroup, and kill
> the largest task inside.
>
> But a user can change this behavior by enabling the per-cgroup
> oom_kill_all_tasks option. If set, it causes the OOM killer treat
> the whole cgroup as an indivisible memory consumer. In case if it's
> selected as on OOM victim, all belonging tasks will be killed.
>
> Tasks in the root cgroup are treated as independent memory consumers,
> and are compared with other memory consumers (e.g. leaf cgroups).
> The root cgroup doesn't support the oom_kill_all_tasks feature.
>
...
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f979ac7..855d335 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2625,6 +2625,184 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
> return ret;
> }
>
> +static long mem_cgroup_oom_badness(struct mem_cgroup *memcg,
> + const nodemask_t *nodemask)
> +{
> + long points = 0;
> + int nid;
> + struct mem_cgroup *iter;
> +
> + for_each_mem_cgroup_tree(iter, memcg) {

AFAIU this function is called on every iteration over the cgroup tree,
which might be costly in case of a deep hierarchy, as it has quadratic
complexity at worst. We could eliminate the nested loop by computing
badness of all eligible cgroups before starting looking for a victim and
saving the values in struct mem_cgroup. Not sure if it's worth it, as
OOM is a pretty cold path.

> + for_each_node_state(nid, N_MEMORY) {
> + if (nodemask && !node_isset(nid, *nodemask))
> + continue;
> +
> + points += mem_cgroup_node_nr_lru_pages(iter, nid,
> + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));

Hmm, is there a reason why we shouldn't take into account file pages?

> + }
> +
> + points += mem_cgroup_get_nr_swap_pages(iter);

AFAICS mem_cgroup_get_nr_swap_pages() returns the number of pages that
can still be charged to the cgroup. IIUC we want to account pages that
have already been charged to the cgroup, i.e. the value of the 'swap'
page counter or MEMCG_SWAP stat counter.

> + points += memcg_page_state(iter, MEMCG_KERNEL_STACK_KB) /
> + (PAGE_SIZE / 1024);
> + points += memcg_page_state(iter, MEMCG_SLAB_UNRECLAIMABLE);
> + points += memcg_page_state(iter, MEMCG_SOCK);
> + }
> +
> + return points;
> +}
> +
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> + struct cgroup_subsys_state *css = NULL;
> + struct mem_cgroup *iter = NULL;
> + struct mem_cgroup *chosen_memcg = NULL;
> + struct mem_cgroup *parent = root_mem_cgroup;
> + unsigned long totalpages = oc->totalpages;
> + long chosen_memcg_points = 0;
> + long points = 0;
> +
> + oc->chosen = NULL;
> + oc->chosen_memcg = NULL;
> +
> + if (mem_cgroup_disabled())
> + return false;
> +
> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> + return false;
> +
> + pr_info("Choosing a victim memcg because of the %s",
> + oc->memcg ?
> + "memory limit reached of cgroup " :
> + "system-wide OOM\n");
> + if (oc->memcg) {
> + pr_cont_cgroup_path(oc->memcg->css.cgroup);
> + pr_cont("\n");
> +
> + chosen_memcg = oc->memcg;
> + parent = oc->memcg;
> + }
> +
> + rcu_read_lock();
> +
> + for (;;) {
> + css = css_next_child(css, &parent->css);
> + if (css) {
> + iter = mem_cgroup_from_css(css);
> +
> + points = mem_cgroup_oom_badness(iter, oc->nodemask);
> + points += iter->oom_score_adj * (totalpages / 1000);
> +
> + pr_info("Cgroup ");
> + pr_cont_cgroup_path(iter->css.cgroup);
> + pr_cont(": %ld\n", points);

Not sure if everyone wants to see these messages in the log.

> +
> + if (points > chosen_memcg_points) {
> + chosen_memcg = iter;
> + chosen_memcg_points = points;
> + oc->chosen_points = points;
> + }
> +
> + continue;
> + }
> +
> + if (chosen_memcg && !chosen_memcg->oom_kill_all_tasks) {
> + /* Go deeper in the cgroup hierarchy */
> + totalpages = chosen_memcg_points;

We set 'totalpages' to the target cgroup limit (or the total RAM
size) when computing a victim score. Why do you prefer to use
chosen_memcg_points here instead? Why not the limit of the chosen
cgroup?

> + chosen_memcg_points = 0;
> +
> + parent = chosen_memcg;
> + chosen_memcg = NULL;
> +
> + continue;
> + }
> +
> + if (!chosen_memcg && parent != root_mem_cgroup)
> + chosen_memcg = parent;
> +
> + break;
> + }
> +

> + if (!oc->memcg) {
> + /*
> + * We should also consider tasks in the root cgroup
> + * with badness larger than oc->chosen_points
> + */
> +
> + struct css_task_iter it;
> + struct task_struct *task;
> + int ret = 0;
> +
> + css_task_iter_start(&root_mem_cgroup->css, &it);
> + while (!ret && (task = css_task_iter_next(&it)))
> + ret = oom_evaluate_task(task, oc);
> + css_task_iter_end(&it);
> + }

IMHO it isn't quite correct to compare tasks from the root cgroup with
leaf cgroups, because they are at different levels. Shouldn't we compare
their scores only with the top level cgroups?

As an alternative approach, may be, we could remove this branch
altogether and ignore root tasks here (i.e. have any root task a higher
priority a priori)? Perhaps, it could be acceptable, because normally
the root cgroup only hosts kernel processes and init (at least this is
the default systemd setup IIRC).

> +
> + if (!oc->chosen && chosen_memcg) {
> + pr_info("Chosen cgroup ");
> + pr_cont_cgroup_path(chosen_memcg->css.cgroup);
> + pr_cont(": %ld\n", oc->chosen_points);
> +
> + if (chosen_memcg->oom_kill_all_tasks) {
> + css_get(&chosen_memcg->css);
> + oc->chosen_memcg = chosen_memcg;
> + } else {
> + /*
> + * If we don't need to kill all tasks in the cgroup,
> + * let's select the biggest task.
> + */
> + oc->chosen_points = 0;

> + select_bad_process(oc, chosen_memcg);

I think we'd better use mem_cgroup_scan_task() here directly, without
exporting select_bad_process() from oom_kill.c. IMHO it would be more
straightforward, because select_bad_process() has a branch handling the
global OOM, which isn't used in this case. Come to think of it, wouldn't
it be better to return the chosen cgroup in @oc and let out_of_memory()
select a process within it or kill it as a whole depending on the value
of the oom_kill_all_tasks flag?

Also, if the chosen cgroup has no tasks (which is perfectly possible if
all memory within the cgroup is consumed by shmem e.g.), shouldn't we
retry the cgroup selection?

> + }
> + } else if (oc->chosen)
> + pr_info("Chosen task %s (%d) in root cgroup: %ld\n",
> + oc->chosen->comm, oc->chosen->pid, oc->chosen_points);
> +
> + rcu_read_unlock();
> +
> + oc->chosen_points = 0;
> + return !!oc->chosen || !!oc->chosen_memcg;
> +}
> +
> +static int __oom_kill_task(struct task_struct *tsk, void *arg)
> +{
> + if (!is_global_init(tsk) && !(tsk->flags & PF_KTHREAD)) {
> + get_task_struct(tsk);
> + __oom_kill_process(tsk);
> + }
> + return 0;
> +}
> +
> +bool mem_cgroup_kill_oom_victim(struct oom_control *oc)

I think it'd be OK to define this function in oom_kill.c - we
have everything we need for that. We wouldn't have to export
__oom_kill_process without oom_kill_process then, which is kinda
ugly IMHO.

> +{
> + if (oc->chosen_memcg) {
> + /*
> + * Kill all tasks in the cgroup hierarchy
> + */
> + mem_cgroup_scan_tasks(oc->chosen_memcg,
> + __oom_kill_task, NULL);
> +
> + /*
> + * Release oc->chosen_memcg
> + */
> + css_put(&oc->chosen_memcg->css);
> + oc->chosen_memcg = NULL;
> + }
> +
> + if (oc->chosen && oc->chosen != (void *)-1UL) {

> + __oom_kill_process(oc->chosen);

Why don't you use oom_kill_process (without leading underscores) here?

> + return true;
> + }
> +
> + /*
> + * Reset points before falling back to an old
> + * per-process OOM victim selection logic
> + */
> + oc->chosen_points = 0;
> +
> + return !!oc->chosen;
> +}

2017-06-06 16:00:47

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC PATCH v2 6/7] mm, oom: cgroup-aware OOM killer

On Sun, Jun 04, 2017 at 11:43:33PM +0300, Vladimir Davydov wrote:
> On Thu, Jun 01, 2017 at 07:35:14PM +0100, Roman Gushchin wrote:
> ...
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index f979ac7..855d335 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2625,6 +2625,184 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
> > return ret;
> > }
> >
> > +static long mem_cgroup_oom_badness(struct mem_cgroup *memcg,
> > + const nodemask_t *nodemask)
> > +{
> > + long points = 0;
> > + int nid;
> > + struct mem_cgroup *iter;
> > +
> > + for_each_mem_cgroup_tree(iter, memcg) {
>
> AFAIU this function is called on every iteration over the cgroup tree,
> which might be costly in case of a deep hierarchy, as it has quadratic
> complexity at worst. We could eliminate the nested loop by computing
> badness of all eligible cgroups before starting looking for a victim and
> saving the values in struct mem_cgroup. Not sure if it's worth it, as
> OOM is a pretty cold path.

I've thought about it, but it really not obvious that we want to pay
with some additional memory usage (and code complexity) for optimization
of this path. So, I decided to leave it simple now, and postpone
all optimizations after we'll agree on everything else.

>
> > + for_each_node_state(nid, N_MEMORY) {
> > + if (nodemask && !node_isset(nid, *nodemask))
> > + continue;
> > +
> > + points += mem_cgroup_node_nr_lru_pages(iter, nid,
> > + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
>
> Hmm, is there a reason why we shouldn't take into account file pages?

Because under the OOM conditions we should not have too much pagecache,
and killing a process will unlikely help us to release any additional memory.
But maybe I'm missing something... Lazy free?

>
> > + }
> > +
> > + points += mem_cgroup_get_nr_swap_pages(iter);
>
> AFAICS mem_cgroup_get_nr_swap_pages() returns the number of pages that
> can still be charged to the cgroup. IIUC we want to account pages that
> have already been charged to the cgroup, i.e. the value of the 'swap'
> page counter or MEMCG_SWAP stat counter.

Ok, I'll check it. Thank you!

>
> > + points += memcg_page_state(iter, MEMCG_KERNEL_STACK_KB) /
> > + (PAGE_SIZE / 1024);
> > + points += memcg_page_state(iter, MEMCG_SLAB_UNRECLAIMABLE);
> > + points += memcg_page_state(iter, MEMCG_SOCK);
> > + }
> > +
> > + return points;
> > +}
> > +
> > +bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> > +{
> > + struct cgroup_subsys_state *css = NULL;
> > + struct mem_cgroup *iter = NULL;
> > + struct mem_cgroup *chosen_memcg = NULL;
> > + struct mem_cgroup *parent = root_mem_cgroup;
> > + unsigned long totalpages = oc->totalpages;
> > + long chosen_memcg_points = 0;
> > + long points = 0;
> > +
> > + oc->chosen = NULL;
> > + oc->chosen_memcg = NULL;
> > +
> > + if (mem_cgroup_disabled())
> > + return false;
> > +
> > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > + return false;
> > +
> > + pr_info("Choosing a victim memcg because of the %s",
> > + oc->memcg ?
> > + "memory limit reached of cgroup " :
> > + "system-wide OOM\n");
> > + if (oc->memcg) {
> > + pr_cont_cgroup_path(oc->memcg->css.cgroup);
> > + pr_cont("\n");
> > +
> > + chosen_memcg = oc->memcg;
> > + parent = oc->memcg;
> > + }
> > +
> > + rcu_read_lock();
> > +
> > + for (;;) {
> > + css = css_next_child(css, &parent->css);
> > + if (css) {
> > + iter = mem_cgroup_from_css(css);
> > +
> > + points = mem_cgroup_oom_badness(iter, oc->nodemask);
> > + points += iter->oom_score_adj * (totalpages / 1000);
> > +
> > + pr_info("Cgroup ");
> > + pr_cont_cgroup_path(iter->css.cgroup);
> > + pr_cont(": %ld\n", points);
>
> Not sure if everyone wants to see these messages in the log.

What do you suggest? Remove debug output at all (probably, we still want some),
ratelimit it, make optional?

>
> > +
> > + if (points > chosen_memcg_points) {
> > + chosen_memcg = iter;
> > + chosen_memcg_points = points;
> > + oc->chosen_points = points;
> > + }
> > +
> > + continue;
> > + }
> > +
> > + if (chosen_memcg && !chosen_memcg->oom_kill_all_tasks) {
> > + /* Go deeper in the cgroup hierarchy */
> > + totalpages = chosen_memcg_points;
>
> We set 'totalpages' to the target cgroup limit (or the total RAM
> size) when computing a victim score. Why do you prefer to use
> chosen_memcg_points here instead? Why not the limit of the chosen
> cgroup?

Because I'm trying to implement hierarchical oom_score_adj, so that if
a parent cgroup has oom_score_adj set to -1000, it's successors will
(almost) never selected.

> > + chosen_memcg_points = 0;
> > +
> > + parent = chosen_memcg;
> > + chosen_memcg = NULL;
> > +
> > + continue;
> > + }
> > +
> > + if (!chosen_memcg && parent != root_mem_cgroup)
> > + chosen_memcg = parent;
> > +
> > + break;
> > + }
> > +
>
> > + if (!oc->memcg) {
> > + /*
> > + * We should also consider tasks in the root cgroup
> > + * with badness larger than oc->chosen_points
> > + */
> > +
> > + struct css_task_iter it;
> > + struct task_struct *task;
> > + int ret = 0;
> > +
> > + css_task_iter_start(&root_mem_cgroup->css, &it);
> > + while (!ret && (task = css_task_iter_next(&it)))
> > + ret = oom_evaluate_task(task, oc);
> > + css_task_iter_end(&it);
> > + }
>
> IMHO it isn't quite correct to compare tasks from the root cgroup with
> leaf cgroups, because they are at different levels. Shouldn't we compare
> their scores only with the top level cgroups?

Not sure I follow your idea...
Of course, comparing tasks and cgroups is not really precise,
but hopefully should be good enough for the task.

> As an alternative approach, may be, we could remove this branch
> altogether and ignore root tasks here (i.e. have any root task a higher
> priority a priori)? Perhaps, it could be acceptable, because normally
> the root cgroup only hosts kernel processes and init (at least this is
> the default systemd setup IIRC).
>
> > +
> > + if (!oc->chosen && chosen_memcg) {
> > + pr_info("Chosen cgroup ");
> > + pr_cont_cgroup_path(chosen_memcg->css.cgroup);
> > + pr_cont(": %ld\n", oc->chosen_points);
> > +
> > + if (chosen_memcg->oom_kill_all_tasks) {
> > + css_get(&chosen_memcg->css);
> > + oc->chosen_memcg = chosen_memcg;
> > + } else {
> > + /*
> > + * If we don't need to kill all tasks in the cgroup,
> > + * let's select the biggest task.
> > + */
> > + oc->chosen_points = 0;
>
> > + select_bad_process(oc, chosen_memcg);
>
> I think we'd better use mem_cgroup_scan_task() here directly, without
> exporting select_bad_process() from oom_kill.c. IMHO it would be more
> straightforward, because select_bad_process() has a branch handling the
> global OOM, which isn't used in this case. Come to think of it, wouldn't
> it be better to return the chosen cgroup in @oc and let out_of_memory()
> select a process within it or kill it as a whole depending on the value
> of the oom_kill_all_tasks flag?
>
> Also, if the chosen cgroup has no tasks (which is perfectly possible if
> all memory within the cgroup is consumed by shmem e.g.), shouldn't we
> retry the cgroup selection?

Good point. Whould we retry the cgroup selection or just ignore
non-populated cgroups during selection?

>
> > + }
> > + } else if (oc->chosen)
> > + pr_info("Chosen task %s (%d) in root cgroup: %ld\n",
> > + oc->chosen->comm, oc->chosen->pid, oc->chosen_points);
> > +
> > + rcu_read_unlock();
> > +
> > + oc->chosen_points = 0;
> > + return !!oc->chosen || !!oc->chosen_memcg;
> > +}
> > +
> > +static int __oom_kill_task(struct task_struct *tsk, void *arg)
> > +{
> > + if (!is_global_init(tsk) && !(tsk->flags & PF_KTHREAD)) {
> > + get_task_struct(tsk);
> > + __oom_kill_process(tsk);
> > + }
> > + return 0;
> > +}
> > +
> > +bool mem_cgroup_kill_oom_victim(struct oom_control *oc)
>
> I think it'd be OK to define this function in oom_kill.c - we
> have everything we need for that. We wouldn't have to export
> __oom_kill_process without oom_kill_process then, which is kinda
> ugly IMHO.
>
> > +{
> > + if (oc->chosen_memcg) {
> > + /*
> > + * Kill all tasks in the cgroup hierarchy
> > + */
> > + mem_cgroup_scan_tasks(oc->chosen_memcg,
> > + __oom_kill_task, NULL);
> > +
> > + /*
> > + * Release oc->chosen_memcg
> > + */
> > + css_put(&oc->chosen_memcg->css);
> > + oc->chosen_memcg = NULL;
> > + }
> > +
> > + if (oc->chosen && oc->chosen != (void *)-1UL) {
>
> > + __oom_kill_process(oc->chosen);
>
> Why don't you use oom_kill_process (without leading underscores) here?

Because oom_kill_process() has some unwanted side-effects:
1) it can kill other than specified process, we don't need this optimization here,
2) bulky debug output.

Thank you for review!

Roman