There are three significant concerns about the cgroup aware oom killer as
it is implemented in -mm:
(1) allows users to evade the oom killer by creating subcontainers or
using other controllers since scoring is done per cgroup and not
hierarchically,
(2) unfairly compares the root mem cgroup using completely different
criteria than leaf mem cgroups and allows wildly inaccurate results
if oom_score_adj is used, and
(3) does not allow the user to influence the decisionmaking, such that
important subtrees cannot be preferred or biased.
This patchset aims to fix (1) completely and, by doing so, introduces a
completely extensible user interface that can be expanded in the future.
It preserves all functionality that currently exists in -mm and extends
it to be generally useful outside of very specialized usecases.
It eliminates the mount option for the cgroup aware oom killer entirely
since it is now enabled through the root mem cgroup's oom policy.
---
v3:
- updated documentation
- rebased to next-20180309
Documentation/cgroup-v2.txt | 90 ++++++++++++++++++++++++-------------
include/linux/cgroup-defs.h | 5 ---
include/linux/memcontrol.h | 21 +++++++++
kernel/cgroup/cgroup.c | 13 +-----
mm/memcontrol.c | 64 +++++++++++++++++++++-----
5 files changed, 132 insertions(+), 61 deletions(-)
One of the three significant concerns brought up about the cgroup aware
oom killer is that its decisionmaking is completely evaded by creating
subcontainers and attaching processes such that the ancestor's usage does
not exceed another cgroup on the system.
Consider the example from the previous patch where "memory" is set in
each mem cgroup's cgroup.controllers:
mem cgroup cgroup.procs
========== ============
/cg1 1 process consuming 250MB
/cg2 3 processes consuming 100MB each
/cg3/cg31 2 processes consuming 100MB each
/cg3/cg32 2 processes consuming 100MB each
If memory.oom_policy is "cgroup", a process from /cg2 is chosen because it
is in the single indivisible memory consumer with the greatest usage.
The true usage of /cg3 is actually 400MB, but a process from /cg2 is
chosen because cgroups are compared individually rather than
hierarchically.
If a system is divided into two users, for example:
mem cgroup memory.max
========== ==========
/userA 250MB
/userB 250MB
If /userA runs all processes attached to the local mem cgroup, whereas
/userB distributes their processes over a set of subcontainers under
/userB, /userA will be unfairly penalized.
There is incentive with cgroup v2 to distribute processes over a set of
subcontainers if those processes shall be constrained by other cgroup
controllers; this is a direct result of mandating a single, unified
hierarchy for cgroups. A user may also reasonably do this for mem cgroup
control or statistics. And, a user may do this to evade the cgroup-aware
oom killer selection logic.
This patch adds an oom policy, "tree", that accounts for hierarchical
usage when comparing cgroups and the cgroup aware oom killer is enabled by
an ancestor. This allows administrators, for example, to require users in
their own top-level mem cgroup subtree to be accounted for with
hierarchical usage. In other words, they can longer evade the oom killer
by using other controllers or subcontainers.
If an oom policy of "tree" is in place for a subtree, such as /cg3 above,
the hierarchical usage is used for comparisons with other cgroups if
either "cgroup" or "tree" is the oom policy of the oom mem cgroup. Thus,
if /cg3/memory.oom_policy is "tree", one of the processes from /cg3's
subcontainers is chosen for oom kill.
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/cgroup-v2.txt | 15 +++++++++++++--
include/linux/memcontrol.h | 5 +++++
mm/memcontrol.c | 14 ++++++++++----
3 files changed, 28 insertions(+), 6 deletions(-)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1080,6 +1080,10 @@ PAGE_SIZE multiple when read back.
memory consumers; that is, they will compare mem cgroup usage rather
than process memory footprint. See the "OOM Killer" section below.
+ If "tree", the OOM killer will compare mem cgroups and its subtree
+ as a single indivisible memory consumer. This policy cannot be set
+ on the root mem cgroup. See the "OOM Killer" section below.
+
When an OOM condition occurs, the policy is dictated by the mem
cgroup that is OOM (the root mem cgroup for a system-wide OOM
condition). If a descendant mem cgroup has a policy of "none", for
@@ -1087,6 +1091,10 @@ PAGE_SIZE multiple when read back.
the heuristic will still compare mem cgroups as indivisible memory
consumers.
+ When an OOM condition occurs in a mem cgroup with an OOM policy of
+ "cgroup" or "tree", the OOM killer will compare mem cgroups with
+ "cgroup" policy individually with "tree" policy subtrees.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
@@ -1310,6 +1318,9 @@ There are currently two available oom policies:
subtree as an OOM victim and kill at least one process, depending on
memory.oom_group, from it.
+ - "tree": choose the cgroup with the largest memory footprint considering
+ itself and its subtree and kill at least one process.
+
When selecting a cgroup as a victim, the OOM killer will kill the process
with the largest memory footprint. A user can control this behavior by
enabling the per-cgroup memory.oom_group option. If set, it causes the
@@ -1323,8 +1334,8 @@ Please, note that memory charges are not migrating if tasks
are moved between different memory cgroups. Moving tasks with
significant memory footprint may affect OOM victim selection logic.
If it's a case, please, consider creating a common ancestor for
-the source and destination memory cgroups and enabling oom_group
-on ancestor layer.
+the source and destination memory cgroups and setting a policy of "tree"
+and enabling oom_group on an ancestor layer.
IO
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -69,6 +69,11 @@ enum memcg_oom_policy {
* mem cgroup as an indivisible consumer
*/
MEMCG_OOM_POLICY_CGROUP,
+ /*
+ * Tree cgroup usage for all descendant memcg groups, treating each mem
+ * cgroup and its subtree as an indivisible consumer
+ */
+ MEMCG_OOM_POLICY_TREE,
};
struct mem_cgroup_reclaim_cookie {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2726,7 +2726,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
/*
* The oom_score is calculated for leaf memory cgroups (including
* the root memcg).
- * Non-leaf oom_group cgroups accumulating score of descendant
+ * Cgroups with oom policy of "tree" accumulate the score of descendant
* leaf memory cgroups.
*/
rcu_read_lock();
@@ -2735,10 +2735,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
/*
* We don't consider non-leaf non-oom_group memory cgroups
- * as OOM victims.
+ * without the oom policy of "tree" as OOM victims.
*/
if (memcg_has_children(iter) && iter != root_mem_cgroup &&
- !mem_cgroup_oom_group(iter))
+ !mem_cgroup_oom_group(iter) &&
+ iter->oom_policy != MEMCG_OOM_POLICY_TREE)
continue;
/*
@@ -2801,7 +2802,7 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
else
root = root_mem_cgroup;
- if (root->oom_policy != MEMCG_OOM_POLICY_CGROUP)
+ if (root->oom_policy == MEMCG_OOM_POLICY_NONE)
return false;
select_victim_memcg(root, oc);
@@ -5536,6 +5537,9 @@ static int memory_oom_policy_show(struct seq_file *m, void *v)
case MEMCG_OOM_POLICY_CGROUP:
seq_puts(m, "cgroup\n");
break;
+ case MEMCG_OOM_POLICY_TREE:
+ seq_puts(m, "tree\n");
+ break;
case MEMCG_OOM_POLICY_NONE:
default:
seq_puts(m, "none\n");
@@ -5554,6 +5558,8 @@ static ssize_t memory_oom_policy_write(struct kernfs_open_file *of,
memcg->oom_policy = MEMCG_OOM_POLICY_NONE;
else if (!memcmp("cgroup", buf, min(sizeof("cgroup")-1, nbytes)))
memcg->oom_policy = MEMCG_OOM_POLICY_CGROUP;
+ else if (!memcmp("tree", buf, min(sizeof("tree")-1, nbytes)))
+ memcg->oom_policy = MEMCG_OOM_POLICY_TREE;
else
ret = -EINVAL;
The cgroup aware oom killer is needlessly enforced for the entire system
by a mount option. It's unnecessary to force the system into a single
oom policy: either cgroup aware, or the traditional process aware.
This patch introduces a memory.oom_policy tunable for all mem cgroups.
It is currently a no-op: it can only be set to "none", which is its
default policy. It will be expanded in the next patch to define cgroup
aware oom killer behavior for its subtree.
This is an extensible interface that can be used to define cgroup aware
assessment of mem cgroup subtrees or the traditional process aware
assessment.
Another benefit of such an approach is that an admin can lock in a
certain policy for the system or for a mem cgroup subtree and can
delegate the policy decision to the user to determine if the kill should
originate from a subcontainer, as indivisible memory consumers
themselves, or selection should be done per process.
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/cgroup-v2.txt | 11 +++++++++++
include/linux/memcontrol.h | 11 +++++++++++
mm/memcontrol.c | 35 +++++++++++++++++++++++++++++++++++
3 files changed, 57 insertions(+)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1065,6 +1065,17 @@ PAGE_SIZE multiple when read back.
If cgroup-aware OOM killer is not enabled, ENOTSUPP error
is returned on attempt to access the file.
+ memory.oom_policy
+
+ A read-write single string file which exists on all cgroups. The
+ default value is "none".
+
+ If "none", the OOM killer will use the default policy to choose a
+ victim; that is, it will choose the single process with the largest
+ memory footprint adjusted by /proc/pid/oom_score_adj (see
+ Documentation/filesystems/proc.txt). This is the same policy as if
+ memory cgroups were not even mounted.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -58,6 +58,14 @@ enum memcg_event_item {
MEMCG_NR_EVENTS,
};
+enum memcg_oom_policy {
+ /*
+ * No special oom policy, process selection is determined by
+ * oom_badness()
+ */
+ MEMCG_OOM_POLICY_NONE,
+};
+
struct mem_cgroup_reclaim_cookie {
pg_data_t *pgdat;
int priority;
@@ -203,6 +211,9 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
+ /* OOM policy for this subtree */
+ enum memcg_oom_policy oom_policy;
+
/*
* Treat the sub-tree as an indivisible memory consumer,
* kill all belonging tasks if the memory cgroup selected
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4415,6 +4415,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
if (parent) {
memcg->swappiness = mem_cgroup_swappiness(parent);
memcg->oom_kill_disable = parent->oom_kill_disable;
+ memcg->oom_policy = parent->oom_policy;
}
if (parent && parent->use_hierarchy) {
memcg->use_hierarchy = true;
@@ -5532,6 +5533,34 @@ static int memory_stat_show(struct seq_file *m, void *v)
return 0;
}
+static int memory_oom_policy_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+ enum memcg_oom_policy policy = READ_ONCE(memcg->oom_policy);
+
+ switch (policy) {
+ case MEMCG_OOM_POLICY_NONE:
+ default:
+ seq_puts(m, "none\n");
+ };
+ return 0;
+}
+
+static ssize_t memory_oom_policy_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ ssize_t ret = nbytes;
+
+ buf = strstrip(buf);
+ if (!memcmp("none", buf, min(sizeof("none")-1, nbytes)))
+ memcg->oom_policy = MEMCG_OOM_POLICY_NONE;
+ else
+ ret = -EINVAL;
+
+ return ret;
+}
+
static struct cftype memory_files[] = {
{
.name = "current",
@@ -5573,6 +5602,12 @@ static struct cftype memory_files[] = {
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = memory_stat_show,
},
+ {
+ .name = "oom_policy",
+ .flags = CFTYPE_NS_DELEGATABLE,
+ .seq_show = memory_oom_policy_show,
+ .write = memory_oom_policy_write,
+ },
{ } /* terminate */
};
Now that each mem cgroup on the system has a memory.oom_policy tunable to
specify oom kill selection behavior, remove the needless "groupoom" mount
option that requires (1) the entire system to be forced, perhaps
unnecessarily, perhaps unexpectedly, into a single oom policy that
differs from the traditional per process selection, and (2) a remount to
change.
Instead of enabling the cgroup aware oom killer with the "groupoom" mount
option, set the mem cgroup subtree's memory.oom_policy to "cgroup".
The heuristic used to select a process or cgroup to kill from is
controlled by the oom mem cgroup's memory.oom_policy. This means that if
a descendant mem cgroup has an oom policy of "none", for example, and an
oom condition originates in an ancestor with an oom policy of "cgroup",
the selection logic will treat all descendant cgroups as indivisible
memory consumers.
For example, consider an example where each mem cgroup has "memory" set
in cgroup.controllers:
mem cgroup cgroup.procs
========== ============
/cg1 1 process consuming 250MB
/cg2 3 processes consuming 100MB each
/cg3/cg31 2 processes consuming 100MB each
/cg3/cg32 2 processes consuming 100MB each
If the root mem cgroup's memory.oom_policy is "none", the process from
/cg1 is chosen as the victim. If memory.oom_policy is "cgroup", a process
from /cg2 is chosen because it is in the single indivisible memory
consumer with the greatest usage. This policy of "cgroup" is identical to
to the current "groupoom" mount option, now removed.
Note that /cg3 is not the chosen victim when the oom mem cgroup policy is
"cgroup" because cgroups are treated individually without regard to
hierarchical /cg3/memory.current usage. This will be addressed in a
follow-up patch.
This has the added benefit of allowing descendant cgroups to control their
own oom policies if they have memory.oom_policy file permissions without
being restricted to the system-wide policy. In the above example, /cg2
and /cg3 can be either "none" or "cgroup" with the same results: the
selection heuristic depends only on the policy of the oom mem cgroup. If
/cg2 or /cg3 themselves are oom, however, the policy is controlled by
their own oom policies, either process aware or cgroup aware.
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/cgroup-v2.txt | 78 +++++++++++++++++++------------------
include/linux/cgroup-defs.h | 5 ---
include/linux/memcontrol.h | 5 +++
kernel/cgroup/cgroup.c | 13 +------
mm/memcontrol.c | 17 ++++----
5 files changed, 55 insertions(+), 63 deletions(-)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1076,6 +1076,17 @@ PAGE_SIZE multiple when read back.
Documentation/filesystems/proc.txt). This is the same policy as if
memory cgroups were not even mounted.
+ If "cgroup", the OOM killer will compare mem cgroups as indivisible
+ memory consumers; that is, they will compare mem cgroup usage rather
+ than process memory footprint. See the "OOM Killer" section below.
+
+ When an OOM condition occurs, the policy is dictated by the mem
+ cgroup that is OOM (the root mem cgroup for a system-wide OOM
+ condition). If a descendant mem cgroup has a policy of "none", for
+ example, for an OOM condition in a mem cgroup with policy "cgroup",
+ the heuristic will still compare mem cgroups as indivisible memory
+ consumers.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
@@ -1282,43 +1293,36 @@ belonging to the affected files to ensure correct memory ownership.
OOM Killer
~~~~~~~~~~
-Cgroup v2 memory controller implements a cgroup-aware OOM killer.
-It means that it treats cgroups as first class OOM entities.
-
-Cgroup-aware OOM logic is turned off by default and requires
-passing the "groupoom" option on mounting cgroupfs. It can also
-by remounting cgroupfs with the following command::
-
- # mount -o remount,groupoom $MOUNT_POINT
-
-Under OOM conditions the memory controller tries to make the best
-choice of a victim, looking for a memory cgroup with the largest
-memory footprint, considering leaf cgroups and cgroups with the
-memory.oom_group option set, which are considered to be an indivisible
-memory consumers.
-
-By default, OOM killer will kill the biggest task in the selected
-memory cgroup. A user can change this behavior by enabling
-the per-cgroup memory.oom_group option. If set, it causes
-the OOM killer to kill all processes attached to the cgroup,
-except processes with oom_score_adj set to -1000.
-
-This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
-the memory controller considers only cgroups belonging to the sub-tree
-of the OOM'ing cgroup.
-
-Leaf cgroups and cgroups with oom_group option set are compared based
-on their cumulative memory usage. The root cgroup is treated as a
-leaf memory cgroup as well, so it is compared with other leaf memory
-cgroups. Due to internal implementation restrictions the size of
-the root cgroup is the cumulative sum of oom_badness of all its tasks
-(in other words oom_score_adj of each task is obeyed). Relying on
-oom_score_adj (apart from OOM_SCORE_ADJ_MIN) can lead to over- or
-underestimation of the root cgroup consumption and it is therefore
-discouraged. This might change in the future, however.
-
-If there are no cgroups with the enabled memory controller,
-the OOM killer is using the "traditional" process-based approach.
+Cgroup v2 memory controller implements an optional cgroup-aware out of
+memory killer, which treats cgroups as indivisible OOM entities.
+
+This policy is controlled by memory.oom_policy. When a memory cgroup is
+out of memory, its memory.oom_policy will dictate how the OOM killer will
+select a process, or cgroup, to kill. Likewise, when the system is OOM,
+the policy is dictated by the root mem cgroup.
+
+There are currently two available oom policies:
+
+ - "none": default, choose the largest single memory hogging process to
+ oom kill, as traditionally the OOM killer has always done.
+
+ - "cgroup": choose the cgroup with the largest memory footprint from the
+ subtree as an OOM victim and kill at least one process, depending on
+ memory.oom_group, from it.
+
+When selecting a cgroup as a victim, the OOM killer will kill the process
+with the largest memory footprint. A user can control this behavior by
+enabling the per-cgroup memory.oom_group option. If set, it causes the
+OOM killer to kill all processes attached to the cgroup, except processes
+with /proc/pid/oom_score_adj set to -1000 (oom disabled).
+
+The root cgroup is treated as a leaf memory cgroup as well, so it is
+compared with other leaf memory cgroups. Due to internal implementation
+restrictions the size of the root cgroup is the cumulative sum of
+oom_badness of all its tasks (in other words oom_score_adj of each task
+is obeyed). Relying on oom_score_adj (apart from OOM_SCORE_ADJ_MIN) can
+lead to over- or underestimation of the root cgroup consumption and it is
+therefore discouraged. This might change in the future, however.
Please, note that memory charges are not migrating if tasks
are moved between different memory cgroups. Moving tasks with
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -81,11 +81,6 @@ enum {
* Enable cpuset controller in v1 cgroup to use v2 behavior.
*/
CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
-
- /*
- * Enable cgroup-aware OOM killer.
- */
- CGRP_GROUP_OOM = (1 << 5),
};
/* cftype->flags */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -64,6 +64,11 @@ enum memcg_oom_policy {
* oom_badness()
*/
MEMCG_OOM_POLICY_NONE,
+ /*
+ * Local cgroup usage is used to select a target cgroup, treating each
+ * mem cgroup as an indivisible consumer
+ */
+ MEMCG_OOM_POLICY_CGROUP,
};
struct mem_cgroup_reclaim_cookie {
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1732,9 +1732,6 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
if (!strcmp(token, "nsdelegate")) {
*root_flags |= CGRP_ROOT_NS_DELEGATE;
continue;
- } else if (!strcmp(token, "groupoom")) {
- *root_flags |= CGRP_GROUP_OOM;
- continue;
}
pr_err("cgroup2: unknown option \"%s\"\n", token);
@@ -1751,11 +1748,6 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
else
cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
-
- if (root_flags & CGRP_GROUP_OOM)
- cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
- else
- cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
}
}
@@ -1763,8 +1755,6 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
{
if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
seq_puts(seq, ",nsdelegate");
- if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
- seq_puts(seq, ",groupoom");
return 0;
}
@@ -5932,8 +5922,7 @@ static struct kobj_attribute cgroup_delegate_attr = __ATTR_RO(delegate);
static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
char *buf)
{
- return snprintf(buf, PAGE_SIZE, "nsdelegate\n"
- "groupoom\n");
+ return snprintf(buf, PAGE_SIZE, "nsdelegate\n");
}
static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2811,14 +2811,14 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return false;
- if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
- return false;
-
if (oc->memcg)
root = oc->memcg;
else
root = root_mem_cgroup;
+ if (root->oom_policy != MEMCG_OOM_POLICY_CGROUP)
+ return false;
+
select_victim_memcg(root, oc);
return oc->chosen_memcg;
@@ -5425,9 +5425,6 @@ static int memory_oom_group_show(struct seq_file *m, void *v)
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
bool oom_group = memcg->oom_group;
- if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
- return -ENOTSUPP;
-
seq_printf(m, "%d\n", oom_group);
return 0;
@@ -5441,9 +5438,6 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
int oom_group;
int err;
- if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
- return -ENOTSUPP;
-
err = kstrtoint(strstrip(buf), 0, &oom_group);
if (err)
return err;
@@ -5554,6 +5548,9 @@ static int memory_oom_policy_show(struct seq_file *m, void *v)
enum memcg_oom_policy policy = READ_ONCE(memcg->oom_policy);
switch (policy) {
+ case MEMCG_OOM_POLICY_CGROUP:
+ seq_puts(m, "cgroup\n");
+ break;
case MEMCG_OOM_POLICY_NONE:
default:
seq_puts(m, "none\n");
@@ -5570,6 +5567,8 @@ static ssize_t memory_oom_policy_write(struct kernfs_open_file *of,
buf = strstrip(buf);
if (!memcmp("none", buf, min(sizeof("none")-1, nbytes)))
memcg->oom_policy = MEMCG_OOM_POLICY_NONE;
+ else if (!memcmp("cgroup", buf, min(sizeof("cgroup")-1, nbytes)))
+ memcg->oom_policy = MEMCG_OOM_POLICY_CGROUP;
else
ret = -EINVAL;
There are several downsides to the current implementation that compares
the root mem cgroup with leaf mem cgroups for the cgroup-aware oom killer.
For example, /proc/pid/oom_score_adj is accounted for processes attached
to the root mem cgroup but not leaves. This leads to wild inconsistencies
that unfairly bias for or against the root mem cgroup.
Assume a 728KB bash shell is attached to the root mem cgroup without any
other processes having a non-default /proc/pid/oom_score_adj. At the time
of system oom, the root mem cgroup evaluated to 43,474 pages after boot.
If the bash shell adjusts its /proc/self/oom_score_adj to 1000, however,
the root mem cgroup evaluates to 24,765,482 pages lol. It would take a
cgroup 95GB of memory to outweigh the root mem cgroup's evaluation.
The reverse is even more confusing: if the bash shell adjusts its
/proc/self/oom_score_adj to -999, the root mem cgroup evaluates to 42,268
pages, a basically meaningless transformation.
/proc/pid/oom_score_adj is discounted, however, for processes attached to
leaf mem cgroups. If a sole process using 250MB of memory is attached to
a mem cgroup, it evaluates to 250MB >> PAGE_SHIFT. If its
/proc/pid/oom_score_adj is changed to -999, or even 1000, the evaluation
remains the same for the mem cgroup.
The heuristic that is used for the root mem cgroup also differs from leaf
mem cgroups.
For the root mem cgroup, the evaluation is the sum of all process's
/proc/pid/oom_score. Besides factoring in oom_score_adj, it is based on
the sum of rss + swap + page tables for all processes attached to it.
For leaf mem cgroups, it is based on the amount of anonymous or
unevictable memory + unreclaimable slab + kernel stack + sock + swap.
There's also an exemption for root mem cgroup processes that do not
intersect the allocating process's mems_allowed. Because the current
heuristic is based on oom_badness(), the evaluation of the root mem
cgroup disregards all processes attached to it that have disjoint
mems_allowed making oom selection specifically dependant on the
allocating process for system oom conditions!
This patch introduces completely fair comparison between the root mem
cgroup and leaf mem cgroups. It compares them with the same heuristic
and does not prefer one over the other. It disregards oom_score_adj
as the cgroup-aware oom killer should, if enabled by memory.oom_policy.
The goal is to target the most memory consuming cgroup on the system,
not consider per-process adjustment.
The fact that the evaluation of all mem cgroups depends on the mempolicy
of the allocating process, which is completely undocumented for the
cgroup-aware oom killer, will be addressed in a subsequent patch.
Signed-off-by: David Rientjes <[email protected]>
---
Based on top of oom policy patch series at
https://marc.info/?t=152090280800001
Documentation/cgroup-v2.txt | 7 +-
mm/memcontrol.c | 147 ++++++++++++++++++------------------
2 files changed, 74 insertions(+), 80 deletions(-)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1328,12 +1328,7 @@ OOM killer to kill all processes attached to the cgroup, except processes
with /proc/pid/oom_score_adj set to -1000 (oom disabled).
The root cgroup is treated as a leaf memory cgroup as well, so it is
-compared with other leaf memory cgroups. Due to internal implementation
-restrictions the size of the root cgroup is the cumulative sum of
-oom_badness of all its tasks (in other words oom_score_adj of each task
-is obeyed). Relying on oom_score_adj (apart from OOM_SCORE_ADJ_MIN) can
-lead to over- or underestimation of the root cgroup consumption and it is
-therefore discouraged. This might change in the future, however.
+compared with other leaf memory cgroups.
Please, note that memory charges are not migrating if tasks
are moved between different memory cgroups. Moving tasks with
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -94,6 +94,8 @@ int do_swap_account __read_mostly;
#define do_swap_account 0
#endif
+static atomic_long_t total_sock_pages;
+
/* Whether legacy memory+swap accounting is active */
static bool do_memsw_account(void)
{
@@ -2607,9 +2609,9 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
}
static long memcg_oom_badness(struct mem_cgroup *memcg,
- const nodemask_t *nodemask,
- unsigned long totalpages)
+ const nodemask_t *nodemask)
{
+ const bool is_root_memcg = memcg == root_mem_cgroup;
long points = 0;
int nid;
pg_data_t *pgdat;
@@ -2618,92 +2620,65 @@ static long memcg_oom_badness(struct mem_cgroup *memcg,
if (nodemask && !node_isset(nid, *nodemask))
continue;
- points += mem_cgroup_node_nr_lru_pages(memcg, nid,
- LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
-
pgdat = NODE_DATA(nid);
- points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
- NR_SLAB_UNRECLAIMABLE);
+ if (is_root_memcg) {
+ points += node_page_state(pgdat, NR_ACTIVE_ANON) +
+ node_page_state(pgdat, NR_INACTIVE_ANON);
+ points += node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE);
+ } else {
+ points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+ LRU_ALL_ANON);
+ points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
+ NR_SLAB_UNRECLAIMABLE);
+ }
}
- points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
- (PAGE_SIZE / 1024);
- points += memcg_page_state(memcg, MEMCG_SOCK);
- points += memcg_page_state(memcg, MEMCG_SWAP);
-
+ if (is_root_memcg) {
+ points += global_zone_page_state(NR_KERNEL_STACK_KB) /
+ (PAGE_SIZE / 1024);
+ points += atomic_long_read(&total_sock_pages);
+ points += total_swap_pages - atomic_long_read(&nr_swap_pages);
+ } else {
+ points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+ (PAGE_SIZE / 1024);
+ points += memcg_page_state(memcg, MEMCG_SOCK);
+ points += memcg_page_state(memcg, MEMCG_SWAP);
+ }
return points;
}
/*
- * Checks if the given memcg is a valid OOM victim and returns a number,
- * which means the folowing:
- * -1: there are inflight OOM victim tasks, belonging to the memcg
- * 0: memcg is not eligible, e.g. all belonging tasks are protected
- * by oom_score_adj set to OOM_SCORE_ADJ_MIN
+ * Checks if the given non-root memcg has a valid OOM victim and returns a
+ * number, which means the following:
+ * -1: there is an inflight OOM victim process attached to the memcg
+ * 0: memcg is not eligible because all tasks attached are unkillable
+ * (kthreads or oom_score_adj set to OOM_SCORE_ADJ_MIN)
* >0: memcg is eligible, and the returned value is an estimation
* of the memory footprint
*/
static long oom_evaluate_memcg(struct mem_cgroup *memcg,
- const nodemask_t *nodemask,
- unsigned long totalpages)
+ const nodemask_t *nodemask)
{
struct css_task_iter it;
struct task_struct *task;
int eligible = 0;
/*
- * Root memory cgroup is a special case:
- * we don't have necessary stats to evaluate it exactly as
- * leaf memory cgroups, so we approximate it's oom_score
- * by summing oom_score of all belonging tasks, which are
- * owners of their mm structs.
- *
- * If there are inflight OOM victim tasks inside
- * the root memcg, we return -1.
- */
- if (memcg == root_mem_cgroup) {
- struct css_task_iter it;
- struct task_struct *task;
- long score = 0;
-
- css_task_iter_start(&memcg->css, 0, &it);
- while ((task = css_task_iter_next(&it))) {
- if (tsk_is_oom_victim(task) &&
- !test_bit(MMF_OOM_SKIP,
- &task->signal->oom_mm->flags)) {
- score = -1;
- break;
- }
-
- task_lock(task);
- if (!task->mm || task->mm->owner != task) {
- task_unlock(task);
- continue;
- }
- task_unlock(task);
-
- score += oom_badness(task, memcg, nodemask,
- totalpages);
- }
- css_task_iter_end(&it);
-
- return score;
- }
-
- /*
- * Memcg is OOM eligible if there are OOM killable tasks inside.
- *
- * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
- * as unkillable.
- *
- * If there are inflight OOM victim tasks inside the memcg,
- * we return -1.
+ * Memcg is eligible for oom kill if at least one process is eligible
+ * to be killed. Processes with oom_score_adj of OOM_SCORE_ADJ_MIN
+ * are unkillable.
*/
css_task_iter_start(&memcg->css, 0, &it);
while ((task = css_task_iter_next(&it))) {
+ task_lock(task);
+ if (!task->mm || task != task->mm->owner) {
+ task_unlock(task);
+ continue;
+ }
if (!eligible &&
task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
eligible = 1;
+ task_unlock(task);
if (tsk_is_oom_victim(task) &&
!test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
@@ -2716,13 +2691,14 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
if (eligible <= 0)
return eligible;
- return memcg_oom_badness(memcg, nodemask, totalpages);
+ return memcg_oom_badness(memcg, nodemask);
}
static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
{
struct mem_cgroup *iter, *group = NULL;
long group_score = 0;
+ long leaf_score = 0;
oc->chosen_memcg = NULL;
oc->chosen_points = 0;
@@ -2748,12 +2724,18 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
for_each_mem_cgroup_tree(iter, root) {
long score;
+ /*
+ * Root memory cgroup will be considered after iteration,
+ * if eligible.
+ */
+ if (iter == root_mem_cgroup)
+ continue;
+
/*
* We don't consider non-leaf non-oom_group memory cgroups
* without the oom policy of "tree" as OOM victims.
*/
- if (memcg_has_children(iter) && iter != root_mem_cgroup &&
- !mem_cgroup_oom_group(iter) &&
+ if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter) &&
iter->oom_policy != MEMCG_OOM_POLICY_TREE)
continue;
@@ -2761,16 +2743,15 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
* If group is not set or we've ran out of the group's sub-tree,
* we should set group and reset group_score.
*/
- if (!group || group == root_mem_cgroup ||
- !mem_cgroup_is_descendant(iter, group)) {
+ if (!group || !mem_cgroup_is_descendant(iter, group)) {
group = iter;
group_score = 0;
}
- if (memcg_has_children(iter) && iter != root_mem_cgroup)
+ if (memcg_has_children(iter))
continue;
- score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
+ score = oom_evaluate_memcg(iter, oc->nodemask);
/*
* Ignore empty and non-eligible memory cgroups.
@@ -2789,6 +2770,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
}
group_score += score;
+ leaf_score += score;
if (group_score > oc->chosen_points) {
oc->chosen_points = group_score;
@@ -2796,8 +2778,25 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
}
}
- if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
- css_get(&oc->chosen_memcg->css);
+ if (oc->chosen_memcg != INFLIGHT_VICTIM) {
+ if (root == root_mem_cgroup) {
+ group_score = oom_evaluate_memcg(root_mem_cgroup,
+ oc->nodemask);
+ if (group_score > leaf_score) {
+ /*
+ * Discount the sum of all leaf scores to find
+ * root score.
+ */
+ group_score -= leaf_score;
+ if (group_score > oc->chosen_points) {
+ oc->chosen_points = group_score;
+ oc->chosen_memcg = root_mem_cgroup;
+ }
+ }
+ }
+ if (oc->chosen_memcg)
+ css_get(&oc->chosen_memcg->css);
+ }
rcu_read_unlock();
}
Hello, David!
Overall I like this idea.
Some questions below.
On Tue, Mar 13, 2018 at 05:21:09PM -0700, David Rientjes wrote:
> There are several downsides to the current implementation that compares
> the root mem cgroup with leaf mem cgroups for the cgroup-aware oom killer.
>
> For example, /proc/pid/oom_score_adj is accounted for processes attached
> to the root mem cgroup but not leaves. This leads to wild inconsistencies
> that unfairly bias for or against the root mem cgroup.
>
> Assume a 728KB bash shell is attached to the root mem cgroup without any
> other processes having a non-default /proc/pid/oom_score_adj. At the time
> of system oom, the root mem cgroup evaluated to 43,474 pages after boot.
> If the bash shell adjusts its /proc/self/oom_score_adj to 1000, however,
> the root mem cgroup evaluates to 24,765,482 pages lol. It would take a
> cgroup 95GB of memory to outweigh the root mem cgroup's evaluation.
>
> The reverse is even more confusing: if the bash shell adjusts its
> /proc/self/oom_score_adj to -999, the root mem cgroup evaluates to 42,268
> pages, a basically meaningless transformation.
>
> /proc/pid/oom_score_adj is discounted, however, for processes attached to
> leaf mem cgroups. If a sole process using 250MB of memory is attached to
> a mem cgroup, it evaluates to 250MB >> PAGE_SHIFT. If its
> /proc/pid/oom_score_adj is changed to -999, or even 1000, the evaluation
> remains the same for the mem cgroup.
>
> The heuristic that is used for the root mem cgroup also differs from leaf
> mem cgroups.
>
> For the root mem cgroup, the evaluation is the sum of all process's
> /proc/pid/oom_score. Besides factoring in oom_score_adj, it is based on
> the sum of rss + swap + page tables for all processes attached to it.
> For leaf mem cgroups, it is based on the amount of anonymous or
> unevictable memory + unreclaimable slab + kernel stack + sock + swap.
>
> There's also an exemption for root mem cgroup processes that do not
> intersect the allocating process's mems_allowed. Because the current
> heuristic is based on oom_badness(), the evaluation of the root mem
> cgroup disregards all processes attached to it that have disjoint
> mems_allowed making oom selection specifically dependant on the
> allocating process for system oom conditions!
>
> This patch introduces completely fair comparison between the root mem
> cgroup and leaf mem cgroups. It compares them with the same heuristic
> and does not prefer one over the other. It disregards oom_score_adj
> as the cgroup-aware oom killer should, if enabled by memory.oom_policy.
> The goal is to target the most memory consuming cgroup on the system,
> not consider per-process adjustment.
>
> The fact that the evaluation of all mem cgroups depends on the mempolicy
> of the allocating process, which is completely undocumented for the
> cgroup-aware oom killer, will be addressed in a subsequent patch.
>
> Signed-off-by: David Rientjes <[email protected]>
> ---
> Based on top of oom policy patch series at
> https://marc.info/?t=152090280800001
>
> Documentation/cgroup-v2.txt | 7 +-
> mm/memcontrol.c | 147 ++++++++++++++++++------------------
> 2 files changed, 74 insertions(+), 80 deletions(-)
>
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -1328,12 +1328,7 @@ OOM killer to kill all processes attached to the cgroup, except processes
> with /proc/pid/oom_score_adj set to -1000 (oom disabled).
>
> The root cgroup is treated as a leaf memory cgroup as well, so it is
> -compared with other leaf memory cgroups. Due to internal implementation
> -restrictions the size of the root cgroup is the cumulative sum of
> -oom_badness of all its tasks (in other words oom_score_adj of each task
> -is obeyed). Relying on oom_score_adj (apart from OOM_SCORE_ADJ_MIN) can
> -lead to over- or underestimation of the root cgroup consumption and it is
> -therefore discouraged. This might change in the future, however.
> +compared with other leaf memory cgroups.
>
> Please, note that memory charges are not migrating if tasks
> are moved between different memory cgroups. Moving tasks with
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -94,6 +94,8 @@ int do_swap_account __read_mostly;
> #define do_swap_account 0
> #endif
>
> +static atomic_long_t total_sock_pages;
> +
> /* Whether legacy memory+swap accounting is active */
> static bool do_memsw_account(void)
> {
> @@ -2607,9 +2609,9 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
> }
>
> static long memcg_oom_badness(struct mem_cgroup *memcg,
> - const nodemask_t *nodemask,
> - unsigned long totalpages)
> + const nodemask_t *nodemask)
> {
> + const bool is_root_memcg = memcg == root_mem_cgroup;
> long points = 0;
> int nid;
> pg_data_t *pgdat;
> @@ -2618,92 +2620,65 @@ static long memcg_oom_badness(struct mem_cgroup *memcg,
> if (nodemask && !node_isset(nid, *nodemask))
> continue;
>
> - points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> - LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> -
> pgdat = NODE_DATA(nid);
> - points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> - NR_SLAB_UNRECLAIMABLE);
> + if (is_root_memcg) {
> + points += node_page_state(pgdat, NR_ACTIVE_ANON) +
> + node_page_state(pgdat, NR_INACTIVE_ANON);
> + points += node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE);
> + } else {
> + points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> + LRU_ALL_ANON);
> + points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> + NR_SLAB_UNRECLAIMABLE);
> + }
> }
>
> - points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> - (PAGE_SIZE / 1024);
> - points += memcg_page_state(memcg, MEMCG_SOCK);
> - points += memcg_page_state(memcg, MEMCG_SWAP);
> -
> + if (is_root_memcg) {
> + points += global_zone_page_state(NR_KERNEL_STACK_KB) /
> + (PAGE_SIZE / 1024);
> + points += atomic_long_read(&total_sock_pages);
^^^^^^^^^^^^^^^^
BTW, where do we change this counter?
I also doubt that global atomic variable can work here,
we probably need something better scaling.
Thanks!
On Mon, Mar 12, 2018 at 05:57:53PM -0700, David Rientjes wrote:
> The cgroup aware oom killer is needlessly enforced for the entire system
> by a mount option. It's unnecessary to force the system into a single
> oom policy: either cgroup aware, or the traditional process aware.
Can you, please, provide a real-life example, when using per-process
and cgroup-aware OOM killer depending on OOM scope is beneficial?
It might be quite confusing, depending on configuration.
From inside a container you can have different types of OOMs,
depending on parent's cgroup configuration, which is not even
accessible for reading from inside.
Also, it's probably good to have an interface to show which policies
are available.
Thanks!
On Wed, 14 Mar 2018, Roman Gushchin wrote:
> > @@ -2618,92 +2620,65 @@ static long memcg_oom_badness(struct mem_cgroup *memcg,
> > if (nodemask && !node_isset(nid, *nodemask))
> > continue;
> >
> > - points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> > - LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> > -
> > pgdat = NODE_DATA(nid);
> > - points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> > - NR_SLAB_UNRECLAIMABLE);
> > + if (is_root_memcg) {
> > + points += node_page_state(pgdat, NR_ACTIVE_ANON) +
> > + node_page_state(pgdat, NR_INACTIVE_ANON);
> > + points += node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE);
> > + } else {
> > + points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> > + LRU_ALL_ANON);
> > + points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> > + NR_SLAB_UNRECLAIMABLE);
> > + }
> > }
> >
> > - points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> > - (PAGE_SIZE / 1024);
> > - points += memcg_page_state(memcg, MEMCG_SOCK);
> > - points += memcg_page_state(memcg, MEMCG_SWAP);
> > -
> > + if (is_root_memcg) {
> > + points += global_zone_page_state(NR_KERNEL_STACK_KB) /
> > + (PAGE_SIZE / 1024);
> > + points += atomic_long_read(&total_sock_pages);
> ^^^^^^^^^^^^^^^^
> BTW, where do we change this counter?
>
Seems like it was dropped from the patch somehow. It is intended to do
atomic_long_add(nr_pages) in mem_cgroup_charge_skmem() and
atomic_long_add(-nr_pages) mem_cgroup_uncharge_skmem().
> I also doubt that global atomic variable can work here,
> we probably need something better scaling.
>
Why do you think an atomic_long_add() is too expensive when we're already
disabling irqs and dong try_charge()?
On Wed, 14 Mar 2018, Roman Gushchin wrote:
> > The cgroup aware oom killer is needlessly enforced for the entire system
> > by a mount option. It's unnecessary to force the system into a single
> > oom policy: either cgroup aware, or the traditional process aware.
>
> Can you, please, provide a real-life example, when using per-process
> and cgroup-aware OOM killer depending on OOM scope is beneficial?
>
Hi Roman,
This question is only about per-process vs cgroup-aware, not about the
need for individual cgroup vs hierarchical subtree, so I'll only focus on
that.
Three reasons:
- Does not lock the entire system into a single methodology. Users
working in a subtree can default to what they are used to: per-process
oom selection even though their subtree might be targeted by a system
policy level decision at the root. This allow them flexibility to
organize their subtree intuitively for use with other controllers in a
single hierarchy.
The real-world example is a user who currently organizes their subtree
for this purpose and has defined oom_score_adj appropriately and now
regresses if the admin mounts with the needless "groupoom" option.
- Allows changing the oom policy at runtime without remounting the entire
cgroup fs. Depending on how cgroups are going to be used, per-process
vs cgroup-aware may be mandated separately. This is a trait only of
the mem cgroup controller, the root level oom policy is no different
from the subtree and depends directly on how the subtree is organized.
If other controllers are already being used, requiring a remount to
change the system-wide oom policy is an unnecessary burden.
The real-world example is systems software that either supports user
subtrees or strictly subtrees that it maintains itself. While other
controllers are used, the mem cgroup oom policy can be changed at
runtime rather than requiring a remount and reorganizing other
controllers exactly as before.
- Can be extended to cgroup v1 if necessary. There is no need for a
new cgroup v1 mount option and mem cgroup oom selection is not
dependant on any functionality provided by cgroup v2. The policies
introduced here work exactly the same if used with cgroup v1.
The real-world example is a cgroup configuration that hasn't had
the ability to move to cgroup v2 yet and still would like to use
cgroup-aware oom selection with a very trivial change to add the
memory.oom_policy file to the cgroup v1 filesystem.
> It might be quite confusing, depending on configuration.
> From inside a container you can have different types of OOMs,
> depending on parent's cgroup configuration, which is not even
> accessible for reading from inside.
>
Right, and the oom is the result of the parent's limit that is outside the
control of the user. That limit, and now oom policy, is defined by the
user controlling the ancestor of the subtree. The user need not be
concerned that it was singled out for oom kill: that policy decision is
outside hiso or her control. memory.oom_group can certainly be delegated
to the user, but the targeting cannot be changed or evaded.
However, this patchset also provides them with the ability to define their
own oom policy for subcontainers that they create themselves.
> Also, it's probably good to have an interface to show which policies
> are available.
>
This comes back to the user interface question. I'm very happy to address
any way that the interface can be made better, even though I think what is
currently proposed is satisfactory. I think your comment eludes to thp
like enabling where we have "[always] madvise never"? I'm speculating
that you may be happier with memory.oom_policy becoming
"[none] cgroup tree" and extended for additional policies later?
Otherwise the user would need to try the write and test the return value,
which purely depends on whether the policy is available or not. I'm
rather indifferent to either interface, but if you would prefer the
"[none] cgroup tree" appearance, I'll change to that.
On Wed, Mar 14, 2018 at 01:41:03PM -0700, David Rientjes wrote:
> On Wed, 14 Mar 2018, Roman Gushchin wrote:
>
> > > @@ -2618,92 +2620,65 @@ static long memcg_oom_badness(struct mem_cgroup *memcg,
> > > if (nodemask && !node_isset(nid, *nodemask))
> > > continue;
> > >
> > > - points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> > > - LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> > > -
> > > pgdat = NODE_DATA(nid);
> > > - points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> > > - NR_SLAB_UNRECLAIMABLE);
> > > + if (is_root_memcg) {
> > > + points += node_page_state(pgdat, NR_ACTIVE_ANON) +
> > > + node_page_state(pgdat, NR_INACTIVE_ANON);
> > > + points += node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE);
> > > + } else {
> > > + points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> > > + LRU_ALL_ANON);
> > > + points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> > > + NR_SLAB_UNRECLAIMABLE);
> > > + }
> > > }
> > >
> > > - points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> > > - (PAGE_SIZE / 1024);
> > > - points += memcg_page_state(memcg, MEMCG_SOCK);
> > > - points += memcg_page_state(memcg, MEMCG_SWAP);
> > > -
> > > + if (is_root_memcg) {
> > > + points += global_zone_page_state(NR_KERNEL_STACK_KB) /
> > > + (PAGE_SIZE / 1024);
> > > + points += atomic_long_read(&total_sock_pages);
> > ^^^^^^^^^^^^^^^^
> > BTW, where do we change this counter?
> >
>
> Seems like it was dropped from the patch somehow. It is intended to do
> atomic_long_add(nr_pages) in mem_cgroup_charge_skmem() and
> atomic_long_add(-nr_pages) mem_cgroup_uncharge_skmem().
>
> > I also doubt that global atomic variable can work here,
> > we probably need something better scaling.
> >
>
> Why do you think an atomic_long_add() is too expensive when we're already
> disabling irqs and dong try_charge()?
Hard to say without having full code :)
try_charge() is batched, if you'll batch it too, it will probably work.
Hello, David!
On Wed, Mar 14, 2018 at 01:58:59PM -0700, David Rientjes wrote:
> On Wed, 14 Mar 2018, Roman Gushchin wrote:
> - Does not lock the entire system into a single methodology. Users
> working in a subtree can default to what they are used to: per-process
> oom selection even though their subtree might be targeted by a system
> policy level decision at the root. This allow them flexibility to
> organize their subtree intuitively for use with other controllers in a
> single hierarchy.
>
> The real-world example is a user who currently organizes their subtree
> for this purpose and has defined oom_score_adj appropriately and now
> regresses if the admin mounts with the needless "groupoom" option.
I find this extremely confusing.
The problem is that OOM policy defines independently how the OOM
of the corresponding scope is handled, not like how it prefers
to handle OOMs from above.
As I've said, if you're inside a container, you can have OOMs
of different types, depending on settings, which you don't even know about.
Sometimes oom_score_adj works, sometimes not.
Sometimes all processes are killed, sometimes not.
IMO, this adds nothing but mess.
The mount option (which I'm not a big fan of too) was added only
to provide a 100% backward compatibility, what was forced by Michal.
But I doubt that mixing per-process and per-cgroup approach
makes any sense.
>
> - Allows changing the oom policy at runtime without remounting the entire
> cgroup fs. Depending on how cgroups are going to be used, per-process
> vs cgroup-aware may be mandated separately. This is a trait only of
> the mem cgroup controller, the root level oom policy is no different
> from the subtree and depends directly on how the subtree is organized.
> If other controllers are already being used, requiring a remount to
> change the system-wide oom policy is an unnecessary burden.
>
> The real-world example is systems software that either supports user
> subtrees or strictly subtrees that it maintains itself. While other
> controllers are used, the mem cgroup oom policy can be changed at
> runtime rather than requiring a remount and reorganizing other
> controllers exactly as before.
Btw, what the problem with remounting? You don't have to re-create cgroups,
or something like this; the operation is as trivial as adding a flag.
>
> - Can be extended to cgroup v1 if necessary. There is no need for a
> new cgroup v1 mount option and mem cgroup oom selection is not
> dependant on any functionality provided by cgroup v2. The policies
> introduced here work exactly the same if used with cgroup v1.
>
> The real-world example is a cgroup configuration that hasn't had
> the ability to move to cgroup v2 yet and still would like to use
> cgroup-aware oom selection with a very trivial change to add the
> memory.oom_policy file to the cgroup v1 filesystem.
I assume that v1 interface is frozen.
Thanks!
On Thu, 15 Mar 2018, Roman Gushchin wrote:
> > Seems like it was dropped from the patch somehow. It is intended to do
> > atomic_long_add(nr_pages) in mem_cgroup_charge_skmem() and
> > atomic_long_add(-nr_pages) mem_cgroup_uncharge_skmem().
> >
> > > I also doubt that global atomic variable can work here,
> > > we probably need something better scaling.
> > >
> >
> > Why do you think an atomic_long_add() is too expensive when we're already
> > disabling irqs and dong try_charge()?
>
> Hard to say without having full code :)
> try_charge() is batched, if you'll batch it too, it will probably work.
>
The full code is what's specified above: it does the
atomic_long_add(nr_pages) in mem_cgroup_charge_skmem() and
atomic_long_add(-nr_pages) mem_cgroup_uncharge_skmem().
The patch is comparing the root mem cgroup and leaf mem cgroups fairly.
For this, it requires that we have stats that can be directly compared or
at least very close approximations. We don't want to get in a situation
where root and leaf mem cgroups are being compared based on different
stats.
On Thu, 15 Mar 2018, Roman Gushchin wrote:
> > - Does not lock the entire system into a single methodology. Users
> > working in a subtree can default to what they are used to: per-process
> > oom selection even though their subtree might be targeted by a system
> > policy level decision at the root. This allow them flexibility to
> > organize their subtree intuitively for use with other controllers in a
> > single hierarchy.
> >
> > The real-world example is a user who currently organizes their subtree
> > for this purpose and has defined oom_score_adj appropriately and now
> > regresses if the admin mounts with the needless "groupoom" option.
>
> I find this extremely confusing.
>
> The problem is that OOM policy defines independently how the OOM
> of the corresponding scope is handled, not like how it prefers
> to handle OOMs from above.
>
> As I've said, if you're inside a container, you can have OOMs
> of different types, depending on settings, which you don't even know about.
> Sometimes oom_score_adj works, sometimes not.
> Sometimes all processes are killed, sometimes not.
> IMO, this adds nothing but mess.
>
There are many additional problems with the cgroup aware oom killer in
-mm, yes, the fact that memory.oom_group is factored into the selection
logic is another problem. Users who prefer to account their subtree for
comparison (the only way to avoid allowing users to evade the oom killer
completely) should use the memory.oom_policy of "tree" introduced later.
memory.oom_group needs to be completely separated from the policy of
selecting a victim, it shall only be a mechanism that defines if a single
process is oom killed or all processes attached to the victim mem cgroup
as a property of the workload.
> The mount option (which I'm not a big fan of too) was added only
> to provide a 100% backward compatibility, what was forced by Michal.
> But I doubt that mixing per-process and per-cgroup approach
> makes any sense.
>
It makes absolute sense and has real users who can immediately use this if
it's merged. There is nothing wrong with a user preferring to kill the
largest process from their subtree on mem cgroup oom. It's what they've
always experienced, with cgroup v1 and v2. It's the difference between
users in a subtree being able to use /proc/pid/oom_score_adj or not.
Without it, their oom_score_adj values become entirely irrelevant. We
have users who tune their oom_score_adj and are running in a subtree they
control.
If an overcomitted ancestor is oom, which is up to the admin to define in
the organization of the hierarchy and imposing limits, the user does not
control which process or group of processes is oom killed. That's a
decision for the ancestor which controls all descendant cgroups, including
limits and oom policies.
> >
> > - Allows changing the oom policy at runtime without remounting the entire
> > cgroup fs. Depending on how cgroups are going to be used, per-process
> > vs cgroup-aware may be mandated separately. This is a trait only of
> > the mem cgroup controller, the root level oom policy is no different
> > from the subtree and depends directly on how the subtree is organized.
> > If other controllers are already being used, requiring a remount to
> > change the system-wide oom policy is an unnecessary burden.
> >
> > The real-world example is systems software that either supports user
> > subtrees or strictly subtrees that it maintains itself. While other
> > controllers are used, the mem cgroup oom policy can be changed at
> > runtime rather than requiring a remount and reorganizing other
> > controllers exactly as before.
>
> Btw, what the problem with remounting? You don't have to re-create cgroups,
> or something like this; the operation is as trivial as adding a flag.
>
Remounting is for the entire mem cgroup hierarchy. The point of this
entire patchset is that different subtrees will have different policies,
it cannot be locked into a single selection logic.
This completely avoids users being able to evade the cgroup-aware oom
killer by creating subcontainers.
Obviously I've been focused on users controlling subtrees in a lot of my
examples. Those users may already prefer oom killing the largest process
on the system (or their subtree). They can still do that with this patch
and opt out of cgroup awareness for their subtree.
It also provides all the functionality that the current implementation in
-mm provides.
> >
> > - Can be extended to cgroup v1 if necessary. There is no need for a
> > new cgroup v1 mount option and mem cgroup oom selection is not
> > dependant on any functionality provided by cgroup v2. The policies
> > introduced here work exactly the same if used with cgroup v1.
> >
> > The real-world example is a cgroup configuration that hasn't had
> > the ability to move to cgroup v2 yet and still would like to use
> > cgroup-aware oom selection with a very trivial change to add the
> > memory.oom_policy file to the cgroup v1 filesystem.
>
> I assume that v1 interface is frozen.
>
It requires adding the memory.oom_policy file to the cgroup v1 fs, that's
it. No other support is needed. If that's allowed upstream, great; if
not, it's a very simple patch to carry for a distribution.
With the current implementation of the cgroup-aware oom killer,
memory.oom_group defines two behaviors:
- consider the footprint of the "group" consisting of the mem cgroup
itself and all descendants for comparison with other cgroups, and
- when selected as the victim mem cgroup, kill all processes attached to
it and its descendants that are eligible to be killed.
Now that the memory.oom_policy of "tree" considers the memory footprint of
the mem cgroup and all its descendants, separate the memory.oom_group
setting from the selection criteria.
Now, memory.oom_group only controls whether all processes attached to the
victim mem cgroup and its descendants are oom killed (when set to "1") or
the single largest memory consuming process attached to the victim mem
cgroup and its descendants is killed.
This is generally regarded as a property of the workload attached to the
subtree: it depends on whether the workload can continue running and be
useful if a single process is oom killed or whether it's better to kill
all attached processes.
Signed-off-by: David Rientjes <[email protected]>
---
Based on top of oom policy patch series at
https://marc.info/?t=152090280800001 and follow-up patch at
https://marc.info/?l=linux-kernel&m=152098687824112
Documentation/cgroup-v2.txt | 21 ++++-----------------
mm/memcontrol.c | 8 ++++----
2 files changed, 8 insertions(+), 21 deletions(-)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1045,25 +1045,12 @@ PAGE_SIZE multiple when read back.
A read-write single value file which exists on non-root
cgroups. The default is "0".
- If set, OOM killer will consider the memory cgroup as an
- indivisible memory consumers and compare it with other memory
- consumers by it's memory footprint.
- If such memory cgroup is selected as an OOM victim, all
- processes belonging to it or it's descendants will be killed.
+ If such memory cgroup is selected as an OOM victim, all processes
+ attached to it and its descendants that are eligible for oom kill
+ (their /proc/pid/oom_score_adj is not oom disabled) will be killed.
This applies to system-wide OOM conditions and reaching
the hard memory limit of the cgroup and their ancestor.
- If OOM condition happens in a descendant cgroup with it's own
- memory limit, the memory cgroup can't be considered
- as an OOM victim, and OOM killer will not kill all belonging
- tasks.
-
- Also, OOM killer respects the /proc/pid/oom_score_adj value -1000,
- and will never kill the unkillable task, even if memory.oom_group
- is set.
-
- If cgroup-aware OOM killer is not enabled, ENOTSUPP error
- is returned on attempt to access the file.
memory.oom_policy
@@ -1325,7 +1312,7 @@ When selecting a cgroup as a victim, the OOM killer will kill the process
with the largest memory footprint. A user can control this behavior by
enabling the per-cgroup memory.oom_group option. If set, it causes the
OOM killer to kill all processes attached to the cgroup, except processes
-with /proc/pid/oom_score_adj set to -1000 (oom disabled).
+with /proc/pid/oom_score_adj set to OOM_SCORE_ADJ_MIN.
The root cgroup is treated as a leaf memory cgroup as well, so it is
compared with other leaf memory cgroups.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2732,11 +2732,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
continue;
/*
- * We don't consider non-leaf non-oom_group memory cgroups
- * without the oom policy of "tree" as OOM victims.
+ * We don't consider non-leaf memory cgroups without the oom
+ * policy of "tree" as OOM victims.
*/
- if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter) &&
- iter->oom_policy != MEMCG_OOM_POLICY_TREE)
+ if (iter->oom_policy != MEMCG_OOM_POLICY_TREE &&
+ memcg_has_children(iter))
continue;
/*
The cgroup-aware oom killer currently considers the set of allowed nodes
for the allocation that triggers the oom killer and discounts usage from
disallowed nodes when comparing cgroups.
If a cgroup has both the cpuset and memory controllers enabled, it may be
possible to restrict allocations to a subset of nodes, for example. Some
latency sensitive users use cpusets to allocate only local memory, almost
to the point of oom even though there is an abundance of available free
memory on other nodes.
The same is true for processes that mbind(2) their memory to a set of
allowed nodes.
This yields very inconsistent results by considering usage from each mem
cgroup (and perhaps its subtree) for the allocation's set of allowed nodes
for its mempolicy. Allocating a single page for a vma that is mbind to a
now-oom node can cause a cgroup that is restricted to that node by its
cpuset controller to be oom killed when other cgroups may have much higher
overall usage.
The cgroup-aware oom killer is described as killing the largest memory
consuming cgroup (or subtree) without mentioning the mempolicy of the
allocation. For now, discount it. It would be possible to add an
additional oom policy for NUMA awareness if it would be generally useful
later with the extensible interface.
Signed-off-by: David Rientjes <[email protected]>
---
Based on top of oom policy patch series at
https://marc.info/?t=152090280800001 and follow-up patch at
https://marc.info/?l=linux-kernel&m=152098687824112
mm/memcontrol.c | 18 ++++++------------
1 file changed, 6 insertions(+), 12 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2608,19 +2608,15 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
return ret;
}
-static long memcg_oom_badness(struct mem_cgroup *memcg,
- const nodemask_t *nodemask)
+static long memcg_oom_badness(struct mem_cgroup *memcg)
{
const bool is_root_memcg = memcg == root_mem_cgroup;
long points = 0;
int nid;
- pg_data_t *pgdat;
for_each_node_state(nid, N_MEMORY) {
- if (nodemask && !node_isset(nid, *nodemask))
- continue;
+ pg_data_t *pgdat = NODE_DATA(nid);
- pgdat = NODE_DATA(nid);
if (is_root_memcg) {
points += node_page_state(pgdat, NR_ACTIVE_ANON) +
node_page_state(pgdat, NR_INACTIVE_ANON);
@@ -2656,8 +2652,7 @@ static long memcg_oom_badness(struct mem_cgroup *memcg,
* >0: memcg is eligible, and the returned value is an estimation
* of the memory footprint
*/
-static long oom_evaluate_memcg(struct mem_cgroup *memcg,
- const nodemask_t *nodemask)
+static long oom_evaluate_memcg(struct mem_cgroup *memcg)
{
struct css_task_iter it;
struct task_struct *task;
@@ -2691,7 +2686,7 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
if (eligible <= 0)
return eligible;
- return memcg_oom_badness(memcg, nodemask);
+ return memcg_oom_badness(memcg);
}
static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
@@ -2751,7 +2746,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
if (memcg_has_children(iter))
continue;
- score = oom_evaluate_memcg(iter, oc->nodemask);
+ score = oom_evaluate_memcg(iter);
/*
* Ignore empty and non-eligible memory cgroups.
@@ -2780,8 +2775,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
if (oc->chosen_memcg != INFLIGHT_VICTIM) {
if (root == root_mem_cgroup) {
- group_score = oom_evaluate_memcg(root_mem_cgroup,
- oc->nodemask);
+ group_score = oom_evaluate_memcg(root_mem_cgroup);
if (group_score > leaf_score) {
/*
* Discount the sum of all leaf scores to find
On Mon, 12 Mar 2018, David Rientjes wrote:
> There are three significant concerns about the cgroup aware oom killer as
> it is implemented in -mm:
>
> (1) allows users to evade the oom killer by creating subcontainers or
> using other controllers since scoring is done per cgroup and not
> hierarchically,
>
> (2) unfairly compares the root mem cgroup using completely different
> criteria than leaf mem cgroups and allows wildly inaccurate results
> if oom_score_adj is used, and
>
> (3) does not allow the user to influence the decisionmaking, such that
> important subtrees cannot be preferred or biased.
>
> This patchset aims to fix (1) completely and, by doing so, introduces a
> completely extensible user interface that can be expanded in the future.
>
> It preserves all functionality that currently exists in -mm and extends
> it to be generally useful outside of very specialized usecases.
>
> It eliminates the mount option for the cgroup aware oom killer entirely
> since it is now enabled through the root mem cgroup's oom policy.
There are currently six patches in this series since additional patches on
top of it have been proposed to fix the several issues with the current
implementation in -mm. The six patches address (1) and (2) above, as well
as a couple other minor tweaks. I believe (3) can be subsequently
addressed after the feature has been merged since it builds upon what is
already here and shouldn't hold it back from being merged itself.
I plan on sending out the entire series once feedback is received for the
patches already sent.
Thanks.
The cgroup aware oom killer is needlessly enforced for the entire system
by a mount option. It's unnecessary to force the system into a single
oom policy: either cgroup aware, or the traditional process aware.
This patch introduces a memory.oom_policy tunable for all mem cgroups.
It is currently a no-op: it can only be set to "none", which is its
default policy. It will be expanded in the next patch to define cgroup
aware oom killer behavior for its subtree.
This is an extensible interface that can be used to define cgroup aware
assessment of mem cgroup subtrees or the traditional process aware
assessment.
Reading memory.oom_policy will specify the list of available policies.
Another benefit of such an approach is that an admin can lock in a
certain policy for the system or for a mem cgroup subtree and can
delegate the policy decision to the user to determine if the kill should
originate from a subcontainer, as indivisible memory consumers
themselves, or selection should be done per process.
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/cgroup-v2.txt | 11 +++++++++++
include/linux/memcontrol.h | 11 +++++++++++
mm/memcontrol.c | 35 +++++++++++++++++++++++++++++++++++
3 files changed, 57 insertions(+)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1065,6 +1065,17 @@ PAGE_SIZE multiple when read back.
If cgroup-aware OOM killer is not enabled, ENOTSUPP error
is returned on attempt to access the file.
+ memory.oom_policy
+
+ A read-write single string file which exists on all cgroups. The
+ default value is "none".
+
+ If "none", the OOM killer will use the default policy to choose a
+ victim; that is, it will choose the single process with the largest
+ memory footprint adjusted by /proc/pid/oom_score_adj (see
+ Documentation/filesystems/proc.txt). This is the same policy as if
+ memory cgroups were not even mounted.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -58,6 +58,14 @@ enum memcg_event_item {
MEMCG_NR_EVENTS,
};
+enum memcg_oom_policy {
+ /*
+ * No special oom policy, process selection is determined by
+ * oom_badness()
+ */
+ MEMCG_OOM_POLICY_NONE,
+};
+
struct mem_cgroup_reclaim_cookie {
pg_data_t *pgdat;
int priority;
@@ -203,6 +211,9 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
+ /* OOM policy for this subtree */
+ enum memcg_oom_policy oom_policy;
+
/*
* Treat the sub-tree as an indivisible memory consumer,
* kill all belonging tasks if the memory cgroup selected
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4430,6 +4430,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
if (parent) {
memcg->swappiness = mem_cgroup_swappiness(parent);
memcg->oom_kill_disable = parent->oom_kill_disable;
+ memcg->oom_policy = parent->oom_policy;
}
if (parent && parent->use_hierarchy) {
memcg->use_hierarchy = true;
@@ -5547,6 +5548,34 @@ static int memory_stat_show(struct seq_file *m, void *v)
return 0;
}
+static int memory_oom_policy_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+ enum memcg_oom_policy policy = READ_ONCE(memcg->oom_policy);
+
+ switch (policy) {
+ case MEMCG_OOM_POLICY_NONE:
+ default:
+ seq_puts(m, "[none]\n");
+ };
+ return 0;
+}
+
+static ssize_t memory_oom_policy_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ ssize_t ret = nbytes;
+
+ buf = strstrip(buf);
+ if (!memcmp("none", buf, min(sizeof("none")-1, nbytes)))
+ memcg->oom_policy = MEMCG_OOM_POLICY_NONE;
+ else
+ ret = -EINVAL;
+
+ return ret;
+}
+
static struct cftype memory_files[] = {
{
.name = "current",
@@ -5588,6 +5617,12 @@ static struct cftype memory_files[] = {
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = memory_stat_show,
},
+ {
+ .name = "oom_policy",
+ .flags = CFTYPE_NS_DELEGATABLE,
+ .seq_show = memory_oom_policy_show,
+ .write = memory_oom_policy_write,
+ },
{ } /* terminate */
};
One of the three significant concerns brought up about the cgroup aware
oom killer is that its decisionmaking is completely evaded by creating
subcontainers and attaching processes such that the ancestor's usage does
not exceed another cgroup on the system.
Consider the example from the previous patch where "memory" is set in
each mem cgroup's cgroup.controllers:
mem cgroup cgroup.procs
========== ============
/cg1 1 process consuming 250MB
/cg2 3 processes consuming 100MB each
/cg3/cg31 2 processes consuming 100MB each
/cg3/cg32 2 processes consuming 100MB each
If memory.oom_policy is "cgroup", a process from /cg2 is chosen because it
is in the single indivisible memory consumer with the greatest usage.
The true usage of /cg3 is actually 400MB, but a process from /cg2 is
chosen because cgroups are compared individually rather than
hierarchically.
If a system is divided into two users, for example:
mem cgroup memory.max
========== ==========
/userA 250MB
/userB 250MB
If /userA runs all processes attached to the local mem cgroup, whereas
/userB distributes their processes over a set of subcontainers under
/userB, /userA will be unfairly penalized.
There is incentive with cgroup v2 to distribute processes over a set of
subcontainers if those processes shall be constrained by other cgroup
controllers; this is a direct result of mandating a single, unified
hierarchy for cgroups. A user may also reasonably do this for mem cgroup
control or statistics. And, a user may do this to evade the cgroup-aware
oom killer selection logic.
This patch adds an oom policy, "tree", that accounts for hierarchical
usage when comparing cgroups and the cgroup aware oom killer is enabled by
an ancestor. This allows administrators, for example, to require users in
their own top-level mem cgroup subtree to be accounted for with
hierarchical usage. In other words, they can longer evade the oom killer
by using other controllers or subcontainers.
If an oom policy of "tree" is in place for a subtree, such as /cg3 above,
the hierarchical usage is used for comparisons with other cgroups if
either "cgroup" or "tree" is the oom policy of the oom mem cgroup. Thus,
if /cg3/memory.oom_policy is "tree", one of the processes from /cg3's
subcontainers is chosen for oom kill.
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/cgroup-v2.txt | 17 ++++++++++++++---
include/linux/memcontrol.h | 5 +++++
mm/memcontrol.c | 18 ++++++++++++------
3 files changed, 31 insertions(+), 9 deletions(-)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1080,6 +1080,10 @@ PAGE_SIZE multiple when read back.
memory consumers; that is, they will compare mem cgroup usage rather
than process memory footprint. See the "OOM Killer" section below.
+ If "tree", the OOM killer will compare mem cgroups and its subtree
+ as a single indivisible memory consumer. This policy cannot be set
+ on the root mem cgroup. See the "OOM Killer" section below.
+
When an OOM condition occurs, the policy is dictated by the mem
cgroup that is OOM (the root mem cgroup for a system-wide OOM
condition). If a descendant mem cgroup has a policy of "none", for
@@ -1087,6 +1091,10 @@ PAGE_SIZE multiple when read back.
the heuristic will still compare mem cgroups as indivisible memory
consumers.
+ When an OOM condition occurs in a mem cgroup with an OOM policy of
+ "cgroup" or "tree", the OOM killer will compare mem cgroups with
+ "cgroup" policy individually with "tree" policy subtrees.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
@@ -1301,7 +1309,7 @@ out of memory, its memory.oom_policy will dictate how the OOM killer will
select a process, or cgroup, to kill. Likewise, when the system is OOM,
the policy is dictated by the root mem cgroup.
-There are currently two available oom policies:
+There are currently three available oom policies:
- "none": default, choose the largest single memory hogging process to
oom kill, as traditionally the OOM killer has always done.
@@ -1310,6 +1318,9 @@ There are currently two available oom policies:
subtree as an OOM victim and kill at least one process, depending on
memory.oom_group, from it.
+ - "tree": choose the cgroup with the largest memory footprint considering
+ itself and its subtree and kill at least one process.
+
When selecting a cgroup as a victim, the OOM killer will kill the process
with the largest memory footprint. A user can control this behavior by
enabling the per-cgroup memory.oom_group option. If set, it causes the
@@ -1328,8 +1339,8 @@ Please, note that memory charges are not migrating if tasks
are moved between different memory cgroups. Moving tasks with
significant memory footprint may affect OOM victim selection logic.
If it's a case, please, consider creating a common ancestor for
-the source and destination memory cgroups and enabling oom_group
-on ancestor layer.
+the source and destination memory cgroups and setting a policy of "tree"
+and enabling oom_group on an ancestor layer.
IO
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -69,6 +69,11 @@ enum memcg_oom_policy {
* mem cgroup as an indivisible consumer
*/
MEMCG_OOM_POLICY_CGROUP,
+ /*
+ * Tree cgroup usage for all descendant memcg groups, treating each mem
+ * cgroup and its subtree as an indivisible consumer
+ */
+ MEMCG_OOM_POLICY_TREE,
};
struct mem_cgroup_reclaim_cookie {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2741,7 +2741,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
/*
* The oom_score is calculated for leaf memory cgroups (including
* the root memcg).
- * Non-leaf oom_group cgroups accumulating score of descendant
+ * Cgroups with oom policy of "tree" accumulate the score of descendant
* leaf memory cgroups.
*/
rcu_read_lock();
@@ -2750,10 +2750,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
/*
* We don't consider non-leaf non-oom_group memory cgroups
- * as OOM victims.
+ * without the oom policy of "tree" as OOM victims.
*/
if (memcg_has_children(iter) && iter != root_mem_cgroup &&
- !mem_cgroup_oom_group(iter))
+ !mem_cgroup_oom_group(iter) &&
+ iter->oom_policy != MEMCG_OOM_POLICY_TREE)
continue;
/*
@@ -2816,7 +2817,7 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
else
root = root_mem_cgroup;
- if (root->oom_policy != MEMCG_OOM_POLICY_CGROUP)
+ if (root->oom_policy == MEMCG_OOM_POLICY_NONE)
return false;
select_victim_memcg(root, oc);
@@ -5549,11 +5550,14 @@ static int memory_oom_policy_show(struct seq_file *m, void *v)
switch (policy) {
case MEMCG_OOM_POLICY_CGROUP:
- seq_puts(m, "none [cgroup]\n");
+ seq_puts(m, "none [cgroup] tree\n");
+ break;
+ case MEMCG_OOM_POLICY_TREE:
+ seq_puts(m, "none cgroup [tree]\n");
break;
case MEMCG_OOM_POLICY_NONE:
default:
- seq_puts(m, "[none] cgroup\n");
+ seq_puts(m, "[none] cgroup tree\n");
};
return 0;
}
@@ -5569,6 +5573,8 @@ static ssize_t memory_oom_policy_write(struct kernfs_open_file *of,
memcg->oom_policy = MEMCG_OOM_POLICY_NONE;
else if (!memcmp("cgroup", buf, min(sizeof("cgroup")-1, nbytes)))
memcg->oom_policy = MEMCG_OOM_POLICY_CGROUP;
+ else if (!memcmp("tree", buf, min(sizeof("tree")-1, nbytes)))
+ memcg->oom_policy = MEMCG_OOM_POLICY_TREE;
else
ret = -EINVAL;
The cgroup-aware oom killer currently considers the set of allowed nodes
for the allocation that triggers the oom killer and discounts usage from
disallowed nodes when comparing cgroups.
If a cgroup has both the cpuset and memory controllers enabled, it may be
possible to restrict allocations to a subset of nodes, for example. Some
latency sensitive users use cpusets to allocate only local memory, almost
to the point of oom even though there is an abundance of available free
memory on other nodes.
The same is true for processes that mbind(2) their memory to a set of
allowed nodes.
This yields very inconsistent results by considering usage from each mem
cgroup (and perhaps its subtree) for the allocation's set of allowed nodes
for its mempolicy. Allocating a single page for a vma that is mbind to a
now-oom node can cause a cgroup that is restricted to that node by its
cpuset controller to be oom killed when other cgroups may have much higher
overall usage.
The cgroup-aware oom killer is described as killing the largest memory
consuming cgroup (or subtree) without mentioning the mempolicy of the
allocation. For now, discount it. It would be possible to add an
additional oom policy for NUMA awareness if it would be generally useful
later with the extensible interface.
Signed-off-by: David Rientjes <[email protected]>
---
mm/memcontrol.c | 18 ++++++------------
1 file changed, 6 insertions(+), 12 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2608,19 +2608,15 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
return ret;
}
-static long memcg_oom_badness(struct mem_cgroup *memcg,
- const nodemask_t *nodemask)
+static long memcg_oom_badness(struct mem_cgroup *memcg)
{
const bool is_root_memcg = memcg == root_mem_cgroup;
long points = 0;
int nid;
- pg_data_t *pgdat;
for_each_node_state(nid, N_MEMORY) {
- if (nodemask && !node_isset(nid, *nodemask))
- continue;
+ pg_data_t *pgdat = NODE_DATA(nid);
- pgdat = NODE_DATA(nid);
if (is_root_memcg) {
points += node_page_state(pgdat, NR_ACTIVE_ANON) +
node_page_state(pgdat, NR_INACTIVE_ANON);
@@ -2656,8 +2652,7 @@ static long memcg_oom_badness(struct mem_cgroup *memcg,
* >0: memcg is eligible, and the returned value is an estimation
* of the memory footprint
*/
-static long oom_evaluate_memcg(struct mem_cgroup *memcg,
- const nodemask_t *nodemask)
+static long oom_evaluate_memcg(struct mem_cgroup *memcg)
{
struct css_task_iter it;
struct task_struct *task;
@@ -2691,7 +2686,7 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
if (eligible <= 0)
return eligible;
- return memcg_oom_badness(memcg, nodemask);
+ return memcg_oom_badness(memcg);
}
static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
@@ -2751,7 +2746,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
if (memcg_has_children(iter))
continue;
- score = oom_evaluate_memcg(iter, oc->nodemask);
+ score = oom_evaluate_memcg(iter);
/*
* Ignore empty and non-eligible memory cgroups.
@@ -2780,8 +2775,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
if (oc->chosen_memcg != INFLIGHT_VICTIM) {
if (root == root_mem_cgroup) {
- group_score = oom_evaluate_memcg(root_mem_cgroup,
- oc->nodemask);
+ group_score = oom_evaluate_memcg(root_mem_cgroup);
if (group_score > leaf_score) {
/*
* Discount the sum of all leaf scores to find
There are several downsides to the current implementation that compares
the root mem cgroup with leaf mem cgroups for the cgroup-aware oom killer.
For example, /proc/pid/oom_score_adj is accounted for processes attached
to the root mem cgroup but not leaves. This leads to wild inconsistencies
that unfairly bias for or against the root mem cgroup.
Assume a 728KB bash shell is attached to the root mem cgroup without any
other processes having a non-default /proc/pid/oom_score_adj. At the time
of system oom, the root mem cgroup evaluated to 43,474 pages after boot.
If the bash shell adjusts its /proc/self/oom_score_adj to 1000, however,
the root mem cgroup evaluates to 24,765,482 pages lol. It would take a
cgroup 95GB of memory to outweigh the root mem cgroup's evaluation.
The reverse is even more confusing: if the bash shell adjusts its
/proc/self/oom_score_adj to -999, the root mem cgroup evaluates to 42,268
pages, a basically meaningless transformation.
/proc/pid/oom_score_adj is discounted, however, for processes attached to
leaf mem cgroups. If a sole process using 250MB of memory is attached to
a mem cgroup, it evaluates to 250MB >> PAGE_SHIFT. If its
/proc/pid/oom_score_adj is changed to -999, or even 1000, the evaluation
remains the same for the mem cgroup.
The heuristic that is used for the root mem cgroup also differs from leaf
mem cgroups.
For the root mem cgroup, the evaluation is the sum of all process's
/proc/pid/oom_score. Besides factoring in oom_score_adj, it is based on
the sum of rss + swap + page tables for all processes attached to it.
For leaf mem cgroups, it is based on the amount of anonymous or
unevictable memory + unreclaimable slab + kernel stack + sock + swap.
There's also an exemption for root mem cgroup processes that do not
intersect the allocating process's mems_allowed. Because the current
heuristic is based on oom_badness(), the evaluation of the root mem
cgroup disregards all processes attached to it that have disjoint
mems_allowed making oom selection specifically dependant on the
allocating process for system oom conditions!
This patch introduces completely fair comparison between the root mem
cgroup and leaf mem cgroups. It compares them with the same heuristic
and does not prefer one over the other. It disregards oom_score_adj
as the cgroup-aware oom killer should, if enabled by memory.oom_policy.
The goal is to target the most memory consuming cgroup on the system,
not consider per-process adjustment.
The fact that the evaluation of all mem cgroups depends on the mempolicy
of the allocating process, which is completely undocumented for the
cgroup-aware oom killer, will be addressed in a subsequent patch.
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/cgroup-v2.txt | 7 +-
mm/memcontrol.c | 149 ++++++++++++++++++------------------
2 files changed, 76 insertions(+), 80 deletions(-)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1328,12 +1328,7 @@ OOM killer to kill all processes attached to the cgroup, except processes
with /proc/pid/oom_score_adj set to -1000 (oom disabled).
The root cgroup is treated as a leaf memory cgroup as well, so it is
-compared with other leaf memory cgroups. Due to internal implementation
-restrictions the size of the root cgroup is the cumulative sum of
-oom_badness of all its tasks (in other words oom_score_adj of each task
-is obeyed). Relying on oom_score_adj (apart from OOM_SCORE_ADJ_MIN) can
-lead to over- or underestimation of the root cgroup consumption and it is
-therefore discouraged. This might change in the future, however.
+compared with other leaf memory cgroups.
Please, note that memory charges are not migrating if tasks
are moved between different memory cgroups. Moving tasks with
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -94,6 +94,8 @@ int do_swap_account __read_mostly;
#define do_swap_account 0
#endif
+static atomic_long_t total_sock_pages;
+
/* Whether legacy memory+swap accounting is active */
static bool do_memsw_account(void)
{
@@ -2607,9 +2609,9 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
}
static long memcg_oom_badness(struct mem_cgroup *memcg,
- const nodemask_t *nodemask,
- unsigned long totalpages)
+ const nodemask_t *nodemask)
{
+ const bool is_root_memcg = memcg == root_mem_cgroup;
long points = 0;
int nid;
pg_data_t *pgdat;
@@ -2618,92 +2620,65 @@ static long memcg_oom_badness(struct mem_cgroup *memcg,
if (nodemask && !node_isset(nid, *nodemask))
continue;
- points += mem_cgroup_node_nr_lru_pages(memcg, nid,
- LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
-
pgdat = NODE_DATA(nid);
- points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
- NR_SLAB_UNRECLAIMABLE);
+ if (is_root_memcg) {
+ points += node_page_state(pgdat, NR_ACTIVE_ANON) +
+ node_page_state(pgdat, NR_INACTIVE_ANON);
+ points += node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE);
+ } else {
+ points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+ LRU_ALL_ANON);
+ points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
+ NR_SLAB_UNRECLAIMABLE);
+ }
}
- points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
- (PAGE_SIZE / 1024);
- points += memcg_page_state(memcg, MEMCG_SOCK);
- points += memcg_page_state(memcg, MEMCG_SWAP);
-
+ if (is_root_memcg) {
+ points += global_zone_page_state(NR_KERNEL_STACK_KB) /
+ (PAGE_SIZE / 1024);
+ points += atomic_long_read(&total_sock_pages);
+ points += total_swap_pages - atomic_long_read(&nr_swap_pages);
+ } else {
+ points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+ (PAGE_SIZE / 1024);
+ points += memcg_page_state(memcg, MEMCG_SOCK);
+ points += memcg_page_state(memcg, MEMCG_SWAP);
+ }
return points;
}
/*
- * Checks if the given memcg is a valid OOM victim and returns a number,
- * which means the folowing:
- * -1: there are inflight OOM victim tasks, belonging to the memcg
- * 0: memcg is not eligible, e.g. all belonging tasks are protected
- * by oom_score_adj set to OOM_SCORE_ADJ_MIN
+ * Checks if the given non-root memcg has a valid OOM victim and returns a
+ * number, which means the following:
+ * -1: there is an inflight OOM victim process attached to the memcg
+ * 0: memcg is not eligible because all tasks attached are unkillable
+ * (kthreads or oom_score_adj set to OOM_SCORE_ADJ_MIN)
* >0: memcg is eligible, and the returned value is an estimation
* of the memory footprint
*/
static long oom_evaluate_memcg(struct mem_cgroup *memcg,
- const nodemask_t *nodemask,
- unsigned long totalpages)
+ const nodemask_t *nodemask)
{
struct css_task_iter it;
struct task_struct *task;
int eligible = 0;
/*
- * Root memory cgroup is a special case:
- * we don't have necessary stats to evaluate it exactly as
- * leaf memory cgroups, so we approximate it's oom_score
- * by summing oom_score of all belonging tasks, which are
- * owners of their mm structs.
- *
- * If there are inflight OOM victim tasks inside
- * the root memcg, we return -1.
- */
- if (memcg == root_mem_cgroup) {
- struct css_task_iter it;
- struct task_struct *task;
- long score = 0;
-
- css_task_iter_start(&memcg->css, 0, &it);
- while ((task = css_task_iter_next(&it))) {
- if (tsk_is_oom_victim(task) &&
- !test_bit(MMF_OOM_SKIP,
- &task->signal->oom_mm->flags)) {
- score = -1;
- break;
- }
-
- task_lock(task);
- if (!task->mm || task->mm->owner != task) {
- task_unlock(task);
- continue;
- }
- task_unlock(task);
-
- score += oom_badness(task, memcg, nodemask,
- totalpages);
- }
- css_task_iter_end(&it);
-
- return score;
- }
-
- /*
- * Memcg is OOM eligible if there are OOM killable tasks inside.
- *
- * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
- * as unkillable.
- *
- * If there are inflight OOM victim tasks inside the memcg,
- * we return -1.
+ * Memcg is eligible for oom kill if at least one process is eligible
+ * to be killed. Processes with oom_score_adj of OOM_SCORE_ADJ_MIN
+ * are unkillable.
*/
css_task_iter_start(&memcg->css, 0, &it);
while ((task = css_task_iter_next(&it))) {
+ task_lock(task);
+ if (!task->mm || task != task->mm->owner) {
+ task_unlock(task);
+ continue;
+ }
if (!eligible &&
task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
eligible = 1;
+ task_unlock(task);
if (tsk_is_oom_victim(task) &&
!test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
@@ -2716,13 +2691,14 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
if (eligible <= 0)
return eligible;
- return memcg_oom_badness(memcg, nodemask, totalpages);
+ return memcg_oom_badness(memcg, nodemask);
}
static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
{
struct mem_cgroup *iter, *group = NULL;
long group_score = 0;
+ long leaf_score = 0;
oc->chosen_memcg = NULL;
oc->chosen_points = 0;
@@ -2748,12 +2724,18 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
for_each_mem_cgroup_tree(iter, root) {
long score;
+ /*
+ * Root memory cgroup will be considered after iteration,
+ * if eligible.
+ */
+ if (iter == root_mem_cgroup)
+ continue;
+
/*
* We don't consider non-leaf non-oom_group memory cgroups
* without the oom policy of "tree" as OOM victims.
*/
- if (memcg_has_children(iter) && iter != root_mem_cgroup &&
- !mem_cgroup_oom_group(iter) &&
+ if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter) &&
iter->oom_policy != MEMCG_OOM_POLICY_TREE)
continue;
@@ -2761,16 +2743,15 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
* If group is not set or we've ran out of the group's sub-tree,
* we should set group and reset group_score.
*/
- if (!group || group == root_mem_cgroup ||
- !mem_cgroup_is_descendant(iter, group)) {
+ if (!group || !mem_cgroup_is_descendant(iter, group)) {
group = iter;
group_score = 0;
}
- if (memcg_has_children(iter) && iter != root_mem_cgroup)
+ if (memcg_has_children(iter))
continue;
- score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
+ score = oom_evaluate_memcg(iter, oc->nodemask);
/*
* Ignore empty and non-eligible memory cgroups.
@@ -2789,6 +2770,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
}
group_score += score;
+ leaf_score += score;
if (group_score > oc->chosen_points) {
oc->chosen_points = group_score;
@@ -2796,8 +2778,25 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
}
}
- if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
- css_get(&oc->chosen_memcg->css);
+ if (oc->chosen_memcg != INFLIGHT_VICTIM) {
+ if (root == root_mem_cgroup) {
+ group_score = oom_evaluate_memcg(root_mem_cgroup,
+ oc->nodemask);
+ if (group_score > leaf_score) {
+ /*
+ * Discount the sum of all leaf scores to find
+ * root score.
+ */
+ group_score -= leaf_score;
+ if (group_score > oc->chosen_points) {
+ oc->chosen_points = group_score;
+ oc->chosen_memcg = root_mem_cgroup;
+ }
+ }
+ }
+ if (oc->chosen_memcg)
+ css_get(&oc->chosen_memcg->css);
+ }
rcu_read_unlock();
}
@@ -6119,6 +6118,7 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
gfp_mask = GFP_NOWAIT;
mod_memcg_state(memcg, MEMCG_SOCK, nr_pages);
+ atomic_long_add(nr_pages, &total_sock_pages);
if (try_charge(memcg, gfp_mask, nr_pages) == 0)
return true;
@@ -6140,6 +6140,7 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
}
mod_memcg_state(memcg, MEMCG_SOCK, -nr_pages);
+ atomic_long_add(-nr_pages, &total_sock_pages);
refill_stock(memcg, nr_pages);
}
With the current implementation of the cgroup-aware oom killer,
memory.oom_group defines two behaviors:
- consider the footprint of the "group" consisting of the mem cgroup
itself and all descendants for comparison with other cgroups, and
- when selected as the victim mem cgroup, kill all processes attached to
it and its descendants that are eligible to be killed.
Now that the memory.oom_policy of "tree" considers the memory footprint of
the mem cgroup and all its descendants, separate the memory.oom_group
setting from the selection criteria.
Now, memory.oom_group only controls whether all processes attached to the
victim mem cgroup and its descendants are oom killed (when set to "1") or
the single largest memory consuming process attached to the victim mem
cgroup and its descendants is killed.
This is generally regarded as a property of the workload attached to the
subtree: it depends on whether the workload can continue running and be
useful if a single process is oom killed or whether it's better to kill
all attached processes.
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/cgroup-v2.txt | 21 ++++-----------------
mm/memcontrol.c | 8 ++++----
2 files changed, 8 insertions(+), 21 deletions(-)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1045,25 +1045,12 @@ PAGE_SIZE multiple when read back.
A read-write single value file which exists on non-root
cgroups. The default is "0".
- If set, OOM killer will consider the memory cgroup as an
- indivisible memory consumers and compare it with other memory
- consumers by it's memory footprint.
- If such memory cgroup is selected as an OOM victim, all
- processes belonging to it or it's descendants will be killed.
+ If such memory cgroup is selected as an OOM victim, all processes
+ attached to it and its descendants that are eligible for oom kill
+ (their /proc/pid/oom_score_adj is not oom disabled) will be killed.
This applies to system-wide OOM conditions and reaching
the hard memory limit of the cgroup and their ancestor.
- If OOM condition happens in a descendant cgroup with it's own
- memory limit, the memory cgroup can't be considered
- as an OOM victim, and OOM killer will not kill all belonging
- tasks.
-
- Also, OOM killer respects the /proc/pid/oom_score_adj value -1000,
- and will never kill the unkillable task, even if memory.oom_group
- is set.
-
- If cgroup-aware OOM killer is not enabled, ENOTSUPP error
- is returned on attempt to access the file.
memory.oom_policy
@@ -1325,7 +1312,7 @@ When selecting a cgroup as a victim, the OOM killer will kill the process
with the largest memory footprint. A user can control this behavior by
enabling the per-cgroup memory.oom_group option. If set, it causes the
OOM killer to kill all processes attached to the cgroup, except processes
-with /proc/pid/oom_score_adj set to -1000 (oom disabled).
+with /proc/pid/oom_score_adj set to OOM_SCORE_ADJ_MIN.
The root cgroup is treated as a leaf memory cgroup as well, so it is
compared with other leaf memory cgroups.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2732,11 +2732,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
continue;
/*
- * We don't consider non-leaf non-oom_group memory cgroups
- * without the oom policy of "tree" as OOM victims.
+ * We don't consider non-leaf memory cgroups without the oom
+ * policy of "tree" as OOM victims.
*/
- if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter) &&
- iter->oom_policy != MEMCG_OOM_POLICY_TREE)
+ if (iter->oom_policy != MEMCG_OOM_POLICY_TREE &&
+ memcg_has_children(iter))
continue;
/*
There are three significant concerns about the cgroup aware oom killer as
it is implemented in -mm:
(1) allows users to evade the oom killer by creating subcontainers or
using other controllers since scoring is done per cgroup and not
hierarchically,
(2) unfairly compares the root mem cgroup using completely different
criteria than leaf mem cgroups and allows wildly inaccurate results
if oom_score_adj is used, and
(3) does not allow the user to influence the decisionmaking, such that
important subtrees cannot be preferred or biased.
This patchset fixes (1) and (2) completely and, by doing so, introduces a
completely extensible user interface that can be expanded in the future.
Concern (3) could subsequently be addressed either before or after the
cgroup-aware oom killer feature is merged.
It preserves all functionality that currently exists in -mm and extends
it to be generally useful outside of very specialized usecases.
It eliminates the mount option for the cgroup aware oom killer entirely
since it is now enabled through the root mem cgroup's oom policy.
---
- Rebased to next-20180305
- Fixed issue where total_sock_pages was not being modified
- Changed output of memory.oom_policy to show all available policies
Documentation/cgroup-v2.txt | 100 ++++++++--------
include/linux/cgroup-defs.h | 5 -
include/linux/memcontrol.h | 21 ++++
kernel/cgroup/cgroup.c | 13 +--
mm/memcontrol.c | 221 +++++++++++++++++++++---------------
5 files changed, 204 insertions(+), 156 deletions(-)
Now that each mem cgroup on the system has a memory.oom_policy tunable to
specify oom kill selection behavior, remove the needless "groupoom" mount
option that requires (1) the entire system to be forced, perhaps
unnecessarily, perhaps unexpectedly, into a single oom policy that
differs from the traditional per process selection, and (2) a remount to
change.
Instead of enabling the cgroup aware oom killer with the "groupoom" mount
option, set the mem cgroup subtree's memory.oom_policy to "cgroup".
The heuristic used to select a process or cgroup to kill from is
controlled by the oom mem cgroup's memory.oom_policy. This means that if
a descendant mem cgroup has an oom policy of "none", for example, and an
oom condition originates in an ancestor with an oom policy of "cgroup",
the selection logic will treat all descendant cgroups as indivisible
memory consumers.
For example, consider an example where each mem cgroup has "memory" set
in cgroup.controllers:
mem cgroup cgroup.procs
========== ============
/cg1 1 process consuming 250MB
/cg2 3 processes consuming 100MB each
/cg3/cg31 2 processes consuming 100MB each
/cg3/cg32 2 processes consuming 100MB each
If the root mem cgroup's memory.oom_policy is "none", the process from
/cg1 is chosen as the victim. If memory.oom_policy is "cgroup", a process
from /cg2 is chosen because it is in the single indivisible memory
consumer with the greatest usage. This policy of "cgroup" is identical to
to the current "groupoom" mount option, now removed.
Note that /cg3 is not the chosen victim when the oom mem cgroup policy is
"cgroup" because cgroups are treated individually without regard to
hierarchical /cg3/memory.current usage. This will be addressed in a
follow-up patch.
This has the added benefit of allowing descendant cgroups to control their
own oom policies if they have memory.oom_policy file permissions without
being restricted to the system-wide policy. In the above example, /cg2
and /cg3 can be either "none" or "cgroup" with the same results: the
selection heuristic depends only on the policy of the oom mem cgroup. If
/cg2 or /cg3 themselves are oom, however, the policy is controlled by
their own oom policies, either process aware or cgroup aware.
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/cgroup-v2.txt | 78 +++++++++++++++++++------------------
include/linux/cgroup-defs.h | 5 ---
include/linux/memcontrol.h | 5 +++
kernel/cgroup/cgroup.c | 13 +------
mm/memcontrol.c | 19 +++++----
5 files changed, 56 insertions(+), 64 deletions(-)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1076,6 +1076,17 @@ PAGE_SIZE multiple when read back.
Documentation/filesystems/proc.txt). This is the same policy as if
memory cgroups were not even mounted.
+ If "cgroup", the OOM killer will compare mem cgroups as indivisible
+ memory consumers; that is, they will compare mem cgroup usage rather
+ than process memory footprint. See the "OOM Killer" section below.
+
+ When an OOM condition occurs, the policy is dictated by the mem
+ cgroup that is OOM (the root mem cgroup for a system-wide OOM
+ condition). If a descendant mem cgroup has a policy of "none", for
+ example, for an OOM condition in a mem cgroup with policy "cgroup",
+ the heuristic will still compare mem cgroups as indivisible memory
+ consumers.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
@@ -1282,43 +1293,36 @@ belonging to the affected files to ensure correct memory ownership.
OOM Killer
~~~~~~~~~~
-Cgroup v2 memory controller implements a cgroup-aware OOM killer.
-It means that it treats cgroups as first class OOM entities.
-
-Cgroup-aware OOM logic is turned off by default and requires
-passing the "groupoom" option on mounting cgroupfs. It can also
-by remounting cgroupfs with the following command::
-
- # mount -o remount,groupoom $MOUNT_POINT
-
-Under OOM conditions the memory controller tries to make the best
-choice of a victim, looking for a memory cgroup with the largest
-memory footprint, considering leaf cgroups and cgroups with the
-memory.oom_group option set, which are considered to be an indivisible
-memory consumers.
-
-By default, OOM killer will kill the biggest task in the selected
-memory cgroup. A user can change this behavior by enabling
-the per-cgroup memory.oom_group option. If set, it causes
-the OOM killer to kill all processes attached to the cgroup,
-except processes with oom_score_adj set to -1000.
-
-This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
-the memory controller considers only cgroups belonging to the sub-tree
-of the OOM'ing cgroup.
-
-Leaf cgroups and cgroups with oom_group option set are compared based
-on their cumulative memory usage. The root cgroup is treated as a
-leaf memory cgroup as well, so it is compared with other leaf memory
-cgroups. Due to internal implementation restrictions the size of
-the root cgroup is the cumulative sum of oom_badness of all its tasks
-(in other words oom_score_adj of each task is obeyed). Relying on
-oom_score_adj (apart from OOM_SCORE_ADJ_MIN) can lead to over- or
-underestimation of the root cgroup consumption and it is therefore
-discouraged. This might change in the future, however.
-
-If there are no cgroups with the enabled memory controller,
-the OOM killer is using the "traditional" process-based approach.
+Cgroup v2 memory controller implements an optional cgroup-aware out of
+memory killer, which treats cgroups as indivisible OOM entities.
+
+This policy is controlled by memory.oom_policy. When a memory cgroup is
+out of memory, its memory.oom_policy will dictate how the OOM killer will
+select a process, or cgroup, to kill. Likewise, when the system is OOM,
+the policy is dictated by the root mem cgroup.
+
+There are currently two available oom policies:
+
+ - "none": default, choose the largest single memory hogging process to
+ oom kill, as traditionally the OOM killer has always done.
+
+ - "cgroup": choose the cgroup with the largest memory footprint from the
+ subtree as an OOM victim and kill at least one process, depending on
+ memory.oom_group, from it.
+
+When selecting a cgroup as a victim, the OOM killer will kill the process
+with the largest memory footprint. A user can control this behavior by
+enabling the per-cgroup memory.oom_group option. If set, it causes the
+OOM killer to kill all processes attached to the cgroup, except processes
+with /proc/pid/oom_score_adj set to -1000 (oom disabled).
+
+The root cgroup is treated as a leaf memory cgroup as well, so it is
+compared with other leaf memory cgroups. Due to internal implementation
+restrictions the size of the root cgroup is the cumulative sum of
+oom_badness of all its tasks (in other words oom_score_adj of each task
+is obeyed). Relying on oom_score_adj (apart from OOM_SCORE_ADJ_MIN) can
+lead to over- or underestimation of the root cgroup consumption and it is
+therefore discouraged. This might change in the future, however.
Please, note that memory charges are not migrating if tasks
are moved between different memory cgroups. Moving tasks with
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -81,11 +81,6 @@ enum {
* Enable cpuset controller in v1 cgroup to use v2 behavior.
*/
CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
-
- /*
- * Enable cgroup-aware OOM killer.
- */
- CGRP_GROUP_OOM = (1 << 5),
};
/* cftype->flags */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -64,6 +64,11 @@ enum memcg_oom_policy {
* oom_badness()
*/
MEMCG_OOM_POLICY_NONE,
+ /*
+ * Local cgroup usage is used to select a target cgroup, treating each
+ * mem cgroup as an indivisible consumer
+ */
+ MEMCG_OOM_POLICY_CGROUP,
};
struct mem_cgroup_reclaim_cookie {
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1732,9 +1732,6 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
if (!strcmp(token, "nsdelegate")) {
*root_flags |= CGRP_ROOT_NS_DELEGATE;
continue;
- } else if (!strcmp(token, "groupoom")) {
- *root_flags |= CGRP_GROUP_OOM;
- continue;
}
pr_err("cgroup2: unknown option \"%s\"\n", token);
@@ -1751,11 +1748,6 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
else
cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
-
- if (root_flags & CGRP_GROUP_OOM)
- cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
- else
- cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
}
}
@@ -1763,8 +1755,6 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
{
if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
seq_puts(seq, ",nsdelegate");
- if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
- seq_puts(seq, ",groupoom");
return 0;
}
@@ -5932,8 +5922,7 @@ static struct kobj_attribute cgroup_delegate_attr = __ATTR_RO(delegate);
static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
char *buf)
{
- return snprintf(buf, PAGE_SIZE, "nsdelegate\n"
- "groupoom\n");
+ return snprintf(buf, PAGE_SIZE, "nsdelegate\n");
}
static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2811,14 +2811,14 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return false;
- if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
- return false;
-
if (oc->memcg)
root = oc->memcg;
else
root = root_mem_cgroup;
+ if (root->oom_policy != MEMCG_OOM_POLICY_CGROUP)
+ return false;
+
select_victim_memcg(root, oc);
return oc->chosen_memcg;
@@ -5425,9 +5425,6 @@ static int memory_oom_group_show(struct seq_file *m, void *v)
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
bool oom_group = memcg->oom_group;
- if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
- return -ENOTSUPP;
-
seq_printf(m, "%d\n", oom_group);
return 0;
@@ -5441,9 +5438,6 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
int oom_group;
int err;
- if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
- return -ENOTSUPP;
-
err = kstrtoint(strstrip(buf), 0, &oom_group);
if (err)
return err;
@@ -5554,9 +5548,12 @@ static int memory_oom_policy_show(struct seq_file *m, void *v)
enum memcg_oom_policy policy = READ_ONCE(memcg->oom_policy);
switch (policy) {
+ case MEMCG_OOM_POLICY_CGROUP:
+ seq_puts(m, "none [cgroup]\n");
+ break;
case MEMCG_OOM_POLICY_NONE:
default:
- seq_puts(m, "[none]\n");
+ seq_puts(m, "[none] cgroup\n");
};
return 0;
}
@@ -5570,6 +5567,8 @@ static ssize_t memory_oom_policy_write(struct kernfs_open_file *of,
buf = strstrip(buf);
if (!memcmp("none", buf, min(sizeof("none")-1, nbytes)))
memcg->oom_policy = MEMCG_OOM_POLICY_NONE;
+ else if (!memcmp("cgroup", buf, min(sizeof("cgroup")-1, nbytes)))
+ memcg->oom_policy = MEMCG_OOM_POLICY_CGROUP;
else
ret = -EINVAL;
Hi David,
I love your patch! Yet something to improve:
[auto build test ERROR on next-20180309]
[cannot apply to linus/master v4.16-rc4 v4.16-rc3 v4.16-rc2 v4.16-rc5]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/David-Rientjes/rewrite-cgroup-aware-oom-killer-for-general-use/20180318-222124
config: i386-randconfig-x013-201811 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All errors (new ones prefixed by >>):
mm/memcontrol.c: In function 'memcg_oom_badness':
>> mm/memcontrol.c:2640:50: error: 'nr_swap_pages' undeclared (first use in this function); did you mean 'do_swap_page'?
points += total_swap_pages - atomic_long_read(&nr_swap_pages);
^~~~~~~~~~~~~
do_swap_page
mm/memcontrol.c:2640:50: note: each undeclared identifier is reported only once for each function it appears in
vim +2640 mm/memcontrol.c
2610
2611 static long memcg_oom_badness(struct mem_cgroup *memcg,
2612 const nodemask_t *nodemask)
2613 {
2614 const bool is_root_memcg = memcg == root_mem_cgroup;
2615 long points = 0;
2616 int nid;
2617 pg_data_t *pgdat;
2618
2619 for_each_node_state(nid, N_MEMORY) {
2620 if (nodemask && !node_isset(nid, *nodemask))
2621 continue;
2622
2623 pgdat = NODE_DATA(nid);
2624 if (is_root_memcg) {
2625 points += node_page_state(pgdat, NR_ACTIVE_ANON) +
2626 node_page_state(pgdat, NR_INACTIVE_ANON);
2627 points += node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE);
2628 } else {
2629 points += mem_cgroup_node_nr_lru_pages(memcg, nid,
2630 LRU_ALL_ANON);
2631 points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
2632 NR_SLAB_UNRECLAIMABLE);
2633 }
2634 }
2635
2636 if (is_root_memcg) {
2637 points += global_zone_page_state(NR_KERNEL_STACK_KB) /
2638 (PAGE_SIZE / 1024);
2639 points += atomic_long_read(&total_sock_pages);
> 2640 points += total_swap_pages - atomic_long_read(&nr_swap_pages);
2641 } else {
2642 points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
2643 (PAGE_SIZE / 1024);
2644 points += memcg_page_state(memcg, MEMCG_SOCK);
2645 points += memcg_page_state(memcg, MEMCG_SWAP);
2646 }
2647 return points;
2648 }
2649
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi David,
I love your patch! Yet something to improve:
[auto build test ERROR on next-20180309]
[cannot apply to linus/master v4.16-rc4 v4.16-rc3 v4.16-rc2 v4.16-rc5]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/David-Rientjes/rewrite-cgroup-aware-oom-killer-for-general-use/20180318-222124
config: i386-randconfig-s0-201811 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All errors (new ones prefixed by >>):
mm/memcontrol.c: In function 'memcg_oom_badness':
>> mm/memcontrol.c:2640:50: error: 'nr_swap_pages' undeclared (first use in this function)
points += total_swap_pages - atomic_long_read(&nr_swap_pages);
^~~~~~~~~~~~~
mm/memcontrol.c:2640:50: note: each undeclared identifier is reported only once for each function it appears in
At top level:
mm/memcontrol.c:706:27: warning: 'get_mem_cgroup' defined but not used [-Wunused-function]
static struct mem_cgroup *get_mem_cgroup(struct mem_cgroup *memcg)
^~~~~~~~~~~~~~
vim +/nr_swap_pages +2640 mm/memcontrol.c
2610
2611 static long memcg_oom_badness(struct mem_cgroup *memcg,
2612 const nodemask_t *nodemask)
2613 {
2614 const bool is_root_memcg = memcg == root_mem_cgroup;
2615 long points = 0;
2616 int nid;
2617 pg_data_t *pgdat;
2618
2619 for_each_node_state(nid, N_MEMORY) {
2620 if (nodemask && !node_isset(nid, *nodemask))
2621 continue;
2622
2623 pgdat = NODE_DATA(nid);
2624 if (is_root_memcg) {
2625 points += node_page_state(pgdat, NR_ACTIVE_ANON) +
2626 node_page_state(pgdat, NR_INACTIVE_ANON);
2627 points += node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE);
2628 } else {
2629 points += mem_cgroup_node_nr_lru_pages(memcg, nid,
2630 LRU_ALL_ANON);
2631 points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
2632 NR_SLAB_UNRECLAIMABLE);
2633 }
2634 }
2635
2636 if (is_root_memcg) {
2637 points += global_zone_page_state(NR_KERNEL_STACK_KB) /
2638 (PAGE_SIZE / 1024);
2639 points += atomic_long_read(&total_sock_pages);
> 2640 points += total_swap_pages - atomic_long_read(&nr_swap_pages);
2641 } else {
2642 points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
2643 (PAGE_SIZE / 1024);
2644 points += memcg_page_state(memcg, MEMCG_SOCK);
2645 points += memcg_page_state(memcg, MEMCG_SWAP);
2646 }
2647 return points;
2648 }
2649
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
There are several downsides to the current implementation that compares
the root mem cgroup with leaf mem cgroups for the cgroup-aware oom killer.
For example, /proc/pid/oom_score_adj is accounted for processes attached
to the root mem cgroup but not leaves. This leads to wild inconsistencies
that unfairly bias for or against the root mem cgroup.
Assume a 728KB bash shell is attached to the root mem cgroup without any
other processes having a non-default /proc/pid/oom_score_adj. At the time
of system oom, the root mem cgroup evaluated to 43,474 pages after boot.
If the bash shell adjusts its /proc/self/oom_score_adj to 1000, however,
the root mem cgroup evaluates to 24,765,482 pages lol. It would take a
cgroup 95GB of memory to outweigh the root mem cgroup's evaluation.
The reverse is even more confusing: if the bash shell adjusts its
/proc/self/oom_score_adj to -999, the root mem cgroup evaluates to 42,268
pages, a basically meaningless transformation.
/proc/pid/oom_score_adj is discounted, however, for processes attached to
leaf mem cgroups. If a sole process using 250MB of memory is attached to
a mem cgroup, it evaluates to 250MB >> PAGE_SHIFT. If its
/proc/pid/oom_score_adj is changed to -999, or even 1000, the evaluation
remains the same for the mem cgroup.
The heuristic that is used for the root mem cgroup also differs from leaf
mem cgroups.
For the root mem cgroup, the evaluation is the sum of all process's
/proc/pid/oom_score. Besides factoring in oom_score_adj, it is based on
the sum of rss + swap + page tables for all processes attached to it.
For leaf mem cgroups, it is based on the amount of anonymous or
unevictable memory + unreclaimable slab + kernel stack + sock + swap.
There's also an exemption for root mem cgroup processes that do not
intersect the allocating process's mems_allowed. Because the current
heuristic is based on oom_badness(), the evaluation of the root mem
cgroup disregards all processes attached to it that have disjoint
mems_allowed making oom selection specifically dependant on the
allocating process for system oom conditions!
This patch introduces completely fair comparison between the root mem
cgroup and leaf mem cgroups. It compares them with the same heuristic
and does not prefer one over the other. It disregards oom_score_adj
as the cgroup-aware oom killer should, if enabled by memory.oom_policy.
The goal is to target the most memory consuming cgroup on the system,
not consider per-process adjustment.
The fact that the evaluation of all mem cgroups depends on the mempolicy
of the allocating process, which is completely undocumented for the
cgroup-aware oom killer, will be addressed in a subsequent patch.
Signed-off-by: David Rientjes <[email protected]>
---
Fixed build error for !CONFIG_SWAP per kbuild test robot, the rest of
the series should still apply cleanly
Documentation/cgroup-v2.txt | 7 +-
mm/memcontrol.c | 149 ++++++++++++++++++------------------
2 files changed, 76 insertions(+), 80 deletions(-)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1328,12 +1328,7 @@ OOM killer to kill all processes attached to the cgroup, except processes
with /proc/pid/oom_score_adj set to -1000 (oom disabled).
The root cgroup is treated as a leaf memory cgroup as well, so it is
-compared with other leaf memory cgroups. Due to internal implementation
-restrictions the size of the root cgroup is the cumulative sum of
-oom_badness of all its tasks (in other words oom_score_adj of each task
-is obeyed). Relying on oom_score_adj (apart from OOM_SCORE_ADJ_MIN) can
-lead to over- or underestimation of the root cgroup consumption and it is
-therefore discouraged. This might change in the future, however.
+compared with other leaf memory cgroups.
Please, note that memory charges are not migrating if tasks
are moved between different memory cgroups. Moving tasks with
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -94,6 +94,8 @@ int do_swap_account __read_mostly;
#define do_swap_account 0
#endif
+static atomic_long_t total_sock_pages;
+
/* Whether legacy memory+swap accounting is active */
static bool do_memsw_account(void)
{
@@ -2607,9 +2609,9 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
}
static long memcg_oom_badness(struct mem_cgroup *memcg,
- const nodemask_t *nodemask,
- unsigned long totalpages)
+ const nodemask_t *nodemask)
{
+ const bool is_root_memcg = memcg == root_mem_cgroup;
long points = 0;
int nid;
pg_data_t *pgdat;
@@ -2618,92 +2620,65 @@ static long memcg_oom_badness(struct mem_cgroup *memcg,
if (nodemask && !node_isset(nid, *nodemask))
continue;
- points += mem_cgroup_node_nr_lru_pages(memcg, nid,
- LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
-
pgdat = NODE_DATA(nid);
- points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
- NR_SLAB_UNRECLAIMABLE);
+ if (is_root_memcg) {
+ points += node_page_state(pgdat, NR_ACTIVE_ANON) +
+ node_page_state(pgdat, NR_INACTIVE_ANON);
+ points += node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE);
+ } else {
+ points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+ LRU_ALL_ANON);
+ points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
+ NR_SLAB_UNRECLAIMABLE);
+ }
}
- points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
- (PAGE_SIZE / 1024);
- points += memcg_page_state(memcg, MEMCG_SOCK);
- points += memcg_page_state(memcg, MEMCG_SWAP);
-
+ if (is_root_memcg) {
+ points += global_zone_page_state(NR_KERNEL_STACK_KB) /
+ (PAGE_SIZE / 1024);
+ points += atomic_long_read(&total_sock_pages);
+ points += total_swap_pages - get_nr_swap_pages();
+ } else {
+ points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+ (PAGE_SIZE / 1024);
+ points += memcg_page_state(memcg, MEMCG_SOCK);
+ points += memcg_page_state(memcg, MEMCG_SWAP);
+ }
return points;
}
/*
- * Checks if the given memcg is a valid OOM victim and returns a number,
- * which means the folowing:
- * -1: there are inflight OOM victim tasks, belonging to the memcg
- * 0: memcg is not eligible, e.g. all belonging tasks are protected
- * by oom_score_adj set to OOM_SCORE_ADJ_MIN
+ * Checks if the given non-root memcg has a valid OOM victim and returns a
+ * number, which means the following:
+ * -1: there is an inflight OOM victim process attached to the memcg
+ * 0: memcg is not eligible because all tasks attached are unkillable
+ * (kthreads or oom_score_adj set to OOM_SCORE_ADJ_MIN)
* >0: memcg is eligible, and the returned value is an estimation
* of the memory footprint
*/
static long oom_evaluate_memcg(struct mem_cgroup *memcg,
- const nodemask_t *nodemask,
- unsigned long totalpages)
+ const nodemask_t *nodemask)
{
struct css_task_iter it;
struct task_struct *task;
int eligible = 0;
/*
- * Root memory cgroup is a special case:
- * we don't have necessary stats to evaluate it exactly as
- * leaf memory cgroups, so we approximate it's oom_score
- * by summing oom_score of all belonging tasks, which are
- * owners of their mm structs.
- *
- * If there are inflight OOM victim tasks inside
- * the root memcg, we return -1.
- */
- if (memcg == root_mem_cgroup) {
- struct css_task_iter it;
- struct task_struct *task;
- long score = 0;
-
- css_task_iter_start(&memcg->css, 0, &it);
- while ((task = css_task_iter_next(&it))) {
- if (tsk_is_oom_victim(task) &&
- !test_bit(MMF_OOM_SKIP,
- &task->signal->oom_mm->flags)) {
- score = -1;
- break;
- }
-
- task_lock(task);
- if (!task->mm || task->mm->owner != task) {
- task_unlock(task);
- continue;
- }
- task_unlock(task);
-
- score += oom_badness(task, memcg, nodemask,
- totalpages);
- }
- css_task_iter_end(&it);
-
- return score;
- }
-
- /*
- * Memcg is OOM eligible if there are OOM killable tasks inside.
- *
- * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
- * as unkillable.
- *
- * If there are inflight OOM victim tasks inside the memcg,
- * we return -1.
+ * Memcg is eligible for oom kill if at least one process is eligible
+ * to be killed. Processes with oom_score_adj of OOM_SCORE_ADJ_MIN
+ * are unkillable.
*/
css_task_iter_start(&memcg->css, 0, &it);
while ((task = css_task_iter_next(&it))) {
+ task_lock(task);
+ if (!task->mm || task != task->mm->owner) {
+ task_unlock(task);
+ continue;
+ }
if (!eligible &&
task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
eligible = 1;
+ task_unlock(task);
if (tsk_is_oom_victim(task) &&
!test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
@@ -2716,13 +2691,14 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
if (eligible <= 0)
return eligible;
- return memcg_oom_badness(memcg, nodemask, totalpages);
+ return memcg_oom_badness(memcg, nodemask);
}
static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
{
struct mem_cgroup *iter, *group = NULL;
long group_score = 0;
+ long leaf_score = 0;
oc->chosen_memcg = NULL;
oc->chosen_points = 0;
@@ -2748,12 +2724,18 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
for_each_mem_cgroup_tree(iter, root) {
long score;
+ /*
+ * Root memory cgroup will be considered after iteration,
+ * if eligible.
+ */
+ if (iter == root_mem_cgroup)
+ continue;
+
/*
* We don't consider non-leaf non-oom_group memory cgroups
* without the oom policy of "tree" as OOM victims.
*/
- if (memcg_has_children(iter) && iter != root_mem_cgroup &&
- !mem_cgroup_oom_group(iter) &&
+ if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter) &&
iter->oom_policy != MEMCG_OOM_POLICY_TREE)
continue;
@@ -2761,16 +2743,15 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
* If group is not set or we've ran out of the group's sub-tree,
* we should set group and reset group_score.
*/
- if (!group || group == root_mem_cgroup ||
- !mem_cgroup_is_descendant(iter, group)) {
+ if (!group || !mem_cgroup_is_descendant(iter, group)) {
group = iter;
group_score = 0;
}
- if (memcg_has_children(iter) && iter != root_mem_cgroup)
+ if (memcg_has_children(iter))
continue;
- score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
+ score = oom_evaluate_memcg(iter, oc->nodemask);
/*
* Ignore empty and non-eligible memory cgroups.
@@ -2789,6 +2770,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
}
group_score += score;
+ leaf_score += score;
if (group_score > oc->chosen_points) {
oc->chosen_points = group_score;
@@ -2796,8 +2778,25 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
}
}
- if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
- css_get(&oc->chosen_memcg->css);
+ if (oc->chosen_memcg != INFLIGHT_VICTIM) {
+ if (root == root_mem_cgroup) {
+ group_score = oom_evaluate_memcg(root_mem_cgroup,
+ oc->nodemask);
+ if (group_score > leaf_score) {
+ /*
+ * Discount the sum of all leaf scores to find
+ * root score.
+ */
+ group_score -= leaf_score;
+ if (group_score > oc->chosen_points) {
+ oc->chosen_points = group_score;
+ oc->chosen_memcg = root_mem_cgroup;
+ }
+ }
+ }
+ if (oc->chosen_memcg)
+ css_get(&oc->chosen_memcg->css);
+ }
rcu_read_unlock();
}
@@ -6119,6 +6118,7 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
gfp_mask = GFP_NOWAIT;
mod_memcg_state(memcg, MEMCG_SOCK, nr_pages);
+ atomic_long_add(nr_pages, &total_sock_pages);
if (try_charge(memcg, gfp_mask, nr_pages) == 0)
return true;
@@ -6140,6 +6140,7 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
}
mod_memcg_state(memcg, MEMCG_SOCK, -nr_pages);
+ atomic_long_add(-nr_pages, &total_sock_pages);
refill_stock(memcg, nr_pages);
}
With the current implementation of the cgroup-aware oom killer,
memory.oom_group defines two behaviors:
- consider the footprint of the "group" consisting of the mem cgroup
itself and all descendants for comparison with other cgroups, and
- when selected as the victim mem cgroup, kill all processes attached to
it and its descendants that are eligible to be killed.
Now that the memory.oom_policy of "tree" considers the memory footprint of
the mem cgroup and all its descendants, separate the memory.oom_group
setting from the selection criteria.
Now, memory.oom_group only controls whether all processes attached to the
victim mem cgroup and its descendants are oom killed (when set to "1") or
the single largest memory consuming process attached to the victim mem
cgroup and its descendants is killed.
This is generally regarded as a property of the workload attached to the
subtree: it depends on whether the workload can continue running and be
useful if a single process is oom killed or whether it's better to kill
all attached processes.
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/cgroup-v2.txt | 21 ++++-----------------
mm/memcontrol.c | 8 ++++----
2 files changed, 8 insertions(+), 21 deletions(-)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1045,25 +1045,12 @@ PAGE_SIZE multiple when read back.
A read-write single value file which exists on non-root
cgroups. The default is "0".
- If set, OOM killer will consider the memory cgroup as an
- indivisible memory consumers and compare it with other memory
- consumers by it's memory footprint.
- If such memory cgroup is selected as an OOM victim, all
- processes belonging to it or it's descendants will be killed.
+ If such memory cgroup is selected as an OOM victim, all processes
+ attached to it and its descendants that are eligible for oom kill
+ (their /proc/pid/oom_score_adj is not oom disabled) will be killed.
This applies to system-wide OOM conditions and reaching
the hard memory limit of the cgroup and their ancestor.
- If OOM condition happens in a descendant cgroup with it's own
- memory limit, the memory cgroup can't be considered
- as an OOM victim, and OOM killer will not kill all belonging
- tasks.
-
- Also, OOM killer respects the /proc/pid/oom_score_adj value -1000,
- and will never kill the unkillable task, even if memory.oom_group
- is set.
-
- If cgroup-aware OOM killer is not enabled, ENOTSUPP error
- is returned on attempt to access the file.
memory.oom_policy
@@ -1325,7 +1312,7 @@ When selecting a cgroup as a victim, the OOM killer will kill the process
with the largest memory footprint. A user can control this behavior by
enabling the per-cgroup memory.oom_group option. If set, it causes the
OOM killer to kill all processes attached to the cgroup, except processes
-with /proc/pid/oom_score_adj set to -1000 (oom disabled).
+with /proc/pid/oom_score_adj set to OOM_SCORE_ADJ_MIN.
The root cgroup is treated as a leaf memory cgroup as well, so it is
compared with other leaf memory cgroups.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2732,11 +2732,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
continue;
/*
- * We don't consider non-leaf non-oom_group memory cgroups
- * without the oom policy of "tree" as OOM victims.
+ * We don't consider non-leaf memory cgroups without the oom
+ * policy of "tree" as OOM victims.
*/
- if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter) &&
- iter->oom_policy != MEMCG_OOM_POLICY_TREE)
+ if (iter->oom_policy != MEMCG_OOM_POLICY_TREE &&
+ memcg_has_children(iter))
continue;
/*
There are several downsides to the current implementation that compares
the root mem cgroup with leaf mem cgroups for the cgroup-aware oom killer.
For example, /proc/pid/oom_score_adj is accounted for processes attached
to the root mem cgroup but not leaves. This leads to wild inconsistencies
that unfairly bias for or against the root mem cgroup.
Assume a 728KB bash shell is attached to the root mem cgroup without any
other processes having a non-default /proc/pid/oom_score_adj. At the time
of system oom, the root mem cgroup evaluated to 43,474 pages after boot.
If the bash shell adjusts its /proc/self/oom_score_adj to 1000, however,
the root mem cgroup evaluates to 24,765,482 pages lol. It would take a
cgroup 95GB of memory to outweigh the root mem cgroup's evaluation.
The reverse is even more confusing: if the bash shell adjusts its
/proc/self/oom_score_adj to -999, the root mem cgroup evaluates to 42,268
pages, a basically meaningless transformation.
/proc/pid/oom_score_adj is discounted, however, for processes attached to
leaf mem cgroups. If a sole process using 250MB of memory is attached to
a mem cgroup, it evaluates to 250MB >> PAGE_SHIFT. If its
/proc/pid/oom_score_adj is changed to -999, or even 1000, the evaluation
remains the same for the mem cgroup.
The heuristic that is used for the root mem cgroup also differs from leaf
mem cgroups.
For the root mem cgroup, the evaluation is the sum of all process's
/proc/pid/oom_score. Besides factoring in oom_score_adj, it is based on
the sum of rss + swap + page tables for all processes attached to it.
For leaf mem cgroups, it is based on the amount of anonymous or
unevictable memory + unreclaimable slab + kernel stack + sock + swap.
There's also an exemption for root mem cgroup processes that do not
intersect the allocating process's mems_allowed. Because the current
heuristic is based on oom_badness(), the evaluation of the root mem
cgroup disregards all processes attached to it that have disjoint
mems_allowed making oom selection specifically dependant on the
allocating process for system oom conditions!
This patch introduces completely fair comparison between the root mem
cgroup and leaf mem cgroups. It compares them with the same heuristic
and does not prefer one over the other. It disregards oom_score_adj
as the cgroup-aware oom killer should, if enabled by memory.oom_policy.
The goal is to target the most memory consuming cgroup on the system,
not consider per-process adjustment.
The fact that the evaluation of all mem cgroups depends on the mempolicy
of the allocating process, which is completely undocumented for the
cgroup-aware oom killer, will be addressed in a subsequent patch.
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/cgroup-v2.txt | 7 +-
mm/memcontrol.c | 149 ++++++++++++++++++------------------
2 files changed, 76 insertions(+), 80 deletions(-)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1328,12 +1328,7 @@ OOM killer to kill all processes attached to the cgroup, except processes
with /proc/pid/oom_score_adj set to -1000 (oom disabled).
The root cgroup is treated as a leaf memory cgroup as well, so it is
-compared with other leaf memory cgroups. Due to internal implementation
-restrictions the size of the root cgroup is the cumulative sum of
-oom_badness of all its tasks (in other words oom_score_adj of each task
-is obeyed). Relying on oom_score_adj (apart from OOM_SCORE_ADJ_MIN) can
-lead to over- or underestimation of the root cgroup consumption and it is
-therefore discouraged. This might change in the future, however.
+compared with other leaf memory cgroups.
Please, note that memory charges are not migrating if tasks
are moved between different memory cgroups. Moving tasks with
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -94,6 +94,8 @@ int do_swap_account __read_mostly;
#define do_swap_account 0
#endif
+static atomic_long_t total_sock_pages;
+
/* Whether legacy memory+swap accounting is active */
static bool do_memsw_account(void)
{
@@ -2607,9 +2609,9 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
}
static long memcg_oom_badness(struct mem_cgroup *memcg,
- const nodemask_t *nodemask,
- unsigned long totalpages)
+ const nodemask_t *nodemask)
{
+ const bool is_root_memcg = memcg == root_mem_cgroup;
long points = 0;
int nid;
pg_data_t *pgdat;
@@ -2618,92 +2620,65 @@ static long memcg_oom_badness(struct mem_cgroup *memcg,
if (nodemask && !node_isset(nid, *nodemask))
continue;
- points += mem_cgroup_node_nr_lru_pages(memcg, nid,
- LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
-
pgdat = NODE_DATA(nid);
- points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
- NR_SLAB_UNRECLAIMABLE);
+ if (is_root_memcg) {
+ points += node_page_state(pgdat, NR_ACTIVE_ANON) +
+ node_page_state(pgdat, NR_INACTIVE_ANON);
+ points += node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE);
+ } else {
+ points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+ LRU_ALL_ANON);
+ points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
+ NR_SLAB_UNRECLAIMABLE);
+ }
}
- points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
- (PAGE_SIZE / 1024);
- points += memcg_page_state(memcg, MEMCG_SOCK);
- points += memcg_page_state(memcg, MEMCG_SWAP);
-
+ if (is_root_memcg) {
+ points += global_zone_page_state(NR_KERNEL_STACK_KB) /
+ (PAGE_SIZE / 1024);
+ points += atomic_long_read(&total_sock_pages);
+ points += total_swap_pages - get_nr_swap_pages();
+ } else {
+ points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+ (PAGE_SIZE / 1024);
+ points += memcg_page_state(memcg, MEMCG_SOCK);
+ points += memcg_page_state(memcg, MEMCG_SWAP);
+ }
return points;
}
/*
- * Checks if the given memcg is a valid OOM victim and returns a number,
- * which means the folowing:
- * -1: there are inflight OOM victim tasks, belonging to the memcg
- * 0: memcg is not eligible, e.g. all belonging tasks are protected
- * by oom_score_adj set to OOM_SCORE_ADJ_MIN
+ * Checks if the given non-root memcg has a valid OOM victim and returns a
+ * number, which means the following:
+ * -1: there is an inflight OOM victim process attached to the memcg
+ * 0: memcg is not eligible because all tasks attached are unkillable
+ * (kthreads or oom_score_adj set to OOM_SCORE_ADJ_MIN)
* >0: memcg is eligible, and the returned value is an estimation
* of the memory footprint
*/
static long oom_evaluate_memcg(struct mem_cgroup *memcg,
- const nodemask_t *nodemask,
- unsigned long totalpages)
+ const nodemask_t *nodemask)
{
struct css_task_iter it;
struct task_struct *task;
int eligible = 0;
/*
- * Root memory cgroup is a special case:
- * we don't have necessary stats to evaluate it exactly as
- * leaf memory cgroups, so we approximate it's oom_score
- * by summing oom_score of all belonging tasks, which are
- * owners of their mm structs.
- *
- * If there are inflight OOM victim tasks inside
- * the root memcg, we return -1.
- */
- if (memcg == root_mem_cgroup) {
- struct css_task_iter it;
- struct task_struct *task;
- long score = 0;
-
- css_task_iter_start(&memcg->css, 0, &it);
- while ((task = css_task_iter_next(&it))) {
- if (tsk_is_oom_victim(task) &&
- !test_bit(MMF_OOM_SKIP,
- &task->signal->oom_mm->flags)) {
- score = -1;
- break;
- }
-
- task_lock(task);
- if (!task->mm || task->mm->owner != task) {
- task_unlock(task);
- continue;
- }
- task_unlock(task);
-
- score += oom_badness(task, memcg, nodemask,
- totalpages);
- }
- css_task_iter_end(&it);
-
- return score;
- }
-
- /*
- * Memcg is OOM eligible if there are OOM killable tasks inside.
- *
- * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
- * as unkillable.
- *
- * If there are inflight OOM victim tasks inside the memcg,
- * we return -1.
+ * Memcg is eligible for oom kill if at least one process is eligible
+ * to be killed. Processes with oom_score_adj of OOM_SCORE_ADJ_MIN
+ * are unkillable.
*/
css_task_iter_start(&memcg->css, 0, &it);
while ((task = css_task_iter_next(&it))) {
+ task_lock(task);
+ if (!task->mm || task != task->mm->owner) {
+ task_unlock(task);
+ continue;
+ }
if (!eligible &&
task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
eligible = 1;
+ task_unlock(task);
if (tsk_is_oom_victim(task) &&
!test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
@@ -2716,13 +2691,14 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
if (eligible <= 0)
return eligible;
- return memcg_oom_badness(memcg, nodemask, totalpages);
+ return memcg_oom_badness(memcg, nodemask);
}
static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
{
struct mem_cgroup *iter, *group = NULL;
long group_score = 0;
+ long leaf_score = 0;
oc->chosen_memcg = NULL;
oc->chosen_points = 0;
@@ -2748,12 +2724,18 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
for_each_mem_cgroup_tree(iter, root) {
long score;
+ /*
+ * Root memory cgroup will be considered after iteration,
+ * if eligible.
+ */
+ if (iter == root_mem_cgroup)
+ continue;
+
/*
* We don't consider non-leaf non-oom_group memory cgroups
* without the oom policy of "tree" as OOM victims.
*/
- if (memcg_has_children(iter) && iter != root_mem_cgroup &&
- !mem_cgroup_oom_group(iter) &&
+ if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter) &&
iter->oom_policy != MEMCG_OOM_POLICY_TREE)
continue;
@@ -2761,16 +2743,15 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
* If group is not set or we've ran out of the group's sub-tree,
* we should set group and reset group_score.
*/
- if (!group || group == root_mem_cgroup ||
- !mem_cgroup_is_descendant(iter, group)) {
+ if (!group || !mem_cgroup_is_descendant(iter, group)) {
group = iter;
group_score = 0;
}
- if (memcg_has_children(iter) && iter != root_mem_cgroup)
+ if (memcg_has_children(iter))
continue;
- score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
+ score = oom_evaluate_memcg(iter, oc->nodemask);
/*
* Ignore empty and non-eligible memory cgroups.
@@ -2789,6 +2770,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
}
group_score += score;
+ leaf_score += score;
if (group_score > oc->chosen_points) {
oc->chosen_points = group_score;
@@ -2796,8 +2778,25 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
}
}
- if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
- css_get(&oc->chosen_memcg->css);
+ if (oc->chosen_memcg != INFLIGHT_VICTIM) {
+ if (root == root_mem_cgroup) {
+ group_score = oom_evaluate_memcg(root_mem_cgroup,
+ oc->nodemask);
+ if (group_score > leaf_score) {
+ /*
+ * Discount the sum of all leaf scores to find
+ * root score.
+ */
+ group_score -= leaf_score;
+ if (group_score > oc->chosen_points) {
+ oc->chosen_points = group_score;
+ oc->chosen_memcg = root_mem_cgroup;
+ }
+ }
+ }
+ if (oc->chosen_memcg)
+ css_get(&oc->chosen_memcg->css);
+ }
rcu_read_unlock();
}
@@ -6119,6 +6118,7 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
gfp_mask = GFP_NOWAIT;
mod_memcg_state(memcg, MEMCG_SOCK, nr_pages);
+ atomic_long_add(nr_pages, &total_sock_pages);
if (try_charge(memcg, gfp_mask, nr_pages) == 0)
return true;
@@ -6140,6 +6140,7 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
}
mod_memcg_state(memcg, MEMCG_SOCK, -nr_pages);
+ atomic_long_add(-nr_pages, &total_sock_pages);
refill_stock(memcg, nr_pages);
}
The cgroup-aware oom killer currently considers the set of allowed nodes
for the allocation that triggers the oom killer and discounts usage from
disallowed nodes when comparing cgroups.
If a cgroup has both the cpuset and memory controllers enabled, it may be
possible to restrict allocations to a subset of nodes, for example. Some
latency sensitive users use cpusets to allocate only local memory, almost
to the point of oom even though there is an abundance of available free
memory on other nodes.
The same is true for processes that mbind(2) their memory to a set of
allowed nodes.
This yields very inconsistent results by considering usage from each mem
cgroup (and perhaps its subtree) for the allocation's set of allowed nodes
for its mempolicy. Allocating a single page for a vma that is mbind to a
now-oom node can cause a cgroup that is restricted to that node by its
cpuset controller to be oom killed when other cgroups may have much higher
overall usage.
The cgroup-aware oom killer is described as killing the largest memory
consuming cgroup (or subtree) without mentioning the mempolicy of the
allocation. For now, discount it. It would be possible to add an
additional oom policy for NUMA awareness if it would be generally useful
later with the extensible interface.
Signed-off-by: David Rientjes <[email protected]>
---
mm/memcontrol.c | 18 ++++++------------
1 file changed, 6 insertions(+), 12 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2608,19 +2608,15 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
return ret;
}
-static long memcg_oom_badness(struct mem_cgroup *memcg,
- const nodemask_t *nodemask)
+static long memcg_oom_badness(struct mem_cgroup *memcg)
{
const bool is_root_memcg = memcg == root_mem_cgroup;
long points = 0;
int nid;
- pg_data_t *pgdat;
for_each_node_state(nid, N_MEMORY) {
- if (nodemask && !node_isset(nid, *nodemask))
- continue;
+ pg_data_t *pgdat = NODE_DATA(nid);
- pgdat = NODE_DATA(nid);
if (is_root_memcg) {
points += node_page_state(pgdat, NR_ACTIVE_ANON) +
node_page_state(pgdat, NR_INACTIVE_ANON);
@@ -2656,8 +2652,7 @@ static long memcg_oom_badness(struct mem_cgroup *memcg,
* >0: memcg is eligible, and the returned value is an estimation
* of the memory footprint
*/
-static long oom_evaluate_memcg(struct mem_cgroup *memcg,
- const nodemask_t *nodemask)
+static long oom_evaluate_memcg(struct mem_cgroup *memcg)
{
struct css_task_iter it;
struct task_struct *task;
@@ -2691,7 +2686,7 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
if (eligible <= 0)
return eligible;
- return memcg_oom_badness(memcg, nodemask);
+ return memcg_oom_badness(memcg);
}
static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
@@ -2751,7 +2746,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
if (memcg_has_children(iter))
continue;
- score = oom_evaluate_memcg(iter, oc->nodemask);
+ score = oom_evaluate_memcg(iter);
/*
* Ignore empty and non-eligible memory cgroups.
@@ -2780,8 +2775,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
if (oc->chosen_memcg != INFLIGHT_VICTIM) {
if (root == root_mem_cgroup) {
- group_score = oom_evaluate_memcg(root_mem_cgroup,
- oc->nodemask);
+ group_score = oom_evaluate_memcg(root_mem_cgroup);
if (group_score > leaf_score) {
/*
* Discount the sum of all leaf scores to find
rewrite cgroup aware oom killer for general use
There are three significant concerns about the cgroup aware oom killer as
it is implemented in -mm:
(1) allows users to evade the oom killer by creating subcontainers or
using other controllers since scoring is done per cgroup and not
hierarchically,
(2) unfairly compares the root mem cgroup using completely different
criteria than leaf mem cgroups and allows wildly inaccurate results
if oom_score_adj is used, and
(3) does not allow the user to influence the decisionmaking, such that
important subtrees cannot be preferred or biased.
This patchset fixes (1) and (2) completely and, by doing so, introduces a
completely extensible user interface that can be expanded in the future.
Concern (3) could subsequently be addressed either before or after the
cgroup-aware oom killer feature is merged.
It preserves all functionality that currently exists in -mm and extends
it to be generally useful outside of very specialized usecases.
It eliminates the mount option for the cgroup aware oom killer entirely
since it is now enabled through the root mem cgroup's oom policy.
---
- Rebased to next-20180322
- Fixed get_nr_swap_pages() build bug found by kbuild test robot
Documentation/cgroup-v2.txt | 100 ++++++++--------
include/linux/cgroup-defs.h | 5 -
include/linux/memcontrol.h | 21 ++++
kernel/cgroup/cgroup.c | 13 +--
mm/memcontrol.c | 221 +++++++++++++++++++++---------------
5 files changed, 204 insertions(+), 156 deletions(-)
The cgroup aware oom killer is needlessly enforced for the entire system
by a mount option. It's unnecessary to force the system into a single
oom policy: either cgroup aware, or the traditional process aware.
This patch introduces a memory.oom_policy tunable for all mem cgroups.
It is currently a no-op: it can only be set to "none", which is its
default policy. It will be expanded in the next patch to define cgroup
aware oom killer behavior for its subtree.
This is an extensible interface that can be used to define cgroup aware
assessment of mem cgroup subtrees or the traditional process aware
assessment.
Reading memory.oom_policy will specify the list of available policies.
Another benefit of such an approach is that an admin can lock in a
certain policy for the system or for a mem cgroup subtree and can
delegate the policy decision to the user to determine if the kill should
originate from a subcontainer, as indivisible memory consumers
themselves, or selection should be done per process.
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/cgroup-v2.txt | 11 +++++++++++
include/linux/memcontrol.h | 11 +++++++++++
mm/memcontrol.c | 35 +++++++++++++++++++++++++++++++++++
3 files changed, 57 insertions(+)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1065,6 +1065,17 @@ PAGE_SIZE multiple when read back.
If cgroup-aware OOM killer is not enabled, ENOTSUPP error
is returned on attempt to access the file.
+ memory.oom_policy
+
+ A read-write single string file which exists on all cgroups. The
+ default value is "none".
+
+ If "none", the OOM killer will use the default policy to choose a
+ victim; that is, it will choose the single process with the largest
+ memory footprint adjusted by /proc/pid/oom_score_adj (see
+ Documentation/filesystems/proc.txt). This is the same policy as if
+ memory cgroups were not even mounted.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -58,6 +58,14 @@ enum memcg_event_item {
MEMCG_NR_EVENTS,
};
+enum memcg_oom_policy {
+ /*
+ * No special oom policy, process selection is determined by
+ * oom_badness()
+ */
+ MEMCG_OOM_POLICY_NONE,
+};
+
struct mem_cgroup_reclaim_cookie {
pg_data_t *pgdat;
int priority;
@@ -203,6 +211,9 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
+ /* OOM policy for this subtree */
+ enum memcg_oom_policy oom_policy;
+
/*
* Treat the sub-tree as an indivisible memory consumer,
* kill all belonging tasks if the memory cgroup selected
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4430,6 +4430,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
if (parent) {
memcg->swappiness = mem_cgroup_swappiness(parent);
memcg->oom_kill_disable = parent->oom_kill_disable;
+ memcg->oom_policy = parent->oom_policy;
}
if (parent && parent->use_hierarchy) {
memcg->use_hierarchy = true;
@@ -5547,6 +5548,34 @@ static int memory_stat_show(struct seq_file *m, void *v)
return 0;
}
+static int memory_oom_policy_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+ enum memcg_oom_policy policy = READ_ONCE(memcg->oom_policy);
+
+ switch (policy) {
+ case MEMCG_OOM_POLICY_NONE:
+ default:
+ seq_puts(m, "[none]\n");
+ };
+ return 0;
+}
+
+static ssize_t memory_oom_policy_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ ssize_t ret = nbytes;
+
+ buf = strstrip(buf);
+ if (!memcmp("none", buf, min(sizeof("none")-1, nbytes)))
+ memcg->oom_policy = MEMCG_OOM_POLICY_NONE;
+ else
+ ret = -EINVAL;
+
+ return ret;
+}
+
static struct cftype memory_files[] = {
{
.name = "current",
@@ -5588,6 +5617,12 @@ static struct cftype memory_files[] = {
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = memory_stat_show,
},
+ {
+ .name = "oom_policy",
+ .flags = CFTYPE_NS_DELEGATABLE,
+ .seq_show = memory_oom_policy_show,
+ .write = memory_oom_policy_write,
+ },
{ } /* terminate */
};
One of the three significant concerns brought up about the cgroup aware
oom killer is that its decisionmaking is completely evaded by creating
subcontainers and attaching processes such that the ancestor's usage does
not exceed another cgroup on the system.
Consider the example from the previous patch where "memory" is set in
each mem cgroup's cgroup.controllers:
mem cgroup cgroup.procs
========== ============
/cg1 1 process consuming 250MB
/cg2 3 processes consuming 100MB each
/cg3/cg31 2 processes consuming 100MB each
/cg3/cg32 2 processes consuming 100MB each
If memory.oom_policy is "cgroup", a process from /cg2 is chosen because it
is in the single indivisible memory consumer with the greatest usage.
The true usage of /cg3 is actually 400MB, but a process from /cg2 is
chosen because cgroups are compared individually rather than
hierarchically.
If a system is divided into two users, for example:
mem cgroup memory.max
========== ==========
/userA 250MB
/userB 250MB
If /userA runs all processes attached to the local mem cgroup, whereas
/userB distributes their processes over a set of subcontainers under
/userB, /userA will be unfairly penalized.
There is incentive with cgroup v2 to distribute processes over a set of
subcontainers if those processes shall be constrained by other cgroup
controllers; this is a direct result of mandating a single, unified
hierarchy for cgroups. A user may also reasonably do this for mem cgroup
control or statistics. And, a user may do this to evade the cgroup-aware
oom killer selection logic.
This patch adds an oom policy, "tree", that accounts for hierarchical
usage when comparing cgroups and the cgroup aware oom killer is enabled by
an ancestor. This allows administrators, for example, to require users in
their own top-level mem cgroup subtree to be accounted for with
hierarchical usage. In other words, they can longer evade the oom killer
by using other controllers or subcontainers.
If an oom policy of "tree" is in place for a subtree, such as /cg3 above,
the hierarchical usage is used for comparisons with other cgroups if
either "cgroup" or "tree" is the oom policy of the oom mem cgroup. Thus,
if /cg3/memory.oom_policy is "tree", one of the processes from /cg3's
subcontainers is chosen for oom kill.
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/cgroup-v2.txt | 17 ++++++++++++++---
include/linux/memcontrol.h | 5 +++++
mm/memcontrol.c | 18 ++++++++++++------
3 files changed, 31 insertions(+), 9 deletions(-)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1080,6 +1080,10 @@ PAGE_SIZE multiple when read back.
memory consumers; that is, they will compare mem cgroup usage rather
than process memory footprint. See the "OOM Killer" section below.
+ If "tree", the OOM killer will compare mem cgroups and its subtree
+ as a single indivisible memory consumer. This policy cannot be set
+ on the root mem cgroup. See the "OOM Killer" section below.
+
When an OOM condition occurs, the policy is dictated by the mem
cgroup that is OOM (the root mem cgroup for a system-wide OOM
condition). If a descendant mem cgroup has a policy of "none", for
@@ -1087,6 +1091,10 @@ PAGE_SIZE multiple when read back.
the heuristic will still compare mem cgroups as indivisible memory
consumers.
+ When an OOM condition occurs in a mem cgroup with an OOM policy of
+ "cgroup" or "tree", the OOM killer will compare mem cgroups with
+ "cgroup" policy individually with "tree" policy subtrees.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
@@ -1301,7 +1309,7 @@ out of memory, its memory.oom_policy will dictate how the OOM killer will
select a process, or cgroup, to kill. Likewise, when the system is OOM,
the policy is dictated by the root mem cgroup.
-There are currently two available oom policies:
+There are currently three available oom policies:
- "none": default, choose the largest single memory hogging process to
oom kill, as traditionally the OOM killer has always done.
@@ -1310,6 +1318,9 @@ There are currently two available oom policies:
subtree as an OOM victim and kill at least one process, depending on
memory.oom_group, from it.
+ - "tree": choose the cgroup with the largest memory footprint considering
+ itself and its subtree and kill at least one process.
+
When selecting a cgroup as a victim, the OOM killer will kill the process
with the largest memory footprint. A user can control this behavior by
enabling the per-cgroup memory.oom_group option. If set, it causes the
@@ -1328,8 +1339,8 @@ Please, note that memory charges are not migrating if tasks
are moved between different memory cgroups. Moving tasks with
significant memory footprint may affect OOM victim selection logic.
If it's a case, please, consider creating a common ancestor for
-the source and destination memory cgroups and enabling oom_group
-on ancestor layer.
+the source and destination memory cgroups and setting a policy of "tree"
+and enabling oom_group on an ancestor layer.
IO
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -69,6 +69,11 @@ enum memcg_oom_policy {
* mem cgroup as an indivisible consumer
*/
MEMCG_OOM_POLICY_CGROUP,
+ /*
+ * Tree cgroup usage for all descendant memcg groups, treating each mem
+ * cgroup and its subtree as an indivisible consumer
+ */
+ MEMCG_OOM_POLICY_TREE,
};
struct mem_cgroup_reclaim_cookie {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2741,7 +2741,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
/*
* The oom_score is calculated for leaf memory cgroups (including
* the root memcg).
- * Non-leaf oom_group cgroups accumulating score of descendant
+ * Cgroups with oom policy of "tree" accumulate the score of descendant
* leaf memory cgroups.
*/
rcu_read_lock();
@@ -2750,10 +2750,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
/*
* We don't consider non-leaf non-oom_group memory cgroups
- * as OOM victims.
+ * without the oom policy of "tree" as OOM victims.
*/
if (memcg_has_children(iter) && iter != root_mem_cgroup &&
- !mem_cgroup_oom_group(iter))
+ !mem_cgroup_oom_group(iter) &&
+ iter->oom_policy != MEMCG_OOM_POLICY_TREE)
continue;
/*
@@ -2816,7 +2817,7 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
else
root = root_mem_cgroup;
- if (root->oom_policy != MEMCG_OOM_POLICY_CGROUP)
+ if (root->oom_policy == MEMCG_OOM_POLICY_NONE)
return false;
select_victim_memcg(root, oc);
@@ -5549,11 +5550,14 @@ static int memory_oom_policy_show(struct seq_file *m, void *v)
switch (policy) {
case MEMCG_OOM_POLICY_CGROUP:
- seq_puts(m, "none [cgroup]\n");
+ seq_puts(m, "none [cgroup] tree\n");
+ break;
+ case MEMCG_OOM_POLICY_TREE:
+ seq_puts(m, "none cgroup [tree]\n");
break;
case MEMCG_OOM_POLICY_NONE:
default:
- seq_puts(m, "[none] cgroup\n");
+ seq_puts(m, "[none] cgroup tree\n");
};
return 0;
}
@@ -5569,6 +5573,8 @@ static ssize_t memory_oom_policy_write(struct kernfs_open_file *of,
memcg->oom_policy = MEMCG_OOM_POLICY_NONE;
else if (!memcmp("cgroup", buf, min(sizeof("cgroup")-1, nbytes)))
memcg->oom_policy = MEMCG_OOM_POLICY_CGROUP;
+ else if (!memcmp("tree", buf, min(sizeof("tree")-1, nbytes)))
+ memcg->oom_policy = MEMCG_OOM_POLICY_TREE;
else
ret = -EINVAL;
Now that each mem cgroup on the system has a memory.oom_policy tunable to
specify oom kill selection behavior, remove the needless "groupoom" mount
option that requires (1) the entire system to be forced, perhaps
unnecessarily, perhaps unexpectedly, into a single oom policy that
differs from the traditional per process selection, and (2) a remount to
change.
Instead of enabling the cgroup aware oom killer with the "groupoom" mount
option, set the mem cgroup subtree's memory.oom_policy to "cgroup".
The heuristic used to select a process or cgroup to kill from is
controlled by the oom mem cgroup's memory.oom_policy. This means that if
a descendant mem cgroup has an oom policy of "none", for example, and an
oom condition originates in an ancestor with an oom policy of "cgroup",
the selection logic will treat all descendant cgroups as indivisible
memory consumers.
For example, consider an example where each mem cgroup has "memory" set
in cgroup.controllers:
mem cgroup cgroup.procs
========== ============
/cg1 1 process consuming 250MB
/cg2 3 processes consuming 100MB each
/cg3/cg31 2 processes consuming 100MB each
/cg3/cg32 2 processes consuming 100MB each
If the root mem cgroup's memory.oom_policy is "none", the process from
/cg1 is chosen as the victim. If memory.oom_policy is "cgroup", a process
from /cg2 is chosen because it is in the single indivisible memory
consumer with the greatest usage. This policy of "cgroup" is identical to
to the current "groupoom" mount option, now removed.
Note that /cg3 is not the chosen victim when the oom mem cgroup policy is
"cgroup" because cgroups are treated individually without regard to
hierarchical /cg3/memory.current usage. This will be addressed in a
follow-up patch.
This has the added benefit of allowing descendant cgroups to control their
own oom policies if they have memory.oom_policy file permissions without
being restricted to the system-wide policy. In the above example, /cg2
and /cg3 can be either "none" or "cgroup" with the same results: the
selection heuristic depends only on the policy of the oom mem cgroup. If
/cg2 or /cg3 themselves are oom, however, the policy is controlled by
their own oom policies, either process aware or cgroup aware.
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/cgroup-v2.txt | 78 +++++++++++++++++++------------------
include/linux/cgroup-defs.h | 5 ---
include/linux/memcontrol.h | 5 +++
kernel/cgroup/cgroup.c | 13 +------
mm/memcontrol.c | 19 +++++----
5 files changed, 56 insertions(+), 64 deletions(-)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1076,6 +1076,17 @@ PAGE_SIZE multiple when read back.
Documentation/filesystems/proc.txt). This is the same policy as if
memory cgroups were not even mounted.
+ If "cgroup", the OOM killer will compare mem cgroups as indivisible
+ memory consumers; that is, they will compare mem cgroup usage rather
+ than process memory footprint. See the "OOM Killer" section below.
+
+ When an OOM condition occurs, the policy is dictated by the mem
+ cgroup that is OOM (the root mem cgroup for a system-wide OOM
+ condition). If a descendant mem cgroup has a policy of "none", for
+ example, for an OOM condition in a mem cgroup with policy "cgroup",
+ the heuristic will still compare mem cgroups as indivisible memory
+ consumers.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
@@ -1282,43 +1293,36 @@ belonging to the affected files to ensure correct memory ownership.
OOM Killer
~~~~~~~~~~
-Cgroup v2 memory controller implements a cgroup-aware OOM killer.
-It means that it treats cgroups as first class OOM entities.
-
-Cgroup-aware OOM logic is turned off by default and requires
-passing the "groupoom" option on mounting cgroupfs. It can also
-by remounting cgroupfs with the following command::
-
- # mount -o remount,groupoom $MOUNT_POINT
-
-Under OOM conditions the memory controller tries to make the best
-choice of a victim, looking for a memory cgroup with the largest
-memory footprint, considering leaf cgroups and cgroups with the
-memory.oom_group option set, which are considered to be an indivisible
-memory consumers.
-
-By default, OOM killer will kill the biggest task in the selected
-memory cgroup. A user can change this behavior by enabling
-the per-cgroup memory.oom_group option. If set, it causes
-the OOM killer to kill all processes attached to the cgroup,
-except processes with oom_score_adj set to -1000.
-
-This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
-the memory controller considers only cgroups belonging to the sub-tree
-of the OOM'ing cgroup.
-
-Leaf cgroups and cgroups with oom_group option set are compared based
-on their cumulative memory usage. The root cgroup is treated as a
-leaf memory cgroup as well, so it is compared with other leaf memory
-cgroups. Due to internal implementation restrictions the size of
-the root cgroup is the cumulative sum of oom_badness of all its tasks
-(in other words oom_score_adj of each task is obeyed). Relying on
-oom_score_adj (apart from OOM_SCORE_ADJ_MIN) can lead to over- or
-underestimation of the root cgroup consumption and it is therefore
-discouraged. This might change in the future, however.
-
-If there are no cgroups with the enabled memory controller,
-the OOM killer is using the "traditional" process-based approach.
+Cgroup v2 memory controller implements an optional cgroup-aware out of
+memory killer, which treats cgroups as indivisible OOM entities.
+
+This policy is controlled by memory.oom_policy. When a memory cgroup is
+out of memory, its memory.oom_policy will dictate how the OOM killer will
+select a process, or cgroup, to kill. Likewise, when the system is OOM,
+the policy is dictated by the root mem cgroup.
+
+There are currently two available oom policies:
+
+ - "none": default, choose the largest single memory hogging process to
+ oom kill, as traditionally the OOM killer has always done.
+
+ - "cgroup": choose the cgroup with the largest memory footprint from the
+ subtree as an OOM victim and kill at least one process, depending on
+ memory.oom_group, from it.
+
+When selecting a cgroup as a victim, the OOM killer will kill the process
+with the largest memory footprint. A user can control this behavior by
+enabling the per-cgroup memory.oom_group option. If set, it causes the
+OOM killer to kill all processes attached to the cgroup, except processes
+with /proc/pid/oom_score_adj set to -1000 (oom disabled).
+
+The root cgroup is treated as a leaf memory cgroup as well, so it is
+compared with other leaf memory cgroups. Due to internal implementation
+restrictions the size of the root cgroup is the cumulative sum of
+oom_badness of all its tasks (in other words oom_score_adj of each task
+is obeyed). Relying on oom_score_adj (apart from OOM_SCORE_ADJ_MIN) can
+lead to over- or underestimation of the root cgroup consumption and it is
+therefore discouraged. This might change in the future, however.
Please, note that memory charges are not migrating if tasks
are moved between different memory cgroups. Moving tasks with
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -81,11 +81,6 @@ enum {
* Enable cpuset controller in v1 cgroup to use v2 behavior.
*/
CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
-
- /*
- * Enable cgroup-aware OOM killer.
- */
- CGRP_GROUP_OOM = (1 << 5),
};
/* cftype->flags */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -64,6 +64,11 @@ enum memcg_oom_policy {
* oom_badness()
*/
MEMCG_OOM_POLICY_NONE,
+ /*
+ * Local cgroup usage is used to select a target cgroup, treating each
+ * mem cgroup as an indivisible consumer
+ */
+ MEMCG_OOM_POLICY_CGROUP,
};
struct mem_cgroup_reclaim_cookie {
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1732,9 +1732,6 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
if (!strcmp(token, "nsdelegate")) {
*root_flags |= CGRP_ROOT_NS_DELEGATE;
continue;
- } else if (!strcmp(token, "groupoom")) {
- *root_flags |= CGRP_GROUP_OOM;
- continue;
}
pr_err("cgroup2: unknown option \"%s\"\n", token);
@@ -1751,11 +1748,6 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
else
cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
-
- if (root_flags & CGRP_GROUP_OOM)
- cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
- else
- cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
}
}
@@ -1763,8 +1755,6 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
{
if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
seq_puts(seq, ",nsdelegate");
- if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
- seq_puts(seq, ",groupoom");
return 0;
}
@@ -5925,8 +5915,7 @@ static struct kobj_attribute cgroup_delegate_attr = __ATTR_RO(delegate);
static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
char *buf)
{
- return snprintf(buf, PAGE_SIZE, "nsdelegate\n"
- "groupoom\n");
+ return snprintf(buf, PAGE_SIZE, "nsdelegate\n");
}
static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2811,14 +2811,14 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return false;
- if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
- return false;
-
if (oc->memcg)
root = oc->memcg;
else
root = root_mem_cgroup;
+ if (root->oom_policy != MEMCG_OOM_POLICY_CGROUP)
+ return false;
+
select_victim_memcg(root, oc);
return oc->chosen_memcg;
@@ -5425,9 +5425,6 @@ static int memory_oom_group_show(struct seq_file *m, void *v)
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
bool oom_group = memcg->oom_group;
- if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
- return -ENOTSUPP;
-
seq_printf(m, "%d\n", oom_group);
return 0;
@@ -5441,9 +5438,6 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
int oom_group;
int err;
- if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
- return -ENOTSUPP;
-
err = kstrtoint(strstrip(buf), 0, &oom_group);
if (err)
return err;
@@ -5554,9 +5548,12 @@ static int memory_oom_policy_show(struct seq_file *m, void *v)
enum memcg_oom_policy policy = READ_ONCE(memcg->oom_policy);
switch (policy) {
+ case MEMCG_OOM_POLICY_CGROUP:
+ seq_puts(m, "none [cgroup]\n");
+ break;
case MEMCG_OOM_POLICY_NONE:
default:
- seq_puts(m, "[none]\n");
+ seq_puts(m, "[none] cgroup\n");
};
return 0;
}
@@ -5570,6 +5567,8 @@ static ssize_t memory_oom_policy_write(struct kernfs_open_file *of,
buf = strstrip(buf);
if (!memcmp("none", buf, min(sizeof("none")-1, nbytes)))
memcg->oom_policy = MEMCG_OOM_POLICY_NONE;
+ else if (!memcmp("cgroup", buf, min(sizeof("cgroup")-1, nbytes)))
+ memcg->oom_policy = MEMCG_OOM_POLICY_CGROUP;
else
ret = -EINVAL;
There are three significant concerns about the cgroup aware oom killer as
it is implemented in -mm:
(1) allows users to evade the oom killer by creating subcontainers or
using other controllers since scoring is done per cgroup and not
hierarchically,
(2) unfairly compares the root mem cgroup using completely different
criteria than leaf mem cgroups and allows wildly inaccurate results
if oom_score_adj is used, and
(3) does not allow the user to influence the decisionmaking, such that
important subtrees cannot be preferred or biased.
This patchset fixes (1) and (2) completely and, by doing so, introduces a
completely extensible user interface that can be expanded in the future.
Concern (3) could subsequently be addressed either before or after the
cgroup-aware oom killer feature is merged.
It preserves all functionality that currently exists in -mm and extends
it to be generally useful outside of very specialized usecases.
It eliminates the mount option for the cgroup aware oom killer entirely
since it is now enabled through the root mem cgroup's oom policy.
---
v3:
- Rebased to next-20180713
v2:
- Rebased to next-20180322
- Fixed get_nr_swap_pages() build bug found by kbuild test robot
Documentation/admin-guide/cgroup-v2.rst | 100 ++++++-----
include/linux/cgroup-defs.h | 5 -
include/linux/memcontrol.h | 21 +++
kernel/cgroup/cgroup.c | 13 +-
mm/memcontrol.c | 221 ++++++++++++++----------
5 files changed, 204 insertions(+), 156 deletions(-)
The cgroup-aware oom killer currently considers the set of allowed nodes
for the allocation that triggers the oom killer and discounts usage from
disallowed nodes when comparing cgroups.
If a cgroup has both the cpuset and memory controllers enabled, it may be
possible to restrict allocations to a subset of nodes, for example. Some
latency sensitive users use cpusets to allocate only local memory, almost
to the point of oom even though there is an abundance of available free
memory on other nodes.
The same is true for processes that mbind(2) their memory to a set of
allowed nodes.
This yields very inconsistent results by considering usage from each mem
cgroup (and perhaps its subtree) for the allocation's set of allowed nodes
for its mempolicy. Allocating a single page for a vma that is mbind to a
now-oom node can cause a cgroup that is restricted to that node by its
cpuset controller to be oom killed when other cgroups may have much higher
overall usage.
The cgroup-aware oom killer is described as killing the largest memory
consuming cgroup (or subtree) without mentioning the mempolicy of the
allocation. For now, discount it. It would be possible to add an
additional oom policy for NUMA awareness if it would be generally useful
later with the extensible interface.
Signed-off-by: David Rientjes <[email protected]>
---
mm/memcontrol.c | 18 ++++++------------
1 file changed, 6 insertions(+), 12 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2819,19 +2819,15 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
return ret;
}
-static long memcg_oom_badness(struct mem_cgroup *memcg,
- const nodemask_t *nodemask)
+static long memcg_oom_badness(struct mem_cgroup *memcg)
{
const bool is_root_memcg = memcg == root_mem_cgroup;
long points = 0;
int nid;
- pg_data_t *pgdat;
for_each_node_state(nid, N_MEMORY) {
- if (nodemask && !node_isset(nid, *nodemask))
- continue;
+ pg_data_t *pgdat = NODE_DATA(nid);
- pgdat = NODE_DATA(nid);
if (is_root_memcg) {
points += node_page_state(pgdat, NR_ACTIVE_ANON) +
node_page_state(pgdat, NR_INACTIVE_ANON);
@@ -2867,8 +2863,7 @@ static long memcg_oom_badness(struct mem_cgroup *memcg,
* >0: memcg is eligible, and the returned value is an estimation
* of the memory footprint
*/
-static long oom_evaluate_memcg(struct mem_cgroup *memcg,
- const nodemask_t *nodemask)
+static long oom_evaluate_memcg(struct mem_cgroup *memcg)
{
struct css_task_iter it;
struct task_struct *task;
@@ -2902,7 +2897,7 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
if (eligible <= 0)
return eligible;
- return memcg_oom_badness(memcg, nodemask);
+ return memcg_oom_badness(memcg);
}
static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
@@ -2962,7 +2957,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
if (memcg_has_children(iter))
continue;
- score = oom_evaluate_memcg(iter, oc->nodemask);
+ score = oom_evaluate_memcg(iter);
/*
* Ignore empty and non-eligible memory cgroups.
@@ -2991,8 +2986,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
if (oc->chosen_memcg != INFLIGHT_VICTIM) {
if (root == root_mem_cgroup) {
- group_score = oom_evaluate_memcg(root_mem_cgroup,
- oc->nodemask);
+ group_score = oom_evaluate_memcg(root_mem_cgroup);
if (group_score > leaf_score) {
/*
* Discount the sum of all leaf scores to find
With the current implementation of the cgroup-aware oom killer,
memory.oom_group defines two behaviors:
- consider the footprint of the "group" consisting of the mem cgroup
itself and all descendants for comparison with other cgroups, and
- when selected as the victim mem cgroup, kill all processes attached to
it and its descendants that are eligible to be killed.
Now that the memory.oom_policy of "tree" considers the memory footprint of
the mem cgroup and all its descendants, separate the memory.oom_group
setting from the selection criteria.
Now, memory.oom_group only controls whether all processes attached to the
victim mem cgroup and its descendants are oom killed (when set to "1") or
the single largest memory consuming process attached to the victim mem
cgroup and its descendants is killed.
This is generally regarded as a property of the workload attached to the
subtree: it depends on whether the workload can continue running and be
useful if a single process is oom killed or whether it's better to kill
all attached processes.
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 21 ++++-----------------
mm/memcontrol.c | 8 ++++----
2 files changed, 8 insertions(+), 21 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1078,25 +1078,12 @@ PAGE_SIZE multiple when read back.
A read-write single value file which exists on non-root
cgroups. The default is "0".
- If set, OOM killer will consider the memory cgroup as an
- indivisible memory consumers and compare it with other memory
- consumers by it's memory footprint.
- If such memory cgroup is selected as an OOM victim, all
- processes belonging to it or it's descendants will be killed.
+ If such memory cgroup is selected as an OOM victim, all processes
+ attached to it and its descendants that are eligible for oom kill
+ (their /proc/pid/oom_score_adj is not oom disabled) will be killed.
This applies to system-wide OOM conditions and reaching
the hard memory limit of the cgroup and their ancestor.
- If OOM condition happens in a descendant cgroup with it's own
- memory limit, the memory cgroup can't be considered
- as an OOM victim, and OOM killer will not kill all belonging
- tasks.
-
- Also, OOM killer respects the /proc/pid/oom_score_adj value -1000,
- and will never kill the unkillable task, even if memory.oom_group
- is set.
-
- If cgroup-aware OOM killer is not enabled, ENOTSUPP error
- is returned on attempt to access the file.
memory.oom_policy
@@ -1379,7 +1366,7 @@ When selecting a cgroup as a victim, the OOM killer will kill the process
with the largest memory footprint. A user can control this behavior by
enabling the per-cgroup memory.oom_group option. If set, it causes the
OOM killer to kill all processes attached to the cgroup, except processes
-with /proc/pid/oom_score_adj set to -1000 (oom disabled).
+with /proc/pid/oom_score_adj set to OOM_SCORE_ADJ_MIN.
The root cgroup is treated as a leaf memory cgroup as well, so it is
compared with other leaf memory cgroups.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2943,11 +2943,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
continue;
/*
- * We don't consider non-leaf non-oom_group memory cgroups
- * without the oom policy of "tree" as OOM victims.
+ * We don't consider non-leaf memory cgroups without the oom
+ * policy of "tree" as OOM victims.
*/
- if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter) &&
- iter->oom_policy != MEMCG_OOM_POLICY_TREE)
+ if (iter->oom_policy != MEMCG_OOM_POLICY_TREE &&
+ memcg_has_children(iter))
continue;
/*
The cgroup aware oom killer is needlessly enforced for the entire system
by a mount option. It's unnecessary to force the system into a single
oom policy: either cgroup aware, or the traditional process aware.
This patch introduces a memory.oom_policy tunable for all mem cgroups.
It is currently a no-op: it can only be set to "none", which is its
default policy. It will be expanded in the next patch to define cgroup
aware oom killer behavior for its subtree.
This is an extensible interface that can be used to define cgroup aware
assessment of mem cgroup subtrees or the traditional process aware
assessment.
Reading memory.oom_policy will specify the list of available policies.
Another benefit of such an approach is that an admin can lock in a
certain policy for the system or for a mem cgroup subtree and can
delegate the policy decision to the user to determine if the kill should
originate from a subcontainer, as indivisible memory consumers
themselves, or selection should be done per process.
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 11 ++++++++
include/linux/memcontrol.h | 11 ++++++++
mm/memcontrol.c | 35 +++++++++++++++++++++++++
3 files changed, 57 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1098,6 +1098,17 @@ PAGE_SIZE multiple when read back.
If cgroup-aware OOM killer is not enabled, ENOTSUPP error
is returned on attempt to access the file.
+ memory.oom_policy
+
+ A read-write single string file which exists on all cgroups. The
+ default value is "none".
+
+ If "none", the OOM killer will use the default policy to choose a
+ victim; that is, it will choose the single process with the largest
+ memory footprint adjusted by /proc/pid/oom_score_adj (see
+ Documentation/filesystems/proc.txt). This is the same policy as if
+ memory cgroups were not even mounted.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -66,6 +66,14 @@ enum mem_cgroup_protection {
MEMCG_PROT_MIN,
};
+enum memcg_oom_policy {
+ /*
+ * No special oom policy, process selection is determined by
+ * oom_badness()
+ */
+ MEMCG_OOM_POLICY_NONE,
+};
+
struct mem_cgroup_reclaim_cookie {
pg_data_t *pgdat;
int priority;
@@ -234,6 +242,9 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
+ /* OOM policy for this subtree */
+ enum memcg_oom_policy oom_policy;
+
/*
* Treat the sub-tree as an indivisible memory consumer,
* kill all belonging tasks if the memory cgroup selected
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4649,6 +4649,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
if (parent) {
memcg->swappiness = mem_cgroup_swappiness(parent);
memcg->oom_kill_disable = parent->oom_kill_disable;
+ memcg->oom_policy = parent->oom_policy;
}
if (parent && parent->use_hierarchy) {
memcg->use_hierarchy = true;
@@ -5810,6 +5811,34 @@ static int memory_stat_show(struct seq_file *m, void *v)
return 0;
}
+static int memory_oom_policy_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+ enum memcg_oom_policy policy = READ_ONCE(memcg->oom_policy);
+
+ switch (policy) {
+ case MEMCG_OOM_POLICY_NONE:
+ default:
+ seq_puts(m, "[none]\n");
+ };
+ return 0;
+}
+
+static ssize_t memory_oom_policy_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ ssize_t ret = nbytes;
+
+ buf = strstrip(buf);
+ if (!memcmp("none", buf, min(sizeof("none")-1, nbytes)))
+ memcg->oom_policy = MEMCG_OOM_POLICY_NONE;
+ else
+ ret = -EINVAL;
+
+ return ret;
+}
+
static struct cftype memory_files[] = {
{
.name = "current",
@@ -5857,6 +5886,12 @@ static struct cftype memory_files[] = {
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = memory_stat_show,
},
+ {
+ .name = "oom_policy",
+ .flags = CFTYPE_NS_DELEGATABLE,
+ .seq_show = memory_oom_policy_show,
+ .write = memory_oom_policy_write,
+ },
{ } /* terminate */
};
One of the three significant concerns brought up about the cgroup aware
oom killer is that its decisionmaking is completely evaded by creating
subcontainers and attaching processes such that the ancestor's usage does
not exceed another cgroup on the system.
Consider the example from the previous patch where "memory" is set in
each mem cgroup's cgroup.controllers:
mem cgroup cgroup.procs
========== ============
/cg1 1 process consuming 250MB
/cg2 3 processes consuming 100MB each
/cg3/cg31 2 processes consuming 100MB each
/cg3/cg32 2 processes consuming 100MB each
If memory.oom_policy is "cgroup", a process from /cg2 is chosen because it
is in the single indivisible memory consumer with the greatest usage.
The true usage of /cg3 is actually 400MB, but a process from /cg2 is
chosen because cgroups are compared individually rather than
hierarchically.
If a system is divided into two users, for example:
mem cgroup memory.max
========== ==========
/userA 250MB
/userB 250MB
If /userA runs all processes attached to the local mem cgroup, whereas
/userB distributes their processes over a set of subcontainers under
/userB, /userA will be unfairly penalized.
There is incentive with cgroup v2 to distribute processes over a set of
subcontainers if those processes shall be constrained by other cgroup
controllers; this is a direct result of mandating a single, unified
hierarchy for cgroups. A user may also reasonably do this for mem cgroup
control or statistics. And, a user may do this to evade the cgroup-aware
oom killer selection logic.
This patch adds an oom policy, "tree", that accounts for hierarchical
usage when comparing cgroups and the cgroup aware oom killer is enabled by
an ancestor. This allows administrators, for example, to require users in
their own top-level mem cgroup subtree to be accounted for with
hierarchical usage. In other words, they can longer evade the oom killer
by using other controllers or subcontainers.
If an oom policy of "tree" is in place for a subtree, such as /cg3 above,
the hierarchical usage is used for comparisons with other cgroups if
either "cgroup" or "tree" is the oom policy of the oom mem cgroup. Thus,
if /cg3/memory.oom_policy is "tree", one of the processes from /cg3's
subcontainers is chosen for oom kill.
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 17 ++++++++++++++---
include/linux/memcontrol.h | 5 +++++
mm/memcontrol.c | 18 ++++++++++++------
3 files changed, 31 insertions(+), 9 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1113,6 +1113,10 @@ PAGE_SIZE multiple when read back.
memory consumers; that is, they will compare mem cgroup usage rather
than process memory footprint. See the "OOM Killer" section below.
+ If "tree", the OOM killer will compare mem cgroups and its subtree
+ as a single indivisible memory consumer. This policy cannot be set
+ on the root mem cgroup. See the "OOM Killer" section below.
+
When an OOM condition occurs, the policy is dictated by the mem
cgroup that is OOM (the root mem cgroup for a system-wide OOM
condition). If a descendant mem cgroup has a policy of "none", for
@@ -1120,6 +1124,10 @@ PAGE_SIZE multiple when read back.
the heuristic will still compare mem cgroups as indivisible memory
consumers.
+ When an OOM condition occurs in a mem cgroup with an OOM policy of
+ "cgroup" or "tree", the OOM killer will compare mem cgroups with
+ "cgroup" policy individually with "tree" policy subtrees.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
@@ -1355,7 +1363,7 @@ out of memory, its memory.oom_policy will dictate how the OOM killer will
select a process, or cgroup, to kill. Likewise, when the system is OOM,
the policy is dictated by the root mem cgroup.
-There are currently two available oom policies:
+There are currently three available oom policies:
- "none": default, choose the largest single memory hogging process to
oom kill, as traditionally the OOM killer has always done.
@@ -1364,6 +1372,9 @@ There are currently two available oom policies:
subtree as an OOM victim and kill at least one process, depending on
memory.oom_group, from it.
+ - "tree": choose the cgroup with the largest memory footprint considering
+ itself and its subtree and kill at least one process.
+
When selecting a cgroup as a victim, the OOM killer will kill the process
with the largest memory footprint. A user can control this behavior by
enabling the per-cgroup memory.oom_group option. If set, it causes the
@@ -1382,8 +1393,8 @@ Please, note that memory charges are not migrating if tasks
are moved between different memory cgroups. Moving tasks with
significant memory footprint may affect OOM victim selection logic.
If it's a case, please, consider creating a common ancestor for
-the source and destination memory cgroups and enabling oom_group
-on ancestor layer.
+the source and destination memory cgroups and setting a policy of "tree"
+and enabling oom_group on an ancestor layer.
IO
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -77,6 +77,11 @@ enum memcg_oom_policy {
* mem cgroup as an indivisible consumer
*/
MEMCG_OOM_POLICY_CGROUP,
+ /*
+ * Tree cgroup usage for all descendant memcg groups, treating each mem
+ * cgroup and its subtree as an indivisible consumer
+ */
+ MEMCG_OOM_POLICY_TREE,
};
struct mem_cgroup_reclaim_cookie {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2952,7 +2952,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
/*
* The oom_score is calculated for leaf memory cgroups (including
* the root memcg).
- * Non-leaf oom_group cgroups accumulating score of descendant
+ * Cgroups with oom policy of "tree" accumulate the score of descendant
* leaf memory cgroups.
*/
rcu_read_lock();
@@ -2961,10 +2961,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
/*
* We don't consider non-leaf non-oom_group memory cgroups
- * as OOM victims.
+ * without the oom policy of "tree" as OOM victims.
*/
if (memcg_has_children(iter) && iter != root_mem_cgroup &&
- !mem_cgroup_oom_group(iter))
+ !mem_cgroup_oom_group(iter) &&
+ iter->oom_policy != MEMCG_OOM_POLICY_TREE)
continue;
/*
@@ -3027,7 +3028,7 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
else
root = root_mem_cgroup;
- if (root->oom_policy != MEMCG_OOM_POLICY_CGROUP)
+ if (root->oom_policy == MEMCG_OOM_POLICY_NONE)
return false;
select_victim_memcg(root, oc);
@@ -5812,11 +5813,14 @@ static int memory_oom_policy_show(struct seq_file *m, void *v)
switch (policy) {
case MEMCG_OOM_POLICY_CGROUP:
- seq_puts(m, "none [cgroup]\n");
+ seq_puts(m, "none [cgroup] tree\n");
+ break;
+ case MEMCG_OOM_POLICY_TREE:
+ seq_puts(m, "none cgroup [tree]\n");
break;
case MEMCG_OOM_POLICY_NONE:
default:
- seq_puts(m, "[none] cgroup\n");
+ seq_puts(m, "[none] cgroup tree\n");
};
return 0;
}
@@ -5832,6 +5836,8 @@ static ssize_t memory_oom_policy_write(struct kernfs_open_file *of,
memcg->oom_policy = MEMCG_OOM_POLICY_NONE;
else if (!memcmp("cgroup", buf, min(sizeof("cgroup")-1, nbytes)))
memcg->oom_policy = MEMCG_OOM_POLICY_CGROUP;
+ else if (!memcmp("tree", buf, min(sizeof("tree")-1, nbytes)))
+ memcg->oom_policy = MEMCG_OOM_POLICY_TREE;
else
ret = -EINVAL;
There are several downsides to the current implementation that compares
the root mem cgroup with leaf mem cgroups for the cgroup-aware oom killer.
For example, /proc/pid/oom_score_adj is accounted for processes attached
to the root mem cgroup but not leaves. This leads to wild inconsistencies
that unfairly bias for or against the root mem cgroup.
Assume a 728KB bash shell is attached to the root mem cgroup without any
other processes having a non-default /proc/pid/oom_score_adj. At the time
of system oom, the root mem cgroup evaluated to 43,474 pages after boot.
If the bash shell adjusts its /proc/self/oom_score_adj to 1000, however,
the root mem cgroup evaluates to 24,765,482 pages lol. It would take a
cgroup 95GB of memory to outweigh the root mem cgroup's evaluation.
The reverse is even more confusing: if the bash shell adjusts its
/proc/self/oom_score_adj to -999, the root mem cgroup evaluates to 42,268
pages, a basically meaningless transformation.
/proc/pid/oom_score_adj is discounted, however, for processes attached to
leaf mem cgroups. If a sole process using 250MB of memory is attached to
a mem cgroup, it evaluates to 250MB >> PAGE_SHIFT. If its
/proc/pid/oom_score_adj is changed to -999, or even 1000, the evaluation
remains the same for the mem cgroup.
The heuristic that is used for the root mem cgroup also differs from leaf
mem cgroups.
For the root mem cgroup, the evaluation is the sum of all process's
/proc/pid/oom_score. Besides factoring in oom_score_adj, it is based on
the sum of rss + swap + page tables for all processes attached to it.
For leaf mem cgroups, it is based on the amount of anonymous or
unevictable memory + unreclaimable slab + kernel stack + sock + swap.
There's also an exemption for root mem cgroup processes that do not
intersect the allocating process's mems_allowed. Because the current
heuristic is based on oom_badness(), the evaluation of the root mem
cgroup disregards all processes attached to it that have disjoint
mems_allowed making oom selection specifically dependant on the
allocating process for system oom conditions!
This patch introduces completely fair comparison between the root mem
cgroup and leaf mem cgroups. It compares them with the same heuristic
and does not prefer one over the other. It disregards oom_score_adj
as the cgroup-aware oom killer should, if enabled by memory.oom_policy.
The goal is to target the most memory consuming cgroup on the system,
not consider per-process adjustment.
The fact that the evaluation of all mem cgroups depends on the mempolicy
of the allocating process, which is completely undocumented for the
cgroup-aware oom killer, will be addressed in a subsequent patch.
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 7 +-
mm/memcontrol.c | 149 ++++++++++++------------
2 files changed, 76 insertions(+), 80 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1382,12 +1382,7 @@ OOM killer to kill all processes attached to the cgroup, except processes
with /proc/pid/oom_score_adj set to -1000 (oom disabled).
The root cgroup is treated as a leaf memory cgroup as well, so it is
-compared with other leaf memory cgroups. Due to internal implementation
-restrictions the size of the root cgroup is the cumulative sum of
-oom_badness of all its tasks (in other words oom_score_adj of each task
-is obeyed). Relying on oom_score_adj (apart from OOM_SCORE_ADJ_MIN) can
-lead to over- or underestimation of the root cgroup consumption and it is
-therefore discouraged. This might change in the future, however.
+compared with other leaf memory cgroups.
Please, note that memory charges are not migrating if tasks
are moved between different memory cgroups. Moving tasks with
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -94,6 +94,8 @@ int do_swap_account __read_mostly;
#define do_swap_account 0
#endif
+static atomic_long_t total_sock_pages;
+
/* Whether legacy memory+swap accounting is active */
static bool do_memsw_account(void)
{
@@ -2818,9 +2820,9 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
}
static long memcg_oom_badness(struct mem_cgroup *memcg,
- const nodemask_t *nodemask,
- unsigned long totalpages)
+ const nodemask_t *nodemask)
{
+ const bool is_root_memcg = memcg == root_mem_cgroup;
long points = 0;
int nid;
pg_data_t *pgdat;
@@ -2829,92 +2831,65 @@ static long memcg_oom_badness(struct mem_cgroup *memcg,
if (nodemask && !node_isset(nid, *nodemask))
continue;
- points += mem_cgroup_node_nr_lru_pages(memcg, nid,
- LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
-
pgdat = NODE_DATA(nid);
- points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
- NR_SLAB_UNRECLAIMABLE);
+ if (is_root_memcg) {
+ points += node_page_state(pgdat, NR_ACTIVE_ANON) +
+ node_page_state(pgdat, NR_INACTIVE_ANON);
+ points += node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE);
+ } else {
+ points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+ LRU_ALL_ANON);
+ points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
+ NR_SLAB_UNRECLAIMABLE);
+ }
}
- points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
- (PAGE_SIZE / 1024);
- points += memcg_page_state(memcg, MEMCG_SOCK);
- points += memcg_page_state(memcg, MEMCG_SWAP);
-
+ if (is_root_memcg) {
+ points += global_zone_page_state(NR_KERNEL_STACK_KB) /
+ (PAGE_SIZE / 1024);
+ points += atomic_long_read(&total_sock_pages);
+ points += total_swap_pages - get_nr_swap_pages();
+ } else {
+ points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+ (PAGE_SIZE / 1024);
+ points += memcg_page_state(memcg, MEMCG_SOCK);
+ points += memcg_page_state(memcg, MEMCG_SWAP);
+ }
return points;
}
/*
- * Checks if the given memcg is a valid OOM victim and returns a number,
- * which means the folowing:
- * -1: there are inflight OOM victim tasks, belonging to the memcg
- * 0: memcg is not eligible, e.g. all belonging tasks are protected
- * by oom_score_adj set to OOM_SCORE_ADJ_MIN
+ * Checks if the given non-root memcg has a valid OOM victim and returns a
+ * number, which means the following:
+ * -1: there is an inflight OOM victim process attached to the memcg
+ * 0: memcg is not eligible because all tasks attached are unkillable
+ * (kthreads or oom_score_adj set to OOM_SCORE_ADJ_MIN)
* >0: memcg is eligible, and the returned value is an estimation
* of the memory footprint
*/
static long oom_evaluate_memcg(struct mem_cgroup *memcg,
- const nodemask_t *nodemask,
- unsigned long totalpages)
+ const nodemask_t *nodemask)
{
struct css_task_iter it;
struct task_struct *task;
int eligible = 0;
/*
- * Root memory cgroup is a special case:
- * we don't have necessary stats to evaluate it exactly as
- * leaf memory cgroups, so we approximate it's oom_score
- * by summing oom_score of all belonging tasks, which are
- * owners of their mm structs.
- *
- * If there are inflight OOM victim tasks inside
- * the root memcg, we return -1.
- */
- if (memcg == root_mem_cgroup) {
- struct css_task_iter it;
- struct task_struct *task;
- long score = 0;
-
- css_task_iter_start(&memcg->css, 0, &it);
- while ((task = css_task_iter_next(&it))) {
- if (tsk_is_oom_victim(task) &&
- !test_bit(MMF_OOM_SKIP,
- &task->signal->oom_mm->flags)) {
- score = -1;
- break;
- }
-
- task_lock(task);
- if (!task->mm) {
- task_unlock(task);
- continue;
- }
- task_unlock(task);
-
- score += oom_badness(task, memcg, nodemask,
- totalpages);
- }
- css_task_iter_end(&it);
-
- return score;
- }
-
- /*
- * Memcg is OOM eligible if there are OOM killable tasks inside.
- *
- * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
- * as unkillable.
- *
- * If there are inflight OOM victim tasks inside the memcg,
- * we return -1.
+ * Memcg is eligible for oom kill if at least one process is eligible
+ * to be killed. Processes with oom_score_adj of OOM_SCORE_ADJ_MIN
+ * are unkillable.
*/
css_task_iter_start(&memcg->css, 0, &it);
while ((task = css_task_iter_next(&it))) {
+ task_lock(task);
+ if (!task->mm) {
+ task_unlock(task);
+ continue;
+ }
if (!eligible &&
task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
eligible = 1;
+ task_unlock(task);
if (tsk_is_oom_victim(task) &&
!test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
@@ -2927,13 +2902,14 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
if (eligible <= 0)
return eligible;
- return memcg_oom_badness(memcg, nodemask, totalpages);
+ return memcg_oom_badness(memcg, nodemask);
}
static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
{
struct mem_cgroup *iter, *group = NULL;
long group_score = 0;
+ long leaf_score = 0;
oc->chosen_memcg = NULL;
oc->chosen_points = 0;
@@ -2959,12 +2935,18 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
for_each_mem_cgroup_tree(iter, root) {
long score;
+ /*
+ * Root memory cgroup will be considered after iteration,
+ * if eligible.
+ */
+ if (iter == root_mem_cgroup)
+ continue;
+
/*
* We don't consider non-leaf non-oom_group memory cgroups
* without the oom policy of "tree" as OOM victims.
*/
- if (memcg_has_children(iter) && iter != root_mem_cgroup &&
- !mem_cgroup_oom_group(iter) &&
+ if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter) &&
iter->oom_policy != MEMCG_OOM_POLICY_TREE)
continue;
@@ -2972,16 +2954,15 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
* If group is not set or we've ran out of the group's sub-tree,
* we should set group and reset group_score.
*/
- if (!group || group == root_mem_cgroup ||
- !mem_cgroup_is_descendant(iter, group)) {
+ if (!group || !mem_cgroup_is_descendant(iter, group)) {
group = iter;
group_score = 0;
}
- if (memcg_has_children(iter) && iter != root_mem_cgroup)
+ if (memcg_has_children(iter))
continue;
- score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
+ score = oom_evaluate_memcg(iter, oc->nodemask);
/*
* Ignore empty and non-eligible memory cgroups.
@@ -3000,6 +2981,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
}
group_score += score;
+ leaf_score += score;
if (group_score > oc->chosen_points) {
oc->chosen_points = group_score;
@@ -3007,8 +2989,25 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
}
}
- if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
- css_get(&oc->chosen_memcg->css);
+ if (oc->chosen_memcg != INFLIGHT_VICTIM) {
+ if (root == root_mem_cgroup) {
+ group_score = oom_evaluate_memcg(root_mem_cgroup,
+ oc->nodemask);
+ if (group_score > leaf_score) {
+ /*
+ * Discount the sum of all leaf scores to find
+ * root score.
+ */
+ group_score -= leaf_score;
+ if (group_score > oc->chosen_points) {
+ oc->chosen_points = group_score;
+ oc->chosen_memcg = root_mem_cgroup;
+ }
+ }
+ }
+ if (oc->chosen_memcg)
+ css_get(&oc->chosen_memcg->css);
+ }
rcu_read_unlock();
}
@@ -6491,6 +6490,7 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
gfp_mask = GFP_NOWAIT;
mod_memcg_state(memcg, MEMCG_SOCK, nr_pages);
+ atomic_long_add(nr_pages, &total_sock_pages);
if (try_charge(memcg, gfp_mask, nr_pages) == 0)
return true;
@@ -6512,6 +6512,7 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
}
mod_memcg_state(memcg, MEMCG_SOCK, -nr_pages);
+ atomic_long_add(-nr_pages, &total_sock_pages);
refill_stock(memcg, nr_pages);
}
Now that each mem cgroup on the system has a memory.oom_policy tunable to
specify oom kill selection behavior, remove the needless "groupoom" mount
option that requires (1) the entire system to be forced, perhaps
unnecessarily, perhaps unexpectedly, into a single oom policy that
differs from the traditional per process selection, and (2) a remount to
change.
Instead of enabling the cgroup aware oom killer with the "groupoom" mount
option, set the mem cgroup subtree's memory.oom_policy to "cgroup".
The heuristic used to select a process or cgroup to kill from is
controlled by the oom mem cgroup's memory.oom_policy. This means that if
a descendant mem cgroup has an oom policy of "none", for example, and an
oom condition originates in an ancestor with an oom policy of "cgroup",
the selection logic will treat all descendant cgroups as indivisible
memory consumers.
For example, consider an example where each mem cgroup has "memory" set
in cgroup.controllers:
mem cgroup cgroup.procs
========== ============
/cg1 1 process consuming 250MB
/cg2 3 processes consuming 100MB each
/cg3/cg31 2 processes consuming 100MB each
/cg3/cg32 2 processes consuming 100MB each
If the root mem cgroup's memory.oom_policy is "none", the process from
/cg1 is chosen as the victim. If memory.oom_policy is "cgroup", a process
from /cg2 is chosen because it is in the single indivisible memory
consumer with the greatest usage. This policy of "cgroup" is identical to
to the current "groupoom" mount option, now removed.
Note that /cg3 is not the chosen victim when the oom mem cgroup policy is
"cgroup" because cgroups are treated individually without regard to
hierarchical /cg3/memory.current usage. This will be addressed in a
follow-up patch.
This has the added benefit of allowing descendant cgroups to control their
own oom policies if they have memory.oom_policy file permissions without
being restricted to the system-wide policy. In the above example, /cg2
and /cg3 can be either "none" or "cgroup" with the same results: the
selection heuristic depends only on the policy of the oom mem cgroup. If
/cg2 or /cg3 themselves are oom, however, the policy is controlled by
their own oom policies, either process aware or cgroup aware.
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 78 +++++++++++++------------
include/linux/cgroup-defs.h | 5 --
include/linux/memcontrol.h | 5 ++
kernel/cgroup/cgroup.c | 13 +----
mm/memcontrol.c | 19 +++---
5 files changed, 56 insertions(+), 64 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1109,6 +1109,17 @@ PAGE_SIZE multiple when read back.
Documentation/filesystems/proc.txt). This is the same policy as if
memory cgroups were not even mounted.
+ If "cgroup", the OOM killer will compare mem cgroups as indivisible
+ memory consumers; that is, they will compare mem cgroup usage rather
+ than process memory footprint. See the "OOM Killer" section below.
+
+ When an OOM condition occurs, the policy is dictated by the mem
+ cgroup that is OOM (the root mem cgroup for a system-wide OOM
+ condition). If a descendant mem cgroup has a policy of "none", for
+ example, for an OOM condition in a mem cgroup with policy "cgroup",
+ the heuristic will still compare mem cgroups as indivisible memory
+ consumers.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
@@ -1336,43 +1347,36 @@ belonging to the affected files to ensure correct memory ownership.
OOM Killer
~~~~~~~~~~
-Cgroup v2 memory controller implements a cgroup-aware OOM killer.
-It means that it treats cgroups as first class OOM entities.
-
-Cgroup-aware OOM logic is turned off by default and requires
-passing the "groupoom" option on mounting cgroupfs. It can also
-by remounting cgroupfs with the following command::
-
- # mount -o remount,groupoom $MOUNT_POINT
-
-Under OOM conditions the memory controller tries to make the best
-choice of a victim, looking for a memory cgroup with the largest
-memory footprint, considering leaf cgroups and cgroups with the
-memory.oom_group option set, which are considered to be an indivisible
-memory consumers.
-
-By default, OOM killer will kill the biggest task in the selected
-memory cgroup. A user can change this behavior by enabling
-the per-cgroup memory.oom_group option. If set, it causes
-the OOM killer to kill all processes attached to the cgroup,
-except processes with oom_score_adj set to -1000.
-
-This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
-the memory controller considers only cgroups belonging to the sub-tree
-of the OOM'ing cgroup.
-
-Leaf cgroups and cgroups with oom_group option set are compared based
-on their cumulative memory usage. The root cgroup is treated as a
-leaf memory cgroup as well, so it is compared with other leaf memory
-cgroups. Due to internal implementation restrictions the size of
-the root cgroup is the cumulative sum of oom_badness of all its tasks
-(in other words oom_score_adj of each task is obeyed). Relying on
-oom_score_adj (apart from OOM_SCORE_ADJ_MIN) can lead to over- or
-underestimation of the root cgroup consumption and it is therefore
-discouraged. This might change in the future, however.
-
-If there are no cgroups with the enabled memory controller,
-the OOM killer is using the "traditional" process-based approach.
+Cgroup v2 memory controller implements an optional cgroup-aware out of
+memory killer, which treats cgroups as indivisible OOM entities.
+
+This policy is controlled by memory.oom_policy. When a memory cgroup is
+out of memory, its memory.oom_policy will dictate how the OOM killer will
+select a process, or cgroup, to kill. Likewise, when the system is OOM,
+the policy is dictated by the root mem cgroup.
+
+There are currently two available oom policies:
+
+ - "none": default, choose the largest single memory hogging process to
+ oom kill, as traditionally the OOM killer has always done.
+
+ - "cgroup": choose the cgroup with the largest memory footprint from the
+ subtree as an OOM victim and kill at least one process, depending on
+ memory.oom_group, from it.
+
+When selecting a cgroup as a victim, the OOM killer will kill the process
+with the largest memory footprint. A user can control this behavior by
+enabling the per-cgroup memory.oom_group option. If set, it causes the
+OOM killer to kill all processes attached to the cgroup, except processes
+with /proc/pid/oom_score_adj set to -1000 (oom disabled).
+
+The root cgroup is treated as a leaf memory cgroup as well, so it is
+compared with other leaf memory cgroups. Due to internal implementation
+restrictions the size of the root cgroup is the cumulative sum of
+oom_badness of all its tasks (in other words oom_score_adj of each task
+is obeyed). Relying on oom_score_adj (apart from OOM_SCORE_ADJ_MIN) can
+lead to over- or underestimation of the root cgroup consumption and it is
+therefore discouraged. This might change in the future, however.
Please, note that memory charges are not migrating if tasks
are moved between different memory cgroups. Moving tasks with
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -81,11 +81,6 @@ enum {
* Enable cpuset controller in v1 cgroup to use v2 behavior.
*/
CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
-
- /*
- * Enable cgroup-aware OOM killer.
- */
- CGRP_GROUP_OOM = (1 << 5),
};
/* cftype->flags */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -72,6 +72,11 @@ enum memcg_oom_policy {
* oom_badness()
*/
MEMCG_OOM_POLICY_NONE,
+ /*
+ * Local cgroup usage is used to select a target cgroup, treating each
+ * mem cgroup as an indivisible consumer
+ */
+ MEMCG_OOM_POLICY_CGROUP,
};
struct mem_cgroup_reclaim_cookie {
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1744,9 +1744,6 @@ static int cgroup2_parse_option(struct fs_context *fc, char *token)
if (!strcmp(token, "nsdelegate")) {
ctx->flags |= CGRP_ROOT_NS_DELEGATE;
return 0;
- } else if (!strcmp(token, "groupoom")) {
- ctx->flags |= CGRP_GROUP_OOM;
- return 0;
}
return -EINVAL;
@@ -1757,8 +1754,6 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
if (current->nsproxy->cgroup_ns == &init_cgroup_ns) {
if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
seq_puts(seq, ",nsdelegate");
- if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
- seq_puts(seq, ",groupoom");
}
return 0;
}
@@ -1770,11 +1765,6 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
else
cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
-
- if (root_flags & CGRP_GROUP_OOM)
- cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
- else
- cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
}
}
@@ -6012,8 +6002,7 @@ static struct kobj_attribute cgroup_delegate_attr = __ATTR_RO(delegate);
static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
char *buf)
{
- return snprintf(buf, PAGE_SIZE, "nsdelegate\n"
- "groupoom\n");
+ return snprintf(buf, PAGE_SIZE, "nsdelegate\n");
}
static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3022,14 +3022,14 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return false;
- if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
- return false;
-
if (oc->memcg)
root = oc->memcg;
else
root = root_mem_cgroup;
+ if (root->oom_policy != MEMCG_OOM_POLICY_CGROUP)
+ return false;
+
select_victim_memcg(root, oc);
return oc->chosen_memcg;
@@ -5683,9 +5683,6 @@ static int memory_oom_group_show(struct seq_file *m, void *v)
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
bool oom_group = memcg->oom_group;
- if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
- return -ENOTSUPP;
-
seq_printf(m, "%d\n", oom_group);
return 0;
@@ -5699,9 +5696,6 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
int oom_group;
int err;
- if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
- return -ENOTSUPP;
-
err = kstrtoint(strstrip(buf), 0, &oom_group);
if (err)
return err;
@@ -5817,9 +5811,12 @@ static int memory_oom_policy_show(struct seq_file *m, void *v)
enum memcg_oom_policy policy = READ_ONCE(memcg->oom_policy);
switch (policy) {
+ case MEMCG_OOM_POLICY_CGROUP:
+ seq_puts(m, "none [cgroup]\n");
+ break;
case MEMCG_OOM_POLICY_NONE:
default:
- seq_puts(m, "[none]\n");
+ seq_puts(m, "[none] cgroup\n");
};
return 0;
}
@@ -5833,6 +5830,8 @@ static ssize_t memory_oom_policy_write(struct kernfs_open_file *of,
buf = strstrip(buf);
if (!memcmp("none", buf, min(sizeof("none")-1, nbytes)))
memcg->oom_policy = MEMCG_OOM_POLICY_NONE;
+ else if (!memcmp("cgroup", buf, min(sizeof("cgroup")-1, nbytes)))
+ memcg->oom_policy = MEMCG_OOM_POLICY_CGROUP;
else
ret = -EINVAL;
On Fri, Jul 13, 2018 at 04:07:29PM -0700, David Rientjes wrote:
> One of the three significant concerns brought up about the cgroup aware
> oom killer is that its decisionmaking is completely evaded by creating
> subcontainers and attaching processes such that the ancestor's usage does
> not exceed another cgroup on the system.
>
> Consider the example from the previous patch where "memory" is set in
> each mem cgroup's cgroup.controllers:
>
> mem cgroup cgroup.procs
> ========== ============
> /cg1 1 process consuming 250MB
> /cg2 3 processes consuming 100MB each
> /cg3/cg31 2 processes consuming 100MB each
> /cg3/cg32 2 processes consuming 100MB each
>
> If memory.oom_policy is "cgroup", a process from /cg2 is chosen because it
> is in the single indivisible memory consumer with the greatest usage.
>
> The true usage of /cg3 is actually 400MB, but a process from /cg2 is
> chosen because cgroups are compared individually rather than
> hierarchically.
>
> If a system is divided into two users, for example:
>
> mem cgroup memory.max
> ========== ==========
> /userA 250MB
> /userB 250MB
>
> If /userA runs all processes attached to the local mem cgroup, whereas
> /userB distributes their processes over a set of subcontainers under
> /userB, /userA will be unfairly penalized.
>
> There is incentive with cgroup v2 to distribute processes over a set of
> subcontainers if those processes shall be constrained by other cgroup
> controllers; this is a direct result of mandating a single, unified
> hierarchy for cgroups. A user may also reasonably do this for mem cgroup
> control or statistics. And, a user may do this to evade the cgroup-aware
> oom killer selection logic.
>
> This patch adds an oom policy, "tree", that accounts for hierarchical
> usage when comparing cgroups and the cgroup aware oom killer is enabled by
> an ancestor. This allows administrators, for example, to require users in
> their own top-level mem cgroup subtree to be accounted for with
> hierarchical usage. In other words, they can longer evade the oom killer
> by using other controllers or subcontainers.
>
> If an oom policy of "tree" is in place for a subtree, such as /cg3 above,
> the hierarchical usage is used for comparisons with other cgroups if
> either "cgroup" or "tree" is the oom policy of the oom mem cgroup. Thus,
> if /cg3/memory.oom_policy is "tree", one of the processes from /cg3's
> subcontainers is chosen for oom kill.
>
> Signed-off-by: David Rientjes <[email protected]>
> ---
> Documentation/admin-guide/cgroup-v2.rst | 17 ++++++++++++++---
> include/linux/memcontrol.h | 5 +++++
> mm/memcontrol.c | 18 ++++++++++++------
> 3 files changed, 31 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1113,6 +1113,10 @@ PAGE_SIZE multiple when read back.
> memory consumers; that is, they will compare mem cgroup usage rather
> than process memory footprint. See the "OOM Killer" section below.
>
> + If "tree", the OOM killer will compare mem cgroups and its subtree
> + as a single indivisible memory consumer. This policy cannot be set
> + on the root mem cgroup. See the "OOM Killer" section below.
> +
> When an OOM condition occurs, the policy is dictated by the mem
> cgroup that is OOM (the root mem cgroup for a system-wide OOM
> condition). If a descendant mem cgroup has a policy of "none", for
> @@ -1120,6 +1124,10 @@ PAGE_SIZE multiple when read back.
> the heuristic will still compare mem cgroups as indivisible memory
> consumers.
>
> + When an OOM condition occurs in a mem cgroup with an OOM policy of
> + "cgroup" or "tree", the OOM killer will compare mem cgroups with
> + "cgroup" policy individually with "tree" policy subtrees.
> +
> memory.events
> A read-only flat-keyed file which exists on non-root cgroups.
> The following entries are defined. Unless specified
> @@ -1355,7 +1363,7 @@ out of memory, its memory.oom_policy will dictate how the OOM killer will
> select a process, or cgroup, to kill. Likewise, when the system is OOM,
> the policy is dictated by the root mem cgroup.
>
> -There are currently two available oom policies:
> +There are currently three available oom policies:
>
> - "none": default, choose the largest single memory hogging process to
> oom kill, as traditionally the OOM killer has always done.
> @@ -1364,6 +1372,9 @@ There are currently two available oom policies:
> subtree as an OOM victim and kill at least one process, depending on
> memory.oom_group, from it.
>
> + - "tree": choose the cgroup with the largest memory footprint considering
> + itself and its subtree and kill at least one process.
> +
> When selecting a cgroup as a victim, the OOM killer will kill the process
> with the largest memory footprint. A user can control this behavior by
> enabling the per-cgroup memory.oom_group option. If set, it causes the
> @@ -1382,8 +1393,8 @@ Please, note that memory charges are not migrating if tasks
> are moved between different memory cgroups. Moving tasks with
> significant memory footprint may affect OOM victim selection logic.
> If it's a case, please, consider creating a common ancestor for
> -the source and destination memory cgroups and enabling oom_group
> -on ancestor layer.
> +the source and destination memory cgroups and setting a policy of "tree"
> +and enabling oom_group on an ancestor layer.
>
>
> IO
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -77,6 +77,11 @@ enum memcg_oom_policy {
> * mem cgroup as an indivisible consumer
> */
> MEMCG_OOM_POLICY_CGROUP,
> + /*
> + * Tree cgroup usage for all descendant memcg groups, treating each mem
> + * cgroup and its subtree as an indivisible consumer
> + */
> + MEMCG_OOM_POLICY_TREE,
> };
>
> struct mem_cgroup_reclaim_cookie {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2952,7 +2952,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> /*
> * The oom_score is calculated for leaf memory cgroups (including
> * the root memcg).
> - * Non-leaf oom_group cgroups accumulating score of descendant
> + * Cgroups with oom policy of "tree" accumulate the score of descendant
> * leaf memory cgroups.
> */
> rcu_read_lock();
> @@ -2961,10 +2961,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
>
> /*
> * We don't consider non-leaf non-oom_group memory cgroups
> - * as OOM victims.
> + * without the oom policy of "tree" as OOM victims.
> */
> if (memcg_has_children(iter) && iter != root_mem_cgroup &&
> - !mem_cgroup_oom_group(iter))
> + !mem_cgroup_oom_group(iter) &&
> + iter->oom_policy != MEMCG_OOM_POLICY_TREE)
> continue;
Hello, David!
I think that there is an inconsistency in the memory.oom_policy definition.
"none" and "cgroup" policies defining how the OOM scoped to this particular
memory cgroup (or system, if set on root) is handled. And all sub-tree
settings do not matter at all, right? Also, if a memory cgroup has no
memory.max set, there is no meaning in setting memory.oom_policy.
And "tree" is different. It actually changes how the selection algorithm works,
and sub-tree settings do matter in this case.
I find it very confusing.
Thanks!
On Mon, 16 Jul 2018, Roman Gushchin wrote:
> Hello, David!
>
> I think that there is an inconsistency in the memory.oom_policy definition.
> "none" and "cgroup" policies defining how the OOM scoped to this particular
> memory cgroup (or system, if set on root) is handled. And all sub-tree
> settings do not matter at all, right? Also, if a memory cgroup has no
> memory.max set, there is no meaning in setting memory.oom_policy.
>
Hi Roman,
The effective oom policy is based on the mem cgroup that is oom. That can
occur when memory.max is set, yes.
If a mem cgroup does not become oom itself, its oom policy doesn't do
anything until, well, it's oom :)
> And "tree" is different. It actually changes how the selection algorithm works,
> and sub-tree settings do matter in this case.
>
"Tree" is considering the entity as a single indivisible memory consumer,
it is compared with siblings based on its hierarhical usage. It has
cgroup oom policy.
It would be possible to separate this out, if you'd prefer, to account
an intermediate cgroup as the largest descendant or the sum of all
descendants. I hadn't found a usecase for that, however, but it doesn't
mean there isn't one. If you'd like, I can introduce another tunable.
On Mon, 16 Jul 2018, David Rientjes wrote:
> > And "tree" is different. It actually changes how the selection algorithm works,
> > and sub-tree settings do matter in this case.
> >
>
> "Tree" is considering the entity as a single indivisible memory consumer,
> it is compared with siblings based on its hierarhical usage. It has
> cgroup oom policy.
>
> It would be possible to separate this out, if you'd prefer, to account
> an intermediate cgroup as the largest descendant or the sum of all
> descendants. I hadn't found a usecase for that, however, but it doesn't
> mean there isn't one. If you'd like, I can introduce another tunable.
>
Roman, I'm trying to make progress so that the cgroup aware oom killer is
in a state that it can be merged. Would you prefer a second tunable here
to specify a cgroup's points includes memory from its subtree?
It would be helpful if you would also review the rest of the patchset.
On Mon, Jul 23, 2018 at 01:33:19PM -0700, David Rientjes wrote:
> On Mon, 16 Jul 2018, David Rientjes wrote:
>
> > > And "tree" is different. It actually changes how the selection algorithm works,
> > > and sub-tree settings do matter in this case.
> > >
> >
> > "Tree" is considering the entity as a single indivisible memory consumer,
> > it is compared with siblings based on its hierarhical usage. It has
> > cgroup oom policy.
> >
> > It would be possible to separate this out, if you'd prefer, to account
> > an intermediate cgroup as the largest descendant or the sum of all
> > descendants. I hadn't found a usecase for that, however, but it doesn't
> > mean there isn't one. If you'd like, I can introduce another tunable.
> >
>
> Roman, I'm trying to make progress so that the cgroup aware oom killer is
> in a state that it can be merged. Would you prefer a second tunable here
> to specify a cgroup's points includes memory from its subtree?
Hi, David!
It's hard to tell, because I don't have a clear picture of what you're
suggesting now. My biggest concern about your last version was that it's hard
to tell what oom_policy really defines. Each value has it's own application
rules, which is a bit messy (some values are meaningful for OOMing cgroup only,
other are reading on hierarchy traversal).
If you know how to make it clear and non-contradictory,
please, describe the proposed interface.
>
> It would be helpful if you would also review the rest of the patchset.
I think, that we should focus on interface semantics right now.
If we can't agree on how the things should work, it makes no sense
to discuss the implementation.
Thanks!
On Mon, 23 Jul 2018, Roman Gushchin wrote:
> > Roman, I'm trying to make progress so that the cgroup aware oom killer is
> > in a state that it can be merged. Would you prefer a second tunable here
> > to specify a cgroup's points includes memory from its subtree?
>
> Hi, David!
>
> It's hard to tell, because I don't have a clear picture of what you're
> suggesting now.
Each patch specifies what it does rather elaborately. If there's
confusion on what this patch, or any of the patches in this patchset, is
motivated by or addresses, please call it out specifically.
> My biggest concern about your last version was that it's hard
> to tell what oom_policy really defines. Each value has it's own application
> rules, which is a bit messy (some values are meaningful for OOMing cgroup only,
> other are reading on hierarchy traversal).
> If you know how to make it clear and non-contradictory,
> please, describe the proposed interface.
>
As my initial response stated, "tree" has cgroup aware properties but it
considers the subtree usage as its own. I do not know of any usecase,
today or in the future, that would want subtree usage accounted to its own
when being considered as a single indivisible memory consumer yet still
want per-process oom kill selection.
If you do not prefer that overloading, I can break the two out from one
another such that one tunable defines cgroup vs process, and another
defines subtree usage being considered or not. That's a perfectly fine
suggestion and I have no problem implementing it. The only reason I did
not was because I do not know of any user that would want process &&
subtree and that would reduce the number of files for mem cgroup by one.
If you'd like me to separate these out by adding another tunable, please
let me know. We will already have another tunable later, but is not
required for this to be merged as the cover letter states, to allow the
user to adjust the calculation for a subtree such that it may protect
important cgroups that are allowed to use more memory than others.
> > It would be helpful if you would also review the rest of the patchset.
>
> I think, that we should focus on interface semantics right now.
> If we can't agree on how the things should work, it makes no sense
> to discuss the implementation.
>
Yes, I have urged that we consider the interface in both the
memory.oom_group discussion as well as the discussion here, which is why
this patchset removes the mount option, does not lock down the entire
hierarchy into a single policy, and is extensible to be generally useful
outside of very special usecases.