Establish a new OOM score algorithm, supports the cgroup level OOM
protection mechanism. When an global/memcg oom event occurs, we treat
all processes in the cgroup as a whole, and OOM killers need to select
the process to kill based on the protection quota of the cgroup
Changelog:
v3:
* Add "auto" option for memory.oom.protect. (patch 1)
* Fix division errors. (patch 1)
* Add observation indicator oom_kill_inherit. (patch 2)
v2:
* Modify the formula of the process request memcg protection quota.
https://lore.kernel.org/linux-mm/[email protected]/
v1:
https://lore.kernel.org/linux-mm/[email protected]/
chengkaitao (2):
mm: memcontrol: protect the memory in cgroup from being oom killed
memcg: add oom_kill_inherit event indicator
Documentation/admin-guide/cgroup-v2.rst | 29 ++++-
fs/proc/base.c | 17 ++-
include/linux/memcontrol.h | 46 +++++++-
include/linux/oom.h | 3 +-
include/linux/page_counter.h | 6 +
mm/memcontrol.c | 199 ++++++++++++++++++++++++++++++++
mm/oom_kill.c | 25 ++--
mm/page_counter.c | 30 +++++
8 files changed, 334 insertions(+), 21 deletions(-)
--
2.14.1
From: chengkaitao <[email protected]>
We created a new interface <memory.oom.protect> for memory, If there is
the OOM killer under parent memory cgroup, and the memory usage of a
child cgroup is within its effective oom.protect boundary, the cgroup's
tasks won't be OOM killed unless there is no unprotected tasks in other
children cgroups. It draws on the logic of <memory.min/low> in the
inheritance relationship.
It has the following advantages,
1. We have the ability to protect more important processes, when there
is a memcg's OOM killer. The oom.protect only takes effect local memcg,
and does not affect the OOM killer of the host.
2. Historically, we can often use oom_score_adj to control a group of
processes, It requires that all processes in the cgroup must have a
common parent processes, we have to set the common parent process's
oom_score_adj, before it forks all children processes. So that it is
very difficult to apply it in other situations. Now oom.protect has no
such restrictions, we can protect a cgroup of processes more easily. The
cgroup can keep some memory, even if the OOM killer has to be called.
If the memory.oom.protect of a nonleaf cgroup is set to "auto", it will
automatically calculate the sum of the effective oom.protect of its own
subcgroups.
Signed-off-by: chengkaitao <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 25 ++++-
fs/proc/base.c | 17 ++-
include/linux/memcontrol.h | 45 +++++++-
include/linux/oom.h | 3 +-
include/linux/page_counter.h | 6 +
mm/memcontrol.c | 193 ++++++++++++++++++++++++++++++++
mm/oom_kill.c | 25 +++--
mm/page_counter.c | 30 +++++
8 files changed, 323 insertions(+), 21 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index f67c0829350b..51e9a84d508a 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1194,7 +1194,7 @@ PAGE_SIZE multiple when read back.
cgroup is within its effective low boundary, the cgroup's
memory won't be reclaimed unless there is no reclaimable
memory available in unprotected cgroups.
- Above the effective low boundary (or
+ Above the effective low boundary (or
effective min boundary if it is higher), pages are reclaimed
proportionally to the overage, reducing reclaim pressure for
smaller overages.
@@ -1295,6 +1295,27 @@ PAGE_SIZE multiple when read back.
to kill any tasks outside of this cgroup, regardless
memory.oom.group values of ancestor cgroups.
+ memory.oom.protect
+ A read-write single value file which exists on non-root
+ cgroups. The default value is "0".
+
+ If there is the OOM killer under parent memory cgroup, and
+ the memory usage of a child cgroup is within its effective
+ oom.protect boundary, the cgroup's processes won't be oom killed
+ unless there is no unprotected processes in other children
+ cgroups. About the effective oom.protect boundary, we assign it
+ to each process in this cgroup in proportion to the actual usage.
+ this factor will be taken into account when calculating the
+ oom_score. Effective oom.protect boundary is limited by
+ memory.oom.protect values of all ancestor cgroups. If there is
+ memory.oom.protect overcommitment (child cgroup or cgroups are
+ requiring more protected memory than parent will allow), then each
+ child cgroup will get the part of parent's protection proportional
+ to its actual memory usage below memory.oom.protect. If the
+ memory.oom.protect of a nonleaf cgroup is set to "auto", it will
+ automatically calculate the sum of the effective oom.protect of
+ its own subcgroups.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
@@ -1894,7 +1915,7 @@ of the two is enforced.
cgroup writeback requires explicit support from the underlying
filesystem. Currently, cgroup writeback is implemented on ext2, ext4,
-btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are
+btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are
attributed to the root cgroup.
There are inherent differences in memory and writeback management
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 05452c3b9872..a34e007a7292 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -553,8 +553,19 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
unsigned long totalpages = totalram_pages() + total_swap_pages;
unsigned long points = 0;
long badness;
+#ifdef CONFIG_MEMCG
+ struct mem_cgroup *memcg;
- badness = oom_badness(task, totalpages);
+ rcu_read_lock();
+ memcg = mem_cgroup_from_task(task);
+ if (memcg && !css_tryget(&memcg->css))
+ memcg = NULL;
+ rcu_read_unlock();
+
+ update_parent_oom_protection(root_mem_cgroup, memcg);
+ css_put(&memcg->css);
+#endif
+ badness = oom_badness(task, totalpages, MEMCG_OOM_PROTECT);
/*
* Special case OOM_SCORE_ADJ_MIN for all others scale the
* badness value into [0, 2000] range which we have been
@@ -2657,7 +2668,7 @@ static struct dentry *proc_pident_instantiate(struct dentry *dentry,
return d_splice_alias(inode, dentry);
}
-static struct dentry *proc_pident_lookup(struct inode *dir,
+static struct dentry *proc_pident_lookup(struct inode *dir,
struct dentry *dentry,
const struct pid_entry *p,
const struct pid_entry *end)
@@ -2870,7 +2881,7 @@ static const struct pid_entry attr_dir_stuff[] = {
static int proc_attr_dir_readdir(struct file *file, struct dir_context *ctx)
{
- return proc_pident_readdir(file, ctx,
+ return proc_pident_readdir(file, ctx,
attr_dir_stuff, ARRAY_SIZE(attr_dir_stuff));
}
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 00a88cf947e1..23ea28d98c0e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -53,6 +53,11 @@ enum memcg_memory_event {
MEMCG_NR_MEMORY_EVENTS,
};
+enum memcg_oom_evaluate {
+ MEMCG_OOM_EVALUATE_NONE,
+ MEMCG_OOM_PROTECT,
+};
+
struct mem_cgroup_reclaim_cookie {
pg_data_t *pgdat;
unsigned int generation;
@@ -622,6 +627,14 @@ static inline void mem_cgroup_protection(struct mem_cgroup *root,
void mem_cgroup_calculate_protection(struct mem_cgroup *root,
struct mem_cgroup *memcg);
+void mem_cgroup_calculate_oom_protection(struct mem_cgroup *root,
+ struct mem_cgroup *memcg);
+void update_parent_oom_protection(struct mem_cgroup *root,
+ struct mem_cgroup *memcg);
+unsigned long get_task_eoom_protect(struct task_struct *p, long points);
+struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
+struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
+bool is_root_oom_protect(void);
static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
struct mem_cgroup *memcg)
@@ -758,10 +771,6 @@ static inline struct lruvec *folio_lruvec(struct folio *folio)
return mem_cgroup_lruvec(memcg, folio_pgdat(folio));
}
-struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
-
-struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
-
struct lruvec *folio_lruvec_lock(struct folio *folio);
struct lruvec *folio_lruvec_lock_irq(struct folio *folio);
struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio,
@@ -822,6 +831,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
int mem_cgroup_scan_tasks(struct mem_cgroup *,
int (*)(struct task_struct *, void *), void *);
+int mem_cgroup_scan_tasks_update_eoom(struct mem_cgroup *memcg,
+ int (*fn)(struct task_struct *, void *, int), void *arg);
static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
{
@@ -1231,6 +1242,16 @@ static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root,
{
}
+static inline void mem_cgroup_calculate_oom_protection(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
+{
+}
+
+static inline void update_parent_oom_protection(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
+{
+}
+
static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
struct mem_cgroup *memcg)
{
@@ -1248,6 +1269,16 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *target,
return false;
}
+static inline unsigned long get_task_eoom_protect(struct task_struct *p, long points)
+{
+ return 0;
+}
+
+static inline bool is_root_oom_protect(void)
+{
+ return 0;
+}
+
static inline int mem_cgroup_charge(struct folio *folio,
struct mm_struct *mm, gfp_t gfp)
{
@@ -1372,6 +1403,12 @@ static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
return 0;
}
+static inline int mem_cgroup_scan_tasks_update_eoom(struct mem_cgroup *memcg,
+ int (*fn)(struct task_struct *, void *, int), void *arg)
+{
+ return 0;
+}
+
static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
{
return 0;
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 7d0c9c48a0c5..04b6daca5a9c 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -97,8 +97,7 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
return 0;
}
-long oom_badness(struct task_struct *p,
- unsigned long totalpages);
+long oom_badness(struct task_struct *p, unsigned long totalpages, int flags);
extern bool out_of_memory(struct oom_control *oc);
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index c141ea9a95ef..d730a7373c1d 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -25,6 +25,10 @@ struct page_counter {
atomic_long_t low_usage;
atomic_long_t children_low_usage;
+ unsigned long eoom_protect;
+ atomic_long_t oom_protect_usage;
+ atomic_long_t children_oom_protect_usage;
+
unsigned long watermark;
unsigned long failcnt;
@@ -35,6 +39,7 @@ struct page_counter {
unsigned long low;
unsigned long high;
unsigned long max;
+ unsigned long oom_protect;
struct page_counter *parent;
} ____cacheline_internodealigned_in_smp;
@@ -65,6 +70,7 @@ bool page_counter_try_charge(struct page_counter *counter,
void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages);
void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages);
+void page_counter_set_oom_protect(struct page_counter *counter, unsigned long nr_pages);
static inline void page_counter_set_high(struct page_counter *counter,
unsigned long nr_pages)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d31fb1e2cb33..5a382359ed49 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1288,6 +1288,52 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
return ret;
}
+/**
+ * mem_cgroup_scan_tasks_update_eoom - iterate over tasks of a memory cgroup
+ * hierarchy and update memcg's eoom_protect
+ * @memcg: hierarchy root
+ * @fn: function to call for each task
+ * @arg: argument passed to @fn
+ *
+ * This function iterates over tasks attached to @memcg or to any of its
+ * descendants and update all memcg's eoom_protect, then calls @fn for each
+ * task. If @fn returns a non-zero value, the function breaks the iteration
+ * loop and returns the value. Otherwise, it will iterate over all tasks and
+ * return 0.
+ *
+ * This function may be called for the root memory cgroup.
+ */
+int mem_cgroup_scan_tasks_update_eoom(struct mem_cgroup *memcg,
+ int (*fn)(struct task_struct *, void *, int), void *arg)
+{
+ struct mem_cgroup *iter;
+ int ret = 0;
+
+ for_each_mem_cgroup_tree(iter, memcg) {
+ struct css_task_iter it;
+ struct task_struct *task;
+
+ mem_cgroup_calculate_oom_protection(memcg, iter);
+ css_task_iter_start(&iter->css, CSS_TASK_ITER_PROCS, &it);
+ while (!ret && (task = css_task_iter_next(&it)))
+ ret = fn(task, arg, MEMCG_OOM_PROTECT);
+ css_task_iter_end(&it);
+ if (ret) {
+ mem_cgroup_iter_break(memcg, iter);
+ break;
+ }
+ }
+ return ret;
+}
+
+bool is_root_oom_protect(void)
+{
+ if (mem_cgroup_disabled())
+ return 0;
+
+ return !!atomic_long_read(&root_mem_cgroup->memory.children_oom_protect_usage);
+}
+
#ifdef CONFIG_DEBUG_VM
void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio)
{
@@ -6396,6 +6442,8 @@ static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value)
{
if (value == PAGE_COUNTER_MAX)
seq_puts(m, "max\n");
+ else if (value == PAGE_COUNTER_MAX + 1)
+ seq_puts(m, "auto\n");
else
seq_printf(m, "%llu\n", (u64)value * PAGE_SIZE);
@@ -6677,6 +6725,34 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
return nbytes;
}
+static int memory_oom_protect_show(struct seq_file *m, void *v)
+{
+ return seq_puts_memcg_tunable(m,
+ READ_ONCE(mem_cgroup_from_seq(m)->memory.oom_protect));
+}
+
+static ssize_t memory_oom_protect_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ unsigned long oom_protect;
+ int err;
+
+ buf = strstrip(buf);
+ if (!strcmp(buf, "auto")) {
+ oom_protect = PAGE_COUNTER_MAX + 1;
+ goto set;
+ }
+
+ err = page_counter_memparse(buf, "max", &oom_protect);
+ if (err)
+ return err;
+
+set:
+ page_counter_set_oom_protect(&memcg->memory, oom_protect);
+ return nbytes;
+}
+
static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
@@ -6782,6 +6858,12 @@ static struct cftype memory_files[] = {
.seq_show = memory_oom_group_show,
.write = memory_oom_group_write,
},
+ {
+ .name = "oom.protect",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_oom_protect_show,
+ .write = memory_oom_protect_write,
+ },
{
.name = "reclaim",
.flags = CFTYPE_NS_DELEGATABLE,
@@ -6978,6 +7060,117 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
atomic_long_read(&parent->memory.children_low_usage)));
}
+static void __mem_cgroup_calculate_oom_protection(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
+{
+ unsigned long usage, parent_usage;
+ struct mem_cgroup *parent;
+
+ usage = page_counter_read(&memcg->memory);
+ if (!usage)
+ return;
+
+ parent = parent_mem_cgroup(memcg);
+
+ if (parent == root) {
+ memcg->memory.eoom_protect = atomic_long_read(&memcg->memory.oom_protect_usage);
+ return;
+ }
+
+ parent_usage = page_counter_read(&parent->memory);
+
+ WRITE_ONCE(memcg->memory.eoom_protect, effective_protection(usage, parent_usage,
+ atomic_long_read(&memcg->memory.oom_protect_usage),
+ READ_ONCE(parent->memory.eoom_protect),
+ atomic_long_read(&parent->memory.children_oom_protect_usage)));
+}
+
+/**
+ * mem_cgroup_calculate_oom_protection - check if memory consumption is in the
+ * normal range of oom's protection
+ * @root: the top ancestor of the sub-tree being checked
+ * @memcg: the memory cgroup to check
+ *
+ * WARNING: This function is not stateless! It can only be used as part
+ * of a top-down tree iteration, not for isolated queries.
+ */
+void mem_cgroup_calculate_oom_protection(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
+{
+ if (mem_cgroup_disabled())
+ return;
+
+ if (!root)
+ root = root_mem_cgroup;
+
+ /*
+ * Effective values of the reclaim targets are ignored so they
+ * can be stale. Have a look at mem_cgroup_protection for more
+ * details.
+ * TODO: calculation should be more robust so that we do not need
+ * that special casing.
+ */
+ if (memcg == root)
+ return;
+
+ __mem_cgroup_calculate_oom_protection(root, memcg);
+}
+
+static void lsit_postorder_for_memcg_parent(
+ struct mem_cgroup *root, struct mem_cgroup *memcg,
+ void (*fn)(struct mem_cgroup *, struct mem_cgroup *))
+{
+ struct mem_cgroup *parent;
+
+ if (!memcg || memcg == root)
+ return;
+
+ parent = parent_mem_cgroup(memcg);
+ lsit_postorder_for_memcg_parent(root, parent, fn);
+ fn(root, memcg);
+}
+
+void update_parent_oom_protection(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
+{
+ if (mem_cgroup_disabled())
+ return;
+
+ if (!root)
+ root = root_mem_cgroup;
+
+ lsit_postorder_for_memcg_parent(root, memcg,
+ __mem_cgroup_calculate_oom_protection);
+}
+
+unsigned long get_task_eoom_protect(struct task_struct *p, long points)
+{
+ struct mem_cgroup *memcg;
+ unsigned long usage, eoom = 0;
+
+ rcu_read_lock();
+ memcg = mem_cgroup_from_task(p);
+
+ if (mem_cgroup_unprotected(NULL, memcg))
+ goto end;
+
+ if (do_memsw_account())
+ usage = page_counter_read(&memcg->memsw);
+ else
+ usage = page_counter_read(&memcg->memory)
+ + page_counter_read(&memcg->swap);
+
+ /* To avoid division errors, when we don't move charge at immigrate */
+ if (!usage)
+ goto end;
+
+ eoom = READ_ONCE(memcg->memory.eoom_protect) * points / usage;
+
+end:
+ rcu_read_unlock();
+ return eoom;
+}
+
static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
gfp_t gfp)
{
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 044e1eed720e..25067145c26d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1,7 +1,7 @@
// SPDX-License-Identifier: GPL-2.0-only
/*
* linux/mm/oom_kill.c
- *
+ *
* Copyright (C) 1998,2000 Rik van Riel
* Thanks go out to Claus Fischer for some serious inspiration and
* for goading me into coding this file...
@@ -193,15 +193,16 @@ static bool should_dump_unreclaim_slab(void)
* oom_badness - heuristic function to determine which candidate task to kill
* @p: task struct of which task we should calculate
* @totalpages: total present RAM allowed for page allocation
+ * @flag: if you want to skip oom_protect function
*
* The heuristic for determining which task to kill is made to be as simple and
* predictable as possible. The goal is to return the highest value for the
* task consuming the most memory to avoid subsequent oom failures.
*/
-long oom_badness(struct task_struct *p, unsigned long totalpages)
+long oom_badness(struct task_struct *p, unsigned long totalpages, int flag)
{
- long points;
- long adj;
+ long points, adj, val = 0;
+ unsigned long shmem;
if (oom_unkillable_task(p))
return LONG_MIN;
@@ -229,11 +230,15 @@ long oom_badness(struct task_struct *p, unsigned long totalpages)
*/
points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
mm_pgtables_bytes(p->mm) / PAGE_SIZE;
+
+ shmem = get_mm_counter(p->mm, MM_SHMEMPAGES);
task_unlock(p);
+ if (flag == MEMCG_OOM_PROTECT)
+ val = get_task_eoom_protect(p, points - shmem);
/* Normalize to oom_score_adj units */
adj *= totalpages / 1000;
- points += adj;
+ points = points + adj - val;
return points;
}
@@ -305,7 +310,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
return CONSTRAINT_NONE;
}
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+static int oom_evaluate_task(struct task_struct *task, void *arg, int flag)
{
struct oom_control *oc = arg;
long points;
@@ -338,7 +343,7 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
goto select;
}
- points = oom_badness(task, oc->totalpages);
+ points = oom_badness(task, oc->totalpages, flag);
if (points == LONG_MIN || points < oc->chosen_points)
goto next;
@@ -365,14 +370,14 @@ static void select_bad_process(struct oom_control *oc)
{
oc->chosen_points = LONG_MIN;
- if (is_memcg_oom(oc))
- mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
+ if (is_memcg_oom(oc) || is_root_oom_protect())
+ mem_cgroup_scan_tasks_update_eoom(oc->memcg, oom_evaluate_task, oc);
else {
struct task_struct *p;
rcu_read_lock();
for_each_process(p)
- if (oom_evaluate_task(p, oc))
+ if (oom_evaluate_task(p, oc, MEMCG_OOM_EVALUATE_NONE))
break;
rcu_read_unlock();
}
diff --git a/mm/page_counter.c b/mm/page_counter.c
index db20d6452b71..cd0cc2c8edfe 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -39,6 +39,19 @@ static void propagate_protected_usage(struct page_counter *c,
if (delta)
atomic_long_add(delta, &c->parent->children_low_usage);
}
+
+ protected = READ_ONCE(c->oom_protect);
+ if (protected == PAGE_COUNTER_MAX + 1)
+ protected = atomic_long_read(&c->children_oom_protect_usage);
+ else
+ protected = min(usage, protected);
+ old_protected = atomic_long_read(&c->oom_protect_usage);
+ if (protected != old_protected) {
+ old_protected = atomic_long_xchg(&c->oom_protect_usage, protected);
+ delta = protected - old_protected;
+ if (delta)
+ atomic_long_add(delta, &c->parent->children_oom_protect_usage);
+ }
}
/**
@@ -234,6 +247,23 @@ void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages)
propagate_protected_usage(c, atomic_long_read(&c->usage));
}
+/**
+ * page_counter_set_oom_protect - set the amount of oom protected memory
+ * @counter: counter
+ * @nr_pages: value to set
+ *
+ * The caller must serialize invocations on the same counter.
+ */
+void page_counter_set_oom_protect(struct page_counter *counter, unsigned long nr_pages)
+{
+ struct page_counter *c;
+
+ WRITE_ONCE(counter->oom_protect, nr_pages);
+
+ for (c = counter; c; c = c->parent)
+ propagate_protected_usage(c, atomic_long_read(&c->usage));
+}
+
/**
* page_counter_memparse - memparse() for page counter limits
* @buf: string to parse
--
2.14.1
Hi chengkaitao,
kernel test robot noticed the following build warnings:
[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on tj-cgroup/for-next linus/master v6.3 next-20230505]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/chengkaitao/mm-memcontrol-protect-the-memory-in-cgroup-from-being-oom-killed/20230506-195043
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20230506114948.6862-2-chengkaitao%40didiglobal.com
patch subject: [PATCH v3 1/2] mm: memcontrol: protect the memory in cgroup from being oom killed
config: i386-randconfig-a011-20230501 (https://download.01.org/0day-ci/archive/20230506/[email protected]/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/a2779b308166286f77728f04043cb7a17a16dd46
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review chengkaitao/mm-memcontrol-protect-the-memory-in-cgroup-from-being-oom-killed/20230506-195043
git checkout a2779b308166286f77728f04043cb7a17a16dd46
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All warnings (new ones prefixed by >>):
>> mm/page_counter.c:44:36: warning: overflow in expression; result is -2147483648 with type 'long' [-Winteger-overflow]
if (protected == PAGE_COUNTER_MAX + 1)
^
1 warning generated.
--
mm/memcontrol.c:1739:2: error: implicit declaration of function 'seq_buf_do_printk' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
seq_buf_do_printk(&s, KERN_INFO);
^
>> mm/memcontrol.c:6445:37: warning: overflow in expression; result is -2147483648 with type 'long' [-Winteger-overflow]
else if (value == PAGE_COUNTER_MAX + 1)
^
mm/memcontrol.c:6743:34: warning: overflow in expression; result is -2147483648 with type 'long' [-Winteger-overflow]
oom_protect = PAGE_COUNTER_MAX + 1;
^
2 warnings and 1 error generated.
vim +/long +44 mm/page_counter.c
15
16 static void propagate_protected_usage(struct page_counter *c,
17 unsigned long usage)
18 {
19 unsigned long protected, old_protected;
20 long delta;
21
22 if (!c->parent)
23 return;
24
25 protected = min(usage, READ_ONCE(c->min));
26 old_protected = atomic_long_read(&c->min_usage);
27 if (protected != old_protected) {
28 old_protected = atomic_long_xchg(&c->min_usage, protected);
29 delta = protected - old_protected;
30 if (delta)
31 atomic_long_add(delta, &c->parent->children_min_usage);
32 }
33
34 protected = min(usage, READ_ONCE(c->low));
35 old_protected = atomic_long_read(&c->low_usage);
36 if (protected != old_protected) {
37 old_protected = atomic_long_xchg(&c->low_usage, protected);
38 delta = protected - old_protected;
39 if (delta)
40 atomic_long_add(delta, &c->parent->children_low_usage);
41 }
42
43 protected = READ_ONCE(c->oom_protect);
> 44 if (protected == PAGE_COUNTER_MAX + 1)
45 protected = atomic_long_read(&c->children_oom_protect_usage);
46 else
47 protected = min(usage, protected);
48 old_protected = atomic_long_read(&c->oom_protect_usage);
49 if (protected != old_protected) {
50 old_protected = atomic_long_xchg(&c->oom_protect_usage, protected);
51 delta = protected - old_protected;
52 if (delta)
53 atomic_long_add(delta, &c->parent->children_oom_protect_usage);
54 }
55 }
56
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests
On Sat 06-05-23 19:49:46, chengkaitao wrote:
> Establish a new OOM score algorithm, supports the cgroup level OOM
> protection mechanism. When an global/memcg oom event occurs, we treat
> all processes in the cgroup as a whole, and OOM killers need to select
> the process to kill based on the protection quota of the cgroup
Although your patch 1 briefly touches on some advantages of this
interface there is a lack of actual usecase. Arguing that oom_score_adj
is hard because it needs a parent process is rather weak to be honest.
It is just trivial to create a thin wrapper, use systemd to launch
important services or simply update the value after the fact. Now
oom_score_adj has its own downsides of course (most notably a
granularity and a lack of group protection.
That being said, make sure you describe your usecase more thoroughly.
Please also make sure you describe the intended heuristic of the knob.
It is not really clear from the description how this fits hierarchical
behavior of cgroups. I would be especially interested in the semantics
of non-leaf memcgs protection as they do not have any actual processes
to protect.
Also there have been concerns mentioned in v2 discussion and it would be
really appreciated to summarize how you have dealt with them.
Please also note that many people are going to be slow in responding
this week because of LSFMM conference
(https://events.linuxfoundation.org/lsfmm/)
--
Michal Hocko
SUSE Labs
At 2023-05-07 18:11:58, "Michal Hocko" <[email protected]> wrote:
>On Sat 06-05-23 19:49:46, chengkaitao wrote:
>> Establish a new OOM score algorithm, supports the cgroup level OOM
>> protection mechanism. When an global/memcg oom event occurs, we treat
>> all processes in the cgroup as a whole, and OOM killers need to select
>> the process to kill based on the protection quota of the cgroup
>
>Although your patch 1 briefly touches on some advantages of this
>interface there is a lack of actual usecase. Arguing that oom_score_adj
>is hard because it needs a parent process is rather weak to be honest.
>It is just trivial to create a thin wrapper, use systemd to launch
>important services or simply update the value after the fact. Now
>oom_score_adj has its own downsides of course (most notably a
>granularity and a lack of group protection.
>
>That being said, make sure you describe your usecase more thoroughly.
>Please also make sure you describe the intended heuristic of the knob.
>It is not really clear from the description how this fits hierarchical
>behavior of cgroups. I would be especially interested in the semantics
>of non-leaf memcgs protection as they do not have any actual processes
>to protect.
>
>Also there have been concerns mentioned in v2 discussion and it would be
>really appreciated to summarize how you have dealt with them.
>
>Please also note that many people are going to be slow in responding
>this week because of LSFMM conference
>(https://events.linuxfoundation.org/lsfmm/)
Here is a more detailed comparison and introduction of the old oom_score_adj
mechanism and the new oom_protect mechanism,
1. The regulating granularity of oom_protect is smaller than that of oom_score_adj.
On a 512G physical machine, the minimum granularity adjusted by oom_score_adj
is 512M, and the minimum granularity adjusted by oom_protect is one page (4K).
2. It may be simple to create a lightweight parent process and uniformly set the
oom_score_adj of some important processes, but it is not a simple matter to make
multi-level settings for tens of thousands of processes on the physical machine
through the lightweight parent processes. We may need a huge table to record the
value of oom_score_adj maintained by all lightweight parent processes, and the
user process limited by the parent process has no ability to change its own
oom_score_adj, because it does not know the details of the huge table. The new
patch adopts the cgroup mechanism. It does not need any parent process to manage
oom_score_adj. the settings between each memcg are independent of each other,
making it easier to plan the OOM order of all processes. Due to the unique nature
of memory resources, current Service cloud vendors are not oversold in memory
planning. I would like to use the new patch to try to achieve the possibility of
oversold memory resources.
3. I conducted a test and deployed an excessive number of containers on a physical
machine, By setting the oom_score_adj value of all processes in the container to
a positive number through dockerinit, even processes that occupy very little memory
in the container are easily killed, resulting in a large number of invalid kill behaviors.
If dockerinit is also killed unfortunately, it will trigger container self-healing, and the
container will rebuild, resulting in more severe memory oscillations. The new patch
abandons the behavior of adding an equal amount of oom_score_adj to each process
in the container and adopts a shared oom_protect quota for all processes in the container.
If a process in the container is killed, the remaining other processes will receive more
oom_protect quota, making it more difficult for the remaining processes to be killed.
In my test case, the new patch reduced the number of invalid kill behaviors by 70%.
4. oom_score_adj is a global configuration that cannot achieve a kill order that only
affects a certain memcg-oom-killer. However, the oom_protect mechanism inherits
downwards, and user can only change the kill order of its own memcg oom, but the
kill order of their parent memcg-oom-killer or global-oom-killer will not be affected
In the final discussion of patch v2, we discussed that although the adjustment range
of oom_score_adj is [-1000,1000], but essentially it only allows two usecases
(OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is
clumsy at best. In order to solve this problem in the new patch, I introduced a new
indicator oom_kill_inherit, which counts the number of times the local and child
cgroups have been selected by the OOM killer of the ancestor cgroup. By observing
the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the
value of oom_protect to achieve the best.
about the semantics of non-leaf memcgs protection,
If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally
calculate the new effective oom_protect quota based on non-leaf memcg's quota.
--
Thanks for your comment!
chengkaitao
On Mon 08-05-23 09:08:25, 程垲涛 Chengkaitao Cheng wrote:
> At 2023-05-07 18:11:58, "Michal Hocko" <[email protected]> wrote:
> >On Sat 06-05-23 19:49:46, chengkaitao wrote:
> >> Establish a new OOM score algorithm, supports the cgroup level OOM
> >> protection mechanism. When an global/memcg oom event occurs, we treat
> >> all processes in the cgroup as a whole, and OOM killers need to select
> >> the process to kill based on the protection quota of the cgroup
> >
> >Although your patch 1 briefly touches on some advantages of this
> >interface there is a lack of actual usecase. Arguing that oom_score_adj
> >is hard because it needs a parent process is rather weak to be honest.
> >It is just trivial to create a thin wrapper, use systemd to launch
> >important services or simply update the value after the fact. Now
> >oom_score_adj has its own downsides of course (most notably a
> >granularity and a lack of group protection.
> >
> >That being said, make sure you describe your usecase more thoroughly.
> >Please also make sure you describe the intended heuristic of the knob.
> >It is not really clear from the description how this fits hierarchical
> >behavior of cgroups. I would be especially interested in the semantics
> >of non-leaf memcgs protection as they do not have any actual processes
> >to protect.
> >
> >Also there have been concerns mentioned in v2 discussion and it would be
> >really appreciated to summarize how you have dealt with them.
> >
> >Please also note that many people are going to be slow in responding
> >this week because of LSFMM conference
> >(https://events.linuxfoundation.org/lsfmm/)
>
> Here is a more detailed comparison and introduction of the old oom_score_adj
> mechanism and the new oom_protect mechanism,
> 1. The regulating granularity of oom_protect is smaller than that of oom_score_adj.
> On a 512G physical machine, the minimum granularity adjusted by oom_score_adj
> is 512M, and the minimum granularity adjusted by oom_protect is one page (4K).
> 2. It may be simple to create a lightweight parent process and uniformly set the
> oom_score_adj of some important processes, but it is not a simple matter to make
> multi-level settings for tens of thousands of processes on the physical machine
> through the lightweight parent processes. We may need a huge table to record the
> value of oom_score_adj maintained by all lightweight parent processes, and the
> user process limited by the parent process has no ability to change its own
> oom_score_adj, because it does not know the details of the huge table. The new
> patch adopts the cgroup mechanism. It does not need any parent process to manage
> oom_score_adj. the settings between each memcg are independent of each other,
> making it easier to plan the OOM order of all processes. Due to the unique nature
> of memory resources, current Service cloud vendors are not oversold in memory
> planning. I would like to use the new patch to try to achieve the possibility of
> oversold memory resources.
OK, this is more specific about the usecase. Thanks! So essentially what
it boils down to is that you are handling many containers (memcgs from
our POV) and they have different priorities. You want to overcommit the
memory to the extend that global ooms are not an unexpected event. Once
that happens the total memory consumption of a specific memcg is less
important than its "priority". You define that priority by the excess of
the memory usage above a user defined threshold. Correct?
Your cover letter mentions that then "all processes in the cgroup as a
whole". That to me reads as oom.group oom killer policy. But a brief
look into the patch suggests you are still looking at specific tasks and
this has been a concern in the previous version of the patch because
memcg accounting and per-process accounting are detached.
> 3. I conducted a test and deployed an excessive number of containers on a physical
> machine, By setting the oom_score_adj value of all processes in the container to
> a positive number through dockerinit, even processes that occupy very little memory
> in the container are easily killed, resulting in a large number of invalid kill behaviors.
> If dockerinit is also killed unfortunately, it will trigger container self-healing, and the
> container will rebuild, resulting in more severe memory oscillations. The new patch
> abandons the behavior of adding an equal amount of oom_score_adj to each process
> in the container and adopts a shared oom_protect quota for all processes in the container.
> If a process in the container is killed, the remaining other processes will receive more
> oom_protect quota, making it more difficult for the remaining processes to be killed.
> In my test case, the new patch reduced the number of invalid kill behaviors by 70%.
> 4. oom_score_adj is a global configuration that cannot achieve a kill order that only
> affects a certain memcg-oom-killer. However, the oom_protect mechanism inherits
> downwards, and user can only change the kill order of its own memcg oom, but the
> kill order of their parent memcg-oom-killer or global-oom-killer will not be affected
Yes oom_score_adj has shortcomings.
> In the final discussion of patch v2, we discussed that although the adjustment range
> of oom_score_adj is [-1000,1000], but essentially it only allows two usecases
> (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is
> clumsy at best. In order to solve this problem in the new patch, I introduced a new
> indicator oom_kill_inherit, which counts the number of times the local and child
> cgroups have been selected by the OOM killer of the ancestor cgroup. By observing
> the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the
> value of oom_protect to achieve the best.
What does the best mean in this context?
> about the semantics of non-leaf memcgs protection,
> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally
> calculate the new effective oom_protect quota based on non-leaf memcg's quota.
So the non-leaf memcg is never used as a target? What if the workload is
distributed over several sub-groups? Our current oom.group
implementation traverses the tree to find a common ancestor in the oom
domain with the oom.group.
All that being said and with the usecase described more specifically. I
can see that memcg based oom victim selection makes some sense. That
menas that it is always a memcg selected and all tasks withing killed.
Memcg based protection can be used to evaluate which memcg to choose and
the overall scheme should be still manageable. It would indeed resemble
memory protection for the regular reclaim.
One thing that is still not really clear to me is to how group vs.
non-group ooms could be handled gracefully. Right now we can handle that
because the oom selection is still process based but with the protection
this will become more problematic as explained previously. Essentially
we would need to enforce the oom selection to be memcg based for all
memcgs. Maybe a mount knob? What do you think?
--
Michal Hocko
SUSE Labs
At 2023-05-08 22:18:18, "Michal Hocko" <[email protected]> wrote:
>On Mon 08-05-23 09:08:25, 程垲涛 Chengkaitao Cheng wrote:
>> At 2023-05-07 18:11:58, "Michal Hocko" <[email protected]> wrote:
>> >On Sat 06-05-23 19:49:46, chengkaitao wrote:
>> >
>> >That being said, make sure you describe your usecase more thoroughly.
>> >Please also make sure you describe the intended heuristic of the knob.
>> >It is not really clear from the description how this fits hierarchical
>> >behavior of cgroups. I would be especially interested in the semantics
>> >of non-leaf memcgs protection as they do not have any actual processes
>> >to protect.
>> >
>> >Also there have been concerns mentioned in v2 discussion and it would be
>> >really appreciated to summarize how you have dealt with them.
>> >
>> >Please also note that many people are going to be slow in responding
>> >this week because of LSFMM conference
>> >(https://events.linuxfoundation.org/lsfmm/)
>>
>> Here is a more detailed comparison and introduction of the old oom_score_adj
>> mechanism and the new oom_protect mechanism,
>> 1. The regulating granularity of oom_protect is smaller than that of oom_score_adj.
>> On a 512G physical machine, the minimum granularity adjusted by oom_score_adj
>> is 512M, and the minimum granularity adjusted by oom_protect is one page (4K).
>> 2. It may be simple to create a lightweight parent process and uniformly set the
>> oom_score_adj of some important processes, but it is not a simple matter to make
>> multi-level settings for tens of thousands of processes on the physical machine
>> through the lightweight parent processes. We may need a huge table to record the
>> value of oom_score_adj maintained by all lightweight parent processes, and the
>> user process limited by the parent process has no ability to change its own
>> oom_score_adj, because it does not know the details of the huge table. The new
>> patch adopts the cgroup mechanism. It does not need any parent process to manage
>> oom_score_adj. the settings between each memcg are independent of each other,
>> making it easier to plan the OOM order of all processes. Due to the unique nature
>> of memory resources, current Service cloud vendors are not oversold in memory
>> planning. I would like to use the new patch to try to achieve the possibility of
>> oversold memory resources.
>
>OK, this is more specific about the usecase. Thanks! So essentially what
>it boils down to is that you are handling many containers (memcgs from
>our POV) and they have different priorities. You want to overcommit the
>memory to the extend that global ooms are not an unexpected event. Once
>that happens the total memory consumption of a specific memcg is less
>important than its "priority". You define that priority by the excess of
>the memory usage above a user defined threshold. Correct?
It's correct.
>Your cover letter mentions that then "all processes in the cgroup as a
>whole". That to me reads as oom.group oom killer policy. But a brief
>look into the patch suggests you are still looking at specific tasks and
>this has been a concern in the previous version of the patch because
>memcg accounting and per-process accounting are detached.
I think the memcg accounting may be more reasonable, as its memory
statistics are more comprehensive, similar to active page cache, which
also increases the probability of OOM-kill. In the new patch, all the
shared memory will also consume the oom_protect quota of the memcg,
and the process's oom_protect quota of the memcg will decrease.
>> 3. I conducted a test and deployed an excessive number of containers on a physical
>> machine, By setting the oom_score_adj value of all processes in the container to
>> a positive number through dockerinit, even processes that occupy very little memory
>> in the container are easily killed, resulting in a large number of invalid kill behaviors.
>> If dockerinit is also killed unfortunately, it will trigger container self-healing, and the
>> container will rebuild, resulting in more severe memory oscillations. The new patch
>> abandons the behavior of adding an equal amount of oom_score_adj to each process
>> in the container and adopts a shared oom_protect quota for all processes in the container.
>> If a process in the container is killed, the remaining other processes will receive more
>> oom_protect quota, making it more difficult for the remaining processes to be killed.
>> In my test case, the new patch reduced the number of invalid kill behaviors by 70%.
>> 4. oom_score_adj is a global configuration that cannot achieve a kill order that only
>> affects a certain memcg-oom-killer. However, the oom_protect mechanism inherits
>> downwards, and user can only change the kill order of its own memcg oom, but the
>> kill order of their parent memcg-oom-killer or global-oom-killer will not be affected
>
>Yes oom_score_adj has shortcomings.
>
>> In the final discussion of patch v2, we discussed that although the adjustment range
>> of oom_score_adj is [-1000,1000], but essentially it only allows two usecases
>> (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is
>> clumsy at best. In order to solve this problem in the new patch, I introduced a new
>> indicator oom_kill_inherit, which counts the number of times the local and child
>> cgroups have been selected by the OOM killer of the ancestor cgroup. By observing
>> the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the
>> value of oom_protect to achieve the best.
>
>What does the best mean in this context?
I have created a new indicator oom_kill_inherit that maintains a negative correlation
with memory.oom.protect, so we have a ruler to measure the optimal value of
memory.oom.protect.
>> about the semantics of non-leaf memcgs protection,
>> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally
>> calculate the new effective oom_protect quota based on non-leaf memcg's quota.
>
>So the non-leaf memcg is never used as a target? What if the workload is
>distributed over several sub-groups? Our current oom.group
>implementation traverses the tree to find a common ancestor in the oom
>domain with the oom.group.
If the oom_protect quota of the parent non-leaf memcg is less than the sum of
sub-groups oom_protect quota, the oom_protect quota of each sub-group will
be proportionally reduced
If the oom_protect quota of the parent non-leaf memcg is greater than the sum
of sub-groups oom_protect quota, the oom_protect quota of each sub-group
will be proportionally increased
The purpose of doing so is that users can set oom_protect quota according to
their own needs, and the system management process can set appropriate
oom_protect quota on the parent non-leaf memcg as the final cover, so that
the system management process can indirectly manage all user processes.
>All that being said and with the usecase described more specifically. I
>can see that memcg based oom victim selection makes some sense. That
>menas that it is always a memcg selected and all tasks withing killed.
>Memcg based protection can be used to evaluate which memcg to choose and
>the overall scheme should be still manageable. It would indeed resemble
>memory protection for the regular reclaim.
>
>One thing that is still not really clear to me is to how group vs.
>non-group ooms could be handled gracefully. Right now we can handle that
>because the oom selection is still process based but with the protection
>this will become more problematic as explained previously. Essentially
>we would need to enforce the oom selection to be memcg based for all
>memcgs. Maybe a mount knob? What do you think?
There is a function in the patch to determine whether the oom_protect
mechanism is enabled. All memory.oom.protect nodes default to 0, so the function
<is_root_oom_protect> returns 0 by default. The oom_protect mechanism will
only take effect when "root_mem_cgroup->memory.children_oom_protect_usage"
is not 0, and only memcg with memory.oom.protect node set will take effect.
+bool is_root_oom_protect(void)
+{
+ if (mem_cgroup_disabled())
+ return 0;
+
+ return !!atomic_long_read(&root_mem_cgroup->memory.children_oom_protect_usage);
+}
I don't know if there is some problems with my understanding?
--
Thanks for your comment!
chengkaitao
[Sorry for a late reply but I was mostly offline last 2 weeks]
On Tue 09-05-23 06:50:59, 程垲涛 Chengkaitao Cheng wrote:
> At 2023-05-08 22:18:18, "Michal Hocko" <[email protected]> wrote:
[...]
> >Your cover letter mentions that then "all processes in the cgroup as a
> >whole". That to me reads as oom.group oom killer policy. But a brief
> >look into the patch suggests you are still looking at specific tasks and
> >this has been a concern in the previous version of the patch because
> >memcg accounting and per-process accounting are detached.
>
> I think the memcg accounting may be more reasonable, as its memory
> statistics are more comprehensive, similar to active page cache, which
> also increases the probability of OOM-kill. In the new patch, all the
> shared memory will also consume the oom_protect quota of the memcg,
> and the process's oom_protect quota of the memcg will decrease.
I am sorry but I do not follow. Could you elaborate please? Are you
arguing for per memcg or per process metrics?
[...]
> >> In the final discussion of patch v2, we discussed that although the adjustment range
> >> of oom_score_adj is [-1000,1000], but essentially it only allows two usecases
> >> (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is
> >> clumsy at best. In order to solve this problem in the new patch, I introduced a new
> >> indicator oom_kill_inherit, which counts the number of times the local and child
> >> cgroups have been selected by the OOM killer of the ancestor cgroup. By observing
> >> the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the
> >> value of oom_protect to achieve the best.
> >
> >What does the best mean in this context?
>
> I have created a new indicator oom_kill_inherit that maintains a negative correlation
> with memory.oom.protect, so we have a ruler to measure the optimal value of
> memory.oom.protect.
An example might help here.
> >> about the semantics of non-leaf memcgs protection,
> >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally
> >> calculate the new effective oom_protect quota based on non-leaf memcg's quota.
> >
> >So the non-leaf memcg is never used as a target? What if the workload is
> >distributed over several sub-groups? Our current oom.group
> >implementation traverses the tree to find a common ancestor in the oom
> >domain with the oom.group.
>
> If the oom_protect quota of the parent non-leaf memcg is less than the sum of
> sub-groups oom_protect quota, the oom_protect quota of each sub-group will
> be proportionally reduced
> If the oom_protect quota of the parent non-leaf memcg is greater than the sum
> of sub-groups oom_protect quota, the oom_protect quota of each sub-group
> will be proportionally increased
> The purpose of doing so is that users can set oom_protect quota according to
> their own needs, and the system management process can set appropriate
> oom_protect quota on the parent non-leaf memcg as the final cover, so that
> the system management process can indirectly manage all user processes.
I guess that you are trying to say that the oom protection has a
standard hierarchical behavior. And that is fine, well, in fact it is
mandatory for any control knob to have a sane hierarchical properties.
But that doesn't address my above question. Let me try again. When is a
non-leaf memcg potentially selected as the oom victim? It doesn't have
any tasks directly but it might be a suitable target to kill a multi
memcg based workload (e.g. a full container).
> >All that being said and with the usecase described more specifically. I
> >can see that memcg based oom victim selection makes some sense. That
> >menas that it is always a memcg selected and all tasks withing killed.
> >Memcg based protection can be used to evaluate which memcg to choose and
> >the overall scheme should be still manageable. It would indeed resemble
> >memory protection for the regular reclaim.
> >
> >One thing that is still not really clear to me is to how group vs.
> >non-group ooms could be handled gracefully. Right now we can handle that
> >because the oom selection is still process based but with the protection
> >this will become more problematic as explained previously. Essentially
> >we would need to enforce the oom selection to be memcg based for all
> >memcgs. Maybe a mount knob? What do you think?
>
> There is a function in the patch to determine whether the oom_protect
> mechanism is enabled. All memory.oom.protect nodes default to 0, so the function
> <is_root_oom_protect> returns 0 by default.
How can an admin determine what is the current oom detection logic?
--
Michal Hocko
SUSE Labs
At 2023-05-22 21:03:50, "Michal Hocko" <[email protected]> wrote:
>[Sorry for a late reply but I was mostly offline last 2 weeks]
>
>On Tue 09-05-23 06:50:59, 程垲涛 Chengkaitao Cheng wrote:
>> At 2023-05-08 22:18:18, "Michal Hocko" <[email protected]> wrote:
>[...]
>> >Your cover letter mentions that then "all processes in the cgroup as a
>> >whole". That to me reads as oom.group oom killer policy. But a brief
>> >look into the patch suggests you are still looking at specific tasks and
>> >this has been a concern in the previous version of the patch because
>> >memcg accounting and per-process accounting are detached.
>>
>> I think the memcg accounting may be more reasonable, as its memory
>> statistics are more comprehensive, similar to active page cache, which
>> also increases the probability of OOM-kill. In the new patch, all the
>> shared memory will also consume the oom_protect quota of the memcg,
>> and the process's oom_protect quota of the memcg will decrease.
>
>I am sorry but I do not follow. Could you elaborate please? Are you
>arguing for per memcg or per process metrics?
You mentioned earlier that 'memcg accounting and per process accounting
are detached', and I may have misunderstood your question. I want to
express above that memcg accounting is more comprehensive than per process
accounting, and using memcg accounting in the OOM-killer mechanism would
be more reasonable.
>[...]
>
>> >> In the final discussion of patch v2, we discussed that although the adjustment range
>> >> of oom_score_adj is [-1000,1000], but essentially it only allows two usecases
>> >> (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is
>> >> clumsy at best. In order to solve this problem in the new patch, I introduced a new
>> >> indicator oom_kill_inherit, which counts the number of times the local and child
>> >> cgroups have been selected by the OOM killer of the ancestor cgroup. By observing
>> >> the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the
>> >> value of oom_protect to achieve the best.
>> >
>> >What does the best mean in this context?
>>
>> I have created a new indicator oom_kill_inherit that maintains a negative correlation
>> with memory.oom.protect, so we have a ruler to measure the optimal value of
>> memory.oom.protect.
>
>An example might help here.
In my testing case, by adjusting memory.oom.protect, I was able to significantly
reduce the oom_kill_inherit of the corresponding cgroup. In a physical machine
with severely oversold memory, I divided all cgroups into three categories and
controlled their probability of being selected by the oom-killer to 0%,% 20,
and 80%, respectively.
>> >> about the semantics of non-leaf memcgs protection,
>> >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally
>> >> calculate the new effective oom_protect quota based on non-leaf memcg's quota.
>> >
>> >So the non-leaf memcg is never used as a target? What if the workload is
>> >distributed over several sub-groups? Our current oom.group
>> >implementation traverses the tree to find a common ancestor in the oom
>> >domain with the oom.group.
>>
>> If the oom_protect quota of the parent non-leaf memcg is less than the sum of
>> sub-groups oom_protect quota, the oom_protect quota of each sub-group will
>> be proportionally reduced
>> If the oom_protect quota of the parent non-leaf memcg is greater than the sum
>> of sub-groups oom_protect quota, the oom_protect quota of each sub-group
>> will be proportionally increased
>> The purpose of doing so is that users can set oom_protect quota according to
>> their own needs, and the system management process can set appropriate
>> oom_protect quota on the parent non-leaf memcg as the final cover, so that
>> the system management process can indirectly manage all user processes.
>
>I guess that you are trying to say that the oom protection has a
>standard hierarchical behavior. And that is fine, well, in fact it is
>mandatory for any control knob to have a sane hierarchical properties.
>But that doesn't address my above question. Let me try again. When is a
>non-leaf memcg potentially selected as the oom victim? It doesn't have
>any tasks directly but it might be a suitable target to kill a multi
>memcg based workload (e.g. a full container).
If nonleaf memcg have the higher memory usage and the smaller
memory.oom.protect, it will have the higher the probability being
selected by the killer. If the non-leaf memcg is selected as the oom
victim, OOM-killer will continue to select the appropriate child
memcg downwards until the leaf memcg is selected.
>> >All that being said and with the usecase described more specifically. I
>> >can see that memcg based oom victim selection makes some sense. That
>> >menas that it is always a memcg selected and all tasks withing killed.
>> >Memcg based protection can be used to evaluate which memcg to choose and
>> >the overall scheme should be still manageable. It would indeed resemble
>> >memory protection for the regular reclaim.
>> >
>> >One thing that is still not really clear to me is to how group vs.
>> >non-group ooms could be handled gracefully. Right now we can handle that
>> >because the oom selection is still process based but with the protection
>> >this will become more problematic as explained previously. Essentially
>> >we would need to enforce the oom selection to be memcg based for all
>> >memcgs. Maybe a mount knob? What do you think?
>>
>> There is a function in the patch to determine whether the oom_protect
>> mechanism is enabled. All memory.oom.protect nodes default to 0, so the function
>> <is_root_oom_protect> returns 0 by default.
>
>How can an admin determine what is the current oom detection logic?
The memory.oom.protect are set by the administrator themselves, and they
must know what the current OOM policy is. Reading the memory.oom.protect
of the first level cgroup directory and observing whether it is 0 can also
determine whether the oom.protect policy is enabled.
For a process, the physical machine administrator, k8s administrator,
agent administrator, and container administrator see different effective
memory.oom.protect for the process, so they only need to pay attention
to the memory.oom.protect of the local cgroup directory. If an administrator
wants to know the OOM detection logic of all administrators, I don't think
there is such a business requirement.
--
Thanks for your comment!
Chengkaitao
On Thu 25-05-23 07:35:41, 程垲涛 Chengkaitao Cheng wrote:
> At 2023-05-22 21:03:50, "Michal Hocko" <[email protected]> wrote:
[...]
> >> I have created a new indicator oom_kill_inherit that maintains a negative correlation
> >> with memory.oom.protect, so we have a ruler to measure the optimal value of
> >> memory.oom.protect.
> >
> >An example might help here.
>
> In my testing case, by adjusting memory.oom.protect, I was able to significantly
> reduce the oom_kill_inherit of the corresponding cgroup. In a physical machine
> with severely oversold memory, I divided all cgroups into three categories and
> controlled their probability of being selected by the oom-killer to 0%,% 20,
> and 80%, respectively.
I might be just dense but I am lost. Can we focus on the barebone
semantic of the group oom selection and killing first. No magic
auto-tuning at this stage please.
> >> >> about the semantics of non-leaf memcgs protection,
> >> >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally
> >> >> calculate the new effective oom_protect quota based on non-leaf memcg's quota.
> >> >
> >> >So the non-leaf memcg is never used as a target? What if the workload is
> >> >distributed over several sub-groups? Our current oom.group
> >> >implementation traverses the tree to find a common ancestor in the oom
> >> >domain with the oom.group.
> >>
> >> If the oom_protect quota of the parent non-leaf memcg is less than the sum of
> >> sub-groups oom_protect quota, the oom_protect quota of each sub-group will
> >> be proportionally reduced
> >> If the oom_protect quota of the parent non-leaf memcg is greater than the sum
> >> of sub-groups oom_protect quota, the oom_protect quota of each sub-group
> >> will be proportionally increased
> >> The purpose of doing so is that users can set oom_protect quota according to
> >> their own needs, and the system management process can set appropriate
> >> oom_protect quota on the parent non-leaf memcg as the final cover, so that
> >> the system management process can indirectly manage all user processes.
> >
> >I guess that you are trying to say that the oom protection has a
> >standard hierarchical behavior. And that is fine, well, in fact it is
> >mandatory for any control knob to have a sane hierarchical properties.
> >But that doesn't address my above question. Let me try again. When is a
> >non-leaf memcg potentially selected as the oom victim? It doesn't have
> >any tasks directly but it might be a suitable target to kill a multi
> >memcg based workload (e.g. a full container).
>
> If nonleaf memcg have the higher memory usage and the smaller
> memory.oom.protect, it will have the higher the probability being
> selected by the killer. If the non-leaf memcg is selected as the oom
> victim, OOM-killer will continue to select the appropriate child
> memcg downwards until the leaf memcg is selected.
Parent memcg has more or equal memory charged than its child(ren) by
definition. Let me try to ask differently. Say you have the following
hierarchy
root
/ \
container_A container_B
(oom.prot=100M) (oom.prot=200M)
(usage=120M) (usage=180M)
/ | \
A B C
/ \
C1 C2
container_B is protected so it should be excluded. Correct? So we are at
container_A to chose from. There are multiple ways the system and
continer admin might want to achieve.
1) system admin might want to shut down the whole container.
2) continer admin might want to shut the whole container down
3) cont. admin might want to shut down a whole sub group (e.g. C as it
is a self contained workload and killing portion of it will put it into
inconsistent state).
4) cont. admin might want to kill the most excess cgroup with tasks (i.e. a
leaf memcg).
5) admin might want to kill a process in the most excess memcg.
Now we already have oom.group thingy that can drive the group killing
policy but it is not really clear how you want to incorporate that to
the protection.
Again, I think that an oom.protection makes sense but the semantic has
to be very carefully thought through because it is quite easy to create
corner cases and weird behavior. I also think that oom.group has to be
consistent with the protection.
> >> >All that being said and with the usecase described more specifically. I
> >> >can see that memcg based oom victim selection makes some sense. That
> >> >menas that it is always a memcg selected and all tasks withing killed.
> >> >Memcg based protection can be used to evaluate which memcg to choose and
> >> >the overall scheme should be still manageable. It would indeed resemble
> >> >memory protection for the regular reclaim.
> >> >
> >> >One thing that is still not really clear to me is to how group vs.
> >> >non-group ooms could be handled gracefully. Right now we can handle that
> >> >because the oom selection is still process based but with the protection
> >> >this will become more problematic as explained previously. Essentially
> >> >we would need to enforce the oom selection to be memcg based for all
> >> >memcgs. Maybe a mount knob? What do you think?
> >>
> >> There is a function in the patch to determine whether the oom_protect
> >> mechanism is enabled. All memory.oom.protect nodes default to 0, so the function
> >> <is_root_oom_protect> returns 0 by default.
> >
> >How can an admin determine what is the current oom detection logic?
>
> The memory.oom.protect are set by the administrator themselves, and they
> must know what the current OOM policy is. Reading the memory.oom.protect
> of the first level cgroup directory and observing whether it is 0 can also
> determine whether the oom.protect policy is enabled.
How do you achieve that from withing a container which doesn't have a
full visibility to the cgroup tree?
--
Michal Hocko
SUSE Labs
At 2023-05-29 22:02:47, "Michal Hocko" <[email protected]> wrote:
>On Thu 25-05-23 07:35:41, 程垲涛 Chengkaitao Cheng wrote:
>> At 2023-05-22 21:03:50, "Michal Hocko" <[email protected]> wrote:
>[...]
>> >> I have created a new indicator oom_kill_inherit that maintains a negative correlation
>> >> with memory.oom.protect, so we have a ruler to measure the optimal value of
>> >> memory.oom.protect.
>> >
>> >An example might help here.
>>
>> In my testing case, by adjusting memory.oom.protect, I was able to significantly
>> reduce the oom_kill_inherit of the corresponding cgroup. In a physical machine
>> with severely oversold memory, I divided all cgroups into three categories and
>> controlled their probability of being selected by the oom-killer to 0%,% 20,
>> and 80%, respectively.
>
>I might be just dense but I am lost. Can we focus on the barebone
>semantic of the group oom selection and killing first. No magic
>auto-tuning at this stage please.
>
>> >> >> about the semantics of non-leaf memcgs protection,
>> >> >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally
>> >> >> calculate the new effective oom_protect quota based on non-leaf memcg's quota.
>> >> >
>> >> >So the non-leaf memcg is never used as a target? What if the workload is
>> >> >distributed over several sub-groups? Our current oom.group
>> >> >implementation traverses the tree to find a common ancestor in the oom
>> >> >domain with the oom.group.
>> >>
>> >> If the oom_protect quota of the parent non-leaf memcg is less than the sum of
>> >> sub-groups oom_protect quota, the oom_protect quota of each sub-group will
>> >> be proportionally reduced
>> >> If the oom_protect quota of the parent non-leaf memcg is greater than the sum
>> >> of sub-groups oom_protect quota, the oom_protect quota of each sub-group
>> >> will be proportionally increased
>> >> The purpose of doing so is that users can set oom_protect quota according to
>> >> their own needs, and the system management process can set appropriate
>> >> oom_protect quota on the parent non-leaf memcg as the final cover, so that
>> >> the system management process can indirectly manage all user processes.
>> >
>> >I guess that you are trying to say that the oom protection has a
>> >standard hierarchical behavior. And that is fine, well, in fact it is
>> >mandatory for any control knob to have a sane hierarchical properties.
>> >But that doesn't address my above question. Let me try again. When is a
>> >non-leaf memcg potentially selected as the oom victim? It doesn't have
>> >any tasks directly but it might be a suitable target to kill a multi
>> >memcg based workload (e.g. a full container).
>>
>> If nonleaf memcg have the higher memory usage and the smaller
>> memory.oom.protect, it will have the higher the probability being
>> selected by the killer. If the non-leaf memcg is selected as the oom
>> victim, OOM-killer will continue to select the appropriate child
>> memcg downwards until the leaf memcg is selected.
>
>Parent memcg has more or equal memory charged than its child(ren) by
>definition. Let me try to ask differently. Say you have the following
>hierarchy
>
> root
> / \
> container_A container_B
> (oom.prot=100M) (oom.prot=200M)
> (usage=120M) (usage=180M)
> / | \
> A B C
> / \
> C1 C2
>
>
>container_B is protected so it should be excluded. Correct? So we are at
>container_A to chose from. There are multiple ways the system and
>continer admin might want to achieve.
>1) system admin might want to shut down the whole container.
>2) continer admin might want to shut the whole container down
>3) cont. admin might want to shut down a whole sub group (e.g. C as it
> is a self contained workload and killing portion of it will put it into
> inconsistent state).
>4) cont. admin might want to kill the most excess cgroup with tasks (i.e. a
> leaf memcg).
>5) admin might want to kill a process in the most excess memcg.
>
>Now we already have oom.group thingy that can drive the group killing
>policy but it is not really clear how you want to incorporate that to
>the protection.
>
>Again, I think that an oom.protection makes sense but the semantic has
>to be very carefully thought through because it is quite easy to create
>corner cases and weird behavior. I also think that oom.group has to be
>consistent with the protection.
The barebone semantic of the function implemented by my patch are
summarized as follows:
Memcg only allows processes in the memcg to be selected by their
ancestor's OOM killer when the memory usage exceeds "oom.protect"
It should be noted that "oom.protect" and "oom.group" are completely
different things, and kneading them together may make the explanation
more confusing.
>> >> >All that being said and with the usecase described more specifically. I
>> >> >can see that memcg based oom victim selection makes some sense. That
>> >> >menas that it is always a memcg selected and all tasks withing killed.
>> >> >Memcg based protection can be used to evaluate which memcg to choose and
>> >> >the overall scheme should be still manageable. It would indeed resemble
>> >> >memory protection for the regular reclaim.
>> >> >
>> >> >One thing that is still not really clear to me is to how group vs.
>> >> >non-group ooms could be handled gracefully. Right now we can handle that
>> >> >because the oom selection is still process based but with the protection
>> >> >this will become more problematic as explained previously. Essentially
>> >> >we would need to enforce the oom selection to be memcg based for all
>> >> >memcgs. Maybe a mount knob? What do you think?
>> >>
>> >> There is a function in the patch to determine whether the oom_protect
>> >> mechanism is enabled. All memory.oom.protect nodes default to 0, so the function
>> >> <is_root_oom_protect> returns 0 by default.
>> >
>> >How can an admin determine what is the current oom detection logic?
>>
>> The memory.oom.protect are set by the administrator themselves, and they
>> must know what the current OOM policy is. Reading the memory.oom.protect
>> of the first level cgroup directory and observing whether it is 0 can also
>> determine whether the oom.protect policy is enabled.
>
>How do you achieve that from withing a container which doesn't have a
>full visibility to the cgroup tree?
The container does not need to have full visibility to the cgroup tree, function
"oom.protect" requires it to only focus on local cgroup and subgroup trees.
--
Thanks for your comment!
chengkaitao
On Mon, May 8, 2023 at 7:18 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 08-05-23 09:08:25, 程垲涛 Chengkaitao Cheng wrote:
> > At 2023-05-07 18:11:58, "Michal Hocko" <[email protected]> wrote:
> > >On Sat 06-05-23 19:49:46, chengkaitao wrote:
> > >> Establish a new OOM score algorithm, supports the cgroup level OOM
> > >> protection mechanism. When an global/memcg oom event occurs, we treat
> > >> all processes in the cgroup as a whole, and OOM killers need to select
> > >> the process to kill based on the protection quota of the cgroup
> > >
> > >Although your patch 1 briefly touches on some advantages of this
> > >interface there is a lack of actual usecase. Arguing that oom_score_adj
> > >is hard because it needs a parent process is rather weak to be honest.
> > >It is just trivial to create a thin wrapper, use systemd to launch
> > >important services or simply update the value after the fact. Now
> > >oom_score_adj has its own downsides of course (most notably a
> > >granularity and a lack of group protection.
> > >
> > >That being said, make sure you describe your usecase more thoroughly.
> > >Please also make sure you describe the intended heuristic of the knob.
> > >It is not really clear from the description how this fits hierarchical
> > >behavior of cgroups. I would be especially interested in the semantics
> > >of non-leaf memcgs protection as they do not have any actual processes
> > >to protect.
> > >
> > >Also there have been concerns mentioned in v2 discussion and it would be
> > >really appreciated to summarize how you have dealt with them.
> > >
> > >Please also note that many people are going to be slow in responding
> > >this week because of LSFMM conference
> > >(https://events.linuxfoundation.org/lsfmm/)
> >
> > Here is a more detailed comparison and introduction of the old oom_score_adj
> > mechanism and the new oom_protect mechanism,
> > 1. The regulating granularity of oom_protect is smaller than that of oom_score_adj.
> > On a 512G physical machine, the minimum granularity adjusted by oom_score_adj
> > is 512M, and the minimum granularity adjusted by oom_protect is one page (4K).
> > 2. It may be simple to create a lightweight parent process and uniformly set the
> > oom_score_adj of some important processes, but it is not a simple matter to make
> > multi-level settings for tens of thousands of processes on the physical machine
> > through the lightweight parent processes. We may need a huge table to record the
> > value of oom_score_adj maintained by all lightweight parent processes, and the
> > user process limited by the parent process has no ability to change its own
> > oom_score_adj, because it does not know the details of the huge table. The new
> > patch adopts the cgroup mechanism. It does not need any parent process to manage
> > oom_score_adj. the settings between each memcg are independent of each other,
> > making it easier to plan the OOM order of all processes. Due to the unique nature
> > of memory resources, current Service cloud vendors are not oversold in memory
> > planning. I would like to use the new patch to try to achieve the possibility of
> > oversold memory resources.
>
> OK, this is more specific about the usecase. Thanks! So essentially what
> it boils down to is that you are handling many containers (memcgs from
> our POV) and they have different priorities. You want to overcommit the
> memory to the extend that global ooms are not an unexpected event. Once
> that happens the total memory consumption of a specific memcg is less
> important than its "priority". You define that priority by the excess of
> the memory usage above a user defined threshold. Correct?
There has been a parallel discussion in the cover letter thread of v4
[1]. To summarize, at Google, we have been using OOM scores to
describe different job priorities in a more explicit way -- regardless
of memory usage. It is strictly priority-based OOM killing. Ties are
broken based on memory usage.
We understand that something like memory.oom.protect has an advantage
in the sense that you can skip killing a process if you know that it
won't free enough memory anyway, but for an environment where multiple
jobs of different priorities are running, we find it crucial to be
able to define strict ordering. Some jobs are simply more important
than others, regardless of their memory usage.
It would be great if we can arrive at an interface that serves this
use case as well.
Thanks!
[1]https://lore.kernel.org/linux-mm/CAJD7tkaQdSTDX0Q7zvvYrA3Y4TcvLdWKnN3yc8VpfWRpUjcYBw@mail.gmail.com/
>
> Your cover letter mentions that then "all processes in the cgroup as a
> whole". That to me reads as oom.group oom killer policy. But a brief
> look into the patch suggests you are still looking at specific tasks and
> this has been a concern in the previous version of the patch because
> memcg accounting and per-process accounting are detached.
>
> > 3. I conducted a test and deployed an excessive number of containers on a physical
> > machine, By setting the oom_score_adj value of all processes in the container to
> > a positive number through dockerinit, even processes that occupy very little memory
> > in the container are easily killed, resulting in a large number of invalid kill behaviors.
> > If dockerinit is also killed unfortunately, it will trigger container self-healing, and the
> > container will rebuild, resulting in more severe memory oscillations. The new patch
> > abandons the behavior of adding an equal amount of oom_score_adj to each process
> > in the container and adopts a shared oom_protect quota for all processes in the container.
> > If a process in the container is killed, the remaining other processes will receive more
> > oom_protect quota, making it more difficult for the remaining processes to be killed.
> > In my test case, the new patch reduced the number of invalid kill behaviors by 70%.
> > 4. oom_score_adj is a global configuration that cannot achieve a kill order that only
> > affects a certain memcg-oom-killer. However, the oom_protect mechanism inherits
> > downwards, and user can only change the kill order of its own memcg oom, but the
> > kill order of their parent memcg-oom-killer or global-oom-killer will not be affected
>
> Yes oom_score_adj has shortcomings.
>
> > In the final discussion of patch v2, we discussed that although the adjustment range
> > of oom_score_adj is [-1000,1000], but essentially it only allows two usecases
> > (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is
> > clumsy at best. In order to solve this problem in the new patch, I introduced a new
> > indicator oom_kill_inherit, which counts the number of times the local and child
> > cgroups have been selected by the OOM killer of the ancestor cgroup. By observing
> > the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the
> > value of oom_protect to achieve the best.
>
> What does the best mean in this context?
>
> > about the semantics of non-leaf memcgs protection,
> > If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally
> > calculate the new effective oom_protect quota based on non-leaf memcg's quota.
>
> So the non-leaf memcg is never used as a target? What if the workload is
> distributed over several sub-groups? Our current oom.group
> implementation traverses the tree to find a common ancestor in the oom
> domain with the oom.group.
>
> All that being said and with the usecase described more specifically. I
> can see that memcg based oom victim selection makes some sense. That
> menas that it is always a memcg selected and all tasks withing killed.
> Memcg based protection can be used to evaluate which memcg to choose and
> the overall scheme should be still manageable. It would indeed resemble
> memory protection for the regular reclaim.
>
> One thing that is still not really clear to me is to how group vs.
> non-group ooms could be handled gracefully. Right now we can handle that
> because the oom selection is still process based but with the protection
> this will become more problematic as explained previously. Essentially
> we would need to enforce the oom selection to be memcg based for all
> memcgs. Maybe a mount knob? What do you think?
> --
> Michal Hocko
> SUSE Labs
+David Rientjes
On Tue, Jun 13, 2023 at 1:27 AM Michal Hocko <[email protected]> wrote:
>
> On Sun 04-06-23 01:25:42, Yosry Ahmed wrote:
> [...]
> > There has been a parallel discussion in the cover letter thread of v4
> > [1]. To summarize, at Google, we have been using OOM scores to
> > describe different job priorities in a more explicit way -- regardless
> > of memory usage. It is strictly priority-based OOM killing. Ties are
> > broken based on memory usage.
> >
> > We understand that something like memory.oom.protect has an advantage
> > in the sense that you can skip killing a process if you know that it
> > won't free enough memory anyway, but for an environment where multiple
> > jobs of different priorities are running, we find it crucial to be
> > able to define strict ordering. Some jobs are simply more important
> > than others, regardless of their memory usage.
>
> I do remember that discussion. I am not a great fan of simple priority
> based interfaces TBH. It sounds as an easy interface but it hits
> complications as soon as you try to define a proper/sensible
> hierarchical semantic. I can see how they might work on leaf memcgs with
> statically assigned priorities but that sounds like a very narrow
> usecase IMHO.
Do you mind elaborating the problem with the hierarchical semantics?
The way it works with our internal implementation is (imo) sensible
and straightforward from a hierarchy POV. Starting at the OOM memcg
(which can be root), we recursively compare the OOM scores of the
children memcgs and pick the one with the lowest score, until we
arrive at a leaf memcg. Within that leaf, we also define per-process
scores, but these are less important to us.
I am not sure I understand why this is not sensible from a hierarchy
POV or a very narrow use case. Not that all this is optional, by
default all memcgs are given the same score, and ties are broken based
on per-memcg (or per-process) usage.
>
> I do not think we can effort a plethora of different OOM selection
> algorithms implemented in the kernel. Therefore we should really
> consider a control interface to be as much extensible and in line
> with the existing interfaces as much as possible. That is why I am
> really open to the oom protection concept which fits reasonably well
> to the reclaim protection scheme. After all oom killer is just a very
> aggressive method of the memory reclaim.
>
> On the other hand I can see a need to customizable OOM victim selection
> functionality. We've been through that discussion on several other
> occasions and the best thing we could come up with was to allow to plug
> BPF into the victim selection process and allow to bypass the system
> default method. No code has ever materialized from those discussions
> though. Maybe this is the time to revive that idea again?
That definitely sounds interesting, and it was brought up before. It
does sound like BPF (or a different customization framework) can be
the answer here. Interested to hear what others think as well.
>
> > It would be great if we can arrive at an interface that serves this
> > use case as well.
> >
> > Thanks!
> >
> > [1]https://lore.kernel.org/linux-mm/CAJD7tkaQdSTDX0Q7zvvYrA3Y4TcvLdWKnN3yc8VpfWRpUjcYBw@mail.gmail.com/
> --
> Michal Hocko
> SUSE Labs
On Sun 04-06-23 01:25:42, Yosry Ahmed wrote:
[...]
> There has been a parallel discussion in the cover letter thread of v4
> [1]. To summarize, at Google, we have been using OOM scores to
> describe different job priorities in a more explicit way -- regardless
> of memory usage. It is strictly priority-based OOM killing. Ties are
> broken based on memory usage.
>
> We understand that something like memory.oom.protect has an advantage
> in the sense that you can skip killing a process if you know that it
> won't free enough memory anyway, but for an environment where multiple
> jobs of different priorities are running, we find it crucial to be
> able to define strict ordering. Some jobs are simply more important
> than others, regardless of their memory usage.
I do remember that discussion. I am not a great fan of simple priority
based interfaces TBH. It sounds as an easy interface but it hits
complications as soon as you try to define a proper/sensible
hierarchical semantic. I can see how they might work on leaf memcgs with
statically assigned priorities but that sounds like a very narrow
usecase IMHO.
I do not think we can effort a plethora of different OOM selection
algorithms implemented in the kernel. Therefore we should really
consider a control interface to be as much extensible and in line
with the existing interfaces as much as possible. That is why I am
really open to the oom protection concept which fits reasonably well
to the reclaim protection scheme. After all oom killer is just a very
aggressive method of the memory reclaim.
On the other hand I can see a need to customizable OOM victim selection
functionality. We've been through that discussion on several other
occasions and the best thing we could come up with was to allow to plug
BPF into the victim selection process and allow to bypass the system
default method. No code has ever materialized from those discussions
though. Maybe this is the time to revive that idea again?
> It would be great if we can arrive at an interface that serves this
> use case as well.
>
> Thanks!
>
> [1]https://lore.kernel.org/linux-mm/CAJD7tkaQdSTDX0Q7zvvYrA3Y4TcvLdWKnN3yc8VpfWRpUjcYBw@mail.gmail.com/
--
Michal Hocko
SUSE Labs
Hello,
On Tue, Jun 13, 2023 at 10:27:32AM +0200, Michal Hocko wrote:
> On the other hand I can see a need to customizable OOM victim selection
> functionality. We've been through that discussion on several other
> occasions and the best thing we could come up with was to allow to plug
> BPF into the victim selection process and allow to bypass the system
> default method. No code has ever materialized from those discussions
> though. Maybe this is the time to revive that idea again?
Yeah, my 5 cent - trying to define a rigid interface for something complex
and flexible is a fool's errand.
Johannes knows a lot better than me but we (meta) are handling most OOMs
with oomd which gives more than sufficient policy flexibility. That said,
handling OOMs in userspace requires wholesale configuration changes (e.g.
working IO isolation) and being able to steer kernel OOM kills can be useful
for many folks. I haven't thought too much about it but the problem seems
pretty well fit for offloading policy decisions to a BPF program.
Thanks.
--
tejun
On Sun 04-06-23 08:05:53, 程垲涛 Chengkaitao Cheng wrote:
> At 2023-05-29 22:02:47, "Michal Hocko" <[email protected]> wrote:
> >On Thu 25-05-23 07:35:41, 程垲涛 Chengkaitao Cheng wrote:
> >> At 2023-05-22 21:03:50, "Michal Hocko" <[email protected]> wrote:
> >[...]
> >> >> I have created a new indicator oom_kill_inherit that maintains a negative correlation
> >> >> with memory.oom.protect, so we have a ruler to measure the optimal value of
> >> >> memory.oom.protect.
> >> >
> >> >An example might help here.
> >>
> >> In my testing case, by adjusting memory.oom.protect, I was able to significantly
> >> reduce the oom_kill_inherit of the corresponding cgroup. In a physical machine
> >> with severely oversold memory, I divided all cgroups into three categories and
> >> controlled their probability of being selected by the oom-killer to 0%,% 20,
> >> and 80%, respectively.
> >
> >I might be just dense but I am lost. Can we focus on the barebone
> >semantic of the group oom selection and killing first. No magic
> >auto-tuning at this stage please.
> >
> >> >> >> about the semantics of non-leaf memcgs protection,
> >> >> >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally
> >> >> >> calculate the new effective oom_protect quota based on non-leaf memcg's quota.
> >> >> >
> >> >> >So the non-leaf memcg is never used as a target? What if the workload is
> >> >> >distributed over several sub-groups? Our current oom.group
> >> >> >implementation traverses the tree to find a common ancestor in the oom
> >> >> >domain with the oom.group.
> >> >>
> >> >> If the oom_protect quota of the parent non-leaf memcg is less than the sum of
> >> >> sub-groups oom_protect quota, the oom_protect quota of each sub-group will
> >> >> be proportionally reduced
> >> >> If the oom_protect quota of the parent non-leaf memcg is greater than the sum
> >> >> of sub-groups oom_protect quota, the oom_protect quota of each sub-group
> >> >> will be proportionally increased
> >> >> The purpose of doing so is that users can set oom_protect quota according to
> >> >> their own needs, and the system management process can set appropriate
> >> >> oom_protect quota on the parent non-leaf memcg as the final cover, so that
> >> >> the system management process can indirectly manage all user processes.
> >> >
> >> >I guess that you are trying to say that the oom protection has a
> >> >standard hierarchical behavior. And that is fine, well, in fact it is
> >> >mandatory for any control knob to have a sane hierarchical properties.
> >> >But that doesn't address my above question. Let me try again. When is a
> >> >non-leaf memcg potentially selected as the oom victim? It doesn't have
> >> >any tasks directly but it might be a suitable target to kill a multi
> >> >memcg based workload (e.g. a full container).
> >>
> >> If nonleaf memcg have the higher memory usage and the smaller
> >> memory.oom.protect, it will have the higher the probability being
> >> selected by the killer. If the non-leaf memcg is selected as the oom
> >> victim, OOM-killer will continue to select the appropriate child
> >> memcg downwards until the leaf memcg is selected.
> >
> >Parent memcg has more or equal memory charged than its child(ren) by
> >definition. Let me try to ask differently. Say you have the following
> >hierarchy
> >
> > root
> > / \
> > container_A container_B
> > (oom.prot=100M) (oom.prot=200M)
> > (usage=120M) (usage=180M)
> > / | \
> > A B C
> > / \
> > C1 C2
> >
> >
> >container_B is protected so it should be excluded. Correct? So we are at
> >container_A to chose from. There are multiple ways the system and
> >continer admin might want to achieve.
> >1) system admin might want to shut down the whole container.
> >2) continer admin might want to shut the whole container down
> >3) cont. admin might want to shut down a whole sub group (e.g. C as it
> > is a self contained workload and killing portion of it will put it into
> > inconsistent state).
> >4) cont. admin might want to kill the most excess cgroup with tasks (i.e. a
> > leaf memcg).
> >5) admin might want to kill a process in the most excess memcg.
> >
> >Now we already have oom.group thingy that can drive the group killing
> >policy but it is not really clear how you want to incorporate that to
> >the protection.
> >
> >Again, I think that an oom.protection makes sense but the semantic has
> >to be very carefully thought through because it is quite easy to create
> >corner cases and weird behavior. I also think that oom.group has to be
> >consistent with the protection.
>
> The barebone semantic of the function implemented by my patch are
> summarized as follows:
> Memcg only allows processes in the memcg to be selected by their
> ancestor's OOM killer when the memory usage exceeds "oom.protect"
I am sure you would need to break this expectation if there is no such
memcg with tasks available or do you panic the system in that case in
the global case and retry for ever for the memcg oom?
> It should be noted that "oom.protect" and "oom.group" are completely
> different things, and kneading them together may make the explanation
> more confusing.
I am not suggesting to tight those two together by any means. I am
merely saying that those two have to be mutually cooperative and still
represent a reasonable semantic. Please have a look at above example
usecases and try to explain how the memory protection fits in here as
you have defined and implemented it.
--
Michal Hocko
SUSE Labs
On Tue 13-06-23 01:36:51, Yosry Ahmed wrote:
> +David Rientjes
>
> On Tue, Jun 13, 2023 at 1:27 AM Michal Hocko <[email protected]> wrote:
> >
> > On Sun 04-06-23 01:25:42, Yosry Ahmed wrote:
> > [...]
> > > There has been a parallel discussion in the cover letter thread of v4
> > > [1]. To summarize, at Google, we have been using OOM scores to
> > > describe different job priorities in a more explicit way -- regardless
> > > of memory usage. It is strictly priority-based OOM killing. Ties are
> > > broken based on memory usage.
> > >
> > > We understand that something like memory.oom.protect has an advantage
> > > in the sense that you can skip killing a process if you know that it
> > > won't free enough memory anyway, but for an environment where multiple
> > > jobs of different priorities are running, we find it crucial to be
> > > able to define strict ordering. Some jobs are simply more important
> > > than others, regardless of their memory usage.
> >
> > I do remember that discussion. I am not a great fan of simple priority
> > based interfaces TBH. It sounds as an easy interface but it hits
> > complications as soon as you try to define a proper/sensible
> > hierarchical semantic. I can see how they might work on leaf memcgs with
> > statically assigned priorities but that sounds like a very narrow
> > usecase IMHO.
>
> Do you mind elaborating the problem with the hierarchical semantics?
Well, let me be more specific. If you have a simple hierarchical numeric
enforcement (assume higher priority more likely to be chosen and the
effective priority to be max(self, max(parents)) then the semantic
itslef is straightforward.
I am not really sure about the practical manageability though. I have
hard time to imagine priority assignment on something like a shared
workload with a more complex hierarchy. For example:
root
/ | \
cont_A cont_B cont_C
each container running its workload with own hierarchy structures that
might be rather dynamic during the lifetime. In order to have a
predictable OOM behavior you need to watch and reassign priorities all
the time, no?
> The way it works with our internal implementation is (imo) sensible
> and straightforward from a hierarchy POV. Starting at the OOM memcg
> (which can be root), we recursively compare the OOM scores of the
> children memcgs and pick the one with the lowest score, until we
> arrive at a leaf memcg.
This approach has a strong requirement on the memcg hierarchy
organization. Siblings have to be directly comparable because you cut
off many potential sub-trees this way (e.g. is it easy to tell
whether you want to rule out all system or user slices?).
I can imagine usecases where this could work reasonably well e.g. a set
of workers of a different priority all of them running under a shared
memcg parent. But more more involved hierarchies seem more complex
because you always keep in mind how the hierarchy is organize to get to
your desired victim.
--
Michal Hocko
SUSE Labs
On Tue, Jun 13, 2023 at 5:06 AM Michal Hocko <[email protected]> wrote:
>
> On Tue 13-06-23 01:36:51, Yosry Ahmed wrote:
> > +David Rientjes
> >
> > On Tue, Jun 13, 2023 at 1:27 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Sun 04-06-23 01:25:42, Yosry Ahmed wrote:
> > > [...]
> > > > There has been a parallel discussion in the cover letter thread of v4
> > > > [1]. To summarize, at Google, we have been using OOM scores to
> > > > describe different job priorities in a more explicit way -- regardless
> > > > of memory usage. It is strictly priority-based OOM killing. Ties are
> > > > broken based on memory usage.
> > > >
> > > > We understand that something like memory.oom.protect has an advantage
> > > > in the sense that you can skip killing a process if you know that it
> > > > won't free enough memory anyway, but for an environment where multiple
> > > > jobs of different priorities are running, we find it crucial to be
> > > > able to define strict ordering. Some jobs are simply more important
> > > > than others, regardless of their memory usage.
> > >
> > > I do remember that discussion. I am not a great fan of simple priority
> > > based interfaces TBH. It sounds as an easy interface but it hits
> > > complications as soon as you try to define a proper/sensible
> > > hierarchical semantic. I can see how they might work on leaf memcgs with
> > > statically assigned priorities but that sounds like a very narrow
> > > usecase IMHO.
> >
> > Do you mind elaborating the problem with the hierarchical semantics?
>
> Well, let me be more specific. If you have a simple hierarchical numeric
> enforcement (assume higher priority more likely to be chosen and the
> effective priority to be max(self, max(parents)) then the semantic
> itslef is straightforward.
>
> I am not really sure about the practical manageability though. I have
> hard time to imagine priority assignment on something like a shared
> workload with a more complex hierarchy. For example:
> root
> / | \
> cont_A cont_B cont_C
>
> each container running its workload with own hierarchy structures that
> might be rather dynamic during the lifetime. In order to have a
> predictable OOM behavior you need to watch and reassign priorities all
> the time, no?
In our case we don't really manage the entire hierarchy in a
centralized fashion. Each container gets a score based on their
relative priority, and each container is free to set scores within its
subcontainers if needed. Isn't this what the hierarchy is all about?
Each parent only cares about its direct children. On the system level,
we care about the priority ordering of containers. Ordering within
containers can be deferred to containers.
>
> > The way it works with our internal implementation is (imo) sensible
> > and straightforward from a hierarchy POV. Starting at the OOM memcg
> > (which can be root), we recursively compare the OOM scores of the
> > children memcgs and pick the one with the lowest score, until we
> > arrive at a leaf memcg.
>
> This approach has a strong requirement on the memcg hierarchy
> organization. Siblings have to be directly comparable because you cut
> off many potential sub-trees this way (e.g. is it easy to tell
> whether you want to rule out all system or user slices?).
>
> I can imagine usecases where this could work reasonably well e.g. a set
> of workers of a different priority all of them running under a shared
> memcg parent. But more more involved hierarchies seem more complex
> because you always keep in mind how the hierarchy is organize to get to
> your desired victim.
I guess the main point is what I mentioned above, you don't need to
manage the entire tree, containers can manage their subtrees. The most
important thing is to provide the kernel with priority ordering among
containers, and optionally priority ordering within a container
(disregarding other containers).
>
> --
> Michal Hocko
> SUSE Labs
On Tue 13-06-23 13:24:24, Yosry Ahmed wrote:
> On Tue, Jun 13, 2023 at 5:06 AM Michal Hocko <[email protected]> wrote:
> >
> > On Tue 13-06-23 01:36:51, Yosry Ahmed wrote:
> > > +David Rientjes
> > >
> > > On Tue, Jun 13, 2023 at 1:27 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Sun 04-06-23 01:25:42, Yosry Ahmed wrote:
> > > > [...]
> > > > > There has been a parallel discussion in the cover letter thread of v4
> > > > > [1]. To summarize, at Google, we have been using OOM scores to
> > > > > describe different job priorities in a more explicit way -- regardless
> > > > > of memory usage. It is strictly priority-based OOM killing. Ties are
> > > > > broken based on memory usage.
> > > > >
> > > > > We understand that something like memory.oom.protect has an advantage
> > > > > in the sense that you can skip killing a process if you know that it
> > > > > won't free enough memory anyway, but for an environment where multiple
> > > > > jobs of different priorities are running, we find it crucial to be
> > > > > able to define strict ordering. Some jobs are simply more important
> > > > > than others, regardless of their memory usage.
> > > >
> > > > I do remember that discussion. I am not a great fan of simple priority
> > > > based interfaces TBH. It sounds as an easy interface but it hits
> > > > complications as soon as you try to define a proper/sensible
> > > > hierarchical semantic. I can see how they might work on leaf memcgs with
> > > > statically assigned priorities but that sounds like a very narrow
> > > > usecase IMHO.
> > >
> > > Do you mind elaborating the problem with the hierarchical semantics?
> >
> > Well, let me be more specific. If you have a simple hierarchical numeric
> > enforcement (assume higher priority more likely to be chosen and the
> > effective priority to be max(self, max(parents)) then the semantic
> > itslef is straightforward.
> >
> > I am not really sure about the practical manageability though. I have
> > hard time to imagine priority assignment on something like a shared
> > workload with a more complex hierarchy. For example:
> > root
> > / | \
> > cont_A cont_B cont_C
> >
> > each container running its workload with own hierarchy structures that
> > might be rather dynamic during the lifetime. In order to have a
> > predictable OOM behavior you need to watch and reassign priorities all
> > the time, no?
>
> In our case we don't really manage the entire hierarchy in a
> centralized fashion. Each container gets a score based on their
> relative priority, and each container is free to set scores within its
> subcontainers if needed. Isn't this what the hierarchy is all about?
> Each parent only cares about its direct children. On the system level,
> we care about the priority ordering of containers. Ordering within
> containers can be deferred to containers.
This really depends on the workload. This might be working for your
setup but as I've said above, many workloads would be struggling with
re-prioritizing as soon as a new workload is started and oom priorities
would need to be reorganized as a result. The setup is just too static
to be generally useful IMHO.
You can avoid that by essentially making mid-layers no priority and only
rely on leaf memcgs when this would become more flexible. This is
something even more complicated with the top-down approach.
That being said, I can see workloads which could benefit from a
priority (essentially user spaced controlled oom pre-selection) based
policy. But there are many other policies like that that would be
usecase specific and not generic enough so I do not think this is worth
a generic interface and would fall into BPF or alike based policies.
--
Michal Hocko
SUSE Labs
On Thu, Jun 15, 2023 at 3:39 AM Michal Hocko <[email protected]> wrote:
>
> On Tue 13-06-23 13:24:24, Yosry Ahmed wrote:
> > On Tue, Jun 13, 2023 at 5:06 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Tue 13-06-23 01:36:51, Yosry Ahmed wrote:
> > > > +David Rientjes
> > > >
> > > > On Tue, Jun 13, 2023 at 1:27 AM Michal Hocko <[email protected]> wrote:
> > > > >
> > > > > On Sun 04-06-23 01:25:42, Yosry Ahmed wrote:
> > > > > [...]
> > > > > > There has been a parallel discussion in the cover letter thread of v4
> > > > > > [1]. To summarize, at Google, we have been using OOM scores to
> > > > > > describe different job priorities in a more explicit way -- regardless
> > > > > > of memory usage. It is strictly priority-based OOM killing. Ties are
> > > > > > broken based on memory usage.
> > > > > >
> > > > > > We understand that something like memory.oom.protect has an advantage
> > > > > > in the sense that you can skip killing a process if you know that it
> > > > > > won't free enough memory anyway, but for an environment where multiple
> > > > > > jobs of different priorities are running, we find it crucial to be
> > > > > > able to define strict ordering. Some jobs are simply more important
> > > > > > than others, regardless of their memory usage.
> > > > >
> > > > > I do remember that discussion. I am not a great fan of simple priority
> > > > > based interfaces TBH. It sounds as an easy interface but it hits
> > > > > complications as soon as you try to define a proper/sensible
> > > > > hierarchical semantic. I can see how they might work on leaf memcgs with
> > > > > statically assigned priorities but that sounds like a very narrow
> > > > > usecase IMHO.
> > > >
> > > > Do you mind elaborating the problem with the hierarchical semantics?
> > >
> > > Well, let me be more specific. If you have a simple hierarchical numeric
> > > enforcement (assume higher priority more likely to be chosen and the
> > > effective priority to be max(self, max(parents)) then the semantic
> > > itslef is straightforward.
> > >
> > > I am not really sure about the practical manageability though. I have
> > > hard time to imagine priority assignment on something like a shared
> > > workload with a more complex hierarchy. For example:
> > > root
> > > / | \
> > > cont_A cont_B cont_C
> > >
> > > each container running its workload with own hierarchy structures that
> > > might be rather dynamic during the lifetime. In order to have a
> > > predictable OOM behavior you need to watch and reassign priorities all
> > > the time, no?
> >
> > In our case we don't really manage the entire hierarchy in a
> > centralized fashion. Each container gets a score based on their
> > relative priority, and each container is free to set scores within its
> > subcontainers if needed. Isn't this what the hierarchy is all about?
> > Each parent only cares about its direct children. On the system level,
> > we care about the priority ordering of containers. Ordering within
> > containers can be deferred to containers.
>
> This really depends on the workload. This might be working for your
> setup but as I've said above, many workloads would be struggling with
> re-prioritizing as soon as a new workload is started and oom priorities
> would need to be reorganized as a result. The setup is just too static
> to be generally useful IMHO.
> You can avoid that by essentially making mid-layers no priority and only
> rely on leaf memcgs when this would become more flexible. This is
> something even more complicated with the top-down approach.
I agree that other setups may find it more difficult if one entity
needs to manage the entire tree, although if the scores range is large
enough, I don't really think it's that static. When a new workload is
started you decide what its priority is compared to the existing
workloads and set its score as such. We use a range of scores from 0
to 10,000 (and it can easily be larger), so it's easy to assign new
scores without reorganizing the existing scores.
>
> That being said, I can see workloads which could benefit from a
> priority (essentially user spaced controlled oom pre-selection) based
> policy. But there are many other policies like that that would be
> usecase specific and not generic enough so I do not think this is worth
> a generic interface and would fall into BPF or alike based policies.
That's reasonable. I can't speak for other folks. Perhaps no single
policy will be generic enough, and we should focus on enabling
customized policy. Perhaps other userspace OOM agents can benefit from
this as well.
>
> --
> Michal Hocko
> SUSE Labs