Establish a new OOM score algorithm, supports the cgroup level OOM
protection mechanism. When an global/memcg oom event occurs, we treat
all processes in the cgroup as a whole, and OOM killers need to select
the process to kill based on the protection quota of the cgroup.
Here is a more detailed comparison and introduction of the old
oom_score_adj mechanism and the new oom_protect mechanism,
1. The regulating granularity of oom_protect is smaller than that of
oom_score_adj. On a 512G physical machine, the minimum granularity
adjusted by oom_score_adj is 512M, and the minimum granularity
adjusted by oom_protect is one page (4K)
2. It may be simple to create a lightweight parent process and uniformly
set the oom_score_adj of some important processes, but it is not a
simple matter to make multi-level settings for tens of thousands of
processes on the physical machine through the lightweight parent
processes. We may need a huge table to record the value of oom_score_adj
maintained by all lightweight parent processes, and the user process
limited by the parent process has no ability to change its own
oom_score_adj, because it does not know the details of the huge
table. on the other hand, we have to set the common parent process'
oom_score_adj, before it forks all children processes. We must strictly
follow this setting sequence, and once oom_score_adj is set, it cannot
be changed. To sum up, it is very difficult to apply oom_score_adj in
other situations. The new patch adopts the cgroup mechanism. It does not
need any parent process to manage oom_score_adj. the settings between
each memcg are independent of each other, making it easier to plan the
OOM order of all processes. Due to the unique nature of memory
resources, current Service cloud vendors are not oversold in memory
planning. I would like to use the new patch to try to achieve the
possibility of oversold memory resources.
3. I conducted a test and deployed an excessive number of containers on
a physical machine, By setting the oom_score_adj value of all processes
in the container to a positive number through dockerinit, even processes
that occupy very little memory in the container are easily killed,
resulting in a large number of invalid kill behaviors. If dockerinit is
also killed unfortunately, it will trigger container self-healing, and
the container will rebuild, resulting in more severe memory
oscillations. The new patch abandons the behavior of adding an equal
amount of oom_score_adj to each process in the container and adopts a
shared oom_protect quota for all processes in the container. If a
process in the container is killed, the remaining other processes will
receive more oom_protect quota, making it more difficult for the
remaining processes to be killed. In my test case, the new patch reduced
the number of invalid kill behaviors by 70%.
4. oom_score_adj is a global configuration that cannot achieve a kill
order that only affects a certain memcg-oom-killer. However, the
oom_protect mechanism inherits downwards (If the oom_protect quota of
the parent cgroup is less than the sum of sub-cgroups oom_protect quota,
the oom_protect quota of each sub-cgroup will be proportionally reduced.
If the oom_protect quota of the parent cgroup is greater than the sum of
sub-cgroups oom_protect quota, the oom_protect quota of each sub-cgroup
will be proportionally increased). The purpose of doing so is that users
can set oom_protect quota according to their own needs, and the system
management process can set appropriate oom_protect quota on the parent
memcg as the final cover. If the oom_protect of the parent cgroup is 0,
the kill order of memcg-oom or global-ooms will not be affected by user
specific settings.
5. Per-process accounting does not count shared memory, similar to
active page cache, which also increases the probability of OOM-kill.
However, the memcg accounting may be more reasonable, as its memory
statistics are more comprehensive. In the new patch, all the shared
memory will also consume the oom_protect quota of the memcg, and the
process's oom_protect quota of the memcg will decrease, the probability
of they being killed will increase.
6. In the final discussion of patch v2, we discussed that although the
adjustment range of oom_score_adj is [-1000,1000], but essentially it
only allows two usecases(OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably.
Everything in between is clumsy at best. In order to solve this problem
in the new patch, I introduced a new indicator oom_kill_inherit, which
counts the number of times the local and child cgroups have been
selected by the OOM killer of the ancestor cgroup. oom_kill_inherit
maintains a negative correlation with memory.oom.protect, so we have a
ruler to measure the optimal value of memory.oom.protect. By observing
the proportion of oom_kill_inherit in the parent cgroup, I can
effectively adjust the value of oom_protect to achieve the best.
Changelog:
v4:
* Fix warning: overflow in expression. (patch 1)
* Supplementary commit information. (patch 0)
v3:
* Add "auto" option for memory.oom.protect. (patch 1)
* Fix division errors. (patch 1)
* Add observation indicator oom_kill_inherit. (patch 2)
https://lore.kernel.org/linux-mm/[email protected]/
v2:
* Modify the formula of the process request memcg protection quota.
https://lore.kernel.org/linux-mm/[email protected]/
v1:
https://lore.kernel.org/linux-mm/[email protected]/
chengkaitao (2):
mm: memcontrol: protect the memory in cgroup from being oom killed
memcg: add oom_kill_inherit event indicator
Documentation/admin-guide/cgroup-v2.rst | 29 ++++-
fs/proc/base.c | 17 ++-
include/linux/memcontrol.h | 46 +++++++-
include/linux/oom.h | 3 +-
include/linux/page_counter.h | 6 +
mm/memcontrol.c | 199 ++++++++++++++++++++++++++++++++
mm/oom_kill.c | 25 ++--
mm/page_counter.c | 30 +++++
8 files changed, 334 insertions(+), 21 deletions(-)
--
2.14.1
From: chengkaitao <[email protected]>
We created a new interface <memory.oom.protect> for memory, If there is
the OOM killer under parent memory cgroup, and the memory usage of a
child cgroup is within its effective oom.protect boundary, the cgroup's
tasks won't be OOM killed unless there is no unprotected tasks in other
children cgroups. It draws on the logic of <memory.min/low> in the
inheritance relationship.
It has the following advantages,
1. The minimum granularity adjusted by oom_protect is one page (4K).
2. We have the ability to protect more important processes, when there
is a memcg's OOM killer. The oom.protect only takes effect local memcg,
and does not affect the OOM killer of the host.
3. We can protect a cgroup of processes more easily. The cgroup can keep
some memory, even if the OOM killer has to be called. The new patch
provides new protection for machine memory oversold, while also
preventing some processes with low memory usage from being killed.
4. All the shared memory will also consume the oom_protect quota of the
memcg, and the process's oom_protect quota of the memcg will decrease,
the probability of they being killed will increase.
If the memory.oom.protect of a nonleaf cgroup is set to "auto", it will
automatically calculate the sum of the effective oom.protect of its own
subcgroups.
Signed-off-by: chengkaitao <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 25 ++++-
fs/proc/base.c | 17 ++-
include/linux/memcontrol.h | 45 +++++++-
include/linux/oom.h | 3 +-
include/linux/page_counter.h | 6 +
mm/memcontrol.c | 193 ++++++++++++++++++++++++++++++++
mm/oom_kill.c | 25 +++--
mm/page_counter.c | 30 +++++
8 files changed, 323 insertions(+), 21 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index f67c0829350b..51e9a84d508a 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1194,7 +1194,7 @@ PAGE_SIZE multiple when read back.
cgroup is within its effective low boundary, the cgroup's
memory won't be reclaimed unless there is no reclaimable
memory available in unprotected cgroups.
- Above the effective low boundary (or
+ Above the effective low boundary (or
effective min boundary if it is higher), pages are reclaimed
proportionally to the overage, reducing reclaim pressure for
smaller overages.
@@ -1295,6 +1295,27 @@ PAGE_SIZE multiple when read back.
to kill any tasks outside of this cgroup, regardless
memory.oom.group values of ancestor cgroups.
+ memory.oom.protect
+ A read-write single value file which exists on non-root
+ cgroups. The default value is "0".
+
+ If there is the OOM killer under parent memory cgroup, and
+ the memory usage of a child cgroup is within its effective
+ oom.protect boundary, the cgroup's processes won't be oom killed
+ unless there is no unprotected processes in other children
+ cgroups. About the effective oom.protect boundary, we assign it
+ to each process in this cgroup in proportion to the actual usage.
+ this factor will be taken into account when calculating the
+ oom_score. Effective oom.protect boundary is limited by
+ memory.oom.protect values of all ancestor cgroups. If there is
+ memory.oom.protect overcommitment (child cgroup or cgroups are
+ requiring more protected memory than parent will allow), then each
+ child cgroup will get the part of parent's protection proportional
+ to its actual memory usage below memory.oom.protect. If the
+ memory.oom.protect of a nonleaf cgroup is set to "auto", it will
+ automatically calculate the sum of the effective oom.protect of
+ its own subcgroups.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
@@ -1894,7 +1915,7 @@ of the two is enforced.
cgroup writeback requires explicit support from the underlying
filesystem. Currently, cgroup writeback is implemented on ext2, ext4,
-btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are
+btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are
attributed to the root cgroup.
There are inherent differences in memory and writeback management
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 05452c3b9872..a34e007a7292 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -553,8 +553,19 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
unsigned long totalpages = totalram_pages() + total_swap_pages;
unsigned long points = 0;
long badness;
+#ifdef CONFIG_MEMCG
+ struct mem_cgroup *memcg;
- badness = oom_badness(task, totalpages);
+ rcu_read_lock();
+ memcg = mem_cgroup_from_task(task);
+ if (memcg && !css_tryget(&memcg->css))
+ memcg = NULL;
+ rcu_read_unlock();
+
+ update_parent_oom_protection(root_mem_cgroup, memcg);
+ css_put(&memcg->css);
+#endif
+ badness = oom_badness(task, totalpages, MEMCG_OOM_PROTECT);
/*
* Special case OOM_SCORE_ADJ_MIN for all others scale the
* badness value into [0, 2000] range which we have been
@@ -2657,7 +2668,7 @@ static struct dentry *proc_pident_instantiate(struct dentry *dentry,
return d_splice_alias(inode, dentry);
}
-static struct dentry *proc_pident_lookup(struct inode *dir,
+static struct dentry *proc_pident_lookup(struct inode *dir,
struct dentry *dentry,
const struct pid_entry *p,
const struct pid_entry *end)
@@ -2870,7 +2881,7 @@ static const struct pid_entry attr_dir_stuff[] = {
static int proc_attr_dir_readdir(struct file *file, struct dir_context *ctx)
{
- return proc_pident_readdir(file, ctx,
+ return proc_pident_readdir(file, ctx,
attr_dir_stuff, ARRAY_SIZE(attr_dir_stuff));
}
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 00a88cf947e1..23ea28d98c0e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -53,6 +53,11 @@ enum memcg_memory_event {
MEMCG_NR_MEMORY_EVENTS,
};
+enum memcg_oom_evaluate {
+ MEMCG_OOM_EVALUATE_NONE,
+ MEMCG_OOM_PROTECT,
+};
+
struct mem_cgroup_reclaim_cookie {
pg_data_t *pgdat;
unsigned int generation;
@@ -622,6 +627,14 @@ static inline void mem_cgroup_protection(struct mem_cgroup *root,
void mem_cgroup_calculate_protection(struct mem_cgroup *root,
struct mem_cgroup *memcg);
+void mem_cgroup_calculate_oom_protection(struct mem_cgroup *root,
+ struct mem_cgroup *memcg);
+void update_parent_oom_protection(struct mem_cgroup *root,
+ struct mem_cgroup *memcg);
+unsigned long get_task_eoom_protect(struct task_struct *p, long points);
+struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
+struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
+bool is_root_oom_protect(void);
static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
struct mem_cgroup *memcg)
@@ -758,10 +771,6 @@ static inline struct lruvec *folio_lruvec(struct folio *folio)
return mem_cgroup_lruvec(memcg, folio_pgdat(folio));
}
-struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
-
-struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
-
struct lruvec *folio_lruvec_lock(struct folio *folio);
struct lruvec *folio_lruvec_lock_irq(struct folio *folio);
struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio,
@@ -822,6 +831,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
int mem_cgroup_scan_tasks(struct mem_cgroup *,
int (*)(struct task_struct *, void *), void *);
+int mem_cgroup_scan_tasks_update_eoom(struct mem_cgroup *memcg,
+ int (*fn)(struct task_struct *, void *, int), void *arg);
static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
{
@@ -1231,6 +1242,16 @@ static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root,
{
}
+static inline void mem_cgroup_calculate_oom_protection(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
+{
+}
+
+static inline void update_parent_oom_protection(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
+{
+}
+
static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
struct mem_cgroup *memcg)
{
@@ -1248,6 +1269,16 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *target,
return false;
}
+static inline unsigned long get_task_eoom_protect(struct task_struct *p, long points)
+{
+ return 0;
+}
+
+static inline bool is_root_oom_protect(void)
+{
+ return 0;
+}
+
static inline int mem_cgroup_charge(struct folio *folio,
struct mm_struct *mm, gfp_t gfp)
{
@@ -1372,6 +1403,12 @@ static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
return 0;
}
+static inline int mem_cgroup_scan_tasks_update_eoom(struct mem_cgroup *memcg,
+ int (*fn)(struct task_struct *, void *, int), void *arg)
+{
+ return 0;
+}
+
static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
{
return 0;
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 7d0c9c48a0c5..04b6daca5a9c 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -97,8 +97,7 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
return 0;
}
-long oom_badness(struct task_struct *p,
- unsigned long totalpages);
+long oom_badness(struct task_struct *p, unsigned long totalpages, int flags);
extern bool out_of_memory(struct oom_control *oc);
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index c141ea9a95ef..d730a7373c1d 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -25,6 +25,10 @@ struct page_counter {
atomic_long_t low_usage;
atomic_long_t children_low_usage;
+ unsigned long eoom_protect;
+ atomic_long_t oom_protect_usage;
+ atomic_long_t children_oom_protect_usage;
+
unsigned long watermark;
unsigned long failcnt;
@@ -35,6 +39,7 @@ struct page_counter {
unsigned long low;
unsigned long high;
unsigned long max;
+ unsigned long oom_protect;
struct page_counter *parent;
} ____cacheline_internodealigned_in_smp;
@@ -65,6 +70,7 @@ bool page_counter_try_charge(struct page_counter *counter,
void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages);
void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages);
+void page_counter_set_oom_protect(struct page_counter *counter, unsigned long nr_pages);
static inline void page_counter_set_high(struct page_counter *counter,
unsigned long nr_pages)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d31fb1e2cb33..8ee67abb415f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1288,6 +1288,52 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
return ret;
}
+/**
+ * mem_cgroup_scan_tasks_update_eoom - iterate over tasks of a memory cgroup
+ * hierarchy and update memcg's eoom_protect
+ * @memcg: hierarchy root
+ * @fn: function to call for each task
+ * @arg: argument passed to @fn
+ *
+ * This function iterates over tasks attached to @memcg or to any of its
+ * descendants and update all memcg's eoom_protect, then calls @fn for each
+ * task. If @fn returns a non-zero value, the function breaks the iteration
+ * loop and returns the value. Otherwise, it will iterate over all tasks and
+ * return 0.
+ *
+ * This function may be called for the root memory cgroup.
+ */
+int mem_cgroup_scan_tasks_update_eoom(struct mem_cgroup *memcg,
+ int (*fn)(struct task_struct *, void *, int), void *arg)
+{
+ struct mem_cgroup *iter;
+ int ret = 0;
+
+ for_each_mem_cgroup_tree(iter, memcg) {
+ struct css_task_iter it;
+ struct task_struct *task;
+
+ mem_cgroup_calculate_oom_protection(memcg, iter);
+ css_task_iter_start(&iter->css, CSS_TASK_ITER_PROCS, &it);
+ while (!ret && (task = css_task_iter_next(&it)))
+ ret = fn(task, arg, MEMCG_OOM_PROTECT);
+ css_task_iter_end(&it);
+ if (ret) {
+ mem_cgroup_iter_break(memcg, iter);
+ break;
+ }
+ }
+ return ret;
+}
+
+bool is_root_oom_protect(void)
+{
+ if (mem_cgroup_disabled())
+ return 0;
+
+ return !!atomic_long_read(&root_mem_cgroup->memory.children_oom_protect_usage);
+}
+
#ifdef CONFIG_DEBUG_VM
void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio)
{
@@ -6396,6 +6442,8 @@ static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value)
{
if (value == PAGE_COUNTER_MAX)
seq_puts(m, "max\n");
+ else if (value == PAGE_COUNTER_MAX - 1)
+ seq_puts(m, "auto\n");
else
seq_printf(m, "%llu\n", (u64)value * PAGE_SIZE);
@@ -6677,6 +6725,34 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
return nbytes;
}
+static int memory_oom_protect_show(struct seq_file *m, void *v)
+{
+ return seq_puts_memcg_tunable(m,
+ READ_ONCE(mem_cgroup_from_seq(m)->memory.oom_protect));
+}
+
+static ssize_t memory_oom_protect_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ unsigned long oom_protect;
+ int err;
+
+ buf = strstrip(buf);
+ if (!strcmp(buf, "auto")) {
+ oom_protect = PAGE_COUNTER_MAX - 1;
+ goto set;
+ }
+
+ err = page_counter_memparse(buf, "max", &oom_protect);
+ if (err)
+ return err;
+
+set:
+ page_counter_set_oom_protect(&memcg->memory, oom_protect);
+ return nbytes;
+}
+
static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
@@ -6782,6 +6858,12 @@ static struct cftype memory_files[] = {
.seq_show = memory_oom_group_show,
.write = memory_oom_group_write,
},
+ {
+ .name = "oom.protect",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_oom_protect_show,
+ .write = memory_oom_protect_write,
+ },
{
.name = "reclaim",
.flags = CFTYPE_NS_DELEGATABLE,
@@ -6978,6 +7060,117 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
atomic_long_read(&parent->memory.children_low_usage)));
}
+static void __mem_cgroup_calculate_oom_protection(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
+{
+ unsigned long usage, parent_usage;
+ struct mem_cgroup *parent;
+
+ usage = page_counter_read(&memcg->memory);
+ if (!usage)
+ return;
+
+ parent = parent_mem_cgroup(memcg);
+
+ if (parent == root) {
+ memcg->memory.eoom_protect = atomic_long_read(&memcg->memory.oom_protect_usage);
+ return;
+ }
+
+ parent_usage = page_counter_read(&parent->memory);
+
+ WRITE_ONCE(memcg->memory.eoom_protect, effective_protection(usage, parent_usage,
+ atomic_long_read(&memcg->memory.oom_protect_usage),
+ READ_ONCE(parent->memory.eoom_protect),
+ atomic_long_read(&parent->memory.children_oom_protect_usage)));
+}
+
+/**
+ * mem_cgroup_calculate_oom_protection - check if memory consumption is in the
+ * normal range of oom's protection
+ * @root: the top ancestor of the sub-tree being checked
+ * @memcg: the memory cgroup to check
+ *
+ * WARNING: This function is not stateless! It can only be used as part
+ * of a top-down tree iteration, not for isolated queries.
+ */
+void mem_cgroup_calculate_oom_protection(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
+{
+ if (mem_cgroup_disabled())
+ return;
+
+ if (!root)
+ root = root_mem_cgroup;
+
+ /*
+ * Effective values of the reclaim targets are ignored so they
+ * can be stale. Have a look at mem_cgroup_protection for more
+ * details.
+ * TODO: calculation should be more robust so that we do not need
+ * that special casing.
+ */
+ if (memcg == root)
+ return;
+
+ __mem_cgroup_calculate_oom_protection(root, memcg);
+}
+
+static void lsit_postorder_for_memcg_parent(
+ struct mem_cgroup *root, struct mem_cgroup *memcg,
+ void (*fn)(struct mem_cgroup *, struct mem_cgroup *))
+{
+ struct mem_cgroup *parent;
+
+ if (!memcg || memcg == root)
+ return;
+
+ parent = parent_mem_cgroup(memcg);
+ lsit_postorder_for_memcg_parent(root, parent, fn);
+ fn(root, memcg);
+}
+
+void update_parent_oom_protection(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
+{
+ if (mem_cgroup_disabled())
+ return;
+
+ if (!root)
+ root = root_mem_cgroup;
+
+ lsit_postorder_for_memcg_parent(root, memcg,
+ __mem_cgroup_calculate_oom_protection);
+}
+
+unsigned long get_task_eoom_protect(struct task_struct *p, long points)
+{
+ struct mem_cgroup *memcg;
+ unsigned long usage, eoom = 0;
+
+ rcu_read_lock();
+ memcg = mem_cgroup_from_task(p);
+
+ if (mem_cgroup_unprotected(NULL, memcg))
+ goto end;
+
+ if (do_memsw_account())
+ usage = page_counter_read(&memcg->memsw);
+ else
+ usage = page_counter_read(&memcg->memory)
+ + page_counter_read(&memcg->swap);
+
+ /* To avoid division errors, when we don't move charge at immigrate */
+ if (!usage)
+ goto end;
+
+ eoom = READ_ONCE(memcg->memory.eoom_protect) * points / usage;
+
+end:
+ rcu_read_unlock();
+ return eoom;
+}
+
static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
gfp_t gfp)
{
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 612b5597d3af..2a80e2d91afe 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1,7 +1,7 @@
// SPDX-License-Identifier: GPL-2.0-only
/*
* linux/mm/oom_kill.c
- *
+ *
* Copyright (C) 1998,2000 Rik van Riel
* Thanks go out to Claus Fischer for some serious inspiration and
* for goading me into coding this file...
@@ -193,15 +193,16 @@ static bool should_dump_unreclaim_slab(void)
* oom_badness - heuristic function to determine which candidate task to kill
* @p: task struct of which task we should calculate
* @totalpages: total present RAM allowed for page allocation
+ * @flag: if you want to skip oom_protect function
*
* The heuristic for determining which task to kill is made to be as simple and
* predictable as possible. The goal is to return the highest value for the
* task consuming the most memory to avoid subsequent oom failures.
*/
-long oom_badness(struct task_struct *p, unsigned long totalpages)
+long oom_badness(struct task_struct *p, unsigned long totalpages, int flag)
{
- long points;
- long adj;
+ long points, adj, val = 0;
+ unsigned long shmem;
if (oom_unkillable_task(p))
return LONG_MIN;
@@ -229,11 +230,15 @@ long oom_badness(struct task_struct *p, unsigned long totalpages)
*/
points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
mm_pgtables_bytes(p->mm) / PAGE_SIZE;
+
+ shmem = get_mm_counter(p->mm, MM_SHMEMPAGES);
task_unlock(p);
+ if (flag == MEMCG_OOM_PROTECT)
+ val = get_task_eoom_protect(p, points - shmem);
/* Normalize to oom_score_adj units */
adj *= totalpages / 1000;
- points += adj;
+ points = points + adj - val;
return points;
}
@@ -305,7 +310,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
return CONSTRAINT_NONE;
}
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+static int oom_evaluate_task(struct task_struct *task, void *arg, int flag)
{
struct oom_control *oc = arg;
long points;
@@ -338,7 +343,7 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
goto select;
}
- points = oom_badness(task, oc->totalpages);
+ points = oom_badness(task, oc->totalpages, flag);
if (points == LONG_MIN || points < oc->chosen_points)
goto next;
@@ -365,14 +370,14 @@ static void select_bad_process(struct oom_control *oc)
{
oc->chosen_points = LONG_MIN;
- if (is_memcg_oom(oc))
- mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
+ if (is_memcg_oom(oc) || is_root_oom_protect())
+ mem_cgroup_scan_tasks_update_eoom(oc->memcg, oom_evaluate_task, oc);
else {
struct task_struct *p;
rcu_read_lock();
for_each_process(p)
- if (oom_evaluate_task(p, oc))
+ if (oom_evaluate_task(p, oc, MEMCG_OOM_EVALUATE_NONE))
break;
rcu_read_unlock();
}
diff --git a/mm/page_counter.c b/mm/page_counter.c
index db20d6452b71..72a6f3543008 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -39,6 +39,19 @@ static void propagate_protected_usage(struct page_counter *c,
if (delta)
atomic_long_add(delta, &c->parent->children_low_usage);
}
+
+ protected = READ_ONCE(c->oom_protect);
+ if (protected == PAGE_COUNTER_MAX - 1)
+ protected = atomic_long_read(&c->children_oom_protect_usage);
+ else
+ protected = min(usage, protected);
+ old_protected = atomic_long_read(&c->oom_protect_usage);
+ if (protected != old_protected) {
+ old_protected = atomic_long_xchg(&c->oom_protect_usage, protected);
+ delta = protected - old_protected;
+ if (delta)
+ atomic_long_add(delta, &c->parent->children_oom_protect_usage);
+ }
}
/**
@@ -234,6 +247,23 @@ void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages)
propagate_protected_usage(c, atomic_long_read(&c->usage));
}
+/**
+ * page_counter_set_oom_protect - set the amount of oom protected memory
+ * @counter: counter
+ * @nr_pages: value to set
+ *
+ * The caller must serialize invocations on the same counter.
+ */
+void page_counter_set_oom_protect(struct page_counter *counter, unsigned long nr_pages)
+{
+ struct page_counter *c;
+
+ WRITE_ONCE(counter->oom_protect, nr_pages);
+
+ for (c = counter; c; c = c->parent)
+ propagate_protected_usage(c, atomic_long_read(&c->usage));
+}
+
/**
* page_counter_memparse - memparse() for page counter limits
* @buf: string to parse
--
2.14.1
+David Rientjes
On Tue, May 16, 2023 at 8:20 PM chengkaitao <[email protected]> wrote:
>
> Establish a new OOM score algorithm, supports the cgroup level OOM
> protection mechanism. When an global/memcg oom event occurs, we treat
> all processes in the cgroup as a whole, and OOM killers need to select
> the process to kill based on the protection quota of the cgroup.
>
> Here is a more detailed comparison and introduction of the old
> oom_score_adj mechanism and the new oom_protect mechanism,
>
> 1. The regulating granularity of oom_protect is smaller than that of
> oom_score_adj. On a 512G physical machine, the minimum granularity
> adjusted by oom_score_adj is 512M, and the minimum granularity
> adjusted by oom_protect is one page (4K)
> 2. It may be simple to create a lightweight parent process and uniformly
> set the oom_score_adj of some important processes, but it is not a
> simple matter to make multi-level settings for tens of thousands of
> processes on the physical machine through the lightweight parent
> processes. We may need a huge table to record the value of oom_score_adj
> maintained by all lightweight parent processes, and the user process
> limited by the parent process has no ability to change its own
> oom_score_adj, because it does not know the details of the huge
> table. on the other hand, we have to set the common parent process'
> oom_score_adj, before it forks all children processes. We must strictly
> follow this setting sequence, and once oom_score_adj is set, it cannot
> be changed. To sum up, it is very difficult to apply oom_score_adj in
> other situations. The new patch adopts the cgroup mechanism. It does not
> need any parent process to manage oom_score_adj. the settings between
> each memcg are independent of each other, making it easier to plan the
> OOM order of all processes. Due to the unique nature of memory
> resources, current Service cloud vendors are not oversold in memory
> planning. I would like to use the new patch to try to achieve the
> possibility of oversold memory resources.
> 3. I conducted a test and deployed an excessive number of containers on
> a physical machine, By setting the oom_score_adj value of all processes
> in the container to a positive number through dockerinit, even processes
> that occupy very little memory in the container are easily killed,
> resulting in a large number of invalid kill behaviors. If dockerinit is
> also killed unfortunately, it will trigger container self-healing, and
> the container will rebuild, resulting in more severe memory
> oscillations. The new patch abandons the behavior of adding an equal
> amount of oom_score_adj to each process in the container and adopts a
> shared oom_protect quota for all processes in the container. If a
> process in the container is killed, the remaining other processes will
> receive more oom_protect quota, making it more difficult for the
> remaining processes to be killed. In my test case, the new patch reduced
> the number of invalid kill behaviors by 70%.
> 4. oom_score_adj is a global configuration that cannot achieve a kill
> order that only affects a certain memcg-oom-killer. However, the
> oom_protect mechanism inherits downwards (If the oom_protect quota of
> the parent cgroup is less than the sum of sub-cgroups oom_protect quota,
> the oom_protect quota of each sub-cgroup will be proportionally reduced.
> If the oom_protect quota of the parent cgroup is greater than the sum of
> sub-cgroups oom_protect quota, the oom_protect quota of each sub-cgroup
> will be proportionally increased). The purpose of doing so is that users
> can set oom_protect quota according to their own needs, and the system
> management process can set appropriate oom_protect quota on the parent
> memcg as the final cover. If the oom_protect of the parent cgroup is 0,
> the kill order of memcg-oom or global-ooms will not be affected by user
> specific settings.
> 5. Per-process accounting does not count shared memory, similar to
> active page cache, which also increases the probability of OOM-kill.
> However, the memcg accounting may be more reasonable, as its memory
> statistics are more comprehensive. In the new patch, all the shared
> memory will also consume the oom_protect quota of the memcg, and the
> process's oom_protect quota of the memcg will decrease, the probability
> of they being killed will increase.
> 6. In the final discussion of patch v2, we discussed that although the
> adjustment range of oom_score_adj is [-1000,1000], but essentially it
> only allows two usecases(OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably.
> Everything in between is clumsy at best. In order to solve this problem
> in the new patch, I introduced a new indicator oom_kill_inherit, which
> counts the number of times the local and child cgroups have been
> selected by the OOM killer of the ancestor cgroup. oom_kill_inherit
> maintains a negative correlation with memory.oom.protect, so we have a
> ruler to measure the optimal value of memory.oom.protect. By observing
> the proportion of oom_kill_inherit in the parent cgroup, I can
> effectively adjust the value of oom_protect to achieve the best.
>
> Changelog:
> v4:
> * Fix warning: overflow in expression. (patch 1)
> * Supplementary commit information. (patch 0)
> v3:
> * Add "auto" option for memory.oom.protect. (patch 1)
> * Fix division errors. (patch 1)
> * Add observation indicator oom_kill_inherit. (patch 2)
> https://lore.kernel.org/linux-mm/[email protected]/
> v2:
> * Modify the formula of the process request memcg protection quota.
> https://lore.kernel.org/linux-mm/[email protected]/
> v1:
> https://lore.kernel.org/linux-mm/[email protected]/
>
> chengkaitao (2):
> mm: memcontrol: protect the memory in cgroup from being oom killed
> memcg: add oom_kill_inherit event indicator
>
> Documentation/admin-guide/cgroup-v2.rst | 29 ++++-
> fs/proc/base.c | 17 ++-
> include/linux/memcontrol.h | 46 +++++++-
> include/linux/oom.h | 3 +-
> include/linux/page_counter.h | 6 +
> mm/memcontrol.c | 199 ++++++++++++++++++++++++++++++++
> mm/oom_kill.c | 25 ++--
> mm/page_counter.c | 30 +++++
> 8 files changed, 334 insertions(+), 21 deletions(-)
>
> --
> 2.14.1
>
>
Perhaps this is only slightly relevant, but at Google we do have a
different per-memcg approach to protect from OOM kills, or more
specifically tell the kernel how we would like the OOM killer to
behave.
We define an interface called memory.oom_score_badness, and we also
allow it to be specified per-process through a procfs interface,
similar to oom_score_adj.
These scores essentially tell the OOM killer the order in which we
prefer memcgs to be OOM'd, and the order in which we want processes in
the memcg to be OOM'd. By default, all processes and memcgs start with
the same score. Ties are broken based on the rss of the process or the
usage of the memcg (prefer to kill the process/memcg that will free
more memory) -- similar to the current OOM killer.
This has been brought up before in other discussions without much
interest [1], but just thought it may be relevant here.
[1]https://lore.kernel.org/lkml/CAHS8izN3ej1mqUpnNQ8c-1Bx5EeO7q5NOkh0qrY_4PLqc8rkHA@mail.gmail.com/#t
At 2023-05-17 14:59:06, "Yosry Ahmed" <[email protected]> wrote:
>+David Rientjes
>
>On Tue, May 16, 2023 at 8:20 PM chengkaitao <[email protected]> wrote:
>>
>> Establish a new OOM score algorithm, supports the cgroup level OOM
>> protection mechanism. When an global/memcg oom event occurs, we treat
>> all processes in the cgroup as a whole, and OOM killers need to select
>> the process to kill based on the protection quota of the cgroup.
>>
>
>Perhaps this is only slightly relevant, but at Google we do have a
>different per-memcg approach to protect from OOM kills, or more
>specifically tell the kernel how we would like the OOM killer to
>behave.
>
>We define an interface called memory.oom_score_badness, and we also
>allow it to be specified per-process through a procfs interface,
>similar to oom_score_adj.
>
>These scores essentially tell the OOM killer the order in which we
>prefer memcgs to be OOM'd, and the order in which we want processes in
>the memcg to be OOM'd. By default, all processes and memcgs start with
>the same score. Ties are broken based on the rss of the process or the
>usage of the memcg (prefer to kill the process/memcg that will free
>more memory) -- similar to the current OOM killer.
Thank you for providing a new application scenario. You have described a
new per-memcg approach, but a simple introduction cannot explain the
details of your approach clearly. If you could compare and analyze my
patches for possible defects, or if your new approach has advantages
that my patches do not have, I would greatly appreciate it.
>This has been brought up before in other discussions without much
>interest [1], but just thought it may be relevant here.
>
>[1]https://lore.kernel.org/lkml/CAHS8izN3ej1mqUpnNQ8c-1Bx5EeO7q5NOkh0qrY_4PLqc8rkHA@mail.gmail.com/#t
--
Thanks for your comment!
chengkaitao
On Wed, May 17, 2023 at 1:01 AM 程垲涛 Chengkaitao Cheng
<[email protected]> wrote:
>
> At 2023-05-17 14:59:06, "Yosry Ahmed" <[email protected]> wrote:
> >+David Rientjes
> >
> >On Tue, May 16, 2023 at 8:20 PM chengkaitao <[email protected]> wrote:
> >>
> >> Establish a new OOM score algorithm, supports the cgroup level OOM
> >> protection mechanism. When an global/memcg oom event occurs, we treat
> >> all processes in the cgroup as a whole, and OOM killers need to select
> >> the process to kill based on the protection quota of the cgroup.
> >>
> >
> >Perhaps this is only slightly relevant, but at Google we do have a
> >different per-memcg approach to protect from OOM kills, or more
> >specifically tell the kernel how we would like the OOM killer to
> >behave.
> >
> >We define an interface called memory.oom_score_badness, and we also
> >allow it to be specified per-process through a procfs interface,
> >similar to oom_score_adj.
> >
> >These scores essentially tell the OOM killer the order in which we
> >prefer memcgs to be OOM'd, and the order in which we want processes in
> >the memcg to be OOM'd. By default, all processes and memcgs start with
> >the same score. Ties are broken based on the rss of the process or the
> >usage of the memcg (prefer to kill the process/memcg that will free
> >more memory) -- similar to the current OOM killer.
>
> Thank you for providing a new application scenario. You have described a
> new per-memcg approach, but a simple introduction cannot explain the
> details of your approach clearly. If you could compare and analyze my
> patches for possible defects, or if your new approach has advantages
> that my patches do not have, I would greatly appreciate it.
Sorry if I was not clear, I am not implying in any way that the
approach I am describing is better than your patches. I am guilty of
not conducting the proper analysis you are requesting.
I just saw the thread and thought it might be interesting to you or
others to know the approach that we have been using for years in our
production. I guess the target is the same, be able to tell the OOM
killer which memcgs/processes are more important to protect. The
fundamental difference is that instead of tuning this based on the
memory usage of the memcg (your approach), we essentially give the OOM
killer the ordering in which we want memcgs/processes to be OOM
killed. This maps to jobs priorities essentially.
If this approach works for you (or any other audience), that's great,
I can share more details and perhaps we can reach something that we
can both use :)
>
> >This has been brought up before in other discussions without much
> >interest [1], but just thought it may be relevant here.
> >
> >[1]https://lore.kernel.org/lkml/CAHS8izN3ej1mqUpnNQ8c-1Bx5EeO7q5NOkh0qrY_4PLqc8rkHA@mail.gmail.com/#t
>
> --
> Thanks for your comment!
> chengkaitao
>
At 2023-05-17 16:09:50, "Yosry Ahmed" <[email protected]> wrote:
>On Wed, May 17, 2023 at 1:01 AM 程垲涛 Chengkaitao Cheng
><[email protected]> wrote:
>>
>> At 2023-05-17 14:59:06, "Yosry Ahmed" <[email protected]> wrote:
>> >+David Rientjes
>> >
>> >On Tue, May 16, 2023 at 8:20 PM chengkaitao <[email protected]> wrote:
>> >>
>> >> Establish a new OOM score algorithm, supports the cgroup level OOM
>> >> protection mechanism. When an global/memcg oom event occurs, we treat
>> >> all processes in the cgroup as a whole, and OOM killers need to select
>> >> the process to kill based on the protection quota of the cgroup.
>> >>
>> >
>> >Perhaps this is only slightly relevant, but at Google we do have a
>> >different per-memcg approach to protect from OOM kills, or more
>> >specifically tell the kernel how we would like the OOM killer to
>> >behave.
>> >
>> >We define an interface called memory.oom_score_badness, and we also
>> >allow it to be specified per-process through a procfs interface,
>> >similar to oom_score_adj.
>> >
>> >These scores essentially tell the OOM killer the order in which we
>> >prefer memcgs to be OOM'd, and the order in which we want processes in
>> >the memcg to be OOM'd. By default, all processes and memcgs start with
>> >the same score. Ties are broken based on the rss of the process or the
>> >usage of the memcg (prefer to kill the process/memcg that will free
>> >more memory) -- similar to the current OOM killer.
>>
>> Thank you for providing a new application scenario. You have described a
>> new per-memcg approach, but a simple introduction cannot explain the
>> details of your approach clearly. If you could compare and analyze my
>> patches for possible defects, or if your new approach has advantages
>> that my patches do not have, I would greatly appreciate it.
>
>Sorry if I was not clear, I am not implying in any way that the
>approach I am describing is better than your patches. I am guilty of
>not conducting the proper analysis you are requesting.
There is no perfect approach in the world, and I also seek your advice with
a learning attitude. You don't need to say sorry, I should say thank you.
>I just saw the thread and thought it might be interesting to you or
>others to know the approach that we have been using for years in our
>production. I guess the target is the same, be able to tell the OOM
>killer which memcgs/processes are more important to protect. The
>fundamental difference is that instead of tuning this based on the
>memory usage of the memcg (your approach), we essentially give the OOM
>killer the ordering in which we want memcgs/processes to be OOM
>killed. This maps to jobs priorities essentially.
Killing processes in order of memory usage cannot effectively protect
important processes. Killing processes in a user-defined priority order
will result in a large number of OOM events and still not being able to
release enough memory. I have been searching for a balance between
the two methods, so that their shortcomings are not too obvious.
The biggest advantage of memcg is its tree topology, and I also hope
to make good use of it.
>If this approach works for you (or any other audience), that's great,
>I can share more details and perhaps we can reach something that we
>can both use :)
If you have a good idea, please share more details or show some code.
I would greatly appreciate it
>>
>> >This has been brought up before in other discussions without much
>> >interest [1], but just thought it may be relevant here.
>> >
>> >[1]https://lore.kernel.org/lkml/CAHS8izN3ej1mqUpnNQ8c-1Bx5EeO7q5NOkh0qrY_4PLqc8rkHA@mail.gmail.com/#t
--
Thanks for your comment!
chengkaitao
On Wed, May 17, 2023 at 3:01 AM 程垲涛 Chengkaitao Cheng
<[email protected]> wrote:
>
> At 2023-05-17 16:09:50, "Yosry Ahmed" <[email protected]> wrote:
> >On Wed, May 17, 2023 at 1:01 AM 程垲涛 Chengkaitao Cheng
> ><[email protected]> wrote:
> >>
> >> At 2023-05-17 14:59:06, "Yosry Ahmed" <[email protected]> wrote:
> >> >+David Rientjes
> >> >
> >> >On Tue, May 16, 2023 at 8:20 PM chengkaitao <[email protected]> wrote:
> >> >>
> >> >> Establish a new OOM score algorithm, supports the cgroup level OOM
> >> >> protection mechanism. When an global/memcg oom event occurs, we treat
> >> >> all processes in the cgroup as a whole, and OOM killers need to select
> >> >> the process to kill based on the protection quota of the cgroup.
> >> >>
> >> >
> >> >Perhaps this is only slightly relevant, but at Google we do have a
> >> >different per-memcg approach to protect from OOM kills, or more
> >> >specifically tell the kernel how we would like the OOM killer to
> >> >behave.
> >> >
> >> >We define an interface called memory.oom_score_badness, and we also
> >> >allow it to be specified per-process through a procfs interface,
> >> >similar to oom_score_adj.
> >> >
> >> >These scores essentially tell the OOM killer the order in which we
> >> >prefer memcgs to be OOM'd, and the order in which we want processes in
> >> >the memcg to be OOM'd. By default, all processes and memcgs start with
> >> >the same score. Ties are broken based on the rss of the process or the
> >> >usage of the memcg (prefer to kill the process/memcg that will free
> >> >more memory) -- similar to the current OOM killer.
> >>
> >> Thank you for providing a new application scenario. You have described a
> >> new per-memcg approach, but a simple introduction cannot explain the
> >> details of your approach clearly. If you could compare and analyze my
> >> patches for possible defects, or if your new approach has advantages
> >> that my patches do not have, I would greatly appreciate it.
> >
> >Sorry if I was not clear, I am not implying in any way that the
> >approach I am describing is better than your patches. I am guilty of
> >not conducting the proper analysis you are requesting.
>
> There is no perfect approach in the world, and I also seek your advice with
> a learning attitude. You don't need to say sorry, I should say thank you.
>
> >I just saw the thread and thought it might be interesting to you or
> >others to know the approach that we have been using for years in our
> >production. I guess the target is the same, be able to tell the OOM
> >killer which memcgs/processes are more important to protect. The
> >fundamental difference is that instead of tuning this based on the
> >memory usage of the memcg (your approach), we essentially give the OOM
> >killer the ordering in which we want memcgs/processes to be OOM
> >killed. This maps to jobs priorities essentially.
>
> Killing processes in order of memory usage cannot effectively protect
> important processes. Killing processes in a user-defined priority order
> will result in a large number of OOM events and still not being able to
> release enough memory. I have been searching for a balance between
> the two methods, so that their shortcomings are not too obvious.
> The biggest advantage of memcg is its tree topology, and I also hope
> to make good use of it.
For us, killing processes in a user-defined priority order works well.
It seems like to tune memory.oom.protect you use oom_kill_inherit to
observe how many times this memcg has been killed due to a limit in an
ancestor. Wouldn't it be more straightforward to specify the priority
of protections among memcgs?
For example, if you observe multiple memcgs being OOM killed due to
hitting an ancestor limit, you will need to decide which of them to
increase memory.oom.protect for more, based on their importance.
Otherwise, if you increase all of them, then there is no point if all
the memory is protected, right?
In this case, wouldn't it be easier to just tell the OOM killer the
relative priority among the memcgs?
>
> >If this approach works for you (or any other audience), that's great,
> >I can share more details and perhaps we can reach something that we
> >can both use :)
>
> If you have a good idea, please share more details or show some code.
> I would greatly appreciate it
The code we have needs to be rebased onto a different version and
cleaned up before it can be shared, but essentially it is as
described.
(a) All processes and memcgs start with a default score.
(b) Userspace can specify scores for memcgs and processes. A higher
score means higher priority (aka less score gets killed first).
(c) The OOM killer essentially looks for the memcg with the lowest
scores to kill, then among this memcg, it looks for the process with
the lowest score. Ties are broken based on usage, so essentially if
all processes/memcgs have the default score, we fallback to the
current OOM behavior.
>
> >>
> >> >This has been brought up before in other discussions without much
> >> >interest [1], but just thought it may be relevant here.
> >> >
> >> >[1]https://lore.kernel.org/lkml/CAHS8izN3ej1mqUpnNQ8c-1Bx5EeO7q5NOkh0qrY_4PLqc8rkHA@mail.gmail.com/#t
>
> --
> Thanks for your comment!
> chengkaitao
>
At 2023-05-18 04:42:12, "Yosry Ahmed" <[email protected]> wrote:
>On Wed, May 17, 2023 at 3:01 AM 程垲涛 Chengkaitao Cheng
><[email protected]> wrote:
>>
>> At 2023-05-17 16:09:50, "Yosry Ahmed" <[email protected]> wrote:
>> >On Wed, May 17, 2023 at 1:01 AM 程垲涛 Chengkaitao Cheng
>> ><[email protected]> wrote:
>> >>
>> >> At 2023-05-17 14:59:06, "Yosry Ahmed" <[email protected]> wrote:
>> >> >+David Rientjes
>> >> >
>> >> >On Tue, May 16, 2023 at 8:20 PM chengkaitao <[email protected]> wrote:
>> >> >>
>> >> Thank you for providing a new application scenario. You have described a
>> >> new per-memcg approach, but a simple introduction cannot explain the
>> >> details of your approach clearly. If you could compare and analyze my
>> >> patches for possible defects, or if your new approach has advantages
>> >> that my patches do not have, I would greatly appreciate it.
>> >
>> >Sorry if I was not clear, I am not implying in any way that the
>> >approach I am describing is better than your patches. I am guilty of
>> >not conducting the proper analysis you are requesting.
>>
>> There is no perfect approach in the world, and I also seek your advice with
>> a learning attitude. You don't need to say sorry, I should say thank you.
>>
>> >I just saw the thread and thought it might be interesting to you or
>> >others to know the approach that we have been using for years in our
>> >production. I guess the target is the same, be able to tell the OOM
>> >killer which memcgs/processes are more important to protect. The
>> >fundamental difference is that instead of tuning this based on the
>> >memory usage of the memcg (your approach), we essentially give the OOM
>> >killer the ordering in which we want memcgs/processes to be OOM
>> >killed. This maps to jobs priorities essentially.
>>
>> Killing processes in order of memory usage cannot effectively protect
>> important processes. Killing processes in a user-defined priority order
>> will result in a large number of OOM events and still not being able to
>> release enough memory. I have been searching for a balance between
>> the two methods, so that their shortcomings are not too obvious.
>> The biggest advantage of memcg is its tree topology, and I also hope
>> to make good use of it.
>
>For us, killing processes in a user-defined priority order works well.
>
>It seems like to tune memory.oom.protect you use oom_kill_inherit to
>observe how many times this memcg has been killed due to a limit in an
>ancestor. Wouldn't it be more straightforward to specify the priority
>of protections among memcgs?
>
>For example, if you observe multiple memcgs being OOM killed due to
>hitting an ancestor limit, you will need to decide which of them to
>increase memory.oom.protect for more, based on their importance.
>Otherwise, if you increase all of them, then there is no point if all
>the memory is protected, right?
If all memory in memcg is protected, its meaning is similar to that of the
highest priority memcg in your approach, which is ultimately killed or
never killed.
>In this case, wouldn't it be easier to just tell the OOM killer the
>relative priority among the memcgs?
>
>>
>> >If this approach works for you (or any other audience), that's great,
>> >I can share more details and perhaps we can reach something that we
>> >can both use :)
>>
>> If you have a good idea, please share more details or show some code.
>> I would greatly appreciate it
>
>The code we have needs to be rebased onto a different version and
>cleaned up before it can be shared, but essentially it is as
>described.
>
>(a) All processes and memcgs start with a default score.
>(b) Userspace can specify scores for memcgs and processes. A higher
>score means higher priority (aka less score gets killed first).
>(c) The OOM killer essentially looks for the memcg with the lowest
>scores to kill, then among this memcg, it looks for the process with
>the lowest score. Ties are broken based on usage, so essentially if
>all processes/memcgs have the default score, we fallback to the
>current OOM behavior.
If memory oversold is severe, all processes of the lowest priority
memcg may be killed before selecting other memcg processes.
If there are 1000 processes with almost zero memory usage in
the lowest priority memcg, 1000 invalid kill events may occur.
To avoid this situation, even for the lowest priority memcg,
I will leave him a very small oom.protect quota.
If faced with two memcgs with the same total memory usage and
priority, memcg A has more processes but less memory usage per
single process, and memcg B has fewer processes but more
memory usage per single process, then when OOM occurs, the
processes in memcg B may continue to be killed until all processes
in memcg B are killed, which is unfair to memcg B because memcg A
also occupies a large amount of memory.
Dose your approach have these issues? Killing processes in a
user-defined priority is indeed easier and can work well in most cases,
but I have been trying to solve the cases that it cannot cover.
--
Thanks for your comment!
chengkaitao
On Wed, May 17, 2023 at 10:12 PM 程垲涛 Chengkaitao Cheng
<[email protected]> wrote:
>
> At 2023-05-18 04:42:12, "Yosry Ahmed" <[email protected]> wrote:
> >On Wed, May 17, 2023 at 3:01 AM 程垲涛 Chengkaitao Cheng
> ><[email protected]> wrote:
> >>
> >> At 2023-05-17 16:09:50, "Yosry Ahmed" <[email protected]> wrote:
> >> >On Wed, May 17, 2023 at 1:01 AM 程垲涛 Chengkaitao Cheng
> >> ><[email protected]> wrote:
> >> >>
> >> >> At 2023-05-17 14:59:06, "Yosry Ahmed" <[email protected]> wrote:
> >> >> >+David Rientjes
> >> >> >
> >> >> >On Tue, May 16, 2023 at 8:20 PM chengkaitao <[email protected]> wrote:
> >> >> >>
> >> >> Thank you for providing a new application scenario. You have described a
> >> >> new per-memcg approach, but a simple introduction cannot explain the
> >> >> details of your approach clearly. If you could compare and analyze my
> >> >> patches for possible defects, or if your new approach has advantages
> >> >> that my patches do not have, I would greatly appreciate it.
> >> >
> >> >Sorry if I was not clear, I am not implying in any way that the
> >> >approach I am describing is better than your patches. I am guilty of
> >> >not conducting the proper analysis you are requesting.
> >>
> >> There is no perfect approach in the world, and I also seek your advice with
> >> a learning attitude. You don't need to say sorry, I should say thank you.
> >>
> >> >I just saw the thread and thought it might be interesting to you or
> >> >others to know the approach that we have been using for years in our
> >> >production. I guess the target is the same, be able to tell the OOM
> >> >killer which memcgs/processes are more important to protect. The
> >> >fundamental difference is that instead of tuning this based on the
> >> >memory usage of the memcg (your approach), we essentially give the OOM
> >> >killer the ordering in which we want memcgs/processes to be OOM
> >> >killed. This maps to jobs priorities essentially.
> >>
> >> Killing processes in order of memory usage cannot effectively protect
> >> important processes. Killing processes in a user-defined priority order
> >> will result in a large number of OOM events and still not being able to
> >> release enough memory. I have been searching for a balance between
> >> the two methods, so that their shortcomings are not too obvious.
> >> The biggest advantage of memcg is its tree topology, and I also hope
> >> to make good use of it.
> >
> >For us, killing processes in a user-defined priority order works well.
> >
> >It seems like to tune memory.oom.protect you use oom_kill_inherit to
> >observe how many times this memcg has been killed due to a limit in an
> >ancestor. Wouldn't it be more straightforward to specify the priority
> >of protections among memcgs?
> >
> >For example, if you observe multiple memcgs being OOM killed due to
> >hitting an ancestor limit, you will need to decide which of them to
> >increase memory.oom.protect for more, based on their importance.
> >Otherwise, if you increase all of them, then there is no point if all
> >the memory is protected, right?
>
> If all memory in memcg is protected, its meaning is similar to that of the
> highest priority memcg in your approach, which is ultimately killed or
> never killed.
Makes sense. I believe it gets a bit trickier when you want to
describe relative ordering between memcgs using memory.oom.protect.
>
> >In this case, wouldn't it be easier to just tell the OOM killer the
> >relative priority among the memcgs?
> >
> >>
> >> >If this approach works for you (or any other audience), that's great,
> >> >I can share more details and perhaps we can reach something that we
> >> >can both use :)
> >>
> >> If you have a good idea, please share more details or show some code.
> >> I would greatly appreciate it
> >
> >The code we have needs to be rebased onto a different version and
> >cleaned up before it can be shared, but essentially it is as
> >described.
> >
> >(a) All processes and memcgs start with a default score.
> >(b) Userspace can specify scores for memcgs and processes. A higher
> >score means higher priority (aka less score gets killed first).
> >(c) The OOM killer essentially looks for the memcg with the lowest
> >scores to kill, then among this memcg, it looks for the process with
> >the lowest score. Ties are broken based on usage, so essentially if
> >all processes/memcgs have the default score, we fallback to the
> >current OOM behavior.
>
> If memory oversold is severe, all processes of the lowest priority
> memcg may be killed before selecting other memcg processes.
> If there are 1000 processes with almost zero memory usage in
> the lowest priority memcg, 1000 invalid kill events may occur.
> To avoid this situation, even for the lowest priority memcg,
> I will leave him a very small oom.protect quota.
I checked internally, and this is indeed something that we see from
time to time. We try to avoid that with userspace OOM killing, but
it's not 100% effective.
>
> If faced with two memcgs with the same total memory usage and
> priority, memcg A has more processes but less memory usage per
> single process, and memcg B has fewer processes but more
> memory usage per single process, then when OOM occurs, the
> processes in memcg B may continue to be killed until all processes
> in memcg B are killed, which is unfair to memcg B because memcg A
> also occupies a large amount of memory.
I believe in this case we will kill one process in memcg B, then the
usage of memcg A will become higher, so we will pick a process from
memcg A next.
>
> Dose your approach have these issues? Killing processes in a
> user-defined priority is indeed easier and can work well in most cases,
> but I have been trying to solve the cases that it cannot cover.
The first issue is relatable with our approach. Let me dig more info
from our internal teams and get back to you with more details.
>
> --
> Thanks for your comment!
> chengkaitao
>
>
At 2023-05-20 06:04:26, "Yosry Ahmed" <[email protected]> wrote:
>On Wed, May 17, 2023 at 10:12 PM 程垲涛 Chengkaitao Cheng
><[email protected]> wrote:
>>
>> At 2023-05-18 04:42:12, "Yosry Ahmed" <[email protected]> wrote:
>> >On Wed, May 17, 2023 at 3:01 AM 程垲涛 Chengkaitao Cheng
>> ><[email protected]> wrote:
>> >>
>> >> At 2023-05-17 16:09:50, "Yosry Ahmed" <[email protected]> wrote:
>> >> >On Wed, May 17, 2023 at 1:01 AM 程垲涛 Chengkaitao Cheng
>> >> ><[email protected]> wrote:
>> >> >>
>> >>
>> >> Killing processes in order of memory usage cannot effectively protect
>> >> important processes. Killing processes in a user-defined priority order
>> >> will result in a large number of OOM events and still not being able to
>> >> release enough memory. I have been searching for a balance between
>> >> the two methods, so that their shortcomings are not too obvious.
>> >> The biggest advantage of memcg is its tree topology, and I also hope
>> >> to make good use of it.
>> >
>> >For us, killing processes in a user-defined priority order works well.
>> >
>> >It seems like to tune memory.oom.protect you use oom_kill_inherit to
>> >observe how many times this memcg has been killed due to a limit in an
>> >ancestor. Wouldn't it be more straightforward to specify the priority
>> >of protections among memcgs?
>> >
>> >For example, if you observe multiple memcgs being OOM killed due to
>> >hitting an ancestor limit, you will need to decide which of them to
>> >increase memory.oom.protect for more, based on their importance.
>> >Otherwise, if you increase all of them, then there is no point if all
>> >the memory is protected, right?
>>
>> If all memory in memcg is protected, its meaning is similar to that of the
>> highest priority memcg in your approach, which is ultimately killed or
>> never killed.
>
>Makes sense. I believe it gets a bit trickier when you want to
>describe relative ordering between memcgs using memory.oom.protect.
Actually, my original intention was not to use memory.oom.protect to
achieve relative ordering between memcgs, it was just a feature that
happened to be achievable. My initial idea was to protect a certain
proportion of memory in memcg from being killed, and through the
method, physical memory can be reasonably planned. Both the physical
machine manager and container manager can add some unimportant
loads beyond the oom.protect limit, greatly improving the oversold
rate of memory. In the worst case scenario, the physical machine can
always provide all the memory limited by memory.oom.protect for memcg.
On the other hand, I also want to achieve relative ordering of internal
processes in memcg, not just a unified ordering of all memcgs on
physical machines.
>> >In this case, wouldn't it be easier to just tell the OOM killer the
>> >relative priority among the memcgs?
>> >
>> >>
>> >> >If this approach works for you (or any other audience), that's great,
>> >> >I can share more details and perhaps we can reach something that we
>> >> >can both use :)
>> >>
>> >> If you have a good idea, please share more details or show some code.
>> >> I would greatly appreciate it
>> >
>> >The code we have needs to be rebased onto a different version and
>> >cleaned up before it can be shared, but essentially it is as
>> >described.
>> >
>> >(a) All processes and memcgs start with a default score.
>> >(b) Userspace can specify scores for memcgs and processes. A higher
>> >score means higher priority (aka less score gets killed first).
>> >(c) The OOM killer essentially looks for the memcg with the lowest
>> >scores to kill, then among this memcg, it looks for the process with
>> >the lowest score. Ties are broken based on usage, so essentially if
>> >all processes/memcgs have the default score, we fallback to the
>> >current OOM behavior.
>>
>> If memory oversold is severe, all processes of the lowest priority
>> memcg may be killed before selecting other memcg processes.
>> If there are 1000 processes with almost zero memory usage in
>> the lowest priority memcg, 1000 invalid kill events may occur.
>> To avoid this situation, even for the lowest priority memcg,
>> I will leave him a very small oom.protect quota.
>
>I checked internally, and this is indeed something that we see from
>time to time. We try to avoid that with userspace OOM killing, but
>it's not 100% effective.
>
>>
>> If faced with two memcgs with the same total memory usage and
>> priority, memcg A has more processes but less memory usage per
>> single process, and memcg B has fewer processes but more
>> memory usage per single process, then when OOM occurs, the
>> processes in memcg B may continue to be killed until all processes
>> in memcg B are killed, which is unfair to memcg B because memcg A
>> also occupies a large amount of memory.
>
>I believe in this case we will kill one process in memcg B, then the
>usage of memcg A will become higher, so we will pick a process from
>memcg A next.
If there is only one process in memcg A and its memory usage is higher
than any other process in memcg B, but the total memory usage of
memcg A is lower than that of memcg B. In this case, if the OOM-killer
still chooses the process in memcg A. it may be unfair to memcg A.
>> Dose your approach have these issues? Killing processes in a
>> user-defined priority is indeed easier and can work well in most cases,
>> but I have been trying to solve the cases that it cannot cover.
>
>The first issue is relatable with our approach. Let me dig more info
>from our internal teams and get back to you with more details.
--
Thanks for your comment!
chengkaitao
On Sat, May 20, 2023 at 2:52 AM 程垲涛 Chengkaitao Cheng
<[email protected]> wrote:
>
> At 2023-05-20 06:04:26, "Yosry Ahmed" <[email protected]> wrote:
> >On Wed, May 17, 2023 at 10:12 PM 程垲涛 Chengkaitao Cheng
> ><[email protected]> wrote:
> >>
> >> At 2023-05-18 04:42:12, "Yosry Ahmed" <[email protected]> wrote:
> >> >On Wed, May 17, 2023 at 3:01 AM 程垲涛 Chengkaitao Cheng
> >> ><[email protected]> wrote:
> >> >>
> >> >> At 2023-05-17 16:09:50, "Yosry Ahmed" <[email protected]> wrote:
> >> >> >On Wed, May 17, 2023 at 1:01 AM 程垲涛 Chengkaitao Cheng
> >> >> ><[email protected]> wrote:
> >> >> >>
> >> >>
> >> >> Killing processes in order of memory usage cannot effectively protect
> >> >> important processes. Killing processes in a user-defined priority order
> >> >> will result in a large number of OOM events and still not being able to
> >> >> release enough memory. I have been searching for a balance between
> >> >> the two methods, so that their shortcomings are not too obvious.
> >> >> The biggest advantage of memcg is its tree topology, and I also hope
> >> >> to make good use of it.
> >> >
> >> >For us, killing processes in a user-defined priority order works well.
> >> >
> >> >It seems like to tune memory.oom.protect you use oom_kill_inherit to
> >> >observe how many times this memcg has been killed due to a limit in an
> >> >ancestor. Wouldn't it be more straightforward to specify the priority
> >> >of protections among memcgs?
> >> >
> >> >For example, if you observe multiple memcgs being OOM killed due to
> >> >hitting an ancestor limit, you will need to decide which of them to
> >> >increase memory.oom.protect for more, based on their importance.
> >> >Otherwise, if you increase all of them, then there is no point if all
> >> >the memory is protected, right?
> >>
> >> If all memory in memcg is protected, its meaning is similar to that of the
> >> highest priority memcg in your approach, which is ultimately killed or
> >> never killed.
> >
> >Makes sense. I believe it gets a bit trickier when you want to
> >describe relative ordering between memcgs using memory.oom.protect.
>
> Actually, my original intention was not to use memory.oom.protect to
> achieve relative ordering between memcgs, it was just a feature that
> happened to be achievable. My initial idea was to protect a certain
> proportion of memory in memcg from being killed, and through the
> method, physical memory can be reasonably planned. Both the physical
> machine manager and container manager can add some unimportant
> loads beyond the oom.protect limit, greatly improving the oversold
> rate of memory. In the worst case scenario, the physical machine can
> always provide all the memory limited by memory.oom.protect for memcg.
>
> On the other hand, I also want to achieve relative ordering of internal
> processes in memcg, not just a unified ordering of all memcgs on
> physical machines.
For us, having a strict priority ordering-based selection is
essential. We have different tiers of jobs of different importance,
and a job of higher priority should not be killed before a lower
priority task if possible, no matter how much memory either of them is
using. Protecting memcgs solely based on their usage can be useful in
some scenarios, but not in a system where you have different tiers of
jobs running with strict priority ordering.
>
> >> >In this case, wouldn't it be easier to just tell the OOM killer the
> >> >relative priority among the memcgs?
> >> >
> >> >>
> >> >> >If this approach works for you (or any other audience), that's great,
> >> >> >I can share more details and perhaps we can reach something that we
> >> >> >can both use :)
> >> >>
> >> >> If you have a good idea, please share more details or show some code.
> >> >> I would greatly appreciate it
> >> >
> >> >The code we have needs to be rebased onto a different version and
> >> >cleaned up before it can be shared, but essentially it is as
> >> >described.
> >> >
> >> >(a) All processes and memcgs start with a default score.
> >> >(b) Userspace can specify scores for memcgs and processes. A higher
> >> >score means higher priority (aka less score gets killed first).
> >> >(c) The OOM killer essentially looks for the memcg with the lowest
> >> >scores to kill, then among this memcg, it looks for the process with
> >> >the lowest score. Ties are broken based on usage, so essentially if
> >> >all processes/memcgs have the default score, we fallback to the
> >> >current OOM behavior.
> >>
> >> If memory oversold is severe, all processes of the lowest priority
> >> memcg may be killed before selecting other memcg processes.
> >> If there are 1000 processes with almost zero memory usage in
> >> the lowest priority memcg, 1000 invalid kill events may occur.
> >> To avoid this situation, even for the lowest priority memcg,
> >> I will leave him a very small oom.protect quota.
> >
> >I checked internally, and this is indeed something that we see from
> >time to time. We try to avoid that with userspace OOM killing, but
> >it's not 100% effective.
> >
> >>
> >> If faced with two memcgs with the same total memory usage and
> >> priority, memcg A has more processes but less memory usage per
> >> single process, and memcg B has fewer processes but more
> >> memory usage per single process, then when OOM occurs, the
> >> processes in memcg B may continue to be killed until all processes
> >> in memcg B are killed, which is unfair to memcg B because memcg A
> >> also occupies a large amount of memory.
> >
> >I believe in this case we will kill one process in memcg B, then the
> >usage of memcg A will become higher, so we will pick a process from
> >memcg A next.
>
> If there is only one process in memcg A and its memory usage is higher
> than any other process in memcg B, but the total memory usage of
> memcg A is lower than that of memcg B. In this case, if the OOM-killer
> still chooses the process in memcg A. it may be unfair to memcg A.
>
> >> Dose your approach have these issues? Killing processes in a
> >> user-defined priority is indeed easier and can work well in most cases,
> >> but I have been trying to solve the cases that it cannot cover.
> >
> >The first issue is relatable with our approach. Let me dig more info
> >from our internal teams and get back to you with more details.
>
> --
> Thanks for your comment!
> chengkaitao
>
>
At 2023-05-24 06:02:55, "Yosry Ahmed" <[email protected]> wrote:
>On Sat, May 20, 2023 at 2:52 AM 程垲涛 Chengkaitao Cheng
><[email protected]> wrote:
>>
>> At 2023-05-20 06:04:26, "Yosry Ahmed" <[email protected]> wrote:
>> >On Wed, May 17, 2023 at 10:12 PM 程垲涛 Chengkaitao Cheng
>> ><[email protected]> wrote:
>> >>
>> >> At 2023-05-18 04:42:12, "Yosry Ahmed" <[email protected]> wrote:
>> >> >On Wed, May 17, 2023 at 3:01 AM 程垲涛 Chengkaitao Cheng
>> >> ><[email protected]> wrote:
>> >> >>
>> >> >> At 2023-05-17 16:09:50, "Yosry Ahmed" <[email protected]> wrote:
>> >> >> >On Wed, May 17, 2023 at 1:01 AM 程垲涛 Chengkaitao Cheng
>> >> >> ><[email protected]> wrote:
>> >> >> >>
>> >> >>
>> >> >> Killing processes in order of memory usage cannot effectively protect
>> >> >> important processes. Killing processes in a user-defined priority order
>> >> >> will result in a large number of OOM events and still not being able to
>> >> >> release enough memory. I have been searching for a balance between
>> >> >> the two methods, so that their shortcomings are not too obvious.
>> >> >> The biggest advantage of memcg is its tree topology, and I also hope
>> >> >> to make good use of it.
>> >> >
>> >> >For us, killing processes in a user-defined priority order works well.
>> >> >
>> >> >It seems like to tune memory.oom.protect you use oom_kill_inherit to
>> >> >observe how many times this memcg has been killed due to a limit in an
>> >> >ancestor. Wouldn't it be more straightforward to specify the priority
>> >> >of protections among memcgs?
>> >> >
>> >> >For example, if you observe multiple memcgs being OOM killed due to
>> >> >hitting an ancestor limit, you will need to decide which of them to
>> >> >increase memory.oom.protect for more, based on their importance.
>> >> >Otherwise, if you increase all of them, then there is no point if all
>> >> >the memory is protected, right?
>> >>
>> >> If all memory in memcg is protected, its meaning is similar to that of the
>> >> highest priority memcg in your approach, which is ultimately killed or
>> >> never killed.
>> >
>> >Makes sense. I believe it gets a bit trickier when you want to
>> >describe relative ordering between memcgs using memory.oom.protect.
>>
>> Actually, my original intention was not to use memory.oom.protect to
>> achieve relative ordering between memcgs, it was just a feature that
>> happened to be achievable. My initial idea was to protect a certain
>> proportion of memory in memcg from being killed, and through the
>> method, physical memory can be reasonably planned. Both the physical
>> machine manager and container manager can add some unimportant
>> loads beyond the oom.protect limit, greatly improving the oversold
>> rate of memory. In the worst case scenario, the physical machine can
>> always provide all the memory limited by memory.oom.protect for memcg.
>>
>> On the other hand, I also want to achieve relative ordering of internal
>> processes in memcg, not just a unified ordering of all memcgs on
>> physical machines.
>
>For us, having a strict priority ordering-based selection is
>essential. We have different tiers of jobs of different importance,
>and a job of higher priority should not be killed before a lower
>priority task if possible, no matter how much memory either of them is
>using. Protecting memcgs solely based on their usage can be useful in
>some scenarios, but not in a system where you have different tiers of
>jobs running with strict priority ordering.
If you want to run with strict priority ordering, it can also be achieved,
but it may be quite troublesome. The directory structure shown below
can achieve the goal.
root
/ \
cgroup A cgroup B
(protect=max) (protect=0)
/ \
cgroup C cgroup D
(protect=max) (protect=0)
/ \
cgroup E cgroup F
(protect=max) (protect=0)
Oom kill order: F > E > C > A
As mentioned earlier, "running with strict priority ordering" may be
some extreme issues, that requires the manager to make a choice.
>>
>> >> >In this case, wouldn't it be easier to just tell the OOM killer the
>> >> >relative priority among the memcgs?
>> >> >
>> >> >>
>> >> >> >If this approach works for you (or any other audience), that's great,
>> >> >> >I can share more details and perhaps we can reach something that we
>> >> >> >can both use :)
>> >> >>
>> >> >> If you have a good idea, please share more details or show some code.
>> >> >> I would greatly appreciate it
>> >> >
>> >> >The code we have needs to be rebased onto a different version and
>> >> >cleaned up before it can be shared, but essentially it is as
>> >> >described.
>> >> >
>> >> >(a) All processes and memcgs start with a default score.
>> >> >(b) Userspace can specify scores for memcgs and processes. A higher
>> >> >score means higher priority (aka less score gets killed first).
>> >> >(c) The OOM killer essentially looks for the memcg with the lowest
>> >> >scores to kill, then among this memcg, it looks for the process with
>> >> >the lowest score. Ties are broken based on usage, so essentially if
>> >> >all processes/memcgs have the default score, we fallback to the
>> >> >current OOM behavior.
>> >>
>> >> If memory oversold is severe, all processes of the lowest priority
>> >> memcg may be killed before selecting other memcg processes.
>> >> If there are 1000 processes with almost zero memory usage in
>> >> the lowest priority memcg, 1000 invalid kill events may occur.
>> >> To avoid this situation, even for the lowest priority memcg,
>> >> I will leave him a very small oom.protect quota.
>> >
>> >I checked internally, and this is indeed something that we see from
>> >time to time. We try to avoid that with userspace OOM killing, but
>> >it's not 100% effective.
>> >
>> >>
>> >> If faced with two memcgs with the same total memory usage and
>> >> priority, memcg A has more processes but less memory usage per
>> >> single process, and memcg B has fewer processes but more
>> >> memory usage per single process, then when OOM occurs, the
>> >> processes in memcg B may continue to be killed until all processes
>> >> in memcg B are killed, which is unfair to memcg B because memcg A
>> >> also occupies a large amount of memory.
>> >
>> >I believe in this case we will kill one process in memcg B, then the
>> >usage of memcg A will become higher, so we will pick a process from
>> >memcg A next.
>>
>> If there is only one process in memcg A and its memory usage is higher
>> than any other process in memcg B, but the total memory usage of
>> memcg A is lower than that of memcg B. In this case, if the OOM-killer
>> still chooses the process in memcg A. it may be unfair to memcg A.
>>
>> >> Dose your approach have these issues? Killing processes in a
>> >> user-defined priority is indeed easier and can work well in most cases,
>> >> but I have been trying to solve the cases that it cannot cover.
>> >
>> >The first issue is relatable with our approach. Let me dig more info
>> >from our internal teams and get back to you with more details.
--
Thanks for your comment!
chengkaitao
On Thu, May 25, 2023 at 1:19 AM 程垲涛 Chengkaitao Cheng
<[email protected]> wrote:
>
> At 2023-05-24 06:02:55, "Yosry Ahmed" <[email protected]> wrote:
> >On Sat, May 20, 2023 at 2:52 AM 程垲涛 Chengkaitao Cheng
> ><[email protected]> wrote:
> >>
> >> At 2023-05-20 06:04:26, "Yosry Ahmed" <[email protected]> wrote:
> >> >On Wed, May 17, 2023 at 10:12 PM 程垲涛 Chengkaitao Cheng
> >> ><[email protected]> wrote:
> >> >>
> >> >> At 2023-05-18 04:42:12, "Yosry Ahmed" <[email protected]> wrote:
> >> >> >On Wed, May 17, 2023 at 3:01 AM 程垲涛 Chengkaitao Cheng
> >> >> ><[email protected]> wrote:
> >> >> >>
> >> >> >> At 2023-05-17 16:09:50, "Yosry Ahmed" <[email protected]> wrote:
> >> >> >> >On Wed, May 17, 2023 at 1:01 AM 程垲涛 Chengkaitao Cheng
> >> >> >> ><[email protected]> wrote:
> >> >> >> >>
> >> >> >>
> >> >> >> Killing processes in order of memory usage cannot effectively protect
> >> >> >> important processes. Killing processes in a user-defined priority order
> >> >> >> will result in a large number of OOM events and still not being able to
> >> >> >> release enough memory. I have been searching for a balance between
> >> >> >> the two methods, so that their shortcomings are not too obvious.
> >> >> >> The biggest advantage of memcg is its tree topology, and I also hope
> >> >> >> to make good use of it.
> >> >> >
> >> >> >For us, killing processes in a user-defined priority order works well.
> >> >> >
> >> >> >It seems like to tune memory.oom.protect you use oom_kill_inherit to
> >> >> >observe how many times this memcg has been killed due to a limit in an
> >> >> >ancestor. Wouldn't it be more straightforward to specify the priority
> >> >> >of protections among memcgs?
> >> >> >
> >> >> >For example, if you observe multiple memcgs being OOM killed due to
> >> >> >hitting an ancestor limit, you will need to decide which of them to
> >> >> >increase memory.oom.protect for more, based on their importance.
> >> >> >Otherwise, if you increase all of them, then there is no point if all
> >> >> >the memory is protected, right?
> >> >>
> >> >> If all memory in memcg is protected, its meaning is similar to that of the
> >> >> highest priority memcg in your approach, which is ultimately killed or
> >> >> never killed.
> >> >
> >> >Makes sense. I believe it gets a bit trickier when you want to
> >> >describe relative ordering between memcgs using memory.oom.protect.
> >>
> >> Actually, my original intention was not to use memory.oom.protect to
> >> achieve relative ordering between memcgs, it was just a feature that
> >> happened to be achievable. My initial idea was to protect a certain
> >> proportion of memory in memcg from being killed, and through the
> >> method, physical memory can be reasonably planned. Both the physical
> >> machine manager and container manager can add some unimportant
> >> loads beyond the oom.protect limit, greatly improving the oversold
> >> rate of memory. In the worst case scenario, the physical machine can
> >> always provide all the memory limited by memory.oom.protect for memcg.
> >>
> >> On the other hand, I also want to achieve relative ordering of internal
> >> processes in memcg, not just a unified ordering of all memcgs on
> >> physical machines.
> >
> >For us, having a strict priority ordering-based selection is
> >essential. We have different tiers of jobs of different importance,
> >and a job of higher priority should not be killed before a lower
> >priority task if possible, no matter how much memory either of them is
> >using. Protecting memcgs solely based on their usage can be useful in
> >some scenarios, but not in a system where you have different tiers of
> >jobs running with strict priority ordering.
>
> If you want to run with strict priority ordering, it can also be achieved,
> but it may be quite troublesome. The directory structure shown below
> can achieve the goal.
>
> root
> / \
> cgroup A cgroup B
> (protect=max) (protect=0)
> / \
> cgroup C cgroup D
> (protect=max) (protect=0)
> / \
> cgroup E cgroup F
> (protect=max) (protect=0)
>
> Oom kill order: F > E > C > A
This requires restructuring the cgroup hierarchy which comes with a
lot of other factors, I don't think that's practically an option.
>
> As mentioned earlier, "running with strict priority ordering" may be
> some extreme issues, that requires the manager to make a choice.
We have been using strict priority ordering in our fleet for many
years now and we depend on it. Some jobs are simply more important
than others, regardless of their usage.
>
> >>
> >> >> >In this case, wouldn't it be easier to just tell the OOM killer the
> >> >> >relative priority among the memcgs?
> >> >> >
> >> >> >>
> >> >> >> >If this approach works for you (or any other audience), that's great,
> >> >> >> >I can share more details and perhaps we can reach something that we
> >> >> >> >can both use :)
> >> >> >>
> >> >> >> If you have a good idea, please share more details or show some code.
> >> >> >> I would greatly appreciate it
> >> >> >
> >> >> >The code we have needs to be rebased onto a different version and
> >> >> >cleaned up before it can be shared, but essentially it is as
> >> >> >described.
> >> >> >
> >> >> >(a) All processes and memcgs start with a default score.
> >> >> >(b) Userspace can specify scores for memcgs and processes. A higher
> >> >> >score means higher priority (aka less score gets killed first).
> >> >> >(c) The OOM killer essentially looks for the memcg with the lowest
> >> >> >scores to kill, then among this memcg, it looks for the process with
> >> >> >the lowest score. Ties are broken based on usage, so essentially if
> >> >> >all processes/memcgs have the default score, we fallback to the
> >> >> >current OOM behavior.
> >> >>
> >> >> If memory oversold is severe, all processes of the lowest priority
> >> >> memcg may be killed before selecting other memcg processes.
> >> >> If there are 1000 processes with almost zero memory usage in
> >> >> the lowest priority memcg, 1000 invalid kill events may occur.
> >> >> To avoid this situation, even for the lowest priority memcg,
> >> >> I will leave him a very small oom.protect quota.
> >> >
> >> >I checked internally, and this is indeed something that we see from
> >> >time to time. We try to avoid that with userspace OOM killing, but
> >> >it's not 100% effective.
> >> >
> >> >>
> >> >> If faced with two memcgs with the same total memory usage and
> >> >> priority, memcg A has more processes but less memory usage per
> >> >> single process, and memcg B has fewer processes but more
> >> >> memory usage per single process, then when OOM occurs, the
> >> >> processes in memcg B may continue to be killed until all processes
> >> >> in memcg B are killed, which is unfair to memcg B because memcg A
> >> >> also occupies a large amount of memory.
> >> >
> >> >I believe in this case we will kill one process in memcg B, then the
> >> >usage of memcg A will become higher, so we will pick a process from
> >> >memcg A next.
> >>
> >> If there is only one process in memcg A and its memory usage is higher
> >> than any other process in memcg B, but the total memory usage of
> >> memcg A is lower than that of memcg B. In this case, if the OOM-killer
> >> still chooses the process in memcg A. it may be unfair to memcg A.
> >>
> >> >> Dose your approach have these issues? Killing processes in a
> >> >> user-defined priority is indeed easier and can work well in most cases,
> >> >> but I have been trying to solve the cases that it cannot cover.
> >> >
> >> >The first issue is relatable with our approach. Let me dig more info
> >> >from our internal teams and get back to you with more details.
>
> --
> Thanks for your comment!
> chengkaitao
>
>