2022-11-30 07:22:58

by chengkaitao

[permalink] [raw]
Subject: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

From: chengkaitao <[email protected]>

We created a new interface <memory.oom.protect> for memory, If there is
the OOM killer under parent memory cgroup, and the memory usage of a
child cgroup is within its effective oom.protect boundary, the cgroup's
tasks won't be OOM killed unless there is no unprotected tasks in other
children cgroups. It draws on the logic of <memory.min/low> in the
inheritance relationship.

It has the following advantages,
1. We have the ability to protect more important processes, when there
is a memcg's OOM killer. The oom.protect only takes effect local memcg,
and does not affect the OOM killer of the host.
2. Historically, we can often use oom_score_adj to control a group of
processes, It requires that all processes in the cgroup must have a
common parent processes, we have to set the common parent process's
oom_score_adj, before it forks all children processes. So that it is
very difficult to apply it in other situations. Now oom.protect has no
such restrictions, we can protect a cgroup of processes more easily. The
cgroup can keep some memory, even if the OOM killer has to be called.

Signed-off-by: chengkaitao <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 22 +++-
fs/proc/base.c | 17 ++-
include/linux/memcontrol.h | 45 +++++++-
include/linux/oom.h | 3 +-
include/linux/page_counter.h | 6 ++
mm/memcontrol.c | 178 ++++++++++++++++++++++++++++++++
mm/oom_kill.c | 22 ++--
mm/page_counter.c | 26 +++++
8 files changed, 298 insertions(+), 21 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index dc254a3cb956..f3542682fa15 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1191,7 +1191,7 @@ PAGE_SIZE multiple when read back.
cgroup is within its effective low boundary, the cgroup's
memory won't be reclaimed unless there is no reclaimable
memory available in unprotected cgroups.
- Above the effective low boundary (or
+ Above the effective low boundary (or
effective min boundary if it is higher), pages are reclaimed
proportionally to the overage, reducing reclaim pressure for
smaller overages.
@@ -1292,6 +1292,24 @@ PAGE_SIZE multiple when read back.
to kill any tasks outside of this cgroup, regardless
memory.oom.group values of ancestor cgroups.

+ memory.oom.protect
+ A read-write single value file which exists on non-root
+ cgroups. The default value is "0".
+
+ If there is the OOM killer under parent memory cgroup, and
+ the memory usage of a child cgroup is within its effective
+ oom.protect boundary, the cgroup's processes won't be oom killed
+ unless there is no unprotected processes in other children
+ cgroups. About the effective oom.protect boundary, we assign it
+ to each process in this cgroup in proportion to the actual usage.
+ this factor will be taken into account when calculating the
+ oom_score. Effective oom.protect boundary is limited by
+ memory.oom.protect values of all ancestor cgroups. If there is
+ memory.oom.protect overcommitment (child cgroup or cgroups are
+ requiring more protected memory than parent will allow), then each
+ child cgroup will get the part of parent's protection proportional
+ to its actual memory usage below memory.oom.protect.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
@@ -1885,7 +1903,7 @@ of the two is enforced.

cgroup writeback requires explicit support from the underlying
filesystem. Currently, cgroup writeback is implemented on ext2, ext4,
-btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are
+btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are
attributed to the root cgroup.

There are inherent differences in memory and writeback management
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9e479d7d202b..f169abcfbe21 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -552,8 +552,19 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
unsigned long totalpages = totalram_pages() + total_swap_pages;
unsigned long points = 0;
long badness;
+#ifdef CONFIG_MEMCG
+ struct mem_cgroup *memcg;

- badness = oom_badness(task, totalpages);
+ rcu_read_lock();
+ memcg = mem_cgroup_from_task(task);
+ if (memcg && !css_tryget(&memcg->css))
+ memcg = NULL;
+ rcu_read_unlock();
+
+ update_parent_oom_protection(root_mem_cgroup, memcg);
+ css_put(&memcg->css);
+#endif
+ badness = oom_badness(task, totalpages, MEMCG_OOM_PROTECT);
/*
* Special case OOM_SCORE_ADJ_MIN for all others scale the
* badness value into [0, 2000] range which we have been
@@ -2657,7 +2668,7 @@ static struct dentry *proc_pident_instantiate(struct dentry *dentry,
return d_splice_alias(inode, dentry);
}

-static struct dentry *proc_pident_lookup(struct inode *dir,
+static struct dentry *proc_pident_lookup(struct inode *dir,
struct dentry *dentry,
const struct pid_entry *p,
const struct pid_entry *end)
@@ -2870,7 +2881,7 @@ static const struct pid_entry attr_dir_stuff[] = {

static int proc_attr_dir_readdir(struct file *file, struct dir_context *ctx)
{
- return proc_pident_readdir(file, ctx,
+ return proc_pident_readdir(file, ctx,
attr_dir_stuff, ARRAY_SIZE(attr_dir_stuff));
}

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e1644a24009c..0ca96d764e45 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -53,6 +53,11 @@ enum memcg_memory_event {
MEMCG_NR_MEMORY_EVENTS,
};

+enum memcg_oom_evaluate {
+ MEMCG_OOM_EVALUATE_NONE,
+ MEMCG_OOM_PROTECT,
+};
+
struct mem_cgroup_reclaim_cookie {
pg_data_t *pgdat;
unsigned int generation;
@@ -614,6 +619,14 @@ static inline void mem_cgroup_protection(struct mem_cgroup *root,

void mem_cgroup_calculate_protection(struct mem_cgroup *root,
struct mem_cgroup *memcg);
+void mem_cgroup_calculate_oom_protection(struct mem_cgroup *root,
+ struct mem_cgroup *memcg);
+void update_parent_oom_protection(struct mem_cgroup *root,
+ struct mem_cgroup *memcg);
+unsigned long get_task_eoom_protect(struct task_struct *p, long points);
+struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
+struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
+bool is_root_oom_protect(void);

static inline bool mem_cgroup_supports_protection(struct mem_cgroup *memcg)
{
@@ -746,10 +759,6 @@ static inline struct lruvec *folio_lruvec(struct folio *folio)
return mem_cgroup_lruvec(memcg, folio_pgdat(folio));
}

-struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
-
-struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
-
struct lruvec *folio_lruvec_lock(struct folio *folio);
struct lruvec *folio_lruvec_lock_irq(struct folio *folio);
struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio,
@@ -805,6 +814,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
int mem_cgroup_scan_tasks(struct mem_cgroup *,
int (*)(struct task_struct *, void *), void *);
+int mem_cgroup_scan_tasks_update_eoom(struct mem_cgroup *memcg,
+ int (*fn)(struct task_struct *, void *, int), void *arg);

static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
{
@@ -1209,6 +1220,16 @@ static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root,
{
}

+static inline void mem_cgroup_calculate_oom_protection(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
+{
+}
+
+void update_parent_oom_protection(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
+{
+}
+
static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
{
return false;
@@ -1219,6 +1240,16 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
return false;
}

+unsigned long get_task_eoom_protect(struct task_struct *p, long points)
+{
+ return 0;
+}
+
+bool is_root_oom_protect(void)
+{
+ return 0;
+}
+
static inline int mem_cgroup_charge(struct folio *folio,
struct mm_struct *mm, gfp_t gfp)
{
@@ -1338,6 +1369,12 @@ static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
return 0;
}

+int mem_cgroup_scan_tasks_update_eoom(struct mem_cgroup *memcg,
+ int (*fn)(struct task_struct *, void *, int), void *arg)
+{
+ return 0;
+}
+
static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
{
return 0;
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 7d0c9c48a0c5..04b6daca5a9c 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -97,8 +97,7 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
return 0;
}

-long oom_badness(struct task_struct *p,
- unsigned long totalpages);
+long oom_badness(struct task_struct *p, unsigned long totalpages, int flags);

extern bool out_of_memory(struct oom_control *oc);

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index c141ea9a95ef..d730a7373c1d 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -25,6 +25,10 @@ struct page_counter {
atomic_long_t low_usage;
atomic_long_t children_low_usage;

+ unsigned long eoom_protect;
+ atomic_long_t oom_protect_usage;
+ atomic_long_t children_oom_protect_usage;
+
unsigned long watermark;
unsigned long failcnt;

@@ -35,6 +39,7 @@ struct page_counter {
unsigned long low;
unsigned long high;
unsigned long max;
+ unsigned long oom_protect;
struct page_counter *parent;
} ____cacheline_internodealigned_in_smp;

@@ -65,6 +70,7 @@ bool page_counter_try_charge(struct page_counter *counter,
void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages);
void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages);
+void page_counter_set_oom_protect(struct page_counter *counter, unsigned long nr_pages);

static inline void page_counter_set_high(struct page_counter *counter,
unsigned long nr_pages)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2d8549ae1b30..6f0878619133 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1261,6 +1261,52 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
return ret;
}

+/**
+ * mem_cgroup_scan_tasks_update_eoom - iterate over tasks of a memory cgroup
+ * hierarchy and update memcg's eoom_protect
+ * @memcg: hierarchy root
+ * @fn: function to call for each task
+ * @arg: argument passed to @fn
+ *
+ * This function iterates over tasks attached to @memcg or to any of its
+ * descendants and update all memcg's eoom_protect, then calls @fn for each
+ * task. If @fn returns a non-zero value, the function breaks the iteration
+ * loop and returns the value. Otherwise, it will iterate over all tasks and
+ * return 0.
+ *
+ * This function may be called for the root memory cgroup.
+ */
+int mem_cgroup_scan_tasks_update_eoom(struct mem_cgroup *memcg,
+ int (*fn)(struct task_struct *, void *, int), void *arg)
+{
+ struct mem_cgroup *iter;
+ int ret = 0;
+
+ for_each_mem_cgroup_tree(iter, memcg) {
+ struct css_task_iter it;
+ struct task_struct *task;
+
+ mem_cgroup_calculate_oom_protection(memcg, iter);
+ css_task_iter_start(&iter->css, CSS_TASK_ITER_PROCS, &it);
+ while (!ret && (task = css_task_iter_next(&it)))
+ ret = fn(task, arg, MEMCG_OOM_PROTECT);
+ css_task_iter_end(&it);
+ if (ret) {
+ mem_cgroup_iter_break(memcg, iter);
+ break;
+ }
+ }
+ return ret;
+}
+
+bool is_root_oom_protect(void)
+{
+ if (mem_cgroup_disabled())
+ return 0;
+
+ return !!atomic_long_read(&root_mem_cgroup->memory.children_oom_protect_usage);
+}
+
#ifdef CONFIG_DEBUG_VM
void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio)
{
@@ -6569,6 +6615,29 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
return nbytes;
}

+static int memory_oom_protect_show(struct seq_file *m, void *v)
+{
+ return seq_puts_memcg_tunable(m,
+ READ_ONCE(mem_cgroup_from_seq(m)->memory.oom_protect));
+}
+
+static ssize_t memory_oom_protect_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ unsigned long oom_protect;
+ int err;
+
+ buf = strstrip(buf);
+ err = page_counter_memparse(buf, "max", &oom_protect);
+ if (err)
+ return err;
+
+ page_counter_set_oom_protect(&memcg->memory, oom_protect);
+
+ return nbytes;
+}
+
static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
@@ -6674,6 +6743,12 @@ static struct cftype memory_files[] = {
.seq_show = memory_oom_group_show,
.write = memory_oom_group_write,
},
+ {
+ .name = "oom.protect",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_oom_protect_show,
+ .write = memory_oom_protect_write,
+ },
{
.name = "reclaim",
.flags = CFTYPE_NS_DELEGATABLE,
@@ -6870,6 +6945,109 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
atomic_long_read(&parent->memory.children_low_usage)));
}

+static void __mem_cgroup_calculate_oom_protection(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
+{
+ unsigned long usage, parent_usage;
+ struct mem_cgroup *parent;
+
+ usage = page_counter_read(&memcg->memory);
+ if (!usage)
+ return;
+
+ parent = parent_mem_cgroup(memcg);
+
+ if (parent == root) {
+ memcg->memory.eoom_protect = READ_ONCE(memcg->memory.oom_protect);
+ return;
+ }
+
+ parent_usage = page_counter_read(&parent->memory);
+
+ WRITE_ONCE(memcg->memory.eoom_protect, effective_protection(usage, parent_usage,
+ READ_ONCE(memcg->memory.oom_protect),
+ READ_ONCE(parent->memory.eoom_protect),
+ atomic_long_read(&parent->memory.children_oom_protect_usage)));
+}
+
+/**
+ * mem_cgroup_calculate_oom_protection - check if memory consumption is in the
+ * normal range of oom's protection
+ * @root: the top ancestor of the sub-tree being checked
+ * @memcg: the memory cgroup to check
+ *
+ * WARNING: This function is not stateless! It can only be used as part
+ * of a top-down tree iteration, not for isolated queries.
+ */
+void mem_cgroup_calculate_oom_protection(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
+{
+ if (mem_cgroup_disabled())
+ return;
+
+ if (!root)
+ root = root_mem_cgroup;
+
+ /*
+ * Effective values of the reclaim targets are ignored so they
+ * can be stale. Have a look at mem_cgroup_protection for more
+ * details.
+ * TODO: calculation should be more robust so that we do not need
+ * that special casing.
+ */
+ if (memcg == root)
+ return;
+
+ __mem_cgroup_calculate_oom_protection(root, memcg);
+}
+
+static void lsit_postorder_for_memcg_parent(
+ struct mem_cgroup *root, struct mem_cgroup *memcg,
+ void (*fn)(struct mem_cgroup *, struct mem_cgroup *))
+{
+ struct mem_cgroup *parent;
+
+ if (!memcg || memcg == root)
+ return;
+
+ parent = parent_mem_cgroup(memcg);
+ lsit_postorder_for_memcg_parent(root, parent, fn);
+ fn(root, memcg);
+}
+
+void update_parent_oom_protection(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
+{
+ if (mem_cgroup_disabled())
+ return;
+
+ if (!root)
+ root = root_mem_cgroup;
+
+ lsit_postorder_for_memcg_parent(root, memcg,
+ __mem_cgroup_calculate_oom_protection);
+}
+
+unsigned long get_task_eoom_protect(struct task_struct *p, long points)
+{
+ struct mem_cgroup *memcg;
+ unsigned long usage, eoom;
+
+ rcu_read_lock();
+ memcg = mem_cgroup_from_task(p);
+
+ if (!memcg || !mem_cgroup_supports_protection(memcg)) {
+ rcu_read_unlock();
+ return 0;
+ }
+
+ usage = page_counter_read(&memcg->memory);
+ eoom = READ_ONCE(memcg->memory.eoom_protect) * points / usage;
+ rcu_read_unlock();
+
+ return eoom;
+}
+
static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
gfp_t gfp)
{
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1276e49b31b0..16aa33323eff 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1,7 +1,7 @@
// SPDX-License-Identifier: GPL-2.0-only
/*
* linux/mm/oom_kill.c
- *
+ *
* Copyright (C) 1998,2000 Rik van Riel
* Thanks go out to Claus Fischer for some serious inspiration and
* for goading me into coding this file...
@@ -193,15 +193,15 @@ static bool should_dump_unreclaim_slab(void)
* oom_badness - heuristic function to determine which candidate task to kill
* @p: task struct of which task we should calculate
* @totalpages: total present RAM allowed for page allocation
+ * @flag: if you want to skip oom_protect function
*
* The heuristic for determining which task to kill is made to be as simple and
* predictable as possible. The goal is to return the highest value for the
* task consuming the most memory to avoid subsequent oom failures.
*/
-long oom_badness(struct task_struct *p, unsigned long totalpages)
+long oom_badness(struct task_struct *p, unsigned long totalpages, int flag)
{
- long points;
- long adj;
+ long points, adj, val = 0;

if (oom_unkillable_task(p))
return LONG_MIN;
@@ -231,9 +231,11 @@ long oom_badness(struct task_struct *p, unsigned long totalpages)
mm_pgtables_bytes(p->mm) / PAGE_SIZE;
task_unlock(p);

+ if (flag == MEMCG_OOM_PROTECT)
+ val = get_task_eoom_protect(p, points);
/* Normalize to oom_score_adj units */
adj *= totalpages / 1000;
- points += adj;
+ points = points + adj - val;

return points;
}
@@ -305,7 +307,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
return CONSTRAINT_NONE;
}

-static int oom_evaluate_task(struct task_struct *task, void *arg)
+static int oom_evaluate_task(struct task_struct *task, void *arg, int flag)
{
struct oom_control *oc = arg;
long points;
@@ -338,7 +340,7 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
goto select;
}

- points = oom_badness(task, oc->totalpages);
+ points = oom_badness(task, oc->totalpages, flag);
if (points == LONG_MIN || points < oc->chosen_points)
goto next;

@@ -365,14 +367,14 @@ static void select_bad_process(struct oom_control *oc)
{
oc->chosen_points = LONG_MIN;

- if (is_memcg_oom(oc))
- mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
+ if (is_memcg_oom(oc) || is_root_oom_protect())
+ mem_cgroup_scan_tasks_update_eoom(oc->memcg, oom_evaluate_task, oc);
else {
struct task_struct *p;

rcu_read_lock();
for_each_process(p)
- if (oom_evaluate_task(p, oc))
+ if (oom_evaluate_task(p, oc, MEMCG_OOM_EVALUATE_NONE))
break;
rcu_read_unlock();
}
diff --git a/mm/page_counter.c b/mm/page_counter.c
index db20d6452b71..43987cc59443 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -39,6 +39,15 @@ static void propagate_protected_usage(struct page_counter *c,
if (delta)
atomic_long_add(delta, &c->parent->children_low_usage);
}
+
+ protected = min(usage, READ_ONCE(c->oom_protect));
+ old_protected = atomic_long_read(&c->oom_protect_usage);
+ if (protected != old_protected) {
+ old_protected = atomic_long_xchg(&c->oom_protect_usage, protected);
+ delta = protected - old_protected;
+ if (delta)
+ atomic_long_add(delta, &c->parent->children_oom_protect_usage);
+ }
}

/**
@@ -234,6 +243,23 @@ void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages)
propagate_protected_usage(c, atomic_long_read(&c->usage));
}

+/**
+ * page_counter_set_oom_protect - set the amount of oom protected memory
+ * @counter: counter
+ * @nr_pages: value to set
+ *
+ * The caller must serialize invocations on the same counter.
+ */
+void page_counter_set_oom_protect(struct page_counter *counter, unsigned long nr_pages)
+{
+ struct page_counter *c;
+
+ WRITE_ONCE(counter->oom_protect, nr_pages);
+
+ for (c = counter; c; c = c->parent)
+ propagate_protected_usage(c, atomic_long_read(&c->usage));
+}
+
/**
* page_counter_memparse - memparse() for page counter limits
* @buf: string to parse
--
2.14.1


2022-11-30 08:58:50

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

On 11/30/22 14:01, chengkaitao wrote:
> From: chengkaitao <[email protected]>
>

Yikes! Another patch from ZTE guys.

I'm suspicious to patches sent from them due to bad reputation with
kernel development community. First, they sent all patches via
[email protected] (listed in Cc) but Greg can't sure these are really
sent from them ([1] & [2]). Then they tried to workaround by sending
from their personal Gmail accounts, again with same response from him
[3]. And finally they sent spoofed emails (as he pointed out in [4]) -
they pretend to send from ZTE domain but actually sent from their
different domain (see raw message and look for X-Google-Original-From:
header.

I was about to review documentation part of this patch, but due to
concerns above, I have to write this reply instead. So I'm not going
to review, sorry for inconvenience.

PS: Adding Greg to Cc: list.

[1]: https://lore.kernel.org/lkml/[email protected]/
[2]: https://lore.kernel.org/lkml/[email protected]/
[3]: https://lore.kernel.org/lkml/[email protected]/
[4]: https://lore.kernel.org/lkml/[email protected]/

--
An old man doll... just what I always wanted! - Clara

2022-11-30 10:13:55

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

Hi chengkaitao,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on akpm-mm/mm-everything]

url: https://github.com/intel-lab-lkp/linux/commits/chengkaitao/mm-memcontrol-protect-the-memory-in-cgroup-from-being-oom-killed/20221130-150440
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20221130070158.44221-1-chengkaitao%40didiglobal.com
patch subject: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed
config: i386-tinyconfig
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/d2e2f936cb254b6976abddb53b9f46dfc9c9a134
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review chengkaitao/mm-memcontrol-protect-the-memory-in-cgroup-from-being-oom-killed/20221130-150440
git checkout d2e2f936cb254b6976abddb53b9f46dfc9c9a134
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>

All error/warnings (new ones prefixed by >>):

ld: init/do_mounts.o: in function `update_parent_oom_protection':
>> do_mounts.c:(.text+0x5): multiple definition of `update_parent_oom_protection'; init/main.o:main.c:(.text+0x32): first defined here
ld: init/do_mounts.o: in function `get_task_eoom_protect':
>> do_mounts.c:(.text+0x6): multiple definition of `get_task_eoom_protect'; init/main.o:main.c:(.text+0x33): first defined here
ld: init/do_mounts.o: in function `is_root_oom_protect':
>> do_mounts.c:(.text+0x9): multiple definition of `is_root_oom_protect'; init/main.o:main.c:(.text+0x36): first defined here
ld: init/do_mounts.o: in function `mem_cgroup_scan_tasks_update_eoom':
>> do_mounts.c:(.text+0xc): multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:main.c:(.text+0x39): first defined here
ld: arch/x86/entry/vdso/vma.o: in function `update_parent_oom_protection':
vma.c:(.text+0x1db): multiple definition of `update_parent_oom_protection'; init/main.o:main.c:(.text+0x32): first defined here
ld: arch/x86/entry/vdso/vma.o: in function `get_task_eoom_protect':
vma.c:(.text+0x1dc): multiple definition of `get_task_eoom_protect'; init/main.o:main.c:(.text+0x33): first defined here
ld: arch/x86/entry/vdso/vma.o: in function `is_root_oom_protect':
vma.c:(.text+0x1df): multiple definition of `is_root_oom_protect'; init/main.o:main.c:(.text+0x36): first defined here
ld: arch/x86/entry/vdso/vma.o: in function `mem_cgroup_scan_tasks_update_eoom':
vma.c:(.text+0x1e2): multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:main.c:(.text+0x39): first defined here
ld: arch/x86/kernel/setup.o: in function `update_parent_oom_protection':
setup.c:(.text+0x3): multiple definition of `update_parent_oom_protection'; init/main.o:main.c:(.text+0x32): first defined here
ld: arch/x86/kernel/setup.o: in function `get_task_eoom_protect':
setup.c:(.text+0x4): multiple definition of `get_task_eoom_protect'; init/main.o:main.c:(.text+0x33): first defined here
ld: arch/x86/kernel/setup.o: in function `is_root_oom_protect':
setup.c:(.text+0x7): multiple definition of `is_root_oom_protect'; init/main.o:main.c:(.text+0x36): first defined here
ld: arch/x86/kernel/setup.o: in function `mem_cgroup_scan_tasks_update_eoom':
setup.c:(.text+0xa): multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:main.c:(.text+0x39): first defined here
ld: arch/x86/kernel/e820.o: in function `update_parent_oom_protection':
e820.c:(.text+0x134): multiple definition of `update_parent_oom_protection'; init/main.o:main.c:(.text+0x32): first defined here
ld: arch/x86/kernel/e820.o: in function `get_task_eoom_protect':
e820.c:(.text+0x135): multiple definition of `get_task_eoom_protect'; init/main.o:main.c:(.text+0x33): first defined here
ld: arch/x86/kernel/e820.o: in function `is_root_oom_protect':
e820.c:(.text+0x138): multiple definition of `is_root_oom_protect'; init/main.o:main.c:(.text+0x36): first defined here
ld: arch/x86/kernel/e820.o: in function `mem_cgroup_scan_tasks_update_eoom':
e820.c:(.text+0x13b): multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:main.c:(.text+0x39): first defined here
ld: arch/x86/kernel/cpu/umwait.o: in function `update_parent_oom_protection':
umwait.c:(.text+0x2d): multiple definition of `update_parent_oom_protection'; init/main.o:main.c:(.text+0x32): first defined here
ld: arch/x86/kernel/cpu/umwait.o: in function `get_task_eoom_protect':
umwait.c:(.text+0x2e): multiple definition of `get_task_eoom_protect'; init/main.o:main.c:(.text+0x33): first defined here
ld: arch/x86/kernel/cpu/umwait.o: in function `is_root_oom_protect':
umwait.c:(.text+0x31): multiple definition of `is_root_oom_protect'; init/main.o:main.c:(.text+0x36): first defined here
ld: arch/x86/kernel/cpu/umwait.o: in function `mem_cgroup_scan_tasks_update_eoom':
umwait.c:(.text+0x34): multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:main.c:(.text+0x39): first defined here
ld: arch/x86/kernel/reboot.o: in function `update_parent_oom_protection':
reboot.c:(.text+0x1a): multiple definition of `update_parent_oom_protection'; init/main.o:main.c:(.text+0x32): first defined here
ld: arch/x86/kernel/reboot.o: in function `get_task_eoom_protect':
reboot.c:(.text+0x1b): multiple definition of `get_task_eoom_protect'; init/main.o:main.c:(.text+0x33): first defined here
ld: arch/x86/kernel/reboot.o: in function `is_root_oom_protect':
reboot.c:(.text+0x1e): multiple definition of `is_root_oom_protect'; init/main.o:main.c:(.text+0x36): first defined here
ld: arch/x86/kernel/reboot.o: in function `mem_cgroup_scan_tasks_update_eoom':
reboot.c:(.text+0x21): multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:main.c:(.text+0x39): first defined here
ld: arch/x86/mm/init.o: in function `update_parent_oom_protection':
init.c:(.text+0x0): multiple definition of `update_parent_oom_protection'; init/main.o:main.c:(.text+0x32): first defined here
ld: arch/x86/mm/init.o: in function `get_task_eoom_protect':
init.c:(.text+0x1): multiple definition of `get_task_eoom_protect'; init/main.o:main.c:(.text+0x33): first defined here
ld: arch/x86/mm/init.o: in function `is_root_oom_protect':
init.c:(.text+0x4): multiple definition of `is_root_oom_protect'; init/main.o:main.c:(.text+0x36): first defined here
ld: arch/x86/mm/init.o: in function `mem_cgroup_scan_tasks_update_eoom':
init.c:(.text+0x7): multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:main.c:(.text+0x39): first defined here
ld: arch/x86/mm/init_32.o: in function `update_parent_oom_protection':
init_32.c:(.text+0x0): multiple definition of `update_parent_oom_protection'; init/main.o:main.c:(.text+0x32): first defined here
ld: arch/x86/mm/init_32.o: in function `get_task_eoom_protect':
init_32.c:(.text+0x1): multiple definition of `get_task_eoom_protect'; init/main.o:main.c:(.text+0x33): first defined here
ld: arch/x86/mm/init_32.o: in function `is_root_oom_protect':
init_32.c:(.text+0x4): multiple definition of `is_root_oom_protect'; init/main.o:main.c:(.text+0x36): first defined here
ld: arch/x86/mm/init_32.o: in function `mem_cgroup_scan_tasks_update_eoom':
init_32.c:(.text+0x7): multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:main.c:(.text+0x39): first defined here
ld: arch/x86/mm/fault.o: in function `update_parent_oom_protection':
fault.c:(.text+0x8cd): multiple definition of `update_parent_oom_protection'; init/main.o:main.c:(.text+0x32): first defined here
ld: arch/x86/mm/fault.o: in function `get_task_eoom_protect':
fault.c:(.text+0x8ce): multiple definition of `get_task_eoom_protect'; init/main.o:main.c:(.text+0x33): first defined here
ld: arch/x86/mm/fault.o: in function `is_root_oom_protect':
fault.c:(.text+0x8d1): multiple definition of `is_root_oom_protect'; init/main.o:main.c:(.text+0x36): first defined here
ld: arch/x86/mm/fault.o: in function `mem_cgroup_scan_tasks_update_eoom':
fault.c:(.text+0x8d4): multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:main.c:(.text+0x39): first defined here
ld: arch/x86/mm/ioremap.o: in function `update_parent_oom_protection':
ioremap.c:(.text+0x25c): multiple definition of `update_parent_oom_protection'; init/main.o:main.c:(.text+0x32): first defined here
ld: arch/x86/mm/ioremap.o: in function `get_task_eoom_protect':
ioremap.c:(.text+0x25d): multiple definition of `get_task_eoom_protect'; init/main.o:main.c:(.text+0x33): first defined here
ld: arch/x86/mm/ioremap.o: in function `is_root_oom_protect':
ioremap.c:(.text+0x260): multiple definition of `is_root_oom_protect'; init/main.o:main.c:(.text+0x36): first defined here
ld: arch/x86/mm/ioremap.o: in function `mem_cgroup_scan_tasks_update_eoom':
ioremap.c:(.text+0x263): multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:main.c:(.text+0x39): first defined here
ld: arch/x86/mm/pgtable.o: in function `update_parent_oom_protection':
pgtable.c:(.text+0x4): multiple definition of `update_parent_oom_protection'; init/main.o:main.c:(.text+0x32): first defined here
ld: arch/x86/mm/pgtable.o: in function `get_task_eoom_protect':
pgtable.c:(.text+0x5): multiple definition of `get_task_eoom_protect'; init/main.o:main.c:(.text+0x33): first defined here
ld: arch/x86/mm/pgtable.o: in function `is_root_oom_protect':
pgtable.c:(.text+0x8): multiple definition of `is_root_oom_protect'; init/main.o:main.c:(.text+0x36): first defined here
ld: arch/x86/mm/pgtable.o: in function `mem_cgroup_scan_tasks_update_eoom':
pgtable.c:(.text+0xb): multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:main.c:(.text+0x39): first defined here
ld: arch/x86/mm/pgtable_32.o: in function `update_parent_oom_protection':
pgtable_32.c:(.text+0x0): multiple definition of `update_parent_oom_protection'; init/main.o:main.c:(.text+0x32): first defined here
ld: arch/x86/mm/pgtable_32.o: in function `get_task_eoom_protect':
pgtable_32.c:(.text+0x1): multiple definition of `get_task_eoom_protect'; init/main.o:main.c:(.text+0x33): first defined here
ld: arch/x86/mm/pgtable_32.o: in function `is_root_oom_protect':
pgtable_32.c:(.text+0x4): multiple definition of `is_root_oom_protect'; init/main.o:main.c:(.text+0x36): first defined here
ld: arch/x86/mm/pgtable_32.o: in function `mem_cgroup_scan_tasks_update_eoom':
pgtable_32.c:(.text+0x7): multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:main.c:(.text+0x39): first defined here
ld: kernel/fork.o: in function `update_parent_oom_protection':
fork.c:(.text+0x5bb): multiple definition of `update_parent_oom_protection'; init/main.o:main.c:(.text+0x32): first defined here
ld: kernel/fork.o: in function `get_task_eoom_protect':
fork.c:(.text+0x5bc): multiple definition of `get_task_eoom_protect'; init/main.o:main.c:(.text+0x33): first defined here
ld: kernel/fork.o: in function `is_root_oom_protect':
fork.c:(.text+0x5bf): multiple definition of `is_root_oom_protect'; init/main.o:main.c:(.text+0x36): first defined here
ld: kernel/fork.o: in function `mem_cgroup_scan_tasks_update_eoom':
fork.c:(.text+0x5c2): multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:main.c:(.text+0x39): first defined here
ld: kernel/cpu.o: in function `update_parent_oom_protection':
cpu.c:(.text+0x161): multiple definition of `update_parent_oom_protection'; init/main.o:main.c:(.text+0x32): first defined here
ld: kernel/cpu.o: in function `get_task_eoom_protect':
cpu.c:(.text+0x162): multiple definition of `get_task_eoom_protect'; init/main.o:main.c:(.text+0x33): first defined here
--
In file included from include/linux/swap.h:9,
from include/linux/suspend.h:5,
from arch/x86/kernel/asm-offsets.c:13:
>> include/linux/memcontrol.h:1228:6: warning: no previous prototype for 'update_parent_oom_protection' [-Wmissing-prototypes]
1228 | void update_parent_oom_protection(struct mem_cgroup *root,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1243:15: warning: no previous prototype for 'get_task_eoom_protect' [-Wmissing-prototypes]
1243 | unsigned long get_task_eoom_protect(struct task_struct *p, long points)
| ^~~~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1248:6: warning: no previous prototype for 'is_root_oom_protect' [-Wmissing-prototypes]
1248 | bool is_root_oom_protect(void)
| ^~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1372:5: warning: no previous prototype for 'mem_cgroup_scan_tasks_update_eoom' [-Wmissing-prototypes]
1372 | int mem_cgroup_scan_tasks_update_eoom(struct mem_cgroup *memcg,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--
In file included from include/linux/rmap.h:12,
from init/main.c:62:
>> include/linux/memcontrol.h:1228:6: warning: no previous prototype for 'update_parent_oom_protection' [-Wmissing-prototypes]
1228 | void update_parent_oom_protection(struct mem_cgroup *root,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1243:15: warning: no previous prototype for 'get_task_eoom_protect' [-Wmissing-prototypes]
1243 | unsigned long get_task_eoom_protect(struct task_struct *p, long points)
| ^~~~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1248:6: warning: no previous prototype for 'is_root_oom_protect' [-Wmissing-prototypes]
1248 | bool is_root_oom_protect(void)
| ^~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1372:5: warning: no previous prototype for 'mem_cgroup_scan_tasks_update_eoom' [-Wmissing-prototypes]
1372 | int mem_cgroup_scan_tasks_update_eoom(struct mem_cgroup *memcg,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
init/main.c:775:20: warning: no previous prototype for 'arch_post_acpi_subsys_init' [-Wmissing-prototypes]
775 | void __init __weak arch_post_acpi_subsys_init(void) { }
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
--
In file included from include/linux/swap.h:9,
from mm/shmem.c:36:
>> include/linux/memcontrol.h:1228:6: warning: no previous prototype for 'update_parent_oom_protection' [-Wmissing-prototypes]
1228 | void update_parent_oom_protection(struct mem_cgroup *root,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1243:15: warning: no previous prototype for 'get_task_eoom_protect' [-Wmissing-prototypes]
1243 | unsigned long get_task_eoom_protect(struct task_struct *p, long points)
| ^~~~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1248:6: warning: no previous prototype for 'is_root_oom_protect' [-Wmissing-prototypes]
1248 | bool is_root_oom_protect(void)
| ^~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1372:5: warning: no previous prototype for 'mem_cgroup_scan_tasks_update_eoom' [-Wmissing-prototypes]
1372 | int mem_cgroup_scan_tasks_update_eoom(struct mem_cgroup *memcg,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/shmem.c:4148:13: warning: no previous prototype for 'shmem_init' [-Wmissing-prototypes]
4148 | void __init shmem_init(void)
| ^~~~~~~~~~
mm/shmem.c:4156:5: warning: no previous prototype for 'shmem_unuse' [-Wmissing-prototypes]
4156 | int shmem_unuse(unsigned int type)
| ^~~~~~~~~~~
mm/shmem.c:4161:5: warning: no previous prototype for 'shmem_lock' [-Wmissing-prototypes]
4161 | int shmem_lock(struct file *file, int lock, struct ucounts *ucounts)
| ^~~~~~~~~~
mm/shmem.c:4166:6: warning: no previous prototype for 'shmem_unlock_mapping' [-Wmissing-prototypes]
4166 | void shmem_unlock_mapping(struct address_space *mapping)
| ^~~~~~~~~~~~~~~~~~~~
mm/shmem.c:4171:15: warning: no previous prototype for 'shmem_get_unmapped_area' [-Wmissing-prototypes]
4171 | unsigned long shmem_get_unmapped_area(struct file *file,
| ^~~~~~~~~~~~~~~~~~~~~~~
mm/shmem.c:4179:6: warning: no previous prototype for 'shmem_truncate_range' [-Wmissing-prototypes]
4179 | void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
| ^~~~~~~~~~~~~~~~~~~~
mm/shmem.c:4239:14: warning: no previous prototype for 'shmem_kernel_file_setup' [-Wmissing-prototypes]
4239 | struct file *shmem_kernel_file_setup(const char *name, loff_t size, unsigned long flags)
| ^~~~~~~~~~~~~~~~~~~~~~~
mm/shmem.c:4250:14: warning: no previous prototype for 'shmem_file_setup' [-Wmissing-prototypes]
4250 | struct file *shmem_file_setup(const char *name, loff_t size, unsigned long flags)
| ^~~~~~~~~~~~~~~~
mm/shmem.c:4263:14: warning: no previous prototype for 'shmem_file_setup_with_mnt' [-Wmissing-prototypes]
4263 | struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt, const char *name,
| ^~~~~~~~~~~~~~~~~~~~~~~~~
mm/shmem.c:4274:5: warning: no previous prototype for 'shmem_zero_setup' [-Wmissing-prototypes]
4274 | int shmem_zero_setup(struct vm_area_struct *vma)
| ^~~~~~~~~~~~~~~~
mm/shmem.c:4312:14: warning: no previous prototype for 'shmem_read_mapping_page_gfp' [-Wmissing-prototypes]
4312 | struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
--
In file included from fs/pipe.c:26:
>> include/linux/memcontrol.h:1228:6: warning: no previous prototype for 'update_parent_oom_protection' [-Wmissing-prototypes]
1228 | void update_parent_oom_protection(struct mem_cgroup *root,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1243:15: warning: no previous prototype for 'get_task_eoom_protect' [-Wmissing-prototypes]
1243 | unsigned long get_task_eoom_protect(struct task_struct *p, long points)
| ^~~~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1248:6: warning: no previous prototype for 'is_root_oom_protect' [-Wmissing-prototypes]
1248 | bool is_root_oom_protect(void)
| ^~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1372:5: warning: no previous prototype for 'mem_cgroup_scan_tasks_update_eoom' [-Wmissing-prototypes]
1372 | int mem_cgroup_scan_tasks_update_eoom(struct mem_cgroup *memcg,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/pipe.c:757:15: warning: no previous prototype for 'account_pipe_buffers' [-Wmissing-prototypes]
757 | unsigned long account_pipe_buffers(struct user_struct *user,
| ^~~~~~~~~~~~~~~~~~~~
fs/pipe.c:763:6: warning: no previous prototype for 'too_many_pipe_buffers_soft' [-Wmissing-prototypes]
763 | bool too_many_pipe_buffers_soft(unsigned long user_bufs)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
fs/pipe.c:770:6: warning: no previous prototype for 'too_many_pipe_buffers_hard' [-Wmissing-prototypes]
770 | bool too_many_pipe_buffers_hard(unsigned long user_bufs)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
fs/pipe.c:777:6: warning: no previous prototype for 'pipe_is_unprivileged_user' [-Wmissing-prototypes]
777 | bool pipe_is_unprivileged_user(void)
| ^~~~~~~~~~~~~~~~~~~~~~~~~
fs/pipe.c:1253:5: warning: no previous prototype for 'pipe_resize_ring' [-Wmissing-prototypes]
1253 | int pipe_resize_ring(struct pipe_inode_info *pipe, unsigned int nr_slots)
| ^~~~~~~~~~~~~~~~
--
In file included from include/net/sock.h:53,
from include/linux/tcp.h:19,
from include/linux/ipv6.h:93,
from include/net/addrconf.h:52,
from lib/vsprintf.c:40:
>> include/linux/memcontrol.h:1228:6: warning: no previous prototype for 'update_parent_oom_protection' [-Wmissing-prototypes]
1228 | void update_parent_oom_protection(struct mem_cgroup *root,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1243:15: warning: no previous prototype for 'get_task_eoom_protect' [-Wmissing-prototypes]
1243 | unsigned long get_task_eoom_protect(struct task_struct *p, long points)
| ^~~~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1248:6: warning: no previous prototype for 'is_root_oom_protect' [-Wmissing-prototypes]
1248 | bool is_root_oom_protect(void)
| ^~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1372:5: warning: no previous prototype for 'mem_cgroup_scan_tasks_update_eoom' [-Wmissing-prototypes]
1372 | int mem_cgroup_scan_tasks_update_eoom(struct mem_cgroup *memcg,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lib/vsprintf.c: In function 'va_format':
lib/vsprintf.c:1685:9: warning: function 'va_format' might be a candidate for 'gnu_printf' format attribute [-Wsuggest-attribute=format]
1685 | buf += vsnprintf(buf, end > buf ? end - buf : 0, va_fmt->fmt, va);
| ^~~
--
In file included from include/linux/swap.h:9,
from include/asm-generic/tlb.h:15,
from arch/x86/include/asm/tlb.h:8,
from arch/x86/entry/vdso/vma.c:24:
>> include/linux/memcontrol.h:1228:6: warning: no previous prototype for 'update_parent_oom_protection' [-Wmissing-prototypes]
1228 | void update_parent_oom_protection(struct mem_cgroup *root,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1243:15: warning: no previous prototype for 'get_task_eoom_protect' [-Wmissing-prototypes]
1243 | unsigned long get_task_eoom_protect(struct task_struct *p, long points)
| ^~~~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1248:6: warning: no previous prototype for 'is_root_oom_protect' [-Wmissing-prototypes]
1248 | bool is_root_oom_protect(void)
| ^~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1372:5: warning: no previous prototype for 'mem_cgroup_scan_tasks_update_eoom' [-Wmissing-prototypes]
1372 | int mem_cgroup_scan_tasks_update_eoom(struct mem_cgroup *memcg,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/x86/entry/vdso/vma.c:35:19: warning: no previous prototype for 'arch_get_vdso_data' [-Wmissing-prototypes]
35 | struct vdso_data *arch_get_vdso_data(void *vvar_page)
| ^~~~~~~~~~~~~~~~~~
--
In file included from include/linux/swap.h:9,
from include/linux/suspend.h:5,
from kernel/sched/build_policy.c:28:
>> include/linux/memcontrol.h:1228:6: warning: no previous prototype for 'update_parent_oom_protection' [-Wmissing-prototypes]
1228 | void update_parent_oom_protection(struct mem_cgroup *root,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1243:15: warning: no previous prototype for 'get_task_eoom_protect' [-Wmissing-prototypes]
1243 | unsigned long get_task_eoom_protect(struct task_struct *p, long points)
| ^~~~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1248:6: warning: no previous prototype for 'is_root_oom_protect' [-Wmissing-prototypes]
1248 | bool is_root_oom_protect(void)
| ^~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1372:5: warning: no previous prototype for 'mem_cgroup_scan_tasks_update_eoom' [-Wmissing-prototypes]
1372 | int mem_cgroup_scan_tasks_update_eoom(struct mem_cgroup *memcg,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from kernel/sched/build_policy.c:45:
kernel/sched/rt.c:9:18: warning: 'max_rt_runtime' defined but not used [-Wunused-const-variable=]
9 | static const u64 max_rt_runtime = MAX_BW;
| ^~~~~~~~~~~~~~
--
In file included from include/linux/swap.h:9,
from include/linux/suspend.h:5,
from arch/x86/kernel/asm-offsets.c:13:
>> include/linux/memcontrol.h:1228:6: warning: no previous prototype for 'update_parent_oom_protection' [-Wmissing-prototypes]
1228 | void update_parent_oom_protection(struct mem_cgroup *root,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1243:15: warning: no previous prototype for 'get_task_eoom_protect' [-Wmissing-prototypes]
1243 | unsigned long get_task_eoom_protect(struct task_struct *p, long points)
| ^~~~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1248:6: warning: no previous prototype for 'is_root_oom_protect' [-Wmissing-prototypes]
1248 | bool is_root_oom_protect(void)
| ^~~~~~~~~~~~~~~~~~~
>> include/linux/memcontrol.h:1372:5: warning: no previous prototype for 'mem_cgroup_scan_tasks_update_eoom' [-Wmissing-prototypes]
1372 | int mem_cgroup_scan_tasks_update_eoom(struct mem_cgroup *memcg,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--
0-DAY CI Kernel Test Service
https://01.org/lkp


Attachments:
(No filename) (25.12 kB)
config (30.02 kB)
Download all attachments

2022-11-30 12:15:54

by chengkaitao

[permalink] [raw]
Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

On Wed, Nov 30, 2022 at 4:41 PM Bagas Sanjaya <[email protected]> wrote:
>
> On 11/30/22 14:01, chengkaitao wrote:
> > From: chengkaitao <[email protected]>
> >
>
> Yikes! Another patch from ZTE guys.
>
> I'm suspicious to patches sent from them due to bad reputation with
> kernel development community. First, they sent all patches via
> [email protected] (listed in Cc) but Greg can't sure these are really
> sent from them ([1] & [2]). Then they tried to workaround by sending
> from their personal Gmail accounts, again with same response from him
> [3]. And finally they sent spoofed emails (as he pointed out in [4]) -
> they pretend to send from ZTE domain but actually sent from their
> different domain (see raw message and look for X-Google-Original-From:
> header.

Hi Bagas Sanjaya,

I'm not an employee of ZTE, just an ordinary developer. I really don't know
all the details about community and ZTE, The reason why I cc [email protected]
is because the output of the script <get_maintainer.pl> has the
address <[email protected]>.

If there is any error in the format of the email, I will try my best
to correct it.

>
> I was about to review documentation part of this patch, but due to
> concerns above, I have to write this reply instead. So I'm not going
> to review, sorry for inconvenience.
>
> PS: Adding Greg to Cc: list.
>
> [1]: https://lore.kernel.org/lkml/[email protected]/
> [2]: https://lore.kernel.org/lkml/[email protected]/
> [3]: https://lore.kernel.org/lkml/[email protected]/
> [4]: https://lore.kernel.org/lkml/[email protected]/
>
> --
> An old man doll... just what I always wanted! - Clara
>


--
Yours,
Kaitao Cheng

2022-11-30 13:22:56

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

On Wed 30-11-22 15:01:58, chengkaitao wrote:
> From: chengkaitao <[email protected]>
>
> We created a new interface <memory.oom.protect> for memory, If there is
> the OOM killer under parent memory cgroup, and the memory usage of a
> child cgroup is within its effective oom.protect boundary, the cgroup's
> tasks won't be OOM killed unless there is no unprotected tasks in other
> children cgroups. It draws on the logic of <memory.min/low> in the
> inheritance relationship.

Could you be more specific about usecases? How do you tune oom.protect
wrt to other tunables? How does this interact with the oom_score_adj
tunining (e.g. a first hand oom victim with the score_adj 1000 sitting
in a oom protected memcg)?

I haven't really read through the whole patch but this struck me odd.

> @@ -552,8 +552,19 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
> unsigned long totalpages = totalram_pages() + total_swap_pages;
> unsigned long points = 0;
> long badness;
> +#ifdef CONFIG_MEMCG
> + struct mem_cgroup *memcg;
>
> - badness = oom_badness(task, totalpages);
> + rcu_read_lock();
> + memcg = mem_cgroup_from_task(task);
> + if (memcg && !css_tryget(&memcg->css))
> + memcg = NULL;
> + rcu_read_unlock();
> +
> + update_parent_oom_protection(root_mem_cgroup, memcg);
> + css_put(&memcg->css);
> +#endif
> + badness = oom_badness(task, totalpages, MEMCG_OOM_PROTECT);

the badness means different thing depending on which memcg hierarchy
subtree you look at. Scaling based on the global oom could get really
misleading.

--
Michal Hocko
SUSE Labs

2022-11-30 13:33:44

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

On Wed, Nov 30, 2022 at 07:33:01PM +0800, Tao pilgrim wrote:
> On Wed, Nov 30, 2022 at 4:41 PM Bagas Sanjaya <[email protected]> wrote:
> >
> > On 11/30/22 14:01, chengkaitao wrote:
> > > From: chengkaitao <[email protected]>
> > >
> >
> > Yikes! Another patch from ZTE guys.
> >
> > I'm suspicious to patches sent from them due to bad reputation with
> > kernel development community. First, they sent all patches via
> > [email protected] (listed in Cc) but Greg can't sure these are really
> > sent from them ([1] & [2]). Then they tried to workaround by sending
> > from their personal Gmail accounts, again with same response from him
> > [3]. And finally they sent spoofed emails (as he pointed out in [4]) -
> > they pretend to send from ZTE domain but actually sent from their
> > different domain (see raw message and look for X-Google-Original-From:
> > header.
>
> Hi Bagas Sanjaya,
>
> I'm not an employee of ZTE, just an ordinary developer. I really don't know
> all the details about community and ZTE, The reason why I cc [email protected]
> is because the output of the script <get_maintainer.pl> has the
> address <[email protected]>.
>
> If there is any error in the format of the email, I will try my best
> to correct it.
>

OK, thanks for clarification. At first I thought you were ZTE guys.
Sorry for inconvenience.

Now I ask: why do your email seem spoofed (sending from your gmail
account but there is extra gmail-specific header that makes you like
"sending" from your corporate email address? Wouldn't it be nice (and
appropriate) if you can send and receive email with the latter address
instead?

Thanks.

--
An old man doll... just what I always wanted! - Clara


Attachments:
(No filename) (1.71 kB)
signature.asc (235.00 B)
Download all attachments
Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

On 2022/11/30 20:43,“Bagas Sanjaya”<[email protected]> wrote:
> On Wed, Nov 30, 2022 at 07:33:01PM +0800, Tao pilgrim wrote:
> > On Wed, Nov 30, 2022 at 4:41 PM Bagas Sanjaya <[email protected]> wrote:
> > >
> > > On 11/30/22 14:01, chengkaitao wrote:
> > > > From: chengkaitao <[email protected]>
> > > >
> > >
> > > Yikes! Another patch from ZTE guys.
> > >
> > > I'm suspicious to patches sent from them due to bad reputation with
> > > kernel development community. First, they sent all patches via
> > > [email protected] (listed in Cc) but Greg can't sure these are really
> > > sent from them ([1] & [2]). Then they tried to workaround by sending
> > > from their personal Gmail accounts, again with same response from him
> > > [3]. And finally they sent spoofed emails (as he pointed out in [4]) -
> > > they pretend to send from ZTE domain but actually sent from their
> > > different domain (see raw message and look for X-Google-Original-From:
> > > header.
> >
> > Hi Bagas Sanjaya,
> >
> > I'm not an employee of ZTE, just an ordinary developer. I really don't know
> > all the details about community and ZTE, The reason why I cc [email protected]
> > is because the output of the script <get_maintainer.pl> has the
> > address <[email protected]>.
> >
> > If there is any error in the format of the email, I will try my best
> > to correct it.
> >
>
> OK, thanks for clarification. At first I thought you were ZTE guys.
> Sorry for inconvenience.
>
> Now I ask: why do your email seem spoofed (sending from your gmail
> account but there is extra gmail-specific header that makes you like
> "sending" from your corporate email address? Wouldn't it be nice (and
> appropriate) if you can send and receive email with the latter address
> instead?
>
It may be caused by my previous habits.
Thanks for your advice. I'll do it.

Thanks.
--
> An old man doll... just what I always wanted! - Clara

Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

On 2022-11-30 21:15:06, "Michal Hocko" <[email protected]> wrote:
> On Wed 30-11-22 15:01:58, chengkaitao wrote:
> > From: chengkaitao <[email protected]>
> >
> > We created a new interface <memory.oom.protect> for memory, If there is
> > the OOM killer under parent memory cgroup, and the memory usage of a
> > child cgroup is within its effective oom.protect boundary, the cgroup's
> > tasks won't be OOM killed unless there is no unprotected tasks in other
> > children cgroups. It draws on the logic of <memory.min/low> in the
> > inheritance relationship.
>
> Could you be more specific about usecases? How do you tune oom.protect
> wrt to other tunables? How does this interact with the oom_score_adj
> tunining (e.g. a first hand oom victim with the score_adj 1000 sitting
> in a oom protected memcg)?

We prefer users to use score_adj and oom.protect independently. Score_adj is
a parameter applicable to host, and oom.protect is a parameter applicable to cgroup.
When the physical machine's memory size is particularly large, the score_adj
granularity is also very large. However, oom.protect can achieve more fine-grained
adjustment.

When the score_adj of the processes are the same, I list the following cases
for explanation,

root
|
cgroup A
/ \
cgroup B cgroup C
(task m,n) (task x,y)

score_adj(all task) = 0;
oom.protect(cgroup A) = 0;
oom.protect(cgroup B) = 0;
oom.protect(cgroup C) = 3G;
usage(task m) = 1G
usage(task n) = 2G
usage(task x) = 1G
usage(task y) = 2G

oom killer order of cgroup A: n > m > y > x
oom killer order of host: y = n > x = m

If cgroup A is a directory maintained by users, users can use oom.protect
to protect relatively important tasks x and y.

However, when score_adj and oom.protect are used at the same time, we
will also consider the impact of both, as expressed in the following formula.
but I have to admit that it is an unstable result.
score = task_usage + score_adj * totalpage - eoom.protect * task_usage / local_memcg_usage

> I haven't really read through the whole patch but this struck me odd.

> > @@ -552,8 +552,19 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
> > unsigned long totalpages = totalram_pages() + total_swap_pages;
> > unsigned long points = 0;
> > long badness;
> > +#ifdef CONFIG_MEMCG
> > + struct mem_cgroup *memcg;
> >
> > - badness = oom_badness(task, totalpages);
> > + rcu_read_lock();
> > + memcg = mem_cgroup_from_task(task);
> > + if (memcg && !css_tryget(&memcg->css))
> > + memcg = NULL;
> > + rcu_read_unlock();
> > +
> > + update_parent_oom_protection(root_mem_cgroup, memcg);
> > + css_put(&memcg->css);
> > +#endif
> > + badness = oom_badness(task, totalpages, MEMCG_OOM_PROTECT);
>
> the badness means different thing depending on which memcg hierarchy
> subtree you look at. Scaling based on the global oom could get really
> misleading.

I also took it into consideration. I planned to change "/proc/pid/oom_score"
to a writable node. When writing to different cgroup paths, different values
will be output. The default output is root cgroup. Do you think this idea is
feasible?
--
Chengkaitao
Best wishes

2022-11-30 17:36:03

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

On Wed 30-11-22 15:46:19, 程垲涛 Chengkaitao Cheng wrote:
> On 2022-11-30 21:15:06, "Michal Hocko" <[email protected]> wrote:
> > On Wed 30-11-22 15:01:58, chengkaitao wrote:
> > > From: chengkaitao <[email protected]>
> > >
> > > We created a new interface <memory.oom.protect> for memory, If there is
> > > the OOM killer under parent memory cgroup, and the memory usage of a
> > > child cgroup is within its effective oom.protect boundary, the cgroup's
> > > tasks won't be OOM killed unless there is no unprotected tasks in other
> > > children cgroups. It draws on the logic of <memory.min/low> in the
> > > inheritance relationship.
> >
> > Could you be more specific about usecases?

This is a very important question to answer.

> > How do you tune oom.protect
> > wrt to other tunables? How does this interact with the oom_score_adj
> > tunining (e.g. a first hand oom victim with the score_adj 1000 sitting
> > in a oom protected memcg)?
>
> We prefer users to use score_adj and oom.protect independently. Score_adj is
> a parameter applicable to host, and oom.protect is a parameter applicable to cgroup.
> When the physical machine's memory size is particularly large, the score_adj
> granularity is also very large. However, oom.protect can achieve more fine-grained
> adjustment.

Let me clarify a bit. I am not trying to defend oom_score_adj. It has
it's well known limitations and it is is essentially unusable for many
situations other than - hide or auto-select potential oom victim.

> When the score_adj of the processes are the same, I list the following cases
> for explanation,
>
> root
> |
> cgroup A
> / \
> cgroup B cgroup C
> (task m,n) (task x,y)
>
> score_adj(all task) = 0;
> oom.protect(cgroup A) = 0;
> oom.protect(cgroup B) = 0;
> oom.protect(cgroup C) = 3G;

How can you enforce protection at C level without any protection at A
level? This would easily allow arbitrary cgroup to hide from the oom
killer and spill over to other cgroups.

> usage(task m) = 1G
> usage(task n) = 2G
> usage(task x) = 1G
> usage(task y) = 2G
>
> oom killer order of cgroup A: n > m > y > x
> oom killer order of host: y = n > x = m
>
> If cgroup A is a directory maintained by users, users can use oom.protect
> to protect relatively important tasks x and y.
>
> However, when score_adj and oom.protect are used at the same time, we
> will also consider the impact of both, as expressed in the following formula.
> but I have to admit that it is an unstable result.
> score = task_usage + score_adj * totalpage - eoom.protect * task_usage / local_memcg_usage

I hope I am not misreading but this has some rather unexpected
properties. First off, bigger memory consumers in a protected memcg are
protected more. Also I would expect the protection discount would
be capped by the actual usage otherwise excessive protection
configuration could skew the results considerably.

> > I haven't really read through the whole patch but this struck me odd.
>
> > > @@ -552,8 +552,19 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
> > > unsigned long totalpages = totalram_pages() + total_swap_pages;
> > > unsigned long points = 0;
> > > long badness;
> > > +#ifdef CONFIG_MEMCG
> > > + struct mem_cgroup *memcg;
> > >
> > > - badness = oom_badness(task, totalpages);
> > > + rcu_read_lock();
> > > + memcg = mem_cgroup_from_task(task);
> > > + if (memcg && !css_tryget(&memcg->css))
> > > + memcg = NULL;
> > > + rcu_read_unlock();
> > > +
> > > + update_parent_oom_protection(root_mem_cgroup, memcg);
> > > + css_put(&memcg->css);
> > > +#endif
> > > + badness = oom_badness(task, totalpages, MEMCG_OOM_PROTECT);
> >
> > the badness means different thing depending on which memcg hierarchy
> > subtree you look at. Scaling based on the global oom could get really
> > misleading.
>
> I also took it into consideration. I planned to change "/proc/pid/oom_score"
> to a writable node. When writing to different cgroup paths, different values
> will be output. The default output is root cgroup. Do you think this idea is
> feasible?

I do not follow. Care to elaborate?
--
Michal Hocko
SUSE Labs

2022-11-30 22:59:18

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

Hi chengkaitao,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on akpm-mm/mm-everything]

url: https://github.com/intel-lab-lkp/linux/commits/chengkaitao/mm-memcontrol-protect-the-memory-in-cgroup-from-being-oom-killed/20221130-150440
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20221130070158.44221-1-chengkaitao%40didiglobal.com
patch subject: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed
config: um-i386_defconfig
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/d2e2f936cb254b6976abddb53b9f46dfc9c9a134
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review chengkaitao/mm-memcontrol-protect-the-memory-in-cgroup-from-being-oom-killed/20221130-150440
git checkout d2e2f936cb254b6976abddb53b9f46dfc9c9a134
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=um SUBARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

ld: init/do_mounts.o: in function `update_parent_oom_protection':
>> include/linux/memcontrol.h:1231: multiple definition of `update_parent_oom_protection'; init/main.o:include/linux/memcontrol.h:1231: first defined here
ld: init/do_mounts.o: in function `get_task_eoom_protect':
>> include/linux/memcontrol.h:1246: multiple definition of `get_task_eoom_protect'; init/main.o:include/linux/memcontrol.h:1246: first defined here
ld: init/do_mounts.o: in function `is_root_oom_protect':
>> include/linux/memcontrol.h:1251: multiple definition of `is_root_oom_protect'; init/main.o:include/linux/memcontrol.h:1251: first defined here
ld: init/do_mounts.o: in function `mem_cgroup_scan_tasks_update_eoom':
>> include/linux/memcontrol.h:1376: multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:include/linux/memcontrol.h:1376: first defined here
ld: arch/um/kernel/mem.o: in function `update_parent_oom_protection':
>> include/linux/memcontrol.h:1231: multiple definition of `update_parent_oom_protection'; init/main.o:include/linux/memcontrol.h:1231: first defined here
ld: arch/um/kernel/mem.o: in function `get_task_eoom_protect':
>> include/linux/memcontrol.h:1246: multiple definition of `get_task_eoom_protect'; init/main.o:include/linux/memcontrol.h:1246: first defined here
ld: arch/um/kernel/mem.o: in function `is_root_oom_protect':
>> include/linux/memcontrol.h:1251: multiple definition of `is_root_oom_protect'; init/main.o:include/linux/memcontrol.h:1251: first defined here
ld: arch/um/kernel/mem.o: in function `mem_cgroup_scan_tasks_update_eoom':
>> include/linux/memcontrol.h:1376: multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:include/linux/memcontrol.h:1376: first defined here
ld: arch/um/kernel/process.o: in function `update_parent_oom_protection':
>> include/linux/memcontrol.h:1231: multiple definition of `update_parent_oom_protection'; init/main.o:include/linux/memcontrol.h:1231: first defined here
ld: arch/um/kernel/process.o: in function `get_task_eoom_protect':
>> include/linux/memcontrol.h:1246: multiple definition of `get_task_eoom_protect'; init/main.o:include/linux/memcontrol.h:1246: first defined here
ld: arch/um/kernel/process.o: in function `is_root_oom_protect':
>> include/linux/memcontrol.h:1251: multiple definition of `is_root_oom_protect'; init/main.o:include/linux/memcontrol.h:1251: first defined here
ld: arch/um/kernel/process.o: in function `mem_cgroup_scan_tasks_update_eoom':
>> include/linux/memcontrol.h:1376: multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:include/linux/memcontrol.h:1376: first defined here
ld: arch/um/kernel/um_arch.o: in function `update_parent_oom_protection':
arch/um/kernel/um_arch.c:528: multiple definition of `update_parent_oom_protection'; init/main.o:include/linux/memcontrol.h:1231: first defined here
ld: arch/um/kernel/um_arch.o: in function `get_task_eoom_protect':
>> include/linux/memcontrol.h:1246: multiple definition of `get_task_eoom_protect'; init/main.o:include/linux/memcontrol.h:1246: first defined here
ld: arch/um/kernel/um_arch.o: in function `is_root_oom_protect':
>> include/linux/memcontrol.h:1251: multiple definition of `is_root_oom_protect'; init/main.o:include/linux/memcontrol.h:1251: first defined here
ld: arch/um/kernel/um_arch.o: in function `mem_cgroup_scan_tasks_update_eoom':
>> include/linux/memcontrol.h:1376: multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:include/linux/memcontrol.h:1376: first defined here
ld: kernel/fork.o: in function `update_parent_oom_protection':
>> include/linux/memcontrol.h:1231: multiple definition of `update_parent_oom_protection'; init/main.o:include/linux/memcontrol.h:1231: first defined here
ld: kernel/fork.o: in function `get_task_eoom_protect':
>> include/linux/memcontrol.h:1246: multiple definition of `get_task_eoom_protect'; init/main.o:include/linux/memcontrol.h:1246: first defined here
ld: kernel/fork.o: in function `is_root_oom_protect':
>> include/linux/memcontrol.h:1251: multiple definition of `is_root_oom_protect'; init/main.o:include/linux/memcontrol.h:1251: first defined here
ld: kernel/fork.o: in function `mem_cgroup_scan_tasks_update_eoom':
>> include/linux/memcontrol.h:1376: multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:include/linux/memcontrol.h:1376: first defined here
ld: kernel/cpu.o: in function `update_parent_oom_protection':
>> include/linux/memcontrol.h:1231: multiple definition of `update_parent_oom_protection'; init/main.o:include/linux/memcontrol.h:1231: first defined here
ld: kernel/cpu.o: in function `get_task_eoom_protect':
include/linux/memcontrol.h:1246: multiple definition of `get_task_eoom_protect'; init/main.o:include/linux/memcontrol.h:1246: first defined here
ld: kernel/cpu.o: in function `is_root_oom_protect':
include/linux/memcontrol.h:1251: multiple definition of `is_root_oom_protect'; init/main.o:include/linux/memcontrol.h:1251: first defined here
ld: kernel/cpu.o: in function `mem_cgroup_scan_tasks_update_eoom':
include/linux/memcontrol.h:1376: multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:include/linux/memcontrol.h:1376: first defined here
ld: kernel/sysctl.o: in function `update_parent_oom_protection':
include/linux/memcontrol.h:1231: multiple definition of `update_parent_oom_protection'; init/main.o:include/linux/memcontrol.h:1231: first defined here
ld: kernel/sysctl.o: in function `get_task_eoom_protect':
include/linux/memcontrol.h:1246: multiple definition of `get_task_eoom_protect'; init/main.o:include/linux/memcontrol.h:1246: first defined here
ld: kernel/sysctl.o: in function `is_root_oom_protect':
include/linux/memcontrol.h:1251: multiple definition of `is_root_oom_protect'; init/main.o:include/linux/memcontrol.h:1251: first defined here
ld: kernel/sysctl.o: in function `mem_cgroup_scan_tasks_update_eoom':
include/linux/memcontrol.h:1376: multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:include/linux/memcontrol.h:1376: first defined here
ld: kernel/sys.o: in function `update_parent_oom_protection':
include/linux/memcontrol.h:1231: multiple definition of `update_parent_oom_protection'; init/main.o:include/linux/memcontrol.h:1231: first defined here
ld: kernel/sys.o: in function `get_task_eoom_protect':
include/linux/memcontrol.h:1246: multiple definition of `get_task_eoom_protect'; init/main.o:include/linux/memcontrol.h:1246: first defined here
ld: kernel/sys.o: in function `is_root_oom_protect':
include/linux/memcontrol.h:1251: multiple definition of `is_root_oom_protect'; init/main.o:include/linux/memcontrol.h:1251: first defined here
ld: kernel/sys.o: in function `mem_cgroup_scan_tasks_update_eoom':
include/linux/memcontrol.h:1376: multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:include/linux/memcontrol.h:1376: first defined here
ld: kernel/umh.o: in function `update_parent_oom_protection':
include/linux/memcontrol.h:1231: multiple definition of `update_parent_oom_protection'; init/main.o:include/linux/memcontrol.h:1231: first defined here
ld: kernel/umh.o: in function `get_task_eoom_protect':
include/linux/memcontrol.h:1246: multiple definition of `get_task_eoom_protect'; init/main.o:include/linux/memcontrol.h:1246: first defined here
ld: kernel/umh.o: in function `is_root_oom_protect':
include/linux/memcontrol.h:1251: multiple definition of `is_root_oom_protect'; init/main.o:include/linux/memcontrol.h:1251: first defined here
ld: kernel/umh.o: in function `mem_cgroup_scan_tasks_update_eoom':
include/linux/memcontrol.h:1376: multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:include/linux/memcontrol.h:1376: first defined here
ld: kernel/pid.o: in function `update_parent_oom_protection':
include/linux/memcontrol.h:1231: multiple definition of `update_parent_oom_protection'; init/main.o:include/linux/memcontrol.h:1231: first defined here
ld: kernel/pid.o: in function `get_task_eoom_protect':
include/linux/memcontrol.h:1246: multiple definition of `get_task_eoom_protect'; init/main.o:include/linux/memcontrol.h:1246: first defined here
ld: kernel/pid.o: in function `is_root_oom_protect':
include/linux/memcontrol.h:1251: multiple definition of `is_root_oom_protect'; init/main.o:include/linux/memcontrol.h:1251: first defined here
ld: kernel/pid.o: in function `mem_cgroup_scan_tasks_update_eoom':
include/linux/memcontrol.h:1376: multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:include/linux/memcontrol.h:1376: first defined here
ld: kernel/task_work.o: in function `update_parent_oom_protection':
include/linux/memcontrol.h:1231: multiple definition of `update_parent_oom_protection'; init/main.o:include/linux/memcontrol.h:1231: first defined here
ld: kernel/task_work.o: in function `get_task_eoom_protect':
include/linux/memcontrol.h:1246: multiple definition of `get_task_eoom_protect'; init/main.o:include/linux/memcontrol.h:1246: first defined here
ld: kernel/task_work.o: in function `is_root_oom_protect':
include/linux/memcontrol.h:1251: multiple definition of `is_root_oom_protect'; init/main.o:include/linux/memcontrol.h:1251: first defined here
ld: kernel/task_work.o: in function `mem_cgroup_scan_tasks_update_eoom':
include/linux/memcontrol.h:1376: multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:include/linux/memcontrol.h:1376: first defined here
ld: kernel/reboot.o: in function `update_parent_oom_protection':
include/linux/memcontrol.h:1231: multiple definition of `update_parent_oom_protection'; init/main.o:include/linux/memcontrol.h:1231: first defined here
ld: kernel/reboot.o: in function `get_task_eoom_protect':
include/linux/memcontrol.h:1246: multiple definition of `get_task_eoom_protect'; init/main.o:include/linux/memcontrol.h:1246: first defined here
ld: kernel/reboot.o: in function `is_root_oom_protect':
include/linux/memcontrol.h:1251: multiple definition of `is_root_oom_protect'; init/main.o:include/linux/memcontrol.h:1251: first defined here
ld: kernel/reboot.o: in function `mem_cgroup_scan_tasks_update_eoom':
include/linux/memcontrol.h:1376: multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:include/linux/memcontrol.h:1376: first defined here
ld: kernel/kmod.o: in function `update_parent_oom_protection':
include/linux/memcontrol.h:1231: multiple definition of `update_parent_oom_protection'; init/main.o:include/linux/memcontrol.h:1231: first defined here
ld: kernel/kmod.o: in function `get_task_eoom_protect':
include/linux/memcontrol.h:1246: multiple definition of `get_task_eoom_protect'; init/main.o:include/linux/memcontrol.h:1246: first defined here
ld: kernel/kmod.o: in function `is_root_oom_protect':
include/linux/memcontrol.h:1251: multiple definition of `is_root_oom_protect'; init/main.o:include/linux/memcontrol.h:1251: first defined here
ld: kernel/kmod.o: in function `mem_cgroup_scan_tasks_update_eoom':
include/linux/memcontrol.h:1376: multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:include/linux/memcontrol.h:1376: first defined here
ld: kernel/sched/core.o: in function `update_parent_oom_protection':
include/linux/memcontrol.h:1231: multiple definition of `update_parent_oom_protection'; init/main.o:include/linux/memcontrol.h:1231: first defined here
ld: kernel/sched/core.o: in function `get_task_eoom_protect':
include/linux/memcontrol.h:1246: multiple definition of `get_task_eoom_protect'; init/main.o:include/linux/memcontrol.h:1246: first defined here
ld: kernel/sched/core.o: in function `is_root_oom_protect':
include/linux/memcontrol.h:1251: multiple definition of `is_root_oom_protect'; init/main.o:include/linux/memcontrol.h:1251: first defined here
ld: kernel/sched/core.o: in function `mem_cgroup_scan_tasks_update_eoom':
include/linux/memcontrol.h:1376: multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:include/linux/memcontrol.h:1376: first defined here
ld: kernel/sched/build_policy.o: in function `update_parent_oom_protection':
include/linux/memcontrol.h:1231: multiple definition of `update_parent_oom_protection'; init/main.o:include/linux/memcontrol.h:1231: first defined here
ld: kernel/sched/build_policy.o: in function `get_task_eoom_protect':
include/linux/memcontrol.h:1246: multiple definition of `get_task_eoom_protect'; init/main.o:include/linux/memcontrol.h:1246: first defined here
ld: kernel/sched/build_policy.o: in function `is_root_oom_protect':
include/linux/memcontrol.h:1251: multiple definition of `is_root_oom_protect'; init/main.o:include/linux/memcontrol.h:1251: first defined here
ld: kernel/sched/build_policy.o: in function `mem_cgroup_scan_tasks_update_eoom':
include/linux/memcontrol.h:1376: multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:include/linux/memcontrol.h:1376: first defined here
ld: kernel/power/main.o: in function `update_parent_oom_protection':
include/linux/memcontrol.h:1231: multiple definition of `update_parent_oom_protection'; init/main.o:include/linux/memcontrol.h:1231: first defined here
ld: kernel/power/main.o: in function `get_task_eoom_protect':
include/linux/memcontrol.h:1246: multiple definition of `get_task_eoom_protect'; init/main.o:include/linux/memcontrol.h:1246: first defined here
ld: kernel/power/main.o: in function `is_root_oom_protect':
include/linux/memcontrol.h:1251: multiple definition of `is_root_oom_protect'; init/main.o:include/linux/memcontrol.h:1251: first defined here
ld: kernel/power/main.o: in function `mem_cgroup_scan_tasks_update_eoom':
include/linux/memcontrol.h:1376: multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:include/linux/memcontrol.h:1376: first defined here
ld: kernel/power/process.o: in function `update_parent_oom_protection':
include/linux/memcontrol.h:1231: multiple definition of `update_parent_oom_protection'; init/main.o:include/linux/memcontrol.h:1231: first defined here
ld: kernel/power/process.o: in function `get_task_eoom_protect':
include/linux/memcontrol.h:1246: multiple definition of `get_task_eoom_protect'; init/main.o:include/linux/memcontrol.h:1246: first defined here
ld: kernel/power/process.o: in function `is_root_oom_protect':
include/linux/memcontrol.h:1251: multiple definition of `is_root_oom_protect'; init/main.o:include/linux/memcontrol.h:1251: first defined here
ld: kernel/power/process.o: in function `mem_cgroup_scan_tasks_update_eoom':
include/linux/memcontrol.h:1376: multiple definition of `mem_cgroup_scan_tasks_update_eoom'; init/main.o:include/linux/memcontrol.h:1376: first defined here
ld: kernel/power/suspend.o: in function `update_parent_oom_protection':
include/linux/memcontrol.h:1231: multiple definition of `update_parent_oom_protection'; init/main.o:include/linux/memcontrol.h:1231: first defined here
ld: kernel/power/suspend.o: in function `get_task_eoom_protect':
include/linux/memcontrol.h:1246: multiple definition of `get_task_eoom_protect'; init/main.o:include/linux/memcontrol.h:1246: first defined here
ld: kernel/power/suspend.o: in function `is_root_oom_protect':
include/linux/memcontrol.h:1251: multiple definition of `is_root_oom_protect'; init/main.o:include/linux/memcontrol.h:1251: first defined here


vim +1231 include/linux/memcontrol.h

1227
1228 void update_parent_oom_protection(struct mem_cgroup *root,
1229 struct mem_cgroup *memcg)
1230 {
> 1231 }
1232
1233 static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
1234 {
1235 return false;
1236 }
1237
1238 static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
1239 {
1240 return false;
1241 }
1242
1243 unsigned long get_task_eoom_protect(struct task_struct *p, long points)
1244 {
1245 return 0;
> 1246 }
1247
1248 bool is_root_oom_protect(void)
1249 {
1250 return 0;
> 1251 }
1252
1253 static inline int mem_cgroup_charge(struct folio *folio,
1254 struct mm_struct *mm, gfp_t gfp)
1255 {
1256 return 0;
1257 }
1258
1259 static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
1260 struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
1261 {
1262 return 0;
1263 }
1264
1265 static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
1266 {
1267 }
1268
1269 static inline void mem_cgroup_uncharge(struct folio *folio)
1270 {
1271 }
1272
1273 static inline void mem_cgroup_uncharge_list(struct list_head *page_list)
1274 {
1275 }
1276
1277 static inline void mem_cgroup_migrate(struct folio *old, struct folio *new)
1278 {
1279 }
1280
1281 static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
1282 struct pglist_data *pgdat)
1283 {
1284 return &pgdat->__lruvec;
1285 }
1286
1287 static inline struct lruvec *folio_lruvec(struct folio *folio)
1288 {
1289 struct pglist_data *pgdat = folio_pgdat(folio);
1290 return &pgdat->__lruvec;
1291 }
1292
1293 static inline
1294 void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio)
1295 {
1296 }
1297
1298 static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
1299 {
1300 return NULL;
1301 }
1302
1303 static inline bool mm_match_cgroup(struct mm_struct *mm,
1304 struct mem_cgroup *memcg)
1305 {
1306 return true;
1307 }
1308
1309 static inline struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
1310 {
1311 return NULL;
1312 }
1313
1314 static inline
1315 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css)
1316 {
1317 return NULL;
1318 }
1319
1320 static inline void obj_cgroup_put(struct obj_cgroup *objcg)
1321 {
1322 }
1323
1324 static inline void mem_cgroup_put(struct mem_cgroup *memcg)
1325 {
1326 }
1327
1328 static inline struct lruvec *folio_lruvec_lock(struct folio *folio)
1329 {
1330 struct pglist_data *pgdat = folio_pgdat(folio);
1331
1332 spin_lock(&pgdat->__lruvec.lru_lock);
1333 return &pgdat->__lruvec;
1334 }
1335
1336 static inline struct lruvec *folio_lruvec_lock_irq(struct folio *folio)
1337 {
1338 struct pglist_data *pgdat = folio_pgdat(folio);
1339
1340 spin_lock_irq(&pgdat->__lruvec.lru_lock);
1341 return &pgdat->__lruvec;
1342 }
1343
1344 static inline struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio,
1345 unsigned long *flagsp)
1346 {
1347 struct pglist_data *pgdat = folio_pgdat(folio);
1348
1349 spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp);
1350 return &pgdat->__lruvec;
1351 }
1352
1353 static inline struct mem_cgroup *
1354 mem_cgroup_iter(struct mem_cgroup *root,
1355 struct mem_cgroup *prev,
1356 struct mem_cgroup_reclaim_cookie *reclaim)
1357 {
1358 return NULL;
1359 }
1360
1361 static inline void mem_cgroup_iter_break(struct mem_cgroup *root,
1362 struct mem_cgroup *prev)
1363 {
1364 }
1365
1366 static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
1367 int (*fn)(struct task_struct *, void *), void *arg)
1368 {
1369 return 0;
1370 }
1371
1372 int mem_cgroup_scan_tasks_update_eoom(struct mem_cgroup *memcg,
1373 int (*fn)(struct task_struct *, void *, int), void *arg)
1374 {
1375 return 0;
> 1376 }
1377

--
0-DAY CI Kernel Test Service
https://01.org/lkp


Attachments:
(No filename) (20.90 kB)
config (43.09 kB)
Download all attachments

2022-12-01 00:52:57

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

On Wed, Nov 30, 2022 at 03:01:58PM +0800, chengkaitao wrote:
> From: chengkaitao <[email protected]>
>
> We created a new interface <memory.oom.protect> for memory, If there is
> the OOM killer under parent memory cgroup, and the memory usage of a
> child cgroup is within its effective oom.protect boundary, the cgroup's
> tasks won't be OOM killed unless there is no unprotected tasks in other
> children cgroups. It draws on the logic of <memory.min/low> in the
> inheritance relationship.
>
> It has the following advantages,
> 1. We have the ability to protect more important processes, when there
> is a memcg's OOM killer. The oom.protect only takes effect local memcg,
> and does not affect the OOM killer of the host.
> 2. Historically, we can often use oom_score_adj to control a group of
> processes, It requires that all processes in the cgroup must have a
> common parent processes, we have to set the common parent process's
> oom_score_adj, before it forks all children processes. So that it is
> very difficult to apply it in other situations. Now oom.protect has no
> such restrictions, we can protect a cgroup of processes more easily. The
> cgroup can keep some memory, even if the OOM killer has to be called.

It reminds me our attempts to provide a more sophisticated cgroup-aware oom
killer. The problem is that the decision which process(es) to kill or preserve
is individual to a specific workload (and can be even time-dependent
for a given workload). So it's really hard to come up with an in-kernel
mechanism which is at the same time flexible enough to work for the majority
of users and reliable enough to serve as the last oom resort measure (which
is the basic goal of the kernel oom killer).

Previously the consensus was to keep the in-kernel oom killer dumb and reliable
and implement complex policies in userspace (e.g. systemd-oomd etc).

Is there a reason why such approach can't work in your case?

Thanks!

Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

At 2022-12-01 00:27:54, "Michal Hocko" <[email protected]> wrote:
>On Wed 30-11-22 15:46:19, 程垲涛 Chengkaitao Cheng wrote:
>> On 2022-11-30 21:15:06, "Michal Hocko" <[email protected]> wrote:
>> > On Wed 30-11-22 15:01:58, chengkaitao wrote:
>> > > From: chengkaitao <[email protected]>
>> > >
>> > > We created a new interface <memory.oom.protect> for memory, If there is
>> > > the OOM killer under parent memory cgroup, and the memory usage of a
>> > > child cgroup is within its effective oom.protect boundary, the cgroup's
>> > > tasks won't be OOM killed unless there is no unprotected tasks in other
>> > > children cgroups. It draws on the logic of <memory.min/low> in the
>> > > inheritance relationship.
>> >
>> > Could you be more specific about usecases?
>
>This is a very important question to answer.

usecases 1: users say that they want to protect an important process
with high memory consumption from being killed by the oom in case
of docker container failure, so as to retain more critical on-site
information or a self recovery mechanism. At this time, they suggest
setting the score_adj of this process to -1000, but I don't agree with
it, because the docker container is not important to other docker
containers of the same physical machine. If score_adj of the process
is set to -1000, the probability of oom in other container processes will
increase.

usecases 2: There are many business processes and agent processes
mixed together on a physical machine, and they need to be classified
and protected. However, some agents are the parents of business
processes, and some business processes are the parents of agent
processes, It will be troublesome to set different score_adj for them.
Business processes and agents cannot determine which level their
score_adj should be at, If we create another agent to set all processes's
score_adj, we have to cycle through all the processes on the physical
machine regularly, which looks stupid.

>> > How do you tune oom.protect
>> > wrt to other tunables? How does this interact with the oom_score_adj
>> > tunining (e.g. a first hand oom victim with the score_adj 1000 sitting
>> > in a oom protected memcg)?
>>
>> We prefer users to use score_adj and oom.protect independently. Score_adj is
>> a parameter applicable to host, and oom.protect is a parameter applicable to cgroup.
>> When the physical machine's memory size is particularly large, the score_adj
>> granularity is also very large. However, oom.protect can achieve more fine-grained
>> adjustment.
>
>Let me clarify a bit. I am not trying to defend oom_score_adj. It has
>it's well known limitations and it is is essentially unusable for many
>situations other than - hide or auto-select potential oom victim.
>
>> When the score_adj of the processes are the same, I list the following cases
>> for explanation,
>>
>> root
>> |
>> cgroup A
>> / \
>> cgroup B cgroup C
>> (task m,n) (task x,y)
>>
>> score_adj(all task) = 0;
>> oom.protect(cgroup A) = 0;
>> oom.protect(cgroup B) = 0;
>> oom.protect(cgroup C) = 3G;
>
>How can you enforce protection at C level without any protection at A
>level?

The basic idea of this scheme is that all processes in the same cgroup are
equally important. If some processes need extra protection, a new cgroup
needs to be created for unified settings. I don't think it is necessary to
implement protection in cgroup C, because task x and task y are equally
important. Only the four processes (task m, n, x and y) in cgroup A, have
important and secondary differences.

> This would easily allow arbitrary cgroup to hide from the oom
> killer and spill over to other cgroups.

I don't think this will happen, because eoom.protect only works on parent
cgroup. If "oom.protect(parent cgroup) = 0", from perspective of
grandpa cgroup, task x and y will not be specially protected.

>> usage(task m) = 1G
>> usage(task n) = 2G
>> usage(task x) = 1G
>> usage(task y) = 2G
>>
>> oom killer order of cgroup A: n > m > y > x
>> oom killer order of host: y = n > x = m
>>
>> If cgroup A is a directory maintained by users, users can use oom.protect
>> to protect relatively important tasks x and y.
>>
>> However, when score_adj and oom.protect are used at the same time, we
>> will also consider the impact of both, as expressed in the following formula.
>> but I have to admit that it is an unstable result.
>> score = task_usage + score_adj * totalpage - eoom.protect * task_usage / local_memcg_usage
>
>I hope I am not misreading but this has some rather unexpected
>properties. First off, bigger memory consumers in a protected memcg are
>protected more.

Since cgroup needs to reasonably distribute the protection quota to all
processes in the cgroup, I think that processes consuming more memory
should get more quota. It is fair to processes consuming less memory
too, even if processes consuming more memory get more quota, its
oom_score is still higher than the processes consuming less memory.
When the oom killer appears in local cgroup, the order of oom killer
remains unchanged

>Also I would expect the protection discount would
>be capped by the actual usage otherwise excessive protection
>configuration could skew the results considerably.

In the calculation, we will select the minimum value of memcg_usage and
oom.protect

>> > I haven't really read through the whole patch but this struck me odd.
>>
>> > > @@ -552,8 +552,19 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
>> > > unsigned long totalpages = totalram_pages() + total_swap_pages;
>> > > unsigned long points = 0;
>> > > long badness;
>> > > +#ifdef CONFIG_MEMCG
>> > > + struct mem_cgroup *memcg;
>> > >
>> > > - badness = oom_badness(task, totalpages);
>> > > + rcu_read_lock();
>> > > + memcg = mem_cgroup_from_task(task);
>> > > + if (memcg && !css_tryget(&memcg->css))
>> > > + memcg = NULL;
>> > > + rcu_read_unlock();
>> > > +
>> > > + update_parent_oom_protection(root_mem_cgroup, memcg);
>> > > + css_put(&memcg->css);
>> > > +#endif
>> > > + badness = oom_badness(task, totalpages, MEMCG_OOM_PROTECT);
>> >
>> > the badness means different thing depending on which memcg hierarchy
>> > subtree you look at. Scaling based on the global oom could get really
>> > misleading.
>>
>> I also took it into consideration. I planned to change "/proc/pid/oom_score"
>> to a writable node. When writing to different cgroup paths, different values
>> will be output. The default output is root cgroup. Do you think this idea is
>> feasible?
>
>I do not follow. Care to elaborate?

Take two example,
cmd: cat /proc/pid/oom_score
output: Scaling based on the global oom

cmd: echo "/cgroupA/cgroupB" > /proc/pid/oom_score
output: Scaling based on the cgroupB oom
(If the task is not in the cgroupB's hierarchy subtree, output: invalid parameter)

Thanks for your comment!
--
Chengkaitao
Best wishes

Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

At 2022-12-01 07:29:11, "Roman Gushchin" <[email protected]> wrote:
>On Wed, Nov 30, 2022 at 03:01:58PM +0800, chengkaitao wrote:
>> From: chengkaitao <[email protected]>
>>
>> We created a new interface <memory.oom.protect> for memory, If there is
>> the OOM killer under parent memory cgroup, and the memory usage of a
>> child cgroup is within its effective oom.protect boundary, the cgroup's
>> tasks won't be OOM killed unless there is no unprotected tasks in other
>> children cgroups. It draws on the logic of <memory.min/low> in the
>> inheritance relationship.
>>
>> It has the following advantages,
>> 1. We have the ability to protect more important processes, when there
>> is a memcg's OOM killer. The oom.protect only takes effect local memcg,
>> and does not affect the OOM killer of the host.
>> 2. Historically, we can often use oom_score_adj to control a group of
>> processes, It requires that all processes in the cgroup must have a
>> common parent processes, we have to set the common parent process's
>> oom_score_adj, before it forks all children processes. So that it is
>> very difficult to apply it in other situations. Now oom.protect has no
>> such restrictions, we can protect a cgroup of processes more easily. The
>> cgroup can keep some memory, even if the OOM killer has to be called.
>
>It reminds me our attempts to provide a more sophisticated cgroup-aware oom
>killer.

As you said, I also like simple strategies and concise code very much, so in
the design of oom.protect, we reuse the evaluation method of oom_score,
we draws on the logic of <memory.min/low> in the inheritance relationship.
Memory.min/low have been demonstrated for a long time. I did it to reduce
the burden on the kernel.

>The problem is that the decision which process(es) to kill or preserve
>is individual to a specific workload (and can be even time-dependent
>for a given workload).

It is correct to kill a process with high workload, but it may not be the
most appropriate. I think the specific process to kill needs to be decided
by the user. I think it is the original intention of score_adj design.

>So it's really hard to come up with an in-kernel
>mechanism which is at the same time flexible enough to work for the majority
>of users and reliable enough to serve as the last oom resort measure (which
>is the basic goal of the kernel oom killer).
>
Our goal is to find a method that is less intrusive to the existing
mechanisms of the kernel, and find a more reasonable supplement
or alternative to the limitations of score_adj.

>Previously the consensus was to keep the in-kernel oom killer dumb and reliable
>and implement complex policies in userspace (e.g. systemd-oomd etc).
>
>Is there a reason why such approach can't work in your case?

I think that as kernel developers, we should try our best to provide
users with simpler and more powerful interfaces. It is clear that the
current oom score mechanism has many limitations. Users need to
do a lot of timed loop detection in order to complete work similar
to the oom score mechanism, or develop a new mechanism just to
skip the imperfect oom score mechanism. This is an inefficient and
forced behavior

Thanks for your comment!

2022-12-01 09:06:37

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

On Thu 01-12-22 04:52:27, 程垲涛 Chengkaitao Cheng wrote:
> At 2022-12-01 00:27:54, "Michal Hocko" <[email protected]> wrote:
> >On Wed 30-11-22 15:46:19, 程垲涛 Chengkaitao Cheng wrote:
> >> On 2022-11-30 21:15:06, "Michal Hocko" <[email protected]> wrote:
> >> > On Wed 30-11-22 15:01:58, chengkaitao wrote:
> >> > > From: chengkaitao <[email protected]>
> >> > >
> >> > > We created a new interface <memory.oom.protect> for memory, If there is
> >> > > the OOM killer under parent memory cgroup, and the memory usage of a
> >> > > child cgroup is within its effective oom.protect boundary, the cgroup's
> >> > > tasks won't be OOM killed unless there is no unprotected tasks in other
> >> > > children cgroups. It draws on the logic of <memory.min/low> in the
> >> > > inheritance relationship.
> >> >
> >> > Could you be more specific about usecases?
> >
> >This is a very important question to answer.
>
> usecases 1: users say that they want to protect an important process
> with high memory consumption from being killed by the oom in case
> of docker container failure, so as to retain more critical on-site
> information or a self recovery mechanism. At this time, they suggest
> setting the score_adj of this process to -1000, but I don't agree with
> it, because the docker container is not important to other docker
> containers of the same physical machine. If score_adj of the process
> is set to -1000, the probability of oom in other container processes will
> increase.
>
> usecases 2: There are many business processes and agent processes
> mixed together on a physical machine, and they need to be classified
> and protected. However, some agents are the parents of business
> processes, and some business processes are the parents of agent
> processes, It will be troublesome to set different score_adj for them.
> Business processes and agents cannot determine which level their
> score_adj should be at, If we create another agent to set all processes's
> score_adj, we have to cycle through all the processes on the physical
> machine regularly, which looks stupid.

I do agree that oom_score_adj is far from ideal tool for these usecases.
But I also agree with Roman that these could be addressed by an oom
killer implementation in the userspace which can have much better
tailored policies. OOM protection limits would require tuning and also
regular revisions (e.g. memory consumption by any workload might change
with different kernel versions) to provide what you are looking for.

> >> > How do you tune oom.protect
> >> > wrt to other tunables? How does this interact with the oom_score_adj
> >> > tunining (e.g. a first hand oom victim with the score_adj 1000 sitting
> >> > in a oom protected memcg)?
> >>
> >> We prefer users to use score_adj and oom.protect independently. Score_adj is
> >> a parameter applicable to host, and oom.protect is a parameter applicable to cgroup.
> >> When the physical machine's memory size is particularly large, the score_adj
> >> granularity is also very large. However, oom.protect can achieve more fine-grained
> >> adjustment.
> >
> >Let me clarify a bit. I am not trying to defend oom_score_adj. It has
> >it's well known limitations and it is is essentially unusable for many
> >situations other than - hide or auto-select potential oom victim.
> >
> >> When the score_adj of the processes are the same, I list the following cases
> >> for explanation,
> >>
> >> root
> >> |
> >> cgroup A
> >> / \
> >> cgroup B cgroup C
> >> (task m,n) (task x,y)
> >>
> >> score_adj(all task) = 0;
> >> oom.protect(cgroup A) = 0;
> >> oom.protect(cgroup B) = 0;
> >> oom.protect(cgroup C) = 3G;
> >
> >How can you enforce protection at C level without any protection at A
> >level?
>
> The basic idea of this scheme is that all processes in the same cgroup are
> equally important. If some processes need extra protection, a new cgroup
> needs to be created for unified settings. I don't think it is necessary to
> implement protection in cgroup C, because task x and task y are equally
> important. Only the four processes (task m, n, x and y) in cgroup A, have
> important and secondary differences.
>
> > This would easily allow arbitrary cgroup to hide from the oom
> > killer and spill over to other cgroups.
>
> I don't think this will happen, because eoom.protect only works on parent
> cgroup. If "oom.protect(parent cgroup) = 0", from perspective of
> grandpa cgroup, task x and y will not be specially protected.

Just to confirm I am on the same page. This means that there won't be
any protection in case of the global oom in the above example. So
effectively the same semantic as the low/min protection.

> >> usage(task m) = 1G
> >> usage(task n) = 2G
> >> usage(task x) = 1G
> >> usage(task y) = 2G
> >>
> >> oom killer order of cgroup A: n > m > y > x
> >> oom killer order of host: y = n > x = m
> >>
> >> If cgroup A is a directory maintained by users, users can use oom.protect
> >> to protect relatively important tasks x and y.
> >>
> >> However, when score_adj and oom.protect are used at the same time, we
> >> will also consider the impact of both, as expressed in the following formula.
> >> but I have to admit that it is an unstable result.
> >> score = task_usage + score_adj * totalpage - eoom.protect * task_usage / local_memcg_usage
> >
> >I hope I am not misreading but this has some rather unexpected
> >properties. First off, bigger memory consumers in a protected memcg are
> >protected more.
>
> Since cgroup needs to reasonably distribute the protection quota to all
> processes in the cgroup, I think that processes consuming more memory
> should get more quota. It is fair to processes consuming less memory
> too, even if processes consuming more memory get more quota, its
> oom_score is still higher than the processes consuming less memory.
> When the oom killer appears in local cgroup, the order of oom killer
> remains unchanged

Why cannot you simply discount the protection from all processes
equally? I do not follow why the task_usage has to play any role in
that.

>
> >Also I would expect the protection discount would
> >be capped by the actual usage otherwise excessive protection
> >configuration could skew the results considerably.
>
> In the calculation, we will select the minimum value of memcg_usage and
> oom.protect
>
> >> > I haven't really read through the whole patch but this struck me odd.
> >>
> >> > > @@ -552,8 +552,19 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
> >> > > unsigned long totalpages = totalram_pages() + total_swap_pages;
> >> > > unsigned long points = 0;
> >> > > long badness;
> >> > > +#ifdef CONFIG_MEMCG
> >> > > + struct mem_cgroup *memcg;
> >> > >
> >> > > - badness = oom_badness(task, totalpages);
> >> > > + rcu_read_lock();
> >> > > + memcg = mem_cgroup_from_task(task);
> >> > > + if (memcg && !css_tryget(&memcg->css))
> >> > > + memcg = NULL;
> >> > > + rcu_read_unlock();
> >> > > +
> >> > > + update_parent_oom_protection(root_mem_cgroup, memcg);
> >> > > + css_put(&memcg->css);
> >> > > +#endif
> >> > > + badness = oom_badness(task, totalpages, MEMCG_OOM_PROTECT);
> >> >
> >> > the badness means different thing depending on which memcg hierarchy
> >> > subtree you look at. Scaling based on the global oom could get really
> >> > misleading.
> >>
> >> I also took it into consideration. I planned to change "/proc/pid/oom_score"
> >> to a writable node. When writing to different cgroup paths, different values
> >> will be output. The default output is root cgroup. Do you think this idea is
> >> feasible?
> >
> >I do not follow. Care to elaborate?
>
> Take two example,
> cmd: cat /proc/pid/oom_score
> output: Scaling based on the global oom
>
> cmd: echo "/cgroupA/cgroupB" > /proc/pid/oom_score
> output: Scaling based on the cgroupB oom
> (If the task is not in the cgroupB's hierarchy subtree, output: invalid parameter)

This is a terrible interface. First of all it assumes a state for the
file without any way to guarantee atomicity. How do you deal with two
different callers accessing the file?

--
Michal Hocko
SUSE Labs

2022-12-01 09:07:17

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

On Thu 01-12-22 07:49:04, 程垲涛 Chengkaitao Cheng wrote:
> At 2022-12-01 07:29:11, "Roman Gushchin" <[email protected]> wrote:
[...]
> >The problem is that the decision which process(es) to kill or preserve
> >is individual to a specific workload (and can be even time-dependent
> >for a given workload).
>
> It is correct to kill a process with high workload, but it may not be the
> most appropriate. I think the specific process to kill needs to be decided
> by the user. I think it is the original intention of score_adj design.

I guess what Roman tries to say here is that there is no obviously _correct_
oom victim candidate. Well, except for a very narrow situation when
there is a memory leak that consumes most of the memory over time. But
that is really hard to identify by the oom selection algorithm in
general.

> >So it's really hard to come up with an in-kernel
> >mechanism which is at the same time flexible enough to work for the majority
> >of users and reliable enough to serve as the last oom resort measure (which
> >is the basic goal of the kernel oom killer).
> >
> Our goal is to find a method that is less intrusive to the existing
> mechanisms of the kernel, and find a more reasonable supplement
> or alternative to the limitations of score_adj.
>
> >Previously the consensus was to keep the in-kernel oom killer dumb and reliable
> >and implement complex policies in userspace (e.g. systemd-oomd etc).
> >
> >Is there a reason why such approach can't work in your case?
>
> I think that as kernel developers, we should try our best to provide
> users with simpler and more powerful interfaces. It is clear that the
> current oom score mechanism has many limitations. Users need to
> do a lot of timed loop detection in order to complete work similar
> to the oom score mechanism, or develop a new mechanism just to
> skip the imperfect oom score mechanism. This is an inefficient and
> forced behavior

You are right that it makes sense to address typical usecases in the
kernel if that is possible. But oom victim selection is really hard
without a deeper understanding of the actual workload. The more clever
we try to be the more corner cases we can produce. Please note that this
has proven to be the case in the long oom development history. We used
to sacrifice child processes over a large process to preserve work or
prefer younger processes. Both those strategies led to problems.

Memcg protection based mechanism sounds like an interesting idea because
it mimics a reclaim protection scheme but I am a bit sceptical it will
be practically useful. Most for 2 reasons. a) memory reclaim protection
can be dynamically tuned because on reclaim/refault/psi metrics. oom
events are rare and mostly a failure situation. This limits any feedback
based approach IMHO. b) Hierarchical nature of the protection will make
it quite hard to configure properly with predictable outcome.

--
Michal Hocko
SUSE Labs

Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

At 2022-12-01 16:49:27, "Michal Hocko" <[email protected]> wrote:
>On Thu 01-12-22 04:52:27, 程垲涛 Chengkaitao Cheng wrote:
>> At 2022-12-01 00:27:54, "Michal Hocko" <[email protected]> wrote:
>> >On Wed 30-11-22 15:46:19, 程垲涛 Chengkaitao Cheng wrote:
>> >> On 2022-11-30 21:15:06, "Michal Hocko" <[email protected]> wrote:
>> >> > On Wed 30-11-22 15:01:58, chengkaitao wrote:
>> >> > > From: chengkaitao <[email protected]>
>> >> > >
>> >> > > We created a new interface <memory.oom.protect> for memory, If there is
>> >> > > the OOM killer under parent memory cgroup, and the memory usage of a
>> >> > > child cgroup is within its effective oom.protect boundary, the cgroup's
>> >> > > tasks won't be OOM killed unless there is no unprotected tasks in other
>> >> > > children cgroups. It draws on the logic of <memory.min/low> in the
>> >> > > inheritance relationship.
>> >> >
>> >> > Could you be more specific about usecases?
>> >
>> >This is a very important question to answer.
>>
>> usecases 1: users say that they want to protect an important process
>> with high memory consumption from being killed by the oom in case
>> of docker container failure, so as to retain more critical on-site
>> information or a self recovery mechanism. At this time, they suggest
>> setting the score_adj of this process to -1000, but I don't agree with
>> it, because the docker container is not important to other docker
>> containers of the same physical machine. If score_adj of the process
>> is set to -1000, the probability of oom in other container processes will
>> increase.
>>
>> usecases 2: There are many business processes and agent processes
>> mixed together on a physical machine, and they need to be classified
>> and protected. However, some agents are the parents of business
>> processes, and some business processes are the parents of agent
>> processes, It will be troublesome to set different score_adj for them.
>> Business processes and agents cannot determine which level their
>> score_adj should be at, If we create another agent to set all processes's
>> score_adj, we have to cycle through all the processes on the physical
>> machine regularly, which looks stupid.
>
>I do agree that oom_score_adj is far from ideal tool for these usecases.
>But I also agree with Roman that these could be addressed by an oom
>killer implementation in the userspace which can have much better
>tailored policies. OOM protection limits would require tuning and also
>regular revisions (e.g. memory consumption by any workload might change
>with different kernel versions) to provide what you are looking for.

There is a misunderstanding, oom.protect does not replace the user's
tailed policies, Its purpose is to make it easier and more efficient for
users to customize policies, or try to avoid users completely abandoning
the oom score to formulate new policies.

>> >> > How do you tune oom.protect
>> >> > wrt to other tunables? How does this interact with the oom_score_adj
>> >> > tunining (e.g. a first hand oom victim with the score_adj 1000 sitting
>> >> > in a oom protected memcg)?
>> >>
>> >> We prefer users to use score_adj and oom.protect independently. Score_adj is
>> >> a parameter applicable to host, and oom.protect is a parameter applicable to cgroup.
>> >> When the physical machine's memory size is particularly large, the score_adj
>> >> granularity is also very large. However, oom.protect can achieve more fine-grained
>> >> adjustment.
>> >
>> >Let me clarify a bit. I am not trying to defend oom_score_adj. It has
>> >it's well known limitations and it is is essentially unusable for many
>> >situations other than - hide or auto-select potential oom victim.
>> >
>> >> When the score_adj of the processes are the same, I list the following cases
>> >> for explanation,
>> >>
>> >> root
>> >> |
>> >> cgroup A
>> >> / \
>> >> cgroup B cgroup C
>> >> (task m,n) (task x,y)
>> >>
>> >> score_adj(all task) = 0;
>> >> oom.protect(cgroup A) = 0;
>> >> oom.protect(cgroup B) = 0;
>> >> oom.protect(cgroup C) = 3G;
>> >
>> >How can you enforce protection at C level without any protection at A
>> >level?
>>
>> The basic idea of this scheme is that all processes in the same cgroup are
>> equally important. If some processes need extra protection, a new cgroup
>> needs to be created for unified settings. I don't think it is necessary to
>> implement protection in cgroup C, because task x and task y are equally
>> important. Only the four processes (task m, n, x and y) in cgroup A, have
>> important and secondary differences.
>>
>> > This would easily allow arbitrary cgroup to hide from the oom
>> > killer and spill over to other cgroups.
>>
>> I don't think this will happen, because eoom.protect only works on parent
>> cgroup. If "oom.protect(parent cgroup) = 0", from perspective of
>> grandpa cgroup, task x and y will not be specially protected.
>
>Just to confirm I am on the same page. This means that there won't be
>any protection in case of the global oom in the above example. So
>effectively the same semantic as the low/min protection.
>
>> >> usage(task m) = 1G
>> >> usage(task n) = 2G
>> >> usage(task x) = 1G
>> >> usage(task y) = 2G
>> >>
>> >> oom killer order of cgroup A: n > m > y > x
>> >> oom killer order of host: y = n > x = m
>> >>
>> >> If cgroup A is a directory maintained by users, users can use oom.protect
>> >> to protect relatively important tasks x and y.
>> >>
>> >> However, when score_adj and oom.protect are used at the same time, we
>> >> will also consider the impact of both, as expressed in the following formula.
>> >> but I have to admit that it is an unstable result.
>> >> score = task_usage + score_adj * totalpage - eoom.protect * task_usage / local_memcg_usage
>> >
>> >I hope I am not misreading but this has some rather unexpected
>> >properties. First off, bigger memory consumers in a protected memcg are
>> >protected more.
>>
>> Since cgroup needs to reasonably distribute the protection quota to all
>> processes in the cgroup, I think that processes consuming more memory
>> should get more quota. It is fair to processes consuming less memory
>> too, even if processes consuming more memory get more quota, its
>> oom_score is still higher than the processes consuming less memory.
>> When the oom killer appears in local cgroup, the order of oom killer
>> remains unchanged
>
>Why cannot you simply discount the protection from all processes
>equally? I do not follow why the task_usage has to play any role in
>that.

If all processes are protected equally, the oom protection of cgroup is
meaningless. For example, if there are more processes in the cgroup,
the cgroup can protect more mems, it is unfair to cgroups with fewer
processes. So we need to keep the total amount of memory that all
processes in the cgroup need to protect consistent with the value of
eoom.protect.
>>
>> >Also I would expect the protection discount would
>> >be capped by the actual usage otherwise excessive protection
>> >configuration could skew the results considerably.
>>
>> In the calculation, we will select the minimum value of memcg_usage and
>> oom.protect
>>
>> >> > I haven't really read through the whole patch but this struck me odd.
>> >>
>> >> > > @@ -552,8 +552,19 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
>> >> > > unsigned long totalpages = totalram_pages() + total_swap_pages;
>> >> > > unsigned long points = 0;
>> >> > > long badness;
>> >> > > +#ifdef CONFIG_MEMCG
>> >> > > + struct mem_cgroup *memcg;
>> >> > >
>> >> > > - badness = oom_badness(task, totalpages);
>> >> > > + rcu_read_lock();
>> >> > > + memcg = mem_cgroup_from_task(task);
>> >> > > + if (memcg && !css_tryget(&memcg->css))
>> >> > > + memcg = NULL;
>> >> > > + rcu_read_unlock();
>> >> > > +
>> >> > > + update_parent_oom_protection(root_mem_cgroup, memcg);
>> >> > > + css_put(&memcg->css);
>> >> > > +#endif
>> >> > > + badness = oom_badness(task, totalpages, MEMCG_OOM_PROTECT);
>> >> >
>> >> > the badness means different thing depending on which memcg hierarchy
>> >> > subtree you look at. Scaling based on the global oom could get really
>> >> > misleading.
>> >>
>> >> I also took it into consideration. I planned to change "/proc/pid/oom_score"
>> >> to a writable node. When writing to different cgroup paths, different values
>> >> will be output. The default output is root cgroup. Do you think this idea is
>> >> feasible?
>> >
>> >I do not follow. Care to elaborate?
>>
>> Take two example,
>> cmd: cat /proc/pid/oom_score
>> output: Scaling based on the global oom
>>
>> cmd: echo "/cgroupA/cgroupB" > /proc/pid/oom_score
>> output: Scaling based on the cgroupB oom
>> (If the task is not in the cgroupB's hierarchy subtree, output: invalid parameter)
>
>This is a terrible interface. First of all it assumes a state for the
>file without any way to guarantee atomicity. How do you deal with two
>different callers accessing the file?

When the echo command is executed, the kernel will directly return the
calculated oom_score. We do not need to perform additional cat command,
and all temporary data will be discarded immediately after the echo operation.
When the cat command is executed, the kernel treats the default value as root
cgroup, so these two operations are atomic.
>
Thanks for your comment!
chengkaitao

2022-12-01 13:39:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

On Thu 01-12-22 10:52:35, 程垲涛 Chengkaitao Cheng wrote:
> At 2022-12-01 16:49:27, "Michal Hocko" <[email protected]> wrote:
> >On Thu 01-12-22 04:52:27, 程垲涛 Chengkaitao Cheng wrote:
> >> At 2022-12-01 00:27:54, "Michal Hocko" <[email protected]> wrote:
> >> >On Wed 30-11-22 15:46:19, 程垲涛 Chengkaitao Cheng wrote:
> >> >> On 2022-11-30 21:15:06, "Michal Hocko" <[email protected]> wrote:
> >> >> > On Wed 30-11-22 15:01:58, chengkaitao wrote:
> >> >> > > From: chengkaitao <[email protected]>
> >> >> > >
> >> >> > > We created a new interface <memory.oom.protect> for memory, If there is
> >> >> > > the OOM killer under parent memory cgroup, and the memory usage of a
> >> >> > > child cgroup is within its effective oom.protect boundary, the cgroup's
> >> >> > > tasks won't be OOM killed unless there is no unprotected tasks in other
> >> >> > > children cgroups. It draws on the logic of <memory.min/low> in the
> >> >> > > inheritance relationship.
> >> >> >
> >> >> > Could you be more specific about usecases?
> >> >
> >> >This is a very important question to answer.
> >>
> >> usecases 1: users say that they want to protect an important process
> >> with high memory consumption from being killed by the oom in case
> >> of docker container failure, so as to retain more critical on-site
> >> information or a self recovery mechanism. At this time, they suggest
> >> setting the score_adj of this process to -1000, but I don't agree with
> >> it, because the docker container is not important to other docker
> >> containers of the same physical machine. If score_adj of the process
> >> is set to -1000, the probability of oom in other container processes will
> >> increase.
> >>
> >> usecases 2: There are many business processes and agent processes
> >> mixed together on a physical machine, and they need to be classified
> >> and protected. However, some agents are the parents of business
> >> processes, and some business processes are the parents of agent
> >> processes, It will be troublesome to set different score_adj for them.
> >> Business processes and agents cannot determine which level their
> >> score_adj should be at, If we create another agent to set all processes's
> >> score_adj, we have to cycle through all the processes on the physical
> >> machine regularly, which looks stupid.
> >
> >I do agree that oom_score_adj is far from ideal tool for these usecases.
> >But I also agree with Roman that these could be addressed by an oom
> >killer implementation in the userspace which can have much better
> >tailored policies. OOM protection limits would require tuning and also
> >regular revisions (e.g. memory consumption by any workload might change
> >with different kernel versions) to provide what you are looking for.
>
> There is a misunderstanding, oom.protect does not replace the user's
> tailed policies, Its purpose is to make it easier and more efficient for
> users to customize policies, or try to avoid users completely abandoning
> the oom score to formulate new policies.

Then you should focus on explaining on how this makes those policies and
easier and moe efficient. I do not see it.

[...]

> >Why cannot you simply discount the protection from all processes
> >equally? I do not follow why the task_usage has to play any role in
> >that.
>
> If all processes are protected equally, the oom protection of cgroup is
> meaningless. For example, if there are more processes in the cgroup,
> the cgroup can protect more mems, it is unfair to cgroups with fewer
> processes. So we need to keep the total amount of memory that all
> processes in the cgroup need to protect consistent with the value of
> eoom.protect.

You are mixing two different concepts together I am afraid. The per
memcg protection should protect the cgroup (i.e. all processes in that
cgroup) while you want it to be also process aware. This results in a
very unclear runtime behavior when a process from a more protected memcg
is selected based on its individual memory usage.
--
Michal Hocko
SUSE Labs

2022-12-01 14:05:34

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

On Thu 01-12-22 13:44:58, Michal Hocko wrote:
> On Thu 01-12-22 10:52:35, 程垲涛 Chengkaitao Cheng wrote:
> > At 2022-12-01 16:49:27, "Michal Hocko" <[email protected]> wrote:
[...]
> > >Why cannot you simply discount the protection from all processes
> > >equally? I do not follow why the task_usage has to play any role in
> > >that.
> >
> > If all processes are protected equally, the oom protection of cgroup is
> > meaningless. For example, if there are more processes in the cgroup,
> > the cgroup can protect more mems, it is unfair to cgroups with fewer
> > processes. So we need to keep the total amount of memory that all
> > processes in the cgroup need to protect consistent with the value of
> > eoom.protect.
>
> You are mixing two different concepts together I am afraid. The per
> memcg protection should protect the cgroup (i.e. all processes in that
> cgroup) while you want it to be also process aware. This results in a
> very unclear runtime behavior when a process from a more protected memcg
> is selected based on its individual memory usage.

Let me be more specific here. Although it is primarily processes which
are the primary source of memcg charges the memory accounted for the oom
badness purposes is not really comparable to the overal memcg charged
memory. Kernel memory, non-mapped memory all that can generate rather
interesting cornercases.
--
Michal Hocko
SUSE Labs

Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

At 2022-12-01 17:02:05, "Michal Hocko" <[email protected]> wrote:
>On Thu 01-12-22 07:49:04, 程垲涛 Chengkaitao Cheng wrote:
>> At 2022-12-01 07:29:11, "Roman Gushchin" <[email protected]> wrote:
>[...]
>> >The problem is that the decision which process(es) to kill or preserve
>> >is individual to a specific workload (and can be even time-dependent
>> >for a given workload).
>>
>> It is correct to kill a process with high workload, but it may not be the
>> most appropriate. I think the specific process to kill needs to be decided
>> by the user. I think it is the original intention of score_adj design.
>
>I guess what Roman tries to say here is that there is no obviously _correct_
>oom victim candidate. Well, except for a very narrow situation when
>there is a memory leak that consumes most of the memory over time. But
>that is really hard to identify by the oom selection algorithm in
>general.
>
>> >So it's really hard to come up with an in-kernel
>> >mechanism which is at the same time flexible enough to work for the majority
>> >of users and reliable enough to serve as the last oom resort measure (which
>> >is the basic goal of the kernel oom killer).
>> >
>> Our goal is to find a method that is less intrusive to the existing
>> mechanisms of the kernel, and find a more reasonable supplement
>> or alternative to the limitations of score_adj.
>>
>> >Previously the consensus was to keep the in-kernel oom killer dumb and reliable
>> >and implement complex policies in userspace (e.g. systemd-oomd etc).
>> >
>> >Is there a reason why such approach can't work in your case?
>>
>> I think that as kernel developers, we should try our best to provide
>> users with simpler and more powerful interfaces. It is clear that the
>> current oom score mechanism has many limitations. Users need to
>> do a lot of timed loop detection in order to complete work similar
>> to the oom score mechanism, or develop a new mechanism just to
>> skip the imperfect oom score mechanism. This is an inefficient and
>> forced behavior
>
>You are right that it makes sense to address typical usecases in the
>kernel if that is possible. But oom victim selection is really hard
>without a deeper understanding of the actual workload. The more clever
>we try to be the more corner cases we can produce. Please note that this
>has proven to be the case in the long oom development history. We used
>to sacrifice child processes over a large process to preserve work or
>prefer younger processes. Both those strategies led to problems.
>
>Memcg protection based mechanism sounds like an interesting idea because
>it mimics a reclaim protection scheme but I am a bit sceptical it will
>be practically useful. Most for 2 reasons. a) memory reclaim protection
>can be dynamically tuned because on reclaim/refault/psi metrics. oom
>events are rare and mostly a failure situation. This limits any feedback
>based approach IMHO. b) Hierarchical nature of the protection will make
>it quite hard to configure properly with predictable outcome.
>
More and more users want to save costs as much as possible by setting the
mem.max to a very small value, resulting in a small number of oom events,
but users can tolerate them, and users want to minimize the impact of oom
events at this time. In similar scenarios, oom events are no longer abnormal
and unpredictable. We need to provide convenient oom policies for users to
choose.

Users have a greater say in oom victim selection, but they cannot perceive
other users, so they cannot accurately formulate their own oom policies.
This is a very contradictory thing. Therefore, we hope that each user's
customized policies can be independent of each other and not interfere with
each other.

>--
>Michal Hocko
>SUSE Labs

Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

At 2022-12-01 21:08:26, "Michal Hocko" <[email protected]> wrote:
>On Thu 01-12-22 13:44:58, Michal Hocko wrote:
>> On Thu 01-12-22 10:52:35, 程垲涛 Chengkaitao Cheng wrote:
>> > At 2022-12-01 16:49:27, "Michal Hocko" <[email protected]> wrote:
>[...]
>> There is a misunderstanding, oom.protect does not replace the user's
>> tailed policies, Its purpose is to make it easier and more efficient for
>> users to customize policies, or try to avoid users completely abandoning
>> the oom score to formulate new policies.
>
> Then you should focus on explaining on how this makes those policies and
> easier and moe efficient. I do not see it.

In fact, there are some relevant contents in the previous chat records.
If oom.protect is applied, it will have the following benefits
1. Users only need to focus on the management of the local cgroup, not the
impact on other users' cgroups.
2. Users and system do not need to spend extra time on complicated and
repeated scanning and configuration. They just need to configure the
oom.protect of specific cgroups, which is a one-time task

>> > >Why cannot you simply discount the protection from all processes
>> > >equally? I do not follow why the task_usage has to play any role in
>> > >that.
>> >
>> > If all processes are protected equally, the oom protection of cgroup is
>> > meaningless. For example, if there are more processes in the cgroup,
>> > the cgroup can protect more mems, it is unfair to cgroups with fewer
>> > processes. So we need to keep the total amount of memory that all
>> > processes in the cgroup need to protect consistent with the value of
>> > eoom.protect.
>>
>> You are mixing two different concepts together I am afraid. The per
>> memcg protection should protect the cgroup (i.e. all processes in that
>> cgroup) while you want it to be also process aware. This results in a
>> very unclear runtime behavior when a process from a more protected memcg
>> is selected based on its individual memory usage.
>
The correct statement here should be that each memcg protection should
protect the number of mems specified by the oom.protect. For example,
a cgroup's usage is 6G, and it's oom.protect is 2G, when an oom killer occurs,
In the worst case, we will only reduce the memory used by this cgroup to 2G
through the om killer.

>Let me be more specific here. Although it is primarily processes which
>are the primary source of memcg charges the memory accounted for the oom
>badness purposes is not really comparable to the overal memcg charged
>memory. Kernel memory, non-mapped memory all that can generate rather
>interesting cornercases.

Sorry, I'm thoughtless enough about some special memory statistics. I will fix
it in the next version

Thanks for your comment!
chengkaitao

2022-12-01 16:43:49

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

On Thu 01-12-22 14:30:11, 程垲涛 Chengkaitao Cheng wrote:
> At 2022-12-01 21:08:26, "Michal Hocko" <[email protected]> wrote:
> >On Thu 01-12-22 13:44:58, Michal Hocko wrote:
> >> On Thu 01-12-22 10:52:35, 程垲涛 Chengkaitao Cheng wrote:
> >> > At 2022-12-01 16:49:27, "Michal Hocko" <[email protected]> wrote:
> >[...]
> >> There is a misunderstanding, oom.protect does not replace the user's
> >> tailed policies, Its purpose is to make it easier and more efficient for
> >> users to customize policies, or try to avoid users completely abandoning
> >> the oom score to formulate new policies.
> >
> > Then you should focus on explaining on how this makes those policies and
> > easier and moe efficient. I do not see it.
>
> In fact, there are some relevant contents in the previous chat records.
> If oom.protect is applied, it will have the following benefits
> 1. Users only need to focus on the management of the local cgroup, not the
> impact on other users' cgroups.

Protection based balancing cannot really work in an isolation.

> 2. Users and system do not need to spend extra time on complicated and
> repeated scanning and configuration. They just need to configure the
> oom.protect of specific cgroups, which is a one-time task

This will not work same way as the memory reclaim protection cannot work
in an isolation on the memcg level.

> >> > >Why cannot you simply discount the protection from all processes
> >> > >equally? I do not follow why the task_usage has to play any role in
> >> > >that.
> >> >
> >> > If all processes are protected equally, the oom protection of cgroup is
> >> > meaningless. For example, if there are more processes in the cgroup,
> >> > the cgroup can protect more mems, it is unfair to cgroups with fewer
> >> > processes. So we need to keep the total amount of memory that all
> >> > processes in the cgroup need to protect consistent with the value of
> >> > eoom.protect.
> >>
> >> You are mixing two different concepts together I am afraid. The per
> >> memcg protection should protect the cgroup (i.e. all processes in that
> >> cgroup) while you want it to be also process aware. This results in a
> >> very unclear runtime behavior when a process from a more protected memcg
> >> is selected based on its individual memory usage.
> >
> The correct statement here should be that each memcg protection should
> protect the number of mems specified by the oom.protect. For example,
> a cgroup's usage is 6G, and it's oom.protect is 2G, when an oom killer occurs,
> In the worst case, we will only reduce the memory used by this cgroup to 2G
> through the om killer.

I do not see how that could be guaranteed. Please keep in mind that a
non-trivial amount of memory resources could be completely independent
on any process life time (just consider tmpfs as a trivial example).

> >Let me be more specific here. Although it is primarily processes which
> >are the primary source of memcg charges the memory accounted for the oom
> >badness purposes is not really comparable to the overal memcg charged
> >memory. Kernel memory, non-mapped memory all that can generate rather
> >interesting cornercases.
>
> Sorry, I'm thoughtless enough about some special memory statistics. I will fix
> it in the next version

Let me just emphasise that we are talking about fundamental disconnect.
Rss based accounting has been used for the OOM killer selection because
the memory gets unmapped and _potentially_ freed when the process goes
away. Memcg changes are bound to the object life time and as said in
many cases there is no direct relation with any process life time.

Hope that clarifies.
--
Michal Hocko
SUSE Labs

2022-12-01 20:41:20

by Mina Almasry

[permalink] [raw]
Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

On Wed, Nov 30, 2022 at 3:29 PM Roman Gushchin <[email protected]> wrote:
>
> On Wed, Nov 30, 2022 at 03:01:58PM +0800, chengkaitao wrote:
> > From: chengkaitao <[email protected]>
> >
> > We created a new interface <memory.oom.protect> for memory, If there is
> > the OOM killer under parent memory cgroup, and the memory usage of a
> > child cgroup is within its effective oom.protect boundary, the cgroup's
> > tasks won't be OOM killed unless there is no unprotected tasks in other
> > children cgroups. It draws on the logic of <memory.min/low> in the
> > inheritance relationship.
> >
> > It has the following advantages,
> > 1. We have the ability to protect more important processes, when there
> > is a memcg's OOM killer. The oom.protect only takes effect local memcg,
> > and does not affect the OOM killer of the host.
> > 2. Historically, we can often use oom_score_adj to control a group of
> > processes, It requires that all processes in the cgroup must have a
> > common parent processes, we have to set the common parent process's
> > oom_score_adj, before it forks all children processes. So that it is
> > very difficult to apply it in other situations. Now oom.protect has no
> > such restrictions, we can protect a cgroup of processes more easily. The
> > cgroup can keep some memory, even if the OOM killer has to be called.
>
> It reminds me our attempts to provide a more sophisticated cgroup-aware oom
> killer. The problem is that the decision which process(es) to kill or preserve
> is individual to a specific workload (and can be even time-dependent
> for a given workload). So it's really hard to come up with an in-kernel
> mechanism which is at the same time flexible enough to work for the majority
> of users and reliable enough to serve as the last oom resort measure (which
> is the basic goal of the kernel oom killer).
>
> Previously the consensus was to keep the in-kernel oom killer dumb and reliable
> and implement complex policies in userspace (e.g. systemd-oomd etc).
>
> Is there a reason why such approach can't work in your case?
>

FWIW we run into similar issues and the systemd-oomd approach doesn't
work reliably enough for us to disable the kernel oom-killer. The
issue as I understand is when the machine is under heavy memory
pressure our userspace oom-killer fails to run quickly enough to save
the machine from getting completely stuck. Why our oom-killer fails to
run is more nuanced. There are cases where it seems stuck to itself to
acquire memory to do the oom-killing or stuck on some lock that needs
to be released by a process that itself is stuck trying to acquire
memory to release the lock, etc.

When the kernel oom-killer does run we would like to shield the
important jobs from it and kill the batch jobs or restartable
processes instead. So we have a similar feature to what is proposed
here internally. Our design is a bit different. For us we enable the
userspace to completely override the oom_badness score pretty much:

1. Every process has /proc/pid/oom_score_badness which overrides the
kernel's calculation if set.
2. Every memcg has a memory.oom_score_badness which indicates this
memcg's oom importance.

On global oom the kernel pretty much kills the baddest process in the
badesset memcg, so we can 'protect' the important jobs from
oom-killing that way.

I haven't tried upstreaming this because I assume there would be
little appetite for it in a general use case, but if the general use
case is interesting for some it would be good to collaborate on some
way for folks that enable the kernel oom-killer to shield certain jobs
that are important.

> Thanks!
>

Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

At 2022-12-01 23:17:49, "Michal Hocko" <[email protected]> wrote:
>On Thu 01-12-22 14:30:11, 程垲涛 Chengkaitao Cheng wrote:
>> At 2022-12-01 21:08:26, "Michal Hocko" <[email protected]> wrote:
>> >On Thu 01-12-22 13:44:58, Michal Hocko wrote:
>> >> On Thu 01-12-22 10:52:35, 程垲涛 Chengkaitao Cheng wrote:
>> >> > At 2022-12-01 16:49:27, "Michal Hocko" <[email protected]> wrote:
>> >[...]
>> >> There is a misunderstanding, oom.protect does not replace the user's
>> >> tailed policies, Its purpose is to make it easier and more efficient for
>> >> users to customize policies, or try to avoid users completely abandoning
>> >> the oom score to formulate new policies.
>> >
>> > Then you should focus on explaining on how this makes those policies and
>> > easier and moe efficient. I do not see it.
>>
>> In fact, there are some relevant contents in the previous chat records.
>> If oom.protect is applied, it will have the following benefits
>> 1. Users only need to focus on the management of the local cgroup, not the
>> impact on other users' cgroups.
>
>Protection based balancing cannot really work in an isolation.

I think that a cgroup only needs to concern the protection value of the child
cgroup, which is independent in a certain sense.

>> 2. Users and system do not need to spend extra time on complicated and
>> repeated scanning and configuration. They just need to configure the
>> oom.protect of specific cgroups, which is a one-time task
>
>This will not work same way as the memory reclaim protection cannot work
>in an isolation on the memcg level.

The parent cgroup's oom.protect can change the actual protected memory size
of the child cgroup, which is exactly what we need. Because of it, the child cgroup
can set its own oom.protect at will.

>> >> > >Why cannot you simply discount the protection from all processes
>> >> > >equally? I do not follow why the task_usage has to play any role in
>> >> > >that.
>> >> >
>> >> > If all processes are protected equally, the oom protection of cgroup is
>> >> > meaningless. For example, if there are more processes in the cgroup,
>> >> > the cgroup can protect more mems, it is unfair to cgroups with fewer
>> >> > processes. So we need to keep the total amount of memory that all
>> >> > processes in the cgroup need to protect consistent with the value of
>> >> > eoom.protect.
>> >>
>> >> You are mixing two different concepts together I am afraid. The per
>> >> memcg protection should protect the cgroup (i.e. all processes in that
>> >> cgroup) while you want it to be also process aware. This results in a
>> >> very unclear runtime behavior when a process from a more protected memcg
>> >> is selected based on its individual memory usage.
>> >
>> The correct statement here should be that each memcg protection should
>> protect the number of mems specified by the oom.protect. For example,
>> a cgroup's usage is 6G, and it's oom.protect is 2G, when an oom killer occurs,
>> In the worst case, we will only reduce the memory used by this cgroup to 2G
>> through the om killer.
>
>I do not see how that could be guaranteed. Please keep in mind that a
>non-trivial amount of memory resources could be completely independent
>on any process life time (just consider tmpfs as a trivial example).
>
>> >Let me be more specific here. Although it is primarily processes which
>> >are the primary source of memcg charges the memory accounted for the oom
>> >badness purposes is not really comparable to the overal memcg charged
>> >memory. Kernel memory, non-mapped memory all that can generate rather
>> >interesting cornercases.
>>
>> Sorry, I'm thoughtless enough about some special memory statistics. I will fix
>> it in the next version
>
>Let me just emphasise that we are talking about fundamental disconnect.
>Rss based accounting has been used for the OOM killer selection because
>the memory gets unmapped and _potentially_ freed when the process goes
>away. Memcg changes are bound to the object life time and as said in
>many cases there is no direct relation with any process life time.

Based on your question, I want to revise the formula as follows,
score = task_usage + score_adj * totalpage - eoom.protect * (task_usage - task_rssshare) /
(local_memcg_usage + local_memcg_swapcache)

After the process is killed, the unmapped cache and sharemem will not be
released immediately, so they should not apply to cgroup for protection quota.
In extreme environments, the memory that cannot be released by the oom killer
(i.e. some mems that have not been charged to the process) may occupy a large
share of protection quota, but it is expected. Of course, the idea may have some
problems that I haven't considered.

>
>Hope that clarifies.

Thanks for your comment!
chengkaitao