This patchset adds a new interface and use it to select victim when OOM
is invoked. The mainly motivation is the need to customizable OOM victim
selection functionality.
The new interface is a bpf hook plugged in oom_evaluate_task. It takes oc
and current task as parameters and return a result indicating which one is
selected by the attached bpf program.
There are several conserns when designing this interface suggested by
Michal:
1. Hooking into oom_evaluate_task can keep the consistency of global and
memcg OOM interface. Besides, it seems the least disruptive to the existing
oom killer implementation.
2. Userspace can handle a lot on its own and provide the input to the BPF
program to make a decision. Since the oom scope iteration will be
implemented already in the kernel so all the BPF program has to do is to
rank processes or memcgs.
Previous discussion link:
[1]https://lore.kernel.org/lkml/[email protected]/
Chuyi Zhou (2):
mm, oom: Introduce bpf_select_task
bpf: Add OOM policy test
mm/oom_kill.c | 57 ++++++-
.../bpf/prog_tests/test_oom_policy.c | 140 ++++++++++++++++++
.../testing/selftests/bpf/progs/oom_policy.c | 77 ++++++++++
3 files changed, 267 insertions(+), 7 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
create mode 100644 tools/testing/selftests/bpf/progs/oom_policy.c
--
2.20.1
This patch adds a new hook bpf_select_task in oom_evaluate_task. It
takes oc and current iterating task as parameters and returns a result
indicating which one is selected by bpf program.
Although bpf_select_task is used to bypass the default method, there are
some existing rules should be obeyed. Specifically, we skip these
"unkillable" tasks(e.g., kthread, MMF_OOM_SKIP, in_vfork()).So we do not
consider tasks with lowest score returned by oom_badness except it was
caused by OOM_SCORE_ADJ_MIN.
If we attach a prog to the hook, the interface is enabled only when we have
successfully chosen at least one valid candidate in previous iteraion. This
is to avoid that we find nothing if bpf program rejects all tasks.
Signed-off-by: Chuyi Zhou <[email protected]>
---
mm/oom_kill.c | 57 ++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 50 insertions(+), 7 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 612b5597d3af..aec4c55ed49a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -18,6 +18,7 @@
* kernel subsystems and hints as to where to find out what things do.
*/
+#include <linux/bpf.h>
#include <linux/oom.h>
#include <linux/mm.h>
#include <linux/err.h>
@@ -210,6 +211,16 @@ long oom_badness(struct task_struct *p, unsigned long totalpages)
if (!p)
return LONG_MIN;
+ /*
+ * If task is allocating a lot of memory and has been marked to be
+ * killed first if it triggers an oom, then set points to LONG_MAX.
+ * It will be selected unless we keep oc->chosen through bpf interface.
+ */
+ if (oom_task_origin(p)) {
+ task_unlock(p);
+ return LONG_MAX;
+ }
+
/*
* Do not even consider tasks which are explicitly marked oom
* unkillable or have been already oom reaped or the are in
@@ -305,8 +316,30 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
return CONSTRAINT_NONE;
}
+enum bpf_select_ret {
+ BPF_SELECT_DISABLE,
+ BPF_SELECT_TASK,
+ BPF_SELECT_CHOSEN,
+};
+
+__weak noinline int bpf_select_task(struct oom_control *oc,
+ struct task_struct *task, long badness_points)
+{
+ return BPF_SELECT_DISABLE;
+}
+
+BTF_SET8_START(oom_bpf_fmodret_ids)
+BTF_ID_FLAGS(func, bpf_select_task)
+BTF_SET8_END(oom_bpf_fmodret_ids)
+
+static const struct btf_kfunc_id_set oom_bpf_fmodret_set = {
+ .owner = THIS_MODULE,
+ .set = &oom_bpf_fmodret_ids,
+};
+
static int oom_evaluate_task(struct task_struct *task, void *arg)
{
+ enum bpf_select_ret bpf_ret = BPF_SELECT_DISABLE;
struct oom_control *oc = arg;
long points;
@@ -329,17 +362,23 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
goto abort;
}
+ points = oom_badness(task, oc->totalpages);
+
/*
- * If task is allocating a lot of memory and has been marked to be
- * killed first if it triggers an oom, then select it.
+ * Do not consider tasks with lowest score value except it was caused
+ * by OOM_SCORE_ADJ_MIN. Give these tasks a chance to be selected by
+ * bpf interface.
*/
- if (oom_task_origin(task)) {
- points = LONG_MAX;
+ if (points == LONG_MIN && task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
+ goto next;
+
+ if (oc->chosen)
+ bpf_ret = bpf_select_task(oc, task, points);
+
+ if (bpf_ret == BPF_SELECT_TASK)
goto select;
- }
- points = oom_badness(task, oc->totalpages);
- if (points == LONG_MIN || points < oc->chosen_points)
+ if (bpf_ret == BPF_SELECT_CHOSEN || points == LONG_MIN || points < oc->chosen_points)
goto next;
select:
@@ -732,10 +771,14 @@ static struct ctl_table vm_oom_kill_table[] = {
static int __init oom_init(void)
{
+ int err;
oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
#ifdef CONFIG_SYSCTL
register_sysctl_init("vm", vm_oom_kill_table);
#endif
+ err = register_btf_fmodret_id_set(&oom_bpf_fmodret_set);
+ if (err)
+ pr_warn("error while registering oom fmodret entrypoints: %d", err);
return 0;
}
subsys_initcall(oom_init)
--
2.20.1
This patch adds a test which implements a priority-based policy through
bpf_select_task.
The BPF program, oom_policy.c, compares the cgroup priority of two tasks
and select the lower one. The userspace program test_oom_policy.c
maintains a priority map by using cgroup id as the keys and priority as
the values. We could protect certain cgroups from oom-killer by setting
higher priority.
Signed-off-by: Chuyi Zhou <[email protected]>
---
.../bpf/prog_tests/test_oom_policy.c | 140 ++++++++++++++++++
.../testing/selftests/bpf/progs/oom_policy.c | 77 ++++++++++
2 files changed, 217 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
create mode 100644 tools/testing/selftests/bpf/progs/oom_policy.c
diff --git a/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c b/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
new file mode 100644
index 000000000000..2400cc48ba83
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
@@ -0,0 +1,140 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <sys/stat.h>
+#include <test_progs.h>
+#include <bpf/btf.h>
+#include <bpf/bpf.h>
+
+#include "cgroup_helpers.h"
+#include "oom_policy.skel.h"
+
+static int map_fd;
+static int cg_nr;
+struct {
+ const char *path;
+ int fd;
+ unsigned long long id;
+} cgs[] = {
+ { "/cg1" },
+ { "/cg2" },
+};
+
+
+static struct oom_policy *open_load_oom_policy_skel(void)
+{
+ struct oom_policy *skel;
+ int err;
+
+ skel = oom_policy__open();
+ if (!ASSERT_OK_PTR(skel, "skel_open"))
+ return NULL;
+
+ err = oom_policy__load(skel);
+ if (!ASSERT_OK(err, "skel_load"))
+ goto cleanup;
+
+ return skel;
+
+cleanup:
+ oom_policy__destroy(skel);
+ return NULL;
+}
+
+static void run_memory_consume(unsigned long long consume_size, int idx)
+{
+ char *buf;
+
+ join_parent_cgroup(cgs[idx].path);
+ buf = malloc(consume_size);
+ memset(buf, 0, consume_size);
+ sleep(2);
+ exit(0);
+}
+
+static int set_cgroup_prio(unsigned long long cg_id, int prio)
+{
+ int err;
+
+ err = bpf_map_update_elem(map_fd, &cg_id, &prio, BPF_ANY);
+ ASSERT_EQ(err, 0, "update_map");
+ return err;
+}
+
+static int prepare_cgroup_environment(void)
+{
+ int err;
+
+ err = setup_cgroup_environment();
+ if (err)
+ goto clean_cg_env;
+ for (int i = 0; i < cg_nr; i++) {
+ err = cgs[i].fd = create_and_get_cgroup(cgs[i].path);
+ if (!ASSERT_GE(cgs[i].fd, 0, "cg_create"))
+ goto clean_cg_env;
+ cgs[i].id = get_cgroup_id(cgs[i].path);
+ }
+ return 0;
+clean_cg_env:
+ cleanup_cgroup_environment();
+ return err;
+}
+
+void test_oom_policy(void)
+{
+ struct oom_policy *skel;
+ struct bpf_link *link;
+ int err;
+ int victim_pid;
+ unsigned long long victim_cg_id;
+
+ link = NULL;
+ cg_nr = ARRAY_SIZE(cgs);
+
+ skel = open_load_oom_policy_skel();
+ err = oom_policy__attach(skel);
+ if (!ASSERT_OK(err, "oom_policy__attach"))
+ goto cleanup;
+
+ map_fd = bpf_object__find_map_fd_by_name(skel->obj, "cg_map");
+ if (!ASSERT_GE(map_fd, 0, "find map"))
+ goto cleanup;
+
+ err = prepare_cgroup_environment();
+ if (!ASSERT_EQ(err, 0, "prepare cgroup env"))
+ goto cleanup;
+
+ write_cgroup_file("/", "memory.max", "10M");
+
+ /*
+ * Set higher priority to cg2 and lower to cg1, so we would select
+ * task under cg1 as victim.(see oom_policy.c)
+ */
+ set_cgroup_prio(cgs[0].id, 10);
+ set_cgroup_prio(cgs[1].id, 50);
+
+ victim_cg_id = cgs[0].id;
+ victim_pid = fork();
+
+ if (victim_pid == 0)
+ run_memory_consume(1024 * 1024 * 4, 0);
+
+ if (fork() == 0)
+ run_memory_consume(1024 * 1024 * 8, 1);
+
+ while (wait(NULL) > 0)
+ ;
+
+ ASSERT_EQ(skel->bss->victim_pid, victim_pid, "victim_pid");
+ ASSERT_EQ(skel->bss->victim_cg_id, victim_cg_id, "victim_cgid");
+
+cleanup:
+ bpf_link__destroy(link);
+ oom_policy__destroy(skel);
+ cleanup_cgroup_environment();
+}
diff --git a/tools/testing/selftests/bpf/progs/oom_policy.c b/tools/testing/selftests/bpf/progs/oom_policy.c
new file mode 100644
index 000000000000..d269ea52bcb2
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/oom_policy.c
@@ -0,0 +1,77 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+struct {
+ __uint(type, BPF_MAP_TYPE_HASH);
+ __type(key, int);
+ __type(value, int);
+ __uint(max_entries, 24);
+} cg_map SEC(".maps");
+
+unsigned int victim_pid;
+u64 victim_cg_id;
+
+enum bpf_select_ret {
+ BPF_SELECT_DISABLE,
+ BPF_SELECT_TASK,
+ BPF_SELECT_CHOSEN,
+};
+
+static __always_inline u64 task_cgroup_id(struct task_struct *task)
+{
+ struct kernfs_node *node;
+ struct task_group *tg;
+
+ if (!task)
+ return 0;
+
+ tg = task->sched_task_group;
+ node = tg->css.cgroup->kn;
+
+ return node->id;
+}
+
+SEC("fentry/oom_kill_process")
+int BPF_PROG(oom_kill_process_k, struct oom_control *oc, const char *message)
+{
+ struct task_struct *victim = oc->chosen;
+
+ if (!victim)
+ return 0;
+
+ victim_pid = victim->pid;
+ victim_cg_id = task_cgroup_id(victim);
+ return 0;
+}
+
+SEC("fmod_ret/bpf_select_task")
+int BPF_PROG(select_task_test, struct oom_control *oc, struct task_struct *task, long points)
+{
+ u64 chosen_cg_id, task_cg_id;
+ int chosen_cg_prio, task_cg_prio;
+ struct task_struct *chosen;
+ int *val;
+
+ chosen = oc->chosen;
+ chosen_cg_id = task_cgroup_id(chosen);
+ task_cg_id = task_cgroup_id(task);
+ chosen_cg_prio = task_cg_prio = 0;
+ val = bpf_map_lookup_elem(&cg_map, &chosen_cg_id);
+ if (val)
+ chosen_cg_prio = *val;
+ val = bpf_map_lookup_elem(&cg_map, &task_cg_id);
+ if (val)
+ task_cg_prio = *val;
+
+ if (chosen_cg_prio > task_cg_prio)
+ return BPF_SELECT_TASK;
+ if (chosen_cg_prio < task_cg_prio)
+ return BPF_SELECT_CHOSEN;
+
+ return BPF_SELECT_DISABLE;
+
+}
+
--
2.20.1
On Fri 04-08-23 17:38:03, Chuyi Zhou wrote:
> This patch adds a new hook bpf_select_task in oom_evaluate_task. It
> takes oc and current iterating task as parameters and returns a result
> indicating which one is selected by bpf program.
>
> Although bpf_select_task is used to bypass the default method, there are
> some existing rules should be obeyed. Specifically, we skip these
> "unkillable" tasks(e.g., kthread, MMF_OOM_SKIP, in_vfork()).So we do not
> consider tasks with lowest score returned by oom_badness except it was
> caused by OOM_SCORE_ADJ_MIN.
Is this really necessary? I do get why we need to preserve
OOM_SCORE_ADJ_* semantic for in-kernel oom selection logic but why
should an arbitrary oom policy care. Look at it from an arbitrary user
space based policy. It just picks a task or memcg and kills taks by
sending SIG_KILL (or maybe SIG_TERM first) signal. oom_score constrains
will not prevent anybody from doing that.
tsk_is_oom_victim (and MMF_OOM_SKIP) is a slightly different case but
not too much. The primary motivation is to prevent new oom victims
while there is one already being killed. This is a reasonable heuristic
especially with the async oom reclaim (oom_reaper). It also reduces
amount of oom emergency memory reserves to some degree but since those
are not absolute this is no longer the primary motivation. _But_ I can
imagine that some policies might be much more aggresive and allow to
select new victims if preexisting are not being killed in time.
oom_unkillable_task is a general sanity check so it should remain in
place.
I am not really sure about oom_task_origin. That is just a very weird
case and I guess it wouldn't hurt to keep it in generic path.
All that being said I think we want something like the following (very
pseudo-code). I have no idea what is the proper way how to define BPF
hooks though so a help from BPF maintainers would be more then handy
---
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 00982b133dc1..9f1743ee2b28 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -190,10 +190,6 @@ static inline bool trigger_all_cpu_backtrace(void)
{
return false;
}
-static inline bool trigger_allbutself_cpu_backtrace(void)
-{
- return false;
-}
static inline bool trigger_cpumask_backtrace(struct cpumask *mask)
{
return false;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 612b5597d3af..c9e04be52700 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -317,6 +317,22 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
goto next;
+ /*
+ * If task is allocating a lot of memory and has been marked to be
+ * killed first if it triggers an oom, then select it.
+ */
+ if (oom_task_origin(task)) {
+ points = LONG_MAX;
+ goto select;
+ }
+
+ switch (bpf_oom_evaluate_task(task, oc, &points)) {
+ case -EOPNOTSUPP: break; /* No BPF policy */
+ case -EBUSY: goto abort; /* abort search process */
+ case 0: goto next; /* ignore process */
+ default: goto select; /* note the task */
+ }
+
/*
* This task already has access to memory reserves and is being killed.
* Don't allow any other task to have access to the reserves unless
@@ -329,15 +345,6 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
goto abort;
}
- /*
- * If task is allocating a lot of memory and has been marked to be
- * killed first if it triggers an oom, then select it.
- */
- if (oom_task_origin(task)) {
- points = LONG_MAX;
- goto select;
- }
-
points = oom_badness(task, oc->totalpages);
if (points == LONG_MIN || points < oc->chosen_points)
goto next;
--
Michal Hocko
SUSE Labs
On 04/08/2023 10:38, Chuyi Zhou wrote:
> This patch adds a new hook bpf_select_task in oom_evaluate_task. It
bpf_select_task() feels like too generic a name - bpf_oom_select_task()
might make the context clearer.
I'd also suggest adding a documentation patch for a new
Documentation/bpf/oom.rst or whatever to describe how it is all supposed
to work.
> takes oc and current iterating task as parameters and returns a result
> indicating which one is selected by bpf program.
>
> Although bpf_select_task is used to bypass the default method, there are
> some existing rules should be obeyed. Specifically, we skip these
> "unkillable" tasks(e.g., kthread, MMF_OOM_SKIP, in_vfork()).So we do not
> consider tasks with lowest score returned by oom_badness except it was
> caused by OOM_SCORE_ADJ_MIN.
>
> If we attach a prog to the hook, the interface is enabled only when we have
> successfully chosen at least one valid candidate in previous iteraion. This
> is to avoid that we find nothing if bpf program rejects all tasks.
>
I don't know anything about OOM mechanisms, so maybe it's just me, but I
found this confusing. Relying on the previous iteration to control
current iteration behaviour seems risky - even if BPF found a victim in
iteration N, it's no guarantee it will in iteration N+1.
Naively I would have thought the right answer here would be to honour
the choice OOM would have made (in the absence of BPF execution) for
cases where BPF did not select a victim. Is that sort of scheme
workable? Does that make sense from the mm side, or would we actually
want to fall back to
pr_warn("Out of memory and no killable processes...\n");
...if BPF didn't select a process?
The danger here seems to be that the current non-BPF mechanism seems to
be guaranteed to find a chosen victim, but delegating to BPF is not. So
what is the right behaviour for such cases from the mm perspective?
(One thing that would probably be worth doing from the BPF side would be
to add a tracepoint to mark the scenario where nothing was chosen for
OOM kill via BPF; this would allow BPF programs to catch the fact that
their OOM selection mechanisms didn't work.)
Alan
> Signed-off-by: Chuyi Zhou <[email protected]>
> ---
> mm/oom_kill.c | 57 ++++++++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 50 insertions(+), 7 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 612b5597d3af..aec4c55ed49a 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -18,6 +18,7 @@
> * kernel subsystems and hints as to where to find out what things do.
> */
>
> +#include <linux/bpf.h>
> #include <linux/oom.h>
> #include <linux/mm.h>
> #include <linux/err.h>
> @@ -210,6 +211,16 @@ long oom_badness(struct task_struct *p, unsigned long totalpages)
> if (!p)
> return LONG_MIN;
>
> + /*
> + * If task is allocating a lot of memory and has been marked to be
> + * killed first if it triggers an oom, then set points to LONG_MAX.
> + * It will be selected unless we keep oc->chosen through bpf interface.
> + */
> + if (oom_task_origin(p)) {
> + task_unlock(p);
> + return LONG_MAX;
> + }
> +
> /*
> * Do not even consider tasks which are explicitly marked oom
> * unkillable or have been already oom reaped or the are in
> @@ -305,8 +316,30 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
> return CONSTRAINT_NONE;
> }
>
> +enum bpf_select_ret {
> + BPF_SELECT_DISABLE,
> + BPF_SELECT_TASK,
> + BPF_SELECT_CHOSEN,
> +};
> +
> +__weak noinline int bpf_select_task(struct oom_control *oc,
> + struct task_struct *task, long badness_points)
> +{
> + return BPF_SELECT_DISABLE;
> +}
> +
> +BTF_SET8_START(oom_bpf_fmodret_ids)
> +BTF_ID_FLAGS(func, bpf_select_task)
> +BTF_SET8_END(oom_bpf_fmodret_ids)
> +
> +static const struct btf_kfunc_id_set oom_bpf_fmodret_set = {
> + .owner = THIS_MODULE,
> + .set = &oom_bpf_fmodret_ids,
> +};
> +
> static int oom_evaluate_task(struct task_struct *task, void *arg)
> {
> + enum bpf_select_ret bpf_ret = BPF_SELECT_DISABLE;
> struct oom_control *oc = arg;
> long points;
>
> @@ -329,17 +362,23 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
> goto abort;
> }
>
> + points = oom_badness(task, oc->totalpages);
> +
> /*
> - * If task is allocating a lot of memory and has been marked to be
> - * killed first if it triggers an oom, then select it.
> + * Do not consider tasks with lowest score value except it was caused
> + * by OOM_SCORE_ADJ_MIN. Give these tasks a chance to be selected by
> + * bpf interface.
> */
> - if (oom_task_origin(task)) {
> - points = LONG_MAX;
> + if (points == LONG_MIN && task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> + goto next;
> +
> + if (oc->chosen)
> + bpf_ret = bpf_select_task(oc, task, points);
> +
> + if (bpf_ret == BPF_SELECT_TASK)
> goto select;
> - }
>
> - points = oom_badness(task, oc->totalpages);
> - if (points == LONG_MIN || points < oc->chosen_points)
> + if (bpf_ret == BPF_SELECT_CHOSEN || points == LONG_MIN || points < oc->chosen_points)
> goto next;
>
> select:
> @@ -732,10 +771,14 @@ static struct ctl_table vm_oom_kill_table[] = {
>
> static int __init oom_init(void)
> {
> + int err;
> oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> #ifdef CONFIG_SYSCTL
> register_sysctl_init("vm", vm_oom_kill_table);
> #endif
probably worth having #ifdef CONFIG_BPF or similar here..
> + err = register_btf_fmodret_id_set(&oom_bpf_fmodret_set);
> + if (err)
> + pr_warn("error while registering oom fmodret entrypoints: %d", err);
> return 0;
> }
> subsys_initcall(oom_init)
On Fri 04-08-23 21:15:57, Chuyi Zhou wrote:
[...]
> > + switch (bpf_oom_evaluate_task(task, oc, &points)) {
> > + case -EOPNOTSUPP: break; /* No BPF policy */
> > + case -EBUSY: goto abort; /* abort search process */
> > + case 0: goto next; /* ignore process */
> > + default: goto select; /* note the task */
> > + }
>
> Why we need to change the *points* value if we do not care about oom_badness
> ? Is it used to record some state? If so, we could record it through bpf
> map.
Strictly speaking we do not need to. That would require BPF to keep the
state internally. Many will do I suppose but we have to keep track of
the victim so that the oom killer knows what to kill so I thought that
it doesn't hurt to keep track of an abstract concept of points as well.
If you think this is not needed then oc->points could be always 0 for
bpf selected victims. The value is not used anyway in the proposed
scheme.
Btw. we will need another hook or metadata for the reporting side of
things. Generally dump_header() to know what has been the selection
policy.
--
Michal Hocko
SUSE Labs
Hello,
在 2023/8/4 19:29, Michal Hocko 写道:
> On Fri 04-08-23 17:38:03, Chuyi Zhou wrote:
>> This patch adds a new hook bpf_select_task in oom_evaluate_task. It
>> takes oc and current iterating task as parameters and returns a result
>> indicating which one is selected by bpf program.
>>
>> Although bpf_select_task is used to bypass the default method, there are
>> some existing rules should be obeyed. Specifically, we skip these
>> "unkillable" tasks(e.g., kthread, MMF_OOM_SKIP, in_vfork()).So we do not
>> consider tasks with lowest score returned by oom_badness except it was
>> caused by OOM_SCORE_ADJ_MIN.
>
> Is this really necessary? I do get why we need to preserve
> OOM_SCORE_ADJ_* semantic for in-kernel oom selection logic but why
> should an arbitrary oom policy care. Look at it from an arbitrary user
> space based policy. It just picks a task or memcg and kills taks by
> sending SIG_KILL (or maybe SIG_TERM first) signal. oom_score constrains
> will not prevent anybody from doing that.
Sorry, some of my expressions may have misled you.
I do agree bpf interface should bypass the current OOM_SCORE_ADJ_*
logic. What I meant to say is that bpf can select a task even it was
setted OOM_SCORE_ADJ_MIN.
>
> tsk_is_oom_victim (and MMF_OOM_SKIP) is a slightly different case but
> not too much. The primary motivation is to prevent new oom victims
> while there is one already being killed. This is a reasonable heuristic
> especially with the async oom reclaim (oom_reaper). It also reduces
> amount of oom emergency memory reserves to some degree but since those
> are not absolute this is no longer the primary motivation. _But_ I can
> imagine that some policies might be much more aggresive and allow to
> select new victims if preexisting are not being killed in time.
>
> oom_unkillable_task is a general sanity check so it should remain in
> place.
>
> I am not really sure about oom_task_origin. That is just a very weird
> case and I guess it wouldn't hurt to keep it in generic path.
>
> All that being said I think we want something like the following (very
> pseudo-code). I have no idea what is the proper way how to define BPF
> hooks though so a help from BPF maintainers would be more then handy
> ---
> diff --git a/include/linux/nmi.h b/include/linux/nmi.h
> index 00982b133dc1..9f1743ee2b28 100644
> --- a/include/linux/nmi.h
> +++ b/include/linux/nmi.h
> @@ -190,10 +190,6 @@ static inline bool trigger_all_cpu_backtrace(void)
> {
> return false;
> }
> -static inline bool trigger_allbutself_cpu_backtrace(void)
> -{
> - return false;
> -}
> static inline bool trigger_cpumask_backtrace(struct cpumask *mask)
> {
> return false;
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 612b5597d3af..c9e04be52700 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -317,6 +317,22 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
> if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
> goto next;
>
> + /*
> + * If task is allocating a lot of memory and has been marked to be
> + * killed first if it triggers an oom, then select it.
> + */
> + if (oom_task_origin(task)) {
> + points = LONG_MAX;
> + goto select;
> + }
> +
> + switch (bpf_oom_evaluate_task(task, oc, &points)) {
> + case -EOPNOTSUPP: break; /* No BPF policy */
> + case -EBUSY: goto abort; /* abort search process */
> + case 0: goto next; /* ignore process */
> + default: goto select; /* note the task */
> + }
Why we need to change the *points* value if we do not care about
oom_badness ? Is it used to record some state? If so, we could record it
through bpf map.
> +
> /*
> * This task already has access to memory reserves and is being killed.
> * Don't allow any other task to have access to the reserves unless
> @@ -329,15 +345,6 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
> goto abort;
> }
>
> - /*
> - * If task is allocating a lot of memory and has been marked to be
> - * killed first if it triggers an oom, then select it.
> - */
> - if (oom_task_origin(task)) {
> - points = LONG_MAX;
> - goto select;
> - }
> -
> points = oom_badness(task, oc->totalpages);
> if (points == LONG_MIN || points < oc->chosen_points)
> goto next;
Thanks for your advice, I'm very glad to follow your suggestions for the
next version of development.
Hello,
在 2023/8/4 19:34, Alan Maguire 写道:
> On 04/08/2023 10:38, Chuyi Zhou wrote:
>> This patch adds a new hook bpf_select_task in oom_evaluate_task. It
>
> bpf_select_task() feels like too generic a name - bpf_oom_select_task()
> might make the context clearer.
>
> I'd also suggest adding a documentation patch for a new
> Documentation/bpf/oom.rst or whatever to describe how it is all supposed
> to work.Got it, I would add it in next version.
>
>> takes oc and current iterating task as parameters and returns a result
>> indicating which one is selected by bpf program.
>>
>> Although bpf_select_task is used to bypass the default method, there are
>> some existing rules should be obeyed. Specifically, we skip these
>> "unkillable" tasks(e.g., kthread, MMF_OOM_SKIP, in_vfork()).So we do not
>> consider tasks with lowest score returned by oom_badness except it was
>> caused by OOM_SCORE_ADJ_MIN.
>>
>> If we attach a prog to the hook, the interface is enabled only when we have
>> successfully chosen at least one valid candidate in previous iteraion. This
>> is to avoid that we find nothing if bpf program rejects all tasks.
>>
>
> I don't know anything about OOM mechanisms, so maybe it's just me, but I
> found this confusing. Relying on the previous iteration to control
> current iteration behaviour seems risky - even if BPF found a victim in
> iteration N, it's no guarantee it will in iteration N+1.
>
The current kernel's OOM actually works like this:
1. if we first find a valid candidate victim A in iteration N, we would
record it in oc->chosen.
2. In iteration N + 1, N+2..., we just compare oc->chosen with the
current iterating task. Suppose we think current task B is better than
oc->chosen(A), we would set oc->chosen = B and we would not consider A
anymore.
IIUC, most policy works like this. We just need to find the *most*
suitable victim. Normally, if in current iteration we drop A and select
B, we would not consider A anymore.
> Naively I would have thought the right answer here would be to honour
> the choice OOM would have made (in the absence of BPF execution) for
> cases where BPF did not select a victim. Is that sort of scheme
> workable? Does that make sense from the mm side, or would we actually
> want to fall back to
>
> pr_warn("Out of memory and no killable processes...\n");
>
> ...if BPF didn't select a process?
>
My major concern was wether we should fully trust the correctness of BPF
Progarm from user since some OOM may invoke kernel panic if we find
nothing. Actually, the current non-BPF mechanism also is not guaranteed
to find a chosen victim (If user set inappropriate oom_score_adj, we may
find nothing).
It seems both you and Michal think here we should honour the default
logic of OOM and do not add something additional to prevent BPF find
nothing.
> The danger here seems to be that the current non-BPF mechanism seems to
> be guaranteed to find a chosen victim, but delegating to BPF is not. So
> what is the right behaviour for such cases from the mm perspective?
>
> (One thing that would probably be worth doing from the BPF side would be
> to add a tracepoint to mark the scenario where nothing was chosen for
> OOM kill via BPF; this would allow BPF programs to catch the fact that
> their OOM selection mechanisms didn't work.)
Nice idea, maybe we could add this tracepoint when we finishing victim
selection? In this way, we could easily catch the selection result in
BPF programs even if we successfully find something.
>
> Alan
>
>> Signed-off-by: Chuyi Zhou <[email protected]>
>> ---
>> mm/oom_kill.c | 57 ++++++++++++++++++++++++++++++++++++++++++++-------
>> 1 file changed, 50 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 612b5597d3af..aec4c55ed49a 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -18,6 +18,7 @@
>> * kernel subsystems and hints as to where to find out what things do.
>> */
>>
>> +#include <linux/bpf.h>
>> #include <linux/oom.h>
>> #include <linux/mm.h>
>> #include <linux/err.h>
>> @@ -210,6 +211,16 @@ long oom_badness(struct task_struct *p, unsigned long totalpages)
>> if (!p)
>> return LONG_MIN;
>>
>> + /*
>> + * If task is allocating a lot of memory and has been marked to be
>> + * killed first if it triggers an oom, then set points to LONG_MAX.
>> + * It will be selected unless we keep oc->chosen through bpf interface.
>> + */
>> + if (oom_task_origin(p)) {
>> + task_unlock(p);
>> + return LONG_MAX;
>> + }
>> +
>> /*
>> * Do not even consider tasks which are explicitly marked oom
>> * unkillable or have been already oom reaped or the are in
>> @@ -305,8 +316,30 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
>> return CONSTRAINT_NONE;
>> }
>>
>> +enum bpf_select_ret {
>> + BPF_SELECT_DISABLE,
>> + BPF_SELECT_TASK,
>> + BPF_SELECT_CHOSEN,
>> +};
>> +
>> +__weak noinline int bpf_select_task(struct oom_control *oc,
>> + struct task_struct *task, long badness_points)
>> +{
>> + return BPF_SELECT_DISABLE;
>> +}
>> +
>> +BTF_SET8_START(oom_bpf_fmodret_ids)
>> +BTF_ID_FLAGS(func, bpf_select_task)
>> +BTF_SET8_END(oom_bpf_fmodret_ids)
>> +
>> +static const struct btf_kfunc_id_set oom_bpf_fmodret_set = {
>> + .owner = THIS_MODULE,
>> + .set = &oom_bpf_fmodret_ids,
>> +};
>> +
>> static int oom_evaluate_task(struct task_struct *task, void *arg)
>> {
>> + enum bpf_select_ret bpf_ret = BPF_SELECT_DISABLE;
>> struct oom_control *oc = arg;
>> long points;
>>
>> @@ -329,17 +362,23 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
>> goto abort;
>> }
>>
>> + points = oom_badness(task, oc->totalpages);
>> +
>> /*
>> - * If task is allocating a lot of memory and has been marked to be
>> - * killed first if it triggers an oom, then select it.
>> + * Do not consider tasks with lowest score value except it was caused
>> + * by OOM_SCORE_ADJ_MIN. Give these tasks a chance to be selected by
>> + * bpf interface.
>> */
>> - if (oom_task_origin(task)) {
>> - points = LONG_MAX;
>> + if (points == LONG_MIN && task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
>> + goto next;
>> +
>> + if (oc->chosen)
>> + bpf_ret = bpf_select_task(oc, task, points);
>> +
>> + if (bpf_ret == BPF_SELECT_TASK)
>> goto select;
>> - }
>>
>> - points = oom_badness(task, oc->totalpages);
>> - if (points == LONG_MIN || points < oc->chosen_points)
>> + if (bpf_ret == BPF_SELECT_CHOSEN || points == LONG_MIN || points < oc->chosen_points)
>> goto next;
>>
>> select:
>> @@ -732,10 +771,14 @@ static struct ctl_table vm_oom_kill_table[] = {
>>
>> static int __init oom_init(void)
>> {
>> + int err;
>> oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
>> #ifdef CONFIG_SYSCTL
>> register_sysctl_init("vm", vm_oom_kill_table);
>> #endif
>
> probably worth having #ifdef CONFIG_BPF or similar here..
I see. Thanks for your remind.
register_btf_fmodret_id_set is controled by CONFIG_BPF_SYSCALL. So maybe
we can add #ifdef CONFIG_BPF_SYSCALL here.
Thanks for your advice, I'm very glad to follow your suggestions for the
next version of development.
>
>> + err = register_btf_fmodret_id_set(&oom_bpf_fmodret_set);
>> + if (err)
>> + pr_warn("error while registering oom fmodret entrypoints: %d", err);
>> return 0;
>> }
>> subsys_initcall(oom_init)
在 2023/8/4 21:34, Michal Hocko 写道:
> On Fri 04-08-23 21:15:57, Chuyi Zhou wrote:
> [...]
>>> + switch (bpf_oom_evaluate_task(task, oc, &points)) {
>>> + case -EOPNOTSUPP: break; /* No BPF policy */
>>> + case -EBUSY: goto abort; /* abort search process */
>>> + case 0: goto next; /* ignore process */
>>> + default: goto select; /* note the task */
>>> + }
>>
>> Why we need to change the *points* value if we do not care about oom_badness
>> ? Is it used to record some state? If so, we could record it through bpf
>> map.
>
> Strictly speaking we do not need to. That would require BPF to keep the
> state internally. Many will do I suppose but we have to keep track of
> the victim so that the oom killer knows what to kill so I thought that
> it doesn't hurt to keep track of an abstract concept of points as well.
> If you think this is not needed then oc->points could be always 0 for
> bpf selected victims. The value is not used anyway in the proposed
> scheme.
>
> Btw. we will need another hook or metadata for the reporting side of
> things. Generally dump_header() to know what has been the selection
> policy.
>
OK. Maybe a integer like policy_type is enough to distinguish different
policies and the default method is zero. Or we can let BPF return a
string like policy_name.
Which one should I start implementing in next version? Do you have a
better idea?
Thanks.
On Mon 07-08-23 10:21:09, Chuyi Zhou wrote:
>
>
> 在 2023/8/4 21:34, Michal Hocko 写道:
> > On Fri 04-08-23 21:15:57, Chuyi Zhou wrote:
> > [...]
> > > > + switch (bpf_oom_evaluate_task(task, oc, &points)) {
> > > > + case -EOPNOTSUPP: break; /* No BPF policy */
> > > > + case -EBUSY: goto abort; /* abort search process */
> > > > + case 0: goto next; /* ignore process */
> > > > + default: goto select; /* note the task */
> > > > + }
> > >
> > > Why we need to change the *points* value if we do not care about oom_badness
> > > ? Is it used to record some state? If so, we could record it through bpf
> > > map.
> >
> > Strictly speaking we do not need to. That would require BPF to keep the
> > state internally. Many will do I suppose but we have to keep track of
> > the victim so that the oom killer knows what to kill so I thought that
> > it doesn't hurt to keep track of an abstract concept of points as well.
> > If you think this is not needed then oc->points could be always 0 for
> > bpf selected victims. The value is not used anyway in the proposed
> > scheme.
> >
> > Btw. we will need another hook or metadata for the reporting side of
> > things. Generally dump_header() to know what has been the selection
> > policy.
> >
> OK. Maybe a integer like policy_type is enough to distinguish different
> policies and the default method is zero. Or we can let BPF return a string
> like policy_name.
>
> Which one should I start implementing in next version? Do you have a better
> idea?
String seems to be more descriptive.
--
Michal Hocko
SUSE Labs
On Sat 05-08-23 07:55:56, Chuyi Zhou wrote:
> Hello,
>
> 在 2023/8/4 19:34, Alan Maguire 写道:
[...]
> > I don't know anything about OOM mechanisms, so maybe it's just me, but I
> > found this confusing. Relying on the previous iteration to control
> > current iteration behaviour seems risky - even if BPF found a victim in
> > iteration N, it's no guarantee it will in iteration N+1.
> >
> The current kernel's OOM actually works like this:
>
> 1. if we first find a valid candidate victim A in iteration N, we would
> record it in oc->chosen.
>
> 2. In iteration N + 1, N+2..., we just compare oc->chosen with the current
> iterating task. Suppose we think current task B is better than
> oc->chosen(A), we would set oc->chosen = B and we would not consider A
> anymore.
>
> IIUC, most policy works like this. We just need to find the *most* suitable
> victim. Normally, if in current iteration we drop A and select B, we would
> not consider A anymore.
Yes, we iterate over all tasks in the specific oom domain (all tasks for
global and all members of memcg tree for hard limit oom). The in-tree
oom policy has to iterate all tasks to achieve some of its goals (like
preventing overkilling while the previously selected victim is still on
the way out). Also oom_score_adj might change the final decision so you
have to really check all eligible tasks.
I can imagine a BPF based policy could be less constrained and as Roman
suggested have a pre-selected victims on stand by. I do not see problem
to have break like mode. Similar to current abort without a canceling an
already noted victim.
--
Michal Hocko
SUSE Labs
On Mon, Aug 07, 2023 at 09:04:34AM +0200, Michal Hocko wrote:
> On Mon 07-08-23 10:21:09, Chuyi Zhou wrote:
> >
> >
> > 在 2023/8/4 21:34, Michal Hocko 写道:
> > > On Fri 04-08-23 21:15:57, Chuyi Zhou wrote:
> > > [...]
> > > > > + switch (bpf_oom_evaluate_task(task, oc, &points)) {
> > > > > + case -EOPNOTSUPP: break; /* No BPF policy */
> > > > > + case -EBUSY: goto abort; /* abort search process */
> > > > > + case 0: goto next; /* ignore process */
> > > > > + default: goto select; /* note the task */
> > > > > + }
To be honest, I can't say I like it. IMO it's not really using the full bpf
potential and is too attached to the current oom implementation.
First, I'm a bit concerned about implicit restrictions we apply to bpf programs
which will be executed potentially thousands times under a very heavy memory
pressure. We will need to make sure that they don't allocate (much) memory, don't
take any locks which might deadlock with other memory allocations etc.
It will potentially require hard restrictions on what these programs can and can't
do and this is something that the bpf community will have to maintain long-term.
Second, if we're introducing bpf here (which I'm not yet convinced),
IMO we should use it in a more generic and expressive way.
Instead of adding hooks into the existing oom killer implementation, we can call
a bpf program before invoking the in-kernel oom killer and let it do whatever
it takes to free some memory. E.g. we can provide it with an API to kill individual
tasks as well as all tasks in a cgroup.
This approach is more generic and will allow to solve certain problems which
can't be solved by the current oom killer, e.g. deleting files from a tmpfs
instead of killing tasks.
So I think the alternative approach is to provide some sort of an interface to
pre-select oom victims in advance. E.g. on memcg level it can look like:
echo PID >> memory.oom.victim_proc
If the list is empty, the default oom killer is invoked.
If there are tasks, the first one is killed on OOM.
A similar interface can exist to choose between sibling cgroups:
echo CGROUP_NAME >> memory.oom.victim_cgroup
This is just a rough idea.
Thanks!
On Mon 07-08-23 10:28:17, Roman Gushchin wrote:
> On Mon, Aug 07, 2023 at 09:04:34AM +0200, Michal Hocko wrote:
> > On Mon 07-08-23 10:21:09, Chuyi Zhou wrote:
> > >
> > >
> > > 在 2023/8/4 21:34, Michal Hocko 写道:
> > > > On Fri 04-08-23 21:15:57, Chuyi Zhou wrote:
> > > > [...]
> > > > > > + switch (bpf_oom_evaluate_task(task, oc, &points)) {
> > > > > > + case -EOPNOTSUPP: break; /* No BPF policy */
> > > > > > + case -EBUSY: goto abort; /* abort search process */
> > > > > > + case 0: goto next; /* ignore process */
> > > > > > + default: goto select; /* note the task */
> > > > > > + }
>
> To be honest, I can't say I like it. IMO it's not really using the full bpf
> potential and is too attached to the current oom implementation.
TBH I am not sure we are able to come up with an interface that would
ise the full BPF potential at this stage and I strongly believe that we
should start by something that is good enough.
> First, I'm a bit concerned about implicit restrictions we apply to bpf programs
> which will be executed potentially thousands times under a very heavy memory
> pressure. We will need to make sure that they don't allocate (much) memory, don't
> take any locks which might deadlock with other memory allocations etc.
> It will potentially require hard restrictions on what these programs can and can't
> do and this is something that the bpf community will have to maintain long-term.
Right, BPF callbacks operating under OOM situations will be really
constrained but this is more or less by definition. Isn't it?
> Second, if we're introducing bpf here (which I'm not yet convinced),
> IMO we should use it in a more generic and expressive way.
> Instead of adding hooks into the existing oom killer implementation, we can call
> a bpf program before invoking the in-kernel oom killer and let it do whatever
> it takes to free some memory. E.g. we can provide it with an API to kill individual
> tasks as well as all tasks in a cgroup.
> This approach is more generic and will allow to solve certain problems which
> can't be solved by the current oom killer, e.g. deleting files from a tmpfs
> instead of killing tasks.
The aim of this proposal is to lift any heavy lifting steming from
iterating tasks or cgroups which those BPF might need to make a
decision. There are other ways of course and provide this iteration
functionality as library functions but my BPF experience is very limited
to say how easy is that.
> So I think the alternative approach is to provide some sort of an interface to
> pre-select oom victims in advance. E.g. on memcg level it can look like:
>
> echo PID >> memory.oom.victim_proc
this is just a terrible interface TBH. Pids are very volatile objects.
At the time oom killer reads this pid it might be a completely different
process.
> If the list is empty, the default oom killer is invoked.
> If there are tasks, the first one is killed on OOM.
> A similar interface can exist to choose between sibling cgroups:
>
> echo CGROUP_NAME >> memory.oom.victim_cgroup
Slightly less volatile but not much better either.
> This is just a rough idea.
I am pretty sure that both policies could be implemetd by the proposed
BPF interface though if you want something like that.
--
Michal Hocko
SUSE Labs
On Tue, Aug 08, 2023 at 10:18:39AM +0200, Michal Hocko wrote:
> On Mon 07-08-23 10:28:17, Roman Gushchin wrote:
> > On Mon, Aug 07, 2023 at 09:04:34AM +0200, Michal Hocko wrote:
> > > On Mon 07-08-23 10:21:09, Chuyi Zhou wrote:
> > > >
> > > >
> > > > 在 2023/8/4 21:34, Michal Hocko 写道:
> > > > > On Fri 04-08-23 21:15:57, Chuyi Zhou wrote:
> > > > > [...]
> > > > > > > + switch (bpf_oom_evaluate_task(task, oc, &points)) {
> > > > > > > + case -EOPNOTSUPP: break; /* No BPF policy */
> > > > > > > + case -EBUSY: goto abort; /* abort search process */
> > > > > > > + case 0: goto next; /* ignore process */
> > > > > > > + default: goto select; /* note the task */
> > > > > > > + }
> >
> > To be honest, I can't say I like it. IMO it's not really using the full bpf
> > potential and is too attached to the current oom implementation.
>
> TBH I am not sure we are able to come up with an interface that would
> ise the full BPF potential at this stage and I strongly believe that we
> should start by something that is good enough.
>
> > First, I'm a bit concerned about implicit restrictions we apply to bpf programs
> > which will be executed potentially thousands times under a very heavy memory
> > pressure. We will need to make sure that they don't allocate (much) memory, don't
> > take any locks which might deadlock with other memory allocations etc.
> > It will potentially require hard restrictions on what these programs can and can't
> > do and this is something that the bpf community will have to maintain long-term.
>
> Right, BPF callbacks operating under OOM situations will be really
> constrained but this is more or less by definition. Isn't it?
What do you mean?
In general, the bpf community is trying to make it as generic as possible and
adding new and new features. Bpf programs are not as constrained as they were
when it's all started.
>
> > Second, if we're introducing bpf here (which I'm not yet convinced),
> > IMO we should use it in a more generic and expressive way.
> > Instead of adding hooks into the existing oom killer implementation, we can call
> > a bpf program before invoking the in-kernel oom killer and let it do whatever
> > it takes to free some memory. E.g. we can provide it with an API to kill individual
> > tasks as well as all tasks in a cgroup.
> > This approach is more generic and will allow to solve certain problems which
> > can't be solved by the current oom killer, e.g. deleting files from a tmpfs
> > instead of killing tasks.
>
> The aim of this proposal is to lift any heavy lifting steming from
> iterating tasks or cgroups which those BPF might need to make a
> decision. There are other ways of course and provide this iteration
> functionality as library functions but my BPF experience is very limited
> to say how easy is that.
>
> > So I think the alternative approach is to provide some sort of an interface to
> > pre-select oom victims in advance. E.g. on memcg level it can look like:
> >
> > echo PID >> memory.oom.victim_proc
>
> this is just a terrible interface TBH. Pids are very volatile objects.
> At the time oom killer reads this pid it might be a completely different
> process.
Well, we already have cgroup.procs interface, which works ok.
Obviously if the task is dead (or is actually killed in a result of oom),
it's pid is removed from the list.
>
> > If the list is empty, the default oom killer is invoked.
> > If there are tasks, the first one is killed on OOM.
> > A similar interface can exist to choose between sibling cgroups:
> >
> > echo CGROUP_NAME >> memory.oom.victim_cgroup
>
> Slightly less volatile but not much better either.
>
> > This is just a rough idea.
>
> I am pretty sure that both policies could be implemetd by the proposed
> BPF interface though if you want something like that.
As I said, I'm pretty concerned about how reliable (and effective) it will be.
I'm not convinced that executing a generic bpf program from the oom context
is safe (and we're talking about executing it potentially thousands of times).
If we're going this way, we need an explicit acknowledge from the bpf
community and a long-term agreement on how we'll keep thing safe.
It would be also nice to come up with some practical examples of bpf programs.
What are meaningful scenarios which can be covered with the proposed approach
and are not covered now with oom_score_adj.
Thanks!
On Tue 08-08-23 14:41:17, Roman Gushchin wrote:
> On Tue, Aug 08, 2023 at 10:18:39AM +0200, Michal Hocko wrote:
> > On Mon 07-08-23 10:28:17, Roman Gushchin wrote:
> > > On Mon, Aug 07, 2023 at 09:04:34AM +0200, Michal Hocko wrote:
> > > > On Mon 07-08-23 10:21:09, Chuyi Zhou wrote:
> > > > >
> > > > >
> > > > > 在 2023/8/4 21:34, Michal Hocko 写道:
> > > > > > On Fri 04-08-23 21:15:57, Chuyi Zhou wrote:
> > > > > > [...]
> > > > > > > > + switch (bpf_oom_evaluate_task(task, oc, &points)) {
> > > > > > > > + case -EOPNOTSUPP: break; /* No BPF policy */
> > > > > > > > + case -EBUSY: goto abort; /* abort search process */
> > > > > > > > + case 0: goto next; /* ignore process */
> > > > > > > > + default: goto select; /* note the task */
> > > > > > > > + }
> > >
> > > To be honest, I can't say I like it. IMO it's not really using the full bpf
> > > potential and is too attached to the current oom implementation.
> >
> > TBH I am not sure we are able to come up with an interface that would
> > ise the full BPF potential at this stage and I strongly believe that we
> > should start by something that is good enough.
> >
> > > First, I'm a bit concerned about implicit restrictions we apply to bpf programs
> > > which will be executed potentially thousands times under a very heavy memory
> > > pressure. We will need to make sure that they don't allocate (much) memory, don't
> > > take any locks which might deadlock with other memory allocations etc.
> > > It will potentially require hard restrictions on what these programs can and can't
> > > do and this is something that the bpf community will have to maintain long-term.
> >
> > Right, BPF callbacks operating under OOM situations will be really
> > constrained but this is more or less by definition. Isn't it?
>
> What do you mean?
Callbacks cannot depend on any direct or indirect memory allocations.
Dependencies on any sleeping locks (again directly or indirectly) is not
allowed just to name the most important ones.
> In general, the bpf community is trying to make it as generic as possible and
> adding new and new features. Bpf programs are not as constrained as they were
> when it's all started.
Are the above ones somehow carved into BPF in general?
> > > Second, if we're introducing bpf here (which I'm not yet convinced),
> > > IMO we should use it in a more generic and expressive way.
> > > Instead of adding hooks into the existing oom killer implementation, we can call
> > > a bpf program before invoking the in-kernel oom killer and let it do whatever
> > > it takes to free some memory. E.g. we can provide it with an API to kill individual
> > > tasks as well as all tasks in a cgroup.
> > > This approach is more generic and will allow to solve certain problems which
> > > can't be solved by the current oom killer, e.g. deleting files from a tmpfs
> > > instead of killing tasks.
> >
> > The aim of this proposal is to lift any heavy lifting steming from
> > iterating tasks or cgroups which those BPF might need to make a
> > decision. There are other ways of course and provide this iteration
> > functionality as library functions but my BPF experience is very limited
> > to say how easy is that.
> >
> > > So I think the alternative approach is to provide some sort of an interface to
> > > pre-select oom victims in advance. E.g. on memcg level it can look like:
> > >
> > > echo PID >> memory.oom.victim_proc
> >
> > this is just a terrible interface TBH. Pids are very volatile objects.
> > At the time oom killer reads this pid it might be a completely different
> > process.
>
> Well, we already have cgroup.procs interface, which works ok.
> Obviously if the task is dead (or is actually killed in a result of oom),
> it's pid is removed from the list.
Right, but writing the pid into the file has an immediate effect and
recycle pid issues would be rare unless the pid space is mostly
depleted. You are proposing an interface where the pid would be consumed
in potentially very distant future. Such an approach would only work if
the pid is auto-removed and then you need a notification mechanism to
replace it by something else.
> > > If the list is empty, the default oom killer is invoked.
> > > If there are tasks, the first one is killed on OOM.
> > > A similar interface can exist to choose between sibling cgroups:
> > >
> > > echo CGROUP_NAME >> memory.oom.victim_cgroup
> >
> > Slightly less volatile but not much better either.
> >
> > > This is just a rough idea.
> >
> > I am pretty sure that both policies could be implemetd by the proposed
> > BPF interface though if you want something like that.
>
> As I said, I'm pretty concerned about how reliable (and effective) it will be.
> I'm not convinced that executing a generic bpf program from the oom context
> is safe (and we're talking about executing it potentially thousands of times).
> If we're going this way, we need an explicit acknowledge from the bpf
> community and a long-term agreement on how we'll keep thing safe.
I do agree with that.
> It would be also nice to come up with some practical examples of bpf programs.
> What are meaningful scenarios which can be covered with the proposed approach
> and are not covered now with oom_score_adj.
Agreed here as well. This RFC serves purpose of brainstorming on all of
this.
There is a fundamental question whether we need BPF for this task in the
first place. Are there any huge advantages to export the callback and
allow a kernel module to hook into it?
--
Michal Hocko
SUSE Labs
On 8/9/23 3:53 PM, Michal Hocko wrote:
> On Tue 08-08-23 14:41:17, Roman Gushchin wrote:
>> It would be also nice to come up with some practical examples of bpf programs.
>> What are meaningful scenarios which can be covered with the proposed approach
>> and are not covered now with oom_score_adj.
>
> Agreed here as well. This RFC serves purpose of brainstorming on all of
> this.
>
> There is a fundamental question whether we need BPF for this task in the
> first place. Are there any huge advantages to export the callback and
> allow a kernel module to hook into it?
The ancient oom-killer largely depends on memory usage when choosing
victims, which might not fit the need of modern scenarios. It's common
nowadays that multiple workloads (tenants) with different 'priorities'
run together, and the decisions made by the oom-killer doesn't always
obey the service level agreements.
While the oom_score_adj only adjusts the usage-based decisions, so it
can hardly be translated into 'priority' semantic. How can we properly
configure it given that we don't know how much memory the workloads
will use? It's really hard for a static strategy to deal with dynamic
provision. IMHO the oom_score_adj is just another demon.
Reworking the oom-killer's internal algorithm or patching some random
metrics may satisfy the immediate needs, but for the next 10 years? I
doubt it. So I think we do need the flexibility to bypass the legacy
usage-based algorithm, through bpf or pre-select interfaces.
Regards,
Abel
>>>> First, I'm a bit concerned about implicit restrictions we apply to bpf programs
>>>> which will be executed potentially thousands times under a very heavy memory
>>>> pressure. We will need to make sure that they don't allocate (much) memory, don't
>>>> take any locks which might deadlock with other memory allocations etc.
>>>> It will potentially require hard restrictions on what these programs can and can't
>>>> do and this is something that the bpf community will have to maintain long-term.
>>>
>>> Right, BPF callbacks operating under OOM situations will be really
>>> constrained but this is more or less by definition. Isn't it?
>>
>> What do you mean?
>
> Callbacks cannot depend on any direct or indirect memory allocations.
> Dependencies on any sleeping locks (again directly or indirectly) is not
> allowed just to name the most important ones.
>
>> In general, the bpf community is trying to make it as generic as possible and
>> adding new and new features. Bpf programs are not as constrained as they were
>> when it's all started.
bpf supports different running context. For example, only non-sleepable bpf prog
is allowed to run at the NIC driver. A sleepable bpf prog is only allowed to run
at some bpf_lsm hooks that is known to be safe to call blocking
bpf-helper/kfunc. From the bpf side, it ensures a non-sleepable bpf prog cannot
do things that may block.
fwiw, Dave has recently proposed something for iterating the task vma
(https://lore.kernel.org/bpf/[email protected]/).
Potentially, a similar iterator can be created for a bpf program to iterate
cgroups and tasks.
Hello,
在 2023/8/9 15:53, Michal Hocko 写道:
> On Tue 08-08-23 14:41:17, Roman Gushchin wrote:
>> On Tue, Aug 08, 2023 at 10:18:39AM +0200, Michal Hocko wrote:
>>> On Mon 07-08-23 10:28:17, Roman Gushchin wrote:
>>>> On Mon, Aug 07, 2023 at 09:04:34AM +0200, Michal Hocko wrote:
>>>>> On Mon 07-08-23 10:21:09, Chuyi Zhou wrote:
>>>>>>
>>>>>>
>>>>>> 在 2023/8/4 21:34, Michal Hocko 写道:
>>>>>>> On Fri 04-08-23 21:15:57, Chuyi Zhou wrote:
>>>>>>> [...]
>>>>>>>>> + switch (bpf_oom_evaluate_task(task, oc, &points)) {
>>>>>>>>> + case -EOPNOTSUPP: break; /* No BPF policy */
>>>>>>>>> + case -EBUSY: goto abort; /* abort search process */
>>>>>>>>> + case 0: goto next; /* ignore process */
>>>>>>>>> + default: goto select; /* note the task */
>>>>>>>>> + }
>>>>
>>>> To be honest, I can't say I like it. IMO it's not really using the full bpf
>>>> potential and is too attached to the current oom implementation.
>>>
>>> TBH I am not sure we are able to come up with an interface that would
>>> ise the full BPF potential at this stage and I strongly believe that we
>>> should start by something that is good enough.
>>>
>>>> First, I'm a bit concerned about implicit restrictions we apply to bpf programs
>>>> which will be executed potentially thousands times under a very heavy memory
>>>> pressure. We will need to make sure that they don't allocate (much) memory, don't
>>>> take any locks which might deadlock with other memory allocations etc.
>>>> It will potentially require hard restrictions on what these programs can and can't
>>>> do and this is something that the bpf community will have to maintain long-term.
>>>
>>> Right, BPF callbacks operating under OOM situations will be really
>>> constrained but this is more or less by definition. Isn't it?
>>
>> What do you mean?
>
> Callbacks cannot depend on any direct or indirect memory allocations.
> Dependencies on any sleeping locks (again directly or indirectly) is not
> allowed just to name the most important ones.
>
>> In general, the bpf community is trying to make it as generic as possible and
>> adding new and new features. Bpf programs are not as constrained as they were
>> when it's all started.
>
> Are the above ones somehow carved into BPF in general?
>
>>>> Second, if we're introducing bpf here (which I'm not yet convinced),
>>>> IMO we should use it in a more generic and expressive way.
>>>> Instead of adding hooks into the existing oom killer implementation, we can call
>>>> a bpf program before invoking the in-kernel oom killer and let it do whatever
>>>> it takes to free some memory. E.g. we can provide it with an API to kill individual
>>>> tasks as well as all tasks in a cgroup.
>>>> This approach is more generic and will allow to solve certain problems which
>>>> can't be solved by the current oom killer, e.g. deleting files from a tmpfs
>>>> instead of killing tasks.
>>>
>>> The aim of this proposal is to lift any heavy lifting steming from
>>> iterating tasks or cgroups which those BPF might need to make a
>>> decision. There are other ways of course and provide this iteration
>>> functionality as library functions but my BPF experience is very limited
>>> to say how easy is that.
>>>
>>>> So I think the alternative approach is to provide some sort of an interface to
>>>> pre-select oom victims in advance. E.g. on memcg level it can look like:
>>>>
>>>> echo PID >> memory.oom.victim_proc
>>>
>>> this is just a terrible interface TBH. Pids are very volatile objects.
>>> At the time oom killer reads this pid it might be a completely different
>>> process.
>>
>> Well, we already have cgroup.procs interface, which works ok.
>> Obviously if the task is dead (or is actually killed in a result of oom),
>> it's pid is removed from the list.
>
> Right, but writing the pid into the file has an immediate effect and
> recycle pid issues would be rare unless the pid space is mostly
> depleted. You are proposing an interface where the pid would be consumed
> in potentially very distant future. Such an approach would only work if
> the pid is auto-removed and then you need a notification mechanism to
> replace it by something else.
>
>>>> If the list is empty, the default oom killer is invoked.
>>>> If there are tasks, the first one is killed on OOM.
>>>> A similar interface can exist to choose between sibling cgroups:
>>>>
>>>> echo CGROUP_NAME >> memory.oom.victim_cgroup
>>>
>>> Slightly less volatile but not much better either.
>>>
>>>> This is just a rough idea.
>>>
>>> I am pretty sure that both policies could be implemetd by the proposed
>>> BPF interface though if you want something like that.
>>
>> As I said, I'm pretty concerned about how reliable (and effective) it will be.
>> I'm not convinced that executing a generic bpf program from the oom context
>> is safe (and we're talking about executing it potentially thousands of times).
>> If we're going this way, we need an explicit acknowledge from the bpf
>> community and a long-term agreement on how we'll keep thing safe.
>
> I do agree with that.
>
>> It would be also nice to come up with some practical examples of bpf programs.
>> What are meaningful scenarios which can be covered with the proposed approach
>> and are not covered now with oom_score_adj.
>
Just like Abel said, the oom_score_adj only adjusts the memory
usage-based decisions, and it's hard to be translated into other
semantics. We see that some userspace oom-killer like oomd has
implemented policies based on other semantics(e.g., memory growth,
priority, psi pressure, ect.) which can be useful in some specific scenario.
> Agreed here as well. This RFC serves purpose of brainstorming on all of
> this.
>
> There is a fundamental question whether we need BPF for this task in the
> first place. Are there any huge advantages to export the callback and
> allow a kernel module to hook into it?
If we export the callback to a kernel module and hook into it,
We still have the same problems (e.g., allocating much memory). Just
like Martin saied, at least BPF supports some basic running context and
some unsafe behavior is restricted.