2023-08-10 08:59:11

by Chuyi Zhou

[permalink] [raw]
Subject: [RFC PATCH v2 0/5] mm: Select victim using bpf_oom_evaluate_task

Changes
-------

This is v2 of the BPF OOM policy patchset.
v1 : https://lore.kernel.org/lkml/[email protected]/
v1 -> v2 changes:

- rename bpf_select_task to bpf_oom_evaluate_task and bypass the
tsk_is_oom_victim (and MMF_OOM_SKIP) logic. (Michal)

- add a new hook to set policy's name, so dump_header() can know
what has been the selection policy when reporting messages. (Michal)

- add a tracepoint when select_bad_process() find nothing. (Alan)

- add a doc to to describe how it is all supposed to work. (Alan)

================

This patchset adds a new interface and use it to select victim when OOM
is invoked. The mainly motivation is the need to customizable OOM victim
selection functionality.

The new interface is a bpf hook plugged in oom_evaluate_task. It takes oc
and current task as parameters and return a result indicating which one is
selected by the attached bpf program.

There are several conserns when designing this interface suggested by
Michal:

1. Hooking into oom_evaluate_task can keep the consistency of global and
memcg OOM interface. Besides, it seems the least disruptive to the existing
oom killer implementation.

2. Userspace can handle a lot on its own and provide the input to the BPF
program to make a decision. Since the oom scope iteration will be
implemented already in the kernel so all the BPF program has to do is to
rank processes or memcgs.

3. The new interface should better bypass the current heuristic rules
(e.g., tsk_is_oom_victim, and MMF_OOM_SKIP) to meet an arbitrary oom
policy's need.

Chuyi Zhou (5):
mm, oom: Introduce bpf_oom_evaluate_task
mm: Add policy_name to identify OOM policies
mm: Add a tracepoint when OOM victim selection is failed
bpf: Add a OOM policy test
bpf: Add a BPF OOM policy Doc

Documentation/bpf/oom.rst | 70 +++++++++
include/linux/oom.h | 7 +
include/trace/events/oom.h | 18 +++
mm/oom_kill.c | 100 +++++++++++--
.../bpf/prog_tests/test_oom_policy.c | 140 ++++++++++++++++++
.../testing/selftests/bpf/progs/oom_policy.c | 104 +++++++++++++
6 files changed, 428 insertions(+), 11 deletions(-)
create mode 100644 Documentation/bpf/oom.rst
create mode 100644 tools/testing/selftests/bpf/prog_tests/test_oom_policy.c
create mode 100644 tools/testing/selftests/bpf/progs/oom_policy.c

--
2.20.1



2023-08-10 09:37:18

by Chuyi Zhou

[permalink] [raw]
Subject: [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task

This patch adds a new hook bpf_oom_evaluate_task in oom_evaluate_task. It
takes oc and current iterating task as parameters and returns a result
indicating which one should be selected. We can use it to bypass the
current logic of oom_evaluate_task and implement customized OOM policies
in the attached BPF progams.

Suggested-by: Michal Hocko <[email protected]>
Signed-off-by: Chuyi Zhou <[email protected]>
---
mm/oom_kill.c | 59 +++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 50 insertions(+), 9 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 612b5597d3af..255c9ef1d808 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -18,6 +18,7 @@
* kernel subsystems and hints as to where to find out what things do.
*/

+#include <linux/bpf.h>
#include <linux/oom.h>
#include <linux/mm.h>
#include <linux/err.h>
@@ -305,6 +306,27 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
return CONSTRAINT_NONE;
}

+enum {
+ NO_BPF_POLICY,
+ BPF_EVAL_ABORT,
+ BPF_EVAL_NEXT,
+ BPF_EVAL_SELECT,
+};
+
+__weak noinline int bpf_oom_evaluate_task(struct task_struct *task, struct oom_control *oc)
+{
+ return NO_BPF_POLICY;
+}
+
+BTF_SET8_START(oom_bpf_fmodret_ids)
+BTF_ID_FLAGS(func, bpf_oom_evaluate_task)
+BTF_SET8_END(oom_bpf_fmodret_ids)
+
+static const struct btf_kfunc_id_set oom_bpf_fmodret_set = {
+ .owner = THIS_MODULE,
+ .set = &oom_bpf_fmodret_ids,
+};
+
static int oom_evaluate_task(struct task_struct *task, void *arg)
{
struct oom_control *oc = arg;
@@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
goto next;

+ /*
+ * If task is allocating a lot of memory and has been marked to be
+ * killed first if it triggers an oom, then select it.
+ */
+ if (oom_task_origin(task)) {
+ points = LONG_MAX;
+ goto select;
+ }
+
+ switch (bpf_oom_evaluate_task(task, oc)) {
+ case BPF_EVAL_ABORT:
+ goto abort; /* abort search process */
+ case BPF_EVAL_NEXT:
+ goto next; /* ignore the task */
+ case BPF_EVAL_SELECT:
+ goto select; /* select the task */
+ default:
+ break; /* No BPF policy */
+ }
+
/*
* This task already has access to memory reserves and is being killed.
* Don't allow any other task to have access to the reserves unless
@@ -329,15 +371,6 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
goto abort;
}

- /*
- * If task is allocating a lot of memory and has been marked to be
- * killed first if it triggers an oom, then select it.
- */
- if (oom_task_origin(task)) {
- points = LONG_MAX;
- goto select;
- }
-
points = oom_badness(task, oc->totalpages);
if (points == LONG_MIN || points < oc->chosen_points)
goto next;
@@ -732,10 +765,18 @@ static struct ctl_table vm_oom_kill_table[] = {

static int __init oom_init(void)
{
+ int err;
oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
#ifdef CONFIG_SYSCTL
register_sysctl_init("vm", vm_oom_kill_table);
#endif
+
+#ifdef CONFIG_BPF_SYSCALL
+ err = register_btf_fmodret_id_set(&oom_bpf_fmodret_set);
+ if (err)
+ pr_warn("error while registering oom fmodret entrypoints: %d", err);
+#endif
+
return 0;
}
subsys_initcall(oom_init)
--
2.20.1


2023-08-19 16:58:10

by Chuyi Zhou

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task

Hello,

在 2023/8/17 10:07, Alexei Starovoitov 写道:
> On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <[email protected]> wrote:
>> static int oom_evaluate_task(struct task_struct *task, void *arg)
>> {
>> struct oom_control *oc = arg;
>> @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
>> if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
>> goto next;
>>
>> + /*
>> + * If task is allocating a lot of memory and has been marked to be
>> + * killed first if it triggers an oom, then select it.
>> + */
>> + if (oom_task_origin(task)) {
>> + points = LONG_MAX;
>> + goto select;
>> + }
>> +
>> + switch (bpf_oom_evaluate_task(task, oc)) {
>> + case BPF_EVAL_ABORT:
>> + goto abort; /* abort search process */
>> + case BPF_EVAL_NEXT:
>> + goto next; /* ignore the task */
>> + case BPF_EVAL_SELECT:
>> + goto select; /* select the task */
>> + default:
>> + break; /* No BPF policy */
>> + }
>> +
>
> I think forcing bpf prog to look at every task is going to be limiting
> long term.
> It's more flexible to invoke bpf prog from out_of_memory()
> and if it doesn't choose a task then fallback to select_bad_process().
> I believe that's what Roman was proposing.
> bpf can choose to iterate memcg or it might have some side knowledge
> that there are processes that can be set as oc->chosen right away,
> so it can skip the iteration.

IIUC, We may need some new bpf features if we want to iterating
tasks/memcg in BPF, sush as:
bpf_for_each_task
bpf_for_each_memcg
bpf_for_each_task_in_memcg
...

It seems we have some work to do first in the BPF side.
Will these iterating features be useful in other BPF scenario except OOM
Policy?

Thanks.

2023-09-13 09:27:24

by Chuyi Zhou

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task

Hello, Bixuan.

在 2023/9/13 09:18, Bixuan Cui 写道:
>
>
> 在 2023/8/10 16:13, Chuyi Zhou 写道:
>> +#include <linux/bpf.h> #include <linux/oom.h> #include <linux/mm.h>
>> #include <linux/err.h> @@ -305,6 +306,27 @@ static enum oom_constraint
>> constrained_alloc(struct oom_control *oc) return CONSTRAINT_NONE; }
>> +enum { + NO_BPF_POLICY, + BPF_EVAL_ABORT, + BPF_EVAL_NEXT, +
>> BPF_EVAL_SELECT, +}; +
>
> I saw that tools/testing/selftests/bpf/progs/oom_policy.c is also using
> NO_BPF_POLICY etc. I think
> +enum {
> +    NO_BPF_POLICY,
> +    BPF_EVAL_ABORT,
> +    BPF_EVAL_NEXT,
> +    BPF_EVAL_SELECT,
> +};
> +
> definitions can be placed in include/linux/oom.h
>

Thanks for your feedback!

Yes, Maybe we should move these enums to a more proper place so that
they can be generated by BTF and we can take them from vmlinux.h.

> Thanks
> Bixuan Cui

2023-09-13 09:47:10

by Bixuan Cui

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task



在 2023/8/10 16:13, Chuyi Zhou 写道:
> +#include <linux/bpf.h> #include <linux/oom.h> #include <linux/mm.h>
> #include <linux/err.h> @@ -305,6 +306,27 @@ static enum oom_constraint
> constrained_alloc(struct oom_control *oc) return CONSTRAINT_NONE; }
> +enum { + NO_BPF_POLICY, + BPF_EVAL_ABORT, + BPF_EVAL_NEXT, +
> BPF_EVAL_SELECT, +}; +

I saw that tools/testing/selftests/bpf/progs/oom_policy.c is also using
NO_BPF_POLICY etc. I think
+enum {
+ NO_BPF_POLICY,
+ BPF_EVAL_ABORT,
+ BPF_EVAL_NEXT,
+ BPF_EVAL_SELECT,
+};
+
definitions can be placed in include/linux/oom.h

Thanks
Bixuan Cui

2023-09-13 13:31:02

by Bixuan Cui

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task



在 2023/8/10 16:13, Chuyi Zhou 写道:
> + +__weak noinline int bpf_oom_evaluate_task(struct task_struct *task,
> struct oom_control *oc) +{ + return NO_BPF_POLICY; +} +
> +BTF_SET8_START(oom_bpf_fmodret_ids) +BTF_ID_FLAGS(func,
> bpf_oom_evaluate_task) +BTF_SET8_END(oom_bpf_fmodret_ids)
I have a question here, why use __weak? Is there other modules that can
redefine bpf_oom_evaluate_task? why not use __bpf_kfunc
(Documentation/bpf/kfuncs.rst) ?

Thanks
Bixuan Cui