2021-09-16 16:36:14

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH rfc 0/6] Scheduler BPF

There is a long history of distro people, system administrators, and
application owners tuning the CFS settings in /proc/sys, which are now
in debugfs. Looking at what these settings actually did, it ended up
boiling down to changing the likelihood of task preemption, or
disabling it by setting the wakeup_granularity_ns to more than half of
the latency_ns. The other settings didn't really do much for
performance.

In other words, some our workloads benefit by having long running tasks
preempted by tasks handling short running requests, and some workloads
that run only short term requests which benefit from never being preempted.

This leads to a few observations and ideas:
- Different workloads want different policies. Being able to configure
the policy per workload could be useful.
- A workload that benefits from not being preempted itself could still
benefit from preempting (low priority) background system tasks.
- It would be useful to quickly (and safely) experiment with different
policies in production, without having to shut down applications or reboot
systems, to determine what the policies for different workloads should be.
- Only a few workloads are large and sensitive enough to merit their own
policy tweaks. CFS by itself should be good enough for everything else,
and we probably do not want policy tweaks to be a replacement for anything
CFS does.

This leads to BPF hooks, which have been successfully used in various
kernel subsystems to provide a way for external code to (safely)
change a few kernel decisions. BPF tooling makes this pretty easy to do,
and the people deploying BPF scripts are already quite used to updating them
for new kernel versions.

This patchset aims to start a discussion about potential applications of BPF
to the scheduler. It also aims to land some very basic BPF infrastructure
necessary to add new BPF hooks to the scheduler, a minimal set of useful
helpers, corresponding libbpf changes, etc.

Our very first experiments with using BPF in CFS look very promising. We're
at a very early stage, however already have seen a nice latency and ~1% RPS
wins for our (Facebook's) main web workload.

As I know, Google is working on a more radical approach [2]: they aim to move
the scheduling code into userspace. It seems that their core motivation is
somewhat similar: to make the scheduler changes easier to develop, validate
and deploy. Even though their approach is different, they also use BPF for
speeding up some hot paths. I think the suggested infrastructure can serve
their purpose too.

An example of an userspace part, which loads some simple hooks is available
here [3]. It's very simple, provided only to simplify playing with the provided
kernel patches.


[1] c722f35b513f ("sched/fair: Bring back select_idle_smt(), but differently")
[2] Google's ghOSt: https://linuxplumbersconf.org/event/11/contributions/954/
[3] https://github.com/rgushchin/atc


Roman Gushchin (6):
bpf: sched: basic infrastructure for scheduler bpf
bpf: sched: add convenient helpers to identify sched entities
bpf: sched: introduce bpf_sched_enable()
sched: cfs: add bpf hooks to control wakeup and tick preemption
libbpf: add support for scheduler bpf programs
bpftool: recognize scheduler programs

include/linux/bpf_sched.h | 53 ++++++++++++
include/linux/bpf_types.h | 3 +
include/linux/sched_hook_defs.h | 4 +
include/uapi/linux/bpf.h | 25 ++++++
kernel/bpf/btf.c | 1 +
kernel/bpf/syscall.c | 21 ++++-
kernel/bpf/trampoline.c | 1 +
kernel/bpf/verifier.c | 9 ++-
kernel/sched/Makefile | 1 +
kernel/sched/bpf_sched.c | 138 ++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 27 +++++++
scripts/bpf_doc.py | 2 +
tools/bpf/bpftool/common.c | 1 +
tools/bpf/bpftool/prog.c | 1 +
tools/include/uapi/linux/bpf.h | 25 ++++++
tools/lib/bpf/libbpf.c | 27 ++++++-
tools/lib/bpf/libbpf.h | 4 +
tools/lib/bpf/libbpf.map | 3 +
18 files changed, 341 insertions(+), 5 deletions(-)
create mode 100644 include/linux/bpf_sched.h
create mode 100644 include/linux/sched_hook_defs.h
create mode 100644 kernel/sched/bpf_sched.c

--
2.31.1


2021-09-16 16:36:46

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH rfc 3/6] bpf: sched: introduce bpf_sched_enable()

Introduce a dedicated static key and the bpf_sched_enabled() wrapper
to guard all invocations of bpf programs in the scheduler code.

It will help to avoid any potential performance regression in a case
when no scheduler bpf programs are attached.

Signed-off-by: Roman Gushchin <[email protected]>
---
include/linux/bpf_sched.h | 24 ++++++++++++++++++++++++
kernel/bpf/syscall.c | 7 +++++++
kernel/sched/bpf_sched.c | 2 ++
3 files changed, 33 insertions(+)

diff --git a/include/linux/bpf_sched.h b/include/linux/bpf_sched.h
index 0f8d3dae53df..6e773aecdff7 100644
--- a/include/linux/bpf_sched.h
+++ b/include/linux/bpf_sched.h
@@ -6,6 +6,8 @@

#ifdef CONFIG_BPF_SYSCALL

+#include <linux/jump_label.h>
+
#define BPF_SCHED_HOOK(RET, DEFAULT, NAME, ...) \
RET bpf_sched_##NAME(__VA_ARGS__);
#include <linux/sched_hook_defs.h>
@@ -14,6 +16,23 @@
int bpf_sched_verify_prog(struct bpf_verifier_log *vlog,
const struct bpf_prog *prog);

+DECLARE_STATIC_KEY_FALSE(bpf_sched_enabled_key);
+
+static inline bool bpf_sched_enabled(void)
+{
+ return static_branch_unlikely(&bpf_sched_enabled_key);
+}
+
+static inline void bpf_sched_inc(void)
+{
+ static_branch_inc(&bpf_sched_enabled_key);
+}
+
+static inline void bpf_sched_dec(void)
+{
+ static_branch_dec(&bpf_sched_enabled_key);
+}
+
#else /* CONFIG_BPF_SYSCALL */

#define BPF_SCHED_HOOK(RET, DEFAULT, NAME, ...) \
@@ -23,6 +42,11 @@ static inline RET bpf_sched_##NAME(__VA_ARGS__) \
}
#undef BPF_SCHED_HOOK

+static inline bool bpf_sched_enabled(void)
+{
+ return false;
+}
+
#endif /* CONFIG_BPF_SYSCALL */

#endif /* _BPF_CGROUP_H */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 67e062376f22..aa5565110498 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -31,6 +31,7 @@
#include <linux/bpf-netns.h>
#include <linux/rcupdate_trace.h>
#include <linux/memcontrol.h>
+#include <linux/bpf_sched.h>

#define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
(map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
@@ -2602,6 +2603,9 @@ static void bpf_tracing_link_release(struct bpf_link *link)
struct bpf_tracing_link *tr_link =
container_of(link, struct bpf_tracing_link, link);

+ if (link->prog->type == BPF_PROG_TYPE_SCHED)
+ bpf_sched_dec();
+
WARN_ON_ONCE(bpf_trampoline_unlink_prog(link->prog,
tr_link->trampoline));

@@ -2804,6 +2808,9 @@ static int bpf_tracing_prog_attach(struct bpf_prog *prog,
goto out_unlock;
}

+ if (prog->type == BPF_PROG_TYPE_SCHED)
+ bpf_sched_inc();
+
link->tgt_prog = tgt_prog;
link->trampoline = tr;

diff --git a/kernel/sched/bpf_sched.c b/kernel/sched/bpf_sched.c
index ead691dc6e85..bf92cfb5ecf4 100644
--- a/kernel/sched/bpf_sched.c
+++ b/kernel/sched/bpf_sched.c
@@ -6,6 +6,8 @@
#include <linux/btf_ids.h>
#include "sched.h"

+DEFINE_STATIC_KEY_FALSE(bpf_sched_enabled_key);
+
/*
* For every hook declare a nop function where a BPF program can be attached.
*/
--
2.31.1

2021-09-16 16:37:47

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH rfc 1/6] bpf: sched: basic infrastructure for scheduler bpf

This commit introduces basic definitions and infrastructure for
scheduler bpf programs. It defines the BPF_PROG_TYPE_SCHED program
type and the BPF_SCHED attachment type.

The implementation is inspired by lsm bpf programs and is based on
kretprobes. This will allow to add new hooks with a minimal changes to
the kernel code and without any changes to libbpf/bpftool.
It's very convenient as I anticipate a large number of private patches
being used for a long time before (or if at all) reaching upstream.

Sched programs are expected to return an int, which meaning will be
context defined.

This patch doesn't add any real scheduler hooks (only a stub), it will
be done by following patches in the series.

Scheduler bpf programs as now are very restricted in what they can do:
only the bpf_printk() helper is available. The scheduler context can
impose significant restrictions on what's safe and what's not. So
let's extend their abilities on case by case basis when a need arise.

Signed-off-by: Roman Gushchin <[email protected]>
---
include/linux/bpf_sched.h | 28 +++++++++++++++
include/linux/bpf_types.h | 3 ++
include/linux/sched_hook_defs.h | 2 ++
include/uapi/linux/bpf.h | 2 ++
kernel/bpf/btf.c | 1 +
kernel/bpf/syscall.c | 14 ++++++--
kernel/bpf/trampoline.c | 1 +
kernel/bpf/verifier.c | 9 ++++-
kernel/sched/Makefile | 1 +
kernel/sched/bpf_sched.c | 62 +++++++++++++++++++++++++++++++++
tools/include/uapi/linux/bpf.h | 2 ++
11 files changed, 122 insertions(+), 3 deletions(-)
create mode 100644 include/linux/bpf_sched.h
create mode 100644 include/linux/sched_hook_defs.h
create mode 100644 kernel/sched/bpf_sched.c

diff --git a/include/linux/bpf_sched.h b/include/linux/bpf_sched.h
new file mode 100644
index 000000000000..0f8d3dae53df
--- /dev/null
+++ b/include/linux/bpf_sched.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _BPF_SCHED_H
+#define _BPF_SCHED_H
+
+#include <linux/bpf.h>
+
+#ifdef CONFIG_BPF_SYSCALL
+
+#define BPF_SCHED_HOOK(RET, DEFAULT, NAME, ...) \
+ RET bpf_sched_##NAME(__VA_ARGS__);
+#include <linux/sched_hook_defs.h>
+#undef BPF_SCHED_HOOK
+
+int bpf_sched_verify_prog(struct bpf_verifier_log *vlog,
+ const struct bpf_prog *prog);
+
+#else /* CONFIG_BPF_SYSCALL */
+
+#define BPF_SCHED_HOOK(RET, DEFAULT, NAME, ...) \
+static inline RET bpf_sched_##NAME(__VA_ARGS__) \
+{ \
+ return DEFAULT; \
+}
+#undef BPF_SCHED_HOOK
+
+#endif /* CONFIG_BPF_SYSCALL */
+
+#endif /* _BPF_CGROUP_H */
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 9c81724e4b98..ed6aac4368c0 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -79,6 +79,9 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LSM, lsm,
#endif
BPF_PROG_TYPE(BPF_PROG_TYPE_SYSCALL, bpf_syscall,
void *, void *)
+BPF_PROG_TYPE(BPF_PROG_TYPE_SCHED, bpf_sched,
+ void *, void *)
+

BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
diff --git a/include/linux/sched_hook_defs.h b/include/linux/sched_hook_defs.h
new file mode 100644
index 000000000000..14344004e335
--- /dev/null
+++ b/include/linux/sched_hook_defs.h
@@ -0,0 +1,2 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+BPF_SCHED_HOOK(int, 0, dummy, void)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d21326558d42..6dfbebb8fc8f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -949,6 +949,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_LSM,
BPF_PROG_TYPE_SK_LOOKUP,
BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
+ BPF_PROG_TYPE_SCHED,
};

enum bpf_attach_type {
@@ -994,6 +995,7 @@ enum bpf_attach_type {
BPF_SK_REUSEPORT_SELECT,
BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
BPF_PERF_EVENT,
+ BPF_SCHED,
__MAX_BPF_ATTACH_TYPE
};

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index c3d605b22473..fb46e447a062 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -4884,6 +4884,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
return true;
t = btf_type_by_id(btf, t->type);
break;
+ case BPF_SCHED:
case BPF_MODIFY_RETURN:
/* For now the BPF_MODIFY_RETURN can only be attached to
* functions that return an int.
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 4e50c0bfdb7d..67e062376f22 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2026,6 +2026,7 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
case BPF_PROG_TYPE_LSM:
case BPF_PROG_TYPE_STRUCT_OPS:
case BPF_PROG_TYPE_EXT:
+ case BPF_PROG_TYPE_SCHED:
break;
default:
return -EINVAL;
@@ -2149,6 +2150,7 @@ static bool is_perfmon_prog_type(enum bpf_prog_type prog_type)
case BPF_PROG_TYPE_LSM:
case BPF_PROG_TYPE_STRUCT_OPS: /* has access to struct sock */
case BPF_PROG_TYPE_EXT: /* extends any prog */
+ case BPF_PROG_TYPE_SCHED:
return true;
default:
return false;
@@ -2682,6 +2684,12 @@ static int bpf_tracing_prog_attach(struct bpf_prog *prog,
goto out_put_prog;
}
break;
+ case BPF_PROG_TYPE_SCHED:
+ if (prog->expected_attach_type != BPF_SCHED) {
+ err = -EINVAL;
+ goto out_put_prog;
+ }
+ break;
default:
err = -EINVAL;
goto out_put_prog;
@@ -2740,13 +2748,14 @@ static int bpf_tracing_prog_attach(struct bpf_prog *prog,
*/
if (!prog->aux->dst_trampoline && !tgt_prog) {
/*
- * Allow re-attach for TRACING and LSM programs. If it's
+ * Allow re-attach for TRACING, LSM ans SCHED programs. If it's
* currently linked, bpf_trampoline_link_prog will fail.
* EXT programs need to specify tgt_prog_fd, so they
* re-attach in separate code path.
*/
if (prog->type != BPF_PROG_TYPE_TRACING &&
- prog->type != BPF_PROG_TYPE_LSM) {
+ prog->type != BPF_PROG_TYPE_LSM &&
+ prog->type != BPF_PROG_TYPE_SCHED) {
err = -EINVAL;
goto out_unlock;
}
@@ -2996,6 +3005,7 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
case BPF_PROG_TYPE_TRACING:
case BPF_PROG_TYPE_EXT:
case BPF_PROG_TYPE_LSM:
+ case BPF_PROG_TYPE_SCHED:
if (attr->raw_tracepoint.name) {
/* The attach point for this category of programs
* should be specified via btf_id during program load.
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 39eaaff81953..980b878892a4 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -394,6 +394,7 @@ static enum bpf_tramp_prog_type bpf_attach_type_to_tramp(struct bpf_prog *prog)
switch (prog->expected_attach_type) {
case BPF_TRACE_FENTRY:
return BPF_TRAMP_FENTRY;
+ case BPF_SCHED:
case BPF_MODIFY_RETURN:
return BPF_TRAMP_MODIFY_RETURN;
case BPF_TRACE_FEXIT:
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 047ac4b4703b..233445619084 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -22,6 +22,7 @@
#include <linux/error-injection.h>
#include <linux/bpf_lsm.h>
#include <linux/btf_ids.h>
+#include <linux/bpf_sched.h>

#include "disasm.h"

@@ -13477,6 +13478,7 @@ int bpf_check_attach_target(struct bpf_verifier_log *log,
case BPF_LSM_MAC:
case BPF_TRACE_FENTRY:
case BPF_TRACE_FEXIT:
+ case BPF_SCHED:
if (!btf_type_is_func(t)) {
bpf_log(log, "attach_btf_id %u is not a function\n",
btf_id);
@@ -13601,7 +13603,8 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)

if (prog->type != BPF_PROG_TYPE_TRACING &&
prog->type != BPF_PROG_TYPE_LSM &&
- prog->type != BPF_PROG_TYPE_EXT)
+ prog->type != BPF_PROG_TYPE_EXT &&
+ prog->type != BPF_PROG_TYPE_SCHED)
return 0;

ret = bpf_check_attach_target(&env->log, prog, tgt_prog, btf_id, &tgt_info);
@@ -13642,6 +13645,10 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
} else if (prog->type == BPF_PROG_TYPE_TRACING &&
btf_id_set_contains(&btf_id_deny, btf_id)) {
return -EINVAL;
+ } else if (prog->type == BPF_PROG_TYPE_SCHED) {
+ ret = bpf_sched_verify_prog(&env->log, prog);
+ if (ret < 0)
+ return ret;
}

key = bpf_trampoline_compute_key(tgt_prog, prog->aux->attach_btf, btf_id);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 978fcfca5871..efb2cad4651b 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -37,3 +37,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o
obj-$(CONFIG_CPU_ISOLATION) += isolation.o
obj-$(CONFIG_PSI) += psi.o
obj-$(CONFIG_SCHED_CORE) += core_sched.o
+obj-$(CONFIG_BPF_SYSCALL) += bpf_sched.o
diff --git a/kernel/sched/bpf_sched.c b/kernel/sched/bpf_sched.c
new file mode 100644
index 000000000000..2f05c186cfd0
--- /dev/null
+++ b/kernel/sched/bpf_sched.c
@@ -0,0 +1,62 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <linux/cgroup.h>
+#include <linux/bpf_verifier.h>
+#include <linux/bpf_sched.h>
+#include <linux/btf_ids.h>
+#include "sched.h"
+
+/*
+ * For every hook declare a nop function where a BPF program can be attached.
+ */
+#define BPF_SCHED_HOOK(RET, DEFAULT, NAME, ...) \
+noinline RET bpf_sched_##NAME(__VA_ARGS__) \
+{ \
+ return DEFAULT; \
+}
+
+#include <linux/sched_hook_defs.h>
+#undef BPF_SCHED_HOOK
+
+#define BPF_SCHED_HOOK(RET, DEFAULT, NAME, ...) BTF_ID(func, bpf_sched_##NAME)
+BTF_SET_START(bpf_sched_hooks)
+#include <linux/sched_hook_defs.h>
+#undef BPF_SCHED_HOOK
+BTF_SET_END(bpf_sched_hooks)
+
+int bpf_sched_verify_prog(struct bpf_verifier_log *vlog,
+ const struct bpf_prog *prog)
+{
+ if (!prog->gpl_compatible) {
+ bpf_log(vlog,
+ "sched programs must have a GPL compatible license\n");
+ return -EINVAL;
+ }
+
+ if (!btf_id_set_contains(&bpf_sched_hooks, prog->aux->attach_btf_id)) {
+ bpf_log(vlog, "attach_btf_id %u points to wrong type name %s\n",
+ prog->aux->attach_btf_id, prog->aux->attach_func_name);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static const struct bpf_func_proto *
+bpf_sched_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+ switch (func_id) {
+ case BPF_FUNC_trace_printk:
+ return bpf_get_trace_printk_proto();
+ default:
+ return NULL;
+ }
+}
+
+const struct bpf_prog_ops bpf_sched_prog_ops = {
+};
+
+const struct bpf_verifier_ops bpf_sched_verifier_ops = {
+ .get_func_proto = bpf_sched_func_proto,
+ .is_valid_access = btf_ctx_access,
+};
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index d21326558d42..6dfbebb8fc8f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -949,6 +949,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_LSM,
BPF_PROG_TYPE_SK_LOOKUP,
BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
+ BPF_PROG_TYPE_SCHED,
};

enum bpf_attach_type {
@@ -994,6 +995,7 @@ enum bpf_attach_type {
BPF_SK_REUSEPORT_SELECT,
BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
BPF_PERF_EVENT,
+ BPF_SCHED,
__MAX_BPF_ATTACH_TYPE
};

--
2.31.1

2021-09-16 21:17:31

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH rfc 6/6] bpftool: recognize scheduler programs

Teach bpftool to recognize scheduler bpf programs.

Signed-off-by: Roman Gushchin <[email protected]>
---
tools/bpf/bpftool/common.c | 1 +
tools/bpf/bpftool/prog.c | 1 +
2 files changed, 2 insertions(+)

diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
index d42d930a3ec4..c73d634f4e82 100644
--- a/tools/bpf/bpftool/common.c
+++ b/tools/bpf/bpftool/common.c
@@ -73,6 +73,7 @@ const char * const attach_type_name[__MAX_BPF_ATTACH_TYPE] = {
[BPF_XDP] = "xdp",
[BPF_SK_REUSEPORT_SELECT] = "sk_skb_reuseport_select",
[BPF_SK_REUSEPORT_SELECT_OR_MIGRATE] = "sk_skb_reuseport_select_or_migrate",
+ [BPF_SCHED] = "sched",
};

void p_err(const char *fmt, ...)
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 9c3e343b7d87..78eb4e807a6b 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -67,6 +67,7 @@ const char * const prog_type_name[] = {
[BPF_PROG_TYPE_EXT] = "ext",
[BPF_PROG_TYPE_LSM] = "lsm",
[BPF_PROG_TYPE_SK_LOOKUP] = "sk_lookup",
+ [BPF_PROG_TYPE_SCHED] = "sched",
};

const size_t prog_type_name_size = ARRAY_SIZE(prog_type_name);
--
2.31.1

2021-09-16 21:17:32

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH rfc 5/6] libbpf: add support for scheduler bpf programs

This patch adds a support for loading and attaching scheduler bpf
programs.

Signed-off-by: Roman Gushchin <[email protected]>
---
tools/lib/bpf/libbpf.c | 27 +++++++++++++++++++++++++--
tools/lib/bpf/libbpf.h | 4 ++++
tools/lib/bpf/libbpf.map | 3 +++
3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 62a43c408d73..8374a8d4aafe 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -2633,7 +2633,8 @@ static int bpf_object__finalize_btf(struct bpf_object *obj)
static bool prog_needs_vmlinux_btf(struct bpf_program *prog)
{
if (prog->type == BPF_PROG_TYPE_STRUCT_OPS ||
- prog->type == BPF_PROG_TYPE_LSM)
+ prog->type == BPF_PROG_TYPE_LSM ||
+ prog->type == BPF_PROG_TYPE_SCHED)
return true;

/* BPF_PROG_TYPE_TRACING programs which do not attach to other programs
@@ -6280,7 +6281,8 @@ int bpf_program__load(struct bpf_program *prog, char *license, __u32 kern_ver)

if ((prog->type == BPF_PROG_TYPE_TRACING ||
prog->type == BPF_PROG_TYPE_LSM ||
- prog->type == BPF_PROG_TYPE_EXT) && !prog->attach_btf_id) {
+ prog->type == BPF_PROG_TYPE_EXT ||
+ prog->type == BPF_PROG_TYPE_SCHED) && !prog->attach_btf_id) {
int btf_obj_fd = 0, btf_type_id = 0;

err = libbpf_find_attach_btf_id(prog, &btf_obj_fd, &btf_type_id);
@@ -7892,6 +7894,7 @@ BPF_PROG_TYPE_FNS(tracing, BPF_PROG_TYPE_TRACING);
BPF_PROG_TYPE_FNS(struct_ops, BPF_PROG_TYPE_STRUCT_OPS);
BPF_PROG_TYPE_FNS(extension, BPF_PROG_TYPE_EXT);
BPF_PROG_TYPE_FNS(sk_lookup, BPF_PROG_TYPE_SK_LOOKUP);
+BPF_PROG_TYPE_FNS(sched, BPF_PROG_TYPE_SCHED);

enum bpf_attach_type
bpf_program__get_expected_attach_type(const struct bpf_program *prog)
@@ -7950,6 +7953,7 @@ static struct bpf_link *attach_raw_tp(struct bpf_program *prog);
static struct bpf_link *attach_trace(struct bpf_program *prog);
static struct bpf_link *attach_lsm(struct bpf_program *prog);
static struct bpf_link *attach_iter(struct bpf_program *prog);
+static struct bpf_link *attach_sched(struct bpf_program *prog);

static const struct bpf_sec_def section_defs[] = {
BPF_PROG_SEC("socket", BPF_PROG_TYPE_SOCKET_FILTER),
@@ -8022,6 +8026,10 @@ static const struct bpf_sec_def section_defs[] = {
.attach_fn = attach_iter),
SEC_DEF("syscall", SYSCALL,
.is_sleepable = true),
+ SEC_DEF("sched/", SCHED,
+ .is_attach_btf = true,
+ .expected_attach_type = BPF_SCHED,
+ .attach_fn = attach_sched),
BPF_EAPROG_SEC("xdp_devmap/", BPF_PROG_TYPE_XDP,
BPF_XDP_DEVMAP),
BPF_EAPROG_SEC("xdp_cpumap/", BPF_PROG_TYPE_XDP,
@@ -8311,6 +8319,7 @@ static int bpf_object__collect_st_ops_relos(struct bpf_object *obj,
#define BTF_TRACE_PREFIX "btf_trace_"
#define BTF_LSM_PREFIX "bpf_lsm_"
#define BTF_ITER_PREFIX "bpf_iter_"
+#define BTF_SCHED_PREFIX "bpf_sched_"
#define BTF_MAX_NAME_SIZE 128

void btf_get_kernel_prefix_kind(enum bpf_attach_type attach_type,
@@ -8329,6 +8338,10 @@ void btf_get_kernel_prefix_kind(enum bpf_attach_type attach_type,
*prefix = BTF_ITER_PREFIX;
*kind = BTF_KIND_FUNC;
break;
+ case BPF_SCHED:
+ *prefix = BTF_SCHED_PREFIX;
+ *kind = BTF_KIND_FUNC;
+ break;
default:
*prefix = "";
*kind = BTF_KIND_FUNC;
@@ -9675,6 +9688,11 @@ struct bpf_link *bpf_program__attach_lsm(struct bpf_program *prog)
return bpf_program__attach_btf_id(prog);
}

+struct bpf_link *bpf_program__attach_sched(struct bpf_program *prog)
+{
+ return bpf_program__attach_btf_id(prog);
+}
+
static struct bpf_link *attach_trace(struct bpf_program *prog)
{
return bpf_program__attach_trace(prog);
@@ -9685,6 +9703,11 @@ static struct bpf_link *attach_lsm(struct bpf_program *prog)
return bpf_program__attach_lsm(prog);
}

+static struct bpf_link *attach_sched(struct bpf_program *prog)
+{
+ return bpf_program__attach_sched(prog);
+}
+
static struct bpf_link *
bpf_program__attach_fd(struct bpf_program *prog, int target_fd, int btf_id,
const char *target_name)
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 2f6f0e15d1e7..42a3dfcca778 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -339,6 +339,8 @@ bpf_program__attach_xdp(struct bpf_program *prog, int ifindex);
LIBBPF_API struct bpf_link *
bpf_program__attach_freplace(struct bpf_program *prog,
int target_fd, const char *attach_func_name);
+LIBBPF_API struct bpf_link *
+bpf_program__attach_sched(struct bpf_program *prog);

struct bpf_map;

@@ -435,6 +437,7 @@ LIBBPF_API int bpf_program__set_tracing(struct bpf_program *prog);
LIBBPF_API int bpf_program__set_struct_ops(struct bpf_program *prog);
LIBBPF_API int bpf_program__set_extension(struct bpf_program *prog);
LIBBPF_API int bpf_program__set_sk_lookup(struct bpf_program *prog);
+LIBBPF_API int bpf_program__set_sched(struct bpf_program *prog);

LIBBPF_API enum bpf_prog_type bpf_program__get_type(const struct bpf_program *prog);
LIBBPF_API void bpf_program__set_type(struct bpf_program *prog,
@@ -463,6 +466,7 @@ LIBBPF_API bool bpf_program__is_tracing(const struct bpf_program *prog);
LIBBPF_API bool bpf_program__is_struct_ops(const struct bpf_program *prog);
LIBBPF_API bool bpf_program__is_extension(const struct bpf_program *prog);
LIBBPF_API bool bpf_program__is_sk_lookup(const struct bpf_program *prog);
+LIBBPF_API bool bpf_program__is_sched(const struct bpf_program *prog);

/*
* No need for __attribute__((packed)), all members of 'bpf_map_def'
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 9e649cf9e771..02f149aced5a 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -390,4 +390,7 @@ LIBBPF_0.5.0 {
LIBBPF_0.6.0 {
global:
btf__add_tag;
+ bpf_program__attach_sched;
+ bpf_program__is_sched;
+ bpf_program__set_sched;
} LIBBPF_0.5.0;
--
2.31.1

2021-09-16 21:17:33

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH rfc 4/6] sched: cfs: add bpf hooks to control wakeup and tick preemption

This patch adds 3 hooks to control wakeup and tick preemption:
cfs_check_preempt_tick
cfs_check_preempt_wakeup
cfs_wakeup_preempt_entity

The first one allows to force or suppress a preemption from a tick
context. An obvious usage example is to minimize the number of
non-voluntary context switches and decrease an associated latency
penalty by (conditionally) providing tasks or task groups an extended
execution slice. It can be used instead of tweaking
sysctl_sched_min_granularity.

The second one is called from the wakeup preemption code and allows
to redefine whether a newly woken task should preempt the execution
of the current task. This is useful to minimize a number of
preemptions of latency sensitive tasks. To some extent it's a more
flexible analog of a sysctl_sched_wakeup_granularity.

The third one is similar, but it tweaks the wakeup_preempt_entity()
function, which is called not only from a wakeup context, but also
from pick_next_task(), which allows to influence the decision on which
task will be running next.

It's a place for a discussion whether we need both these hooks or only
one of them: the second is more powerful, but depends more on the
current implementation. In any case, bpf hooks are not an ABI, so it's
not a deal breaker.

The idea of the wakeup_preempt_entity hook belongs to Rik van Riel. He
also contributed a lot to the whole patchset by proving his ideas,
recommendations and a feedback for earlier (non-public) versions.

Signed-off-by: Roman Gushchin <[email protected]>
---
include/linux/bpf_sched.h | 1 +
include/linux/sched_hook_defs.h | 4 +++-
kernel/sched/fair.c | 27 +++++++++++++++++++++++++++
3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf_sched.h b/include/linux/bpf_sched.h
index 6e773aecdff7..5c238aeb853c 100644
--- a/include/linux/bpf_sched.h
+++ b/include/linux/bpf_sched.h
@@ -40,6 +40,7 @@ static inline RET bpf_sched_##NAME(__VA_ARGS__) \
{ \
return DEFAULT; \
}
+#include <linux/sched_hook_defs.h>
#undef BPF_SCHED_HOOK

static inline bool bpf_sched_enabled(void)
diff --git a/include/linux/sched_hook_defs.h b/include/linux/sched_hook_defs.h
index 14344004e335..f075b32698cd 100644
--- a/include/linux/sched_hook_defs.h
+++ b/include/linux/sched_hook_defs.h
@@ -1,2 +1,4 @@
/* SPDX-License-Identifier: GPL-2.0 */
-BPF_SCHED_HOOK(int, 0, dummy, void)
+BPF_SCHED_HOOK(int, 0, cfs_check_preempt_tick, struct sched_entity *curr, unsigned long delta_exec)
+BPF_SCHED_HOOK(int, 0, cfs_check_preempt_wakeup, struct task_struct *curr, struct task_struct *p)
+BPF_SCHED_HOOK(int, 0, cfs_wakeup_preempt_entity, struct sched_entity *curr, struct sched_entity *se)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff69f245b939..35ea8911b25c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -21,6 +21,7 @@
* Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
*/
#include "sched.h"
+#include <linux/bpf_sched.h>

/*
* Targeted preemption latency for CPU-bound tasks:
@@ -4447,6 +4448,16 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)

ideal_runtime = sched_slice(cfs_rq, curr);
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
+
+ if (bpf_sched_enabled()) {
+ int ret = bpf_sched_cfs_check_preempt_tick(curr, delta_exec);
+
+ if (ret < 0)
+ return;
+ else if (ret > 0)
+ resched_curr(rq_of(cfs_rq));
+ }
+
if (delta_exec > ideal_runtime) {
resched_curr(rq_of(cfs_rq));
/*
@@ -7083,6 +7094,13 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
{
s64 gran, vdiff = curr->vruntime - se->vruntime;

+ if (bpf_sched_enabled()) {
+ int ret = bpf_sched_cfs_wakeup_preempt_entity(curr, se);
+
+ if (ret)
+ return ret;
+ }
+
if (vdiff <= 0)
return -1;

@@ -7168,6 +7186,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
likely(!task_has_idle_policy(p)))
goto preempt;

+ if (bpf_sched_enabled()) {
+ int ret = bpf_sched_cfs_check_preempt_wakeup(current, p);
+
+ if (ret < 0)
+ return;
+ else if (ret > 0)
+ goto preempt;
+ }
+
/*
* Batch and idle tasks do not preempt non-idle tasks (their preemption
* is driven by the tick):
--
2.31.1

2021-09-16 21:18:17

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH rfc 2/6] bpf: sched: add convenient helpers to identify sched entities

This patch adds 3 helpers useful for dealing with sched entities:
u64 bpf_sched_entity_to_tgidpid(struct sched_entity *se);
u64 bpf_sched_entity_to_cgrpid(struct sched_entity *se);
long bpf_sched_entity_belongs_to_cgrp(struct sched_entity *se, u64 cgrpid);

Sched entity is a basic structure used by the scheduler to represent
schedulable objects: tasks and cgroups (if CONFIG_FAIR_GROUP_SCHED
is enabled). It will be passed as an argument to many bpf hooks, so
scheduler bpf programs need a convenient way to deal with it.

bpf_sched_entity_to_tgidpid() and bpf_sched_entity_to_cgrpid() are
useful to identify a sched entity in userspace terms (pid, tgid and
cgroup id). bpf_sched_entity_belongs_to_cgrp() allows to check whether
a sched entity belongs to sub-tree of a cgroup. It allows to write
cgroup-specific scheduler policies even without enabling the cgroup
cpu controller.

Signed-off-by: Roman Gushchin <[email protected]>
---
include/uapi/linux/bpf.h | 23 +++++++++++
kernel/sched/bpf_sched.c | 74 ++++++++++++++++++++++++++++++++++
scripts/bpf_doc.py | 2 +
tools/include/uapi/linux/bpf.h | 23 +++++++++++
4 files changed, 122 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6dfbebb8fc8f..199e4a92820d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4900,6 +4900,26 @@ union bpf_attr {
* **-EINVAL** if *flags* is not zero.
*
* **-ENOENT** if architecture does not support branch records.
+ *
+ * u64 bpf_sched_entity_to_tgidpid(struct sched_entity *se)
+ * Description
+ * Return task's encoded tgid and pid if the sched entity is a task.
+ * Return
+ * Tgid and pid encoded as tgid << 32 \| pid, if *se* is a task. (u64)-1 otherwise.
+ *
+ * u64 bpf_sched_entity_to_cgrpid(struct sched_entity *se)
+ * Description
+ * Return cgroup id if the given sched entity is a cgroup.
+ * Return
+ * Cgroup id, if *se* is a cgroup. (u64)-1 otherwise.
+ *
+ * long bpf_sched_entity_belongs_to_cgrp(struct sched_entity *se, u64 cgrpid)
+ * Description
+ * Checks whether the sched entity belongs to a cgroup or
+ * it's sub-tree. It doesn't require a cgroup CPU controller
+ * to be enabled.
+ * Return
+ * 1 if the sched entity belongs to a cgroup, 0 otherwise.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -5079,6 +5099,9 @@ union bpf_attr {
FN(get_attach_cookie), \
FN(task_pt_regs), \
FN(get_branch_snapshot), \
+ FN(sched_entity_to_tgidpid), \
+ FN(sched_entity_to_cgrpid), \
+ FN(sched_entity_belongs_to_cgrp), \
/* */

/* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/sched/bpf_sched.c b/kernel/sched/bpf_sched.c
index 2f05c186cfd0..ead691dc6e85 100644
--- a/kernel/sched/bpf_sched.c
+++ b/kernel/sched/bpf_sched.c
@@ -42,12 +42,86 @@ int bpf_sched_verify_prog(struct bpf_verifier_log *vlog,
return 0;
}

+BPF_CALL_1(bpf_sched_entity_to_tgidpid, struct sched_entity *, se)
+{
+ if (entity_is_task(se)) {
+ struct task_struct *task = task_of(se);
+
+ return (u64) task->tgid << 32 | task->pid;
+ } else {
+ return (u64) -1;
+ }
+}
+
+BPF_CALL_1(bpf_sched_entity_to_cgrpid, struct sched_entity *, se)
+{
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ if (!entity_is_task(se))
+ return cgroup_id(se->cfs_rq->tg->css.cgroup);
+#endif
+ return (u64) -1;
+}
+
+BPF_CALL_2(bpf_sched_entity_belongs_to_cgrp, struct sched_entity *, se,
+ u64, cgrpid)
+{
+#ifdef CONFIG_CGROUPS
+ struct cgroup *cgrp;
+ int level;
+
+ if (entity_is_task(se))
+ cgrp = task_dfl_cgroup(task_of(se));
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ else
+ cgrp = se->cfs_rq->tg->css.cgroup;
+#endif
+
+ for (level = cgrp->level; level; level--)
+ if (cgrp->ancestor_ids[level] == cgrpid)
+ return 1;
+#endif
+ return 0;
+}
+
+BTF_ID_LIST_SINGLE(btf_sched_entity_ids, struct, sched_entity)
+
+static const struct bpf_func_proto bpf_sched_entity_to_tgidpid_proto = {
+ .func = bpf_sched_entity_to_tgidpid,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_BTF_ID,
+ .arg1_btf_id = &btf_sched_entity_ids[0],
+};
+
+static const struct bpf_func_proto bpf_sched_entity_to_cgrpid_proto = {
+ .func = bpf_sched_entity_to_cgrpid,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_BTF_ID,
+ .arg1_btf_id = &btf_sched_entity_ids[0],
+};
+
+static const struct bpf_func_proto bpf_sched_entity_belongs_to_cgrp_proto = {
+ .func = bpf_sched_entity_belongs_to_cgrp,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_BTF_ID,
+ .arg1_btf_id = &btf_sched_entity_ids[0],
+ .arg2_type = ARG_ANYTHING,
+};
+
static const struct bpf_func_proto *
bpf_sched_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
{
switch (func_id) {
case BPF_FUNC_trace_printk:
return bpf_get_trace_printk_proto();
+ case BPF_FUNC_sched_entity_to_tgidpid:
+ return &bpf_sched_entity_to_tgidpid_proto;
+ case BPF_FUNC_sched_entity_to_cgrpid:
+ return &bpf_sched_entity_to_cgrpid_proto;
+ case BPF_FUNC_sched_entity_belongs_to_cgrp:
+ return &bpf_sched_entity_belongs_to_cgrp_proto;
default:
return NULL;
}
diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index 00ac7b79cddb..84019ba5b67b 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -548,6 +548,7 @@ class PrinterHelpers(Printer):
'struct socket',
'struct file',
'struct bpf_timer',
+ 'struct sched_entity',
]
known_types = {
'...',
@@ -596,6 +597,7 @@ class PrinterHelpers(Printer):
'struct socket',
'struct file',
'struct bpf_timer',
+ 'struct sched_entity',
}
mapped_types = {
'u8': '__u8',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 6dfbebb8fc8f..199e4a92820d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4900,6 +4900,26 @@ union bpf_attr {
* **-EINVAL** if *flags* is not zero.
*
* **-ENOENT** if architecture does not support branch records.
+ *
+ * u64 bpf_sched_entity_to_tgidpid(struct sched_entity *se)
+ * Description
+ * Return task's encoded tgid and pid if the sched entity is a task.
+ * Return
+ * Tgid and pid encoded as tgid << 32 \| pid, if *se* is a task. (u64)-1 otherwise.
+ *
+ * u64 bpf_sched_entity_to_cgrpid(struct sched_entity *se)
+ * Description
+ * Return cgroup id if the given sched entity is a cgroup.
+ * Return
+ * Cgroup id, if *se* is a cgroup. (u64)-1 otherwise.
+ *
+ * long bpf_sched_entity_belongs_to_cgrp(struct sched_entity *se, u64 cgrpid)
+ * Description
+ * Checks whether the sched entity belongs to a cgroup or
+ * it's sub-tree. It doesn't require a cgroup CPU controller
+ * to be enabled.
+ * Return
+ * 1 if the sched entity belongs to a cgroup, 0 otherwise.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -5079,6 +5099,9 @@ union bpf_attr {
FN(get_attach_cookie), \
FN(task_pt_regs), \
FN(get_branch_snapshot), \
+ FN(sched_entity_to_tgidpid), \
+ FN(sched_entity_to_cgrpid), \
+ FN(sched_entity_belongs_to_cgrp), \
/* */

/* integer value in 'imm' field of BPF_CALL instruction selects which helper
--
2.31.1

2021-09-16 21:23:22

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH rfc 0/6] Scheduler BPF

Hello!

I'm sorry, somehow the patchset didn't reach mailing lists and at least
some recepients yesterday (I'm digging into why).

Resending with a fewer people in the cc list, which probably was the reason.

Thanks!

On Thu, Sep 16, 2021 at 09:24:45AM -0700, Roman Gushchin wrote:
> There is a long history of distro people, system administrators, and
> application owners tuning the CFS settings in /proc/sys, which are now
> in debugfs. Looking at what these settings actually did, it ended up
> boiling down to changing the likelihood of task preemption, or
> disabling it by setting the wakeup_granularity_ns to more than half of
> the latency_ns. The other settings didn't really do much for
> performance.
>

2021-10-01 03:39:10

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH rfc 4/6] sched: cfs: add bpf hooks to control wakeup and tick preemption

On Fri, Sep 17, 2021 at 4:36 AM Roman Gushchin <[email protected]> wrote:
>
> This patch adds 3 hooks to control wakeup and tick preemption:
> cfs_check_preempt_tick
> cfs_check_preempt_wakeup
> cfs_wakeup_preempt_entity
>
> The first one allows to force or suppress a preemption from a tick
> context. An obvious usage example is to minimize the number of
> non-voluntary context switches and decrease an associated latency
> penalty by (conditionally) providing tasks or task groups an extended
> execution slice. It can be used instead of tweaking
> sysctl_sched_min_granularity.
>
> The second one is called from the wakeup preemption code and allows
> to redefine whether a newly woken task should preempt the execution
> of the current task. This is useful to minimize a number of
> preemptions of latency sensitive tasks. To some extent it's a more
> flexible analog of a sysctl_sched_wakeup_granularity.

This reminds me of Mel's recent work which might be relevant:
sched/fair: Scale wakeup granularity relative to nr_running
https://lore.kernel.org/lkml/[email protected]/

>
> The third one is similar, but it tweaks the wakeup_preempt_entity()
> function, which is called not only from a wakeup context, but also
> from pick_next_task(), which allows to influence the decision on which
> task will be running next.
>
> It's a place for a discussion whether we need both these hooks or only
> one of them: the second is more powerful, but depends more on the
> current implementation. In any case, bpf hooks are not an ABI, so it's
> not a deal breaker.

I am also curious if similar hook can benefit
newidle_balance/sched_migration_cost
tuning things in this thread:
https://lore.kernel.org/lkml/[email protected]/

It seems those static values are not universal. different topology might need
different settings. but dynamically tuning them in the kernel seems to be
extremely difficult.

>
> The idea of the wakeup_preempt_entity hook belongs to Rik van Riel. He
> also contributed a lot to the whole patchset by proving his ideas,
> recommendations and a feedback for earlier (non-public) versions.
>
> Signed-off-by: Roman Gushchin <[email protected]>
> ---
> include/linux/bpf_sched.h | 1 +
> include/linux/sched_hook_defs.h | 4 +++-
> kernel/sched/fair.c | 27 +++++++++++++++++++++++++++
> 3 files changed, 31 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/bpf_sched.h b/include/linux/bpf_sched.h
> index 6e773aecdff7..5c238aeb853c 100644
> --- a/include/linux/bpf_sched.h
> +++ b/include/linux/bpf_sched.h
> @@ -40,6 +40,7 @@ static inline RET bpf_sched_##NAME(__VA_ARGS__) \
> { \
> return DEFAULT; \
> }
> +#include <linux/sched_hook_defs.h>
> #undef BPF_SCHED_HOOK
>
> static inline bool bpf_sched_enabled(void)
> diff --git a/include/linux/sched_hook_defs.h b/include/linux/sched_hook_defs.h
> index 14344004e335..f075b32698cd 100644
> --- a/include/linux/sched_hook_defs.h
> +++ b/include/linux/sched_hook_defs.h
> @@ -1,2 +1,4 @@
> /* SPDX-License-Identifier: GPL-2.0 */
> -BPF_SCHED_HOOK(int, 0, dummy, void)
> +BPF_SCHED_HOOK(int, 0, cfs_check_preempt_tick, struct sched_entity *curr, unsigned long delta_exec)
> +BPF_SCHED_HOOK(int, 0, cfs_check_preempt_wakeup, struct task_struct *curr, struct task_struct *p)
> +BPF_SCHED_HOOK(int, 0, cfs_wakeup_preempt_entity, struct sched_entity *curr, struct sched_entity *se)
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ff69f245b939..35ea8911b25c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -21,6 +21,7 @@
> * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
> */
> #include "sched.h"
> +#include <linux/bpf_sched.h>
>
> /*
> * Targeted preemption latency for CPU-bound tasks:
> @@ -4447,6 +4448,16 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>
> ideal_runtime = sched_slice(cfs_rq, curr);
> delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
> +
> + if (bpf_sched_enabled()) {
> + int ret = bpf_sched_cfs_check_preempt_tick(curr, delta_exec);
> +
> + if (ret < 0)
> + return;
> + else if (ret > 0)
> + resched_curr(rq_of(cfs_rq));
> + }
> +
> if (delta_exec > ideal_runtime) {
> resched_curr(rq_of(cfs_rq));
> /*
> @@ -7083,6 +7094,13 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
> {
> s64 gran, vdiff = curr->vruntime - se->vruntime;
>
> + if (bpf_sched_enabled()) {
> + int ret = bpf_sched_cfs_wakeup_preempt_entity(curr, se);
> +
> + if (ret)
> + return ret;
> + }
> +
> if (vdiff <= 0)
> return -1;
>
> @@ -7168,6 +7186,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> likely(!task_has_idle_policy(p)))
> goto preempt;
>
> + if (bpf_sched_enabled()) {
> + int ret = bpf_sched_cfs_check_preempt_wakeup(current, p);
> +
> + if (ret < 0)
> + return;
> + else if (ret > 0)
> + goto preempt;
> + }
> +
> /*
> * Batch and idle tasks do not preempt non-idle tasks (their preemption
> * is driven by the tick):
> --
> 2.31.1
>

Thanks
barry

2021-10-02 00:18:27

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH rfc 4/6] sched: cfs: add bpf hooks to control wakeup and tick preemption

On Fri, Oct 01, 2021 at 04:35:58PM +1300, Barry Song wrote:
> On Fri, Sep 17, 2021 at 4:36 AM Roman Gushchin <[email protected]> wrote:
> >
> > This patch adds 3 hooks to control wakeup and tick preemption:
> > cfs_check_preempt_tick
> > cfs_check_preempt_wakeup
> > cfs_wakeup_preempt_entity
> >
> > The first one allows to force or suppress a preemption from a tick
> > context. An obvious usage example is to minimize the number of
> > non-voluntary context switches and decrease an associated latency
> > penalty by (conditionally) providing tasks or task groups an extended
> > execution slice. It can be used instead of tweaking
> > sysctl_sched_min_granularity.
> >
> > The second one is called from the wakeup preemption code and allows
> > to redefine whether a newly woken task should preempt the execution
> > of the current task. This is useful to minimize a number of
> > preemptions of latency sensitive tasks. To some extent it's a more
> > flexible analog of a sysctl_sched_wakeup_granularity.
>
> This reminds me of Mel's recent work which might be relevant:
> sched/fair: Scale wakeup granularity relative to nr_running
> https://lore.kernel.org/lkml/[email protected]/

Oh, this is interesting, thank you for the link! This is a perfect example
of a case when bpf can be useful if the change will be considered to be too
special to be accepted in the mainline code.

>
> >
> > The third one is similar, but it tweaks the wakeup_preempt_entity()
> > function, which is called not only from a wakeup context, but also
> > from pick_next_task(), which allows to influence the decision on which
> > task will be running next.
> >
> > It's a place for a discussion whether we need both these hooks or only
> > one of them: the second is more powerful, but depends more on the
> > current implementation. In any case, bpf hooks are not an ABI, so it's
> > not a deal breaker.
>
> I am also curious if similar hook can benefit
> newidle_balance/sched_migration_cost
> tuning things in this thread:
> https://lore.kernel.org/lkml/[email protected]/
>
> It seems those static values are not universal. different topology might need
> different settings. but dynamically tuning them in the kernel seems to be
> extremely difficult.

Absolutely! I'm already playing with newidle_balance (no specific results yet).
And sched_migration_cost is likely a good target too!

Thanks!

2021-10-06 16:43:22

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH rfc 0/6] Scheduler BPF

Hi Roman

On 09/16/21 09:24, Roman Gushchin wrote:
> There is a long history of distro people, system administrators, and
> application owners tuning the CFS settings in /proc/sys, which are now
> in debugfs. Looking at what these settings actually did, it ended up
> boiling down to changing the likelihood of task preemption, or
> disabling it by setting the wakeup_granularity_ns to more than half of
> the latency_ns. The other settings didn't really do much for
> performance.
>
> In other words, some our workloads benefit by having long running tasks
> preempted by tasks handling short running requests, and some workloads
> that run only short term requests which benefit from never being preempted.

We had discussion about introducing latency-nice hint; but that discussion
didn't end up producing any new API. Your use case seem similar to Android's;
we want some tasks to run ASAP. There's an out of tree patch that puts these
tasks on an idle CPU (keep in mind energy aware scheduling in the context here)
which seem okay for its purpose. Having a more generic solution in mainline
would be nice.

https://lwn.net/Articles/820659/

>
> This leads to a few observations and ideas:
> - Different workloads want different policies. Being able to configure
> the policy per workload could be useful.
> - A workload that benefits from not being preempted itself could still
> benefit from preempting (low priority) background system tasks.

You can put these tasks as SCHED_IDLE. There's a potential danger of starving
these tasks; but assuming they're background and there's idle time in the
system that should be fine.

https://lwn.net/Articles/805317/

That of course assuming you can classify these background tasks..

If you can do the classification, you can also use cpu.shares to reduce how
much cpu time they get. Or CFS bandwidth controller

https://lwn.net/Articles/844976/

I like Androd's model of classifying tasks. I think we need this classification
done by other non-android systems too.

> - It would be useful to quickly (and safely) experiment with different
> policies in production, without having to shut down applications or reboot
> systems, to determine what the policies for different workloads should be.

Userspace should have the knobs that allows them to tune that without reboot.
If you're doing kernel development; then it's part of the job spec I'd say :-)

I think one can still go with the workflow you suggest for development without
the hooks. You'd need to un-inline the function you're interested in; then you
can use kprobes to hook into it and force an early return. That should produce
the same effect, no?

> - Only a few workloads are large and sensitive enough to merit their own
> policy tweaks. CFS by itself should be good enough for everything else,
> and we probably do not want policy tweaks to be a replacement for anything
> CFS does.
>
> This leads to BPF hooks, which have been successfully used in various
> kernel subsystems to provide a way for external code to (safely)
> change a few kernel decisions. BPF tooling makes this pretty easy to do,
> and the people deploying BPF scripts are already quite used to updating them
> for new kernel versions.

I am (very) wary of these hooks. Scheduler (in mobile at least) is an area that
gets heavily modified by vendors and OEMs. We try very hard to understand the
problems they face and get the right set of solutions in mainline. Which would
ultimately help towards the goal of having a single Generic kernel Image [1]
that gives you what you'd expect out of the platform without any need for
additional cherries on top.

So my worry is that this will open the gate for these hooks to get more than
just micro-optimization done in a platform specific way. And that it will
discourage having the right discussion to fix real problems in the scheduler
because the easy path is to do whatever you want in userspace. I am not sure we
can control how these hooks are used.

The question is: why can't we fix any issues in the scheduler/make it better
and must have these hooks instead?

[1] https://arstechnica.com/gadgets/2021/09/android-to-take-an-upstream-first-development-model-for-the-linux-kernel/

Thanks

--
Qais Yousef

>
> This patchset aims to start a discussion about potential applications of BPF
> to the scheduler. It also aims to land some very basic BPF infrastructure
> necessary to add new BPF hooks to the scheduler, a minimal set of useful
> helpers, corresponding libbpf changes, etc.
>
> Our very first experiments with using BPF in CFS look very promising. We're
> at a very early stage, however already have seen a nice latency and ~1% RPS
> wins for our (Facebook's) main web workload.
>
> As I know, Google is working on a more radical approach [2]: they aim to move
> the scheduling code into userspace. It seems that their core motivation is
> somewhat similar: to make the scheduler changes easier to develop, validate
> and deploy. Even though their approach is different, they also use BPF for
> speeding up some hot paths. I think the suggested infrastructure can serve
> their purpose too.
>
> An example of an userspace part, which loads some simple hooks is available
> here [3]. It's very simple, provided only to simplify playing with the provided
> kernel patches.
>
>
> [1] c722f35b513f ("sched/fair: Bring back select_idle_smt(), but differently")
> [2] Google's ghOSt: https://linuxplumbersconf.org/event/11/contributions/954/
> [3] https://github.com/rgushchin/atc
>
>
> Roman Gushchin (6):
> bpf: sched: basic infrastructure for scheduler bpf
> bpf: sched: add convenient helpers to identify sched entities
> bpf: sched: introduce bpf_sched_enable()
> sched: cfs: add bpf hooks to control wakeup and tick preemption
> libbpf: add support for scheduler bpf programs
> bpftool: recognize scheduler programs
>
> include/linux/bpf_sched.h | 53 ++++++++++++
> include/linux/bpf_types.h | 3 +
> include/linux/sched_hook_defs.h | 4 +
> include/uapi/linux/bpf.h | 25 ++++++
> kernel/bpf/btf.c | 1 +
> kernel/bpf/syscall.c | 21 ++++-
> kernel/bpf/trampoline.c | 1 +
> kernel/bpf/verifier.c | 9 ++-
> kernel/sched/Makefile | 1 +
> kernel/sched/bpf_sched.c | 138 ++++++++++++++++++++++++++++++++
> kernel/sched/fair.c | 27 +++++++
> scripts/bpf_doc.py | 2 +
> tools/bpf/bpftool/common.c | 1 +
> tools/bpf/bpftool/prog.c | 1 +
> tools/include/uapi/linux/bpf.h | 25 ++++++
> tools/lib/bpf/libbpf.c | 27 ++++++-
> tools/lib/bpf/libbpf.h | 4 +
> tools/lib/bpf/libbpf.map | 3 +
> 18 files changed, 341 insertions(+), 5 deletions(-)
> create mode 100644 include/linux/bpf_sched.h
> create mode 100644 include/linux/sched_hook_defs.h
> create mode 100644 kernel/sched/bpf_sched.c
>
> --
> 2.31.1
>

2021-10-06 18:51:46

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH rfc 0/6] Scheduler BPF

On Wed, Oct 06, 2021 at 05:39:49PM +0100, Qais Yousef wrote:
> Hi Roman
>
> On 09/16/21 09:24, Roman Gushchin wrote:
> > There is a long history of distro people, system administrators, and
> > application owners tuning the CFS settings in /proc/sys, which are now
> > in debugfs. Looking at what these settings actually did, it ended up
> > boiling down to changing the likelihood of task preemption, or
> > disabling it by setting the wakeup_granularity_ns to more than half of
> > the latency_ns. The other settings didn't really do much for
> > performance.
> >
> > In other words, some our workloads benefit by having long running tasks
> > preempted by tasks handling short running requests, and some workloads
> > that run only short term requests which benefit from never being preempted.
>
> We had discussion about introducing latency-nice hint; but that discussion
> didn't end up producing any new API. Your use case seem similar to Android's;
> we want some tasks to run ASAP. There's an out of tree patch that puts these
> tasks on an idle CPU (keep in mind energy aware scheduling in the context here)
> which seem okay for its purpose. Having a more generic solution in mainline
> would be nice.
>
> https://lwn.net/Articles/820659/

Hello Qais!

Thank you for the link, I like it!

>
> >
> > This leads to a few observations and ideas:
> > - Different workloads want different policies. Being able to configure
> > the policy per workload could be useful.
> > - A workload that benefits from not being preempted itself could still
> > benefit from preempting (low priority) background system tasks.
>
> You can put these tasks as SCHED_IDLE. There's a potential danger of starving
> these tasks; but assuming they're background and there's idle time in the
> system that should be fine.
>
> https://lwn.net/Articles/805317/
>
> That of course assuming you can classify these background tasks..
>
> If you can do the classification, you can also use cpu.shares to reduce how
> much cpu time they get. Or CFS bandwidth controller
>
> https://lwn.net/Articles/844976/

The cfs cgroup controller is that it's getting quite expensive quickly with the
increasing depth of the cgroup tree. This is why we had to disable it for some
of our primary workloads.

Still being able to control latencies on per-cgroup level is one of the goals
of this patchset.

>
> I like Androd's model of classifying tasks. I think we need this classification
> done by other non-android systems too.
>
> > - It would be useful to quickly (and safely) experiment with different
> > policies in production, without having to shut down applications or reboot
> > systems, to determine what the policies for different workloads should be.
>
> Userspace should have the knobs that allows them to tune that without reboot.
> If you're doing kernel development; then it's part of the job spec I'd say :-)

The problem here occurs because there is no comprehensive way to test any
scheduler change rather than run it on many machines (sometimes 1000's) running
different production-alike workloads.

If I'm able to test an idea by loading a bpf program (and btw have some sort of
safety guarantees: maybe the performance will be hurt, but at least no panics),
it can speed up the development process significantly. The alternative is way
more complex from the infrastructure's point of view: releasing a custom kernel,
test it for safety, reboot certain machines to it, pin the kernel from being
automatically updated etc.

>
> I think one can still go with the workflow you suggest for development without
> the hooks. You'd need to un-inline the function you're interested in; then you
> can use kprobes to hook into it and force an early return. That should produce
> the same effect, no?

Basically it's exactly what I'm suggesting. My patchset just provides a
convenient way to define these hooks and some basic useful helper functions.

>
> > - Only a few workloads are large and sensitive enough to merit their own
> > policy tweaks. CFS by itself should be good enough for everything else,
> > and we probably do not want policy tweaks to be a replacement for anything
> > CFS does.
> >
> > This leads to BPF hooks, which have been successfully used in various
> > kernel subsystems to provide a way for external code to (safely)
> > change a few kernel decisions. BPF tooling makes this pretty easy to do,
> > and the people deploying BPF scripts are already quite used to updating them
> > for new kernel versions.
>
> I am (very) wary of these hooks. Scheduler (in mobile at least) is an area that
> gets heavily modified by vendors and OEMs. We try very hard to understand the
> problems they face and get the right set of solutions in mainline. Which would
> ultimately help towards the goal of having a single Generic kernel Image [1]
> that gives you what you'd expect out of the platform without any need for
> additional cherries on top.

Wouldn't it make your life easier had they provide a set of bpf programs instead
of custom patches?

>
> So my worry is that this will open the gate for these hooks to get more than
> just micro-optimization done in a platform specific way. And that it will
> discourage having the right discussion to fix real problems in the scheduler
> because the easy path is to do whatever you want in userspace. I am not sure we
> can control how these hooks are used.

I totally understand your worry. I think we need to find a right balance between
allowing to implement custom policies and keeping the core functionality
working well enough for everybody without a need to tweak anything.

It seems like an alternative to this "let's allow cfs customization via bpf"
approach is to completely move the scheduler code into userspace/bpf, something
that Google's ghOSt is aiming to do.

>
> The question is: why can't we fix any issues in the scheduler/make it better
> and must have these hooks instead?

Of course, if it's possible to implement an idea in a form which is suitable
for everybody and upstream it, this is the best outcome. The problem is that
not every idea is like that. A bpf program can leverage a priori knowledge
of a workload and its needs, something the generic scheduler code lacks
by the definition.

Thanks!

2021-10-11 17:31:36

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH rfc 0/6] Scheduler BPF

Hi Roman

On 10/06/21 11:50, Roman Gushchin wrote:
> On Wed, Oct 06, 2021 at 05:39:49PM +0100, Qais Yousef wrote:
> > Hi Roman
> >
> > On 09/16/21 09:24, Roman Gushchin wrote:
> > > There is a long history of distro people, system administrators, and
> > > application owners tuning the CFS settings in /proc/sys, which are now
> > > in debugfs. Looking at what these settings actually did, it ended up
> > > boiling down to changing the likelihood of task preemption, or
> > > disabling it by setting the wakeup_granularity_ns to more than half of
> > > the latency_ns. The other settings didn't really do much for
> > > performance.
> > >
> > > In other words, some our workloads benefit by having long running tasks
> > > preempted by tasks handling short running requests, and some workloads
> > > that run only short term requests which benefit from never being preempted.
> >
> > We had discussion about introducing latency-nice hint; but that discussion
> > didn't end up producing any new API. Your use case seem similar to Android's;
> > we want some tasks to run ASAP. There's an out of tree patch that puts these
> > tasks on an idle CPU (keep in mind energy aware scheduling in the context here)
> > which seem okay for its purpose. Having a more generic solution in mainline
> > would be nice.
> >
> > https://lwn.net/Articles/820659/
>
> Hello Qais!
>
> Thank you for the link, I like it!
>
> >
> > >
> > > This leads to a few observations and ideas:
> > > - Different workloads want different policies. Being able to configure
> > > the policy per workload could be useful.
> > > - A workload that benefits from not being preempted itself could still
> > > benefit from preempting (low priority) background system tasks.
> >
> > You can put these tasks as SCHED_IDLE. There's a potential danger of starving
> > these tasks; but assuming they're background and there's idle time in the
> > system that should be fine.
> >
> > https://lwn.net/Articles/805317/
> >
> > That of course assuming you can classify these background tasks..
> >
> > If you can do the classification, you can also use cpu.shares to reduce how
> > much cpu time they get. Or CFS bandwidth controller
> >
> > https://lwn.net/Articles/844976/
>
> The cfs cgroup controller is that it's getting quite expensive quickly with the
> increasing depth of the cgroup tree. This is why we had to disable it for some
> of our primary workloads.

I can understand that..

>
> Still being able to control latencies on per-cgroup level is one of the goals
> of this patchset.
>
> >
> > I like Androd's model of classifying tasks. I think we need this classification
> > done by other non-android systems too.
> >
> > > - It would be useful to quickly (and safely) experiment with different
> > > policies in production, without having to shut down applications or reboot
> > > systems, to determine what the policies for different workloads should be.
> >
> > Userspace should have the knobs that allows them to tune that without reboot.
> > If you're doing kernel development; then it's part of the job spec I'd say :-)
>
> The problem here occurs because there is no comprehensive way to test any
> scheduler change rather than run it on many machines (sometimes 1000's) running
> different production-alike workloads.
>
> If I'm able to test an idea by loading a bpf program (and btw have some sort of
> safety guarantees: maybe the performance will be hurt, but at least no panics),
> it can speed up the development process significantly. The alternative is way
> more complex from the infrastructure's point of view: releasing a custom kernel,
> test it for safety, reboot certain machines to it, pin the kernel from being
> automatically updated etc.

This process is unavoidable IMO. Assuming you have these hooks in; as soon as
you require a new hook you'll be forced to have a custom kernel with that new
hook introduced. Which, in my view, no different than pushing a custom kernel
that forces the function of interest to be noinline. Right?

>
> >
> > I think one can still go with the workflow you suggest for development without
> > the hooks. You'd need to un-inline the function you're interested in; then you
> > can use kprobes to hook into it and force an early return. That should produce
> > the same effect, no?
>
> Basically it's exactly what I'm suggesting. My patchset just provides a
> convenient way to define these hooks and some basic useful helper functions.

Convenient will be only true assuming you have a full comprehensive list of
hooks to never require adding a new one. As I highlighted above, this
convenience is limited to hooks that you added now.

Do people always want more hooks? Rhetorical question ;-)

>
> >
> > > - Only a few workloads are large and sensitive enough to merit their own
> > > policy tweaks. CFS by itself should be good enough for everything else,
> > > and we probably do not want policy tweaks to be a replacement for anything
> > > CFS does.
> > >
> > > This leads to BPF hooks, which have been successfully used in various
> > > kernel subsystems to provide a way for external code to (safely)
> > > change a few kernel decisions. BPF tooling makes this pretty easy to do,
> > > and the people deploying BPF scripts are already quite used to updating them
> > > for new kernel versions.
> >
> > I am (very) wary of these hooks. Scheduler (in mobile at least) is an area that
> > gets heavily modified by vendors and OEMs. We try very hard to understand the
> > problems they face and get the right set of solutions in mainline. Which would
> > ultimately help towards the goal of having a single Generic kernel Image [1]
> > that gives you what you'd expect out of the platform without any need for
> > additional cherries on top.
>
> Wouldn't it make your life easier had they provide a set of bpf programs instead
> of custom patches?

Not really.

Having consistent mainline behavior is important, and these customization
contribute to fragmentation and can throw off userspace developers who find
they have to do extra work on some platforms to get the desired outcome. They
will be easy to misuse. We want to see the patches and find ways to improve
mainline kernel instead.

That said, I can see the use case of being able to micro-optimize part of the
scheduler in a workload specific way. But then the way I see this support
happening (DISCLAIMER, personal opinion :-))

1. The hooks have to be about replacing specific snippet, like Barry's
example where it's an area that is hard to find a generic solution
that doesn't have a drawback over a class of workloads.

2. The set of bpf programs that modify it live in the kernel tree for
each hook added. Then we can reason about why the hook is there and
allow others to reap the benefit. Beside being able to re-evaluate
easily if the users still need that hook after a potential
improvement that could render it unnecessary.

3. Out of tree bpf programs can only be loaded if special CONFIG option
is set so that production kernel can only load known ones that the
community knows and have reasoned about.

4. Out of tree bpf programs will taint the kernel. A regression
reported with something funny loaded should be flagged as
potentially bogus.

IMHO this should tame the beast to something useful to address these situations
where the change required to improve one workload will harm others and it's
hard to come up with a good compromise. Then the hook as you suggest could help
implement that policy specifically for that platform/workload.

One can note that the behavior I suggest is similar to how modules work :)

>
> >
> > So my worry is that this will open the gate for these hooks to get more than
> > just micro-optimization done in a platform specific way. And that it will
> > discourage having the right discussion to fix real problems in the scheduler
> > because the easy path is to do whatever you want in userspace. I am not sure we
> > can control how these hooks are used.
>
> I totally understand your worry. I think we need to find a right balance between
> allowing to implement custom policies and keeping the core functionality
> working well enough for everybody without a need to tweak anything.
>
> It seems like an alternative to this "let's allow cfs customization via bpf"
> approach is to completely move the scheduler code into userspace/bpf, something
> that Google's ghOSt is aiming to do.

Why not ship a custom kernel instead then?

>
> >
> > The question is: why can't we fix any issues in the scheduler/make it better
> > and must have these hooks instead?
>
> Of course, if it's possible to implement an idea in a form which is suitable
> for everybody and upstream it, this is the best outcome. The problem is that
> not every idea is like that. A bpf program can leverage a priori knowledge
> of a workload and its needs, something the generic scheduler code lacks
> by the definition.

Yep I see your point for certain aspects of the scheduler that are hard to tune
universally. We just need to be careful not to end up in a wild west or Anything
Can Happen Thursday situation :-)

Maybe the maintainers have a different opinion though.

Cheers

--
Qais Yousef

2021-10-11 18:12:47

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH rfc 0/6] Scheduler BPF

On Mon, Oct 11, 2021 at 05:38:52PM +0100, Qais Yousef wrote:
> Hi Roman
>
> On 10/06/21 11:50, Roman Gushchin wrote:
> > On Wed, Oct 06, 2021 at 05:39:49PM +0100, Qais Yousef wrote:
> > > Hi Roman
> > >
> > > On 09/16/21 09:24, Roman Gushchin wrote:
> > > > There is a long history of distro people, system administrators, and
> > > > application owners tuning the CFS settings in /proc/sys, which are now
> > > > in debugfs. Looking at what these settings actually did, it ended up
> > > > boiling down to changing the likelihood of task preemption, or
> > > > disabling it by setting the wakeup_granularity_ns to more than half of
> > > > the latency_ns. The other settings didn't really do much for
> > > > performance.
> > > >
> > > > In other words, some our workloads benefit by having long running tasks
> > > > preempted by tasks handling short running requests, and some workloads
> > > > that run only short term requests which benefit from never being preempted.
> > >
> > > We had discussion about introducing latency-nice hint; but that discussion
> > > didn't end up producing any new API. Your use case seem similar to Android's;
> > > we want some tasks to run ASAP. There's an out of tree patch that puts these
> > > tasks on an idle CPU (keep in mind energy aware scheduling in the context here)
> > > which seem okay for its purpose. Having a more generic solution in mainline
> > > would be nice.
> > >
> > > https://lwn.net/Articles/820659/
> >
> > Hello Qais!
> >
> > Thank you for the link, I like it!
> >
> > >
> > > >
> > > > This leads to a few observations and ideas:
> > > > - Different workloads want different policies. Being able to configure
> > > > the policy per workload could be useful.
> > > > - A workload that benefits from not being preempted itself could still
> > > > benefit from preempting (low priority) background system tasks.
> > >
> > > You can put these tasks as SCHED_IDLE. There's a potential danger of starving
> > > these tasks; but assuming they're background and there's idle time in the
> > > system that should be fine.
> > >
> > > https://lwn.net/Articles/805317/
> > >
> > > That of course assuming you can classify these background tasks..
> > >
> > > If you can do the classification, you can also use cpu.shares to reduce how
> > > much cpu time they get. Or CFS bandwidth controller
> > >
> > > https://lwn.net/Articles/844976/
> >
> > The cfs cgroup controller is that it's getting quite expensive quickly with the
> > increasing depth of the cgroup tree. This is why we had to disable it for some
> > of our primary workloads.
>
> I can understand that..
>
> >
> > Still being able to control latencies on per-cgroup level is one of the goals
> > of this patchset.
> >
> > >
> > > I like Androd's model of classifying tasks. I think we need this classification
> > > done by other non-android systems too.
> > >
> > > > - It would be useful to quickly (and safely) experiment with different
> > > > policies in production, without having to shut down applications or reboot
> > > > systems, to determine what the policies for different workloads should be.
> > >
> > > Userspace should have the knobs that allows them to tune that without reboot.
> > > If you're doing kernel development; then it's part of the job spec I'd say :-)
> >
> > The problem here occurs because there is no comprehensive way to test any
> > scheduler change rather than run it on many machines (sometimes 1000's) running
> > different production-alike workloads.
> >
> > If I'm able to test an idea by loading a bpf program (and btw have some sort of
> > safety guarantees: maybe the performance will be hurt, but at least no panics),
> > it can speed up the development process significantly. The alternative is way
> > more complex from the infrastructure's point of view: releasing a custom kernel,
> > test it for safety, reboot certain machines to it, pin the kernel from being
> > automatically updated etc.
>
> This process is unavoidable IMO. Assuming you have these hooks in; as soon as
> you require a new hook you'll be forced to have a custom kernel with that new
> hook introduced. Which, in my view, no different than pushing a custom kernel
> that forces the function of interest to be noinline. Right?

I think a relatively small and stable set of hooks can cover a large percent
of potential customization ideas.

>
> >
> > >
> > > I think one can still go with the workflow you suggest for development without
> > > the hooks. You'd need to un-inline the function you're interested in; then you
> > > can use kprobes to hook into it and force an early return. That should produce
> > > the same effect, no?
> >
> > Basically it's exactly what I'm suggesting. My patchset just provides a
> > convenient way to define these hooks and some basic useful helper functions.
>
> Convenient will be only true assuming you have a full comprehensive list of
> hooks to never require adding a new one. As I highlighted above, this
> convenience is limited to hooks that you added now.
>
> Do people always want more hooks? Rhetorical question ;-)

Why do you think that the list of the hooks will be so large/dynamic?

I'm not saying we can figure it out from a first attempt, but I'm pretty sure
that after some initial phase it can be relatively stable, e.g. changing only
with some _major_ changes in the scheduler code.

>
> >
> > >
> > > > - Only a few workloads are large and sensitive enough to merit their own
> > > > policy tweaks. CFS by itself should be good enough for everything else,
> > > > and we probably do not want policy tweaks to be a replacement for anything
> > > > CFS does.
> > > >
> > > > This leads to BPF hooks, which have been successfully used in various
> > > > kernel subsystems to provide a way for external code to (safely)
> > > > change a few kernel decisions. BPF tooling makes this pretty easy to do,
> > > > and the people deploying BPF scripts are already quite used to updating them
> > > > for new kernel versions.
> > >
> > > I am (very) wary of these hooks. Scheduler (in mobile at least) is an area that
> > > gets heavily modified by vendors and OEMs. We try very hard to understand the
> > > problems they face and get the right set of solutions in mainline. Which would
> > > ultimately help towards the goal of having a single Generic kernel Image [1]
> > > that gives you what you'd expect out of the platform without any need for
> > > additional cherries on top.
> >
> > Wouldn't it make your life easier had they provide a set of bpf programs instead
> > of custom patches?
>
> Not really.
>
> Having consistent mainline behavior is important, and these customization
> contribute to fragmentation and can throw off userspace developers who find
> they have to do extra work on some platforms to get the desired outcome. They
> will be easy to misuse. We want to see the patches and find ways to improve
> mainline kernel instead.
>
> That said, I can see the use case of being able to micro-optimize part of the
> scheduler in a workload specific way. But then the way I see this support
> happening (DISCLAIMER, personal opinion :-))
>
> 1. The hooks have to be about replacing specific snippet, like Barry's
> example where it's an area that is hard to find a generic solution
> that doesn't have a drawback over a class of workloads.

This makes sense to me, and this is a good topic to discuss: which hooks do we
really need. I don't think it necessarily has to replace something, but I
totally agree on the "hard to find a generic solution" part.

>
> 2. The set of bpf programs that modify it live in the kernel tree for
> each hook added. Then we can reason about why the hook is there and
> allow others to reap the benefit. Beside being able to re-evaluate
> easily if the users still need that hook after a potential
> improvement that could render it unnecessary.
>
> 3. Out of tree bpf programs can only be loaded if special CONFIG option
> is set so that production kernel can only load known ones that the
> community knows and have reasoned about.
>
> 4. Out of tree bpf programs will taint the kernel. A regression
> reported with something funny loaded should be flagged as
> potentially bogus.

2-4 look as generic bpf questions to me, I don't think there is anything
scheduler-specific. So I'd suggest to bring bpf maintainers into the discussion,
their input can be very valuable.

>
> IMHO this should tame the beast to something useful to address these situations
> where the change required to improve one workload will harm others and it's
> hard to come up with a good compromise. Then the hook as you suggest could help
> implement that policy specifically for that platform/workload.
>
> One can note that the behavior I suggest is similar to how modules work :)

The important benefit of bpf is safety guarantees.

>
> >
> > >
> > > So my worry is that this will open the gate for these hooks to get more than
> > > just micro-optimization done in a platform specific way. And that it will
> > > discourage having the right discussion to fix real problems in the scheduler
> > > because the easy path is to do whatever you want in userspace. I am not sure we
> > > can control how these hooks are used.
> >
> > I totally understand your worry. I think we need to find a right balance between
> > allowing to implement custom policies and keeping the core functionality
> > working well enough for everybody without a need to tweak anything.
> >
> > It seems like an alternative to this "let's allow cfs customization via bpf"
> > approach is to completely move the scheduler code into userspace/bpf, something
> > that Google's ghOSt is aiming to do.
>
> Why not ship a custom kernel instead then?

Shipping a custom kernel (actually any kernel) at this scale isn't easy or fast.
Just for example, imagine a process of rebooting of a 1000000 machines running
1000's different workloads, each with their own redundancy and capacity requirements.

This what makes an ability to push scheduler changes without a reboot/kernel upgrade
so attractive.

Obviously, it's not a case when we talk about a single kernel engineer and their
laptop/dev server/vm.

>
> >
> > >
> > > The question is: why can't we fix any issues in the scheduler/make it better
> > > and must have these hooks instead?
> >
> > Of course, if it's possible to implement an idea in a form which is suitable
> > for everybody and upstream it, this is the best outcome. The problem is that
> > not every idea is like that. A bpf program can leverage a priori knowledge
> > of a workload and its needs, something the generic scheduler code lacks
> > by the definition.
>
> Yep I see your point for certain aspects of the scheduler that are hard to tune
> universally. We just need to be careful not to end up in a wild west or Anything
> Can Happen Thursday situation :-)

Totally agree!

Thanks!

2021-10-12 10:18:15

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH rfc 0/6] Scheduler BPF

On 10/11/21 11:09, Roman Gushchin wrote:
> > Convenient will be only true assuming you have a full comprehensive list of
> > hooks to never require adding a new one. As I highlighted above, this
> > convenience is limited to hooks that you added now.
> >
> > Do people always want more hooks? Rhetorical question ;-)
>
> Why do you think that the list of the hooks will be so large/dynamic?

It's not a fact. Just my thoughts/guess based on how things usually end up.
It's very likely this will grow. I could be wrong of course :)

> I'm not saying we can figure it out from a first attempt, but I'm pretty sure
> that after some initial phase it can be relatively stable, e.g. changing only
> with some _major_ changes in the scheduler code.

My point was that the speed up in workflow will be limited by the what's
available. It might be enough for a large use cases as you say, but at some
point there will be a new bottleneck that you might think worth experimenting
with and the chances a suitable hook is available are 50:50 in theory. So it's
not a magical fix where one would *never* have to push a custom kernel on all
these systems to experiment with some scheduler changes.

> > > > So my worry is that this will open the gate for these hooks to get more than
> > > > just micro-optimization done in a platform specific way. And that it will
> > > > discourage having the right discussion to fix real problems in the scheduler
> > > > because the easy path is to do whatever you want in userspace. I am not sure we
> > > > can control how these hooks are used.
> > >
> > > I totally understand your worry. I think we need to find a right balance between
> > > allowing to implement custom policies and keeping the core functionality
> > > working well enough for everybody without a need to tweak anything.
> > >
> > > It seems like an alternative to this "let's allow cfs customization via bpf"
> > > approach is to completely move the scheduler code into userspace/bpf, something
> > > that Google's ghOSt is aiming to do.
> >
> > Why not ship a custom kernel instead then?
>
> Shipping a custom kernel (actually any kernel) at this scale isn't easy or fast.
> Just for example, imagine a process of rebooting of a 1000000 machines running
> 1000's different workloads, each with their own redundancy and capacity requirements.
>
> This what makes an ability to push scheduler changes without a reboot/kernel upgrade
> so attractive.
>
> Obviously, it's not a case when we talk about a single kernel engineer and their
> laptop/dev server/vm.

I think you're still referring to ghOSt here. I thought your 2 use cases are
different as you mentioned they "completely move the scheduler code into
userspace/bpf"; but it could be just me mis-interpreting what this means. That
didn't read to me they want to micro-optimize (few) certain decisions in the
scheduler, rather replace it altogether, hence my question.

Anyway. My 2cents here is that we should be careful not to introduce something
that encourages out-of-tree workarounds for real scheduler problems nor have it
done in a way where we lose visibility over how these hooks are used and being
able to share it with others who could benefit from the same mico-optimization
too.

Thanks!

--
Qais Yousef

2021-11-20 16:42:12

by Huichun Feng

[permalink] [raw]
Subject: Re: [PATCH rfc 0/6] Scheduler BPF

Hi list,

Is there any reason why this patch only got a few comments so far?

IMHO, by using static keys, the patch should add barely observable overhead
to sched when no BPF progs loaded. Or is it because of the point that Yousef
mentioned earlier? I.e. sched BPF may encourage people to fix sched problems
with BPF progs instead of fixing sched directly.

Thanks!

2021-11-25 06:03:34

by Yafang Shao

[permalink] [raw]
Subject: Re: [PATCH rfc 0/6] Scheduler BPF

Hi Roman,

Scheduler BPF is a great idea.
Thanks for the work.

Scheduler BPF won’t be a small feature, I think we’d better give a
summary of possible hooks it may add first.
We must have a *basic rule* to control what it will tend to be to
avoid adding BPF hooks here and there.
I haven’t found a clear rule yet, but maybe we can learn it from
netfilter, which has 5 basic hooks.
Regarding the scheduler BPF hooks, some possible basic hooks may be:
- Hook for Enqueue
- Hook for Dequeue
- Hook for Put Prev Task
- Hook for Set Next Task


> An example of an userspace part, which loads some simple hooks is available
> here [3]. It's very simple, provided only to simplify playing with the provided
> kernel patches.
>

You’d better add this userspace code into samples/bpf/.


[Some error occurs in my mail client, so I resend it]


--
Thanks
Yafang

2021-11-25 06:11:32

by Yafang Shao

[permalink] [raw]
Subject: Re: [PATCH rfc 2/6] bpf: sched: add convenient helpers to identify sched entities



> On Sep 17, 2021, at 12:24 AM, Roman Gushchin <[email protected]> wrote:
>
> This patch adds 3 helpers useful for dealing with sched entities:
> u64 bpf_sched_entity_to_tgidpid(struct sched_entity *se);
> u64 bpf_sched_entity_to_cgrpid(struct sched_entity *se);
> long bpf_sched_entity_belongs_to_cgrp(struct sched_entity *se, u64 cgrpid);
>
> Sched entity is a basic structure used by the scheduler to represent
> schedulable objects: tasks and cgroups (if CONFIG_FAIR_GROUP_SCHED
> is enabled). It will be passed as an argument to many bpf hooks, so
> scheduler bpf programs need a convenient way to deal with it.
>
> bpf_sched_entity_to_tgidpid() and bpf_sched_entity_to_cgrpid() are
> useful to identify a sched entity in userspace terms (pid, tgid and
> cgroup id). bpf_sched_entity_belongs_to_cgrp() allows to check whether
> a sched entity belongs to sub-tree of a cgroup. It allows to write
> cgroup-specific scheduler policies even without enabling the cgroup
> cpu controller.
>
> Signed-off-by: Roman Gushchin <[email protected]>
> ---
> include/uapi/linux/bpf.h | 23 +++++++++++
> kernel/sched/bpf_sched.c | 74 ++++++++++++++++++++++++++++++++++
> scripts/bpf_doc.py | 2 +
> tools/include/uapi/linux/bpf.h | 23 +++++++++++
> 4 files changed, 122 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 6dfbebb8fc8f..199e4a92820d 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -4900,6 +4900,26 @@ union bpf_attr {
> * **-EINVAL** if *flags* is not zero.
> *
> * **-ENOENT** if architecture does not support branch records.
> + *
> + * u64 bpf_sched_entity_to_tgidpid(struct sched_entity *se)
> + * Description
> + * Return task's encoded tgid and pid if the sched entity is a task.
> + * Return
> + * Tgid and pid encoded as tgid << 32 \| pid, if *se* is a task. (u64)-1 otherwise.
> + *
> + * u64 bpf_sched_entity_to_cgrpid(struct sched_entity *se)
> + * Description
> + * Return cgroup id if the given sched entity is a cgroup.
> + * Return
> + * Cgroup id, if *se* is a cgroup. (u64)-1 otherwise.
> + *
> + * long bpf_sched_entity_belongs_to_cgrp(struct sched_entity *se, u64 cgrpid)
> + * Description
> + * Checks whether the sched entity belongs to a cgroup or
> + * it's sub-tree. It doesn't require a cgroup CPU controller
> + * to be enabled.
> + * Return
> + * 1 if the sched entity belongs to a cgroup, 0 otherwise.
> */
> #define __BPF_FUNC_MAPPER(FN) \
> FN(unspec), \
> @@ -5079,6 +5099,9 @@ union bpf_attr {
> FN(get_attach_cookie), \
> FN(task_pt_regs), \
> FN(get_branch_snapshot), \
> + FN(sched_entity_to_tgidpid), \
> + FN(sched_entity_to_cgrpid), \
> + FN(sched_entity_belongs_to_cgrp), \
> /* */
>
> /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> diff --git a/kernel/sched/bpf_sched.c b/kernel/sched/bpf_sched.c
> index 2f05c186cfd0..ead691dc6e85 100644
> --- a/kernel/sched/bpf_sched.c
> +++ b/kernel/sched/bpf_sched.c
> @@ -42,12 +42,86 @@ int bpf_sched_verify_prog(struct bpf_verifier_log *vlog,
> return 0;
> }
>
> +BPF_CALL_1(bpf_sched_entity_to_tgidpid, struct sched_entity *, se)
> +{
> + if (entity_is_task(se)) {
> + struct task_struct *task = task_of(se);
> +
> + return (u64) task->tgid << 32 | task->pid;
> + } else {
> + return (u64) -1;
> + }
> +}
> +
> +BPF_CALL_1(bpf_sched_entity_to_cgrpid, struct sched_entity *, se)
> +{
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> + if (!entity_is_task(se))
> + return cgroup_id(se->cfs_rq->tg->css.cgroup);
> +#endif
> + return (u64) -1;
> +}
> +
> +BPF_CALL_2(bpf_sched_entity_belongs_to_cgrp, struct sched_entity *, se,
> + u64, cgrpid)
> +{
> +#ifdef CONFIG_CGROUPS
> + struct cgroup *cgrp;
> + int level;
> +
> + if (entity_is_task(se))
> + cgrp = task_dfl_cgroup(task_of(se));
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> + else
> + cgrp = se->cfs_rq->tg->css.cgroup;

It is incorrect.
It should use se->my_q->tg->css.cgroup and some possible NULL check. (for autogroup)
se->cfs_rq and se->my_q are different. se->my_q is the cfs_rq of this se itself, while the se->cfs_rq may be the parent.



> +#endif
> +
> + for (level = cgrp->level; level; level--)
> + if (cgrp->ancestor_ids[level] == cgrpid)
> + return 1;
> +#endif
> + return 0;
> +}
> +
> +BTF_ID_LIST_SINGLE(btf_sched_entity_ids, struct, sched_entity)
> +
> +static const struct bpf_func_proto bpf_sched_entity_to_tgidpid_proto = {
> + .func = bpf_sched_entity_to_tgidpid,
> + .gpl_only = false,
> + .ret_type = RET_INTEGER,
> + .arg1_type = ARG_PTR_TO_BTF_ID,
> + .arg1_btf_id = &btf_sched_entity_ids[0],
> +};
> +
> +static const struct bpf_func_proto bpf_sched_entity_to_cgrpid_proto = {
> + .func = bpf_sched_entity_to_cgrpid,
> + .gpl_only = false,
> + .ret_type = RET_INTEGER,
> + .arg1_type = ARG_PTR_TO_BTF_ID,
> + .arg1_btf_id = &btf_sched_entity_ids[0],
> +};
> +
> +static const struct bpf_func_proto bpf_sched_entity_belongs_to_cgrp_proto = {
> + .func = bpf_sched_entity_belongs_to_cgrp,
> + .gpl_only = false,
> + .ret_type = RET_INTEGER,
> + .arg1_type = ARG_PTR_TO_BTF_ID,
> + .arg1_btf_id = &btf_sched_entity_ids[0],
> + .arg2_type = ARG_ANYTHING,
> +};
> +
> static const struct bpf_func_proto *
> bpf_sched_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> {
> switch (func_id) {
> case BPF_FUNC_trace_printk:
> return bpf_get_trace_printk_proto();
> + case BPF_FUNC_sched_entity_to_tgidpid:
> + return &bpf_sched_entity_to_tgidpid_proto;
> + case BPF_FUNC_sched_entity_to_cgrpid:
> + return &bpf_sched_entity_to_cgrpid_proto;
> + case BPF_FUNC_sched_entity_belongs_to_cgrp:
> + return &bpf_sched_entity_belongs_to_cgrp_proto;
> default:
> return NULL;
> }
> diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
> index 00ac7b79cddb..84019ba5b67b 100755
> --- a/scripts/bpf_doc.py
> +++ b/scripts/bpf_doc.py
> @@ -548,6 +548,7 @@ class PrinterHelpers(Printer):
> 'struct socket',
> 'struct file',
> 'struct bpf_timer',
> + 'struct sched_entity',
> ]
> known_types = {
> '...',
> @@ -596,6 +597,7 @@ class PrinterHelpers(Printer):
> 'struct socket',
> 'struct file',
> 'struct bpf_timer',
> + 'struct sched_entity',
> }
> mapped_types = {
> 'u8': '__u8',
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 6dfbebb8fc8f..199e4a92820d 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -4900,6 +4900,26 @@ union bpf_attr {
> * **-EINVAL** if *flags* is not zero.
> *
> * **-ENOENT** if architecture does not support branch records.
> + *
> + * u64 bpf_sched_entity_to_tgidpid(struct sched_entity *se)
> + * Description
> + * Return task's encoded tgid and pid if the sched entity is a task.
> + * Return
> + * Tgid and pid encoded as tgid << 32 \| pid, if *se* is a task. (u64)-1 otherwise.
> + *
> + * u64 bpf_sched_entity_to_cgrpid(struct sched_entity *se)
> + * Description
> + * Return cgroup id if the given sched entity is a cgroup.
> + * Return
> + * Cgroup id, if *se* is a cgroup. (u64)-1 otherwise.
> + *
> + * long bpf_sched_entity_belongs_to_cgrp(struct sched_entity *se, u64 cgrpid)
> + * Description
> + * Checks whether the sched entity belongs to a cgroup or
> + * it's sub-tree. It doesn't require a cgroup CPU controller
> + * to be enabled.
> + * Return
> + * 1 if the sched entity belongs to a cgroup, 0 otherwise.
> */
> #define __BPF_FUNC_MAPPER(FN) \
> FN(unspec), \
> @@ -5079,6 +5099,9 @@ union bpf_attr {
> FN(get_attach_cookie), \
> FN(task_pt_regs), \
> FN(get_branch_snapshot), \
> + FN(sched_entity_to_tgidpid), \
> + FN(sched_entity_to_cgrpid), \
> + FN(sched_entity_belongs_to_cgrp), \
> /* */
>
> /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> --
> 2.31.1
>
>


2021-11-26 19:49:09

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH rfc 0/6] Scheduler BPF

On Thu, Nov 25, 2021 at 02:00:04PM +0800, Yafang Shao wrote:
> Hi Roman,

Hi Yafang!

>
> Scheduler BPF is a great idea.
> Thanks for the work.

Thanks!

>
> Scheduler BPF won’t be a small feature, I think we’d better give a
> summary of possible hooks it may add first.
> We must have a *basic rule* to control what it will tend to be to
> avoid adding BPF hooks here and there.
> I haven’t found a clear rule yet, but maybe we can learn it from
> netfilter, which has 5 basic hooks.
> Regarding the scheduler BPF hooks, some possible basic hooks may be:
> - Hook for Enqueue
> - Hook for Dequeue
> - Hook for Put Prev Task
> - Hook for Set Next Task

I think it depends on what we want to achieve. There are several options:
we might aim to implement the whole scheduler logic in bpf, we might aim
to do some adjustments to the existing scheduler behavior or a mix of those
approaches.

Bpf as now is now is not capable enough to implement a new scheduler class
without a substantial amount of new c code (in form of helpers, maybe custom
maps, some verifier changes etc). In particular, it's a challenging to
provide strong safety guarantees: any scheduler bpf program loaded shouldn't
crash or deadlock the system (otherwise bpf isn't any better than a kernel
module). Also performance margins are quite tight.

I'm not saying that providing such generic hooks is impossible or useless,
but it requires a lot of changes and support code and I'm not sure that we have
a good justification for them right now.

I think instead we might want to see bpf hooks as a better form of (sysctl)
tunables, which are more flexible (e.g. can be used for specific processes,
cgroups, cpus, being enabled depending on load, weather, etc) and do not create
an ABI (so are easier to maintain).

>
>
> > An example of an userspace part, which loads some simple hooks is available
> > here [3]. It's very simple, provided only to simplify playing with the provided
> > kernel patches.
> >
>
> You’d better add this userspace code into samples/bpf/.

I thought samples/bpf was considered deprecated (in favor to selftests/bpf/),
but I'm gonna check with bpf maintainers. Thanks for the idea!

2021-11-26 19:52:35

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH rfc 2/6] bpf: sched: add convenient helpers to identify sched entities

On Thu, Nov 25, 2021 at 02:09:00PM +0800, Yafang Shao wrote:
>
>
> > On Sep 17, 2021, at 12:24 AM, Roman Gushchin <[email protected]> wrote:
> >
> > This patch adds 3 helpers useful for dealing with sched entities:
> > u64 bpf_sched_entity_to_tgidpid(struct sched_entity *se);
> > u64 bpf_sched_entity_to_cgrpid(struct sched_entity *se);
> > long bpf_sched_entity_belongs_to_cgrp(struct sched_entity *se, u64 cgrpid);
> >
> > Sched entity is a basic structure used by the scheduler to represent
> > schedulable objects: tasks and cgroups (if CONFIG_FAIR_GROUP_SCHED
> > is enabled). It will be passed as an argument to many bpf hooks, so
> > scheduler bpf programs need a convenient way to deal with it.
> >
> > bpf_sched_entity_to_tgidpid() and bpf_sched_entity_to_cgrpid() are
> > useful to identify a sched entity in userspace terms (pid, tgid and
> > cgroup id). bpf_sched_entity_belongs_to_cgrp() allows to check whether
> > a sched entity belongs to sub-tree of a cgroup. It allows to write
> > cgroup-specific scheduler policies even without enabling the cgroup
> > cpu controller.
> >
> > Signed-off-by: Roman Gushchin <[email protected]>
> > ---
> > include/uapi/linux/bpf.h | 23 +++++++++++
> > kernel/sched/bpf_sched.c | 74 ++++++++++++++++++++++++++++++++++
> > scripts/bpf_doc.py | 2 +
> > tools/include/uapi/linux/bpf.h | 23 +++++++++++
> > 4 files changed, 122 insertions(+)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 6dfbebb8fc8f..199e4a92820d 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -4900,6 +4900,26 @@ union bpf_attr {
> > * **-EINVAL** if *flags* is not zero.
> > *
> > * **-ENOENT** if architecture does not support branch records.
> > + *
> > + * u64 bpf_sched_entity_to_tgidpid(struct sched_entity *se)
> > + * Description
> > + * Return task's encoded tgid and pid if the sched entity is a task.
> > + * Return
> > + * Tgid and pid encoded as tgid << 32 \| pid, if *se* is a task. (u64)-1 otherwise.
> > + *
> > + * u64 bpf_sched_entity_to_cgrpid(struct sched_entity *se)
> > + * Description
> > + * Return cgroup id if the given sched entity is a cgroup.
> > + * Return
> > + * Cgroup id, if *se* is a cgroup. (u64)-1 otherwise.
> > + *
> > + * long bpf_sched_entity_belongs_to_cgrp(struct sched_entity *se, u64 cgrpid)
> > + * Description
> > + * Checks whether the sched entity belongs to a cgroup or
> > + * it's sub-tree. It doesn't require a cgroup CPU controller
> > + * to be enabled.
> > + * Return
> > + * 1 if the sched entity belongs to a cgroup, 0 otherwise.
> > */
> > #define __BPF_FUNC_MAPPER(FN) \
> > FN(unspec), \
> > @@ -5079,6 +5099,9 @@ union bpf_attr {
> > FN(get_attach_cookie), \
> > FN(task_pt_regs), \
> > FN(get_branch_snapshot), \
> > + FN(sched_entity_to_tgidpid), \
> > + FN(sched_entity_to_cgrpid), \
> > + FN(sched_entity_belongs_to_cgrp), \
> > /* */
> >
> > /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > diff --git a/kernel/sched/bpf_sched.c b/kernel/sched/bpf_sched.c
> > index 2f05c186cfd0..ead691dc6e85 100644
> > --- a/kernel/sched/bpf_sched.c
> > +++ b/kernel/sched/bpf_sched.c
> > @@ -42,12 +42,86 @@ int bpf_sched_verify_prog(struct bpf_verifier_log *vlog,
> > return 0;
> > }
> >
> > +BPF_CALL_1(bpf_sched_entity_to_tgidpid, struct sched_entity *, se)
> > +{
> > + if (entity_is_task(se)) {
> > + struct task_struct *task = task_of(se);
> > +
> > + return (u64) task->tgid << 32 | task->pid;
> > + } else {
> > + return (u64) -1;
> > + }
> > +}
> > +
> > +BPF_CALL_1(bpf_sched_entity_to_cgrpid, struct sched_entity *, se)
> > +{
> > +#ifdef CONFIG_FAIR_GROUP_SCHED
> > + if (!entity_is_task(se))
> > + return cgroup_id(se->cfs_rq->tg->css.cgroup);
> > +#endif
> > + return (u64) -1;
> > +}
> > +
> > +BPF_CALL_2(bpf_sched_entity_belongs_to_cgrp, struct sched_entity *, se,
> > + u64, cgrpid)
> > +{
> > +#ifdef CONFIG_CGROUPS
> > + struct cgroup *cgrp;
> > + int level;
> > +
> > + if (entity_is_task(se))
> > + cgrp = task_dfl_cgroup(task_of(se));
> > +#ifdef CONFIG_FAIR_GROUP_SCHED
> > + else
> > + cgrp = se->cfs_rq->tg->css.cgroup;
>
> It is incorrect.
> It should use se->my_q->tg->css.cgroup and some possible NULL check. (for autogroup)
> se->cfs_rq and se->my_q are different. se->my_q is the cfs_rq of this se itself, while the se->cfs_rq may be the parent.

Indeed. Thanks, will fix in the next version.

2022-01-15 23:02:03

by Huichun Feng

[permalink] [raw]
Subject: Re: [PATCH rfc 0/6] Scheduler BPF

Hi Roman and the list,

I have a naive question regarding BPF hook for sched.

Given that BPF can also be attached to tracepoint, why do we add a BPF prog
type specific to sched?

The reason I can come up with is that sched BPF can have retval to drive the
scheduling decision in static branch, whereas tracepoint is not able to do this.
Is it mainly because of this or anything else?


Thanks
Huichun

2022-01-21 11:10:24

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH rfc 0/6] Scheduler BPF

On Sat, Jan 15, 2022 at 04:29:24PM +0800, Huichun Feng wrote:
> Hi Roman and the list,

Hello Huichun!

>
> I have a naive question regarding BPF hook for sched.
>
> Given that BPF can also be attached to tracepoint, why do we add a BPF prog
> type specific to sched?

Tracing programs can have return values as well, see kretprobes.

>
> The reason I can come up with is that sched BPF can have retval to drive the
> scheduling decision in static branch, whereas tracepoint is not able to do this.
> Is it mainly because of this or anything else?

Well, you are right that right now there is no strict necessity to
introduce a new prog type (aside from static branch mechanism you
mentioned), however I believe it's useful in a long run. Sched
programs might be able to use a different set of helpers, maybe there
will be some additional restrictions, etc. It's an RFC version of the
patchset and any ideas, suggestions and critic are highly welcome!

Thanks!

2022-07-19 13:59:18

by Ren Zhijie

[permalink] [raw]
Subject: Re: [PATCH rfc 0/6] Scheduler BPF

Hi Roman and list,

We want to implement a programmable scheduler to meet the schedule
requirements of different workloads.

Using BPF, we can easily deploy schedule policies for specific
workloads, quickly verifying without modifying the kernel code. This
greatly reduces the cost of deploying new schedule policies in the
production environment.

Therefore, we want to continue to develop based on your patch. We plan
to merge it into the openeuler open-source community and use the
community to continuously evolve and maintain it.
(link: https://www.openeuler.org/en/)

We made some changes to your patch:
1. Adapt to the openeuler-OLK-5.10 branch, which mostly base on linux
longterm branch 5.10.
2. Introduce the Kconfig CONFIG_BPF_SCHED to isolate related code at
compile time.
3. helpers bpf_sched_entity_to_cgrpid() and
bpf_sched_entity_belongs_to_cgrp() are modified to obtain the task group
to which the sched entity belongs through se->my_q->tg->css.cgroup.

We have some ideas for the next iteration of Scheduler BPF that we would
like to share with you:
1.The tag field is added to struct task_struct and struct task_group.
Users can use the file system interface to mark different tags for
specific workloads. The bpf prog obtains the tags to detect different
workloads.
2.Add BPF hook and helper to scheduling processes such as select_task_rq
and pick_next_task to enable scalability.

It's a new attempt, and there's bound to be a lot of problems later, but
it's exciting that it makes the schduler programmable.

cheers,
Ren Zhijie


2022-07-19 14:06:22

by Ren Zhijie

[permalink] [raw]
Subject: Re: [PATCH rfc 0/6] Scheduler BPF

Hi Roman and list,

We want to implement a programmable scheduler to meet the schedule
requirements of different workloads.

Using BPF, we can easily deploy schedule policies for specific
workloads, quickly verifying without modifying the kernel code. This
greatly reduces the cost of deploying new schedule policies in the
production environment.

Therefore, we want to continue to develop based on your patch. We plan
to merge it into the openeuler open-source community and use the
community to continuously evolve and maintain it.
(link: https://www.openeuler.org/en/)

We made some changes to your patch:
1. Adapt to the openeuler-OLK-5.10 branch, which mostly base on linux
longterm branch 5.10.
2. Introduce the Kconfig CONFIG_BPF_SCHED to isolate related code at
compile time.
3. helpers bpf_sched_entity_to_cgrpid() and
bpf_sched_entity_belongs_to_cgrp() are modified to obtain the task group
to which the sched entity belongs through se->my_q->tg->css.cgroup.

We have some ideas for the next iteration of Scheduler BPF that we would
like to share with you:
1.The tag field is added to struct task_struct and struct task_group.
Users can use the file system interface to mark different tags for
specific workloads. The bpf prog obtains the tags to detect different
workloads.
2.Add BPF hook and helper to scheduling processes such as select_task_rq
and pick_next_task to enable scalability.

It's a new attempt, and there's bound to be a lot of problems later, but
it's exciting that it makes the schduler programmable.

cheers,
Ren Zhijie


2022-07-19 23:27:59

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH rfc 0/6] Scheduler BPF

On Tue, Jul 19, 2022 at 09:17:24PM +0800, Ren Zhijie wrote:
> Hi Roman and list,
>
> We want to implement a programmable scheduler to meet the schedule
> requirements of different workloads.
>
> Using BPF, we can easily deploy schedule policies for specific workloads,
> quickly verifying without modifying the kernel code. This greatly reduces
> the cost of deploying new schedule policies in the production environment.
>
> Therefore, we want to continue to develop based on your patch. We plan to
> merge it into the openeuler open-source community and use the community to
> continuously evolve and maintain it.
> (link: https://www.openeuler.org/en/)
>
> We made some changes to your patch:
> 1. Adapt to the openeuler-OLK-5.10 branch, which mostly base on linux
> longterm branch 5.10.
> 2. Introduce the Kconfig CONFIG_BPF_SCHED to isolate related code at compile
> time.
> 3. helpers bpf_sched_entity_to_cgrpid() and
> bpf_sched_entity_belongs_to_cgrp() are modified to obtain the task group to
> which the sched entity belongs through se->my_q->tg->css.cgroup.
>
> We have some ideas for the next iteration of Scheduler BPF that we would
> like to share with you:
> 1.The tag field is added to struct task_struct and struct task_group. Users
> can use the file system interface to mark different tags for specific
> workloads. The bpf prog obtains the tags to detect different workloads.
> 2.Add BPF hook and helper to scheduling processes such as select_task_rq and
> pick_next_task to enable scalability.
>
> It's a new attempt, and there's bound to be a lot of problems later, but
> it's exciting that it makes the schduler programmable.

Hi Ren!

Great to hear my work is useful and thank you for describing your plans!
I'm not actively working on it right now, but I might start again in the future.
Let me know if I can help you with this effort.

Thanks!