2022-04-22 18:46:25

by Namhyung Kim

[permalink] [raw]
Subject: [RFC RESEND 0/4] perf record: Implement off-cpu profiling with BPF (v1)

Hello,

(Resending with a quick fix for a missing header.)

This is the first version of off-cpu profiling support. Together with
(PMU-based) cpu profiling, it can show holistic view of the performance
characteristics of your application or system.

With BPF, it can aggregate scheduling stats for interested tasks
and/or states and convert the data into a form of perf sample records.
I chose the bpf-output event which is a software event supposed to be
consumed by BPF programs and renamed it as "offcpu-time". So it
requires no change on the perf report side except for setting sample
types of bpf-output event.

Basically it collects userspace callstack for tasks as it's what users
want mostly. Maybe we can add support for the kernel stacks but I'm
afraid that it'd cause more overhead. So the offcpu-time event will
always have callchains regardless of the command line option, and it
enables the children mode in perf report by default.

It adds --off-cpu option to perf record like below:

$ sudo perf record -a --off-cpu -- perf bench sched messaging -l 1000
# Running 'sched/messaging' benchmark:
# 20 sender and receiver processes per group
# 10 groups == 400 processes run

Total time: 1.518 [sec]
[ perf record: Woken up 9 times to write data ]
[ perf record: Captured and wrote 5.313 MB perf.data (53341 samples) ]

Then we can run perf report as usual. The below is just to skip less
important parts.

$ sudo perf report --stdio --call-graph=no --percent-limit=2
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 52K of event 'cycles'
# Event count (approx.): 42522453276
#
# Children Self Command Shared Object Symbol
# ........ ........ ............... ................ ..................................
#
9.58% 9.58% sched-messaging [kernel.vmlinux] [k] audit_filter_rules.constprop.0
8.46% 8.46% sched-messaging [kernel.vmlinux] [k] audit_filter_syscall
4.54% 4.54% sched-messaging [kernel.vmlinux] [k] copy_user_enhanced_fast_string
2.94% 2.94% sched-messaging [kernel.vmlinux] [k] unix_stream_read_generic
2.45% 2.45% sched-messaging [kernel.vmlinux] [k] memcg_slab_free_hook


# Samples: 983 of event 'offcpu-time'
# Event count (approx.): 684538813464
#
# Children Self Command Shared Object Symbol
# ........ ........ ............... .................... ..........................
#
83.86% 0.00% sched-messaging libc-2.33.so [.] __libc_start_main
83.86% 0.00% sched-messaging perf [.] cmd_bench
83.86% 0.00% sched-messaging perf [.] main
83.86% 0.00% sched-messaging perf [.] run_builtin
83.64% 0.00% sched-messaging perf [.] bench_sched_messaging
41.35% 41.35% sched-messaging libpthread-2.33.so [.] __read
38.88% 38.88% sched-messaging libpthread-2.33.so [.] __write
3.41% 3.41% sched-messaging libc-2.33.so [.] __poll

The perf bench sched messaging created 400 processes to send/receive
messages through unix sockets. It spent a large portion of cpu cycles
for audit filter and read/copy the messages while most of the
offcpu-time was in read and write calls.

You can get the code from 'perf/offcpu-v1' branch in my tree at

git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git

Enjoy! :)

Thanks,
Namhyung


Namhyung Kim (4):
perf report: Do not extend sample type of bpf-output event
perf record: Enable off-cpu analysis with BPF
perf record: Implement basic filtering for off-cpu
perf record: Handle argument change in sched_switch

tools/perf/Makefile.perf | 1 +
tools/perf/builtin-record.c | 21 ++
tools/perf/util/Build | 1 +
tools/perf/util/bpf_off_cpu.c | 301 +++++++++++++++++++++++++
tools/perf/util/bpf_skel/off_cpu.bpf.c | 214 ++++++++++++++++++
tools/perf/util/evsel.c | 4 +-
6 files changed, 540 insertions(+), 2 deletions(-)
create mode 100644 tools/perf/util/bpf_off_cpu.c
create mode 100644 tools/perf/util/bpf_skel/off_cpu.bpf.c


base-commit: 41204da4c16071be9090940b18f566832d46becc
--
2.36.0.rc2.479.g8af0fa9b8e-goog


*** BLURB HERE ***

Namhyung Kim (4):
perf report: Do not extend sample type of bpf-output event
perf record: Enable off-cpu analysis with BPF
perf record: Implement basic filtering for off-cpu
perf record: Handle argument change in sched_switch

tools/perf/Makefile.perf | 1 +
tools/perf/builtin-record.c | 21 ++
tools/perf/util/Build | 1 +
tools/perf/util/bpf_off_cpu.c | 302 +++++++++++++++++++++++++
tools/perf/util/bpf_skel/off_cpu.bpf.c | 214 ++++++++++++++++++
tools/perf/util/evsel.c | 4 +-
tools/perf/util/off_cpu.h | 24 ++
7 files changed, 565 insertions(+), 2 deletions(-)
create mode 100644 tools/perf/util/bpf_off_cpu.c
create mode 100644 tools/perf/util/bpf_skel/off_cpu.bpf.c
create mode 100644 tools/perf/util/off_cpu.h


base-commit: 41204da4c16071be9090940b18f566832d46becc
--
2.36.0.rc2.479.g8af0fa9b8e-goog


2022-04-22 19:53:52

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH 4/4] perf record: Handle argument change in sched_switch

Recently sched_switch tracepoint added a new argument for prev_state,
but it's hard to handle the change in a BPF program. Instead, we can
check the function prototype in BTF before loading the program.

Thus I make two copies of the tracepoint handler and select one based
on the BTF info.

Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/util/bpf_off_cpu.c | 32 +++++++++++++++
tools/perf/util/bpf_skel/off_cpu.bpf.c | 55 ++++++++++++++++++++------
2 files changed, 76 insertions(+), 11 deletions(-)

diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index 89f36229041d..38aeb13d3d25 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -86,6 +86,37 @@ static void off_cpu_finish(void *arg __maybe_unused)
off_cpu_bpf__destroy(skel);
}

+/* recent kernel added prev_state arg, so it needs to call the proper function */
+static void check_sched_switch_args(void)
+{
+ const struct btf *btf = bpf_object__btf(skel->obj);
+ const struct btf_type *t1, *t2, *t3;
+ u32 type_id;
+
+ type_id = btf__find_by_name_kind(btf, "bpf_trace_sched_switch",
+ BTF_KIND_TYPEDEF);
+ if ((s32)type_id < 0)
+ goto old_format;
+
+ t1 = btf__type_by_id(btf, type_id);
+ if (t1 == NULL)
+ goto old_format;
+
+ t2 = btf__type_by_id(btf, t1->type);
+ if (t2 == NULL || !btf_is_ptr(t2))
+ goto old_format;
+
+ t3 = btf__type_by_id(btf, t2->type);
+ if (t3 && btf_is_func_proto(t3) && btf_vlen(t3) == 4) {
+ /* new format: disable old functions */
+ bpf_program__set_autoload(skel->progs.on_switch3, false);
+ return;
+ }
+
+old_format:
+ bpf_program__set_autoload(skel->progs.on_switch4, false);
+}
+
int off_cpu_prepare(struct evlist *evlist, struct target *target)
{
int err, fd, i;
@@ -114,6 +145,7 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target)
}

set_max_rlimit();
+ check_sched_switch_args();

err = off_cpu_bpf__load(skel);
if (err) {
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index 27425fe361e2..e11e198af86f 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -121,22 +121,13 @@ static inline int can_record(struct task_struct *t, int state)
return 1;
}

-SEC("tp_btf/sched_switch")
-int on_switch(u64 *ctx)
+static int on_switch(u64 *ctx, struct task_struct *prev,
+ struct task_struct *next, int state)
{
__u64 ts;
- int state;
__u32 pid, stack_id;
- struct task_struct *prev, *next;
struct tstamp_data elem, *pelem;

- if (!enabled)
- return 0;
-
- prev = (struct task_struct *)ctx[1];
- next = (struct task_struct *)ctx[2];
- state = get_task_state(prev);
-
ts = bpf_ktime_get_ns();

if (!can_record(prev, state))
@@ -178,4 +169,46 @@ int on_switch(u64 *ctx)
return 0;
}

+SEC("tp_btf/sched_switch")
+int on_switch3(u64 *ctx)
+{
+ struct task_struct *prev, *next;
+ int state;
+
+ if (!enabled)
+ return 0;
+
+ /*
+ * TP_PROTO(bool preempt, struct task_struct *prev,
+ * struct task_struct *next)
+ */
+ prev = (struct task_struct *)ctx[1];
+ next = (struct task_struct *)ctx[2];
+
+ state = get_task_state(prev);
+
+ return on_switch(ctx, prev, next, state);
+}
+
+SEC("tp_btf/sched_switch")
+int on_switch4(u64 *ctx)
+{
+ struct task_struct *prev, *next;
+ int prev_state;
+
+ if (!enabled)
+ return 0;
+
+ /*
+ * TP_PROTO(bool preempt, int prev_state,
+ * struct task_struct *prev,
+ * struct task_struct *next)
+ */
+ prev = (struct task_struct *)ctx[2];
+ next = (struct task_struct *)ctx[3];
+ prev_state = (int)ctx[1];
+
+ return on_switch(ctx, prev, next, prev_state);
+}
+
char LICENSE[] SEC("license") = "Dual BSD/GPL";
--
2.36.0.rc2.479.g8af0fa9b8e-goog

2022-04-22 20:49:17

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH 1/4] perf report: Do not extend sample type of bpf-output event

Currently evsel__new_idx() sets more sample_type bits when it finds a
BPF-output event. But it should honor what's recorded in the perf
data file rather than blindly sets the bits. Otherwise it could lead
to a parse error when it recorded with a modified sample_type.

Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/util/evsel.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 2a1729e7aee4..5f947adc16cb 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -269,8 +269,8 @@ struct evsel *evsel__new_idx(struct perf_event_attr *attr, int idx)
return NULL;
evsel__init(evsel, attr, idx);

- if (evsel__is_bpf_output(evsel)) {
- evsel->core.attr.sample_type |= (PERF_SAMPLE_RAW | PERF_SAMPLE_TIME |
+ if (evsel__is_bpf_output(evsel) && !attr->sample_type) {
+ evsel->core.attr.sample_type = (PERF_SAMPLE_RAW | PERF_SAMPLE_TIME |
PERF_SAMPLE_CPU | PERF_SAMPLE_PERIOD),
evsel->core.attr.sample_period = 1;
}
--
2.36.0.rc2.479.g8af0fa9b8e-goog

2022-04-22 21:04:31

by Namhyung Kim

[permalink] [raw]
Subject: [PATCH 3/4] perf record: Implement basic filtering for off-cpu

It should honor cpu and task filtering with -a, -C or -p, -t options.

Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/builtin-record.c | 2 +-
tools/perf/util/bpf_off_cpu.c | 78 +++++++++++++++++++++++---
tools/perf/util/bpf_skel/off_cpu.bpf.c | 52 +++++++++++++++--
tools/perf/util/off_cpu.h | 6 +-
4 files changed, 123 insertions(+), 15 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 3d24d528ba8e..592384e058c3 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -907,7 +907,7 @@ static int record__config_text_poke(struct evlist *evlist)

static int record__config_off_cpu(struct record *rec)
{
- return off_cpu_prepare(rec->evlist);
+ return off_cpu_prepare(rec->evlist, &rec->opts.target);
}

static bool record__kcore_readable(struct machine *machine)
diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index 1f87d2a9b86d..89f36229041d 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -6,6 +6,9 @@
#include "util/off_cpu.h"
#include "util/perf-hooks.h"
#include "util/session.h"
+#include "util/target.h"
+#include "util/cpumap.h"
+#include "util/thread_map.h"
#include <bpf/bpf.h>

#include "bpf_skel/off_cpu.skel.h"
@@ -57,8 +60,23 @@ static int off_cpu_config(struct evlist *evlist)
return 0;
}

-static void off_cpu_start(void *arg __maybe_unused)
+static void off_cpu_start(void *arg)
{
+ struct evlist *evlist = arg;
+
+ /* update task filter for the given workload */
+ if (!skel->bss->has_cpu && !skel->bss->has_task &&
+ perf_thread_map__pid(evlist->core.threads, 0) != -1) {
+ int fd;
+ u32 pid;
+ u8 val = 1;
+
+ skel->bss->has_task = 1;
+ fd = bpf_map__fd(skel->maps.task_filter);
+ pid = perf_thread_map__pid(evlist->core.threads, 0);
+ bpf_map_update_elem(fd, &pid, &val, BPF_ANY);
+ }
+
skel->bss->enabled = 1;
}

@@ -68,31 +86,75 @@ static void off_cpu_finish(void *arg __maybe_unused)
off_cpu_bpf__destroy(skel);
}

-int off_cpu_prepare(struct evlist *evlist)
+int off_cpu_prepare(struct evlist *evlist, struct target *target)
{
- int err;
+ int err, fd, i;
+ int ncpus = 1, ntasks = 1;

if (off_cpu_config(evlist) < 0) {
pr_err("Failed to config off-cpu BPF event\n");
return -1;
}

- set_max_rlimit();
-
- skel = off_cpu_bpf__open_and_load();
+ skel = off_cpu_bpf__open();
if (!skel) {
pr_err("Failed to open off-cpu skeleton\n");
return -1;
}

+ /* don't need to set cpu filter for system-wide mode */
+ if (target->cpu_list) {
+ ncpus = perf_cpu_map__nr(evlist->core.user_requested_cpus);
+ bpf_map__set_max_entries(skel->maps.cpu_filter, ncpus);
+ }
+
+ if (target__has_task(target)) {
+ ncpus = perf_thread_map__nr(evlist->core.threads);
+ bpf_map__set_max_entries(skel->maps.task_filter, ntasks);
+ }
+
+ set_max_rlimit();
+
+ err = off_cpu_bpf__load(skel);
+ if (err) {
+ pr_err("Failed to load off-cpu skeleton\n");
+ goto out;
+ }
+
+ if (target->cpu_list) {
+ u32 cpu;
+ u8 val = 1;
+
+ skel->bss->has_cpu = 1;
+ fd = bpf_map__fd(skel->maps.cpu_filter);
+
+ for (i = 0; i < ncpus; i++) {
+ cpu = perf_cpu_map__cpu(evlist->core.user_requested_cpus, i).cpu;
+ bpf_map_update_elem(fd, &cpu, &val, BPF_ANY);
+ }
+ }
+
+ if (target__has_task(target)) {
+ u32 pid;
+ u8 val = 1;
+
+ skel->bss->has_task = 1;
+ fd = bpf_map__fd(skel->maps.task_filter);
+
+ for (i = 0; i < ntasks; i++) {
+ pid = perf_thread_map__pid(evlist->core.threads, i);
+ bpf_map_update_elem(fd, &pid, &val, BPF_ANY);
+ }
+ }
+
err = off_cpu_bpf__attach(skel);
if (err) {
pr_err("Failed to attach off-cpu skeleton\n");
goto out;
}

- if (perf_hooks__set_hook("record_start", off_cpu_start, NULL) ||
- perf_hooks__set_hook("record_end", off_cpu_finish, NULL)) {
+ if (perf_hooks__set_hook("record_start", off_cpu_start, evlist) ||
+ perf_hooks__set_hook("record_end", off_cpu_finish, evlist)) {
pr_err("Failed to attach off-cpu skeleton\n");
goto out;
}
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index 2bc6f7cc59ea..27425fe361e2 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -49,12 +49,28 @@ struct {
__uint(max_entries, MAX_ENTRIES);
} off_cpu SEC(".maps");

+struct {
+ __uint(type, BPF_MAP_TYPE_HASH);
+ __uint(key_size, sizeof(__u32));
+ __uint(value_size, sizeof(__u8));
+ __uint(max_entries, 1);
+} cpu_filter SEC(".maps");
+
+struct {
+ __uint(type, BPF_MAP_TYPE_HASH);
+ __uint(key_size, sizeof(__u32));
+ __uint(value_size, sizeof(__u8));
+ __uint(max_entries, 1);
+} task_filter SEC(".maps");
+
/* old kernel task_struct definition */
struct task_struct___old {
long state;
} __attribute__((preserve_access_index));

int enabled = 0;
+int has_cpu = 0;
+int has_task = 0;

/*
* recently task_struct->state renamed to __state so it made an incompatible
@@ -74,6 +90,37 @@ static inline int get_task_state(struct task_struct *t)
return BPF_CORE_READ(t_old, state);
}

+static inline int can_record(struct task_struct *t, int state)
+{
+ if (has_cpu) {
+ __u32 cpu = bpf_get_smp_processor_id();
+ __u8 *ok;
+
+ ok = bpf_map_lookup_elem(&cpu_filter, &cpu);
+ if (!ok)
+ return 0;
+ }
+
+ if (has_task) {
+ __u8 *ok;
+ __u32 pid = t->pid;
+
+ ok = bpf_map_lookup_elem(&task_filter, &pid);
+ if (!ok)
+ return 0;
+ }
+
+ /* kernel threads don't have user stack */
+ if (t->flags & PF_KTHREAD)
+ return 0;
+
+ if (state != TASK_INTERRUPTIBLE &&
+ state != TASK_UNINTERRUPTIBLE)
+ return 0;
+
+ return 1;
+}
+
SEC("tp_btf/sched_switch")
int on_switch(u64 *ctx)
{
@@ -92,10 +139,7 @@ int on_switch(u64 *ctx)

ts = bpf_ktime_get_ns();

- if (prev->flags & PF_KTHREAD)
- goto next;
- if (state != TASK_INTERRUPTIBLE &&
- state != TASK_UNINTERRUPTIBLE)
+ if (!can_record(prev, state))
goto next;

stack_id = bpf_get_stackid(ctx, &stacks,
diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
index d5e063083bae..9b018d300b6e 100644
--- a/tools/perf/util/off_cpu.h
+++ b/tools/perf/util/off_cpu.h
@@ -2,13 +2,15 @@
#define PERF_UTIL_OFF_CPU_H

struct evlist;
+struct target;
struct perf_session;

#ifdef HAVE_BPF_SKEL
-int off_cpu_prepare(struct evlist *evlist);
+int off_cpu_prepare(struct evlist *evlist, struct target *target);
int off_cpu_write(struct perf_session *session);
#else
-static inline int off_cpu_prepare(struct evlist *evlist __maybe_unused);
+static inline int off_cpu_prepare(struct evlist *evlist __maybe_unused,
+ struct target *target __maybe_unused);
{
return -1;
}
--
2.36.0.rc2.479.g8af0fa9b8e-goog

2022-04-27 10:43:18

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH 4/4] perf record: Handle argument change in sched_switch

On Fri, Apr 22, 2022 at 3:49 PM Namhyung Kim <[email protected]> wrote:
>
> Recently sched_switch tracepoint added a new argument for prev_state,
> but it's hard to handle the change in a BPF program. Instead, we can
> check the function prototype in BTF before loading the program.
>
> Thus I make two copies of the tracepoint handler and select one based
> on the BTF info.
>
> Signed-off-by: Namhyung Kim <[email protected]>
> ---
> tools/perf/util/bpf_off_cpu.c | 32 +++++++++++++++
> tools/perf/util/bpf_skel/off_cpu.bpf.c | 55 ++++++++++++++++++++------
> 2 files changed, 76 insertions(+), 11 deletions(-)
>

[...]

>
> +SEC("tp_btf/sched_switch")
> +int on_switch3(u64 *ctx)
> +{
> + struct task_struct *prev, *next;
> + int state;
> +
> + if (!enabled)
> + return 0;
> +
> + /*
> + * TP_PROTO(bool preempt, struct task_struct *prev,
> + * struct task_struct *next)
> + */
> + prev = (struct task_struct *)ctx[1];
> + next = (struct task_struct *)ctx[2];


you don't have to have two BPF programs for this, you can use
read-only variable to make this choice.

On BPF side

const volatile bool has_prev_state = false;

...

if (has_prev_state) {
prev = (struct task_struct *)ctx[2];
next = (struct task_struct *)ctx[3];
} else {
prev = (struct task_struct *)ctx[1];
next = (struct task_struct *)ctx[2];
}


And from user-space side you do your detection and before skeleton is loaded:

skel->rodata->has_prev_state = <whatever you detected>

But I'm still hoping that this prev_state argument can be moved to the
end ([0]) to make all this unnecessary.

[0] https://lore.kernel.org/lkml/[email protected]/


> +
> + state = get_task_state(prev);
> +
> + return on_switch(ctx, prev, next, state);
> +}
> +

[...]

2022-04-27 18:44:56

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH 4/4] perf record: Handle argument change in sched_switch

Hello,

On Tue, Apr 26, 2022 at 4:55 PM Andrii Nakryiko
<[email protected]> wrote:
>
> On Fri, Apr 22, 2022 at 3:49 PM Namhyung Kim <[email protected]> wrote:
> >
> > Recently sched_switch tracepoint added a new argument for prev_state,
> > but it's hard to handle the change in a BPF program. Instead, we can
> > check the function prototype in BTF before loading the program.
> >
> > Thus I make two copies of the tracepoint handler and select one based
> > on the BTF info.
> >
> > Signed-off-by: Namhyung Kim <[email protected]>
> > ---
> > tools/perf/util/bpf_off_cpu.c | 32 +++++++++++++++
> > tools/perf/util/bpf_skel/off_cpu.bpf.c | 55 ++++++++++++++++++++------
> > 2 files changed, 76 insertions(+), 11 deletions(-)
> >
>
> [...]
>
> >
> > +SEC("tp_btf/sched_switch")
> > +int on_switch3(u64 *ctx)
> > +{
> > + struct task_struct *prev, *next;
> > + int state;
> > +
> > + if (!enabled)
> > + return 0;
> > +
> > + /*
> > + * TP_PROTO(bool preempt, struct task_struct *prev,
> > + * struct task_struct *next)
> > + */
> > + prev = (struct task_struct *)ctx[1];
> > + next = (struct task_struct *)ctx[2];
>
>
> you don't have to have two BPF programs for this, you can use
> read-only variable to make this choice.
>
> On BPF side
>
> const volatile bool has_prev_state = false;
>
> ...
>
> if (has_prev_state) {
> prev = (struct task_struct *)ctx[2];
> next = (struct task_struct *)ctx[3];
> } else {
> prev = (struct task_struct *)ctx[1];
> next = (struct task_struct *)ctx[2];
> }
>
>
> And from user-space side you do your detection and before skeleton is loaded:
>
> skel->rodata->has_prev_state = <whatever you detected>

Nice, thanks for the tip!

Actually I tried something similar but it was with a variable (in bss)
so the verifier in an old kernel rejected it due to invalid arg access.

I guess now the const makes the verifier ignore the branch as if
it's dead but the compiler still generates the code, right?

>
> But I'm still hoping that this prev_state argument can be moved to the
> end ([0]) to make all this unnecessary.
>
> [0] https://lore.kernel.org/lkml/[email protected]/

Yeah, that would make life easier. :)

Thanks,
Namhyung

2022-04-27 20:31:12

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH 4/4] perf record: Handle argument change in sched_switch

On Wed, Apr 27, 2022 at 11:15 AM Namhyung Kim <[email protected]> wrote:
>
> Hello,
>
> On Tue, Apr 26, 2022 at 4:55 PM Andrii Nakryiko
> <[email protected]> wrote:
> >
> > On Fri, Apr 22, 2022 at 3:49 PM Namhyung Kim <[email protected]> wrote:
> > >
> > > Recently sched_switch tracepoint added a new argument for prev_state,
> > > but it's hard to handle the change in a BPF program. Instead, we can
> > > check the function prototype in BTF before loading the program.
> > >
> > > Thus I make two copies of the tracepoint handler and select one based
> > > on the BTF info.
> > >
> > > Signed-off-by: Namhyung Kim <[email protected]>
> > > ---
> > > tools/perf/util/bpf_off_cpu.c | 32 +++++++++++++++
> > > tools/perf/util/bpf_skel/off_cpu.bpf.c | 55 ++++++++++++++++++++------
> > > 2 files changed, 76 insertions(+), 11 deletions(-)
> > >
> >
> > [...]
> >
> > >
> > > +SEC("tp_btf/sched_switch")
> > > +int on_switch3(u64 *ctx)
> > > +{
> > > + struct task_struct *prev, *next;
> > > + int state;
> > > +
> > > + if (!enabled)
> > > + return 0;
> > > +
> > > + /*
> > > + * TP_PROTO(bool preempt, struct task_struct *prev,
> > > + * struct task_struct *next)
> > > + */
> > > + prev = (struct task_struct *)ctx[1];
> > > + next = (struct task_struct *)ctx[2];
> >
> >
> > you don't have to have two BPF programs for this, you can use
> > read-only variable to make this choice.
> >
> > On BPF side
> >
> > const volatile bool has_prev_state = false;
> >
> > ...
> >
> > if (has_prev_state) {
> > prev = (struct task_struct *)ctx[2];
> > next = (struct task_struct *)ctx[3];
> > } else {
> > prev = (struct task_struct *)ctx[1];
> > next = (struct task_struct *)ctx[2];
> > }
> >
> >
> > And from user-space side you do your detection and before skeleton is loaded:
> >
> > skel->rodata->has_prev_state = <whatever you detected>
>
> Nice, thanks for the tip!
>
> Actually I tried something similar but it was with a variable (in bss)
> so the verifier in an old kernel rejected it due to invalid arg access.
>
> I guess now the const makes the verifier ignore the branch as if
> it's dead but the compiler still generates the code, right?


yes, exactly

>
> >
> > But I'm still hoping that this prev_state argument can be moved to the
> > end ([0]) to make all this unnecessary.
> >
> > [0] https://lore.kernel.org/lkml/[email protected]/
>
> Yeah, that would make life easier. :)
>
> Thanks,
> Namhyung

2022-05-01 19:32:12

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH 4/4] perf record: Handle argument change in sched_switch

On Wed, Apr 27, 2022 at 12:26 PM Andrii Nakryiko
<[email protected]> wrote:
>
> On Wed, Apr 27, 2022 at 11:15 AM Namhyung Kim <[email protected]> wrote:
> > Actually I tried something similar but it was with a variable (in bss)
> > so the verifier in an old kernel rejected it due to invalid arg access.
> >
> > I guess now the const makes the verifier ignore the branch as if
> > it's dead but the compiler still generates the code, right?
>
>
> yes, exactly

Then I'm curious how it'd work on newer kernels.
The verifier sees the false branch and detects type mismatch
for the second argument then it'd reject the program?

Thanks,
Namhyung

2022-05-08 14:37:07

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH 4/4] perf record: Handle argument change in sched_switch

On Thu, Apr 28, 2022 at 4:58 PM Namhyung Kim <[email protected]> wrote:
>
> On Wed, Apr 27, 2022 at 12:26 PM Andrii Nakryiko
> <[email protected]> wrote:
> >
> > On Wed, Apr 27, 2022 at 11:15 AM Namhyung Kim <[email protected]> wrote:
> > > Actually I tried something similar but it was with a variable (in bss)
> > > so the verifier in an old kernel rejected it due to invalid arg access.
> > >
> > > I guess now the const makes the verifier ignore the branch as if
> > > it's dead but the compiler still generates the code, right?
> >
> >
> > yes, exactly
>
> Then I'm curious how it'd work on newer kernels.
> The verifier sees the false branch and detects type mismatch
> for the second argument then it'd reject the program?
>

Verifier will know which branch is never taken, and will just ignore
and remove any code in it as dead code.


> Thanks,
> Namhyung