Changes v2 => v3:
1. Fix the use of static_call. (Peter)
2. Limit the use to perfmon version >= 2. (Peter)
3. Modify intel_pmu_snapshot_branch_stack() to use intel_pmu_disable_all
and intel_pmu_enable_all().
Changes v1 => v2:
1. Rename the helper as bpf_get_branch_snapshot;
2. Fix/simplify the use of static_call;
3. Instead of percpu variables, let intel_pmu_snapshot_branch_stack output
branch records to an output argument of type perf_branch_snapshot.
Branch stack can be very useful in understanding software events. For
example, when a long function, e.g. sys_perf_event_open, returns an errno,
it is not obvious why the function failed. Branch stack could provide very
helpful information in this type of scenarios.
This set adds support to read branch stack with a new BPF helper
bpf_get_branch_trace(). Currently, this is only supported in Intel systems.
It is also possible to support the same feaure for PowerPC.
The hardware that records the branch stace is not stopped automatically on
software events. Therefore, it is necessary to stop it in software soon.
Otherwise, the hardware buffers/registers will be flushed. One of the key
design consideration in this set is to minimize the number of branch record
entries between the event triggers and the hardware recorder is stopped.
Based on this goal, current design is different from the discussions in
original RFC [1]:
1) Static call is used when supported, to save function pointer
dereference;
2) intel_pmu_lbr_disable_all is used instead of perf_pmu_disable(),
because the latter uses about 10 entries before stopping LBR.
With current code, on Intel CPU, LBR is stopped after 6 branch entries
after fexit triggers:
ID: 0 from intel_pmu_lbr_disable_all.part.10+37 to intel_pmu_lbr_disable_all.part.10+72
ID: 1 from intel_pmu_lbr_disable_all.part.10+33 to intel_pmu_lbr_disable_all.part.10+37
ID: 2 from intel_pmu_snapshot_branch_stack+46 to intel_pmu_lbr_disable_all.part.10+0
ID: 3 from __bpf_prog_enter+38 to intel_pmu_snapshot_branch_stack+0
ID: 4 from __bpf_prog_enter+8 to __bpf_prog_enter+38
ID: 5 from __brk_limit+477020214 to __bpf_prog_enter+0
ID: 6 from bpf_fexit_loop_test1+22 to __brk_limit+477020195
ID: 7 from bpf_fexit_loop_test1+20 to bpf_fexit_loop_test1+13
ID: 8 from bpf_fexit_loop_test1+20 to bpf_fexit_loop_test1+13
...
[1] https://lore.kernel.org/bpf/[email protected]/
Song Liu (3):
perf: enable branch record for software events
bpf: introduce helper bpf_get_branch_snapshot
selftests/bpf: add test for bpf_get_branch_snapshot
arch/x86/events/intel/core.c | 24 +++-
include/linux/bpf.h | 2 +
include/linux/filter.h | 3 +-
include/linux/perf_event.h | 24 ++++
include/uapi/linux/bpf.h | 16 +++
kernel/bpf/trampoline.c | 13 +++
kernel/bpf/verifier.c | 12 ++
kernel/events/core.c | 3 +
kernel/trace/bpf_trace.c | 43 +++++++
net/bpf/test_run.c | 15 ++-
tools/include/uapi/linux/bpf.h | 16 +++
.../bpf/prog_tests/get_branch_snapshot.c | 106 ++++++++++++++++++
.../selftests/bpf/progs/get_branch_snapshot.c | 41 +++++++
tools/testing/selftests/bpf/trace_helpers.c | 30 +++++
tools/testing/selftests/bpf/trace_helpers.h | 5 +
15 files changed, 349 insertions(+), 4 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
create mode 100644 tools/testing/selftests/bpf/progs/get_branch_snapshot.c
--
2.30.2
The typical way to access branch record (e.g. Intel LBR) is via hardware
perf_event. For CPUs with FREEZE_LBRS_ON_PMI support, PMI could capture
reliable LBR. On the other hand, LBR could also be useful in non-PMI
scenario. For example, in kretprobe or bpf fexit program, LBR could
provide a lot of information on what happened with the function. Add API
to use branch record for software use.
Note that, when the software event triggers, it is necessary to stop the
branch record hardware asap. Therefore, static_call is used to remove some
branch instructions in this process.
Signed-off-by: Song Liu <[email protected]>
---
arch/x86/events/intel/core.c | 24 ++++++++++++++++++++++--
include/linux/perf_event.h | 24 ++++++++++++++++++++++++
kernel/events/core.c | 3 +++
3 files changed, 49 insertions(+), 2 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index ac6fd2dabf6a2..d28d0e12c112c 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2155,9 +2155,9 @@ static void __intel_pmu_disable_all(void)
static void intel_pmu_disable_all(void)
{
+ intel_pmu_lbr_disable_all();
__intel_pmu_disable_all();
intel_pmu_pebs_disable_all();
- intel_pmu_lbr_disable_all();
}
static void __intel_pmu_enable_all(int added, bool pmi)
@@ -2186,6 +2186,20 @@ static void intel_pmu_enable_all(int added)
__intel_pmu_enable_all(added, false);
}
+static int
+intel_pmu_snapshot_branch_stack(struct perf_branch_snapshot *br_snapshot)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ intel_pmu_disable_all();
+ intel_pmu_lbr_read();
+ memcpy(br_snapshot->entries, cpuc->lbr_entries,
+ sizeof(struct perf_branch_entry) * x86_pmu.lbr_nr);
+ br_snapshot->nr = x86_pmu.lbr_nr;
+ intel_pmu_enable_all(0);
+ return 0;
+}
+
/*
* Workaround for:
* Intel Errata AAK100 (model 26)
@@ -6283,9 +6297,15 @@ __init int intel_pmu_init(void)
x86_pmu.lbr_nr = 0;
}
- if (x86_pmu.lbr_nr)
+ if (x86_pmu.lbr_nr) {
pr_cont("%d-deep LBR, ", x86_pmu.lbr_nr);
+ /* only support branch_stack snapshot for perfmon >= v2 */
+ if (x86_pmu.disable_all == intel_pmu_disable_all)
+ static_call_update(perf_snapshot_branch_stack,
+ intel_pmu_snapshot_branch_stack);
+ }
+
intel_pmu_check_extra_regs(x86_pmu.extra_regs);
/* Support full width counters using alternative MSR range */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fe156a8170aa3..1f42e91668024 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -57,6 +57,7 @@ struct perf_guest_info_callbacks {
#include <linux/cgroup.h>
#include <linux/refcount.h>
#include <linux/security.h>
+#include <linux/static_call.h>
#include <asm/local.h>
struct perf_callchain_entry {
@@ -1612,4 +1613,27 @@ extern void __weak arch_perf_update_userpage(struct perf_event *event,
extern __weak u64 arch_perf_get_page_size(struct mm_struct *mm, unsigned long addr);
#endif
+/*
+ * Snapshot branch stack on software events.
+ *
+ * Branch stack can be very useful in understanding software events. For
+ * example, when a long function, e.g. sys_perf_event_open, returns an
+ * errno, it is not obvious why the function failed. Branch stack could
+ * provide very helpful information in this type of scenarios.
+ *
+ * On software event, it is necessary to stop the hardware branch recorder
+ * fast. Otherwise, the hardware register/buffer will be flushed with
+ * entries af the triggering event. Therefore, static call is used to
+ * stop the hardware recorder.
+ */
+#define MAX_BRANCH_SNAPSHOT 32
+
+struct perf_branch_snapshot {
+ unsigned int nr;
+ struct perf_branch_entry entries[MAX_BRANCH_SNAPSHOT];
+};
+
+typedef int (perf_snapshot_branch_stack_t)(struct perf_branch_snapshot *);
+DECLARE_STATIC_CALL(perf_snapshot_branch_stack, perf_snapshot_branch_stack_t);
+
#endif /* _LINUX_PERF_EVENT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 011cc5069b7ba..22807864e913b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -13437,3 +13437,6 @@ struct cgroup_subsys perf_event_cgrp_subsys = {
.threaded = true,
};
#endif /* CONFIG_CGROUP_PERF */
+
+DEFINE_STATIC_CALL_RET0(perf_snapshot_branch_stack,
+ perf_snapshot_branch_stack_t);
--
2.30.2
Introduce bpf_get_branch_snapshot(), which allows tracing pogram to get
branch trace from hardware (e.g. Intel LBR). To use the feature, the
user need to create perf_event with proper branch_record filtering
on each cpu, and then calls bpf_get_branch_snapshot in the bpf function.
On Intel CPUs, VLBR event (raw event 0x1b00) can be use for this.
Signed-off-by: Song Liu <[email protected]>
---
include/linux/bpf.h | 2 ++
include/linux/filter.h | 3 ++-
include/uapi/linux/bpf.h | 16 +++++++++++++
kernel/bpf/trampoline.c | 13 ++++++++++
kernel/bpf/verifier.c | 12 ++++++++++
kernel/trace/bpf_trace.c | 43 ++++++++++++++++++++++++++++++++++
tools/include/uapi/linux/bpf.h | 16 +++++++++++++
7 files changed, 104 insertions(+), 1 deletion(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f4c16f19f83e3..1868434dc519a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2220,4 +2220,6 @@ int bpf_bprintf_prepare(char *fmt, u32 fmt_size, const u64 *raw_args,
u32 **bin_buf, u32 num_args);
void bpf_bprintf_cleanup(void);
+DECLARE_PER_CPU(struct perf_branch_snapshot, bpf_perf_branch_snapshot);
+
#endif /* _LINUX_BPF_H */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 7d248941ecea3..75aa541c8a134 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -575,7 +575,8 @@ struct bpf_prog {
has_callchain_buf:1, /* callchain buffer allocated? */
enforce_expected_attach_type:1, /* Enforce expected_attach_type checking at attach time */
call_get_stack:1, /* Do we call bpf_get_stack() or bpf_get_stackid() */
- call_get_func_ip:1; /* Do we call get_func_ip() */
+ call_get_func_ip:1, /* Do we call get_func_ip() */
+ call_get_branch:1; /* Do we call get_branch_snapshot() */
enum bpf_prog_type type; /* Type of BPF program */
enum bpf_attach_type expected_attach_type; /* For some prog types */
u32 len; /* Number of filter blocks */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 791f31dd0abee..1fd4e85d123da 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4877,6 +4877,21 @@ union bpf_attr {
* Get the struct pt_regs associated with **task**.
* Return
* A pointer to struct pt_regs.
+ *
+ * long bpf_get_branch_snapshot(void *entries, u32 size)
+ * Description
+ * Get branch trace from hardware engines like Intel LBR. The
+ * branch trace is taken soon after the trigger point of the
+ * BPF program, so it may contain some entries after the
+ * trigger point. The user need to filter these entries
+ * accordingly.
+ *
+ * The data is stored as struct perf_branch_entry into output
+ * buffer *entries*. *size* is the size of *entries* in bytes.
+ *
+ * Return
+ * > 0, number of valid output entries.
+ * **-EOPNOTSUPP**, the hardware/kernel does not support this function
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -5055,6 +5070,7 @@ union bpf_attr {
FN(get_func_ip), \
FN(get_attach_cookie), \
FN(task_pt_regs), \
+ FN(get_branch_snapshot), \
/* */
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index fe1e857324e66..137293fcceed7 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -10,6 +10,7 @@
#include <linux/rcupdate_trace.h>
#include <linux/rcupdate_wait.h>
#include <linux/module.h>
+#include <linux/static_call.h>
/* dummy _ops. The verifier will operate on target program's ops. */
const struct bpf_verifier_ops bpf_extension_verifier_ops = {
@@ -564,6 +565,18 @@ static void notrace inc_misses_counter(struct bpf_prog *prog)
u64 notrace __bpf_prog_enter(struct bpf_prog *prog)
__acquires(RCU)
{
+#ifdef CONFIG_PERF_EVENTS
+ /* Calling migrate_disable costs two entries in the LBR. To save
+ * some entries, we call perf_snapshot_branch_stack before
+ * migrate_disable to save some entries. This is OK because we
+ * care about the branch trace before entering the BPF program.
+ * If migrate happens exactly here, there isn't much we can do to
+ * preserve the data.
+ */
+ if (prog->call_get_branch)
+ static_call(perf_snapshot_branch_stack)(
+ this_cpu_ptr(&bpf_perf_branch_snapshot));
+#endif
rcu_read_lock();
migrate_disable();
if (unlikely(__this_cpu_inc_return(*(prog->active)) != 1)) {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 206c221453cfa..72e8b49da0bf9 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6446,6 +6446,18 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
env->prog->call_get_func_ip = true;
}
+ if (func_id == BPF_FUNC_get_branch_snapshot) {
+ if (env->prog->aux->sleepable) {
+ verbose(env, "sleepable progs cannot call get_branch_snapshot\n");
+ return -ENOTSUPP;
+ }
+ if (!IS_ENABLED(CONFIG_PERF_EVENTS)) {
+ verbose(env, "func %s#%d not supported without CONFIG_PERF_EVENTS\n",
+ func_id_name(func_id), func_id);
+ return -ENOTSUPP;
+ }
+ env->prog->call_get_branch = true;
+ }
if (changes_data)
clear_all_pkt_pointers(env);
return 0;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 8e2eb950aa829..a01f26b7877e6 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1017,6 +1017,33 @@ static const struct bpf_func_proto bpf_get_attach_cookie_proto_pe = {
.arg1_type = ARG_PTR_TO_CTX,
};
+BPF_CALL_2(bpf_get_branch_snapshot, void *, buf, u32, size)
+{
+#ifdef CONFIG_PERF_EVENTS
+ u32 max_size;
+
+ if (this_cpu_ptr(&bpf_perf_branch_snapshot)->nr == 0)
+ return -EOPNOTSUPP;
+
+ max_size = this_cpu_ptr(&bpf_perf_branch_snapshot)->nr *
+ sizeof(struct perf_branch_entry);
+ memcpy(buf, this_cpu_ptr(&bpf_perf_branch_snapshot)->entries,
+ min_t(u32, size, max_size));
+
+ return this_cpu_ptr(&bpf_perf_branch_snapshot)->nr;
+#else
+ return -EOPNOTSUPP;
+#endif
+}
+
+static const struct bpf_func_proto bpf_get_branch_snapshot_proto = {
+ .func = bpf_get_branch_snapshot,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_UNINIT_MEM,
+ .arg2_type = ARG_CONST_SIZE_OR_ZERO,
+};
+
static const struct bpf_func_proto *
bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
{
@@ -1132,6 +1159,8 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
return &bpf_snprintf_proto;
case BPF_FUNC_get_func_ip:
return &bpf_get_func_ip_proto_tracing;
+ case BPF_FUNC_get_branch_snapshot:
+ return &bpf_get_branch_snapshot_proto;
default:
return bpf_base_func_proto(func_id);
}
@@ -1863,9 +1892,23 @@ void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp)
preempt_enable();
}
+DEFINE_PER_CPU(struct perf_branch_snapshot, bpf_perf_branch_snapshot);
+
static __always_inline
void __bpf_trace_run(struct bpf_prog *prog, u64 *args)
{
+#ifdef CONFIG_PERF_EVENTS
+ /* Calling migrate_disable costs two entries in the LBR. To save
+ * some entries, we call perf_snapshot_branch_stack before
+ * migrate_disable to save some entries. This is OK because we
+ * care about the branch trace before entering the BPF program.
+ * If migrate happens exactly here, there isn't much we can do to
+ * preserve the data.
+ */
+ if (prog->call_get_branch)
+ static_call(perf_snapshot_branch_stack)(
+ this_cpu_ptr(&bpf_perf_branch_snapshot));
+#endif
cant_sleep();
rcu_read_lock();
(void) bpf_prog_run(prog, args);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 791f31dd0abee..1fd4e85d123da 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4877,6 +4877,21 @@ union bpf_attr {
* Get the struct pt_regs associated with **task**.
* Return
* A pointer to struct pt_regs.
+ *
+ * long bpf_get_branch_snapshot(void *entries, u32 size)
+ * Description
+ * Get branch trace from hardware engines like Intel LBR. The
+ * branch trace is taken soon after the trigger point of the
+ * BPF program, so it may contain some entries after the
+ * trigger point. The user need to filter these entries
+ * accordingly.
+ *
+ * The data is stored as struct perf_branch_entry into output
+ * buffer *entries*. *size* is the size of *entries* in bytes.
+ *
+ * Return
+ * > 0, number of valid output entries.
+ * **-EOPNOTSUPP**, the hardware/kernel does not support this function
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -5055,6 +5070,7 @@ union bpf_attr {
FN(get_func_ip), \
FN(get_attach_cookie), \
FN(task_pt_regs), \
+ FN(get_branch_snapshot), \
/* */
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
--
2.30.2
This test uses bpf_get_branch_snapshot from a fexit program. The test uses
a target kernel function (bpf_fexit_loop_test1) and compares the record
against kallsyms. If there isn't enough record matching kallsyms, the
test fails.
Signed-off-by: Song Liu <[email protected]>
---
net/bpf/test_run.c | 15 ++-
.../bpf/prog_tests/get_branch_snapshot.c | 106 ++++++++++++++++++
.../selftests/bpf/progs/get_branch_snapshot.c | 41 +++++++
tools/testing/selftests/bpf/trace_helpers.c | 30 +++++
tools/testing/selftests/bpf/trace_helpers.h | 5 +
5 files changed, 196 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
create mode 100644 tools/testing/selftests/bpf/progs/get_branch_snapshot.c
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 2eb0e55ef54d2..6cc179a532c9c 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -231,6 +231,18 @@ struct sock * noinline bpf_kfunc_call_test3(struct sock *sk)
return sk;
}
+noinline int bpf_fexit_loop_test1(int n)
+{
+ int i, sum = 0;
+
+ /* the primary goal of this test is to test LBR. Create a lot of
+ * branches in the function, so we can catch it easily.
+ */
+ for (i = 0; i < n; i++)
+ sum += i;
+ return sum;
+}
+
__diag_pop();
ALLOW_ERROR_INJECTION(bpf_modify_return_test, ERRNO);
@@ -293,7 +305,8 @@ int bpf_prog_test_run_tracing(struct bpf_prog *prog,
bpf_fentry_test5(11, (void *)12, 13, 14, 15) != 65 ||
bpf_fentry_test6(16, (void *)17, 18, 19, (void *)20, 21) != 111 ||
bpf_fentry_test7((struct bpf_fentry_test_t *)0) != 0 ||
- bpf_fentry_test8(&arg) != 0)
+ bpf_fentry_test8(&arg) != 0 ||
+ bpf_fexit_loop_test1(101) != 5050)
goto out;
break;
case BPF_MODIFY_RETURN:
diff --git a/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c b/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
new file mode 100644
index 0000000000000..9bb16826418fb
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
@@ -0,0 +1,106 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+#include <test_progs.h>
+#include "get_branch_snapshot.skel.h"
+
+static int *pfd_array;
+static int cpu_cnt;
+
+static int create_perf_events(void)
+{
+ struct perf_event_attr attr = {0};
+ int cpu;
+
+ /* create perf event */
+ attr.size = sizeof(attr);
+ attr.type = PERF_TYPE_RAW;
+ attr.config = 0x1b00;
+ attr.sample_type = PERF_SAMPLE_BRANCH_STACK;
+ attr.branch_sample_type = PERF_SAMPLE_BRANCH_KERNEL |
+ PERF_SAMPLE_BRANCH_USER | PERF_SAMPLE_BRANCH_ANY;
+
+ cpu_cnt = libbpf_num_possible_cpus();
+ pfd_array = malloc(sizeof(int) * cpu_cnt);
+ if (!pfd_array) {
+ cpu_cnt = 0;
+ return 1;
+ }
+
+ for (cpu = 0; cpu < libbpf_num_possible_cpus(); cpu++) {
+ pfd_array[cpu] = syscall(__NR_perf_event_open, &attr,
+ -1, cpu, -1, PERF_FLAG_FD_CLOEXEC);
+ if (pfd_array[cpu] < 0)
+ break;
+ }
+
+ return cpu == 0;
+}
+
+static void close_perf_events(void)
+{
+ int cpu = 0;
+ int fd;
+
+ while (cpu++ < cpu_cnt) {
+ fd = pfd_array[cpu];
+ if (fd < 0)
+ break;
+ close(fd);
+ }
+ free(pfd_array);
+}
+
+void test_get_branch_snapshot(void)
+{
+ struct get_branch_snapshot *skel;
+ int err, prog_fd;
+ __u32 retval;
+
+ if (create_perf_events()) {
+ test__skip(); /* system doesn't support LBR */
+ goto cleanup;
+ }
+
+ skel = get_branch_snapshot__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "get_branch_snapshot__open_and_load"))
+ goto cleanup;
+
+ err = kallsyms_find("bpf_fexit_loop_test1", &skel->bss->address_low);
+ if (!ASSERT_OK(err, "kallsyms_find"))
+ goto cleanup;
+
+ err = kallsyms_find_next("bpf_fexit_loop_test1", &skel->bss->address_high);
+ if (!ASSERT_OK(err, "kallsyms_find_next"))
+ goto cleanup;
+
+ err = get_branch_snapshot__attach(skel);
+ if (!ASSERT_OK(err, "get_branch_snapshot__attach"))
+ goto cleanup;
+
+ prog_fd = bpf_program__fd(skel->progs.test1);
+ err = bpf_prog_test_run(prog_fd, 1, NULL, 0,
+ NULL, 0, &retval, NULL);
+
+ if (!ASSERT_OK(err, "bpf_prog_test_run"))
+ goto cleanup;
+
+ if (skel->bss->total_entries < 16) {
+ /* too few entries for the hit/waste test */
+ test__skip();
+ goto cleanup;
+ }
+
+ ASSERT_GT(skel->bss->test1_hits, 5, "find_test1_in_lbr");
+
+ /* Given we stop LBR in software, we will waste a few entries.
+ * But we should try to waste as few as possibleentries. We are at
+ * about 7 on x86_64 systems.
+ * Add a check for < 10 so that we get heads-up when something
+ * changes and wastes too many entries.
+ */
+ ASSERT_LT(skel->bss->wasted_entries, 10, "check_wasted_entries");
+
+cleanup:
+ get_branch_snapshot__destroy(skel);
+ close_perf_events();
+}
diff --git a/tools/testing/selftests/bpf/progs/get_branch_snapshot.c b/tools/testing/selftests/bpf/progs/get_branch_snapshot.c
new file mode 100644
index 0000000000000..9c944e7480b95
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/get_branch_snapshot.c
@@ -0,0 +1,41 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+__u64 test1_hits = 0;
+__u64 address_low = 0;
+__u64 address_high = 0;
+int wasted_entries = 0;
+long total_entries = 0;
+
+#define MAX_LBR_ENTRIES 32
+
+struct perf_branch_entry entries[MAX_LBR_ENTRIES] = {};
+
+
+static inline bool in_range(__u64 val)
+{
+ return (val >= address_low) && (val < address_high);
+}
+
+SEC("fexit/bpf_fexit_loop_test1")
+int BPF_PROG(test1, int n, int ret)
+{
+ long i;
+
+ total_entries = bpf_get_branch_snapshot(entries, sizeof(entries));
+
+ for (i = 0; i < MAX_LBR_ENTRIES; i++) {
+ if (i >= total_entries)
+ break;
+ if (in_range(entries[i].from) && in_range(entries[i].to))
+ test1_hits++;
+ else if (!test1_hits)
+ wasted_entries++;
+ }
+ return 0;
+}
diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c
index e7a19b04d4eaf..2926a3b626821 100644
--- a/tools/testing/selftests/bpf/trace_helpers.c
+++ b/tools/testing/selftests/bpf/trace_helpers.c
@@ -117,6 +117,36 @@ int kallsyms_find(const char *sym, unsigned long long *addr)
return err;
}
+/* find the address of the next symbol, this can be used to determine the
+ * end of a function
+ */
+int kallsyms_find_next(const char *sym, unsigned long long *addr)
+{
+ char type, name[500];
+ unsigned long long value;
+ bool found = false;
+ int err = 0;
+ FILE *f;
+
+ f = fopen("/proc/kallsyms", "r");
+ if (!f)
+ return -EINVAL;
+
+ while (fscanf(f, "%llx %c %499s%*[^\n]\n", &value, &type, name) > 0) {
+ if (found) {
+ *addr = value;
+ goto out;
+ }
+ if (strcmp(name, sym) == 0)
+ found = true;
+ }
+ err = -ENOENT;
+
+out:
+ fclose(f);
+ return err;
+}
+
void read_trace_pipe(void)
{
int trace_fd;
diff --git a/tools/testing/selftests/bpf/trace_helpers.h b/tools/testing/selftests/bpf/trace_helpers.h
index d907b445524d5..bc8ed86105d94 100644
--- a/tools/testing/selftests/bpf/trace_helpers.h
+++ b/tools/testing/selftests/bpf/trace_helpers.h
@@ -16,6 +16,11 @@ long ksym_get_addr(const char *name);
/* open kallsyms and find addresses on the fly, faster than load + search. */
int kallsyms_find(const char *sym, unsigned long long *addr);
+/* find the address of the next symbol, this can be used to determine the
+ * end of a function
+ */
+int kallsyms_find_next(const char *sym, unsigned long long *addr);
+
void read_trace_pipe(void);
ssize_t get_uprobe_offset(const void *addr, ssize_t base);
--
2.30.2
On Mon, Aug 30, 2021 at 2:42 PM Song Liu <[email protected]> wrote:
>
> The typical way to access branch record (e.g. Intel LBR) is via hardware
> perf_event. For CPUs with FREEZE_LBRS_ON_PMI support, PMI could capture
> reliable LBR. On the other hand, LBR could also be useful in non-PMI
> scenario. For example, in kretprobe or bpf fexit program, LBR could
> provide a lot of information on what happened with the function. Add API
> to use branch record for software use.
>
> Note that, when the software event triggers, it is necessary to stop the
> branch record hardware asap. Therefore, static_call is used to remove some
> branch instructions in this process.
>
> Signed-off-by: Song Liu <[email protected]>
> ---
> arch/x86/events/intel/core.c | 24 ++++++++++++++++++++++--
> include/linux/perf_event.h | 24 ++++++++++++++++++++++++
> kernel/events/core.c | 3 +++
> 3 files changed, 49 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index ac6fd2dabf6a2..d28d0e12c112c 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -2155,9 +2155,9 @@ static void __intel_pmu_disable_all(void)
>
> static void intel_pmu_disable_all(void)
> {
> + intel_pmu_lbr_disable_all();
> __intel_pmu_disable_all();
> intel_pmu_pebs_disable_all();
> - intel_pmu_lbr_disable_all();
> }
>
> static void __intel_pmu_enable_all(int added, bool pmi)
> @@ -2186,6 +2186,20 @@ static void intel_pmu_enable_all(int added)
> __intel_pmu_enable_all(added, false);
> }
>
> +static int
> +intel_pmu_snapshot_branch_stack(struct perf_branch_snapshot *br_snapshot)
> +{
> + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> +
> + intel_pmu_disable_all();
> + intel_pmu_lbr_read();
> + memcpy(br_snapshot->entries, cpuc->lbr_entries,
> + sizeof(struct perf_branch_entry) * x86_pmu.lbr_nr);
> + br_snapshot->nr = x86_pmu.lbr_nr;
> + intel_pmu_enable_all(0);
> + return 0;
> +}
> +
> /*
> * Workaround for:
> * Intel Errata AAK100 (model 26)
> @@ -6283,9 +6297,15 @@ __init int intel_pmu_init(void)
> x86_pmu.lbr_nr = 0;
> }
>
> - if (x86_pmu.lbr_nr)
> + if (x86_pmu.lbr_nr) {
> pr_cont("%d-deep LBR, ", x86_pmu.lbr_nr);
>
> + /* only support branch_stack snapshot for perfmon >= v2 */
> + if (x86_pmu.disable_all == intel_pmu_disable_all)
> + static_call_update(perf_snapshot_branch_stack,
> + intel_pmu_snapshot_branch_stack);
> + }
> +
> intel_pmu_check_extra_regs(x86_pmu.extra_regs);
>
> /* Support full width counters using alternative MSR range */
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index fe156a8170aa3..1f42e91668024 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -57,6 +57,7 @@ struct perf_guest_info_callbacks {
> #include <linux/cgroup.h>
> #include <linux/refcount.h>
> #include <linux/security.h>
> +#include <linux/static_call.h>
> #include <asm/local.h>
>
> struct perf_callchain_entry {
> @@ -1612,4 +1613,27 @@ extern void __weak arch_perf_update_userpage(struct perf_event *event,
> extern __weak u64 arch_perf_get_page_size(struct mm_struct *mm, unsigned long addr);
> #endif
>
> +/*
> + * Snapshot branch stack on software events.
> + *
> + * Branch stack can be very useful in understanding software events. For
> + * example, when a long function, e.g. sys_perf_event_open, returns an
> + * errno, it is not obvious why the function failed. Branch stack could
> + * provide very helpful information in this type of scenarios.
> + *
> + * On software event, it is necessary to stop the hardware branch recorder
> + * fast. Otherwise, the hardware register/buffer will be flushed with
> + * entries af the triggering event. Therefore, static call is used to
> + * stop the hardware recorder.
> + */
> +#define MAX_BRANCH_SNAPSHOT 32
Can you please make it an enum instead? It will make this available as
a constant in vmlinux.h nicely, without users having to #define it
every time.
> +
> +struct perf_branch_snapshot {
> + unsigned int nr;
> + struct perf_branch_entry entries[MAX_BRANCH_SNAPSHOT];
> +};
> +
> +typedef int (perf_snapshot_branch_stack_t)(struct perf_branch_snapshot *);
> +DECLARE_STATIC_CALL(perf_snapshot_branch_stack, perf_snapshot_branch_stack_t);
> +
> #endif /* _LINUX_PERF_EVENT_H */
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 011cc5069b7ba..22807864e913b 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -13437,3 +13437,6 @@ struct cgroup_subsys perf_event_cgrp_subsys = {
> .threaded = true,
> };
> #endif /* CONFIG_CGROUP_PERF */
> +
> +DEFINE_STATIC_CALL_RET0(perf_snapshot_branch_stack,
> + perf_snapshot_branch_stack_t);
> --
> 2.30.2
>
On Mon, Aug 30, 2021 at 2:42 PM Song Liu <[email protected]> wrote:
>
> Introduce bpf_get_branch_snapshot(), which allows tracing pogram to get
> branch trace from hardware (e.g. Intel LBR). To use the feature, the
> user need to create perf_event with proper branch_record filtering
> on each cpu, and then calls bpf_get_branch_snapshot in the bpf function.
> On Intel CPUs, VLBR event (raw event 0x1b00) can be use for this.
>
> Signed-off-by: Song Liu <[email protected]>
> ---
> include/linux/bpf.h | 2 ++
> include/linux/filter.h | 3 ++-
> include/uapi/linux/bpf.h | 16 +++++++++++++
> kernel/bpf/trampoline.c | 13 ++++++++++
> kernel/bpf/verifier.c | 12 ++++++++++
> kernel/trace/bpf_trace.c | 43 ++++++++++++++++++++++++++++++++++
> tools/include/uapi/linux/bpf.h | 16 +++++++++++++
> 7 files changed, 104 insertions(+), 1 deletion(-)
>
[...]
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 206c221453cfa..72e8b49da0bf9 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -6446,6 +6446,18 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> env->prog->call_get_func_ip = true;
> }
>
> + if (func_id == BPF_FUNC_get_branch_snapshot) {
> + if (env->prog->aux->sleepable) {
> + verbose(env, "sleepable progs cannot call get_branch_snapshot\n");
> + return -ENOTSUPP;
> + }
> + if (!IS_ENABLED(CONFIG_PERF_EVENTS)) {
> + verbose(env, "func %s#%d not supported without CONFIG_PERF_EVENTS\n",
> + func_id_name(func_id), func_id);
> + return -ENOTSUPP;
> + }
> + env->prog->call_get_branch = true;
> + }
> if (changes_data)
> clear_all_pkt_pointers(env);
> return 0;
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 8e2eb950aa829..a01f26b7877e6 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -1017,6 +1017,33 @@ static const struct bpf_func_proto bpf_get_attach_cookie_proto_pe = {
> .arg1_type = ARG_PTR_TO_CTX,
> };
>
> +BPF_CALL_2(bpf_get_branch_snapshot, void *, buf, u32, size)
I bet we'll need u64 flags over time, let's add it right now. It's
similar to bpf_read_branch_records().
> +{
> +#ifdef CONFIG_PERF_EVENTS
> + u32 max_size;
> +
> + if (this_cpu_ptr(&bpf_perf_branch_snapshot)->nr == 0)
> + return -EOPNOTSUPP;
> +
> + max_size = this_cpu_ptr(&bpf_perf_branch_snapshot)->nr *
> + sizeof(struct perf_branch_entry);
> + memcpy(buf, this_cpu_ptr(&bpf_perf_branch_snapshot)->entries,
> + min_t(u32, size, max_size));
> +
Check bpf_read_branch_records() implementation and it's argument
validation logic. Let's keep them consistent (e.g., it enforces that
size is a multiple of sizeof(struct perf_branch_entry)). Another
difference is that bpf_read_branch_records() returns number of bytes
filled, not number of records. That's consistent with accepting size
as number of bytes. Let's stick to this convention then, so bytes
everywhere.
> + return this_cpu_ptr(&bpf_perf_branch_snapshot)->nr;
> +#else
> + return -EOPNOTSUPP;
> +#endif
> +}
> +
> +static const struct bpf_func_proto bpf_get_branch_snapshot_proto = {
> + .func = bpf_get_branch_snapshot,
> + .gpl_only = true,
> + .ret_type = RET_INTEGER,
> + .arg1_type = ARG_PTR_TO_UNINIT_MEM,
> + .arg2_type = ARG_CONST_SIZE_OR_ZERO,
> +};
> +
[...]
On Mon, Aug 30, 2021 at 2:44 PM Song Liu <[email protected]> wrote:
>
> This test uses bpf_get_branch_snapshot from a fexit program. The test uses
> a target kernel function (bpf_fexit_loop_test1) and compares the record
> against kallsyms. If there isn't enough record matching kallsyms, the
> test fails.
>
> Signed-off-by: Song Liu <[email protected]>
> ---
> net/bpf/test_run.c | 15 ++-
> .../bpf/prog_tests/get_branch_snapshot.c | 106 ++++++++++++++++++
> .../selftests/bpf/progs/get_branch_snapshot.c | 41 +++++++
> tools/testing/selftests/bpf/trace_helpers.c | 30 +++++
> tools/testing/selftests/bpf/trace_helpers.h | 5 +
> 5 files changed, 196 insertions(+), 1 deletion(-)
> create mode 100644 tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
> create mode 100644 tools/testing/selftests/bpf/progs/get_branch_snapshot.c
>
> diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
> index 2eb0e55ef54d2..6cc179a532c9c 100644
> --- a/net/bpf/test_run.c
> +++ b/net/bpf/test_run.c
> @@ -231,6 +231,18 @@ struct sock * noinline bpf_kfunc_call_test3(struct sock *sk)
> return sk;
> }
>
> +noinline int bpf_fexit_loop_test1(int n)
We have bpf_testmod as part of selftests now, which allows us to add
whatever in-kernel functionality we need, without polluting the kernel
itself. fentry/fexit attach to kernel functions works as well, so do
you think we can use that here for testing?
> +{
> + int i, sum = 0;
> +
> + /* the primary goal of this test is to test LBR. Create a lot of
> + * branches in the function, so we can catch it easily.
> + */
> + for (i = 0; i < n; i++)
> + sum += i;
> + return sum;
> +}
> +
> __diag_pop();
>
> ALLOW_ERROR_INJECTION(bpf_modify_return_test, ERRNO);
> @@ -293,7 +305,8 @@ int bpf_prog_test_run_tracing(struct bpf_prog *prog,
> bpf_fentry_test5(11, (void *)12, 13, 14, 15) != 65 ||
> bpf_fentry_test6(16, (void *)17, 18, 19, (void *)20, 21) != 111 ||
> bpf_fentry_test7((struct bpf_fentry_test_t *)0) != 0 ||
> - bpf_fentry_test8(&arg) != 0)
> + bpf_fentry_test8(&arg) != 0 ||
> + bpf_fexit_loop_test1(101) != 5050)
> goto out;
> break;
> case BPF_MODIFY_RETURN:
> diff --git a/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c b/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
> new file mode 100644
> index 0000000000000..9bb16826418fb
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
> @@ -0,0 +1,106 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2021 Facebook */
> +#include <test_progs.h>
> +#include "get_branch_snapshot.skel.h"
> +
> +static int *pfd_array;
> +static int cpu_cnt;
> +
> +static int create_perf_events(void)
> +{
> + struct perf_event_attr attr = {0};
> + int cpu;
> +
> + /* create perf event */
> + attr.size = sizeof(attr);
> + attr.type = PERF_TYPE_RAW;
> + attr.config = 0x1b00;
> + attr.sample_type = PERF_SAMPLE_BRANCH_STACK;
> + attr.branch_sample_type = PERF_SAMPLE_BRANCH_KERNEL |
> + PERF_SAMPLE_BRANCH_USER | PERF_SAMPLE_BRANCH_ANY;
> +
> + cpu_cnt = libbpf_num_possible_cpus();
> + pfd_array = malloc(sizeof(int) * cpu_cnt);
> + if (!pfd_array) {
> + cpu_cnt = 0;
> + return 1;
> + }
> +
> + for (cpu = 0; cpu < libbpf_num_possible_cpus(); cpu++) {
nit: use cpu_cnt from above?
> + pfd_array[cpu] = syscall(__NR_perf_event_open, &attr,
> + -1, cpu, -1, PERF_FLAG_FD_CLOEXEC);
> + if (pfd_array[cpu] < 0)
> + break;
> + }
> +
> + return cpu == 0;
> +}
> +
> +static void close_perf_events(void)
> +{
> + int cpu = 0;
> + int fd;
> +
> + while (cpu++ < cpu_cnt) {
> + fd = pfd_array[cpu];
> + if (fd < 0)
> + break;
> + close(fd);
> + }
> + free(pfd_array);
> +}
> +
> +void test_get_branch_snapshot(void)
> +{
> + struct get_branch_snapshot *skel;
> + int err, prog_fd;
> + __u32 retval;
> +
> + if (create_perf_events()) {
> + test__skip(); /* system doesn't support LBR */
> + goto cleanup;
Cleanup inside create_perf_events() and just return here. Or at least
initialize skel to NULL above, otherwise __destroy() below will cause
SIGSEGV, most probably.
> + }
> +
> + skel = get_branch_snapshot__open_and_load();
> + if (!ASSERT_OK_PTR(skel, "get_branch_snapshot__open_and_load"))
> + goto cleanup;
> +
> + err = kallsyms_find("bpf_fexit_loop_test1", &skel->bss->address_low);
> + if (!ASSERT_OK(err, "kallsyms_find"))
> + goto cleanup;
> +
> + err = kallsyms_find_next("bpf_fexit_loop_test1", &skel->bss->address_high);
> + if (!ASSERT_OK(err, "kallsyms_find_next"))
> + goto cleanup;
> +
> + err = get_branch_snapshot__attach(skel);
> + if (!ASSERT_OK(err, "get_branch_snapshot__attach"))
> + goto cleanup;
> +
> + prog_fd = bpf_program__fd(skel->progs.test1);
> + err = bpf_prog_test_run(prog_fd, 1, NULL, 0,
> + NULL, 0, &retval, NULL);
> +
> + if (!ASSERT_OK(err, "bpf_prog_test_run"))
> + goto cleanup;
> +
> + if (skel->bss->total_entries < 16) {
> + /* too few entries for the hit/waste test */
> + test__skip();
> + goto cleanup;
> + }
> +
> + ASSERT_GT(skel->bss->test1_hits, 5, "find_test1_in_lbr");
> +
> + /* Given we stop LBR in software, we will waste a few entries.
> + * But we should try to waste as few as possibleentries. We are at
s/possibleentries/possible entries/
> + * about 7 on x86_64 systems.
> + * Add a check for < 10 so that we get heads-up when something
> + * changes and wastes too many entries.
> + */
> + ASSERT_LT(skel->bss->wasted_entries, 10, "check_wasted_entries");
> +
> +cleanup:
> + get_branch_snapshot__destroy(skel);
> + close_perf_events();
> +}
> diff --git a/tools/testing/selftests/bpf/progs/get_branch_snapshot.c b/tools/testing/selftests/bpf/progs/get_branch_snapshot.c
> new file mode 100644
> index 0000000000000..9c944e7480b95
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/get_branch_snapshot.c
> @@ -0,0 +1,41 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2021 Facebook */
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +__u64 test1_hits = 0;
> +__u64 address_low = 0;
> +__u64 address_high = 0;
> +int wasted_entries = 0;
> +long total_entries = 0;
> +
> +#define MAX_LBR_ENTRIES 32
see my comment on another patch, if kernel defines this constant as
enum, we'll automatically get it from vmlinux.h.
> +
> +struct perf_branch_entry entries[MAX_LBR_ENTRIES] = {};
> +
> +
> +static inline bool in_range(__u64 val)
> +{
> + return (val >= address_low) && (val < address_high);
> +}
> +
[...]
Hi Song,
I love your patch! Yet something to improve:
[auto build test ERROR on bpf-next/master]
url: https://github.com/0day-ci/linux/commits/Song-Liu/bpf-introduce-bpf_get_branch_snapshot/20210831-054455
base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: x86_64-randconfig-a003-20210831 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
# https://github.com/0day-ci/linux/commit/09548da8ea3e9a75d0d969b18c148ce17ed1c97d
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Song-Liu/bpf-introduce-bpf_get_branch_snapshot/20210831-054455
git checkout 09548da8ea3e9a75d0d969b18c148ce17ed1c97d
# save the attached .config to linux build tree
mkdir build_dir
make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All errors (new ones prefixed by >>):
ld: kernel/bpf/trampoline.o: in function `__bpf_prog_enter':
>> kernel/bpf/trampoline.c:578: undefined reference to `bpf_perf_branch_snapshot'
vim +578 kernel/bpf/trampoline.c
551
552 /* The logic is similar to bpf_prog_run(), but with an explicit
553 * rcu_read_lock() and migrate_disable() which are required
554 * for the trampoline. The macro is split into
555 * call __bpf_prog_enter
556 * call prog->bpf_func
557 * call __bpf_prog_exit
558 *
559 * __bpf_prog_enter returns:
560 * 0 - skip execution of the bpf prog
561 * 1 - execute bpf prog
562 * [2..MAX_U64] - execute bpf prog and record execution time.
563 * This is start time.
564 */
565 u64 notrace __bpf_prog_enter(struct bpf_prog *prog)
566 __acquires(RCU)
567 {
568 #ifdef CONFIG_PERF_EVENTS
569 /* Calling migrate_disable costs two entries in the LBR. To save
570 * some entries, we call perf_snapshot_branch_stack before
571 * migrate_disable to save some entries. This is OK because we
572 * care about the branch trace before entering the BPF program.
573 * If migrate happens exactly here, there isn't much we can do to
574 * preserve the data.
575 */
576 if (prog->call_get_branch)
577 static_call(perf_snapshot_branch_stack)(
> 578 this_cpu_ptr(&bpf_perf_branch_snapshot));
579 #endif
580 rcu_read_lock();
581 migrate_disable();
582 if (unlikely(__this_cpu_inc_return(*(prog->active)) != 1)) {
583 inc_misses_counter(prog);
584 return 0;
585 }
586 return bpf_prog_start_time();
587 }
588
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
On Mon, Aug 30, 2021 at 02:41:05PM -0700, Song Liu wrote:
> @@ -564,6 +565,18 @@ static void notrace inc_misses_counter(struct bpf_prog *prog)
> u64 notrace __bpf_prog_enter(struct bpf_prog *prog)
> __acquires(RCU)
> {
preempt_disable_notrace();
> +#ifdef CONFIG_PERF_EVENTS
> + /* Calling migrate_disable costs two entries in the LBR. To save
> + * some entries, we call perf_snapshot_branch_stack before
> + * migrate_disable to save some entries. This is OK because we
> + * care about the branch trace before entering the BPF program.
> + * If migrate happens exactly here, there isn't much we can do to
> + * preserve the data.
> + */
> + if (prog->call_get_branch)
> + static_call(perf_snapshot_branch_stack)(
> + this_cpu_ptr(&bpf_perf_branch_snapshot));
Here the comment is accurate, but if you recall the calling context
requirements of perf_snapshot_branch_stack from the last patch, you'll
see it requires you have at the very least preemption disabled, which
you just violated.
I think you'll find that (on x86 at least) the suggested
preempt_disable_notrace() incurs no additional branches.
Still, there is the next point to consider...
> +#endif
> rcu_read_lock();
> migrate_disable();
preempt_enable_notrace();
> if (unlikely(__this_cpu_inc_return(*(prog->active)) != 1)) {
> @@ -1863,9 +1892,23 @@ void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp)
> preempt_enable();
> }
>
> +DEFINE_PER_CPU(struct perf_branch_snapshot, bpf_perf_branch_snapshot);
> +
> static __always_inline
> void __bpf_trace_run(struct bpf_prog *prog, u64 *args)
> {
> +#ifdef CONFIG_PERF_EVENTS
> + /* Calling migrate_disable costs two entries in the LBR. To save
> + * some entries, we call perf_snapshot_branch_stack before
> + * migrate_disable to save some entries. This is OK because we
> + * care about the branch trace before entering the BPF program.
> + * If migrate happens exactly here, there isn't much we can do to
> + * preserve the data.
> + */
> + if (prog->call_get_branch)
> + static_call(perf_snapshot_branch_stack)(
> + this_cpu_ptr(&bpf_perf_branch_snapshot));
> +#endif
> cant_sleep();
In the face of ^^^^^^ the comment makes no sense. Still, what are the
nesting rules for __bpf_trace_run() and __bpf_prog_enter() ? I'm
thinking the trace one can nest inside an occurence of prog, at which
point you have pieces.
> rcu_read_lock();
> (void) bpf_prog_run(prog, args);
> On Aug 31, 2021, at 8:24 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Aug 30, 2021 at 02:41:04PM -0700, Song Liu wrote:
>
>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>> index ac6fd2dabf6a2..d28d0e12c112c 100644
>> --- a/arch/x86/events/intel/core.c
>> +++ b/arch/x86/events/intel/core.c
>> @@ -2155,9 +2155,9 @@ static void __intel_pmu_disable_all(void)
>>
>> static void intel_pmu_disable_all(void)
>> {
>> + intel_pmu_lbr_disable_all();
>> __intel_pmu_disable_all();
>> intel_pmu_pebs_disable_all();
>> - intel_pmu_lbr_disable_all();
>> }
>
> Hurmph... I'm not sure about that, I'd rather you sprinkle a few
> __always_inline to ensure no actual function is called while you disable
> things in the correct order.
>
> You now still have a hole vs PMI.
Hmm... I will move this back and try some inlining. It may require moving
some functions from ds.c/lbr.c to arch/x86/events/perf_event.h. But I guess
that is OK, as there are similar functions in the header.
Thanks,
Song
> On Aug 31, 2021, at 8:32 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Aug 30, 2021 at 02:41:05PM -0700, Song Liu wrote:
>
>> @@ -564,6 +565,18 @@ static void notrace inc_misses_counter(struct bpf_prog *prog)
>> u64 notrace __bpf_prog_enter(struct bpf_prog *prog)
>> __acquires(RCU)
>> {
> preempt_disable_notrace();
>
>> +#ifdef CONFIG_PERF_EVENTS
>> + /* Calling migrate_disable costs two entries in the LBR. To save
>> + * some entries, we call perf_snapshot_branch_stack before
>> + * migrate_disable to save some entries. This is OK because we
>> + * care about the branch trace before entering the BPF program.
>> + * If migrate happens exactly here, there isn't much we can do to
>> + * preserve the data.
>> + */
>> + if (prog->call_get_branch)
>> + static_call(perf_snapshot_branch_stack)(
>> + this_cpu_ptr(&bpf_perf_branch_snapshot));
>
> Here the comment is accurate, but if you recall the calling context
> requirements of perf_snapshot_branch_stack from the last patch, you'll
> see it requires you have at the very least preemption disabled, which
> you just violated.
>
> I think you'll find that (on x86 at least) the suggested
> preempt_disable_notrace() incurs no additional branches.
>
> Still, there is the next point to consider...
>
>> +#endif
>> rcu_read_lock();
>> migrate_disable();
>
> preempt_enable_notrace();
Do we want preempt_enable_notrace() after migrate_disable()? It feels a
little weird to me.
>
>> if (unlikely(__this_cpu_inc_return(*(prog->active)) != 1)) {
>
>> @@ -1863,9 +1892,23 @@ void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp)
>> preempt_enable();
>> }
>>
>> +DEFINE_PER_CPU(struct perf_branch_snapshot, bpf_perf_branch_snapshot);
>> +
>> static __always_inline
>> void __bpf_trace_run(struct bpf_prog *prog, u64 *args)
>> {
>> +#ifdef CONFIG_PERF_EVENTS
>> + /* Calling migrate_disable costs two entries in the LBR. To save
>> + * some entries, we call perf_snapshot_branch_stack before
>> + * migrate_disable to save some entries. This is OK because we
>> + * care about the branch trace before entering the BPF program.
>> + * If migrate happens exactly here, there isn't much we can do to
>> + * preserve the data.
>> + */
>> + if (prog->call_get_branch)
>> + static_call(perf_snapshot_branch_stack)(
>> + this_cpu_ptr(&bpf_perf_branch_snapshot));
>> +#endif
>> cant_sleep();
>
> In the face of ^^^^^^ the comment makes no sense. Still, what are the
> nesting rules for __bpf_trace_run() and __bpf_prog_enter() ? I'm
> thinking the trace one can nest inside an occurence of prog, at which
> point you have pieces.
I think broken LBR records is something we cannot really avoid in case
of nesting. OTOH, these should be rare cases and will not hurt the results
in most the use cases.
I should probably tighten the rules in verifier to only apply it for
__bpf_prog_enter (where we have the primary use case). We can enable it
for other program types when there are other use cases.
Thanks,
Song
On Mon, Aug 30, 2021 at 02:41:04PM -0700, Song Liu wrote:
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index ac6fd2dabf6a2..d28d0e12c112c 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -2155,9 +2155,9 @@ static void __intel_pmu_disable_all(void)
>
> static void intel_pmu_disable_all(void)
> {
> + intel_pmu_lbr_disable_all();
> __intel_pmu_disable_all();
> intel_pmu_pebs_disable_all();
> - intel_pmu_lbr_disable_all();
> }
Hurmph... I'm not sure about that, I'd rather you sprinkle a few
__always_inline to ensure no actual function is called while you disable
things in the correct order.
You now still have a hole vs PMI.
> +static int
> +intel_pmu_snapshot_branch_stack(struct perf_branch_snapshot *br_snapshot)
> +{
> + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
Note that this requires preemption is disabled, then look at the
call-sites in your next patch and spot the problem...
> +
> + intel_pmu_disable_all();
> + intel_pmu_lbr_read();
> + memcpy(br_snapshot->entries, cpuc->lbr_entries,
> + sizeof(struct perf_branch_entry) * x86_pmu.lbr_nr);
> + br_snapshot->nr = x86_pmu.lbr_nr;
> + intel_pmu_enable_all(0);
> + return 0;
> +}
> +
> /*
> * Workaround for:
> * Intel Errata AAK100 (model 26)
> @@ -6283,9 +6297,15 @@ __init int intel_pmu_init(void)
> x86_pmu.lbr_nr = 0;
> }
>
> - if (x86_pmu.lbr_nr)
> + if (x86_pmu.lbr_nr) {
> pr_cont("%d-deep LBR, ", x86_pmu.lbr_nr);
>
> + /* only support branch_stack snapshot for perfmon >= v2 */
> + if (x86_pmu.disable_all == intel_pmu_disable_all)
{
> + static_call_update(perf_snapshot_branch_stack,
> + intel_pmu_snapshot_branch_stack);
}
> + }
> +
> intel_pmu_check_extra_regs(x86_pmu.extra_regs);
>
> /* Support full width counters using alternative MSR range */
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 011cc5069b7ba..22807864e913b 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -13437,3 +13437,6 @@ struct cgroup_subsys perf_event_cgrp_subsys = {
> .threaded = true,
> };
> #endif /* CONFIG_CGROUP_PERF */
> +
> +DEFINE_STATIC_CALL_RET0(perf_snapshot_branch_stack,
> + perf_snapshot_branch_stack_t);
I'll squint and accept 82 characters :-)
> On Aug 31, 2021, at 9:41 AM, Song Liu <[email protected]> wrote:
>
>
>
>> On Aug 31, 2021, at 8:32 AM, Peter Zijlstra <[email protected]> wrote:
>>
>> On Mon, Aug 30, 2021 at 02:41:05PM -0700, Song Liu wrote:
>>
>>> @@ -564,6 +565,18 @@ static void notrace inc_misses_counter(struct bpf_prog *prog)
>>> u64 notrace __bpf_prog_enter(struct bpf_prog *prog)
>>> __acquires(RCU)
>>> {
>> preempt_disable_notrace();
>>
>>> +#ifdef CONFIG_PERF_EVENTS
>>> + /* Calling migrate_disable costs two entries in the LBR. To save
>>> + * some entries, we call perf_snapshot_branch_stack before
>>> + * migrate_disable to save some entries. This is OK because we
>>> + * care about the branch trace before entering the BPF program.
>>> + * If migrate happens exactly here, there isn't much we can do to
>>> + * preserve the data.
>>> + */
>>> + if (prog->call_get_branch)
>>> + static_call(perf_snapshot_branch_stack)(
>>> + this_cpu_ptr(&bpf_perf_branch_snapshot));
>>
>> Here the comment is accurate, but if you recall the calling context
>> requirements of perf_snapshot_branch_stack from the last patch, you'll
>> see it requires you have at the very least preemption disabled, which
>> you just violated.
>
>>
>> I think you'll find that (on x86 at least) the suggested
>> preempt_disable_notrace() incurs no additional branches.
>>
>> Still, there is the next point to consider...
>>
>>> +#endif
>>> rcu_read_lock();
>>> migrate_disable();
>>
>> preempt_enable_notrace();
>
> Do we want preempt_enable_notrace() after migrate_disable()? It feels a
> little weird to me.
>
>>
>>> if (unlikely(__this_cpu_inc_return(*(prog->active)) != 1)) {
>>
>>> @@ -1863,9 +1892,23 @@ void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp)
>>> preempt_enable();
>>> }
>>>
>>> +DEFINE_PER_CPU(struct perf_branch_snapshot, bpf_perf_branch_snapshot);
>>> +
>>> static __always_inline
>>> void __bpf_trace_run(struct bpf_prog *prog, u64 *args)
>>> {
>>> +#ifdef CONFIG_PERF_EVENTS
>>> + /* Calling migrate_disable costs two entries in the LBR. To save
>>> + * some entries, we call perf_snapshot_branch_stack before
>>> + * migrate_disable to save some entries. This is OK because we
>>> + * care about the branch trace before entering the BPF program.
>>> + * If migrate happens exactly here, there isn't much we can do to
>>> + * preserve the data.
>>> + */
>>> + if (prog->call_get_branch)
>>> + static_call(perf_snapshot_branch_stack)(
>>> + this_cpu_ptr(&bpf_perf_branch_snapshot));
>>> +#endif
>>> cant_sleep();
>>
>> In the face of ^^^^^^ the comment makes no sense. Still, what are the
>> nesting rules for __bpf_trace_run() and __bpf_prog_enter() ? I'm
>> thinking the trace one can nest inside an occurence of prog, at which
>> point you have pieces.
>
> I think broken LBR records is something we cannot really avoid in case
> of nesting. OTOH, these should be rare cases and will not hurt the results
> in most the use cases.
>
> I should probably tighten the rules in verifier to only apply it for
> __bpf_prog_enter (where we have the primary use case). We can enable it
> for other program types when there are other use cases.
Update about some offline discussion with Alexei and Andrii. We are planning
to move static_call(perf_snapshot_branch_stack) to inside the helper
bpf_get_branch_snapshot. This change has a few benefit:
1. No need for extra check (prog->call_get_branch) before every program (even
when the program doesn't use the helper).
2. No need to duplicate code of different BPF program hook.
3. BPF program always run with migrate_disable(), so it is not necessary to
run add extra preempt_disable_notrace.
It does flushes a few more LBR entries. But the result seems ok:
ID: 0 from intel_pmu_lbr_disable_all+58 to intel_pmu_lbr_disable_all+93
ID: 1 from intel_pmu_lbr_disable_all+54 to intel_pmu_lbr_disable_all+58
ID: 2 from intel_pmu_snapshot_branch_stack+88 to intel_pmu_lbr_disable_all+0
ID: 3 from bpf_get_branch_snapshot+28 to intel_pmu_snapshot_branch_stack+0
ID: 4 from <bpf_tramepoline> to bpf_get_branch_snapshot+0
ID: 5 from <bpf_tramepoline> to <bpf_tramepoline>
ID: 6 from __bpf_prog_enter+34 to <bpf_tramepoline>
ID: 7 from migrate_disable+60 to __bpf_prog_enter+9
ID: 8 from __bpf_prog_enter+4 to migrate_disable+0
ID: 9 from __bpf_prog_enter+4 to __bpf_prog_enter+0
ID: 10 from bpf_fexit_loop_test1+22 to __bpf_prog_enter+0
ID: 11 from bpf_fexit_loop_test1+20 to bpf_fexit_loop_test1+13
ID: 12 from bpf_fexit_loop_test1+20 to bpf_fexit_loop_test1+13
ID: 13 from bpf_fexit_loop_test1+20 to bpf_fexit_loop_test1+13
ID: 14 from bpf_fexit_loop_test1+20 to bpf_fexit_loop_test1+13
ID: 15 from bpf_fexit_loop_test1+20 to bpf_fexit_loop_test1+13
We can save more by inlining intel_pmu_lbr_disable_all(). But it is probably
not necessary at the moment.
Thanks,
Song
On 8/31/21 2:24 PM, Song Liu wrote:
>
>
>> On Aug 31, 2021, at 9:41 AM, Song Liu <[email protected]> wrote:
>>
>>
>>
>>> On Aug 31, 2021, at 8:32 AM, Peter Zijlstra <[email protected]> wrote:
>>>
>>> On Mon, Aug 30, 2021 at 02:41:05PM -0700, Song Liu wrote:
>>>
>>>> @@ -564,6 +565,18 @@ static void notrace inc_misses_counter(struct bpf_prog *prog)
>>>> u64 notrace __bpf_prog_enter(struct bpf_prog *prog)
>>>> __acquires(RCU)
>>>> {
>>> preempt_disable_notrace();
>>>
>>>> +#ifdef CONFIG_PERF_EVENTS
>>>> + /* Calling migrate_disable costs two entries in the LBR. To save
>>>> + * some entries, we call perf_snapshot_branch_stack before
>>>> + * migrate_disable to save some entries. This is OK because we
>>>> + * care about the branch trace before entering the BPF program.
>>>> + * If migrate happens exactly here, there isn't much we can do to
>>>> + * preserve the data.
>>>> + */
>>>> + if (prog->call_get_branch)
>>>> + static_call(perf_snapshot_branch_stack)(
>>>> + this_cpu_ptr(&bpf_perf_branch_snapshot));
>>>
>>> Here the comment is accurate, but if you recall the calling context
>>> requirements of perf_snapshot_branch_stack from the last patch, you'll
>>> see it requires you have at the very least preemption disabled, which
>>> you just violated.
>>
>>>
>>> I think you'll find that (on x86 at least) the suggested
>>> preempt_disable_notrace() incurs no additional branches.
>>>
>>> Still, there is the next point to consider...
>>>
>>>> +#endif
>>>> rcu_read_lock();
>>>> migrate_disable();
>>>
>>> preempt_enable_notrace();
>>
>> Do we want preempt_enable_notrace() after migrate_disable()? It feels a
>> little weird to me.
>>
>>>
>>>> if (unlikely(__this_cpu_inc_return(*(prog->active)) != 1)) {
>>>
>>>> @@ -1863,9 +1892,23 @@ void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp)
>>>> preempt_enable();
>>>> }
>>>>
>>>> +DEFINE_PER_CPU(struct perf_branch_snapshot, bpf_perf_branch_snapshot);
>>>> +
>>>> static __always_inline
>>>> void __bpf_trace_run(struct bpf_prog *prog, u64 *args)
>>>> {
>>>> +#ifdef CONFIG_PERF_EVENTS
>>>> + /* Calling migrate_disable costs two entries in the LBR. To save
>>>> + * some entries, we call perf_snapshot_branch_stack before
>>>> + * migrate_disable to save some entries. This is OK because we
>>>> + * care about the branch trace before entering the BPF program.
>>>> + * If migrate happens exactly here, there isn't much we can do to
>>>> + * preserve the data.
>>>> + */
>>>> + if (prog->call_get_branch)
>>>> + static_call(perf_snapshot_branch_stack)(
>>>> + this_cpu_ptr(&bpf_perf_branch_snapshot));
>>>> +#endif
>>>> cant_sleep();
>>>
>>> In the face of ^^^^^^ the comment makes no sense. Still, what are the
>>> nesting rules for __bpf_trace_run() and __bpf_prog_enter() ? I'm
>>> thinking the trace one can nest inside an occurence of prog, at which
>>> point you have pieces.
>>
>> I think broken LBR records is something we cannot really avoid in case
>> of nesting. OTOH, these should be rare cases and will not hurt the results
>> in most the use cases.
>>
>> I should probably tighten the rules in verifier to only apply it for
>> __bpf_prog_enter (where we have the primary use case). We can enable it
>> for other program types when there are other use cases.
>
> Update about some offline discussion with Alexei and Andrii. We are planning
> to move static_call(perf_snapshot_branch_stack) to inside the helper
> bpf_get_branch_snapshot. This change has a few benefit:
>
> 1. No need for extra check (prog->call_get_branch) before every program (even
> when the program doesn't use the helper).
>
> 2. No need to duplicate code of different BPF program hook.
> 3. BPF program always run with migrate_disable(), so it is not necessary to
> run add extra preempt_disable_notrace.
>
> It does flushes a few more LBR entries. But the result seems ok:
>
> ID: 0 from intel_pmu_lbr_disable_all+58 to intel_pmu_lbr_disable_all+93
> ID: 1 from intel_pmu_lbr_disable_all+54 to intel_pmu_lbr_disable_all+58
> ID: 2 from intel_pmu_snapshot_branch_stack+88 to intel_pmu_lbr_disable_all+0
> ID: 3 from bpf_get_branch_snapshot+28 to intel_pmu_snapshot_branch_stack+0
> ID: 4 from <bpf_tramepoline> to bpf_get_branch_snapshot+0
> ID: 5 from <bpf_tramepoline> to <bpf_tramepoline>
> ID: 6 from __bpf_prog_enter+34 to <bpf_tramepoline>
> ID: 7 from migrate_disable+60 to __bpf_prog_enter+9
> ID: 8 from __bpf_prog_enter+4 to migrate_disable+0
If we make migrate_disable 'static inline' it will save these 2 entries.
It's probably worth doing regardless, since it will be immediate
performance benefit for all bpf programs.
> ID: 9 from __bpf_prog_enter+4 to __bpf_prog_enter+0
> ID: 10 from bpf_fexit_loop_test1+22 to __bpf_prog_enter+0
> ID: 11 from bpf_fexit_loop_test1+20 to bpf_fexit_loop_test1+13
> ID: 12 from bpf_fexit_loop_test1+20 to bpf_fexit_loop_test1+13
> ID: 13 from bpf_fexit_loop_test1+20 to bpf_fexit_loop_test1+13
> ID: 14 from bpf_fexit_loop_test1+20 to bpf_fexit_loop_test1+13
> ID: 15 from bpf_fexit_loop_test1+20 to bpf_fexit_loop_test1+13
>
> We can save more by inlining intel_pmu_lbr_disable_all(). But it is probably
> not necessary at the moment.
>
> Thanks,
> Song
>
>
>