Branch records are a CPU feature that can be configured to record
certain branches that are taken during code execution. This data is
particularly interesting for profile guided optimizations. perf has had
branch record support for a while but the data collection can be a bit
coarse grained.
We (Facebook) have seen in experiments that associating metadata with
branch records can improve results (after postprocessing). We generally
use bpf_probe_read_*() to get metadata out of userspace. That's why bpf
support for branch records is useful.
Aside from this particular use case, having branch data available to bpf
progs can be useful to get stack traces out of userspace applications
that omit frame pointers.
Changes in v5:
- Rename bpf_perf_prog_read_branches() -> bpf_read_branch_records()
- Rename BPF_F_GET_BR_SIZE -> BPF_F_GET_BRANCH_RECORDS_SIZE
- Squash tools/ bpf.h sync into selftest commit
Changes in v4:
- Add BPF_F_GET_BR_SIZE flag
- Return -ENOENT on unsupported architectures
- Only accept initialized memory in helper
- Check buffer size is multiple of sizeof(struct perf_branch_entry)
- Use bpf skeleton in selftest
- Add commit messages
- Spelling and formatting
Changes in v3:
- Document filling unused buffer with zero
- Formatting fixes
- Rebase
Changes in v2:
- Change to a bpf helper instead of context access
- Avoid mentioning Intel specific things
Daniel Xu (2):
bpf: Add bpf_read_branch_records() helper
selftests/bpf: add bpf_read_branch_records() selftest
include/uapi/linux/bpf.h | 25 +++-
kernel/trace/bpf_trace.c | 41 +++++++
tools/include/uapi/linux/bpf.h | 25 +++-
.../selftests/bpf/prog_tests/perf_branches.c | 112 ++++++++++++++++++
.../selftests/bpf/progs/test_perf_branches.c | 74 ++++++++++++
5 files changed, 275 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/perf_branches.c
create mode 100644 tools/testing/selftests/bpf/progs/test_perf_branches.c
--
2.21.1
Add a selftest to test:
* default bpf_read_branch_records() behavior
* BPF_F_GET_BRANCH_RECORDS_SIZE flag behavior
* using helper to write to stack
* using helper to write to map
Tested by running:
# ./test_progs -t perf_branches
#27 perf_branches:OK
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
Also sync tools/include/uapi/linux/bpf.h.
Signed-off-by: Daniel Xu <[email protected]>
---
tools/include/uapi/linux/bpf.h | 25 +++-
.../selftests/bpf/prog_tests/perf_branches.c | 112 ++++++++++++++++++
.../selftests/bpf/progs/test_perf_branches.c | 74 ++++++++++++
3 files changed, 210 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/perf_branches.c
create mode 100644 tools/testing/selftests/bpf/progs/test_perf_branches.c
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index f1d74a2bd234..332aa433d045 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2892,6 +2892,25 @@ union bpf_attr {
* Obtain the 64bit jiffies
* Return
* The 64 bit jiffies
+ *
+ * int bpf_read_branch_records(struct bpf_perf_event_data *ctx, void *buf, u32 buf_size, u64 flags)
+ * Description
+ * For an eBPF program attached to a perf event, retrieve the
+ * branch records (struct perf_branch_entry) associated to *ctx*
+ * and store it in the buffer pointed by *buf* up to size
+ * *buf_size* bytes.
+ *
+ * The *flags* can be set to **BPF_F_GET_BRANCH_RECORDS_SIZE** to
+ * instead return the number of bytes required to store all the
+ * branch entries. If this flag is set, *buf* may be NULL.
+ * Return
+ * On success, number of bytes written to *buf*. On error, a
+ * negative value.
+ *
+ * **-EINVAL** if arguments invalid or **buf_size** not a multiple
+ * of sizeof(struct perf_branch_entry).
+ *
+ * **-ENOENT** if architecture does not support branch records.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -3012,7 +3031,8 @@ union bpf_attr {
FN(probe_read_kernel_str), \
FN(tcp_send_ack), \
FN(send_signal_thread), \
- FN(jiffies64),
+ FN(jiffies64), \
+ FN(read_branch_records),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
@@ -3091,6 +3111,9 @@ enum bpf_func_id {
/* BPF_FUNC_sk_storage_get flags */
#define BPF_SK_STORAGE_GET_F_CREATE (1ULL << 0)
+/* BPF_FUNC_read_branch_records flags. */
+#define BPF_F_GET_BRANCH_RECORDS_SIZE (1ULL << 0)
+
/* Mode for BPF_FUNC_skb_adjust_room helper. */
enum bpf_adj_room_mode {
BPF_ADJ_ROOM_NET,
diff --git a/tools/testing/selftests/bpf/prog_tests/perf_branches.c b/tools/testing/selftests/bpf/prog_tests/perf_branches.c
new file mode 100644
index 000000000000..54a982a6c513
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/perf_branches.c
@@ -0,0 +1,112 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <pthread.h>
+#include <sched.h>
+#include <sys/socket.h>
+#include <test_progs.h>
+#include "bpf/libbpf_internal.h"
+#include "test_perf_branches.skel.h"
+
+struct output {
+ int required_size;
+ int written_stack;
+ int written_map;
+};
+
+static void on_sample(void *ctx, int cpu, void *data, __u32 size)
+{
+ int pbe_size = sizeof(struct perf_branch_entry);
+ int required_size = ((struct output *)data)->required_size;
+ int written_stack = ((struct output *)data)->written_stack;
+ int written_map = ((struct output *)data)->written_map;
+ int duration = 0;
+
+ /*
+ * It's hard to validate the contents of the branch entries b/c it
+ * would require some kind of disassembler and also encoding the
+ * valid jump instructions for supported architectures. So just check
+ * the easy stuff for now.
+ */
+ CHECK(required_size <= 0, "read_branches_size", "err %d\n", required_size);
+ CHECK(written_stack < 0, "read_branches_stack", "err %d\n", written_stack);
+ CHECK(written_stack % pbe_size != 0, "read_branches_stack",
+ "stack bytes written=%d not multiple of struct size=%d\n",
+ written_stack, pbe_size);
+ CHECK(written_map < 0, "read_branches_map", "err %d\n", written_map);
+ CHECK(written_map % pbe_size != 0, "read_branches_map",
+ "map bytes written=%d not multiple of struct size=%d\n",
+ written_map, pbe_size);
+ CHECK(written_map < written_stack, "read_branches_size",
+ "written_map=%d < written_stack=%d\n", written_map, written_stack);
+
+ *(int *)ctx = 1;
+}
+
+void test_perf_branches(void)
+{
+ int err, i, pfd = -1, duration = 0, ok = 0;
+ struct perf_buffer_opts pb_opts = {};
+ struct perf_event_attr attr = {};
+ struct perf_buffer *pb;
+ struct bpf_link *link;
+ volatile int j = 0;
+ cpu_set_t cpu_set;
+
+
+ struct test_perf_branches *skel;
+ skel = test_perf_branches__open_and_load();
+ if (CHECK(!skel, "test_perf_branches_load",
+ "perf_branches skeleton failed\n"))
+ goto out_destroy;
+
+ /* create perf event */
+ attr.size = sizeof(attr);
+ attr.type = PERF_TYPE_HARDWARE;
+ attr.config = PERF_COUNT_HW_CPU_CYCLES;
+ attr.freq = 1;
+ attr.sample_freq = 4000;
+ attr.sample_type = PERF_SAMPLE_BRANCH_STACK;
+ attr.branch_sample_type = PERF_SAMPLE_BRANCH_USER | PERF_SAMPLE_BRANCH_ANY;
+ pfd = syscall(__NR_perf_event_open, &attr, -1, 0, -1, PERF_FLAG_FD_CLOEXEC);
+ if (CHECK(pfd < 0, "perf_event_open", "err %d\n", pfd))
+ goto out_destroy;
+
+ /* attach perf_event */
+ link = bpf_program__attach_perf_event(skel->progs.perf_branches, pfd);
+ if (CHECK(IS_ERR(link), "attach_perf_event", "err %ld\n", PTR_ERR(link)))
+ goto out_close_perf;
+
+ /* set up perf buffer */
+ pb_opts.sample_cb = on_sample;
+ pb_opts.ctx = &ok;
+ pb = perf_buffer__new(bpf_map__fd(skel->maps.perf_buf_map), 1, &pb_opts);
+ if (CHECK(IS_ERR(pb), "perf_buf__new", "err %ld\n", PTR_ERR(pb)))
+ goto out_detach;
+
+ /* generate some branches on cpu 0 */
+ CPU_ZERO(&cpu_set);
+ CPU_SET(0, &cpu_set);
+ err = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set), &cpu_set);
+ if (CHECK(err, "set_affinity", "cpu #0, err %d\n", err))
+ goto out_free_pb;
+ /* spin the loop for a while (random high number) */
+ for (i = 0; i < 1000000; ++i)
+ ++j;
+
+ /* read perf buffer */
+ err = perf_buffer__poll(pb, 500);
+ if (CHECK(err < 0, "perf_buffer__poll", "err %d\n", err))
+ goto out_free_pb;
+
+ if (CHECK(!ok, "ok", "not ok\n"))
+ goto out_free_pb;
+
+out_free_pb:
+ perf_buffer__free(pb);
+out_detach:
+ bpf_link__destroy(link);
+out_close_perf:
+ close(pfd);
+out_destroy:
+ test_perf_branches__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_perf_branches.c b/tools/testing/selftests/bpf/progs/test_perf_branches.c
new file mode 100644
index 000000000000..60327d512400
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_perf_branches.c
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2019 Facebook
+
+#include <stddef.h>
+#include <linux/ptrace.h>
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_trace_helpers.h"
+
+struct fake_perf_branch_entry {
+ __u64 _a;
+ __u64 _b;
+ __u64 _c;
+};
+
+struct output {
+ int required_size;
+ int written_stack;
+ int written_map;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
+ __uint(key_size, sizeof(int));
+ __uint(value_size, sizeof(int));
+} perf_buf_map SEC(".maps");
+
+typedef struct fake_perf_branch_entry fpbe_t[30];
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 1);
+ __type(key, __u32);
+ __type(value, fpbe_t);
+} scratch_map SEC(".maps");
+
+SEC("perf_event")
+int perf_branches(void *ctx)
+{
+ struct fake_perf_branch_entry entries[4] = {0};
+ struct output output = {0};
+ __u32 key = 0, *value;
+
+ /* write to stack */
+ output.written_stack = bpf_read_branch_records(ctx, entries,
+ sizeof(entries), 0);
+ /* ignore spurious events */
+ if (!output.written_stack)
+ return 1;
+
+ /* get required size */
+ output.required_size =
+ bpf_read_branch_records(ctx, NULL, 0,
+ BPF_F_GET_BRANCH_RECORDS_SIZE);
+
+ /* write to map */
+ value = bpf_map_lookup_elem(&scratch_map, &key);
+ if (value)
+ output.written_map =
+ bpf_read_branch_records(ctx,
+ value,
+ 30 * sizeof(struct fake_perf_branch_entry),
+ 0);
+
+ /* ignore spurious events */
+ if (!output.written_map)
+ return 1;
+
+ bpf_perf_event_output(ctx, &perf_buf_map, BPF_F_CURRENT_CPU,
+ &output, sizeof(output));
+ return 0;
+}
+
+char _license[] SEC("license") = "GPL";
--
2.21.1
On Sat, Jan 25, 2020 at 2:32 PM Daniel Xu <[email protected]> wrote:
> + attr.type = PERF_TYPE_HARDWARE;
> + attr.config = PERF_COUNT_HW_CPU_CYCLES;
> + attr.freq = 1;
> + attr.sample_freq = 4000;
> + attr.sample_type = PERF_SAMPLE_BRANCH_STACK;
> + attr.branch_sample_type = PERF_SAMPLE_BRANCH_USER | PERF_SAMPLE_BRANCH_ANY;
> + pfd = syscall(__NR_perf_event_open, &attr, -1, 0, -1, PERF_FLAG_FD_CLOEXEC);
> + if (CHECK(pfd < 0, "perf_event_open", "err %d\n", pfd))
> + goto out_destroy;
It's failing for me in kvm. Is there way to make it work?
CIs will be vm based too. If this test requires physical host
such test will keep failing in all such environments.
Folks will be annoyed and eventually will disable the test.
Can we figure out how to test in the vm from the start?
On Sat Jan 25, 2020 at 6:53 PM, Alexei Starovoitov wrote:
> On Sat, Jan 25, 2020 at 2:32 PM Daniel Xu <[email protected]> wrote:
> > + attr.type = PERF_TYPE_HARDWARE;
> > + attr.config = PERF_COUNT_HW_CPU_CYCLES;
> > + attr.freq = 1;
> > + attr.sample_freq = 4000;
> > + attr.sample_type = PERF_SAMPLE_BRANCH_STACK;
> > + attr.branch_sample_type = PERF_SAMPLE_BRANCH_USER | PERF_SAMPLE_BRANCH_ANY;
> > + pfd = syscall(__NR_perf_event_open, &attr, -1, 0, -1, PERF_FLAG_FD_CLOEXEC);
> > + if (CHECK(pfd < 0, "perf_event_open", "err %d\n", pfd))
> > + goto out_destroy;
>
>
> It's failing for me in kvm. Is there way to make it work?
> CIs will be vm based too. If this test requires physical host
> such test will keep failing in all such environments.
> Folks will be annoyed and eventually will disable the test.
> Can we figure out how to test in the vm from the start?
It seems there's a patchset that's adding LBR support to guest hosts:
https://lkml.org/lkml/2019/8/6/215 . However it seems to be stuck in
review limbo. Is there anything we can do to help that set along?
As far as hacking it, nothing really comes to mind. Seems that patchset
is our best hope.
On 1/25/20 8:10 PM, Daniel Xu wrote:
> On Sat Jan 25, 2020 at 6:53 PM, Alexei Starovoitov wrote:
>> On Sat, Jan 25, 2020 at 2:32 PM Daniel Xu <[email protected]> wrote:
>>> + attr.type = PERF_TYPE_HARDWARE;
>>> + attr.config = PERF_COUNT_HW_CPU_CYCLES;
>>> + attr.freq = 1;
>>> + attr.sample_freq = 4000;
>>> + attr.sample_type = PERF_SAMPLE_BRANCH_STACK;
>>> + attr.branch_sample_type = PERF_SAMPLE_BRANCH_USER | PERF_SAMPLE_BRANCH_ANY;
>>> + pfd = syscall(__NR_perf_event_open, &attr, -1, 0, -1, PERF_FLAG_FD_CLOEXEC);
>>> + if (CHECK(pfd < 0, "perf_event_open", "err %d\n", pfd))
>>> + goto out_destroy;
>>
>>
>> It's failing for me in kvm. Is there way to make it work?
>> CIs will be vm based too. If this test requires physical host
>> such test will keep failing in all such environments.
>> Folks will be annoyed and eventually will disable the test.
>> Can we figure out how to test in the vm from the start?
>
> It seems there's a patchset that's adding LBR support to guest hosts:
> https://lkml.org/lkml/2019/8/6/215 . However it seems to be stuck in
> review limbo. Is there anything we can do to help that set along?
>
> As far as hacking it, nothing really comes to mind. Seems that patchset
> is our best hope.
prog_tests/send_signal.c tests send_signal helper under nmi with
hardware counters. It added a check to see whether the underlying
hardware counter is supported, if it is not, the test is
skipped.
Maybe we can use the same appraoch here. If perf_event_open with
PERF_TYPE_HARDWARE/PERF_SAMPLE_BRANCH_STACK failed,
we just mark the test as skipped instead of failing.
On Sun, Jan 26, 2020 at 04:50:14AM +0000, Yonghong Song wrote:
>
>
> On 1/25/20 8:10 PM, Daniel Xu wrote:
> > On Sat Jan 25, 2020 at 6:53 PM, Alexei Starovoitov wrote:
> >> On Sat, Jan 25, 2020 at 2:32 PM Daniel Xu <[email protected]> wrote:
> >>> + attr.type = PERF_TYPE_HARDWARE;
> >>> + attr.config = PERF_COUNT_HW_CPU_CYCLES;
> >>> + attr.freq = 1;
> >>> + attr.sample_freq = 4000;
> >>> + attr.sample_type = PERF_SAMPLE_BRANCH_STACK;
> >>> + attr.branch_sample_type = PERF_SAMPLE_BRANCH_USER | PERF_SAMPLE_BRANCH_ANY;
> >>> + pfd = syscall(__NR_perf_event_open, &attr, -1, 0, -1, PERF_FLAG_FD_CLOEXEC);
> >>> + if (CHECK(pfd < 0, "perf_event_open", "err %d\n", pfd))
> >>> + goto out_destroy;
> >>
> >>
> >> It's failing for me in kvm. Is there way to make it work?
> >> CIs will be vm based too. If this test requires physical host
> >> such test will keep failing in all such environments.
> >> Folks will be annoyed and eventually will disable the test.
> >> Can we figure out how to test in the vm from the start?
> >
> > It seems there's a patchset that's adding LBR support to guest hosts:
> > https://lkml.org/lkml/2019/8/6/215 . However it seems to be stuck in
> > review limbo. Is there anything we can do to help that set along?
> >
> > As far as hacking it, nothing really comes to mind. Seems that patchset
> > is our best hope.
>
> prog_tests/send_signal.c tests send_signal helper under nmi with
> hardware counters. It added a check to see whether the underlying
> hardware counter is supported, if it is not, the test is
> skipped.
>
> Maybe we can use the same appraoch here. If perf_event_open with
> PERF_TYPE_HARDWARE/PERF_SAMPLE_BRANCH_STACK failed,
> we just mark the test as skipped instead of failing.
Instead of failing and skipping the test how about making it test error case?
Like instead of lbr perf_event some other event can be passed into bpf prog.
New helper can still be called and in such case it should return einval?
On 1/25/20 8:52 PM, Alexei Starovoitov wrote:
> On Sun, Jan 26, 2020 at 04:50:14AM +0000, Yonghong Song wrote:
>>
>>
>> On 1/25/20 8:10 PM, Daniel Xu wrote:
>>> On Sat Jan 25, 2020 at 6:53 PM, Alexei Starovoitov wrote:
>>>> On Sat, Jan 25, 2020 at 2:32 PM Daniel Xu <[email protected]> wrote:
>>>>> + attr.type = PERF_TYPE_HARDWARE;
>>>>> + attr.config = PERF_COUNT_HW_CPU_CYCLES;
>>>>> + attr.freq = 1;
>>>>> + attr.sample_freq = 4000;
>>>>> + attr.sample_type = PERF_SAMPLE_BRANCH_STACK;
>>>>> + attr.branch_sample_type = PERF_SAMPLE_BRANCH_USER | PERF_SAMPLE_BRANCH_ANY;
>>>>> + pfd = syscall(__NR_perf_event_open, &attr, -1, 0, -1, PERF_FLAG_FD_CLOEXEC);
>>>>> + if (CHECK(pfd < 0, "perf_event_open", "err %d\n", pfd))
>>>>> + goto out_destroy;
>>>>
>>>>
>>>> It's failing for me in kvm. Is there way to make it work?
>>>> CIs will be vm based too. If this test requires physical host
>>>> such test will keep failing in all such environments.
>>>> Folks will be annoyed and eventually will disable the test.
>>>> Can we figure out how to test in the vm from the start?
>>>
>>> It seems there's a patchset that's adding LBR support to guest hosts:
>>> https://lkml.org/lkml/2019/8/6/215 . However it seems to be stuck in
>>> review limbo. Is there anything we can do to help that set along?
>>>
>>> As far as hacking it, nothing really comes to mind. Seems that patchset
>>> is our best hope.
>>
>> prog_tests/send_signal.c tests send_signal helper under nmi with
>> hardware counters. It added a check to see whether the underlying
>> hardware counter is supported, if it is not, the test is
>> skipped.
>>
>> Maybe we can use the same appraoch here. If perf_event_open with
>> PERF_TYPE_HARDWARE/PERF_SAMPLE_BRANCH_STACK failed,
>> we just mark the test as skipped instead of failing.
>
> Instead of failing and skipping the test how about making it test error case?
> Like instead of lbr perf_event some other event can be passed into bpf prog.
> New helper can still be called and in such case it should return einval?
We can have both, I think. Some people may have a test environment
where PERF_SAMPLE_BRANCH_STACK is available, if there is a breakage,
then it will be reported.