Changes RFC -> PATCH v1:
1. In perf-record, poll vip events in a separate thread;
2. Add tag to bpf prog name;
3. Small refactorings.
Original cover letter (with minor revisions):
This is to follow up Alexei's early effort to show bpf programs
https://www.spinics.net/lists/netdev/msg524232.html
In this version, PERF_RECORD_BPF_EVENT is introduced to send real time BPF
load/unload events to user space. In user space, perf-record is modified
to listen to these events (through a dedicated ring buffer) and generate
detailed information about the program (struct bpf_prog_info_event). Then,
perf-report translates these events into proper symbols.
With this set, perf-report will show bpf program as:
18.49% 0.16% test [kernel.vmlinux] [k] ksys_write
18.01% 0.47% test [kernel.vmlinux] [k] vfs_write
17.02% 0.40% test bpf_prog [k] bpf_prog_07367f7ba80df72b_
16.97% 0.10% test [kernel.vmlinux] [k] __vfs_write
16.86% 0.12% test [kernel.vmlinux] [k] comm_write
16.67% 0.39% test [kernel.vmlinux] [k] bpf_probe_read
Note that, the program name is still work in progress, it will be cleaner
with function types in BTF.
Please share your comments on this.
Thanks,
Song
Song Liu (5):
perf, bpf: Introduce PERF_RECORD_BPF_EVENT
perf: sync tools/include/uapi/linux/perf_event.h
perf util: basic handling of PERF_RECORD_BPF_EVENT
perf util: introduce bpf_prog_info_event
perf util: generate bpf_prog_info_event for short living bpf programs
include/linux/perf_event.h | 5 +
include/uapi/linux/perf_event.h | 27 ++-
kernel/bpf/syscall.c | 4 +
kernel/events/core.c | 82 ++++++-
tools/include/uapi/linux/perf_event.h | 27 ++-
tools/perf/builtin-record.c | 91 +++++++-
tools/perf/builtin-report.c | 2 +
tools/perf/util/Build | 2 +
tools/perf/util/bpf-info.c | 305 ++++++++++++++++++++++++++
tools/perf/util/bpf-info.h | 29 +++
tools/perf/util/event.c | 21 ++
tools/perf/util/event.h | 29 +++
tools/perf/util/evlist.c | 58 ++++-
tools/perf/util/evlist.h | 6 +
tools/perf/util/evsel.c | 9 +
tools/perf/util/evsel.h | 3 +
tools/perf/util/machine.c | 10 +
tools/perf/util/machine.h | 2 +
tools/perf/util/session.c | 8 +
tools/perf/util/tool.h | 7 +-
20 files changed, 714 insertions(+), 13 deletions(-)
create mode 100644 tools/perf/util/bpf-info.c
create mode 100644 tools/perf/util/bpf-info.h
--
2.17.1
For better performance analysis of BPF programs, this patch introduces
PERF_RECORD_BPF_EVENT, a new perf_event_type that exposes BPF program
load/unload information to user space.
/*
* Record different types of bpf events:
* enum perf_bpf_event_type {
* PERF_BPF_EVENT_UNKNOWN = 0,
* PERF_BPF_EVENT_PROG_LOAD = 1,
* PERF_BPF_EVENT_PROG_UNLOAD = 2,
* };
*
* struct {
* struct perf_event_header header;
* u16 type;
* u16 flags;
* u32 id; // prog_id or map_id
* };
*/
PERF_RECORD_BPF_EVENT = 17,
PERF_RECORD_BPF_EVENT contains minimal information about the BPF program.
Perf utility (or other user space tools) should listen to this event and
fetch more details about the event via BPF syscalls
(BPF_PROG_GET_FD_BY_ID, BPF_OBJ_GET_INFO_BY_FD, etc.). We decided not to
include all details of bpf_prog_info in the perf ring buffer because
the interface is under fast developments. Perf utility uses the BPF
syscalls to gather information of already loaded programs. Including
similar information in the perf ring buffer introduces a second ABI.
We picked PERF_RECORD_BPF_EVENT over tracepoints because PERF_RECORD is a
stable ABI; while tracepoints are more likely to change in the future.
Currently, PERF_RECORD_BPF_EVENT only support two events:
PERF_BPF_EVENT_PROG_LOAD and PERF_BPF_EVENT_PROG_UNLOAD. But it can be
easily extended to support more events.
Signed-off-by: Song Liu <[email protected]>
---
include/linux/perf_event.h | 5 ++
include/uapi/linux/perf_event.h | 27 ++++++++++-
kernel/bpf/syscall.c | 4 ++
kernel/events/core.c | 82 ++++++++++++++++++++++++++++++++-
4 files changed, 116 insertions(+), 2 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 53c500f0ca79..a3126fd5b7f1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1113,6 +1113,9 @@ static inline void perf_event_task_sched_out(struct task_struct *prev,
}
extern void perf_event_mmap(struct vm_area_struct *vma);
+extern void perf_event_bpf_event(enum perf_bpf_event_type type,
+ u16 flags, u32 id);
+
extern struct perf_guest_info_callbacks *perf_guest_cbs;
extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks);
extern int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks);
@@ -1333,6 +1336,8 @@ static inline int perf_unregister_guest_info_callbacks
(struct perf_guest_info_callbacks *callbacks) { return 0; }
static inline void perf_event_mmap(struct vm_area_struct *vma) { }
+static inline void perf_event_bpf_event(enum perf_bpf_event_type type,
+ u16 flags, u32 id) { }
static inline void perf_event_exec(void) { }
static inline void perf_event_comm(struct task_struct *tsk, bool exec) { }
static inline void perf_event_namespaces(struct task_struct *tsk) { }
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index f35eb72739c0..72a7da2b713f 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -372,7 +372,8 @@ struct perf_event_attr {
context_switch : 1, /* context switch data */
write_backward : 1, /* Write ring buffer from end to beginning */
namespaces : 1, /* include namespaces data */
- __reserved_1 : 35;
+ bpf_event : 1, /* include bpf events */
+ __reserved_1 : 34;
union {
__u32 wakeup_events; /* wakeup every n events */
@@ -963,9 +964,33 @@ enum perf_event_type {
*/
PERF_RECORD_NAMESPACES = 16,
+ /*
+ * Record different types of bpf events:
+ * enum perf_bpf_event_type {
+ * PERF_BPF_EVENT_UNKNOWN = 0,
+ * PERF_BPF_EVENT_PROG_LOAD = 1,
+ * PERF_BPF_EVENT_PROG_UNLOAD = 2,
+ * };
+ *
+ * struct {
+ * struct perf_event_header header;
+ * u16 type;
+ * u16 flags;
+ * u32 id; // prog_id or map_id
+ * };
+ */
+ PERF_RECORD_BPF_EVENT = 17,
+
PERF_RECORD_MAX, /* non-ABI */
};
+enum perf_bpf_event_type {
+ PERF_BPF_EVENT_UNKNOWN = 0,
+ PERF_BPF_EVENT_PROG_LOAD = 1,
+ PERF_BPF_EVENT_PROG_UNLOAD = 2,
+ PERF_BPF_EVENT_MAX, /* non-ABI */
+};
+
#define PERF_MAX_STACK_DEPTH 127
#define PERF_MAX_CONTEXTS_PER_STACK 8
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 18e3be193a05..b37051a13be6 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1101,9 +1101,12 @@ static void __bpf_prog_put_rcu(struct rcu_head *rcu)
static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock)
{
if (atomic_dec_and_test(&prog->aux->refcnt)) {
+ int prog_id = prog->aux->id;
+
/* bpf_prog_free_id() must be called first */
bpf_prog_free_id(prog, do_idr_lock);
bpf_prog_kallsyms_del_all(prog);
+ perf_event_bpf_event(PERF_BPF_EVENT_PROG_UNLOAD, 0, prog_id);
call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu);
}
@@ -1441,6 +1444,7 @@ static int bpf_prog_load(union bpf_attr *attr)
}
bpf_prog_kallsyms_add(prog);
+ perf_event_bpf_event(PERF_BPF_EVENT_PROG_LOAD, 0, prog->aux->id);
return err;
free_used_maps:
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5a97f34bc14c..54667be6669b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -385,6 +385,7 @@ static atomic_t nr_namespaces_events __read_mostly;
static atomic_t nr_task_events __read_mostly;
static atomic_t nr_freq_events __read_mostly;
static atomic_t nr_switch_events __read_mostly;
+static atomic_t nr_bpf_events __read_mostly;
static LIST_HEAD(pmus);
static DEFINE_MUTEX(pmus_lock);
@@ -4235,7 +4236,7 @@ static bool is_sb_event(struct perf_event *event)
if (attr->mmap || attr->mmap_data || attr->mmap2 ||
attr->comm || attr->comm_exec ||
- attr->task ||
+ attr->task || attr->bpf_event ||
attr->context_switch)
return true;
return false;
@@ -4305,6 +4306,8 @@ static void unaccount_event(struct perf_event *event)
dec = true;
if (has_branch_stack(event))
dec = true;
+ if (event->attr.bpf_event)
+ atomic_dec(&nr_bpf_events);
if (dec) {
if (!atomic_add_unless(&perf_sched_count, -1, 1))
@@ -7650,6 +7653,81 @@ static void perf_log_throttle(struct perf_event *event, int enable)
perf_output_end(&handle);
}
+/*
+ * bpf load/unload tracking
+ */
+
+struct perf_bpf_event {
+ struct {
+ struct perf_event_header header;
+ u16 type;
+ u16 flags;
+ u32 id;
+ } event_id;
+};
+
+static int perf_event_bpf_match(struct perf_event *event)
+{
+ return event->attr.bpf_event;
+}
+
+static void perf_event_bpf_output(struct perf_event *event,
+ void *data)
+{
+ struct perf_bpf_event *bpf_event = data;
+ struct perf_output_handle handle;
+ struct perf_sample_data sample;
+ int size = bpf_event->event_id.header.size;
+ int ret;
+
+ if (!perf_event_bpf_match(event))
+ return;
+
+ perf_event_header__init_id(&bpf_event->event_id.header, &sample, event);
+ ret = perf_output_begin(&handle, event,
+ bpf_event->event_id.header.size);
+ if (ret)
+ goto out;
+
+ perf_output_put(&handle, bpf_event->event_id);
+ perf_event__output_id_sample(event, &handle, &sample);
+
+ perf_output_end(&handle);
+out:
+ bpf_event->event_id.header.size = size;
+}
+
+static void perf_event_bpf(struct perf_bpf_event *bpf_event)
+{
+ perf_iterate_sb(perf_event_bpf_output,
+ bpf_event,
+ NULL);
+}
+
+void perf_event_bpf_event(enum perf_bpf_event_type type, u16 flags, u32 id)
+{
+ struct perf_bpf_event bpf_event;
+
+ if (!atomic_read(&nr_bpf_events))
+ return;
+
+ if (type <= PERF_BPF_EVENT_UNKNOWN || type >= PERF_BPF_EVENT_MAX)
+ return;
+
+ bpf_event = (struct perf_bpf_event){
+ .event_id = {
+ .header = {
+ .type = PERF_RECORD_BPF_EVENT,
+ .size = sizeof(bpf_event.event_id),
+ },
+ .type = type,
+ .flags = flags,
+ .id = id,
+ },
+ };
+ perf_event_bpf(&bpf_event);
+}
+
void perf_event_itrace_started(struct perf_event *event)
{
event->attach_state |= PERF_ATTACH_ITRACE;
@@ -9871,6 +9949,8 @@ static void account_event(struct perf_event *event)
inc = true;
if (is_cgroup_event(event))
inc = true;
+ if (event->attr.bpf_event)
+ atomic_inc(&nr_bpf_events);
if (inc) {
/*
--
2.17.1
Sync changes for PERF_RECORD_BPF_EVENT.
Signed-off-by: Song Liu <[email protected]>
---
tools/include/uapi/linux/perf_event.h | 27 ++++++++++++++++++++++++++-
1 file changed, 26 insertions(+), 1 deletion(-)
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index f35eb72739c0..72a7da2b713f 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -372,7 +372,8 @@ struct perf_event_attr {
context_switch : 1, /* context switch data */
write_backward : 1, /* Write ring buffer from end to beginning */
namespaces : 1, /* include namespaces data */
- __reserved_1 : 35;
+ bpf_event : 1, /* include bpf events */
+ __reserved_1 : 34;
union {
__u32 wakeup_events; /* wakeup every n events */
@@ -963,9 +964,33 @@ enum perf_event_type {
*/
PERF_RECORD_NAMESPACES = 16,
+ /*
+ * Record different types of bpf events:
+ * enum perf_bpf_event_type {
+ * PERF_BPF_EVENT_UNKNOWN = 0,
+ * PERF_BPF_EVENT_PROG_LOAD = 1,
+ * PERF_BPF_EVENT_PROG_UNLOAD = 2,
+ * };
+ *
+ * struct {
+ * struct perf_event_header header;
+ * u16 type;
+ * u16 flags;
+ * u32 id; // prog_id or map_id
+ * };
+ */
+ PERF_RECORD_BPF_EVENT = 17,
+
PERF_RECORD_MAX, /* non-ABI */
};
+enum perf_bpf_event_type {
+ PERF_BPF_EVENT_UNKNOWN = 0,
+ PERF_BPF_EVENT_PROG_LOAD = 1,
+ PERF_BPF_EVENT_PROG_UNLOAD = 2,
+ PERF_BPF_EVENT_MAX, /* non-ABI */
+};
+
#define PERF_MAX_STACK_DEPTH 127
#define PERF_MAX_CONTEXTS_PER_STACK 8
--
2.17.1
This patch introduces struct bpf_prog_info_event to union perf_event.
struct bpf_prog_info_event {
struct perf_event_header header;
u32 prog_info_len;
u32 ksym_table_len;
u64 ksym_table;
struct bpf_prog_info prog_info;
char data[];
};
struct bpf_prog_info_event contains information about a bpf program.
These events are written to perf.data by perf-record, and processed by
perf-report.
struct bpf_prog_info_event uses arrays for some data (ksym_table, and
arrays in struct bpf_prog_info). To make these arrays easy to serialize,
we allocate continuous memory (data). These array pointers are translated
to offset in bpf_prog_info_event before written to file. And vice-versa
when the event is read from file.
This patch enables synthesizing these events at the beginning of
perf-record run. Next patch will process short living bpf programs that
are created during perf-record.
Signed-off-by: Song Liu <[email protected]>
---
tools/perf/builtin-record.c | 8 +
tools/perf/builtin-report.c | 2 +
tools/perf/util/Build | 2 +
tools/perf/util/bpf-info.c | 305 ++++++++++++++++++++++++++++++++++++
tools/perf/util/bpf-info.h | 29 ++++
tools/perf/util/event.c | 1 +
tools/perf/util/event.h | 14 ++
tools/perf/util/session.c | 4 +
tools/perf/util/tool.h | 3 +-
9 files changed, 367 insertions(+), 1 deletion(-)
create mode 100644 tools/perf/util/bpf-info.c
create mode 100644 tools/perf/util/bpf-info.h
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 0980dfe3396b..86dfba937e4e 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -41,6 +41,7 @@
#include "util/perf-hooks.h"
#include "util/time-utils.h"
#include "util/units.h"
+#include "util/bpf-info.h"
#include "asm/bug.h"
#include <errno.h>
@@ -966,6 +967,12 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
if (err < 0)
goto out_child;
+ err = perf_event__synthesize_bpf_prog_info(
+ &rec->tool, process_synthesized_event,
+ &rec->session->machines.host);
+ if (err < 0)
+ goto out_child;
+
if (rec->realtime_prio) {
struct sched_param param;
@@ -1531,6 +1538,7 @@ static struct record record = {
.namespaces = perf_event__process_namespaces,
.mmap = perf_event__process_mmap,
.mmap2 = perf_event__process_mmap2,
+ .bpf_prog_info = perf_event__process_bpf_prog_info,
.ordered_events = true,
},
};
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index c0703979c51d..4a9a3e8da4e0 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -41,6 +41,7 @@
#include "util/auxtrace.h"
#include "util/units.h"
#include "util/branch.h"
+#include "util/bpf-info.h"
#include <dlfcn.h>
#include <errno.h>
@@ -981,6 +982,7 @@ int cmd_report(int argc, const char **argv)
.auxtrace_info = perf_event__process_auxtrace_info,
.auxtrace = perf_event__process_auxtrace,
.feature = process_feature_event,
+ .bpf_prog_info = perf_event__process_bpf_prog_info,
.ordered_events = true,
.ordering_requires_timestamps = true,
},
diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index ecd9f9ceda77..624c7281217c 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -150,6 +150,8 @@ endif
libperf-y += perf-hooks.o
+libperf-$(CONFIG_LIBBPF) += bpf-info.o
+
libperf-$(CONFIG_CXX) += c++/
CFLAGS_config.o += -DETC_PERFCONFIG="BUILD_STR($(ETC_PERFCONFIG_SQ))"
diff --git a/tools/perf/util/bpf-info.c b/tools/perf/util/bpf-info.c
new file mode 100644
index 000000000000..2e0005677c43
--- /dev/null
+++ b/tools/perf/util/bpf-info.c
@@ -0,0 +1,305 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018 Facebook
+ */
+#include <errno.h>
+#include <stdio.h>
+#include <bpf/bpf.h>
+#include "bpf-info.h"
+#include "debug.h"
+#include "session.h"
+
+#define KSYM_NAME_LEN 128
+#define BPF_PROG_INFO_MIN_SIZE 128 /* minimal require jited_func_lens */
+
+static inline __u64 ptr_to_u64(const void *ptr)
+{
+ return (__u64) (unsigned long) ptr;
+}
+
+static int snprintf_hex(char *buf, size_t size, unsigned char *data, size_t len)
+{
+ int ret = 0;
+ size_t i;
+
+ for (i = 0; i < len; i++)
+ ret += snprintf(buf + ret, size - ret, "%02x", data[i]);
+ return ret;
+}
+
+/* fetch information of the bpf program via bpf syscall. */
+struct bpf_prog_info_event *perf_bpf_info__get_bpf_prog_info_event(int fd)
+{
+ struct bpf_prog_info_event *prog_info_event = NULL;
+ struct bpf_prog_info info = {};
+ u32 info_len = sizeof(info);
+ u32 event_len, i;
+ void *ptr;
+ int err;
+
+ err = bpf_obj_get_info_by_fd(fd, &info, &info_len);
+ if (err) {
+ pr_debug("can't get prog info: %s", strerror(errno));
+ return NULL;
+ }
+ if (info_len < BPF_PROG_INFO_MIN_SIZE) {
+ pr_debug("kernel is too old to support proper prog info\n");
+ return NULL;
+ }
+
+ /* calculate size of bpf_prog_info_event */
+ event_len = sizeof(struct bpf_prog_info_event);
+ event_len += info_len;
+ event_len -= sizeof(info);
+ event_len += info.jited_prog_len;
+ event_len += info.xlated_prog_len;
+ event_len += info.nr_map_ids * sizeof(u32);
+ event_len += info.nr_jited_ksyms * sizeof(u64);
+ event_len += info.nr_jited_func_lens * sizeof(u32);
+ event_len += info.nr_jited_ksyms * KSYM_NAME_LEN;
+
+ prog_info_event = (struct bpf_prog_info_event *) malloc(event_len);
+ if (!prog_info_event)
+ return NULL;
+
+ /* assign pointers for map_ids, jited_prog_insns, etc. */
+ ptr = prog_info_event->data;
+ info.map_ids = ptr_to_u64(ptr);
+ ptr += info.nr_map_ids * sizeof(u32);
+ info.jited_prog_insns = ptr_to_u64(ptr);
+ ptr += info.jited_prog_len;
+ info.xlated_prog_insns = ptr_to_u64(ptr);
+ ptr += info.xlated_prog_len;
+ info.jited_ksyms = ptr_to_u64(ptr);
+ ptr += info.nr_jited_ksyms * sizeof(u64);
+ info.jited_func_lens = ptr_to_u64(ptr);
+ ptr += info.nr_jited_func_lens * sizeof(u32);
+
+ err = bpf_obj_get_info_by_fd(fd, &info, &info_len);
+ if (err) {
+ pr_err("can't get prog info: %s\n", strerror(errno));
+ free(prog_info_event);
+ return NULL;
+ }
+
+ /* fill data in prog_info_event */
+ prog_info_event->header.type = PERF_RECORD_BPF_PROG_INFO;
+ prog_info_event->header.misc = 0;
+ prog_info_event->prog_info_len = info_len;
+
+ memcpy(&prog_info_event->prog_info, &info, info_len);
+
+ prog_info_event->ksym_table_len = 0;
+ prog_info_event->ksym_table = ptr_to_u64(ptr);
+
+ /* fill in fake symbol name for now, add real name after BTF */
+ if (info.nr_jited_func_lens == 1 && info.name) { /* only main prog */
+ size_t l;
+
+ assert(info.nr_jited_ksyms == 1);
+ l = snprintf(ptr, KSYM_NAME_LEN, "bpf_prog_");
+ l += snprintf_hex(ptr + l, KSYM_NAME_LEN - l,
+ prog_info_event->prog_info.tag,
+ BPF_TAG_SIZE);
+ l += snprintf(ptr + l, KSYM_NAME_LEN - l, "_%s", info.name);
+ prog_info_event->ksym_table_len += l + 1;
+ ptr += l + 1;
+
+ } else {
+ assert(info.nr_jited_ksyms == info.nr_jited_func_lens);
+
+ for (i = 0; i < info.nr_jited_ksyms; i++) {
+ size_t l;
+
+ l = snprintf(ptr, KSYM_NAME_LEN, "bpf_prog_");
+ l += snprintf_hex(ptr + l, KSYM_NAME_LEN - l,
+ prog_info_event->prog_info.tag,
+ BPF_TAG_SIZE);
+ l += snprintf(ptr + l, KSYM_NAME_LEN - l, "_F");
+ prog_info_event->ksym_table_len += l + 1;
+ ptr += l + 1;
+ }
+ }
+
+ prog_info_event->header.size = ptr - (void *)prog_info_event;
+
+ return prog_info_event;
+}
+
+static size_t fprintf_bpf_prog_info(
+ struct bpf_prog_info_event *prog_info_event, FILE *fp)
+{
+ struct bpf_prog_info *info = &prog_info_event->prog_info;
+ unsigned long *jited_ksyms = (unsigned long *)(info->jited_ksyms);
+ char *name_ptr = (char *) prog_info_event->ksym_table;
+ unsigned int i;
+ size_t ret;
+
+ ret = fprintf(fp, "bpf_prog: type: %u id: %u ", info->type, info->id);
+ ret += fprintf(fp, "nr_jited_ksyms: %u\n", info->nr_jited_ksyms);
+
+ for (i = 0; i < info->nr_jited_ksyms; i++) {
+ ret += fprintf(fp, "jited_ksyms[%u]: %lx %s\n",
+ i, jited_ksyms[i], name_ptr);
+ name_ptr += strlen(name_ptr);
+ }
+ return ret;
+}
+
+size_t perf_event__fprintf_bpf_prog_info(union perf_event *event, FILE *fp)
+{
+ return fprintf_bpf_prog_info(&event->bpf_prog_info, fp);
+}
+
+/*
+ * translate all array ptr to offset from base address, called before
+ * writing the event to file
+ */
+void perf_bpf_info__ptr_to_offset(
+ struct bpf_prog_info_event *prog_info_event)
+{
+ u64 base = ptr_to_u64(prog_info_event);
+
+ prog_info_event->ksym_table -= base;
+ prog_info_event->prog_info.jited_prog_insns -= base;
+ prog_info_event->prog_info.xlated_prog_insns -= base;
+ prog_info_event->prog_info.map_ids -= base;
+ prog_info_event->prog_info.jited_ksyms -= base;
+ prog_info_event->prog_info.jited_func_lens -= base;
+}
+
+/*
+ * translate offset from base address to array pointer, called after
+ * reading the event from file
+ */
+void perf_bpf_info__offset_to_ptr(
+ struct bpf_prog_info_event *prog_info_event)
+{
+ u64 base = ptr_to_u64(prog_info_event);
+
+ prog_info_event->ksym_table += base;
+ prog_info_event->prog_info.jited_prog_insns += base;
+ prog_info_event->prog_info.xlated_prog_insns += base;
+ prog_info_event->prog_info.map_ids += base;
+ prog_info_event->prog_info.jited_ksyms += base;
+ prog_info_event->prog_info.jited_func_lens += base;
+}
+
+int perf_event__synthesize_one_bpf_prog_info(struct perf_tool *tool,
+ perf_event__handler_t process,
+ struct machine *machine,
+ int fd)
+{
+ struct bpf_prog_info_event *prog_info_event;
+
+ prog_info_event = perf_bpf_info__get_bpf_prog_info_event(fd);
+
+ if (!prog_info_event) {
+ pr_err("Failed to get prog_info_event\n");
+ return -1;
+ }
+ perf_bpf_info__ptr_to_offset(prog_info_event);
+
+ if (perf_tool__process_synth_event(
+ tool, (union perf_event *)prog_info_event,
+ machine, process) != 0) {
+ free(prog_info_event);
+ return -1;
+ }
+
+ free(prog_info_event);
+ return 0;
+}
+
+int perf_event__synthesize_bpf_prog_info(struct perf_tool *tool,
+ perf_event__handler_t process,
+ struct machine *machine)
+{
+ __u32 id = 0;
+ int err = 0;
+ int fd;
+
+ while (true) {
+ err = bpf_prog_get_next_id(id, &id);
+ if (err) {
+ if (errno == ENOENT) {
+ err = 0;
+ break;
+ }
+ pr_err("can't get next program: %s%s",
+ strerror(errno),
+ errno == EINVAL ? " -- kernel too old?" : "");
+ err = -1;
+ break;
+ }
+ fd = bpf_prog_get_fd_by_id(id);
+ if (fd < 0) {
+ pr_debug("Failed to get fd for prog_id %u\n", id);
+ continue;
+ }
+
+ err = perf_event__synthesize_one_bpf_prog_info(
+ tool, process, machine, fd);
+ close(fd);
+ if (err)
+ break;
+ }
+ return err;
+}
+
+int perf_event__process_bpf_prog_info(struct perf_session *session,
+ union perf_event *event)
+{
+ struct machine *machine = &session->machines.host;
+ struct bpf_prog_info_event *prog_info_event;
+ struct bpf_prog_info *info;
+ struct symbol *sym;
+ struct map *map;
+ char *name_ptr;
+ int ret = 0;
+ u64 *addrs;
+ u32 *lens;
+ u32 i;
+
+ prog_info_event = (struct bpf_prog_info_event *)
+ malloc(event->header.size);
+ if (!prog_info_event)
+ return -ENOMEM;
+
+ /* copy the data to rw memeory so we can modify it */
+ memcpy(prog_info_event, &event->bpf_prog_info, event->header.size);
+ info = &prog_info_event->prog_info;
+
+ perf_bpf_info__offset_to_ptr(prog_info_event);
+ name_ptr = (char *) prog_info_event->ksym_table;
+ addrs = (u64 *)info->jited_ksyms;
+ lens = (u32 *)info->jited_func_lens;
+ for (i = 0; i < info->nr_jited_ksyms; i++) {
+ u32 len = info->nr_jited_func_lens == 1 ?
+ len = info->jited_prog_len : lens[i];
+
+ map = map_groups__find(&machine->kmaps, addrs[i]);
+ if (!map) {
+ map = dso__new_map("bpf_prog");
+ if (!map) {
+ ret = -ENOMEM;
+ break;
+ }
+ map->start = addrs[i];
+ map->pgoff = map->start;
+ map->end = map->start + len;
+ map_groups__insert(&machine->kmaps, map);
+ }
+
+ sym = symbol__new(addrs[i], len, 0, 0, name_ptr);
+ if (!sym) {
+ ret = -ENOMEM;
+ break;
+ }
+ dso__insert_symbol(map->dso, sym);
+ name_ptr += strlen(name_ptr) + 1;
+ }
+
+ free(prog_info_event);
+ return ret;
+}
diff --git a/tools/perf/util/bpf-info.h b/tools/perf/util/bpf-info.h
new file mode 100644
index 000000000000..39d5da102017
--- /dev/null
+++ b/tools/perf/util/bpf-info.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __PERF_BPF_INFO_H
+#define __PERF_BPF_INFO_H
+
+#include "event.h"
+#include "machine.h"
+#include "tool.h"
+#include "symbol.h"
+
+struct bpf_prog_info_event *perf_bpf_info__get_bpf_prog_info_event(int fd);
+
+size_t perf_event__fprintf_bpf_prog_info(union perf_event *event, FILE *fp);
+
+int perf_event__synthesize_one_bpf_prog_info(struct perf_tool *tool,
+ perf_event__handler_t process,
+ struct machine *machine,
+ int fd);
+
+int perf_event__synthesize_bpf_prog_info(struct perf_tool *tool,
+ perf_event__handler_t process,
+ struct machine *machine);
+
+void perf_bpf_info__ptr_to_offset(struct bpf_prog_info_event *prog_info_event);
+void perf_bpf_info__offset_to_ptr(struct bpf_prog_info_event *prog_info_event);
+
+int perf_event__process_bpf_prog_info(struct perf_session *session,
+ union perf_event *event);
+
+#endif /* __PERF_BPF_INFO_H */
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 04104a4ffe93..de2a89faf445 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -62,6 +62,7 @@ static const char *perf_event__names[] = {
[PERF_RECORD_EVENT_UPDATE] = "EVENT_UPDATE",
[PERF_RECORD_TIME_CONV] = "TIME_CONV",
[PERF_RECORD_HEADER_FEATURE] = "FEATURE",
+ [PERF_RECORD_BPF_PROG_INFO] = "BPF_PROG_INFO",
};
static const char *perf_ns__names[] = {
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 13a0c64dd0ed..dc64d800eaa6 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -5,6 +5,7 @@
#include <limits.h>
#include <stdio.h>
#include <linux/kernel.h>
+#include <linux/bpf.h>
#include "../perf.h"
#include "build-id.h"
@@ -258,6 +259,7 @@ enum perf_user_event_type { /* above any possible kernel type */
PERF_RECORD_EVENT_UPDATE = 78,
PERF_RECORD_TIME_CONV = 79,
PERF_RECORD_HEADER_FEATURE = 80,
+ PERF_RECORD_BPF_PROG_INFO = 81,
PERF_RECORD_HEADER_MAX
};
@@ -629,6 +631,17 @@ struct feature_event {
char data[];
};
+#define KSYM_NAME_LEN 128
+
+struct bpf_prog_info_event {
+ struct perf_event_header header;
+ u32 prog_info_len;
+ u32 ksym_table_len;
+ u64 ksym_table;
+ struct bpf_prog_info prog_info;
+ char data[];
+};
+
union perf_event {
struct perf_event_header header;
struct mmap_event mmap;
@@ -661,6 +674,7 @@ union perf_event {
struct time_conv_event time_conv;
struct feature_event feat;
struct bpf_event bpf_event;
+ struct bpf_prog_info_event bpf_prog_info;
};
void perf_event__print_totals(void);
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index dffe5120d2d3..5365ee1dfbec 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -415,6 +415,8 @@ void perf_tool__fill_defaults(struct perf_tool *tool)
tool->time_conv = process_event_op2_stub;
if (tool->feature == NULL)
tool->feature = process_event_op2_stub;
+ if (tool->bpf_prog_info == NULL)
+ tool->bpf_prog_info = process_event_op2_stub;
}
static void swap_sample_id_all(union perf_event *event, void *data)
@@ -1397,6 +1399,8 @@ static s64 perf_session__process_user_event(struct perf_session *session,
return tool->time_conv(session, event);
case PERF_RECORD_HEADER_FEATURE:
return tool->feature(session, event);
+ case PERF_RECORD_BPF_PROG_INFO:
+ return tool->bpf_prog_info(session, event);
default:
return -EINVAL;
}
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index 69ae898ca024..739a4b1188f7 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -70,7 +70,8 @@ struct perf_tool {
stat_config,
stat,
stat_round,
- feature;
+ feature,
+ bpf_prog_info;
event_op3 auxtrace;
bool ordered_events;
bool ordering_requires_timestamps;
--
2.17.1
This patch adds basic handling of PERF_RECORD_BPF_EVENT in perf util.
Following patches add more logic that leverages these events.
Signed-off-by: Song Liu <[email protected]>
---
tools/perf/util/event.c | 20 ++++++++++++++++++++
tools/perf/util/event.h | 15 +++++++++++++++
tools/perf/util/evsel.c | 1 +
tools/perf/util/machine.c | 10 ++++++++++
tools/perf/util/machine.h | 2 ++
tools/perf/util/session.c | 4 ++++
tools/perf/util/tool.h | 4 +++-
7 files changed, 55 insertions(+), 1 deletion(-)
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 0cd42150f712..04104a4ffe93 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -24,6 +24,7 @@
#include "symbol/kallsyms.h"
#include "asm/bug.h"
#include "stat.h"
+#include "session.h"
static const char *perf_event__names[] = {
[0] = "TOTAL",
@@ -43,6 +44,7 @@ static const char *perf_event__names[] = {
[PERF_RECORD_SWITCH] = "SWITCH",
[PERF_RECORD_SWITCH_CPU_WIDE] = "SWITCH_CPU_WIDE",
[PERF_RECORD_NAMESPACES] = "NAMESPACES",
+ [PERF_RECORD_BPF_EVENT] = "BPF_EVENT",
[PERF_RECORD_HEADER_ATTR] = "ATTR",
[PERF_RECORD_HEADER_EVENT_TYPE] = "EVENT_TYPE",
[PERF_RECORD_HEADER_TRACING_DATA] = "TRACING_DATA",
@@ -1333,6 +1335,14 @@ int perf_event__process_switch(struct perf_tool *tool __maybe_unused,
return machine__process_switch_event(machine, event);
}
+int perf_event__process_bpf_event(struct perf_tool *tool __maybe_unused,
+ union perf_event *event,
+ struct perf_sample *sample __maybe_unused,
+ struct machine *machine)
+{
+ return machine__process_bpf_event(machine, event);
+}
+
size_t perf_event__fprintf_mmap(union perf_event *event, FILE *fp)
{
return fprintf(fp, " %d/%d: [%#" PRIx64 "(%#" PRIx64 ") @ %#" PRIx64 "]: %c %s\n",
@@ -1465,6 +1475,13 @@ static size_t perf_event__fprintf_lost(union perf_event *event, FILE *fp)
return fprintf(fp, " lost %" PRIu64 "\n", event->lost.lost);
}
+size_t perf_event__fprintf_bpf_event(union perf_event *event, FILE *fp)
+{
+ return fprintf(fp, " bpf event with type %u, flags %u, id %u\n",
+ event->bpf_event.type, event->bpf_event.flags,
+ event->bpf_event.id);
+}
+
size_t perf_event__fprintf(union perf_event *event, FILE *fp)
{
size_t ret = fprintf(fp, "PERF_RECORD_%s",
@@ -1500,6 +1517,9 @@ size_t perf_event__fprintf(union perf_event *event, FILE *fp)
case PERF_RECORD_LOST:
ret += perf_event__fprintf_lost(event, fp);
break;
+ case PERF_RECORD_BPF_EVENT:
+ ret += perf_event__fprintf_bpf_event(event, fp);
+ break;
default:
ret += fprintf(fp, "\n");
}
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index bfa60bcafbde..13a0c64dd0ed 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -84,6 +84,15 @@ struct throttle_event {
u64 stream_id;
};
+
+/* Record different types of bpf events, see enum perf_bpf_event_type */
+struct bpf_event {
+ struct perf_event_header header;
+ u16 type;
+ u16 flags;
+ u32 id;
+};
+
#define PERF_SAMPLE_MASK \
(PERF_SAMPLE_IP | PERF_SAMPLE_TID | \
PERF_SAMPLE_TIME | PERF_SAMPLE_ADDR | \
@@ -651,6 +660,7 @@ union perf_event {
struct stat_round_event stat_round;
struct time_conv_event time_conv;
struct feature_event feat;
+ struct bpf_event bpf_event;
};
void perf_event__print_totals(void);
@@ -750,6 +760,10 @@ int perf_event__process_exit(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample,
struct machine *machine);
+int perf_event__process_bpf_event(struct perf_tool *tool,
+ union perf_event *event,
+ struct perf_sample *sample,
+ struct machine *machine);
int perf_tool__process_synth_event(struct perf_tool *tool,
union perf_event *event,
struct machine *machine,
@@ -814,6 +828,7 @@ size_t perf_event__fprintf_switch(union perf_event *event, FILE *fp);
size_t perf_event__fprintf_thread_map(union perf_event *event, FILE *fp);
size_t perf_event__fprintf_cpu_map(union perf_event *event, FILE *fp);
size_t perf_event__fprintf_namespaces(union perf_event *event, FILE *fp);
+size_t perf_event__fprintf_bpf_event(union perf_event *event, FILE *fp);
size_t perf_event__fprintf(union perf_event *event, FILE *fp);
int kallsyms__get_function_start(const char *kallsyms_filename,
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index cb7f01059940..af9d539e4b6a 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1635,6 +1635,7 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
PRINT_ATTRf(context_switch, p_unsigned);
PRINT_ATTRf(write_backward, p_unsigned);
PRINT_ATTRf(namespaces, p_unsigned);
+ PRINT_ATTRf(bpf_event, p_unsigned);
PRINT_ATTRn("{ wakeup_events, wakeup_watermark }", wakeup_events, p_unsigned);
PRINT_ATTRf(bp_type, p_unsigned);
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 111ae858cbcb..74c295f6fdc4 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -681,6 +681,14 @@ int machine__process_switch_event(struct machine *machine __maybe_unused,
return 0;
}
+int machine__process_bpf_event(struct machine *machine __maybe_unused,
+ union perf_event *event)
+{
+ if (dump_trace)
+ perf_event__fprintf_bpf_event(event, stdout);
+ return 0;
+}
+
static void dso__adjust_kmod_long_name(struct dso *dso, const char *filename)
{
const char *dup_filename;
@@ -1795,6 +1803,8 @@ int machine__process_event(struct machine *machine, union perf_event *event,
case PERF_RECORD_SWITCH:
case PERF_RECORD_SWITCH_CPU_WIDE:
ret = machine__process_switch_event(machine, event); break;
+ case PERF_RECORD_BPF_EVENT:
+ ret = machine__process_bpf_event(machine, event); break;
default:
ret = -1;
break;
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index d856b85862e2..0ed237d6fd0a 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -127,6 +127,8 @@ int machine__process_mmap_event(struct machine *machine, union perf_event *event
struct perf_sample *sample);
int machine__process_mmap2_event(struct machine *machine, union perf_event *event,
struct perf_sample *sample);
+int machine__process_bpf_event(struct machine *machine,
+ union perf_event *event);
int machine__process_event(struct machine *machine, union perf_event *event,
struct perf_sample *sample);
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 7d2c8ce6cfad..dffe5120d2d3 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -371,6 +371,8 @@ void perf_tool__fill_defaults(struct perf_tool *tool)
tool->itrace_start = perf_event__process_itrace_start;
if (tool->context_switch == NULL)
tool->context_switch = perf_event__process_switch;
+ if (tool->bpf_event == NULL)
+ tool->bpf_event = perf_event__process_bpf_event;
if (tool->read == NULL)
tool->read = process_event_sample_stub;
if (tool->throttle == NULL)
@@ -1300,6 +1302,8 @@ static int machines__deliver_event(struct machines *machines,
case PERF_RECORD_SWITCH:
case PERF_RECORD_SWITCH_CPU_WIDE:
return tool->context_switch(tool, event, sample, machine);
+ case PERF_RECORD_BPF_EVENT:
+ return tool->bpf_event(tool, event, sample, machine);
default:
++evlist->stats.nr_unknown_events;
return -1;
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index 56e4ca54020a..69ae898ca024 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -53,7 +53,9 @@ struct perf_tool {
itrace_start,
context_switch,
throttle,
- unthrottle;
+ unthrottle,
+ bpf_event;
+
event_attr_op attr;
event_attr_op event_update;
event_op2 tracing_data;
--
2.17.1
This patch enables perf-record to listen to bpf_event and generate
bpf_prog_info_event for bpf programs loaded and unloaded during
perf-record run.
To minimize latency between bpf_event and following bpf calls, separate
mmap with watermark of 1 is created to process these vip events. Then
a separate dummy event is attached to the special mmap. A separate thread
is used to only poll bpf events.
By default, perf-record will listen to bpf_event. Option no-bpf-event is
added in case the user would opt out.
Signed-off-by: Song Liu <[email protected]>
---
tools/perf/builtin-record.c | 83 ++++++++++++++++++++++++++++++++++++-
tools/perf/util/evlist.c | 58 ++++++++++++++++++++++----
tools/perf/util/evlist.h | 6 +++
tools/perf/util/evsel.c | 8 ++++
tools/perf/util/evsel.h | 3 ++
5 files changed, 150 insertions(+), 8 deletions(-)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 86dfba937e4e..11e7a8e8597a 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -54,6 +54,7 @@
#include <sys/mman.h>
#include <sys/wait.h>
#include <linux/time64.h>
+#include <bpf/bpf.h>
struct switch_output {
bool enabled;
@@ -80,8 +81,10 @@ struct record {
bool buildid_all;
bool timestamp_filename;
bool timestamp_boundary;
+ bool no_bpf_event;
struct switch_output switch_output;
unsigned long long samples;
+ pthread_mutex_t write_mutex;
};
static volatile int auxtrace_record__snapshot_started;
@@ -381,6 +384,8 @@ static int record__open(struct record *rec)
pos->tracking = 1;
pos->attr.enable_on_exec = 1;
}
+ if (!rec->no_bpf_event)
+ perf_evlist__add_bpf_tracker(evlist);
perf_evlist__config(evlist, opts, &callchain_param);
@@ -562,6 +567,58 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
return rc;
}
+static void record__process_bpf_event(struct record *rec,
+ union perf_event *event)
+{
+ int fd;
+
+ switch (event->bpf_event.type) {
+ case PERF_BPF_EVENT_PROG_LOAD:
+ fd = bpf_prog_get_fd_by_id(event->bpf_event.id);
+ if (fd > 0) {
+ perf_event__synthesize_one_bpf_prog_info(
+ &rec->tool, process_synthesized_event,
+ &rec->session->machines.host, fd);
+ close(fd);
+ }
+ /* fall through */
+ case PERF_BPF_EVENT_PROG_UNLOAD:
+ record__write(rec, NULL, event,
+ event->header.size);
+ break;
+ default:
+ break;
+ }
+}
+
+static int record__mmap_process_vip_events(struct record *rec)
+{
+ int i;
+
+ pthread_mutex_lock(&rec->write_mutex);
+ for (i = 0; i < rec->evlist->nr_mmaps; i++) {
+ struct perf_mmap *map = &rec->evlist->vip_mmap[i];
+ union perf_event *event;
+
+ perf_mmap__read_init(map);
+ while ((event = perf_mmap__read_event(map)) != NULL) {
+ pr_debug("processing vip event of type %d\n",
+ event->header.type);
+ switch (event->header.type) {
+ case PERF_RECORD_BPF_EVENT:
+ record__process_bpf_event(rec, event);
+ break;
+ default:
+ break;
+ }
+ perf_mmap__consume(map);
+ }
+ perf_mmap__read_done(map);
+ }
+ pthread_mutex_unlock(&rec->write_mutex);
+ return 0;
+}
+
static int record__mmap_read_all(struct record *rec)
{
int err;
@@ -855,6 +912,19 @@ static int record__synthesize(struct record *rec, bool tail)
return err;
}
+static void *vip_poll_thread(void *arg)
+{
+ struct record *rec = arg;
+
+ if (rec->no_bpf_event)
+ return NULL;
+ while (!done) {
+ perf_evlist__poll_vip(rec->evlist, 1000);
+ record__mmap_process_vip_events(rec);
+ }
+ return NULL;
+}
+
static int __cmd_record(struct record *rec, int argc, const char **argv)
{
int err;
@@ -867,6 +937,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
struct perf_session *session;
bool disabled = false, draining = false;
int fd;
+ pthread_t thread;
atexit(record__sig_exit);
signal(SIGCHLD, sig_handler);
@@ -1049,6 +1120,8 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
trigger_ready(&auxtrace_snapshot_trigger);
trigger_ready(&switch_output_trigger);
perf_hooks__invoke_record_start();
+
+ pthread_create(&thread, NULL, vip_poll_thread, rec);
for (;;) {
unsigned long long hits = rec->samples;
@@ -1063,7 +1136,10 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
if (trigger_is_hit(&switch_output_trigger) || done || draining)
perf_evlist__toggle_bkw_mmap(rec->evlist, BKW_MMAP_DATA_PENDING);
- if (record__mmap_read_all(rec) < 0) {
+ pthread_mutex_lock(&rec->write_mutex);
+ err = record__mmap_read_all(rec);
+ pthread_mutex_unlock(&rec->write_mutex);
+ if (err < 0) {
trigger_error(&auxtrace_snapshot_trigger);
trigger_error(&switch_output_trigger);
err = -1;
@@ -1164,6 +1240,8 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
record__synthesize_workload(rec, true);
out_child:
+ pthread_join(thread, NULL);
+
if (forks) {
int exit_status;
@@ -1541,6 +1619,7 @@ static struct record record = {
.bpf_prog_info = perf_event__process_bpf_prog_info,
.ordered_events = true,
},
+ .write_mutex = PTHREAD_MUTEX_INITIALIZER,
};
const char record_callchain_help[] = CALLCHAIN_RECORD_HELP
@@ -1689,6 +1768,8 @@ static struct option __record_options[] = {
"signal"),
OPT_BOOLEAN(0, "dry-run", &dry_run,
"Parse options then exit"),
+ OPT_BOOLEAN(0, "no-bpf-event", &record.no_bpf_event,
+ "do not record event on bpf program load/unload"),
OPT_END()
};
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index be440df29615..eb0b12fe7658 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -45,6 +45,7 @@ void perf_evlist__init(struct perf_evlist *evlist, struct cpu_map *cpus,
for (i = 0; i < PERF_EVLIST__HLIST_SIZE; ++i)
INIT_HLIST_HEAD(&evlist->heads[i]);
INIT_LIST_HEAD(&evlist->entries);
+ INIT_LIST_HEAD(&evlist->vip_entries);
perf_evlist__set_maps(evlist, cpus, threads);
fdarray__init(&evlist->pollfd, 64);
evlist->workload.pid = -1;
@@ -177,6 +178,8 @@ void perf_evlist__add(struct perf_evlist *evlist, struct perf_evsel *entry)
{
entry->evlist = evlist;
list_add_tail(&entry->node, &evlist->entries);
+ if (entry->vip)
+ list_add_tail(&entry->vip_node, &evlist->vip_entries);
entry->idx = evlist->nr_entries;
entry->tracking = !entry->idx;
@@ -267,6 +270,27 @@ int perf_evlist__add_dummy(struct perf_evlist *evlist)
return 0;
}
+int perf_evlist__add_bpf_tracker(struct perf_evlist *evlist)
+{
+ struct perf_event_attr attr = {
+ .type = PERF_TYPE_SOFTWARE,
+ .config = PERF_COUNT_SW_DUMMY,
+ .watermark = 1,
+ .bpf_event = 1,
+ .wakeup_watermark = 1,
+ .size = sizeof(attr), /* to capture ABI version */
+ };
+ struct perf_evsel *evsel = perf_evsel__new_idx(&attr,
+ evlist->nr_entries);
+
+ if (evsel == NULL)
+ return -ENOMEM;
+
+ evsel->vip = true;
+ perf_evlist__add(evlist, evsel);
+ return 0;
+}
+
static int perf_evlist__add_attrs(struct perf_evlist *evlist,
struct perf_event_attr *attrs, size_t nr_attrs)
{
@@ -452,15 +476,18 @@ int perf_evlist__alloc_pollfd(struct perf_evlist *evlist)
}
static int __perf_evlist__add_pollfd(struct perf_evlist *evlist, int fd,
- struct perf_mmap *map, short revent)
+ struct perf_mmap *map, short revent,
+ bool vip)
{
- int pos = fdarray__add(&evlist->pollfd, fd, revent | POLLERR | POLLHUP);
+ struct fdarray *pollfd = vip ? &evlist->vip_pollfd : &evlist->pollfd;
+ int pos = fdarray__add(pollfd, fd, revent | POLLERR | POLLHUP);
+
/*
* Save the idx so that when we filter out fds POLLHUP'ed we can
* close the associated evlist->mmap[] entry.
*/
if (pos >= 0) {
- evlist->pollfd.priv[pos].ptr = map;
+ pollfd->priv[pos].ptr = map;
fcntl(fd, F_SETFL, O_NONBLOCK);
}
@@ -470,7 +497,7 @@ static int __perf_evlist__add_pollfd(struct perf_evlist *evlist, int fd,
int perf_evlist__add_pollfd(struct perf_evlist *evlist, int fd)
{
- return __perf_evlist__add_pollfd(evlist, fd, NULL, POLLIN);
+ return __perf_evlist__add_pollfd(evlist, fd, NULL, POLLIN, false);
}
static void perf_evlist__munmap_filtered(struct fdarray *fda, int fd,
@@ -493,6 +520,11 @@ int perf_evlist__poll(struct perf_evlist *evlist, int timeout)
return fdarray__poll(&evlist->pollfd, timeout);
}
+int perf_evlist__poll_vip(struct perf_evlist *evlist, int timeout)
+{
+ return fdarray__poll(&evlist->vip_pollfd, timeout);
+}
+
static void perf_evlist__id_hash(struct perf_evlist *evlist,
struct perf_evsel *evsel,
int cpu, int thread, u64 id)
@@ -770,6 +802,7 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx,
int evlist_cpu = cpu_map__cpu(evlist->cpus, cpu_idx);
evlist__for_each_entry(evlist, evsel) {
+ struct perf_mmap *vip_maps = evlist->vip_mmap;
struct perf_mmap *maps = evlist->mmap;
int *output = _output;
int fd;
@@ -800,7 +833,11 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx,
fd = FD(evsel, cpu, thread);
- if (*output == -1) {
+ if (evsel->vip) {
+ if (perf_mmap__mmap(&vip_maps[idx], mp,
+ fd, evlist_cpu) < 0)
+ return -1;
+ } else if (*output == -1) {
*output = fd;
if (perf_mmap__mmap(&maps[idx], mp, *output, evlist_cpu) < 0)
@@ -822,8 +859,12 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx,
* Therefore don't add it for polling.
*/
if (!evsel->system_wide &&
- __perf_evlist__add_pollfd(evlist, fd, &maps[idx], revent) < 0) {
- perf_mmap__put(&maps[idx]);
+ __perf_evlist__add_pollfd(
+ evlist, fd,
+ evsel->vip ? &vip_maps[idx] : &maps[idx],
+ revent, evsel->vip) < 0) {
+ perf_mmap__put(evsel->vip ?
+ &vip_maps[idx] : &maps[idx]);
return -1;
}
@@ -1035,6 +1076,9 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
if (!evlist->mmap)
return -ENOMEM;
+ if (!evlist->vip_mmap)
+ evlist->vip_mmap = perf_evlist__alloc_mmap(evlist, false);
+
if (evlist->pollfd.entries == NULL && perf_evlist__alloc_pollfd(evlist) < 0)
return -ENOMEM;
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index dc66436add98..3ed19f9fbc97 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -26,6 +26,7 @@ struct record_opts;
struct perf_evlist {
struct list_head entries;
+ struct list_head vip_entries;
struct hlist_head heads[PERF_EVLIST__HLIST_SIZE];
int nr_entries;
int nr_groups;
@@ -42,7 +43,9 @@ struct perf_evlist {
pid_t pid;
} workload;
struct fdarray pollfd;
+ struct fdarray vip_pollfd;
struct perf_mmap *mmap;
+ struct perf_mmap *vip_mmap;
struct perf_mmap *overwrite_mmap;
struct thread_map *threads;
struct cpu_map *cpus;
@@ -84,6 +87,8 @@ int __perf_evlist__add_default_attrs(struct perf_evlist *evlist,
int perf_evlist__add_dummy(struct perf_evlist *evlist);
+int perf_evlist__add_bpf_tracker(struct perf_evlist *evlist);
+
int perf_evlist__add_newtp(struct perf_evlist *evlist,
const char *sys, const char *name, void *handler);
@@ -120,6 +125,7 @@ int perf_evlist__alloc_pollfd(struct perf_evlist *evlist);
int perf_evlist__filter_pollfd(struct perf_evlist *evlist, short revents_and_mask);
int perf_evlist__poll(struct perf_evlist *evlist, int timeout);
+int perf_evlist__poll_vip(struct perf_evlist *evlist, int timeout);
struct perf_evsel *perf_evlist__id2evsel(struct perf_evlist *evlist, u64 id);
struct perf_evsel *perf_evlist__id2evsel_strict(struct perf_evlist *evlist,
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index af9d539e4b6a..94456a493607 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -235,6 +235,7 @@ void perf_evsel__init(struct perf_evsel *evsel,
evsel->evlist = NULL;
evsel->bpf_fd = -1;
INIT_LIST_HEAD(&evsel->node);
+ INIT_LIST_HEAD(&evsel->vip_node);
INIT_LIST_HEAD(&evsel->config_terms);
perf_evsel__object.init(evsel);
evsel->sample_size = __perf_evsel__sample_size(attr->sample_type);
@@ -1795,6 +1796,8 @@ int perf_evsel__open(struct perf_evsel *evsel, struct cpu_map *cpus,
PERF_SAMPLE_BRANCH_NO_CYCLES);
if (perf_missing_features.group_read && evsel->attr.inherit)
evsel->attr.read_format &= ~(PERF_FORMAT_GROUP|PERF_FORMAT_ID);
+ if (perf_missing_features.bpf_event)
+ evsel->attr.bpf_event = 0;
retry_sample_id:
if (perf_missing_features.sample_id_all)
evsel->attr.sample_id_all = 0;
@@ -1939,6 +1942,11 @@ int perf_evsel__open(struct perf_evsel *evsel, struct cpu_map *cpus,
perf_missing_features.exclude_guest = true;
pr_debug2("switching off exclude_guest, exclude_host\n");
goto fallback_missing_features;
+ } else if (!perf_missing_features.bpf_event &&
+ evsel->attr.bpf_event) {
+ perf_missing_features.bpf_event = true;
+ pr_debug2("switching off bpf_event\n");
+ goto fallback_missing_features;
} else if (!perf_missing_features.sample_id_all) {
perf_missing_features.sample_id_all = true;
pr_debug2("switching off sample_id_all\n");
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 4107c39f4a54..82b1d3e42603 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -89,6 +89,7 @@ struct perf_stat_evsel;
*/
struct perf_evsel {
struct list_head node;
+ struct list_head vip_node;
struct perf_evlist *evlist;
struct perf_event_attr attr;
char *filter;
@@ -128,6 +129,7 @@ struct perf_evsel {
bool ignore_missing_thread;
bool forced_leader;
bool use_uncore_alias;
+ bool vip; /* vip events have their own mmap */
/* parse modifier helper */
int exclude_GH;
int nr_members;
@@ -163,6 +165,7 @@ struct perf_missing_features {
bool lbr_flags;
bool write_backward;
bool group_read;
+ bool bpf_event;
};
extern struct perf_missing_features perf_missing_features;
--
2.17.1
On Wed, Nov 21, 2018 at 11:55:02AM -0800, Song Liu wrote:
> This patch enables perf-record to listen to bpf_event and generate
> bpf_prog_info_event for bpf programs loaded and unloaded during
> perf-record run.
>
> To minimize latency between bpf_event and following bpf calls, separate
> mmap with watermark of 1 is created to process these vip events. Then
> a separate dummy event is attached to the special mmap. A separate thread
> is used to only poll bpf events.
>
> By default, perf-record will listen to bpf_event. Option no-bpf-event is
> added in case the user would opt out.
I think only default perf-record for sampling should include it.
perf record -e for tracepoints, kprobes and anything else should imply no-bpf-event.
On 11/21/18 2:35 PM, Song Liu wrote:
>
>
>> On Nov 21, 2018, at 2:11 PM, Alexei Starovoitov <[email protected]> wrote:
>>
>> On Wed, Nov 21, 2018 at 11:55:02AM -0800, Song Liu wrote:
>>> This patch enables perf-record to listen to bpf_event and generate
>>> bpf_prog_info_event for bpf programs loaded and unloaded during
>>> perf-record run.
>>>
>>> To minimize latency between bpf_event and following bpf calls, separate
>>> mmap with watermark of 1 is created to process these vip events. Then
>>> a separate dummy event is attached to the special mmap. A separate thread
>>> is used to only poll bpf events.
>>>
>>> By default, perf-record will listen to bpf_event. Option no-bpf-event is
>>> added in case the user would opt out.
>>
>> I think only default perf-record for sampling should include it.
>> perf record -e for tracepoints, kprobes and anything else should imply no-bpf-event.
>>
>
> Maybe something like
>
> if (software_event && ! -g)
> no_bpf_event = true;
>
> ?
>
> This will allow attaching kprobe on bpf helper and see which bpf program
> calls it.
right. good point about -g (I assume you mean callgraph collection).
Indeed, if bpf progs unload before perf-report they won't be seen in
callgraphs too even for sw events, so we have to enable bpf_events
to see them.
> On Nov 21, 2018, at 2:11 PM, Alexei Starovoitov <[email protected]> wrote:
>
> On Wed, Nov 21, 2018 at 11:55:02AM -0800, Song Liu wrote:
>> This patch enables perf-record to listen to bpf_event and generate
>> bpf_prog_info_event for bpf programs loaded and unloaded during
>> perf-record run.
>>
>> To minimize latency between bpf_event and following bpf calls, separate
>> mmap with watermark of 1 is created to process these vip events. Then
>> a separate dummy event is attached to the special mmap. A separate thread
>> is used to only poll bpf events.
>>
>> By default, perf-record will listen to bpf_event. Option no-bpf-event is
>> added in case the user would opt out.
>
> I think only default perf-record for sampling should include it.
> perf record -e for tracepoints, kprobes and anything else should imply no-bpf-event.
>
Maybe something like
if (software_event && ! -g)
no_bpf_event = true;
?
This will allow attaching kprobe on bpf helper and see which bpf program
calls it.
Thanks,
Song
On Wed, Nov 21, 2018 at 11:54:57AM -0800, Song Liu wrote:
> Changes RFC -> PATCH v1:
>
> 1. In perf-record, poll vip events in a separate thread;
> 2. Add tag to bpf prog name;
> 3. Small refactorings.
>
> Original cover letter (with minor revisions):
>
> This is to follow up Alexei's early effort to show bpf programs
>
> https://www.spinics.net/lists/netdev/msg524232.html
>
> In this version, PERF_RECORD_BPF_EVENT is introduced to send real time BPF
> load/unload events to user space. In user space, perf-record is modified
> to listen to these events (through a dedicated ring buffer) and generate
> detailed information about the program (struct bpf_prog_info_event). Then,
> perf-report translates these events into proper symbols.
>
> With this set, perf-report will show bpf program as:
>
> 18.49% 0.16% test [kernel.vmlinux] [k] ksys_write
> 18.01% 0.47% test [kernel.vmlinux] [k] vfs_write
> 17.02% 0.40% test bpf_prog [k] bpf_prog_07367f7ba80df72b_
> 16.97% 0.10% test [kernel.vmlinux] [k] __vfs_write
> 16.86% 0.12% test [kernel.vmlinux] [k] comm_write
> 16.67% 0.39% test [kernel.vmlinux] [k] bpf_probe_read
>
> Note that, the program name is still work in progress, it will be cleaner
> with function types in BTF.
>
> Please share your comments on this.
So I see:
kernel/bpf/core.c:void bpf_prog_kallsyms_add(struct bpf_prog *fp)
which should already provide basic symbol information for extant eBPF
programs, right?
And (AFAIK) perf uses /proc/kcore for annotate on the current running
kernel (if not, it really should, given alternatives, jump_labels and
all other other self-modifying code).
So this fancy new stuff is only for the case where your profile spans
eBPF load/unload events (which should be relatively rare in the normal
case, right), or when you want source annotated asm output (I normally
don't bother with that).
That is; I would really like this fancy stuff to be an optional extra
that is typically not needed.
Does that make sense?
Hi Peter,
> On Nov 22, 2018, at 1:32 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Nov 21, 2018 at 11:54:57AM -0800, Song Liu wrote:
>> Changes RFC -> PATCH v1:
>>
>> 1. In perf-record, poll vip events in a separate thread;
>> 2. Add tag to bpf prog name;
>> 3. Small refactorings.
>>
>> Original cover letter (with minor revisions):
>>
>> This is to follow up Alexei's early effort to show bpf programs
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_netdev_msg524232.html&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=dR8692q0_uaizy0jkrBJQM5k2hfm4CiFxYT8KaysFrg&m=3--n8O2EZfFY2WyGCKt0u4zd73778zD7xmoNHi9tMCU&s=3DY93pysLN-m1tgYmd7YAyQGNSq6KpYKucIJcB3nofc&e=
>>
>> In this version, PERF_RECORD_BPF_EVENT is introduced to send real time BPF
>> load/unload events to user space. In user space, perf-record is modified
>> to listen to these events (through a dedicated ring buffer) and generate
>> detailed information about the program (struct bpf_prog_info_event). Then,
>> perf-report translates these events into proper symbols.
>>
>> With this set, perf-report will show bpf program as:
>>
>> 18.49% 0.16% test [kernel.vmlinux] [k] ksys_write
>> 18.01% 0.47% test [kernel.vmlinux] [k] vfs_write
>> 17.02% 0.40% test bpf_prog [k] bpf_prog_07367f7ba80df72b_
>> 16.97% 0.10% test [kernel.vmlinux] [k] __vfs_write
>> 16.86% 0.12% test [kernel.vmlinux] [k] comm_write
>> 16.67% 0.39% test [kernel.vmlinux] [k] bpf_probe_read
>>
>> Note that, the program name is still work in progress, it will be cleaner
>> with function types in BTF.
>>
>> Please share your comments on this.
>
> So I see:
>
> kernel/bpf/core.c:void bpf_prog_kallsyms_add(struct bpf_prog *fp)
>
> which should already provide basic symbol information for extant eBPF
> programs, right?
Right, if the BPF program is still loaded when perf-report runs, symbols
are available.
> And (AFAIK) perf uses /proc/kcore for annotate on the current running
> kernel (if not, it really should, given alternatives, jump_labels and
> all other other self-modifying code).
>
> So this fancy new stuff is only for the case where your profile spans
> eBPF load/unload events (which should be relatively rare in the normal
> case, right), or when you want source annotated asm output (I normally
> don't bother with that).
This patch set adds two pieces of information:
1. At the beginning of perf-record, save info of existing BPF programs;
2. Gather information of BPF programs load/unload during perf-record.
(1) is all in user space. It is necessary to show symbols of BPF program
that are unloaded _after_ perf-record. (2) needs PERF_RECORD_BPF_EVENT
from the ring buffer. It covers BPF program loaded during perf-record
(perf record -- bpf_test).
> That is; I would really like this fancy stuff to be an optional extra
> that is typically not needed.
>
> Does that make sense?
(1) above is always enabled with this set. I added option no-bpf-events
to disable (2). I guess you prefer the (2) is disabled by default, and
enabled with an option?
Thanks,
Song
On 11/21/18 3:11 PM, Alexei Starovoitov wrote:
> On Wed, Nov 21, 2018 at 11:55:02AM -0800, Song Liu wrote:
>> This patch enables perf-record to listen to bpf_event and generate
>> bpf_prog_info_event for bpf programs loaded and unloaded during
>> perf-record run.
>>
>> To minimize latency between bpf_event and following bpf calls, separate
>> mmap with watermark of 1 is created to process these vip events. Then
>> a separate dummy event is attached to the special mmap. A separate thread
>> is used to only poll bpf events.
>>
>> By default, perf-record will listen to bpf_event. Option no-bpf-event is
>> added in case the user would opt out.
>
> I think only default perf-record for sampling should include it.
> perf record -e for tracepoints, kprobes and anything else should imply no-bpf-event.
>
It should default off for those, but as I have said before users should
be able to add bpf load and unload events to the mix of tracepoints,
probes, etc they are monitoring.
On Thu, Nov 22, 2018 at 06:13:32PM +0000, Song Liu wrote:
> Hi Peter,
>
> > On Nov 22, 2018, at 1:32 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > On Wed, Nov 21, 2018 at 11:54:57AM -0800, Song Liu wrote:
> >> Changes RFC -> PATCH v1:
> >>
> >> 1. In perf-record, poll vip events in a separate thread;
> >> 2. Add tag to bpf prog name;
> >> 3. Small refactorings.
> >>
> >> Original cover letter (with minor revisions):
> >>
> >> This is to follow up Alexei's early effort to show bpf programs
> >> In this version, PERF_RECORD_BPF_EVENT is introduced to send real time BPF
> >> load/unload events to user space. In user space, perf-record is modified
> >> to listen to these events (through a dedicated ring buffer) and generate
> >> detailed information about the program (struct bpf_prog_info_event). Then,
> >> perf-report translates these events into proper symbols.
> >>
> >> With this set, perf-report will show bpf program as:
> >>
> >> 18.49% 0.16% test [kernel.vmlinux] [k] ksys_write
> >> 18.01% 0.47% test [kernel.vmlinux] [k] vfs_write
> >> 17.02% 0.40% test bpf_prog [k] bpf_prog_07367f7ba80df72b_
> >> 16.97% 0.10% test [kernel.vmlinux] [k] __vfs_write
> >> 16.86% 0.12% test [kernel.vmlinux] [k] comm_write
> >> 16.67% 0.39% test [kernel.vmlinux] [k] bpf_probe_read
> >>
> >> Note that, the program name is still work in progress, it will be cleaner
> >> with function types in BTF.
> >>
> >> Please share your comments on this.
> >
> > So I see:
> >
> > kernel/bpf/core.c:void bpf_prog_kallsyms_add(struct bpf_prog *fp)
> >
> > which should already provide basic symbol information for extant eBPF
> > programs, right?
>
> Right, if the BPF program is still loaded when perf-report runs, symbols
> are available.
Good, that is not something that was clear. The Changelog seems to imply
we need this new stuff in order to observe symbols.
> > And (AFAIK) perf uses /proc/kcore for annotate on the current running
> > kernel (if not, it really should, given alternatives, jump_labels and
> > all other other self-modifying code).
> >
> > So this fancy new stuff is only for the case where your profile spans
> > eBPF load/unload events (which should be relatively rare in the normal
> > case, right), or when you want source annotated asm output (I normally
> > don't bother with that).
>
> This patch set adds two pieces of information:
> 1. At the beginning of perf-record, save info of existing BPF programs;
> 2. Gather information of BPF programs load/unload during perf-record.
>
> (1) is all in user space. It is necessary to show symbols of BPF program
> that are unloaded _after_ perf-record. (2) needs PERF_RECORD_BPF_EVENT
> from the ring buffer. It covers BPF program loaded during perf-record
> (perf record -- bpf_test).
I'm saying that if you given them symbols; most people won't need any of
that ever.
And just tracking kallsyms is _much_ cheaper than either 1 or 2. Alexei
was talking fairly big amounts of data per BPF prog. Dumping and saving
that sounds like pointless overhead for 99% of the users.
> > That is; I would really like this fancy stuff to be an optional extra
> > that is typically not needed.
> >
> > Does that make sense?
>
> (1) above is always enabled with this set. I added option no-bpf-events
> to disable (2). I guess you prefer the (2) is disabled by default, and
> enabled with an option?
I'm saying neither should be default enabled. Instead we should do
recording and tracking by default.
That gets people symbol information on BPF stuff, which is likely all
they ever need.
If they then decide they need more, then and only then can they enable
the fancy pants bpf-synchronous-syscall nonsense thingy. And like I said
before; I don't bother with source annotated asm output for
perf-annotate.
And if the bpf prog is still loaded, kcore should contain the raw jitted
asm just fine; you again don't need the bpf syscall stuff.
Now, I'm not saying this patch set is useless; but I'm saying most
people should not need this, and it is massive overkill for the needs of
most people.
> On Nov 26, 2018, at 6:50 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Nov 22, 2018 at 06:13:32PM +0000, Song Liu wrote:
>> Hi Peter,
>>
>>> On Nov 22, 2018, at 1:32 AM, Peter Zijlstra <[email protected]> wrote:
>>>
>>> On Wed, Nov 21, 2018 at 11:54:57AM -0800, Song Liu wrote:
>>>> Changes RFC -> PATCH v1:
>>>>
>>>> 1. In perf-record, poll vip events in a separate thread;
>>>> 2. Add tag to bpf prog name;
>>>> 3. Small refactorings.
>>>>
>>>> Original cover letter (with minor revisions):
>>>>
>>>> This is to follow up Alexei's early effort to show bpf programs
>
>>>> In this version, PERF_RECORD_BPF_EVENT is introduced to send real time BPF
>>>> load/unload events to user space. In user space, perf-record is modified
>>>> to listen to these events (through a dedicated ring buffer) and generate
>>>> detailed information about the program (struct bpf_prog_info_event). Then,
>>>> perf-report translates these events into proper symbols.
>>>>
>>>> With this set, perf-report will show bpf program as:
>>>>
>>>> 18.49% 0.16% test [kernel.vmlinux] [k] ksys_write
>>>> 18.01% 0.47% test [kernel.vmlinux] [k] vfs_write
>>>> 17.02% 0.40% test bpf_prog [k] bpf_prog_07367f7ba80df72b_
>>>> 16.97% 0.10% test [kernel.vmlinux] [k] __vfs_write
>>>> 16.86% 0.12% test [kernel.vmlinux] [k] comm_write
>>>> 16.67% 0.39% test [kernel.vmlinux] [k] bpf_probe_read
>>>>
>>>> Note that, the program name is still work in progress, it will be cleaner
>>>> with function types in BTF.
>>>>
>>>> Please share your comments on this.
>>>
>>> So I see:
>>>
>>> kernel/bpf/core.c:void bpf_prog_kallsyms_add(struct bpf_prog *fp)
>>>
>>> which should already provide basic symbol information for extant eBPF
>>> programs, right?
>>
>> Right, if the BPF program is still loaded when perf-report runs, symbols
>> are available.
>
> Good, that is not something that was clear. The Changelog seems to imply
> we need this new stuff in order to observe symbols.
I will clarify this in next version.
>
>>> And (AFAIK) perf uses /proc/kcore for annotate on the current running
>>> kernel (if not, it really should, given alternatives, jump_labels and
>>> all other other self-modifying code).
>>>
>>> So this fancy new stuff is only for the case where your profile spans
>>> eBPF load/unload events (which should be relatively rare in the normal
>>> case, right), or when you want source annotated asm output (I normally
>>> don't bother with that).
>>
>> This patch set adds two pieces of information:
>> 1. At the beginning of perf-record, save info of existing BPF programs;
>> 2. Gather information of BPF programs load/unload during perf-record.
>>
>> (1) is all in user space. It is necessary to show symbols of BPF program
>> that are unloaded _after_ perf-record. (2) needs PERF_RECORD_BPF_EVENT
>> from the ring buffer. It covers BPF program loaded during perf-record
>> (perf record -- bpf_test).
>
> I'm saying that if you given them symbols; most people won't need any of
> that ever.
>
> And just tracking kallsyms is _much_ cheaper than either 1 or 2. Alexei
> was talking fairly big amounts of data per BPF prog. Dumping and saving
> that sounds like pointless overhead for 99% of the users.
Due to the kernel-module-like natural of BPF program, I think it is still
necessary to cover cases that BPF programs are unloaded when perf-record
runs. How about we add another step that scans all bpf_prog_XXXX from
kallsyms, and synthesizes symbols for them?
>
>>> That is; I would really like this fancy stuff to be an optional extra
>>> that is typically not needed.
>>>
>>> Does that make sense?
>>
>> (1) above is always enabled with this set. I added option no-bpf-events
>> to disable (2). I guess you prefer the (2) is disabled by default, and
>> enabled with an option?
>
> I'm saying neither should be default enabled. Instead we should do
> recording and tracking by default.
>
> That gets people symbol information on BPF stuff, which is likely all
> they ever need.
How about we extend PERF_RECORD_BPF_EVENT with basic symbol information
(name, addr, length)? By default, we just record these events in the ring
buffer, just like mmap events. This (plus scanning kallsyms before record)
will enable symbols for all BPF programs.
For more information, we add an option to enable more information
(annotated asm, etc.), with dedicated dummy event, thread, and ring buffer.
Thanks,
Song
Em Mon, Nov 26, 2018 at 03:50:04PM +0100, Peter Zijlstra escreveu:
> Now, I'm not saying this patch set is useless; but I'm saying most
> people should not need this, and it is massive overkill for the needs of
> most people.
So, the comparision is sort of with kernel modules, that can come and go
while you're profiling/tracing, if that happens, then samples, in post
processing, are not resolvable, and that is the case for kernel modules
right now. Sure, you're right, that doesn't happen so frequently, so
nobody hollered (thankfully that is now verbotten ;-)) at us so far.
You need to have the load-kernel-bin/unload-kernel-bin events recorded,
and you need to somehow match those addresses to some symtab/src(for
people that want to have src mixed up with assembly) and you need that
jitted code, with timestamps of when it was loaded and it was unloaded.
People doing post processing analysis of weird problems need all those
details.
Now I don't know how frequently those binary blobs gets loaded/unloaded
in the brave new world of eBPF, but for completeness sake, we need those
load/unload events and we need to grab a copy of the raw jitted binary,
etc.
- Arnaldo
> On Nov 26, 2018, at 7:27 AM, Arnaldo Carvalho de Melo <[email protected]> wrote:
>
> Em Mon, Nov 26, 2018 at 03:50:04PM +0100, Peter Zijlstra escreveu:
>> Now, I'm not saying this patch set is useless; but I'm saying most
>> people should not need this, and it is massive overkill for the needs of
>> most people.
>
> So, the comparision is sort of with kernel modules, that can come and go
> while you're profiling/tracing, if that happens, then samples, in post
> processing, are not resolvable, and that is the case for kernel modules
> right now. Sure, you're right, that doesn't happen so frequently, so
> nobody hollered (thankfully that is now verbotten ;-)) at us so far.
>
> You need to have the load-kernel-bin/unload-kernel-bin events recorded,
> and you need to somehow match those addresses to some symtab/src(for
> people that want to have src mixed up with assembly) and you need that
> jitted code, with timestamps of when it was loaded and it was unloaded.
>
> People doing post processing analysis of weird problems need all those
> details.
>
> Now I don't know how frequently those binary blobs gets loaded/unloaded
> in the brave new world of eBPF, but for completeness sake, we need those
> load/unload events and we need to grab a copy of the raw jitted binary,
> etc.
BPF programs get loaded/unloaded more often than kernel modules. This is
because BPF verifier makes sure BPF program will never crash the kernel,
thus BPF programs could often be debugged as user space apps.
Typical debug/tune process is like:
perf record -- ./test_work
perf report
# make changes to bpf program and repeat
or even:
perf record -o perf_1.data -- ./test_v1
perf record -o perf_2.data -- ./test_v2
perf record -o perf_3.data -- ./test_v3
# perf report in 3 terminals, and compare the output
In such use cases, including all details in the perf.data is very helpful,
as /proc/kcore doesn't have details of the program when perf-report runs.
Thanks,
Song
> On Nov 26, 2018, at 6:50 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Nov 22, 2018 at 06:13:32PM +0000, Song Liu wrote:
>> Hi Peter,
>>
>>> On Nov 22, 2018, at 1:32 AM, Peter Zijlstra <[email protected]> wrote:
>>>
>>> On Wed, Nov 21, 2018 at 11:54:57AM -0800, Song Liu wrote:
>>>> Changes RFC -> PATCH v1:
>>>>
>>>> 1. In perf-record, poll vip events in a separate thread;
>>>> 2. Add tag to bpf prog name;
>>>> 3. Small refactorings.
>>>>
>>>> Original cover letter (with minor revisions):
>>>>
>>>> This is to follow up Alexei's early effort to show bpf programs
>
>>>> In this version, PERF_RECORD_BPF_EVENT is introduced to send real time BPF
>>>> load/unload events to user space. In user space, perf-record is modified
>>>> to listen to these events (through a dedicated ring buffer) and generate
>>>> detailed information about the program (struct bpf_prog_info_event). Then,
>>>> perf-report translates these events into proper symbols.
>>>>
>>>> With this set, perf-report will show bpf program as:
>>>>
>>>> 18.49% 0.16% test [kernel.vmlinux] [k] ksys_write
>>>> 18.01% 0.47% test [kernel.vmlinux] [k] vfs_write
>>>> 17.02% 0.40% test bpf_prog [k] bpf_prog_07367f7ba80df72b_
>>>> 16.97% 0.10% test [kernel.vmlinux] [k] __vfs_write
>>>> 16.86% 0.12% test [kernel.vmlinux] [k] comm_write
>>>> 16.67% 0.39% test [kernel.vmlinux] [k] bpf_probe_read
>>>>
>>>> Note that, the program name is still work in progress, it will be cleaner
>>>> with function types in BTF.
>>>>
>>>> Please share your comments on this.
>>>
>>> So I see:
>>>
>>> kernel/bpf/core.c:void bpf_prog_kallsyms_add(struct bpf_prog *fp)
>>>
>>> which should already provide basic symbol information for extant eBPF
>>> programs, right?
>>
>> Right, if the BPF program is still loaded when perf-report runs, symbols
>> are available.
>
> Good, that is not something that was clear. The Changelog seems to imply
> we need this new stuff in order to observe symbols.
>
>>> And (AFAIK) perf uses /proc/kcore for annotate on the current running
>>> kernel (if not, it really should, given alternatives, jump_labels and
>>> all other other self-modifying code).
>>>
>>> So this fancy new stuff is only for the case where your profile spans
>>> eBPF load/unload events (which should be relatively rare in the normal
>>> case, right), or when you want source annotated asm output (I normally
>>> don't bother with that).
>>
>> This patch set adds two pieces of information:
>> 1. At the beginning of perf-record, save info of existing BPF programs;
>> 2. Gather information of BPF programs load/unload during perf-record.
>>
>> (1) is all in user space. It is necessary to show symbols of BPF program
>> that are unloaded _after_ perf-record. (2) needs PERF_RECORD_BPF_EVENT
>> from the ring buffer. It covers BPF program loaded during perf-record
>> (perf record -- bpf_test).
>
> I'm saying that if you given them symbols; most people won't need any of
> that ever.
>
> And just tracking kallsyms is _much_ cheaper than either 1 or 2. Alexei
> was talking fairly big amounts of data per BPF prog. Dumping and saving
> that sounds like pointless overhead for 99% of the users.
If annotation is not needed, we have the option to reduce the amount of
data per bpf prog by not requesting JITed binaries. Would a flag to tune
this solve the concern of dumping and saving too much data?
Thanks,
Song
> On Nov 26, 2018, at 12:00 PM, Song Liu <[email protected]> wrote:
>
>
>
>> On Nov 26, 2018, at 6:50 AM, Peter Zijlstra <[email protected]> wrote:
>>
>> On Thu, Nov 22, 2018 at 06:13:32PM +0000, Song Liu wrote:
>>> Hi Peter,
>>>
>>>> On Nov 22, 2018, at 1:32 AM, Peter Zijlstra <[email protected]> wrote:
>>>>
>>>> On Wed, Nov 21, 2018 at 11:54:57AM -0800, Song Liu wrote:
>>>>> Changes RFC -> PATCH v1:
>>>>>
>>>>> 1. In perf-record, poll vip events in a separate thread;
>>>>> 2. Add tag to bpf prog name;
>>>>> 3. Small refactorings.
>>>>>
>>>>> Original cover letter (with minor revisions):
>>>>>
>>>>> This is to follow up Alexei's early effort to show bpf programs
>>
>>>>> In this version, PERF_RECORD_BPF_EVENT is introduced to send real time BPF
>>>>> load/unload events to user space. In user space, perf-record is modified
>>>>> to listen to these events (through a dedicated ring buffer) and generate
>>>>> detailed information about the program (struct bpf_prog_info_event). Then,
>>>>> perf-report translates these events into proper symbols.
>>>>>
>>>>> With this set, perf-report will show bpf program as:
>>>>>
>>>>> 18.49% 0.16% test [kernel.vmlinux] [k] ksys_write
>>>>> 18.01% 0.47% test [kernel.vmlinux] [k] vfs_write
>>>>> 17.02% 0.40% test bpf_prog [k] bpf_prog_07367f7ba80df72b_
>>>>> 16.97% 0.10% test [kernel.vmlinux] [k] __vfs_write
>>>>> 16.86% 0.12% test [kernel.vmlinux] [k] comm_write
>>>>> 16.67% 0.39% test [kernel.vmlinux] [k] bpf_probe_read
>>>>>
>>>>> Note that, the program name is still work in progress, it will be cleaner
>>>>> with function types in BTF.
>>>>>
>>>>> Please share your comments on this.
>>>>
>>>> So I see:
>>>>
>>>> kernel/bpf/core.c:void bpf_prog_kallsyms_add(struct bpf_prog *fp)
>>>>
>>>> which should already provide basic symbol information for extant eBPF
>>>> programs, right?
>>>
>>> Right, if the BPF program is still loaded when perf-report runs, symbols
>>> are available.
>>
>> Good, that is not something that was clear. The Changelog seems to imply
>> we need this new stuff in order to observe symbols.
>
> I will clarify this in next version.
>
>>
>>>> And (AFAIK) perf uses /proc/kcore for annotate on the current running
>>>> kernel (if not, it really should, given alternatives, jump_labels and
>>>> all other other self-modifying code).
>>>>
>>>> So this fancy new stuff is only for the case where your profile spans
>>>> eBPF load/unload events (which should be relatively rare in the normal
>>>> case, right), or when you want source annotated asm output (I normally
>>>> don't bother with that).
>>>
>>> This patch set adds two pieces of information:
>>> 1. At the beginning of perf-record, save info of existing BPF programs;
>>> 2. Gather information of BPF programs load/unload during perf-record.
>>>
>>> (1) is all in user space. It is necessary to show symbols of BPF program
>>> that are unloaded _after_ perf-record. (2) needs PERF_RECORD_BPF_EVENT
>>> from the ring buffer. It covers BPF program loaded during perf-record
>>> (perf record -- bpf_test).
>>
>> I'm saying that if you given them symbols; most people won't need any of
>> that ever.
>>
>> And just tracking kallsyms is _much_ cheaper than either 1 or 2. Alexei
>> was talking fairly big amounts of data per BPF prog. Dumping and saving
>> that sounds like pointless overhead for 99% of the users.
>
> Due to the kernel-module-like natural of BPF program, I think it is still
> necessary to cover cases that BPF programs are unloaded when perf-record
> runs. How about we add another step that scans all bpf_prog_XXXX from
> kallsyms, and synthesizes symbols for them?
>
>>
>>>> That is; I would really like this fancy stuff to be an optional extra
>>>> that is typically not needed.
>>>>
>>>> Does that make sense?
>>>
>>> (1) above is always enabled with this set. I added option no-bpf-events
>>> to disable (2). I guess you prefer the (2) is disabled by default, and
>>> enabled with an option?
>>
>> I'm saying neither should be default enabled. Instead we should do
>> recording and tracking by default.
>>
>> That gets people symbol information on BPF stuff, which is likely all
>> they ever need.
>
> How about we extend PERF_RECORD_BPF_EVENT with basic symbol information
> (name, addr, length)? By default, we just record these events in the ring
> buffer, just like mmap events. This (plus scanning kallsyms before record)
> will enable symbols for all BPF programs.
>
> For more information, we add an option to enable more information
> (annotated asm, etc.), with dedicated dummy event, thread, and ring buffer.
>
After syncing with Alexei offline, I realized this won't work cleanly.
My initial idea is to extend PERF_RECORD_BPF_EVENT like:
struct bpf_event {
struct perf_event_header header;
u16 type;
u16 flags;
u32 id; /* bpf prog_id */
u64 start_addr;
u32 length;
char name[KSYM_NAME_LEN];
};
This is a structure with variable length, which is OK.
However, I missed the fact that each bpf program could have up to 256 sub
programs. To fit that into single bpf_event need some pretty ugly hack:
struct bpf_sub_prog {
u64 start_addr;
u32 length; /* or length */
u32 name_len; /* length of name, struct bpf_event need it */
char name[KSYM_NAME_LEN];
};
struct bpf_event {
struct perf_event_header header;
u16 type;
u16 flags;
u32 id; /* bpf prog_id */
u64 num_of_sub_prog;
struct bpf_sub_prog sub_progs[];
};
In this case, bpf_sub_prog has variable length, thus struct bpf_event has
variable number of variable length members. This not impossible to handle
these, but it will be ugly for sure.
One alternative to this approach is to keep one sub program per bpf_event,
but generate multiple bpf_event (one for each sub program). Another solution
is to generate separate records (same or different PERF_RECORD_*) for
bpf_event and bpf_sub_prog.
However, I don't think any of these yield as clean perf ABI as current
version.
In summary, I can think of the following directions:
1. Current design.
Pros: simple perf ABI;
Cons: more work in perf tool.
2. One bpf_event per BPF prog, variable number of variable length variable.
Pros: less work for perf tool, if annotation is not needed;
Cons: ugly perf ABI.
3. Multiple bpf_event per BPF prog (one per sub prog).
Pros: less work for perf tool, if annotation is not needed;
Cons: need coordinate multiple records (same prog_id, different sub prog).
4. Separate bpf_event and bpf_sub_prog_event.
Pros: less work for perf tool, if annotation is not needed;
Cons: more complicated perf ABI (two PERF_RECORD_* or more complicated
encoding for the same PERF_RECORD_).
Overall, I think current design is the best for simplest ABI. And it gives
flexibility to tune what information perf tool get from bpf syscalls.
Peter and Arnaldo, what's your recommendation on these directions (or other
better alternatives)?
Thanks,
Song