This set catches symbol for all bpf programs loaded/unloaded
before/during/after perf-record run PERF_RECORD_KSYMBOL and
PERF_RECORD_BPF_EVENT.
PERF_RECORD_KSYMBOL and PERF_RECORD_BPF_EVENT includes key information
of a bpf program load and unload. They are sent through perf ringbuffer,
and stored in perf.data. PERF_RECORD_KSYMBOL includes basic information
for simple profiling. It is ON by default. PERF_RECORD_BPF_EVENT is
used to gather more information of the bpf program. It is necessary for
perf-annotate of bpf programs.
Before this patch, perf-report will not be able to recover symbols of
bpf programs once the programs are unloaded.
This is to follow up Alexei's early effort [2] to show bpf programs via
mmap events.
Thanks,
Song
Changes v3 -> PATCH v4:
1. Fixed build error reported by kbuild test bot.
Changes v3 -> PATCH v4:
1. Split information about bpf program into PERF_RECORD_KSYMBOL (with
name, addr, len); and PERF_RECORD_BPF_EVENT PERF_RECORD_BPF_EVENT
(with id, tag);
2. Split the implementation in kernel and user space.
Changes v2 -> PATCH v3:
1. Rebase on bpf-next tree, and on top of BPF sub program tag patches [1]
for latest information in bpf_prog_info.
2. Complete handling and synthesizing PERF_RECORD_BPF_EVENT in perf.
Changes v1 -> PATCH v2:
1. Only 3 of the 5 patches in v1, to focus on ABI first;
2. Generate PERF_RECORD_BPF_EVENT per bpf sub program instead of per prog;
3. Modify PERF_RECORD_BPF_EVENT with more details (addr, len, name),
so that it can be used for basic profiling without calling sys_bpf.
Changes RFC -> PATCH v1:
1. In perf-record, poll vip events in a separate thread;
2. Add tag to bpf prog name;
3. Small refactorings.
[1] https://patchwork.ozlabs.org/project/netdev/list/?series=81037
[2] https://www.spinics.net/lists/netdev/msg524232.html
Song Liu (7):
perf, bpf: Introduce PERF_RECORD_KSYMBOL
sync tools/include/uapi/linux/perf_event.h
perf, bpf: introduce PERF_RECORD_BPF_EVENT
sync tools/include/uapi/linux/perf_event.h
perf util: handle PERF_RECORD_KSYMBOL
perf util: handle PERF_RECORD_BPF_EVENT
perf tools: synthesize PERF_RECORD_* for loaded BPF programs
include/linux/filter.h | 7 +
include/linux/perf_event.h | 18 +++
include/uapi/linux/perf_event.h | 50 +++++-
kernel/bpf/core.c | 2 +-
kernel/bpf/syscall.c | 2 +
kernel/events/core.c | 218 ++++++++++++++++++++++++-
tools/include/uapi/linux/perf_event.h | 50 +++++-
tools/perf/builtin-record.c | 7 +
tools/perf/perf.h | 1 +
tools/perf/util/Build | 2 +
tools/perf/util/bpf-event.c | 219 ++++++++++++++++++++++++++
tools/perf/util/bpf-event.h | 16 ++
tools/perf/util/event.c | 40 +++++
tools/perf/util/event.h | 34 ++++
tools/perf/util/evsel.c | 19 +++
tools/perf/util/evsel.h | 2 +
tools/perf/util/machine.c | 60 +++++++
tools/perf/util/machine.h | 5 +-
tools/perf/util/session.c | 8 +
tools/perf/util/tool.h | 5 +-
20 files changed, 759 insertions(+), 6 deletions(-)
create mode 100644 tools/perf/util/bpf-event.c
create mode 100644 tools/perf/util/bpf-event.h
--
2.17.1
For better performance analysis of BPF programs, this patch introduces
PERF_RECORD_BPF_EVENT, a new perf_event_type that exposes BPF program
load/unload information to user space.
Each BPF program may contain up to BPF_MAX_SUBPROGS (256) sub programs.
The following example shows kernel symbols for a BPF program with 7
sub programs:
ffffffffa0257cf9 t bpf_prog_b07ccb89267cf242_F
ffffffffa02592e1 t bpf_prog_2dcecc18072623fc_F
ffffffffa025b0e9 t bpf_prog_bb7a405ebaec5d5c_F
ffffffffa025dd2c t bpf_prog_a7540d4a39ec1fc7_F
ffffffffa025fcca t bpf_prog_05762d4ade0e3737_F
ffffffffa026108f t bpf_prog_db4bd11e35df90d4_F
ffffffffa0263f00 t bpf_prog_89d64e4abf0f0126_F
ffffffffa0257cf9 t bpf_prog_ae31629322c4b018__dummy_tracepoi
When a bpf program is loaded, PERF_RECORD_KSYMBOL is generated for
each of these sub programs. Therefore, PERF_RECORD_BPF_EVENT is not
needed for simple profiling.
For annotation, user space need to listen to PERF_RECORD_BPF_EVENT
and gather more information about these (sub) programs via sys_bpf.
Signed-off-by: Song Liu <[email protected]>
---
include/linux/filter.h | 7 ++
include/linux/perf_event.h | 6 ++
include/uapi/linux/perf_event.h | 29 +++++++-
kernel/bpf/core.c | 2 +-
kernel/bpf/syscall.c | 2 +
kernel/events/core.c | 120 ++++++++++++++++++++++++++++++++
6 files changed, 164 insertions(+), 2 deletions(-)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 537e9e7c6e6f..45d23560f90b 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -955,6 +955,7 @@ bpf_address_lookup(unsigned long addr, unsigned long *size,
void bpf_prog_kallsyms_add(struct bpf_prog *fp);
void bpf_prog_kallsyms_del(struct bpf_prog *fp);
+void bpf_get_prog_name(const struct bpf_prog *prog, char *sym);
#else /* CONFIG_BPF_JIT */
@@ -1010,6 +1011,12 @@ static inline void bpf_prog_kallsyms_add(struct bpf_prog *fp)
static inline void bpf_prog_kallsyms_del(struct bpf_prog *fp)
{
}
+
+static inline void bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
+{
+ sym[0] = '\0';
+}
+
#endif /* CONFIG_BPF_JIT */
void bpf_prog_kallsyms_del_subprogs(struct bpf_prog *fp);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 037863e69bb2..8ea0ce650c6f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1118,6 +1118,9 @@ extern void perf_event_mmap(struct vm_area_struct *vma);
typedef int (perf_ksymbol_get_name_f)(char *name, int name_len, void *data);
extern void perf_event_ksymbol(int type, u64 addr, u64 len, bool unregister,
perf_ksymbol_get_name_f get_name, void *data);
+extern void perf_event_bpf_event(struct bpf_prog *prog,
+ enum perf_bpf_event_type type,
+ u16 flags);
extern struct perf_guest_info_callbacks *perf_guest_cbs;
extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks);
@@ -1345,6 +1348,9 @@ static inline void perf_event_ksymbol(int type, u64 addr, u64 len,
bool unregister,
perf_ksymbol_get_name_f get_name,
void *data) { }
+static inline void perf_event_bpf_event(struct bpf_prog *prog,
+ enum perf_bpf_event_type type,
+ u16 flags) { }
static inline void perf_event_exec(void) { }
static inline void perf_event_comm(struct task_struct *tsk, bool exec) { }
static inline void perf_event_namespaces(struct task_struct *tsk) { }
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 6c9e327e87ed..68db04058408 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -373,7 +373,8 @@ struct perf_event_attr {
write_backward : 1, /* Write ring buffer from end to beginning */
namespaces : 1, /* include namespaces data */
ksymbol : 1, /* include ksymbol events */
- __reserved_1 : 34;
+ bpf_event : 1, /* include bpf events */
+ __reserved_1 : 33;
union {
__u32 wakeup_events; /* wakeup every n events */
@@ -986,9 +987,35 @@ enum perf_event_type {
*/
PERF_RECORD_KSYMBOL = 17,
+ /*
+ * Record bpf events:
+ * enum perf_bpf_event_type {
+ * PERF_BPF_EVENT_UNKNOWN = 0,
+ * PERF_BPF_EVENT_PROG_LOAD = 1,
+ * PERF_BPF_EVENT_PROG_UNLOAD = 2,
+ * };
+ *
+ * struct {
+ * struct perf_event_header header;
+ * u16 type;
+ * u16 flags;
+ * u32 id;
+ * u8 tag[BPF_TAG_SIZE];
+ * struct sample_id sample_id;
+ * };
+ */
+ PERF_RECORD_BPF_EVENT = 18,
+
PERF_RECORD_MAX, /* non-ABI */
};
+enum perf_bpf_event_type {
+ PERF_BPF_EVENT_UNKNOWN = 0,
+ PERF_BPF_EVENT_PROG_LOAD = 1,
+ PERF_BPF_EVENT_PROG_UNLOAD = 2,
+ PERF_BPF_EVENT_MAX, /* non-ABI */
+};
+
#define PERF_MAX_STACK_DEPTH 127
#define PERF_MAX_CONTEXTS_PER_STACK 8
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 5cdd8da0e7f2..2a8364294f11 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -496,7 +496,7 @@ bpf_get_prog_addr_region(const struct bpf_prog *prog,
*symbol_end = addr + hdr->pages * PAGE_SIZE;
}
-static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
+void bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
{
const char *end = sym + KSYM_NAME_LEN;
const struct btf_type *type;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0607db304def..4af63c8c95eb 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1211,6 +1211,7 @@ static void __bpf_prog_put_rcu(struct rcu_head *rcu)
static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock)
{
if (atomic_dec_and_test(&prog->aux->refcnt)) {
+ perf_event_bpf_event(prog, PERF_BPF_EVENT_PROG_UNLOAD, 0);
/* bpf_prog_free_id() must be called first */
bpf_prog_free_id(prog, do_idr_lock);
bpf_prog_kallsyms_del_all(prog);
@@ -1554,6 +1555,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
}
bpf_prog_kallsyms_add(prog);
+ perf_event_bpf_event(prog, PERF_BPF_EVENT_PROG_LOAD, 0);
return err;
free_used_maps:
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c0ac6dee367c..04feb2b28c46 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -386,6 +386,7 @@ static atomic_t nr_task_events __read_mostly;
static atomic_t nr_freq_events __read_mostly;
static atomic_t nr_switch_events __read_mostly;
static atomic_t nr_ksymbol_events __read_mostly;
+static atomic_t nr_bpf_events __read_mostly;
static LIST_HEAD(pmus);
static DEFINE_MUTEX(pmus_lock);
@@ -4308,6 +4309,8 @@ static void unaccount_event(struct perf_event *event)
dec = true;
if (event->attr.ksymbol)
atomic_dec(&nr_ksymbol_events);
+ if (event->attr.bpf_event)
+ atomic_dec(&nr_bpf_events);
if (dec) {
if (!atomic_add_unless(&perf_sched_count, -1, 1))
@@ -7744,6 +7747,121 @@ void perf_event_ksymbol(int ksym_type, u64 addr, u64 len, bool unregister,
WARN_ONCE(1, "%s: Invalid KSYMBOL type 0x%x\n", __func__, ksym_type);
}
+/*
+ * bpf program load/unload tracking
+ */
+
+struct perf_bpf_event {
+ struct bpf_prog *prog;
+ struct {
+ struct perf_event_header header;
+ u16 type;
+ u16 flags;
+ u32 id;
+ u8 tag[BPF_TAG_SIZE];
+ } event_id;
+};
+
+static int perf_event_bpf_match(struct perf_event *event)
+{
+ return event->attr.bpf_event;
+}
+
+static void perf_event_bpf_output(struct perf_event *event, void *data)
+{
+ struct perf_bpf_event *bpf_event = data;
+ struct perf_output_handle handle;
+ struct perf_sample_data sample;
+ int ret;
+
+ if (!perf_event_bpf_match(event))
+ return;
+
+ perf_event_header__init_id(&bpf_event->event_id.header,
+ &sample, event);
+ ret = perf_output_begin(&handle, event,
+ bpf_event->event_id.header.size);
+ if (ret)
+ return;
+
+ perf_output_put(&handle, bpf_event->event_id);
+ perf_event__output_id_sample(event, &handle, &sample);
+
+ perf_output_end(&handle);
+}
+
+static int perf_event_bpf_get_name(char *name, int len, void *data)
+{
+ struct bpf_prog *prog = data;
+
+ bpf_get_prog_name(prog, name);
+ return 0;
+}
+
+static void perf_event_bpf_emit_ksymbols(struct bpf_prog *prog,
+ enum perf_bpf_event_type type)
+{
+ bool unregister = type == PERF_BPF_EVENT_PROG_UNLOAD;
+ int i;
+
+ if (prog->aux->func_cnt == 0) {
+ perf_event_ksymbol(PERF_RECORD_MISC_KSYMBOL_TYPE_BPF,
+ (u64)(unsigned long)prog->bpf_func,
+ prog->jited_len, unregister,
+ perf_event_bpf_get_name, prog);
+ } else {
+ for (i = 0; i < prog->aux->func_cnt; i++) {
+ struct bpf_prog *subprog = prog->aux->func[i];
+
+ perf_event_ksymbol(
+ PERF_RECORD_MISC_KSYMBOL_TYPE_BPF,
+ (u64)(unsigned long)subprog->bpf_func,
+ subprog->jited_len, unregister,
+ perf_event_bpf_get_name, subprog);
+ }
+ }
+}
+
+void perf_event_bpf_event(struct bpf_prog *prog,
+ enum perf_bpf_event_type type,
+ u16 flags)
+{
+ struct perf_bpf_event bpf_event;
+
+ if (type <= PERF_BPF_EVENT_UNKNOWN ||
+ type >= PERF_BPF_EVENT_MAX)
+ return;
+
+ switch (type) {
+ case PERF_BPF_EVENT_PROG_LOAD:
+ case PERF_BPF_EVENT_PROG_UNLOAD:
+ if (atomic_read(&nr_ksymbol_events))
+ perf_event_bpf_emit_ksymbols(prog, type);
+ break;
+ default:
+ break;
+ }
+
+ if (!atomic_read(&nr_bpf_events))
+ return;
+
+ bpf_event = (struct perf_bpf_event){
+ .prog = prog,
+ .event_id = {
+ .header = {
+ .type = PERF_RECORD_BPF_EVENT,
+ .size = sizeof(bpf_event.event_id),
+ },
+ .type = type,
+ .flags = flags,
+ .id = prog->aux->id,
+ },
+ };
+
+ memcpy(bpf_event.event_id.tag, prog->tag, BPF_TAG_SIZE);
+ perf_iterate_sb(perf_event_bpf_output, &bpf_event, NULL);
+}
+
void perf_event_itrace_started(struct perf_event *event)
{
event->attach_state |= PERF_ATTACH_ITRACE;
@@ -9996,6 +10114,8 @@ static void account_event(struct perf_event *event)
inc = true;
if (event->attr.ksymbol)
atomic_inc(&nr_ksymbol_events);
+ if (event->attr.bpf_event)
+ atomic_inc(&nr_bpf_events);
if (inc) {
/*
--
2.17.1
This patch handles PERF_RECORD_KSYMBOL in perf record/report.
Specifically, map and symbol are created for ksymbol register, and
removed for ksymbol unregister.
This patch also set perf_event_attr.ksymbol properly. The flag is
ON by default.
Signed-off-by: Song Liu <[email protected]>
---
tools/perf/util/event.c | 20 ++++++++++++++
tools/perf/util/event.h | 18 +++++++++++++
tools/perf/util/evsel.c | 9 +++++++
tools/perf/util/evsel.h | 1 +
tools/perf/util/machine.c | 57 +++++++++++++++++++++++++++++++++++++++
tools/perf/util/machine.h | 5 +++-
tools/perf/util/session.c | 4 +++
tools/perf/util/tool.h | 4 ++-
8 files changed, 116 insertions(+), 2 deletions(-)
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index e9c108a6b1c3..4b47194eaa4b 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -24,6 +24,7 @@
#include "symbol/kallsyms.h"
#include "asm/bug.h"
#include "stat.h"
+#include "session.h"
static const char *perf_event__names[] = {
[0] = "TOTAL",
@@ -43,6 +44,7 @@ static const char *perf_event__names[] = {
[PERF_RECORD_SWITCH] = "SWITCH",
[PERF_RECORD_SWITCH_CPU_WIDE] = "SWITCH_CPU_WIDE",
[PERF_RECORD_NAMESPACES] = "NAMESPACES",
+ [PERF_RECORD_KSYMBOL] = "KSYMBOL",
[PERF_RECORD_HEADER_ATTR] = "ATTR",
[PERF_RECORD_HEADER_EVENT_TYPE] = "EVENT_TYPE",
[PERF_RECORD_HEADER_TRACING_DATA] = "TRACING_DATA",
@@ -1335,6 +1337,14 @@ int perf_event__process_switch(struct perf_tool *tool __maybe_unused,
return machine__process_switch_event(machine, event);
}
+int perf_event__process_ksymbol(struct perf_tool *tool __maybe_unused,
+ union perf_event *event,
+ struct perf_sample *sample __maybe_unused,
+ struct machine *machine)
+{
+ return machine__process_ksymbol(machine, event, sample);
+}
+
size_t perf_event__fprintf_mmap(union perf_event *event, FILE *fp)
{
return fprintf(fp, " %d/%d: [%#" PRIx64 "(%#" PRIx64 ") @ %#" PRIx64 "]: %c %s\n",
@@ -1467,6 +1477,13 @@ static size_t perf_event__fprintf_lost(union perf_event *event, FILE *fp)
return fprintf(fp, " lost %" PRIu64 "\n", event->lost.lost);
}
+size_t perf_event__fprintf_ksymbol(union perf_event *event, FILE *fp)
+{
+ return fprintf(fp, " ksymbol event with addr %lx len %lu name %s\n",
+ event->ksymbol_event.addr, event->ksymbol_event.len,
+ event->ksymbol_event.name);
+}
+
size_t perf_event__fprintf(union perf_event *event, FILE *fp)
{
size_t ret = fprintf(fp, "PERF_RECORD_%s",
@@ -1502,6 +1519,9 @@ size_t perf_event__fprintf(union perf_event *event, FILE *fp)
case PERF_RECORD_LOST:
ret += perf_event__fprintf_lost(event, fp);
break;
+ case PERF_RECORD_KSYMBOL:
+ ret += perf_event__fprintf_ksymbol(event, fp);
+ break;
default:
ret += fprintf(fp, "\n");
}
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index bfa60bcafbde..100b59f1fdec 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -5,6 +5,7 @@
#include <limits.h>
#include <stdio.h>
#include <linux/kernel.h>
+#include <linux/bpf.h>
#include "../perf.h"
#include "build-id.h"
@@ -84,6 +85,17 @@ struct throttle_event {
u64 stream_id;
};
+#ifndef KSYM_NAME_LEN
+#define KSYM_NAME_LEN 256
+#endif
+
+struct ksymbol_event {
+ struct perf_event_header header;
+ u64 addr;
+ u64 len;
+ char name[KSYM_NAME_LEN];
+};
+
#define PERF_SAMPLE_MASK \
(PERF_SAMPLE_IP | PERF_SAMPLE_TID | \
PERF_SAMPLE_TIME | PERF_SAMPLE_ADDR | \
@@ -651,6 +663,7 @@ union perf_event {
struct stat_round_event stat_round;
struct time_conv_event time_conv;
struct feature_event feat;
+ struct ksymbol_event ksymbol_event;
};
void perf_event__print_totals(void);
@@ -750,6 +763,10 @@ int perf_event__process_exit(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample,
struct machine *machine);
+int perf_event__process_ksymbol(struct perf_tool *tool,
+ union perf_event *event,
+ struct perf_sample *sample,
+ struct machine *machine);
int perf_tool__process_synth_event(struct perf_tool *tool,
union perf_event *event,
struct machine *machine,
@@ -814,6 +831,7 @@ size_t perf_event__fprintf_switch(union perf_event *event, FILE *fp);
size_t perf_event__fprintf_thread_map(union perf_event *event, FILE *fp);
size_t perf_event__fprintf_cpu_map(union perf_event *event, FILE *fp);
size_t perf_event__fprintf_namespaces(union perf_event *event, FILE *fp);
+size_t perf_event__fprintf_ksymbol(union perf_event *event, FILE *fp);
size_t perf_event__fprintf(union perf_event *event, FILE *fp);
int kallsyms__get_function_start(const char *kallsyms_filename,
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index dbc0466db368..de34ce875648 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1035,6 +1035,7 @@ void perf_evsel__config(struct perf_evsel *evsel, struct record_opts *opts,
attr->mmap = track;
attr->mmap2 = track && !perf_missing_features.mmap2;
attr->comm = track;
+ attr->ksymbol = track && !perf_missing_features.ksymbol;
if (opts->record_namespaces)
attr->namespaces = track;
@@ -1652,6 +1653,7 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
PRINT_ATTRf(context_switch, p_unsigned);
PRINT_ATTRf(write_backward, p_unsigned);
PRINT_ATTRf(namespaces, p_unsigned);
+ PRINT_ATTRf(ksymbol, p_unsigned);
PRINT_ATTRn("{ wakeup_events, wakeup_watermark }", wakeup_events, p_unsigned);
PRINT_ATTRf(bp_type, p_unsigned);
@@ -1811,6 +1813,8 @@ int perf_evsel__open(struct perf_evsel *evsel, struct cpu_map *cpus,
PERF_SAMPLE_BRANCH_NO_CYCLES);
if (perf_missing_features.group_read && evsel->attr.inherit)
evsel->attr.read_format &= ~(PERF_FORMAT_GROUP|PERF_FORMAT_ID);
+ if (perf_missing_features.ksymbol)
+ evsel->attr.ksymbol = 0;
retry_sample_id:
if (perf_missing_features.sample_id_all)
evsel->attr.sample_id_all = 0;
@@ -1955,6 +1959,11 @@ int perf_evsel__open(struct perf_evsel *evsel, struct cpu_map *cpus,
perf_missing_features.exclude_guest = true;
pr_debug2("switching off exclude_guest, exclude_host\n");
goto fallback_missing_features;
+ } else if (!perf_missing_features.ksymbol &&
+ evsel->attr.ksymbol) {
+ perf_missing_features.ksymbol = true;
+ pr_debug2("switching off ksymbol\n");
+ goto fallback_missing_features;
} else if (!perf_missing_features.sample_id_all) {
perf_missing_features.sample_id_all = true;
pr_debug2("switching off sample_id_all\n");
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 3147ca76c6fc..c647c1963f7a 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -168,6 +168,7 @@ struct perf_missing_features {
bool lbr_flags;
bool write_backward;
bool group_read;
+ bool ksymbol;
};
extern struct perf_missing_features perf_missing_features;
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 8f36ce813bc5..5b50442d77c9 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -681,6 +681,61 @@ int machine__process_switch_event(struct machine *machine __maybe_unused,
return 0;
}
+static int machine__process_ksymbol_register(
+ struct machine *machine,
+ union perf_event *event,
+ struct perf_sample *sample __maybe_unused)
+{
+ struct symbol *sym;
+ struct map *map;
+
+ map = map_groups__find(&machine->kmaps, event->ksymbol_event.addr);
+ if (!map) {
+ map = dso__new_map("bpf_prog");
+ if (!map)
+ return -ENOMEM;
+
+ map->start = event->ksymbol_event.addr;
+ map->pgoff = map->start;
+ map->end = map->start + event->ksymbol_event.len;
+ map_groups__insert(&machine->kmaps, map);
+ }
+
+ sym = symbol__new(event->ksymbol_event.addr, event->ksymbol_event.len,
+ 0, 0, event->ksymbol_event.name);
+ if (!sym)
+ return -ENOMEM;
+ dso__insert_symbol(map->dso, sym);
+ return 0;
+}
+
+static int machine__process_ksymbol_unregister(
+ struct machine *machine,
+ union perf_event *event,
+ struct perf_sample *sample __maybe_unused)
+{
+ struct map *map;
+
+ map = map_groups__find(&machine->kmaps, event->ksymbol_event.addr);
+ if (map)
+ map_groups__remove(&machine->kmaps, map);
+
+ return 0;
+}
+
+int machine__process_ksymbol(struct machine *machine __maybe_unused,
+ union perf_event *event,
+ struct perf_sample *sample)
+{
+ if (dump_trace)
+ perf_event__fprintf_ksymbol(event, stderr);
+
+ if (event->header.misc & PERF_RECORD_MISC_KSYMBOL_UNREGISTER)
+ return machine__process_ksymbol_unregister(machine, event,
+ sample);
+ return machine__process_ksymbol_register(machine, event, sample);
+}
+
static void dso__adjust_kmod_long_name(struct dso *dso, const char *filename)
{
const char *dup_filename;
@@ -1812,6 +1867,8 @@ int machine__process_event(struct machine *machine, union perf_event *event,
case PERF_RECORD_SWITCH:
case PERF_RECORD_SWITCH_CPU_WIDE:
ret = machine__process_switch_event(machine, event); break;
+ case PERF_RECORD_KSYMBOL:
+ ret = machine__process_ksymbol(machine, event, sample); break;
default:
ret = -1;
break;
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index d856b85862e2..e92728fe66cd 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -127,9 +127,12 @@ int machine__process_mmap_event(struct machine *machine, union perf_event *event
struct perf_sample *sample);
int machine__process_mmap2_event(struct machine *machine, union perf_event *event,
struct perf_sample *sample);
+
int machine__process_event(struct machine *machine, union perf_event *event,
struct perf_sample *sample);
-
+int machine__process_ksymbol(struct machine *machine,
+ union perf_event *event,
+ struct perf_sample *sample);
typedef void (*machine__process_t)(struct machine *machine, void *data);
struct machines {
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 7d2c8ce6cfad..2c1976a66164 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -371,6 +371,8 @@ void perf_tool__fill_defaults(struct perf_tool *tool)
tool->itrace_start = perf_event__process_itrace_start;
if (tool->context_switch == NULL)
tool->context_switch = perf_event__process_switch;
+ if (tool->ksymbol == NULL)
+ tool->ksymbol = perf_event__process_ksymbol;
if (tool->read == NULL)
tool->read = process_event_sample_stub;
if (tool->throttle == NULL)
@@ -1300,6 +1302,8 @@ static int machines__deliver_event(struct machines *machines,
case PERF_RECORD_SWITCH:
case PERF_RECORD_SWITCH_CPU_WIDE:
return tool->context_switch(tool, event, sample, machine);
+ case PERF_RECORD_KSYMBOL:
+ return tool->ksymbol(tool, event, sample, machine);
default:
++evlist->stats.nr_unknown_events;
return -1;
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index 56e4ca54020a..9c81ca2f3cf7 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -53,7 +53,9 @@ struct perf_tool {
itrace_start,
context_switch,
throttle,
- unthrottle;
+ unthrottle,
+ ksymbol;
+
event_attr_op attr;
event_attr_op event_update;
event_op2 tracing_data;
--
2.17.1
sync changes for PERF_RECORD_KSYMBOL
Signed-off-by: Song Liu <[email protected]>
---
tools/include/uapi/linux/perf_event.h | 23 ++++++++++++++++++++++-
1 file changed, 22 insertions(+), 1 deletion(-)
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 9de8780ac8d9..6c9e327e87ed 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -372,7 +372,8 @@ struct perf_event_attr {
context_switch : 1, /* context switch data */
write_backward : 1, /* Write ring buffer from end to beginning */
namespaces : 1, /* include namespaces data */
- __reserved_1 : 35;
+ ksymbol : 1, /* include ksymbol events */
+ __reserved_1 : 34;
union {
__u32 wakeup_events; /* wakeup every n events */
@@ -648,11 +649,18 @@ struct perf_event_mmap_page {
* PERF_RECORD_MISC_COMM_EXEC - PERF_RECORD_COMM event
* PERF_RECORD_MISC_FORK_EXEC - PERF_RECORD_FORK event (perf internal)
* PERF_RECORD_MISC_SWITCH_OUT - PERF_RECORD_SWITCH* events
+ * PERF_RECORD_MISC_KSYMBOL_* - PERF_RECORD_KSYMBOL event
*/
#define PERF_RECORD_MISC_MMAP_DATA (1 << 13)
#define PERF_RECORD_MISC_COMM_EXEC (1 << 13)
#define PERF_RECORD_MISC_FORK_EXEC (1 << 13)
#define PERF_RECORD_MISC_SWITCH_OUT (1 << 13)
+
+#define PERF_RECORD_MISC_KSYMBOL_UNREGISTER (1 << 3)
+#define PERF_RECORD_MISC_KSYMBOL_TYPE_MASK (7 << 4)
+#define PERF_RECORD_MISC_KSYMBOL_TYPE_UNKNOWN (0 << 4)
+#define PERF_RECORD_MISC_KSYMBOL_TYPE_BPF (1 << 4)
+
/*
* These PERF_RECORD_MISC_* flags below are safely reused
* for the following events:
@@ -965,6 +973,19 @@ enum perf_event_type {
*/
PERF_RECORD_NAMESPACES = 16,
+ /*
+ * Record ksymbol register/unregister events:
+ *
+ * struct {
+ * struct perf_event_header header;
+ * u64 addr;
+ * u64 len;
+ * char name[];
+ * struct sample_id sample_id;
+ * };
+ */
+ PERF_RECORD_KSYMBOL = 17,
+
PERF_RECORD_MAX, /* non-ABI */
};
--
2.17.1
For better performance analysis of dynamically JITed and loaded kernel
functions, such as BPF programs, this patch introduces
PERF_RECORD_KSYMBOL, a new perf_event_type that exposes kernel symbol
register/unregister information to user space.
The following data structure is used for PERF_RECORD_KSYMBOL.
/*
* struct {
* struct perf_event_header header;
* u64 addr;
* u64 len;
* char name[];
* struct sample_id sample_id;
* };
*/
PERF_RECORD_KSYMBOL uses 4 more bits (bit 3-6) in header.misc:
Bit 3 is used to differentiate register vs. unregister. Bit 4-6
encode types of the ksymbol. The following are details of the
assignment of these bits:
Signed-off-by: Song Liu <[email protected]>
---
include/linux/perf_event.h | 12 ++++
include/uapi/linux/perf_event.h | 23 +++++++-
kernel/events/core.c | 98 ++++++++++++++++++++++++++++++++-
3 files changed, 131 insertions(+), 2 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 53c500f0ca79..037863e69bb2 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1113,6 +1113,12 @@ static inline void perf_event_task_sched_out(struct task_struct *prev,
}
extern void perf_event_mmap(struct vm_area_struct *vma);
+
+/* callback function to generate ksymbol name */
+typedef int (perf_ksymbol_get_name_f)(char *name, int name_len, void *data);
+extern void perf_event_ksymbol(int type, u64 addr, u64 len, bool unregister,
+ perf_ksymbol_get_name_f get_name, void *data);
+
extern struct perf_guest_info_callbacks *perf_guest_cbs;
extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks);
extern int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks);
@@ -1333,6 +1339,12 @@ static inline int perf_unregister_guest_info_callbacks
(struct perf_guest_info_callbacks *callbacks) { return 0; }
static inline void perf_event_mmap(struct vm_area_struct *vma) { }
+
+typedef int (perf_ksymbol_get_name_f)(char *name, int name_len, void *data);
+static inline void perf_event_ksymbol(int type, u64 addr, u64 len,
+ bool unregister,
+ perf_ksymbol_get_name_f get_name,
+ void *data) { }
static inline void perf_event_exec(void) { }
static inline void perf_event_comm(struct task_struct *tsk, bool exec) { }
static inline void perf_event_namespaces(struct task_struct *tsk) { }
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9de8780ac8d9..6c9e327e87ed 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -372,7 +372,8 @@ struct perf_event_attr {
context_switch : 1, /* context switch data */
write_backward : 1, /* Write ring buffer from end to beginning */
namespaces : 1, /* include namespaces data */
- __reserved_1 : 35;
+ ksymbol : 1, /* include ksymbol events */
+ __reserved_1 : 34;
union {
__u32 wakeup_events; /* wakeup every n events */
@@ -648,11 +649,18 @@ struct perf_event_mmap_page {
* PERF_RECORD_MISC_COMM_EXEC - PERF_RECORD_COMM event
* PERF_RECORD_MISC_FORK_EXEC - PERF_RECORD_FORK event (perf internal)
* PERF_RECORD_MISC_SWITCH_OUT - PERF_RECORD_SWITCH* events
+ * PERF_RECORD_MISC_KSYMBOL_* - PERF_RECORD_KSYMBOL event
*/
#define PERF_RECORD_MISC_MMAP_DATA (1 << 13)
#define PERF_RECORD_MISC_COMM_EXEC (1 << 13)
#define PERF_RECORD_MISC_FORK_EXEC (1 << 13)
#define PERF_RECORD_MISC_SWITCH_OUT (1 << 13)
+
+#define PERF_RECORD_MISC_KSYMBOL_UNREGISTER (1 << 3)
+#define PERF_RECORD_MISC_KSYMBOL_TYPE_MASK (7 << 4)
+#define PERF_RECORD_MISC_KSYMBOL_TYPE_UNKNOWN (0 << 4)
+#define PERF_RECORD_MISC_KSYMBOL_TYPE_BPF (1 << 4)
+
/*
* These PERF_RECORD_MISC_* flags below are safely reused
* for the following events:
@@ -965,6 +973,19 @@ enum perf_event_type {
*/
PERF_RECORD_NAMESPACES = 16,
+ /*
+ * Record ksymbol register/unregister events:
+ *
+ * struct {
+ * struct perf_event_header header;
+ * u64 addr;
+ * u64 len;
+ * char name[];
+ * struct sample_id sample_id;
+ * };
+ */
+ PERF_RECORD_KSYMBOL = 17,
+
PERF_RECORD_MAX, /* non-ABI */
};
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 84530ab358c3..c0ac6dee367c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -385,6 +385,7 @@ static atomic_t nr_namespaces_events __read_mostly;
static atomic_t nr_task_events __read_mostly;
static atomic_t nr_freq_events __read_mostly;
static atomic_t nr_switch_events __read_mostly;
+static atomic_t nr_ksymbol_events __read_mostly;
static LIST_HEAD(pmus);
static DEFINE_MUTEX(pmus_lock);
@@ -4235,7 +4236,7 @@ static bool is_sb_event(struct perf_event *event)
if (attr->mmap || attr->mmap_data || attr->mmap2 ||
attr->comm || attr->comm_exec ||
- attr->task ||
+ attr->task || attr->ksymbol ||
attr->context_switch)
return true;
return false;
@@ -4305,6 +4306,8 @@ static void unaccount_event(struct perf_event *event)
dec = true;
if (has_branch_stack(event))
dec = true;
+ if (event->attr.ksymbol)
+ atomic_dec(&nr_ksymbol_events);
if (dec) {
if (!atomic_add_unless(&perf_sched_count, -1, 1))
@@ -7650,6 +7653,97 @@ static void perf_log_throttle(struct perf_event *event, int enable)
perf_output_end(&handle);
}
+/*
+ * ksymbol register/unregister tracking
+ */
+
+struct perf_ksymbol_event {
+ const char *name;
+ int name_len;
+ struct {
+ struct perf_event_header header;
+ u64 addr;
+ u64 len;
+ } event_id;
+};
+
+static int perf_event_ksymbol_match(struct perf_event *event)
+{
+ return event->attr.ksymbol;
+}
+
+static void perf_event_ksymbol_output(struct perf_event *event, void *data)
+{
+ struct perf_ksymbol_event *ksymbol_event = data;
+ struct perf_output_handle handle;
+ struct perf_sample_data sample;
+ int ret;
+
+ if (!perf_event_ksymbol_match(event))
+ return;
+
+ perf_event_header__init_id(&ksymbol_event->event_id.header,
+ &sample, event);
+ ret = perf_output_begin(&handle, event,
+ ksymbol_event->event_id.header.size);
+ if (ret)
+ return;
+
+ perf_output_put(&handle, ksymbol_event->event_id);
+ __output_copy(&handle, ksymbol_event->name, ksymbol_event->name_len);
+ perf_event__output_id_sample(event, &handle, &sample);
+
+ perf_output_end(&handle);
+}
+
+void perf_event_ksymbol(int ksym_type, u64 addr, u64 len, bool unregister,
+ perf_ksymbol_get_name_f get_name, void *data)
+{
+ struct perf_ksymbol_event ksymbol_event;
+ char name[KSYM_NAME_LEN];
+ int name_len;
+
+ if (!atomic_read(&nr_ksymbol_events))
+ return;
+
+ if (ksym_type & ~PERF_RECORD_MISC_KSYMBOL_TYPE_MASK)
+ goto err;
+
+ switch (ksym_type & PERF_RECORD_MISC_KSYMBOL_TYPE_MASK) {
+ case PERF_RECORD_MISC_KSYMBOL_TYPE_BPF:
+ break;
+ case PERF_RECORD_MISC_KSYMBOL_TYPE_UNKNOWN:
+ default:
+ goto err;
+ }
+
+ get_name(name, KSYM_NAME_LEN, data);
+ name_len = strlen(name) + 1;
+ while (!IS_ALIGNED(name_len, sizeof(u64)))
+ name[name_len++] = '\0';
+ BUILD_BUG_ON(KSYM_NAME_LEN % sizeof(u64));
+
+ ksymbol_event = (struct perf_ksymbol_event){
+ .name = name,
+ .name_len = name_len,
+ .event_id = {
+ .header = {
+ .type = PERF_RECORD_KSYMBOL,
+ .misc = ksym_type,
+ .size = sizeof(ksymbol_event.event_id) +
+ name_len,
+ },
+ .addr = addr,
+ .len = len,
+ },
+ };
+
+ perf_iterate_sb(perf_event_ksymbol_output, &ksymbol_event, NULL);
+ return;
+err:
+ WARN_ONCE(1, "%s: Invalid KSYMBOL type 0x%x\n", __func__, ksym_type);
+}
+
void perf_event_itrace_started(struct perf_event *event)
{
event->attach_state |= PERF_ATTACH_ITRACE;
@@ -9900,6 +9994,8 @@ static void account_event(struct perf_event *event)
inc = true;
if (is_cgroup_event(event))
inc = true;
+ if (event->attr.ksymbol)
+ atomic_inc(&nr_ksymbol_events);
if (inc) {
/*
--
2.17.1
sync for PERF_RECORD_BPF_EVENT
Signed-off-by: Song Liu <[email protected]>
---
tools/include/uapi/linux/perf_event.h | 29 ++++++++++++++++++++++++++-
1 file changed, 28 insertions(+), 1 deletion(-)
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 6c9e327e87ed..68db04058408 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -373,7 +373,8 @@ struct perf_event_attr {
write_backward : 1, /* Write ring buffer from end to beginning */
namespaces : 1, /* include namespaces data */
ksymbol : 1, /* include ksymbol events */
- __reserved_1 : 34;
+ bpf_event : 1, /* include bpf events */
+ __reserved_1 : 33;
union {
__u32 wakeup_events; /* wakeup every n events */
@@ -986,9 +987,35 @@ enum perf_event_type {
*/
PERF_RECORD_KSYMBOL = 17,
+ /*
+ * Record bpf events:
+ * enum perf_bpf_event_type {
+ * PERF_BPF_EVENT_UNKNOWN = 0,
+ * PERF_BPF_EVENT_PROG_LOAD = 1,
+ * PERF_BPF_EVENT_PROG_UNLOAD = 2,
+ * };
+ *
+ * struct {
+ * struct perf_event_header header;
+ * u16 type;
+ * u16 flags;
+ * u32 id;
+ * u8 tag[BPF_TAG_SIZE];
+ * struct sample_id sample_id;
+ * };
+ */
+ PERF_RECORD_BPF_EVENT = 18,
+
PERF_RECORD_MAX, /* non-ABI */
};
+enum perf_bpf_event_type {
+ PERF_BPF_EVENT_UNKNOWN = 0,
+ PERF_BPF_EVENT_PROG_LOAD = 1,
+ PERF_BPF_EVENT_PROG_UNLOAD = 2,
+ PERF_BPF_EVENT_MAX, /* non-ABI */
+};
+
#define PERF_MAX_STACK_DEPTH 127
#define PERF_MAX_CONTEXTS_PER_STACK 8
--
2.17.1
This patch adds basic handling of PERF_RECORD_BPF_EVENT.
Tracking of PERF_RECORD_BPF_EVENT is OFF by default. Option --bpf-event
is added to turn it on.
Signed-off-by: Song Liu <[email protected]>
---
tools/perf/builtin-record.c | 1 +
tools/perf/perf.h | 1 +
tools/perf/util/Build | 2 ++
tools/perf/util/bpf-event.c | 15 +++++++++++++++
tools/perf/util/bpf-event.h | 11 +++++++++++
tools/perf/util/event.c | 20 ++++++++++++++++++++
tools/perf/util/event.h | 16 ++++++++++++++++
tools/perf/util/evsel.c | 10 ++++++++++
tools/perf/util/evsel.h | 1 +
tools/perf/util/machine.c | 3 +++
tools/perf/util/session.c | 4 ++++
tools/perf/util/tool.h | 3 ++-
12 files changed, 86 insertions(+), 1 deletion(-)
create mode 100644 tools/perf/util/bpf-event.c
create mode 100644 tools/perf/util/bpf-event.h
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 488779bc4c8d..8983ab2e60dd 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1604,6 +1604,7 @@ static struct option __record_options[] = {
OPT_BOOLEAN(0, "tail-synthesize", &record.opts.tail_synthesize,
"synthesize non-sample events at the end of output"),
OPT_BOOLEAN(0, "overwrite", &record.opts.overwrite, "use overwrite mode"),
+ OPT_BOOLEAN(0, "bpf-event", &record.opts.bpf_event, "record bpf events"),
OPT_BOOLEAN(0, "strict-freq", &record.opts.strict_freq,
"Fail if the specified frequency can't be used"),
OPT_CALLBACK('F', "freq", &record.opts, "freq or 'max'",
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 0ed4a34c74c4..a45a85ab38d4 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -66,6 +66,7 @@ struct record_opts {
bool ignore_missing_thread;
bool strict_freq;
bool sample_id;
+ bool bpf_event;
unsigned int freq;
unsigned int mmap_pages;
unsigned int auxtrace_mmap_pages;
diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index b7bf201fe8a8..99455f30c264 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -151,6 +151,8 @@ endif
libperf-y += perf-hooks.o
+libperf-$(CONFIG_LIBBPF) += bpf-event.o
+
libperf-$(CONFIG_CXX) += c++/
CFLAGS_config.o += -DETC_PERFCONFIG="BUILD_STR($(ETC_PERFCONFIG_SQ))"
diff --git a/tools/perf/util/bpf-event.c b/tools/perf/util/bpf-event.c
new file mode 100644
index 000000000000..f24f75506f51
--- /dev/null
+++ b/tools/perf/util/bpf-event.c
@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <errno.h>
+#include <bpf/bpf.h>
+#include "bpf-event.h"
+#include "debug.h"
+#include "symbol.h"
+
+int machine__process_bpf_event(struct machine *machine __maybe_unused,
+ union perf_event *event,
+ struct perf_sample *sample __maybe_unused)
+{
+ if (dump_trace)
+ perf_event__fprintf_bpf_event(event, stderr);
+ return 0;
+}
diff --git a/tools/perf/util/bpf-event.h b/tools/perf/util/bpf-event.h
new file mode 100644
index 000000000000..d5ca355dd298
--- /dev/null
+++ b/tools/perf/util/bpf-event.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __PERF_BPF_EVENT_H
+#define __PERF_BPF_EVENT_H
+
+#include "machine.h"
+
+int machine__process_bpf_event(struct machine *machine,
+ union perf_event *event,
+ struct perf_sample *sample);
+
+#endif
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 4b47194eaa4b..0e2eb3a3eab4 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -25,6 +25,7 @@
#include "asm/bug.h"
#include "stat.h"
#include "session.h"
+#include "bpf-event.h"
static const char *perf_event__names[] = {
[0] = "TOTAL",
@@ -45,6 +46,7 @@ static const char *perf_event__names[] = {
[PERF_RECORD_SWITCH_CPU_WIDE] = "SWITCH_CPU_WIDE",
[PERF_RECORD_NAMESPACES] = "NAMESPACES",
[PERF_RECORD_KSYMBOL] = "KSYMBOL",
+ [PERF_RECORD_BPF_EVENT] = "BPF_EVENT",
[PERF_RECORD_HEADER_ATTR] = "ATTR",
[PERF_RECORD_HEADER_EVENT_TYPE] = "EVENT_TYPE",
[PERF_RECORD_HEADER_TRACING_DATA] = "TRACING_DATA",
@@ -1345,6 +1347,14 @@ int perf_event__process_ksymbol(struct perf_tool *tool __maybe_unused,
return machine__process_ksymbol(machine, event, sample);
}
+int perf_event__process_bpf_event(struct perf_tool *tool __maybe_unused,
+ union perf_event *event,
+ struct perf_sample *sample __maybe_unused,
+ struct machine *machine)
+{
+ return machine__process_bpf_event(machine, event, sample);
+}
+
size_t perf_event__fprintf_mmap(union perf_event *event, FILE *fp)
{
return fprintf(fp, " %d/%d: [%#" PRIx64 "(%#" PRIx64 ") @ %#" PRIx64 "]: %c %s\n",
@@ -1484,6 +1494,13 @@ size_t perf_event__fprintf_ksymbol(union perf_event *event, FILE *fp)
event->ksymbol_event.name);
}
+size_t perf_event__fprintf_bpf_event(union perf_event *event, FILE *fp)
+{
+ return fprintf(fp, " bpf event with type %u, flags %u, id %u\n",
+ event->bpf_event.type, event->bpf_event.flags,
+ event->bpf_event.id);
+}
+
size_t perf_event__fprintf(union perf_event *event, FILE *fp)
{
size_t ret = fprintf(fp, "PERF_RECORD_%s",
@@ -1522,6 +1539,9 @@ size_t perf_event__fprintf(union perf_event *event, FILE *fp)
case PERF_RECORD_KSYMBOL:
ret += perf_event__fprintf_ksymbol(event, fp);
break;
+ case PERF_RECORD_BPF_EVENT:
+ ret += perf_event__fprintf_bpf_event(event, fp);
+ break;
default:
ret += fprintf(fp, "\n");
}
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 100b59f1fdec..eef795e89bef 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -96,6 +96,16 @@ struct ksymbol_event {
char name[KSYM_NAME_LEN];
};
+struct bpf_event {
+ struct perf_event_header header;
+ u16 type;
+ u16 flags;
+ u32 id;
+
+ /* for bpf_prog types */
+ u8 tag[BPF_TAG_SIZE]; // prog tag
+};
+
#define PERF_SAMPLE_MASK \
(PERF_SAMPLE_IP | PERF_SAMPLE_TID | \
PERF_SAMPLE_TIME | PERF_SAMPLE_ADDR | \
@@ -664,6 +674,7 @@ union perf_event {
struct time_conv_event time_conv;
struct feature_event feat;
struct ksymbol_event ksymbol_event;
+ struct bpf_event bpf_event;
};
void perf_event__print_totals(void);
@@ -767,6 +778,10 @@ int perf_event__process_ksymbol(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample,
struct machine *machine);
+int perf_event__process_bpf_event(struct perf_tool *tool,
+ union perf_event *event,
+ struct perf_sample *sample,
+ struct machine *machine);
int perf_tool__process_synth_event(struct perf_tool *tool,
union perf_event *event,
struct machine *machine,
@@ -832,6 +847,7 @@ size_t perf_event__fprintf_thread_map(union perf_event *event, FILE *fp);
size_t perf_event__fprintf_cpu_map(union perf_event *event, FILE *fp);
size_t perf_event__fprintf_namespaces(union perf_event *event, FILE *fp);
size_t perf_event__fprintf_ksymbol(union perf_event *event, FILE *fp);
+size_t perf_event__fprintf_bpf_event(union perf_event *event, FILE *fp);
size_t perf_event__fprintf(union perf_event *event, FILE *fp);
int kallsyms__get_function_start(const char *kallsyms_filename,
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index de34ce875648..1e8b7d2897e3 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1036,6 +1036,8 @@ void perf_evsel__config(struct perf_evsel *evsel, struct record_opts *opts,
attr->mmap2 = track && !perf_missing_features.mmap2;
attr->comm = track;
attr->ksymbol = track && !perf_missing_features.ksymbol;
+ attr->bpf_event = track && opts->bpf_event &&
+ !perf_missing_features.bpf_event;
if (opts->record_namespaces)
attr->namespaces = track;
@@ -1654,6 +1656,7 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
PRINT_ATTRf(write_backward, p_unsigned);
PRINT_ATTRf(namespaces, p_unsigned);
PRINT_ATTRf(ksymbol, p_unsigned);
+ PRINT_ATTRf(bpf_event, p_unsigned);
PRINT_ATTRn("{ wakeup_events, wakeup_watermark }", wakeup_events, p_unsigned);
PRINT_ATTRf(bp_type, p_unsigned);
@@ -1815,6 +1818,8 @@ int perf_evsel__open(struct perf_evsel *evsel, struct cpu_map *cpus,
evsel->attr.read_format &= ~(PERF_FORMAT_GROUP|PERF_FORMAT_ID);
if (perf_missing_features.ksymbol)
evsel->attr.ksymbol = 0;
+ if (perf_missing_features.bpf_event)
+ evsel->attr.bpf_event = 0;
retry_sample_id:
if (perf_missing_features.sample_id_all)
evsel->attr.sample_id_all = 0;
@@ -1964,6 +1969,11 @@ int perf_evsel__open(struct perf_evsel *evsel, struct cpu_map *cpus,
perf_missing_features.ksymbol = true;
pr_debug2("switching off ksymbol\n");
goto fallback_missing_features;
+ } else if (!perf_missing_features.bpf_event &&
+ evsel->attr.bpf_event) {
+ perf_missing_features.bpf_event = true;
+ pr_debug2("switching off bpf_event\n");
+ goto fallback_missing_features;
} else if (!perf_missing_features.sample_id_all) {
perf_missing_features.sample_id_all = true;
pr_debug2("switching off sample_id_all\n");
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index c647c1963f7a..490d7972e87a 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -169,6 +169,7 @@ struct perf_missing_features {
bool write_backward;
bool group_read;
bool ksymbol;
+ bool bpf_event;
};
extern struct perf_missing_features perf_missing_features;
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 5b50442d77c9..61854e2ded95 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -21,6 +21,7 @@
#include "unwind.h"
#include "linux/hash.h"
#include "asm/bug.h"
+#include "bpf-event.h"
#include "sane_ctype.h"
#include <symbol/kallsyms.h>
@@ -1869,6 +1870,8 @@ int machine__process_event(struct machine *machine, union perf_event *event,
ret = machine__process_switch_event(machine, event); break;
case PERF_RECORD_KSYMBOL:
ret = machine__process_ksymbol(machine, event, sample); break;
+ case PERF_RECORD_BPF_EVENT:
+ ret = machine__process_bpf_event(machine, event, sample); break;
default:
ret = -1;
break;
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 2c1976a66164..640c78871cb5 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -373,6 +373,8 @@ void perf_tool__fill_defaults(struct perf_tool *tool)
tool->context_switch = perf_event__process_switch;
if (tool->ksymbol == NULL)
tool->ksymbol = perf_event__process_ksymbol;
+ if (tool->bpf_event == NULL)
+ tool->bpf_event = perf_event__process_bpf_event;
if (tool->read == NULL)
tool->read = process_event_sample_stub;
if (tool->throttle == NULL)
@@ -1304,6 +1306,8 @@ static int machines__deliver_event(struct machines *machines,
return tool->context_switch(tool, event, sample, machine);
case PERF_RECORD_KSYMBOL:
return tool->ksymbol(tool, event, sample, machine);
+ case PERF_RECORD_BPF_EVENT:
+ return tool->bpf_event(tool, event, sample, machine);
default:
++evlist->stats.nr_unknown_events;
return -1;
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index 9c81ca2f3cf7..250391672f9f 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -54,7 +54,8 @@ struct perf_tool {
context_switch,
throttle,
unthrottle,
- ksymbol;
+ ksymbol,
+ bpf_event;
event_attr_op attr;
event_attr_op event_update;
--
2.17.1
This patch synthesize PERF_RECORD_KSYMBOL and PERF_RECORD_BPF_EVENT for
BPF programs loaded before perf-record. This is achieved by gathering
information about all BPF programs via sys_bpf.
Signed-off-by: Song Liu <[email protected]>
---
tools/perf/builtin-record.c | 6 ++
tools/perf/util/bpf-event.c | 204 ++++++++++++++++++++++++++++++++++++
tools/perf/util/bpf-event.h | 5 +
3 files changed, 215 insertions(+)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 8983ab2e60dd..d239e8bf2f0f 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -41,6 +41,7 @@
#include "util/perf-hooks.h"
#include "util/time-utils.h"
#include "util/units.h"
+#include "util/bpf-event.h"
#include "asm/bug.h"
#include <errno.h>
@@ -855,6 +856,11 @@ static int record__synthesize(struct record *rec, bool tail)
return err;
}
+ err = perf_event__synthesize_bpf_events(tool, process_synthesized_event,
+ machine, opts);
+ if (err < 0)
+ pr_warning("Couldn't synthesize bpf events.\n");
+
err = __machine__synthesize_threads(machine, tool, &opts->target, rec->evlist->threads,
process_synthesized_event, opts->sample_address,
opts->proc_map_timeout, 1);
diff --git a/tools/perf/util/bpf-event.c b/tools/perf/util/bpf-event.c
index f24f75506f51..98208b577533 100644
--- a/tools/perf/util/bpf-event.c
+++ b/tools/perf/util/bpf-event.c
@@ -1,10 +1,24 @@
// SPDX-License-Identifier: GPL-2.0
#include <errno.h>
#include <bpf/bpf.h>
+#include <bpf/btf.h>
+#include <linux/btf.h>
#include "bpf-event.h"
#include "debug.h"
#include "symbol.h"
+#define ptr_to_u64(ptr) ((__u64)(unsigned long)(ptr))
+
+static int snprintf_hex(char *buf, size_t size, unsigned char *data, size_t len)
+{
+ int ret = 0;
+ size_t i;
+
+ for (i = 0; i < len; i++)
+ ret += snprintf(buf + ret, size - ret, "%02x", data[i]);
+ return ret;
+}
+
int machine__process_bpf_event(struct machine *machine __maybe_unused,
union perf_event *event,
struct perf_sample *sample __maybe_unused)
@@ -13,3 +27,193 @@ int machine__process_bpf_event(struct machine *machine __maybe_unused,
perf_event__fprintf_bpf_event(event, stderr);
return 0;
}
+
+/*
+ * Synthesize PERF_RECORD_KSYMBOL and PERF_RECORD_BPF_EVENT for one bpf
+ * program. One PERF_RECORD_BPF_EVENT is generated for the program. And
+ * one PERF_RECORD_KSYMBOL is generated for each sub program.
+ */
+static int perf_event__synthesize_one_bpf_prog(struct perf_tool *tool,
+ perf_event__handler_t process,
+ struct machine *machine,
+ int fd,
+ union perf_event *event,
+ struct record_opts *opts)
+{
+ struct ksymbol_event *ksymbol_event = &event->ksymbol_event;
+ struct bpf_event *bpf_event = &event->bpf_event;
+ u32 sub_prog_cnt, i, func_info_rec_size;
+ u8 (*prog_tags)[BPF_TAG_SIZE] = NULL;
+ struct bpf_prog_info info = {};
+ u32 info_len = sizeof(info);
+ void *func_infos = NULL;
+ u64 *prog_addrs = NULL;
+ struct btf *btf = NULL;
+ u32 *prog_lens = NULL;
+ bool has_btf = false;
+ int err = 0;
+
+ /* Call bpf_obj_get_info_by_fd() to get sizes of arrays */
+ err = bpf_obj_get_info_by_fd(fd, &info, &info_len);
+
+ if (err || info_len < 192 /* need field prog_tags */)
+ return -1;
+
+ /* number of ksyms, func_lengths, and tags should match */
+ sub_prog_cnt = info.nr_jited_ksyms;
+ if (sub_prog_cnt != info.nr_prog_tags ||
+ sub_prog_cnt != info.nr_jited_func_lens)
+ return -1;
+
+ /* check BTF func info support */
+ if (info.btf_id && info.nr_func_info && info.func_info_rec_size) {
+ /* btf func info number should be same as sub_prog_cnt */
+ if (sub_prog_cnt != info.nr_func_info)
+ return -1;
+ if (btf__get_from_id(info.btf_id, &btf))
+ return -1;
+ func_info_rec_size = info.func_info_rec_size;
+ func_infos = malloc(sub_prog_cnt * func_info_rec_size);
+ if (!func_infos)
+ return -1;
+ has_btf = true;
+ }
+
+ /*
+ * We need address, length, and tag for each sub program.
+ * Allocate memory and call bpf_obj_get_info_by_fd() again
+ */
+ prog_addrs = (u64 *)malloc(sizeof(u64) * sub_prog_cnt);
+ prog_lens = (u32 *)malloc(sizeof(u64) * sub_prog_cnt);
+ prog_tags = malloc(sizeof(u8) * BPF_TAG_SIZE * sub_prog_cnt);
+
+ err = !prog_addrs || !prog_lens || !prog_tags;
+ if (err)
+ goto out;
+
+ memset(&info, 0, sizeof(info));
+ info.nr_jited_ksyms = sub_prog_cnt;
+ info.nr_jited_func_lens = sub_prog_cnt;
+ info.nr_prog_tags = sub_prog_cnt;
+ info.jited_ksyms = ptr_to_u64(prog_addrs);
+ info.jited_func_lens = ptr_to_u64(prog_lens);
+ info.prog_tags = ptr_to_u64(prog_tags);
+ info_len = sizeof(info);
+ if (has_btf) {
+ info.nr_func_info = sub_prog_cnt;
+ info.func_info_rec_size = func_info_rec_size;
+ info.func_info = ptr_to_u64(func_infos);
+ }
+
+ err = bpf_obj_get_info_by_fd(fd, &info, &info_len);
+ if (err)
+ goto out;
+
+ /* Synthesize PERF_RECORD_KSYMBOL */
+ for (i = 0; i < sub_prog_cnt; i++) {
+ const struct bpf_func_info *finfo;
+ const char *short_name = NULL;
+ const struct btf_type *t;
+ int name_len;
+
+ *ksymbol_event = (struct ksymbol_event){
+ .header = {
+ .type = PERF_RECORD_KSYMBOL,
+ .misc = PERF_RECORD_MISC_KSYMBOL_TYPE_BPF,
+ .size = sizeof(struct ksymbol_event),
+ },
+ .addr = prog_addrs[i],
+ .len = prog_lens[i],
+ };
+ name_len = snprintf(ksymbol_event->name, KSYM_NAME_LEN,
+ "bpf_prog_");
+ name_len += snprintf_hex(ksymbol_event->name + name_len,
+ KSYM_NAME_LEN - name_len,
+ prog_tags[i], BPF_TAG_SIZE);
+ if (has_btf) {
+ finfo = func_infos + i * info.func_info_rec_size;
+ t = btf__type_by_id(btf, finfo->type_id);
+ short_name = btf__name_by_offset(btf, t->name_off);
+ } else if (i == 0 && sub_prog_cnt == 1) {
+ /* no subprog */
+ if (info.name[0])
+ short_name = info.name;
+ } else
+ short_name = "F";
+ if (short_name)
+ name_len += snprintf(ksymbol_event->name + name_len,
+ KSYM_NAME_LEN - name_len,
+ "_%s", short_name);
+
+ ksymbol_event->header.size += PERF_ALIGN(name_len + 1,
+ sizeof(u64));
+ err = perf_tool__process_synth_event(tool, event,
+ machine, process);
+ }
+
+ /* Synthesize PERF_RECORD_BPF_EVENT */
+ if (opts->bpf_event) {
+ *bpf_event = (struct bpf_event){
+ .header = {
+ .type = PERF_RECORD_BPF_EVENT,
+ .size = sizeof(struct bpf_event),
+ },
+ .type = PERF_BPF_EVENT_PROG_LOAD,
+ .flags = 0,
+ .id = info.id,
+ };
+ memcpy(bpf_event->tag, prog_tags[i], BPF_TAG_SIZE);
+ err = perf_tool__process_synth_event(tool, event,
+ machine, process);
+ }
+
+out:
+ free(prog_tags);
+ free(prog_lens);
+ free(prog_addrs);
+ free(func_infos);
+ free(btf);
+ return err ? -1 : 0;
+}
+
+int perf_event__synthesize_bpf_events(struct perf_tool *tool,
+ perf_event__handler_t process,
+ struct machine *machine,
+ struct record_opts *opts)
+{
+ union perf_event *event;
+ __u32 id = 0;
+ int err;
+ int fd;
+
+ event = malloc(sizeof(event->bpf_event) + KSYM_NAME_LEN);
+ if (!event)
+ return -1;
+ while (true) {
+ err = bpf_prog_get_next_id(id, &id);
+ if (err) {
+ if (errno == ENOENT) {
+ err = 0;
+ break;
+ }
+ pr_err("can't get next program: %s%s",
+ strerror(errno),
+ errno == EINVAL ? " -- kernel too old?" : "");
+ err = -1;
+ break;
+ }
+ fd = bpf_prog_get_fd_by_id(id);
+ if (fd < 0) {
+ pr_debug("Failed to get fd for prog_id %u\n", id);
+ continue;
+ }
+
+ err = perf_event__synthesize_one_bpf_prog(
+ tool, process, machine, fd, event, opts);
+ close(fd);
+ if (err)
+ break;
+ }
+ free(event);
+ return err;
+}
diff --git a/tools/perf/util/bpf-event.h b/tools/perf/util/bpf-event.h
index d5ca355dd298..38aee4040f12 100644
--- a/tools/perf/util/bpf-event.h
+++ b/tools/perf/util/bpf-event.h
@@ -8,4 +8,9 @@ int machine__process_bpf_event(struct machine *machine,
union perf_event *event,
struct perf_sample *sample);
+int perf_event__synthesize_bpf_events(struct perf_tool *tool,
+ perf_event__handler_t process,
+ struct machine *machine,
+ struct record_opts *opts);
+
#endif
--
2.17.1
Thanks Peter!
> On Jan 8, 2019, at 10:31 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Dec 20, 2018 at 10:28:58AM -0800, Song Liu wrote:
>
>> @@ -648,11 +649,18 @@ struct perf_event_mmap_page {
>> * PERF_RECORD_MISC_COMM_EXEC - PERF_RECORD_COMM event
>> * PERF_RECORD_MISC_FORK_EXEC - PERF_RECORD_FORK event (perf internal)
>> * PERF_RECORD_MISC_SWITCH_OUT - PERF_RECORD_SWITCH* events
>> + * PERF_RECORD_MISC_KSYMBOL_* - PERF_RECORD_KSYMBOL event
>> */
>> #define PERF_RECORD_MISC_MMAP_DATA (1 << 13)
>> #define PERF_RECORD_MISC_COMM_EXEC (1 << 13)
>> #define PERF_RECORD_MISC_FORK_EXEC (1 << 13)
>> #define PERF_RECORD_MISC_SWITCH_OUT (1 << 13)
>> +
>> +#define PERF_RECORD_MISC_KSYMBOL_UNREGISTER (1 << 3)
>> +#define PERF_RECORD_MISC_KSYMBOL_TYPE_MASK (7 << 4)
>> +#define PERF_RECORD_MISC_KSYMBOL_TYPE_UNKNOWN (0 << 4)
>> +#define PERF_RECORD_MISC_KSYMBOL_TYPE_BPF (1 << 4)
>
> So this gives us 8 possible types, of which 2 are claimed.
>
> I suppose we already know of FTRACE_TRAMPOLINE and MODULE, which
> accounts for half the space.
>
>> +
>> /*
>> * These PERF_RECORD_MISC_* flags below are safely reused
>> * for the following events:
>> @@ -965,6 +973,19 @@ enum perf_event_type {
>> */
>> PERF_RECORD_NAMESPACES = 16,
>>
>> + /*
>> + * Record ksymbol register/unregister events:
>> + *
>> + * struct {
>> + * struct perf_event_header header;
>> + * u64 addr;
>> + * u64 len;
>> + * char name[];
>> + * struct sample_id sample_id;
>> + * };
>> + */
>> + PERF_RECORD_KSYMBOL = 17,
>
> Do we really need u64 length? That is, are we willing to consider
> symbols larger than 4G ?
>
> Could we not reclaim the top 32 or 16 bits thereof for that type thing
> instead of using a few misc bits?
I don't think we need u64 length. 32 bit should be more than enough.
How about we revise it as:
+ /*
+ * Record ksymbol register/unregister events:
+ *
+ * struct {
+ * struct perf_event_header header;
+ * u64 addr;
+ * u32 len;
+ * u16 type;
+ * u16 reserved;
+ * char name[];
+ * struct sample_id sample_id;
+ * };
+ */
+ PERF_RECORD_KSYMBOL = 17,
[...]
+#define PERF_RECORD_MISC_KSYMBOL_TYPE_UNKNOWN 0
+#define PERF_RECORD_MISC_KSYMBOL_TYPE_BPF 1
+#define PERF_RECORD_MISC_KSYMBOL_TYPE_MODULE 2
+#define PERF_RECORD_MISC_KSYMBOL_TYPE_FTRACE_TRAMPOLINE 3
Thanks again,
Song
On Thu, Dec 20, 2018 at 10:28:58AM -0800, Song Liu wrote:
> @@ -648,11 +649,18 @@ struct perf_event_mmap_page {
> * PERF_RECORD_MISC_COMM_EXEC - PERF_RECORD_COMM event
> * PERF_RECORD_MISC_FORK_EXEC - PERF_RECORD_FORK event (perf internal)
> * PERF_RECORD_MISC_SWITCH_OUT - PERF_RECORD_SWITCH* events
> + * PERF_RECORD_MISC_KSYMBOL_* - PERF_RECORD_KSYMBOL event
> */
> #define PERF_RECORD_MISC_MMAP_DATA (1 << 13)
> #define PERF_RECORD_MISC_COMM_EXEC (1 << 13)
> #define PERF_RECORD_MISC_FORK_EXEC (1 << 13)
> #define PERF_RECORD_MISC_SWITCH_OUT (1 << 13)
> +
> +#define PERF_RECORD_MISC_KSYMBOL_UNREGISTER (1 << 3)
> +#define PERF_RECORD_MISC_KSYMBOL_TYPE_MASK (7 << 4)
> +#define PERF_RECORD_MISC_KSYMBOL_TYPE_UNKNOWN (0 << 4)
> +#define PERF_RECORD_MISC_KSYMBOL_TYPE_BPF (1 << 4)
So this gives us 8 possible types, of which 2 are claimed.
I suppose we already know of FTRACE_TRAMPOLINE and MODULE, which
accounts for half the space.
> +
> /*
> * These PERF_RECORD_MISC_* flags below are safely reused
> * for the following events:
> @@ -965,6 +973,19 @@ enum perf_event_type {
> */
> PERF_RECORD_NAMESPACES = 16,
>
> + /*
> + * Record ksymbol register/unregister events:
> + *
> + * struct {
> + * struct perf_event_header header;
> + * u64 addr;
> + * u64 len;
> + * char name[];
> + * struct sample_id sample_id;
> + * };
> + */
> + PERF_RECORD_KSYMBOL = 17,
Do we really need u64 length? That is, are we willing to consider
symbols larger than 4G ?
Could we not reclaim the top 32 or 16 bits thereof for that type thing
instead of using a few misc bits?
On Thu, Dec 20, 2018 at 10:29:00AM -0800, Song Liu wrote:
> @@ -986,9 +987,35 @@ enum perf_event_type {
> */
> PERF_RECORD_KSYMBOL = 17,
>
> + /*
> + * Record bpf events:
> + * enum perf_bpf_event_type {
> + * PERF_BPF_EVENT_UNKNOWN = 0,
> + * PERF_BPF_EVENT_PROG_LOAD = 1,
> + * PERF_BPF_EVENT_PROG_UNLOAD = 2,
> + * };
> + *
> + * struct {
> + * struct perf_event_header header;
> + * u16 type;
> + * u16 flags;
> + * u32 id;
> + * u8 tag[BPF_TAG_SIZE];
> + * struct sample_id sample_id;
> + * };
> + */
> + PERF_RECORD_BPF_EVENT = 18,
> +
Elsewhere today, I raised the point that by the time (however short
interval) userspace gets around to reading this event, the actual
program could be gone again.
In this case the program has been with us for a very short period
indeed; but it could still have generated some samples or otherwise
generated trace data.
It was suggested to allow pinning modules/programs to avoid this
situation, but that of course has other undesirable effects, such as a
trivial DoS.
A truly horrible hack would be to include an open filedesc in the event
that needs closing to release the resource, but I'm sorry for even
suggesting that **shudder**.
Do we have any sane ideas?
> On Jan 8, 2019, at 10:41 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Dec 20, 2018 at 10:29:00AM -0800, Song Liu wrote:
>> @@ -986,9 +987,35 @@ enum perf_event_type {
>> */
>> PERF_RECORD_KSYMBOL = 17,
>>
>> + /*
>> + * Record bpf events:
>> + * enum perf_bpf_event_type {
>> + * PERF_BPF_EVENT_UNKNOWN = 0,
>> + * PERF_BPF_EVENT_PROG_LOAD = 1,
>> + * PERF_BPF_EVENT_PROG_UNLOAD = 2,
>> + * };
>> + *
>> + * struct {
>> + * struct perf_event_header header;
>> + * u16 type;
>> + * u16 flags;
>> + * u32 id;
>> + * u8 tag[BPF_TAG_SIZE];
>> + * struct sample_id sample_id;
>> + * };
>> + */
>> + PERF_RECORD_BPF_EVENT = 18,
>> +
>
> Elsewhere today, I raised the point that by the time (however short
> interval) userspace gets around to reading this event, the actual
> program could be gone again.
>
> In this case the program has been with us for a very short period
> indeed; but it could still have generated some samples or otherwise
> generated trace data.
Since we already have the separate KSYMBOL events, BPF_EVENT is only
required for advanced use cases, like annotation. So I guess missing
it for very-short-living programs should not be a huge problem?
> It was suggested to allow pinning modules/programs to avoid this
> situation, but that of course has other undesirable effects, such as a
> trivial DoS.
>
> A truly horrible hack would be to include an open filedesc in the event
> that needs closing to release the resource, but I'm sorry for even
> suggesting that **shudder**.
>
> Do we have any sane ideas?
How about we gate the open filedesc solution with an option, and limit
that option for root only? If this still sounds hacky, maybe we should
just ignore when short-living programs are missed?
Thanks,
Song
On Tue, Jan 08, 2019 at 07:10:20PM +0000, Song Liu wrote:
> > On Jan 8, 2019, at 10:41 AM, Peter Zijlstra <[email protected]> wrote:
> > On Thu, Dec 20, 2018 at 10:29:00AM -0800, Song Liu wrote:
> >> @@ -986,9 +987,35 @@ enum perf_event_type {
> >> */
> >> PERF_RECORD_KSYMBOL = 17,
> >>
> >> + /*
> >> + * Record bpf events:
> >> + * enum perf_bpf_event_type {
> >> + * PERF_BPF_EVENT_UNKNOWN = 0,
> >> + * PERF_BPF_EVENT_PROG_LOAD = 1,
> >> + * PERF_BPF_EVENT_PROG_UNLOAD = 2,
> >> + * };
> >> + *
> >> + * struct {
> >> + * struct perf_event_header header;
> >> + * u16 type;
> >> + * u16 flags;
> >> + * u32 id;
> >> + * u8 tag[BPF_TAG_SIZE];
> >> + * struct sample_id sample_id;
> >> + * };
> >> + */
> >> + PERF_RECORD_BPF_EVENT = 18,
> >> +
> >
> > Elsewhere today, I raised the point that by the time (however short
> > interval) userspace gets around to reading this event, the actual
> > program could be gone again.
> >
> > In this case the program has been with us for a very short period
> > indeed; but it could still have generated some samples or otherwise
> > generated trace data.
>
> Since we already have the separate KSYMBOL events, BPF_EVENT is only
> required for advanced use cases, like annotation. So I guess missing
> it for very-short-living programs should not be a huge problem?
>
> > It was suggested to allow pinning modules/programs to avoid this
> > situation, but that of course has other undesirable effects, such as a
> > trivial DoS.
> >
> > A truly horrible hack would be to include an open filedesc in the event
> > that needs closing to release the resource, but I'm sorry for even
> > suggesting that **shudder**.
> >
> > Do we have any sane ideas?
>
> How about we gate the open filedesc solution with an option, and limit
> that option for root only? If this still sounds hacky, maybe we should
> just ignore when short-living programs are missed?
I'm afraid we might also 'need' this for the kallsym thing.
The problem is that things like Intel PT (ARM Coresight too IIRC) encode
a bitstream of branch-taken decisions. The only way to decode that and
reconstruct the actual code-flow is with an exact matching text image.
In order to have this matching text we need to be able to copy out every
piece of dynamic text (from kcore) that has ever executed before it
dissapears.
Elsewhere (*), Andi suggests to have a kind of text-free fence
interface, where userspace can call a complete. And I suppose as long we
know there is a consumer, we also know we'll not be blocked
indefinitely. So it would have to be slightly more complicated than
suggested, but I think that is something we could work with.
It would also not complicate these events.
[*] https://lkml.kernel.org/r/[email protected]
On Tue, Jan 08, 2019 at 06:56:19PM +0000, Song Liu wrote:
> I don't think we need u64 length. 32 bit should be more than enough.
>
> How about we revise it as:
Yep, works. Although I would not introduce types we're not yet
generating. We can add those easily enough when needed.
Em Tue, Jan 08, 2019 at 07:10:20PM +0000, Song Liu escreveu:
> > On Jan 8, 2019, at 10:41 AM, Peter Zijlstra <[email protected]> wrote:
> > On Thu, Dec 20, 2018 at 10:29:00AM -0800, Song Liu wrote:
> >> @@ -986,9 +987,35 @@ enum perf_event_type {
> >> */
> >> PERF_RECORD_KSYMBOL = 17,
> >>
> >> + /*
> >> + * Record bpf events:
> >> + * enum perf_bpf_event_type {
> >> + * PERF_BPF_EVENT_UNKNOWN = 0,
> >> + * PERF_BPF_EVENT_PROG_LOAD = 1,
> >> + * PERF_BPF_EVENT_PROG_UNLOAD = 2,
> >> + * };
> >> + *
> >> + * struct {
> >> + * struct perf_event_header header;
> >> + * u16 type;
> >> + * u16 flags;
> >> + * u32 id;
> >> + * u8 tag[BPF_TAG_SIZE];
> >> + * struct sample_id sample_id;
> >> + * };
> >> + */
> >> + PERF_RECORD_BPF_EVENT = 18,
> > It was suggested to allow pinning modules/programs to avoid this
> > situation, but that of course has other undesirable effects, such as a
> > trivial DoS.
> >
> > A truly horrible hack would be to include an open filedesc in the event
> > that needs closing to release the resource, but I'm sorry for even
> > suggesting that **shudder**.
> >
> > Do we have any sane ideas?
>
> How about we gate the open filedesc solution with an option, and limit
> that option for root only? If this still sounds hacky, maybe we should
> just ignore when short-living programs are missed?
Short lived short programs could go in the event? Short lived long
events.. One could ask for max number of bytes of binary?
The smallest kernel modules are 16KB, multiple of PAGE_SIZE:
[acme@quaco perf]$ cat /proc/modules | sort -k2 -nr | tail
ebtable_nat 16384 1 - Live 0x0000000000000000
ebtable_filter 16384 1 - Live 0x0000000000000000
crct10dif_pclmul 16384 0 - Live 0x0000000000000000
crc32_pclmul 16384 0 - Live 0x0000000000000000
coretemp 16384 0 - Live 0x0000000000000000
btrtl 16384 1 btusb, Live 0x0000000000000000
btbcm 16384 1 btusb, Live 0x0000000000000000
arc4 16384 2 - Live 0x0000000000000000
acpi_thermal_rel 16384 1 int3400_thermal, Live 0x0000000000000000
ac97_bus 16384 1 snd_soc_core, Live 0x0000000000000000
[acme@quaco perf]$
On a Fedora 29 I have these here, all rather small:
# bpftool prog
13: cgroup_skb tag 7be49e3934a125ba gpl
loaded_at 2019-01-04T14:40:32-0300 uid 0
xlated 296B jited 229B memlock 4096B map_ids 13,14
14: cgroup_skb tag 2a142ef67aaad174 gpl
loaded_at 2019-01-04T14:40:32-0300 uid 0
xlated 296B jited 229B memlock 4096B map_ids 13,14
15: cgroup_skb tag 7be49e3934a125ba gpl
loaded_at 2019-01-04T14:40:32-0300 uid 0
xlated 296B jited 229B memlock 4096B map_ids 15,16
16: cgroup_skb tag 2a142ef67aaad174 gpl
loaded_at 2019-01-04T14:40:32-0300 uid 0
xlated 296B jited 229B memlock 4096B map_ids 15,16
17: cgroup_skb tag 7be49e3934a125ba gpl
loaded_at 2019-01-04T14:40:43-0300 uid 0
xlated 296B jited 229B memlock 4096B map_ids 17,18
18: cgroup_skb tag 2a142ef67aaad174 gpl
loaded_at 2019-01-04T14:40:43-0300 uid 0
xlated 296B jited 229B memlock 4096B map_ids 17,18
21: cgroup_skb tag 7be49e3934a125ba gpl
loaded_at 2019-01-04T14:40:43-0300 uid 0
xlated 296B jited 229B memlock 4096B map_ids 21,22
22: cgroup_skb tag 2a142ef67aaad174 gpl
loaded_at 2019-01-04T14:40:43-0300 uid 0
xlated 296B jited 229B memlock 4096B map_ids 21,22
[root@quaco IRPF2018]#
Running 'perf trace' with its BPF augmenter get these two more:
158: tracepoint name sys_enter tag 12504ba9402f952f gpl
loaded_at 2019-01-08T17:12:39-0300 uid 0
xlated 512B jited 374B memlock 4096B map_ids 118,117,116
159: tracepoint name sys_exit tag c1bd85c092d6e4aa gpl
loaded_at 2019-01-08T17:12:39-0300 uid 0
xlated 256B jited 191B memlock 4096B map_ids 118,117
[root@quaco ~]#
A PERF_RECORD_MMAP gets as its payload up to PATH_MAX - sizeof(u64).
So for a class of programs, shoving it together with the
PERF_RECORD_MMAP like event may be enough?
You started the shuddering suggestions... ;-)
- Arnaldo
On Thu, Dec 20, 2018 at 10:29:00AM -0800, Song Liu wrote:
> The following example shows kernel symbols for a BPF program with 7
> sub programs:
>
> ffffffffa0257cf9 t bpf_prog_b07ccb89267cf242_F
> ffffffffa02592e1 t bpf_prog_2dcecc18072623fc_F
> ffffffffa025b0e9 t bpf_prog_bb7a405ebaec5d5c_F
> ffffffffa025dd2c t bpf_prog_a7540d4a39ec1fc7_F
> ffffffffa025fcca t bpf_prog_05762d4ade0e3737_F
> ffffffffa026108f t bpf_prog_db4bd11e35df90d4_F
> ffffffffa0263f00 t bpf_prog_89d64e4abf0f0126_F
> ffffffffa0257cf9 t bpf_prog_ae31629322c4b018__dummy_tracepoi
We should probably specify somewhere that the name can include a
'[module]' part just like normal kallsyms. Even though you don't
currently use that.
On 1/8/19 12:29 PM, Peter Zijlstra wrote:
> On Thu, Dec 20, 2018 at 10:29:00AM -0800, Song Liu wrote:
>> The following example shows kernel symbols for a BPF program with 7
>> sub programs:
>>
>> ffffffffa0257cf9 t bpf_prog_b07ccb89267cf242_F
>> ffffffffa02592e1 t bpf_prog_2dcecc18072623fc_F
>> ffffffffa025b0e9 t bpf_prog_bb7a405ebaec5d5c_F
>> ffffffffa025dd2c t bpf_prog_a7540d4a39ec1fc7_F
>> ffffffffa025fcca t bpf_prog_05762d4ade0e3737_F
>> ffffffffa026108f t bpf_prog_db4bd11e35df90d4_F
>> ffffffffa0263f00 t bpf_prog_89d64e4abf0f0126_F
>> ffffffffa0257cf9 t bpf_prog_ae31629322c4b018__dummy_tracepoi
>
> We should probably specify somewhere that the name can include a
> '[module]' part just like normal kallsyms. Even though you don't
> currently use that.
there is no [module] equivalent in bpf land.
The progs loaded by different users can be shared.
There is no strict tree hierarchy (like in modules)
where there is one root and a bunch of function underneath.
In bpf all these functions form a graph and can call each other.
Same with maps that are shared by different progs of different types.
Like networking prog and tracing prog can share common map.
On 1/8/19 10:41 AM, Peter Zijlstra wrote:
> On Thu, Dec 20, 2018 at 10:29:00AM -0800, Song Liu wrote:
>> @@ -986,9 +987,35 @@ enum perf_event_type {
>> */
>> PERF_RECORD_KSYMBOL = 17,
>>
>> + /*
>> + * Record bpf events:
>> + * enum perf_bpf_event_type {
>> + * PERF_BPF_EVENT_UNKNOWN = 0,
>> + * PERF_BPF_EVENT_PROG_LOAD = 1,
>> + * PERF_BPF_EVENT_PROG_UNLOAD = 2,
>> + * };
>> + *
>> + * struct {
>> + * struct perf_event_header header;
>> + * u16 type;
>> + * u16 flags;
>> + * u32 id;
>> + * u8 tag[BPF_TAG_SIZE];
>> + * struct sample_id sample_id;
>> + * };
>> + */
>> + PERF_RECORD_BPF_EVENT = 18,
>> +
>
> Elsewhere today, I raised the point that by the time (however short
> interval) userspace gets around to reading this event, the actual
> program could be gone again.
>
> In this case the program has been with us for a very short period
> indeed; but it could still have generated some samples or otherwise
> generated trace data.
>
> It was suggested to allow pinning modules/programs to avoid this
> situation, but that of course has other undesirable effects, such as a
> trivial DoS.
>
> A truly horrible hack would be to include an open filedesc in the event
> that needs closing to release the resource, but I'm sorry for even
> suggesting that **shudder**.
>
> Do we have any sane ideas?
I think we should miss such ultra short lived progs.
If perf record is slower than some user doing for(;;) {load prog;
unload;} than it's a good thing.
I frankly don't think it's practically possible to miss a prog.
perf ring buffer is way faster than prog load/unload.
Even when all scheduling artifacts of perf user space are factored in.
Doing barriers is DoS-able.
Sending FD via ring buffer is not possible.
If there was a non-shudder solution to this we could do it,
but all complications pointing out that this is not a problem
worth solving.
At least since right now none of us see a clean fix I propose
to move ahead with what we have and address it later if better ideas
come up. We've been missing short lived progs and prog notifications
all this time and users complain. That is the problem to address.
> On Jan 8, 2019, at 12:16 PM, Arnaldo Carvalho de Melo <[email protected]> wrote:
>
> Em Tue, Jan 08, 2019 at 07:10:20PM +0000, Song Liu escreveu:
>>> On Jan 8, 2019, at 10:41 AM, Peter Zijlstra <[email protected]> wrote:
>>> On Thu, Dec 20, 2018 at 10:29:00AM -0800, Song Liu wrote:
>>>> @@ -986,9 +987,35 @@ enum perf_event_type {
>>>> */
>>>> PERF_RECORD_KSYMBOL = 17,
>>>>
>>>> + /*
>>>> + * Record bpf events:
>>>> + * enum perf_bpf_event_type {
>>>> + * PERF_BPF_EVENT_UNKNOWN = 0,
>>>> + * PERF_BPF_EVENT_PROG_LOAD = 1,
>>>> + * PERF_BPF_EVENT_PROG_UNLOAD = 2,
>>>> + * };
>>>> + *
>>>> + * struct {
>>>> + * struct perf_event_header header;
>>>> + * u16 type;
>>>> + * u16 flags;
>>>> + * u32 id;
>>>> + * u8 tag[BPF_TAG_SIZE];
>>>> + * struct sample_id sample_id;
>>>> + * };
>>>> + */
>>>> + PERF_RECORD_BPF_EVENT = 18,
>
>>> It was suggested to allow pinning modules/programs to avoid this
>>> situation, but that of course has other undesirable effects, such as a
>>> trivial DoS.
>>>
>>> A truly horrible hack would be to include an open filedesc in the event
>>> that needs closing to release the resource, but I'm sorry for even
>>> suggesting that **shudder**.
>>>
>>> Do we have any sane ideas?
>>
>> How about we gate the open filedesc solution with an option, and limit
>> that option for root only? If this still sounds hacky, maybe we should
>> just ignore when short-living programs are missed?
>
> Short lived short programs could go in the event? Short lived long
> events.. One could ask for max number of bytes of binary?
>
> The smallest kernel modules are 16KB, multiple of PAGE_SIZE:
>
> [acme@quaco perf]$ cat /proc/modules | sort -k2 -nr | tail
> ebtable_nat 16384 1 - Live 0x0000000000000000
> ebtable_filter 16384 1 - Live 0x0000000000000000
> crct10dif_pclmul 16384 0 - Live 0x0000000000000000
> crc32_pclmul 16384 0 - Live 0x0000000000000000
> coretemp 16384 0 - Live 0x0000000000000000
> btrtl 16384 1 btusb, Live 0x0000000000000000
> btbcm 16384 1 btusb, Live 0x0000000000000000
> arc4 16384 2 - Live 0x0000000000000000
> acpi_thermal_rel 16384 1 int3400_thermal, Live 0x0000000000000000
> ac97_bus 16384 1 snd_soc_core, Live 0x0000000000000000
> [acme@quaco perf]$
>
> On a Fedora 29 I have these here, all rather small:
>
> # bpftool prog
> 13: cgroup_skb tag 7be49e3934a125ba gpl
> loaded_at 2019-01-04T14:40:32-0300 uid 0
> xlated 296B jited 229B memlock 4096B map_ids 13,14
> 14: cgroup_skb tag 2a142ef67aaad174 gpl
> loaded_at 2019-01-04T14:40:32-0300 uid 0
> xlated 296B jited 229B memlock 4096B map_ids 13,14
> 15: cgroup_skb tag 7be49e3934a125ba gpl
> loaded_at 2019-01-04T14:40:32-0300 uid 0
> xlated 296B jited 229B memlock 4096B map_ids 15,16
> 16: cgroup_skb tag 2a142ef67aaad174 gpl
> loaded_at 2019-01-04T14:40:32-0300 uid 0
> xlated 296B jited 229B memlock 4096B map_ids 15,16
> 17: cgroup_skb tag 7be49e3934a125ba gpl
> loaded_at 2019-01-04T14:40:43-0300 uid 0
> xlated 296B jited 229B memlock 4096B map_ids 17,18
> 18: cgroup_skb tag 2a142ef67aaad174 gpl
> loaded_at 2019-01-04T14:40:43-0300 uid 0
> xlated 296B jited 229B memlock 4096B map_ids 17,18
> 21: cgroup_skb tag 7be49e3934a125ba gpl
> loaded_at 2019-01-04T14:40:43-0300 uid 0
> xlated 296B jited 229B memlock 4096B map_ids 21,22
> 22: cgroup_skb tag 2a142ef67aaad174 gpl
> loaded_at 2019-01-04T14:40:43-0300 uid 0
> xlated 296B jited 229B memlock 4096B map_ids 21,22
> [root@quaco IRPF2018]#
>
>
> Running 'perf trace' with its BPF augmenter get these two more:
>
> 158: tracepoint name sys_enter tag 12504ba9402f952f gpl
> loaded_at 2019-01-08T17:12:39-0300 uid 0
> xlated 512B jited 374B memlock 4096B map_ids 118,117,116
> 159: tracepoint name sys_exit tag c1bd85c092d6e4aa gpl
> loaded_at 2019-01-08T17:12:39-0300 uid 0
> xlated 256B jited 191B memlock 4096B map_ids 118,117
> [root@quaco ~]#
>
> A PERF_RECORD_MMAP gets as its payload up to PATH_MAX - sizeof(u64).
>
> So for a class of programs, shoving it together with the
> PERF_RECORD_MMAP like event may be enough?
>
> You started the shuddering suggestions... ;-)
>
> - Arnaldo
Besides the cited binary, we are adding more information for each
BPF program, including source code. So even short program could
easily exceed PATH_MAX...
Song
On Thu, Dec 20, 2018 at 10:29:00AM -0800, Song Liu wrote:
> +static void perf_event_bpf_emit_ksymbols(struct bpf_prog *prog,
> + enum perf_bpf_event_type type)
> +{
> + bool unregister = type == PERF_BPF_EVENT_PROG_UNLOAD;
> + int i;
> +
> + if (prog->aux->func_cnt == 0) {
> + perf_event_ksymbol(PERF_RECORD_MISC_KSYMBOL_TYPE_BPF,
> + (u64)(unsigned long)prog->bpf_func,
> + prog->jited_len, unregister,
> + perf_event_bpf_get_name, prog);
> + } else {
> + for (i = 0; i < prog->aux->func_cnt; i++) {
> + struct bpf_prog *subprog = prog->aux->func[i];
> +
> + perf_event_ksymbol(
> + PERF_RECORD_MISC_KSYMBOL_TYPE_BPF,
> + (u64)(unsigned long)subprog->bpf_func,
> + subprog->jited_len, unregister,
> + perf_event_bpf_get_name, subprog);
> + }
> + }
> +}
That's a bit unexpected, but yes sure, that works for now.
I was expecting it to be hooked up in your kallsym rbtree thing, but
whatever, we can fix that when needed.
> On Jan 8, 2019, at 11:43 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Jan 08, 2019 at 07:10:20PM +0000, Song Liu wrote:
>>> On Jan 8, 2019, at 10:41 AM, Peter Zijlstra <[email protected]> wrote:
>>> On Thu, Dec 20, 2018 at 10:29:00AM -0800, Song Liu wrote:
>>>> @@ -986,9 +987,35 @@ enum perf_event_type {
>>>> */
>>>> PERF_RECORD_KSYMBOL = 17,
>>>>
>>>> + /*
>>>> + * Record bpf events:
>>>> + * enum perf_bpf_event_type {
>>>> + * PERF_BPF_EVENT_UNKNOWN = 0,
>>>> + * PERF_BPF_EVENT_PROG_LOAD = 1,
>>>> + * PERF_BPF_EVENT_PROG_UNLOAD = 2,
>>>> + * };
>>>> + *
>>>> + * struct {
>>>> + * struct perf_event_header header;
>>>> + * u16 type;
>>>> + * u16 flags;
>>>> + * u32 id;
>>>> + * u8 tag[BPF_TAG_SIZE];
>>>> + * struct sample_id sample_id;
>>>> + * };
>>>> + */
>>>> + PERF_RECORD_BPF_EVENT = 18,
>>>> +
>>>
>>> Elsewhere today, I raised the point that by the time (however short
>>> interval) userspace gets around to reading this event, the actual
>>> program could be gone again.
>>>
>>> In this case the program has been with us for a very short period
>>> indeed; but it could still have generated some samples or otherwise
>>> generated trace data.
>>
>> Since we already have the separate KSYMBOL events, BPF_EVENT is only
>> required for advanced use cases, like annotation. So I guess missing
>> it for very-short-living programs should not be a huge problem?
>>
>>> It was suggested to allow pinning modules/programs to avoid this
>>> situation, but that of course has other undesirable effects, such as a
>>> trivial DoS.
>>>
>>> A truly horrible hack would be to include an open filedesc in the event
>>> that needs closing to release the resource, but I'm sorry for even
>>> suggesting that **shudder**.
>>>
>>> Do we have any sane ideas?
>>
>> How about we gate the open filedesc solution with an option, and limit
>> that option for root only? If this still sounds hacky, maybe we should
>> just ignore when short-living programs are missed?
>
> I'm afraid we might also 'need' this for the kallsym thing.
>
> The problem is that things like Intel PT (ARM Coresight too IIRC) encode
> a bitstream of branch-taken decisions. The only way to decode that and
> reconstruct the actual code-flow is with an exact matching text image.
>
> In order to have this matching text we need to be able to copy out every
> piece of dynamic text (from kcore) that has ever executed before it
> dissapears.
>
> Elsewhere (*), Andi suggests to have a kind of text-free fence
> interface, where userspace can call a complete. And I suppose as long we
> know there is a consumer, we also know we'll not be blocked
> indefinitely. So it would have to be slightly more complicated than
> suggested, but I think that is something we could work with.
>
> It would also not complicate these events.
>
>
>
> [*] https://lkml.kernel.org/r/[email protected]
I think Intel PT case is at instruction granularity (instead of ksymbol
granularity)? If this is true, modules, BPF, and PT could still share
the ksymbol record for basic profiling. And advanced use cases like
annotation will depend on user space to record BPF_EVENT (and equivalent
for other cases) timely. But at least, the ksymbol is already there.
Does this make sense?
Thanks,
Song
On Tue, Jan 08, 2019 at 11:54:04PM +0000, Song Liu wrote:
> I think Intel PT case is at instruction granularity (instead of ksymbol
> granularity)?
Yes.
> If this is true, modules, BPF, and PT could still share
> the ksymbol record for basic profiling. And advanced use cases like
> annotation will depend on user space to record BPF_EVENT (and equivalent
> for other cases) timely. But at least, the ksymbol is already there.
>
> Does this make sense?
I'm not sure I follow; the idea was that on ksym events we copy out the
instructions using kcore. The ksym event already has addr+len.
All we need is some means of ensuring the symbol is still there by the
time we see the event and do the copy.
I think we can do this with a new ioctl() on /proc/kcore itself:
- when we have kcore open, we queue all text-free operations on list-1.
- when we close kcore, we drain all (text-free) list-* and perform the
pending frees immediately.
- on ioctl(KCORE_QC) we perform the pending free of list-3 and advance
list-2 to list-3 and list-1 to list-2.
Perf would then open kcore at the start of the record, make a complete
copy and keep the FD open. At the end of every buffer process, we issue
KCORE_QC IFF we observed a ksym unreg in that buffer.
We use 3 lists instead of 2 to guard against races, if there was a
reg+unreg en-route but not yet visible in the buffer, then we don't want
that free to be processed. The next buffer (read) will have the event(s)
and all should be well.
> On Jan 9, 2019, at 2:18 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Jan 08, 2019 at 11:54:04PM +0000, Song Liu wrote:
>
>> I think Intel PT case is at instruction granularity (instead of ksymbol
>> granularity)?
>
> Yes.
>
>> If this is true, modules, BPF, and PT could still share
>> the ksymbol record for basic profiling. And advanced use cases like
>> annotation will depend on user space to record BPF_EVENT (and equivalent
>> for other cases) timely. But at least, the ksymbol is already there.
>>
>> Does this make sense?
>
> I'm not sure I follow; the idea was that on ksym events we copy out the
> instructions using kcore. The ksym event already has addr+len.
I was thinking about modifying the text in-place scenario. In this case,
we can use something like
struct perf_record_text_modify {
u64 addr;
u_big_enough old_instr;
u_big_enough new_instr;
timestamp ;
};
It is a fixed size record, and we don't need process it immediately
in user space. At the end of perf run, a series of these events will
help us reconstruct exact text at any time.
>
> All we need is some means of ensuring the symbol is still there by the
> time we see the event and do the copy.
>
> I think we can do this with a new ioctl() on /proc/kcore itself:
>
> - when we have kcore open, we queue all text-free operations on list-1.
>
> - when we close kcore, we drain all (text-free) list-* and perform the
> pending frees immediately.
>
> - on ioctl(KCORE_QC) we perform the pending free of list-3 and advance
> list-2 to list-3 and list-1 to list-2.
>
> Perf would then open kcore at the start of the record, make a complete
> copy and keep the FD open. At the end of every buffer process, we issue
> KCORE_QC IFF we observed a ksym unreg in that buffer.
Does this mean we need to scan every buffer before writing it to perf.data
during perf-record?
Also, if we need ksym unreg here, I guess it is NOT really modifying text
in-place, but creating new version and swap? Then can we include something
like this in perf.data:
struct perf_record_text_modify {
u64 old_addr;
u64 new_addr;
u32 old_len; /* up to MAX_SIZE */
u32 new_len; /* up to MAX_SIZE */
u8 old_text[MAX_SIZE];
u8 new_text[MAX_SIZE];
timestamp ;
};
In this way, this record is embedded in perf.data, and doesn't require
extra processing during perf-record (only at the end of perf-record).
This would work for text modifying case, as modifying text is simply
old-text to new-text.
Similar solution would not work for BPF case, as bpf_prog_info is
getting a lot more members in the near future.
Does this make sense...?
Thanks,
Song
On Tue, Jan 08, 2019 at 08:45:19PM +0000, Alexei Starovoitov wrote:
> On 1/8/19 12:29 PM, Peter Zijlstra wrote:
> > On Thu, Dec 20, 2018 at 10:29:00AM -0800, Song Liu wrote:
> >> The following example shows kernel symbols for a BPF program with 7
> >> sub programs:
> >>
> >> ffffffffa0257cf9 t bpf_prog_b07ccb89267cf242_F
> >> ffffffffa02592e1 t bpf_prog_2dcecc18072623fc_F
> >> ffffffffa025b0e9 t bpf_prog_bb7a405ebaec5d5c_F
> >> ffffffffa025dd2c t bpf_prog_a7540d4a39ec1fc7_F
> >> ffffffffa025fcca t bpf_prog_05762d4ade0e3737_F
> >> ffffffffa026108f t bpf_prog_db4bd11e35df90d4_F
> >> ffffffffa0263f00 t bpf_prog_89d64e4abf0f0126_F
> >> ffffffffa0257cf9 t bpf_prog_ae31629322c4b018__dummy_tracepoi
> >
> > We should probably specify somewhere that the name can include a
> > '[module]' part just like normal kallsyms. Even though you don't
> > currently use that.
>
> there is no [module] equivalent in bpf land.
I know; although you could consider each program it's own separate
module. But what I meant was, we should probably document the name[]
format somewhere, maybe in the PERF_RECORD_KSYMBOL comment.
The "symbol [module]" syntax can be used to create a DSO sort key, so
you could simply put in "[bpf]" for all BPF generated symbols and have
everything BPF grouped in perf-report when sorted on DSO.
It doesn't have any other implications.
Similarly, I would suggest "[ftrace]" for all the ftrace trampolines
(which are currently not exposed but really should be).
On Wed, Jan 09, 2019 at 11:32:50AM +0000, Song Liu wrote:
> I was thinking about modifying the text in-place scenario. In this case,
> we can use something like
>
> struct perf_record_text_modify {
> u64 addr;
> u_big_enough old_instr;
> u_big_enough new_instr;
char[15] for x86 ;-)
Also, I don't think we need old, we should already have the old text,
either from a previous event or from the initial kcore snapshot.
> timestamp ;
that lives in struct sample_id.
> };
>
> It is a fixed size record, and we don't need process it immediately
> in user space. At the end of perf run, a series of these events will
> help us reconstruct exact text at any time.
That works for text_poke users, see also:
https://lkml.kernel.org/r/[email protected]
But is useless for module / bpf / ftrace dynamic text.
> > All we need is some means of ensuring the symbol is still there by the
> > time we see the event and do the copy.
> >
> > I think we can do this with a new ioctl() on /proc/kcore itself:
> >
> > - when we have kcore open, we queue all text-free operations on list-1.
> >
> > - when we close kcore, we drain all (text-free) list-* and perform the
> > pending frees immediately.
> >
> > - on ioctl(KCORE_QC) we perform the pending free of list-3 and advance
> > list-2 to list-3 and list-1 to list-2.
> >
> > Perf would then open kcore at the start of the record, make a complete
> > copy and keep the FD open. At the end of every buffer process, we issue
> > KCORE_QC IFF we observed a ksym unreg in that buffer.
>
> Does this mean we need to scan every buffer before writing it to perf.data
> during perf-record?
Just like the BPF events, yes. Now for PT most of the actual data is not
in the regular buffer, so it shouldn't be too horrible, but just like
the BPF event, it can get its own buffer if it does become a problem.
> Also, if we need ksym unreg here, I guess it is NOT really modifying text
> in-place, but creating new version and swap? Then can we include something
> like this in perf.data:
>
> struct perf_record_text_modify {
> u64 old_addr;
> u64 new_addr;
> u32 old_len; /* up to MAX_SIZE */
> u32 new_len; /* up to MAX_SIZE */
> u8 old_text[MAX_SIZE];
> u8 new_text[MAX_SIZE];
> timestamp ;
> };
>
> In this way, this record is embedded in perf.data, and doesn't require
> extra processing during perf-record (only at the end of perf-record).
> This would work for text modifying case, as modifying text is simply
> old-text to new-text.
>
> Similar solution would not work for BPF case, as bpf_prog_info is
> getting a lot more members in the near future.
>
> Does this make sense...?
I don't think we actually need old_text here either. We're creating a
new text mapping, there was nothing there before.
But still, perf events are limited to 64k, so that means we cannot
support symbols larger than that (although I suppose that would be
fairly rare).
Something like that could work, but I'm not sure it is actually better.
Some PT person would have to play with things I suppose.
> On Jan 9, 2019, at 4:41 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Jan 08, 2019 at 08:45:19PM +0000, Alexei Starovoitov wrote:
>> On 1/8/19 12:29 PM, Peter Zijlstra wrote:
>>> On Thu, Dec 20, 2018 at 10:29:00AM -0800, Song Liu wrote:
>>>> The following example shows kernel symbols for a BPF program with 7
>>>> sub programs:
>>>>
>>>> ffffffffa0257cf9 t bpf_prog_b07ccb89267cf242_F
>>>> ffffffffa02592e1 t bpf_prog_2dcecc18072623fc_F
>>>> ffffffffa025b0e9 t bpf_prog_bb7a405ebaec5d5c_F
>>>> ffffffffa025dd2c t bpf_prog_a7540d4a39ec1fc7_F
>>>> ffffffffa025fcca t bpf_prog_05762d4ade0e3737_F
>>>> ffffffffa026108f t bpf_prog_db4bd11e35df90d4_F
>>>> ffffffffa0263f00 t bpf_prog_89d64e4abf0f0126_F
>>>> ffffffffa0257cf9 t bpf_prog_ae31629322c4b018__dummy_tracepoi
>>>
>>> We should probably specify somewhere that the name can include a
>>> '[module]' part just like normal kallsyms. Even though you don't
>>> currently use that.
>>
>> there is no [module] equivalent in bpf land.
>
> I know; although you could consider each program it's own separate
> module. But what I meant was, we should probably document the name[]
> format somewhere, maybe in the PERF_RECORD_KSYMBOL comment.
>
> The "symbol [module]" syntax can be used to create a DSO sort key, so
> you could simply put in "[bpf]" for all BPF generated symbols and have
> everything BPF grouped in perf-report when sorted on DSO.
In current version, I put [bpf_prog] for bpf programs DSO. We can
probably add something to /proc/kallsyms as well. On the other hand,
"bpf_prog_<tag>_XXX" also indicates this is a BPF program.
Thanks,
Song
> On Jan 9, 2019, at 4:59 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Jan 09, 2019 at 11:32:50AM +0000, Song Liu wrote:
>> I was thinking about modifying the text in-place scenario. In this case,
>> we can use something like
>>
>> struct perf_record_text_modify {
>> u64 addr;
>> u_big_enough old_instr;
>> u_big_enough new_instr;
>
> char[15] for x86 ;-)
>
> Also, I don't think we need old, we should already have the old text,
> either from a previous event or from the initial kcore snapshot.
>
>> timestamp ;
>
> that lives in struct sample_id.
>
>> };
>>
>> It is a fixed size record, and we don't need process it immediately
>> in user space. At the end of perf run, a series of these events will
>> help us reconstruct exact text at any time.
>
> That works for text_poke users, see also:
>
> https://lkml.kernel.org/r/[email protected]
>
> But is useless for module / bpf / ftrace dynamic text.
I think we will end up with RECORD_KSYMBOL + something else for all cases.
For bpf, it is RECORD_KSYMBOL + (optional) RECORD_BPF_EVENT. For text_poke,
it will be RECORD_KSYMBOL + RECORD_TEXT_POKE. In all cases, RECORD_KSYMBOL
goes to regular buffer and gets saved directly to perf.data. The other
record goes to a separate buffer, and requires extra processing.
>
>>> All we need is some means of ensuring the symbol is still there by the
>>> time we see the event and do the copy.
>>>
>>> I think we can do this with a new ioctl() on /proc/kcore itself:
>>>
>>> - when we have kcore open, we queue all text-free operations on list-1.
>>>
>>> - when we close kcore, we drain all (text-free) list-* and perform the
>>> pending frees immediately.
>>>
>>> - on ioctl(KCORE_QC) we perform the pending free of list-3 and advance
>>> list-2 to list-3 and list-1 to list-2.
>>>
>>> Perf would then open kcore at the start of the record, make a complete
>>> copy and keep the FD open. At the end of every buffer process, we issue
>>> KCORE_QC IFF we observed a ksym unreg in that buffer.
>>
>> Does this mean we need to scan every buffer before writing it to perf.data
>> during perf-record?
>
> Just like the BPF events, yes. Now for PT most of the actual data is not
> in the regular buffer, so it shouldn't be too horrible, but just like
> the BPF event, it can get its own buffer if it does become a problem.
I see. Separate buffer does make it better.
>
>> Also, if we need ksym unreg here, I guess it is NOT really modifying text
>> in-place, but creating new version and swap? Then can we include something
>> like this in perf.data:
>>
>> struct perf_record_text_modify {
>> u64 old_addr;
>> u64 new_addr;
>> u32 old_len; /* up to MAX_SIZE */
>> u32 new_len; /* up to MAX_SIZE */
>> u8 old_text[MAX_SIZE];
>> u8 new_text[MAX_SIZE];
>> timestamp ;
>> };
>>
>> In this way, this record is embedded in perf.data, and doesn't require
>> extra processing during perf-record (only at the end of perf-record).
>> This would work for text modifying case, as modifying text is simply
>> old-text to new-text.
>>
>> Similar solution would not work for BPF case, as bpf_prog_info is
>> getting a lot more members in the near future.
>>
>> Does this make sense...?
>
> I don't think we actually need old_text here either. We're creating a
> new text mapping, there was nothing there before.
>
> But still, perf events are limited to 64k, so that means we cannot
> support symbols larger than that (although I suppose that would be
> fairly rare).
For larger symbols, I guess we can do one RECORD_KSYMBOL and multiple
RECORD_TEXT_MODIFY.
Thanks,
Song