2016-04-05 04:54:16

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH net-next 0/8] allow bpf attach to tracepoints

Hi Steven, Peter,

last time we discussed bpf+tracepoints it was a year ago [1] and the reason
we didn't proceed with that approach was that bpf would make arguments
arg1, arg2 to trace_xx(arg1, arg2) call to be exposed to bpf program
and that was considered unnecessary extension of abi. Back then I wanted
to avoid the cost of buffer alloc and field assign part in all
of the tracepoints, but looks like when optimized the cost is acceptable.
So this new apporach doesn't expose any new abi to bpf program.
The program is looking at tracepoint fields after they were copied
by perf_trace_xx() and described in /sys/kernel/debug/tracing/events/xxx/format
We made a tool [2] that takes arguments from /sys/.../format and works as:
$ tplist.py -v random:urandom_read
int got_bits;
int pool_left;
int input_left;
Then these fields can be copy-pasted into bpf program like:
struct urandom_read {
__u64 hidden_pad;
int got_bits;
int pool_left;
int input_left;
};
and the program can use it:
SEC("tracepoint/random/urandom_read")
int bpf_prog(struct urandom_read *ctx)
{
return ctx->pool_left > 0 ? 1 : 0;
}
This way the program can access tracepoint fields faster than
equivalent bpf+kprobe program, which is the main goal of these patches.

Patch 1 and 2 are simple changes in perf core side, please review.
I'd like to take the whole set via net-next tree, since the rest of
the patches might conflict with other bpf work going on in net-next
and we want to avoid cross-tree merge conflicts.
Patch 7 is an example of access to tracepoint fields from bpf prog.
Patch 8 is a micro benchmark for bpf+kprobe vs bpf+tracepoint.

Note that for actual tracing tools the user doesn't need to
run tplist.py and copy-paste fields manually. The tools do it
automatically. Like argdist tool [3] can be used as:
$ argdist -H 't:block:block_rq_complete():u32:nr_sector'
where 'nr_sector' is name of tracepoint field taken from
/sys/kernel/debug/tracing/events/block/block_rq_complete/format
and appropriate bpf program is generated on the fly.

[1] http://thread.gmane.org/gmane.linux.kernel.api/8127/focus=8165
[2] https://github.com/iovisor/bcc/blob/master/tools/tplist.py
[3] https://github.com/iovisor/bcc/blob/master/tools/argdist.py

Alexei Starovoitov (8):
perf: optimize perf_fetch_caller_regs
perf, bpf: allow bpf programs attach to tracepoints
bpf: register BPF_PROG_TYPE_TRACEPOINT program type
bpf: support bpf_get_stackid() and bpf_perf_event_output() in
tracepoint programs
bpf: sanitize bpf tracepoint access
samples/bpf: add tracepoint support to bpf loader
samples/bpf: tracepoint example
samples/bpf: add tracepoint vs kprobe performance tests

include/linux/bpf.h | 2 +
include/linux/perf_event.h | 2 -
include/linux/trace_events.h | 1 +
include/trace/perf.h | 18 +++-
include/uapi/linux/bpf.h | 1 +
kernel/bpf/stackmap.c | 2 +-
kernel/bpf/verifier.c | 6 +-
kernel/events/core.c | 21 ++++-
kernel/trace/bpf_trace.c | 85 ++++++++++++++++-
kernel/trace/trace_event_perf.c | 4 +
kernel/trace/trace_events.c | 18 ++++
samples/bpf/Makefile | 5 +
samples/bpf/bpf_load.c | 26 +++++-
samples/bpf/offwaketime_kern.c | 26 +++++-
samples/bpf/test_overhead_kprobe_kern.c | 41 ++++++++
samples/bpf/test_overhead_tp_kern.c | 36 +++++++
samples/bpf/test_overhead_user.c | 161 ++++++++++++++++++++++++++++++++
17 files changed, 432 insertions(+), 23 deletions(-)
create mode 100644 samples/bpf/test_overhead_kprobe_kern.c
create mode 100644 samples/bpf/test_overhead_tp_kern.c
create mode 100644 samples/bpf/test_overhead_user.c

--
2.8.0


2016-04-05 04:53:00

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH net-next 2/8] perf, bpf: allow bpf programs attach to tracepoints

introduce BPF_PROG_TYPE_TRACEPOINT program type and allow it to be
attached to tracepoints.
The tracepoint will copy the arguments in the per-cpu buffer and pass
it to the bpf program as its first argument.
The layout of the fields can be discovered by doing
'cat /sys/kernel/debug/tracing/events/sched/sched_switch/format'
prior to the compilation of the program with exception that first 8 bytes
are reserved and not accessible to the program. This area is used to store
the pointer to 'struct pt_regs' which some of the bpf helpers will use:
+---------+
| 8 bytes | hidden 'struct pt_regs *' (inaccessible to bpf program)
+---------+
| N bytes | static tracepoint fields defined in tracepoint/format (bpf readonly)
+---------+
| dynamic | __dynamic_array bytes of tracepoint (inaccessible to bpf yet)
+---------+

Not that all of the fields are already dumped to user space via perf ring buffer
and some application access it directly without consulting tracepoint/format.
Same rule applies here: static tracepoint fields should only be accessed
in a format defined in tracepoint/format. The order of fields and
field sizes are not an ABI.

Signed-off-by: Alexei Starovoitov <[email protected]>
---
include/trace/perf.h | 18 ++++++++++++++----
include/uapi/linux/bpf.h | 1 +
kernel/events/core.c | 13 +++++++++----
kernel/trace/trace_event_perf.c | 3 +++
4 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/include/trace/perf.h b/include/trace/perf.h
index 26486fcd74ce..55feb69c873f 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -37,18 +37,19 @@ perf_trace_##call(void *__data, proto) \
struct trace_event_call *event_call = __data; \
struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
struct trace_event_raw_##call *entry; \
+ struct bpf_prog *prog = event_call->prog; \
struct pt_regs *__regs; \
u64 __addr = 0, __count = 1; \
struct task_struct *__task = NULL; \
struct hlist_head *head; \
int __entry_size; \
int __data_size; \
- int rctx; \
+ int rctx, event_type; \
\
__data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
\
head = this_cpu_ptr(event_call->perf_events); \
- if (__builtin_constant_p(!__task) && !__task && \
+ if (!prog && __builtin_constant_p(!__task) && !__task && \
hlist_empty(head)) \
return; \
\
@@ -56,8 +57,9 @@ perf_trace_##call(void *__data, proto) \
sizeof(u64)); \
__entry_size -= sizeof(u32); \
\
- entry = perf_trace_buf_prepare(__entry_size, \
- event_call->event.type, &__regs, &rctx); \
+ event_type = prog ? TRACE_EVENT_TYPE_MAX : event_call->event.type; \
+ entry = perf_trace_buf_prepare(__entry_size, event_type, \
+ &__regs, &rctx); \
if (!entry) \
return; \
\
@@ -67,6 +69,14 @@ perf_trace_##call(void *__data, proto) \
\
{ assign; } \
\
+ if (prog) { \
+ *(struct pt_regs **)entry = __regs; \
+ if (!trace_call_bpf(prog, entry) || hlist_empty(head)) { \
+ perf_swevent_put_recursion_context(rctx); \
+ return; \
+ } \
+ memset(&entry->ent, 0, sizeof(entry->ent)); \
+ } \
perf_trace_buf_submit(entry, __entry_size, rctx, __addr, \
__count, __regs, head, __task); \
}
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 23917bb47bf3..70eda5aeb304 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -92,6 +92,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_KPROBE,
BPF_PROG_TYPE_SCHED_CLS,
BPF_PROG_TYPE_SCHED_ACT,
+ BPF_PROG_TYPE_TRACEPOINT,
};

#define BPF_PSEUDO_MAP_FD 1
diff --git a/kernel/events/core.c b/kernel/events/core.c
index de24fbce5277..58fc9a7d1562 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6725,12 +6725,13 @@ int perf_swevent_get_recursion_context(void)
}
EXPORT_SYMBOL_GPL(perf_swevent_get_recursion_context);

-inline void perf_swevent_put_recursion_context(int rctx)
+void perf_swevent_put_recursion_context(int rctx)
{
struct swevent_htable *swhash = this_cpu_ptr(&swevent_htable);

put_recursion_context(swhash->recursion, rctx);
}
+EXPORT_SYMBOL_GPL(perf_swevent_put_recursion_context);

void ___perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr)
{
@@ -7104,6 +7105,7 @@ static void perf_event_free_filter(struct perf_event *event)

static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
{
+ bool is_kprobe, is_tracepoint;
struct bpf_prog *prog;

if (event->attr.type != PERF_TYPE_TRACEPOINT)
@@ -7112,15 +7114,18 @@ static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
if (event->tp_event->prog)
return -EEXIST;

- if (!(event->tp_event->flags & TRACE_EVENT_FL_UKPROBE))
- /* bpf programs can only be attached to u/kprobes */
+ is_kprobe = event->tp_event->flags & TRACE_EVENT_FL_UKPROBE;
+ is_tracepoint = event->tp_event->flags & TRACE_EVENT_FL_TRACEPOINT;
+ if (!is_kprobe && !is_tracepoint)
+ /* bpf programs can only be attached to u/kprobe or tracepoint */
return -EINVAL;

prog = bpf_prog_get(prog_fd);
if (IS_ERR(prog))
return PTR_ERR(prog);

- if (prog->type != BPF_PROG_TYPE_KPROBE) {
+ if ((is_kprobe && prog->type != BPF_PROG_TYPE_KPROBE) ||
+ (is_tracepoint && prog->type != BPF_PROG_TYPE_TRACEPOINT)) {
/* valid fd, but invalid bpf program type */
bpf_prog_put(prog);
return -EINVAL;
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 7a68afca8249..7ada829029d3 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -284,6 +284,9 @@ void *perf_trace_buf_prepare(int size, unsigned short type,
*regs = this_cpu_ptr(&__perf_regs[*rctxp]);
raw_data = this_cpu_ptr(perf_trace_buf[*rctxp]);

+ if (type == TRACE_EVENT_TYPE_MAX)
+ return raw_data;
+
/* zero the dead bytes from align to not leak stack to user */
memset(&raw_data[size - sizeof(u64)], 0, sizeof(u64));

--
2.8.0

2016-04-05 04:53:05

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH net-next 8/8] samples/bpf: add tracepoint vs kprobe performance tests

the first microbenchmark does
fd=open("/proc/self/comm");
for() {
write(fd, "test");
}
and on 4 cpus in parallel:
writes per sec
base (no tracepoints, no kprobes) 930k
with kprobe at __set_task_comm() 420k
with tracepoint at task:task_rename 730k

For kprobe + full bpf program manully fetches oldcomm, newcomm via bpf_probe_read.
For tracepint bpf program does nothing, since arguments are copied by tracepoint.

2nd microbenchmark does:
fd=open("/dev/urandom");
for() {
read(fd, buf);
}
and on 4 cpus in parallel:
reads per sec
base (no tracepoints, no kprobes) 300k
with kprobe at urandom_read() 279k
with tracepoint at random:urandom_read 290k

bpf progs attached to kprobe and tracepoint are noop.

Signed-off-by: Alexei Starovoitov <[email protected]>
---
samples/bpf/Makefile | 5 +
samples/bpf/test_overhead_kprobe_kern.c | 41 ++++++++
samples/bpf/test_overhead_tp_kern.c | 36 +++++++
samples/bpf/test_overhead_user.c | 161 ++++++++++++++++++++++++++++++++
4 files changed, 243 insertions(+)
create mode 100644 samples/bpf/test_overhead_kprobe_kern.c
create mode 100644 samples/bpf/test_overhead_tp_kern.c
create mode 100644 samples/bpf/test_overhead_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 502c9fc8db85..9959771bf808 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -19,6 +19,7 @@ hostprogs-y += lathist
hostprogs-y += offwaketime
hostprogs-y += spintest
hostprogs-y += map_perf_test
+hostprogs-y += test_overhead

test_verifier-objs := test_verifier.o libbpf.o
test_maps-objs := test_maps.o libbpf.o
@@ -38,6 +39,7 @@ lathist-objs := bpf_load.o libbpf.o lathist_user.o
offwaketime-objs := bpf_load.o libbpf.o offwaketime_user.o
spintest-objs := bpf_load.o libbpf.o spintest_user.o
map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
+test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o

# Tell kbuild to always build the programs
always := $(hostprogs-y)
@@ -56,6 +58,8 @@ always += lathist_kern.o
always += offwaketime_kern.o
always += spintest_kern.o
always += map_perf_test_kern.o
+always += test_overhead_tp_kern.o
+always += test_overhead_kprobe_kern.o

HOSTCFLAGS += -I$(objtree)/usr/include

@@ -75,6 +79,7 @@ HOSTLOADLIBES_lathist += -lelf
HOSTLOADLIBES_offwaketime += -lelf
HOSTLOADLIBES_spintest += -lelf
HOSTLOADLIBES_map_perf_test += -lelf -lrt
+HOSTLOADLIBES_test_overhead += -lelf -lrt

# point this to your LLVM backend with bpf support
LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/test_overhead_kprobe_kern.c b/samples/bpf/test_overhead_kprobe_kern.c
new file mode 100644
index 000000000000..468a66a92ef9
--- /dev/null
+++ b/samples/bpf/test_overhead_kprobe_kern.c
@@ -0,0 +1,41 @@
+/* Copyright (c) 2016 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/version.h>
+#include <linux/ptrace.h>
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+#define _(P) ({typeof(P) val = 0; bpf_probe_read(&val, sizeof(val), &P); val;})
+
+SEC("kprobe/__set_task_comm")
+int prog(struct pt_regs *ctx)
+{
+ struct signal_struct *signal;
+ struct task_struct *tsk;
+ char oldcomm[16] = {};
+ char newcomm[16] = {};
+ u16 oom_score_adj;
+ u32 pid;
+
+ tsk = (void *)PT_REGS_PARM1(ctx);
+
+ pid = _(tsk->pid);
+ bpf_probe_read(oldcomm, sizeof(oldcomm), &tsk->comm);
+ bpf_probe_read(newcomm, sizeof(newcomm), (void *)PT_REGS_PARM2(ctx));
+ signal = _(tsk->signal);
+ oom_score_adj = _(signal->oom_score_adj);
+ return 0;
+}
+
+SEC("kprobe/urandom_read")
+int prog2(struct pt_regs *ctx)
+{
+ return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/test_overhead_tp_kern.c b/samples/bpf/test_overhead_tp_kern.c
new file mode 100644
index 000000000000..38f5c0b9da9f
--- /dev/null
+++ b/samples/bpf/test_overhead_tp_kern.c
@@ -0,0 +1,36 @@
+/* Copyright (c) 2016 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+/* from /sys/kernel/debug/tracing/events/task/task_rename/format */
+struct task_rename {
+ __u64 pad;
+ __u32 pid;
+ char oldcomm[16];
+ char newcomm[16];
+ __u16 oom_score_adj;
+};
+SEC("tracepoint/task/task_rename")
+int prog(struct task_rename *ctx)
+{
+ return 0;
+}
+
+/* from /sys/kernel/debug/tracing/events/random/urandom_read/format */
+struct urandom_read {
+ __u64 pad;
+ int got_bits;
+ int pool_left;
+ int input_left;
+};
+SEC("tracepoint/random/urandom_read")
+int prog2(struct urandom_read *ctx)
+{
+ return 0;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/test_overhead_user.c b/samples/bpf/test_overhead_user.c
new file mode 100644
index 000000000000..6872d67e151d
--- /dev/null
+++ b/samples/bpf/test_overhead_user.c
@@ -0,0 +1,161 @@
+/* Copyright (c) 2016 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#define _GNU_SOURCE
+#include <sched.h>
+#include <stdio.h>
+#include <sys/types.h>
+#include <asm/unistd.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <assert.h>
+#include <sys/wait.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <linux/bpf.h>
+#include <string.h>
+#include <time.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+#define MAX_CNT 1000000
+
+static __u64 time_get_ns(void)
+{
+ struct timespec ts;
+
+ clock_gettime(CLOCK_MONOTONIC, &ts);
+ return ts.tv_sec * 1000000000ull + ts.tv_nsec;
+}
+
+static void test_task_rename(int cpu)
+{
+ __u64 start_time;
+ char buf[] = "test\n";
+ int i, fd;
+
+ fd = open("/proc/self/comm", O_WRONLY|O_TRUNC);
+ if (fd < 0) {
+ printf("couldn't open /proc\n");
+ exit(1);
+ }
+ start_time = time_get_ns();
+ for (i = 0; i < MAX_CNT; i++)
+ write(fd, buf, sizeof(buf));
+ printf("task_rename:%d: %lld events per sec\n",
+ cpu, MAX_CNT * 1000000000ll / (time_get_ns() - start_time));
+ close(fd);
+}
+
+static void test_urandom_read(int cpu)
+{
+ __u64 start_time;
+ char buf[4];
+ int i, fd;
+
+ fd = open("/dev/urandom", O_RDONLY);
+ if (fd < 0) {
+ printf("couldn't open /dev/urandom\n");
+ exit(1);
+ }
+ start_time = time_get_ns();
+ for (i = 0; i < MAX_CNT; i++)
+ read(fd, buf, sizeof(buf));
+ printf("urandom_read:%d: %lld events per sec\n",
+ cpu, MAX_CNT * 1000000000ll / (time_get_ns() - start_time));
+ close(fd);
+}
+
+static void loop(int cpu, int flags)
+{
+ cpu_set_t cpuset;
+
+ CPU_ZERO(&cpuset);
+ CPU_SET(cpu, &cpuset);
+ sched_setaffinity(0, sizeof(cpuset), &cpuset);
+
+ if (flags & 1)
+ test_task_rename(cpu);
+ if (flags & 2)
+ test_urandom_read(cpu);
+}
+
+static void run_perf_test(int tasks, int flags)
+{
+ pid_t pid[tasks];
+ int i;
+
+ for (i = 0; i < tasks; i++) {
+ pid[i] = fork();
+ if (pid[i] == 0) {
+ loop(i, flags);
+ exit(0);
+ } else if (pid[i] == -1) {
+ printf("couldn't spawn #%d process\n", i);
+ exit(1);
+ }
+ }
+ for (i = 0; i < tasks; i++) {
+ int status;
+
+ assert(waitpid(pid[i], &status, 0) == pid[i]);
+ assert(status == 0);
+ }
+}
+
+static void unload_progs(void)
+{
+ close(prog_fd[0]);
+ close(prog_fd[1]);
+ close(event_fd[0]);
+ close(event_fd[1]);
+}
+
+int main(int argc, char **argv)
+{
+ struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+ char filename[256];
+ int num_cpu = 8;
+ int test_flags = ~0;
+
+ setrlimit(RLIMIT_MEMLOCK, &r);
+
+ if (argc > 1)
+ test_flags = atoi(argv[1]) ? : test_flags;
+ if (argc > 2)
+ num_cpu = atoi(argv[2]) ? : num_cpu;
+
+ if (test_flags & 0x3) {
+ printf("BASE\n");
+ run_perf_test(num_cpu, test_flags);
+ }
+
+ if (test_flags & 0xC) {
+ snprintf(filename, sizeof(filename),
+ "%s_kprobe_kern.o", argv[0]);
+ if (load_bpf_file(filename)) {
+ printf("%s", bpf_log_buf);
+ return 1;
+ }
+ printf("w/KPROBE\n");
+ run_perf_test(num_cpu, test_flags >> 2);
+ unload_progs();
+ }
+
+ if (test_flags & 0x30) {
+ snprintf(filename, sizeof(filename),
+ "%s_tp_kern.o", argv[0]);
+ if (load_bpf_file(filename)) {
+ printf("%s", bpf_log_buf);
+ return 1;
+ }
+ printf("w/TRACEPOINT\n");
+ run_perf_test(num_cpu, test_flags >> 4);
+ unload_progs();
+ }
+
+ return 0;
+}
--
2.8.0

2016-04-05 04:53:50

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH net-next 5/8] bpf: sanitize bpf tracepoint access

during bpf program loading remember the last byte of ctx access
and at the time of attaching the program to tracepoint check that
the program doesn't access bytes beyond defined in tracepoint fields

This also disallows access to __dynamic_array fields, but can be
relaxed in the future.

Signed-off-by: Alexei Starovoitov <[email protected]>
---
include/linux/bpf.h | 1 +
include/linux/trace_events.h | 1 +
kernel/bpf/verifier.c | 6 +++++-
kernel/events/core.c | 8 ++++++++
kernel/trace/trace_events.c | 18 ++++++++++++++++++
5 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 198f6ace70ec..b2365a6eba3d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -131,6 +131,7 @@ struct bpf_prog_type_list {
struct bpf_prog_aux {
atomic_t refcnt;
u32 used_map_cnt;
+ u32 max_ctx_offset;
const struct bpf_verifier_ops *ops;
struct bpf_map **used_maps;
struct bpf_prog *prog;
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 0810f81b6db2..97bd7da98567 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -569,6 +569,7 @@ extern int trace_define_field(struct trace_event_call *call, const char *type,
int is_signed, int filter_type);
extern int trace_add_event_call(struct trace_event_call *call);
extern int trace_remove_event_call(struct trace_event_call *call);
+extern int trace_event_get_offsets(struct trace_event_call *call);

#define is_signed_type(type) (((type)(-1)) < (type)1)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 2e08f8e9b771..58792fed5678 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -652,8 +652,12 @@ static int check_ctx_access(struct verifier_env *env, int off, int size,
enum bpf_access_type t)
{
if (env->prog->aux->ops->is_valid_access &&
- env->prog->aux->ops->is_valid_access(off, size, t))
+ env->prog->aux->ops->is_valid_access(off, size, t)) {
+ /* remember the offset of last byte accessed in ctx */
+ if (env->prog->aux->max_ctx_offset < off + size)
+ env->prog->aux->max_ctx_offset = off + size;
return 0;
+ }

verbose("invalid bpf_context access off=%d size=%d\n", off, size);
return -EACCES;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 58fc9a7d1562..e7e3c2057582 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7131,6 +7131,14 @@ static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
return -EINVAL;
}

+ if (is_tracepoint) {
+ int off = trace_event_get_offsets(event->tp_event);
+
+ if (prog->aux->max_ctx_offset > off) {
+ bpf_prog_put(prog);
+ return -EACCES;
+ }
+ }
event->tp_event->prog = prog;

return 0;
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 05ddc0820771..ced963049e0a 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -204,6 +204,24 @@ static void trace_destroy_fields(struct trace_event_call *call)
}
}

+/*
+ * run-time version of trace_event_get_offsets_<call>() that returns the last
+ * accessible offset of trace fields excluding __dynamic_array bytes
+ */
+int trace_event_get_offsets(struct trace_event_call *call)
+{
+ struct ftrace_event_field *tail;
+ struct list_head *head;
+
+ head = trace_get_fields(call);
+ /*
+ * head->next points to the last field with the largest offset,
+ * since it was added last by trace_define_field()
+ */
+ tail = list_first_entry(head, struct ftrace_event_field, link);
+ return tail->offset + tail->size;
+}
+
int trace_event_raw_init(struct trace_event_call *call)
{
int id;
--
2.8.0

2016-04-05 04:54:19

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH net-next 1/8] perf: optimize perf_fetch_caller_regs

avoid memset in perf_fetch_caller_regs, since it's the critical path of all tracepoints.
It's called from perf_sw_event_sched, perf_event_task_sched_in and all of perf_trace_##call
with this_cpu_ptr(&__perf_regs[..]) which are zero initialized by perpcu_alloc and
subsequent call to perf_arch_fetch_caller_regs initializes the same fields on all archs,
so we can safely drop memset from all of the above cases and move it into
perf_ftrace_function_call that calls it with stack allocated pt_regs.

Signed-off-by: Alexei Starovoitov <[email protected]>
---
include/linux/perf_event.h | 2 --
kernel/trace/trace_event_perf.c | 1 +
2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f291275ffd71..e89f7199c223 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -882,8 +882,6 @@ static inline void perf_arch_fetch_caller_regs(struct pt_regs *regs, unsigned lo
*/
static inline void perf_fetch_caller_regs(struct pt_regs *regs)
{
- memset(regs, 0, sizeof(*regs));
-
perf_arch_fetch_caller_regs(regs, CALLER_ADDR0);
}

diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 00df25fd86ef..7a68afca8249 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -316,6 +316,7 @@ perf_ftrace_function_call(unsigned long ip, unsigned long parent_ip,

BUILD_BUG_ON(ENTRY_SIZE > PERF_MAX_TRACE_SIZE);

+ memset(&regs, 0, sizeof(regs));
perf_fetch_caller_regs(&regs);

entry = perf_trace_buf_prepare(ENTRY_SIZE, TRACE_FN, NULL, &rctx);
--
2.8.0

2016-04-05 04:54:14

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH net-next 6/8] samples/bpf: add tracepoint support to bpf loader

Recognize "tracepoint/" section name prefix and attach the program
to that tracepoint.

Signed-off-by: Alexei Starovoitov <[email protected]>
---
samples/bpf/bpf_load.c | 26 +++++++++++++++++++++-----
1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 58f86bd11b3d..022af71c2bb5 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -49,6 +49,7 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
bool is_socket = strncmp(event, "socket", 6) == 0;
bool is_kprobe = strncmp(event, "kprobe/", 7) == 0;
bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
+ bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0;
enum bpf_prog_type prog_type;
char buf[256];
int fd, efd, err, id;
@@ -63,6 +64,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
} else if (is_kprobe || is_kretprobe) {
prog_type = BPF_PROG_TYPE_KPROBE;
+ } else if (is_tracepoint) {
+ prog_type = BPF_PROG_TYPE_TRACEPOINT;
} else {
printf("Unknown event '%s'\n", event);
return -1;
@@ -111,12 +114,23 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
event, strerror(errno));
return -1;
}
- }

- strcpy(buf, DEBUGFS);
- strcat(buf, "events/kprobes/");
- strcat(buf, event);
- strcat(buf, "/id");
+ strcpy(buf, DEBUGFS);
+ strcat(buf, "events/kprobes/");
+ strcat(buf, event);
+ strcat(buf, "/id");
+ } else if (is_tracepoint) {
+ event += 11;
+
+ if (*event == 0) {
+ printf("event name cannot be empty\n");
+ return -1;
+ }
+ strcpy(buf, DEBUGFS);
+ strcat(buf, "events/");
+ strcat(buf, event);
+ strcat(buf, "/id");
+ }

efd = open(buf, O_RDONLY, 0);
if (efd < 0) {
@@ -304,6 +318,7 @@ int load_bpf_file(char *path)

if (memcmp(shname_prog, "kprobe/", 7) == 0 ||
memcmp(shname_prog, "kretprobe/", 10) == 0 ||
+ memcmp(shname_prog, "tracepoint/", 11) == 0 ||
memcmp(shname_prog, "socket", 6) == 0)
load_and_attach(shname_prog, insns, data_prog->d_size);
}
@@ -320,6 +335,7 @@ int load_bpf_file(char *path)

if (memcmp(shname, "kprobe/", 7) == 0 ||
memcmp(shname, "kretprobe/", 10) == 0 ||
+ memcmp(shname, "tracepoint/", 11) == 0 ||
memcmp(shname, "socket", 6) == 0)
load_and_attach(shname, data->d_buf, data->d_size);
}
--
2.8.0

2016-04-05 04:55:45

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH net-next 3/8] bpf: register BPF_PROG_TYPE_TRACEPOINT program type

register tracepoint bpf program type and let it call the same set
of helper functions as BPF_PROG_TYPE_KPROBE

Signed-off-by: Alexei Starovoitov <[email protected]>
---
kernel/trace/bpf_trace.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 43 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 3e4ffb3ace5f..3e5ebe3254d2 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -268,7 +268,7 @@ static const struct bpf_func_proto bpf_perf_event_output_proto = {
.arg5_type = ARG_CONST_STACK_SIZE,
};

-static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func_id)
+static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id)
{
switch (func_id) {
case BPF_FUNC_map_lookup_elem:
@@ -295,12 +295,20 @@ static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func
return &bpf_get_smp_processor_id_proto;
case BPF_FUNC_perf_event_read:
return &bpf_perf_event_read_proto;
+ default:
+ return NULL;
+ }
+}
+
+static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func_id)
+{
+ switch (func_id) {
case BPF_FUNC_perf_event_output:
return &bpf_perf_event_output_proto;
case BPF_FUNC_get_stackid:
return &bpf_get_stackid_proto;
default:
- return NULL;
+ return tracing_func_proto(func_id);
}
}

@@ -332,9 +340,42 @@ static struct bpf_prog_type_list kprobe_tl = {
.type = BPF_PROG_TYPE_KPROBE,
};

+static const struct bpf_func_proto *tp_prog_func_proto(enum bpf_func_id func_id)
+{
+ switch (func_id) {
+ case BPF_FUNC_perf_event_output:
+ case BPF_FUNC_get_stackid:
+ return NULL;
+ default:
+ return tracing_func_proto(func_id);
+ }
+}
+
+static bool tp_prog_is_valid_access(int off, int size, enum bpf_access_type type)
+{
+ if (off < sizeof(void *) || off >= PERF_MAX_TRACE_SIZE)
+ return false;
+ if (type != BPF_READ)
+ return false;
+ if (off % size != 0)
+ return false;
+ return true;
+}
+
+static const struct bpf_verifier_ops tracepoint_prog_ops = {
+ .get_func_proto = tp_prog_func_proto,
+ .is_valid_access = tp_prog_is_valid_access,
+};
+
+static struct bpf_prog_type_list tracepoint_tl = {
+ .ops = &tracepoint_prog_ops,
+ .type = BPF_PROG_TYPE_TRACEPOINT,
+};
+
static int __init register_kprobe_prog_ops(void)
{
bpf_register_prog_type(&kprobe_tl);
+ bpf_register_prog_type(&tracepoint_tl);
return 0;
}
late_initcall(register_kprobe_prog_ops);
--
2.8.0

2016-04-05 04:55:43

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH net-next 4/8] bpf: support bpf_get_stackid() and bpf_perf_event_output() in tracepoint programs

needs two wrapper functions to fetch 'struct pt_regs *' to convert
tracepoint bpf context into kprobe bpf context to reuse existing
helper functions

Signed-off-by: Alexei Starovoitov <[email protected]>
---
include/linux/bpf.h | 1 +
kernel/bpf/stackmap.c | 2 +-
kernel/trace/bpf_trace.c | 42 +++++++++++++++++++++++++++++++++++++++++-
3 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 21ee41b92e8a..198f6ace70ec 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -160,6 +160,7 @@ struct bpf_array {
#define MAX_TAIL_CALL_CNT 32

u64 bpf_tail_call(u64 ctx, u64 r2, u64 index, u64 r4, u64 r5);
+u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
void bpf_fd_array_map_clear(struct bpf_map *map);
bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog *fp);
const struct bpf_func_proto *bpf_get_trace_printk_proto(void);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 499d9e933f8e..35114725cf30 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -116,7 +116,7 @@ free_smap:
return ERR_PTR(err);
}

-static u64 bpf_get_stackid(u64 r1, u64 r2, u64 flags, u64 r4, u64 r5)
+u64 bpf_get_stackid(u64 r1, u64 r2, u64 flags, u64 r4, u64 r5)
{
struct pt_regs *regs = (struct pt_regs *) (long) r1;
struct bpf_map *map = (struct bpf_map *) (long) r2;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 3e5ebe3254d2..413ec5614180 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -340,12 +340,52 @@ static struct bpf_prog_type_list kprobe_tl = {
.type = BPF_PROG_TYPE_KPROBE,
};

+static u64 bpf_perf_event_output_tp(u64 r1, u64 r2, u64 index, u64 r4, u64 size)
+{
+ /*
+ * r1 points to perf tracepoint buffer where first 8 bytes are hidden
+ * from bpf program and contain a pointer to 'struct pt_regs'. Fetch it
+ * from there and call the same bpf_perf_event_output() helper
+ */
+ u64 ctx = *(long *)r1;
+
+ return bpf_perf_event_output(ctx, r2, index, r4, size);
+}
+
+static const struct bpf_func_proto bpf_perf_event_output_proto_tp = {
+ .func = bpf_perf_event_output_tp,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_CONST_MAP_PTR,
+ .arg3_type = ARG_ANYTHING,
+ .arg4_type = ARG_PTR_TO_STACK,
+ .arg5_type = ARG_CONST_STACK_SIZE,
+};
+
+static u64 bpf_get_stackid_tp(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ u64 ctx = *(long *)r1;
+
+ return bpf_get_stackid(ctx, r2, r3, r4, r5);
+}
+
+static const struct bpf_func_proto bpf_get_stackid_proto_tp = {
+ .func = bpf_get_stackid_tp,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_CONST_MAP_PTR,
+ .arg3_type = ARG_ANYTHING,
+};
+
static const struct bpf_func_proto *tp_prog_func_proto(enum bpf_func_id func_id)
{
switch (func_id) {
case BPF_FUNC_perf_event_output:
+ return &bpf_perf_event_output_proto_tp;
case BPF_FUNC_get_stackid:
- return NULL;
+ return &bpf_get_stackid_proto_tp;
default:
return tracing_func_proto(func_id);
}
--
2.8.0

2016-04-05 04:55:41

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH net-next 7/8] samples/bpf: tracepoint example

modify offwaketime to work with sched/sched_switch tracepoint
instead of kprobe into finish_task_switch

Signed-off-by: Alexei Starovoitov <[email protected]>
---
samples/bpf/offwaketime_kern.c | 26 ++++++++++++++++++++++----
1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/samples/bpf/offwaketime_kern.c b/samples/bpf/offwaketime_kern.c
index c0aa5a9b9c48..983629a31c79 100644
--- a/samples/bpf/offwaketime_kern.c
+++ b/samples/bpf/offwaketime_kern.c
@@ -73,7 +73,7 @@ int waker(struct pt_regs *ctx)
return 0;
}

-static inline int update_counts(struct pt_regs *ctx, u32 pid, u64 delta)
+static inline int update_counts(void *ctx, u32 pid, u64 delta)
{
struct key_t key = {};
struct wokeby_t *woke;
@@ -100,15 +100,33 @@ static inline int update_counts(struct pt_regs *ctx, u32 pid, u64 delta)
return 0;
}

+#if 1
+/* taken from /sys/kernel/debug/tracing/events/sched/sched_switch/format */
+struct sched_switch_args {
+ unsigned long long pad;
+ char prev_comm[16];
+ int prev_pid;
+ int prev_prio;
+ long long prev_state;
+ char next_comm[16];
+ int next_pid;
+ int next_prio;
+};
+SEC("tracepoint/sched/sched_switch")
+int oncpu(struct sched_switch_args *ctx)
+{
+ /* record previous thread sleep time */
+ u32 pid = ctx->prev_pid;
+#else
SEC("kprobe/finish_task_switch")
int oncpu(struct pt_regs *ctx)
{
struct task_struct *p = (void *) PT_REGS_PARM1(ctx);
+ /* record previous thread sleep time */
+ u32 pid = _(p->pid);
+#endif
u64 delta, ts, *tsp;
- u32 pid;

- /* record previous thread sleep time */
- pid = _(p->pid);
ts = bpf_ktime_get_ns();
bpf_map_update_elem(&start, &pid, &ts, BPF_ANY);

--
2.8.0

2016-04-05 12:07:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH net-next 1/8] perf: optimize perf_fetch_caller_regs

On Mon, Apr 04, 2016 at 09:52:47PM -0700, Alexei Starovoitov wrote:
> avoid memset in perf_fetch_caller_regs, since it's the critical path of all tracepoints.
> It's called from perf_sw_event_sched, perf_event_task_sched_in and all of perf_trace_##call
> with this_cpu_ptr(&__perf_regs[..]) which are zero initialized by perpcu_alloc

Its not actually allocated; but because its a static uninitialized
variable we get .bss like behaviour and the initial value is copied to
all CPUs when the per-cpu allocator thingy bootstraps SMP IIRC.

> and
> subsequent call to perf_arch_fetch_caller_regs initializes the same fields on all archs,
> so we can safely drop memset from all of the above cases and

Indeed.

> move it into
> perf_ftrace_function_call that calls it with stack allocated pt_regs.

Hmm, is there a reason that's still on-stack instead of using the
per-cpu thing, Steve?

> Signed-off-by: Alexei Starovoitov <[email protected]>

In any case,

Acked-by: Peter Zijlstra (Intel) <[email protected]>

2016-04-05 14:19:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH net-next 2/8] perf, bpf: allow bpf programs attach to tracepoints

On Mon, Apr 04, 2016 at 09:52:48PM -0700, Alexei Starovoitov wrote:
> introduce BPF_PROG_TYPE_TRACEPOINT program type and allow it to be
> attached to tracepoints.

More specifically the perf tracepoint handler, not tracepoints directly.

> The tracepoint will copy the arguments in the per-cpu buffer and pass
> it to the bpf program as its first argument.

> The layout of the fields can be discovered by doing
> 'cat /sys/kernel/debug/tracing/events/sched/sched_switch/format'
> prior to the compilation of the program with exception that first 8 bytes
> are reserved and not accessible to the program. This area is used to store
> the pointer to 'struct pt_regs' which some of the bpf helpers will use:
> +---------+
> | 8 bytes | hidden 'struct pt_regs *' (inaccessible to bpf program)
> +---------+
> | N bytes | static tracepoint fields defined in tracepoint/format (bpf readonly)
> +---------+
> | dynamic | __dynamic_array bytes of tracepoint (inaccessible to bpf yet)
> +---------+
>
> Not that all of the fields are already dumped to user space via perf ring buffer
> and some application access it directly without consulting tracepoint/format.

We call those apps broken..

> Same rule applies here: static tracepoint fields should only be accessed
> in a format defined in tracepoint/format. The order of fields and
> field sizes are not an ABI.


> @@ -56,8 +57,9 @@ perf_trace_##call(void *__data, proto) \
> sizeof(u64)); \
> __entry_size -= sizeof(u32); \
> \
> - entry = perf_trace_buf_prepare(__entry_size, \
> - event_call->event.type, &__regs, &rctx); \
> + event_type = prog ? TRACE_EVENT_TYPE_MAX : event_call->event.type; \
> + entry = perf_trace_buf_prepare(__entry_size, event_type, \
> + &__regs, &rctx); \
> if (!entry) \
> return; \
> \
> @@ -67,6 +69,14 @@ perf_trace_##call(void *__data, proto) \
> \
> { assign; } \
> \
> + if (prog) { \
> + *(struct pt_regs **)entry = __regs; \
> + if (!trace_call_bpf(prog, entry) || hlist_empty(head)) { \
> + perf_swevent_put_recursion_context(rctx); \
> + return; \

So if the prog 'fails' you consume the entry,

> + } \
> + memset(&entry->ent, 0, sizeof(entry->ent)); \

But if not, you destroy it and then feed it to perf?

> + } \
> perf_trace_buf_submit(entry, __entry_size, rctx, __addr, \
> __count, __regs, head, __task); \
> }


> diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
> index 7a68afca8249..7ada829029d3 100644
> --- a/kernel/trace/trace_event_perf.c
> +++ b/kernel/trace/trace_event_perf.c
> @@ -284,6 +284,9 @@ void *perf_trace_buf_prepare(int size, unsigned short type,
> *regs = this_cpu_ptr(&__perf_regs[*rctxp]);
> raw_data = this_cpu_ptr(perf_trace_buf[*rctxp]);
>
> + if (type == TRACE_EVENT_TYPE_MAX)
> + return raw_data;
> +
> /* zero the dead bytes from align to not leak stack to user */
> memset(&raw_data[size - sizeof(u64)], 0, sizeof(u64));

What's this hunk do? Why can you skip this stuff for BPF attached
events?

2016-04-05 17:42:20

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next 1/8] perf: optimize perf_fetch_caller_regs

On 4/5/16 5:06 AM, Peter Zijlstra wrote:
> On Mon, Apr 04, 2016 at 09:52:47PM -0700, Alexei Starovoitov wrote:
>> avoid memset in perf_fetch_caller_regs, since it's the critical path of all tracepoints.
>> It's called from perf_sw_event_sched, perf_event_task_sched_in and all of perf_trace_##call
>> with this_cpu_ptr(&__perf_regs[..]) which are zero initialized by perpcu_alloc
>
> Its not actually allocated; but because its a static uninitialized
> variable we get .bss like behaviour and the initial value is copied to
> all CPUs when the per-cpu allocator thingy bootstraps SMP IIRC.

yes, it's .bss-like in a special section. I think static percpu still
goes through some fancy boot time init similar to dynamic.
What I tried to emphasize that either static or dynamic percpu areas
are guaranteed to be zero initialized.

>> and
>> subsequent call to perf_arch_fetch_caller_regs initializes the same fields on all archs,
>> so we can safely drop memset from all of the above cases and
>
> Indeed.
>
>> move it into
>> perf_ftrace_function_call that calls it with stack allocated pt_regs.
>
> Hmm, is there a reason that's still on-stack instead of using the
> per-cpu thing, Steve?
>
>> Signed-off-by: Alexei Starovoitov <[email protected]>
>
> In any case,
>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>

Thanks for the quick review.

2016-04-05 18:11:12

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next 2/8] perf, bpf: allow bpf programs attach to tracepoints

On 4/5/16 7:18 AM, Peter Zijlstra wrote:
> On Mon, Apr 04, 2016 at 09:52:48PM -0700, Alexei Starovoitov wrote:
>> introduce BPF_PROG_TYPE_TRACEPOINT program type and allow it to be
>> attached to tracepoints.
>
> More specifically the perf tracepoint handler, not tracepoints directly.

yes. perf tracepoint handler only. There is no attempt here to attach
to ftrace tracepoint handler.

>> The tracepoint will copy the arguments in the per-cpu buffer and pass
>> it to the bpf program as its first argument.
>
>> The layout of the fields can be discovered by doing
>> 'cat /sys/kernel/debug/tracing/events/sched/sched_switch/format'
>> prior to the compilation of the program with exception that first 8 bytes
>> are reserved and not accessible to the program. This area is used to store
>> the pointer to 'struct pt_regs' which some of the bpf helpers will use:
>> +---------+
>> | 8 bytes | hidden 'struct pt_regs *' (inaccessible to bpf program)
>> +---------+
>> | N bytes | static tracepoint fields defined in tracepoint/format (bpf readonly)
>> +---------+
>> | dynamic | __dynamic_array bytes of tracepoint (inaccessible to bpf yet)
>> +---------+
>>
>> Not that all of the fields are already dumped to user space via perf ring buffer
>> and some application access it directly without consulting tracepoint/format.
>
> We call those apps broken..

yes.
uapi/linux/perf_event.h lines 742-749 are pretty clear about it:
"In other words, PERF_SAMPLE_RAW contents are not an ABI"

>> Same rule applies here: static tracepoint fields should only be accessed
>> in a format defined in tracepoint/format. The order of fields and
>> field sizes are not an ABI.
>
>
>> @@ -56,8 +57,9 @@ perf_trace_##call(void *__data, proto) \
>> sizeof(u64)); \
>> __entry_size -= sizeof(u32); \
>> \
>> - entry = perf_trace_buf_prepare(__entry_size, \
>> - event_call->event.type, &__regs, &rctx); \
>> + event_type = prog ? TRACE_EVENT_TYPE_MAX : event_call->event.type; \
>> + entry = perf_trace_buf_prepare(__entry_size, event_type, \
>> + &__regs, &rctx); \
>> if (!entry) \
>> return; \
>> \
>> @@ -67,6 +69,14 @@ perf_trace_##call(void *__data, proto) \
>> \
>> { assign; } \
>> \
>> + if (prog) { \
>> + *(struct pt_regs **)entry = __regs; \
>> + if (!trace_call_bpf(prog, entry) || hlist_empty(head)) { \
>> + perf_swevent_put_recursion_context(rctx); \
>> + return; \
>
> So if the prog 'fails' you consume the entry,

I wouldn't call it 'fails' ;)
The interpretation of return code from bpf program is defined
in kernel/trace/bpf_trace.c as:
* 0 - return from kprobe (event is filtered out)
* 1 - store kprobe event into ring buffer
* Other values are reserved and currently alias to 1

so the above !trace_call_bpf() check matches existing bpf+kprobe
behavior.

>
>> + } \
>> + memset(&entry->ent, 0, sizeof(entry->ent)); \
>
> But if not, you destroy it and then feed it to perf?

yes. If bpf prog returns 1 the buffer goes into normal ring-buffer
with all perf_event attributes and so on.
So far there wasn't a single real use case where we went this path.
Programs always do aggregation inside and pass stuff to user space
either via bpf maps or via bpf_perf_event_output() helper.
I wanted to keep perf_trace_xx() calls to be minimal in .text size
so memset above is one x86 instruction, but I don't mind
replacing this memset with a call to a helper function that will do:
local_save_flags(flags);
tracing_generic_entry_update(entry, flags, preempt_count());
entry->type = type;
Then whether bpf attached or not the ring buffer will see the same
raw tracepoint entry. You think it's cleaner?

>
>> + } \
>> perf_trace_buf_submit(entry, __entry_size, rctx, __addr, \
>> __count, __regs, head, __task); \
>> }
>
>
>> diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
>> index 7a68afca8249..7ada829029d3 100644
>> --- a/kernel/trace/trace_event_perf.c
>> +++ b/kernel/trace/trace_event_perf.c
>> @@ -284,6 +284,9 @@ void *perf_trace_buf_prepare(int size, unsigned short type,
>> *regs = this_cpu_ptr(&__perf_regs[*rctxp]);
>> raw_data = this_cpu_ptr(perf_trace_buf[*rctxp]);
>>
>> + if (type == TRACE_EVENT_TYPE_MAX)
>> + return raw_data;
>> +
>> /* zero the dead bytes from align to not leak stack to user */
>> memset(&raw_data[size - sizeof(u64)], 0, sizeof(u64));
>
> What's this hunk do? Why can you skip this stuff for BPF attached
> events?

that hunk is skipping init of first 8 bytes, which are occupied
by 'struct trace_entry' normally and are inited as:
local_save_flags(flags);
tracing_generic_entry_update(entry, flags, preempt_count());
entry->type = type;
that adds extra overhead that bpf progs don't need.
If bpf needs current pid, it calls bpf_get_current_pid_tgid() helper.
local_save_flags() is also quite slow x86 insn that is not needed.
If programs would need to read flags, we can introduce new helper
to read it.
These 8 bytes are instead used to store hidden 'struct pt_regs'
pointer which is invisible to bpf programs directly and used
by two bpf helpers: bpf_get_stackid() and bpf_perf_event_output()
which need pt_regs. See patch 4/8

2016-04-05 18:17:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH net-next 2/8] perf, bpf: allow bpf programs attach to tracepoints

On Tue, Apr 05, 2016 at 11:09:30AM -0700, Alexei Starovoitov wrote:
> >>@@ -67,6 +69,14 @@ perf_trace_##call(void *__data, proto) \
> >> \
> >> { assign; } \
> >> \
> >>+ if (prog) { \
> >>+ *(struct pt_regs **)entry = __regs; \
> >>+ if (!trace_call_bpf(prog, entry) || hlist_empty(head)) { \
> >>+ perf_swevent_put_recursion_context(rctx); \
> >>+ return; \
> >>+ } \
> >>+ memset(&entry->ent, 0, sizeof(entry->ent)); \
> >
> >But if not, you destroy it and then feed it to perf?
>
> yes. If bpf prog returns 1 the buffer goes into normal ring-buffer
> with all perf_event attributes and so on.
> So far there wasn't a single real use case where we went this path.
> Programs always do aggregation inside and pass stuff to user space
> either via bpf maps or via bpf_perf_event_output() helper.
> I wanted to keep perf_trace_xx() calls to be minimal in .text size
> so memset above is one x86 instruction, but I don't mind
> replacing this memset with a call to a helper function that will do:
> local_save_flags(flags);
> tracing_generic_entry_update(entry, flags, preempt_count());
> entry->type = type;
> Then whether bpf attached or not the ring buffer will see the same
> raw tracepoint entry. You think it's cleaner?

Yeah, otherwise you get very weird and surprising behaviour.

Also, one possible use-case is dynamic filters where the BPF program is
basically used to filter events, although I suppose we already have a
hook for that elsewhere.

2016-04-05 18:24:22

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next 2/8] perf, bpf: allow bpf programs attach to tracepoints

On 4/5/16 11:16 AM, Peter Zijlstra wrote:
> On Tue, Apr 05, 2016 at 11:09:30AM -0700, Alexei Starovoitov wrote:
>>>> @@ -67,6 +69,14 @@ perf_trace_##call(void *__data, proto) \
>>>> \
>>>> { assign; } \
>>>> \
>>>> + if (prog) { \
>>>> + *(struct pt_regs **)entry = __regs; \
>>>> + if (!trace_call_bpf(prog, entry) || hlist_empty(head)) { \
>>>> + perf_swevent_put_recursion_context(rctx); \
>>>> + return; \
>>>> + } \
>>>> + memset(&entry->ent, 0, sizeof(entry->ent)); \
>>>
>>> But if not, you destroy it and then feed it to perf?
>>
>> yes. If bpf prog returns 1 the buffer goes into normal ring-buffer
>> with all perf_event attributes and so on.
>> So far there wasn't a single real use case where we went this path.
>> Programs always do aggregation inside and pass stuff to user space
>> either via bpf maps or via bpf_perf_event_output() helper.
>> I wanted to keep perf_trace_xx() calls to be minimal in .text size
>> so memset above is one x86 instruction, but I don't mind
>> replacing this memset with a call to a helper function that will do:
>> local_save_flags(flags);
>> tracing_generic_entry_update(entry, flags, preempt_count());
>> entry->type = type;
>> Then whether bpf attached or not the ring buffer will see the same
>> raw tracepoint entry. You think it's cleaner?
>
> Yeah, otherwise you get very weird and surprising behaviour.

ok. will respin.

> Also, one possible use-case is dynamic filters where the BPF program is
> basically used to filter events, although I suppose we already have a
> hook for that elsewhere.

There is no other bpf hook for tracepoints. This patch is it.
And yes, after respin such use case will be possible.
Thanks!

2016-04-08 22:12:24

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH net-next 1/8] perf: optimize perf_fetch_caller_regs

On Tue, 5 Apr 2016 14:06:26 +0200
Peter Zijlstra <[email protected]> wrote:

> On Mon, Apr 04, 2016 at 09:52:47PM -0700, Alexei Starovoitov wrote:
> > avoid memset in perf_fetch_caller_regs, since it's the critical path of all tracepoints.
> > It's called from perf_sw_event_sched, perf_event_task_sched_in and all of perf_trace_##call
> > with this_cpu_ptr(&__perf_regs[..]) which are zero initialized by perpcu_alloc
>
> Its not actually allocated; but because its a static uninitialized
> variable we get .bss like behaviour and the initial value is copied to
> all CPUs when the per-cpu allocator thingy bootstraps SMP IIRC.
>
> > and
> > subsequent call to perf_arch_fetch_caller_regs initializes the same fields on all archs,
> > so we can safely drop memset from all of the above cases and
>
> Indeed.
>
> > move it into
> > perf_ftrace_function_call that calls it with stack allocated pt_regs.
>
> Hmm, is there a reason that's still on-stack instead of using the
> per-cpu thing, Steve?

Well, what do you do when you are tracing with regs in an interrupt
that already set the per cpu regs field? We could create our own
per-cpu one as well, but then that would require checking which level
we are in, as we can have one for normal context, one for softirq
context, one for irq context and one for nmi context.

-- Steve



>
> > Signed-off-by: Alexei Starovoitov <[email protected]>
>
> In any case,
>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>

2016-04-18 16:13:18

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH net-next 0/8] allow bpf attach to tracepoints

On Mon, 4 Apr 2016 21:52:46 -0700
Alexei Starovoitov <[email protected]> wrote:

> Hi Steven, Peter,
>
> last time we discussed bpf+tracepoints it was a year ago [1] and the reason
> we didn't proceed with that approach was that bpf would make arguments
> arg1, arg2 to trace_xx(arg1, arg2) call to be exposed to bpf program
> and that was considered unnecessary extension of abi. Back then I wanted
> to avoid the cost of buffer alloc and field assign part in all
> of the tracepoints, but looks like when optimized the cost is acceptable.
> So this new apporach doesn't expose any new abi to bpf program.
> The program is looking at tracepoint fields after they were copied
> by perf_trace_xx() and described in /sys/kernel/debug/tracing/events/xxx/format

Does this mean that ftrace could use this ability as well? As all the
current filtering of ftrace was done after it was copied to the buffer,
and that was what you wanted to avoid.

-- Steve

2016-04-18 19:52:54

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next 0/8] allow bpf attach to tracepoints

On 4/18/16 9:13 AM, Steven Rostedt wrote:
> On Mon, 4 Apr 2016 21:52:46 -0700
> Alexei Starovoitov <[email protected]> wrote:
>
>> Hi Steven, Peter,
>>
>> last time we discussed bpf+tracepoints it was a year ago [1] and the reason
>> we didn't proceed with that approach was that bpf would make arguments
>> arg1, arg2 to trace_xx(arg1, arg2) call to be exposed to bpf program
>> and that was considered unnecessary extension of abi. Back then I wanted
>> to avoid the cost of buffer alloc and field assign part in all
>> of the tracepoints, but looks like when optimized the cost is acceptable.
>> So this new apporach doesn't expose any new abi to bpf program.
>> The program is looking at tracepoint fields after they were copied
>> by perf_trace_xx() and described in /sys/kernel/debug/tracing/events/xxx/format
>
> Does this mean that ftrace could use this ability as well? As all the
> current filtering of ftrace was done after it was copied to the buffer,
> and that was what you wanted to avoid.

yeah, it could be added to ftrace as well, but it won't be as effective
as perf_trace, since the cost of trace_event_buffer_reserve() in
trace_event_raw_event_() handler is significantly higher than
perf_trace_buf_alloc() in perf_trace_().
Then from the program point of view it wouldn't care how that memory
is allocated, so the user tools will just use perf_trace_() style.
The only use case I see for bpf with ftrace's tracepoint handler
is to work as an actual filter, but we already have filters there...
so not clear to me of the actual value of adding bpf to ftrace's
tracepoint handler.
At the same time there is whole ftrace as function tracer. That is
very lucrative field for bpf to plug into ;)

As far as 2nd part of your question about copying. Yeah, it adds to
the cost, so kprobe sometimes is faster than perf_trace tracepoint
that is copying a lot of args which are not going to be used.

2016-04-18 20:29:11

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH net-next 2/8] perf, bpf: allow bpf programs attach to tracepoints

On Mon, 4 Apr 2016 21:52:48 -0700
Alexei Starovoitov <[email protected]> wrote:

> introduce BPF_PROG_TYPE_TRACEPOINT program type and allow it to be
> attached to tracepoints.
> The tracepoint will copy the arguments in the per-cpu buffer and pass
> it to the bpf program as its first argument.
> The layout of the fields can be discovered by doing
> 'cat /sys/kernel/debug/tracing/events/sched/sched_switch/format'
> prior to the compilation of the program with exception that first 8 bytes
> are reserved and not accessible to the program. This area is used to store
> the pointer to 'struct pt_regs' which some of the bpf helpers will use:
> +---------+
> | 8 bytes | hidden 'struct pt_regs *' (inaccessible to bpf program)
> +---------+
> | N bytes | static tracepoint fields defined in tracepoint/format (bpf readonly)
> +---------+
> | dynamic | __dynamic_array bytes of tracepoint (inaccessible to bpf yet)
> +---------+
>
> Not that all of the fields are already dumped to user space via perf ring buffer
> and some application access it directly without consulting tracepoint/format.
> Same rule applies here: static tracepoint fields should only be accessed
> in a format defined in tracepoint/format. The order of fields and
> field sizes are not an ABI.
>
> Signed-off-by: Alexei Starovoitov <[email protected]>
> ---
> include/trace/perf.h | 18 ++++++++++++++----
> include/uapi/linux/bpf.h | 1 +
> kernel/events/core.c | 13 +++++++++----
> kernel/trace/trace_event_perf.c | 3 +++
> 4 files changed, 27 insertions(+), 8 deletions(-)
>
> diff --git a/include/trace/perf.h b/include/trace/perf.h
> index 26486fcd74ce..55feb69c873f 100644
> --- a/include/trace/perf.h
> +++ b/include/trace/perf.h
> @@ -37,18 +37,19 @@ perf_trace_##call(void *__data, proto) \
> struct trace_event_call *event_call = __data; \
> struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
> struct trace_event_raw_##call *entry; \
> + struct bpf_prog *prog = event_call->prog; \
> struct pt_regs *__regs; \
> u64 __addr = 0, __count = 1; \
> struct task_struct *__task = NULL; \
> struct hlist_head *head; \
> int __entry_size; \
> int __data_size; \
> - int rctx; \
> + int rctx, event_type; \
> \
> __data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
> \
> head = this_cpu_ptr(event_call->perf_events); \
> - if (__builtin_constant_p(!__task) && !__task && \
> + if (!prog && __builtin_constant_p(!__task) && !__task && \
> hlist_empty(head)) \
> return; \
> \
> @@ -56,8 +57,9 @@ perf_trace_##call(void *__data, proto) \
> sizeof(u64)); \
> __entry_size -= sizeof(u32); \
> \
> - entry = perf_trace_buf_prepare(__entry_size, \
> - event_call->event.type, &__regs, &rctx); \
> + event_type = prog ? TRACE_EVENT_TYPE_MAX : event_call->event.type; \

Can you move this into perf_trace_entry_prepare?

> + entry = perf_trace_buf_prepare(__entry_size, event_type, \
> + &__regs, &rctx); \
> if (!entry) \
> return; \
> \
> @@ -67,6 +69,14 @@ perf_trace_##call(void *__data, proto) \
> \
> { assign; } \
> \
> + if (prog) { \
> + *(struct pt_regs **)entry = __regs; \
> + if (!trace_call_bpf(prog, entry) || hlist_empty(head)) { \
> + perf_swevent_put_recursion_context(rctx); \
> + return; \
> + } \
> + memset(&entry->ent, 0, sizeof(entry->ent)); \
> + } \

And perhaps this into perf_trace_buf_submit()?

Tracepoints are a major cause of bloat, and the reasons for these
prepare and submit functions is to move code out of the macros. Every
tracepoint in the kernel (1000 and counting) will include this code.
I've already had complaints that each tracepoint can add up to 5k to
the core.

-- Steve


> perf_trace_buf_submit(entry, __entry_size, rctx, __addr, \
> __count, __regs, head, __task); \
> }
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 23917bb47bf3..70eda5aeb304 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -92,6 +92,7 @@ enum bpf_prog_type {
> BPF_PROG_TYPE_KPROBE,
> BPF_PROG_TYPE_SCHED_CLS,
> BPF_PROG_TYPE_SCHED_ACT,
> + BPF_PROG_TYPE_TRACEPOINT,
> };
>
> #define BPF_PSEUDO_MAP_FD 1
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index de24fbce5277..58fc9a7d1562 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6725,12 +6725,13 @@ int perf_swevent_get_recursion_context(void)
> }
> EXPORT_SYMBOL_GPL(perf_swevent_get_recursion_context);
>
> -inline void perf_swevent_put_recursion_context(int rctx)
> +void perf_swevent_put_recursion_context(int rctx)
> {
> struct swevent_htable *swhash = this_cpu_ptr(&swevent_htable);
>
> put_recursion_context(swhash->recursion, rctx);
> }
> +EXPORT_SYMBOL_GPL(perf_swevent_put_recursion_context);
>
> void ___perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr)
> {
> @@ -7104,6 +7105,7 @@ static void perf_event_free_filter(struct perf_event *event)
>
> static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
> {
> + bool is_kprobe, is_tracepoint;
> struct bpf_prog *prog;
>
> if (event->attr.type != PERF_TYPE_TRACEPOINT)
> @@ -7112,15 +7114,18 @@ static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
> if (event->tp_event->prog)
> return -EEXIST;
>
> - if (!(event->tp_event->flags & TRACE_EVENT_FL_UKPROBE))
> - /* bpf programs can only be attached to u/kprobes */
> + is_kprobe = event->tp_event->flags & TRACE_EVENT_FL_UKPROBE;
> + is_tracepoint = event->tp_event->flags & TRACE_EVENT_FL_TRACEPOINT;
> + if (!is_kprobe && !is_tracepoint)
> + /* bpf programs can only be attached to u/kprobe or tracepoint */
> return -EINVAL;
>
> prog = bpf_prog_get(prog_fd);
> if (IS_ERR(prog))
> return PTR_ERR(prog);
>
> - if (prog->type != BPF_PROG_TYPE_KPROBE) {
> + if ((is_kprobe && prog->type != BPF_PROG_TYPE_KPROBE) ||
> + (is_tracepoint && prog->type != BPF_PROG_TYPE_TRACEPOINT)) {
> /* valid fd, but invalid bpf program type */
> bpf_prog_put(prog);
> return -EINVAL;
> diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
> index 7a68afca8249..7ada829029d3 100644
> --- a/kernel/trace/trace_event_perf.c
> +++ b/kernel/trace/trace_event_perf.c
> @@ -284,6 +284,9 @@ void *perf_trace_buf_prepare(int size, unsigned short type,
> *regs = this_cpu_ptr(&__perf_regs[*rctxp]);
> raw_data = this_cpu_ptr(perf_trace_buf[*rctxp]);
>
> + if (type == TRACE_EVENT_TYPE_MAX)
> + return raw_data;
> +
> /* zero the dead bytes from align to not leak stack to user */
> memset(&raw_data[size - sizeof(u64)], 0, sizeof(u64));
>

2016-04-18 20:47:19

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH net-next 0/8] allow bpf attach to tracepoints

On Mon, 18 Apr 2016 12:51:43 -0700
Alexei Starovoitov <[email protected]> wrote:


> yeah, it could be added to ftrace as well, but it won't be as effective
> as perf_trace, since the cost of trace_event_buffer_reserve() in
> trace_event_raw_event_() handler is significantly higher than
> perf_trace_buf_alloc() in perf_trace_().

Right, because ftrace optimizes the case where all traces make it into
the ring buffer (zero copy). Of course, if a filter is a attached, it
would be trivial to add a case to copy first into a per cpu context
buffer, and only copy into the ring buffer if the filter succeeds. Hmm,
that may actually be something I want to do regardless with the current
filter system.

> Then from the program point of view it wouldn't care how that memory
> is allocated, so the user tools will just use perf_trace_() style.
> The only use case I see for bpf with ftrace's tracepoint handler
> is to work as an actual filter, but we already have filters there...

But wasn't it shown that eBPF could speed up the current filters? If we
hook into eBPF then we could replace what we have with a faster filter.

> so not clear to me of the actual value of adding bpf to ftrace's
> tracepoint handler.

Well, you wouldn't because you don't use ftrace ;-) But I'm sure others
might.

> At the same time there is whole ftrace as function tracer. That is
> very lucrative field for bpf to plug into ;)

Which could get this as a side-effect of this change.

-- Steve

2016-04-18 21:26:16

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next 0/8] allow bpf attach to tracepoints

On 4/18/16 1:47 PM, Steven Rostedt wrote:
> On Mon, 18 Apr 2016 12:51:43 -0700
> Alexei Starovoitov <[email protected]> wrote:
>
>
>> yeah, it could be added to ftrace as well, but it won't be as effective
>> as perf_trace, since the cost of trace_event_buffer_reserve() in
>> trace_event_raw_event_() handler is significantly higher than
>> perf_trace_buf_alloc() in perf_trace_().
>
> Right, because ftrace optimizes the case where all traces make it into
> the ring buffer (zero copy). Of course, if a filter is a attached, it
> would be trivial to add a case to copy first into a per cpu context
> buffer, and only copy into the ring buffer if the filter succeeds. Hmm,
> that may actually be something I want to do regardless with the current
> filter system.
>
>> Then from the program point of view it wouldn't care how that memory
>> is allocated, so the user tools will just use perf_trace_() style.
>> The only use case I see for bpf with ftrace's tracepoint handler
>> is to work as an actual filter, but we already have filters there...
>
> But wasn't it shown that eBPF could speed up the current filters? If we
> hook into eBPF then we could replace what we have with a faster filter.

yes. Long ago I had a patch to accelerate filter_match_preds(),
but then abandoned it, since, as you said, ftrace is primarily targeting
streaming these events to user space and faster filtering probably
won't be much noticeable to the end user.
So far all the tools around bpf have been capitalizing on
the ability to aggregate data in the kernel.

>> At the same time there is whole ftrace as function tracer. That is
>> very lucrative field for bpf to plug into ;)
>
> Which could get this as a side-effect of this change.

not really as side-effect. That would be the new thing to design.
I only have few rough ideas on how to approach that.

2016-04-18 21:44:20

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next 2/8] perf, bpf: allow bpf programs attach to tracepoints

On 4/18/16 1:29 PM, Steven Rostedt wrote:
> On Mon, 4 Apr 2016 21:52:48 -0700
> Alexei Starovoitov <[email protected]> wrote:
>
>> introduce BPF_PROG_TYPE_TRACEPOINT program type and allow it to be
>> attached to tracepoints.
>> The tracepoint will copy the arguments in the per-cpu buffer and pass
>> it to the bpf program as its first argument.
>> The layout of the fields can be discovered by doing
>> 'cat /sys/kernel/debug/tracing/events/sched/sched_switch/format'
>> prior to the compilation of the program with exception that first 8 bytes
>> are reserved and not accessible to the program. This area is used to store
>> the pointer to 'struct pt_regs' which some of the bpf helpers will use:
>> +---------+
>> | 8 bytes | hidden 'struct pt_regs *' (inaccessible to bpf program)
>> +---------+
>> | N bytes | static tracepoint fields defined in tracepoint/format (bpf readonly)
>> +---------+
>> | dynamic | __dynamic_array bytes of tracepoint (inaccessible to bpf yet)
>> +---------+
>>
>> Not that all of the fields are already dumped to user space via perf ring buffer
>> and some application access it directly without consulting tracepoint/format.
>> Same rule applies here: static tracepoint fields should only be accessed
>> in a format defined in tracepoint/format. The order of fields and
>> field sizes are not an ABI.
>>
>> Signed-off-by: Alexei Starovoitov <[email protected]>
>> ---
...
>> - entry = perf_trace_buf_prepare(__entry_size, \
>> - event_call->event.type, &__regs, &rctx); \
>> + event_type = prog ? TRACE_EVENT_TYPE_MAX : event_call->event.type; \
>
> Can you move this into perf_trace_entry_prepare?

that's the old version.
The last one are commits 1e1dcd93b46 and 98b5c2c65c295 in net-next.

>> + if (prog) { \
>> + *(struct pt_regs **)entry = __regs; \
>> + if (!trace_call_bpf(prog, entry) || hlist_empty(head)) { \
>> + perf_swevent_put_recursion_context(rctx); \
>> + return; \
>> + } \
>> + memset(&entry->ent, 0, sizeof(entry->ent)); \
>> + } \
>
> And perhaps this into perf_trace_buf_submit()?
>
> Tracepoints are a major cause of bloat, and the reasons for these
> prepare and submit functions is to move code out of the macros. Every
> tracepoint in the kernel (1000 and counting) will include this code.
> I've already had complaints that each tracepoint can add up to 5k to
> the core.

I was worried about this too, but single 'if' and two calls
(as in commit 98b5c2c65c295) is a better way, since it's faster, cleaner
and doesn't need to refactor the whole perf_trace_buf_submit() to pass
extra event_call argument to it.
perf_trace_buf_submit() is already ugly with 8 arguments!
Passing more args or creating a struct to pass args only going to
hurt performance without much reduction in .text size.
tinyfication folks will disable tracepoints anyway.
Note that the most common case is bpf returning 0 and not even
calling perf_trace_buf_submit() which is already slow due
to so many args passed on stack.
This stuff is called million times a second, so every instruction
counts.

2016-04-18 22:16:36

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH net-next 2/8] perf, bpf: allow bpf programs attach to tracepoints

On Mon, 18 Apr 2016 14:43:07 -0700
Alexei Starovoitov <[email protected]> wrote:


> I was worried about this too, but single 'if' and two calls
> (as in commit 98b5c2c65c295) is a better way, since it's faster, cleaner
> and doesn't need to refactor the whole perf_trace_buf_submit() to pass
> extra event_call argument to it.
> perf_trace_buf_submit() is already ugly with 8 arguments!

Right, but I solved that in ftrace by creating an on-stack descriptor
that can be passed by a single parameter. See the "fbuffer" in the
trace_event_raw_event* code.


> Passing more args or creating a struct to pass args only going to
> hurt performance without much reduction in .text size.
> tinyfication folks will disable tracepoints anyway.
> Note that the most common case is bpf returning 0 and not even
> calling perf_trace_buf_submit() which is already slow due
> to so many args passed on stack.
> This stuff is called million times a second, so every instruction
> counts.

Note, that doesn't matter if you are bloating the kernel for the 99.9%
of those that don't use bpf.

Please remember this! Us tracing folks are second class citizens! If
there's a way to speed up tracing by 10%, but in doing so we cause
mainline to be hurt by over 1%, we shouldn't be doing it. Tracing and
hooks on tracepoints are really not used by many people. Don't fall
into Linus's category of "my code is the most important". That's
especially true for tracing.

These macros causes bloat. There's been many complaints about it.
There's a way around it that isn't that bad (with the descriptor), we
should be using it.

-- Steve

2016-04-19 01:16:19

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next 2/8] perf, bpf: allow bpf programs attach to tracepoints

On 4/18/16 3:16 PM, Steven Rostedt wrote:
> On Mon, 18 Apr 2016 14:43:07 -0700
> Alexei Starovoitov <[email protected]> wrote:
>
>
>> I was worried about this too, but single 'if' and two calls
>> (as in commit 98b5c2c65c295) is a better way, since it's faster, cleaner
>> and doesn't need to refactor the whole perf_trace_buf_submit() to pass
>> extra event_call argument to it.
>> perf_trace_buf_submit() is already ugly with 8 arguments!
>
> Right, but I solved that in ftrace by creating an on-stack descriptor
> that can be passed by a single parameter. See the "fbuffer" in the
> trace_event_raw_event* code.

Yes. That what I referred to in below 'a struct to pass args'...
But, fine, will try to optimize the size further.
Frankly much bigger .text savings will come from combining
trace_event_raw_event_*() with perf_trace_*()
Especially if you're ok with copying tp args into perf's percpu
buffer first and then copying into ftrace's ring buffer.
Then we can half the number of such auto-generated functions.

>> Passing more args or creating a struct to pass args only going to
>> hurt performance without much reduction in .text size.
>> tinyfication folks will disable tracepoints anyway.
>> Note that the most common case is bpf returning 0 and not even
>> calling perf_trace_buf_submit() which is already slow due
>> to so many args passed on stack.
>> This stuff is called million times a second, so every instruction
>> counts.
>
> Note, that doesn't matter if you are bloating the kernel for the 99.9%
> of those that don't use bpf.
>
> Please remember this! Us tracing folks are second class citizens! If
> there's a way to speed up tracing by 10%, but in doing so we cause
> mainline to be hurt by over 1%, we shouldn't be doing it. Tracing and
> hooks on tracepoints are really not used by many people. Don't fall
> into Linus's category of "my code is the most important". That's
> especially true for tracing.

tracing was indeed not used that often in the past, but
bpf+tracing completely changed the picture. It's no longer just
debugging. It is the first class citizen that runs 24/7 in production
and its performance and lowest overhead are crucial.

2016-04-19 02:58:57

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH net-next 2/8] perf, bpf: allow bpf programs attach to tracepoints

On Mon, 18 Apr 2016 18:15:04 -0700
Alexei Starovoitov <[email protected]> wrote:

> On 4/18/16 3:16 PM, Steven Rostedt wrote:

> Yes. That what I referred to in below 'a struct to pass args'...
> But, fine, will try to optimize the size further.
> Frankly much bigger .text savings will come from combining
> trace_event_raw_event_*() with perf_trace_*()
> Especially if you're ok with copying tp args into perf's percpu
> buffer first and then copying into ftrace's ring buffer.
> Then we can half the number of such auto-generated functions.

I'm only fine with that when we filter. Otherwise we just lost all the
benefits of zero copy in the first place.

>
> >> Passing more args or creating a struct to pass args only going to
> >> hurt performance without much reduction in .text size.
> >> tinyfication folks will disable tracepoints anyway.
> >> Note that the most common case is bpf returning 0 and not even
> >> calling perf_trace_buf_submit() which is already slow due
> >> to so many args passed on stack.
> >> This stuff is called million times a second, so every instruction
> >> counts.
> >
> > Note, that doesn't matter if you are bloating the kernel for the 99.9%
> > of those that don't use bpf.
> >
> > Please remember this! Us tracing folks are second class citizens! If
> > there's a way to speed up tracing by 10%, but in doing so we cause
> > mainline to be hurt by over 1%, we shouldn't be doing it. Tracing and
> > hooks on tracepoints are really not used by many people. Don't fall
> > into Linus's category of "my code is the most important". That's
> > especially true for tracing.
>
> tracing was indeed not used that often in the past, but
> bpf+tracing completely changed the picture. It's no longer just
> debugging. It is the first class citizen that runs 24/7 in production
> and its performance and lowest overhead are crucial.

Still, 99.9% of users don't use it.

-- Steve