Hi Steven,
This patch set is for linux-trace/for-next
It adds ability to attach eBPF programs to tracepoints, syscalls and kprobes.
Obviously too late for 3.20, but please review. I'll rebase and repost when
merge window closes.
Main difference in V3 is different attaching mechanism:
- load program via bpf() syscall and receive prog_fd
- event_fd = perf_event_open()
- ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd) to attach program to event
- close(event_fd) will destroy event and detach the program
kernel diff became smaller and in general this approach is cleaner
(thanks to Masami and Namhyung for suggesting it)
The programs are run before ring buffer is allocated to have minimal
impact on a system, which can be demonstrated by
'dd if=/dev/zero of=/dev/null count=20000000' test:
4.80074 s, 2.1 GB/s - no tracing (raw base line)
5.62705 s, 1.8 GB/s - attached bpf program does 'map[log2(count)]++' without JIT
5.05963 s, 2.0 GB/s - attached bpf program does 'map[log2(count)]++' with JIT
4.91715 s, 2.1 GB/s - attached bpf program does 'return 0'
perf record -e skb:sys_write dd if=/dev/zero of=/dev/null count=20000000
8.75686 s, 1.2 GB/s
Warning: Processed 20355236 events and lost 44 chunks!
perf record -e skb:sys_write --filter cnt==1234 dd if=/dev/zero of=/dev/null count=20000000
5.69732 s, 1.8 GB/s
6.13730 s, 1.7 GB/s - echo 1 > /sys/../events/skb/sys_write/enable
6.50091 s, 1.6 GB/s - echo 'cnt == 1234' > /sys/../events/skb/sys_write/filter
(skb:sys_write is a temporary tracepoint in write() syscall)
So the overhead of realistic bpf program is 5.05963/4.80074 = ~5%
which is faster than perf_event filtering: 5.69732/4.80074 = ~18%
or ftrace filtering: 6.50091/4.80074 = ~35%
V2->V3:
- changed program attach interface from tracefs into perf_event ioctl
- rewrote user space helpers to use perf_events
- rewrote tracex1 example to use mmap-ed ring_buffer instead of trace_pipe
- as suggested by Arnaldo renamed bpf_memcmp to bpf_probe_memcmp to better
indicate function logic
- added ifdefs to make bpf check a nop when CONFIG_BPF_SYSCALL is not set
V1->V2:
- dropped bpf_dump_stack() and bpf_printk() helpers
- disabled running programs in_nmi
- other minor cleanups
Program attach point and input arguments:
- programs attached to kprobes receive 'struct pt_regs *' as an input.
See tracex4_kern.c that demonstrates how users can write a C program like:
SEC("events/kprobes/sys_write")
int bpf_prog4(struct pt_regs *regs)
{
long write_size = regs->dx;
// here user need to know the proto of sys_write() from kernel
// sources and x64 calling convention to know that register $rdx
// contains 3rd argument to sys_write() which is 'size_t count'
it's obviously architecture dependent, but allows building sophisticated
user tools on top, that can see from debug info of vmlinux which variables
are in which registers or stack locations and fetch it from there.
'perf probe' can potentialy use this hook to generate programs in user space
and insert them instead of letting kernel parse string during kprobe creation.
- programs attached to tracepoints and syscalls receive 'struct bpf_context *':
u64 arg1, arg2, ..., arg6;
for syscalls they match syscall arguments.
for tracepoints these args match arguments passed to tracepoint.
For example:
trace_sched_migrate_task(p, new_cpu); from sched/core.c
arg1 <- p which is 'struct task_struct *'
arg2 <- new_cpu which is 'unsigned int'
arg3..arg6 = 0
the program can use bpf_fetch_u8/16/32/64/ptr() helpers to walk 'task_struct'
or any other kernel data structures.
These helpers are using probe_kernel_read() similar to 'perf probe' which is
not 100% safe in both cases, but good enough.
To access task_struct's pid inside 'sched_migrate_task' tracepoint
the program can do:
struct task_struct *task = (struct task_struct *)ctx->arg1;
u32 pid = bpf_fetch_u32(&task->pid);
Since struct layout is kernel configuration specific such programs are not
portable and require access to kernel headers to be compiled,
but in this case we don't need debug info.
llvm with bpf backend will statically compute task->pid offset as a constant
based on kernel headers only.
The example of this arbitrary pointer walking is tracex1_kern.c
which does skb->dev->name == "lo" filtering.
In all cases the programs are called before ring buffer is allocated to
minimize the overhead, since we want to filter huge number of events, but
perf_trace_buf_prepare/submit and argument copy for every event is too costly.
Note, tracepoint/syscall and kprobe programs are two different types:
BPF_PROG_TYPE_TRACEPOINT and BPF_PROG_TYPE_KPROBE,
since they expect different input.
Both use the same set of helper functions:
- map access (lookup/update/delete)
- fetch (probe_kernel_read wrappers)
- probe_memcmp (probe_kernel_read + memcmp)
Portability:
- kprobe programs are architecture dependent and need user scripting
language like ktap/stap/dtrace/perf that will dynamically generate
them based on debug info in vmlinux
- tracepoint programs are architecture independent, but if arbitrary pointer
walking (with fetch() helpers) is used, they need data struct layout to match.
Debug info is not necessary
- for networking use case we need to access 'struct sk_buff' fields in portable
way (user space needs to fetch packet length without knowing layout of sk_buff),
so for some frequently used data structures there will be a way to access them
effeciently without bpf_fetch* helpers. Once it's ready tracepoint programs
that access common data structs will be kernel independent.
Program return value:
- programs return 0 to discard an event
- and return non-zero to proceed with event (get ring buffer, copy
arguments there and pass to user space via mmap-ed area)
Examples:
- dropmon.c - simple kfree_skb() accounting in eBPF assembler, similar
to dropmon tool
- tracex1_kern.c - does net/netif_receive_skb event filtering
for dev->skb->name == "lo" condition
trace1_user.c - receives PERF_SAMPLE_RAW events into mmap-ed buffer and
prints them
- tracex2_kern.c - same kfree_skb() accounting like dropmon, but now in C
plus computes histogram of all write sizes from sys_write syscall
and prints the histogram in userspace
- tracex3_kern.c - most sophisticated example that computes IO latency
between block/block_rq_issue and block/block_rq_complete events
and prints 'heatmap' using gray shades of text terminal.
Useful to analyze disk performance.
- tracex4_kern.c - computes histogram of write sizes from sys_write syscall
using kprobe mechanism instead of syscall. Since kprobe is optimized into
ftrace the overhead of instrumentation is smaller than in example 2.
The user space tools like ktap/dtrace/systemptap/perf that has access
to debug info would probably want to use kprobe attachment point, since kprobe
can be inserted anywhere and all registers are avaiable in the program.
tracepoint attachments are useful without debug info, so standalone tools
like iosnoop will use them.
The main difference vs existing perf_probe/ftrace infra is in kernel aggregation
and conditional walking of arbitrary data structures.
Thanks!
Alexei Starovoitov (8):
tracing: attach eBPF programs to tracepoints and syscalls
tracing: allow eBPF programs to call ktime_get_ns()
samples: bpf: simple tracing example in eBPF assembler
samples: bpf: simple tracing example in C
samples: bpf: counting example for kfree_skb tracepoint and write
syscall
samples: bpf: IO latency analysis (iosnoop/heatmap)
tracing: attach eBPF programs to kprobe/kretprobe
samples: bpf: simple kprobe example
include/linux/bpf.h | 6 +-
include/linux/ftrace_event.h | 14 +++
include/trace/bpf_trace.h | 25 +++++
include/trace/ftrace.h | 31 +++++++
include/uapi/linux/bpf.h | 9 ++
include/uapi/linux/perf_event.h | 1 +
kernel/events/core.c | 58 ++++++++++++
kernel/trace/Makefile | 1 +
kernel/trace/bpf_trace.c | 194 +++++++++++++++++++++++++++++++++++++++
kernel/trace/trace_kprobe.c | 10 +-
kernel/trace/trace_syscalls.c | 35 +++++++
samples/bpf/Makefile | 18 ++++
samples/bpf/bpf_helpers.h | 14 +++
samples/bpf/bpf_load.c | 136 +++++++++++++++++++++++++--
samples/bpf/bpf_load.h | 12 +++
samples/bpf/dropmon.c | 143 +++++++++++++++++++++++++++++
samples/bpf/libbpf.c | 7 ++
samples/bpf/libbpf.h | 4 +
samples/bpf/tracex1_kern.c | 28 ++++++
samples/bpf/tracex1_user.c | 50 ++++++++++
samples/bpf/tracex2_kern.c | 71 ++++++++++++++
samples/bpf/tracex2_user.c | 95 +++++++++++++++++++
samples/bpf/tracex3_kern.c | 98 ++++++++++++++++++++
samples/bpf/tracex3_user.c | 152 ++++++++++++++++++++++++++++++
samples/bpf/tracex4_kern.c | 36 ++++++++
samples/bpf/tracex4_user.c | 83 +++++++++++++++++
26 files changed, 1321 insertions(+), 10 deletions(-)
create mode 100644 include/trace/bpf_trace.h
create mode 100644 kernel/trace/bpf_trace.c
create mode 100644 samples/bpf/dropmon.c
create mode 100644 samples/bpf/tracex1_kern.c
create mode 100644 samples/bpf/tracex1_user.c
create mode 100644 samples/bpf/tracex2_kern.c
create mode 100644 samples/bpf/tracex2_user.c
create mode 100644 samples/bpf/tracex3_kern.c
create mode 100644 samples/bpf/tracex3_user.c
create mode 100644 samples/bpf/tracex4_kern.c
create mode 100644 samples/bpf/tracex4_user.c
--
1.7.9.5
User interface:
struct perf_event_attr attr = {.type = PERF_TYPE_TRACEPOINT, .config = event_id, ...};
event_fd = perf_event_open(&attr,...);
ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
prog_fd is a file descriptor associated with eBPF program previously loaded.
event_id is an ID of static tracepoint event or syscall.
(kprobe support is in next patch)
close(event_fd) - automatically detaches eBPF program from it
eBPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- fetch_ptr/u64/u32/u16/u8 values from unsafe address via probe_kernel_read(),
so that eBPF program can walk any kernel data structures
- probe_memcmp - combination of probe_kernel_read() and memcmp()
Signed-off-by: Alexei Starovoitov <[email protected]>
---
include/linux/bpf.h | 6 +-
include/linux/ftrace_event.h | 11 +++
include/trace/bpf_trace.h | 25 +++++++
include/trace/ftrace.h | 31 +++++++++
include/uapi/linux/bpf.h | 7 ++
include/uapi/linux/perf_event.h | 1 +
kernel/events/core.c | 55 +++++++++++++++
kernel/trace/Makefile | 1 +
kernel/trace/bpf_trace.c | 145 +++++++++++++++++++++++++++++++++++++++
kernel/trace/trace_syscalls.c | 35 ++++++++++
10 files changed, 316 insertions(+), 1 deletion(-)
create mode 100644 include/trace/bpf_trace.h
create mode 100644 kernel/trace/bpf_trace.c
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index bbfceb756452..a0f6f636ced0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -130,10 +130,14 @@ struct bpf_prog_aux {
#ifdef CONFIG_BPF_SYSCALL
void bpf_prog_put(struct bpf_prog *prog);
+struct bpf_prog *bpf_prog_get(u32 ufd);
#else
static inline void bpf_prog_put(struct bpf_prog *prog) {}
+static inline struct bpf_prog *bpf_prog_get(u32 ufd)
+{
+ return ERR_PTR(-ENOENT);
+}
#endif
-struct bpf_prog *bpf_prog_get(u32 ufd);
/* verify correctness of eBPF program */
int bpf_check(struct bpf_prog *fp, union bpf_attr *attr);
diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 0bebb5c348b8..479d0a4a42b3 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -13,6 +13,7 @@ struct trace_array;
struct trace_buffer;
struct tracer;
struct dentry;
+struct bpf_prog;
struct trace_print_flags {
unsigned long mask;
@@ -299,6 +300,7 @@ struct ftrace_event_call {
#ifdef CONFIG_PERF_EVENTS
int perf_refcount;
struct hlist_head __percpu *perf_events;
+ struct bpf_prog *prog;
int (*perf_perm)(struct ftrace_event_call *,
struct perf_event *);
@@ -544,6 +546,15 @@ event_trigger_unlock_commit_regs(struct ftrace_event_file *file,
event_triggers_post_call(file, tt);
}
+#ifdef CONFIG_BPF_SYSCALL
+unsigned int trace_call_bpf(struct bpf_prog *prog, void *ctx);
+#else
+static inline unsigned int trace_call_bpf(struct bpf_prog *prog, void *ctx)
+{
+ return 1;
+}
+#endif
+
enum {
FILTER_OTHER = 0,
FILTER_STATIC_STRING,
diff --git a/include/trace/bpf_trace.h b/include/trace/bpf_trace.h
new file mode 100644
index 000000000000..4e64f61f484d
--- /dev/null
+++ b/include/trace/bpf_trace.h
@@ -0,0 +1,25 @@
+/* Copyright (c) 2011-2015 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#ifndef _LINUX_KERNEL_BPF_TRACE_H
+#define _LINUX_KERNEL_BPF_TRACE_H
+
+/* For tracepoint filters argN fields match one to one to arguments
+ * passed to tracepoint events
+ *
+ * For syscall entry filters argN fields match syscall arguments
+ * For syscall exit filters arg1 is a return value
+ */
+struct bpf_context {
+ u64 arg1;
+ u64 arg2;
+ u64 arg3;
+ u64 arg4;
+ u64 arg5;
+ u64 arg6;
+};
+
+#endif /* _LINUX_KERNEL_BPF_TRACE_H */
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 139b5067345b..4c275ce2dcf0 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -17,6 +17,7 @@
*/
#include <linux/ftrace_event.h>
+#include <trace/bpf_trace.h>
/*
* DECLARE_EVENT_CLASS can be used to add a generic function
@@ -755,12 +756,32 @@ __attribute__((section("_ftrace_events"))) *__event_##call = &event_##call
#undef __perf_task
#define __perf_task(t) (__task = (t))
+/* zero extend integer, pointer or aggregate type to u64 without warnings */
+#define __CAST_TO_U64(EXPR) ({ \
+ u64 ret = 0; \
+ typeof(EXPR) expr = EXPR; \
+ switch (sizeof(expr)) { \
+ case 8: ret = *(u64 *) &expr; break; \
+ case 4: ret = *(u32 *) &expr; break; \
+ case 2: ret = *(u16 *) &expr; break; \
+ case 1: ret = *(u8 *) &expr; break; \
+ } \
+ ret; })
+
+#define __BPF_CAST1(a,...) __CAST_TO_U64(a)
+#define __BPF_CAST2(a,...) __CAST_TO_U64(a), __BPF_CAST1(__VA_ARGS__)
+#define __BPF_CAST3(a,...) __CAST_TO_U64(a), __BPF_CAST2(__VA_ARGS__)
+#define __BPF_CAST4(a,...) __CAST_TO_U64(a), __BPF_CAST3(__VA_ARGS__)
+#define __BPF_CAST5(a,...) __CAST_TO_U64(a), __BPF_CAST4(__VA_ARGS__)
+#define __BPF_CAST6(a,...) __CAST_TO_U64(a), __BPF_CAST5(__VA_ARGS__)
+
#undef DECLARE_EVENT_CLASS
#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
static notrace void \
perf_trace_##call(void *__data, proto) \
{ \
struct ftrace_event_call *event_call = __data; \
+ struct bpf_prog *prog = event_call->prog; \
struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\
struct ftrace_raw_##call *entry; \
struct pt_regs __regs; \
@@ -771,6 +792,16 @@ perf_trace_##call(void *__data, proto) \
int __data_size; \
int rctx; \
\
+ if (prog) { \
+ __maybe_unused const u64 z = 0; \
+ struct bpf_context __ctx = ((struct bpf_context) { \
+ __BPF_CAST6(args, z, z, z, z, z) \
+ }); \
+ \
+ if (!trace_call_bpf(prog, &__ctx)) \
+ return; \
+ } \
+ \
__data_size = ftrace_get_offsets_##call(&__data_offsets, args); \
\
head = this_cpu_ptr(event_call->perf_events); \
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 45da7ec7d274..d73d7d0abe6e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -118,6 +118,7 @@ enum bpf_map_type {
enum bpf_prog_type {
BPF_PROG_TYPE_UNSPEC,
BPF_PROG_TYPE_SOCKET_FILTER,
+ BPF_PROG_TYPE_TRACEPOINT,
};
/* flags for BPF_MAP_UPDATE_ELEM command */
@@ -162,6 +163,12 @@ enum bpf_func_id {
BPF_FUNC_map_lookup_elem, /* void *map_lookup_elem(&map, &key) */
BPF_FUNC_map_update_elem, /* int map_update_elem(&map, &key, &value, flags) */
BPF_FUNC_map_delete_elem, /* int map_delete_elem(&map, &key) */
+ BPF_FUNC_fetch_ptr, /* void *bpf_fetch_ptr(void *unsafe_ptr) */
+ BPF_FUNC_fetch_u64, /* u64 bpf_fetch_u64(void *unsafe_ptr) */
+ BPF_FUNC_fetch_u32, /* u32 bpf_fetch_u32(void *unsafe_ptr) */
+ BPF_FUNC_fetch_u16, /* u16 bpf_fetch_u16(void *unsafe_ptr) */
+ BPF_FUNC_fetch_u8, /* u8 bpf_fetch_u8(void *unsafe_ptr) */
+ BPF_FUNC_probe_memcmp, /* int bpf_probe_memcmp(unsafe_ptr, safe_ptr, size) */
__BPF_FUNC_MAX_ID,
};
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9b79abbd1ab8..d7ba67234761 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -360,6 +360,7 @@ struct perf_event_attr {
#define PERF_EVENT_IOC_SET_OUTPUT _IO ('$', 5)
#define PERF_EVENT_IOC_SET_FILTER _IOW('$', 6, char *)
#define PERF_EVENT_IOC_ID _IOR('$', 7, __u64 *)
+#define PERF_EVENT_IOC_SET_BPF _IOW('$', 8, __u32)
enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 882f835a0d85..674a8ca17190 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -42,6 +42,8 @@
#include <linux/module.h>
#include <linux/mman.h>
#include <linux/compat.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
#include "internal.h"
@@ -3283,6 +3285,7 @@ errout:
}
static void perf_event_free_filter(struct perf_event *event);
+static void perf_event_free_bpf_prog(struct perf_event *event);
static void free_event_rcu(struct rcu_head *head)
{
@@ -3292,6 +3295,7 @@ static void free_event_rcu(struct rcu_head *head)
if (event->ns)
put_pid_ns(event->ns);
perf_event_free_filter(event);
+ perf_event_free_bpf_prog(event);
kfree(event);
}
@@ -3795,6 +3799,7 @@ static inline int perf_fget_light(int fd, struct fd *p)
static int perf_event_set_output(struct perf_event *event,
struct perf_event *output_event);
static int perf_event_set_filter(struct perf_event *event, void __user *arg);
+static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd);
static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
@@ -3849,6 +3854,9 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
case PERF_EVENT_IOC_SET_FILTER:
return perf_event_set_filter(event, (void __user *)arg);
+ case PERF_EVENT_IOC_SET_BPF:
+ return perf_event_set_bpf_prog(event, arg);
+
default:
return -ENOTTY;
}
@@ -6266,6 +6274,45 @@ static void perf_event_free_filter(struct perf_event *event)
ftrace_profile_free_filter(event);
}
+static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
+{
+ struct bpf_prog *prog;
+
+ if (event->attr.type != PERF_TYPE_TRACEPOINT)
+ return -EINVAL;
+
+ if (event->tp_event->prog)
+ return -EEXIST;
+
+ prog = bpf_prog_get(prog_fd);
+ if (IS_ERR(prog))
+ return PTR_ERR(prog);
+
+ if (prog->aux->prog_type != BPF_PROG_TYPE_TRACEPOINT) {
+ /* valid fd, but invalid bpf program type */
+ bpf_prog_put(prog);
+ return -EINVAL;
+ }
+
+ event->tp_event->prog = prog;
+
+ return 0;
+}
+
+static void perf_event_free_bpf_prog(struct perf_event *event)
+{
+ struct bpf_prog *prog;
+
+ if (!event->tp_event)
+ return;
+
+ prog = event->tp_event->prog;
+ if (prog) {
+ event->tp_event->prog = NULL;
+ bpf_prog_put(prog);
+ }
+}
+
#else
static inline void perf_tp_register(void)
@@ -6281,6 +6328,14 @@ static void perf_event_free_filter(struct perf_event *event)
{
}
+static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
+{
+ return -ENOENT;
+}
+
+static void perf_event_free_bpf_prog(struct perf_event *event)
+{
+}
#endif /* CONFIG_EVENT_TRACING */
#ifdef CONFIG_HAVE_HW_BREAKPOINT
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 979ccde26720..54ae225e5fc6 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -53,6 +53,7 @@ obj-$(CONFIG_EVENT_TRACING) += trace_event_perf.o
endif
obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o
obj-$(CONFIG_EVENT_TRACING) += trace_events_trigger.o
+obj-$(CONFIG_BPF_SYSCALL) += bpf_trace.o
obj-$(CONFIG_KPROBE_EVENT) += trace_kprobe.o
obj-$(CONFIG_TRACEPOINTS) += power-traces.o
ifeq ($(CONFIG_PM),y)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
new file mode 100644
index 000000000000..ec065e0a364e
--- /dev/null
+++ b/kernel/trace/bpf_trace.c
@@ -0,0 +1,145 @@
+/* Copyright (c) 2011-2015 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <linux/uaccess.h>
+#include <trace/bpf_trace.h>
+#include "trace.h"
+
+unsigned int trace_call_bpf(struct bpf_prog *prog, void *ctx)
+{
+ unsigned int ret;
+
+ if (in_nmi()) /* not supported yet */
+ return 1;
+
+ rcu_read_lock();
+ ret = BPF_PROG_RUN(prog, ctx);
+ rcu_read_unlock();
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(trace_call_bpf);
+
+static u64 bpf_fetch_ptr(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ void *unsafe_ptr = (void *) (long) r1;
+ void *ptr = NULL;
+
+ probe_kernel_read(&ptr, unsafe_ptr, sizeof(ptr));
+ return (u64) (unsigned long) ptr;
+}
+
+#define FETCH(SIZE) \
+static u64 bpf_fetch_##SIZE(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) \
+{ \
+ void *unsafe_ptr = (void *) (long) r1; \
+ SIZE val = 0; \
+ \
+ probe_kernel_read(&val, unsafe_ptr, sizeof(val)); \
+ return (u64) (SIZE) val; \
+}
+FETCH(u64)
+FETCH(u32)
+FETCH(u16)
+FETCH(u8)
+#undef FETCH
+
+static u64 bpf_probe_memcmp(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ void *unsafe_ptr = (void *) (long) r1;
+ void *safe_ptr = (void *) (long) r2;
+ u32 size = (u32) r3;
+ char buf[64];
+ int err;
+
+ if (size < 64) {
+ err = probe_kernel_read(buf, unsafe_ptr, size);
+ if (err)
+ return err;
+ return memcmp(buf, safe_ptr, size);
+ }
+ return -1;
+}
+
+static struct bpf_func_proto tp_prog_funcs[] = {
+#define FETCH(SIZE) \
+ [BPF_FUNC_fetch_##SIZE] = { \
+ .func = bpf_fetch_##SIZE, \
+ .gpl_only = true, \
+ .ret_type = RET_INTEGER, \
+ },
+ FETCH(ptr)
+ FETCH(u64)
+ FETCH(u32)
+ FETCH(u16)
+ FETCH(u8)
+#undef FETCH
+ [BPF_FUNC_probe_memcmp] = {
+ .func = bpf_probe_memcmp,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_ANYTHING,
+ .arg2_type = ARG_PTR_TO_STACK,
+ .arg3_type = ARG_CONST_STACK_SIZE,
+ },
+};
+
+static const struct bpf_func_proto *tp_prog_func_proto(enum bpf_func_id func_id)
+{
+ switch (func_id) {
+ case BPF_FUNC_map_lookup_elem:
+ return &bpf_map_lookup_elem_proto;
+ case BPF_FUNC_map_update_elem:
+ return &bpf_map_update_elem_proto;
+ case BPF_FUNC_map_delete_elem:
+ return &bpf_map_delete_elem_proto;
+ default:
+ if (func_id < 0 || func_id >= ARRAY_SIZE(tp_prog_funcs))
+ return NULL;
+ return &tp_prog_funcs[func_id];
+ }
+}
+
+/* check access to argN fields of 'struct bpf_context' from program */
+static bool tp_prog_is_valid_access(int off, int size,
+ enum bpf_access_type type)
+{
+ /* check bounds */
+ if (off < 0 || off >= sizeof(struct bpf_context))
+ return false;
+
+ /* only read is allowed */
+ if (type != BPF_READ)
+ return false;
+
+ /* disallow misaligned access */
+ if (off % size != 0)
+ return false;
+
+ return true;
+}
+
+static struct bpf_verifier_ops tp_prog_ops = {
+ .get_func_proto = tp_prog_func_proto,
+ .is_valid_access = tp_prog_is_valid_access,
+};
+
+static struct bpf_prog_type_list tl = {
+ .ops = &tp_prog_ops,
+ .type = BPF_PROG_TYPE_TRACEPOINT,
+};
+
+static int __init register_tp_prog_ops(void)
+{
+ bpf_register_prog_type(&tl);
+ return 0;
+}
+late_initcall(register_tp_prog_ops);
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index c6ee36fcbf90..3487c41f4c0e 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -7,6 +7,7 @@
#include <linux/ftrace.h>
#include <linux/perf_event.h>
#include <asm/syscall.h>
+#include <trace/bpf_trace.h>
#include "trace_output.h"
#include "trace.h"
@@ -545,11 +546,26 @@ static DECLARE_BITMAP(enabled_perf_exit_syscalls, NR_syscalls);
static int sys_perf_refcount_enter;
static int sys_perf_refcount_exit;
+static void populate_bpf_ctx(struct bpf_context *ctx, struct pt_regs *regs)
+{
+ struct task_struct *task = current;
+ unsigned long args[6];
+
+ syscall_get_arguments(task, regs, 0, 6, args);
+ ctx->arg1 = args[0];
+ ctx->arg2 = args[1];
+ ctx->arg3 = args[2];
+ ctx->arg4 = args[3];
+ ctx->arg5 = args[4];
+ ctx->arg6 = args[5];
+}
+
static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
{
struct syscall_metadata *sys_data;
struct syscall_trace_enter *rec;
struct hlist_head *head;
+ struct bpf_prog *prog;
int syscall_nr;
int rctx;
int size;
@@ -564,6 +580,15 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
if (!sys_data)
return;
+ prog = sys_data->enter_event->prog;
+ if (prog) {
+ struct bpf_context ctx;
+
+ populate_bpf_ctx(&ctx, regs);
+ if (!trace_call_bpf(prog, &ctx))
+ return;
+ }
+
head = this_cpu_ptr(sys_data->enter_event->perf_events);
if (hlist_empty(head))
return;
@@ -624,6 +649,7 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
struct syscall_metadata *sys_data;
struct syscall_trace_exit *rec;
struct hlist_head *head;
+ struct bpf_prog *prog;
int syscall_nr;
int rctx;
int size;
@@ -638,6 +664,15 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
if (!sys_data)
return;
+ prog = sys_data->exit_event->prog;
+ if (prog) {
+ struct bpf_context ctx = {};
+
+ ctx.arg1 = syscall_get_return_value(current, regs);
+ if (!trace_call_bpf(prog, &ctx))
+ return;
+ }
+
head = this_cpu_ptr(sys_data->exit_event->perf_events);
if (hlist_empty(head))
return;
--
1.7.9.5
bpf_ktime_get_ns() is used by programs to compue time delta between events
or as a timestamp
Signed-off-by: Alexei Starovoitov <[email protected]>
---
include/uapi/linux/bpf.h | 1 +
kernel/trace/bpf_trace.c | 10 ++++++++++
2 files changed, 11 insertions(+)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d73d7d0abe6e..ecae21e58ba3 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -169,6 +169,7 @@ enum bpf_func_id {
BPF_FUNC_fetch_u16, /* u16 bpf_fetch_u16(void *unsafe_ptr) */
BPF_FUNC_fetch_u8, /* u8 bpf_fetch_u8(void *unsafe_ptr) */
BPF_FUNC_probe_memcmp, /* int bpf_probe_memcmp(unsafe_ptr, safe_ptr, size) */
+ BPF_FUNC_ktime_get_ns, /* u64 bpf_ktime_get_ns(void) */
__BPF_FUNC_MAX_ID,
};
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ec065e0a364e..e3196266b72f 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -69,6 +69,11 @@ static u64 bpf_probe_memcmp(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
return -1;
}
+static u64 bpf_ktime_get_ns(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ return ktime_get_ns();
+}
+
static struct bpf_func_proto tp_prog_funcs[] = {
#define FETCH(SIZE) \
[BPF_FUNC_fetch_##SIZE] = { \
@@ -90,6 +95,11 @@ static struct bpf_func_proto tp_prog_funcs[] = {
.arg2_type = ARG_PTR_TO_STACK,
.arg3_type = ARG_CONST_STACK_SIZE,
},
+ [BPF_FUNC_ktime_get_ns] = {
+ .func = bpf_ktime_get_ns,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ },
};
static const struct bpf_func_proto *tp_prog_func_proto(enum bpf_func_id func_id)
--
1.7.9.5
simple packet drop monitor:
- in-kernel eBPF program attaches to skb:kfree_skb event and records number
of packet drops at given location
- userspace iterates over the map every second and prints stats
Usage:
$ sudo dropmon
location 0xffffffff81695995 count 1
location 0xffffffff816d0da9 count 2
location 0xffffffff81695995 count 2
location 0xffffffff816d0da9 count 2
location 0xffffffff81695995 count 3
location 0xffffffff816d0da9 count 2
$ addr2line -ape ./bld_x64/vmlinux 0xffffffff81695995 0xffffffff816d0da9
0xffffffff81695995: ./bld_x64/../net/ipv4/icmp.c:1038
0xffffffff816d0da9: ./bld_x64/../net/unix/af_unix.c:1231
Signed-off-by: Alexei Starovoitov <[email protected]>
---
samples/bpf/Makefile | 2 +
samples/bpf/dropmon.c | 143 +++++++++++++++++++++++++++++++++++++++++++++++++
samples/bpf/libbpf.c | 7 +++
samples/bpf/libbpf.h | 4 ++
4 files changed, 156 insertions(+)
create mode 100644 samples/bpf/dropmon.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index b5b3600dcdf5..789691374562 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -6,7 +6,9 @@ hostprogs-y := test_verifier test_maps
hostprogs-y += sock_example
hostprogs-y += sockex1
hostprogs-y += sockex2
+hostprogs-y += dropmon
+dropmon-objs := dropmon.o libbpf.o
test_verifier-objs := test_verifier.o libbpf.o
test_maps-objs := test_maps.o libbpf.o
sock_example-objs := sock_example.o libbpf.o
diff --git a/samples/bpf/dropmon.c b/samples/bpf/dropmon.c
new file mode 100644
index 000000000000..515504f68506
--- /dev/null
+++ b/samples/bpf/dropmon.c
@@ -0,0 +1,143 @@
+/* simple packet drop monitor:
+ * - in-kernel eBPF program attaches to kfree_skb() event and records number
+ * of packet drops at given location
+ * - userspace iterates over the map every second and prints stats
+ */
+#include <stdio.h>
+#include <unistd.h>
+#include <linux/bpf.h>
+#include <errno.h>
+#include <linux/unistd.h>
+#include <string.h>
+#include <linux/filter.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <stdbool.h>
+#include <linux/perf_event.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include "libbpf.h"
+
+#define TRACEPOINT "/sys/kernel/debug/tracing/events/skb/kfree_skb/id"
+
+static int dropmon(void)
+{
+ long long key, next_key, value = 0;
+ int prog_fd, map_fd, i, event_fd, efd, err;
+ char buf[32];
+
+ map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value), 1024);
+ if (map_fd < 0) {
+ printf("failed to create map '%s'\n", strerror(errno));
+ goto cleanup;
+ }
+
+ /* the following eBPF program is equivalent to C:
+ * int filter(struct bpf_context *ctx)
+ * {
+ * long loc = ctx->arg2;
+ * long init_val = 1;
+ * long *value;
+ *
+ * value = bpf_map_lookup_elem(MAP_ID, &loc);
+ * if (value) {
+ * __sync_fetch_and_add(value, 1);
+ * } else {
+ * bpf_map_update_elem(MAP_ID, &loc, &init_val, BPF_ANY);
+ * }
+ * return 0;
+ * }
+ */
+ struct bpf_insn prog[] = {
+ BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1, 8), /* r2 = *(u64 *)(r1 + 8) */
+ BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -8), /* *(u64 *)(fp - 8) = r2 */
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), /* r2 = fp - 8 */
+ BPF_LD_MAP_FD(BPF_REG_1, map_fd),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4),
+ BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
+ BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+ BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
+ BPF_EXIT_INSN(),
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, -16, 1), /* *(u64 *)(fp - 16) = 1 */
+ BPF_MOV64_IMM(BPF_REG_4, BPF_ANY),
+ BPF_MOV64_REG(BPF_REG_3, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, -16), /* r3 = fp - 16 */
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), /* r2 = fp - 8 */
+ BPF_LD_MAP_FD(BPF_REG_1, map_fd),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_update_elem),
+ BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
+ BPF_EXIT_INSN(),
+ };
+
+ prog_fd = bpf_prog_load(BPF_PROG_TYPE_TRACEPOINT, prog,
+ sizeof(prog), "GPL");
+ if (prog_fd < 0) {
+ printf("failed to load prog '%s'\n%s",
+ strerror(errno), bpf_log_buf);
+ return -1;
+ }
+
+
+ event_fd = open(TRACEPOINT, O_RDONLY, 0);
+ if (event_fd < 0) {
+ printf("failed to open event %s\n", TRACEPOINT);
+ return -1;
+ }
+
+ err = read(event_fd, buf, sizeof(buf));
+ if (err < 0 || err >= sizeof(buf)) {
+ printf("read from '%s' failed '%s'\n",
+ TRACEPOINT, strerror(errno));
+ return -1;
+ }
+
+ close(event_fd);
+
+ buf[err] = 0;
+
+ struct perf_event_attr attr = {.type = PERF_TYPE_TRACEPOINT};
+ attr.config = atoi(buf);
+
+ efd = perf_event_open(&attr, -1/*pid*/, 0/*cpu*/, -1/*group_fd*/, 0);
+ if (efd < 0) {
+ printf("event %lld fd %d err %s\n",
+ attr.config, efd, strerror(errno));
+ return -1;
+ }
+ ioctl(efd, PERF_EVENT_IOC_ENABLE, 0);
+ ioctl(efd, PERF_EVENT_IOC_SET_BPF, prog_fd);
+
+ for (i = 0; i < 10; i++) {
+ key = 0;
+ while (bpf_get_next_key(map_fd, &key, &next_key) == 0) {
+ bpf_lookup_elem(map_fd, &next_key, &value);
+ printf("location 0x%llx count %lld\n", next_key, value);
+ key = next_key;
+ }
+ if (key)
+ printf("\n");
+ sleep(1);
+ }
+
+cleanup:
+ /* maps, programs, tracepoint filters will auto cleanup on process exit */
+
+ return 0;
+}
+
+int main(void)
+{
+ FILE *f;
+
+ /* start ping in the background to get some kfree_skb events */
+ f = popen("ping -c5 localhost", "r");
+ (void) f;
+
+ dropmon();
+ return 0;
+}
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index 46d50b7ddf79..f4f428149a7d 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -121,3 +121,10 @@ int open_raw_sock(const char *name)
return sock;
}
+
+int perf_event_open(struct perf_event_attr *attr, int pid, int cpu,
+ int group_fd, unsigned long flags)
+{
+ return syscall(__NR_perf_event_open, attr, pid, cpu,
+ group_fd, flags);
+}
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index 58c5fe1bdba1..92ff824eaed5 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -182,4 +182,8 @@ extern char bpf_log_buf[LOG_BUF_SIZE];
/* create RAW socket and bind to interface 'name' */
int open_raw_sock(const char *name);
+struct perf_event_attr;
+int perf_event_open(struct perf_event_attr *attr, int pid, int cpu,
+ int group_fd, unsigned long flags);
+
#endif
--
1.7.9.5
tracex1_kern.c - C program which will be compiled into eBPF
to filter netif_receive_skb events on skb->dev->name == "lo"
The programs returns 1 to store an event into ring_buffer
and returns 0 - to discard an event.
tracex1_user.c - corresponding user space component that:
- loads bpf program via bpf() syscall
- opens net:netif_receive_skb event via perf_event_open() syscall
- attaches the program to event via ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
- mmaps event_fd
- polls event_fd, walks ring_buffer and prints events
Usage:
$ sudo tracex1
pid 29241 skb 0xffff88045e58c500 len 84 dev lo
pid 29241 skb 0xffff88045e58cd00 len 84 dev lo
pid 29241 skb 0xffff880074c35000 len 84 dev lo
pid 29241 skb 0xffff880074c35200 len 84 dev lo
Signed-off-by: Alexei Starovoitov <[email protected]>
---
samples/bpf/Makefile | 4 ++
samples/bpf/bpf_helpers.h | 14 +++++
samples/bpf/bpf_load.c | 129 +++++++++++++++++++++++++++++++++++++++++---
samples/bpf/bpf_load.h | 12 +++++
samples/bpf/tracex1_kern.c | 28 ++++++++++
samples/bpf/tracex1_user.c | 50 +++++++++++++++++
6 files changed, 231 insertions(+), 6 deletions(-)
create mode 100644 samples/bpf/tracex1_kern.c
create mode 100644 samples/bpf/tracex1_user.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 789691374562..da28e1b6d3a6 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -7,6 +7,7 @@ hostprogs-y += sock_example
hostprogs-y += sockex1
hostprogs-y += sockex2
hostprogs-y += dropmon
+hostprogs-y += tracex1
dropmon-objs := dropmon.o libbpf.o
test_verifier-objs := test_verifier.o libbpf.o
@@ -14,17 +15,20 @@ test_maps-objs := test_maps.o libbpf.o
sock_example-objs := sock_example.o libbpf.o
sockex1-objs := bpf_load.o libbpf.o sockex1_user.o
sockex2-objs := bpf_load.o libbpf.o sockex2_user.o
+tracex1-objs := bpf_load.o libbpf.o tracex1_user.o
# Tell kbuild to always build the programs
always := $(hostprogs-y)
always += sockex1_kern.o
always += sockex2_kern.o
+always += tracex1_kern.o
HOSTCFLAGS += -I$(objtree)/usr/include
HOSTCFLAGS_bpf_load.o += -I$(objtree)/usr/include -Wno-unused-variable
HOSTLOADLIBES_sockex1 += -lelf
HOSTLOADLIBES_sockex2 += -lelf
+HOSTLOADLIBES_tracex1 += -lelf
# point this to your LLVM backend with bpf support
LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index ca0333146006..406e9705d99e 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -15,6 +15,20 @@ static int (*bpf_map_update_elem)(void *map, void *key, void *value,
(void *) BPF_FUNC_map_update_elem;
static int (*bpf_map_delete_elem)(void *map, void *key) =
(void *) BPF_FUNC_map_delete_elem;
+static void *(*bpf_fetch_ptr)(void *unsafe_ptr) =
+ (void *) BPF_FUNC_fetch_ptr;
+static unsigned long long (*bpf_fetch_u64)(void *unsafe_ptr) =
+ (void *) BPF_FUNC_fetch_u64;
+static unsigned int (*bpf_fetch_u32)(void *unsafe_ptr) =
+ (void *) BPF_FUNC_fetch_u32;
+static unsigned short (*bpf_fetch_u16)(void *unsafe_ptr) =
+ (void *) BPF_FUNC_fetch_u16;
+static unsigned char (*bpf_fetch_u8)(void *unsafe_ptr) =
+ (void *) BPF_FUNC_fetch_u8;
+static int (*bpf_probe_memcmp)(void *unsafe_ptr, void *safe_ptr, int size) =
+ (void *) BPF_FUNC_probe_memcmp;
+static unsigned long long (*bpf_ktime_get_ns)(void) =
+ (void *) BPF_FUNC_ktime_get_ns;
/* llvm builtin functions that eBPF C program may use to
* emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 1831d236382b..2aece65963e4 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -8,29 +8,47 @@
#include <unistd.h>
#include <string.h>
#include <stdbool.h>
+#include <stdlib.h>
#include <linux/bpf.h>
#include <linux/filter.h>
+#include <linux/perf_event.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <poll.h>
#include "libbpf.h"
#include "bpf_helpers.h"
#include "bpf_load.h"
+#define DEBUGFS "/sys/kernel/debug/tracing/"
+
static char license[128];
static bool processed_sec[128];
int map_fd[MAX_MAPS];
int prog_fd[MAX_PROGS];
+int event_fd[MAX_PROGS];
int prog_cnt;
static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
{
- int fd;
bool is_socket = strncmp(event, "socket", 6) == 0;
+ enum bpf_prog_type prog_type;
+ char path[256] = DEBUGFS;
+ char buf[32];
+ int fd, efd, err, id;
+ struct perf_event_attr attr;
- if (!is_socket)
- /* tracing events tbd */
- return -1;
+ attr.type = PERF_TYPE_TRACEPOINT;
+ attr.sample_type = PERF_SAMPLE_RAW;
+ attr.sample_period = 1;
+ attr.wakeup_events = 1;
+
+ if (is_socket)
+ prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
+ else
+ prog_type = BPF_PROG_TYPE_TRACEPOINT;
- fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER,
- prog, size, license);
+ fd = bpf_prog_load(prog_type, prog, size, license);
if (fd < 0) {
printf("bpf_prog_load() err=%d\n%s", errno, bpf_log_buf);
@@ -39,6 +57,39 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
prog_fd[prog_cnt++] = fd;
+ if (is_socket)
+ return 0;
+
+ strcat(path, event);
+ strcat(path, "/id");
+
+ efd = open(path, O_RDONLY, 0);
+ if (efd < 0) {
+ printf("failed to open event %s\n", event);
+ return -1;
+ }
+
+ err = read(efd, buf, sizeof(buf));
+ if (err < 0 || err >= sizeof(buf)) {
+ printf("read from '%s' failed '%s'\n", event, strerror(errno));
+ return -1;
+ }
+
+ close(efd);
+
+ buf[err] = 0;
+ id = atoi(buf);
+ attr.config = id;
+
+ efd = perf_event_open(&attr, -1/*pid*/, 0/*cpu*/, -1/*group_fd*/, 0);
+ if (efd < 0) {
+ printf("event %d fd %d err %s\n", id, efd, strerror(errno));
+ return -1;
+ }
+ event_fd[prog_cnt - 1] = efd;
+ ioctl(efd, PERF_EVENT_IOC_ENABLE, 0);
+ ioctl(efd, PERF_EVENT_IOC_SET_BPF, fd);
+
return 0;
}
@@ -201,3 +252,69 @@ int load_bpf_file(char *path)
close(fd);
return 0;
}
+
+int page_size;
+int page_cnt = 8;
+volatile struct perf_event_mmap_page *header;
+
+int perf_event_mmap(int fd)
+{
+ void *base;
+ int mmap_size;
+
+ page_size = getpagesize();
+ mmap_size = page_size * (page_cnt + 1);
+
+ base = mmap(NULL, mmap_size, PROT_READ, MAP_SHARED, fd, 0);
+ if (base == MAP_FAILED) {
+ printf("mmap err\n");
+ return -1;
+ }
+
+ header = base;
+ return 0;
+}
+
+int perf_event_poll(int fd)
+{
+ struct pollfd pfd = {.fd = fd, .events = POLLIN};
+ return poll(&pfd, 1, -1);
+}
+
+struct perf_event_sample {
+ struct perf_event_header header;
+ __u32 size;
+ char data[];
+};
+
+void perf_event_read(print_fn fn)
+{
+ static __u64 old_data_head = 0;
+ __u64 data_head = header->data_head;
+ __u64 buffer_size = page_cnt * page_size;
+ void *base, *begin, *end;
+
+ if (data_head == old_data_head)
+ return;
+
+ base = ((char *)header) + page_size;
+
+ begin = base + old_data_head % buffer_size;
+ end = base + data_head % buffer_size;
+
+ while (begin < end) {
+ struct perf_event_sample *e;
+
+ e = begin;
+
+ if (e->header.type != PERF_RECORD_SAMPLE) {
+ printf("event is not a sample type %d\n", e->header.type);
+ }
+
+ begin += sizeof(*e) + e->size;
+ fn(e->data, e->size);
+ }
+ /* else when end > begin - the events had wrapped. ignored for now */
+
+ old_data_head = data_head;
+}
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index 27789a34f5e6..cf55663405da 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -6,6 +6,7 @@
extern int map_fd[MAX_MAPS];
extern int prog_fd[MAX_PROGS];
+extern int event_fd[MAX_PROGS];
/* parses elf file compiled by llvm .c->.o
* . parses 'maps' section and creates maps via BPF syscall
@@ -21,4 +22,15 @@ extern int prog_fd[MAX_PROGS];
*/
int load_bpf_file(char *path);
+int perf_event_mmap(int fd);
+int perf_event_poll(int fd);
+typedef void (*print_fn)(void *data, int size);
+void perf_event_read(print_fn fn);
+struct trace_entry {
+ unsigned short type;
+ unsigned char flags;
+ unsigned char preempt_count;
+ int pid;
+};
+
#endif
diff --git a/samples/bpf/tracex1_kern.c b/samples/bpf/tracex1_kern.c
new file mode 100644
index 000000000000..449db961c642
--- /dev/null
+++ b/samples/bpf/tracex1_kern.c
@@ -0,0 +1,28 @@
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <uapi/linux/bpf.h>
+#include <trace/bpf_trace.h>
+#include "bpf_helpers.h"
+
+SEC("events/net/netif_receive_skb")
+int bpf_prog1(struct bpf_context *ctx)
+{
+ /*
+ * attaches to net:netif_receive_skb
+ * and filters events for loobpack device only
+ */
+ char devname[] = "lo";
+ struct net_device *dev;
+ struct sk_buff *skb = 0;
+
+ skb = (struct sk_buff *) ctx->arg1;
+ dev = bpf_fetch_ptr(&skb->dev);
+ if (bpf_probe_memcmp(dev->name, devname, 2) == 0)
+ /* pass event to userspace via perf ring_buffer */
+ return 1;
+
+ /* drop event */
+ return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/tracex1_user.c b/samples/bpf/tracex1_user.c
new file mode 100644
index 000000000000..a49de23eb30b
--- /dev/null
+++ b/samples/bpf/tracex1_user.c
@@ -0,0 +1,50 @@
+#include <stdio.h>
+#include <linux/bpf.h>
+#include <unistd.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+static char *get_str(void *entry, __u32 arrloc)
+{
+ int off = arrloc & 0xffff;
+ return entry + off;
+}
+
+static void print_netif_receive_skb(void *data, int size)
+{
+ struct ftrace_raw_netif_receive_skb {
+ struct trace_entry t;
+ void *skb;
+ __u32 len;
+ __u32 name;
+ } *e = data;
+
+ printf("pid %d skb %p len %d dev %s\n",
+ e->t.pid, e->skb, e->len, get_str(e, e->name));
+}
+
+int main(int ac, char **argv)
+{
+ FILE *f;
+ char filename[256];
+
+ snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+ if (load_bpf_file(filename)) {
+ printf("%s", bpf_log_buf);
+ return 1;
+ }
+
+ if (perf_event_mmap(event_fd[0]) < 0)
+ return 1;
+
+ f = popen("taskset 1 ping -c5 localhost", "r");
+ (void) f;
+
+ for (;;) {
+ perf_event_poll(event_fd[0]);
+ perf_event_read(print_netif_receive_skb);
+ }
+
+ return 0;
+}
--
1.7.9.5
this example has two probes in one C file that attach to different tracepoints
and use two different maps.
1st probe is the similar to dropmon.c. It attaches to kfree_skb tracepoint and
count number of packet drops at different locations
2nd probe attaches to syscalls/sys_enter_write and computes a histogram of different
write sizes
Usage:
$ sudo tracex2
location 0xffffffff816959a5 count 1
location 0xffffffff816959a5 count 2
557145+0 records in
557145+0 records out
285258240 bytes (285 MB) copied, 1.02379 s, 279 MB/s
syscall write() stats
byte_size : count distribution
1 -> 1 : 3 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 2 | |
32 -> 63 : 3 | |
64 -> 127 : 1 | |
128 -> 255 : 1 | |
256 -> 511 : 0 | |
512 -> 1023 : 1118968 |************************************* |
Ctrl-C at any time. Kernel will auto cleanup maps and programs
Signed-off-by: Alexei Starovoitov <[email protected]>
---
samples/bpf/Makefile | 4 ++
samples/bpf/tracex2_kern.c | 71 +++++++++++++++++++++++++++++++++
samples/bpf/tracex2_user.c | 95 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 170 insertions(+)
create mode 100644 samples/bpf/tracex2_kern.c
create mode 100644 samples/bpf/tracex2_user.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index da28e1b6d3a6..416af24b01fd 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -8,6 +8,7 @@ hostprogs-y += sockex1
hostprogs-y += sockex2
hostprogs-y += dropmon
hostprogs-y += tracex1
+hostprogs-y += tracex2
dropmon-objs := dropmon.o libbpf.o
test_verifier-objs := test_verifier.o libbpf.o
@@ -16,12 +17,14 @@ sock_example-objs := sock_example.o libbpf.o
sockex1-objs := bpf_load.o libbpf.o sockex1_user.o
sockex2-objs := bpf_load.o libbpf.o sockex2_user.o
tracex1-objs := bpf_load.o libbpf.o tracex1_user.o
+tracex2-objs := bpf_load.o libbpf.o tracex2_user.o
# Tell kbuild to always build the programs
always := $(hostprogs-y)
always += sockex1_kern.o
always += sockex2_kern.o
always += tracex1_kern.o
+always += tracex2_kern.o
HOSTCFLAGS += -I$(objtree)/usr/include
@@ -29,6 +32,7 @@ HOSTCFLAGS_bpf_load.o += -I$(objtree)/usr/include -Wno-unused-variable
HOSTLOADLIBES_sockex1 += -lelf
HOSTLOADLIBES_sockex2 += -lelf
HOSTLOADLIBES_tracex1 += -lelf
+HOSTLOADLIBES_tracex2 += -lelf
# point this to your LLVM backend with bpf support
LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/tracex2_kern.c b/samples/bpf/tracex2_kern.c
new file mode 100644
index 000000000000..a789c456c1b4
--- /dev/null
+++ b/samples/bpf/tracex2_kern.c
@@ -0,0 +1,71 @@
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <uapi/linux/bpf.h>
+#include <trace/bpf_trace.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") my_map = {
+ .type = BPF_MAP_TYPE_HASH,
+ .key_size = sizeof(long),
+ .value_size = sizeof(long),
+ .max_entries = 1024,
+};
+
+SEC("events/skb/kfree_skb")
+int bpf_prog2(struct bpf_context *ctx)
+{
+ long loc = ctx->arg2;
+ long init_val = 1;
+ long *value;
+
+ value = bpf_map_lookup_elem(&my_map, &loc);
+ if (value)
+ *value += 1;
+ else
+ bpf_map_update_elem(&my_map, &loc, &init_val, BPF_ANY);
+ return 0;
+}
+
+static unsigned int log2(unsigned int v)
+{
+ unsigned int r;
+ unsigned int shift;
+
+ r = (v > 0xFFFF) << 4; v >>= r;
+ shift = (v > 0xFF) << 3; v >>= shift; r |= shift;
+ shift = (v > 0xF) << 2; v >>= shift; r |= shift;
+ shift = (v > 0x3) << 1; v >>= shift; r |= shift;
+ r |= (v >> 1);
+ return r;
+}
+
+static unsigned int log2l(unsigned long v)
+{
+ unsigned int hi = v >> 32;
+ if (hi)
+ return log2(hi) + 32;
+ else
+ return log2(v);
+}
+
+struct bpf_map_def SEC("maps") my_hist_map = {
+ .type = BPF_MAP_TYPE_ARRAY,
+ .key_size = sizeof(u32),
+ .value_size = sizeof(long),
+ .max_entries = 64,
+};
+
+SEC("events/syscalls/sys_enter_write")
+int bpf_prog3(struct bpf_context *ctx)
+{
+ long write_size = ctx->arg3;
+ long init_val = 1;
+ long *value;
+ u32 index = log2l(write_size);
+
+ value = bpf_map_lookup_elem(&my_hist_map, &index);
+ if (value)
+ __sync_fetch_and_add(value, 1);
+ return 0;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/tracex2_user.c b/samples/bpf/tracex2_user.c
new file mode 100644
index 000000000000..91b8d0896fbb
--- /dev/null
+++ b/samples/bpf/tracex2_user.c
@@ -0,0 +1,95 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+#define MAX_INDEX 64
+#define MAX_STARS 38
+
+static void stars(char *str, long val, long max, int width)
+{
+ int i;
+
+ for (i = 0; i < (width * val / max) - 1 && i < width - 1; i++)
+ str[i] = '*';
+ if (val > max)
+ str[i - 1] = '+';
+ str[i] = '\0';
+}
+
+static void print_hist(int fd)
+{
+ int key;
+ long value;
+ long data[MAX_INDEX] = {};
+ char starstr[MAX_STARS];
+ int i;
+ int max_ind = -1;
+ long max_value = 0;
+
+ for (key = 0; key < MAX_INDEX; key++) {
+ bpf_lookup_elem(fd, &key, &value);
+ data[key] = value;
+ if (value && key > max_ind)
+ max_ind = key;
+ if (value > max_value)
+ max_value = value;
+ }
+
+ printf(" syscall write() stats\n");
+ printf(" byte_size : count distribution\n");
+ for (i = 1; i <= max_ind + 1; i++) {
+ stars(starstr, data[i - 1], max_value, MAX_STARS);
+ printf("%8ld -> %-8ld : %-8ld |%-*s|\n",
+ (1l << i) >> 1, (1l << i) - 1, data[i - 1],
+ MAX_STARS, starstr);
+ }
+}
+static void int_exit(int sig)
+{
+ print_hist(map_fd[1]);
+ exit(0);
+}
+
+int main(int ac, char **argv)
+{
+ char filename[256];
+ long key, next_key, value;
+ FILE *f;
+ int i;
+
+ snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+ signal(SIGINT, int_exit);
+
+ /* start 'ping' in the background to have some kfree_skb events */
+ f = popen("ping -c5 localhost", "r");
+ (void) f;
+
+ /* start 'dd' in the background to have plenty of 'write' syscalls */
+ f = popen("dd if=/dev/zero of=/dev/null count=5000000", "r");
+ (void) f;
+
+ if (load_bpf_file(filename)) {
+ printf("%s", bpf_log_buf);
+ return 1;
+ }
+
+ for (i = 0; i < 5; i++) {
+ key = 0;
+ while (bpf_get_next_key(map_fd[0], &key, &next_key) == 0) {
+ bpf_lookup_elem(map_fd[0], &next_key, &value);
+ printf("location 0x%lx count %ld\n", next_key, value);
+ key = next_key;
+ }
+ if (key)
+ printf("\n");
+ sleep(1);
+ }
+ print_hist(map_fd[1]);
+
+ return 0;
+}
--
1.7.9.5
eBPF C program attaches to block_rq_issue/block_rq_complete events to calculate
IO latency. Then it waits for the first 100 events to compute average latency
and uses range [0 .. ave_lat * 2] to record histogram of events in this latency
range.
User space reads this histogram map every 2 seconds and prints it as a 'heatmap'
using gray shades of text terminal. Black spaces have many events and white
spaces have very few events. Left most space is the smallest latency, right most
space is the largest latency in the range.
If kernel sees too many events that fall out of histogram range, user space
adjusts the range up, so heatmap for next 2 seconds will be more accurate.
Usage:
$ sudo ./tracex3
and do 'sudo dd if=/dev/sda of=/dev/null' in other terminal.
Observe IO latencies and how different activity (like 'make kernel') affects it.
Similar experiments can be done for network transmit latencies, syscalls, etc
Signed-off-by: Alexei Starovoitov <[email protected]>
---
samples/bpf/Makefile | 4 ++
samples/bpf/tracex3_kern.c | 98 ++++++++++++++++++++++++++++
samples/bpf/tracex3_user.c | 152 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 254 insertions(+)
create mode 100644 samples/bpf/tracex3_kern.c
create mode 100644 samples/bpf/tracex3_user.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 416af24b01fd..da0efd8032ab 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -9,6 +9,7 @@ hostprogs-y += sockex2
hostprogs-y += dropmon
hostprogs-y += tracex1
hostprogs-y += tracex2
+hostprogs-y += tracex3
dropmon-objs := dropmon.o libbpf.o
test_verifier-objs := test_verifier.o libbpf.o
@@ -18,6 +19,7 @@ sockex1-objs := bpf_load.o libbpf.o sockex1_user.o
sockex2-objs := bpf_load.o libbpf.o sockex2_user.o
tracex1-objs := bpf_load.o libbpf.o tracex1_user.o
tracex2-objs := bpf_load.o libbpf.o tracex2_user.o
+tracex3-objs := bpf_load.o libbpf.o tracex3_user.o
# Tell kbuild to always build the programs
always := $(hostprogs-y)
@@ -25,6 +27,7 @@ always += sockex1_kern.o
always += sockex2_kern.o
always += tracex1_kern.o
always += tracex2_kern.o
+always += tracex3_kern.o
HOSTCFLAGS += -I$(objtree)/usr/include
@@ -33,6 +36,7 @@ HOSTLOADLIBES_sockex1 += -lelf
HOSTLOADLIBES_sockex2 += -lelf
HOSTLOADLIBES_tracex1 += -lelf
HOSTLOADLIBES_tracex2 += -lelf
+HOSTLOADLIBES_tracex3 += -lelf
# point this to your LLVM backend with bpf support
LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/tracex3_kern.c b/samples/bpf/tracex3_kern.c
new file mode 100644
index 000000000000..961f3f373270
--- /dev/null
+++ b/samples/bpf/tracex3_kern.c
@@ -0,0 +1,98 @@
+/* Copyright (c) 2013-2015 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <uapi/linux/bpf.h>
+#include <trace/bpf_trace.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") my_map = {
+ .type = BPF_MAP_TYPE_HASH,
+ .key_size = sizeof(long),
+ .value_size = sizeof(u64),
+ .max_entries = 4096,
+};
+
+SEC("events/block/block_rq_issue")
+int bpf_prog1(struct bpf_context *ctx)
+{
+ long rq = ctx->arg2;
+ u64 val = bpf_ktime_get_ns();
+
+ bpf_map_update_elem(&my_map, &rq, &val, BPF_ANY);
+ return 0;
+}
+
+struct globals {
+ u64 lat_ave;
+ u64 lat_sum;
+ u64 missed;
+ u64 max_lat;
+ int num_samples;
+};
+
+struct bpf_map_def SEC("maps") global_map = {
+ .type = BPF_MAP_TYPE_ARRAY,
+ .key_size = sizeof(int),
+ .value_size = sizeof(struct globals),
+ .max_entries = 1,
+};
+
+#define MAX_SLOT 32
+
+struct bpf_map_def SEC("maps") lat_map = {
+ .type = BPF_MAP_TYPE_ARRAY,
+ .key_size = sizeof(int),
+ .value_size = sizeof(u64),
+ .max_entries = MAX_SLOT,
+};
+
+SEC("events/block/block_rq_complete")
+int bpf_prog2(struct bpf_context *ctx)
+{
+ long rq = ctx->arg2;
+ void *value;
+
+ value = bpf_map_lookup_elem(&my_map, &rq);
+ if (!value)
+ return 0;
+
+ u64 cur_time = bpf_ktime_get_ns();
+ u64 delta = (cur_time - *(u64 *)value) / 1000;
+
+ bpf_map_delete_elem(&my_map, &rq);
+
+ int ind = 0;
+ struct globals *g = bpf_map_lookup_elem(&global_map, &ind);
+
+ if (!g)
+ return 0;
+ if (g->lat_ave == 0) {
+ g->num_samples++;
+ g->lat_sum += delta;
+ if (g->num_samples >= 100)
+ g->lat_ave = g->lat_sum / g->num_samples;
+ } else {
+ u64 max_lat = g->lat_ave * 2;
+
+ if (delta > max_lat) {
+ g->missed++;
+ if (delta > g->max_lat)
+ g->max_lat = delta;
+ return 0;
+ }
+
+ ind = delta * MAX_SLOT / max_lat;
+ value = bpf_map_lookup_elem(&lat_map, &ind);
+ if (!value)
+ return 0;
+ (*(u64 *)value)++;
+ }
+
+ return 0;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/tracex3_user.c b/samples/bpf/tracex3_user.c
new file mode 100644
index 000000000000..c49f41f28cba
--- /dev/null
+++ b/samples/bpf/tracex3_user.c
@@ -0,0 +1,152 @@
+/* Copyright (c) 2013-2015 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <unistd.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
+
+struct globals {
+ __u64 lat_ave;
+ __u64 lat_sum;
+ __u64 missed;
+ __u64 max_lat;
+ int num_samples;
+};
+
+static void clear_stats(int fd)
+{
+ int key;
+ __u64 value = 0;
+
+ for (key = 0; key < 32; key++)
+ bpf_update_elem(fd, &key, &value, BPF_ANY);
+}
+
+const char *color[] = {
+ "\033[48;5;255m",
+ "\033[48;5;252m",
+ "\033[48;5;250m",
+ "\033[48;5;248m",
+ "\033[48;5;246m",
+ "\033[48;5;244m",
+ "\033[48;5;242m",
+ "\033[48;5;240m",
+ "\033[48;5;238m",
+ "\033[48;5;236m",
+ "\033[48;5;234m",
+ "\033[48;5;232m",
+};
+const int num_colors = ARRAY_SIZE(color);
+
+const char nocolor[] = "\033[00m";
+
+static void print_banner(__u64 max_lat)
+{
+ printf("0 usec ... %lld usec\n", max_lat);
+}
+
+static void print_hist(int fd)
+{
+ int key;
+ __u64 value;
+ __u64 cnt[32];
+ __u64 max_cnt = 0;
+ __u64 total_events = 0;
+ int max_bucket = 0;
+
+ for (key = 0; key < 32; key++) {
+ value = 0;
+ bpf_lookup_elem(fd, &key, &value);
+ if (value > 0)
+ max_bucket = key;
+ cnt[key] = value;
+ total_events += value;
+ if (value > max_cnt)
+ max_cnt = value;
+ }
+ clear_stats(fd);
+ for (key = 0; key < 32; key++) {
+ int c = num_colors * cnt[key] / (max_cnt + 1);
+
+ printf("%s %s", color[c], nocolor);
+ }
+ printf(" captured=%lld", total_events);
+
+ key = 0;
+ struct globals g = {};
+
+ bpf_lookup_elem(map_fd[1], &key, &g);
+
+ printf(" missed=%lld max_lat=%lld usec\n",
+ g.missed, g.max_lat);
+
+ if (g.missed > 10 && g.missed > total_events / 10) {
+ printf("adjusting range UP...\n");
+ g.lat_ave = g.max_lat / 2;
+ print_banner(g.lat_ave * 2);
+ } else if (max_bucket < 4 && total_events > 100) {
+ printf("adjusting range DOWN...\n");
+ g.lat_ave = g.lat_ave / 4;
+ print_banner(g.lat_ave * 2);
+ }
+ /* clear some globals */
+ g.missed = 0;
+ g.max_lat = 0;
+ bpf_update_elem(map_fd[1], &key, &g, BPF_ANY);
+}
+
+static void int_exit(int sig)
+{
+ print_hist(map_fd[2]);
+ exit(0);
+}
+
+int main(int ac, char **argv)
+{
+ char filename[256];
+
+ snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+ if (load_bpf_file(filename)) {
+ printf("%s", bpf_log_buf);
+ return 1;
+ }
+
+ clear_stats(map_fd[2]);
+
+ signal(SIGINT, int_exit);
+
+ struct globals g;
+
+ printf("waiting for events to determine average latency...\n");
+ for (;;) {
+ int key = 0;
+
+ bpf_lookup_elem(map_fd[1], &key, &g);
+ if (g.lat_ave)
+ break;
+ sleep(1);
+ }
+
+ printf(" IO latency in usec\n"
+ " %s %s - many events with this latency\n"
+ " %s %s - few events\n",
+ color[num_colors - 1], nocolor,
+ color[0], nocolor);
+ print_banner(g.lat_ave * 2);
+ for (;;) {
+ print_hist(map_fd[2]);
+ sleep(2);
+ }
+
+ return 0;
+}
--
1.7.9.5
introduce new type of eBPF programs BPF_PROG_TYPE_KPROBE.
Such programs are allowed to call the same helper functions
as tracing filters, but bpf_context is different:
For tracing filters bpf_context is 6 arguments of tracepoints or syscalls
For kprobe filters bpf_context == pt_regs
Signed-off-by: Alexei Starovoitov <[email protected]>
---
include/linux/ftrace_event.h | 3 +++
include/uapi/linux/bpf.h | 1 +
kernel/events/core.c | 5 ++++-
kernel/trace/bpf_trace.c | 39 +++++++++++++++++++++++++++++++++++++++
kernel/trace/trace_kprobe.c | 10 +++++++++-
5 files changed, 56 insertions(+), 2 deletions(-)
diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 479d0a4a42b3..cd6efd23bfae 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -249,6 +249,7 @@ enum {
TRACE_EVENT_FL_WAS_ENABLED_BIT,
TRACE_EVENT_FL_USE_CALL_FILTER_BIT,
TRACE_EVENT_FL_TRACEPOINT_BIT,
+ TRACE_EVENT_FL_KPROBE_BIT,
};
/*
@@ -262,6 +263,7 @@ enum {
* it is best to clear the buffers that used it).
* USE_CALL_FILTER - For ftrace internal events, don't use file filter
* TRACEPOINT - Event is a tracepoint
+ * KPROBE - Event is a kprobe
*/
enum {
TRACE_EVENT_FL_FILTERED = (1 << TRACE_EVENT_FL_FILTERED_BIT),
@@ -271,6 +273,7 @@ enum {
TRACE_EVENT_FL_WAS_ENABLED = (1 << TRACE_EVENT_FL_WAS_ENABLED_BIT),
TRACE_EVENT_FL_USE_CALL_FILTER = (1 << TRACE_EVENT_FL_USE_CALL_FILTER_BIT),
TRACE_EVENT_FL_TRACEPOINT = (1 << TRACE_EVENT_FL_TRACEPOINT_BIT),
+ TRACE_EVENT_FL_KPROBE = (1 << TRACE_EVENT_FL_KPROBE_BIT),
};
struct ftrace_event_call {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ecae21e58ba3..cf443900318d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -119,6 +119,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_UNSPEC,
BPF_PROG_TYPE_SOCKET_FILTER,
BPF_PROG_TYPE_TRACEPOINT,
+ BPF_PROG_TYPE_KPROBE,
};
/* flags for BPF_MAP_UPDATE_ELEM command */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 674a8ca17190..94de727a8c0c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6288,7 +6288,10 @@ static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
if (IS_ERR(prog))
return PTR_ERR(prog);
- if (prog->aux->prog_type != BPF_PROG_TYPE_TRACEPOINT) {
+ if (((event->tp_event->flags & TRACE_EVENT_FL_KPROBE) &&
+ prog->aux->prog_type != BPF_PROG_TYPE_KPROBE) ||
+ (!(event->tp_event->flags & TRACE_EVENT_FL_KPROBE) &&
+ prog->aux->prog_type != BPF_PROG_TYPE_TRACEPOINT)) {
/* valid fd, but invalid bpf program type */
bpf_prog_put(prog);
return -EINVAL;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index e3196266b72f..52dbd1e7dc28 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -153,3 +153,42 @@ static int __init register_tp_prog_ops(void)
return 0;
}
late_initcall(register_tp_prog_ops);
+
+/* check access to fields of 'struct pt_regs' from BPF program */
+static bool kprobe_prog_is_valid_access(int off, int size, enum bpf_access_type type)
+{
+ /* check bounds */
+ if (off < 0 || off >= sizeof(struct pt_regs))
+ return false;
+
+ /* only read is allowed */
+ if (type != BPF_READ)
+ return false;
+
+ /* disallow misaligned access */
+ if (off % size != 0)
+ return false;
+
+ return true;
+}
+/* kprobe filter programs are allowed to call the same helper functions
+ * as tracing filters, but bpf_context is different:
+ * For tracing filters bpf_context is 6 arguments of tracepoints or syscalls
+ * For kprobe filters bpf_context == pt_regs
+ */
+static struct bpf_verifier_ops kprobe_prog_ops = {
+ .get_func_proto = tp_prog_func_proto,
+ .is_valid_access = kprobe_prog_is_valid_access,
+};
+
+static struct bpf_prog_type_list kprobe_tl = {
+ .ops = &kprobe_prog_ops,
+ .type = BPF_PROG_TYPE_KPROBE,
+};
+
+static int __init register_kprobe_prog_ops(void)
+{
+ bpf_register_prog_type(&kprobe_tl);
+ return 0;
+}
+late_initcall(register_kprobe_prog_ops);
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 5edb518be345..d503ac43b85b 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1134,11 +1134,15 @@ static void
kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
{
struct ftrace_event_call *call = &tk->tp.call;
+ struct bpf_prog *prog = call->prog;
struct kprobe_trace_entry_head *entry;
struct hlist_head *head;
int size, __size, dsize;
int rctx;
+ if (prog && !trace_call_bpf(prog, regs))
+ return;
+
head = this_cpu_ptr(call->perf_events);
if (hlist_empty(head))
return;
@@ -1165,11 +1169,15 @@ kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,
struct pt_regs *regs)
{
struct ftrace_event_call *call = &tk->tp.call;
+ struct bpf_prog *prog = call->prog;
struct kretprobe_trace_entry_head *entry;
struct hlist_head *head;
int size, __size, dsize;
int rctx;
+ if (prog && !trace_call_bpf(prog, regs))
+ return;
+
head = this_cpu_ptr(call->perf_events);
if (hlist_empty(head))
return;
@@ -1286,7 +1294,7 @@ static int register_kprobe_event(struct trace_kprobe *tk)
kfree(call->print_fmt);
return -ENODEV;
}
- call->flags = 0;
+ call->flags = TRACE_EVENT_FL_KPROBE;
call->class->reg = kprobe_register;
call->data = tk;
ret = trace_add_event_call(call);
--
1.7.9.5
the logic of the example is similar to tracex2, but syscall 'write' statistics
is capturead from kprobe placed at sys_write function instead of through
syscall instrumentation.
Also tracex4_kern.c has a different way of doing log2 in C.
Note, unlike tracepoint and syscall programs, kprobe programs receive
'struct pt_regs' as an input. It's responsibility of the program author
or higher level dynamic tracing tool to match registers to function arguments.
Since pt_regs are architecture dependent, programs are also arch dependent,
unlike tracepoint/syscalls programs which are universal.
Usage:
$ sudo tracex4
2216443+0 records in
2216442+0 records out
1134818304 bytes (1.1 GB) copied, 2.00746 s, 565 MB/s
kprobe sys_write() stats
byte_size : count distribution
1 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 1 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 2214734 |************************************* |
Signed-off-by: Alexei Starovoitov <[email protected]>
---
samples/bpf/Makefile | 4 +++
samples/bpf/bpf_load.c | 3 ++
samples/bpf/tracex4_kern.c | 36 +++++++++++++++++++
samples/bpf/tracex4_user.c | 83 ++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 126 insertions(+)
create mode 100644 samples/bpf/tracex4_kern.c
create mode 100644 samples/bpf/tracex4_user.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index da0efd8032ab..22c7a38f3f95 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -10,6 +10,7 @@ hostprogs-y += dropmon
hostprogs-y += tracex1
hostprogs-y += tracex2
hostprogs-y += tracex3
+hostprogs-y += tracex4
dropmon-objs := dropmon.o libbpf.o
test_verifier-objs := test_verifier.o libbpf.o
@@ -20,6 +21,7 @@ sockex2-objs := bpf_load.o libbpf.o sockex2_user.o
tracex1-objs := bpf_load.o libbpf.o tracex1_user.o
tracex2-objs := bpf_load.o libbpf.o tracex2_user.o
tracex3-objs := bpf_load.o libbpf.o tracex3_user.o
+tracex4-objs := bpf_load.o libbpf.o tracex4_user.o
# Tell kbuild to always build the programs
always := $(hostprogs-y)
@@ -28,6 +30,7 @@ always += sockex2_kern.o
always += tracex1_kern.o
always += tracex2_kern.o
always += tracex3_kern.o
+always += tracex4_kern.o
HOSTCFLAGS += -I$(objtree)/usr/include
@@ -37,6 +40,7 @@ HOSTLOADLIBES_sockex2 += -lelf
HOSTLOADLIBES_tracex1 += -lelf
HOSTLOADLIBES_tracex2 += -lelf
HOSTLOADLIBES_tracex3 += -lelf
+HOSTLOADLIBES_tracex4 += -lelf
# point this to your LLVM backend with bpf support
LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 2aece65963e4..2206b49df625 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -32,6 +32,7 @@ int prog_cnt;
static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
{
bool is_socket = strncmp(event, "socket", 6) == 0;
+ bool is_kprobe = strncmp(event, "events/kprobes/", 15) == 0;
enum bpf_prog_type prog_type;
char path[256] = DEBUGFS;
char buf[32];
@@ -45,6 +46,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
if (is_socket)
prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
+ else if (is_kprobe)
+ prog_type = BPF_PROG_TYPE_KPROBE;
else
prog_type = BPF_PROG_TYPE_TRACEPOINT;
diff --git a/samples/bpf/tracex4_kern.c b/samples/bpf/tracex4_kern.c
new file mode 100644
index 000000000000..9646f9e43417
--- /dev/null
+++ b/samples/bpf/tracex4_kern.c
@@ -0,0 +1,36 @@
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <uapi/linux/bpf.h>
+#include <trace/bpf_trace.h>
+#include "bpf_helpers.h"
+
+static unsigned int log2l(unsigned long long n)
+{
+#define S(k) if (n >= (1ull << k)) { i += k; n >>= k; }
+ int i = -(n == 0);
+ S(32); S(16); S(8); S(4); S(2); S(1);
+ return i;
+#undef S
+}
+
+struct bpf_map_def SEC("maps") my_hist_map = {
+ .type = BPF_MAP_TYPE_ARRAY,
+ .key_size = sizeof(u32),
+ .value_size = sizeof(long),
+ .max_entries = 64,
+};
+
+SEC("events/kprobes/sys_write")
+int bpf_prog4(struct pt_regs *regs)
+{
+ long write_size = regs->dx; /* $rdx contains 3rd argument to a function */
+ long init_val = 1;
+ void *value;
+ u32 index = log2l(write_size);
+
+ value = bpf_map_lookup_elem(&my_hist_map, &index);
+ if (value)
+ __sync_fetch_and_add((long *)value, 1);
+ return 0;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/tracex4_user.c b/samples/bpf/tracex4_user.c
new file mode 100644
index 000000000000..741206127768
--- /dev/null
+++ b/samples/bpf/tracex4_user.c
@@ -0,0 +1,83 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+#define MAX_INDEX 64
+#define MAX_STARS 38
+
+static void stars(char *str, long val, long max, int width)
+{
+ int i;
+
+ for (i = 0; i < (width * val / max) - 1 && i < width - 1; i++)
+ str[i] = '*';
+ if (val > max)
+ str[i - 1] = '+';
+ str[i] = '\0';
+}
+
+static void print_hist(int fd)
+{
+ int key;
+ long value;
+ long data[MAX_INDEX] = {};
+ char starstr[MAX_STARS];
+ int i;
+ int max_ind = -1;
+ long max_value = 0;
+
+ for (key = 0; key < MAX_INDEX; key++) {
+ bpf_lookup_elem(fd, &key, &value);
+ data[key] = value;
+ if (value && key > max_ind)
+ max_ind = key;
+ if (value > max_value)
+ max_value = value;
+ }
+
+ printf("\n kprobe sys_write() stats\n");
+ printf(" byte_size : count distribution\n");
+ for (i = 1; i <= max_ind + 1; i++) {
+ stars(starstr, data[i - 1], max_value, MAX_STARS);
+ printf("%8ld -> %-8ld : %-8ld |%-*s|\n",
+ (1l << i) >> 1, (1l << i) - 1, data[i - 1],
+ MAX_STARS, starstr);
+ }
+}
+static void int_exit(int sig)
+{
+ print_hist(map_fd[0]);
+ exit(0);
+}
+
+int main(int ac, char **argv)
+{
+ char filename[256];
+ FILE *f;
+ int i;
+
+ snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+ signal(SIGINT, int_exit);
+
+ i = system("echo 'p:sys_write sys_write' > /sys/kernel/debug/tracing/kprobe_events");
+ (void) i;
+
+ /* start 'dd' in the background to have plenty of 'write' syscalls */
+ f = popen("dd if=/dev/zero of=/dev/null count=5000000", "r");
+ (void) f;
+
+ if (load_bpf_file(filename)) {
+ printf("%s", bpf_log_buf);
+ return 1;
+ }
+
+ sleep(2);
+ kill(0, SIGINT); /* send Ctrl-C to self and to 'dd' */
+
+ return 0;
+}
--
1.7.9.5
On Mon, 9 Feb 2015 19:45:57 -0800
Alexei Starovoitov <[email protected]> wrote:
> +int perf_event_mmap(int fd);
> +int perf_event_poll(int fd);
> +typedef void (*print_fn)(void *data, int size);
> +void perf_event_read(print_fn fn);
> +struct trace_entry {
> + unsigned short type;
> + unsigned char flags;
> + unsigned char preempt_count;
> + int pid;
> +};
> +
Please do not hard code any structures. This is not a stable ABI, and
it may not even match if you are running 32 bit userspace on top of a
64 bit kernel.
Please parse the format files. libtraceevent does this for you. If need
be, link to that. But if you look at the event format files you'll see
the offsets and sizes in the binary code:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
I don't want to get stuck with pinned kernel data structures again. We
had 4 blank bytes of data for every event, because latency top hard
coded the field. Luckily, the 64 bit / 32 bit interface caused latency
top to have to use the event_parse code to work, and we were able to
remove that field after it was converted.
-- Steve
On Mon, 9 Feb 2015 19:45:57 -0800
Alexei Starovoitov <[email protected]> wrote:
> +static void print_netif_receive_skb(void *data, int size)
> +{
> + struct ftrace_raw_netif_receive_skb {
> + struct trace_entry t;
> + void *skb;
> + __u32 len;
> + __u32 name;
> + } *e = data;
Same here, there's no guarantee that this structure will always match.
cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/format
name: netif_receive_skb
ID: 975
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:void * skbaddr; offset:8; size:8; signed:0;
field:unsigned int len; offset:16; size:4; signed:0;
field:__data_loc char[] name; offset:20; size:4; signed:1;
print fmt: "dev=%s skbaddr=%p len=%u", __get_str(name), REC->skbaddr, REC->len
-- Steve
On Mon, 9 Feb 2015 19:45:54 -0800
Alexei Starovoitov <[email protected]> wrote:
> +#endif /* _LINUX_KERNEL_BPF_TRACE_H */
> diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
> index 139b5067345b..4c275ce2dcf0 100644
> --- a/include/trace/ftrace.h
> +++ b/include/trace/ftrace.h
> @@ -17,6 +17,7 @@
> */
>
> #include <linux/ftrace_event.h>
> +#include <trace/bpf_trace.h>
>
> /*
> * DECLARE_EVENT_CLASS can be used to add a generic function
> @@ -755,12 +756,32 @@ __attribute__((section("_ftrace_events"))) *__event_##call = &event_##call
> #undef __perf_task
> #define __perf_task(t) (__task = (t))
>
> +/* zero extend integer, pointer or aggregate type to u64 without warnings */
> +#define __CAST_TO_U64(EXPR) ({ \
> + u64 ret = 0; \
> + typeof(EXPR) expr = EXPR; \
> + switch (sizeof(expr)) { \
> + case 8: ret = *(u64 *) &expr; break; \
> + case 4: ret = *(u32 *) &expr; break; \
> + case 2: ret = *(u16 *) &expr; break; \
> + case 1: ret = *(u8 *) &expr; break; \
> + } \
> + ret; })
> +
> +#define __BPF_CAST1(a,...) __CAST_TO_U64(a)
> +#define __BPF_CAST2(a,...) __CAST_TO_U64(a), __BPF_CAST1(__VA_ARGS__)
> +#define __BPF_CAST3(a,...) __CAST_TO_U64(a), __BPF_CAST2(__VA_ARGS__)
> +#define __BPF_CAST4(a,...) __CAST_TO_U64(a), __BPF_CAST3(__VA_ARGS__)
> +#define __BPF_CAST5(a,...) __CAST_TO_U64(a), __BPF_CAST4(__VA_ARGS__)
> +#define __BPF_CAST6(a,...) __CAST_TO_U64(a), __BPF_CAST5(__VA_ARGS__)
> +
> #undef DECLARE_EVENT_CLASS
> #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
> static notrace void \
> perf_trace_##call(void *__data, proto) \
> { \
> struct ftrace_event_call *event_call = __data; \
> + struct bpf_prog *prog = event_call->prog; \
Looks like this is entirely perf based and does not interact with
ftrace at all. In other words, it's perf not tracing.
It makes more sense to go through tip than the tracing tree.
But I still do not want any hard coded event structures. All access to
data from the binary code must be parsed by looking at the event/format
files. Otherwise you will lock internals of the kernel as userspace
ABI, because eBPF programs will break if those internals change, and
that could severely limit progress in the future.
-- Steve
> struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\
> struct ftrace_raw_##call *entry; \
> struct pt_regs __regs; \
> @@ -771,6 +792,16 @@ perf_trace_##call(void *__data, proto) \
> int __data_size; \
> int rctx; \
> \
> + if (prog) { \
> + __maybe_unused const u64 z = 0; \
> + struct bpf_context __ctx = ((struct bpf_context) { \
> + __BPF_CAST6(args, z, z, z, z, z) \
> + }); \
> + \
> + if (!trace_call_bpf(prog, &__ctx)) \
> + return; \
> + } \
> + \
> __data_size = ftrace_get_offsets_##call(&__data_offsets, args); \
> \
> head = this_cpu_ptr(event_call->perf_events); \
On Mon, 9 Feb 2015 19:45:54 -0800
Alexei Starovoitov <[email protected]> wrote:
> +/* For tracepoint filters argN fields match one to one to arguments
> + * passed to tracepoint events
> + *
> + * For syscall entry filters argN fields match syscall arguments
> + * For syscall exit filters arg1 is a return value
> + */
> +struct bpf_context {
> + u64 arg1;
> + u64 arg2;
> + u64 arg3;
> + u64 arg4;
> + u64 arg5;
> + u64 arg6;
> +};
> +
> +#endif /* _LINUX_KERNEL_BPF_TRACE_H */
> diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
> index 139b5067345b..4c275ce2dcf0 100644
> --- a/include/trace/ftrace.h
> +++ b/include/trace/ftrace.h
> @@ -17,6 +17,7 @@
> */
>
> #include <linux/ftrace_event.h>
> +#include <trace/bpf_trace.h>
>
> /*
> * DECLARE_EVENT_CLASS can be used to add a generic function
> @@ -755,12 +756,32 @@ __attribute__((section("_ftrace_events"))) *__event_##call = &event_##call
> #undef __perf_task
> #define __perf_task(t) (__task = (t))
>
> +/* zero extend integer, pointer or aggregate type to u64 without warnings */
> +#define __CAST_TO_U64(EXPR) ({ \
> + u64 ret = 0; \
> + typeof(EXPR) expr = EXPR; \
> + switch (sizeof(expr)) { \
> + case 8: ret = *(u64 *) &expr; break; \
> + case 4: ret = *(u32 *) &expr; break; \
> + case 2: ret = *(u16 *) &expr; break; \
> + case 1: ret = *(u8 *) &expr; break; \
> + } \
> + ret; })
> +
> +#define __BPF_CAST1(a,...) __CAST_TO_U64(a)
> +#define __BPF_CAST2(a,...) __CAST_TO_U64(a), __BPF_CAST1(__VA_ARGS__)
> +#define __BPF_CAST3(a,...) __CAST_TO_U64(a), __BPF_CAST2(__VA_ARGS__)
> +#define __BPF_CAST4(a,...) __CAST_TO_U64(a), __BPF_CAST3(__VA_ARGS__)
> +#define __BPF_CAST5(a,...) __CAST_TO_U64(a), __BPF_CAST4(__VA_ARGS__)
> +#define __BPF_CAST6(a,...) __CAST_TO_U64(a), __BPF_CAST5(__VA_ARGS__)
> +
> #undef DECLARE_EVENT_CLASS
> #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
> static notrace void \
> perf_trace_##call(void *__data, proto) \
> { \
> struct ftrace_event_call *event_call = __data; \
> + struct bpf_prog *prog = event_call->prog; \
> struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\
> struct ftrace_raw_##call *entry; \
> struct pt_regs __regs; \
> @@ -771,6 +792,16 @@ perf_trace_##call(void *__data, proto) \
> int __data_size; \
> int rctx; \
> \
> + if (prog) { \
> + __maybe_unused const u64 z = 0; \
> + struct bpf_context __ctx = ((struct bpf_context) { \
> + __BPF_CAST6(args, z, z, z, z, z) \
Note, there is no guarantee that args is at most 6. For example, in
drivers/net/wireless/brcm80211/brcmsmac/brcms_trace_events.h, the
trace_event brcms_txstatus has 8 args.
But I guess that's OK if you do not need those last args, right?
Also, there's no interface the lets us know what the args are. I may be
able to come up with something. That's the reason I never filtered
before tracing. Because we had no way of knowing what to filter on,
because the args were never visible.
I'm nervous about showing args of tracepoints too, because we don't want
that to become a strict ABI either.
-- Steve
> + }); \
> + \
> + if (!trace_call_bpf(prog, &__ctx)) \
> + return; \
> + } \
> + \
> __data_size = ftrace_get_offsets_##call(&__data_offsets, args); \
On Mon, 9 Feb 2015 23:08:36 -0500
Steven Rostedt <[email protected]> wrote:
> I don't want to get stuck with pinned kernel data structures again. We
> had 4 blank bytes of data for every event, because latency top hard
> coded the field. Luckily, the 64 bit / 32 bit interface caused latency
> top to have to use the event_parse code to work, and we were able to
> remove that field after it was converted.
I'm wondering if we should label eBPF programs as "modules". That is,
they have no guarantee of working from one kernel to the next. They
execute in the kernel, thus they are very similar to modules.
If we can get Linus to say that eBPF programs are not user space, and
that they are treated the same as modules (no internal ABI), then I
think we can be a bit more free at what we allow.
-- Steve
On Mon, Feb 9, 2015 at 9:16 PM, Steven Rostedt <[email protected]> wrote:
> On Mon, 9 Feb 2015 23:08:36 -0500
> Steven Rostedt <[email protected]> wrote:
>
>> I don't want to get stuck with pinned kernel data structures again. We
>> had 4 blank bytes of data for every event, because latency top hard
>> coded the field. Luckily, the 64 bit / 32 bit interface caused latency
>> top to have to use the event_parse code to work, and we were able to
>> remove that field after it was converted.
I think your main point boils down to:
> But I still do not want any hard coded event structures. All access to
> data from the binary code must be parsed by looking at the event/format
> files. Otherwise you will lock internals of the kernel as userspace
> ABI, because eBPF programs will break if those internals change, and
> that could severely limit progress in the future.
and I completely agree.
the patch 4 is an example. It doesn't mean in any way
that structs defined here is an ABI.
To be compatible across kernels the user space must read
format file as you mentioned in your other reply.
> I'm wondering if we should label eBPF programs as "modules". That is,
> they have no guarantee of working from one kernel to the next. They
> execute in the kernel, thus they are very similar to modules.
>
> If we can get Linus to say that eBPF programs are not user space, and
> that they are treated the same as modules (no internal ABI), then I
> think we can be a bit more free at what we allow.
I thought we already stated that.
Here is the quote from perf_event.h:
* # The RAW record below is opaque data wrt the ABI
* #
* # That is, the ABI doesn't make any promises wrt to
* # the stability of its content, it may vary depending
* # on event, hardware, kernel version and phase of
* # the moon.
* #
* # In other words, PERF_SAMPLE_RAW contents are not an ABI.
and this example is reading PERF_SAMPLE_RAW events and
uses locally defined structs to print them for simplicity.
On Mon, Feb 9, 2015 at 9:45 PM, Alexei Starovoitov <[email protected]> wrote:
> I thought we already stated that.
> Here is the quote from perf_event.h:
> * # The RAW record below is opaque data wrt the ABI
> * #
> * # That is, the ABI doesn't make any promises wrt to
> * # the stability of its content, it may vary depending
> * # on event, hardware, kernel version and phase of
> * # the moon.
> * #
> * # In other words, PERF_SAMPLE_RAW contents are not an ABI.
>
> and this example is reading PERF_SAMPLE_RAW events and
> uses locally defined structs to print them for simplicity.
to underline my point once more:
addition of bpf doesn't change at all what PERF_SAMPLE_RAW already
delivers to user space.
so no new ABIs anywhere.
On Mon, Feb 9, 2015 at 8:46 PM, Steven Rostedt <[email protected]> wrote:
>
> Looks like this is entirely perf based and does not interact with
> ftrace at all. In other words, it's perf not tracing.
>
> It makes more sense to go through tip than the tracing tree.
well, all of earlier series were based on ftrace only,
but I was given convincing enough arguments that
perf_even_open+ioctl is a better interface :)
Ok. will rebase on tip in the next version.
On Mon, Feb 9, 2015 at 9:13 PM, Steven Rostedt <[email protected]> wrote:
>> \
>> + if (prog) { \
>> + __maybe_unused const u64 z = 0; \
>> + struct bpf_context __ctx = ((struct bpf_context) { \
>> + __BPF_CAST6(args, z, z, z, z, z) \
>
> Note, there is no guarantee that args is at most 6. For example, in
> drivers/net/wireless/brcm80211/brcmsmac/brcms_trace_events.h, the
> trace_event brcms_txstatus has 8 args.
>
> But I guess that's OK if you do not need those last args, right?
yeah, some tracepoints pass a lot of things.
That's rare and in most of the cases they can be fetched
from parent structure.
> I'm nervous about showing args of tracepoints too, because we don't want
> that to become a strict ABI either.
One can argue that current TP_printk format is already an ABI,
because somebody might be parsing the text output.
so in some cases we cannot change tracepoints without
somebody complaining that his tool broke.
In other cases tracepoints are used for debugging only
and no one will notice when they change...
It was and still a grey area.
bpf doesn't change any of that.
It actually makes addition of new tracepoints easier.
In the future we might add a tracepoint and pass a single
pointer to interesting data struct to it. bpf programs will walk
data structures 'as safe modules' via bpf_fetch*() methods
without exposing it as ABI.
whereas today we pass a lot of fields to tracepoints and
make all of these fields immutable.
To me tracepoints are like gdb breakpoints.
and bpf programs like live debugger that examine things.
the next step is to be able to write bpf scripts on the fly
without leaving debugger. Something like perf probe +
editor + live execution. Truly like gdb for kernel.
while kernel is running.
Added Linus because he's the one that would revert changes on breakage.
On Mon, 9 Feb 2015 21:45:21 -0800
Alexei Starovoitov <[email protected]> wrote:
> On Mon, Feb 9, 2015 at 9:16 PM, Steven Rostedt <[email protected]> wrote:
> > On Mon, 9 Feb 2015 23:08:36 -0500
> > Steven Rostedt <[email protected]> wrote:
> >
> >> I don't want to get stuck with pinned kernel data structures again. We
> >> had 4 blank bytes of data for every event, because latency top hard
> >> coded the field. Luckily, the 64 bit / 32 bit interface caused latency
> >> top to have to use the event_parse code to work, and we were able to
> >> remove that field after it was converted.
>
> I think your main point boils down to:
>
> > But I still do not want any hard coded event structures. All access to
> > data from the binary code must be parsed by looking at the event/format
> > files. Otherwise you will lock internals of the kernel as userspace
> > ABI, because eBPF programs will break if those internals change, and
> > that could severely limit progress in the future.
>
> and I completely agree.
>
> the patch 4 is an example. It doesn't mean in any way
> that structs defined here is an ABI.
> To be compatible across kernels the user space must read
> format file as you mentioned in your other reply.
The thing is, this is a sample. Which means it will be cut and pasted
into other programs. If the sample does not follow the way we want
users to use this, then how can we complain if they hard code it as
well?
>
> > I'm wondering if we should label eBPF programs as "modules". That is,
> > they have no guarantee of working from one kernel to the next. They
> > execute in the kernel, thus they are very similar to modules.
> >
> > If we can get Linus to say that eBPF programs are not user space, and
> > that they are treated the same as modules (no internal ABI), then I
> > think we can be a bit more free at what we allow.
>
> I thought we already stated that.
> Here is the quote from perf_event.h:
> * # The RAW record below is opaque data wrt the ABI
> * #
> * # That is, the ABI doesn't make any promises wrt to
> * # the stability of its content, it may vary depending
> * # on event, hardware, kernel version and phase of
> * # the moon.
> * #
> * # In other words, PERF_SAMPLE_RAW contents are not an ABI.
>
> and this example is reading PERF_SAMPLE_RAW events and
> uses locally defined structs to print them for simplicity.
As we found out the hard way with latencytop, comments like this does
not matter. If an application does something like this, it's our fault
if it breaks later. We can't say "hey you were suppose to do it this
way". That argument breaks down even more if our own examples do not
follow the way we want others to do things.
-- Steve
On Mon, 9 Feb 2015 21:47:42 -0800
Alexei Starovoitov <[email protected]> wrote:
> On Mon, Feb 9, 2015 at 9:45 PM, Alexei Starovoitov <[email protected]> wrote:
> > I thought we already stated that.
> > Here is the quote from perf_event.h:
> > * # The RAW record below is opaque data wrt the ABI
> > * #
> > * # That is, the ABI doesn't make any promises wrt to
> > * # the stability of its content, it may vary depending
> > * # on event, hardware, kernel version and phase of
> > * # the moon.
> > * #
> > * # In other words, PERF_SAMPLE_RAW contents are not an ABI.
> >
> > and this example is reading PERF_SAMPLE_RAW events and
> > uses locally defined structs to print them for simplicity.
>
> to underline my point once more:
> addition of bpf doesn't change at all what PERF_SAMPLE_RAW already
> delivers to user space.
> so no new ABIs anywhere.
Again, it we give an example of how to hard code the data, definitely
expect this to show up in user space. Users are going to look at this
code to learn how to use eBPF. I really want it to do it the correct
way instead of the 'easy' way. Because whatever way we have it here,
will be the way we will see it out in the wild.
-- Steve
On Mon, 9 Feb 2015 21:51:05 -0800
Alexei Starovoitov <[email protected]> wrote:
> On Mon, Feb 9, 2015 at 8:46 PM, Steven Rostedt <[email protected]> wrote:
> >
> > Looks like this is entirely perf based and does not interact with
> > ftrace at all. In other words, it's perf not tracing.
> >
> > It makes more sense to go through tip than the tracing tree.
>
> well, all of earlier series were based on ftrace only,
> but I was given convincing enough arguments that
> perf_even_open+ioctl is a better interface :)
> Ok. will rebase on tip in the next version.
That's fine. If this proves useful, I'll probably add the interface
back to ftrace as well.
-- Steve
On Mon, 9 Feb 2015 22:10:45 -0800
Alexei Starovoitov <[email protected]> wrote:
> One can argue that current TP_printk format is already an ABI,
> because somebody might be parsing the text output.
If somebody does, then it is an ABI. Luckily, it's not that useful to
parse, thus it hasn't been an issue. As Linus has stated in the past,
it's not that we can't change ABI interfaces, its just that we can not
change them if there's a user space application that depends on it.
The harder we make it for interface changes to break user space, the
better. The field layouts is a user interface. In fact, some of those
fields must always be there. This is because tools do parse the layout
and expect some events to have specific fields. Now we can add new
fields, or even remove fields that no user space tool is using. This is
because today, tools use libtraceevent to parse the event data.
This is why I'm nervous about exporting the parameters of the trace
event call. Right now, those parameters can always change, because the
only way to know they exist is by looking at the code. And currently,
there's no way to interact with those parameters. Once we have eBPF in
mainline, we now have a way to interact with the parameters and if
those parameters change, then the eBPF program will break, and if eBPF
can be part of a user space tool, that will break that tool and
whatever change in the trace point that caused this breakage would have
to be reverted. IOW, this can limit development in the kernel.
Al Viro currently does not let any tracepoint in VFS because he doesn't
want the internals of that code locked to an ABI. He's right to be
worried.
> so in some cases we cannot change tracepoints without
> somebody complaining that his tool broke.
> In other cases tracepoints are used for debugging only
> and no one will notice when they change...
> It was and still a grey area.
Not really. If a tool uses the tracepoint, it can lock that tracepoint
down. This is exactly what latencytop did. It happened, it's not a
hypothetical situation.
> bpf doesn't change any of that.
> It actually makes addition of new tracepoints easier.
I totally disagree. It adds more ways to see inside the kernel, and if
user space depends on this, it adds more ways the kernel can not change.
It comes down to how robust eBPF is with the internals of the kernel
changing. If we limit eBPF to system call tracepoints only, that's
fine because those have the same ABI as the system call itself. I'm
worried about the internal tracepoints for scheduling, irqs, file
systems, etc.
> In the future we might add a tracepoint and pass a single
> pointer to interesting data struct to it. bpf programs will walk
> data structures 'as safe modules' via bpf_fetch*() methods
> without exposing it as ABI.
Will this work if that structure changes? When the field we are looking
for no longer exists?
> whereas today we pass a lot of fields to tracepoints and
> make all of these fields immutable.
The parameters passed to the tracepoint are not shown to userspace and
can change at will. Now, we present the final parsing of the parameters
that convert to fields. As all currently known tools uses
libtraceevent.a, and parse the format files, those fields can move
around and even change in size. The structures are not immutable. The
fields are locked down if user space relies on them. But they can move
about within the tracepoint, because the parsing allows for it.
Remember, these are processed fields. The result of TP_fast_assign()
and what gets put into the ring buffer. Now what is passed to the
actual tracepoint is not visible by userspace, and in lots of cases, it
is just a pointer to some structure. What eBPF brings to the table is a
way to access this structure from user space. What keeps a structured
passed to a tracepoint from becoming immutable if there's a eBPF
program that expects it to have a specific field?
>
> To me tracepoints are like gdb breakpoints.
Unfortunately, it doesn't matter what you think they are. It doesn't
matter what I think they are. What matters is what Linus thinks they
are and if user space depends on it and Linus decides to revert what
ever change broke that user space program, no matter what we think, we
just screwed ourselves.
I'm being stubborn on this because this is exactly what happened in the
past, which caused every trace point to hold 4 bytes of padding. 4
bytes may not sound like a lot, but when you have 1 million
tracepoints, that's 4 megs of wasted space.
> and bpf programs like live debugger that examine things.
If bpf programs only dealt with kprobes, I may agree. But tracepoints
have already been proven to be a type of ABI. If we open another window
into the kernel, this can screw us later. It's better to solve this now
than when we are fighting with Linus over user space breakage.
>
> the next step is to be able to write bpf scripts on the fly
> without leaving debugger. Something like perf probe +
> editor + live execution. Truly like gdb for kernel.
> while kernel is running.
What we need is to know if eBPF programs are modules or a user space
interface. If they are a user interface then we need to be extremely
careful here. If they are treated the same as modules, then it would
not add any API. But that hasn't been settled yet, even if we have a
comment in the kernel.
Maybe what we should do is to make eBPF pass the kernel version it was
made for (with all the mod version checks). If it doesn't match, fail
to load it. Perhaps the more eBPF is limited like modules are, the
better chance we have that no eBPF program creates a new ABI.
-- Steve
On Mon, 9 Feb 2015 19:45:53 -0800
Alexei Starovoitov <[email protected]> wrote:
> So the overhead of realistic bpf program is 5.05963/4.80074 = ~5%
> which is faster than perf_event filtering: 5.69732/4.80074 = ~18%
> or ftrace filtering: 6.50091/4.80074 = ~35%
Come to think of it, this is comparing apples to oranges, as you move
the filtering before the recording. It would be interesting to see the
ftrace speed up, if it were to use eBPF instead of its own filtering.
Maybe that 35% is the filter part, and not the discard part.
I just tried the dd test with count==1234 and count!=1234 and the one
that drops events is only slightly slower. In this case it does seem
that the most overhead is in the filter logic.
But by moving it before the recording, we can not use the fields
defined in the format files, as the parameters and the fields do not
match in most trace points. And to use the parameters, as I have
stated, there's no interface to know what those parameters are, then
filtering on them is a one shot deal. Might as well write a module and
hook directly to the tracepoint and do the filtering natively. That
would be faster than BPF too.
My point is, what's the use case? If you filter before recording, you
can not use the fields of the tracepoint. That limits you to filtering
only syscalls, and perhaps kprobes.
-- Steve
On Tue, Feb 10, 2015 at 5:05 AM, Steven Rostedt <[email protected]> wrote:
> On Mon, 9 Feb 2015 22:10:45 -0800
> Alexei Starovoitov <[email protected]> wrote:
>
>> One can argue that current TP_printk format is already an ABI,
>> because somebody might be parsing the text output.
>
> If somebody does, then it is an ABI. Luckily, it's not that useful to
> parse, thus it hasn't been an issue. As Linus has stated in the past,
> it's not that we can't change ABI interfaces, its just that we can not
> change them if there's a user space application that depends on it.
there are already tools that parse trace_pipe:
https://github.com/brendangregg/perf-tools
> and expect some events to have specific fields. Now we can add new
> fields, or even remove fields that no user space tool is using. This is
> because today, tools use libtraceevent to parse the event data.
not all tools use libtraceevent.
gdb calls perf_event_open directly:
https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c
and parses PERF_RECORD_SAMPLE as a binary.
In this case it's branch records, but I think we never said anywhere
that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come
in this particular order.
> This is why I'm nervous about exporting the parameters of the trace
> event call. Right now, those parameters can always change, because the
> only way to know they exist is by looking at the code. And currently,
> there's no way to interact with those parameters. Once we have eBPF in
> mainline, we now have a way to interact with the parameters and if
> those parameters change, then the eBPF program will break, and if eBPF
> can be part of a user space tool, that will break that tool and
> whatever change in the trace point that caused this breakage would have
> to be reverted. IOW, this can limit development in the kernel.
it can limit development unless we say that bpf programs
that attach to tracepoints are not part of ABI.
Easy enough to add large comment similar to perf_event.h
> Al Viro currently does not let any tracepoint in VFS because he doesn't
> want the internals of that code locked to an ABI. He's right to be
> worried.
Same with networking bits. We don't want tracepoints to limit
kernel development, but we want debuggability and kernel
analytics.
All existing tracepoints defined via DEFINE_EVENT should
not be an ABI.
But some maintainers think of them as ABI, whereas others
are using them freely. imo it's time to remove ambiguity.
The idea for new style of tracepoints is the following:
introduce new macro: DEFINE_EVENT_USER
and let programs attach to them.
These tracepoint will receive one or two pointers to important
structs only. They will not have TP_printk, assign and fields.
The placement and arguments to these new tracepoints
will be an ABI.
All existing tracepoints are not.
The main reason to attach to tracepoint is that they are
accessible without debug info (unlike kprobe)
Another reason is speed. tracepoints are much faster than
optimized kprobes and for real-time analytics the speed
is critical.
The position of new tracepoints and their arguments
will be an ABI and the programs can be both.
If program is using bpf_fetch*() helpers it obviously
wants to access internal data structures, so
it's really nothing more, but 'safe kernel module'
and kernel layout specific.
Both old and new tracepoints + programs will be used
for live kernel debugging.
If program is accessing user-ized data structures then
it is portable and will run on any kernel.
In uapi header we can define:
struct task_struct_user {
int pid;
int prio;
};
and let bpf programs access it via real 'struct task_struct *'
pointer passed into tracepoint.
bpf loader will translate offsets and sizes used inside
the program into real task_struct's offsets and loads.
(all structs are read-only of course)
programs will be fast and kernel independent.
They will be used for analytics (latency, etc)
>> so in some cases we cannot change tracepoints without
>> somebody complaining that his tool broke.
>> In other cases tracepoints are used for debugging only
>> and no one will notice when they change...
>> It was and still a grey area.
>
> Not really. If a tool uses the tracepoint, it can lock that tracepoint
> down. This is exactly what latencytop did. It happened, it's not a
> hypothetical situation.
correct.
>> bpf doesn't change any of that.
>> It actually makes addition of new tracepoints easier.
>
> I totally disagree. It adds more ways to see inside the kernel, and if
> user space depends on this, it adds more ways the kernel can not change.
>
> It comes down to how robust eBPF is with the internals of the kernel
> changing. If we limit eBPF to system call tracepoints only, that's
> fine because those have the same ABI as the system call itself. I'm
> worried about the internal tracepoints for scheduling, irqs, file
> systems, etc.
agree. we need to make it clear that existing tracepoints
+ programs is not ABI.
>> In the future we might add a tracepoint and pass a single
>> pointer to interesting data struct to it. bpf programs will walk
>> data structures 'as safe modules' via bpf_fetch*() methods
>> without exposing it as ABI.
>
> Will this work if that structure changes? When the field we are looking
> for no longer exists?
bpf_fetch*() is the same mechanism as perf probe.
If there is a mistake by user space tools, the program
will be reading some junk, but it won't be crashing.
To be able to debug live kernel we need to see everywhere.
Same way as systemtap loads kernel modules to walk
things inside kernel, bpf programs walk pointers with
bpf_fetch*().
I'm saying that if program is using bpf_fetch*()
it wants to see kernel internals and obviously depends
on particular kernel layout.
>> whereas today we pass a lot of fields to tracepoints and
>> make all of these fields immutable.
>
> The parameters passed to the tracepoint are not shown to userspace and
> can change at will. Now, we present the final parsing of the parameters
> that convert to fields. As all currently known tools uses
> libtraceevent.a, and parse the format files, those fields can move
> around and even change in size. The structures are not immutable. The
> fields are locked down if user space relies on them. But they can move
> about within the tracepoint, because the parsing allows for it.
>
> Remember, these are processed fields. The result of TP_fast_assign()
> and what gets put into the ring buffer. Now what is passed to the
> actual tracepoint is not visible by userspace, and in lots of cases, it
> is just a pointer to some structure. What eBPF brings to the table is a
> way to access this structure from user space. What keeps a structured
> passed to a tracepoint from becoming immutable if there's a eBPF
> program that expects it to have a specific field?
agree. that's fair.
I'm proposing to treat bpf programs that attach to existing
tracepoints as kernel modules that carry no ABI claims.
>> and bpf programs like live debugger that examine things.
>
> If bpf programs only dealt with kprobes, I may agree. But tracepoints
> have already been proven to be a type of ABI. If we open another window
> into the kernel, this can screw us later. It's better to solve this now
> than when we are fighting with Linus over user space breakage.
I'm not sure what's more needed other than adding
large comments into documentation, man pages and sample
code that bpf+existing tracepoint is not an ABI.
> What we need is to know if eBPF programs are modules or a user space
> interface. If they are a user interface then we need to be extremely
> careful here. If they are treated the same as modules, then it would
> not add any API. But that hasn't been settled yet, even if we have a
> comment in the kernel.
>
> Maybe what we should do is to make eBPF pass the kernel version it was
> made for (with all the mod version checks). If it doesn't match, fail
> to load it. Perhaps the more eBPF is limited like modules are, the
> better chance we have that no eBPF program creates a new ABI.
it's easy to add kernel version check and it will be equally
easy for user space to hack it.
imo comments in documentation and samples is good enough.
also not all bpf programs are equal.
bpf+existing tracepoint is not ABI
bpf+new tracepoint is ABI if programs are not using bpf_fetch
bpf+syscall is ABI if programs are not using bpf_fetch
bpf+kprobe is not ABI
bpf+sockets is ABI
At the end we want most of the programs to be written
without assuming anything about kernel internals.
But for live kernel debugging we will write programs
very specific to given kernel layout.
We can categorize the above in non-ambigous via
bpf program type.
Programs with:
BPF_PROG_TYPE_TRACEPOINT - not ABI
BPF_PROG_TYPE_KPROBE - not ABI
BPF_PROG_TYPE_SOCKET_FILTER - ABI
for my proposed 'new tracepoints' we can add type:
BPF_PROG_TYPE_TRACEPOINT_USER - ABI
and disallow calls to bpf_fetch*() for them.
To make it more strict we can do kernel version check
for all prog types that are 'not ABI', but is it really necessary?
To summarize and consolidate other threads:
- I will remove reading of PERF_SAMPLE_RAW in tracex1 example.
it's really orthogonal to this whole discussion.
- will add more comments through out that just because
programs can read tracepoint arguments, they shouldn't
make any assumptions that args stays as-is from version to version
- will work on a patch to demonstrate how few in-kernel
structures can be user-ized and how programs can access
them in version-indepedent way
btw the concept of user-ized data structures already exists
with classic bpf, since 'A = load -0x1000' is translated into
'A = skb->protocol'. I'm thinking of something similar
but more generic and less obscure.
On Tue, 10 Feb 2015 11:53:22 -0800
Alexei Starovoitov <[email protected]> wrote:
> On Tue, Feb 10, 2015 at 5:05 AM, Steven Rostedt <[email protected]> wrote:
> > On Mon, 9 Feb 2015 22:10:45 -0800
> > Alexei Starovoitov <[email protected]> wrote:
> >
> >> One can argue that current TP_printk format is already an ABI,
> >> because somebody might be parsing the text output.
> >
> > If somebody does, then it is an ABI. Luckily, it's not that useful to
> > parse, thus it hasn't been an issue. As Linus has stated in the past,
> > it's not that we can't change ABI interfaces, its just that we can not
> > change them if there's a user space application that depends on it.
>
> there are already tools that parse trace_pipe:
> https://github.com/brendangregg/perf-tools
Yep, and if this becomes a standard, then any change that makes
trace_pipe different will be reverted.
>
> > and expect some events to have specific fields. Now we can add new
> > fields, or even remove fields that no user space tool is using. This is
> > because today, tools use libtraceevent to parse the event data.
>
> not all tools use libtraceevent.
> gdb calls perf_event_open directly:
> https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c
> and parses PERF_RECORD_SAMPLE as a binary.
> In this case it's branch records, but I think we never said anywhere
> that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come
> in this particular order.
What particular order? Note, that's a hardware event, not a software
one.
>
> > This is why I'm nervous about exporting the parameters of the trace
> > event call. Right now, those parameters can always change, because the
> > only way to know they exist is by looking at the code. And currently,
> > there's no way to interact with those parameters. Once we have eBPF in
> > mainline, we now have a way to interact with the parameters and if
> > those parameters change, then the eBPF program will break, and if eBPF
> > can be part of a user space tool, that will break that tool and
> > whatever change in the trace point that caused this breakage would have
> > to be reverted. IOW, this can limit development in the kernel.
>
> it can limit development unless we say that bpf programs
> that attach to tracepoints are not part of ABI.
> Easy enough to add large comment similar to perf_event.h
Again, it doesn't matter what we say. Linus made that very clear. He
stated if you provide an interface, and someone uses that interface for
a user space application, and they complain if it breaks, that is just
reason to revert whatever broke it.
>
> > Al Viro currently does not let any tracepoint in VFS because he doesn't
> > want the internals of that code locked to an ABI. He's right to be
> > worried.
>
> Same with networking bits. We don't want tracepoints to limit
> kernel development, but we want debuggability and kernel
> analytics.
> All existing tracepoints defined via DEFINE_EVENT should
> not be an ABI.
I agree, but that doesn't make it so :-/
> But some maintainers think of them as ABI, whereas others
> are using them freely. imo it's time to remove ambiguity.
I would love to, and have brought this up at Kernel Summit more than
once with no solution out of it.
>
> The idea for new style of tracepoints is the following:
> introduce new macro: DEFINE_EVENT_USER
> and let programs attach to them.
We actually have that today. But it's TRACE_EVENT_FLAGS(), although
that should be cleaned up a bit. Frederic added it to label events that
are safe for perf non root. It seems to be used currently only for
syscalls.
> These tracepoint will receive one or two pointers to important
> structs only. They will not have TP_printk, assign and fields.
> The placement and arguments to these new tracepoints
> will be an ABI.
> All existing tracepoints are not.
TP_printk() is not really an issue.
>
> The main reason to attach to tracepoint is that they are
> accessible without debug info (unlike kprobe)
That is, if you have a special bpf call to access variables, right? How
else do you access part of a data structure.
> Another reason is speed. tracepoints are much faster than
> optimized kprobes and for real-time analytics the speed
> is critical.
>
> The position of new tracepoints and their arguments
> will be an ABI and the programs can be both.
You means "special tracepoints" one that does export the arguments?
Question is, how many maintainers will add these, knowing that they
will have to be forever maintained as is.
> If program is using bpf_fetch*() helpers it obviously
> wants to access internal data structures, so
> it's really nothing more, but 'safe kernel module'
> and kernel layout specific.
> Both old and new tracepoints + programs will be used
> for live kernel debugging.
>
> If program is accessing user-ized data structures then
Technically, the TP_struct__entry is a user-ized structure.
> it is portable and will run on any kernel.
> In uapi header we can define:
> struct task_struct_user {
> int pid;
> int prio;
Here's a perfect example of something that looks stable to show to
user space, but is really a pimple that is hiding cancer.
Lets start with pid. We have name spaces. What pid will be put there?
We have to show the pid of the name space it is under.
Then we have prio. What is prio in the DEADLINE scheduler. It is rather
meaningless. Also, it is meaningless in SCHED_OTHER.
Also note that even for SCHED_FIFO, the prio is used differently in the
kernel than it is in userspace. For the kernel, lower is higher.
> };
> and let bpf programs access it via real 'struct task_struct *'
> pointer passed into tracepoint.
> bpf loader will translate offsets and sizes used inside
> the program into real task_struct's offsets and loads.
It would need to do more that that. It may have to calculate the value
that it returns, as the internal value may be different with different
kernels.
> (all structs are read-only of course)
> programs will be fast and kernel independent.
> They will be used for analytics (latency, etc)
>
> >> so in some cases we cannot change tracepoints without
> >> somebody complaining that his tool broke.
> >> In other cases tracepoints are used for debugging only
> >> and no one will notice when they change...
> >> It was and still a grey area.
> >
> > Not really. If a tool uses the tracepoint, it can lock that tracepoint
> > down. This is exactly what latencytop did. It happened, it's not a
> > hypothetical situation.
>
> correct.
>
> >> bpf doesn't change any of that.
> >> It actually makes addition of new tracepoints easier.
> >
> > I totally disagree. It adds more ways to see inside the kernel, and if
> > user space depends on this, it adds more ways the kernel can not change.
> >
> > It comes down to how robust eBPF is with the internals of the kernel
> > changing. If we limit eBPF to system call tracepoints only, that's
> > fine because those have the same ABI as the system call itself. I'm
> > worried about the internal tracepoints for scheduling, irqs, file
> > systems, etc.
>
> agree. we need to make it clear that existing tracepoints
> + programs is not ABI.
The question is, how do we do that. Linus pointed out that comments and
documentation is not enough. We need to have an interface that users
would use before they use one that we do not like them to use.
>
> >> In the future we might add a tracepoint and pass a single
> >> pointer to interesting data struct to it. bpf programs will walk
> >> data structures 'as safe modules' via bpf_fetch*() methods
> >> without exposing it as ABI.
> >
> > Will this work if that structure changes? When the field we are looking
> > for no longer exists?
>
> bpf_fetch*() is the same mechanism as perf probe.
> If there is a mistake by user space tools, the program
> will be reading some junk, but it won't be crashing.
But what if the userspace tool depends on that value returning
something meaningful. If it was meaningful in the past, it will have to
be meaningful in the future, even if the internals of the kernel make
it otherwise.
> To be able to debug live kernel we need to see everywhere.
> Same way as systemtap loads kernel modules to walk
> things inside kernel, bpf programs walk pointers with
> bpf_fetch*().
I would agree here too. But again, we really need to be careful about
the interface we expose.
> I'm saying that if program is using bpf_fetch*()
> it wants to see kernel internals and obviously depends
> on particular kernel layout.
Right, but if there's something specific in that kernel layout that it
can decide to do something with, otherwise it breaks, then we need to
worry about it.
eBPF is very flexible, which means it is bound to have someone use it
in a way you never dreamed of, and that will be what bites you in the
end (pun intended).
>
> >> whereas today we pass a lot of fields to tracepoints and
> >> make all of these fields immutable.
> >
> > The parameters passed to the tracepoint are not shown to userspace and
> > can change at will. Now, we present the final parsing of the parameters
> > that convert to fields. As all currently known tools uses
> > libtraceevent.a, and parse the format files, those fields can move
> > around and even change in size. The structures are not immutable. The
> > fields are locked down if user space relies on them. But they can move
> > about within the tracepoint, because the parsing allows for it.
> >
> > Remember, these are processed fields. The result of TP_fast_assign()
> > and what gets put into the ring buffer. Now what is passed to the
> > actual tracepoint is not visible by userspace, and in lots of cases, it
> > is just a pointer to some structure. What eBPF brings to the table is a
> > way to access this structure from user space. What keeps a structured
> > passed to a tracepoint from becoming immutable if there's a eBPF
> > program that expects it to have a specific field?
>
> agree. that's fair.
> I'm proposing to treat bpf programs that attach to existing
> tracepoints as kernel modules that carry no ABI claims.
Yeah, we may need to find a way to guarantee that.
>
> >> and bpf programs like live debugger that examine things.
> >
> > If bpf programs only dealt with kprobes, I may agree. But tracepoints
> > have already been proven to be a type of ABI. If we open another window
> > into the kernel, this can screw us later. It's better to solve this now
> > than when we are fighting with Linus over user space breakage.
>
> I'm not sure what's more needed other than adding
> large comments into documentation, man pages and sample
> code that bpf+existing tracepoint is not an ABI.
Making it break as soon as a config or kernel version changes ;-)
That may be the only way to guarantee that users do not rely on it.
>
> > What we need is to know if eBPF programs are modules or a user space
> > interface. If they are a user interface then we need to be extremely
> > careful here. If they are treated the same as modules, then it would
> > not add any API. But that hasn't been settled yet, even if we have a
> > comment in the kernel.
> >
> > Maybe what we should do is to make eBPF pass the kernel version it was
> > made for (with all the mod version checks). If it doesn't match, fail
> > to load it. Perhaps the more eBPF is limited like modules are, the
> > better chance we have that no eBPF program creates a new ABI.
>
> it's easy to add kernel version check and it will be equally
> easy for user space to hack it.
Well, if the bpf program says it is built for kernel 3.19, but the
user tool fudges it to say it was built for kernel 3.20, but it breaks,
than how can they complain about a regression? The tool itself says it
was built for 3.20 but doesn't work. You need to show that same program
worked on 3.19, but it wont because 3.20 wont work there.
> imo comments in documentation and samples is good enough.
Again, your and my opinion do not matter. It comes down to what Linus
thinks. And if we expose an interface that applications decide to use,
and it breaks when a design in the kernel changes, Linus may revert
that change. The author of that change will not be too happy with us.
>
> also not all bpf programs are equal.
> bpf+existing tracepoint is not ABI
Why not?
> bpf+new tracepoint is ABI if programs are not using bpf_fetch
How is this different?
> bpf+syscall is ABI if programs are not using bpf_fetch
Well, this is easy. As syscalls are ABI, and the tracepoints for them
match the ABI, it by default becomes an ABI.
> bpf+kprobe is not ABI
Right.
> bpf+sockets is ABI
Right, because sockets themselves are ABI.
> At the end we want most of the programs to be written
> without assuming anything about kernel internals.
> But for live kernel debugging we will write programs
> very specific to given kernel layout.
And here lies the trick. How do we differentiate applications for
everyday use from debugging tools? There's been times when debugging
tools have shown themselves as being so useful they become everyday
use tools.
>
> We can categorize the above in non-ambigous via
> bpf program type.
> Programs with:
> BPF_PROG_TYPE_TRACEPOINT - not ABI
> BPF_PROG_TYPE_KPROBE - not ABI
> BPF_PROG_TYPE_SOCKET_FILTER - ABI
Again, what enforces this? (hint, it's Linus)
>
> for my proposed 'new tracepoints' we can add type:
> BPF_PROG_TYPE_TRACEPOINT_USER - ABI
> and disallow calls to bpf_fetch*() for them.
> To make it more strict we can do kernel version check
> for all prog types that are 'not ABI', but is it really necessary?
If we have something that makes it difficult for a tool to work from
one kernel to the next, or ever with different configs, where that tool
will never become a standard, then that should be good enough to keep
it from dictating user ABI.
To give you an example, we thought about scrambling the trace event
field locations from boot to boot to keep tools from hard coding the
event layout. This may sound crazy, but developers out there are crazy.
And if you want to keep them from abusing interfaces, you just need to
be a bit more crazy than they are.
>
> To summarize and consolidate other threads:
> - I will remove reading of PERF_SAMPLE_RAW in tracex1 example.
> it's really orthogonal to this whole discussion.
Or yous libtraceevent ;-) We really need to finish that and package it
up for distros.
> - will add more comments through out that just because
> programs can read tracepoint arguments, they shouldn't
> make any assumptions that args stays as-is from version to version
We may need to find a way to actually keep it from being as is from
version to version even if the users do not change.
> - will work on a patch to demonstrate how few in-kernel
> structures can be user-ized and how programs can access
> them in version-indepedent way
It will be interesting to see what kernel structures can be user-ized
that are not already used by system calls.
>
> btw the concept of user-ized data structures already exists
> with classic bpf, since 'A = load -0x1000' is translated into
> 'A = skb->protocol'. I'm thinking of something similar
> but more generic and less obscure.
I have to try to wrap my head around understanding the classic bpf, and
how "load -0x1000" translates to "skb->protocol". Is that documented
somewhere?
-- Steve
On Tue, Feb 10, 2015 at 1:53 PM, Steven Rostedt <[email protected]> wrote:
> On Tue, 10 Feb 2015 11:53:22 -0800
> Alexei Starovoitov <[email protected]> wrote:
>
>> On Tue, Feb 10, 2015 at 5:05 AM, Steven Rostedt <[email protected]> wrote:
>> > On Mon, 9 Feb 2015 22:10:45 -0800
>> > Alexei Starovoitov <[email protected]> wrote:
>> >
>> >> One can argue that current TP_printk format is already an ABI,
>> >> because somebody might be parsing the text output.
>> >
>> > If somebody does, then it is an ABI. Luckily, it's not that useful to
>> > parse, thus it hasn't been an issue. As Linus has stated in the past,
>> > it's not that we can't change ABI interfaces, its just that we can not
>> > change them if there's a user space application that depends on it.
>>
>> there are already tools that parse trace_pipe:
>> https://github.com/brendangregg/perf-tools
>
> Yep, and if this becomes a standard, then any change that makes
> trace_pipe different will be reverted.
I think reading of trace_pipe is widespread.
>> > and expect some events to have specific fields. Now we can add new
>> > fields, or even remove fields that no user space tool is using. This is
>> > because today, tools use libtraceevent to parse the event data.
>>
>> not all tools use libtraceevent.
>> gdb calls perf_event_open directly:
>> https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c
>> and parses PERF_RECORD_SAMPLE as a binary.
>> In this case it's branch records, but I think we never said anywhere
>> that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come
>> in this particular order.
>
> What particular order? Note, that's a hardware event, not a software
> one.
yes, but gdb assumes that 'u64 ip' precedes, 'u64 addr'
when attr.sample_type = IP | ADDR whereas this is an
internal order of 'if' statements inside perf_output_sample()...
>> But some maintainers think of them as ABI, whereas others
>> are using them freely. imo it's time to remove ambiguity.
>
> I would love to, and have brought this up at Kernel Summit more than
> once with no solution out of it.
let's try it again at plumbers in august?
For now I'm going to drop bpf+tracepoints, since it's so controversial
and go with bpf+syscall and bpf+kprobe only.
Hopefully by august it will be clear what bpf+kprobes can do
and why I'm excited about bpf+tracepoints in the future.
>> The idea for new style of tracepoints is the following:
>> introduce new macro: DEFINE_EVENT_USER
>> and let programs attach to them.
>
> We actually have that today. But it's TRACE_EVENT_FLAGS(), although
> that should be cleaned up a bit. Frederic added it to label events that
> are safe for perf non root. It seems to be used currently only for
> syscalls.
I didn't mean to let unprivileged apps to use.
Better name probably would be: DEFINE_EVENT_BPF
and only let bpf programs use it.
>> These tracepoint will receive one or two pointers to important
>> structs only. They will not have TP_printk, assign and fields.
>> The placement and arguments to these new tracepoints
>> will be an ABI.
>> All existing tracepoints are not.
>
> TP_printk() is not really an issue.
I think it is. The way things are printed is the most
visible part of tracepoints and I suspect maintainers don't
want to add new ones, because internal fields are printed
and users do parse trace_pipe.
Recent discussion about tcp instrumentation was
about adding new tracepoints and a module to print them.
As soon as something like this is in, the next question
'what we're going to do when more arguments need
to be printed'...
imo the solution is DEFINE_EVENT_BPF that doesn't
print anything and a bpf program to process it.
Then programs decide what they like to pass to
user space and in what format.
The kernel/user ABI is the position of this new tracepoint
and arguments that are passed in.
bpf program takes care of walking pointers, extracting
interesting fields, aggregating and passing them to
user in the format specific to this particular program.
Then when user wants to collect more fields it just
changes the program and corresponding userland side.
>> The position of new tracepoints and their arguments
>> will be an ABI and the programs can be both.
>
> You means "special tracepoints" one that does export the arguments?
>
> Question is, how many maintainers will add these, knowing that they
> will have to be forever maintained as is.
Obviously we would need to prove usefulness of
tracepoint_bpf before they're accepted.
Hopefully bpf+kprobe will make an impression on its own
and users would want similar but without debug info.
>> If program is using bpf_fetch*() helpers it obviously
>> wants to access internal data structures, so
>> it's really nothing more, but 'safe kernel module'
>> and kernel layout specific.
>> Both old and new tracepoints + programs will be used
>> for live kernel debugging.
>>
>> If program is accessing user-ized data structures then
>
> Technically, the TP_struct__entry is a user-ized structure.
>
>> it is portable and will run on any kernel.
>> In uapi header we can define:
>> struct task_struct_user {
>> int pid;
>> int prio;
>
> Here's a perfect example of something that looks stable to show to
> user space, but is really a pimple that is hiding cancer.
>
> Lets start with pid. We have name spaces. What pid will be put there?
> We have to show the pid of the name space it is under.
>
> Then we have prio. What is prio in the DEADLINE scheduler. It is rather
> meaningless. Also, it is meaningless in SCHED_OTHER.
>
> Also note that even for SCHED_FIFO, the prio is used differently in the
> kernel than it is in userspace. For the kernel, lower is higher.
well, ->prio and ->pid are already printed by sched tracepoints
and their meaning depends on scheduler. So users taking that
into account.
I'm not suggesting to preserve the meaning of 'pid' semantically
in all cases. That's not what users would want anyway.
I want to allow programs to access important fields and print
them in more generic way than current TP_printk does.
Then exposed ABI of such tracepoint_bpf is smaller than
with current tracepoints.
>> };
>> and let bpf programs access it via real 'struct task_struct *'
>> pointer passed into tracepoint.
>> bpf loader will translate offsets and sizes used inside
>> the program into real task_struct's offsets and loads.
>
> It would need to do more that that. It may have to calculate the value
> that it returns, as the internal value may be different with different
> kernels.
back to 'prio'... the 'prio' accessible from the program
should be the same 'prio' that we're storing inside task_struct.
No extra conversions.
The bpf-script for sched analytics would want to see
the actual value to make sense of it.
> But what if the userspace tool depends on that value returning
> something meaningful. If it was meaningful in the past, it will have to
> be meaningful in the future, even if the internals of the kernel make
> it otherwise.
in some cases... yes. if we make a poor choice of
selecting fields for this 'task_struct_user' then some fields
may disappear from real task_struct, and placeholders
will be left in _user version.
so exposed fields need to be thought through.
In case of networking some of these choices were
already made by classic bpf. Fields:
protocol, pkttype, ifindex, hatype, mark, hash, queue_mapping
have to be reference-able from 'skb' pointer.
They can move from skb to some other struct, change
their sizes, location within sk_buff, etc
They can even disappear, but user visible
struct sk_buff_user will still contain the above.
> eBPF is very flexible, which means it is bound to have someone use it
> in a way you never dreamed of, and that will be what bites you in the
> end (pun intended).
understood :)
let's start slow then with bpf+syscall and bpf+kprobe only.
>> also not all bpf programs are equal.
>> bpf+existing tracepoint is not ABI
>
> Why not?
well, because we want to see more tracepoints in the kernel.
We're already struggling to add more.
>> bpf+new tracepoint is ABI if programs are not using bpf_fetch
>
> How is this different?
the new ones will be explicit by definition.
>> bpf+syscall is ABI if programs are not using bpf_fetch
>
> Well, this is easy. As syscalls are ABI, and the tracepoints for them
> match the ABI, it by default becomes an ABI.
>
>> bpf+kprobe is not ABI
>
> Right.
>
>> bpf+sockets is ABI
>
> Right, because sockets themselves are ABI.
>
>> At the end we want most of the programs to be written
>> without assuming anything about kernel internals.
>> But for live kernel debugging we will write programs
>> very specific to given kernel layout.
>
> And here lies the trick. How do we differentiate applications for
> everyday use from debugging tools? There's been times when debugging
> tools have shown themselves as being so useful they become everyday
> use tools.
>
>>
>> We can categorize the above in non-ambigous via
>> bpf program type.
>> Programs with:
>> BPF_PROG_TYPE_TRACEPOINT - not ABI
>> BPF_PROG_TYPE_KPROBE - not ABI
>> BPF_PROG_TYPE_SOCKET_FILTER - ABI
>
> Again, what enforces this? (hint, it's Linus)
>
>
>>
>> for my proposed 'new tracepoints' we can add type:
>> BPF_PROG_TYPE_TRACEPOINT_USER - ABI
>> and disallow calls to bpf_fetch*() for them.
>> To make it more strict we can do kernel version check
>> for all prog types that are 'not ABI', but is it really necessary?
>
> If we have something that makes it difficult for a tool to work from
> one kernel to the next, or ever with different configs, where that tool
> will never become a standard, then that should be good enough to keep
> it from dictating user ABI.
>
> To give you an example, we thought about scrambling the trace event
> field locations from boot to boot to keep tools from hard coding the
> event layout. This may sound crazy, but developers out there are crazy.
> And if you want to keep them from abusing interfaces, you just need to
> be a bit more crazy than they are.
that is indeed crazy. the point is understood.
right now I cannot think of a solid way to prevent abuse
of bpf+tracepoint, so just going to drop it for now.
Cool things can be done with bpf+kprobe/syscall already.
>> To summarize and consolidate other threads:
>> - I will remove reading of PERF_SAMPLE_RAW in tracex1 example.
>> it's really orthogonal to this whole discussion.
>
> Or yous libtraceevent ;-) We really need to finish that and package it
> up for distros.
sure. some sophisticated example can use it too.
>> - will add more comments through out that just because
>> programs can read tracepoint arguments, they shouldn't
>> make any assumptions that args stays as-is from version to version
>
> We may need to find a way to actually keep it from being as is from
> version to version even if the users do not change.
>
>> - will work on a patch to demonstrate how few in-kernel
>> structures can be user-ized and how programs can access
>> them in version-indepedent way
>
> It will be interesting to see what kernel structures can be user-ized
> that are not already used by system calls.
well, for networking, few fields I mentioned above would
be enough for most of the programs.
>> btw the concept of user-ized data structures already exists
>> with classic bpf, since 'A = load -0x1000' is translated into
>> 'A = skb->protocol'. I'm thinking of something similar
>> but more generic and less obscure.
>
> I have to try to wrap my head around understanding the classic bpf, and
> how "load -0x1000" translates to "skb->protocol". Is that documented
> somewhere?
well, the magic constants are in uapi/linux/filter.h
load -0x1000 = skb->protocol
load -0x1000 + 4 = skb->pkt_type
load -0x1000 + 8 = skb->dev->ifindex
for eBPF I want to clean it up in a way that user program
will see:
struct sk_buff_user {
int protocol;
int pkt_type;
int ifindex;
...
};
and C code of bpf program will look normal: skb->ifindex.
bpf loader will do verification and translation of
'load skb_ptr+12' into sequence of loads with correct
internal offsets.
So struct sk_buff_user is only an interface. It won't exist
in such layout in memory.
On Tue, 10 Feb 2015 16:22:50 -0800
Alexei Starovoitov <[email protected]> wrote:
> > Yep, and if this becomes a standard, then any change that makes
> > trace_pipe different will be reverted.
>
> I think reading of trace_pipe is widespread.
I've heard of a few, but nothing that has broken when something changed.
Is it scripts or actual C code?
Again, it matters if people complain about the change.
> >> But some maintainers think of them as ABI, whereas others
> >> are using them freely. imo it's time to remove ambiguity.
> >
> > I would love to, and have brought this up at Kernel Summit more than
> > once with no solution out of it.
>
> let's try it again at plumbers in august?
Well, we need a statement from Linus. And it would be nice if we could
also get Ingo involved in the discussion, but he seldom comes to
anything but Kernel Summit.
>
> For now I'm going to drop bpf+tracepoints, since it's so controversial
> and go with bpf+syscall and bpf+kprobe only.
Probably the safest bet.
>
> Hopefully by august it will be clear what bpf+kprobes can do
> and why I'm excited about bpf+tracepoints in the future.
BTW, I wonder if I could make a simple compiler in the kernel that
would translate the current ftrace filters into a BPF program, where it
could use the program and not use the current filter logic.
> >> These tracepoint will receive one or two pointers to important
> >> structs only. They will not have TP_printk, assign and fields.
> >> The placement and arguments to these new tracepoints
> >> will be an ABI.
> >> All existing tracepoints are not.
> >
> > TP_printk() is not really an issue.
>
> I think it is. The way things are printed is the most
> visible part of tracepoints and I suspect maintainers don't
> want to add new ones, because internal fields are printed
> and users do parse trace_pipe.
> Recent discussion about tcp instrumentation was
> about adding new tracepoints and a module to print them.
> As soon as something like this is in, the next question
> 'what we're going to do when more arguments need
> to be printed'...
I should rephrase that. It's not that it's not an issue, it's just that
it hasn't been an issue. the trace_pipe code is slow. The
raw_trace_pipe is much faster. Any tool would benefit from using it.
I really need to get a library out to help users do such a thing.
>
> imo the solution is DEFINE_EVENT_BPF that doesn't
> print anything and a bpf program to process it.
You mean to be completely invisible to ftrace? And the debugfs/tracefs
directory?
> >
> >> it is portable and will run on any kernel.
> >> In uapi header we can define:
> >> struct task_struct_user {
> >> int pid;
> >> int prio;
> >
> > Here's a perfect example of something that looks stable to show to
> > user space, but is really a pimple that is hiding cancer.
> >
> > Lets start with pid. We have name spaces. What pid will be put there?
> > We have to show the pid of the name space it is under.
> >
> > Then we have prio. What is prio in the DEADLINE scheduler. It is rather
> > meaningless. Also, it is meaningless in SCHED_OTHER.
> >
> > Also note that even for SCHED_FIFO, the prio is used differently in the
> > kernel than it is in userspace. For the kernel, lower is higher.
>
> well, ->prio and ->pid are already printed by sched tracepoints
> and their meaning depends on scheduler. So users taking that
> into account.
I know, and Peter hates this.
> I'm not suggesting to preserve the meaning of 'pid' semantically
> in all cases. That's not what users would want anyway.
> I want to allow programs to access important fields and print
> them in more generic way than current TP_printk does.
> Then exposed ABI of such tracepoint_bpf is smaller than
> with current tracepoints.
Again, this would mean they become invisible to ftrace, and even
ftrace_dump_on_oops.
I'm not fully understanding what is to be exported by this new ABI. If
the fields available, will always be available, then why can't the
appear in a TP_printk()?
> > eBPF is very flexible, which means it is bound to have someone use it
> > in a way you never dreamed of, and that will be what bites you in the
> > end (pun intended).
>
> understood :)
> let's start slow then with bpf+syscall and bpf+kprobe only.
I'm fine with that.
>
> >> also not all bpf programs are equal.
> >> bpf+existing tracepoint is not ABI
> >
> > Why not?
>
> well, because we want to see more tracepoints in the kernel.
> We're already struggling to add more.
Still, the question is, even with a new "tracepoint" that limits what
it shows, there still needs to be something that is guaranteed to
always be there. I still don't see how this is different than the
current tracepoints.
>
> >> bpf+new tracepoint is ABI if programs are not using bpf_fetch
> >
> > How is this different?
>
> the new ones will be explicit by definition.
Who monitors this?
> > To give you an example, we thought about scrambling the trace event
> > field locations from boot to boot to keep tools from hard coding the
> > event layout. This may sound crazy, but developers out there are crazy.
> > And if you want to keep them from abusing interfaces, you just need to
> > be a bit more crazy than they are.
>
> that is indeed crazy. the point is understood.
>
> right now I cannot think of a solid way to prevent abuse
> of bpf+tracepoint, so just going to drop it for now.
Welcome to our world ;-)
> Cool things can be done with bpf+kprobe/syscall already.
True.
-- Steve
On Tue, Feb 10, 2015 at 4:50 PM, Steven Rostedt <[email protected]> wrote:
>
>> >> But some maintainers think of them as ABI, whereas others
>> >> are using them freely. imo it's time to remove ambiguity.
>> >
>> > I would love to, and have brought this up at Kernel Summit more than
>> > once with no solution out of it.
>>
>> let's try it again at plumbers in august?
>
> Well, we need a statement from Linus. And it would be nice if we could
> also get Ingo involved in the discussion, but he seldom comes to
> anything but Kernel Summit.
+1
> BTW, I wonder if I could make a simple compiler in the kernel that
> would translate the current ftrace filters into a BPF program, where it
> could use the program and not use the current filter logic.
yep. I've sent that patch last year.
It converted pred_tree into bpf program.
I can try to dig it up. It doesn't provide extra programmability
though, just makes filtering logic much faster.
>> imo the solution is DEFINE_EVENT_BPF that doesn't
>> print anything and a bpf program to process it.
>
> You mean to be completely invisible to ftrace? And the debugfs/tracefs
> directory?
I mean it will be seen in tracefs to get 'id', but without enable/format/filter
>> I'm not suggesting to preserve the meaning of 'pid' semantically
>> in all cases. That's not what users would want anyway.
>> I want to allow programs to access important fields and print
>> them in more generic way than current TP_printk does.
>> Then exposed ABI of such tracepoint_bpf is smaller than
>> with current tracepoints.
>
> Again, this would mean they become invisible to ftrace, and even
> ftrace_dump_on_oops.
yes, since these new tracepoints have no meat inside them.
They're placeholders sitting idle and waiting for bpf to do
something useful with them.
> I'm not fully understanding what is to be exported by this new ABI. If
> the fields available, will always be available, then why can't the
> appear in a TP_printk()?
say, we define trace_netif_rx_entry() as this new tracepoint_bpf.
It will have only one argument 'skb'.
bpf program will read and print skb fields the way it likes
for particular tracing scenario.
So instead of making
TP_printk("dev=%s napi_id=%#x queue_mapping=%u skbaddr=%p
vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x
ip_summed=%d hash=0x%08x l4_hash=%d len=%u data_len=%u truesize=%u
mac_header_valid=%d mac_header=%d nr_frags=%d gso_size=%d
gso_type=%#x",...
the abi exposed via trace_pipe (as it is today),
the new tracepoint_bpf abi is presence of 'skb' pointer as one
and only argument to bpf program.
Future refactoring of netif_rx would need to guarantee
that trace_netif_rx_entry(skb) is called. that's it.
imo such tracepoints are much easier to deal with during
code changes.
May be some of the existing tracepoints like this one that
takes one argument can be marked 'bpf-ready', so that
programs can attach to them only.
>> let's start slow then with bpf+syscall and bpf+kprobe only.
>
> I'm fine with that.
thanks. will wait for merge window to close and will repost.
On Tue, 10 Feb 2015 19:04:55 -0800
Alexei Starovoitov <[email protected]> wrote:
> > You mean to be completely invisible to ftrace? And the debugfs/tracefs
> > directory?
>
> I mean it will be seen in tracefs to get 'id', but without enable/format/filter
In other words, invisible to ftrace. I'm not sure I'll be on board to
work with that.
> > Again, this would mean they become invisible to ftrace, and even
> > ftrace_dump_on_oops.
>
> yes, since these new tracepoints have no meat inside them.
> They're placeholders sitting idle and waiting for bpf to do
> something useful with them.
Hmm, I have a patch somewhere (never posted it), that add
TRACE_MARKER(), which basically would just print that it was hit. But
no data was passed to it. Something like that I would be more inclined
to take. Then the question is, what can bpf access there? Could just be
a place holder to add a "fast kprobe". This way, it can show up in
trace events with enable and and all that, it just wont have any data
to print out besides the normal pid, flags, etc.
But because it will inject a nop, that could be converted to a jump, it
will give you the power of kprobes but with the speed of a tracepoint.
>
> > I'm not fully understanding what is to be exported by this new ABI. If
> > the fields available, will always be available, then why can't the
> > appear in a TP_printk()?
>
> say, we define trace_netif_rx_entry() as this new tracepoint_bpf.
> It will have only one argument 'skb'.
> bpf program will read and print skb fields the way it likes
> for particular tracing scenario.
> So instead of making
> TP_printk("dev=%s napi_id=%#x queue_mapping=%u skbaddr=%p
> vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x
> ip_summed=%d hash=0x%08x l4_hash=%d len=%u data_len=%u truesize=%u
> mac_header_valid=%d mac_header=%d nr_frags=%d gso_size=%d
> gso_type=%#x",...
> the abi exposed via trace_pipe (as it is today),
> the new tracepoint_bpf abi is presence of 'skb' pointer as one
> and only argument to bpf program.
> Future refactoring of netif_rx would need to guarantee
> that trace_netif_rx_entry(skb) is called. that's it.
> imo such tracepoints are much easier to deal with during
> code changes.
But what can you access from that skb that is guaranteed to be there?
If you say anything, then there's no reason it can't be added to the
printk as well.
>
> May be some of the existing tracepoints like this one that
> takes one argument can be marked 'bpf-ready', so that
> programs can attach to them only.
I really hate the idea of adding tracepoints that ftrace can't use. It
basically kills the entire busy box usage scenario, as boards that have
extremely limited userspace still make full use of ftrace via the
existing tracepoints.
I still don't see the argument that adding data via the bpf functions
is any different than adding those same items to fields in an event.
Once you add a bpf function, then you must maintain those fields.
Look, you can always add more to a TP_printk(), as that is standard
with all text file kernel parsing. Like stat in /proc. Fields can not
be removed, but more can always be appended to the end.
Any tool that parses trace_pipe is broken if it can't handle extended
fields. The api can be extended, and for text files, that is by
appending to them.
-- Steve
On Tue, Feb 10, 2015 at 8:31 PM, Steven Rostedt <[email protected]> wrote:
>> > Again, this would mean they become invisible to ftrace, and even
>> > ftrace_dump_on_oops.
>>
>> yes, since these new tracepoints have no meat inside them.
>> They're placeholders sitting idle and waiting for bpf to do
>> something useful with them.
>
> Hmm, I have a patch somewhere (never posted it), that add
> TRACE_MARKER(), which basically would just print that it was hit. But
> no data was passed to it. Something like that I would be more inclined
> to take. Then the question is, what can bpf access there? Could just be
> a place holder to add a "fast kprobe". This way, it can show up in
> trace events with enable and and all that, it just wont have any data
> to print out besides the normal pid, flags, etc.
>
> But because it will inject a nop, that could be converted to a jump, it
> will give you the power of kprobes but with the speed of a tracepoint.
fair enough.
Something like TRACE_MARKER(arg1, arg2) that prints
it was hit without accessing the args would be enough.
Without any args it is indeed a 'fast kprobe' only.
Debug info would still be needed to access
function arguments.
On x64 function entry point and x64 abi make it easy
to access args, but i386 or kprobe in the middle
lose visibility when debug info is not available.
TRACE_MARKER (with few key args that function
is operating on) is enough to achieve roughly the same
as kprobe without debug info.
>> > I'm not fully understanding what is to be exported by this new ABI. If
>> > the fields available, will always be available, then why can't the
>> > appear in a TP_printk()?
>>
>> say, we define trace_netif_rx_entry() as this new tracepoint_bpf.
>> It will have only one argument 'skb'.
>> bpf program will read and print skb fields the way it likes
>> for particular tracing scenario.
>> So instead of making
>> TP_printk("dev=%s napi_id=%#x queue_mapping=%u skbaddr=%p
>> vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x
>> ip_summed=%d hash=0x%08x l4_hash=%d len=%u data_len=%u truesize=%u
>> mac_header_valid=%d mac_header=%d nr_frags=%d gso_size=%d
>> gso_type=%#x",...
>> the abi exposed via trace_pipe (as it is today),
>> the new tracepoint_bpf abi is presence of 'skb' pointer as one
>> and only argument to bpf program.
>> Future refactoring of netif_rx would need to guarantee
>> that trace_netif_rx_entry(skb) is called. that's it.
>> imo such tracepoints are much easier to deal with during
>> code changes.
>
> But what can you access from that skb that is guaranteed to be there?
> If you say anything, then there's no reason it can't be added to the
> printk as well.
programs can access any field via bpf_fetch*() helpers which
make them kernel layout dependent or via user-ized sk_buff
with few fields which is portable.
In both cases kernel/user abi is only 'skb' pointer.
whether it's debugging program that needs full access
via fetch* helpers or portable program that uses stable api
it's up to program author.
Just like kprobes, it's clear, that if program is using
fetch* helpers it's doing it without any abi guarantees.
'perf probe' and 'bpf with fetch* helpers are the same.
perf probe creates wrappers on top of probe_kernel_read and
bpf_fetch* helpers are wrappers on top of probe_kernel_read.
Complains that 'my kprobe with flags=%cx mode=+4($stack)
stopped working in new kernel' are equivalent to complains
that program with bpf_fetch* stopped working.
Whereas if program is using user-ized structs it will work
across kernel versions, though it will be able to see
only very limited slice of in-kernel data.
>> May be some of the existing tracepoints like this one that
>> takes one argument can be marked 'bpf-ready', so that
>> programs can attach to them only.
>
> I really hate the idea of adding tracepoints that ftrace can't use. It
> basically kills the entire busy box usage scenario, as boards that have
> extremely limited userspace still make full use of ftrace via the
> existing tracepoints.
agree. I think your trace_marker with few args is a good
middle ground.
> I still don't see the argument that adding data via the bpf functions
> is any different than adding those same items to fields in an event.
> Once you add a bpf function, then you must maintain those fields.
>
> Look, you can always add more to a TP_printk(), as that is standard
> with all text file kernel parsing. Like stat in /proc. Fields can not
> be removed, but more can always be appended to the end.
>
> Any tool that parses trace_pipe is broken if it can't handle extended
> fields. The api can be extended, and for text files, that is by
> appending to them.
I agree that any text parsing script should be able to cope
with additional args without problems. I think it's a fear of
<1% breakage is causing maintainers to avoid any changes
to tracepoints even when they just add few args to the end
of TP_printk.
When tracepoints stop printing and the only thing they see
is single pointer to a well known struct like sk_buff,
this fear of tracepoints should fade.
programs are not part of the kernel, so whatever they do
and print is not our headache. We only make sure that
interface between kernel and programs is stable.
In other words kernel ABI is what kernel exposes to
user space and to bpf programs. Though programs
are run inside the kernel what they do it outside of
kernel abi. So when program prints fields is not our
problem, whereas when tracepoint prints fields it's
kernel abi.
ps I'll be traveling for the next few weeks, so
apologize in advance for slow response.
On Tue, Feb 10, 2015 at 04:22:50PM -0800, Alexei Starovoitov wrote:
> >> not all tools use libtraceevent.
> >> gdb calls perf_event_open directly:
> >> https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c
> >> and parses PERF_RECORD_SAMPLE as a binary.
> >> In this case it's branch records, but I think we never said anywhere
> >> that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come
> >> in this particular order.
> >
> > What particular order? Note, that's a hardware event, not a software
> > one.
>
> yes, but gdb assumes that 'u64 ip' precedes, 'u64 addr'
> when attr.sample_type = IP | ADDR whereas this is an
> internal order of 'if' statements inside perf_output_sample()...
This is indeed promised in the data layout description in
include/uapi/linux/perf_event.h.
There is no other way to find where fields are; perf relies on
predetermined order of fields coupled with the requested field bitmask.
So we promise the order: id, ip, pid, tid, time, addr,.. etc.
So if you request IP and ADDR but none of the other fields, then you
know your sample will start with IP and then contain ADDR.
The traceevent thing has a debug/trace-fs format description of fields
that is supposed to be used.
On Tue, Feb 10, 2015 at 04:22:50PM -0800, Alexei Starovoitov wrote:
> well, ->prio and ->pid are already printed by sched tracepoints
> and their meaning depends on scheduler. So users taking that
> into account.
Right, so trace_events were/are root only, and root 'should' be in the
root pid namespace, and therefore pid magically works.
And I'm not sure, but I don't think the 'nested' root available from
containers should have access to ftrace, so that should not be an issue.
Perf tries real hard to present PERF_SAMPLE_PID data in the pid
namespace of the task that created the event.
As to prio; yes this is a prime example of suck, I would love to change
that but cannot :-(. The only solace I have is that everybody who is
relying on it is broken.
There is a very good reason I'm against adding more tracepoints to the
scheduler, its a nightmare.
On Tue, Feb 10, 2015 at 04:22:50PM -0800, Alexei Starovoitov wrote:
> > It would need to do more that that. It may have to calculate the value
> > that it returns, as the internal value may be different with different
> > kernels.
>
> back to 'prio'... the 'prio' accessible from the program
> should be the same 'prio' that we're storing inside task_struct.
Its not, task_struct::prio is an entirely different value than the one
used in sched_param::sched_priority / sched_attr::sched_priority.
And the 'problem' is, prio is only relevant to SCHED_RR/SCHED_FIFO
tasks, we have more classes.
> No extra conversions.
We're not going to add runtime/space overhead to the kernel just because
someone might maybe someday trace the kernel.
That leaves the option of either tracing the kernel internal value and
userspace will just have to deal with it, or making the tracepoint more
expensive by having it do the conversion.
Now the big question is, what do we do when we add/extend a scheduling
class and have more parameters? We cannot change the tracepoint because
userspace assumes format. And I simply refuse to add a second -- because
that will end up being a third and fourth etc.. -- tracepoint right next
to it with a different layout.
Note that we just did add a class, we grew SCHED_DEADLINE a few releases
ago, and that has 3 parameters (or 6 depending on how you look at it).
You currently cannot 'debug' that with the existing tracepoints.
In short, I loathe traceevents, they're more trouble than they're worth.
Now I do love the infrastructure, its very useful debugging, but that's
all with local hacks that will never see the light of day.
On Tue, Feb 10, 2015 at 04:53:59PM -0500, Steven Rostedt wrote:
> > >> In the future we might add a tracepoint and pass a single
> > >> pointer to interesting data struct to it. bpf programs will walk
> > >> data structures 'as safe modules' via bpf_fetch*() methods
> > >> without exposing it as ABI.
> > >
> > > Will this work if that structure changes? When the field we are looking
> > > for no longer exists?
> >
> > bpf_fetch*() is the same mechanism as perf probe.
> > If there is a mistake by user space tools, the program
> > will be reading some junk, but it won't be crashing.
>
> But what if the userspace tool depends on that value returning
> something meaningful. If it was meaningful in the past, it will have to
> be meaningful in the future, even if the internals of the kernel make
> it otherwise.
We're compiling the BPF stuff against the 'current' kernel headers
right? So would enforcing module versioning not be sufficient?
We already break out-of-tree modules without a second thought, the
module interface is not guaranteed. They just need to cope with it.
Anything using the kernel headers to look into the kernel guts should be
bound to the same rules.
So if we think of BFP stuff as out-of-tree modules, and treat them the
same, I see no problem.
I'm sure some BFP 'scripts' will turn in the same right mess that
out-of-tree modules are, with endless #ifdef version checks, but hey,
not my problem ;-)
On Tue, 10 Feb 2015 22:33:05 -0800
Alexei Starovoitov <[email protected]> wrote:
> fair enough.
> Something like TRACE_MARKER(arg1, arg2) that prints
> it was hit without accessing the args would be enough.
> Without any args it is indeed a 'fast kprobe' only.
> Debug info would still be needed to access
> function arguments.
> On x64 function entry point and x64 abi make it easy
> to access args, but i386 or kprobe in the middle
> lose visibility when debug info is not available.
> TRACE_MARKER (with few key args that function
> is operating on) is enough to achieve roughly the same
> as kprobe without debug info.
Actually, what about a TRACE_EVENT_DEBUG(), that has a few args and
possibly a full trace event layout.
The difference would be that the trace events do not show up unless you
have "trace_debug" on the command line. This should prevent
applications from depending on them.
I could even do the nasty dmesg output like I do with trace_printk()s,
that would definitely keep a production kernel from adding it by
default.
When trace_debug is not there, the trace points could still be accessed
but perhaps only via bpf, or act like a simple trace marker.
Note, if you need ids, I rather have them in another directory than
tracefs. Make a eventfs perhaps that holds these. I rather keep tracefs
simple.
This is something that needs to probably be discussed at bit.
-- Steve
>> eBPF is very flexible, which means it is bound to have someone use it
>> in a way you never dreamed of, and that will be what bites you in the
>> end (pun intended).
> understood :)
> let's start slow then with bpf+syscall and bpf+kprobe only.
I think BPF + system calls/kprobes can meet our use case
(https://lkml.org/lkml/2015/2/6/44), but there're some issues to be
improved.
I suggest that you can improve bpf+kprobes when attached to function
headers(or TRACE_MARKERS), make it converts pt-regs to bpf_ctx->arg1,
arg2.., then top models and architectures can be separated by bpf.
BPF bytecode is cross-platform, but what we can get by using bpf+kprobes
is a 'regs->rdx' kind of information, such information is both
architecture and kernel version related.
We hope to establish some models for describing kernel procedures such
as IO and network, which requires that it does not rely on architecture
and does not rely to a specific kernel version as much as possible.
On Wed, Feb 11, 2015 at 5:28 AM, Peter Zijlstra <[email protected]> wrote:
>
> We're compiling the BPF stuff against the 'current' kernel headers
> right?
the tracex1 example is pulling kernel headers to demonstrate
how bpf_fetch*() helpers can be used to walk kernel structures
without debug info.
The other examples don't need any internal headers.
> So would enforcing module versioning not be sufficient?
I'm going to redo the ex1 to use kprobe and some form of
version check. Indeed module-like versioning should
be enough.
On Wed, Feb 11, 2015 at 7:51 AM, Steven Rostedt <[email protected]> wrote:
> On Tue, 10 Feb 2015 22:33:05 -0800
> Alexei Starovoitov <[email protected]> wrote:
>
>
>> fair enough.
>> Something like TRACE_MARKER(arg1, arg2) that prints
>> it was hit without accessing the args would be enough.
>> Without any args it is indeed a 'fast kprobe' only.
>> Debug info would still be needed to access
>> function arguments.
>> On x64 function entry point and x64 abi make it easy
>> to access args, but i386 or kprobe in the middle
>> lose visibility when debug info is not available.
>> TRACE_MARKER (with few key args that function
>> is operating on) is enough to achieve roughly the same
>> as kprobe without debug info.
>
> Actually, what about a TRACE_EVENT_DEBUG(), that has a few args and
> possibly a full trace event layout.
>
> The difference would be that the trace events do not show up unless you
> have "trace_debug" on the command line. This should prevent
> applications from depending on them.
>
> I could even do the nasty dmesg output like I do with trace_printk()s,
> that would definitely keep a production kernel from adding it by
> default.
>
> When trace_debug is not there, the trace points could still be accessed
> but perhaps only via bpf, or act like a simple trace marker.
I think that is a great idea!
Makes it clear that all prints are for debugging and
no abi guarantees.
> Note, if you need ids, I rather have them in another directory than
> tracefs. Make a eventfs perhaps that holds these. I rather keep tracefs
> simple.
indeed. makes sense. no reason to burn fs memory just
to get an id from name. may be perf_event api can be
extended to lookup id from name. I think perf will benefit as well.
On Wed, Feb 11, 2015 at 11:58 PM, Hekuang <[email protected]> wrote:
>
>>> eBPF is very flexible, which means it is bound to have someone use it
>>> in a way you never dreamed of, and that will be what bites you in the
>>> end (pun intended).
>>
>> understood :)
>> let's start slow then with bpf+syscall and bpf+kprobe only.
>
>
> I think BPF + system calls/kprobes can meet our use case
> (https://lkml.org/lkml/2015/2/6/44), but there're some issues to be
> improved.
>
> I suggest that you can improve bpf+kprobes when attached to function
> headers(or TRACE_MARKERS), make it converts pt-regs to bpf_ctx->arg1,
> arg2.., then top models and architectures can be separated by bpf.
>
> BPF bytecode is cross-platform, but what we can get by using bpf+kprobes
> is a 'regs->rdx' kind of information, such information is both
> architecture and kernel version related.
for kprobes in the middle of the function, kernel cannot
convert pt_regs into argN. Placement was decided by compiler
and can only be found in debug info.
I think bpf+kprobe will be using it when it is available.
When there is no debug info, kprobes will be limited
to function entry and mapping of regs/stack into
argN can be done by user space depending on architecture.
So user tracing scripts in some higher level language
can be kernel/arch independent when 'perf probe+bpf'
is loading them on the fly on the given machine.
> We hope to establish some models for describing kernel procedures such
> as IO and network, which requires that it does not rely on architecture
> and does not rely to a specific kernel version as much as possible.
That's obviously a goal, but it requires a new approach to tracepoints.
I think a lot of great ideas were discussed in this thread, so I'm
hopeful that we'll come up with solution that will satisfy even
strictest Peter's requirements :)
Hi, Alexei
Another suggestion on bpf syscall interface. Currently, BPF +
syscalls/kprobes depends on CONFIG_BPF_SYSCALL. In kernel used on
commercial products, CONFIG_BPF_SYSCALL is probably disabled, in this
case, bpf bytecode cannot be loaded to the kernel.
If we turn the functionality of BPF_SYSCALL into a loadable module, then
we can use it without any dependencies on the kernel. What about change
bpf syscall to a /dev node or /sys file which can be exported by a
kernel module?
On Mon, Feb 16, 2015 at 6:26 AM, He Kuang <[email protected]> wrote:
> Hi, Alexei
>
> Another suggestion on bpf syscall interface. Currently, BPF +
> syscalls/kprobes depends on CONFIG_BPF_SYSCALL. In kernel used on
> commercial products, CONFIG_BPF_SYSCALL is probably disabled, in this
> case, bpf bytecode cannot be loaded to the kernel.
I'm seeing a flurry of use cases for bpf in ovs, tc, tracing, etc
When it's all ready, we can turn that config on by default.
> If we turn the functionality of BPF_SYSCALL into a loadable module, then
> we can use it without any dependencies on the kernel. What about change
> bpf syscall to a /dev node or /sys file which can be exported by a
> kernel module?
I don't think we will allow extending bpf by modules.
'bpf in modules' is an interface that is too easy to abuse.
So all of bpf core, helper functions and program types will be builtin.
As far as bpf+tracing the plan is to do bpf+kprobe and bpf+syscalls
first. Then add right set of helpers to make sure that use cases
like 'tcp stack instrumentation' are fully addressed.
Then there were few great ideas of accelerating kprobes
with trace markers and debug tracepoints that we can do later.