2015-07-01 02:59:08

by He Kuang

[permalink] [raw]
Subject: [RFC PATCH 0/5] Make eBPF programs output data to perf event

The idea to let eBPF output data to perf sample event was first
metioned in "[RFC PATCH v4 10/29] bpf tools: Collect map definitions
from 'maps' section", and the output data is not limited to PMU
counters but data like time latencies, cache misses or other things
users want to record.

This patch adds an extra perf trace buffer for other utilities like
bpf to fill extra data to perf events. A recursion flags which was
first introduced by commit 444a2a3bcd6d ("tracing, perf_events:
Protect the buffer from recursion in perf") should be got before
saving data to buffer, to protect the percpu data buffer from being
overwritten.

A bpf function for output data to perf event is added, the data is
written to perf trace extra buffer which will be collected and
appended to the end of the orignal perf event data. The new data does
not change the original format and the generated event can still be
processed by perf script.

The first patch in this series has been posted last week, Masami
acked-by it but Alexei said there's some reason to call bpf_prog in
the head of the funciton, so it may require more discussion.

Here is a sample for recording time interval of a probed functions in
perf sample event, the bpf program is like this:

SEC("generic_perform_write=generic_perform_write")
int NODE_generic_perform_write(struct pt_regs *ctx)
{
char fmt[] = "generic_perform_write, cur=0x%llx, del=0x%llx\n";
u64 cur_time, del_time;
int ind =0;
struct time_table output, *last = bpf_map_lookup_elem(&global_time_table, &ind);
if (!last)
return 0;

cur_time = bpf_ktime_get_ns();
if (!last->last_time)
del_time = 0;
else
del_time = cur_time - last->last_time;

/* For debug */
bpf_trace_printk(fmt, sizeof(fmt), cur_time, del_time);

/* Update time table */
output.last_time = cur_time;
bpf_map_update_elem(&global_time_table, &ind, &output, BPF_ANY);

/* This is a casual condition to show the funciton */
if (del_time < 1000)
return 0;

bpf_output_sample(&del_time, sizeof(del_time));
return 1;
}

Debug output in /sys/kernel/debug/tracing/trace:

dd-1018 [000] d... 48512.533331: : generic_perform_write cur=0x2c15cc83436e, del=0x0
dd-1018 [000] d... 48512.534032: : generic_perform_write cur=0x2c15cc8e3c35, del=0xaf8c7
dd-1018 [000] d... 48512.534528: : generic_perform_write cur=0x2c15cc95ec0f, del=0x7afda

The raw data record in perf.data:
$ perf-report -D

. ... raw event: size 80 bytes
. 0000: 09 00 00 00 01 00 50 00 61 0b 14 81 ff ff ff ff ......P.a.......
. 0010: fa 03 00 00 fa 03 00 00 81 3e 24 c8 15 2c 00 00 .........>$..,..
. 0020: 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................
. 0030: 1c 00 00 00 99 05 01 00 fa 03 00 00 60 0b 14 81 ............`...
. 0040: ff ff ff ff c7 f8 0a 00 00 00 00 00 00 00 00 00 ................
_______
\
----- del_time = 0x000af8c7

And the perf script result is untouched:

dd 1018 [000] 48472.063753: perf_bpf_probe:generic_perform_write: (ffffffff81140b60)
dd 1018 [000] 48472.063753: perf_bpf_probe:generic_perform_write: (ffffffff81140b60)

Next step, we'll let the userspace perf tools to parse this extra data
generated by eBPF.

Thank you.

He Kuang (5):
bpf: Put perf_events check ahead of bpf prog
perf/trace: Add perf extra percpu trace buffer
tracing/kprobe: Separate inc recursion count out of
perf_trace_buf_prepare
bpf: Introduce function for outputing sample data to perf event
tracing/kprobe: Combine extra trace buf into perf trace buf

include/linux/ftrace_event.h | 7 +++-
include/linux/perf_event.h | 2 ++
include/uapi/linux/bpf.h | 1 +
kernel/events/core.c | 6 ++++
kernel/events/internal.h | 17 ++++++----
kernel/trace/bpf_trace.c | 29 ++++++++++++++++
kernel/trace/trace_event_perf.c | 73 ++++++++++++++++++++++++++++++++++++++---
kernel/trace/trace_kprobe.c | 70 ++++++++++++++++++++++++++++++++-------
samples/bpf/bpf_helpers.h | 2 ++
9 files changed, 182 insertions(+), 25 deletions(-)

--
1.8.5.2


2015-07-01 02:58:58

by He Kuang

[permalink] [raw]
Subject: [RFC PATCH 1/5] bpf: Put perf_events check ahead of bpf prog

When we add a kprobe point and record events by perf, the execution path
of all threads on each cpu will enter this point, but perf may only
record events on a particular thread or cpu at this kprobe point, a
check on call->perf_events list filters out the threads which perf is
not recording.

Currently, bpf_prog will be entered at the beginning of
kprobe_perf_func() before the above check, which makes bpf_prog be
executed in every threads including determined not to be recorded
threads. A simple test can demonstrate this:

'bpf_prog_on_write.o' contains a bpf prog which outputs to trace buffer
when it is entered. Run a background thread 'another-dd' and 'dd'
simultaneously, but only record 'dd' thread by perf. The result shows
all threads trigger bpf_prog.

$ another-dd if=/dev/zero of=test1 bs=4k count=1000000
$ perf record -v --event bpf_prog_on_write.o -- dd if=/dev/zero of=test2 bs=4k count=3
$ cat /sys/kernel/debug/tracing/trace
another-dd-1007 [000] d... 120.225835: : generic_perform_write: tgid=1007, pid=1007
another-dd-1007 [000] d... 120.227123: : generic_perform_write: tgid=1007, pid=1007
[repeat many times...]
another-dd-1007 [000] d... 120.412395: : generic_perform_write: tgid=1007, pid=1007
another-dd-1007 [000] d... 120.412524: : generic_perform_write: tgid=1007, pid=1007
dd-1009 [000] d... 120.413080: : generic_perform_write: tgid=1009, pid=1009
dd-1009 [000] d... 120.414846: : generic_perform_write: tgid=1009, pid=1009
dd-1009 [000] d... 120.415013: : generic_perform_write: tgid=1009, pid=1009
another-dd-1007 [000] d... 120.416128: : generic_perform_write: tgid=1007, pid=1007
another-dd-1007 [000] d... 120.416295: : generic_perform_write: tgid=1007, pid=1007

This patch moves the check on perf_events list ahead and skip running
bpf_prog on threads perf not care.

After this patch:

$ another-dd if=/dev/zero of=test1 bs=4k count=1000000
$ perf record -v --event bpf_prog_on_write.o -- dd if=/dev/zero of=test2 bs=4k count=3
$ cat /sys/kernel/debug/tracing/trace
dd-994 [000] d... 46.386754: : generic_perform_write: tgid=994, pid=994
dd-994 [000] d... 46.389167: : generic_perform_write: tgid=994, pid=994
dd-994 [000] d... 46.389551: : generic_perform_write: tgid=994, pid=994

Signed-off-by: He Kuang <[email protected]>
Acked-by: Masami Hiramatsu <[email protected]>
---
kernel/trace/trace_kprobe.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index d0ce590..5600df8 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1141,13 +1141,13 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
int size, __size, dsize;
int rctx;

- if (prog && !trace_call_bpf(prog, regs))
- return;
-
head = this_cpu_ptr(call->perf_events);
if (hlist_empty(head))
return;

+ if (prog && !trace_call_bpf(prog, regs))
+ return;
+
dsize = __get_data_size(&tk->tp, regs);
__size = sizeof(*entry) + tk->tp.size + dsize;
size = ALIGN(__size + sizeof(u32), sizeof(u64));
@@ -1176,13 +1176,13 @@ kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,
int size, __size, dsize;
int rctx;

- if (prog && !trace_call_bpf(prog, regs))
- return;
-
head = this_cpu_ptr(call->perf_events);
if (hlist_empty(head))
return;

+ if (prog && !trace_call_bpf(prog, regs))
+ return;
+
dsize = __get_data_size(&tk->tp, regs);
__size = sizeof(*entry) + tk->tp.size + dsize;
size = ALIGN(__size + sizeof(u32), sizeof(u64));
--
1.8.5.2

2015-07-01 02:58:27

by He Kuang

[permalink] [raw]
Subject: [RFC PATCH 2/5] perf/trace: Add perf extra percpu trace buffer

Introduce another trace buffer besides perf_trace_buffer, so other
utilities like bpf can generate and fill perf event data to it
extraly and the data can be collected to the original buffer
afterwards. The life cycle of this extra buffer is the same as the
original one.

Signed-off-by: He Kuang <[email protected]>
---
include/linux/ftrace_event.h | 5 ++++-
kernel/trace/trace_event_perf.c | 46 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index f9ecf63..b6cf5a3 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -411,7 +411,8 @@ struct ftrace_event_file {
} \
early_initcall(trace_init_perf_perm_##name);

-#define PERF_MAX_TRACE_SIZE 2048
+#define PERF_MAX_TRACE_SIZE 2048
+#define PERF_MAX_EXTRA_TRACE_SIZE 128

#define MAX_FILTER_STR_VAL 256 /* Should handle KSYM_SYMBOL_LEN */

@@ -614,6 +615,8 @@ extern int ftrace_profile_set_filter(struct perf_event *event, int event_id,
extern void ftrace_profile_free_filter(struct perf_event *event);
extern void *perf_trace_buf_prepare(int size, unsigned short type,
struct pt_regs **regs, int *rctxp);
+extern void *perf_extra_trace_buf_prepare(int size, int rctx);
+extern void *get_perf_extra_trace_buf(int rctx);

static inline void
perf_trace_buf_submit(void *raw_data, int size, int rctx, u64 addr,
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 6fa484d..603b3da 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -10,6 +10,7 @@
#include "trace.h"

static char __percpu *perf_trace_buf[PERF_NR_CONTEXTS];
+static char __percpu *perf_extra_trace_buf[PERF_NR_CONTEXTS];

/*
* Force it to be aligned to unsigned long to avoid misaligned accesses
@@ -18,6 +19,10 @@ static char __percpu *perf_trace_buf[PERF_NR_CONTEXTS];
typedef typeof(unsigned long [PERF_MAX_TRACE_SIZE / sizeof(unsigned long)])
perf_trace_t;

+typedef
+typeof(unsigned long [PERF_MAX_EXTRA_TRACE_SIZE / sizeof(unsigned long)])
+ perf_extra_trace_t;
+
/* Count the events in use (per event id, not per instance) */
static int total_ref_count;

@@ -113,6 +118,13 @@ static int perf_trace_event_reg(struct ftrace_event_call *tp_event,
goto fail;

perf_trace_buf[i] = buf;
+
+ buf = (char __percpu *)
+ alloc_percpu(perf_extra_trace_t);
+ if (!buf)
+ goto fail;
+
+ perf_extra_trace_buf[i] = buf;
}
}

@@ -130,6 +142,9 @@ fail:
for (i = 0; i < PERF_NR_CONTEXTS; i++) {
free_percpu(perf_trace_buf[i]);
perf_trace_buf[i] = NULL;
+
+ free_percpu(perf_extra_trace_buf[i]);
+ perf_extra_trace_buf[i] = NULL;
}
}

@@ -164,6 +179,9 @@ static void perf_trace_event_unreg(struct perf_event *p_event)
for (i = 0; i < PERF_NR_CONTEXTS; i++) {
free_percpu(perf_trace_buf[i]);
perf_trace_buf[i] = NULL;
+
+ free_percpu(perf_extra_trace_buf[i]);
+ perf_extra_trace_buf[i] = NULL;
}
}
out:
@@ -260,6 +278,34 @@ void perf_trace_del(struct perf_event *p_event, int flags)
tp_event->class->reg(tp_event, TRACE_REG_PERF_DEL, p_event);
}

+void *perf_extra_trace_buf_prepare(int size, int rctx)
+{
+ char *raw_data;
+
+ BUILD_BUG_ON(PERF_MAX_EXTRA_TRACE_SIZE % sizeof(unsigned long));
+
+ if (WARN_ONCE(size > PERF_MAX_EXTRA_TRACE_SIZE,
+ "perf extra buffer not large enough"))
+ return NULL;
+
+ raw_data = this_cpu_ptr(perf_extra_trace_buf[rctx]);
+
+ /* The first 4 bytes is raw_data size and it is used as a valid flag */
+ *(u32 *)raw_data = size;
+ raw_data += sizeof(u32);
+
+ return raw_data;
+}
+EXPORT_SYMBOL_GPL(perf_extra_trace_buf_prepare);
+NOKPROBE_SYMBOL(perf_extra_trace_buf_prepare);
+
+void *get_perf_extra_trace_buf(int rctx)
+{
+ return this_cpu_ptr(perf_extra_trace_buf[rctx]);
+}
+EXPORT_SYMBOL_GPL(get_perf_extra_trace_buf);
+NOKPROBE_SYMBOL(get_perf_extra_trace_buf);
+
void *perf_trace_buf_prepare(int size, unsigned short type,
struct pt_regs **regs, int *rctxp)
{
--
1.8.5.2

2015-07-01 02:59:40

by He Kuang

[permalink] [raw]
Subject: [RFC PATCH 3/5] tracing/kprobe: Separate inc recursion count out of perf_trace_buf_prepare

Inside perf_trace_buf_prepare(), a recursion count is increased, the
count was first introduced by commit 444a2a3bcd6d ("tracing,
perf_events: Protect the buffer from recursion in perf") to protect
the percpu data buffer from being overwritten.

For future patch to enable eBPF saving data into perf extra trace
buffer and prevent data buffer being filled recursively, the recursion
count is increased outside before entering trace_call_bpf() and
decreased in case of error. In this condition, we use the new function
perf_trace_buf_prepare_rctx() for not increasing the recursion count a
second time.

Signed-off-by: He Kuang <[email protected]>
---
include/linux/ftrace_event.h | 2 ++
kernel/trace/trace_event_perf.c | 27 ++++++++++++++++++++++-----
kernel/trace/trace_kprobe.c | 28 ++++++++++++++++++++++------
3 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index b6cf5a3..085d236 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -615,6 +615,8 @@ extern int ftrace_profile_set_filter(struct perf_event *event, int event_id,
extern void ftrace_profile_free_filter(struct perf_event *event);
extern void *perf_trace_buf_prepare(int size, unsigned short type,
struct pt_regs **regs, int *rctxp);
+extern void *perf_trace_buf_prepare_rctx(int size, unsigned short type,
+ struct pt_regs **regs, int rctx);
extern void *perf_extra_trace_buf_prepare(int size, int rctx);
extern void *get_perf_extra_trace_buf(int rctx);

diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 603b3da..e683cb6 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -306,8 +306,9 @@ void *get_perf_extra_trace_buf(int rctx)
EXPORT_SYMBOL_GPL(get_perf_extra_trace_buf);
NOKPROBE_SYMBOL(get_perf_extra_trace_buf);

-void *perf_trace_buf_prepare(int size, unsigned short type,
- struct pt_regs **regs, int *rctxp)
+static void *__perf_trace_buf_prepare(int size, unsigned short type,
+ struct pt_regs **regs, int *rctxp,
+ bool update_rctx)
{
struct trace_entry *entry;
unsigned long flags;
@@ -322,9 +323,11 @@ void *perf_trace_buf_prepare(int size, unsigned short type,

pc = preempt_count();

- *rctxp = perf_swevent_get_recursion_context();
- if (*rctxp < 0)
- return NULL;
+ if (update_rctx) {
+ *rctxp = perf_swevent_get_recursion_context();
+ if (*rctxp < 0)
+ return NULL;
+ }

if (regs)
*regs = this_cpu_ptr(&__perf_regs[*rctxp]);
@@ -340,9 +343,23 @@ void *perf_trace_buf_prepare(int size, unsigned short type,

return raw_data;
}
+
+void *perf_trace_buf_prepare(int size, unsigned short type,
+ struct pt_regs **regs, int *rctxp)
+{
+ return __perf_trace_buf_prepare(size, type, regs, rctxp, true);
+}
EXPORT_SYMBOL_GPL(perf_trace_buf_prepare);
NOKPROBE_SYMBOL(perf_trace_buf_prepare);

+void *perf_trace_buf_prepare_rctx(int size, unsigned short type,
+ struct pt_regs **regs, int rctx)
+{
+ return __perf_trace_buf_prepare(size, type, regs, &rctx, false);
+}
+EXPORT_SYMBOL_GPL(perf_trace_buf_prepare_rctx);
+NOKPROBE_SYMBOL(perf_trace_buf_prepare_rctx);
+
#ifdef CONFIG_FUNCTION_TRACER
static void
perf_ftrace_function_call(unsigned long ip, unsigned long parent_ip,
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 5600df8..16ad88e 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1145,22 +1145,30 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
if (hlist_empty(head))
return;

- if (prog && !trace_call_bpf(prog, regs))
+ rctx = perf_swevent_get_recursion_context();
+ if (rctx < 0)
return;

+ if (prog && !trace_call_bpf(prog, regs))
+ goto out;
+
dsize = __get_data_size(&tk->tp, regs);
__size = sizeof(*entry) + tk->tp.size + dsize;
size = ALIGN(__size + sizeof(u32), sizeof(u64));
size -= sizeof(u32);

- entry = perf_trace_buf_prepare(size, call->event.type, NULL, &rctx);
+ entry = perf_trace_buf_prepare_rctx(size, call->event.type, NULL, rctx);
if (!entry)
- return;
+ goto out;

entry->ip = (unsigned long)tk->rp.kp.addr;
memset(&entry[1], 0, dsize);
store_trace_args(sizeof(*entry), &tk->tp, regs, (u8 *)&entry[1], dsize);
perf_trace_buf_submit(entry, size, rctx, 0, 1, regs, head, NULL);
+
+ return;
+out:
+ perf_swevent_put_recursion_context(rctx);
}
NOKPROBE_SYMBOL(kprobe_perf_func);

@@ -1180,22 +1188,30 @@ kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,
if (hlist_empty(head))
return;

- if (prog && !trace_call_bpf(prog, regs))
+ rctx = perf_swevent_get_recursion_context();
+ if (rctx < 0)
return;

+ if (prog && !trace_call_bpf(prog, regs))
+ goto out;
+
dsize = __get_data_size(&tk->tp, regs);
__size = sizeof(*entry) + tk->tp.size + dsize;
size = ALIGN(__size + sizeof(u32), sizeof(u64));
size -= sizeof(u32);

- entry = perf_trace_buf_prepare(size, call->event.type, NULL, &rctx);
+ entry = perf_trace_buf_prepare_rctx(size, call->event.type, NULL, rctx);
if (!entry)
- return;
+ goto out;

entry->func = (unsigned long)tk->rp.kp.addr;
entry->ret_ip = (unsigned long)ri->ret_addr;
store_trace_args(sizeof(*entry), &tk->tp, regs, (u8 *)&entry[1], dsize);
perf_trace_buf_submit(entry, size, rctx, 0, 1, regs, head, NULL);
+
+ return;
+out:
+ perf_swevent_put_recursion_context(rctx);
}
NOKPROBE_SYMBOL(kretprobe_perf_func);
#endif /* CONFIG_PERF_EVENTS */
--
1.8.5.2

2015-07-01 02:58:41

by He Kuang

[permalink] [raw]
Subject: [RFC PATCH 4/5] bpf: Introduce function for outputing sample data to perf event

Add function to receive data from eBPF programs and fill that into
perf extra trace buffer of the current context. In previous patch we
make sure that the recursion flag protecting perf trace buffer is
checked when bpf_prog is executed, so here we can safely fill the
trace buffer.

In order to get the corresponding trace buffer of the context, new
function perf_swevent_current_context_type() is added, this function
only gets the current context type but does not increase the recursion
count.

Signed-off-by: He Kuang <[email protected]>
---
include/linux/perf_event.h | 2 ++
include/uapi/linux/bpf.h | 1 +
kernel/events/core.c | 6 ++++++
kernel/events/internal.h | 17 ++++++++++-------
kernel/trace/bpf_trace.c | 29 +++++++++++++++++++++++++++++
samples/bpf/bpf_helpers.h | 2 ++
6 files changed, 50 insertions(+), 7 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index a204d52..984c89c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -939,6 +939,7 @@ extern unsigned int perf_output_copy(struct perf_output_handle *handle,
const void *buf, unsigned int len);
extern unsigned int perf_output_skip(struct perf_output_handle *handle,
unsigned int len);
+extern int perf_swevent_current_context_type(void);
extern int perf_swevent_get_recursion_context(void);
extern void perf_swevent_put_recursion_context(int rctx);
extern u64 perf_swevent_set_period(struct perf_event *event);
@@ -995,6 +996,7 @@ static inline void perf_event_exec(void) { }
static inline void perf_event_comm(struct task_struct *tsk, bool exec) { }
static inline void perf_event_fork(struct task_struct *tsk) { }
static inline void perf_event_init(void) { }
+static inline int perf_swevent_current_context_type(void); { return -1; }
static inline int perf_swevent_get_recursion_context(void) { return -1; }
static inline void perf_swevent_put_recursion_context(int rctx) { }
static inline u64 perf_swevent_set_period(struct perf_event *event) { return 0; }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a9ebdf5..8a7015d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -169,6 +169,7 @@ enum bpf_func_id {
BPF_FUNC_map_update_elem, /* int map_update_elem(&map, &key, &value, flags) */
BPF_FUNC_map_delete_elem, /* int map_delete_elem(&map, &key) */
BPF_FUNC_probe_read, /* int bpf_probe_read(void *dst, int size, void *src) */
+ BPF_FUNC_output_sample, /* int bpf_output_sample(void *src, int size) */
BPF_FUNC_ktime_get_ns, /* u64 bpf_ktime_get_ns(void) */
BPF_FUNC_trace_printk, /* int bpf_trace_printk(const char *fmt, int fmt_size, ...) */
BPF_FUNC_get_prandom_u32, /* u32 prandom_u32(void) */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9e0773d..0224d5b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6376,6 +6376,12 @@ end:

DEFINE_PER_CPU(struct pt_regs, __perf_regs[4]);

+int perf_swevent_current_context_type(void)
+{
+ return current_context_type();
+}
+EXPORT_SYMBOL_GPL(perf_swevent_current_context_type);
+
int perf_swevent_get_recursion_context(void)
{
struct swevent_htable *swhash = this_cpu_ptr(&swevent_htable);
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 2deb24c..5cabce5 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -175,18 +175,21 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs);
extern int get_callchain_buffers(void);
extern void put_callchain_buffers(void);

-static inline int get_recursion_context(int *recursion)
+static inline int current_context_type(void)
{
- int rctx;
-
if (in_nmi())
- rctx = 3;
+ return 3;
else if (in_irq())
- rctx = 2;
+ return 2;
else if (in_softirq())
- rctx = 1;
+ return 1;
else
- rctx = 0;
+ return 0;
+}
+
+static inline int get_recursion_context(int *recursion)
+{
+ int rctx = current_context_type();

if (recursion[rctx])
return -1;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 2d56ce5..b2f363d 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -79,6 +79,33 @@ static const struct bpf_func_proto bpf_probe_read_proto = {
.arg3_type = ARG_ANYTHING,
};

+static u64 bpf_output_sample(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ void *src = (void *) (long) r1;
+ int size = (int) r2;
+ void *buf;
+ int rctx = perf_swevent_current_context_type();
+
+ if (rctx < 0)
+ return -EINVAL;
+
+ buf = perf_extra_trace_buf_prepare(size, rctx);
+ if (!buf)
+ return -ENOMEM;
+
+ memcpy(buf, src, size);
+
+ return 0;
+}
+
+static const struct bpf_func_proto bpf_output_sample_proto = {
+ .func = bpf_output_sample,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_STACK,
+ .arg2_type = ARG_CONST_STACK_SIZE,
+};
+
static u64 bpf_ktime_get_ns(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
{
/* NMI safe access to clock monotonic */
@@ -170,6 +197,8 @@ static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func
return &bpf_map_delete_elem_proto;
case BPF_FUNC_probe_read:
return &bpf_probe_read_proto;
+ case BPF_FUNC_output_sample:
+ return &bpf_output_sample_proto;
case BPF_FUNC_ktime_get_ns:
return &bpf_ktime_get_ns_proto;

diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index f960b5f..105510e 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -17,6 +17,8 @@ static int (*bpf_map_delete_elem)(void *map, void *key) =
(void *) BPF_FUNC_map_delete_elem;
static int (*bpf_probe_read)(void *dst, int size, void *unsafe_ptr) =
(void *) BPF_FUNC_probe_read;
+static int (*bpf_output_sample)(void *src, int size) =
+ (void *) BPF_FUNC_output_sample;
static unsigned long long (*bpf_ktime_get_ns)(void) =
(void *) BPF_FUNC_ktime_get_ns;
static int (*bpf_trace_printk)(const char *fmt, int fmt_size, ...) =
--
1.8.5.2

2015-07-01 02:59:50

by He Kuang

[permalink] [raw]
Subject: [RFC PATCH 5/5] tracing/kprobe: Combine extra trace buf into perf trace buf

Collect extra trace buffer data in the stage of preparing perf trace
buffer. If extra perf trace buffer is valid, set invalid flag, extend
the trace buffer size and append the data of extra buffer to perf
trace buffer, so the combined data will be compatible to the orignal
format and can be processed by perf-script.

Signed-off-by: He Kuang <[email protected]>
---
kernel/trace/trace_kprobe.c | 34 ++++++++++++++++++++++++++++++++--
1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 16ad88e..88642ee 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1140,6 +1140,8 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
struct hlist_head *head;
int size, __size, dsize;
int rctx;
+ u32 extra_buf_size;
+ char *extra_buf;

head = this_cpu_ptr(call->perf_events);
if (hlist_empty(head))
@@ -1152,15 +1154,28 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
if (prog && !trace_call_bpf(prog, regs))
goto out;

+ /* Check extra trace buf */
+ extra_buf = get_perf_extra_trace_buf(rctx);
+ extra_buf_size = *(u32 *)extra_buf;
+
+ /* Clear size to invalid buf */
+ *(u32 *)extra_buf = 0;
+
+ if (extra_buf_size != 0)
+ extra_buf += sizeof(u32);
+
dsize = __get_data_size(&tk->tp, regs);
__size = sizeof(*entry) + tk->tp.size + dsize;
- size = ALIGN(__size + sizeof(u32), sizeof(u64));
+ size = ALIGN(__size + extra_buf_size + sizeof(u32), sizeof(u64));
size -= sizeof(u32);

entry = perf_trace_buf_prepare_rctx(size, call->event.type, NULL, rctx);
if (!entry)
goto out;

+ /* Combine extra trace buf to perf trace buf */
+ memcpy((char *)entry + __size, extra_buf, extra_buf_size);
+
entry->ip = (unsigned long)tk->rp.kp.addr;
memset(&entry[1], 0, dsize);
store_trace_args(sizeof(*entry), &tk->tp, regs, (u8 *)&entry[1], dsize);
@@ -1183,6 +1198,8 @@ kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,
struct hlist_head *head;
int size, __size, dsize;
int rctx;
+ u32 extra_buf_size;
+ char *extra_buf;

head = this_cpu_ptr(call->perf_events);
if (hlist_empty(head))
@@ -1195,15 +1212,28 @@ kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,
if (prog && !trace_call_bpf(prog, regs))
goto out;

+ /* Check extra trace buf */
+ extra_buf = get_perf_extra_trace_buf(rctx);
+ extra_buf_size = *(u32 *)extra_buf;
+
+ /* Clear size to invalid buf */
+ *(u32 *)extra_buf = 0;
+
+ if (extra_buf_size != 0)
+ extra_buf += sizeof(u32);
+
dsize = __get_data_size(&tk->tp, regs);
__size = sizeof(*entry) + tk->tp.size + dsize;
- size = ALIGN(__size + sizeof(u32), sizeof(u64));
+ size = ALIGN(__size + extra_buf_size + sizeof(u32), sizeof(u64));
size -= sizeof(u32);

entry = perf_trace_buf_prepare_rctx(size, call->event.type, NULL, rctx);
if (!entry)
goto out;

+ /* Combine extra trace buf to perf trace buf */
+ memcpy((char *)entry + __size, extra_buf, extra_buf_size);
+
entry->func = (unsigned long)tk->rp.kp.addr;
entry->ret_ip = (unsigned long)ri->ret_addr;
store_trace_args(sizeof(*entry), &tk->tp, regs, (u8 *)&entry[1], dsize);
--
1.8.5.2

2015-07-01 05:45:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Make eBPF programs output data to perf event

On Wed, Jul 01, 2015 at 02:57:30AM +0000, He Kuang wrote:
> This patch adds an extra perf trace buffer for other utilities like
> bpf to fill extra data to perf events.

What!, why?

2015-07-01 06:22:59

by Wang Nan

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Make eBPF programs output data to perf event



On 2015/7/1 13:44, Peter Zijlstra wrote:
> On Wed, Jul 01, 2015 at 02:57:30AM +0000, He Kuang wrote:
>> This patch adds an extra perf trace buffer for other utilities like
>> bpf to fill extra data to perf events.
> What!, why?

The goal of this patchset is to give BPF program a mean to output
something through
perf samples.

BPF programs give us a way to filter and aggregate events, which makes
us do many
interesting things. For example, we can count the number of context
switches in sys_write
system calls by attaching BPF programs onto the entry and exit points of
the system call
and the entry of __schedule, then count the number when exiting.
Combined with BPF
reading PMU which we are working on, BPF programs can be used to profile
kernel functions
in a fine-grained manner.

However, currently the only ways that BPF programs can transfer
something to perf are:

1. By returning 0 and 1 a BPF program can prevent perf to collect a
sample;
2. By map mechanism, user programs (perf) is possible to read the
aggregation result
computed by BPF program (not implemented now);
3. By BPF_FUNC_trace_printk they are able to output string into ftrace
ring buffer.

For the task I mentioned above, the best way do it is to print results
into ring buffer
in the program attached to sys_write%return, and merge them and
perf.data together using
timestamps.

We believe it can be improved. These patches is a try that, allows bpf
programs call something
like 'BPF_FUNC_output_sample' to output something, and collects them
with other data
output by a perf sample together. With the help of perf (not implemented
yet), perf will be
able to extract those data through 'perf script' or 'perf data convert
--to-ctf'. Some further
analysis can be made then.

The extra perf trace buffer is added for that reason. Currently, we use
perf_trace_buf as a
per_cpu buffer for other parts of a perf sample data. Making bpf program
to append information into
that buffer is possible, but requires us to caculate data size a perf
sample require (by calling
__get_data_size) before we can ensure the samples will not be filtered
out. Also, we can make
BPF program write from the beginning of that buffer and append perf
sample data to it. However,
they will not able to be parsed by current perf then.

2015-07-01 11:58:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Make eBPF programs output data to perf event

On Wed, Jul 01, 2015 at 02:21:11PM +0800, Wangnan (F) wrote:
>
>
> On 2015/7/1 13:44, Peter Zijlstra wrote:
> >On Wed, Jul 01, 2015 at 02:57:30AM +0000, He Kuang wrote:
> >>This patch adds an extra perf trace buffer for other utilities like
> >>bpf to fill extra data to perf events.
> >What!, why?
>
> The goal of this patchset is to give BPF program a mean to output something
> through
> perf samples.

But why create a separate trace buffer, it should go into the regular
perf buffer.

Your email is unreadably wordwrapped. Please fix your MUA.

2015-07-02 02:48:35

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Make eBPF programs output data to perf event

On 7/1/15 4:58 AM, Peter Zijlstra wrote:
>
> But why create a separate trace buffer, it should go into the regular
> perf buffer.

+1

I think
+static char __percpu *perf_extra_trace_buf[PERF_NR_CONTEXTS];
is redundant.
It adds quite a bit of unnecessary complexity to the whole patch set.

Also the call to bpf_output_sample() is not effective unless program
returns 1. It's a confusing user interface.

Also you cannot ever do:
BPF_FUNC_probe_read,
+ BPF_FUNC_output_sample,
BPF_FUNC_ktime_get_ns,
new functions must be added to the end.

Why not just do:
perf_trace_buf_prepare() + perf_trace_buf_submit() from the helper?
No changes to current code.
No need to call __get_data_size() and other overhead.
The helper can be called multiple times from the same program.
imo much cleaner.

Also how about calling this helper:
bpf_trace_buf_submit(void *stack_ptr, int size) ?
bpf_output_sample, I think, is odd name. It's not a sample.
May be other name?

2015-07-02 03:42:23

by He Kuang

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Make eBPF programs output data to perf event



On 2015/7/2 10:48, Alexei Starovoitov wrote:
> On 7/1/15 4:58 AM, Peter Zijlstra wrote:
>>
>> But why create a separate trace buffer, it should go into the regular
>> perf buffer.
>
> +1
>
> I think
> +static char __percpu *perf_extra_trace_buf[PERF_NR_CONTEXTS];
> is redundant.
> It adds quite a bit of unnecessary complexity to the whole patch set.
>
> Also the call to bpf_output_sample() is not effective unless program
> returns 1. It's a confusing user interface.
>
> Also you cannot ever do:
> BPF_FUNC_probe_read,
> + BPF_FUNC_output_sample,
> BPF_FUNC_ktime_get_ns,
> new functions must be added to the end.
>
> Why not just do:
> perf_trace_buf_prepare() + perf_trace_buf_submit() from the helper?
> No changes to current code.
> No need to call __get_data_size() and other overhead.
> The helper can be called multiple times from the same program.
> imo much cleaner.
>

Invoke perf_trace_buf_submit() will generate a second perf
event (header->type = PERF_RECORD_SAMPLE) entry which is
different from the event entry outputed by the orignial
kprobe. So the final result of the example in 00/00 patch may
like this:

sample entry 1(from bpf_prog):
comm timestamp1 generic_perform_write pmu_value=0x1234

sample entry 2(from original kprobe):
comm timestamp2 generic_perform_write: (ffffffff81140b60)

Compared with current implementation:

combined sample entry:
comm timestamp generic_perform_write: (ffffffff81140b60) pmu_value=0x1234

The former two entries may be discontinuous as there are multiple
threads and kprobes to be recorded, and there's a chance that one
entry is missed but the other is recorded. What we need is the
pmu_value read when 'generic_perform_write' enters, the two
entries result is not intuitive enough and userspace tools have
to do the work to find and combine those two sample entries to
get the result.

Thank you.

> Also how about calling this helper:
> bpf_trace_buf_submit(void *stack_ptr, int size) ?
> bpf_output_sample, I think, is odd name. It's not a sample.
> May be other name?
>
>

2015-07-02 03:50:43

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] bpf: Put perf_events check ahead of bpf prog

On 6/30/15 7:57 PM, He Kuang wrote:
> When we add a kprobe point and record events by perf, the execution path
> of all threads on each cpu will enter this point, but perf may only
> record events on a particular thread or cpu at this kprobe point, a
> check on call->perf_events list filters out the threads which perf is
> not recording.

I think there is a better way to do that. You're adding artificial
per_cpu filtering whereas you really need per_pid filtering.
The patch kinda worked, but looks more by accident.
The accurate way to do per_pid filtering is to automatically add
'if ((u32)bpf_get_current_pid_tgid() != expected_pid) return 0;'
as the first statement of the program.
You already have nice infra to add prologue to the program.
So I think adding above 'if' on user space side of perf is
a preferred way to achieve that.
Also such per_pid filtering should be smart.
The command 'perf record -a -e my_prog.o -- sleep 10' should
collect all things that my_prog was attached to and not only
things that happened as part of sleep.

2015-07-02 03:52:20

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Make eBPF programs output data to perf event

On 7/1/15 8:38 PM, He Kuang wrote:
>
>
> On 2015/7/2 10:48, Alexei Starovoitov wrote:
>> On 7/1/15 4:58 AM, Peter Zijlstra wrote:
>>>
>>> But why create a separate trace buffer, it should go into the regular
>>> perf buffer.
>>
>> +1
>>
>> I think
>> +static char __percpu *perf_extra_trace_buf[PERF_NR_CONTEXTS];
>> is redundant.
>> It adds quite a bit of unnecessary complexity to the whole patch set.
>>
>> Also the call to bpf_output_sample() is not effective unless program
>> returns 1. It's a confusing user interface.
>>
>> Also you cannot ever do:
>> BPF_FUNC_probe_read,
>> + BPF_FUNC_output_sample,
>> BPF_FUNC_ktime_get_ns,
>> new functions must be added to the end.
>>
>> Why not just do:
>> perf_trace_buf_prepare() + perf_trace_buf_submit() from the helper?
>> No changes to current code.
>> No need to call __get_data_size() and other overhead.
>> The helper can be called multiple times from the same program.
>> imo much cleaner.
>>
>
> Invoke perf_trace_buf_submit() will generate a second perf
> event (header->type = PERF_RECORD_SAMPLE) entry which is
> different from the event entry outputed by the orignial
> kprobe. So the final result of the example in 00/00 patch may
> like this:
>
> sample entry 1(from bpf_prog):
> comm timestamp1 generic_perform_write pmu_value=0x1234
> sample entry 2(from original kprobe):
> comm timestamp2 generic_perform_write: (ffffffff81140b60)
> Compared with current implementation:
> combined sample entry:
> comm timestamp generic_perform_write: (ffffffff81140b60)
> pmu_value=0x1234
>
> The former two entries may be discontinuous as there are multiple
> threads and kprobes to be recorded, and there's a chance that one
> entry is missed but the other is recorded. What we need is the
> pmu_value read when 'generic_perform_write' enters, the two
> entries result is not intuitive enough and userspace tools have
> to do the work to find and combine those two sample entries to
> get the result.

Just change your example to return 0 and user space will see
one sample.

2015-07-02 05:54:40

by Wang Nan

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] bpf: Put perf_events check ahead of bpf prog



On 2015/7/2 11:50, Alexei Starovoitov wrote:
> On 6/30/15 7:57 PM, He Kuang wrote:
>> When we add a kprobe point and record events by perf, the execution path
>> of all threads on each cpu will enter this point, but perf may only
>> record events on a particular thread or cpu at this kprobe point, a
>> check on call->perf_events list filters out the threads which perf is
>> not recording.
>
> I think there is a better way to do that. You're adding artificial
> per_cpu filtering whereas you really need per_pid filtering.


I think the differences between you and He Kuang is the order of
filtering. In He Kuang's view, perf's original filtering mechanism
(implicit or explicit) should takes precedence over BPF filter, because
what the user want is to filter events with *an additional* BPF filter.
So filters should be run by following order:

event -> X -> Y -> Z -> BPF filter +-> perf.data
|
`-> dropped

(In the above diagram, X represents limitations which prevent an event
to be triggered. For example, kprobe reentering. Y represents implicit
filters, like checking of call->perf_events, which is used to filter
events from other CPU out (per-pid perf event is also done by it).
Z represents explicit filter which is set using
PERF_EVENT_IOC_SET_FILTER by user.)

So only those events which should be collected by perf without BPF
filter should be passed to BPF program.

The above is our understanding of ideal BPF filters.

Therefore, to create a ideal BPF filter, it should be better to put BPF
filters into perf_tp_filter_match().

In current implementation, BPF filters take effects in the middle
of kprobe event processing:

event -> X -> BPF filter -> Y -> Z +-> perf.data
|
`-> dropped

And this patch changes the ordering to:

event -> X -> Y -> BPF filter -> Z +-> perf.data
|
`-> dropped

Both are not ideal, but He Kuang's patch moves BPF filter to correct
direction. It uses a relativly lower-cost operation (checking of
call->perf_events) to reduce the need of calling BPF filters.

I'd like to discuss with you about the correctness of our
understanding. Do you have any strong reason to put BPF filters at such
an early stage?

Thank you.

2015-07-02 09:26:32

by Wang Nan

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Make eBPF programs output data to perf event



On 2015/7/2 11:52, Alexei Starovoitov wrote:
> On 7/1/15 8:38 PM, He Kuang wrote:
>>
>>
>> On 2015/7/2 10:48, Alexei Starovoitov wrote:
>>> On 7/1/15 4:58 AM, Peter Zijlstra wrote:
>>>>
>>>> But why create a separate trace buffer, it should go into the regular
>>>> perf buffer.
>>>
>>> +1
>>>
>>> I think
>>> +static char __percpu *perf_extra_trace_buf[PERF_NR_CONTEXTS];
>>> is redundant.
>>> It adds quite a bit of unnecessary complexity to the whole patch set.
>>>
>>> Also the call to bpf_output_sample() is not effective unless program
>>> returns 1. It's a confusing user interface.
>>>
>>> Also you cannot ever do:
>>> BPF_FUNC_probe_read,
>>> + BPF_FUNC_output_sample,
>>> BPF_FUNC_ktime_get_ns,
>>> new functions must be added to the end.
>>>
>>> Why not just do:
>>> perf_trace_buf_prepare() + perf_trace_buf_submit() from the helper?
>>> No changes to current code.
>>> No need to call __get_data_size() and other overhead.
>>> The helper can be called multiple times from the same program.
>>> imo much cleaner.
>>>
>>
>> Invoke perf_trace_buf_submit() will generate a second perf
>> event (header->type = PERF_RECORD_SAMPLE) entry which is
>> different from the event entry outputed by the orignial
>> kprobe. So the final result of the example in 00/00 patch may
>> like this:
>>
>> sample entry 1(from bpf_prog):
>> comm timestamp1 generic_perform_write pmu_value=0x1234
>> sample entry 2(from original kprobe):
>> comm timestamp2 generic_perform_write: (ffffffff81140b60)
>> Compared with current implementation:
>> combined sample entry:
>> comm timestamp generic_perform_write: (ffffffff81140b60)
>> pmu_value=0x1234
>>
>> The former two entries may be discontinuous as there are multiple
>> threads and kprobes to be recorded, and there's a chance that one
>> entry is missed but the other is recorded. What we need is the
>> pmu_value read when 'generic_perform_write' enters, the two
>> entries result is not intuitive enough and userspace tools have
>> to do the work to find and combine those two sample entries to
>> get the result.
>
> Just change your example to return 0 and user space will see
> one sample.
>

Yes, by using perf_trace_buf_prepare() + perf_trace_buf_submit() in
helper function and let bpf program always returns 0 we can make data
collected by BPF programs output into samples, if following problems
are solved:

1. In bpf program there's no way to get 'struct perf_event' or 'struct
ftrace_event_call'. We have to deduce them through pt_regs:

pt_regs -> ip -> kprobe -> struct trace_kprobe -> struct
ftrace_event_call -> hlist_entry -> struct perf_event

Which seems dirty, but without that we can't call
perf_trace_buf_submit().

2. Even if we finally get 'struct perf_event', I'm not sure whether
user really concern on it. If we really concern on all information
output through perf_trace_buf_submit() like callstack and
register, why not make bpf program return non-zero instead? But then
we have to consider how to connect two samples together.

So maybe writing a new function to replace perf_trace_buf_submit() and
output some light-weight information instead of full event data is
worth considering. Otherwise, maybe a dummy 'struct perf_event' for BPF
outputing is also acceptable?

What we are trying to do in previous patches is to merge data output by
BPF programs and original data output by perf_trace_buf_submit()
together. For example (expressed in CTF metadata format):

event.header := struct { // both output by perf_trace_buf_submit()
integer { ... } id;
integer { ... } timestamp;
}
event {
name = "perf_bpf_probe:lock_page";
...
fields := struct {
integer { ... } perf_ip; // perf_trace_buf_submit()
integer { ... } perf_tid; // perf_trace_buf_submit()
...
integer { ... } page; <-- Fetched using prologue
integer { ... } cycle_cmu_counter; <-- Output by BPF program
}
}

We believe that implemented should be simpler. Whether to use an extra
perf_trace_buf or not can be discussed. We have other choices. For
example, we can make BPF program write its data from the end of
bpf_trace_buf, and connect two parts of output data before calling
perf_trace_buf_submit().

Thank you.

2015-07-02 09:31:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Make eBPF programs output data to perf event

On Wed, Jul 01, 2015 at 07:48:25PM -0700, Alexei Starovoitov wrote:
> On 7/1/15 4:58 AM, Peter Zijlstra wrote:
> >
> >But why create a separate trace buffer, it should go into the regular
> >perf buffer.
>
> +1
>
> I think
> +static char __percpu *perf_extra_trace_buf[PERF_NR_CONTEXTS];
> is redundant.
> It adds quite a bit of unnecessary complexity to the whole patch set.

Ah, that is what he does. I through he was creating a trace buffer, as
in the stuff ftrace uses, to output stuff concurrently to the regular
perf buffer, which would be insane ;-)

So you need these temporary buffers if you cannot tell a-priory how much
data is going to get written. The perf buffer can only deal with fixed
size events, you request a buffer of @size, you write into it, you
commit.

So if there's variable sized data, you first need to generate it in
order to tell how long it is, so that you can reserve your record and
copy it in. Sucks, but that's the way it is.

2015-07-02 13:51:45

by He Kuang

[permalink] [raw]
Subject: [RFC PATCH v2 0/4] Make eBPF programs output data to perf event

Hi,

Accordint to the discussion on patchset v1, it seems an extra
perf_trace_buf is reduntant, this patch removes the additional buffer
and stores bpf output into a temporary region at the ending of the
original perf_trace_buf. The temporary region will be moved back to
the proper offset of perf_trace_buf at the stage of
perf_trace_prepare_buf.

v1-v2:

- Remove additional perf_trace_buf. Use the end of perf_trace_buf as
a temporary region to store bpf data.

- Rename bpf_output_sample to bpf_output_data.

- New bpf API added to the end of the function list.

Thank you.

He Kuang (4):
bpf: Put perf_events check ahead of bpf prog
tracing/kprobe: Separate inc recursion count out of
perf_trace_buf_prepare
bpf: Introduce function for outputing data to perf event
tracing/kprobe: Combine bpf output and perf event output

include/linux/ftrace_event.h | 4 +++
include/linux/perf_event.h | 2 ++
include/uapi/linux/bpf.h | 3 ++
kernel/events/core.c | 6 ++++
kernel/events/internal.h | 17 ++++++----
kernel/trace/bpf_trace.c | 29 +++++++++++++++++
kernel/trace/trace_event_perf.c | 56 +++++++++++++++++++++++++++++---
kernel/trace/trace_kprobe.c | 72 ++++++++++++++++++++++++++++++++++-------
samples/bpf/bpf_helpers.h | 2 ++
9 files changed, 167 insertions(+), 24 deletions(-)

--
1.8.5.2

2015-07-02 13:51:36

by He Kuang

[permalink] [raw]
Subject: [RFC PATCH v2 1/4] bpf: Put perf_events check ahead of bpf prog

When we add a kprobe point and record events by perf, the execution path
of all threads on each cpu will enter this point, but perf may only
record events on a particular thread or cpu at this kprobe point, a
check on call->perf_events list filters out the threads which perf is
not recording.

Currently, bpf_prog will be entered at the beginning of
kprobe_perf_func() before the above check, which makes bpf_prog be
executed in every threads including determined not to be recorded
threads. A simple test can demonstrate this:

'bpf_prog_on_write.o' contains a bpf prog which outputs to trace buffer
when it is entered. Run a background thread 'another-dd' and 'dd'
simultaneously, but only record 'dd' thread by perf. The result shows
all threads trigger bpf_prog.

$ another-dd if=/dev/zero of=test1 bs=4k count=1000000
$ perf record -v --event bpf_prog_on_write.o -- dd if=/dev/zero of=test2 bs=4k count=3
$ cat /sys/kernel/debug/tracing/trace
another-dd-1007 [000] d... 120.225835: : generic_perform_write: tgid=1007, pid=1007
another-dd-1007 [000] d... 120.227123: : generic_perform_write: tgid=1007, pid=1007
[repeat many times...]
another-dd-1007 [000] d... 120.412395: : generic_perform_write: tgid=1007, pid=1007
another-dd-1007 [000] d... 120.412524: : generic_perform_write: tgid=1007, pid=1007
dd-1009 [000] d... 120.413080: : generic_perform_write: tgid=1009, pid=1009
dd-1009 [000] d... 120.414846: : generic_perform_write: tgid=1009, pid=1009
dd-1009 [000] d... 120.415013: : generic_perform_write: tgid=1009, pid=1009
another-dd-1007 [000] d... 120.416128: : generic_perform_write: tgid=1007, pid=1007
another-dd-1007 [000] d... 120.416295: : generic_perform_write: tgid=1007, pid=1007

This patch moves the check on perf_events list ahead and skip running
bpf_prog on threads perf not care.

After this patch:

$ another-dd if=/dev/zero of=test1 bs=4k count=1000000
$ perf record -v --event bpf_prog_on_write.o -- dd if=/dev/zero of=test2 bs=4k count=3
$ cat /sys/kernel/debug/tracing/trace
dd-994 [000] d... 46.386754: : generic_perform_write: tgid=994, pid=994
dd-994 [000] d... 46.389167: : generic_perform_write: tgid=994, pid=994
dd-994 [000] d... 46.389551: : generic_perform_write: tgid=994, pid=994

Signed-off-by: He Kuang <[email protected]>
Acked-by: Masami Hiramatsu <[email protected]>
---
kernel/trace/trace_kprobe.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index d0ce590..5600df8 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1141,13 +1141,13 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
int size, __size, dsize;
int rctx;

- if (prog && !trace_call_bpf(prog, regs))
- return;
-
head = this_cpu_ptr(call->perf_events);
if (hlist_empty(head))
return;

+ if (prog && !trace_call_bpf(prog, regs))
+ return;
+
dsize = __get_data_size(&tk->tp, regs);
__size = sizeof(*entry) + tk->tp.size + dsize;
size = ALIGN(__size + sizeof(u32), sizeof(u64));
@@ -1176,13 +1176,13 @@ kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,
int size, __size, dsize;
int rctx;

- if (prog && !trace_call_bpf(prog, regs))
- return;
-
head = this_cpu_ptr(call->perf_events);
if (hlist_empty(head))
return;

+ if (prog && !trace_call_bpf(prog, regs))
+ return;
+
dsize = __get_data_size(&tk->tp, regs);
__size = sizeof(*entry) + tk->tp.size + dsize;
size = ALIGN(__size + sizeof(u32), sizeof(u64));
--
1.8.5.2

2015-07-02 13:51:53

by He Kuang

[permalink] [raw]
Subject: [RFC PATCH v2 2/4] tracing/kprobe: Separate inc recursion count out of perf_trace_buf_prepare

Inside perf_trace_buf_prepare(), a recursion count is increased, the
count was first introduced by commit 444a2a3bcd6d ("tracing,
perf_events: Protect the buffer from recursion in perf") to protect
the percpu data buffer from being overwritten.

For future patch to enable eBPF saving data into perf trace buffer and
prevent data buffer being filled recursively, the recursion count is
increased outside before entering trace_call_bpf() and decreased in
case of error. In this condition, we use the new function
perf_trace_buf_prepare_rctx() for not increasing the recursion count a
second time.

Signed-off-by: He Kuang <[email protected]>
---
include/linux/ftrace_event.h | 2 ++
kernel/trace/trace_event_perf.c | 27 ++++++++++++++++++++++-----
kernel/trace/trace_kprobe.c | 28 ++++++++++++++++++++++------
3 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index f9ecf63..d54f11d 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -614,6 +614,8 @@ extern int ftrace_profile_set_filter(struct perf_event *event, int event_id,
extern void ftrace_profile_free_filter(struct perf_event *event);
extern void *perf_trace_buf_prepare(int size, unsigned short type,
struct pt_regs **regs, int *rctxp);
+extern void *perf_trace_buf_prepare_rctx(int size, unsigned short type,
+ struct pt_regs **regs, int rctx);

static inline void
perf_trace_buf_submit(void *raw_data, int size, int rctx, u64 addr,
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 6fa484d..344b601 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -260,8 +260,9 @@ void perf_trace_del(struct perf_event *p_event, int flags)
tp_event->class->reg(tp_event, TRACE_REG_PERF_DEL, p_event);
}

-void *perf_trace_buf_prepare(int size, unsigned short type,
- struct pt_regs **regs, int *rctxp)
+static void *__perf_trace_buf_prepare(int size, unsigned short type,
+ struct pt_regs **regs, int *rctxp,
+ bool update_rctx)
{
struct trace_entry *entry;
unsigned long flags;
@@ -276,9 +277,11 @@ void *perf_trace_buf_prepare(int size, unsigned short type,

pc = preempt_count();

- *rctxp = perf_swevent_get_recursion_context();
- if (*rctxp < 0)
- return NULL;
+ if (update_rctx) {
+ *rctxp = perf_swevent_get_recursion_context();
+ if (*rctxp < 0)
+ return NULL;
+ }

if (regs)
*regs = this_cpu_ptr(&__perf_regs[*rctxp]);
@@ -294,9 +297,23 @@ void *perf_trace_buf_prepare(int size, unsigned short type,

return raw_data;
}
+
+void *perf_trace_buf_prepare(int size, unsigned short type,
+ struct pt_regs **regs, int *rctxp)
+{
+ return __perf_trace_buf_prepare(size, type, regs, rctxp, true);
+}
EXPORT_SYMBOL_GPL(perf_trace_buf_prepare);
NOKPROBE_SYMBOL(perf_trace_buf_prepare);

+void *perf_trace_buf_prepare_rctx(int size, unsigned short type,
+ struct pt_regs **regs, int rctx)
+{
+ return __perf_trace_buf_prepare(size, type, regs, &rctx, false);
+}
+EXPORT_SYMBOL_GPL(perf_trace_buf_prepare_rctx);
+NOKPROBE_SYMBOL(perf_trace_buf_prepare_rctx);
+
#ifdef CONFIG_FUNCTION_TRACER
static void
perf_ftrace_function_call(unsigned long ip, unsigned long parent_ip,
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 5600df8..16ad88e 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1145,22 +1145,30 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
if (hlist_empty(head))
return;

- if (prog && !trace_call_bpf(prog, regs))
+ rctx = perf_swevent_get_recursion_context();
+ if (rctx < 0)
return;

+ if (prog && !trace_call_bpf(prog, regs))
+ goto out;
+
dsize = __get_data_size(&tk->tp, regs);
__size = sizeof(*entry) + tk->tp.size + dsize;
size = ALIGN(__size + sizeof(u32), sizeof(u64));
size -= sizeof(u32);

- entry = perf_trace_buf_prepare(size, call->event.type, NULL, &rctx);
+ entry = perf_trace_buf_prepare_rctx(size, call->event.type, NULL, rctx);
if (!entry)
- return;
+ goto out;

entry->ip = (unsigned long)tk->rp.kp.addr;
memset(&entry[1], 0, dsize);
store_trace_args(sizeof(*entry), &tk->tp, regs, (u8 *)&entry[1], dsize);
perf_trace_buf_submit(entry, size, rctx, 0, 1, regs, head, NULL);
+
+ return;
+out:
+ perf_swevent_put_recursion_context(rctx);
}
NOKPROBE_SYMBOL(kprobe_perf_func);

@@ -1180,22 +1188,30 @@ kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,
if (hlist_empty(head))
return;

- if (prog && !trace_call_bpf(prog, regs))
+ rctx = perf_swevent_get_recursion_context();
+ if (rctx < 0)
return;

+ if (prog && !trace_call_bpf(prog, regs))
+ goto out;
+
dsize = __get_data_size(&tk->tp, regs);
__size = sizeof(*entry) + tk->tp.size + dsize;
size = ALIGN(__size + sizeof(u32), sizeof(u64));
size -= sizeof(u32);

- entry = perf_trace_buf_prepare(size, call->event.type, NULL, &rctx);
+ entry = perf_trace_buf_prepare_rctx(size, call->event.type, NULL, rctx);
if (!entry)
- return;
+ goto out;

entry->func = (unsigned long)tk->rp.kp.addr;
entry->ret_ip = (unsigned long)ri->ret_addr;
store_trace_args(sizeof(*entry), &tk->tp, regs, (u8 *)&entry[1], dsize);
perf_trace_buf_submit(entry, size, rctx, 0, 1, regs, head, NULL);
+
+ return;
+out:
+ perf_swevent_put_recursion_context(rctx);
}
NOKPROBE_SYMBOL(kretprobe_perf_func);
#endif /* CONFIG_PERF_EVENTS */
--
1.8.5.2

2015-07-02 13:52:05

by He Kuang

[permalink] [raw]
Subject: [RFC PATCH v2 3/4] bpf: Introduce function for outputing data to perf event

Add function to receive data from eBPF programs and fill that into
perf trace buffer of the current context. In previous patch we make
sure that the recursion counter protecting perf trace buffer is
checked when bpf_prog is executed, so here we can safely fill the
trace buffer. The data is temporarily stored at the end of
perf_trace_buf, the last 4 bytes of the buffer is used as a valid flag
and contains tempory buffer length.

In order to get the corresponding trace buffer of the context, new
function perf_swevent_current_context_type() is added, this function
only gets the current context type but does not increase the recursion
count.

Signed-off-by: He Kuang <[email protected]>
---
include/linux/ftrace_event.h | 2 ++
include/linux/perf_event.h | 2 ++
include/uapi/linux/bpf.h | 3 +++
kernel/events/core.c | 6 ++++++
kernel/events/internal.h | 17 ++++++++++-------
kernel/trace/bpf_trace.c | 29 +++++++++++++++++++++++++++++
kernel/trace/trace_event_perf.c | 29 +++++++++++++++++++++++++++++
samples/bpf/bpf_helpers.h | 2 ++
8 files changed, 83 insertions(+), 7 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index d54f11d..1c1f3ad 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -616,6 +616,8 @@ extern void *perf_trace_buf_prepare(int size, unsigned short type,
struct pt_regs **regs, int *rctxp);
extern void *perf_trace_buf_prepare_rctx(int size, unsigned short type,
struct pt_regs **regs, int rctx);
+extern void *perf_trace_buf_prepare_rctx_tail(int size, int rctx);
+extern void *get_perf_trace_buf(int rctx);

static inline void
perf_trace_buf_submit(void *raw_data, int size, int rctx, u64 addr,
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index a204d52..984c89c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -939,6 +939,7 @@ extern unsigned int perf_output_copy(struct perf_output_handle *handle,
const void *buf, unsigned int len);
extern unsigned int perf_output_skip(struct perf_output_handle *handle,
unsigned int len);
+extern int perf_swevent_current_context_type(void);
extern int perf_swevent_get_recursion_context(void);
extern void perf_swevent_put_recursion_context(int rctx);
extern u64 perf_swevent_set_period(struct perf_event *event);
@@ -995,6 +996,7 @@ static inline void perf_event_exec(void) { }
static inline void perf_event_comm(struct task_struct *tsk, bool exec) { }
static inline void perf_event_fork(struct task_struct *tsk) { }
static inline void perf_event_init(void) { }
+static inline int perf_swevent_current_context_type(void); { return -1; }
static inline int perf_swevent_get_recursion_context(void) { return -1; }
static inline void perf_swevent_put_recursion_context(int rctx) { }
static inline u64 perf_swevent_set_period(struct perf_event *event) { return 0; }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a9ebdf5..13d3e46 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -210,6 +210,9 @@ enum bpf_func_id {
* Return: 0 on success
*/
BPF_FUNC_l4_csum_replace,
+
+ /* int bpf_output_data(void *src, int size) */
+ BPF_FUNC_output_data,
__BPF_FUNC_MAX_ID,
};

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9e0773d..0224d5b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6376,6 +6376,12 @@ end:

DEFINE_PER_CPU(struct pt_regs, __perf_regs[4]);

+int perf_swevent_current_context_type(void)
+{
+ return current_context_type();
+}
+EXPORT_SYMBOL_GPL(perf_swevent_current_context_type);
+
int perf_swevent_get_recursion_context(void)
{
struct swevent_htable *swhash = this_cpu_ptr(&swevent_htable);
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 2deb24c..5cabce5 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -175,18 +175,21 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs);
extern int get_callchain_buffers(void);
extern void put_callchain_buffers(void);

-static inline int get_recursion_context(int *recursion)
+static inline int current_context_type(void)
{
- int rctx;
-
if (in_nmi())
- rctx = 3;
+ return 3;
else if (in_irq())
- rctx = 2;
+ return 2;
else if (in_softirq())
- rctx = 1;
+ return 1;
else
- rctx = 0;
+ return 0;
+}
+
+static inline int get_recursion_context(int *recursion)
+{
+ int rctx = current_context_type();

if (recursion[rctx])
return -1;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 2d56ce5..9159b5e 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -79,6 +79,33 @@ static const struct bpf_func_proto bpf_probe_read_proto = {
.arg3_type = ARG_ANYTHING,
};

+static u64 bpf_output_data(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ void *src = (void *) (long) r1;
+ int size = (int) r2;
+ void *buf;
+ int rctx = perf_swevent_current_context_type();
+
+ if (rctx < 0)
+ return -EINVAL;
+
+ buf = perf_trace_buf_prepare_rctx_tail(size, rctx);
+ if (!buf)
+ return -ENOMEM;
+
+ memcpy(buf, src, size);
+
+ return 0;
+}
+
+static const struct bpf_func_proto bpf_output_data_proto = {
+ .func = bpf_output_data,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_STACK,
+ .arg2_type = ARG_CONST_STACK_SIZE,
+};
+
static u64 bpf_ktime_get_ns(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
{
/* NMI safe access to clock monotonic */
@@ -170,6 +197,8 @@ static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func
return &bpf_map_delete_elem_proto;
case BPF_FUNC_probe_read:
return &bpf_probe_read_proto;
+ case BPF_FUNC_output_data:
+ return &bpf_output_data_proto;
case BPF_FUNC_ktime_get_ns:
return &bpf_ktime_get_ns_proto;

diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 344b601..2eeb59b 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -260,6 +260,35 @@ void perf_trace_del(struct perf_event *p_event, int flags)
tp_event->class->reg(tp_event, TRACE_REG_PERF_DEL, p_event);
}

+void *perf_trace_buf_prepare_rctx_tail(int size, int rctx)
+{
+ char *raw_data;
+
+ BUILD_BUG_ON(PERF_MAX_TRACE_SIZE % sizeof(unsigned long));
+
+ if (WARN_ONCE(size > PERF_MAX_TRACE_SIZE,
+ "perf buffer not large enough"))
+ return NULL;
+
+ raw_data = this_cpu_ptr(perf_trace_buf[rctx]);
+ raw_data += (PERF_MAX_TRACE_SIZE - sizeof(u32));
+
+ /* The lat 4 bytes is raw_data size and it is used as a valid flag */
+ *(u32 *)raw_data = size;
+ raw_data -= size;
+
+ return raw_data;
+}
+EXPORT_SYMBOL_GPL(perf_trace_buf_prepare_rctx_tail);
+NOKPROBE_SYMBOL(perf_trace_buf_prepare_rctx_tail);
+
+void *get_perf_trace_buf(int rctx)
+{
+ return this_cpu_ptr(perf_trace_buf[rctx]);
+}
+EXPORT_SYMBOL_GPL(get_perf_trace_buf);
+NOKPROBE_SYMBOL(get_perf_trace_buf);
+
static void *__perf_trace_buf_prepare(int size, unsigned short type,
struct pt_regs **regs, int *rctxp,
bool update_rctx)
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index f960b5f..44bfbeb 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -17,6 +17,8 @@ static int (*bpf_map_delete_elem)(void *map, void *key) =
(void *) BPF_FUNC_map_delete_elem;
static int (*bpf_probe_read)(void *dst, int size, void *unsafe_ptr) =
(void *) BPF_FUNC_probe_read;
+static int (*bpf_output_data)(void *src, int size) =
+ (void *) BPF_FUNC_output_data;
static unsigned long long (*bpf_ktime_get_ns)(void) =
(void *) BPF_FUNC_ktime_get_ns;
static int (*bpf_trace_printk)(const char *fmt, int fmt_size, ...) =
--
1.8.5.2

2015-07-02 13:51:26

by He Kuang

[permalink] [raw]
Subject: [RFC PATCH v2 4/4] tracing/kprobe: Combine bpf output and perf event output

Check and collect temporary trace buffer in the stage of preparing
perf trace buffer. If there're data to be collected, set invalid flag,
extend the trace buffer size and move the data to proper offset, so
the combined data will be compatible to the orignal format and can be
processed by perf-script.

Signed-off-by: He Kuang <[email protected]>
---
kernel/trace/trace_kprobe.c | 36 ++++++++++++++++++++++++++++++++++--
1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 16ad88e..274735b 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1140,6 +1140,8 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
struct hlist_head *head;
int size, __size, dsize;
int rctx;
+ u32 buf_tail_size;
+ char *buf_tail;

head = this_cpu_ptr(call->perf_events);
if (hlist_empty(head))
@@ -1152,15 +1154,29 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
if (prog && !trace_call_bpf(prog, regs))
goto out;

+ /* Check trace buf */
+ buf_tail = get_perf_trace_buf(rctx) +
+ PERF_MAX_TRACE_SIZE - sizeof(u32);
+ buf_tail_size = *(u32 *)buf_tail;
+
+ /* Clear size to invalid buf */
+ *(u32 *)buf_tail = 0;
+
+ if (buf_tail_size != 0)
+ buf_tail -= buf_tail_size;
+
dsize = __get_data_size(&tk->tp, regs);
__size = sizeof(*entry) + tk->tp.size + dsize;
- size = ALIGN(__size + sizeof(u32), sizeof(u64));
+ size = ALIGN(__size + buf_tail_size + sizeof(u32), sizeof(u64));
size -= sizeof(u32);

entry = perf_trace_buf_prepare_rctx(size, call->event.type, NULL, rctx);
if (!entry)
goto out;

+ /* Move temporary buf to proper offset */
+ memmove((char *)entry + __size, buf_tail, buf_tail_size);
+
entry->ip = (unsigned long)tk->rp.kp.addr;
memset(&entry[1], 0, dsize);
store_trace_args(sizeof(*entry), &tk->tp, regs, (u8 *)&entry[1], dsize);
@@ -1183,6 +1199,8 @@ kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,
struct hlist_head *head;
int size, __size, dsize;
int rctx;
+ u32 buf_tail_size;
+ char *buf_tail;

head = this_cpu_ptr(call->perf_events);
if (hlist_empty(head))
@@ -1195,15 +1213,29 @@ kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,
if (prog && !trace_call_bpf(prog, regs))
goto out;

+ /* Check trace buf */
+ buf_tail = get_perf_trace_buf(rctx) +
+ PERF_MAX_TRACE_SIZE - sizeof(u32);
+ buf_tail_size = *(u32 *)buf_tail;
+
+ /* Clear size to invalid buf */
+ *(u32 *)buf_tail = 0;
+
+ if (buf_tail_size != 0)
+ buf_tail -= buf_tail_size;
+
dsize = __get_data_size(&tk->tp, regs);
__size = sizeof(*entry) + tk->tp.size + dsize;
- size = ALIGN(__size + sizeof(u32), sizeof(u64));
+ size = ALIGN(__size + buf_tail_size + sizeof(u32), sizeof(u64));
size -= sizeof(u32);

entry = perf_trace_buf_prepare_rctx(size, call->event.type, NULL, rctx);
if (!entry)
goto out;

+ /* Move temporary buf to proper offset */
+ memmove((char *)entry + __size, buf_tail, buf_tail_size);
+
entry->func = (unsigned long)tk->rp.kp.addr;
entry->ret_ip = (unsigned long)ri->ret_addr;
store_trace_args(sizeof(*entry), &tk->tp, regs, (u8 *)&entry[1], dsize);
--
1.8.5.2

2015-07-02 18:02:43

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] bpf: Put perf_events check ahead of bpf prog

On 7/1/15 10:52 PM, Wangnan (F) wrote:
> I'd like to discuss with you about the correctness of our
> understanding. Do you have any strong reason to put BPF filters at such
> an early stage?

the obvious reason is performance.
It is so much faster to run generated
'if (bpf_get_current_pid() != expected_pid) return'
instead of going through __get_data_size,
perf_trace_buf_prepare, store_trace_args,
perf_trace_buf_submit->perf_tp_event_match->filter_match_preds.
bpf is the fastest way to filter out things, so it should be first.

I would argue that even for regular samples (cycle counts and so on),
we should be using this tiny bpf prog to filter by pid.
It's around 5 or so instructions that perf can always use instead
of doing 'common_pid != expected_pid' > filter. Disturbance to the
whole kernel will be much lower. Obviously there are no hooks for
bpf programs in regular samples, but I think it's worth doing.

2015-07-02 18:37:54

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Make eBPF programs output data to perf event

On 7/2/15 2:24 AM, Wangnan (F) wrote:
> Yes, by using perf_trace_buf_prepare() + perf_trace_buf_submit() in
> helper function and let bpf program always returns 0 we can make data
> collected by BPF programs output into samples, if following problems
> are solved:
>
> 1. In bpf program there's no way to get 'struct perf_event' or 'struct
> ftrace_event_call'. We have to deduce them through pt_regs:
>
> pt_regs -> ip -> kprobe -> struct trace_kprobe -> struct
> ftrace_event_call -> hlist_entry -> struct perf_event

yeah, going through hash table via get_kprobe() is not pretty.
How about using this_cpu_write(current_perf_event, ...) and using it
from the helper? bpf progs are non-preemptable and non-reentrable.
Also I think this helper would be more flexible if we can
allow passing sample_type into it.
Ideally from the program one could do something like:
bpf_event_output(buf, sizeof(buf), PERF_SAMPLE_CALLCHAIN);
which will prepare a sample with raw buf and callstack.
This way program can decide when and how send events to user space.

> 2. Even if we finally get 'struct perf_event', I'm not sure whether
> user really concern on it. If we really concern on all information
> output through perf_trace_buf_submit() like callstack and
> register, why not make bpf program return non-zero instead? But then
> we have to consider how to connect two samples together.

see my suggestion above. when sample_type was hard coded during event
creation it's a useful case on its own, but if we can make program to
provide this type dynamically, it will open whole new set of possibilities.

2015-07-02 18:41:57

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/4] bpf: Put perf_events check ahead of bpf prog

On 7/2/15 6:50 AM, He Kuang wrote:
> When we add a kprobe point and record events by perf, the execution path
> of all threads on each cpu will enter this point, but perf may only
> record events on a particular thread or cpu at this kprobe point, a
> check on call->perf_events list filters out the threads which perf is
> not recording.

as explained in the other thread, I don't like this patch at all.
please let discussion come to conclusion first before resending
the same stuff over and over.

btw, I'll be off-grid till Sunday due to July 4 holidays.