This patch is based on https://lkml.org/lkml/2015/5/17/84 (perf tools:
introduce 'perf bpf' command to load eBPF programs).
Previous discusions on perf bpf: Probing with local variable:
https://lkml.org/lkml/2015/5/5/260. In that patch, we tried to
generate a bpf bytecode prologue in perf, this prologue fetches and
places variables as bpf function parameters, for making it easier to
fetch variables in bpf prog.
Alexei's comments:
- Argument limitation is <=3, which is OK but should be documented.
- Support it without debug info when kprobe is placed at the top
of the function.
- Concise the 'config' section.
Masami has metioned:
- The redundant functionality of both userspace and kernel variable
parsing.
- The possibility of replacing the old fetch_arg functions with these
byte code
I've made a new version of userspace prologue which fixes the problems
in that RFC series(not sent yet), but when trying to resolve Alexei's
2nd suggestion, we found it is in contradiction to the argument number
limitation. By a rough statistics, there're 13.5 percent fucntions
have 4 or more arguments in kernel. BPF calling convention limits the
maximum number of argument number to 5(R1~R5), besides the R1 for
'ctx', there're 4 registers left for arguments passing. It is not
reasonable to pass the first 4 arguments when probing a function which
has more than 4 arguments.
Consider Masami's suggestion to do the work in kernel, we found that
adding a helper proto-type function for fetching bpf variables is a
more easier way to reach our goals. Embed trace_probe pointer to 'ctx'
for bpf prog, then we can use the existing code for fetching args in
kernel. Just like the 2nd suggestion, but here we do not generate any
bytecode, but use the existing call_fetch() results directly. Example
code can be found in [RPF PATCH 5/5].
Moreover, this method removes the argument number limitation caused by
bpf calling convention(R2-R5 for placing variables). And leaves the
users free to decide whether or not do the arguments/variables
fetching. They can use this helper function in their own conditions.
Also need to note:
- We can generate a syntax sugar which can convert the 'structure
param' to function args, this can reduce the users' extra work.
- An extra verification needs to be implemented to be sure that user
provides enough space for arguments fetching.
This method's pros & cons:
pros:
- Remove arugment number limitation.
- User free to choose whether or not do the fetch and decide where to
execute the fetch.
- Remove kernel/userspace redundant functionality of parsing args.
cons:
- User should add the 'structure param' code themselves.
Looking forward for disscusions.
He Kuang (5):
perf bpf: Add -k option for testing convenience
bpf: Pass trace_probe to bpf_prog for variable fetching
bpf: Add helper function for fetching variables at probe point
samples/bpf: Add proper prefix to objects in Makefile
samples/bpf: Add sample for testing bpf fetch args
include/uapi/linux/bpf.h | 1 +
kernel/trace/bpf_trace.c | 38 ++++++++++++++++++++++++++++++++
kernel/trace/trace_kprobe.c | 11 ++++++++--
kernel/trace/trace_probe.h | 5 +++++
samples/bpf/Makefile | 3 ++-
samples/bpf/bpf_helpers.h | 2 ++
samples/bpf/sample_bpf_fetch_args.c | 43 +++++++++++++++++++++++++++++++++++++
tools/perf/builtin-bpf.c | 3 +++
8 files changed, 103 insertions(+), 3 deletions(-)
create mode 100644 samples/bpf/sample_bpf_fetch_args.c
--
1.8.5.2
Add -k option to perf bpf command.
Signed-off-by: He Kuang <[email protected]>
---
tools/perf/builtin-bpf.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/perf/builtin-bpf.c b/tools/perf/builtin-bpf.c
index 4ef294a..9ea34b3 100644
--- a/tools/perf/builtin-bpf.c
+++ b/tools/perf/builtin-bpf.c
@@ -11,6 +11,7 @@
#include "builtin.h"
#include "perf.h"
#include "debug.h"
+#include "util/symbol.h"
#include "parse-options.h"
#include "bpf-loader.h"
@@ -30,6 +31,8 @@ static struct bpf_cmd bpf_cmds[];
struct option bpf_options[] = {
OPT_INCR('v', "verbose", &verbose, "be more verbose "
"(show debug information)"),
+ OPT_STRING('k', "vmlinux", &symbol_conf.vmlinux_name,
+ "file", "vmlinux pathname"),
OPT_END()
};
--
1.8.5.2
Add new structure bpf_pt_regs, which contains both original
'ctx'(pt_regs) and trabe_probe pointer, and pass this new pointer to bpf
prog for variable fetching.
Signed-off-by: He Kuang <[email protected]>
---
kernel/trace/trace_kprobe.c | 11 +++++++++--
kernel/trace/trace_probe.h | 5 +++++
2 files changed, 14 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index d0ce590..cee0b28 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1141,8 +1141,15 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
int size, __size, dsize;
int rctx;
- if (prog && !trace_call_bpf(prog, regs))
- return;
+ if (prog) {
+ struct bpf_pt_regs bpf_pt_regs;
+
+ bpf_pt_regs.pt_regs = *regs;
+ bpf_pt_regs.tp = &tk->tp;
+
+ if (!trace_call_bpf(prog, &bpf_pt_regs))
+ return;
+ }
head = this_cpu_ptr(call->perf_events);
if (hlist_empty(head))
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index ab283e1..5b1f12c 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -391,4 +391,9 @@ store_trace_args(int ent_size, struct trace_probe *tp, struct pt_regs *regs,
}
}
+struct bpf_pt_regs {
+ struct pt_regs pt_regs;
+ struct trace_probe *tp;
+};
+
extern int set_print_fmt(struct trace_probe *tp, bool is_return);
--
1.8.5.2
This helper function uses kernel structure trace_probe and related fetch
functions for fetching variables described in 'SEC' to bpf stack.
Signed-off-by: He Kuang <[email protected]>
---
include/uapi/linux/bpf.h | 1 +
kernel/trace/bpf_trace.c | 38 ++++++++++++++++++++++++++++++++++++++
samples/bpf/bpf_helpers.h | 2 ++
3 files changed, 41 insertions(+)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a9ebdf5..b1a7685 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -210,6 +210,7 @@ enum bpf_func_id {
* Return: 0 on success
*/
BPF_FUNC_l4_csum_replace,
+ BPF_FUNC_fetch_args,
__BPF_FUNC_MAX_ID,
};
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 2d56ce5..ba601da 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -12,6 +12,7 @@
#include <linux/uaccess.h>
#include <linux/ctype.h>
#include "trace.h"
+#include "trace_probe.h"
static DEFINE_PER_CPU(int, bpf_prog_active);
@@ -159,6 +160,39 @@ static const struct bpf_func_proto bpf_trace_printk_proto = {
.arg2_type = ARG_CONST_STACK_SIZE,
};
+/* Store the value of each argument */
+static void
+bpf_store_trace_args(struct pt_regs *regs, struct trace_probe *tp,
+ u8 *data)
+{
+ int i;
+
+ for (i = 0; i < tp->nr_args; i++) {
+ /* Just fetching data normally */
+ call_fetch(&tp->args[i].fetch, regs,
+ data + tp->args[i].offset);
+ }
+}
+
+static u64 bpf_fetch_args(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ struct pt_regs *regs = (struct pt_regs *)(long)r1;
+ struct trace_probe *tp = ((struct bpf_pt_regs *)regs)->tp;
+ void *data = (void *)(long)r2;
+
+ bpf_store_trace_args(regs, tp, data);
+
+ return 0;
+}
+
+static const struct bpf_func_proto bpf_fetch_args_proto = {
+ .func = bpf_fetch_args,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_PTR_TO_STACK,
+};
+
static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func_id)
{
switch (func_id) {
@@ -181,6 +215,10 @@ static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func
trace_printk_init_buffers();
return &bpf_trace_printk_proto;
+
+ case BPF_FUNC_fetch_args:
+ return &bpf_fetch_args_proto;
+
default:
return NULL;
}
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index f960b5f..578a8e3 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -21,6 +21,8 @@ static unsigned long long (*bpf_ktime_get_ns)(void) =
(void *) BPF_FUNC_ktime_get_ns;
static int (*bpf_trace_printk)(const char *fmt, int fmt_size, ...) =
(void *) BPF_FUNC_trace_printk;
+static int (*bpf_fetch_args)(void *ctx, void *data) =
+ (void *) BPF_FUNC_fetch_args;
/* llvm builtin functions that eBPF C program may use to
* emit BPF_LD_ABS and BPF_LD_IND instructions
--
1.8.5.2
Always use $(obj) when referring to generated files and use $(src) when
referring to files located in the src tree.
Signed-off-by: He Kuang <[email protected]>
---
samples/bpf/Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 76e3458..8fdbd73 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -44,7 +44,7 @@ HOSTLOADLIBES_tracex4 += -lelf -lrt
# point this to your LLVM backend with bpf support
LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
-%.o: %.c
+$(obj)/%.o: $(src)/%.c
clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
-D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
-O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@
--
1.8.5.2
Sample code for testing bpf fetch args.
Works as following steps:
$ perf bpf record --object sample_bpf_fetch_args.o -- dd if=/dev/zero of=/mnt/data/test bs=4k count=3
show result in ringbuffer:
$ perf script
dd 1088 [000] 5740.260451: perf_bpf_probe:generic_perform_write: (ffffffff811308ea) a_ops=0xffffffff81a20160 bytes=0x1000 page=0xffff88007c621540 pos=0
dd 1088 [000] 5740.260451: perf_bpf_probe:generic_perform_write: (ffffffff811308ea) a_ops=0xffffffff81a20160 bytes=0x1000 page=0xffffea0001c49f40 pos=4096
dd 1088 [000] 5740.260451: perf_bpf_probe:generic_perform_write: (ffffffff811308ea) a_ops=0xffffffff81a20160 bytes=0x1000 page=0xffffea0001c49f80 pos=8192
show result in bpf prog:
$ cat /sys/kernel/debug/tracing/trace |grep dd
dd-1098 [000] d... 6892.829003: : NODE_write1 a_ops=ffffffff81a20160, bytes=0000000000001000
dd-1098 [000] d... 6892.829049: : NODE_write2 page =ffff88007c621540, pos = (null)
dd-1098 [000] d... 6892.829650: : NODE_write1 a_ops=ffffffff81a20160, bytes=0000000000001000
dd-1098 [000] d... 6892.829662: : NODE_write2 page =ffffea0001c49f40, pos =0000000000001000
dd-1098 [000] d... 6892.829831: : NODE_write1 a_ops=ffffffff81a20160, bytes=0000000000001000
dd-1098 [000] d... 6892.829842: : NODE_write2 page =ffffea0001c49f80, pos =0000000000002000
Signed-off-by: He Kuang <[email protected]>
---
samples/bpf/Makefile | 1 +
samples/bpf/sample_bpf_fetch_args.c | 43 +++++++++++++++++++++++++++++++++++++
2 files changed, 44 insertions(+)
create mode 100644 samples/bpf/sample_bpf_fetch_args.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 8fdbd73..dc0b0e8 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -30,6 +30,7 @@ always += tracex2_kern.o
always += tracex3_kern.o
always += tracex4_kern.o
always += tcbpf1_kern.o
+always += sample_bpf_fetch_args.o
HOSTCFLAGS += -I$(objtree)/usr/include
diff --git a/samples/bpf/sample_bpf_fetch_args.c b/samples/bpf/sample_bpf_fetch_args.c
new file mode 100644
index 0000000..9b587df
--- /dev/null
+++ b/samples/bpf/sample_bpf_fetch_args.c
@@ -0,0 +1,43 @@
+/*
+ Sample code for bpf_fetch_args().
+*/
+
+#include <linux/writeback.h>
+#include <linux/blkdev.h>
+
+#include <uapi/linux/bpf.h>
+#include <linux/version.h>
+#include "bpf_helpers.h"
+
+SEC("generic_perform_write=generic_perform_write+122 file->f_mapping->a_ops bytes page pos")
+int NODE_generic_perform_write(struct pt_regs *ctx)
+{
+ struct param_s {
+ unsigned long a_ops;
+ unsigned long bytes;
+ unsigned long page;
+ unsigned long pos;
+ } param = {0};
+
+ bpf_fetch_args(ctx, ¶m);
+
+ /* actions */
+ {
+ /* 5 args max for bpf_trace_printk, print in 2 lines */
+ char fmt1[] = "NODE_write1 a_ops=%p, bytes=%p\n";
+ char fmt2[] = "NODE_write2 page =%p, pos =%p\n";
+
+ bpf_trace_printk(fmt1, sizeof(fmt1),
+ param.a_ops,
+ param.bytes);
+
+ bpf_trace_printk(fmt2, sizeof(fmt2),
+ param.page,
+ param.pos);
+ }
+
+ return 1;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
--
1.8.5.2
On 5/17/15 10:30 PM, He Kuang wrote:
> Add new structure bpf_pt_regs, which contains both original
> 'ctx'(pt_regs) and trabe_probe pointer, and pass this new pointer to bpf
> prog for variable fetching.
>
> Signed-off-by: He Kuang <[email protected]>
> ---
> kernel/trace/trace_kprobe.c | 11 +++++++++--
> kernel/trace/trace_probe.h | 5 +++++
> 2 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index d0ce590..cee0b28 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -1141,8 +1141,15 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
> int size, __size, dsize;
> int rctx;
>
> - if (prog && !trace_call_bpf(prog, regs))
> - return;
> + if (prog) {
> + struct bpf_pt_regs bpf_pt_regs;
> +
> + bpf_pt_regs.pt_regs = *regs;
> + bpf_pt_regs.tp = &tk->tp;
...
> +struct bpf_pt_regs {
> + struct pt_regs pt_regs;
> + struct trace_probe *tp;
> +};
that is a massive overhead.
On x64 it means copying 168 bytes for every call.
imo that's a wrong trade off.
On 5/17/15 10:30 PM, He Kuang wrote:
> This helper function uses kernel structure trace_probe and related fetch
> functions for fetching variables described in 'SEC' to bpf stack.
>
> Signed-off-by: He Kuang <[email protected]>
...
> +/* Store the value of each argument */
> +static void
> +bpf_store_trace_args(struct pt_regs *regs, struct trace_probe *tp,
> + u8 *data)
> +{
> + int i;
> +
> + for (i = 0; i < tp->nr_args; i++) {
> + /* Just fetching data normally */
> + call_fetch(&tp->args[i].fetch, regs,
> + data + tp->args[i].offset);
that is slower than generating bpf by user space, but more importantly
that's invalid. There is no size check.
r2 in fetch_args points to stack, but nothing checks the stack limits.
You need to add code here to dynamically check it as well.
which will be adding runtime overhead as well.
Your first approach of generating argument accessors in user space
was better.
I think the limit of 3 or 4 arguments was fine.
We need to generate the code for non-debug case anyway,
like my earlier suggestion:
SEC("kprobe/generic_perform_write(void*, void*, long long)")
without debug info it will copy ctx->di into r2, ctx->si into r3
and ctx->dx into r4.
On 5/17/15 10:30 PM, He Kuang wrote:
> Always use $(obj) when referring to generated files and use $(src) when
> referring to files located in the src tree.
>
> Signed-off-by: He Kuang <[email protected]>
> ---
> samples/bpf/Makefile | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index 76e3458..8fdbd73 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -44,7 +44,7 @@ HOSTLOADLIBES_tracex4 += -lelf -lrt
> # point this to your LLVM backend with bpf support
> LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
>
> -%.o: %.c
> +$(obj)/%.o: $(src)/%.c
> clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
> -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
> -O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@
the same fix is already in net-next:
https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=b88c06e36dcb9b4ae285f7821f62d68dc34b25d3
On 2015/05/18 14:30, He Kuang wrote:
> This patch is based on https://lkml.org/lkml/2015/5/17/84 (perf tools:
> introduce 'perf bpf' command to load eBPF programs).
>
> Previous discusions on perf bpf: Probing with local variable:
> https://lkml.org/lkml/2015/5/5/260. In that patch, we tried to
> generate a bpf bytecode prologue in perf, this prologue fetches and
> places variables as bpf function parameters, for making it easier to
> fetch variables in bpf prog.
>
> Alexei's comments:
>
> - Argument limitation is <=3, which is OK but should be documented.
> - Support it without debug info when kprobe is placed at the top
> of the function.
> - Concise the 'config' section.
>
> Masami has metioned:
>
> - The redundant functionality of both userspace and kernel variable
> parsing.
> - The possibility of replacing the old fetch_arg functions with these
> byte code
>
> I've made a new version of userspace prologue which fixes the problems
> in that RFC series(not sent yet), but when trying to resolve Alexei's
> 2nd suggestion, we found it is in contradiction to the argument number
> limitation. By a rough statistics, there're 13.5 percent fucntions
> have 4 or more arguments in kernel. BPF calling convention limits the
> maximum number of argument number to 5(R1~R5), besides the R1 for
> 'ctx', there're 4 registers left for arguments passing. It is not
> reasonable to pass the first 4 arguments when probing a function which
> has more than 4 arguments.
>
> Consider Masami's suggestion to do the work in kernel, we found that
> adding a helper proto-type function for fetching bpf variables is a
> more easier way to reach our goals. Embed trace_probe pointer to 'ctx'
> for bpf prog, then we can use the existing code for fetching args in
> kernel. Just like the 2nd suggestion, but here we do not generate any
> bytecode, but use the existing call_fetch() results directly. Example
> code can be found in [RPF PATCH 5/5].
Hmm, what I suggested was that optimizing call_fetch methods with BPF,
yours seems opposite. Since BPF can be optimized by x86 native instructions
by using JIT, it is much faster than current call-chain fetch method.
I'm still not sure all the fetch method can be covered with BPF, e.g.
fetching a bitfield requires bitmasks and bitshift ops.
Thank you,
>
> Moreover, this method removes the argument number limitation caused by
> bpf calling convention(R2-R5 for placing variables). And leaves the
> users free to decide whether or not do the arguments/variables
> fetching. They can use this helper function in their own conditions.
>
> Also need to note:
>
> - We can generate a syntax sugar which can convert the 'structure
> param' to function args, this can reduce the users' extra work.
> - An extra verification needs to be implemented to be sure that user
> provides enough space for arguments fetching.
>
> This method's pros & cons:
>
> pros:
> - Remove arugment number limitation.
> - User free to choose whether or not do the fetch and decide where to
> execute the fetch.
> - Remove kernel/userspace redundant functionality of parsing args.
>
> cons:
> - User should add the 'structure param' code themselves.
>
> Looking forward for disscusions.
>
> He Kuang (5):
> perf bpf: Add -k option for testing convenience
> bpf: Pass trace_probe to bpf_prog for variable fetching
> bpf: Add helper function for fetching variables at probe point
> samples/bpf: Add proper prefix to objects in Makefile
> samples/bpf: Add sample for testing bpf fetch args
>
> include/uapi/linux/bpf.h | 1 +
> kernel/trace/bpf_trace.c | 38 ++++++++++++++++++++++++++++++++
> kernel/trace/trace_kprobe.c | 11 ++++++++--
> kernel/trace/trace_probe.h | 5 +++++
> samples/bpf/Makefile | 3 ++-
> samples/bpf/bpf_helpers.h | 2 ++
> samples/bpf/sample_bpf_fetch_args.c | 43 +++++++++++++++++++++++++++++++++++++
> tools/perf/builtin-bpf.c | 3 +++
> 8 files changed, 103 insertions(+), 3 deletions(-)
> create mode 100644 samples/bpf/sample_bpf_fetch_args.c
>
--
Masami HIRAMATSU
Linux Technology Research Center, System Productivity Research Dept.
Center for Technology Innovation - Systems Engineering
Hitachi, Ltd., Research & Development Group
E-mail: [email protected]