Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756518AbdLUCDb (ORCPT ); Wed, 20 Dec 2017 21:03:31 -0500 Received: from mail-pf0-f193.google.com ([209.85.192.193]:34250 "EHLO mail-pf0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755834AbdLUCD3 (ORCPT ); Wed, 20 Dec 2017 21:03:29 -0500 X-Google-Smtp-Source: ACJfBoucne/qmvaqAXZuV7F50ttpo/IL3ibZqmnh8ZihvdSRReairFCxB24WW1k1vO6uNcUFjJJ88g== Date: Wed, 20 Dec 2017 18:03:26 -0800 From: Alexei Starovoitov To: Peter Zijlstra Cc: Teng Qin , mingo@redhat.com, bgregg@netflix.com, daniel@iogearbox.net, linux-kernel@vger.kernel.org, kernel-team@fb.com, netdev@vger.kernel.org, "David S. Miller" , brouer@redhat.com Subject: tracepoint_probe_register and bpf. Was: [PATCH tip 0/3] Improvements of scheduler related Tracepoints Message-ID: <20171221020323.6b6655547vjyf53o@ast-mbp> References: <20171214202044.1629279-1-qinteng@fb.com> <20171214204932.GH3326@worktop> <1632e487-ee65-b50d-85e5-82f42c69fea1@fb.com> <20171215073908.myx3wgka7qimcmsg@hirez.programming.kicks-ass.net> <20171218091157.rgogahbflgwvwcdw@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20171218091157.rgogahbflgwvwcdw@hirez.programming.kicks-ass.net> User-Agent: NeoMutt/20170421 (1.8.2) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4287 Lines: 101 On Mon, Dec 18, 2017 at 10:11:57AM +0100, Peter Zijlstra wrote: > On Fri, Dec 15, 2017 at 09:09:51AM -0800, Alexei Starovoitov wrote: > > > yeah. Currently bpf progs are called at the end of > > perf_trace_##call() > > { > > .. regular tracepoint copy craft > > perf_trace_run_bpf_submit( &copied args ) > > } > > > > from bpf pov we'd rather get access to raw args passed into > > perf_trace_##call. > > Sounds like you're suggesting to let bpf side register its > > progs directly via tracepoint_probe_register() ? > > That would solve the whole thing really nicely indeed. > > Not sure I thought that far; but if you want the probe arguments either > grab them from the perf probe you're running in or indeed register your > own, don't play silly buggers and copy them in the trace data. So I've tried both approaches: 1. grab arguments from: perf_trace_##call(void *__data, proto) it works ok, but it adds 50-70 bytes of .text to every perf_trace_* functions which multiplied by the number of tracepoints (~500) which in total adds ~30k of .text to vmlinux Not too bad, but it also adds extra branch to critical execution path, so not ideal 2. register your own This approach I think is much better. The trick I'm doing is the following: #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \ /* no 'static' here. The bpf probe functions are global */ \ notrace void \ __bpf_trace_##call(void *__data, proto) \ { \ struct trace_event_call *event_call = __data; \ \ __FN_COUNT(bpf_trace_run, args)(event_call, CAST_TO_U64(args)); \ } Where CAST_TO_U64 is a macro to cast all args to u64 one by one. bpf_trace_run*() functions are defined as: void bpf_trace_run1(struct trace_event_call *call, u64 arg1); void bpf_trace_run2(struct trace_event_call *call, u64 arg1, u64 arg2); void bpf_trace_run3(struct trace_event_call *call, u64 arg1, u64 arg2, u64 arg3); so in assembler it looks like: (gdb) disassemble __bpf_trace_xdp_exception Dump of assembler code for function __bpf_trace_xdp_exception: 0xffffffff81132080 <+0>: mov %ecx,%ecx 0xffffffff81132082 <+2>: jmpq 0xffffffff811231f0 where TRACE_EVENT(xdp_exception, TP_PROTO(const struct net_device *dev, const struct bpf_prog *xdp, u32 act), The above assembler snippet is casting 32-bit 'act' field into 'u64' to pass into bpf_trace_run3() and all other registers are passed as-is. So all of ~500 of __bpf_trace_*() functions are only 5-10 _byte_ long and in total this approach adds 7k bytes to .text and 8k bytes to .rodata since I want them to appear in kallsyms, so I don't have to add another 8-byte fields to 'struct trace_event_class' and 'struct trace_event_call' Such approach, I think, is the lowest overhead we can possibly have while calling trace_xdp_exception() from kernel C code and transitioning into bpf land. Since many folks are starting to use tracepoint+bpf at speeds of 1M+ events per second this would be very valuable optimization. Diff stat so far: drivers/gpu/drm/i915/i915_trace.h | 13 ++++++++++--- fs/btrfs/super.c | 4 ++++ include/linux/trace_events.h | 27 +++++++++++++++++++++++++++ include/trace/bpf_probe.h | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ include/trace/define_trace.h | 1 + include/uapi/linux/bpf.h | 4 ++++ kernel/trace/bpf_trace.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 191 insertions(+), 3 deletions(-) Since I'm not touching anything on ftrace or perf side, I'm thinking to extend BPF_PROG_ATTACH command to specify which tracepoint to attach to. The user space will look like: prog_fd = bpf_prog_load(...); bpf_prog_attach(prog_fd, "xdp_exception"); On the kernel side I will walk kallsyms to find __tracepoint_xdp_exception record, check that tp->name == "xdp_exception", then find __bpf_trace_xdp_exception() address (also in kallsyms), and finally use tracepoint_probe_register() to connect the two. Thoughts?