Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751583AbbBKGd2 (ORCPT ); Wed, 11 Feb 2015 01:33:28 -0500 Received: from mail-qc0-f172.google.com ([209.85.216.172]:38359 "EHLO mail-qc0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751223AbbBKGd0 (ORCPT ); Wed, 11 Feb 2015 01:33:26 -0500 MIME-Version: 1.0 From: Alexei Starovoitov Date: Tue, 10 Feb 2015 22:33:05 -0800 Message-ID: Subject: Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls To: Steven Rostedt Cc: Ingo Molnar , Namhyung Kim , Arnaldo Carvalho de Melo , Jiri Olsa , Masami Hiramatsu , Linux API , Network Development , LKML , Linus Torvalds , Peter Zijlstra , "Eric W. Biederman" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5852 Lines: 127 On Tue, Feb 10, 2015 at 8:31 PM, Steven Rostedt wrote: >> > Again, this would mean they become invisible to ftrace, and even >> > ftrace_dump_on_oops. >> >> yes, since these new tracepoints have no meat inside them. >> They're placeholders sitting idle and waiting for bpf to do >> something useful with them. > > Hmm, I have a patch somewhere (never posted it), that add > TRACE_MARKER(), which basically would just print that it was hit. But > no data was passed to it. Something like that I would be more inclined > to take. Then the question is, what can bpf access there? Could just be > a place holder to add a "fast kprobe". This way, it can show up in > trace events with enable and and all that, it just wont have any data > to print out besides the normal pid, flags, etc. > > But because it will inject a nop, that could be converted to a jump, it > will give you the power of kprobes but with the speed of a tracepoint. fair enough. Something like TRACE_MARKER(arg1, arg2) that prints it was hit without accessing the args would be enough. Without any args it is indeed a 'fast kprobe' only. Debug info would still be needed to access function arguments. On x64 function entry point and x64 abi make it easy to access args, but i386 or kprobe in the middle lose visibility when debug info is not available. TRACE_MARKER (with few key args that function is operating on) is enough to achieve roughly the same as kprobe without debug info. >> > I'm not fully understanding what is to be exported by this new ABI. If >> > the fields available, will always be available, then why can't the >> > appear in a TP_printk()? >> >> say, we define trace_netif_rx_entry() as this new tracepoint_bpf. >> It will have only one argument 'skb'. >> bpf program will read and print skb fields the way it likes >> for particular tracing scenario. >> So instead of making >> TP_printk("dev=%s napi_id=%#x queue_mapping=%u skbaddr=%p >> vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x >> ip_summed=%d hash=0x%08x l4_hash=%d len=%u data_len=%u truesize=%u >> mac_header_valid=%d mac_header=%d nr_frags=%d gso_size=%d >> gso_type=%#x",... >> the abi exposed via trace_pipe (as it is today), >> the new tracepoint_bpf abi is presence of 'skb' pointer as one >> and only argument to bpf program. >> Future refactoring of netif_rx would need to guarantee >> that trace_netif_rx_entry(skb) is called. that's it. >> imo such tracepoints are much easier to deal with during >> code changes. > > But what can you access from that skb that is guaranteed to be there? > If you say anything, then there's no reason it can't be added to the > printk as well. programs can access any field via bpf_fetch*() helpers which make them kernel layout dependent or via user-ized sk_buff with few fields which is portable. In both cases kernel/user abi is only 'skb' pointer. whether it's debugging program that needs full access via fetch* helpers or portable program that uses stable api it's up to program author. Just like kprobes, it's clear, that if program is using fetch* helpers it's doing it without any abi guarantees. 'perf probe' and 'bpf with fetch* helpers are the same. perf probe creates wrappers on top of probe_kernel_read and bpf_fetch* helpers are wrappers on top of probe_kernel_read. Complains that 'my kprobe with flags=%cx mode=+4($stack) stopped working in new kernel' are equivalent to complains that program with bpf_fetch* stopped working. Whereas if program is using user-ized structs it will work across kernel versions, though it will be able to see only very limited slice of in-kernel data. >> May be some of the existing tracepoints like this one that >> takes one argument can be marked 'bpf-ready', so that >> programs can attach to them only. > > I really hate the idea of adding tracepoints that ftrace can't use. It > basically kills the entire busy box usage scenario, as boards that have > extremely limited userspace still make full use of ftrace via the > existing tracepoints. agree. I think your trace_marker with few args is a good middle ground. > I still don't see the argument that adding data via the bpf functions > is any different than adding those same items to fields in an event. > Once you add a bpf function, then you must maintain those fields. > > Look, you can always add more to a TP_printk(), as that is standard > with all text file kernel parsing. Like stat in /proc. Fields can not > be removed, but more can always be appended to the end. > > Any tool that parses trace_pipe is broken if it can't handle extended > fields. The api can be extended, and for text files, that is by > appending to them. I agree that any text parsing script should be able to cope with additional args without problems. I think it's a fear of <1% breakage is causing maintainers to avoid any changes to tracepoints even when they just add few args to the end of TP_printk. When tracepoints stop printing and the only thing they see is single pointer to a well known struct like sk_buff, this fear of tracepoints should fade. programs are not part of the kernel, so whatever they do and print is not our headache. We only make sure that interface between kernel and programs is stable. In other words kernel ABI is what kernel exposes to user space and to bpf programs. Though programs are run inside the kernel what they do it outside of kernel abi. So when program prints fields is not our problem, whereas when tracepoint prints fields it's kernel abi. ps I'll be traveling for the next few weeks, so apologize in advance for slow response. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/