Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756309AbbBFUXh (ORCPT ); Fri, 6 Feb 2015 15:23:37 -0500 Received: from mail-qc0-f177.google.com ([209.85.216.177]:55615 "EHLO mail-qc0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755225AbbBFUXf (ORCPT ); Fri, 6 Feb 2015 15:23:35 -0500 MIME-Version: 1.0 From: Alexei Starovoitov Date: Fri, 6 Feb 2015 12:23:14 -0800 Message-ID: Subject: Re: [RFC 0/2] Experience on full stack profiling using eBPF. To: He Kuang Cc: wangnan0@huawei.com, Ingo Molnar , Namhyung Kim , Arnaldo Carvalho de Melo , Jiri Olsa , Masami Hiramatsu , LKML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4340 Lines: 101 On Thu, Feb 5, 2015 at 10:56 PM, He Kuang wrote: > > Our goal is to profile production systems by analyzing end-to-end trace > of particular procedures, including full IO request through sys_write() > system call to underlying disk driver, full network processes through > sys_sendto() to NIC driver and even full database query through client > initiating to server processing to client finalizing. To avoid heavy > tracing overhead and overwhelming data, we randomly sample a few out of > huge amounts of procedures, only trace events caused by them at > different layers and leave other procedures work normally. As a result, > we make as few traces as possible for our profiling with the minimal > effect to system performance. all makes sense. Ability to do what you're describing above is exactly the goal for bpf+tracing infra. > The result trace log is less than 20 lines, which will be more than > 10000 lines without sampling and conditions. We can easily get a graph > like this from the above result trace: > > S1/2/3 are samples. > > S1S2 S3 > | | | > (vfs)pagecache -----------------------------------------> timeline > S1S2 S3 > | | | > (block)request -----------------------------------------> timeline > S1S2 S3 > | | | > bio_end -----------------------------------------> timeline nice. Would be great if your tool is developed in some public repo that everyone can see and contribute. > During our work we found that eBPF can be improved so we can make our > life easier: > > 1. BPF prog cannot use div/mod operator, which is useful in sampling. hmm. there are several tests in test_bpf.ko that check div/mod ops. Just tried it with simple C example and it worked fine as well. Please share the example that caused issues. > 2. BPF is a filter, the output trace goes to trace pipe, sometimes we > may need a structured trace output like perf.data, so we can record more > in filter function. BPF Array table is a good place to record data, but > the index is restricted to 4 byte size. I think 4B elements in array is more then practically usable. array map is used in cases where one needs very fast access to few elements. hash map is for large key/values (though the number of elements is also limited to 4B) Both key and value don't have do be scalar types. For hash map one can use a struct of multiple integers as a key and as a value. For array map key has to be u32 and value is a struct of any size. btw in the next version of my tracing patches I've changed attachment point to use perf_event instead of tracefs interface. So all not-dropped events will go into perf.data if user space cares to accept them. Independently of that, I'm planning to add a 'streaming' interface in addition to maps. More like generalized trace/perf buffer so that programs can stream interesting events to user space in the arbitrary format. > 3. Though most of events are filtered out, the eBPF program attached to > tracepoints still run for every execution. We should find a way to > minimize performance overhead for those events filtered out. well, yeah the program does run for every event, because the program itself does the filtering :) when bpf program returns 0, the tracepoint handler returns immediately. The cost of running 'return 0' bpf program is 2% according to my measurements with 'dd' (I will post all data and new version of patches soon) > 4. Provide a more user-friendly interface for user to use. sure. please suggest :) > Thanks for your work. thanks a lot for trying it out and providing feedback! Some tracepoints from patch 1 seems redundant. Like there is already one for trace_block_bio_backmerge() why add trace_bio_attempt_back_merge() ? that's anyway for block layer maintainers to decide. imo adding tracepoints in 'strategic' places makes a lot of sense just need to think through their placements. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/