Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751808AbcDTWGT (ORCPT ); Wed, 20 Apr 2016 18:06:19 -0400 Received: from mail-pf0-f177.google.com ([209.85.192.177]:34302 "EHLO mail-pf0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751325AbcDTWGR (ORCPT ); Wed, 20 Apr 2016 18:06:17 -0400 Date: Wed, 20 Apr 2016 15:06:11 -0700 From: Alexei Starovoitov To: Wang Nan Cc: acme@kernel.org, jolsa@redhat.com, brendan.d.gregg@gmail.com, linux-kernel@vger.kernel.org, pi3orama@163.com, Arnaldo Carvalho de Melo , Alexei Starovoitov , Jiri Olsa , Li Zefan Subject: Re: [RFC PATCH 00/13] perf tools: Support uBPF script Message-ID: <20160420220609.GA38485@ast-mbp.thefacebook.com> References: <1461175313-38310-1-git-send-email-wangnan0@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1461175313-38310-1-git-send-email-wangnan0@huawei.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5917 Lines: 178 On Wed, Apr 20, 2016 at 06:01:40PM +0000, Wang Nan wrote: > This patch set allows to perf invoke some user space BPF scripts on some > point. uBPF scripts and kernel BPF scripts reside in one BPF object. > They communicate with each other with BPF maps. uBPF scripts can invoke > helper functions provided by perf. > > At least following new features can be achieved based on uBPF support: > > 1) Report statistical result: > Like DTrace, perf print statistical report before quit. No need to > extract data using 'perf report'. Statistical method is controled by > user. > > 2) Control perf's behavior: > Dynamically adjust period of different events. Policy is defined by > user. > > uBPF library is required before compile. It can be found from github: > > https://github.com/iovisor/ubpf.git > > Following is an example: > > Using BPF script attached at the bottom of this commit message, one > can print histogram of write size before perf exit like this: > > # ~/perf record -a -e ./test_ubpf.c & > [1] 16800 > # dd if=/dev/zero of=/dev/null bs=512 count=5000 > 5000+0 records in > 5000+0 records out > 2560000 bytes (2.6 MB) copied, 0.00552838 s, 463 MB/s > # dd if=/dev/zero of=/dev/null bs=2048 count=5000 > 5000+0 records in > 5000+0 records out > 10240000 bytes (10 MB) copied, 0.0188971 s, 542 MB/s > # fg > ^C <--- *Press Ctrl-c* > 2^^0: 47 > 2^^1: 13 > 2^^2: 4 > 2^^3: 130 > 2^^4: 11 > 2^^5: 1051 > 2^^6: 486 > 2^^7: 4863 > 2^^8: 0 > 2^^9: 5003 > 2^^10: 4 > 2^^11: 5003 > 2^^12: 1 > 2^^13: 0 > 2^^14: 0 > 2^^15: 0 > 2^^16: 0 > 2^^17: 0 > 2^^18: 0 > 2^^19: 0 > 2^^20: 0 > [ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 0.788 MB perf.data ] > > Here is test_ubpf.c. > > /************ BEGIN ***************/ > #include > #define SEC(NAME) __attribute__((section(NAME), used)) > struct bpf_map_def { > unsigned int type; > unsigned int key_size; > unsigned int value_size; > unsigned int max_entries; > }; > > #define BPF_ANY 0 > > static void *(*map_lookup_elem)(struct bpf_map_def *, void *) = > (void *)BPF_FUNC_map_lookup_elem; > > static inline unsigned int log2(unsigned int v) > { > unsigned int r; > unsigned int shift; > > r = (v > 0xFFFF) << 4; v >>= r; > shift = (v > 0xFF) << 3; v >>= shift; r |= shift; > shift = (v > 0xF) << 2; v >>= shift; r |= shift; > shift = (v > 0x3) << 1; v >>= shift; r |= shift; > r |= (v >> 1); > return r; > } > > static inline unsigned int log2l(unsigned long v) > { > unsigned int hi = v >> 32; > if (hi) > return log2(hi) + 32; > else > return log2(v); > } > > struct bpf_map_def SEC("maps") my_hist_map = { > .type = BPF_MAP_TYPE_ARRAY, > .key_size = sizeof(int), > .value_size = sizeof(long), > .max_entries = 21, > }; > > SEC("sys_write=sys_write count") > int sys_write(void *ctx, int err, long write_size) > { > long *value; > int key = 0; > > if (err) > return 0; > > key = log2l(write_size); > if (key > 20) > key = 20; > value = map_lookup_elem(&my_hist_map, &key); > if (!value) > return 0; > __sync_fetch_and_add(value, 1); > return 0; > } > char _license[] SEC("license") = "GPL"; > u32 _version SEC("version") = LINUX_VERSION_CODE; > > /* Following ugly magic numbers can be find from tools/perf/util/ubpf-helpers-list.h */ > static int (*ubpf_memcmp)(void *s1, void *s2, unsigned int n) = (void *)1; > static void (*ubpf_memcpy)(void *d, void *s, unsigned int size) = (void *)2; > static int (*ubpf_strcmp)(void *s1, void *s2) = (void *)3; > static int (*ubpf_printf)(char *fmt, ...) = (void *)4; > static int (*ubpf_map_lookup_elem)(void *map_desc, void *key, void *value) = (void *)5; > static int (*ubpf_map_update_elem)(void *map_desc, void *key, void *value, unsigned long long flags) = (void *)6; > static int (*ubpf_map_get_next_key)(void *map_desc, void *key, void *value) = (void *)7; > > SEC("UBPF;perf_record_end") > int perf_record_end(int samples) > { > int i, key; > long value; > char fmt[] = "2^^%d: %d\n"; > > for (i = 0; i < 21; i++) { > ubpf_map_lookup_elem(&my_hist_map, &i, &value); > ubpf_printf(fmt, i, value); > } > return 0; > } Interesting! If bpf is used for both kernel and user side programs, we can allow almost arbitrary C code for the user side. There is no need to be limited to a fixed set of helpers. There is no verifier in user space either. Just call 'printf("string")' directly. Wouldn't even need to change interpreter. Also ubpf was written from scratch with apache2, while perf is gpl, so you can just link kernel/bpf/core.o directly instead of using external libraries. I really meant link .o file compiled for kernel. Advertize dummy kfree/kmalloc and it will link fine, since perf will only be calling __bpf_prog_run() which is 99% indepdendent from kernel. I used to do exactly that long ago while performance tunning the interpreter. Another option is to fork the interpreter for perf, but I don't like it at all. Compiling the same bpf/core.c once for kernel and once for perf is another option, but imo linking core.o is easier. In general this set and overall bpf in user space makes sense only if we allow much more flexible C code for user space. If it's limited to ubpf_* helpers, that will quickly become suboptimal. Another alternative is to use luajit for user space scripting like we do in bcc. That gives full flexibility with good performance. If we can do 'restricted C into bpf' for kernel and 'full C into bpf' for user space that would be a great model. Note llvm doesn't care how C looks like. You can call any function in C and use loops.