Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752310AbbEFErL (ORCPT ); Wed, 6 May 2015 00:47:11 -0400 Received: from szxga01-in.huawei.com ([58.251.152.64]:46945 "EHLO szxga01-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750770AbbEFErI (ORCPT ); Wed, 6 May 2015 00:47:08 -0400 Message-ID: <55499CB0.1090400@huawei.com> Date: Wed, 6 May 2015 12:46:40 +0800 From: Wang Nan User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Alexei Starovoitov CC: , , Li Zefan Subject: Re: [RFC PATCH 00/22] perf tools: introduce 'perf bpf' command to load eBPF programs. References: <1430391165-30267-1-git-send-email-wangnan0@huawei.com> <554302F0.3070101@plumgrid.com> <55447A7D.4000205@huawei.com> <554832AA.5050503@plumgrid.com> <55484A11.7070603@huawei.com> <554859CD.4090206@plumgrid.com> <55485FBF.20306@huawei.com> In-Reply-To: <55485FBF.20306@huawei.com> Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.111.69.129] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6095 Lines: 142 Hi Alexei Starovoitov, Have you ever read this mail? I'm very intrerested in triggering perf sample in BPF code. You said it is not a problem. Could you please give me some further information? Thank you. On 2015/5/5 14:14, Wang Nan wrote: > On 2015/5/5 13:49, Alexei Starovoitov wrote: >> On 5/4/15 9:41 PM, Wang Nan wrote: >>> >>> That's great. Could you please append the description of 'llvm -s' into your README >>> or comments? It has cost me a lot of time for dumping eBPF instructions so I decide to >>> add it into perf... >> >> sure. it's just -filetype=asm flag to llc instead of -filetype=obj. >> Eventually it will work as normal 'clang -S file.c' when few more >> llvm commits are accepted upstream. >> >>>>> My collage He Kuang is working on variable accessing. Probing inside function body >>>>> and accessing its local variable will be supported like this: >>>>> >>>>> SEC("config") char _prog_config[] = "prog: func_name:1234 vara=localvara" >>>>> int prog(struct pt_regs *ctx, unsigned long vara) { >>>>> // vara is the value of localvara of function func_name >>>>> } >>>> >>>> that would be great. I'm not sure though how you can achieve that >>>> without changing C front-end ? >>> >>> It's not very difficult. He is trying to generate the loader of vara >>> as prologue, then paste the prologue and the main eBPF program together. >>> From the viewpoint of kernel bpf verifier, there is only one param (ctx); the >>> prologue program fetches the value of vara then put it into a propoer register, >>> then main program work. >> >> got it. I think that's much cleaner than what I was proposing. >> The only question is then: >> char _prog_config[] = "prog: func_name:1234 vara=localvara" >> should actually be something like "... r2=localvara", right? >> since prologue would need to assign into r2. >> Otherwise I don't see where you find out about 'vara' inside >> compiled bpf code. >> > > I think the calling convention could teach us which var should go to which > register. In the case of > > SEC("config") char _prog_config[] = "prog: func_name:1234 vara=localvara varb=globalvarb"; > int prog(struct pt_regs *ctx, unsigned long vara, unsigned long varb) { ... } > > llvm should compile 'prog' according to calling convention. The body of that > program should assume vara in r2 and varb in r3. The prologue also puts the vars into > r2 and r3 according to calling convention. Therefore, after paste them together, the final > program should run properly. There is no need to describe register number explicitly. > What do you think? > > >> Would be nice if this can be done without debug info. >> Like in tracex2_kern.c I have: >> SEC("kprobe/sys_write") >> int bpf_prog(struct pt_regs *ctx) >> { >> long wr_size = ctx->dx; /* arg3 */ >> >> with your prolog generator the above can be rewritten as: >> SEC("kprobe/sys_write") >> int bpf_prog(struct pt_regs *unused, int fd, char *buf, size_t wr_size) >> { >> /* use wr_size */ >> >> that will improve ease of use a lot. >> > > It is possible if probing on the entry of a function. However, when probing on > function body, there still need a way to pass variable list required by the > program to perf to let it generate correct prologue. We'd like to implement > the generic one (list vars in config string) first, then make function > parameters accessing as a syntax sugar. > >>> Another possible solution is to change the protocol between kprobe and eBPF >>> program, makes kprobes calls fetchers and passes them to eBPF program as >>> a second param (group all varx together). >>> A prologue may still need in this case to load each param into correct >>> register. >> >> you mean grouping varx together in some other struct and embedding it >> together with pt_regs into new container struct? >> doable, but your first approach is quite clean already. why bother. >> > > The second approach makes us reuse the fetchers code which are already in > kernel. Further more, if new type of fetchers are appear (for example, fetcher > of PMU counter), we support it automatically. > >>> Could you please consider the following problem? >>> >>> We find there are serval __lock_page() calls last very long time. We are going >>> to find corresponding __unlock_page() so we can know what blocks them. We want to >>> insert eBPF programs before io_schedule() in __lock_page(), and also add eBPF program >>> on the entry of __unlock_page(), so we can compute the interval between page locking and >>> unlocking. If time is longer than a threshold, let __unlock_page() trigger a perf sampling >>> so we get its call stack. In this case, eBPF program acts as a trace filter. >> >> all makes sense and your use case fits quite well into existing >> bpf+kprobe model. I'm not sure why you're calling a 'problem'. >> A problem of how to display that call stack from perf? >> I would say it fits better as a sample than a trace. >> If you dump it as a trace, it won't easy to decipher, whereas if you >> treat it a sampling event, perf record/report facility will pick it up and display nicely. Meaning that one sample == lock_page/unlock_page >> latency > N. Then existing sample_callchain flag should work. >> > > Quite well. Do we have an eBPF function like > > static int (*bpf_perf_sample)(const char *fmt, int fmt_size, ...) = BPF_FUNC_perf_sample > > so we can use it in the program probed in the body of __unlock_page() like that: > > ... > if (latency > 0.5s) > bpf_perf_sample("page=%p, latency=%d", sizeof(...), page, latency); > ... > > Thank you. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/