Message-ID: <55485FBF.20306@huawei.com>
Date: Tue, 5 May 2015 14:14:23 +0800
From: Wang Nan <wangnan0@huawei.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: Alexei Starovoitov <ast@plumgrid.com>, <davem@davemloft.net>,
        <acme@kernel.org>, <mingo@redhat.com>, <a.p.zijlstra@chello.nl>,
        <masami.hiramatsu.pt@hitachi.com>, <jolsa@kernel.org>
CC: <linux-kernel@vger.kernel.org>, <pi3orama@163.com>, <hekuang@huawei.com>,
        <bgregg@netflix.com>
Subject: Re: [RFC PATCH 00/22] perf tools: introduce 'perf bpf' command to
 load eBPF programs.
References: <1430391165-30267-1-git-send-email-wangnan0@huawei.com> <554302F0.3070101@plumgrid.com> <55447A7D.4000205@huawei.com> <554832AA.5050503@plumgrid.com> <55484A11.7070603@huawei.com> <554859CD.4090206@plumgrid.com>
In-Reply-To: <554859CD.4090206@plumgrid.com>
Content-Type: text/plain; charset="windows-1252"
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5441
Lines: 123

On 2015/5/5 13:49, Alexei Starovoitov wrote:
> On 5/4/15 9:41 PM, Wang Nan wrote:
>>
>> That's great. Could you please append the description of 'llvm -s' into your README
>> or comments? It has cost me a lot of time for dumping eBPF instructions so I decide to
>> add it into perf...
> 
> sure. it's just -filetype=asm flag to llc instead of -filetype=obj.
> Eventually it will work as normal 'clang -S file.c' when few more
> llvm commits are accepted upstream.
> 
>>>> My collage He Kuang is working on variable accessing. Probing inside function body
>>>> and accessing its local variable will be supported like this:
>>>>
>>>>    SEC("config") char _prog_config[] = "prog: func_name:1234 vara=localvara"
>>>>    int prog(struct pt_regs *ctx, unsigned long vara) {
>>>>       // vara is the value of localvara of function func_name
>>>>    }
>>>
>>> that would be great. I'm not sure though how you can achieve that
>>> without changing C front-end ?
>>
>> It's not very difficult. He is trying to generate the loader of vara
>> as prologue, then paste the prologue and the main eBPF program together.
>>  From the viewpoint of kernel bpf verifier, there is only one param (ctx); the
>> prologue program fetches the value of vara then put it into a propoer register,
>> then main program work.
> 
> got it. I think that's much cleaner than what I was proposing.
> The only question is then:
> char _prog_config[] = "prog: func_name:1234 vara=localvara"
> should actually be something like "... r2=localvara", right?
> since prologue would need to assign into r2.
> Otherwise I don't see where you find out about 'vara' inside
> compiled bpf code.
>

I think the calling convention could teach us which var should go to which
register. In the case of

 SEC("config") char _prog_config[] = "prog: func_name:1234 vara=localvara varb=globalvarb";
 int prog(struct pt_regs *ctx, unsigned long vara, unsigned long varb) { ... }

llvm should compile 'prog' according to calling convention. The body of that
program should assume vara in r2 and varb in r3. The prologue also puts the vars into
r2 and r3 according to calling convention. Therefore, after paste them together, the final
program should run properly. There is no need to describe register number explicitly.
What do you think?


> Would be nice if this can be done without debug info.
> Like in tracex2_kern.c I have:
> SEC("kprobe/sys_write")
> int bpf_prog(struct pt_regs *ctx)
> {
>         long wr_size = ctx->dx; /* arg3 */
> 
> with your prolog generator the above can be rewritten as:
> SEC("kprobe/sys_write")
> int bpf_prog(struct pt_regs *unused, int fd, char *buf, size_t wr_size)
> {
>         /* use wr_size */
> 
> that will improve ease of use a lot.
>

It is possible if probing on the entry of a function. However, when probing on
function body, there still need a way to pass variable list required by the
program to perf to let it generate correct prologue. We'd like to implement
the generic one (list vars in config string) first, then make function
parameters accessing as a syntax sugar.

>> Another possible solution is to change the protocol between kprobe and eBPF
>> program, makes kprobes calls fetchers and passes them to eBPF program as
>> a second param (group all varx together).
>> A prologue may still need in this case to load each param into correct
>> register.
> 
> you mean grouping varx together in some other struct and embedding it
> together with pt_regs into new container struct?
> doable, but your first approach is quite clean already. why bother.
> 

The second approach makes us reuse the fetchers code which are already in
kernel. Further more, if new type of fetchers are appear (for example, fetcher
of PMU counter), we support it automatically.

>> Could you please consider the following problem?
>>
>> We find there are serval __lock_page() calls last very long time. We are going
>> to find corresponding __unlock_page() so we can know what blocks them. We want to
>> insert eBPF programs before io_schedule() in __lock_page(), and also add eBPF program
>> on the entry of __unlock_page(), so we can compute the interval between page locking and
>> unlocking. If time is longer than a threshold, let __unlock_page() trigger a perf sampling
>> so we get its call stack. In this case, eBPF program acts as a trace filter.
> 
> all makes sense and your use case fits quite well into existing
> bpf+kprobe model. I'm not sure why you're calling a 'problem'.
> A problem of how to display that call stack from perf?
> I would say it fits better as a sample than a trace.
> If you dump it as a trace, it won't easy to decipher, whereas if you
> treat it a sampling event, perf record/report facility will pick it up and display nicely. Meaning that one sample == lock_page/unlock_page
> latency > N. Then existing sample_callchain flag should work.
> 

Quite well. Do we have an eBPF function like

static int (*bpf_perf_sample)(const char *fmt, int fmt_size, ...) = BPF_FUNC_perf_sample

so we can use it in the program probed in the body of __unlock_page() like that:

 ...
 if (latency > 0.5s)
    bpf_perf_sample("page=%p, latency=%d", sizeof(...), page, latency);
 ...

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/