Message-ID: <554859CD.4090206@plumgrid.com>
Date: Mon, 04 May 2015 22:49:01 -0700
From: Alexei Starovoitov <ast@plumgrid.com>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Thunderbird/31.6.0
MIME-Version: 1.0
To: Wang Nan <wangnan0@huawei.com>, davem@davemloft.net, acme@kernel.org,
        mingo@redhat.com, a.p.zijlstra@chello.nl,
        masami.hiramatsu.pt@hitachi.com, jolsa@kernel.org
CC: linux-kernel@vger.kernel.org, pi3orama@163.com, hekuang@huawei.com,
        bgregg@netflix.com
Subject: Re: [RFC PATCH 00/22] perf tools: introduce 'perf bpf' command to
 load eBPF programs.
References: <1430391165-30267-1-git-send-email-wangnan0@huawei.com> <554302F0.3070101@plumgrid.com> <55447A7D.4000205@huawei.com> <554832AA.5050503@plumgrid.com> <55484A11.7070603@huawei.com>
In-Reply-To: <55484A11.7070603@huawei.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3765
Lines: 83

On 5/4/15 9:41 PM, Wang Nan wrote:
>
> That's great. Could you please append the description of 'llvm -s' into your README
> or comments? It has cost me a lot of time for dumping eBPF instructions so I decide to
> add it into perf...

sure. it's just -filetype=asm flag to llc instead of -filetype=obj.
Eventually it will work as normal 'clang -S file.c' when few more
llvm commits are accepted upstream.

>>> My collage He Kuang is working on variable accessing. Probing inside function body
>>> and accessing its local variable will be supported like this:
>>>
>>>    SEC("config") char _prog_config[] = "prog: func_name:1234 vara=localvara"
>>>    int prog(struct pt_regs *ctx, unsigned long vara) {
>>>       // vara is the value of localvara of function func_name
>>>    }
>>
>> that would be great. I'm not sure though how you can achieve that
>> without changing C front-end ?
>
> It's not very difficult. He is trying to generate the loader of vara
> as prologue, then paste the prologue and the main eBPF program together.
>  From the viewpoint of kernel bpf verifier, there is only one param (ctx); the
> prologue program fetches the value of vara then put it into a propoer register,
> then main program work.

got it. I think that's much cleaner than what I was proposing.
The only question is then:
char _prog_config[] = "prog: func_name:1234 vara=localvara"
should actually be something like "... r2=localvara", right?
since prologue would need to assign into r2.
Otherwise I don't see where you find out about 'vara' inside
compiled bpf code.

Would be nice if this can be done without debug info.
Like in tracex2_kern.c I have:
SEC("kprobe/sys_write")
int bpf_prog(struct pt_regs *ctx)
{
         long wr_size = ctx->dx; /* arg3 */

with your prolog generator the above can be rewritten as:
SEC("kprobe/sys_write")
int bpf_prog(struct pt_regs *unused, int fd, char *buf, size_t wr_size)
{
         /* use wr_size */

that will improve ease of use a lot.

> Another possible solution is to change the protocol between kprobe and eBPF
> program, makes kprobes calls fetchers and passes them to eBPF program as
> a second param (group all varx together).
> A prologue may still need in this case to load each param into correct
> register.

you mean grouping varx together in some other struct and embedding it
together with pt_regs into new container struct?
doable, but your first approach is quite clean already. why bother.

> Could you please consider the following problem?
>
> We find there are serval __lock_page() calls last very long time. We are going
> to find corresponding __unlock_page() so we can know what blocks them. We want to
> insert eBPF programs before io_schedule() in __lock_page(), and also add eBPF program
> on the entry of __unlock_page(), so we can compute the interval between page locking and
> unlocking. If time is longer than a threshold, let __unlock_page() trigger a perf sampling
> so we get its call stack. In this case, eBPF program acts as a trace filter.

all makes sense and your use case fits quite well into existing
bpf+kprobe model. I'm not sure why you're calling a 'problem'.
A problem of how to display that call stack from perf?
I would say it fits better as a sample than a trace.
If you dump it as a trace, it won't easy to decipher, whereas if you
treat it a sampling event, perf record/report facility will pick it up 
and display nicely. Meaning that one sample == lock_page/unlock_page
latency > N. Then existing sample_callchain flag should work.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/