LinuxLists.cc - Re: [RFC PATCH v4 10/29] bpf tools: Collect map definitions from 'maps' section

2015-06-01 05:20:14

Subject: Re: [RFC PATCH v4 10/29] bpf tools: Collect map definitions from 'maps' section

On 2015/6/1 10:12, Namhyung Kim wrote:
> Hi Alexei and Wang,
>
> On Thu, May 28, 2015 at 08:35:19PM -0700, Alexei Starovoitov wrote:
>> On Thu, May 28, 2015 at 03:14:44PM +0800, Wangnan (F) wrote:
>>> On 2015/5/28 14:09, Alexei Starovoitov wrote:
>>>> On Thu, May 28, 2015 at 11:09:50AM +0800, Wangnan (F) wrote:
>>> For me, enable eBPF program to read PMU counter is the first thing need to
>>> be done.
>>> The other thing is enabling eBPF programs to bring some information to perf
>>> sample.
>>>
>>> Here is an example to show my idea.
>>>
>>> I have a program which:
>>>
>>> int main()
>>> {
>>> while(1) {
>>> read(...);
>>> /* do A */
>>> write(...);
>>> /* do B */
>>> }
>>> }
>>>
>>> Then by using following script:
>>>
>>> SEC("enter=sys_write $outdata:u64")
>>> int enter_sys_write(...) {
>>> u64 cycles_cnt = bpf_read_pmu(&cycles_pmu);
>>> bpf_store_value(cycles_cnt);
>>> return 1;
>>> }
>>>
>>> SEC("enter=sys_read $outdata:u64")
>>> int enter_sys_read(...) {
>>> u64 cycles_cnt = bpf_read_pmu(&cycles_pmu);
>>> bpf_store_value(cycles_cnt);
>>> return 1;
>>> }
>>>
>>> by 'perf script', we can check the counter of cycles at each points, then we
>>> are allowed
>>> to compute the number of cycles between any two sampling points. This way we
>>> can compute
>>> how many cycles taken by A and B. If instruction counter is also recorded,
>>> we will know
>>> the IPC of A and B.
>> Agree. That's useful. That's exactly what I meant by
>> "compute a number of cache misses between two kprobe events".
>> The overhead is less when bpf program computes the cycle and instruction
>> delta, computes IPC and passes only final IPC numbers to the user space.
>> It can even average IPC over time.
>> For some very frequent events it can read cycle_cnt on sys_entry_read,
>> then read it on sys_exit_read, compute delta and average it into the map.
>> User space can read the map every second or every 10 seconds and print
>> nice graph.
> Looks very interesting and useful indeed!
>
>> As far as 'bpf_store_value' goes... I was thinking to expose perf ring_buffer
>> to bpf programs, so that program can stream any data to perf that receives
>> it via mmap. Then you don't need this '$outdata' hack.
> Then we need to define and pass the format of such data so that perf
> tools can read and process the data. IIRC Masami suggested to have an
> additional user event type for inserting/injecting non-perf events -
> like PERF_RECORD_USER_DEFINED_TYPE? And its contents is something
> similar to tracepoint event format file so that we can reuse existing
> code to parse the event definition.

Is it possible to expose such format through
/sys/kernel/debug/tracing/events/*/*/format
so we can avoid extra work on perf side and make it accessable by both
perf and ftrace?

Currently we do this by opening an internal PMU and adding a common
field in trace_define_common_fields(). By reading that PMU in
tracing_generic_entry_update() we are able to collect its value by both
perf and ftrace, both
kprobe events and tracepoints (the implementation is ugly. We have to
hardwire the
PMU because alerting common field dynamically is hard. If we want to
trace multiple PMUs then
recompiling is required). In serval usecase, we found that using ftrace
should be better
because the cost of perf is higher.

Although currently BPF programs can only get executed if it traced by
perf, I think we can
extend it to ftrace (but not sure how to do it now...).

Currently I'm still working on perf bpf stuffs. I think it has almost
done. The next step
should be solving arguments passing problem. After that we should enable
eBPF programs to
read hardware PMU. Outputting should be the final step. I'm glad to see
many people are
thinking on it. Please keep me in the loop if you have any new idea on
this area.

Thank you.

> Thanks,
> Namhyung

2015-06-01 06:04:49

by Namhyung Kim

[permalink] [raw]

Subject: Re: [RFC PATCH v4 10/29] bpf tools: Collect map definitions from 'maps' section

On Mon, Jun 01, 2015 at 01:19:16PM +0800, Wangnan (F) wrote:
>
>
> On 2015/6/1 10:12, Namhyung Kim wrote:
> >Hi Alexei and Wang,
> >
> >On Thu, May 28, 2015 at 08:35:19PM -0700, Alexei Starovoitov wrote:
> >>On Thu, May 28, 2015 at 03:14:44PM +0800, Wangnan (F) wrote:
> >>>On 2015/5/28 14:09, Alexei Starovoitov wrote:
> >>As far as 'bpf_store_value' goes... I was thinking to expose perf ring_buffer
> >>to bpf programs, so that program can stream any data to perf that receives
> >>it via mmap. Then you don't need this '$outdata' hack.
> >Then we need to define and pass the format of such data so that perf
> >tools can read and process the data. IIRC Masami suggested to have an
> >additional user event type for inserting/injecting non-perf events -
> >like PERF_RECORD_USER_DEFINED_TYPE? And its contents is something
> >similar to tracepoint event format file so that we can reuse existing
> >code to parse the event definition.
>
> Is it possible to expose such format through
> /sys/kernel/debug/tracing/events/*/*/format
> so we can avoid extra work on perf side and make it accessable by both perf
> and ftrace?

No, I mean export such format through an event in perf.data file. It
still needs extra work on perf-tools side. But by using user-defined
event types, there should be no kernel-side work.

Above is just a suggestion how to deal with external data/events in
perf. But I'm seeing many people want such feature so we need a way
to handle it anyway. ;-)

>
> Currently we do this by opening an internal PMU and adding a common
> field in trace_define_common_fields(). By reading that PMU in
> tracing_generic_entry_update() we are able to collect its value by both perf
> and ftrace, both
> kprobe events and tracepoints (the implementation is ugly. We have to
> hardwire the
> PMU because alerting common field dynamically is hard. If we want to trace
> multiple PMUs then
> recompiling is required). In serval usecase, we found that using ftrace
> should be better
> because the cost of perf is higher.
>
> Although currently BPF programs can only get executed if it traced by perf,
> I think we can
> extend it to ftrace (but not sure how to do it now...).
>
> Currently I'm still working on perf bpf stuffs. I think it has almost done.
> The next step
> should be solving arguments passing problem. After that we should enable
> eBPF programs to
> read hardware PMU. Outputting should be the final step. I'm glad to see many
> people are
> thinking on it. Please keep me in the loop if you have any new idea on this
> area.

Sure thing. I really thank you to do this work!
Namhyung

2015-06-01 13:01:45

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [RFC PATCH v4 10/29] bpf tools: Collect map definitions from 'maps' section

Em Mon, Jun 01, 2015 at 03:03:11PM +0900, Namhyung Kim escreveu:
> On Mon, Jun 01, 2015 at 01:19:16PM +0800, Wangnan (F) wrote:
> > On 2015/6/1 10:12, Namhyung Kim wrote:
> > >On Thu, May 28, 2015 at 08:35:19PM -0700, Alexei Starovoitov wrote:
> > >>On Thu, May 28, 2015 at 03:14:44PM +0800, Wangnan (F) wrote:
> > >>>On 2015/5/28 14:09, Alexei Starovoitov wrote:
> > >>As far as 'bpf_store_value' goes... I was thinking to expose perf ring_buffer
> > >>to bpf programs, so that program can stream any data to perf that receives
> > >>it via mmap. Then you don't need this '$outdata' hack.
> > >Then we need to define and pass the format of such data so that perf
> > >tools can read and process the data. IIRC Masami suggested to have an
> > >additional user event type for inserting/injecting non-perf events -
> > >like PERF_RECORD_USER_DEFINED_TYPE? And its contents is something

That would behave mostly like PERF_TYPE_TRACEPOINT but would look for
event /format definitions in another place? Perhaps one that is in a per
buildid location, i.e. each library has its own place to store such
field definitions, and by tying it to a particular version, it could
change it as it see fit?

> > >similar to tracepoint event format file so that we can reuse existing
> > >code to parse the event definition.

> > Is it possible to expose such format through
> > /sys/kernel/debug/tracing/events/*/*/format
> > so we can avoid extra work on perf side and make it accessable by both perf
> > and ftrace?

> No, I mean export such format through an event in perf.data file. It
> still needs extra work on perf-tools side. But by using user-defined
> event types, there should be no kernel-side work.
>
> Above is just a suggestion how to deal with external data/events in
> perf. But I'm seeing many people want such feature so we need a way
> to handle it anyway. ;-)

Right, having a way to map from a { attr.type = PERF_TYPE_USER_DEFINED,
attr.config = N }, to a description of the fields, like what we have now
for tracepoints, kprobes and uprobes, seems the way to reuse most of the
existing infrastructure.

We then collect just the /format files referenced in the perf.data file
and all seems to be in place, no?

- Arnaldo