Message-ID: <559386D7.1020208@huawei.com>
Date: Wed, 1 Jul 2015 14:21:11 +0800
From: "Wangnan (F)" <wangnan0@huawei.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0
MIME-Version: 1.0
To: Peter Zijlstra <peterz@infradead.org>, He Kuang <hekuang@huawei.com>
CC: <ast@plumgrid.com>, <rostedt@goodmis.org>,
        <masami.hiramatsu.pt@hitachi.com>, <mingo@redhat.com>,
        <acme@redhat.com>, <jolsa@kernel.org>, <namhyung@kernel.org>,
        <linux-kernel@vger.kernel.org>, pi3orama <pi3orama@163.com>
Subject: Re: [RFC PATCH 0/5] Make eBPF programs output data to perf event
References: <1435719455-91155-1-git-send-email-hekuang@huawei.com> <20150701054458.GN19282@twins.programming.kicks-ass.net>
In-Reply-To: <20150701054458.GN19282@twins.programming.kicks-ass.net>
Content-Type: text/plain; charset="utf-8"; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2569
Lines: 69


On 2015/7/1 13:44, Peter Zijlstra wrote:
> On Wed, Jul 01, 2015 at 02:57:30AM +0000, He Kuang wrote:
>> This patch adds an extra perf trace buffer for other utilities like
>> bpf to fill extra data to perf events.
> What!, why?

The goal of this patchset is to give BPF program a mean to output 
something through
perf samples.

BPF programs give us a way to filter and aggregate events, which makes 
us do many
interesting things. For example, we can count the number of context 
switches in sys_write
system calls by attaching BPF programs onto the entry and exit points of 
the system call
and the entry of __schedule, then count the number when exiting. 
Combined with BPF
reading PMU which we are working on, BPF programs can be used to profile 
kernel functions
in a fine-grained manner.

However, currently the only ways that BPF programs can transfer 
something to perf are:

  1. By returning 0 and 1 a BPF program can prevent perf to collect a 
sample;
  2. By map mechanism, user programs (perf) is possible to read the 
aggregation result
     computed by BPF program (not implemented now);
  3. By BPF_FUNC_trace_printk they are able to output string into ftrace 
ring buffer.

For the task I mentioned above, the best way do it is to print results 
into ring buffer
in the program attached to sys_write%return, and merge them and 
perf.data together using
timestamps.

We believe it can be improved. These patches is a try that, allows bpf 
programs call something
like 'BPF_FUNC_output_sample' to output something, and collects them 
with other data
output by a perf sample together. With the help of perf (not implemented 
yet), perf will be
able to extract those data through 'perf script' or 'perf data convert 
--to-ctf'. Some further
analysis can be made then.

The extra perf trace buffer is added for that reason. Currently, we use 
perf_trace_buf as a
per_cpu buffer for other parts of a perf sample data. Making bpf program 
to append information into
that buffer is possible, but requires us to caculate data size a perf 
sample require (by calling
__get_data_size) before we can ensure the samples will not be filtered 
out. Also, we can make
BPF program write from the beginning of that buffer and append perf 
sample data to it. However,
they will not able to be parsed by current perf then.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/