Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753074AbbGBJ0c (ORCPT ); Thu, 2 Jul 2015 05:26:32 -0400 Received: from szxga01-in.huawei.com ([58.251.152.64]:35900 "EHLO szxga01-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751960AbbGBJ0Z (ORCPT ); Thu, 2 Jul 2015 05:26:25 -0400 Message-ID: <55950356.3050507@huawei.com> Date: Thu, 2 Jul 2015 17:24:38 +0800 From: "Wangnan (F)" User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Alexei Starovoitov , He Kuang , Peter Zijlstra CC: , , , , , , , pi3orama Subject: Re: [RFC PATCH 0/5] Make eBPF programs output data to perf event References: <1435719455-91155-1-git-send-email-hekuang@huawei.com> <20150701054458.GN19282@twins.programming.kicks-ass.net> <559386D7.1020208@huawei.com> <20150701115825.GV19282@twins.programming.kicks-ass.net> <5594A679.7010108@plumgrid.com> <5594B22F.8090500@huawei.com> <5594B569.60103@plumgrid.com> In-Reply-To: <5594B569.60103@plumgrid.com> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.111.66.109] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4553 Lines: 122 On 2015/7/2 11:52, Alexei Starovoitov wrote: > On 7/1/15 8:38 PM, He Kuang wrote: >> >> >> On 2015/7/2 10:48, Alexei Starovoitov wrote: >>> On 7/1/15 4:58 AM, Peter Zijlstra wrote: >>>> >>>> But why create a separate trace buffer, it should go into the regular >>>> perf buffer. >>> >>> +1 >>> >>> I think >>> +static char __percpu *perf_extra_trace_buf[PERF_NR_CONTEXTS]; >>> is redundant. >>> It adds quite a bit of unnecessary complexity to the whole patch set. >>> >>> Also the call to bpf_output_sample() is not effective unless program >>> returns 1. It's a confusing user interface. >>> >>> Also you cannot ever do: >>> BPF_FUNC_probe_read, >>> + BPF_FUNC_output_sample, >>> BPF_FUNC_ktime_get_ns, >>> new functions must be added to the end. >>> >>> Why not just do: >>> perf_trace_buf_prepare() + perf_trace_buf_submit() from the helper? >>> No changes to current code. >>> No need to call __get_data_size() and other overhead. >>> The helper can be called multiple times from the same program. >>> imo much cleaner. >>> >> >> Invoke perf_trace_buf_submit() will generate a second perf >> event (header->type = PERF_RECORD_SAMPLE) entry which is >> different from the event entry outputed by the orignial >> kprobe. So the final result of the example in 00/00 patch may >> like this: >> >> sample entry 1(from bpf_prog): >> comm timestamp1 generic_perform_write pmu_value=0x1234 >> sample entry 2(from original kprobe): >> comm timestamp2 generic_perform_write: (ffffffff81140b60) >> Compared with current implementation: >> combined sample entry: >> comm timestamp generic_perform_write: (ffffffff81140b60) >> pmu_value=0x1234 >> >> The former two entries may be discontinuous as there are multiple >> threads and kprobes to be recorded, and there's a chance that one >> entry is missed but the other is recorded. What we need is the >> pmu_value read when 'generic_perform_write' enters, the two >> entries result is not intuitive enough and userspace tools have >> to do the work to find and combine those two sample entries to >> get the result. > > Just change your example to return 0 and user space will see > one sample. > Yes, by using perf_trace_buf_prepare() + perf_trace_buf_submit() in helper function and let bpf program always returns 0 we can make data collected by BPF programs output into samples, if following problems are solved: 1. In bpf program there's no way to get 'struct perf_event' or 'struct ftrace_event_call'. We have to deduce them through pt_regs: pt_regs -> ip -> kprobe -> struct trace_kprobe -> struct ftrace_event_call -> hlist_entry -> struct perf_event Which seems dirty, but without that we can't call perf_trace_buf_submit(). 2. Even if we finally get 'struct perf_event', I'm not sure whether user really concern on it. If we really concern on all information output through perf_trace_buf_submit() like callstack and register, why not make bpf program return non-zero instead? But then we have to consider how to connect two samples together. So maybe writing a new function to replace perf_trace_buf_submit() and output some light-weight information instead of full event data is worth considering. Otherwise, maybe a dummy 'struct perf_event' for BPF outputing is also acceptable? What we are trying to do in previous patches is to merge data output by BPF programs and original data output by perf_trace_buf_submit() together. For example (expressed in CTF metadata format): event.header := struct { // both output by perf_trace_buf_submit() integer { ... } id; integer { ... } timestamp; } event { name = "perf_bpf_probe:lock_page"; ... fields := struct { integer { ... } perf_ip; // perf_trace_buf_submit() integer { ... } perf_tid; // perf_trace_buf_submit() ... integer { ... } page; <-- Fetched using prologue integer { ... } cycle_cmu_counter; <-- Output by BPF program } } We believe that implemented should be simpler. Whether to use an extra perf_trace_buf or not can be discussed. We have other choices. For example, we can make BPF program write its data from the end of bpf_trace_buf, and connect two parts of output data before calling perf_trace_buf_submit(). Thank you. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/