Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753853AbbGQX3K (ORCPT ); Fri, 17 Jul 2015 19:29:10 -0400 Received: from m12-16.163.com ([220.181.12.16]:46687 "EHLO m12-16.163.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753729AbbGQX3H convert rfc822-to-8bit (ORCPT ); Fri, 17 Jul 2015 19:29:07 -0400 Content-Type: text/plain; charset=gb2312 Mime-Version: 1.0 (1.0) Subject: Re: [RFC PATCH 0/6] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter From: pi3orama X-Mailer: iPhone Mail (12H143) In-Reply-To: <55A98808.9010307@plumgrid.com> Date: Sat, 18 Jul 2015 07:27:43 +0800 Cc: kaixu xia , "davem@davemloft.net" , "acme@kernel.org" , "mingo@redhat.com" , "a.p.zijlstra@chello.nl" , "masami.hiramatsu.pt@hitachi.com" , "jolsa@kernel.org" , "wangnan0@huawei.com" , "linux-kernel@vger.kernel.org" , "hekuang@huawei.com" Content-Transfer-Encoding: 8BIT Message-Id: References: <1437129816-13176-1-git-send-email-xiakaixu@huawei.com> <55A98808.9010307@plumgrid.com> To: Alexei Starovoitov X-CM-TRANSID: EMCowEB5ukhvj6lVcFd2AA--.21946S2 X-Coremail-Antispam: 1Uf129KBjvJXoWxAr45XF4rZrWxZF45Aw4xWFg_yoW5KrW3pF Z3uFZ0kr4vqF9Fgwn5Jw1fW34ru3yrJayDWrW5t34xZw1DWr97XF4fKa1Y9F15Cr10y342 q34jva4rAa47Aa7anT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x07jUKsbUUUUU= X-Originating-IP: [210.73.4.168] X-CM-SenderInfo: lslt02xdpdqiywtou0bp/xtbBdQw6QFEANei1LQABsW Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3837 Lines: 78 ?????ҵ? iPhone > ?? 2015??7??18?գ?????6:56??Alexei Starovoitov д???? > >> On 7/17/15 3:43 AM, kaixu xia wrote: >> There are many useful PMUs provided by X86 and other architectures. By >> combining PMU, kprobe and eBPF program together, many interesting things >> can be done. For example, by probing at sched:sched_switch we can >> measure IPC changing between different processes by watching 'cycle' PMU >> counter; by probing at entry and exit points of a kernel function we are >> able to compute cache miss rate for a function by collecting >> 'cache-misses' counter and see the differences. In summary, we can >> define the begin and end points of a procedure, insert kprobes on them, >> attach two BPF programs and let them collect specific PMU counter. > > that would be definitely a useful feature. > As far as overall design I think it should be done slightly differently. > The addition of 'flags' to all maps is a bit hacky and it seems has few > holes. It's better to reuse 'store fds into maps' code that prog_array > is doing. You can add new map type BPF_MAP_TYPE_PERF_EVENT_ARRAY > and reuse most of the arraymap.c code. Then we also need another BPF_MAP_TYPE_PERF_EVENT_HASHMAP for events in task context. > The program also wouldn't need to do lookup+read_pmu, so instead of: > r0 = 0 (the chosen key: CPU-0) > *(u32 *)(fp - 4) = r0 > value = bpf_map_lookup_elem(map_fd, fp - 4); > count = bpf_read_pmu(value); > you will be able to do: > count = bpf_perf_event_read(perf_event_array_map_fd, index) > which will be faster. > note, I'd prefer 'bpf_perf_event_read' name for the helper. I though that. It makes things much simpler since we won't need care verifier too much. I choose current implementation because I think we may need perf event not wrapped in map in future (for example, tracepoints). With the design you suggested in this case we have to create a map with only 1 element in it. Although it is acceptable we can make it a better look. However, let's think that in future. > > Then inside helper we really cannot do mutex, sleep or smp_call, > but since programs are always executed in preempt disabled > and never from NMI, I think something like the following should work: > u64 bpf_perf_event_read(u64 r1, u64 index,...) > { > struct bpf_perf_event_array *array = (void *) (long) r1; > struct perf_event *event; > > if (unlikely(index >= array->map.max_entries)) > return -EINVAL; > event = array->events[index]; > if (event->state != PERF_EVENT_STATE_ACTIVE) > return -EINVAL; > if (event->oncpu != raw_smp_processor_id()) > return -EINVAL; > __perf_event_read(event); > return perf_event_count(event); > } > not sure whether we need to disable irq around __perf_event_read, > I think it should be ok without. > Also during store of FD into perf_event_array you'd need > to filter out all crazy events. I would limit it to few > basic types first. > > btw, make sure you do your tests with lockdep and other debugs on. > and for the sample code please use C for the bpf program. Not many > people can read bpf asm ;) > We still need some perf side code to make a c program work. Perf should create the perf events and put them into map. We post this kernel side code first to see your attitude on the basic design principle we choose. Although the implementation has some bugs, from you and Peter's response I think the user interface we choose is no too much problem, so we can start perf side coding now. Right? After perf side code done you won't see asm code in the example. Thank you. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/