Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753463AbbGQW4M (ORCPT ); Fri, 17 Jul 2015 18:56:12 -0400 Received: from mail-pd0-f170.google.com ([209.85.192.170]:35350 "EHLO mail-pd0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751702AbbGQW4K (ORCPT ); Fri, 17 Jul 2015 18:56:10 -0400 Subject: Re: [RFC PATCH 0/6] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter To: kaixu xia , davem@davemloft.net, acme@kernel.org, mingo@redhat.com, a.p.zijlstra@chello.nl, masami.hiramatsu.pt@hitachi.com, jolsa@kernel.org References: <1437129816-13176-1-git-send-email-xiakaixu@huawei.com> Cc: wangnan0@huawei.com, linux-kernel@vger.kernel.org, pi3orama@163.com, hekuang@huawei.com From: Alexei Starovoitov Message-ID: <55A98808.9010307@plumgrid.com> Date: Fri, 17 Jul 2015 15:56:08 -0700 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Thunderbird/38.0.1 MIME-Version: 1.0 In-Reply-To: <1437129816-13176-1-git-send-email-xiakaixu@huawei.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2702 Lines: 60 On 7/17/15 3:43 AM, kaixu xia wrote: > There are many useful PMUs provided by X86 and other architectures. By > combining PMU, kprobe and eBPF program together, many interesting things > can be done. For example, by probing at sched:sched_switch we can > measure IPC changing between different processes by watching 'cycle' PMU > counter; by probing at entry and exit points of a kernel function we are > able to compute cache miss rate for a function by collecting > 'cache-misses' counter and see the differences. In summary, we can > define the begin and end points of a procedure, insert kprobes on them, > attach two BPF programs and let them collect specific PMU counter. that would be definitely a useful feature. As far as overall design I think it should be done slightly differently. The addition of 'flags' to all maps is a bit hacky and it seems has few holes. It's better to reuse 'store fds into maps' code that prog_array is doing. You can add new map type BPF_MAP_TYPE_PERF_EVENT_ARRAY and reuse most of the arraymap.c code. The program also wouldn't need to do lookup+read_pmu, so instead of: r0 = 0 (the chosen key: CPU-0) *(u32 *)(fp - 4) = r0 value = bpf_map_lookup_elem(map_fd, fp - 4); count = bpf_read_pmu(value); you will be able to do: count = bpf_perf_event_read(perf_event_array_map_fd, index) which will be faster. note, I'd prefer 'bpf_perf_event_read' name for the helper. Then inside helper we really cannot do mutex, sleep or smp_call, but since programs are always executed in preempt disabled and never from NMI, I think something like the following should work: u64 bpf_perf_event_read(u64 r1, u64 index,...) { struct bpf_perf_event_array *array = (void *) (long) r1; struct perf_event *event; if (unlikely(index >= array->map.max_entries)) return -EINVAL; event = array->events[index]; if (event->state != PERF_EVENT_STATE_ACTIVE) return -EINVAL; if (event->oncpu != raw_smp_processor_id()) return -EINVAL; __perf_event_read(event); return perf_event_count(event); } not sure whether we need to disable irq around __perf_event_read, I think it should be ok without. Also during store of FD into perf_event_array you'd need to filter out all crazy events. I would limit it to few basic types first. btw, make sure you do your tests with lockdep and other debugs on. and for the sample code please use C for the bpf program. Not many people can read bpf asm ;) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/