Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754086AbbGWXeH (ORCPT ); Thu, 23 Jul 2015 19:34:07 -0400 Received: from www62.your-server.de ([213.133.104.62]:42594 "EHLO www62.your-server.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752345AbbGWXeE (ORCPT ); Thu, 23 Jul 2015 19:34:04 -0400 Message-ID: <55B179DB.4080308@iogearbox.net> Date: Fri, 24 Jul 2015 01:33:47 +0200 From: Daniel Borkmann User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Kaixu Xia , ast@plumgrid.com, davem@davemloft.net, acme@kernel.org, mingo@redhat.com, a.p.zijlstra@chello.nl, masami.hiramatsu.pt@hitachi.com, jolsa@kernel.org CC: wangnan0@huawei.com, linux-kernel@vger.kernel.org, pi3orama@163.com, hekuang@huawei.com Subject: Re: [PATCH v2 0/5] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter References: <1437552572-84748-1-git-send-email-xiakaixu@huawei.com> In-Reply-To: <1437552572-84748-1-git-send-email-xiakaixu@huawei.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Authenticated-Sender: daniel@iogearbox.net Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5486 Lines: 112 On 07/22/2015 10:09 AM, Kaixu Xia wrote: > Previous patch v1 url: > https://lkml.org/lkml/2015/7/17/287 [ Sorry to chime in late, just noticed this series now as I wasn't in Cc for the core BPF changes. More below ... ] > This patchset allows user read PMU events in the following way: > 1. Open the PMU using perf_event_open() (for each CPUs or for > each processes he/she'd like to watch); > 2. Create a BPF_MAP_TYPE_PERF_EVENT_ARRAY BPF map; > 3. Insert FDs into the map with some key-value mapping scheme > (i.e. cpuid -> event on that CPU); > 4. Load and attach eBPF programs as usual; > 5. In eBPF program, get the perf_event_map_fd and key (i.e. > cpuid get from bpf_get_smp_processor_id()) then use > bpf_perf_event_read() to read from it. > 6. Do anything he/her want. > > changes in V2: > - put atomic_long_inc_not_zero() between fdget() and fdput(); > - limit the event type to PERF_TYPE_RAW and PERF_TYPE_HARDWARE; > - Only read the event counter on current CPU or on current > process; > - add new map type BPF_MAP_TYPE_PERF_EVENT_ARRAY to store the > pointer to the struct perf_event; > - according to the perf_event_map_fd and key, the function > bpf_perf_event_read() can get the Hardware PMU counter value; > > Patch 5/5 is a simple example and shows how to use this new eBPF > programs ability. The PMU counter data can be found in > /sys/kernel/debug/tracing/trace(trace_pipe).(the cycles PMU > value when 'kprobe/sys_write' sampling) > > $ cat /sys/kernel/debug/tracing/trace_pipe > $ ./tracex6 > ... > cat-677 [002] d..1 210.299270: : bpf count: CPU-2 5316659 > cat-677 [002] d..1 210.299316: : bpf count: CPU-2 5378639 > cat-677 [002] d..1 210.299362: : bpf count: CPU-2 5440654 > cat-677 [002] d..1 210.299408: : bpf count: CPU-2 5503211 > cat-677 [002] d..1 210.299454: : bpf count: CPU-2 5565438 > cat-677 [002] d..1 210.299500: : bpf count: CPU-2 5627433 > cat-677 [002] d..1 210.299547: : bpf count: CPU-2 5690033 > cat-677 [002] d..1 210.299593: : bpf count: CPU-2 5752184 > cat-677 [002] d..1 210.299639: : bpf count: CPU-2 5814543 > <...>-548 [009] d..1 210.299667: : bpf count: CPU-9 605418074 > <...>-548 [009] d..1 210.299692: : bpf count: CPU-9 605452692 > cat-677 [002] d..1 210.299700: : bpf count: CPU-2 5896319 > <...>-548 [009] d..1 210.299710: : bpf count: CPU-9 605477824 > <...>-548 [009] d..1 210.299728: : bpf count: CPU-9 605501726 > <...>-548 [009] d..1 210.299745: : bpf count: CPU-9 605525279 > <...>-548 [009] d..1 210.299762: : bpf count: CPU-9 605547817 > <...>-548 [009] d..1 210.299778: : bpf count: CPU-9 605570433 > <...>-548 [009] d..1 210.299795: : bpf count: CPU-9 605592743 > ... > > The detail of patches is as follow: > > Patch 1/5 introduces a new bpf map type. This map only stores the > pointer to struct perf_event; > > Patch 2/5 introduces a map_traverse_elem() function for further use; > > Patch 3/5 convets event file descriptors into perf_event structure when > add new element to the map; So far all the map backends are of generic nature, knowing absolutely nothing about a particular consumer/subsystem of eBPF (tc, socket filters, etc). The tail call is a bit special, but nevertheless generic for each user and [very] useful, so it makes sense to inherit from the array map and move the code there. I don't really like that we start add new _special_-cased maps here into the eBPF core code, it seems quite hacky. :( From your rather terse commit description where you introduce the maps, I failed to see a detailed elaboration on this i.e. why it cannot be abstracted any different? > Patch 4/5 implement function bpf_perf_event_read() that get the selected > hardware PMU conuter; > > Patch 5/5 give a simple example. > > Kaixu Xia (5): > bpf: Add new bpf map type to store the pointer to struct perf_event > bpf: Add function map->ops->map_traverse_elem() to traverse map elems > bpf: Save the pointer to struct perf_event to map > bpf: Implement function bpf_perf_event_read() that get the selected > hardware PMU conuter > samples/bpf: example of get selected PMU counter value > > include/linux/bpf.h | 6 +++ > include/linux/perf_event.h | 5 ++- > include/uapi/linux/bpf.h | 3 ++ > kernel/bpf/arraymap.c | 110 +++++++++++++++++++++++++++++++++++++++++++++ > kernel/bpf/helpers.c | 42 +++++++++++++++++ > kernel/bpf/syscall.c | 26 +++++++++++ > kernel/events/core.c | 30 ++++++++++++- > kernel/trace/bpf_trace.c | 2 + > samples/bpf/Makefile | 4 ++ > samples/bpf/bpf_helpers.h | 2 + > samples/bpf/tracex6_kern.c | 27 +++++++++++ > samples/bpf/tracex6_user.c | 67 +++++++++++++++++++++++++++ > 12 files changed, 321 insertions(+), 3 deletions(-) > create mode 100644 samples/bpf/tracex6_kern.c > create mode 100644 samples/bpf/tracex6_user.c > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/