Message-ID: <55B179DB.4080308@iogearbox.net>
Date: Fri, 24 Jul 2015 01:33:47 +0200
From: Daniel Borkmann <daniel@iogearbox.net>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
To: Kaixu Xia <xiakaixu@huawei.com>, ast@plumgrid.com, davem@davemloft.net,
        acme@kernel.org, mingo@redhat.com, a.p.zijlstra@chello.nl,
        masami.hiramatsu.pt@hitachi.com, jolsa@kernel.org
CC: wangnan0@huawei.com, linux-kernel@vger.kernel.org, pi3orama@163.com,
        hekuang@huawei.com
Subject: Re: [PATCH v2 0/5] bpf: Introduce the new ability of eBPF programs
 to access hardware PMU counter
References: <1437552572-84748-1-git-send-email-xiakaixu@huawei.com>
In-Reply-To: <1437552572-84748-1-git-send-email-xiakaixu@huawei.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5486
Lines: 112

On 07/22/2015 10:09 AM, Kaixu Xia wrote:
> Previous patch v1 url:
> https://lkml.org/lkml/2015/7/17/287

[ Sorry to chime in late, just noticed this series now as I wasn't in Cc for
   the core BPF changes. More below ... ]

> This patchset allows user read PMU events in the following way:
>   1. Open the PMU using perf_event_open() (for each CPUs or for
>      each processes he/she'd like to watch);
>   2. Create a BPF_MAP_TYPE_PERF_EVENT_ARRAY BPF map;
>   3. Insert FDs into the map with some key-value mapping scheme
>      (i.e. cpuid -> event on that CPU);
>   4. Load and attach eBPF programs as usual;
>   5. In eBPF program, get the perf_event_map_fd and key (i.e.
>      cpuid get from bpf_get_smp_processor_id()) then use
>      bpf_perf_event_read() to read from it.
>   6. Do anything he/her want.
>
> changes in V2:
>   - put atomic_long_inc_not_zero() between fdget() and fdput();
>   - limit the event type to PERF_TYPE_RAW and PERF_TYPE_HARDWARE;
>   - Only read the event counter on current CPU or on current
>     process;
>   - add new map type BPF_MAP_TYPE_PERF_EVENT_ARRAY to store the
>     pointer to the struct perf_event;
>   - according to the perf_event_map_fd and key, the function
>     bpf_perf_event_read() can get the Hardware PMU counter value;
>
> Patch 5/5 is a simple example and shows how to use this new eBPF
> programs ability. The PMU counter data can be found in
> /sys/kernel/debug/tracing/trace(trace_pipe).(the cycles PMU
> value when 'kprobe/sys_write' sampling)
>
>    $ cat /sys/kernel/debug/tracing/trace_pipe
>    $ ./tracex6
>         ...
>               cat-677   [002] d..1   210.299270: : bpf count: CPU-2  5316659
>               cat-677   [002] d..1   210.299316: : bpf count: CPU-2  5378639
>               cat-677   [002] d..1   210.299362: : bpf count: CPU-2  5440654
>               cat-677   [002] d..1   210.299408: : bpf count: CPU-2  5503211
>               cat-677   [002] d..1   210.299454: : bpf count: CPU-2  5565438
>               cat-677   [002] d..1   210.299500: : bpf count: CPU-2  5627433
>               cat-677   [002] d..1   210.299547: : bpf count: CPU-2  5690033
>               cat-677   [002] d..1   210.299593: : bpf count: CPU-2  5752184
>               cat-677   [002] d..1   210.299639: : bpf count: CPU-2  5814543
>             <...>-548   [009] d..1   210.299667: : bpf count: CPU-9  605418074
>             <...>-548   [009] d..1   210.299692: : bpf count: CPU-9  605452692
>               cat-677   [002] d..1   210.299700: : bpf count: CPU-2  5896319
>             <...>-548   [009] d..1   210.299710: : bpf count: CPU-9  605477824
>             <...>-548   [009] d..1   210.299728: : bpf count: CPU-9  605501726
>             <...>-548   [009] d..1   210.299745: : bpf count: CPU-9  605525279
>             <...>-548   [009] d..1   210.299762: : bpf count: CPU-9  605547817
>             <...>-548   [009] d..1   210.299778: : bpf count: CPU-9  605570433
>             <...>-548   [009] d..1   210.299795: : bpf count: CPU-9  605592743
>         ...
>
> The detail of patches is as follow:
>
> Patch 1/5 introduces a new bpf map type. This map only stores the
> pointer to struct perf_event;
>
> Patch 2/5 introduces a map_traverse_elem() function for further use;
>
> Patch 3/5 convets event file descriptors into perf_event structure when
> add new element to the map;

So far all the map backends are of generic nature, knowing absolutely nothing
about a particular consumer/subsystem of eBPF (tc, socket filters, etc). The
tail call is a bit special, but nevertheless generic for each user and [very]
useful, so it makes sense to inherit from the array map and move the code there.

I don't really like that we start add new _special_-cased maps here into the
eBPF core code, it seems quite hacky. :( From your rather terse commit description
where you introduce the maps, I failed to see a detailed elaboration on this i.e.
why it cannot be abstracted any different?

> Patch 4/5 implement function bpf_perf_event_read() that get the selected
> hardware PMU conuter;
>
> Patch 5/5 give a simple example.
>
> Kaixu Xia (5):
>    bpf: Add new bpf map type to store the pointer to struct perf_event
>    bpf: Add function map->ops->map_traverse_elem() to traverse map elems
>    bpf: Save the pointer to struct perf_event to map
>    bpf: Implement function bpf_perf_event_read() that get the selected
>      hardware PMU conuter
>    samples/bpf: example of get selected PMU counter value
>
>   include/linux/bpf.h        |   6 +++
>   include/linux/perf_event.h |   5 ++-
>   include/uapi/linux/bpf.h   |   3 ++
>   kernel/bpf/arraymap.c      | 110 +++++++++++++++++++++++++++++++++++++++++++++
>   kernel/bpf/helpers.c       |  42 +++++++++++++++++
>   kernel/bpf/syscall.c       |  26 +++++++++++
>   kernel/events/core.c       |  30 ++++++++++++-
>   kernel/trace/bpf_trace.c   |   2 +
>   samples/bpf/Makefile       |   4 ++
>   samples/bpf/bpf_helpers.h  |   2 +
>   samples/bpf/tracex6_kern.c |  27 +++++++++++
>   samples/bpf/tracex6_user.c |  67 +++++++++++++++++++++++++++
>   12 files changed, 321 insertions(+), 3 deletions(-)
>   create mode 100644 samples/bpf/tracex6_kern.c
>   create mode 100644 samples/bpf/tracex6_user.c
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/