From: kaixu xia <xiakaixu@huawei.com>
To: <ast@plumgrid.com>, <davem@davemloft.net>, <acme@kernel.org>,
        <mingo@redhat.com>, <a.p.zijlstra@chello.nl>,
        <masami.hiramatsu.pt@hitachi.com>, <jolsa@kernel.org>
CC: <xiakaixu@huawei.com>, <wangnan0@huawei.com>,
        <linux-kernel@vger.kernel.org>, <pi3orama@163.com>,
        <hekuang@huawei.com>
Subject: [RFC PATCH 0/6] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter
Date: Fri, 17 Jul 2015 18:43:30 +0800
Message-ID: <1437129816-13176-1-git-send-email-xiakaixu@huawei.com>
MIME-Version: 1.0
Content-Type: text/plain
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5543
Lines: 116

This series of patches introduce the new ability of eBPF programs
to access hardware PMU counter. Previous discussions on this subject:
https://lkml.org/lkml/2015/5/27/1027.

There are many useful PMUs provided by X86 and other architectures. By
combining PMU, kprobe and eBPF program together, many interesting things
can be done. For example, by probing at sched:sched_switch we can
measure IPC changing between different processes by watching 'cycle' PMU
counter; by probing at entry and exit points of a kernel function we are
able to compute cache miss rate for a function by collecting
'cache-misses' counter and see the differences. In summary, we can
define the begin and end points of a procedure, insert kprobes on them,
attach two BPF programs and let them collect specific PMU counter.
Further, by reading those PMU counter BPF program can bring some hints
to resource schedulers. 

This patchset allows user read PMU events in the following way:
 1. Open the PMU using perf_event_open() (for each CPUs or for
    each processes he/she'd like to watch);
 2. Create a BPF map with BPF_MAP_FLAG_PERF_EVENT set in its
    type field;
 3. Insert FDs into the map with some key-value mapping scheme
    (i.e. cpuid -> event on that CPU);
 4. Load and attach eBPF programs as usual; 
 5. In eBPF program, fetch the perf_event from map with key
    (i.e. cpuid get from bpf_get_smp_processor_id()) then use
    bpf_read_pmu() to read from it.
 6. Do anything he/her want. 

This patchset consists of necessary changes to the kernel space.
Perf will be the normal user space tool based on
https://lkml.org/lkml/2015/7/8/823 (perf tools: filtering events
using eBPF programs), https://lkml.org/lkml/2015/7/13/831
(Make eBPF programs output data to perf) and the corresonding
patches are on the way.

Patch 6/6 is a simple example and shows how to use this new eBPF
programs ability. The PMU counter data can be found in
/sys/kernel/debug/tracing/trace.(the cycles counter value when
'kprobe/sys_write' sampling)

  $ ./bpf_pmu_test
  $ cat /sys/kernel/debug/tracing/trace
       ...
       syslog-ng-555   [001] dn.1 10189.004626: : bpf count: CPU-0  9935764297
       syslog-ng-555   [001] d..1 10189.053776: : bpf count: CPU-0  10000706398
       syslog-ng-555   [001] dn.1 10189.102972: : bpf count: CPU-0  10067117321
       syslog-ng-555   [001] d..1 10189.152925: : bpf count: CPU-0  10134551505
       syslog-ng-555   [001] dn.1 10189.202043: : bpf count: CPU-0  10200869299
       syslog-ng-555   [001] d..1 10189.251167: : bpf count: CPU-0  10267179481
       syslog-ng-555   [001] dn.1 10189.300285: : bpf count: CPU-0  10333493522
       syslog-ng-555   [001] d..1 10189.349410: : bpf count: CPU-0  10399808073
       syslog-ng-555   [001] dn.1 10189.398528: : bpf count: CPU-0  10466121583
       syslog-ng-555   [001] d..1 10189.447645: : bpf count: CPU-0  10532433368
       syslog-ng-555   [001] d..1 10189.496841: : bpf count: CPU-0  10598841104
       syslog-ng-555   [001] d..1 10189.546891: : bpf count: CPU-0  10666410564
       syslog-ng-555   [001] dn.1 10189.596016: : bpf count: CPU-0  10732729739
       syslog-ng-555   [001] d..1 10189.645146: : bpf count: CPU-0  12884941186
       syslog-ng-555   [001] d..1 10189.694263: : bpf count: CPU-0  12951249903
       syslog-ng-555   [001] dn.1 10189.743382: : bpf count: CPU-0  13017561470
       syslog-ng-555   [001] d..1 10189.792506: : bpf count: CPU-0  13083873521
       syslog-ng-555   [001] d..1 10189.841631: : bpf count: CPU-0  13150190416
       syslog-ng-555   [001] d..1 10189.890749: : bpf count: CPU-0  13216505962
       syslog-ng-555   [001] d..1 10189.939945: : bpf count: CPU-0  13282913062
       ...

The detail of patches is as follow: 

Patch 1/6 introduces a flag of map. The flag bit is encoded into type
field passed through attr;

Patch 2/6 introduces a map_traverse_elem() function for further use; 

Patch 3/6 convets event file descriptors into perf_event structure when
add new element to a map with the flag set; 

Patch 4/6 introduces a bpf program function argument constraint for
PMU map;

Patch 5/6 implement function bpf_read_pmu() that get the selected
hardware PMU conuter;

Patch 6/6 give a simple example.

kaixu xia (6):
  bpf: Add new flags that specify the value type stored in map
  bpf: Add function map->ops->map_traverse_elem() to traverse map elems
  bpf: Save the pointer to struct perf_event to map
  bpf: Add a bpf program function argument constraint for PMU map
  bpf: Implement function bpf_read_pmu() that get the selected hardware
    PMU conuter
  samples/bpf: example of get selected PMU counter value

 include/linux/bpf.h        |    7 +++
 include/linux/perf_event.h |    2 +
 include/uapi/linux/bpf.h   |   16 +++++
 kernel/bpf/arraymap.c      |   17 ++++++
 kernel/bpf/hashtab.c       |   27 +++++++++
 kernel/bpf/helpers.c       |   27 +++++++++
 kernel/bpf/syscall.c       |   81 ++++++++++++++++++++++++-
 kernel/bpf/verifier.c      |    9 +++
 kernel/events/core.c       |   22 +++++++
 kernel/trace/bpf_trace.c   |    2 +
 samples/bpf/bpf_helpers.h  |    2 +
 samples/bpf/bpf_pmu_test.c |  143 ++++++++++++++++++++++++++++++++++++++++++++
 12 files changed, 353 insertions(+), 2 deletions(-)
 create mode 100644 samples/bpf/bpf_pmu_test.c

-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/