Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1945977AbbGQKp0 (ORCPT ); Fri, 17 Jul 2015 06:45:26 -0400 Received: from szxga02-in.huawei.com ([119.145.14.65]:32599 "EHLO szxga02-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757667AbbGQKpU (ORCPT ); Fri, 17 Jul 2015 06:45:20 -0400 From: kaixu xia To: , , , , , , CC: , , , , Subject: [RFC PATCH 0/6] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter Date: Fri, 17 Jul 2015 18:43:30 +0800 Message-ID: <1437129816-13176-1-git-send-email-xiakaixu@huawei.com> X-Mailer: git-send-email 1.7.10.4 MIME-Version: 1.0 Content-Type: text/plain X-Originating-IP: [10.110.52.33] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5543 Lines: 116 This series of patches introduce the new ability of eBPF programs to access hardware PMU counter. Previous discussions on this subject: https://lkml.org/lkml/2015/5/27/1027. There are many useful PMUs provided by X86 and other architectures. By combining PMU, kprobe and eBPF program together, many interesting things can be done. For example, by probing at sched:sched_switch we can measure IPC changing between different processes by watching 'cycle' PMU counter; by probing at entry and exit points of a kernel function we are able to compute cache miss rate for a function by collecting 'cache-misses' counter and see the differences. In summary, we can define the begin and end points of a procedure, insert kprobes on them, attach two BPF programs and let them collect specific PMU counter. Further, by reading those PMU counter BPF program can bring some hints to resource schedulers. This patchset allows user read PMU events in the following way: 1. Open the PMU using perf_event_open() (for each CPUs or for each processes he/she'd like to watch); 2. Create a BPF map with BPF_MAP_FLAG_PERF_EVENT set in its type field; 3. Insert FDs into the map with some key-value mapping scheme (i.e. cpuid -> event on that CPU); 4. Load and attach eBPF programs as usual; 5. In eBPF program, fetch the perf_event from map with key (i.e. cpuid get from bpf_get_smp_processor_id()) then use bpf_read_pmu() to read from it. 6. Do anything he/her want. This patchset consists of necessary changes to the kernel space. Perf will be the normal user space tool based on https://lkml.org/lkml/2015/7/8/823 (perf tools: filtering events using eBPF programs), https://lkml.org/lkml/2015/7/13/831 (Make eBPF programs output data to perf) and the corresonding patches are on the way. Patch 6/6 is a simple example and shows how to use this new eBPF programs ability. The PMU counter data can be found in /sys/kernel/debug/tracing/trace.(the cycles counter value when 'kprobe/sys_write' sampling) $ ./bpf_pmu_test $ cat /sys/kernel/debug/tracing/trace ... syslog-ng-555 [001] dn.1 10189.004626: : bpf count: CPU-0 9935764297 syslog-ng-555 [001] d..1 10189.053776: : bpf count: CPU-0 10000706398 syslog-ng-555 [001] dn.1 10189.102972: : bpf count: CPU-0 10067117321 syslog-ng-555 [001] d..1 10189.152925: : bpf count: CPU-0 10134551505 syslog-ng-555 [001] dn.1 10189.202043: : bpf count: CPU-0 10200869299 syslog-ng-555 [001] d..1 10189.251167: : bpf count: CPU-0 10267179481 syslog-ng-555 [001] dn.1 10189.300285: : bpf count: CPU-0 10333493522 syslog-ng-555 [001] d..1 10189.349410: : bpf count: CPU-0 10399808073 syslog-ng-555 [001] dn.1 10189.398528: : bpf count: CPU-0 10466121583 syslog-ng-555 [001] d..1 10189.447645: : bpf count: CPU-0 10532433368 syslog-ng-555 [001] d..1 10189.496841: : bpf count: CPU-0 10598841104 syslog-ng-555 [001] d..1 10189.546891: : bpf count: CPU-0 10666410564 syslog-ng-555 [001] dn.1 10189.596016: : bpf count: CPU-0 10732729739 syslog-ng-555 [001] d..1 10189.645146: : bpf count: CPU-0 12884941186 syslog-ng-555 [001] d..1 10189.694263: : bpf count: CPU-0 12951249903 syslog-ng-555 [001] dn.1 10189.743382: : bpf count: CPU-0 13017561470 syslog-ng-555 [001] d..1 10189.792506: : bpf count: CPU-0 13083873521 syslog-ng-555 [001] d..1 10189.841631: : bpf count: CPU-0 13150190416 syslog-ng-555 [001] d..1 10189.890749: : bpf count: CPU-0 13216505962 syslog-ng-555 [001] d..1 10189.939945: : bpf count: CPU-0 13282913062 ... The detail of patches is as follow: Patch 1/6 introduces a flag of map. The flag bit is encoded into type field passed through attr; Patch 2/6 introduces a map_traverse_elem() function for further use; Patch 3/6 convets event file descriptors into perf_event structure when add new element to a map with the flag set; Patch 4/6 introduces a bpf program function argument constraint for PMU map; Patch 5/6 implement function bpf_read_pmu() that get the selected hardware PMU conuter; Patch 6/6 give a simple example. kaixu xia (6): bpf: Add new flags that specify the value type stored in map bpf: Add function map->ops->map_traverse_elem() to traverse map elems bpf: Save the pointer to struct perf_event to map bpf: Add a bpf program function argument constraint for PMU map bpf: Implement function bpf_read_pmu() that get the selected hardware PMU conuter samples/bpf: example of get selected PMU counter value include/linux/bpf.h | 7 +++ include/linux/perf_event.h | 2 + include/uapi/linux/bpf.h | 16 +++++ kernel/bpf/arraymap.c | 17 ++++++ kernel/bpf/hashtab.c | 27 +++++++++ kernel/bpf/helpers.c | 27 +++++++++ kernel/bpf/syscall.c | 81 ++++++++++++++++++++++++- kernel/bpf/verifier.c | 9 +++ kernel/events/core.c | 22 +++++++ kernel/trace/bpf_trace.c | 2 + samples/bpf/bpf_helpers.h | 2 + samples/bpf/bpf_pmu_test.c | 143 ++++++++++++++++++++++++++++++++++++++++++++ 12 files changed, 353 insertions(+), 2 deletions(-) create mode 100644 samples/bpf/bpf_pmu_test.c -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/