2015-07-31 12:00:09

by xiakaixu

[permalink] [raw]
Subject: [PATCH v5 0/4] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter

Previous patch v4 url:
https://lkml.org/lkml/2015/7/28/432

changes in V5:
- move struct fd_array_map_ops* fd_ops to bpf_map;
- move array perf event decrement refcnt function to
map_free;
- fix the NULL ptr of perf_event_get();
- move bpf_perf_event_read() to kernel/bpf/bpf_trace.c;
- get rid of the remaining struct bpf_prog;
- move the unnecessay cast on void *;

changes in V4:
- make the bpf_prog_array_map more generic;
- fix the bug of event refcnt leak;
- use more useful errno in bpf_perf_event_read();

changes in V3:
- collapse V2 patches 1-3 into one;
- drop the function map->ops->map_traverse_elem() and release
the struct perf_event in map_free;
- only allow to access bpf_perf_event_read() from programs;
- update the perf_event_array_map elem via xchg();
- pass index directly to bpf_perf_event_read() instead of
MAP_KEY;

changes in V2:
- put atomic_long_inc_not_zero() between fdget() and fdput();
- limit the event type to PERF_TYPE_RAW and PERF_TYPE_HARDWARE;
- Only read the event counter on current CPU or on current
process;
- add new map type BPF_MAP_TYPE_PERF_EVENT_ARRAY to store the
pointer to the struct perf_event;
- according to the perf_event_map_fd and key, the function
bpf_perf_event_read() can get the Hardware PMU counter value;

Patch 4/4 is a simple example and shows how to use this new eBPF
programs ability. The PMU counter data can be found in
/sys/kernel/debug/tracing/trace(trace_pipe).(the cycles PMU
value when 'kprobe/sys_write' sampling)

$ cat /sys/kernel/debug/tracing/trace_pipe
$ ./tracex6
...
syslog-ng-548 [000] d..1 76.905673: : CPU-0 681765271
syslog-ng-548 [000] d..1 76.905690: : CPU-0 681787855
syslog-ng-548 [000] d..1 76.905707: : CPU-0 681810504
syslog-ng-548 [000] d..1 76.905725: : CPU-0 681834771
syslog-ng-548 [000] d..1 76.905745: : CPU-0 681859519
syslog-ng-548 [000] d..1 76.905766: : CPU-0 681890419
syslog-ng-548 [000] d..1 76.905783: : CPU-0 681914045
syslog-ng-548 [000] d..1 76.905800: : CPU-0 681935950
syslog-ng-548 [000] d..1 76.905816: : CPU-0 681958299
ls-690 [005] d..1 82.241308: : CPU-5 3138451
sh-691 [004] d..1 82.244570: : CPU-4 7324988
<...>-699 [007] d..1 99.961387: : CPU-7 3194027
<...>-695 [003] d..1 99.961474: : CPU-3 288901
<...>-695 [003] d..1 99.961541: : CPU-3 383145
<...>-695 [003] d..1 99.961591: : CPU-3 450365
<...>-695 [003] d..1 99.961639: : CPU-3 515751
<...>-695 [003] d..1 99.961686: : CPU-3 579047
...

The detail of patches is as follow:

Patch 1/4 rewrites part of the bpf_prog_array map code and make it
more generic;

Patch 2/4 introduces a new bpf map type. This map only stores the
pointer to struct perf_event;

Patch 3/4 implements function bpf_perf_event_read() that get the
selected hardware PMU conuter;

Patch 4/4 gives a simple example.

Kaixu Xia (3):
bpf: Add new bpf map type to store the pointer to struct perf_event
bpf: Implement function bpf_perf_event_read() that get the selected
hardware PMU conuter
samples/bpf: example of get selected PMU counter value

Wang Nan (1):
bpf: Make the bpf_prog_array_map more generic

arch/x86/net/bpf_jit_comp.c | 6 +-
include/linux/bpf.h | 10 +++-
include/linux/perf_event.h | 14 ++++-
include/uapi/linux/bpf.h | 2 +
kernel/bpf/arraymap.c | 135 ++++++++++++++++++++++++++++++++++----------
kernel/bpf/core.c | 2 +-
kernel/bpf/syscall.c | 2 +-
kernel/bpf/verifier.c | 56 +++++++++++++-----
kernel/events/core.c | 27 ++++++---
kernel/trace/bpf_trace.c | 37 ++++++++++++
samples/bpf/Makefile | 4 ++
samples/bpf/bpf_helpers.h | 2 +
samples/bpf/tracex6_kern.c | 26 +++++++++
samples/bpf/tracex6_user.c | 68 ++++++++++++++++++++++
14 files changed, 328 insertions(+), 63 deletions(-)
create mode 100644 samples/bpf/tracex6_kern.c
create mode 100644 samples/bpf/tracex6_user.c

--
1.8.3.4


2015-07-31 12:01:47

by xiakaixu

[permalink] [raw]
Subject: [PATCH v5 1/4] bpf: Make the bpf_prog_array_map more generic

From: Wang Nan <[email protected]>

According to the comments from Daniel, rewrite part of
the bpf_prog_array map code and make it more generic.
So the new perf_event_array map type can reuse most of
code with bpf_prog_array map and add fewer lines of
special code.

Tested the samples/bpf/tracex5 after this patch:
$ sudo ./tracex5
...
dd-1051 [000] d... 26.682903: : mmap
dd-1051 [000] d... 26.698348: : syscall=102 (one of get/set uid/pid/gid)
dd-1051 [000] d... 26.703892: : read(fd=0, buf=000000000078c010, size=512)
dd-1051 [000] d... 26.705847: : write(fd=1, buf=000000000078c010, size=512)
dd-1051 [000] d... 26.707914: : read(fd=0, buf=000000000078c010, size=512)
dd-1051 [000] d... 26.710988: : write(fd=1, buf=000000000078c010, size=512)
dd-1051 [000] d... 26.711865: : read(fd=0, buf=000000000078c010, size=512)
dd-1051 [000] d... 26.712704: : write(fd=1, buf=000000000078c010, size=512)
...

Signed-off-by: Wang Nan <[email protected]>
Signed-off-by: Kaixu Xia <[email protected]>
---
arch/x86/net/bpf_jit_comp.c | 6 ++--
include/linux/bpf.h | 8 +++--
kernel/bpf/arraymap.c | 80 +++++++++++++++++++++++++++------------------
kernel/bpf/core.c | 2 +-
kernel/bpf/syscall.c | 2 +-
5 files changed, 60 insertions(+), 38 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 579a8fd..e377f07 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -246,7 +246,7 @@ static void emit_prologue(u8 **pprog)
* goto out;
* if (++tail_call_cnt > MAX_TAIL_CALL_CNT)
* goto out;
- * prog = array->prog[index];
+ * prog = array->ptrs[index];
* if (prog == NULL)
* goto out;
* goto *(prog->bpf_func + prologue_size);
@@ -284,9 +284,9 @@ static void emit_bpf_tail_call(u8 **pprog)
EMIT3(0x83, 0xC0, 0x01); /* add eax, 1 */
EMIT2_off32(0x89, 0x85, -STACKSIZE + 36); /* mov dword ptr [rbp - 516], eax */

- /* prog = array->prog[index]; */
+ /* prog = array->ptrs[index]; */
EMIT4(0x48, 0x8D, 0x44, 0xD6); /* lea rax, [rsi + rdx * 8 + 0x50] */
- EMIT1(offsetof(struct bpf_array, prog));
+ EMIT1(offsetof(struct bpf_array, ptrs));
EMIT3(0x48, 0x8B, 0x00); /* mov rax, qword ptr [rax] */

/* if (prog == NULL)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4383476..a8ce262 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -24,6 +24,10 @@ struct bpf_map_ops {
void *(*map_lookup_elem)(struct bpf_map *map, void *key);
int (*map_update_elem)(struct bpf_map *map, void *key, void *value, u64 flags);
int (*map_delete_elem)(struct bpf_map *map, void *key);
+
+ /* funcs called by prog_array and perf_event_array map */
+ void *(*map_fd_get_ptr) (struct bpf_map *map, int fd);
+ void (*map_fd_put_ptr) (void *ptr);
};

struct bpf_map {
@@ -142,13 +146,13 @@ struct bpf_array {
bool owner_jited;
union {
char value[0] __aligned(8);
- struct bpf_prog *prog[0] __aligned(8);
+ void *ptrs[0] __aligned(8);
};
};
#define MAX_TAIL_CALL_CNT 32

u64 bpf_tail_call(u64 ctx, u64 r2, u64 index, u64 r4, u64 r5);
-void bpf_prog_array_map_clear(struct bpf_map *map);
+void bpf_fd_array_map_clear(struct bpf_map *map);
bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog *fp);
const struct bpf_func_proto *bpf_get_trace_printk_proto(void);

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index cb31229..45df657 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -150,15 +150,15 @@ static int __init register_array_map(void)
}
late_initcall(register_array_map);

-static struct bpf_map *prog_array_map_alloc(union bpf_attr *attr)
+static struct bpf_map *fd_array_map_alloc(union bpf_attr *attr)
{
- /* only bpf_prog file descriptors can be stored in prog_array map */
+ /* only file descriptors can be stored in this type of map */
if (attr->value_size != sizeof(u32))
return ERR_PTR(-EINVAL);
return array_map_alloc(attr);
}

-static void prog_array_map_free(struct bpf_map *map)
+static void fd_array_map_free(struct bpf_map *map)
{
struct bpf_array *array = container_of(map, struct bpf_array, map);
int i;
@@ -167,21 +167,21 @@ static void prog_array_map_free(struct bpf_map *map)

/* make sure it's empty */
for (i = 0; i < array->map.max_entries; i++)
- BUG_ON(array->prog[i] != NULL);
+ BUG_ON(array->ptrs[i] != NULL);
kvfree(array);
}

-static void *prog_array_map_lookup_elem(struct bpf_map *map, void *key)
+static void *fd_array_map_lookup_elem(struct bpf_map *map, void *key)
{
return NULL;
}

/* only called from syscall */
-static int prog_array_map_update_elem(struct bpf_map *map, void *key,
- void *value, u64 map_flags)
+static int fd_array_map_update_elem(struct bpf_map *map, void *key,
+ void *value, u64 map_flags)
{
struct bpf_array *array = container_of(map, struct bpf_array, map);
- struct bpf_prog *prog, *old_prog;
+ void *new_ptr, *old_ptr;
u32 index = *(u32 *)key, ufd;

if (map_flags != BPF_ANY)
@@ -191,57 +191,75 @@ static int prog_array_map_update_elem(struct bpf_map *map, void *key,
return -E2BIG;

ufd = *(u32 *)value;
- prog = bpf_prog_get(ufd);
- if (IS_ERR(prog))
- return PTR_ERR(prog);
-
- if (!bpf_prog_array_compatible(array, prog)) {
- bpf_prog_put(prog);
- return -EINVAL;
- }
+ new_ptr = map->ops->map_fd_get_ptr(map, ufd);
+ if (IS_ERR(new_ptr))
+ return PTR_ERR(new_ptr);

- old_prog = xchg(array->prog + index, prog);
- if (old_prog)
- bpf_prog_put_rcu(old_prog);
+ old_ptr = xchg(array->ptrs + index, new_ptr);
+ if (old_ptr)
+ map->ops->map_fd_put_ptr(old_ptr);

return 0;
}

-static int prog_array_map_delete_elem(struct bpf_map *map, void *key)
+static int fd_array_map_delete_elem(struct bpf_map *map, void *key)
{
struct bpf_array *array = container_of(map, struct bpf_array, map);
- struct bpf_prog *old_prog;
+ void *old_ptr;
u32 index = *(u32 *)key;

if (index >= array->map.max_entries)
return -E2BIG;

- old_prog = xchg(array->prog + index, NULL);
- if (old_prog) {
- bpf_prog_put_rcu(old_prog);
+ old_ptr = xchg(array->ptrs + index, NULL);
+ if (old_ptr) {
+ map->ops->map_fd_put_ptr(old_ptr);
return 0;
} else {
return -ENOENT;
}
}

+static void *prog_fd_array_get_ptr(struct bpf_map *map, int fd)
+{
+ struct bpf_array *array = container_of(map, struct bpf_array, map);
+ struct bpf_prog *prog = bpf_prog_get(fd);
+ if (IS_ERR(prog))
+ return prog;
+
+ if (!bpf_prog_array_compatible(array, prog)) {
+ bpf_prog_put(prog);
+ return ERR_PTR(-EINVAL);
+ }
+ return prog;
+}
+
+static void prog_fd_array_put_ptr(void *ptr)
+{
+ struct bpf_prog *prog = ptr;
+
+ bpf_prog_put_rcu(prog);
+}
+
/* decrement refcnt of all bpf_progs that are stored in this map */
-void bpf_prog_array_map_clear(struct bpf_map *map)
+void bpf_fd_array_map_clear(struct bpf_map *map)
{
struct bpf_array *array = container_of(map, struct bpf_array, map);
int i;

for (i = 0; i < array->map.max_entries; i++)
- prog_array_map_delete_elem(map, &i);
+ fd_array_map_delete_elem(map, &i);
}

static const struct bpf_map_ops prog_array_ops = {
- .map_alloc = prog_array_map_alloc,
- .map_free = prog_array_map_free,
+ .map_alloc = fd_array_map_alloc,
+ .map_free = fd_array_map_free,
.map_get_next_key = array_map_get_next_key,
- .map_lookup_elem = prog_array_map_lookup_elem,
- .map_update_elem = prog_array_map_update_elem,
- .map_delete_elem = prog_array_map_delete_elem,
+ .map_lookup_elem = fd_array_map_lookup_elem,
+ .map_update_elem = fd_array_map_update_elem,
+ .map_delete_elem = fd_array_map_delete_elem,
+ .map_fd_get_ptr = prog_fd_array_get_ptr,
+ .map_fd_put_ptr = prog_fd_array_put_ptr,
};

static struct bpf_map_type_list prog_array_type __read_mostly = {
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index c5bedc8..be848d0 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -449,7 +449,7 @@ select_insn:

tail_call_cnt++;

- prog = READ_ONCE(array->prog[index]);
+ prog = READ_ONCE(array->ptrs[index]);
if (unlikely(!prog))
goto out;

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index a1b14d1..dc9b464 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -72,7 +72,7 @@ static int bpf_map_release(struct inode *inode, struct file *filp)
/* prog_array stores refcnt-ed bpf_prog pointers
* release them all when user space closes prog_array_fd
*/
- bpf_prog_array_map_clear(map);
+ bpf_fd_array_map_clear(map);

bpf_map_put(map);
return 0;
--
1.8.3.4

2015-07-31 12:00:01

by xiakaixu

[permalink] [raw]
Subject: [PATCH v5 2/4] bpf: Add new bpf map type to store the pointer to struct perf_event

Introduce a new bpf map type 'BPF_MAP_TYPE_PERF_EVENT_ARRAY'.
This map only stores the pointer to struct perf_event. The
user space event FDs from perf_event_open() syscall are converted
to the pointer to struct perf_event and stored in map.

Signed-off-by: Kaixu Xia <[email protected]>
---
include/linux/bpf.h | 1 +
include/linux/perf_event.h | 2 ++
include/uapi/linux/bpf.h | 1 +
kernel/bpf/arraymap.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++
kernel/events/core.c | 17 ++++++++++++++
5 files changed, 76 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a8ce262..d0b394a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -10,6 +10,7 @@
#include <uapi/linux/bpf.h>
#include <linux/workqueue.h>
#include <linux/file.h>
+#include <linux/perf_event.h>

struct bpf_map;

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2027809..27e05c1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -641,6 +641,7 @@ extern int perf_event_init_task(struct task_struct *child);
extern void perf_event_exit_task(struct task_struct *child);
extern void perf_event_free_task(struct task_struct *task);
extern void perf_event_delayed_put(struct task_struct *task);
+extern struct perf_event *perf_event_get(unsigned int fd);
extern void perf_event_print_debug(void);
extern void perf_pmu_disable(struct pmu *pmu);
extern void perf_pmu_enable(struct pmu *pmu);
@@ -979,6 +980,7 @@ static inline int perf_event_init_task(struct task_struct *child) { return 0; }
static inline void perf_event_exit_task(struct task_struct *child) { }
static inline void perf_event_free_task(struct task_struct *task) { }
static inline void perf_event_delayed_put(struct task_struct *task) { }
+static struct perf_event *perf_event_get(unsigned int fd) { return ERR_PTR(-EINVAL); }
static inline void perf_event_print_debug(void) { }
static inline int perf_event_task_disable(void) { return -EINVAL; }
static inline int perf_event_task_enable(void) { return -EINVAL; }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 29ef6f9..69a1f6b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -114,6 +114,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_HASH,
BPF_MAP_TYPE_ARRAY,
BPF_MAP_TYPE_PROG_ARRAY,
+ BPF_MAP_TYPE_PERF_EVENT_ARRAY,
};

enum bpf_prog_type {
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 45df657..b7e0b5d 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -273,3 +273,58 @@ static int __init register_prog_array_map(void)
return 0;
}
late_initcall(register_prog_array_map);
+
+static void perf_event_array_map_free(struct bpf_map *map)
+{
+ bpf_fd_array_map_clear(map);
+ fd_array_map_free(map);
+}
+
+static void *perf_event_fd_array_get_ptr(struct bpf_map *map, int fd)
+{
+ struct perf_event *event;
+
+ event = perf_event_get(fd);
+ if (IS_ERR(event))
+ return event;
+
+ /*
+ * prevent some crazy events so we can make our life easier
+ */
+ if (event->attr.type != PERF_TYPE_RAW &&
+ event->attr.type != PERF_TYPE_HARDWARE) {
+ perf_event_release_kernel(event);
+ return ERR_PTR(-EINVAL);
+ }
+ return event;
+}
+
+static void perf_event_fd_array_put_ptr(void *ptr)
+{
+ struct perf_event *event = ptr;
+
+ perf_event_release_kernel(event);
+}
+
+static const struct bpf_map_ops perf_event_array_ops = {
+ .map_alloc = fd_array_map_alloc,
+ .map_free = perf_event_array_map_free,
+ .map_get_next_key = array_map_get_next_key,
+ .map_lookup_elem = fd_array_map_lookup_elem,
+ .map_update_elem = fd_array_map_update_elem,
+ .map_delete_elem = fd_array_map_delete_elem,
+ .map_fd_get_ptr = perf_event_fd_array_get_ptr,
+ .map_fd_put_ptr = perf_event_fd_array_put_ptr,
+};
+
+static struct bpf_map_type_list perf_event_array_type __read_mostly = {
+ .ops = &perf_event_array_ops,
+ .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+};
+
+static int __init register_perf_event_array_map(void)
+{
+ bpf_register_map_type(&perf_event_array_type);
+ return 0;
+}
+late_initcall(register_perf_event_array_map);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d3dae34..58f0d47 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8574,6 +8574,23 @@ void perf_event_delayed_put(struct task_struct *task)
WARN_ON_ONCE(task->perf_event_ctxp[ctxn]);
}

+struct perf_event *perf_event_get(unsigned int fd)
+{
+ int err;
+ struct fd f;
+ struct perf_event *event;
+
+ err = perf_fget_light(fd, &f);
+ if (err)
+ return ERR_PTR(err);
+
+ event = f.file->private_data;
+ atomic_long_inc(&event->refcount);
+ fdput(f);
+
+ return event;
+}
+
/*
* inherit a event from parent task to child task:
*/
--
1.8.3.4

2015-07-31 11:59:59

by xiakaixu

[permalink] [raw]
Subject: [PATCH v5 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter

According to the perf_event_map_fd and index, the function
bpf_perf_event_read() can convert the corresponding map
value to the pointer to struct perf_event and return the
Hardware PMU counter value.

Signed-off-by: Kaixu Xia <[email protected]>
---
include/linux/bpf.h | 1 +
include/linux/perf_event.h | 12 +++++++++-
include/uapi/linux/bpf.h | 1 +
kernel/bpf/verifier.c | 56 +++++++++++++++++++++++++++++++++-------------
kernel/events/core.c | 10 +--------
kernel/trace/bpf_trace.c | 37 ++++++++++++++++++++++++++++++
6 files changed, 92 insertions(+), 25 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index d0b394a..db9f781 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -190,6 +190,7 @@ extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
extern const struct bpf_func_proto bpf_map_update_elem_proto;
extern const struct bpf_func_proto bpf_map_delete_elem_proto;

+extern const struct bpf_func_proto bpf_perf_event_read_proto;
extern const struct bpf_func_proto bpf_get_prandom_u32_proto;
extern const struct bpf_func_proto bpf_get_smp_processor_id_proto;
extern const struct bpf_func_proto bpf_tail_call_proto;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 27e05c1..c1a3f39 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -662,7 +662,7 @@ extern void perf_pmu_migrate_context(struct pmu *pmu,
int src_cpu, int dst_cpu);
extern u64 perf_event_read_value(struct perf_event *event,
u64 *enabled, u64 *running);
-
+extern void __perf_event_read(void *info);

struct perf_sample_data {
/*
@@ -863,6 +863,14 @@ static inline u64 __perf_event_count(struct perf_event *event)
return local64_read(&event->count) + atomic64_read(&event->child_count);
}

+static inline u64 perf_event_count(struct perf_event *event)
+{
+ if (event->pmu->count)
+ return event->pmu->count(event);
+
+ return __perf_event_count(event);
+}
+
extern void perf_event_mmap(struct vm_area_struct *vma);
extern struct perf_guest_info_callbacks *perf_guest_cbs;
extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks);
@@ -984,6 +992,8 @@ static struct perf_event *perf_event_get(unsigned int fd) { return ERR_PTR(-EIN
static inline void perf_event_print_debug(void) { }
static inline int perf_event_task_disable(void) { return -EINVAL; }
static inline int perf_event_task_enable(void) { return -EINVAL; }
+static inline void __perf_event_read(void *info) { }
+static inline u64 perf_event_count(struct perf_event *event) { return 0; }
static inline int perf_event_refresh(struct perf_event *event, int refresh)
{
return -EINVAL;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 69a1f6b..b9b13ce 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -250,6 +250,7 @@ enum bpf_func_id {
* Return: 0 on success
*/
BPF_FUNC_get_current_comm,
+ BPF_FUNC_perf_event_read, /* u64 bpf_perf_event_read(&map, index) */
__BPF_FUNC_MAX_ID,
};

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 039d866..93b6624 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -833,6 +833,44 @@ static int check_func_arg(struct verifier_env *env, u32 regno,
return err;
}

+static int check_func_limit(struct bpf_map **mapp, int func_id)
+{
+ struct bpf_map *map = *mapp;
+
+ if (map && map->map_type == BPF_MAP_TYPE_PROG_ARRAY &&
+ func_id != BPF_FUNC_tail_call)
+ /* prog_array map type needs extra care:
+ * only allow to pass it into bpf_tail_call() for now.
+ * bpf_map_delete_elem() can be allowed in the future,
+ * while bpf_map_update_elem() must only be done via syscall
+ */
+ return -EINVAL;
+
+ if (func_id == BPF_FUNC_tail_call &&
+ map->map_type != BPF_MAP_TYPE_PROG_ARRAY)
+ /* don't allow any other map type to be passed into
+ * bpf_tail_call()
+ */
+ return -EINVAL;
+
+ if (map && map->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY &&
+ func_id != BPF_FUNC_perf_event_read)
+ /* perf_event_array map type needs extra care:
+ * only allow to pass it into bpf_perf_event_read() for now.
+ * bpf_map_update/delete_elem() must only be done via syscall
+ */
+ return -EINVAL;
+
+ if (func_id == BPF_FUNC_perf_event_read &&
+ map->map_type != BPF_MAP_TYPE_PERF_EVENT_ARRAY)
+ /* don't allow any other map type to be passed into
+ * bpf_perf_event_read()
+ */
+ return -EINVAL;
+
+ return 0;
+}
+
static int check_call(struct verifier_env *env, int func_id)
{
struct verifier_state *state = &env->cur_state;
@@ -908,21 +946,9 @@ static int check_call(struct verifier_env *env, int func_id)
return -EINVAL;
}

- if (map && map->map_type == BPF_MAP_TYPE_PROG_ARRAY &&
- func_id != BPF_FUNC_tail_call)
- /* prog_array map type needs extra care:
- * only allow to pass it into bpf_tail_call() for now.
- * bpf_map_delete_elem() can be allowed in the future,
- * while bpf_map_update_elem() must only be done via syscall
- */
- return -EINVAL;
-
- if (func_id == BPF_FUNC_tail_call &&
- map->map_type != BPF_MAP_TYPE_PROG_ARRAY)
- /* don't allow any other map type to be passed into
- * bpf_tail_call()
- */
- return -EINVAL;
+ err = check_func_limit(&map, func_id);
+ if (err)
+ return err;

return 0;
}
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 58f0d47..1e4d65a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3177,7 +3177,7 @@ void perf_event_exec(void)
/*
* Cross CPU call to read the hardware event
*/
-static void __perf_event_read(void *info)
+void __perf_event_read(void *info)
{
struct perf_event *event = info;
struct perf_event_context *ctx = event->ctx;
@@ -3204,14 +3204,6 @@ static void __perf_event_read(void *info)
raw_spin_unlock(&ctx->lock);
}

-static inline u64 perf_event_count(struct perf_event *event)
-{
- if (event->pmu->count)
- return event->pmu->count(event);
-
- return __perf_event_count(event);
-}
-
static u64 perf_event_read(struct perf_event *event)
{
/*
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 88a041a..5b81da1 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -158,6 +158,41 @@ const struct bpf_func_proto *bpf_get_trace_printk_proto(void)
return &bpf_trace_printk_proto;
}

+static u64 bpf_perf_event_read(u64 r1, u64 index, u64 r3, u64 r4, u64 r5)
+{
+ struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
+ struct bpf_array *array = container_of(map, struct bpf_array, map);
+ struct perf_event *event;
+
+ if (unlikely(index >= array->map.max_entries))
+ return -E2BIG;
+
+ event = (struct perf_event *)array->ptrs[index];
+ if (!event)
+ return -ENOENT;
+
+ if (unlikely(event->state != PERF_EVENT_STATE_ACTIVE))
+ return -EINVAL;
+
+ if (event->oncpu != raw_smp_processor_id() &&
+ event->ctx->task != current)
+ return -EINVAL;
+
+ if (unlikely(event->attr.inherit))
+ return -EINVAL;
+
+ __perf_event_read(event);
+ return perf_event_count(event);
+}
+
+const struct bpf_func_proto bpf_perf_event_read_proto = {
+ .func = bpf_perf_event_read,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_CONST_MAP_PTR,
+ .arg2_type = ARG_ANYTHING,
+};
+
static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func_id)
{
switch (func_id) {
@@ -183,6 +218,8 @@ static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func
return bpf_get_trace_printk_proto();
case BPF_FUNC_get_smp_processor_id:
return &bpf_get_smp_processor_id_proto;
+ case BPF_FUNC_perf_event_read:
+ return &bpf_perf_event_read_proto;
default:
return NULL;
}
--
1.8.3.4

2015-07-31 12:00:43

by xiakaixu

[permalink] [raw]
Subject: [PATCH v5 4/4] samples/bpf: example of get selected PMU counter value

This is a simple example and shows how to use the new ability
to get the selected Hardware PMU counter value.

Signed-off-by: Kaixu Xia <[email protected]>
---
samples/bpf/Makefile | 4 +++
samples/bpf/bpf_helpers.h | 2 ++
samples/bpf/tracex6_kern.c | 26 ++++++++++++++++++
samples/bpf/tracex6_user.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 100 insertions(+)
create mode 100644 samples/bpf/tracex6_kern.c
create mode 100644 samples/bpf/tracex6_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 4450fed..63e7d50 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -12,6 +12,7 @@ hostprogs-y += tracex2
hostprogs-y += tracex3
hostprogs-y += tracex4
hostprogs-y += tracex5
+hostprogs-y += tracex6
hostprogs-y += lathist

test_verifier-objs := test_verifier.o libbpf.o
@@ -25,6 +26,7 @@ tracex2-objs := bpf_load.o libbpf.o tracex2_user.o
tracex3-objs := bpf_load.o libbpf.o tracex3_user.o
tracex4-objs := bpf_load.o libbpf.o tracex4_user.o
tracex5-objs := bpf_load.o libbpf.o tracex5_user.o
+tracex6-objs := bpf_load.o libbpf.o tracex6_user.o
lathist-objs := bpf_load.o libbpf.o lathist_user.o

# Tell kbuild to always build the programs
@@ -37,6 +39,7 @@ always += tracex2_kern.o
always += tracex3_kern.o
always += tracex4_kern.o
always += tracex5_kern.o
+always += tracex6_kern.o
always += tcbpf1_kern.o
always += lathist_kern.o

@@ -51,6 +54,7 @@ HOSTLOADLIBES_tracex2 += -lelf
HOSTLOADLIBES_tracex3 += -lelf
HOSTLOADLIBES_tracex4 += -lelf -lrt
HOSTLOADLIBES_tracex5 += -lelf
+HOSTLOADLIBES_tracex6 += -lelf
HOSTLOADLIBES_lathist += -lelf

# point this to your LLVM backend with bpf support
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index bdf1c16..c8a3594 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -31,6 +31,8 @@ static unsigned long long (*bpf_get_current_uid_gid)(void) =
(void *) BPF_FUNC_get_current_uid_gid;
static int (*bpf_get_current_comm)(void *buf, int buf_size) =
(void *) BPF_FUNC_get_current_comm;
+static int (*bpf_perf_event_read)(void *map, int index) =
+ (void *) BPF_FUNC_perf_event_read;

/* llvm builtin functions that eBPF C program may use to
* emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/samples/bpf/tracex6_kern.c b/samples/bpf/tracex6_kern.c
new file mode 100644
index 0000000..23d1cff
--- /dev/null
+++ b/samples/bpf/tracex6_kern.c
@@ -0,0 +1,26 @@
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") my_map = {
+ .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+ .key_size = sizeof(int),
+ .value_size = sizeof(u32),
+ .max_entries = 32,
+};
+
+SEC("kprobe/sys_write")
+int bpf_prog1(struct pt_regs *ctx)
+{
+ u64 count;
+ u32 key = bpf_get_smp_processor_id();
+ char fmt[] = "CPU-%d %llu\n";
+
+ count = bpf_perf_event_read(&my_map, key);
+ bpf_trace_printk(fmt, sizeof(fmt), key, count);
+
+ return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/tracex6_user.c b/samples/bpf/tracex6_user.c
new file mode 100644
index 0000000..928f05e
--- /dev/null
+++ b/samples/bpf/tracex6_user.c
@@ -0,0 +1,68 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <sys/ioctl.h>
+#include <linux/perf_event.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+#define SAMPLE_PERIOD 0x7fffffffffffffffULL
+
+static void test_bpf_perf_event(void)
+{
+ int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
+ int *pmu_fd = malloc(nr_cpus * sizeof(int));
+ unsigned long value;
+ int i;
+
+ struct perf_event_attr attr_insn_pmu = {
+ .freq = 0,
+ .sample_period = SAMPLE_PERIOD,
+ .inherit = 0,
+ .type = PERF_TYPE_HARDWARE,
+ .read_format = 0,
+ .sample_type = 0,
+ .config = 0,/* PMU: cycles */
+ };
+
+ for (i = 0; i < nr_cpus; i++) {
+ pmu_fd[i] = perf_event_open(&attr_insn_pmu, -1/*pid*/, i/*cpu*/, -1/*group_fd*/, 0);
+ if (pmu_fd[i] < 0)
+ printf("event syscall failed\n");
+
+ bpf_update_elem(map_fd[0], &i, &pmu_fd[i], BPF_ANY);
+ ioctl(pmu_fd[i], PERF_EVENT_IOC_ENABLE, 0);
+ }
+
+ system("ls");
+ system("pwd");
+ system("sleep 2");
+
+ for (i = 0; i < nr_cpus; i++)
+ close(pmu_fd[i]);
+
+ close(map_fd);
+
+ free(pmu_fd);
+}
+
+int main(int argc, char **argv)
+{
+ char filename[256];
+
+ snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+ if (load_bpf_file(filename)) {
+ printf("%s", bpf_log_buf);
+ return 1;
+ }
+
+ test_bpf_perf_event();
+
+ return 0;
+}
--
1.8.3.4

2015-07-31 16:33:33

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter

On 7/31/15 4:59 AM, Kaixu Xia wrote:
> changes in V5:
> - move struct fd_array_map_ops* fd_ops to bpf_map;
> - move array perf event decrement refcnt function to
> map_free;
> - fix the NULL ptr of perf_event_get();
> - move bpf_perf_event_read() to kernel/bpf/bpf_trace.c;
> - get rid of the remaining struct bpf_prog;
> - move the unnecessay cast on void *;

overall looks good. few nits here and there...

With amount of changes to bpf core in patch 1 it definitely
needs to go via net-next, otherwise merge window conflicts
will be nasty.

Ingo, Peter, could you please review perf bits in patches 2 and 3,
so we can take the whole set via net-next.

Thanks!

2015-07-31 16:36:41

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] bpf: Make the bpf_prog_array_map more generic

On 7/31/15 4:59 AM, Kaixu Xia wrote:
> According to the comments from Daniel, rewrite part of
> the bpf_prog_array map code and make it more generic.
> So the new perf_event_array map type can reuse most of
> code with bpf_prog_array map and add fewer lines of
> special code.
>
> Tested the samples/bpf/tracex5 after this patch:
> $ sudo ./tracex5
> ...
> dd-1051 [000] d... 26.682903: : mmap
> dd-1051 [000] d... 26.698348: : syscall=102 (one of get/set uid/pid/gid)
> dd-1051 [000] d... 26.703892: : read(fd=0, buf=000000000078c010, size=512)
> dd-1051 [000] d... 26.705847: : write(fd=1, buf=000000000078c010, size=512)
> dd-1051 [000] d... 26.707914: : read(fd=0, buf=000000000078c010, size=512)
> dd-1051 [000] d... 26.710988: : write(fd=1, buf=000000000078c010, size=512)
> dd-1051 [000] d... 26.711865: : read(fd=0, buf=000000000078c010, size=512)
> dd-1051 [000] d... 26.712704: : write(fd=1, buf=000000000078c010, size=512)
> ...

the commit log doesn't really need these test results.
Also first paragraph is meaningless without prior history.
Please reword it, so it clearly says what this patch is doing and why.
Anyone reading it years later should be able to understand.

2015-07-31 16:39:39

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] bpf: Add new bpf map type to store the pointer to struct perf_event

On 7/31/15 4:59 AM, Kaixu Xia wrote:
> + /*
> + * prevent some crazy events so we can make our life easier
> + */
> + if (event->attr.type != PERF_TYPE_RAW &&
> + event->attr.type != PERF_TYPE_HARDWARE) {
> + perf_event_release_kernel(event);
> + return ERR_PTR(-EINVAL);
> + }

either remove the comment and make it clean.

2015-07-31 16:43:49

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH v5 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter

On 7/31/15 4:59 AM, Kaixu Xia wrote:
> +static int check_func_limit(struct bpf_map **mapp, int func_id)
> +{
> + struct bpf_map *map = *mapp;
> +
> + if (map && map->map_type == BPF_MAP_TYPE_PROG_ARRAY &&
> + func_id != BPF_FUNC_tail_call)
> + /* prog_array map type needs extra care:
> + * only allow to pass it into bpf_tail_call() for now.
> + * bpf_map_delete_elem() can be allowed in the future,
> + * while bpf_map_update_elem() must only be done via syscall
> + */
> + return -EINVAL;
> +
> + if (func_id == BPF_FUNC_tail_call &&
> + map->map_type != BPF_MAP_TYPE_PROG_ARRAY)
> + /* don't allow any other map type to be passed into
> + * bpf_tail_call()
> + */
> + return -EINVAL;
> +
> + if (map && map->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY &&
> + func_id != BPF_FUNC_perf_event_read)
> + /* perf_event_array map type needs extra care:
> + * only allow to pass it into bpf_perf_event_read() for now.
> + * bpf_map_update/delete_elem() must only be done via syscall
> + */
> + return -EINVAL;
> +
> + if (func_id == BPF_FUNC_perf_event_read &&
> + map->map_type != BPF_MAP_TYPE_PERF_EVENT_ARRAY)
> + /* don't allow any other map type to be passed into
> + * bpf_perf_event_read()
> + */
> + return -EINVAL;

copy paste of comments is too much.
The whole thing probably can be better generalized with an array
{{BPF_MAP_TYPE_PROG_ARRAY, BPF_FUNC_tail_call},
{BPF_MAP_TYPE_PERF_EVENT_ARRAY, BPF_FUNC_perf_event_read},
}
and a loop?