2015-07-28 11:18:01

by xiakaixu

[permalink] [raw]
Subject: [PATCH v4 0/4] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter

Previous patch v3 url:
https://lkml.org/lkml/2015/7/23/203

changes in V4:
- make the bpf_prog_array_map more generic;
- fix the bug of event refcnt leak;
- use more useful errno in bpf_perf_event_read();

changes in V3:
- collapse V2 patches 1-3 into one;
- drop the function map->ops->map_traverse_elem() and release
the struct perf_event in map_free;
- only allow to access bpf_perf_event_read() from programs;
- update the perf_event_array_map elem via xchg();
- pass index directly to bpf_perf_event_read() instead of
MAP_KEY;

changes in V2:
- put atomic_long_inc_not_zero() between fdget() and fdput();
- limit the event type to PERF_TYPE_RAW and PERF_TYPE_HARDWARE;
- Only read the event counter on current CPU or on current
process;
- add new map type BPF_MAP_TYPE_PERF_EVENT_ARRAY to store the
pointer to the struct perf_event;
- according to the perf_event_map_fd and key, the function
bpf_perf_event_read() can get the Hardware PMU counter value;

Patch 4/4 is a simple example and shows how to use this new eBPF
programs ability. The PMU counter data can be found in
/sys/kernel/debug/tracing/trace(trace_pipe).(the cycles PMU
value when 'kprobe/sys_write' sampling)

$ cat /sys/kernel/debug/tracing/trace_pipe
$ ./tracex6
...
syslog-ng-548 [000] d..1 76.905673: : CPU-0 681765271
syslog-ng-548 [000] d..1 76.905690: : CPU-0 681787855
syslog-ng-548 [000] d..1 76.905707: : CPU-0 681810504
syslog-ng-548 [000] d..1 76.905725: : CPU-0 681834771
syslog-ng-548 [000] d..1 76.905745: : CPU-0 681859519
syslog-ng-548 [000] d..1 76.905766: : CPU-0 681890419
syslog-ng-548 [000] d..1 76.905783: : CPU-0 681914045
syslog-ng-548 [000] d..1 76.905800: : CPU-0 681935950
syslog-ng-548 [000] d..1 76.905816: : CPU-0 681958299
ls-690 [005] d..1 82.241308: : CPU-5 3138451
sh-691 [004] d..1 82.244570: : CPU-4 7324988
<...>-699 [007] d..1 99.961387: : CPU-7 3194027
<...>-695 [003] d..1 99.961474: : CPU-3 288901
<...>-695 [003] d..1 99.961541: : CPU-3 383145
<...>-695 [003] d..1 99.961591: : CPU-3 450365
<...>-695 [003] d..1 99.961639: : CPU-3 515751
<...>-695 [003] d..1 99.961686: : CPU-3 579047
...

The detail of patches is as follow:

Patch 1/4 rewrites part of the bpf_prog_array map code and make it
more generic;

Patch 2/4 introduces a new bpf map type. This map only stores the
pointer to struct perf_event;

Patch 3/4 implements function bpf_perf_event_read() that get the
selected hardware PMU conuter;

Patch 4/4 gives a simple example.

Kaixu Xia (3):
bpf: Add new bpf map type to store the pointer to struct perf_event
bpf: Implement function bpf_perf_event_read() that get the selected
hardware PMU conuter
samples/bpf: example of get selected PMU counter value

Wang Nan (1):
bpf: Make the bpf_prog_array_map more generic

include/linux/bpf.h | 8 ++-
include/linux/perf_event.h | 5 +-
include/uapi/linux/bpf.h | 2 +
kernel/bpf/arraymap.c | 161 ++++++++++++++++++++++++++++++++++++---------
kernel/bpf/helpers.c | 36 ++++++++++
kernel/bpf/syscall.c | 4 +-
kernel/bpf/verifier.c | 15 +++++
kernel/events/core.c | 21 +++++-
kernel/trace/bpf_trace.c | 2 +
samples/bpf/Makefile | 4 ++
samples/bpf/bpf_helpers.h | 2 +
samples/bpf/tracex6_kern.c | 26 ++++++++
samples/bpf/tracex6_user.c | 67 +++++++++++++++++++
13 files changed, 316 insertions(+), 37 deletions(-)
create mode 100644 samples/bpf/tracex6_kern.c
create mode 100644 samples/bpf/tracex6_user.c

--
1.8.3.4


2015-07-28 11:18:03

by xiakaixu

[permalink] [raw]
Subject: [PATCH v4 1/4] bpf: Make the bpf_prog_array_map more generic

From: Wang Nan <[email protected]>

According to the comments from Daniel, rewrite part of
the bpf_prog_array map code and make it more generic.
So the new perf_event_array map type can reuse most of
code with bpf_prog_array map and add fewer lines of
special code.

Tested the samples/bpf/tracex5 after this patch:
$ sudo ./tracex5
...
dd-1051 [000] d... 26.682903: : mmap
dd-1051 [000] d... 26.698348: : syscall=102 (one of get/set uid/pid/gid)
dd-1051 [000] d... 26.703892: : read(fd=0, buf=000000000078c010, size=512)
dd-1051 [000] d... 26.705847: : write(fd=1, buf=000000000078c010, size=512)
dd-1051 [000] d... 26.707914: : read(fd=0, buf=000000000078c010, size=512)
dd-1051 [000] d... 26.710988: : write(fd=1, buf=000000000078c010, size=512)
dd-1051 [000] d... 26.711865: : read(fd=0, buf=000000000078c010, size=512)
dd-1051 [000] d... 26.712704: : write(fd=1, buf=000000000078c010, size=512)
...

Signed-off-by: Wang Nan <[email protected]>
Signed-off-by: Kaixu Xia <[email protected]>
---
include/linux/bpf.h | 6 ++-
kernel/bpf/arraymap.c | 104 +++++++++++++++++++++++++++++++++++---------------
kernel/bpf/syscall.c | 4 +-
3 files changed, 80 insertions(+), 34 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4383476..610b730 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -130,6 +130,8 @@ struct bpf_prog_aux {
};
};

+struct fd_array_map_ops;
+
struct bpf_array {
struct bpf_map map;
u32 elem_size;
@@ -140,15 +142,17 @@ struct bpf_array {
*/
enum bpf_prog_type owner_prog_type;
bool owner_jited;
+ const struct fd_array_map_ops* fd_ops;
union {
char value[0] __aligned(8);
+ void *ptrs[0] __aligned(8);
struct bpf_prog *prog[0] __aligned(8);
};
};
#define MAX_TAIL_CALL_CNT 32

u64 bpf_tail_call(u64 ctx, u64 r2, u64 index, u64 r4, u64 r5);
-void bpf_prog_array_map_clear(struct bpf_map *map);
+void bpf_fd_array_map_clear(struct bpf_map *map);
bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog *fp);
const struct bpf_func_proto *bpf_get_trace_printk_proto(void);

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index cb31229..4784cdc 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -150,15 +150,62 @@ static int __init register_array_map(void)
}
late_initcall(register_array_map);

-static struct bpf_map *prog_array_map_alloc(union bpf_attr *attr)
+struct fd_array_map_ops {
+ void *(*get_ptr)(struct bpf_array *array, int fd);
+ void (*put_ptr)(struct bpf_array *array, void *ptr);
+};
+
+static void *prog_fd_array_get_ptr(struct bpf_array *array, int fd)
+{
+ struct bpf_prog *prog = bpf_prog_get(fd);
+ if (IS_ERR(prog))
+ return prog;
+
+ if (!bpf_prog_array_compatible(array, prog)) {
+ bpf_prog_put(prog);
+ return ERR_PTR(-EINVAL);
+ }
+ return prog;
+}
+
+static void prog_fd_array_put_ptr(struct bpf_array *array __maybe_unused,
+ void *ptr)
+{
+ struct bpf_prog *prog = (struct bpf_prog *)ptr;
+
+ bpf_prog_put_rcu(prog);
+}
+
+static const struct fd_array_map_ops prog_fd_array_map_ops = {
+ .get_ptr = prog_fd_array_get_ptr,
+ .put_ptr = prog_fd_array_put_ptr,
+};
+
+static struct bpf_map *fd_array_map_alloc(union bpf_attr *attr,
+ const struct fd_array_map_ops *ops)
{
- /* only bpf_prog file descriptors can be stored in prog_array map */
+ struct bpf_map *map;
+ struct bpf_array *array;
+
+ /* only file descriptors can be stored in this type of map */
if (attr->value_size != sizeof(u32))
return ERR_PTR(-EINVAL);
- return array_map_alloc(attr);
+
+ map = array_map_alloc(attr);
+ if (IS_ERR(map))
+ return map;
+
+ array = container_of(map, struct bpf_array, map);
+ array->fd_ops = ops;
+ return map;
+}
+
+static struct bpf_map *prog_array_map_alloc(union bpf_attr *attr)
+{
+ return fd_array_map_alloc(attr, &prog_fd_array_map_ops);
}

-static void prog_array_map_free(struct bpf_map *map)
+static void fd_array_map_free(struct bpf_map *map)
{
struct bpf_array *array = container_of(map, struct bpf_array, map);
int i;
@@ -167,21 +214,21 @@ static void prog_array_map_free(struct bpf_map *map)

/* make sure it's empty */
for (i = 0; i < array->map.max_entries; i++)
- BUG_ON(array->prog[i] != NULL);
+ BUG_ON(array->ptrs[i] != NULL);
kvfree(array);
}

-static void *prog_array_map_lookup_elem(struct bpf_map *map, void *key)
+static void *fd_array_map_lookup_elem(struct bpf_map *map, void *key)
{
return NULL;
}

/* only called from syscall */
-static int prog_array_map_update_elem(struct bpf_map *map, void *key,
- void *value, u64 map_flags)
+static int fd_array_map_update_elem(struct bpf_map *map, void *key,
+ void *value, u64 map_flags)
{
struct bpf_array *array = container_of(map, struct bpf_array, map);
- struct bpf_prog *prog, *old_prog;
+ void *new_ptr, *old_ptr;
u32 index = *(u32 *)key, ufd;

if (map_flags != BPF_ANY)
@@ -191,34 +238,29 @@ static int prog_array_map_update_elem(struct bpf_map *map, void *key,
return -E2BIG;

ufd = *(u32 *)value;
- prog = bpf_prog_get(ufd);
- if (IS_ERR(prog))
- return PTR_ERR(prog);
-
- if (!bpf_prog_array_compatible(array, prog)) {
- bpf_prog_put(prog);
- return -EINVAL;
- }
+ new_ptr = array->fd_ops->get_ptr(array, ufd);
+ if (IS_ERR(new_ptr))
+ return PTR_ERR(new_ptr);

- old_prog = xchg(array->prog + index, prog);
- if (old_prog)
- bpf_prog_put_rcu(old_prog);
+ old_ptr = xchg(array->ptrs + index, new_ptr);
+ if (old_ptr)
+ array->fd_ops->put_ptr(array, old_ptr);

return 0;
}

-static int prog_array_map_delete_elem(struct bpf_map *map, void *key)
+static int fd_array_map_delete_elem(struct bpf_map *map, void *key)
{
struct bpf_array *array = container_of(map, struct bpf_array, map);
- struct bpf_prog *old_prog;
+ void *old_ptr;
u32 index = *(u32 *)key;

if (index >= array->map.max_entries)
return -E2BIG;

- old_prog = xchg(array->prog + index, NULL);
- if (old_prog) {
- bpf_prog_put_rcu(old_prog);
+ old_ptr = xchg(array->ptrs + index, NULL);
+ if (old_ptr) {
+ array->fd_ops->put_ptr(array, old_ptr);
return 0;
} else {
return -ENOENT;
@@ -226,22 +268,22 @@ static int prog_array_map_delete_elem(struct bpf_map *map, void *key)
}

/* decrement refcnt of all bpf_progs that are stored in this map */
-void bpf_prog_array_map_clear(struct bpf_map *map)
+void bpf_fd_array_map_clear(struct bpf_map *map)
{
struct bpf_array *array = container_of(map, struct bpf_array, map);
int i;

for (i = 0; i < array->map.max_entries; i++)
- prog_array_map_delete_elem(map, &i);
+ fd_array_map_delete_elem(map, &i);
}

static const struct bpf_map_ops prog_array_ops = {
.map_alloc = prog_array_map_alloc,
- .map_free = prog_array_map_free,
+ .map_free = fd_array_map_free,
.map_get_next_key = array_map_get_next_key,
- .map_lookup_elem = prog_array_map_lookup_elem,
- .map_update_elem = prog_array_map_update_elem,
- .map_delete_elem = prog_array_map_delete_elem,
+ .map_lookup_elem = fd_array_map_lookup_elem,
+ .map_update_elem = fd_array_map_update_elem,
+ .map_delete_elem = fd_array_map_delete_elem,
};

static struct bpf_map_type_list prog_array_type __read_mostly = {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index a1b14d1..de2dcc2 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -68,11 +68,11 @@ static int bpf_map_release(struct inode *inode, struct file *filp)
{
struct bpf_map *map = filp->private_data;

- if (map->map_type == BPF_MAP_TYPE_PROG_ARRAY)
+ if (map->map_type >= BPF_MAP_TYPE_PROG_ARRAY)
/* prog_array stores refcnt-ed bpf_prog pointers
* release them all when user space closes prog_array_fd
*/
- bpf_prog_array_map_clear(map);
+ bpf_fd_array_map_clear(map);

bpf_map_put(map);
return 0;
--
1.8.3.4

2015-07-28 11:22:29

by xiakaixu

[permalink] [raw]
Subject: [PATCH v4 2/4] bpf: Add new bpf map type to store the pointer to struct perf_event

Introduce a new bpf map type 'BPF_MAP_TYPE_PERF_EVENT_ARRAY'.
This map only stores the pointer to struct perf_event. The
user space event FDs from perf_event_open() syscall are converted
to the pointer to struct perf_event and stored in map.

Signed-off-by: Kaixu Xia <[email protected]>
---
include/linux/bpf.h | 1 +
include/linux/perf_event.h | 2 ++
include/uapi/linux/bpf.h | 1 +
kernel/bpf/arraymap.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++
kernel/bpf/verifier.c | 15 ++++++++++++
kernel/events/core.c | 17 ++++++++++++++
6 files changed, 93 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 610b730..3c9c0eb 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -10,6 +10,7 @@
#include <uapi/linux/bpf.h>
#include <linux/workqueue.h>
#include <linux/file.h>
+#include <linux/perf_event.h>

struct bpf_map;

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2027809..2ea4067 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -641,6 +641,7 @@ extern int perf_event_init_task(struct task_struct *child);
extern void perf_event_exit_task(struct task_struct *child);
extern void perf_event_free_task(struct task_struct *task);
extern void perf_event_delayed_put(struct task_struct *task);
+extern struct perf_event *perf_event_get(unsigned int fd);
extern void perf_event_print_debug(void);
extern void perf_pmu_disable(struct pmu *pmu);
extern void perf_pmu_enable(struct pmu *pmu);
@@ -979,6 +980,7 @@ static inline int perf_event_init_task(struct task_struct *child) { return 0; }
static inline void perf_event_exit_task(struct task_struct *child) { }
static inline void perf_event_free_task(struct task_struct *task) { }
static inline void perf_event_delayed_put(struct task_struct *task) { }
+static struct perf_event *perf_event_get(unsigned int fd) { return NULL; }
static inline void perf_event_print_debug(void) { }
static inline int perf_event_task_disable(void) { return -EINVAL; }
static inline int perf_event_task_enable(void) { return -EINVAL; }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 29ef6f9..69a1f6b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -114,6 +114,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_HASH,
BPF_MAP_TYPE_ARRAY,
BPF_MAP_TYPE_PROG_ARRAY,
+ BPF_MAP_TYPE_PERF_EVENT_ARRAY,
};

enum bpf_prog_type {
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 4784cdc..a9e0235 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -297,3 +297,60 @@ static int __init register_prog_array_map(void)
return 0;
}
late_initcall(register_prog_array_map);
+
+static void *perf_event_fd_array_get_ptr(struct bpf_array *array, int fd)
+{
+ struct perf_event *event;
+
+ event = perf_event_get(fd);
+ if (IS_ERR(event))
+ return event;
+
+ /*
+ * prevent some crazy events so we can make our life easier
+ */
+ if (event->attr.type != PERF_TYPE_RAW &&
+ event->attr.type != PERF_TYPE_HARDWARE) {
+ perf_event_release_kernel(event);
+ return ERR_PTR(-EINVAL);
+ }
+ return event;
+}
+
+static void perf_event_fd_array_put_ptr(struct bpf_array *array, void *ptr)
+{
+ struct perf_event *event = (struct perf_event *)ptr;
+
+ perf_event_release_kernel(event);
+}
+
+static const struct fd_array_map_ops perf_event_fd_array_map_ops = {
+ .get_ptr = perf_event_fd_array_get_ptr,
+ .put_ptr = perf_event_fd_array_put_ptr,
+};
+
+static struct bpf_map *perf_event_array_map_alloc(union bpf_attr *attr)
+{
+ return fd_array_map_alloc(attr, &perf_event_fd_array_map_ops);
+}
+
+static const struct bpf_map_ops perf_event_array_ops = {
+ .map_alloc = perf_event_array_map_alloc,
+ .map_free = fd_array_map_free,
+ .map_get_next_key = array_map_get_next_key,
+ .map_lookup_elem = fd_array_map_lookup_elem,
+ .map_update_elem = fd_array_map_update_elem,
+ .map_delete_elem = fd_array_map_delete_elem,
+};
+
+static struct bpf_map_type_list perf_event_array_type __read_mostly = {
+ .ops = &perf_event_array_ops,
+ .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+};
+
+static int __init register_perf_event_array_map(void)
+{
+ bpf_register_map_type(&perf_event_array_type);
+ return 0;
+}
+late_initcall(register_perf_event_array_map);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 039d866..c70f7e7 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -924,6 +924,21 @@ static int check_call(struct verifier_env *env, int func_id)
*/
return -EINVAL;

+ if (map && map->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY &&
+ func_id != BPF_FUNC_perf_event_read)
+ /* perf_event_array map type needs extra care:
+ * only allow to pass it into bpf_perf_event_read() for now.
+ * bpf_map_update/delete_elem() must only be done via syscall
+ */
+ return -EINVAL;
+
+ if (func_id == BPF_FUNC_perf_event_read &&
+ map->map_type != BPF_MAP_TYPE_PERF_EVENT_ARRAY)
+ /* don't allow any other map type to be passed into
+ * bpf_perf_event_read()
+ */
+ return -EINVAL;
+
return 0;
}

diff --git a/kernel/events/core.c b/kernel/events/core.c
index d3dae34..58f0d47 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8574,6 +8574,23 @@ void perf_event_delayed_put(struct task_struct *task)
WARN_ON_ONCE(task->perf_event_ctxp[ctxn]);
}

+struct perf_event *perf_event_get(unsigned int fd)
+{
+ int err;
+ struct fd f;
+ struct perf_event *event;
+
+ err = perf_fget_light(fd, &f);
+ if (err)
+ return ERR_PTR(err);
+
+ event = f.file->private_data;
+ atomic_long_inc(&event->refcount);
+ fdput(f);
+
+ return event;
+}
+
/*
* inherit a event from parent task to child task:
*/
--
1.8.3.4

2015-07-28 11:22:30

by xiakaixu

[permalink] [raw]
Subject: [PATCH v4 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter

According to the perf_event_map_fd and index, the function
bpf_perf_event_read() can convert the corresponding map
value to the pointer to struct perf_event and return the
Hardware PMU counter value.

Signed-off-by: Kaixu Xia <[email protected]>
---
include/linux/bpf.h | 1 +
include/linux/perf_event.h | 3 ++-
include/uapi/linux/bpf.h | 1 +
kernel/bpf/helpers.c | 36 ++++++++++++++++++++++++++++++++++++
kernel/events/core.c | 4 ++--
kernel/trace/bpf_trace.c | 2 ++
6 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 3c9c0eb..516992c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -190,6 +190,7 @@ extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
extern const struct bpf_func_proto bpf_map_update_elem_proto;
extern const struct bpf_func_proto bpf_map_delete_elem_proto;

+extern const struct bpf_func_proto bpf_perf_event_read_proto;
extern const struct bpf_func_proto bpf_get_prandom_u32_proto;
extern const struct bpf_func_proto bpf_get_smp_processor_id_proto;
extern const struct bpf_func_proto bpf_tail_call_proto;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2ea4067..899abcb 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -662,7 +662,8 @@ extern void perf_pmu_migrate_context(struct pmu *pmu,
int src_cpu, int dst_cpu);
extern u64 perf_event_read_value(struct perf_event *event,
u64 *enabled, u64 *running);
-
+extern void __perf_event_read(void *info);
+extern u64 perf_event_count(struct perf_event *event);

struct perf_sample_data {
/*
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 69a1f6b..b9b13ce 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -250,6 +250,7 @@ enum bpf_func_id {
* Return: 0 on success
*/
BPF_FUNC_get_current_comm,
+ BPF_FUNC_perf_event_read, /* u64 bpf_perf_event_read(&map, index) */
__BPF_FUNC_MAX_ID,
};

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 1447ec0..c40c5ea 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -182,3 +182,39 @@ const struct bpf_func_proto bpf_get_current_comm_proto = {
.arg1_type = ARG_PTR_TO_STACK,
.arg2_type = ARG_CONST_STACK_SIZE,
};
+
+static u64 bpf_perf_event_read(u64 r1, u64 index, u64 r3, u64 r4, u64 r5)
+{
+ struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
+ struct bpf_array *array = container_of(map, struct bpf_array, map);
+ struct perf_event *event;
+
+ if (index >= array->map.max_entries)
+ return -E2BIG;
+
+ event = (struct perf_event *)array->ptrs[index];
+ if (!event)
+ return -ENOENT;
+
+ if (event->state != PERF_EVENT_STATE_ACTIVE)
+ return -EINVAL;
+
+ if (event->oncpu != raw_smp_processor_id() &&
+ event->ctx->task != current)
+ return -EINVAL;
+
+ if (event->attr.inherit)
+ return -EINVAL;
+
+ __perf_event_read(event);
+
+ return perf_event_count(event);
+}
+
+const struct bpf_func_proto bpf_perf_event_read_proto = {
+ .func = bpf_perf_event_read,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_CONST_MAP_PTR,
+ .arg2_type = ARG_ANYTHING,
+};
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 58f0d47..c926c6d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3177,7 +3177,7 @@ void perf_event_exec(void)
/*
* Cross CPU call to read the hardware event
*/
-static void __perf_event_read(void *info)
+void __perf_event_read(void *info)
{
struct perf_event *event = info;
struct perf_event_context *ctx = event->ctx;
@@ -3204,7 +3204,7 @@ static void __perf_event_read(void *info)
raw_spin_unlock(&ctx->lock);
}

-static inline u64 perf_event_count(struct perf_event *event)
+u64 perf_event_count(struct perf_event *event)
{
if (event->pmu->count)
return event->pmu->count(event);
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 88a041a..9cf094f 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -183,6 +183,8 @@ static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func
return bpf_get_trace_printk_proto();
case BPF_FUNC_get_smp_processor_id:
return &bpf_get_smp_processor_id_proto;
+ case BPF_FUNC_perf_event_read:
+ return &bpf_perf_event_read_proto;
default:
return NULL;
}
--
1.8.3.4

2015-07-28 11:22:57

by xiakaixu

[permalink] [raw]
Subject: [PATCH v4 4/4] samples/bpf: example of get selected PMU counter value

This is a simple example and shows how to use the new ability
to get the selected Hardware PMU counter value.

Signed-off-by: Kaixu Xia <[email protected]>
---
samples/bpf/Makefile | 4 +++
samples/bpf/bpf_helpers.h | 2 ++
samples/bpf/tracex6_kern.c | 26 ++++++++++++++++++
samples/bpf/tracex6_user.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 99 insertions(+)
create mode 100644 samples/bpf/tracex6_kern.c
create mode 100644 samples/bpf/tracex6_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 4450fed..63e7d50 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -12,6 +12,7 @@ hostprogs-y += tracex2
hostprogs-y += tracex3
hostprogs-y += tracex4
hostprogs-y += tracex5
+hostprogs-y += tracex6
hostprogs-y += lathist

test_verifier-objs := test_verifier.o libbpf.o
@@ -25,6 +26,7 @@ tracex2-objs := bpf_load.o libbpf.o tracex2_user.o
tracex3-objs := bpf_load.o libbpf.o tracex3_user.o
tracex4-objs := bpf_load.o libbpf.o tracex4_user.o
tracex5-objs := bpf_load.o libbpf.o tracex5_user.o
+tracex6-objs := bpf_load.o libbpf.o tracex6_user.o
lathist-objs := bpf_load.o libbpf.o lathist_user.o

# Tell kbuild to always build the programs
@@ -37,6 +39,7 @@ always += tracex2_kern.o
always += tracex3_kern.o
always += tracex4_kern.o
always += tracex5_kern.o
+always += tracex6_kern.o
always += tcbpf1_kern.o
always += lathist_kern.o

@@ -51,6 +54,7 @@ HOSTLOADLIBES_tracex2 += -lelf
HOSTLOADLIBES_tracex3 += -lelf
HOSTLOADLIBES_tracex4 += -lelf -lrt
HOSTLOADLIBES_tracex5 += -lelf
+HOSTLOADLIBES_tracex6 += -lelf
HOSTLOADLIBES_lathist += -lelf

# point this to your LLVM backend with bpf support
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index bdf1c16..c8a3594 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -31,6 +31,8 @@ static unsigned long long (*bpf_get_current_uid_gid)(void) =
(void *) BPF_FUNC_get_current_uid_gid;
static int (*bpf_get_current_comm)(void *buf, int buf_size) =
(void *) BPF_FUNC_get_current_comm;
+static int (*bpf_perf_event_read)(void *map, int index) =
+ (void *) BPF_FUNC_perf_event_read;

/* llvm builtin functions that eBPF C program may use to
* emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/samples/bpf/tracex6_kern.c b/samples/bpf/tracex6_kern.c
new file mode 100644
index 0000000..23d1cff
--- /dev/null
+++ b/samples/bpf/tracex6_kern.c
@@ -0,0 +1,26 @@
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") my_map = {
+ .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+ .key_size = sizeof(int),
+ .value_size = sizeof(u32),
+ .max_entries = 32,
+};
+
+SEC("kprobe/sys_write")
+int bpf_prog1(struct pt_regs *ctx)
+{
+ u64 count;
+ u32 key = bpf_get_smp_processor_id();
+ char fmt[] = "CPU-%d %llu\n";
+
+ count = bpf_perf_event_read(&my_map, key);
+ bpf_trace_printk(fmt, sizeof(fmt), key, count);
+
+ return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/tracex6_user.c b/samples/bpf/tracex6_user.c
new file mode 100644
index 0000000..e607eac
--- /dev/null
+++ b/samples/bpf/tracex6_user.c
@@ -0,0 +1,67 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <sys/ioctl.h>
+#include <linux/perf_event.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+static void test_bpf_perf_event(void)
+{
+ int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
+ int *pmu_fd = malloc(nr_cpus * sizeof(int));
+ unsigned long value;
+ int i;
+
+ struct perf_event_attr attr_insn_pmu = {
+ .freq = 0,
+ .sample_period = 0x7fffffffffffffffULL,
+ .inherit = 0,
+ .type = PERF_TYPE_HARDWARE,
+ .read_format = 0,
+ .sample_type = 0,
+ .config = 0,/* PMU: cycles */
+ };
+
+ for (i = 0; i < nr_cpus; i++) {
+ pmu_fd[i] = perf_event_open(&attr_insn_pmu, -1/*pid*/, i/*cpu*/, -1/*group_fd*/, 0);
+ if (pmu_fd[i] < 0)
+ printf("event syscall failed\n");
+
+ bpf_update_elem(map_fd[0], &i, (pmu_fd + i), BPF_ANY);
+
+ ioctl(pmu_fd[i], PERF_EVENT_IOC_ENABLE, 0);
+ }
+
+ system("ls");
+ system("pwd");
+ system("sleep 2");
+
+ for (i = 0; i < nr_cpus; i++)
+ close(pmu_fd[i]);
+
+ close(map_fd);
+
+ free(pmu_fd);
+}
+
+int main(int argc, char **argv)
+{
+ char filename[256];
+
+ snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+ if (load_bpf_file(filename)) {
+ printf("%s", bpf_log_buf);
+ return 1;
+ }
+
+ test_bpf_perf_event();
+
+ return 0;
+}
--
1.8.3.4

2015-07-29 23:17:50

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] bpf: Make the bpf_prog_array_map more generic

On 07/28/2015 01:17 PM, Kaixu Xia wrote:
> From: Wang Nan <[email protected]>
>
> According to the comments from Daniel, rewrite part of
> the bpf_prog_array map code and make it more generic.
> So the new perf_event_array map type can reuse most of
> code with bpf_prog_array map and add fewer lines of
> special code.
>
> Tested the samples/bpf/tracex5 after this patch:
> $ sudo ./tracex5
> ...
> dd-1051 [000] d... 26.682903: : mmap
> dd-1051 [000] d... 26.698348: : syscall=102 (one of get/set uid/pid/gid)
> dd-1051 [000] d... 26.703892: : read(fd=0, buf=000000000078c010, size=512)
> dd-1051 [000] d... 26.705847: : write(fd=1, buf=000000000078c010, size=512)
> dd-1051 [000] d... 26.707914: : read(fd=0, buf=000000000078c010, size=512)
> dd-1051 [000] d... 26.710988: : write(fd=1, buf=000000000078c010, size=512)
> dd-1051 [000] d... 26.711865: : read(fd=0, buf=000000000078c010, size=512)
> dd-1051 [000] d... 26.712704: : write(fd=1, buf=000000000078c010, size=512)
> ...
>
> Signed-off-by: Wang Nan <[email protected]>
> Signed-off-by: Kaixu Xia <[email protected]>
> ---
> include/linux/bpf.h | 6 ++-
> kernel/bpf/arraymap.c | 104 +++++++++++++++++++++++++++++++++++---------------
> kernel/bpf/syscall.c | 4 +-
> 3 files changed, 80 insertions(+), 34 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 4383476..610b730 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -130,6 +130,8 @@ struct bpf_prog_aux {
> };
> };
>
> +struct fd_array_map_ops;
> +
> struct bpf_array {
> struct bpf_map map;
> u32 elem_size;
> @@ -140,15 +142,17 @@ struct bpf_array {
> */
> enum bpf_prog_type owner_prog_type;
> bool owner_jited;
> + const struct fd_array_map_ops* fd_ops;
> union {
> char value[0] __aligned(8);
> + void *ptrs[0] __aligned(8);
> struct bpf_prog *prog[0] __aligned(8);

After your conversion, prog member from the union is not used here anymore
(only as offsetof(...) in JITs). We should probably get rid of it then.

> };
> };
> #define MAX_TAIL_CALL_CNT 32
>
> u64 bpf_tail_call(u64 ctx, u64 r2, u64 index, u64 r4, u64 r5);
> -void bpf_prog_array_map_clear(struct bpf_map *map);
> +void bpf_fd_array_map_clear(struct bpf_map *map);
> bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog *fp);
> const struct bpf_func_proto *bpf_get_trace_printk_proto(void);
>
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index cb31229..4784cdc 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -150,15 +150,62 @@ static int __init register_array_map(void)
> }
> late_initcall(register_array_map);
>
> -static struct bpf_map *prog_array_map_alloc(union bpf_attr *attr)
> +struct fd_array_map_ops {
> + void *(*get_ptr)(struct bpf_array *array, int fd);
> + void (*put_ptr)(struct bpf_array *array, void *ptr);
> +};
> +
> +static void *prog_fd_array_get_ptr(struct bpf_array *array, int fd)
> +{
> + struct bpf_prog *prog = bpf_prog_get(fd);
> + if (IS_ERR(prog))
> + return prog;
> +
> + if (!bpf_prog_array_compatible(array, prog)) {
> + bpf_prog_put(prog);
> + return ERR_PTR(-EINVAL);
> + }
> + return prog;
> +}
> +
> +static void prog_fd_array_put_ptr(struct bpf_array *array __maybe_unused,
> + void *ptr)

array member seems not to be used in both implementations. It should then
probably not be part of the API?

> +{
> + struct bpf_prog *prog = (struct bpf_prog *)ptr;

No cast on void * needed.

> +
> + bpf_prog_put_rcu(prog);
> +}
> +
> +static const struct fd_array_map_ops prog_fd_array_map_ops = {
> + .get_ptr = prog_fd_array_get_ptr,
> + .put_ptr = prog_fd_array_put_ptr,
> +};
> +
> +static struct bpf_map *fd_array_map_alloc(union bpf_attr *attr,
> + const struct fd_array_map_ops *ops)
> {
> - /* only bpf_prog file descriptors can be stored in prog_array map */
> + struct bpf_map *map;
> + struct bpf_array *array;
> +
> + /* only file descriptors can be stored in this type of map */
> if (attr->value_size != sizeof(u32))
> return ERR_PTR(-EINVAL);
> - return array_map_alloc(attr);
> +
> + map = array_map_alloc(attr);
> + if (IS_ERR(map))
> + return map;
> +
> + array = container_of(map, struct bpf_array, map);
> + array->fd_ops = ops;
> + return map;
> +}
> +
> +static struct bpf_map *prog_array_map_alloc(union bpf_attr *attr)
> +{
> + return fd_array_map_alloc(attr, &prog_fd_array_map_ops);
> }
>
> -static void prog_array_map_free(struct bpf_map *map)
> +static void fd_array_map_free(struct bpf_map *map)
> {
> struct bpf_array *array = container_of(map, struct bpf_array, map);
> int i;
> @@ -167,21 +214,21 @@ static void prog_array_map_free(struct bpf_map *map)
>
> /* make sure it's empty */
> for (i = 0; i < array->map.max_entries; i++)
> - BUG_ON(array->prog[i] != NULL);
> + BUG_ON(array->ptrs[i] != NULL);
> kvfree(array);
> }
>
> -static void *prog_array_map_lookup_elem(struct bpf_map *map, void *key)
> +static void *fd_array_map_lookup_elem(struct bpf_map *map, void *key)
> {
> return NULL;
> }
>
> /* only called from syscall */
> -static int prog_array_map_update_elem(struct bpf_map *map, void *key,
> - void *value, u64 map_flags)
> +static int fd_array_map_update_elem(struct bpf_map *map, void *key,
> + void *value, u64 map_flags)
> {
> struct bpf_array *array = container_of(map, struct bpf_array, map);
> - struct bpf_prog *prog, *old_prog;
> + void *new_ptr, *old_ptr;
> u32 index = *(u32 *)key, ufd;
>
> if (map_flags != BPF_ANY)
> @@ -191,34 +238,29 @@ static int prog_array_map_update_elem(struct bpf_map *map, void *key,
> return -E2BIG;
>
> ufd = *(u32 *)value;
> - prog = bpf_prog_get(ufd);
> - if (IS_ERR(prog))
> - return PTR_ERR(prog);
> -
> - if (!bpf_prog_array_compatible(array, prog)) {
> - bpf_prog_put(prog);
> - return -EINVAL;
> - }
> + new_ptr = array->fd_ops->get_ptr(array, ufd);
> + if (IS_ERR(new_ptr))
> + return PTR_ERR(new_ptr);
>
> - old_prog = xchg(array->prog + index, prog);
> - if (old_prog)
> - bpf_prog_put_rcu(old_prog);
> + old_ptr = xchg(array->ptrs + index, new_ptr);
> + if (old_ptr)
> + array->fd_ops->put_ptr(array, old_ptr);
>
> return 0;
> }
>
> -static int prog_array_map_delete_elem(struct bpf_map *map, void *key)
> +static int fd_array_map_delete_elem(struct bpf_map *map, void *key)
> {
> struct bpf_array *array = container_of(map, struct bpf_array, map);
> - struct bpf_prog *old_prog;
> + void *old_ptr;
> u32 index = *(u32 *)key;
>
> if (index >= array->map.max_entries)
> return -E2BIG;
>
> - old_prog = xchg(array->prog + index, NULL);
> - if (old_prog) {
> - bpf_prog_put_rcu(old_prog);
> + old_ptr = xchg(array->ptrs + index, NULL);
> + if (old_ptr) {
> + array->fd_ops->put_ptr(array, old_ptr);
> return 0;
> } else {
> return -ENOENT;
> @@ -226,22 +268,22 @@ static int prog_array_map_delete_elem(struct bpf_map *map, void *key)
> }
>
> /* decrement refcnt of all bpf_progs that are stored in this map */
> -void bpf_prog_array_map_clear(struct bpf_map *map)
> +void bpf_fd_array_map_clear(struct bpf_map *map)
> {
> struct bpf_array *array = container_of(map, struct bpf_array, map);
> int i;
>
> for (i = 0; i < array->map.max_entries; i++)
> - prog_array_map_delete_elem(map, &i);
> + fd_array_map_delete_elem(map, &i);
> }
>
> static const struct bpf_map_ops prog_array_ops = {
> .map_alloc = prog_array_map_alloc,
> - .map_free = prog_array_map_free,
> + .map_free = fd_array_map_free,
> .map_get_next_key = array_map_get_next_key,
> - .map_lookup_elem = prog_array_map_lookup_elem,
> - .map_update_elem = prog_array_map_update_elem,
> - .map_delete_elem = prog_array_map_delete_elem,
> + .map_lookup_elem = fd_array_map_lookup_elem,
> + .map_update_elem = fd_array_map_update_elem,
> + .map_delete_elem = fd_array_map_delete_elem,

I'm wondering if we should move fd_ops actually into bpf_map? Seems like
then both only differ in get_ptr/put_ptr and could also reuse the same
bpf_map_ops structure?

> };
>
> static struct bpf_map_type_list prog_array_type __read_mostly = {
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index a1b14d1..de2dcc2 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -68,11 +68,11 @@ static int bpf_map_release(struct inode *inode, struct file *filp)
> {
> struct bpf_map *map = filp->private_data;
>
> - if (map->map_type == BPF_MAP_TYPE_PROG_ARRAY)
> + if (map->map_type >= BPF_MAP_TYPE_PROG_ARRAY)
> /* prog_array stores refcnt-ed bpf_prog pointers
> * release them all when user space closes prog_array_fd
> */
> - bpf_prog_array_map_clear(map);
> + bpf_fd_array_map_clear(map);

When we are going to add a new map type to the eBPF framework that is not
an fd_array_map thing, this assumption of map->map_type >= BPF_MAP_TYPE_PROG_ARRAY
might not hold then ...

> bpf_map_put(map);
> return 0;
>

2015-07-29 23:30:34

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [PATCH v4 2/4] bpf: Add new bpf map type to store the pointer to struct perf_event

On 07/28/2015 01:17 PM, Kaixu Xia wrote:
> Introduce a new bpf map type 'BPF_MAP_TYPE_PERF_EVENT_ARRAY'.
> This map only stores the pointer to struct perf_event. The
> user space event FDs from perf_event_open() syscall are converted
> to the pointer to struct perf_event and stored in map.
>
> Signed-off-by: Kaixu Xia <[email protected]>
> ---
> include/linux/bpf.h | 1 +
> include/linux/perf_event.h | 2 ++
> include/uapi/linux/bpf.h | 1 +
> kernel/bpf/arraymap.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++
> kernel/bpf/verifier.c | 15 ++++++++++++
> kernel/events/core.c | 17 ++++++++++++++
> 6 files changed, 93 insertions(+)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 610b730..3c9c0eb 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -10,6 +10,7 @@
> #include <uapi/linux/bpf.h>
> #include <linux/workqueue.h>
> #include <linux/file.h>
> +#include <linux/perf_event.h>
>
> struct bpf_map;
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 2027809..2ea4067 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -641,6 +641,7 @@ extern int perf_event_init_task(struct task_struct *child);
> extern void perf_event_exit_task(struct task_struct *child);
> extern void perf_event_free_task(struct task_struct *task);
> extern void perf_event_delayed_put(struct task_struct *task);
> +extern struct perf_event *perf_event_get(unsigned int fd);
> extern void perf_event_print_debug(void);
> extern void perf_pmu_disable(struct pmu *pmu);
> extern void perf_pmu_enable(struct pmu *pmu);
> @@ -979,6 +980,7 @@ static inline int perf_event_init_task(struct task_struct *child) { return 0; }
> static inline void perf_event_exit_task(struct task_struct *child) { }
> static inline void perf_event_free_task(struct task_struct *task) { }
> static inline void perf_event_delayed_put(struct task_struct *task) { }
> +static struct perf_event *perf_event_get(unsigned int fd) { return NULL; }
> static inline void perf_event_print_debug(void) { }
> static inline int perf_event_task_disable(void) { return -EINVAL; }
> static inline int perf_event_task_enable(void) { return -EINVAL; }
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 29ef6f9..69a1f6b 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -114,6 +114,7 @@ enum bpf_map_type {
> BPF_MAP_TYPE_HASH,
> BPF_MAP_TYPE_ARRAY,
> BPF_MAP_TYPE_PROG_ARRAY,
> + BPF_MAP_TYPE_PERF_EVENT_ARRAY,
> };
>
> enum bpf_prog_type {
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index 4784cdc..a9e0235 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -297,3 +297,60 @@ static int __init register_prog_array_map(void)
> return 0;
> }
> late_initcall(register_prog_array_map);
> +
> +static void *perf_event_fd_array_get_ptr(struct bpf_array *array, int fd)
> +{
> + struct perf_event *event;
> +
> + event = perf_event_get(fd);
> + if (IS_ERR(event))
> + return event;

So in case of !CONFIG_PERF_EVENTS, you define perf_event_get() to return NULL.

Seems like this will give a nice NULL ptr deref below. ;)

> + /*
> + * prevent some crazy events so we can make our life easier
> + */
> + if (event->attr.type != PERF_TYPE_RAW &&
> + event->attr.type != PERF_TYPE_HARDWARE) {
> + perf_event_release_kernel(event);
> + return ERR_PTR(-EINVAL);
> + }
> + return event;
> +}
> +
> +static void perf_event_fd_array_put_ptr(struct bpf_array *array, void *ptr)
> +{
> + struct perf_event *event = (struct perf_event *)ptr;

( Same comment as in previous patch. )

> + perf_event_release_kernel(event);
> +}
> +
> +static const struct fd_array_map_ops perf_event_fd_array_map_ops = {
> + .get_ptr = perf_event_fd_array_get_ptr,
> + .put_ptr = perf_event_fd_array_put_ptr,
> +};
...
> +late_initcall(register_perf_event_array_map);
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 039d866..c70f7e7 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -924,6 +924,21 @@ static int check_call(struct verifier_env *env, int func_id)
> */
> return -EINVAL;
>
> + if (map && map->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY &&
> + func_id != BPF_FUNC_perf_event_read)
> + /* perf_event_array map type needs extra care:
> + * only allow to pass it into bpf_perf_event_read() for now.
> + * bpf_map_update/delete_elem() must only be done via syscall
> + */
> + return -EINVAL;
> +
> + if (func_id == BPF_FUNC_perf_event_read &&
> + map->map_type != BPF_MAP_TYPE_PERF_EVENT_ARRAY)
> + /* don't allow any other map type to be passed into
> + * bpf_perf_event_read()
> + */
> + return -EINVAL;

Maybe a function as we do similar checks with BPF_MAP_TYPE_PROG_ARRAY?

> return 0;
> }
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index d3dae34..58f0d47 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -8574,6 +8574,23 @@ void perf_event_delayed_put(struct task_struct *task)
> WARN_ON_ONCE(task->perf_event_ctxp[ctxn]);
> }
>
> +struct perf_event *perf_event_get(unsigned int fd)
> +{
> + int err;
> + struct fd f;
> + struct perf_event *event;
> +
> + err = perf_fget_light(fd, &f);
> + if (err)
> + return ERR_PTR(err);
> +
> + event = f.file->private_data;
> + atomic_long_inc(&event->refcount);
> + fdput(f);
> +
> + return event;
> +}
> +
> /*
> * inherit a event from parent task to child task:
> */
>

2015-07-29 23:51:57

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [PATCH v4 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter

On 07/28/2015 01:17 PM, Kaixu Xia wrote:
> According to the perf_event_map_fd and index, the function
> bpf_perf_event_read() can convert the corresponding map
> value to the pointer to struct perf_event and return the
> Hardware PMU counter value.
>
> Signed-off-by: Kaixu Xia <[email protected]>
> ---
> include/linux/bpf.h | 1 +
> include/linux/perf_event.h | 3 ++-
> include/uapi/linux/bpf.h | 1 +
> kernel/bpf/helpers.c | 36 ++++++++++++++++++++++++++++++++++++
> kernel/events/core.c | 4 ++--
> kernel/trace/bpf_trace.c | 2 ++
> 6 files changed, 44 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 3c9c0eb..516992c 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -190,6 +190,7 @@ extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
> extern const struct bpf_func_proto bpf_map_update_elem_proto;
> extern const struct bpf_func_proto bpf_map_delete_elem_proto;
>
> +extern const struct bpf_func_proto bpf_perf_event_read_proto;
> extern const struct bpf_func_proto bpf_get_prandom_u32_proto;
> extern const struct bpf_func_proto bpf_get_smp_processor_id_proto;
> extern const struct bpf_func_proto bpf_tail_call_proto;
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 2ea4067..899abcb 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -662,7 +662,8 @@ extern void perf_pmu_migrate_context(struct pmu *pmu,
> int src_cpu, int dst_cpu);
> extern u64 perf_event_read_value(struct perf_event *event,
> u64 *enabled, u64 *running);
> -
> +extern void __perf_event_read(void *info);
> +extern u64 perf_event_count(struct perf_event *event);
>
> struct perf_sample_data {
> /*
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 69a1f6b..b9b13ce 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -250,6 +250,7 @@ enum bpf_func_id {
> * Return: 0 on success
> */
> BPF_FUNC_get_current_comm,
> + BPF_FUNC_perf_event_read, /* u64 bpf_perf_event_read(&map, index) */
> __BPF_FUNC_MAX_ID,
> };
>
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 1447ec0..c40c5ea 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -182,3 +182,39 @@ const struct bpf_func_proto bpf_get_current_comm_proto = {
> .arg1_type = ARG_PTR_TO_STACK,
> .arg2_type = ARG_CONST_STACK_SIZE,
> };
> +
> +static u64 bpf_perf_event_read(u64 r1, u64 index, u64 r3, u64 r4, u64 r5)
> +{
> + struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
> + struct bpf_array *array = container_of(map, struct bpf_array, map);
> + struct perf_event *event;
> +
> + if (index >= array->map.max_entries)
> + return -E2BIG;

Maybe unlikely(...), likewise below.

> + event = (struct perf_event *)array->ptrs[index];
> + if (!event)
> + return -ENOENT;
> +
> + if (event->state != PERF_EVENT_STATE_ACTIVE)
> + return -EINVAL;
> +
> + if (event->oncpu != raw_smp_processor_id() &&
> + event->ctx->task != current)
> + return -EINVAL;
> +
> + if (event->attr.inherit)
> + return -EINVAL;
> +
> + __perf_event_read(event);
> +
> + return perf_event_count(event);

I believe this helper should rather go somewhere such as bpf_trace.c
(or under kernel/events/ ?), wouldn't we otherwise get a build error
when perf events are compiled out?

Anyway, I let perf folks comment on that (and the helper in general).

> +}
> +
> +const struct bpf_func_proto bpf_perf_event_read_proto = {
> + .func = bpf_perf_event_read,
> + .gpl_only = false,
> + .ret_type = RET_INTEGER,
> + .arg1_type = ARG_CONST_MAP_PTR,
> + .arg2_type = ARG_ANYTHING,
> +};
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 58f0d47..c926c6d 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3177,7 +3177,7 @@ void perf_event_exec(void)
> /*
> * Cross CPU call to read the hardware event
> */
> -static void __perf_event_read(void *info)
> +void __perf_event_read(void *info)

Does this need to be declared in a header file, no?

> {
> struct perf_event *event = info;
> struct perf_event_context *ctx = event->ctx;
> @@ -3204,7 +3204,7 @@ static void __perf_event_read(void *info)
> raw_spin_unlock(&ctx->lock);
> }
>
> -static inline u64 perf_event_count(struct perf_event *event)
> +u64 perf_event_count(struct perf_event *event)

Likewise? Should the inlining be preserved?

> {
> if (event->pmu->count)
> return event->pmu->count(event);
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 88a041a..9cf094f 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -183,6 +183,8 @@ static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func

Btw, I'm wondering if the perf event map portions should actually better go
here into bpf_trace.c ...

> return bpf_get_trace_printk_proto();
> case BPF_FUNC_get_smp_processor_id:
> return &bpf_get_smp_processor_id_proto;
> + case BPF_FUNC_perf_event_read:
> + return &bpf_perf_event_read_proto;
> default:
> return NULL;
> }
>

2015-07-29 23:56:51

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [PATCH v4 4/4] samples/bpf: example of get selected PMU counter value

On 07/28/2015 01:17 PM, Kaixu Xia wrote:
> This is a simple example and shows how to use the new ability
> to get the selected Hardware PMU counter value.
>
> Signed-off-by: Kaixu Xia <[email protected]>
...
> diff --git a/samples/bpf/tracex6_user.c b/samples/bpf/tracex6_user.c
> new file mode 100644
> index 0000000..e607eac
> --- /dev/null
> +++ b/samples/bpf/tracex6_user.c
> @@ -0,0 +1,67 @@
> +#include <stdio.h>
> +#include <unistd.h>
> +#include <stdlib.h>
> +#include <stdbool.h>
> +#include <string.h>
> +#include <fcntl.h>
> +#include <poll.h>
> +#include <sys/ioctl.h>
> +#include <linux/perf_event.h>
> +#include <linux/bpf.h>
> +#include "libbpf.h"
> +#include "bpf_load.h"
> +
> +static void test_bpf_perf_event(void)
> +{
> + int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
> + int *pmu_fd = malloc(nr_cpus * sizeof(int));
> + unsigned long value;
> + int i;
> +
> + struct perf_event_attr attr_insn_pmu = {
> + .freq = 0,
> + .sample_period = 0x7fffffffffffffffULL,

Nit: #define ...

> + .inherit = 0,
> + .type = PERF_TYPE_HARDWARE,
> + .read_format = 0,
> + .sample_type = 0,
> + .config = 0,/* PMU: cycles */
> + };
> +
> + for (i = 0; i < nr_cpus; i++) {
> + pmu_fd[i] = perf_event_open(&attr_insn_pmu, -1/*pid*/, i/*cpu*/, -1/*group_fd*/, 0);
> + if (pmu_fd[i] < 0)
> + printf("event syscall failed\n");
> +
> + bpf_update_elem(map_fd[0], &i, (pmu_fd + i), BPF_ANY);

Nit: you use pmu_fd[i], but here (pmu_fd + i)? Maybe better &pmu_fd[i]?

> + ioctl(pmu_fd[i], PERF_EVENT_IOC_ENABLE, 0);
> + }

2015-07-30 00:08:53

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [PATCH v4 0/4] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter

On 07/28/2015 01:17 PM, Kaixu Xia wrote:
> Previous patch v3 url:
> https://lkml.org/lkml/2015/7/23/203
...
> Kaixu Xia (3):
> bpf: Add new bpf map type to store the pointer to struct perf_event
> bpf: Implement function bpf_perf_event_read() that get the selected
> hardware PMU conuter
> samples/bpf: example of get selected PMU counter value
>
> Wang Nan (1):
> bpf: Make the bpf_prog_array_map more generic

So kernel/bpf/ patches are usually going via netdev tree as main development
happens there and to avoid cross tree conflicts, etc. Thus, please Cc also
netdev in the next iterations.

Maybe when these patches are in a perfect shape at some point, perf folks
provide their Acked-by(s) to the series to give their consent, and then this
should go via net-next? Or will this be organized differently?

> include/linux/bpf.h | 8 ++-
> include/linux/perf_event.h | 5 +-
> include/uapi/linux/bpf.h | 2 +
> kernel/bpf/arraymap.c | 161 ++++++++++++++++++++++++++++++++++++---------
> kernel/bpf/helpers.c | 36 ++++++++++
> kernel/bpf/syscall.c | 4 +-
> kernel/bpf/verifier.c | 15 +++++
> kernel/events/core.c | 21 +++++-
> kernel/trace/bpf_trace.c | 2 +
> samples/bpf/Makefile | 4 ++
> samples/bpf/bpf_helpers.h | 2 +
> samples/bpf/tracex6_kern.c | 26 ++++++++
> samples/bpf/tracex6_user.c | 67 +++++++++++++++++++
> 13 files changed, 316 insertions(+), 37 deletions(-)
> create mode 100644 samples/bpf/tracex6_kern.c
> create mode 100644 samples/bpf/tracex6_user.c
>

2015-07-30 01:44:15

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] bpf: Make the bpf_prog_array_map more generic

On 7/29/15 4:17 PM, Daniel Borkmann wrote:
>> - if (map->map_type == BPF_MAP_TYPE_PROG_ARRAY)
>> + if (map->map_type >= BPF_MAP_TYPE_PROG_ARRAY)
>> /* prog_array stores refcnt-ed bpf_prog pointers
>> * release them all when user space closes prog_array_fd
>> */
>> - bpf_prog_array_map_clear(map);
>> + bpf_fd_array_map_clear(map);
>
> When we are going to add a new map type to the eBPF framework that is not
> an fd_array_map thing, this assumption of map->map_type >=
> BPF_MAP_TYPE_PROG_ARRAY
> might not hold then ...

Also I think here changing == to >= is probably unnecessary.
prog_array needs to do it because of circular dependency
whereas perf_event_array cannot have it.
Even when we attach bpf prog to perf_event and then add it to
perf_event_array used by the same prog, right?
Please test such scenario just in case.

2015-07-30 01:45:39

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH v4 2/4] bpf: Add new bpf map type to store the pointer to struct perf_event

On 7/29/15 4:30 PM, Daniel Borkmann wrote:
> + if (map && map->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY &&
> + func_id != BPF_FUNC_perf_event_read)

this part belongs in patch 3, since patch 2 won't compile as-is.
Please keep bi-sectability intact.

2015-07-30 01:50:15

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH v4 0/4] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter

On 7/29/15 5:08 PM, Daniel Borkmann wrote:
> On 07/28/2015 01:17 PM, Kaixu Xia wrote:
>> Previous patch v3 url:
>> https://lkml.org/lkml/2015/7/23/203
> ...
>> Kaixu Xia (3):
>> bpf: Add new bpf map type to store the pointer to struct perf_event
>> bpf: Implement function bpf_perf_event_read() that get the selected
>> hardware PMU conuter
>> samples/bpf: example of get selected PMU counter value
>>
>> Wang Nan (1):
>> bpf: Make the bpf_prog_array_map more generic
>
> So kernel/bpf/ patches are usually going via netdev tree as main
> development
> happens there and to avoid cross tree conflicts, etc. Thus, please Cc also
> netdev in the next iterations.
>
> Maybe when these patches are in a perfect shape at some point, perf folks
> provide their Acked-by(s) to the series to give their consent, and then
> this
> should go via net-next? Or will this be organized differently?

In this case it looks that amount of kernel/bpf/ changes is higher than
perf related. Also it looks like addition of bpf_perf_event_read()
helper also won't affect anything on perf side, so in this case I think
net-next tree is indeed a better fit.

2015-07-31 08:51:53

by xiakaixu

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] bpf: Make the bpf_prog_array_map more generic

于 2015/7/30 9:44, Alexei Starovoitov 写道:
> On 7/29/15 4:17 PM, Daniel Borkmann wrote:
>>> - if (map->map_type == BPF_MAP_TYPE_PROG_ARRAY)
>>> + if (map->map_type >= BPF_MAP_TYPE_PROG_ARRAY)
>>> /* prog_array stores refcnt-ed bpf_prog pointers
>>> * release them all when user space closes prog_array_fd
>>> */
>>> - bpf_prog_array_map_clear(map);
>>> + bpf_fd_array_map_clear(map);
>>
>> When we are going to add a new map type to the eBPF framework that is not
>> an fd_array_map thing, this assumption of map->map_type >=
>> BPF_MAP_TYPE_PROG_ARRAY
>> might not hold then ...
>
> Also I think here changing == to >= is probably unnecessary.
> prog_array needs to do it because of circular dependency
> whereas perf_event_array cannot have it.
> Even when we attach bpf prog to perf_event and then add it to
> perf_event_array used by the same prog, right?
> Please test such scenario just in case.

Not sure completely understand what you mean. You know, we can
attach bpf_prog to kprobe events. For now, we limit few event
types, only PERF_EVENT_RAW & PERF_EVENT_HARDWARE event can
be accessed in bpf_perf_event_read(). Seems like the dependency
scenario won't happen.
I will add the event decrement refcnt function to map_free in V5.
right?
>
>
> .
>

2015-07-31 15:46:12

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] bpf: Make the bpf_prog_array_map more generic

On 7/31/15 1:50 AM, xiakaixu wrote:
> 于 2015/7/30 9:44, Alexei Starovoitov 写道:
>> On 7/29/15 4:17 PM, Daniel Borkmann wrote:
>>>> - if (map->map_type == BPF_MAP_TYPE_PROG_ARRAY)
>>>> + if (map->map_type >= BPF_MAP_TYPE_PROG_ARRAY)
>>>> /* prog_array stores refcnt-ed bpf_prog pointers
>>>> * release them all when user space closes prog_array_fd
>>>> */
>>>> - bpf_prog_array_map_clear(map);
>>>> + bpf_fd_array_map_clear(map);
>>>
>>> When we are going to add a new map type to the eBPF framework that is not
>>> an fd_array_map thing, this assumption of map->map_type >=
>>> BPF_MAP_TYPE_PROG_ARRAY
>>> might not hold then ...
>>
>> Also I think here changing == to >= is probably unnecessary.
>> prog_array needs to do it because of circular dependency
>> whereas perf_event_array cannot have it.
>> Even when we attach bpf prog to perf_event and then add it to
>> perf_event_array used by the same prog, right?
>> Please test such scenario just in case.
>
> Not sure completely understand what you mean. You know, we can
> attach bpf_prog to kprobe events. For now, we limit few event
> types, only PERF_EVENT_RAW & PERF_EVENT_HARDWARE event can
> be accessed in bpf_perf_event_read(). Seems like the dependency
> scenario won't happen.

ahh, yes, you're correct. There is no circular dependency.