2015-06-03 22:41:19

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: [GIT PULL 0/6] perf/core improvements and fixes

Hi Ingo,

Please consider applying.

One of the next requests probably will have the eBPF work by Wang Nan,
but I am still going thru it and want to test it thoroughly.

BTW: Have you looked at it lately? It is at:

http://lkml.kernel.org/r/[email protected]

Super summary from the above cover letter:

---------------------
It enables 'perf record' to filter events using eBPF programs like:

# perf record --event bpf-file.o sleep 1

Events are selected and filtered according to definitions in bpf-file.o.
---------------------

The first two patches from that series are in this pull req, as
they just move stuff into tools/include/linux/ from tools/perf/include.

Regards,

- Arnaldo

The following changes since commit 5c9b9bc67c684e40b3a5e7e9facde0fb7200cd8c:

Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core (2015-05-29 20:19:02 +0200)

are available in the git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git tags/perf-core-for-mingo

for you to fetch changes up to 1f121b03d058dd07199d8924373d3c52a207f63b:

perf tools: Deal with kernel module names in '[]' correctly (2015-06-03 10:02:38 -0300)

----------------------------------------------------------------
perf/core improvements and fixes:

User visible:

- Fix 'perf probe' segfault when glob matching function without debuginfo (Wang Nan)

- Remove newline char when reading event scale and unit (Madhavan Srinivasan)

- Deal with kernel module names in '[]' correctly (Wang Nan)

Infrastructure:

- Fix the search for the kernel DSO on the unified list (Arnaldo Carvalho de Melo)

- Move tools/perf/util/include/linux/{kernel.h,list.h,poison.h} to tools/include,
to be used in tools/lib/bpf/ (Wang Nan)

Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

----------------------------------------------------------------
Arnaldo Carvalho de Melo (1):
perf machine: Fix the search for the kernel DSO on the unified list

Madhavan Srinivasan (1):
perf tools: Remove newline char when reading event scale and unit

Wang Nan (4):
perf probe: Fix segfault when glob matching function without debuginfo
perf tools: Move linux/kernel.h to tools/include
tools: Move tools/perf/util/include/linux/{list.h,poison.h} to tools/include
perf tools: Deal with kernel module names in '[]' correctly

tools/{perf/util => }/include/linux/kernel.h | 4 +-
tools/{perf/util => }/include/linux/list.h | 6 +--
tools/include/linux/poison.h | 1 +
tools/perf/MANIFEST | 3 ++
tools/perf/tests/kmod-path.c | 72 ++++++++++++++++++++++++++++
tools/perf/util/dso.c | 47 ++++++++++++++++--
tools/perf/util/dso.h | 2 +-
tools/perf/util/header.c | 8 ++--
tools/perf/util/include/linux/poison.h | 1 -
tools/perf/util/machine.c | 22 ++++++++-
tools/perf/util/pmu.c | 11 ++++-
tools/perf/util/probe-event.c | 26 ++++++++--
12 files changed, 179 insertions(+), 24 deletions(-)
rename tools/{perf/util => }/include/linux/kernel.h (97%)
rename tools/{perf/util => }/include/linux/list.h (90%)
create mode 100644 tools/include/linux/poison.h
delete mode 100644 tools/perf/util/include/linux/poison.h


2015-06-03 22:40:52

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: [PATCH 1/6] perf probe: Fix segfault when glob matching function without debuginfo

From: Wang Nan <[email protected]>

Commit 4c859351226c920b227fec040a3b447f0d482af3 ("perf probe: Support
glob wildcards for function name") introduces segfault problems when
debuginfo is not available:

# perf probe 'sys_w*'
Added new events:
Segmentation fault

The first problem resides in find_probe_trace_events_from_map(). In
that function, find_probe_functions() is called to match each symbol
against glob to find the number of matching functions, but still use
map__for_each_symbol_by_name() to find 'struct symbol' for matching
functions. Unfortunately, map__for_each_symbol_by_name() does
exact matching by searching in an rbtree.

It doesn't know glob matching, and not easy for it to support it because
it use rbtree based binary search, but we are unable to ensure all names
matched by the glob (any glob passed by user) reside in one subtree.

This patch drops map__for_each_symbol_by_name(). Since there is no
rbtree again, re-matching all symbols costs a lot. This patch avoid it
by saving all matching results into an array (syms).

The second problem is the lost of tp->realname. In
__add_probe_trace_events(), if pev->point.function is glob, the event
name should be set to tev->point.realname. This patch ensures its
existence by strdup sym->name instead of leaving a NULL pointer there.

After this patch:

# perf probe 'sys_w*'
Added new events:
probe:sys_waitid (on sys_w*)
probe:sys_wait4 (on sys_w*)
probe:sys_waitpid (on sys_w*)
probe:sys_write (on sys_w*)
probe:sys_writev (on sys_w*)

You can now use it in all perf tools, such as:

perf record -e probe:sys_writev -aR sleep 1

Signed-off-by: Wang Nan <[email protected]>
Acked-by: Masami Hiramatsu <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Zefan Li <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/util/probe-event.c | 26 +++++++++++++++++++++-----
1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index d27edef5eb5b..e6f215b7a052 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -2494,7 +2494,8 @@ close_out:
return ret;
}

-static int find_probe_functions(struct map *map, char *name)
+static int find_probe_functions(struct map *map, char *name,
+ struct symbol **syms)
{
int found = 0;
struct symbol *sym;
@@ -2504,8 +2505,11 @@ static int find_probe_functions(struct map *map, char *name)
return 0;

map__for_each_symbol(map, sym, tmp) {
- if (strglobmatch(sym->name, name))
+ if (strglobmatch(sym->name, name)) {
found++;
+ if (syms && found < probe_conf.max_probes)
+ syms[found - 1] = sym;
+ }
}

return found;
@@ -2528,11 +2532,12 @@ static int find_probe_trace_events_from_map(struct perf_probe_event *pev,
struct map *map = NULL;
struct ref_reloc_sym *reloc_sym = NULL;
struct symbol *sym;
+ struct symbol **syms = NULL;
struct probe_trace_event *tev;
struct perf_probe_point *pp = &pev->point;
struct probe_trace_point *tp;
int num_matched_functions;
- int ret, i;
+ int ret, i, j;

map = get_target_map(pev->target, pev->uprobes);
if (!map) {
@@ -2540,11 +2545,17 @@ static int find_probe_trace_events_from_map(struct perf_probe_event *pev,
goto out;
}

+ syms = malloc(sizeof(struct symbol *) * probe_conf.max_probes);
+ if (!syms) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
/*
* Load matched symbols: Since the different local symbols may have
* same name but different addresses, this lists all the symbols.
*/
- num_matched_functions = find_probe_functions(map, pp->function);
+ num_matched_functions = find_probe_functions(map, pp->function, syms);
if (num_matched_functions == 0) {
pr_err("Failed to find symbol %s in %s\n", pp->function,
pev->target ? : "kernel");
@@ -2575,7 +2586,9 @@ static int find_probe_trace_events_from_map(struct perf_probe_event *pev,

ret = 0;

- map__for_each_symbol_by_name(map, pp->function, sym) {
+ for (j = 0; j < num_matched_functions; j++) {
+ sym = syms[j];
+
tev = (*tevs) + ret;
tp = &tev->point;
if (ret == num_matched_functions) {
@@ -2599,6 +2612,8 @@ static int find_probe_trace_events_from_map(struct perf_probe_event *pev,
tp->symbol = strdup_or_goto(sym->name, nomem_out);
tp->offset = pp->offset;
}
+ tp->realname = strdup_or_goto(sym->name, nomem_out);
+
tp->retprobe = pp->retprobe;
if (pev->target)
tev->point.module = strdup_or_goto(pev->target,
@@ -2629,6 +2644,7 @@ static int find_probe_trace_events_from_map(struct perf_probe_event *pev,

out:
put_target_map(map, pev->uprobes);
+ free(syms);
return ret;

nomem_out:
--
2.1.0

2015-06-03 22:41:56

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: [PATCH 2/6] perf tools: Remove newline char when reading event scale and unit

From: Madhavan Srinivasan <[email protected]>

The <fd979c013207> commit intruduced the perf_event_sysfs_show function
to display the event_str value of an attr in kernel/event/core.c. But
the function returns the value with a newline char.

So, if a event also carries a event.unit file, when printing the counter
data perf tool formatting goes for a spin.

That is, because of the event unit, event name is printed in the newline
because of perf_event_sysfs_show returns with a newline char.

Now fixing perf core will break API, hencing proposing a fix in the perf tool.

Signed-off-by: Madhavan Srinivasan <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sukadev Bhattiprolu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
[ Add spaces around operators ]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/util/pmu.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 5d3ab7c8ceaf..0fcc624eb767 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -112,7 +112,11 @@ static int perf_pmu__parse_scale(struct perf_pmu_alias *alias, char *dir, char *
if (sret < 0)
goto error;

- scale[sret] = '\0';
+ if (scale[sret - 1] == '\n')
+ scale[sret - 1] = '\0';
+ else
+ scale[sret] = '\0';
+
/*
* save current locale
*/
@@ -154,7 +158,10 @@ static int perf_pmu__parse_unit(struct perf_pmu_alias *alias, char *dir, char *n

close(fd);

- alias->unit[sret] = '\0';
+ if (alias->unit[sret - 1] == '\n')
+ alias->unit[sret - 1] = '\0';
+ else
+ alias->unit[sret] = '\0';

return 0;
error:
--
2.1.0

2015-06-03 22:41:07

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: [PATCH 3/6] perf machine: Fix the search for the kernel DSO on the unified list

From: Arnaldo Carvalho de Melo <[email protected]>

When unifying the user_dsos and kernel_dsos a bug was introduced by
inverting the check for dso->kernel, fix it.

Fixes: 3d39ac538629 ("perf machine: No need to have two DSOs lists")
Cc: Adrian Hunter <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/util/machine.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 2ed61f59d415..4e29e80932e5 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1149,7 +1149,7 @@ static int machine__process_kernel_mmap_event(struct machine *machine,
struct dso *dso;

list_for_each_entry(dso, &machine->dsos.head, node) {
- if (dso->kernel && is_kernel_module(dso->long_name))
+ if (!dso->kernel || is_kernel_module(dso->long_name))
continue;

kernel = dso;
--
2.1.0

2015-06-03 22:41:45

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: [PATCH 4/6] perf tools: Move linux/kernel.h to tools/include

From: Wang Nan <[email protected]>

This patch moves kernel.h from tools/perf/util/include/linux/kernel.h
to tools/include/linux/kernel.h to enable other libraries use macros in
it, like libbpf which will be introduced by further patches.

MANIFEST is also updated for 'make perf-*-src-pkg'.

Signed-off-by: Wang Nan <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Cc: Brendan Gregg <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: David Ahern <[email protected]>
Cc: He Kuang <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kaixu Xia <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Zefan Li <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
[ Fixed up the ifdef guard to match other entries in tools/include/linux ]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/include/linux/kernel.h | 107 +++++++++++++++++++++++++++++++++
tools/perf/MANIFEST | 1 +
tools/perf/util/include/linux/kernel.h | 107 ---------------------------------
3 files changed, 108 insertions(+), 107 deletions(-)
create mode 100644 tools/include/linux/kernel.h
delete mode 100644 tools/perf/util/include/linux/kernel.h

diff --git a/tools/include/linux/kernel.h b/tools/include/linux/kernel.h
new file mode 100644
index 000000000000..76df53539c2a
--- /dev/null
+++ b/tools/include/linux/kernel.h
@@ -0,0 +1,107 @@
+#ifndef __TOOLS_LINUX_KERNEL_H
+#define __TOOLS_LINUX_KERNEL_H
+
+#include <stdarg.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <assert.h>
+
+#define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
+
+#define PERF_ALIGN(x, a) __PERF_ALIGN_MASK(x, (typeof(x))(a)-1)
+#define __PERF_ALIGN_MASK(x, mask) (((x)+(mask))&~(mask))
+
+#ifndef offsetof
+#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
+#endif
+
+#ifndef container_of
+/**
+ * container_of - cast a member of a structure out to the containing structure
+ * @ptr: the pointer to the member.
+ * @type: the type of the container struct this is embedded in.
+ * @member: the name of the member within the struct.
+ *
+ */
+#define container_of(ptr, type, member) ({ \
+ const typeof(((type *)0)->member) * __mptr = (ptr); \
+ (type *)((char *)__mptr - offsetof(type, member)); })
+#endif
+
+#define BUILD_BUG_ON_ZERO(e) (sizeof(struct { int:-!!(e); }))
+
+#ifndef max
+#define max(x, y) ({ \
+ typeof(x) _max1 = (x); \
+ typeof(y) _max2 = (y); \
+ (void) (&_max1 == &_max2); \
+ _max1 > _max2 ? _max1 : _max2; })
+#endif
+
+#ifndef min
+#define min(x, y) ({ \
+ typeof(x) _min1 = (x); \
+ typeof(y) _min2 = (y); \
+ (void) (&_min1 == &_min2); \
+ _min1 < _min2 ? _min1 : _min2; })
+#endif
+
+#ifndef roundup
+#define roundup(x, y) ( \
+{ \
+ const typeof(y) __y = y; \
+ (((x) + (__y - 1)) / __y) * __y; \
+} \
+)
+#endif
+
+#ifndef BUG_ON
+#ifdef NDEBUG
+#define BUG_ON(cond) do { if (cond) {} } while (0)
+#else
+#define BUG_ON(cond) assert(!(cond))
+#endif
+#endif
+
+/*
+ * Both need more care to handle endianness
+ * (Don't use bitmap_copy_le() for now)
+ */
+#define cpu_to_le64(x) (x)
+#define cpu_to_le32(x) (x)
+
+static inline int
+vscnprintf(char *buf, size_t size, const char *fmt, va_list args)
+{
+ int i;
+ ssize_t ssize = size;
+
+ i = vsnprintf(buf, size, fmt, args);
+
+ return (i >= ssize) ? (ssize - 1) : i;
+}
+
+static inline int scnprintf(char * buf, size_t size, const char * fmt, ...)
+{
+ va_list args;
+ ssize_t ssize = size;
+ int i;
+
+ va_start(args, fmt);
+ i = vsnprintf(buf, size, fmt, args);
+ va_end(args);
+
+ return (i >= ssize) ? (ssize - 1) : i;
+}
+
+/*
+ * This looks more complex than it should be. But we need to
+ * get the type for the ~ right in round_down (it needs to be
+ * as wide as the result!), and we want to evaluate the macro
+ * arguments just once each.
+ */
+#define __round_mask(x, y) ((__typeof__(x))((y)-1))
+#define round_up(x, y) ((((x)-1) | __round_mask(x, y))+1)
+#define round_down(x, y) ((x) & ~__round_mask(x, y))
+
+#endif
diff --git a/tools/perf/MANIFEST b/tools/perf/MANIFEST
index a83cf75164e1..fce4a47347aa 100644
--- a/tools/perf/MANIFEST
+++ b/tools/perf/MANIFEST
@@ -40,6 +40,7 @@ tools/include/linux/bitops.h
tools/include/linux/compiler.h
tools/include/linux/export.h
tools/include/linux/hash.h
+tools/include/linux/kernel.h
tools/include/linux/log2.h
tools/include/linux/types.h
include/asm-generic/bitops/arch_hweight.h
diff --git a/tools/perf/util/include/linux/kernel.h b/tools/perf/util/include/linux/kernel.h
deleted file mode 100644
index 09e8e7aea7c6..000000000000
--- a/tools/perf/util/include/linux/kernel.h
+++ /dev/null
@@ -1,107 +0,0 @@
-#ifndef PERF_LINUX_KERNEL_H_
-#define PERF_LINUX_KERNEL_H_
-
-#include <stdarg.h>
-#include <stdio.h>
-#include <stdlib.h>
-#include <assert.h>
-
-#define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
-
-#define PERF_ALIGN(x, a) __PERF_ALIGN_MASK(x, (typeof(x))(a)-1)
-#define __PERF_ALIGN_MASK(x, mask) (((x)+(mask))&~(mask))
-
-#ifndef offsetof
-#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
-#endif
-
-#ifndef container_of
-/**
- * container_of - cast a member of a structure out to the containing structure
- * @ptr: the pointer to the member.
- * @type: the type of the container struct this is embedded in.
- * @member: the name of the member within the struct.
- *
- */
-#define container_of(ptr, type, member) ({ \
- const typeof(((type *)0)->member) * __mptr = (ptr); \
- (type *)((char *)__mptr - offsetof(type, member)); })
-#endif
-
-#define BUILD_BUG_ON_ZERO(e) (sizeof(struct { int:-!!(e); }))
-
-#ifndef max
-#define max(x, y) ({ \
- typeof(x) _max1 = (x); \
- typeof(y) _max2 = (y); \
- (void) (&_max1 == &_max2); \
- _max1 > _max2 ? _max1 : _max2; })
-#endif
-
-#ifndef min
-#define min(x, y) ({ \
- typeof(x) _min1 = (x); \
- typeof(y) _min2 = (y); \
- (void) (&_min1 == &_min2); \
- _min1 < _min2 ? _min1 : _min2; })
-#endif
-
-#ifndef roundup
-#define roundup(x, y) ( \
-{ \
- const typeof(y) __y = y; \
- (((x) + (__y - 1)) / __y) * __y; \
-} \
-)
-#endif
-
-#ifndef BUG_ON
-#ifdef NDEBUG
-#define BUG_ON(cond) do { if (cond) {} } while (0)
-#else
-#define BUG_ON(cond) assert(!(cond))
-#endif
-#endif
-
-/*
- * Both need more care to handle endianness
- * (Don't use bitmap_copy_le() for now)
- */
-#define cpu_to_le64(x) (x)
-#define cpu_to_le32(x) (x)
-
-static inline int
-vscnprintf(char *buf, size_t size, const char *fmt, va_list args)
-{
- int i;
- ssize_t ssize = size;
-
- i = vsnprintf(buf, size, fmt, args);
-
- return (i >= ssize) ? (ssize - 1) : i;
-}
-
-static inline int scnprintf(char * buf, size_t size, const char * fmt, ...)
-{
- va_list args;
- ssize_t ssize = size;
- int i;
-
- va_start(args, fmt);
- i = vsnprintf(buf, size, fmt, args);
- va_end(args);
-
- return (i >= ssize) ? (ssize - 1) : i;
-}
-
-/*
- * This looks more complex than it should be. But we need to
- * get the type for the ~ right in round_down (it needs to be
- * as wide as the result!), and we want to evaluate the macro
- * arguments just once each.
- */
-#define __round_mask(x, y) ((__typeof__(x))((y)-1))
-#define round_up(x, y) ((((x)-1) | __round_mask(x, y))+1)
-#define round_down(x, y) ((x) & ~__round_mask(x, y))
-
-#endif
--
2.1.0

2015-06-03 22:41:38

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: [PATCH 5/6] tools: Move tools/perf/util/include/linux/{list.h,poison.h} to tools/include

From: Wang Nan <[email protected]>

This patch moves list.h from tools/perf/util/include/linux/list.h to
tools/include/linux/list.h to enable other libraries use macros in it,
like libbpf which will be introduced by further patches. Since list.h
depend on poison.h, poison.h is also moved.

Both file use relative path, so one '..' is removed for each header to
make them suit for new directory.

MANIFEST is also updated for 'make perf-*-src-pkg'.

Signed-off-by: Wang Nan <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Brendan Gregg <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: David Ahern <[email protected]>
Cc: He Kuang <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kaixu Xia <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Zefan Li <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/include/linux/list.h | 29 +++++++++++++++++++++++++++++
tools/include/linux/poison.h | 1 +
tools/perf/MANIFEST | 2 ++
tools/perf/util/include/linux/list.h | 29 -----------------------------
tools/perf/util/include/linux/poison.h | 1 -
5 files changed, 32 insertions(+), 30 deletions(-)
create mode 100644 tools/include/linux/list.h
create mode 100644 tools/include/linux/poison.h
delete mode 100644 tools/perf/util/include/linux/list.h
delete mode 100644 tools/perf/util/include/linux/poison.h

diff --git a/tools/include/linux/list.h b/tools/include/linux/list.h
new file mode 100644
index 000000000000..76b014c96893
--- /dev/null
+++ b/tools/include/linux/list.h
@@ -0,0 +1,29 @@
+#include <linux/kernel.h>
+#include <linux/types.h>
+
+#include "../../../include/linux/list.h"
+
+#ifndef TOOLS_LIST_H
+#define TOOLS_LIST_H
+/**
+ * list_del_range - deletes range of entries from list.
+ * @begin: first element in the range to delete from the list.
+ * @end: last element in the range to delete from the list.
+ * Note: list_empty on the range of entries does not return true after this,
+ * the entries is in an undefined state.
+ */
+static inline void list_del_range(struct list_head *begin,
+ struct list_head *end)
+{
+ begin->prev->next = end->next;
+ end->next->prev = begin->prev;
+}
+
+/**
+ * list_for_each_from - iterate over a list from one of its nodes
+ * @pos: the &struct list_head to use as a loop cursor, from where to start
+ * @head: the head for your list.
+ */
+#define list_for_each_from(pos, head) \
+ for (; pos != (head); pos = pos->next)
+#endif
diff --git a/tools/include/linux/poison.h b/tools/include/linux/poison.h
new file mode 100644
index 000000000000..0c27bdf14233
--- /dev/null
+++ b/tools/include/linux/poison.h
@@ -0,0 +1 @@
+#include "../../../include/linux/poison.h"
diff --git a/tools/perf/MANIFEST b/tools/perf/MANIFEST
index fce4a47347aa..a0bdd6124583 100644
--- a/tools/perf/MANIFEST
+++ b/tools/perf/MANIFEST
@@ -41,7 +41,9 @@ tools/include/linux/compiler.h
tools/include/linux/export.h
tools/include/linux/hash.h
tools/include/linux/kernel.h
+tools/include/linux/list.h
tools/include/linux/log2.h
+tools/include/linux/poison.h
tools/include/linux/types.h
include/asm-generic/bitops/arch_hweight.h
include/asm-generic/bitops/const_hweight.h
diff --git a/tools/perf/util/include/linux/list.h b/tools/perf/util/include/linux/list.h
deleted file mode 100644
index 76ddbc726343..000000000000
--- a/tools/perf/util/include/linux/list.h
+++ /dev/null
@@ -1,29 +0,0 @@
-#include <linux/kernel.h>
-#include <linux/types.h>
-
-#include "../../../../include/linux/list.h"
-
-#ifndef PERF_LIST_H
-#define PERF_LIST_H
-/**
- * list_del_range - deletes range of entries from list.
- * @begin: first element in the range to delete from the list.
- * @end: last element in the range to delete from the list.
- * Note: list_empty on the range of entries does not return true after this,
- * the entries is in an undefined state.
- */
-static inline void list_del_range(struct list_head *begin,
- struct list_head *end)
-{
- begin->prev->next = end->next;
- end->next->prev = begin->prev;
-}
-
-/**
- * list_for_each_from - iterate over a list from one of its nodes
- * @pos: the &struct list_head to use as a loop cursor, from where to start
- * @head: the head for your list.
- */
-#define list_for_each_from(pos, head) \
- for (; pos != (head); pos = pos->next)
-#endif
diff --git a/tools/perf/util/include/linux/poison.h b/tools/perf/util/include/linux/poison.h
deleted file mode 100644
index fef6dbc9ce13..000000000000
--- a/tools/perf/util/include/linux/poison.h
+++ /dev/null
@@ -1 +0,0 @@
-#include "../../../../include/linux/poison.h"
--
2.1.0

2015-06-03 22:41:11

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: [PATCH 6/6] perf tools: Deal with kernel module names in '[]' correctly

From: Wang Nan <[email protected]>

Before patch ba92732e9808 ('perf kmaps: Check kmaps to make code more
robust'), 'perf report' and 'perf annotate' will segfault if trace data
contains kernel module information like this:

# perf report -D -i ./perf.data
...
0 0 0x188 [0x50]: PERF_RECORD_MMAP -1/0: [0xffffffbff1018000(0xf068000) @ 0]: x [test_module]
...

# perf report -i ./perf.data --objdump=/path/to/objdump --kallsyms=/path/to/kallsyms

perf: Segmentation fault
-------- backtrace --------
/path/to/perf[0x503478]
/lib64/libc.so.6(+0x3545f)[0x7fb201f3745f]
/path/to/perf[0x499b56]
/path/to/perf(dso__load_kallsyms+0x13c)[0x49b56c]
/path/to/perf(dso__load+0x72e)[0x49c21e]
/path/to/perf(map__load+0x6e)[0x4ae9ee]
/path/to/perf(thread__find_addr_map+0x24c)[0x47deec]
/path/to/perf(perf_event__preprocess_sample+0x88)[0x47e238]
/path/to/perf[0x43ad02]
/path/to/perf[0x4b55bc]
/path/to/perf(ordered_events__flush+0xca)[0x4b57ea]
/path/to/perf[0x4b1a01]
/path/to/perf(perf_session__process_events+0x3be)[0x4b428e]
/path/to/perf(cmd_report+0xf11)[0x43bfc1]
/path/to/perf[0x474702]
/path/to/perf(main+0x5f5)[0x42de95]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x7fb201f23bd4]
/path/to/perf[0x42dfc4]

This is because __kmod_path__parse treats '[' leading names as kernel
name instead of names of kernel module.

If perf.data contains build information and the buildid of such modules
can be found, the dso->kernel of it will be set to DSO_TYPE_KERNEL by
__event_process_build_id(), not kernel module.

It will then be passed to dso__load() -> dso__load_kernel_sym() ->
dso__load_kcore() if --kallsyms is provided.

The refered patch adds NULL pointer checker to avoid segfault. However,
such kernel modules are still processed incorrectly.

This patch fixes __kmod_path__parse, makes it treat names like
'[test_module]' as kernel modules.

kmod-path.c is also update to reflect the above changes.

Signed-off-by: Wang Nan <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Zefan Li <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
[ Fixed the merged with 0443f36b0de0 ("perf machine: Fix the search
for the kernel DSO on the unified list" ]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/tests/kmod-path.c | 72 ++++++++++++++++++++++++++++++++++++++++++++
tools/perf/util/dso.c | 47 ++++++++++++++++++++++++++---
tools/perf/util/dso.h | 2 +-
tools/perf/util/header.c | 8 ++---
tools/perf/util/machine.c | 22 +++++++++++++-
5 files changed, 140 insertions(+), 11 deletions(-)

diff --git a/tools/perf/tests/kmod-path.c b/tools/perf/tests/kmod-path.c
index e8d7cbb9320c..08c433b4bf4f 100644
--- a/tools/perf/tests/kmod-path.c
+++ b/tools/perf/tests/kmod-path.c
@@ -34,9 +34,21 @@ static int test(const char *path, bool alloc_name, bool alloc_ext,
return 0;
}

+static int test_is_kernel_module(const char *path, int cpumode, bool expect)
+{
+ TEST_ASSERT_VAL("is_kernel_module",
+ (!!is_kernel_module(path, cpumode)) == (!!expect));
+ pr_debug("%s (cpumode: %d) - is_kernel_module: %s\n",
+ path, cpumode, expect ? "true" : "false");
+ return 0;
+}
+
#define T(path, an, ae, k, c, n, e) \
TEST_ASSERT_VAL("failed", !test(path, an, ae, k, c, n, e))

+#define M(path, c, e) \
+ TEST_ASSERT_VAL("failed", !test_is_kernel_module(path, c, e))
+
int test__kmod_path__parse(void)
{
/* path alloc_name alloc_ext kmod comp name ext */
@@ -44,30 +56,90 @@ int test__kmod_path__parse(void)
T("/xxxx/xxxx/x-x.ko", false , true , true, false, NULL , NULL);
T("/xxxx/xxxx/x-x.ko", true , false , true, false, "[x_x]", NULL);
T("/xxxx/xxxx/x-x.ko", false , false , true, false, NULL , NULL);
+ M("/xxxx/xxxx/x-x.ko", PERF_RECORD_MISC_CPUMODE_UNKNOWN, true);
+ M("/xxxx/xxxx/x-x.ko", PERF_RECORD_MISC_KERNEL, true);
+ M("/xxxx/xxxx/x-x.ko", PERF_RECORD_MISC_USER, false);

/* path alloc_name alloc_ext kmod comp name ext */
T("/xxxx/xxxx/x.ko.gz", true , true , true, true, "[x]", "gz");
T("/xxxx/xxxx/x.ko.gz", false , true , true, true, NULL , "gz");
T("/xxxx/xxxx/x.ko.gz", true , false , true, true, "[x]", NULL);
T("/xxxx/xxxx/x.ko.gz", false , false , true, true, NULL , NULL);
+ M("/xxxx/xxxx/x.ko.gz", PERF_RECORD_MISC_CPUMODE_UNKNOWN, true);
+ M("/xxxx/xxxx/x.ko.gz", PERF_RECORD_MISC_KERNEL, true);
+ M("/xxxx/xxxx/x.ko.gz", PERF_RECORD_MISC_USER, false);

/* path alloc_name alloc_ext kmod comp name ext */
T("/xxxx/xxxx/x.gz", true , true , false, true, "x.gz" ,"gz");
T("/xxxx/xxxx/x.gz", false , true , false, true, NULL ,"gz");
T("/xxxx/xxxx/x.gz", true , false , false, true, "x.gz" , NULL);
T("/xxxx/xxxx/x.gz", false , false , false, true, NULL , NULL);
+ M("/xxxx/xxxx/x.gz", PERF_RECORD_MISC_CPUMODE_UNKNOWN, false);
+ M("/xxxx/xxxx/x.gz", PERF_RECORD_MISC_KERNEL, false);
+ M("/xxxx/xxxx/x.gz", PERF_RECORD_MISC_USER, false);

/* path alloc_name alloc_ext kmod comp name ext */
T("x.gz", true , true , false, true, "x.gz", "gz");
T("x.gz", false , true , false, true, NULL , "gz");
T("x.gz", true , false , false, true, "x.gz", NULL);
T("x.gz", false , false , false, true, NULL , NULL);
+ M("x.gz", PERF_RECORD_MISC_CPUMODE_UNKNOWN, false);
+ M("x.gz", PERF_RECORD_MISC_KERNEL, false);
+ M("x.gz", PERF_RECORD_MISC_USER, false);

/* path alloc_name alloc_ext kmod comp name ext */
T("x.ko.gz", true , true , true, true, "[x]", "gz");
T("x.ko.gz", false , true , true, true, NULL , "gz");
T("x.ko.gz", true , false , true, true, "[x]", NULL);
T("x.ko.gz", false , false , true, true, NULL , NULL);
+ M("x.ko.gz", PERF_RECORD_MISC_CPUMODE_UNKNOWN, true);
+ M("x.ko.gz", PERF_RECORD_MISC_KERNEL, true);
+ M("x.ko.gz", PERF_RECORD_MISC_USER, false);
+
+ /* path alloc_name alloc_ext kmod comp name ext */
+ T("[test_module]", true , true , true, false, "[test_module]", NULL);
+ T("[test_module]", false , true , true, false, NULL , NULL);
+ T("[test_module]", true , false , true, false, "[test_module]", NULL);
+ T("[test_module]", false , false , true, false, NULL , NULL);
+ M("[test_module]", PERF_RECORD_MISC_CPUMODE_UNKNOWN, true);
+ M("[test_module]", PERF_RECORD_MISC_KERNEL, true);
+ M("[test_module]", PERF_RECORD_MISC_USER, false);
+
+ /* path alloc_name alloc_ext kmod comp name ext */
+ T("[test.module]", true , true , true, false, "[test.module]", NULL);
+ T("[test.module]", false , true , true, false, NULL , NULL);
+ T("[test.module]", true , false , true, false, "[test.module]", NULL);
+ T("[test.module]", false , false , true, false, NULL , NULL);
+ M("[test.module]", PERF_RECORD_MISC_CPUMODE_UNKNOWN, true);
+ M("[test.module]", PERF_RECORD_MISC_KERNEL, true);
+ M("[test.module]", PERF_RECORD_MISC_USER, false);
+
+ /* path alloc_name alloc_ext kmod comp name ext */
+ T("[vdso]", true , true , false, false, "[vdso]", NULL);
+ T("[vdso]", false , true , false, false, NULL , NULL);
+ T("[vdso]", true , false , false, false, "[vdso]", NULL);
+ T("[vdso]", false , false , false, false, NULL , NULL);
+ M("[vdso]", PERF_RECORD_MISC_CPUMODE_UNKNOWN, false);
+ M("[vdso]", PERF_RECORD_MISC_KERNEL, false);
+ M("[vdso]", PERF_RECORD_MISC_USER, false);
+
+ /* path alloc_name alloc_ext kmod comp name ext */
+ T("[vsyscall]", true , true , false, false, "[vsyscall]", NULL);
+ T("[vsyscall]", false , true , false, false, NULL , NULL);
+ T("[vsyscall]", true , false , false, false, "[vsyscall]", NULL);
+ T("[vsyscall]", false , false , false, false, NULL , NULL);
+ M("[vsyscall]", PERF_RECORD_MISC_CPUMODE_UNKNOWN, false);
+ M("[vsyscall]", PERF_RECORD_MISC_KERNEL, false);
+ M("[vsyscall]", PERF_RECORD_MISC_USER, false);
+
+ /* path alloc_name alloc_ext kmod comp name ext */
+ T("[kernel.kallsyms]", true , true , false, false, "[kernel.kallsyms]", NULL);
+ T("[kernel.kallsyms]", false , true , false, false, NULL , NULL);
+ T("[kernel.kallsyms]", true , false , false, false, "[kernel.kallsyms]", NULL);
+ T("[kernel.kallsyms]", false , false , false, false, NULL , NULL);
+ M("[kernel.kallsyms]", PERF_RECORD_MISC_CPUMODE_UNKNOWN, false);
+ M("[kernel.kallsyms]", PERF_RECORD_MISC_KERNEL, false);
+ M("[kernel.kallsyms]", PERF_RECORD_MISC_USER, false);

return 0;
}
diff --git a/tools/perf/util/dso.c b/tools/perf/util/dso.c
index b335db3532a2..5ec9e892c89b 100644
--- a/tools/perf/util/dso.c
+++ b/tools/perf/util/dso.c
@@ -166,12 +166,28 @@ bool is_supported_compression(const char *ext)
return false;
}

-bool is_kernel_module(const char *pathname)
+bool is_kernel_module(const char *pathname, int cpumode)
{
struct kmod_path m;
-
- if (kmod_path__parse(&m, pathname))
- return NULL;
+ int mode = cpumode & PERF_RECORD_MISC_CPUMODE_MASK;
+
+ WARN_ONCE(mode != cpumode,
+ "Internal error: passing unmasked cpumode (%x) to is_kernel_module",
+ cpumode);
+
+ switch (mode) {
+ case PERF_RECORD_MISC_USER:
+ case PERF_RECORD_MISC_HYPERVISOR:
+ case PERF_RECORD_MISC_GUEST_USER:
+ return false;
+ /* Treat PERF_RECORD_MISC_CPUMODE_UNKNOWN as kernel */
+ default:
+ if (kmod_path__parse(&m, pathname)) {
+ pr_err("Failed to check whether %s is a kernel module or not. Assume it is.",
+ pathname);
+ return true;
+ }
+ }

return m.kmod;
}
@@ -215,12 +231,33 @@ int __kmod_path__parse(struct kmod_path *m, const char *path,
{
const char *name = strrchr(path, '/');
const char *ext = strrchr(path, '.');
+ bool is_simple_name = false;

memset(m, 0x0, sizeof(*m));
name = name ? name + 1 : path;

+ /*
+ * '.' is also a valid character for module name. For example:
+ * [aaa.bbb] is a valid module name. '[' should have higher
+ * priority than '.ko' suffix.
+ *
+ * The kernel names are from machine__mmap_name. Such
+ * name should belong to kernel itself, not kernel module.
+ */
+ if (name[0] == '[') {
+ is_simple_name = true;
+ if ((strncmp(name, "[kernel.kallsyms]", 17) == 0) ||
+ (strncmp(name, "[guest.kernel.kallsyms", 22) == 0) ||
+ (strncmp(name, "[vdso]", 6) == 0) ||
+ (strncmp(name, "[vsyscall]", 10) == 0)) {
+ m->kmod = false;
+
+ } else
+ m->kmod = true;
+ }
+
/* No extension, just return name. */
- if (ext == NULL) {
+ if ((ext == NULL) || is_simple_name) {
if (alloc_name) {
m->name = strdup(name);
return m->name ? 0 : -ENOMEM;
diff --git a/tools/perf/util/dso.h b/tools/perf/util/dso.h
index 24a507a54147..ba2d90ed881f 100644
--- a/tools/perf/util/dso.h
+++ b/tools/perf/util/dso.h
@@ -220,7 +220,7 @@ char dso__symtab_origin(const struct dso *dso);
int dso__read_binary_type_filename(const struct dso *dso, enum dso_binary_type type,
char *root_dir, char *filename, size_t size);
bool is_supported_compression(const char *ext);
-bool is_kernel_module(const char *pathname);
+bool is_kernel_module(const char *pathname, int cpumode);
bool decompress_to_file(const char *ext, const char *filename, int output_fd);
bool dso__needs_decompress(struct dso *dso);

diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index 851143a7988d..ac5aaaeed7ff 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -1239,7 +1239,7 @@ static int __event_process_build_id(struct build_id_event *bev,
{
int err = -1;
struct machine *machine;
- u16 misc;
+ u16 cpumode;
struct dso *dso;
enum dso_kernel_type dso_type;

@@ -1247,9 +1247,9 @@ static int __event_process_build_id(struct build_id_event *bev,
if (!machine)
goto out;

- misc = bev->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
+ cpumode = bev->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;

- switch (misc) {
+ switch (cpumode) {
case PERF_RECORD_MISC_KERNEL:
dso_type = DSO_TYPE_KERNEL;
break;
@@ -1270,7 +1270,7 @@ static int __event_process_build_id(struct build_id_event *bev,

dso__set_build_id(dso, &bev->build_id);

- if (!is_kernel_module(filename))
+ if (!is_kernel_module(filename, cpumode))
dso->kernel = dso_type;

build_id__sprintf(dso->build_id, sizeof(dso->build_id),
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 4e29e80932e5..9e02c86f39f5 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1149,9 +1149,29 @@ static int machine__process_kernel_mmap_event(struct machine *machine,
struct dso *dso;

list_for_each_entry(dso, &machine->dsos.head, node) {
- if (!dso->kernel || is_kernel_module(dso->long_name))
+
+ /*
+ * The cpumode passed to is_kernel_module is not the
+ * cpumode of *this* event. If we insist on passing
+ * correct cpumode to is_kernel_module, we should
+ * record the cpumode when we adding this dso to the
+ * linked list.
+ *
+ * However we don't really need passing correct
+ * cpumode. We know the correct cpumode must be kernel
+ * mode (if not, we should not link it onto kernel_dsos
+ * list).
+ *
+ * Therefore, we pass PERF_RECORD_MISC_CPUMODE_UNKNOWN.
+ * is_kernel_module() treats it as a kernel cpumode.
+ */
+
+ if (!dso->kernel ||
+ is_kernel_module(dso->long_name,
+ PERF_RECORD_MISC_CPUMODE_UNKNOWN))
continue;

+
kernel = dso;
break;
}
--
2.1.0

2015-06-04 05:49:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT PULL 0/6] perf/core improvements and fixes


* Arnaldo Carvalho de Melo <[email protected]> wrote:

> Hi Ingo,
>
> Please consider applying.
>
> One of the next requests probably will have the eBPF work by Wang Nan,
> but I am still going thru it and want to test it thoroughly.
>
> BTW: Have you looked at it lately? It is at:
>
> http://lkml.kernel.org/r/[email protected]
>
> Super summary from the above cover letter:
>
> ---------------------
> It enables 'perf record' to filter events using eBPF programs like:
>
> # perf record --event bpf-file.o sleep 1
>
> Events are selected and filtered according to definitions in bpf-file.o.

Looks useful, but I think the UI needs one more tweak: could you fix it to be able
to filter based on the eBPF _source_ file, not just the object file?

People want to tweak such filters as they profile, so we should use the eBPF
source code as the primary interface. We can compile it internally to the .o just
fine. The .o file is a totally uninteresting intermediate product in itself.

I.e. we need to first think through such profiling workflows from beginning to end
before allowing them upstream.

> ---------------------
>
> The first two patches from that series are in this pull req, as
> they just move stuff into tools/include/linux/ from tools/perf/include.
>
> Regards,
>
> - Arnaldo
>
> The following changes since commit 5c9b9bc67c684e40b3a5e7e9facde0fb7200cd8c:
>
> Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core (2015-05-29 20:19:02 +0200)
>
> are available in the git repository at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git tags/perf-core-for-mingo
>
> for you to fetch changes up to 1f121b03d058dd07199d8924373d3c52a207f63b:
>
> perf tools: Deal with kernel module names in '[]' correctly (2015-06-03 10:02:38 -0300)
>
> ----------------------------------------------------------------
> perf/core improvements and fixes:
>
> User visible:
>
> - Fix 'perf probe' segfault when glob matching function without debuginfo (Wang Nan)
>
> - Remove newline char when reading event scale and unit (Madhavan Srinivasan)
>
> - Deal with kernel module names in '[]' correctly (Wang Nan)
>
> Infrastructure:
>
> - Fix the search for the kernel DSO on the unified list (Arnaldo Carvalho de Melo)
>
> - Move tools/perf/util/include/linux/{kernel.h,list.h,poison.h} to tools/include,
> to be used in tools/lib/bpf/ (Wang Nan)
>
> Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
>
> ----------------------------------------------------------------
> Arnaldo Carvalho de Melo (1):
> perf machine: Fix the search for the kernel DSO on the unified list
>
> Madhavan Srinivasan (1):
> perf tools: Remove newline char when reading event scale and unit
>
> Wang Nan (4):
> perf probe: Fix segfault when glob matching function without debuginfo
> perf tools: Move linux/kernel.h to tools/include
> tools: Move tools/perf/util/include/linux/{list.h,poison.h} to tools/include
> perf tools: Deal with kernel module names in '[]' correctly
>
> tools/{perf/util => }/include/linux/kernel.h | 4 +-
> tools/{perf/util => }/include/linux/list.h | 6 +--
> tools/include/linux/poison.h | 1 +
> tools/perf/MANIFEST | 3 ++
> tools/perf/tests/kmod-path.c | 72 ++++++++++++++++++++++++++++
> tools/perf/util/dso.c | 47 ++++++++++++++++--
> tools/perf/util/dso.h | 2 +-
> tools/perf/util/header.c | 8 ++--
> tools/perf/util/include/linux/poison.h | 1 -
> tools/perf/util/machine.c | 22 ++++++++-
> tools/perf/util/pmu.c | 11 ++++-
> tools/perf/util/probe-event.c | 26 ++++++++--
> 12 files changed, 179 insertions(+), 24 deletions(-)
> rename tools/{perf/util => }/include/linux/kernel.h (97%)
> rename tools/{perf/util => }/include/linux/list.h (90%)
> create mode 100644 tools/include/linux/poison.h
> delete mode 100644 tools/perf/util/include/linux/poison.h

Pulled, thanks a lot Arnaldo!

Ingo

2015-06-04 06:09:48

by Wang Nan

[permalink] [raw]
Subject: Re: [GIT PULL 0/6] perf/core improvements and fixes



On 2015/6/4 13:48, Ingo Molnar wrote:
> * Arnaldo Carvalho de Melo <[email protected]> wrote:
>
>> Hi Ingo,
>>
>> Please consider applying.
>>
>> One of the next requests probably will have the eBPF work by Wang Nan,
>> but I am still going thru it and want to test it thoroughly.
>>
>> BTW: Have you looked at it lately? It is at:
>>
>> http://lkml.kernel.org/r/[email protected]
>>
>> Super summary from the above cover letter:
>>
>> ---------------------
>> It enables 'perf record' to filter events using eBPF programs like:
>>
>> # perf record --event bpf-file.o sleep 1
>>
>> Events are selected and filtered according to definitions in bpf-file.o.
> Looks useful, but I think the UI needs one more tweak: could you fix it to be able
> to filter based on the eBPF _source_ file, not just the object file?
>
> People want to tweak such filters as they profile, so we should use the eBPF
> source code as the primary interface. We can compile it internally to the .o just
> fine. The .o file is a totally uninteresting intermediate product in itself.
>
> I.e. we need to first think through such profiling workflows from beginning to end
> before allowing them upstream.

In a private mail Alexei Starovoitov disscussed with me about this. He
said that he is working
on a shared object which can compile C program into BPF bytecode on the
fly. After he done his
work, I think perf can support dtrace-like profiling that, users will be
able to feed source
code to perf directly on cmdline. He said he can release it on June. I
added him to the CC-list.

However I think the '.o' intermediate is still needed. I'd like to share
a real profiling
experience using eBPF today, please keep an eye on it. In my experience,
since we are using C
instead of dtrace, the code piece could be relative complex. Therefore,
even if perf is able
to compile the C source on the fly, I think user still need to transfer
the profiling scripts
to the target machine. Therefor, for him, precompiling and do some
debugging on a high-end server
then transfer it into target machine (like a smartphone) is tolerable,
and useful for me.

Thank you.

>> ---------------------
>>
>> The first two patches from that series are in this pull req, as
>> they just move stuff into tools/include/linux/ from tools/perf/include.
>>
>> Regards,
>>
>> - Arnaldo
>>
>> The following changes since commit 5c9b9bc67c684e40b3a5e7e9facde0fb7200cd8c:
>>
>> Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core (2015-05-29 20:19:02 +0200)
>>
>> are available in the git repository at:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git tags/perf-core-for-mingo
>>
>> for you to fetch changes up to 1f121b03d058dd07199d8924373d3c52a207f63b:
>>
>> perf tools: Deal with kernel module names in '[]' correctly (2015-06-03 10:02:38 -0300)
>>
>> ----------------------------------------------------------------
>> perf/core improvements and fixes:
>>
>> User visible:
>>
>> - Fix 'perf probe' segfault when glob matching function without debuginfo (Wang Nan)
>>
>> - Remove newline char when reading event scale and unit (Madhavan Srinivasan)
>>
>> - Deal with kernel module names in '[]' correctly (Wang Nan)
>>
>> Infrastructure:
>>
>> - Fix the search for the kernel DSO on the unified list (Arnaldo Carvalho de Melo)
>>
>> - Move tools/perf/util/include/linux/{kernel.h,list.h,poison.h} to tools/include,
>> to be used in tools/lib/bpf/ (Wang Nan)
>>
>> Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
>>
>> ----------------------------------------------------------------
>> Arnaldo Carvalho de Melo (1):
>> perf machine: Fix the search for the kernel DSO on the unified list
>>
>> Madhavan Srinivasan (1):
>> perf tools: Remove newline char when reading event scale and unit
>>
>> Wang Nan (4):
>> perf probe: Fix segfault when glob matching function without debuginfo
>> perf tools: Move linux/kernel.h to tools/include
>> tools: Move tools/perf/util/include/linux/{list.h,poison.h} to tools/include
>> perf tools: Deal with kernel module names in '[]' correctly
>>
>> tools/{perf/util => }/include/linux/kernel.h | 4 +-
>> tools/{perf/util => }/include/linux/list.h | 6 +--
>> tools/include/linux/poison.h | 1 +
>> tools/perf/MANIFEST | 3 ++
>> tools/perf/tests/kmod-path.c | 72 ++++++++++++++++++++++++++++
>> tools/perf/util/dso.c | 47 ++++++++++++++++--
>> tools/perf/util/dso.h | 2 +-
>> tools/perf/util/header.c | 8 ++--
>> tools/perf/util/include/linux/poison.h | 1 -
>> tools/perf/util/machine.c | 22 ++++++++-
>> tools/perf/util/pmu.c | 11 ++++-
>> tools/perf/util/probe-event.c | 26 ++++++++--
>> 12 files changed, 179 insertions(+), 24 deletions(-)
>> rename tools/{perf/util => }/include/linux/kernel.h (97%)
>> rename tools/{perf/util => }/include/linux/list.h (90%)
>> create mode 100644 tools/include/linux/poison.h
>> delete mode 100644 tools/perf/util/include/linux/poison.h
> Pulled, thanks a lot Arnaldo!
>
> Ingo

2015-06-04 07:22:08

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT PULL 0/6] perf/core improvements and fixes


* Wangnan (F) <[email protected]> wrote:

> On 2015/6/4 13:48, Ingo Molnar wrote:
> >* Arnaldo Carvalho de Melo <[email protected]> wrote:
> >
> >>Hi Ingo,
> >>
> >> Please consider applying.
> >>
> >> One of the next requests probably will have the eBPF work by Wang Nan,
> >>but I am still going thru it and want to test it thoroughly.
> >>
> >> BTW: Have you looked at it lately? It is at:
> >>
> >>http://lkml.kernel.org/r/[email protected]
> >>
> >>Super summary from the above cover letter:
> >>
> >>---------------------
> >>It enables 'perf record' to filter events using eBPF programs like:
> >>
> >> # perf record --event bpf-file.o sleep 1
> >>
> >>Events are selected and filtered according to definitions in bpf-file.o.
> >Looks useful, but I think the UI needs one more tweak: could you fix it to be able
> >to filter based on the eBPF _source_ file, not just the object file?
> >
> >People want to tweak such filters as they profile, so we should use the eBPF
> >source code as the primary interface. We can compile it internally to the .o just
> >fine. The .o file is a totally uninteresting intermediate product in itself.
> >
> >I.e. we need to first think through such profiling workflows from beginning to end
> >before allowing them upstream.
>
> In a private mail Alexei Starovoitov disscussed with me about this. He said that
> he is working on a shared object which can compile C program into BPF bytecode
> on the fly. After he done his work, I think perf can support dtrace-like
> profiling that, users will be able to feed source code to perf directly on
> cmdline. He said he can release it on June. I added him to the CC-list.
>
> However I think the '.o' intermediate is still needed. [...]

So how do you generate the .o? Why cannot the tool, if it sees that the filter
parameter is eBPF source code, do that automatically?

I.e. you are making the user jump through hoops for no good reason - that's a bad
UI and a bad workflow. Please don't do that!

Thanks,

Ingo

2015-06-04 10:06:50

by Wang Nan

[permalink] [raw]
Subject: Re: [GIT PULL 0/6] perf/core improvements and fixes



On 2015/6/4 15:21, Ingo Molnar wrote:
> * Wangnan (F) <[email protected]> wrote:
>
>> On 2015/6/4 13:48, Ingo Molnar wrote:
>>> * Arnaldo Carvalho de Melo <[email protected]> wrote:
>>>
>>>> Hi Ingo,
>>>>
>>>> Please consider applying.
>>>>
>>>> One of the next requests probably will have the eBPF work by Wang Nan,
>>>> but I am still going thru it and want to test it thoroughly.
>>>>
>>>> BTW: Have you looked at it lately? It is at:
>>>>
>>>> http://lkml.kernel.org/r/[email protected]
>>>>
>>>> Super summary from the above cover letter:
>>>>
>>>> ---------------------
>>>> It enables 'perf record' to filter events using eBPF programs like:
>>>>
>>>> # perf record --event bpf-file.o sleep 1
>>>>
>>>> Events are selected and filtered according to definitions in bpf-file.o.
>>> Looks useful, but I think the UI needs one more tweak: could you fix it to be able
>>> to filter based on the eBPF _source_ file, not just the object file?
>>>
>>> People want to tweak such filters as they profile, so we should use the eBPF
>>> source code as the primary interface. We can compile it internally to the .o just
>>> fine. The .o file is a totally uninteresting intermediate product in itself.
>>>
>>> I.e. we need to first think through such profiling workflows from beginning to end
>>> before allowing them upstream.
>> In a private mail Alexei Starovoitov disscussed with me about this. He said that
>> he is working on a shared object which can compile C program into BPF bytecode
>> on the fly. After he done his work, I think perf can support dtrace-like
>> profiling that, users will be able to feed source code to perf directly on
>> cmdline. He said he can release it on June. I added him to the CC-list.
>>
>> However I think the '.o' intermediate is still needed. [...]
> So how do you generate the .o? Why cannot the tool, if it sees that the filter
> parameter is eBPF source code, do that automatically?

I think compiling on the fly is our goal, and Alexei is working on it.
However,
it looks like some limitations are still exist. For example, the BPF
program may
require the definition of kernel structure for it to derefrernce struct
pointers.
Therefore, the machine on which compiling occures should have kernel
header and
configuration installed. However, most of production environment doesn't
have
them for security and reasons.

I think BPF should be used to profile not only development environment
but also
production systems. If we only concern development systems, why not
recompiling
kernel with stub core or using kProbe modules?

What about this:

Give perf the ability to generate the '.o' file (to memory or dump to a
real file)
on development environment. If we are willing to profiling that system,
we can make
load the generated object into kernel and start trace by (perf record
--event bpf_file.c ...).
If we are willing to profile systems on which the compiling environment
is missing, we can
make perf to create a '.o' file (perf bpf compile bpf_file.c -o
bpf_file.o) then copy
it onto target environment, use perf record --event bpf_file.o ... to
filter events.

Thank you.

> I.e. you are making the user jump through hoops for no good reason - that's a bad
> UI and a bad workflow. Please don't do that!
>
> Thanks,
>
> Ingo

2015-06-04 10:21:14

by Wang Nan

[permalink] [raw]
Subject: [EXPERIENCE] My experience on using perf record BPF filter on a real usecase

Hi all,

I'd like to share my exprience on using 'perf record' BPF filter in a
real usecase to show the power and shortcome in my patch series:

https://lkml.kernel.org/r/[email protected]

and other works on eBPF.

My usecase shows that such filter is useful. Also, I hope it can help us
to find way to further improve it.

My task is to find the reason why iozone test result is bad on some
specific cases. The development environment is a x86_64 server, the target
machine is a smartphone with Android. By previous analysis I have
already got some useful information:

1. iozone computes bandwidth by averaging time of each sys_write.

2. In our case, 1% sys_write takes 75% of total time, so what I need
to do now should be finding the reason why those sys_write take so
long.

3. By sampling call stack on sched:sched_switch, I find that those
sys_write calls lock_page() and blocks on it.

I decide to use BPF filter to find the other side of this locking
contention. The idea is simple:

1. For all calls of lock_page(), probe at entry and exit points of
it. Measure the execution time of the lock_page() call. If it takes
too long (longer than 0.1 second) then there should have a lock
contention. Take the sample at exit point.

2. For all calls of unlock_page(), if the page is acquiring by other
on at least 0.1 second before, take a sample at this point.

Currently making the above idea work is possible but not very
straightforward. One problem I can identify is:

Doesn't like ftrace, there is no way for eBPF program to access call
stack information. Without extra information, eBPF programs are
unable to match lock_page events and corresponding lock_page%return
events. Currently the only way for passing information between
programs are maps. To simulate call stack matching, I create a
BPF_FUNC_git_tid() which returns current->pid, and a
proc_locking_page_map map which records the acquired page and time
of calling lock_page.

Another problem is: at the entry of lock_page() and
unlock_page(), for fetching the page pointer I have to directly
use 'ctx->regs[0]' (I am on aarch64). Which is not protable.

The final program I used is attached at the bottom of this email. It
costs more than 100 lines of code. I have to do some debugging to
ensure it works correctly on a virtual machine.

It is compiled using:

# $CLANG ${INCLUDE} -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign -O2 \
-emit-llvm -c lock_page.c -o - | $LLC -march=bpf -filetype=obj -o \
lock_page.o

Then the lock_page.o is transfered onto target system.

I loaded it using following command:

# perf record -e syscalls:sys_enter_write -e syscalls:sys_exit_write \
-e lock_page.o -a iozone ...

Here is another inconvenience. Currently I only concern on write
syscall issued by iozone. However, without '-a' I'm unable to collect
information of the locker. If I want to filter sys_{enter,exit}_write
belong to iozone out using eBPF, I need to implement another function
like BPF_FUNC_git_comm. Another method is to use perf '--filter' after
the two events. However it looks strange to use two filter mechanisms
together. This time I choose to do filtering offline using perf script.

The result is resonable. Finaly I found the two side of lock contention.
It shows the way to improve. I'm sorry I can't share the call stack in
this list.

One inconvenience in this stage is: the information is
printed into ring buffer while the samples are stored into perf.data.
By analysing perf.data without ftrace ring buffer I don't know how long
the lock_page() cost becasue I don't sample at the entry of
lock_page().

The final part is the BPF program I used. I think there should have
better way to do it. If any know how to make it shorter please let me
know.

Thank you.

/* ------------- START OF BPF PROGRAM ------------- */
/* __lock_page pass to unlock_page, key is pid */
struct proc_locking_page {
unsigned long page;
unsigned long time;
};

struct bpf_map_def SEC("maps") proc_locking_page_map = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(unsigned long),
.value_size = sizeof(struct proc_locking_page),
.max_entries = 1000000,
};

/* from page get pid */
struct page_being_locked_by_proc {
unsigned long tid;
unsigned long time;
};

struct bpf_map_def SEC("maps") page_being_locked_by_proc_map = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(unsigned long),
.value_size = sizeof(struct page_being_locked_by_proc),
.max_entries = 1000000,
};

SEC("lock_page=__lock_page")
int lock_page_recorder(struct pt_regs *ctx)
{
unsigned long tid = bpf_get_tid();
unsigned long page = ctx->regs[0];
unsigned long curr_ns = bpf_ktime_get_ns();

struct proc_locking_page locking_page;
struct page_being_locked_by_proc being_locked;

locking_page.page = page;
locking_page.time = curr_ns;
being_locked.tid = tid;
being_locked.time = curr_ns;

bpf_map_update_elem(&proc_locking_page_map, &tid,
&locking_page, BPF_ANY);
bpf_map_update_elem(&page_being_locked_by_proc_map, &page,
&being_locked, BPF_ANY);
return 0;
}

SEC("lock_page_ret=__lock_page%return")
int lock_page_return_recorder(struct pt_regs *ctx)
{
unsigned long tid = bpf_get_tid();
unsigned long curr_ns = bpf_ktime_get_ns();
unsigned long page;
unsigned long diff_time;
struct proc_locking_page *p_locking_page;


p_locking_page = bpf_map_lookup_elem(&proc_locking_page_map, &tid);


/* BAD!! */
if (!p_locking_page)
return 0;

page = p_locking_page->page;
diff_time = curr_ns - p_locking_page->time;
bpf_map_delete_elem(&proc_locking_page_map, &tid);
bpf_map_delete_elem(&page_being_locked_by_proc_map, &page);

if (diff_time > 10000000) {
char fmt[] = "tid %d get page %lx using %d ns\n";
bpf_trace_printk(fmt, sizeof(fmt), tid, page, diff_time);
return 1;
}

return 0;
}

SEC("unlock_page=unlock_page")
int unlock_page_recorder(struct pt_regs *ctx)
{
unsigned long tid = bpf_get_tid();
unsigned long page = ctx->regs[0];
unsigned long time = bpf_ktime_get_ns();
unsigned long diff_time;
struct page_being_locked_by_proc *p_being_locked;
char fmt[] = "%d vs %d, %d ns\n";

p_being_locked =
bpf_map_lookup_elem(&page_being_locked_by_proc_map, &page);
if (!p_being_locked)
return 0;
diff_time = time - p_being_locked->time;
if (diff_time > 10000000) {
bpf_trace_printk(fmt, sizeof(fmt), tid,
p_being_locked->tid, diff_time);
return 1;
}
return 0;
}
/* ------------- END OF BPF PROGRAM ------------- */

2015-06-04 12:40:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT PULL 0/6] perf/core improvements and fixes


* Wangnan (F) <[email protected]> wrote:

> > So how do you generate the .o? Why cannot the tool, if it sees that the filter
> > parameter is eBPF source code, do that automatically?
>
> I think compiling on the fly is our goal, and Alexei is working on it.

So what exact command line are you using to create the .o?

What exactly should users type to create a simple eBPF filter profile?

Thanks,

Ingo

2015-06-04 13:01:48

by Wang Nan

[permalink] [raw]
Subject: Re: [GIT PULL 0/6] perf/core improvements and fixes



?????ҵ? iPhone

> ?? 2015??6??4?գ?????8:40??Ingo Molnar <[email protected]> д????
>
>
> * Wangnan (F) <[email protected]> wrote:
>
>>> So how do you generate the .o? Why cannot the tool, if it sees that the filter
>>> parameter is eBPF source code, do that automatically?
>>
>> I think compiling on the fly is our goal, and Alexei is working on it.
>
> So what exact command line are you using to create the .o?
>
> What exactly should users type to create a simple eBPF filter profile?
>

I have mentioned in previous mail:

Use

# perf record -e bpf_source.c cmdline

to create a eBPF filter from source,

Use

# perf record -e bpf_object.o cmdline

to create a eBPF filter from object intermedia.

Use

# perf bpf compile bpf_source.c --kbuild=kernel-build-dir -o bpf_object.o

to create the .o

I think this should be enough. Currently only the second case has been implemented.

Thanks.

> Thanks,
>
> Ingo

2015-06-04 14:04:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT PULL 0/6] perf/core improvements and fixes


* pi3orama <[email protected]> wrote:

>
>
> 发自我的 iPhone
>
> > 在 2015年6月4日,下午8:40,Ingo Molnar <[email protected]> 写道:
> >
> >
> > * Wangnan (F) <[email protected]> wrote:
> >
> >>> So how do you generate the .o? Why cannot the tool, if it sees that the filter
> >>> parameter is eBPF source code, do that automatically?
> >>
> >> I think compiling on the fly is our goal, and Alexei is working on it.
> >
> > So what exact command line are you using to create the .o?
> >
> > What exactly should users type to create a simple eBPF filter profile?
>
> I have mentioned in previous mail:
>
> Use
>
> # perf record -e bpf_source.c cmdline
>
> to create a eBPF filter from source,
>
> Use
>
> # perf record -e bpf_object.o cmdline
>
> to create a eBPF filter from object intermedia.
>
> Use
>
> # perf bpf compile bpf_source.c --kbuild=kernel-build-dir -o bpf_object.o
>
> to create the .o
>
> I think this should be enough. Currently only the second case has been implemented.

So if users cannot actually generate .o files then it's premature to merge this in
such an incomplete form!

It should be possible to use a feature that we are merging.

Thanks,

Ingo

2015-06-04 16:22:30

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [GIT PULL 0/6] perf/core improvements and fixes

On 6/4/15 7:04 AM, Ingo Molnar wrote:
>> > # perf record -e bpf_source.c cmdline
>> >
>> > to create a eBPF filter from source,
>> >
>> >Use
>> >
>> ># perf record -e bpf_object.o cmdline
>> >
>> >to create a eBPF filter from object intermedia.
>> >
>> >Use
>> >
>> ># perf bpf compile bpf_source.c --kbuild=kernel-build-dir -o bpf_object.o
>> >
>> >to create the .o
>> >
>> >I think this should be enough. Currently only the second case has been implemented.
> So if users cannot actually generate .o files then it's premature to merge this in
> such an incomplete form!
>
> It should be possible to use a feature that we are merging.

of course it's usable :) There is some confusion here.
To compile .c into .o one can easily use
clang -O2 -emit-llvm -c file.c -o - | llc -march=bpf -o file.o
any version of clang is ok,
llc needs to be fresh with bpf backend.

For a lot of cases kernel headers are not needed, so above
will work fine.
For our TC examples we recommend to use 'bcc' alias:
bcc() {
clang -O2 -emit-llvm -c $1 -o - | llc -march=bpf -filetype=obj -o
"`basename $1 .c`.o"
}
then compiling as easy as 'bcc file.c'

What Wang mentioned that we're working on is fully integrated 'bcc'.
It will use clang/llvm as libraries, so no intermediate steps will
be needed, but some folks will always have concerns about
ultra-embedded environments where even 20Mb of libllvm.so is too much.

So I think we need to support both 'perf record -e file.[co]'

Subject: Re: [GIT PULL 0/6] perf/core improvements and fixes

On 2015/06/05 1:22, Alexei Starovoitov wrote:
> On 6/4/15 7:04 AM, Ingo Molnar wrote:
>>>> # perf record -e bpf_source.c cmdline
>>>>
>>>> to create a eBPF filter from source,
>>>>
>>>> Use
>>>>
>>>> # perf record -e bpf_object.o cmdline
>>>>
>>>> to create a eBPF filter from object intermedia.
>>>>
>>>> Use
>>>>
>>>> # perf bpf compile bpf_source.c --kbuild=kernel-build-dir -o bpf_object.o
>>>>
>>>> to create the .o
>>>>
>>>> I think this should be enough. Currently only the second case has been implemented.
>> So if users cannot actually generate .o files then it's premature to merge this in
>> such an incomplete form!
>>
>> It should be possible to use a feature that we are merging.
>
> of course it's usable :) There is some confusion here.
> To compile .c into .o one can easily use
> clang -O2 -emit-llvm -c file.c -o - | llc -march=bpf -o file.o
> any version of clang is ok,
> llc needs to be fresh with bpf backend.
>
> For a lot of cases kernel headers are not needed, so above
> will work fine.
> For our TC examples we recommend to use 'bcc' alias:
> bcc() {
> clang -O2 -emit-llvm -c $1 -o - | llc -march=bpf -filetype=obj -o
> "`basename $1 .c`.o"
> }
> then compiling as easy as 'bcc file.c'
>
> What Wang mentioned that we're working on is fully integrated 'bcc'.
> It will use clang/llvm as libraries, so no intermediate steps will
> be needed, but some folks will always have concerns about
> ultra-embedded environments where even 20Mb of libllvm.so is too much.
>
> So I think we need to support both 'perf record -e file.[co]'

I think we'd better make 'perf record -e file.c' default and '-e file.o'
should be an option.

Thank you,

>
>


--
Masami HIRAMATSU
Linux Technology Research Center, System Productivity Research Dept.
Center for Technology Innovation - Systems Engineering
Hitachi, Ltd., Research & Development Group
E-mail: [email protected]

2015-06-04 22:08:11

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [GIT PULL 0/6] perf/core improvements and fixes

On 6/4/15 2:48 PM, Masami Hiramatsu wrote:
> On 2015/06/05 1:22, Alexei Starovoitov wrote:
>> On 6/4/15 7:04 AM, Ingo Molnar wrote:
>>>>> # perf record -e bpf_source.c cmdline
>>>>>
>>>>> to create a eBPF filter from source,
>>>>>
>>>>> Use
>>>>>
>>>>> # perf record -e bpf_object.o cmdline
>>>>>
>>>>> to create a eBPF filter from object intermedia.
>>>>>
>>>>> Use
>>>>>
>>>>> # perf bpf compile bpf_source.c --kbuild=kernel-build-dir -o bpf_object.o
>>>>>
>>>>> to create the .o
>>>>>
>>>>> I think this should be enough. Currently only the second case has been implemented.
>>> So if users cannot actually generate .o files then it's premature to merge this in
>>> such an incomplete form!
>>>
>>> It should be possible to use a feature that we are merging.
>>
>> of course it's usable :) There is some confusion here.
>> To compile .c into .o one can easily use
>> clang -O2 -emit-llvm -c file.c -o - | llc -march=bpf -o file.o
>> any version of clang is ok,
>> llc needs to be fresh with bpf backend.
>>
>> For a lot of cases kernel headers are not needed, so above
>> will work fine.
>> For our TC examples we recommend to use 'bcc' alias:
>> bcc() {
>> clang -O2 -emit-llvm -c $1 -o - | llc -march=bpf -filetype=obj -o
>> "`basename $1 .c`.o"
>> }
>> then compiling as easy as 'bcc file.c'
>>
>> What Wang mentioned that we're working on is fully integrated 'bcc'.
>> It will use clang/llvm as libraries, so no intermediate steps will
>> be needed, but some folks will always have concerns about
>> ultra-embedded environments where even 20Mb of libllvm.so is too much.
>>
>> So I think we need to support both 'perf record -e file.[co]'
>
> I think we'd better make 'perf record -e file.c' default and '-e file.o'
> should be an option.

what do you mean 'default' ? It's a command line :)
.c is easier to use of course, no question.

2015-06-05 06:41:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT PULL 0/6] perf/core improvements and fixes


* Alexei Starovoitov <[email protected]> wrote:

> On 6/4/15 7:04 AM, Ingo Molnar wrote:
> >>> # perf record -e bpf_source.c cmdline
> >>>
> >>> to create a eBPF filter from source,
> >>>
> >>>Use
> >>>
> >>># perf record -e bpf_object.o cmdline
> >>>
> >>>to create a eBPF filter from object intermedia.
> >>>
> >>>Use
> >>>
> >>># perf bpf compile bpf_source.c --kbuild=kernel-build-dir -o bpf_object.o
> >>>
> >>>to create the .o
> >>>
> >>>I think this should be enough. Currently only the second case has been implemented.
> >
> > So if users cannot actually generate .o files then it's premature to merge
> > this in such an incomplete form!
> >
> > It should be possible to use a feature that we are merging.
>
> of course it's usable :) There is some confusion here.
> To compile .c into .o one can easily use
> clang -O2 -emit-llvm -c file.c -o - | llc -march=bpf -o file.o

There's no confusion here: you guys are trying to sell me what at this stage is
incomplete and hard to use, and I'm resisting it as I should! :-)

We also have different definitions of 'easily'. It might be 'easy' to type:

clang -O2 -emit-llvm -c file.c -o - | llc -march=bpf -o file.o

... for some tooling developer intimate with eBPF, but to the first time user who
found an interesting looking eBPF scriptlet on the net or in the documentation and
wants to try his luck? It's absolutely non-obvious!

The current usage to get a _minimal_ eBPF script running is non-obvious and
obscure to the level of being a show stopper.

I don't understand why you guys are even wasting time arguing about it: it's not
that hard to auto-build from source code. It's one of the basic features of
tooling. If you ever built perf you'll know that typing 'make install' will type
in all those quirky build lines automatically for you, without requiring you to
perform any other step, no matter how trivial.

Doubly annoying, you seem to have the UI principles wrong, you seem to think that
a .o is a proper user interface. It absolutely is _not_ okay.

The Linux kernel project and as an extension the perf project deals with source
code, and I'm 100% suspicious of approaches that somehow think that .o objects are
the right UI for _anything_ except temporary files that sometimes show up in
object directories...

Fix the 'newbie user' UI flow as a _first_ priority, not as a second thought!

Every single quirky line or nonsensical option you require a first time user to
type halves the number of new users we'll get. You need to understand why dtrace
is so popular:

- it's bloody easy to use

- it's a safe environment you can deploy in critical environments

- it's flexible

- instrumentation hacks are very easy to share

eBPF based scripting got 3 out of those 4 right, but please don't forget item 1
either, because without that we have nothing but a bunch of unusable functionality
in the kernel and in tooling that benefits only very few people. Okay?

> So I think we need to support both 'perf record -e file.[co]'

Why do you even need to ask? Of course!

Think through how users will meet eBPF scripts and how they will interact with
them:

- they'll see or download an eBPF scriptlet somewhere and will have a .c file.

- ideally there will be built-in eBPF scriptlets just like we have tracing
plugins, and there's a good UI to query them and see their description and
source code.

- then they will want to use it all with the minimum amount of fuss

- they don't care how the eBPF scriptlet gets to the kernel: whether the kernel
can read and build the .c files, or whether there's some user tooling that
turns it into bytecode. Most humans don't read bytecode!

- they will absolutely not download random .o's and we should not encourage that
in any case - these things should be source code based.

These things compile in an eye blink, there's very little reason to ever deal with
a .o, except some weird and rare usecases...

In fact I'm NAK-ing the whole .o based interface until the .c interface is made
the _primary_ one and works well and until I see that you have thought through
basic usability questions...

Thanks,

Ingo

2015-06-05 08:54:35

by Wang Nan

[permalink] [raw]
Subject: Re: [GIT PULL 0/6] perf/core improvements and fixes



On 2015/6/5 14:41, Ingo Molnar wrote:
> * Alexei Starovoitov <[email protected]> wrote:
>
>> On 6/4/15 7:04 AM, Ingo Molnar wrote:
>>>>> # perf record -e bpf_source.c cmdline
>>>>>
>>>>> to create a eBPF filter from source,
>>>>>
>>>>> Use
>>>>>
>>>>> # perf record -e bpf_object.o cmdline
>>>>>
>>>>> to create a eBPF filter from object intermedia.
>>>>>
>>>>> Use
>>>>>
>>>>> # perf bpf compile bpf_source.c --kbuild=kernel-build-dir -o bpf_object.o
>>>>>
>>>>> to create the .o
>>>>>
>>>>> I think this should be enough. Currently only the second case has been implemented.
>>> So if users cannot actually generate .o files then it's premature to merge
>>> this in such an incomplete form!
>>>
>>> It should be possible to use a feature that we are merging.
>> of course it's usable :) There is some confusion here.
>> To compile .c into .o one can easily use
>> clang -O2 -emit-llvm -c file.c -o - | llc -march=bpf -o file.o
> There's no confusion here: you guys are trying to sell me what at this stage is
> incomplete and hard to use, and I'm resisting it as I should! :-)
>
> We also have different definitions of 'easily'. It might be 'easy' to type:
>
> clang -O2 -emit-llvm -c file.c -o - | llc -march=bpf -o file.o
>
> ... for some tooling developer intimate with eBPF, but to the first time user who
> found an interesting looking eBPF scriptlet on the net or in the documentation and
> wants to try his luck? It's absolutely non-obvious!
>
> The current usage to get a _minimal_ eBPF script running is non-obvious and
> obscure to the level of being a show stopper.
>
> I don't understand why you guys are even wasting time arguing about it: it's not
> that hard to auto-build from source code. It's one of the basic features of
> tooling. If you ever built perf you'll know that typing 'make install' will type
> in all those quirky build lines automatically for you, without requiring you to
> perform any other step, no matter how trivial.
>
> Doubly annoying, you seem to have the UI principles wrong, you seem to think that
> a .o is a proper user interface. It absolutely is _not_ okay.
>
> The Linux kernel project and as an extension the perf project deals with source
> code, and I'm 100% suspicious of approaches that somehow think that .o objects are
> the right UI for _anything_ except temporary files that sometimes show up in
> object directories...
>
> Fix the 'newbie user' UI flow as a _first_ priority, not as a second thought!
>
> Every single quirky line or nonsensical option you require a first time user to
> type halves the number of new users we'll get. You need to understand why dtrace
> is so popular:
>
> - it's bloody easy to use
>
> - it's a safe environment you can deploy in critical environments
>
> - it's flexible
>
> - instrumentation hacks are very easy to share
>
> eBPF based scripting got 3 out of those 4 right, but please don't forget item 1
> either, because without that we have nothing but a bunch of unusable functionality
> in the kernel and in tooling that benefits only very few people. Okay?
>
>> So I think we need to support both 'perf record -e file.[co]'
> Why do you even need to ask? Of course!
>
> Think through how users will meet eBPF scripts and how they will interact with
> them:
>
> - they'll see or download an eBPF scriptlet somewhere and will have a .c file.
>
> - ideally there will be built-in eBPF scriptlets just like we have tracing
> plugins, and there's a good UI to query them and see their description and
> source code.
>
> - then they will want to use it all with the minimum amount of fuss
>
> - they don't care how the eBPF scriptlet gets to the kernel: whether the kernel
> can read and build the .c files, or whether there's some user tooling that
> turns it into bytecode. Most humans don't read bytecode!
>
> - they will absolutely not download random .o's and we should not encourage that
> in any case - these things should be source code based.
>
> These things compile in an eye blink, there's very little reason to ever deal with
> a .o, except some weird and rare usecases...
>
> In fact I'm NAK-ing the whole .o based interface until the .c interface is made
> the _primary_ one and works well and until I see that you have thought through
> basic usability questions...

OK. Let's start making a nice UI.

At this stage, what about wrapping current clang and llc workflow into perf,
let it call them to compile '.c' scripts? This is the way 'perf
annotate' using
objdump. I can do this job, but firstly I'd like to know people's opinion on
it, and the value of the wrapper if Alexei Starovoitov's dynamic compiler
shared object is coming.

Following functions will be added:

- perf searches clang and llc under current $PATH,

- Users are allowed to pass the position of those programs,
if perf failed to find the automatically,

- Users are allowed to pass extra compiling options to clang and llc, like
include directories,

- 'perf record' automatically calls them if a '.c' is passed using
'--event'.

- 'perf bpf compile' command will be added to compile a '.c' find into
'.o',

Further, basic header files should be shipped with kernel headers.

User interface update:

- --llc, --clang, --llc-opt and --clang-opt option will be added to
'perf record'
to indicate the position and extra options to them. Ideally they can
be leave
blank.

- 'perf bpf compile' will be added.

One problem I can find is that, the wrapper will make perf depend on
llvm. I don't
think the compiler will be deployed in production environments... And
also, the
embedded case...

Any suggestion? Do you think the above idea is on the right way?

Thanks.

> Thanks,
>
> Ingo

2015-06-05 12:06:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT PULL 0/6] perf/core improvements and fixes


* Wangnan (F) <[email protected]> wrote:

> OK. Let's start making a nice UI.

Thanks!

> [...]
>
> One problem I can find is that, the wrapper will make perf depend on llvm. I
> don't think the compiler will be deployed in production environments... And
> also, the embedded case...

What dependencies are there?

On the usage side there should be very few outright dependencies: if the llvm
binary is not available, or doesn't support what you need, or there's no runtime
environment you can use to build the bytecode, you should display an informative
error message so that the user knows what is missing and how to install it.

Thanks,

Ingo

2015-06-05 14:27:56

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [GIT PULL 0/6] perf/core improvements and fixes

Em Fri, Jun 05, 2015 at 04:53:03PM +0800, Wangnan (F) escreveu:
> On 2015/6/5 14:41, Ingo Molnar wrote:
> >* Alexei Starovoitov <[email protected]> wrote:
> >>On 6/4/15 7:04 AM, Ingo Molnar wrote:
> >In fact I'm NAK-ing the whole .o based interface until the .c interface is made
> >the _primary_ one and works well and until I see that you have thought through
> >basic usability questions...

> OK. Let's start making a nice UI.

> At this stage, what about wrapping current clang and llc workflow into perf,
> let it call them to compile '.c' scripts? This is the way 'perf annotate'
> using
> objdump. I can do this job, but firstly I'd like to know people's opinion on

Right, no need for, at a first step, to save this into a cache, or use
libraries, etc, just automate the bpf.c into bpf.o, load it and use it
as an event.

> it, and the value of the wrapper if Alexei Starovoitov's dynamic compiler
> shared object is coming.

No need to wait for that, when it comes we can use it, but Ingo
established as the door for this to be accepted is that we could use:

perf record -e foo.c usleep

Right?

> Following functions will be added:
>
> - perf searches clang and llc under current $PATH,

Good for a first step

> - Users are allowed to pass the position of those programs,
> if perf failed to find the automatically,

I would leave all this configurabilty for later, stating that it has to
be in the PATH should be ok for a first step.

> - Users are allowed to pass extra compiling options to clang and llc, like
> include directories,

For later too?

> - 'perf record' automatically calls them if a '.c' is passed using
> '--event'.

Right.

> - 'perf bpf compile' command will be added to compile a '.c' find into
> '.o',

for later? I.e. 'perf record -e foo.c usleep' could start by generating
the foo.o and not deleting it, so:

perf record -e foo.c usleep

Followed by:

perf record -e foo.o usleep

Would work, the later would be like a quick hack so that we could have
access to a pre-compiled foo.o quickly, at this introductory stage.

> Further, basic header files should be shipped with kernel headers.
>
> User interface update:
>
> - --llc, --clang, --llc-opt and --clang-opt option will be added to 'perf
> record'
> to indicate the position and extra options to them. Ideally they can be
> leave
> blank.

Couldn't this be left to a section in a .perfconfig file to avoid having
so many command line options? We could have one of those files per
"project", after we add a --config option to 'perf record', to override
whatever it finds in the config file search it already does, i.e.:

[clang]

path = /a/b/clang
opt = -a -b -c -d

[llc]

path = /d/e/llc
opt = -r -t -y -u

But even this can be left for a second step.

> - 'perf bpf compile' will be added.

> One problem I can find is that, the wrapper will make perf depend on
> llvm. I don't think the compiler will be deployed in production
> environments... And also, the embedded case...

Well, we can always build a subset of perf, using the command line
options to disable certain features.

- Arnaldo

2015-06-05 14:07:41

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [GIT PULL 0/6] perf/core improvements and fixes

Em Fri, Jun 05, 2015 at 02:05:50PM +0200, Ingo Molnar escreveu:
> * Wangnan (F) <[email protected]> wrote:
<SNIP>
> > One problem I can find is that, the wrapper will make perf depend on llvm. I
> > don't think the compiler will be deployed in production environments... And
> > also, the embedded case...

> What dependencies are there?

> On the usage side there should be very few outright dependencies: if the llvm
> binary is not available, or doesn't support what you need, or there's no runtime
> environment you can use to build the bytecode, you should display an informative
> error message so that the user knows what is missing and how to install it.

Right, something like:

[acme@zoo ~]$ perf trace -e nanosleep usleep 1
Error: No permissions to read /sys/kernel/debug/tracing/events/raw_syscalls/sys_(enter|exit)
Hint: Try 'sudo mount -o remount,mode=755 /sys/kernel/debug'
[acme@zoo ~]$ sudo mount -o remount,mode=755 /sys/kernel/debug
[sudo] password for acme:
[acme@zoo ~]$ perf trace -e nanosleep usleep 1
0.565 ( 0.060 ms): usleep/17648 nanosleep(rqtp: 0x7fff22baebf0) = 0
[acme@zoo ~]$ perf trace --all-cpus
Error: Operation not permitted.
Hint: Check /proc/sys/kernel/perf_event_paranoid setting.
Hint: For system wide tracing it needs to be set to -1.
Hint: Try: 'sudo sh -c "echo -1 > /proc/sys/kernel/perf_event_paranoid"'
Hint: The current value is 1.
[acme@zoo ~]$
[acme@zoo ~]$ trace -a -e poll usleep 1
[acme@zoo ~]$ trace -a -e poll usleep 1
0.041 ( 0.000 ms): firefox/1458 ... [continued]: poll()) = 1
0.267 ( 0.003 ms): firefox/1458 poll(ufds: 0x7f43d6ea1340, nfds: 5) = 0 Timeout
0.275 ( 0.001 ms): firefox/1458 poll(ufds: 0x7f43d6ea1340, nfds: 5) = 0 Timeout
0.283 ( 0.001 ms): firefox/1458 poll(ufds: 0x7f43d6ea1340, nfds: 5) = 0 Timeout
0.979 ( 0.000 ms): gnome-terminal/2572 ... [continued]: poll()) = 1
1.056 ( 0.768 ms): firefox/1458 poll(ufds: 0x7f43d6ea1340, nfds: 5, timeout_msecs: 4294967295) ...
1.065 ( 0.009 ms): gnome-terminal/2572 poll(ufds: 0x1934250, nfds: 23, timeout_msecs: 10) = 1
1.087 ( 0.007 ms): gnome-terminal/2572 poll(ufds: 0x1934250, nfds: 23, timeout_msecs: 10) = 2
1.132 ( 0.007 ms): gnome-terminal/2572 poll(ufds: 0x1934250, nfds: 23, timeout_msecs: 10) = 1
1.161 ( 0.013 ms): gnome-terminal/2572 poll(ufds: 0x1934250, nfds: 23, timeout_msecs: 10) = 1
[acme@zoo ~]$

I.e. Explain the mistake and provide a hint to solve it, as close to the actual
commands needed to perform such corrective/enabling action as possible.

- Arnaldo

2015-06-07 13:11:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT PULL 0/6] perf/core improvements and fixes


* Arnaldo Carvalho de Melo <[email protected]> wrote:

> Em Fri, Jun 05, 2015 at 02:05:50PM +0200, Ingo Molnar escreveu:
> > * Wangnan (F) <[email protected]> wrote:
> <SNIP>
> > > One problem I can find is that, the wrapper will make perf depend on llvm. I
> > > don't think the compiler will be deployed in production environments... And
> > > also, the embedded case...
>
> > What dependencies are there?
>
> > On the usage side there should be very few outright dependencies: if the llvm
> > binary is not available, or doesn't support what you need, or there's no runtime
> > environment you can use to build the bytecode, you should display an informative
> > error message so that the user knows what is missing and how to install it.
>
> Right, something like:
>
> [acme@zoo ~]$ perf trace -e nanosleep usleep 1
> Error: No permissions to read /sys/kernel/debug/tracing/events/raw_syscalls/sys_(enter|exit)
> Hint: Try 'sudo mount -o remount,mode=755 /sys/kernel/debug'

> [acme@zoo ~]$ sudo mount -o remount,mode=755 /sys/kernel/debug
> [sudo] password for acme:
> [acme@zoo ~]$ perf trace -e nanosleep usleep 1
> 0.565 ( 0.060 ms): usleep/17648 nanosleep(rqtp: 0x7fff22baebf0) = 0
> [acme@zoo ~]$ perf trace --all-cpus
> Error: Operation not permitted.
> Hint: Check /proc/sys/kernel/perf_event_paranoid setting.
> Hint: For system wide tracing it needs to be set to -1.
> Hint: Try: 'sudo sh -c "echo -1 > /proc/sys/kernel/perf_event_paranoid"'
> Hint: The current value is 1.
> [acme@zoo ~]$
> [acme@zoo ~]$ trace -a -e poll usleep 1
> [acme@zoo ~]$ trace -a -e poll usleep 1
> 0.041 ( 0.000 ms): firefox/1458 ... [continued]: poll()) = 1
> 0.267 ( 0.003 ms): firefox/1458 poll(ufds: 0x7f43d6ea1340, nfds: 5) = 0 Timeout
> 0.275 ( 0.001 ms): firefox/1458 poll(ufds: 0x7f43d6ea1340, nfds: 5) = 0 Timeout
> 0.283 ( 0.001 ms): firefox/1458 poll(ufds: 0x7f43d6ea1340, nfds: 5) = 0 Timeout
> 0.979 ( 0.000 ms): gnome-terminal/2572 ... [continued]: poll()) = 1
> 1.056 ( 0.768 ms): firefox/1458 poll(ufds: 0x7f43d6ea1340, nfds: 5, timeout_msecs: 4294967295) ...
> 1.065 ( 0.009 ms): gnome-terminal/2572 poll(ufds: 0x1934250, nfds: 23, timeout_msecs: 10) = 1
> 1.087 ( 0.007 ms): gnome-terminal/2572 poll(ufds: 0x1934250, nfds: 23, timeout_msecs: 10) = 2
> 1.132 ( 0.007 ms): gnome-terminal/2572 poll(ufds: 0x1934250, nfds: 23, timeout_msecs: 10) = 1
> 1.161 ( 0.013 ms): gnome-terminal/2572 poll(ufds: 0x1934250, nfds: 23, timeout_msecs: 10) = 1
> [acme@zoo ~]$
>
> I.e. Explain the mistake and provide a hint to solve it, as close to the actual
> commands needed to perform such corrective/enabling action as possible.

Yeah, I absolutely love such tooling hints.

Thanks,

Ingo

2015-06-10 06:42:19

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [EXPERIENCE] My experience on using perf record BPF filter on a real usecase

On 6/4/15 3:17 AM, Wangnan (F) wrote:
> Hi all,
>
> I'd like to share my exprience on using 'perf record' BPF filter in a
> real usecase to show the power and shortcome in my patch series:

thanks for sharing!

> Here is another inconvenience. Currently I only concern on write
> syscall issued by iozone. However, without '-a' I'm unable to collect
> information of the locker. If I want to filter sys_{enter,exit}_write
> belong to iozone out using eBPF, I need to implement another function
> like BPF_FUNC_git_comm. Another method is to use perf '--filter' after
> the two events. However it looks strange to use two filter mechanisms
> together. This time I choose to do filtering offline using perf script.

that doesn't sound clean.
btw, I've been playing for a while with
bpf_get_current_task_info() helper:
https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/commit/?id=c5453ffa107ddf95a91920cc947bb8bf9eab16d6
I think it's a better mechanism.
The user can get pid only via:
u32 pid = 0;
bpf_get_current_task_info(&pid, sizeof(pid));
or full pid + comm + future fields via full 'struct bpf_task_info'
Thoughts?

2015-06-10 06:49:03

by Wang Nan

[permalink] [raw]
Subject: Re: [EXPERIENCE] My experience on using perf record BPF filter on a real usecase



On 2015/6/10 14:42, Alexei Starovoitov wrote:
> On 6/4/15 3:17 AM, Wangnan (F) wrote:
>> Hi all,
>>
>> I'd like to share my exprience on using 'perf record' BPF filter in a
>> real usecase to show the power and shortcome in my patch series:
>
> thanks for sharing!
>
>> Here is another inconvenience. Currently I only concern on write
>> syscall issued by iozone. However, without '-a' I'm unable to collect
>> information of the locker. If I want to filter sys_{enter,exit}_write
>> belong to iozone out using eBPF, I need to implement another function
>> like BPF_FUNC_git_comm. Another method is to use perf '--filter' after
>> the two events. However it looks strange to use two filter mechanisms
>> together. This time I choose to do filtering offline using perf script.
>
> that doesn't sound clean.
> btw, I've been playing for a while with
> bpf_get_current_task_info() helper:
> https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/commit/?id=c5453ffa107ddf95a91920cc947bb8bf9eab16d6
>
> I think it's a better mechanism.
> The user can get pid only via:
> u32 pid = 0;
> bpf_get_current_task_info(&pid, sizeof(pid));
> or full pid + comm + future fields via full 'struct bpf_task_info'
> Thoughts?
>

Looks good. Thank you for your information!