LinuxLists.cc - [PATCH v2 0/7] perf trace pagefaults

2014-06-20 10:51:21

by Stanislav Fomichev

[permalink] [raw]

Subject: [PATCH v2 0/7] perf trace pagefaults

This patch series adds support for pagefaults tracing to 'perf trace' command.
It seems this feature was planned by Namhyung Kim (http://events.linuxfoundation.org/images/stories/pdf/klf2012_n_kim.pdf page 17/28)
but I couldn't find any prior patches/discussion and started from scratch.

First three patches add the feature and options to enable faults and disable
syscalls.
Two last patches add events caching (like it's done in the perf kvm), so that
we don't get fault events prior to mmap/comm events (makes sense only
for live mode).

This is just a proof-of-concept, and I'd like to get some comments about
where and what I got wrong and what additional useful information I can
expose in the trace.

v2:
- added more info to the changelogs
- reworked options (-f -> -F, --pgfaults -> --pf=[all|min|maj])
- separated tracepoint_handler changes into additional patch
- separated record/replay into additional patch
- other fixes pointed out by Arnaldo Carvalho de Melo

Stanislav Fomichev (7):
perf trace: add perf_event parameter to tracepoint_handler
perf trace: add support for pagefault tracing
perf trace: add pagefaults record and replay support
perf trace: add pagefault statistics
perf trace: add possibility to switch off syscall events
perf kvm: move perf_kvm__mmap_read into session utils
perf trace: add events cache

tools/perf/Documentation/perf-trace.txt | 19 ++
tools/perf/builtin-kvm.c | 88 +-------
tools/perf/builtin-trace.c | 350 ++++++++++++++++++++++----------
tools/perf/util/session.c | 85 ++++++++
tools/perf/util/session.h | 5 +
5 files changed, 357 insertions(+), 190 deletions(-)

--
1.8.3.2

2014-06-20 10:50:24

by Stanislav Fomichev

[permalink] [raw]

Subject: [PATCH 6/7] perf kvm: move perf_kvm__mmap_read into session utils

It will be reused by perf trace in the following commit.

Signed-off-by: Stanislav Fomichev <[email protected]>
---
tools/perf/builtin-kvm.c | 88 +++--------------------------------------------
tools/perf/util/session.c | 85 +++++++++++++++++++++++++++++++++++++++++++++
tools/perf/util/session.h | 5 +++
3 files changed, 94 insertions(+), 84 deletions(-)

diff --git a/tools/perf/builtin-kvm.c b/tools/perf/builtin-kvm.c
index 0f1e5a2f6ad7..a69ffe7512e5 100644
--- a/tools/perf/builtin-kvm.c
+++ b/tools/perf/builtin-kvm.c
@@ -885,89 +885,6 @@ static bool verify_vcpu(int vcpu)
*/
#define PERF_KVM__MAX_EVENTS_PER_MMAP 25

-static s64 perf_kvm__mmap_read_idx(struct perf_kvm_stat *kvm, int idx,
- u64 *mmap_time)
-{
- union perf_event *event;
- struct perf_sample sample;
- s64 n = 0;
- int err;
-
- *mmap_time = ULLONG_MAX;
- while ((event = perf_evlist__mmap_read(kvm->evlist, idx)) != NULL) {
- err = perf_evlist__parse_sample(kvm->evlist, event, &sample);
- if (err) {
- perf_evlist__mmap_consume(kvm->evlist, idx);
- pr_err("Failed to parse sample\n");
- return -1;
- }
-
- err = perf_session_queue_event(kvm->session, event, &sample, 0);
- /*
- * FIXME: Here we can't consume the event, as perf_session_queue_event will
- * point to it, and it'll get possibly overwritten by the kernel.
- */
- perf_evlist__mmap_consume(kvm->evlist, idx);
-
- if (err) {
- pr_err("Failed to enqueue sample: %d\n", err);
- return -1;
- }
-
- /* save time stamp of our first sample for this mmap */
- if (n == 0)
- *mmap_time = sample.time;
-
- /* limit events per mmap handled all at once */
- n++;
- if (n == PERF_KVM__MAX_EVENTS_PER_MMAP)
- break;
- }
-
- return n;
-}
-
-static int perf_kvm__mmap_read(struct perf_kvm_stat *kvm)
-{
- int i, err, throttled = 0;
- s64 n, ntotal = 0;
- u64 flush_time = ULLONG_MAX, mmap_time;
-
- for (i = 0; i < kvm->evlist->nr_mmaps; i++) {
- n = perf_kvm__mmap_read_idx(kvm, i, &mmap_time);
- if (n < 0)
- return -1;
-
- /* flush time is going to be the minimum of all the individual
- * mmap times. Essentially, we flush all the samples queued up
- * from the last pass under our minimal start time -- that leaves
- * a very small race for samples to come in with a lower timestamp.
- * The ioctl to return the perf_clock timestamp should close the
- * race entirely.
- */
- if (mmap_time < flush_time)
- flush_time = mmap_time;
-
- ntotal += n;
- if (n == PERF_KVM__MAX_EVENTS_PER_MMAP)
- throttled = 1;
- }
-
- /* flush queue after each round in which we processed events */
- if (ntotal) {
- kvm->session->ordered_samples.next_flush = flush_time;
- err = kvm->tool.finished_round(&kvm->tool, NULL, kvm->session);
- if (err) {
- if (kvm->lost_events)
- pr_info("\nLost events: %" PRIu64 "\n\n",
- kvm->lost_events);
- return err;
- }
- }
-
- return throttled;
-}
-
static volatile int done;

static void sig_handler(int sig __maybe_unused)
@@ -1133,7 +1050,10 @@ static int kvm_events_live_report(struct perf_kvm_stat *kvm)
while (!done) {
int rc;

- rc = perf_kvm__mmap_read(kvm);
+ rc = perf_session__mmap_read(&kvm->tool,
+ kvm->session,
+ kvm->evlist,
+ PERF_KVM__MAX_EVENTS_PER_MMAP);
if (rc < 0)
break;

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 4526d966b10a..994846060c5e 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1671,3 +1671,88 @@ int __perf_session__set_tracepoints_handlers(struct perf_session *session,
out:
return err;
}
+
+static s64 perf_session__mmap_read_idx(struct perf_session *session,
+ int idx,
+ u64 *mmap_time,
+ int nr_per_mmap)
+{
+ union perf_event *event;
+ struct perf_sample sample;
+ s64 n = 0;
+ int err;
+
+ *mmap_time = ULLONG_MAX;
+ while ((event = perf_evlist__mmap_read(session->evlist, idx)) != NULL) {
+ err = perf_evlist__parse_sample(session->evlist, event, &sample);
+ if (err) {
+ perf_evlist__mmap_consume(session->evlist, idx);
+ pr_err("Failed to parse sample\n");
+ return -1;
+ }
+
+ err = perf_session_queue_event(session, event, &sample, 0);
+ /*
+ * FIXME: Here we can't consume the event, as perf_session_queue_event will
+ * point to it, and it'll get possibly overwritten by the kernel.
+ */
+ perf_evlist__mmap_consume(session->evlist, idx);
+
+ if (err) {
+ pr_err("Failed to enqueue sample: %d\n", err);
+ return -1;
+ }
+
+ /* save time stamp of our first sample for this mmap */
+ if (n == 0)
+ *mmap_time = sample.time;
+
+ /* limit events per mmap handled all at once */
+ n++;
+ if (n == nr_per_mmap)
+ break;
+ }
+
+ return n;
+}
+
+int perf_session__mmap_read(struct perf_tool *tool,
+ struct perf_session *session,
+ struct perf_evlist *evlist,
+ int nr_per_mmap)
+{
+ int i, err, throttled = 0;
+ s64 n, ntotal = 0;
+ u64 flush_time = ULLONG_MAX, mmap_time;
+
+ for (i = 0; i < evlist->nr_mmaps; i++) {
+ n = perf_session__mmap_read_idx(session, i, &mmap_time,
+ nr_per_mmap);
+ if (n < 0)
+ return -1;
+
+ /* flush time is going to be the minimum of all the individual
+ * mmap times. Essentially, we flush all the samples queued up
+ * from the last pass under our minimal start time -- that leaves
+ * a very small race for samples to come in with a lower timestamp.
+ * The ioctl to return the perf_clock timestamp should close the
+ * race entirely.
+ */
+ if (mmap_time < flush_time)
+ flush_time = mmap_time;
+
+ ntotal += n;
+ if (n == nr_per_mmap)
+ throttled = 1;
+ }
+
+ /* flush queue after each round in which we processed events */
+ if (ntotal) {
+ session->ordered_samples.next_flush = flush_time;
+ err = tool->finished_round(tool, NULL, session);
+ if (err)
+ return err;
+ }
+
+ return throttled;
+}
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 9494fb68828a..e79da3c1071e 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -133,4 +133,9 @@ int __perf_session__set_tracepoints_handlers(struct perf_session *session,
extern volatile int session_done;

#define session_done() (*(volatile int *)(&session_done))
+
+int perf_session__mmap_read(struct perf_tool *tool,
+ struct perf_session *session,
+ struct perf_evlist *evlist,
+ int nr_per_mmap);
#endif /* __PERF_SESSION_H */
--
1.8.3.2

2014-06-20 10:50:23

by Stanislav Fomichev

[permalink] [raw]

Subject: [PATCH 4/7] perf trace: add pagefault statistics

'perf trace' can show summary of events using -S option. This commit
also reports number of major/minor pagefault events in this summary.

Signed-off-by: Stanislav Fomichev <[email protected]>
---
tools/perf/builtin-trace.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index 9d3f3f7f3ea1..9dd21dd26ef2 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -1132,6 +1132,7 @@ struct thread_trace {
u64 exit_time;
bool entry_pending;
unsigned long nr_events;
+ unsigned long pfmaj, pfmin;
char *entry_str;
double runtime_ms;
struct {
@@ -1786,8 +1787,12 @@ static int trace__pgfault(struct trace *trace,
u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
struct addr_location al;
char map_type = 'd';
+ struct thread_trace *ttrace;

thread = machine__findnew_thread(trace->host, sample->pid, sample->tid);
+ ttrace = thread__trace(thread, trace->output);
+ if (ttrace == NULL)
+ return -1;

thread__find_addr_location(thread, trace->host, cpumode, MAP__FUNCTION,
sample->ip, &al);
@@ -1798,6 +1803,11 @@ static int trace__pgfault(struct trace *trace,
evsel->attr.config == PERF_COUNT_SW_PAGE_FAULTS_MAJ ?
"maj" : "min");

+ if (evsel->attr.config == PERF_COUNT_SW_PAGE_FAULTS_MAJ)
+ ttrace->pfmaj++;
+ else
+ ttrace->pfmin++;
+
if (al.sym)
fprintf(trace->output, "[%s+0x%lx]",
al.sym->name, al.addr - al.sym->start);
@@ -2338,6 +2348,7 @@ static int trace__fprintf_one_thread(struct thread *thread, void *priv)
printed += fprintf(fp, " %s (%d), ", thread__comm_str(thread), thread->tid);
printed += fprintf(fp, "%lu events, ", ttrace->nr_events);
printed += fprintf(fp, "%.1f%%", ratio);
+ printed += fprintf(fp, ", %lu:%lu faults", ttrace->pfmaj, ttrace->pfmin);
printed += fprintf(fp, ", %.3f msec\n", ttrace->runtime_ms);
printed += thread__dump_stats(ttrace, trace, fp);

--
1.8.3.2

2014-06-20 10:50:22

by Stanislav Fomichev

[permalink] [raw]

Subject: [PATCH 3/7] perf trace: add pagefaults record and replay support

Previous commit added live pagefault trace support, this one adds
record and replay support.

Signed-off-by: Stanislav Fomichev <[email protected]>
---
tools/perf/builtin-trace.c | 63 +++++++++++++++++++++++++++++++++-------------
1 file changed, 45 insertions(+), 18 deletions(-)

diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index a80aae2bba40..9d3f3f7f3ea1 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -1888,7 +1888,7 @@ static int parse_target_str(struct trace *trace)
return 0;
}

-static int trace__record(int argc, const char **argv)
+static int trace__record(struct trace *trace, int argc, const char **argv)
{
unsigned int rec_argc, i, j;
const char **rec_argv;
@@ -1897,34 +1897,52 @@ static int trace__record(int argc, const char **argv)
"-R",
"-m", "1024",
"-c", "1",
- "-e",
};

+ const char * const sc_args[] = { "-e", };
+ unsigned int sc_args_nr = ARRAY_SIZE(sc_args);
+ const char * const majpf_args[] = { "-e", "major-faults" };
+ unsigned int majpf_args_nr = ARRAY_SIZE(majpf_args);
+ const char * const minpf_args[] = { "-e", "minor-faults" };
+ unsigned int minpf_args_nr = ARRAY_SIZE(minpf_args);
+
/* +1 is for the event string below */
- rec_argc = ARRAY_SIZE(record_args) + 1 + argc;
+ rec_argc = ARRAY_SIZE(record_args) + sc_args_nr + 1 +
+ majpf_args_nr + minpf_args_nr + argc;
rec_argv = calloc(rec_argc + 1, sizeof(char *));

if (rec_argv == NULL)
return -ENOMEM;

+ j = 0;
for (i = 0; i < ARRAY_SIZE(record_args); i++)
- rec_argv[i] = record_args[i];
+ rec_argv[j++] = record_args[i];
+
+ for (i = 0; i < sc_args_nr; i++)
+ rec_argv[j++] = sc_args[i];

/* event string may be different for older kernels - e.g., RHEL6 */
if (is_valid_tracepoint("raw_syscalls:sys_enter"))
- rec_argv[i] = "raw_syscalls:sys_enter,raw_syscalls:sys_exit";
+ rec_argv[j++] = "raw_syscalls:sys_enter,raw_syscalls:sys_exit";
else if (is_valid_tracepoint("syscalls:sys_enter"))
- rec_argv[i] = "syscalls:sys_enter,syscalls:sys_exit";
+ rec_argv[j++] = "syscalls:sys_enter,syscalls:sys_exit";
else {
pr_err("Neither raw_syscalls nor syscalls events exist.\n");
return -1;
}
- i++;

- for (j = 0; j < (unsigned int)argc; j++, i++)
- rec_argv[i] = argv[j];
+ if (trace->trace_pgfaults & TRACE_PFMAJ)
+ for (i = 0; i < majpf_args_nr; i++)
+ rec_argv[j++] = majpf_args[i];
+
+ if (trace->trace_pgfaults & TRACE_PFMIN)
+ for (i = 0; i < minpf_args_nr; i++)
+ rec_argv[j++] = minpf_args[i];
+
+ for (i = 0; i < (unsigned int)argc; i++)
+ rec_argv[j++] = argv[i];

- return cmd_record(i, rec_argv, NULL);
+ return cmd_record(j, rec_argv, NULL);
}

static size_t trace__fprintf_thread_summary(struct trace *trace, FILE *fp);
@@ -2215,6 +2233,14 @@ static int trace__replay(struct trace *trace)
goto out;
}

+ evlist__for_each(session->evlist, evsel) {
+ if (evsel->attr.type == PERF_TYPE_SOFTWARE &&
+ (evsel->attr.config == PERF_COUNT_SW_PAGE_FAULTS_MAJ ||
+ evsel->attr.config == PERF_COUNT_SW_PAGE_FAULTS_MIN ||
+ evsel->attr.config == PERF_COUNT_SW_PAGE_FAULTS))
+ evsel->handler = trace__pgfault;
+ }
+
err = parse_target_str(trace);
if (err != 0)
goto out;
@@ -2449,20 +2475,21 @@ int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)
int err;
char bf[BUFSIZ];

- if ((argc > 1) && (strcmp(argv[1], "record") == 0))
- return trace__record(argc-2, &argv[2]);
-
- argc = parse_options(argc, argv, trace_options, trace_usage, 0);
-
- /* summary_only implies summary option, but don't overwrite summary if set */
- if (trace.summary_only)
- trace.summary = trace.summary_only;
+ argc = parse_options(argc, argv, trace_options, trace_usage,
+ PARSE_OPT_STOP_AT_NON_OPTION);

if (trace.trace_pgfaults) {
trace.opts.sample_address = true;
trace.opts.sample_time = true;
}

+ if ((argc >= 1) && (strcmp(argv[0], "record") == 0))
+ return trace__record(&trace, argc-1, &argv[1]);
+
+ /* summary_only implies summary option, but don't overwrite summary if set */
+ if (trace.summary_only)
+ trace.summary = trace.summary_only;
+
if (output_name != NULL) {
err = trace__open_output(&trace, output_name);
if (err < 0) {
--
1.8.3.2

2014-06-20 10:51:26

by Stanislav Fomichev

[permalink] [raw]

Subject: [PATCH 2/7] perf trace: add support for pagefault tracing

This patch adds optional pagefault tracing support to 'perf trace'.
Using -F/--pf option user can specify whether he wants minor, major or
all pagefault events to be traced. This patch adds only live mode,
record and replace will come in a separate patch.

Example output:
1756272.905 ( 0.000 ms): curl/5937 majfault [0x7fa7261978b6] => /usr/lib/x86_64-linux-gnu/libkrb5.so.26.0.0+0x85288 (d.)
1862866.036 ( 0.000 ms): wget/8460 majfault [__clear_user+0x3f] => 0x659cb4 (?k)

Signed-off-by: Stanislav Fomichev <[email protected]>
---
tools/perf/Documentation/perf-trace.txt | 12 ++++
tools/perf/builtin-trace.c | 116 +++++++++++++++++++++++++++++++-
2 files changed, 127 insertions(+), 1 deletion(-)

diff --git a/tools/perf/Documentation/perf-trace.txt b/tools/perf/Documentation/perf-trace.txt
index fae38d9a44a4..8e5f710aa45d 100644
--- a/tools/perf/Documentation/perf-trace.txt
+++ b/tools/perf/Documentation/perf-trace.txt
@@ -107,6 +107,18 @@ the thread executes on the designated CPUs. Default is to monitor all CPUs.
Show tool stats such as number of times fd->pathname was discovered thru
hooking the open syscall return + vfs_getname or via reading /proc/pid/fd, etc.

+-F=[all|min|maj]::
+--pf=[all|min|maj]::
+ Trace pagefaults. Optionally, you can specify whether you want minor,
+ major or all pagefaults. Default value is maj.
+
+EXAMPLES
+--------
+
+Trace syscalls, major and minor pagefaults:
+
+ $ perf trace -F all
+
SEE ALSO
--------
linkperf:perf-record[1], linkperf:perf-script[1]
diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index a9b542918da0..a80aae2bba40 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -1177,6 +1177,9 @@ fail:
return NULL;
}

+#define TRACE_PFMAJ (1 << 0)
+#define TRACE_PFMIN (1 << 1)
+
struct trace {
struct perf_tool tool;
struct {
@@ -1211,6 +1214,7 @@ struct trace {
bool summary_only;
bool show_comm;
bool show_tool_stats;
+ int trace_pgfaults;
};

static int trace__set_fd_pathname(struct thread *thread, int fd, const char *pathname)
@@ -1773,6 +1777,59 @@ out_dump:
return 0;
}

+static int trace__pgfault(struct trace *trace,
+ struct perf_evsel *evsel,
+ union perf_event *event,
+ struct perf_sample *sample)
+{
+ struct thread *thread;
+ u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
+ struct addr_location al;
+ char map_type = 'd';
+
+ thread = machine__findnew_thread(trace->host, sample->pid, sample->tid);
+
+ thread__find_addr_location(thread, trace->host, cpumode, MAP__FUNCTION,
+ sample->ip, &al);
+
+ trace__fprintf_entry_head(trace, thread, 0, sample->time, trace->output);
+
+ fprintf(trace->output, "%sfault ",
+ evsel->attr.config == PERF_COUNT_SW_PAGE_FAULTS_MAJ ?
+ "maj" : "min");
+
+ if (al.sym)
+ fprintf(trace->output, "[%s+0x%lx]",
+ al.sym->name, al.addr - al.sym->start);
+ else
+ fprintf(trace->output, "[0x%lx]", sample->ip);
+
+ fprintf(trace->output, " => ");
+
+ thread__find_addr_map(thread, trace->host, cpumode, MAP__VARIABLE,
+ sample->addr, &al);
+
+ if (!al.map) {
+ thread__find_addr_map(thread, trace->host, cpumode,
+ MAP__FUNCTION, sample->addr, &al);
+
+ if (al.map)
+ map_type = 'x';
+ }
+
+ if (al.map) {
+ fprintf(trace->output, "%s+0x%lx",
+ al.map->dso->long_name, al.addr);
+ } else {
+ map_type = '?';
+ fprintf(trace->output, "0x%lx", sample->addr);
+ }
+
+ fprintf(trace->output, " (%c%c)\n", map_type, al.level);
+
+ return 0;
+}
+
static bool skip_sample(struct trace *trace, struct perf_sample *sample)
{
if ((trace->pid_list && intlist__find(trace->pid_list, sample->pid)) ||
@@ -1887,6 +1944,30 @@ static void perf_evlist__add_vfs_getname(struct perf_evlist *evlist)
perf_evlist__add(evlist, evsel);
}

+static int perf_evlist__add_pgfault(struct perf_evlist *evlist,
+ u64 config)
+{
+ struct perf_evsel *evsel;
+ struct perf_event_attr attr = {
+ .type = PERF_TYPE_SOFTWARE,
+ .mmap_data = 1,
+ .sample_period = 1,
+ };
+
+ attr.config = config;
+
+ event_attr_init(&attr);
+
+ evsel = perf_evsel__new(&attr);
+ if (!evsel)
+ return -ENOMEM;
+
+ evsel->handler = trace__pgfault;
+ perf_evlist__add(evlist, evsel);
+
+ return 0;
+}
+
static int trace__run(struct trace *trace, int argc, const char **argv)
{
struct perf_evlist *evlist = perf_evlist__new();
@@ -1907,6 +1988,14 @@ static int trace__run(struct trace *trace, int argc, const char **argv)

perf_evlist__add_vfs_getname(evlist);

+ if ((trace->trace_pgfaults & TRACE_PFMAJ) &&
+ perf_evlist__add_pgfault(evlist, PERF_COUNT_SW_PAGE_FAULTS_MAJ))
+ goto out_error_tp;
+
+ if ((trace->trace_pgfaults & TRACE_PFMIN) &&
+ perf_evlist__add_pgfault(evlist, PERF_COUNT_SW_PAGE_FAULTS_MIN))
+ goto out_error_tp;
+
if (trace->sched &&
perf_evlist__add_newtp(evlist, "sched", "sched_stat_runtime",
trace__sched_stat_runtime))
@@ -1987,7 +2076,8 @@ again:
goto next_event;
}

- if (sample.raw_data == NULL) {
+ if (evsel->attr.type == PERF_TYPE_TRACEPOINT &&
+ sample.raw_data == NULL) {
fprintf(trace->output, "%s sample with no payload for tid: %d, cpu %d, raw_size=%d, skipping...\n",
perf_evsel__name(evsel), sample.tid,
sample.cpu, sample.raw_size);
@@ -2269,6 +2359,23 @@ static int trace__open_output(struct trace *trace, const char *filename)
return trace->output == NULL ? -errno : 0;
}

+static int parse_pagefaults(const struct option *opt, const char *str,
+ int unset __maybe_unused)
+{
+ int *trace_pgfaults = opt->value;
+
+ if (strcmp(str, "all") == 0)
+ *trace_pgfaults |= TRACE_PFMAJ | TRACE_PFMIN;
+ else if (strcmp(str, "maj") == 0)
+ *trace_pgfaults |= TRACE_PFMAJ;
+ else if (strcmp(str, "min") == 0)
+ *trace_pgfaults |= TRACE_PFMIN;
+ else
+ return -1;
+
+ return 0;
+}
+
int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)
{
const char * const trace_usage[] = {
@@ -2335,6 +2442,8 @@ int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)
"Show only syscall summary with statistics"),
OPT_BOOLEAN('S', "with-summary", &trace.summary,
"Show all syscalls and summary with statistics"),
+ OPT_CALLBACK_DEFAULT('F', "pf", &trace.trace_pgfaults, "all|maj|min",
+ "Trace pagefaults", parse_pagefaults, "maj"),
OPT_END()
};
int err;
@@ -2349,6 +2458,11 @@ int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)
if (trace.summary_only)
trace.summary = trace.summary_only;

+ if (trace.trace_pgfaults) {
+ trace.opts.sample_address = true;
+ trace.opts.sample_time = true;
+ }
+
if (output_name != NULL) {
err = trace__open_output(&trace, output_name);
if (err < 0) {
--
1.8.3.2

2014-06-20 10:51:24

by Stanislav Fomichev

[permalink] [raw]

Subject: [PATCH 1/7] perf trace: add perf_event parameter to tracepoint_handler

It will be used by next pagefault tracing patches in the series.

Signed-off-by: Stanislav Fomichev <[email protected]>
---
tools/perf/builtin-trace.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index f954c26de231..a9b542918da0 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -1535,6 +1535,7 @@ static size_t syscall__scnprintf_args(struct syscall *sc, char *bf, size_t size,
}

typedef int (*tracepoint_handler)(struct trace *trace, struct perf_evsel *evsel,
+ union perf_event *event,
struct perf_sample *sample);

static struct syscall *trace__syscall_info(struct trace *trace,
@@ -1607,6 +1608,7 @@ static void thread__update_stats(struct thread_trace *ttrace,
}

static int trace__sys_enter(struct trace *trace, struct perf_evsel *evsel,
+ union perf_event *event __maybe_unused,
struct perf_sample *sample)
{
char *msg;
@@ -1656,6 +1658,7 @@ static int trace__sys_enter(struct trace *trace, struct perf_evsel *evsel,
}

static int trace__sys_exit(struct trace *trace, struct perf_evsel *evsel,
+ union perf_event *event __maybe_unused,
struct perf_sample *sample)
{
int ret;
@@ -1735,6 +1738,7 @@ out:
}

static int trace__vfs_getname(struct trace *trace, struct perf_evsel *evsel,
+ union perf_event *event __maybe_unused,
struct perf_sample *sample)
{
trace->last_vfs_getname = perf_evsel__rawptr(evsel, sample, "pathname");
@@ -1742,6 +1746,7 @@ static int trace__vfs_getname(struct trace *trace, struct perf_evsel *evsel,
}

static int trace__sched_stat_runtime(struct trace *trace, struct perf_evsel *evsel,
+ union perf_event *event __maybe_unused,
struct perf_sample *sample)
{
u64 runtime = perf_evsel__intval(evsel, sample, "runtime");
@@ -1781,7 +1786,7 @@ static bool skip_sample(struct trace *trace, struct perf_sample *sample)
}

static int trace__process_sample(struct perf_tool *tool,
- union perf_event *event __maybe_unused,
+ union perf_event *event,
struct perf_sample *sample,
struct perf_evsel *evsel,
struct machine *machine __maybe_unused)
@@ -1799,7 +1804,7 @@ static int trace__process_sample(struct perf_tool *tool,

if (handler) {
++trace->nr_events;
- handler(trace, evsel, sample);
+ handler(trace, evsel, event, sample);
}

return err;
@@ -1990,7 +1995,7 @@ again:
}

handler = evsel->handler;
- handler(trace, evsel, &sample);
+ handler(trace, evsel, event, &sample);
next_event:
perf_evlist__mmap_consume(evlist, i);

--
1.8.3.2

2014-06-20 10:51:23

by Stanislav Fomichev

[permalink] [raw]

Subject: [PATCH 5/7] perf trace: add possibility to switch off syscall events

Currently, we may either trace syscalls or syscalls+pagefaults. We'd
like to be able to trace *only* pagefaults and this commit implements
this feature.

Signed-off-by: Stanislav Fomichev <[email protected]>
---
tools/perf/Documentation/perf-trace.txt | 7 ++++
tools/perf/builtin-trace.c | 58 ++++++++++++++++++---------------
2 files changed, 39 insertions(+), 26 deletions(-)

diff --git a/tools/perf/Documentation/perf-trace.txt b/tools/perf/Documentation/perf-trace.txt
index 8e5f710aa45d..bfe204cac2aa 100644
--- a/tools/perf/Documentation/perf-trace.txt
+++ b/tools/perf/Documentation/perf-trace.txt
@@ -112,9 +112,16 @@ the thread executes on the designated CPUs. Default is to monitor all CPUs.
Trace pagefaults. Optionally, you can specify whether you want minor,
major or all pagefaults. Default value is maj.

+--syscalls::
+ Trace system calls. This options is enabled by default.
+
EXAMPLES
--------

+Trace only major pagefaults:
+
+ $ perf trace --no-syscalls -F
+
Trace syscalls, major and minor pagefaults:

$ perf trace -F all
diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index 9dd21dd26ef2..3cce721f8b06 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -1215,6 +1215,7 @@ struct trace {
bool summary_only;
bool show_comm;
bool show_tool_stats;
+ bool trace_syscalls;
int trace_pgfaults;
};

@@ -1928,17 +1929,19 @@ static int trace__record(struct trace *trace, int argc, const char **argv)
for (i = 0; i < ARRAY_SIZE(record_args); i++)
rec_argv[j++] = record_args[i];

- for (i = 0; i < sc_args_nr; i++)
- rec_argv[j++] = sc_args[i];
-
- /* event string may be different for older kernels - e.g., RHEL6 */
- if (is_valid_tracepoint("raw_syscalls:sys_enter"))
- rec_argv[j++] = "raw_syscalls:sys_enter,raw_syscalls:sys_exit";
- else if (is_valid_tracepoint("syscalls:sys_enter"))
- rec_argv[j++] = "syscalls:sys_enter,syscalls:sys_exit";
- else {
- pr_err("Neither raw_syscalls nor syscalls events exist.\n");
- return -1;
+ if (trace->trace_syscalls) {
+ for (i = 0; i < sc_args_nr; i++)
+ rec_argv[j++] = sc_args[i];
+
+ /* event string may be different for older kernels - e.g., RHEL6 */
+ if (is_valid_tracepoint("raw_syscalls:sys_enter"))
+ rec_argv[j++] = "raw_syscalls:sys_enter,raw_syscalls:sys_exit";
+ else if (is_valid_tracepoint("syscalls:sys_enter"))
+ rec_argv[j++] = "syscalls:sys_enter,syscalls:sys_exit";
+ else {
+ pr_err("Neither raw_syscalls nor syscalls events exist.\n");
+ return -1;
+ }
}

if (trace->trace_pgfaults & TRACE_PFMAJ)
@@ -2011,10 +2014,13 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
goto out;
}

- if (perf_evlist__add_syscall_newtp(evlist, trace__sys_enter, trace__sys_exit))
+ if (trace->trace_syscalls &&
+ perf_evlist__add_syscall_newtp(evlist, trace__sys_enter,
+ trace__sys_exit))
goto out_error_tp;

- perf_evlist__add_vfs_getname(evlist);
+ if (trace->trace_syscalls)
+ perf_evlist__add_vfs_getname(evlist);

if ((trace->trace_pgfaults & TRACE_PFMAJ) &&
perf_evlist__add_pgfault(evlist, PERF_COUNT_SW_PAGE_FAULTS_MAJ))
@@ -2216,13 +2222,10 @@ static int trace__replay(struct trace *trace)
if (evsel == NULL)
evsel = perf_evlist__find_tracepoint_by_name(session->evlist,
"syscalls:sys_enter");
- if (evsel == NULL) {
- pr_err("Data file does not have raw_syscalls:sys_enter event\n");
- goto out;
- }

- if (perf_evsel__init_syscall_tp(evsel, trace__sys_enter) < 0 ||
- perf_evsel__init_sc_tp_ptr_field(evsel, args)) {
+ if (evsel &&
+ (perf_evsel__init_syscall_tp(evsel, trace__sys_enter) < 0 ||
+ perf_evsel__init_sc_tp_ptr_field(evsel, args))) {
pr_err("Error during initialize raw_syscalls:sys_enter event\n");
goto out;
}
@@ -2232,13 +2235,9 @@ static int trace__replay(struct trace *trace)
if (evsel == NULL)
evsel = perf_evlist__find_tracepoint_by_name(session->evlist,
"syscalls:sys_exit");
- if (evsel == NULL) {
- pr_err("Data file does not have raw_syscalls:sys_exit event\n");
- goto out;
- }
-
- if (perf_evsel__init_syscall_tp(evsel, trace__sys_exit) < 0 ||
- perf_evsel__init_sc_tp_uint_field(evsel, ret)) {
+ if (evsel &&
+ (perf_evsel__init_syscall_tp(evsel, trace__sys_exit) < 0 ||
+ perf_evsel__init_sc_tp_uint_field(evsel, ret))) {
pr_err("Error during initialize raw_syscalls:sys_exit event\n");
goto out;
}
@@ -2442,6 +2441,7 @@ int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)
},
.output = stdout,
.show_comm = true,
+ .trace_syscalls = true,
};
const char *output_name = NULL;
const char *ev_qualifier_str = NULL;
@@ -2481,6 +2481,7 @@ int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)
"Show all syscalls and summary with statistics"),
OPT_CALLBACK_DEFAULT('F', "pf", &trace.trace_pgfaults, "all|maj|min",
"Trace pagefaults", parse_pagefaults, "maj"),
+ OPT_BOOLEAN(0, "syscalls", &trace.trace_syscalls, "Trace syscalls"),
OPT_END()
};
int err;
@@ -2501,6 +2502,11 @@ int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)
if (trace.summary_only)
trace.summary = trace.summary_only;

+ if (!trace.trace_syscalls && !trace.trace_pgfaults) {
+ pr_err("Please specify something to trace.\n");
+ return -1;
+ }
+
if (output_name != NULL) {
err = trace__open_output(&trace, output_name);
if (err < 0) {
--
1.8.3.2

2014-06-20 10:51:20

by Stanislav Fomichev

[permalink] [raw]

Subject: [PATCH 7/7] perf trace: add events cache

Use session to cache and order events from all CPUs so we get
correct ordering of comms/mmaps and faults.

Signed-off-by: Stanislav Fomichev <[email protected]>
---
tools/perf/builtin-trace.c | 117 ++++++++++++++++++---------------------------
1 file changed, 46 insertions(+), 71 deletions(-)

diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index 3cce721f8b06..e163998547ef 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -20,6 +20,11 @@
#include <sys/mman.h>
#include <linux/futex.h>

+/* keeping the max events to a modest level to keep
+ * the processing of samples per mmap smooth.
+ */
+#define PERF_TRACE__MAX_EVENTS_PER_MMAP 25
+
/* For older distros: */
#ifndef MAP_STACK
# define MAP_STACK 0x20000
@@ -2001,10 +2006,12 @@ static int perf_evlist__add_pgfault(struct perf_evlist *evlist,

static int trace__run(struct trace *trace, int argc, const char **argv)
{
+ struct perf_data_file file = {
+ .mode = PERF_DATA_MODE_WRITE,
+ };
+ struct perf_session *session;
struct perf_evlist *evlist = perf_evlist__new();
- struct perf_evsel *evsel;
- int err = -1, i;
- unsigned long before;
+ int err = -1;
const bool forks = argc > 0;

trace->live = true;
@@ -2049,6 +2056,18 @@ static int trace__run(struct trace *trace, int argc, const char **argv)

perf_evlist__config(evlist, &trace->opts);

+ session = perf_session__new(&file, false, &trace->tool);
+ if (session == NULL) {
+ err = -ENOMEM;
+ goto out_delete_evlist;
+ }
+ session->evlist = evlist;
+ perf_session__set_id_hdr_size(session);
+ machine__synthesize_threads(&session->machines.host,
+ &trace->opts.target,
+ evlist->threads, false);
+ trace->host = &session->machines.host;
+
signal(SIGCHLD, sig_handler);
signal(SIGINT, sig_handler);

@@ -2068,7 +2087,7 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
err = perf_evlist__mmap(evlist, trace->opts.mmap_pages, false);
if (err < 0) {
fprintf(trace->output, "Couldn't mmap the events: %s\n", strerror(errno));
- goto out_delete_evlist;
+ goto out_delete_session;
}

perf_evlist__enable(evlist);
@@ -2077,67 +2096,19 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
perf_evlist__start_workload(evlist);

trace->multiple_threads = evlist->threads->map[0] == -1 || evlist->threads->nr > 1;
-again:
- before = trace->nr_events;
-
- for (i = 0; i < evlist->nr_mmaps; i++) {
- union perf_event *event;
-
- while ((event = perf_evlist__mmap_read(evlist, i)) != NULL) {
- const u32 type = event->header.type;
- tracepoint_handler handler;
- struct perf_sample sample;
-
- ++trace->nr_events;

- err = perf_evlist__parse_sample(evlist, event, &sample);
- if (err) {
- fprintf(trace->output, "Can't parse sample, err = %d, skipping...\n", err);
- goto next_event;
- }
-
- if (!trace->full_time && trace->base_time == 0)
- trace->base_time = sample.time;
-
- if (type != PERF_RECORD_SAMPLE) {
- trace__process_event(trace, trace->host, event, &sample);
- continue;
- }
-
- evsel = perf_evlist__id2evsel(evlist, sample.id);
- if (evsel == NULL) {
- fprintf(trace->output, "Unknown tp ID %" PRIu64 ", skipping...\n", sample.id);
- goto next_event;
- }
-
- if (evsel->attr.type == PERF_TYPE_TRACEPOINT &&
- sample.raw_data == NULL) {
- fprintf(trace->output, "%s sample with no payload for tid: %d, cpu %d, raw_size=%d, skipping...\n",
- perf_evsel__name(evsel), sample.tid,
- sample.cpu, sample.raw_size);
- goto next_event;
- }
-
- handler = evsel->handler;
- handler(trace, evsel, event, &sample);
-next_event:
- perf_evlist__mmap_consume(evlist, i);
-
- if (interrupted)
- goto out_disable;
- }
- }
+ while (!done) {
+ int rc;

- if (trace->nr_events == before) {
- int timeout = done ? 100 : -1;
+ rc = perf_session__mmap_read(&trace->tool, session, evlist,
+ PERF_TRACE__MAX_EVENTS_PER_MMAP);
+ if (rc < 0)
+ break;

- if (poll(evlist->pollfd, evlist->nr_fds, timeout) > 0)
- goto again;
- } else {
- goto again;
+ if (!rc && !done)
+ err = poll(evlist->pollfd, evlist->nr_fds, 100);
}

-out_disable:
perf_evlist__disable(evlist);

if (!err) {
@@ -2153,6 +2124,8 @@ out_disable:
}
}

+out_delete_session:
+ perf_session__delete(session);
out_delete_evlist:
perf_evlist__delete(evlist);
out:
@@ -2187,17 +2160,6 @@ static int trace__replay(struct trace *trace)
struct perf_evsel *evsel;
int err = -1;

- trace->tool.sample = trace__process_sample;
- trace->tool.mmap = perf_event__process_mmap;
- trace->tool.mmap2 = perf_event__process_mmap2;
- trace->tool.comm = perf_event__process_comm;
- trace->tool.exit = perf_event__process_exit;
- trace->tool.fork = perf_event__process_fork;
- trace->tool.attr = perf_event__process_attr;
- trace->tool.tracing_data = perf_event__process_tracing_data;
- trace->tool.build_id = perf_event__process_build_id;
-
- trace->tool.ordered_samples = true;
trace->tool.ordering_requires_timestamps = true;

/* add tid to output */
@@ -2547,6 +2509,19 @@ int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)
if (!argc && target__none(&trace.opts.target))
trace.opts.target.system_wide = true;

+ trace.tool.sample = trace__process_sample;
+ trace.tool.mmap = perf_event__process_mmap;
+ trace.tool.mmap2 = perf_event__process_mmap2;
+ trace.tool.comm = perf_event__process_comm;
+ trace.tool.exit = perf_event__process_exit;
+ trace.tool.fork = perf_event__process_fork;
+ trace.tool.attr = perf_event__process_attr;
+ trace.tool.tracing_data = perf_event__process_tracing_data;
+ trace.tool.build_id = perf_event__process_build_id;
+
+ trace.tool.ordered_samples = true;
+ perf_tool__fill_defaults(&trace.tool);
+
if (input_name)
err = trace__replay(&trace);
else
--
1.8.3.2

2014-06-20 13:21:13

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH v2 0/7] perf trace pagefaults

Em Fri, Jun 20, 2014 at 02:49:42PM +0400, Stanislav Fomichev escreveu:
> This patch series adds support for pagefaults tracing to 'perf trace' command.
> It seems this feature was planned by Namhyung Kim (http://events.linuxfoundation.org/images/stories/pdf/klf2012_n_kim.pdf page 17/28)
> but I couldn't find any prior patches/discussion and started from scratch.

Just to clarify here, those slides came from slides I made and in turn
the whole idea about pagefaults tracing I got from the trace prototype
that Thomas Gleixner implemented in his 'trace' utility, described
here:

Announcing a new utility: 'trace'
http://lwn.net/Articles/415728/

The comments section has lots of interesting ideas, some you may find
interesting to implement :-)

There is a branch in my tree with the branch tglx did his work on:

https://git.kernel.org/cgit/linux/kernel/git/acme/linux.git/log/?h=tmp.perf/trace2

There you can take a look and compare what you're doing to what he did.

Now I'll go thru your current patches and will cherry pick whatever I
think its OK already, and will try and provide comments for whatever I
think needs more work.

- Arnaldo

> First three patches add the feature and options to enable faults and disable
> syscalls.
> Two last patches add events caching (like it's done in the perf kvm), so that
> we don't get fault events prior to mmap/comm events (makes sense only
> for live mode).
>
> This is just a proof-of-concept, and I'd like to get some comments about
> where and what I got wrong and what additional useful information I can
> expose in the trace.
>
> v2:
> - added more info to the changelogs
> - reworked options (-f -> -F, --pgfaults -> --pf=[all|min|maj])
> - separated tracepoint_handler changes into additional patch
> - separated record/replay into additional patch
> - other fixes pointed out by Arnaldo Carvalho de Melo
>
> Stanislav Fomichev (7):
> perf trace: add perf_event parameter to tracepoint_handler
> perf trace: add support for pagefault tracing
> perf trace: add pagefaults record and replay support
> perf trace: add pagefault statistics
> perf trace: add possibility to switch off syscall events
> perf kvm: move perf_kvm__mmap_read into session utils
> perf trace: add events cache
>
> tools/perf/Documentation/perf-trace.txt | 19 ++
> tools/perf/builtin-kvm.c | 88 +-------
> tools/perf/builtin-trace.c | 350 ++++++++++++++++++++++----------
> tools/perf/util/session.c | 85 ++++++++
> tools/perf/util/session.h | 5 +
> 5 files changed, 357 insertions(+), 190 deletions(-)
>
> --
> 1.8.3.2

2014-06-20 14:45:07

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH 6/7] perf kvm: move perf_kvm__mmap_read into session utils

Em Fri, Jun 20, 2014 at 02:49:48PM +0400, Stanislav Fomichev escreveu:
> It will be reused by perf trace in the following commit.

I know this is needed, but one of the goals when I started builtin-trace
was not to use perf_session at all, as it initially was just about the
live mode.

record/replay came later and ended up needing it, but I'll take a look
at what Jiri is doing on the ordered samples front before getting to 6/7
and 7/7.

Up to 5/7 looks ok and I'm going now to apply and try it.

I.e. what we need from perf_session is just the ordered_samples bits,
perhaps in its current form, perhaps rewritten, see (renewed) discussion
involving David Ahern and Jiri Olsa.

- Arnaldo

> Signed-off-by: Stanislav Fomichev <[email protected]>
> ---
> tools/perf/builtin-kvm.c | 88 +++--------------------------------------------
> tools/perf/util/session.c | 85 +++++++++++++++++++++++++++++++++++++++++++++
> tools/perf/util/session.h | 5 +++
> 3 files changed, 94 insertions(+), 84 deletions(-)
>
> diff --git a/tools/perf/builtin-kvm.c b/tools/perf/builtin-kvm.c
> index 0f1e5a2f6ad7..a69ffe7512e5 100644
> --- a/tools/perf/builtin-kvm.c
> +++ b/tools/perf/builtin-kvm.c
> @@ -885,89 +885,6 @@ static bool verify_vcpu(int vcpu)
> */
> #define PERF_KVM__MAX_EVENTS_PER_MMAP 25
>
> -static s64 perf_kvm__mmap_read_idx(struct perf_kvm_stat *kvm, int idx,
> - u64 *mmap_time)
> -{
> - union perf_event *event;
> - struct perf_sample sample;
> - s64 n = 0;
> - int err;
> -
> - *mmap_time = ULLONG_MAX;
> - while ((event = perf_evlist__mmap_read(kvm->evlist, idx)) != NULL) {
> - err = perf_evlist__parse_sample(kvm->evlist, event, &sample);
> - if (err) {
> - perf_evlist__mmap_consume(kvm->evlist, idx);
> - pr_err("Failed to parse sample\n");
> - return -1;
> - }
> -
> - err = perf_session_queue_event(kvm->session, event, &sample, 0);
> - /*
> - * FIXME: Here we can't consume the event, as perf_session_queue_event will
> - * point to it, and it'll get possibly overwritten by the kernel.
> - */
> - perf_evlist__mmap_consume(kvm->evlist, idx);
> -
> - if (err) {
> - pr_err("Failed to enqueue sample: %d\n", err);
> - return -1;
> - }
> -
> - /* save time stamp of our first sample for this mmap */
> - if (n == 0)
> - *mmap_time = sample.time;
> -
> - /* limit events per mmap handled all at once */
> - n++;
> - if (n == PERF_KVM__MAX_EVENTS_PER_MMAP)
> - break;
> - }
> -
> - return n;
> -}
> -
> -static int perf_kvm__mmap_read(struct perf_kvm_stat *kvm)
> -{
> - int i, err, throttled = 0;
> - s64 n, ntotal = 0;
> - u64 flush_time = ULLONG_MAX, mmap_time;
> -
> - for (i = 0; i < kvm->evlist->nr_mmaps; i++) {
> - n = perf_kvm__mmap_read_idx(kvm, i, &mmap_time);
> - if (n < 0)
> - return -1;
> -
> - /* flush time is going to be the minimum of all the individual
> - * mmap times. Essentially, we flush all the samples queued up
> - * from the last pass under our minimal start time -- that leaves
> - * a very small race for samples to come in with a lower timestamp.
> - * The ioctl to return the perf_clock timestamp should close the
> - * race entirely.
> - */
> - if (mmap_time < flush_time)
> - flush_time = mmap_time;
> -
> - ntotal += n;
> - if (n == PERF_KVM__MAX_EVENTS_PER_MMAP)
> - throttled = 1;
> - }
> -
> - /* flush queue after each round in which we processed events */
> - if (ntotal) {
> - kvm->session->ordered_samples.next_flush = flush_time;
> - err = kvm->tool.finished_round(&kvm->tool, NULL, kvm->session);
> - if (err) {
> - if (kvm->lost_events)
> - pr_info("\nLost events: %" PRIu64 "\n\n",
> - kvm->lost_events);
> - return err;
> - }
> - }
> -
> - return throttled;
> -}
> -
> static volatile int done;
>
> static void sig_handler(int sig __maybe_unused)
> @@ -1133,7 +1050,10 @@ static int kvm_events_live_report(struct perf_kvm_stat *kvm)
> while (!done) {
> int rc;
>
> - rc = perf_kvm__mmap_read(kvm);
> + rc = perf_session__mmap_read(&kvm->tool,
> + kvm->session,
> + kvm->evlist,
> + PERF_KVM__MAX_EVENTS_PER_MMAP);
> if (rc < 0)
> break;
>
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 4526d966b10a..994846060c5e 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -1671,3 +1671,88 @@ int __perf_session__set_tracepoints_handlers(struct perf_session *session,
> out:
> return err;
> }
> +
> +static s64 perf_session__mmap_read_idx(struct perf_session *session,
> + int idx,
> + u64 *mmap_time,
> + int nr_per_mmap)
> +{
> + union perf_event *event;
> + struct perf_sample sample;
> + s64 n = 0;
> + int err;
> +
> + *mmap_time = ULLONG_MAX;
> + while ((event = perf_evlist__mmap_read(session->evlist, idx)) != NULL) {
> + err = perf_evlist__parse_sample(session->evlist, event, &sample);
> + if (err) {
> + perf_evlist__mmap_consume(session->evlist, idx);
> + pr_err("Failed to parse sample\n");
> + return -1;
> + }
> +
> + err = perf_session_queue_event(session, event, &sample, 0);
> + /*
> + * FIXME: Here we can't consume the event, as perf_session_queue_event will
> + * point to it, and it'll get possibly overwritten by the kernel.
> + */
> + perf_evlist__mmap_consume(session->evlist, idx);
> +
> + if (err) {
> + pr_err("Failed to enqueue sample: %d\n", err);
> + return -1;
> + }
> +
> + /* save time stamp of our first sample for this mmap */
> + if (n == 0)
> + *mmap_time = sample.time;
> +
> + /* limit events per mmap handled all at once */
> + n++;
> + if (n == nr_per_mmap)
> + break;
> + }
> +
> + return n;
> +}
> +
> +int perf_session__mmap_read(struct perf_tool *tool,
> + struct perf_session *session,
> + struct perf_evlist *evlist,
> + int nr_per_mmap)
> +{
> + int i, err, throttled = 0;
> + s64 n, ntotal = 0;
> + u64 flush_time = ULLONG_MAX, mmap_time;
> +
> + for (i = 0; i < evlist->nr_mmaps; i++) {
> + n = perf_session__mmap_read_idx(session, i, &mmap_time,
> + nr_per_mmap);
> + if (n < 0)
> + return -1;
> +
> + /* flush time is going to be the minimum of all the individual
> + * mmap times. Essentially, we flush all the samples queued up
> + * from the last pass under our minimal start time -- that leaves
> + * a very small race for samples to come in with a lower timestamp.
> + * The ioctl to return the perf_clock timestamp should close the
> + * race entirely.
> + */
> + if (mmap_time < flush_time)
> + flush_time = mmap_time;
> +
> + ntotal += n;
> + if (n == nr_per_mmap)
> + throttled = 1;
> + }
> +
> + /* flush queue after each round in which we processed events */
> + if (ntotal) {
> + session->ordered_samples.next_flush = flush_time;
> + err = tool->finished_round(tool, NULL, session);
> + if (err)
> + return err;
> + }
> +
> + return throttled;
> +}
> diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
> index 9494fb68828a..e79da3c1071e 100644
> --- a/tools/perf/util/session.h
> +++ b/tools/perf/util/session.h
> @@ -133,4 +133,9 @@ int __perf_session__set_tracepoints_handlers(struct perf_session *session,
> extern volatile int session_done;
>
> #define session_done() (*(volatile int *)(&session_done))
> +
> +int perf_session__mmap_read(struct perf_tool *tool,
> + struct perf_session *session,
> + struct perf_evlist *evlist,
> + int nr_per_mmap);
> #endif /* __PERF_SESSION_H */
> --
> 1.8.3.2

2014-06-20 15:00:03

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH 2/7] perf trace: add support for pagefault tracing

Em Fri, Jun 20, 2014 at 02:49:44PM +0400, Stanislav Fomichev escreveu:
> This patch adds optional pagefault tracing support to 'perf trace'.
> Using -F/--pf option user can specify whether he wants minor, major or
> all pagefault events to be traced. This patch adds only live mode,
> record and replace will come in a separate patch.
>
> Example output:
> 1756272.905 ( 0.000 ms): curl/5937 majfault [0x7fa7261978b6] => /usr/lib/x86_64-linux-gnu/libkrb5.so.26.0.0+0x85288 (d.)

Things like the above may be better expressed using dso@symbol+offset,
even more when the symbol is not resolved within a DSO, due to
insufficient expressivity in the available symtab, i.e. perhaps the user
needs to install a debuginfo package.

> 1862866.036 ( 0.000 ms): wget/8460 majfault [__clear_user+0x3f] => 0x659cb4 (?k)

And here, where is this __clear_user symbol from? The kernel? Probaly
that 'k' there means that? If so, probably that deserves an entry in the
documentation.

WRT documentation, please consider sending followup patches providing
examples in the documentation, talking about the possible need of
installing debuginfo packages as well.

But its shaping up nicely, good work!

- Arnaldo

> Signed-off-by: Stanislav Fomichev <[email protected]>
> ---
> tools/perf/Documentation/perf-trace.txt | 12 ++++
> tools/perf/builtin-trace.c | 116 +++++++++++++++++++++++++++++++-
> 2 files changed, 127 insertions(+), 1 deletion(-)
>
> diff --git a/tools/perf/Documentation/perf-trace.txt b/tools/perf/Documentation/perf-trace.txt
> index fae38d9a44a4..8e5f710aa45d 100644
> --- a/tools/perf/Documentation/perf-trace.txt
> +++ b/tools/perf/Documentation/perf-trace.txt
> @@ -107,6 +107,18 @@ the thread executes on the designated CPUs. Default is to monitor all CPUs.
> Show tool stats such as number of times fd->pathname was discovered thru
> hooking the open syscall return + vfs_getname or via reading /proc/pid/fd, etc.
>
> +-F=[all|min|maj]::
> +--pf=[all|min|maj]::
> + Trace pagefaults. Optionally, you can specify whether you want minor,
> + major or all pagefaults. Default value is maj.
> +
> +EXAMPLES
> +--------
> +
> +Trace syscalls, major and minor pagefaults:
> +
> + $ perf trace -F all
> +
> SEE ALSO
> --------
> linkperf:perf-record[1], linkperf:perf-script[1]
> diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
> index a9b542918da0..a80aae2bba40 100644
> --- a/tools/perf/builtin-trace.c
> +++ b/tools/perf/builtin-trace.c
> @@ -1177,6 +1177,9 @@ fail:
> return NULL;
> }
>
> +#define TRACE_PFMAJ (1 << 0)
> +#define TRACE_PFMIN (1 << 1)
> +
> struct trace {
> struct perf_tool tool;
> struct {
> @@ -1211,6 +1214,7 @@ struct trace {
> bool summary_only;
> bool show_comm;
> bool show_tool_stats;
> + int trace_pgfaults;
> };
>
> static int trace__set_fd_pathname(struct thread *thread, int fd, const char *pathname)
> @@ -1773,6 +1777,59 @@ out_dump:
> return 0;
> }
>
> +static int trace__pgfault(struct trace *trace,
> + struct perf_evsel *evsel,
> + union perf_event *event,
> + struct perf_sample *sample)
> +{
> + struct thread *thread;
> + u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
> + struct addr_location al;
> + char map_type = 'd';
> +
> + thread = machine__findnew_thread(trace->host, sample->pid, sample->tid);
> +
> + thread__find_addr_location(thread, trace->host, cpumode, MAP__FUNCTION,
> + sample->ip, &al);
> +
> + trace__fprintf_entry_head(trace, thread, 0, sample->time, trace->output);
> +
> + fprintf(trace->output, "%sfault ",
> + evsel->attr.config == PERF_COUNT_SW_PAGE_FAULTS_MAJ ?
> + "maj" : "min");
> +
> + if (al.sym)
> + fprintf(trace->output, "[%s+0x%lx]",
> + al.sym->name, al.addr - al.sym->start);
> + else
> + fprintf(trace->output, "[0x%lx]", sample->ip);
> +
> + fprintf(trace->output, " => ");
> +
> + thread__find_addr_map(thread, trace->host, cpumode, MAP__VARIABLE,
> + sample->addr, &al);
> +
> + if (!al.map) {
> + thread__find_addr_map(thread, trace->host, cpumode,
> + MAP__FUNCTION, sample->addr, &al);
> +
> + if (al.map)
> + map_type = 'x';
> + }
> +
> + if (al.map) {
> + fprintf(trace->output, "%s+0x%lx",
> + al.map->dso->long_name, al.addr);
> + } else {
> + map_type = '?';
> + fprintf(trace->output, "0x%lx", sample->addr);
> + }
> +
> + fprintf(trace->output, " (%c%c)\n", map_type, al.level);
> +
> + return 0;
> +}
> +
> static bool skip_sample(struct trace *trace, struct perf_sample *sample)
> {
> if ((trace->pid_list && intlist__find(trace->pid_list, sample->pid)) ||
> @@ -1887,6 +1944,30 @@ static void perf_evlist__add_vfs_getname(struct perf_evlist *evlist)
> perf_evlist__add(evlist, evsel);
> }
>
> +static int perf_evlist__add_pgfault(struct perf_evlist *evlist,
> + u64 config)
> +{
> + struct perf_evsel *evsel;
> + struct perf_event_attr attr = {
> + .type = PERF_TYPE_SOFTWARE,
> + .mmap_data = 1,
> + .sample_period = 1,
> + };
> +
> + attr.config = config;
> +
> + event_attr_init(&attr);
> +
> + evsel = perf_evsel__new(&attr);
> + if (!evsel)
> + return -ENOMEM;
> +
> + evsel->handler = trace__pgfault;
> + perf_evlist__add(evlist, evsel);
> +
> + return 0;
> +}
> +
> static int trace__run(struct trace *trace, int argc, const char **argv)
> {
> struct perf_evlist *evlist = perf_evlist__new();
> @@ -1907,6 +1988,14 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
>
> perf_evlist__add_vfs_getname(evlist);
>
> + if ((trace->trace_pgfaults & TRACE_PFMAJ) &&
> + perf_evlist__add_pgfault(evlist, PERF_COUNT_SW_PAGE_FAULTS_MAJ))
> + goto out_error_tp;
> +
> + if ((trace->trace_pgfaults & TRACE_PFMIN) &&
> + perf_evlist__add_pgfault(evlist, PERF_COUNT_SW_PAGE_FAULTS_MIN))
> + goto out_error_tp;
> +
> if (trace->sched &&
> perf_evlist__add_newtp(evlist, "sched", "sched_stat_runtime",
> trace__sched_stat_runtime))
> @@ -1987,7 +2076,8 @@ again:
> goto next_event;
> }
>
> - if (sample.raw_data == NULL) {
> + if (evsel->attr.type == PERF_TYPE_TRACEPOINT &&
> + sample.raw_data == NULL) {
> fprintf(trace->output, "%s sample with no payload for tid: %d, cpu %d, raw_size=%d, skipping...\n",
> perf_evsel__name(evsel), sample.tid,
> sample.cpu, sample.raw_size);
> @@ -2269,6 +2359,23 @@ static int trace__open_output(struct trace *trace, const char *filename)
> return trace->output == NULL ? -errno : 0;
> }
>
> +static int parse_pagefaults(const struct option *opt, const char *str,
> + int unset __maybe_unused)
> +{
> + int *trace_pgfaults = opt->value;
> +
> + if (strcmp(str, "all") == 0)
> + *trace_pgfaults |= TRACE_PFMAJ | TRACE_PFMIN;
> + else if (strcmp(str, "maj") == 0)
> + *trace_pgfaults |= TRACE_PFMAJ;
> + else if (strcmp(str, "min") == 0)
> + *trace_pgfaults |= TRACE_PFMIN;
> + else
> + return -1;
> +
> + return 0;
> +}
> +
> int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)
> {
> const char * const trace_usage[] = {
> @@ -2335,6 +2442,8 @@ int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)
> "Show only syscall summary with statistics"),
> OPT_BOOLEAN('S', "with-summary", &trace.summary,
> "Show all syscalls and summary with statistics"),
> + OPT_CALLBACK_DEFAULT('F', "pf", &trace.trace_pgfaults, "all|maj|min",
> + "Trace pagefaults", parse_pagefaults, "maj"),
> OPT_END()
> };
> int err;
> @@ -2349,6 +2458,11 @@ int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)
> if (trace.summary_only)
> trace.summary = trace.summary_only;
>
> + if (trace.trace_pgfaults) {
> + trace.opts.sample_address = true;
> + trace.opts.sample_time = true;
> + }
> +
> if (output_name != NULL) {
> err = trace__open_output(&trace, output_name);
> if (err < 0) {
> --
> 1.8.3.2

2014-06-20 15:07:10

by Stanislav Fomichev

[permalink] [raw]

Subject: Re: [PATCH 6/7] perf kvm: move perf_kvm__mmap_read into session utils

> I.e. what we need from perf_session is just the ordered_samples bits,
> perhaps in its current form, perhaps rewritten, see (renewed) discussion
> involving David Ahern and Jiri Olsa.
Do you mean you want to pull (at some point) Jiri's patches and want to use
just event ordering part in perf trace, not a full blown perf_session?
I'll try to have a look how I can use these patches for ordered trace.

2014-06-20 15:08:56

by Stanislav Fomichev

[permalink] [raw]

Subject: Re: [PATCH v2 0/7] perf trace pagefaults

> Just to clarify here, those slides came from slides I made and in turn
> the whole idea about pagefaults tracing I got from the trace prototype
> that Thomas Gleixner implemented in his 'trace' utility, described
> here:
>
> Announcing a new utility: 'trace'
> http://lwn.net/Articles/415728/
>
> The comments section has lots of interesting ideas, some you may find
> interesting to implement :-)
>
> There is a branch in my tree with the branch tglx did his work on:
>
> https://git.kernel.org/cgit/linux/kernel/git/acme/linux.git/log/?h=tmp.perf/trace2
Wow, thanks, I tried to search lkml for any presence of
patches/discussion about these slides, but couldn't find anything, thanks for
pointing it out.

I really like 'blocking/preempted' indication and of course I miss
pointers decoding.

Did anyone really think about decoding pointers and how we can
implement it (like dumping them upon entering a syscall and then
using inside the perf trace?)?

2014-06-20 15:25:45

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH 6/7] perf kvm: move perf_kvm__mmap_read into session utils

Em Fri, Jun 20, 2014 at 07:07:02PM +0400, Stanislav Fomichev escreveu:
> > I.e. what we need from perf_session is just the ordered_samples bits,
> > perhaps in its current form, perhaps rewritten, see (renewed) discussion
> > involving David Ahern and Jiri Olsa.
> Do you mean you want to pull (at some point) Jiri's patches and want to use
> just event ordering part in perf trace, not a full blown perf_session?

Right.

> I'll try to have a look how I can use these patches for ordered trace.

Thanks, reviewing his work is the logical next step for you since you're
working on this, great move!

- Arnaldo

2014-06-20 15:39:05

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH v2 0/7] perf trace pagefaults

Em Fri, Jun 20, 2014 at 07:03:18PM +0400, Stanislav Fomichev escreveu:
> > Just to clarify here, those slides came from slides I made and in turn
> > the whole idea about pagefaults tracing I got from the trace prototype
> > that Thomas Gleixner implemented in his 'trace' utility, described
> > here:

> > Announcing a new utility: 'trace'
> > http://lwn.net/Articles/415728/

> > The comments section has lots of interesting ideas, some you may find
> > interesting to implement :-)

> > There is a branch in my tree with the branch tglx did his work on:

> > https://git.kernel.org/cgit/linux/kernel/git/acme/linux.git/log/?h=tmp.perf/trace2

> Wow, thanks, I tried to search lkml for any presence of
> patches/discussion about these slides, but couldn't find anything, thanks for
> pointing it out.

> I really like 'blocking/preempted' indication and of course I miss
> pointers decoding.

> Did anyone really think about decoding pointers and how we can
> implement it (like dumping them upon entering a syscall and then
> using inside the perf trace?)?

Hey, haven't you seen the vfs_getname probe? Idea is to hook on where
the relevant copy_from_user is done and insert that into the ring
buffer, as we already do for mapping fd -> pathname.

Right now it is too simple, but I was starting to work (when you jumped
right in with your work making me stop and go on testing/reviewing :) )
on making it more generic so that we could defer pretty printing the
arguments from sys_enter to sys_exit, when, by then, we would already
have an association of a user level pointer in some specific thread to
its contents.

This will allow us to to resolve the pathname pointer in things like
open() (i.e. not just after that, in the fd syscalls (write, etc)) and
as well any other pointer of interest.

By librarizing 'builtin-probe.c', that now uses lots of global
variables, etc, we would be able to insert probes where we want them to
capture the contents of pointers, check if the probes are already in
place, use just the ones that we managed to insert (i.e. that were not
invalid because the places where we wanted them to be were changed
across kernel releases, etc).

I.e. no need for actual tracepoints from day one, just wannabe
tracepoints using whatever probe inserting gizmo the kprobes_tracer used
by 'perf probe' now thinks its best to use.

Combine that with using DWARF descriptions (that could be pre cached
into something like CTF (the DTrace kind of CTF) or similar) like pahole
does and we would mostly automatically do all this work of prettyfing
syscall parameters.

</handwave>

:-)

For now try:

perf probe 'vfs_getname=getname_flags:65 pathname=result->name:string'
trace

And look at how it manages to decode fds.

- Arnaldo

2014-06-20 15:49:12

by Stanislav Fomichev

[permalink] [raw]

Subject: Re: [PATCH 2/7] perf trace: add support for pagefault tracing

> > 1756272.905 ( 0.000 ms): curl/5937 majfault [0x7fa7261978b6] => /usr/lib/x86_64-linux-gnu/libkrb5.so.26.0.0+0x85288 (d.)
>
> Things like the above may be better expressed using dso@symbol+offset,
> even more when the symbol is not resolved within a DSO, due to
> insufficient expressivity in the available symtab, i.e. perhaps the user
> needs to install a debuginfo package.
Currently, the format is:
<header> <min|maj>fault [ip.symbol+ip.offset] => addr.dso+addr.offset (<mmap_type><addr.level>)

if no symbol found for IP, we have:
<header> <min|maj>fault [ip.addr] => addr.dso+addr.offset (<mmap_type><addr.level>)

mmap_type is d - for data, x - for executable vmas.

I think we may try to print [ip.dso+ip.offset] if we can't resolve ip symbol,
but I don't want dso@symbol+offset because if we have symbol, dso is
probably (?) redundant (ok, at least for me).
It also seems we don't need to resolve symbol of pagefault address
because it's either some code or some address on the heap. Dso is enough.

> > 1862866.036 ( 0.000 ms): wget/8460 majfault [__clear_user+0x3f] => 0x659cb4 (?k)
>
> And here, where is this __clear_user symbol from? The kernel? Probaly
> that 'k' there means that? If so, probably that deserves an entry in the
> documentation.
Yes, k or . describes symbol 'level'.

> WRT documentation, please consider sending followup patches providing
> examples in the documentation, talking about the possible need of
> installing debuginfo packages as well.
I didn't put much effort into documentation because I just wanted to get
initial feedback. Will follow.

> But its shaping up nicely, good work!
Thanks.

2014-06-20 16:11:42

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH 2/7] perf trace: add support for pagefault tracing

Em Fri, Jun 20, 2014 at 07:49:01PM +0400, Stanislav Fomichev escreveu:
> > > 1756272.905 ( 0.000 ms): curl/5937 majfault [0x7fa7261978b6] => /usr/lib/x86_64-linux-gnu/libkrb5.so.26.0.0+0x85288 (d.)
> >
> > Things like the above may be better expressed using dso@symbol+offset,
> > even more when the symbol is not resolved within a DSO, due to
> > insufficient expressivity in the available symtab, i.e. perhaps the user
> > needs to install a debuginfo package.
> Currently, the format is:
> <header> <min|maj>fault [ip.symbol+ip.offset] => addr.dso+addr.offset (<mmap_type><addr.level>)
>
> if no symbol found for IP, we have:
> <header> <min|maj>fault [ip.addr] => addr.dso+addr.offset (<mmap_type><addr.level>)
>
> mmap_type is d - for data, x - for executable vmas.
>
> I think we may try to print [ip.dso+ip.offset] if we can't resolve ip symbol,
> but I don't want dso@symbol+offset because if we have symbol, dso is
> probably (?) redundant (ok, at least for me).

Well, I don't think it is :-)

dso->short_name should disambiguate 99.9% of the cases tho. Perhaps we
can use --verbose, as in other places, to switch using dso->long_name,
for cases where multiple versions of a library are used, some in non
standard paths, which sometimes puzzles people looking at the tools
output.

> It also seems we don't need to resolve symbol of pagefault address

Well, in some cases if we can resolve to a variable, why not?

> because it's either some code or some address on the heap. Dso is enough.

> > > 1862866.036 ( 0.000 ms): wget/8460 majfault [__clear_user+0x3f] => 0x659cb4 (?k)

> > And here, where is this __clear_user symbol from? The kernel? Probaly
> > that 'k' there means that? If so, probably that deserves an entry in the
> > documentation.

> Yes, k or . describes symbol 'level'.

> > WRT documentation, please consider sending followup patches providing
> > examples in the documentation, talking about the possible need of
> > installing debuginfo packages as well.
> I didn't put much effort into documentation because I just wanted to get
> initial feedback. Will follow.

Ok, but I think its good practice, as much as it is seldom done because
it is boring, to do it exactly when doing the matching feature. Doing it
judiciously may even help the person implementing to think about what is
being done more thoroughly which sometimes leads to problems being
solved at an early stage 8-)

> > But its shaping up nicely, good work!
> Thanks.

- Arnaldo

2014-06-20 16:19:12

by Stanislav Fomichev

[permalink] [raw]

Subject: Re: [PATCH v2 0/7] perf trace pagefaults

> Hey, haven't you seen the vfs_getname probe? Idea is to hook on where
> the relevant copy_from_user is done and insert that into the ring
> buffer, as we already do for mapping fd -> pathname.
I saw it but didn't actually try because it needs all the debugging
stuff enabled and in place.

> I.e. no need for actual tracepoints from day one, just wannabe
> tracepoints using whatever probe inserting gizmo the kprobes_tracer used
> by 'perf probe' now thinks its best to use.
But we then need to predefine many probes for decoding to work in the form of
func:offset, and then play catch-up with all the kernel changes.
Or I miss something important here?

> For now try:
>
> perf probe 'vfs_getname=getname_flags:65 pathname=result->name:string'
> trace
>
> And look at how it manages to decode fds.
I will try, but does 65 still work after c4ad8f98bef77c7356aa6a9ad9188a6acc6b849d? :-)

2014-06-20 18:30:35

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH v2 0/7] perf trace pagefaults

Em Fri, Jun 20, 2014 at 08:18:59PM +0400, Stanislav Fomichev escreveu:
> > Hey, haven't you seen the vfs_getname probe? Idea is to hook on where
> > the relevant copy_from_user is done and insert that into the ring
> > buffer, as we already do for mapping fd -> pathname.

> I saw it but didn't actually try because it needs all the debugging
> stuff enabled and in place.

Touch?, more on that below...

> > I.e. no need for actual tracepoints from day one, just wannabe
> > tracepoints using whatever probe inserting gizmo the kprobes_tracer used
> > by 'perf probe' now thinks its best to use.

> But we then need to predefine many probes for decoding to work in the form of
> func:offset, and then play catch-up with all the kernel changes.
> Or I miss something important here?

No you don't.

If we want to disturb the system in the least way possible, we need to
tag along the copying from userspace of those pointers, so that we get
them fresh and just stash it in our ring buffer and get out of the way
quickly.

> > For now try:
> >
> > perf probe 'vfs_getname=getname_flags:65 pathname=result->name:string'
> > trace
> >
> > And look at how it manages to decode fds.

> I will try, but does 65 still work after c4ad8f98bef77c7356aa6a9ad9188a6acc6b849d? :-)

Well, when I prototyped this[1] the idea is that in some areas, there is
not that much code flux that before commiting to any kind of new
interface, be it tracepoints or something else, we may well just use
'perf probe' to get what we need, and this was done in...

commit 75b757ca90469e990e6901f4a9497fe4161f7f5a
Author: Arnaldo Carvalho de Melo <[email protected]>
Date: Tue Sep 24 11:04:32 2013 -0300

Almost a year ago, and it still works, now lets see the cset you mention...

[acme@zoo linux]$ git describe c4ad8f98bef77c7356aa6a9ad9188a6acc6b849d
v3.14-rc1-14-gc4ad8f98bef7
[acme@zoo linux]$
[root@zoo ~]# uname -r
3.15.0-rc8+

Humm, what is the problem?

[root@zoo ~]# perf probe -V getname_flags:65
Available variables at getname_flags:65
@<getname_flags+227>
char* filename
int len
int* empty
long int max
struct filename* result
[root@zoo ~]#
[root@zoo ~]# perf probe 'vfs_getname=getname_flags:65 pathname=result->name:string'
Added new event:
probe:vfs_getname (on getname_flags:65 with pathname=result->name:string)

You can now use it in all perf tools, such as:

perf record -e probe:vfs_getname -aR sleep 1

[root@zoo ~]# perf record -e probe:vfs_getname -aR sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.133 MB perf.data (~49505 samples) ]
[root@zoo ~]# perf evlist
probe:vfs_getname
[root@zoo ~]# perf evlist -v
probe:vfs_getname: sample_freq=1, type: 2, config: 1317, size: 96, sample_type: IP|TID|TIME|CPU|PERIOD|RAW, disabled: 1, inherit: 1, mmap: 1, comm: 1, sample_id_all: 1, exclude_guest: 1
[root@zoo ~]# perf script
perf 11255 [003] 156054.623210: probe:vfs_getname: (ffffffff811c2e43) pathname="/home/acme/libexec/perf-core/sleep"
perf 11255 [003] 156054.624759: probe:vfs_getname: (ffffffff811c2e43) pathname="/usr/lib64/qt-3.3/bin/sleep"
perf 11255 [003] 156054.624782: probe:vfs_getname: (ffffffff811c2e43) pathname="/usr/lib64/ccache/sleep"
perf 11255 [003] 156054.624794: probe:vfs_getname: (ffffffff811c2e43) pathname="/usr/local/sbin/sleep"
perf 11255 [003] 156054.624809: probe:vfs_getname: (ffffffff811c2e43) pathname="/usr/local/bin/sleep"
perf 11255 [003] 156054.624818: probe:vfs_getname: (ffffffff811c2e43) pathname="/sbin/sleep"
perf 11255 [003] 156054.625017: probe:vfs_getname: (ffffffff811c2e43) pathname="/bin/sleep"
sleep 11255 [002] 156054.626093: probe:vfs_getname: (ffffffff811c2e43) pathname="/etc/ld.so.preload"
sleep 11255 [002] 156054.626114: probe:vfs_getname: (ffffffff811c2e43) pathname="/etc/ld.so.cache"
sleep 11255 [002] 156054.626159: probe:vfs_getname: (ffffffff811c2e43) pathname="/lib64/libc.so.6"
sleep 11255 [002] 156054.626751: probe:vfs_getname: (ffffffff811c2e43) pathname="/usr/lib/locale/locale-archive"
goa-daemon 2082 [003] 156054.955138: probe:vfs_getname: (ffffffff811c2e43) pathname="/etc/localtime"
goa-daemon 2082 [003] 156054.955573: probe:vfs_getname: (ffffffff811c2e43) pathname="/etc/localtime"
[root@zoo ~]#

Best possible way to do this? Guess not, but I'm looking from a tooling
perspective, i.e. about using what is available, not about adding requirements
to the kernel or toolchain, that we can do after we prototype in the best way
possible with existing facilities.

- Arnaldo

[1] And I feel like all of tools/perf/ is just that, reference implementations, but hopefully
done in a such a way that may well be useful as-is :-)

2014-06-23 11:41:50

by Stanislav Fomichev

[permalink] [raw]

Subject: Re: [PATCH v2 0/7] perf trace pagefaults

> > But we then need to predefine many probes for decoding to work in the form of
> > func:offset, and then play catch-up with all the kernel changes.
> > Or I miss something important here?
>
> No you don't.
>
> If we want to disturb the system in the least way possible, we need to
> tag along the copying from userspace of those pointers, so that we get
> them fresh and just stash it in our ring buffer and get out of the way
> quickly.
I just thought maybe you have some grand plan in mind about automagically
adding probes so argument tracing works transparently. I like the
approach though.

> Almost a year ago, and it still works, now lets see the cset you mention...
>
> [acme@zoo linux]$ git describe c4ad8f98bef77c7356aa6a9ad9188a6acc6b849d
> v3.14-rc1-14-gc4ad8f98bef7
> [acme@zoo linux]$
> [root@zoo ~]# uname -r
> 3.15.0-rc8+
>
> Humm, what is the problem?
I thought that result->name was actually set on 65th line of
getname_flags, so the above commit would move it to 66th. But it's not
the case, sorry for confusion.

> [1] And I feel like all of tools/perf/ is just that, reference implementations, but hopefully
> done in a such a way that may well be useful as-is :-)
I'd like perf to be a goto tool for all kind of performance analysis,
not just a reference implementation. I believe nobody looks at this
reference, and we end up with tools like https://github.com/draios/sysdig
which do their own events, ring buffer, etc.

2014-06-23 14:00:22

[permalink] [raw]

Subject: Re: [PATCH v2 0/7] perf trace pagefaults

On 6/20/14, 9:24 AM, Arnaldo Carvalho de Melo wrote:
> Right now it is too simple, but I was starting to work (when you jumped
> right in with your work making me stop and go on testing/reviewing :) )
> on making it more generic so that we could defer pretty printing the
> arguments from sys_enter to sys_exit, when, by then, we would already
> have an association of a user level pointer in some specific thread to
> its contents.
>
> This will allow us to to resolve the pathname pointer in things like
> open() (i.e. not just after that, in the fd syscalls (write, etc)) and
> as well any other pointer of interest.
>
> By librarizing 'builtin-probe.c', that now uses lots of global
> variables, etc, we would be able to insert probes where we want them to
> capture the contents of pointers, check if the probes are already in
> place, use just the ones that we managed to insert (i.e. that were not
> invalid because the places where we wanted them to be were changed
> across kernel releases, etc).
>
> I.e. no need for actual tracepoints from day one, just wannabe
> tracepoints using whatever probe inserting gizmo the kprobes_tracer used
> by 'perf probe' now thinks its best to use.
>
> Combine that with using DWARF descriptions (that could be pre cached
> into something like CTF (the DTrace kind of CTF) or similar) like pahole
> does and we would mostly automatically do all this work of prettyfing
> syscall parameters.
>
> </handwave>

That was so much handwaving you could keep cool at a World Cup game. :-)

David

2014-06-23 14:06:24

[permalink] [raw]

Subject: Re: [PATCH 6/7] perf kvm: move perf_kvm__mmap_read into session utils

On 6/20/14, 8:44 AM, Arnaldo Carvalho de Melo wrote:
> Em Fri, Jun 20, 2014 at 02:49:48PM +0400, Stanislav Fomichev escreveu:
>> It will be reused by perf trace in the following commit.
>
> I know this is needed, but one of the goals when I started builtin-trace
> was not to use perf_session at all, as it initially was just about the
> live mode.
>
> record/replay came later and ended up needing it,

perf-kvm does not use the session code because I kept tripping over the
out of order timestamp problem. By limiting the number of events read in
a single pass I was able to smooth out the load and avoided hitting it
as often. As I mentioned before the final piece was using the perf_clock
timestamp at the start of the round to ensure I never hit it.

Stanislav: are you seeing out of order timestamp errors or just trying
to reuse the live code from perf-kvm?

David

2014-06-23 14:15:06

by Stanislav Fomichev

[permalink] [raw]

Subject: Re: [PATCH 6/7] perf kvm: move perf_kvm__mmap_read into session utils

> Stanislav: are you seeing out of order timestamp errors or just trying to
> reuse the live code from perf-kvm?
Yes, I need events ordering, otherwise I see:
1. pagefault event
2. comm and mmap events for the process above

Hence I have to print cryptic messages without any details.

As Arnaldo pointed out, I have to look at Jiri Olsa's 'ordered samples
queue' patches; I probably can use only the queue, without the perf session.

2014-06-24 07:17:33

by Namhyung Kim

[permalink] [raw]

Subject: Re: [PATCH v2 0/7] perf trace pagefaults

Hi Arnaldo and Stanislav,

On Fri, 20 Jun 2014 10:21:05 -0300, Arnaldo Carvalho de Melo wrote:
> Em Fri, Jun 20, 2014 at 02:49:42PM +0400, Stanislav Fomichev escreveu:
>> This patch series adds support for pagefaults tracing to 'perf trace' command.
>> It seems this feature was planned by Namhyung Kim (http://events.linuxfoundation.org/images/stories/pdf/klf2012_n_kim.pdf page 17/28)
>> but I couldn't find any prior patches/discussion and started from scratch.
>
> Just to clarify here, those slides came from slides I made and in turn
> the whole idea about pagefaults tracing I got from the trace prototype
> that Thomas Gleixner implemented in his 'trace' utility, described
> here:
>
> Announcing a new utility: 'trace'
> http://lwn.net/Articles/415728/

Right, I asked to Arnaldo to suggest some cool topics to introduce in
KLF 2012 and that was it. I had nothing with the features. :)

Keep going nice works, guys!

Thanks,
Namhyung

2014-06-24 12:46:33

by Stanislav Fomichev

[permalink] [raw]

Subject: Re: [PATCH 2/7] perf trace: add support for pagefault tracing

> > I think we may try to print [ip.dso+ip.offset] if we can't resolve ip symbol,
> > but I don't want dso@symbol+offset because if we have symbol, dso is
> > probably (?) redundant (ok, at least for me).
>
> Well, I don't think it is :-)
>
> dso->short_name should disambiguate 99.9% of the cases tho. Perhaps we
> can use --verbose, as in other places, to switch using dso->long_name,
> for cases where multiple versions of a library are used, some in non
> standard paths, which sometimes puzzles people looking at the tools
> output.
If we use --verbose it will result in noise besides short/long names
(like looking up debug symbols, etc).
But I'm still not convinced we need to always print dso of ip, we already have
target process and symbol (if we have dbg symbols), this should be
enough to understand what code path triggered fault.

> > It also seems we don't need to resolve symbol of pagefault address
>
> Well, in some cases if we can resolve to a variable, why not?
If we fault into data map, we resolve to nothing, right? But if we
fault into some executable map, we can resolve some symbol. But is it
really useful to know (because for me it's much more interesting to know
which dso and offset we fault into, not the actual symbol)?

What I think makes sense it to have current (default) format:
[ip.sym+ip.offs] => [addr.dso.long+addr.offs]

And --verbose format, which will use dso.long@symbol+offset for addr and ip.
So by default we have somewhat readable and concise information, and if we
need to gather all possible details we may use --verbose. What do you
think?

2014-06-24 14:15:10

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH v2 0/7] perf trace pagefaults

Em Mon, Jun 23, 2014 at 03:41:39PM +0400, Stanislav Fomichev escreveu:
> > > But we then need to predefine many probes for decoding to work in the form of
> > > func:offset, and then play catch-up with all the kernel changes.
> > > Or I miss something important here?

> > No you don't.

> > If we want to disturb the system in the least way possible, we need to
> > tag along the copying from userspace of those pointers, so that we get
> > them fresh and just stash it in our ring buffer and get out of the way
> > quickly.

> I just thought maybe you have some grand plan in mind about automagically
> adding probes so argument tracing works transparently. I like the
> approach though.

First we use what we have in place, then we optimize it.

> > Almost a year ago, and it still works, now lets see the cset you mention...
> >
> > [acme@zoo linux]$ git describe c4ad8f98bef77c7356aa6a9ad9188a6acc6b849d
> > v3.14-rc1-14-gc4ad8f98bef7
> > [acme@zoo linux]$
> > [root@zoo ~]# uname -r
> > 3.15.0-rc8+
> >
> > Humm, what is the problem?

> I thought that result->name was actually set on 65th line of
> getname_flags, so the above commit would move it to 66th. But it's not
> the case, sorry for confusion.

> > [1] And I feel like all of tools/perf/ is just that, reference implementations, but hopefully
> > done in a such a way that may well be useful as-is :-)

> I'd like perf to be a goto tool for all kind of performance analysis,

yay, and you're working for that, thanks!

> not just a reference implementation. I believe nobody looks at this
> reference, and we end up with tools like https://github.com/draios/sysdig

Never heard about it, will take a look, thanks for the pointer.

> which do their own events, ring buffer, etc.

There are several out there :)

- Arnaldo

2014-06-24 15:21:15

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH 2/7] perf trace: add support for pagefault tracing

Em Tue, Jun 24, 2014 at 04:46:20PM +0400, Stanislav Fomichev escreveu:
> > > I think we may try to print [ip.dso+ip.offset] if we can't resolve ip symbol,
> > > but I don't want dso@symbol+offset because if we have symbol, dso is
> > > probably (?) redundant (ok, at least for me).

> > Well, I don't think it is :-)

> > dso->short_name should disambiguate 99.9% of the cases tho. Perhaps we
> > can use --verbose, as in other places, to switch using dso->long_name,
> > for cases where multiple versions of a library are used, some in non
> > standard paths, which sometimes puzzles people looking at the tools
> > output.

> If we use --verbose it will result in noise besides short/long names
> (like looking up debug symbols, etc).
> But I'm still not convinced we need to always print dso of ip, we already have
> target process and symbol (if we have dbg symbols), this should be
> enough to understand what code path triggered fault.

> > > It also seems we don't need to resolve symbol of pagefault address

> > Well, in some cases if we can resolve to a variable, why not?
> If we fault into data map, we resolve to nothing, right? But if we
> fault into some executable map, we can resolve some symbol. But is it
> really useful to know (because for me it's much more interesting to know
> which dso and offset we fault into, not the actual symbol)?

> What I think makes sense it to have current (default) format:
> [ip.sym+ip.offs] => [addr.dso.long+addr.offs]

> And --verbose format, which will use dso.long@symbol+offset for addr and ip.
> So by default we have somewhat readable and concise information, and if we
> need to gather all possible details we may use --verbose. What do you
> think?

We can go with that, yes.

- Arnaldo