2014-06-18 15:02:55

by Stanislav Fomichev

[permalink] [raw]
Subject: [RFC PATCH 0/5] perf trace pagefaults

This patch series adds support for pagefaults tracing to 'perf trace' command.
It seems this feature was planned by Namhyung Kim (http://events.linuxfoundation.org/images/stories/pdf/klf2012_n_kim.pdf page 17/28)
but I couldn't find any prior patches/discussion and started from scratch.

First three patches add the feature and options to enable faults and disable
syscalls.
Two last patches add events caching (like it's done in the perf kvm), so that
we don't get fault events prior to mmap/comm events (makes sense only
for live mode).

This is just a proof-of-concept, and I'd like to get some comments about
where and what I got wrong and what additional useful information I can
expose in the trace.

Stanislav Fomichev (5):
perf trace: add support for pagefault tracing
perf trace: add pagefault statistics
perf trace: add possibility to switch off syscall events
perf kvm: move perf_kvm__mmap_read into session utils
perf trace: add events cache

tools/perf/Documentation/perf-trace.txt | 19 ++
tools/perf/builtin-kvm.c | 88 +-------
tools/perf/builtin-trace.c | 351 ++++++++++++++++++++++----------
tools/perf/util/session.c | 85 ++++++++
tools/perf/util/session.h | 5 +
5 files changed, 353 insertions(+), 195 deletions(-)

--
1.8.3.2


2014-06-18 15:00:56

by Stanislav Fomichev

[permalink] [raw]
Subject: [PATCH 1/5] perf trace: add support for pagefault tracing

Signed-off-by: Stanislav Fomichev <[email protected]>
---
tools/perf/Documentation/perf-trace.txt | 12 ++
tools/perf/builtin-trace.c | 198 ++++++++++++++++++++++++++------
2 files changed, 176 insertions(+), 34 deletions(-)

diff --git a/tools/perf/Documentation/perf-trace.txt b/tools/perf/Documentation/perf-trace.txt
index fae38d9a44a4..7da5f75a45f1 100644
--- a/tools/perf/Documentation/perf-trace.txt
+++ b/tools/perf/Documentation/perf-trace.txt
@@ -107,6 +107,18 @@ the thread executes on the designated CPUs. Default is to monitor all CPUs.
Show tool stats such as number of times fd->pathname was discovered thru
hooking the open syscall return + vfs_getname or via reading /proc/pid/fd, etc.

+-f::
+--pgfaults::
+ Trace major pagefaults. To also trace minor pagefaults, specify this
+ option twice.
+
+EXAMPLES
+--------
+
+Trace syscalls, major and minor pagefaults:
+
+ $ perf trace -f -f
+
SEE ALSO
--------
linkperf:perf-record[1], linkperf:perf-script[1]
diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index f954c26de231..6c7ae048db59 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -1211,6 +1211,7 @@ struct trace {
bool summary_only;
bool show_comm;
bool show_tool_stats;
+ int trace_pgfaults;
};

static int trace__set_fd_pathname(struct thread *thread, int fd, const char *pathname)
@@ -1534,7 +1535,9 @@ static size_t syscall__scnprintf_args(struct syscall *sc, char *bf, size_t size,
return printed;
}

-typedef int (*tracepoint_handler)(struct trace *trace, struct perf_evsel *evsel,
+typedef int (*tracepoint_handler)(struct trace *trace,
+ union perf_event *event,
+ struct perf_evsel *evsel,
struct perf_sample *sample);

static struct syscall *trace__syscall_info(struct trace *trace,
@@ -1606,7 +1609,9 @@ static void thread__update_stats(struct thread_trace *ttrace,
update_stats(stats, duration);
}

-static int trace__sys_enter(struct trace *trace, struct perf_evsel *evsel,
+static int trace__sys_enter(struct trace *trace,
+ union perf_event *event __maybe_unused,
+ struct perf_evsel *evsel,
struct perf_sample *sample)
{
char *msg;
@@ -1655,7 +1660,9 @@ static int trace__sys_enter(struct trace *trace, struct perf_evsel *evsel,
return 0;
}

-static int trace__sys_exit(struct trace *trace, struct perf_evsel *evsel,
+static int trace__sys_exit(struct trace *trace,
+ union perf_event *event __maybe_unused,
+ struct perf_evsel *evsel,
struct perf_sample *sample)
{
int ret;
@@ -1734,14 +1741,18 @@ out:
return 0;
}

-static int trace__vfs_getname(struct trace *trace, struct perf_evsel *evsel,
+static int trace__vfs_getname(struct trace *trace,
+ union perf_event *event __maybe_unused,
+ struct perf_evsel *evsel,
struct perf_sample *sample)
{
trace->last_vfs_getname = perf_evsel__rawptr(evsel, sample, "pathname");
return 0;
}

-static int trace__sched_stat_runtime(struct trace *trace, struct perf_evsel *evsel,
+static int trace__sched_stat_runtime(struct trace *trace,
+ union perf_event *event __maybe_unused,
+ struct perf_evsel *evsel,
struct perf_sample *sample)
{
u64 runtime = perf_evsel__intval(evsel, sample, "runtime");
@@ -1768,6 +1779,69 @@ out_dump:
return 0;
}

+static bool valid_dso(struct addr_location *al, struct perf_sample *sample)
+{
+ if (al->map && al->map->dso &&
+ al->map->start <= sample->addr &&
+ al->map->end > sample->addr)
+ return true;
+ else
+ return false;
+}
+
+static int trace__pgfault(struct trace *trace,
+ union perf_event *event,
+ struct perf_evsel *evsel,
+ struct perf_sample *sample)
+{
+ struct thread *thread;
+ u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
+ struct addr_location al;
+ char map_type = 'd';
+
+ thread = machine__findnew_thread(trace->host, sample->pid, sample->tid);
+
+ thread__find_addr_location(thread, trace->host, cpumode, MAP__FUNCTION,
+ sample->ip, &al);
+
+ trace__fprintf_entry_head(trace, thread, 0, sample->time, trace->output);
+
+ fprintf(trace->output, "%sfault ",
+ evsel->attr.config == PERF_COUNT_SW_PAGE_FAULTS_MAJ ?
+ "maj" : "min");
+
+ if (al.sym)
+ fprintf(trace->output, "[%s+0x%lx]",
+ al.sym->name, al.addr - al.sym->start);
+ else
+ fprintf(trace->output, "[0x%lx]", sample->ip);
+
+ fprintf(trace->output, " => ");
+
+ thread__find_addr_map(thread, trace->host, cpumode, MAP__VARIABLE,
+ sample->addr, &al);
+
+ if (!valid_dso(&al, sample)) {
+ thread__find_addr_map(thread, trace->host, cpumode,
+ MAP__FUNCTION, sample->addr, &al);
+
+ if (valid_dso(&al, sample))
+ map_type = 'x';
+ }
+
+ if (valid_dso(&al, sample)) {
+ fprintf(trace->output, "%s+0x%lx",
+ al.map->dso->long_name, sample->addr - al.map->start);
+ } else {
+ map_type = '?';
+ fprintf(trace->output, "0x%lx", sample->addr);
+ }
+
+ fprintf(trace->output, " (%c%c)\n", map_type, al.level);
+
+ return 0;
+}
+
static bool skip_sample(struct trace *trace, struct perf_sample *sample)
{
if ((trace->pid_list && intlist__find(trace->pid_list, sample->pid)) ||
@@ -1781,7 +1855,7 @@ static bool skip_sample(struct trace *trace, struct perf_sample *sample)
}

static int trace__process_sample(struct perf_tool *tool,
- union perf_event *event __maybe_unused,
+ union perf_event *event,
struct perf_sample *sample,
struct perf_evsel *evsel,
struct machine *machine __maybe_unused)
@@ -1799,7 +1873,7 @@ static int trace__process_sample(struct perf_tool *tool,

if (handler) {
++trace->nr_events;
- handler(trace, evsel, sample);
+ handler(trace, event, evsel, sample);
}

return err;
@@ -1826,7 +1900,7 @@ static int parse_target_str(struct trace *trace)
return 0;
}

-static int trace__record(int argc, const char **argv)
+static int trace__record(struct trace *trace, int argc, const char **argv)
{
unsigned int rec_argc, i, j;
const char **rec_argv;
@@ -1835,34 +1909,52 @@ static int trace__record(int argc, const char **argv)
"-R",
"-m", "1024",
"-c", "1",
- "-e",
};

+ const char * const sc_args[] = { "-e", };
+ unsigned int sc_args_nr = ARRAY_SIZE(sc_args);
+ const char * const majpf_args[] = { "-e", "major-faults" };
+ unsigned int majpf_args_nr = ARRAY_SIZE(majpf_args);
+ const char * const minpf_args[] = { "-e", "minor-faults" };
+ unsigned int minpf_args_nr = ARRAY_SIZE(minpf_args);
+
/* +1 is for the event string below */
- rec_argc = ARRAY_SIZE(record_args) + 1 + argc;
+ rec_argc = ARRAY_SIZE(record_args) + sc_args_nr + 1 +
+ majpf_args_nr + minpf_args_nr + argc;
rec_argv = calloc(rec_argc + 1, sizeof(char *));

if (rec_argv == NULL)
return -ENOMEM;

+ j = 0;
for (i = 0; i < ARRAY_SIZE(record_args); i++)
- rec_argv[i] = record_args[i];
+ rec_argv[j++] = record_args[i];
+
+ for (i = 0; i < sc_args_nr; i++)
+ rec_argv[j++] = sc_args[i];

/* event string may be different for older kernels - e.g., RHEL6 */
if (is_valid_tracepoint("raw_syscalls:sys_enter"))
- rec_argv[i] = "raw_syscalls:sys_enter,raw_syscalls:sys_exit";
+ rec_argv[j++] = "raw_syscalls:sys_enter,raw_syscalls:sys_exit";
else if (is_valid_tracepoint("syscalls:sys_enter"))
- rec_argv[i] = "syscalls:sys_enter,syscalls:sys_exit";
+ rec_argv[j++] = "syscalls:sys_enter,syscalls:sys_exit";
else {
pr_err("Neither raw_syscalls nor syscalls events exist.\n");
return -1;
}
- i++;

- for (j = 0; j < (unsigned int)argc; j++, i++)
- rec_argv[i] = argv[j];
+ if (trace->trace_pgfaults)
+ for (i = 0; i < majpf_args_nr; i++)
+ rec_argv[j++] = majpf_args[i];
+
+ if (trace->trace_pgfaults > 1)
+ for (i = 0; i < minpf_args_nr; i++)
+ rec_argv[j++] = minpf_args[i];

- return cmd_record(i, rec_argv, NULL);
+ for (i = 0; i < (unsigned int)argc; i++)
+ rec_argv[j++] = argv[i];
+
+ return cmd_record(j, rec_argv, NULL);
}

static size_t trace__fprintf_thread_summary(struct trace *trace, FILE *fp);
@@ -1882,6 +1974,30 @@ static void perf_evlist__add_vfs_getname(struct perf_evlist *evlist)
perf_evlist__add(evlist, evsel);
}

+static int perf_evlist__add_pgfault(struct perf_evlist *evlist,
+ u64 config)
+{
+ struct perf_evsel *evsel;
+ struct perf_event_attr attr = {
+ .type = PERF_TYPE_SOFTWARE,
+ .mmap_data = 1,
+ .sample_period = 1,
+ };
+
+ attr.config = config;
+
+ event_attr_init(&attr);
+
+ evsel = perf_evsel__new(&attr);
+ if (!evsel)
+ return -ENOMEM;
+
+ evsel->handler = trace__pgfault;
+ perf_evlist__add(evlist, evsel);
+
+ return 0;
+}
+
static int trace__run(struct trace *trace, int argc, const char **argv)
{
struct perf_evlist *evlist = perf_evlist__new();
@@ -1902,6 +2018,14 @@ static int trace__run(struct trace *trace, int argc, const char **argv)

perf_evlist__add_vfs_getname(evlist);

+ if (trace->trace_pgfaults &&
+ perf_evlist__add_pgfault(evlist, PERF_COUNT_SW_PAGE_FAULTS_MAJ))
+ goto out_error_tp;
+
+ if (trace->trace_pgfaults > 1 &&
+ perf_evlist__add_pgfault(evlist, PERF_COUNT_SW_PAGE_FAULTS_MIN))
+ goto out_error_tp;
+
if (trace->sched &&
perf_evlist__add_newtp(evlist, "sched", "sched_stat_runtime",
trace__sched_stat_runtime))
@@ -1982,7 +2106,8 @@ again:
goto next_event;
}

- if (sample.raw_data == NULL) {
+ if (evsel->attr.type != PERF_TYPE_SOFTWARE &&
+ sample.raw_data == NULL) {
fprintf(trace->output, "%s sample with no payload for tid: %d, cpu %d, raw_size=%d, skipping...\n",
perf_evsel__name(evsel), sample.tid,
sample.cpu, sample.raw_size);
@@ -1990,7 +2115,7 @@ again:
}

handler = evsel->handler;
- handler(trace, evsel, &sample);
+ handler(trace, event, evsel, &sample);
next_event:
perf_evlist__mmap_consume(evlist, i);

@@ -2093,13 +2218,10 @@ static int trace__replay(struct trace *trace)
if (evsel == NULL)
evsel = perf_evlist__find_tracepoint_by_name(session->evlist,
"syscalls:sys_enter");
- if (evsel == NULL) {
- pr_err("Data file does not have raw_syscalls:sys_enter event\n");
- goto out;
- }

- if (perf_evsel__init_syscall_tp(evsel, trace__sys_enter) < 0 ||
- perf_evsel__init_sc_tp_ptr_field(evsel, args)) {
+ if (evsel &&
+ (perf_evsel__init_syscall_tp(evsel, trace__sys_enter) < 0 ||
+ perf_evsel__init_sc_tp_ptr_field(evsel, args))) {
pr_err("Error during initialize raw_syscalls:sys_enter event\n");
goto out;
}
@@ -2109,15 +2231,19 @@ static int trace__replay(struct trace *trace)
if (evsel == NULL)
evsel = perf_evlist__find_tracepoint_by_name(session->evlist,
"syscalls:sys_exit");
- if (evsel == NULL) {
- pr_err("Data file does not have raw_syscalls:sys_exit event\n");
+ if (evsel &&
+ (perf_evsel__init_syscall_tp(evsel, trace__sys_exit) < 0 ||
+ perf_evsel__init_sc_tp_uint_field(evsel, ret))) {
+ pr_err("Error during initialize raw_syscalls:sys_exit event\n");
goto out;
}

- if (perf_evsel__init_syscall_tp(evsel, trace__sys_exit) < 0 ||
- perf_evsel__init_sc_tp_uint_field(evsel, ret)) {
- pr_err("Error during initialize raw_syscalls:sys_exit event\n");
- goto out;
+ evlist__for_each(session->evlist, evsel) {
+ if (evsel->attr.type == PERF_TYPE_SOFTWARE &&
+ (evsel->attr.config == PERF_COUNT_SW_PAGE_FAULTS_MAJ ||
+ evsel->attr.config == PERF_COUNT_SW_PAGE_FAULTS_MIN ||
+ evsel->attr.config == PERF_COUNT_SW_PAGE_FAULTS))
+ evsel->handler = trace__pgfault;
}

err = parse_target_str(trace);
@@ -2290,6 +2416,8 @@ int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)
.user_interval = ULLONG_MAX,
.no_buffering = true,
.mmap_pages = 1024,
+ .sample_address = true,
+ .sample_time = true,
},
.output = stdout,
.show_comm = true,
@@ -2330,15 +2458,17 @@ int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)
"Show only syscall summary with statistics"),
OPT_BOOLEAN('S', "with-summary", &trace.summary,
"Show all syscalls and summary with statistics"),
+ OPT_INCR('f', "pgfaults", &trace.trace_pgfaults, "Trace pagefaults"),
OPT_END()
};
int err;
char bf[BUFSIZ];

- if ((argc > 1) && (strcmp(argv[1], "record") == 0))
- return trace__record(argc-2, &argv[2]);
+ argc = parse_options(argc, argv, trace_options, trace_usage,
+ PARSE_OPT_STOP_AT_NON_OPTION);

- argc = parse_options(argc, argv, trace_options, trace_usage, 0);
+ if ((argc >= 1) && (strcmp(argv[0], "record") == 0))
+ return trace__record(&trace, argc-1, &argv[1]);

/* summary_only implies summary option, but don't overwrite summary if set */
if (trace.summary_only)
--
1.8.3.2

2014-06-18 15:01:07

by Stanislav Fomichev

[permalink] [raw]
Subject: [PATCH 3/5] perf trace: add possibility to switch off syscall events

Signed-off-by: Stanislav Fomichev <[email protected]>
---
tools/perf/Documentation/perf-trace.txt | 7 ++++++
tools/perf/builtin-trace.c | 39 ++++++++++++++++++++++-----------
2 files changed, 33 insertions(+), 13 deletions(-)

diff --git a/tools/perf/Documentation/perf-trace.txt b/tools/perf/Documentation/perf-trace.txt
index 7da5f75a45f1..301fc9384d93 100644
--- a/tools/perf/Documentation/perf-trace.txt
+++ b/tools/perf/Documentation/perf-trace.txt
@@ -112,9 +112,16 @@ the thread executes on the designated CPUs. Default is to monitor all CPUs.
Trace major pagefaults. To also trace minor pagefaults, specify this
option twice.

+--syscalls::
+ Trace system calls. This options is enabled by default.
+
EXAMPLES
--------

+Trace only major pagefaults:
+
+ $ perf trace --no-syscalls -f
+
Trace syscalls, major and minor pagefaults:

$ perf trace -f -f
diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index 535f0b504950..90ae4036e5ee 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -1212,6 +1212,7 @@ struct trace {
bool summary_only;
bool show_comm;
bool show_tool_stats;
+ bool trace_syscalls;
int trace_pgfaults;
};

@@ -1940,17 +1941,19 @@ static int trace__record(struct trace *trace, int argc, const char **argv)
for (i = 0; i < ARRAY_SIZE(record_args); i++)
rec_argv[j++] = record_args[i];

- for (i = 0; i < sc_args_nr; i++)
- rec_argv[j++] = sc_args[i];
-
- /* event string may be different for older kernels - e.g., RHEL6 */
- if (is_valid_tracepoint("raw_syscalls:sys_enter"))
- rec_argv[j++] = "raw_syscalls:sys_enter,raw_syscalls:sys_exit";
- else if (is_valid_tracepoint("syscalls:sys_enter"))
- rec_argv[j++] = "syscalls:sys_enter,syscalls:sys_exit";
- else {
- pr_err("Neither raw_syscalls nor syscalls events exist.\n");
- return -1;
+ if (trace->trace_syscalls) {
+ for (i = 0; i < sc_args_nr; i++)
+ rec_argv[j++] = sc_args[i];
+
+ /* event string may be different for older kernels - e.g., RHEL6 */
+ if (is_valid_tracepoint("raw_syscalls:sys_enter"))
+ rec_argv[j++] = "raw_syscalls:sys_enter,raw_syscalls:sys_exit";
+ else if (is_valid_tracepoint("syscalls:sys_enter"))
+ rec_argv[j++] = "syscalls:sys_enter,syscalls:sys_exit";
+ else {
+ pr_err("Neither raw_syscalls nor syscalls events exist.\n");
+ return -1;
+ }
}

if (trace->trace_pgfaults)
@@ -2023,10 +2026,13 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
goto out;
}

- if (perf_evlist__add_syscall_newtp(evlist, trace__sys_enter, trace__sys_exit))
+ if (trace->trace_syscalls &&
+ perf_evlist__add_syscall_newtp(evlist, trace__sys_enter,
+ trace__sys_exit))
goto out_error_tp;

- perf_evlist__add_vfs_getname(evlist);
+ if (trace->trace_syscalls)
+ perf_evlist__add_vfs_getname(evlist);

if (trace->trace_pgfaults &&
perf_evlist__add_pgfault(evlist, PERF_COUNT_SW_PAGE_FAULTS_MAJ))
@@ -2432,6 +2438,7 @@ int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)
},
.output = stdout,
.show_comm = true,
+ .trace_syscalls = true,
};
const char *output_name = NULL;
const char *ev_qualifier_str = NULL;
@@ -2469,6 +2476,7 @@ int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)
"Show only syscall summary with statistics"),
OPT_BOOLEAN('S', "with-summary", &trace.summary,
"Show all syscalls and summary with statistics"),
+ OPT_BOOLEAN(0, "syscalls", &trace.trace_syscalls, "Trace syscalls"),
OPT_INCR('f', "pgfaults", &trace.trace_pgfaults, "Trace pagefaults"),
OPT_END()
};
@@ -2485,6 +2493,11 @@ int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)
if (trace.summary_only)
trace.summary = trace.summary_only;

+ if (!trace.trace_syscalls && !trace.trace_pgfaults) {
+ pr_err("Please specify something to trace.\n");
+ return -1;
+ }
+
if (output_name != NULL) {
err = trace__open_output(&trace, output_name);
if (err < 0) {
--
1.8.3.2

2014-06-18 15:01:09

by Stanislav Fomichev

[permalink] [raw]
Subject: [PATCH 4/5] perf kvm: move perf_kvm__mmap_read into session utils

It will be reused by perf trace in the following commit.

Signed-off-by: Stanislav Fomichev <[email protected]>
---
tools/perf/builtin-kvm.c | 88 +++--------------------------------------------
tools/perf/util/session.c | 85 +++++++++++++++++++++++++++++++++++++++++++++
tools/perf/util/session.h | 5 +++
3 files changed, 94 insertions(+), 84 deletions(-)

diff --git a/tools/perf/builtin-kvm.c b/tools/perf/builtin-kvm.c
index 0f1e5a2f6ad7..a69ffe7512e5 100644
--- a/tools/perf/builtin-kvm.c
+++ b/tools/perf/builtin-kvm.c
@@ -885,89 +885,6 @@ static bool verify_vcpu(int vcpu)
*/
#define PERF_KVM__MAX_EVENTS_PER_MMAP 25

-static s64 perf_kvm__mmap_read_idx(struct perf_kvm_stat *kvm, int idx,
- u64 *mmap_time)
-{
- union perf_event *event;
- struct perf_sample sample;
- s64 n = 0;
- int err;
-
- *mmap_time = ULLONG_MAX;
- while ((event = perf_evlist__mmap_read(kvm->evlist, idx)) != NULL) {
- err = perf_evlist__parse_sample(kvm->evlist, event, &sample);
- if (err) {
- perf_evlist__mmap_consume(kvm->evlist, idx);
- pr_err("Failed to parse sample\n");
- return -1;
- }
-
- err = perf_session_queue_event(kvm->session, event, &sample, 0);
- /*
- * FIXME: Here we can't consume the event, as perf_session_queue_event will
- * point to it, and it'll get possibly overwritten by the kernel.
- */
- perf_evlist__mmap_consume(kvm->evlist, idx);
-
- if (err) {
- pr_err("Failed to enqueue sample: %d\n", err);
- return -1;
- }
-
- /* save time stamp of our first sample for this mmap */
- if (n == 0)
- *mmap_time = sample.time;
-
- /* limit events per mmap handled all at once */
- n++;
- if (n == PERF_KVM__MAX_EVENTS_PER_MMAP)
- break;
- }
-
- return n;
-}
-
-static int perf_kvm__mmap_read(struct perf_kvm_stat *kvm)
-{
- int i, err, throttled = 0;
- s64 n, ntotal = 0;
- u64 flush_time = ULLONG_MAX, mmap_time;
-
- for (i = 0; i < kvm->evlist->nr_mmaps; i++) {
- n = perf_kvm__mmap_read_idx(kvm, i, &mmap_time);
- if (n < 0)
- return -1;
-
- /* flush time is going to be the minimum of all the individual
- * mmap times. Essentially, we flush all the samples queued up
- * from the last pass under our minimal start time -- that leaves
- * a very small race for samples to come in with a lower timestamp.
- * The ioctl to return the perf_clock timestamp should close the
- * race entirely.
- */
- if (mmap_time < flush_time)
- flush_time = mmap_time;
-
- ntotal += n;
- if (n == PERF_KVM__MAX_EVENTS_PER_MMAP)
- throttled = 1;
- }
-
- /* flush queue after each round in which we processed events */
- if (ntotal) {
- kvm->session->ordered_samples.next_flush = flush_time;
- err = kvm->tool.finished_round(&kvm->tool, NULL, kvm->session);
- if (err) {
- if (kvm->lost_events)
- pr_info("\nLost events: %" PRIu64 "\n\n",
- kvm->lost_events);
- return err;
- }
- }
-
- return throttled;
-}
-
static volatile int done;

static void sig_handler(int sig __maybe_unused)
@@ -1133,7 +1050,10 @@ static int kvm_events_live_report(struct perf_kvm_stat *kvm)
while (!done) {
int rc;

- rc = perf_kvm__mmap_read(kvm);
+ rc = perf_session__mmap_read(&kvm->tool,
+ kvm->session,
+ kvm->evlist,
+ PERF_KVM__MAX_EVENTS_PER_MMAP);
if (rc < 0)
break;

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 4526d966b10a..994846060c5e 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1671,3 +1671,88 @@ int __perf_session__set_tracepoints_handlers(struct perf_session *session,
out:
return err;
}
+
+static s64 perf_session__mmap_read_idx(struct perf_session *session,
+ int idx,
+ u64 *mmap_time,
+ int nr_per_mmap)
+{
+ union perf_event *event;
+ struct perf_sample sample;
+ s64 n = 0;
+ int err;
+
+ *mmap_time = ULLONG_MAX;
+ while ((event = perf_evlist__mmap_read(session->evlist, idx)) != NULL) {
+ err = perf_evlist__parse_sample(session->evlist, event, &sample);
+ if (err) {
+ perf_evlist__mmap_consume(session->evlist, idx);
+ pr_err("Failed to parse sample\n");
+ return -1;
+ }
+
+ err = perf_session_queue_event(session, event, &sample, 0);
+ /*
+ * FIXME: Here we can't consume the event, as perf_session_queue_event will
+ * point to it, and it'll get possibly overwritten by the kernel.
+ */
+ perf_evlist__mmap_consume(session->evlist, idx);
+
+ if (err) {
+ pr_err("Failed to enqueue sample: %d\n", err);
+ return -1;
+ }
+
+ /* save time stamp of our first sample for this mmap */
+ if (n == 0)
+ *mmap_time = sample.time;
+
+ /* limit events per mmap handled all at once */
+ n++;
+ if (n == nr_per_mmap)
+ break;
+ }
+
+ return n;
+}
+
+int perf_session__mmap_read(struct perf_tool *tool,
+ struct perf_session *session,
+ struct perf_evlist *evlist,
+ int nr_per_mmap)
+{
+ int i, err, throttled = 0;
+ s64 n, ntotal = 0;
+ u64 flush_time = ULLONG_MAX, mmap_time;
+
+ for (i = 0; i < evlist->nr_mmaps; i++) {
+ n = perf_session__mmap_read_idx(session, i, &mmap_time,
+ nr_per_mmap);
+ if (n < 0)
+ return -1;
+
+ /* flush time is going to be the minimum of all the individual
+ * mmap times. Essentially, we flush all the samples queued up
+ * from the last pass under our minimal start time -- that leaves
+ * a very small race for samples to come in with a lower timestamp.
+ * The ioctl to return the perf_clock timestamp should close the
+ * race entirely.
+ */
+ if (mmap_time < flush_time)
+ flush_time = mmap_time;
+
+ ntotal += n;
+ if (n == nr_per_mmap)
+ throttled = 1;
+ }
+
+ /* flush queue after each round in which we processed events */
+ if (ntotal) {
+ session->ordered_samples.next_flush = flush_time;
+ err = tool->finished_round(tool, NULL, session);
+ if (err)
+ return err;
+ }
+
+ return throttled;
+}
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 9494fb68828a..e79da3c1071e 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -133,4 +133,9 @@ int __perf_session__set_tracepoints_handlers(struct perf_session *session,
extern volatile int session_done;

#define session_done() (*(volatile int *)(&session_done))
+
+int perf_session__mmap_read(struct perf_tool *tool,
+ struct perf_session *session,
+ struct perf_evlist *evlist,
+ int nr_per_mmap);
#endif /* __PERF_SESSION_H */
--
1.8.3.2

2014-06-18 15:02:35

by Stanislav Fomichev

[permalink] [raw]
Subject: [PATCH 5/5] perf trace: add events cache

Use session to cache and order events from all CPUs so we get
correct ordering of comms/mmaps and faults.

Signed-off-by: Stanislav Fomichev <[email protected]>
---
tools/perf/builtin-trace.c | 117 ++++++++++++++++++---------------------------
1 file changed, 46 insertions(+), 71 deletions(-)

diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index 90ae4036e5ee..7135459b5720 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -20,6 +20,11 @@
#include <sys/mman.h>
#include <linux/futex.h>

+/* keeping the max events to a modest level to keep
+ * the processing of samples per mmap smooth.
+ */
+#define PERF_TRACE__MAX_EVENTS_PER_MMAP 25
+
/* For older distros: */
#ifndef MAP_STACK
# define MAP_STACK 0x20000
@@ -2013,10 +2018,12 @@ static int perf_evlist__add_pgfault(struct perf_evlist *evlist,

static int trace__run(struct trace *trace, int argc, const char **argv)
{
+ struct perf_data_file file = {
+ .mode = PERF_DATA_MODE_WRITE,
+ };
+ struct perf_session *session;
struct perf_evlist *evlist = perf_evlist__new();
- struct perf_evsel *evsel;
- int err = -1, i;
- unsigned long before;
+ int err = -1;
const bool forks = argc > 0;

trace->live = true;
@@ -2061,6 +2068,18 @@ static int trace__run(struct trace *trace, int argc, const char **argv)

perf_evlist__config(evlist, &trace->opts);

+ session = perf_session__new(&file, false, &trace->tool);
+ if (session == NULL) {
+ err = -ENOMEM;
+ goto out_delete_evlist;
+ }
+ session->evlist = evlist;
+ perf_session__set_id_hdr_size(session);
+ machine__synthesize_threads(&session->machines.host,
+ &trace->opts.target,
+ evlist->threads, false);
+ trace->host = &session->machines.host;
+
signal(SIGCHLD, sig_handler);
signal(SIGINT, sig_handler);

@@ -2080,7 +2099,7 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
err = perf_evlist__mmap(evlist, trace->opts.mmap_pages, false);
if (err < 0) {
fprintf(trace->output, "Couldn't mmap the events: %s\n", strerror(errno));
- goto out_delete_evlist;
+ goto out_delete_session;
}

perf_evlist__enable(evlist);
@@ -2089,67 +2108,19 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
perf_evlist__start_workload(evlist);

trace->multiple_threads = evlist->threads->map[0] == -1 || evlist->threads->nr > 1;
-again:
- before = trace->nr_events;
-
- for (i = 0; i < evlist->nr_mmaps; i++) {
- union perf_event *event;
-
- while ((event = perf_evlist__mmap_read(evlist, i)) != NULL) {
- const u32 type = event->header.type;
- tracepoint_handler handler;
- struct perf_sample sample;
-
- ++trace->nr_events;

- err = perf_evlist__parse_sample(evlist, event, &sample);
- if (err) {
- fprintf(trace->output, "Can't parse sample, err = %d, skipping...\n", err);
- goto next_event;
- }
-
- if (!trace->full_time && trace->base_time == 0)
- trace->base_time = sample.time;
-
- if (type != PERF_RECORD_SAMPLE) {
- trace__process_event(trace, trace->host, event, &sample);
- continue;
- }
-
- evsel = perf_evlist__id2evsel(evlist, sample.id);
- if (evsel == NULL) {
- fprintf(trace->output, "Unknown tp ID %" PRIu64 ", skipping...\n", sample.id);
- goto next_event;
- }
-
- if (evsel->attr.type != PERF_TYPE_SOFTWARE &&
- sample.raw_data == NULL) {
- fprintf(trace->output, "%s sample with no payload for tid: %d, cpu %d, raw_size=%d, skipping...\n",
- perf_evsel__name(evsel), sample.tid,
- sample.cpu, sample.raw_size);
- goto next_event;
- }
-
- handler = evsel->handler;
- handler(trace, event, evsel, &sample);
-next_event:
- perf_evlist__mmap_consume(evlist, i);
-
- if (interrupted)
- goto out_disable;
- }
- }
+ while (!done) {
+ int rc;

- if (trace->nr_events == before) {
- int timeout = done ? 100 : -1;
+ rc = perf_session__mmap_read(&trace->tool, session, evlist,
+ PERF_TRACE__MAX_EVENTS_PER_MMAP);
+ if (rc < 0)
+ break;

- if (poll(evlist->pollfd, evlist->nr_fds, timeout) > 0)
- goto again;
- } else {
- goto again;
+ if (!rc && !done)
+ err = poll(evlist->pollfd, evlist->nr_fds, 100);
}

-out_disable:
perf_evlist__disable(evlist);

if (!err) {
@@ -2165,6 +2136,8 @@ out_disable:
}
}

+out_delete_session:
+ perf_session__delete(session);
out_delete_evlist:
perf_evlist__delete(evlist);
out:
@@ -2199,17 +2172,6 @@ static int trace__replay(struct trace *trace)
struct perf_evsel *evsel;
int err = -1;

- trace->tool.sample = trace__process_sample;
- trace->tool.mmap = perf_event__process_mmap;
- trace->tool.mmap2 = perf_event__process_mmap2;
- trace->tool.comm = perf_event__process_comm;
- trace->tool.exit = perf_event__process_exit;
- trace->tool.fork = perf_event__process_fork;
- trace->tool.attr = perf_event__process_attr;
- trace->tool.tracing_data = perf_event__process_tracing_data;
- trace->tool.build_id = perf_event__process_build_id;
-
- trace->tool.ordered_samples = true;
trace->tool.ordering_requires_timestamps = true;

/* add tid to output */
@@ -2538,6 +2500,19 @@ int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)
if (!argc && target__none(&trace.opts.target))
trace.opts.target.system_wide = true;

+ trace.tool.sample = trace__process_sample;
+ trace.tool.mmap = perf_event__process_mmap;
+ trace.tool.mmap2 = perf_event__process_mmap2;
+ trace.tool.comm = perf_event__process_comm;
+ trace.tool.exit = perf_event__process_exit;
+ trace.tool.fork = perf_event__process_fork;
+ trace.tool.attr = perf_event__process_attr;
+ trace.tool.tracing_data = perf_event__process_tracing_data;
+ trace.tool.build_id = perf_event__process_build_id;
+
+ trace.tool.ordered_samples = true;
+ perf_tool__fill_defaults(&trace.tool);
+
if (input_name)
err = trace__replay(&trace);
else
--
1.8.3.2

2014-06-18 15:02:54

by Stanislav Fomichev

[permalink] [raw]
Subject: [PATCH 2/5] perf trace: add pagefault statistics

Signed-off-by: Stanislav Fomichev <[email protected]>
---
tools/perf/builtin-trace.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index 6c7ae048db59..535f0b504950 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -1132,6 +1132,7 @@ struct thread_trace {
u64 exit_time;
bool entry_pending;
unsigned long nr_events;
+ unsigned long pfmaj, pfmin;
char *entry_str;
double runtime_ms;
struct {
@@ -1798,8 +1799,12 @@ static int trace__pgfault(struct trace *trace,
u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
struct addr_location al;
char map_type = 'd';
+ struct thread_trace *ttrace;

thread = machine__findnew_thread(trace->host, sample->pid, sample->tid);
+ ttrace = thread__trace(thread, trace->output);
+ if (ttrace == NULL)
+ return -1;

thread__find_addr_location(thread, trace->host, cpumode, MAP__FUNCTION,
sample->ip, &al);
@@ -1810,6 +1815,11 @@ static int trace__pgfault(struct trace *trace,
evsel->attr.config == PERF_COUNT_SW_PAGE_FAULTS_MAJ ?
"maj" : "min");

+ if (evsel->attr.config == PERF_COUNT_SW_PAGE_FAULTS_MAJ)
+ ttrace->pfmaj++;
+ else
+ ttrace->pfmin++;
+
if (al.sym)
fprintf(trace->output, "[%s+0x%lx]",
al.sym->name, al.addr - al.sym->start);
@@ -2343,6 +2353,7 @@ static int trace__fprintf_one_thread(struct thread *thread, void *priv)
printed += fprintf(fp, " %s (%d), ", thread__comm_str(thread), thread->tid);
printed += fprintf(fp, "%lu events, ", ttrace->nr_events);
printed += fprintf(fp, "%.1f%%", ratio);
+ printed += fprintf(fp, ", %lu:%lu faults", ttrace->pfmaj, ttrace->pfmin);
printed += fprintf(fp, ", %.3f msec\n", ttrace->runtime_ms);
printed += thread__dump_stats(ttrace, trace, fp);

--
1.8.3.2

2014-06-18 15:47:58

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH 1/5] perf trace: add support for pagefault tracing

First: thanks for working on this!

Em Wed, Jun 18, 2014 at 06:59:21PM +0400, Stanislav Fomichev escreveu:
> Signed-off-by: Stanislav Fomichev <[email protected]>

No single line on the changelog? Please add a description here, with
examples of it in use, etc.

Also you're doing various things on the same patch, please read on to
see how it should be broken into multiple changesets.

> ---
> tools/perf/Documentation/perf-trace.txt | 12 ++
> tools/perf/builtin-trace.c | 198 ++++++++++++++++++++++++++------
> 2 files changed, 176 insertions(+), 34 deletions(-)
>
> diff --git a/tools/perf/Documentation/perf-trace.txt b/tools/perf/Documentation/perf-trace.txt
> index fae38d9a44a4..7da5f75a45f1 100644
> --- a/tools/perf/Documentation/perf-trace.txt
> +++ b/tools/perf/Documentation/perf-trace.txt
> @@ -107,6 +107,18 @@ the thread executes on the designated CPUs. Default is to monitor all CPUs.
> Show tool stats such as number of times fd->pathname was discovered thru
> hooking the open syscall return + vfs_getname or via reading /proc/pid/fd, etc.
>
> +-f::
> +--pgfaults::
> + Trace major pagefaults. To also trace minor pagefaults, specify this
> + option twice.

I try, when possible, to not use short options that are used in
'strace', so what if we use 'F' here?

And:

trace --pgfaults --pgfaults

for major and min page faults looks ugly, what if we instead use --pf
for both, and allow passing min or maj as args?

I.e.:

For both major and minor:

trace --pf

for just major page faults:

trace --pf maj

> +
> +EXAMPLES
> +--------
> +
> +Trace syscalls, major and minor pagefaults:
> +
> + $ perf trace -f -f
> +
> SEE ALSO
> --------
> linkperf:perf-record[1], linkperf:perf-script[1]
> diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
> index f954c26de231..6c7ae048db59 100644
> --- a/tools/perf/builtin-trace.c
> +++ b/tools/perf/builtin-trace.c
> @@ -1211,6 +1211,7 @@ struct trace {
> bool summary_only;
> bool show_comm;
> bool show_tool_stats;
> + int trace_pgfaults;

uint8_t should be enough

> };
>
> static int trace__set_fd_pathname(struct thread *thread, int fd, const char *pathname)
> @@ -1534,7 +1535,9 @@ static size_t syscall__scnprintf_args(struct syscall *sc, char *bf, size_t size,
> return printed;
> }
>
> -typedef int (*tracepoint_handler)(struct trace *trace, struct perf_evsel *evsel,
> +typedef int (*tracepoint_handler)(struct trace *trace,
> + union perf_event *event,
> + struct perf_evsel *evsel,
> struct perf_sample *sample);

You'll reduce patch size if you leave the first line alone and just add
the new parameter (event) after evsel.

It'll become then:

typedef int (*tracepoint_handler)(struct trace *trace, struct perf_evsel *evsel,
+ union perf_event *event,
struct perf_sample *sample);

Then please make one separate patch adding this new parameter, stating
that it will be used by pagefault tracing, that will come in a followup
patch in this series.

>
> static struct syscall *trace__syscall_info(struct trace *trace,
> @@ -1606,7 +1609,9 @@ static void thread__update_stats(struct thread_trace *ttrace,
> update_stats(stats, duration);
> }
>
> -static int trace__sys_enter(struct trace *trace, struct perf_evsel *evsel,
> +static int trace__sys_enter(struct trace *trace,
> + union perf_event *event __maybe_unused,
> + struct perf_evsel *evsel,
> struct perf_sample *sample)
> {
> char *msg;
> @@ -1655,7 +1660,9 @@ static int trace__sys_enter(struct trace *trace, struct perf_evsel *evsel,
> return 0;
> }
>
> -static int trace__sys_exit(struct trace *trace, struct perf_evsel *evsel,
> +static int trace__sys_exit(struct trace *trace,
> + union perf_event *event __maybe_unused,
> + struct perf_evsel *evsel,
> struct perf_sample *sample)
> {
> int ret;
> @@ -1734,14 +1741,18 @@ out:
> return 0;
> }
>
> -static int trace__vfs_getname(struct trace *trace, struct perf_evsel *evsel,
> +static int trace__vfs_getname(struct trace *trace,
> + union perf_event *event __maybe_unused,
> + struct perf_evsel *evsel,
> struct perf_sample *sample)
> {
> trace->last_vfs_getname = perf_evsel__rawptr(evsel, sample, "pathname");
> return 0;
> }
>
> -static int trace__sched_stat_runtime(struct trace *trace, struct perf_evsel *evsel,
> +static int trace__sched_stat_runtime(struct trace *trace,
> + union perf_event *event __maybe_unused,
> + struct perf_evsel *evsel,
> struct perf_sample *sample)
> {
> u64 runtime = perf_evsel__intval(evsel, sample, "runtime");
> @@ -1768,6 +1779,69 @@ out_dump:
> return 0;
> }
>
> +static bool valid_dso(struct addr_location *al, struct perf_sample *sample)

Humm, can't this be reduced to just:

return al->map && al->map->dso;

? I.e. if the helper returned a dso, it is because the address that was
looked up is in that dso, no?

I even guess that if there is al->map, that should be enough to check,
byt haven't thought this thru nor looked at the relevant sources, in a
hurry now.

> +{
> + if (al->map && al->map->dso &&
> + al->map->start <= sample->addr &&
> + al->map->end > sample->addr)
> + return true;
> + else
> + return false;
> +}
> +
> +static int trace__pgfault(struct trace *trace,
> + union perf_event *event,
> + struct perf_evsel *evsel,
> + struct perf_sample *sample)
> +{
> + struct thread *thread;
> + u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
> + struct addr_location al;
> + char map_type = 'd';
> +
> + thread = machine__findnew_thread(trace->host, sample->pid, sample->tid);
> +
> + thread__find_addr_location(thread, trace->host, cpumode, MAP__FUNCTION,
> + sample->ip, &al);
> +
> + trace__fprintf_entry_head(trace, thread, 0, sample->time, trace->output);
> +
> + fprintf(trace->output, "%sfault ",
> + evsel->attr.config == PERF_COUNT_SW_PAGE_FAULTS_MAJ ?
> + "maj" : "min");
> +
> + if (al.sym)
> + fprintf(trace->output, "[%s+0x%lx]",
> + al.sym->name, al.addr - al.sym->start);
> + else
> + fprintf(trace->output, "[0x%lx]", sample->ip);
> +
> + fprintf(trace->output, " => ");
> +
> + thread__find_addr_map(thread, trace->host, cpumode, MAP__VARIABLE,
> + sample->addr, &al);
> +
> + if (!valid_dso(&al, sample)) {
> + thread__find_addr_map(thread, trace->host, cpumode,
> + MAP__FUNCTION, sample->addr, &al);
> +
> + if (valid_dso(&al, sample))
> + map_type = 'x';
> + }
> +
> + if (valid_dso(&al, sample)) {
> + fprintf(trace->output, "%s+0x%lx",
> + al.map->dso->long_name, sample->addr - al.map->start);
> + } else {
> + map_type = '?';
> + fprintf(trace->output, "0x%lx", sample->addr);
> + }
> +
> + fprintf(trace->output, " (%c%c)\n", map_type, al.level);
> +
> + return 0;
> +}
> +
> static bool skip_sample(struct trace *trace, struct perf_sample *sample)
> {
> if ((trace->pid_list && intlist__find(trace->pid_list, sample->pid)) ||
> @@ -1781,7 +1855,7 @@ static bool skip_sample(struct trace *trace, struct perf_sample *sample)
> }
>
> static int trace__process_sample(struct perf_tool *tool,
> - union perf_event *event __maybe_unused,
> + union perf_event *event,
> struct perf_sample *sample,
> struct perf_evsel *evsel,
> struct machine *machine __maybe_unused)
> @@ -1799,7 +1873,7 @@ static int trace__process_sample(struct perf_tool *tool,
>
> if (handler) {
> ++trace->nr_events;
> - handler(trace, evsel, sample);
> + handler(trace, event, evsel, sample);
> }
>
> return err;
> @@ -1826,7 +1900,7 @@ static int parse_target_str(struct trace *trace)
> return 0;
> }
>
> -static int trace__record(int argc, const char **argv)
> +static int trace__record(struct trace *trace, int argc, const char **argv)
> {
> unsigned int rec_argc, i, j;
> const char **rec_argv;
> @@ -1835,34 +1909,52 @@ static int trace__record(int argc, const char **argv)
> "-R",
> "-m", "1024",
> "-c", "1",
> - "-e",
> };
>
> + const char * const sc_args[] = { "-e", };
> + unsigned int sc_args_nr = ARRAY_SIZE(sc_args);
> + const char * const majpf_args[] = { "-e", "major-faults" };
> + unsigned int majpf_args_nr = ARRAY_SIZE(majpf_args);
> + const char * const minpf_args[] = { "-e", "minor-faults" };
> + unsigned int minpf_args_nr = ARRAY_SIZE(minpf_args);
> +

Please separate adding page fault tracing recording on a file in a
separate patch from adding it to live mode, then the changelog message
can concentrate on examples for each feature.

> /* +1 is for the event string below */
> - rec_argc = ARRAY_SIZE(record_args) + 1 + argc;
> + rec_argc = ARRAY_SIZE(record_args) + sc_args_nr + 1 +
> + majpf_args_nr + minpf_args_nr + argc;
> rec_argv = calloc(rec_argc + 1, sizeof(char *));
>
> if (rec_argv == NULL)
> return -ENOMEM;
>
> + j = 0;
> for (i = 0; i < ARRAY_SIZE(record_args); i++)
> - rec_argv[i] = record_args[i];
> + rec_argv[j++] = record_args[i];
> +
> + for (i = 0; i < sc_args_nr; i++)
> + rec_argv[j++] = sc_args[i];
>
> /* event string may be different for older kernels - e.g., RHEL6 */
> if (is_valid_tracepoint("raw_syscalls:sys_enter"))
> - rec_argv[i] = "raw_syscalls:sys_enter,raw_syscalls:sys_exit";
> + rec_argv[j++] = "raw_syscalls:sys_enter,raw_syscalls:sys_exit";
> else if (is_valid_tracepoint("syscalls:sys_enter"))
> - rec_argv[i] = "syscalls:sys_enter,syscalls:sys_exit";
> + rec_argv[j++] = "syscalls:sys_enter,syscalls:sys_exit";
> else {
> pr_err("Neither raw_syscalls nor syscalls events exist.\n");
> return -1;
> }
> - i++;
>
> - for (j = 0; j < (unsigned int)argc; j++, i++)
> - rec_argv[i] = argv[j];
> + if (trace->trace_pgfaults)
> + for (i = 0; i < majpf_args_nr; i++)
> + rec_argv[j++] = majpf_args[i];
> +
> + if (trace->trace_pgfaults > 1)
> + for (i = 0; i < minpf_args_nr; i++)
> + rec_argv[j++] = minpf_args[i];
>
> - return cmd_record(i, rec_argv, NULL);
> + for (i = 0; i < (unsigned int)argc; i++)
> + rec_argv[j++] = argv[i];
> +
> + return cmd_record(j, rec_argv, NULL);
> }
>
> static size_t trace__fprintf_thread_summary(struct trace *trace, FILE *fp);
> @@ -1882,6 +1974,30 @@ static void perf_evlist__add_vfs_getname(struct perf_evlist *evlist)
> perf_evlist__add(evlist, evsel);
> }
>
> +static int perf_evlist__add_pgfault(struct perf_evlist *evlist,
> + u64 config)
> +{
> + struct perf_evsel *evsel;
> + struct perf_event_attr attr = {
> + .type = PERF_TYPE_SOFTWARE,
> + .mmap_data = 1,
> + .sample_period = 1,
> + };
> +
> + attr.config = config;
> +
> + event_attr_init(&attr);
> +
> + evsel = perf_evsel__new(&attr);
> + if (!evsel)
> + return -ENOMEM;
> +
> + evsel->handler = trace__pgfault;
> + perf_evlist__add(evlist, evsel);
> +
> + return 0;
> +}
> +
> static int trace__run(struct trace *trace, int argc, const char **argv)
> {
> struct perf_evlist *evlist = perf_evlist__new();
> @@ -1902,6 +2018,14 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
>
> perf_evlist__add_vfs_getname(evlist);
>
> + if (trace->trace_pgfaults &&
> + perf_evlist__add_pgfault(evlist, PERF_COUNT_SW_PAGE_FAULTS_MAJ))
> + goto out_error_tp;
> +
> + if (trace->trace_pgfaults > 1 &&
> + perf_evlist__add_pgfault(evlist, PERF_COUNT_SW_PAGE_FAULTS_MIN))
> + goto out_error_tp;
> +
> if (trace->sched &&
> perf_evlist__add_newtp(evlist, "sched", "sched_stat_runtime",
> trace__sched_stat_runtime))
> @@ -1982,7 +2106,8 @@ again:
> goto next_event;
> }
>
> - if (sample.raw_data == NULL) {
> + if (evsel->attr.type != PERF_TYPE_SOFTWARE &&
> + sample.raw_data == NULL) {

This looks like a separate patch with relevant associated changelog
message explaining why this is needed.

> fprintf(trace->output, "%s sample with no payload for tid: %d, cpu %d, raw_size=%d, skipping...\n",
> perf_evsel__name(evsel), sample.tid,
> sample.cpu, sample.raw_size);
> @@ -1990,7 +2115,7 @@ again:
> }
>
> handler = evsel->handler;
> - handler(trace, evsel, &sample);
> + handler(trace, event, evsel, &sample);
> next_event:
> perf_evlist__mmap_consume(evlist, i);
>
> @@ -2093,13 +2218,10 @@ static int trace__replay(struct trace *trace)
> if (evsel == NULL)
> evsel = perf_evlist__find_tracepoint_by_name(session->evlist,
> "syscalls:sys_enter");
> - if (evsel == NULL) {
> - pr_err("Data file does not have raw_syscalls:sys_enter event\n");
> - goto out;
> - }
>
> - if (perf_evsel__init_syscall_tp(evsel, trace__sys_enter) < 0 ||
> - perf_evsel__init_sc_tp_ptr_field(evsel, args)) {
> + if (evsel &&
> + (perf_evsel__init_syscall_tp(evsel, trace__sys_enter) < 0 ||
> + perf_evsel__init_sc_tp_ptr_field(evsel, args))) {
> pr_err("Error during initialize raw_syscalls:sys_enter event\n");
> goto out;

the above can be ditched, not needed. Care to explain if you think
otherwise?

> }
> @@ -2109,15 +2231,19 @@ static int trace__replay(struct trace *trace)
> if (evsel == NULL)
> evsel = perf_evlist__find_tracepoint_by_name(session->evlist,
> "syscalls:sys_exit");
> - if (evsel == NULL) {
> - pr_err("Data file does not have raw_syscalls:sys_exit event\n");
> + if (evsel &&
> + (perf_evsel__init_syscall_tp(evsel, trace__sys_exit) < 0 ||
> + perf_evsel__init_sc_tp_uint_field(evsel, ret))) {
> + pr_err("Error during initialize raw_syscalls:sys_exit event\n");
> goto out;

Dittoi

> }
>
> - if (perf_evsel__init_syscall_tp(evsel, trace__sys_exit) < 0 ||
> - perf_evsel__init_sc_tp_uint_field(evsel, ret)) {
> - pr_err("Error during initialize raw_syscalls:sys_exit event\n");
> - goto out;
> + evlist__for_each(session->evlist, evsel) {
> + if (evsel->attr.type == PERF_TYPE_SOFTWARE &&
> + (evsel->attr.config == PERF_COUNT_SW_PAGE_FAULTS_MAJ ||
> + evsel->attr.config == PERF_COUNT_SW_PAGE_FAULTS_MIN ||
> + evsel->attr.config == PERF_COUNT_SW_PAGE_FAULTS))
> + evsel->handler = trace__pgfault;

the above looks ugly, can't we set the handler when adding the events?
Or is this just for the perf.data handling case? Have to dig deeper,
just a first feeling.

> }
>
> err = parse_target_str(trace);
> @@ -2290,6 +2416,8 @@ int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)
> .user_interval = ULLONG_MAX,
> .no_buffering = true,
> .mmap_pages = 1024,
> + .sample_address = true,
> + .sample_time = true,

The above should be made conditional, i.e. if --pf is used?

> },
> .output = stdout,
> .show_comm = true,
> @@ -2330,15 +2458,17 @@ int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)
> "Show only syscall summary with statistics"),
> OPT_BOOLEAN('S', "with-summary", &trace.summary,
> "Show all syscalls and summary with statistics"),
> + OPT_INCR('f', "pgfaults", &trace.trace_pgfaults, "Trace pagefaults"),
> OPT_END()
> };
> int err;
> char bf[BUFSIZ];
>
> - if ((argc > 1) && (strcmp(argv[1], "record") == 0))
> - return trace__record(argc-2, &argv[2]);
> + argc = parse_options(argc, argv, trace_options, trace_usage,
> + PARSE_OPT_STOP_AT_NON_OPTION);
>
> - argc = parse_options(argc, argv, trace_options, trace_usage, 0);
> + if ((argc >= 1) && (strcmp(argv[0], "record") == 0))
> + return trace__record(&trace, argc-1, &argv[1]);
>
> /* summary_only implies summary option, but don't overwrite summary if set */
> if (trace.summary_only)
> --
> 1.8.3.2

2014-06-18 17:43:06

by Stanislav Fomichev

[permalink] [raw]
Subject: Re: [PATCH 1/5] perf trace: add support for pagefault tracing

> I try, when possible, to not use short options that are used in
> 'strace', so what if we use 'F' here?
Agreed, will change.

> And:
>
> trace --pgfaults --pgfaults
>
> for major and min page faults looks ugly, what if we instead use --pf
> for both, and allow passing min or maj as args?
>
> I.e.:
>
> For both major and minor:
>
> trace --pf
>
> for just major page faults:
>
> trace --pf maj
Not sure I like it. Currently, when we need to trace page faults its
major faults in 99.9% of cases, we are not interested in minor ones (and
there are thousands of them in a second). So we just do 'perf trace -F'.
If we need minor faults, we are probably interested in all fault events,
so we do -F twice.

With 'trace --pf' we by default hit our 0.01% use case, so we always
need to type 'trace --pf maj'. --pf may be clear from the documentation
standpoint, but I don't like the fact that --pf defaults to all faults.

> > + int trace_pgfaults;
>
> uint8_t should be enough
By using int I state: 'I don't care about underlying type, give me
something to count'. If we use uint8_t it would imply (to me) that
for some reason we care about struct layout, size, endianness, etc.
IOW, I don't think we need to care about +-3 bytes here.

> > -typedef int (*tracepoint_handler)(struct trace *trace, struct perf_evsel *evsel,
> > +typedef int (*tracepoint_handler)(struct trace *trace,
> > + union perf_event *event,
> > + struct perf_evsel *evsel,
> > struct perf_sample *sample);
>
> You'll reduce patch size if you leave the first line alone and just add
> the new parameter (event) after evsel.
>
> It'll become then:
>
> typedef int (*tracepoint_handler)(struct trace *trace, struct perf_evsel *evsel,
> + union perf_event *event,
> struct perf_sample *sample);
>
> Then please make one separate patch adding this new parameter, stating
> that it will be used by pagefault tracing, that will come in a followup
> patch in this series.
Agreed, thanks.

> > +static bool valid_dso(struct addr_location *al, struct perf_sample *sample)
>
> Humm, can't this be reduced to just:
>
> return al->map && al->map->dso;
>
> ? I.e. if the helper returned a dso, it is because the address that was
> looked up is in that dso, no?
>
> I even guess that if there is al->map, that should be enough to check,
> byt haven't thought this thru nor looked at the relevant sources, in a
> hurry now.
Yes, we don't need to check for ->dso, it's always non-null, I'll remove
the check.
But we do need to check for the range. Because
thread__find_addr_map searches for the closest map using only ->start and
doesn't check if address is within map range (->end).
Maybe we need to fix it in thread__find_addr_map so it always returns valid map?

> Please separate adding page fault tracing recording on a file in a
> separate patch from adding it to live mode, then the changelog message
> can concentrate on examples for each feature.
Ok.

> > - if (sample.raw_data == NULL) {
> > + if (evsel->attr.type != PERF_TYPE_SOFTWARE &&
> > + sample.raw_data == NULL) {
>
> This looks like a separate patch with relevant associated changelog
> message explaining why this is needed.
No, this belongs here. When we enable page faults, they don't have any
raw data (while syscalls have). So we still want to keep this check for
syscalls but don't want it for fault events.

> > + if (evsel &&
> > + (perf_evsel__init_syscall_tp(evsel, trace__sys_enter) < 0 ||
> > + perf_evsel__init_sc_tp_ptr_field(evsel, args))) {
> > pr_err("Error during initialize raw_syscalls:sys_enter event\n");
> > goto out;
>
> the above can be ditched, not needed. Care to explain if you think
> otherwise?
This doesn't belong to this patch, but it's required because we can do
'trace --no-syscalls -F' and get only fault events without syscall events;
we don't want to fail here.
Will move to appropriate patch.

> > + evlist__for_each(session->evlist, evsel) {
> > + if (evsel->attr.type == PERF_TYPE_SOFTWARE &&
> > + (evsel->attr.config == PERF_COUNT_SW_PAGE_FAULTS_MAJ ||
> > + evsel->attr.config == PERF_COUNT_SW_PAGE_FAULTS_MIN ||
> > + evsel->attr.config == PERF_COUNT_SW_PAGE_FAULTS))
> > + evsel->handler = trace__pgfault;
>
> the above looks ugly, can't we set the handler when adding the events?
> Or is this just for the perf.data handling case? Have to dig deeper,
> just a first feeling.
It's for perf.data parsing. If you know a better way to do it without
iterating over all events, pls let me know.

> > }
> >
> > err = parse_target_str(trace);
> > @@ -2290,6 +2416,8 @@ int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)
> > .user_interval = ULLONG_MAX,
> > .no_buffering = true,
> > .mmap_pages = 1024,
> > + .sample_address = true,
> > + .sample_time = true,
>
> The above should be made conditional, i.e. if --pf is used?
Yes, fixed, thanks.

2014-06-18 20:04:40

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH 1/5] perf trace: add support for pagefault tracing

Em Wed, Jun 18, 2014 at 09:42:54PM +0400, Stanislav Fomichev escreveu:
> > I try, when possible, to not use short options that are used in
> > 'strace', so what if we use 'F' here?
> Agreed, will change.
>
> > And:
> >
> > trace --pgfaults --pgfaults
> >
> > for major and min page faults looks ugly, what if we instead use --pf
> > for both, and allow passing min or maj as args?
> >
> > I.e.:
> >
> > For both major and minor:
> >
> > trace --pf
> >
> > for just major page faults:
> >
> > trace --pf maj
> Not sure I like it. Currently, when we need to trace page faults its
> major faults in 99.9% of cases, we are not interested in minor ones (and

If 99.9% of the cases is for for major faults, then make --pf default to
that :-) While allowing to change this behaviour via:

For just major faults:

trace --pf

For all kinds:

trace --pf all

For just minor faults:

trace --pf min

Sure you can have the shorthand that -FF means "--pf all"

What I'm trying to avoid is to have to use with long options:

trace --pf --pf

Also, with your scheme, how would I ask for just minor faults, if
somebody happens to want that?

> there are thousands of them in a second). So we just do 'perf trace -F'.
> If we need minor faults, we are probably interested in all fault events,
> so we do -F twice.
>
> With 'trace --pf' we by default hit our 0.01% use case, so we always
> need to type 'trace --pf maj'. --pf may be clear from the documentation
> standpoint, but I don't like the fact that --pf defaults to all faults.

So make it default to the most common case :-)

> > > + int trace_pgfaults;

> > uint8_t should be enough
> By using int I state: 'I don't care about underlying type, give me
> something to count'. If we use uint8_t it would imply (to me) that
> for some reason we care about struct layout, size, endianness, etc.
> IOW, I don't think we need to care about +-3 bytes here.

Guess I've been using pahole too much... ;-)

Also, uint8_t is something that can count, as is u8, that is more
commonly used in tools/perf/ to make it look like kernel code :-)

But nah, not a biggie, just trying to be judicious in using types.

> > > -typedef int (*tracepoint_handler)(struct trace *trace, struct perf_evsel *evsel,
> > > +typedef int (*tracepoint_handler)(struct trace *trace,
> > > + union perf_event *event,
> > > + struct perf_evsel *evsel,
> > > struct perf_sample *sample);
> >
> > You'll reduce patch size if you leave the first line alone and just add
> > the new parameter (event) after evsel.
> >
> > It'll become then:
> >
> > typedef int (*tracepoint_handler)(struct trace *trace, struct perf_evsel *evsel,
> > + union perf_event *event,
> > struct perf_sample *sample);
> >
> > Then please make one separate patch adding this new parameter, stating
> > that it will be used by pagefault tracing, that will come in a followup
> > patch in this series.
> Agreed, thanks.
>
> > > +static bool valid_dso(struct addr_location *al, struct perf_sample *sample)
> >
> > Humm, can't this be reduced to just:
> >
> > return al->map && al->map->dso;
> >
> > ? I.e. if the helper returned a dso, it is because the address that was
> > looked up is in that dso, no?
> >
> > I even guess that if there is al->map, that should be enough to check,
> > byt haven't thought this thru nor looked at the relevant sources, in a
> > hurry now.
> Yes, we don't need to check for ->dso, it's always non-null, I'll remove
> the check.
> But we do need to check for the range. Because
> thread__find_addr_map searches for the closest map using only ->start and
> doesn't check if address is within map range (->end).
> Maybe we need to fix it in thread__find_addr_map so it always returns valid map?

That is my expectation, yes, if I ask for the map where address N is, it
should return just that, where have you found this bug?

The thread__find_addr_map will end up calling maps__find() and it does
this:

if (ip < m->start)
p = &(*p)->rb_left;
else if (ip > m->end)
p = &(*p)->rb_right;
else
return m;

Where is the problem?

> > Please separate adding page fault tracing recording on a file in a
> > separate patch from adding it to live mode, then the changelog message
> > can concentrate on examples for each feature.
> Ok.
>
> > > - if (sample.raw_data == NULL) {
> > > + if (evsel->attr.type != PERF_TYPE_SOFTWARE &&
> > > + sample.raw_data == NULL) {
> >
> > This looks like a separate patch with relevant associated changelog
> > message explaining why this is needed.
> No, this belongs here. When we enable page faults, they don't have any
> raw data (while syscalls have). So we still want to keep this check for
> syscalls but don't want it for fault events.

Understood the intent, but the test here is really:

if (evsel->attr.type == PERF_TYPE_TRACEPOINT &&
sample.raw_data == NULL)

Ok?

> > > + if (evsel &&
> > > + (perf_evsel__init_syscall_tp(evsel, trace__sys_enter) < 0 ||
> > > + perf_evsel__init_sc_tp_ptr_field(evsel, args))) {
> > > pr_err("Error during initialize raw_syscalls:sys_enter event\n");
> > > goto out;
> >
> > the above can be ditched, not needed. Care to explain if you think
> > otherwise?

> This doesn't belong to this patch, but it's required because we can do
> 'trace --no-syscalls -F' and get only fault events without syscall events;
> we don't want to fail here.
> Will move to appropriate patch.

Thanks

> > > + evlist__for_each(session->evlist, evsel) {
> > > + if (evsel->attr.type == PERF_TYPE_SOFTWARE &&
> > > + (evsel->attr.config == PERF_COUNT_SW_PAGE_FAULTS_MAJ ||
> > > + evsel->attr.config == PERF_COUNT_SW_PAGE_FAULTS_MIN ||
> > > + evsel->attr.config == PERF_COUNT_SW_PAGE_FAULTS))
> > > + evsel->handler = trace__pgfault;
> >
> > the above looks ugly, can't we set the handler when adding the events?
> > Or is this just for the perf.data handling case? Have to dig deeper,
> > just a first feeling.
> It's for perf.data parsing. If you know a better way to do it without
> iterating over all events, pls let me know.

We had something related in a perf_evlist__set_tracepoint_handlers(),
but this case is for a software event, so we would need a
perf_evlist__set_attr_handlers(), I can do that later, for now this is
the only user, as this evsel->handler thing was introduced with 'trace',
so keep it like this, I can revisit this later.

> > > }
> > >
> > > err = parse_target_str(trace);
> > > @@ -2290,6 +2416,8 @@ int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)
> > > .user_interval = ULLONG_MAX,
> > > .no_buffering = true,
> > > .mmap_pages = 1024,
> > > + .sample_address = true,
> > > + .sample_time = true,
> >
> > The above should be made conditional, i.e. if --pf is used?
> Yes, fixed, thanks.

Also, have you considered using:

[root@zoo ~]# perf list exceptions:page_fault*
exceptions:page_fault_user [Tracepoint event]
exceptions:page_fault_kernel [Tracepoint event]
[root@zoo ~]#

Instead? I need to check if they're completely equivalent to what we
need here...

- Arnaldo

2014-06-19 12:06:59

by Stanislav Fomichev

[permalink] [raw]
Subject: Re: [PATCH 1/5] perf trace: add support for pagefault tracing

> So make it default to the most common case :-)
Ok, agreed, will make --pf=all|maj|min and will default it to maj.

> That is my expectation, yes, if I ask for the map where address N is, it
> should return just that, where have you found this bug?
>
> The thread__find_addr_map will end up calling maps__find() and it does
> this:
>
> if (ip < m->start)
> p = &(*p)->rb_left;
> else if (ip > m->end)
> p = &(*p)->rb_right;
> else
> return m;
>
> Where is the problem?
You're right, we don't need to check the range, if map is non-null it's a
valid one. I'll remove valid_dso and will just check al.map for null.

> Understood the intent, but the test here is really:
>
> if (evsel->attr.type == PERF_TYPE_TRACEPOINT &&
> sample.raw_data == NULL)
>
> Ok?
Yes, makes more sense.

> Also, have you considered using:
>
> [root@zoo ~]# perf list exceptions:page_fault*
> exceptions:page_fault_user [Tracepoint event]
> exceptions:page_fault_kernel [Tracepoint event]
> [root@zoo ~]#
>
> Instead? I need to check if they're completely equivalent to what we
> need here...
Hm, didn't know there are two kinds of fault events :-/
But it seems these are x86 only.

2014-06-25 11:33:17

by Stanislav Fomichev

[permalink] [raw]
Subject: Re: [PATCH 1/5] perf trace: add support for pagefault tracing

> > Also, have you considered using:
> >
> > [root@zoo ~]# perf list exceptions:page_fault*
> > exceptions:page_fault_user [Tracepoint event]
> > exceptions:page_fault_kernel [Tracepoint event]
> > [root@zoo ~]#
> >
> > Instead? I need to check if they're completely equivalent to what we
> > need here...
> Hm, didn't know there are two kinds of fault events :-/
> But it seems these are x86 only.
Maybe we can use these x86 only events to trace userspace fault duration?
It seems we can pair exceptions:page_fault_user and
context_tracking:user_enter. But it's just a theory, need to verify...