2011-02-18 05:54:00

by David Ahern

[permalink] [raw]
Subject: [PATCH 0/3] perf events: Add realtime clock event and timehist option

David Ahern (3):
perf events: fix WARN_ON_ONCE for 64-bit raw data, SW events
perf events: Introduce realtime clock event
perf events: add timehist option to record and report

include/linux/perf_event.h | 1 +
kernel/perf_event.c | 74 +++++++++++-
tools/perf/Documentation/perf-record.txt | 4 +
tools/perf/Documentation/perf-report.txt | 6 +
tools/perf/builtin-record.c | 78 +++++++++++-
tools/perf/builtin-report.c | 208 +++++++++++++++++++++++++++++-
tools/perf/util/event.c | 1 +
tools/perf/util/event.h | 8 +
tools/perf/util/parse-events.c | 2 +
tools/perf/util/session.c | 4 +
tools/perf/util/session.h | 3 +-
11 files changed, 384 insertions(+), 5 deletions(-)

--
1.7.3.4


2011-02-18 05:54:02

by David Ahern

[permalink] [raw]
Subject: [PATCH 2/3] perf events: Introduce realtime clock event

The motivation for this event is to convert perf_clock() time stamps
to wall-clock (gettimeofday()) equivalents, including adjustments made
by NTP (e.g., for comparing perf events to other log files).

This patch is based on the monotonic patch by Arnaldo Carvalho de Melo
<[email protected]>.

(NOTE: Comments from the last review of the timehist patch series
suggested calling this a monotonic clock. I am not trying to be
dense here; since gettimeofday maps to realtime clock I think that
is the better name for it.)

Signed-off-by: David Ahern <[email protected]>
---
include/linux/perf_event.h | 1 +
kernel/perf_event.c | 72 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 73 insertions(+), 0 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8ceb5a6..51a2f34 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -105,6 +105,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_PAGE_FAULTS_MAJ = 6,
PERF_COUNT_SW_ALIGNMENT_FAULTS = 7,
PERF_COUNT_SW_EMULATION_FAULTS = 8,
+ PERF_COUNT_SW_REALTIME_CLOCK = 9,

PERF_COUNT_SW_MAX, /* non-ABI */
};
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 3280671..4b6d29c 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -5413,6 +5413,7 @@ static int perf_swevent_init(struct perf_event *event)
switch (event_id) {
case PERF_COUNT_SW_CPU_CLOCK:
case PERF_COUNT_SW_TASK_CLOCK:
+ case PERF_COUNT_SW_REALTIME_CLOCK:
return -ENOENT;

default:
@@ -5599,6 +5600,7 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
struct perf_sample_data data;
struct pt_regs *regs;
struct perf_event *event;
+ struct perf_raw_record raw;
u64 period;

event = container_of(hrtimer, struct perf_event, hw.hrtimer);
@@ -5610,6 +5612,13 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)

perf_sample_data_init(&data, 0);
data.period = event->hw.last_period;
+ if (event->attr.sample_type & PERF_SAMPLE_RAW)
+ {
+ raw.size = sizeof(u64);
+ raw.data = &event->count;
+ data.raw = &raw;
+ }
+
regs = get_irq_regs();

if (regs && !perf_exclude_event(event, regs)) {
@@ -5751,6 +5760,68 @@ static struct pmu perf_cpu_clock = {
};

/*
+ * Software event: realtime wall time clock
+ */
+
+static void realtime_clock_event_update(struct perf_event *event)
+{
+ u64 now = ktime_to_ns(ktime_get_real());
+ local64_set(&event->count, now);
+}
+
+static void realtime_clock_event_start(struct perf_event *event, int flags)
+{
+ realtime_clock_event_update(event);
+ perf_swevent_start_hrtimer(event);
+}
+
+static void realtime_clock_event_stop(struct perf_event *event, int flags)
+{
+ perf_swevent_cancel_hrtimer(event);
+ realtime_clock_event_update(event);
+}
+
+static int realtime_clock_event_add(struct perf_event *event, int flags)
+{
+ if (flags & PERF_EF_START)
+ realtime_clock_event_start(event, flags);
+
+ return 0;
+}
+
+static void realtime_clock_event_del(struct perf_event *event, int flags)
+{
+ realtime_clock_event_stop(event, flags);
+}
+
+static void realtime_clock_event_read(struct perf_event *event)
+{
+ realtime_clock_event_update(event);
+}
+
+static int realtime_clock_event_init(struct perf_event *event)
+{
+ if (event->attr.type != PERF_TYPE_SOFTWARE)
+ return -ENOENT;
+
+ if (event->attr.config != PERF_COUNT_SW_REALTIME_CLOCK)
+ return -ENOENT;
+
+ return 0;
+}
+
+static struct pmu perf_realtime_clock = {
+ .task_ctx_nr = perf_sw_context,
+
+ .event_init = realtime_clock_event_init,
+ .add = realtime_clock_event_add,
+ .del = realtime_clock_event_del,
+ .start = realtime_clock_event_start,
+ .stop = realtime_clock_event_stop,
+ .read = realtime_clock_event_read,
+};
+
+/*
* Software event: task time clock
*/

@@ -7285,6 +7356,7 @@ void __init perf_event_init(void)
init_srcu_struct(&pmus_srcu);
perf_pmu_register(&perf_swevent, "software", PERF_TYPE_SOFTWARE);
perf_pmu_register(&perf_cpu_clock, NULL, -1);
+ perf_pmu_register(&perf_realtime_clock, NULL, -1);
perf_pmu_register(&perf_task_clock, NULL, -1);
perf_tp_register();
perf_cpu_notifier(perf_cpu_notify);
--
1.7.3.4

2011-02-18 05:54:08

by David Ahern

[permalink] [raw]
Subject: [PATCH 3/3] perf events: add timehist option to record and report

The intent of the timehist option is to 'pretty-print' the samples
recorded rather than generating a histogram. This is done by
sampling the realtime clock event and correlating perf_clock time
stamps to wall-clock.

If the realtime-clock event is not available (e.g, older kernels)
fallback to a synthesized event. (I realize there is resistance
to new synthesized events, but it is a simple way to gain this
feature on older kernels without the new to modify the kernel
code).

NOTE: Commit ba3dd36 needs to be reverted else this causes the
system to spin. Ironically, ba3dd36 was created to resolve a
lockup.

Signed-off-by: David Ahern <[email protected]>
---
tools/perf/Documentation/perf-record.txt | 4 +
tools/perf/Documentation/perf-report.txt | 6 +
tools/perf/builtin-record.c | 78 +++++++++++-
tools/perf/builtin-report.c | 208 +++++++++++++++++++++++++++++-
tools/perf/util/event.c | 1 +
tools/perf/util/event.h | 8 +
tools/perf/util/parse-events.c | 2 +
tools/perf/util/session.c | 4 +
tools/perf/util/session.h | 3 +-
9 files changed, 310 insertions(+), 4 deletions(-)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 5a520f8..8ae106b 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -148,6 +148,10 @@ an empty cgroup (monitor all the time) using, e.g., -G foo,,bar. Cgroups must ha
corresponding events, i.e., they always refer to events defined earlier on the command
line.

+--timehist::
+Collect data for time history report. This generates reference time samples
+which are used to convert perf timestamps to time-of-day.
+
SEE ALSO
--------
linkperf:perf-stat[1], linkperf:perf-list[1]
diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 8ba03d6..aaaf035 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -119,6 +119,12 @@ OPTIONS
--symfs=<directory>::
Look for files with symbols relative to this directory.

+--timehist::
+ Generate time history output. This shows each sample with a wall-clock
+ timestamp and address to symbol conversions. The samples are output in
+ the same order they exist in the perf.data file. The --timehist option
+ must be used with the record to get wall-clock timestamps.
+
SEE ALSO
--------
linkperf:perf-stat[1]
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index a4aaadc..91d49a3 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -55,6 +55,8 @@ static bool nodelay = false;
static bool raw_samples = false;
static bool sample_id_all_avail = true;
static bool system_wide = false;
+static bool time_history = false;
+static struct perf_evsel *th_evsel = NULL;
static pid_t target_pid = -1;
static pid_t target_tid = -1;
static pid_t child_pid = -1;
@@ -276,6 +278,11 @@ static void config_attr(struct perf_evsel *evsel, struct perf_evlist *evlist)
attr->sample_type |= PERF_SAMPLE_CPU;
}

+ if (time_history) {
+ attr->sample_type |= PERF_SAMPLE_TIME;
+ attr->sample_type |= PERF_SAMPLE_CPU;
+ }
+
if (nodelay) {
attr->watermark = 0;
attr->wakeup_events = 1;
@@ -290,6 +297,56 @@ static void config_attr(struct perf_evsel *evsel, struct perf_evlist *evlist)
}
}

+static int perf_event__synthesize_reftime(perf_event__handler_t process,
+ struct perf_session *psession)
+{
+ union perf_event ev;
+ struct timespec tp;
+
+ memset(&ev, 0, sizeof(ev));
+
+ /* race here between successive calls, but should be close enough */
+ if (gettimeofday(&ev.reftime.tv, NULL) != 0) {
+ error("gettimeofday failed. Cannot generate reference time.\n");
+ return -1;
+ }
+ if (clock_gettime(CLOCK_MONOTONIC, &tp) != 0) {
+ error("clock_gettime failed. Cannot generate reference time.\n");
+ return -1;
+ }
+ ev.reftime.nsec = (u64) tp.tv_sec * NSEC_PER_SEC + (u64) tp.tv_nsec;
+
+ ev.header.type = PERF_RECORD_REFTIME;
+ ev.header.size = sizeof(ev.reftime);
+
+ return process(&ev, NULL, psession);
+}
+
+static struct perf_evsel *create_timehist_counter(void)
+{
+ struct perf_event_attr attr;
+ struct perf_evsel *evsel;
+
+ memset(&attr, 0, sizeof(attr));
+ attr.type = PERF_TYPE_SOFTWARE;
+ attr.config = PERF_COUNT_SW_REALTIME_CLOCK;
+
+ /* need raw to get realtime clock samples and want sample
+ * time for correlating the two
+ */
+ attr.sample_type = PERF_SAMPLE_RAW | PERF_SAMPLE_ID | PERF_SAMPLE_TIME;
+ attr.read_format = PERF_FORMAT_ID;
+
+ /* TO-DO: make this time configurable */
+ attr.sample_period = 1 * NSEC_PER_SEC;
+
+ evsel = perf_evsel__new(&attr, evsel_list->nr_entries);
+ if (!evsel)
+ die("Error: Failed to allocate memory for time counter\n");
+
+ return evsel;
+}
+
static void open_counters(struct perf_evlist *evlist)
{
struct perf_evsel *pos;
@@ -331,7 +388,8 @@ try_again:
* Old kernel, no attr->sample_id_type_all field
*/
sample_id_all_avail = false;
- if (!sample_time && !raw_samples && !time_needed)
+ if (!sample_time && !raw_samples
+ && !time_needed && !time_history)
attr->sample_type &= ~PERF_SAMPLE_TIME;

goto retry_sample_id;
@@ -612,6 +670,16 @@ static int __cmd_record(int argc, const char **argv)
return err;
}

+ if (time_history && !th_evsel) {
+ if (verbose)
+ warning ("REALTIME_CLOCK event failed. "
+ "falling back to synthesized clock events\n");
+
+ if (perf_event__synthesize_reftime(process_synthesized_event,
+ session) != 0)
+ error("Failed to synthesize reftime event. SOL.\n");
+ }
+
post_processing_offset = lseek(output, 0, SEEK_CUR);

perf_session__set_sample_id_all(session, sample_id_all_avail);
@@ -810,6 +878,8 @@ const struct option record_options[] = {
OPT_CALLBACK('G', "cgroup", &evsel_list, "name",
"monitor event in cgroup name only",
parse_cgroups),
+ OPT_BOOLEAN(0, "timehist", &time_history,
+ "collect data for time history report"),
OPT_END()
};

@@ -855,6 +925,12 @@ int cmd_record(int argc, const char **argv, const char *prefix __used)
goto out_symbol_exit;
}

+ if (time_history) {
+ th_evsel = create_timehist_counter();
+ if (th_evsel)
+ perf_evlist__add(evsel_list, th_evsel);
+ }
+
if (target_pid != -1)
target_tid = target_pid;

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index f9a99a1..59002d6 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -36,6 +36,7 @@ static char const *input_name = "perf.data";
static bool force, use_tui, use_stdio;
static bool hide_unresolved;
static bool dont_use_callchains;
+static bool time_history;

static bool show_threads;
static struct perf_read_values show_threads_values;
@@ -46,6 +47,189 @@ static const char *pretty_printing_style = default_pretty_printing_style;
static char callchain_default_opt[] = "fractal,0.5";
static symbol_filter_t annotate_init;

+
+struct timeval tv_ref;
+u64 timestamp_ref;
+
+static const char *timestr(u64 timestamp)
+{
+ struct tm ltime;
+ u64 dt;
+ struct timeval tv_dt, tv_res;
+ static char tstr[64];
+
+ if (timestamp_ref == 0) {
+ static bool show_msg = true;
+ if (show_msg) {
+ fprintf(stderr, "Reference clock event not found in file; "
+ "Cannot generate wall-clock time strings\n");
+ show_msg = false;
+ }
+ snprintf(tstr, sizeof(tstr), "%16" PRId64, timestamp);
+ return tstr;
+ } else if (timestamp == 0) {
+ snprintf(tstr, sizeof(tstr), "%16" PRId64, timestamp);
+ return tstr;
+ }
+ if (timestamp > timestamp_ref) {
+ dt = timestamp - timestamp_ref;
+ tv_dt.tv_sec = (time_t) (dt / NSEC_PER_SEC);
+ tv_dt.tv_usec = (dt - tv_dt.tv_sec * NSEC_PER_SEC) / 1000;
+ timeradd(&tv_ref, &tv_dt, &tv_res);
+ } else {
+ dt = timestamp_ref - timestamp;
+ tv_dt.tv_sec = (time_t) (dt / NSEC_PER_SEC);
+ tv_dt.tv_usec = (dt - tv_dt.tv_sec * NSEC_PER_SEC) / 1000;
+ timersub(&tv_ref, &tv_dt, &tv_res);
+ }
+
+ if (localtime_r(&tv_res.tv_sec, &ltime) == NULL) {
+ snprintf(tstr, sizeof(tstr), "00:00:00.000000 %16" PRId64, timestamp);
+ } else {
+ char date[64];
+ strftime(date, sizeof(date), "%H:%M:%S", &ltime);
+ snprintf(tstr, sizeof(tstr), "%s.%06ld %16" PRId64,
+ date, tv_res.tv_usec, timestamp);
+ }
+
+ return tstr;
+}
+
+#define TIMEHIST_FMT "%32s %2d %s %s %d %16" PRIx64 " %s (%s)\n"
+
+static void timehist_header(void)
+{
+ printf("%32s %s %7s %3s %5s %16s %s (%s)\n\n",
+ " Time-of-Day Kernel Timestamp",
+ "cpu", "Event", "Command", "PID",
+ " IP ", "Symbol Name", "DSO Name");
+}
+
+static int perf_event__process_reftime(union perf_event *event,
+ struct perf_session *session __used)
+{
+ tv_ref = event->reftime.tv;
+ timestamp_ref = event->reftime.nsec;
+
+ return 0;
+}
+
+static void perf_event__process_rclk(struct perf_sample *sample,
+ struct perf_session *session)
+{
+ u64 s;
+
+ if (!sample->raw_size) {
+ printf("rclk sample does not have raw data.\n");
+ return;
+ }
+
+ if (sample->raw_size != sizeof(u64)) {
+ printf("rclk sample has raw size %d; expected %ld.\n",
+ sample->raw_size, sizeof(u64));
+ return;
+ }
+ if (!(session->sample_type & PERF_SAMPLE_TIME)) {
+ printf("rclk sample does not have kernel sample time\n");
+ return;
+ }
+
+ /* convert raw sample to wall-clock reference time */
+ s = *(u64*)sample->raw_data;
+ tv_ref.tv_sec = s / NSEC_PER_SEC;
+ tv_ref.tv_usec = (s - tv_ref.tv_sec * NSEC_PER_SEC) / 1000;
+
+ /* match raw sample with kernel timestamp */
+ timestamp_ref = sample->time;
+}
+
+static void perf_session__print_sample(struct perf_session *session,
+ struct addr_location *al,
+ struct perf_sample *sample)
+{
+ static int show_timehist_error = 1;
+ struct callchain_cursor_node *node, *prev;
+ u64 timestamp = 0;
+ const char *tstr;
+ bool need_spacer = false;
+ struct perf_event_attr *attr;
+ const char *evname = NULL;
+ const char *symname = "", *dsoname = "";
+
+ if (show_timehist_error && ((sample->cpu == (u32) -1) ||
+ !(session->sample_type & PERF_SAMPLE_TIME))) {
+ pr_err("Data for time history missing in perf event samples.\n"
+ "Did you record with --timehist option?\n");
+ show_timehist_error = 0;
+ }
+
+ attr = perf_header__find_attr(sample->id, &session->header);
+ if (attr)
+ evname = __event_name(attr->type, attr->config);
+
+ if (session->sample_type & PERF_SAMPLE_TIME)
+ timestamp = sample->time;
+
+ tstr = timestr(timestamp);
+
+ if ((symbol_conf.use_callchain) && sample->callchain) {
+
+ if (perf_session__resolve_callchain(session, al->thread,
+ sample->callchain, NULL) != 0) {
+ fprintf(stderr, "failed to resolve callchain. skipping\n");
+ return;
+ }
+
+ node = session->callchain_cursor.first;
+ while (node) {
+
+ if (node->sym && node->sym->name)
+ symname = node->sym->name;
+ else if (hide_unresolved)
+ continue;
+
+ if (node->map && node->map->dso && node->map->dso->name)
+ dsoname = node->map->dso->name;
+ else if (hide_unresolved)
+ continue;
+
+ printf(TIMEHIST_FMT,
+ tstr, sample->cpu, evname ? evname : "-",
+ al->thread->comm_set ? al->thread->comm : "-",
+ al->thread->pid,
+ node->ip,
+ symname, dsoname);
+
+ need_spacer = true;
+
+ prev = node;
+ node = node->next;
+ }
+
+ } else {
+ if (al->sym && al->sym->name)
+ symname = al->sym->name;
+
+ if (al->map && al->map->dso && al->map->dso->name)
+ dsoname = al->map->dso->name;
+
+ printf(TIMEHIST_FMT,
+ tstr, sample->cpu, evname ? evname : "-",
+ al->thread->comm_set ? al->thread->comm : "-",
+ al->thread->pid, al->addr,
+ symname, dsoname);
+
+ need_spacer = true;
+ }
+
+ /* put a gap between records */
+ if (need_spacer)
+ printf("\n");
+
+ return;
+}
+
+
static struct hists *perf_session__hists_findnew(struct perf_session *self,
u64 event_stream, u32 type,
u64 config)
@@ -167,6 +351,15 @@ static int process_sample_event(union perf_event *event,
struct addr_location al;
struct perf_event_attr *attr;

+ attr = perf_header__find_attr(sample->id, &session->header);
+ if (attr &&
+ (attr->type == PERF_TYPE_SOFTWARE) &&
+ (attr->config == PERF_COUNT_SW_REALTIME_CLOCK)) {
+ perf_event__process_rclk(sample, session);
+ if (time_history)
+ return 0;
+ }
+
if (perf_event__preprocess_sample(event, session, &al, sample,
annotate_init) < 0) {
fprintf(stderr, "problem processing %d event, skipping it.\n",
@@ -177,7 +370,9 @@ static int process_sample_event(union perf_event *event,
if (al.filtered || (hide_unresolved && al.sym == NULL))
return 0;

- if (perf_session__add_hist_entry(session, &al, sample)) {
+ if (time_history) {
+ perf_session__print_sample(session, &al, sample);
+ } else if (perf_session__add_hist_entry(session, &al, sample)) {
pr_debug("problem incrementing symbol period, skipping event\n");
return -1;
}
@@ -257,6 +452,7 @@ static struct perf_event_ops event_ops = {
.event_type = perf_event__process_event_type,
.tracing_data = perf_event__process_tracing_data,
.build_id = perf_event__process_build_id,
+ .reftime = perf_event__process_reftime,
.ordered_samples = true,
.ordering_requires_timestamps = true,
};
@@ -334,6 +530,9 @@ static int __cmd_report(void)
if (ret)
goto out_delete;

+ if (time_history)
+ timehist_header();
+
ret = perf_session__process_events(session, &event_ops);
if (ret)
goto out_delete;
@@ -349,6 +548,9 @@ static int __cmd_report(void)
if (verbose > 2)
perf_session__fprintf_dsos(session, stdout);

+ if (time_history)
+ goto out_delete;
+
next = rb_first(&session->hists_tree);
while (next) {
struct hists *hists;
@@ -498,6 +700,8 @@ static const struct option options[] = {
"Only display entries resolved to a symbol"),
OPT_STRING(0, "symfs", &symbol_conf.symfs, "directory",
"Look for files with symbols relative to this directory"),
+ OPT_BOOLEAN(0, "timehist", &time_history,
+ "Dump time history of event samples"),
OPT_END()
};

@@ -505,7 +709,7 @@ int cmd_report(int argc, const char **argv, const char *prefix __used)
{
argc = parse_options(argc, argv, options, report_usage, 0);

- if (use_stdio)
+ if (use_stdio || time_history)
use_browser = 0;
else if (use_tui)
use_browser = 1;
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index fbf5754..6bbd551 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -24,6 +24,7 @@ static const char *perf_event__names[] = {
[PERF_RECORD_HEADER_TRACING_DATA] = "TRACING_DATA",
[PERF_RECORD_HEADER_BUILD_ID] = "BUILD_ID",
[PERF_RECORD_FINISHED_ROUND] = "FINISHED_ROUND",
+ [PERF_RECORD_REFTIME] = "REF_TIME",
};

const char *perf_event__name(unsigned int id)
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 9c35170..4ba112d 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -91,6 +91,7 @@ enum perf_user_event_type { /* above any possible kernel type */
PERF_RECORD_HEADER_TRACING_DATA = 66,
PERF_RECORD_HEADER_BUILD_ID = 67,
PERF_RECORD_FINISHED_ROUND = 68,
+ PERF_RECORD_REFTIME = 69,
PERF_RECORD_HEADER_MAX
};

@@ -117,6 +118,12 @@ struct tracing_data_event {
u32 size;
};

+struct reftime_event {
+ struct perf_event_header header;
+ struct timeval tv;
+ u64 nsec;
+};
+
union perf_event {
struct perf_event_header header;
struct ip_event ip;
@@ -130,6 +137,7 @@ union perf_event {
struct event_type_event event_type;
struct tracing_data_event tracing_data;
struct build_id_event build_id;
+ struct reftime_event reftime;
};

void perf_event__print_totals(void);
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index cf082da..eb143a1 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -40,6 +40,7 @@ static struct event_symbol event_symbols[] = {
{ CHW(BUS_CYCLES), "bus-cycles", "" },

{ CSW(CPU_CLOCK), "cpu-clock", "" },
+ { CSW(REALTIME_CLOCK), "realtime-clock", "rclk" },
{ CSW(TASK_CLOCK), "task-clock", "" },
{ CSW(PAGE_FAULTS), "page-faults", "faults" },
{ CSW(PAGE_FAULTS_MIN), "minor-faults", "" },
@@ -78,6 +79,7 @@ static const char *sw_event_names[] = {
"major-faults",
"alignment-faults",
"emulation-faults",
+ "realtime-clock-msecs",
};

#define MAX_ALIASES 8
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index a3a871f..a07d96f 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -779,6 +779,10 @@ static int perf_session__process_user_event(struct perf_session *session, union
return ops->build_id(event, session);
case PERF_RECORD_FINISHED_ROUND:
return ops->finished_round(event, session, ops);
+ case PERF_RECORD_REFTIME:
+ if (ops->reftime)
+ return ops->reftime(event, session);
+ return -EINVAL;
default:
return -EINVAL;
}
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 977b3a1..687d2b7 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -77,7 +77,8 @@ struct perf_event_ops {
event_synth_op attr,
event_type,
tracing_data,
- build_id;
+ build_id,
+ reftime;
event_op2 finished_round;
bool ordered_samples;
bool ordering_requires_timestamps;
--
1.7.3.4

2011-02-18 05:54:35

by David Ahern

[permalink] [raw]
Subject: [PATCH 1/3] perf events: fix WARN_ON_ONCE for 64-bit raw data, SW events

This check does not work out for 64-bit counter:

sizeof(u32) + sizeof(u64) = 4 + 8 = 12

which does not pass the WARN_ON_ONCE test. I'm guessing (hoping
really) that the intent is that the size of the raw data is a
multiple of 4 bytes, not 8.

Signed-off-by: David Ahern <[email protected]>
---
kernel/perf_event.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index a0a6987..3280671 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -4351,7 +4351,7 @@ void perf_prepare_sample(struct perf_event_header *header,
else
size += sizeof(u32);

- WARN_ON_ONCE(size & (sizeof(u64)-1));
+ WARN_ON_ONCE(size & (sizeof(u32)-1));
header->size += size;
}
}
--
1.7.3.4

2011-02-18 07:07:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/3] perf events: add timehist option to record and report


* David Ahern <[email protected]> wrote:

> +--timehist::
> + Generate time history output. This shows each sample with a wall-clock
> + timestamp and address to symbol conversions. The samples are output in
> + the same order they exist in the perf.data file. The --timehist option
> + must be used with the record to get wall-clock timestamps.

Displaying samples this way is very useful, but how is this different from trace
output - i.e. 'perf script' output?

There's a 'live logging mode' feature:

perf record ./myworkload | perf inject -v -b | perf script -i -

( There's also some work going on to have a separate 'perf trace' utility that will
output a linear trace of whatever events in perf.data are. )

So unless i'm missing something it would be more useful to extend 'perf script' (or
the upcoming 'perf trace') to list regular samples in a perf.data (i.e. not rely on
-R raw trace data alone), and of course allow a straightforward utilization of real
timestamps, when available.

Thanks,

Ingo

2011-02-18 10:59:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf events: fix WARN_ON_ONCE for 64-bit raw data, SW events

On Thu, 2011-02-17 at 22:53 -0700, David Ahern wrote:
> This check does not work out for 64-bit counter:
>
> sizeof(u32) + sizeof(u64) = 4 + 8 = 12
>
> which does not pass the WARN_ON_ONCE test. I'm guessing (hoping
> really) that the intent is that the size of the raw data is a
> multiple of 4 bytes, not 8.

No the data must come in 8bytes strides.

2011-02-18 11:12:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/3] perf events: Introduce realtime clock event

On Thu, 2011-02-17 at 22:53 -0700, David Ahern wrote:
> The motivation for this event is to convert perf_clock() time stamps
> to wall-clock (gettimeofday()) equivalents, including adjustments made
> by NTP (e.g., for comparing perf events to other log files).

> This patch is based on the monotonic patch by Arnaldo Carvalho de Melo
> <[email protected]>.
>
> (NOTE: Comments from the last review of the timehist patch series
> suggested calling this a monotonic clock. I am not trying to be
> dense here; since gettimeofday maps to realtime clock I think that
> is the better name for it.)

Well, the idea was to use CLOCK_MONOTONIC, not to call CLOCK_REALTIME
monotonic.

I'm really not sure why you want CLOCK_REALTIME and I think
CLOCK_MONOTONIC is more useful (I'd argue you want your system logs to
contain both, every admin who's ever had to untangle what happened
during DST switches will agree)

> @@ -5610,6 +5612,13 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
>
> perf_sample_data_init(&data, 0);
> data.period = event->hw.last_period;
> + if (event->attr.sample_type & PERF_SAMPLE_RAW)
> + {
> + raw.size = sizeof(u64);
> + raw.data = &event->count;
> + data.raw = &raw;
> + }
> +
> regs = get_irq_regs();
>
> if (regs && !perf_exclude_event(event, regs)) {


Why!? you already keep ->count = ktime_get_real(), so simply reading the
count value will get you the timestamp.. this is superfluous at best.

2011-02-18 14:28:24

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH 3/3] perf events: add timehist option to record and report



On 02/18/11 00:06, Ingo Molnar wrote:
>
> * David Ahern <[email protected]> wrote:
>
>> +--timehist::
>> + Generate time history output. This shows each sample with a wall-clock
>> + timestamp and address to symbol conversions. The samples are output in
>> + the same order they exist in the perf.data file. The --timehist option
>> + must be used with the record to get wall-clock timestamps.
>
> Displaying samples this way is very useful, but how is this different from trace
> output - i.e. 'perf script' output?
>
> There's a 'live logging mode' feature:
>
> perf record ./myworkload | perf inject -v -b | perf script -i -
>
> ( There's also some work going on to have a separate 'perf trace' utility that will
> output a linear trace of whatever events in perf.data are. )
>
> So unless i'm missing something it would be more useful to extend 'perf script' (or
> the upcoming 'perf trace') to list regular samples in a perf.data (i.e. not rely on
> -R raw trace data alone), and of course allow a straightforward utilization of real
> timestamps, when available.

To date I have not seen any option to show samples with wall-clock
timestamps. In a sense it pretty-prints the -R format. This allows perf
samples to be compared to other log files / sources of data to resolve
an issue.

For example:

Time-of-Day Kernel Timestamp cpu Event Command PID
IP Symbol Name (DSO Name)
22:40:13.831567 858005051634 1 context-switches kworker/1:0 8
ffffffff8103251e perf_event_task_sched_out ([kernel.kallsyms])

22:40:13.831683 858005168139 1 context-switches sshd 897
ffffffff8103251e perf_event_task_sched_out ([kernel.kallsyms])

With the stack callchain option it dumps a line for each address in the
stack (columns abbreviated here for email):

07:24:37.527728 1 sshd 819 ffffffff8103251e perf_event_task_sched_out
([kernel.kallsyms])
07:24:37.527728 1 sshd 819 ffffffff81377231 schedule ([kernel.kallsyms])
07:24:37.527728 1 sshd 819 ffffffff81377f3a
schedule_hrtimeout_range_clock ([kernel.kallsyms])
07:24:37.527728 1 sshd 819 ffffffff81378016 schedule_hrtimeout_range
([kernel.kallsyms])
07:24:37.527728 1 sshd 819 ffffffff81103e15 poll_schedule_timeout
([kernel.kallsyms])
07:24:37.527728 1 sshd 819 ffffffff8110463c do_select ([kernel.kallsyms])
07:24:37.527728 1 sshd 819 ffffffff811047f4 core_sys_select
([kernel.kallsyms])
07:24:37.527728 1 sshd 819 ffffffff81104929 sys_select
([kernel.kallsyms])
07:24:37.527728 1 sshd 819 ffffffff81002bc2 system_call
([kernel.kallsyms])
07:24:37.527728 1 sshd 819 7f56d4814e93 __GI_select
(/lib64/libc-2.12.90.so)
07:24:37.527728 1 sshd 819 40560b run_builtin
(/tmp/build-perf/perf)
07:24:37.527728 1 sshd 819 40580d handle_internal_command
(/tmp/build-perf/perf)
07:24:37.527728 1 sshd 819 405959 run_argv
(/tmp/build-perf/perf)
07:24:37.527728 1 sshd 819 405b63 main (/tmp/build-perf/perf)
07:24:37.527728 1 sshd 819 7fa49de9fe7d __libc_start_main
(/lib64/libc-2.12.90.so)


This format has allowed us to do an in-depth analysis of scheduling
(record -e cs -c 1): we can see when was process running, for how long
it ran, why it was kicked off the processor, how long was it off the
processor, what ran in between schedulings, etc. If a log file shows a
process missing a heartbeat you can look at the perf data in the time
range and see if it had a chance to run. If not where was it blocked.

David

>
> Thanks,
>
> Ingo

2011-02-18 14:33:31

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf events: fix WARN_ON_ONCE for 64-bit raw data, SW events



On 02/18/11 04:00, Peter Zijlstra wrote:
> On Thu, 2011-02-17 at 22:53 -0700, David Ahern wrote:
>> This check does not work out for 64-bit counter:
>>
>> sizeof(u32) + sizeof(u64) = 4 + 8 = 12
>>
>> which does not pass the WARN_ON_ONCE test. I'm guessing (hoping
>> really) that the intent is that the size of the raw data is a
>> multiple of 4 bytes, not 8.
>
> No the data must come in 8bytes strides.
>

Then the sizeof the size field needs to be moved to u64; right now it is
u32:

struct perf_raw_record {
u32 size;
void *data;
};

So size should be changed to u64 rather than modifying the WARN_ON_ONCE?


Another gotcha is in perf_output_sample there is:
struct {
u32 size;
u32 data;
} raw = {
.size = sizeof(u32),
.data = 0,
};

While that meets the 8-byte stride I was concerned that changing the
size field to u64 breaks ABI.

David

2011-02-18 14:39:43

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH 2/3] perf events: Introduce realtime clock event



On 02/18/11 04:14, Peter Zijlstra wrote:
> On Thu, 2011-02-17 at 22:53 -0700, David Ahern wrote:
>> The motivation for this event is to convert perf_clock() time stamps
>> to wall-clock (gettimeofday()) equivalents, including adjustments made
>> by NTP (e.g., for comparing perf events to other log files).
>
>> This patch is based on the monotonic patch by Arnaldo Carvalho de Melo
>> <[email protected]>.
>>
>> (NOTE: Comments from the last review of the timehist patch series
>> suggested calling this a monotonic clock. I am not trying to be
>> dense here; since gettimeofday maps to realtime clock I think that
>> is the better name for it.)
>
> Well, the idea was to use CLOCK_MONOTONIC, not to call CLOCK_REALTIME
> monotonic.
>
> I'm really not sure why you want CLOCK_REALTIME and I think
> CLOCK_MONOTONIC is more useful (I'd argue you want your system logs to
> contain both, every admin who's ever had to untangle what happened
> during DST switches will agree)

I believe CLOCK_MONOTONIC is what perf_clock is tied to -- the
timestamps for PERF_SAMPLE_TIME -- so we already have that.

Programs that generate time-of-day output are using gettimeofday which
is tied to CLOCK_REALTIME. We want to be able to correlate a perf sample
to an entry in an applications log file.

David


>
>> @@ -5610,6 +5612,13 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
>>
>> perf_sample_data_init(&data, 0);
>> data.period = event->hw.last_period;
>> + if (event->attr.sample_type & PERF_SAMPLE_RAW)
>> + {
>> + raw.size = sizeof(u64);
>> + raw.data = &event->count;
>> + data.raw = &raw;
>> + }
>> +
>> regs = get_irq_regs();
>>
>> if (regs && !perf_exclude_event(event, regs)) {
>
>
> Why!? you already keep ->count = ktime_get_real(), so simply reading the
> count value will get you the timestamp.. this is superfluous at best.

And that is a conundrum I was stuck on for a while. perf record does not
sample counters; it only creates sample events. I looked at having perf
record sample the clock event, but then I would have to synthesize an
event for the output file. Similarly perf record for hardware counters
does not show the value of the counter.

David


>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-perf-users" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-02-18 14:53:59

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf events: fix WARN_ON_ONCE for 64-bit raw data, SW events

Em Fri, Feb 18, 2011 at 07:33:28AM -0700, David Ahern escreveu:
> On 02/18/11 04:00, Peter Zijlstra wrote:
> > On Thu, 2011-02-17 at 22:53 -0700, David Ahern wrote:
> >> This check does not work out for 64-bit counter:
> >>
> >> sizeof(u32) + sizeof(u64) = 4 + 8 = 12
> >>
> >> which does not pass the WARN_ON_ONCE test. I'm guessing (hoping
> >> really) that the intent is that the size of the raw data is a
> >> multiple of 4 bytes, not 8.
> >
> > No the data must come in 8bytes strides.
> >
>
> Then the sizeof the size field needs to be moved to u64; right now it is
> u32:
>
> struct perf_raw_record {
> u32 size;
> void *data;
> };
>
> So size should be changed to u64 rather than modifying the WARN_ON_ONCE?

Yeah, I guess so, 32-bit has to die it seems, gets in the way of testing
:-)

- Arnaldo

2011-02-18 14:54:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf events: fix WARN_ON_ONCE for 64-bit raw data, SW events

On Fri, 2011-02-18 at 07:33 -0700, David Ahern wrote:
>
> On 02/18/11 04:00, Peter Zijlstra wrote:
> > On Thu, 2011-02-17 at 22:53 -0700, David Ahern wrote:
> >> This check does not work out for 64-bit counter:
> >>
> >> sizeof(u32) + sizeof(u64) = 4 + 8 = 12
> >>
> >> which does not pass the WARN_ON_ONCE test. I'm guessing (hoping
> >> really) that the intent is that the size of the raw data is a
> >> multiple of 4 bytes, not 8.
> >
> > No the data must come in 8bytes strides.
> >
>
> Then the sizeof the size field needs to be moved to u64; right now it is
> u32:

That's a false conclusion.

> struct perf_raw_record {
> u32 size;
> void *data;
> };

> So size should be changed to u64 rather than modifying the WARN_ON_ONCE?

No.

> Another gotcha is in perf_output_sample there is:
> struct {
> u32 size;
> u32 data;
> } raw = {
> .size = sizeof(u32),
> .data = 0,
> };
>
> While that meets the 8-byte stride I was concerned that changing the
> size field to u64 breaks ABI.

It would.

No, what you need to do is provide data that is sized such that it
matches the 8b stride adding padding where needed.

Anyway, I don't think you need RAW at all.

2011-02-18 14:57:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/3] perf events: Introduce realtime clock event

On Fri, 2011-02-18 at 07:39 -0700, David Ahern wrote:
>
> On 02/18/11 04:14, Peter Zijlstra wrote:
> > On Thu, 2011-02-17 at 22:53 -0700, David Ahern wrote:
> >> The motivation for this event is to convert perf_clock() time stamps
> >> to wall-clock (gettimeofday()) equivalents, including adjustments made
> >> by NTP (e.g., for comparing perf events to other log files).
> >
> >> This patch is based on the monotonic patch by Arnaldo Carvalho de Melo
> >> <[email protected]>.
> >>
> >> (NOTE: Comments from the last review of the timehist patch series
> >> suggested calling this a monotonic clock. I am not trying to be
> >> dense here; since gettimeofday maps to realtime clock I think that
> >> is the better name for it.)
> >
> > Well, the idea was to use CLOCK_MONOTONIC, not to call CLOCK_REALTIME
> > monotonic.
> >
> > I'm really not sure why you want CLOCK_REALTIME and I think
> > CLOCK_MONOTONIC is more useful (I'd argue you want your system logs to
> > contain both, every admin who's ever had to untangle what happened
> > during DST switches will agree)
>
> I believe CLOCK_MONOTONIC is what perf_clock is tied to -- the
> timestamps for PERF_SAMPLE_TIME -- so we already have that.

Its not (it mere _can_ be), it could be tied to the TSC which can
significantly drift wrt CLOCK_MONOTONIC.

> Programs that generate time-of-day output are using gettimeofday which
> is tied to CLOCK_REALTIME. We want to be able to correlate a perf sample
> to an entry in an applications log file.

Well, you can argue those programs are broken :-), Imagine the joys of
trying to figure out wth happens when DST jumps the clock back an hour
and you have an hour of duplicate data.

> >> @@ -5610,6 +5612,13 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
> >>
> >> perf_sample_data_init(&data, 0);
> >> data.period = event->hw.last_period;
> >> + if (event->attr.sample_type & PERF_SAMPLE_RAW)
> >> + {
> >> + raw.size = sizeof(u64);
> >> + raw.data = &event->count;
> >> + data.raw = &raw;
> >> + }
> >> +
> >> regs = get_irq_regs();
> >>
> >> if (regs && !perf_exclude_event(event, regs)) {
> >
> >
> > Why!? you already keep ->count = ktime_get_real(), so simply reading the
> > count value will get you the timestamp.. this is superfluous at best.
>
> And that is a conundrum I was stuck on for a while. perf record does not
> sample counters; it only creates sample events. I looked at having perf
> record sample the clock event, but then I would have to synthesize an
> event for the output file. Similarly perf record for hardware counters
> does not show the value of the counter.

Well, but you can fix that, simply add PERF_SAMPLE_READ, no need to
abuse the raw crap.


2011-02-18 15:00:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf events: fix WARN_ON_ONCE for 64-bit raw data, SW events

On Fri, 2011-02-18 at 12:53 -0200, Arnaldo Carvalho de Melo wrote:
>
> > So size should be changed to u64 rather than modifying the
> WARN_ON_ONCE?
>
> Yeah, I guess so, 32-bit has to die it seems, gets in the way of
> testing
> :-)

No!

1) don't use RAW unless you have to
2) you don't have to, goto 1
3) if you do anyway, ensure its properly padded

2011-02-18 15:28:28

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf events: fix WARN_ON_ONCE for 64-bit raw data, SW events



On 02/18/11 07:55, Peter Zijlstra wrote:
>> Another gotcha is in perf_output_sample there is:
>> struct {
>> u32 size;
>> u32 data;
>> } raw = {
>> .size = sizeof(u32),
>> .data = 0,
>> };
>>
>> While that meets the 8-byte stride I was concerned that changing the
>> size field to u64 breaks ABI.
>
> It would.
>
> No, what you need to do is provide data that is sized such that it
> matches the 8b stride adding padding where needed.

I want to push a 64-bit counter to userspace. By definition it meets the
8-byte stride. Now how do I get sizeof(size field) + sizeof(data) to
meet an 8-byte stride? I can't when sizeof(size field) = 4. Ok, I can
pad the data with 4 bytes, but then I an api to tell the user to strip
the 4-byte pad. Seems like an odd way to go.

Why is an 8-byte stride preferred over a 4-bytes?

David

>
> Anyway, I don't think you need RAW at all.
>

2011-02-18 15:35:51

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH 2/3] perf events: Introduce realtime clock event



On 02/18/11 07:58, Peter Zijlstra wrote:
> Well, but you can fix that, simply add PERF_SAMPLE_READ, no need to
> abuse the raw crap.
>

Ok I see how this option works kernel side. It's not plumbed userspace side:
if (type & PERF_SAMPLE_READ) {
fprintf(stderr, "PERF_SAMPLE_READ is unsuported for now\n");
return -1;
}

I'll take a look at that.

David

>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-perf-users" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-02-18 15:39:50

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH 2/3] perf events: Introduce realtime clock event



On 02/18/11 07:58, Peter Zijlstra wrote:
>>> I'm really not sure why you want CLOCK_REALTIME and I think
>>> CLOCK_MONOTONIC is more useful (I'd argue you want your system logs to
>>> contain both, every admin who's ever had to untangle what happened
>>> during DST switches will agree)
>>
>> I believe CLOCK_MONOTONIC is what perf_clock is tied to -- the
>> timestamps for PERF_SAMPLE_TIME -- so we already have that.
>
> Its not (it mere _can_ be), it could be tied to the TSC which can
> significantly drift wrt CLOCK_MONOTONIC.

Ok, either way I would like correlation between perf_clock and the time
sample data and gettimeofday.

>
>> Programs that generate time-of-day output are using gettimeofday which
>> is tied to CLOCK_REALTIME. We want to be able to correlate a perf sample
>> to an entry in an applications log file.
>
> Well, you can argue those programs are broken :-), Imagine the joys of
> trying to figure out wth happens when DST jumps the clock back an hour
> and you have an hour of duplicate data.
>

Luckily DST only happens twice a year. Of course reboots happen a little
more often and those reset the monotonic clock.

I can't change the known universe of programs that create pretty
HH:MM:SS MM/DD/YY time strings. What I do want is to know why a program
missed a heartbeat as noted by a log entry. Correlating with a perf
event and seeing the backtrace is quite handy.

David

2011-02-18 15:52:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf events: fix WARN_ON_ONCE for 64-bit raw data, SW events

On Fri, 2011-02-18 at 08:28 -0700, David Ahern wrote:

> Why is an 8-byte stride preferred over a 4-bytes?

So as to avoid having to deal with unaligned u64 reads.

2011-02-18 17:04:59

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf events: fix WARN_ON_ONCE for 64-bit raw data, SW events

Em Fri, Feb 18, 2011 at 04:01:31PM +0100, Peter Zijlstra escreveu:
> On Fri, 2011-02-18 at 12:53 -0200, Arnaldo Carvalho de Melo wrote:
> > > So size should be changed to u64 rather than modifying the
> > WARN_ON_ONCE?
> >
> > Yeah, I guess so, 32-bit has to die it seems, gets in the way of
> > testing
> > :-)
>
> No!
>
> 1) don't use RAW unless you have to
> 2) you don't have to, goto 1
> 3) if you do anyway, ensure its properly padded

Ok, but using that struct definition on 32-bit land, will lead to non 8 byte
alignment, right?

- Arnaldo

2011-02-18 17:13:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf events: fix WARN_ON_ONCE for 64-bit raw data, SW events

On Fri, 2011-02-18 at 15:04 -0200, Arnaldo Carvalho de Melo wrote:
> Em Fri, Feb 18, 2011 at 04:01:31PM +0100, Peter Zijlstra escreveu:
> > On Fri, 2011-02-18 at 12:53 -0200, Arnaldo Carvalho de Melo wrote:
> > > > So size should be changed to u64 rather than modifying the
> > > WARN_ON_ONCE?
> > >
> > > Yeah, I guess so, 32-bit has to die it seems, gets in the way of
> > > testing
> > > :-)
> >
> > No!
> >
> > 1) don't use RAW unless you have to
> > 2) you don't have to, goto 1
> > 3) if you do anyway, ensure its properly padded
>
> Ok, but using that struct definition on 32-bit land, will lead to non 8 byte
> alignment, right?

Nope, its just fine.. it defaults to an empty 4byte payload ;-)

Its just that any custom payload must be of size 4+8*n bytes.

2011-02-18 17:15:18

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf events: fix WARN_ON_ONCE for 64-bit raw data, SW events



On 02/18/11 10:13, Peter Zijlstra wrote:
> On Fri, 2011-02-18 at 15:04 -0200, Arnaldo Carvalho de Melo wrote:
>> Em Fri, Feb 18, 2011 at 04:01:31PM +0100, Peter Zijlstra escreveu:
>>> On Fri, 2011-02-18 at 12:53 -0200, Arnaldo Carvalho de Melo wrote:
>>>>> So size should be changed to u64 rather than modifying the
>>>> WARN_ON_ONCE?
>>>>
>>>> Yeah, I guess so, 32-bit has to die it seems, gets in the way of
>>>> testing
>>>> :-)
>>>
>>> No!
>>>
>>> 1) don't use RAW unless you have to
>>> 2) you don't have to, goto 1
>>> 3) if you do anyway, ensure its properly padded
>>
>> Ok, but using that struct definition on 32-bit land, will lead to non 8 byte
>> alignment, right?
>
> Nope, its just fine.. it defaults to an empty 4byte payload ;-)
>
> Its just that any custom payload must be of size 4+8*n bytes.

4 + 8*n != 8-byte aligned for any whole number n.

David

2011-02-18 17:17:30

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH 1/3] perf events: fix WARN_ON_ONCE for 64-bit raw data, SW events



On 02/18/11 10:15, David Ahern wrote:

>> Its just that any custom payload must be of size 4+8*n bytes.
>
> 4 + 8*n != 8-byte aligned for any whole number n.

nevermind. you are referring to the payload only; I was thinking full
raw sample.


>
> David
> --
> To unsubscribe from this list: send the line "unsubscribe linux-perf-users" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-02-18 17:59:42

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 3/3] perf events: add timehist option to record and report

On Fri, Feb 18, 2011 at 07:28:20AM -0700, David Ahern wrote:
>
>
> On 02/18/11 00:06, Ingo Molnar wrote:
> >
> > * David Ahern <[email protected]> wrote:
> >
> >> +--timehist::
> >> + Generate time history output. This shows each sample with a wall-clock
> >> + timestamp and address to symbol conversions. The samples are output in
> >> + the same order they exist in the perf.data file. The --timehist option
> >> + must be used with the record to get wall-clock timestamps.
> >
> > Displaying samples this way is very useful, but how is this different from trace
> > output - i.e. 'perf script' output?
> >
> > There's a 'live logging mode' feature:
> >
> > perf record ./myworkload | perf inject -v -b | perf script -i -
> >
> > ( There's also some work going on to have a separate 'perf trace' utility that will
> > output a linear trace of whatever events in perf.data are. )
> >
> > So unless i'm missing something it would be more useful to extend 'perf script' (or
> > the upcoming 'perf trace') to list regular samples in a perf.data (i.e. not rely on
> > -R raw trace data alone), and of course allow a straightforward utilization of real
> > timestamps, when available.
>
> To date I have not seen any option to show samples with wall-clock
> timestamps. In a sense it pretty-prints the -R format. This allows perf
> samples to be compared to other log files / sources of data to resolve
> an issue.
>
> For example:
>
> Time-of-Day Kernel Timestamp cpu Event Command PID
> IP Symbol Name (DSO Name)
> 22:40:13.831567 858005051634 1 context-switches kworker/1:0 8
> ffffffff8103251e perf_event_task_sched_out ([kernel.kallsyms])
>
> 22:40:13.831683 858005168139 1 context-switches sshd 897
> ffffffff8103251e perf_event_task_sched_out ([kernel.kallsyms])
>
> With the stack callchain option it dumps a line for each address in the
> stack (columns abbreviated here for email):
>
> 07:24:37.527728 1 sshd 819 ffffffff8103251e perf_event_task_sched_out
> ([kernel.kallsyms])
> 07:24:37.527728 1 sshd 819 ffffffff81377231 schedule ([kernel.kallsyms])
> 07:24:37.527728 1 sshd 819 ffffffff81377f3a
> schedule_hrtimeout_range_clock ([kernel.kallsyms])
> 07:24:37.527728 1 sshd 819 ffffffff81378016 schedule_hrtimeout_range
> ([kernel.kallsyms])
> 07:24:37.527728 1 sshd 819 ffffffff81103e15 poll_schedule_timeout
> ([kernel.kallsyms])
> 07:24:37.527728 1 sshd 819 ffffffff8110463c do_select ([kernel.kallsyms])
> 07:24:37.527728 1 sshd 819 ffffffff811047f4 core_sys_select
> ([kernel.kallsyms])
> 07:24:37.527728 1 sshd 819 ffffffff81104929 sys_select
> ([kernel.kallsyms])
> 07:24:37.527728 1 sshd 819 ffffffff81002bc2 system_call
> ([kernel.kallsyms])
> 07:24:37.527728 1 sshd 819 7f56d4814e93 __GI_select
> (/lib64/libc-2.12.90.so)
> 07:24:37.527728 1 sshd 819 40560b run_builtin
> (/tmp/build-perf/perf)
> 07:24:37.527728 1 sshd 819 40580d handle_internal_command
> (/tmp/build-perf/perf)
> 07:24:37.527728 1 sshd 819 405959 run_argv
> (/tmp/build-perf/perf)
> 07:24:37.527728 1 sshd 819 405b63 main (/tmp/build-perf/perf)
> 07:24:37.527728 1 sshd 819 7fa49de9fe7d __libc_start_main
> (/lib64/libc-2.12.90.so)
>
>
> This format has allowed us to do an in-depth analysis of scheduling
> (record -e cs -c 1): we can see when was process running, for how long
> it ran, why it was kicked off the processor, how long was it off the
> processor, what ran in between schedulings, etc. If a log file shows a
> process missing a heartbeat you can look at the perf data in the time
> range and see if it had a chance to run. If not where was it blocked.

Oh, in this case I strongly suggest you to use the tracepoint events.

Look at the output of "perf list" and seek the events which name
start with "sched:"

sched:sched_kthread_stop [Tracepoint event]
sched:sched_kthread_stop_ret [Tracepoint event]
sched:sched_wait_task [Tracepoint event]
sched:sched_wakeup [Tracepoint event]
sched:sched_wakeup_new [Tracepoint event]
sched:sched_switch [Tracepoint event]
sched:sched_migrate_task [Tracepoint event]
sched:sched_process_free [Tracepoint event]
sched:sched_process_exit [Tracepoint event]
sched:sched_process_wait [Tracepoint event]
sched:sched_process_fork [Tracepoint event]
sched:sched_stat_wait [Tracepoint event]
sched:sched_stat_sleep [Tracepoint event]
sched:sched_stat_iowait [Tracepoint event]
sched:sched_stat_runtime [Tracepoint event]


You have the sched:sched_switch event and many others.

Just try:

perf record -a -e sched:*
perf script

perf-4128 [000] 19242.870025: sched_stat_runtime: comm=perf pid=4128 runtime=7430405 [ns] vruntime=3530192223488
perf-4128 [000] 19242.870042: sched_stat_runtime: comm=perf pid=4128 runtime=23142 [ns] vruntime=3530192246630 [n
perf-4128 [000] 19242.870045: sched_stat_sleep: comm=kondemand/0 pid=59 delay=9979163 [ns]
perf-4128 [000] 19242.870048: sched_wakeup: comm=kondemand/0 pid=59 prio=120 success=1 target_cpu=000
perf-4128 [000] 19242.870063: sched_stat_runtime: comm=perf pid=4128 runtime=21581 [ns] vruntime=3530192268211 [n
perf-4128 [000] 19242.870066: sched_stat_wait: comm=kondemand/0 pid=59 delay=21581 [ns]
perf-4128 [000] 19242.870069: sched_switch: prev_comm=perf prev_pid=4128 prev_prio=120 prev_state=R ==> next_comm
kondemand/0-59 [000] 19242.870091: sched_stat_runtime: comm=kondemand/0 pid=59 runtime=27362 [ns] vruntime=35301862739
kondemand/0-59 [000] 19242.870094: sched_stat_wait: comm=perf pid=4128 delay=27362 [ns]
kondemand/0-59 [000] 19242.870095: sched_switch: prev_comm=kondemand/0 prev_pid=59 prev_prio=120 prev_state=S ==> next

And you can run your own script on these events:

$ sudo ./perf script -g python
generated Python script: perf-script.py

Edit perf-script.py and then run it:

$ perf script -s ./perf-script.py

That also works for perl.

The timestamps will be the cpu time and not the walltime, but at least that seems
to be partly what you seek?

2011-02-18 18:07:39

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH 3/3] perf events: add timehist option to record and report



On 02/18/11 10:59, Frederic Weisbecker wrote:
> On Fri, Feb 18, 2011 at 07:28:20AM -0700, David Ahern wrote:
>>
>>
>> On 02/18/11 00:06, Ingo Molnar wrote:
>>>
>>> * David Ahern <[email protected]> wrote:
>>>
>>>> +--timehist::
>>>> + Generate time history output. This shows each sample with a wall-clock
>>>> + timestamp and address to symbol conversions. The samples are output in
>>>> + the same order they exist in the perf.data file. The --timehist option
>>>> + must be used with the record to get wall-clock timestamps.
>>>
>>> Displaying samples this way is very useful, but how is this different from trace
>>> output - i.e. 'perf script' output?
>>>
>>> There's a 'live logging mode' feature:
>>>
>>> perf record ./myworkload | perf inject -v -b | perf script -i -
>>>
>>> ( There's also some work going on to have a separate 'perf trace' utility that will
>>> output a linear trace of whatever events in perf.data are. )
>>>
>>> So unless i'm missing something it would be more useful to extend 'perf script' (or
>>> the upcoming 'perf trace') to list regular samples in a perf.data (i.e. not rely on
>>> -R raw trace data alone), and of course allow a straightforward utilization of real
>>> timestamps, when available.
>>
>> To date I have not seen any option to show samples with wall-clock
>> timestamps. In a sense it pretty-prints the -R format. This allows perf
>> samples to be compared to other log files / sources of data to resolve
>> an issue.
>>
>> For example:
>>
>> Time-of-Day Kernel Timestamp cpu Event Command PID
>> IP Symbol Name (DSO Name)
>> 22:40:13.831567 858005051634 1 context-switches kworker/1:0 8
>> ffffffff8103251e perf_event_task_sched_out ([kernel.kallsyms])
>>
>> 22:40:13.831683 858005168139 1 context-switches sshd 897
>> ffffffff8103251e perf_event_task_sched_out ([kernel.kallsyms])
>>
>> With the stack callchain option it dumps a line for each address in the
>> stack (columns abbreviated here for email):
>>
>> 07:24:37.527728 1 sshd 819 ffffffff8103251e perf_event_task_sched_out
>> ([kernel.kallsyms])
>> 07:24:37.527728 1 sshd 819 ffffffff81377231 schedule ([kernel.kallsyms])
>> 07:24:37.527728 1 sshd 819 ffffffff81377f3a
>> schedule_hrtimeout_range_clock ([kernel.kallsyms])
>> 07:24:37.527728 1 sshd 819 ffffffff81378016 schedule_hrtimeout_range
>> ([kernel.kallsyms])
>> 07:24:37.527728 1 sshd 819 ffffffff81103e15 poll_schedule_timeout
>> ([kernel.kallsyms])
>> 07:24:37.527728 1 sshd 819 ffffffff8110463c do_select ([kernel.kallsyms])
>> 07:24:37.527728 1 sshd 819 ffffffff811047f4 core_sys_select
>> ([kernel.kallsyms])
>> 07:24:37.527728 1 sshd 819 ffffffff81104929 sys_select
>> ([kernel.kallsyms])
>> 07:24:37.527728 1 sshd 819 ffffffff81002bc2 system_call
>> ([kernel.kallsyms])
>> 07:24:37.527728 1 sshd 819 7f56d4814e93 __GI_select
>> (/lib64/libc-2.12.90.so)
>> 07:24:37.527728 1 sshd 819 40560b run_builtin
>> (/tmp/build-perf/perf)
>> 07:24:37.527728 1 sshd 819 40580d handle_internal_command
>> (/tmp/build-perf/perf)
>> 07:24:37.527728 1 sshd 819 405959 run_argv
>> (/tmp/build-perf/perf)
>> 07:24:37.527728 1 sshd 819 405b63 main (/tmp/build-perf/perf)
>> 07:24:37.527728 1 sshd 819 7fa49de9fe7d __libc_start_main
>> (/lib64/libc-2.12.90.so)
>>
>>
>> This format has allowed us to do an in-depth analysis of scheduling
>> (record -e cs -c 1): we can see when was process running, for how long
>> it ran, why it was kicked off the processor, how long was it off the
>> processor, what ran in between schedulings, etc. If a log file shows a
>> process missing a heartbeat you can look at the perf data in the time
>> range and see if it had a chance to run. If not where was it blocked.
>
> Oh, in this case I strongly suggest you to use the tracepoint events.
>
> Look at the output of "perf list" and seek the events which name
> start with "sched:"
>
> sched:sched_kthread_stop [Tracepoint event]
> sched:sched_kthread_stop_ret [Tracepoint event]
> sched:sched_wait_task [Tracepoint event]
> sched:sched_wakeup [Tracepoint event]
> sched:sched_wakeup_new [Tracepoint event]
> sched:sched_switch [Tracepoint event]
> sched:sched_migrate_task [Tracepoint event]
> sched:sched_process_free [Tracepoint event]
> sched:sched_process_exit [Tracepoint event]
> sched:sched_process_wait [Tracepoint event]
> sched:sched_process_fork [Tracepoint event]
> sched:sched_stat_wait [Tracepoint event]
> sched:sched_stat_sleep [Tracepoint event]
> sched:sched_stat_iowait [Tracepoint event]
> sched:sched_stat_runtime [Tracepoint event]
>
>
> You have the sched:sched_switch event and many others.
>
> Just try:
>
> perf record -a -e sched:*
> perf script
>
> perf-4128 [000] 19242.870025: sched_stat_runtime: comm=perf pid=4128 runtime=7430405 [ns] vruntime=3530192223488
> perf-4128 [000] 19242.870042: sched_stat_runtime: comm=perf pid=4128 runtime=23142 [ns] vruntime=3530192246630 [n
> perf-4128 [000] 19242.870045: sched_stat_sleep: comm=kondemand/0 pid=59 delay=9979163 [ns]
> perf-4128 [000] 19242.870048: sched_wakeup: comm=kondemand/0 pid=59 prio=120 success=1 target_cpu=000
> perf-4128 [000] 19242.870063: sched_stat_runtime: comm=perf pid=4128 runtime=21581 [ns] vruntime=3530192268211 [n
> perf-4128 [000] 19242.870066: sched_stat_wait: comm=kondemand/0 pid=59 delay=21581 [ns]
> perf-4128 [000] 19242.870069: sched_switch: prev_comm=perf prev_pid=4128 prev_prio=120 prev_state=R ==> next_comm
> kondemand/0-59 [000] 19242.870091: sched_stat_runtime: comm=kondemand/0 pid=59 runtime=27362 [ns] vruntime=35301862739
> kondemand/0-59 [000] 19242.870094: sched_stat_wait: comm=perf pid=4128 delay=27362 [ns]
> kondemand/0-59 [000] 19242.870095: sched_switch: prev_comm=kondemand/0 prev_pid=59 prev_prio=120 prev_state=S ==> next
>
> And you can run your own script on these events:
>
> $ sudo ./perf script -g python
> generated Python script: perf-script.py
>
> Edit perf-script.py and then run it:
>
> $ perf script -s ./perf-script.py
>
> That also works for perl.
>
> The timestamps will be the cpu time and not the walltime, but at least that seems
> to be partly what you seek?

Yes, I have looked at that output before -- essentially ftrace output.

We want not only context-switch events, but the stack trace at the
switch. For example, with the stack trace you can see preemption -- when
a process got booted by another and where the booted process is at the
time. You can see not only which system call caused an ap to block
(e.g., an ioctl) but the full code path leading to the block.

That data along with the gettimeofday timestamp has allowed us to
resolve performance issues such as a system call taking longer than
expected during a specific sequence of events or a process getting
preempted and not scheduled for N seconds. etc., etc.

David

2011-02-18 18:38:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] perf events: add timehist option to record and report

On Fri, 2011-02-18 at 11:07 -0700, David Ahern wrote:
> We want not only context-switch events, but the stack trace at the
> switch.

Both ftrace and perf can do that:

perf record -ge sched:sched_switch -c 1

or

echo 1 > /debug/tracing/options/stacktrace

> That data along with the gettimeofday timestamp has allowed us to
> resolve performance issues such as a system call taking longer than
> expected during a specific sequence of events or a process getting
> preempted and not scheduled for N seconds. etc., etc.

Non of that seems to require GTOD stamps.

2011-02-18 18:41:28

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH 3/3] perf events: add timehist option to record and report

Em Fri, Feb 18, 2011 at 06:59:30PM +0100, Frederic Weisbecker escreveu:
> sched:sched_wait_task [Tracepoint event]
> sched:sched_wakeup [Tracepoint event]
> sched:sched_wakeup_new [Tracepoint event]
> sched:sched_switch [Tracepoint event]
> sched:sched_migrate_task [Tracepoint event]
> sched:sched_process_free [Tracepoint event]
> sched:sched_process_exit [Tracepoint event]
>
>
> You have the sched:sched_switch event and many others.
>
> Just try:
>
> perf record -a -e sched:*
> perf script
>
> perf-4128 [000] 19242.870025: sched_stat_runtime: comm=perf pid=4128 runtime=7430405 [ns] vruntime=3530192223488
> perf-4128 [000] 19242.870042: sched_stat_runtime: comm=perf pid=4128 runtime=23142 [ns] vruntime=3530192246630 [n
> perf-4128 [000] 19242.870045: sched_stat_sleep: comm=kondemand/0 pid=59 delay=9979163 [ns]
> perf-4128 [000] 19242.870048: sched_wakeup: comm=kondemand/0 pid=59 prio=120 success=1 target_cpu=000
> perf-4128 [000] 19242.870063: sched_stat_runtime: comm=perf pid=4128 runtime=21581 [ns] vruntime=3530192268211 [n
> perf-4128 [000] 19242.870066: sched_stat_wait: comm=kondemand/0 pid=59 delay=21581 [ns]
> perf-4128 [000] 19242.870069: sched_switch: prev_comm=perf prev_pid=4128 prev_prio=120 prev_state=R ==> next_comm
> kondemand/0-59 [000] 19242.870091: sched_stat_runtime: comm=kondemand/0 pid=59 runtime=27362 [ns] vruntime=35301862739
> kondemand/0-59 [000] 19242.870094: sched_stat_wait: comm=perf pid=4128 delay=27362 [ns]
> kondemand/0-59 [000] 19242.870095: sched_switch: prev_comm=kondemand/0 prev_pid=59 prev_prio=120 prev_state=S ==> next
>
> And you can run your own script on these events:
>
> $ sudo ./perf script -g python
> generated Python script: perf-script.py
>
> Edit perf-script.py and then run it:
>
> $ perf script -s ./perf-script.py
>
> That also works for perl.
>
> The timestamps will be the cpu time and not the walltime, but at least that seems
> to be partly what you seek?

The whole issue for him, AFAIK, is to correlate perf events with app
events.

Think about tcpdump + networking tracepoints or 'perf probe' dynamic
events in the network stack, he wants to merge those logs and correlate
the tcpdump packet exchange with the tracepoints events in the network
stack, etc.

I.e. it doesn't matter if it is ftrace or not, having a common clock
shared between apps and kernel tracing/whatever infrastructure is what
David is after, right?

He can change userspace to use the clock the kernel is using in the
perf/ftrace/whatever infrastructure or make the kernel use the clock
userspace uses.

The issue here is who will bend, u or k ;-)

- Arnaldo

2011-02-18 18:45:04

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH 3/3] perf events: add timehist option to record and report



On 02/18/11 11:39, Peter Zijlstra wrote:
> On Fri, 2011-02-18 at 11:07 -0700, David Ahern wrote:
>> We want not only context-switch events, but the stack trace at the
>> switch.
>
> Both ftrace and perf can do that:
>
> perf record -ge sched:sched_switch -c 1

Yes, exactly. And that is what I am using -- well, -e cs -c 1.

>
> or
>
> echo 1 > /debug/tracing/options/stacktrace

>
>> That data along with the gettimeofday timestamp has allowed us to
>> resolve performance issues such as a system call taking longer than
>> expected during a specific sequence of events or a process getting
>> preempted and not scheduled for N seconds. etc., etc.
>
> Non of that seems to require GTOD stamps.

On 01/23/11 at 03:45:67 a syslog event noted an application restart.
Checking the applications logs it went silent. Why? What does perf say?

Oh, 2 days have gone by before said event is reported to engineering,
and the server was been rebooted to 'clear' the problem - hence
resetting monotonic clock.

Multiply that by 100's of processes.

David

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-perf-users" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-02-18 18:48:00

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH 3/3] perf events: add timehist option to record and report



On 02/18/11 11:41, Arnaldo Carvalho de Melo wrote:
> Em Fri, Feb 18, 2011 at 06:59:30PM +0100, Frederic Weisbecker escreveu:
>> sched:sched_wait_task [Tracepoint event]
>> sched:sched_wakeup [Tracepoint event]
>> sched:sched_wakeup_new [Tracepoint event]
>> sched:sched_switch [Tracepoint event]
>> sched:sched_migrate_task [Tracepoint event]
>> sched:sched_process_free [Tracepoint event]
>> sched:sched_process_exit [Tracepoint event]
>>
>>
>> You have the sched:sched_switch event and many others.
>>
>> Just try:
>>
>> perf record -a -e sched:*
>> perf script
>>
>> perf-4128 [000] 19242.870025: sched_stat_runtime: comm=perf pid=4128 runtime=7430405 [ns] vruntime=3530192223488
>> perf-4128 [000] 19242.870042: sched_stat_runtime: comm=perf pid=4128 runtime=23142 [ns] vruntime=3530192246630 [n
>> perf-4128 [000] 19242.870045: sched_stat_sleep: comm=kondemand/0 pid=59 delay=9979163 [ns]
>> perf-4128 [000] 19242.870048: sched_wakeup: comm=kondemand/0 pid=59 prio=120 success=1 target_cpu=000
>> perf-4128 [000] 19242.870063: sched_stat_runtime: comm=perf pid=4128 runtime=21581 [ns] vruntime=3530192268211 [n
>> perf-4128 [000] 19242.870066: sched_stat_wait: comm=kondemand/0 pid=59 delay=21581 [ns]
>> perf-4128 [000] 19242.870069: sched_switch: prev_comm=perf prev_pid=4128 prev_prio=120 prev_state=R ==> next_comm
>> kondemand/0-59 [000] 19242.870091: sched_stat_runtime: comm=kondemand/0 pid=59 runtime=27362 [ns] vruntime=35301862739
>> kondemand/0-59 [000] 19242.870094: sched_stat_wait: comm=perf pid=4128 delay=27362 [ns]
>> kondemand/0-59 [000] 19242.870095: sched_switch: prev_comm=kondemand/0 prev_pid=59 prev_prio=120 prev_state=S ==> next
>>
>> And you can run your own script on these events:
>>
>> $ sudo ./perf script -g python
>> generated Python script: perf-script.py
>>
>> Edit perf-script.py and then run it:
>>
>> $ perf script -s ./perf-script.py
>>
>> That also works for perl.
>>
>> The timestamps will be the cpu time and not the walltime, but at least that seems
>> to be partly what you seek?
>
> The whole issue for him, AFAIK, is to correlate perf events with app
> events.
>
> Think about tcpdump + networking tracepoints or 'perf probe' dynamic
> events in the network stack, he wants to merge those logs and correlate
> the tcpdump packet exchange with the tracepoints events in the network
> stack, etc.
>
> I.e. it doesn't matter if it is ftrace or not, having a common clock
> shared between apps and kernel tracing/whatever infrastructure is what
> David is after, right?

A means of correlating the two, yes.

>
> He can change userspace to use the clock the kernel is using in the
> perf/ftrace/whatever infrastructure or make the kernel use the clock
> userspace uses.
>
> The issue here is who will bend, u or k ;-)

It makes sense for ftrace, perf, etc to use a monotonic clock. That
timestamp is nonsense to users (just a string of numbers) and
application writers like those pretty little strings that relate to time
has humans acknowledge it.

David

>
> - Arnaldo

2011-02-18 19:02:58

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH 3/3] perf events: add timehist option to record and report



On 02/18/11 11:41, Arnaldo Carvalho de Melo wrote:
>
> Think about tcpdump + networking tracepoints or 'perf probe' dynamic
> events in the network stack, he wants to merge those logs and correlate
> the tcpdump packet exchange with the tracepoints events in the network
> stack, etc.

I'm zoned in on ap logs, but tcpdump and network analysis is another
very good example. packets are tagged with realtime:

static inline void __net_timestamp(struct sk_buff *skb)
{
skb->tstamp = ktime_get_real();
}


David

2011-02-18 19:06:14

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH 3/3] perf events: add timehist option to record and report

Em Fri, Feb 18, 2011 at 11:53:16AM -0700, David Ahern escreveu:
> On 02/18/11 11:41, Arnaldo Carvalho de Melo wrote:
> > Think about tcpdump + networking tracepoints or 'perf probe' dynamic
> > events in the network stack, he wants to merge those logs and correlate
> > the tcpdump packet exchange with the tracepoints events in the network
> > stack, etc.

> I'm zoned in on ap logs, but tcpdump and network analysis is another
> very good example. packets are tagged with realtime:

> static inline void __net_timestamp(struct sk_buff *skb)
> {
> skb->tstamp = ktime_get_real();
> }

I'd bet the original intent was to merge that with app specific logs :-)

Anyway, I think the way out of this is to have a clock_gettime like
interface, let the users decide what they need (or think they do),
sample it together with the other events and be done with it.

- Arnaldo

2011-02-18 19:25:28

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 3/3] perf events: add timehist option to record and report

On Fri, Feb 18, 2011 at 11:07:32AM -0700, David Ahern wrote:
>
>
> On 02/18/11 10:59, Frederic Weisbecker wrote:
> > On Fri, Feb 18, 2011 at 07:28:20AM -0700, David Ahern wrote:
> >>
> >>
> >> On 02/18/11 00:06, Ingo Molnar wrote:
> >>>
> >>> * David Ahern <[email protected]> wrote:
> >>>
> >>>> +--timehist::
> >>>> + Generate time history output. This shows each sample with a wall-clock
> >>>> + timestamp and address to symbol conversions. The samples are output in
> >>>> + the same order they exist in the perf.data file. The --timehist option
> >>>> + must be used with the record to get wall-clock timestamps.
> >>>
> >>> Displaying samples this way is very useful, but how is this different from trace
> >>> output - i.e. 'perf script' output?
> >>>
> >>> There's a 'live logging mode' feature:
> >>>
> >>> perf record ./myworkload | perf inject -v -b | perf script -i -
> >>>
> >>> ( There's also some work going on to have a separate 'perf trace' utility that will
> >>> output a linear trace of whatever events in perf.data are. )
> >>>
> >>> So unless i'm missing something it would be more useful to extend 'perf script' (or
> >>> the upcoming 'perf trace') to list regular samples in a perf.data (i.e. not rely on
> >>> -R raw trace data alone), and of course allow a straightforward utilization of real
> >>> timestamps, when available.
> >>
> >> To date I have not seen any option to show samples with wall-clock
> >> timestamps. In a sense it pretty-prints the -R format. This allows perf
> >> samples to be compared to other log files / sources of data to resolve
> >> an issue.
> >>
> >> For example:
> >>
> >> Time-of-Day Kernel Timestamp cpu Event Command PID
> >> IP Symbol Name (DSO Name)
> >> 22:40:13.831567 858005051634 1 context-switches kworker/1:0 8
> >> ffffffff8103251e perf_event_task_sched_out ([kernel.kallsyms])
> >>
> >> 22:40:13.831683 858005168139 1 context-switches sshd 897
> >> ffffffff8103251e perf_event_task_sched_out ([kernel.kallsyms])
> >>
> >> With the stack callchain option it dumps a line for each address in the
> >> stack (columns abbreviated here for email):
> >>
> >> 07:24:37.527728 1 sshd 819 ffffffff8103251e perf_event_task_sched_out
> >> ([kernel.kallsyms])
> >> 07:24:37.527728 1 sshd 819 ffffffff81377231 schedule ([kernel.kallsyms])
> >> 07:24:37.527728 1 sshd 819 ffffffff81377f3a
> >> schedule_hrtimeout_range_clock ([kernel.kallsyms])
> >> 07:24:37.527728 1 sshd 819 ffffffff81378016 schedule_hrtimeout_range
> >> ([kernel.kallsyms])
> >> 07:24:37.527728 1 sshd 819 ffffffff81103e15 poll_schedule_timeout
> >> ([kernel.kallsyms])
> >> 07:24:37.527728 1 sshd 819 ffffffff8110463c do_select ([kernel.kallsyms])
> >> 07:24:37.527728 1 sshd 819 ffffffff811047f4 core_sys_select
> >> ([kernel.kallsyms])
> >> 07:24:37.527728 1 sshd 819 ffffffff81104929 sys_select
> >> ([kernel.kallsyms])
> >> 07:24:37.527728 1 sshd 819 ffffffff81002bc2 system_call
> >> ([kernel.kallsyms])
> >> 07:24:37.527728 1 sshd 819 7f56d4814e93 __GI_select
> >> (/lib64/libc-2.12.90.so)
> >> 07:24:37.527728 1 sshd 819 40560b run_builtin
> >> (/tmp/build-perf/perf)
> >> 07:24:37.527728 1 sshd 819 40580d handle_internal_command
> >> (/tmp/build-perf/perf)
> >> 07:24:37.527728 1 sshd 819 405959 run_argv
> >> (/tmp/build-perf/perf)
> >> 07:24:37.527728 1 sshd 819 405b63 main (/tmp/build-perf/perf)
> >> 07:24:37.527728 1 sshd 819 7fa49de9fe7d __libc_start_main
> >> (/lib64/libc-2.12.90.so)
> >>
> >>
> >> This format has allowed us to do an in-depth analysis of scheduling
> >> (record -e cs -c 1): we can see when was process running, for how long
> >> it ran, why it was kicked off the processor, how long was it off the
> >> processor, what ran in between schedulings, etc. If a log file shows a
> >> process missing a heartbeat you can look at the perf data in the time
> >> range and see if it had a chance to run. If not where was it blocked.
> >
> > Oh, in this case I strongly suggest you to use the tracepoint events.
> >
> > Look at the output of "perf list" and seek the events which name
> > start with "sched:"
> >
> > sched:sched_kthread_stop [Tracepoint event]
> > sched:sched_kthread_stop_ret [Tracepoint event]
> > sched:sched_wait_task [Tracepoint event]
> > sched:sched_wakeup [Tracepoint event]
> > sched:sched_wakeup_new [Tracepoint event]
> > sched:sched_switch [Tracepoint event]
> > sched:sched_migrate_task [Tracepoint event]
> > sched:sched_process_free [Tracepoint event]
> > sched:sched_process_exit [Tracepoint event]
> > sched:sched_process_wait [Tracepoint event]
> > sched:sched_process_fork [Tracepoint event]
> > sched:sched_stat_wait [Tracepoint event]
> > sched:sched_stat_sleep [Tracepoint event]
> > sched:sched_stat_iowait [Tracepoint event]
> > sched:sched_stat_runtime [Tracepoint event]
> >
> >
> > You have the sched:sched_switch event and many others.
> >
> > Just try:
> >
> > perf record -a -e sched:*
> > perf script
> >
> > perf-4128 [000] 19242.870025: sched_stat_runtime: comm=perf pid=4128 runtime=7430405 [ns] vruntime=3530192223488
> > perf-4128 [000] 19242.870042: sched_stat_runtime: comm=perf pid=4128 runtime=23142 [ns] vruntime=3530192246630 [n
> > perf-4128 [000] 19242.870045: sched_stat_sleep: comm=kondemand/0 pid=59 delay=9979163 [ns]
> > perf-4128 [000] 19242.870048: sched_wakeup: comm=kondemand/0 pid=59 prio=120 success=1 target_cpu=000
> > perf-4128 [000] 19242.870063: sched_stat_runtime: comm=perf pid=4128 runtime=21581 [ns] vruntime=3530192268211 [n
> > perf-4128 [000] 19242.870066: sched_stat_wait: comm=kondemand/0 pid=59 delay=21581 [ns]
> > perf-4128 [000] 19242.870069: sched_switch: prev_comm=perf prev_pid=4128 prev_prio=120 prev_state=R ==> next_comm
> > kondemand/0-59 [000] 19242.870091: sched_stat_runtime: comm=kondemand/0 pid=59 runtime=27362 [ns] vruntime=35301862739
> > kondemand/0-59 [000] 19242.870094: sched_stat_wait: comm=perf pid=4128 delay=27362 [ns]
> > kondemand/0-59 [000] 19242.870095: sched_switch: prev_comm=kondemand/0 prev_pid=59 prev_prio=120 prev_state=S ==> next
> >
> > And you can run your own script on these events:
> >
> > $ sudo ./perf script -g python
> > generated Python script: perf-script.py
> >
> > Edit perf-script.py and then run it:
> >
> > $ perf script -s ./perf-script.py
> >
> > That also works for perl.
> >
> > The timestamps will be the cpu time and not the walltime, but at least that seems
> > to be partly what you seek?
>
> Yes, I have looked at that output before -- essentially ftrace output.
>
> We want not only context-switch events, but the stack trace at the
> switch. For example, with the stack trace you can see preemption -- when
> a process got booted by another and where the booted process is at the
> time. You can see not only which system call caused an ap to block
> (e.g., an ioctl) but the full code path leading to the block.

You can recognize preemption a the context switch tracepoint: if the state
of the scheduled out task is R (TASK_RUNNING), which means it doesn't go
to sleep but gets preempted, with an explicit preemption point like cond_resched(),
or a more implicit one: spin_unlock(), preempt_enable(), an interrupt, ...
Or it has been woken up while it was about to sleep, but it doesn't make much
difference.

If you want to know when a process is booted by another you can use the
fork tracepoint, or sched:wake_up_new, etc...

And you can use syscall tracepoints to get the syscalls you want.

I don't see much the point for you to use stacktraces. But if you
do, then rather add this support to perf script, in our scripting
framework.

Because what you've done is basically to add tracing support to
perf report. But we have perf script for that already. It only focuses
on tracepoint events but they are those people are interested in
because they show logical events in the kernel. I guess
people are not interested in cpu-cycles overflows events or so as
they don't show a state change in the kernel.

Well, yeah I can understand if one considers the software events,
that makes meaningful events from the kernel. But these software events
support have been a mistake in perf. You should rather use the
tracepoint events instead.

> That data along with the gettimeofday timestamp has allowed us to
> resolve performance issues such as a system call taking longer than
> expected during a specific sequence of events or a process getting
> preempted and not scheduled for N seconds. etc., etc.

That's about the same here. If you really need this, you need to add
the support in perf script to handle that on tracepoint events.

2011-02-18 19:30:30

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 3/3] perf events: add timehist option to record and report

On Fri, Feb 18, 2011 at 04:41:19PM -0200, Arnaldo Carvalho de Melo wrote:
> Em Fri, Feb 18, 2011 at 06:59:30PM +0100, Frederic Weisbecker escreveu:
> > sched:sched_wait_task [Tracepoint event]
> > sched:sched_wakeup [Tracepoint event]
> > sched:sched_wakeup_new [Tracepoint event]
> > sched:sched_switch [Tracepoint event]
> > sched:sched_migrate_task [Tracepoint event]
> > sched:sched_process_free [Tracepoint event]
> > sched:sched_process_exit [Tracepoint event]
> >
> >
> > You have the sched:sched_switch event and many others.
> >
> > Just try:
> >
> > perf record -a -e sched:*
> > perf script
> >
> > perf-4128 [000] 19242.870025: sched_stat_runtime: comm=perf pid=4128 runtime=7430405 [ns] vruntime=3530192223488
> > perf-4128 [000] 19242.870042: sched_stat_runtime: comm=perf pid=4128 runtime=23142 [ns] vruntime=3530192246630 [n
> > perf-4128 [000] 19242.870045: sched_stat_sleep: comm=kondemand/0 pid=59 delay=9979163 [ns]
> > perf-4128 [000] 19242.870048: sched_wakeup: comm=kondemand/0 pid=59 prio=120 success=1 target_cpu=000
> > perf-4128 [000] 19242.870063: sched_stat_runtime: comm=perf pid=4128 runtime=21581 [ns] vruntime=3530192268211 [n
> > perf-4128 [000] 19242.870066: sched_stat_wait: comm=kondemand/0 pid=59 delay=21581 [ns]
> > perf-4128 [000] 19242.870069: sched_switch: prev_comm=perf prev_pid=4128 prev_prio=120 prev_state=R ==> next_comm
> > kondemand/0-59 [000] 19242.870091: sched_stat_runtime: comm=kondemand/0 pid=59 runtime=27362 [ns] vruntime=35301862739
> > kondemand/0-59 [000] 19242.870094: sched_stat_wait: comm=perf pid=4128 delay=27362 [ns]
> > kondemand/0-59 [000] 19242.870095: sched_switch: prev_comm=kondemand/0 prev_pid=59 prev_prio=120 prev_state=S ==> next
> >
> > And you can run your own script on these events:
> >
> > $ sudo ./perf script -g python
> > generated Python script: perf-script.py
> >
> > Edit perf-script.py and then run it:
> >
> > $ perf script -s ./perf-script.py
> >
> > That also works for perl.
> >
> > The timestamps will be the cpu time and not the walltime, but at least that seems
> > to be partly what you seek?
>
> The whole issue for him, AFAIK, is to correlate perf events with app
> events.
>
> Think about tcpdump + networking tracepoints or 'perf probe' dynamic
> events in the network stack, he wants to merge those logs and correlate
> the tcpdump packet exchange with the tracepoints events in the network
> stack, etc.
>
> I.e. it doesn't matter if it is ftrace or not, having a common clock
> shared between apps and kernel tracing/whatever infrastructure is what
> David is after, right?
>
> He can change userspace to use the clock the kernel is using in the
> perf/ftrace/whatever infrastructure or make the kernel use the clock
> userspace uses.
>
> The issue here is who will bend, u or k ;-)

Right. I don't disagree with the need of a walltime. That already
in debate :)

I was rather arguing about the tracing part. Adding support in perf report
in the wrong way to do this. If something must be extended, It should be done
in perf script, where we do the tracing support.

2011-02-18 19:53:41

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH 3/3] perf events: add timehist option to record and report



On 02/18/11 12:24, Frederic Weisbecker wrote:
>> We want not only context-switch events, but the stack trace at the
>> switch. For example, with the stack trace you can see preemption -- when
>> a process got booted by another and where the booted process is at the
>> time. You can see not only which system call caused an ap to block
>> (e.g., an ioctl) but the full code path leading to the block.
>
> You can recognize preemption a the context switch tracepoint: if the state
> of the scheduled out task is R (TASK_RUNNING), which means it doesn't go
> to sleep but gets preempted, with an explicit preemption point like cond_resched(),
> or a more implicit one: spin_unlock(), preempt_enable(), an interrupt, ...
> Or it has been woken up while it was about to sleep, but it doesn't make much
> difference.
>
> If you want to know when a process is booted by another you can use the
> fork tracepoint, or sched:wake_up_new, etc...
>
> And you can use syscall tracepoints to get the syscalls you want.
>
> I don't see much the point for you to use stacktraces. But if you
> do, then rather add this support to perf script, in our scripting
> framework.

It's more the simplicity of what we are using today. 1 command, 1 event
being monitored:

perf record -ag -e cs -c 1

A wealth of information. That command shows preemption, stack traces
only for context-switches (not all of the syscalls which is
overwhelming) and opens the door for other analysis. One data set.
Simple. Focused.

>
> Because what you've done is basically to add tracing support to
> perf report. But we have perf script for that already. It only focuses
> on tracepoint events but they are those people are interested in
> because they show logical events in the kernel. I guess
> people are not interested in cpu-cycles overflows events or so as
> they don't show a state change in the kernel.

I have always referred to this as pretty printing each sample recorded
as opposed to summarizing into a histogram. With that approach you have
dictated the analysis of the data - a histogram summary. By printing
each sample with address-symbol conversions we can look at it in
whatever angle we need to make sense of it.

David


>
> Well, yeah I can understand if one considers the software events,
> that makes meaningful events from the kernel. But these software events
> support have been a mistake in perf. You should rather use the
> tracepoint events instead.
>
>> That data along with the gettimeofday timestamp has allowed us to
>> resolve performance issues such as a system call taking longer than
>> expected during a specific sequence of events or a process getting
>> preempted and not scheduled for N seconds. etc., etc.
>
> That's about the same here. If you really need this, you need to add
> the support in perf script to handle that on tracepoint events.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-perf-users" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-02-18 20:31:07

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH 3/3] perf events: add timehist option to record and report

Em Fri, Feb 18, 2011 at 08:29:56PM +0100, Frederic Weisbecker escreveu:
> On Fri, Feb 18, 2011 at 04:41:19PM -0200, Arnaldo Carvalho de Melo wrote:
> > Em Fri, Feb 18, 2011 at 06:59:30PM +0100, Frederic Weisbecker escreveu:
> > > The timestamps will be the cpu time and not the walltime, but at least that seems
> > > to be partly what you seek?
> >
> > The whole issue for him, AFAIK, is to correlate perf events with app
> > events.
> >
> > Think about tcpdump + networking tracepoints or 'perf probe' dynamic
> > events in the network stack, he wants to merge those logs and correlate
> > the tcpdump packet exchange with the tracepoints events in the network
> > stack, etc.
> >
> > I.e. it doesn't matter if it is ftrace or not, having a common clock
> > shared between apps and kernel tracing/whatever infrastructure is what
> > David is after, right?
> >
> > He can change userspace to use the clock the kernel is using in the
> > perf/ftrace/whatever infrastructure or make the kernel use the clock
> > userspace uses.
> >
> > The issue here is who will bend, u or k ;-)
>
> Right. I don't disagree with the need of a walltime. That already
> in debate :)
>
> I was rather arguing about the tracing part. Adding support in perf report
> in the wrong way to do this. If something must be extended, It should be done
> in perf script, where we do the tracing support.

Squashing "profiling" and "tracing" differences is something I think
worthy.

Its all about events, that should be treated as equal in all tools. So I
don't think 'perf script' is about tracing, its about handling events
found in the event stream, be it hw, sw, tracepoints, dynamic probes put
in place by 'perf probe', or whatever event source we end up having.

A 'clock_gettime' like event, sampling whatever POSIX clock the kernel
supports should be enough flexibility, leaving concerns about sanity of
any particular use to the rope user.

Of course, extensibility via 'perf script' or 'use perf' in perf scripts
should provide great avenues for experimentation, not requiring changes
in the builtins :-)

- Arnaldo

2011-02-19 09:33:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/3] perf events: add timehist option to record and report


* David Ahern <[email protected]> wrote:

> On 01/23/11 at 03:45:67 a syslog event noted an application restart.
> Checking the applications logs it went silent. Why? What does perf say?
>
> Oh, 2 days have gone by before said event is reported to engineering,
> and the server was been rebooted to 'clear' the problem - hence
> resetting monotonic clock.

I think that's a valid usecase and the feature you are suggestion is very useful.

Why not extend the regular trace output to also (optionally) output GTOD timestamps?

Thanks,

Ingo

2011-02-19 14:38:37

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH 3/3] perf events: add timehist option to record and report

Em Sat, Feb 19, 2011 at 10:32:16AM +0100, Ingo Molnar escreveu:
> * David Ahern <[email protected]> wrote:
>
> > On 01/23/11 at 03:45:67 a syslog event noted an application restart.
> > Checking the applications logs it went silent. Why? What does perf say?
> >
> > Oh, 2 days have gone by before said event is reported to engineering,
> > and the server was been rebooted to 'clear' the problem - hence
> > resetting monotonic clock.
>
> I think that's a valid usecase and the feature you are suggestion is very useful.
>
> Why not extend the regular trace output to also (optionally) output GTOD timestamps?

That is how this this feature request/submission originally started.

David tried with an extra header, then a synthesized event, then after
some discussion with Peter I started working on a monotonic clock event,
but got sidetracked with other stuff, he tried picking up from where I
left and here we are.

I think the summary is: we need a clock_gettime like interface so that
users can pick and choose what clock matches best whatever app they're
trying to merge logs.

- Arnaldo

2011-02-20 12:49:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/3] perf events: Introduce realtime clock event


* Peter Zijlstra <[email protected]> wrote:

> > Programs that generate time-of-day output are using gettimeofday which is tied
> > to CLOCK_REALTIME. We want to be able to correlate a perf sample to an entry in
> > an applications log file.
>
> Well, you can argue those programs are broken :-), [...]

Then 99% of the world is broken as daylight saving time jumps are part of our lives.

> [...] Imagine the joys of trying to figure out wth happens when DST jumps the
> clock back an hour and you have an hour of duplicate data.

DST means twice a year the clock jumps back and forth by exactly one hour, on a
given Saturday night. People debugging logs on Saturday nights and not realizing
that there's a (always highly publicized) DST switch going on *deserve* to be
surprised twice a year! :-)

Really, i dont think it's a big issue - and timestamping in *real* time has very
obvious advantages as its name already suggests.

Thanks,

Ingo

2011-02-21 21:14:03

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 3/3] perf events: add timehist option to record and report

On Fri, Feb 18, 2011 at 12:53:34PM -0700, David Ahern wrote:
>
>
> On 02/18/11 12:24, Frederic Weisbecker wrote:
> >> We want not only context-switch events, but the stack trace at the
> >> switch. For example, with the stack trace you can see preemption -- when
> >> a process got booted by another and where the booted process is at the
> >> time. You can see not only which system call caused an ap to block
> >> (e.g., an ioctl) but the full code path leading to the block.
> >
> > You can recognize preemption a the context switch tracepoint: if the state
> > of the scheduled out task is R (TASK_RUNNING), which means it doesn't go
> > to sleep but gets preempted, with an explicit preemption point like cond_resched(),
> > or a more implicit one: spin_unlock(), preempt_enable(), an interrupt, ...
> > Or it has been woken up while it was about to sleep, but it doesn't make much
> > difference.
> >
> > If you want to know when a process is booted by another you can use the
> > fork tracepoint, or sched:wake_up_new, etc...
> >
> > And you can use syscall tracepoints to get the syscalls you want.
> >
> > I don't see much the point for you to use stacktraces. But if you
> > do, then rather add this support to perf script, in our scripting
> > framework.
>
> It's more the simplicity of what we are using today. 1 command, 1 event
> being monitored:
>
> perf record -ag -e cs -c 1
>
> A wealth of information. That command shows preemption, stack traces
> only for context-switches (not all of the syscalls which is
> overwhelming) and opens the door for other analysis. One data set.
> Simple. Focused.

Right. Then why not adding support to stacktraces in perf script?
Seems it's the only missing thing to achieve what you want.

> > Because what you've done is basically to add tracing support to
> > perf report. But we have perf script for that already. It only focuses
> > on tracepoint events but they are those people are interested in
> > because they show logical events in the kernel. I guess
> > people are not interested in cpu-cycles overflows events or so as
> > they don't show a state change in the kernel.
>
> I have always referred to this as pretty printing each sample recorded
> as opposed to summarizing into a histogram. With that approach you have
> dictated the analysis of the data - a histogram summary. By printing
> each sample with address-symbol conversions we can look at it in
> whatever angle we need to make sense of it.

I'm not dictating any analysis. You can can just launch
perf script without script and it will display the events,
like you want.

Just try:

perf record -a -e sched:sched_switch
perf script

You can extend it to print addresses or symbols or even stacktraces.
It just seems to me the right place to do so. Not perf report that is
about histograms.

2011-02-21 21:35:10

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 3/3] perf events: add timehist option to record and report

On Fri, Feb 18, 2011 at 06:30:15PM -0200, Arnaldo Carvalho de Melo wrote:
> Em Fri, Feb 18, 2011 at 08:29:56PM +0100, Frederic Weisbecker escreveu:
> > On Fri, Feb 18, 2011 at 04:41:19PM -0200, Arnaldo Carvalho de Melo wrote:
> > > Em Fri, Feb 18, 2011 at 06:59:30PM +0100, Frederic Weisbecker escreveu:
> > > > The timestamps will be the cpu time and not the walltime, but at least that seems
> > > > to be partly what you seek?
> > >
> > > The whole issue for him, AFAIK, is to correlate perf events with app
> > > events.
> > >
> > > Think about tcpdump + networking tracepoints or 'perf probe' dynamic
> > > events in the network stack, he wants to merge those logs and correlate
> > > the tcpdump packet exchange with the tracepoints events in the network
> > > stack, etc.
> > >
> > > I.e. it doesn't matter if it is ftrace or not, having a common clock
> > > shared between apps and kernel tracing/whatever infrastructure is what
> > > David is after, right?
> > >
> > > He can change userspace to use the clock the kernel is using in the
> > > perf/ftrace/whatever infrastructure or make the kernel use the clock
> > > userspace uses.
> > >
> > > The issue here is who will bend, u or k ;-)
> >
> > Right. I don't disagree with the need of a walltime. That already
> > in debate :)
> >
> > I was rather arguing about the tracing part. Adding support in perf report
> > in the wrong way to do this. If something must be extended, It should be done
> > in perf script, where we do the tracing support.
>
> Squashing "profiling" and "tracing" differences is something I think
> worthy.
>
> Its all about events, that should be treated as equal in all tools. So I
> don't think 'perf script' is about tracing, its about handling events
> found in the event stream, be it hw, sw, tracepoints, dynamic probes put
> in place by 'perf probe', or whatever event source we end up having.
>
> A 'clock_gettime' like event, sampling whatever POSIX clock the kernel
> supports should be enough flexibility, leaving concerns about sanity of
> any particular use to the rope user.
>
> Of course, extensibility via 'perf script' or 'use perf' in perf scripts
> should provide great avenues for experimentation, not requiring changes
> in the builtins :-)

Right.

So I agree perf script should be tracing/profiling/any semantic agnostic.
But when it's about dumping, displaying events sequentially, it seems
an extension to perf script is a better place for that rather than
perf report. perf script default behaviour is already about displaying
raw traces. If support is needed to print the ip of the event or the
stacktraces, it's very welcome!