From: Kan Liang <[email protected]>
Profiling brings additional overhead. High overhead may impacts the
behavior of the profiling object, impacts the accuracy of the
profiling result, and even hang the system.
Currently, perf has dynamic interrupt throttle mechanism to lower the
sample rate and overhead. But it has limitations.
- The mechanism only focus in the overhead from NMI. However, there
are other parts which bring big overhead. E.g, multiplexing.
- The hint from the mechanism doesn't work on fixed period.
- The system changes which caused by the mechanism are not recorded
in the perf.data. Users have no idea about the overhead and its
impact.
Acctually, any passive ways like dynamic interrupt throttle mechanism
are only palliative. The best way is to export overheads information,
provide more hints, and help the users design more proper perf command.
According to our test, there are four parts which can bring big overhead.
They include NMI handler, multiplexing handler, iterate side-band events,
and write data in file. Two new perf record type PERF_RECORD_OVERHEAD and
PERF_RECORD_USER_OVERHEAD are introduced to record the overhead
information in kernel and user space respectively.
The overhead information is the system per-CPU overhead, not per-event
overhead. The implementation takes advantage of the existing event log
mechanism.
To reduce the additional overhead from logging overhead information, the
overhead information only be output when the event is going to be
disabled or task is scheduling out.
In perf report, the overhead will be checked automatically. If the
overhead rate is larger than 10%. A warning will be displayed.
A new option is also introduced to display detial per-CPU overhead
information.
Current implementation only include four overhead sources. There could be
more in other parts. The new overhead source can be easily added as a
new type.
Kan Liang (14):
perf/x86: Introduce PERF_RECORD_OVERHEAD
perf/x86: output NMI overhead
perf/x86: output multiplexing overhead
perf/x86: output side-band events overhead
perf tools: handle PERF_RECORD_OVERHEAD record type
perf tools: show NMI overhead
perf tools: show multiplexing overhead
perf tools: show side-band events overhead
perf tools: make get_nsecs visible for buildin files
perf tools: introduce PERF_RECORD_USER_OVERHEAD
perf tools: record write data overhead
perf tools: record elapsed time
perf tools: warn on high overhead
perf script: show overhead events
arch/x86/events/core.c | 19 ++++-
arch/x86/events/perf_event.h | 2 +
include/linux/perf_event.h | 9 ++
include/uapi/linux/perf_event.h | 33 ++++++++
kernel/events/core.c | 139 +++++++++++++++++++++++++------
tools/include/uapi/linux/perf_event.h | 33 ++++++++
tools/perf/Documentation/perf-report.txt | 3 +
tools/perf/builtin-record.c | 71 +++++++++++++++-
tools/perf/builtin-report.c | 25 ++++++
tools/perf/builtin-sched.c | 2 +-
tools/perf/builtin-script.c | 36 ++++++++
tools/perf/builtin.h | 1 +
tools/perf/util/event.c | 47 +++++++++++
tools/perf/util/event.h | 37 ++++++++
tools/perf/util/machine.c | 35 ++++++++
tools/perf/util/machine.h | 2 +
tools/perf/util/session.c | 68 +++++++++++++++
tools/perf/util/symbol.h | 3 +-
tools/perf/util/tool.h | 1 +
19 files changed, 535 insertions(+), 31 deletions(-)
--
2.5.5
From: Kan Liang <[email protected]>
NMI handler is one of the most important part which brings overhead.
There are lots of NMI during sampling. It's very expensive to log each
NMI. So the accumulated time and NMI# will be output when event is going
to be disabled or task is scheduling out.
The newly introduced flag PERF_EF_LOG indicate to output the overhead
log.
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/core.c | 19 ++++++++++++++-
arch/x86/events/perf_event.h | 2 ++
include/linux/perf_event.h | 1 +
include/uapi/linux/perf_event.h | 2 ++
kernel/events/core.c | 54 ++++++++++++++++++++++-------------------
5 files changed, 52 insertions(+), 26 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index d31735f..6c3b0ef 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1397,6 +1397,11 @@ static void x86_pmu_del(struct perf_event *event, int flags)
perf_event_update_userpage(event);
+ if ((flags & PERF_EF_LOG) && cpuc->nmi_overhead.nr) {
+ cpuc->nmi_overhead.cpu = smp_processor_id();
+ perf_log_overhead(event, PERF_NMI_OVERHEAD, &cpuc->nmi_overhead);
+ }
+
do_del:
if (x86_pmu.del) {
/*
@@ -1475,11 +1480,21 @@ void perf_events_lapic_init(void)
apic_write(APIC_LVTPC, APIC_DM_NMI);
}
+static void
+perf_caculate_nmi_overhead(u64 time)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ cpuc->nmi_overhead.nr++;
+ cpuc->nmi_overhead.time += time;
+}
+
static int
perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
{
u64 start_clock;
u64 finish_clock;
+ u64 clock;
int ret;
/*
@@ -1492,8 +1507,10 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
start_clock = sched_clock();
ret = x86_pmu.handle_irq(regs);
finish_clock = sched_clock();
+ clock = finish_clock - start_clock;
- perf_sample_event_took(finish_clock - start_clock);
+ perf_caculate_nmi_overhead(clock);
+ perf_sample_event_took(clock);
return ret;
}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 5874d8d..c6b25ac 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -230,6 +230,8 @@ struct cpu_hw_events {
struct intel_excl_cntrs *excl_cntrs;
int excl_thread_id; /* 0 or 1 */
+ struct perf_overhead_entry nmi_overhead;
+
/*
* AMD specific bits
*/
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3a530a7..632647f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -313,6 +313,7 @@ struct pmu {
#define PERF_EF_START 0x01 /* start the counter when adding */
#define PERF_EF_RELOAD 0x02 /* reload the counter when starting */
#define PERF_EF_UPDATE 0x04 /* update the counter when stopping */
+#define PERF_EF_LOG 0x08
/*
* Adds/Removes a counter to/from the PMU, can be done inside a
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index ae571be..071323d 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -992,6 +992,8 @@ struct perf_branch_entry {
};
enum perf_record_overhead_type {
+ PERF_NMI_OVERHEAD = 0,
+
PERF_OVERHEAD_MAX,
};
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 590c399..d82e6ca 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1785,7 +1785,8 @@ event_filter_match(struct perf_event *event)
static void
event_sched_out(struct perf_event *event,
struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx)
+ struct perf_event_context *ctx,
+ bool log_overhead)
{
u64 tstamp = perf_event_time(event);
u64 delta;
@@ -1810,9 +1811,8 @@ event_sched_out(struct perf_event *event,
return;
perf_pmu_disable(event->pmu);
-
event->tstamp_stopped = tstamp;
- event->pmu->del(event, 0);
+ event->pmu->del(event, log_overhead ? PERF_EF_LOG : 0);
event->oncpu = -1;
event->state = PERF_EVENT_STATE_INACTIVE;
if (event->pending_disable) {
@@ -1835,20 +1835,21 @@ event_sched_out(struct perf_event *event,
static void
group_sched_out(struct perf_event *group_event,
struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx)
+ struct perf_event_context *ctx,
+ bool log_overhead)
{
struct perf_event *event;
int state = group_event->state;
perf_pmu_disable(ctx->pmu);
- event_sched_out(group_event, cpuctx, ctx);
+ event_sched_out(group_event, cpuctx, ctx, log_overhead);
/*
* Schedule out siblings (if any):
*/
list_for_each_entry(event, &group_event->sibling_list, group_entry)
- event_sched_out(event, cpuctx, ctx);
+ event_sched_out(event, cpuctx, ctx, log_overhead);
perf_pmu_enable(ctx->pmu);
@@ -1872,7 +1873,7 @@ __perf_remove_from_context(struct perf_event *event,
{
unsigned long flags = (unsigned long)info;
- event_sched_out(event, cpuctx, ctx);
+ event_sched_out(event, cpuctx, ctx, false);
if (flags & DETACH_GROUP)
perf_group_detach(event);
list_del_event(event, ctx);
@@ -1918,9 +1919,9 @@ static void __perf_event_disable(struct perf_event *event,
update_cgrp_time_from_event(event);
update_group_times(event);
if (event == event->group_leader)
- group_sched_out(event, cpuctx, ctx);
+ group_sched_out(event, cpuctx, ctx, true);
else
- event_sched_out(event, cpuctx, ctx);
+ event_sched_out(event, cpuctx, ctx, true);
event->state = PERF_EVENT_STATE_OFF;
}
@@ -2141,10 +2142,10 @@ group_sched_in(struct perf_event *group_event,
event->tstamp_running += now - event->tstamp_stopped;
event->tstamp_stopped = now;
} else {
- event_sched_out(event, cpuctx, ctx);
+ event_sched_out(event, cpuctx, ctx, false);
}
}
- event_sched_out(group_event, cpuctx, ctx);
+ event_sched_out(group_event, cpuctx, ctx, false);
pmu->cancel_txn(pmu);
@@ -2198,7 +2199,8 @@ static void add_event_to_ctx(struct perf_event *event,
static void ctx_sched_out(struct perf_event_context *ctx,
struct perf_cpu_context *cpuctx,
- enum event_type_t event_type);
+ enum event_type_t event_type,
+ bool log_overhead);
static void
ctx_sched_in(struct perf_event_context *ctx,
struct perf_cpu_context *cpuctx,
@@ -2206,7 +2208,8 @@ ctx_sched_in(struct perf_event_context *ctx,
struct task_struct *task);
static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx)
+ struct perf_event_context *ctx,
+ bool log_overhead)
{
if (!cpuctx->task_ctx)
return;
@@ -2214,7 +2217,7 @@ static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
if (WARN_ON_ONCE(ctx != cpuctx->task_ctx))
return;
- ctx_sched_out(ctx, cpuctx, EVENT_ALL);
+ ctx_sched_out(ctx, cpuctx, EVENT_ALL, log_overhead);
}
static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
@@ -2234,7 +2237,7 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
{
perf_pmu_disable(cpuctx->ctx.pmu);
if (task_ctx)
- task_ctx_sched_out(cpuctx, task_ctx);
+ task_ctx_sched_out(cpuctx, task_ctx, false);
cpu_ctx_sched_out(cpuctx, EVENT_ALL);
perf_event_sched_in(cpuctx, task_ctx, current);
perf_pmu_enable(cpuctx->ctx.pmu);
@@ -2281,7 +2284,7 @@ static int __perf_install_in_context(void *info)
}
if (activate) {
- ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+ ctx_sched_out(ctx, cpuctx, EVENT_TIME, false);
add_event_to_ctx(event, ctx);
ctx_resched(cpuctx, task_ctx);
} else {
@@ -2396,7 +2399,7 @@ static void __perf_event_enable(struct perf_event *event,
return;
if (ctx->is_active)
- ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+ ctx_sched_out(ctx, cpuctx, EVENT_TIME, false);
__perf_event_mark_enabled(event);
@@ -2611,7 +2614,8 @@ EXPORT_SYMBOL_GPL(perf_event_refresh);
static void ctx_sched_out(struct perf_event_context *ctx,
struct perf_cpu_context *cpuctx,
- enum event_type_t event_type)
+ enum event_type_t event_type,
+ bool log_overhead)
{
int is_active = ctx->is_active;
struct perf_event *event;
@@ -2662,12 +2666,12 @@ static void ctx_sched_out(struct perf_event_context *ctx,
perf_pmu_disable(ctx->pmu);
if (is_active & EVENT_PINNED) {
list_for_each_entry(event, &ctx->pinned_groups, group_entry)
- group_sched_out(event, cpuctx, ctx);
+ group_sched_out(event, cpuctx, ctx, log_overhead);
}
if (is_active & EVENT_FLEXIBLE) {
list_for_each_entry(event, &ctx->flexible_groups, group_entry)
- group_sched_out(event, cpuctx, ctx);
+ group_sched_out(event, cpuctx, ctx, log_overhead);
}
perf_pmu_enable(ctx->pmu);
}
@@ -2850,7 +2854,7 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
if (do_switch) {
raw_spin_lock(&ctx->lock);
- task_ctx_sched_out(cpuctx, ctx);
+ task_ctx_sched_out(cpuctx, ctx, false);
raw_spin_unlock(&ctx->lock);
}
}
@@ -2958,7 +2962,7 @@ void __perf_event_task_sched_out(struct task_struct *task,
static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
enum event_type_t event_type)
{
- ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
+ ctx_sched_out(&cpuctx->ctx, cpuctx, event_type, false);
}
static void
@@ -3350,7 +3354,7 @@ static int perf_rotate_context(struct perf_cpu_context *cpuctx)
cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
if (ctx)
- ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
+ ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE, false);
rotate_ctx(&cpuctx->ctx);
if (ctx)
@@ -3415,7 +3419,7 @@ static void perf_event_enable_on_exec(int ctxn)
cpuctx = __get_cpu_context(ctx);
perf_ctx_lock(cpuctx, ctx);
- ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+ ctx_sched_out(ctx, cpuctx, EVENT_TIME, false);
list_for_each_entry(event, &ctx->event_list, event_entry)
enabled |= event_enable_on_exec(event, ctx);
@@ -10190,7 +10194,7 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
* in.
*/
raw_spin_lock_irq(&child_ctx->lock);
- task_ctx_sched_out(__get_cpu_context(child_ctx), child_ctx);
+ task_ctx_sched_out(__get_cpu_context(child_ctx), child_ctx, true);
/*
* Now that the context is inactive, destroy the task <-> ctx relation
--
2.5.5
From: Kan Liang <[email protected]>
Introduce a new option --show-overhead to show overhead events in perf
script
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-script.c | 36 ++++++++++++++++++++++++++++++++++++
tools/perf/util/event.c | 37 +++++++++++++++++++++++++++++++++++++
tools/perf/util/event.h | 1 +
3 files changed, 74 insertions(+)
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index e1daff3..76d9747 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -829,6 +829,7 @@ struct perf_script {
bool show_task_events;
bool show_mmap_events;
bool show_switch_events;
+ bool show_overhead;
bool allocated;
struct cpu_map *cpus;
struct thread_map *threads;
@@ -1264,6 +1265,37 @@ static int process_switch_event(struct perf_tool *tool,
return 0;
}
+static int process_overhead_event(struct perf_tool *tool,
+ union perf_event *event,
+ struct perf_sample *sample,
+ struct machine *machine)
+{
+ struct thread *thread;
+ struct perf_script *script = container_of(tool, struct perf_script, tool);
+ struct perf_session *session = script->session;
+ struct perf_evsel *evsel;
+
+ if (perf_event__process_switch(tool, event, sample, machine) < 0)
+ return -1;
+ if (sample) {
+ evsel = perf_evlist__id2evsel(session->evlist, sample->id);
+ thread = machine__findnew_thread(machine, sample->pid, sample->tid);
+ if (thread == NULL) {
+ pr_debug("problem processing OVERHEAD event, skipping it.\n");
+ return -1;
+ }
+
+ print_sample_start(sample, thread, evsel);
+ perf_event__fprintf(event, stdout);
+ thread__put(thread);
+ } else {
+ /* USER OVERHEAD event */
+ perf_event__fprintf(event, stdout);
+ }
+
+ return 0;
+}
+
static void sig_handler(int sig __maybe_unused)
{
session_done = 1;
@@ -1287,6 +1319,8 @@ static int __cmd_script(struct perf_script *script)
}
if (script->show_switch_events)
script->tool.context_switch = process_switch_event;
+ if (script->show_overhead)
+ script->tool.overhead = process_overhead_event;
ret = perf_session__process_events(script->session);
@@ -2172,6 +2206,8 @@ int cmd_script(int argc, const char **argv, const char *prefix __maybe_unused)
"Show the mmap events"),
OPT_BOOLEAN('\0', "show-switch-events", &script.show_switch_events,
"Show context switch events (if recorded)"),
+ OPT_BOOLEAN('\0', "show-overhead", &script.show_overhead,
+ "Show overhead events"),
OPT_BOOLEAN('f', "force", &file.force, "don't complain, do it"),
OPT_BOOLEAN(0, "ns", &nanosecs,
"Use 9 decimal places when displaying time"),
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 6cd43c9..cd4f3aa 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -1190,6 +1190,39 @@ size_t perf_event__fprintf_switch(union perf_event *event, FILE *fp)
event->context_switch.next_prev_tid);
}
+size_t perf_event__fprintf_overhead(union perf_event *event, FILE *fp)
+{
+ size_t ret;
+
+ if (event->overhead.type == PERF_NMI_OVERHEAD) {
+ ret = fprintf(fp, " [NMI] nr: %llu time: %llu cpu %u\n",
+ event->overhead.entry.nr,
+ event->overhead.entry.time,
+ event->overhead.entry.cpu);
+ } else if (event->overhead.type == PERF_MUX_OVERHEAD) {
+ ret = fprintf(fp, " [MUX] nr: %llu time: %llu cpu %u\n",
+ event->overhead.entry.nr,
+ event->overhead.entry.time,
+ event->overhead.entry.cpu);
+ } else if (event->overhead.type == PERF_SB_OVERHEAD) {
+ ret = fprintf(fp, " [SB] nr: %llu time: %llu cpu %u\n",
+ event->overhead.entry.nr,
+ event->overhead.entry.time,
+ event->overhead.entry.cpu);
+ } else if (event->overhead.type == PERF_USER_WRITE_OVERHEAD) {
+ ret = fprintf(fp, " [USER WRITE] nr: %llu time: %llu cpu %u\n",
+ event->overhead.entry.nr,
+ event->overhead.entry.time,
+ event->overhead.entry.cpu);
+ } else if (event->overhead.type == PERF_USER_ELAPSED_TIME) {
+ ret = fprintf(fp, " [ELAPSED TIME] time: %llu\n",
+ event->overhead.entry.time);
+ } else {
+ ret = fprintf(fp, " unhandled!\n");
+ }
+ return ret;
+}
+
size_t perf_event__fprintf(union perf_event *event, FILE *fp)
{
size_t ret = fprintf(fp, "PERF_RECORD_%s",
@@ -1219,6 +1252,10 @@ size_t perf_event__fprintf(union perf_event *event, FILE *fp)
case PERF_RECORD_SWITCH_CPU_WIDE:
ret += perf_event__fprintf_switch(event, fp);
break;
+ case PERF_RECORD_OVERHEAD:
+ case PERF_RECORD_USER_OVERHEAD:
+ ret += perf_event__fprintf_overhead(event, fp);
+ break;
default:
ret += fprintf(fp, "\n");
}
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index ceb0968..36e295d 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -690,6 +690,7 @@ size_t perf_event__fprintf_switch(union perf_event *event, FILE *fp);
size_t perf_event__fprintf_thread_map(union perf_event *event, FILE *fp);
size_t perf_event__fprintf_cpu_map(union perf_event *event, FILE *fp);
size_t perf_event__fprintf(union perf_event *event, FILE *fp);
+size_t perf_event__fprintf_overhead(union perf_event *event, FILE *fp);
u64 kallsyms__get_function_start(const char *kallsyms_filename,
const char *symbol_name);
--
2.5.5
From: Kan Liang <[email protected]>
The rough overhead rate can be caculated by the sum of all kinds of
overhead / elapsed time.
If the overhead rate is higher than 10%, warning the user.
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/util/session.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index e84808f..decfc48 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1559,6 +1559,30 @@ perf_session__warn_order(const struct perf_session *session)
ui__warning("%u out of order events recorded.\n", oe->nr_unordered_events);
}
+static void
+perf_session__warn_overhead(const struct perf_session *session)
+{
+ const struct events_stats *stats = &session->evlist->stats;
+ double overhead_rate;
+ u64 overhead;
+ int i;
+
+ for (i = 0; i < session->header.env.nr_cpus_online; i++) {
+ overhead = stats->total_nmi_overhead[i][1];
+ overhead += stats->total_mux_overhead[i][1];
+ overhead += stats->total_sb_overhead[i][1];
+ overhead += stats->total_user_write_overhead[i][1];
+
+ overhead_rate = (double)overhead / (double)stats->elapsed_time;
+
+ if (overhead_rate > 0.1) {
+ ui__warning("Perf overhead is high! The overhead rate is %3.2f%% on CPU %d\n\n"
+ "Please consider reducing the number of events, or increasing the period, or decrease the frequency.\n\n",
+ overhead_rate * 100.0, i);
+ }
+ }
+}
+
static void perf_session__warn_about_errors(const struct perf_session *session)
{
const struct events_stats *stats = &session->evlist->stats;
@@ -1632,6 +1656,8 @@ static void perf_session__warn_about_errors(const struct perf_session *session)
"Increase it by --proc-map-timeout\n",
stats->nr_proc_map_timeout);
}
+
+ perf_session__warn_overhead(session);
}
static int perf_session__flush_thread_stack(struct thread *thread,
--
2.5.5
From: Kan Liang <[email protected]>
Record the elapsed time of perf record, and display it in perf report
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-record.c | 10 ++++++++++
tools/perf/builtin-report.c | 1 +
tools/perf/util/event.h | 3 +++
tools/perf/util/machine.c | 3 +++
tools/perf/util/session.c | 3 +++
5 files changed, 20 insertions(+)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 492058e..ea94e10 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -69,6 +69,7 @@ struct record {
bool switch_output;
unsigned long long samples;
struct write_overhead overhead[MAX_NR_CPUS];
+ u64 elapsed_time;
};
static u64 get_vnsecs(void)
@@ -866,6 +867,12 @@ static void perf_event__synth_overhead(struct record *rec, perf_event__handler_t
(void)process(&rec->tool, &event, NULL, NULL);
}
+
+ event.overhead.type = PERF_USER_ELAPSED_TIME;
+ event.overhead.entry.cpu = -1;
+ event.overhead.entry.nr = 1;
+ event.overhead.entry.time = rec->elapsed_time;
+ (void)process(&rec->tool, &event, NULL, NULL);
}
static int __cmd_record(struct record *rec, int argc, const char **argv)
@@ -1129,6 +1136,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
goto out_child;
}
+ rec->elapsed_time = get_nsecs() - rec->elapsed_time;
perf_event__synth_overhead(rec, process_synthesized_event);
if (!quiet)
@@ -1601,6 +1609,8 @@ int cmd_record(int argc, const char **argv, const char *prefix __maybe_unused)
# undef REASON
#endif
+ rec->elapsed_time = get_nsecs();
+
rec->evlist = perf_evlist__new();
if (rec->evlist == NULL)
return -ENOMEM;
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 9c0a424..de2a9b6 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -371,6 +371,7 @@ static int perf_evlist__tty_browse_hists(struct perf_evlist *evlist,
fprintf(stdout, "#\n# Total Lost Samples: %" PRIu64 "\n#\n", evlist->stats.total_lost_samples);
if (symbol_conf.show_overhead) {
+ fprintf(stdout, "# Elapsed time: %" PRIu64 " ns\n", evlist->stats.elapsed_time);
fprintf(stdout, "# Overhead:\n");
for (cpu = 0; cpu < session->header.env.nr_cpus_online; cpu++) {
if (!evlist->stats.total_nmi_overhead[cpu][0])
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 9927cf9..ceb0968 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -275,6 +275,7 @@ enum auxtrace_error_type {
* The total_user_write_overhead tells exactly the overhead to write data in
* perf record.
* The total write# is stored in [0], while the accumulated time is in [1].
+ * The elapsed_time tells the elapsed time of perf record
*/
struct events_stats {
u64 total_period;
@@ -287,6 +288,7 @@ struct events_stats {
u64 total_mux_overhead[MAX_NR_CPUS][2];
u64 total_sb_overhead[MAX_NR_CPUS][2];
u64 total_user_write_overhead[MAX_NR_CPUS][2];
+ u64 elapsed_time;
u32 nr_events[PERF_RECORD_HEADER_MAX];
u32 nr_non_filtered_samples;
u32 nr_lost_warned;
@@ -497,6 +499,7 @@ struct time_conv_event {
u64 time_zero;
};
+#define PERF_USER_ELAPSED_TIME 200 /* above any possible overhead type */
enum perf_user_overhead_event_type { /* above any possible kernel type */
PERF_USER_OVERHEAD_TYPE_START = 100,
PERF_USER_WRITE_OVERHEAD = 100,
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index ce7a0ea..150071f 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -578,6 +578,9 @@ int machine__process_overhead_event(struct machine *machine __maybe_unused,
event->overhead.entry.nr,
event->overhead.entry.time,
event->overhead.entry.cpu);
+ } else if (event->overhead.type == PERF_USER_ELAPSED_TIME) {
+ dump_printf(" Elapsed time: %llu\n",
+ event->overhead.entry.time);
} else {
dump_printf("\tUNSUPPORT OVERHEAD TYPE 0x%x!\n", event->overhead.type);
}
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index a72992b..e84808f 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1231,6 +1231,9 @@ overhead_stats_update(struct perf_tool *tool,
evlist->stats.total_user_write_overhead[event->overhead.entry.cpu][0] += event->overhead.entry.nr;
evlist->stats.total_user_write_overhead[event->overhead.entry.cpu][1] += event->overhead.entry.time;
break;
+ case PERF_USER_ELAPSED_TIME:
+ evlist->stats.elapsed_time = event->overhead.entry.time;
+ break;
default:
break;
}
--
2.5.5
From: Kan Liang <[email protected]>
Multiplexing overhead is one of the key overhead when the number of
events is more than available counters.
Signed-off-by: Kan Liang <[email protected]>
---
include/linux/perf_event.h | 2 ++
include/uapi/linux/perf_event.h | 1 +
kernel/events/core.c | 16 ++++++++++++++++
3 files changed, 19 insertions(+)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 632647f..f72b97a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -793,6 +793,8 @@ struct perf_cpu_context {
struct list_head sched_cb_entry;
int sched_cb_usage;
+
+ struct perf_overhead_entry mux_overhead;
};
struct perf_output_handle {
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 071323d..9124c7c 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -993,6 +993,7 @@ struct perf_branch_entry {
enum perf_record_overhead_type {
PERF_NMI_OVERHEAD = 0,
+ PERF_MUX_OVERHEAD,
PERF_OVERHEAD_MAX,
};
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d82e6ca..9934059 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1829,6 +1829,11 @@ event_sched_out(struct perf_event *event,
if (event->attr.exclusive || !cpuctx->active_oncpu)
cpuctx->exclusive = 0;
+ if (log_overhead && cpuctx->mux_overhead.nr) {
+ cpuctx->mux_overhead.cpu = smp_processor_id();
+ perf_log_overhead(event, PERF_MUX_OVERHEAD, &cpuctx->mux_overhead);
+ }
+
perf_pmu_enable(event->pmu);
}
@@ -3330,9 +3335,17 @@ static void rotate_ctx(struct perf_event_context *ctx)
list_rotate_left(&ctx->flexible_groups);
}
+static void
+perf_caculate_mux_overhead(struct perf_cpu_context *cpuctx, u64 time)
+{
+ cpuctx->mux_overhead.nr++;
+ cpuctx->mux_overhead.time += time;
+}
+
static int perf_rotate_context(struct perf_cpu_context *cpuctx)
{
struct perf_event_context *ctx = NULL;
+ u64 start_clock, end_clock;
int rotate = 0;
if (cpuctx->ctx.nr_events) {
@@ -3349,6 +3362,7 @@ static int perf_rotate_context(struct perf_cpu_context *cpuctx)
if (!rotate)
goto done;
+ start_clock = perf_clock();
perf_ctx_lock(cpuctx, cpuctx->task_ctx);
perf_pmu_disable(cpuctx->ctx.pmu);
@@ -3364,6 +3378,8 @@ static int perf_rotate_context(struct perf_cpu_context *cpuctx)
perf_pmu_enable(cpuctx->ctx.pmu);
perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+ end_clock = perf_clock();
+ perf_caculate_mux_overhead(cpuctx, end_clock - start_clock);
done:
return rotate;
--
2.5.5
From: Kan Liang <[email protected]>
perf record writes all kinds of events to file, which brings overhead as
well. Introducing PERF_USER_WRITE_OVERHEAD to save the overhead
information.
The overhead is measured by the time taken by the write itself, not the
elapsed system time.
Caculate the total user write overhead on each CPU, and display them in
perf report
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-record.c | 61 +++++++++++++++++++++++++++++++++++++++++++--
tools/perf/util/event.h | 9 +++++++
tools/perf/util/machine.c | 5 ++++
tools/perf/util/session.c | 4 +++
4 files changed, 77 insertions(+), 2 deletions(-)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 67d2a90..492058e 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -45,6 +45,11 @@
#include <asm/bug.h>
#include <linux/time64.h>
+struct write_overhead{
+ u64 nr;
+ u64 time;
+};
+
struct record {
struct perf_tool tool;
struct record_opts opts;
@@ -63,17 +68,40 @@ struct record {
bool timestamp_filename;
bool switch_output;
unsigned long long samples;
+ struct write_overhead overhead[MAX_NR_CPUS];
};
+static u64 get_vnsecs(void)
+{
+ struct timespec ts;
+
+ clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
+
+ return ts.tv_sec * NSEC_PER_SEC + ts.tv_nsec;
+}
+
static int record__write(struct record *rec, void *bf, size_t size)
{
+ int cpu = sched_getcpu();
+ u64 start, end;
+ int ret = 0;
+
+ start = get_vnsecs();
if (perf_data_file__write(rec->session->file, bf, size) < 0) {
pr_err("failed to write perf data, error: %m\n");
- return -1;
+ ret = -1;
+ goto done;
}
rec->bytes_written += size;
- return 0;
+done:
+ end = get_vnsecs();
+ if (cpu >= 0) {
+ rec->overhead[cpu].nr++;
+ rec->overhead[cpu].time += (end - start);
+ }
+
+ return ret;
}
static int process_synthesized_event(struct perf_tool *tool,
@@ -813,6 +841,33 @@ static int record__synthesize(struct record *rec, bool tail)
return err;
}
+static void perf_event__synth_overhead(struct record *rec, perf_event__handler_t process)
+{
+ int cpu;
+
+ union perf_event event = {
+ .overhead = {
+ .header = {
+ .type = PERF_RECORD_USER_OVERHEAD,
+ .size = sizeof(struct perf_overhead),
+ },
+ },
+ };
+
+ event.overhead.type = PERF_USER_WRITE_OVERHEAD;
+
+ for (cpu = 0; cpu < MAX_NR_CPUS; cpu++) {
+ if (!rec->overhead[cpu].nr)
+ continue;
+
+ event.overhead.entry.cpu = cpu;
+ event.overhead.entry.nr = rec->overhead[cpu].nr;
+ event.overhead.entry.time = rec->overhead[cpu].time;
+
+ (void)process(&rec->tool, &event, NULL, NULL);
+ }
+}
+
static int __cmd_record(struct record *rec, int argc, const char **argv)
{
int err;
@@ -1074,6 +1129,8 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
goto out_child;
}
+ perf_event__synth_overhead(rec, process_synthesized_event);
+
if (!quiet)
fprintf(stderr, "[ perf record: Woken up %ld times to write data ]\n", waking);
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 1ef1a9d..9927cf9 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -272,6 +272,9 @@ enum auxtrace_error_type {
* events on each CPU.
* The total number of accessing side-band events handler function is stored
* in [0], while the accumulated processing time is in [1].
+ * The total_user_write_overhead tells exactly the overhead to write data in
+ * perf record.
+ * The total write# is stored in [0], while the accumulated time is in [1].
*/
struct events_stats {
u64 total_period;
@@ -283,6 +286,7 @@ struct events_stats {
u64 total_nmi_overhead[MAX_NR_CPUS][2];
u64 total_mux_overhead[MAX_NR_CPUS][2];
u64 total_sb_overhead[MAX_NR_CPUS][2];
+ u64 total_user_write_overhead[MAX_NR_CPUS][2];
u32 nr_events[PERF_RECORD_HEADER_MAX];
u32 nr_non_filtered_samples;
u32 nr_lost_warned;
@@ -493,6 +497,11 @@ struct time_conv_event {
u64 time_zero;
};
+enum perf_user_overhead_event_type { /* above any possible kernel type */
+ PERF_USER_OVERHEAD_TYPE_START = 100,
+ PERF_USER_WRITE_OVERHEAD = 100,
+};
+
struct perf_overhead {
struct perf_event_header header;
u32 type;
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index d8cde21..ce7a0ea 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -573,6 +573,11 @@ int machine__process_overhead_event(struct machine *machine __maybe_unused,
event->overhead.entry.nr,
event->overhead.entry.time,
event->overhead.entry.cpu);
+ } else if (event->overhead.type == PERF_USER_WRITE_OVERHEAD) {
+ dump_printf(" User write nr: %llu time: %llu cpu %u\n",
+ event->overhead.entry.nr,
+ event->overhead.entry.time,
+ event->overhead.entry.cpu);
} else {
dump_printf("\tUNSUPPORT OVERHEAD TYPE 0x%x!\n", event->overhead.type);
}
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 27a5c8a..a72992b 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1227,6 +1227,10 @@ overhead_stats_update(struct perf_tool *tool,
evlist->stats.total_sb_overhead[event->overhead.entry.cpu][0] += event->overhead.entry.nr;
evlist->stats.total_sb_overhead[event->overhead.entry.cpu][1] += event->overhead.entry.time;
break;
+ case PERF_USER_WRITE_OVERHEAD:
+ evlist->stats.total_user_write_overhead[event->overhead.entry.cpu][0] += event->overhead.entry.nr;
+ evlist->stats.total_user_write_overhead[event->overhead.entry.cpu][1] += event->overhead.entry.time;
+ break;
default:
break;
}
--
2.5.5
From: Kan Liang <[email protected]>
get_nsecs will be used to get time in following patches.
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-sched.c | 2 +-
tools/perf/builtin.h | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index fb34412..bf0d571e8 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -194,7 +194,7 @@ struct perf_sched {
struct perf_sched_map map;
};
-static u64 get_nsecs(void)
+u64 get_nsecs(void)
{
struct timespec ts;
diff --git a/tools/perf/builtin.h b/tools/perf/builtin.h
index 0bcf68e..a564e15 100644
--- a/tools/perf/builtin.h
+++ b/tools/perf/builtin.h
@@ -12,6 +12,7 @@ const char *help_unknown_cmd(const char *cmd);
void prune_packed_objects(int);
int read_line_with_nul(char *buf, int size, FILE *file);
int check_pager_config(const char *cmd);
+u64 get_nsecs(void);
int cmd_annotate(int argc, const char **argv, const char *prefix);
int cmd_bench(int argc, const char **argv, const char *prefix);
--
2.5.5
From: Kan Liang <[email protected]>
User space perf tool also bring overhead. Introduce
PERF_RECORD_USER_OVERHEAD to track the overhead information.
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/util/event.c | 1 +
tools/perf/util/event.h | 1 +
tools/perf/util/session.c | 4 ++++
3 files changed, 6 insertions(+)
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index ca98c4c..6cd43c9 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -48,6 +48,7 @@ static const char *perf_event__names[] = {
[PERF_RECORD_STAT_ROUND] = "STAT_ROUND",
[PERF_RECORD_EVENT_UPDATE] = "EVENT_UPDATE",
[PERF_RECORD_TIME_CONV] = "TIME_CONV",
+ [PERF_RECORD_USER_OVERHEAD] = "USER_OVERHEAD",
};
const char *perf_event__name(unsigned int id)
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 3357529..1ef1a9d 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -237,6 +237,7 @@ enum perf_user_event_type { /* above any possible kernel type */
PERF_RECORD_STAT_ROUND = 77,
PERF_RECORD_EVENT_UPDATE = 78,
PERF_RECORD_TIME_CONV = 79,
+ PERF_RECORD_USER_OVERHEAD = 80,
PERF_RECORD_HEADER_MAX
};
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index e3aa9d7..27a5c8a 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -804,6 +804,7 @@ static perf_event__swap_op perf_event__swap_ops[] = {
[PERF_RECORD_STAT_ROUND] = perf_event__stat_round_swap,
[PERF_RECORD_EVENT_UPDATE] = perf_event__event_update_swap,
[PERF_RECORD_TIME_CONV] = perf_event__all64_swap,
+ [PERF_RECORD_USER_OVERHEAD] = perf_event__all64_swap,
[PERF_RECORD_HEADER_MAX] = NULL,
};
@@ -1382,6 +1383,9 @@ static s64 perf_session__process_user_event(struct perf_session *session,
case PERF_RECORD_TIME_CONV:
session->time_conv = event->time_conv;
return tool->time_conv(tool, event, session);
+ case PERF_RECORD_USER_OVERHEAD:
+ overhead_stats_update(tool, session->evlist, event);
+ return tool->overhead(tool, event, NULL, NULL);
default:
return -EINVAL;
}
--
2.5.5
From: Kan Liang <[email protected]>
Caculate the total overhead from accessing side-band events handler
function, and display them in perf report
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-report.c | 3 +++
tools/perf/util/event.h | 5 +++++
tools/perf/util/machine.c | 5 +++++
tools/perf/util/session.c | 4 ++++
4 files changed, 17 insertions(+)
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 2515d7a..9c0a424 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -384,6 +384,9 @@ static int perf_evlist__tty_browse_hists(struct perf_evlist *evlist,
fprintf(stdout, "#\t\tMultiplexing#: %" PRIu64 " time: %" PRIu64 " ns\n",
evlist->stats.total_mux_overhead[cpu][0],
evlist->stats.total_mux_overhead[cpu][1]);
+ fprintf(stdout, "#\t\tSB#: %" PRIu64 " time: %" PRIu64 " ns\n",
+ evlist->stats.total_sb_overhead[cpu][0],
+ evlist->stats.total_sb_overhead[cpu][1]);
}
fprintf(stdout, "#\n");
}
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 70e2508..3357529 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -267,6 +267,10 @@ enum auxtrace_error_type {
* The total NMI# is stored in [0], while the accumulated time is in [1].
* The total_mux_overhead tells exactly the Multiplexing overhead on each CPU.
* The total rotate# is stored in [0], while the accumulated time is in [1].
+ * The total_sb_overhead tells exactly the overhead to output side-band
+ * events on each CPU.
+ * The total number of accessing side-band events handler function is stored
+ * in [0], while the accumulated processing time is in [1].
*/
struct events_stats {
u64 total_period;
@@ -277,6 +281,7 @@ struct events_stats {
u64 total_invalid_chains;
u64 total_nmi_overhead[MAX_NR_CPUS][2];
u64 total_mux_overhead[MAX_NR_CPUS][2];
+ u64 total_sb_overhead[MAX_NR_CPUS][2];
u32 nr_events[PERF_RECORD_HEADER_MAX];
u32 nr_non_filtered_samples;
u32 nr_lost_warned;
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index eca1f8b..d8cde21 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -568,6 +568,11 @@ int machine__process_overhead_event(struct machine *machine __maybe_unused,
event->overhead.entry.nr,
event->overhead.entry.time,
event->overhead.entry.cpu);
+ } else if (event->overhead.type == PERF_SB_OVERHEAD) {
+ dump_printf(" SB nr: %llu time: %llu cpu %u\n",
+ event->overhead.entry.nr,
+ event->overhead.entry.time,
+ event->overhead.entry.cpu);
} else {
dump_printf("\tUNSUPPORT OVERHEAD TYPE 0x%x!\n", event->overhead.type);
}
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 594fd5e..e3aa9d7 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1222,6 +1222,10 @@ overhead_stats_update(struct perf_tool *tool,
evlist->stats.total_mux_overhead[event->overhead.entry.cpu][0] += event->overhead.entry.nr;
evlist->stats.total_mux_overhead[event->overhead.entry.cpu][1] += event->overhead.entry.time;
break;
+ case PERF_SB_OVERHEAD:
+ evlist->stats.total_sb_overhead[event->overhead.entry.cpu][0] += event->overhead.entry.nr;
+ evlist->stats.total_sb_overhead[event->overhead.entry.cpu][1] += event->overhead.entry.time;
+ break;
default:
break;
}
--
2.5.5
From: Kan Liang <[email protected]>
Caculate the total multiplexing overhead on each CPU, and display them
in perf report
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-report.c | 8 ++++++--
tools/perf/util/event.h | 3 +++
tools/perf/util/machine.c | 5 +++++
tools/perf/util/session.c | 4 ++++
4 files changed, 18 insertions(+), 2 deletions(-)
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index b1437586..2515d7a 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -377,9 +377,13 @@ static int perf_evlist__tty_browse_hists(struct perf_evlist *evlist,
continue;
if (rep->cpu_list && !test_bit(cpu, rep->cpu_bitmap))
continue;
- fprintf(stdout, "#\tCPU %d: NMI#: %" PRIu64 " time: %" PRIu64 " ns\n",
- cpu, evlist->stats.total_nmi_overhead[cpu][0],
+ fprintf(stdout, "#\tCPU %d\n", cpu);
+ fprintf(stdout, "#\t\tNMI#: %" PRIu64 " time: %" PRIu64 " ns\n",
+ evlist->stats.total_nmi_overhead[cpu][0],
evlist->stats.total_nmi_overhead[cpu][1]);
+ fprintf(stdout, "#\t\tMultiplexing#: %" PRIu64 " time: %" PRIu64 " ns\n",
+ evlist->stats.total_mux_overhead[cpu][0],
+ evlist->stats.total_mux_overhead[cpu][1]);
}
fprintf(stdout, "#\n");
}
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 7d40d54..70e2508 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -265,6 +265,8 @@ enum auxtrace_error_type {
*
* The total_nmi_overhead tells exactly the NMI handler overhead on each CPU.
* The total NMI# is stored in [0], while the accumulated time is in [1].
+ * The total_mux_overhead tells exactly the Multiplexing overhead on each CPU.
+ * The total rotate# is stored in [0], while the accumulated time is in [1].
*/
struct events_stats {
u64 total_period;
@@ -274,6 +276,7 @@ struct events_stats {
u64 total_aux_lost;
u64 total_invalid_chains;
u64 total_nmi_overhead[MAX_NR_CPUS][2];
+ u64 total_mux_overhead[MAX_NR_CPUS][2];
u32 nr_events[PERF_RECORD_HEADER_MAX];
u32 nr_non_filtered_samples;
u32 nr_lost_warned;
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 58076f2..eca1f8b 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -563,6 +563,11 @@ int machine__process_overhead_event(struct machine *machine __maybe_unused,
event->overhead.entry.nr,
event->overhead.entry.time,
event->overhead.entry.cpu);
+ } else if (event->overhead.type == PERF_MUX_OVERHEAD) {
+ dump_printf(" Multiplexing nr: %llu time: %llu cpu %u\n",
+ event->overhead.entry.nr,
+ event->overhead.entry.time,
+ event->overhead.entry.cpu);
} else {
dump_printf("\tUNSUPPORT OVERHEAD TYPE 0x%x!\n", event->overhead.type);
}
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index a79ab99..594fd5e 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1218,6 +1218,10 @@ overhead_stats_update(struct perf_tool *tool,
evlist->stats.total_nmi_overhead[event->overhead.entry.cpu][0] += event->overhead.entry.nr;
evlist->stats.total_nmi_overhead[event->overhead.entry.cpu][1] += event->overhead.entry.time;
break;
+ case PERF_MUX_OVERHEAD:
+ evlist->stats.total_mux_overhead[event->overhead.entry.cpu][0] += event->overhead.entry.nr;
+ evlist->stats.total_mux_overhead[event->overhead.entry.cpu][1] += event->overhead.entry.time;
+ break;
default:
break;
}
--
2.5.5
From: Kan Liang <[email protected]>
Caculate the total NMI overhead on each CPU, and display them in perf
report
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-report.c | 11 +++++++++++
tools/perf/util/event.h | 4 ++++
tools/perf/util/machine.c | 9 +++++++++
tools/perf/util/session.c | 18 ++++++++++++++++++
4 files changed, 42 insertions(+)
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 1416c39..b1437586 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -365,11 +365,22 @@ static int perf_evlist__tty_browse_hists(struct perf_evlist *evlist,
struct report *rep,
const char *help)
{
+ struct perf_session *session = rep->session;
struct perf_evsel *pos;
+ int cpu;
fprintf(stdout, "#\n# Total Lost Samples: %" PRIu64 "\n#\n", evlist->stats.total_lost_samples);
if (symbol_conf.show_overhead) {
fprintf(stdout, "# Overhead:\n");
+ for (cpu = 0; cpu < session->header.env.nr_cpus_online; cpu++) {
+ if (!evlist->stats.total_nmi_overhead[cpu][0])
+ continue;
+ if (rep->cpu_list && !test_bit(cpu, rep->cpu_bitmap))
+ continue;
+ fprintf(stdout, "#\tCPU %d: NMI#: %" PRIu64 " time: %" PRIu64 " ns\n",
+ cpu, evlist->stats.total_nmi_overhead[cpu][0],
+ evlist->stats.total_nmi_overhead[cpu][1]);
+ }
fprintf(stdout, "#\n");
}
evlist__for_each_entry(evlist, pos) {
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index d1b179b..7d40d54 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -262,6 +262,9 @@ enum auxtrace_error_type {
* multipling nr_events[PERF_EVENT_SAMPLE] by a frequency isn't possible to get
* the total number of low level events, it is necessary to to sum all struct
* sample_event.period and stash the result in total_period.
+ *
+ * The total_nmi_overhead tells exactly the NMI handler overhead on each CPU.
+ * The total NMI# is stored in [0], while the accumulated time is in [1].
*/
struct events_stats {
u64 total_period;
@@ -270,6 +273,7 @@ struct events_stats {
u64 total_lost_samples;
u64 total_aux_lost;
u64 total_invalid_chains;
+ u64 total_nmi_overhead[MAX_NR_CPUS][2];
u32 nr_events[PERF_RECORD_HEADER_MAX];
u32 nr_non_filtered_samples;
u32 nr_lost_warned;
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 1101757..58076f2 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -558,6 +558,15 @@ int machine__process_switch_event(struct machine *machine __maybe_unused,
int machine__process_overhead_event(struct machine *machine __maybe_unused,
union perf_event *event __maybe_unused)
{
+ if (event->overhead.type == PERF_NMI_OVERHEAD) {
+ dump_printf(" NMI nr: %llu time: %llu cpu %u\n",
+ event->overhead.entry.nr,
+ event->overhead.entry.time,
+ event->overhead.entry.cpu);
+ } else {
+ dump_printf("\tUNSUPPORT OVERHEAD TYPE 0x%x!\n", event->overhead.type);
+ }
+
return 0;
}
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index bc0bc21..a79ab99 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1207,6 +1207,23 @@ static int
&sample->read.one, machine);
}
+static void
+overhead_stats_update(struct perf_tool *tool,
+ struct perf_evlist *evlist,
+ union perf_event *event)
+{
+ if (tool->overhead == perf_event__process_overhead) {
+ switch (event->overhead.type) {
+ case PERF_NMI_OVERHEAD:
+ evlist->stats.total_nmi_overhead[event->overhead.entry.cpu][0] += event->overhead.entry.nr;
+ evlist->stats.total_nmi_overhead[event->overhead.entry.cpu][1] += event->overhead.entry.time;
+ break;
+ default:
+ break;
+ }
+ }
+}
+
static int machines__deliver_event(struct machines *machines,
struct perf_evlist *evlist,
union perf_event *event,
@@ -1271,6 +1288,7 @@ static int machines__deliver_event(struct machines *machines,
case PERF_RECORD_SWITCH_CPU_WIDE:
return tool->context_switch(tool, event, sample, machine);
case PERF_RECORD_OVERHEAD:
+ overhead_stats_update(tool, evlist, event);
return tool->overhead(tool, event, sample, machine);
default:
++evlist->stats.nr_unknown_events;
--
2.5.5
From: Kan Liang <[email protected]>
The infrastructure to handle PERF_RECORD_OVERHEAD record type. A new
perf report option is also introduced as a knob to show the overhead
information.
Signed-off-by: Kan Liang <[email protected]>
---
tools/include/uapi/linux/perf_event.h | 33 ++++++++++++++++++++++++++++++++
tools/perf/Documentation/perf-report.txt | 3 +++
tools/perf/builtin-report.c | 6 ++++++
tools/perf/util/event.c | 9 +++++++++
tools/perf/util/event.h | 11 +++++++++++
tools/perf/util/machine.c | 8 ++++++++
tools/perf/util/machine.h | 2 ++
tools/perf/util/session.c | 5 +++++
tools/perf/util/symbol.h | 3 ++-
tools/perf/util/tool.h | 1 +
10 files changed, 80 insertions(+), 1 deletion(-)
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index c66a485..5e7c522 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -862,6 +862,17 @@ enum perf_event_type {
*/
PERF_RECORD_SWITCH_CPU_WIDE = 15,
+ /*
+ * Records perf overhead
+ * struct {
+ * struct perf_event_header header;
+ * u32 type;
+ * struct perf_overhead_entry entry;
+ * struct sample_id sample_id;
+ * };
+ */
+ PERF_RECORD_OVERHEAD = 16,
+
PERF_RECORD_MAX, /* non-ABI */
};
@@ -980,4 +991,26 @@ struct perf_branch_entry {
reserved:44;
};
+enum perf_record_overhead_type {
+ PERF_NMI_OVERHEAD = 0,
+ PERF_MUX_OVERHEAD,
+ PERF_SB_OVERHEAD,
+
+ PERF_OVERHEAD_MAX,
+};
+
+/*
+ * single overhead record layout:
+ *
+ * cpu: The cpu which overhead occues
+ * nr: Times of overhead happens.
+ * E.g. for NMI, nr == times of NMI handler are called.
+ * time: Total overhead cost(ns)
+ */
+struct perf_overhead_entry {
+ __u32 cpu;
+ __u64 nr;
+ __u64 time;
+};
+
#endif /* _UAPI_LINUX_PERF_EVENT_H */
diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 2d17462..fea8bea 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -412,6 +412,9 @@ include::itrace.txt[]
--hierarchy::
Enable hierarchical output.
+--show-overhead::
+ Show extra overhead which perf brings during monitoring
+
include::callchain-overhead-calculation.txt[]
SEE ALSO
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 3dfbfff..1416c39 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -368,6 +368,10 @@ static int perf_evlist__tty_browse_hists(struct perf_evlist *evlist,
struct perf_evsel *pos;
fprintf(stdout, "#\n# Total Lost Samples: %" PRIu64 "\n#\n", evlist->stats.total_lost_samples);
+ if (symbol_conf.show_overhead) {
+ fprintf(stdout, "# Overhead:\n");
+ fprintf(stdout, "#\n");
+ }
evlist__for_each_entry(evlist, pos) {
struct hists *hists = evsel__hists(pos);
const char *evname = perf_evsel__name(pos);
@@ -830,6 +834,8 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
OPT_CALLBACK_DEFAULT(0, "stdio-color", NULL, "mode",
"'always' (default), 'never' or 'auto' only applicable to --stdio mode",
stdio__config_color, "always"),
+ OPT_BOOLEAN(0, "show-overhead", &symbol_conf.show_overhead,
+ "Show perf overhead"),
OPT_END()
};
struct perf_data_file file = {
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 8ab0d7d..ca98c4c 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -31,6 +31,7 @@ static const char *perf_event__names[] = {
[PERF_RECORD_LOST_SAMPLES] = "LOST_SAMPLES",
[PERF_RECORD_SWITCH] = "SWITCH",
[PERF_RECORD_SWITCH_CPU_WIDE] = "SWITCH_CPU_WIDE",
+ [PERF_RECORD_OVERHEAD] = "OVERHEAD",
[PERF_RECORD_HEADER_ATTR] = "ATTR",
[PERF_RECORD_HEADER_EVENT_TYPE] = "EVENT_TYPE",
[PERF_RECORD_HEADER_TRACING_DATA] = "TRACING_DATA",
@@ -1056,6 +1057,14 @@ int perf_event__process_switch(struct perf_tool *tool __maybe_unused,
return machine__process_switch_event(machine, event);
}
+int perf_event__process_overhead(struct perf_tool *tool __maybe_unused,
+ union perf_event *event,
+ struct perf_sample *sample __maybe_unused,
+ struct machine *machine)
+{
+ return machine__process_overhead_event(machine, event);
+}
+
size_t perf_event__fprintf_mmap(union perf_event *event, FILE *fp)
{
return fprintf(fp, " %d/%d: [%#" PRIx64 "(%#" PRIx64 ") @ %#" PRIx64 "]: %c %s\n",
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index c735c53..d1b179b 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -480,6 +480,12 @@ struct time_conv_event {
u64 time_zero;
};
+struct perf_overhead {
+ struct perf_event_header header;
+ u32 type;
+ struct perf_overhead_entry entry;
+};
+
union perf_event {
struct perf_event_header header;
struct mmap_event mmap;
@@ -509,6 +515,7 @@ union perf_event {
struct stat_event stat;
struct stat_round_event stat_round;
struct time_conv_event time_conv;
+ struct perf_overhead overhead;
};
void perf_event__print_totals(void);
@@ -587,6 +594,10 @@ int perf_event__process_switch(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample,
struct machine *machine);
+int perf_event__process_overhead(struct perf_tool *tool,
+ union perf_event *event,
+ struct perf_sample *sample,
+ struct machine *machine);
int perf_event__process_mmap(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample,
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 9b33bef..1101757 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -555,6 +555,12 @@ int machine__process_switch_event(struct machine *machine __maybe_unused,
return 0;
}
+int machine__process_overhead_event(struct machine *machine __maybe_unused,
+ union perf_event *event __maybe_unused)
+{
+ return 0;
+}
+
static void dso__adjust_kmod_long_name(struct dso *dso, const char *filename)
{
const char *dup_filename;
@@ -1536,6 +1542,8 @@ int machine__process_event(struct machine *machine, union perf_event *event,
case PERF_RECORD_SWITCH:
case PERF_RECORD_SWITCH_CPU_WIDE:
ret = machine__process_switch_event(machine, event); break;
+ case PERF_RECORD_OVERHEAD:
+ ret = machine__process_overhead_event(machine, event); break;
default:
ret = -1;
break;
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index 354de6e..ec2dd4d 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -97,6 +97,8 @@ int machine__process_itrace_start_event(struct machine *machine,
union perf_event *event);
int machine__process_switch_event(struct machine *machine,
union perf_event *event);
+int machine__process_overhead_event(struct machine *machine,
+ union perf_event *event);
int machine__process_mmap_event(struct machine *machine, union perf_event *event,
struct perf_sample *sample);
int machine__process_mmap2_event(struct machine *machine, union perf_event *event,
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index f268201..bc0bc21 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -373,6 +373,8 @@ void perf_tool__fill_defaults(struct perf_tool *tool)
tool->itrace_start = perf_event__process_itrace_start;
if (tool->context_switch == NULL)
tool->context_switch = perf_event__process_switch;
+ if (tool->overhead == NULL)
+ tool->overhead = perf_event__process_overhead;
if (tool->read == NULL)
tool->read = process_event_sample_stub;
if (tool->throttle == NULL)
@@ -786,6 +788,7 @@ static perf_event__swap_op perf_event__swap_ops[] = {
[PERF_RECORD_LOST_SAMPLES] = perf_event__all64_swap,
[PERF_RECORD_SWITCH] = perf_event__switch_swap,
[PERF_RECORD_SWITCH_CPU_WIDE] = perf_event__switch_swap,
+ [PERF_RECORD_OVERHEAD] = perf_event__all64_swap,
[PERF_RECORD_HEADER_ATTR] = perf_event__hdr_attr_swap,
[PERF_RECORD_HEADER_EVENT_TYPE] = perf_event__event_type_swap,
[PERF_RECORD_HEADER_TRACING_DATA] = perf_event__tracing_data_swap,
@@ -1267,6 +1270,8 @@ static int machines__deliver_event(struct machines *machines,
case PERF_RECORD_SWITCH:
case PERF_RECORD_SWITCH_CPU_WIDE:
return tool->context_switch(tool, event, sample, machine);
+ case PERF_RECORD_OVERHEAD:
+ return tool->overhead(tool, event, sample, machine);
default:
++evlist->stats.nr_unknown_events;
return -1;
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index 2d0a905..2d96bdb 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -117,7 +117,8 @@ struct symbol_conf {
show_ref_callgraph,
hide_unresolved,
raw_trace,
- report_hierarchy;
+ report_hierarchy,
+ show_overhead;
const char *vmlinux_name,
*kallsyms_name,
*source_prefix,
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index ac2590a..c5bbb34 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -47,6 +47,7 @@ struct perf_tool {
aux,
itrace_start,
context_switch,
+ overhead,
throttle,
unthrottle;
event_attr_op attr;
--
2.5.5
From: Kan Liang <[email protected]>
Iterating all events which need to receive side-band events also bring
some overhead.
Save the overhead information in task context or CPU context, whichever
context is available.
Signed-off-by: Kan Liang <[email protected]>
---
include/linux/perf_event.h | 2 ++
include/uapi/linux/perf_event.h | 1 +
kernel/events/core.c | 32 ++++++++++++++++++++++++++++----
3 files changed, 31 insertions(+), 4 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f72b97a..ec3cb7f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -764,6 +764,8 @@ struct perf_event_context {
#endif
void *task_ctx_data; /* pmu specific data */
struct rcu_head rcu_head;
+
+ struct perf_overhead_entry sb_overhead;
};
/*
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9124c7c..5e7c522 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -994,6 +994,7 @@ struct perf_branch_entry {
enum perf_record_overhead_type {
PERF_NMI_OVERHEAD = 0,
PERF_MUX_OVERHEAD,
+ PERF_SB_OVERHEAD,
PERF_OVERHEAD_MAX,
};
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9934059..51e9df7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1829,9 +1829,15 @@ event_sched_out(struct perf_event *event,
if (event->attr.exclusive || !cpuctx->active_oncpu)
cpuctx->exclusive = 0;
- if (log_overhead && cpuctx->mux_overhead.nr) {
- cpuctx->mux_overhead.cpu = smp_processor_id();
- perf_log_overhead(event, PERF_MUX_OVERHEAD, &cpuctx->mux_overhead);
+ if (log_overhead) {
+ if (cpuctx->mux_overhead.nr) {
+ cpuctx->mux_overhead.cpu = smp_processor_id();
+ perf_log_overhead(event, PERF_MUX_OVERHEAD, &cpuctx->mux_overhead);
+ }
+ if (ctx->sb_overhead.nr) {
+ ctx->sb_overhead.cpu = smp_processor_id();
+ perf_log_overhead(event, PERF_SB_OVERHEAD, &ctx->sb_overhead);
+ }
}
perf_pmu_enable(event->pmu);
@@ -6133,6 +6139,14 @@ static void perf_iterate_sb_cpu(perf_iterate_f output, void *data)
}
}
+static void
+perf_caculate_sb_overhead(struct perf_event_context *ctx,
+ u64 time)
+{
+ ctx->sb_overhead.nr++;
+ ctx->sb_overhead.time += time;
+}
+
/*
* Iterate all events that need to receive side-band events.
*
@@ -6143,9 +6157,12 @@ static void
perf_iterate_sb(perf_iterate_f output, void *data,
struct perf_event_context *task_ctx)
{
+ struct perf_event_context *overhead_ctx = task_ctx;
struct perf_event_context *ctx;
+ u64 start_clock, end_clock;
int ctxn;
+ start_clock = perf_clock();
rcu_read_lock();
preempt_disable();
@@ -6163,12 +6180,19 @@ perf_iterate_sb(perf_iterate_f output, void *data,
for_each_task_context_nr(ctxn) {
ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
- if (ctx)
+ if (ctx) {
perf_iterate_ctx(ctx, output, data, false);
+ if (!overhead_ctx)
+ overhead_ctx = ctx;
+ }
}
done:
preempt_enable();
rcu_read_unlock();
+
+ end_clock = perf_clock();
+ if (overhead_ctx)
+ perf_caculate_sb_overhead(overhead_ctx, end_clock - start_clock);
}
/*
--
2.5.5
From: Kan Liang <[email protected]>
A new perf record is introduced to export perf overhead information to
userspace. So we can measure the overhead of sampling directly.
To output the overhead information, it takes advantage of the existing
event log mechanism. But the overhead information is the system
overhead, not per-event overhead.
Signed-off-by: Kan Liang <[email protected]>
---
include/linux/perf_event.h | 4 ++++
include/uapi/linux/perf_event.h | 29 +++++++++++++++++++++++++++
kernel/events/core.c | 43 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 76 insertions(+)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 4741ecd..3a530a7 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -998,6 +998,10 @@ perf_event__output_id_sample(struct perf_event *event,
extern void
perf_log_lost_samples(struct perf_event *event, u64 lost);
+extern void
+perf_log_overhead(struct perf_event *event, u32 type,
+ struct perf_overhead_entry *entry);
+
static inline bool is_sampling_event(struct perf_event *event)
{
return event->attr.sample_period != 0;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index c66a485..ae571be 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -862,6 +862,17 @@ enum perf_event_type {
*/
PERF_RECORD_SWITCH_CPU_WIDE = 15,
+ /*
+ * Records perf overhead
+ * struct {
+ * struct perf_event_header header;
+ * u32 type;
+ * struct perf_overhead_entry entry;
+ * struct sample_id sample_id;
+ * };
+ */
+ PERF_RECORD_OVERHEAD = 16,
+
PERF_RECORD_MAX, /* non-ABI */
};
@@ -980,4 +991,22 @@ struct perf_branch_entry {
reserved:44;
};
+enum perf_record_overhead_type {
+ PERF_OVERHEAD_MAX,
+};
+
+/*
+ * single overhead record layout:
+ *
+ * cpu: The cpu which overhead occues
+ * nr: Times of overhead happens.
+ * E.g. for NMI, nr == times of NMI handler are called.
+ * time: Total overhead cost(ns)
+ */
+struct perf_overhead_entry {
+ __u32 cpu;
+ __u64 nr;
+ __u64 time;
+};
+
#endif /* _UAPI_LINUX_PERF_EVENT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index ff230bb..590c399 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7037,6 +7037,49 @@ static void perf_log_itrace_start(struct perf_event *event)
perf_output_end(&handle);
}
+
+/*
+ * Record overhead logging
+ *
+ * The overhead logged here is the system overhead, not per-event overhead.
+ * This function only take advantage of the existing event log mechanism
+ * to log the overhead record.
+ *
+ */
+void perf_log_overhead(struct perf_event *event, u32 type,
+ struct perf_overhead_entry *entry)
+{
+ struct perf_output_handle handle;
+ struct perf_sample_data sample;
+ int ret;
+
+ struct {
+ struct perf_event_header header;
+ u32 type;
+ struct perf_overhead_entry overhead;
+ } overhead_event = {
+ .header = {
+ .type = PERF_RECORD_OVERHEAD,
+ .misc = 0,
+ .size = sizeof(overhead_event),
+ },
+ };
+ overhead_event.type = type;
+ memcpy(&overhead_event.overhead, entry, sizeof(overhead_event.overhead));
+
+ perf_event_header__init_id(&overhead_event.header, &sample, event);
+ ret = perf_output_begin(&handle, event, overhead_event.header.size);
+
+ if (ret)
+ return;
+
+ perf_output_put(&handle, overhead_event);
+ perf_event__output_id_sample(event, &handle, &sample);
+
+ perf_output_end(&handle);
+ memset(entry, 0, sizeof(*entry));
+}
+
/*
* Generic event overflow handling, sampling.
*/
--
2.5.5
On Wed, Nov 23, 2016 at 04:44:40AM -0500, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> NMI handler is one of the most important part which brings overhead.
>
> There are lots of NMI during sampling. It's very expensive to log each
> NMI. So the accumulated time and NMI# will be output when event is going
> to be disabled or task is scheduling out.
> The newly introduced flag PERF_EF_LOG indicate to output the overhead
> log.
>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> arch/x86/events/core.c | 19 ++++++++++++++-
> arch/x86/events/perf_event.h | 2 ++
> include/linux/perf_event.h | 1 +
> include/uapi/linux/perf_event.h | 2 ++
> kernel/events/core.c | 54 ++++++++++++++++++++++-------------------
> 5 files changed, 52 insertions(+), 26 deletions(-)
The subject says this patch is x86 specific, but most changes are in
core code.
On Wed, Nov 23, 2016 at 04:44:41AM -0500, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> Multiplexing overhead is one of the key overhead when the number of
> events is more than available counters.
>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> include/linux/perf_event.h | 2 ++
> include/uapi/linux/perf_event.h | 1 +
> kernel/events/core.c | 16 ++++++++++++++++
> 3 files changed, 19 insertions(+)
The subject says x86 specific, but its _all_ core code.
On Wed, Nov 23, 2016 at 04:44:42AM -0500, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> Iterating all events which need to receive side-band events also bring
> some overhead.
> Save the overhead information in task context or CPU context, whichever
> context is available.
>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> include/linux/perf_event.h | 2 ++
> include/uapi/linux/perf_event.h | 1 +
> kernel/events/core.c | 32 ++++++++++++++++++++++++++++----
> 3 files changed, 31 insertions(+), 4 deletions(-)
Guess?
>
> On Wed, Nov 23, 2016 at 04:44:41AM -0500, [email protected] wrote:
> > From: Kan Liang <[email protected]>
> >
> > Multiplexing overhead is one of the key overhead when the number of
> > events is more than available counters.
> >
> > Signed-off-by: Kan Liang <[email protected]>
> > ---
> > include/linux/perf_event.h | 2 ++
> > include/uapi/linux/perf_event.h | 1 +
> > kernel/events/core.c | 16 ++++++++++++++++
> > 3 files changed, 19 insertions(+)
>
> The subject says x86 specific, but its _all_ core code.
Oh, Sorry. I will change the subject in next version.
Thanks,
Kan
On Wed, Nov 23, 2016 at 04:44:39AM -0500, [email protected] wrote:
> +struct perf_overhead_entry {
> + __u32 cpu;
> + __u64 nr;
> + __u64 time;
> +};
> +void perf_log_overhead(struct perf_event *event, u32 type,
> + struct perf_overhead_entry *entry)
> +{
> + struct perf_output_handle handle;
> + struct perf_sample_data sample;
> + int ret;
> +
> + struct {
> + struct perf_event_header header;
> + u32 type;
There's two u32 sized holes here.
> + struct perf_overhead_entry overhead;
> + } overhead_event = {
> + .header = {
> + .type = PERF_RECORD_OVERHEAD,
> + .misc = 0,
> + .size = sizeof(overhead_event),
> + },
.type = type,
.overhead = *event,
> + };
> + overhead_event.type = type;
> + memcpy(&overhead_event.overhead, entry, sizeof(overhead_event.overhead));
Do the above named initialization instead.
> +
> + perf_event_header__init_id(&overhead_event.header, &sample, event);
> + ret = perf_output_begin(&handle, event, overhead_event.header.size);
> +
> + if (ret)
> + return;
> +
> + perf_output_put(&handle, overhead_event);
> + perf_event__output_id_sample(event, &handle, &sample);
> +
> + perf_output_end(&handle);
> + memset(entry, 0, sizeof(*entry));
> +}
On Wed, Nov 23, 2016 at 04:44:39AM -0500, [email protected] wrote:
> +/*
> + * single overhead record layout:
> + *
> + * cpu: The cpu which overhead occues
This is duplicate information, its already present in sample_id when
PERF_SAMPLE_CPU, and without that we don't care.
> + * nr: Times of overhead happens.
> + * E.g. for NMI, nr == times of NMI handler are called.
> + * time: Total overhead cost(ns)
> + */
> +struct perf_overhead_entry {
> + __u32 cpu;
> + __u64 nr;
> + __u64 time;
> +};
[email protected] writes:
> From: Kan Liang <[email protected]>
>
> The rough overhead rate can be caculated by the sum of all kinds of
> overhead / elapsed time.
> If the overhead rate is higher than 10%, warning the user.
Thinking about this more: this is comparing the cost of a single
CPU to the total wall clock time. This isn't very good and can
give confusing results with many cores.
Perhaps we need two separate metrics here:
- cost of perf record on its CPU (or later on if it gets multi threaded
more multiple). Warn if this is >50% or so.
- average perf collection overhead on a CPU. The 10% threshold here
seems appropiate.
-Andi
>
> [email protected] writes:
>
> > From: Kan Liang <[email protected]>
> >
> > The rough overhead rate can be caculated by the sum of all kinds of
> > overhead / elapsed time.
> > If the overhead rate is higher than 10%, warning the user.
>
> Thinking about this more: this is comparing the cost of a single CPU to the
> total wall clock time. This isn't very good and can give confusing results with
> many cores.
>
> Perhaps we need two separate metrics here:
>
> - cost of perf record on its CPU (or later on if it gets multi threaded
> more multiple). Warn if this is >50% or so.
What's the formula for cost of perf record on its CPU?
The cost only includes user space overhead or all overhead?
What is the divisor?
> - average perf collection overhead on a CPU. The 10% threshold here
> seems appropiate.
For the average, do you mean add all overheads among CPUs together
and divide the CPU#?
To calculate the rate, the divisor is wall clock time, right?
Thanks,
Kan
On Wed, Nov 23, 2016 at 04:44:43AM -0500, [email protected] wrote:
SNIP
> +
> static void dso__adjust_kmod_long_name(struct dso *dso, const char *filename)
> {
> const char *dup_filename;
> @@ -1536,6 +1542,8 @@ int machine__process_event(struct machine *machine, union perf_event *event,
> case PERF_RECORD_SWITCH:
> case PERF_RECORD_SWITCH_CPU_WIDE:
> ret = machine__process_switch_event(machine, event); break;
> + case PERF_RECORD_OVERHEAD:
> + ret = machine__process_overhead_event(machine, event); break;
missing breaks
jirka
> default:
> ret = -1;
> break;
SNIP
On Wed, Nov 23, 2016 at 04:44:44AM -0500, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> Caculate the total NMI overhead on each CPU, and display them in perf
> report
so the output looks like this:
---
# Elapsed time: 1720167944 ns
# Overhead:
# CPU 6
# NMI#: 27 time: 111379 ns
# Multiplexing#: 0 time: 0 ns
# SB#: 57 time: 90045 ns
#
# Samples: 26 of event 'cycles:u'
# Event count (approx.): 1677531
#
# Overhead Command Shared Object Symbol
# ........ ....... ................ .......................
#
24.20% ls ls [.] _init
17.18% ls libc-2.24.so [.] __strcoll_l
11.85% ls ld-2.24.so [.] _dl_relocate_object
---
few things:
- I wonder we want to put this overhead output separatelly from the
main perf out.. this scale bad with with bigger cpu counts
- we might want to call it some other way, becayse we already
use 'overhead' for the event count %
- how about TUI output? ;-) I dont think it's necessary, however
currently 'perf report --show-overhead' does not show anything
ifTUI is default output, unless you use --stdio option
thanks,
jirka
On Wed, Nov 23, 2016 at 04:44:44AM -0500, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> Caculate the total NMI overhead on each CPU, and display them in perf
> report
please put example output into chagelog
thanks,
jirka
On Wed, Nov 23, 2016 at 04:44:44AM -0500, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> Caculate the total NMI overhead on each CPU, and display them in perf
> report
>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> tools/perf/builtin-report.c | 11 +++++++++++
> tools/perf/util/event.h | 4 ++++
> tools/perf/util/machine.c | 9 +++++++++
> tools/perf/util/session.c | 18 ++++++++++++++++++
> 4 files changed, 42 insertions(+)
>
> diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> index 1416c39..b1437586 100644
> --- a/tools/perf/builtin-report.c
> +++ b/tools/perf/builtin-report.c
> @@ -365,11 +365,22 @@ static int perf_evlist__tty_browse_hists(struct perf_evlist *evlist,
> struct report *rep,
> const char *help)
> {
> + struct perf_session *session = rep->session;
> struct perf_evsel *pos;
> + int cpu;
>
> fprintf(stdout, "#\n# Total Lost Samples: %" PRIu64 "\n#\n", evlist->stats.total_lost_samples);
> if (symbol_conf.show_overhead) {
> fprintf(stdout, "# Overhead:\n");
> + for (cpu = 0; cpu < session->header.env.nr_cpus_online; cpu++) {
> + if (!evlist->stats.total_nmi_overhead[cpu][0])
> + continue;
> + if (rep->cpu_list && !test_bit(cpu, rep->cpu_bitmap))
> + continue;
> + fprintf(stdout, "#\tCPU %d: NMI#: %" PRIu64 " time: %" PRIu64 " ns\n",
> + cpu, evlist->stats.total_nmi_overhead[cpu][0],
> + evlist->stats.total_nmi_overhead[cpu][1]);
> + }
> fprintf(stdout, "#\n");
> }
> evlist__for_each_entry(evlist, pos) {
> diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
> index d1b179b..7d40d54 100644
> --- a/tools/perf/util/event.h
> +++ b/tools/perf/util/event.h
> @@ -262,6 +262,9 @@ enum auxtrace_error_type {
> * multipling nr_events[PERF_EVENT_SAMPLE] by a frequency isn't possible to get
> * the total number of low level events, it is necessary to to sum all struct
> * sample_event.period and stash the result in total_period.
> + *
> + * The total_nmi_overhead tells exactly the NMI handler overhead on each CPU.
> + * The total NMI# is stored in [0], while the accumulated time is in [1].
> */
hum, why can't this be stored this in the struct instead.. ?
thanks,
jirka
On Wed, Nov 23, 2016 at 11:35:59PM +0100, Jiri Olsa wrote:
> On Wed, Nov 23, 2016 at 04:44:43AM -0500, [email protected] wrote:
>
> SNIP
>
> > +
> > static void dso__adjust_kmod_long_name(struct dso *dso, const char *filename)
> > {
> > const char *dup_filename;
> > @@ -1536,6 +1542,8 @@ int machine__process_event(struct machine *machine, union perf_event *event,
> > case PERF_RECORD_SWITCH:
> > case PERF_RECORD_SWITCH_CPU_WIDE:
> > ret = machine__process_switch_event(machine, event); break;
> > + case PERF_RECORD_OVERHEAD:
> > + ret = machine__process_overhead_event(machine, event); break;
>
> missing breaks
ugh.. im blind.. sry :-\
jirka
On Wed, Nov 23, 2016 at 04:44:49AM -0500, [email protected] wrote:
SNIP
>
> +static void perf_event__synth_overhead(struct record *rec, perf_event__handler_t process)
> +{
> + int cpu;
> +
> + union perf_event event = {
> + .overhead = {
> + .header = {
> + .type = PERF_RECORD_USER_OVERHEAD,
> + .size = sizeof(struct perf_overhead),
> + },
> + },
> + };
> +
> + event.overhead.type = PERF_USER_WRITE_OVERHEAD;
> +
> + for (cpu = 0; cpu < MAX_NR_CPUS; cpu++) {
> + if (!rec->overhead[cpu].nr)
> + continue;
> +
> + event.overhead.entry.cpu = cpu;
> + event.overhead.entry.nr = rec->overhead[cpu].nr;
> + event.overhead.entry.time = rec->overhead[cpu].time;
> +
> + (void)process(&rec->tool, &event, NULL, NULL);
plese make perf_event__synth_overhead return int and check above 'process' return value
jirka
On Wed, Nov 23, 2016 at 04:44:49AM -0500, [email protected] wrote:
SNIP
> diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> index d8cde21..ce7a0ea 100644
> --- a/tools/perf/util/machine.c
> +++ b/tools/perf/util/machine.c
> @@ -573,6 +573,11 @@ int machine__process_overhead_event(struct machine *machine __maybe_unused,
> event->overhead.entry.nr,
> event->overhead.entry.time,
> event->overhead.entry.cpu);
> + } else if (event->overhead.type == PERF_USER_WRITE_OVERHEAD) {
> + dump_printf(" User write nr: %llu time: %llu cpu %u\n",
> + event->overhead.entry.nr,
> + event->overhead.entry.time,
> + event->overhead.entry.cpu);
should this be rather in the switch ?
jirka
On Wed, Nov 23, 2016 at 04:44:52AM -0500, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> Introduce a new option --show-overhead to show overhead events in perf
> script
perf 7356 [001] 7292.203517: 482010 cycles:pp: ffffffff818e2150 _raw_spin_unlock_irqrestore+0x40 (/lib/modules/4.9.0-rc1+/build/vmlinux)
PERF_RECORD_USER_OVERHEAD [USER WRITE] nr: 1995 time: 14790661 cpu 1
PERF_RECORD_USER_OVERHEAD [ELAPSED TIME] time: 2721649901
perf 7356 [001] 7292.203766: 482076 cycles:pp: ffffffff81117f17 lock_release+0x37 (/lib/modules/4.9.0-rc1+/build/vmlinux)
I guess those 2 overhead events dont have context..?
should we make sure they are the last events?
thanks,
jirka
On Wed, Nov 23, 2016 at 04:44:52AM -0500, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> Introduce a new option --show-overhead to show overhead events in perf
> script
please add exmaple output into changelog
thanks,
jirka
On Wed, Nov 23, 2016 at 04:44:52AM -0500, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> Introduce a new option --show-overhead to show overhead events in perf
> script
>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> tools/perf/builtin-script.c | 36 ++++++++++++++++++++++++++++++++++++
> tools/perf/util/event.c | 37 +++++++++++++++++++++++++++++++++++++
> tools/perf/util/event.h | 1 +
> 3 files changed, 74 insertions(+)
>
> diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
> index e1daff3..76d9747 100644
> --- a/tools/perf/builtin-script.c
> +++ b/tools/perf/builtin-script.c
> @@ -829,6 +829,7 @@ struct perf_script {
> bool show_task_events;
> bool show_mmap_events;
> bool show_switch_events;
> + bool show_overhead;
> bool allocated;
> struct cpu_map *cpus;
> struct thread_map *threads;
> @@ -1264,6 +1265,37 @@ static int process_switch_event(struct perf_tool *tool,
> return 0;
> }
>
> +static int process_overhead_event(struct perf_tool *tool,
> + union perf_event *event,
> + struct perf_sample *sample,
> + struct machine *machine)
> +{
> + struct thread *thread;
> + struct perf_script *script = container_of(tool, struct perf_script, tool);
> + struct perf_session *session = script->session;
> + struct perf_evsel *evsel;
> +
> + if (perf_event__process_switch(tool, event, sample, machine) < 0)
> + return -1;
process_switch event? copy&paste error?
jirka
> + if (sample) {
> + evsel = perf_evlist__id2evsel(session->evlist, sample->id);
> + thread = machine__findnew_thread(machine, sample->pid, sample->tid);
> + if (thread == NULL) {
> + pr_debug("problem processing OVERHEAD event, skipping it.\n");
> + return -1;
> + }
> +
> + print_sample_start(sample, thread, evsel);
> + perf_event__fprintf(event, stdout);
> + thread__put(thread);
> + } else {
> + /* USER OVERHEAD event */
> + perf_event__fprintf(event, stdout);
> + }
> +
> + return 0;
> +}
> +
SNIP
On Wed, Nov 23, 2016 at 04:44:52AM -0500, [email protected] wrote:
SNIP
> +}
> +
> static void sig_handler(int sig __maybe_unused)
> {
> session_done = 1;
> @@ -1287,6 +1319,8 @@ static int __cmd_script(struct perf_script *script)
> }
> if (script->show_switch_events)
> script->tool.context_switch = process_switch_event;
> + if (script->show_overhead)
> + script->tool.overhead = process_overhead_event;
>
> ret = perf_session__process_events(script->session);
>
> @@ -2172,6 +2206,8 @@ int cmd_script(int argc, const char **argv, const char *prefix __maybe_unused)
> "Show the mmap events"),
> OPT_BOOLEAN('\0', "show-switch-events", &script.show_switch_events,
> "Show context switch events (if recorded)"),
> + OPT_BOOLEAN('\0', "show-overhead", &script.show_overhead,
> + "Show overhead events"),
please add the '-events' suffix as for the mmap and task
thanks,
jirka
On Wed, Nov 23, 2016 at 04:44:39AM -0500, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> A new perf record is introduced to export perf overhead information to
> userspace. So we can measure the overhead of sampling directly.
>
> To output the overhead information, it takes advantage of the existing
> event log mechanism. But the overhead information is the system
> overhead, not per-event overhead.
>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> include/linux/perf_event.h | 4 ++++
> include/uapi/linux/perf_event.h | 29 +++++++++++++++++++++++++++
> kernel/events/core.c | 43 +++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 76 insertions(+)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 4741ecd..3a530a7 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -998,6 +998,10 @@ perf_event__output_id_sample(struct perf_event *event,
> extern void
> perf_log_lost_samples(struct perf_event *event, u64 lost);
>
> +extern void
> +perf_log_overhead(struct perf_event *event, u32 type,
> + struct perf_overhead_entry *entry);
> +
> static inline bool is_sampling_event(struct perf_event *event)
> {
> return event->attr.sample_period != 0;
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index c66a485..ae571be 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -862,6 +862,17 @@ enum perf_event_type {
> */
> PERF_RECORD_SWITCH_CPU_WIDE = 15,
>
> + /*
> + * Records perf overhead
> + * struct {
> + * struct perf_event_header header;
> + * u32 type;
> + * struct perf_overhead_entry entry;
> + * struct sample_id sample_id;
> + * };
> + */
> + PERF_RECORD_OVERHEAD = 16,
> +
> PERF_RECORD_MAX, /* non-ABI */
> };
I think we should make this optional/configurable like the
rest of the aux events, like below..
jirka
---
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 5e7c52278ef0..853a919bd9ae 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -344,7 +344,8 @@ struct perf_event_attr {
use_clockid : 1, /* use @clockid for time fields */
context_switch : 1, /* context switch data */
write_backward : 1, /* Write ring buffer from end to beginning */
- __reserved_1 : 36;
+ overhead : 1,
+ __reserved_1 : 35;
union {
__u32 wakeup_events; /* wakeup every n events */
* [email protected] <[email protected]> wrote:
> From: Kan Liang <[email protected]>
>
> Profiling brings additional overhead. High overhead may impacts the
> behavior of the profiling object, impacts the accuracy of the
> profiling result, and even hang the system.
> Currently, perf has dynamic interrupt throttle mechanism to lower the
> sample rate and overhead. But it has limitations.
> - The mechanism only focus in the overhead from NMI. However, there
> are other parts which bring big overhead. E.g, multiplexing.
> - The hint from the mechanism doesn't work on fixed period.
> - The system changes which caused by the mechanism are not recorded
> in the perf.data. Users have no idea about the overhead and its
> impact.
> Acctually, any passive ways like dynamic interrupt throttle mechanism
> are only palliative. The best way is to export overheads information,
> provide more hints, and help the users design more proper perf command.
>
> According to our test, there are four parts which can bring big overhead.
> They include NMI handler, multiplexing handler, iterate side-band events,
> and write data in file. Two new perf record type PERF_RECORD_OVERHEAD and
> PERF_RECORD_USER_OVERHEAD are introduced to record the overhead
> information in kernel and user space respectively.
> The overhead information is the system per-CPU overhead, not per-event
> overhead. The implementation takes advantage of the existing event log
> mechanism.
> To reduce the additional overhead from logging overhead information, the
> overhead information only be output when the event is going to be
> disabled or task is scheduling out.
>
> In perf report, the overhead will be checked automatically. If the
> overhead rate is larger than 10%. A warning will be displayed.
> A new option is also introduced to display detial per-CPU overhead
> information.
>
> Current implementation only include four overhead sources. There could be
> more in other parts. The new overhead source can be easily added as a
> new type.
Please include sample output of the new instrumentation!
Not even the tooling patches show any of the output, nor is it clear anywhere what
kind of 'overhead' measurement it is, what the units are, what the metrics are,
how users can _use_ this information, etc.
This is totally inadequate description.
Thanks,
Ingo
>
> On Wed, Nov 23, 2016 at 04:44:44AM -0500, [email protected] wrote:
> > From: Kan Liang <[email protected]>
> >
> > Caculate the total NMI overhead on each CPU, and display them in perf
> > report
>
> so the output looks like this:
>
> ---
> # Elapsed time: 1720167944 ns
> # Overhead:
> # CPU 6
> # NMI#: 27 time: 111379 ns
> # Multiplexing#: 0 time: 0 ns
> # SB#: 57 time: 90045 ns
> #
> # Samples: 26 of event 'cycles:u'
> # Event count (approx.): 1677531
> #
> # Overhead Command Shared Object Symbol
> # ........ ....... ................ .......................
> #
> 24.20% ls ls [.] _init
> 17.18% ls libc-2.24.so [.] __strcoll_l
> 11.85% ls ld-2.24.so [.] _dl_relocate_object
> ---
>
>
> few things:
>
> - I wonder we want to put this overhead output separatelly from the
> main perf out.. this scale bad with with bigger cpu counts
>
This output can only be shown when the user explicitly apply
the --show-overhead option. I think the user should expect the big
header.
Or I can add --show-overhead-only option which only show the
overhead information. It will like what we do for --header and
--header-only
Any suggestions?
> - we might want to call it some other way, becayse we already
> use 'overhead' for the event count %
>
"operating_cost"? "processing_cost"? "perf_cost"? "perf_overhead"?
Suggestions?
> - how about TUI output? ;-) I dont think it's necessary, however
> currently 'perf report --show-overhead' does not show anything
> ifTUI is default output, unless you use --stdio option
I will try to add something in TUI mode.
Thanks,
Kan
>
> On Wed, Nov 23, 2016 at 04:44:39AM -0500, [email protected] wrote:
> > From: Kan Liang <[email protected]>
> >
> > A new perf record is introduced to export perf overhead information to
> > userspace. So we can measure the overhead of sampling directly.
> >
> > To output the overhead information, it takes advantage of the existing
> > event log mechanism. But the overhead information is the system
> > overhead, not per-event overhead.
> >
> > Signed-off-by: Kan Liang <[email protected]>
> > ---
> > include/linux/perf_event.h | 4 ++++
> > include/uapi/linux/perf_event.h | 29 +++++++++++++++++++++++++++
> > kernel/events/core.c | 43
> +++++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 76 insertions(+)
> >
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index 4741ecd..3a530a7 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -998,6 +998,10 @@ perf_event__output_id_sample(struct
> perf_event
> > *event, extern void perf_log_lost_samples(struct perf_event *event,
> > u64 lost);
> >
> > +extern void
> > +perf_log_overhead(struct perf_event *event, u32 type,
> > + struct perf_overhead_entry *entry);
> > +
> > static inline bool is_sampling_event(struct perf_event *event) {
> > return event->attr.sample_period != 0; diff --git
> > a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> > index c66a485..ae571be 100644
> > --- a/include/uapi/linux/perf_event.h
> > +++ b/include/uapi/linux/perf_event.h
> > @@ -862,6 +862,17 @@ enum perf_event_type {
> > */
> > PERF_RECORD_SWITCH_CPU_WIDE = 15,
> >
> > + /*
> > + * Records perf overhead
> > + * struct {
> > + * struct perf_event_header header;
> > + * u32 type;
> > + * struct perf_overhead_entry entry;
> > + * struct sample_id sample_id;
> > + * };
> > + */
> > + PERF_RECORD_OVERHEAD = 16,
> > +
> > PERF_RECORD_MAX, /* non-ABI */
> > };
>
> I think we should make this optional/configurable like the rest of the aux
> events, like below..
>
The overhead logging only happens when event is going to be disabled or
the task is scheduling out. It should not be much and expensive.
Peter,
What do you think?
Should we make it configurable?
Thanks,
Kan
> jirka
>
>
> ---
> diff --git a/include/uapi/linux/perf_event.h
> b/include/uapi/linux/perf_event.h index 5e7c52278ef0..853a919bd9ae
> 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -344,7 +344,8 @@ struct perf_event_attr {
> use_clockid : 1, /* use @clockid for time
> fields */
> context_switch : 1, /* context switch data */
> write_backward : 1, /* Write ring buffer
> from end to beginning */
> - __reserved_1 : 36;
> + overhead : 1,
> + __reserved_1 : 35;
>
> union {
> __u32 wakeup_events; /* wakeup every n
> events */
On Thu, Nov 24, 2016 at 01:45:28PM +0000, Liang, Kan wrote:
> > I think we should make this optional/configurable like the rest of the aux
> > events, like below..
> >
>
> The overhead logging only happens when event is going to be disabled or
> the task is scheduling out. It should not be much and expensive.
>
> Peter,
>
> What do you think?
> Should we make it configurable?
Is there a downside to having it optional?
>
> On Thu, Nov 24, 2016 at 01:45:28PM +0000, Liang, Kan wrote:
> > > I think we should make this optional/configurable like the rest of
> > > the aux events, like below..
> > >
> >
> > The overhead logging only happens when event is going to be disabled
> > or the task is scheduling out. It should not be much and expensive.
> >
> > Peter,
> >
> > What do you think?
> > Should we make it configurable?
>
> Is there a downside to having it optional?
NO. There will be no overhead information dumped into perf.data. Just like
current implementation.
On Thu, Nov 24, 2016 at 01:56:51PM +0000, Liang, Kan wrote:
>
>
> >
> > On Thu, Nov 24, 2016 at 01:45:28PM +0000, Liang, Kan wrote:
> > > > I think we should make this optional/configurable like the rest of
> > > > the aux events, like below..
> > > >
> > >
> > > The overhead logging only happens when event is going to be disabled
> > > or the task is scheduling out. It should not be much and expensive.
> > >
> > > Peter,
> > >
> > > What do you think?
> > > Should we make it configurable?
> >
> > Is there a downside to having it optional?
>
> NO. There will be no overhead information dumped into perf.data. Just like
> current implementation.
old perf tools will get unexpected events if it's not optional
[root@ibm-x3650m4-01 perf]# ./perf c2c record -ag
^C[ perf record: Woken up 1 times to write data ]
Warning:
Found 1 unknown events!
Is this an older tool processing a perf.data file generated by a more recent tool?
If that is not the case, consider reporting to [email protected].
[ perf record: Captured and wrote 2.333 MB perf.data (9370 samples) ]
jirka
> On Thu, Nov 24, 2016 at 01:56:51PM +0000, Liang, Kan wrote:
> >
> >
> > >
> > > On Thu, Nov 24, 2016 at 01:45:28PM +0000, Liang, Kan wrote:
> > > > > I think we should make this optional/configurable like the rest
> > > > > of the aux events, like below..
> > > > >
> > > >
> > > > The overhead logging only happens when event is going to be
> > > > disabled or the task is scheduling out. It should not be much and
> expensive.
> > > >
> > > > Peter,
> > > >
> > > > What do you think?
> > > > Should we make it configurable?
> > >
> > > Is there a downside to having it optional?
> >
> > NO. There will be no overhead information dumped into perf.data. Just
> > like current implementation.
>
> old perf tools will get unexpected events if it's not optional
>
That's expected because a new record type is introduced.
OK, I will make it configurable. Anyway, there is nothing to lose. :)
Thanks,
Kan
> [root@ibm-x3650m4-01 perf]# ./perf c2c record -ag ^C[ perf record: Woken
> up 1 times to write data ]
> Warning:
> Found 1 unknown events!
>
> Is this an older tool processing a perf.data file generated by a more recent
> tool?
>
> If that is not the case, consider reporting to [email protected].
>
> [ perf record: Captured and wrote 2.333 MB perf.data (9370 samples) ]
>
>
> jirka
On Thu, Nov 24, 2016 at 02:39:26PM +0000, Liang, Kan wrote:
>
>
> > On Thu, Nov 24, 2016 at 01:56:51PM +0000, Liang, Kan wrote:
> > >
> > >
> > > >
> > > > On Thu, Nov 24, 2016 at 01:45:28PM +0000, Liang, Kan wrote:
> > > > > > I think we should make this optional/configurable like the rest
> > > > > > of the aux events, like below..
> > > > > >
> > > > >
> > > > > The overhead logging only happens when event is going to be
> > > > > disabled or the task is scheduling out. It should not be much and
> > expensive.
> > > > >
> > > > > Peter,
> > > > >
> > > > > What do you think?
> > > > > Should we make it configurable?
> > > >
> > > > Is there a downside to having it optional?
> > >
> > > NO. There will be no overhead information dumped into perf.data. Just
> > > like current implementation.
> >
> > old perf tools will get unexpected events if it's not optional
> >
>
> That's expected because a new record type is introduced.
but not if it's configurable.. older tools wont get it
jirka
On Thu, Nov 24, 2016 at 01:37:04PM +0000, Liang, Kan wrote:
>
>
> >
> > On Wed, Nov 23, 2016 at 04:44:44AM -0500, [email protected] wrote:
> > > From: Kan Liang <[email protected]>
> > >
> > > Caculate the total NMI overhead on each CPU, and display them in perf
> > > report
> >
> > so the output looks like this:
> >
> > ---
> > # Elapsed time: 1720167944 ns
> > # Overhead:
> > # CPU 6
> > # NMI#: 27 time: 111379 ns
> > # Multiplexing#: 0 time: 0 ns
> > # SB#: 57 time: 90045 ns
> > #
> > # Samples: 26 of event 'cycles:u'
> > # Event count (approx.): 1677531
> > #
> > # Overhead Command Shared Object Symbol
> > # ........ ....... ................ .......................
> > #
> > 24.20% ls ls [.] _init
> > 17.18% ls libc-2.24.so [.] __strcoll_l
> > 11.85% ls ld-2.24.so [.] _dl_relocate_object
> > ---
how about we display the overhead information same way the main perf output:
CPU NMI NMI time MTX MTX time SB SB time
... ..... ........ ..... ........ ...... ........
6 27 111379 0 0 57 90045
would be just matter of adding new sort objects
jirka
On Wed, Nov 23, 2016 at 04:44:42AM -0500, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> Iterating all events which need to receive side-band events also bring
> some overhead.
> Save the overhead information in task context or CPU context, whichever
> context is available.
Do we really want to expose this concept to userspace?
What if the implementation changes?
Thanks,
Mark.
> Signed-off-by: Kan Liang <[email protected]>
> ---
> include/linux/perf_event.h | 2 ++
> include/uapi/linux/perf_event.h | 1 +
> kernel/events/core.c | 32 ++++++++++++++++++++++++++++----
> 3 files changed, 31 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index f72b97a..ec3cb7f 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -764,6 +764,8 @@ struct perf_event_context {
> #endif
> void *task_ctx_data; /* pmu specific data */
> struct rcu_head rcu_head;
> +
> + struct perf_overhead_entry sb_overhead;
> };
>
> /*
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 9124c7c..5e7c522 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -994,6 +994,7 @@ struct perf_branch_entry {
> enum perf_record_overhead_type {
> PERF_NMI_OVERHEAD = 0,
> PERF_MUX_OVERHEAD,
> + PERF_SB_OVERHEAD,
>
> PERF_OVERHEAD_MAX,
> };
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 9934059..51e9df7 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1829,9 +1829,15 @@ event_sched_out(struct perf_event *event,
> if (event->attr.exclusive || !cpuctx->active_oncpu)
> cpuctx->exclusive = 0;
>
> - if (log_overhead && cpuctx->mux_overhead.nr) {
> - cpuctx->mux_overhead.cpu = smp_processor_id();
> - perf_log_overhead(event, PERF_MUX_OVERHEAD, &cpuctx->mux_overhead);
> + if (log_overhead) {
> + if (cpuctx->mux_overhead.nr) {
> + cpuctx->mux_overhead.cpu = smp_processor_id();
> + perf_log_overhead(event, PERF_MUX_OVERHEAD, &cpuctx->mux_overhead);
> + }
> + if (ctx->sb_overhead.nr) {
> + ctx->sb_overhead.cpu = smp_processor_id();
> + perf_log_overhead(event, PERF_SB_OVERHEAD, &ctx->sb_overhead);
> + }
> }
>
> perf_pmu_enable(event->pmu);
> @@ -6133,6 +6139,14 @@ static void perf_iterate_sb_cpu(perf_iterate_f output, void *data)
> }
> }
>
> +static void
> +perf_caculate_sb_overhead(struct perf_event_context *ctx,
> + u64 time)
> +{
> + ctx->sb_overhead.nr++;
> + ctx->sb_overhead.time += time;
> +}
> +
> /*
> * Iterate all events that need to receive side-band events.
> *
> @@ -6143,9 +6157,12 @@ static void
> perf_iterate_sb(perf_iterate_f output, void *data,
> struct perf_event_context *task_ctx)
> {
> + struct perf_event_context *overhead_ctx = task_ctx;
> struct perf_event_context *ctx;
> + u64 start_clock, end_clock;
> int ctxn;
>
> + start_clock = perf_clock();
> rcu_read_lock();
> preempt_disable();
>
> @@ -6163,12 +6180,19 @@ perf_iterate_sb(perf_iterate_f output, void *data,
>
> for_each_task_context_nr(ctxn) {
> ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
> - if (ctx)
> + if (ctx) {
> perf_iterate_ctx(ctx, output, data, false);
> + if (!overhead_ctx)
> + overhead_ctx = ctx;
> + }
> }
> done:
> preempt_enable();
> rcu_read_unlock();
> +
> + end_clock = perf_clock();
> + if (overhead_ctx)
> + perf_caculate_sb_overhead(overhead_ctx, end_clock - start_clock);
> }
>
> /*
> --
> 2.5.5
>
On Wed, Nov 23, 2016 at 04:44:40AM -0500, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> NMI handler is one of the most important part which brings overhead.
>
> There are lots of NMI during sampling. It's very expensive to log each
> NMI. So the accumulated time and NMI# will be output when event is going
> to be disabled or task is scheduling out.
> The newly introduced flag PERF_EF_LOG indicate to output the overhead
> log.
>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> arch/x86/events/core.c | 19 ++++++++++++++-
> arch/x86/events/perf_event.h | 2 ++
> include/linux/perf_event.h | 1 +
> include/uapi/linux/perf_event.h | 2 ++
> kernel/events/core.c | 54 ++++++++++++++++++++++-------------------
> 5 files changed, 52 insertions(+), 26 deletions(-)
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index d31735f..6c3b0ef 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -1397,6 +1397,11 @@ static void x86_pmu_del(struct perf_event *event, int flags)
>
> perf_event_update_userpage(event);
>
> + if ((flags & PERF_EF_LOG) && cpuc->nmi_overhead.nr) {
> + cpuc->nmi_overhead.cpu = smp_processor_id();
> + perf_log_overhead(event, PERF_NMI_OVERHEAD, &cpuc->nmi_overhead);
> + }
> +
> do_del:
> if (x86_pmu.del) {
> /*
> @@ -1475,11 +1480,21 @@ void perf_events_lapic_init(void)
> apic_write(APIC_LVTPC, APIC_DM_NMI);
> }
>
> +static void
> +perf_caculate_nmi_overhead(u64 time)
s/caculate/calculate/ - this tripped me up when grepping.
> @@ -1492,8 +1507,10 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
> start_clock = sched_clock();
> ret = x86_pmu.handle_irq(regs);
> finish_clock = sched_clock();
> + clock = finish_clock - start_clock;
>
> - perf_sample_event_took(finish_clock - start_clock);
> + perf_caculate_nmi_overhead(clock);
> + perf_sample_event_took(clock);
Ah, so it's the *sampling* overhead, not the NMI overhead.
This doesn't take into account the cost of entering/exiting the handler, which
could be larger than the sampling overhead (e.g. if the PMU is connected
through chained interrupt controllers).
> enum perf_record_overhead_type {
> + PERF_NMI_OVERHEAD = 0,
As above, it may be worth calling this PERF_SAMPLE_OVERHEAD; this doesn't count
the entire cost of the NMI, and other architectures may want to implement this,
yet don't have NMI.
[...]
> static void
> event_sched_out(struct perf_event *event,
> struct perf_cpu_context *cpuctx,
> - struct perf_event_context *ctx)
> + struct perf_event_context *ctx,
> + bool log_overhead)
Boolean parameter are always confusing. Why not pass the flags directly? That
way we can pass *which* overhead to log, and make the callsites easier to
understand.
> event->tstamp_stopped = tstamp;
> - event->pmu->del(event, 0);
> + event->pmu->del(event, log_overhead ? PERF_EF_LOG : 0);
... which we could pass on here.
> @@ -1835,20 +1835,21 @@ event_sched_out(struct perf_event *event,
> static void
> group_sched_out(struct perf_event *group_event,
> struct perf_cpu_context *cpuctx,
> - struct perf_event_context *ctx)
> + struct perf_event_context *ctx,
> + bool log_overhead)
Likewise.
> @@ -1872,7 +1873,7 @@ __perf_remove_from_context(struct perf_event *event,
> {
> unsigned long flags = (unsigned long)info;
>
> - event_sched_out(event, cpuctx, ctx);
> + event_sched_out(event, cpuctx, ctx, false);
> if (flags & DETACH_GROUP)
> perf_group_detach(event);
> list_del_event(event, ctx);
> @@ -1918,9 +1919,9 @@ static void __perf_event_disable(struct perf_event *event,
> update_cgrp_time_from_event(event);
> update_group_times(event);
> if (event == event->group_leader)
> - group_sched_out(event, cpuctx, ctx);
> + group_sched_out(event, cpuctx, ctx, true);
> else
> - event_sched_out(event, cpuctx, ctx);
> + event_sched_out(event, cpuctx, ctx, true);
Why does this differ from __perf_remove_from_context()?
What's the policy for when we do or do not measure overhead?
Thanks,
Mark.
On Thu, Nov 24, 2016 at 02:50:43PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 24, 2016 at 01:45:28PM +0000, Liang, Kan wrote:
> > > I think we should make this optional/configurable like the rest of the aux
> > > events, like below..
> > >
> >
> > The overhead logging only happens when event is going to be disabled or
> > the task is scheduling out. It should not be much and expensive.
> >
> > Peter,
> >
> > What do you think?
> > Should we make it configurable?
>
> Is there a downside to having it optional?
It would be good to always have at least one line overhead summary in the
default output. So if someone sends you a perf report output file and it has
suspicious overhead can investigate.
-Andi
On Thu, Nov 24, 2016 at 10:28:22AM -0800, Andi Kleen wrote:
> On Thu, Nov 24, 2016 at 02:50:43PM +0100, Peter Zijlstra wrote:
> > On Thu, Nov 24, 2016 at 01:45:28PM +0000, Liang, Kan wrote:
> > > > I think we should make this optional/configurable like the rest of the aux
> > > > events, like below..
> > > >
> > >
> > > The overhead logging only happens when event is going to be disabled or
> > > the task is scheduling out. It should not be much and expensive.
> > >
> > > Peter,
> > >
> > > What do you think?
> > > Should we make it configurable?
> >
> > Is there a downside to having it optional?
>
> It would be good to always have at least one line overhead summary in the
> default output. So if someone sends you a perf report output file and it has
> suspicious overhead can investigate.
Sure, but that's a tool thing, totally irrelevant for the kernel.
On Thu, Nov 24, 2016 at 07:58:49PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 24, 2016 at 10:28:22AM -0800, Andi Kleen wrote:
> > On Thu, Nov 24, 2016 at 02:50:43PM +0100, Peter Zijlstra wrote:
> > > On Thu, Nov 24, 2016 at 01:45:28PM +0000, Liang, Kan wrote:
> > > > > I think we should make this optional/configurable like the rest of the aux
> > > > > events, like below..
> > > > >
> > > >
> > > > The overhead logging only happens when event is going to be disabled or
> > > > the task is scheduling out. It should not be much and expensive.
> > > >
> > > > Peter,
> > > >
> > > > What do you think?
> > > > Should we make it configurable?
> > >
> > > Is there a downside to having it optional?
> >
> > It would be good to always have at least one line overhead summary in the
> > default output. So if someone sends you a perf report output file and it has
> > suspicious overhead can investigate.
>
> Sure, but that's a tool thing, totally irrelevant for the kernel.
It requires enabling the overhead records by default.
If there is really concern about not supporting old tool would need a
disable instead of an enable option.
-Andi
On Thu, Nov 24, 2016 at 04:19:09PM +0000, Mark Rutland wrote:
> > static void
> > event_sched_out(struct perf_event *event,
> > struct perf_cpu_context *cpuctx,
> > - struct perf_event_context *ctx)
> > + struct perf_event_context *ctx,
> > + bool log_overhead)
>
> Boolean parameter are always confusing. Why not pass the flags directly? That
> way we can pass *which* overhead to log, and make the callsites easier to
> understand.
>
> > event->tstamp_stopped = tstamp;
> > - event->pmu->del(event, 0);
> > + event->pmu->del(event, log_overhead ? PERF_EF_LOG : 0);
>
> ... which we could pass on here.
>
> > @@ -1835,20 +1835,21 @@ event_sched_out(struct perf_event *event,
> > static void
> > group_sched_out(struct perf_event *group_event,
> > struct perf_cpu_context *cpuctx,
> > - struct perf_event_context *ctx)
> > + struct perf_event_context *ctx,
> > + bool log_overhead)
>
> Likewise.
>
> > @@ -1872,7 +1873,7 @@ __perf_remove_from_context(struct perf_event *event,
> > {
> > unsigned long flags = (unsigned long)info;
> >
> > - event_sched_out(event, cpuctx, ctx);
> > + event_sched_out(event, cpuctx, ctx, false);
> > if (flags & DETACH_GROUP)
> > perf_group_detach(event);
> > list_del_event(event, ctx);
> > @@ -1918,9 +1919,9 @@ static void __perf_event_disable(struct perf_event *event,
> > update_cgrp_time_from_event(event);
> > update_group_times(event);
> > if (event == event->group_leader)
> > - group_sched_out(event, cpuctx, ctx);
> > + group_sched_out(event, cpuctx, ctx, true);
> > else
> > - event_sched_out(event, cpuctx, ctx);
> > + event_sched_out(event, cpuctx, ctx, true);
>
> Why does this differ from __perf_remove_from_context()?
So I'm not a great fan of sprinkling all this through the core code at
all.
On Thu, Nov 24, 2016 at 11:02:16AM -0800, Andi Kleen wrote:
> > Sure, but that's a tool thing, totally irrelevant for the kernel.
>
> It requires enabling the overhead records by default.
by the tool, not the kernel.
> > @@ -1492,8 +1507,10 @@ perf_event_nmi_handler(unsigned int cmd,
> struct pt_regs *regs)
> > start_clock = sched_clock();
> > ret = x86_pmu.handle_irq(regs);
> > finish_clock = sched_clock();
> > + clock = finish_clock - start_clock;
> >
> > - perf_sample_event_took(finish_clock - start_clock);
> > + perf_caculate_nmi_overhead(clock);
> > + perf_sample_event_took(clock);
>
> Ah, so it's the *sampling* overhead, not the NMI overhead.
>
> This doesn't take into account the cost of entering/exiting the handler,
> which could be larger than the sampling overhead (e.g. if the PMU is
> connected through chained interrupt controllers).
>
> > enum perf_record_overhead_type {
> > + PERF_NMI_OVERHEAD = 0,
>
> As above, it may be worth calling this PERF_SAMPLE_OVERHEAD; this
I think PERF_NMI stands for the NMI overhead in perf part.
PERF_SAMPLE_OVERHEAD looks too generic I think.
It heard like the sum of all overheads in sampling.
After all we collect the overhead in different stage of sampling.
NMI handler, multiplexing, side-band events...
> doesn't count the entire cost of the NMI, and other architectures may want
> to implement this, yet don't have NMI.
>
I think I can change it to PERF_X86_NMI_OVERHEAD, if you think it's more clear.
For other architectures, they can implement their own type of overhead,
just ignore the NMI one.
> > @@ -1872,7 +1873,7 @@ __perf_remove_from_context(struct perf_event
> > *event, {
> > unsigned long flags = (unsigned long)info;
> >
> > - event_sched_out(event, cpuctx, ctx);
> > + event_sched_out(event, cpuctx, ctx, false);
> > if (flags & DETACH_GROUP)
> > perf_group_detach(event);
> > list_del_event(event, ctx);
> > @@ -1918,9 +1919,9 @@ static void __perf_event_disable(struct
> perf_event *event,
> > update_cgrp_time_from_event(event);
> > update_group_times(event);
> > if (event == event->group_leader)
> > - group_sched_out(event, cpuctx, ctx);
> > + group_sched_out(event, cpuctx, ctx, true);
> > else
> > - event_sched_out(event, cpuctx, ctx);
> > + event_sched_out(event, cpuctx, ctx, true);
>
> Why does this differ from __perf_remove_from_context()?
>
Both of them are called on removing event. So I think we only
need to log overhead in one place.
I just did some tests. It looks __perf_remove_from_context is called
after __perf_event_disable.
I think I will log overhead in __perf_remove_from_context for next
version.
> What's the policy for when we do or do not measure overhead?
Currently, it's enabled all the time.
Jirka suggested me to make it configurable. I will do it in next version.
For next version, I still prefer to make it enable by default, since
it doesn't bring additional overhead based on my test.
Thanks,
Kan
>
> On Wed, Nov 23, 2016 at 04:44:42AM -0500, [email protected] wrote:
> > From: Kan Liang <[email protected]>
> >
> > Iterating all events which need to receive side-band events also bring
> > some overhead.
> > Save the overhead information in task context or CPU context,
> > whichever context is available.
>
> Do we really want to expose this concept to userspace?
>
> What if the implementation changes?
The concept of side-band will be removed?
I thought we just use the rb-tree to replace the list.
I think no matter how do we implement it, we do need to calculate its
overhead, unless the concept is gone, or it merged with other overhead type.
Because based on my test, it brings big overhead on some cases.
Thanks,
Kan
>
> Thanks,
> Mark.
>
> > Signed-off-by: Kan Liang <[email protected]>
> > ---
> > include/linux/perf_event.h | 2 ++
> > include/uapi/linux/perf_event.h | 1 +
> > kernel/events/core.c | 32 ++++++++++++++++++++++++++++----
> > 3 files changed, 31 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index f72b97a..ec3cb7f 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -764,6 +764,8 @@ struct perf_event_context { #endif
> > void *task_ctx_data; /* pmu specific data
> */
> > struct rcu_head rcu_head;
> > +
> > + struct perf_overhead_entry sb_overhead;
> > };
> >
> > /*
> > diff --git a/include/uapi/linux/perf_event.h
> > b/include/uapi/linux/perf_event.h index 9124c7c..5e7c522 100644
> > --- a/include/uapi/linux/perf_event.h
> > +++ b/include/uapi/linux/perf_event.h
> > @@ -994,6 +994,7 @@ struct perf_branch_entry { enum
> > perf_record_overhead_type {
> > PERF_NMI_OVERHEAD = 0,
> > PERF_MUX_OVERHEAD,
> > + PERF_SB_OVERHEAD,
> >
> > PERF_OVERHEAD_MAX,
> > };
> > diff --git a/kernel/events/core.c b/kernel/events/core.c index
> > 9934059..51e9df7 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -1829,9 +1829,15 @@ event_sched_out(struct perf_event *event,
> > if (event->attr.exclusive || !cpuctx->active_oncpu)
> > cpuctx->exclusive = 0;
> >
> > - if (log_overhead && cpuctx->mux_overhead.nr) {
> > - cpuctx->mux_overhead.cpu = smp_processor_id();
> > - perf_log_overhead(event, PERF_MUX_OVERHEAD, &cpuctx-
> >mux_overhead);
> > + if (log_overhead) {
> > + if (cpuctx->mux_overhead.nr) {
> > + cpuctx->mux_overhead.cpu = smp_processor_id();
> > + perf_log_overhead(event, PERF_MUX_OVERHEAD,
> &cpuctx->mux_overhead);
> > + }
> > + if (ctx->sb_overhead.nr) {
> > + ctx->sb_overhead.cpu = smp_processor_id();
> > + perf_log_overhead(event, PERF_SB_OVERHEAD,
> &ctx->sb_overhead);
> > + }
> > }
> >
> > perf_pmu_enable(event->pmu);
> > @@ -6133,6 +6139,14 @@ static void perf_iterate_sb_cpu(perf_iterate_f
> output, void *data)
> > }
> > }
> >
> > +static void
> > +perf_caculate_sb_overhead(struct perf_event_context *ctx,
> > + u64 time)
> > +{
> > + ctx->sb_overhead.nr++;
> > + ctx->sb_overhead.time += time;
> > +}
> > +
> > /*
> > * Iterate all events that need to receive side-band events.
> > *
> > @@ -6143,9 +6157,12 @@ static void
> > perf_iterate_sb(perf_iterate_f output, void *data,
> > struct perf_event_context *task_ctx) {
> > + struct perf_event_context *overhead_ctx = task_ctx;
> > struct perf_event_context *ctx;
> > + u64 start_clock, end_clock;
> > int ctxn;
> >
> > + start_clock = perf_clock();
> > rcu_read_lock();
> > preempt_disable();
> >
> > @@ -6163,12 +6180,19 @@ perf_iterate_sb(perf_iterate_f output, void
> > *data,
> >
> > for_each_task_context_nr(ctxn) {
> > ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
> > - if (ctx)
> > + if (ctx) {
> > perf_iterate_ctx(ctx, output, data, false);
> > + if (!overhead_ctx)
> > + overhead_ctx = ctx;
> > + }
> > }
> > done:
> > preempt_enable();
> > rcu_read_unlock();
> > +
> > + end_clock = perf_clock();
> > + if (overhead_ctx)
> > + perf_caculate_sb_overhead(overhead_ctx, end_clock -
> start_clock);
> > }
> >
> > /*
> > --
> > 2.5.5
> >
Hi,
On Thu, Nov 24, 2016 at 04:27:21PM +0100, Jiri Olsa wrote:
> On Thu, Nov 24, 2016 at 01:37:04PM +0000, Liang, Kan wrote:
> >
> >
> > >
> > > On Wed, Nov 23, 2016 at 04:44:44AM -0500, [email protected] wrote:
> > > > From: Kan Liang <[email protected]>
> > > >
> > > > Caculate the total NMI overhead on each CPU, and display them in perf
> > > > report
> > >
> > > so the output looks like this:
> > >
> > > ---
> > > # Elapsed time: 1720167944 ns
> > > # Overhead:
> > > # CPU 6
> > > # NMI#: 27 time: 111379 ns
> > > # Multiplexing#: 0 time: 0 ns
> > > # SB#: 57 time: 90045 ns
> > > #
> > > # Samples: 26 of event 'cycles:u'
> > > # Event count (approx.): 1677531
> > > #
> > > # Overhead Command Shared Object Symbol
> > > # ........ ....... ................ .......................
> > > #
> > > 24.20% ls ls [.] _init
> > > 17.18% ls libc-2.24.so [.] __strcoll_l
> > > 11.85% ls ld-2.24.so [.] _dl_relocate_object
> > > ---
>
> how about we display the overhead information same way the main perf output:
>
> CPU NMI NMI time MTX MTX time SB SB time
> ... ..... ........ ..... ........ ...... ........
> 6 27 111379 0 0 57 90045
>
>
> would be just matter of adding new sort objects
How would you connect those to hist entries then? It'd be possible if
the sort key had 'cpu' only, no?
Thanks,
Namhyung
On Thu, Nov 24, 2016 at 07:40:21PM +0000, Liang, Kan wrote:
>
>
> > > @@ -1492,8 +1507,10 @@ perf_event_nmi_handler(unsigned int cmd,
> > struct pt_regs *regs)
> > > start_clock = sched_clock();
> > > ret = x86_pmu.handle_irq(regs);
> > > finish_clock = sched_clock();
> > > + clock = finish_clock - start_clock;
> > >
> > > - perf_sample_event_took(finish_clock - start_clock);
> > > + perf_caculate_nmi_overhead(clock);
> > > + perf_sample_event_took(clock);
> >
> > Ah, so it's the *sampling* overhead, not the NMI overhead.
> >
> > This doesn't take into account the cost of entering/exiting the handler,
> > which could be larger than the sampling overhead (e.g. if the PMU is
> > connected through chained interrupt controllers).
> >
> > > enum perf_record_overhead_type {
> > > + PERF_NMI_OVERHEAD = 0,
> >
> > As above, it may be worth calling this PERF_SAMPLE_OVERHEAD; this
>
> I think PERF_NMI stands for the NMI overhead in perf part.
>
> PERF_SAMPLE_OVERHEAD looks too generic I think.
> It heard like the sum of all overheads in sampling.
> After all we collect the overhead in different stage of sampling.
> NMI handler, multiplexing, side-band events...
>
>
> > doesn't count the entire cost of the NMI, and other architectures may want
> > to implement this, yet don't have NMI.
> >
>
> I think I can change it to PERF_X86_NMI_OVERHEAD, if you think it's more clear.
> For other architectures, they can implement their own type of overhead,
> just ignore the NMI one.
I think it'd be better making it arch-agnostic if possible. What
about PERF_PMU_OVERHEAD or PERF_PMU_SAMPLE_OVERHEAD?
Thanks,
Namhyung
>
>
>
> > > @@ -1872,7 +1873,7 @@ __perf_remove_from_context(struct perf_event
> > > *event, {
> > > unsigned long flags = (unsigned long)info;
> > >
> > > - event_sched_out(event, cpuctx, ctx);
> > > + event_sched_out(event, cpuctx, ctx, false);
> > > if (flags & DETACH_GROUP)
> > > perf_group_detach(event);
> > > list_del_event(event, ctx);
> > > @@ -1918,9 +1919,9 @@ static void __perf_event_disable(struct
> > perf_event *event,
> > > update_cgrp_time_from_event(event);
> > > update_group_times(event);
> > > if (event == event->group_leader)
> > > - group_sched_out(event, cpuctx, ctx);
> > > + group_sched_out(event, cpuctx, ctx, true);
> > > else
> > > - event_sched_out(event, cpuctx, ctx);
> > > + event_sched_out(event, cpuctx, ctx, true);
> >
> > Why does this differ from __perf_remove_from_context()?
> >
>
> Both of them are called on removing event. So I think we only
> need to log overhead in one place.
>
> I just did some tests. It looks __perf_remove_from_context is called
> after __perf_event_disable.
> I think I will log overhead in __perf_remove_from_context for next
> version.
>
>
> > What's the policy for when we do or do not measure overhead?
>
> Currently, it's enabled all the time.
> Jirka suggested me to make it configurable. I will do it in next version.
> For next version, I still prefer to make it enable by default, since
> it doesn't bring additional overhead based on my test.
>
> Thanks,
> Kan
On Fri, Nov 25, 2016 at 08:20:13AM +0900, Namhyung Kim wrote:
> Hi,
>
> On Thu, Nov 24, 2016 at 04:27:21PM +0100, Jiri Olsa wrote:
> > On Thu, Nov 24, 2016 at 01:37:04PM +0000, Liang, Kan wrote:
> > >
> > >
> > > >
> > > > On Wed, Nov 23, 2016 at 04:44:44AM -0500, [email protected] wrote:
> > > > > From: Kan Liang <[email protected]>
> > > > >
> > > > > Caculate the total NMI overhead on each CPU, and display them in perf
> > > > > report
> > > >
> > > > so the output looks like this:
> > > >
> > > > ---
> > > > # Elapsed time: 1720167944 ns
> > > > # Overhead:
> > > > # CPU 6
> > > > # NMI#: 27 time: 111379 ns
> > > > # Multiplexing#: 0 time: 0 ns
> > > > # SB#: 57 time: 90045 ns
> > > > #
> > > > # Samples: 26 of event 'cycles:u'
> > > > # Event count (approx.): 1677531
> > > > #
> > > > # Overhead Command Shared Object Symbol
> > > > # ........ ....... ................ .......................
> > > > #
> > > > 24.20% ls ls [.] _init
> > > > 17.18% ls libc-2.24.so [.] __strcoll_l
> > > > 11.85% ls ld-2.24.so [.] _dl_relocate_object
> > > > ---
> >
> > how about we display the overhead information same way the main perf output:
> >
> > CPU NMI NMI time MTX MTX time SB SB time
> > ... ..... ........ ..... ........ ...... ........
> > 6 27 111379 0 0 57 90045
> >
> >
> > would be just matter of adding new sort objects
>
> How would you connect those to hist entries then? It'd be possible if
> the sort key had 'cpu' only, no?
right, I should have said fields then..
jirka
> how about we display the overhead information same way the main perf output:
>
> CPU NMI NMI time MTX MTX time SB SB time
> ... ..... ........ ..... ........ ...... ........
> 6 27 111379 0 0 57 90045
>
>
> would be just matter of adding new sort objects
The problem with making overhead a standard sort key is that you have
to chose between an output format that makes sense for overhead
and one that makes sense for normal samples.
But overhead is more "auxillary" information, so it should be possible
to access it together with normal sampling information in a single
output file.
So I think it's better handled separately.
-Andi
On Wed, Nov 23, 2016 at 10:03:24PM +0000, Liang, Kan wrote:
> > Perhaps we need two separate metrics here:
> >
> > - cost of perf record on its CPU (or later on if it gets multi threaded
> > more multiple). Warn if this is >50% or so.
>
> What's the formula for cost of perf record on its CPU?
> The cost only includes user space overhead or all overhead?
> What is the divisor?
It would be all the overhead in the process. Accounting overhead in
kernel threads or interrupts caused by IO is difficult, we could leave
that out for now.
Sum of:
For each perf thread:
thread cpu time / monotonic wall time
I guess Sum is better than average here because the perf threads are
likely running (or could be) on the same CPU. If perf record was changed to be
more aggressively flush buffers on the local CPUs this would need to
change, but I presume it's good enough for now.
>
>
> > - average perf collection overhead on a CPU. The 10% threshold here
> > seems appropiate.
> For the average, do you mean add all overheads among CPUs together
> and divide the CPU#?
Right. Possibly also max of all too.
>
> To calculate the rate, the divisor is wall clock time, right?
monotonic wall clock time yes.
-Andi