From: Kan Liang <[email protected]>
Profiling brings additional overhead. High overhead may impacts the
behavior of the profiling object, impacts the accuracy of the
profiling result, and even hang the system.
Currently, perf has dynamic interrupt throttle mechanism to lower the
sample rate and overhead. But it has limitations.
- The mechanism only focus in the sampling overhead. However, there
are other parts which also bring big overhead. E.g, multiplexing.
- The hint from the mechanism doesn't work on fixed period.
- The system changes which caused by the mechanism are not recorded
in the perf.data. Users have no idea about the overhead and its
impact.
Acctually, any passive ways like dynamic interrupt throttle mechanism
are only palliative. The best way is to export overhead information,
provide more hints, and help the users design more proper perf command.
Both kernel and user tool can bring overhead.
For kernel, three parts can bring obvious overhead.
- sample overhead: For x86, it's the time cost in perf NMI handler.
- multiplexing overhead: The time cost spends on rotate context.
- side-band events overhead: The time cost spends on iterating all
events that need to receive side-band events.
All the time cost of those parts are recorded in new perf record type
PERF_RECORD_OVERHEAD.
For user, the perf record CPU time and elapsed time are recorded in new
perf record type PERF_RECORD_USER_OVERHEAD.
User can apply --show-profiling-cost in perf report to dump the overhead
details in the head of the output. The result is sorted by CPU, if CPU is
available. Otherwise, the accumulated result is printed.
perf report automatically checks the overhead information. If the kerenl
profiling cost is > 10% or perf record profiling cost > 50%. A warning
is printed. The kerenl profiling cost is from max of each CPU overhead /
perf monotonic wall time. The perf record profiling cost is from perf CPU
time / perf monotonic wall time.
Logging overhead information is default on. User can disable it by
applying --no-profile-cost-info in perf record.
User can use the overhead information to refine their perf command and get
accurate profiling result. For example, if there is high overhead warning,
user may reduce the number of events/increase the period/decrease the
frequency.
Developer can also use the overhead information to find bugs.
Here is an example.
$ perf report --show-profiling-cost
Warning:
Perf kernel profiling cost is high! The cost rate is 15.30%
Please consider reducing the number of events, or increasing the period,
or decrease the frequency.
# To display the perf.data header info, please use
--header/--header-only options.
#
# ========
# Elapsed time: 120345432522(ns)
# Perf record cpu time: 505470324(ns)
#
# CPU SAM SAM cost(ns) MUX MUX cost(ns) SB
SB cost(ns)
# -1 2247030 1705370307 111354 3111550168 1237358
13591276893
# ========
Changes since V1:
- fix u32 holes and remove duplicate CPU information
- configurable overhead logging
- Introduce the concept of PMU specific overhead and common core
overhead. Rename NMI overhead to PMU sample overhead
- Add log_overhead in perf_event_context to indicate the logging of
overhead
- Refine the output of overhead information
- Use perf CPU time to replace perf write data overhead
- Refine the formula of overhead evaluation
- Refine perf script
Kan Liang (13):
perf/core: Introduce PERF_RECORD_OVERHEAD
perf/core: output overhead when sched out from context
perf/x86: output sampling overhead
perf/core: output multiplexing overhead
perf/core: output side-band events overhead
perf tools: option to disable overhead collection
perf tools: handle PERF_RECORD_OVERHEAD record type
perf tools: show kernel overhead
perf script: show kernel overhead
perf tools: add time related functions
perf tools: introduce PERF_RECORD_USER_OVERHEAD
perf tools: record user space profiling cost
perf report: warn on high overhead
arch/x86/events/core.c | 17 +++-
arch/x86/events/perf_event.h | 2 +
include/linux/perf_event.h | 15 +++
include/uapi/linux/perf_event.h | 42 ++++++++-
kernel/events/core.c | 92 ++++++++++++++++++-
tools/include/uapi/linux/perf_event.h | 42 ++++++++-
tools/perf/Documentation/perf-record.txt | 6 ++
tools/perf/Documentation/perf-report.txt | 10 ++
tools/perf/Documentation/perf-script.txt | 3 +
tools/perf/builtin-record.c | 63 ++++++++++++-
tools/perf/builtin-report.c | 16 +++-
tools/perf/builtin-sched.c | 2 +-
tools/perf/builtin-script.c | 32 +++++++
tools/perf/builtin.h | 2 +
tools/perf/perf.h | 1 +
tools/perf/util/event.c | 39 ++++++++
tools/perf/util/event.h | 31 +++++++
tools/perf/util/evlist.c | 6 ++
tools/perf/util/evlist.h | 1 +
tools/perf/util/evsel.c | 1 +
tools/perf/util/machine.c | 37 ++++++++
tools/perf/util/machine.h | 3 +
tools/perf/util/session.c | 152 +++++++++++++++++++++++++++++++
tools/perf/util/session.h | 3 +
tools/perf/util/symbol.h | 3 +-
tools/perf/util/tool.h | 1 +
26 files changed, 613 insertions(+), 9 deletions(-)
--
2.5.5
From: Kan Liang <[email protected]>
A new perf record is introduced to export perf overhead information to
userspace. So the user can measure the overhead of sampling directly.
If the user doesn't want to use this feature, it can be switched off by
configuring the user space tool.
To output the overhead information, it takes advantage of the existing
event log mechanism. But the overhead information is the system
overhead, not per-event overhead.
Signed-off-by: Kan Liang <[email protected]>
---
include/linux/perf_event.h | 9 ++++++++
include/uapi/linux/perf_event.h | 39 +++++++++++++++++++++++++++++++++-
kernel/events/core.c | 46 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 93 insertions(+), 1 deletion(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 4741ecd..5bc8156 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -998,6 +998,10 @@ perf_event__output_id_sample(struct perf_event *event,
extern void
perf_log_lost_samples(struct perf_event *event, u64 lost);
+extern void
+perf_log_overhead(struct perf_event *event, u64 type,
+ struct perf_overhead_entry *entry);
+
static inline bool is_sampling_event(struct perf_event *event)
{
return event->attr.sample_period != 0;
@@ -1221,6 +1225,11 @@ static inline bool has_addr_filter(struct perf_event *event)
return event->pmu->nr_addr_filters;
}
+static inline bool needs_log_overhead(struct perf_event *event)
+{
+ return !!event->attr.overhead;
+}
+
/*
* An inherited event uses parent's filters
*/
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index c66a485..bb0ecf0 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -344,7 +344,8 @@ struct perf_event_attr {
use_clockid : 1, /* use @clockid for time fields */
context_switch : 1, /* context switch data */
write_backward : 1, /* Write ring buffer from end to beginning */
- __reserved_1 : 36;
+ overhead : 1, /* Log overhead information */
+ __reserved_1 : 35;
union {
__u32 wakeup_events; /* wakeup every n events */
@@ -862,6 +863,17 @@ enum perf_event_type {
*/
PERF_RECORD_SWITCH_CPU_WIDE = 15,
+ /*
+ * Records perf overhead
+ * struct {
+ * struct perf_event_header header;
+ * u64 type;
+ * struct perf_overhead_entry entry;
+ * struct sample_id sample_id;
+ * };
+ */
+ PERF_RECORD_OVERHEAD = 16,
+
PERF_RECORD_MAX, /* non-ABI */
};
@@ -980,4 +992,29 @@ struct perf_branch_entry {
reserved:44;
};
+/*
+ * The overhead type could be different among architectures.
+ * The common overhead type can be defined from PERF_CORE_OVERHEAD
+ * The arch specific type should be defined from PERF_PMU_OVERHEAD
+ */
+enum perf_record_overhead_type {
+ PERF_CORE_OVERHEAD = 0,
+
+ PERF_PMU_OVERHEAD = 20,
+
+ PERF_OVERHEAD_MAX,
+};
+
+/*
+ * single overhead record layout:
+ *
+ * nr: Times of overhead happens.
+ * E.g. for NMI, nr == times of NMI handler are called.
+ * time: Total overhead cost(ns)
+ */
+struct perf_overhead_entry {
+ __u64 nr;
+ __u64 time;
+};
+
#endif /* _UAPI_LINUX_PERF_EVENT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6ee1feb..5312744 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7037,6 +7037,52 @@ static void perf_log_itrace_start(struct perf_event *event)
perf_output_end(&handle);
}
+
+/*
+ * Record overhead information
+ *
+ * The overhead logged here is the system overhead, not per-event overhead.
+ * This function only take advantage of the existing event log mechanism
+ * to log the overhead information.
+ *
+ */
+void perf_log_overhead(struct perf_event *event, u64 type,
+ struct perf_overhead_entry *entry)
+{
+ struct perf_output_handle handle;
+ struct perf_sample_data sample;
+ int ret;
+
+ struct {
+ struct perf_event_header header;
+ u64 type;
+ struct perf_overhead_entry overhead;
+ } overhead_event = {
+ .header = {
+ .type = PERF_RECORD_OVERHEAD,
+ .misc = 0,
+ .size = sizeof(overhead_event),
+ },
+ .type = type,
+ .overhead = {
+ .nr = entry->nr,
+ .time = entry->time,
+ },
+ };
+
+ perf_event_header__init_id(&overhead_event.header, &sample, event);
+ ret = perf_output_begin(&handle, event, overhead_event.header.size);
+
+ if (ret)
+ return;
+
+ perf_output_put(&handle, overhead_event);
+ perf_event__output_id_sample(event, &handle, &sample);
+
+ perf_output_end(&handle);
+ memset(entry, 0, sizeof(*entry));
+}
+
/*
* Generic event overflow handling, sampling.
*/
--
2.5.5
From: Kan Liang <[email protected]>
Shows kernel overhead in perf script.
The output is as below:
perf script --show-profiling-cost-events
perf 29001 79989.093958: 1 cycles:
ffffffff81064ca6 native_write_msr (/lib/
sleep 29001 79989.094282: 7661 cycles:
ffffffff810dc433 update_blocked_averages
sleep 29001 79989.094294: 7442 cycles:
ffffffff81810f60 irq_work_interrupt (/li
sleep 29001 79989.094305: 25466 cycles:
ffffffff813ca410 radix_tree_next_chunk (
sleep 29001 79989.094340: 94368 cycles:
ffffffff8180fa90 page_fault (/lib/module
sleep 29001 79989.094459: 167362 cycles:
ffffffff811e3f79 alloc_set_pte (/lib/mod
sleep 29001 79989.094672: 190283 cycles:
7f5d7c91d8e7 _dl_addr (/usr/lib64/li
sleep 29001 79991.094978: 194526 cycles:
ffffffff811e0579 __tlb_remove_page_size.
sleep 29001 79991.095061: PERF_RECORD_OVERHEAD [SAMPLE] nr: 8
time: 28110
sleep 29001 79991.095062: PERF_RECORD_OVERHEAD [SB] nr: 24
time: 41397
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/Documentation/perf-script.txt | 3 +++
tools/perf/builtin-script.c | 32 ++++++++++++++++++++++++++++++++
tools/perf/util/event.c | 29 +++++++++++++++++++++++++++++
tools/perf/util/event.h | 1 +
4 files changed, 65 insertions(+)
diff --git a/tools/perf/Documentation/perf-script.txt b/tools/perf/Documentation/perf-script.txt
index c01904f..b371023 100644
--- a/tools/perf/Documentation/perf-script.txt
+++ b/tools/perf/Documentation/perf-script.txt
@@ -289,6 +289,9 @@ include::itrace.txt[]
--force::
Don't do ownership validation.
+--show-profiling-cost-events::
+ Display perf profiling time cost related event (PERF_RECORD_OVERHEAD)
+
SEE ALSO
--------
linkperf:perf-record[1], linkperf:perf-script-perl[1],
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index e1daff3..af404a4 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -829,6 +829,7 @@ struct perf_script {
bool show_task_events;
bool show_mmap_events;
bool show_switch_events;
+ bool show_profiling_cost_events;
bool allocated;
struct cpu_map *cpus;
struct thread_map *threads;
@@ -1264,6 +1265,33 @@ static int process_switch_event(struct perf_tool *tool,
return 0;
}
+static int process_overhead_event(struct perf_tool *tool,
+ union perf_event *event,
+ struct perf_sample *sample,
+ struct machine *machine)
+{
+ struct thread *thread;
+ struct perf_script *script = container_of(tool, struct perf_script, tool);
+ struct perf_session *session = script->session;
+ struct perf_evsel *evsel;
+
+ if (perf_event__process_overhead(tool, event, sample, machine) < 0)
+ return -1;
+
+ evsel = perf_evlist__id2evsel(session->evlist, sample->id);
+ thread = machine__findnew_thread(machine, sample->pid, sample->tid);
+ if (thread == NULL) {
+ pr_debug("problem processing OVERHEAD event, skipping it.\n");
+ return -1;
+ }
+
+ print_sample_start(sample, thread, evsel);
+ perf_event__fprintf(event, stdout);
+ thread__put(thread);
+
+ return 0;
+}
+
static void sig_handler(int sig __maybe_unused)
{
session_done = 1;
@@ -1287,6 +1315,8 @@ static int __cmd_script(struct perf_script *script)
}
if (script->show_switch_events)
script->tool.context_switch = process_switch_event;
+ if (script->show_profiling_cost_events)
+ script->tool.overhead = process_overhead_event;
ret = perf_session__process_events(script->session);
@@ -2172,6 +2202,8 @@ int cmd_script(int argc, const char **argv, const char *prefix __maybe_unused)
"Show the mmap events"),
OPT_BOOLEAN('\0', "show-switch-events", &script.show_switch_events,
"Show context switch events (if recorded)"),
+ OPT_BOOLEAN('\0', "show-profiling-cost-events", &script.show_profiling_cost_events,
+ "Show perf profiling time cost events"),
OPT_BOOLEAN('f', "force", &file.force, "don't complain, do it"),
OPT_BOOLEAN(0, "ns", &nanosecs,
"Use 9 decimal places when displaying time"),
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 58d095c..e8736dc 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -1189,6 +1189,32 @@ size_t perf_event__fprintf_switch(union perf_event *event, FILE *fp)
event->context_switch.next_prev_tid);
}
+size_t perf_event__fprintf_overhead(union perf_event *event, FILE *fp)
+{
+ size_t ret;
+
+ switch (event->overhead.type) {
+ case PERF_PMU_SAMPLE_OVERHEAD:
+ ret = fprintf(fp, " [SAMPLE] nr: %llu time: %llu\n",
+ event->overhead.entry.nr,
+ event->overhead.entry.time);
+ break;
+ case PERF_CORE_MUX_OVERHEAD:
+ ret = fprintf(fp, " [MUX] nr: %llu time: %llu\n",
+ event->overhead.entry.nr,
+ event->overhead.entry.time);
+ break;
+ case PERF_CORE_SB_OVERHEAD:
+ ret = fprintf(fp, " [SB] nr: %llu time: %llu\n",
+ event->overhead.entry.nr,
+ event->overhead.entry.time);
+ break;
+ default:
+ ret = fprintf(fp, " unhandled!\n");
+ }
+ return ret;
+}
+
size_t perf_event__fprintf(union perf_event *event, FILE *fp)
{
size_t ret = fprintf(fp, "PERF_RECORD_%s",
@@ -1218,6 +1244,9 @@ size_t perf_event__fprintf(union perf_event *event, FILE *fp)
case PERF_RECORD_SWITCH_CPU_WIDE:
ret += perf_event__fprintf_switch(event, fp);
break;
+ case PERF_RECORD_OVERHEAD:
+ ret += perf_event__fprintf_overhead(event, fp);
+ break;
default:
ret += fprintf(fp, "\n");
}
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index dd4ec5c..a6c98ce 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -675,6 +675,7 @@ size_t perf_event__fprintf_switch(union perf_event *event, FILE *fp);
size_t perf_event__fprintf_thread_map(union perf_event *event, FILE *fp);
size_t perf_event__fprintf_cpu_map(union perf_event *event, FILE *fp);
size_t perf_event__fprintf(union perf_event *event, FILE *fp);
+size_t perf_event__fprintf_overhead(union perf_event *event, FILE *fp);
u64 kallsyms__get_function_start(const char *kallsyms_filename,
const char *symbol_name);
--
2.5.5
From: Kan Liang <[email protected]>
Make get_nsecs visible to collect monotonic wall time.
Introduce get_vnsecs to collect CPU time.
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-record.c | 9 +++++++++
tools/perf/builtin-sched.c | 2 +-
tools/perf/builtin.h | 2 ++
3 files changed, 12 insertions(+), 1 deletion(-)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 73b033b..e69099d 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -65,6 +65,15 @@ struct record {
unsigned long long samples;
};
+u64 get_vnsecs(void)
+{
+ struct timespec ts;
+
+ clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
+
+ return ts.tv_sec * NSEC_PER_SEC + ts.tv_nsec;
+}
+
static int record__write(struct record *rec, void *bf, size_t size)
{
if (perf_data_file__write(rec->session->file, bf, size) < 0) {
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index a49a032..35a301b 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -229,7 +229,7 @@ static struct thread **idle_threads;
static int idle_max_cpu;
static char idle_comm[] = "<idle>";
-static u64 get_nsecs(void)
+u64 get_nsecs(void)
{
struct timespec ts;
diff --git a/tools/perf/builtin.h b/tools/perf/builtin.h
index 0bcf68e..2a9a1db 100644
--- a/tools/perf/builtin.h
+++ b/tools/perf/builtin.h
@@ -12,6 +12,8 @@ const char *help_unknown_cmd(const char *cmd);
void prune_packed_objects(int);
int read_line_with_nul(char *buf, int size, FILE *file);
int check_pager_config(const char *cmd);
+u64 get_nsecs(void);
+u64 get_vnsecs(void);
int cmd_annotate(int argc, const char **argv, const char *prefix);
int cmd_bench(int argc, const char **argv, const char *prefix);
--
2.5.5
From: Kan Liang <[email protected]>
It is used to track user space perf tool profiling time cost.
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/util/event.c | 1 +
tools/perf/util/event.h | 5 +++++
tools/perf/util/session.c | 4 ++++
3 files changed, 10 insertions(+)
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index e8736dc..764893c 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -48,6 +48,7 @@ static const char *perf_event__names[] = {
[PERF_RECORD_STAT_ROUND] = "STAT_ROUND",
[PERF_RECORD_EVENT_UPDATE] = "EVENT_UPDATE",
[PERF_RECORD_TIME_CONV] = "TIME_CONV",
+ [PERF_RECORD_USER_OVERHEAD] = "USER_OVERHEAD",
};
const char *perf_event__name(unsigned int id)
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index a6c98ce..6110f32 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -237,6 +237,7 @@ enum perf_user_event_type { /* above any possible kernel type */
PERF_RECORD_STAT_ROUND = 77,
PERF_RECORD_EVENT_UPDATE = 78,
PERF_RECORD_TIME_CONV = 79,
+ PERF_RECORD_USER_OVERHEAD = 80,
PERF_RECORD_HEADER_MAX
};
@@ -490,6 +491,10 @@ struct time_conv_event {
u64 time_zero;
};
+enum perf_user_overhead_event_type { /* above any possible kernel type */
+ PERF_USER_OVERHEAD_TYPE_START = 100,
+};
+
struct perf_overhead {
struct perf_event_header header;
u64 type;
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 1dc32e3..7745bec 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -804,6 +804,7 @@ static perf_event__swap_op perf_event__swap_ops[] = {
[PERF_RECORD_STAT_ROUND] = perf_event__stat_round_swap,
[PERF_RECORD_EVENT_UPDATE] = perf_event__event_update_swap,
[PERF_RECORD_TIME_CONV] = perf_event__all64_swap,
+ [PERF_RECORD_USER_OVERHEAD] = perf_event__all64_swap,
[PERF_RECORD_HEADER_MAX] = NULL,
};
@@ -1387,6 +1388,9 @@ static s64 perf_session__process_user_event(struct perf_session *session,
case PERF_RECORD_TIME_CONV:
session->time_conv = event->time_conv;
return tool->time_conv(tool, event, session);
+ case PERF_RECORD_USER_OVERHEAD:
+ overhead_stats_update(tool, session->evlist, event, -1);
+ return tool->overhead(tool, event, NULL, NULL);
default:
return -EINVAL;
}
--
2.5.5
From: Kan Liang <[email protected]>
The kernel overhead include sample overhead, multiplexing overhead and
side-band events overhead. All the information is printed in the head of
output file if the show profiling cost option is applied.
The result is sorted by CPU. If there is no available CPU information in
sample, using -1 instead.
Output when CPU is available
# To display the perf.data header info, please use
--header/--header-only options.
#
# ========
# CPU SAM SAM cost(ns) MUX MUX cost(ns) SB
SB cost(ns)
# 0 2261446 1991216491 122174 1128716680 0
0
# 1 17360 67130324 122174 4498665956 0
0
# 2 25279 112575486 122174 4460407877 0
0
# 3 35532 157216449 122174 4411951469 0
0
... ...
# 61 16248 78159121 122172 4495896103 0
0
# 62 18911 90880108 122172 4488922500 0
0
# 63 63292 322048750 122171 4380579279 0
0
# ========
#
#
# Total Lost Samples: 0
Output when CPU is not available
# To display the perf.data header info, please use
--header/--header-only options.
#
# ========
# CPU SAM SAM cost(ns) MUX MUX cost(ns) SB
SB cost(ns)
# -1 2044182 1558283028 102380 2146692288 1237358
7556703763
# ========
#
#
# Total Lost Samples: 0
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/Documentation/perf-report.txt | 7 +++
tools/perf/util/event.h | 10 ++++
tools/perf/util/machine.c | 28 +++++++++--
tools/perf/util/session.c | 85 +++++++++++++++++++++++++++++---
4 files changed, 120 insertions(+), 10 deletions(-)
diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 36d196c..50a89ff 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -414,6 +414,13 @@ include::itrace.txt[]
--show-profiling-cost:
Show extra time cost during perf profiling
+ Sort the extra time cost by CPU
+ If CPU information is not available in perf_sample, using -1 instead.
+ The time cost include:
+ - SAM: sample overhead. For x86, it's the time cost in perf NMI handler.
+ - MUX: multiplexing overhead. The time cost spends on rotate context.
+ - SB: side-band events overhead. The time cost spends on iterating all
+ events that need to receive side-band events.
include::callchain-overhead-calculation.txt[]
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index d96e215..dd4ec5c 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -245,6 +245,12 @@ enum auxtrace_error_type {
PERF_AUXTRACE_ERROR_MAX
};
+struct total_overhead {
+ struct perf_overhead_entry total_sample[MAX_NR_CPUS + 1];
+ struct perf_overhead_entry total_mux[MAX_NR_CPUS + 1];
+ struct perf_overhead_entry total_sb[MAX_NR_CPUS + 1];
+};
+
/*
* The kernel collects the number of events it couldn't send in a stretch and
* when possible sends this number in a PERF_RECORD_LOST event. The number of
@@ -262,6 +268,9 @@ enum auxtrace_error_type {
* multipling nr_events[PERF_EVENT_SAMPLE] by a frequency isn't possible to get
* the total number of low level events, it is necessary to to sum all struct
* sample_event.period and stash the result in total_period.
+ *
+ * The overhead tells the perf profiling time cost on each CPU. The cost can be
+ * divided into sampling cost, multiplexing cost and side-band events cost.
*/
struct events_stats {
u64 total_period;
@@ -270,6 +279,7 @@ struct events_stats {
u64 total_lost_samples;
u64 total_aux_lost;
u64 total_invalid_chains;
+ struct total_overhead overhead;
u32 nr_events[PERF_RECORD_HEADER_MAX];
u32 nr_non_filtered_samples;
u32 nr_lost_warned;
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 02c8f7a..72c4412 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -557,9 +557,31 @@ int machine__process_switch_event(struct machine *machine __maybe_unused,
int machine__process_overhead_event(struct machine *machine __maybe_unused,
union perf_event *event,
- struct perf_sample *sample __maybe_unused)
-{
- dump_printf("\tUNSUPPORT TYPE 0x%lx!\n", event->overhead.type);
+ struct perf_sample *sample)
+{
+ switch (event->overhead.type) {
+ case PERF_PMU_SAMPLE_OVERHEAD:
+ dump_printf(" SAMPLE nr: %llu time: %llu cpu %d\n",
+ event->overhead.entry.nr,
+ event->overhead.entry.time,
+ sample->cpu);
+ break;
+ case PERF_CORE_MUX_OVERHEAD:
+ dump_printf(" MULTIPLEXING nr: %llu time: %llu cpu %d\n",
+ event->overhead.entry.nr,
+ event->overhead.entry.time,
+ sample->cpu);
+ break;
+ case PERF_CORE_SB_OVERHEAD:
+ dump_printf(" SIDE-BAND nr: %llu time: %llu cpu %d\n",
+ event->overhead.entry.nr,
+ event->overhead.entry.time,
+ sample->cpu);
+ break;
+ default:
+ dump_printf("\tUNSUPPORT TYPE 0x%lx!\n", event->overhead.type);
+ return 0;
+ }
return 0;
}
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 9de4f74..1dc32e3 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1207,6 +1207,36 @@ static int
&sample->read.one, machine);
}
+static void
+overhead_stats_update(struct perf_tool *tool,
+ struct perf_evlist *evlist,
+ union perf_event *event,
+ int cpu)
+{
+ if (tool->overhead != perf_event__process_overhead)
+ return;
+
+ if (cpu < 0)
+ cpu = MAX_NR_CPUS;
+
+ switch (event->overhead.type) {
+ case PERF_PMU_SAMPLE_OVERHEAD:
+ evlist->stats.overhead.total_sample[cpu].nr += event->overhead.entry.nr;
+ evlist->stats.overhead.total_sample[cpu].time += event->overhead.entry.time;
+ break;
+ case PERF_CORE_MUX_OVERHEAD:
+ evlist->stats.overhead.total_mux[cpu].nr += event->overhead.entry.nr;
+ evlist->stats.overhead.total_mux[cpu].time += event->overhead.entry.time;
+ break;
+ case PERF_CORE_SB_OVERHEAD:
+ evlist->stats.overhead.total_sb[cpu].nr += event->overhead.entry.nr;
+ evlist->stats.overhead.total_sb[cpu].time += event->overhead.entry.time;
+ break;
+ default:
+ break;
+ }
+}
+
static int machines__deliver_event(struct machines *machines,
struct perf_evlist *evlist,
union perf_event *event,
@@ -1271,6 +1301,7 @@ static int machines__deliver_event(struct machines *machines,
case PERF_RECORD_SWITCH_CPU_WIDE:
return tool->context_switch(tool, event, sample, machine);
case PERF_RECORD_OVERHEAD:
+ overhead_stats_update(tool, evlist, event, sample->cpu);
return tool->overhead(tool, event, sample, machine);
default:
++evlist->stats.nr_unknown_events;
@@ -2038,20 +2069,60 @@ void perf_session__fprintf_info(struct perf_session *session, FILE *fp,
fprintf(fp, "# ========\n#\n");
}
-void perf_session__fprintf_overhead_info(struct perf_session *session, FILE *fp,
- const char *cpu_list __maybe_unused,
- unsigned long *cpu_bitmap __maybe_unused)
+static void
+perf_session__fprintf_overhead_body(struct perf_evlist *evlist,
+ FILE *fp, int cpu)
{
- if (session == NULL || fp == NULL)
+ if (!evlist->stats.overhead.total_sample[cpu].nr &&
+ !evlist->stats.overhead.total_mux[cpu].nr &&
+ !evlist->stats.overhead.total_sb[cpu].nr)
return;
- if (!perf_evlist__overhead(session->evlist)) {
- fprintf(fp, "#\n# No profiling time cost information available.\n#\n");
+ fprintf(stdout, "# %3d ", (cpu == MAX_NR_CPUS) ? -1 : cpu);
+
+ fprintf(stdout, "%10llu %14llu ",
+ evlist->stats.overhead.total_sample[cpu].nr,
+ evlist->stats.overhead.total_sample[cpu].time);
+ fprintf(stdout, "%10llu %14llu ",
+ evlist->stats.overhead.total_mux[cpu].nr,
+ evlist->stats.overhead.total_mux[cpu].time);
+ fprintf(stdout, "%10llu %14llu ",
+ evlist->stats.overhead.total_sb[cpu].nr,
+ evlist->stats.overhead.total_sb[cpu].time);
+ fprintf(fp, "\n");
+}
+
+void perf_session__fprintf_overhead_info(struct perf_session *session,
+ FILE *fp,
+ const char *cpu_list,
+ unsigned long *cpu_bitmap)
+{
+ struct perf_evlist *evlist;
+ int cpu;
+
+ if (session == NULL || fp == NULL)
return;
+ evlist = session->evlist;
+ if (!perf_evlist__overhead(evlist)) {
+ fprintf(fp, "#\n# No profiling time cost information available.\n#\n");
}
fprintf(fp, "# ========\n");
-
+ fprintf(fp, "# CPU");
+ fprintf(fp, " SAM SAM cost(ns)");
+ fprintf(fp, " MUX MUX cost(ns)");
+ fprintf(fp, " SB SB cost(ns)");
+ fprintf(fp, "\n");
+
+ perf_session__fprintf_overhead_body(evlist, fp, MAX_NR_CPUS);
+
+ if (perf_evlist__sample_id_all(evlist)) {
+ for (cpu = 0; cpu < session->header.env.nr_cpus_online; cpu++) {
+ if (cpu_list && !test_bit(cpu, cpu_bitmap))
+ continue;
+ perf_session__fprintf_overhead_body(evlist, fp, cpu);
+ }
+ }
fprintf(fp, "# ========\n#\n");
}
--
2.5.5
From: Kan Liang <[email protected]>
Perf collects overhead information by default.
A new option --no-profile-cost-info is introduced to disable the
information collection.
Signed-off-by: Kan Liang <[email protected]>
---
tools/include/uapi/linux/perf_event.h | 42 +++++++++++++++++++++++++++++++-
tools/perf/Documentation/perf-record.txt | 6 +++++
tools/perf/builtin-record.c | 2 ++
tools/perf/perf.h | 1 +
tools/perf/util/evsel.c | 1 +
5 files changed, 51 insertions(+), 1 deletion(-)
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index c66a485..bdf2eec 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -344,7 +344,8 @@ struct perf_event_attr {
use_clockid : 1, /* use @clockid for time fields */
context_switch : 1, /* context switch data */
write_backward : 1, /* Write ring buffer from end to beginning */
- __reserved_1 : 36;
+ overhead : 1, /* Log overhead information */
+ __reserved_1 : 35;
union {
__u32 wakeup_events; /* wakeup every n events */
@@ -862,6 +863,17 @@ enum perf_event_type {
*/
PERF_RECORD_SWITCH_CPU_WIDE = 15,
+ /*
+ * Records perf overhead
+ * struct {
+ * struct perf_event_header header;
+ * u64 type;
+ * struct perf_overhead_entry entry;
+ * struct sample_id sample_id;
+ * };
+ */
+ PERF_RECORD_OVERHEAD = 16,
+
PERF_RECORD_MAX, /* non-ABI */
};
@@ -980,4 +992,32 @@ struct perf_branch_entry {
reserved:44;
};
+/*
+ * The overhead type could be different among architectures.
+ * The common overhead type can be defined from PERF_CORE_OVERHEAD
+ * The arch specific type should be defined from PERF_PMU_OVERHEAD
+ */
+enum perf_record_overhead_type {
+ PERF_CORE_OVERHEAD = 0,
+ PERF_CORE_MUX_OVERHEAD = 0,
+ PERF_CORE_SB_OVERHEAD,
+
+ PERF_PMU_OVERHEAD = 20,
+ PERF_PMU_SAMPLE_OVERHEAD = 20,
+
+ PERF_OVERHEAD_MAX,
+};
+
+/*
+ * single overhead record layout:
+ *
+ * nr: Times of overhead happens.
+ * E.g. for NMI, nr == times of NMI handler are called.
+ * time: Total overhead cost(ns)
+ */
+struct perf_overhead_entry {
+ __u64 nr;
+ __u64 time;
+};
+
#endif /* _UAPI_LINUX_PERF_EVENT_H */
diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 27fc361..29a57f9 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -460,6 +460,12 @@ config terms. For example: 'cycles/overwrite/' and 'instructions/no-overwrite/'.
Implies --tail-synthesize.
+--no-profile-cost-info::
+Perf collects profiling time cost by default.
+The time cost includes sampling overhead, multiplexing overhead, side-band
+events overhead and so on.
+This option disable the profiling time cost information collection.
+
SEE ALSO
--------
linkperf:perf-stat[1], linkperf:perf-list[1]
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 67d2a90..73b033b 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1512,6 +1512,8 @@ struct option __record_options[] = {
"Switch output when receive SIGUSR2"),
OPT_BOOLEAN(0, "dry-run", &dry_run,
"Parse options then exit"),
+ OPT_BOOLEAN(0, "no-profile-cost-info", &record.opts.no_profile_cost,
+ "Do not collect perf profiling time cost information"),
OPT_END()
};
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 9a0236a..59b16d3 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -55,6 +55,7 @@ struct record_opts {
bool all_user;
bool tail_synthesize;
bool overwrite;
+ bool no_profile_cost;
unsigned int freq;
unsigned int mmap_pages;
unsigned int auxtrace_mmap_pages;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index b2365a63..146c191 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -817,6 +817,7 @@ void perf_evsel__config(struct perf_evsel *evsel, struct record_opts *opts,
attr->sample_id_all = perf_missing_features.sample_id_all ? 0 : 1;
attr->inherit = !opts->no_inherit;
attr->write_backward = opts->overwrite ? 1 : 0;
+ attr->overhead = !opts->no_profile_cost;
perf_evsel__set_sample_bit(evsel, IP);
perf_evsel__set_sample_bit(evsel, TID);
--
2.5.5
From: Kan Liang <[email protected]>
The infrastructure to handle PERF_RECORD_OVERHEAD record type.
A new perf report option is introduced as a knob to show the profiling
overhead information. The option also forces tty output.
The overhead information is auxillary information, so it should be
possible to access it together with normal sampling information in a
single output file. But the overhead is overall profiling time cost, not
per-sample/event profiling time cost. Making overhead a standard sort
key could confuse the normal samples.
So the information is printed separately in the head of the output.
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/Documentation/perf-report.txt | 3 +++
tools/perf/builtin-report.c | 16 +++++++++++++++-
tools/perf/util/event.c | 9 +++++++++
tools/perf/util/event.h | 11 +++++++++++
tools/perf/util/evlist.c | 6 ++++++
tools/perf/util/evlist.h | 1 +
tools/perf/util/machine.c | 10 ++++++++++
tools/perf/util/machine.h | 3 +++
tools/perf/util/session.c | 21 +++++++++++++++++++++
tools/perf/util/session.h | 3 +++
tools/perf/util/symbol.h | 3 ++-
tools/perf/util/tool.h | 1 +
12 files changed, 85 insertions(+), 2 deletions(-)
diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 2d17462..36d196c 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -412,6 +412,9 @@ include::itrace.txt[]
--hierarchy::
Enable hierarchical output.
+--show-profiling-cost:
+ Show extra time cost during perf profiling
+
include::callchain-overhead-calculation.txt[]
SEE ALSO
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 3dfbfff..d2f5e3c 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -565,6 +565,10 @@ static int __cmd_report(struct report *rep)
evlist__for_each_entry(session->evlist, pos)
rep->nr_entries += evsel__hists(pos)->nr_entries;
+ if (symbol_conf.show_profiling_cost) {
+ perf_session__fprintf_overhead_info(session, stdout, rep->cpu_list, rep->cpu_bitmap);
+ }
+
if (use_browser == 0) {
if (verbose > 3)
perf_session__fprintf(session, stdout);
@@ -830,6 +834,8 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
OPT_CALLBACK_DEFAULT(0, "stdio-color", NULL, "mode",
"'always' (default), 'never' or 'auto' only applicable to --stdio mode",
stdio__config_color, "always"),
+ OPT_BOOLEAN(0, "show-profiling-cost", &symbol_conf.show_profiling_cost,
+ "Show extra time cost during perf profiling"),
OPT_END()
};
struct perf_data_file file = {
@@ -957,7 +963,8 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
}
/* Force tty output for header output and per-thread stat. */
- if (report.header || report.header_only || report.show_threads)
+ if (report.header || report.header_only ||
+ report.show_threads || symbol_conf.show_profiling_cost)
use_browser = 0;
if (strcmp(input_name, "-") != 0)
@@ -986,6 +993,13 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
stdout);
}
+ if (!symbol_conf.show_profiling_cost &&
+ perf_evlist__overhead(session->evlist) &&
+ (use_browser == 0)) {
+ fputs("# To display the profiling time cost info, please use --show-profiling-cost options.\n#\n",
+ stdout);
+ }
+
/*
* Only in the TUI browser we are doing integrated annotation,
* so don't allocate extra space that won't be used in the stdio
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 8ab0d7d..58d095c 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -31,6 +31,7 @@ static const char *perf_event__names[] = {
[PERF_RECORD_LOST_SAMPLES] = "LOST_SAMPLES",
[PERF_RECORD_SWITCH] = "SWITCH",
[PERF_RECORD_SWITCH_CPU_WIDE] = "SWITCH_CPU_WIDE",
+ [PERF_RECORD_OVERHEAD] = "OVERHEAD",
[PERF_RECORD_HEADER_ATTR] = "ATTR",
[PERF_RECORD_HEADER_EVENT_TYPE] = "EVENT_TYPE",
[PERF_RECORD_HEADER_TRACING_DATA] = "TRACING_DATA",
@@ -1056,6 +1057,14 @@ int perf_event__process_switch(struct perf_tool *tool __maybe_unused,
return machine__process_switch_event(machine, event);
}
+int perf_event__process_overhead(struct perf_tool *tool __maybe_unused,
+ union perf_event *event,
+ struct perf_sample *sample __maybe_unused,
+ struct machine *machine)
+{
+ return machine__process_overhead_event(machine, event, sample);
+}
+
size_t perf_event__fprintf_mmap(union perf_event *event, FILE *fp)
{
return fprintf(fp, " %d/%d: [%#" PRIx64 "(%#" PRIx64 ") @ %#" PRIx64 "]: %c %s\n",
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index c735c53..d96e215 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -480,6 +480,12 @@ struct time_conv_event {
u64 time_zero;
};
+struct perf_overhead {
+ struct perf_event_header header;
+ u64 type;
+ struct perf_overhead_entry entry;
+};
+
union perf_event {
struct perf_event_header header;
struct mmap_event mmap;
@@ -509,6 +515,7 @@ union perf_event {
struct stat_event stat;
struct stat_round_event stat_round;
struct time_conv_event time_conv;
+ struct perf_overhead overhead;
};
void perf_event__print_totals(void);
@@ -587,6 +594,10 @@ int perf_event__process_switch(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample,
struct machine *machine);
+int perf_event__process_overhead(struct perf_tool *tool,
+ union perf_event *event,
+ struct perf_sample *sample,
+ struct machine *machine);
int perf_event__process_mmap(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample,
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index d92e020..edcf421 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1594,6 +1594,12 @@ bool perf_evlist__sample_id_all(struct perf_evlist *evlist)
return first->attr.sample_id_all;
}
+bool perf_evlist__overhead(struct perf_evlist *evlist)
+{
+ struct perf_evsel *first = perf_evlist__first(evlist);
+ return first->attr.overhead;
+}
+
void perf_evlist__set_selected(struct perf_evlist *evlist,
struct perf_evsel *evsel)
{
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 4fd034f..6d8efa6 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -241,6 +241,7 @@ u64 __perf_evlist__combined_sample_type(struct perf_evlist *evlist);
u64 perf_evlist__combined_sample_type(struct perf_evlist *evlist);
u64 perf_evlist__combined_branch_type(struct perf_evlist *evlist);
bool perf_evlist__sample_id_all(struct perf_evlist *evlist);
+bool perf_evlist__overhead(struct perf_evlist *evlist);
u16 perf_evlist__id_hdr_size(struct perf_evlist *evlist);
int perf_evlist__parse_sample(struct perf_evlist *evlist, union perf_event *event,
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 9b33bef..02c8f7a 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -555,6 +555,14 @@ int machine__process_switch_event(struct machine *machine __maybe_unused,
return 0;
}
+int machine__process_overhead_event(struct machine *machine __maybe_unused,
+ union perf_event *event,
+ struct perf_sample *sample __maybe_unused)
+{
+ dump_printf("\tUNSUPPORT TYPE 0x%lx!\n", event->overhead.type);
+ return 0;
+}
+
static void dso__adjust_kmod_long_name(struct dso *dso, const char *filename)
{
const char *dup_filename;
@@ -1536,6 +1544,8 @@ int machine__process_event(struct machine *machine, union perf_event *event,
case PERF_RECORD_SWITCH:
case PERF_RECORD_SWITCH_CPU_WIDE:
ret = machine__process_switch_event(machine, event); break;
+ case PERF_RECORD_OVERHEAD:
+ ret = machine__process_overhead_event(machine, event, sample); break;
default:
ret = -1;
break;
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index 354de6e..7e29e28 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -97,6 +97,9 @@ int machine__process_itrace_start_event(struct machine *machine,
union perf_event *event);
int machine__process_switch_event(struct machine *machine,
union perf_event *event);
+int machine__process_overhead_event(struct machine *machine,
+ union perf_event *event,
+ struct perf_sample *sample);
int machine__process_mmap_event(struct machine *machine, union perf_event *event,
struct perf_sample *sample);
int machine__process_mmap2_event(struct machine *machine, union perf_event *event,
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index f268201..9de4f74 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -373,6 +373,8 @@ void perf_tool__fill_defaults(struct perf_tool *tool)
tool->itrace_start = perf_event__process_itrace_start;
if (tool->context_switch == NULL)
tool->context_switch = perf_event__process_switch;
+ if (tool->overhead == NULL)
+ tool->overhead = perf_event__process_overhead;
if (tool->read == NULL)
tool->read = process_event_sample_stub;
if (tool->throttle == NULL)
@@ -786,6 +788,7 @@ static perf_event__swap_op perf_event__swap_ops[] = {
[PERF_RECORD_LOST_SAMPLES] = perf_event__all64_swap,
[PERF_RECORD_SWITCH] = perf_event__switch_swap,
[PERF_RECORD_SWITCH_CPU_WIDE] = perf_event__switch_swap,
+ [PERF_RECORD_OVERHEAD] = perf_event__all64_swap,
[PERF_RECORD_HEADER_ATTR] = perf_event__hdr_attr_swap,
[PERF_RECORD_HEADER_EVENT_TYPE] = perf_event__event_type_swap,
[PERF_RECORD_HEADER_TRACING_DATA] = perf_event__tracing_data_swap,
@@ -1267,6 +1270,8 @@ static int machines__deliver_event(struct machines *machines,
case PERF_RECORD_SWITCH:
case PERF_RECORD_SWITCH_CPU_WIDE:
return tool->context_switch(tool, event, sample, machine);
+ case PERF_RECORD_OVERHEAD:
+ return tool->overhead(tool, event, sample, machine);
default:
++evlist->stats.nr_unknown_events;
return -1;
@@ -2033,6 +2038,22 @@ void perf_session__fprintf_info(struct perf_session *session, FILE *fp,
fprintf(fp, "# ========\n#\n");
}
+void perf_session__fprintf_overhead_info(struct perf_session *session, FILE *fp,
+ const char *cpu_list __maybe_unused,
+ unsigned long *cpu_bitmap __maybe_unused)
+{
+ if (session == NULL || fp == NULL)
+ return;
+
+ if (!perf_evlist__overhead(session->evlist)) {
+ fprintf(fp, "#\n# No profiling time cost information available.\n#\n");
+ return;
+ }
+
+ fprintf(fp, "# ========\n");
+
+ fprintf(fp, "# ========\n#\n");
+}
int __perf_session__set_tracepoints_handlers(struct perf_session *session,
const struct perf_evsel_str_handler *assocs,
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 4bd7585..7d08867 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -102,6 +102,9 @@ int perf_session__cpu_bitmap(struct perf_session *session,
void perf_session__fprintf_info(struct perf_session *s, FILE *fp, bool full);
+void perf_session__fprintf_overhead_info(struct perf_session *s, FILE *fp,
+ const char *cpu_list, unsigned long *cpu_bitmap);
+
struct perf_evsel_str_handler;
int __perf_session__set_tracepoints_handlers(struct perf_session *session,
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index 1bcbefc..6902171 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -118,7 +118,8 @@ struct symbol_conf {
show_ref_callgraph,
hide_unresolved,
raw_trace,
- report_hierarchy;
+ report_hierarchy,
+ show_profiling_cost;
const char *vmlinux_name,
*kallsyms_name,
*source_prefix,
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index ac2590a..c5bbb34 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -47,6 +47,7 @@ struct perf_tool {
aux,
itrace_start,
context_switch,
+ overhead,
throttle,
unthrottle;
event_attr_op attr;
--
2.5.5
From: Kan Liang <[email protected]>
Iterating all events which need to receive side-band events also bring
some overhead.
The side-band events overhead PERF_CORE_SB_OVERHEAD is a common overhead
type.
Signed-off-by: Kan Liang <[email protected]>
---
include/linux/perf_event.h | 2 ++
include/uapi/linux/perf_event.h | 1 +
kernel/events/core.c | 27 ++++++++++++++++++++++++---
3 files changed, 27 insertions(+), 3 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 351d321..fe4ca0b 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -765,6 +765,8 @@ struct perf_event_context {
#endif
void *task_ctx_data; /* pmu specific data */
struct rcu_head rcu_head;
+
+ struct perf_overhead_entry sb_overhead;
};
/*
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 355086f..bdf2eec 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -1000,6 +1000,7 @@ struct perf_branch_entry {
enum perf_record_overhead_type {
PERF_CORE_OVERHEAD = 0,
PERF_CORE_MUX_OVERHEAD = 0,
+ PERF_CORE_SB_OVERHEAD,
PERF_PMU_OVERHEAD = 20,
PERF_PMU_SAMPLE_OVERHEAD = 20,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 025a19d..85706fb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1830,8 +1830,12 @@ event_sched_out(struct perf_event *event,
if (event->attr.exclusive || !cpuctx->active_oncpu)
cpuctx->exclusive = 0;
- if (log_overhead && cpuctx->mux_overhead.nr)
- perf_log_overhead(event, PERF_CORE_MUX_OVERHEAD, &cpuctx->mux_overhead);
+ if (log_overhead) {
+ if (cpuctx->mux_overhead.nr)
+ perf_log_overhead(event, PERF_CORE_MUX_OVERHEAD, &cpuctx->mux_overhead);
+ if (ctx->sb_overhead.nr)
+ perf_log_overhead(event, PERF_CORE_SB_OVERHEAD, &ctx->sb_overhead);
+ }
perf_pmu_enable(event->pmu);
}
@@ -6131,6 +6135,13 @@ static void perf_iterate_sb_cpu(perf_iterate_f output, void *data)
}
}
+static void
+perf_calculate_sb_overhead(struct perf_event_context *ctx, u64 time)
+{
+ ctx->sb_overhead.nr++;
+ ctx->sb_overhead.time += time;
+}
+
/*
* Iterate all events that need to receive side-band events.
*
@@ -6141,9 +6152,12 @@ static void
perf_iterate_sb(perf_iterate_f output, void *data,
struct perf_event_context *task_ctx)
{
+ struct perf_event_context *overhead_ctx = task_ctx;
struct perf_event_context *ctx;
+ u64 start_clock, end_clock;
int ctxn;
+ start_clock = perf_clock();
rcu_read_lock();
preempt_disable();
@@ -6161,12 +6175,19 @@ perf_iterate_sb(perf_iterate_f output, void *data,
for_each_task_context_nr(ctxn) {
ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
- if (ctx)
+ if (ctx) {
perf_iterate_ctx(ctx, output, data, false);
+ if (!overhead_ctx)
+ overhead_ctx = ctx;
+ }
}
done:
preempt_enable();
rcu_read_unlock();
+
+ end_clock = perf_clock();
+ if (overhead_ctx)
+ perf_calculate_sb_overhead(overhead_ctx, end_clock - start_clock);
}
/*
--
2.5.5
From: Kan Liang <[email protected]>
Warning the user if the overhead is too high.
The overhead evaluation metrics is as below.
- Kerenl profiling cost evaluation
MAX(cpu0_overhead, cpu1_overhead,...) / perf monotonic wall time
If the rate is > 10%, a warning is printed.
- perf record profiling cost evaluation
CPU time of perf record / perf monotonic wall time
If the rate is > 50%, a warning is printed.
Here is an example of output.
Warning:
Perf kernel profiling cost is high! The cost rate is 15.30%
Please consider reducing the number of events, or increasing the period,
or decrease the frequency.
# To display the perf.data header info, please use
--header/--header-only options.
#
# ========
# Elapsed time: 120345432522(ns)
# Perf record cpu time: 505470324(ns)
#
# CPU SAM SAM cost(ns) MUX MUX cost(ns) SB
SB cost(ns)
# -1 2247030 1705370307 111354 3111550168 1237358
13591276893
# ========
#
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/util/session.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 7cf9e1d..1188a1a 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1563,6 +1563,52 @@ perf_session__warn_order(const struct perf_session *session)
ui__warning("%u out of order events recorded.\n", oe->nr_unordered_events);
}
+static void
+perf_session__warn_overhead(const struct perf_session *session)
+{
+ const struct events_stats *stats = &session->evlist->stats;
+ double overhead_rate;
+ u64 overhead, max;
+ int cpu;
+ int i;
+
+ overhead = stats->overhead.total_sample[MAX_NR_CPUS].time;
+ overhead += stats->overhead.total_mux[MAX_NR_CPUS].time;
+ overhead += stats->overhead.total_sb[MAX_NR_CPUS].time;
+ max = overhead;
+ cpu = -1;
+
+ for (i = 0; i < session->header.env.nr_cpus_online; i++) {
+ overhead = stats->overhead.total_sample[i].time;
+ overhead += stats->overhead.total_mux[i].time;
+ overhead += stats->overhead.total_sb[i].time;
+ if (overhead > max) {
+ max = overhead;
+ cpu = i;
+ }
+ }
+ overhead_rate = (double)max / (double)stats->elapsed_time;
+
+ if (overhead_rate > 0.1) {
+ if (cpu > 0) {
+ ui__warning("Perf kernel profiling cost is high! The cost rate is %3.2f%% on CPU %d\n\n"
+ "Please consider reducing the number of events, or increasing the period, or decrease the frequency.\n\n",
+ overhead_rate * 100.0, cpu);
+ } else {
+ ui__warning("Perf kernel profiling cost is high! The cost rate is %3.2f%%\n\n"
+ "Please consider reducing the number of events, or increasing the period, or decrease the frequency.\n\n",
+ overhead_rate * 100.0);
+ }
+ }
+
+ overhead_rate = (double)stats->cpu_time / (double)stats->elapsed_time;
+ if (overhead_rate > 0.5) {
+ ui__warning("Perf record profiling cost is high! The cost rate is %3.2f%%\n\n"
+ "Please consider reducing the number of events, or increasing the period, or decrease the frequency.\n\n",
+ overhead_rate * 100.0);
+ }
+}
+
static void perf_session__warn_about_errors(const struct perf_session *session)
{
const struct events_stats *stats = &session->evlist->stats;
@@ -1636,6 +1682,8 @@ static void perf_session__warn_about_errors(const struct perf_session *session)
"Increase it by --proc-map-timeout\n",
stats->nr_proc_map_timeout);
}
+
+ perf_session__warn_overhead(session);
}
static int perf_session__flush_thread_stack(struct thread *thread,
--
2.5.5
From: Kan Liang <[email protected]>
Multiplexing overhead is one of the key overhead when the number of
events is more than available counters.
The multiplexing overhead PERF_CORE_MUX_OVERHEAD is a common overhead
type.
Signed-off-by: Kan Liang <[email protected]>
---
include/linux/perf_event.h | 2 ++
include/uapi/linux/perf_event.h | 1 +
kernel/events/core.c | 14 ++++++++++++++
3 files changed, 17 insertions(+)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index ebd356e..351d321 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -794,6 +794,8 @@ struct perf_cpu_context {
struct list_head sched_cb_entry;
int sched_cb_usage;
+
+ struct perf_overhead_entry mux_overhead;
};
struct perf_output_handle {
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index fe7b1fb..355086f 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -999,6 +999,7 @@ struct perf_branch_entry {
*/
enum perf_record_overhead_type {
PERF_CORE_OVERHEAD = 0,
+ PERF_CORE_MUX_OVERHEAD = 0,
PERF_PMU_OVERHEAD = 20,
PERF_PMU_SAMPLE_OVERHEAD = 20,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 306bc92..025a19d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1830,6 +1830,9 @@ event_sched_out(struct perf_event *event,
if (event->attr.exclusive || !cpuctx->active_oncpu)
cpuctx->exclusive = 0;
+ if (log_overhead && cpuctx->mux_overhead.nr)
+ perf_log_overhead(event, PERF_CORE_MUX_OVERHEAD, &cpuctx->mux_overhead);
+
perf_pmu_enable(event->pmu);
}
@@ -3330,9 +3333,17 @@ static void rotate_ctx(struct perf_event_context *ctx)
list_rotate_left(&ctx->flexible_groups);
}
+static void
+perf_calculate_mux_overhead(struct perf_cpu_context *cpuctx, u64 time)
+{
+ cpuctx->mux_overhead.nr++;
+ cpuctx->mux_overhead.time += time;
+}
+
static int perf_rotate_context(struct perf_cpu_context *cpuctx)
{
struct perf_event_context *ctx = NULL;
+ u64 start_clock, end_clock;
int rotate = 0;
if (cpuctx->ctx.nr_events) {
@@ -3349,6 +3360,7 @@ static int perf_rotate_context(struct perf_cpu_context *cpuctx)
if (!rotate)
goto done;
+ start_clock = perf_clock();
perf_ctx_lock(cpuctx, cpuctx->task_ctx);
perf_pmu_disable(cpuctx->ctx.pmu);
@@ -3364,6 +3376,8 @@ static int perf_rotate_context(struct perf_cpu_context *cpuctx)
perf_pmu_enable(cpuctx->ctx.pmu);
perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+ end_clock = perf_clock();
+ perf_calculate_mux_overhead(cpuctx, end_clock - start_clock);
done:
return rotate;
--
2.5.5
From: Kan Liang <[email protected]>
Outputing every overhead when it happens is very costly. The accumulated
time is more meaningful. So the overhead information should be outputted
at the very end.
The overhead information is outputted when task is scheduling out or the
event is going to be disabled.
The arch specific overhead is outputted in event pmu delete, when Flag
PERF_EF_LOG is set.
Signed-off-by: Kan Liang <[email protected]>
---
include/linux/perf_event.h | 2 ++
kernel/events/core.c | 9 ++++++++-
2 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 5bc8156..ebd356e 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -313,6 +313,7 @@ struct pmu {
#define PERF_EF_START 0x01 /* start the counter when adding */
#define PERF_EF_RELOAD 0x02 /* reload the counter when starting */
#define PERF_EF_UPDATE 0x04 /* update the counter when stopping */
+#define PERF_EF_LOG 0x08 /* log overhead information */
/*
* Adds/Removes a counter to/from the PMU, can be done inside a
@@ -741,6 +742,7 @@ struct perf_event_context {
int nr_stat;
int nr_freq;
int rotate_disable;
+ int log_overhead;
atomic_t refcount;
struct task_struct *task;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5312744..306bc92 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1787,6 +1787,7 @@ event_sched_out(struct perf_event *event,
struct perf_cpu_context *cpuctx,
struct perf_event_context *ctx)
{
+ bool log_overhead = needs_log_overhead(event) & ctx->log_overhead;
u64 tstamp = perf_event_time(event);
u64 delta;
@@ -1812,7 +1813,7 @@ event_sched_out(struct perf_event *event,
perf_pmu_disable(event->pmu);
event->tstamp_stopped = tstamp;
- event->pmu->del(event, 0);
+ event->pmu->del(event, log_overhead ? PERF_EF_LOG : 0);
event->oncpu = -1;
event->state = PERF_EVENT_STATE_INACTIVE;
if (event->pending_disable) {
@@ -1914,6 +1915,9 @@ static void __perf_event_disable(struct perf_event *event,
if (event->state < PERF_EVENT_STATE_INACTIVE)
return;
+ /* log overhead when disable event */
+ ctx->log_overhead = true;
+
update_context_time(ctx);
update_cgrp_time_from_event(event);
update_group_times(event);
@@ -10177,6 +10181,9 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
if (!child_ctx)
return;
+ /* log overhead when exit task context */
+ child_ctx->log_overhead = true;
+
/*
* In order to reduce the amount of tricky in ctx tear-down, we hold
* ctx::mutex over the entire thing. This serializes against almost
--
2.5.5
From: Kan Liang <[email protected]>
On x86, NMI handler is the most important part which brings overhead
for sampling. Adding a pmu specific overhead type
PERF_PMU_SAMPLE_OVERHEAD for it.
For other architectures which may don't have NMI, the overhead type can
be reused.
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/core.c | 17 ++++++++++++++++-
arch/x86/events/perf_event.h | 2 ++
include/uapi/linux/perf_event.h | 1 +
3 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 9d4bf3a..de40f96 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1397,6 +1397,9 @@ static void x86_pmu_del(struct perf_event *event, int flags)
perf_event_update_userpage(event);
+ if ((flags & PERF_EF_LOG) && cpuc->nmi_overhead.nr)
+ perf_log_overhead(event, PERF_PMU_SAMPLE_OVERHEAD, &cpuc->nmi_overhead);
+
do_del:
if (x86_pmu.del) {
/*
@@ -1475,11 +1478,21 @@ void perf_events_lapic_init(void)
apic_write(APIC_LVTPC, APIC_DM_NMI);
}
+static void
+perf_calculate_nmi_overhead(u64 time)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ cpuc->nmi_overhead.nr++;
+ cpuc->nmi_overhead.time += time;
+}
+
static int
perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
{
u64 start_clock;
u64 finish_clock;
+ u64 clock;
int ret;
/*
@@ -1492,8 +1505,10 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
start_clock = sched_clock();
ret = x86_pmu.handle_irq(regs);
finish_clock = sched_clock();
+ clock = finish_clock - start_clock;
- perf_sample_event_took(finish_clock - start_clock);
+ perf_calculate_nmi_overhead(clock);
+ perf_sample_event_took(clock);
return ret;
}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index a77ee02..7a03384 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -230,6 +230,8 @@ struct cpu_hw_events {
struct intel_excl_cntrs *excl_cntrs;
int excl_thread_id; /* 0 or 1 */
+ struct perf_overhead_entry nmi_overhead;
+
/*
* AMD specific bits
*/
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index bb0ecf0..fe7b1fb 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -1001,6 +1001,7 @@ enum perf_record_overhead_type {
PERF_CORE_OVERHEAD = 0,
PERF_PMU_OVERHEAD = 20,
+ PERF_PMU_SAMPLE_OVERHEAD = 20,
PERF_OVERHEAD_MAX,
};
--
2.5.5
From: Kan Liang <[email protected]>
Record the cpu time and elapsed time of perf record.
Show them in perf report.
# To display the perf.data header info, please use
--header/--header-only options.
#
# ========
# Elapsed time: 1199332708(ns)
# Perf record cpu time: 197780038(ns)
#
# CPU SAM SAM cost(ns) MUX MUX cost(ns) SB
SB cost(ns)
# 0 2 13659 12101 6111457 0
0
# 1 1 8160 12101 46062331 0
0
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-record.c | 52 ++++++++++++++++++++++++++++++++++++++++++++-
tools/perf/util/event.h | 4 ++++
tools/perf/util/machine.c | 5 +++++
tools/perf/util/session.c | 8 +++++++
4 files changed, 68 insertions(+), 1 deletion(-)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index e69099d..405eef4 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -49,6 +49,8 @@ struct record {
struct perf_tool tool;
struct record_opts opts;
u64 bytes_written;
+ u64 record_cpu_time;
+ u64 elapsed_time;
struct perf_data_file file;
struct auxtrace_record *itr;
struct perf_evlist *evlist;
@@ -746,6 +748,47 @@ static const struct perf_event_mmap_page *record__pick_pc(struct record *rec)
return NULL;
}
+static int perf_event__synth_overhead(struct record *rec, u64 type, u64 time,
+ perf_event__handler_t process)
+{
+ union perf_event event = {
+ .overhead = {
+ .header = {
+ .type = PERF_RECORD_USER_OVERHEAD,
+ .size = sizeof(struct perf_overhead),
+ },
+ .type = type,
+ .entry = {
+ .nr = 1,
+ .time = time,
+ },
+ },
+ };
+
+ return process(&rec->tool, &event, NULL, NULL);
+}
+
+static int perf_event__synth_overheads(struct record *rec)
+{
+ int err;
+
+ err = perf_event__synth_overhead(rec, PERF_USER_CPU_TIME,
+ (get_vnsecs() - rec->record_cpu_time),
+ process_synthesized_event);
+ if (err < 0)
+ return err;
+ rec->record_cpu_time = get_vnsecs();
+
+ err = perf_event__synth_overhead(rec, PERF_USER_ELAPSED_TIME,
+ (get_nsecs() - rec->elapsed_time),
+ process_synthesized_event);
+ if (err < 0)
+ return err;
+ rec->elapsed_time = get_nsecs();
+
+ return 0;
+}
+
static int record__synthesize(struct record *rec, bool tail)
{
struct perf_session *session = rec->session;
@@ -757,7 +800,7 @@ static int record__synthesize(struct record *rec, bool tail)
int err = 0;
if (rec->opts.tail_synthesize != tail)
- return 0;
+ goto out;
if (file->is_pipe) {
err = perf_event__synthesize_attrs(tool, session,
@@ -819,6 +862,10 @@ static int record__synthesize(struct record *rec, bool tail)
process_synthesized_event, opts->sample_address,
opts->proc_map_timeout);
out:
+ if (tail && perf_evlist__overhead(session->evlist) &&
+ (perf_event__synth_overheads(rec) < 0))
+ pr_err("Couldn't synthesize user overhead information.\n");
+
return err;
}
@@ -1555,6 +1602,9 @@ int cmd_record(int argc, const char **argv, const char *prefix __maybe_unused)
# undef REASON
#endif
+ rec->record_cpu_time = get_vnsecs();
+ rec->elapsed_time = get_nsecs();
+
rec->evlist = perf_evlist__new();
if (rec->evlist == NULL)
return -ENOMEM;
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 6110f32..9c69c34 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -280,6 +280,8 @@ struct events_stats {
u64 total_lost_samples;
u64 total_aux_lost;
u64 total_invalid_chains;
+ u64 cpu_time;
+ u64 elapsed_time;
struct total_overhead overhead;
u32 nr_events[PERF_RECORD_HEADER_MAX];
u32 nr_non_filtered_samples;
@@ -493,6 +495,8 @@ struct time_conv_event {
enum perf_user_overhead_event_type { /* above any possible kernel type */
PERF_USER_OVERHEAD_TYPE_START = 100,
+ PERF_USER_CPU_TIME = 100,
+ PERF_USER_ELAPSED_TIME,
};
struct perf_overhead {
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 72c4412..b95104a 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -578,6 +578,11 @@ int machine__process_overhead_event(struct machine *machine __maybe_unused,
event->overhead.entry.time,
sample->cpu);
break;
+ case PERF_USER_CPU_TIME:
+ dump_printf(" CPU time: %llu\n", event->overhead.entry.time);
+ break;
+ case PERF_USER_ELAPSED_TIME:
+ dump_printf(" elapsed time: %llu\n", event->overhead.entry.time);
default:
dump_printf("\tUNSUPPORT TYPE 0x%lx!\n", event->overhead.type);
return 0;
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 7745bec..7cf9e1d 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1233,6 +1233,12 @@ overhead_stats_update(struct perf_tool *tool,
evlist->stats.overhead.total_sb[cpu].nr += event->overhead.entry.nr;
evlist->stats.overhead.total_sb[cpu].time += event->overhead.entry.time;
break;
+ case PERF_USER_CPU_TIME:
+ evlist->stats.cpu_time += event->overhead.entry.time;
+ break;
+ case PERF_USER_ELAPSED_TIME:
+ evlist->stats.elapsed_time += event->overhead.entry.time;
+ break;
default:
break;
}
@@ -2112,6 +2118,8 @@ void perf_session__fprintf_overhead_info(struct perf_session *session,
}
fprintf(fp, "# ========\n");
+ fprintf(fp, "# Elapsed time: %lu(ns)\n", evlist->stats.elapsed_time);
+ fprintf(fp, "# Perf record cpu time: %lu(ns)\n#\n", evlist->stats.cpu_time);
fprintf(fp, "# CPU");
fprintf(fp, " SAM SAM cost(ns)");
fprintf(fp, " MUX MUX cost(ns)");
--
2.5.5
On Fri, Dec 02, 2016 at 04:19:17PM -0500, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> Shows kernel overhead in perf script.
>
> The output is as below:
>
> perf script --show-profiling-cost-events
> perf 29001 79989.093958: 1 cycles:
> ffffffff81064ca6 native_write_msr (/lib/
> sleep 29001 79989.094282: 7661 cycles:
> ffffffff810dc433 update_blocked_averages
> sleep 29001 79989.094294: 7442 cycles:
> ffffffff81810f60 irq_work_interrupt (/li
> sleep 29001 79989.094305: 25466 cycles:
> ffffffff813ca410 radix_tree_next_chunk (
> sleep 29001 79989.094340: 94368 cycles:
> ffffffff8180fa90 page_fault (/lib/module
> sleep 29001 79989.094459: 167362 cycles:
> ffffffff811e3f79 alloc_set_pte (/lib/mod
> sleep 29001 79989.094672: 190283 cycles:
> 7f5d7c91d8e7 _dl_addr (/usr/lib64/li
> sleep 29001 79991.094978: 194526 cycles:
> ffffffff811e0579 __tlb_remove_page_size.
> sleep 29001 79991.095061: PERF_RECORD_OVERHEAD [SAMPLE] nr: 8
> time: 28110
> sleep 29001 79991.095062: PERF_RECORD_OVERHEAD [SB] nr: 24
> time: 41397
hi,
got segfault by running:
[jolsa@ibm-x3650m4-01 perf]$ catchsegv ./perf --no-pager script --show-profiling-cost-events
Backtrace:
/home/jolsa/linux-perf/tools/perf/builtin-script.c:1281(process_overhead_event)[0x4351c4]
/home/jolsa/linux-perf/tools/perf/util/session.c:1399(perf_session__process_user_event)[0x4e997b]
/home/jolsa/linux-perf/tools/perf/util/session.c:1501(perf_session__process_event)[0x4e9de3]
/home/jolsa/linux-perf/tools/perf/util/session.c:1918 (discriminator 1)(__perf_session__process_events)[0x4eae1b]
/home/jolsa/linux-perf/tools/perf/util/session.c:1972(perf_session__process_events)[0x4eb019]
/home/jolsa/linux-perf/tools/perf/builtin-script.c:1321(__cmd_script)[0x43538c]
/home/jolsa/linux-perf/tools/perf/builtin-script.c:2498(cmd_script)[0x438f8e]
/home/jolsa/linux-perf/tools/perf/perf.c:358(run_builtin)[0x495452]
/home/jolsa/linux-perf/tools/perf/perf.c:420(handle_internal_command)[0x4956bf]
/home/jolsa/linux-perf/tools/perf/perf.c:469(run_argv)[0x495804]
/home/jolsa/linux-perf/tools/perf/perf.c:610(main)[0x495bed]
/lib64/libc.so.6(__libc_start_main+0xf1)[0x7f5932390401]
??:?(_start)[0x40a80a]
??:0(??)[0x0]
jirka
> On Fri, Dec 02, 2016 at 04:19:17PM -0500, [email protected] wrote:
> > From: Kan Liang <[email protected]>
> >
> > Shows kernel overhead in perf script.
> >
> > The output is as below:
> >
> > perf script --show-profiling-cost-events
> > perf 29001 79989.093958: 1 cycles:
> > ffffffff81064ca6 native_write_msr (/lib/
> > sleep 29001 79989.094282: 7661 cycles:
> > ffffffff810dc433 update_blocked_averages
> > sleep 29001 79989.094294: 7442 cycles:
> > ffffffff81810f60 irq_work_interrupt (/li
> > sleep 29001 79989.094305: 25466 cycles:
> > ffffffff813ca410 radix_tree_next_chunk (
> > sleep 29001 79989.094340: 94368 cycles:
> > ffffffff8180fa90 page_fault (/lib/module
> > sleep 29001 79989.094459: 167362 cycles:
> > ffffffff811e3f79 alloc_set_pte (/lib/mod
> > sleep 29001 79989.094672: 190283 cycles:
> > 7f5d7c91d8e7 _dl_addr (/usr/lib64/li
> > sleep 29001 79991.094978: 194526 cycles:
> > ffffffff811e0579 __tlb_remove_page_size.
> > sleep 29001 79991.095061: PERF_RECORD_OVERHEAD [SAMPLE] nr:
> 8
> > time: 28110
> > sleep 29001 79991.095062: PERF_RECORD_OVERHEAD [SB] nr: 24
> > time: 41397
>
> hi,
> got segfault by running:
>
> [jolsa@ibm-x3650m4-01 perf]$ catchsegv ./perf --no-pager script --show-
> profiling-cost-events
>
Thanks for the testing.
I only test it with kernel overhead event, but not redo the test after
adding the user overhead event. :(
The user overhead event doesn't have sample. So it's not supported for now.
The patch as below will fix it. I will include the fix in V3.
BTW: perf script patch is a stand along patch. It would not impact other patches.
I think you can still play with the rest of V2 patches.
Thanks,
Kan
------
>From 03360f4083c30a19f8985ac07893e65e8de7a355 Mon Sep 17 00:00:00 2001
From: Kan Liang <[email protected]>
Date: Mon, 5 Dec 2016 08:56:28 -0500
Subject: [PATCH 09/13] perf script: show kernel overhead
Shows kernel overhead in perf script.
The output is as below:
perf script --show-profiling-cost-events
perf 29001 79989.093958: 1 cycles:
ffffffff81064ca6 native_write_msr (/lib/
sleep 29001 79989.094282: 7661 cycles:
ffffffff810dc433 update_blocked_averages
sleep 29001 79989.094294: 7442 cycles:
ffffffff81810f60 irq_work_interrupt (/li
sleep 29001 79989.094305: 25466 cycles:
ffffffff813ca410 radix_tree_next_chunk (
sleep 29001 79989.094340: 94368 cycles:
ffffffff8180fa90 page_fault (/lib/module
sleep 29001 79989.094459: 167362 cycles:
ffffffff811e3f79 alloc_set_pte (/lib/mod
sleep 29001 79989.094672: 190283 cycles:
7f5d7c91d8e7 _dl_addr (/usr/lib64/li
sleep 29001 79991.094978: 194526 cycles:
ffffffff811e0579 __tlb_remove_page_size.
sleep 29001 79991.095061: PERF_RECORD_OVERHEAD [SAMPLE] nr: 8
time: 28110
sleep 29001 79991.095062: PERF_RECORD_OVERHEAD [SB] nr: 24
time: 41397
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/Documentation/perf-script.txt | 3 +++
tools/perf/builtin-script.c | 36 ++++++++++++++++++++++++++++++++
tools/perf/util/event.c | 29 +++++++++++++++++++++++++
tools/perf/util/event.h | 1 +
4 files changed, 69 insertions(+)
diff --git a/tools/perf/Documentation/perf-script.txt b/tools/perf/Documentation/perf-script.txt
index c01904f..b371023 100644
--- a/tools/perf/Documentation/perf-script.txt
+++ b/tools/perf/Documentation/perf-script.txt
@@ -289,6 +289,9 @@ include::itrace.txt[]
--force::
Don't do ownership validation.
+--show-profiling-cost-events::
+ Display perf profiling time cost related event (PERF_RECORD_OVERHEAD)
+
SEE ALSO
--------
linkperf:perf-record[1], linkperf:perf-script-perl[1],
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index e1daff3..01019de 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -829,6 +829,7 @@ struct perf_script {
bool show_task_events;
bool show_mmap_events;
bool show_switch_events;
+ bool show_profiling_cost_events;
bool allocated;
struct cpu_map *cpus;
struct thread_map *threads;
@@ -1264,6 +1265,37 @@ static int process_switch_event(struct perf_tool *tool,
return 0;
}
+static int process_overhead_event(struct perf_tool *tool,
+ union perf_event *event,
+ struct perf_sample *sample,
+ struct machine *machine)
+{
+ struct thread *thread;
+ struct perf_script *script = container_of(tool, struct perf_script, tool);
+ struct perf_session *session = script->session;
+ struct perf_evsel *evsel;
+
+ if (perf_event__process_overhead(tool, event, sample, machine) < 0)
+ return -1;
+
+ /* No sample for user overhead event. Not support. */
+ if (!sample)
+ return 0;
+
+ evsel = perf_evlist__id2evsel(session->evlist, sample->id);
+ thread = machine__findnew_thread(machine, sample->pid, sample->tid);
+ if (thread == NULL) {
+ pr_debug("problem processing OVERHEAD event, skipping it.\n");
+ return -1;
+ }
+
+ print_sample_start(sample, thread, evsel);
+ perf_event__fprintf(event, stdout);
+ thread__put(thread);
+
+ return 0;
+}
+
static void sig_handler(int sig __maybe_unused)
{
session_done = 1;
@@ -1287,6 +1319,8 @@ static int __cmd_script(struct perf_script *script)
}
if (script->show_switch_events)
script->tool.context_switch = process_switch_event;
+ if (script->show_profiling_cost_events)
+ script->tool.overhead = process_overhead_event;
ret = perf_session__process_events(script->session);
@@ -2172,6 +2206,8 @@ int cmd_script(int argc, const char **argv, const char *prefix __maybe_unused)
"Show the mmap events"),
OPT_BOOLEAN('\0', "show-switch-events", &script.show_switch_events,
"Show context switch events (if recorded)"),
+ OPT_BOOLEAN('\0', "show-profiling-cost-events", &script.show_profiling_cost_events,
+ "Show perf profiling time cost events"),
OPT_BOOLEAN('f', "force", &file.force, "don't complain, do it"),
OPT_BOOLEAN(0, "ns", &nanosecs,
"Use 9 decimal places when displaying time"),
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 58d095c..e8736dc 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -1189,6 +1189,32 @@ size_t perf_event__fprintf_switch(union perf_event *event, FILE *fp)
event->context_switch.next_prev_tid);
}
+size_t perf_event__fprintf_overhead(union perf_event *event, FILE *fp)
+{
+ size_t ret;
+
+ switch (event->overhead.type) {
+ case PERF_PMU_SAMPLE_OVERHEAD:
+ ret = fprintf(fp, " [SAMPLE] nr: %llu time: %llu\n",
+ event->overhead.entry.nr,
+ event->overhead.entry.time);
+ break;
+ case PERF_CORE_MUX_OVERHEAD:
+ ret = fprintf(fp, " [MUX] nr: %llu time: %llu\n",
+ event->overhead.entry.nr,
+ event->overhead.entry.time);
+ break;
+ case PERF_CORE_SB_OVERHEAD:
+ ret = fprintf(fp, " [SB] nr: %llu time: %llu\n",
+ event->overhead.entry.nr,
+ event->overhead.entry.time);
+ break;
+ default:
+ ret = fprintf(fp, " unhandled!\n");
+ }
+ return ret;
+}
+
size_t perf_event__fprintf(union perf_event *event, FILE *fp)
{
size_t ret = fprintf(fp, "PERF_RECORD_%s",
@@ -1218,6 +1244,9 @@ size_t perf_event__fprintf(union perf_event *event, FILE *fp)
case PERF_RECORD_SWITCH_CPU_WIDE:
ret += perf_event__fprintf_switch(event, fp);
break;
+ case PERF_RECORD_OVERHEAD:
+ ret += perf_event__fprintf_overhead(event, fp);
+ break;
default:
ret += fprintf(fp, "\n");
}
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index dd4ec5c..a6c98ce 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -675,6 +675,7 @@ size_t perf_event__fprintf_switch(union perf_event *event, FILE *fp);
size_t perf_event__fprintf_thread_map(union perf_event *event, FILE *fp);
size_t perf_event__fprintf_cpu_map(union perf_event *event, FILE *fp);
size_t perf_event__fprintf(union perf_event *event, FILE *fp);
+size_t perf_event__fprintf_overhead(union perf_event *event, FILE *fp);
u64 kallsyms__get_function_start(const char *kallsyms_filename,
const char *symbol_name);
--
2.5.5
On Fri, Dec 02, 2016 at 04:19:15PM -0500, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> The infrastructure to handle PERF_RECORD_OVERHEAD record type.
> A new perf report option is introduced as a knob to show the profiling
> overhead information. The option also forces tty output.
>
> The overhead information is auxillary information, so it should be
> possible to access it together with normal sampling information in a
> single output file. But the overhead is overall profiling time cost, not
> per-sample/event profiling time cost. Making overhead a standard sort
> key could confuse the normal samples.
> So the information is printed separately in the head of the output.
>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> tools/perf/Documentation/perf-report.txt | 3 +++
> tools/perf/builtin-report.c | 16 +++++++++++++++-
> tools/perf/util/event.c | 9 +++++++++
> tools/perf/util/event.h | 11 +++++++++++
> tools/perf/util/evlist.c | 6 ++++++
> tools/perf/util/evlist.h | 1 +
> tools/perf/util/machine.c | 10 ++++++++++
> tools/perf/util/machine.h | 3 +++
> tools/perf/util/session.c | 21 +++++++++++++++++++++
> tools/perf/util/session.h | 3 +++
> tools/perf/util/symbol.h | 3 ++-
> tools/perf/util/tool.h | 1 +
> 12 files changed, 85 insertions(+), 2 deletions(-)
I think also perf inject needs repipe this overhead handler
jirka
On Fri, Dec 02, 2016 at 04:19:16PM -0500, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> The kernel overhead include sample overhead, multiplexing overhead and
> side-band events overhead. All the information is printed in the head of
> output file if the show profiling cost option is applied.
> The result is sorted by CPU. If there is no available CPU information in
> sample, using -1 instead.
>
> Output when CPU is available
> # To display the perf.data header info, please use
> --header/--header-only options.
> #
> # ========
> # CPU SAM SAM cost(ns) MUX MUX cost(ns) SB
> SB cost(ns)
> # 0 2261446 1991216491 122174 1128716680 0
- please make this output properly alligned
- it'd be nice to see some 1000 delimitiers
- do we need to display the rest of the perf output,
it seems too much to me already
thanks,
jirka
On Fri, Dec 02, 2016 at 04:19:16PM -0500, [email protected] wrote:
SNIP
> --show-profiling-cost:
> Show extra time cost during perf profiling
> + Sort the extra time cost by CPU
> + If CPU information is not available in perf_sample, using -1 instead.
> + The time cost include:
> + - SAM: sample overhead. For x86, it's the time cost in perf NMI handler.
> + - MUX: multiplexing overhead. The time cost spends on rotate context.
> + - SB: side-band events overhead. The time cost spends on iterating all
> + events that need to receive side-band events.
>
> include::callchain-overhead-calculation.txt[]
>
> diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
> index d96e215..dd4ec5c 100644
> --- a/tools/perf/util/event.h
> +++ b/tools/perf/util/event.h
> @@ -245,6 +245,12 @@ enum auxtrace_error_type {
> PERF_AUXTRACE_ERROR_MAX
> };
>
> +struct total_overhead {
> + struct perf_overhead_entry total_sample[MAX_NR_CPUS + 1];
> + struct perf_overhead_entry total_mux[MAX_NR_CPUS + 1];
> + struct perf_overhead_entry total_sb[MAX_NR_CPUS + 1];
> +};
I think this should be either:
- dynamically allocated (there's cpu count available in the session)
- or made static within perf report (as in shadow stats) and counted
in report's overhead tool callback
I also don't like that you do the process-related counts
in the xxx[MAX_NR_CPUS] member, we should have separated
struct perf_overhead_entry for that
jirla
On Fri, Dec 02, 2016 at 04:19:16PM -0500, [email protected] wrote:
SNIP
> diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> index 02c8f7a..72c4412 100644
> --- a/tools/perf/util/machine.c
> +++ b/tools/perf/util/machine.c
> @@ -557,9 +557,31 @@ int machine__process_switch_event(struct machine *machine __maybe_unused,
>
> int machine__process_overhead_event(struct machine *machine __maybe_unused,
> union perf_event *event,
> - struct perf_sample *sample __maybe_unused)
> -{
> - dump_printf("\tUNSUPPORT TYPE 0x%lx!\n", event->overhead.type);
> + struct perf_sample *sample)
> +{
> + switch (event->overhead.type) {
> + case PERF_PMU_SAMPLE_OVERHEAD:
> + dump_printf(" SAMPLE nr: %llu time: %llu cpu %d\n",
> + event->overhead.entry.nr,
> + event->overhead.entry.time,
> + sample->cpu);
> + break;
> + case PERF_CORE_MUX_OVERHEAD:
> + dump_printf(" MULTIPLEXING nr: %llu time: %llu cpu %d\n",
> + event->overhead.entry.nr,
> + event->overhead.entry.time,
> + sample->cpu);
> + break;
> + case PERF_CORE_SB_OVERHEAD:
> + dump_printf(" SIDE-BAND nr: %llu time: %llu cpu %d\n",
> + event->overhead.entry.nr,
> + event->overhead.entry.time,
> + sample->cpu);
> + break;
> + default:
> + dump_printf("\tUNSUPPORT TYPE 0x%lx!\n", event->overhead.type);
> + return 0;
> + }
> return 0;
normaly we call here perf_event__fprintf handlers, which you're
adding in the following patch.. above code seems redundant
thanks,
jirka
On Fri, Dec 02, 2016 at 04:19:18PM -0500, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> Make get_nsecs visible to collect monotonic wall time.
> Introduce get_vnsecs to collect CPU time.
David just recently added util/time-utils.c, could you
please place both those functions in there?
thanks,
jirka
On Fri, Dec 02, 2016 at 04:19:11PM -0500, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> On x86, NMI handler is the most important part which brings overhead
> for sampling. Adding a pmu specific overhead type
> PERF_PMU_SAMPLE_OVERHEAD for it.
>
> For other architectures which may don't have NMI, the overhead type can
> be reused.
>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> arch/x86/events/core.c | 17 ++++++++++++++++-
> arch/x86/events/perf_event.h | 2 ++
> include/uapi/linux/perf_event.h | 1 +
> 3 files changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 9d4bf3a..de40f96 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -1397,6 +1397,9 @@ static void x86_pmu_del(struct perf_event *event, int flags)
>
> perf_event_update_userpage(event);
>
> + if ((flags & PERF_EF_LOG) && cpuc->nmi_overhead.nr)
> + perf_log_overhead(event, PERF_PMU_SAMPLE_OVERHEAD, &cpuc->nmi_overhead);
> +
> do_del:
> if (x86_pmu.del) {
> /*
That's not at all mentioned in the changelog, and it clearly isn't
nmi_overhead.
On Fri, Dec 02, 2016 at 04:19:10PM -0500, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> Outputing every overhead when it happens is very costly. The accumulated
> time is more meaningful. So the overhead information should be outputted
> at the very end.
>
> The overhead information is outputted when task is scheduling out or the
> event is going to be disabled.
Why? The context switch stuff can happen very frequently. Doesn't seem
like a good place to add overhead.
> The arch specific overhead is outputted in event pmu delete, when Flag
> PERF_EF_LOG is set.
On Fri, Dec 02, 2016 at 04:19:12PM -0500, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> Multiplexing overhead is one of the key overhead when the number of
> events is more than available counters.
>
> The multiplexing overhead PERF_CORE_MUX_OVERHEAD is a common overhead
> type.
>
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index fe7b1fb..355086f 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -999,6 +999,7 @@ struct perf_branch_entry {
> */
> enum perf_record_overhead_type {
> PERF_CORE_OVERHEAD = 0,
> + PERF_CORE_MUX_OVERHEAD = 0,
'0' already had a name ?!
>
> PERF_PMU_OVERHEAD = 20,
> PERF_PMU_SAMPLE_OVERHEAD = 20,
Same here I suppose, why are there two with the same name?
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 306bc92..025a19d 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1830,6 +1830,9 @@ event_sched_out(struct perf_event *event,
> if (event->attr.exclusive || !cpuctx->active_oncpu)
> cpuctx->exclusive = 0;
>
> + if (log_overhead && cpuctx->mux_overhead.nr)
> + perf_log_overhead(event, PERF_CORE_MUX_OVERHEAD, &cpuctx->mux_overhead);
> +
This isn't mentioned in the Changelog. Why is it here?
> perf_pmu_enable(event->pmu);
> }
>
On Fri, Dec 02, 2016 at 04:19:13PM -0500, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> Iterating all events which need to receive side-band events also bring
> some overhead.
>
> The side-band events overhead PERF_CORE_SB_OVERHEAD is a common overhead
> type.
>
> Signed-off-by: Kan Liang <[email protected]>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 025a19d..85706fb 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1830,8 +1830,12 @@ event_sched_out(struct perf_event *event,
> if (event->attr.exclusive || !cpuctx->active_oncpu)
> cpuctx->exclusive = 0;
>
> - if (log_overhead && cpuctx->mux_overhead.nr)
> - perf_log_overhead(event, PERF_CORE_MUX_OVERHEAD, &cpuctx->mux_overhead);
> + if (log_overhead) {
> + if (cpuctx->mux_overhead.nr)
> + perf_log_overhead(event, PERF_CORE_MUX_OVERHEAD, &cpuctx->mux_overhead);
> + if (ctx->sb_overhead.nr)
> + perf_log_overhead(event, PERF_CORE_SB_OVERHEAD, &ctx->sb_overhead);
> + }
Urgh, that's getting uglier by the minute, and again, its not mentioned.
>
> perf_pmu_enable(event->pmu);
> }
On Fri, Dec 02, 2016 at 04:19:09PM -0500, [email protected] wrote:
> @@ -1221,6 +1225,11 @@ static inline bool has_addr_filter(struct perf_event *event)
> return event->pmu->nr_addr_filters;
> }
>
> +static inline bool needs_log_overhead(struct perf_event *event)
> +{
> + return !!event->attr.overhead;
> +}
There seems to be exactly one user of that, seems a tad pointless to
provide this function.
On Fri, Dec 02, 2016 at 04:19:11PM -0500, [email protected] wrote:
> +static void
> +perf_calculate_nmi_overhead(u64 time)
> +{
> + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> +
> + cpuc->nmi_overhead.nr++;
> + cpuc->nmi_overhead.time += time;
> +}
This function doesn't calculate anything much.
> On Fri, Dec 02, 2016 at 04:19:11PM -0500, [email protected] wrote:
> > From: Kan Liang <[email protected]>
> >
> > On x86, NMI handler is the most important part which brings overhead
> > for sampling. Adding a pmu specific overhead type
> > PERF_PMU_SAMPLE_OVERHEAD for it.
> >
> > For other architectures which may don't have NMI, the overhead type
> > can be reused.
> >
> > Signed-off-by: Kan Liang <[email protected]>
> > ---
> > arch/x86/events/core.c | 17 ++++++++++++++++-
> > arch/x86/events/perf_event.h | 2 ++
> > include/uapi/linux/perf_event.h | 1 +
> > 3 files changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index
> > 9d4bf3a..de40f96 100644
> > --- a/arch/x86/events/core.c
> > +++ b/arch/x86/events/core.c
> > @@ -1397,6 +1397,9 @@ static void x86_pmu_del(struct perf_event
> > *event, int flags)
> >
> > perf_event_update_userpage(event);
> >
> > + if ((flags & PERF_EF_LOG) && cpuc->nmi_overhead.nr)
> > + perf_log_overhead(event, PERF_PMU_SAMPLE_OVERHEAD,
> > +&cpuc->nmi_overhead);
> > +
> > do_del:
> > if (x86_pmu.del) {
> > /*
>
> That's not at all mentioned in the changelog, and it clearly isn't
> nmi_overhead.
Here it only records the overhead, not calculate.
The calculation is in nmi_hanlder as below. I will make it clear in the changelog.
@@ -1492,8 +1505,10 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
start_clock = sched_clock();
ret = x86_pmu.handle_irq(regs);
finish_clock = sched_clock();
+ clock = finish_clock - start_clock;
- perf_sample_event_took(finish_clock - start_clock);
+ perf_calculate_nmi_overhead(clock);
+ perf_sample_event_took(clock);
return ret;
}
Thanks,
Kan
> On Fri, Dec 02, 2016 at 04:19:12PM -0500, [email protected] wrote:
> > From: Kan Liang <[email protected]>
> >
> > Multiplexing overhead is one of the key overhead when the number of
> > events is more than available counters.
> >
> > The multiplexing overhead PERF_CORE_MUX_OVERHEAD is a common
> overhead
> > type.
> >
>
> > diff --git a/include/uapi/linux/perf_event.h
> > b/include/uapi/linux/perf_event.h index fe7b1fb..355086f 100644
> > --- a/include/uapi/linux/perf_event.h
> > +++ b/include/uapi/linux/perf_event.h
> > @@ -999,6 +999,7 @@ struct perf_branch_entry {
> > */
> > enum perf_record_overhead_type {
> > PERF_CORE_OVERHEAD = 0,
> > + PERF_CORE_MUX_OVERHEAD = 0,
>
> '0' already had a name ?!
PERF_CORE_OVERHEAD is only used to indicate the start of core
overhead type.
I will use comment line to replace PERF_CORE_OVERHEAD.
Thanks,
Kan
On Tue, Dec 06, 2016 at 03:02:20PM +0000, Liang, Kan wrote:
>
>
> > On Fri, Dec 02, 2016 at 04:19:11PM -0500, [email protected] wrote:
> > > From: Kan Liang <[email protected]>
> > >
> > > On x86, NMI handler is the most important part which brings overhead
> > > for sampling. Adding a pmu specific overhead type
> > > PERF_PMU_SAMPLE_OVERHEAD for it.
> > >
> > > For other architectures which may don't have NMI, the overhead type
> > > can be reused.
> > >
> > > Signed-off-by: Kan Liang <[email protected]>
> > > ---
> > > arch/x86/events/core.c | 17 ++++++++++++++++-
> > > arch/x86/events/perf_event.h | 2 ++
> > > include/uapi/linux/perf_event.h | 1 +
> > > 3 files changed, 19 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index
> > > 9d4bf3a..de40f96 100644
> > > --- a/arch/x86/events/core.c
> > > +++ b/arch/x86/events/core.c
> > > @@ -1397,6 +1397,9 @@ static void x86_pmu_del(struct perf_event
> > > *event, int flags)
> > >
> > > perf_event_update_userpage(event);
> > >
> > > + if ((flags & PERF_EF_LOG) && cpuc->nmi_overhead.nr)
> > > + perf_log_overhead(event, PERF_PMU_SAMPLE_OVERHEAD,
> > > +&cpuc->nmi_overhead);
> > > +
> > > do_del:
> > > if (x86_pmu.del) {
> > > /*
> >
> > That's not at all mentioned in the changelog, and it clearly isn't
> > nmi_overhead.
>
> Here it only records the overhead, not calculate.
It doesn't record anything, it generates the output. And it doesn't
explain why that needs to be in pmu::del(), in general that's a horrible
thing to do.
> On Tue, Dec 06, 2016 at 03:02:20PM +0000, Liang, Kan wrote:
> >
> >
> > > On Fri, Dec 02, 2016 at 04:19:11PM -0500, [email protected] wrote:
> > > > From: Kan Liang <[email protected]>
> > > >
> > > > On x86, NMI handler is the most important part which brings
> > > > overhead for sampling. Adding a pmu specific overhead type
> > > > PERF_PMU_SAMPLE_OVERHEAD for it.
> > > >
> > > > For other architectures which may don't have NMI, the overhead
> > > > type can be reused.
> > > >
> > > > Signed-off-by: Kan Liang <[email protected]>
> > > > ---
> > > > arch/x86/events/core.c | 17 ++++++++++++++++-
> > > > arch/x86/events/perf_event.h | 2 ++
> > > > include/uapi/linux/perf_event.h | 1 +
> > > > 3 files changed, 19 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index
> > > > 9d4bf3a..de40f96 100644
> > > > --- a/arch/x86/events/core.c
> > > > +++ b/arch/x86/events/core.c
> > > > @@ -1397,6 +1397,9 @@ static void x86_pmu_del(struct perf_event
> > > > *event, int flags)
> > > >
> > > > perf_event_update_userpage(event);
> > > >
> > > > + if ((flags & PERF_EF_LOG) && cpuc->nmi_overhead.nr)
> > > > + perf_log_overhead(event, PERF_PMU_SAMPLE_OVERHEAD,
> > > > +&cpuc->nmi_overhead);
> > > > +
> > > > do_del:
> > > > if (x86_pmu.del) {
> > > > /*
> > >
> > > That's not at all mentioned in the changelog, and it clearly isn't
> > > nmi_overhead.
> >
> > Here it only records the overhead, not calculate.
>
> It doesn't record anything, it generates the output. And it doesn't explain
> why that needs to be in pmu::del(), in general that's a horrible thing to do.
Yes, it only generate/log the output. Sorry for the confused wording.
The NMI overhead is pmu specific overhead. So the NMI overhead output
should be generated in pmu code.
I assume that the pmu:del is the last called pmu function when perf finish.
Is it a good place for logging?
Thanks,
Kan
On Tue, Dec 06, 2016 at 03:47:40PM +0000, Liang, Kan wrote:
> > It doesn't record anything, it generates the output. And it doesn't explain
> > why that needs to be in pmu::del(), in general that's a horrible thing to do.
>
> Yes, it only generate/log the output. Sorry for the confused wording.
>
> The NMI overhead is pmu specific overhead. So the NMI overhead output
> should be generated in pmu code.
True, but you're also accounting in a per-cpu bucket, which means it
includes all events. At which point the per-event overhead thing doesn't
really make sense.
It also means that previous sessions influence the numbers of our
current session; there's no explicit reset of the numbers.
> I assume that the pmu:del is the last called pmu function when perf finish.
> Is it a good place for logging?
No, its horrible. Sure, we'll call pmu::del on events, but yuck.
You really only want _one_ invocation when you stop using the event, and
we don't really have a good place for that. But instead of creating one,
you do horrible things.
Now, I realize there's a bit of a catch-22 in that the moment we know
the event is going away, its already gone from userspace. So we cannot
dump data from there in general..
Howver, if we have output redirection we can, but that would make things
depend on that and it cannot be used for the last event who's buffer
we're using.
Another option would be to introduce PERF_EVENT_IOC_STAT or something
like that, and have the tool call that when its 'done'.
> On Tue, Dec 06, 2016 at 03:47:40PM +0000, Liang, Kan wrote:
>
> > > It doesn't record anything, it generates the output. And it doesn't
> > > explain why that needs to be in pmu::del(), in general that's a horrible
> thing to do.
> >
> > Yes, it only generate/log the output. Sorry for the confused wording.
> >
> > The NMI overhead is pmu specific overhead. So the NMI overhead output
> > should be generated in pmu code.
>
> True, but you're also accounting in a per-cpu bucket, which means it
> includes all events. At which point the per-event overhead thing doesn't
> really make sense.
>
> It also means that previous sessions influence the numbers of our current
> session; there's no explicit reset of the numbers.
>
> > I assume that the pmu:del is the last called pmu function when perf finish.
> > Is it a good place for logging?
>
> No, its horrible. Sure, we'll call pmu::del on events, but yuck.
>
> You really only want _one_ invocation when you stop using the event, and
> we don't really have a good place for that. But instead of creating one, you
> do horrible things.
>
> Now, I realize there's a bit of a catch-22 in that the moment we know the
> event is going away, its already gone from userspace. So we cannot dump
> data from there in general..
>
> Howver, if we have output redirection we can, but that would make things
> depend on that and it cannot be used for the last event who's buffer we're
> using.
>
> Another option would be to introduce PERF_EVENT_IOC_STAT or something
> like that, and have the tool call that when its 'done'.
>
OK. I think I will implement a new ioctl PERF_EVENT_IOC_STAT.
The IOC_STAT will be called by tool when its 'start' and 'done'.
I will also introduce two new ioc flags.
(PERF_IOC_FLAG_STAT_START and PERF_IOC_FLAG_STAT_DONE)
In 'start', the kernel will reset the numbers.
In 'done', the kernel will generate all outputs. The overhead numbers are from
different cpu. To distinguish them, we have to add cpu in overhead_entry. We
cannot trust sample_id.
struct perf_overhead_entry {
__u32 cpu;
__u32 nr;
__u64 time;
};
I will also add void (*overhead_stat) in struct pmu to do pmu specific reset and
generation.
In V2, the three overheads are stored in different per-event/per-cpu ctx.
For next V3, I will store all the overheads in pmu's cpuctx.
So the number will be the overhead for pmu, not the global system.
It should be more clear and useful.
how does it sound?
Thanks,
Kan