From: Kan Liang <[email protected]>
This series only include the kernel related patches.
Profiling brings additional overhead. High overhead may impacts the
behavior of the profiling object, impacts the accuracy of the
profiling result, and even hang the system.
Currently, perf has dynamic interrupt throttle mechanism to lower the
sample rate and overhead. But it has limitations.
- The mechanism only focus in the sampling overhead. However, there
are other parts which also bring big overhead. E.g, multiplexing.
- The hint from the mechanism doesn't work on fixed period.
- The system changes which caused by the mechanism are not recorded
in the perf.data. Users have no idea about the overhead and its
impact.
Actually, any passive ways like dynamic interrupt throttle mechanism
are only palliative. The best way is to export overhead information,
provide more hints, and help the users design more proper perf command.
For kernel, three parts can bring obvious overhead.
- sample overhead: For x86, it's the time cost in perf NMI handler.
- multiplexing overhead: The time cost spends on rotate context.
- side-band events overhead: The time cost spends on iterating all
events that need to receive side-band events.
The time cost of those parts are stored in pmu's per-cpu cpuctx.
The tool can call PERF_EVENT_IOC_STAT when it's 'done'. Then the kernel
generates the overhead record PERF_RECORD_OVERHEAD.
User can use the overhead information to refine their perf command and get
accurate profiling result. For example, if there is high overhead warning,
user may reduce the number of events/increase the period/decrease the
frequency.
Developer can also use the overhead information to find bugs.
Changes since V2:
- Separate kernel patches from the previous version
- Add PERF_EVENT_IOC_STAT to control overhead statistics
- Collect per pmu overhead information
- Store the overhead information in pmu's cpuctx
- Add CPU information in overhead record
Changes since V1:
- fix u32 holes and remove duplicate CPU information
- configurable overhead logging
- Introduce the concept of PMU specific overhead and common core
overhead. Rename NMI overhead to PMU sample overhead
- Add log_overhead in perf_event_context to indicate the logging of
overhead
- Refine the output of overhead information
- Use perf CPU time to replace perf write data overhead
- Refine the formula of overhead evaluation
- Refine perf script
Kan Liang (6):
perf/core: Introduce PERF_RECORD_OVERHEAD
perf/core: Add PERF_EVENT_IOC_STAT to control overhead statistics
perf/x86: implement overhead stat for x86 pmu
perf/core: calculate multiplexing overhead
perf/core: calculate side-band events overhead
perf/x86: calculate sampling overhead
arch/x86/events/core.c | 45 +++++++++++++++++++++++-
include/linux/perf_event.h | 12 +++++++
include/uapi/linux/perf_event.h | 44 ++++++++++++++++++++++-
kernel/events/core.c | 77 ++++++++++++++++++++++++++++++++++++++++-
4 files changed, 175 insertions(+), 3 deletions(-)
--
2.4.3
From: Kan Liang <[email protected]>
On x86, NMI handler is the most important part which brings overhead
for sampling. Adding a pmu specific overhead type
PERF_PMU_SAMPLE_OVERHEAD for it.
For other architectures which may don't have NMI, the overhead type can
be reused.
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/core.c | 8 +++++++-
include/uapi/linux/perf_event.h | 1 +
2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 09ab36a..1e57ccf 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1478,8 +1478,10 @@ void perf_events_lapic_init(void)
static int
perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
{
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu.pmu_cpu_context);
u64 start_clock;
u64 finish_clock;
+ u64 clock;
int ret;
/*
@@ -1492,8 +1494,12 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
start_clock = sched_clock();
ret = x86_pmu.handle_irq(regs);
finish_clock = sched_clock();
+ clock = finish_clock - start_clock;
+ perf_sample_event_took(clock);
- perf_sample_event_took(finish_clock - start_clock);
+ /* calculate NMI overhead */
+ cpuctx->overhead[PERF_PMU_SAMPLE_OVERHEAD].nr++;
+ cpuctx->overhead[PERF_PMU_SAMPLE_OVERHEAD].time += clock;
return ret;
}
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 7ba6d30..954b116 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -1004,6 +1004,7 @@ enum perf_record_overhead_type {
PERF_CORE_MUX_OVERHEAD = 0,
PERF_CORE_SB_OVERHEAD,
/* PMU specific */
+ PERF_PMU_SAMPLE_OVERHEAD,
PERF_OVERHEAD_MAX,
};
--
2.4.3
From: Kan Liang <[email protected]>
Iterating all events which need to receive side-band events also bring
some overhead.
The side-band events overhead PERF_CORE_SB_OVERHEAD is a common overhead
type.
Signed-off-by: Kan Liang <[email protected]>
---
include/uapi/linux/perf_event.h | 1 +
kernel/events/core.c | 17 ++++++++++++++++-
2 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index c488336..7ba6d30 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -1002,6 +1002,7 @@ struct perf_branch_entry {
enum perf_record_overhead_type {
/* common overhead */
PERF_CORE_MUX_OVERHEAD = 0,
+ PERF_CORE_SB_OVERHEAD,
/* PMU specific */
PERF_OVERHEAD_MAX,
};
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 28468ae..335b1e2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6134,9 +6134,13 @@ static void
perf_iterate_sb(perf_iterate_f output, void *data,
struct perf_event_context *task_ctx)
{
+ struct perf_event_context *overhead_ctx = task_ctx;
+ struct perf_cpu_context *cpuctx;
struct perf_event_context *ctx;
+ u64 start_clock, end_clock;
int ctxn;
+ start_clock = perf_clock();
rcu_read_lock();
preempt_disable();
@@ -6154,12 +6158,23 @@ perf_iterate_sb(perf_iterate_f output, void *data,
for_each_task_context_nr(ctxn) {
ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
- if (ctx)
+ if (ctx) {
perf_iterate_ctx(ctx, output, data, false);
+ if (!overhead_ctx)
+ overhead_ctx = ctx;
+ }
}
done:
preempt_enable();
rcu_read_unlock();
+
+ /* calculate side-band event overhead */
+ end_clock = perf_clock();
+ if (overhead_ctx && overhead_ctx->pmu && overhead_ctx->pmu->stat) {
+ cpuctx = this_cpu_ptr(overhead_ctx->pmu->pmu_cpu_context);
+ cpuctx->overhead[PERF_CORE_SB_OVERHEAD].nr++;
+ cpuctx->overhead[PERF_CORE_SB_OVERHEAD].time += end_clock - start_clock;
+ }
}
/*
--
2.4.3
From: Kan Liang <[email protected]>
Multiplexing overhead is one of the key overhead when the number of
events is more than available counters.
The multiplexing overhead PERF_CORE_MUX_OVERHEAD is a common overhead
type.
Signed-off-by: Kan Liang <[email protected]>
---
include/uapi/linux/perf_event.h | 1 +
kernel/events/core.c | 9 +++++++++
2 files changed, 10 insertions(+)
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 23b7963..c488336 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -1001,6 +1001,7 @@ struct perf_branch_entry {
*/
enum perf_record_overhead_type {
/* common overhead */
+ PERF_CORE_MUX_OVERHEAD = 0,
/* PMU specific */
PERF_OVERHEAD_MAX,
};
diff --git a/kernel/events/core.c b/kernel/events/core.c
index dbde193..28468ae 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3326,6 +3326,7 @@ static void rotate_ctx(struct perf_event_context *ctx)
static int perf_rotate_context(struct perf_cpu_context *cpuctx)
{
struct perf_event_context *ctx = NULL;
+ u64 start_clock, end_clock;
int rotate = 0;
if (cpuctx->ctx.nr_events) {
@@ -3342,6 +3343,7 @@ static int perf_rotate_context(struct perf_cpu_context *cpuctx)
if (!rotate)
goto done;
+ start_clock = perf_clock();
perf_ctx_lock(cpuctx, cpuctx->task_ctx);
perf_pmu_disable(cpuctx->ctx.pmu);
@@ -3357,6 +3359,13 @@ static int perf_rotate_context(struct perf_cpu_context *cpuctx)
perf_pmu_enable(cpuctx->ctx.pmu);
perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+
+ /* calculate multiplexing overhead */
+ if (cpuctx->ctx.pmu->stat) {
+ end_clock = perf_clock();
+ cpuctx->overhead[PERF_CORE_MUX_OVERHEAD].nr++;
+ cpuctx->overhead[PERF_CORE_MUX_OVERHEAD].time += end_clock - start_clock;
+ }
done:
return rotate;
--
2.4.3
From: Kan Liang <[email protected]>
In STAT_START, resets overhead counter for each possible cpuctx of event
pmu.
In STAT_DONE, generate overhead information for each possible cpuctx and
reset the overhead counteris.
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/core.c | 37 +++++++++++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 6e395c9..09ab36a 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2198,6 +2198,40 @@ static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
x86_pmu.sched_task(ctx, sched_in);
}
+static int x86_pmu_stat(struct perf_event *event, u32 flag)
+{
+ struct perf_cpu_context *cpuctx;
+ struct pmu *pmu = event->pmu;
+ int cpu, i;
+
+ if (!(flag & (PERF_IOC_FLAG_STAT_START | PERF_IOC_FLAG_STAT_DONE)))
+ return -EINVAL;
+
+ if (!event->attr.overhead)
+ return -EINVAL;
+
+ if (flag & PERF_IOC_FLAG_STAT_DONE) {
+ for_each_possible_cpu(cpu) {
+ cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
+
+ for (i = 0; i < PERF_OVERHEAD_MAX; i++) {
+ if (!cpuctx->overhead[i].nr)
+ continue;
+ perf_log_overhead(event, i, cpu,
+ cpuctx->overhead[i].nr,
+ cpuctx->overhead[i].time);
+ }
+ }
+ }
+
+ for_each_possible_cpu(cpu) {
+ cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
+ memset(cpuctx->overhead, 0, PERF_OVERHEAD_MAX * sizeof(struct perf_overhead_entry));
+ }
+
+ return 0;
+}
+
void perf_check_microcode(void)
{
if (x86_pmu.check_microcode)
@@ -2228,6 +2262,9 @@ static struct pmu pmu = {
.event_idx = x86_pmu_event_idx,
.sched_task = x86_pmu_sched_task,
+
+ .stat = x86_pmu_stat,
+
.task_ctx_size = sizeof(struct x86_perf_task_context),
};
--
2.4.3
From: Kan Liang <[email protected]>
It only needs to generate the overhead statistics once when perf is
done. But there is no good place in the kernel for that.
A new ioctl PERF_EVENT_IOC_STAT is introduced to notify the kernel when
the tool 'start' and 'done'.
In 'start', the kernel resets the overhead numbers.
In 'done', the kernel generates pmu's overhead.
It also needs a pmu specific int (*stat) to do the real work.
Signed-off-by: Kan Liang <[email protected]>
---
include/linux/perf_event.h | 6 ++++++
include/uapi/linux/perf_event.h | 3 +++
kernel/events/core.c | 5 +++++
3 files changed, 14 insertions(+)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 946e8d8..a34f9a2 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -403,6 +403,12 @@ struct pmu {
*/
void (*sched_task) (struct perf_event_context *ctx,
bool sched_in);
+
+ /*
+ * overhead statistics
+ */
+ int (*stat) (struct perf_event *event, u32 flag);
+
/*
* PMU specific data size
*/
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 101f8b3..23b7963 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -408,9 +408,12 @@ struct perf_event_attr {
#define PERF_EVENT_IOC_ID _IOR('$', 7, __u64 *)
#define PERF_EVENT_IOC_SET_BPF _IOW('$', 8, __u32)
#define PERF_EVENT_IOC_PAUSE_OUTPUT _IOW('$', 9, __u32)
+#define PERF_EVENT_IOC_STAT _IOW('$', 10, __u32)
enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
+ PERF_IOC_FLAG_STAT_START = 1U << 1,
+ PERF_IOC_FLAG_STAT_DONE = 1U << 2,
};
/*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1420139..dbde193 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4637,6 +4637,11 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
rcu_read_unlock();
return 0;
}
+ case PERF_EVENT_IOC_STAT: {
+ if (event->pmu->stat)
+ return event->pmu->stat(event, flags);
+ return 0;
+ }
default:
return -ENOTTY;
}
--
2.4.3
From: Kan Liang <[email protected]>
A new perf record is introduced to export perf overhead information to
userspace. So the user can measure the overhead of sampling directly.
If the user doesn't want to use this feature, it can be switched off by
configuring the user space tool.
Perf overhead is actually the sum of active pmu overhead. The pmu event
could be run on different CPU. To calculate the perf overhead, it needs
to collect the per-pmu per-cpu overhead information.
Each pmu has its own per-cpu cpuctx. It's a good place to store the
overhead information.
To output the overhead information, it takes advantage of the existing
event log mechanism. But the overhead information is the per-pmu
overhead, not per-event overhead.
Signed-off-by: Kan Liang <[email protected]>
---
include/linux/perf_event.h | 6 ++++++
include/uapi/linux/perf_event.h | 38 +++++++++++++++++++++++++++++++++-
kernel/events/core.c | 46 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 89 insertions(+), 1 deletion(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 4741ecd..946e8d8 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -792,6 +792,8 @@ struct perf_cpu_context {
struct list_head sched_cb_entry;
int sched_cb_usage;
+
+ struct perf_overhead_entry overhead[PERF_OVERHEAD_MAX];
};
struct perf_output_handle {
@@ -998,6 +1000,10 @@ perf_event__output_id_sample(struct perf_event *event,
extern void
perf_log_lost_samples(struct perf_event *event, u64 lost);
+extern void
+perf_log_overhead(struct perf_event *event, u64 type,
+ u32 cpu, u32 nr, u64 time);
+
static inline bool is_sampling_event(struct perf_event *event)
{
return event->attr.sample_period != 0;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index c66a485..101f8b3 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -344,7 +344,8 @@ struct perf_event_attr {
use_clockid : 1, /* use @clockid for time fields */
context_switch : 1, /* context switch data */
write_backward : 1, /* Write ring buffer from end to beginning */
- __reserved_1 : 36;
+ overhead : 1, /* Log overhead information */
+ __reserved_1 : 35;
union {
__u32 wakeup_events; /* wakeup every n events */
@@ -862,6 +863,17 @@ enum perf_event_type {
*/
PERF_RECORD_SWITCH_CPU_WIDE = 15,
+ /*
+ * Records perf overhead
+ * struct {
+ * struct perf_event_header header;
+ * u64 type;
+ * struct perf_overhead_entry entry;
+ * struct sample_id sample_id;
+ * };
+ */
+ PERF_RECORD_OVERHEAD = 16,
+
PERF_RECORD_MAX, /* non-ABI */
};
@@ -980,4 +992,28 @@ struct perf_branch_entry {
reserved:44;
};
+/*
+ * The overhead could be common overhead (in core codes) or
+ * PMU specific overhead (in pmu specific codes).
+ */
+enum perf_record_overhead_type {
+ /* common overhead */
+ /* PMU specific */
+ PERF_OVERHEAD_MAX,
+};
+
+/*
+ * single overhead record layout:
+ *
+ * cpu: CPU id
+ * nr: Times of overhead happens.
+ * E.g. for NMI, nr == times of NMI handler are called.
+ * time: Total overhead cost(ns)
+ */
+struct perf_overhead_entry {
+ __u32 cpu;
+ __u32 nr;
+ __u64 time;
+};
+
#endif /* _UAPI_LINUX_PERF_EVENT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 02c8421..1420139 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7034,6 +7034,52 @@ static void perf_log_itrace_start(struct perf_event *event)
perf_output_end(&handle);
}
+
+/*
+ * Record overhead information
+ *
+ * The overhead logged here is the overhead for event pmu, not per-event overhead.
+ * This function only take advantage of the existing event log mechanism
+ * to log the overhead information.
+ *
+ */
+void perf_log_overhead(struct perf_event *event, u64 type,
+ u32 cpu, u32 nr, u64 time)
+{
+ struct perf_output_handle handle;
+ struct perf_sample_data sample;
+ int ret;
+
+ struct {
+ struct perf_event_header header;
+ u64 type;
+ struct perf_overhead_entry overhead;
+ } overhead_event = {
+ .header = {
+ .type = PERF_RECORD_OVERHEAD,
+ .misc = 0,
+ .size = sizeof(overhead_event),
+ },
+ .type = type,
+ .overhead = {
+ .cpu = cpu,
+ .nr = nr,
+ .time = time,
+ },
+ };
+
+ perf_event_header__init_id(&overhead_event.header, &sample, event);
+ ret = perf_output_begin(&handle, event, overhead_event.header.size);
+
+ if (ret)
+ return;
+
+ perf_output_put(&handle, overhead_event);
+ perf_event__output_id_sample(event, &handle, &sample);
+
+ perf_output_end(&handle);
+}
+
/*
* Generic event overflow handling, sampling.
*/
--
2.4.3
Ping.
Any comments for the series?
Thanks,
Kan
> Subject: [PATCH V3 0/6] export perf overheads information (kernel)
>
> From: Kan Liang <[email protected]>
>
> This series only include the kernel related patches.
>
> Profiling brings additional overhead. High overhead may impacts the
> behavior of the profiling object, impacts the accuracy of the profiling result,
> and even hang the system.
> Currently, perf has dynamic interrupt throttle mechanism to lower the
> sample rate and overhead. But it has limitations.
> - The mechanism only focus in the sampling overhead. However, there
> are other parts which also bring big overhead. E.g, multiplexing.
> - The hint from the mechanism doesn't work on fixed period.
> - The system changes which caused by the mechanism are not recorded
> in the perf.data. Users have no idea about the overhead and its
> impact.
> Actually, any passive ways like dynamic interrupt throttle mechanism are
> only palliative. The best way is to export overhead information, provide
> more hints, and help the users design more proper perf command.
>
> For kernel, three parts can bring obvious overhead.
> - sample overhead: For x86, it's the time cost in perf NMI handler.
> - multiplexing overhead: The time cost spends on rotate context.
> - side-band events overhead: The time cost spends on iterating all
> events that need to receive side-band events.
> The time cost of those parts are stored in pmu's per-cpu cpuctx.
> The tool can call PERF_EVENT_IOC_STAT when it's 'done'. Then the kernel
> generates the overhead record PERF_RECORD_OVERHEAD.
>
> User can use the overhead information to refine their perf command and
> get accurate profiling result. For example, if there is high overhead warning,
> user may reduce the number of events/increase the period/decrease the
> frequency.
> Developer can also use the overhead information to find bugs.
>
> Changes since V2:
> - Separate kernel patches from the previous version
> - Add PERF_EVENT_IOC_STAT to control overhead statistics
> - Collect per pmu overhead information
> - Store the overhead information in pmu's cpuctx
> - Add CPU information in overhead record
>
> Changes since V1:
> - fix u32 holes and remove duplicate CPU information
> - configurable overhead logging
> - Introduce the concept of PMU specific overhead and common core
> overhead. Rename NMI overhead to PMU sample overhead
> - Add log_overhead in perf_event_context to indicate the logging of
> overhead
> - Refine the output of overhead information
> - Use perf CPU time to replace perf write data overhead
> - Refine the formula of overhead evaluation
> - Refine perf script
>
> Kan Liang (6):
> perf/core: Introduce PERF_RECORD_OVERHEAD
> perf/core: Add PERF_EVENT_IOC_STAT to control overhead statistics
> perf/x86: implement overhead stat for x86 pmu
> perf/core: calculate multiplexing overhead
> perf/core: calculate side-band events overhead
> perf/x86: calculate sampling overhead
>
> arch/x86/events/core.c | 45 +++++++++++++++++++++++-
> include/linux/perf_event.h | 12 +++++++
> include/uapi/linux/perf_event.h | 44 ++++++++++++++++++++++-
> kernel/events/core.c | 77
> ++++++++++++++++++++++++++++++++++++++++-
> 4 files changed, 175 insertions(+), 3 deletions(-)
>
> --
> 2.4.3