(The kernel part of haswell LBR call stack patch set is original from
Yan, Zheng. I only did some modification based on his work. The perf
tool part is from me. The patches have been tested on Haswell platform.)
For many profiling tasks we need the callgraph. For example we often
need to see the caller of a lock or the caller of a memcpy or other
library function to actually tune the program. Frame pointer unwinding
is efficient and works well. But frame pointers are off by default on
64bit code (and on modern 32bit gccs), so there are many binaries around
that do not use frame pointers. Profiling unchanged production code is
very useful in practice. On some CPUs frame pointer also has a high
cost. Dwarf2 unwinding also does not always work and is extremely slow
(upto 20% overhead).
Haswell has a new feature that utilizes the existing Last Branch Record
facility to record call chains. When the feature is enabled, function
call will be collected as normal, but as return instructions are
executed the last captured branch record is popped from the on-chip LBR
registers. The LBR call stack facility provides an alternative to get
callgraph. It has some limitations too, but should work in most cases
and is significantly faster than dwarf. Frame pointer unwinding is still
the best default, but LBR call stack is a good alternative when nothing
else works.
In the implementation, both Frame pointer and LBR call stack data are
collected by kernel, and expose to user space. The frame pointer is
still output as PERF_SAMPLE_CALLCHAIN data format. The LBR call stack
data will be output as PERF_SAMPLE_BRANCH_STACK data format. A callahain
source extension of perf report call-graph option is introduced. So user
can choose call chain from either FP or LBR call stack.
When profiling bc(1) on Fedora 19:
echo 'scale=2000; 4*a(1)' > cmd; perf record --call-graph fp bc -l < cmd
If this feature is enabled, perf report with lbr output looks like:
50.36% bc bc [.] bc_divide
|
--- bc_divide
execute
run_code
yyparse
main
__libc_start_main
_start
33.66% bc bc [.] _one_mult
|
--- _one_mult
bc_divide
execute
run_code
yyparse
main
__libc_start_main
_start
7.62% bc bc [.] _bc_do_add
|
--- _bc_do_add
|
|--99.89%-- 0x2000186a8
--0.11%-- [...]
6.83% bc bc [.] _bc_do_sub
|
--- _bc_do_sub
|
|--99.94%-- bc_add
| execute
| run_code
| yyparse
| main
| __libc_start_main
| _start
--0.06%-- [...]
0.46% bc libc-2.17.so [.] __memset_sse2
|
--- __memset_sse2
|
|--54.13%-- bc_new_num
| |
| |--51.00%-- bc_divide
| | execute
| | run_code
| | yyparse
| | main
| | __libc_start_main
| | _start
| |
| |--30.46%-- _bc_do_sub
| | bc_add
| | execute
| | run_code
| | yyparse
| | main
| | __libc_start_main
| | _start
| |
| --18.55%-- _bc_do_add
| bc_add
| execute
| run_code
| yyparse
| main
| __libc_start_main
| _start
|
--45.87%-- bc_divide
execute
run_code
yyparse
main
__libc_start_main
_start
If this feature is disabled, perf report output looks like:
50.49% bc bc [.] bc_divide
|
--- bc_divide
33.57% bc bc [.] _one_mult
|
--- _one_mult
7.61% bc bc [.] _bc_do_add
|
--- _bc_do_add
0x2000186a8
6.88% bc bc [.] _bc_do_sub
|
--- _bc_do_sub
0.42% bc libc-2.17.so [.] __memcpy_ssse3_back
|
--- __memcpy_ssse3_back
Another example is to demo the extension of perf report.
If both fp and lbr are available, we can dump either them by fp or lbr
option as below.
$ perf record --call-graph fp ./a.out
[ perf record: Woken up 18 times to write data ]
[ perf record: Captured and wrote 4.322 MB perf.data (~188824 samples) ]
$ perf report --call-graph fractal,0.5,callee,function,fp -D | wc -l
605688
$ perf report --call-graph fractal,0.5,callee,function,lbr -D | wc -l
605730
The LBR call stack has following known limitations
- Zero length calls are not filtered out by hardware
- Exception handing such as setjmp/longjmp will have calls/returns not
match
- Pushing different return address onto the stack will have calls/returns
not match
- If callstack is deeper than the LBR, only the last entries are captured
Changes since v1
- split change into more patches
- introduce context switch callback and use it to flush LBR
- use the context switch callback to save/restore LBR
- dynamic allocate memory area for storing LBR stack, always switch the
memory area during context switch
- disable this feature by default
- more description in change logs
Changes since v2
- don't use xchg to switch PMU specific data
- remove nr_branch_stack from struct perf_event_context
- simplify the save/restore LBR stack logical
- remove unnecessary 'has_branch_stack -> needs_branch_stack'
conversion
- more description in change logs
Changes since v3
- remove sysfs attribute file that disable this feature
Changes since v4
- re-organize code that save/resotre LBR stack
- allocate pmu specific data when it's needed
- update code comments
Changes since v5
- Expose LBR call stack data to user perf tool
- Add option for perf report to support LBR call stack
- Some minor changes according to comments
Yan, Zheng (15):
perf, x86: Reduce lbr_sel_map size
perf, core: introduce pmu context switch callback
perf, x86: use context switch callback to flush LBR stack
perf, x86: Basic Haswell LBR call stack support
perf, core: pmu specific data for perf task context
perf, core: always switch pmu specific data during context switch
perf, x86: allocate space for storing LBR stack
perf, x86: track number of events that use LBR callstack
perf, x86: Save/resotre LBR stack during context switch
perf, core: simplify need branch stack check
perf, core: expose LBR call stack to user perf tool
perf, x86: re-organize code that implicitly enables LBR/PEBS
perf, x86: enable LBR callstack when recording callchain
perf, x86: disable FREEZE_LBRS_ON_PMI when LBR operates in callstack
mode
perf, x86: Discard zero length call entries in LBR call stack
Kan Liang (2):
perf tools: handle LBR call stack data
perf tools: choose to dump callchain from LBR and FP
arch/x86/kernel/cpu/perf_event.c | 90 ++++++---
arch/x86/kernel/cpu/perf_event.h | 28 ++-
arch/x86/kernel/cpu/perf_event_intel.c | 38 +---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 310 ++++++++++++++++++++++-------
include/linux/perf_event.h | 34 +++-
include/uapi/linux/perf_event.h | 49 +++--
kernel/events/callchain.c | 1 +
kernel/events/core.c | 200 +++++++++++--------
tools/perf/builtin-report.c | 8 +-
tools/perf/util/callchain.c | 18 +-
tools/perf/util/callchain.h | 6 +
tools/perf/util/event.h | 8 +
tools/perf/util/evsel.c | 21 +-
tools/perf/util/machine.c | 198 ++++++++++++------
tools/perf/util/session.c | 34 +++-
16 files changed, 728 insertions(+), 317 deletions(-)
--
1.8.3.2
The index of lbr_sel_map is bit value of perf branch_sample_type.
PERF_SAMPLE_BRANCH_MAX is 1024 at present, so each lbr_sel_map uses
4096 bytes. By using bit shift as index, we can reduce lbr_sel_map
size to 40 bytes. This patch defines 'bit shift' for branch types,
and use 'bit shift' to define lbr_sel_maps.
Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Reviewed-by: Stephane Eranian <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 4 +++
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 54 ++++++++++++++----------------
include/uapi/linux/perf_event.h | 49 +++++++++++++++++++--------
3 files changed, 64 insertions(+), 43 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index d98a34d..2d017fa 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -510,6 +510,10 @@ struct x86_pmu {
struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr);
};
+enum {
+ PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE = PERF_SAMPLE_BRANCH_MAX_SHIFT,
+};
+
#define x86_add_quirk(func_) \
do { \
static struct x86_pmu_quirk __quirk __initdata = { \
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 4af1061..7344f05 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -69,10 +69,6 @@ static enum {
#define LBR_FROM_FLAG_IN_TX (1ULL << 62)
#define LBR_FROM_FLAG_ABORT (1ULL << 61)
-#define for_each_branch_sample_type(x) \
- for ((x) = PERF_SAMPLE_BRANCH_USER; \
- (x) < PERF_SAMPLE_BRANCH_MAX; (x) <<= 1)
-
/*
* x86control flow change classification
* x86control flow changes include branches, interrupts, traps, faults
@@ -403,14 +399,14 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
{
struct hw_perf_event_extra *reg;
u64 br_type = event->attr.branch_sample_type;
- u64 mask = 0, m;
- u64 v;
+ u64 mask = 0, v;
+ int i;
- for_each_branch_sample_type(m) {
- if (!(br_type & m))
+ for (i = 0; i < PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE; i++) {
+ if (!(br_type & (1ULL << i)))
continue;
- v = x86_pmu.lbr_sel_map[m];
+ v = x86_pmu.lbr_sel_map[i];
if (v == LBR_NOT_SUPP)
return -EOPNOTSUPP;
@@ -665,35 +661,35 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
/*
* Map interface branch filters onto LBR filters
*/
-static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
- [PERF_SAMPLE_BRANCH_ANY] = LBR_ANY,
- [PERF_SAMPLE_BRANCH_USER] = LBR_USER,
- [PERF_SAMPLE_BRANCH_KERNEL] = LBR_KERNEL,
- [PERF_SAMPLE_BRANCH_HV] = LBR_IGN,
- [PERF_SAMPLE_BRANCH_ANY_RETURN] = LBR_RETURN | LBR_REL_JMP
- | LBR_IND_JMP | LBR_FAR,
+static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+ [PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
+ [PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
+ [PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
+ [PERF_SAMPLE_BRANCH_HV_SHIFT] = LBR_IGN,
+ [PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT] = LBR_RETURN | LBR_REL_JMP
+ | LBR_IND_JMP | LBR_FAR,
/*
* NHM/WSM erratum: must include REL_JMP+IND_JMP to get CALL branches
*/
- [PERF_SAMPLE_BRANCH_ANY_CALL] =
+ [PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT] =
LBR_REL_CALL | LBR_IND_CALL | LBR_REL_JMP | LBR_IND_JMP | LBR_FAR,
/*
* NHM/WSM erratum: must include IND_JMP to capture IND_CALL
*/
- [PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL | LBR_IND_JMP,
- [PERF_SAMPLE_BRANCH_COND] = LBR_JCC,
+ [PERF_SAMPLE_BRANCH_IND_CALL_SHIFT] = LBR_IND_CALL | LBR_IND_JMP,
+ [PERF_SAMPLE_BRANCH_COND_SHIFT] = LBR_JCC,
};
-static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
- [PERF_SAMPLE_BRANCH_ANY] = LBR_ANY,
- [PERF_SAMPLE_BRANCH_USER] = LBR_USER,
- [PERF_SAMPLE_BRANCH_KERNEL] = LBR_KERNEL,
- [PERF_SAMPLE_BRANCH_HV] = LBR_IGN,
- [PERF_SAMPLE_BRANCH_ANY_RETURN] = LBR_RETURN | LBR_FAR,
- [PERF_SAMPLE_BRANCH_ANY_CALL] = LBR_REL_CALL | LBR_IND_CALL
- | LBR_FAR,
- [PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL,
- [PERF_SAMPLE_BRANCH_COND] = LBR_JCC,
+static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+ [PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
+ [PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
+ [PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
+ [PERF_SAMPLE_BRANCH_HV_SHIFT] = LBR_IGN,
+ [PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT] = LBR_RETURN | LBR_FAR,
+ [PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT] = LBR_REL_CALL | LBR_IND_CALL
+ | LBR_FAR,
+ [PERF_SAMPLE_BRANCH_IND_CALL_SHIFT] = LBR_IND_CALL,
+ [PERF_SAMPLE_BRANCH_COND_SHIFT] = LBR_JCC,
};
/* core */
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9269de2..7d41d59 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -151,21 +151,42 @@ enum perf_event_sample_format {
* The branch types can be combined, however BRANCH_ANY covers all types
* of branches and therefore it supersedes all the other types.
*/
+enum perf_branch_sample_type_shift {
+ PERF_SAMPLE_BRANCH_USER_SHIFT = 0, /* user branches */
+ PERF_SAMPLE_BRANCH_KERNEL_SHIFT = 1, /* kernel branches */
+ PERF_SAMPLE_BRANCH_HV_SHIFT = 2, /* hypervisor branches */
+
+ PERF_SAMPLE_BRANCH_ANY_SHIFT = 3, /* any branch types */
+ PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT = 4, /* any call branch */
+ PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT = 5, /* any return branch */
+ PERF_SAMPLE_BRANCH_IND_CALL_SHIFT = 6, /* indirect calls */
+ PERF_SAMPLE_BRANCH_ABORT_TX_SHIFT = 7, /* transaction aborts */
+ PERF_SAMPLE_BRANCH_IN_TX_SHIFT = 8, /* in transaction */
+ PERF_SAMPLE_BRANCH_NO_TX_SHIFT = 9, /* not in transaction */
+ PERF_SAMPLE_BRANCH_COND_SHIFT = 10, /* conditional branches */
+
+ PERF_SAMPLE_BRANCH_MAX_SHIFT /* non-ABI */
+};
+
enum perf_branch_sample_type {
- PERF_SAMPLE_BRANCH_USER = 1U << 0, /* user branches */
- PERF_SAMPLE_BRANCH_KERNEL = 1U << 1, /* kernel branches */
- PERF_SAMPLE_BRANCH_HV = 1U << 2, /* hypervisor branches */
-
- PERF_SAMPLE_BRANCH_ANY = 1U << 3, /* any branch types */
- PERF_SAMPLE_BRANCH_ANY_CALL = 1U << 4, /* any call branch */
- PERF_SAMPLE_BRANCH_ANY_RETURN = 1U << 5, /* any return branch */
- PERF_SAMPLE_BRANCH_IND_CALL = 1U << 6, /* indirect calls */
- PERF_SAMPLE_BRANCH_ABORT_TX = 1U << 7, /* transaction aborts */
- PERF_SAMPLE_BRANCH_IN_TX = 1U << 8, /* in transaction */
- PERF_SAMPLE_BRANCH_NO_TX = 1U << 9, /* not in transaction */
- PERF_SAMPLE_BRANCH_COND = 1U << 10, /* conditional branches */
-
- PERF_SAMPLE_BRANCH_MAX = 1U << 11, /* non-ABI */
+ PERF_SAMPLE_BRANCH_USER = 1U << PERF_SAMPLE_BRANCH_USER_SHIFT,
+ PERF_SAMPLE_BRANCH_KERNEL = 1U << PERF_SAMPLE_BRANCH_KERNEL_SHIFT,
+ PERF_SAMPLE_BRANCH_HV = 1U << PERF_SAMPLE_BRANCH_HV_SHIFT,
+
+ PERF_SAMPLE_BRANCH_ANY = 1U << PERF_SAMPLE_BRANCH_ANY_SHIFT,
+ PERF_SAMPLE_BRANCH_ANY_CALL =
+ 1U << PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT,
+ PERF_SAMPLE_BRANCH_ANY_RETURN =
+ 1U << PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT,
+ PERF_SAMPLE_BRANCH_IND_CALL =
+ 1U << PERF_SAMPLE_BRANCH_IND_CALL_SHIFT,
+ PERF_SAMPLE_BRANCH_ABORT_TX =
+ 1U << PERF_SAMPLE_BRANCH_ABORT_TX_SHIFT,
+ PERF_SAMPLE_BRANCH_IN_TX = 1U << PERF_SAMPLE_BRANCH_IN_TX_SHIFT,
+ PERF_SAMPLE_BRANCH_NO_TX = 1U << PERF_SAMPLE_BRANCH_NO_TX_SHIFT,
+ PERF_SAMPLE_BRANCH_COND = 1U << PERF_SAMPLE_BRANCH_COND_SHIFT,
+
+ PERF_SAMPLE_BRANCH_MAX = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT,
};
#define PERF_SAMPLE_BRANCH_PLM_ALL \
--
1.8.3.2
The callback is invoked when process is scheduled in or out.
It provides mechanism for later patches to save/store the LBR
stack. For the schedule in case, the callback is invoked at
the same place that flush branch stack callback is invoked.
So it also can replace the flush branch stack callback. To
avoid unnecessary overhead, the callback is enabled only when
there are events use the LBR stack.
Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 7 +++++
arch/x86/kernel/cpu/perf_event.h | 2 ++
include/linux/perf_event.h | 9 +++++++
kernel/events/core.c | 57 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 75 insertions(+)
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 16c7302..b42a204 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1887,6 +1887,12 @@ static const struct attribute_group *x86_pmu_attr_groups[] = {
NULL,
};
+static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+ if (x86_pmu.sched_task)
+ x86_pmu.sched_task(ctx, sched_in);
+}
+
static void x86_pmu_flush_branch_stack(void)
{
if (x86_pmu.flush_branch_stack)
@@ -1920,6 +1926,7 @@ static struct pmu pmu = {
.event_idx = x86_pmu_event_idx,
.flush_branch_stack = x86_pmu_flush_branch_stack,
+ .sched_task = x86_pmu_sched_task,
};
void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 2d017fa..4dafbe3 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -468,6 +468,8 @@ struct x86_pmu {
void (*check_microcode)(void);
void (*flush_branch_stack)(void);
+ void (*sched_task)(struct perf_event_context *ctx,
+ bool sched_in);
/*
* Intel Arch Perfmon v2+
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 893a0d0..40ecad1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -263,6 +263,13 @@ struct pmu {
* flush branch stack on context-switches (needed in cpu-wide mode)
*/
void (*flush_branch_stack) (void);
+
+ /*
+ * context-switches callback
+ */
+ void (*sched_task) (struct perf_event_context *ctx,
+ bool sched_in);
+
};
/**
@@ -562,6 +569,8 @@ extern void perf_event_delayed_put(struct task_struct *task);
extern void perf_event_print_debug(void);
extern void perf_pmu_disable(struct pmu *pmu);
extern void perf_pmu_enable(struct pmu *pmu);
+extern void perf_sched_cb_dec(struct pmu *pmu);
+extern void perf_sched_cb_inc(struct pmu *pmu);
extern int perf_event_task_disable(void);
extern int perf_event_task_enable(void);
extern int perf_event_refresh(struct perf_event *event, int refresh);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 094df8c..362cb3a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -154,6 +154,7 @@ enum event_type_t {
struct static_key_deferred perf_sched_events __read_mostly;
static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
+static DEFINE_PER_CPU(int, perf_sched_cb_usages);
static atomic_t nr_mmap_events __read_mostly;
static atomic_t nr_comm_events __read_mostly;
@@ -2435,6 +2436,56 @@ unlock:
}
}
+void perf_sched_cb_dec(struct pmu *pmu)
+{
+ this_cpu_dec(perf_sched_cb_usages);
+}
+
+void perf_sched_cb_inc(struct pmu *pmu)
+{
+ this_cpu_inc(perf_sched_cb_usages);
+}
+
+/*
+ * This function provides the context switch callback to the lower code
+ * layer. It is invoked ONLY when the context switch callback is enabled.
+ */
+static void perf_pmu_sched_task(struct task_struct *prev,
+ struct task_struct *next,
+ bool sched_in)
+{
+ struct perf_cpu_context *cpuctx;
+ struct pmu *pmu;
+ unsigned long flags;
+
+ if (prev == next)
+ return;
+
+ local_irq_save(flags);
+
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(pmu, &pmus, entry) {
+ if (pmu->sched_task) {
+ cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+
+ perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+ perf_pmu_disable(pmu);
+
+ pmu->sched_task(cpuctx->task_ctx, sched_in);
+
+ perf_pmu_enable(pmu);
+
+ perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+ }
+ }
+
+ rcu_read_unlock();
+
+ local_irq_restore(flags);
+}
+
#define for_each_task_context_nr(ctxn) \
for ((ctxn) = 0; (ctxn) < perf_nr_task_contexts; (ctxn)++)
@@ -2454,6 +2505,9 @@ void __perf_event_task_sched_out(struct task_struct *task,
{
int ctxn;
+ if (__this_cpu_read(perf_sched_cb_usages))
+ perf_pmu_sched_task(task, next, false);
+
for_each_task_context_nr(ctxn)
perf_event_context_sched_out(task, ctxn, next);
@@ -2711,6 +2765,9 @@ void __perf_event_task_sched_in(struct task_struct *prev,
/* check for system-wide branch_stack events */
if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
perf_branch_stack_sched_in(prev, task);
+
+ if (__this_cpu_read(perf_sched_cb_usages))
+ perf_pmu_sched_task(prev, task, true);
}
static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count)
--
1.8.3.2
Previous commit introduces context switch callback, its function
overlaps with the flush branch stack callback. So we can use the
context switch callback to flush LBR stack.
This patch adds code that uses the flush branch callback to
flush the LBR stack when task is being scheduled in. The callback
is enabled only when there are events use the LBR hardware. This
patch also removes all old flush branch stack code.
Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 7 ---
arch/x86/kernel/cpu/perf_event.h | 3 +-
arch/x86/kernel/cpu/perf_event_intel.c | 14 +-----
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 27 ++++++++++-
include/linux/perf_event.h | 6 ---
kernel/events/core.c | 77 ------------------------------
6 files changed, 29 insertions(+), 105 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index b42a204..fa7a168 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1893,12 +1893,6 @@ static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
x86_pmu.sched_task(ctx, sched_in);
}
-static void x86_pmu_flush_branch_stack(void)
-{
- if (x86_pmu.flush_branch_stack)
- x86_pmu.flush_branch_stack();
-}
-
void perf_check_microcode(void)
{
if (x86_pmu.check_microcode)
@@ -1925,7 +1919,6 @@ static struct pmu pmu = {
.commit_txn = x86_pmu_commit_txn,
.event_idx = x86_pmu_event_idx,
- .flush_branch_stack = x86_pmu_flush_branch_stack,
.sched_task = x86_pmu_sched_task,
};
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 4dafbe3..c84378b 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -467,7 +467,6 @@ struct x86_pmu {
void (*cpu_dead)(int cpu);
void (*check_microcode)(void);
- void (*flush_branch_stack)(void);
void (*sched_task)(struct perf_event_context *ctx,
bool sched_in);
@@ -728,6 +727,8 @@ void intel_pmu_pebs_disable_all(void);
void intel_ds_init(void);
+void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
+
void intel_pmu_lbr_reset(void);
void intel_pmu_lbr_enable(struct perf_event *event);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 3851def..38042e3 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2191,18 +2191,6 @@ static void intel_pmu_cpu_dying(int cpu)
fini_debug_store_on_cpu(cpu);
}
-static void intel_pmu_flush_branch_stack(void)
-{
- /*
- * Intel LBR does not tag entries with the
- * PID of the current task, then we need to
- * flush it on ctxsw
- * For now, we simply reset it
- */
- if (x86_pmu.lbr_nr)
- intel_pmu_lbr_reset();
-}
-
PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");
PMU_FORMAT_ATTR(ldlat, "config1:0-15");
@@ -2254,7 +2242,7 @@ static __initconst const struct x86_pmu intel_pmu = {
.cpu_starting = intel_pmu_cpu_starting,
.cpu_dying = intel_pmu_cpu_dying,
.guest_get_msrs = intel_guest_get_msrs,
- .flush_branch_stack = intel_pmu_flush_branch_stack,
+ .sched_task = intel_pmu_lbr_sched_task,
};
static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 7344f05..18d13ca 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -177,13 +177,36 @@ void intel_pmu_lbr_reset(void)
intel_pmu_lbr_reset_64();
}
-void intel_pmu_lbr_enable(struct perf_event *event)
+void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
if (!x86_pmu.lbr_nr)
return;
+ /*
+ * When sampling the branck stack in system-wide, it may be
+ * necessary to flush the stack on context switch. This happens
+ * when the branch stack does not tag its entries with the pid
+ * of the current task. Otherwise it becomes impossible to
+ * associate a branch entry with a task. This ambiguity is more
+ * likely to appear when the branch stack supports priv level
+ * filtering and the user sets it to monitor only at the user
+ * level (which could be a useful measurement in system-wide
+ * mode). In that case, the risk is high of having a branch
+ * stack with branch from multiple tasks.
+ */
+ if (sched_in) {
+ intel_pmu_lbr_reset();
+ cpuc->lbr_context = ctx;
+ }
+}
+void intel_pmu_lbr_enable(struct perf_event *event)
+{
+ struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+ if (!x86_pmu.lbr_nr)
+ return;
/*
* Reset the LBR stack if we changed task context to
* avoid data leaks.
@@ -195,6 +218,7 @@ void intel_pmu_lbr_enable(struct perf_event *event)
cpuc->br_sel = event->hw.branch_reg.reg;
cpuc->lbr_users++;
+ perf_sched_cb_inc(event->ctx->pmu);
}
void intel_pmu_lbr_disable(struct perf_event *event)
@@ -206,6 +230,7 @@ void intel_pmu_lbr_disable(struct perf_event *event)
cpuc->lbr_users--;
WARN_ON_ONCE(cpuc->lbr_users < 0);
+ perf_sched_cb_dec(event->ctx->pmu);
if (cpuc->enabled && !cpuc->lbr_users) {
__intel_pmu_lbr_disable();
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 40ecad1..c392fd5 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -260,11 +260,6 @@ struct pmu {
int (*event_idx) (struct perf_event *event); /*optional */
/*
- * flush branch stack on context-switches (needed in cpu-wide mode)
- */
- void (*flush_branch_stack) (void);
-
- /*
* context-switches callback
*/
void (*sched_task) (struct perf_event_context *ctx,
@@ -514,7 +509,6 @@ struct perf_event_context {
u64 generation;
int pin_count;
int nr_cgroups; /* cgroup evts */
- int nr_branch_stack; /* branch_stack evt */
struct rcu_head rcu_head;
struct delayed_work orphans_remove;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 362cb3a..5971e7d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -153,7 +153,6 @@ enum event_type_t {
*/
struct static_key_deferred perf_sched_events __read_mostly;
static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
-static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
static DEFINE_PER_CPU(int, perf_sched_cb_usages);
static atomic_t nr_mmap_events __read_mostly;
@@ -1152,9 +1151,6 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
if (is_cgroup_event(event))
ctx->nr_cgroups++;
- if (has_branch_stack(event))
- ctx->nr_branch_stack++;
-
list_add_rcu(&event->event_entry, &ctx->event_list);
if (!ctx->nr_events)
perf_pmu_rotate_start(ctx->pmu);
@@ -1317,9 +1313,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
cpuctx->cgrp = NULL;
}
- if (has_branch_stack(event))
- ctx->nr_branch_stack--;
-
ctx->nr_events--;
if (event->attr.inherit_stat)
ctx->nr_stat--;
@@ -2673,64 +2666,6 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
}
/*
- * When sampling the branck stack in system-wide, it may be necessary
- * to flush the stack on context switch. This happens when the branch
- * stack does not tag its entries with the pid of the current task.
- * Otherwise it becomes impossible to associate a branch entry with a
- * task. This ambiguity is more likely to appear when the branch stack
- * supports priv level filtering and the user sets it to monitor only
- * at the user level (which could be a useful measurement in system-wide
- * mode). In that case, the risk is high of having a branch stack with
- * branch from multiple tasks. Flushing may mean dropping the existing
- * entries or stashing them somewhere in the PMU specific code layer.
- *
- * This function provides the context switch callback to the lower code
- * layer. It is invoked ONLY when there is at least one system-wide context
- * with at least one active event using taken branch sampling.
- */
-static void perf_branch_stack_sched_in(struct task_struct *prev,
- struct task_struct *task)
-{
- struct perf_cpu_context *cpuctx;
- struct pmu *pmu;
- unsigned long flags;
-
- /* no need to flush branch stack if not changing task */
- if (prev == task)
- return;
-
- local_irq_save(flags);
-
- rcu_read_lock();
-
- list_for_each_entry_rcu(pmu, &pmus, entry) {
- cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
-
- /*
- * check if the context has at least one
- * event using PERF_SAMPLE_BRANCH_STACK
- */
- if (cpuctx->ctx.nr_branch_stack > 0
- && pmu->flush_branch_stack) {
-
- perf_ctx_lock(cpuctx, cpuctx->task_ctx);
-
- perf_pmu_disable(pmu);
-
- pmu->flush_branch_stack();
-
- perf_pmu_enable(pmu);
-
- perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
- }
- }
-
- rcu_read_unlock();
-
- local_irq_restore(flags);
-}
-
-/*
* Called from scheduler to add the events of the current task
* with interrupts disabled.
*
@@ -2762,10 +2697,6 @@ void __perf_event_task_sched_in(struct task_struct *prev,
if (atomic_read(&__get_cpu_var(perf_cgroup_events)))
perf_cgroup_sched_in(prev, task);
- /* check for system-wide branch_stack events */
- if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
- perf_branch_stack_sched_in(prev, task);
-
if (__this_cpu_read(perf_sched_cb_usages))
perf_pmu_sched_task(prev, task, true);
}
@@ -3359,10 +3290,6 @@ static void unaccount_event_cpu(struct perf_event *event, int cpu)
if (event->parent)
return;
- if (has_branch_stack(event)) {
- if (!(event->attach_state & PERF_ATTACH_TASK))
- atomic_dec(&per_cpu(perf_branch_stack_events, cpu));
- }
if (is_cgroup_event(event))
atomic_dec(&per_cpu(perf_cgroup_events, cpu));
}
@@ -6935,10 +6862,6 @@ static void account_event_cpu(struct perf_event *event, int cpu)
if (event->parent)
return;
- if (has_branch_stack(event)) {
- if (!(event->attach_state & PERF_ATTACH_TASK))
- atomic_inc(&per_cpu(perf_branch_stack_events, cpu));
- }
if (is_cgroup_event(event))
atomic_inc(&per_cpu(perf_cgroup_events, cpu));
}
--
1.8.3.2
make later patch more readable, no logic change.
Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 59 ++++++++++++++++++++--------------------
1 file changed, 29 insertions(+), 30 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 8043526..9656b9e 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -394,36 +394,35 @@ int x86_pmu_hw_config(struct perf_event *event)
if (event->attr.precise_ip > precise)
return -EOPNOTSUPP;
- /*
- * check that PEBS LBR correction does not conflict with
- * whatever the user is asking with attr->branch_sample_type
- */
- if (event->attr.precise_ip > 1 &&
- x86_pmu.intel_cap.pebs_format < 2) {
- u64 *br_type = &event->attr.branch_sample_type;
-
- if (has_branch_stack(event)) {
- if (!precise_br_compat(event))
- return -EOPNOTSUPP;
-
- /* branch_sample_type is compatible */
-
- } else {
- /*
- * user did not specify branch_sample_type
- *
- * For PEBS fixups, we capture all
- * the branches at the priv level of the
- * event.
- */
- *br_type = PERF_SAMPLE_BRANCH_ANY;
-
- if (!event->attr.exclude_user)
- *br_type |= PERF_SAMPLE_BRANCH_USER;
-
- if (!event->attr.exclude_kernel)
- *br_type |= PERF_SAMPLE_BRANCH_KERNEL;
- }
+ }
+ /*
+ * check that PEBS LBR correction does not conflict with
+ * whatever the user is asking with attr->branch_sample_type
+ */
+ if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format < 2) {
+ u64 *br_type = &event->attr.branch_sample_type;
+
+ if (has_branch_stack(event)) {
+ if (!precise_br_compat(event))
+ return -EOPNOTSUPP;
+
+ /* branch_sample_type is compatible */
+
+ } else {
+ /*
+ * user did not specify branch_sample_type
+ *
+ * For PEBS fixups, we capture all
+ * the branches at the priv level of the
+ * event.
+ */
+ *br_type = PERF_SAMPLE_BRANCH_ANY;
+
+ if (!event->attr.exclude_user)
+ *br_type |= PERF_SAMPLE_BRANCH_USER;
+
+ if (!event->attr.exclude_kernel)
+ *br_type |= PERF_SAMPLE_BRANCH_KERNEL;
}
}
--
1.8.3.2
Introduce a new flag PERF_ATTACH_TASK_DATA for perf event's attach
stata. The flag is set by PMU's event_init() callback, it indicates
that perf event needs PMU specific data.
The PMU specific data are initialized to zeros. Later patches will
use PMU specific data to save LBR stack.
Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
include/linux/perf_event.h | 6 ++++++
kernel/events/core.c | 40 ++++++++++++++++++++++++++++++++++++----
2 files changed, 42 insertions(+), 4 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index c392fd5..159e092 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -264,6 +264,10 @@ struct pmu {
*/
void (*sched_task) (struct perf_event_context *ctx,
bool sched_in);
+ /*
+ * PMU specific data size
+ */
+ size_t task_ctx_size;
};
@@ -300,6 +304,7 @@ struct swevent_hlist {
#define PERF_ATTACH_CONTEXT 0x01
#define PERF_ATTACH_GROUP 0x02
#define PERF_ATTACH_TASK 0x04
+#define PERF_ATTACH_TASK_DATA 0x08
struct perf_cgroup;
struct ring_buffer;
@@ -509,6 +514,7 @@ struct perf_event_context {
u64 generation;
int pin_count;
int nr_cgroups; /* cgroup evts */
+ void *task_ctx_data; /* pmu specific data */
struct rcu_head rcu_head;
struct delayed_work orphans_remove;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5971e7d..a7c0d5c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -895,6 +895,15 @@ static void get_ctx(struct perf_event_context *ctx)
WARN_ON(!atomic_inc_not_zero(&ctx->refcount));
}
+static void free_ctx(struct rcu_head *head)
+{
+ struct perf_event_context *ctx;
+
+ ctx = container_of(head, struct perf_event_context, rcu_head);
+ kfree(ctx->task_ctx_data);
+ kfree(ctx);
+}
+
static void put_ctx(struct perf_event_context *ctx)
{
if (atomic_dec_and_test(&ctx->refcount)) {
@@ -902,7 +911,7 @@ static void put_ctx(struct perf_event_context *ctx)
put_ctx(ctx->parent_ctx);
if (ctx->task)
put_task_struct(ctx->task);
- kfree_rcu(ctx, rcu_head);
+ call_rcu(&ctx->rcu_head, free_ctx);
}
}
@@ -3188,12 +3197,15 @@ errout:
* Returns a matching context with refcount and pincount.
*/
static struct perf_event_context *
-find_get_context(struct pmu *pmu, struct task_struct *task, int cpu)
+find_get_context(struct pmu *pmu, struct task_struct *task,
+ struct perf_event *event)
{
struct perf_event_context *ctx, *clone_ctx = NULL;
struct perf_cpu_context *cpuctx;
+ void *task_ctx_data = NULL;
unsigned long flags;
int ctxn, err;
+ int cpu = event->cpu;
if (!task) {
/* Must be root to operate on a CPU event: */
@@ -3221,11 +3233,24 @@ find_get_context(struct pmu *pmu, struct task_struct *task, int cpu)
if (ctxn < 0)
goto errout;
+ if (event->attach_state & PERF_ATTACH_TASK_DATA) {
+ task_ctx_data = kzalloc(pmu->task_ctx_size, GFP_KERNEL);
+ if (!task_ctx_data) {
+ err = -ENOMEM;
+ goto errout;
+ }
+ }
+
retry:
ctx = perf_lock_task_context(task, ctxn, &flags);
if (ctx) {
clone_ctx = unclone_ctx(ctx);
++ctx->pin_count;
+
+ if (task_ctx_data && !ctx->task_ctx_data) {
+ ctx->task_ctx_data = task_ctx_data;
+ task_ctx_data = NULL;
+ }
raw_spin_unlock_irqrestore(&ctx->lock, flags);
if (clone_ctx)
@@ -3236,6 +3261,11 @@ retry:
if (!ctx)
goto errout;
+ if (task_ctx_data) {
+ ctx->task_ctx_data = task_ctx_data;
+ task_ctx_data = NULL;
+ }
+
err = 0;
mutex_lock(&task->perf_event_mutex);
/*
@@ -3262,9 +3292,11 @@ retry:
}
}
+ kfree(task_ctx_data);
return ctx;
errout:
+ kfree(task_ctx_data);
return ERR_PTR(err);
}
@@ -7344,7 +7376,7 @@ SYSCALL_DEFINE5(perf_event_open,
/*
* Get the target context (task or percpu):
*/
- ctx = find_get_context(pmu, task, event->cpu);
+ ctx = find_get_context(pmu, task, event);
if (IS_ERR(ctx)) {
err = PTR_ERR(ctx);
goto err_alloc;
@@ -7513,7 +7545,7 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
account_event(event);
- ctx = find_get_context(event->pmu, task, cpu);
+ ctx = find_get_context(event->pmu, task, event);
if (IS_ERR(ctx)) {
err = PTR_ERR(ctx);
goto err_free;
--
1.8.3.2
The LBR call stack data will be output as PERF_SAMPLE_BRANCH_STACK data
format. ip_callchain is changed to show available callchain source.
LBR call stack also need to be shown in user space.
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/util/event.h | 8 ++++++++
tools/perf/util/evsel.c | 21 +++++++++++++--------
2 files changed, 21 insertions(+), 8 deletions(-)
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 5699e7e..3e291c5 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -119,7 +119,15 @@ struct sample_read {
};
};
+/*
+ * From Haswell, the existing Last Branch Record facility can
+ * also be used to record call chains.
+ * source: indicates the available call chains source.
+ */
+#define PERF_FP_CALLCHAIN 0x01
+#define PERF_LBR_CALLCHAIN 0x02
struct ip_callchain {
+ u64 source;
u64 nr;
u64 ips[0];
};
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index d1ecde0..431f0eb 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1027,7 +1027,6 @@ static size_t perf_event_attr__fprintf(struct perf_event_attr *attr, FILE *fp)
ret += PRINT_ATTR2(exclude_host, exclude_guest);
ret += PRINT_ATTR2N("excl.callchain_kern", exclude_callchain_kernel,
"excl.callchain_user", exclude_callchain_user);
-
ret += PRINT_ATTR_U32(wakeup_events);
ret += PRINT_ATTR_U32(wakeup_watermark);
ret += PRINT_ATTR_X32(bp_type);
@@ -1439,10 +1438,10 @@ int perf_evsel__parse_sample(struct perf_evsel *evsel, union perf_event *event,
const u64 max_callchain_nr = UINT64_MAX / sizeof(u64);
OVERFLOW_CHECK_u64(array);
- data->callchain = (struct ip_callchain *)array++;
+ data->callchain = (struct ip_callchain *)array;
if (data->callchain->nr > max_callchain_nr)
return -EFAULT;
- sz = data->callchain->nr * sizeof(u64);
+ sz = (data->callchain->nr + 2) * sizeof(u64);
OVERFLOW_CHECK(array, sz, max_size);
array = (void *)array + sz;
}
@@ -1465,7 +1464,9 @@ int perf_evsel__parse_sample(struct perf_evsel *evsel, union perf_event *event,
array = (void *)array + data->raw_size;
}
- if (type & PERF_SAMPLE_BRANCH_STACK) {
+ if ((type & PERF_SAMPLE_BRANCH_STACK) ||
+ (data->callchain &&
+ (data->callchain->source & PERF_LBR_CALLCHAIN))) {
const u64 max_branch_nr = UINT64_MAX /
sizeof(struct branch_entry);
@@ -1589,7 +1590,7 @@ size_t perf_event__sample_event_size(const struct perf_sample *sample, u64 type,
}
if (type & PERF_SAMPLE_CALLCHAIN) {
- sz = (sample->callchain->nr + 1) * sizeof(u64);
+ sz = (sample->callchain->nr + 2) * sizeof(u64);
result += sz;
}
@@ -1598,7 +1599,9 @@ size_t perf_event__sample_event_size(const struct perf_sample *sample, u64 type,
result += sample->raw_size;
}
- if (type & PERF_SAMPLE_BRANCH_STACK) {
+ if ((type & PERF_SAMPLE_BRANCH_STACK) ||
+ (sample->callchain &&
+ (sample->callchain->source & PERF_LBR_CALLCHAIN))) {
sz = sample->branch_stack->nr * sizeof(struct branch_entry);
sz += sizeof(u64);
result += sz;
@@ -1744,7 +1747,7 @@ int perf_event__synthesize_sample(union perf_event *event, u64 type,
}
if (type & PERF_SAMPLE_CALLCHAIN) {
- sz = (sample->callchain->nr + 1) * sizeof(u64);
+ sz = (sample->callchain->nr + 2) * sizeof(u64);
memcpy(array, sample->callchain, sz);
array = (void *)array + sz;
}
@@ -1767,7 +1770,9 @@ int perf_event__synthesize_sample(union perf_event *event, u64 type,
array = (void *)array + sample->raw_size;
}
- if (type & PERF_SAMPLE_BRANCH_STACK) {
+ if ((type & PERF_SAMPLE_BRANCH_STACK) ||
+ (sample->callchain &&
+ (sample->callchain->source & PERF_LBR_CALLCHAIN))) {
sz = sample->branch_stack->nr * sizeof(struct branch_entry);
sz += sizeof(u64);
memcpy(array, sample->branch_stack, sz);
--
1.8.3.2
Extend call-graph option in perf report to support callchain source (fp
or lbr).
The default value is fp. It means that frame pointers is preferred call
chain source. If it isn't available, lbr data will be used then.
If the value is set to lbr, it means lbr data is preferred call chain
source. If lbr data isn't available, try fp data then.
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-report.c | 8 +-
tools/perf/util/callchain.c | 18 +++-
tools/perf/util/callchain.h | 6 ++
tools/perf/util/machine.c | 198 ++++++++++++++++++++++++++++++--------------
tools/perf/util/session.c | 34 +++++++-
5 files changed, 194 insertions(+), 70 deletions(-)
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 140a6cd..23fad5a 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -575,7 +575,7 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
struct stat st;
bool has_br_stack = false;
int branch_mode = -1;
- char callchain_default_opt[] = "fractal,0.5,callee";
+ char callchain_default_opt[] = "fractal,0.5,callee,function,fp";
const char * const report_usage[] = {
"perf report [<options>]",
NULL
@@ -637,9 +637,9 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
"regex filter to identify parent, see: '--sort parent'"),
OPT_BOOLEAN('x', "exclude-other", &symbol_conf.exclude_other,
"Only display entries with parent-match"),
- OPT_CALLBACK_DEFAULT('g', "call-graph", &report, "output_type,min_percent[,print_limit],call_order",
- "Display callchains using output_type (graph, flat, fractal, or none) , min percent threshold, optional print limit, callchain order, key (function or address). "
- "Default: fractal,0.5,callee,function", &report_parse_callchain_opt, callchain_default_opt),
+ OPT_CALLBACK_DEFAULT('g', "call-graph", &report, "output_type,min_percent[,print_limit],call_order,source",
+ "Display callchains using output_type (graph, flat, fractal, or none) , min percent threshold, optional print limit, callchain order, key (function or address), callchain source(fp or lbr). "
+ "Default: fractal,0.5,callee,function,fp", &report_parse_callchain_opt, callchain_default_opt),
OPT_BOOLEAN(0, "children", &symbol_conf.cumulate_callchain,
"Accumulate callchains of children and show total overhead as well"),
OPT_INTEGER(0, "max-stack", &report.max_stack,
diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c
index c84d3f8..281ba14 100644
--- a/tools/perf/util/callchain.c
+++ b/tools/perf/util/callchain.c
@@ -152,6 +152,19 @@ static int parse_callchain_sort_key(const char *value)
return -1;
}
+static int parse_callchain_source(const char *value)
+{
+ if (!strncmp(value, "fp", strlen(value))) {
+ callchain_param.source = SOURCE_FP;
+ return 0;
+ }
+ if (!strncmp(value, "lbr", strlen(value))) {
+ callchain_param.source = SOURCE_LBR;
+ return 0;
+ }
+ return -1;
+}
+
int
parse_callchain_report_opt(const char *arg)
{
@@ -173,7 +186,8 @@ parse_callchain_report_opt(const char *arg)
if (!parse_callchain_mode(tok) ||
!parse_callchain_order(tok) ||
- !parse_callchain_sort_key(tok)) {
+ !parse_callchain_sort_key(tok) ||
+ !parse_callchain_source(tok)) {
/* parsing ok - move on to the next */
} else if (!minpcnt_set) {
/* try to get the min percent */
@@ -225,6 +239,8 @@ int perf_callchain_config(const char *var, const char *value)
return parse_callchain_order(value);
if (!strcmp(var, "sort-key"))
return parse_callchain_sort_key(value);
+ if (!strcmp(var, "source"))
+ return parse_callchain_source(value);
if (!strcmp(var, "threshold")) {
callchain_param.min_percent = strtod(value, &endptr);
if (value == endptr)
diff --git a/tools/perf/util/callchain.h b/tools/perf/util/callchain.h
index 94cfefd..6b3ba57 100644
--- a/tools/perf/util/callchain.h
+++ b/tools/perf/util/callchain.h
@@ -53,6 +53,11 @@ enum chain_key {
CCKEY_ADDRESS
};
+enum chain_source {
+ SOURCE_FP,
+ SOURCE_LBR
+};
+
struct callchain_param {
bool enabled;
enum perf_call_graph_mode record_mode;
@@ -63,6 +68,7 @@ struct callchain_param {
sort_chain_func_t sort;
enum chain_order order;
enum chain_key key;
+ enum chain_source source;
};
extern struct callchain_param callchain_param;
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 34fc7c8..9fc5fd9 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1371,19 +1371,81 @@ struct branch_info *sample__resolve_bstack(struct perf_sample *sample,
return bi;
}
+static inline int __machine__resolve_callchain_sample(struct machine *machine,
+ struct thread *thread,
+ u64 ip,
+ u8 *cpumode,
+ struct symbol **parent,
+ struct addr_location *root_al,
+ struct addr_location *al)
+{
+ int err;
+
+ if (ip >= PERF_CONTEXT_MAX) {
+ switch (ip) {
+ case PERF_CONTEXT_HV:
+ *cpumode = PERF_RECORD_MISC_HYPERVISOR;
+ break;
+ case PERF_CONTEXT_KERNEL:
+ *cpumode = PERF_RECORD_MISC_KERNEL;
+ break;
+ case PERF_CONTEXT_USER:
+ *cpumode = PERF_RECORD_MISC_USER;
+ break;
+ default:
+ pr_debug("invalid callchain context: "
+ "%"PRId64"\n", (s64) ip);
+ /*
+ * It seems the callchain is corrupted.
+ * Discard all.
+ */
+ callchain_cursor_reset(&callchain_cursor);
+ return 1;
+ }
+ return 0;
+ }
+
+ al->filtered = 0;
+ thread__find_addr_location(thread, machine, *cpumode,
+ MAP__FUNCTION, ip, al);
+ if (al->sym != NULL) {
+ if (sort__has_parent && !*parent &&
+ symbol__match_regex(al->sym, &parent_regex))
+ *parent = al->sym;
+ else if (have_ignore_callees && root_al &&
+ symbol__match_regex(al->sym, &ignore_callees_regex)) {
+ /* Treat this symbol as the root,
+ forgetting its callees. */
+ *root_al = *al;
+ callchain_cursor_reset(&callchain_cursor);
+ }
+ }
+
+ err = callchain_cursor_append(&callchain_cursor,
+ ip, al->map, al->sym);
+ if (err)
+ return err;
+ return 0;
+}
+
static int machine__resolve_callchain_sample(struct machine *machine,
struct thread *thread,
- struct ip_callchain *chain,
+ struct perf_sample *sample,
struct symbol **parent,
struct addr_location *root_al,
int max_stack)
{
+ struct ip_callchain *chain = sample->callchain;
u8 cpumode = PERF_RECORD_MISC_USER;
int chain_nr = min(max_stack, (int)chain->nr);
- int i;
- int j;
- int err;
+ int i, j, err;
int skip_idx __maybe_unused;
+ int use_fp = (callchain_param.source == SOURCE_FP) ? 1 : 0;
+ u64 ip;
+
+ /* If there isn't user fp callchain available, try LBR */
+ if (!(chain->source & PERF_FP_CALLCHAIN))
+ use_fp = 0;
callchain_cursor_reset(&callchain_cursor);
@@ -1392,73 +1454,83 @@ static int machine__resolve_callchain_sample(struct machine *machine,
return 0;
}
- /*
- * Based on DWARF debug information, some architectures skip
- * a callchain entry saved by the kernel.
- */
- skip_idx = arch_skip_callchain_idx(machine, thread, chain);
-
- for (i = 0; i < chain_nr; i++) {
- u64 ip;
- struct addr_location al;
+again:
+ /* try LBR */
+ if (!use_fp && (chain->source & PERF_LBR_CALLCHAIN)) {
+ struct branch_stack *lbr_stack = sample->branch_stack;
+ int lbr_nr = lbr_stack->nr;
+ int mix_chain_nr;
- if (callchain_param.order == ORDER_CALLEE)
- j = i;
- else
- j = chain->nr - i - 1;
+ for (i = 0; i < chain_nr; i++) {
+ if (chain->ips[i] == PERF_CONTEXT_USER)
+ break;
+ }
-#ifdef HAVE_SKIP_CALLCHAIN_IDX
- if (j == skip_idx)
- continue;
-#endif
- ip = chain->ips[j];
+ /* LBR only affects the user callchain */
+ if (i == chain_nr) {
+ use_fp = 1;
+ goto again;
+ }
- if (ip >= PERF_CONTEXT_MAX) {
- switch (ip) {
- case PERF_CONTEXT_HV:
- cpumode = PERF_RECORD_MISC_HYPERVISOR;
- break;
- case PERF_CONTEXT_KERNEL:
- cpumode = PERF_RECORD_MISC_KERNEL;
- break;
- case PERF_CONTEXT_USER:
- cpumode = PERF_RECORD_MISC_USER;
- break;
- default:
- pr_debug("invalid callchain context: "
- "%"PRId64"\n", (s64) ip);
- /*
- * It seems the callchain is corrupted.
- * Discard all.
- */
- callchain_cursor_reset(&callchain_cursor);
- return 0;
- }
- continue;
+ mix_chain_nr = i + 2 + lbr_nr;
+ if (mix_chain_nr > PERF_MAX_STACK_DEPTH) {
+ pr_warning("corrupted callchain. skipping...\n");
+ return 0;
}
- al.filtered = 0;
- thread__find_addr_location(thread, machine, cpumode,
- MAP__FUNCTION, ip, &al);
- if (al.sym != NULL) {
- if (sort__has_parent && !*parent &&
- symbol__match_regex(al.sym, &parent_regex))
- *parent = al.sym;
- else if (have_ignore_callees && root_al &&
- symbol__match_regex(al.sym, &ignore_callees_regex)) {
- /* Treat this symbol as the root,
- forgetting its callees. */
- *root_al = al;
- callchain_cursor_reset(&callchain_cursor);
+ for (j = 0; j < mix_chain_nr; j++) {
+ struct addr_location al;
+
+ if (callchain_param.order == ORDER_CALLEE) {
+ if (j < i + 2)
+ ip = chain->ips[j];
+ else
+ ip = lbr_stack->entries[j - i - 2].from;
+ } else {
+ if (j < lbr_nr)
+ ip = lbr_stack->entries[lbr_nr - j - 1].from;
+ else
+ ip = chain->ips[i + 1 - (j - lbr_nr)];
}
+ err = __machine__resolve_callchain_sample(machine,
+ thread, ip, &cpumode, parent, root_al, &al);
+ /* Discard all when the callchain is corrupted */
+ if (err > 0)
+ return 0;
+ else if (err)
+ return err;
}
+ } else {
- err = callchain_cursor_append(&callchain_cursor,
- ip, al.map, al.sym);
- if (err)
- return err;
- }
+ /*
+ * Based on DWARF debug information, some architectures skip
+ * a callchain entry saved by the kernel.
+ */
+ skip_idx = arch_skip_callchain_idx(machine, thread, chain);
+
+ for (i = 0; i < chain_nr; i++) {
+ struct addr_location al;
+
+ if (callchain_param.order == ORDER_CALLEE)
+ j = i;
+ else
+ j = chain->nr - i - 1;
+
+#ifdef HAVE_SKIP_CALLCHAIN_IDX
+ if (j == skip_idx)
+ continue;
+#endif
+ ip = chain->ips[j];
+ err = __machine__resolve_callchain_sample(machine,
+ thread, ip, &cpumode, parent, root_al, &al);
+ /* Discard all when the callchain is corrupted */
+ if (err > 0)
+ return 0;
+ else if (err)
+ return err;
+ }
+ }
return 0;
}
@@ -1480,7 +1552,7 @@ int machine__resolve_callchain(struct machine *machine,
int ret;
ret = machine__resolve_callchain_sample(machine, thread,
- sample->callchain, parent,
+ sample, parent,
root_al, max_stack);
if (ret)
return ret;
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 6702ac2..75fa183 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -548,12 +548,42 @@ int perf_session_queue_event(struct perf_session *s, union perf_event *event,
static void callchain__printf(struct perf_sample *sample)
{
unsigned int i;
+ u64 total_nr, callchain_nr;
+ int use_fp = (callchain_param.source == SOURCE_FP) ? 1 : 0;
- printf("... chain: nr:%" PRIu64 "\n", sample->callchain->nr);
+ total_nr = callchain_nr = sample->callchain->nr;
- for (i = 0; i < sample->callchain->nr; i++)
+ /* If there isn't user fp callchain available, try LBR */
+ if (!(sample->callchain->source & PERF_FP_CALLCHAIN))
+ use_fp = 0;
+
+ if (!use_fp && (sample->callchain->source & PERF_LBR_CALLCHAIN)) {
+ struct branch_stack *lbr_stack = sample->branch_stack;
+
+ for (i = 0; i < callchain_nr; i++) {
+ if (sample->callchain->ips[i] == PERF_CONTEXT_USER)
+ break;
+ }
+
+ if (i != callchain_nr) {
+ total_nr = i + 1 + lbr_stack->nr;
+ callchain_nr = i + 1;
+ }
+ }
+
+ printf("... chain: nr:%" PRIu64 "\n", total_nr);
+
+ for (i = 0; i < callchain_nr + 1; i++)
printf("..... %2d: %016" PRIx64 "\n",
i, sample->callchain->ips[i]);
+
+ if (total_nr > callchain_nr) {
+ struct branch_stack *lbr_stack = sample->branch_stack;
+
+ for (i = 0; i < lbr_stack->nr; i++)
+ printf("..... %2d: %016" PRIx64 "\n",
+ (int)(i + callchain_nr + 1), lbr_stack->entries[i].from);
+ }
}
static void branch_stack__printf(struct perf_sample *sample)
--
1.8.3.2
Haswell has a new feature that utilizes the existing LBR facility to
record call chains. To enable this feature, bits (JCC, NEAR_IND_JMP,
NEAR_REL_JMP, FAR_BRANCH, EN_CALLSTACK) in LBR_SELECT must be set to 1,
bits (NEAR_REL_CALL, NEAR-IND_CALL, NEAR_RET) must be cleared. Due to
a hardware bug of Haswell, this feature doesn't work well with
FREEZE_LBRS_ON_PMI.
When the call stack feature is enabled, the LBR stack will capture
unfiltered call data normally, but as return instructions are executed,
the last captured branch record is flushed from the on-chip registers
in a last-in first-out (LIFO) manner. Thus, branch information relative
to leaf functions will not be captured, while preserving the call stack
information of the main line execution path.
This patch defines a separate lbr_sel map for Haswell. The map contains
a new entry for the call stack feature.
Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 14 ++++-
arch/x86/kernel/cpu/perf_event_intel.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 91 ++++++++++++++++++++++--------
3 files changed, 83 insertions(+), 24 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index c84378b..21e03ed 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -512,7 +512,11 @@ struct x86_pmu {
};
enum {
- PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE = PERF_SAMPLE_BRANCH_MAX_SHIFT,
+ PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT = PERF_SAMPLE_BRANCH_MAX_SHIFT,
+ PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE,
+
+ PERF_SAMPLE_BRANCH_CALL_STACK =
+ 1U << PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT,
};
#define x86_add_quirk(func_) \
@@ -546,6 +550,12 @@ static struct perf_pmu_events_attr event_attr_##v = { \
extern struct x86_pmu x86_pmu __read_mostly;
+static inline bool x86_pmu_has_lbr_callstack(void)
+{
+ return x86_pmu.lbr_sel_map &&
+ x86_pmu.lbr_sel_map[PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT] > 0;
+}
+
DECLARE_PER_CPU(struct cpu_hw_events, cpu_hw_events);
int x86_perf_event_set_period(struct perf_event *event);
@@ -749,6 +759,8 @@ void intel_pmu_lbr_init_atom(void);
void intel_pmu_lbr_init_snb(void);
+void intel_pmu_lbr_init_hsw(void);
+
int intel_pmu_setup_lbr_filter(struct perf_event *event);
int p4_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 38042e3..e8c690e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2683,7 +2683,7 @@ __init int intel_pmu_init(void)
memcpy(hw_cache_event_ids, hsw_hw_cache_event_ids, sizeof(hw_cache_event_ids));
memcpy(hw_cache_extra_regs, hsw_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));
- intel_pmu_lbr_init_snb();
+ intel_pmu_lbr_init_hsw();
x86_pmu.event_constraints = intel_hsw_event_constraints;
x86_pmu.pebs_constraints = intel_hsw_pebs_event_constraints;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 18d13ca..e0ba9c0 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -39,6 +39,7 @@ static enum {
#define LBR_IND_JMP_BIT 6 /* do not capture indirect jumps */
#define LBR_REL_JMP_BIT 7 /* do not capture relative jumps */
#define LBR_FAR_BIT 8 /* do not capture far branches */
+#define LBR_CALL_STACK_BIT 9 /* enable call stack */
#define LBR_KERNEL (1 << LBR_KERNEL_BIT)
#define LBR_USER (1 << LBR_USER_BIT)
@@ -49,6 +50,7 @@ static enum {
#define LBR_REL_JMP (1 << LBR_REL_JMP_BIT)
#define LBR_IND_JMP (1 << LBR_IND_JMP_BIT)
#define LBR_FAR (1 << LBR_FAR_BIT)
+#define LBR_CALL_STACK (1 << LBR_CALL_STACK_BIT)
#define LBR_PLM (LBR_KERNEL | LBR_USER)
@@ -74,24 +76,25 @@ static enum {
* x86control flow changes include branches, interrupts, traps, faults
*/
enum {
- X86_BR_NONE = 0, /* unknown */
-
- X86_BR_USER = 1 << 0, /* branch target is user */
- X86_BR_KERNEL = 1 << 1, /* branch target is kernel */
-
- X86_BR_CALL = 1 << 2, /* call */
- X86_BR_RET = 1 << 3, /* return */
- X86_BR_SYSCALL = 1 << 4, /* syscall */
- X86_BR_SYSRET = 1 << 5, /* syscall return */
- X86_BR_INT = 1 << 6, /* sw interrupt */
- X86_BR_IRET = 1 << 7, /* return from interrupt */
- X86_BR_JCC = 1 << 8, /* conditional */
- X86_BR_JMP = 1 << 9, /* jump */
- X86_BR_IRQ = 1 << 10,/* hw interrupt or trap or fault */
- X86_BR_IND_CALL = 1 << 11,/* indirect calls */
- X86_BR_ABORT = 1 << 12,/* transaction abort */
- X86_BR_IN_TX = 1 << 13,/* in transaction */
- X86_BR_NO_TX = 1 << 14,/* not in transaction */
+ X86_BR_NONE = 0, /* unknown */
+
+ X86_BR_USER = 1 << 0, /* branch target is user */
+ X86_BR_KERNEL = 1 << 1, /* branch target is kernel */
+
+ X86_BR_CALL = 1 << 2, /* call */
+ X86_BR_RET = 1 << 3, /* return */
+ X86_BR_SYSCALL = 1 << 4, /* syscall */
+ X86_BR_SYSRET = 1 << 5, /* syscall return */
+ X86_BR_INT = 1 << 6, /* sw interrupt */
+ X86_BR_IRET = 1 << 7, /* return from interrupt */
+ X86_BR_JCC = 1 << 8, /* conditional */
+ X86_BR_JMP = 1 << 9, /* jump */
+ X86_BR_IRQ = 1 << 10,/* hw interrupt or trap or fault */
+ X86_BR_IND_CALL = 1 << 11,/* indirect calls */
+ X86_BR_ABORT = 1 << 12,/* transaction abort */
+ X86_BR_IN_TX = 1 << 13,/* in transaction */
+ X86_BR_NO_TX = 1 << 14,/* not in transaction */
+ X86_BR_CALL_STACK = 1 << 15,/* call stack */
};
#define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
@@ -371,7 +374,7 @@ void intel_pmu_lbr_read(void)
* - in case there is no HW filter
* - in case the HW filter has errata or limitations
*/
-static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
+static int intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
{
u64 br_type = event->attr.branch_sample_type;
int mask = 0;
@@ -408,11 +411,21 @@ static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
if (br_type & PERF_SAMPLE_BRANCH_COND)
mask |= X86_BR_JCC;
+ if (br_type & PERF_SAMPLE_BRANCH_CALL_STACK) {
+ if (!x86_pmu_has_lbr_callstack())
+ return -EOPNOTSUPP;
+ if (mask & ~(X86_BR_USER | X86_BR_KERNEL))
+ return -EINVAL;
+ mask |= X86_BR_CALL | X86_BR_IND_CALL | X86_BR_RET |
+ X86_BR_CALL_STACK;
+ }
+
/*
* stash actual user request into reg, it may
* be used by fixup code for some CPU
*/
event->hw.branch_reg.reg = mask;
+ return 0;
}
/*
@@ -441,8 +454,12 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
reg = &event->hw.branch_reg;
reg->idx = EXTRA_REG_LBR;
- /* LBR_SELECT operates in suppress mode so invert mask */
- reg->config = ~mask & x86_pmu.lbr_sel_mask;
+ /*
+ * The first 9 bits (LBR_SEL_MASK) in LBR_SELECT operate
+ * in suppress mode. So LBR_SELECT should be set to
+ * (~mask & LBR_SEL_MASK) | (mask & ~LBR_SEL_MASK)
+ */
+ reg->config = mask ^ x86_pmu.lbr_sel_mask;
return 0;
}
@@ -460,7 +477,9 @@ int intel_pmu_setup_lbr_filter(struct perf_event *event)
/*
* setup SW LBR filter
*/
- intel_pmu_setup_sw_lbr_filter(event);
+ ret = intel_pmu_setup_sw_lbr_filter(event);
+ if (ret)
+ return ret;
/*
* setup HW LBR filter, if any
@@ -717,6 +736,20 @@ static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
[PERF_SAMPLE_BRANCH_COND_SHIFT] = LBR_JCC,
};
+static const int hsw_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+ [PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
+ [PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
+ [PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
+ [PERF_SAMPLE_BRANCH_HV_SHIFT] = LBR_IGN,
+ [PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT] = LBR_RETURN | LBR_FAR,
+ [PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT] = LBR_REL_CALL | LBR_IND_CALL
+ | LBR_FAR,
+ [PERF_SAMPLE_BRANCH_IND_CALL_SHIFT] = LBR_IND_CALL,
+ [PERF_SAMPLE_BRANCH_COND_SHIFT] = LBR_JCC,
+ [PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT] = LBR_REL_CALL | LBR_IND_CALL
+ | LBR_RETURN | LBR_CALL_STACK,
+};
+
/* core */
void __init intel_pmu_lbr_init_core(void)
{
@@ -773,6 +806,20 @@ void __init intel_pmu_lbr_init_snb(void)
pr_cont("16-deep LBR, ");
}
+/* haswell */
+void intel_pmu_lbr_init_hsw(void)
+{
+ x86_pmu.lbr_nr = 16;
+ x86_pmu.lbr_tos = MSR_LBR_TOS;
+ x86_pmu.lbr_from = MSR_LBR_NHM_FROM;
+ x86_pmu.lbr_to = MSR_LBR_NHM_TO;
+
+ x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
+ x86_pmu.lbr_sel_map = hsw_lbr_sel_map;
+
+ pr_cont("16-deep LBR, ");
+}
+
/* atom */
void __init intel_pmu_lbr_init_atom(void)
{
--
1.8.3.2
"Zero length call" uses the attribute of the call instruction to push
the immediate instruction pointer on to the stack and then pops off
that address into a register. This is accomplished without any matching
return instruction. It confuses the hardware and make the recorded call
stack incorrect.
We can partially resolve this issue by: decode call instructions and
discard any zero length call entry in the LBR stack.
Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index cd4bd6f..fe25c42 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -94,7 +94,8 @@ enum {
X86_BR_ABORT = 1 << 12,/* transaction abort */
X86_BR_IN_TX = 1 << 13,/* in transaction */
X86_BR_NO_TX = 1 << 14,/* not in transaction */
- X86_BR_CALL_STACK = 1 << 15,/* call stack */
+ X86_BR_ZERO_CALL = 1 << 15,/* zero length call */
+ X86_BR_CALL_STACK = 1 << 16,/* call stack */
};
#define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
@@ -111,13 +112,15 @@ enum {
X86_BR_JMP |\
X86_BR_IRQ |\
X86_BR_ABORT |\
- X86_BR_IND_CALL)
+ X86_BR_IND_CALL |\
+ X86_BR_ZERO_CALL)
#define X86_BR_ALL (X86_BR_PLM | X86_BR_ANY)
#define X86_BR_ANY_CALL \
(X86_BR_CALL |\
X86_BR_IND_CALL |\
+ X86_BR_ZERO_CALL |\
X86_BR_SYSCALL |\
X86_BR_IRQ |\
X86_BR_INT)
@@ -688,6 +691,12 @@ static int branch_type(unsigned long from, unsigned long to, int abort)
ret = X86_BR_INT;
break;
case 0xe8: /* call near rel */
+ insn_get_immediate(&insn);
+ if (insn.immediate1.value == 0) {
+ /* zero length call */
+ ret = X86_BR_ZERO_CALL;
+ break;
+ }
case 0x9a: /* call far absolute */
ret = X86_BR_CALL;
break;
--
1.8.3.2
With LBR call stack feature enable, there are two call chain data
sources, traditional frame pointer and LBR call stack.
This patch extends the perf_callchain_entry struct to mark the available
call chain source.
The frame pointer is still output as PERF_SAMPLE_CALLCHAIN data format.
The LBR call stack data will be output as PERF_SAMPLE_BRANCH_STACK data
format.
Note: The LBR call stack is only available for user callchain. The
kernel is always got from frame pointers.
The user space perf tool also need to be changed to handle thses two
sources.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 7 +++++++
arch/x86/kernel/cpu/perf_event_intel.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 2 ++
include/linux/perf_event.h | 14 +++++++++++++-
kernel/events/callchain.c | 1 +
kernel/events/core.c | 22 +++++++++++++++++-----
7 files changed, 42 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index f94a618e..8043526 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -2049,6 +2049,10 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
perf_callchain_store(entry, cs_base + frame.return_address);
fp = compat_ptr(ss_base + frame.next_frame);
}
+
+ if (fp != compat_ptr(regs->bp))
+ entry->source |= PERF_FP_CALLCHAIN;
+
return 1;
}
#else
@@ -2101,6 +2105,9 @@ perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
perf_callchain_store(entry, frame.return_address);
fp = frame.next_frame;
}
+
+ if (fp != (void __user *)regs->bp)
+ entry->source |= PERF_FP_CALLCHAIN;
}
/*
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 50bb51d..2808267 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1533,7 +1533,7 @@ again:
perf_sample_data_init(&data, 0, event->hw.last_period);
- if (has_branch_stack(event))
+ if (needs_branch_stack(event))
data.br_stack = &cpuc->lbr_stack;
if (perf_event_overflow(event, &data, regs))
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index b1553d0..5b2d2b3 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -907,7 +907,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
data.txn = intel_hsw_transaction(pebs);
}
- if (has_branch_stack(event))
+ if (needs_branch_stack(event))
data.br_stack = &cpuc->lbr_stack;
if (perf_event_overflow(event, &data, ®s))
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 1908875..6e473c9 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -745,6 +745,8 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
int i, j, type;
bool compress = false;
+ cpuc->lbr_stack.user_callstack = branch_user_callstack(br_sel);
+
/* if sampling all branches, then nothing to filter */
if ((br_sel & X86_BR_ALL) == X86_BR_ALL)
return;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 285776a..fd1936f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -55,7 +55,16 @@ struct perf_guest_info_callbacks {
#include <linux/workqueue.h>
#include <asm/local.h>
+/*
+ * From Haswell, the existing Last Branch Record facility can
+ * also be used to record call chains.
+ * source: indicates the available call chains source.
+ */
+#define PERF_FP_CALLCHAIN 0x01
+#define PERF_LBR_CALLCHAIN 0x02
+
struct perf_callchain_entry {
+ __u64 source;
__u64 nr;
__u64 ip[PERF_MAX_STACK_DEPTH];
};
@@ -67,7 +76,9 @@ struct perf_raw_record {
/*
* branch stack layout:
- * nr: number of taken branches stored in entries[]
+ * user_callstack: LBR is enhanced to support call stack profiling.
+ * user_callstack indicates if it's call stack info.
+ * nr: number of taken branches stored in entries[]
*
* Note that nr can vary from sample to sample
* branches (to, from) are stored from most recent
@@ -75,6 +86,7 @@ struct perf_raw_record {
* recent branch.
*/
struct perf_branch_stack {
+ bool user_callstack;
__u64 nr;
struct perf_branch_entry entries[0];
};
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index f2a88de..69fab7c 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -175,6 +175,7 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
if (!entry)
goto exit_put;
+ entry->source = 0;
entry->nr = 0;
if (kernel && !user_mode(regs)) {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c8e367c..84f9885 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4793,7 +4793,7 @@ void perf_output_sample(struct perf_output_handle *handle,
if (sample_type & PERF_SAMPLE_CALLCHAIN) {
if (data->callchain) {
- int size = 1;
+ int size = 2;
if (data->callchain)
size += data->callchain->nr;
@@ -4824,7 +4824,9 @@ void perf_output_sample(struct perf_output_handle *handle,
}
}
- if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
+ /* LBR can be used for call stack, so it may be enabled implicitly. */
+ if ((sample_type & PERF_SAMPLE_BRANCH_STACK) ||
+ (data->br_stack && data->br_stack->user_callstack)) {
if (data->br_stack) {
size_t size;
@@ -4908,13 +4910,21 @@ void perf_prepare_sample(struct perf_event_header *header,
data->ip = perf_instruction_pointer(regs);
if (sample_type & PERF_SAMPLE_CALLCHAIN) {
- int size = 1;
+ int size = 2;
data->callchain = perf_callchain(event, regs);
- if (data->callchain)
+ if (data->callchain) {
size += data->callchain->nr;
+ if (data->br_stack &&
+ data->br_stack->user_callstack &&
+ !(sample_type & PERF_SAMPLE_BRANCH_STACK) &&
+ !(sample_type & PERF_SAMPLE_STACK_USER))
+ data->callchain->source |=
+ PERF_LBR_CALLCHAIN;
+ }
+
header->size += size * sizeof(u64);
}
@@ -4930,7 +4940,9 @@ void perf_prepare_sample(struct perf_event_header *header,
header->size += size;
}
- if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
+ /* LBR can be used for call stack, so it may be enabled implicitly. */
+ if ((sample_type & PERF_SAMPLE_BRANCH_STACK) ||
+ (data->br_stack && data->br_stack->user_callstack)) {
int size = sizeof(u64); /* nr */
if (data->br_stack) {
size += data->br_stack->nr
--
1.8.3.2
Using event->attr.branch_sample_type to replace
intel_pmu_needs_lbr_smpl() for avoiding duplicating code
that implicitly enables the LBR.
Currently, branch stack can be enabled by user explicitly requested
branch sampling or implicit branch sampling to correct PEBS skid.
For user explicitly requested branch sampling, the branch_sample_type is
explicitly set by user. For PEBS case, the branch_sample_type is also
implicitly set to PERF_SAMPLE_BRANCH_ANY in x86_pmu_hw_config.
Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel.c | 20 +++-----------------
include/linux/perf_event.h | 5 +++++
kernel/events/core.c | 3 +++
3 files changed, 11 insertions(+), 17 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index e8c690e..50bb51d 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1158,20 +1158,6 @@ static __initconst const u64 slm_hw_cache_event_ids
},
};
-static inline bool intel_pmu_needs_lbr_smpl(struct perf_event *event)
-{
- /* user explicitly requested branch sampling */
- if (has_branch_stack(event))
- return true;
-
- /* implicit branch sampling to correct PEBS skid */
- if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1 &&
- x86_pmu.intel_cap.pebs_format < 2)
- return true;
-
- return false;
-}
-
static void intel_pmu_disable_all(void)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
@@ -1336,7 +1322,7 @@ static void intel_pmu_disable_event(struct perf_event *event)
* must disable before any actual event
* because any event may be combined with LBR
*/
- if (intel_pmu_needs_lbr_smpl(event))
+ if (needs_branch_stack(event))
intel_pmu_lbr_disable(event);
if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
@@ -1397,7 +1383,7 @@ static void intel_pmu_enable_event(struct perf_event *event)
* must enabled before any actual event
* because any event may be combined with LBR
*/
- if (intel_pmu_needs_lbr_smpl(event))
+ if (needs_branch_stack(event))
intel_pmu_lbr_enable(event);
if (event->attr.exclude_host)
@@ -1876,7 +1862,7 @@ static int intel_pmu_hw_config(struct perf_event *event)
if (event->attr.precise_ip && x86_pmu.pebs_aliases)
x86_pmu.pebs_aliases(event);
- if (intel_pmu_needs_lbr_smpl(event)) {
+ if (needs_branch_stack(event)) {
ret = intel_pmu_setup_lbr_filter(event);
if (ret)
return ret;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 159e092..285776a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -791,6 +791,11 @@ static inline bool has_branch_stack(struct perf_event *event)
return event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK;
}
+static inline bool needs_branch_stack(struct perf_event *event)
+{
+ return event->attr.branch_sample_type != 0;
+}
+
extern int perf_output_begin(struct perf_output_handle *handle,
struct perf_event *event, unsigned int size);
extern void perf_output_end(struct perf_output_handle *handle);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5b5f143..c8e367c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7028,6 +7028,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
if (attr->inherit && (attr->read_format & PERF_FORMAT_GROUP))
goto err_ns;
+ if (!has_branch_stack(event))
+ event->attr.branch_sample_type = 0;
+
pmu = perf_init_event(event);
if (!pmu)
goto err_ns;
--
1.8.3.2
When the LBR call stack is enabled, it is necessary to save/restore
the LBR stack on context switch. The solution is saving/restoring
the LBR stack to/from task's perf event context.
The LBR stack is saved/restored only when there are events that use
the LBR call stack. If no event uses LBR call stack, the LBR stack
is reset when task is scheduled in.
Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 89 ++++++++++++++++++++++++++----
1 file changed, 77 insertions(+), 12 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index f9b9f2b..1908875 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -180,13 +180,90 @@ void intel_pmu_lbr_reset(void)
intel_pmu_lbr_reset_64();
}
+/*
+ * TOS = most recently recorded branch
+ */
+static inline u64 intel_pmu_lbr_tos(void)
+{
+ u64 tos;
+
+ rdmsrl(x86_pmu.lbr_tos, tos);
+ return tos;
+}
+
+enum {
+ LBR_NONE,
+ LBR_VALID,
+};
+
+static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
+{
+ int i;
+ unsigned lbr_idx, mask;
+ u64 tos;
+
+ if (task_ctx->lbr_callstack_users == 0 ||
+ task_ctx->lbr_stack_state == LBR_NONE) {
+ intel_pmu_lbr_reset();
+ return;
+ }
+
+ mask = x86_pmu.lbr_nr - 1;
+ tos = intel_pmu_lbr_tos();
+ for (i = 0; i < x86_pmu.lbr_nr; i++) {
+ lbr_idx = (tos - i) & mask;
+ wrmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
+ wrmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
+ }
+ task_ctx->lbr_stack_state = LBR_NONE;
+}
+
+static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
+{
+ int i;
+ unsigned lbr_idx, mask;
+ u64 tos;
+
+ if (task_ctx->lbr_callstack_users == 0) {
+ task_ctx->lbr_stack_state = LBR_NONE;
+ return;
+ }
+
+ mask = x86_pmu.lbr_nr - 1;
+ tos = intel_pmu_lbr_tos();
+ for (i = 0; i < x86_pmu.lbr_nr; i++) {
+ lbr_idx = (tos - i) & mask;
+ rdmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
+ rdmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
+ }
+ task_ctx->lbr_stack_state = LBR_VALID;
+}
+
+
void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+ struct x86_perf_task_context *task_ctx;
if (!x86_pmu.lbr_nr)
return;
/*
+ * If LBR callstack feature is enabled and the stack was saved when
+ * the task was scheduled out, restore the stack. Otherwise flush
+ * the LBR stack.
+ */
+ task_ctx = ctx ? ctx->task_ctx_data : NULL;
+ if (task_ctx) {
+ if (sched_in) {
+ __intel_pmu_lbr_restore(task_ctx);
+ cpuc->lbr_context = ctx;
+ } else {
+ __intel_pmu_lbr_save(task_ctx);
+ }
+ return;
+ }
+
+ /*
* When sampling the branck stack in system-wide, it may be
* necessary to flush the stack on context switch. This happens
* when the branch stack does not tag its entries with the pid
@@ -277,18 +354,6 @@ void intel_pmu_lbr_disable_all(void)
__intel_pmu_lbr_disable();
}
-/*
- * TOS = most recently recorded branch
- */
-static inline u64 intel_pmu_lbr_tos(void)
-{
- u64 tos;
-
- rdmsrl(x86_pmu.lbr_tos, tos);
-
- return tos;
-}
-
static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
{
unsigned long mask = x86_pmu.lbr_nr - 1;
--
1.8.3.2
Only enable LBR callstack when user requires fp callgraph. The feature
is not available when PERF_SAMPLE_BRANCH_STACK or PERF_SAMPLE_STACK_USER
is required.
Also, this feature only affects how to get user callchain. The kernel
callchain is always got by frame pointers.
Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 9656b9e..b3256a3 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -424,10 +424,24 @@ int x86_pmu_hw_config(struct perf_event *event)
if (!event->attr.exclude_kernel)
*br_type |= PERF_SAMPLE_BRANCH_KERNEL;
}
- }
+ } else if (x86_pmu_has_lbr_callstack() &&
+ (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) &&
+ !(event->attr.sample_type & PERF_SAMPLE_STACK_USER) &&
+ !has_branch_stack(event) &&
+ !event->attr.exclude_user &&
+ (event->attach_state & PERF_ATTACH_TASK)) {
+ /*
+ * user did not specify branch_sample_type,
+ * try using the LBR call stack facility to
+ * record call chains of user program.
+ */
+ event->attr.branch_sample_type =
+ PERF_SAMPLE_BRANCH_USER |
+ PERF_SAMPLE_BRANCH_CALL_STACK;
- if (event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_CALL_STACK)
+ /* needs PMU specific data to save LBR stack */
event->attach_state |= PERF_ATTACH_TASK_DATA;
+ }
/*
* Generate PMC IRQs:
--
1.8.3.2
LBR callstack is designed for PEBS, It does not work well with
FREEZE_LBRS_ON_PMI for non PEBS event. If FREEZE_LBRS_ON_PMI is set for
non PEBS event, PMIs near call/return instructions may cause superfluous
increase/decrease of LBR_TOS.
This patch modifies __intel_pmu_lbr_enable() to not enable
FREEZE_LBRS_ON_PMI when LBR operates in callstack mode. We currently
don't use LBR callstack to capture kernel space callchain, so disabling
FREEZE_LBRS_ON_PMI should not be a problem.
Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 6e473c9..cd4bd6f 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -131,14 +131,23 @@ static void intel_pmu_lbr_filter(struct cpu_hw_events *cpuc);
static void __intel_pmu_lbr_enable(void)
{
- u64 debugctl;
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+ u64 debugctl, lbr_select = 0;
- if (cpuc->lbr_sel)
- wrmsrl(MSR_LBR_SELECT, cpuc->lbr_sel->config);
+ if (cpuc->lbr_sel) {
+ lbr_select = cpuc->lbr_sel->config;
+ wrmsrl(MSR_LBR_SELECT, lbr_select);
+ }
rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
- debugctl |= (DEBUGCTLMSR_LBR | DEBUGCTLMSR_FREEZE_LBRS_ON_PMI);
+ debugctl |= DEBUGCTLMSR_LBR;
+ /*
+ * LBR callstack does not work well with FREEZE_LBRS_ON_PMI.
+ * If FREEZE_LBRS_ON_PMI is set, PMI near call/return instructions
+ * may cause superfluous increase/decrease of LBR_TOS.
+ */
+ if (!(lbr_select & LBR_CALL_STACK))
+ debugctl |= DEBUGCTLMSR_FREEZE_LBRS_ON_PMI;
wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
}
--
1.8.3.2
When enabling/disabling an event, check if the event uses the LBR
callstack feature, adjust the LBR callstack usage count accordingly.
Later patch will use the usage count to decide if LBR stack should
be saved/restored.
Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index e0ba9c0..f9b9f2b 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -204,9 +204,15 @@ void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
}
}
+static inline bool branch_user_callstack(unsigned br_sel)
+{
+ return (br_sel & X86_BR_USER) && (br_sel & X86_BR_CALL_STACK);
+}
+
void intel_pmu_lbr_enable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+ struct x86_perf_task_context *task_ctx;
if (!x86_pmu.lbr_nr)
return;
@@ -220,6 +226,12 @@ void intel_pmu_lbr_enable(struct perf_event *event)
}
cpuc->br_sel = event->hw.branch_reg.reg;
+ if (branch_user_callstack(cpuc->br_sel) && event->ctx &&
+ event->ctx->task_ctx_data) {
+ task_ctx = event->ctx->task_ctx_data;
+ task_ctx->lbr_callstack_users++;
+ }
+
cpuc->lbr_users++;
perf_sched_cb_inc(event->ctx->pmu);
}
@@ -227,10 +239,17 @@ void intel_pmu_lbr_enable(struct perf_event *event)
void intel_pmu_lbr_disable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+ struct x86_perf_task_context *task_ctx;
if (!x86_pmu.lbr_nr)
return;
+ if (branch_user_callstack(cpuc->br_sel) && event->ctx &&
+ event->ctx->task_ctx_data) {
+ task_ctx = event->ctx->task_ctx_data;
+ task_ctx->lbr_callstack_users--;
+ }
+
cpuc->lbr_users--;
WARN_ON_ONCE(cpuc->lbr_users < 0);
perf_sched_cb_dec(event->ctx->pmu);
--
1.8.3.2
If two tasks were both forked from the same parent task, Events in
their perf task contexts can be the same. Perf core may leave out
switching the perf event contexts.
Previous patch inroduces pmu specific data. The data is for saving
the LBR stack, it is task specific. So we need to switch the data
even when context switch is optimized out.
Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
kernel/events/core.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a7c0d5c..5b5f143 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2420,6 +2420,9 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
next->perf_event_ctxp[ctxn] = ctx;
ctx->task = next;
next_ctx->task = task;
+
+ swap(ctx->task_ctx_data, next_ctx->task_ctx_data);
+
do_switch = 0;
perf_event_sync_stat(ctx, next_ctx);
--
1.8.3.2
When the LBR call stack is enabled, it is necessary to save/restore
the LBR stack on context switch. We can use pmu specific data to
store LBR stack when task is scheduled out. This patch adds code
that allocates the pmu specific data.
Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Reviewed-by: Stephane Eranian <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 4 ++++
arch/x86/kernel/cpu/perf_event.h | 7 +++++++
2 files changed, 11 insertions(+)
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index fa7a168..f94a618e 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -427,6 +427,9 @@ int x86_pmu_hw_config(struct perf_event *event)
}
}
+ if (event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_CALL_STACK)
+ event->attach_state |= PERF_ATTACH_TASK_DATA;
+
/*
* Generate PMC IRQs:
* (keep 'enabled' bit clear for now)
@@ -1920,6 +1923,7 @@ static struct pmu pmu = {
.event_idx = x86_pmu_event_idx,
.sched_task = x86_pmu_sched_task,
+ .task_ctx_size = sizeof(struct x86_perf_task_context),
};
void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 21e03ed..c210bff 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -511,6 +511,13 @@ struct x86_pmu {
struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr);
};
+struct x86_perf_task_context {
+ u64 lbr_from[MAX_LBR_ENTRIES];
+ u64 lbr_to[MAX_LBR_ENTRIES];
+ int lbr_callstack_users;
+ int lbr_stack_state;
+};
+
enum {
PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT = PERF_SAMPLE_BRANCH_MAX_SHIFT,
PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE,
--
1.8.3.2
On Sun, Oct 19, 2014 at 05:54:56PM -0400, Kan Liang wrote:
This should still very much have:
From: Yan, Zheng <[email protected]>
Seeing how you did not write this patch, probably true for all the
others too, although I've not checked yet.
> The index of lbr_sel_map is bit value of perf branch_sample_type.
> PERF_SAMPLE_BRANCH_MAX is 1024 at present, so each lbr_sel_map uses
> 4096 bytes. By using bit shift as index, we can reduce lbr_sel_map
> size to 40 bytes. This patch defines 'bit shift' for branch types,
> and use 'bit shift' to define lbr_sel_maps.
>
> Signed-off-by: Yan, Zheng <[email protected]>
> Signed-off-by: Kan Liang <[email protected]>
> Reviewed-by: Stephane Eranian <[email protected]>
> ---
On Sun, Oct 19, 2014 at 05:55:12PM -0400, Kan Liang wrote:
SNIP
> - return 0;
> - }
> - continue;
> + mix_chain_nr = i + 2 + lbr_nr;
> + if (mix_chain_nr > PERF_MAX_STACK_DEPTH) {
> + pr_warning("corrupted callchain. skipping...\n");
> + return 0;
> }
>
> - al.filtered = 0;
> - thread__find_addr_location(thread, machine, cpumode,
> - MAP__FUNCTION, ip, &al);
> - if (al.sym != NULL) {
> - if (sort__has_parent && !*parent &&
> - symbol__match_regex(al.sym, &parent_regex))
> - *parent = al.sym;
> - else if (have_ignore_callees && root_al &&
> - symbol__match_regex(al.sym, &ignore_callees_regex)) {
> - /* Treat this symbol as the root,
> - forgetting its callees. */
> - *root_al = al;
> - callchain_cursor_reset(&callchain_cursor);
> + for (j = 0; j < mix_chain_nr; j++) {
> + struct addr_location al;
> +
> + if (callchain_param.order == ORDER_CALLEE) {
> + if (j < i + 2)
> + ip = chain->ips[j];
> + else
> + ip = lbr_stack->entries[j - i - 2].from;
> + } else {
> + if (j < lbr_nr)
> + ip = lbr_stack->entries[lbr_nr - j - 1].from;
> + else
> + ip = chain->ips[i + 1 - (j - lbr_nr)];
> }
> + err = __machine__resolve_callchain_sample(machine,
> + thread, ip, &cpumode, parent, root_al, &al);
> + /* Discard all when the callchain is corrupted */
> + if (err > 0)
> + return 0;
> + else if (err)
> + return err;
so you print FP callchains followed by LBR stack data, right?
but AFAICS from kernel changes the FP callchains and LBR callchains
data are unrelated.. 2 datasources of the same information
do we rather want to print them separately? or using an option
as Andi did in his lbr-as-callgraph patchset:
http://marc.info/?l=linux-kernel&m=141177467802602&w=2
thanks,
jirka
On Sun, Oct 19, 2014 at 05:55:08PM -0400, Kan Liang wrote:
> Only enable LBR callstack when user requires fp callgraph. The feature
> is not available when PERF_SAMPLE_BRANCH_STACK or PERF_SAMPLE_STACK_USER
> is required.
> Also, this feature only affects how to get user callchain. The kernel
> callchain is always got by frame pointers.
>
> Signed-off-by: Yan, Zheng <[email protected]>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> arch/x86/kernel/cpu/perf_event.c | 18 ++++++++++++++++--
> 1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 9656b9e..b3256a3 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -424,10 +424,24 @@ int x86_pmu_hw_config(struct perf_event *event)
> if (!event->attr.exclude_kernel)
> *br_type |= PERF_SAMPLE_BRANCH_KERNEL;
> }
> - }
> + } else if (x86_pmu_has_lbr_callstack() &&
> + (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) &&
> + !(event->attr.sample_type & PERF_SAMPLE_STACK_USER) &&
> + !has_branch_stack(event) &&
> + !event->attr.exclude_user &&
> + (event->attach_state & PERF_ATTACH_TASK)) {
> + /*
> + * user did not specify branch_sample_type,
> + * try using the LBR call stack facility to
> + * record call chains of user program.
> + */
> + event->attr.branch_sample_type =
> + PERF_SAMPLE_BRANCH_USER |
> + PERF_SAMPLE_BRANCH_CALL_STACK;
>
I dont see PERF_SAMPLE_BRANCH_CALL_STACK being defind in uapi.. any reason
why I cant enable this feature explicitly?
thanks,
jirka
On Fri, Oct 24, 2014 at 03:36:00PM +0200, Jiri Olsa wrote:
> On Sun, Oct 19, 2014 at 05:55:12PM -0400, Kan Liang wrote:
>
> SNIP
>
> > - return 0;
> > - }
> > - continue;
> > + mix_chain_nr = i + 2 + lbr_nr;
> > + if (mix_chain_nr > PERF_MAX_STACK_DEPTH) {
> > + pr_warning("corrupted callchain. skipping...\n");
> > + return 0;
> > }
> >
> > - al.filtered = 0;
> > - thread__find_addr_location(thread, machine, cpumode,
> > - MAP__FUNCTION, ip, &al);
> > - if (al.sym != NULL) {
> > - if (sort__has_parent && !*parent &&
> > - symbol__match_regex(al.sym, &parent_regex))
> > - *parent = al.sym;
> > - else if (have_ignore_callees && root_al &&
> > - symbol__match_regex(al.sym, &ignore_callees_regex)) {
> > - /* Treat this symbol as the root,
> > - forgetting its callees. */
> > - *root_al = al;
> > - callchain_cursor_reset(&callchain_cursor);
> > + for (j = 0; j < mix_chain_nr; j++) {
> > + struct addr_location al;
> > +
> > + if (callchain_param.order == ORDER_CALLEE) {
> > + if (j < i + 2)
> > + ip = chain->ips[j];
> > + else
> > + ip = lbr_stack->entries[j - i - 2].from;
> > + } else {
> > + if (j < lbr_nr)
> > + ip = lbr_stack->entries[lbr_nr - j - 1].from;
> > + else
> > + ip = chain->ips[i + 1 - (j - lbr_nr)];
> > }
> > + err = __machine__resolve_callchain_sample(machine,
> > + thread, ip, &cpumode, parent, root_al, &al);
> > + /* Discard all when the callchain is corrupted */
> > + if (err > 0)
> > + return 0;
> > + else if (err)
> > + return err;
>
> so you print FP callchains followed by LBR stack data, right?
>
> but AFAICS from kernel changes the FP callchains and LBR callchains
> data are unrelated.. 2 datasources of the same information
>
> do we rather want to print them separately? or using an option
> as Andi did in his lbr-as-callgraph patchset:
> http://marc.info/?l=linux-kernel&m=141177467802602&w=2
hum, sorry I've got confused by perf report -D output:
5719019682019 0x2128 [0x80]: PERF_RECORD_SAMPLE(IP, 0x2): 2499/2499: 0x401791 period: 327077 addr: 0
... chain: nr:3
..... 0: fffffffffffffe00 FP
..... 1: 0000000000401791 FP
..... 2: 00000032dca21d63 LBR
..... 3: 000000000040184c LBR
which sometimes displays user space FP data with LBR.. but I guess
the intention was to display either userspace FP or LBR, right?
I think we should have an option to be able to choose/switch.
Once general option that will tell report to use:
FP, DWARF, LBR (branches), LBR (callchain)
setting by default whatever the best option is based on the data we have.
thanks,
jirka
>
> On Sun, Oct 19, 2014 at 05:55:08PM -0400, Kan Liang wrote:
> > Only enable LBR callstack when user requires fp callgraph. The feature
> > is not available when PERF_SAMPLE_BRANCH_STACK or
> > PERF_SAMPLE_STACK_USER is required.
> > Also, this feature only affects how to get user callchain. The kernel
> > callchain is always got by frame pointers.
> >
> > Signed-off-by: Yan, Zheng <[email protected]>
> > Signed-off-by: Kan Liang <[email protected]>
> > ---
> > arch/x86/kernel/cpu/perf_event.c | 18 ++++++++++++++++--
> > 1 file changed, 16 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kernel/cpu/perf_event.c
> > b/arch/x86/kernel/cpu/perf_event.c
> > index 9656b9e..b3256a3 100644
> > --- a/arch/x86/kernel/cpu/perf_event.c
> > +++ b/arch/x86/kernel/cpu/perf_event.c
> > @@ -424,10 +424,24 @@ int x86_pmu_hw_config(struct perf_event
> *event)
> > if (!event->attr.exclude_kernel)
> > *br_type |=
> PERF_SAMPLE_BRANCH_KERNEL;
> > }
> > - }
> > + } else if (x86_pmu_has_lbr_callstack() &&
> > + (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
> &&
> > + !(event->attr.sample_type & PERF_SAMPLE_STACK_USER)
> &&
> > + !has_branch_stack(event) &&
> > + !event->attr.exclude_user &&
> > + (event->attach_state & PERF_ATTACH_TASK)) {
> > + /*
> > + * user did not specify branch_sample_type,
> > + * try using the LBR call stack facility to
> > + * record call chains of user program.
> > + */
> > + event->attr.branch_sample_type =
> > + PERF_SAMPLE_BRANCH_USER |
> > + PERF_SAMPLE_BRANCH_CALL_STACK;
> >
>
> I dont see PERF_SAMPLE_BRANCH_CALL_STACK being defind in uapi.. any
> reason why I cant enable this feature explicitly?
The LBR call stack has some limitations. E.g. the feature is only for HSW+.
It is only available for user callchain. We cannot collect branch information
and call chain by LBR at the same time.. So the feature is designed as
an alternative way to get callgraph, and it isn't exposed to enable.
Otherwise, it may confuse the user. He enables the BRANCH_CALL_STACK,
but the data is only partly or even not at all from hardware.
We have an option for perf report, the user can choose the preferred data
source. User can set it to LBR. So it tries LBR data first. If not available, it
tries FP then. The user cannot lose any callchain data.
Thanks,
Kan
>
> thanks,
> jirka
>
> On Fri, Oct 24, 2014 at 03:36:00PM +0200, Jiri Olsa wrote:
> > On Sun, Oct 19, 2014 at 05:55:12PM -0400, Kan Liang wrote:
> >
> > SNIP
> >
> > > - return 0;
> > > - }
> > > - continue;
> > > + mix_chain_nr = i + 2 + lbr_nr;
> > > + if (mix_chain_nr > PERF_MAX_STACK_DEPTH) {
> > > + pr_warning("corrupted callchain. skipping...\n");
> > > + return 0;
> > > }
> > >
> > > - al.filtered = 0;
> > > - thread__find_addr_location(thread, machine, cpumode,
> > > - MAP__FUNCTION, ip, &al);
> > > - if (al.sym != NULL) {
> > > - if (sort__has_parent && !*parent &&
> > > - symbol__match_regex(al.sym, &parent_regex))
> > > - *parent = al.sym;
> > > - else if (have_ignore_callees && root_al &&
> > > - symbol__match_regex(al.sym,
> &ignore_callees_regex)) {
> > > - /* Treat this symbol as the root,
> > > - forgetting its callees. */
> > > - *root_al = al;
> > > - callchain_cursor_reset(&callchain_cursor);
> > > + for (j = 0; j < mix_chain_nr; j++) {
> > > + struct addr_location al;
> > > +
> > > + if (callchain_param.order == ORDER_CALLEE) {
> > > + if (j < i + 2)
> > > + ip = chain->ips[j];
> > > + else
> > > + ip = lbr_stack->entries[j - i - 2].from;
> > > + } else {
> > > + if (j < lbr_nr)
> > > + ip = lbr_stack->entries[lbr_nr - j -
> 1].from;
> > > + else
> > > + ip = chain->ips[i + 1 - (j - lbr_nr)];
> > > }
> > > + err =
> __machine__resolve_callchain_sample(machine,
> > > + thread, ip, &cpumode, parent, root_al, &al);
> > > + /* Discard all when the callchain is corrupted */
> > > + if (err > 0)
> > > + return 0;
> > > + else if (err)
> > > + return err;
> >
> > so you print FP callchains followed by LBR stack data, right?
> >
> > but AFAICS from kernel changes the FP callchains and LBR callchains
> > data are unrelated.. 2 datasources of the same information
> >
> > do we rather want to print them separately? or using an option as Andi
> > did in his lbr-as-callgraph patchset:
> > http://marc.info/?l=linux-kernel&m=141177467802602&w=2
>
> hum, sorry I've got confused by perf report -D output:
>
> 5719019682019 0x2128 [0x80]: PERF_RECORD_SAMPLE(IP, 0x2): 2499/2499:
> 0x401791 period: 327077 addr: 0 ... chain: nr:3 ..... 0: fffffffffffffe00 FP ..... 1:
> 0000000000401791 FP ..... 2: 00000032dca21d63 LBR ..... 3: 000000000040184c
> LBR
>
> which sometimes displays user space FP data with LBR.. but I guess the
> intention was to display either userspace FP or LBR, right?
>
It depends on how user set the perf report --call-graph options, and
if both data are available.
Let's say both FP and LBR data are available.
If the user prefer the fp data,
here is the command and the call chain from last sample.
perf report --call-graph fractal,0.5,callee,function,fp -D
3741913434263 0x3a2960 [0xe8]: PERF_RECORD_SAMPLE(IP, 0x2):
20920/20920: 0x400568 period: 611393 addr: 0
... chain: nr:6
..... 0: fffffffffffffe00
..... 1: 0000000000400568
..... 2: 000000000040058c <---- FP data start from here
..... 3: 00000000004005b8
..... 4: 00000000004005fe
..... 5: 0000003d1cc21b45
..... 6: 0000000000000005
... thread: a.out:20920
...... dso: /home/lk/a.out
If the user prefer the lbr data source,
here is the command and the call chain from last sample.
perf report --call-graph fractal,0.5,callee,function,lbr -D
3741913434263 0x3a2960 [0xe8]: PERF_RECORD_SAMPLE(IP, 0x2):
20920/20920: 0x400568 period: 611393 addr: 0
... chain: nr:6
..... 0: fffffffffffffe00
..... 1: 0000000000400568
..... 2: 0000000000400587 <---- LBR data start from here
..... 3: 00000000004005b3
..... 4: 00000000004005f9
..... 5: 0000003d1cc21b43
..... 6: 0000000000400474
... thread: a.out:20920
...... dso: /home/lk/a.out
If there is only one data source available (either fp or lbr),
no matter what option is set, the only available data is printed.
> I think we should have an option to be able to choose/switch.
> Once general option that will tell report to use:
> FP, DWARF, LBR (branches), LBR (callchain)
We cannot collect LBR branch and LBR callchain at the same time.
If both call-chain and branch are required, it only collects branch.
The choose/switch can only happen when the user only want to
collect call chain by FP. So I only extend the reporter's --call-graph
option to callchain source (fp or lbr). The default one is fp.
Thanks,
Kan
>
> setting by default whatever the best option is based on the data we have.
>
> thanks,
> jirka
Hi Peter,
Did you get a chance to review the rest of the patch set?
Thanks,
Kan
>
> On Sun, Oct 19, 2014 at 05:54:56PM -0400, Kan Liang wrote:
>
> This should still very much have:
>
> From: Yan, Zheng <[email protected]>
>
> Seeing how you did not write this patch, probably true for all the others
> too, although I've not checked yet.
>
> > The index of lbr_sel_map is bit value of perf branch_sample_type.
> > PERF_SAMPLE_BRANCH_MAX is 1024 at present, so each lbr_sel_map
> uses
> > 4096 bytes. By using bit shift as index, we can reduce lbr_sel_map
> > size to 40 bytes. This patch defines 'bit shift' for branch types, and
> > use 'bit shift' to define lbr_sel_maps.
> >
> > Signed-off-by: Yan, Zheng <[email protected]>
> > Signed-off-by: Kan Liang <[email protected]>
> > Reviewed-by: Stephane Eranian <[email protected]>
> > ---
On Tue, Nov 04, 2014 at 01:07:39AM +0000, Liang, Kan wrote:
> Hi Peter,
>
> Did you get a chance to review the rest of the patch set?
No point in reviewing stuff I could not apply if I wanted to is there?
On Tue, Nov 04, 2014 at 08:14:21AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 04, 2014 at 01:07:39AM +0000, Liang, Kan wrote:
> > Hi Peter,
> >
> > Did you get a chance to review the rest of the patch set?
>
> No point in reviewing stuff I could not apply if I wanted to is there?
So please (re)read Documentation/SubmittingPatches and try again. Also,
you might want to not Cc Zheng on the next posting (but very much leave
him attribution, just not actually send the mail to a dead email
address).