2014-11-05 03:10:19

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V7 00/17] perf, x86: Haswell LBR call stack support

For many profiling tasks we need the callgraph. For example we often
need to see the caller of a lock or the caller of a memcpy or other
library function to actually tune the program. Frame pointer unwinding
is efficient and works well. But frame pointers are off by default on
64bit code (and on modern 32bit gccs), so there are many binaries around
that do not use frame pointers. Profiling unchanged production code is
very useful in practice. On some CPUs frame pointer also has a high
cost. Dwarf2 unwinding also does not always work and is extremely slow
(upto 20% overhead).

Haswell has a new feature that utilizes the existing Last Branch Record
facility to record call chains. When the feature is enabled, function
call will be collected as normal, but as return instructions are
executed the last captured branch record is popped from the on-chip LBR
registers. The LBR call stack facility provides an alternative to get
callgraph. It has some limitations too, but should work in most cases
and is significantly faster than dwarf. Frame pointer unwinding is still
the best default, but LBR call stack is a good alternative when nothing
else works.

In the implementation, both Frame pointer and LBR call stack data are
collected by kernel, and expose to user space. The frame pointer is
still output as PERF_SAMPLE_CALLCHAIN data format. The LBR call stack
data will be output as PERF_SAMPLE_BRANCH_STACK data format. a callahain
source extension of perf report call-graph option is introduced. So user
can choose call chain from either FP or LBR call stack.

When profiling bc(1) on Fedora 19:
echo 'scale=2000; 4*a(1)' > cmd; perf record --call-graph fp bc -l < cmd
If this feature is enabled, perf report with lbr output looks like:
50.36% bc bc [.] bc_divide
|
--- bc_divide
execute
run_code
yyparse
main
__libc_start_main
_start
33.66% bc bc [.] _one_mult
|
--- _one_mult
bc_divide
execute
run_code
yyparse
main
__libc_start_main
_start
7.62% bc bc [.] _bc_do_add
|
--- _bc_do_add
|
|--99.89%-- 0x2000186a8
--0.11%-- [...]
6.83% bc bc [.] _bc_do_sub
|
--- _bc_do_sub
|
|--99.94%-- bc_add
| execute
| run_code
| yyparse
| main
| __libc_start_main
| _start
--0.06%-- [...]
0.46% bc libc-2.17.so [.] __memset_sse2
|
--- __memset_sse2
|
|--54.13%-- bc_new_num
| |
| |--51.00%-- bc_divide
| | execute
| | run_code
| | yyparse
| | main
| | __libc_start_main
| | _start
| |
| |--30.46%-- _bc_do_sub
| | bc_add
| | execute
| | run_code
| | yyparse
| | main
| | __libc_start_main
| | _start
| |
| --18.55%-- _bc_do_add
| bc_add
| execute
| run_code
| yyparse
| main
| __libc_start_main
| _start
|
--45.87%-- bc_divide
execute
run_code
yyparse
main
__libc_start_main
_start
If this feature is disabled, perf report output looks like:
50.49% bc bc [.] bc_divide
|
--- bc_divide
33.57% bc bc [.] _one_mult
|
--- _one_mult
7.61% bc bc [.] _bc_do_add
|
--- _bc_do_add
0x2000186a8
6.88% bc bc [.] _bc_do_sub
|
--- _bc_do_sub
0.42% bc libc-2.17.so [.] __memcpy_ssse3_back
|
--- __memcpy_ssse3_back

Another example is to demo the extension of perf report.
If both fp and lbr are available, we can dump either them by fp or lbr
option as below.

$ perf record --call-graph fp ./a.out
[ perf record: Woken up 18 times to write data ]
[ perf record: Captured and wrote 4.322 MB perf.data (~188824 samples) ]

$ perf report --call-graph fractal,0.5,callee,function,fp -D | wc -l
605688
$ perf report --call-graph fractal,0.5,callee,function,lbr -D | wc -l
605730


The LBR call stack has following known limitations
- Zero length calls are not filtered out by hardware
- Exception handing such as setjmp/longjmp will have calls/returns not
match
- Pushing different return address onto the stack will have calls/returns
not match
- If callstack is deeper than the LBR, only the last entries are captured

Changes since v1
- split change into more patches
- introduce context switch callback and use it to flush LBR
- use the context switch callback to save/restore LBR
- dynamic allocate memory area for storing LBR stack, always switch the
memory area during context switch
- disable this feature by default
- more description in change logs

Changes since v2
- don't use xchg to switch PMU specific data
- remove nr_branch_stack from struct perf_event_context
- simplify the save/restore LBR stack logical
- remove unnecessary 'has_branch_stack -> needs_branch_stack'
conversion
- more description in change logs

Changes since v3
- remove sysfs attribute file that disable this feature

Changes since v4
- re-organize code that save/resotre LBR stack
- allocate pmu specific data when it's needed
- update code comments

Changes since v5
- Expose LBR call stack data to user perf tool
- Add option for perf report to support LBR call stack
- Some minor changes according to comments

Changes since v6
- rebase on tip.git 05066a2a04
- Modify perf test accordingly

Yan, Zheng (15):
perf, x86: Reduce lbr_sel_map size
perf, core: introduce pmu context switch callback
perf, x86: use context switch callback to flush LBR stack
perf, x86: Basic Haswell LBR call stack support
perf, core: pmu specific data for perf task context
perf, core: always switch pmu specific data during context switch
perf, x86: allocate space for storing LBR stack
perf, x86: track number of events that use LBR callstack
perf, x86: Save/resotre LBR stack during context switch
perf, core: simplify need branch stack check
perf, core: expose LBR call stack to user perf tool
perf, x86: re-organize code that implicitly enables LBR/PEBS
perf, x86: enable LBR callstack when recording callchain
perf, x86: disable FREEZE_LBRS_ON_PMI when LBR operates in callstack
mode
perf, x86: Discard zero length call entries in LBR call stack

Kan Liang (2):
perf tools: handle LBR call stack data
perf tools: choose to dump callchain from LBR and FP

arch/x86/kernel/cpu/perf_event.c | 90 ++++++---
arch/x86/kernel/cpu/perf_event.h | 28 ++-
arch/x86/kernel/cpu/perf_event_intel.c | 38 +---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 309 ++++++++++++++++++++++-------
include/linux/perf_event.h | 35 +++-
include/uapi/linux/perf_event.h | 49 +++--
kernel/events/callchain.c | 1 +
kernel/events/core.c | 200 +++++++++++--------
tools/perf/builtin-report.c | 8 +-
tools/perf/tests/hists_cumulate.c | 20 +-
tools/perf/tests/sample-parsing.c | 2 +-
tools/perf/util/callchain.c | 18 +-
tools/perf/util/callchain.h | 6 +
tools/perf/util/event.h | 8 +
tools/perf/util/evsel.c | 21 +-
tools/perf/util/machine.c | 198 ++++++++++++------
tools/perf/util/session.c | 34 +++-
18 files changed, 742 insertions(+), 325 deletions(-)

--
1.8.3.2


2014-11-05 03:10:23

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V7 02/17] perf, core: introduce pmu context switch callback

From: Yan, Zheng <[email protected]>

The callback is invoked when process is scheduled in or out.
It provides mechanism for later patches to save/store the LBR
stack. For the schedule in case, the callback is invoked at
the same place that flush branch stack callback is invoked.
So it also can replace the flush branch stack callback. To
avoid unnecessary overhead, the callback is enabled only when
there are events use the LBR stack.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 7 +++++
arch/x86/kernel/cpu/perf_event.h | 2 ++
include/linux/perf_event.h | 9 +++++++
kernel/events/core.c | 57 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 75 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 143e5f5..d5de9e1 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1879,6 +1879,12 @@ static const struct attribute_group *x86_pmu_attr_groups[] = {
NULL,
};

+static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+ if (x86_pmu.sched_task)
+ x86_pmu.sched_task(ctx, sched_in);
+}
+
static void x86_pmu_flush_branch_stack(void)
{
if (x86_pmu.flush_branch_stack)
@@ -1912,6 +1918,7 @@ static struct pmu pmu = {

.event_idx = x86_pmu_event_idx,
.flush_branch_stack = x86_pmu_flush_branch_stack,
+ .sched_task = x86_pmu_sched_task,
};

void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 86c675c..0617abb 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -467,6 +467,8 @@ struct x86_pmu {

void (*check_microcode)(void);
void (*flush_branch_stack)(void);
+ void (*sched_task)(struct perf_event_context *ctx,
+ bool sched_in);

/*
* Intel Arch Perfmon v2+
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 893a0d0..40ecad1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -263,6 +263,13 @@ struct pmu {
* flush branch stack on context-switches (needed in cpu-wide mode)
*/
void (*flush_branch_stack) (void);
+
+ /*
+ * context-switches callback
+ */
+ void (*sched_task) (struct perf_event_context *ctx,
+ bool sched_in);
+
};

/**
@@ -562,6 +569,8 @@ extern void perf_event_delayed_put(struct task_struct *task);
extern void perf_event_print_debug(void);
extern void perf_pmu_disable(struct pmu *pmu);
extern void perf_pmu_enable(struct pmu *pmu);
+extern void perf_sched_cb_dec(struct pmu *pmu);
+extern void perf_sched_cb_inc(struct pmu *pmu);
extern int perf_event_task_disable(void);
extern int perf_event_task_enable(void);
extern int perf_event_refresh(struct perf_event *event, int refresh);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2b02c9f..28c2764 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -154,6 +154,7 @@ enum event_type_t {
struct static_key_deferred perf_sched_events __read_mostly;
static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
+static DEFINE_PER_CPU(int, perf_sched_cb_usages);

static atomic_t nr_mmap_events __read_mostly;
static atomic_t nr_comm_events __read_mostly;
@@ -2435,6 +2436,56 @@ unlock:
}
}

+void perf_sched_cb_dec(struct pmu *pmu)
+{
+ this_cpu_dec(perf_sched_cb_usages);
+}
+
+void perf_sched_cb_inc(struct pmu *pmu)
+{
+ this_cpu_inc(perf_sched_cb_usages);
+}
+
+/*
+ * This function provides the context switch callback to the lower code
+ * layer. It is invoked ONLY when the context switch callback is enabled.
+ */
+static void perf_pmu_sched_task(struct task_struct *prev,
+ struct task_struct *next,
+ bool sched_in)
+{
+ struct perf_cpu_context *cpuctx;
+ struct pmu *pmu;
+ unsigned long flags;
+
+ if (prev == next)
+ return;
+
+ local_irq_save(flags);
+
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(pmu, &pmus, entry) {
+ if (pmu->sched_task) {
+ cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+
+ perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+ perf_pmu_disable(pmu);
+
+ pmu->sched_task(cpuctx->task_ctx, sched_in);
+
+ perf_pmu_enable(pmu);
+
+ perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+ }
+ }
+
+ rcu_read_unlock();
+
+ local_irq_restore(flags);
+}
+
#define for_each_task_context_nr(ctxn) \
for ((ctxn) = 0; (ctxn) < perf_nr_task_contexts; (ctxn)++)

@@ -2454,6 +2505,9 @@ void __perf_event_task_sched_out(struct task_struct *task,
{
int ctxn;

+ if (__this_cpu_read(perf_sched_cb_usages))
+ perf_pmu_sched_task(task, next, false);
+
for_each_task_context_nr(ctxn)
perf_event_context_sched_out(task, ctxn, next);

@@ -2711,6 +2765,9 @@ void __perf_event_task_sched_in(struct task_struct *prev,
/* check for system-wide branch_stack events */
if (atomic_read(this_cpu_ptr(&perf_branch_stack_events)))
perf_branch_stack_sched_in(prev, task);
+
+ if (__this_cpu_read(perf_sched_cb_usages))
+ perf_pmu_sched_task(prev, task, true);
}

static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count)
--
1.8.3.2

2014-11-05 03:10:33

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V7 10/17] perf, core: simplify need branch stack check

From: Yan, Zheng <[email protected]>

Using event->attr.branch_sample_type to replace
intel_pmu_needs_lbr_smpl() for avoiding duplicating code
that implicitly enables the LBR.
Currently, branch stack can be enabled by user explicitly requested
branch sampling or implicit branch sampling to correct PEBS skid.
For user explicitly requested branch sampling, the branch_sample_type is
explicitly set by user. For PEBS case, the branch_sample_type is also
implicitly set to PERF_SAMPLE_BRANCH_ANY in x86_pmu_hw_config.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel.c | 20 +++-----------------
include/linux/perf_event.h | 5 +++++
kernel/events/core.c | 3 +++
3 files changed, 11 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index a0c0739..5f449fb 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1029,20 +1029,6 @@ static __initconst const u64 slm_hw_cache_event_ids
},
};

-static inline bool intel_pmu_needs_lbr_smpl(struct perf_event *event)
-{
- /* user explicitly requested branch sampling */
- if (has_branch_stack(event))
- return true;
-
- /* implicit branch sampling to correct PEBS skid */
- if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1 &&
- x86_pmu.intel_cap.pebs_format < 2)
- return true;
-
- return false;
-}
-
static void intel_pmu_disable_all(void)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -1207,7 +1193,7 @@ static void intel_pmu_disable_event(struct perf_event *event)
* must disable before any actual event
* because any event may be combined with LBR
*/
- if (intel_pmu_needs_lbr_smpl(event))
+ if (needs_branch_stack(event))
intel_pmu_lbr_disable(event);

if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
@@ -1268,7 +1254,7 @@ static void intel_pmu_enable_event(struct perf_event *event)
* must enabled before any actual event
* because any event may be combined with LBR
*/
- if (intel_pmu_needs_lbr_smpl(event))
+ if (needs_branch_stack(event))
intel_pmu_lbr_enable(event);

if (event->attr.exclude_host)
@@ -1747,7 +1733,7 @@ static int intel_pmu_hw_config(struct perf_event *event)
if (event->attr.precise_ip && x86_pmu.pebs_aliases)
x86_pmu.pebs_aliases(event);

- if (intel_pmu_needs_lbr_smpl(event)) {
+ if (needs_branch_stack(event)) {
ret = intel_pmu_setup_lbr_filter(event);
if (ret)
return ret;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 84ec3e6..0d67460 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -796,6 +796,11 @@ static inline bool has_branch_stack(struct perf_event *event)
return event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK;
}

+static inline bool needs_branch_stack(struct perf_event *event)
+{
+ return event->attr.branch_sample_type != 0;
+}
+
extern int perf_output_begin(struct perf_output_handle *handle,
struct perf_event *event, unsigned int size);
extern void perf_output_end(struct perf_output_handle *handle);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4360c95..3f3e43d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7015,6 +7015,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
if (attr->inherit && (attr->read_format & PERF_FORMAT_GROUP))
goto err_ns;

+ if (!has_branch_stack(event))
+ event->attr.branch_sample_type = 0;
+
pmu = perf_init_event(event);
if (!pmu)
goto err_ns;
--
1.8.3.2

2014-11-05 03:10:48

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V7 14/17] perf, x86: disable FREEZE_LBRS_ON_PMI when LBR operates in callstack mode

From: Yan, Zheng <[email protected]>

LBR callstack is designed for PEBS, It does not work well with
FREEZE_LBRS_ON_PMI for non PEBS event. If FREEZE_LBRS_ON_PMI is set for
non PEBS event, PMIs near call/return instructions may cause superfluous
increase/decrease of LBR_TOS.

This patch modifies __intel_pmu_lbr_enable() to not enable
FREEZE_LBRS_ON_PMI when LBR operates in callstack mode. We currently
don't use LBR callstack to capture kernel space callchain, so disabling
FREEZE_LBRS_ON_PMI should not be a problem.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index a9e3a0d..12a87b0 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -131,14 +131,23 @@ static void intel_pmu_lbr_filter(struct cpu_hw_events *cpuc);

static void __intel_pmu_lbr_enable(void)
{
- u64 debugctl;
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ u64 debugctl, lbr_select = 0;

- if (cpuc->lbr_sel)
- wrmsrl(MSR_LBR_SELECT, cpuc->lbr_sel->config);
+ if (cpuc->lbr_sel) {
+ lbr_select = cpuc->lbr_sel->config;
+ wrmsrl(MSR_LBR_SELECT, lbr_select);
+ }

rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
- debugctl |= (DEBUGCTLMSR_LBR | DEBUGCTLMSR_FREEZE_LBRS_ON_PMI);
+ debugctl |= DEBUGCTLMSR_LBR;
+ /*
+ * LBR callstack does not work well with FREEZE_LBRS_ON_PMI.
+ * If FREEZE_LBRS_ON_PMI is set, PMI near call/return instructions
+ * may cause superfluous increase/decrease of LBR_TOS.
+ */
+ if (!(lbr_select & LBR_CALL_STACK))
+ debugctl |= DEBUGCTLMSR_FREEZE_LBRS_ON_PMI;
wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
}

--
1.8.3.2

2014-11-05 03:10:56

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V7 15/17] perf, x86: Discard zero length call entries in LBR call stack

From: Yan, Zheng <[email protected]>

"Zero length call" uses the attribute of the call instruction to push
the immediate instruction pointer on to the stack and then pops off
that address into a register. This is accomplished without any matching
return instruction. It confuses the hardware and make the recorded call
stack incorrect.

We can partially resolve this issue by: decode call instructions and
discard any zero length call entry in the LBR stack.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 12a87b0..b75adec 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -94,7 +94,8 @@ enum {
X86_BR_ABORT = 1 << 12,/* transaction abort */
X86_BR_IN_TX = 1 << 13,/* in transaction */
X86_BR_NO_TX = 1 << 14,/* not in transaction */
- X86_BR_CALL_STACK = 1 << 15,/* call stack */
+ X86_BR_ZERO_CALL = 1 << 15,/* zero length call */
+ X86_BR_CALL_STACK = 1 << 16,/* call stack */
};

#define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
@@ -111,13 +112,15 @@ enum {
X86_BR_JMP |\
X86_BR_IRQ |\
X86_BR_ABORT |\
- X86_BR_IND_CALL)
+ X86_BR_IND_CALL |\
+ X86_BR_ZERO_CALL)

#define X86_BR_ALL (X86_BR_PLM | X86_BR_ANY)

#define X86_BR_ANY_CALL \
(X86_BR_CALL |\
X86_BR_IND_CALL |\
+ X86_BR_ZERO_CALL |\
X86_BR_SYSCALL |\
X86_BR_IRQ |\
X86_BR_INT)
@@ -689,6 +692,12 @@ static int branch_type(unsigned long from, unsigned long to, int abort)
ret = X86_BR_INT;
break;
case 0xe8: /* call near rel */
+ insn_get_immediate(&insn);
+ if (insn.immediate1.value == 0) {
+ /* zero length call */
+ ret = X86_BR_ZERO_CALL;
+ break;
+ }
case 0x9a: /* call far absolute */
ret = X86_BR_CALL;
break;
--
1.8.3.2

2014-11-05 03:10:54

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V7 16/17] perf tools: handle LBR call stack data

From: Kan Liang <[email protected]>

The LBR call stack data will be output as PERF_SAMPLE_BRANCH_STACK data
format. ip_callchain is changed to show available callchain source.
LBR call stack also need to be shown in user space.
Perf test also changes accordingly.

Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/tests/hists_cumulate.c | 20 ++++++++++----------
tools/perf/tests/sample-parsing.c | 2 +-
tools/perf/util/event.h | 8 ++++++++
tools/perf/util/evsel.c | 21 +++++++++++++--------
4 files changed, 32 insertions(+), 19 deletions(-)

diff --git a/tools/perf/tests/hists_cumulate.c b/tools/perf/tests/hists_cumulate.c
index 614d5c4..8967ae6 100644
--- a/tools/perf/tests/hists_cumulate.c
+++ b/tools/perf/tests/hists_cumulate.c
@@ -48,29 +48,29 @@ static struct sample fake_samples[] = {
*/
static u64 fake_callchains[][10] = {
/* schedule => run_command => main */
- { 3, FAKE_IP_KERNEL_SCHEDULE, FAKE_IP_PERF_RUN_COMMAND, FAKE_IP_PERF_MAIN, },
+ { 1, 3, FAKE_IP_KERNEL_SCHEDULE, FAKE_IP_PERF_RUN_COMMAND, FAKE_IP_PERF_MAIN, },
/* main */
- { 1, FAKE_IP_PERF_MAIN, },
+ { 1, 1, FAKE_IP_PERF_MAIN, },
/* cmd_record => run_command => main */
- { 3, FAKE_IP_PERF_CMD_RECORD, FAKE_IP_PERF_RUN_COMMAND, FAKE_IP_PERF_MAIN, },
+ { 1, 3, FAKE_IP_PERF_CMD_RECORD, FAKE_IP_PERF_RUN_COMMAND, FAKE_IP_PERF_MAIN, },
/* malloc => cmd_record => run_command => main */
- { 4, FAKE_IP_LIBC_MALLOC, FAKE_IP_PERF_CMD_RECORD, FAKE_IP_PERF_RUN_COMMAND,
+ { 1, 4, FAKE_IP_LIBC_MALLOC, FAKE_IP_PERF_CMD_RECORD, FAKE_IP_PERF_RUN_COMMAND,
FAKE_IP_PERF_MAIN, },
/* free => cmd_record => run_command => main */
- { 4, FAKE_IP_LIBC_FREE, FAKE_IP_PERF_CMD_RECORD, FAKE_IP_PERF_RUN_COMMAND,
+ { 1, 4, FAKE_IP_LIBC_FREE, FAKE_IP_PERF_CMD_RECORD, FAKE_IP_PERF_RUN_COMMAND,
FAKE_IP_PERF_MAIN, },
/* main */
- { 1, FAKE_IP_PERF_MAIN, },
+ { 1, 1, FAKE_IP_PERF_MAIN, },
/* page_fault => sys_perf_event_open => run_command => main */
- { 4, FAKE_IP_KERNEL_PAGE_FAULT, FAKE_IP_KERNEL_SYS_PERF_EVENT_OPEN,
+ { 1, 4, FAKE_IP_KERNEL_PAGE_FAULT, FAKE_IP_KERNEL_SYS_PERF_EVENT_OPEN,
FAKE_IP_PERF_RUN_COMMAND, FAKE_IP_PERF_MAIN, },
/* main */
- { 1, FAKE_IP_BASH_MAIN, },
+ { 1, 1, FAKE_IP_BASH_MAIN, },
/* xmalloc => malloc => xmalloc => malloc => xmalloc => main */
- { 6, FAKE_IP_BASH_XMALLOC, FAKE_IP_LIBC_MALLOC, FAKE_IP_BASH_XMALLOC,
+ { 1, 6, FAKE_IP_BASH_XMALLOC, FAKE_IP_LIBC_MALLOC, FAKE_IP_BASH_XMALLOC,
FAKE_IP_LIBC_MALLOC, FAKE_IP_BASH_XMALLOC, FAKE_IP_BASH_MAIN, },
/* page_fault => malloc => main */
- { 3, FAKE_IP_KERNEL_PAGE_FAULT, FAKE_IP_LIBC_MALLOC, FAKE_IP_BASH_MAIN, },
+ { 1, 3, FAKE_IP_KERNEL_PAGE_FAULT, FAKE_IP_LIBC_MALLOC, FAKE_IP_BASH_MAIN, },
};

static int add_hist_entries(struct hists *hists, struct machine *machine)
diff --git a/tools/perf/tests/sample-parsing.c b/tools/perf/tests/sample-parsing.c
index ca292f9..dfa9dfa 100644
--- a/tools/perf/tests/sample-parsing.c
+++ b/tools/perf/tests/sample-parsing.c
@@ -145,7 +145,7 @@ static int do_test(u64 sample_type, u64 sample_regs_user, u64 read_format)
u64 data[64];
} callchain = {
/* 3 ips */
- .data = {3, 201, 202, 203},
+ .data = {1, 3, 201, 202, 203},
};
union {
struct branch_stack branch_stack;
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 8c7fe9d..efeefe7 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -119,7 +119,15 @@ struct sample_read {
};
};

+/*
+ * From Haswell, the existing Last Branch Record facility can
+ * also be used to record call chains.
+ * source: indicates the available call chains source.
+ */
+#define PERF_FP_CALLCHAIN 0x01
+#define PERF_LBR_CALLCHAIN 0x02
struct ip_callchain {
+ u64 source;
u64 nr;
u64 ips[0];
};
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 2f9e680..ff472d6 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1028,7 +1028,6 @@ static size_t perf_event_attr__fprintf(struct perf_event_attr *attr, FILE *fp)
ret += PRINT_ATTR2(exclude_host, exclude_guest);
ret += PRINT_ATTR2N("excl.callchain_kern", exclude_callchain_kernel,
"excl.callchain_user", exclude_callchain_user);
-
ret += PRINT_ATTR_U32(wakeup_events);
ret += PRINT_ATTR_U32(wakeup_watermark);
ret += PRINT_ATTR_X32(bp_type);
@@ -1440,10 +1439,10 @@ int perf_evsel__parse_sample(struct perf_evsel *evsel, union perf_event *event,
const u64 max_callchain_nr = UINT64_MAX / sizeof(u64);

OVERFLOW_CHECK_u64(array);
- data->callchain = (struct ip_callchain *)array++;
+ data->callchain = (struct ip_callchain *)array;
if (data->callchain->nr > max_callchain_nr)
return -EFAULT;
- sz = data->callchain->nr * sizeof(u64);
+ sz = (data->callchain->nr + 2) * sizeof(u64);
OVERFLOW_CHECK(array, sz, max_size);
array = (void *)array + sz;
}
@@ -1466,7 +1465,9 @@ int perf_evsel__parse_sample(struct perf_evsel *evsel, union perf_event *event,
array = (void *)array + data->raw_size;
}

- if (type & PERF_SAMPLE_BRANCH_STACK) {
+ if ((type & PERF_SAMPLE_BRANCH_STACK) ||
+ (data->callchain &&
+ (data->callchain->source & PERF_LBR_CALLCHAIN))) {
const u64 max_branch_nr = UINT64_MAX /
sizeof(struct branch_entry);

@@ -1590,7 +1591,7 @@ size_t perf_event__sample_event_size(const struct perf_sample *sample, u64 type,
}

if (type & PERF_SAMPLE_CALLCHAIN) {
- sz = (sample->callchain->nr + 1) * sizeof(u64);
+ sz = (sample->callchain->nr + 2) * sizeof(u64);
result += sz;
}

@@ -1599,7 +1600,9 @@ size_t perf_event__sample_event_size(const struct perf_sample *sample, u64 type,
result += sample->raw_size;
}

- if (type & PERF_SAMPLE_BRANCH_STACK) {
+ if ((type & PERF_SAMPLE_BRANCH_STACK) ||
+ (sample->callchain &&
+ (sample->callchain->source & PERF_LBR_CALLCHAIN))) {
sz = sample->branch_stack->nr * sizeof(struct branch_entry);
sz += sizeof(u64);
result += sz;
@@ -1745,7 +1748,7 @@ int perf_event__synthesize_sample(union perf_event *event, u64 type,
}

if (type & PERF_SAMPLE_CALLCHAIN) {
- sz = (sample->callchain->nr + 1) * sizeof(u64);
+ sz = (sample->callchain->nr + 2) * sizeof(u64);
memcpy(array, sample->callchain, sz);
array = (void *)array + sz;
}
@@ -1768,7 +1771,9 @@ int perf_event__synthesize_sample(union perf_event *event, u64 type,
array = (void *)array + sample->raw_size;
}

- if (type & PERF_SAMPLE_BRANCH_STACK) {
+ if ((type & PERF_SAMPLE_BRANCH_STACK) ||
+ (sample->callchain &&
+ (sample->callchain->source & PERF_LBR_CALLCHAIN))) {
sz = sample->branch_stack->nr * sizeof(struct branch_entry);
sz += sizeof(u64);
memcpy(array, sample->branch_stack, sz);
--
1.8.3.2

2014-11-05 03:10:53

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V7 17/17] perf tools: choose to dump callchain from LBR and FP

From: Kan Liang <[email protected]>

Extend call-graph option in perf report to support callchain source (fp
or lbr).
The default value is fp. It means that frame pointers is preferred call
chain source. If it isn't available, lbr data will be used then.
If the value is set to lbr, it means lbr data is preferred call chain
source. If lbr data isn't available, try fp data then.

Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-report.c | 8 +-
tools/perf/util/callchain.c | 18 +++-
tools/perf/util/callchain.h | 6 ++
tools/perf/util/machine.c | 198 ++++++++++++++++++++++++++++++--------------
tools/perf/util/session.c | 34 +++++++-
5 files changed, 194 insertions(+), 70 deletions(-)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 140a6cd..23fad5a 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -575,7 +575,7 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
struct stat st;
bool has_br_stack = false;
int branch_mode = -1;
- char callchain_default_opt[] = "fractal,0.5,callee";
+ char callchain_default_opt[] = "fractal,0.5,callee,function,fp";
const char * const report_usage[] = {
"perf report [<options>]",
NULL
@@ -637,9 +637,9 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
"regex filter to identify parent, see: '--sort parent'"),
OPT_BOOLEAN('x', "exclude-other", &symbol_conf.exclude_other,
"Only display entries with parent-match"),
- OPT_CALLBACK_DEFAULT('g', "call-graph", &report, "output_type,min_percent[,print_limit],call_order",
- "Display callchains using output_type (graph, flat, fractal, or none) , min percent threshold, optional print limit, callchain order, key (function or address). "
- "Default: fractal,0.5,callee,function", &report_parse_callchain_opt, callchain_default_opt),
+ OPT_CALLBACK_DEFAULT('g', "call-graph", &report, "output_type,min_percent[,print_limit],call_order,source",
+ "Display callchains using output_type (graph, flat, fractal, or none) , min percent threshold, optional print limit, callchain order, key (function or address), callchain source(fp or lbr). "
+ "Default: fractal,0.5,callee,function,fp", &report_parse_callchain_opt, callchain_default_opt),
OPT_BOOLEAN(0, "children", &symbol_conf.cumulate_callchain,
"Accumulate callchains of children and show total overhead as well"),
OPT_INTEGER(0, "max-stack", &report.max_stack,
diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c
index 0022980..2f7d2c9 100644
--- a/tools/perf/util/callchain.c
+++ b/tools/perf/util/callchain.c
@@ -152,6 +152,19 @@ static int parse_callchain_sort_key(const char *value)
return -1;
}

+static int parse_callchain_source(const char *value)
+{
+ if (!strncmp(value, "fp", strlen(value))) {
+ callchain_param.source = SOURCE_FP;
+ return 0;
+ }
+ if (!strncmp(value, "lbr", strlen(value))) {
+ callchain_param.source = SOURCE_LBR;
+ return 0;
+ }
+ return -1;
+}
+
int
parse_callchain_report_opt(const char *arg)
{
@@ -173,7 +186,8 @@ parse_callchain_report_opt(const char *arg)

if (!parse_callchain_mode(tok) ||
!parse_callchain_order(tok) ||
- !parse_callchain_sort_key(tok)) {
+ !parse_callchain_sort_key(tok) ||
+ !parse_callchain_source(tok)) {
/* parsing ok - move on to the next */
} else if (!minpcnt_set) {
/* try to get the min percent */
@@ -225,6 +239,8 @@ int perf_callchain_config(const char *var, const char *value)
return parse_callchain_order(value);
if (!strcmp(var, "sort-key"))
return parse_callchain_sort_key(value);
+ if (!strcmp(var, "source"))
+ return parse_callchain_source(value);
if (!strcmp(var, "threshold")) {
callchain_param.min_percent = strtod(value, &endptr);
if (value == endptr)
diff --git a/tools/perf/util/callchain.h b/tools/perf/util/callchain.h
index 3caccc2..267a976 100644
--- a/tools/perf/util/callchain.h
+++ b/tools/perf/util/callchain.h
@@ -53,6 +53,11 @@ enum chain_key {
CCKEY_ADDRESS
};

+enum chain_source {
+ SOURCE_FP,
+ SOURCE_LBR
+};
+
struct callchain_param {
bool enabled;
enum perf_call_graph_mode record_mode;
@@ -63,6 +68,7 @@ struct callchain_param {
sort_chain_func_t sort;
enum chain_order order;
enum chain_key key;
+ enum chain_source source;
};

extern struct callchain_param callchain_param;
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 51a6303..22a7f00 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1367,18 +1367,80 @@ struct branch_info *sample__resolve_bstack(struct perf_sample *sample,
return bi;
}

+static inline int __thread__resolve_callchain_sample(
+ struct thread *thread,
+ u64 ip,
+ u8 *cpumode,
+ struct symbol **parent,
+ struct addr_location *root_al,
+ struct addr_location *al)
+{
+ int err;
+
+ if (ip >= PERF_CONTEXT_MAX) {
+ switch (ip) {
+ case PERF_CONTEXT_HV:
+ *cpumode = PERF_RECORD_MISC_HYPERVISOR;
+ break;
+ case PERF_CONTEXT_KERNEL:
+ *cpumode = PERF_RECORD_MISC_KERNEL;
+ break;
+ case PERF_CONTEXT_USER:
+ *cpumode = PERF_RECORD_MISC_USER;
+ break;
+ default:
+ pr_debug("invalid callchain context: "
+ "%"PRId64"\n", (s64) ip);
+ /*
+ * It seems the callchain is corrupted.
+ * Discard all.
+ */
+ callchain_cursor_reset(&callchain_cursor);
+ return 1;
+ }
+ return 0;
+ }
+
+ al->filtered = 0;
+ thread__find_addr_location(thread, *cpumode,
+ MAP__FUNCTION, ip, al);
+ if (al->sym != NULL) {
+ if (sort__has_parent && !*parent &&
+ symbol__match_regex(al->sym, &parent_regex))
+ *parent = al->sym;
+ else if (have_ignore_callees && root_al &&
+ symbol__match_regex(al->sym, &ignore_callees_regex)) {
+ /* Treat this symbol as the root,
+ forgetting its callees. */
+ *root_al = *al;
+ callchain_cursor_reset(&callchain_cursor);
+ }
+ }
+
+ err = callchain_cursor_append(&callchain_cursor,
+ ip, al->map, al->sym);
+ if (err)
+ return err;
+ return 0;
+}
+
static int thread__resolve_callchain_sample(struct thread *thread,
- struct ip_callchain *chain,
+ struct perf_sample *sample,
struct symbol **parent,
struct addr_location *root_al,
int max_stack)
{
+ struct ip_callchain *chain = sample->callchain;
u8 cpumode = PERF_RECORD_MISC_USER;
int chain_nr = min(max_stack, (int)chain->nr);
- int i;
- int j;
- int err;
+ int i, j, err;
int skip_idx __maybe_unused;
+ int use_fp = (callchain_param.source == SOURCE_FP) ? 1 : 0;
+ u64 ip;
+
+ /* If there isn't user fp callchain available, try LBR */
+ if (!(chain->source & PERF_FP_CALLCHAIN))
+ use_fp = 0;

callchain_cursor_reset(&callchain_cursor);

@@ -1387,73 +1449,83 @@ static int thread__resolve_callchain_sample(struct thread *thread,
return 0;
}

- /*
- * Based on DWARF debug information, some architectures skip
- * a callchain entry saved by the kernel.
- */
- skip_idx = arch_skip_callchain_idx(thread, chain);
-
- for (i = 0; i < chain_nr; i++) {
- u64 ip;
- struct addr_location al;
+again:
+ /* try LBR */
+ if (!use_fp && (chain->source & PERF_LBR_CALLCHAIN)) {
+ struct branch_stack *lbr_stack = sample->branch_stack;
+ int lbr_nr = lbr_stack->nr;
+ int mix_chain_nr;

- if (callchain_param.order == ORDER_CALLEE)
- j = i;
- else
- j = chain->nr - i - 1;
+ for (i = 0; i < chain_nr; i++) {
+ if (chain->ips[i] == PERF_CONTEXT_USER)
+ break;
+ }

-#ifdef HAVE_SKIP_CALLCHAIN_IDX
- if (j == skip_idx)
- continue;
-#endif
- ip = chain->ips[j];
+ /* LBR only affects the user callchain */
+ if (i == chain_nr) {
+ use_fp = 1;
+ goto again;
+ }

- if (ip >= PERF_CONTEXT_MAX) {
- switch (ip) {
- case PERF_CONTEXT_HV:
- cpumode = PERF_RECORD_MISC_HYPERVISOR;
- break;
- case PERF_CONTEXT_KERNEL:
- cpumode = PERF_RECORD_MISC_KERNEL;
- break;
- case PERF_CONTEXT_USER:
- cpumode = PERF_RECORD_MISC_USER;
- break;
- default:
- pr_debug("invalid callchain context: "
- "%"PRId64"\n", (s64) ip);
- /*
- * It seems the callchain is corrupted.
- * Discard all.
- */
- callchain_cursor_reset(&callchain_cursor);
- return 0;
- }
- continue;
+ mix_chain_nr = i + 2 + lbr_nr;
+ if (mix_chain_nr > PERF_MAX_STACK_DEPTH) {
+ pr_warning("corrupted callchain. skipping...\n");
+ return 0;
}

- al.filtered = 0;
- thread__find_addr_location(thread, cpumode,
- MAP__FUNCTION, ip, &al);
- if (al.sym != NULL) {
- if (sort__has_parent && !*parent &&
- symbol__match_regex(al.sym, &parent_regex))
- *parent = al.sym;
- else if (have_ignore_callees && root_al &&
- symbol__match_regex(al.sym, &ignore_callees_regex)) {
- /* Treat this symbol as the root,
- forgetting its callees. */
- *root_al = al;
- callchain_cursor_reset(&callchain_cursor);
+ for (j = 0; j < mix_chain_nr; j++) {
+ struct addr_location al;
+
+ if (callchain_param.order == ORDER_CALLEE) {
+ if (j < i + 2)
+ ip = chain->ips[j];
+ else
+ ip = lbr_stack->entries[j - i - 2].from;
+ } else {
+ if (j < lbr_nr)
+ ip = lbr_stack->entries[lbr_nr - j - 1].from;
+ else
+ ip = chain->ips[i + 1 - (j - lbr_nr)];
}
+ err = __thread__resolve_callchain_sample(thread,
+ ip, &cpumode, parent, root_al, &al);
+ /* Discard all when the callchain is corrupted */
+ if (err > 0)
+ return 0;
+ else if (err)
+ return err;
}
+ } else {

- err = callchain_cursor_append(&callchain_cursor,
- ip, al.map, al.sym);
- if (err)
- return err;
- }
+ /*
+ * Based on DWARF debug information, some architectures skip
+ * a callchain entry saved by the kernel.
+ */
+ skip_idx = arch_skip_callchain_idx(thread, chain);
+
+ for (i = 0; i < chain_nr; i++) {
+ struct addr_location al;
+
+ if (callchain_param.order == ORDER_CALLEE)
+ j = i;
+ else
+ j = chain->nr - i - 1;
+
+#ifdef HAVE_SKIP_CALLCHAIN_IDX
+ if (j == skip_idx)
+ continue;
+#endif
+ ip = chain->ips[j];
+ err = __thread__resolve_callchain_sample(thread,
+ ip, &cpumode, parent, root_al, &al);

+ /* Discard all when the callchain is corrupted */
+ if (err > 0)
+ return 0;
+ else if (err)
+ return err;
+ }
+ }
return 0;
}

@@ -1471,7 +1543,7 @@ int thread__resolve_callchain(struct thread *thread,
struct addr_location *root_al,
int max_stack)
{
- int ret = thread__resolve_callchain_sample(thread, sample->callchain,
+ int ret = thread__resolve_callchain_sample(thread, sample,
parent, root_al, max_stack);
if (ret)
return ret;
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index f4478ce..8866014 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -560,12 +560,42 @@ int perf_session_queue_event(struct perf_session *s, union perf_event *event,
static void callchain__printf(struct perf_sample *sample)
{
unsigned int i;
+ u64 total_nr, callchain_nr;
+ int use_fp = (callchain_param.source == SOURCE_FP) ? 1 : 0;

- printf("... chain: nr:%" PRIu64 "\n", sample->callchain->nr);
+ total_nr = callchain_nr = sample->callchain->nr;

- for (i = 0; i < sample->callchain->nr; i++)
+ /* If there isn't user fp callchain available, try LBR */
+ if (!(sample->callchain->source & PERF_FP_CALLCHAIN))
+ use_fp = 0;
+
+ if (!use_fp && (sample->callchain->source & PERF_LBR_CALLCHAIN)) {
+ struct branch_stack *lbr_stack = sample->branch_stack;
+
+ for (i = 0; i < callchain_nr; i++) {
+ if (sample->callchain->ips[i] == PERF_CONTEXT_USER)
+ break;
+ }
+
+ if (i != callchain_nr) {
+ total_nr = i + 1 + lbr_stack->nr;
+ callchain_nr = i + 1;
+ }
+ }
+
+ printf("... chain: nr:%" PRIu64 "\n", total_nr);
+
+ for (i = 0; i < callchain_nr + 1; i++)
printf("..... %2d: %016" PRIx64 "\n",
i, sample->callchain->ips[i]);
+
+ if (total_nr > callchain_nr) {
+ struct branch_stack *lbr_stack = sample->branch_stack;
+
+ for (i = 0; i < lbr_stack->nr; i++)
+ printf("..... %2d: %016" PRIx64 "\n",
+ (int)(i + callchain_nr + 1), lbr_stack->entries[i].from);
+ }
}

static void branch_stack__printf(struct perf_sample *sample)
--
1.8.3.2

2014-11-05 03:11:53

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V7 13/17] perf, x86: enable LBR callstack when recording callchain

From: Yan, Zheng <[email protected]>

Only enable LBR callstack when user requires fp callgraph. The feature
is not available when PERF_SAMPLE_BRANCH_STACK or PERF_SAMPLE_STACK_USER
is required.
Also, this feature only affects how to get user callchain. The kernel
callchain is always got by frame pointers.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 2f79b9d..f454620 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -425,10 +425,24 @@ int x86_pmu_hw_config(struct perf_event *event)
if (!event->attr.exclude_kernel)
*br_type |= PERF_SAMPLE_BRANCH_KERNEL;
}
- }
+ } else if (x86_pmu_has_lbr_callstack() &&
+ (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) &&
+ !(event->attr.sample_type & PERF_SAMPLE_STACK_USER) &&
+ !has_branch_stack(event) &&
+ !event->attr.exclude_user &&
+ (event->attach_state & PERF_ATTACH_TASK)) {
+ /*
+ * user did not specify branch_sample_type,
+ * try using the LBR call stack facility to
+ * record call chains of user program.
+ */
+ event->attr.branch_sample_type =
+ PERF_SAMPLE_BRANCH_USER |
+ PERF_SAMPLE_BRANCH_CALL_STACK;

- if (event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_CALL_STACK)
+ /* needs PMU specific data to save LBR stack */
event->attach_state |= PERF_ATTACH_TASK_DATA;
+ }

/*
* Generate PMC IRQs:
--
1.8.3.2

2014-11-05 03:11:54

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V7 12/17] perf, x86: re-organize code that implicitly enables LBR/PEBS

From: Yan, Zheng <[email protected]>

make later patch more readable, no logic change.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 59 ++++++++++++++++++++--------------------
1 file changed, 29 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 646e705..2f79b9d 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -395,36 +395,35 @@ int x86_pmu_hw_config(struct perf_event *event)

if (event->attr.precise_ip > precise)
return -EOPNOTSUPP;
- /*
- * check that PEBS LBR correction does not conflict with
- * whatever the user is asking with attr->branch_sample_type
- */
- if (event->attr.precise_ip > 1 &&
- x86_pmu.intel_cap.pebs_format < 2) {
- u64 *br_type = &event->attr.branch_sample_type;
-
- if (has_branch_stack(event)) {
- if (!precise_br_compat(event))
- return -EOPNOTSUPP;
-
- /* branch_sample_type is compatible */
-
- } else {
- /*
- * user did not specify branch_sample_type
- *
- * For PEBS fixups, we capture all
- * the branches at the priv level of the
- * event.
- */
- *br_type = PERF_SAMPLE_BRANCH_ANY;
-
- if (!event->attr.exclude_user)
- *br_type |= PERF_SAMPLE_BRANCH_USER;
-
- if (!event->attr.exclude_kernel)
- *br_type |= PERF_SAMPLE_BRANCH_KERNEL;
- }
+ }
+ /*
+ * check that PEBS LBR correction does not conflict with
+ * whatever the user is asking with attr->branch_sample_type
+ */
+ if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format < 2) {
+ u64 *br_type = &event->attr.branch_sample_type;
+
+ if (has_branch_stack(event)) {
+ if (!precise_br_compat(event))
+ return -EOPNOTSUPP;
+
+ /* branch_sample_type is compatible */
+
+ } else {
+ /*
+ * user did not specify branch_sample_type
+ *
+ * For PEBS fixups, we capture all
+ * the branches at the priv level of the
+ * event.
+ */
+ *br_type = PERF_SAMPLE_BRANCH_ANY;
+
+ if (!event->attr.exclude_user)
+ *br_type |= PERF_SAMPLE_BRANCH_USER;
+
+ if (!event->attr.exclude_kernel)
+ *br_type |= PERF_SAMPLE_BRANCH_KERNEL;
}
}

--
1.8.3.2

2014-11-05 03:12:27

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V7 11/17] perf, core: expose LBR call stack to user perf tool

From: Yan, Zheng <[email protected]>

With LBR call stack feature enable, there are two call chain data
sources, traditional frame pointer and LBR call stack.
This patch extends the perf_callchain_entry struct to mark the available
call chain source.
The frame pointer is still output as PERF_SAMPLE_CALLCHAIN data format.
The LBR call stack data will be output as PERF_SAMPLE_BRANCH_STACK data
format.

Note: The LBR call stack is only available for user callchain. The
kernel is always got from frame pointers.
The user space perf tool also need to be changed to handle these two
sources.

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 7 +++++++
arch/x86/kernel/cpu/perf_event_intel.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 2 ++
include/linux/perf_event.h | 14 +++++++++++++-
kernel/events/callchain.c | 1 +
kernel/events/core.c | 22 +++++++++++++++++-----
7 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 1fd9492..646e705 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -2041,6 +2041,10 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
perf_callchain_store(entry, cs_base + frame.return_address);
fp = compat_ptr(ss_base + frame.next_frame);
}
+
+ if (fp != compat_ptr(regs->bp))
+ entry->source |= PERF_FP_CALLCHAIN;
+
return 1;
}
#else
@@ -2093,6 +2097,9 @@ perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
perf_callchain_store(entry, frame.return_address);
fp = frame.next_frame;
}
+
+ if (fp != (void __user *)regs->bp)
+ entry->source |= PERF_FP_CALLCHAIN;
}

/*
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 5f449fb..b35e23d 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1404,7 +1404,7 @@ again:

perf_sample_data_init(&data, 0, event->hw.last_period);

- if (has_branch_stack(event))
+ if (needs_branch_stack(event))
data.br_stack = &cpuc->lbr_stack;

if (perf_event_overflow(event, &data, regs))
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 46211bc..517fd26 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -907,7 +907,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
data.txn = intel_hsw_transaction(pebs);
}

- if (has_branch_stack(event))
+ if (needs_branch_stack(event))
data.br_stack = &cpuc->lbr_stack;

if (perf_event_overflow(event, &data, &regs))
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 8c72efa..a9e3a0d 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -746,6 +746,8 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
int i, j, type;
bool compress = false;

+ cpuc->lbr_stack.user_callstack = branch_user_callstack(br_sel);
+
/* if sampling all branches, then nothing to filter */
if ((br_sel & X86_BR_ALL) == X86_BR_ALL)
return;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0d67460..c42f4ec 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -55,7 +55,16 @@ struct perf_guest_info_callbacks {
#include <linux/workqueue.h>
#include <asm/local.h>

+/*
+ * From Haswell, the existing Last Branch Record facility can
+ * also be used to record call chains.
+ * source: indicates the available call chains source.
+ */
+#define PERF_FP_CALLCHAIN 0x01
+#define PERF_LBR_CALLCHAIN 0x02
+
struct perf_callchain_entry {
+ __u64 source;
__u64 nr;
__u64 ip[PERF_MAX_STACK_DEPTH];
};
@@ -67,7 +76,9 @@ struct perf_raw_record {

/*
* branch stack layout:
- * nr: number of taken branches stored in entries[]
+ * user_callstack: LBR is enhanced to support call stack profiling.
+ * user_callstack indicates if it's call stack info.
+ * nr: number of taken branches stored in entries[]
*
* Note that nr can vary from sample to sample
* branches (to, from) are stored from most recent
@@ -75,6 +86,7 @@ struct perf_raw_record {
* recent branch.
*/
struct perf_branch_stack {
+ bool user_callstack;
__u64 nr;
struct perf_branch_entry entries[0];
};
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index d659487..0fc5924 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -175,6 +175,7 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
if (!entry)
goto exit_put;

+ entry->source = 0;
entry->nr = 0;

if (kernel && !user_mode(regs)) {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3f3e43d..27f9596 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4793,7 +4793,7 @@ void perf_output_sample(struct perf_output_handle *handle,

if (sample_type & PERF_SAMPLE_CALLCHAIN) {
if (data->callchain) {
- int size = 1;
+ int size = 2;

if (data->callchain)
size += data->callchain->nr;
@@ -4824,7 +4824,9 @@ void perf_output_sample(struct perf_output_handle *handle,
}
}

- if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
+ /* LBR can be used for call stack, so it may be enabled implicitly. */
+ if ((sample_type & PERF_SAMPLE_BRANCH_STACK) ||
+ (data->br_stack && data->br_stack->user_callstack)) {
if (data->br_stack) {
size_t size;

@@ -4908,13 +4910,21 @@ void perf_prepare_sample(struct perf_event_header *header,
data->ip = perf_instruction_pointer(regs);

if (sample_type & PERF_SAMPLE_CALLCHAIN) {
- int size = 1;
+ int size = 2;

data->callchain = perf_callchain(event, regs);

- if (data->callchain)
+ if (data->callchain) {
size += data->callchain->nr;

+ if (data->br_stack &&
+ data->br_stack->user_callstack &&
+ !(sample_type & PERF_SAMPLE_BRANCH_STACK) &&
+ !(sample_type & PERF_SAMPLE_STACK_USER))
+ data->callchain->source |=
+ PERF_LBR_CALLCHAIN;
+ }
+
header->size += size * sizeof(u64);
}

@@ -4930,7 +4940,9 @@ void perf_prepare_sample(struct perf_event_header *header,
header->size += size;
}

- if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
+ /* LBR can be used for call stack, so it may be enabled implicitly. */
+ if ((sample_type & PERF_SAMPLE_BRANCH_STACK) ||
+ (data->br_stack && data->br_stack->user_callstack)) {
int size = sizeof(u64); /* nr */
if (data->br_stack) {
size += data->br_stack->nr
--
1.8.3.2

2014-11-05 03:12:45

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V7 07/17] perf, x86: allocate space for storing LBR stack

From: Yan, Zheng <[email protected]>

When the LBR call stack is enabled, it is necessary to save/restore
the LBR stack on context switch. We can use pmu specific data to
store LBR stack when task is scheduled out. This patch adds code
that allocates the pmu specific data.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Reviewed-by: Stephane Eranian <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 4 ++++
arch/x86/kernel/cpu/perf_event.h | 7 +++++++
2 files changed, 11 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index e37adf0..1fd9492 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -428,6 +428,9 @@ int x86_pmu_hw_config(struct perf_event *event)
}
}

+ if (event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_CALL_STACK)
+ event->attach_state |= PERF_ATTACH_TASK_DATA;
+
/*
* Generate PMC IRQs:
* (keep 'enabled' bit clear for now)
@@ -1912,6 +1915,7 @@ static struct pmu pmu = {

.event_idx = x86_pmu_event_idx,
.sched_task = x86_pmu_sched_task,
+ .task_ctx_size = sizeof(struct x86_perf_task_context),
};

void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 13464e4..b4568e5 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -510,6 +510,13 @@ struct x86_pmu {
struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr);
};

+struct x86_perf_task_context {
+ u64 lbr_from[MAX_LBR_ENTRIES];
+ u64 lbr_to[MAX_LBR_ENTRIES];
+ int lbr_callstack_users;
+ int lbr_stack_state;
+};
+
enum {
PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT = PERF_SAMPLE_BRANCH_MAX_SHIFT,
PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE,
--
1.8.3.2

2014-11-05 03:10:21

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V7 01/17] perf, x86: Reduce lbr_sel_map size

From: Yan, Zheng <[email protected]>

The index of lbr_sel_map is bit value of perf branch_sample_type.
PERF_SAMPLE_BRANCH_MAX is 1024 at present, so each lbr_sel_map uses
4096 bytes. By using bit shift as index, we can reduce lbr_sel_map
size to 40 bytes. This patch defines 'bit shift' for branch types,
and use 'bit shift' to define lbr_sel_maps.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Reviewed-by: Stephane Eranian <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 4 +++
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 54 ++++++++++++++----------------
include/uapi/linux/perf_event.h | 49 +++++++++++++++++++--------
3 files changed, 64 insertions(+), 43 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index fc5eb39..86c675c 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -509,6 +509,10 @@ struct x86_pmu {
struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr);
};

+enum {
+ PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE = PERF_SAMPLE_BRANCH_MAX_SHIFT,
+};
+
#define x86_add_quirk(func_) \
do { \
static struct x86_pmu_quirk __quirk __initdata = { \
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 45fa730..66cb268 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -69,10 +69,6 @@ static enum {
#define LBR_FROM_FLAG_IN_TX (1ULL << 62)
#define LBR_FROM_FLAG_ABORT (1ULL << 61)

-#define for_each_branch_sample_type(x) \
- for ((x) = PERF_SAMPLE_BRANCH_USER; \
- (x) < PERF_SAMPLE_BRANCH_MAX; (x) <<= 1)
-
/*
* x86control flow change classification
* x86control flow changes include branches, interrupts, traps, faults
@@ -403,14 +399,14 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
{
struct hw_perf_event_extra *reg;
u64 br_type = event->attr.branch_sample_type;
- u64 mask = 0, m;
- u64 v;
+ u64 mask = 0, v;
+ int i;

- for_each_branch_sample_type(m) {
- if (!(br_type & m))
+ for (i = 0; i < PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE; i++) {
+ if (!(br_type & (1ULL << i)))
continue;

- v = x86_pmu.lbr_sel_map[m];
+ v = x86_pmu.lbr_sel_map[i];
if (v == LBR_NOT_SUPP)
return -EOPNOTSUPP;

@@ -665,35 +661,35 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
/*
* Map interface branch filters onto LBR filters
*/
-static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
- [PERF_SAMPLE_BRANCH_ANY] = LBR_ANY,
- [PERF_SAMPLE_BRANCH_USER] = LBR_USER,
- [PERF_SAMPLE_BRANCH_KERNEL] = LBR_KERNEL,
- [PERF_SAMPLE_BRANCH_HV] = LBR_IGN,
- [PERF_SAMPLE_BRANCH_ANY_RETURN] = LBR_RETURN | LBR_REL_JMP
- | LBR_IND_JMP | LBR_FAR,
+static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+ [PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
+ [PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
+ [PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
+ [PERF_SAMPLE_BRANCH_HV_SHIFT] = LBR_IGN,
+ [PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT] = LBR_RETURN | LBR_REL_JMP
+ | LBR_IND_JMP | LBR_FAR,
/*
* NHM/WSM erratum: must include REL_JMP+IND_JMP to get CALL branches
*/
- [PERF_SAMPLE_BRANCH_ANY_CALL] =
+ [PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT] =
LBR_REL_CALL | LBR_IND_CALL | LBR_REL_JMP | LBR_IND_JMP | LBR_FAR,
/*
* NHM/WSM erratum: must include IND_JMP to capture IND_CALL
*/
- [PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL | LBR_IND_JMP,
- [PERF_SAMPLE_BRANCH_COND] = LBR_JCC,
+ [PERF_SAMPLE_BRANCH_IND_CALL_SHIFT] = LBR_IND_CALL | LBR_IND_JMP,
+ [PERF_SAMPLE_BRANCH_COND_SHIFT] = LBR_JCC,
};

-static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
- [PERF_SAMPLE_BRANCH_ANY] = LBR_ANY,
- [PERF_SAMPLE_BRANCH_USER] = LBR_USER,
- [PERF_SAMPLE_BRANCH_KERNEL] = LBR_KERNEL,
- [PERF_SAMPLE_BRANCH_HV] = LBR_IGN,
- [PERF_SAMPLE_BRANCH_ANY_RETURN] = LBR_RETURN | LBR_FAR,
- [PERF_SAMPLE_BRANCH_ANY_CALL] = LBR_REL_CALL | LBR_IND_CALL
- | LBR_FAR,
- [PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL,
- [PERF_SAMPLE_BRANCH_COND] = LBR_JCC,
+static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+ [PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
+ [PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
+ [PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
+ [PERF_SAMPLE_BRANCH_HV_SHIFT] = LBR_IGN,
+ [PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT] = LBR_RETURN | LBR_FAR,
+ [PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT] = LBR_REL_CALL | LBR_IND_CALL
+ | LBR_FAR,
+ [PERF_SAMPLE_BRANCH_IND_CALL_SHIFT] = LBR_IND_CALL,
+ [PERF_SAMPLE_BRANCH_COND_SHIFT] = LBR_JCC,
};

/* core */
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9d84540..c610960 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -151,21 +151,42 @@ enum perf_event_sample_format {
* The branch types can be combined, however BRANCH_ANY covers all types
* of branches and therefore it supersedes all the other types.
*/
+enum perf_branch_sample_type_shift {
+ PERF_SAMPLE_BRANCH_USER_SHIFT = 0, /* user branches */
+ PERF_SAMPLE_BRANCH_KERNEL_SHIFT = 1, /* kernel branches */
+ PERF_SAMPLE_BRANCH_HV_SHIFT = 2, /* hypervisor branches */
+
+ PERF_SAMPLE_BRANCH_ANY_SHIFT = 3, /* any branch types */
+ PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT = 4, /* any call branch */
+ PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT = 5, /* any return branch */
+ PERF_SAMPLE_BRANCH_IND_CALL_SHIFT = 6, /* indirect calls */
+ PERF_SAMPLE_BRANCH_ABORT_TX_SHIFT = 7, /* transaction aborts */
+ PERF_SAMPLE_BRANCH_IN_TX_SHIFT = 8, /* in transaction */
+ PERF_SAMPLE_BRANCH_NO_TX_SHIFT = 9, /* not in transaction */
+ PERF_SAMPLE_BRANCH_COND_SHIFT = 10, /* conditional branches */
+
+ PERF_SAMPLE_BRANCH_MAX_SHIFT /* non-ABI */
+};
+
enum perf_branch_sample_type {
- PERF_SAMPLE_BRANCH_USER = 1U << 0, /* user branches */
- PERF_SAMPLE_BRANCH_KERNEL = 1U << 1, /* kernel branches */
- PERF_SAMPLE_BRANCH_HV = 1U << 2, /* hypervisor branches */
-
- PERF_SAMPLE_BRANCH_ANY = 1U << 3, /* any branch types */
- PERF_SAMPLE_BRANCH_ANY_CALL = 1U << 4, /* any call branch */
- PERF_SAMPLE_BRANCH_ANY_RETURN = 1U << 5, /* any return branch */
- PERF_SAMPLE_BRANCH_IND_CALL = 1U << 6, /* indirect calls */
- PERF_SAMPLE_BRANCH_ABORT_TX = 1U << 7, /* transaction aborts */
- PERF_SAMPLE_BRANCH_IN_TX = 1U << 8, /* in transaction */
- PERF_SAMPLE_BRANCH_NO_TX = 1U << 9, /* not in transaction */
- PERF_SAMPLE_BRANCH_COND = 1U << 10, /* conditional branches */
-
- PERF_SAMPLE_BRANCH_MAX = 1U << 11, /* non-ABI */
+ PERF_SAMPLE_BRANCH_USER = 1U << PERF_SAMPLE_BRANCH_USER_SHIFT,
+ PERF_SAMPLE_BRANCH_KERNEL = 1U << PERF_SAMPLE_BRANCH_KERNEL_SHIFT,
+ PERF_SAMPLE_BRANCH_HV = 1U << PERF_SAMPLE_BRANCH_HV_SHIFT,
+
+ PERF_SAMPLE_BRANCH_ANY = 1U << PERF_SAMPLE_BRANCH_ANY_SHIFT,
+ PERF_SAMPLE_BRANCH_ANY_CALL =
+ 1U << PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT,
+ PERF_SAMPLE_BRANCH_ANY_RETURN =
+ 1U << PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT,
+ PERF_SAMPLE_BRANCH_IND_CALL =
+ 1U << PERF_SAMPLE_BRANCH_IND_CALL_SHIFT,
+ PERF_SAMPLE_BRANCH_ABORT_TX =
+ 1U << PERF_SAMPLE_BRANCH_ABORT_TX_SHIFT,
+ PERF_SAMPLE_BRANCH_IN_TX = 1U << PERF_SAMPLE_BRANCH_IN_TX_SHIFT,
+ PERF_SAMPLE_BRANCH_NO_TX = 1U << PERF_SAMPLE_BRANCH_NO_TX_SHIFT,
+ PERF_SAMPLE_BRANCH_COND = 1U << PERF_SAMPLE_BRANCH_COND_SHIFT,
+
+ PERF_SAMPLE_BRANCH_MAX = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT,
};

#define PERF_SAMPLE_BRANCH_PLM_ALL \
--
1.8.3.2

2014-11-05 03:13:03

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V7 09/17] perf, x86: Save/resotre LBR stack during context switch

From: Yan, Zheng <[email protected]>

When the LBR call stack is enabled, it is necessary to save/restore
the LBR stack on context switch. The solution is saving/restoring
the LBR stack to/from task's perf event context.

The LBR stack is saved/restored only when there are events that use
the LBR call stack. If no event uses LBR call stack, the LBR stack
is reset when task is scheduled in.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 88 ++++++++++++++++++++++++++----
1 file changed, 76 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index c0ec384..8c72efa 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -180,14 +180,90 @@ void intel_pmu_lbr_reset(void)
intel_pmu_lbr_reset_64();
}

+/*
+ * TOS = most recently recorded branch
+ */
+static inline u64 intel_pmu_lbr_tos(void)
+{
+ u64 tos;
+
+ rdmsrl(x86_pmu.lbr_tos, tos);
+ return tos;
+}
+
+enum {
+ LBR_NONE,
+ LBR_VALID,
+};
+
+static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
+{
+ int i;
+ unsigned lbr_idx, mask;
+ u64 tos;
+
+ if (task_ctx->lbr_callstack_users == 0 ||
+ task_ctx->lbr_stack_state == LBR_NONE) {
+ intel_pmu_lbr_reset();
+ return;
+ }
+
+ mask = x86_pmu.lbr_nr - 1;
+ tos = intel_pmu_lbr_tos();
+ for (i = 0; i < x86_pmu.lbr_nr; i++) {
+ lbr_idx = (tos - i) & mask;
+ wrmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
+ wrmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
+ }
+ task_ctx->lbr_stack_state = LBR_NONE;
+}
+
+static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
+{
+ int i;
+ unsigned lbr_idx, mask;
+ u64 tos;
+
+ if (task_ctx->lbr_callstack_users == 0) {
+ task_ctx->lbr_stack_state = LBR_NONE;
+ return;
+ }
+
+ mask = x86_pmu.lbr_nr - 1;
+ tos = intel_pmu_lbr_tos();
+ for (i = 0; i < x86_pmu.lbr_nr; i++) {
+ lbr_idx = (tos - i) & mask;
+ rdmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
+ rdmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
+ }
+ task_ctx->lbr_stack_state = LBR_VALID;
+}
+
void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct x86_perf_task_context *task_ctx;

if (!x86_pmu.lbr_nr)
return;

/*
+ * If LBR callstack feature is enabled and the stack was saved when
+ * the task was scheduled out, restore the stack. Otherwise flush
+ * the LBR stack.
+ */
+ task_ctx = ctx ? ctx->task_ctx_data : NULL;
+ if (task_ctx) {
+ if (sched_in) {
+ __intel_pmu_lbr_restore(task_ctx);
+ cpuc->lbr_context = ctx;
+ } else {
+ __intel_pmu_lbr_save(task_ctx);
+ }
+ return;
+ }
+
+ /*
* When sampling the branck stack in system-wide, it may be
* necessary to flush the stack on context switch. This happens
* when the branch stack does not tag its entries with the pid
@@ -279,18 +355,6 @@ void intel_pmu_lbr_disable_all(void)
__intel_pmu_lbr_disable();
}

-/*
- * TOS = most recently recorded branch
- */
-static inline u64 intel_pmu_lbr_tos(void)
-{
- u64 tos;
-
- rdmsrl(x86_pmu.lbr_tos, tos);
-
- return tos;
-}
-
static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
{
unsigned long mask = x86_pmu.lbr_nr - 1;
--
1.8.3.2

2014-11-05 03:13:00

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V7 08/17] perf, x86: track number of events that use LBR callstack

From: Yan, Zheng <[email protected]>

When enabling/disabling an event, check if the event uses the LBR
callstack feature, adjust the LBR callstack usage count accordingly.
Later patch will use the usage count to decide if LBR stack should
be saved/restored.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 4a29bf5..c0ec384 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -205,9 +205,15 @@ void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
}
}

+static inline bool branch_user_callstack(unsigned br_sel)
+{
+ return (br_sel & X86_BR_USER) && (br_sel & X86_BR_CALL_STACK);
+}
+
void intel_pmu_lbr_enable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct x86_perf_task_context *task_ctx;

if (!x86_pmu.lbr_nr)
return;
@@ -222,6 +228,12 @@ void intel_pmu_lbr_enable(struct perf_event *event)
}
cpuc->br_sel = event->hw.branch_reg.reg;

+ if (branch_user_callstack(cpuc->br_sel) && event->ctx &&
+ event->ctx->task_ctx_data) {
+ task_ctx = event->ctx->task_ctx_data;
+ task_ctx->lbr_callstack_users++;
+ }
+
cpuc->lbr_users++;
perf_sched_cb_inc(event->ctx->pmu);
}
@@ -229,10 +241,17 @@ void intel_pmu_lbr_enable(struct perf_event *event)
void intel_pmu_lbr_disable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct x86_perf_task_context *task_ctx;

if (!x86_pmu.lbr_nr)
return;

+ if (branch_user_callstack(cpuc->br_sel) && event->ctx &&
+ event->ctx->task_ctx_data) {
+ task_ctx = event->ctx->task_ctx_data;
+ task_ctx->lbr_callstack_users--;
+ }
+
cpuc->lbr_users--;
WARN_ON_ONCE(cpuc->lbr_users < 0);
perf_sched_cb_dec(event->ctx->pmu);
--
1.8.3.2

2014-11-05 03:12:59

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V7 03/17] perf, x86: use context switch callback to flush LBR stack

From: Yan, Zheng <[email protected]>

Previous commit introduces context switch callback, its function
overlaps with the flush branch stack callback. So we can use the
context switch callback to flush LBR stack.

This patch adds code that uses the flush branch callback to
flush the LBR stack when task is being scheduled in. The callback
is enabled only when there are events use the LBR hardware. This
patch also removes all old flush branch stack code.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 7 ---
arch/x86/kernel/cpu/perf_event.h | 3 +-
arch/x86/kernel/cpu/perf_event_intel.c | 14 +-----
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 27 +++++++++++
include/linux/perf_event.h | 1 -
kernel/events/core.c | 77 ------------------------------
6 files changed, 30 insertions(+), 99 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index d5de9e1..e37adf0 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1885,12 +1885,6 @@ static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
x86_pmu.sched_task(ctx, sched_in);
}

-static void x86_pmu_flush_branch_stack(void)
-{
- if (x86_pmu.flush_branch_stack)
- x86_pmu.flush_branch_stack();
-}
-
void perf_check_microcode(void)
{
if (x86_pmu.check_microcode)
@@ -1917,7 +1911,6 @@ static struct pmu pmu = {
.commit_txn = x86_pmu_commit_txn,

.event_idx = x86_pmu_event_idx,
- .flush_branch_stack = x86_pmu_flush_branch_stack,
.sched_task = x86_pmu_sched_task,
};

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 0617abb..3d6d533 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -466,7 +466,6 @@ struct x86_pmu {
void (*cpu_dead)(int cpu);

void (*check_microcode)(void);
- void (*flush_branch_stack)(void);
void (*sched_task)(struct perf_event_context *ctx,
bool sched_in);

@@ -727,6 +726,8 @@ void intel_pmu_pebs_disable_all(void);

void intel_ds_init(void);

+void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
+
void intel_pmu_lbr_reset(void);

void intel_pmu_lbr_enable(struct perf_event *event);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 944bf01..9a6e247 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2044,18 +2044,6 @@ static void intel_pmu_cpu_dying(int cpu)
fini_debug_store_on_cpu(cpu);
}

-static void intel_pmu_flush_branch_stack(void)
-{
- /*
- * Intel LBR does not tag entries with the
- * PID of the current task, then we need to
- * flush it on ctxsw
- * For now, we simply reset it
- */
- if (x86_pmu.lbr_nr)
- intel_pmu_lbr_reset();
-}
-
PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");

PMU_FORMAT_ATTR(ldlat, "config1:0-15");
@@ -2107,7 +2095,7 @@ static __initconst const struct x86_pmu intel_pmu = {
.cpu_starting = intel_pmu_cpu_starting,
.cpu_dying = intel_pmu_cpu_dying,
.guest_get_msrs = intel_guest_get_msrs,
- .flush_branch_stack = intel_pmu_flush_branch_stack,
+ .sched_task = intel_pmu_lbr_sched_task,
};

static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 66cb268..c29036b 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -177,6 +177,31 @@ void intel_pmu_lbr_reset(void)
intel_pmu_lbr_reset_64();
}

+void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ if (!x86_pmu.lbr_nr)
+ return;
+
+ /*
+ * When sampling the branck stack in system-wide, it may be
+ * necessary to flush the stack on context switch. This happens
+ * when the branch stack does not tag its entries with the pid
+ * of the current task. Otherwise it becomes impossible to
+ * associate a branch entry with a task. This ambiguity is more
+ * likely to appear when the branch stack supports priv level
+ * filtering and the user sets it to monitor only at the user
+ * level (which could be a useful measurement in system-wide
+ * mode). In that case, the risk is high of having a branch
+ * stack with branch from multiple tasks.
+ */
+ if (sched_in) {
+ intel_pmu_lbr_reset();
+ cpuc->lbr_context = ctx;
+ }
+}
+
void intel_pmu_lbr_enable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -195,6 +220,7 @@ void intel_pmu_lbr_enable(struct perf_event *event)
cpuc->br_sel = event->hw.branch_reg.reg;

cpuc->lbr_users++;
+ perf_sched_cb_inc(event->ctx->pmu);
}

void intel_pmu_lbr_disable(struct perf_event *event)
@@ -206,6 +232,7 @@ void intel_pmu_lbr_disable(struct perf_event *event)

cpuc->lbr_users--;
WARN_ON_ONCE(cpuc->lbr_users < 0);
+ perf_sched_cb_dec(event->ctx->pmu);

if (cpuc->enabled && !cpuc->lbr_users) {
__intel_pmu_lbr_disable();
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 40ecad1..ed51836 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -514,7 +514,6 @@ struct perf_event_context {
u64 generation;
int pin_count;
int nr_cgroups; /* cgroup evts */
- int nr_branch_stack; /* branch_stack evt */
struct rcu_head rcu_head;

struct delayed_work orphans_remove;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 28c2764..9212a2b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -153,7 +153,6 @@ enum event_type_t {
*/
struct static_key_deferred perf_sched_events __read_mostly;
static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
-static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
static DEFINE_PER_CPU(int, perf_sched_cb_usages);

static atomic_t nr_mmap_events __read_mostly;
@@ -1152,9 +1151,6 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
if (is_cgroup_event(event))
ctx->nr_cgroups++;

- if (has_branch_stack(event))
- ctx->nr_branch_stack++;
-
list_add_rcu(&event->event_entry, &ctx->event_list);
if (!ctx->nr_events)
perf_pmu_rotate_start(ctx->pmu);
@@ -1317,9 +1313,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
cpuctx->cgrp = NULL;
}

- if (has_branch_stack(event))
- ctx->nr_branch_stack--;
-
ctx->nr_events--;
if (event->attr.inherit_stat)
ctx->nr_stat--;
@@ -2673,64 +2666,6 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
}

/*
- * When sampling the branck stack in system-wide, it may be necessary
- * to flush the stack on context switch. This happens when the branch
- * stack does not tag its entries with the pid of the current task.
- * Otherwise it becomes impossible to associate a branch entry with a
- * task. This ambiguity is more likely to appear when the branch stack
- * supports priv level filtering and the user sets it to monitor only
- * at the user level (which could be a useful measurement in system-wide
- * mode). In that case, the risk is high of having a branch stack with
- * branch from multiple tasks. Flushing may mean dropping the existing
- * entries or stashing them somewhere in the PMU specific code layer.
- *
- * This function provides the context switch callback to the lower code
- * layer. It is invoked ONLY when there is at least one system-wide context
- * with at least one active event using taken branch sampling.
- */
-static void perf_branch_stack_sched_in(struct task_struct *prev,
- struct task_struct *task)
-{
- struct perf_cpu_context *cpuctx;
- struct pmu *pmu;
- unsigned long flags;
-
- /* no need to flush branch stack if not changing task */
- if (prev == task)
- return;
-
- local_irq_save(flags);
-
- rcu_read_lock();
-
- list_for_each_entry_rcu(pmu, &pmus, entry) {
- cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
-
- /*
- * check if the context has at least one
- * event using PERF_SAMPLE_BRANCH_STACK
- */
- if (cpuctx->ctx.nr_branch_stack > 0
- && pmu->flush_branch_stack) {
-
- perf_ctx_lock(cpuctx, cpuctx->task_ctx);
-
- perf_pmu_disable(pmu);
-
- pmu->flush_branch_stack();
-
- perf_pmu_enable(pmu);
-
- perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
- }
- }
-
- rcu_read_unlock();
-
- local_irq_restore(flags);
-}
-
-/*
* Called from scheduler to add the events of the current task
* with interrupts disabled.
*
@@ -2762,10 +2697,6 @@ void __perf_event_task_sched_in(struct task_struct *prev,
if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
perf_cgroup_sched_in(prev, task);

- /* check for system-wide branch_stack events */
- if (atomic_read(this_cpu_ptr(&perf_branch_stack_events)))
- perf_branch_stack_sched_in(prev, task);
-
if (__this_cpu_read(perf_sched_cb_usages))
perf_pmu_sched_task(prev, task, true);
}
@@ -3359,10 +3290,6 @@ static void unaccount_event_cpu(struct perf_event *event, int cpu)
if (event->parent)
return;

- if (has_branch_stack(event)) {
- if (!(event->attach_state & PERF_ATTACH_TASK))
- atomic_dec(&per_cpu(perf_branch_stack_events, cpu));
- }
if (is_cgroup_event(event))
atomic_dec(&per_cpu(perf_cgroup_events, cpu));
}
@@ -6922,10 +6849,6 @@ static void account_event_cpu(struct perf_event *event, int cpu)
if (event->parent)
return;

- if (has_branch_stack(event)) {
- if (!(event->attach_state & PERF_ATTACH_TASK))
- atomic_inc(&per_cpu(perf_branch_stack_events, cpu));
- }
if (is_cgroup_event(event))
atomic_inc(&per_cpu(perf_cgroup_events, cpu));
}
--
1.8.3.2

2014-11-05 03:14:10

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V7 06/17] perf, core: always switch pmu specific data during context switch

From: Yan, Zheng <[email protected]>

If two tasks were both forked from the same parent task, Events in
their perf task contexts can be the same. Perf core may leave out
switching the perf event contexts.

Previous patch inroduces pmu specific data. The data is for saving
the LBR stack, it is task specific. So we need to switch the data
even when context switch is optimized out.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
kernel/events/core.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 08d6671..4360c95 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2420,6 +2420,9 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
next->perf_event_ctxp[ctxn] = ctx;
ctx->task = next;
next_ctx->task = task;
+
+ swap(ctx->task_ctx_data, next_ctx->task_ctx_data);
+
do_switch = 0;

perf_event_sync_stat(ctx, next_ctx);
--
1.8.3.2

2014-11-05 03:14:09

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V7 04/17] perf, x86: Basic Haswell LBR call stack support

From: Yan, Zheng <[email protected]>

Haswell has a new feature that utilizes the existing LBR facility to
record call chains. To enable this feature, bits (JCC, NEAR_IND_JMP,
NEAR_REL_JMP, FAR_BRANCH, EN_CALLSTACK) in LBR_SELECT must be set to 1,
bits (NEAR_REL_CALL, NEAR-IND_CALL, NEAR_RET) must be cleared. Due to
a hardware bug of Haswell, this feature doesn't work well with
FREEZE_LBRS_ON_PMI.

When the call stack feature is enabled, the LBR stack will capture
unfiltered call data normally, but as return instructions are executed,
the last captured branch record is flushed from the on-chip registers
in a last-in first-out (LIFO) manner. Thus, branch information relative
to leaf functions will not be captured, while preserving the call stack
information of the main line execution path.

This patch defines a separate lbr_sel map for Haswell. The map contains
a new entry for the call stack feature.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 14 ++++-
arch/x86/kernel/cpu/perf_event_intel.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 91 ++++++++++++++++++++++--------
3 files changed, 83 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 3d6d533..13464e4 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -511,7 +511,11 @@ struct x86_pmu {
};

enum {
- PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE = PERF_SAMPLE_BRANCH_MAX_SHIFT,
+ PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT = PERF_SAMPLE_BRANCH_MAX_SHIFT,
+ PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE,
+
+ PERF_SAMPLE_BRANCH_CALL_STACK =
+ 1U << PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT,
};

#define x86_add_quirk(func_) \
@@ -545,6 +549,12 @@ static struct perf_pmu_events_attr event_attr_##v = { \

extern struct x86_pmu x86_pmu __read_mostly;

+static inline bool x86_pmu_has_lbr_callstack(void)
+{
+ return x86_pmu.lbr_sel_map &&
+ x86_pmu.lbr_sel_map[PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT] > 0;
+}
+
DECLARE_PER_CPU(struct cpu_hw_events, cpu_hw_events);

int x86_perf_event_set_period(struct perf_event *event);
@@ -748,6 +758,8 @@ void intel_pmu_lbr_init_atom(void);

void intel_pmu_lbr_init_snb(void);

+void intel_pmu_lbr_init_hsw(void);
+
int intel_pmu_setup_lbr_filter(struct perf_event *event);

int p4_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 9a6e247..a0c0739 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2536,7 +2536,7 @@ __init int intel_pmu_init(void)
memcpy(hw_cache_event_ids, snb_hw_cache_event_ids, sizeof(hw_cache_event_ids));
memcpy(hw_cache_extra_regs, snb_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));

- intel_pmu_lbr_init_snb();
+ intel_pmu_lbr_init_hsw();

x86_pmu.event_constraints = intel_hsw_event_constraints;
x86_pmu.pebs_constraints = intel_hsw_pebs_event_constraints;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index c29036b..4a29bf5 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -39,6 +39,7 @@ static enum {
#define LBR_IND_JMP_BIT 6 /* do not capture indirect jumps */
#define LBR_REL_JMP_BIT 7 /* do not capture relative jumps */
#define LBR_FAR_BIT 8 /* do not capture far branches */
+#define LBR_CALL_STACK_BIT 9 /* enable call stack */

#define LBR_KERNEL (1 << LBR_KERNEL_BIT)
#define LBR_USER (1 << LBR_USER_BIT)
@@ -49,6 +50,7 @@ static enum {
#define LBR_REL_JMP (1 << LBR_REL_JMP_BIT)
#define LBR_IND_JMP (1 << LBR_IND_JMP_BIT)
#define LBR_FAR (1 << LBR_FAR_BIT)
+#define LBR_CALL_STACK (1 << LBR_CALL_STACK_BIT)

#define LBR_PLM (LBR_KERNEL | LBR_USER)

@@ -74,24 +76,25 @@ static enum {
* x86control flow changes include branches, interrupts, traps, faults
*/
enum {
- X86_BR_NONE = 0, /* unknown */
-
- X86_BR_USER = 1 << 0, /* branch target is user */
- X86_BR_KERNEL = 1 << 1, /* branch target is kernel */
-
- X86_BR_CALL = 1 << 2, /* call */
- X86_BR_RET = 1 << 3, /* return */
- X86_BR_SYSCALL = 1 << 4, /* syscall */
- X86_BR_SYSRET = 1 << 5, /* syscall return */
- X86_BR_INT = 1 << 6, /* sw interrupt */
- X86_BR_IRET = 1 << 7, /* return from interrupt */
- X86_BR_JCC = 1 << 8, /* conditional */
- X86_BR_JMP = 1 << 9, /* jump */
- X86_BR_IRQ = 1 << 10,/* hw interrupt or trap or fault */
- X86_BR_IND_CALL = 1 << 11,/* indirect calls */
- X86_BR_ABORT = 1 << 12,/* transaction abort */
- X86_BR_IN_TX = 1 << 13,/* in transaction */
- X86_BR_NO_TX = 1 << 14,/* not in transaction */
+ X86_BR_NONE = 0, /* unknown */
+
+ X86_BR_USER = 1 << 0, /* branch target is user */
+ X86_BR_KERNEL = 1 << 1, /* branch target is kernel */
+
+ X86_BR_CALL = 1 << 2, /* call */
+ X86_BR_RET = 1 << 3, /* return */
+ X86_BR_SYSCALL = 1 << 4, /* syscall */
+ X86_BR_SYSRET = 1 << 5, /* syscall return */
+ X86_BR_INT = 1 << 6, /* sw interrupt */
+ X86_BR_IRET = 1 << 7, /* return from interrupt */
+ X86_BR_JCC = 1 << 8, /* conditional */
+ X86_BR_JMP = 1 << 9, /* jump */
+ X86_BR_IRQ = 1 << 10,/* hw interrupt or trap or fault */
+ X86_BR_IND_CALL = 1 << 11,/* indirect calls */
+ X86_BR_ABORT = 1 << 12,/* transaction abort */
+ X86_BR_IN_TX = 1 << 13,/* in transaction */
+ X86_BR_NO_TX = 1 << 14,/* not in transaction */
+ X86_BR_CALL_STACK = 1 << 15,/* call stack */
};

#define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
@@ -373,7 +376,7 @@ void intel_pmu_lbr_read(void)
* - in case there is no HW filter
* - in case the HW filter has errata or limitations
*/
-static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
+static int intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
{
u64 br_type = event->attr.branch_sample_type;
int mask = 0;
@@ -410,11 +413,21 @@ static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
if (br_type & PERF_SAMPLE_BRANCH_COND)
mask |= X86_BR_JCC;

+ if (br_type & PERF_SAMPLE_BRANCH_CALL_STACK) {
+ if (!x86_pmu_has_lbr_callstack())
+ return -EOPNOTSUPP;
+ if (mask & ~(X86_BR_USER | X86_BR_KERNEL))
+ return -EINVAL;
+ mask |= X86_BR_CALL | X86_BR_IND_CALL | X86_BR_RET |
+ X86_BR_CALL_STACK;
+ }
+
/*
* stash actual user request into reg, it may
* be used by fixup code for some CPU
*/
event->hw.branch_reg.reg = mask;
+ return 0;
}

/*
@@ -443,8 +456,12 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
reg = &event->hw.branch_reg;
reg->idx = EXTRA_REG_LBR;

- /* LBR_SELECT operates in suppress mode so invert mask */
- reg->config = ~mask & x86_pmu.lbr_sel_mask;
+ /*
+ * The first 9 bits (LBR_SEL_MASK) in LBR_SELECT operate
+ * in suppress mode. So LBR_SELECT should be set to
+ * (~mask & LBR_SEL_MASK) | (mask & ~LBR_SEL_MASK)
+ */
+ reg->config = mask ^ x86_pmu.lbr_sel_mask;

return 0;
}
@@ -462,7 +479,9 @@ int intel_pmu_setup_lbr_filter(struct perf_event *event)
/*
* setup SW LBR filter
*/
- intel_pmu_setup_sw_lbr_filter(event);
+ ret = intel_pmu_setup_sw_lbr_filter(event);
+ if (ret)
+ return ret;

/*
* setup HW LBR filter, if any
@@ -719,6 +738,20 @@ static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
[PERF_SAMPLE_BRANCH_COND_SHIFT] = LBR_JCC,
};

+static const int hsw_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+ [PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
+ [PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
+ [PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
+ [PERF_SAMPLE_BRANCH_HV_SHIFT] = LBR_IGN,
+ [PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT] = LBR_RETURN | LBR_FAR,
+ [PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT] = LBR_REL_CALL | LBR_IND_CALL
+ | LBR_FAR,
+ [PERF_SAMPLE_BRANCH_IND_CALL_SHIFT] = LBR_IND_CALL,
+ [PERF_SAMPLE_BRANCH_COND_SHIFT] = LBR_JCC,
+ [PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT] = LBR_REL_CALL | LBR_IND_CALL
+ | LBR_RETURN | LBR_CALL_STACK,
+};
+
/* core */
void __init intel_pmu_lbr_init_core(void)
{
@@ -775,6 +808,20 @@ void __init intel_pmu_lbr_init_snb(void)
pr_cont("16-deep LBR, ");
}

+/* haswell */
+void intel_pmu_lbr_init_hsw(void)
+{
+ x86_pmu.lbr_nr = 16;
+ x86_pmu.lbr_tos = MSR_LBR_TOS;
+ x86_pmu.lbr_from = MSR_LBR_NHM_FROM;
+ x86_pmu.lbr_to = MSR_LBR_NHM_TO;
+
+ x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
+ x86_pmu.lbr_sel_map = hsw_lbr_sel_map;
+
+ pr_cont("16-deep LBR, ");
+}
+
/* atom */
void __init intel_pmu_lbr_init_atom(void)
{
--
1.8.3.2

2014-11-05 03:14:43

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V7 05/17] perf, core: pmu specific data for perf task context

From: Yan, Zheng <[email protected]>

Introduce a new flag PERF_ATTACH_TASK_DATA for perf event's attach
stata. The flag is set by PMU's event_init() callback, it indicates
that perf event needs PMU specific data.

The PMU specific data are initialized to zeros. Later patches will
use PMU specific data to save LBR stack.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
include/linux/perf_event.h | 6 ++++++
kernel/events/core.c | 40 ++++++++++++++++++++++++++++++++++++----
2 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index ed51836..84ec3e6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -269,6 +269,10 @@ struct pmu {
*/
void (*sched_task) (struct perf_event_context *ctx,
bool sched_in);
+ /*
+ * PMU specific data size
+ */
+ size_t task_ctx_size;

};

@@ -305,6 +309,7 @@ struct swevent_hlist {
#define PERF_ATTACH_CONTEXT 0x01
#define PERF_ATTACH_GROUP 0x02
#define PERF_ATTACH_TASK 0x04
+#define PERF_ATTACH_TASK_DATA 0x08

struct perf_cgroup;
struct ring_buffer;
@@ -514,6 +519,7 @@ struct perf_event_context {
u64 generation;
int pin_count;
int nr_cgroups; /* cgroup evts */
+ void *task_ctx_data; /* pmu specific data */
struct rcu_head rcu_head;

struct delayed_work orphans_remove;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9212a2b..08d6671 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -895,6 +895,15 @@ static void get_ctx(struct perf_event_context *ctx)
WARN_ON(!atomic_inc_not_zero(&ctx->refcount));
}

+static void free_ctx(struct rcu_head *head)
+{
+ struct perf_event_context *ctx;
+
+ ctx = container_of(head, struct perf_event_context, rcu_head);
+ kfree(ctx->task_ctx_data);
+ kfree(ctx);
+}
+
static void put_ctx(struct perf_event_context *ctx)
{
if (atomic_dec_and_test(&ctx->refcount)) {
@@ -902,7 +911,7 @@ static void put_ctx(struct perf_event_context *ctx)
put_ctx(ctx->parent_ctx);
if (ctx->task)
put_task_struct(ctx->task);
- kfree_rcu(ctx, rcu_head);
+ call_rcu(&ctx->rcu_head, free_ctx);
}
}

@@ -3188,12 +3197,15 @@ errout:
* Returns a matching context with refcount and pincount.
*/
static struct perf_event_context *
-find_get_context(struct pmu *pmu, struct task_struct *task, int cpu)
+find_get_context(struct pmu *pmu, struct task_struct *task,
+ struct perf_event *event)
{
struct perf_event_context *ctx, *clone_ctx = NULL;
struct perf_cpu_context *cpuctx;
+ void *task_ctx_data = NULL;
unsigned long flags;
int ctxn, err;
+ int cpu = event->cpu;

if (!task) {
/* Must be root to operate on a CPU event: */
@@ -3221,11 +3233,24 @@ find_get_context(struct pmu *pmu, struct task_struct *task, int cpu)
if (ctxn < 0)
goto errout;

+ if (event->attach_state & PERF_ATTACH_TASK_DATA) {
+ task_ctx_data = kzalloc(pmu->task_ctx_size, GFP_KERNEL);
+ if (!task_ctx_data) {
+ err = -ENOMEM;
+ goto errout;
+ }
+ }
+
retry:
ctx = perf_lock_task_context(task, ctxn, &flags);
if (ctx) {
clone_ctx = unclone_ctx(ctx);
++ctx->pin_count;
+
+ if (task_ctx_data && !ctx->task_ctx_data) {
+ ctx->task_ctx_data = task_ctx_data;
+ task_ctx_data = NULL;
+ }
raw_spin_unlock_irqrestore(&ctx->lock, flags);

if (clone_ctx)
@@ -3236,6 +3261,11 @@ retry:
if (!ctx)
goto errout;

+ if (task_ctx_data) {
+ ctx->task_ctx_data = task_ctx_data;
+ task_ctx_data = NULL;
+ }
+
err = 0;
mutex_lock(&task->perf_event_mutex);
/*
@@ -3262,9 +3292,11 @@ retry:
}
}

+ kfree(task_ctx_data);
return ctx;

errout:
+ kfree(task_ctx_data);
return ERR_PTR(err);
}

@@ -7331,7 +7363,7 @@ SYSCALL_DEFINE5(perf_event_open,
/*
* Get the target context (task or percpu):
*/
- ctx = find_get_context(pmu, task, event->cpu);
+ ctx = find_get_context(pmu, task, event);
if (IS_ERR(ctx)) {
err = PTR_ERR(ctx);
goto err_alloc;
@@ -7500,7 +7532,7 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,

account_event(event);

- ctx = find_get_context(event->pmu, task, cpu);
+ ctx = find_get_context(event->pmu, task, event);
if (IS_ERR(ctx)) {
err = PTR_ERR(ctx);
goto err_free;
--
1.8.3.2

2014-11-05 09:21:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V7 11/17] perf, core: expose LBR call stack to user perf tool

On Tue, Nov 04, 2014 at 09:56:07PM -0500, Kan Liang wrote:
> From: Yan, Zheng <[email protected]>

Now for this patch I'm fairly sure Zheng didn't actually write it; his
last posting did something different IIRC.

> With LBR call stack feature enable, there are two call chain data

_three_, LBR is the 3rd.

> sources, traditional frame pointer and LBR call stack.
> This patch extends the perf_callchain_entry struct to mark the available
> call chain source.
> The frame pointer is still output as PERF_SAMPLE_CALLCHAIN data format.
> The LBR call stack data will be output as PERF_SAMPLE_BRANCH_STACK data
> format.

I'm not sure what this patch does?! Why do the FP based callchains need
_any_ changes?

2014-11-05 09:21:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V7 13/17] perf, x86: enable LBR callstack when recording callchain

On Tue, Nov 04, 2014 at 09:56:09PM -0500, Kan Liang wrote:
> From: Yan, Zheng <[email protected]>
>
> Only enable LBR callstack when user requires fp callgraph. The feature
> is not available when PERF_SAMPLE_BRANCH_STACK or PERF_SAMPLE_STACK_USER
> is required.
> Also, this feature only affects how to get user callchain. The kernel
> callchain is always got by frame pointers.

Since FP callchains should not change, this doesn't appear to make any
sense either.

2014-11-05 09:38:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V7 00/17] perf, x86: Haswell LBR call stack support



So if I take all except 11,13,16,17 but instead do something like the
below, everything will work just fine, right?

Or am I missing something?

---
arch/x86/kernel/cpu/perf_event.h | 8 --------
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 8 ++++----
include/uapi/linux/perf_event.h | 16 ++++++++--------
3 files changed, 12 insertions(+), 20 deletions(-)

--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -521,14 +521,6 @@ struct x86_perf_task_context {
int lbr_stack_state;
};

-enum {
- PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT = PERF_SAMPLE_BRANCH_MAX_SHIFT,
- PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE,
-
- PERF_SAMPLE_BRANCH_CALL_STACK =
- 1U << PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT,
-};
-
#define x86_add_quirk(func_) \
do { \
static struct x86_pmu_quirk __quirk __initdata = { \
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -537,7 +537,7 @@ static int intel_pmu_setup_hw_lbr_filter
u64 mask = 0, v;
int i;

- for (i = 0; i < PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE; i++) {
+ for (i = 0; i < PERF_SAMPLE_BRANCH_MAX_SHIFT; i++) {
if (!(br_type & (1ULL << i)))
continue;

@@ -808,7 +808,7 @@ intel_pmu_lbr_filter(struct cpu_hw_event
/*
* Map interface branch filters onto LBR filters
*/
-static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX_SHIFT] = {
[PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
[PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
[PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
@@ -827,7 +827,7 @@ static const int nhm_lbr_sel_map[PERF_SA
[PERF_SAMPLE_BRANCH_COND_SHIFT] = LBR_JCC,
};

-static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX_SHIFT] = {
[PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
[PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
[PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
@@ -839,7 +839,7 @@ static const int snb_lbr_sel_map[PERF_SA
[PERF_SAMPLE_BRANCH_COND_SHIFT] = LBR_JCC,
};

-static const int hsw_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+static const int hsw_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX_SHIFT] = {
[PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
[PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
[PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -166,6 +166,8 @@ enum perf_branch_sample_type_shift {
PERF_SAMPLE_BRANCH_NO_TX_SHIFT = 9, /* not in transaction */
PERF_SAMPLE_BRANCH_COND_SHIFT = 10, /* conditional branches */

+ PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT = 11, /* call/ret stack */
+
PERF_SAMPLE_BRANCH_MAX_SHIFT /* non-ABI */
};

@@ -175,18 +177,16 @@ enum perf_branch_sample_type {
PERF_SAMPLE_BRANCH_HV = 1U << PERF_SAMPLE_BRANCH_HV_SHIFT,

PERF_SAMPLE_BRANCH_ANY = 1U << PERF_SAMPLE_BRANCH_ANY_SHIFT,
- PERF_SAMPLE_BRANCH_ANY_CALL =
- 1U << PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT,
- PERF_SAMPLE_BRANCH_ANY_RETURN =
- 1U << PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT,
- PERF_SAMPLE_BRANCH_IND_CALL =
- 1U << PERF_SAMPLE_BRANCH_IND_CALL_SHIFT,
- PERF_SAMPLE_BRANCH_ABORT_TX =
- 1U << PERF_SAMPLE_BRANCH_ABORT_TX_SHIFT,
+ PERF_SAMPLE_BRANCH_ANY_CALL = 1U << PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT,
+ PERF_SAMPLE_BRANCH_ANY_RETURN = 1U << PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT,
+ PERF_SAMPLE_BRANCH_IND_CALL = 1U << PERF_SAMPLE_BRANCH_IND_CALL_SHIFT,
+ PERF_SAMPLE_BRANCH_ABORT_TX = 1U << PERF_SAMPLE_BRANCH_ABORT_TX_SHIFT,
PERF_SAMPLE_BRANCH_IN_TX = 1U << PERF_SAMPLE_BRANCH_IN_TX_SHIFT,
PERF_SAMPLE_BRANCH_NO_TX = 1U << PERF_SAMPLE_BRANCH_NO_TX_SHIFT,
PERF_SAMPLE_BRANCH_COND = 1U << PERF_SAMPLE_BRANCH_COND_SHIFT,

+ PERF_SAMPLE_BRANCH_CALL_STACK = 1U << PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT,
+
PERF_SAMPLE_BRANCH_MAX = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT,
};

2014-11-05 09:58:45

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH V7 13/17] perf, x86: enable LBR callstack when recording callchain

On Wed, Nov 5, 2014 at 10:21 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Nov 04, 2014 at 09:56:09PM -0500, Kan Liang wrote:
>> From: Yan, Zheng <[email protected]>
>>
>> Only enable LBR callstack when user requires fp callgraph. The feature
>> is not available when PERF_SAMPLE_BRANCH_STACK or PERF_SAMPLE_STACK_USER
>> is required.
>> Also, this feature only affects how to get user callchain. The kernel
>> callchain is always got by frame pointers.
>
> Since FP callchains should not change, this doesn't appear to make any
> sense either.

If I recall earlier discussion, the FP callchain are not changed. On
HSW, when requesting
fp at the user level only, then the kernel automatically tries to use
the LBR callstack mode.
Advantage is that the user app does not require frame-pointer or dwarf
debug info to get
correct callchains with perf record. The downside is that LBR
callstack does not work in
certain callchain corner cases.

The reason why using LBR call stack mode is restricted to user level
only is because of
a bug in the LBR call stack hardware which forces the kernel to drop
LBR_FREEZE_PMI.
In other words ,the LBR does not stop on counter overflow, thus it
will be wiped out by
execution of kernel code leading to the PMU interrupt handler.

2014-11-05 10:44:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V7 13/17] perf, x86: enable LBR callstack when recording callchain

On Wed, Nov 05, 2014 at 10:58:28AM +0100, Stephane Eranian wrote:
> On Wed, Nov 5, 2014 at 10:21 AM, Peter Zijlstra <[email protected]> wrote:
> > On Tue, Nov 04, 2014 at 09:56:09PM -0500, Kan Liang wrote:
> >> From: Yan, Zheng <[email protected]>
> >>
> >> Only enable LBR callstack when user requires fp callgraph. The feature
> >> is not available when PERF_SAMPLE_BRANCH_STACK or PERF_SAMPLE_STACK_USER
> >> is required.
> >> Also, this feature only affects how to get user callchain. The kernel
> >> callchain is always got by frame pointers.
> >
> > Since FP callchains should not change, this doesn't appear to make any
> > sense either.
>
> If I recall earlier discussion, the FP callchain are not changed. On
> HSW, when requesting fp at the user level only, then the kernel
> automatically tries to use the LBR callstack mode. Advantage is that
> the user app does not require frame-pointer or dwarf debug info to get
> correct callchains with perf record. The downside is that LBR
> callstack does not work in certain callchain corner cases.

But this patch changes the FP callchain interface. I see no need of
that. We already have multiple independent callchain options (FP and
Dwarf) adding a third option should also be independent (LBR).

Allowing all 3 at the same time allows for identifying those corner
cases.

That is I simply don't see a good reason intertwine these things at the
interface level. All it does is reduce options. Would it not be 'nice'
to allow both FP and LBR at the same time?

2014-11-05 10:57:12

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH V7 13/17] perf, x86: enable LBR callstack when recording callchain

On Wed, Nov 5, 2014 at 11:43 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Nov 05, 2014 at 10:58:28AM +0100, Stephane Eranian wrote:
>> On Wed, Nov 5, 2014 at 10:21 AM, Peter Zijlstra <[email protected]> wrote:
>> > On Tue, Nov 04, 2014 at 09:56:09PM -0500, Kan Liang wrote:
>> >> From: Yan, Zheng <[email protected]>
>> >>
>> >> Only enable LBR callstack when user requires fp callgraph. The feature
>> >> is not available when PERF_SAMPLE_BRANCH_STACK or PERF_SAMPLE_STACK_USER
>> >> is required.
>> >> Also, this feature only affects how to get user callchain. The kernel
>> >> callchain is always got by frame pointers.
>> >
>> > Since FP callchains should not change, this doesn't appear to make any
>> > sense either.
>>
>> If I recall earlier discussion, the FP callchain are not changed. On
>> HSW, when requesting fp at the user level only, then the kernel
>> automatically tries to use the LBR callstack mode. Advantage is that
>> the user app does not require frame-pointer or dwarf debug info to get
>> correct callchains with perf record. The downside is that LBR
>> callstack does not work in certain callchain corner cases.
>
> But this patch changes the FP callchain interface. I see no need of
> that. We already have multiple independent callchain options (FP and
> Dwarf) adding a third option should also be independent (LBR).
>
> Allowing all 3 at the same time allows for identifying those corner
> cases.
>
> That is I simply don't see a good reason intertwine these things at the
> interface level. All it does is reduce options. Would it not be 'nice'
> to allow both FP and LBR at the same time?

Yes, but I wonder how would the tool sort this out if you have FP and LBR
for each sample.

My understanding of the patch is that it does not change the user interface,
it changes the way callchains are gathered by the kernel on HSW.

Is there explicit mention in the API that CALLCHAIN is relying on FP?

I think in general it would be better for tools to know which
low-level mechanism
is used to better interpret the results and especially be aware of the
limitations of
each mechanism.

I think the patch is trying some auto-promotion of CALLCHAIN to FP based
on the belief it is better in most cases. It reminds me of the discussion about
precise mode. Why not default to precise for all events that support it?

I would be okay if the patch was introducing the 3rd mode for callchains.

2014-11-05 12:49:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V7 13/17] perf, x86: enable LBR callstack when recording callchain

On Wed, Nov 05, 2014 at 11:57:10AM +0100, Stephane Eranian wrote:
> Yes, but I wonder how would the tool sort this out if you have FP and LBR
> for each sample.

That's the tools 'problem'. It currently can already have FP and Dwarf
bits. And it does not need to request all of them.

> My understanding of the patch is that it does not change the user interface,
> it changes the way callchains are gathered by the kernel on HSW.

I was under the impression it did change, but that shows how well the
Changelog explained things I suppose :/

> Is there explicit mention in the API that CALLCHAIN is relying on FP?

Don't think so. Although I would much prefer if it uses a single method
per arch across both kernel and user space. For x86 that is FP (since
that's the only method available to the kernel).

> I think in general it would be better for tools to know which
> low-level mechanism is used to better interpret the results and
> especially be aware of the limitations of each mechanism.

Agreed.

> I think the patch is trying some auto-promotion of CALLCHAIN to FP
> based on the belief it is better in most cases.

We're all more familiar with FP, and it doesn't have the obvious problem
if only 16 entries. I've worked on quite a bit of software that had much
deeper callchains -- yay for recursive algorithms and/or C++.

With a bit of care FP can be 'perfect', although Andi likes to point out
that glibc isn't and often wrecks FP :-(

> It reminds me of the discussion about precise mode. Why not default to
> precise for all events that support it?

I've no idea where that discussion stranded.

> I would be okay if the patch was introducing the 3rd mode for callchains.

Right, I would prefer that (as should be clear by now), this would allow
running with two (or even all three) and compare results.

2014-11-05 13:22:11

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH V7 13/17] perf, x86: enable LBR callstack when recording callchain

On Wed, Nov 5, 2014 at 1:49 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Nov 05, 2014 at 11:57:10AM +0100, Stephane Eranian wrote:
>> Yes, but I wonder how would the tool sort this out if you have FP and LBR
>> for each sample.
>
> That's the tools 'problem'. It currently can already have FP and Dwarf
> bits. And it does not need to request all of them.
>
I was thinking about the case where the tool would request both FP and
LBR at the same to try and construct a complete callstack. Not sure how
the tool could do that.

>> My understanding of the patch is that it does not change the user interface,
>> it changes the way callchains are gathered by the kernel on HSW.
>
> I was under the impression it did change, but that shows how well the
> Changelog explained things I suppose :/
>
With the current patches (or the latest version I looked at), there was no
way to request explicitly LBR mode. It was automatic if CALLCHAIN +
user mode only sampling.

>> Is there explicit mention in the API that CALLCHAIN is relying on FP?
>
> Don't think so. Although I would much prefer if it uses a single method
> per arch across both kernel and user space. For x86 that is FP (since
> that's the only method available to the kernel).
>
I tend to agree here. The problem with FP is that it is not easy to figure
out how a binary has been compiled. Getting valid FP callchains for
large binaries using lots of shared libraries is very challenging. All
libraries must be compiled with FP. It is not easy to test if FP was
compiled in. There is no ELF header flag for this. Need to inspect
the x86 asm and look at function prologues.

This is where LBR has an advantage, it works regardless of how
a binaries and shared libs have been compiled. That is why this is
a good (or some would say better) approach which is using hardware
assist.

>> I think in general it would be better for tools to know which
>> low-level mechanism is used to better interpret the results and
>> especially be aware of the limitations of each mechanism.
>
> Agreed.
>
>> I think the patch is trying some auto-promotion of CALLCHAIN to FP
>> based on the belief it is better in most cases.
>
> We're all more familiar with FP, and it doesn't have the obvious problem
> if only 16 entries. I've worked on quite a bit of software that had much
> deeper callchains -- yay for recursive algorithms and/or C++.
>
Yes, this is true too. But it is not so clear to me if people really care about
top of callchains that much. I think usually 2-6 would probably yield enough
useful info.

LBR callstack fails for leaf function optimization. Where the callee does
not return to its caller but instead to the caller's caller. That is the one
case I know about. There are others I believe.

> With a bit of care FP can be 'perfect', although Andi likes to point out
> that glibc isn't and often wrecks FP :-(
>
Especially any hand-crafted assembly...

>> It reminds me of the discussion about precise mode. Why not default to
>> precise for all events that support it?
>
> I've no idea where that discussion stranded.
>
>> I would be okay if the patch was introducing the 3rd mode for callchains.
>
> Right, I would prefer that (as should be clear by now), this would allow
> running with two (or even all three) and compare results.

I don't think it would be very hard to modify the patch set to make that 3rd
mode visible. Just need to make that new PERF_RECORD_* type visible
to user and modify the compatibility checks.

2014-11-05 15:45:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V7 13/17] perf, x86: enable LBR callstack when recording callchain

On Wed, Nov 05, 2014 at 02:22:07PM +0100, Stephane Eranian wrote:
> I tend to agree here. The problem with FP is that it is not easy to figure
> out how a binary has been compiled. Getting valid FP callchains for
> large binaries using lots of shared libraries is very challenging. All
> libraries must be compiled with FP. It is not easy to test if FP was
> compiled in. There is no ELF header flag for this. Need to inspect
> the x86 asm and look at function prologues.

build world ftw :-), I realize that on many distros this is hard, but in
some environments its really rather easy.

But yes, its tedious without the capability to build world.

> This is where LBR has an advantage, it works regardless of how
> a binaries and shared libs have been compiled. That is why this is
> a good (or some would say better) approach which is using hardware
> assist.

Right, but only because we made of mess of the thing in the first place
:-/

> > We're all more familiar with FP, and it doesn't have the obvious problem
> > if only 16 entries. I've worked on quite a bit of software that had much
> > deeper callchains -- yay for recursive algorithms and/or C++.
> >
> Yes, this is true too. But it is not so clear to me if people really care about
> top of callchains that much. I think usually 2-6 would probably yield enough
> useful info.

Right, with C++ if you have a particularly gruesome object hierarchy a
simple constructor can blow your entire 16 calls out the window, so when
you then get around to doing actual work there's nothing left.

But yes, that should not be too common I think.

> LBR callstack fails for leaf function optimization. Where the callee does
> not return to its caller but instead to the caller's caller. That is the one
> case I know about. There are others I believe.

Yeah, tail call and jong jump might also confuse the thing, I can't
remember.

> > With a bit of care FP can be 'perfect', although Andi likes to point out
> > that glibc isn't and often wrecks FP :-(
> >
> Especially any hand-crafted assembly...

Well, it doesn't need to. But yes its easy to do wrong in that case.

> I don't think it would be very hard to modify the patch set to make that 3rd
> mode visible. Just need to make that new PERF_RECORD_* type visible
> to user and modify the compatibility checks.

There's no new RECORD type afaict; would not the relatively simple patch
I proposed be enough? It exposes PERF_SAMPLE_BRANCH_CALL_STACK and you'd
get the data through the normal PERF_SAMPLE_BRANCH_STACK output.

2014-11-05 15:55:03

by Liang, Kan

[permalink] [raw]
Subject: RE: [PATCH V7 13/17] perf, x86: enable LBR callstack when recording callchain


Thanks for your comments. There are lots of discussion about the patch.
It's hard to reply them one by one. So I try to reply all the concerns here.

The patchset doesn't try to introduce the 3rd independent callchain option
That's because LBR callstack has some limitations (only available for user
callchain, only 16 entries, cannot collect branch info at the same time, etc).
So it’s designed as a supplement/extension of FP callchain options. It rely
on FP, but can provide the callstack info when FP isn't available in some
cases which Stephane and Andi mentioned.

Since it's not an independent callchain options, I didn't provide an explicit
option for user to enable it.
However, I provide an option in perf report to show the LBR userspace
callchain and FP callchain. That's the main difference between Zheng's
previous patch and the latest patch.

Here are how it works.
When the user enable FP callchain on HSW, the kernel implicitly enable
both LBR callstack and FP.
Zheng's previous patch does everything in kernel. If FP is not available,
then LBR callstack data will be used implicitly. If FP is available, then LBR
callstack data will be discarded.
While the latest patch expose both LBR callstack and FP data to user tool.
A new option for perf report is introduced. The user can dump the callchain
from either lbr or fp if they are both available.
E.g.
perf report --call-graph fp (both userspace and kernel callchain from FP)
perf report --call-graph lbr (userspace callchain from LBR, kernel from FP)

>
> On Wed, Nov 5, 2014 at 1:49 PM, Peter Zijlstra <[email protected]>
> wrote:
> > On Wed, Nov 05, 2014 at 11:57:10AM +0100, Stephane Eranian wrote:
> >> Yes, but I wonder how would the tool sort this out if you have FP and
> >> LBR for each sample.
> >
> > That's the tools 'problem'. It currently can already have FP and Dwarf
> > bits. And it does not need to request all of them.
> >
> I was thinking about the case where the tool would request both FP and
> LBR at the same to try and construct a complete callstack. Not sure how the
> tool could do that.

Both LBR and FP data are pushed to the tool. User can use newly introduced
perf report call-graph option to choose how to construct the callstack.

>
> >> My understanding of the patch is that it does not change the user
> >> interface, it changes the way callchains are gathered by the kernel on
> HSW.
> >
> > I was under the impression it did change, but that shows how well the
> > Changelog explained things I suppose :/
> >
> With the current patches (or the latest version I looked at), there was no
> way to request explicitly LBR mode. It was automatic if CALLCHAIN + user
> mode only sampling.
>

Yes, currently there is no way to request explicitly LBR mode.

> >> Is there explicit mention in the API that CALLCHAIN is relying on FP?
> >
> > Don't think so. Although I would much prefer if it uses a single
> > method per arch across both kernel and user space. For x86 that is FP
> > (since that's the only method available to the kernel).
> >
> I tend to agree here. The problem with FP is that it is not easy to figure out
> how a binary has been compiled. Getting valid FP callchains for large
> binaries using lots of shared libraries is very challenging. All libraries must
> be compiled with FP. It is not easy to test if FP was compiled in. There is no
> ELF header flag for this. Need to inspect the x86 asm and look at function
> prologues.
>
> This is where LBR has an advantage, it works regardless of how a binaries
> and shared libs have been compiled. That is why this is a good (or some
> would say better) approach which is using hardware assist.
>

Agreed. LBR is a very good supplement.

> >> I think in general it would be better for tools to know which
> >> low-level mechanism is used to better interpret the results and
> >> especially be aware of the limitations of each mechanism.
> >
> > Agreed.
> >
> >> I think the patch is trying some auto-promotion of CALLCHAIN to FP
> >> based on the belief it is better in most cases.
> >
> > We're all more familiar with FP, and it doesn't have the obvious
> > problem if only 16 entries. I've worked on quite a bit of software
> > that had much deeper callchains -- yay for recursive algorithms and/or
> C++.
> >
> Yes, this is true too. But it is not so clear to me if people really care about
> top of callchains that much. I think usually 2-6 would probably yield enough
> useful info.
>
> LBR callstack fails for leaf function optimization. Where the callee does not
> return to its caller but instead to the caller's caller. That is the one case I
> know about. There are others I believe.
>
> > With a bit of care FP can be 'perfect', although Andi likes to point
> > out that glibc isn't and often wrecks FP :-(
> >
> Especially any hand-crafted assembly...
>
> >> It reminds me of the discussion about precise mode. Why not default
> >> to precise for all events that support it?
> >
> > I've no idea where that discussion stranded.
> >
> >> I would be okay if the patch was introducing the 3rd mode for callchains.
> >
> > Right, I would prefer that (as should be clear by now), this would
> > allow running with two (or even all three) and compare results.
>
> I don't think it would be very hard to modify the patch set to make that 3rd
> mode visible. Just need to make that new PERF_RECORD_* type visible to
> user and modify the compatibility checks.

It's not hard. But LBR is not an independent callchain options. It's better to be
a supplement of FP. Otherwise, it may confuse the user. He enables the
BRANCH_CALL_STACK, but the data is partly or even not at all from hardware.


Thanks,
Kan
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-11-05 16:23:06

by Liang, Kan

[permalink] [raw]
Subject: RE: [PATCH V7 00/17] perf, x86: Haswell LBR call stack support


>>
>
> So if I take all except 11,13,16,17 but instead do something like the below,
> everything will work just fine, right?
>
> Or am I missing something?
>

Yes, it should work. Then LBR callstack will rely on user to enable it.
But user never get the LBR callstack data if it's available.
I'm not sure why you do that?


> ---
> arch/x86/kernel/cpu/perf_event.h | 8 --------
> arch/x86/kernel/cpu/perf_event_intel_lbr.c | 8 ++++----
> include/uapi/linux/perf_event.h | 16 ++++++++--------
> 3 files changed, 12 insertions(+), 20 deletions(-)
>
> --- a/arch/x86/kernel/cpu/perf_event.h
> +++ b/arch/x86/kernel/cpu/perf_event.h
> @@ -521,14 +521,6 @@ struct x86_perf_task_context {
> int lbr_stack_state;
> };
>
> -enum {
> - PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT =
> PERF_SAMPLE_BRANCH_MAX_SHIFT,
> - PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE,
> -
> - PERF_SAMPLE_BRANCH_CALL_STACK =
> - 1U <<
> PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT,
> -};
> -
> #define x86_add_quirk(func_) \
> do { \
> static struct x86_pmu_quirk __quirk __initdata = { \
> --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> @@ -537,7 +537,7 @@ static int intel_pmu_setup_hw_lbr_filter
> u64 mask = 0, v;
> int i;
>
> - for (i = 0; i < PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE; i++) {
> + for (i = 0; i < PERF_SAMPLE_BRANCH_MAX_SHIFT; i++) {
> if (!(br_type & (1ULL << i)))
> continue;
>
> @@ -808,7 +808,7 @@ intel_pmu_lbr_filter(struct cpu_hw_event
> /*
> * Map interface branch filters onto LBR filters
> */
> -static const int
> nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
> +static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX_SHIFT]
> = {
> [PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
> [PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
> [PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
> @@ -827,7 +827,7 @@ static const int nhm_lbr_sel_map[PERF_SA
> [PERF_SAMPLE_BRANCH_COND_SHIFT] = LBR_JCC,
> };
>
> -static const int
> snb_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
> +static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX_SHIFT] =
> {
> [PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
> [PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
> [PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
> @@ -839,7 +839,7 @@ static const int snb_lbr_sel_map[PERF_SA
> [PERF_SAMPLE_BRANCH_COND_SHIFT] = LBR_JCC,
> };
>
> -static const int
> hsw_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
> +static const int hsw_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX_SHIFT] =
> {
> [PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
> [PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
> [PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -166,6 +166,8 @@ enum perf_branch_sample_type_shift {
> PERF_SAMPLE_BRANCH_NO_TX_SHIFT = 9, /* not
> in transaction */
> PERF_SAMPLE_BRANCH_COND_SHIFT = 10, /* conditional
> branches */
>
> + PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT = 11, /* call/ret
> stack */
> +
> PERF_SAMPLE_BRANCH_MAX_SHIFT /* non-ABI */
> };
>
> @@ -175,18 +177,16 @@ enum perf_branch_sample_type {
> PERF_SAMPLE_BRANCH_HV = 1U <<
> PERF_SAMPLE_BRANCH_HV_SHIFT,
>
> PERF_SAMPLE_BRANCH_ANY = 1U <<
> PERF_SAMPLE_BRANCH_ANY_SHIFT,
> - PERF_SAMPLE_BRANCH_ANY_CALL =
> - 1U <<
> PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT,
> - PERF_SAMPLE_BRANCH_ANY_RETURN =
> - 1U <<
> PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT,
> - PERF_SAMPLE_BRANCH_IND_CALL =
> - 1U <<
> PERF_SAMPLE_BRANCH_IND_CALL_SHIFT,
> - PERF_SAMPLE_BRANCH_ABORT_TX =
> - 1U <<
> PERF_SAMPLE_BRANCH_ABORT_TX_SHIFT,
> + PERF_SAMPLE_BRANCH_ANY_CALL = 1U <<
> PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT,
> + PERF_SAMPLE_BRANCH_ANY_RETURN = 1U <<
> PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT,
> + PERF_SAMPLE_BRANCH_IND_CALL = 1U <<
> PERF_SAMPLE_BRANCH_IND_CALL_SHIFT,
> + PERF_SAMPLE_BRANCH_ABORT_TX = 1U <<
> PERF_SAMPLE_BRANCH_ABORT_TX_SHIFT,
> PERF_SAMPLE_BRANCH_IN_TX = 1U <<
> PERF_SAMPLE_BRANCH_IN_TX_SHIFT,
> PERF_SAMPLE_BRANCH_NO_TX = 1U <<
> PERF_SAMPLE_BRANCH_NO_TX_SHIFT,
> PERF_SAMPLE_BRANCH_COND = 1U <<
> PERF_SAMPLE_BRANCH_COND_SHIFT,
>
> + PERF_SAMPLE_BRANCH_CALL_STACK = 1U <<
> PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT,
> +
> PERF_SAMPLE_BRANCH_MAX = 1U <<
> PERF_SAMPLE_BRANCH_MAX_SHIFT,
> };
>

2014-11-05 16:27:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V7 00/17] perf, x86: Haswell LBR call stack support

On Wed, Nov 05, 2014 at 04:22:09PM +0000, Liang, Kan wrote:
>
> >>
> >
> > So if I take all except 11,13,16,17 but instead do something like the below,
> > everything will work just fine, right?
> >
> > Or am I missing something?
> >
>
> Yes, it should work. Then LBR callstack will rely on user to enable it.
> But user never get the LBR callstack data if it's available.
> I'm not sure why you do that?

Uhm what? If you request PERF_SAMPLE_BRANCH_CALL_STACK the user will get
the data through the regular PERF_SAMPLE_BRANCH_STACK output of
PERF_RECORD_SAMPLE, right?

2014-11-05 16:29:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V7 13/17] perf, x86: enable LBR callstack when recording callchain

On Wed, Nov 05, 2014 at 03:53:34PM +0000, Liang, Kan wrote:
> > I don't think it would be very hard to modify the patch set to make that 3rd
> > mode visible. Just need to make that new PERF_RECORD_* type visible to
> > user and modify the compatibility checks.
>
> It's not hard. But LBR is not an independent callchain options. It's better to be
> a supplement of FP. Otherwise, it may confuse the user. He enables the
> BRANCH_CALL_STACK, but the data is partly or even not at all from hardware.

What the user sees is up to userspace. It should not be forced by the
kernel/user interface.

2014-11-05 17:04:35

by Liang, Kan

[permalink] [raw]
Subject: RE: [PATCH V7 00/17] perf, x86: Haswell LBR call stack support



> On Wed, Nov 05, 2014 at 04:22:09PM +0000, Liang, Kan wrote:
> >
> > >>
> > >
> > > So if I take all except 11,13,16,17 but instead do something like
> > > the below, everything will work just fine, right?
> > >
> > > Or am I missing something?
> > >
> >
> > Yes, it should work. Then LBR callstack will rely on user to enable it.
> > But user never get the LBR callstack data if it's available.
> > I'm not sure why you do that?
>
> Uhm what? If you request PERF_SAMPLE_BRANCH_CALL_STACK the user
> will get the data through the regular PERF_SAMPLE_BRANCH_STACK
> output of PERF_RECORD_SAMPLE, right?

OK, I think I should understand your meaning.
>From the kernel side, we provide a 3rd callchain option to user.
The kernel only tries to enable LBR and prepare the data if possible.
The kernel doesn't guarantee the data must come from hardware LBR,
even user choose LBR callchain option.

It's user tool's responsibility to filter the request, choose the data source
and reconstruct the data. It's not the scope of this kernel patch. That
should depend on the implementation of another user tool patchset.

What do you think?

Thanks,
Kan

2014-11-05 17:40:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH V7 13/17] perf, x86: enable LBR callstack when recording callchain

> The reason why using LBR call stack mode is restricted to user level
> only is because of
> a bug in the LBR call stack hardware which forces the kernel to drop
> LBR_FREEZE_PMI.

It works with PEBS events, just not with non PEBS (the patch currently
does not implement this distinction and I'm not sure it's worth it)

> In other words ,the LBR does not stop on counter overflow, thus it
> will be wiped out by
> execution of kernel code leading to the PMU interrupt handler.

It stops, but it takes some time to stop so the call-return stack
can get out of sync.

-Andi

--
[email protected] -- Speaking for myself only

2014-11-05 17:55:26

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH V7 13/17] perf, x86: enable LBR callstack when recording callchain

On Wed, Nov 05, 2014 at 05:29:32PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 05, 2014 at 03:53:34PM +0000, Liang, Kan wrote:
> > > I don't think it would be very hard to modify the patch set to make that 3rd
> > > mode visible. Just need to make that new PERF_RECORD_* type visible to
> > > user and modify the compatibility checks.
> >
> > It's not hard. But LBR is not an independent callchain options. It's better to be
> > a supplement of FP. Otherwise, it may confuse the user. He enables the
> > BRANCH_CALL_STACK, but the data is partly or even not at all from hardware.
>
> What the user sees is up to userspace. It should not be forced by the
> kernel/user interface.

The original idea was to abstract it inside the kernel. Unlike dwarf the
LBR callstack is simple enough that it can be easily abstracted. If you don't
want to do that yes then handling it in the user tools is the right way.

-Andi
--
[email protected] -- Speaking for myself only

2014-11-05 17:57:16

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH V7 13/17] perf, x86: enable LBR callstack when recording callchain

> LBR callstack fails for leaf function optimization. Where the callee does
> not return to its caller but instead to the caller's caller. That is the one
> case I know about. There are others I believe.

No it should work fine for this case. You just don't see the tail call,
but the call stack does not get out of sync, so future calls work fine.
It's the same as with frame pointers or even dwarf.

Typical cases that throw it off are throwing exceptions or user space
context switching.

One somewhat common case that used to throw it off was the
call 1f ; 1: pop %ebx
older 32bit binaries used to do to get the PIC offset. That has been
fixed in newer compilers by using an out of line function
(and also Broadwell has a workaround for this)

-Andi

--
[email protected] -- Speaking for myself only

Subject: [tip:perf/core] perf/x86/intel: Reduce lbr_sel_map[] size

Commit-ID: 27ac905b8f88d28779b0661809286b5ba2817d37
Gitweb: http://git.kernel.org/tip/27ac905b8f88d28779b0661809286b5ba2817d37
Author: Yan, Zheng <[email protected]>
AuthorDate: Tue, 4 Nov 2014 21:55:57 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 18 Feb 2015 17:16:01 +0100

perf/x86/intel: Reduce lbr_sel_map[] size

The index of lbr_sel_map is bit value of perf branch_sample_type.
PERF_SAMPLE_BRANCH_MAX is 1024 at present, so each lbr_sel_map uses
4096 bytes. By using bit shift as index, we can reduce lbr_sel_map
size to 40 bytes. This patch defines 'bit shift' for branch types,
and use 'bit shift' to define lbr_sel_maps.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Stephane Eranian <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 4 +++
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 54 ++++++++++++++----------------
include/uapi/linux/perf_event.h | 49 +++++++++++++++++++--------
3 files changed, 64 insertions(+), 43 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index df525d2..0c45b22 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -515,6 +515,10 @@ struct x86_pmu {
struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr);
};

+enum {
+ PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE = PERF_SAMPLE_BRANCH_MAX_SHIFT,
+};
+
#define x86_add_quirk(func_) \
do { \
static struct x86_pmu_quirk __quirk __initdata = { \
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 58f1a94..8bc078f 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -69,10 +69,6 @@ static enum {
#define LBR_FROM_FLAG_IN_TX (1ULL << 62)
#define LBR_FROM_FLAG_ABORT (1ULL << 61)

-#define for_each_branch_sample_type(x) \
- for ((x) = PERF_SAMPLE_BRANCH_USER; \
- (x) < PERF_SAMPLE_BRANCH_MAX; (x) <<= 1)
-
/*
* x86control flow change classification
* x86control flow changes include branches, interrupts, traps, faults
@@ -403,14 +399,14 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
{
struct hw_perf_event_extra *reg;
u64 br_type = event->attr.branch_sample_type;
- u64 mask = 0, m;
- u64 v;
+ u64 mask = 0, v;
+ int i;

- for_each_branch_sample_type(m) {
- if (!(br_type & m))
+ for (i = 0; i < PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE; i++) {
+ if (!(br_type & (1ULL << i)))
continue;

- v = x86_pmu.lbr_sel_map[m];
+ v = x86_pmu.lbr_sel_map[i];
if (v == LBR_NOT_SUPP)
return -EOPNOTSUPP;

@@ -678,35 +674,35 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
/*
* Map interface branch filters onto LBR filters
*/
-static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
- [PERF_SAMPLE_BRANCH_ANY] = LBR_ANY,
- [PERF_SAMPLE_BRANCH_USER] = LBR_USER,
- [PERF_SAMPLE_BRANCH_KERNEL] = LBR_KERNEL,
- [PERF_SAMPLE_BRANCH_HV] = LBR_IGN,
- [PERF_SAMPLE_BRANCH_ANY_RETURN] = LBR_RETURN | LBR_REL_JMP
- | LBR_IND_JMP | LBR_FAR,
+static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+ [PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
+ [PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
+ [PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
+ [PERF_SAMPLE_BRANCH_HV_SHIFT] = LBR_IGN,
+ [PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT] = LBR_RETURN | LBR_REL_JMP
+ | LBR_IND_JMP | LBR_FAR,
/*
* NHM/WSM erratum: must include REL_JMP+IND_JMP to get CALL branches
*/
- [PERF_SAMPLE_BRANCH_ANY_CALL] =
+ [PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT] =
LBR_REL_CALL | LBR_IND_CALL | LBR_REL_JMP | LBR_IND_JMP | LBR_FAR,
/*
* NHM/WSM erratum: must include IND_JMP to capture IND_CALL
*/
- [PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL | LBR_IND_JMP,
- [PERF_SAMPLE_BRANCH_COND] = LBR_JCC,
+ [PERF_SAMPLE_BRANCH_IND_CALL_SHIFT] = LBR_IND_CALL | LBR_IND_JMP,
+ [PERF_SAMPLE_BRANCH_COND_SHIFT] = LBR_JCC,
};

-static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
- [PERF_SAMPLE_BRANCH_ANY] = LBR_ANY,
- [PERF_SAMPLE_BRANCH_USER] = LBR_USER,
- [PERF_SAMPLE_BRANCH_KERNEL] = LBR_KERNEL,
- [PERF_SAMPLE_BRANCH_HV] = LBR_IGN,
- [PERF_SAMPLE_BRANCH_ANY_RETURN] = LBR_RETURN | LBR_FAR,
- [PERF_SAMPLE_BRANCH_ANY_CALL] = LBR_REL_CALL | LBR_IND_CALL
- | LBR_FAR,
- [PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL,
- [PERF_SAMPLE_BRANCH_COND] = LBR_JCC,
+static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+ [PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
+ [PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
+ [PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
+ [PERF_SAMPLE_BRANCH_HV_SHIFT] = LBR_IGN,
+ [PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT] = LBR_RETURN | LBR_FAR,
+ [PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT] = LBR_REL_CALL | LBR_IND_CALL
+ | LBR_FAR,
+ [PERF_SAMPLE_BRANCH_IND_CALL_SHIFT] = LBR_IND_CALL,
+ [PERF_SAMPLE_BRANCH_COND_SHIFT] = LBR_JCC,
};

/* core */
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9b79abb..e46b932 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -152,21 +152,42 @@ enum perf_event_sample_format {
* The branch types can be combined, however BRANCH_ANY covers all types
* of branches and therefore it supersedes all the other types.
*/
+enum perf_branch_sample_type_shift {
+ PERF_SAMPLE_BRANCH_USER_SHIFT = 0, /* user branches */
+ PERF_SAMPLE_BRANCH_KERNEL_SHIFT = 1, /* kernel branches */
+ PERF_SAMPLE_BRANCH_HV_SHIFT = 2, /* hypervisor branches */
+
+ PERF_SAMPLE_BRANCH_ANY_SHIFT = 3, /* any branch types */
+ PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT = 4, /* any call branch */
+ PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT = 5, /* any return branch */
+ PERF_SAMPLE_BRANCH_IND_CALL_SHIFT = 6, /* indirect calls */
+ PERF_SAMPLE_BRANCH_ABORT_TX_SHIFT = 7, /* transaction aborts */
+ PERF_SAMPLE_BRANCH_IN_TX_SHIFT = 8, /* in transaction */
+ PERF_SAMPLE_BRANCH_NO_TX_SHIFT = 9, /* not in transaction */
+ PERF_SAMPLE_BRANCH_COND_SHIFT = 10, /* conditional branches */
+
+ PERF_SAMPLE_BRANCH_MAX_SHIFT /* non-ABI */
+};
+
enum perf_branch_sample_type {
- PERF_SAMPLE_BRANCH_USER = 1U << 0, /* user branches */
- PERF_SAMPLE_BRANCH_KERNEL = 1U << 1, /* kernel branches */
- PERF_SAMPLE_BRANCH_HV = 1U << 2, /* hypervisor branches */
-
- PERF_SAMPLE_BRANCH_ANY = 1U << 3, /* any branch types */
- PERF_SAMPLE_BRANCH_ANY_CALL = 1U << 4, /* any call branch */
- PERF_SAMPLE_BRANCH_ANY_RETURN = 1U << 5, /* any return branch */
- PERF_SAMPLE_BRANCH_IND_CALL = 1U << 6, /* indirect calls */
- PERF_SAMPLE_BRANCH_ABORT_TX = 1U << 7, /* transaction aborts */
- PERF_SAMPLE_BRANCH_IN_TX = 1U << 8, /* in transaction */
- PERF_SAMPLE_BRANCH_NO_TX = 1U << 9, /* not in transaction */
- PERF_SAMPLE_BRANCH_COND = 1U << 10, /* conditional branches */
-
- PERF_SAMPLE_BRANCH_MAX = 1U << 11, /* non-ABI */
+ PERF_SAMPLE_BRANCH_USER = 1U << PERF_SAMPLE_BRANCH_USER_SHIFT,
+ PERF_SAMPLE_BRANCH_KERNEL = 1U << PERF_SAMPLE_BRANCH_KERNEL_SHIFT,
+ PERF_SAMPLE_BRANCH_HV = 1U << PERF_SAMPLE_BRANCH_HV_SHIFT,
+
+ PERF_SAMPLE_BRANCH_ANY = 1U << PERF_SAMPLE_BRANCH_ANY_SHIFT,
+ PERF_SAMPLE_BRANCH_ANY_CALL =
+ 1U << PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT,
+ PERF_SAMPLE_BRANCH_ANY_RETURN =
+ 1U << PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT,
+ PERF_SAMPLE_BRANCH_IND_CALL =
+ 1U << PERF_SAMPLE_BRANCH_IND_CALL_SHIFT,
+ PERF_SAMPLE_BRANCH_ABORT_TX =
+ 1U << PERF_SAMPLE_BRANCH_ABORT_TX_SHIFT,
+ PERF_SAMPLE_BRANCH_IN_TX = 1U << PERF_SAMPLE_BRANCH_IN_TX_SHIFT,
+ PERF_SAMPLE_BRANCH_NO_TX = 1U << PERF_SAMPLE_BRANCH_NO_TX_SHIFT,
+ PERF_SAMPLE_BRANCH_COND = 1U << PERF_SAMPLE_BRANCH_COND_SHIFT,
+
+ PERF_SAMPLE_BRANCH_MAX = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT,
};

#define PERF_SAMPLE_BRANCH_PLM_ALL \

Subject: [tip:perf/core] perf: Introduce pmu context switch callback

Commit-ID: ba532500c5651a4be4108acc64ed99a95cb005b3
Gitweb: http://git.kernel.org/tip/ba532500c5651a4be4108acc64ed99a95cb005b3
Author: Yan, Zheng <[email protected]>
AuthorDate: Tue, 4 Nov 2014 21:55:58 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 18 Feb 2015 17:16:02 +0100

perf: Introduce pmu context switch callback

The callback is invoked when process is scheduled in or out.
It provides mechanism for later patches to save/store the LBR
stack. For the schedule in case, the callback is invoked at
the same place that flush branch stack callback is invoked.
So it also can replace the flush branch stack callback. To
avoid unnecessary overhead, the callback is enabled only when
there are events use the LBR stack.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 7 +++++
arch/x86/kernel/cpu/perf_event.h | 2 ++
include/linux/perf_event.h | 9 +++++++
kernel/events/core.c | 57 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 75 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index b71a7f8..0efbd6c 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1914,6 +1914,12 @@ static const struct attribute_group *x86_pmu_attr_groups[] = {
NULL,
};

+static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+ if (x86_pmu.sched_task)
+ x86_pmu.sched_task(ctx, sched_in);
+}
+
static void x86_pmu_flush_branch_stack(void)
{
if (x86_pmu.flush_branch_stack)
@@ -1950,6 +1956,7 @@ static struct pmu pmu = {

.event_idx = x86_pmu_event_idx,
.flush_branch_stack = x86_pmu_flush_branch_stack,
+ .sched_task = x86_pmu_sched_task,
};

void arch_perf_update_userpage(struct perf_event *event,
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 0c45b22..211b54c 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -473,6 +473,8 @@ struct x86_pmu {

void (*check_microcode)(void);
void (*flush_branch_stack)(void);
+ void (*sched_task)(struct perf_event_context *ctx,
+ bool sched_in);

/*
* Intel Arch Perfmon v2+
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3326200..fbab623 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -265,6 +265,13 @@ struct pmu {
* flush branch stack on context-switches (needed in cpu-wide mode)
*/
void (*flush_branch_stack) (void);
+
+ /*
+ * context-switches callback
+ */
+ void (*sched_task) (struct perf_event_context *ctx,
+ bool sched_in);
+
};

/**
@@ -558,6 +565,8 @@ extern void perf_event_delayed_put(struct task_struct *task);
extern void perf_event_print_debug(void);
extern void perf_pmu_disable(struct pmu *pmu);
extern void perf_pmu_enable(struct pmu *pmu);
+extern void perf_sched_cb_dec(struct pmu *pmu);
+extern void perf_sched_cb_inc(struct pmu *pmu);
extern int perf_event_task_disable(void);
extern int perf_event_task_enable(void);
extern int perf_event_refresh(struct perf_event *event, int refresh);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index fef45b4..6c8b31b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -154,6 +154,7 @@ enum event_type_t {
struct static_key_deferred perf_sched_events __read_mostly;
static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
+static DEFINE_PER_CPU(int, perf_sched_cb_usages);

static atomic_t nr_mmap_events __read_mostly;
static atomic_t nr_comm_events __read_mostly;
@@ -2577,6 +2578,56 @@ unlock:
}
}

+void perf_sched_cb_dec(struct pmu *pmu)
+{
+ this_cpu_dec(perf_sched_cb_usages);
+}
+
+void perf_sched_cb_inc(struct pmu *pmu)
+{
+ this_cpu_inc(perf_sched_cb_usages);
+}
+
+/*
+ * This function provides the context switch callback to the lower code
+ * layer. It is invoked ONLY when the context switch callback is enabled.
+ */
+static void perf_pmu_sched_task(struct task_struct *prev,
+ struct task_struct *next,
+ bool sched_in)
+{
+ struct perf_cpu_context *cpuctx;
+ struct pmu *pmu;
+ unsigned long flags;
+
+ if (prev == next)
+ return;
+
+ local_irq_save(flags);
+
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(pmu, &pmus, entry) {
+ if (pmu->sched_task) {
+ cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+
+ perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+ perf_pmu_disable(pmu);
+
+ pmu->sched_task(cpuctx->task_ctx, sched_in);
+
+ perf_pmu_enable(pmu);
+
+ perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+ }
+ }
+
+ rcu_read_unlock();
+
+ local_irq_restore(flags);
+}
+
#define for_each_task_context_nr(ctxn) \
for ((ctxn) = 0; (ctxn) < perf_nr_task_contexts; (ctxn)++)

@@ -2596,6 +2647,9 @@ void __perf_event_task_sched_out(struct task_struct *task,
{
int ctxn;

+ if (__this_cpu_read(perf_sched_cb_usages))
+ perf_pmu_sched_task(task, next, false);
+
for_each_task_context_nr(ctxn)
perf_event_context_sched_out(task, ctxn, next);

@@ -2847,6 +2901,9 @@ void __perf_event_task_sched_in(struct task_struct *prev,
/* check for system-wide branch_stack events */
if (atomic_read(this_cpu_ptr(&perf_branch_stack_events)))
perf_branch_stack_sched_in(prev, task);
+
+ if (__this_cpu_read(perf_sched_cb_usages))
+ perf_pmu_sched_task(prev, task, true);
}

static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count)

Subject: [tip:perf/core] perf/x86/intel: Use context switch callback to flush LBR stack

Commit-ID: 2a0ad3b326a9024ba86dca4028499d31fa0c6c4d
Gitweb: http://git.kernel.org/tip/2a0ad3b326a9024ba86dca4028499d31fa0c6c4d
Author: Yan, Zheng <[email protected]>
AuthorDate: Tue, 4 Nov 2014 21:55:59 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 18 Feb 2015 17:16:03 +0100

perf/x86/intel: Use context switch callback to flush LBR stack

Previous commit introduces context switch callback, its function
overlaps with the flush branch stack callback. So we can use the
context switch callback to flush LBR stack.

This patch adds code that uses the flush branch callback to
flush the LBR stack when task is being scheduled in. The callback
is enabled only when there are events use the LBR hardware. This
patch also removes all old flush branch stack code.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 7 ---
arch/x86/kernel/cpu/perf_event.h | 3 +-
arch/x86/kernel/cpu/perf_event_intel.c | 14 +-----
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 27 +++++++++++
include/linux/perf_event.h | 1 -
kernel/events/core.c | 77 ------------------------------
6 files changed, 30 insertions(+), 99 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 0efbd6c..6b1fd26 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1920,12 +1920,6 @@ static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
x86_pmu.sched_task(ctx, sched_in);
}

-static void x86_pmu_flush_branch_stack(void)
-{
- if (x86_pmu.flush_branch_stack)
- x86_pmu.flush_branch_stack();
-}
-
void perf_check_microcode(void)
{
if (x86_pmu.check_microcode)
@@ -1955,7 +1949,6 @@ static struct pmu pmu = {
.commit_txn = x86_pmu_commit_txn,

.event_idx = x86_pmu_event_idx,
- .flush_branch_stack = x86_pmu_flush_branch_stack,
.sched_task = x86_pmu_sched_task,
};

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 211b54c..949d008 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -472,7 +472,6 @@ struct x86_pmu {
void (*cpu_dead)(int cpu);

void (*check_microcode)(void);
- void (*flush_branch_stack)(void);
void (*sched_task)(struct perf_event_context *ctx,
bool sched_in);

@@ -733,6 +732,8 @@ void intel_pmu_pebs_disable_all(void);

void intel_ds_init(void);

+void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
+
void intel_pmu_lbr_reset(void);

void intel_pmu_lbr_enable(struct perf_event *event);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 498b6d9..424fbf7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2044,18 +2044,6 @@ static void intel_pmu_cpu_dying(int cpu)
fini_debug_store_on_cpu(cpu);
}

-static void intel_pmu_flush_branch_stack(void)
-{
- /*
- * Intel LBR does not tag entries with the
- * PID of the current task, then we need to
- * flush it on ctxsw
- * For now, we simply reset it
- */
- if (x86_pmu.lbr_nr)
- intel_pmu_lbr_reset();
-}
-
PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");

PMU_FORMAT_ATTR(ldlat, "config1:0-15");
@@ -2107,7 +2095,7 @@ static __initconst const struct x86_pmu intel_pmu = {
.cpu_starting = intel_pmu_cpu_starting,
.cpu_dying = intel_pmu_cpu_dying,
.guest_get_msrs = intel_guest_get_msrs,
- .flush_branch_stack = intel_pmu_flush_branch_stack,
+ .sched_task = intel_pmu_lbr_sched_task,
};

static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 8bc078f..c0e23c5 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -177,6 +177,31 @@ void intel_pmu_lbr_reset(void)
intel_pmu_lbr_reset_64();
}

+void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ if (!x86_pmu.lbr_nr)
+ return;
+
+ /*
+ * When sampling the branck stack in system-wide, it may be
+ * necessary to flush the stack on context switch. This happens
+ * when the branch stack does not tag its entries with the pid
+ * of the current task. Otherwise it becomes impossible to
+ * associate a branch entry with a task. This ambiguity is more
+ * likely to appear when the branch stack supports priv level
+ * filtering and the user sets it to monitor only at the user
+ * level (which could be a useful measurement in system-wide
+ * mode). In that case, the risk is high of having a branch
+ * stack with branch from multiple tasks.
+ */
+ if (sched_in) {
+ intel_pmu_lbr_reset();
+ cpuc->lbr_context = ctx;
+ }
+}
+
void intel_pmu_lbr_enable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -195,6 +220,7 @@ void intel_pmu_lbr_enable(struct perf_event *event)
cpuc->br_sel = event->hw.branch_reg.reg;

cpuc->lbr_users++;
+ perf_sched_cb_inc(event->ctx->pmu);
}

void intel_pmu_lbr_disable(struct perf_event *event)
@@ -206,6 +232,7 @@ void intel_pmu_lbr_disable(struct perf_event *event)

cpuc->lbr_users--;
WARN_ON_ONCE(cpuc->lbr_users < 0);
+ perf_sched_cb_dec(event->ctx->pmu);

if (cpuc->enabled && !cpuc->lbr_users) {
__intel_pmu_lbr_disable();
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fbab623..c7007a5 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -511,7 +511,6 @@ struct perf_event_context {
u64 generation;
int pin_count;
int nr_cgroups; /* cgroup evts */
- int nr_branch_stack; /* branch_stack evt */
struct rcu_head rcu_head;

struct delayed_work orphans_remove;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6c8b31b..f563ce7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -153,7 +153,6 @@ enum event_type_t {
*/
struct static_key_deferred perf_sched_events __read_mostly;
static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
-static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
static DEFINE_PER_CPU(int, perf_sched_cb_usages);

static atomic_t nr_mmap_events __read_mostly;
@@ -1240,9 +1239,6 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
if (is_cgroup_event(event))
ctx->nr_cgroups++;

- if (has_branch_stack(event))
- ctx->nr_branch_stack++;
-
list_add_rcu(&event->event_entry, &ctx->event_list);
ctx->nr_events++;
if (event->attr.inherit_stat)
@@ -1409,9 +1405,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
cpuctx->cgrp = NULL;
}

- if (has_branch_stack(event))
- ctx->nr_branch_stack--;
-
ctx->nr_events--;
if (event->attr.inherit_stat)
ctx->nr_stat--;
@@ -2809,64 +2802,6 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
}

/*
- * When sampling the branck stack in system-wide, it may be necessary
- * to flush the stack on context switch. This happens when the branch
- * stack does not tag its entries with the pid of the current task.
- * Otherwise it becomes impossible to associate a branch entry with a
- * task. This ambiguity is more likely to appear when the branch stack
- * supports priv level filtering and the user sets it to monitor only
- * at the user level (which could be a useful measurement in system-wide
- * mode). In that case, the risk is high of having a branch stack with
- * branch from multiple tasks. Flushing may mean dropping the existing
- * entries or stashing them somewhere in the PMU specific code layer.
- *
- * This function provides the context switch callback to the lower code
- * layer. It is invoked ONLY when there is at least one system-wide context
- * with at least one active event using taken branch sampling.
- */
-static void perf_branch_stack_sched_in(struct task_struct *prev,
- struct task_struct *task)
-{
- struct perf_cpu_context *cpuctx;
- struct pmu *pmu;
- unsigned long flags;
-
- /* no need to flush branch stack if not changing task */
- if (prev == task)
- return;
-
- local_irq_save(flags);
-
- rcu_read_lock();
-
- list_for_each_entry_rcu(pmu, &pmus, entry) {
- cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
-
- /*
- * check if the context has at least one
- * event using PERF_SAMPLE_BRANCH_STACK
- */
- if (cpuctx->ctx.nr_branch_stack > 0
- && pmu->flush_branch_stack) {
-
- perf_ctx_lock(cpuctx, cpuctx->task_ctx);
-
- perf_pmu_disable(pmu);
-
- pmu->flush_branch_stack();
-
- perf_pmu_enable(pmu);
-
- perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
- }
- }
-
- rcu_read_unlock();
-
- local_irq_restore(flags);
-}
-
-/*
* Called from scheduler to add the events of the current task
* with interrupts disabled.
*
@@ -2898,10 +2833,6 @@ void __perf_event_task_sched_in(struct task_struct *prev,
if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
perf_cgroup_sched_in(prev, task);

- /* check for system-wide branch_stack events */
- if (atomic_read(this_cpu_ptr(&perf_branch_stack_events)))
- perf_branch_stack_sched_in(prev, task);
-
if (__this_cpu_read(perf_sched_cb_usages))
perf_pmu_sched_task(prev, task, true);
}
@@ -3480,10 +3411,6 @@ static void unaccount_event_cpu(struct perf_event *event, int cpu)
if (event->parent)
return;

- if (has_branch_stack(event)) {
- if (!(event->attach_state & PERF_ATTACH_TASK))
- atomic_dec(&per_cpu(perf_branch_stack_events, cpu));
- }
if (is_cgroup_event(event))
atomic_dec(&per_cpu(perf_cgroup_events, cpu));
}
@@ -7139,10 +7066,6 @@ static void account_event_cpu(struct perf_event *event, int cpu)
if (event->parent)
return;

- if (has_branch_stack(event)) {
- if (!(event->attach_state & PERF_ATTACH_TASK))
- atomic_inc(&per_cpu(perf_branch_stack_events, cpu));
- }
if (is_cgroup_event(event))
atomic_inc(&per_cpu(perf_cgroup_events, cpu));
}

Subject: [tip:perf/core] perf/x86/intel: Add basic Haswell LBR call stack support

Commit-ID: e9d7f7cd97c45e2c612d3b38be05b4cfb27939ee
Gitweb: http://git.kernel.org/tip/e9d7f7cd97c45e2c612d3b38be05b4cfb27939ee
Author: Yan, Zheng <[email protected]>
AuthorDate: Tue, 4 Nov 2014 21:56:00 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 18 Feb 2015 17:16:04 +0100

perf/x86/intel: Add basic Haswell LBR call stack support

Haswell has a new feature that utilizes the existing LBR facility to
record call chains. To enable this feature, bits (JCC, NEAR_IND_JMP,
NEAR_REL_JMP, FAR_BRANCH, EN_CALLSTACK) in LBR_SELECT must be set to 1,
bits (NEAR_REL_CALL, NEAR-IND_CALL, NEAR_RET) must be cleared. Due to
a hardware bug of Haswell, this feature doesn't work well with
FREEZE_LBRS_ON_PMI.

When the call stack feature is enabled, the LBR stack will capture
unfiltered call data normally, but as return instructions are executed,
the last captured branch record is flushed from the on-chip registers
in a last-in first-out (LIFO) manner. Thus, branch information relative
to leaf functions will not be captured, while preserving the call stack
information of the main line execution path.

This patch defines a separate lbr_sel map for Haswell. The map contains
a new entry for the call stack feature.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 14 ++++-
arch/x86/kernel/cpu/perf_event_intel.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 91 ++++++++++++++++++++++--------
3 files changed, 83 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 949d008..c9a62c5 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -517,7 +517,11 @@ struct x86_pmu {
};

enum {
- PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE = PERF_SAMPLE_BRANCH_MAX_SHIFT,
+ PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT = PERF_SAMPLE_BRANCH_MAX_SHIFT,
+ PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE,
+
+ PERF_SAMPLE_BRANCH_CALL_STACK =
+ 1U << PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT,
};

#define x86_add_quirk(func_) \
@@ -551,6 +555,12 @@ static struct perf_pmu_events_attr event_attr_##v = { \

extern struct x86_pmu x86_pmu __read_mostly;

+static inline bool x86_pmu_has_lbr_callstack(void)
+{
+ return x86_pmu.lbr_sel_map &&
+ x86_pmu.lbr_sel_map[PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT] > 0;
+}
+
DECLARE_PER_CPU(struct cpu_hw_events, cpu_hw_events);

int x86_perf_event_set_period(struct perf_event *event);
@@ -754,6 +764,8 @@ void intel_pmu_lbr_init_atom(void);

void intel_pmu_lbr_init_snb(void);

+void intel_pmu_lbr_init_hsw(void);
+
int intel_pmu_setup_lbr_filter(struct perf_event *event);

int p4_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 424fbf7..a485ba1 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2537,7 +2537,7 @@ __init int intel_pmu_init(void)
memcpy(hw_cache_event_ids, snb_hw_cache_event_ids, sizeof(hw_cache_event_ids));
memcpy(hw_cache_extra_regs, snb_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));

- intel_pmu_lbr_init_snb();
+ intel_pmu_lbr_init_hsw();

x86_pmu.event_constraints = intel_hsw_event_constraints;
x86_pmu.pebs_constraints = intel_hsw_pebs_event_constraints;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index c0e23c5..ac8279e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -39,6 +39,7 @@ static enum {
#define LBR_IND_JMP_BIT 6 /* do not capture indirect jumps */
#define LBR_REL_JMP_BIT 7 /* do not capture relative jumps */
#define LBR_FAR_BIT 8 /* do not capture far branches */
+#define LBR_CALL_STACK_BIT 9 /* enable call stack */

#define LBR_KERNEL (1 << LBR_KERNEL_BIT)
#define LBR_USER (1 << LBR_USER_BIT)
@@ -49,6 +50,7 @@ static enum {
#define LBR_REL_JMP (1 << LBR_REL_JMP_BIT)
#define LBR_IND_JMP (1 << LBR_IND_JMP_BIT)
#define LBR_FAR (1 << LBR_FAR_BIT)
+#define LBR_CALL_STACK (1 << LBR_CALL_STACK_BIT)

#define LBR_PLM (LBR_KERNEL | LBR_USER)

@@ -74,24 +76,25 @@ static enum {
* x86control flow changes include branches, interrupts, traps, faults
*/
enum {
- X86_BR_NONE = 0, /* unknown */
-
- X86_BR_USER = 1 << 0, /* branch target is user */
- X86_BR_KERNEL = 1 << 1, /* branch target is kernel */
-
- X86_BR_CALL = 1 << 2, /* call */
- X86_BR_RET = 1 << 3, /* return */
- X86_BR_SYSCALL = 1 << 4, /* syscall */
- X86_BR_SYSRET = 1 << 5, /* syscall return */
- X86_BR_INT = 1 << 6, /* sw interrupt */
- X86_BR_IRET = 1 << 7, /* return from interrupt */
- X86_BR_JCC = 1 << 8, /* conditional */
- X86_BR_JMP = 1 << 9, /* jump */
- X86_BR_IRQ = 1 << 10,/* hw interrupt or trap or fault */
- X86_BR_IND_CALL = 1 << 11,/* indirect calls */
- X86_BR_ABORT = 1 << 12,/* transaction abort */
- X86_BR_IN_TX = 1 << 13,/* in transaction */
- X86_BR_NO_TX = 1 << 14,/* not in transaction */
+ X86_BR_NONE = 0, /* unknown */
+
+ X86_BR_USER = 1 << 0, /* branch target is user */
+ X86_BR_KERNEL = 1 << 1, /* branch target is kernel */
+
+ X86_BR_CALL = 1 << 2, /* call */
+ X86_BR_RET = 1 << 3, /* return */
+ X86_BR_SYSCALL = 1 << 4, /* syscall */
+ X86_BR_SYSRET = 1 << 5, /* syscall return */
+ X86_BR_INT = 1 << 6, /* sw interrupt */
+ X86_BR_IRET = 1 << 7, /* return from interrupt */
+ X86_BR_JCC = 1 << 8, /* conditional */
+ X86_BR_JMP = 1 << 9, /* jump */
+ X86_BR_IRQ = 1 << 10,/* hw interrupt or trap or fault */
+ X86_BR_IND_CALL = 1 << 11,/* indirect calls */
+ X86_BR_ABORT = 1 << 12,/* transaction abort */
+ X86_BR_IN_TX = 1 << 13,/* in transaction */
+ X86_BR_NO_TX = 1 << 14,/* not in transaction */
+ X86_BR_CALL_STACK = 1 << 15,/* call stack */
};

#define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
@@ -373,7 +376,7 @@ void intel_pmu_lbr_read(void)
* - in case there is no HW filter
* - in case the HW filter has errata or limitations
*/
-static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
+static int intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
{
u64 br_type = event->attr.branch_sample_type;
int mask = 0;
@@ -410,11 +413,21 @@ static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
if (br_type & PERF_SAMPLE_BRANCH_COND)
mask |= X86_BR_JCC;

+ if (br_type & PERF_SAMPLE_BRANCH_CALL_STACK) {
+ if (!x86_pmu_has_lbr_callstack())
+ return -EOPNOTSUPP;
+ if (mask & ~(X86_BR_USER | X86_BR_KERNEL))
+ return -EINVAL;
+ mask |= X86_BR_CALL | X86_BR_IND_CALL | X86_BR_RET |
+ X86_BR_CALL_STACK;
+ }
+
/*
* stash actual user request into reg, it may
* be used by fixup code for some CPU
*/
event->hw.branch_reg.reg = mask;
+ return 0;
}

/*
@@ -443,8 +456,12 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
reg = &event->hw.branch_reg;
reg->idx = EXTRA_REG_LBR;

- /* LBR_SELECT operates in suppress mode so invert mask */
- reg->config = ~mask & x86_pmu.lbr_sel_mask;
+ /*
+ * The first 9 bits (LBR_SEL_MASK) in LBR_SELECT operate
+ * in suppress mode. So LBR_SELECT should be set to
+ * (~mask & LBR_SEL_MASK) | (mask & ~LBR_SEL_MASK)
+ */
+ reg->config = mask ^ x86_pmu.lbr_sel_mask;

return 0;
}
@@ -462,7 +479,9 @@ int intel_pmu_setup_lbr_filter(struct perf_event *event)
/*
* setup SW LBR filter
*/
- intel_pmu_setup_sw_lbr_filter(event);
+ ret = intel_pmu_setup_sw_lbr_filter(event);
+ if (ret)
+ return ret;

/*
* setup HW LBR filter, if any
@@ -732,6 +751,20 @@ static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
[PERF_SAMPLE_BRANCH_COND_SHIFT] = LBR_JCC,
};

+static const int hsw_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+ [PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
+ [PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
+ [PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
+ [PERF_SAMPLE_BRANCH_HV_SHIFT] = LBR_IGN,
+ [PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT] = LBR_RETURN | LBR_FAR,
+ [PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT] = LBR_REL_CALL | LBR_IND_CALL
+ | LBR_FAR,
+ [PERF_SAMPLE_BRANCH_IND_CALL_SHIFT] = LBR_IND_CALL,
+ [PERF_SAMPLE_BRANCH_COND_SHIFT] = LBR_JCC,
+ [PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT] = LBR_REL_CALL | LBR_IND_CALL
+ | LBR_RETURN | LBR_CALL_STACK,
+};
+
/* core */
void __init intel_pmu_lbr_init_core(void)
{
@@ -788,6 +821,20 @@ void __init intel_pmu_lbr_init_snb(void)
pr_cont("16-deep LBR, ");
}

+/* haswell */
+void intel_pmu_lbr_init_hsw(void)
+{
+ x86_pmu.lbr_nr = 16;
+ x86_pmu.lbr_tos = MSR_LBR_TOS;
+ x86_pmu.lbr_from = MSR_LBR_NHM_FROM;
+ x86_pmu.lbr_to = MSR_LBR_NHM_TO;
+
+ x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
+ x86_pmu.lbr_sel_map = hsw_lbr_sel_map;
+
+ pr_cont("16-deep LBR, ");
+}
+
/* atom */
void __init intel_pmu_lbr_init_atom(void)
{

Subject: [tip:perf/core] perf: Add pmu specific data for perf task context

Commit-ID: 4af57ef28c2c1047fda9e1a5be02aa7a6a69cf9e
Gitweb: http://git.kernel.org/tip/4af57ef28c2c1047fda9e1a5be02aa7a6a69cf9e
Author: Yan, Zheng <[email protected]>
AuthorDate: Tue, 4 Nov 2014 21:56:01 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 18 Feb 2015 17:16:05 +0100

perf: Add pmu specific data for perf task context

Introduce a new flag PERF_ATTACH_TASK_DATA for perf event's attach
stata. The flag is set by PMU's event_init() callback, it indicates
that perf event needs PMU specific data.

The PMU specific data are initialized to zeros. Later patches will
use PMU specific data to save LBR stack.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/perf_event.h | 6 ++++++
kernel/events/core.c | 40 ++++++++++++++++++++++++++++++++++++----
2 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index c7007a5..270cd01 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -271,6 +271,10 @@ struct pmu {
*/
void (*sched_task) (struct perf_event_context *ctx,
bool sched_in);
+ /*
+ * PMU specific data size
+ */
+ size_t task_ctx_size;

};

@@ -307,6 +311,7 @@ struct swevent_hlist {
#define PERF_ATTACH_CONTEXT 0x01
#define PERF_ATTACH_GROUP 0x02
#define PERF_ATTACH_TASK 0x04
+#define PERF_ATTACH_TASK_DATA 0x08

struct perf_cgroup;
struct ring_buffer;
@@ -511,6 +516,7 @@ struct perf_event_context {
u64 generation;
int pin_count;
int nr_cgroups; /* cgroup evts */
+ void *task_ctx_data; /* pmu specific data */
struct rcu_head rcu_head;

struct delayed_work orphans_remove;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f563ce7..688086b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -905,6 +905,15 @@ static void get_ctx(struct perf_event_context *ctx)
WARN_ON(!atomic_inc_not_zero(&ctx->refcount));
}

+static void free_ctx(struct rcu_head *head)
+{
+ struct perf_event_context *ctx;
+
+ ctx = container_of(head, struct perf_event_context, rcu_head);
+ kfree(ctx->task_ctx_data);
+ kfree(ctx);
+}
+
static void put_ctx(struct perf_event_context *ctx)
{
if (atomic_dec_and_test(&ctx->refcount)) {
@@ -912,7 +921,7 @@ static void put_ctx(struct perf_event_context *ctx)
put_ctx(ctx->parent_ctx);
if (ctx->task)
put_task_struct(ctx->task);
- kfree_rcu(ctx, rcu_head);
+ call_rcu(&ctx->rcu_head, free_ctx);
}
}

@@ -3309,12 +3318,15 @@ errout:
* Returns a matching context with refcount and pincount.
*/
static struct perf_event_context *
-find_get_context(struct pmu *pmu, struct task_struct *task, int cpu)
+find_get_context(struct pmu *pmu, struct task_struct *task,
+ struct perf_event *event)
{
struct perf_event_context *ctx, *clone_ctx = NULL;
struct perf_cpu_context *cpuctx;
+ void *task_ctx_data = NULL;
unsigned long flags;
int ctxn, err;
+ int cpu = event->cpu;

if (!task) {
/* Must be root to operate on a CPU event: */
@@ -3342,11 +3354,24 @@ find_get_context(struct pmu *pmu, struct task_struct *task, int cpu)
if (ctxn < 0)
goto errout;

+ if (event->attach_state & PERF_ATTACH_TASK_DATA) {
+ task_ctx_data = kzalloc(pmu->task_ctx_size, GFP_KERNEL);
+ if (!task_ctx_data) {
+ err = -ENOMEM;
+ goto errout;
+ }
+ }
+
retry:
ctx = perf_lock_task_context(task, ctxn, &flags);
if (ctx) {
clone_ctx = unclone_ctx(ctx);
++ctx->pin_count;
+
+ if (task_ctx_data && !ctx->task_ctx_data) {
+ ctx->task_ctx_data = task_ctx_data;
+ task_ctx_data = NULL;
+ }
raw_spin_unlock_irqrestore(&ctx->lock, flags);

if (clone_ctx)
@@ -3357,6 +3382,11 @@ retry:
if (!ctx)
goto errout;

+ if (task_ctx_data) {
+ ctx->task_ctx_data = task_ctx_data;
+ task_ctx_data = NULL;
+ }
+
err = 0;
mutex_lock(&task->perf_event_mutex);
/*
@@ -3383,9 +3413,11 @@ retry:
}
}

+ kfree(task_ctx_data);
return ctx;

errout:
+ kfree(task_ctx_data);
return ERR_PTR(err);
}

@@ -7559,7 +7591,7 @@ SYSCALL_DEFINE5(perf_event_open,
/*
* Get the target context (task or percpu):
*/
- ctx = find_get_context(pmu, task, event->cpu);
+ ctx = find_get_context(pmu, task, event);
if (IS_ERR(ctx)) {
err = PTR_ERR(ctx);
goto err_alloc;
@@ -7765,7 +7797,7 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,

account_event(event);

- ctx = find_get_context(event->pmu, task, cpu);
+ ctx = find_get_context(event->pmu, task, event);
if (IS_ERR(ctx)) {
err = PTR_ERR(ctx);
goto err_free;

Subject: [tip:perf/core] perf: Always switch pmu specific data during context switch

Commit-ID: 5a158c3ccd2183a7b0866be6685d001fe653430f
Gitweb: http://git.kernel.org/tip/5a158c3ccd2183a7b0866be6685d001fe653430f
Author: Yan, Zheng <[email protected]>
AuthorDate: Tue, 4 Nov 2014 21:56:02 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 18 Feb 2015 17:16:07 +0100

perf: Always switch pmu specific data during context switch

If two tasks were both forked from the same parent task, Events in
their perf task contexts can be the same. Perf core may leave out
switching the perf event contexts.

Previous patch inroduces pmu specific data. The data is for saving
the LBR stack, it is task specific. So we need to switch the data
even when context switch is optimized out.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/events/core.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 688086b..84451c0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2562,6 +2562,9 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
next->perf_event_ctxp[ctxn] = ctx;
ctx->task = next;
next_ctx->task = task;
+
+ swap(ctx->task_ctx_data, next_ctx->task_ctx_data);
+
do_switch = 0;

perf_event_sync_stat(ctx, next_ctx);

Subject: [tip:perf/core] perf/x86/intel: Allocate space for storing LBR stack

Commit-ID: e18bf526422769611e7248135e36a4cea0e4e38d
Gitweb: http://git.kernel.org/tip/e18bf526422769611e7248135e36a4cea0e4e38d
Author: Yan, Zheng <[email protected]>
AuthorDate: Tue, 4 Nov 2014 21:56:03 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 18 Feb 2015 17:16:08 +0100

perf/x86/intel: Allocate space for storing LBR stack

When the LBR call stack is enabled, it is necessary to save/restore
the LBR stack on context switch. We can use pmu specific data to
store LBR stack when task is scheduled out. This patch adds code
that allocates the pmu specific data.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Stephane Eranian <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 4 ++++
arch/x86/kernel/cpu/perf_event.h | 7 +++++++
2 files changed, 11 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 6b1fd26..8ffd71e 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -432,6 +432,9 @@ int x86_pmu_hw_config(struct perf_event *event)
}
}

+ if (event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_CALL_STACK)
+ event->attach_state |= PERF_ATTACH_TASK_DATA;
+
/*
* Generate PMC IRQs:
* (keep 'enabled' bit clear for now)
@@ -1950,6 +1953,7 @@ static struct pmu pmu = {

.event_idx = x86_pmu_event_idx,
.sched_task = x86_pmu_sched_task,
+ .task_ctx_size = sizeof(struct x86_perf_task_context),
};

void arch_perf_update_userpage(struct perf_event *event,
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index c9a62c5..69c26b3 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -516,6 +516,13 @@ struct x86_pmu {
struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr);
};

+struct x86_perf_task_context {
+ u64 lbr_from[MAX_LBR_ENTRIES];
+ u64 lbr_to[MAX_LBR_ENTRIES];
+ int lbr_callstack_users;
+ int lbr_stack_state;
+};
+
enum {
PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT = PERF_SAMPLE_BRANCH_MAX_SHIFT,
PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE,

Subject: [tip:perf/core] perf/x86/intel: Track number of events that use the LBR callstack

Commit-ID: 63f0c1d84196b712fe9de99a8514486ab416d517
Gitweb: http://git.kernel.org/tip/63f0c1d84196b712fe9de99a8514486ab416d517
Author: Yan, Zheng <[email protected]>
AuthorDate: Tue, 4 Nov 2014 21:56:04 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 18 Feb 2015 17:16:09 +0100

perf/x86/intel: Track number of events that use the LBR callstack

When enabling/disabling an event, check if the event uses the LBR
callstack feature, adjust the LBR callstack usage count accordingly.
Later patch will use the usage count to decide if LBR stack should
be saved/restored.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index ac8279e..ac8e542 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -205,9 +205,15 @@ void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
}
}

+static inline bool branch_user_callstack(unsigned br_sel)
+{
+ return (br_sel & X86_BR_USER) && (br_sel & X86_BR_CALL_STACK);
+}
+
void intel_pmu_lbr_enable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct x86_perf_task_context *task_ctx;

if (!x86_pmu.lbr_nr)
return;
@@ -222,6 +228,12 @@ void intel_pmu_lbr_enable(struct perf_event *event)
}
cpuc->br_sel = event->hw.branch_reg.reg;

+ if (branch_user_callstack(cpuc->br_sel) && event->ctx &&
+ event->ctx->task_ctx_data) {
+ task_ctx = event->ctx->task_ctx_data;
+ task_ctx->lbr_callstack_users++;
+ }
+
cpuc->lbr_users++;
perf_sched_cb_inc(event->ctx->pmu);
}
@@ -229,10 +241,17 @@ void intel_pmu_lbr_enable(struct perf_event *event)
void intel_pmu_lbr_disable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct x86_perf_task_context *task_ctx;

if (!x86_pmu.lbr_nr)
return;

+ if (branch_user_callstack(cpuc->br_sel) && event->ctx &&
+ event->ctx->task_ctx_data) {
+ task_ctx = event->ctx->task_ctx_data;
+ task_ctx->lbr_callstack_users--;
+ }
+
cpuc->lbr_users--;
WARN_ON_ONCE(cpuc->lbr_users < 0);
perf_sched_cb_dec(event->ctx->pmu);

Subject: [tip:perf/core] perf/x86/intel: Save/ restore LBR stack during context switch

Commit-ID: 76cb2c617f12a4dd53c0e899972813b805ad6cc2
Gitweb: http://git.kernel.org/tip/76cb2c617f12a4dd53c0e899972813b805ad6cc2
Author: Yan, Zheng <[email protected]>
AuthorDate: Tue, 4 Nov 2014 21:56:05 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 18 Feb 2015 17:16:10 +0100

perf/x86/intel: Save/restore LBR stack during context switch

When the LBR call stack is enabled, it is necessary to save/restore
the LBR stack on context switch. The solution is saving/restoring
the LBR stack to/from task's perf event context.

The LBR stack is saved/restored only when there are events that use
the LBR call stack. If no event uses LBR call stack, the LBR stack
is reset when task is scheduled in.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 88 ++++++++++++++++++++++++++----
1 file changed, 76 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index ac8e542..a8b3f23 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -180,14 +180,90 @@ void intel_pmu_lbr_reset(void)
intel_pmu_lbr_reset_64();
}

+/*
+ * TOS = most recently recorded branch
+ */
+static inline u64 intel_pmu_lbr_tos(void)
+{
+ u64 tos;
+
+ rdmsrl(x86_pmu.lbr_tos, tos);
+ return tos;
+}
+
+enum {
+ LBR_NONE,
+ LBR_VALID,
+};
+
+static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
+{
+ int i;
+ unsigned lbr_idx, mask;
+ u64 tos;
+
+ if (task_ctx->lbr_callstack_users == 0 ||
+ task_ctx->lbr_stack_state == LBR_NONE) {
+ intel_pmu_lbr_reset();
+ return;
+ }
+
+ mask = x86_pmu.lbr_nr - 1;
+ tos = intel_pmu_lbr_tos();
+ for (i = 0; i < x86_pmu.lbr_nr; i++) {
+ lbr_idx = (tos - i) & mask;
+ wrmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
+ wrmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
+ }
+ task_ctx->lbr_stack_state = LBR_NONE;
+}
+
+static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
+{
+ int i;
+ unsigned lbr_idx, mask;
+ u64 tos;
+
+ if (task_ctx->lbr_callstack_users == 0) {
+ task_ctx->lbr_stack_state = LBR_NONE;
+ return;
+ }
+
+ mask = x86_pmu.lbr_nr - 1;
+ tos = intel_pmu_lbr_tos();
+ for (i = 0; i < x86_pmu.lbr_nr; i++) {
+ lbr_idx = (tos - i) & mask;
+ rdmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
+ rdmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
+ }
+ task_ctx->lbr_stack_state = LBR_VALID;
+}
+
void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct x86_perf_task_context *task_ctx;

if (!x86_pmu.lbr_nr)
return;

/*
+ * If LBR callstack feature is enabled and the stack was saved when
+ * the task was scheduled out, restore the stack. Otherwise flush
+ * the LBR stack.
+ */
+ task_ctx = ctx ? ctx->task_ctx_data : NULL;
+ if (task_ctx) {
+ if (sched_in) {
+ __intel_pmu_lbr_restore(task_ctx);
+ cpuc->lbr_context = ctx;
+ } else {
+ __intel_pmu_lbr_save(task_ctx);
+ }
+ return;
+ }
+
+ /*
* When sampling the branck stack in system-wide, it may be
* necessary to flush the stack on context switch. This happens
* when the branch stack does not tag its entries with the pid
@@ -279,18 +355,6 @@ void intel_pmu_lbr_disable_all(void)
__intel_pmu_lbr_disable();
}

-/*
- * TOS = most recently recorded branch
- */
-static inline u64 intel_pmu_lbr_tos(void)
-{
- u64 tos;
-
- rdmsrl(x86_pmu.lbr_tos, tos);
-
- return tos;
-}
-
static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
{
unsigned long mask = x86_pmu.lbr_nr - 1;

Subject: [tip:perf/core] perf: Simplify the branch stack check

Commit-ID: a46a23000198d929391aa9dac8de68734efa2703
Gitweb: http://git.kernel.org/tip/a46a23000198d929391aa9dac8de68734efa2703
Author: Yan, Zheng <[email protected]>
AuthorDate: Tue, 4 Nov 2014 21:56:06 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 18 Feb 2015 17:16:11 +0100

perf: Simplify the branch stack check

Use event->attr.branch_sample_type to replace
intel_pmu_needs_lbr_smpl() for avoiding duplicated code that
implicitly enables the LBR.

Currently, branch stack can be enabled by user explicitly requesting
branch sampling or implicit branch sampling to correct PEBS skid.

For user explicitly requested branch sampling, the branch_sample_type
is explicitly set by user. For PEBS case, the branch_sample_type is also
implicitly set to PERF_SAMPLE_BRANCH_ANY in x86_pmu_hw_config.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel.c | 20 +++-----------------
include/linux/perf_event.h | 5 +++++
kernel/events/core.c | 3 +++
3 files changed, 11 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index a485ba1..9f1dd18 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1029,20 +1029,6 @@ static __initconst const u64 slm_hw_cache_event_ids
},
};

-static inline bool intel_pmu_needs_lbr_smpl(struct perf_event *event)
-{
- /* user explicitly requested branch sampling */
- if (has_branch_stack(event))
- return true;
-
- /* implicit branch sampling to correct PEBS skid */
- if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1 &&
- x86_pmu.intel_cap.pebs_format < 2)
- return true;
-
- return false;
-}
-
static void intel_pmu_disable_all(void)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -1207,7 +1193,7 @@ static void intel_pmu_disable_event(struct perf_event *event)
* must disable before any actual event
* because any event may be combined with LBR
*/
- if (intel_pmu_needs_lbr_smpl(event))
+ if (needs_branch_stack(event))
intel_pmu_lbr_disable(event);

if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
@@ -1268,7 +1254,7 @@ static void intel_pmu_enable_event(struct perf_event *event)
* must enabled before any actual event
* because any event may be combined with LBR
*/
- if (intel_pmu_needs_lbr_smpl(event))
+ if (needs_branch_stack(event))
intel_pmu_lbr_enable(event);

if (event->attr.exclude_host)
@@ -1747,7 +1733,7 @@ static int intel_pmu_hw_config(struct perf_event *event)
if (event->attr.precise_ip && x86_pmu.pebs_aliases)
x86_pmu.pebs_aliases(event);

- if (intel_pmu_needs_lbr_smpl(event)) {
+ if (needs_branch_stack(event)) {
ret = intel_pmu_setup_lbr_filter(event);
if (ret)
return ret;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 270cd01..43cc158 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -814,6 +814,11 @@ static inline bool has_branch_stack(struct perf_event *event)
return event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK;
}

+static inline bool needs_branch_stack(struct perf_event *event)
+{
+ return event->attr.branch_sample_type != 0;
+}
+
extern int perf_output_begin(struct perf_output_handle *handle,
struct perf_event *event, unsigned int size);
extern void perf_output_end(struct perf_output_handle *handle);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 84451c0..257eccf 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7232,6 +7232,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
if (attr->inherit && (attr->read_format & PERF_FORMAT_GROUP))
goto err_ns;

+ if (!has_branch_stack(event))
+ event->attr.branch_sample_type = 0;
+
pmu = perf_init_event(event);
if (!pmu)
goto err_ns;

Subject: [tip:perf/core] perf/x86/intel: Re-organize code that implicitly enables LBR/PEBS

Commit-ID: 4b854900995194601d767fcd112307b21ed278b2
Gitweb: http://git.kernel.org/tip/4b854900995194601d767fcd112307b21ed278b2
Author: Yan, Zheng <[email protected]>
AuthorDate: Tue, 4 Nov 2014 21:56:08 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 18 Feb 2015 17:16:12 +0100

perf/x86/intel: Re-organize code that implicitly enables LBR/PEBS

Make later patch more readable, no logic change.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 59 ++++++++++++++++++++--------------------
1 file changed, 29 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 8ffd71e..e0dab5c 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -399,36 +399,35 @@ int x86_pmu_hw_config(struct perf_event *event)

if (event->attr.precise_ip > precise)
return -EOPNOTSUPP;
- /*
- * check that PEBS LBR correction does not conflict with
- * whatever the user is asking with attr->branch_sample_type
- */
- if (event->attr.precise_ip > 1 &&
- x86_pmu.intel_cap.pebs_format < 2) {
- u64 *br_type = &event->attr.branch_sample_type;
-
- if (has_branch_stack(event)) {
- if (!precise_br_compat(event))
- return -EOPNOTSUPP;
-
- /* branch_sample_type is compatible */
-
- } else {
- /*
- * user did not specify branch_sample_type
- *
- * For PEBS fixups, we capture all
- * the branches at the priv level of the
- * event.
- */
- *br_type = PERF_SAMPLE_BRANCH_ANY;
-
- if (!event->attr.exclude_user)
- *br_type |= PERF_SAMPLE_BRANCH_USER;
-
- if (!event->attr.exclude_kernel)
- *br_type |= PERF_SAMPLE_BRANCH_KERNEL;
- }
+ }
+ /*
+ * check that PEBS LBR correction does not conflict with
+ * whatever the user is asking with attr->branch_sample_type
+ */
+ if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format < 2) {
+ u64 *br_type = &event->attr.branch_sample_type;
+
+ if (has_branch_stack(event)) {
+ if (!precise_br_compat(event))
+ return -EOPNOTSUPP;
+
+ /* branch_sample_type is compatible */
+
+ } else {
+ /*
+ * user did not specify branch_sample_type
+ *
+ * For PEBS fixups, we capture all
+ * the branches at the priv level of the
+ * event.
+ */
+ *br_type = PERF_SAMPLE_BRANCH_ANY;
+
+ if (!event->attr.exclude_user)
+ *br_type |= PERF_SAMPLE_BRANCH_USER;
+
+ if (!event->attr.exclude_kernel)
+ *br_type |= PERF_SAMPLE_BRANCH_KERNEL;
}
}

Subject: [tip:perf/core] perf/x86/intel: Disable FREEZE_LBRS_ON_PMI when LBR operates in callstack mode

Commit-ID: 2c70d0086e4e9e2440f0f78098090f32bde14277
Gitweb: http://git.kernel.org/tip/2c70d0086e4e9e2440f0f78098090f32bde14277
Author: Yan, Zheng <[email protected]>
AuthorDate: Tue, 4 Nov 2014 21:56:10 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 18 Feb 2015 17:16:13 +0100

perf/x86/intel: Disable FREEZE_LBRS_ON_PMI when LBR operates in callstack mode

LBR callstack is designed for PEBS, It does not work well with
FREEZE_LBRS_ON_PMI for non PEBS event. If FREEZE_LBRS_ON_PMI is set for
non PEBS event, PMIs near call/return instructions may cause superfluous
increase/decrease of LBR_TOS.

This patch modifies __intel_pmu_lbr_enable() to not enable
FREEZE_LBRS_ON_PMI when LBR operates in callstack mode. We currently
don't use LBR callstack to capture kernel space callchain, so disabling
FREEZE_LBRS_ON_PMI should not be a problem.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index a8b3f23..92a44fd 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -131,14 +131,23 @@ static void intel_pmu_lbr_filter(struct cpu_hw_events *cpuc);

static void __intel_pmu_lbr_enable(void)
{
- u64 debugctl;
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ u64 debugctl, lbr_select = 0;

- if (cpuc->lbr_sel)
- wrmsrl(MSR_LBR_SELECT, cpuc->lbr_sel->config);
+ if (cpuc->lbr_sel) {
+ lbr_select = cpuc->lbr_sel->config;
+ wrmsrl(MSR_LBR_SELECT, lbr_select);
+ }

rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
- debugctl |= (DEBUGCTLMSR_LBR | DEBUGCTLMSR_FREEZE_LBRS_ON_PMI);
+ debugctl |= DEBUGCTLMSR_LBR;
+ /*
+ * LBR callstack does not work well with FREEZE_LBRS_ON_PMI.
+ * If FREEZE_LBRS_ON_PMI is set, PMI near call/return instructions
+ * may cause superfluous increase/decrease of LBR_TOS.
+ */
+ if (!(lbr_select & LBR_CALL_STACK))
+ debugctl |= DEBUGCTLMSR_FREEZE_LBRS_ON_PMI;
wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
}

Subject: [tip:perf/core] perf/x86/intel: Discard zero length call entries in LBR call stack

Commit-ID: aa54ae9b87b83af7edabcc34a299e7e014609af4
Gitweb: http://git.kernel.org/tip/aa54ae9b87b83af7edabcc34a299e7e014609af4
Author: Yan, Zheng <[email protected]>
AuthorDate: Tue, 4 Nov 2014 21:56:11 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 18 Feb 2015 17:16:14 +0100

perf/x86/intel: Discard zero length call entries in LBR call stack

"Zero length call" uses the attribute of the call instruction to push
the immediate instruction pointer on to the stack and then pops off
that address into a register. This is accomplished without any matching
return instruction. It confuses the hardware and make the recorded call
stack incorrect.

We can partially resolve this issue by: decode call instructions and
discard any zero length call entry in the LBR stack.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 92a44fd..084f2eb 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -94,7 +94,8 @@ enum {
X86_BR_ABORT = 1 << 12,/* transaction abort */
X86_BR_IN_TX = 1 << 13,/* in transaction */
X86_BR_NO_TX = 1 << 14,/* not in transaction */
- X86_BR_CALL_STACK = 1 << 15,/* call stack */
+ X86_BR_ZERO_CALL = 1 << 15,/* zero length call */
+ X86_BR_CALL_STACK = 1 << 16,/* call stack */
};

#define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
@@ -111,13 +112,15 @@ enum {
X86_BR_JMP |\
X86_BR_IRQ |\
X86_BR_ABORT |\
- X86_BR_IND_CALL)
+ X86_BR_IND_CALL |\
+ X86_BR_ZERO_CALL)

#define X86_BR_ALL (X86_BR_PLM | X86_BR_ANY)

#define X86_BR_ANY_CALL \
(X86_BR_CALL |\
X86_BR_IND_CALL |\
+ X86_BR_ZERO_CALL |\
X86_BR_SYSCALL |\
X86_BR_IRQ |\
X86_BR_INT)
@@ -702,6 +705,12 @@ static int branch_type(unsigned long from, unsigned long to, int abort)
ret = X86_BR_INT;
break;
case 0xe8: /* call near rel */
+ insn_get_immediate(&insn);
+ if (insn.immediate1.value == 0) {
+ /* zero length call */
+ ret = X86_BR_ZERO_CALL;
+ break;
+ }
case 0x9a: /* call far absolute */
ret = X86_BR_CALL;
break;

Subject: [tip:perf/core] perf/x86/intel: Expose LBR callstack to user space tooling

Commit-ID: 2c44b1936bb3b135a3fac8b3493394d42e51cf70
Gitweb: http://git.kernel.org/tip/2c44b1936bb3b135a3fac8b3493394d42e51cf70
Author: Peter Zijlstra <[email protected]>
AuthorDate: Wed, 5 Nov 2014 10:36:45 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 18 Feb 2015 17:16:15 +0100

perf/x86/intel: Expose LBR callstack to user space tooling

With LBR call stack feature enable, there are three callchain options.
Enable the 3rd callchain option (LBR callstack) to user space tooling.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 8 --------
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 8 ++++----
include/uapi/linux/perf_event.h | 16 ++++++++--------
3 files changed, 12 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 69c26b3..a371d27 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -523,14 +523,6 @@ struct x86_perf_task_context {
int lbr_stack_state;
};

-enum {
- PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT = PERF_SAMPLE_BRANCH_MAX_SHIFT,
- PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE,
-
- PERF_SAMPLE_BRANCH_CALL_STACK =
- 1U << PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT,
-};
-
#define x86_add_quirk(func_) \
do { \
static struct x86_pmu_quirk __quirk __initdata = { \
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 084f2eb..0473874 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -537,7 +537,7 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
u64 mask = 0, v;
int i;

- for (i = 0; i < PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE; i++) {
+ for (i = 0; i < PERF_SAMPLE_BRANCH_MAX_SHIFT; i++) {
if (!(br_type & (1ULL << i)))
continue;

@@ -821,7 +821,7 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
/*
* Map interface branch filters onto LBR filters
*/
-static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX_SHIFT] = {
[PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
[PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
[PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
@@ -840,7 +840,7 @@ static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
[PERF_SAMPLE_BRANCH_COND_SHIFT] = LBR_JCC,
};

-static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX_SHIFT] = {
[PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
[PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
[PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
@@ -852,7 +852,7 @@ static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
[PERF_SAMPLE_BRANCH_COND_SHIFT] = LBR_JCC,
};

-static const int hsw_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+static const int hsw_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX_SHIFT] = {
[PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
[PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
[PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index e46b932..1e3cd07 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -166,6 +166,8 @@ enum perf_branch_sample_type_shift {
PERF_SAMPLE_BRANCH_NO_TX_SHIFT = 9, /* not in transaction */
PERF_SAMPLE_BRANCH_COND_SHIFT = 10, /* conditional branches */

+ PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT = 11, /* call/ret stack */
+
PERF_SAMPLE_BRANCH_MAX_SHIFT /* non-ABI */
};

@@ -175,18 +177,16 @@ enum perf_branch_sample_type {
PERF_SAMPLE_BRANCH_HV = 1U << PERF_SAMPLE_BRANCH_HV_SHIFT,

PERF_SAMPLE_BRANCH_ANY = 1U << PERF_SAMPLE_BRANCH_ANY_SHIFT,
- PERF_SAMPLE_BRANCH_ANY_CALL =
- 1U << PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT,
- PERF_SAMPLE_BRANCH_ANY_RETURN =
- 1U << PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT,
- PERF_SAMPLE_BRANCH_IND_CALL =
- 1U << PERF_SAMPLE_BRANCH_IND_CALL_SHIFT,
- PERF_SAMPLE_BRANCH_ABORT_TX =
- 1U << PERF_SAMPLE_BRANCH_ABORT_TX_SHIFT,
+ PERF_SAMPLE_BRANCH_ANY_CALL = 1U << PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT,
+ PERF_SAMPLE_BRANCH_ANY_RETURN = 1U << PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT,
+ PERF_SAMPLE_BRANCH_IND_CALL = 1U << PERF_SAMPLE_BRANCH_IND_CALL_SHIFT,
+ PERF_SAMPLE_BRANCH_ABORT_TX = 1U << PERF_SAMPLE_BRANCH_ABORT_TX_SHIFT,
PERF_SAMPLE_BRANCH_IN_TX = 1U << PERF_SAMPLE_BRANCH_IN_TX_SHIFT,
PERF_SAMPLE_BRANCH_NO_TX = 1U << PERF_SAMPLE_BRANCH_NO_TX_SHIFT,
PERF_SAMPLE_BRANCH_COND = 1U << PERF_SAMPLE_BRANCH_COND_SHIFT,

+ PERF_SAMPLE_BRANCH_CALL_STACK = 1U << PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT,
+
PERF_SAMPLE_BRANCH_MAX = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT,
};

2015-12-09 08:34:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:perf/core] perf: Add pmu specific data for perf task context

On Wed, Feb 18, 2015 at 09:15:06AM -0800, tip-bot for Yan, Zheng wrote:
> +find_get_context(struct pmu *pmu, struct task_struct *task,
> + struct perf_event *event)
> {
> struct perf_event_context *ctx, *clone_ctx = NULL;
> struct perf_cpu_context *cpuctx;
> + void *task_ctx_data = NULL;
> unsigned long flags;
> int ctxn, err;
> + int cpu = event->cpu;
>
> if (!task) {
> /* Must be root to operate on a CPU event: */
> @@ -3342,11 +3354,24 @@ find_get_context(struct pmu *pmu, struct task_struct *task, int cpu)
> if (ctxn < 0)
> goto errout;
>
> + if (event->attach_state & PERF_ATTACH_TASK_DATA) {
> + task_ctx_data = kzalloc(pmu->task_ctx_size, GFP_KERNEL);
> + if (!task_ctx_data) {
> + err = -ENOMEM;
> + goto errout;
> + }
> + }
> +
> retry:
> ctx = perf_lock_task_context(task, ctxn, &flags);
> if (ctx) {
> clone_ctx = unclone_ctx(ctx);
> ++ctx->pin_count;
> +
> + if (task_ctx_data && !ctx->task_ctx_data) {
> + ctx->task_ctx_data = task_ctx_data;
> + task_ctx_data = NULL;
> + }
> raw_spin_unlock_irqrestore(&ctx->lock, flags);
>
> if (clone_ctx)
> @@ -3357,6 +3382,11 @@ retry:
> if (!ctx)
> goto errout;
>
> + if (task_ctx_data) {
> + ctx->task_ctx_data = task_ctx_data;
> + task_ctx_data = NULL;
> + }
> +
> err = 0;
> mutex_lock(&task->perf_event_mutex);
> /*
> @@ -3383,9 +3413,11 @@ retry:
> }
> }
>
> + kfree(task_ctx_data);
> return ctx;
>
> errout:
> + kfree(task_ctx_data);
> return ERR_PTR(err);
> }


diff --git a/kernel/events/core.c b/kernel/events/core.c
index 36babfd..97aa610 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3508,11 +3515,6 @@ retry:
if (!ctx)
goto errout;

- if (task_ctx_data) {
- ctx->task_ctx_data = task_ctx_data;
- task_ctx_data = NULL;
- }
-
err = 0;
mutex_lock(&task->perf_event_mutex);
/*
@@ -3526,6 +3528,10 @@ retry:
else {
get_ctx(ctx);
++ctx->pin_count;
+ if (task_ctx_data) {
+ ctx->task_ctx_data = task_ctx_data;
+ task_ctx_data = NULL;
+ }
rcu_assign_pointer(task->perf_event_ctxp[ctxn], ctx);
}
mutex_unlock(&task->perf_event_mutex);


Does that make sense? No point in setting task_ctx_data if we're going
to free the ctx and try again.

2015-12-09 14:59:27

by Liang, Kan

[permalink] [raw]
Subject: RE: [tip:perf/core] perf: Add pmu specific data for perf task context


>
> On Wed, Feb 18, 2015 at 09:15:06AM -0800, tip-bot for Yan, Zheng wrote:
> > +find_get_context(struct pmu *pmu, struct task_struct *task,
> > + struct perf_event *event)
> > {
> > struct perf_event_context *ctx, *clone_ctx = NULL;
> > struct perf_cpu_context *cpuctx;
> > + void *task_ctx_data = NULL;
> > unsigned long flags;
> > int ctxn, err;
> > + int cpu = event->cpu;
> >
> > if (!task) {
> > /* Must be root to operate on a CPU event: */ @@ -
> 3342,11 +3354,24
> > @@ find_get_context(struct pmu *pmu, struct task_struct *task, int cpu)
> > if (ctxn < 0)
> > goto errout;
> >
> > + if (event->attach_state & PERF_ATTACH_TASK_DATA) {
> > + task_ctx_data = kzalloc(pmu->task_ctx_size, GFP_KERNEL);
> > + if (!task_ctx_data) {
> > + err = -ENOMEM;
> > + goto errout;
> > + }
> > + }
> > +
> > retry:
> > ctx = perf_lock_task_context(task, ctxn, &flags);
> > if (ctx) {
> > clone_ctx = unclone_ctx(ctx);
> > ++ctx->pin_count;
> > +
> > + if (task_ctx_data && !ctx->task_ctx_data) {
> > + ctx->task_ctx_data = task_ctx_data;
> > + task_ctx_data = NULL;
> > + }
> > raw_spin_unlock_irqrestore(&ctx->lock, flags);
> >
> > if (clone_ctx)
> > @@ -3357,6 +3382,11 @@ retry:
> > if (!ctx)
> > goto errout;
> >
> > + if (task_ctx_data) {
> > + ctx->task_ctx_data = task_ctx_data;
> > + task_ctx_data = NULL;
> > + }
> > +
> > err = 0;
> > mutex_lock(&task->perf_event_mutex);
> > /*
> > @@ -3383,9 +3413,11 @@ retry:
> > }
> > }
> >
> > + kfree(task_ctx_data);
> > return ctx;
> >
> > errout:
> > + kfree(task_ctx_data);
> > return ERR_PTR(err);
> > }
>
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c index
> 36babfd..97aa610 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3508,11 +3515,6 @@ retry:
> if (!ctx)
> goto errout;
>
> - if (task_ctx_data) {
> - ctx->task_ctx_data = task_ctx_data;
> - task_ctx_data = NULL;
> - }
> -
> err = 0;
> mutex_lock(&task->perf_event_mutex);
> /*
> @@ -3526,6 +3528,10 @@ retry:
> else {
> get_ctx(ctx);
> ++ctx->pin_count;
> + if (task_ctx_data) {
> + ctx->task_ctx_data = task_ctx_data;
> + task_ctx_data = NULL;
> + }
> rcu_assign_pointer(task->perf_event_ctxp[ctxn],
> ctx);
> }
> mutex_unlock(&task->perf_event_mutex);
>
>
> Does that make sense? No point in setting task_ctx_data if we're going to
> free the ctx and try again.

The task_ctx_data will be checked before use. So it wouldn't crash the
system if it's NULL.
The problem is that LBR stack info will not be save/store on context switch
anymore. The user probably get wrong call stack information.
May I know why you want to do that?

Thanks,
Kan

2015-12-09 15:14:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:perf/core] perf: Add pmu specific data for perf task context

On Wed, Dec 09, 2015 at 02:59:21PM +0000, Liang, Kan wrote:
> > diff --git a/kernel/events/core.c b/kernel/events/core.c index
> > 36babfd..97aa610 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -3508,11 +3515,6 @@ retry:
> > if (!ctx)
> > goto errout;
> >
> > - if (task_ctx_data) {
> > - ctx->task_ctx_data = task_ctx_data;
> > - task_ctx_data = NULL;
> > - }
> > -
> > err = 0;
> > mutex_lock(&task->perf_event_mutex);
> > /*
> > @@ -3526,6 +3528,10 @@ retry:
> > else {
> > get_ctx(ctx);
> > ++ctx->pin_count;
> > + if (task_ctx_data) {
> > + ctx->task_ctx_data = task_ctx_data;
> > + task_ctx_data = NULL;
> > + }
> > rcu_assign_pointer(task->perf_event_ctxp[ctxn],
> > ctx);
> > }
> > mutex_unlock(&task->perf_event_mutex);
> >
> >
> > Does that make sense? No point in setting task_ctx_data if we're going to
> > free the ctx and try again.
>
> The task_ctx_data will be checked before use. So it wouldn't crash the
> system if it's NULL.

Yeah, I know, I checked :-)

> The problem is that LBR stack info will not be save/store on context
> switch anymore. The user probably get wrong call stack information.

Yep

> May I know why you want to do that?

Because this seemed like a less fragile construct. When there's multiple
event creations racing it seems possible (ableit entirely unlikely) to
assign the allocated task_ctx_data to a ctx that we'll delete, and on
the second go around re-allocate a ctx, but are left wihtout
task_ctx_data to assign to it.

So by only assigning the task_ctx_data when we _know_ we've succeeded,
we'll avoid this scenario.

2015-12-09 15:25:31

by Liang, Kan

[permalink] [raw]
Subject: RE: [tip:perf/core] perf: Add pmu specific data for perf task context


>
> On Wed, Dec 09, 2015 at 02:59:21PM +0000, Liang, Kan wrote:
> > > diff --git a/kernel/events/core.c b/kernel/events/core.c index
> > > 36babfd..97aa610 100644
> > > --- a/kernel/events/core.c
> > > +++ b/kernel/events/core.c
> > > @@ -3508,11 +3515,6 @@ retry:
> > > if (!ctx)
> > > goto errout;
> > >
> > > - if (task_ctx_data) {
> > > - ctx->task_ctx_data = task_ctx_data;
> > > - task_ctx_data = NULL;
> > > - }
> > > -
> > > err = 0;
> > > mutex_lock(&task->perf_event_mutex);
> > > /*
> > > @@ -3526,6 +3528,10 @@ retry:
> > > else {
> > > get_ctx(ctx);
> > > ++ctx->pin_count;
> > > + if (task_ctx_data) {
> > > + ctx->task_ctx_data = task_ctx_data;
> > > + task_ctx_data = NULL;
> > > + }
> > > rcu_assign_pointer(task->perf_event_ctxp[ctxn],
> > > ctx);
> > > }
> > > mutex_unlock(&task->perf_event_mutex);
> > >
> > >
> > > Does that make sense? No point in setting task_ctx_data if we're
> > > going to free the ctx and try again.
> >
> > The task_ctx_data will be checked before use. So it wouldn't crash the
> > system if it's NULL.
>
> Yeah, I know, I checked :-)
>
> > The problem is that LBR stack info will not be save/store on context
> > switch anymore. The user probably get wrong call stack information.
>
> Yep
>
> > May I know why you want to do that?
>
> Because this seemed like a less fragile construct. When there's multiple
> event creations racing it seems possible (ableit entirely unlikely) to assign
> the allocated task_ctx_data to a ctx that we'll delete, and on the second go
> around re-allocate a ctx, but are left wihtout task_ctx_data to assign to it.
>
> So by only assigning the task_ctx_data when we _know_ we've succeeded,
> we'll avoid this scenario.

Yes, I think it make sense to that.

Thanks,
Kan