2014-02-18 06:07:47

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support

For many profiling tasks we need the callgraph. For example we often
need to see the caller of a lock or the caller of a memcpy or other
library function to actually tune the program. Frame pointer unwinding
is efficient and works well. But frame pointers are off by default on
64bit code (and on modern 32bit gccs), so there are many binaries around
that do not use frame pointers. Profiling unchanged production code is
very useful in practice. On some CPUs frame pointer also has a high
cost. Dwarf2 unwinding also does not always work and is extremely slow
(upto 20% overhead).

Haswell has a new feature that utilizes the existing Last Branch Record
facility to record call chains. When the feature is enabled, function
call will be collected as normal, but as return instructions are
executed the last captured branch record is popped from the on-chip LBR
registers. The LBR call stack facility provides an alternative to get
callgraph. It has some limitations too, but should work in most cases
and is significantly faster than dwarf. Frame pointer unwinding is still
the best default, but LBR call stack is a good alternative when nothing
else works.

This patch series adds LBR call stack support. User can enabled/disable
this through an sysfs attribute file in the CPU PMU directory:
echo 1 > /sys/bus/event_source/devices/cpu/lbr_callstack

When profiling bc(1) on Fedora 19:
echo 'scale=2000; 4*a(1)' > cmd; perf record -g fp bc -l < cmd

If this feature is enabled, perf report output looks like:
50.36% bc bc [.] bc_divide
|
--- bc_divide
execute
run_code
yyparse
main
__libc_start_main
_start

33.66% bc bc [.] _one_mult
|
--- _one_mult
bc_divide
execute
run_code
yyparse
main
__libc_start_main
_start

7.62% bc bc [.] _bc_do_add
|
--- _bc_do_add
|
|--99.89%-- 0x2000186a8
--0.11%-- [...]

6.83% bc bc [.] _bc_do_sub
|
--- _bc_do_sub
|
|--99.94%-- bc_add
| execute
| run_code
| yyparse
| main
| __libc_start_main
| _start
--0.06%-- [...]

0.46% bc libc-2.17.so [.] __memset_sse2
|
--- __memset_sse2
|
|--54.13%-- bc_new_num
| |
| |--51.00%-- bc_divide
| | execute
| | run_code
| | yyparse
| | main
| | __libc_start_main
| | _start
| |
| |--30.46%-- _bc_do_sub
| | bc_add
| | execute
| | run_code
| | yyparse
| | main
| | __libc_start_main
| | _start
| |
| --18.55%-- _bc_do_add
| bc_add
| execute
| run_code
| yyparse
| main
| __libc_start_main
| _start
|
--45.87%-- bc_divide
execute
run_code
yyparse
main
__libc_start_main
_start

If this feature is disabled, perf report output looks like:
50.49% bc bc [.] bc_divide
|
--- bc_divide

33.57% bc bc [.] _one_mult
|
--- _one_mult

7.61% bc bc [.] _bc_do_add
|
--- _bc_do_add
0x2000186a8

6.88% bc bc [.] _bc_do_sub
|
--- _bc_do_sub

0.42% bc libc-2.17.so [.] __memcpy_ssse3_back
|
--- __memcpy_ssse3_back

The LBR call stack has following known limitations
- Zero length calls are not filtered out by hardware
- Exception handing such as setjmp/longjmp will have calls/returns not
match
- Pushing different return address onto the stack will have calls/returns
not match
- If callstack is deeper than the LBR, only the last entries are captured

Changes since v1
- split change into more patches
- introduce context switch callback and use it to flush LBR
- use the context switch callback to save/restore LBR
- dynamic allocate memory area for storing LBR stack, always switch the
memory area during context switch
- disable this feature by default
- more description in change logs

Changes since v2
- don't use xchg to switch PMU specific data
- remove nr_branch_stack from struct perf_event_context
- simplify the save/restore LBR stack logical
- remove unnecessary 'has_branch_stack -> needs_branch_stack'
conversion
- more description in change logs


2014-02-18 06:08:03

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 10/14] perf, core: simplify need branch stack check

event->attr.branch_sample_type is non-zero no matter branch stack
is enabled explicitly or is enabled implicitly. So we can use it
to replace intel_pmu_needs_lbr_smpl(). This avoids duplicating
code that implicitly enables the LBR.

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel.c | 20 +++-----------------
include/linux/perf_event.h | 5 +++++
kernel/events/core.c | 3 +++
3 files changed, 11 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 84a1c09..722171c 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1030,20 +1030,6 @@ static __initconst const u64 slm_hw_cache_event_ids
},
};

-static inline bool intel_pmu_needs_lbr_smpl(struct perf_event *event)
-{
- /* user explicitly requested branch sampling */
- if (has_branch_stack(event))
- return true;
-
- /* implicit branch sampling to correct PEBS skid */
- if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1 &&
- x86_pmu.intel_cap.pebs_format < 2)
- return true;
-
- return false;
-}
-
static void intel_pmu_disable_all(void)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
@@ -1208,7 +1194,7 @@ static void intel_pmu_disable_event(struct perf_event *event)
* must disable before any actual event
* because any event may be combined with LBR
*/
- if (intel_pmu_needs_lbr_smpl(event))
+ if (needs_branch_stack(event))
intel_pmu_lbr_disable(event);

if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
@@ -1269,7 +1255,7 @@ static void intel_pmu_enable_event(struct perf_event *event)
* must enabled before any actual event
* because any event may be combined with LBR
*/
- if (intel_pmu_needs_lbr_smpl(event))
+ if (needs_branch_stack(event))
intel_pmu_lbr_enable(event);

if (event->attr.exclude_host)
@@ -1741,7 +1727,7 @@ static int intel_pmu_hw_config(struct perf_event *event)
if (event->attr.precise_ip && x86_pmu.pebs_aliases)
x86_pmu.pebs_aliases(event);

- if (intel_pmu_needs_lbr_smpl(event)) {
+ if (needs_branch_stack(event)) {
ret = intel_pmu_setup_lbr_filter(event);
if (ret)
return ret;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3da433d..6afc675 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -763,6 +763,11 @@ static inline bool has_branch_stack(struct perf_event *event)
return event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK;
}

+static inline bool needs_branch_stack(struct perf_event *event)
+{
+ return event->attr.branch_sample_type != 0;
+}
+
extern int perf_output_begin(struct perf_output_handle *handle,
struct perf_event *event, unsigned int size);
extern void perf_output_end(struct perf_output_handle *handle);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d5417e5..69fbf1b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6765,6 +6765,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
if (attr->inherit && (attr->read_format & PERF_FORMAT_GROUP))
goto err_ns;

+ if (!has_branch_stack(event))
+ event->attr.branch_sample_type = 0;
+
pmu = perf_init_event(event);
if (!pmu)
goto err_ns;
--
1.8.5.3

2014-02-18 06:08:05

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 13/14] perf, x86: enable LBR callstack when recording callchain

Try enabling the LBR callstack facility if user requests recording
user callchain. Also adds a cpu pmu attribute to enable/disable this
feature. This feature is disabled by default because it may contend
for the LBR with other events that explicitly require branch stack.

Note: this feature only affects how to get user callchain. The kernel
callchain is always got by frame pointers.

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 99 ++++++++++++++++++++++++++++------------
arch/x86/kernel/cpu/perf_event.h | 1 +
2 files changed, 71 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 0d0fe2f3..837a0b1 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -399,37 +399,49 @@ int x86_pmu_hw_config(struct perf_event *event)

if (event->attr.precise_ip > precise)
return -EOPNOTSUPP;
+ }
+ /*
+ * check that PEBS LBR correction does not conflict with
+ * whatever the user is asking with attr->branch_sample_type
+ */
+ if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format < 2) {
+ u64 *br_type = &event->attr.branch_sample_type;
+
+ if (has_branch_stack(event)) {
+ if (!precise_br_compat(event))
+ return -EOPNOTSUPP;
+
+ /* branch_sample_type is compatible */
+
+ } else {
+ /*
+ * user did not specify branch_sample_type
+ *
+ * For PEBS fixups, we capture all
+ * the branches at the priv level of the
+ * event.
+ */
+ *br_type = PERF_SAMPLE_BRANCH_ANY;
+
+ if (!event->attr.exclude_user)
+ *br_type |= PERF_SAMPLE_BRANCH_USER;
+
+ if (!event->attr.exclude_kernel)
+ *br_type |= PERF_SAMPLE_BRANCH_KERNEL;
+ }
+ } else if ((event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) &&
+ !has_branch_stack(event) &&
+ x86_pmu.attr_lbr_callstack &&
+ !event->attr.exclude_user &&
+ (event->attach_state & PERF_ATTACH_TASK)) {
/*
- * check that PEBS LBR correction does not conflict with
- * whatever the user is asking with attr->branch_sample_type
+ * user did not specify branch_sample_type,
+ * try using the LBR call stack facility to
+ * record call chains of user program.
*/
- if (event->attr.precise_ip > 1 &&
- x86_pmu.intel_cap.pebs_format < 2) {
- u64 *br_type = &event->attr.branch_sample_type;
-
- if (has_branch_stack(event)) {
- if (!precise_br_compat(event))
- return -EOPNOTSUPP;
-
- /* branch_sample_type is compatible */
-
- } else {
- /*
- * user did not specify branch_sample_type
- *
- * For PEBS fixups, we capture all
- * the branches at the priv level of the
- * event.
- */
- *br_type = PERF_SAMPLE_BRANCH_ANY;
-
- if (!event->attr.exclude_user)
- *br_type |= PERF_SAMPLE_BRANCH_USER;
-
- if (!event->attr.exclude_kernel)
- *br_type |= PERF_SAMPLE_BRANCH_KERNEL;
- }
- }
+ event->attr.branch_sample_type =
+ PERF_SAMPLE_BRANCH_USER |
+ PERF_SAMPLE_BRANCH_CALL_STACK;
}

/*
@@ -1832,10 +1844,39 @@ static ssize_t set_attr_rdpmc(struct device *cdev,
return count;
}

+static ssize_t get_attr_lbr_callstack(struct device *cdev,
+ struct device_attribute *attr, char *buf)
+{
+ return snprintf(buf, 40, "%d\n", x86_pmu.attr_lbr_callstack);
+}
+
+static ssize_t set_attr_lbr_callstack(struct device *cdev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned long val;
+ ssize_t ret;
+
+ ret = kstrtoul(buf, 0, &val);
+ if (ret)
+ return ret;
+
+ if (!!val != !!x86_pmu.attr_lbr_callstack) {
+ if (val && !x86_pmu_has_lbr_callstack())
+ return -EOPNOTSUPP;
+ x86_pmu.attr_lbr_callstack = !!val;
+ }
+ return count;
+}
+
static DEVICE_ATTR(rdpmc, S_IRUSR | S_IWUSR, get_attr_rdpmc, set_attr_rdpmc);
+static DEVICE_ATTR(lbr_callstack, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH,
+ get_attr_lbr_callstack, set_attr_lbr_callstack);
+

static struct attribute *x86_pmu_attrs[] = {
&dev_attr_rdpmc.attr,
+ &dev_attr_lbr_callstack.attr,
NULL,
};

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 61b1e9c..3798e0d 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -410,6 +410,7 @@ struct x86_pmu {
*/
int attr_rdpmc_broken;
int attr_rdpmc;
+ int attr_lbr_callstack;
struct attribute **format_attrs;
struct attribute **event_attrs;

--
1.8.5.3

2014-02-18 06:08:09

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 14/14] perf, x86: Discard zero length call entries in LBR call stack

"Zero length call" uses the attribute of the call instruction to push
the immediate instruction pointer on to the stack and then pops off
that address into a register. This is accomplished without any matching
return instruction. It confuses the hardware and make the recorded call
stack incorrect.

We can partially resolve this issue by: decode call instructions and
discard any zero length call entry in the LBR stack.

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 7e26367..2e96fe4 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -94,7 +94,8 @@ enum {
X86_BR_ABORT = 1 << 12,/* transaction abort */
X86_BR_IN_TX = 1 << 13,/* in transaction */
X86_BR_NO_TX = 1 << 14,/* not in transaction */
- X86_BR_CALL_STACK = 1 << 15,/* call stack */
+ X86_BR_ZERO_CALL = 1 << 15,/* zero length call */
+ X86_BR_CALL_STACK = 1 << 16,/* call stack */
};

#define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
@@ -111,13 +112,15 @@ enum {
X86_BR_JMP |\
X86_BR_IRQ |\
X86_BR_ABORT |\
- X86_BR_IND_CALL)
+ X86_BR_IND_CALL |\
+ X86_BR_ZERO_CALL)

#define X86_BR_ALL (X86_BR_PLM | X86_BR_ANY)

#define X86_BR_ANY_CALL \
(X86_BR_CALL |\
X86_BR_IND_CALL |\
+ X86_BR_ZERO_CALL |\
X86_BR_SYSCALL |\
X86_BR_IRQ |\
X86_BR_INT)
@@ -651,6 +654,12 @@ static int branch_type(unsigned long from, unsigned long to, int abort)
ret = X86_BR_INT;
break;
case 0xe8: /* call near rel */
+ insn_get_immediate(&insn);
+ if (insn.immediate1.value == 0) {
+ /* zero length call */
+ ret = X86_BR_ZERO_CALL;
+ break;
+ }
case 0x9a: /* call far absolute */
ret = X86_BR_CALL;
break;
--
1.8.5.3

2014-02-18 06:08:47

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 12/14] perf, x86: use LBR call stack to get user callchain

Haswell has a new feature that utilizes the existing Last Branch Record
facility to record call chains. When the feature is enabled, function
call will be collected as normal, but as return instructions are executed
the last captured branch record is popped from the on-chip LBR registers.
The LBR call stack facility can help perf to get call chains of progam
without frame pointer.

This patch makes x86's perf_callchain_user() failback to LBR callstack
when there is no frame pointer in the user program.

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 33 ++++++++++++++++++++++++++----
arch/x86/kernel/cpu/perf_event_intel.c | 10 ++++++++-
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 2 ++
include/linux/perf_event.h | 1 +
4 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 6094560..0d0fe2f3 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1975,12 +1975,28 @@ static unsigned long get_segment_base(unsigned int segment)
return get_desc_base(desc + idx);
}

+static inline void
+perf_callchain_lbr_callstack(struct perf_callchain_entry *entry,
+ struct perf_sample_data *data)
+{
+ struct perf_branch_stack *br_stack = data->br_stack;
+
+ if (br_stack && br_stack->user_callstack) {
+ int i = 0;
+ while (i < br_stack->nr && entry->nr < PERF_MAX_STACK_DEPTH) {
+ perf_callchain_store(entry, br_stack->entries[i].from);
+ i++;
+ }
+ }
+}
+
#ifdef CONFIG_COMPAT

#include <asm/compat.h>

static inline int
-perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
+perf_callchain_user32(struct perf_callchain_entry *entry,
+ struct pt_regs *regs, struct perf_sample_data *data)
{
/* 32-bit process in 64-bit kernel. */
unsigned long ss_base, cs_base;
@@ -2009,11 +2025,16 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
perf_callchain_store(entry, cs_base + frame.return_address);
fp = compat_ptr(ss_base + frame.next_frame);
}
+
+ if (fp == compat_ptr(regs->bp))
+ perf_callchain_lbr_callstack(entry, data);
+
return 1;
}
#else
static inline int
-perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
+perf_callchain_user32(struct perf_callchain_entry *entry,
+ struct pt_regs *regs, struct perf_sample_data *data)
{
return 0;
}
@@ -2043,12 +2064,12 @@ void perf_callchain_user(struct perf_callchain_entry *entry,
if (!current->mm)
return;

- if (perf_callchain_user32(regs, entry))
+ if (perf_callchain_user32(entry, regs, data))
return;

while (entry->nr < PERF_MAX_STACK_DEPTH) {
unsigned long bytes;
- frame.next_frame = NULL;
+ frame.next_frame = NULL;
frame.return_address = 0;

bytes = copy_from_user_nmi(&frame, fp, sizeof(frame));
@@ -2061,6 +2082,10 @@ void perf_callchain_user(struct perf_callchain_entry *entry,
perf_callchain_store(entry, frame.return_address);
fp = frame.next_frame;
}
+
+ /* try LBR callstack if there is no frame pointer */
+ if (fp == (void __user *)regs->bp)
+ perf_callchain_lbr_callstack(entry, data);
}

/*
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 722171c..9057a20 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1030,6 +1030,14 @@ static __initconst const u64 slm_hw_cache_event_ids
},
};

+static inline bool intel_pmu_needs_lbr_callstack(struct perf_event *event)
+{
+ if ((event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) &&
+ (event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_CALL_STACK))
+ return true;
+ return false;
+}
+
static void intel_pmu_disable_all(void)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
@@ -1398,7 +1406,7 @@ again:

perf_sample_data_init(&data, 0, event->hw.last_period);

- if (has_branch_stack(event))
+ if (needs_branch_stack(event))
data.br_stack = &cpuc->lbr_stack;

if (perf_event_overflow(event, &data, regs))
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index aa726af..7e26367 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -717,6 +717,8 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
int i, j, type;
bool compress = false;

+ cpuc->lbr_stack.user_callstack = branch_user_callstack(br_sel);
+
/* if sampling all branches, then nothing to filter */
if ((br_sel & X86_BR_ALL) == X86_BR_ALL)
return;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index b87974a..517c34a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -74,6 +74,7 @@ struct perf_raw_record {
* recent branch.
*/
struct perf_branch_stack {
+ bool user_callstack;
__u64 nr;
struct perf_branch_entry entries[0];
};
--
1.8.5.3

2014-02-18 06:07:59

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 07/14] perf, x86: track number of events that use LBR callstack

When enabling/disabling an event, check if the event uses the LBR
callstack feature, adjust the LBR callstack usage count accordingly.
Later patch will use the usage count to decide if the LBR stack
should be saved/restored.

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index cfb1fe0..2e80727 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -201,15 +201,27 @@ void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
intel_pmu_lbr_reset();
}

+static inline bool branch_user_callstack(unsigned br_sel)
+{
+ return (br_sel & X86_BR_USER) && (br_sel & X86_BR_CALL_STACK);
+}
+
void intel_pmu_lbr_enable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+ struct x86_perf_task_context *task_ctx;

if (!x86_pmu.lbr_nr)
return;

+ cpuc = &__get_cpu_var(cpu_hw_events);
+ task_ctx = event->ctx ? event->ctx->task_ctx_data : NULL;
+
cpuc->br_sel = event->hw.branch_reg.reg;

+ if (branch_user_callstack(cpuc->br_sel))
+ task_ctx->lbr_callstack_users++;
+
cpuc->lbr_users++;
if (cpuc->lbr_users == 1)
perf_sched_cb_enable(event->ctx->pmu);
@@ -217,11 +229,18 @@ void intel_pmu_lbr_enable(struct perf_event *event)

void intel_pmu_lbr_disable(struct perf_event *event)
{
- struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+ struct cpu_hw_events *cpuc;
+ struct x86_perf_task_context *task_ctx;

if (!x86_pmu.lbr_nr)
return;

+ cpuc = &__get_cpu_var(cpu_hw_events);
+ task_ctx = event->ctx ? event->ctx->task_ctx_data : NULL;
+
+ if (branch_user_callstack(cpuc->br_sel))
+ task_ctx->lbr_callstack_users--;
+
cpuc->lbr_users--;
WARN_ON_ONCE(cpuc->lbr_users < 0);

--
1.8.5.3

2014-02-18 06:09:12

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 11/14] perf, core: Pass perf_sample_data to perf_callchain()

Haswell has a new feature that utilizes the existing Last Branch Record
facility to record call chains. When the feature is enabled, function
call will be collected as normal, but as return instructions are executed
the last captured branch record is popped from the on-chip LBR registers.
The LBR call stack facility can help perf to get call chains of progam
without frame pointer.

This patch modifies various architectures' perf_callchain() to accept
perf sample data. Later patch will add code that use the sample data to
get call chains.

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/arm/kernel/perf_event.c | 4 ++--
arch/powerpc/perf/callchain.c | 4 ++--
arch/sparc/kernel/perf_event.c | 4 ++--
arch/x86/kernel/cpu/perf_event.c | 4 ++--
include/linux/perf_event.h | 3 ++-
kernel/events/callchain.c | 8 +++++---
kernel/events/core.c | 2 +-
kernel/events/internal.h | 3 ++-
8 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
index 789d846..276b13b 100644
--- a/arch/arm/kernel/perf_event.c
+++ b/arch/arm/kernel/perf_event.c
@@ -562,8 +562,8 @@ user_backtrace(struct frame_tail __user *tail,
return buftail.fp - 1;
}

-void
-perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
+void perf_callchain_user(struct perf_callchain_entry *entry,
+ struct pt_regs *regs, struct perf_sample_data *data)
{
struct frame_tail __user *tail;

diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c
index 74d1e78..b379ebc 100644
--- a/arch/powerpc/perf/callchain.c
+++ b/arch/powerpc/perf/callchain.c
@@ -482,8 +482,8 @@ static void perf_callchain_user_32(struct perf_callchain_entry *entry,
}
}

-void
-perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
+void perf_callchain_user(struct perf_callchain_entry *entry,
+ struct pt_regs *regs, struct perf_sample_data *data)
{
if (current_is_64bit())
perf_callchain_user_64(entry, regs);
diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c
index b5c38fa..cba0306 100644
--- a/arch/sparc/kernel/perf_event.c
+++ b/arch/sparc/kernel/perf_event.c
@@ -1785,8 +1785,8 @@ static void perf_callchain_user_32(struct perf_callchain_entry *entry,
} while (entry->nr < PERF_MAX_STACK_DEPTH);
}

-void
-perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
+void perf_callchain_user(struct perf_callchain_entry *entry,
+ struct pt_regs *regs, struct perf_sample_data *data)
{
perf_callchain_store(entry, regs->tpc);

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 6a7792f..6094560 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -2019,8 +2019,8 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
}
#endif

-void
-perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
+void perf_callchain_user(struct perf_callchain_entry *entry,
+ struct pt_regs *regs, struct perf_sample_data *data)
{
struct stack_frame frame;
const void __user *fp;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6afc675..b87974a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -706,7 +706,8 @@ extern void perf_event_fork(struct task_struct *tsk);
/* Callchains */
DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);

-extern void perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs);
+extern void perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs,
+ struct perf_sample_data *data);
extern void perf_callchain_kernel(struct perf_callchain_entry *entry, struct pt_regs *regs);

static inline void perf_callchain_store(struct perf_callchain_entry *entry, u64 ip)
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 97b67df..19d497c 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -30,7 +30,8 @@ __weak void perf_callchain_kernel(struct perf_callchain_entry *entry,
}

__weak void perf_callchain_user(struct perf_callchain_entry *entry,
- struct pt_regs *regs)
+ struct pt_regs *regs,
+ struct perf_sample_data *data)
{
}

@@ -157,7 +158,8 @@ put_callchain_entry(int rctx)
}

struct perf_callchain_entry *
-perf_callchain(struct perf_event *event, struct pt_regs *regs)
+perf_callchain(struct perf_event *event, struct pt_regs *regs,
+ struct perf_sample_data *data)
{
int rctx;
struct perf_callchain_entry *entry;
@@ -198,7 +200,7 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
goto exit_put;

perf_callchain_store(entry, PERF_CONTEXT_USER);
- perf_callchain_user(entry, regs);
+ perf_callchain_user(entry, regs, data);
}
}

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 69fbf1b..fbe0a0b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4693,7 +4693,7 @@ void perf_prepare_sample(struct perf_event_header *header,
if (sample_type & PERF_SAMPLE_CALLCHAIN) {
int size = 1;

- data->callchain = perf_callchain(event, regs);
+ data->callchain = perf_callchain(event, regs, data);

if (data->callchain)
size += data->callchain->nr;
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 569b2187..cd18b64 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -147,7 +147,8 @@ DEFINE_OUTPUT_COPY(__output_copy_user, arch_perf_out_copy_user)

/* Callchain handling */
extern struct perf_callchain_entry *
-perf_callchain(struct perf_event *event, struct pt_regs *regs);
+perf_callchain(struct perf_event *event, struct pt_regs *regs,
+ struct perf_sample_data *data);
extern int get_callchain_buffers(void);
extern void put_callchain_buffers(void);

--
1.8.5.3

2014-02-18 06:09:43

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 09/14] perf, x86: Save/resotre LBR stack during context switch

When the LBR call stack is enabled, it is necessary to save/restore
the LBR stack on context switch. The solution is saving/restoring
the LBR stack to/from task's perf event context.

The LBR stack is saved/restored only when there are events that use
the LBR call stack. If no event uses LBR call stack, the LBR stack
is reset when task is scheduled in.

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 79 ++++++++++++++++++++++++------
1 file changed, 65 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 2e80727..aa726af 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -187,18 +187,81 @@ void intel_pmu_lbr_reset(void)
intel_pmu_lbr_reset_64();
}

+/*
+ * TOS = most recently recorded branch
+ */
+static inline u64 intel_pmu_lbr_tos(void)
+{
+ u64 tos;
+ rdmsrl(x86_pmu.lbr_tos, tos);
+ return tos;
+}
+
+enum {
+ LBR_NONE,
+ LBR_VALID,
+};
+
+static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
+{
+ int i;
+ unsigned lbr_idx, mask = x86_pmu.lbr_nr - 1;
+ u64 tos = intel_pmu_lbr_tos();
+
+ for (i = 0; i < x86_pmu.lbr_nr; i++) {
+ lbr_idx = (tos - i) & mask;
+ wrmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
+ wrmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
+ }
+ task_ctx->lbr_stack_state = LBR_NONE;
+}
+
+static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
+{
+ int i;
+ unsigned lbr_idx, mask = x86_pmu.lbr_nr - 1;
+ u64 tos = intel_pmu_lbr_tos();
+
+ for (i = 0; i < x86_pmu.lbr_nr; i++) {
+ lbr_idx = (tos - i) & mask;
+ rdmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
+ rdmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
+ }
+ task_ctx->lbr_stack_state = LBR_VALID;
+}
+
+
void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
{
+ struct x86_perf_task_context *task_ctx;
+
if (!x86_pmu.lbr_nr)
return;

+ task_ctx = ctx ? ctx->task_ctx_data : NULL;
+
/*
* It is necessary to flush the stack on context switch. This happens
* when the branch stack does not tag its entries with the pid of the
* current task.
*/
- if (sched_in)
- intel_pmu_lbr_reset();
+ if (sched_in) {
+ if (!task_ctx ||
+ !task_ctx->lbr_callstack_users ||
+ task_ctx->lbr_stack_state != LBR_VALID)
+ intel_pmu_lbr_reset();
+ else
+ __intel_pmu_lbr_restore(task_ctx);
+ return;
+ }
+
+ /* schedule out */
+ if (task_ctx) {
+ if (task_ctx->lbr_callstack_users)
+ __intel_pmu_lbr_save(task_ctx);
+ else
+ task_ctx->lbr_stack_state = LBR_NONE;
+ }
}

static inline bool branch_user_callstack(unsigned br_sel)
@@ -267,18 +330,6 @@ void intel_pmu_lbr_disable_all(void)
__intel_pmu_lbr_disable();
}

-/*
- * TOS = most recently recorded branch
- */
-static inline u64 intel_pmu_lbr_tos(void)
-{
- u64 tos;
-
- rdmsrl(x86_pmu.lbr_tos, tos);
-
- return tos;
-}
-
static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
{
unsigned long mask = x86_pmu.lbr_nr - 1;
--
1.8.5.3

2014-02-18 06:07:58

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 06/14] perf, core: always switch pmu specific data during context switch

If two tasks were both forked from the same parent task, Events in
their perf task contexts can be the same. Perf core optimizes context
switch out in this case.

Previous patch inroduces pmu specific data. The data is task specific,
so we should switch the data even when context switch is optimized out.

Signed-off-by: Yan, Zheng <[email protected]>
---
kernel/events/core.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index da551c5..d5417e5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2331,6 +2331,7 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
raw_spin_lock(&ctx->lock);
raw_spin_lock_nested(&next_ctx->lock, SINGLE_DEPTH_NESTING);
if (context_equiv(ctx, next_ctx)) {
+ void *ctx_data;
/*
* XXX do we need a memory barrier of sorts
* wrt to rcu_dereference() of perf_event_ctxp
@@ -2339,6 +2340,11 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
next->perf_event_ctxp[ctxn] = ctx;
ctx->task = next;
next_ctx->task = task;
+
+ ctx_data = next_ctx->task_ctx_data;
+ next_ctx->task_ctx_data = ctx->task_ctx_data;
+ ctx->task_ctx_data = ctx_data;
+
do_switch = 0;

perf_event_sync_stat(ctx, next_ctx);
--
1.8.5.3

2014-02-18 06:10:07

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 08/14] perf, x86: allocate space for storing LBR stack

When the LBR call stack is enabled, it is necessary to save/restore
the LBR stack on context switch. We can use pmu specific data to
store LBR stack when task is scheduled out. This patch adds code
that allocates the pmu specific data.

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 1 +
arch/x86/kernel/cpu/perf_event.h | 7 +++++++
2 files changed, 8 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 534e859..6a7792f 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1883,6 +1883,7 @@ static struct pmu pmu = {

.event_idx = x86_pmu_event_idx,
.sched_task = x86_pmu_sched_task,
+ .task_ctx_size = sizeof(struct x86_perf_task_context),
};

void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 9bc3996..61b1e9c 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -469,6 +469,13 @@ struct x86_pmu {
struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr);
};

+struct x86_perf_task_context {
+ u64 lbr_from[MAX_LBR_ENTRIES];
+ u64 lbr_to[MAX_LBR_ENTRIES];
+ int lbr_callstack_users;
+ int lbr_stack_state;
+};
+
enum {
PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT = PERF_SAMPLE_BRANCH_MAX_SHIFT,
PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE,
--
1.8.5.3

2014-02-18 06:07:53

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 04/14] perf, x86: Basic Haswell LBR call stack support

Haswell has a new feature that utilizes the existing LBR facility to
record call chains. To enable this feature, bits (JCC, NEAR_IND_JMP,
NEAR_REL_JMP, FAR_BRANCH, EN_CALLSTACK) in LBR_SELECT must be set to 1,
bits (NEAR_REL_CALL, NEAR-IND_CALL, NEAR_RET) must be cleared. Due to
a hardware bug of Haswell, this feature doesn't work well with
FREEZE_LBRS_ON_PMI.

When the call stack feature is enabled, the LBR stack will capture
unfiltered call data normally, but as return instructions are executed,
the last captured branch record is flushed from the on-chip registers
in a last-in first-out (LIFO) manner. Thus, branch information relative
to leaf functions will not be captured, while preserving the call stack
information of the main line execution path.

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 13 +++-
arch/x86/kernel/cpu/perf_event_intel.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 98 +++++++++++++++++++++++-------
3 files changed, 88 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index ccbe3fd..9bc3996 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -470,7 +470,10 @@ struct x86_pmu {
};

enum {
- PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE = PERF_SAMPLE_BRANCH_MAX_SHIFT,
+ PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT = PERF_SAMPLE_BRANCH_MAX_SHIFT,
+ PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE,
+
+ PERF_SAMPLE_BRANCH_CALL_STACK = 1U << PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT,
};

#define x86_add_quirk(func_) \
@@ -504,6 +507,12 @@ static struct perf_pmu_events_attr event_attr_##v = { \

extern struct x86_pmu x86_pmu __read_mostly;

+static inline bool x86_pmu_has_lbr_callstack(void)
+{
+ return x86_pmu.lbr_sel_map &&
+ x86_pmu.lbr_sel_map[PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT] > 0;
+}
+
DECLARE_PER_CPU(struct cpu_hw_events, cpu_hw_events);

int x86_perf_event_set_period(struct perf_event *event);
@@ -707,6 +716,8 @@ void intel_pmu_lbr_init_atom(void);

void intel_pmu_lbr_init_snb(void);

+void intel_pmu_lbr_init_hsw(void);
+
int intel_pmu_setup_lbr_filter(struct perf_event *event);

int p4_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 4325bae..84a1c09 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2494,7 +2494,7 @@ __init int intel_pmu_init(void)
memcpy(hw_cache_event_ids, snb_hw_cache_event_ids, sizeof(hw_cache_event_ids));
memcpy(hw_cache_extra_regs, snb_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));

- intel_pmu_lbr_init_snb();
+ intel_pmu_lbr_init_hsw();

x86_pmu.event_constraints = intel_hsw_event_constraints;
x86_pmu.pebs_constraints = intel_hsw_pebs_event_constraints;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 7ff2a99..cfb1fe0 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -39,6 +39,7 @@ static enum {
#define LBR_IND_JMP_BIT 6 /* do not capture indirect jumps */
#define LBR_REL_JMP_BIT 7 /* do not capture relative jumps */
#define LBR_FAR_BIT 8 /* do not capture far branches */
+#define LBR_CALL_STACK_BIT 9 /* enable call stack */

#define LBR_KERNEL (1 << LBR_KERNEL_BIT)
#define LBR_USER (1 << LBR_USER_BIT)
@@ -49,6 +50,7 @@ static enum {
#define LBR_REL_JMP (1 << LBR_REL_JMP_BIT)
#define LBR_IND_JMP (1 << LBR_IND_JMP_BIT)
#define LBR_FAR (1 << LBR_FAR_BIT)
+#define LBR_CALL_STACK (1 << LBR_CALL_STACK_BIT)

#define LBR_PLM (LBR_KERNEL | LBR_USER)

@@ -74,24 +76,25 @@ static enum {
* x86control flow changes include branches, interrupts, traps, faults
*/
enum {
- X86_BR_NONE = 0, /* unknown */
-
- X86_BR_USER = 1 << 0, /* branch target is user */
- X86_BR_KERNEL = 1 << 1, /* branch target is kernel */
-
- X86_BR_CALL = 1 << 2, /* call */
- X86_BR_RET = 1 << 3, /* return */
- X86_BR_SYSCALL = 1 << 4, /* syscall */
- X86_BR_SYSRET = 1 << 5, /* syscall return */
- X86_BR_INT = 1 << 6, /* sw interrupt */
- X86_BR_IRET = 1 << 7, /* return from interrupt */
- X86_BR_JCC = 1 << 8, /* conditional */
- X86_BR_JMP = 1 << 9, /* jump */
- X86_BR_IRQ = 1 << 10,/* hw interrupt or trap or fault */
- X86_BR_IND_CALL = 1 << 11,/* indirect calls */
- X86_BR_ABORT = 1 << 12,/* transaction abort */
- X86_BR_IN_TX = 1 << 13,/* in transaction */
- X86_BR_NO_TX = 1 << 14,/* not in transaction */
+ X86_BR_NONE = 0, /* unknown */
+
+ X86_BR_USER = 1 << 0, /* branch target is user */
+ X86_BR_KERNEL = 1 << 1, /* branch target is kernel */
+
+ X86_BR_CALL = 1 << 2, /* call */
+ X86_BR_RET = 1 << 3, /* return */
+ X86_BR_SYSCALL = 1 << 4, /* syscall */
+ X86_BR_SYSRET = 1 << 5, /* syscall return */
+ X86_BR_INT = 1 << 6, /* sw interrupt */
+ X86_BR_IRET = 1 << 7, /* return from interrupt */
+ X86_BR_JCC = 1 << 8, /* conditional */
+ X86_BR_JMP = 1 << 9, /* jump */
+ X86_BR_IRQ = 1 << 10,/* hw interrupt or trap or fault */
+ X86_BR_IND_CALL = 1 << 11,/* indirect calls */
+ X86_BR_ABORT = 1 << 12,/* transaction abort */
+ X86_BR_IN_TX = 1 << 13,/* in transaction */
+ X86_BR_NO_TX = 1 << 14,/* not in transaction */
+ X86_BR_CALL_STACK = 1 << 15,/* call stack */
};

#define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
@@ -135,7 +138,14 @@ static void __intel_pmu_lbr_enable(void)
wrmsrl(MSR_LBR_SELECT, cpuc->lbr_sel->config);

rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
- debugctl |= (DEBUGCTLMSR_LBR | DEBUGCTLMSR_FREEZE_LBRS_ON_PMI);
+ debugctl |= DEBUGCTLMSR_LBR;
+ /*
+ * LBR callstack does not work well with FREEZE_LBRS_ON_PMI.
+ * If FREEZE_LBRS_ON_PMI is set, PMI near call/return instructions
+ * may cause superfluous increase/decrease of LBR_TOS.
+ */
+ if (!cpuc->lbr_sel || !(cpuc->lbr_sel->config & LBR_CALL_STACK))
+ debugctl |= DEBUGCTLMSR_FREEZE_LBRS_ON_PMI;
wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
}

@@ -354,7 +364,7 @@ void intel_pmu_lbr_read(void)
* - in case there is no HW filter
* - in case the HW filter has errata or limitations
*/
-static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
+static int intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
{
u64 br_type = event->attr.branch_sample_type;
int mask = 0;
@@ -388,11 +398,21 @@ static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
if (br_type & PERF_SAMPLE_BRANCH_NO_TX)
mask |= X86_BR_NO_TX;

+ if (br_type & PERF_SAMPLE_BRANCH_CALL_STACK) {
+ if (!x86_pmu_has_lbr_callstack())
+ return -EOPNOTSUPP;
+ if (mask & ~(X86_BR_USER | X86_BR_KERNEL))
+ return -EINVAL;
+ mask |= X86_BR_CALL | X86_BR_IND_CALL | X86_BR_RET |
+ X86_BR_CALL_STACK;
+ }
+
/*
* stash actual user request into reg, it may
* be used by fixup code for some CPU
*/
event->hw.branch_reg.reg = mask;
+ return 0;
}

/*
@@ -421,8 +441,11 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
reg = &event->hw.branch_reg;
reg->idx = EXTRA_REG_LBR;

- /* LBR_SELECT operates in suppress mode so invert mask */
- reg->config = ~mask & x86_pmu.lbr_sel_mask;
+ /*
+ * the first 8 bits (LBR_SEL_MASK) in LBR_SELECT operates
+ * in suppress mode so invert mask
+ */
+ reg->config = mask ^ x86_pmu.lbr_sel_mask;

return 0;
}
@@ -440,7 +463,9 @@ int intel_pmu_setup_lbr_filter(struct perf_event *event)
/*
* setup SW LBR filter
*/
- intel_pmu_setup_sw_lbr_filter(event);
+ ret = intel_pmu_setup_sw_lbr_filter(event);
+ if (ret)
+ return ret;

/*
* setup HW LBR filter, if any
@@ -695,6 +720,19 @@ static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
[PERF_SAMPLE_BRANCH_IND_CALL_SHIFT] = LBR_IND_CALL,
};

+static const int hsw_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+ [PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
+ [PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
+ [PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
+ [PERF_SAMPLE_BRANCH_HV_SHIFT] = LBR_IGN,
+ [PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT] = LBR_RETURN | LBR_FAR,
+ [PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT] = LBR_REL_CALL | LBR_IND_CALL
+ | LBR_FAR,
+ [PERF_SAMPLE_BRANCH_IND_CALL_SHIFT] = LBR_IND_CALL,
+ [PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT] = LBR_REL_CALL | LBR_IND_CALL
+ | LBR_RETURN | LBR_CALL_STACK,
+};
+
/* core */
void intel_pmu_lbr_init_core(void)
{
@@ -751,6 +789,20 @@ void intel_pmu_lbr_init_snb(void)
pr_cont("16-deep LBR, ");
}

+/* haswell */
+void intel_pmu_lbr_init_hsw(void)
+{
+ x86_pmu.lbr_nr = 16;
+ x86_pmu.lbr_tos = MSR_LBR_TOS;
+ x86_pmu.lbr_from = MSR_LBR_NHM_FROM;
+ x86_pmu.lbr_to = MSR_LBR_NHM_TO;
+
+ x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
+ x86_pmu.lbr_sel_map = hsw_lbr_sel_map;
+
+ pr_cont("16-deep LBR, ");
+}
+
/* atom */
void intel_pmu_lbr_init_atom(void)
{
--
1.8.5.3

2014-02-18 06:07:51

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 02/14] perf, core: introduce pmu context switch callback

The callback is invoked when process is scheduled in or out. It
provides mechanism for later patches to save/store the LBR stack.
It can also replace the flush branch stack callback.

To avoid unnecessary overhead, the callback is enabled dynamically

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 7 +++++
arch/x86/kernel/cpu/perf_event.h | 4 +++
include/linux/perf_event.h | 8 ++++++
kernel/events/core.c | 60 +++++++++++++++++++++++++++++++++++++++-
4 files changed, 78 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 895604f..68c0314 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1850,6 +1850,12 @@ static const struct attribute_group *x86_pmu_attr_groups[] = {
NULL,
};

+static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+ if (x86_pmu.sched_task)
+ x86_pmu.sched_task(ctx, sched_in);
+}
+
static void x86_pmu_flush_branch_stack(void)
{
if (x86_pmu.flush_branch_stack)
@@ -1883,6 +1889,7 @@ static struct pmu pmu = {

.event_idx = x86_pmu_event_idx,
.flush_branch_stack = x86_pmu_flush_branch_stack,
+ .sched_task = x86_pmu_sched_task,
};

void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 518025e..551f09b 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -427,6 +427,8 @@ struct x86_pmu {

void (*check_microcode)(void);
void (*flush_branch_stack)(void);
+ void (*sched_task)(struct perf_event_context *ctx,
+ bool sched_in);

/*
* Intel Arch Perfmon v2+
@@ -685,6 +687,8 @@ void intel_pmu_pebs_disable_all(void);

void intel_ds_init(void);

+void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
+
void intel_pmu_lbr_reset(void);

void intel_pmu_lbr_enable(struct perf_event *event);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e56b07f..adc20f2 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -251,6 +251,12 @@ struct pmu {
* flush branch stack on context-switches (needed in cpu-wide mode)
*/
void (*flush_branch_stack) (void);
+
+ /*
+ * PMU callback for context-switches. optional
+ */
+ void (*sched_task) (struct perf_event_context *ctx,
+ bool sched_in);
};

/**
@@ -544,6 +550,8 @@ extern void perf_event_delayed_put(struct task_struct *task);
extern void perf_event_print_debug(void);
extern void perf_pmu_disable(struct pmu *pmu);
extern void perf_pmu_enable(struct pmu *pmu);
+extern void perf_sched_cb_disable(struct pmu *pmu);
+extern void perf_sched_cb_enable(struct pmu *pmu);
extern int perf_event_task_disable(void);
extern int perf_event_task_enable(void);
extern int perf_event_refresh(struct perf_event *event, int refresh);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2067cbb..350e566 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -142,6 +142,7 @@ enum event_type_t {
struct static_key_deferred perf_sched_events __read_mostly;
static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
+static DEFINE_PER_CPU(int, perf_sched_cb_usages);

static atomic_t nr_mmap_events __read_mostly;
static atomic_t nr_comm_events __read_mostly;
@@ -151,6 +152,7 @@ static atomic_t nr_freq_events __read_mostly;
static LIST_HEAD(pmus);
static DEFINE_MUTEX(pmus_lock);
static struct srcu_struct pmus_srcu;
+static struct idr pmu_idr;

/*
* perf event paranoia level:
@@ -2353,6 +2355,57 @@ unlock:
}
}

+void perf_sched_cb_disable(struct pmu *pmu)
+{
+ __get_cpu_var(perf_sched_cb_usages)--;
+}
+
+void perf_sched_cb_enable(struct pmu *pmu)
+{
+ __get_cpu_var(perf_sched_cb_usages)++;
+}
+
+/*
+ * This function provides the context switch callback to the lower code
+ * layer. It is invoked ONLY when the context switch callback is enabled.
+ */
+static void perf_pmu_sched_task(struct task_struct *prev,
+ struct task_struct *next,
+ bool sched_in)
+{
+ struct perf_cpu_context *cpuctx;
+ struct pmu *pmu;
+ unsigned long flags;
+
+ if (prev == next)
+ return;
+
+ local_irq_save(flags);
+
+ rcu_read_lock();
+
+ pmu = idr_find(&pmu_idr, PERF_TYPE_RAW);
+
+ if (pmu && pmu->sched_task) {
+ cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+ pmu = cpuctx->ctx.pmu;
+
+ perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+ perf_pmu_disable(pmu);
+
+ pmu->sched_task(cpuctx->task_ctx, sched_in);
+
+ perf_pmu_enable(pmu);
+
+ perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+ }
+
+ rcu_read_unlock();
+
+ local_irq_restore(flags);
+}
+
#define for_each_task_context_nr(ctxn) \
for ((ctxn) = 0; (ctxn) < perf_nr_task_contexts; (ctxn)++)

@@ -2372,6 +2425,9 @@ void __perf_event_task_sched_out(struct task_struct *task,
{
int ctxn;

+ if (__get_cpu_var(perf_sched_cb_usages))
+ perf_pmu_sched_task(task, next, false);
+
for_each_task_context_nr(ctxn)
perf_event_context_sched_out(task, ctxn, next);

@@ -2631,6 +2687,9 @@ void __perf_event_task_sched_in(struct task_struct *prev,
/* check for system-wide branch_stack events */
if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
perf_branch_stack_sched_in(prev, task);
+
+ if (__get_cpu_var(perf_sched_cb_usages))
+ perf_pmu_sched_task(prev, task, true);
}

static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count)
@@ -6356,7 +6415,6 @@ static void free_pmu_context(struct pmu *pmu)
out:
mutex_unlock(&pmus_lock);
}
-static struct idr pmu_idr;

static ssize_t
type_show(struct device *dev, struct device_attribute *attr, char *page)
--
1.8.5.3

2014-02-18 06:11:01

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 05/14] perf, core: pmu specific data for perf task context

Introduce a new field to 'struct pmu' to specify the size of PMU
specific data. Allocate memory for PMU specific data when allocating
perf task context. The PMU specific data are initialized to zeros.
Later patches will use PMU specific data to save LBR stack.

Signed-off-by: Yan, Zheng <[email protected]>
---
include/linux/perf_event.h | 5 +++++
kernel/events/core.c | 19 ++++++++++++++++++-
2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 80ddc0c..3da433d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -252,6 +252,10 @@ struct pmu {
*/
void (*sched_task) (struct perf_event_context *ctx,
bool sched_in);
+ /*
+ * PMU specific data size
+ */
+ size_t task_ctx_size;
};

/**
@@ -493,6 +497,7 @@ struct perf_event_context {
u64 generation;
int pin_count;
int nr_cgroups; /* cgroup evts */
+ void *task_ctx_data; /* pmu specific data */
struct rcu_head rcu_head;
};

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 202e3fd..da551c5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -898,6 +898,15 @@ static void get_ctx(struct perf_event_context *ctx)
WARN_ON(!atomic_inc_not_zero(&ctx->refcount));
}

+static void free_ctx(struct rcu_head *head)
+{
+ struct perf_event_context *ctx;
+
+ ctx = container_of(head, struct perf_event_context, rcu_head);
+ kfree(ctx->task_ctx_data);
+ kfree(ctx);
+}
+
static void put_ctx(struct perf_event_context *ctx)
{
if (atomic_dec_and_test(&ctx->refcount)) {
@@ -905,7 +914,7 @@ static void put_ctx(struct perf_event_context *ctx)
put_ctx(ctx->parent_ctx);
if (ctx->task)
put_task_struct(ctx->task);
- kfree_rcu(ctx, rcu_head);
+ call_rcu(&ctx->rcu_head, free_ctx);
}
}

@@ -3044,6 +3053,14 @@ alloc_perf_context(struct pmu *pmu, struct task_struct *task)
if (!ctx)
return NULL;

+ if (task && pmu->task_ctx_size > 0) {
+ ctx->task_ctx_data = kzalloc(pmu->task_ctx_size, GFP_KERNEL);
+ if (!ctx->task_ctx_data) {
+ kfree(ctx);
+ return NULL;
+ }
+ }
+
__perf_event_init_context(ctx);
if (task) {
ctx->task = task;
--
1.8.5.3

2014-02-18 06:11:30

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 03/14] perf, x86: use context switch callback to flush LBR stack

Enable the pmu context switch callback when LBR is used. Use the
callback to flush LBR stack when task is scheduled in. This allows
us to move code that flushes LBR stack from perf core to perf x86.

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 7 ---
arch/x86/kernel/cpu/perf_event.h | 2 -
arch/x86/kernel/cpu/perf_event_intel.c | 14 +-----
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 32 +++++++-----
include/linux/perf_event.h | 6 ---
kernel/events/core.c | 78 ------------------------------
6 files changed, 21 insertions(+), 118 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 68c0314..534e859 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1856,12 +1856,6 @@ static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
x86_pmu.sched_task(ctx, sched_in);
}

-static void x86_pmu_flush_branch_stack(void)
-{
- if (x86_pmu.flush_branch_stack)
- x86_pmu.flush_branch_stack();
-}
-
void perf_check_microcode(void)
{
if (x86_pmu.check_microcode)
@@ -1888,7 +1882,6 @@ static struct pmu pmu = {
.commit_txn = x86_pmu_commit_txn,

.event_idx = x86_pmu_event_idx,
- .flush_branch_stack = x86_pmu_flush_branch_stack,
.sched_task = x86_pmu_sched_task,
};

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 551f09b..ccbe3fd 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -150,7 +150,6 @@ struct cpu_hw_events {
* Intel LBR bits
*/
int lbr_users;
- void *lbr_context;
struct perf_branch_stack lbr_stack;
struct perf_branch_entry lbr_entries[MAX_LBR_ENTRIES];
struct er_account *lbr_sel;
@@ -426,7 +425,6 @@ struct x86_pmu {
void (*cpu_dead)(int cpu);

void (*check_microcode)(void);
- void (*flush_branch_stack)(void);
void (*sched_task)(struct perf_event_context *ctx,
bool sched_in);

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 0fa4f24..4325bae 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2038,18 +2038,6 @@ static void intel_pmu_cpu_dying(int cpu)
fini_debug_store_on_cpu(cpu);
}

-static void intel_pmu_flush_branch_stack(void)
-{
- /*
- * Intel LBR does not tag entries with the
- * PID of the current task, then we need to
- * flush it on ctxsw
- * For now, we simply reset it
- */
- if (x86_pmu.lbr_nr)
- intel_pmu_lbr_reset();
-}
-
PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");

PMU_FORMAT_ATTR(ldlat, "config1:0-15");
@@ -2101,7 +2089,7 @@ static __initconst const struct x86_pmu intel_pmu = {
.cpu_starting = intel_pmu_cpu_starting,
.cpu_dying = intel_pmu_cpu_dying,
.guest_get_msrs = intel_guest_get_msrs,
- .flush_branch_stack = intel_pmu_flush_branch_stack,
+ .sched_task = intel_pmu_lbr_sched_task,
};

static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 1ae2ec5..7ff2a99 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -177,24 +177,32 @@ void intel_pmu_lbr_reset(void)
intel_pmu_lbr_reset_64();
}

-void intel_pmu_lbr_enable(struct perf_event *event)
+void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
{
- struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
-
if (!x86_pmu.lbr_nr)
return;

/*
- * Reset the LBR stack if we changed task context to
- * avoid data leaks.
+ * It is necessary to flush the stack on context switch. This happens
+ * when the branch stack does not tag its entries with the pid of the
+ * current task.
*/
- if (event->ctx->task && cpuc->lbr_context != event->ctx) {
+ if (sched_in)
intel_pmu_lbr_reset();
- cpuc->lbr_context = event->ctx;
- }
+}
+
+void intel_pmu_lbr_enable(struct perf_event *event)
+{
+ struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+ if (!x86_pmu.lbr_nr)
+ return;
+
cpuc->br_sel = event->hw.branch_reg.reg;

cpuc->lbr_users++;
+ if (cpuc->lbr_users == 1)
+ perf_sched_cb_enable(event->ctx->pmu);
}

void intel_pmu_lbr_disable(struct perf_event *event)
@@ -207,10 +215,10 @@ void intel_pmu_lbr_disable(struct perf_event *event)
cpuc->lbr_users--;
WARN_ON_ONCE(cpuc->lbr_users < 0);

- if (cpuc->enabled && !cpuc->lbr_users) {
- __intel_pmu_lbr_disable();
- /* avoid stale pointer */
- cpuc->lbr_context = NULL;
+ if (!cpuc->lbr_users) {
+ perf_sched_cb_disable(event->ctx->pmu);
+ if (cpuc->enabled)
+ __intel_pmu_lbr_disable();
}
}

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index adc20f2..80ddc0c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -248,11 +248,6 @@ struct pmu {
int (*event_idx) (struct perf_event *event); /*optional */

/*
- * flush branch stack on context-switches (needed in cpu-wide mode)
- */
- void (*flush_branch_stack) (void);
-
- /*
* PMU callback for context-switches. optional
*/
void (*sched_task) (struct perf_event_context *ctx,
@@ -498,7 +493,6 @@ struct perf_event_context {
u64 generation;
int pin_count;
int nr_cgroups; /* cgroup evts */
- int nr_branch_stack; /* branch_stack evt */
struct rcu_head rcu_head;
};

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 350e566..202e3fd 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -141,7 +141,6 @@ enum event_type_t {
*/
struct static_key_deferred perf_sched_events __read_mostly;
static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
-static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
static DEFINE_PER_CPU(int, perf_sched_cb_usages);

static atomic_t nr_mmap_events __read_mostly;
@@ -1145,9 +1144,6 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
if (is_cgroup_event(event))
ctx->nr_cgroups++;

- if (has_branch_stack(event))
- ctx->nr_branch_stack++;
-
list_add_rcu(&event->event_entry, &ctx->event_list);
if (!ctx->nr_events)
perf_pmu_rotate_start(ctx->pmu);
@@ -1310,9 +1306,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
cpuctx->cgrp = NULL;
}

- if (has_branch_stack(event))
- ctx->nr_branch_stack--;
-
ctx->nr_events--;
if (event->attr.inherit_stat)
ctx->nr_stat--;
@@ -2592,65 +2585,6 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
perf_pmu_rotate_start(ctx->pmu);
}

-/*
- * When sampling the branck stack in system-wide, it may be necessary
- * to flush the stack on context switch. This happens when the branch
- * stack does not tag its entries with the pid of the current task.
- * Otherwise it becomes impossible to associate a branch entry with a
- * task. This ambiguity is more likely to appear when the branch stack
- * supports priv level filtering and the user sets it to monitor only
- * at the user level (which could be a useful measurement in system-wide
- * mode). In that case, the risk is high of having a branch stack with
- * branch from multiple tasks. Flushing may mean dropping the existing
- * entries or stashing them somewhere in the PMU specific code layer.
- *
- * This function provides the context switch callback to the lower code
- * layer. It is invoked ONLY when there is at least one system-wide context
- * with at least one active event using taken branch sampling.
- */
-static void perf_branch_stack_sched_in(struct task_struct *prev,
- struct task_struct *task)
-{
- struct perf_cpu_context *cpuctx;
- struct pmu *pmu;
- unsigned long flags;
-
- /* no need to flush branch stack if not changing task */
- if (prev == task)
- return;
-
- local_irq_save(flags);
-
- rcu_read_lock();
-
- list_for_each_entry_rcu(pmu, &pmus, entry) {
- cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
-
- /*
- * check if the context has at least one
- * event using PERF_SAMPLE_BRANCH_STACK
- */
- if (cpuctx->ctx.nr_branch_stack > 0
- && pmu->flush_branch_stack) {
-
- pmu = cpuctx->ctx.pmu;
-
- perf_ctx_lock(cpuctx, cpuctx->task_ctx);
-
- perf_pmu_disable(pmu);
-
- pmu->flush_branch_stack();
-
- perf_pmu_enable(pmu);
-
- perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
- }
- }
-
- rcu_read_unlock();
-
- local_irq_restore(flags);
-}

/*
* Called from scheduler to add the events of the current task
@@ -2684,10 +2618,6 @@ void __perf_event_task_sched_in(struct task_struct *prev,
if (atomic_read(&__get_cpu_var(perf_cgroup_events)))
perf_cgroup_sched_in(prev, task);

- /* check for system-wide branch_stack events */
- if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
- perf_branch_stack_sched_in(prev, task);
-
if (__get_cpu_var(perf_sched_cb_usages))
perf_pmu_sched_task(prev, task, true);
}
@@ -3256,10 +3186,6 @@ static void unaccount_event_cpu(struct perf_event *event, int cpu)
if (event->parent)
return;

- if (has_branch_stack(event)) {
- if (!(event->attach_state & PERF_ATTACH_TASK))
- atomic_dec(&per_cpu(perf_branch_stack_events, cpu));
- }
if (is_cgroup_event(event))
atomic_dec(&per_cpu(perf_cgroup_events, cpu));
}
@@ -6685,10 +6611,6 @@ static void account_event_cpu(struct perf_event *event, int cpu)
if (event->parent)
return;

- if (has_branch_stack(event)) {
- if (!(event->attach_state & PERF_ATTACH_TASK))
- atomic_inc(&per_cpu(perf_branch_stack_events, cpu));
- }
if (is_cgroup_event(event))
atomic_inc(&per_cpu(perf_cgroup_events, cpu));
}
--
1.8.5.3

2014-02-18 06:12:11

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 01/14] perf, x86: Reduce lbr_sel_map size

The index of lbr_sel_map is bit value of perf branch_sample_type.
PERF_SAMPLE_BRANCH_MAX is 1024 at present, so each lbr_sel_map uses
4096 bytes. By using bit shift as index, we can reduce lbr_sel_map
size to 40 bytes.

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 4 +++
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 50 ++++++++++++++----------------
include/uapi/linux/perf_event.h | 42 +++++++++++++++++--------
3 files changed, 56 insertions(+), 40 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 4972c24..518025e 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -469,6 +469,10 @@ struct x86_pmu {
struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr);
};

+enum {
+ PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE = PERF_SAMPLE_BRANCH_MAX_SHIFT,
+};
+
#define x86_add_quirk(func_) \
do { \
static struct x86_pmu_quirk __quirk __initdata = { \
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index d82d155..1ae2ec5 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -69,10 +69,6 @@ static enum {
#define LBR_FROM_FLAG_IN_TX (1ULL << 62)
#define LBR_FROM_FLAG_ABORT (1ULL << 61)

-#define for_each_branch_sample_type(x) \
- for ((x) = PERF_SAMPLE_BRANCH_USER; \
- (x) < PERF_SAMPLE_BRANCH_MAX; (x) <<= 1)
-
/*
* x86control flow change classification
* x86control flow changes include branches, interrupts, traps, faults
@@ -400,14 +396,14 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
{
struct hw_perf_event_extra *reg;
u64 br_type = event->attr.branch_sample_type;
- u64 mask = 0, m;
- u64 v;
+ u64 mask = 0, v;
+ int i;

- for_each_branch_sample_type(m) {
- if (!(br_type & m))
+ for (i = 0; i < PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE; i++) {
+ if (!(br_type & (1ULL << i)))
continue;

- v = x86_pmu.lbr_sel_map[m];
+ v = x86_pmu.lbr_sel_map[i];
if (v == LBR_NOT_SUPP)
return -EOPNOTSUPP;

@@ -662,33 +658,33 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
/*
* Map interface branch filters onto LBR filters
*/
-static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
- [PERF_SAMPLE_BRANCH_ANY] = LBR_ANY,
- [PERF_SAMPLE_BRANCH_USER] = LBR_USER,
- [PERF_SAMPLE_BRANCH_KERNEL] = LBR_KERNEL,
- [PERF_SAMPLE_BRANCH_HV] = LBR_IGN,
- [PERF_SAMPLE_BRANCH_ANY_RETURN] = LBR_RETURN | LBR_REL_JMP
- | LBR_IND_JMP | LBR_FAR,
+static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+ [PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
+ [PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
+ [PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
+ [PERF_SAMPLE_BRANCH_HV_SHIFT] = LBR_IGN,
+ [PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT] = LBR_RETURN | LBR_REL_JMP
+ | LBR_IND_JMP | LBR_FAR,
/*
* NHM/WSM erratum: must include REL_JMP+IND_JMP to get CALL branches
*/
- [PERF_SAMPLE_BRANCH_ANY_CALL] =
+ [PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT] =
LBR_REL_CALL | LBR_IND_CALL | LBR_REL_JMP | LBR_IND_JMP | LBR_FAR,
/*
* NHM/WSM erratum: must include IND_JMP to capture IND_CALL
*/
- [PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL | LBR_IND_JMP,
+ [PERF_SAMPLE_BRANCH_IND_CALL_SHIFT] = LBR_IND_CALL | LBR_IND_JMP,
};

-static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
- [PERF_SAMPLE_BRANCH_ANY] = LBR_ANY,
- [PERF_SAMPLE_BRANCH_USER] = LBR_USER,
- [PERF_SAMPLE_BRANCH_KERNEL] = LBR_KERNEL,
- [PERF_SAMPLE_BRANCH_HV] = LBR_IGN,
- [PERF_SAMPLE_BRANCH_ANY_RETURN] = LBR_RETURN | LBR_FAR,
- [PERF_SAMPLE_BRANCH_ANY_CALL] = LBR_REL_CALL | LBR_IND_CALL
- | LBR_FAR,
- [PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL,
+static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+ [PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_ANY,
+ [PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_USER,
+ [PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_KERNEL,
+ [PERF_SAMPLE_BRANCH_HV_SHIFT] = LBR_IGN,
+ [PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT] = LBR_RETURN | LBR_FAR,
+ [PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT] = LBR_REL_CALL | LBR_IND_CALL
+ | LBR_FAR,
+ [PERF_SAMPLE_BRANCH_IND_CALL_SHIFT] = LBR_IND_CALL,
};

/* core */
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 853bc1c..95e7022 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -151,20 +151,36 @@ enum perf_event_sample_format {
* The branch types can be combined, however BRANCH_ANY covers all types
* of branches and therefore it supersedes all the other types.
*/
+enum perf_branch_sample_type_shift {
+ PERF_SAMPLE_BRANCH_USER_SHIFT = 0, /* user branches */
+ PERF_SAMPLE_BRANCH_KERNEL_SHIFT = 1, /* kernel branches */
+ PERF_SAMPLE_BRANCH_HV_SHIFT = 2, /* hypervisor branches */
+
+ PERF_SAMPLE_BRANCH_ANY_SHIFT = 3, /* any branch types */
+ PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT = 4, /* any call branch */
+ PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT = 5, /* any return branch */
+ PERF_SAMPLE_BRANCH_IND_CALL_SHIFT = 6, /* indirect calls */
+ PERF_SAMPLE_BRANCH_ABORT_TX_SHIFT = 7, /* transaction aborts */
+ PERF_SAMPLE_BRANCH_IN_TX_SHIFT = 8, /* in transaction */
+ PERF_SAMPLE_BRANCH_NO_TX_SHIFT = 9, /* not in transaction */
+
+ PERF_SAMPLE_BRANCH_MAX_SHIFT /* non-ABI */
+};
+
enum perf_branch_sample_type {
- PERF_SAMPLE_BRANCH_USER = 1U << 0, /* user branches */
- PERF_SAMPLE_BRANCH_KERNEL = 1U << 1, /* kernel branches */
- PERF_SAMPLE_BRANCH_HV = 1U << 2, /* hypervisor branches */
-
- PERF_SAMPLE_BRANCH_ANY = 1U << 3, /* any branch types */
- PERF_SAMPLE_BRANCH_ANY_CALL = 1U << 4, /* any call branch */
- PERF_SAMPLE_BRANCH_ANY_RETURN = 1U << 5, /* any return branch */
- PERF_SAMPLE_BRANCH_IND_CALL = 1U << 6, /* indirect calls */
- PERF_SAMPLE_BRANCH_ABORT_TX = 1U << 7, /* transaction aborts */
- PERF_SAMPLE_BRANCH_IN_TX = 1U << 8, /* in transaction */
- PERF_SAMPLE_BRANCH_NO_TX = 1U << 9, /* not in transaction */
-
- PERF_SAMPLE_BRANCH_MAX = 1U << 10, /* non-ABI */
+ PERF_SAMPLE_BRANCH_USER = 1U << PERF_SAMPLE_BRANCH_USER_SHIFT,
+ PERF_SAMPLE_BRANCH_KERNEL = 1U << PERF_SAMPLE_BRANCH_KERNEL_SHIFT,
+ PERF_SAMPLE_BRANCH_HV = 1U << PERF_SAMPLE_BRANCH_HV_SHIFT,
+
+ PERF_SAMPLE_BRANCH_ANY = 1U << PERF_SAMPLE_BRANCH_ANY_SHIFT,
+ PERF_SAMPLE_BRANCH_ANY_CALL = 1U << PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT,
+ PERF_SAMPLE_BRANCH_ANY_RETURN = 1U << PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT,
+ PERF_SAMPLE_BRANCH_IND_CALL = 1U << PERF_SAMPLE_BRANCH_IND_CALL_SHIFT,
+ PERF_SAMPLE_BRANCH_ABORT_TX = 1U << PERF_SAMPLE_BRANCH_ABORT_TX_SHIFT,
+ PERF_SAMPLE_BRANCH_IN_TX = 1U << PERF_SAMPLE_BRANCH_IN_TX_SHIFT,
+ PERF_SAMPLE_BRANCH_NO_TX = 1U << PERF_SAMPLE_BRANCH_NO_TX_SHIFT,
+
+ PERF_SAMPLE_BRANCH_MAX = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT,
};

#define PERF_SAMPLE_BRANCH_PLM_ALL \
--
1.8.5.3

2014-02-23 19:47:57

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support

Could you add the Reviewed-by: on the patches I already
reviewed? So I focus on the changes you made and continue
testing on my HSW system.

Thanks.

On Tue, Feb 18, 2014 at 7:07 AM, Yan, Zheng <[email protected]> wrote:
> For many profiling tasks we need the callgraph. For example we often
> need to see the caller of a lock or the caller of a memcpy or other
> library function to actually tune the program. Frame pointer unwinding
> is efficient and works well. But frame pointers are off by default on
> 64bit code (and on modern 32bit gccs), so there are many binaries around
> that do not use frame pointers. Profiling unchanged production code is
> very useful in practice. On some CPUs frame pointer also has a high
> cost. Dwarf2 unwinding also does not always work and is extremely slow
> (upto 20% overhead).
>
> Haswell has a new feature that utilizes the existing Last Branch Record
> facility to record call chains. When the feature is enabled, function
> call will be collected as normal, but as return instructions are
> executed the last captured branch record is popped from the on-chip LBR
> registers. The LBR call stack facility provides an alternative to get
> callgraph. It has some limitations too, but should work in most cases
> and is significantly faster than dwarf. Frame pointer unwinding is still
> the best default, but LBR call stack is a good alternative when nothing
> else works.
>
> This patch series adds LBR call stack support. User can enabled/disable
> this through an sysfs attribute file in the CPU PMU directory:
> echo 1 > /sys/bus/event_source/devices/cpu/lbr_callstack
>
> When profiling bc(1) on Fedora 19:
> echo 'scale=2000; 4*a(1)' > cmd; perf record -g fp bc -l < cmd
>
> If this feature is enabled, perf report output looks like:
> 50.36% bc bc [.] bc_divide
> |
> --- bc_divide
> execute
> run_code
> yyparse
> main
> __libc_start_main
> _start
>
> 33.66% bc bc [.] _one_mult
> |
> --- _one_mult
> bc_divide
> execute
> run_code
> yyparse
> main
> __libc_start_main
> _start
>
> 7.62% bc bc [.] _bc_do_add
> |
> --- _bc_do_add
> |
> |--99.89%-- 0x2000186a8
> --0.11%-- [...]
>
> 6.83% bc bc [.] _bc_do_sub
> |
> --- _bc_do_sub
> |
> |--99.94%-- bc_add
> | execute
> | run_code
> | yyparse
> | main
> | __libc_start_main
> | _start
> --0.06%-- [...]
>
> 0.46% bc libc-2.17.so [.] __memset_sse2
> |
> --- __memset_sse2
> |
> |--54.13%-- bc_new_num
> | |
> | |--51.00%-- bc_divide
> | | execute
> | | run_code
> | | yyparse
> | | main
> | | __libc_start_main
> | | _start
> | |
> | |--30.46%-- _bc_do_sub
> | | bc_add
> | | execute
> | | run_code
> | | yyparse
> | | main
> | | __libc_start_main
> | | _start
> | |
> | --18.55%-- _bc_do_add
> | bc_add
> | execute
> | run_code
> | yyparse
> | main
> | __libc_start_main
> | _start
> |
> --45.87%-- bc_divide
> execute
> run_code
> yyparse
> main
> __libc_start_main
> _start
>
> If this feature is disabled, perf report output looks like:
> 50.49% bc bc [.] bc_divide
> |
> --- bc_divide
>
> 33.57% bc bc [.] _one_mult
> |
> --- _one_mult
>
> 7.61% bc bc [.] _bc_do_add
> |
> --- _bc_do_add
> 0x2000186a8
>
> 6.88% bc bc [.] _bc_do_sub
> |
> --- _bc_do_sub
>
> 0.42% bc libc-2.17.so [.] __memcpy_ssse3_back
> |
> --- __memcpy_ssse3_back
>
> The LBR call stack has following known limitations
> - Zero length calls are not filtered out by hardware
> - Exception handing such as setjmp/longjmp will have calls/returns not
> match
> - Pushing different return address onto the stack will have calls/returns
> not match
> - If callstack is deeper than the LBR, only the last entries are captured
>
> Changes since v1
> - split change into more patches
> - introduce context switch callback and use it to flush LBR
> - use the context switch callback to save/restore LBR
> - dynamic allocate memory area for storing LBR stack, always switch the
> memory area during context switch
> - disable this feature by default
> - more description in change logs
>
> Changes since v2
> - don't use xchg to switch PMU specific data
> - remove nr_branch_stack from struct perf_event_context
> - simplify the save/restore LBR stack logical
> - remove unnecessary 'has_branch_stack -> needs_branch_stack'
> conversion
> - more description in change logs
>
>

2014-02-24 01:08:27

by Yan, Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support

On 02/24/2014 03:47 AM, Stephane Eranian wrote:
> Could you add the Reviewed-by: on the patches I already
> reviewed? So I focus on the changes you made and continue
> testing on my HSW system.
>
Hi,

I got your Reviewed-by for patch 1,5,6,8. Patch 6 was changed in this series.
So only patch 1,6,8 were left (they all are simple change). I will add your
Reviewed-by in next version.

Regards
Yan, Zheng

> Thanks.
>
> On Tue, Feb 18, 2014 at 7:07 AM, Yan, Zheng <[email protected]> wrote:
>> For many profiling tasks we need the callgraph. For example we often
>> need to see the caller of a lock or the caller of a memcpy or other
>> library function to actually tune the program. Frame pointer unwinding
>> is efficient and works well. But frame pointers are off by default on
>> 64bit code (and on modern 32bit gccs), so there are many binaries around
>> that do not use frame pointers. Profiling unchanged production code is
>> very useful in practice. On some CPUs frame pointer also has a high
>> cost. Dwarf2 unwinding also does not always work and is extremely slow
>> (upto 20% overhead).
>>
>> Haswell has a new feature that utilizes the existing Last Branch Record
>> facility to record call chains. When the feature is enabled, function
>> call will be collected as normal, but as return instructions are
>> executed the last captured branch record is popped from the on-chip LBR
>> registers. The LBR call stack facility provides an alternative to get
>> callgraph. It has some limitations too, but should work in most cases
>> and is significantly faster than dwarf. Frame pointer unwinding is still
>> the best default, but LBR call stack is a good alternative when nothing
>> else works.
>>
>> This patch series adds LBR call stack support. User can enabled/disable
>> this through an sysfs attribute file in the CPU PMU directory:
>> echo 1 > /sys/bus/event_source/devices/cpu/lbr_callstack
>>
>> When profiling bc(1) on Fedora 19:
>> echo 'scale=2000; 4*a(1)' > cmd; perf record -g fp bc -l < cmd
>>
>> If this feature is enabled, perf report output looks like:
>> 50.36% bc bc [.] bc_divide
>> |
>> --- bc_divide
>> execute
>> run_code
>> yyparse
>> main
>> __libc_start_main
>> _start
>>
>> 33.66% bc bc [.] _one_mult
>> |
>> --- _one_mult
>> bc_divide
>> execute
>> run_code
>> yyparse
>> main
>> __libc_start_main
>> _start
>>
>> 7.62% bc bc [.] _bc_do_add
>> |
>> --- _bc_do_add
>> |
>> |--99.89%-- 0x2000186a8
>> --0.11%-- [...]
>>
>> 6.83% bc bc [.] _bc_do_sub
>> |
>> --- _bc_do_sub
>> |
>> |--99.94%-- bc_add
>> | execute
>> | run_code
>> | yyparse
>> | main
>> | __libc_start_main
>> | _start
>> --0.06%-- [...]
>>
>> 0.46% bc libc-2.17.so [.] __memset_sse2
>> |
>> --- __memset_sse2
>> |
>> |--54.13%-- bc_new_num
>> | |
>> | |--51.00%-- bc_divide
>> | | execute
>> | | run_code
>> | | yyparse
>> | | main
>> | | __libc_start_main
>> | | _start
>> | |
>> | |--30.46%-- _bc_do_sub
>> | | bc_add
>> | | execute
>> | | run_code
>> | | yyparse
>> | | main
>> | | __libc_start_main
>> | | _start
>> | |
>> | --18.55%-- _bc_do_add
>> | bc_add
>> | execute
>> | run_code
>> | yyparse
>> | main
>> | __libc_start_main
>> | _start
>> |
>> --45.87%-- bc_divide
>> execute
>> run_code
>> yyparse
>> main
>> __libc_start_main
>> _start
>>
>> If this feature is disabled, perf report output looks like:
>> 50.49% bc bc [.] bc_divide
>> |
>> --- bc_divide
>>
>> 33.57% bc bc [.] _one_mult
>> |
>> --- _one_mult
>>
>> 7.61% bc bc [.] _bc_do_add
>> |
>> --- _bc_do_add
>> 0x2000186a8
>>
>> 6.88% bc bc [.] _bc_do_sub
>> |
>> --- _bc_do_sub
>>
>> 0.42% bc libc-2.17.so [.] __memcpy_ssse3_back
>> |
>> --- __memcpy_ssse3_back
>>
>> The LBR call stack has following known limitations
>> - Zero length calls are not filtered out by hardware
>> - Exception handing such as setjmp/longjmp will have calls/returns not
>> match
>> - Pushing different return address onto the stack will have calls/returns
>> not match
>> - If callstack is deeper than the LBR, only the last entries are captured
>>
>> Changes since v1
>> - split change into more patches
>> - introduce context switch callback and use it to flush LBR
>> - use the context switch callback to save/restore LBR
>> - dynamic allocate memory area for storing LBR stack, always switch the
>> memory area during context switch
>> - disable this feature by default
>> - more description in change logs
>>
>> Changes since v2
>> - don't use xchg to switch PMU specific data
>> - remove nr_branch_stack from struct perf_event_context
>> - simplify the save/restore LBR stack logical
>> - remove unnecessary 'has_branch_stack -> needs_branch_stack'
>> conversion
>> - more description in change logs
>>
>>

2014-02-24 07:14:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support



Your patches still only have 3-4 lines of Changelog, I'll continue
ignoring them.

2014-02-26 02:39:48

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support

On 02/17/2014 10:07 PM, Yan, Zheng wrote:
>
> This patch series adds LBR call stack support. User can enabled/disable
> this through an sysfs attribute file in the CPU PMU directory:
> echo 1 > /sys/bus/event_source/devices/cpu/lbr_callstack

This seems like an unpleasant way to control this. It would be handy to
be able to control this as an option to perf record.

--Andy

2014-02-26 07:04:04

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support

On Wed, Feb 26, 2014 at 3:39 AM, Andy Lutomirski <[email protected]> wrote:
> On 02/17/2014 10:07 PM, Yan, Zheng wrote:
>>
>> This patch series adds LBR call stack support. User can enabled/disable
>> this through an sysfs attribute file in the CPU PMU directory:
>> echo 1 > /sys/bus/event_source/devices/cpu/lbr_callstack
>
> This seems like an unpleasant way to control this. It would be handy to
> be able to control this as an option to perf record.
>
That would mean you'd be root for perf.
Or are you suggesting a perf event option? But then, you'd expose arch-specific
feature at the API level.

> --Andy

2014-02-26 09:00:13

by Yan, Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support

On 02/26/2014 03:04 PM, Stephane Eranian wrote:
> On Wed, Feb 26, 2014 at 3:39 AM, Andy Lutomirski <[email protected]> wrote:
>> On 02/17/2014 10:07 PM, Yan, Zheng wrote:
>>>
>>> This patch series adds LBR call stack support. User can enabled/disable
>>> this through an sysfs attribute file in the CPU PMU directory:
>>> echo 1 > /sys/bus/event_source/devices/cpu/lbr_callstack
>>
>> This seems like an unpleasant way to control this. It would be handy to
>> be able to control this as an option to perf record.
>>
> That would mean you'd be root for perf.
> Or are you suggesting a perf event option? But then, you'd expose arch-specific
> feature at the API level.
>

Another option is enable this feature by default if hardware supports it. But it
causes LBR contention between events that want callstack and evnets that want LBR.

Regards
Yan, Zheng

2014-02-26 16:03:35

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support

On Wed, Feb 26, 2014 at 12:04 AM, Stephane Eranian <[email protected]> wrote:
> On Wed, Feb 26, 2014 at 3:39 AM, Andy Lutomirski <[email protected]> wrote:
>> On 02/17/2014 10:07 PM, Yan, Zheng wrote:
>>>
>>> This patch series adds LBR call stack support. User can enabled/disable
>>> this through an sysfs attribute file in the CPU PMU directory:
>>> echo 1 > /sys/bus/event_source/devices/cpu/lbr_callstack
>>
>> This seems like an unpleasant way to control this. It would be handy to
>> be able to control this as an option to perf record.
>>
> That would mean you'd be root for perf.
> Or are you suggesting a perf event option? But then, you'd expose arch-specific
> feature at the API level.
>

I'm suggesting a perf event option, just like the way that PEBS works.

--Andy

2014-02-26 18:55:17

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support

> I'm suggesting a perf event option, just like the way that PEBS works.

Right now it's a somewhat experimental feature and just having
the sysctl is fine. If it turns out that is what everyone uses
such an option could be still added later.

I suspect most people would still use FP if they can, just use
the LBRs if that doesn't work.

-Andi

2014-02-26 18:59:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support

On Wed, Feb 26, 2014 at 10:55 AM, Andi Kleen <[email protected]> wrote:
>> I'm suggesting a perf event option, just like the way that PEBS works.
>
> Right now it's a somewhat experimental feature and just having
> the sysctl is fine. If it turns out that is what everyone uses
> such an option could be still added later.

I'm a bit worried that the syscall will be stuck as ABI forever,
though. Its presence will make adding a different configuration
mechanism awkward.

>
> I suspect most people would still use FP if they can, just use
> the LBRs if that doesn't work.

I wonder if anyone who uses perf for userspace profiling *ever* uses
FP and gets away with it. There's precious little userspace software
compiled with frame pointers these days on most architectures.

I have a concrete reason for this question: it would be nice to
compile the vDSO with frame pointers off. IIRC there would be a
significant performance gain, and I think the only thing that would
break is perf. But it looks like perf will have nice elfutils unwind
support in 3.15, and if FP support is useless anyway...

--Andy

2014-02-26 19:20:01

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support

On 2/26/14, 11:59 AM, Andy Lutomirski wrote:

> I wonder if anyone who uses perf for userspace profiling *ever* uses
> FP and gets away with it. There's precious little userspace software
> compiled with frame pointers these days on most architectures.

yes and yes. With control over the entire stack we are making sure
frame-pointers are enabled as much as possible.

David

2014-02-26 19:26:05

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support

On Wed, Feb 26, 2014 at 11:19 AM, David Ahern <[email protected]> wrote:
> On 2/26/14, 11:59 AM, Andy Lutomirski wrote:
>
>> I wonder if anyone who uses perf for userspace profiling *ever* uses
>> FP and gets away with it. There's precious little userspace software
>> compiled with frame pointers these days on most architectures.
>
>
> yes and yes. With control over the entire stack we are making sure
> frame-pointers are enabled as much as possible.
>

I'm curious why.

Maybe this should be a config option. Anyone using a standard distro
is running a nearly completely frame-pointer-omitted userspace these
days.

--Andy

>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-02-26 20:15:05

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support

On 2/26/14, 12:25 PM, Andy Lutomirski wrote:
> On Wed, Feb 26, 2014 at 11:19 AM, David Ahern <[email protected]> wrote:
>> On 2/26/14, 11:59 AM, Andy Lutomirski wrote:
>>
>>> I wonder if anyone who uses perf for userspace profiling *ever* uses
>>> FP and gets away with it. There's precious little userspace software
>>> compiled with frame pointers these days on most architectures.
>>
>>
>> yes and yes. With control over the entire stack we are making sure
>> frame-pointers are enabled as much as possible.
>>
>
> I'm curious why.

Is there some reason not to enable frame pointers?

fp method has much less overhead than dwarf, and good, clear callchains
are important.

>
> Maybe this should be a config option. Anyone using a standard distro
> is running a nearly completely frame-pointer-omitted userspace these
> days.

Does WRL or Yocto fall into that 'standard distro' comment? Fairly easy
to enable frame-pointers.

David

2014-02-26 20:27:09

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support

On Wed, Feb 26, 2014 at 12:14 PM, David Ahern <[email protected]> wrote:
> On 2/26/14, 12:25 PM, Andy Lutomirski wrote:
>>
>> On Wed, Feb 26, 2014 at 11:19 AM, David Ahern <[email protected]> wrote:
>>>
>>> On 2/26/14, 11:59 AM, Andy Lutomirski wrote:
>>>
>>>> I wonder if anyone who uses perf for userspace profiling *ever* uses
>>>> FP and gets away with it. There's precious little userspace software
>>>> compiled with frame pointers these days on most architectures.
>>>
>>>
>>>
>>> yes and yes. With control over the entire stack we are making sure
>>> frame-pointers are enabled as much as possible.
>>>
>>
>> I'm curious why.
>
>
> Is there some reason not to enable frame pointers?

Speed. FPO saves one register (a big deal on x86_32; not so important
on x86_64) but also saves a few cycles on function entry and exit,
which is a bigger deal for small functions.

>
> fp method has much less overhead than dwarf, and good, clear callchains are
> important.
>

Agreed about the good, clear callchains. But DWARF seems to work
pretty well, and you only have the overhead when you're actually
debugging or profiling.

>
>>
>> Maybe this should be a config option. Anyone using a standard distro
>> is running a nearly completely frame-pointer-omitted userspace these
>> days.
>
>
> Does WRL or Yocto fall into that 'standard distro' comment? Fairly easy to
> enable frame-pointers.

Fair enough :)

--Andy

2014-02-26 20:32:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support

On Wed, Feb 26, 2014 at 01:14:58PM -0700, David Ahern wrote:
> Does WRL or Yocto fall into that 'standard distro' comment? Fairly easy to
> enable frame-pointers.

Yeah, same for Gentoo, I always build world with frame-pointers enabled.

2014-02-26 20:53:26

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support

> Is there some reason not to enable frame pointers?

It makes code slower.

Especially on Atom CPUs, where it causes pipeline stalls, but
also to some degree on others, because you lose one register and
spend a little bit of time setting it up, so making small
functions more expensive.

Another issue is that you can't enable it on a lot of existing
libraries, sometimes not even with a recompile. For example
glibc assembler functions do not support it at all, which
is a very common case.

They are designed to use dwarf, but in practice dwarf
is very slow (perf has to save the stack for every sample)
and in practice doesn't always work (too small stack saving,
wrong annotations, out of date or broken dwarf library etc.)

LBR callstack mode is not perfect either, and it has
its own tradeoffs, but in many cases it seems to be a good
and more efficient replacement for dwarf, when FP is not available.

-Andi

2014-02-26 21:16:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support

On Wed, Feb 26, 2014 at 09:53:22PM +0100, Andi Kleen wrote:
> > Is there some reason not to enable frame pointers?
>
> It makes code slower.
>
> Especially on Atom CPUs, where it causes pipeline stalls, but

Yeah, but nobody sane cares about the in-order atom crap CPUs.

> also to some degree on others, because you lose one register and
> spend a little bit of time setting it up, so making small
> functions more expensive.

Luckily GCC is rather good at inlining a lot of those. Esp. with LTO
like stuff.

> Another issue is that you can't enable it on a lot of existing
> libraries, sometimes not even with a recompile. For example
> glibc assembler functions do not support it at all, which
> is a very common case.

They're mostly all leaf functions, so it doesn't matter much if
anything.

> They are designed to use dwarf, but in practice dwarf
> is very slow (perf has to save the stack for every sample)
> and in practice doesn't always work (too small stack saving,
> wrong annotations, out of date or broken dwarf library etc.)
>
> LBR callstack mode is not perfect either, and it has
> its own tradeoffs, but in many cases it seems to be a good
> and more efficient replacement for dwarf, when FP is not available.

But except for the lobbying Intel put into disabling FP because of that
piece of shit Atom we'd all still have it enabled.

2014-02-26 21:33:54

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support

> > Another issue is that you can't enable it on a lot of existing
> > libraries, sometimes not even with a recompile. For example
> > glibc assembler functions do not support it at all, which
> > is a very common case.
>
> They're mostly all leaf functions, so it doesn't matter much if
> anything.

If you assume they don't destroy FP -- which many of them do.
A lot of str* and some mem* functions are problematic
(note it depends what CPU you use)

A common problem I ran into was that it was impossible
to profile through mutex locks (now fixed in latest glibc)

>
> > They are designed to use dwarf, but in practice dwarf
> > is very slow (perf has to save the stack for every sample)
> > and in practice doesn't always work (too small stack saving,
> > wrong annotations, out of date or broken dwarf library etc.)
> >
> > LBR callstack mode is not perfect either, and it has
> > its own tradeoffs, but in many cases it seems to be a good
> > and more efficient replacement for dwarf, when FP is not available.
>
> But except for the lobbying Intel put into disabling FP because of that
> piece of shit Atom we'd all still have it enabled.

The original reason for getting rid of FP on 64bit (and later 32bit) was
the original AMD K8, which has similar pipeline stalls as Atom. That was
long before Atom existed. Most older CPUs had similar problems,
so it was eventually also done on 32bit.

-Andi

P.S.: Congratulations on getting every single statement
in the email wrong. That's a full jackpot.

--
[email protected] -- Speaking for myself only.

2014-02-26 21:34:49

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support

On 2/26/14, 1:53 PM, Andi Kleen wrote:
>> Is there some reason not to enable frame pointers?
>
> It makes code slower.

Sure there is some overhead because of the push, mov, pop instructions
per function. But, take for example the simple program below. Compile
with and without frame pointers

gcc -Wall -fno-omit-frame-pointer fp-test.c -owith-fp
gcc -Wall -fomit-frame-pointer fp-test.c -ono-fp

$ time ./with-fp
real 0m9.187s
user 0m9.174s
sys 0m0.001s

$ time ./no-fp
real 0m11.749s
user 0m11.731s
sys 0m0.001s

>
> Especially on Atom CPUs, where it causes pipeline stalls, but
> also to some degree on others, because you lose one register and
> spend a little bit of time setting it up, so making small
> functions more expensive.
>
> Another issue is that you can't enable it on a lot of existing
> libraries, sometimes not even with a recompile. For example
> glibc assembler functions do not support it at all, which
> is a very common case.
>
> They are designed to use dwarf, but in practice dwarf
> is very slow (perf has to save the stack for every sample)
> and in practice doesn't always work (too small stack saving,
> wrong annotations, out of date or broken dwarf library etc.)

dwarf is often just not usable:

$ perf record --call-graph dwarf -- ./no-fp
[ perf record: Woken up 1521 times to write data ]
[ perf record: Captured and wrote 380.567 MB perf.data (~16627233 samples) ]
0x4003cf0 [0]: failed to process type: 0

Compared to the fp route:
$ perf record -g -- ./with-fp
[ perf record: Woken up 12 times to write data ]
[ perf record: Captured and wrote 2.948 MB perf.data (~128816 samples) ]

That is a huge difference. Not to mention the fact the dwarf file is
useless which means radically lowering sample rate and increasing mmap size.

The efficiency of fp is worth the small amount of (theoretical) overhead
-- at least for us with xeon CPUs.
>
> LBR callstack mode is not perfect either, and it has
> its own tradeoffs, but in many cases it seems to be a good
> and more efficient replacement for dwarf, when FP is not available.

Haswell only option -- based on the subject line?

David

--

$ cat fp-test.c

#include <stdlib.h>

static int i;

void e(void)
{
i++;
}
void d(void)
{
e();
}
void c(void)
{
d();
}
void b(void)
{
c();
}
void a(void)
{
b();
}

int main(int argc, char *argv[])
{
int iter = 1000000000;

if (argc > 1)
iter = atoi(argv[1]);

while (--iter > 0)
a();

return 0;
}

2014-02-26 21:42:38

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support

On Wed, Feb 26, 2014 at 02:34:31PM -0700, David Ahern wrote:
> On 2/26/14, 1:53 PM, Andi Kleen wrote:
> >>Is there some reason not to enable frame pointers?
> >
> >It makes code slower.
>
> Sure there is some overhead because of the push, mov, pop
> instructions per function. But, take for example the simple program
> below. Compile with and without frame pointers

I'm not criticizing your choice, just saying that
it's often not practical to get FP everywhere
(and I bet you missed some cases too)

<.. micro benchmark snipped...>

The CPU you're using has special hardware to avoid the main
problems with FP. It can still cause slow downs in other
cases (e.g. one register less). But there are other
CPUs where this special hardware is not available.

You may not care about these cases, but other people do.

> >wrong annotations, out of date or broken dwarf library etc.)
>
> dwarf is often just not usable:

I agree (altough I haven't seen that error before)


> That is a huge difference. Not to mention the fact the dwarf file is
> useless which means radically lowering sample rate and increasing
> mmap size.

Yep.

It's just fundamentally inefficient for profiling.

-Andi

2014-02-27 09:09:53

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support

On Wed, Feb 26, 2014 at 10:42 PM, Andi Kleen <[email protected]> wrote:
> On Wed, Feb 26, 2014 at 02:34:31PM -0700, David Ahern wrote:
>> On 2/26/14, 1:53 PM, Andi Kleen wrote:
>> >>Is there some reason not to enable frame pointers?
>> >
>> >It makes code slower.
>>

That is what I have been told by compiler people too.

This is especially true of small functions which C++ object-oriented
code is full of. And that's how large programs are written with these
days.

The other problem with FP is hat you need to have everything
compiled with it. It is not always obvious how to check this, without
going to assembly. There is no indication in the ELF headers, AFAIK.


>> Sure there is some overhead because of the push, mov, pop
>> instructions per function. But, take for example the simple program
>> below. Compile with and without frame pointers
>
> I'm not criticizing your choice, just saying that
> it's often not practical to get FP everywhere
> (and I bet you missed some cases too)
>
> <.. micro benchmark snipped...>
>
> The CPU you're using has special hardware to avoid the main
> problems with FP. It can still cause slow downs in other
> cases (e.g. one register less). But there are other
> CPUs where this special hardware is not available.
>
> You may not care about these cases, but other people do.
>
>> >wrong annotations, out of date or broken dwarf library etc.)
>>
>> dwarf is often just not usable:
>

The first problem with the dwarf approach is that it incurs some
overhead during sampling. You need to copy a chunk of the user stack
in each sample. Not clear how much you need.

The second problem is security. You are saving random chunks of stack
in the perf.data file. Who knows what it contains. In many environments this
is a showstopper.

The Haswell LBR call stack is a good compromise, though as Andi
pointed out it has its tradeoffs. It does not work in all
cases. But it has the speed and the security. It is model specific.
But I can live with that. PMU always comes with incremental
improvements.

> I agree (altough I haven't seen that error before)
>
>
>> That is a huge difference. Not to mention the fact the dwarf file is
>> useless which means radically lowering sample rate and increasing
>> mmap size.
>
> Yep.
>
> It's just fundamentally inefficient for profiling.
>
> -Andi

2014-02-27 12:35:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support


* Andy Lutomirski <[email protected]> wrote:

> On Wed, Feb 26, 2014 at 10:55 AM, Andi Kleen <[email protected]> wrote:
>
> >> I'm suggesting a perf event option, just like the way that PEBS
> >> works.
> >
> > Right now it's a somewhat experimental feature and just having the
> > sysctl is fine. [...]

NACKed-by: Ingo Molnar <[email protected]>

> > [...] If it turns out that is what everyone uses such an option
> > could be still added later.
>
> I'm a bit worried that the syscall will be stuck as ABI forever,
> though. Its presence will make adding a different configuration
> mechanism awkward.

The sysctl is a usability non-starter, obviously.

Thanks,

Ingo

2014-02-27 16:08:33

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support

On Thu, Feb 27, 2014 at 01:35:07PM +0100, Ingo Molnar wrote:
>
> * Andy Lutomirski <[email protected]> wrote:
>
> > On Wed, Feb 26, 2014 at 10:55 AM, Andi Kleen <[email protected]> wrote:
> >
> > >> I'm suggesting a perf event option, just like the way that PEBS
> > >> works.
> > >
> > > Right now it's a somewhat experimental feature and just having the
> > > sysctl is fine. [...]
>
> NACKed-by: Ingo Molnar <[email protected]>

You forgot to propose an alternative?

-Andi

2014-04-09 11:49:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support

On Wed, Feb 26, 2014 at 12:26:43PM -0800, Andy Lutomirski wrote:
> Speed. FPO saves one register (a big deal on x86_32; not so important
> on x86_64) but also saves a few cycles on function entry and exit,
> which is a bigger deal for small functions.

So I though that LTO was supposed to get rid of a lot of the small
function and inline them.

I've also heard that in practise this is very 'hard', and thus we're
still stuck with a gazillion small functions (mostly C++ people suffer
from this).

Can anybody give a concise explanation on why LTO doesn't rid us of
these small functions or point to a web resource that describes the
problem?

2014-04-09 16:48:56

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support

On Wed, Apr 09, 2014 at 01:48:57PM +0200, Peter Zijlstra wrote:
> On Wed, Feb 26, 2014 at 12:26:43PM -0800, Andy Lutomirski wrote:
> > Speed. FPO saves one register (a big deal on x86_32; not so important
> > on x86_64) but also saves a few cycles on function entry and exit,
> > which is a bigger deal for small functions.
>
> So I though that LTO was supposed to get rid of a lot of the small
> function and inline them.

It does it when it can (no indirect), thinks it's profitable and won't
increase code size too much.

>
> I've also heard that in practise this is very 'hard', and thus we're
> still stuck with a gazillion small functions (mostly C++ people suffer
> from this).

They need devirtualization, which we cannot do currently in the kernel.

>
> Can anybody give a concise explanation on why LTO doesn't rid us of
> these small functions or point to a web resource that describes the
> problem?

It depends on the code of course.
On one of my LTO builds I have ~10% less functions in System.map.

Actual results will vary of course on the config.

-Andi


--
[email protected] -- Speaking for myself only.

2014-04-09 17:40:28

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf, x86: Haswell LBR call stack support


BTW the whole discussion is rather pointless.

We have to profile the software as it is, not as we wish it to be.

That means: small functions, often no frame pointer, all kinds of crappy
code and missing information.

And then doing it all with as little overhead as possible.

I think on these metrics callstack LBR is attractive for many
(but not all) cases.

-Andi

--
[email protected] -- Speaking for myself only.