2014-11-13 00:33:52

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V2 0/3] perf tool: Haswell LBR call stack support (user)

From: Kan Liang <[email protected]>

This is the user space patch for Haswell LBR call stack support.
For many profiling tasks we need the callgraph. For example we often
need to see the caller of a lock or the caller of a memcpy or other
library function to actually tune the program. Frame pointer unwinding
is efficient and works well. But frame pointers are off by default on
64bit code (and on modern 32bit gccs), so there are many binaries around
that do not use frame pointers. Profiling unchanged production code is
very useful in practice. On some CPUs frame pointer also has a high
cost. Dwarf2 unwinding also does not always work and is extremely slow
(upto 20% overhead).

Haswell has a new feature that utilizes the existing Last Branch Record
facility to record call chains. When the feature is enabled, function
call will be collected as normal, but as return instructions are
executed the last captured branch record is popped from the on-chip LBR
registers. The LBR call stack facility provides an alternative to get
callgraph. It has some limitations too, but should work in most cases
and is significantly faster than dwarf. Frame pointer unwinding is still
the best default, but LBR call stack is a good alternative when nothing
else works.

A new call chain recording option "lbr" is introduced into perf tool for
LBR call stack. The user can use --call-graph lbr to get the call stack
information from hardware.

When profiling bc(1) on Fedora 19:
echo 'scale=2000; 4*a(1)' > cmd; perf record --call-graph lbr bc -l < cmd
If enabling LBR, perf report output looks like:
50.36% bc bc [.] bc_divide
|
--- bc_divide
execute
run_code
yyparse
main
__libc_start_main
_start
33.66% bc bc [.] _one_mult
|
--- _one_mult
bc_divide
execute
run_code
yyparse
main
__libc_start_main
_start
7.62% bc bc [.] _bc_do_add
|
--- _bc_do_add
|
|--99.89%-- 0x2000186a8
--0.11%-- [...]
6.83% bc bc [.] _bc_do_sub
|
--- _bc_do_sub
|
|--99.94%-- bc_add
| execute
| run_code
| yyparse
| main
| __libc_start_main
| _start
--0.06%-- [...]
0.46% bc libc-2.17.so [.] __memset_sse2
|
--- __memset_sse2
|
|--54.13%-- bc_new_num
| |
| |--51.00%-- bc_divide
| | execute
| | run_code
| | yyparse
| | main
| | __libc_start_main
| | _start
| |
| |--30.46%-- _bc_do_sub
| | bc_add
| | execute
| | run_code
| | yyparse
| | main
| | __libc_start_main
| | _start
| |
| --18.55%-- _bc_do_add
| bc_add
| execute
| run_code
| yyparse
| main
| __libc_start_main
| _start
|
--45.87%-- bc_divide
execute
run_code
yyparse
main
__libc_start_main
_start
If using FP, perf report output looks like:
echo 'scale=2000; 4*a(1)' > cmd; perf record --call-graph fp bc -l < cmd
50.49% bc bc [.] bc_divide
|
--- bc_divide
33.57% bc bc [.] _one_mult
|
--- _one_mult
7.61% bc bc [.] _bc_do_add
|
--- _bc_do_add
0x2000186a8
6.88% bc bc [.] _bc_do_sub
|
--- _bc_do_sub
0.42% bc libc-2.17.so [.] __memcpy_ssse3_back
|
--- __memcpy_ssse3_back

If using LBR, perf report -D output looks like:
11739295893248 0x4d0 [0xe0]: PERF_RECORD_SAMPLE(IP, 0x2): 10505/10505:
0x40054d period: 39255 addr: 0
... LBR call chain: nr:7
..... 0: fffffffffffffe00
..... 1: 0000000000400540
..... 2: 0000000000400587
..... 3: 00000000004005b3
..... 4: 00000000004005ef
..... 5: 0000003d1cc21b43
..... 6: 0000000000400474
... FP chain: nr:6
..... 0: fffffffffffffe00
..... 1: 000000000040054d
..... 2: 000000000040058c
..... 3: 00000000004005b8
..... 4: 00000000004005f4
..... 5: 0000003d1cc21b45
... thread: a.out:10505
...... dso: /home/lk/a.out


The LBR call stack has following known limitations
- Zero length calls are not filtered out by hardware
- Exception handing such as setjmp/longjmp will have calls/returns not
match
- Pushing different return address onto the stack will have calls/returns
not match
- If callstack is deeper than the LBR, only the last entries are captured

Changes since v1
- Update help document
- Force exclude_user to 0 with warning in LBR call stack
- Dump both lbr and fp info when report -D
- Reconstruct thread__resolve_callchain_sample and split it into two patches
- Use has_branch_callstack function to check LBR call stack available



Kan Liang (3):
perf tools: enable LBR call stack support
perf tool: re-organize thread__resolve_callchain_sample
perf tools: Construct LBR call chain

tools/perf/Documentation/perf-record.txt | 10 +-
tools/perf/builtin-record.c | 6 +-
tools/perf/builtin-report.c | 2 +
tools/perf/util/callchain.c | 10 +-
tools/perf/util/callchain.h | 1 +
tools/perf/util/evsel.c | 21 +++-
tools/perf/util/evsel.h | 4 +
tools/perf/util/machine.c | 176 +++++++++++++++++++++----------
tools/perf/util/session.c | 56 ++++++++--
9 files changed, 216 insertions(+), 70 deletions(-)

--
1.8.3.2


2014-11-13 00:33:55

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V2 2/3] perf tool: re-organize thread__resolve_callchain_sample

From: Kan Liang <[email protected]>

Re-organize thread__resolve_callchain_sample. Next patch will reuse
parts of thread__resolve_callchain_sample, so factored out the common
functionality.

Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/util/machine.c | 105 +++++++++++++++++++++++++---------------------
1 file changed, 57 insertions(+), 48 deletions(-)

diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 52e9490..9c7d136 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1399,6 +1399,57 @@ struct branch_info *sample__resolve_bstack(struct perf_sample *sample,
return bi;
}

+static inline int __thread__resolve_callchain_sample(struct thread *thread,
+ u64 ip, u8 *cpumode,
+ struct symbol **parent,
+ struct addr_location *root_al)
+{
+ struct addr_location al;
+
+ if (ip >= PERF_CONTEXT_MAX) {
+ switch (ip) {
+ case PERF_CONTEXT_HV:
+ *cpumode = PERF_RECORD_MISC_HYPERVISOR;
+ break;
+ case PERF_CONTEXT_KERNEL:
+ *cpumode = PERF_RECORD_MISC_KERNEL;
+ break;
+ case PERF_CONTEXT_USER:
+ *cpumode = PERF_RECORD_MISC_USER;
+ break;
+ default:
+ pr_debug("invalid callchain context: "
+ "%"PRId64"\n", (s64) ip);
+ /*
+ * It seems the callchain is corrupted.
+ * Discard all.
+ */
+ callchain_cursor_reset(&callchain_cursor);
+ return 1;
+ }
+ return 0;
+ }
+
+ al.filtered = 0;
+ thread__find_addr_location(thread, *cpumode,
+ MAP__FUNCTION, ip, &al);
+ if (al.sym != NULL) {
+ if (sort__has_parent && !*parent &&
+ symbol__match_regex(al.sym, &parent_regex))
+ *parent = al.sym;
+ else if (have_ignore_callees && root_al &&
+ symbol__match_regex(al.sym, &ignore_callees_regex)) {
+ /* Treat this symbol as the root,
+ forgetting its callees. */
+ *root_al = al;
+ callchain_cursor_reset(&callchain_cursor);
+ }
+ }
+
+ return callchain_cursor_append(&callchain_cursor,
+ ip, al.map, al.sym);
+
+}
static int thread__resolve_callchain_sample(struct thread *thread,
struct ip_callchain *chain,
struct symbol **parent,
@@ -1407,9 +1458,7 @@ static int thread__resolve_callchain_sample(struct thread *thread,
{
u8 cpumode = PERF_RECORD_MISC_USER;
int chain_nr = min(max_stack, (int)chain->nr);
- int i;
- int j;
- int err;
+ int i, j, err = 0;
int skip_idx __maybe_unused;

callchain_cursor_reset(&callchain_cursor);
@@ -1427,7 +1476,6 @@ static int thread__resolve_callchain_sample(struct thread *thread,

for (i = 0; i < chain_nr; i++) {
u64 ip;
- struct addr_location al;

if (callchain_param.order == ORDER_CALLEE)
j = i;
@@ -1440,53 +1488,14 @@ static int thread__resolve_callchain_sample(struct thread *thread,
#endif
ip = chain->ips[j];

- if (ip >= PERF_CONTEXT_MAX) {
- switch (ip) {
- case PERF_CONTEXT_HV:
- cpumode = PERF_RECORD_MISC_HYPERVISOR;
- break;
- case PERF_CONTEXT_KERNEL:
- cpumode = PERF_RECORD_MISC_KERNEL;
- break;
- case PERF_CONTEXT_USER:
- cpumode = PERF_RECORD_MISC_USER;
- break;
- default:
- pr_debug("invalid callchain context: "
- "%"PRId64"\n", (s64) ip);
- /*
- * It seems the callchain is corrupted.
- * Discard all.
- */
- callchain_cursor_reset(&callchain_cursor);
- return 0;
- }
- continue;
- }
-
- al.filtered = 0;
- thread__find_addr_location(thread, cpumode,
- MAP__FUNCTION, ip, &al);
- if (al.sym != NULL) {
- if (sort__has_parent && !*parent &&
- symbol__match_regex(al.sym, &parent_regex))
- *parent = al.sym;
- else if (have_ignore_callees && root_al &&
- symbol__match_regex(al.sym, &ignore_callees_regex)) {
- /* Treat this symbol as the root,
- forgetting its callees. */
- *root_al = al;
- callchain_cursor_reset(&callchain_cursor);
- }
- }
-
- err = callchain_cursor_append(&callchain_cursor,
- ip, al.map, al.sym);
+ err = __thread__resolve_callchain_sample(thread,
+ ip, &cpumode, parent, root_al);
if (err)
- return err;
+ goto exit;
}

- return 0;
+exit:
+ return (err < 0) ? err : 0;
}

static int unwind_entry(struct unwind_entry *entry, void *arg)
--
1.8.3.2

2014-11-13 00:34:12

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V2 3/3] perf tools: Construct LBR call chain

From: Kan Liang <[email protected]>

LBR call stack only has user callchain. It is output as
PERF_SAMPLE_BRANCH_STACK data format. For the kernel callchain, it's
still from PERF_SAMPLE_CALLCHAIN.
The perf tool has to handle both data sources to construct a
complete callstack.
For perf report -D option, both lbr and fp information will be
displayed.

Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/util/evsel.h | 4 +++
tools/perf/util/machine.c | 73 +++++++++++++++++++++++++++++++++++++++++------
tools/perf/util/session.c | 56 ++++++++++++++++++++++++++++++++----
3 files changed, 118 insertions(+), 15 deletions(-)

diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 9797909..1bbaa74 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -368,4 +368,8 @@ for ((_evsel) = list_entry((_leader)->node.next, struct perf_evsel, node); \
(_evsel) && (_evsel)->leader == (_leader); \
(_evsel) = list_entry((_evsel)->node.next, struct perf_evsel, node))

+static inline bool has_branch_callstack(struct perf_evsel *evsel)
+{
+ return evsel->attr.branch_sample_type & PERF_SAMPLE_BRANCH_CALL_STACK;
+}
#endif /* __PERF_EVSEL_H */
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 9c7d136..a63ae26 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1451,15 +1451,19 @@ static inline int __thread__resolve_callchain_sample(struct thread *thread,

}
static int thread__resolve_callchain_sample(struct thread *thread,
- struct ip_callchain *chain,
- struct symbol **parent,
- struct addr_location *root_al,
- int max_stack)
+ struct perf_evsel *evsel,
+ struct perf_sample *sample,
+ struct symbol **parent,
+ struct addr_location *root_al,
+ int max_stack)
{
+ struct ip_callchain *chain = sample->callchain;
u8 cpumode = PERF_RECORD_MISC_USER;
int chain_nr = min(max_stack, (int)chain->nr);
int i, j, err = 0;
int skip_idx __maybe_unused;
+ u64 ip;
+ bool lbr = has_branch_callstack(evsel);

callchain_cursor_reset(&callchain_cursor);

@@ -1468,6 +1472,58 @@ static int thread__resolve_callchain_sample(struct thread *thread,
return 0;
}

+ if (lbr) {
+ for (i = 0; i < chain_nr; i++) {
+ if (chain->ips[i] == PERF_CONTEXT_USER)
+ break;
+ }
+
+ /* LBR only affects the user callchain */
+ if (i != chain_nr) {
+ struct branch_stack *lbr_stack = sample->branch_stack;
+ int lbr_nr = lbr_stack->nr;
+ /*
+ * LBR callstack can only get user call chain information.
+ * The mix_chain_nr should be kernel call chain number plus
+ * LBR user call chain number.
+ * i is kernel call chain number, 1 is PERF_CONTEXT_USER,
+ * lbr_nr + 1 is the user call chain number.
+ * For details, please refer to the comments
+ * in callchain__printf
+ */
+ int mix_chain_nr = i + 1 + lbr_nr + 1;
+
+ if (mix_chain_nr > PERF_MAX_STACK_DEPTH) {
+ pr_warning("corrupted callchain. skipping...\n");
+ return 0;
+ }
+
+ for (j = 0; j < mix_chain_nr; j++) {
+ if (callchain_param.order == ORDER_CALLEE) {
+ if (j < i + 1)
+ ip = chain->ips[j];
+ else if (j > i + 1)
+ ip = lbr_stack->entries[j - i - 2].from;
+ else
+ ip = lbr_stack->entries[0].to;
+ } else {
+ if (j < lbr_nr)
+ ip = lbr_stack->entries[lbr_nr - j - 1].from;
+ else if (j > lbr_nr)
+ ip = chain->ips[i + 1 - (j - lbr_nr)];
+ else
+ ip = lbr_stack->entries[0].to;
+ }
+
+ err = __thread__resolve_callchain_sample(thread,
+ ip, &cpumode, parent, root_al);
+ if (err)
+ goto exit;
+ }
+ return 0;
+ }
+ }
+
/*
* Based on DWARF debug information, some architectures skip
* a callchain entry saved by the kernel.
@@ -1475,8 +1531,6 @@ static int thread__resolve_callchain_sample(struct thread *thread,
skip_idx = arch_skip_callchain_idx(thread, chain);

for (i = 0; i < chain_nr; i++) {
- u64 ip;
-
if (callchain_param.order == ORDER_CALLEE)
j = i;
else
@@ -1489,7 +1543,7 @@ static int thread__resolve_callchain_sample(struct thread *thread,
ip = chain->ips[j];

err = __thread__resolve_callchain_sample(thread,
- ip, &cpumode, parent, root_al);
+ ip, &cpumode, parent, root_al);
if (err)
goto exit;
}
@@ -1512,8 +1566,9 @@ int thread__resolve_callchain(struct thread *thread,
struct addr_location *root_al,
int max_stack)
{
- int ret = thread__resolve_callchain_sample(thread, sample->callchain,
- parent, root_al, max_stack);
+ int ret = thread__resolve_callchain_sample(thread, evsel,
+ sample, parent,
+ root_al, max_stack);
if (ret)
return ret;

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index f4478ce..794c46b 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -557,15 +557,59 @@ int perf_session_queue_event(struct perf_session *s, union perf_event *event,
return 0;
}

-static void callchain__printf(struct perf_sample *sample)
+static void callchain__printf(struct perf_evsel *evsel,
+ struct perf_sample *sample)
{
unsigned int i;
+ struct ip_callchain *callchain = sample->callchain;
+ bool lbr = has_branch_callstack(evsel);

- printf("... chain: nr:%" PRIu64 "\n", sample->callchain->nr);
+ if (lbr) {
+ struct branch_stack *lbr_stack = sample->branch_stack;
+ u64 kernel_callchain_nr = callchain->nr;

- for (i = 0; i < sample->callchain->nr; i++)
+ for (i = 0; i < kernel_callchain_nr; i++) {
+ if (callchain->ips[i] == PERF_CONTEXT_USER)
+ break;
+ }
+
+ if ((i != kernel_callchain_nr) && lbr_stack->nr) {
+ u64 total_nr;
+ /*
+ * LBR callstack can only get user call chain information,
+ * So i is kernel call chain number, 1 is PERF_CONTEXT_USER.
+ *
+ * The user call chain is stored in LBR registers.
+ * LBR are pair registers. The caller is stored in "from"
+ * register, while the callee is stored in "to" register.
+ * For example, there is a call stack "A"->"B"->"C"->"D".
+ * The LBR registers will recorde like "C"->"D", "B"->"C",
+ * "A"->"B". So only the first "to" register and all "from"
+ * registers are needed to construct the whole stack.
+ */
+ total_nr = i + 1 + lbr_stack->nr + 1;
+ kernel_callchain_nr = i + 1;
+
+ printf("... LBR call chain: nr:%" PRIu64 "\n", total_nr);
+
+ for (i = 0; i < kernel_callchain_nr; i++)
+ printf("..... %2d: %016" PRIx64 "\n",
+ i, callchain->ips[i]);
+
+ printf("..... %2d: %016" PRIx64 "\n",
+ (int)(kernel_callchain_nr), lbr_stack->entries[0].to);
+ for (i = 0; i < lbr_stack->nr; i++)
+ printf("..... %2d: %016" PRIx64 "\n",
+ (int)(i + kernel_callchain_nr + 1), lbr_stack->entries[i].from);
+ }
+
+ }
+
+ printf("... FP chain: nr:%" PRIu64 "\n", callchain->nr);
+
+ for (i = 0; i < callchain->nr; i++)
printf("..... %2d: %016" PRIx64 "\n",
- i, sample->callchain->ips[i]);
+ i, callchain->ips[i]);
}

static void branch_stack__printf(struct perf_sample *sample)
@@ -691,9 +735,9 @@ static void dump_sample(struct perf_evsel *evsel, union perf_event *event,
sample_type = evsel->attr.sample_type;

if (sample_type & PERF_SAMPLE_CALLCHAIN)
- callchain__printf(sample);
+ callchain__printf(evsel, sample);

- if (sample_type & PERF_SAMPLE_BRANCH_STACK)
+ if ((sample_type & PERF_SAMPLE_BRANCH_STACK) && !has_branch_callstack(evsel))
branch_stack__printf(sample);

if (sample_type & PERF_SAMPLE_REGS_USER)
--
1.8.3.2

2014-11-13 00:34:51

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V2 1/3] perf tools: enable LBR call stack support

From: Kan Liang <[email protected]>

Currently, there are two call chain recording options, fp and dwarf.
Haswell has a new feature that utilizes the existing LBR facility to
record call chains. So it provides the third options to record call
chain. This patch enables the lbr call stack support.

LBR call stack has some limitations. It reuses current LBR facility, so
LBR call stack and branch record can not be enabled at the same time. It
is only available for user callchain.
However, LBR call stack can work on the user app which doesn't have
frame-pointer or dwarf debug info compiled. It is a good alternative
when nothing else works.

Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/Documentation/perf-record.txt | 10 ++++++++--
tools/perf/builtin-record.c | 6 +++---
tools/perf/builtin-report.c | 2 ++
tools/perf/util/callchain.c | 10 +++++++++-
tools/perf/util/callchain.h | 1 +
tools/perf/util/evsel.c | 21 +++++++++++++++++++--
6 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 398f8d5..0808460 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -100,13 +100,19 @@ OPTIONS
implies -g.

Allows specifying "fp" (frame pointer) or "dwarf"
- (DWARF's CFI - Call Frame Information) as the method to collect
+ (DWARF's CFI - Call Frame Information) or "lbr"
+ (Hardware Last Branch Record facility) as the method to collect
the information used to show the call graphs.

In some systems, where binaries are build with gcc
- --fomit-frame-pointer, using the "fp" method will produce bogus
+ --fomit-frame-pointer, using the "fp" method will produce bogus
call graphs, using "dwarf", if available (perf tools linked to
the libunwind library) should be used instead.
+ Using the "lbr" method doesn't require any compiler options. It
+ will produce call graphs from the hardware LBR registers. The
+ main limition is that it is only available on new Intel
+ platforms, such as Haswell. It can only get user call chain. It
+ doesn't work with branch stack sampling at the same time.

-q::
--quiet::
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 582c4da..e486627 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -639,7 +639,7 @@ error:

static void callchain_debug(void)
{
- static const char *str[CALLCHAIN_MAX] = { "NONE", "FP", "DWARF" };
+ static const char *str[CALLCHAIN_MAX] = { "NONE", "FP", "DWARF", "LBR" };

pr_debug("callchain: type %s\n", str[callchain_param.record_mode]);

@@ -725,9 +725,9 @@ static struct record record = {
#define CALLCHAIN_HELP "setup and enables call-graph (stack chain/backtrace) recording: "

#ifdef HAVE_DWARF_UNWIND_SUPPORT
-const char record_callchain_help[] = CALLCHAIN_HELP "fp dwarf";
+const char record_callchain_help[] = CALLCHAIN_HELP "fp dwarf lbr";
#else
-const char record_callchain_help[] = CALLCHAIN_HELP "fp";
+const char record_callchain_help[] = CALLCHAIN_HELP "fp lbr";
#endif

/*
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 140a6cd..43babdb 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -261,6 +261,8 @@ static int report__setup_sample_type(struct report *rep)
if ((sample_type & PERF_SAMPLE_REGS_USER) &&
(sample_type & PERF_SAMPLE_STACK_USER))
callchain_param.record_mode = CALLCHAIN_DWARF;
+ else if (sample_type & PERF_SAMPLE_BRANCH_STACK)
+ callchain_param.record_mode = CALLCHAIN_LBR;
else
callchain_param.record_mode = CALLCHAIN_FP;
}
diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c
index 0022980..2d98b86 100644
--- a/tools/perf/util/callchain.c
+++ b/tools/perf/util/callchain.c
@@ -77,7 +77,7 @@ int parse_callchain_record_opt(const char *arg)
ret = 0;
} else
pr_err("callchain: No more arguments "
- "needed for -g fp\n");
+ "needed for --call-graph fp\n");
break;

#ifdef HAVE_DWARF_UNWIND_SUPPORT
@@ -97,6 +97,14 @@ int parse_callchain_record_opt(const char *arg)
callchain_param.dump_size = size;
}
#endif /* HAVE_DWARF_UNWIND_SUPPORT */
+ } else if (!strncmp(name, "lbr", sizeof("lbr"))) {
+ if (!strtok_r(NULL, ",", &saveptr)) {
+ callchain_param.record_mode = CALLCHAIN_LBR;
+ ret = 0;
+ } else
+ pr_err("callchain: No more arguments "
+ "needed for --call-graph lbr\n");
+ break;
} else {
pr_err("callchain: Unknown --call-graph option "
"value: %s\n", arg);
diff --git a/tools/perf/util/callchain.h b/tools/perf/util/callchain.h
index 3caccc2..c0d026f 100644
--- a/tools/perf/util/callchain.h
+++ b/tools/perf/util/callchain.h
@@ -11,6 +11,7 @@ enum perf_call_graph_mode {
CALLCHAIN_NONE,
CALLCHAIN_FP,
CALLCHAIN_DWARF,
+ CALLCHAIN_LBR,
CALLCHAIN_MAX
};

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 12b4396..7cbe2e9 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -537,13 +537,30 @@ int perf_evsel__group_desc(struct perf_evsel *evsel, char *buf, size_t size)
}

static void
-perf_evsel__config_callgraph(struct perf_evsel *evsel)
+perf_evsel__config_callgraph(struct perf_evsel *evsel,
+ struct record_opts *opts)
{
bool function = perf_evsel__is_function_event(evsel);
struct perf_event_attr *attr = &evsel->attr;

perf_evsel__set_sample_bit(evsel, CALLCHAIN);

+ if (callchain_param.record_mode == CALLCHAIN_LBR) {
+ if (!opts->branch_stack) {
+ perf_evsel__set_sample_bit(evsel, BRANCH_STACK);
+ attr->branch_sample_type = PERF_SAMPLE_BRANCH_USER |
+ PERF_SAMPLE_BRANCH_CALL_STACK;
+ if (attr->exclude_user) {
+ attr->exclude_user = 0;
+
+ pr_warning("LBR callstack option is only available"
+ " to get user callchain information."
+ " Force exclude_user to 0.\n");
+ }
+ } else
+ pr_info("Cannot use LBR callstack with branch stack\n");
+ }
+
if (callchain_param.record_mode == CALLCHAIN_DWARF) {
if (!function) {
perf_evsel__set_sample_bit(evsel, REGS_USER);
@@ -659,7 +676,7 @@ void perf_evsel__config(struct perf_evsel *evsel, struct record_opts *opts)
}

if (callchain_param.enabled && !evsel->no_aux_samples)
- perf_evsel__config_callgraph(evsel);
+ perf_evsel__config_callgraph(evsel, opts);

if (target__has_cpu(&opts->target))
perf_evsel__set_sample_bit(evsel, CPU);
--
1.8.3.2

2014-11-13 18:21:47

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH V2 2/3] perf tool: re-organize thread__resolve_callchain_sample

On Wed, Nov 12, 2014 at 07:18:14PM -0500, [email protected] wrote:

SNIP

> +static inline int __thread__resolve_callchain_sample(struct thread *thread,
> + u64 ip, u8 *cpumode,
> + struct symbol **parent,
> + struct addr_location *root_al)
> +{
> + struct addr_location al;
> +
> + if (ip >= PERF_CONTEXT_MAX) {
> + switch (ip) {
> + case PERF_CONTEXT_HV:
> + *cpumode = PERF_RECORD_MISC_HYPERVISOR;
> + break;
> + case PERF_CONTEXT_KERNEL:
> + *cpumode = PERF_RECORD_MISC_KERNEL;
> + break;
> + case PERF_CONTEXT_USER:
> + *cpumode = PERF_RECORD_MISC_USER;
> + break;
> + default:
> + pr_debug("invalid callchain context: "
> + "%"PRId64"\n", (s64) ip);
> + /*
> + * It seems the callchain is corrupted.
> + * Discard all.
> + */
> + callchain_cursor_reset(&callchain_cursor);
> + return 1;
> + }
> + return 0;
> + }
> +
> + al.filtered = 0;
> + thread__find_addr_location(thread, *cpumode,
> + MAP__FUNCTION, ip, &al);
> + if (al.sym != NULL) {
> + if (sort__has_parent && !*parent &&
> + symbol__match_regex(al.sym, &parent_regex))
> + *parent = al.sym;
> + else if (have_ignore_callees && root_al &&
> + symbol__match_regex(al.sym, &ignore_callees_regex)) {
> + /* Treat this symbol as the root,
> + forgetting its callees. */
> + *root_al = al;
> + callchain_cursor_reset(&callchain_cursor);
> + }
> + }
> +
> + return callchain_cursor_append(&callchain_cursor,
> + ip, al.map, al.sym);

you added slightly more than Andi ;-)
http://marc.info/?l=linux-kernel&m=141584439819848&w=2

Any chance you guys could sync on this? ..you're touching the
same code.. Andi, maybe you wouldn't mind having this patch
instead of your change.. looks like the only extra part is the
cpumode resolve.

thanks,
jirka

2014-11-13 18:48:14

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH V2 2/3] perf tool: re-organize thread__resolve_callchain_sample

> Any chance you guys could sync on this? ..you're touching the
> same code.. Andi, maybe you wouldn't mind having this patch
> instead of your change.. looks like the only extra part is the
> cpumode resolve.

Please just merge one of them, and the other can rebase.

Note the lbr as callgraph has been posted since nearly a year now
(v1 was in January)

If you merged patches faster all these problems with conflicts
wouldn't happen.

-Andi

2014-11-14 08:23:56

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH V2 2/3] perf tool: re-organize thread__resolve_callchain_sample

On Thu, Nov 13, 2014 at 10:48:10AM -0800, Andi Kleen wrote:
> > Any chance you guys could sync on this? ..you're touching the
> > same code.. Andi, maybe you wouldn't mind having this patch
> > instead of your change.. looks like the only extra part is the
> > cpumode resolve.
>
> Please just merge one of them, and the other can rebase.
>
> Note the lbr as callgraph has been posted since nearly a year now
> (v1 was in January)
>

Arnaldo just took it:
http://marc.info/?l=linux-kernel&m=141590608310095&w=2

Kan, please rebase once it's pushed out.

thanks,
jirka