2019-11-19 14:36:39

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V4 00/13] Stitch LBR call stack

From: Kan Liang <[email protected]>


Changes since V3
- Add the new branch sample type at the end of enum
perf_branch_sample_type.
- Rebase the user space patch on top of acme's perf/core branch

Changes since V2
- Move tos into struct perf_branch_stack

Changes since V1
- Add a new branch sample type for LBR TOS. Drop the sample type in V1.
- Add check in perf header to detect unknown input bits in event attr
- Save and use the LBR cursor nodes from previous sample to avoid
duplicate calculation of cursor nodes.
- Add fast path for duplicate entries check. It benefits all call stack
parsing, not just for stitch LBR call stack. It can be merged
independetely.

Start from Haswell, Linux perf can utilize the existing Last Branch
Record (LBR) facility to record call stack. However, the depth of the
reconstructed LBR call stack limits to the number of LBR registers.
E.g. on skylake, the depth of reconstructed LBR call stack is <= 32
That's because HW will overwrite the oldest LBR registers when it's
full.

However, the overwritten LBRs may still be retrieved from previous
sample. At that moment, HW hasn't overwritten the LBR registers yet.
Perf tools can stitch those overwritten LBRs on current call stacks to
get a more complete call stack.

To determine if LBRs can be stitched, the physical index of LBR
registers is required. A new branch sample type is introduced in
patch 1 to 3 to dump the LBR Top-of-Stack (TOS) information for perf
tools.

Only when the new branch sample type is set, the TOS information is
dumped into the PERF_SAMPLE_BRANCH_STACK output. Perf tool should check
the attr.branch_sample_type, and apply the corresponding format for
PERF_SAMPLE_BRANCH_STACK samples. The check is introduced in Patch 4.

Besides, the maximum number of LBRs is required as well. Patch 5 & 6
retrieve the capabilities information from sysfs and save them in perf
header.

Patch 7 & 8 implements the LBR stitching approach.

Users can use the options introduced in patch 9-12 to enable the LBR
stitching approach for perf report, script, top and c2c.

Patch 13 adds fast path for duplicate entries check. It benefits all
call stack parsing, not just for stitch LBR call stack. It can be
merged independetely.


The stitching approach base on LBR call stack technology. The known
limitations of LBR call stack technology still apply to the approach,
e.g. Exception handing such as setjmp/longjmp will have calls/returns
not match.
This approach is not full proof. There can be cases where it creates
incorrect call stacks from incorrect matches. There is no attempt
to validate any matches in another way. So it is not enabled by default.
However in many common cases with call stack overflows it can recreate
better call stacks than the default lbr call stack output. So if there
are problems with LBR overflows this is a possible workaround.

Regression:
Users may collect LBR call stack on a machine with new perf tool and
new kernel (support LBR TOS). However, they may parse the perf.data with
old perf tool (not support LBR TOS). The old tool doesn't check
attr.branch_sample_type. Users probably get incorrect information
without any warning.

Performance impact:
The processing time may increase with the LBR stitching approach
enabled. The impact depends on the increased depth of call stacks.

For a simple test case tchain_edit with 43 depth of call stacks.
perf record --call-graph lbr -- ./tchain_edit
perf report --stitch-lbr

Without --stitch-lbr, perf report only display 32 depth of call stacks.
With --stitch-lbr, perf report can display all 43 depth of call stacks.
The depth of call stacks increase 34.3%.

Correspondingly, the processing time of perf report increases 39%,
Without --stitch-lbr: 11.0 sec
With --stitch-lbr: 15.3 sec

The source code of tchain_edit.c is something similar as below.
noinline void f43(void)
{
int i;
for (i = 0; i < 10000;) {

if(i%2)
i++;
else
i++;
}
}

noinline void f42(void)
{
int i;
for (i = 0; i < 100; i++) {
f43();
f43();
f43();
}
}

noinline void f41(void)
{
int i;
for (i = 0; i < 100; i++) {
f42();
f42();
f42();
}
}

noinline void f40(void)
{
f41();
}

... ...

noinline void f32(void)
{
f33();
}

noinline void f31(void)
{
int i;

for (i = 0; i < 10000; i++) {
if(i%2)
i++;
else
i++;
}

f32();
}

noinline void f30(void)
{
f31();
}

... ...

noinline void f1(void)
{
f2();
}

int main()
{
f1();
}

Kan Liang (13):
perf/core: Add new branch sample type for LBR TOS
perf/x86/intel: Output LBR TOS information
perf tools: Support new branch sample type for LBR TOS
perf header: Add check for event attr
perf pmu: Add support for PMU capabilities
perf header: Support CPU PMU capabilities
perf machine: Refine the function for LBR call stack reconstruction
perf tools: Stitch LBR call stack
perf report: Add option to enable the LBR stitching approach
perf script: Add option to enable the LBR stitching approach
perf top: Add option to enable the LBR stitching approach
perf c2c: Add option to enable the LBR stitching approach
perf hist: Add fast path for duplicate entries check

arch/x86/events/intel/lbr.c | 9 +
include/linux/perf_event.h | 2 +
include/uapi/linux/perf_event.h | 16 +-
kernel/events/core.c | 13 +-
tools/include/uapi/linux/perf_event.h | 16 +-
tools/perf/Documentation/perf-c2c.txt | 11 +
tools/perf/Documentation/perf-report.txt | 11 +
tools/perf/Documentation/perf-script.txt | 11 +
tools/perf/Documentation/perf-top.txt | 9 +
.../Documentation/perf.data-file-format.txt | 16 +
tools/perf/builtin-c2c.c | 6 +
tools/perf/builtin-record.c | 3 +
tools/perf/builtin-report.c | 6 +
tools/perf/builtin-script.c | 6 +
tools/perf/builtin-stat.c | 1 +
tools/perf/builtin-top.c | 11 +
tools/perf/util/branch.h | 5 +-
tools/perf/util/callchain.h | 12 +-
tools/perf/util/env.h | 3 +
tools/perf/util/event.h | 1 +
tools/perf/util/evsel.c | 20 +-
tools/perf/util/evsel.h | 6 +
tools/perf/util/header.c | 148 +++++++
tools/perf/util/header.h | 1 +
tools/perf/util/hist.c | 23 +
tools/perf/util/machine.c | 409 +++++++++++++++---
tools/perf/util/parse-branch-options.c | 3 +-
tools/perf/util/perf_event_attr_fprintf.c | 3 +-
tools/perf/util/pmu.c | 87 ++++
tools/perf/util/pmu.h | 12 +
tools/perf/util/sort.c | 2 +-
tools/perf/util/sort.h | 2 +
tools/perf/util/thread.c | 2 +
tools/perf/util/thread.h | 34 ++
tools/perf/util/top.h | 1 +
35 files changed, 843 insertions(+), 78 deletions(-)

--
2.17.1



2019-11-19 14:36:51

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V4 03/13] perf tools: Support new branch sample type for LBR TOS

From: Kan Liang <[email protected]>

Support new branch sample type for LBR TOS.

Enable LBR_TOS by default in LBR call stack mode.
If kernel doesn't support the sample type, switching it off.

Add a new branch options "tos" for the new branch sample type.
The branch sample type is 64 bits. Change int to u64 for mode in
struct branch_mode and bit in struct bit_names.

Set tos to -1ULL if the LBR TOS information is unavailable.

Signed-off-by: Kan Liang <[email protected]>
---
tools/include/uapi/linux/perf_event.h | 16 ++++++++++++++--
tools/perf/util/event.h | 1 +
tools/perf/util/evsel.c | 20 +++++++++++++++++---
tools/perf/util/evsel.h | 6 ++++++
tools/perf/util/parse-branch-options.c | 3 ++-
tools/perf/util/perf_event_attr_fprintf.c | 3 ++-
6 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index bb7b271397a6..c2da61c9ace7 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -180,7 +180,10 @@ enum perf_branch_sample_type_shift {

PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT = 16, /* save branch type */

- PERF_SAMPLE_BRANCH_MAX_SHIFT /* non-ABI */
+ PERF_SAMPLE_BRANCH_MAX_SHIFT = 17, /* non-ABI */
+
+ /* PMU specific */
+ PERF_SAMPLE_BRANCH_LBR_TOS_SHIFT = 63, /* save LBR TOS */
};

enum perf_branch_sample_type {
@@ -208,8 +211,13 @@ enum perf_branch_sample_type {
1U << PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT,

PERF_SAMPLE_BRANCH_MAX = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT,
+
+ PERF_SAMPLE_BRANCH_LBR_TOS = 1ULL << PERF_SAMPLE_BRANCH_LBR_TOS_SHIFT,
};

+#define PERF_SAMPLE_BRANCH_MASK ((PERF_SAMPLE_BRANCH_MAX - 1) |\
+ PERF_SAMPLE_BRANCH_LBR_TOS)
+
/*
* Common flow change classification
*/
@@ -849,7 +857,11 @@ enum perf_event_type {
* char data[size];}&& PERF_SAMPLE_RAW
*
* { u64 nr;
- * { u64 from, to, flags } lbr[nr];} && PERF_SAMPLE_BRANCH_STACK
+ * { u64 from, to, flags } lbr[nr];
+ *
+ * # only available if PERF_SAMPLE_BRANCH_LBR_TOS is set
+ * u64 tos;
+ * } && PERF_SAMPLE_BRANCH_STACK
*
* { u64 abi; # enum perf_sample_regs_abi
* u64 regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index a0a0c91cde4a..98794758546b 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -130,6 +130,7 @@ struct perf_sample {
u32 raw_size;
u64 data_src;
u64 phys_addr;
+ u64 lbr_tos;
u32 flags;
u16 insn_len;
u8 cpumode;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 1bf60f325608..b19669eb4437 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -712,7 +712,8 @@ static void __perf_evsel__config_callchain(struct evsel *evsel,
attr->branch_sample_type = PERF_SAMPLE_BRANCH_USER |
PERF_SAMPLE_BRANCH_CALL_STACK |
PERF_SAMPLE_BRANCH_NO_CYCLES |
- PERF_SAMPLE_BRANCH_NO_FLAGS;
+ PERF_SAMPLE_BRANCH_NO_FLAGS |
+ PERF_SAMPLE_BRANCH_LBR_TOS;
}
} else
pr_warning("Cannot use LBR callstack with branch stack. "
@@ -763,7 +764,8 @@ perf_evsel__reset_callgraph(struct evsel *evsel,
if (param->record_mode == CALLCHAIN_LBR) {
perf_evsel__reset_sample_bit(evsel, BRANCH_STACK);
attr->branch_sample_type &= ~(PERF_SAMPLE_BRANCH_USER |
- PERF_SAMPLE_BRANCH_CALL_STACK);
+ PERF_SAMPLE_BRANCH_CALL_STACK |
+ PERF_SAMPLE_BRANCH_LBR_TOS);
}
if (param->record_mode == CALLCHAIN_DWARF) {
perf_evsel__reset_sample_bit(evsel, REGS_USER);
@@ -1641,6 +1643,8 @@ int evsel__open(struct evsel *evsel, struct perf_cpu_map *cpus,
evsel->core.attr.ksymbol = 0;
if (perf_missing_features.bpf)
evsel->core.attr.bpf_event = 0;
+ if (perf_missing_features.lbr_tos)
+ evsel->core.attr.branch_sample_type &= ~PERF_SAMPLE_BRANCH_LBR_TOS;
retry_sample_id:
if (perf_missing_features.sample_id_all)
evsel->core.attr.sample_id_all = 0;
@@ -1752,7 +1756,12 @@ int evsel__open(struct evsel *evsel, struct perf_cpu_map *cpus,
* Must probe features in the order they were added to the
* perf_event_attr interface.
*/
- if (!perf_missing_features.aux_output && evsel->core.attr.aux_output) {
+ if (!perf_missing_features.lbr_tos &&
+ (evsel->core.attr.branch_sample_type & PERF_SAMPLE_BRANCH_LBR_TOS)) {
+ perf_missing_features.lbr_tos = true;
+ pr_debug2("switching off LBR TOS support\n");
+ goto fallback_missing_features;
+ } else if (!perf_missing_features.aux_output && evsel->core.attr.aux_output) {
perf_missing_features.aux_output = true;
pr_debug2_peo("Kernel has no attr.aux_output support, bailing out\n");
goto out_close;
@@ -2129,6 +2138,11 @@ int perf_evsel__parse_sample(struct evsel *evsel, union perf_event *event,
sz = data->branch_stack->nr * sizeof(struct branch_entry);
OVERFLOW_CHECK(array, sz, max_size);
array = (void *)array + sz;
+
+ if (perf_evsel__has_lbr_tos(evsel))
+ data->lbr_tos = *array++;
+ else
+ data->lbr_tos = -1ULL;
}

if (type & PERF_SAMPLE_REGS_USER) {
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index ddc5ee6f6592..43a9fd83f791 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -115,6 +115,7 @@ struct perf_missing_features {
bool ksymbol;
bool bpf;
bool aux_output;
+ bool lbr_tos;
};

extern struct perf_missing_features perf_missing_features;
@@ -377,6 +378,11 @@ for ((_evsel) = _leader; \
(_evsel) && (_evsel)->leader == (_leader); \
(_evsel) = list_entry((_evsel)->core.node.next, struct evsel, core.node))

+static inline bool perf_evsel__has_lbr_tos(const struct evsel *evsel)
+{
+ return evsel->core.attr.branch_sample_type & PERF_SAMPLE_BRANCH_LBR_TOS;
+}
+
static inline bool perf_evsel__has_branch_callstack(const struct evsel *evsel)
{
return evsel->core.attr.branch_sample_type & PERF_SAMPLE_BRANCH_CALL_STACK;
diff --git a/tools/perf/util/parse-branch-options.c b/tools/perf/util/parse-branch-options.c
index bb4aa88c50a8..ce8b9ffc0663 100644
--- a/tools/perf/util/parse-branch-options.c
+++ b/tools/perf/util/parse-branch-options.c
@@ -13,7 +13,7 @@

struct branch_mode {
const char *name;
- int mode;
+ u64 mode;
};

static const struct branch_mode branch_modes[] = {
@@ -32,6 +32,7 @@ static const struct branch_mode branch_modes[] = {
BRANCH_OPT("call", PERF_SAMPLE_BRANCH_CALL),
BRANCH_OPT("save_type", PERF_SAMPLE_BRANCH_TYPE_SAVE),
BRANCH_OPT("stack", PERF_SAMPLE_BRANCH_CALL_STACK),
+ BRANCH_OPT("tos", PERF_SAMPLE_BRANCH_LBR_TOS),
BRANCH_END
};

diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
index d4ad3f04923a..3411b67ea92a 100644
--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -8,7 +8,7 @@
#include "util/evsel_fprintf.h"

struct bit_names {
- int bit;
+ u64 bit;
const char *name;
};

@@ -50,6 +50,7 @@ static void __p_branch_sample_type(char *buf, size_t size, u64 value)
bit_name(ABORT_TX), bit_name(IN_TX), bit_name(NO_TX),
bit_name(COND), bit_name(CALL_STACK), bit_name(IND_JUMP),
bit_name(CALL), bit_name(NO_FLAGS), bit_name(NO_CYCLES),
+ bit_name(LBR_TOS),
{ .name = NULL, }
};
#undef bit_name
--
2.17.1


2019-11-19 14:36:53

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V4 01/13] perf/core: Add new branch sample type for LBR TOS

From: Kan Liang <[email protected]>

In LBR call stack mode, the depth of reconstructed LBR call stack limits
to the number of LBR registers. With LBR Top-of-Stack (TOS) information,
perf tool may stitch the stacks of two samples. The reconstructed LBR
call stack can break the HW limitation.

Add a new branch sample type to retrieve LBR TOS. The new type is PMU
specific. Add it at the end of enum perf_branch_sample_type.
Add a macro to retrieve defined bits of branch sample type.
Update perf_copy_attr() to handle the new bit.

Only when the new branch sample type is set, the TOS information is
dumped into the PERF_SAMPLE_BRANCH_STACK output.
Perf tool should check the attr.branch_sample_type, and apply the
corresponding format for PERF_SAMPLE_BRANCH_STACK samples.
Otherwise, some user case may be broken. For example, users may parse a
perf.data, which include the new branch sample type, with an old version
perf tool (without the check). Users probably get incorrect information
without any warning.

Signed-off-by: Kan Liang <[email protected]>
---
include/linux/perf_event.h | 2 ++
include/uapi/linux/perf_event.h | 16 ++++++++++++++--
kernel/events/core.c | 13 ++++++++++++-
3 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 011dcbdbccc2..761021c7ee8a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -93,6 +93,7 @@ struct perf_raw_record {
/*
* branch stack layout:
* nr: number of taken branches stored in entries[]
+ * tos: Top-of-Stack (TOS) information. PMU specific data.
*
* Note that nr can vary from sample to sample
* branches (to, from) are stored from most recent
@@ -101,6 +102,7 @@ struct perf_raw_record {
*/
struct perf_branch_stack {
__u64 nr;
+ __u64 tos; /* PMU specific data */
struct perf_branch_entry entries[0];
};

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index bb7b271397a6..c2da61c9ace7 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -180,7 +180,10 @@ enum perf_branch_sample_type_shift {

PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT = 16, /* save branch type */

- PERF_SAMPLE_BRANCH_MAX_SHIFT /* non-ABI */
+ PERF_SAMPLE_BRANCH_MAX_SHIFT = 17, /* non-ABI */
+
+ /* PMU specific */
+ PERF_SAMPLE_BRANCH_LBR_TOS_SHIFT = 63, /* save LBR TOS */
};

enum perf_branch_sample_type {
@@ -208,8 +211,13 @@ enum perf_branch_sample_type {
1U << PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT,

PERF_SAMPLE_BRANCH_MAX = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT,
+
+ PERF_SAMPLE_BRANCH_LBR_TOS = 1ULL << PERF_SAMPLE_BRANCH_LBR_TOS_SHIFT,
};

+#define PERF_SAMPLE_BRANCH_MASK ((PERF_SAMPLE_BRANCH_MAX - 1) |\
+ PERF_SAMPLE_BRANCH_LBR_TOS)
+
/*
* Common flow change classification
*/
@@ -849,7 +857,11 @@ enum perf_event_type {
* char data[size];}&& PERF_SAMPLE_RAW
*
* { u64 nr;
- * { u64 from, to, flags } lbr[nr];} && PERF_SAMPLE_BRANCH_STACK
+ * { u64 from, to, flags } lbr[nr];
+ *
+ * # only available if PERF_SAMPLE_BRANCH_LBR_TOS is set
+ * u64 tos;
+ * } && PERF_SAMPLE_BRANCH_STACK
*
* { u64 abi; # enum perf_sample_regs_abi
* u64 regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
diff --git a/kernel/events/core.c b/kernel/events/core.c
index cfd89b4a02d8..8aff3aad43b5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6391,6 +6391,11 @@ static void perf_output_read(struct perf_output_handle *handle,
perf_output_read_one(handle, event, enabled, running);
}

+static inline bool perf_sample_save_lbr_tos(struct perf_event *event)
+{
+ return event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_LBR_TOS;
+}
+
void perf_output_sample(struct perf_output_handle *handle,
struct perf_event_header *header,
struct perf_sample_data *data,
@@ -6480,6 +6485,8 @@ void perf_output_sample(struct perf_output_handle *handle,

perf_output_put(handle, data->br_stack->nr);
perf_output_copy(handle, data->br_stack->entries, size);
+ if (perf_sample_save_lbr_tos(event))
+ perf_output_put(handle, data->br_stack->tos);
} else {
/*
* we always store at least the value of nr
@@ -6667,7 +6674,11 @@ void perf_prepare_sample(struct perf_event_header *header,
if (data->br_stack) {
size += data->br_stack->nr
* sizeof(struct perf_branch_entry);
+
+ if (perf_sample_save_lbr_tos(event))
+ size += sizeof(u64);
}
+
header->size += size;
}

@@ -10731,7 +10742,7 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
u64 mask = attr->branch_sample_type;

/* only using defined bits */
- if (mask & ~(PERF_SAMPLE_BRANCH_MAX-1))
+ if (mask & ~PERF_SAMPLE_BRANCH_MASK)
return -EINVAL;

/* at least one branch bit must be set */
--
2.17.1


2019-11-19 14:36:54

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V4 04/13] perf header: Add check for event attr

From: Kan Liang <[email protected]>

The perf.data may be generated by a newer version of perf tool,
which support new input bits in attr, e.g. new bit for
branch_sample_type.
The perf.data may be parsed by an older version of perf tool later.
The old perf tool may parse the perf.data incorrectly. There is no
warning message for this case.

Current perf header never check for unknown input bits in attr.

When read the event desc from header, check the stored event attr.
The reserved bits, sample type, read format and branch sample type
will be checked.

Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/util/header.c | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)

diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index becc2d109423..7ed481c9bcdf 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -1599,6 +1599,41 @@ static void free_event_desc(struct evsel *events)
free(events);
}

+static bool perf_attr_check(struct perf_event_attr *attr)
+{
+ if (attr->__reserved_1) {
+ pr_warning("Unexpected reserved bits (0x%x) are detected. "
+ "Please update perf tool.\n",
+ attr->__reserved_1);
+ return false;
+ }
+
+ if (attr->sample_type & ~(PERF_SAMPLE_MAX-1)) {
+ pr_warning("Unknown sample type (0x%llx) is detected. "
+ "Please update perf tool.\n",
+ attr->sample_type);
+ return false;
+ }
+
+ if (attr->read_format & ~(PERF_FORMAT_MAX-1)) {
+ pr_warning("Unknown read format (0x%llx) is detected. "
+ "Please update perf tool.\n",
+ attr->read_format);
+ return false;
+ }
+
+ if ((attr->sample_type & PERF_SAMPLE_BRANCH_STACK) &&
+ (attr->branch_sample_type & ~PERF_SAMPLE_BRANCH_MASK)) {
+ pr_warning("Unknown branch sample type (0x%llx) is detected. "
+ "Please update perf tool.\n",
+ attr->branch_sample_type);
+
+ return false;
+ }
+
+ return true;
+}
+
static struct evsel *read_event_desc(struct feat_fd *ff)
{
struct evsel *evsel, *events = NULL;
@@ -1643,6 +1678,9 @@ static struct evsel *read_event_desc(struct feat_fd *ff)

memcpy(&evsel->core.attr, buf, msz);

+ if (!perf_attr_check(&evsel->core.attr))
+ goto error;
+
if (do_read_u32(ff, &nr))
goto error;

--
2.17.1


2019-11-19 14:37:02

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V4 08/13] perf tools: Stitch LBR call stack

From: Kan Liang <[email protected]>

In LBR call stack mode, the depth of reconstructed LBR call stack limits
to the number of LBR registers.

For example, on skylake, the depth of reconstructed LBR call stack is
always <= 32.

# To display the perf.data header info, please use
# --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 6K of event 'cycles'
# Event count (approx.): 6487119731
#
# Children Self Command Shared Object Symbol
# ........ ........ ............... ..................
# ................................

99.97% 99.97% tchain_edit tchain_edit [.] f43
|
--99.64%--f11
f12
f13
f14
f15
f16
f17
f18
f19
f20
f21
f22
f23
f24
f25
f26
f27
f28
f29
f30
f31
f32
f33
f34
f35
f36
f37
f38
f39
f40
f41
f42
f43

For a call stack which is deeper than LBR limit, HW will overwrite the
LBR register with oldest branch. Only partial call stacks can be
reconstructed.

However, the overwritten LBRs may still be retrieved from previous
sample. At that moment, HW hasn't overwritten the LBR registers yet.
Perf tools can stitch those overwritten LBRs on current call stacks to
get a more complete call stack.

To determine if LBRs can be stitched, perf tools need to compare current
sample with previous sample.
- They should have identical LBR records (Same from, to and flags
values, and the same physical index of LBR registers).
- The searching starts from the base-of-stack of current sample.

In struct lbr_stitch, add 'prev_sample' to save the previous sample.
Add 'prev_lbr_cursor' to save all LBR cursor nodes from previous sample.
Once perf determines to stitch the previous LBRs, the corresponding LBR
cursor nodes will be copied to 'lists'.
The 'lists' is to track the LBR cursor nodes which are going to be
stitched.
When the stitching is over, the nodes will not be freed immediately.
They will be moved to 'free_lists'. Next stitching may reuse the space.
Both 'lists' and 'free_lists' will be freed when all samples are
processed.

The 'lbr_stitch_enable' is used to indicate whether enable LBR stitch
approach, which is disabled by default. The following patch will
introduce a new option to enable the LBR stitch approach.
This is because,
- The stitching approach base on LBR call stack technology. The known
limitations of LBR call stack technology still apply to the approach,
e.g. Exception handing such as setjmp/longjmp will have calls/returns
not match.
- This approach is not full proof. There can be cases where it creates
incorrect call stacks from incorrect matches. There is no attempt
to validate any matches in another way.

However in many common cases with call stack overflows it can recreate
better call stacks than the default lbr call stack output. So if there
are problems with LBR overflows, this is a possible workaround.

Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/util/branch.h | 5 +-
tools/perf/util/callchain.h | 12 +-
tools/perf/util/machine.c | 233 +++++++++++++++++++++++++++++++++++-
tools/perf/util/thread.c | 2 +
tools/perf/util/thread.h | 34 ++++++
5 files changed, 281 insertions(+), 5 deletions(-)

diff --git a/tools/perf/util/branch.h b/tools/perf/util/branch.h
index 88e00d268f6f..749fce3675b6 100644
--- a/tools/perf/util/branch.h
+++ b/tools/perf/util/branch.h
@@ -34,7 +34,10 @@ struct branch_info {
struct branch_entry {
u64 from;
u64 to;
- struct branch_flags flags;
+ union {
+ struct branch_flags flags;
+ u64 flags_value;
+ };
};

struct branch_stack {
diff --git a/tools/perf/util/callchain.h b/tools/perf/util/callchain.h
index 706bb7bbe1e1..e599a23c0fdb 100644
--- a/tools/perf/util/callchain.h
+++ b/tools/perf/util/callchain.h
@@ -148,7 +148,17 @@ struct callchain_cursor_node {
u64 branch_from;
int nr_loop_iter;
u64 iter_cycles;
- struct callchain_cursor_node *next;
+ union {
+ struct callchain_cursor_node *next;
+
+ /* Indicate valid cursor node for LBR stitch */
+ bool valid;
+ };
+};
+
+struct stitch_list {
+ struct list_head node;
+ struct callchain_cursor_node cursor;
};

struct callchain_cursor {
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 1a2c3e26c01f..28d38eff7444 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -2227,6 +2227,31 @@ static int lbr_callchain_add_kernel_ip(struct thread *thread,
return 0;
}

+static void save_lbr_cursor_node(struct thread *thread,
+ struct callchain_cursor *cursor,
+ int idx)
+{
+ struct lbr_stitch *lbr_stitch = thread->lbr_stitch;
+
+ if (!lbr_stitch)
+ return;
+
+ if (cursor->pos == cursor->nr) {
+ lbr_stitch->prev_lbr_cursor[idx].valid = false;
+ return;
+ }
+
+ if (!cursor->curr)
+ cursor->curr = cursor->first;
+ else
+ cursor->curr = cursor->curr->next;
+ memcpy(&lbr_stitch->prev_lbr_cursor[idx], cursor->curr,
+ sizeof(struct callchain_cursor_node));
+
+ lbr_stitch->prev_lbr_cursor[idx].valid = true;
+ cursor->pos++;
+}
+
static int lbr_callchain_add_lbr_ip(struct thread *thread,
struct callchain_cursor *cursor,
struct perf_sample *sample,
@@ -2241,6 +2266,21 @@ static int lbr_callchain_add_lbr_ip(struct thread *thread,
u64 ip, branch_from = 0;
int err, i;

+ /*
+ * The curr and pos are not used in writing session. They are cleared
+ * in callchain_cursor_commit() when the writing session is closed.
+ * Using curr and pos to track the current cursor node.
+ */
+ if (thread->lbr_stitch) {
+ cursor->curr = NULL;
+ cursor->pos = cursor->nr;
+ if (cursor->nr) {
+ cursor->curr = cursor->first;
+ for (i = 0; i < (int)(cursor->nr - 1); i++)
+ cursor->curr = cursor->curr->next;
+ }
+ }
+
if (callee) {
ip = lbr_stack->entries[0].to;
flags = &lbr_stack->entries[0].flags;
@@ -2251,6 +2291,20 @@ static int lbr_callchain_add_lbr_ip(struct thread *thread,
if (err)
return err;

+ /*
+ * The number of cursor node increases.
+ * Move the current cursor node.
+ * But does not need to save current cursor node for entry 0.
+ * It's impossible to stitch the whole LBRs of previous sample.
+ */
+ if (thread->lbr_stitch && (cursor->pos != cursor->nr)) {
+ if (!cursor->curr)
+ cursor->curr = cursor->first;
+ else
+ cursor->curr = cursor->curr->next;
+ cursor->pos++;
+ }
+
for (i = 0; i < lbr_nr; i++) {
ip = lbr_stack->entries[i].from;
flags = &lbr_stack->entries[i].flags;
@@ -2259,6 +2313,7 @@ static int lbr_callchain_add_lbr_ip(struct thread *thread,
true, flags, NULL, branch_from);
if (err)
return err;
+ save_lbr_cursor_node(thread, cursor, i);
}
} else {
for (i = lbr_nr - 1; i >= 0; i--) {
@@ -2269,6 +2324,7 @@ static int lbr_callchain_add_lbr_ip(struct thread *thread,
true, flags, NULL, branch_from);
if (err)
return err;
+ save_lbr_cursor_node(thread, cursor, i);
}

ip = lbr_stack->entries[0].to;
@@ -2284,6 +2340,146 @@ static int lbr_callchain_add_lbr_ip(struct thread *thread,
return 0;
}

+static int lbr_callchain_add_stitched_lbr_ip(struct thread *thread,
+ struct callchain_cursor *cursor)
+{
+ struct lbr_stitch *lbr_stitch = thread->lbr_stitch;
+ struct stitch_list *stitch_node;
+ int err;
+
+ struct callchain_cursor_node *cnode;
+
+ list_for_each_entry(stitch_node, &lbr_stitch->lists, node) {
+ cnode = &stitch_node->cursor;
+
+ err = callchain_cursor_append(cursor, cnode->ip,
+ &cnode->ms,
+ cnode->branch,
+ &cnode->branch_flags,
+ cnode->nr_loop_iter,
+ cnode->iter_cycles,
+ cnode->branch_from,
+ cnode->srcline);
+ if (err)
+ return err;
+
+ }
+ return 0;
+}
+
+static struct stitch_list *get_stitch_node(struct thread *thread)
+{
+ struct lbr_stitch *lbr_stitch = thread->lbr_stitch;
+ struct stitch_list *stitch_node;
+
+ if (!list_empty(&lbr_stitch->free_lists)) {
+ stitch_node = list_first_entry(&lbr_stitch->free_lists,
+ struct stitch_list, node);
+ list_del(&stitch_node->node);
+
+ return stitch_node;
+ }
+
+ return malloc(sizeof(struct stitch_list));
+}
+
+static bool has_stitched_lbr(struct thread *thread,
+ struct perf_sample *cur,
+ struct perf_sample *prev,
+ unsigned int max_lbr,
+ bool callee)
+{
+ struct branch_stack *cur_stack = cur->branch_stack;
+ struct branch_stack *prev_stack = prev->branch_stack;
+ struct lbr_stitch *lbr_stitch = thread->lbr_stitch;
+ int i, j, nr_identical_branches = 0;
+ struct stitch_list *stitch_node;
+ u64 cur_base, distance;
+
+ if (!cur_stack || !prev_stack)
+ return false;
+
+ /* Find the physical index of the base-of-stack for current sample. */
+ cur_base = max_lbr - cur_stack->nr + cur->lbr_tos + 1;
+
+ distance = (prev->lbr_tos > cur_base) ? (prev->lbr_tos - cur_base) :
+ (max_lbr + prev->lbr_tos - cur_base);
+ /* Previous sample has shorter stack. Nothing can be stitched. */
+ if (distance + 1 > prev_stack->nr)
+ return false;
+
+ /*
+ * Check if there are identical LBRs between two samples.
+ * Identicall LBRs must have same from, to and flags values. Also,
+ * they have to be saved in the same LBR registers (same physical
+ * index).
+ *
+ * Starts from the base-of-stack of current sample.
+ */
+ for (i = distance, j = cur_stack->nr - 1; (i >= 0) && (j >= 0); i--, j--) {
+ if ((prev_stack->entries[i].from != cur_stack->entries[j].from) ||
+ (prev_stack->entries[i].to != cur_stack->entries[j].to) ||
+ (prev_stack->entries[i].flags_value != cur_stack->entries[j].flags_value))
+ break;
+
+ nr_identical_branches++;
+ }
+
+ if (!nr_identical_branches)
+ return false;
+
+ /*
+ * Save the LBRs between the base-of-stack of previous sample
+ * and the base-of-stack of current sample into lbr_stitch->lists.
+ * These LBRs will be stitched later.
+ */
+ for (i = prev_stack->nr - 1; i > (int)distance; i--) {
+
+ if (!lbr_stitch->prev_lbr_cursor[i].valid)
+ continue;
+
+ stitch_node = get_stitch_node(thread);
+ if (!stitch_node)
+ return false;
+
+ memcpy(&stitch_node->cursor, &lbr_stitch->prev_lbr_cursor[i],
+ sizeof(struct callchain_cursor_node));
+
+ if (callee)
+ list_add(&stitch_node->node, &lbr_stitch->lists);
+ else
+ list_add_tail(&stitch_node->node, &lbr_stitch->lists);
+ }
+
+ return true;
+}
+
+static bool alloc_lbr_stitch(struct thread *thread, unsigned int max_lbr)
+{
+ if (thread->lbr_stitch)
+ return true;
+
+ thread->lbr_stitch = calloc(1, sizeof(struct lbr_stitch));
+ if (!thread->lbr_stitch)
+ goto err;
+
+ thread->lbr_stitch->prev_lbr_cursor = calloc(max_lbr + 1, sizeof(struct callchain_cursor_node));
+ if (!thread->lbr_stitch->prev_lbr_cursor)
+ goto free_lbr_stitch;
+
+ INIT_LIST_HEAD(&thread->lbr_stitch->lists);
+ INIT_LIST_HEAD(&thread->lbr_stitch->free_lists);
+
+ return true;
+
+free_lbr_stitch:
+ free(thread->lbr_stitch);
+ thread->lbr_stitch = NULL;
+err:
+ pr_warning("Failed to allocate space for stitched LBRs. Disable LBR stitch\n");
+ thread->lbr_stitch_enable = false;
+ return false;
+}
/*
* Recolve LBR callstack chain sample
* Return:
@@ -2296,10 +2492,14 @@ static int resolve_lbr_callchain_sample(struct thread *thread,
struct perf_sample *sample,
struct symbol **parent,
struct addr_location *root_al,
- int max_stack)
+ int max_stack,
+ unsigned int max_lbr)
{
struct ip_callchain *chain = sample->callchain;
int chain_nr = min(max_stack, (int)chain->nr);
+ bool callee = (callchain_param.order == ORDER_CALLEE);
+ struct lbr_stitch *lbr_stitch;
+ bool stitched_lbr = false;
int i, err;

for (i = 0; i < chain_nr; i++) {
@@ -2314,7 +2514,21 @@ static int resolve_lbr_callchain_sample(struct thread *thread,
if (i == chain_nr)
return 0;

- if (callchain_param.order == ORDER_CALLEE) {
+ if (thread->lbr_stitch_enable && sample->lbr_tos != (-1ULL) &&
+ (max_lbr > 0) && alloc_lbr_stitch(thread, max_lbr)) {
+ lbr_stitch = thread->lbr_stitch;
+
+ stitched_lbr = has_stitched_lbr(thread, sample,
+ &lbr_stitch->prev_sample,
+ max_lbr, callee);
+ if (!stitched_lbr) {
+ list_replace_init(&lbr_stitch->lists,
+ &lbr_stitch->free_lists);
+ }
+ memcpy(&lbr_stitch->prev_sample, sample, sizeof(*sample));
+ }
+
+ if (callee) {
err = lbr_callchain_add_kernel_ip(thread, cursor, sample,
parent, root_al, true, i);
if (err)
@@ -2323,7 +2537,17 @@ static int resolve_lbr_callchain_sample(struct thread *thread,
parent, root_al, true);
if (err)
goto error;
+ if (stitched_lbr) {
+ err = lbr_callchain_add_stitched_lbr_ip(thread, cursor);
+ if (err)
+ goto error;
+ }
} else {
+ if (stitched_lbr) {
+ err = lbr_callchain_add_stitched_lbr_ip(thread, cursor);
+ if (err)
+ goto error;
+ }
err = lbr_callchain_add_lbr_ip(thread, cursor, sample,
parent, root_al, false);
if (err)
@@ -2380,8 +2604,11 @@ static int thread__resolve_callchain_sample(struct thread *thread,
chain_nr = chain->nr;

if (perf_evsel__has_branch_callstack(evsel)) {
+ struct perf_env *env = perf_evsel__env(evsel);
+
err = resolve_lbr_callchain_sample(thread, cursor, sample, parent,
- root_al, max_stack);
+ root_al, max_stack,
+ !env ? 0 : env->max_branches);
if (err)
return (err < 0) ? err : 0;
}
diff --git a/tools/perf/util/thread.c b/tools/perf/util/thread.c
index 0a277a920970..b8503c345c14 100644
--- a/tools/perf/util/thread.c
+++ b/tools/perf/util/thread.c
@@ -47,6 +47,7 @@ struct thread *thread__new(pid_t pid, pid_t tid)
thread->tid = tid;
thread->ppid = -1;
thread->cpu = -1;
+ thread->lbr_stitch_enable = false;
INIT_LIST_HEAD(&thread->namespaces_list);
INIT_LIST_HEAD(&thread->comm_list);
init_rwsem(&thread->namespaces_lock);
@@ -110,6 +111,7 @@ void thread__delete(struct thread *thread)

exit_rwsem(&thread->namespaces_lock);
exit_rwsem(&thread->comm_lock);
+ thread__free_stitch_list(thread);
free(thread);
}

diff --git a/tools/perf/util/thread.h b/tools/perf/util/thread.h
index 51bdb9a7af7f..49e5e493fa4c 100644
--- a/tools/perf/util/thread.h
+++ b/tools/perf/util/thread.h
@@ -13,6 +13,8 @@
#include <strlist.h>
#include <intlist.h>
#include "rwsem.h"
+#include "event.h"
+#include "callchain.h"

struct addr_location;
struct map;
@@ -20,6 +22,13 @@ struct perf_record_namespaces;
struct thread_stack;
struct unwind_libunwind_ops;

+struct lbr_stitch {
+ struct list_head lists;
+ struct list_head free_lists;
+ struct perf_sample prev_sample;
+ struct callchain_cursor_node *prev_lbr_cursor;
+};
+
struct thread {
union {
struct rb_node rb_node;
@@ -46,6 +55,10 @@ struct thread {
struct srccode_state srccode_state;
bool filter;
int filter_entry_depth;
+
+ /* LBR call stack stitch */
+ bool lbr_stitch_enable;
+ struct lbr_stitch *lbr_stitch;
};

struct machine;
@@ -142,4 +155,25 @@ static inline bool thread__is_filtered(struct thread *thread)
return false;
}

+static inline void thread__free_stitch_list(struct thread *thread)
+{
+ struct lbr_stitch *lbr_stitch = thread->lbr_stitch;
+ struct stitch_list *pos, *tmp;
+
+ if (!lbr_stitch)
+ return;
+
+ list_for_each_entry_safe(pos, tmp, &lbr_stitch->lists, node) {
+ list_del_init(&pos->node);
+ free(pos);
+ }
+
+ list_for_each_entry_safe(pos, tmp, &lbr_stitch->free_lists, node) {
+ list_del_init(&pos->node);
+ free(pos);
+ }
+ free(lbr_stitch->prev_lbr_cursor);
+ free(thread->lbr_stitch);
+}
+
#endif /* __PERF_THREAD_H */
--
2.17.1


2019-11-19 14:37:13

by Liang, Kan

[permalink] [raw]
Subject: [RFC PATCH V4 13/13] perf hist: Add fast path for duplicate entries check approach

From: Kan Liang <[email protected]>

Perf checks the duplicate entries in a callchain before adding an entry.
However the check is very slow especially with deeper call stack.
Almost ~50% elapsed time of perf report is spent on the check when the
call stack is always depth of 32.

The hist_entry__cmp() is used to compare the new entry with the old
entries. It will go through all the available sorts in the sort_list,
and call the specific cmp of each sort, which is very slow.
Actually, for most cases, there are no duplicate entries in callchain.
The symbols are usually different. It's much faster to do a quick check
for symbols first. Only do the full cmp when the symbols are exactly the
same.
The quick check is only to check symbols, not dso. Export
_sort__sym_cmp.

$perf record --call-graph lbr ./tchain_edit_64

Without the patch
$time perf report --stdio
real 0m21.142s
user 0m21.110s
sys 0m0.033s

With the patch
$time perf report --stdio
real 0m10.977s
user 0m10.948s
sys 0m0.027s

Signed-off-by: Kan Liang <[email protected]>
Cc: Namhyung Kim <[email protected]>
---
tools/perf/util/hist.c | 23 +++++++++++++++++++++++
tools/perf/util/sort.c | 2 +-
tools/perf/util/sort.h | 2 ++
3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 0a8d72ae93ca..6eb35dde3905 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -1057,6 +1057,20 @@ iter_next_cumulative_entry(struct hist_entry_iter *iter,
return fill_callchain_info(al, node, iter->hide_unresolved);
}

+static bool
+hist_entry__fast__sym_diff(struct hist_entry *left,
+ struct hist_entry *right)
+{
+ struct symbol *sym_l = left->ms.sym;
+ struct symbol *sym_r = right->ms.sym;
+
+ if (!sym_l && !sym_r)
+ return left->ip != right->ip;
+
+ return !!_sort__sym_cmp(sym_l, sym_r);
+}
+
+
static int
iter_add_next_cumulative_entry(struct hist_entry_iter *iter,
struct addr_location *al)
@@ -1083,6 +1097,7 @@ iter_add_next_cumulative_entry(struct hist_entry_iter *iter,
};
int i;
struct callchain_cursor cursor;
+ bool fast = hists__has(he_tmp.hists, sym);

callchain_cursor_snapshot(&cursor, &callchain_cursor);

@@ -1093,6 +1108,14 @@ iter_add_next_cumulative_entry(struct hist_entry_iter *iter,
* It's possible that it has cycles or recursive calls.
*/
for (i = 0; i < iter->curr; i++) {
+ /*
+ * For most cases, there are no duplicate entries in callchain.
+ * The symbols are usually different. Do a quick check for
+ * symbols first.
+ */
+ if (fast && hist_entry__fast__sym_diff(he_cache[i], &he_tmp))
+ continue;
+
if (hist_entry__cmp(he_cache[i], &he_tmp) == 0) {
/* to avoid calling callback function */
iter->he = NULL;
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 6b626e6b111e..afa1ac233760 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -234,7 +234,7 @@ static int64_t _sort__addr_cmp(u64 left_ip, u64 right_ip)
return (int64_t)(right_ip - left_ip);
}

-static int64_t _sort__sym_cmp(struct symbol *sym_l, struct symbol *sym_r)
+int64_t _sort__sym_cmp(struct symbol *sym_l, struct symbol *sym_r)
{
if (!sym_l || !sym_r)
return cmp_null(sym_l, sym_r);
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index 5aff9542d9b7..d608b8a28a92 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -307,5 +307,7 @@ int64_t
sort__daddr_cmp(struct hist_entry *left, struct hist_entry *right);
int64_t
sort__dcacheline_cmp(struct hist_entry *left, struct hist_entry *right);
+int64_t
+_sort__sym_cmp(struct symbol *sym_l, struct symbol *sym_r);
char *hist_entry__srcline(struct hist_entry *he);
#endif /* __PERF_SORT_H */
--
2.17.1


2019-11-19 14:37:55

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V4 09/13] perf report: Add option to enable the LBR stitching approach

From: Kan Liang <[email protected]>

With the LBR stitching approach, the reconstructed LBR call stack
can break the HW limitation. However, it may reconstruct invalid call
stacks in some cases, e.g. exception handing such as setjmp/longjmp.
Also, it may impact the processing time especially when the number of
samples with stitched LBRs are huge.

Add an option to enable the approach.

# To display the perf.data header info, please use
# --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 6K of event 'cycles'
# Event count (approx.): 6492797701
#
# Children Self Command Shared Object Symbol
# ........ ........ ............... ..................
# .................................
#
99.99% 99.99% tchain_edit tchain_edit [.] f43
|
---main
f1
f2
f3
f4
f5
f6
f7
f8
f9
f10
f11
f12
f13
f14
f15
f16
f17
f18
f19
f20
f21
f22
f23
f24
f25
f26
f27
f28
f29
f30
f31
|
--99.65%--f32
f33
f34
f35
f36
f37
f38
f39
f40
f41
f42
f43

Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/Documentation/perf-report.txt | 11 +++++++++++
tools/perf/builtin-report.c | 6 ++++++
2 files changed, 17 insertions(+)

diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 8dbe2119686a..b42bd38e5790 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -476,6 +476,17 @@ include::itrace.txt[]
This option extends the perf report to show reference callgraphs,
which collected by reference event, in no callgraph event.

+--stitch-lbr::
+ Show callgraph with stitched LBRs, which may have more complete
+ callgraph. The perf.data file must have been obtained using
+ perf record --call-graph lbr.
+ Disabled by default. In common cases with call stack overflows,
+ it can recreate better call stacks than the default lbr call stack
+ output. But this approach is not full proof. There can be cases
+ where it creates incorrect call stacks from incorrect matches.
+ The known limitations include exception handing such as
+ setjmp/longjmp will have calls/returns not match.
+
--socket-filter::
Only report the samples on the processor socket that match with this filter

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 585805f51f15..00c1d8a47b18 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -84,6 +84,7 @@ struct report {
bool header_only;
bool nonany_branch_mode;
bool group_set;
+ bool stitch_lbr;
int max_stack;
struct perf_read_values show_threads_values;
struct annotation_options annotation_opts;
@@ -267,6 +268,9 @@ static int process_sample_event(struct perf_tool *tool,
return -1;
}

+ if (rep->stitch_lbr)
+ al.thread->lbr_stitch_enable = true;
+
if (symbol_conf.hide_unresolved && al.sym == NULL)
goto out_put;

@@ -1229,6 +1233,8 @@ int cmd_report(int argc, const char **argv)
"Show full source file name path for source lines"),
OPT_BOOLEAN(0, "show-ref-call-graph", &symbol_conf.show_ref_callgraph,
"Show callgraph from reference event"),
+ OPT_BOOLEAN(0, "stitch-lbr", &report.stitch_lbr,
+ "Enable LBR callgraph stitching approach"),
OPT_INTEGER(0, "socket-filter", &report.socket_filter,
"only show processor socket that match with this filter"),
OPT_BOOLEAN(0, "raw-trace", &symbol_conf.raw_trace,
--
2.17.1


2019-11-19 14:38:01

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V4 12/13] perf c2c: Add option to enable the LBR stitching approach

From: Kan Liang <[email protected]>

With the LBR stitching approach, the reconstructed LBR call stack
can break the HW limitation. However, it may reconstruct invalid call
stacks in some cases, e.g. exception handing such as setjmp/longjmp.
Also, it may impact the processing time especially when the number of
samples with stitched LBRs are huge.

Add an option to enable the approach.

Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/Documentation/perf-c2c.txt | 11 +++++++++++
tools/perf/builtin-c2c.c | 6 ++++++
2 files changed, 17 insertions(+)

diff --git a/tools/perf/Documentation/perf-c2c.txt b/tools/perf/Documentation/perf-c2c.txt
index e6150f21267d..2133eb320cb0 100644
--- a/tools/perf/Documentation/perf-c2c.txt
+++ b/tools/perf/Documentation/perf-c2c.txt
@@ -111,6 +111,17 @@ REPORT OPTIONS
--display::
Switch to HITM type (rmt, lcl) to display and sort on. Total HITMs as default.

+--stitch-lbr::
+ Show callgraph with stitched LBRs, which may have more complete
+ callgraph. The perf.data file must have been obtained using
+ perf c2c record --call-graph lbr.
+ Disabled by default. In common cases with call stack overflows,
+ it can recreate better call stacks than the default lbr call stack
+ output. But this approach is not full proof. There can be cases
+ where it creates incorrect call stacks from incorrect matches.
+ The known limitations include exception handing such as
+ setjmp/longjmp will have calls/returns not match.
+
C2C RECORD
----------
The perf c2c record command setup options related to HITM cacheline analysis
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index e69f44941aad..91c6277f958a 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -95,6 +95,7 @@ struct perf_c2c {
bool use_stdio;
bool stats_only;
bool symbol_full;
+ bool stitch_lbr;

/* HITM shared clines stats */
struct c2c_stats hitm_stats;
@@ -273,6 +274,9 @@ static int process_sample_event(struct perf_tool *tool __maybe_unused,
return -1;
}

+ if (c2c.stitch_lbr)
+ al.thread->lbr_stitch_enable = true;
+
ret = sample__resolve_callchain(sample, &callchain_cursor, NULL,
evsel, &al, sysctl_perf_event_max_stack);
if (ret)
@@ -2750,6 +2754,8 @@ static int perf_c2c__report(int argc, const char **argv)
OPT_STRING('c', "coalesce", &coalesce, "coalesce fields",
"coalesce fields: pid,tid,iaddr,dso"),
OPT_BOOLEAN('f', "force", &symbol_conf.force, "don't complain, do it"),
+ OPT_BOOLEAN(0, "stitch-lbr", &c2c.stitch_lbr,
+ "Enable LBR callgraph stitching approach"),
OPT_PARENT(c2c_options),
OPT_END()
};
--
2.17.1


2019-11-19 14:38:17

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V4 06/13] perf header: Support CPU PMU capabilities

From: Kan Liang <[email protected]>

To stitch LBR call stack, the max LBR information is required. So the
CPU PMU capabilities information has to be stored in perf header.

Add a new feature HEADER_CPU_PMU_CAPS for CPU PMU capabilities.
Retrieve all CPU PMU capabilities, not just max LBR information.

Add variable max_branches to facilitate future usage.

The CPU PMU capabilities information is only useful for LBR call stack
mode. Clear the feature for perf stat and other perf record mode.

Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
.../Documentation/perf.data-file-format.txt | 16 +++
tools/perf/builtin-record.c | 3 +
tools/perf/builtin-stat.c | 1 +
tools/perf/util/env.h | 3 +
tools/perf/util/header.c | 110 ++++++++++++++++++
tools/perf/util/header.h | 1 +
6 files changed, 134 insertions(+)

diff --git a/tools/perf/Documentation/perf.data-file-format.txt b/tools/perf/Documentation/perf.data-file-format.txt
index b0152e1095c5..b6472e463284 100644
--- a/tools/perf/Documentation/perf.data-file-format.txt
+++ b/tools/perf/Documentation/perf.data-file-format.txt
@@ -373,6 +373,22 @@ struct {
Indicates that trace contains records of PERF_RECORD_COMPRESSED type
that have perf_events records in compressed form.

+ HEADER_CPU_PMU_CAPS = 28,
+
+ A list of cpu PMU capabilities. The format of data is as below.
+
+struct {
+ u32 nr_cpu_pmu_caps;
+ {
+ char name[];
+ char value[];
+ } [nr_cpu_pmu_caps]
+};
+
+
+Example:
+ cpu pmu capabilities: branches=32, max_precise=3, pmu_name=icelake
+
other bits are reserved and should ignored for now
HEADER_FEAT_BITS = 256,

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index b95c000c1ed9..b53e19eb4b6c 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1106,6 +1106,9 @@ static void record__init_features(struct record *rec)
if (!record__comp_enabled(rec))
perf_header__clear_feat(&session->header, HEADER_COMPRESSED);

+ if (!callchain_param.enabled || (callchain_param.record_mode != CALLCHAIN_LBR))
+ perf_header__clear_feat(&session->header, HEADER_CPU_PMU_CAPS);
+
perf_header__clear_feat(&session->header, HEADER_STAT);
}

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 5964e808d73d..ed2d0aa7a861 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -1471,6 +1471,7 @@ static void init_features(struct perf_session *session)
perf_header__clear_feat(&session->header, HEADER_TRACING_DATA);
perf_header__clear_feat(&session->header, HEADER_BRANCH_STACK);
perf_header__clear_feat(&session->header, HEADER_AUXTRACE);
+ perf_header__clear_feat(&session->header, HEADER_CPU_PMU_CAPS);
}

static int __cmd_record(int argc, const char **argv)
diff --git a/tools/perf/util/env.h b/tools/perf/util/env.h
index 11d05ae3606a..d286d478b4d8 100644
--- a/tools/perf/util/env.h
+++ b/tools/perf/util/env.h
@@ -48,6 +48,7 @@ struct perf_env {
char *cpuid;
unsigned long long total_mem;
unsigned int msr_pmu_type;
+ unsigned int max_branches;

int nr_cmdline;
int nr_sibling_cores;
@@ -57,12 +58,14 @@ struct perf_env {
int nr_memory_nodes;
int nr_pmu_mappings;
int nr_groups;
+ int nr_cpu_pmu_caps;
char *cmdline;
const char **cmdline_argv;
char *sibling_cores;
char *sibling_dies;
char *sibling_threads;
char *pmu_mappings;
+ char *cpu_pmu_caps;
struct cpu_topology_map *cpu;
struct cpu_cache_level *caches;
int caches_cnt;
diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index 7ed481c9bcdf..6d32e5f1192c 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -1404,6 +1404,39 @@ static int write_compressed(struct feat_fd *ff __maybe_unused,
return do_write(ff, &(ff->ph->env.comp_mmap_len), sizeof(ff->ph->env.comp_mmap_len));
}

+static int write_cpu_pmu_caps(struct feat_fd *ff,
+ struct evlist *evlist __maybe_unused)
+{
+ struct perf_pmu_caps *caps = NULL;
+ struct perf_pmu *cpu_pmu;
+ int nr_caps;
+ int ret;
+
+ cpu_pmu = perf_pmu__find("cpu");
+ if (!cpu_pmu)
+ return -ENOENT;
+
+ nr_caps = perf_pmu__caps_parse(cpu_pmu);
+ if (nr_caps < 0)
+ return nr_caps;
+
+ ret = do_write(ff, &nr_caps, sizeof(nr_caps));
+ if (ret < 0)
+ return ret;
+
+ while ((caps = perf_pmu__scan_caps(cpu_pmu, caps))) {
+ ret = do_write_string(ff, caps->name);
+ if (ret < 0)
+ return ret;
+
+ ret = do_write_string(ff, caps->value);
+ if (ret < 0)
+ return ret;
+ }
+
+ return ret;
+}
+
static void print_hostname(struct feat_fd *ff, FILE *fp)
{
fprintf(fp, "# hostname : %s\n", ff->ph->env.hostname);
@@ -1819,6 +1852,28 @@ static void print_compressed(struct feat_fd *ff, FILE *fp)
ff->ph->env.comp_level, ff->ph->env.comp_ratio);
}

+static void print_cpu_pmu_caps(struct feat_fd *ff, FILE *fp)
+{
+ const char *delimiter = "# cpu pmu capabilities: ";
+ char *str;
+ u32 nr_caps;
+
+ nr_caps = ff->ph->env.nr_cpu_pmu_caps;
+ if (!nr_caps) {
+ fprintf(fp, "# cpu pmu capabilities: not available\n");
+ return;
+ }
+
+ str = ff->ph->env.cpu_pmu_caps;
+ while (nr_caps--) {
+ fprintf(fp, "%s%s", delimiter, str);
+ delimiter = ", ";
+ str += strlen(str) + 1;
+ }
+
+ fprintf(fp, "\n");
+}
+
static void print_pmu_mappings(struct feat_fd *ff, FILE *fp)
{
const char *delimiter = "# pmu mappings: ";
@@ -2856,6 +2911,60 @@ static int process_compressed(struct feat_fd *ff,
return 0;
}

+static int process_cpu_pmu_caps(struct feat_fd *ff,
+ void *data __maybe_unused)
+{
+ char *name, *value;
+ struct strbuf sb;
+ u32 nr_caps;
+
+ if (do_read_u32(ff, &nr_caps))
+ return -1;
+
+ if (!nr_caps) {
+ pr_debug("cpu pmu capabilities not available\n");
+ return 0;
+ }
+
+ ff->ph->env.nr_cpu_pmu_caps = nr_caps;
+
+ if (strbuf_init(&sb, 128) < 0)
+ return -1;
+
+ while (nr_caps--) {
+ name = do_read_string(ff);
+ if (!name)
+ goto error;
+
+ value = do_read_string(ff);
+ if (!value)
+ goto free_name;
+
+ if (strbuf_addf(&sb, "%s=%s", name, value) < 0)
+ goto free_value;
+
+ /* include a NULL character at the end */
+ if (strbuf_add(&sb, "", 1) < 0)
+ goto free_value;
+
+ if (!strcmp(name, "branches"))
+ ff->ph->env.max_branches = atoi(value);
+
+ free(value);
+ free(name);
+ }
+ ff->ph->env.cpu_pmu_caps = strbuf_detach(&sb, NULL);
+ return 0;
+
+free_value:
+ free(value);
+free_name:
+ free(name);
+error:
+ strbuf_release(&sb);
+ return -1;
+}
+
#define FEAT_OPR(n, func, __full_only) \
[HEADER_##n] = { \
.name = __stringify(n), \
@@ -2913,6 +3022,7 @@ const struct perf_header_feature_ops feat_ops[HEADER_LAST_FEATURE] = {
FEAT_OPR(BPF_PROG_INFO, bpf_prog_info, false),
FEAT_OPR(BPF_BTF, bpf_btf, false),
FEAT_OPR(COMPRESSED, compressed, false),
+ FEAT_OPR(CPU_PMU_CAPS, cpu_pmu_caps, false),
};

struct header_print_data {
diff --git a/tools/perf/util/header.h b/tools/perf/util/header.h
index 840f95cee349..650bd1c7a99b 100644
--- a/tools/perf/util/header.h
+++ b/tools/perf/util/header.h
@@ -43,6 +43,7 @@ enum {
HEADER_BPF_PROG_INFO,
HEADER_BPF_BTF,
HEADER_COMPRESSED,
+ HEADER_CPU_PMU_CAPS,
HEADER_LAST_FEATURE,
HEADER_FEAT_BITS = 256,
};
--
2.17.1


2019-11-19 14:38:27

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V4 10/13] perf script: Add option to enable the LBR stitching approach

From: Kan Liang <[email protected]>

With the LBR stitching approach, the reconstructed LBR call stack
can break the HW limitation. However, it may reconstruct invalid call
stacks in some cases, e.g. exception handing such as setjmp/longjmp.
Also, it may impact the processing time especially when the number of
samples with stitched LBRs are huge.

Add an option to enable the approach.

Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/Documentation/perf-script.txt | 11 +++++++++++
tools/perf/builtin-script.c | 6 ++++++
2 files changed, 17 insertions(+)

diff --git a/tools/perf/Documentation/perf-script.txt b/tools/perf/Documentation/perf-script.txt
index 2599b057e47b..472f20f1e479 100644
--- a/tools/perf/Documentation/perf-script.txt
+++ b/tools/perf/Documentation/perf-script.txt
@@ -426,6 +426,17 @@ include::itrace.txt[]
--show-on-off-events::
Show the --switch-on/off events too.

+--stitch-lbr::
+ Show callgraph with stitched LBRs, which may have more complete
+ callgraph. The perf.data file must have been obtained using
+ perf record --call-graph lbr.
+ Disabled by default. In common cases with call stack overflows,
+ it can recreate better call stacks than the default lbr call stack
+ output. But this approach is not full proof. There can be cases
+ where it creates incorrect call stacks from incorrect matches.
+ The known limitations include exception handing such as
+ setjmp/longjmp will have calls/returns not match.
+
SEE ALSO
--------
linkperf:perf-record[1], linkperf:perf-script-perl[1],
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index f86c5cce5b2c..fa1d475571dd 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -1641,6 +1641,7 @@ struct perf_script {
bool show_bpf_events;
bool allocated;
bool per_event_dump;
+ bool stitch_lbr;
struct evswitch evswitch;
struct perf_cpu_map *cpus;
struct perf_thread_map *threads;
@@ -1867,6 +1868,9 @@ static void process_event(struct perf_script *script,
if (PRINT_FIELD(IP)) {
struct callchain_cursor *cursor = NULL;

+ if (script->stitch_lbr)
+ al->thread->lbr_stitch_enable = true;
+
if (symbol_conf.use_callchain && sample->callchain &&
thread__resolve_callchain(al->thread, &callchain_cursor, evsel,
sample, NULL, NULL, scripting_max_stack) == 0)
@@ -3556,6 +3560,8 @@ int cmd_script(int argc, const char **argv)
"file", "file saving guest os /proc/kallsyms"),
OPT_STRING(0, "guestmodules", &symbol_conf.default_guest_modules,
"file", "file saving guest os /proc/modules"),
+ OPT_BOOLEAN('\0', "stitch-lbr", &script.stitch_lbr,
+ "Enable LBR callgraph stitching approach"),
OPTS_EVSWITCH(&script.evswitch),
OPT_END()
};
--
2.17.1


2019-11-19 14:39:00

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V4 07/13] perf machine: Refine the function for LBR call stack reconstruction

From: Kan Liang <[email protected]>

LBR only collect the user call stack. To reconstruct a call stack, both
kernel call stack and user call stack are required. The function
resolve_lbr_callchain_sample() mix the kernel call stack and user
call stack. Now, with the help of TOS, perf tool can reconstruct a more
complete call stack by adding some user call stack from previous sample.
However, current implementation is hard to be extended to support it.

Abstract two new functions to resolve user call stack and kernel
call stack respectively.

No functional changes.

Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/util/machine.c | 186 ++++++++++++++++++++++++--------------
1 file changed, 119 insertions(+), 67 deletions(-)

diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 6a0f5c25ce3e..1a2c3e26c01f 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -2194,6 +2194,96 @@ static int remove_loops(struct branch_entry *l, int nr,
return nr;
}

+
+static int lbr_callchain_add_kernel_ip(struct thread *thread,
+ struct callchain_cursor *cursor,
+ struct perf_sample *sample,
+ struct symbol **parent,
+ struct addr_location *root_al,
+ bool callee, int end)
+{
+ struct ip_callchain *chain = sample->callchain;
+ u8 cpumode = PERF_RECORD_MISC_USER;
+ int err, i;
+
+ if (callee) {
+ for (i = 0; i < end + 1; i++) {
+ err = add_callchain_ip(thread, cursor, parent,
+ root_al, &cpumode, chain->ips[i],
+ false, NULL, NULL, 0);
+ if (err)
+ return err;
+ }
+ } else {
+ for (i = end; i >= 0; i--) {
+ err = add_callchain_ip(thread, cursor, parent,
+ root_al, &cpumode, chain->ips[i],
+ false, NULL, NULL, 0);
+ if (err)
+ return err;
+ }
+ }
+
+ return 0;
+}
+
+static int lbr_callchain_add_lbr_ip(struct thread *thread,
+ struct callchain_cursor *cursor,
+ struct perf_sample *sample,
+ struct symbol **parent,
+ struct addr_location *root_al,
+ bool callee)
+{
+ struct branch_stack *lbr_stack = sample->branch_stack;
+ u8 cpumode = PERF_RECORD_MISC_USER;
+ int lbr_nr = lbr_stack->nr;
+ struct branch_flags *flags;
+ u64 ip, branch_from = 0;
+ int err, i;
+
+ if (callee) {
+ ip = lbr_stack->entries[0].to;
+ flags = &lbr_stack->entries[0].flags;
+ branch_from = lbr_stack->entries[0].from;
+ err = add_callchain_ip(thread, cursor, parent,
+ root_al, &cpumode, ip,
+ true, flags, NULL, branch_from);
+ if (err)
+ return err;
+
+ for (i = 0; i < lbr_nr; i++) {
+ ip = lbr_stack->entries[i].from;
+ flags = &lbr_stack->entries[i].flags;
+ err = add_callchain_ip(thread, cursor, parent,
+ root_al, &cpumode, ip,
+ true, flags, NULL, branch_from);
+ if (err)
+ return err;
+ }
+ } else {
+ for (i = lbr_nr - 1; i >= 0; i--) {
+ ip = lbr_stack->entries[i].from;
+ flags = &lbr_stack->entries[i].flags;
+ err = add_callchain_ip(thread, cursor, parent,
+ root_al, &cpumode, ip,
+ true, flags, NULL, branch_from);
+ if (err)
+ return err;
+ }
+
+ ip = lbr_stack->entries[0].to;
+ flags = &lbr_stack->entries[0].flags;
+ branch_from = lbr_stack->entries[0].from;
+ err = add_callchain_ip(thread, cursor, parent,
+ root_al, &cpumode, ip,
+ true, flags, NULL, branch_from);
+ if (err)
+ return err;
+ }
+
+ return 0;
+}
+
/*
* Recolve LBR callstack chain sample
* Return:
@@ -2209,82 +2299,44 @@ static int resolve_lbr_callchain_sample(struct thread *thread,
int max_stack)
{
struct ip_callchain *chain = sample->callchain;
- int chain_nr = min(max_stack, (int)chain->nr), i;
- u8 cpumode = PERF_RECORD_MISC_USER;
- u64 ip, branch_from = 0;
+ int chain_nr = min(max_stack, (int)chain->nr);
+ int i, err;

for (i = 0; i < chain_nr; i++) {
if (chain->ips[i] == PERF_CONTEXT_USER)
break;
}

- /* LBR only affects the user callchain */
- if (i != chain_nr) {
- struct branch_stack *lbr_stack = sample->branch_stack;
- int lbr_nr = lbr_stack->nr, j, k;
- bool branch;
- struct branch_flags *flags;
- /*
- * LBR callstack can only get user call chain.
- * The mix_chain_nr is kernel call chain
- * number plus LBR user call chain number.
- * i is kernel call chain number,
- * 1 is PERF_CONTEXT_USER,
- * lbr_nr + 1 is the user call chain number.
- * For details, please refer to the comments
- * in callchain__printf
- */
- int mix_chain_nr = i + 1 + lbr_nr + 1;
-
- for (j = 0; j < mix_chain_nr; j++) {
- int err;
- branch = false;
- flags = NULL;
-
- if (callchain_param.order == ORDER_CALLEE) {
- if (j < i + 1)
- ip = chain->ips[j];
- else if (j > i + 1) {
- k = j - i - 2;
- ip = lbr_stack->entries[k].from;
- branch = true;
- flags = &lbr_stack->entries[k].flags;
- } else {
- ip = lbr_stack->entries[0].to;
- branch = true;
- flags = &lbr_stack->entries[0].flags;
- branch_from =
- lbr_stack->entries[0].from;
- }
- } else {
- if (j < lbr_nr) {
- k = lbr_nr - j - 1;
- ip = lbr_stack->entries[k].from;
- branch = true;
- flags = &lbr_stack->entries[k].flags;
- }
- else if (j > lbr_nr)
- ip = chain->ips[i + 1 - (j - lbr_nr)];
- else {
- ip = lbr_stack->entries[0].to;
- branch = true;
- flags = &lbr_stack->entries[0].flags;
- branch_from =
- lbr_stack->entries[0].from;
- }
- }
+ /*
+ * LBR only affects the user callchain.
+ * Fall back if there is no user callchain.
+ */
+ if (i == chain_nr)
+ return 0;

- err = add_callchain_ip(thread, cursor, parent,
- root_al, &cpumode, ip,
- branch, flags, NULL,
- branch_from);
- if (err)
- return (err < 0) ? err : 0;
- }
- return 1;
+ if (callchain_param.order == ORDER_CALLEE) {
+ err = lbr_callchain_add_kernel_ip(thread, cursor, sample,
+ parent, root_al, true, i);
+ if (err)
+ goto error;
+ err = lbr_callchain_add_lbr_ip(thread, cursor, sample,
+ parent, root_al, true);
+ if (err)
+ goto error;
+ } else {
+ err = lbr_callchain_add_lbr_ip(thread, cursor, sample,
+ parent, root_al, false);
+ if (err)
+ goto error;
+ err = lbr_callchain_add_kernel_ip(thread, cursor, sample,
+ parent, root_al, false, i);
+ if (err)
+ goto error;
}

- return 0;
+ return 1;
+error:
+ return (err < 0) ? err : 0;
}

static int find_prev_cpumode(struct ip_callchain *chain, struct thread *thread,
--
2.17.1


2019-11-19 14:39:02

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V4 02/13] perf/x86/intel: Output LBR TOS information

From: Kan Liang <[email protected]>

A new branch sample type was introduced to require the LBR Top-of-Stack
(TOS) information.

For non-adaptive PEBS and non-PEBS, the TOS information can be directly
retrieved from TOS MSR read in intel_pmu_lbr_read().

For adaptive PEBS, the LBR information stored in PEBS record doesn't
include the TOS information. For single PEBS, TOS can be directly read
from MSR, because the PMI is triggered immediately after PEBS is
written. TOS MSR is still unchanged.
For large PEBS, TOS MSR has stale value. Set -1ULL to indicate that the
TOS information is not available.

Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/intel/lbr.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 534c76606049..956802dff5f7 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -585,6 +585,7 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
cpuc->lbr_entries[i].reserved = 0;
}
cpuc->lbr_stack.nr = i;
+ cpuc->lbr_stack.tos = tos;
}

/*
@@ -680,6 +681,7 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
out++;
}
cpuc->lbr_stack.nr = out;
+ cpuc->lbr_stack.tos = tos;
}

void intel_pmu_lbr_read(void)
@@ -1120,6 +1122,13 @@ void intel_pmu_store_pebs_lbrs(struct pebs_lbr *lbr)
int i;

cpuc->lbr_stack.nr = x86_pmu.lbr_nr;
+
+ /* Cannot get TOS for large PEBS */
+ if (cpuc->n_pebs == cpuc->n_large_pebs)
+ cpuc->lbr_stack.tos = -1ULL;
+ else
+ cpuc->lbr_stack.tos = intel_pmu_lbr_tos();
+
for (i = 0; i < x86_pmu.lbr_nr; i++) {
u64 info = lbr->lbr[i].info;
struct perf_branch_entry *e = &cpuc->lbr_entries[i];
--
2.17.1


2019-11-19 14:39:11

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V4 11/13] perf top: Add option to enable the LBR stitching approach

From: Kan Liang <[email protected]>

With the LBR stitching approach, the reconstructed LBR call stack
can break the HW limitation. However, it may reconstruct invalid call
stacks in some cases, e.g. exception handing such as setjmp/longjmp.
Also, it may impact the processing time especially when the number of
samples with stitched LBRs are huge.

Add an option to enable the approach.
The option must be used with --call-graph lbr.

Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/Documentation/perf-top.txt | 9 +++++++++
tools/perf/builtin-top.c | 11 +++++++++++
tools/perf/util/top.h | 1 +
3 files changed, 21 insertions(+)

diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
index 5596129a71cf..80b57f942a86 100644
--- a/tools/perf/Documentation/perf-top.txt
+++ b/tools/perf/Documentation/perf-top.txt
@@ -304,6 +304,15 @@ Default is to monitor all CPUS.
go straight to the histogram browser, just like 'perf top' with no events
explicitely specified does.

+--stitch-lbr::
+ Show callgraph with stitched LBRs, which may have more complete
+ callgraph. The option must be used with --call-graph lbr recording.
+ Disabled by default. In common cases with call stack overflows,
+ it can recreate better call stacks than the default lbr call stack
+ output. But this approach is not full proof. There can be cases
+ where it creates incorrect call stacks from incorrect matches.
+ The known limitations include exception handing such as
+ setjmp/longjmp will have calls/returns not match.

INTERACTIVE PROMPTING KEYS
--------------------------
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index dc80044bc46f..7c820cfe2f23 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -33,6 +33,7 @@
#include "util/map.h"
#include "util/mmap.h"
#include "util/session.h"
+#include "util/thread.h"
#include "util/symbol.h"
#include "util/synthetic-events.h"
#include "util/top.h"
@@ -766,6 +767,9 @@ static void perf_event__process_sample(struct perf_tool *tool,
if (machine__resolve(machine, &al, sample) < 0)
return;

+ if (top->stitch_lbr)
+ al.thread->lbr_stitch_enable = true;
+
if (!machine->kptr_restrict_warned &&
symbol_conf.kptr_restrict &&
al.cpumode == PERF_RECORD_MISC_KERNEL) {
@@ -1539,6 +1543,8 @@ int cmd_top(int argc, const char **argv)
"number of thread to run event synthesize"),
OPT_BOOLEAN(0, "namespaces", &opts->record_namespaces,
"Record namespaces events"),
+ OPT_BOOLEAN(0, "stitch-lbr", &top.stitch_lbr,
+ "Enable LBR callgraph stitching approach"),
OPTS_EVSWITCH(&top.evswitch),
OPT_END()
};
@@ -1601,6 +1607,11 @@ int cmd_top(int argc, const char **argv)
}
}

+ if (top.stitch_lbr && !(callchain_param.record_mode == CALLCHAIN_LBR)) {
+ pr_err("Error: --stitch-lbr must be used with --call-graph lbr\n");
+ goto out_delete_evlist;
+ }
+
if (opts->branch_stack && callchain_param.enabled)
symbol_conf.show_branchflag_count = true;

diff --git a/tools/perf/util/top.h b/tools/perf/util/top.h
index f117d4f4821e..45dc84ddff37 100644
--- a/tools/perf/util/top.h
+++ b/tools/perf/util/top.h
@@ -36,6 +36,7 @@ struct perf_top {
bool use_tui, use_stdio;
bool vmlinux_warned;
bool dump_symtab;
+ bool stitch_lbr;
struct hist_entry *sym_filter_entry;
struct evsel *sym_evsel;
struct perf_session *session;
--
2.17.1


2019-11-19 14:40:47

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V4 05/13] perf pmu: Add support for PMU capabilities

From: Kan Liang <[email protected]>

The PMU capabilities information, which is located at
/sys/bus/event_source/devices/<dev>/caps, is required by perf tool.
For example, the max LBR information is required to stitch LBR call
stack.

Add perf_pmu__caps_parse() to parse the PMU capabilities information.
The information is stored in a list.

Add perf_pmu__scan_caps() to scan the capabilities once by one.

The following patch will store the capabilities information in perf
header.

Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/util/pmu.c | 87 +++++++++++++++++++++++++++++++++++++++++++
tools/perf/util/pmu.h | 12 ++++++
2 files changed, 99 insertions(+)

diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index db1e57113f4b..150f24bb7e1a 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -853,6 +853,7 @@ static struct perf_pmu *pmu_lookup(const char *name)

INIT_LIST_HEAD(&pmu->format);
INIT_LIST_HEAD(&pmu->aliases);
+ INIT_LIST_HEAD(&pmu->caps);
list_splice(&format, &pmu->format);
list_splice(&aliases, &pmu->aliases);
list_add_tail(&pmu->list, &pmus);
@@ -1567,3 +1568,89 @@ int perf_pmu__scan_file(struct perf_pmu *pmu, const char *name, const char *fmt,
va_end(args);
return ret;
}
+
+static int perf_pmu__new_caps(struct list_head *list, char *name, char *value)
+{
+ struct perf_pmu_caps *caps;
+
+ caps = zalloc(sizeof(*caps));
+ if (!caps)
+ return -ENOMEM;
+
+ caps->name = strdup(name);
+ caps->value = strndup(value, strlen(value) - 1);
+ list_add_tail(&caps->list, list);
+ return 0;
+}
+
+/*
+ * Reading/parsing the given pmu capabilities, which should be located at:
+ * /sys/bus/event_source/devices/<dev>/caps as sysfs group attributes.
+ * Return the number of capabilities
+ */
+int perf_pmu__caps_parse(struct perf_pmu *pmu)
+{
+ struct stat st;
+ char caps_path[PATH_MAX];
+ const char *sysfs = sysfs__mountpoint();
+ DIR *caps_dir;
+ struct dirent *evt_ent;
+ int nr_caps = 0;
+
+ if (!sysfs)
+ return -1;
+
+ snprintf(caps_path, PATH_MAX,
+ "%s" EVENT_SOURCE_DEVICE_PATH "%s/caps", sysfs, pmu->name);
+
+ if (stat(caps_path, &st) < 0)
+ return 0; /* no error if caps does not exist */
+
+ caps_dir = opendir(caps_path);
+ if (!caps_dir)
+ return -EINVAL;
+
+ while ((evt_ent = readdir(caps_dir)) != NULL) {
+ char *name = evt_ent->d_name;
+ char path[PATH_MAX];
+ char value[128];
+ FILE *file;
+
+ if (!strcmp(name, ".") || !strcmp(name, ".."))
+ continue;
+
+ snprintf(path, PATH_MAX, "%s/%s", caps_path, name);
+
+ file = fopen(path, "r");
+ if (!file)
+ break;
+
+ if (!fgets(value, sizeof(value), file) ||
+ (perf_pmu__new_caps(&pmu->caps, name, value) < 0)) {
+ fclose(file);
+ break;
+ }
+
+ nr_caps++;
+ fclose(file);
+ }
+
+ closedir(caps_dir);
+
+ return nr_caps;
+}
+
+struct perf_pmu_caps *perf_pmu__scan_caps(struct perf_pmu *pmu,
+ struct perf_pmu_caps *caps)
+{
+ if (!pmu)
+ return NULL;
+
+ if (!caps)
+ caps = list_prepare_entry(caps, &pmu->caps, list);
+
+ list_for_each_entry_continue(caps, &pmu->caps, list)
+ return caps;
+
+ return NULL;
+}
diff --git a/tools/perf/util/pmu.h b/tools/perf/util/pmu.h
index 3e8cd31a89cc..f7dc94c3ae8a 100644
--- a/tools/perf/util/pmu.h
+++ b/tools/perf/util/pmu.h
@@ -21,6 +21,12 @@ enum {

struct perf_event_attr;

+struct perf_pmu_caps {
+ char *name;
+ char *value;
+ struct list_head list;
+};
+
struct perf_pmu {
char *name;
__u32 type;
@@ -31,6 +37,7 @@ struct perf_pmu {
struct perf_cpu_map *cpus;
struct list_head format; /* HEAD struct perf_pmu_format -> list */
struct list_head aliases; /* HEAD struct perf_pmu_alias -> list */
+ struct list_head caps; /* HEAD struct perf_pmu_caps -> list */
struct list_head list; /* ELEM */
};

@@ -100,4 +107,9 @@ struct pmu_events_map *perf_pmu__find_map(struct perf_pmu *pmu);

int perf_pmu__convert_scale(const char *scale, char **end, double *sval);

+int perf_pmu__caps_parse(struct perf_pmu *pmu);
+
+struct perf_pmu_caps *perf_pmu__scan_caps(struct perf_pmu *pmu,
+ struct perf_pmu_caps *caps);
+
#endif /* __PMU_H */
--
2.17.1


2019-11-19 19:02:24

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH V4 03/13] perf tools: Support new branch sample type for LBR TOS

On Tue, Nov 19, 2019 at 6:35 AM <[email protected]> wrote:
>
> From: Kan Liang <[email protected]>
>
> Support new branch sample type for LBR TOS.
>
> Enable LBR_TOS by default in LBR call stack mode.
> If kernel doesn't support the sample type, switching it off.
>
> Add a new branch options "tos" for the new branch sample type.
> The branch sample type is 64 bits. Change int to u64 for mode in
> struct branch_mode and bit in struct bit_names.
>
> Set tos to -1ULL if the LBR TOS information is unavailable.
>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> tools/include/uapi/linux/perf_event.h | 16 ++++++++++++++--
> tools/perf/util/event.h | 1 +
> tools/perf/util/evsel.c | 20 +++++++++++++++++---
> tools/perf/util/evsel.h | 6 ++++++
> tools/perf/util/parse-branch-options.c | 3 ++-
> tools/perf/util/perf_event_attr_fprintf.c | 3 ++-
> 6 files changed, 42 insertions(+), 7 deletions(-)
>
> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
> index bb7b271397a6..c2da61c9ace7 100644
> --- a/tools/include/uapi/linux/perf_event.h
> +++ b/tools/include/uapi/linux/perf_event.h
> @@ -180,7 +180,10 @@ enum perf_branch_sample_type_shift {
>
> PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT = 16, /* save branch type */
>
> - PERF_SAMPLE_BRANCH_MAX_SHIFT /* non-ABI */
> + PERF_SAMPLE_BRANCH_MAX_SHIFT = 17, /* non-ABI */
> +
> + /* PMU specific */

No! You must abstract this.

> + PERF_SAMPLE_BRANCH_LBR_TOS_SHIFT = 63, /* save LBR TOS */
> };
>
I don't like this because this is too Intel specific.
What is the meaning of this field? You need a clear definition so it can be used
with other PERF_SAMPLE_BRANCH_* implementations.


>
> enum perf_branch_sample_type {
> @@ -208,8 +211,13 @@ enum perf_branch_sample_type {
> 1U << PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT,
>
> PERF_SAMPLE_BRANCH_MAX = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT,
> +
> + PERF_SAMPLE_BRANCH_LBR_TOS = 1ULL << PERF_SAMPLE_BRANCH_LBR_TOS_SHIFT,
> };
>
> +#define PERF_SAMPLE_BRANCH_MASK ((PERF_SAMPLE_BRANCH_MAX - 1) |\
> + PERF_SAMPLE_BRANCH_LBR_TOS)
> +
> /*
> * Common flow change classification
> */
> @@ -849,7 +857,11 @@ enum perf_event_type {
> * char data[size];}&& PERF_SAMPLE_RAW
> *
> * { u64 nr;
> - * { u64 from, to, flags } lbr[nr];} && PERF_SAMPLE_BRANCH_STACK
> + * { u64 from, to, flags } lbr[nr];
> + *
> + * # only available if PERF_SAMPLE_BRANCH_LBR_TOS is set
> + * u64 tos;
> + * } && PERF_SAMPLE_BRANCH_STACK
> *
> * { u64 abi; # enum perf_sample_regs_abi
> * u64 regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
> diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
> index a0a0c91cde4a..98794758546b 100644
> --- a/tools/perf/util/event.h
> +++ b/tools/perf/util/event.h
> @@ -130,6 +130,7 @@ struct perf_sample {
> u32 raw_size;
> u64 data_src;
> u64 phys_addr;
> + u64 lbr_tos;
> u32 flags;
> u16 insn_len;
> u8 cpumode;
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 1bf60f325608..b19669eb4437 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -712,7 +712,8 @@ static void __perf_evsel__config_callchain(struct evsel *evsel,
> attr->branch_sample_type = PERF_SAMPLE_BRANCH_USER |
> PERF_SAMPLE_BRANCH_CALL_STACK |
> PERF_SAMPLE_BRANCH_NO_CYCLES |
> - PERF_SAMPLE_BRANCH_NO_FLAGS;
> + PERF_SAMPLE_BRANCH_NO_FLAGS |
> + PERF_SAMPLE_BRANCH_LBR_TOS;
> }
> } else
> pr_warning("Cannot use LBR callstack with branch stack. "
> @@ -763,7 +764,8 @@ perf_evsel__reset_callgraph(struct evsel *evsel,
> if (param->record_mode == CALLCHAIN_LBR) {
> perf_evsel__reset_sample_bit(evsel, BRANCH_STACK);
> attr->branch_sample_type &= ~(PERF_SAMPLE_BRANCH_USER |
> - PERF_SAMPLE_BRANCH_CALL_STACK);
> + PERF_SAMPLE_BRANCH_CALL_STACK |
> + PERF_SAMPLE_BRANCH_LBR_TOS);
> }
> if (param->record_mode == CALLCHAIN_DWARF) {
> perf_evsel__reset_sample_bit(evsel, REGS_USER);
> @@ -1641,6 +1643,8 @@ int evsel__open(struct evsel *evsel, struct perf_cpu_map *cpus,
> evsel->core.attr.ksymbol = 0;
> if (perf_missing_features.bpf)
> evsel->core.attr.bpf_event = 0;
> + if (perf_missing_features.lbr_tos)
> + evsel->core.attr.branch_sample_type &= ~PERF_SAMPLE_BRANCH_LBR_TOS;
> retry_sample_id:
> if (perf_missing_features.sample_id_all)
> evsel->core.attr.sample_id_all = 0;
> @@ -1752,7 +1756,12 @@ int evsel__open(struct evsel *evsel, struct perf_cpu_map *cpus,
> * Must probe features in the order they were added to the
> * perf_event_attr interface.
> */
> - if (!perf_missing_features.aux_output && evsel->core.attr.aux_output) {
> + if (!perf_missing_features.lbr_tos &&
> + (evsel->core.attr.branch_sample_type & PERF_SAMPLE_BRANCH_LBR_TOS)) {
> + perf_missing_features.lbr_tos = true;
> + pr_debug2("switching off LBR TOS support\n");
> + goto fallback_missing_features;
> + } else if (!perf_missing_features.aux_output && evsel->core.attr.aux_output) {
> perf_missing_features.aux_output = true;
> pr_debug2_peo("Kernel has no attr.aux_output support, bailing out\n");
> goto out_close;
> @@ -2129,6 +2138,11 @@ int perf_evsel__parse_sample(struct evsel *evsel, union perf_event *event,
> sz = data->branch_stack->nr * sizeof(struct branch_entry);
> OVERFLOW_CHECK(array, sz, max_size);
> array = (void *)array + sz;
> +
> + if (perf_evsel__has_lbr_tos(evsel))
> + data->lbr_tos = *array++;
> + else
> + data->lbr_tos = -1ULL;
> }
>
> if (type & PERF_SAMPLE_REGS_USER) {
> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
> index ddc5ee6f6592..43a9fd83f791 100644
> --- a/tools/perf/util/evsel.h
> +++ b/tools/perf/util/evsel.h
> @@ -115,6 +115,7 @@ struct perf_missing_features {
> bool ksymbol;
> bool bpf;
> bool aux_output;
> + bool lbr_tos;
> };
>
> extern struct perf_missing_features perf_missing_features;
> @@ -377,6 +378,11 @@ for ((_evsel) = _leader; \
> (_evsel) && (_evsel)->leader == (_leader); \
> (_evsel) = list_entry((_evsel)->core.node.next, struct evsel, core.node))
>
> +static inline bool perf_evsel__has_lbr_tos(const struct evsel *evsel)
> +{
> + return evsel->core.attr.branch_sample_type & PERF_SAMPLE_BRANCH_LBR_TOS;
> +}
> +
> static inline bool perf_evsel__has_branch_callstack(const struct evsel *evsel)
> {
> return evsel->core.attr.branch_sample_type & PERF_SAMPLE_BRANCH_CALL_STACK;
> diff --git a/tools/perf/util/parse-branch-options.c b/tools/perf/util/parse-branch-options.c
> index bb4aa88c50a8..ce8b9ffc0663 100644
> --- a/tools/perf/util/parse-branch-options.c
> +++ b/tools/perf/util/parse-branch-options.c
> @@ -13,7 +13,7 @@
>
> struct branch_mode {
> const char *name;
> - int mode;
> + u64 mode;
> };
>
> static const struct branch_mode branch_modes[] = {
> @@ -32,6 +32,7 @@ static const struct branch_mode branch_modes[] = {
> BRANCH_OPT("call", PERF_SAMPLE_BRANCH_CALL),
> BRANCH_OPT("save_type", PERF_SAMPLE_BRANCH_TYPE_SAVE),
> BRANCH_OPT("stack", PERF_SAMPLE_BRANCH_CALL_STACK),
> + BRANCH_OPT("tos", PERF_SAMPLE_BRANCH_LBR_TOS),
> BRANCH_END
> };
>
> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
> index d4ad3f04923a..3411b67ea92a 100644
> --- a/tools/perf/util/perf_event_attr_fprintf.c
> +++ b/tools/perf/util/perf_event_attr_fprintf.c
> @@ -8,7 +8,7 @@
> #include "util/evsel_fprintf.h"
>
> struct bit_names {
> - int bit;
> + u64 bit;
> const char *name;
> };
>
> @@ -50,6 +50,7 @@ static void __p_branch_sample_type(char *buf, size_t size, u64 value)
> bit_name(ABORT_TX), bit_name(IN_TX), bit_name(NO_TX),
> bit_name(COND), bit_name(CALL_STACK), bit_name(IND_JUMP),
> bit_name(CALL), bit_name(NO_FLAGS), bit_name(NO_CYCLES),
> + bit_name(LBR_TOS),
> { .name = NULL, }
> };
> #undef bit_name
> --
> 2.17.1
>

2019-11-19 19:06:13

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH V4 01/13] perf/core: Add new branch sample type for LBR TOS

On Tue, Nov 19, 2019 at 6:35 AM <[email protected]> wrote:
>
> From: Kan Liang <[email protected]>
>
> In LBR call stack mode, the depth of reconstructed LBR call stack limits
> to the number of LBR registers. With LBR Top-of-Stack (TOS) information,
> perf tool may stitch the stacks of two samples. The reconstructed LBR
> call stack can break the HW limitation.
>
> Add a new branch sample type to retrieve LBR TOS. The new type is PMU
> specific. Add it at the end of enum perf_branch_sample_type.
> Add a macro to retrieve defined bits of branch sample type.
> Update perf_copy_attr() to handle the new bit.
>
> Only when the new branch sample type is set, the TOS information is
> dumped into the PERF_SAMPLE_BRANCH_STACK output.
> Perf tool should check the attr.branch_sample_type, and apply the
> corresponding format for PERF_SAMPLE_BRANCH_STACK samples.
> Otherwise, some user case may be broken. For example, users may parse a
> perf.data, which include the new branch sample type, with an old version
> perf tool (without the check). Users probably get incorrect information
> without any warning.
>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> include/linux/perf_event.h | 2 ++
> include/uapi/linux/perf_event.h | 16 ++++++++++++++--
> kernel/events/core.c | 13 ++++++++++++-
> 3 files changed, 28 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 011dcbdbccc2..761021c7ee8a 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -93,6 +93,7 @@ struct perf_raw_record {
> /*
> * branch stack layout:
> * nr: number of taken branches stored in entries[]
> + * tos: Top-of-Stack (TOS) information. PMU specific data.
> *
> * Note that nr can vary from sample to sample
> * branches (to, from) are stored from most recent
> @@ -101,6 +102,7 @@ struct perf_raw_record {
> */
> struct perf_branch_stack {
> __u64 nr;
> + __u64 tos; /* PMU specific data */
> struct perf_branch_entry entries[0];
> };
>
Same remark as with the other patch. You need to abstract this.
The TOS and PMU specific data should be limited to x86/event/intel/*.[ch].

> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index bb7b271397a6..c2da61c9ace7 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -180,7 +180,10 @@ enum perf_branch_sample_type_shift {
>
> PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT = 16, /* save branch type */
>
> - PERF_SAMPLE_BRANCH_MAX_SHIFT /* non-ABI */
> + PERF_SAMPLE_BRANCH_MAX_SHIFT = 17, /* non-ABI */
> +
> + /* PMU specific */
> + PERF_SAMPLE_BRANCH_LBR_TOS_SHIFT = 63, /* save LBR TOS */
> };
>
> enum perf_branch_sample_type {
> @@ -208,8 +211,13 @@ enum perf_branch_sample_type {
> 1U << PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT,
>
> PERF_SAMPLE_BRANCH_MAX = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT,
> +
> + PERF_SAMPLE_BRANCH_LBR_TOS = 1ULL << PERF_SAMPLE_BRANCH_LBR_TOS_SHIFT,
> };
>
> +#define PERF_SAMPLE_BRANCH_MASK ((PERF_SAMPLE_BRANCH_MAX - 1) |\
> + PERF_SAMPLE_BRANCH_LBR_TOS)
> +
> /*
> * Common flow change classification
> */
> @@ -849,7 +857,11 @@ enum perf_event_type {
> * char data[size];}&& PERF_SAMPLE_RAW
> *
> * { u64 nr;
> - * { u64 from, to, flags } lbr[nr];} && PERF_SAMPLE_BRANCH_STACK
> + * { u64 from, to, flags } lbr[nr];
> + *
> + * # only available if PERF_SAMPLE_BRANCH_LBR_TOS is set
> + * u64 tos;
> + * } && PERF_SAMPLE_BRANCH_STACK
> *
> * { u64 abi; # enum perf_sample_regs_abi
> * u64 regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index cfd89b4a02d8..8aff3aad43b5 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6391,6 +6391,11 @@ static void perf_output_read(struct perf_output_handle *handle,
> perf_output_read_one(handle, event, enabled, running);
> }
>
> +static inline bool perf_sample_save_lbr_tos(struct perf_event *event)
> +{
> + return event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_LBR_TOS;
> +}
> +
> void perf_output_sample(struct perf_output_handle *handle,
> struct perf_event_header *header,
> struct perf_sample_data *data,
> @@ -6480,6 +6485,8 @@ void perf_output_sample(struct perf_output_handle *handle,
>
> perf_output_put(handle, data->br_stack->nr);
> perf_output_copy(handle, data->br_stack->entries, size);
> + if (perf_sample_save_lbr_tos(event))
> + perf_output_put(handle, data->br_stack->tos);
> } else {
> /*
> * we always store at least the value of nr
> @@ -6667,7 +6674,11 @@ void perf_prepare_sample(struct perf_event_header *header,
> if (data->br_stack) {
> size += data->br_stack->nr
> * sizeof(struct perf_branch_entry);
> +
> + if (perf_sample_save_lbr_tos(event))
> + size += sizeof(u64);
> }
> +
> header->size += size;
> }
>
> @@ -10731,7 +10742,7 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
> u64 mask = attr->branch_sample_type;
>
> /* only using defined bits */
> - if (mask & ~(PERF_SAMPLE_BRANCH_MAX-1))
> + if (mask & ~PERF_SAMPLE_BRANCH_MASK)
> return -EINVAL;
>
> /* at least one branch bit must be set */
> --
> 2.17.1
>

2019-11-19 21:35:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V4 03/13] perf tools: Support new branch sample type for LBR TOS

On Tue, Nov 19, 2019 at 11:00:00AM -0800, Stephane Eranian wrote:
> On Tue, Nov 19, 2019 at 6:35 AM <[email protected]> wrote:

> > diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
> > index bb7b271397a6..c2da61c9ace7 100644
> > --- a/tools/include/uapi/linux/perf_event.h
> > +++ b/tools/include/uapi/linux/perf_event.h
> > @@ -180,7 +180,10 @@ enum perf_branch_sample_type_shift {
> >
> > PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT = 16, /* save branch type */
> >
> > - PERF_SAMPLE_BRANCH_MAX_SHIFT /* non-ABI */
> > + PERF_SAMPLE_BRANCH_MAX_SHIFT = 17, /* non-ABI */
> > +
> > + /* PMU specific */
>
> No! You must abstract this.
>
> > + PERF_SAMPLE_BRANCH_LBR_TOS_SHIFT = 63, /* save LBR TOS */
> > };
> >
> I don't like this because this is too Intel specific.
> What is the meaning of this field? You need a clear definition so it can be used
> with other PERF_SAMPLE_BRANCH_* implementations.

I also detest the MSB usage. Normal pattern is that any bit >= MAX
will be rejected by the kernel.

2019-11-19 22:21:14

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH V4 03/13] perf tools: Support new branch sample type for LBR TOS



On 11/19/2019 4:31 PM, Peter Zijlstra wrote:
> On Tue, Nov 19, 2019 at 11:00:00AM -0800, Stephane Eranian wrote:
>> On Tue, Nov 19, 2019 at 6:35 AM <[email protected]> wrote:
>
>>> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
>>> index bb7b271397a6..c2da61c9ace7 100644
>>> --- a/tools/include/uapi/linux/perf_event.h
>>> +++ b/tools/include/uapi/linux/perf_event.h
>>> @@ -180,7 +180,10 @@ enum perf_branch_sample_type_shift {
>>>
>>> PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT = 16, /* save branch type */
>>>
>>> - PERF_SAMPLE_BRANCH_MAX_SHIFT /* non-ABI */
>>> + PERF_SAMPLE_BRANCH_MAX_SHIFT = 17, /* non-ABI */
>>> +
>>> + /* PMU specific */
>>
>> No! You must abstract this.
>>
>>> + PERF_SAMPLE_BRANCH_LBR_TOS_SHIFT = 63, /* save LBR TOS */
>>> };
>>>
>> I don't like this because this is too Intel specific.
>> What is the meaning of this field? You need a clear definition so it can be used
>> with other PERF_SAMPLE_BRANCH_* implementations.
>
> I also detest the MSB usage. Normal pattern is that any bit >= MAX
> will be rejected by the kernel.
>

OK. I will still use bit 17 for the new branch sample type.

I can change the Intel specific name, and use a generic name. How about
PERF_SAMPLE_BRANCH_PMU_SPECIFIC?

If we make it generic, there will be another question. How much space
should we reserve for this new branch sample type?
For LBR TOS, we only need a u64.
I'm not sure if it's good enough for other platforms.

Or maybe we want a flexible space as below?
@@ -849,7 +854,12 @@ enum perf_event_type {
* char data[size];}&& PERF_SAMPLE_RAW
*
* { u64 nr;
- * { u64 from, to, flags } lbr[nr];} &&
PERF_SAMPLE_BRANCH_STACK
+ * { u64 from, to, flags } lbr[nr];
+ *
+ * # only available if PERF_SAMPLE_BRANCH_PMU_SPECIFIC is set
+ * u64 nr;
+ * u64 data[nr];
+ * } && PERF_SAMPLE_BRANCH_STACK
*
* { u64 abi; # enum perf_sample_regs_abi
* u64 regs[weight(mask)]; } &&
PERF_SAMPLE_REGS_USER


Thanks,
Kan


2019-11-19 22:29:22

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH V4 01/13] perf/core: Add new branch sample type for LBR TOS



On 11/19/2019 2:02 PM, Stephane Eranian wrote:
> On Tue, Nov 19, 2019 at 6:35 AM<[email protected]> wrote:
>> From: Kan Liang<[email protected]>
>>
>> In LBR call stack mode, the depth of reconstructed LBR call stack limits
>> to the number of LBR registers. With LBR Top-of-Stack (TOS) information,
>> perf tool may stitch the stacks of two samples. The reconstructed LBR
>> call stack can break the HW limitation.
>>
>> Add a new branch sample type to retrieve LBR TOS. The new type is PMU
>> specific. Add it at the end of enum perf_branch_sample_type.
>> Add a macro to retrieve defined bits of branch sample type.
>> Update perf_copy_attr() to handle the new bit.
>>
>> Only when the new branch sample type is set, the TOS information is
>> dumped into the PERF_SAMPLE_BRANCH_STACK output.
>> Perf tool should check the attr.branch_sample_type, and apply the
>> corresponding format for PERF_SAMPLE_BRANCH_STACK samples.
>> Otherwise, some user case may be broken. For example, users may parse a
>> perf.data, which include the new branch sample type, with an old version
>> perf tool (without the check). Users probably get incorrect information
>> without any warning.
>>
>> Signed-off-by: Kan Liang<[email protected]>
>> ---
>> include/linux/perf_event.h | 2 ++
>> include/uapi/linux/perf_event.h | 16 ++++++++++++++--
>> kernel/events/core.c | 13 ++++++++++++-
>> 3 files changed, 28 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 011dcbdbccc2..761021c7ee8a 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -93,6 +93,7 @@ struct perf_raw_record {
>> /*
>> * branch stack layout:
>> * nr: number of taken branches stored in entries[]
>> + * tos: Top-of-Stack (TOS) information. PMU specific data.
>> *
>> * Note that nr can vary from sample to sample
>> * branches (to, from) are stored from most recent
>> @@ -101,6 +102,7 @@ struct perf_raw_record {
>> */
>> struct perf_branch_stack {
>> __u64 nr;
>> + __u64 tos; /* PMU specific data */
>> struct perf_branch_entry entries[0];
>> };
>>
> Same remark as with the other patch. You need to abstract this.
> The TOS and PMU specific data should be limited to x86/event/intel/*.[ch].
>

If we change tos to a generic name, e.g. pmu_specific_data, can we still
keep it here?

If not, I think the only way is to introduce a new method, e.g.
output_br_pmu_data(), at struct pmu.
When outputting the sample data, the generic code will call
event->pmu->output_br_pmu_data() to retrieve the TOS in Intel code.
I think it's too complicated.

Thanks,
Kan





2019-11-19 22:55:54

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH V4 01/13] perf/core: Add new branch sample type for LBR TOS

On Tue, Nov 19, 2019 at 2:25 PM Liang, Kan <[email protected]> wrote:
>
>
>
> On 11/19/2019 2:02 PM, Stephane Eranian wrote:
> > On Tue, Nov 19, 2019 at 6:35 AM<[email protected]> wrote:
> >> From: Kan Liang<[email protected]>
> >>
> >> In LBR call stack mode, the depth of reconstructed LBR call stack limits
> >> to the number of LBR registers. With LBR Top-of-Stack (TOS) information,
> >> perf tool may stitch the stacks of two samples. The reconstructed LBR
> >> call stack can break the HW limitation.
> >>
> >> Add a new branch sample type to retrieve LBR TOS. The new type is PMU
> >> specific. Add it at the end of enum perf_branch_sample_type.
> >> Add a macro to retrieve defined bits of branch sample type.
> >> Update perf_copy_attr() to handle the new bit.
> >>
> >> Only when the new branch sample type is set, the TOS information is
> >> dumped into the PERF_SAMPLE_BRANCH_STACK output.
> >> Perf tool should check the attr.branch_sample_type, and apply the
> >> corresponding format for PERF_SAMPLE_BRANCH_STACK samples.
> >> Otherwise, some user case may be broken. For example, users may parse a
> >> perf.data, which include the new branch sample type, with an old version
> >> perf tool (without the check). Users probably get incorrect information
> >> without any warning.
> >>
> >> Signed-off-by: Kan Liang<[email protected]>
> >> ---
> >> include/linux/perf_event.h | 2 ++
> >> include/uapi/linux/perf_event.h | 16 ++++++++++++++--
> >> kernel/events/core.c | 13 ++++++++++++-
> >> 3 files changed, 28 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> >> index 011dcbdbccc2..761021c7ee8a 100644
> >> --- a/include/linux/perf_event.h
> >> +++ b/include/linux/perf_event.h
> >> @@ -93,6 +93,7 @@ struct perf_raw_record {
> >> /*
> >> * branch stack layout:
> >> * nr: number of taken branches stored in entries[]
> >> + * tos: Top-of-Stack (TOS) information. PMU specific data.
> >> *
> >> * Note that nr can vary from sample to sample
> >> * branches (to, from) are stored from most recent
> >> @@ -101,6 +102,7 @@ struct perf_raw_record {
> >> */
> >> struct perf_branch_stack {
> >> __u64 nr;
> >> + __u64 tos; /* PMU specific data */
> >> struct perf_branch_entry entries[0];
> >> };
> >>
> > Same remark as with the other patch. You need to abstract this.
> > The TOS and PMU specific data should be limited to x86/event/intel/*.[ch].
> >
>
> If we change tos to a generic name, e.g. pmu_specific_data, can we still
> keep it here?
>
It's not just about the name, it is about what it points to?
What value does it return when the hw does not have a TOS?
I added the PERF_SAMPLE_BRANCH_*. I did not just expose the
raw LBR. There is an abstraction layer, so it is easier to map to other
architectures, like IBM Power, for instance. You cannot just add a TOS
and say it is PMU specific. If you do that for all architectures, then it
becomes very messy and hard to understand and use especially for tools.

This is an interface you are trying to define. This needs to be specified
precisely so that tools can make the right assumptions across hw platforms.

Note that the entries[] array is normally already sorted by most
recent to least recent.
So exporting a TOS there is bizarre. The TOS is likely always pointing to the
most recent entry. The TOS you want is exposing a low level index which does not
map to the abstracted branch stack. And that's a problem. You need to reconcile
your definition of TOS with the branch_sample_entry [] abstraction.

> If not, I think the only way is to introduce a new method, e.g.
> output_br_pmu_data(), at struct pmu.
> When outputting the sample data, the generic code will call
> event->pmu->output_br_pmu_data() to retrieve the TOS in Intel code.
> I think it's too complicated.
>
> Thanks,
> Kan
>
>
>
>

2019-11-20 17:18:24

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH V4 01/13] perf/core: Add new branch sample type for LBR TOS



On 11/19/2019 5:51 PM, Stephane Eranian wrote:
> On Tue, Nov 19, 2019 at 2:25 PM Liang, Kan <[email protected]> wrote:
>>
>>
>>
>> On 11/19/2019 2:02 PM, Stephane Eranian wrote:
>>> On Tue, Nov 19, 2019 at 6:35 AM<[email protected]> wrote:
>>>> From: Kan Liang<[email protected]>
>>>>
>>>> In LBR call stack mode, the depth of reconstructed LBR call stack limits
>>>> to the number of LBR registers. With LBR Top-of-Stack (TOS) information,
>>>> perf tool may stitch the stacks of two samples. The reconstructed LBR
>>>> call stack can break the HW limitation.
>>>>
>>>> Add a new branch sample type to retrieve LBR TOS. The new type is PMU
>>>> specific. Add it at the end of enum perf_branch_sample_type.
>>>> Add a macro to retrieve defined bits of branch sample type.
>>>> Update perf_copy_attr() to handle the new bit.
>>>>
>>>> Only when the new branch sample type is set, the TOS information is
>>>> dumped into the PERF_SAMPLE_BRANCH_STACK output.
>>>> Perf tool should check the attr.branch_sample_type, and apply the
>>>> corresponding format for PERF_SAMPLE_BRANCH_STACK samples.
>>>> Otherwise, some user case may be broken. For example, users may parse a
>>>> perf.data, which include the new branch sample type, with an old version
>>>> perf tool (without the check). Users probably get incorrect information
>>>> without any warning.
>>>>
>>>> Signed-off-by: Kan Liang<[email protected]>
>>>> ---
>>>> include/linux/perf_event.h | 2 ++
>>>> include/uapi/linux/perf_event.h | 16 ++++++++++++++--
>>>> kernel/events/core.c | 13 ++++++++++++-
>>>> 3 files changed, 28 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>>>> index 011dcbdbccc2..761021c7ee8a 100644
>>>> --- a/include/linux/perf_event.h
>>>> +++ b/include/linux/perf_event.h
>>>> @@ -93,6 +93,7 @@ struct perf_raw_record {
>>>> /*
>>>> * branch stack layout:
>>>> * nr: number of taken branches stored in entries[]
>>>> + * tos: Top-of-Stack (TOS) information. PMU specific data.
>>>> *
>>>> * Note that nr can vary from sample to sample
>>>> * branches (to, from) are stored from most recent
>>>> @@ -101,6 +102,7 @@ struct perf_raw_record {
>>>> */
>>>> struct perf_branch_stack {
>>>> __u64 nr;
>>>> + __u64 tos; /* PMU specific data */
>>>> struct perf_branch_entry entries[0];
>>>> };
>>>>
>>> Same remark as with the other patch. You need to abstract this.
>>> The TOS and PMU specific data should be limited to x86/event/intel/*.[ch].
>>>
>>
>> If we change tos to a generic name, e.g. pmu_specific_data, can we still
>> keep it here?
>>
> It's not just about the name, it is about what it points to?
> What value does it return when the hw does not have a TOS?
> I added the PERF_SAMPLE_BRANCH_*. I did not just expose the
> raw LBR. There is an abstraction layer, so it is easier to map to other
> architectures, like IBM Power, for instance. You cannot just add a TOS
> and say it is PMU specific. If you do that for all architectures, then it
> becomes very messy and hard to understand and use especially for tools.
>
> This is an interface you are trying to define. This needs to be specified
> precisely so that tools can make the right assumptions across hw platforms.
>
> Note that the entries[] array is normally already sorted by most
> recent to least recent.
> So exporting a TOS there is bizarre. The TOS is likely always pointing to the
> most recent entry. The TOS you want is exposing a low level index which does not
> map to the abstracted branch stack. And that's a problem. You need to reconcile
> your definition of TOS with the branch_sample_entry [] abstraction.
>

I plan to use hw_idx to replace tos, which indicates the low level index
of raw branch records for the most recent branch aka entries[0].
For other architectures whose raw branch records are already stored in
age order. The hw_idx should be 0.
If we don't know the order of raw branch records, the hw_idx should be
-1ULL. I will set hw_idx to -1ULL for IBM Power for now. They can change
it later if needed.

How about the changes as below?

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 011dcbdbccc2..e2de81372433 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -93,14 +93,26 @@ struct perf_raw_record {
/*
* branch stack layout:
* nr: number of taken branches stored in entries[]
+ * hw_idx: The low level index of raw branch records
+ * for the most recent branch.
+ * -1ULL means invalid.
*
* Note that nr can vary from sample to sample
* branches (to, from) are stored from most recent
* to least recent, i.e., entries[0] contains the most
* recent branch.
+ * The entries[] is an abstraction of raw branch records,
+ * which may not be stored in age order in HW, e.g. Intel LBR.
+ * The hw_idx is to expose the low level index of raw
+ * branch record for the most recent branch aka entries[0].
+ * For the architectures whose raw branch records are
+ * already stored in age order, the hw_idx should be 0.
+ * If we don't know the order of raw branch records,
+ * the hw_idx should be -1ULL.
*/
struct perf_branch_stack {
__u64 nr;
+ __u64 hw_idx;
struct perf_branch_entry entries[0];
};
diff --git a/include/uapi/linux/perf_event.h
b/include/uapi/linux/perf_event.h
index bb7b271397a6..30f335f0d25e 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -180,7 +180,9 @@ enum perf_branch_sample_type_shift {

PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT = 16, /* save branch
type */

- PERF_SAMPLE_BRANCH_MAX_SHIFT /* non-ABI */
+ PERF_SAMPLE_BRANCH_HW_INDEX_SHIFT = 17, /* save low level
index of raw branch records */
+
+ PERF_SAMPLE_BRANCH_MAX_SHIFT = 18, /* non-ABI */
};

enum perf_branch_sample_type {
@@ -207,6 +209,8 @@ enum perf_branch_sample_type {
PERF_SAMPLE_BRANCH_TYPE_SAVE =
1U << PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT,

+ PERF_SAMPLE_BRANCH_HW_INDEX = 1U <<
PERF_SAMPLE_BRANCH_HW_INDEX_SHIFT,
+
PERF_SAMPLE_BRANCH_MAX = 1U <<
PERF_SAMPLE_BRANCH_MAX_SHIFT,
};

@@ -849,7 +853,11 @@ enum perf_event_type {
* char data[size];}&& PERF_SAMPLE_RAW
*
* { u64 nr;
- * { u64 from, to, flags } lbr[nr];} &&
PERF_SAMPLE_BRANCH_STACK
+ * { u64 from, to, flags } lbr[nr];
+ *
+ * # only available if PERF_SAMPLE_BRANCH_HW_INDEX is set
+ * u64 hw_idx;
+ * } && PERF_SAMPLE_BRANCH_STACK
*
* { u64 abi; # enum perf_sample_regs_abi
* u64 regs[weight(mask)]; } &&
PERF_SAMPLE_REGS_USER


Thanks,
Kan