2020-03-02 05:24:50

by Ravi Bangoria

[permalink] [raw]
Subject: [RFC 00/11] perf: Enhancing perf to export processor hazard information

Most modern microprocessors employ complex instruction execution
pipelines such that many instructions can be 'in flight' at any
given point in time. Various factors affect this pipeline and
hazards are the primary among them. Different types of hazards
exist - Data hazards, Structural hazards and Control hazards.
Data hazard is the case where data dependencies exist between
instructions in different stages in the pipeline. Structural
hazard is when the same processor hardware is needed by more
than one instruction in flight at the same time. Control hazards
are more the branch misprediction kinds.

Information about these hazards are critical towards analyzing
performance issues and also to tune software to overcome such
issues. Modern processors export such hazard data in Performance
Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
AMD[3] provides similar information.

Implementation detail:

A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
If it's set, kernel converts arch specific hazard information
into generic format:

struct perf_pipeline_haz_data {
/* Instruction/Opcode type: Load, Store, Branch .... */
__u8 itype;
/* Instruction Cache source */
__u8 icache;
/* Instruction suffered hazard in pipeline stage */
__u8 hazard_stage;
/* Hazard reason */
__u8 hazard_reason;
/* Instruction suffered stall in pipeline stage */
__u8 stall_stage;
/* Stall reason */
__u8 stall_reason;
__u16 pad;
};

... which can be read by user from mmap() ring buffer. With this
approach, sample perf report in hazard mode looks like (On IBM
PowerPC):

# ./perf record --hazard ./ebizzy
# ./perf report --hazard
Overhead Symbol Shared Instruction Type Hazard Stage Hazard Reason Stall Stage Stall Reason ICache access
36.58% [.] thread_run ebizzy Load LSU Mispredict LSU Load fin L1 hit
9.46% [.] thread_run ebizzy Load LSU Mispredict LSU Dcache_miss L1 hit
1.76% [.] thread_run ebizzy Fixed point - - - - L1 hit
1.31% [.] thread_run ebizzy Load LSU ERAT Miss LSU Load fin L1 hit
1.27% [.] thread_run ebizzy Load LSU Mispredict - - L1 hit
1.16% [.] thread_run ebizzy Fixed point - - FXU Fixed cycle L1 hit
0.50% [.] thread_run ebizzy Fixed point ISU Source Unavailable FXU Fixed cycle L1 hit
0.30% [.] thread_run ebizzy Load LSU LMQ Full, DERAT Miss LSU Load fin L1 hit
0.24% [.] thread_run ebizzy Load LSU ERAT Miss - - L1 hit
0.08% [.] thread_run ebizzy - - - BRU Fixed cycle L1 hit
0.05% [.] thread_run ebizzy Branch - - BRU Fixed cycle L1 hit
0.04% [.] thread_run ebizzy Fixed point ISU Source Unavailable - - L1 hit

Also perf annotate with hazard data:

│ Disassembly of section .text:

│ 0000000010001cf8 <compare>:
│ compare():
│ return NULL;
│ }

│ static int
│ compare(const void *p1, const void *p2)
│ {
33.23 │ std r31,-8(r1)
│ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
│ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
│ {haz_stage: LSU, haz_reason: Load Hit Store, stall_stage: LSU, stall_reason: -, icache: L3 hit}
│ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: -, stall_reason: -, icache: L1 hit}
│ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
│ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
0.84 │ stdu r1,-64(r1)
│ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: -, stall_reason: -, icache: L1 hit}
0.24 │ mr r31,r1
│ {haz_stage: -, haz_reason: -, stall_stage: -, stall_reason: -, icache: L1 hit}
21.18 │ std r3,32(r31)
│ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
│ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
│ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}


Patches:
- Patch #1 is a simple cleanup patch
- Patch #2, #3, #4 implements generic and arch specific kernel
infrastructure
- Patch #5 enables perf record and script with hazard mode
- Patch #6, #7, #8 enables perf report with hazard mode
- Patch #9, #10, #11 enables perf annotate with hazard mode

Note:
- This series is based on the talk by Madhavan in LPC 2018[4]. This is
just an early RFC to get comments about the approach and not intended
to be merged yet.
- I've prepared the series base on v5.6-rc3. But it depends on generic
perf annotate fixes [5][6] which are already merged by Arnaldo in
perf/urgent and perf/core.

[1]: Book III, Section 9.4.10:
https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0
[2]: https://wiki.raptorcs.com/w/images/6/6b/POWER9_PMU_UG_v12_28NOV2018_pub.pdf#G9.1106986
[3]: https://www.amd.com/system/files/TechDocs/24593.pdf#G19.1089550
[4]: https://linuxplumbersconf.org/event/2/contributions/76/
[5]: http://lore.kernel.org/r/[email protected]
[6]: http://lore.kernel.org/r/[email protected]


Madhavan Srinivasan (7):
perf/core: Data structure to present hazard data
powerpc/perf: Arch specific definitions for pipeline
powerpc/perf: Arch support to expose Hazard data
perf tools: Enable record and script to record and show hazard data
perf hists: Make a room for hazard info in struct hist_entry
perf hazard: Functions to convert generic hazard data to arch specific
string
perf report: Enable hazard mode

Ravi Bangoria (4):
powerpc/perf: Simplify ISA207_SIER macros
perf annotate: Introduce type for annotation_line
perf annotate: Preparation for hazard
perf annotate: Show hazard data in tui mode

arch/powerpc/include/asm/perf_event_server.h | 2 +
.../include/uapi/asm/perf_pipeline_haz.h | 80 ++++++
arch/powerpc/perf/core-book3s.c | 4 +
arch/powerpc/perf/isa207-common.c | 165 ++++++++++++-
arch/powerpc/perf/isa207-common.h | 23 +-
arch/powerpc/perf/power8-pmu.c | 1 +
arch/powerpc/perf/power9-pmu.c | 1 +
include/linux/perf_event.h | 7 +
include/uapi/linux/perf_event.h | 32 ++-
kernel/events/core.c | 6 +
tools/include/uapi/linux/perf_event.h | 32 ++-
tools/perf/Documentation/perf-record.txt | 3 +
tools/perf/builtin-annotate.c | 7 +-
tools/perf/builtin-c2c.c | 4 +-
tools/perf/builtin-diff.c | 6 +-
tools/perf/builtin-record.c | 1 +
tools/perf/builtin-report.c | 29 +++
tools/perf/tests/hists_link.c | 4 +-
tools/perf/ui/browsers/annotate.c | 128 ++++++++--
tools/perf/ui/gtk/annotate.c | 6 +-
tools/perf/util/Build | 2 +
tools/perf/util/annotate.c | 153 +++++++++++-
tools/perf/util/annotate.h | 38 ++-
tools/perf/util/event.h | 1 +
tools/perf/util/evsel.c | 10 +
tools/perf/util/hazard.c | 51 ++++
tools/perf/util/hazard.h | 14 ++
tools/perf/util/hazard/Build | 1 +
.../util/hazard/powerpc/perf_pipeline_haz.h | 80 ++++++
.../perf/util/hazard/powerpc/powerpc_hazard.c | 142 +++++++++++
.../perf/util/hazard/powerpc/powerpc_hazard.h | 14 ++
tools/perf/util/hist.c | 112 ++++++++-
tools/perf/util/hist.h | 13 +
tools/perf/util/machine.c | 6 +
tools/perf/util/machine.h | 3 +
tools/perf/util/perf_event_attr_fprintf.c | 1 +
tools/perf/util/record.h | 1 +
tools/perf/util/session.c | 16 ++
tools/perf/util/sort.c | 230 ++++++++++++++++++
tools/perf/util/sort.h | 23 ++
40 files changed, 1387 insertions(+), 65 deletions(-)
create mode 100644 arch/powerpc/include/uapi/asm/perf_pipeline_haz.h
create mode 100644 tools/perf/util/hazard.c
create mode 100644 tools/perf/util/hazard.h
create mode 100644 tools/perf/util/hazard/Build
create mode 100644 tools/perf/util/hazard/powerpc/perf_pipeline_haz.h
create mode 100644 tools/perf/util/hazard/powerpc/powerpc_hazard.c
create mode 100644 tools/perf/util/hazard/powerpc/powerpc_hazard.h

--
2.21.1


2020-03-02 05:24:59

by Ravi Bangoria

[permalink] [raw]
Subject: [RFC 01/11] powerpc/perf: Simplify ISA207_SIER macros

Instead of having separate macros for MASK and SHIFT, and using
them to derive the bits, let's have simple macro to do the job.
Also, remove ISA207_ prefix because some of the SIER bits which
are extracted with these macros are not defined in ISA, example
DATA_SRC bits.

Signed-off-by: Ravi Bangoria <[email protected]>
---
arch/powerpc/perf/isa207-common.c | 8 ++++----
arch/powerpc/perf/isa207-common.h | 11 +++--------
2 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/perf/isa207-common.c b/arch/powerpc/perf/isa207-common.c
index 4c86da5eb28a..07026bbd292b 100644
--- a/arch/powerpc/perf/isa207-common.c
+++ b/arch/powerpc/perf/isa207-common.c
@@ -215,10 +215,10 @@ void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, u32 flags,
}

sier = mfspr(SPRN_SIER);
- val = (sier & ISA207_SIER_TYPE_MASK) >> ISA207_SIER_TYPE_SHIFT;
+ val = SIER_TYPE(sier);
if (val == 1 || val == 2) {
- idx = (sier & ISA207_SIER_LDST_MASK) >> ISA207_SIER_LDST_SHIFT;
- sub_idx = (sier & ISA207_SIER_DATA_SRC_MASK) >> ISA207_SIER_DATA_SRC_SHIFT;
+ idx = SIER_LDST(sier);
+ sub_idx = SIER_DATA_SRC(sier);

dsrc->val = isa207_find_source(idx, sub_idx);
dsrc->val |= (val == 1) ? P(OP, LOAD) : P(OP, STORE);
@@ -231,7 +231,7 @@ void isa207_get_mem_weight(u64 *weight)
u64 exp = MMCRA_THR_CTR_EXP(mmcra);
u64 mantissa = MMCRA_THR_CTR_MANT(mmcra);
u64 sier = mfspr(SPRN_SIER);
- u64 val = (sier & ISA207_SIER_TYPE_MASK) >> ISA207_SIER_TYPE_SHIFT;
+ u64 val = SIER_TYPE(sier);

if (val == 0 || val == 7)
*weight = 0;
diff --git a/arch/powerpc/perf/isa207-common.h b/arch/powerpc/perf/isa207-common.h
index 63fd4f3f6013..7027eb9f3e40 100644
--- a/arch/powerpc/perf/isa207-common.h
+++ b/arch/powerpc/perf/isa207-common.h
@@ -202,14 +202,9 @@
#define MAX_ALT 2
#define MAX_PMU_COUNTERS 6

-#define ISA207_SIER_TYPE_SHIFT 15
-#define ISA207_SIER_TYPE_MASK (0x7ull << ISA207_SIER_TYPE_SHIFT)
-
-#define ISA207_SIER_LDST_SHIFT 1
-#define ISA207_SIER_LDST_MASK (0x7ull << ISA207_SIER_LDST_SHIFT)
-
-#define ISA207_SIER_DATA_SRC_SHIFT 53
-#define ISA207_SIER_DATA_SRC_MASK (0x7ull << ISA207_SIER_DATA_SRC_SHIFT)
+#define SIER_DATA_SRC(sier) (((sier) >> (63 - 10)) & 0x7ull)
+#define SIER_TYPE(sier) (((sier) >> (63 - 48)) & 0x7ull)
+#define SIER_LDST(sier) (((sier) >> (63 - 62)) & 0x7ull)

#define P(a, b) PERF_MEM_S(a, b)
#define PH(a, b) (P(LVL, HIT) | P(a, b))
--
2.21.1

2020-03-02 05:25:05

by Ravi Bangoria

[permalink] [raw]
Subject: [RFC 02/11] perf/core: Data structure to present hazard data

From: Madhavan Srinivasan <[email protected]>

Introduce new perf sample_type PERF_SAMPLE_PIPELINE_HAZ to request kernel
to provide cpu pipeline hazard data. Also, introduce arch independent
structure 'perf_pipeline_haz_data' to pass hazard data to userspace. This
is generic structure and arch specific data needs to be converted to this
format.

Signed-off-by: Madhavan Srinivasan <[email protected]>
Signed-off-by: Ravi Bangoria <[email protected]>
---
include/linux/perf_event.h | 7 ++++++
include/uapi/linux/perf_event.h | 32 ++++++++++++++++++++++++++-
kernel/events/core.c | 6 +++++
tools/include/uapi/linux/perf_event.h | 32 ++++++++++++++++++++++++++-
4 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 547773f5894e..d5b606e3c57d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1001,6 +1001,7 @@ struct perf_sample_data {
u64 stack_user_size;

u64 phys_addr;
+ struct perf_pipeline_haz_data pipeline_haz;
} ____cacheline_aligned;

/* default value for data source */
@@ -1021,6 +1022,12 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
data->weight = 0;
data->data_src.val = PERF_MEM_NA;
data->txn = 0;
+ data->pipeline_haz.itype = PERF_HAZ__ITYPE_NA;
+ data->pipeline_haz.icache = PERF_HAZ__ICACHE_NA;
+ data->pipeline_haz.hazard_stage = PERF_HAZ__PIPE_STAGE_NA;
+ data->pipeline_haz.hazard_reason = PERF_HAZ__HREASON_NA;
+ data->pipeline_haz.stall_stage = PERF_HAZ__PIPE_STAGE_NA;
+ data->pipeline_haz.stall_reason = PERF_HAZ__SREASON_NA;
}

extern void perf_output_sample(struct perf_output_handle *handle,
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 377d794d3105..ff252618ca93 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -142,8 +142,9 @@ enum perf_event_sample_format {
PERF_SAMPLE_REGS_INTR = 1U << 18,
PERF_SAMPLE_PHYS_ADDR = 1U << 19,
PERF_SAMPLE_AUX = 1U << 20,
+ PERF_SAMPLE_PIPELINE_HAZ = 1U << 21,

- PERF_SAMPLE_MAX = 1U << 21, /* non-ABI */
+ PERF_SAMPLE_MAX = 1U << 22, /* non-ABI */

__PERF_SAMPLE_CALLCHAIN_EARLY = 1ULL << 63, /* non-ABI; internal use */
};
@@ -870,6 +871,13 @@ enum perf_event_type {
* { u64 phys_addr;} && PERF_SAMPLE_PHYS_ADDR
* { u64 size;
* char data[size]; } && PERF_SAMPLE_AUX
+ * { u8 itype;
+ * u8 icache;
+ * u8 hazard_stage;
+ * u8 hazard_reason;
+ * u8 stall_stage;
+ * u8 stall_reason;
+ * u16 pad;} && PERF_SAMPLE_PIPELINE_HAZ
* };
*/
PERF_RECORD_SAMPLE = 9,
@@ -1185,4 +1193,26 @@ struct perf_branch_entry {
reserved:40;
};

+struct perf_pipeline_haz_data {
+ /* Instruction/Opcode type: Load, Store, Branch .... */
+ __u8 itype;
+ /* Instruction Cache source */
+ __u8 icache;
+ /* Instruction suffered hazard in pipeline stage */
+ __u8 hazard_stage;
+ /* Hazard reason */
+ __u8 hazard_reason;
+ /* Instruction suffered stall in pipeline stage */
+ __u8 stall_stage;
+ /* Stall reason */
+ __u8 stall_reason;
+ __u16 pad;
+};
+
+#define PERF_HAZ__ITYPE_NA 0x0
+#define PERF_HAZ__ICACHE_NA 0x0
+#define PERF_HAZ__PIPE_STAGE_NA 0x0
+#define PERF_HAZ__HREASON_NA 0x0
+#define PERF_HAZ__SREASON_NA 0x0
+
#endif /* _UAPI_LINUX_PERF_EVENT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e453589da97c..d00037c77ccf 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1754,6 +1754,9 @@ static void __perf_event_header_size(struct perf_event *event, u64 sample_type)
if (sample_type & PERF_SAMPLE_PHYS_ADDR)
size += sizeof(data->phys_addr);

+ if (sample_type & PERF_SAMPLE_PIPELINE_HAZ)
+ size += sizeof(data->pipeline_haz);
+
event->header_size = size;
}

@@ -6712,6 +6715,9 @@ void perf_output_sample(struct perf_output_handle *handle,
perf_aux_sample_output(event, handle, data);
}

+ if (sample_type & PERF_SAMPLE_PIPELINE_HAZ)
+ perf_output_put(handle, data->pipeline_haz);
+
if (!event->attr.watermark) {
int wakeup_events = event->attr.wakeup_events;

diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 377d794d3105..ff252618ca93 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -142,8 +142,9 @@ enum perf_event_sample_format {
PERF_SAMPLE_REGS_INTR = 1U << 18,
PERF_SAMPLE_PHYS_ADDR = 1U << 19,
PERF_SAMPLE_AUX = 1U << 20,
+ PERF_SAMPLE_PIPELINE_HAZ = 1U << 21,

- PERF_SAMPLE_MAX = 1U << 21, /* non-ABI */
+ PERF_SAMPLE_MAX = 1U << 22, /* non-ABI */

__PERF_SAMPLE_CALLCHAIN_EARLY = 1ULL << 63, /* non-ABI; internal use */
};
@@ -870,6 +871,13 @@ enum perf_event_type {
* { u64 phys_addr;} && PERF_SAMPLE_PHYS_ADDR
* { u64 size;
* char data[size]; } && PERF_SAMPLE_AUX
+ * { u8 itype;
+ * u8 icache;
+ * u8 hazard_stage;
+ * u8 hazard_reason;
+ * u8 stall_stage;
+ * u8 stall_reason;
+ * u16 pad;} && PERF_SAMPLE_PIPELINE_HAZ
* };
*/
PERF_RECORD_SAMPLE = 9,
@@ -1185,4 +1193,26 @@ struct perf_branch_entry {
reserved:40;
};

+struct perf_pipeline_haz_data {
+ /* Instruction/Opcode type: Load, Store, Branch .... */
+ __u8 itype;
+ /* Instruction Cache source */
+ __u8 icache;
+ /* Instruction suffered hazard in pipeline stage */
+ __u8 hazard_stage;
+ /* Hazard reason */
+ __u8 hazard_reason;
+ /* Instruction suffered stall in pipeline stage */
+ __u8 stall_stage;
+ /* Stall reason */
+ __u8 stall_reason;
+ __u16 pad;
+};
+
+#define PERF_HAZ__ITYPE_NA 0x0
+#define PERF_HAZ__ICACHE_NA 0x0
+#define PERF_HAZ__PIPE_STAGE_NA 0x0
+#define PERF_HAZ__HREASON_NA 0x0
+#define PERF_HAZ__SREASON_NA 0x0
+
#endif /* _UAPI_LINUX_PERF_EVENT_H */
--
2.21.1

2020-03-02 05:25:12

by Ravi Bangoria

[permalink] [raw]
Subject: [RFC 03/11] powerpc/perf: Arch specific definitions for pipeline

From: Madhavan Srinivasan <[email protected]>

Create powerpc specific definitions for pipeline hazard and stalls.
This information is available in SIER register on powerpc. Current
definitions are based on IBM PowerPC SIER specification available
in ISA[1] and Performance Monitor Unit User’s Guide[2].

[1]: Book III, Section 9.4.10:
https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0
[2]: https://wiki.raptorcs.com/w/images/6/6b/POWER9_PMU_UG_v12_28NOV2018_pub.pdf#G9.1106986

Signed-off-by: Madhavan Srinivasan <[email protected]>
Signed-off-by: Ravi Bangoria <[email protected]>
---
.../include/uapi/asm/perf_pipeline_haz.h | 80 +++++++++++++++++++
1 file changed, 80 insertions(+)
create mode 100644 arch/powerpc/include/uapi/asm/perf_pipeline_haz.h

diff --git a/arch/powerpc/include/uapi/asm/perf_pipeline_haz.h b/arch/powerpc/include/uapi/asm/perf_pipeline_haz.h
new file mode 100644
index 000000000000..de8857ec31dd
--- /dev/null
+++ b/arch/powerpc/include/uapi/asm/perf_pipeline_haz.h
@@ -0,0 +1,80 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_ASM_POWERPC_PERF_PIPELINE_HAZ_H
+#define _UAPI_ASM_POWERPC_PERF_PIPELINE_HAZ_H
+
+enum perf_inst_type {
+ PERF_HAZ__ITYPE_LOAD = 1,
+ PERF_HAZ__ITYPE_STORE,
+ PERF_HAZ__ITYPE_BRANCH,
+ PERF_HAZ__ITYPE_FP,
+ PERF_HAZ__ITYPE_FX,
+ PERF_HAZ__ITYPE_CR_OR_SC,
+};
+
+enum perf_inst_cache {
+ PERF_HAZ__ICACHE_L1_HIT = 1,
+ PERF_HAZ__ICACHE_L2_HIT,
+ PERF_HAZ__ICACHE_L3_HIT,
+ PERF_HAZ__ICACHE_L3_MISS,
+};
+
+enum perf_pipeline_stage {
+ PERF_HAZ__PIPE_STAGE_IFU = 1,
+ PERF_HAZ__PIPE_STAGE_IDU,
+ PERF_HAZ__PIPE_STAGE_ISU,
+ PERF_HAZ__PIPE_STAGE_LSU,
+ PERF_HAZ__PIPE_STAGE_BRU,
+ PERF_HAZ__PIPE_STAGE_FXU,
+ PERF_HAZ__PIPE_STAGE_FPU,
+ PERF_HAZ__PIPE_STAGE_VSU,
+ PERF_HAZ__PIPE_STAGE_OTHER,
+};
+
+enum perf_haz_bru_reason {
+ PERF_HAZ__HAZ_BRU_MPRED_DIR = 1,
+ PERF_HAZ__HAZ_BRU_MPRED_TA,
+};
+
+enum perf_haz_isu_reason {
+ PERF_HAZ__HAZ_ISU_SRC = 1,
+ PERF_HAZ__HAZ_ISU_COL = 1,
+};
+
+enum perf_haz_lsu_reason {
+ PERF_HAZ__HAZ_LSU_ERAT_MISS = 1,
+ PERF_HAZ__HAZ_LSU_LMQ,
+ PERF_HAZ__HAZ_LSU_LHS,
+ PERF_HAZ__HAZ_LSU_MPRED,
+ PERF_HAZ__HAZ_DERAT_MISS,
+ PERF_HAZ__HAZ_LSU_LMQ_DERAT_MISS,
+ PERF_HAZ__HAZ_LSU_LHS_DERAT_MISS,
+ PERF_HAZ__HAZ_LSU_MPRED_DERAT_MISS,
+};
+
+enum perf_stall_lsu_reason {
+ PERF_HAZ__STALL_LSU_DCACHE_MISS = 1,
+ PERF_HAZ__STALL_LSU_LD_FIN,
+ PERF_HAZ__STALL_LSU_ST_FWD,
+ PERF_HAZ__STALL_LSU_ST,
+};
+
+enum perf_stall_fxu_reason {
+ PERF_HAZ__STALL_FXU_MC = 1,
+ PERF_HAZ__STALL_FXU_FC,
+};
+
+enum perf_stall_bru_reason {
+ PERF_HAZ__STALL_BRU_FIN_MPRED = 1,
+ PERF_HAZ__STALL_BRU_FC,
+};
+
+enum perf_stall_vsu_reason {
+ PERF_HAZ__STALL_VSU_MC = 1,
+ PERF_HAZ__STALL_VSU_FC,
+};
+
+enum perf_stall_other_reason {
+ PERF_HAZ__STALL_NTC,
+};
+
+#endif /* _UAPI_ASM_POWERPC_PERF_PIPELINE_HAZ_H */
--
2.21.1

2020-03-02 05:25:18

by Ravi Bangoria

[permalink] [raw]
Subject: [RFC 04/11] powerpc/perf: Arch support to expose Hazard data

From: Madhavan Srinivasan <[email protected]>

SIER register on PowerPC hw pmu provides cpu pipeline hazard information.
Add logic to convert this arch specific data into perf_pipeline_haz_data
structure.

Signed-off-by: Madhavan Srinivasan <[email protected]>
Signed-off-by: Ravi Bangoria <[email protected]>
---
arch/powerpc/include/asm/perf_event_server.h | 2 +
arch/powerpc/perf/core-book3s.c | 4 +
arch/powerpc/perf/isa207-common.c | 157 +++++++++++++++++++
arch/powerpc/perf/isa207-common.h | 12 ++
arch/powerpc/perf/power8-pmu.c | 1 +
arch/powerpc/perf/power9-pmu.c | 1 +
6 files changed, 177 insertions(+)

diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h
index 3e9703f44c7c..9b8f90439ff2 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -37,6 +37,8 @@ struct power_pmu {
void (*get_mem_data_src)(union perf_mem_data_src *dsrc,
u32 flags, struct pt_regs *regs);
void (*get_mem_weight)(u64 *weight);
+ void (*get_phazard_data)(struct perf_pipeline_haz_data *phaz,
+ u32 flags, struct pt_regs *regs);
unsigned long group_constraint_mask;
unsigned long group_constraint_val;
u64 (*bhrb_filter_map)(u64 branch_sample_type);
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 3086055bf681..fcbb4acc3a03 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -2096,6 +2096,10 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
ppmu->get_mem_weight)
ppmu->get_mem_weight(&data.weight);

+ if (event->attr.sample_type & PERF_SAMPLE_PIPELINE_HAZ &&
+ ppmu->get_phazard_data)
+ ppmu->get_phazard_data(&data.pipeline_haz, ppmu->flags, regs);
+
if (perf_event_overflow(event, &data, regs))
power_pmu_stop(event, 0);
}
diff --git a/arch/powerpc/perf/isa207-common.c b/arch/powerpc/perf/isa207-common.c
index 07026bbd292b..03dafde7cace 100644
--- a/arch/powerpc/perf/isa207-common.c
+++ b/arch/powerpc/perf/isa207-common.c
@@ -239,6 +239,163 @@ void isa207_get_mem_weight(u64 *weight)
*weight = mantissa << (2 * exp);
}

+static __u8 get_inst_type(u64 sier)
+{
+ switch (SIER_TYPE(sier)) {
+ case 1:
+ return PERF_HAZ__ITYPE_LOAD;
+ case 2:
+ return PERF_HAZ__ITYPE_STORE;
+ case 3:
+ return PERF_HAZ__ITYPE_BRANCH;
+ case 4:
+ return PERF_HAZ__ITYPE_FP;
+ case 5:
+ return PERF_HAZ__ITYPE_FX;
+ case 6:
+ return PERF_HAZ__ITYPE_CR_OR_SC;
+ }
+ return PERF_HAZ__ITYPE_NA;
+}
+
+static __u8 get_inst_cache(u64 sier)
+{
+ switch (SIER_ICACHE(sier)) {
+ case 1:
+ return PERF_HAZ__ICACHE_L1_HIT;
+ case 2:
+ return PERF_HAZ__ICACHE_L2_HIT;
+ case 3:
+ return PERF_HAZ__ICACHE_L3_HIT;
+ case 4:
+ return PERF_HAZ__ICACHE_L3_MISS;
+ }
+ return PERF_HAZ__ICACHE_NA;
+}
+
+static void get_hazard_data(u64 sier, struct perf_pipeline_haz_data *haz)
+{
+ if (SIER_MPRED(sier)) {
+ haz->hazard_stage = PERF_HAZ__PIPE_STAGE_BRU;
+
+ switch (SIER_MPRED_TYPE(sier)) {
+ case 1:
+ haz->hazard_reason = PERF_HAZ__HAZ_BRU_MPRED_DIR;
+ return;
+ case 2:
+ haz->hazard_reason = PERF_HAZ__HAZ_BRU_MPRED_TA;
+ return;
+ }
+ }
+
+ if (cpu_has_feature(CPU_FTR_ARCH_300) &&
+ (SIER_TYPE(sier) == 1 || SIER_TYPE(sier) == 2)) {
+ haz->hazard_stage = PERF_HAZ__PIPE_STAGE_LSU;
+ haz->hazard_reason = PERF_HAZ__HAZ_DERAT_MISS;
+ return;
+ }
+
+ if (cpu_has_feature(CPU_FTR_ARCH_207S) &&
+ (SIER_TYPE(sier) == 1 || SIER_TYPE(sier) == 2)) {
+ int derat_miss = SIER_DERAT_MISS(sier);
+
+ haz->hazard_stage = PERF_HAZ__PIPE_STAGE_LSU;
+
+ switch (p8_SIER_REJ_LSU_REASON(sier)) {
+ case 0:
+ haz->hazard_reason = PERF_HAZ__HAZ_LSU_ERAT_MISS;
+ return;
+ case 1:
+ haz->hazard_reason = (derat_miss) ?
+ PERF_HAZ__HAZ_LSU_LMQ_DERAT_MISS :
+ PERF_HAZ__HAZ_LSU_LMQ;
+ return;
+ case 2:
+ haz->hazard_reason = (derat_miss) ?
+ PERF_HAZ__HAZ_LSU_LHS_DERAT_MISS :
+ PERF_HAZ__HAZ_LSU_LHS;
+ return;
+ case 3:
+ haz->hazard_reason = (derat_miss) ?
+ PERF_HAZ__HAZ_LSU_MPRED_DERAT_MISS :
+ PERF_HAZ__HAZ_LSU_MPRED;
+ return;
+ }
+
+ if (derat_miss)
+ haz->hazard_reason = PERF_HAZ__HAZ_DERAT_MISS;
+ }
+
+ if (cpu_has_feature(CPU_FTR_ARCH_207S) && p8_SIER_REJ_ISU(sier)) {
+ haz->hazard_stage = PERF_HAZ__PIPE_STAGE_ISU;
+
+ if (p8_SIER_REJ_ISU_SRC(sier))
+ haz->hazard_reason = PERF_HAZ__HAZ_ISU_SRC;
+ if (p8_SIER_REJ_ISU_COL(sier))
+ haz->hazard_reason = PERF_HAZ__HAZ_ISU_COL;
+ }
+}
+
+static void get_stall_data(u64 sier, struct perf_pipeline_haz_data *haz)
+{
+ switch (SIER_FIN_STALL_REASON(sier)) {
+ case 1:
+ haz->stall_stage = PERF_HAZ__PIPE_STAGE_OTHER;
+ haz->stall_reason = PERF_HAZ__STALL_NTC;
+ break;
+ case 4:
+ haz->stall_stage = PERF_HAZ__PIPE_STAGE_LSU;
+ haz->stall_reason = PERF_HAZ__STALL_LSU_DCACHE_MISS;
+ break;
+ case 5:
+ haz->stall_stage = PERF_HAZ__PIPE_STAGE_LSU;
+ haz->stall_reason = PERF_HAZ__STALL_LSU_LD_FIN;
+ break;
+ case 6:
+ haz->stall_stage = PERF_HAZ__PIPE_STAGE_LSU;
+ haz->stall_reason = PERF_HAZ__STALL_LSU_ST_FWD;
+ break;
+ case 7:
+ haz->stall_stage = PERF_HAZ__PIPE_STAGE_LSU;
+ haz->stall_reason = PERF_HAZ__STALL_LSU_ST;
+ break;
+ case 8:
+ haz->stall_stage = PERF_HAZ__PIPE_STAGE_FXU;
+ haz->stall_reason = PERF_HAZ__STALL_FXU_MC;
+ break;
+ case 9:
+ haz->stall_stage = PERF_HAZ__PIPE_STAGE_BRU;
+ haz->stall_reason = PERF_HAZ__STALL_BRU_FIN_MPRED;
+ break;
+ case 10:
+ haz->stall_stage = PERF_HAZ__PIPE_STAGE_VSU;
+ haz->stall_reason = PERF_HAZ__STALL_VSU_MC;
+ break;
+ case 12:
+ haz->stall_stage = PERF_HAZ__PIPE_STAGE_FXU;
+ haz->stall_reason = PERF_HAZ__STALL_FXU_FC;
+ break;
+ case 13:
+ haz->stall_stage = PERF_HAZ__PIPE_STAGE_VSU;
+ haz->stall_reason = PERF_HAZ__STALL_VSU_FC;
+ break;
+ case 14:
+ haz->stall_stage = PERF_HAZ__PIPE_STAGE_BRU;
+ haz->stall_reason = PERF_HAZ__STALL_BRU_FC;
+ }
+}
+
+void isa207_get_phazard_data(struct perf_pipeline_haz_data *haz, u32 flags,
+ struct pt_regs *regs)
+{
+ u64 sier = mfspr(SPRN_SIER);
+
+ haz->itype = get_inst_type(sier);
+ haz->icache = get_inst_cache(sier);
+ get_hazard_data(sier, haz);
+ get_stall_data(sier, haz);
+}
+
int isa207_get_constraint(u64 event, unsigned long *maskp, unsigned long *valp)
{
unsigned int unit, pmc, cache, ebb;
diff --git a/arch/powerpc/perf/isa207-common.h b/arch/powerpc/perf/isa207-common.h
index 7027eb9f3e40..125e0e44aeea 100644
--- a/arch/powerpc/perf/isa207-common.h
+++ b/arch/powerpc/perf/isa207-common.h
@@ -12,6 +12,7 @@
#include <linux/perf_event.h>
#include <asm/firmware.h>
#include <asm/cputable.h>
+#include <asm/perf_pipeline_haz.h>

#define EVENT_EBB_MASK 1ull
#define EVENT_EBB_SHIFT PERF_EVENT_CONFIG_EBB_SHIFT
@@ -202,8 +203,17 @@
#define MAX_ALT 2
#define MAX_PMU_COUNTERS 6

+#define SIER_FIN_STALL_REASON(sier) (((sier) >> (63 - 6)) & 0xfull)
#define SIER_DATA_SRC(sier) (((sier) >> (63 - 10)) & 0x7ull)
+#define p8_SIER_REJ_ISU_SRC(sier) (((sier) >> (63 - 32)) & 0x1ull)
+#define p8_SIER_REJ_ISU_COL(sier) (((sier) >> (63 - 33)) & 0x1ull)
+#define p8_SIER_REJ_ISU(sier) (((sier) >> (63 - 33)) & 0x3ull)
+#define p8_SIER_REJ_LSU_REASON(sier) (((sier) >> (63 - 36)) & 0x3ull)
#define SIER_TYPE(sier) (((sier) >> (63 - 48)) & 0x7ull)
+#define SIER_ICACHE(sier) (((sier) >> (63 - 51)) & 0x7ull)
+#define SIER_MPRED(sier) (((sier) >> (63 - 53)) & 0x1ull)
+#define SIER_MPRED_TYPE(sier) (((sier) >> (63 - 55)) & 0x3ull)
+#define SIER_DERAT_MISS(sier) (((sier) >> (63 - 56)) & 0x1ull)
#define SIER_LDST(sier) (((sier) >> (63 - 62)) & 0x7ull)

#define P(a, b) PERF_MEM_S(a, b)
@@ -220,5 +230,7 @@ int isa207_get_alternatives(u64 event, u64 alt[], int size, unsigned int flags,
void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, u32 flags,
struct pt_regs *regs);
void isa207_get_mem_weight(u64 *weight);
+void isa207_get_phazard_data(struct perf_pipeline_haz_data *haz, u32 flags,
+ struct pt_regs *regs);

#endif
diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c
index 3a5fcc20ff31..dc407329ba94 100644
--- a/arch/powerpc/perf/power8-pmu.c
+++ b/arch/powerpc/perf/power8-pmu.c
@@ -370,6 +370,7 @@ static struct power_pmu power8_pmu = {
.get_mem_data_src = isa207_get_mem_data_src,
.get_mem_weight = isa207_get_mem_weight,
.disable_pmc = isa207_disable_pmc,
+ .get_phazard_data = isa207_get_phazard_data,
.flags = PPMU_HAS_SIER | PPMU_ARCH_207S,
.n_generic = ARRAY_SIZE(power8_generic_events),
.generic_events = power8_generic_events,
diff --git a/arch/powerpc/perf/power9-pmu.c b/arch/powerpc/perf/power9-pmu.c
index 08c3ef796198..84f663d8df13 100644
--- a/arch/powerpc/perf/power9-pmu.c
+++ b/arch/powerpc/perf/power9-pmu.c
@@ -428,6 +428,7 @@ static struct power_pmu power9_pmu = {
.get_mem_data_src = isa207_get_mem_data_src,
.get_mem_weight = isa207_get_mem_weight,
.disable_pmc = isa207_disable_pmc,
+ .get_phazard_data = isa207_get_phazard_data,
.flags = PPMU_HAS_SIER | PPMU_ARCH_207S,
.n_generic = ARRAY_SIZE(power9_generic_events),
.generic_events = power9_generic_events,
--
2.21.1

2020-03-02 05:25:29

by Ravi Bangoria

[permalink] [raw]
Subject: [RFC 06/11] perf hists: Make a room for hazard info in struct hist_entry

From: Madhavan Srinivasan <[email protected]>

To enable hazard mode with perf report (followup patch) we need to
have cpu pipeline hazard data available in hist_entry. Add hazard
info into struct hist_entry. Also add hazard_info as parameter to
hists__add_entry().

Signed-off-by: Madhavan Srinivasan <[email protected]>
Signed-off-by: Ravi Bangoria <[email protected]>
---
tools/perf/builtin-annotate.c | 2 +-
tools/perf/builtin-c2c.c | 4 ++--
tools/perf/builtin-diff.c | 6 +++---
tools/perf/tests/hists_link.c | 4 ++--
tools/perf/util/hist.c | 22 +++++++++++++++-------
tools/perf/util/hist.h | 2 ++
tools/perf/util/sort.h | 1 +
7 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/tools/perf/builtin-annotate.c b/tools/perf/builtin-annotate.c
index 6c0a0412502e..78552a9428a6 100644
--- a/tools/perf/builtin-annotate.c
+++ b/tools/perf/builtin-annotate.c
@@ -249,7 +249,7 @@ static int perf_evsel__add_sample(struct evsel *evsel,
if (ann->has_br_stack && has_annotation(ann))
return process_branch_callback(evsel, sample, al, ann, machine);

- he = hists__add_entry(hists, al, NULL, NULL, NULL, sample, true);
+ he = hists__add_entry(hists, al, NULL, NULL, NULL, NULL, sample, true);
if (he == NULL)
return -ENOMEM;

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 246ac0b4d54f..2a1cb5cda6d9 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -292,7 +292,7 @@ static int process_sample_event(struct perf_tool *tool __maybe_unused,
c2c_decode_stats(&stats, mi);

he = hists__add_entry_ops(&c2c_hists->hists, &c2c_entry_ops,
- &al, NULL, NULL, mi,
+ &al, NULL, NULL, mi, NULL,
sample, true);
if (he == NULL)
goto free_mi;
@@ -326,7 +326,7 @@ static int process_sample_event(struct perf_tool *tool __maybe_unused,
goto free_mi;

he = hists__add_entry_ops(&c2c_hists->hists, &c2c_entry_ops,
- &al, NULL, NULL, mi,
+ &al, NULL, NULL, mi, NULL,
sample, true);
if (he == NULL)
goto free_mi;
diff --git a/tools/perf/builtin-diff.c b/tools/perf/builtin-diff.c
index f8b6ae557d8b..e32e91f89a18 100644
--- a/tools/perf/builtin-diff.c
+++ b/tools/perf/builtin-diff.c
@@ -412,15 +412,15 @@ static int diff__process_sample_event(struct perf_tool *tool,
}

if (compute != COMPUTE_CYCLES) {
- if (!hists__add_entry(hists, &al, NULL, NULL, NULL, sample,
- true)) {
+ if (!hists__add_entry(hists, &al, NULL, NULL, NULL, NULL,
+ sample, true)) {
pr_warning("problem incrementing symbol period, "
"skipping event\n");
goto out_put;
}
} else {
if (!hists__add_entry_ops(hists, &block_hist_ops, &al, NULL,
- NULL, NULL, sample, true)) {
+ NULL, NULL, NULL, sample, true)) {
pr_warning("problem incrementing symbol period, "
"skipping event\n");
goto out_put;
diff --git a/tools/perf/tests/hists_link.c b/tools/perf/tests/hists_link.c
index a024d3f3a412..112a90818d2e 100644
--- a/tools/perf/tests/hists_link.c
+++ b/tools/perf/tests/hists_link.c
@@ -86,7 +86,7 @@ static int add_hist_entries(struct evlist *evlist, struct machine *machine)
if (machine__resolve(machine, &al, &sample) < 0)
goto out;

- he = hists__add_entry(hists, &al, NULL,
+ he = hists__add_entry(hists, &al, NULL, NULL,
NULL, NULL, &sample, true);
if (he == NULL) {
addr_location__put(&al);
@@ -105,7 +105,7 @@ static int add_hist_entries(struct evlist *evlist, struct machine *machine)
if (machine__resolve(machine, &al, &sample) < 0)
goto out;

- he = hists__add_entry(hists, &al, NULL,
+ he = hists__add_entry(hists, &al, NULL, NULL,
NULL, NULL, &sample, true);
if (he == NULL) {
addr_location__put(&al);
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index ca5a8f4d007e..6d23efaa52c8 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -604,6 +604,7 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
* and will not be used anymore.
*/
mem_info__zput(entry->mem_info);
+ zfree(&entry->haz_data);

block_info__zput(entry->block_info);

@@ -678,6 +679,7 @@ __hists__add_entry(struct hists *hists,
struct symbol *sym_parent,
struct branch_info *bi,
struct mem_info *mi,
+ struct perf_pipeline_haz_data *haz_data,
struct block_info *block_info,
struct perf_sample *sample,
bool sample_self,
@@ -712,6 +714,7 @@ __hists__add_entry(struct hists *hists,
.hists = hists,
.branch_info = bi,
.mem_info = mi,
+ .haz_data = haz_data,
.block_info = block_info,
.transaction = sample->transaction,
.raw_data = sample->raw_data,
@@ -732,10 +735,11 @@ struct hist_entry *hists__add_entry(struct hists *hists,
struct symbol *sym_parent,
struct branch_info *bi,
struct mem_info *mi,
+ struct perf_pipeline_haz_data *haz_data,
struct perf_sample *sample,
bool sample_self)
{
- return __hists__add_entry(hists, al, sym_parent, bi, mi, NULL,
+ return __hists__add_entry(hists, al, sym_parent, bi, mi, haz_data, NULL,
sample, sample_self, NULL);
}

@@ -745,10 +749,11 @@ struct hist_entry *hists__add_entry_ops(struct hists *hists,
struct symbol *sym_parent,
struct branch_info *bi,
struct mem_info *mi,
+ struct perf_pipeline_haz_data *haz_data,
struct perf_sample *sample,
bool sample_self)
{
- return __hists__add_entry(hists, al, sym_parent, bi, mi, NULL,
+ return __hists__add_entry(hists, al, sym_parent, bi, mi, haz_data, NULL,
sample, sample_self, ops);
}

@@ -823,7 +828,7 @@ iter_add_single_mem_entry(struct hist_entry_iter *iter, struct addr_location *al
sample->period = cost;

he = hists__add_entry(hists, al, iter->parent, NULL, mi,
- sample, true);
+ NULL, sample, true);
if (!he)
return -ENOMEM;

@@ -926,7 +931,7 @@ iter_add_next_branch_entry(struct hist_entry_iter *iter, struct addr_location *a
sample->weight = bi->flags.cycles ? bi->flags.cycles : 1;

he = hists__add_entry(hists, al, iter->parent, &bi[i], NULL,
- sample, true);
+ NULL, sample, true);
if (he == NULL)
return -ENOMEM;

@@ -963,7 +968,7 @@ iter_add_single_normal_entry(struct hist_entry_iter *iter, struct addr_location
struct hist_entry *he;

he = hists__add_entry(evsel__hists(evsel), al, iter->parent, NULL, NULL,
- sample, true);
+ NULL, sample, true);
if (he == NULL)
return -ENOMEM;

@@ -1024,7 +1029,7 @@ iter_add_single_cumulative_entry(struct hist_entry_iter *iter,
int err = 0;

he = hists__add_entry(hists, al, iter->parent, NULL, NULL,
- sample, true);
+ NULL, sample, true);
if (he == NULL)
return -ENOMEM;

@@ -1101,7 +1106,7 @@ iter_add_next_cumulative_entry(struct hist_entry_iter *iter,
}

he = hists__add_entry(evsel__hists(evsel), al, iter->parent, NULL, NULL,
- sample, false);
+ NULL, sample, false);
if (he == NULL)
return -ENOMEM;

@@ -1268,6 +1273,9 @@ void hist_entry__delete(struct hist_entry *he)
mem_info__zput(he->mem_info);
}

+ if (he->haz_data)
+ zfree(&he->haz_data);
+
if (he->block_info)
block_info__zput(he->block_info);

diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 0aa63aeb58ec..a4d12a503126 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -139,6 +139,7 @@ struct hist_entry *hists__add_entry(struct hists *hists,
struct symbol *parent,
struct branch_info *bi,
struct mem_info *mi,
+ struct perf_pipeline_haz_data *haz_data,
struct perf_sample *sample,
bool sample_self);

@@ -148,6 +149,7 @@ struct hist_entry *hists__add_entry_ops(struct hists *hists,
struct symbol *sym_parent,
struct branch_info *bi,
struct mem_info *mi,
+ struct perf_pipeline_haz_data *haz_data,
struct perf_sample *sample,
bool sample_self);

diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index 6c862d62d052..55eb65fd593f 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -139,6 +139,7 @@ struct hist_entry {
long time;
struct hists *hists;
struct mem_info *mem_info;
+ struct perf_pipeline_haz_data *haz_data;
struct block_info *block_info;
void *raw_data;
u32 raw_size;
--
2.21.1

2020-03-02 05:25:38

by Ravi Bangoria

[permalink] [raw]
Subject: [RFC 07/11] perf hazard: Functions to convert generic hazard data to arch specific string

From: Madhavan Srinivasan <[email protected]>

Kernel provides pipeline hazard data in struct perf_pipeline_haz_data
format. Add code to convert this data into meaningful string which can
be shown in perf report (followup patch).

Introduce tools/perf/utils/hazard directory which will contains arch
specific directories. Under arch specific directory, add arch specific
logic that will be called by generic code. This directory structure is
introduced to enable cross-arch reporting.

Signed-off-by: Madhavan Srinivasan <[email protected]>
Signed-off-by: Ravi Bangoria <[email protected]>
---
tools/perf/util/Build | 2 +
tools/perf/util/hazard.c | 51 +++++++
tools/perf/util/hazard.h | 14 ++
tools/perf/util/hazard/Build | 1 +
.../util/hazard/powerpc/perf_pipeline_haz.h | 80 ++++++++++
.../perf/util/hazard/powerpc/powerpc_hazard.c | 142 ++++++++++++++++++
.../perf/util/hazard/powerpc/powerpc_hazard.h | 14 ++
7 files changed, 304 insertions(+)
create mode 100644 tools/perf/util/hazard.c
create mode 100644 tools/perf/util/hazard.h
create mode 100644 tools/perf/util/hazard/Build
create mode 100644 tools/perf/util/hazard/powerpc/perf_pipeline_haz.h
create mode 100644 tools/perf/util/hazard/powerpc/powerpc_hazard.c
create mode 100644 tools/perf/util/hazard/powerpc/powerpc_hazard.h

diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index 07da6c790b63..f5e1b7d79b6d 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -118,6 +118,7 @@ perf-y += parse-regs-options.o
perf-y += term.o
perf-y += help-unknown-cmd.o
perf-y += mem-events.o
+perf-y += hazard.o
perf-y += vsprintf.o
perf-y += units.o
perf-y += time-utils.o
@@ -153,6 +154,7 @@ perf-$(CONFIG_LIBUNWIND_AARCH64) += libunwind/arm64.o
perf-$(CONFIG_LIBBABELTRACE) += data-convert-bt.o

perf-y += scripting-engines/
+perf-y += hazard/

perf-$(CONFIG_ZLIB) += zlib.o
perf-$(CONFIG_LZMA) += lzma.o
diff --git a/tools/perf/util/hazard.c b/tools/perf/util/hazard.c
new file mode 100644
index 000000000000..db235b26b266
--- /dev/null
+++ b/tools/perf/util/hazard.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <string.h>
+#include "hazard/powerpc/powerpc_hazard.h"
+
+const char *perf_haz__itype_str(u8 itype, const char *arch)
+{
+ if (!strncmp(arch, "powerpc", strlen("powerpc")))
+ return powerpc__haz__itype_str(itype);
+
+ return "-";
+}
+
+const char *perf_haz__icache_str(u8 icache, const char *arch)
+{
+ if (!strncmp(arch, "powerpc", strlen("powerpc")))
+ return powerpc__haz__icache_str(icache);
+
+ return "-";
+}
+
+const char *perf_haz__hstage_str(u8 hstage, const char *arch)
+{
+ if (!strncmp(arch, "powerpc", strlen("powerpc")))
+ return powerpc__haz__hstage_str(hstage);
+
+ return "-";
+}
+
+const char *perf_haz__hreason_str(u8 hstage, u8 hreason, const char *arch)
+{
+ if (!strncmp(arch, "powerpc", strlen("powerpc")))
+ return powerpc__haz__hreason_str(hstage, hreason);
+
+ return "-";
+}
+
+const char *perf_haz__sstage_str(u8 sstage, const char *arch)
+{
+ if (!strncmp(arch, "powerpc", strlen("powerpc")))
+ return powerpc__haz__sstage_str(sstage);
+
+ return "-";
+}
+
+const char *perf_haz__sreason_str(u8 sstage, u8 sreason, const char *arch)
+{
+ if (!strncmp(arch, "powerpc", strlen("powerpc")))
+ return powerpc__haz__sreason_str(sstage, sreason);
+
+ return "-";
+}
diff --git a/tools/perf/util/hazard.h b/tools/perf/util/hazard.h
new file mode 100644
index 000000000000..eab4190e056a
--- /dev/null
+++ b/tools/perf/util/hazard.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __PERF_HAZARD_H
+#define __PERF_HAZARD_H
+
+#include "sort.h"
+
+const char *perf_haz__itype_str(u8 itype, const char *arch);
+const char *perf_haz__icache_str(u8 icache, const char *arch);
+const char *perf_haz__hstage_str(u8 hstage, const char *arch);
+const char *perf_haz__hreason_str(u8 hstage, u8 hreason, const char *arch);
+const char *perf_haz__sstage_str(u8 sstage, const char *arch);
+const char *perf_haz__sreason_str(u8 sstage, u8 sreason, const char *arch);
+
+#endif /* __PERF_HAZARD_H */
diff --git a/tools/perf/util/hazard/Build b/tools/perf/util/hazard/Build
new file mode 100644
index 000000000000..314c5e316383
--- /dev/null
+++ b/tools/perf/util/hazard/Build
@@ -0,0 +1 @@
+perf-y += powerpc/powerpc_hazard.o
diff --git a/tools/perf/util/hazard/powerpc/perf_pipeline_haz.h b/tools/perf/util/hazard/powerpc/perf_pipeline_haz.h
new file mode 100644
index 000000000000..de8857ec31dd
--- /dev/null
+++ b/tools/perf/util/hazard/powerpc/perf_pipeline_haz.h
@@ -0,0 +1,80 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_ASM_POWERPC_PERF_PIPELINE_HAZ_H
+#define _UAPI_ASM_POWERPC_PERF_PIPELINE_HAZ_H
+
+enum perf_inst_type {
+ PERF_HAZ__ITYPE_LOAD = 1,
+ PERF_HAZ__ITYPE_STORE,
+ PERF_HAZ__ITYPE_BRANCH,
+ PERF_HAZ__ITYPE_FP,
+ PERF_HAZ__ITYPE_FX,
+ PERF_HAZ__ITYPE_CR_OR_SC,
+};
+
+enum perf_inst_cache {
+ PERF_HAZ__ICACHE_L1_HIT = 1,
+ PERF_HAZ__ICACHE_L2_HIT,
+ PERF_HAZ__ICACHE_L3_HIT,
+ PERF_HAZ__ICACHE_L3_MISS,
+};
+
+enum perf_pipeline_stage {
+ PERF_HAZ__PIPE_STAGE_IFU = 1,
+ PERF_HAZ__PIPE_STAGE_IDU,
+ PERF_HAZ__PIPE_STAGE_ISU,
+ PERF_HAZ__PIPE_STAGE_LSU,
+ PERF_HAZ__PIPE_STAGE_BRU,
+ PERF_HAZ__PIPE_STAGE_FXU,
+ PERF_HAZ__PIPE_STAGE_FPU,
+ PERF_HAZ__PIPE_STAGE_VSU,
+ PERF_HAZ__PIPE_STAGE_OTHER,
+};
+
+enum perf_haz_bru_reason {
+ PERF_HAZ__HAZ_BRU_MPRED_DIR = 1,
+ PERF_HAZ__HAZ_BRU_MPRED_TA,
+};
+
+enum perf_haz_isu_reason {
+ PERF_HAZ__HAZ_ISU_SRC = 1,
+ PERF_HAZ__HAZ_ISU_COL = 1,
+};
+
+enum perf_haz_lsu_reason {
+ PERF_HAZ__HAZ_LSU_ERAT_MISS = 1,
+ PERF_HAZ__HAZ_LSU_LMQ,
+ PERF_HAZ__HAZ_LSU_LHS,
+ PERF_HAZ__HAZ_LSU_MPRED,
+ PERF_HAZ__HAZ_DERAT_MISS,
+ PERF_HAZ__HAZ_LSU_LMQ_DERAT_MISS,
+ PERF_HAZ__HAZ_LSU_LHS_DERAT_MISS,
+ PERF_HAZ__HAZ_LSU_MPRED_DERAT_MISS,
+};
+
+enum perf_stall_lsu_reason {
+ PERF_HAZ__STALL_LSU_DCACHE_MISS = 1,
+ PERF_HAZ__STALL_LSU_LD_FIN,
+ PERF_HAZ__STALL_LSU_ST_FWD,
+ PERF_HAZ__STALL_LSU_ST,
+};
+
+enum perf_stall_fxu_reason {
+ PERF_HAZ__STALL_FXU_MC = 1,
+ PERF_HAZ__STALL_FXU_FC,
+};
+
+enum perf_stall_bru_reason {
+ PERF_HAZ__STALL_BRU_FIN_MPRED = 1,
+ PERF_HAZ__STALL_BRU_FC,
+};
+
+enum perf_stall_vsu_reason {
+ PERF_HAZ__STALL_VSU_MC = 1,
+ PERF_HAZ__STALL_VSU_FC,
+};
+
+enum perf_stall_other_reason {
+ PERF_HAZ__STALL_NTC,
+};
+
+#endif /* _UAPI_ASM_POWERPC_PERF_PIPELINE_HAZ_H */
diff --git a/tools/perf/util/hazard/powerpc/powerpc_hazard.c b/tools/perf/util/hazard/powerpc/powerpc_hazard.c
new file mode 100644
index 000000000000..dcb95b769367
--- /dev/null
+++ b/tools/perf/util/hazard/powerpc/powerpc_hazard.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stddef.h>
+#include "hazard.h"
+#include "powerpc_hazard.h"
+#include "perf_pipeline_haz.h"
+
+static const char *haz_inst_type[] = {
+ "-",
+ "Load",
+ "Store",
+ "Branch",
+ "Floating Point",
+ "Fixed point",
+ "Condition Register/System Call",
+};
+
+const char *powerpc__haz__itype_str(u8 itype)
+{
+ return haz_inst_type[itype];
+}
+
+static const char *haz_inst_cache[] = {
+ "-",
+ "L1 hit",
+ "L2 hit",
+ "L3 hit",
+ "L3 Miss"
+};
+
+const char *powerpc__haz__icache_str(u8 icache)
+{
+ return haz_inst_cache[icache];
+}
+
+static const char *pipeline_stages[] = {
+ "-",
+ "IFU",
+ "IDU",
+ "ISU",
+ "LSU",
+ "BRU",
+ "FXU",
+ "FPU",
+ "VSU",
+};
+
+const char *powerpc__haz__hstage_str(u8 hstage)
+{
+ return pipeline_stages[hstage];
+}
+
+static const char *haz_bru_reason[] = {
+ "-",
+ "Direction",
+ "Target Address",
+};
+
+static const char *haz_isu_reason[] = {
+ "-",
+ "Source Unavailable",
+ "Resource Collision",
+};
+
+static const char *haz_lsu_reason[] = {
+ "-",
+ "ERAT Miss",
+ "LMQ Full",
+ "Load Hit Store",
+ "Mispredict",
+ "DERAT Miss",
+ "LMQ Full, DERAT Miss",
+ "Load Hit Store, DERAT Miss",
+ "Mispredict, DERAT Miss",
+};
+
+const char *powerpc__haz__hreason_str(u8 hstage, u8 hreason)
+{
+ switch (hstage) {
+ case PERF_HAZ__PIPE_STAGE_BRU:
+ return haz_bru_reason[hreason];
+ case PERF_HAZ__PIPE_STAGE_LSU:
+ return haz_lsu_reason[hreason];
+ case PERF_HAZ__PIPE_STAGE_ISU:
+ return haz_isu_reason[hreason];
+ default:
+ return "-";
+ }
+}
+
+const char *powerpc__haz__sstage_str(u8 sstage)
+{
+ return pipeline_stages[sstage];
+}
+
+static const char *stall_lsu_reason[] = {
+ "-",
+ "Dcache_miss",
+ "Load fin",
+ "Store fwd",
+ "Store",
+};
+
+static const char *stall_fxu_reason[] = {
+ "-",
+ "Multi cycle",
+ "Fixed cycle",
+};
+
+static const char *stall_bru_reason[] = {
+ "-",
+ "Finish Mispredict",
+ "Fixed cycle",
+};
+
+static const char *stall_vsu_reason[] = {
+ "-",
+ "Multi cycle",
+ "Fixed cycle",
+};
+
+static const char *stall_other_reason[] = {
+ "-",
+ "Marked fin before NTC",
+};
+
+const char *powerpc__haz__sreason_str(u8 sstage, u8 sreason)
+{
+ switch (sstage) {
+ case PERF_HAZ__PIPE_STAGE_LSU:
+ return stall_lsu_reason[sreason];
+ case PERF_HAZ__PIPE_STAGE_FXU:
+ return stall_fxu_reason[sreason];
+ case PERF_HAZ__PIPE_STAGE_BRU:
+ return stall_bru_reason[sreason];
+ case PERF_HAZ__PIPE_STAGE_VSU:
+ return stall_vsu_reason[sreason];
+ case PERF_HAZ__PIPE_STAGE_OTHER:
+ return stall_other_reason[sreason];
+ default:
+ return "-";
+ }
+}
diff --git a/tools/perf/util/hazard/powerpc/powerpc_hazard.h b/tools/perf/util/hazard/powerpc/powerpc_hazard.h
new file mode 100644
index 000000000000..f13f8f3cd10d
--- /dev/null
+++ b/tools/perf/util/hazard/powerpc/powerpc_hazard.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __PERF_POWERPC_HAZARD_H
+#define __PERF_POWERPC_HAZARD_H
+
+#include "hazard.h"
+
+const char *powerpc__haz__itype_str(u8 itype);
+const char *powerpc__haz__icache_str(u8 icache);
+const char *powerpc__haz__hstage_str(u8 hstage);
+const char *powerpc__haz__hreason_str(u8 hstage, u8 hreason);
+const char *powerpc__haz__sstage_str(u8 sstage);
+const char *powerpc__haz__sreason_str(u8 sstage, u8 sreason);
+
+#endif /* __PERF_POWERPC_HAZARD_H */
--
2.21.1

2020-03-02 05:25:53

by Ravi Bangoria

[permalink] [raw]
Subject: [RFC 05/11] perf tools: Enable record and script to record and show hazard data

From: Madhavan Srinivasan <[email protected]>

Introduce new perf record option "--hazard" to capture cpu pipeline
hazard data. Also enable perf script -D to dump raw values of it.
Sample o/p:

$ ./perf record -e r4010e --hazard -- ls
$ ./perf script -D
... PERF_RECORD_SAMPLE(IP, 0x2): ...
hazard information:
Inst Type 0x1
Inst Cache 0x1
Hazard Stage 0x4
Hazard Reason 0x3
Stall Stage 0x4
Stall Reason 0x2

Signed-off-by: Madhavan Srinivasan <[email protected]>
Signed-off-by: Ravi Bangoria <[email protected]>
---
tools/perf/Documentation/perf-record.txt | 3 +++
tools/perf/builtin-record.c | 1 +
tools/perf/util/event.h | 1 +
tools/perf/util/evsel.c | 10 ++++++++++
tools/perf/util/perf_event_attr_fprintf.c | 1 +
tools/perf/util/record.h | 1 +
tools/perf/util/session.c | 16 ++++++++++++++++
7 files changed, 33 insertions(+)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index b23a4012a606..e7bd1b6938ce 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -283,6 +283,9 @@ OPTIONS
--phys-data::
Record the sample physical addresses.

+--hazard::
+ Record processor pipeline hazard and stall information.
+
-T::
--timestamp::
Record the sample timestamps. Use it with 'perf report -D' to see the
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 4c301466101b..6bd32d7bc4e9 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -2301,6 +2301,7 @@ static struct option __record_options[] = {
OPT_BOOLEAN('s', "stat", &record.opts.inherit_stat,
"per thread counts"),
OPT_BOOLEAN('d', "data", &record.opts.sample_address, "Record the sample addresses"),
+ OPT_BOOLEAN(0, "hazard", &record.opts.hazard, "Record processor pipeline hazard and stall information"),
OPT_BOOLEAN(0, "phys-data", &record.opts.sample_phys_addr,
"Record the sample physical addresses"),
OPT_BOOLEAN(0, "sample-cpu", &record.opts.sample_cpu, "Record the sample cpu"),
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 85223159737c..ff0f03253a95 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -148,6 +148,7 @@ struct perf_sample {
struct stack_dump user_stack;
struct sample_read read;
struct aux_sample aux_sample;
+ struct perf_pipeline_haz_data *pipeline_haz;
};

#define PERF_MEM_DATA_SRC_NONE \
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index c8dc4450884c..e37ed7929c2c 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1080,6 +1080,9 @@ void perf_evsel__config(struct evsel *evsel, struct record_opts *opts,
if (opts->sample_phys_addr)
perf_evsel__set_sample_bit(evsel, PHYS_ADDR);

+ if (opts->hazard)
+ perf_evsel__set_sample_bit(evsel, PIPELINE_HAZ);
+
if (opts->no_buffering) {
attr->watermark = 0;
attr->wakeup_events = 1;
@@ -2265,6 +2268,13 @@ int perf_evsel__parse_sample(struct evsel *evsel, union perf_event *event,
array = (void *)array + sz;
}

+ if (type & PERF_SAMPLE_PIPELINE_HAZ) {
+ sz = sizeof(struct perf_pipeline_haz_data);
+ OVERFLOW_CHECK(array, sz, max_size);
+ data->pipeline_haz = (struct perf_pipeline_haz_data *)array;
+ array = (void *)array + sz;
+ }
+
return 0;
}

diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
index 651203126c71..d97e755c886b 100644
--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -35,6 +35,7 @@ static void __p_sample_type(char *buf, size_t size, u64 value)
bit_name(BRANCH_STACK), bit_name(REGS_USER), bit_name(STACK_USER),
bit_name(IDENTIFIER), bit_name(REGS_INTR), bit_name(DATA_SRC),
bit_name(WEIGHT), bit_name(PHYS_ADDR), bit_name(AUX),
+ bit_name(PIPELINE_HAZ),
{ .name = NULL, }
};
#undef bit_name
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index 5421fd2ad383..f1678a0bc8ce 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -67,6 +67,7 @@ struct record_opts {
int affinity;
int mmap_flush;
unsigned int comp_level;
+ bool hazard;
};

extern const char * const *record_usage;
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index d0d7d25b23e3..834ca7df2349 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1153,6 +1153,19 @@ static void stack_user__printf(struct stack_dump *dump)
dump->size, dump->offset);
}

+static void pipeline_hazard__printf(struct perf_sample *sample)
+{
+ struct perf_pipeline_haz_data *haz = sample->pipeline_haz;
+
+ printf("... hazard information:\n");
+ printf(".... Inst Type 0x%" PRIx32 "\n", haz->itype);
+ printf(".... Inst Cache 0x%" PRIx32 "\n", haz->icache);
+ printf(".... Hazard Stage 0x%" PRIx32 "\n", haz->hazard_stage);
+ printf(".... Hazard Reason 0x%" PRIx32 "\n", haz->hazard_reason);
+ printf(".... Stall Stage 0x%" PRIx32 "\n", haz->stall_stage);
+ printf(".... Stall Reason 0x%" PRIx32 "\n", haz->stall_reason);
+}
+
static void perf_evlist__print_tstamp(struct evlist *evlist,
union perf_event *event,
struct perf_sample *sample)
@@ -1251,6 +1264,9 @@ static void dump_sample(struct evsel *evsel, union perf_event *event,
if (sample_type & PERF_SAMPLE_STACK_USER)
stack_user__printf(&sample->user_stack);

+ if (sample_type & PERF_SAMPLE_PIPELINE_HAZ)
+ pipeline_hazard__printf(sample);
+
if (sample_type & PERF_SAMPLE_WEIGHT)
printf("... weight: %" PRIu64 "\n", sample->weight);

--
2.21.1

2020-03-02 05:25:54

by Ravi Bangoria

[permalink] [raw]
Subject: [RFC 10/11] perf annotate: Preparation for hazard

Introduce 'struct hazard_hist' that will contain hazard specific
information for annotate. Add Array of list of 'struct hazard_hist'
into 'struct annotated_source' where array length = symbol size and
each member of list contain hazard info from associated perf sample.
This information is prepared while parsing samples in perf report.
Also, this is just a preparation step for annotate and followup
patch does actual annotate ui changes.

Signed-off-by: Ravi Bangoria <[email protected]>
---
tools/perf/builtin-report.c | 1 +
tools/perf/util/annotate.c | 75 +++++++++++++++++++++++++++++++++++++
tools/perf/util/annotate.h | 14 +++++++
tools/perf/util/hist.c | 13 +++++++
tools/perf/util/hist.h | 4 ++
tools/perf/util/machine.c | 6 +++
tools/perf/util/machine.h | 3 ++
7 files changed, 116 insertions(+)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index a47542a12da1..ff950ff8dd51 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -301,6 +301,7 @@ static int process_sample_event(struct perf_tool *tool,
hist__account_cycles(sample->branch_stack, &al, sample,
rep->nonany_branch_mode,
&rep->total_cycles);
+ hist__capture_haz_info(&al, sample, evsel);
}

ret = hist_entry_iter__add(&iter, &al, rep->max_stack, rep);
diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c
index 8aef60a6ffea..766934b0f36d 100644
--- a/tools/perf/util/annotate.c
+++ b/tools/perf/util/annotate.c
@@ -36,6 +36,7 @@
#include "string2.h"
#include "util/event.h"
#include "arch/common.h"
+#include "hazard.h"
#include <regex.h>
#include <pthread.h>
#include <linux/bitops.h>
@@ -800,6 +801,21 @@ static int symbol__alloc_hist_cycles(struct symbol *sym)
return 0;
}

+static int symbol__alloc_hist_hazard(struct symbol *sym)
+{
+ struct annotation *notes = symbol__annotation(sym);
+ const size_t size = symbol__size(sym);
+ size_t i;
+
+ notes->src->haz_hist = calloc(size, sizeof(struct hazard_hist));
+ if (notes->src->haz_hist == NULL)
+ return -1;
+
+ for (i = 0; i < size; i++)
+ INIT_LIST_HEAD(&notes->src->haz_hist[i].list);
+ return 0;
+}
+
void symbol__annotate_zero_histograms(struct symbol *sym)
{
struct annotation *notes = symbol__annotation(sym);
@@ -920,6 +936,25 @@ static struct cyc_hist *symbol__cycles_hist(struct symbol *sym)
return notes->src->cycles_hist;
}

+static struct hazard_hist *symbol__hazard_hist(struct symbol *sym)
+{
+ struct annotation *notes = symbol__annotation(sym);
+
+ if (notes->src == NULL) {
+ notes->src = annotated_source__new();
+ if (notes->src == NULL)
+ return NULL;
+ goto alloc_haz_hist;
+ }
+
+ if (!notes->src->haz_hist) {
+alloc_haz_hist:
+ symbol__alloc_hist_hazard(sym);
+ }
+
+ return notes->src->haz_hist;
+}
+
struct annotated_source *symbol__hists(struct symbol *sym, int nr_hists)
{
struct annotation *notes = symbol__annotation(sym);
@@ -1014,6 +1049,46 @@ int addr_map_symbol__account_cycles(struct addr_map_symbol *ams,
return err;
}

+int symbol__capture_haz_info(struct addr_map_symbol *ams,
+ struct perf_sample *sample,
+ struct evsel *evsel)
+{
+ struct hazard_hist *hh, *tmp;
+ u64 offset;
+ const char *arch = perf_env__arch(perf_evsel__env(evsel));
+
+ if (ams->ms.sym == NULL)
+ return 0;
+
+ hh = symbol__hazard_hist(ams->ms.sym);
+ if (!hh)
+ return -ENOMEM;
+
+ if (ams->al_addr < ams->ms.sym->start || ams->al_addr >= ams->ms.sym->end)
+ return -ERANGE;
+
+ offset = ams->al_addr - ams->ms.sym->start;
+
+ tmp = zalloc(sizeof(*tmp));
+ if (!tmp)
+ return -ENOMEM;
+
+ tmp->icache = perf_haz__icache_str(sample->pipeline_haz->icache, arch);
+ tmp->haz_stage = perf_haz__hstage_str(sample->pipeline_haz->hazard_stage,
+ arch);
+ tmp->haz_reason = perf_haz__hreason_str(sample->pipeline_haz->hazard_stage,
+ sample->pipeline_haz->hazard_reason,
+ arch);
+ tmp->stall_stage = perf_haz__sstage_str(sample->pipeline_haz->stall_stage,
+ arch);
+ tmp->stall_reason = perf_haz__sreason_str(sample->pipeline_haz->stall_stage,
+ sample->pipeline_haz->stall_reason,
+ arch);
+
+ list_add(&tmp->list, &hh[offset].list);
+ return 0;
+}
+
static unsigned annotation__count_insn(struct annotation *notes, u64 start, u64 end)
{
unsigned n_insn = 0;
diff --git a/tools/perf/util/annotate.h b/tools/perf/util/annotate.h
index 89839713d242..a3803f89b8fc 100644
--- a/tools/perf/util/annotate.h
+++ b/tools/perf/util/annotate.h
@@ -249,6 +249,15 @@ struct cyc_hist {
u16 reset;
};

+struct hazard_hist {
+ struct list_head list;
+ const char *icache;
+ const char *haz_stage;
+ const char *haz_reason;
+ const char *stall_stage;
+ const char *stall_reason;
+};
+
/** struct annotated_source - symbols with hits have this attached as in sannotation
*
* @histograms: Array of addr hit histograms per event being monitored
@@ -271,6 +280,7 @@ struct annotated_source {
int nr_histograms;
size_t sizeof_sym_hist;
struct cyc_hist *cycles_hist;
+ struct hazard_hist *haz_hist; /* Array of list. Array length = symbol size */
struct sym_hist *histograms;
};

@@ -343,6 +353,10 @@ int addr_map_symbol__account_cycles(struct addr_map_symbol *ams,
struct addr_map_symbol *start,
unsigned cycles);

+int symbol__capture_haz_info(struct addr_map_symbol *ams,
+ struct perf_sample *sample,
+ struct evsel *evsel);
+
int hist_entry__inc_addr_samples(struct hist_entry *he, struct perf_sample *sample,
struct evsel *evsel, u64 addr);

diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index d690d08b0f64..3cbfebb80f68 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -2702,6 +2702,19 @@ void hist__account_cycles(struct branch_stack *bs, struct addr_location *al,
}
}

+void hist__capture_haz_info(struct addr_location *al,
+ struct perf_sample *sample,
+ struct evsel *evsel)
+{
+ struct addr_map_symbol ams;
+
+ if (!sample->pipeline_haz)
+ return;
+
+ sample__resolve_haz(al, &ams, sample);
+ symbol__capture_haz_info(&ams, sample, evsel);
+}
+
size_t perf_evlist__fprintf_nr_events(struct evlist *evlist, FILE *fp)
{
struct evsel *pos;
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 063d65985a11..086c1dfde21f 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -554,6 +554,10 @@ void hist__account_cycles(struct branch_stack *bs, struct addr_location *al,
struct perf_sample *sample, bool nonany_branch_mode,
u64 *total_cycles);

+void hist__capture_haz_info(struct addr_location *al,
+ struct perf_sample *sample,
+ struct evsel *evsel);
+
struct option;
int parse_filter_percentage(const struct option *opt, const char *arg, int unset);
int perf_hist_config(const char *var, const char *value);
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index fb5c2cd44d30..e575488a1390 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -2094,6 +2094,12 @@ struct branch_info *sample__resolve_bstack(struct perf_sample *sample,
return bi;
}

+void sample__resolve_haz(struct addr_location *al, struct addr_map_symbol *ams,
+ struct perf_sample *sample)
+{
+ ip__resolve_ams(al->thread, ams, sample->ip);
+}
+
static void save_iterations(struct iterations *iter,
struct branch_entry *be, int nr)
{
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index be0a930eca89..e9a298b5430c 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -173,6 +173,9 @@ struct branch_info *sample__resolve_bstack(struct perf_sample *sample,
struct mem_info *sample__resolve_mem(struct perf_sample *sample,
struct addr_location *al);

+void sample__resolve_haz(struct addr_location *al, struct addr_map_symbol *ams,
+ struct perf_sample *sample);
+
struct callchain_cursor;

int thread__resolve_callchain(struct thread *thread,
--
2.21.1

2020-03-02 05:26:02

by Ravi Bangoria

[permalink] [raw]
Subject: [RFC 09/11] perf annotate: Introduce type for annotation_line

struct annotation_line can contain either assembly instruction or
a source code line. To distinguish between them we currently use
offset. If offset is -1, it's a source otherwise it's assembly.
This is bit cryptic when you first read the code. Introduce new
field 'type' that denotes type of the data the annotation_line
object contains.

Signed-off-by: Ravi Bangoria <[email protected]>
---
tools/perf/ui/browsers/annotate.c | 4 ++--
tools/perf/ui/gtk/annotate.c | 6 +++---
tools/perf/util/annotate.c | 27 ++++++++++++++++-----------
tools/perf/util/annotate.h | 8 +++++++-
4 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/tools/perf/ui/browsers/annotate.c b/tools/perf/ui/browsers/annotate.c
index 9023267e5643..2e4db8216b3b 100644
--- a/tools/perf/ui/browsers/annotate.c
+++ b/tools/perf/ui/browsers/annotate.c
@@ -317,7 +317,7 @@ static void annotate_browser__calc_percent(struct annotate_browser *browser,
double max_percent = 0.0;
int i;

- if (pos->al.offset == -1) {
+ if (pos->al.type != AL_TYPE_ASM) {
RB_CLEAR_NODE(&pos->al.rb_node);
continue;
}
@@ -816,7 +816,7 @@ static int annotate_browser__run(struct annotate_browser *browser,

if (browser->selection == NULL)
ui_helpline__puts("Huh? No selection. Report to [email protected]");
- else if (browser->selection->offset == -1)
+ else if (browser->selection->type != AL_TYPE_ASM)
ui_helpline__puts("Actions are only available for assembly lines.");
else if (!dl->ins.ops)
goto show_sup_ins;
diff --git a/tools/perf/ui/gtk/annotate.c b/tools/perf/ui/gtk/annotate.c
index 35f9641bf670..71c792a0b17d 100644
--- a/tools/perf/ui/gtk/annotate.c
+++ b/tools/perf/ui/gtk/annotate.c
@@ -35,7 +35,7 @@ static int perf_gtk__get_percent(char *buf, size_t size, struct symbol *sym,

strcpy(buf, "");

- if (dl->al.offset == (s64) -1)
+ if (dl->al.type != AL_TYPE_ASM)
return 0;

symhist = annotation__histogram(symbol__annotation(sym), evidx);
@@ -61,7 +61,7 @@ static int perf_gtk__get_offset(char *buf, size_t size, struct map_symbol *ms,

strcpy(buf, "");

- if (dl->al.offset == (s64) -1)
+ if (dl->al.type != AL_TYPE_ASM)
return 0;

return scnprintf(buf, size, "%"PRIx64, start + dl->al.offset);
@@ -78,7 +78,7 @@ static int perf_gtk__get_line(char *buf, size_t size, struct disasm_line *dl)
if (!line)
return 0;

- if (dl->al.offset != (s64) -1)
+ if (dl->al.type == AL_TYPE_ASM)
markup = NULL;

if (markup)
diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c
index 4e2706274d85..8aef60a6ffea 100644
--- a/tools/perf/util/annotate.c
+++ b/tools/perf/util/annotate.c
@@ -1148,6 +1148,7 @@ struct annotate_args {
struct evsel *evsel;
struct annotation_options *options;
s64 offset;
+ u8 type;
char *line;
int line_nr;
};
@@ -1156,6 +1157,7 @@ static void annotation_line__init(struct annotation_line *al,
struct annotate_args *args,
int nr)
{
+ al->type = args->type;
al->offset = args->offset;
al->line = strdup(args->line);
al->line_nr = args->line_nr;
@@ -1202,7 +1204,7 @@ static struct disasm_line *disasm_line__new(struct annotate_args *args)
if (dl->al.line == NULL)
goto out_delete;

- if (args->offset != -1) {
+ if (dl->al.type == AL_TYPE_ASM) {
if (disasm_line__parse(dl->al.line, &dl->ins.name, &dl->ops.raw) < 0)
goto out_free_line;

@@ -1246,7 +1248,7 @@ struct annotation_line *
annotation_line__next(struct annotation_line *pos, struct list_head *head)
{
list_for_each_entry_continue(pos, head, node)
- if (pos->offset >= 0)
+ if (pos->type == AL_TYPE_ASM)
return pos;

return NULL;
@@ -1357,7 +1359,7 @@ annotation_line__print(struct annotation_line *al, struct symbol *sym, u64 start
static const char *prev_line;
static const char *prev_color;

- if (al->offset != -1) {
+ if (al->type == AL_TYPE_ASM) {
double max_percent = 0.0;
int i, nr_percent = 1;
const char *color;
@@ -1500,6 +1502,7 @@ static int symbol__parse_objdump_line(struct symbol *sym,
}

args->offset = offset;
+ args->type = (offset != -1) ? AL_TYPE_ASM : AL_TYPE_SRC;
args->line = parsed_line;
args->line_nr = *line_nr;
args->ms.sym = sym;
@@ -1783,6 +1786,7 @@ static int symbol__disassemble_bpf(struct symbol *sym,
fflush(s);

if (!opts->hide_src_code && srcline) {
+ args->type = AL_TYPE_SRC;
args->offset = -1;
args->line = strdup(srcline);
args->line_nr = 0;
@@ -1794,6 +1798,7 @@ static int symbol__disassemble_bpf(struct symbol *sym,
}
}

+ args->type = AL_TYPE_ASM;
args->offset = pc;
args->line = buf + prev_buf_size;
args->line_nr = 0;
@@ -2099,7 +2104,7 @@ static void annotation__calc_percent(struct annotation *notes,
s64 end;
int i = 0;

- if (al->offset == -1)
+ if (al->type != AL_TYPE_ASM)
continue;

next = annotation_line__next(al, &notes->src->source);
@@ -2309,7 +2314,7 @@ static int annotated_source__addr_fmt_width(struct list_head *lines, u64 start)
struct annotation_line *line;

list_for_each_entry_reverse(line, lines, node) {
- if (line->offset != -1)
+ if (line->type == AL_TYPE_ASM)
return scnprintf(bf, sizeof(bf), "%" PRIx64, start + line->offset);
}

@@ -2549,7 +2554,7 @@ static size_t disasm_line__fprintf(struct disasm_line *dl, FILE *fp)
{
size_t printed;

- if (dl->al.offset == -1)
+ if (dl->al.type == AL_TYPE_SRC)
return fprintf(fp, "%s\n", dl->al.line);

printed = fprintf(fp, "%#" PRIx64 " %s", dl->al.offset, dl->ins.name);
@@ -2628,7 +2633,7 @@ static void annotation__set_offsets(struct annotation *notes, s64 size)
if (notes->max_line_len < line_len)
notes->max_line_len = line_len;
al->idx = notes->nr_entries++;
- if (al->offset != -1) {
+ if (al->type == AL_TYPE_ASM) {
al->idx_asm = notes->nr_asm_entries++;
/*
* FIXME: short term bandaid to cope with assembly
@@ -2659,7 +2664,7 @@ static int annotation__max_ins_name(struct annotation *notes)
struct annotation_line *al;

list_for_each_entry(al, &notes->src->source, node) {
- if (al->offset == -1)
+ if (al->type != AL_TYPE_ASM)
continue;

len = strlen(disasm_line(al)->ins.name);
@@ -2875,7 +2880,7 @@ static void __annotation_line__write(struct annotation_line *al, struct annotati
char bf[256];
int printed;

- if (first_line && (al->offset == -1 || percent_max == 0.0)) {
+ if (first_line && (al->type != AL_TYPE_ASM || percent_max == 0.0)) {
if (notes->have_cycles) {
if (al->ipc == 0.0 && al->cycles == 0)
show_title = true;
@@ -2883,7 +2888,7 @@ static void __annotation_line__write(struct annotation_line *al, struct annotati
show_title = true;
}

- if (al->offset != -1 && percent_max != 0.0) {
+ if (al->type == AL_TYPE_ASM && percent_max != 0.0) {
int i;

for (i = 0; i < notes->nr_events; i++) {
@@ -2964,7 +2969,7 @@ static void __annotation_line__write(struct annotation_line *al, struct annotati

if (!*al->line)
obj__printf(obj, "%-*s", width - pcnt_width - cycles_width, " ");
- else if (al->offset == -1) {
+ else if (al->type == AL_TYPE_SRC) {
if (al->line_nr && notes->options->show_linenr)
printed = scnprintf(bf, sizeof(bf), "%-*d ", notes->widths.addr + 1, al->line_nr);
else
diff --git a/tools/perf/util/annotate.h b/tools/perf/util/annotate.h
index d9b5bb105056..89839713d242 100644
--- a/tools/perf/util/annotate.h
+++ b/tools/perf/util/annotate.h
@@ -128,9 +128,15 @@ struct annotation_data {
struct sym_hist_entry he;
};

+enum annotation_line_type {
+ AL_TYPE_ASM = 0,
+ AL_TYPE_SRC,
+};
+
struct annotation_line {
struct list_head node;
struct rb_node rb_node;
+ u8 type;
s64 offset;
char *line;
int line_nr;
@@ -310,7 +316,7 @@ static inline int annotation__pcnt_width(struct annotation *notes)

static inline bool annotation_line__filter(struct annotation_line *al, struct annotation *notes)
{
- return notes->options->hide_src_code && al->offset == -1;
+ return notes->options->hide_src_code && al->type == AL_TYPE_SRC;
}

void annotation__update_column_widths(struct annotation *notes);
--
2.21.1

2020-03-02 05:26:30

by Ravi Bangoria

[permalink] [raw]
Subject: [RFC 08/11] perf report: Enable hazard mode

From: Madhavan Srinivasan <[email protected]>

Introduce --hazard with perf report to show perf report with hazard
data. Hazard mode columns are Instruction Type, Hazard Stage, Hazard
Reason, Stall Stage, Stall Reason and Icache access. Default sort
order is sym, dso, inst type, hazard stage, hazard reason, stall
stage, stall reason, inst cache.

Sample o/p on IBM PowerPC machine:

Overhead Symbol Shared Instruction Type Hazard Stage Hazard Reason Stall Stage Stall Reason ICache access
36.58% [.] thread_run ebizzy Load LSU Mispredict LSU Load fin L1 hit
9.46% [.] thread_run ebizzy Load LSU Mispredict LSU Dcache_miss L1 hit
1.76% [.] thread_run ebizzy Fixed point - - - - L1 hit
1.31% [.] thread_run ebizzy Load LSU ERAT Miss LSU Load fin L1 hit
1.27% [.] thread_run ebizzy Load LSU Mispredict - - L1 hit
1.16% [.] thread_run ebizzy Fixed point - - FXU Fixed cycle L1 hit
0.50% [.] thread_run ebizzy Fixed point ISU Source Unavailable FXU Fixed cycle L1 hit
0.30% [.] thread_run ebizzy Load LSU LMQ Full, DERAT Miss LSU Load fin L1 hit
0.24% [.] thread_run ebizzy Load LSU ERAT Miss - - L1 hit
0.08% [.] thread_run ebizzy - - - BRU Fixed cycle L1 hit
0.05% [.] thread_run ebizzy Branch - - BRU Fixed cycle L1 hit
0.04% [.] thread_run ebizzy Fixed point ISU Source Unavailable - - L1 hit

Signed-off-by: Madhavan Srinivasan <[email protected]>
Signed-off-by: Ravi Bangoria <[email protected]>
---
tools/perf/builtin-report.c | 28 +++++
tools/perf/util/hist.c | 77 ++++++++++++
tools/perf/util/hist.h | 7 ++
tools/perf/util/sort.c | 230 ++++++++++++++++++++++++++++++++++++
tools/perf/util/sort.h | 22 ++++
5 files changed, 364 insertions(+)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 72a12b69f120..a47542a12da1 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -77,6 +77,7 @@ struct report {
bool show_threads;
bool inverted_callchain;
bool mem_mode;
+ bool hazard;
bool stats_mode;
bool tasks_mode;
bool mmaps_mode;
@@ -285,6 +286,8 @@ static int process_sample_event(struct perf_tool *tool,
iter.ops = &hist_iter_branch;
} else if (rep->mem_mode) {
iter.ops = &hist_iter_mem;
+ } else if (rep->hazard) {
+ iter.ops = &hist_iter_haz;
} else if (symbol_conf.cumulate_callchain) {
iter.ops = &hist_iter_cumulative;
} else {
@@ -396,6 +399,14 @@ static int report__setup_sample_type(struct report *rep)
}
}

+ if (sort__mode == SORT_MODE__HAZARD) {
+ if (!is_pipe && !(sample_type & PERF_SAMPLE_PIPELINE_HAZ)) {
+ ui__error("Selected --hazard but no hazard data. "
+ "Did you call perf record without --hazard?\n");
+ return -1;
+ }
+ }
+
if (symbol_conf.use_callchain || symbol_conf.cumulate_callchain) {
if ((sample_type & PERF_SAMPLE_REGS_USER) &&
(sample_type & PERF_SAMPLE_STACK_USER)) {
@@ -484,6 +495,9 @@ static size_t hists__fprintf_nr_sample_events(struct hists *hists, struct report
if (rep->mem_mode) {
ret += fprintf(fp, "\n# Total weight : %" PRIu64, nr_events);
ret += fprintf(fp, "\n# Sort order : %s", sort_order ? : default_mem_sort_order);
+ } else if (rep->hazard) {
+ ret += fprintf(fp, "\n# Event count (approx.): %" PRIu64, nr_events);
+ ret += fprintf(fp, "\n# Sort order: %s", sort_order ? : default_haz_sort_order);
} else
ret += fprintf(fp, "\n# Event count (approx.): %" PRIu64, nr_events);

@@ -1228,6 +1242,7 @@ int cmd_report(int argc, const char **argv)
OPT_BOOLEAN(0, "demangle-kernel", &symbol_conf.demangle_kernel,
"Enable kernel symbol demangling"),
OPT_BOOLEAN(0, "mem-mode", &report.mem_mode, "mem access profile"),
+ OPT_BOOLEAN(0, "hazard", &report.hazard, "Processor pipeline hazard and stalls"),
OPT_INTEGER(0, "samples", &symbol_conf.res_sample,
"Number of samples to save per histogram entry for individual browsing"),
OPT_CALLBACK(0, "percent-limit", &report, "percent",
@@ -1394,6 +1409,19 @@ int cmd_report(int argc, const char **argv)
symbol_conf.cumulate_callchain = false;
}

+ if (report.hazard) {
+ if (sort__mode == SORT_MODE__BRANCH) {
+ pr_err("branch and hazard incompatible\n");
+ goto error;
+ }
+ if (sort__mode == SORT_MODE__MEMORY) {
+ pr_err("mem-mode and hazard incompatible\n");
+ goto error;
+ }
+ sort__mode = SORT_MODE__HAZARD;
+ symbol_conf.cumulate_callchain = false;
+ }
+
if (symbol_conf.report_hierarchy) {
/* disable incompatible options */
symbol_conf.cumulate_callchain = false;
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 6d23efaa52c8..d690d08b0f64 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -203,6 +203,12 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
hists__new_col_len(hists, HISTC_MEM_LVL, 21 + 3);
hists__new_col_len(hists, HISTC_LOCAL_WEIGHT, 12);
hists__new_col_len(hists, HISTC_GLOBAL_WEIGHT, 12);
+ hists__new_col_len(hists, HISTC_HAZ_ITYPE, 31);
+ hists__new_col_len(hists, HISTC_HAZ_ICACHE, 14);
+ hists__new_col_len(hists, HISTC_HAZ_HSTAGE, 13);
+ hists__new_col_len(hists, HISTC_HAZ_HREASON, 27);
+ hists__new_col_len(hists, HISTC_HAZ_SSTAGE, 12);
+ hists__new_col_len(hists, HISTC_HAZ_SREASON, 22);
if (symbol_conf.nanosecs)
hists__new_col_len(hists, HISTC_TIME, 16);
else
@@ -864,6 +870,69 @@ iter_finish_mem_entry(struct hist_entry_iter *iter,
return err;
}

+static int iter_prepare_haz_entry(struct hist_entry_iter *iter,
+ struct addr_location *al __maybe_unused)
+{
+ struct perf_sample *sample = iter->sample;
+ struct perf_pipeline_haz_data *haz_data;
+
+ if (!sample->pipeline_haz) {
+ iter->priv = NULL;
+ return 0;
+ }
+
+ haz_data = zalloc(sizeof(*haz_data));
+ if (!haz_data)
+ return -ENOMEM;
+
+ memcpy(haz_data, sample->pipeline_haz, sizeof(*haz_data));
+ iter->priv = haz_data;
+ return 0;
+}
+
+static int iter_add_single_haz_entry(struct hist_entry_iter *iter,
+ struct addr_location *al __maybe_unused)
+{
+ struct perf_pipeline_haz_data *haz_data = iter->priv;
+ struct hists *hists = evsel__hists(iter->evsel);
+ struct perf_sample *sample = iter->sample;
+ struct hist_entry *he;
+
+ he = hists__add_entry(hists, al, iter->parent, NULL, NULL, haz_data,
+ sample, true);
+ if (!he)
+ return -ENOMEM;
+
+ iter->he = he;
+ return 0;
+}
+
+static int iter_finish_haz_entry(struct hist_entry_iter *iter,
+ struct addr_location *al __maybe_unused)
+{
+ struct evsel *evsel = iter->evsel;
+ struct hists *hists = evsel__hists(evsel);
+ struct hist_entry *he = iter->he;
+ int err = -EINVAL;
+
+ if (he == NULL)
+ goto out;
+
+ hists__inc_nr_samples(hists, he->filtered);
+
+ err = hist_entry__append_callchain(he, iter->sample);
+
+out:
+ /*
+ * We don't need to free iter->priv (perf_pipeline_haz_data) here since
+ * the hazard info was either already freed in hists__findnew_entry() or
+ * passed to a new hist entry by hist_entry__new().
+ */
+ iter->priv = NULL;
+ iter->he = NULL;
+ return err;
+}
+
static int
iter_prepare_branch_entry(struct hist_entry_iter *iter, struct addr_location *al)
{
@@ -1128,6 +1197,14 @@ iter_finish_cumulative_entry(struct hist_entry_iter *iter,
return 0;
}

+const struct hist_iter_ops hist_iter_haz = {
+ .prepare_entry = iter_prepare_haz_entry,
+ .add_single_entry = iter_add_single_haz_entry,
+ .next_entry = iter_next_nop_entry,
+ .add_next_entry = iter_add_next_nop_entry,
+ .finish_entry = iter_finish_haz_entry,
+};
+
const struct hist_iter_ops hist_iter_mem = {
.prepare_entry = iter_prepare_mem_entry,
.add_single_entry = iter_add_single_mem_entry,
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index a4d12a503126..063d65985a11 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -69,6 +69,12 @@ enum hist_column {
HISTC_SYM_SIZE,
HISTC_DSO_SIZE,
HISTC_SYMBOL_IPC,
+ HISTC_HAZ_ITYPE,
+ HISTC_HAZ_ICACHE,
+ HISTC_HAZ_HSTAGE,
+ HISTC_HAZ_HREASON,
+ HISTC_HAZ_SSTAGE,
+ HISTC_HAZ_SREASON,
HISTC_NR_COLS, /* Last entry */
};

@@ -132,6 +138,7 @@ struct hist_entry_iter {
extern const struct hist_iter_ops hist_iter_normal;
extern const struct hist_iter_ops hist_iter_branch;
extern const struct hist_iter_ops hist_iter_mem;
+extern const struct hist_iter_ops hist_iter_haz;
extern const struct hist_iter_ops hist_iter_cumulative;

struct hist_entry *hists__add_entry(struct hists *hists,
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index ab0cfd790ad0..b6d47e7160af 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -23,6 +23,7 @@
#include "strbuf.h"
#include <traceevent/event-parse.h>
#include "mem-events.h"
+#include "hazard.h"
#include "annotate.h"
#include "time-utils.h"
#include <linux/kernel.h>
@@ -34,6 +35,7 @@ const char *parent_pattern = default_parent_pattern;
const char *default_sort_order = "comm,dso,symbol";
const char default_branch_sort_order[] = "comm,dso_from,symbol_from,symbol_to,cycles";
const char default_mem_sort_order[] = "local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked";
+const char default_haz_sort_order[] = "sym,dso,itype,hstage,hreason,sstage,sreason,icache";
const char default_top_sort_order[] = "dso,symbol";
const char default_diff_sort_order[] = "dso,symbol";
const char default_tracepoint_sort_order[] = "trace";
@@ -1377,6 +1379,190 @@ struct sort_entry sort_mem_dcacheline = {
.se_width_idx = HISTC_MEM_DCACHELINE,
};

+static int64_t sort__itype_cmp(struct hist_entry *l, struct hist_entry *r)
+{
+ u8 itype_l, itype_r;
+
+ itype_l = l->haz_data ? l->haz_data->itype : PERF_HAZ__ITYPE_NA;
+ itype_r = r->haz_data ? r->haz_data->itype : PERF_HAZ__ITYPE_NA;
+
+ return (int64_t)(itype_r - itype_l);
+}
+
+static int hist_entry__itype_snprintf(struct hist_entry *he, char *bf,
+ size_t size, unsigned int width)
+{
+ const char *arch;
+ const char *itype = "-";
+
+ arch = hist_entry__get_arch(he);
+ if (he->haz_data)
+ itype = perf_haz__itype_str(he->haz_data->itype, arch);
+
+ return repsep_snprintf(bf, size, "%-*s", width, itype);
+}
+
+struct sort_entry sort_haz_inst_type = {
+ .se_header = "Instruction Type",
+ .se_cmp = sort__itype_cmp,
+ .se_snprintf = hist_entry__itype_snprintf,
+ .se_width_idx = HISTC_HAZ_ITYPE,
+};
+
+static int64_t sort__icache_cmp(struct hist_entry *l, struct hist_entry *r)
+{
+ u8 icache_l, icache_r;
+
+ icache_l = l->haz_data ? l->haz_data->icache : PERF_HAZ__ICACHE_NA;
+ icache_r = r->haz_data ? r->haz_data->icache : PERF_HAZ__ICACHE_NA;
+
+ return (int64_t)(icache_r - icache_l);
+}
+
+static int hist_entry__icache_snprintf(struct hist_entry *he, char *bf,
+ size_t size, unsigned int width)
+{
+ const char *arch;
+ const char *icache = "-";
+
+ arch = hist_entry__get_arch(he);
+ if (he->haz_data)
+ icache = perf_haz__icache_str(he->haz_data->icache, arch);
+
+ return repsep_snprintf(bf, size, "%-*s", width, icache);
+}
+
+struct sort_entry sort_haz_inst_cache = {
+ .se_header = "ICache access",
+ .se_cmp = sort__icache_cmp,
+ .se_snprintf = hist_entry__icache_snprintf,
+ .se_width_idx = HISTC_HAZ_ICACHE,
+};
+
+static int64_t sort__hstage_cmp(struct hist_entry *l, struct hist_entry *r)
+{
+ u8 hstage_l, hstage_r;
+
+ hstage_l = l->haz_data ? l->haz_data->hazard_stage : PERF_HAZ__PIPE_STAGE_NA;
+ hstage_r = r->haz_data ? r->haz_data->hazard_stage : PERF_HAZ__PIPE_STAGE_NA;
+
+ return (int64_t)(hstage_r - hstage_l);
+}
+
+static int hist_entry__hstage_snprintf(struct hist_entry *he, char *bf,
+ size_t size, unsigned int width)
+{
+ const char *arch;
+ const char *hstage = "-";
+
+ arch = hist_entry__get_arch(he);
+ if (he->haz_data)
+ hstage = perf_haz__hstage_str(he->haz_data->hazard_stage, arch);
+
+ return repsep_snprintf(bf, size, "%-*s", width, hstage);
+}
+
+struct sort_entry sort_haz_hazard_stage = {
+ .se_header = "Hazard Stage",
+ .se_cmp = sort__hstage_cmp,
+ .se_snprintf = hist_entry__hstage_snprintf,
+ .se_width_idx = HISTC_HAZ_HSTAGE,
+};
+
+static int64_t sort__hreason_cmp(struct hist_entry *l, struct hist_entry *r)
+{
+ u8 hreason_l, hreason_r;
+
+ hreason_l = l->haz_data ? l->haz_data->hazard_reason : PERF_HAZ__HREASON_NA;
+ hreason_r = r->haz_data ? r->haz_data->hazard_reason : PERF_HAZ__HREASON_NA;
+
+ return (int64_t)(hreason_r - hreason_l);
+}
+
+static int hist_entry__hreason_snprintf(struct hist_entry *he, char *bf,
+ size_t size, unsigned int width)
+{
+ const char *arch;
+ const char *hreason = "-";
+
+ arch = hist_entry__get_arch(he);
+ if (he->haz_data) {
+ hreason = perf_haz__hreason_str(he->haz_data->hazard_stage,
+ he->haz_data->hazard_reason, arch);
+ }
+
+ return repsep_snprintf(bf, size, "%-*s", width, hreason);
+}
+
+struct sort_entry sort_haz_hazard_reason = {
+ .se_header = "Hazard Reason",
+ .se_cmp = sort__hreason_cmp,
+ .se_snprintf = hist_entry__hreason_snprintf,
+ .se_width_idx = HISTC_HAZ_HREASON,
+};
+
+static int64_t sort__sstage_cmp(struct hist_entry *l, struct hist_entry *r)
+{
+ u8 sstage_l, sstage_r;
+
+ sstage_l = l->haz_data ? l->haz_data->stall_stage : PERF_HAZ__PIPE_STAGE_NA;
+ sstage_r = r->haz_data ? r->haz_data->stall_stage : PERF_HAZ__PIPE_STAGE_NA;
+
+ return (int64_t)(sstage_r - sstage_l);
+}
+
+static int hist_entry__sstage_snprintf(struct hist_entry *he, char *bf,
+ size_t size, unsigned int width)
+{
+ const char *arch;
+ const char *sstage = "-";
+
+ arch = hist_entry__get_arch(he);
+ if (he->haz_data)
+ sstage = perf_haz__sstage_str(he->haz_data->stall_stage, arch);
+
+ return repsep_snprintf(bf, size, "%-*s", width, sstage);
+}
+
+struct sort_entry sort_haz_stall_stage = {
+ .se_header = "Stall Stage",
+ .se_cmp = sort__sstage_cmp,
+ .se_snprintf = hist_entry__sstage_snprintf,
+ .se_width_idx = HISTC_HAZ_SSTAGE,
+};
+
+static int64_t sort__sreason_cmp(struct hist_entry *l, struct hist_entry *r)
+{
+ u8 sreason_l, sreason_r;
+
+ sreason_l = l->haz_data ? l->haz_data->stall_reason : PERF_HAZ__SREASON_NA;
+ sreason_r = r->haz_data ? r->haz_data->stall_reason : PERF_HAZ__SREASON_NA;
+
+ return (int64_t)(sreason_r - sreason_l);
+}
+
+static int hist_entry__sreason_snprintf(struct hist_entry *he, char *bf,
+ size_t size, unsigned int width)
+{
+ const char *arch;
+ const char *sreason = "-";
+
+ arch = hist_entry__get_arch(he);
+ if (he->haz_data) {
+ sreason = perf_haz__sreason_str(he->haz_data->stall_stage,
+ he->haz_data->stall_reason, arch);
+ }
+
+ return repsep_snprintf(bf, size, "%-*s", width, sreason);
+}
+
+struct sort_entry sort_haz_stall_reason = {
+ .se_header = "Stall Reason",
+ .se_cmp = sort__sreason_cmp,
+ .se_snprintf = hist_entry__sreason_snprintf,
+ .se_width_idx = HISTC_HAZ_SREASON,
+};
+
static int64_t
sort__phys_daddr_cmp(struct hist_entry *left, struct hist_entry *right)
{
@@ -1699,6 +1885,19 @@ static struct sort_dimension memory_sort_dimensions[] = {

#undef DIM

+#define DIM(d, n, func) [d - __SORT_HAZARD_MODE] = { .name = n, .entry = &(func) }
+
+static struct sort_dimension hazard_sort_dimensions[] = {
+ DIM(SORT_HAZ_ITYPE, "itype", sort_haz_inst_type),
+ DIM(SORT_HAZ_ICACHE, "icache", sort_haz_inst_cache),
+ DIM(SORT_HAZ_HSTAGE, "hstage", sort_haz_hazard_stage),
+ DIM(SORT_HAZ_HREASON, "hreason", sort_haz_hazard_reason),
+ DIM(SORT_HAZ_SSTAGE, "sstage", sort_haz_stall_stage),
+ DIM(SORT_HAZ_SREASON, "sreason", sort_haz_stall_reason),
+};
+
+#undef DIM
+
struct hpp_dimension {
const char *name;
struct perf_hpp_fmt *fmt;
@@ -2643,6 +2842,19 @@ int sort_dimension__add(struct perf_hpp_list *list, const char *tok,
return 0;
}

+ for (i = 0; i < ARRAY_SIZE(hazard_sort_dimensions); i++) {
+ struct sort_dimension *sd = &hazard_sort_dimensions[i];
+
+ if (strncasecmp(tok, sd->name, strlen(tok)))
+ continue;
+
+ if (sort__mode != SORT_MODE__HAZARD)
+ return -EINVAL;
+
+ __sort_dimension__add(sd, list, level);
+ return 0;
+ }
+
if (!add_dynamic_entry(evlist, tok, level))
return 0;

@@ -2705,6 +2917,7 @@ static const char *get_default_sort_order(struct evlist *evlist)
default_top_sort_order,
default_diff_sort_order,
default_tracepoint_sort_order,
+ default_haz_sort_order,
};
bool use_trace = true;
struct evsel *evsel;
@@ -2976,6 +3189,18 @@ int output_field_add(struct perf_hpp_list *list, char *tok)
return __sort_dimension__add_output(list, sd);
}

+ for (i = 0; i < ARRAY_SIZE(hazard_sort_dimensions); i++) {
+ struct sort_dimension *sd = &hazard_sort_dimensions[i];
+
+ if (strncasecmp(tok, sd->name, strlen(tok)))
+ continue;
+
+ if (sort__mode != SORT_MODE__HAZARD)
+ return -EINVAL;
+
+ return __sort_dimension__add_output(list, sd);
+ }
+
return -ESRCH;
}

@@ -3014,6 +3239,9 @@ void reset_dimensions(void)

for (i = 0; i < ARRAY_SIZE(memory_sort_dimensions); i++)
memory_sort_dimensions[i].taken = 0;
+
+ for (i = 0; i < ARRAY_SIZE(hazard_sort_dimensions); i++)
+ hazard_sort_dimensions[i].taken = 0;
}

bool is_strict_order(const char *order)
@@ -3148,6 +3376,8 @@ const char *sort_help(const char *prefix)
ARRAY_SIZE(bstack_sort_dimensions), &len);
add_sort_string(&sb, memory_sort_dimensions,
ARRAY_SIZE(memory_sort_dimensions), &len);
+ add_sort_string(&sb, hazard_sort_dimensions,
+ ARRAY_SIZE(hazard_sort_dimensions), &len);
s = strbuf_detach(&sb, NULL);
strbuf_release(&sb);
return s;
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index 55eb65fd593f..3c18ece0aa4f 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -12,6 +12,7 @@
#include "hist.h"
#include "stat.h"
#include "spark.h"
+#include "env.h"

struct option;
struct thread;
@@ -36,6 +37,7 @@ extern struct sort_entry sort_sym_to;
extern struct sort_entry sort_srcline;
extern enum sort_type sort__first_dimension;
extern const char default_mem_sort_order[];
+extern const char default_haz_sort_order[];

struct res_sample {
u64 time;
@@ -199,6 +201,16 @@ static inline float hist_entry__get_percent_limit(struct hist_entry *he)
return period * 100.0 / total_period;
}

+static inline struct perf_env *hist_entry__env(struct hist_entry *he)
+{
+ return perf_evsel__env(hists_to_evsel(he->hists));
+}
+
+static inline const char *hist_entry__get_arch(struct hist_entry *he)
+{
+ return perf_env__arch(hist_entry__env(he));
+}
+
enum sort_mode {
SORT_MODE__NORMAL,
SORT_MODE__BRANCH,
@@ -206,6 +218,7 @@ enum sort_mode {
SORT_MODE__TOP,
SORT_MODE__DIFF,
SORT_MODE__TRACEPOINT,
+ SORT_MODE__HAZARD,
};

enum sort_type {
@@ -254,6 +267,15 @@ enum sort_type {
SORT_MEM_DCACHELINE,
SORT_MEM_IADDR_SYMBOL,
SORT_MEM_PHYS_DADDR,
+
+ /* hazard mode specific sort keys */
+ __SORT_HAZARD_MODE,
+ SORT_HAZ_ITYPE = __SORT_HAZARD_MODE,
+ SORT_HAZ_ICACHE,
+ SORT_HAZ_HSTAGE,
+ SORT_HAZ_HREASON,
+ SORT_HAZ_SSTAGE,
+ SORT_HAZ_SREASON,
};

/*
--
2.21.1

2020-03-02 05:26:58

by Ravi Bangoria

[permalink] [raw]
Subject: [RFC 11/11] perf annotate: Show hazard data in tui mode

Enable perf report->annotate tui mode to show hazard information. By
default they are hidden, but user can unhide them by pressing hot key
'S'. Sample o/p:

│ Disassembly of section .text:

│ 0000000010001cf8 <compare>:
│ compare():
│ return NULL;
│ }

│ static int
│ compare(const void *p1, const void *p2)
│ {
33.23 │ std r31,-8(r1)
│ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
│ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
│ {haz_stage: LSU, haz_reason: Load Hit Store, stall_stage: LSU, stall_reason: -, icache: L3 hit}
│ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: -, stall_reason: -, icache: L1 hit}
│ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
│ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
0.84 │ stdu r1,-64(r1)
│ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: -, stall_reason: -, icache: L1 hit}
0.24 │ mr r31,r1
│ {haz_stage: -, haz_reason: -, stall_stage: -, stall_reason: -, icache: L1 hit}
21.18 │ std r3,32(r31)
│ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
│ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
│ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}

Signed-off-by: Ravi Bangoria <[email protected]>
---
tools/perf/builtin-annotate.c | 5 ++
tools/perf/ui/browsers/annotate.c | 124 ++++++++++++++++++++++++++----
tools/perf/util/annotate.c | 51 +++++++++++-
tools/perf/util/annotate.h | 18 ++++-
4 files changed, 178 insertions(+), 20 deletions(-)

diff --git a/tools/perf/builtin-annotate.c b/tools/perf/builtin-annotate.c
index 78552a9428a6..a51313a6b019 100644
--- a/tools/perf/builtin-annotate.c
+++ b/tools/perf/builtin-annotate.c
@@ -472,6 +472,7 @@ static const char * const annotate_usage[] = {

int cmd_annotate(int argc, const char **argv)
{
+ bool annotate_haz = false;
struct perf_annotate annotate = {
.tool = {
.sample = process_sample_event,
@@ -531,6 +532,8 @@ int cmd_annotate(int argc, const char **argv)
symbol__config_symfs),
OPT_BOOLEAN(0, "source", &annotate.opts.annotate_src,
"Interleave source code with assembly code (default)"),
+ OPT_BOOLEAN(0, "hazard", &annotate_haz,
+ "Interleave CPU pileline hazard/stall data with assembly code"),
OPT_BOOLEAN(0, "asm-raw", &annotate.opts.show_asm_raw,
"Display raw encoding of assembly instructions (default)"),
OPT_STRING('M', "disassembler-style", &annotate.opts.disassembler_style, "disassembler style",
@@ -583,6 +586,8 @@ int cmd_annotate(int argc, const char **argv)
if (annotate_check_args(&annotate.opts) < 0)
return -EINVAL;

+ annotate.opts.hide_haz_data = !annotate_haz;
+
if (symbol_conf.show_nr_samples && annotate.use_gtk) {
pr_err("--show-nr-samples is not available in --gtk mode at this time\n");
return ret;
diff --git a/tools/perf/ui/browsers/annotate.c b/tools/perf/ui/browsers/annotate.c
index 2e4db8216b3b..b04d825cee50 100644
--- a/tools/perf/ui/browsers/annotate.c
+++ b/tools/perf/ui/browsers/annotate.c
@@ -190,9 +190,15 @@ static void annotate_browser__draw_current_jump(struct ui_browser *browser)
return;
}

- if (notes->options->hide_src_code) {
+ if (notes->options->hide_src_code && notes->options->hide_haz_data) {
from = cursor->al.idx_asm;
to = target->idx_asm;
+ } else if (!notes->options->hide_src_code && notes->options->hide_haz_data) {
+ from = cursor->al.idx_asm + cursor->al.idx_src + 1;
+ to = target->idx_asm + target->idx_src + 1;
+ } else if (notes->options->hide_src_code && !notes->options->hide_haz_data) {
+ from = cursor->al.idx_asm + cursor->al.idx_haz + 1;
+ to = target->idx_asm + target->idx_haz + 1;
} else {
from = (u64)cursor->al.idx;
to = (u64)target->idx;
@@ -293,8 +299,13 @@ static void annotate_browser__set_rb_top(struct annotate_browser *browser,
struct annotation_line * pos = rb_entry(nd, struct annotation_line, rb_node);
u32 idx = pos->idx;

- if (notes->options->hide_src_code)
+ if (notes->options->hide_src_code && notes->options->hide_haz_data)
idx = pos->idx_asm;
+ else if (!notes->options->hide_src_code && notes->options->hide_haz_data)
+ idx = pos->idx_asm + pos->idx_src + 1;
+ else if (notes->options->hide_src_code && !notes->options->hide_haz_data)
+ idx = pos->idx_asm + pos->idx_haz + 1;
+
annotate_browser__set_top(browser, pos, idx);
browser->curr_hot = nd;
}
@@ -348,44 +359,117 @@ static bool annotate_browser__toggle_source(struct annotate_browser *browser)
struct annotation *notes = browser__annotation(&browser->b);
struct annotation_line *al;
off_t offset = browser->b.index - browser->b.top_idx;
+ u32 curr_idx;

browser->b.seek(&browser->b, offset, SEEK_CUR);
al = list_entry(browser->b.top, struct annotation_line, node);

if (notes->options->hide_src_code) {
- if (al->idx_asm < offset)
- offset = al->idx;
+ if (notes->options->hide_haz_data) {
+ curr_idx = al->idx_asm + al->idx_src + 1;
+ browser->b.nr_entries = notes->nr_asm_entries +
+ notes->nr_src_entries;
+ } else {
+ curr_idx = al->idx;
+ browser->b.nr_entries = notes->nr_entries;
+ }

- browser->b.nr_entries = notes->nr_entries;
notes->options->hide_src_code = false;
browser->b.seek(&browser->b, -offset, SEEK_CUR);
- browser->b.top_idx = al->idx - offset;
- browser->b.index = al->idx;
+ browser->b.top_idx = curr_idx - offset;
+ browser->b.index = curr_idx;
} else {
- if (al->idx_asm < 0) {
+ if (al->type != AL_TYPE_ASM) {
ui_helpline__puts("Only available for assembly lines.");
browser->b.seek(&browser->b, -offset, SEEK_CUR);
return false;
}

- if (al->idx_asm < offset)
- offset = al->idx_asm;
+ if (notes->options->hide_haz_data) {
+ curr_idx = al->idx_asm;
+ browser->b.nr_entries = notes->nr_asm_entries;
+ } else {
+ curr_idx = al->idx_asm + al->idx_haz + 1;
+ browser->b.nr_entries = notes->nr_asm_entries +
+ notes->nr_haz_entries;
+ }
+
+ if (curr_idx < offset)
+ offset = curr_idx;

- browser->b.nr_entries = notes->nr_asm_entries;
notes->options->hide_src_code = true;
browser->b.seek(&browser->b, -offset, SEEK_CUR);
- browser->b.top_idx = al->idx_asm - offset;
- browser->b.index = al->idx_asm;
+ browser->b.top_idx = curr_idx - offset;
+ browser->b.index = curr_idx;
+ }
+
+ return true;
+}
+
+static bool annotate_browser__toggle_hazard(struct annotate_browser *browser)
+{
+ struct annotation *notes = browser__annotation(&browser->b);
+ struct annotation_line *al;
+ off_t offset = browser->b.index - browser->b.top_idx;
+ u32 curr_idx;
+
+ browser->b.seek(&browser->b, offset, SEEK_CUR);
+ al = list_entry(browser->b.top, struct annotation_line, node);
+
+ if (notes->options->hide_haz_data) {
+ if (notes->options->hide_src_code) {
+ curr_idx = al->idx_asm + al->idx_haz + 1;
+ browser->b.nr_entries = notes->nr_asm_entries +
+ notes->nr_haz_entries;
+ } else {
+ curr_idx = al->idx;
+ browser->b.nr_entries = notes->nr_entries;
+ }
+
+ notes->options->hide_haz_data = false;
+ browser->b.seek(&browser->b, -offset, SEEK_CUR);
+ browser->b.top_idx = curr_idx - offset;
+ browser->b.index = curr_idx;
+ } else {
+ if (al->type != AL_TYPE_ASM) {
+ ui_helpline__puts("Only available for assembly lines.");
+ browser->b.seek(&browser->b, -offset, SEEK_CUR);
+ return false;
+ }
+
+ if (notes->options->hide_src_code) {
+ curr_idx = al->idx_asm;
+ browser->b.nr_entries = notes->nr_asm_entries;
+ } else {
+ curr_idx = al->idx_asm + al->idx_src + 1;
+ browser->b.nr_entries = notes->nr_asm_entries +
+ notes->nr_src_entries;
+ }
+
+ if (curr_idx < offset)
+ offset = curr_idx;
+
+ notes->options->hide_haz_data = true;
+ browser->b.seek(&browser->b, -offset, SEEK_CUR);
+ browser->b.top_idx = curr_idx - offset;
+ browser->b.index = curr_idx;
}

return true;
}

-static void ui_browser__init_asm_mode(struct ui_browser *browser)
+static void ui_browser__hide_src_code(struct ui_browser *browser)
+{
+ struct annotation *notes = browser__annotation(browser);
+ ui_browser__reset_index(browser);
+ browser->nr_entries -= notes->nr_src_entries;
+}
+
+static void ui_browser__hide_haz_data(struct ui_browser *browser)
{
struct annotation *notes = browser__annotation(browser);
ui_browser__reset_index(browser);
- browser->nr_entries = notes->nr_asm_entries;
+ browser->nr_entries -= notes->nr_haz_entries;
}

#define SYM_TITLE_MAX_SIZE (PATH_MAX + 64)
@@ -743,6 +827,7 @@ static int annotate_browser__run(struct annotate_browser *browser,
"o Toggle disassembler output/simplified view\n"
"O Bump offset level (jump targets -> +call -> all -> cycle thru)\n"
"s Toggle source code view\n"
+ "S Toggle pipeline hazard/stall view\n"
"t Circulate percent, total period, samples view\n"
"c Show min/max cycle\n"
"/ Search string\n"
@@ -767,6 +852,10 @@ static int annotate_browser__run(struct annotate_browser *browser,
if (annotate_browser__toggle_source(browser))
ui_helpline__puts(help);
continue;
+ case 'S':
+ if (annotate_browser__toggle_hazard(browser))
+ ui_helpline__puts(help);
+ continue;
case 'o':
notes->options->use_offset = !notes->options->use_offset;
annotation__update_column_widths(notes);
@@ -932,7 +1021,10 @@ int symbol__tui_annotate(struct map_symbol *ms, struct evsel *evsel,
browser.b.width += 18; /* Percentage */

if (notes->options->hide_src_code)
- ui_browser__init_asm_mode(&browser.b);
+ ui_browser__hide_src_code(&browser.b);
+
+ if (notes->options->hide_haz_data)
+ ui_browser__hide_haz_data(&browser.b);

ret = annotate_browser__run(&browser, evsel, hbt);

diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c
index 766934b0f36d..d782ee193345 100644
--- a/tools/perf/util/annotate.c
+++ b/tools/perf/util/annotate.c
@@ -66,6 +66,7 @@ struct annotation_options annotation__default_options = {
.annotate_src = true,
.offset_level = ANNOTATION__OFFSET_JUMP_TARGETS,
.percent_type = PERCENT_PERIOD_LOCAL,
+ .hide_haz_data = true,
};

static regex_t file_lineno;
@@ -1268,7 +1269,7 @@ static struct disasm_line *disasm_line__new(struct annotate_args *args)
struct disasm_line *dl = NULL;
int nr = 1;

- if (perf_evsel__is_group_event(args->evsel))
+ if (args->type != AL_TYPE_HAZ && perf_evsel__is_group_event(args->evsel))
nr = args->evsel->core.nr_members;

dl = zalloc(disasm_line_size(nr));
@@ -1509,6 +1510,7 @@ annotation_line__print(struct annotation_line *al, struct symbol *sym, u64 start
} else if (max_lines && printed >= max_lines)
return 1;
else {
+ /* TODO: Hazard specific changes */
int width = symbol_conf.show_total_period ? 12 : 8;

if (queue)
@@ -1526,6 +1528,34 @@ annotation_line__print(struct annotation_line *al, struct symbol *sym, u64 start
return 0;
}

+static int annotation_add_hazard_detail(struct annotation *notes, s64 offset)
+{
+ int ret;
+ struct hazard_hist *hh;
+ struct disasm_line *dl;
+ struct annotate_args args = {
+ .type = AL_TYPE_HAZ,
+ .offset = -1,
+ };
+
+ list_for_each_entry(hh, &notes->src->haz_hist[offset].list, list) {
+ ret = asprintf(&(args.line), " {haz_stage: %s, "
+ "haz_reason: %s, stall_stage: %s, stall_reason: %s, "
+ "icache: %s}", hh->haz_stage, hh->haz_reason,
+ hh->stall_stage, hh->stall_reason, hh->icache);
+ if (ret == -1)
+ return -ENOMEM;
+
+ dl = disasm_line__new(&args);
+ free(args.line);
+ if (!dl)
+ return -ENOMEM;
+
+ annotation_line__add(&dl->al, &notes->src->source);
+ }
+ return 0;
+}
+
/*
* symbol__parse_objdump_line() parses objdump output (with -d --no-show-raw)
* which looks like following
@@ -1608,6 +1638,10 @@ static int symbol__parse_objdump_line(struct symbol *sym,

annotation_line__add(&dl->al, &notes->src->source);

+ if (offset != -1 && notes->src->haz_hist &&
+ !list_empty(&notes->src->haz_hist[offset].list))
+ return annotation_add_hazard_detail(notes, offset);
+
return 0;
}

@@ -2701,6 +2735,8 @@ static void annotation__set_offsets(struct annotation *notes, s64 size)
notes->max_line_len = 0;
notes->nr_entries = 0;
notes->nr_asm_entries = 0;
+ notes->nr_src_entries = 0;
+ notes->nr_haz_entries = 0;

list_for_each_entry(al, &notes->src->source, node) {
size_t line_len = strlen(al->line);
@@ -2710,6 +2746,8 @@ static void annotation__set_offsets(struct annotation *notes, s64 size)
al->idx = notes->nr_entries++;
if (al->type == AL_TYPE_ASM) {
al->idx_asm = notes->nr_asm_entries++;
+ al->idx_src = notes->nr_src_entries - 1;
+ al->idx_haz = notes->nr_haz_entries - 1;
/*
* FIXME: short term bandaid to cope with assembly
* routines that comes with labels in the same column
@@ -2719,8 +2757,15 @@ static void annotation__set_offsets(struct annotation *notes, s64 size)
*/
if (al->offset < size)
notes->offsets[al->offset] = al;
- } else
+ } else if (al->type == AL_TYPE_SRC) {
al->idx_asm = -1;
+ al->idx_src = notes->nr_src_entries++;
+ al->idx_haz = notes->nr_haz_entries - 1;
+ } else if (al->type == AL_TYPE_HAZ) {
+ al->idx_asm = -1;
+ al->idx_haz = notes->nr_haz_entries++;
+ al->idx_src = notes->nr_src_entries - 1;
+ }
}
}

@@ -3051,6 +3096,8 @@ static void __annotation_line__write(struct annotation_line *al, struct annotati
printed = scnprintf(bf, sizeof(bf), "%-*s ", notes->widths.addr, " ");
obj__printf(obj, bf);
obj__printf(obj, "%-*s", width - printed - pcnt_width - cycles_width + 1, al->line);
+ } else if (al->type == AL_TYPE_HAZ) {
+ obj__printf(obj, "%-*s", width - pcnt_width - cycles_width, al->line);
} else {
u64 addr = al->offset;
int color = -1;
diff --git a/tools/perf/util/annotate.h b/tools/perf/util/annotate.h
index a3803f89b8fc..3a082c13bdad 100644
--- a/tools/perf/util/annotate.h
+++ b/tools/perf/util/annotate.h
@@ -85,7 +85,8 @@ struct annotation_options {
show_nr_jumps,
show_minmax_cycle,
show_asm_raw,
- annotate_src;
+ annotate_src,
+ hide_haz_data;
u8 offset_level;
int min_pcnt;
int max_lines;
@@ -131,8 +132,16 @@ struct annotation_data {
enum annotation_line_type {
AL_TYPE_ASM = 0,
AL_TYPE_SRC,
+ AL_TYPE_HAZ,
};

+/*
+ * @idx_asm, @idx_src and @idx_haz starts from -1 and increments in
+ * chronological order. For _all_ non-assembly lines, @idx_asm is -1.
+ * OTOH, for non-source line, @idx_src contains prev value. Similarly
+ * for non-hazard lines @idx_haz also contains prev value.
+ * And @idx = @idx_asm + (@idx_src + 1) + (@idx_haz + 1);
+ */
struct annotation_line {
struct list_head node;
struct rb_node rb_node;
@@ -148,6 +157,8 @@ struct annotation_line {
char *path;
u32 idx;
int idx_asm;
+ int idx_src;
+ int idx_haz;
int data_nr;
struct annotation_data data[0];
};
@@ -298,6 +309,8 @@ struct annotation {
int max_jump_sources;
int nr_entries;
int nr_asm_entries;
+ int nr_src_entries;
+ int nr_haz_entries;
u16 max_line_len;
struct {
u8 addr;
@@ -326,7 +339,8 @@ static inline int annotation__pcnt_width(struct annotation *notes)

static inline bool annotation_line__filter(struct annotation_line *al, struct annotation *notes)
{
- return notes->options->hide_src_code && al->type == AL_TYPE_SRC;
+ return ((notes->options->hide_src_code && al->type == AL_TYPE_SRC) ||
+ (notes->options->hide_haz_data && al->type == AL_TYPE_HAZ));
}

void annotation__update_column_widths(struct annotation *notes);
--
2.21.1

2020-03-02 09:56:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 02/11] perf/core: Data structure to present hazard data

On Mon, Mar 02, 2020 at 10:53:46AM +0530, Ravi Bangoria wrote:
> From: Madhavan Srinivasan <[email protected]>
>
> Introduce new perf sample_type PERF_SAMPLE_PIPELINE_HAZ to request kernel
> to provide cpu pipeline hazard data. Also, introduce arch independent
> structure 'perf_pipeline_haz_data' to pass hazard data to userspace. This
> is generic structure and arch specific data needs to be converted to this
> format.
>
> Signed-off-by: Madhavan Srinivasan <[email protected]>
> Signed-off-by: Ravi Bangoria <[email protected]>
> ---
> include/linux/perf_event.h | 7 ++++++
> include/uapi/linux/perf_event.h | 32 ++++++++++++++++++++++++++-
> kernel/events/core.c | 6 +++++
> tools/include/uapi/linux/perf_event.h | 32 ++++++++++++++++++++++++++-
> 4 files changed, 75 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 547773f5894e..d5b606e3c57d 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1001,6 +1001,7 @@ struct perf_sample_data {
> u64 stack_user_size;
>
> u64 phys_addr;
> + struct perf_pipeline_haz_data pipeline_haz;
> } ____cacheline_aligned;
>
> /* default value for data source */
> @@ -1021,6 +1022,12 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
> data->weight = 0;
> data->data_src.val = PERF_MEM_NA;
> data->txn = 0;
> + data->pipeline_haz.itype = PERF_HAZ__ITYPE_NA;
> + data->pipeline_haz.icache = PERF_HAZ__ICACHE_NA;
> + data->pipeline_haz.hazard_stage = PERF_HAZ__PIPE_STAGE_NA;
> + data->pipeline_haz.hazard_reason = PERF_HAZ__HREASON_NA;
> + data->pipeline_haz.stall_stage = PERF_HAZ__PIPE_STAGE_NA;
> + data->pipeline_haz.stall_reason = PERF_HAZ__SREASON_NA;
> }

NAK, Don't touch anything outside of the first cacheline here.

2020-03-02 10:15:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
> Modern processors export such hazard data in Performance
> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
> AMD[3] provides similar information.
>
> Implementation detail:
>
> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
> If it's set, kernel converts arch specific hazard information
> into generic format:
>
> struct perf_pipeline_haz_data {
> /* Instruction/Opcode type: Load, Store, Branch .... */
> __u8 itype;
> /* Instruction Cache source */
> __u8 icache;
> /* Instruction suffered hazard in pipeline stage */
> __u8 hazard_stage;
> /* Hazard reason */
> __u8 hazard_reason;
> /* Instruction suffered stall in pipeline stage */
> __u8 stall_stage;
> /* Stall reason */
> __u8 stall_reason;
> __u16 pad;
> };

Kim, does this format indeed work for AMD IBS?

2020-03-02 14:25:18

by Madhavan Srinivasan

[permalink] [raw]
Subject: Re: [RFC 02/11] perf/core: Data structure to present hazard data



On 3/2/20 3:25 PM, Peter Zijlstra wrote:
> On Mon, Mar 02, 2020 at 10:53:46AM +0530, Ravi Bangoria wrote:
>> From: Madhavan Srinivasan <[email protected]>
>>
>> Introduce new perf sample_type PERF_SAMPLE_PIPELINE_HAZ to request kernel
>> to provide cpu pipeline hazard data. Also, introduce arch independent
>> structure 'perf_pipeline_haz_data' to pass hazard data to userspace. This
>> is generic structure and arch specific data needs to be converted to this
>> format.
>>
>> Signed-off-by: Madhavan Srinivasan <[email protected]>
>> Signed-off-by: Ravi Bangoria <[email protected]>
>> ---
>> include/linux/perf_event.h | 7 ++++++
>> include/uapi/linux/perf_event.h | 32 ++++++++++++++++++++++++++-
>> kernel/events/core.c | 6 +++++
>> tools/include/uapi/linux/perf_event.h | 32 ++++++++++++++++++++++++++-
>> 4 files changed, 75 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 547773f5894e..d5b606e3c57d 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -1001,6 +1001,7 @@ struct perf_sample_data {
>> u64 stack_user_size;
>>
>> u64 phys_addr;
>> + struct perf_pipeline_haz_data pipeline_haz;
>> } ____cacheline_aligned;
>>
>> /* default value for data source */
>> @@ -1021,6 +1022,12 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
>> data->weight = 0;
>> data->data_src.val = PERF_MEM_NA;
>> data->txn = 0;
>> + data->pipeline_haz.itype = PERF_HAZ__ITYPE_NA;
>> + data->pipeline_haz.icache = PERF_HAZ__ICACHE_NA;
>> + data->pipeline_haz.hazard_stage = PERF_HAZ__PIPE_STAGE_NA;
>> + data->pipeline_haz.hazard_reason = PERF_HAZ__HREASON_NA;
>> + data->pipeline_haz.stall_stage = PERF_HAZ__PIPE_STAGE_NA;
>> + data->pipeline_haz.stall_reason = PERF_HAZ__SREASON_NA;
>> }
> NAK, Don't touch anything outside of the first cacheline here.

My bad, should have looked at the comment in "struct perf_sample_data {".
Will move it to perf_prepare_sample().

Thanks for comments.
Maddy

2020-03-02 14:49:10

by Mark Rutland

[permalink] [raw]
Subject: Re: [RFC 02/11] perf/core: Data structure to present hazard data

On Mon, Mar 02, 2020 at 10:53:46AM +0530, Ravi Bangoria wrote:
> From: Madhavan Srinivasan <[email protected]>
>
> Introduce new perf sample_type PERF_SAMPLE_PIPELINE_HAZ to request kernel
> to provide cpu pipeline hazard data. Also, introduce arch independent
> structure 'perf_pipeline_haz_data' to pass hazard data to userspace. This
> is generic structure and arch specific data needs to be converted to this
> format.
>
> Signed-off-by: Madhavan Srinivasan <[email protected]>
> Signed-off-by: Ravi Bangoria <[email protected]>
> ---
> include/linux/perf_event.h | 7 ++++++
> include/uapi/linux/perf_event.h | 32 ++++++++++++++++++++++++++-
> kernel/events/core.c | 6 +++++
> tools/include/uapi/linux/perf_event.h | 32 ++++++++++++++++++++++++++-
> 4 files changed, 75 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 547773f5894e..d5b606e3c57d 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1001,6 +1001,7 @@ struct perf_sample_data {
> u64 stack_user_size;
>
> u64 phys_addr;
> + struct perf_pipeline_haz_data pipeline_haz;
> } ____cacheline_aligned;
>
> /* default value for data source */
> @@ -1021,6 +1022,12 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
> data->weight = 0;
> data->data_src.val = PERF_MEM_NA;
> data->txn = 0;
> + data->pipeline_haz.itype = PERF_HAZ__ITYPE_NA;
> + data->pipeline_haz.icache = PERF_HAZ__ICACHE_NA;
> + data->pipeline_haz.hazard_stage = PERF_HAZ__PIPE_STAGE_NA;
> + data->pipeline_haz.hazard_reason = PERF_HAZ__HREASON_NA;
> + data->pipeline_haz.stall_stage = PERF_HAZ__PIPE_STAGE_NA;
> + data->pipeline_haz.stall_reason = PERF_HAZ__SREASON_NA;
> }
>
> extern void perf_output_sample(struct perf_output_handle *handle,
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 377d794d3105..ff252618ca93 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -142,8 +142,9 @@ enum perf_event_sample_format {
> PERF_SAMPLE_REGS_INTR = 1U << 18,
> PERF_SAMPLE_PHYS_ADDR = 1U << 19,
> PERF_SAMPLE_AUX = 1U << 20,
> + PERF_SAMPLE_PIPELINE_HAZ = 1U << 21,

Can we please have perf_event_open() reject this sample flag for PMUs
without the new callback (introduced in the next patch)?

That way it'll be possible to detect whether the PMU exposes this.

Thanks,
Mark.

2020-03-02 14:55:25

by Mark Rutland

[permalink] [raw]
Subject: Re: [RFC 02/11] perf/core: Data structure to present hazard data

On Mon, Mar 02, 2020 at 10:53:46AM +0530, Ravi Bangoria wrote:
> From: Madhavan Srinivasan <[email protected]>
>
> Introduce new perf sample_type PERF_SAMPLE_PIPELINE_HAZ to request kernel
> to provide cpu pipeline hazard data. Also, introduce arch independent
> structure 'perf_pipeline_haz_data' to pass hazard data to userspace. This
> is generic structure and arch specific data needs to be converted to this
> format.
>
> Signed-off-by: Madhavan Srinivasan <[email protected]>
> Signed-off-by: Ravi Bangoria <[email protected]>
> ---
> include/linux/perf_event.h | 7 ++++++
> include/uapi/linux/perf_event.h | 32 ++++++++++++++++++++++++++-
> kernel/events/core.c | 6 +++++
> tools/include/uapi/linux/perf_event.h | 32 ++++++++++++++++++++++++++-
> 4 files changed, 75 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 547773f5894e..d5b606e3c57d 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1001,6 +1001,7 @@ struct perf_sample_data {
> u64 stack_user_size;
>
> u64 phys_addr;
> + struct perf_pipeline_haz_data pipeline_haz;
> } ____cacheline_aligned;

I don't think you can add this here, see below.

> /* default value for data source */
> @@ -1021,6 +1022,12 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
> data->weight = 0;
> data->data_src.val = PERF_MEM_NA;
> data->txn = 0;
> + data->pipeline_haz.itype = PERF_HAZ__ITYPE_NA;
> + data->pipeline_haz.icache = PERF_HAZ__ICACHE_NA;
> + data->pipeline_haz.hazard_stage = PERF_HAZ__PIPE_STAGE_NA;
> + data->pipeline_haz.hazard_reason = PERF_HAZ__HREASON_NA;
> + data->pipeline_haz.stall_stage = PERF_HAZ__PIPE_STAGE_NA;
> + data->pipeline_haz.stall_reason = PERF_HAZ__SREASON_NA;
> }
>
> extern void perf_output_sample(struct perf_output_handle *handle,
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 377d794d3105..ff252618ca93 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -142,8 +142,9 @@ enum perf_event_sample_format {
> PERF_SAMPLE_REGS_INTR = 1U << 18,
> PERF_SAMPLE_PHYS_ADDR = 1U << 19,
> PERF_SAMPLE_AUX = 1U << 20,
> + PERF_SAMPLE_PIPELINE_HAZ = 1U << 21,
>
> - PERF_SAMPLE_MAX = 1U << 21, /* non-ABI */
> + PERF_SAMPLE_MAX = 1U << 22, /* non-ABI */
>
> __PERF_SAMPLE_CALLCHAIN_EARLY = 1ULL << 63, /* non-ABI; internal use */
> };
> @@ -870,6 +871,13 @@ enum perf_event_type {
> * { u64 phys_addr;} && PERF_SAMPLE_PHYS_ADDR
> * { u64 size;
> * char data[size]; } && PERF_SAMPLE_AUX
> + * { u8 itype;
> + * u8 icache;
> + * u8 hazard_stage;
> + * u8 hazard_reason;
> + * u8 stall_stage;
> + * u8 stall_reason;
> + * u16 pad;} && PERF_SAMPLE_PIPELINE_HAZ
> * };

The existing comment shows the aux data *immediately* after ther
phys_addr field, where you've placed struct perf_pipeline_haz_data.

If adding to struct perf_sample_data is fine, this needs to come before
the aux data in this comment. If adding to struct perf_sample_data is
not fine. struct perf_pipeline_haz_data cannot live there.

I suspect the latter is true, but you're getting away with it because
you're not using both PERF_SAMPLE_AUX and PERF_SAMPLE_PIPELINE_HAZ
simultaneously.

Thanks,
Mark.

> */
> PERF_RECORD_SAMPLE = 9,
> @@ -1185,4 +1193,26 @@ struct perf_branch_entry {
> reserved:40;
> };
>
> +struct perf_pipeline_haz_data {
> + /* Instruction/Opcode type: Load, Store, Branch .... */
> + __u8 itype;
> + /* Instruction Cache source */
> + __u8 icache;
> + /* Instruction suffered hazard in pipeline stage */
> + __u8 hazard_stage;
> + /* Hazard reason */
> + __u8 hazard_reason;
> + /* Instruction suffered stall in pipeline stage */
> + __u8 stall_stage;
> + /* Stall reason */
> + __u8 stall_reason;
> + __u16 pad;
> +};
> +
> +#define PERF_HAZ__ITYPE_NA 0x0
> +#define PERF_HAZ__ICACHE_NA 0x0
> +#define PERF_HAZ__PIPE_STAGE_NA 0x0
> +#define PERF_HAZ__HREASON_NA 0x0
> +#define PERF_HAZ__SREASON_NA 0x0
> +
> #endif /* _UAPI_LINUX_PERF_EVENT_H */
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index e453589da97c..d00037c77ccf 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1754,6 +1754,9 @@ static void __perf_event_header_size(struct perf_event *event, u64 sample_type)
> if (sample_type & PERF_SAMPLE_PHYS_ADDR)
> size += sizeof(data->phys_addr);
>
> + if (sample_type & PERF_SAMPLE_PIPELINE_HAZ)
> + size += sizeof(data->pipeline_haz);
> +
> event->header_size = size;
> }
>
> @@ -6712,6 +6715,9 @@ void perf_output_sample(struct perf_output_handle *handle,
> perf_aux_sample_output(event, handle, data);
> }
>
> + if (sample_type & PERF_SAMPLE_PIPELINE_HAZ)
> + perf_output_put(handle, data->pipeline_haz);
> +
> if (!event->attr.watermark) {
> int wakeup_events = event->attr.wakeup_events;
>
> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
> index 377d794d3105..ff252618ca93 100644
> --- a/tools/include/uapi/linux/perf_event.h
> +++ b/tools/include/uapi/linux/perf_event.h
> @@ -142,8 +142,9 @@ enum perf_event_sample_format {
> PERF_SAMPLE_REGS_INTR = 1U << 18,
> PERF_SAMPLE_PHYS_ADDR = 1U << 19,
> PERF_SAMPLE_AUX = 1U << 20,
> + PERF_SAMPLE_PIPELINE_HAZ = 1U << 21,
>
> - PERF_SAMPLE_MAX = 1U << 21, /* non-ABI */
> + PERF_SAMPLE_MAX = 1U << 22, /* non-ABI */
>
> __PERF_SAMPLE_CALLCHAIN_EARLY = 1ULL << 63, /* non-ABI; internal use */
> };
> @@ -870,6 +871,13 @@ enum perf_event_type {
> * { u64 phys_addr;} && PERF_SAMPLE_PHYS_ADDR
> * { u64 size;
> * char data[size]; } && PERF_SAMPLE_AUX
> + * { u8 itype;
> + * u8 icache;
> + * u8 hazard_stage;
> + * u8 hazard_reason;
> + * u8 stall_stage;
> + * u8 stall_reason;
> + * u16 pad;} && PERF_SAMPLE_PIPELINE_HAZ
> * };
> */
> PERF_RECORD_SAMPLE = 9,
> @@ -1185,4 +1193,26 @@ struct perf_branch_entry {
> reserved:40;
> };
>
> +struct perf_pipeline_haz_data {
> + /* Instruction/Opcode type: Load, Store, Branch .... */
> + __u8 itype;
> + /* Instruction Cache source */
> + __u8 icache;
> + /* Instruction suffered hazard in pipeline stage */
> + __u8 hazard_stage;
> + /* Hazard reason */
> + __u8 hazard_reason;
> + /* Instruction suffered stall in pipeline stage */
> + __u8 stall_stage;
> + /* Stall reason */
> + __u8 stall_reason;
> + __u16 pad;
> +};
> +
> +#define PERF_HAZ__ITYPE_NA 0x0
> +#define PERF_HAZ__ICACHE_NA 0x0
> +#define PERF_HAZ__PIPE_STAGE_NA 0x0
> +#define PERF_HAZ__HREASON_NA 0x0
> +#define PERF_HAZ__SREASON_NA 0x0
> +
> #endif /* _UAPI_LINUX_PERF_EVENT_H */
> --
> 2.21.1
>

2020-03-02 20:21:44

by Stephane Eranian

[permalink] [raw]
Subject: Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
> > Modern processors export such hazard data in Performance
> > Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
> > Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
> > AMD[3] provides similar information.
> >
> > Implementation detail:
> >
> > A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
> > If it's set, kernel converts arch specific hazard information
> > into generic format:
> >
> > struct perf_pipeline_haz_data {
> > /* Instruction/Opcode type: Load, Store, Branch .... */
> > __u8 itype;
> > /* Instruction Cache source */
> > __u8 icache;
> > /* Instruction suffered hazard in pipeline stage */
> > __u8 hazard_stage;
> > /* Hazard reason */
> > __u8 hazard_reason;
> > /* Instruction suffered stall in pipeline stage */
> > __u8 stall_stage;
> > /* Stall reason */
> > __u8 stall_reason;
> > __u16 pad;
> > };
>
> Kim, does this format indeed work for AMD IBS?


Personally, I don't like the term hazard. This is too IBM Power
specific. We need to find a better term, maybe stall or penalty.
Also worth considering is the support of ARM SPE (Statistical
Profiling Extension) which is their version of IBS.
Whatever gets added need to cover all three with no limitations.

2020-03-02 21:13:18

by Paul A. Clarke

[permalink] [raw]
Subject: Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

On 3/1/20 11:23 PM, Ravi Bangoria wrote:
> Most modern microprocessors employ complex instruction execution
> pipelines such that many instructions can be 'in flight' at any
> given point in time. Various factors affect this pipeline and
> hazards are the primary among them. Different types of hazards
> exist - Data hazards, Structural hazards and Control hazards.
> Data hazard is the case where data dependencies exist between
> instructions in different stages in the pipeline. Structural
> hazard is when the same processor hardware is needed by more
> than one instruction in flight at the same time. Control hazards
> are more the branch misprediction kinds.
>
> Information about these hazards are critical towards analyzing
> performance issues and also to tune software to overcome such
> issues. Modern processors export such hazard data in Performance
> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
> AMD[3] provides similar information.
>
> Implementation detail:
>
> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
> If it's set, kernel converts arch specific hazard information
> into generic format:
>
> struct perf_pipeline_haz_data {
> /* Instruction/Opcode type: Load, Store, Branch .... */
> __u8 itype;

At the risk of bike-shedding (in an RFC, no less), "itype" doesn't convey enough meaning to me. "inst_type"? I see in 03/11, you use "perf_inst_type".

> /* Instruction Cache source */
> __u8 icache;

Possibly same here, and you use "perf_inst_cache" in 03/11.

> /* Instruction suffered hazard in pipeline stage */
> __u8 hazard_stage;
> /* Hazard reason */
> __u8 hazard_reason;
> /* Instruction suffered stall in pipeline stage */
> __u8 stall_stage;
> /* Stall reason */
> __u8 stall_reason;
> __u16 pad;
> };
>
> ... which can be read by user from mmap() ring buffer. With this
> approach, sample perf report in hazard mode looks like (On IBM
> PowerPC):
>
> # ./perf record --hazard ./ebizzy
> # ./perf report --hazard
> Overhead Symbol Shared Instruction Type Hazard Stage Hazard Reason Stall Stage Stall Reason ICache access
> 36.58% [.] thread_run ebizzy Load LSU Mispredict LSU Load fin L1 hit
> 9.46% [.] thread_run ebizzy Load LSU Mispredict LSU Dcache_miss L1 hit
> 1.76% [.] thread_run ebizzy Fixed point - - - - L1 hit
> 1.31% [.] thread_run ebizzy Load LSU ERAT Miss LSU Load fin L1 hit
> 1.27% [.] thread_run ebizzy Load LSU Mispredict - - L1 hit
> 1.16% [.] thread_run ebizzy Fixed point - - FXU Fixed cycle L1 hit
> 0.50% [.] thread_run ebizzy Fixed point ISU Source Unavailable FXU Fixed cycle L1 hit
> 0.30% [.] thread_run ebizzy Load LSU LMQ Full, DERAT Miss LSU Load fin L1 hit
> 0.24% [.] thread_run ebizzy Load LSU ERAT Miss - - L1 hit
> 0.08% [.] thread_run ebizzy - - - BRU Fixed cycle L1 hit
> 0.05% [.] thread_run ebizzy Branch - - BRU Fixed cycle L1 hit
> 0.04% [.] thread_run ebizzy Fixed point ISU Source Unavailable - - L1 hit

How are these to be interpreted? This is great information, but is it possible to make it more readable for non-experts? If each of these map 1:1 with hardware events, should you emit the name of the event here, so that can be used to look up further information? For example, does the first line map to PM_CMPLU_STALL_LSU_FIN?
What was "Mispredict[ed]"? (Is it different from a branch misprediction?) And how does this relate to "L1 hit"?
Can we emit "Load finish" instead of "Load fin" for easier reading? 03/11 also has "Marked fin before NTC".
Nit: why does "Dcache_miss" have an underscore and none of the others?

> Also perf annotate with hazard data:

> │ static int
> │ compare(const void *p1, const void *p2)
> │ {
> 33.23 │ std r31,-8(r1)
> │ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
> │ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
> │ {haz_stage: LSU, haz_reason: Load Hit Store, stall_stage: LSU, stall_reason: -, icache: L3 hit}
> │ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: -, stall_reason: -, icache: L1 hit}
> │ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
> │ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
> 0.84 │ stdu r1,-64(r1)
> │ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: -, stall_reason: -, icache: L1 hit}
> 0.24 │ mr r31,r1
> │ {haz_stage: -, haz_reason: -, stall_stage: -, stall_reason: -, icache: L1 hit}
> 21.18 │ std r3,32(r31)
> │ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
> │ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
> │ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
>
>
> Patches:
> - Patch #1 is a simple cleanup patch
> - Patch #2, #3, #4 implements generic and arch specific kernel
> infrastructure
> - Patch #5 enables perf record and script with hazard mode
> - Patch #6, #7, #8 enables perf report with hazard mode
> - Patch #9, #10, #11 enables perf annotate with hazard mode
>
> Note:
> - This series is based on the talk by Madhavan in LPC 2018[4]. This is
> just an early RFC to get comments about the approach and not intended
> to be merged yet.
> - I've prepared the series base on v5.6-rc3. But it depends on generic
> perf annotate fixes [5][6] which are already merged by Arnaldo in
> perf/urgent and perf/core.
>
> [1]: Book III, Section 9.4.10:
> https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0
> [2]: https://wiki.raptorcs.com/w/images/6/6b/POWER9_PMU_UG_v12_28NOV2018_pub.pdf#G9.1106986

This document is also available from the "IBM Portal for OpenPOWER" under the "All IBM Material for OpenPOWER" https://www-355.ibm.com/systems/power/openpower/tgcmDocumentRepository.xhtml?aliasId=OpenPOWER, under each of the individual modules. (Well hidden, it might be said, and not a simple link like you have here.)

> [3]: https://www.amd.com/system/files/TechDocs/24593.pdf#G19.1089550
> [4]: https://linuxplumbersconf.org/event/2/contributions/76/
> [5]: http://lore.kernel.org/r/[email protected]
> [6]: http://lore.kernel.org/r/[email protected]

PC

2020-03-02 22:26:50

by Kim Phillips

[permalink] [raw]
Subject: Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

On 3/2/20 2:21 PM, Stephane Eranian wrote:
> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <[email protected]> wrote:
>>
>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>> Modern processors export such hazard data in Performance
>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>> AMD[3] provides similar information.
>>>
>>> Implementation detail:
>>>
>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>> If it's set, kernel converts arch specific hazard information
>>> into generic format:
>>>
>>> struct perf_pipeline_haz_data {
>>> /* Instruction/Opcode type: Load, Store, Branch .... */
>>> __u8 itype;
>>> /* Instruction Cache source */
>>> __u8 icache;
>>> /* Instruction suffered hazard in pipeline stage */
>>> __u8 hazard_stage;
>>> /* Hazard reason */
>>> __u8 hazard_reason;
>>> /* Instruction suffered stall in pipeline stage */
>>> __u8 stall_stage;
>>> /* Stall reason */
>>> __u8 stall_reason;
>>> __u16 pad;
>>> };
>>
>> Kim, does this format indeed work for AMD IBS?

It's not really 1:1, we don't have these separations of stages
and reasons, for example: we have missed in L2 cache, for example.
So IBS output is flatter, with more cycle latency figures than
IBM's AFAICT.

> Personally, I don't like the term hazard. This is too IBM Power
> specific. We need to find a better term, maybe stall or penalty.

Right, IBS doesn't have a filter to only count stalled or otherwise
bad events. IBS' PPR descriptions has one occurrence of the
word stall, and no penalty. The way I read IBS is it's just
reporting more sample data than just the precise IP: things like
hits, misses, cycle latencies, addresses, types, etc., so words
like 'extended', or the 'auxiliary' already used today even
are more appropriate for IBS, although I'm the last person to
bikeshed.

> Also worth considering is the support of ARM SPE (Statistical
> Profiling Extension) which is their version of IBS.
> Whatever gets added need to cover all three with no limitations.

I thought Intel's various LBR, PEBS, and PT supported providing
similar sample data in perf already, like with perf mem/c2c?

Kim

2020-03-03 01:33:55

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

On Mon, Mar 02, 2020 at 11:13:32AM +0100, Peter Zijlstra wrote:
> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
> > Modern processors export such hazard data in Performance
> > Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
> > Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
> > AMD[3] provides similar information.
> >
> > Implementation detail:
> >
> > A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
> > If it's set, kernel converts arch specific hazard information
> > into generic format:
> >
> > struct perf_pipeline_haz_data {
> > /* Instruction/Opcode type: Load, Store, Branch .... */
> > __u8 itype;
> > /* Instruction Cache source */
> > __u8 icache;
> > /* Instruction suffered hazard in pipeline stage */
> > __u8 hazard_stage;
> > /* Hazard reason */
> > __u8 hazard_reason;
> > /* Instruction suffered stall in pipeline stage */
> > __u8 stall_stage;
> > /* Stall reason */
> > __u8 stall_reason;
> > __u16 pad;
> > };
>
> Kim, does this format indeed work for AMD IBS?

Intel PEBS has a similar concept for annotation of memory accesses,
which is already exported through perf_mem_data_src. This is essentially
an extension. It would be better to have something unified here.
Right now it seems to duplicate at least part of the PEBS facility.

-Andi

2020-03-03 14:32:36

by Ravi Bangoria

[permalink] [raw]
Subject: Re: [RFC 02/11] perf/core: Data structure to present hazard data



On 3/2/20 8:24 PM, Mark Rutland wrote:
>> @@ -870,6 +871,13 @@ enum perf_event_type {
>> * { u64 phys_addr;} && PERF_SAMPLE_PHYS_ADDR
>> * { u64 size;
>> * char data[size]; } && PERF_SAMPLE_AUX
>> + * { u8 itype;
>> + * u8 icache;
>> + * u8 hazard_stage;
>> + * u8 hazard_reason;
>> + * u8 stall_stage;
>> + * u8 stall_reason;
>> + * u16 pad;} && PERF_SAMPLE_PIPELINE_HAZ
>> * };
>
> The existing comment shows the aux data *immediately* after ther
> phys_addr field, where you've placed struct perf_pipeline_haz_data.
>
> If adding to struct perf_sample_data is fine, this needs to come before
> the aux data in this comment. If adding to struct perf_sample_data is
> not fine. struct perf_pipeline_haz_data cannot live there.
>
> I suspect the latter is true, but you're getting away with it because
> you're not using both PERF_SAMPLE_AUX and PERF_SAMPLE_PIPELINE_HAZ
> simultaneously.

Right. Thanks for pointing it out. Will change it.

Ravi

2020-03-03 14:33:49

by Ravi Bangoria

[permalink] [raw]
Subject: Re: [RFC 02/11] perf/core: Data structure to present hazard data


On 3/2/20 8:18 PM, Mark Rutland wrote:
>> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
>> index 377d794d3105..ff252618ca93 100644
>> --- a/include/uapi/linux/perf_event.h
>> +++ b/include/uapi/linux/perf_event.h
>> @@ -142,8 +142,9 @@ enum perf_event_sample_format {
>> PERF_SAMPLE_REGS_INTR = 1U << 18,
>> PERF_SAMPLE_PHYS_ADDR = 1U << 19,
>> PERF_SAMPLE_AUX = 1U << 20,
>> + PERF_SAMPLE_PIPELINE_HAZ = 1U << 21,
>
> Can we please have perf_event_open() reject this sample flag for PMUs
> without the new callback (introduced in the next patch)?
>
> That way it'll be possible to detect whether the PMU exposes this.

Sure. Will change it.

Ravi

2020-03-05 04:29:46

by Madhavan Srinivasan

[permalink] [raw]
Subject: Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information



On 3/3/20 1:51 AM, Stephane Eranian wrote:
> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <[email protected]> wrote:
>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>> Modern processors export such hazard data in Performance
>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>> AMD[3] provides similar information.
>>>
>>> Implementation detail:
>>>
>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>> If it's set, kernel converts arch specific hazard information
>>> into generic format:
>>>
>>> struct perf_pipeline_haz_data {
>>> /* Instruction/Opcode type: Load, Store, Branch .... */
>>> __u8 itype;
>>> /* Instruction Cache source */
>>> __u8 icache;
>>> /* Instruction suffered hazard in pipeline stage */
>>> __u8 hazard_stage;
>>> /* Hazard reason */
>>> __u8 hazard_reason;
>>> /* Instruction suffered stall in pipeline stage */
>>> __u8 stall_stage;
>>> /* Stall reason */
>>> __u8 stall_reason;
>>> __u16 pad;
>>> };
>> Kim, does this format indeed work for AMD IBS?
>
> Personally, I don't like the term hazard. This is too IBM Power
> specific. We need to find a better term, maybe stall or penalty.

Yes, names can be reworked and thinking more on it, how about these
as "pipeline" data instead of "hazard" data.

> Also worth considering is the support of ARM SPE (Statistical
> Profiling Extension) which is their version of IBS.
> Whatever gets added need to cover all three with no limitations.

Thanks for pointing this out. We looked at the ARM SPE spec and it does
provides information like issue latency, translation latency so on.
And AMD IBS provides data like fetch latency, tag to retire latency,
completion to retire latency and so on when using Fetch sampling.
 So yes, will rework the struct definition to include data from ARM SPE
and AMD IBS also. Will post out a newer version soon.

Thanks for the comments
Maddy

2020-03-05 04:47:41

by Ravi Bangoria

[permalink] [raw]
Subject: Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

Hi Kim,

Sorry about being bit late.

On 3/3/20 3:55 AM, Kim Phillips wrote:
> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <[email protected]> wrote:
>>>
>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>> Modern processors export such hazard data in Performance
>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>>> AMD[3] provides similar information.
>>>>
>>>> Implementation detail:
>>>>
>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>>> If it's set, kernel converts arch specific hazard information
>>>> into generic format:
>>>>
>>>> struct perf_pipeline_haz_data {
>>>> /* Instruction/Opcode type: Load, Store, Branch .... */
>>>> __u8 itype;
>>>> /* Instruction Cache source */
>>>> __u8 icache;
>>>> /* Instruction suffered hazard in pipeline stage */
>>>> __u8 hazard_stage;
>>>> /* Hazard reason */
>>>> __u8 hazard_reason;
>>>> /* Instruction suffered stall in pipeline stage */
>>>> __u8 stall_stage;
>>>> /* Stall reason */
>>>> __u8 stall_reason;
>>>> __u16 pad;
>>>> };
>>>
>>> Kim, does this format indeed work for AMD IBS?
>
> It's not really 1:1, we don't have these separations of stages
> and reasons, for example: we have missed in L2 cache, for example.
> So IBS output is flatter, with more cycle latency figures than
> IBM's AFAICT.

AMD IBS captures pipeline latency data incase Fetch sampling like the
Fetch latency, tag to retire latency, completion to retire latency and
so on. Yes, Ops sampling do provide more data on load/store centric
information. But it also captures more detailed data for Branch instructions.
And we also looked at ARM SPE, which also captures more details pipeline
data and latency information.

>
>> Personally, I don't like the term hazard. This is too IBM Power
>> specific. We need to find a better term, maybe stall or penalty.
>
> Right, IBS doesn't have a filter to only count stalled or otherwise
> bad events. IBS' PPR descriptions has one occurrence of the
> word stall, and no penalty. The way I read IBS is it's just
> reporting more sample data than just the precise IP: things like
> hits, misses, cycle latencies, addresses, types, etc., so words
> like 'extended', or the 'auxiliary' already used today even
> are more appropriate for IBS, although I'm the last person to
> bikeshed.

We are thinking of using "pipeline" word instead of Hazard.

>
>> Also worth considering is the support of ARM SPE (Statistical
>> Profiling Extension) which is their version of IBS.
>> Whatever gets added need to cover all three with no limitations.
>
> I thought Intel's various LBR, PEBS, and PT supported providing
> similar sample data in perf already, like with perf mem/c2c?

perf-mem is more of data centric in my opinion. It is more towards
memory profiling. So proposal here is to expose pipeline related
details like stalls and latencies.

Thanks for the review,
Ravi

2020-03-05 05:07:24

by Ravi Bangoria

[permalink] [raw]
Subject: Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

Hi Paul,

Sorry for bit late reply.

On 3/3/20 2:38 AM, Paul Clarke wrote:
> On 3/1/20 11:23 PM, Ravi Bangoria wrote:
>> Most modern microprocessors employ complex instruction execution
>> pipelines such that many instructions can be 'in flight' at any
>> given point in time. Various factors affect this pipeline and
>> hazards are the primary among them. Different types of hazards
>> exist - Data hazards, Structural hazards and Control hazards.
>> Data hazard is the case where data dependencies exist between
>> instructions in different stages in the pipeline. Structural
>> hazard is when the same processor hardware is needed by more
>> than one instruction in flight at the same time. Control hazards
>> are more the branch misprediction kinds.
>>
>> Information about these hazards are critical towards analyzing
>> performance issues and also to tune software to overcome such
>> issues. Modern processors export such hazard data in Performance
>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>> AMD[3] provides similar information.
>>
>> Implementation detail:
>>
>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>> If it's set, kernel converts arch specific hazard information
>> into generic format:
>>
>> struct perf_pipeline_haz_data {
>> /* Instruction/Opcode type: Load, Store, Branch .... */
>> __u8 itype;
>
> At the risk of bike-shedding (in an RFC, no less), "itype" doesn't convey enough meaning to me. "inst_type"? I see in 03/11, you use "perf_inst_type".

I was thinking to rename itype with operation_type or op_type. Because
AMD IBS and ARM SPE observes micro ops and also op_type is more aligned
to pipeline word.

>
>> /* Instruction Cache source */
>> __u8 icache;
>
> Possibly same here, and you use "perf_inst_cache" in 03/11.

Sure.

>
>> /* Instruction suffered hazard in pipeline stage */
>> __u8 hazard_stage;
>> /* Hazard reason */
>> __u8 hazard_reason;
>> /* Instruction suffered stall in pipeline stage */
>> __u8 stall_stage;
>> /* Stall reason */
>> __u8 stall_reason;
>> __u16 pad;
>> };
>>
>> ... which can be read by user from mmap() ring buffer. With this
>> approach, sample perf report in hazard mode looks like (On IBM
>> PowerPC):
>>
>> # ./perf record --hazard ./ebizzy
>> # ./perf report --hazard
>> Overhead Symbol Shared Instruction Type Hazard Stage Hazard Reason Stall Stage Stall Reason ICache access
>> 36.58% [.] thread_run ebizzy Load LSU Mispredict LSU Load fin L1 hit
>> 9.46% [.] thread_run ebizzy Load LSU Mispredict LSU Dcache_miss L1 hit
>> 1.76% [.] thread_run ebizzy Fixed point - - - - L1 hit
>> 1.31% [.] thread_run ebizzy Load LSU ERAT Miss LSU Load fin L1 hit
>> 1.27% [.] thread_run ebizzy Load LSU Mispredict - - L1 hit
>> 1.16% [.] thread_run ebizzy Fixed point - - FXU Fixed cycle L1 hit
>> 0.50% [.] thread_run ebizzy Fixed point ISU Source Unavailable FXU Fixed cycle L1 hit
>> 0.30% [.] thread_run ebizzy Load LSU LMQ Full, DERAT Miss LSU Load fin L1 hit
>> 0.24% [.] thread_run ebizzy Load LSU ERAT Miss - - L1 hit
>> 0.08% [.] thread_run ebizzy - - - BRU Fixed cycle L1 hit
>> 0.05% [.] thread_run ebizzy Branch - - BRU Fixed cycle L1 hit
>> 0.04% [.] thread_run ebizzy Fixed point ISU Source Unavailable - - L1 hit
>
> How are these to be interpreted? This is great information, but is it possible to make it more readable for non-experts?

For the RFC proposal we just pulled the details from the spec. But yes, will
look into this.

> If each of these map 1:1 with hardware events, should you emit the name of the event here, so that can be used to look up further information? For example, does the first line map to PM_CMPLU_STALL_LSU_FIN?
I'm using PM_MRK_INST_CMPL event in perf record an SIER provides all these
information.

> What was "Mispredict[ed]"? (Is it different from a branch misprediction?) And how does this relate to "L1 hit"?

I'm not 100% sure. I'll check with the hw folks about it.

> Can we emit "Load finish" instead of "Load fin" for easier reading? 03/11 also has "Marked fin before NTC".
> Nit: why does "Dcache_miss" have an underscore and none of the others?

Sure. Will change it.

>
>> Also perf annotate with hazard data:
>
>> │ static int
>> │ compare(const void *p1, const void *p2)
>> │ {
>> 33.23 │ std r31,-8(r1)
>> │ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
>> │ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
>> │ {haz_stage: LSU, haz_reason: Load Hit Store, stall_stage: LSU, stall_reason: -, icache: L3 hit}
>> │ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: -, stall_reason: -, icache: L1 hit}
>> │ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
>> │ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
>> 0.84 │ stdu r1,-64(r1)
>> │ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: -, stall_reason: -, icache: L1 hit}
>> 0.24 │ mr r31,r1
>> │ {haz_stage: -, haz_reason: -, stall_stage: -, stall_reason: -, icache: L1 hit}
>> 21.18 │ std r3,32(r31)
>> │ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
>> │ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
>> │ {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
>>
>>
>> Patches:
>> - Patch #1 is a simple cleanup patch
>> - Patch #2, #3, #4 implements generic and arch specific kernel
>> infrastructure
>> - Patch #5 enables perf record and script with hazard mode
>> - Patch #6, #7, #8 enables perf report with hazard mode
>> - Patch #9, #10, #11 enables perf annotate with hazard mode
>>
>> Note:
>> - This series is based on the talk by Madhavan in LPC 2018[4]. This is
>> just an early RFC to get comments about the approach and not intended
>> to be merged yet.
>> - I've prepared the series base on v5.6-rc3. But it depends on generic
>> perf annotate fixes [5][6] which are already merged by Arnaldo in
>> perf/urgent and perf/core.
>>
>> [1]: Book III, Section 9.4.10:
>> https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0
>> [2]: https://wiki.raptorcs.com/w/images/6/6b/POWER9_PMU_UG_v12_28NOV2018_pub.pdf#G9.1106986
>
> This document is also available from the "IBM Portal for OpenPOWER" under the "All IBM Material for OpenPOWER" https://www-355.ibm.com/systems/power/openpower/tgcmDocumentRepository.xhtml?aliasId=OpenPOWER, under each of the individual modules. (Well hidden, it might be said, and not a simple link like you have here.)

Thanks for pointing it :)
Ravi

2020-03-05 05:09:03

by Ravi Bangoria

[permalink] [raw]
Subject: Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

Hi Andi,

Sorry for being bit late.

On 3/3/20 7:03 AM, Andi Kleen wrote:
> On Mon, Mar 02, 2020 at 11:13:32AM +0100, Peter Zijlstra wrote:
>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>> Modern processors export such hazard data in Performance
>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>> AMD[3] provides similar information.
>>>
>>> Implementation detail:
>>>
>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>> If it's set, kernel converts arch specific hazard information
>>> into generic format:
>>>
>>> struct perf_pipeline_haz_data {
>>> /* Instruction/Opcode type: Load, Store, Branch .... */
>>> __u8 itype;
>>> /* Instruction Cache source */
>>> __u8 icache;
>>> /* Instruction suffered hazard in pipeline stage */
>>> __u8 hazard_stage;
>>> /* Hazard reason */
>>> __u8 hazard_reason;
>>> /* Instruction suffered stall in pipeline stage */
>>> __u8 stall_stage;
>>> /* Stall reason */
>>> __u8 stall_reason;
>>> __u16 pad;
>>> };
>>
>> Kim, does this format indeed work for AMD IBS?
>
> Intel PEBS has a similar concept for annotation of memory accesses,
> which is already exported through perf_mem_data_src. This is essentially
> an extension. It would be better to have something unified here.
> Right now it seems to duplicate at least part of the PEBS facility.

IIUC there is a distinction from perf mem vs exposing the pipeline details.
perf-mem/perf_mem_data_src is more of memory accesses profiling. And proposal
here is to expose pipeline related details like stalls and latencies. Would
prefer/suggest not to extend the current structure further to capture pipeline
details.

Ravi

2020-03-05 22:08:44

by Kim Phillips

[permalink] [raw]
Subject: Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

On 3/4/20 10:46 PM, Ravi Bangoria wrote:
> Hi Kim,

Hi Ravi,

> On 3/3/20 3:55 AM, Kim Phillips wrote:
>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <[email protected]> wrote:
>>>>
>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>> Modern processors export such hazard data in Performance
>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>>>> AMD[3] provides similar information.
>>>>>
>>>>> Implementation detail:
>>>>>
>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>>>> If it's set, kernel converts arch specific hazard information
>>>>> into generic format:
>>>>>
>>>>>    struct perf_pipeline_haz_data {
>>>>>           /* Instruction/Opcode type: Load, Store, Branch .... */
>>>>>           __u8    itype;
>>>>>           /* Instruction Cache source */
>>>>>           __u8    icache;
>>>>>           /* Instruction suffered hazard in pipeline stage */
>>>>>           __u8    hazard_stage;
>>>>>           /* Hazard reason */
>>>>>           __u8    hazard_reason;
>>>>>           /* Instruction suffered stall in pipeline stage */
>>>>>           __u8    stall_stage;
>>>>>           /* Stall reason */
>>>>>           __u8    stall_reason;
>>>>>           __u16   pad;
>>>>>    };
>>>>
>>>> Kim, does this format indeed work for AMD IBS?
>>
>> It's not really 1:1, we don't have these separations of stages
>> and reasons, for example: we have missed in L2 cache, for example.
>> So IBS output is flatter, with more cycle latency figures than
>> IBM's AFAICT.
>
> AMD IBS captures pipeline latency data incase Fetch sampling like the
> Fetch latency, tag to retire latency, completion to retire latency and
> so on. Yes, Ops sampling do provide more data on load/store centric
> information. But it also captures more detailed data for Branch instructions.
> And we also looked at ARM SPE, which also captures more details pipeline
> data and latency information.
>
>>> Personally, I don't like the term hazard. This is too IBM Power
>>> specific. We need to find a better term, maybe stall or penalty.
>>
>> Right, IBS doesn't have a filter to only count stalled or otherwise
>> bad events.  IBS' PPR descriptions has one occurrence of the
>> word stall, and no penalty.  The way I read IBS is it's just
>> reporting more sample data than just the precise IP: things like
>> hits, misses, cycle latencies, addresses, types, etc., so words
>> like 'extended', or the 'auxiliary' already used today even
>> are more appropriate for IBS, although I'm the last person to
>> bikeshed.
>
> We are thinking of using "pipeline" word instead of Hazard.

Hm, the word 'pipeline' occurs 0 times in IBS documentation.

I realize there are a couple of core pipeline-specific pieces
of information coming out of it, but the vast majority
are addresses, latencies of various components in the memory
hierarchy, and various component hit/miss bits.

What's needed here is a vendor-specific extended
sample information that all these technologies gather,
of which things like e.g., 'L1 TLB cycle latency' we
all should have in common.

I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
either. Can we use PERF_SAMPLE_AUX instead? Take a look at
commit 98dcf14d7f9c "perf tools: Add kernel AUX area sampling
definitions". The sample identifier can be used to determine
which vendor's sampling IP's data is in it, and events can
be recorded just by copying the content of the SIER, etc.
registers, and then events get synthesized from the aux
sample at report/inject/annotate etc. time. This allows
for less sample recording overhead, and moves all the vendor
specific decoding and common event conversions for userspace
to figure out.

>>> Also worth considering is the support of ARM SPE (Statistical
>>> Profiling Extension) which is their version of IBS.
>>> Whatever gets added need to cover all three with no limitations.
>>
>> I thought Intel's various LBR, PEBS, and PT supported providing
>> similar sample data in perf already, like with perf mem/c2c?
>
> perf-mem is more of data centric in my opinion. It is more towards
> memory profiling. So proposal here is to expose pipeline related
> details like stalls and latencies.

Like I said, I don't see it that way, I see it as "any particular
vendor's event's extended details', and these pipeline details
have overlap with existing infrastructure within perf, e.g., L2
cache misses.

Kim

2020-03-11 16:08:46

by Ravi Bangoria

[permalink] [raw]
Subject: Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

Hi Kim,

On 3/6/20 3:36 AM, Kim Phillips wrote:
>> On 3/3/20 3:55 AM, Kim Phillips wrote:
>>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <[email protected]> wrote:
>>>>>
>>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>>> Modern processors export such hazard data in Performance
>>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>>>>> AMD[3] provides similar information.
>>>>>>
>>>>>> Implementation detail:
>>>>>>
>>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>>>>> If it's set, kernel converts arch specific hazard information
>>>>>> into generic format:
>>>>>>
>>>>>>    struct perf_pipeline_haz_data {
>>>>>>           /* Instruction/Opcode type: Load, Store, Branch .... */
>>>>>>           __u8    itype;
>>>>>>           /* Instruction Cache source */
>>>>>>           __u8    icache;
>>>>>>           /* Instruction suffered hazard in pipeline stage */
>>>>>>           __u8    hazard_stage;
>>>>>>           /* Hazard reason */
>>>>>>           __u8    hazard_reason;
>>>>>>           /* Instruction suffered stall in pipeline stage */
>>>>>>           __u8    stall_stage;
>>>>>>           /* Stall reason */
>>>>>>           __u8    stall_reason;
>>>>>>           __u16   pad;
>>>>>>    };
>>>>>
>>>>> Kim, does this format indeed work for AMD IBS?
>>>
>>> It's not really 1:1, we don't have these separations of stages
>>> and reasons, for example: we have missed in L2 cache, for example.
>>> So IBS output is flatter, with more cycle latency figures than
>>> IBM's AFAICT.
>>
>> AMD IBS captures pipeline latency data incase Fetch sampling like the
>> Fetch latency, tag to retire latency, completion to retire latency and
>> so on. Yes, Ops sampling do provide more data on load/store centric
>> information. But it also captures more detailed data for Branch instructions.
>> And we also looked at ARM SPE, which also captures more details pipeline
>> data and latency information.
>>
>>>> Personally, I don't like the term hazard. This is too IBM Power
>>>> specific. We need to find a better term, maybe stall or penalty.
>>>
>>> Right, IBS doesn't have a filter to only count stalled or otherwise
>>> bad events.  IBS' PPR descriptions has one occurrence of the
>>> word stall, and no penalty.  The way I read IBS is it's just
>>> reporting more sample data than just the precise IP: things like
>>> hits, misses, cycle latencies, addresses, types, etc., so words
>>> like 'extended', or the 'auxiliary' already used today even
>>> are more appropriate for IBS, although I'm the last person to
>>> bikeshed.
>>
>> We are thinking of using "pipeline" word instead of Hazard.
>
> Hm, the word 'pipeline' occurs 0 times in IBS documentation.

NP. We thought pipeline is generic hw term so we proposed "pipeline"
word. We are open to term which can be generic enough.

>
> I realize there are a couple of core pipeline-specific pieces
> of information coming out of it, but the vast majority
> are addresses, latencies of various components in the memory
> hierarchy, and various component hit/miss bits.

Yes. we should capture core pipeline specific details. For example,
IBS generates Branch unit information(IbsOpData1) and Icahce related
data(IbsFetchCtl) which is something that shouldn't be extended as
part of perf-mem, IMO.

>
> What's needed here is a vendor-specific extended
> sample information that all these technologies gather,
> of which things like e.g., 'L1 TLB cycle latency' we
> all should have in common.

Yes. We will include fields to capture the latency cycles (like Issue
latency, Instruction completion latency etc..) along with other pipeline
details in the proposed structure.

>
> I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
> either. Can we use PERF_SAMPLE_AUX instead?

We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is intended when
large volume of data needs to be captured as part of perf.data without
frequent PMIs. But proposed type is to address the capture of pipeline
information on each sample using PMI at periodic intervals. Hence proposing
PERF_SAMPLE_PIPELINE_HAZ.

> Take a look at
> commit 98dcf14d7f9c "perf tools: Add kernel AUX area sampling
> definitions". The sample identifier can be used to determine
> which vendor's sampling IP's data is in it, and events can
> be recorded just by copying the content of the SIER, etc.
> registers, and then events get synthesized from the aux
> sample at report/inject/annotate etc. time. This allows
> for less sample recording overhead, and moves all the vendor
> specific decoding and common event conversions for userspace
> to figure out.

When AUX buffer data is structured, tool side changes added to present the
pipeline data can be re-used.

>
>>>> Also worth considering is the support of ARM SPE (Statistical
>>>> Profiling Extension) which is their version of IBS.
>>>> Whatever gets added need to cover all three with no limitations.
>>>
>>> I thought Intel's various LBR, PEBS, and PT supported providing
>>> similar sample data in perf already, like with perf mem/c2c?
>>
>> perf-mem is more of data centric in my opinion. It is more towards
>> memory profiling. So proposal here is to expose pipeline related
>> details like stalls and latencies.
>
> Like I said, I don't see it that way, I see it as "any particular
> vendor's event's extended details', and these pipeline details
> have overlap with existing infrastructure within perf, e.g., L2
> cache misses.
>
> Kim
>

2020-03-12 22:40:52

by Kim Phillips

[permalink] [raw]
Subject: Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

On 3/11/20 11:00 AM, Ravi Bangoria wrote:
> Hi Kim,

Hi Ravi,

> On 3/6/20 3:36 AM, Kim Phillips wrote:
>>> On 3/3/20 3:55 AM, Kim Phillips wrote:
>>>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <[email protected]> wrote:
>>>>>>
>>>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>>>> Modern processors export such hazard data in Performance
>>>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>>>>>> AMD[3] provides similar information.
>>>>>>>
>>>>>>> Implementation detail:
>>>>>>>
>>>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>>>>>> If it's set, kernel converts arch specific hazard information
>>>>>>> into generic format:
>>>>>>>
>>>>>>>     struct perf_pipeline_haz_data {
>>>>>>>            /* Instruction/Opcode type: Load, Store, Branch .... */
>>>>>>>            __u8    itype;
>>>>>>>            /* Instruction Cache source */
>>>>>>>            __u8    icache;
>>>>>>>            /* Instruction suffered hazard in pipeline stage */
>>>>>>>            __u8    hazard_stage;
>>>>>>>            /* Hazard reason */
>>>>>>>            __u8    hazard_reason;
>>>>>>>            /* Instruction suffered stall in pipeline stage */
>>>>>>>            __u8    stall_stage;
>>>>>>>            /* Stall reason */
>>>>>>>            __u8    stall_reason;
>>>>>>>            __u16   pad;
>>>>>>>     };
>>>>>>
>>>>>> Kim, does this format indeed work for AMD IBS?
>>>>
>>>> It's not really 1:1, we don't have these separations of stages
>>>> and reasons, for example: we have missed in L2 cache, for example.
>>>> So IBS output is flatter, with more cycle latency figures than
>>>> IBM's AFAICT.
>>>
>>> AMD IBS captures pipeline latency data incase Fetch sampling like the
>>> Fetch latency, tag to retire latency, completion to retire latency and
>>> so on. Yes, Ops sampling do provide more data on load/store centric
>>> information. But it also captures more detailed data for Branch instructions.
>>> And we also looked at ARM SPE, which also captures more details pipeline
>>> data and latency information.
>>>
>>>>> Personally, I don't like the term hazard. This is too IBM Power
>>>>> specific. We need to find a better term, maybe stall or penalty.
>>>>
>>>> Right, IBS doesn't have a filter to only count stalled or otherwise
>>>> bad events.  IBS' PPR descriptions has one occurrence of the
>>>> word stall, and no penalty.  The way I read IBS is it's just
>>>> reporting more sample data than just the precise IP: things like
>>>> hits, misses, cycle latencies, addresses, types, etc., so words
>>>> like 'extended', or the 'auxiliary' already used today even
>>>> are more appropriate for IBS, although I'm the last person to
>>>> bikeshed.
>>>
>>> We are thinking of using "pipeline" word instead of Hazard.
>>
>> Hm, the word 'pipeline' occurs 0 times in IBS documentation.
>
> NP. We thought pipeline is generic hw term so we proposed "pipeline"
> word. We are open to term which can be generic enough.
>
>>
>> I realize there are a couple of core pipeline-specific pieces
>> of information coming out of it, but the vast majority
>> are addresses, latencies of various components in the memory
>> hierarchy, and various component hit/miss bits.
>
> Yes. we should capture core pipeline specific details. For example,
> IBS generates Branch unit information(IbsOpData1) and Icahce related
> data(IbsFetchCtl) which is something that shouldn't be extended as
> part of perf-mem, IMO.

Sure, IBS Op-side output is more 'perf mem' friendly, and so it
should populate perf_mem_data_src fields, just like POWER9 can:

union perf_mem_data_src {
...
__u64 mem_rsvd:24,
mem_snoopx:2, /* snoop mode, ext */
mem_remote:1, /* remote */
mem_lvl_num:4, /* memory hierarchy level number */
mem_dtlb:7, /* tlb access */
mem_lock:2, /* lock instr */
mem_snoop:5, /* snoop mode */
mem_lvl:14, /* memory hierarchy level */
mem_op:5; /* type of opcode */


E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
'mem_lock', and the Reload Bus Source Encoding bits can
be used to populate mem_snoop, right?

For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be
used for the ld/st target addresses, too.

>> What's needed here is a vendor-specific extended
>> sample information that all these technologies gather,
>> of which things like e.g., 'L1 TLB cycle latency' we
>> all should have in common.
>
> Yes. We will include fields to capture the latency cycles (like Issue
> latency, Instruction completion latency etc..) along with other pipeline
> details in the proposed structure.

Latency figures are just an example, and from what I
can tell, struct perf_sample_data already has a 'weight' member,
used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to
transfer memory access latency figures. Granted, that's
a bad name given all other vendors don't call latency
'weight'.

I didn't see any latency figures coming out of POWER9,
and do not expect this patchseries to implement those
of other vendors, e.g., AMD's IBS; leave each vendor
to amend perf to suit their own h/w output please.

My main point there, however, was that each vendor should
use streamlined record-level code to just copy the data
in the proprietary format that their hardware produces,
and then then perf tooling can synthesize the events
from the raw data at report/script/etc. time.

>> I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
>> either.  Can we use PERF_SAMPLE_AUX instead?
>
> We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is intended when
> large volume of data needs to be captured as part of perf.data without
> frequent PMIs. But proposed type is to address the capture of pipeline

SAMPLE_AUX shouldn't care whether the volume is large, or how frequent
PMIs are, even though it may be used in those environments.

> information on each sample using PMI at periodic intervals. Hence proposing
> PERF_SAMPLE_PIPELINE_HAZ.

And that's fine for any extra bits that POWER9 has to convey
to its users beyond things already represented by other sample
types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
and other vendor e.g., AMD IBS data can be made vendor-independent
at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
what IBS currently uses.

>>  Take a look at
>> commit 98dcf14d7f9c "perf tools: Add kernel AUX area sampling
>> definitions".  The sample identifier can be used to determine
>> which vendor's sampling IP's data is in it, and events can
>> be recorded just by copying the content of the SIER, etc.
>> registers, and then events get synthesized from the aux
>> sample at report/inject/annotate etc. time.  This allows
>> for less sample recording overhead, and moves all the vendor
>> specific decoding and common event conversions for userspace
>> to figure out.
>
> When AUX buffer data is structured, tool side changes added to present the
> pipeline data can be re-used.

Not sure I understand: AUX data would be structured on
each vendor's raw h/w register formats.

Thanks,

Kim

>>>>> Also worth considering is the support of ARM SPE (Statistical
>>>>> Profiling Extension) which is their version of IBS.
>>>>> Whatever gets added need to cover all three with no limitations.
>>>>
>>>> I thought Intel's various LBR, PEBS, and PT supported providing
>>>> similar sample data in perf already, like with perf mem/c2c?
>>>
>>> perf-mem is more of data centric in my opinion. It is more towards
>>> memory profiling. So proposal here is to expose pipeline related
>>> details like stalls and latencies.
>>
>> Like I said, I don't see it that way, I see it as "any particular
>> vendor's event's extended details', and these pipeline details
>> have overlap with existing infrastructure within perf, e.g., L2
>> cache misses.
>>
>> Kim
>>
>

2020-03-17 06:51:47

by Madhavan Srinivasan

[permalink] [raw]
Subject: Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information



On 3/13/20 4:08 AM, Kim Phillips wrote:
> On 3/11/20 11:00 AM, Ravi Bangoria wrote:
>> Hi Kim,
> Hi Ravi,
>
>> On 3/6/20 3:36 AM, Kim Phillips wrote:
>>>> On 3/3/20 3:55 AM, Kim Phillips wrote:
>>>>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>>>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <[email protected]> wrote:
>>>>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>>>>> Modern processors export such hazard data in Performance
>>>>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>>>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>>>>>>> AMD[3] provides similar information.
>>>>>>>>
>>>>>>>> Implementation detail:
>>>>>>>>
>>>>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>>>>>>> If it's set, kernel converts arch specific hazard information
>>>>>>>> into generic format:
>>>>>>>>
>>>>>>>>     struct perf_pipeline_haz_data {
>>>>>>>>            /* Instruction/Opcode type: Load, Store, Branch .... */
>>>>>>>>            __u8    itype;
>>>>>>>>            /* Instruction Cache source */
>>>>>>>>            __u8    icache;
>>>>>>>>            /* Instruction suffered hazard in pipeline stage */
>>>>>>>>            __u8    hazard_stage;
>>>>>>>>            /* Hazard reason */
>>>>>>>>            __u8    hazard_reason;
>>>>>>>>            /* Instruction suffered stall in pipeline stage */
>>>>>>>>            __u8    stall_stage;
>>>>>>>>            /* Stall reason */
>>>>>>>>            __u8    stall_reason;
>>>>>>>>            __u16   pad;
>>>>>>>>     };
>>>>>>> Kim, does this format indeed work for AMD IBS?
>>>>> It's not really 1:1, we don't have these separations of stages
>>>>> and reasons, for example: we have missed in L2 cache, for example.
>>>>> So IBS output is flatter, with more cycle latency figures than
>>>>> IBM's AFAICT.
>>>> AMD IBS captures pipeline latency data incase Fetch sampling like the
>>>> Fetch latency, tag to retire latency, completion to retire latency and
>>>> so on. Yes, Ops sampling do provide more data on load/store centric
>>>> information. But it also captures more detailed data for Branch instructions.
>>>> And we also looked at ARM SPE, which also captures more details pipeline
>>>> data and latency information.
>>>>
>>>>>> Personally, I don't like the term hazard. This is too IBM Power
>>>>>> specific. We need to find a better term, maybe stall or penalty.
>>>>> Right, IBS doesn't have a filter to only count stalled or otherwise
>>>>> bad events.  IBS' PPR descriptions has one occurrence of the
>>>>> word stall, and no penalty.  The way I read IBS is it's just
>>>>> reporting more sample data than just the precise IP: things like
>>>>> hits, misses, cycle latencies, addresses, types, etc., so words
>>>>> like 'extended', or the 'auxiliary' already used today even
>>>>> are more appropriate for IBS, although I'm the last person to
>>>>> bikeshed.
>>>> We are thinking of using "pipeline" word instead of Hazard.
>>> Hm, the word 'pipeline' occurs 0 times in IBS documentation.
>> NP. We thought pipeline is generic hw term so we proposed "pipeline"
>> word. We are open to term which can be generic enough.
>>
>>> I realize there are a couple of core pipeline-specific pieces
>>> of information coming out of it, but the vast majority
>>> are addresses, latencies of various components in the memory
>>> hierarchy, and various component hit/miss bits.
>> Yes. we should capture core pipeline specific details. For example,
>> IBS generates Branch unit information(IbsOpData1) and Icahce related
>> data(IbsFetchCtl) which is something that shouldn't be extended as
>> part of perf-mem, IMO.
> Sure, IBS Op-side output is more 'perf mem' friendly, and so it
> should populate perf_mem_data_src fields, just like POWER9 can:
>
> union perf_mem_data_src {
> ...
> __u64 mem_rsvd:24,
> mem_snoopx:2, /* snoop mode, ext */
> mem_remote:1, /* remote */
> mem_lvl_num:4, /* memory hierarchy level number */
> mem_dtlb:7, /* tlb access */
> mem_lock:2, /* lock instr */
> mem_snoop:5, /* snoop mode */
> mem_lvl:14, /* memory hierarchy level */
> mem_op:5; /* type of opcode */
>
>
> E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
> mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
> 'mem_lock', and the Reload Bus Source Encoding bits can
> be used to populate mem_snoop, right?
Hi Kim,

Yes. We do expose these data as part of perf-mem for POWER.


> For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be
> used for the ld/st target addresses, too.
>
>>> What's needed here is a vendor-specific extended
>>> sample information that all these technologies gather,
>>> of which things like e.g., 'L1 TLB cycle latency' we
>>> all should have in common.
>> Yes. We will include fields to capture the latency cycles (like Issue
>> latency, Instruction completion latency etc..) along with other pipeline
>> details in the proposed structure.
> Latency figures are just an example, and from what I
> can tell, struct perf_sample_data already has a 'weight' member,
> used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to
> transfer memory access latency figures. Granted, that's
> a bad name given all other vendors don't call latency
> 'weight'.
>
> I didn't see any latency figures coming out of POWER9,
> and do not expect this patchseries to implement those
> of other vendors, e.g., AMD's IBS; leave each vendor
> to amend perf to suit their own h/w output please.

Reference structure proposed in this patchset did not have members
to capture latency info for that exact reason. But idea here is to
abstract  as vendor specific as possible. So if we include u16 array,
then this format can also capture data from IBS since it provides
few latency details.


>
> My main point there, however, was that each vendor should
> use streamlined record-level code to just copy the data
> in the proprietary format that their hardware produces,
> and then then perf tooling can synthesize the events
> from the raw data at report/script/etc. time.
>
>>> I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
>>> either.  Can we use PERF_SAMPLE_AUX instead?
>> We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is intended when
>> large volume of data needs to be captured as part of perf.data without
>> frequent PMIs. But proposed type is to address the capture of pipeline
> SAMPLE_AUX shouldn't care whether the volume is large, or how frequent
> PMIs are, even though it may be used in those environments.
>
>> information on each sample using PMI at periodic intervals. Hence proposing
>> PERF_SAMPLE_PIPELINE_HAZ.
> And that's fine for any extra bits that POWER9 has to convey
> to its users beyond things already represented by other sample
> types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
> and other vendor e.g., AMD IBS data can be made vendor-independent
> at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
> what IBS currently uses.

My bad. Not sure what you mean by this. We are trying to abstract
as much vendor specific data as possible with this (like perf-mem).


Maddy
>
>>>  Take a look at
>>> commit 98dcf14d7f9c "perf tools: Add kernel AUX area sampling
>>> definitions".  The sample identifier can be used to determine
>>> which vendor's sampling IP's data is in it, and events can
>>> be recorded just by copying the content of the SIER, etc.
>>> registers, and then events get synthesized from the aux
>>> sample at report/inject/annotate etc. time.  This allows
>>> for less sample recording overhead, and moves all the vendor
>>> specific decoding and common event conversions for userspace
>>> to figure out.
>> When AUX buffer data is structured, tool side changes added to present the
>> pipeline data can be re-used.
> Not sure I understand: AUX data would be structured on
> each vendor's raw h/w register formats.
>
> Thanks,
>
> Kim
>
>>>>>> Also worth considering is the support of ARM SPE (Statistical
>>>>>> Profiling Extension) which is their version of IBS.
>>>>>> Whatever gets added need to cover all three with no limitations.
>>>>> I thought Intel's various LBR, PEBS, and PT supported providing
>>>>> similar sample data in perf already, like with perf mem/c2c?
>>>> perf-mem is more of data centric in my opinion. It is more towards
>>>> memory profiling. So proposal here is to expose pipeline related
>>>> details like stalls and latencies.
>>> Like I said, I don't see it that way, I see it as "any particular
>>> vendor's event's extended details', and these pipeline details
>>> have overlap with existing infrastructure within perf, e.g., L2
>>> cache misses.
>>>
>>> Kim
>>>

2020-03-18 17:37:56

by Kim Phillips

[permalink] [raw]
Subject: Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

Hi Maddy,

On 3/17/20 1:50 AM, maddy wrote:
> On 3/13/20 4:08 AM, Kim Phillips wrote:
>> On 3/11/20 11:00 AM, Ravi Bangoria wrote:
>>> On 3/6/20 3:36 AM, Kim Phillips wrote:
>>>>> On 3/3/20 3:55 AM, Kim Phillips wrote:
>>>>>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>>>>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <[email protected]> wrote:
>>>>>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>>>>>> Modern processors export such hazard data in Performance
>>>>>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>>>>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>>>>>>>> AMD[3] provides similar information.
>>>>>>>>>
>>>>>>>>> Implementation detail:
>>>>>>>>>
>>>>>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>>>>>>>> If it's set, kernel converts arch specific hazard information
>>>>>>>>> into generic format:
>>>>>>>>>
>>>>>>>>>      struct perf_pipeline_haz_data {
>>>>>>>>>             /* Instruction/Opcode type: Load, Store, Branch .... */
>>>>>>>>>             __u8    itype;
>>>>>>>>>             /* Instruction Cache source */
>>>>>>>>>             __u8    icache;
>>>>>>>>>             /* Instruction suffered hazard in pipeline stage */
>>>>>>>>>             __u8    hazard_stage;
>>>>>>>>>             /* Hazard reason */
>>>>>>>>>             __u8    hazard_reason;
>>>>>>>>>             /* Instruction suffered stall in pipeline stage */
>>>>>>>>>             __u8    stall_stage;
>>>>>>>>>             /* Stall reason */
>>>>>>>>>             __u8    stall_reason;
>>>>>>>>>             __u16   pad;
>>>>>>>>>      };
>>>>>>>> Kim, does this format indeed work for AMD IBS?
>>>>>> It's not really 1:1, we don't have these separations of stages
>>>>>> and reasons, for example: we have missed in L2 cache, for example.
>>>>>> So IBS output is flatter, with more cycle latency figures than
>>>>>> IBM's AFAICT.
>>>>> AMD IBS captures pipeline latency data incase Fetch sampling like the
>>>>> Fetch latency, tag to retire latency, completion to retire latency and
>>>>> so on. Yes, Ops sampling do provide more data on load/store centric
>>>>> information. But it also captures more detailed data for Branch instructions.
>>>>> And we also looked at ARM SPE, which also captures more details pipeline
>>>>> data and latency information.
>>>>>
>>>>>>> Personally, I don't like the term hazard. This is too IBM Power
>>>>>>> specific. We need to find a better term, maybe stall or penalty.
>>>>>> Right, IBS doesn't have a filter to only count stalled or otherwise
>>>>>> bad events.  IBS' PPR descriptions has one occurrence of the
>>>>>> word stall, and no penalty.  The way I read IBS is it's just
>>>>>> reporting more sample data than just the precise IP: things like
>>>>>> hits, misses, cycle latencies, addresses, types, etc., so words
>>>>>> like 'extended', or the 'auxiliary' already used today even
>>>>>> are more appropriate for IBS, although I'm the last person to
>>>>>> bikeshed.
>>>>> We are thinking of using "pipeline" word instead of Hazard.
>>>> Hm, the word 'pipeline' occurs 0 times in IBS documentation.
>>> NP. We thought pipeline is generic hw term so we proposed "pipeline"
>>> word. We are open to term which can be generic enough.
>>>
>>>> I realize there are a couple of core pipeline-specific pieces
>>>> of information coming out of it, but the vast majority
>>>> are addresses, latencies of various components in the memory
>>>> hierarchy, and various component hit/miss bits.
>>> Yes. we should capture core pipeline specific details. For example,
>>> IBS generates Branch unit information(IbsOpData1) and Icahce related
>>> data(IbsFetchCtl) which is something that shouldn't be extended as
>>> part of perf-mem, IMO.
>> Sure, IBS Op-side output is more 'perf mem' friendly, and so it
>> should populate perf_mem_data_src fields, just like POWER9 can:
>>
>> union perf_mem_data_src {
>> ...
>>                  __u64   mem_rsvd:24,
>>                          mem_snoopx:2,   /* snoop mode, ext */
>>                          mem_remote:1,   /* remote */
>>                          mem_lvl_num:4,  /* memory hierarchy level number */
>>                          mem_dtlb:7,     /* tlb access */
>>                          mem_lock:2,     /* lock instr */
>>                          mem_snoop:5,    /* snoop mode */
>>                          mem_lvl:14,     /* memory hierarchy level */
>>                          mem_op:5;       /* type of opcode */
>>
>>
>> E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
>> mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
>> 'mem_lock', and the Reload Bus Source Encoding bits can
>> be used to populate mem_snoop, right?
> Hi Kim,
>
> Yes. We do expose these data as part of perf-mem for POWER.

OK, I see relevant PERF_MEM_S bits in arch/powerpc/perf/isa207-common.c:
isa207_find_source now, thanks.

>> For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be
>> used for the ld/st target addresses, too.
>>
>>>> What's needed here is a vendor-specific extended
>>>> sample information that all these technologies gather,
>>>> of which things like e.g., 'L1 TLB cycle latency' we
>>>> all should have in common.
>>> Yes. We will include fields to capture the latency cycles (like Issue
>>> latency, Instruction completion latency etc..) along with other pipeline
>>> details in the proposed structure.
>> Latency figures are just an example, and from what I
>> can tell, struct perf_sample_data already has a 'weight' member,
>> used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to
>> transfer memory access latency figures.  Granted, that's
>> a bad name given all other vendors don't call latency
>> 'weight'.
>>
>> I didn't see any latency figures coming out of POWER9,
>> and do not expect this patchseries to implement those
>> of other vendors, e.g., AMD's IBS; leave each vendor
>> to amend perf to suit their own h/w output please.
>
> Reference structure proposed in this patchset did not have members
> to capture latency info for that exact reason. But idea here is to
> abstract  as vendor specific as possible. So if we include u16 array,
> then this format can also capture data from IBS since it provides
> few latency details.

OK, that sounds a bit different from the 6 x u8's + 1 u16 padded
struct presented in this patchset.

IBS Ops can report e.g.:

15 tag-to-retire cycles bits,
15 completion to retire count bits,
15 L1 DTLB refill latency bits,
15 DC miss latency bits,
5 outstanding memory requests on mem refill bits, and so on.

IBS Fetch reports 15 bits of fetch latency, and another 16
for iTLB latency, among others.

Some of these may/may not be valid simultaneously, and
there are IBS specific rules to establish validity.

>> My main point there, however, was that each vendor should
>> use streamlined record-level code to just copy the data
>> in the proprietary format that their hardware produces,
>> and then then perf tooling can synthesize the events
>> from the raw data at report/script/etc. time.
>>
>>>> I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
>>>> either.  Can we use PERF_SAMPLE_AUX instead?
>>> We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is intended when
>>> large volume of data needs to be captured as part of perf.data without
>>> frequent PMIs. But proposed type is to address the capture of pipeline
>> SAMPLE_AUX shouldn't care whether the volume is large, or how frequent
>> PMIs are, even though it may be used in those environments.
>>
>>> information on each sample using PMI at periodic intervals. Hence proposing
>>> PERF_SAMPLE_PIPELINE_HAZ.
>> And that's fine for any extra bits that POWER9 has to convey
>> to its users beyond things already represented by other sample
>> types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
>> and other vendor e.g., AMD IBS data can be made vendor-independent
>> at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
>> what IBS currently uses.
>
> My bad. Not sure what you mean by this. We are trying to abstract
> as much vendor specific data as possible with this (like perf-mem).

Perhaps if I say it this way: instead of doing all the
isa207_get_phazard_data() work past the mfspr(SPRN_SIER)
in patch 4/11, rather/instead just put the raw sier value in a
PERF_SAMPLE_RAW or _AUX event, and call perf_event_update_userpage.
Specific SIER capabilities can be written as part of the perf.data
header. Then synthesize the true pipe events from the raw SIER
values later, and in userspace.

I guess it's technically optional, but I think that's how
I'd do it in IBS, since it minimizes the record-time overhead.

Thanks,

Kim

> Maddy
>>
>>>>   Take a look at
>>>> commit 98dcf14d7f9c "perf tools: Add kernel AUX area sampling
>>>> definitions".  The sample identifier can be used to determine
>>>> which vendor's sampling IP's data is in it, and events can
>>>> be recorded just by copying the content of the SIER, etc.
>>>> registers, and then events get synthesized from the aux
>>>> sample at report/inject/annotate etc. time.  This allows
>>>> for less sample recording overhead, and moves all the vendor
>>>> specific decoding and common event conversions for userspace
>>>> to figure out.
>>> When AUX buffer data is structured, tool side changes added to present the
>>> pipeline data can be re-used.
>> Not sure I understand: AUX data would be structured on
>> each vendor's raw h/w register formats.
>>
>> Thanks,
>>
>> Kim
>>
>>>>>>> Also worth considering is the support of ARM SPE (Statistical
>>>>>>> Profiling Extension) which is their version of IBS.
>>>>>>> Whatever gets added need to cover all three with no limitations.
>>>>>> I thought Intel's various LBR, PEBS, and PT supported providing
>>>>>> similar sample data in perf already, like with perf mem/c2c?
>>>>> perf-mem is more of data centric in my opinion. It is more towards
>>>>> memory profiling. So proposal here is to expose pipeline related
>>>>> details like stalls and latencies.
>>>> Like I said, I don't see it that way, I see it as "any particular
>>>> vendor's event's extended details', and these pipeline details
>>>> have overlap with existing infrastructure within perf, e.g., L2
>>>> cache misses.
>>>>
>>>> Kim
>>>>
>

2020-03-19 11:24:21

by Michael Ellerman

[permalink] [raw]
Subject: Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

Kim Phillips <[email protected]> writes:
> On 3/17/20 1:50 AM, maddy wrote:
>> On 3/13/20 4:08 AM, Kim Phillips wrote:
>>> On 3/11/20 11:00 AM, Ravi Bangoria wrote:
>>>
>>>> information on each sample using PMI at periodic intervals. Hence proposing
>>>> PERF_SAMPLE_PIPELINE_HAZ.
>>>
>>> And that's fine for any extra bits that POWER9 has to convey
>>> to its users beyond things already represented by other sample
>>> types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
>>> and other vendor e.g., AMD IBS data can be made vendor-independent
>>> at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
>>> what IBS currently uses.
>>
>> My bad. Not sure what you mean by this. We are trying to abstract
>> as much vendor specific data as possible with this (like perf-mem).
>
> Perhaps if I say it this way: instead of doing all the
> isa207_get_phazard_data() work past the mfspr(SPRN_SIER)
> in patch 4/11, rather/instead just put the raw sier value in a
> PERF_SAMPLE_RAW or _AUX event, and call perf_event_update_userpage.
> Specific SIER capabilities can be written as part of the perf.data
> header. Then synthesize the true pipe events from the raw SIER
> values later, and in userspace.

In the past the perf maintainers have wanted the perf API to abstract
over the specific CPU details, rather than just pushing raw register
values out to userspace.

But maybe that's no longer the case and we should just use
PERF_SAMPLE_AUX?

cheers

2020-03-26 10:21:03

by Madhavan Srinivasan

[permalink] [raw]
Subject: Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information



On 3/18/20 11:05 PM, Kim Phillips wrote:
> Hi Maddy,
>
> On 3/17/20 1:50 AM, maddy wrote:
>> On 3/13/20 4:08 AM, Kim Phillips wrote:
>>> On 3/11/20 11:00 AM, Ravi Bangoria wrote:
>>>> On 3/6/20 3:36 AM, Kim Phillips wrote:
>>>>>> On 3/3/20 3:55 AM, Kim Phillips wrote:
>>>>>>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>>>>>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <[email protected]> wrote:
>>>>>>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>>>>>>> Modern processors export such hazard data in Performance
>>>>>>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>>>>>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>>>>>>>>> AMD[3] provides similar information.
>>>>>>>>>>
>>>>>>>>>> Implementation detail:
>>>>>>>>>>
>>>>>>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>>>>>>>>> If it's set, kernel converts arch specific hazard information
>>>>>>>>>> into generic format:
>>>>>>>>>>
>>>>>>>>>>      struct perf_pipeline_haz_data {
>>>>>>>>>>             /* Instruction/Opcode type: Load, Store, Branch .... */
>>>>>>>>>>             __u8    itype;
>>>>>>>>>>             /* Instruction Cache source */
>>>>>>>>>>             __u8    icache;
>>>>>>>>>>             /* Instruction suffered hazard in pipeline stage */
>>>>>>>>>>             __u8    hazard_stage;
>>>>>>>>>>             /* Hazard reason */
>>>>>>>>>>             __u8    hazard_reason;
>>>>>>>>>>             /* Instruction suffered stall in pipeline stage */
>>>>>>>>>>             __u8    stall_stage;
>>>>>>>>>>             /* Stall reason */
>>>>>>>>>>             __u8    stall_reason;
>>>>>>>>>>             __u16   pad;
>>>>>>>>>>      };
>>>>>>>>> Kim, does this format indeed work for AMD IBS?
>>>>>>> It's not really 1:1, we don't have these separations of stages
>>>>>>> and reasons, for example: we have missed in L2 cache, for example.
>>>>>>> So IBS output is flatter, with more cycle latency figures than
>>>>>>> IBM's AFAICT.
>>>>>> AMD IBS captures pipeline latency data incase Fetch sampling like the
>>>>>> Fetch latency, tag to retire latency, completion to retire latency and
>>>>>> so on. Yes, Ops sampling do provide more data on load/store centric
>>>>>> information. But it also captures more detailed data for Branch instructions.
>>>>>> And we also looked at ARM SPE, which also captures more details pipeline
>>>>>> data and latency information.
>>>>>>
>>>>>>>> Personally, I don't like the term hazard. This is too IBM Power
>>>>>>>> specific. We need to find a better term, maybe stall or penalty.
>>>>>>> Right, IBS doesn't have a filter to only count stalled or otherwise
>>>>>>> bad events.  IBS' PPR descriptions has one occurrence of the
>>>>>>> word stall, and no penalty.  The way I read IBS is it's just
>>>>>>> reporting more sample data than just the precise IP: things like
>>>>>>> hits, misses, cycle latencies, addresses, types, etc., so words
>>>>>>> like 'extended', or the 'auxiliary' already used today even
>>>>>>> are more appropriate for IBS, although I'm the last person to
>>>>>>> bikeshed.
>>>>>> We are thinking of using "pipeline" word instead of Hazard.
>>>>> Hm, the word 'pipeline' occurs 0 times in IBS documentation.
>>>> NP. We thought pipeline is generic hw term so we proposed "pipeline"
>>>> word. We are open to term which can be generic enough.
>>>>
>>>>> I realize there are a couple of core pipeline-specific pieces
>>>>> of information coming out of it, but the vast majority
>>>>> are addresses, latencies of various components in the memory
>>>>> hierarchy, and various component hit/miss bits.
>>>> Yes. we should capture core pipeline specific details. For example,
>>>> IBS generates Branch unit information(IbsOpData1) and Icahce related
>>>> data(IbsFetchCtl) which is something that shouldn't be extended as
>>>> part of perf-mem, IMO.
>>> Sure, IBS Op-side output is more 'perf mem' friendly, and so it
>>> should populate perf_mem_data_src fields, just like POWER9 can:
>>>
>>> union perf_mem_data_src {
>>> ...
>>>                  __u64   mem_rsvd:24,
>>>                          mem_snoopx:2,   /* snoop mode, ext */
>>>                          mem_remote:1,   /* remote */
>>>                          mem_lvl_num:4,  /* memory hierarchy level number */
>>>                          mem_dtlb:7,     /* tlb access */
>>>                          mem_lock:2,     /* lock instr */
>>>                          mem_snoop:5,    /* snoop mode */
>>>                          mem_lvl:14,     /* memory hierarchy level */
>>>                          mem_op:5;       /* type of opcode */
>>>
>>>
>>> E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
>>> mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
>>> 'mem_lock', and the Reload Bus Source Encoding bits can
>>> be used to populate mem_snoop, right?
>> Hi Kim,
>>
>> Yes. We do expose these data as part of perf-mem for POWER.
> OK, I see relevant PERF_MEM_S bits in arch/powerpc/perf/isa207-common.c:
> isa207_find_source now, thanks.
>
>>> For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be
>>> used for the ld/st target addresses, too.
>>>
>>>>> What's needed here is a vendor-specific extended
>>>>> sample information that all these technologies gather,
>>>>> of which things like e.g., 'L1 TLB cycle latency' we
>>>>> all should have in common.
>>>> Yes. We will include fields to capture the latency cycles (like Issue
>>>> latency, Instruction completion latency etc..) along with other pipeline
>>>> details in the proposed structure.
>>> Latency figures are just an example, and from what I
>>> can tell, struct perf_sample_data already has a 'weight' member,
>>> used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to
>>> transfer memory access latency figures.  Granted, that's
>>> a bad name given all other vendors don't call latency
>>> 'weight'.
>>>
>>> I didn't see any latency figures coming out of POWER9,
>>> and do not expect this patchseries to implement those
>>> of other vendors, e.g., AMD's IBS; leave each vendor
>>> to amend perf to suit their own h/w output please.
>> Reference structure proposed in this patchset did not have members
>> to capture latency info for that exact reason. But idea here is to
>> abstract  as vendor specific as possible. So if we include u16 array,
>> then this format can also capture data from IBS since it provides
>> few latency details.
> OK, that sounds a bit different from the 6 x u8's + 1 u16 padded
> struct presented in this patchset.
>
> IBS Ops can report e.g.:
>
> 15 tag-to-retire cycles bits,
> 15 completion to retire count bits,
> 15 L1 DTLB refill latency bits,
> 15 DC miss latency bits,
> 5 outstanding memory requests on mem refill bits, and so on.
>
> IBS Fetch reports 15 bits of fetch latency, and another 16
> for iTLB latency, among others.
>
> Some of these may/may not be valid simultaneously, and
> there are IBS specific rules to establish validity.
>
>>> My main point there, however, was that each vendor should
>>> use streamlined record-level code to just copy the data
>>> in the proprietary format that their hardware produces,
>>> and then then perf tooling can synthesize the events
>>> from the raw data at report/script/etc. time.
>>>
>>>>> I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
>>>>> either.  Can we use PERF_SAMPLE_AUX instead?
>>>> We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is intended when
>>>> large volume of data needs to be captured as part of perf.data without
>>>> frequent PMIs. But proposed type is to address the capture of pipeline
>>> SAMPLE_AUX shouldn't care whether the volume is large, or how frequent
>>> PMIs are, even though it may be used in those environments.
>>>
>>>> information on each sample using PMI at periodic intervals. Hence proposing
>>>> PERF_SAMPLE_PIPELINE_HAZ.
>>> And that's fine for any extra bits that POWER9 has to convey
>>> to its users beyond things already represented by other sample
>>> types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
>>> and other vendor e.g., AMD IBS data can be made vendor-independent
>>> at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
>>> what IBS currently uses.
>> My bad. Not sure what you mean by this. We are trying to abstract
>> as much vendor specific data as possible with this (like perf-mem).
> Perhaps if I say it this way: instead of doing all the
> isa207_get_phazard_data() work past the mfspr(SPRN_SIER)
> in patch 4/11, rather/instead just put the raw sier value in a
> PERF_SAMPLE_RAW or _AUX event, and call perf_event_update_userpage.
> Specific SIER capabilities can be written as part of the perf.data
> header. Then synthesize the true pipe events from the raw SIER
> values later, and in userspace.

Hi Kim,

Would like to stay away from SAMPLE_RAW type for these comments in
perf_events.h

*      #
*      # The RAW record below is opaque data wrt the ABI
*      #
*      # That is, the ABI doesn't make any promises wrt to
*      # the stability of its content, it may vary depending
*      # on event, hardware, kernel version and phase of
*      # the moon.
*      #
*      # In other words, PERF_SAMPLE_RAW contents are not an ABI.
*      #

Secondly, sorry I didn't understand your suggestion about using
PERF_SAMPLE_AUX.
IIUC, SAMPLE_AUX will go to AUX ring buffer, which is more memory and more
challenging when correlating and presenting the pipeline details for
each IP.
IMO, having a new sample type can be useful to capture the pipeline data
both in perf_sample_data and if _AUX is enabled, can be made to push to
AUX buffer.

Maddy

>
> I guess it's technically optional, but I think that's how
> I'd do it in IBS, since it minimizes the record-time overhead.
>
> Thanks,
>
> Kim
>
>> Maddy
>>>>>   Take a look at
>>>>> commit 98dcf14d7f9c "perf tools: Add kernel AUX area sampling
>>>>> definitions".  The sample identifier can be used to determine
>>>>> which vendor's sampling IP's data is in it, and events can
>>>>> be recorded just by copying the content of the SIER, etc.
>>>>> registers, and then events get synthesized from the aux
>>>>> sample at report/inject/annotate etc. time.  This allows
>>>>> for less sample recording overhead, and moves all the vendor
>>>>> specific decoding and common event conversions for userspace
>>>>> to figure out.
>>>> When AUX buffer data is structured, tool side changes added to present the
>>>> pipeline data can be re-used.
>>> Not sure I understand: AUX data would be structured on
>>> each vendor's raw h/w register formats.
>>>
>>> Thanks,
>>>
>>> Kim
>>>
>>>>>>>> Also worth considering is the support of ARM SPE (Statistical
>>>>>>>> Profiling Extension) which is their version of IBS.
>>>>>>>> Whatever gets added need to cover all three with no limitations.
>>>>>>> I thought Intel's various LBR, PEBS, and PT supported providing
>>>>>>> similar sample data in perf already, like with perf mem/c2c?
>>>>>> perf-mem is more of data centric in my opinion. It is more towards
>>>>>> memory profiling. So proposal here is to expose pipeline related
>>>>>> details like stalls and latencies.
>>>>> Like I said, I don't see it that way, I see it as "any particular
>>>>> vendor's event's extended details', and these pipeline details
>>>>> have overlap with existing infrastructure within perf, e.g., L2
>>>>> cache misses.
>>>>>
>>>>> Kim
>>>>>

2020-03-26 19:51:47

by Kim Phillips

[permalink] [raw]
Subject: Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information



On 3/26/20 5:19 AM, maddy wrote:
>
>
> On 3/18/20 11:05 PM, Kim Phillips wrote:
>> Hi Maddy,
>>
>> On 3/17/20 1:50 AM, maddy wrote:
>>> On 3/13/20 4:08 AM, Kim Phillips wrote:
>>>> On 3/11/20 11:00 AM, Ravi Bangoria wrote:
>>>>> On 3/6/20 3:36 AM, Kim Phillips wrote:
>>>>>>> On 3/3/20 3:55 AM, Kim Phillips wrote:
>>>>>>>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>>>>>>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <[email protected]> wrote:
>>>>>>>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>>>>>>>> Modern processors export such hazard data in Performance
>>>>>>>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>>>>>>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>>>>>>>>>> AMD[3] provides similar information.
>>>>>>>>>>>
>>>>>>>>>>> Implementation detail:
>>>>>>>>>>>
>>>>>>>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>>>>>>>>>> If it's set, kernel converts arch specific hazard information
>>>>>>>>>>> into generic format:
>>>>>>>>>>>
>>>>>>>>>>>       struct perf_pipeline_haz_data {
>>>>>>>>>>>              /* Instruction/Opcode type: Load, Store, Branch .... */
>>>>>>>>>>>              __u8    itype;
>>>>>>>>>>>              /* Instruction Cache source */
>>>>>>>>>>>              __u8    icache;
>>>>>>>>>>>              /* Instruction suffered hazard in pipeline stage */
>>>>>>>>>>>              __u8    hazard_stage;
>>>>>>>>>>>              /* Hazard reason */
>>>>>>>>>>>              __u8    hazard_reason;
>>>>>>>>>>>              /* Instruction suffered stall in pipeline stage */
>>>>>>>>>>>              __u8    stall_stage;
>>>>>>>>>>>              /* Stall reason */
>>>>>>>>>>>              __u8    stall_reason;
>>>>>>>>>>>              __u16   pad;
>>>>>>>>>>>       };
>>>>>>>>>> Kim, does this format indeed work for AMD IBS?
>>>>>>>> It's not really 1:1, we don't have these separations of stages
>>>>>>>> and reasons, for example: we have missed in L2 cache, for example.
>>>>>>>> So IBS output is flatter, with more cycle latency figures than
>>>>>>>> IBM's AFAICT.
>>>>>>> AMD IBS captures pipeline latency data incase Fetch sampling like the
>>>>>>> Fetch latency, tag to retire latency, completion to retire latency and
>>>>>>> so on. Yes, Ops sampling do provide more data on load/store centric
>>>>>>> information. But it also captures more detailed data for Branch instructions.
>>>>>>> And we also looked at ARM SPE, which also captures more details pipeline
>>>>>>> data and latency information.
>>>>>>>
>>>>>>>>> Personally, I don't like the term hazard. This is too IBM Power
>>>>>>>>> specific. We need to find a better term, maybe stall or penalty.
>>>>>>>> Right, IBS doesn't have a filter to only count stalled or otherwise
>>>>>>>> bad events.  IBS' PPR descriptions has one occurrence of the
>>>>>>>> word stall, and no penalty.  The way I read IBS is it's just
>>>>>>>> reporting more sample data than just the precise IP: things like
>>>>>>>> hits, misses, cycle latencies, addresses, types, etc., so words
>>>>>>>> like 'extended', or the 'auxiliary' already used today even
>>>>>>>> are more appropriate for IBS, although I'm the last person to
>>>>>>>> bikeshed.
>>>>>>> We are thinking of using "pipeline" word instead of Hazard.
>>>>>> Hm, the word 'pipeline' occurs 0 times in IBS documentation.
>>>>> NP. We thought pipeline is generic hw term so we proposed "pipeline"
>>>>> word. We are open to term which can be generic enough.
>>>>>
>>>>>> I realize there are a couple of core pipeline-specific pieces
>>>>>> of information coming out of it, but the vast majority
>>>>>> are addresses, latencies of various components in the memory
>>>>>> hierarchy, and various component hit/miss bits.
>>>>> Yes. we should capture core pipeline specific details. For example,
>>>>> IBS generates Branch unit information(IbsOpData1) and Icahce related
>>>>> data(IbsFetchCtl) which is something that shouldn't be extended as
>>>>> part of perf-mem, IMO.
>>>> Sure, IBS Op-side output is more 'perf mem' friendly, and so it
>>>> should populate perf_mem_data_src fields, just like POWER9 can:
>>>>
>>>> union perf_mem_data_src {
>>>> ...
>>>>                   __u64   mem_rsvd:24,
>>>>                           mem_snoopx:2,   /* snoop mode, ext */
>>>>                           mem_remote:1,   /* remote */
>>>>                           mem_lvl_num:4,  /* memory hierarchy level number */
>>>>                           mem_dtlb:7,     /* tlb access */
>>>>                           mem_lock:2,     /* lock instr */
>>>>                           mem_snoop:5,    /* snoop mode */
>>>>                           mem_lvl:14,     /* memory hierarchy level */
>>>>                           mem_op:5;       /* type of opcode */
>>>>
>>>>
>>>> E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
>>>> mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
>>>> 'mem_lock', and the Reload Bus Source Encoding bits can
>>>> be used to populate mem_snoop, right?
>>> Hi Kim,
>>>
>>> Yes. We do expose these data as part of perf-mem for POWER.
>> OK, I see relevant PERF_MEM_S bits in arch/powerpc/perf/isa207-common.c:
>> isa207_find_source now, thanks.
>>
>>>> For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be
>>>> used for the ld/st target addresses, too.
>>>>
>>>>>> What's needed here is a vendor-specific extended
>>>>>> sample information that all these technologies gather,
>>>>>> of which things like e.g., 'L1 TLB cycle latency' we
>>>>>> all should have in common.
>>>>> Yes. We will include fields to capture the latency cycles (like Issue
>>>>> latency, Instruction completion latency etc..) along with other pipeline
>>>>> details in the proposed structure.
>>>> Latency figures are just an example, and from what I
>>>> can tell, struct perf_sample_data already has a 'weight' member,
>>>> used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to
>>>> transfer memory access latency figures.  Granted, that's
>>>> a bad name given all other vendors don't call latency
>>>> 'weight'.
>>>>
>>>> I didn't see any latency figures coming out of POWER9,
>>>> and do not expect this patchseries to implement those
>>>> of other vendors, e.g., AMD's IBS; leave each vendor
>>>> to amend perf to suit their own h/w output please.
>>> Reference structure proposed in this patchset did not have members
>>> to capture latency info for that exact reason. But idea here is to
>>> abstract  as vendor specific as possible. So if we include u16 array,
>>> then this format can also capture data from IBS since it provides
>>> few latency details.
>> OK, that sounds a bit different from the 6 x u8's + 1 u16 padded
>> struct presented in this patchset.
>>
>> IBS Ops can report e.g.:
>>
>> 15 tag-to-retire cycles bits,
>> 15 completion to retire count bits,
>> 15 L1 DTLB refill latency bits,
>> 15 DC miss latency bits,
>> 5 outstanding memory requests on mem refill bits, and so on.
>>
>> IBS Fetch reports 15 bits of fetch latency, and another 16
>> for iTLB latency, among others.
>>
>> Some of these may/may not be valid simultaneously, and
>> there are IBS specific rules to establish validity.
>>
>>>> My main point there, however, was that each vendor should
>>>> use streamlined record-level code to just copy the data
>>>> in the proprietary format that their hardware produces,
>>>> and then then perf tooling can synthesize the events
>>>> from the raw data at report/script/etc. time.
>>>>
>>>>>> I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
>>>>>> either.  Can we use PERF_SAMPLE_AUX instead?
>>>>> We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is intended when
>>>>> large volume of data needs to be captured as part of perf.data without
>>>>> frequent PMIs. But proposed type is to address the capture of pipeline
>>>> SAMPLE_AUX shouldn't care whether the volume is large, or how frequent
>>>> PMIs are, even though it may be used in those environments.
>>>>
>>>>> information on each sample using PMI at periodic intervals. Hence proposing
>>>>> PERF_SAMPLE_PIPELINE_HAZ.
>>>> And that's fine for any extra bits that POWER9 has to convey
>>>> to its users beyond things already represented by other sample
>>>> types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
>>>> and other vendor e.g., AMD IBS data can be made vendor-independent
>>>> at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
>>>> what IBS currently uses.
>>> My bad. Not sure what you mean by this. We are trying to abstract
>>> as much vendor specific data as possible with this (like perf-mem).
>> Perhaps if I say it this way: instead of doing all the
>> isa207_get_phazard_data() work past the mfspr(SPRN_SIER)
>> in patch 4/11, rather/instead just put the raw sier value in a
>> PERF_SAMPLE_RAW or _AUX event, and call perf_event_update_userpage.
>> Specific SIER capabilities can be written as part of the perf.data
>> header.  Then synthesize the true pipe events from the raw SIER
>> values later, and in userspace.
>
> Hi Kim,
>
> Would like to stay away from SAMPLE_RAW type for these comments in perf_events.h
>
> *      #
> *      # The RAW record below is opaque data wrt the ABI
> *      #
> *      # That is, the ABI doesn't make any promises wrt to
> *      # the stability of its content, it may vary depending
> *      # on event, hardware, kernel version and phase of
> *      # the moon.
> *      #
> *      # In other words, PERF_SAMPLE_RAW contents are not an ABI.
> *      #

The "it may vary depending on ... hardware" clause makes it sound
appropriate for the use-case where the raw hardware register contents
are copied directly into the user buffer.

> Secondly, sorry I didn't understand your suggestion about using PERF_SAMPLE_AUX.
> IIUC, SAMPLE_AUX will go to AUX ring buffer, which is more memory and more
> challenging when correlating and presenting the pipeline details for each IP.
> IMO, having a new sample type can be useful to capture the pipeline data
> both in perf_sample_data and if _AUX is enabled, can be made to push to
> AUX buffer.

OK, I didn't think SAMPLE_AUX and the aux ring buffer were
interdependent, sorry.

Thanks,

Kim

2020-04-20 07:13:20

by Madhavan Srinivasan

[permalink] [raw]
Subject: Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information



On 3/27/20 1:18 AM, Kim Phillips wrote:
>
> On 3/26/20 5:19 AM, maddy wrote:
>>
>> On 3/18/20 11:05 PM, Kim Phillips wrote:
>>> Hi Maddy,
>>>
>>> On 3/17/20 1:50 AM, maddy wrote:
>>>> On 3/13/20 4:08 AM, Kim Phillips wrote:
>>>>> On 3/11/20 11:00 AM, Ravi Bangoria wrote:
>>>>>> On 3/6/20 3:36 AM, Kim Phillips wrote:
>>>>>>>> On 3/3/20 3:55 AM, Kim Phillips wrote:
>>>>>>>>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>>>>>>>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <[email protected]> wrote:
>>>>>>>>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>>>>>>>>> Modern processors export such hazard data in Performance
>>>>>>>>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>>>>>>>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>>>>>>>>>>> AMD[3] provides similar information.
>>>>>>>>>>>>
>>>>>>>>>>>> Implementation detail:
>>>>>>>>>>>>
>>>>>>>>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>>>>>>>>>>> If it's set, kernel converts arch specific hazard information
>>>>>>>>>>>> into generic format:
>>>>>>>>>>>>
>>>>>>>>>>>>       struct perf_pipeline_haz_data {
>>>>>>>>>>>>              /* Instruction/Opcode type: Load, Store, Branch .... */
>>>>>>>>>>>>              __u8    itype;
>>>>>>>>>>>>              /* Instruction Cache source */
>>>>>>>>>>>>              __u8    icache;
>>>>>>>>>>>>              /* Instruction suffered hazard in pipeline stage */
>>>>>>>>>>>>              __u8    hazard_stage;
>>>>>>>>>>>>              /* Hazard reason */
>>>>>>>>>>>>              __u8    hazard_reason;
>>>>>>>>>>>>              /* Instruction suffered stall in pipeline stage */
>>>>>>>>>>>>              __u8    stall_stage;
>>>>>>>>>>>>              /* Stall reason */
>>>>>>>>>>>>              __u8    stall_reason;
>>>>>>>>>>>>              __u16   pad;
>>>>>>>>>>>>       };
>>>>>>>>>>> Kim, does this format indeed work for AMD IBS?
>>>>>>>>> It's not really 1:1, we don't have these separations of stages
>>>>>>>>> and reasons, for example: we have missed in L2 cache, for example.
>>>>>>>>> So IBS output is flatter, with more cycle latency figures than
>>>>>>>>> IBM's AFAICT.
>>>>>>>> AMD IBS captures pipeline latency data incase Fetch sampling like the
>>>>>>>> Fetch latency, tag to retire latency, completion to retire latency and
>>>>>>>> so on. Yes, Ops sampling do provide more data on load/store centric
>>>>>>>> information. But it also captures more detailed data for Branch instructions.
>>>>>>>> And we also looked at ARM SPE, which also captures more details pipeline
>>>>>>>> data and latency information.
>>>>>>>>
>>>>>>>>>> Personally, I don't like the term hazard. This is too IBM Power
>>>>>>>>>> specific. We need to find a better term, maybe stall or penalty.
>>>>>>>>> Right, IBS doesn't have a filter to only count stalled or otherwise
>>>>>>>>> bad events.  IBS' PPR descriptions has one occurrence of the
>>>>>>>>> word stall, and no penalty.  The way I read IBS is it's just
>>>>>>>>> reporting more sample data than just the precise IP: things like
>>>>>>>>> hits, misses, cycle latencies, addresses, types, etc., so words
>>>>>>>>> like 'extended', or the 'auxiliary' already used today even
>>>>>>>>> are more appropriate for IBS, although I'm the last person to
>>>>>>>>> bikeshed.
>>>>>>>> We are thinking of using "pipeline" word instead of Hazard.
>>>>>>> Hm, the word 'pipeline' occurs 0 times in IBS documentation.
>>>>>> NP. We thought pipeline is generic hw term so we proposed "pipeline"
>>>>>> word. We are open to term which can be generic enough.
>>>>>>
>>>>>>> I realize there are a couple of core pipeline-specific pieces
>>>>>>> of information coming out of it, but the vast majority
>>>>>>> are addresses, latencies of various components in the memory
>>>>>>> hierarchy, and various component hit/miss bits.
>>>>>> Yes. we should capture core pipeline specific details. For example,
>>>>>> IBS generates Branch unit information(IbsOpData1) and Icahce related
>>>>>> data(IbsFetchCtl) which is something that shouldn't be extended as
>>>>>> part of perf-mem, IMO.
>>>>> Sure, IBS Op-side output is more 'perf mem' friendly, and so it
>>>>> should populate perf_mem_data_src fields, just like POWER9 can:
>>>>>
>>>>> union perf_mem_data_src {
>>>>> ...
>>>>>                   __u64   mem_rsvd:24,
>>>>>                           mem_snoopx:2,   /* snoop mode, ext */
>>>>>                           mem_remote:1,   /* remote */
>>>>>                           mem_lvl_num:4,  /* memory hierarchy level number */
>>>>>                           mem_dtlb:7,     /* tlb access */
>>>>>                           mem_lock:2,     /* lock instr */
>>>>>                           mem_snoop:5,    /* snoop mode */
>>>>>                           mem_lvl:14,     /* memory hierarchy level */
>>>>>                           mem_op:5;       /* type of opcode */
>>>>>
>>>>>
>>>>> E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
>>>>> mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
>>>>> 'mem_lock', and the Reload Bus Source Encoding bits can
>>>>> be used to populate mem_snoop, right?
>>>> Hi Kim,
>>>>
>>>> Yes. We do expose these data as part of perf-mem for POWER.
>>> OK, I see relevant PERF_MEM_S bits in arch/powerpc/perf/isa207-common.c:
>>> isa207_find_source now, thanks.
>>>
>>>>> For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be
>>>>> used for the ld/st target addresses, too.
>>>>>
>>>>>>> What's needed here is a vendor-specific extended
>>>>>>> sample information that all these technologies gather,
>>>>>>> of which things like e.g., 'L1 TLB cycle latency' we
>>>>>>> all should have in common.
>>>>>> Yes. We will include fields to capture the latency cycles (like Issue
>>>>>> latency, Instruction completion latency etc..) along with other pipeline
>>>>>> details in the proposed structure.
>>>>> Latency figures are just an example, and from what I
>>>>> can tell, struct perf_sample_data already has a 'weight' member,
>>>>> used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to
>>>>> transfer memory access latency figures.  Granted, that's
>>>>> a bad name given all other vendors don't call latency
>>>>> 'weight'.
>>>>>
>>>>> I didn't see any latency figures coming out of POWER9,
>>>>> and do not expect this patchseries to implement those
>>>>> of other vendors, e.g., AMD's IBS; leave each vendor
>>>>> to amend perf to suit their own h/w output please.
>>>> Reference structure proposed in this patchset did not have members
>>>> to capture latency info for that exact reason. But idea here is to
>>>> abstract  as vendor specific as possible. So if we include u16 array,
>>>> then this format can also capture data from IBS since it provides
>>>> few latency details.
>>> OK, that sounds a bit different from the 6 x u8's + 1 u16 padded
>>> struct presented in this patchset.
>>>
>>> IBS Ops can report e.g.:
>>>
>>> 15 tag-to-retire cycles bits,
>>> 15 completion to retire count bits,
>>> 15 L1 DTLB refill latency bits,
>>> 15 DC miss latency bits,
>>> 5 outstanding memory requests on mem refill bits, and so on.
>>>
>>> IBS Fetch reports 15 bits of fetch latency, and another 16
>>> for iTLB latency, among others.
>>>
>>> Some of these may/may not be valid simultaneously, and
>>> there are IBS specific rules to establish validity.
>>>
>>>>> My main point there, however, was that each vendor should
>>>>> use streamlined record-level code to just copy the data
>>>>> in the proprietary format that their hardware produces,
>>>>> and then then perf tooling can synthesize the events
>>>>> from the raw data at report/script/etc. time.
>>>>>
>>>>>>> I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
>>>>>>> either.  Can we use PERF_SAMPLE_AUX instead?
>>>>>> We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is intended when
>>>>>> large volume of data needs to be captured as part of perf.data without
>>>>>> frequent PMIs. But proposed type is to address the capture of pipeline
>>>>> SAMPLE_AUX shouldn't care whether the volume is large, or how frequent
>>>>> PMIs are, even though it may be used in those environments.
>>>>>
>>>>>> information on each sample using PMI at periodic intervals. Hence proposing
>>>>>> PERF_SAMPLE_PIPELINE_HAZ.
>>>>> And that's fine for any extra bits that POWER9 has to convey
>>>>> to its users beyond things already represented by other sample
>>>>> types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
>>>>> and other vendor e.g., AMD IBS data can be made vendor-independent
>>>>> at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
>>>>> what IBS currently uses.
>>>> My bad. Not sure what you mean by this. We are trying to abstract
>>>> as much vendor specific data as possible with this (like perf-mem).
>>> Perhaps if I say it this way: instead of doing all the
>>> isa207_get_phazard_data() work past the mfspr(SPRN_SIER)
>>> in patch 4/11, rather/instead just put the raw sier value in a
>>> PERF_SAMPLE_RAW or _AUX event, and call perf_event_update_userpage.
>>> Specific SIER capabilities can be written as part of the perf.data
>>> header.  Then synthesize the true pipe events from the raw SIER
>>> values later, and in userspace.
>> Hi Kim,
>>
>> Would like to stay away from SAMPLE_RAW type for these comments in perf_events.h
>>
>> *      #
>> *      # The RAW record below is opaque data wrt the ABI
>> *      #
>> *      # That is, the ABI doesn't make any promises wrt to
>> *      # the stability of its content, it may vary depending
>> *      # on event, hardware, kernel version and phase of
>> *      # the moon.
>> *      #
>> *      # In other words, PERF_SAMPLE_RAW contents are not an ABI.
>> *      #
> The "it may vary depending on ... hardware" clause makes it sound
> appropriate for the use-case where the raw hardware register contents
> are copied directly into the user buffer.


Hi Kim,

Sorry for the delayed response.

But perf tool side needs infrastructure to handle the raw sample
data from cpu-pmu (used by tracepoints). I am not sure whether
his is the approach we should look here.

peterz any comments?

>
>> Secondly, sorry I didn't understand your suggestion about using PERF_SAMPLE_AUX.
>> IIUC, SAMPLE_AUX will go to AUX ring buffer, which is more memory and more
>> challenging when correlating and presenting the pipeline details for each IP.
>> IMO, having a new sample type can be useful to capture the pipeline data
>> both in perf_sample_data and if _AUX is enabled, can be made to push to
>> AUX buffer.
> OK, I didn't think SAMPLE_AUX and the aux ring buffer were
> interdependent, sorry.
>
> Thanks,
>
> Kim

2020-04-27 07:21:28

by Madhavan Srinivasan

[permalink] [raw]
Subject: Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

peterz,

    Can you please help. Is it okay to use PERF_SAMPLE_RAW to expose
the pipeline stall details and
add tool side infrastructure to handle the PERF_SAMPLE_RAW for cpu-pmu
samples.

Maddy

On 4/20/20 12:39 PM, Madhavan Srinivasan wrote:
>
>
> On 3/27/20 1:18 AM, Kim Phillips wrote:
>>
>> On 3/26/20 5:19 AM, maddy wrote:
>>>
>>> On 3/18/20 11:05 PM, Kim Phillips wrote:
>>>> Hi Maddy,
>>>>
>>>> On 3/17/20 1:50 AM, maddy wrote:
>>>>> On 3/13/20 4:08 AM, Kim Phillips wrote:
>>>>>> On 3/11/20 11:00 AM, Ravi Bangoria wrote:
>>>>>>> On 3/6/20 3:36 AM, Kim Phillips wrote:
>>>>>>>>> On 3/3/20 3:55 AM, Kim Phillips wrote:
>>>>>>>>>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>>>>>>>>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra
>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>>>>>>>>>> Modern processors export such hazard data in Performance
>>>>>>>>>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction
>>>>>>>>>>>>> Event
>>>>>>>>>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based
>>>>>>>>>>>>> Sampling' on
>>>>>>>>>>>>> AMD[3] provides similar information.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Implementation detail:
>>>>>>>>>>>>>
>>>>>>>>>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is
>>>>>>>>>>>>> introduced.
>>>>>>>>>>>>> If it's set, kernel converts arch specific hazard information
>>>>>>>>>>>>> into generic format:
>>>>>>>>>>>>>
>>>>>>>>>>>>>        struct perf_pipeline_haz_data {
>>>>>>>>>>>>>               /* Instruction/Opcode type: Load, Store,
>>>>>>>>>>>>> Branch .... */
>>>>>>>>>>>>>               __u8    itype;
>>>>>>>>>>>>>               /* Instruction Cache source */
>>>>>>>>>>>>>               __u8    icache;
>>>>>>>>>>>>>               /* Instruction suffered hazard in pipeline
>>>>>>>>>>>>> stage */
>>>>>>>>>>>>>               __u8    hazard_stage;
>>>>>>>>>>>>>               /* Hazard reason */
>>>>>>>>>>>>>               __u8    hazard_reason;
>>>>>>>>>>>>>               /* Instruction suffered stall in pipeline
>>>>>>>>>>>>> stage */
>>>>>>>>>>>>>               __u8    stall_stage;
>>>>>>>>>>>>>               /* Stall reason */
>>>>>>>>>>>>>               __u8    stall_reason;
>>>>>>>>>>>>>               __u16   pad;
>>>>>>>>>>>>>        };
>>>>>>>>>>>> Kim, does this format indeed work for AMD IBS?
>>>>>>>>>> It's not really 1:1, we don't have these separations of stages
>>>>>>>>>> and reasons, for example: we have missed in L2 cache, for
>>>>>>>>>> example.
>>>>>>>>>> So IBS output is flatter, with more cycle latency figures than
>>>>>>>>>> IBM's AFAICT.
>>>>>>>>> AMD IBS captures pipeline latency data incase Fetch sampling
>>>>>>>>> like the
>>>>>>>>> Fetch latency, tag to retire latency, completion to retire
>>>>>>>>> latency and
>>>>>>>>> so on. Yes, Ops sampling do provide more data on load/store
>>>>>>>>> centric
>>>>>>>>> information. But it also captures more detailed data for
>>>>>>>>> Branch instructions.
>>>>>>>>> And we also looked at ARM SPE, which also captures more
>>>>>>>>> details pipeline
>>>>>>>>> data and latency information.
>>>>>>>>>
>>>>>>>>>>> Personally, I don't like the term hazard. This is too IBM Power
>>>>>>>>>>> specific. We need to find a better term, maybe stall or
>>>>>>>>>>> penalty.
>>>>>>>>>> Right, IBS doesn't have a filter to only count stalled or
>>>>>>>>>> otherwise
>>>>>>>>>> bad events.  IBS' PPR descriptions has one occurrence of the
>>>>>>>>>> word stall, and no penalty.  The way I read IBS is it's just
>>>>>>>>>> reporting more sample data than just the precise IP: things like
>>>>>>>>>> hits, misses, cycle latencies, addresses, types, etc., so words
>>>>>>>>>> like 'extended', or the 'auxiliary' already used today even
>>>>>>>>>> are more appropriate for IBS, although I'm the last person to
>>>>>>>>>> bikeshed.
>>>>>>>>> We are thinking of using "pipeline" word instead of Hazard.
>>>>>>>> Hm, the word 'pipeline' occurs 0 times in IBS documentation.
>>>>>>> NP. We thought pipeline is generic hw term so we proposed
>>>>>>> "pipeline"
>>>>>>> word. We are open to term which can be generic enough.
>>>>>>>
>>>>>>>> I realize there are a couple of core pipeline-specific pieces
>>>>>>>> of information coming out of it, but the vast majority
>>>>>>>> are addresses, latencies of various components in the memory
>>>>>>>> hierarchy, and various component hit/miss bits.
>>>>>>> Yes. we should capture core pipeline specific details. For example,
>>>>>>> IBS generates Branch unit information(IbsOpData1) and Icahce
>>>>>>> related
>>>>>>> data(IbsFetchCtl) which is something that shouldn't be extended as
>>>>>>> part of perf-mem, IMO.
>>>>>> Sure, IBS Op-side output is more 'perf mem' friendly, and so it
>>>>>> should populate perf_mem_data_src fields, just like POWER9 can:
>>>>>>
>>>>>> union perf_mem_data_src {
>>>>>> ...
>>>>>>                    __u64   mem_rsvd:24,
>>>>>>                            mem_snoopx:2,   /* snoop mode, ext */
>>>>>>                            mem_remote:1,   /* remote */
>>>>>>                            mem_lvl_num:4,  /* memory hierarchy
>>>>>> level number */
>>>>>>                            mem_dtlb:7,     /* tlb access */
>>>>>>                            mem_lock:2,     /* lock instr */
>>>>>>                            mem_snoop:5,    /* snoop mode */
>>>>>>                            mem_lvl:14,     /* memory hierarchy
>>>>>> level */
>>>>>>                            mem_op:5;       /* type of opcode */
>>>>>>
>>>>>>
>>>>>> E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
>>>>>> mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
>>>>>> 'mem_lock', and the Reload Bus Source Encoding bits can
>>>>>> be used to populate mem_snoop, right?
>>>>> Hi Kim,
>>>>>
>>>>> Yes. We do expose these data as part of perf-mem for POWER.
>>>> OK, I see relevant PERF_MEM_S bits in
>>>> arch/powerpc/perf/isa207-common.c:
>>>> isa207_find_source now, thanks.
>>>>
>>>>>> For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be
>>>>>> used for the ld/st target addresses, too.
>>>>>>
>>>>>>>> What's needed here is a vendor-specific extended
>>>>>>>> sample information that all these technologies gather,
>>>>>>>> of which things like e.g., 'L1 TLB cycle latency' we
>>>>>>>> all should have in common.
>>>>>>> Yes. We will include fields to capture the latency cycles (like
>>>>>>> Issue
>>>>>>> latency, Instruction completion latency etc..) along with other
>>>>>>> pipeline
>>>>>>> details in the proposed structure.
>>>>>> Latency figures are just an example, and from what I
>>>>>> can tell, struct perf_sample_data already has a 'weight' member,
>>>>>> used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to
>>>>>> transfer memory access latency figures.  Granted, that's
>>>>>> a bad name given all other vendors don't call latency
>>>>>> 'weight'.
>>>>>>
>>>>>> I didn't see any latency figures coming out of POWER9,
>>>>>> and do not expect this patchseries to implement those
>>>>>> of other vendors, e.g., AMD's IBS; leave each vendor
>>>>>> to amend perf to suit their own h/w output please.
>>>>> Reference structure proposed in this patchset did not have members
>>>>> to capture latency info for that exact reason. But idea here is to
>>>>> abstract  as vendor specific as possible. So if we include u16 array,
>>>>> then this format can also capture data from IBS since it provides
>>>>> few latency details.
>>>> OK, that sounds a bit different from the 6 x u8's + 1 u16 padded
>>>> struct presented in this patchset.
>>>>
>>>> IBS Ops can report e.g.:
>>>>
>>>> 15 tag-to-retire cycles bits,
>>>> 15 completion to retire count bits,
>>>> 15 L1 DTLB refill latency bits,
>>>> 15 DC miss latency bits,
>>>> 5 outstanding memory requests on mem refill bits, and so on.
>>>>
>>>> IBS Fetch reports 15 bits of fetch latency, and another 16
>>>> for iTLB latency, among others.
>>>>
>>>> Some of these may/may not be valid simultaneously, and
>>>> there are IBS specific rules to establish validity.
>>>>
>>>>>> My main point there, however, was that each vendor should
>>>>>> use streamlined record-level code to just copy the data
>>>>>> in the proprietary format that their hardware produces,
>>>>>> and then then perf tooling can synthesize the events
>>>>>> from the raw data at report/script/etc. time.
>>>>>>
>>>>>>>> I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
>>>>>>>> either.  Can we use PERF_SAMPLE_AUX instead?
>>>>>>> We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is
>>>>>>> intended when
>>>>>>> large volume of data needs to be captured as part of perf.data
>>>>>>> without
>>>>>>> frequent PMIs. But proposed type is to address the capture of
>>>>>>> pipeline
>>>>>> SAMPLE_AUX shouldn't care whether the volume is large, or how
>>>>>> frequent
>>>>>> PMIs are, even though it may be used in those environments.
>>>>>>
>>>>>>> information on each sample using PMI at periodic intervals.
>>>>>>> Hence proposing
>>>>>>> PERF_SAMPLE_PIPELINE_HAZ.
>>>>>> And that's fine for any extra bits that POWER9 has to convey
>>>>>> to its users beyond things already represented by other sample
>>>>>> types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
>>>>>> and other vendor e.g., AMD IBS data can be made vendor-independent
>>>>>> at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
>>>>>> what IBS currently uses.
>>>>> My bad. Not sure what you mean by this. We are trying to abstract
>>>>> as much vendor specific data as possible with this (like perf-mem).
>>>> Perhaps if I say it this way: instead of doing all the
>>>> isa207_get_phazard_data() work past the mfspr(SPRN_SIER)
>>>> in patch 4/11, rather/instead just put the raw sier value in a
>>>> PERF_SAMPLE_RAW or _AUX event, and call perf_event_update_userpage.
>>>> Specific SIER capabilities can be written as part of the perf.data
>>>> header.  Then synthesize the true pipe events from the raw SIER
>>>> values later, and in userspace.
>>> Hi Kim,
>>>
>>> Would like to stay away from SAMPLE_RAW type for these comments in
>>> perf_events.h
>>>
>>> *      #
>>> *      # The RAW record below is opaque data wrt the ABI
>>> *      #
>>> *      # That is, the ABI doesn't make any promises wrt to
>>> *      # the stability of its content, it may vary depending
>>> *      # on event, hardware, kernel version and phase of
>>> *      # the moon.
>>> *      #
>>> *      # In other words, PERF_SAMPLE_RAW contents are not an ABI.
>>> *      #
>> The "it may vary depending on ... hardware" clause makes it sound
>> appropriate for the use-case where the raw hardware register contents
>> are copied directly into the user buffer.
>
>
> Hi Kim,
>
> Sorry for the delayed response.
>
> But perf tool side needs infrastructure to handle the raw sample
> data from cpu-pmu (used by tracepoints). I am not sure whether
> his is the approach we should look here.
>
> peterz any comments?
>
>>
>>> Secondly, sorry I didn't understand your suggestion about using
>>> PERF_SAMPLE_AUX.
>>> IIUC, SAMPLE_AUX will go to AUX ring buffer, which is more memory
>>> and more
>>> challenging when correlating and presenting the pipeline details for
>>> each IP.
>>> IMO, having a new sample type can be useful to capture the pipeline
>>> data
>>> both in perf_sample_data and if _AUX is enabled, can be made to push to
>>> AUX buffer.
>> OK, I didn't think SAMPLE_AUX and the aux ring buffer were
>> interdependent, sorry.
>>
>> Thanks,
>>
>> Kim
>