2021-03-22 15:00:47

by Athira Rajeev

[permalink] [raw]
Subject: [PATCH V2 0/5] powerpc/perf: Export processor pipeline stage cycles information

Performance Monitoring Unit (PMU) registers in powerpc exports
number of cycles elapsed between different stages in the pipeline.
Example, sampling registers in ISA v3.1.

This patchset implements kernel and perf tools support to expose
these pipeline stage cycles using the sample type PERF_SAMPLE_WEIGHT_TYPE.

Patch 1/5 adds kernel side support to store the cycle counter
values as part of 'var2_w' and 'var3_w' fields of perf_sample_weight
structure.

Patch 2/5 adds support to make the perf report column header
strings as dynamic.
Patch 3/5 adds powerpc support in perf tools for PERF_SAMPLE_WEIGHT_STRUCT
in sample type: PERF_SAMPLE_WEIGHT_TYPE.
Patch 4/5 adds support to present pipeline stage cycles as part of
mem-mode.
Patch 5/5 is to display the new sort dimenstion in perf report columns
only on powerpc.

Sample output on powerpc:

# perf mem record ls
# perf mem report

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 11 of event 'cpu/mem-loads/'
# Total weight : 1332
# Sort order : local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked,blocked,local_ins_lat,stall_cyc
#
# Overhead Samples Local Weight Memory access Symbol Shared Object Data Symbol Data Object Snoop TLB access Locked Blocked Finish Cyc Dispatch Cyc
# ........ ............ ............ ........................ .................................. ................ ............................................. ..................... ............ ...................... ...... .......... ............. .............
#
44.14% 1 588 L1 hit [k] rcu_nmi_exit [kernel.vmlinux] [k] 0xc0000007ffdd21b0 [unknown] N/A N/A No N/A 7 5
22.22% 1 296 L1 hit [k] copypage_power7 [kernel.vmlinux] [k] 0xc0000000ff6a1780 [unknown] N/A N/A No N/A 293 3
6.98% 1 93 L1 hit [.] _dl_addr libc-2.31.so [.] 0x00007fff86fa5058 libc-2.31.so N/A N/A No N/A 7 1
6.61% 1 88 L2 hit [.] new_do_write libc-2.31.so [.] _IO_2_1_stdout_+0x0 libc-2.31.so N/A N/A No N/A 84 1
5.93% 1 79 L1 hit [k] printk_nmi_exit [kernel.vmlinux] [k] 0xc0000006085df6b0 [unknown] N/A N/A No N/A 7 1
4.05% 1 54 L2 hit [.] __alloc_dir libc-2.31.so [.] 0x00007fffdb70a640 [stack] N/A N/A No N/A 18 1
3.60% 1 48 L1 hit [.] _init ls [.] 0x000000016ca82118 [heap] N/A N/A No N/A 7 6
2.40% 1 32 L1 hit [k] desc_read [kernel.vmlinux] [k] _printk_rb_static_descs+0x1ea10 [kernel.vmlinux].data N/A N/A No N/A 7 1
1.65% 1 22 L2 hit [k] perf_iterate_ctx.constprop.139 [kernel.vmlinux] [k] 0xc00000064d79e8a8 [unknown] N/A N/A No N/A 16 1
1.58% 1 21 L1 hit [k] perf_event_interrupt [kernel.vmlinux] [k] 0xc0000006085df6b0 [unknown] N/A N/A No N/A 7 1
0.83% 1 11 L1 hit [k] perf_event_exec [kernel.vmlinux] [k] 0xc0000007ffdd3288 [unknown] N/A N/A No N/A 7 4


Changelog:
Changes from v1 -> v2
Addressed Jiri's review comments:
- Display the new sort dimension 'p_stage_cyc' only
on supported architecture.
- Check for arch specific header string for matching
sort order in patch2.

Athira Rajeev (5):
powerpc/perf: Expose processor pipeline stage cycles using
PERF_SAMPLE_WEIGHT_STRUCT
tools/perf: Add dynamic headers for perf report columns
tools/perf: Add powerpc support for PERF_SAMPLE_WEIGHT_STRUCT
tools/perf: Support pipeline stage cycles for powerpc
tools/perf: Display sort dimension p_stage_cyc only on supported archs

arch/powerpc/include/asm/perf_event_server.h | 2 +-
arch/powerpc/perf/core-book3s.c | 4 +-
arch/powerpc/perf/isa207-common.c | 29 ++++++++++++--
arch/powerpc/perf/isa207-common.h | 6 ++-
tools/perf/Documentation/perf-report.txt | 2 +
tools/perf/arch/powerpc/util/Build | 2 +
tools/perf/arch/powerpc/util/event.c | 53 ++++++++++++++++++++++++
tools/perf/arch/powerpc/util/evsel.c | 8 ++++
tools/perf/util/event.h | 3 ++
tools/perf/util/hist.c | 11 +++--
tools/perf/util/hist.h | 1 +
tools/perf/util/session.c | 4 +-
tools/perf/util/sort.c | 60 +++++++++++++++++++++++++++-
tools/perf/util/sort.h | 2 +
14 files changed, 174 insertions(+), 13 deletions(-)
create mode 100644 tools/perf/arch/powerpc/util/event.c
create mode 100644 tools/perf/arch/powerpc/util/evsel.c

--
1.8.3.1


2021-03-22 15:01:56

by Athira Rajeev

[permalink] [raw]
Subject: [PATCH V2 2/5] tools/perf: Add dynamic headers for perf report columns

Currently the header string for different columns in perf report
is fixed. Some fields of perf sample could have different meaning
for different architectures than the meaning conveyed by the header
string. An example is the new field 'var2_w' of perf_sample_weight
structure. This is presently captured as 'Local INSTR Latency' in
perf mem report. But this could be used to denote a different latency
cycle in another architecture.

Introduce a weak function arch_perf_header_entry() to set
the arch specific header string for the fields which can contain dynamic
header. If the architecture do not have this function, fall back to the
default header string value.

Signed-off-by: Athira Rajeev <[email protected]>
---
tools/perf/util/event.h | 1 +
tools/perf/util/sort.c | 19 ++++++++++++++++++-
2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index f603edbbbc6f..6106a9c134c9 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -427,5 +427,6 @@ void cpu_map_data__synthesize(struct perf_record_cpu_map_data *data, struct per

void arch_perf_parse_sample_weight(struct perf_sample *data, const __u64 *array, u64 type);
void arch_perf_synthesize_sample_weight(const struct perf_sample *data, __u64 *array, u64 type);
+const char *arch_perf_header_entry(const char *se_header);

#endif /* __PERF_RECORD_H */
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 552b590485bf..eeb03e749181 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -25,6 +25,7 @@
#include <traceevent/event-parse.h>
#include "mem-events.h"
#include "annotate.h"
+#include "event.h"
#include "time-utils.h"
#include "cgroup.h"
#include "machine.h"
@@ -45,6 +46,7 @@
regex_t ignore_callees_regex;
int have_ignore_callees = 0;
enum sort_mode sort__mode = SORT_MODE__NORMAL;
+const char *dynamic_headers[] = {"local_ins_lat"};

/*
* Replaces all occurrences of a char used with the:
@@ -1816,6 +1818,16 @@ struct sort_dimension {
int taken;
};

+const char * __weak arch_perf_header_entry(const char *se_header)
+{
+ return se_header;
+}
+
+static void sort_dimension_add_dynamic_header(struct sort_dimension *sd)
+{
+ sd->entry->se_header = arch_perf_header_entry(sd->entry->se_header);
+}
+
#define DIM(d, n, func) [d] = { .name = n, .entry = &(func) }

static struct sort_dimension common_sort_dimensions[] = {
@@ -2739,7 +2751,7 @@ int sort_dimension__add(struct perf_hpp_list *list, const char *tok,
struct evlist *evlist,
int level)
{
- unsigned int i;
+ unsigned int i, j;

for (i = 0; i < ARRAY_SIZE(common_sort_dimensions); i++) {
struct sort_dimension *sd = &common_sort_dimensions[i];
@@ -2747,6 +2759,11 @@ int sort_dimension__add(struct perf_hpp_list *list, const char *tok,
if (strncasecmp(tok, sd->name, strlen(tok)))
continue;

+ for (j = 0; j < ARRAY_SIZE(dynamic_headers); j++) {
+ if (!strcmp(dynamic_headers[j], sd->name))
+ sort_dimension_add_dynamic_header(sd);
+ }
+
if (sd->entry == &sort_parent) {
int ret = regcomp(&parent_regex, parent_pattern, REG_EXTENDED);
if (ret) {
--
1.8.3.1

2021-03-22 15:02:27

by Athira Rajeev

[permalink] [raw]
Subject: [PATCH V2 4/5] tools/perf: Support pipeline stage cycles for powerpc

The pipeline stage cycles details can be recorded on powerpc from
the contents of Performance Monitor Unit (PMU) registers. On
ISA v3.1 platform, sampling registers exposes the cycles spent in
different pipeline stages. Patch adds perf tools support to present
two of the cycle counter information along with memory latency (weight).

Re-use the field 'ins_lat' for storing the first pipeline stage cycle.
This is stored in 'var2_w' field of 'perf_sample_weight'.

Add a new field 'p_stage_cyc' to store the second pipeline stage cycle
which is stored in 'var3_w' field of perf_sample_weight.

Add new sort function 'Pipeline Stage Cycle' and include this in
default_mem_sort_order[]. This new sort function may be used to denote
some other pipeline stage in another architecture. So add this to
list of sort entries that can have dynamic header string.

Signed-off-by: Athira Rajeev <[email protected]>
---
tools/perf/Documentation/perf-report.txt | 2 ++
tools/perf/arch/powerpc/util/event.c | 18 ++++++++++++++++--
tools/perf/util/event.h | 1 +
tools/perf/util/hist.c | 11 ++++++++---
tools/perf/util/hist.h | 1 +
tools/perf/util/session.c | 4 +++-
tools/perf/util/sort.c | 24 ++++++++++++++++++++++--
tools/perf/util/sort.h | 2 ++
8 files changed, 55 insertions(+), 8 deletions(-)

diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index f546b5e9db05..563fb01a9b8d 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -112,6 +112,8 @@ OPTIONS
- ins_lat: Instruction latency in core cycles. This is the global instruction
latency
- local_ins_lat: Local instruction latency version
+ - p_stage_cyc: On powerpc, this presents the number of cycles spent in a
+ pipeline stage. And currently supported only on powerpc.

By default, comm, dso and symbol keys are used.
(i.e. --sort comm,dso,symbol)
diff --git a/tools/perf/arch/powerpc/util/event.c b/tools/perf/arch/powerpc/util/event.c
index f49d32c2c8ae..22521bc9481a 100644
--- a/tools/perf/arch/powerpc/util/event.c
+++ b/tools/perf/arch/powerpc/util/event.c
@@ -18,8 +18,11 @@ void arch_perf_parse_sample_weight(struct perf_sample *data,
weight.full = *array;
if (type & PERF_SAMPLE_WEIGHT)
data->weight = weight.full;
- else
+ else {
data->weight = weight.var1_dw;
+ data->ins_lat = weight.var2_w;
+ data->p_stage_cyc = weight.var3_w;
+ }
}

void arch_perf_synthesize_sample_weight(const struct perf_sample *data,
@@ -27,6 +30,17 @@ void arch_perf_synthesize_sample_weight(const struct perf_sample *data,
{
*array = data->weight;

- if (type & PERF_SAMPLE_WEIGHT_STRUCT)
+ if (type & PERF_SAMPLE_WEIGHT_STRUCT) {
*array &= 0xffffffff;
+ *array |= ((u64)data->ins_lat << 32);
+ }
+}
+
+const char *arch_perf_header_entry(const char *se_header)
+{
+ if (!strcmp(se_header, "Local INSTR Latency"))
+ return "Finish Cyc";
+ else if (!strcmp(se_header, "Pipeline Stage Cycle"))
+ return "Dispatch Cyc";
+ return se_header;
}
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 6106a9c134c9..e5da4a695ff2 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -147,6 +147,7 @@ struct perf_sample {
u8 cpumode;
u16 misc;
u16 ins_lat;
+ u16 p_stage_cyc;
bool no_hw_idx; /* No hw_idx collected in branch_stack */
char insn[MAX_INSN];
void *raw_data;
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index c82f5fc26af8..9299ee535518 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -211,6 +211,7 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
hists__new_col_len(hists, HISTC_MEM_BLOCKED, 10);
hists__new_col_len(hists, HISTC_LOCAL_INS_LAT, 13);
hists__new_col_len(hists, HISTC_GLOBAL_INS_LAT, 13);
+ hists__new_col_len(hists, HISTC_P_STAGE_CYC, 13);
if (symbol_conf.nanosecs)
hists__new_col_len(hists, HISTC_TIME, 16);
else
@@ -289,13 +290,14 @@ static long hist_time(unsigned long htime)
}

static void he_stat__add_period(struct he_stat *he_stat, u64 period,
- u64 weight, u64 ins_lat)
+ u64 weight, u64 ins_lat, u64 p_stage_cyc)
{

he_stat->period += period;
he_stat->weight += weight;
he_stat->nr_events += 1;
he_stat->ins_lat += ins_lat;
+ he_stat->p_stage_cyc += p_stage_cyc;
}

static void he_stat__add_stat(struct he_stat *dest, struct he_stat *src)
@@ -308,6 +310,7 @@ static void he_stat__add_stat(struct he_stat *dest, struct he_stat *src)
dest->nr_events += src->nr_events;
dest->weight += src->weight;
dest->ins_lat += src->ins_lat;
+ dest->p_stage_cyc += src->p_stage_cyc;
}

static void he_stat__decay(struct he_stat *he_stat)
@@ -597,6 +600,7 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
u64 period = entry->stat.period;
u64 weight = entry->stat.weight;
u64 ins_lat = entry->stat.ins_lat;
+ u64 p_stage_cyc = entry->stat.p_stage_cyc;
bool leftmost = true;

p = &hists->entries_in->rb_root.rb_node;
@@ -615,11 +619,11 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,

if (!cmp) {
if (sample_self) {
- he_stat__add_period(&he->stat, period, weight, ins_lat);
+ he_stat__add_period(&he->stat, period, weight, ins_lat, p_stage_cyc);
hist_entry__add_callchain_period(he, period);
}
if (symbol_conf.cumulate_callchain)
- he_stat__add_period(he->stat_acc, period, weight, ins_lat);
+ he_stat__add_period(he->stat_acc, period, weight, ins_lat, p_stage_cyc);

/*
* This mem info was allocated from sample__resolve_mem
@@ -731,6 +735,7 @@ static void hists__res_sample(struct hist_entry *he, struct perf_sample *sample)
.period = sample->period,
.weight = sample->weight,
.ins_lat = sample->ins_lat,
+ .p_stage_cyc = sample->p_stage_cyc,
},
.parent = sym_parent,
.filtered = symbol__parent_filter(sym_parent) | al->filtered,
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 3c537232294b..e2faa745c8d6 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -75,6 +75,7 @@ enum hist_column {
HISTC_MEM_BLOCKED,
HISTC_LOCAL_INS_LAT,
HISTC_GLOBAL_INS_LAT,
+ HISTC_P_STAGE_CYC,
HISTC_NR_COLS, /* Last entry */
};

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 859832a82496..a6fed96d783d 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1302,8 +1302,10 @@ static void dump_sample(struct evsel *evsel, union perf_event *event,

if (sample_type & PERF_SAMPLE_WEIGHT_TYPE) {
printf("... weight: %" PRIu64 "", sample->weight);
- if (sample_type & PERF_SAMPLE_WEIGHT_STRUCT)
+ if (sample_type & PERF_SAMPLE_WEIGHT_STRUCT) {
printf(",0x%"PRIx16"", sample->ins_lat);
+ printf(",0x%"PRIx16"", sample->p_stage_cyc);
+ }
printf("\n");
}

diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index eeb03e749181..d262261ad1a6 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -37,7 +37,7 @@
const char *parent_pattern = default_parent_pattern;
const char *default_sort_order = "comm,dso,symbol";
const char default_branch_sort_order[] = "comm,dso_from,symbol_from,symbol_to,cycles";
-const char default_mem_sort_order[] = "local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked,blocked,local_ins_lat";
+const char default_mem_sort_order[] = "local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked,blocked,local_ins_lat,p_stage_cyc";
const char default_top_sort_order[] = "dso,symbol";
const char default_diff_sort_order[] = "dso,symbol";
const char default_tracepoint_sort_order[] = "trace";
@@ -46,7 +46,7 @@
regex_t ignore_callees_regex;
int have_ignore_callees = 0;
enum sort_mode sort__mode = SORT_MODE__NORMAL;
-const char *dynamic_headers[] = {"local_ins_lat"};
+const char *dynamic_headers[] = {"local_ins_lat", "p_stage_cyc"};

/*
* Replaces all occurrences of a char used with the:
@@ -1410,6 +1410,25 @@ struct sort_entry sort_global_ins_lat = {
.se_width_idx = HISTC_GLOBAL_INS_LAT,
};

+static int64_t
+sort__global_p_stage_cyc_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+ return left->stat.p_stage_cyc - right->stat.p_stage_cyc;
+}
+
+static int hist_entry__p_stage_cyc_snprintf(struct hist_entry *he, char *bf,
+ size_t size, unsigned int width)
+{
+ return repsep_snprintf(bf, size, "%-*u", width, he->stat.p_stage_cyc);
+}
+
+struct sort_entry sort_p_stage_cyc = {
+ .se_header = "Pipeline Stage Cycle",
+ .se_cmp = sort__global_p_stage_cyc_cmp,
+ .se_snprintf = hist_entry__p_stage_cyc_snprintf,
+ .se_width_idx = HISTC_P_STAGE_CYC,
+};
+
struct sort_entry sort_mem_daddr_sym = {
.se_header = "Data Symbol",
.se_cmp = sort__daddr_cmp,
@@ -1853,6 +1872,7 @@ static void sort_dimension_add_dynamic_header(struct sort_dimension *sd)
DIM(SORT_CODE_PAGE_SIZE, "code_page_size", sort_code_page_size),
DIM(SORT_LOCAL_INS_LAT, "local_ins_lat", sort_local_ins_lat),
DIM(SORT_GLOBAL_INS_LAT, "ins_lat", sort_global_ins_lat),
+ DIM(SORT_PIPELINE_STAGE_CYC, "p_stage_cyc", sort_p_stage_cyc),
};

#undef DIM
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index 63f67a3f3630..d9795ca0d676 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -51,6 +51,7 @@ struct he_stat {
u64 period_guest_us;
u64 weight;
u64 ins_lat;
+ u64 p_stage_cyc;
u32 nr_events;
};

@@ -234,6 +235,7 @@ enum sort_type {
SORT_CODE_PAGE_SIZE,
SORT_LOCAL_INS_LAT,
SORT_GLOBAL_INS_LAT,
+ SORT_PIPELINE_STAGE_CYC,

/* branch stack specific sort keys */
__SORT_BRANCH_STACK,
--
1.8.3.1

2021-03-22 15:02:44

by Athira Rajeev

[permalink] [raw]
Subject: [PATCH V2 1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT

Performance Monitoring Unit (PMU) registers in powerpc provides
information on cycles elapsed between different stages in the
pipeline. This can be used for application tuning. On ISA v3.1
platform, this information is exposed by sampling registers.
Patch adds kernel support to capture two of the cycle counters
as part of perf sample using the sample type:
PERF_SAMPLE_WEIGHT_STRUCT.

The power PMU function 'get_mem_weight' currently uses 64 bit weight
field of perf_sample_data to capture memory latency. But following the
introduction of PERF_SAMPLE_WEIGHT_TYPE, weight field could contain
64-bit or 32-bit value depending on the architexture support for
PERF_SAMPLE_WEIGHT_STRUCT. Patches uses WEIGHT_STRUCT to expose the
pipeline stage cycles info. Hence update the ppmu functions to work for
64-bit and 32-bit weight values.

If the sample type is PERF_SAMPLE_WEIGHT, use the 64-bit weight field.
if the sample type is PERF_SAMPLE_WEIGHT_STRUCT, memory subsystem
latency is stored in the low 32bits of perf_sample_weight structure.
Also for CPU_FTR_ARCH_31, capture the two cycle counter information in
two 16 bit fields of perf_sample_weight structure.

Signed-off-by: Athira Rajeev <[email protected]>
---
arch/powerpc/include/asm/perf_event_server.h | 2 +-
arch/powerpc/perf/core-book3s.c | 4 ++--
arch/powerpc/perf/isa207-common.c | 29 +++++++++++++++++++++++++---
arch/powerpc/perf/isa207-common.h | 6 +++++-
4 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h
index 00e7e671bb4b..112cf092d7b3 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -43,7 +43,7 @@ struct power_pmu {
u64 alt[]);
void (*get_mem_data_src)(union perf_mem_data_src *dsrc,
u32 flags, struct pt_regs *regs);
- void (*get_mem_weight)(u64 *weight);
+ void (*get_mem_weight)(u64 *weight, u64 type);
unsigned long group_constraint_mask;
unsigned long group_constraint_val;
u64 (*bhrb_filter_map)(u64 branch_sample_type);
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 766f064f00fb..6936763246bd 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -2206,9 +2206,9 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
ppmu->get_mem_data_src)
ppmu->get_mem_data_src(&data.data_src, ppmu->flags, regs);

- if (event->attr.sample_type & PERF_SAMPLE_WEIGHT &&
+ if (event->attr.sample_type & PERF_SAMPLE_WEIGHT_TYPE &&
ppmu->get_mem_weight)
- ppmu->get_mem_weight(&data.weight.full);
+ ppmu->get_mem_weight(&data.weight.full, event->attr.sample_type);

if (perf_event_overflow(event, &data, regs))
power_pmu_stop(event, 0);
diff --git a/arch/powerpc/perf/isa207-common.c b/arch/powerpc/perf/isa207-common.c
index e4f577da33d8..5dcbdbd54598 100644
--- a/arch/powerpc/perf/isa207-common.c
+++ b/arch/powerpc/perf/isa207-common.c
@@ -284,8 +284,10 @@ void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, u32 flags,
}
}

-void isa207_get_mem_weight(u64 *weight)
+void isa207_get_mem_weight(u64 *weight, u64 type)
{
+ union perf_sample_weight *weight_fields;
+ u64 weight_lat;
u64 mmcra = mfspr(SPRN_MMCRA);
u64 exp = MMCRA_THR_CTR_EXP(mmcra);
u64 mantissa = MMCRA_THR_CTR_MANT(mmcra);
@@ -296,9 +298,30 @@ void isa207_get_mem_weight(u64 *weight)
mantissa = P10_MMCRA_THR_CTR_MANT(mmcra);

if (val == 0 || val == 7)
- *weight = 0;
+ weight_lat = 0;
else
- *weight = mantissa << (2 * exp);
+ weight_lat = mantissa << (2 * exp);
+
+ /*
+ * Use 64 bit weight field (full) if sample type is
+ * WEIGHT.
+ *
+ * if sample type is WEIGHT_STRUCT:
+ * - store memory latency in the lower 32 bits.
+ * - For ISA v3.1, use remaining two 16 bit fields of
+ * perf_sample_weight to store cycle counter values
+ * from sier2.
+ */
+ weight_fields = (union perf_sample_weight *)weight;
+ if (type & PERF_SAMPLE_WEIGHT)
+ weight_fields->full = weight_lat;
+ else {
+ weight_fields->var1_dw = (u32)weight_lat;
+ if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+ weight_fields->var2_w = P10_SIER2_FINISH_CYC(mfspr(SPRN_SIER2));
+ weight_fields->var3_w = P10_SIER2_DISPATCH_CYC(mfspr(SPRN_SIER2));
+ }
+ }
}

int isa207_get_constraint(u64 event, unsigned long *maskp, unsigned long *valp, u64 event_config1)
diff --git a/arch/powerpc/perf/isa207-common.h b/arch/powerpc/perf/isa207-common.h
index 1af0e8c97ac7..fc30d43c4d0c 100644
--- a/arch/powerpc/perf/isa207-common.h
+++ b/arch/powerpc/perf/isa207-common.h
@@ -265,6 +265,10 @@
#define ISA207_SIER_DATA_SRC_SHIFT 53
#define ISA207_SIER_DATA_SRC_MASK (0x7ull << ISA207_SIER_DATA_SRC_SHIFT)

+/* Bits in SIER2/SIER3 for Power10 */
+#define P10_SIER2_FINISH_CYC(sier2) (((sier2) >> (63 - 37)) & 0x7fful)
+#define P10_SIER2_DISPATCH_CYC(sier2) (((sier2) >> (63 - 13)) & 0x7fful)
+
#define P(a, b) PERF_MEM_S(a, b)
#define PH(a, b) (P(LVL, HIT) | P(a, b))
#define PM(a, b) (P(LVL, MISS) | P(a, b))
@@ -278,6 +282,6 @@ int isa207_get_alternatives(u64 event, u64 alt[], int size, unsigned int flags,
const unsigned int ev_alt[][MAX_ALT]);
void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, u32 flags,
struct pt_regs *regs);
-void isa207_get_mem_weight(u64 *weight);
+void isa207_get_mem_weight(u64 *weight, u64 type);

#endif
--
1.8.3.1

2021-03-22 15:03:40

by Athira Rajeev

[permalink] [raw]
Subject: [PATCH V2 5/5] tools/perf: Display sort dimension p_stage_cyc only on supported archs

The sort dimension "p_stage_cyc" is used to represent pipeline
stage cycle information. Presently, this is used only in powerpc.
For unsupported platforms, we don't want to display it
in the perf report output columns. Hence add check in sort_dimension__add()
and skip the sort key incase it is not applicable for the particular arch.

Signed-off-by: Athira Rajeev <[email protected]>
---
tools/perf/arch/powerpc/util/event.c | 7 +++++++
tools/perf/util/event.h | 1 +
tools/perf/util/sort.c | 19 +++++++++++++++++++
3 files changed, 27 insertions(+)

diff --git a/tools/perf/arch/powerpc/util/event.c b/tools/perf/arch/powerpc/util/event.c
index 22521bc9481a..3bf441257466 100644
--- a/tools/perf/arch/powerpc/util/event.c
+++ b/tools/perf/arch/powerpc/util/event.c
@@ -44,3 +44,10 @@ const char *arch_perf_header_entry(const char *se_header)
return "Dispatch Cyc";
return se_header;
}
+
+int arch_support_sort_key(const char *sort_key)
+{
+ if (!strcmp(sort_key, "p_stage_cyc"))
+ return 1;
+ return 0;
+}
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index e5da4a695ff2..8a62fb39e365 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -429,5 +429,6 @@ void cpu_map_data__synthesize(struct perf_record_cpu_map_data *data, struct per
void arch_perf_parse_sample_weight(struct perf_sample *data, const __u64 *array, u64 type);
void arch_perf_synthesize_sample_weight(const struct perf_sample *data, __u64 *array, u64 type);
const char *arch_perf_header_entry(const char *se_header);
+int arch_support_sort_key(const char *sort_key);

#endif /* __PERF_RECORD_H */
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index d262261ad1a6..e8030778ff44 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -47,6 +47,7 @@
int have_ignore_callees = 0;
enum sort_mode sort__mode = SORT_MODE__NORMAL;
const char *dynamic_headers[] = {"local_ins_lat", "p_stage_cyc"};
+const char *arch_specific_sort_keys[] = {"p_stage_cyc"};

/*
* Replaces all occurrences of a char used with the:
@@ -1837,6 +1838,11 @@ struct sort_dimension {
int taken;
};

+int __weak arch_support_sort_key(const char *sort_key __maybe_unused)
+{
+ return 0;
+}
+
const char * __weak arch_perf_header_entry(const char *se_header)
{
return se_header;
@@ -2773,6 +2779,19 @@ int sort_dimension__add(struct perf_hpp_list *list, const char *tok,
{
unsigned int i, j;

+ /*
+ * Check to see if there are any arch specific
+ * sort dimensions not applicable for the current
+ * architecture. If so, Skip that sort key since
+ * we don't want to display it in the output fields.
+ */
+ for (j = 0; j < ARRAY_SIZE(arch_specific_sort_keys); j++) {
+ if (!strcmp(arch_specific_sort_keys[j], tok) &&
+ !arch_support_sort_key(tok)) {
+ return 0;
+ }
+ }
+
for (i = 0; i < ARRAY_SIZE(common_sort_dimensions); i++) {
struct sort_dimension *sd = &common_sort_dimensions[i];

--
1.8.3.1

2021-03-24 22:23:39

by Madhavan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH V2 1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT


On 3/22/21 8:27 PM, Athira Rajeev wrote:
> Performance Monitoring Unit (PMU) registers in powerpc provides
> information on cycles elapsed between different stages in the
> pipeline. This can be used for application tuning. On ISA v3.1
> platform, this information is exposed by sampling registers.
> Patch adds kernel support to capture two of the cycle counters
> as part of perf sample using the sample type:
> PERF_SAMPLE_WEIGHT_STRUCT.
>
> The power PMU function 'get_mem_weight' currently uses 64 bit weight
> field of perf_sample_data to capture memory latency. But following the
> introduction of PERF_SAMPLE_WEIGHT_TYPE, weight field could contain
> 64-bit or 32-bit value depending on the architexture support for
> PERF_SAMPLE_WEIGHT_STRUCT. Patches uses WEIGHT_STRUCT to expose the
> pipeline stage cycles info. Hence update the ppmu functions to work for
> 64-bit and 32-bit weight values.
>
> If the sample type is PERF_SAMPLE_WEIGHT, use the 64-bit weight field.
> if the sample type is PERF_SAMPLE_WEIGHT_STRUCT, memory subsystem
> latency is stored in the low 32bits of perf_sample_weight structure.
> Also for CPU_FTR_ARCH_31, capture the two cycle counter information in
> two 16 bit fields of perf_sample_weight structure.

Changes looks fine to me.

Reviewed-by: Madhavan Srinivasan <[email protected]>


> Signed-off-by: Athira Rajeev <[email protected]>
> ---
> arch/powerpc/include/asm/perf_event_server.h | 2 +-
> arch/powerpc/perf/core-book3s.c | 4 ++--
> arch/powerpc/perf/isa207-common.c | 29 +++++++++++++++++++++++++---
> arch/powerpc/perf/isa207-common.h | 6 +++++-
> 4 files changed, 34 insertions(+), 7 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h
> index 00e7e671bb4b..112cf092d7b3 100644
> --- a/arch/powerpc/include/asm/perf_event_server.h
> +++ b/arch/powerpc/include/asm/perf_event_server.h
> @@ -43,7 +43,7 @@ struct power_pmu {
> u64 alt[]);
> void (*get_mem_data_src)(union perf_mem_data_src *dsrc,
> u32 flags, struct pt_regs *regs);
> - void (*get_mem_weight)(u64 *weight);
> + void (*get_mem_weight)(u64 *weight, u64 type);
> unsigned long group_constraint_mask;
> unsigned long group_constraint_val;
> u64 (*bhrb_filter_map)(u64 branch_sample_type);
> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index 766f064f00fb..6936763246bd 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -2206,9 +2206,9 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
> ppmu->get_mem_data_src)
> ppmu->get_mem_data_src(&data.data_src, ppmu->flags, regs);
>
> - if (event->attr.sample_type & PERF_SAMPLE_WEIGHT &&
> + if (event->attr.sample_type & PERF_SAMPLE_WEIGHT_TYPE &&
> ppmu->get_mem_weight)
> - ppmu->get_mem_weight(&data.weight.full);
> + ppmu->get_mem_weight(&data.weight.full, event->attr.sample_type);
>
> if (perf_event_overflow(event, &data, regs))
> power_pmu_stop(event, 0);
> diff --git a/arch/powerpc/perf/isa207-common.c b/arch/powerpc/perf/isa207-common.c
> index e4f577da33d8..5dcbdbd54598 100644
> --- a/arch/powerpc/perf/isa207-common.c
> +++ b/arch/powerpc/perf/isa207-common.c
> @@ -284,8 +284,10 @@ void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, u32 flags,
> }
> }
>
> -void isa207_get_mem_weight(u64 *weight)
> +void isa207_get_mem_weight(u64 *weight, u64 type)
> {
> + union perf_sample_weight *weight_fields;
> + u64 weight_lat;
> u64 mmcra = mfspr(SPRN_MMCRA);
> u64 exp = MMCRA_THR_CTR_EXP(mmcra);
> u64 mantissa = MMCRA_THR_CTR_MANT(mmcra);
> @@ -296,9 +298,30 @@ void isa207_get_mem_weight(u64 *weight)
> mantissa = P10_MMCRA_THR_CTR_MANT(mmcra);
>
> if (val == 0 || val == 7)
> - *weight = 0;
> + weight_lat = 0;
> else
> - *weight = mantissa << (2 * exp);
> + weight_lat = mantissa << (2 * exp);
> +
> + /*
> + * Use 64 bit weight field (full) if sample type is
> + * WEIGHT.
> + *
> + * if sample type is WEIGHT_STRUCT:
> + * - store memory latency in the lower 32 bits.
> + * - For ISA v3.1, use remaining two 16 bit fields of
> + * perf_sample_weight to store cycle counter values
> + * from sier2.
> + */
> + weight_fields = (union perf_sample_weight *)weight;
> + if (type & PERF_SAMPLE_WEIGHT)
> + weight_fields->full = weight_lat;
> + else {
> + weight_fields->var1_dw = (u32)weight_lat;
> + if (cpu_has_feature(CPU_FTR_ARCH_31)) {
> + weight_fields->var2_w = P10_SIER2_FINISH_CYC(mfspr(SPRN_SIER2));
> + weight_fields->var3_w = P10_SIER2_DISPATCH_CYC(mfspr(SPRN_SIER2));
> + }
> + }
> }
>
> int isa207_get_constraint(u64 event, unsigned long *maskp, unsigned long *valp, u64 event_config1)
> diff --git a/arch/powerpc/perf/isa207-common.h b/arch/powerpc/perf/isa207-common.h
> index 1af0e8c97ac7..fc30d43c4d0c 100644
> --- a/arch/powerpc/perf/isa207-common.h
> +++ b/arch/powerpc/perf/isa207-common.h
> @@ -265,6 +265,10 @@
> #define ISA207_SIER_DATA_SRC_SHIFT 53
> #define ISA207_SIER_DATA_SRC_MASK (0x7ull << ISA207_SIER_DATA_SRC_SHIFT)
>
> +/* Bits in SIER2/SIER3 for Power10 */
> +#define P10_SIER2_FINISH_CYC(sier2) (((sier2) >> (63 - 37)) & 0x7fful)
> +#define P10_SIER2_DISPATCH_CYC(sier2) (((sier2) >> (63 - 13)) & 0x7fful)
> +
> #define P(a, b) PERF_MEM_S(a, b)
> #define PH(a, b) (P(LVL, HIT) | P(a, b))
> #define PM(a, b) (P(LVL, MISS) | P(a, b))
> @@ -278,6 +282,6 @@ int isa207_get_alternatives(u64 event, u64 alt[], int size, unsigned int flags,
> const unsigned int ev_alt[][MAX_ALT]);
> void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, u32 flags,
> struct pt_regs *regs);
> -void isa207_get_mem_weight(u64 *weight);
> +void isa207_get_mem_weight(u64 *weight, u64 type);
>
> #endif

2021-03-25 13:03:23

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH V2 1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT

Em Wed, Mar 24, 2021 at 10:05:23AM +0530, Madhavan Srinivasan escreveu:
>
> On 3/22/21 8:27 PM, Athira Rajeev wrote:
> > Performance Monitoring Unit (PMU) registers in powerpc provides
> > information on cycles elapsed between different stages in the
> > pipeline. This can be used for application tuning. On ISA v3.1
> > platform, this information is exposed by sampling registers.
> > Patch adds kernel support to capture two of the cycle counters
> > as part of perf sample using the sample type:
> > PERF_SAMPLE_WEIGHT_STRUCT.
> >
> > The power PMU function 'get_mem_weight' currently uses 64 bit weight
> > field of perf_sample_data to capture memory latency. But following the
> > introduction of PERF_SAMPLE_WEIGHT_TYPE, weight field could contain
> > 64-bit or 32-bit value depending on the architexture support for
> > PERF_SAMPLE_WEIGHT_STRUCT. Patches uses WEIGHT_STRUCT to expose the
> > pipeline stage cycles info. Hence update the ppmu functions to work for
> > 64-bit and 32-bit weight values.
> >
> > If the sample type is PERF_SAMPLE_WEIGHT, use the 64-bit weight field.
> > if the sample type is PERF_SAMPLE_WEIGHT_STRUCT, memory subsystem
> > latency is stored in the low 32bits of perf_sample_weight structure.
> > Also for CPU_FTR_ARCH_31, capture the two cycle counter information in
> > two 16 bit fields of perf_sample_weight structure.
>
> Changes looks fine to me.
>
> Reviewed-by: Madhavan Srinivasan <[email protected]>

So who will process the kernel bits? I'm merging the tooling parts,

Thanks,

- Arnaldo

>
> > Signed-off-by: Athira Rajeev <[email protected]>
> > ---
> > arch/powerpc/include/asm/perf_event_server.h | 2 +-
> > arch/powerpc/perf/core-book3s.c | 4 ++--
> > arch/powerpc/perf/isa207-common.c | 29 +++++++++++++++++++++++++---
> > arch/powerpc/perf/isa207-common.h | 6 +++++-
> > 4 files changed, 34 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h
> > index 00e7e671bb4b..112cf092d7b3 100644
> > --- a/arch/powerpc/include/asm/perf_event_server.h
> > +++ b/arch/powerpc/include/asm/perf_event_server.h
> > @@ -43,7 +43,7 @@ struct power_pmu {
> > u64 alt[]);
> > void (*get_mem_data_src)(union perf_mem_data_src *dsrc,
> > u32 flags, struct pt_regs *regs);
> > - void (*get_mem_weight)(u64 *weight);
> > + void (*get_mem_weight)(u64 *weight, u64 type);
> > unsigned long group_constraint_mask;
> > unsigned long group_constraint_val;
> > u64 (*bhrb_filter_map)(u64 branch_sample_type);
> > diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> > index 766f064f00fb..6936763246bd 100644
> > --- a/arch/powerpc/perf/core-book3s.c
> > +++ b/arch/powerpc/perf/core-book3s.c
> > @@ -2206,9 +2206,9 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
> > ppmu->get_mem_data_src)
> > ppmu->get_mem_data_src(&data.data_src, ppmu->flags, regs);
> > - if (event->attr.sample_type & PERF_SAMPLE_WEIGHT &&
> > + if (event->attr.sample_type & PERF_SAMPLE_WEIGHT_TYPE &&
> > ppmu->get_mem_weight)
> > - ppmu->get_mem_weight(&data.weight.full);
> > + ppmu->get_mem_weight(&data.weight.full, event->attr.sample_type);
> > if (perf_event_overflow(event, &data, regs))
> > power_pmu_stop(event, 0);
> > diff --git a/arch/powerpc/perf/isa207-common.c b/arch/powerpc/perf/isa207-common.c
> > index e4f577da33d8..5dcbdbd54598 100644
> > --- a/arch/powerpc/perf/isa207-common.c
> > +++ b/arch/powerpc/perf/isa207-common.c
> > @@ -284,8 +284,10 @@ void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, u32 flags,
> > }
> > }
> > -void isa207_get_mem_weight(u64 *weight)
> > +void isa207_get_mem_weight(u64 *weight, u64 type)
> > {
> > + union perf_sample_weight *weight_fields;
> > + u64 weight_lat;
> > u64 mmcra = mfspr(SPRN_MMCRA);
> > u64 exp = MMCRA_THR_CTR_EXP(mmcra);
> > u64 mantissa = MMCRA_THR_CTR_MANT(mmcra);
> > @@ -296,9 +298,30 @@ void isa207_get_mem_weight(u64 *weight)
> > mantissa = P10_MMCRA_THR_CTR_MANT(mmcra);
> > if (val == 0 || val == 7)
> > - *weight = 0;
> > + weight_lat = 0;
> > else
> > - *weight = mantissa << (2 * exp);
> > + weight_lat = mantissa << (2 * exp);
> > +
> > + /*
> > + * Use 64 bit weight field (full) if sample type is
> > + * WEIGHT.
> > + *
> > + * if sample type is WEIGHT_STRUCT:
> > + * - store memory latency in the lower 32 bits.
> > + * - For ISA v3.1, use remaining two 16 bit fields of
> > + * perf_sample_weight to store cycle counter values
> > + * from sier2.
> > + */
> > + weight_fields = (union perf_sample_weight *)weight;
> > + if (type & PERF_SAMPLE_WEIGHT)
> > + weight_fields->full = weight_lat;
> > + else {
> > + weight_fields->var1_dw = (u32)weight_lat;
> > + if (cpu_has_feature(CPU_FTR_ARCH_31)) {
> > + weight_fields->var2_w = P10_SIER2_FINISH_CYC(mfspr(SPRN_SIER2));
> > + weight_fields->var3_w = P10_SIER2_DISPATCH_CYC(mfspr(SPRN_SIER2));
> > + }
> > + }
> > }
> > int isa207_get_constraint(u64 event, unsigned long *maskp, unsigned long *valp, u64 event_config1)
> > diff --git a/arch/powerpc/perf/isa207-common.h b/arch/powerpc/perf/isa207-common.h
> > index 1af0e8c97ac7..fc30d43c4d0c 100644
> > --- a/arch/powerpc/perf/isa207-common.h
> > +++ b/arch/powerpc/perf/isa207-common.h
> > @@ -265,6 +265,10 @@
> > #define ISA207_SIER_DATA_SRC_SHIFT 53
> > #define ISA207_SIER_DATA_SRC_MASK (0x7ull << ISA207_SIER_DATA_SRC_SHIFT)
> > +/* Bits in SIER2/SIER3 for Power10 */
> > +#define P10_SIER2_FINISH_CYC(sier2) (((sier2) >> (63 - 37)) & 0x7fful)
> > +#define P10_SIER2_DISPATCH_CYC(sier2) (((sier2) >> (63 - 13)) & 0x7fful)
> > +
> > #define P(a, b) PERF_MEM_S(a, b)
> > #define PH(a, b) (P(LVL, HIT) | P(a, b))
> > #define PM(a, b) (P(LVL, MISS) | P(a, b))
> > @@ -278,6 +282,6 @@ int isa207_get_alternatives(u64 event, u64 alt[], int size, unsigned int flags,
> > const unsigned int ev_alt[][MAX_ALT]);
> > void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, u32 flags,
> > struct pt_regs *regs);
> > -void isa207_get_mem_weight(u64 *weight);
> > +void isa207_get_mem_weight(u64 *weight, u64 type);
> > #endif

--

- Arnaldo

2021-03-25 13:10:47

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH V2 1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT

Em Wed, Mar 24, 2021 at 10:05:23AM +0530, Madhavan Srinivasan escreveu:
>
> On 3/22/21 8:27 PM, Athira Rajeev wrote:
> > Performance Monitoring Unit (PMU) registers in powerpc provides
> > information on cycles elapsed between different stages in the
> > pipeline. This can be used for application tuning. On ISA v3.1
> > platform, this information is exposed by sampling registers.
> > Patch adds kernel support to capture two of the cycle counters
> > as part of perf sample using the sample type:
> > PERF_SAMPLE_WEIGHT_STRUCT.
> >
> > The power PMU function 'get_mem_weight' currently uses 64 bit weight
> > field of perf_sample_data to capture memory latency. But following the
> > introduction of PERF_SAMPLE_WEIGHT_TYPE, weight field could contain
> > 64-bit or 32-bit value depending on the architexture support for
> > PERF_SAMPLE_WEIGHT_STRUCT. Patches uses WEIGHT_STRUCT to expose the
> > pipeline stage cycles info. Hence update the ppmu functions to work for
> > 64-bit and 32-bit weight values.
> >
> > If the sample type is PERF_SAMPLE_WEIGHT, use the 64-bit weight field.
> > if the sample type is PERF_SAMPLE_WEIGHT_STRUCT, memory subsystem
> > latency is stored in the low 32bits of perf_sample_weight structure.
> > Also for CPU_FTR_ARCH_31, capture the two cycle counter information in
> > two 16 bit fields of perf_sample_weight structure.
>
> Changes looks fine to me.

You mean just the kernel part or can I add your Reviewed-by to all the
patchset?

> Reviewed-by: Madhavan Srinivasan <[email protected]>
>
>
> > Signed-off-by: Athira Rajeev <[email protected]>
> > ---
> > arch/powerpc/include/asm/perf_event_server.h | 2 +-
> > arch/powerpc/perf/core-book3s.c | 4 ++--
> > arch/powerpc/perf/isa207-common.c | 29 +++++++++++++++++++++++++---
> > arch/powerpc/perf/isa207-common.h | 6 +++++-
> > 4 files changed, 34 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h
> > index 00e7e671bb4b..112cf092d7b3 100644
> > --- a/arch/powerpc/include/asm/perf_event_server.h
> > +++ b/arch/powerpc/include/asm/perf_event_server.h
> > @@ -43,7 +43,7 @@ struct power_pmu {
> > u64 alt[]);
> > void (*get_mem_data_src)(union perf_mem_data_src *dsrc,
> > u32 flags, struct pt_regs *regs);
> > - void (*get_mem_weight)(u64 *weight);
> > + void (*get_mem_weight)(u64 *weight, u64 type);
> > unsigned long group_constraint_mask;
> > unsigned long group_constraint_val;
> > u64 (*bhrb_filter_map)(u64 branch_sample_type);
> > diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> > index 766f064f00fb..6936763246bd 100644
> > --- a/arch/powerpc/perf/core-book3s.c
> > +++ b/arch/powerpc/perf/core-book3s.c
> > @@ -2206,9 +2206,9 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
> > ppmu->get_mem_data_src)
> > ppmu->get_mem_data_src(&data.data_src, ppmu->flags, regs);
> > - if (event->attr.sample_type & PERF_SAMPLE_WEIGHT &&
> > + if (event->attr.sample_type & PERF_SAMPLE_WEIGHT_TYPE &&
> > ppmu->get_mem_weight)
> > - ppmu->get_mem_weight(&data.weight.full);
> > + ppmu->get_mem_weight(&data.weight.full, event->attr.sample_type);
> > if (perf_event_overflow(event, &data, regs))
> > power_pmu_stop(event, 0);
> > diff --git a/arch/powerpc/perf/isa207-common.c b/arch/powerpc/perf/isa207-common.c
> > index e4f577da33d8..5dcbdbd54598 100644
> > --- a/arch/powerpc/perf/isa207-common.c
> > +++ b/arch/powerpc/perf/isa207-common.c
> > @@ -284,8 +284,10 @@ void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, u32 flags,
> > }
> > }
> > -void isa207_get_mem_weight(u64 *weight)
> > +void isa207_get_mem_weight(u64 *weight, u64 type)
> > {
> > + union perf_sample_weight *weight_fields;
> > + u64 weight_lat;
> > u64 mmcra = mfspr(SPRN_MMCRA);
> > u64 exp = MMCRA_THR_CTR_EXP(mmcra);
> > u64 mantissa = MMCRA_THR_CTR_MANT(mmcra);
> > @@ -296,9 +298,30 @@ void isa207_get_mem_weight(u64 *weight)
> > mantissa = P10_MMCRA_THR_CTR_MANT(mmcra);
> > if (val == 0 || val == 7)
> > - *weight = 0;
> > + weight_lat = 0;
> > else
> > - *weight = mantissa << (2 * exp);
> > + weight_lat = mantissa << (2 * exp);
> > +
> > + /*
> > + * Use 64 bit weight field (full) if sample type is
> > + * WEIGHT.
> > + *
> > + * if sample type is WEIGHT_STRUCT:
> > + * - store memory latency in the lower 32 bits.
> > + * - For ISA v3.1, use remaining two 16 bit fields of
> > + * perf_sample_weight to store cycle counter values
> > + * from sier2.
> > + */
> > + weight_fields = (union perf_sample_weight *)weight;
> > + if (type & PERF_SAMPLE_WEIGHT)
> > + weight_fields->full = weight_lat;
> > + else {
> > + weight_fields->var1_dw = (u32)weight_lat;
> > + if (cpu_has_feature(CPU_FTR_ARCH_31)) {
> > + weight_fields->var2_w = P10_SIER2_FINISH_CYC(mfspr(SPRN_SIER2));
> > + weight_fields->var3_w = P10_SIER2_DISPATCH_CYC(mfspr(SPRN_SIER2));
> > + }
> > + }
> > }
> > int isa207_get_constraint(u64 event, unsigned long *maskp, unsigned long *valp, u64 event_config1)
> > diff --git a/arch/powerpc/perf/isa207-common.h b/arch/powerpc/perf/isa207-common.h
> > index 1af0e8c97ac7..fc30d43c4d0c 100644
> > --- a/arch/powerpc/perf/isa207-common.h
> > +++ b/arch/powerpc/perf/isa207-common.h
> > @@ -265,6 +265,10 @@
> > #define ISA207_SIER_DATA_SRC_SHIFT 53
> > #define ISA207_SIER_DATA_SRC_MASK (0x7ull << ISA207_SIER_DATA_SRC_SHIFT)
> > +/* Bits in SIER2/SIER3 for Power10 */
> > +#define P10_SIER2_FINISH_CYC(sier2) (((sier2) >> (63 - 37)) & 0x7fful)
> > +#define P10_SIER2_DISPATCH_CYC(sier2) (((sier2) >> (63 - 13)) & 0x7fful)
> > +
> > #define P(a, b) PERF_MEM_S(a, b)
> > #define PH(a, b) (P(LVL, HIT) | P(a, b))
> > #define PM(a, b) (P(LVL, MISS) | P(a, b))
> > @@ -278,6 +282,6 @@ int isa207_get_alternatives(u64 event, u64 alt[], int size, unsigned int flags,
> > const unsigned int ev_alt[][MAX_ALT]);
> > void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, u32 flags,
> > struct pt_regs *regs);
> > -void isa207_get_mem_weight(u64 *weight);
> > +void isa207_get_mem_weight(u64 *weight, u64 type);
> > #endif

--

- Arnaldo

2021-03-25 14:41:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V2 1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT

On Thu, Mar 25, 2021 at 10:01:35AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Wed, Mar 24, 2021 at 10:05:23AM +0530, Madhavan Srinivasan escreveu:
> >
> > On 3/22/21 8:27 PM, Athira Rajeev wrote:
> > > Performance Monitoring Unit (PMU) registers in powerpc provides
> > > information on cycles elapsed between different stages in the
> > > pipeline. This can be used for application tuning. On ISA v3.1
> > > platform, this information is exposed by sampling registers.
> > > Patch adds kernel support to capture two of the cycle counters
> > > as part of perf sample using the sample type:
> > > PERF_SAMPLE_WEIGHT_STRUCT.
> > >
> > > The power PMU function 'get_mem_weight' currently uses 64 bit weight
> > > field of perf_sample_data to capture memory latency. But following the
> > > introduction of PERF_SAMPLE_WEIGHT_TYPE, weight field could contain
> > > 64-bit or 32-bit value depending on the architexture support for
> > > PERF_SAMPLE_WEIGHT_STRUCT. Patches uses WEIGHT_STRUCT to expose the
> > > pipeline stage cycles info. Hence update the ppmu functions to work for
> > > 64-bit and 32-bit weight values.
> > >
> > > If the sample type is PERF_SAMPLE_WEIGHT, use the 64-bit weight field.
> > > if the sample type is PERF_SAMPLE_WEIGHT_STRUCT, memory subsystem
> > > latency is stored in the low 32bits of perf_sample_weight structure.
> > > Also for CPU_FTR_ARCH_31, capture the two cycle counter information in
> > > two 16 bit fields of perf_sample_weight structure.
> >
> > Changes looks fine to me.
> >
> > Reviewed-by: Madhavan Srinivasan <[email protected]>
>
> So who will process the kernel bits? I'm merging the tooling parts,

I was sorta expecting these to go through the powerpc tree. Let me know
if you want them in tip/perf/core instead.

2021-03-25 16:44:36

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH V2 1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT



On March 25, 2021 11:38:01 AM GMT-03:00, Peter Zijlstra <[email protected]> wrote:
>On Thu, Mar 25, 2021 at 10:01:35AM -0300, Arnaldo Carvalho de Melo
>wrote:.
>> > > Also for CPU_FTR_ARCH_31, capture the two cycle counter
>information in
>> > > two 16 bit fields of perf_sample_weight structure.
>> >
>> > Changes looks fine to me.
>> >
>> > Reviewed-by: Madhavan Srinivasan <[email protected]>
>>
>> So who will process the kernel bits? I'm merging the tooling parts,
>
>I was sorta expecting these to go through the powerpc tree. Let me know
>if you want them in tip/perf/core instead.

Shouldn't matter by which tree it gets upstream, as long as it gets picked :-)

- Arnaldo

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2021-03-26 08:35:22

by Madhavan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH V2 1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT


On 3/25/21 6:36 PM, Arnaldo Carvalho de Melo wrote:
> Em Wed, Mar 24, 2021 at 10:05:23AM +0530, Madhavan Srinivasan escreveu:
>> On 3/22/21 8:27 PM, Athira Rajeev wrote:
>>> Performance Monitoring Unit (PMU) registers in powerpc provides
>>> information on cycles elapsed between different stages in the
>>> pipeline. This can be used for application tuning. On ISA v3.1
>>> platform, this information is exposed by sampling registers.
>>> Patch adds kernel support to capture two of the cycle counters
>>> as part of perf sample using the sample type:
>>> PERF_SAMPLE_WEIGHT_STRUCT.
>>>
>>> The power PMU function 'get_mem_weight' currently uses 64 bit weight
>>> field of perf_sample_data to capture memory latency. But following the
>>> introduction of PERF_SAMPLE_WEIGHT_TYPE, weight field could contain
>>> 64-bit or 32-bit value depending on the architexture support for
>>> PERF_SAMPLE_WEIGHT_STRUCT. Patches uses WEIGHT_STRUCT to expose the
>>> pipeline stage cycles info. Hence update the ppmu functions to work for
>>> 64-bit and 32-bit weight values.
>>>
>>> If the sample type is PERF_SAMPLE_WEIGHT, use the 64-bit weight field.
>>> if the sample type is PERF_SAMPLE_WEIGHT_STRUCT, memory subsystem
>>> latency is stored in the low 32bits of perf_sample_weight structure.
>>> Also for CPU_FTR_ARCH_31, capture the two cycle counter information in
>>> two 16 bit fields of perf_sample_weight structure.
>> Changes looks fine to me.
> You mean just the kernel part or can I add your Reviewed-by to all the
> patchset?


Yes, kindly add it, I did review the patchset. My bad, i should have
mentioned it here

or should have replied to the cover letter.


Maddy


>
>> Reviewed-by: Madhavan Srinivasan <[email protected]>
>>
>>
>>> Signed-off-by: Athira Rajeev <[email protected]>
>>> ---
>>> arch/powerpc/include/asm/perf_event_server.h | 2 +-
>>> arch/powerpc/perf/core-book3s.c | 4 ++--
>>> arch/powerpc/perf/isa207-common.c | 29 +++++++++++++++++++++++++---
>>> arch/powerpc/perf/isa207-common.h | 6 +++++-
>>> 4 files changed, 34 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h
>>> index 00e7e671bb4b..112cf092d7b3 100644
>>> --- a/arch/powerpc/include/asm/perf_event_server.h
>>> +++ b/arch/powerpc/include/asm/perf_event_server.h
>>> @@ -43,7 +43,7 @@ struct power_pmu {
>>> u64 alt[]);
>>> void (*get_mem_data_src)(union perf_mem_data_src *dsrc,
>>> u32 flags, struct pt_regs *regs);
>>> - void (*get_mem_weight)(u64 *weight);
>>> + void (*get_mem_weight)(u64 *weight, u64 type);
>>> unsigned long group_constraint_mask;
>>> unsigned long group_constraint_val;
>>> u64 (*bhrb_filter_map)(u64 branch_sample_type);
>>> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
>>> index 766f064f00fb..6936763246bd 100644
>>> --- a/arch/powerpc/perf/core-book3s.c
>>> +++ b/arch/powerpc/perf/core-book3s.c
>>> @@ -2206,9 +2206,9 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
>>> ppmu->get_mem_data_src)
>>> ppmu->get_mem_data_src(&data.data_src, ppmu->flags, regs);
>>> - if (event->attr.sample_type & PERF_SAMPLE_WEIGHT &&
>>> + if (event->attr.sample_type & PERF_SAMPLE_WEIGHT_TYPE &&
>>> ppmu->get_mem_weight)
>>> - ppmu->get_mem_weight(&data.weight.full);
>>> + ppmu->get_mem_weight(&data.weight.full, event->attr.sample_type);
>>> if (perf_event_overflow(event, &data, regs))
>>> power_pmu_stop(event, 0);
>>> diff --git a/arch/powerpc/perf/isa207-common.c b/arch/powerpc/perf/isa207-common.c
>>> index e4f577da33d8..5dcbdbd54598 100644
>>> --- a/arch/powerpc/perf/isa207-common.c
>>> +++ b/arch/powerpc/perf/isa207-common.c
>>> @@ -284,8 +284,10 @@ void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, u32 flags,
>>> }
>>> }
>>> -void isa207_get_mem_weight(u64 *weight)
>>> +void isa207_get_mem_weight(u64 *weight, u64 type)
>>> {
>>> + union perf_sample_weight *weight_fields;
>>> + u64 weight_lat;
>>> u64 mmcra = mfspr(SPRN_MMCRA);
>>> u64 exp = MMCRA_THR_CTR_EXP(mmcra);
>>> u64 mantissa = MMCRA_THR_CTR_MANT(mmcra);
>>> @@ -296,9 +298,30 @@ void isa207_get_mem_weight(u64 *weight)
>>> mantissa = P10_MMCRA_THR_CTR_MANT(mmcra);
>>> if (val == 0 || val == 7)
>>> - *weight = 0;
>>> + weight_lat = 0;
>>> else
>>> - *weight = mantissa << (2 * exp);
>>> + weight_lat = mantissa << (2 * exp);
>>> +
>>> + /*
>>> + * Use 64 bit weight field (full) if sample type is
>>> + * WEIGHT.
>>> + *
>>> + * if sample type is WEIGHT_STRUCT:
>>> + * - store memory latency in the lower 32 bits.
>>> + * - For ISA v3.1, use remaining two 16 bit fields of
>>> + * perf_sample_weight to store cycle counter values
>>> + * from sier2.
>>> + */
>>> + weight_fields = (union perf_sample_weight *)weight;
>>> + if (type & PERF_SAMPLE_WEIGHT)
>>> + weight_fields->full = weight_lat;
>>> + else {
>>> + weight_fields->var1_dw = (u32)weight_lat;
>>> + if (cpu_has_feature(CPU_FTR_ARCH_31)) {
>>> + weight_fields->var2_w = P10_SIER2_FINISH_CYC(mfspr(SPRN_SIER2));
>>> + weight_fields->var3_w = P10_SIER2_DISPATCH_CYC(mfspr(SPRN_SIER2));
>>> + }
>>> + }
>>> }
>>> int isa207_get_constraint(u64 event, unsigned long *maskp, unsigned long *valp, u64 event_config1)
>>> diff --git a/arch/powerpc/perf/isa207-common.h b/arch/powerpc/perf/isa207-common.h
>>> index 1af0e8c97ac7..fc30d43c4d0c 100644
>>> --- a/arch/powerpc/perf/isa207-common.h
>>> +++ b/arch/powerpc/perf/isa207-common.h
>>> @@ -265,6 +265,10 @@
>>> #define ISA207_SIER_DATA_SRC_SHIFT 53
>>> #define ISA207_SIER_DATA_SRC_MASK (0x7ull << ISA207_SIER_DATA_SRC_SHIFT)
>>> +/* Bits in SIER2/SIER3 for Power10 */
>>> +#define P10_SIER2_FINISH_CYC(sier2) (((sier2) >> (63 - 37)) & 0x7fful)
>>> +#define P10_SIER2_DISPATCH_CYC(sier2) (((sier2) >> (63 - 13)) & 0x7fful)
>>> +
>>> #define P(a, b) PERF_MEM_S(a, b)
>>> #define PH(a, b) (P(LVL, HIT) | P(a, b))
>>> #define PM(a, b) (P(LVL, MISS) | P(a, b))
>>> @@ -278,6 +282,6 @@ int isa207_get_alternatives(u64 event, u64 alt[], int size, unsigned int flags,
>>> const unsigned int ev_alt[][MAX_ALT]);
>>> void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, u32 flags,
>>> struct pt_regs *regs);
>>> -void isa207_get_mem_weight(u64 *weight);
>>> +void isa207_get_mem_weight(u64 *weight, u64 type);
>>> #endif

2021-03-27 13:24:39

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH V2 1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT

Arnaldo <[email protected]> writes:
> On March 25, 2021 11:38:01 AM GMT-03:00, Peter Zijlstra <[email protected]> wrote:
>>On Thu, Mar 25, 2021 at 10:01:35AM -0300, Arnaldo Carvalho de Melo
>>wrote:.
>>> > > Also for CPU_FTR_ARCH_31, capture the two cycle counter
>>information in
>>> > > two 16 bit fields of perf_sample_weight structure.
>>> >
>>> > Changes looks fine to me.
>>> >
>>> > Reviewed-by: Madhavan Srinivasan <[email protected]>
>>>
>>> So who will process the kernel bits? I'm merging the tooling parts,
>>
>>I was sorta expecting these to go through the powerpc tree. Let me know
>>if you want them in tip/perf/core instead.
>
> Shouldn't matter by which tree it gets upstream, as long as it gets picked :-)

I plan to take them, just haven't got around to it yet :}

cheers

2021-04-21 16:57:42

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH V2 0/5] powerpc/perf: Export processor pipeline stage cycles information

On Mon, 22 Mar 2021 10:57:22 -0400, Athira Rajeev wrote:
> Performance Monitoring Unit (PMU) registers in powerpc exports
> number of cycles elapsed between different stages in the pipeline.
> Example, sampling registers in ISA v3.1.
>
> This patchset implements kernel and perf tools support to expose
> these pipeline stage cycles using the sample type PERF_SAMPLE_WEIGHT_TYPE.
>
> [...]

Patch 1 applied to powerpc/next.

[1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT
https://git.kernel.org/powerpc/c/af31fd0c9107e400a8eb89d0eafb40bb78802f79

cheers