2022-08-11 07:04:41

by Leo Yan

[permalink] [raw]
Subject: [PATCH v6 00/15] perf c2c: Support data source and display for Arm64

Arm64 Neoverse CPUs supports data source in Arm SPE trace, this allows
us to detect cache line contention and transfers.

This patch set has been rebased on the acme/perf/core branch with the latest
commit b39c9e1b101d ("perf machine: Fix missing free of
machine->kallsyms_filename").

To make building success, a compilation fixing commit [1] has been sent
to LKML, this patch set is dependent on it. This patch set has been verified
for both x86 perf memory events and Arm SPE events.

[1] https://lore.kernel.org/lkml/[email protected]/

Changes from v5:
* Removed the patch "perf: Add SNOOP_PEER flag to perf mem data struct"
(Arnaldo);
* Removed the patch "perf arm-spe: Don't set data source if it's not a
memory operation" which has been merged in the mainline kernel, so can
dismiss merging conflict.
* Rebased on the latest acme perf/core branch, no any code change
compared to previous version.

Changes from v4:
* Included Ali's patch set for adding data source in Arm SPE samples;
* Added Ian's ACK and Ali's review and test tags;
* Update document for the default peer dispaly for Arm64 (Ali).

Changes from v3:
* Changed to display remote and local peer accesses (Joe);
* Fixed the usage info for display types (Joe);
* Do not display HITM dimensions when use 'peer' display, and HITM
display doesn't show any 'peer' dimensions (James);
* Split to smaller patches for adding dimensions of peer operations;
* Updated documentation to reflect the latest GUI and stdio.


Ali Saidi (2):
perf tools: sync addition of PERF_MEM_SNOOPX_PEER
perf arm-spe: Use SPE data source for neoverse cores

Leo Yan (13):
perf mem: Print snoop peer flag
perf mem: Add statistics for peer snooping
perf c2c: Output statistics for peer snooping
perf c2c: Add dimensions for peer load operations
perf c2c: Add dimensions of peer metrics for cache line view
perf c2c: Add mean dimensions for peer operations
perf c2c: Use explicit names for display macros
perf c2c: Rename dimension from 'percent_hitm' to
'percent_costly_snoop'
perf c2c: Refactor node header
perf c2c: Refactor display string
perf c2c: Sort on peer snooping for load operations
perf c2c: Use 'peer' as default display for Arm64
perf c2c: Update documentation for new display option 'peer'

tools/include/uapi/linux/perf_event.h | 2 +-
tools/perf/Documentation/perf-c2c.txt | 31 +-
tools/perf/builtin-c2c.c | 454 ++++++++++++++----
.../util/arm-spe-decoder/arm-spe-decoder.c | 1 +
.../util/arm-spe-decoder/arm-spe-decoder.h | 12 +
tools/perf/util/arm-spe.c | 130 ++++-
tools/perf/util/mem-events.c | 46 +-
tools/perf/util/mem-events.h | 3 +
8 files changed, 547 insertions(+), 132 deletions(-)

--
2.34.1


2022-08-11 07:05:39

by Leo Yan

[permalink] [raw]
Subject: [PATCH v6 03/15] perf arm-spe: Use SPE data source for neoverse cores

From: Ali Saidi <[email protected]>

When synthesizing data from SPE, augment the type with source information
for Arm Neoverse cores. The field is IMPLDEF but the Neoverse cores all use
the same encoding. I can't find encoding information for any other SPE
implementations to unify their choices with Arm's thus that is left for
future work.

This change populates the mem_lvl_num for Neoverse cores as well as the
deprecated mem_lvl namespace.

Signed-off-by: Ali Saidi <[email protected]>
Reviewed-by: German Gomez <[email protected]>
Reviewed-by: Leo Yan <[email protected]>
Tested-by: Leo Yan <[email protected]>
---
.../util/arm-spe-decoder/arm-spe-decoder.c | 1 +
.../util/arm-spe-decoder/arm-spe-decoder.h | 12 ++
tools/perf/util/arm-spe.c | 130 +++++++++++++++---
3 files changed, 127 insertions(+), 16 deletions(-)

diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
index 5e390a1a79ab..091987dd3966 100644
--- a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
+++ b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
@@ -220,6 +220,7 @@ static int arm_spe_read_record(struct arm_spe_decoder *decoder)

break;
case ARM_SPE_DATA_SOURCE:
+ decoder->record.source = payload;
break;
case ARM_SPE_BAD:
break;
diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
index 69b31084d6be..46a61df1145b 100644
--- a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
+++ b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
@@ -29,6 +29,17 @@ enum arm_spe_op_type {
ARM_SPE_ST = 1 << 1,
};

+enum arm_spe_neoverse_data_source {
+ ARM_SPE_NV_L1D = 0x0,
+ ARM_SPE_NV_L2 = 0x8,
+ ARM_SPE_NV_PEER_CORE = 0x9,
+ ARM_SPE_NV_LOCAL_CLUSTER = 0xa,
+ ARM_SPE_NV_SYS_CACHE = 0xb,
+ ARM_SPE_NV_PEER_CLUSTER = 0xc,
+ ARM_SPE_NV_REMOTE = 0xd,
+ ARM_SPE_NV_DRAM = 0xe,
+};
+
struct arm_spe_record {
enum arm_spe_sample_type type;
int err;
@@ -40,6 +51,7 @@ struct arm_spe_record {
u64 virt_addr;
u64 phys_addr;
u64 context_id;
+ u16 source;
};

struct arm_spe_insn;
diff --git a/tools/perf/util/arm-spe.c b/tools/perf/util/arm-spe.c
index d040406f3314..22dcfe07e886 100644
--- a/tools/perf/util/arm-spe.c
+++ b/tools/perf/util/arm-spe.c
@@ -34,6 +34,7 @@
#include "arm-spe-decoder/arm-spe-decoder.h"
#include "arm-spe-decoder/arm-spe-pkt-decoder.h"

+#include "../../arch/arm64/include/asm/cputype.h"
#define MAX_TIMESTAMP (~0ULL)

struct arm_spe {
@@ -45,6 +46,7 @@ struct arm_spe {
struct perf_session *session;
struct machine *machine;
u32 pmu_type;
+ u64 midr;

struct perf_tsc_conversion tc;

@@ -387,35 +389,128 @@ static int arm_spe__synth_instruction_sample(struct arm_spe_queue *speq,
return arm_spe_deliver_synth_event(spe, speq, event, &sample);
}

-static u64 arm_spe__synth_data_source(const struct arm_spe_record *record)
+static const struct midr_range neoverse_spe[] = {
+ MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N1),
+ MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N2),
+ MIDR_ALL_VERSIONS(MIDR_NEOVERSE_V1),
+ {},
+};
+
+static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
+ union perf_mem_data_src *data_src)
{
- union perf_mem_data_src data_src = { 0 };
+ /*
+ * Even though four levels of cache hierarchy are possible, no known
+ * production Neoverse systems currently include more than three levels
+ * so for the time being we assume three exist. If a production system
+ * is built with four the this function would have to be changed to
+ * detect the number of levels for reporting.
+ */

- if (record->op == ARM_SPE_LD)
- data_src.mem_op = PERF_MEM_OP_LOAD;
- else if (record->op == ARM_SPE_ST)
- data_src.mem_op = PERF_MEM_OP_STORE;
- else
- return 0;
+ /*
+ * We have no data on the hit level or data source for stores in the
+ * Neoverse SPE records.
+ */
+ if (record->op & ARM_SPE_ST) {
+ data_src->mem_lvl = PERF_MEM_LVL_NA;
+ data_src->mem_lvl_num = PERF_MEM_LVLNUM_NA;
+ data_src->mem_snoop = PERF_MEM_SNOOP_NA;
+ return;
+ }
+
+ switch (record->source) {
+ case ARM_SPE_NV_L1D:
+ data_src->mem_lvl = PERF_MEM_LVL_L1 | PERF_MEM_LVL_HIT;
+ data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
+ data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
+ break;
+ case ARM_SPE_NV_L2:
+ data_src->mem_lvl = PERF_MEM_LVL_L2 | PERF_MEM_LVL_HIT;
+ data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
+ data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
+ break;
+ case ARM_SPE_NV_PEER_CORE:
+ data_src->mem_lvl = PERF_MEM_LVL_L2 | PERF_MEM_LVL_HIT;
+ data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
+ data_src->mem_snoopx = PERF_MEM_SNOOPX_PEER;
+ break;
+ /*
+ * We don't know if this is L1, L2 but we do know it was a cache-2-cache
+ * transfer, so set SNOOPX_PEER
+ */
+ case ARM_SPE_NV_LOCAL_CLUSTER:
+ case ARM_SPE_NV_PEER_CLUSTER:
+ data_src->mem_lvl = PERF_MEM_LVL_L3 | PERF_MEM_LVL_HIT;
+ data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
+ data_src->mem_snoopx = PERF_MEM_SNOOPX_PEER;
+ break;
+ /*
+ * System cache is assumed to be L3
+ */
+ case ARM_SPE_NV_SYS_CACHE:
+ data_src->mem_lvl = PERF_MEM_LVL_L3 | PERF_MEM_LVL_HIT;
+ data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
+ data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
+ break;
+ /*
+ * We don't know what level it hit in, except it came from the other
+ * socket
+ */
+ case ARM_SPE_NV_REMOTE:
+ data_src->mem_lvl = PERF_MEM_LVL_REM_CCE1;
+ data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
+ data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
+ data_src->mem_snoopx = PERF_MEM_SNOOPX_PEER;
+ break;
+ case ARM_SPE_NV_DRAM:
+ data_src->mem_lvl = PERF_MEM_LVL_LOC_RAM | PERF_MEM_LVL_HIT;
+ data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
+ data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
+ break;
+ default:
+ break;
+ }
+}

+static void arm_spe__synth_data_source_generic(const struct arm_spe_record *record,
+ union perf_mem_data_src *data_src)
+{
if (record->type & (ARM_SPE_LLC_ACCESS | ARM_SPE_LLC_MISS)) {
- data_src.mem_lvl = PERF_MEM_LVL_L3;
+ data_src->mem_lvl = PERF_MEM_LVL_L3;

if (record->type & ARM_SPE_LLC_MISS)
- data_src.mem_lvl |= PERF_MEM_LVL_MISS;
+ data_src->mem_lvl |= PERF_MEM_LVL_MISS;
else
- data_src.mem_lvl |= PERF_MEM_LVL_HIT;
+ data_src->mem_lvl |= PERF_MEM_LVL_HIT;
} else if (record->type & (ARM_SPE_L1D_ACCESS | ARM_SPE_L1D_MISS)) {
- data_src.mem_lvl = PERF_MEM_LVL_L1;
+ data_src->mem_lvl = PERF_MEM_LVL_L1;

if (record->type & ARM_SPE_L1D_MISS)
- data_src.mem_lvl |= PERF_MEM_LVL_MISS;
+ data_src->mem_lvl |= PERF_MEM_LVL_MISS;
else
- data_src.mem_lvl |= PERF_MEM_LVL_HIT;
+ data_src->mem_lvl |= PERF_MEM_LVL_HIT;
}

if (record->type & ARM_SPE_REMOTE_ACCESS)
- data_src.mem_lvl |= PERF_MEM_LVL_REM_CCE1;
+ data_src->mem_lvl |= PERF_MEM_LVL_REM_CCE1;
+}
+
+static u64 arm_spe__synth_data_source(const struct arm_spe_record *record, u64 midr)
+{
+ union perf_mem_data_src data_src = { 0 };
+ bool is_neoverse = is_midr_in_range(midr, neoverse_spe);
+
+ if (record->op == ARM_SPE_LD)
+ data_src.mem_op = PERF_MEM_OP_LOAD;
+ else if (record->op == ARM_SPE_ST)
+ data_src.mem_op = PERF_MEM_OP_STORE;
+ else
+ return 0;
+
+ if (is_neoverse)
+ arm_spe__synth_data_source_neoverse(record, &data_src);
+ else
+ arm_spe__synth_data_source_generic(record, &data_src);

if (record->type & (ARM_SPE_TLB_ACCESS | ARM_SPE_TLB_MISS)) {
data_src.mem_dtlb = PERF_MEM_TLB_WK;
@@ -436,7 +531,7 @@ static int arm_spe_sample(struct arm_spe_queue *speq)
u64 data_src;
int err;

- data_src = arm_spe__synth_data_source(record);
+ data_src = arm_spe__synth_data_source(record, spe->midr);

if (spe->sample_flc) {
if (record->type & ARM_SPE_L1D_MISS) {
@@ -1178,6 +1273,8 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
struct perf_record_auxtrace_info *auxtrace_info = &event->auxtrace_info;
size_t min_sz = sizeof(u64) * ARM_SPE_AUXTRACE_PRIV_MAX;
struct perf_record_time_conv *tc = &session->time_conv;
+ const char *cpuid = perf_env__cpuid(session->evlist->env);
+ u64 midr = strtol(cpuid, NULL, 16);
struct arm_spe *spe;
int err;

@@ -1197,6 +1294,7 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
spe->machine = &session->machines.host; /* No kvm support */
spe->auxtrace_type = auxtrace_info->type;
spe->pmu_type = auxtrace_info->priv[ARM_SPE_PMU_TYPE];
+ spe->midr = midr;

spe->timeless_decoding = arm_spe__is_timeless_decoding(spe);

--
2.34.1

2022-08-11 07:07:10

by Leo Yan

[permalink] [raw]
Subject: [PATCH v6 14/15] perf c2c: Use 'peer' as default display for Arm64

Since Arm64 arch doesn't support HITMs flags, this patch changes to use
'peer' as default display if user doesn't specify any type; for other
arches, it still uses 'tot' as default display type if user doesn't
specify it.

This patch changes to call perf_session__new() in an earlier place, so
session environment can be initialized ahead and arch info can be used
for setting display type.

Suggested-by: Ali Saidi <[email protected]>
Signed-off-by: Leo Yan <[email protected]>
Acked-by: Ian Rogers <[email protected]>
Tested-by: Ali Saidi <[email protected]>
Reviewed-by: Ali Saidi <[email protected]>
---
tools/perf/builtin-c2c.c | 36 ++++++++++++++++++++++++------------
1 file changed, 24 insertions(+), 12 deletions(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index f7a961e55a92..653e13b5037e 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -2878,7 +2878,7 @@ static int setup_callchain(struct evlist *evlist)

static int setup_display(const char *str)
{
- const char *display = str ?: "tot";
+ const char *display = str;

if (!strcmp(display, "tot"))
c2c.display = DISPLAY_TOT_HITM;
@@ -3068,27 +3068,39 @@ static int perf_c2c__report(int argc, const char **argv)
data.path = input_name;
data.force = symbol_conf.force;

+ session = perf_session__new(&data, &c2c.tool);
+ if (IS_ERR(session)) {
+ err = PTR_ERR(session);
+ pr_debug("Error creating perf session\n");
+ goto out;
+ }
+
+ /*
+ * Use the 'tot' as default display type if user doesn't specify it;
+ * since Arm64 platform doesn't support HITMs flag, use 'peer' as the
+ * default display type.
+ */
+ if (!display) {
+ if (!strcmp(perf_env__arch(&session->header.env), "arm64"))
+ display = "peer";
+ else
+ display = "tot";
+ }
+
err = setup_display(display);
if (err)
- goto out;
+ goto out_session;

err = setup_coalesce(coalesce, no_source);
if (err) {
pr_debug("Failed to initialize hists\n");
- goto out;
+ goto out_session;
}

err = c2c_hists__init(&c2c.hists, "dcacheline", 2);
if (err) {
pr_debug("Failed to initialize hists\n");
- goto out;
- }
-
- session = perf_session__new(&data, &c2c.tool);
- if (IS_ERR(session)) {
- err = PTR_ERR(session);
- pr_debug("Error creating perf session\n");
- goto out;
+ goto out_session;
}

session->itrace_synth_opts = &itrace_synth_opts;
@@ -3096,7 +3108,7 @@ static int perf_c2c__report(int argc, const char **argv)
err = setup_nodes(session);
if (err) {
pr_err("Failed setup nodes\n");
- goto out;
+ goto out_session;
}

err = mem2node__init(&c2c.mem2node, &session->header.env);
--
2.34.1

2022-08-11 07:08:31

by Leo Yan

[permalink] [raw]
Subject: [PATCH v6 12/15] perf c2c: Refactor display string

The display type is shown by combination the display string array and a
suffix string "HITMs", which is not friendly to extend display for other
sorting type (e.g. extension for peer operations).

This patch moves the suffix string "HITMs" into display string array for
HITM types, so it can allow us to not necessarily to output string
"HITMs" for new incoming display type.

Signed-off-by: Leo Yan <[email protected]>
Acked-by: Ian Rogers <[email protected]>
Tested-by: Ali Saidi <[email protected]>
Reviewed-by: Ali Saidi <[email protected]>
---
tools/perf/builtin-c2c.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 49a9b8480b41..8b7c1fd35380 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -122,9 +122,9 @@ enum {
};

static const char *display_str[DISPLAY_MAX] = {
- [DISPLAY_LCL_HITM] = "Local",
- [DISPLAY_RMT_HITM] = "Remote",
- [DISPLAY_TOT_HITM] = "Total",
+ [DISPLAY_LCL_HITM] = "Local HITMs",
+ [DISPLAY_RMT_HITM] = "Remote HITMs",
+ [DISPLAY_TOT_HITM] = "Total HITMs",
};

static const struct option c2c_options[] = {
@@ -2489,7 +2489,7 @@ static void print_c2c_info(FILE *out, struct perf_session *session)
fprintf(out, "%-36s: %s\n", first ? " Events" : "", evsel__name(evsel));
first = false;
}
- fprintf(out, " Cachelines sort on : %s HITMs\n",
+ fprintf(out, " Cachelines sort on : %s\n",
display_str[c2c.display]);
fprintf(out, " Cacheline data grouping : %s\n", c2c.cl_sort);
}
@@ -2646,7 +2646,7 @@ static int perf_c2c_browser__title(struct hist_browser *browser,
{
scnprintf(bf, size,
"Shared Data Cache Line Table "
- "(%lu entries, sorted on %s HITMs)",
+ "(%lu entries, sorted on %s)",
browser->nr_non_filtered_entries,
display_str[c2c.display]);
return 0;
--
2.34.1

2022-08-11 07:25:05

by Leo Yan

[permalink] [raw]
Subject: [PATCH v6 15/15] perf c2c: Update documentation for new display option 'peer'

Since the new display option 'peer' is introduced, this patch is to
update the documentation to reflect it.

Signed-off-by: Leo Yan <[email protected]>
Acked-by: Ian Rogers <[email protected]>
Reviewed-by: Ali Saidi <[email protected]>
---
tools/perf/Documentation/perf-c2c.txt | 31 +++++++++++++++++++++------
1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/tools/perf/Documentation/perf-c2c.txt b/tools/perf/Documentation/perf-c2c.txt
index 6f69173731aa..f1f7ae6b08d1 100644
--- a/tools/perf/Documentation/perf-c2c.txt
+++ b/tools/perf/Documentation/perf-c2c.txt
@@ -109,7 +109,9 @@ REPORT OPTIONS

-d::
--display::
- Switch to HITM type (rmt, lcl) to display and sort on. Total HITMs as default.
+ Switch to HITM type (rmt, lcl) or peer snooping type (peer) to display
+ and sort on. Total HITMs (tot) as default, except Arm64 uses peer mode
+ as default.

--stitch-lbr::
Show callgraph with stitched LBRs, which may have more complete
@@ -174,12 +176,18 @@ For each cacheline in the 1) list we display following data:
Cacheline
- cacheline address (hex number)

- Rmt/Lcl Hitm
+ Rmt/Lcl Hitm (Display with HITM types)
- cacheline percentage of all Remote/Local HITM accesses

- LLC Load Hitm - Total, LclHitm, RmtHitm
+ Peer Snoop (Display with peer type)
+ - cacheline percentage of all peer accesses
+
+ LLC Load Hitm - Total, LclHitm, RmtHitm (For display with HITM types)
- count of Total/Local/Remote load HITMs

+ Load Peer - Total, Local, Remote (For display with peer type)
+ - count of Total/Local/Remote load from peer cache or DRAM
+
Total records
- sum of all cachelines accesses

@@ -201,16 +209,21 @@ For each cacheline in the 1) list we display following data:
- count of LLC load accesses, includes LLC hits and LLC HITMs

RMT Load Hit - RmtHit, RmtHitm
- - count of remote load accesses, includes remote hits and remote HITMs
+ - count of remote load accesses, includes remote hits and remote HITMs;
+ on Arm neoverse cores, RmtHit is used to account remote accesses,
+ includes remote DRAM or any upward cache level in remote node

Load Dram - Lcl, Rmt
- count of local and remote DRAM accesses

For each offset in the 2) list we display following data:

- HITM - Rmt, Lcl
+ HITM - Rmt, Lcl (Display with HITM types)
- % of Remote/Local HITM accesses for given offset within cacheline

+ Peer Snoop - Rmt, Lcl (Display with peer type)
+ - % of Remote/Local peer accesses for given offset within cacheline
+
Store Refs - L1 Hit, L1 Miss, N/A
- % of store accesses that hit L1, missed L1 and N/A (no available) memory
level for given offset within cacheline
@@ -227,9 +240,12 @@ For each offset in the 2) list we display following data:
Code address
- code address responsible for the accesses

- cycles - rmt hitm, lcl hitm, load
+ cycles - rmt hitm, lcl hitm, load (Display with HITM types)
- sum of cycles for given accesses - Remote/Local HITM and generic load

+ cycles - rmt peer, lcl peer, load (Display with peer type)
+ - sum of cycles for given accesses - Remote/Local peer load and generic load
+
cpu cnt
- number of cpus that participated on the access

@@ -251,7 +267,8 @@ The 'Node' field displays nodes that accesses given cacheline
offset. Its output comes in 3 flavors:
- node IDs separated by ','
- node IDs with stats for each ID, in following format:
- Node{cpus %hitms %stores}
+ Node{cpus %hitms %stores} (Display with HITM types)
+ Node{cpus %peers %stores} (Display with peer type)
- node IDs with list of affected CPUs in following format:
Node{cpu list}

--
2.34.1

2022-08-11 22:06:56

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v6 03/15] perf arm-spe: Use SPE data source for neoverse cores

Em Thu, Aug 11, 2022 at 02:24:39PM +0800, Leo Yan escreveu:
> From: Ali Saidi <[email protected]>
>
> When synthesizing data from SPE, augment the type with source information
> for Arm Neoverse cores. The field is IMPLDEF but the Neoverse cores all use
> the same encoding. I can't find encoding information for any other SPE
> implementations to unify their choices with Arm's thus that is left for
> future work.
>
> This change populates the mem_lvl_num for Neoverse cores as well as the
> deprecated mem_lvl namespace.

So at this point, building on x86_64, I get:

In file included from util/arm-spe.c:37:
util/../../arch/arm64/include/asm/cputype.h:183:10: fatal error: asm/sysreg.h: No such file or directory
183 | #include <asm/sysreg.h>
| ^~~~~~~~~~~~~~
compilation terminated.
make[4]: *** [/var/home/acme/git/perf/tools/build/Makefile.build:96: /tmp/build/perf/util/arm-spe.o] Error 1
make[4]: *** Waiting for unfinished jobs....
LD /tmp/build/perf/util/arm-spe-decoder/perf-in.o
make[3]: *** [/var/home/acme/git/perf/tools/build/Makefile.build:139: util] Error 2
make[2]: *** [Makefile.perf:660: /tmp/build/perf/perf-in.o] Error 2
make[1]: *** [Makefile.perf:240: sub-make] Error 2
make: *** [Makefile:113: install-bin] Error 2
make: Leaving directory '/var/home/acme/git/perf/tools/perf'

Performance counter stats for 'make -k BUILD_BPF_SKEL=1 CORESIGHT=1 O=/tmp/build/perf -C tools/perf install-bin':

12,163,704,676 cycles:u
20,601,569,045 instructions:u # 1.69 insn per cycle

3.733981168 seconds time elapsed

2.897595000 seconds user
1.446798000 seconds sys


⬢[acme@toolbox perf]$

I saw a patch floating by that seems related, will check.

- Arnaldo

> Signed-off-by: Ali Saidi <[email protected]>
> Reviewed-by: German Gomez <[email protected]>
> Reviewed-by: Leo Yan <[email protected]>
> Tested-by: Leo Yan <[email protected]>
> ---
> .../util/arm-spe-decoder/arm-spe-decoder.c | 1 +
> .../util/arm-spe-decoder/arm-spe-decoder.h | 12 ++
> tools/perf/util/arm-spe.c | 130 +++++++++++++++---
> 3 files changed, 127 insertions(+), 16 deletions(-)
>
> diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
> index 5e390a1a79ab..091987dd3966 100644
> --- a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
> +++ b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
> @@ -220,6 +220,7 @@ static int arm_spe_read_record(struct arm_spe_decoder *decoder)
>
> break;
> case ARM_SPE_DATA_SOURCE:
> + decoder->record.source = payload;
> break;
> case ARM_SPE_BAD:
> break;
> diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
> index 69b31084d6be..46a61df1145b 100644
> --- a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
> +++ b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
> @@ -29,6 +29,17 @@ enum arm_spe_op_type {
> ARM_SPE_ST = 1 << 1,
> };
>
> +enum arm_spe_neoverse_data_source {
> + ARM_SPE_NV_L1D = 0x0,
> + ARM_SPE_NV_L2 = 0x8,
> + ARM_SPE_NV_PEER_CORE = 0x9,
> + ARM_SPE_NV_LOCAL_CLUSTER = 0xa,
> + ARM_SPE_NV_SYS_CACHE = 0xb,
> + ARM_SPE_NV_PEER_CLUSTER = 0xc,
> + ARM_SPE_NV_REMOTE = 0xd,
> + ARM_SPE_NV_DRAM = 0xe,
> +};
> +
> struct arm_spe_record {
> enum arm_spe_sample_type type;
> int err;
> @@ -40,6 +51,7 @@ struct arm_spe_record {
> u64 virt_addr;
> u64 phys_addr;
> u64 context_id;
> + u16 source;
> };
>
> struct arm_spe_insn;
> diff --git a/tools/perf/util/arm-spe.c b/tools/perf/util/arm-spe.c
> index d040406f3314..22dcfe07e886 100644
> --- a/tools/perf/util/arm-spe.c
> +++ b/tools/perf/util/arm-spe.c
> @@ -34,6 +34,7 @@
> #include "arm-spe-decoder/arm-spe-decoder.h"
> #include "arm-spe-decoder/arm-spe-pkt-decoder.h"
>
> +#include "../../arch/arm64/include/asm/cputype.h"
> #define MAX_TIMESTAMP (~0ULL)
>
> struct arm_spe {
> @@ -45,6 +46,7 @@ struct arm_spe {
> struct perf_session *session;
> struct machine *machine;
> u32 pmu_type;
> + u64 midr;
>
> struct perf_tsc_conversion tc;
>
> @@ -387,35 +389,128 @@ static int arm_spe__synth_instruction_sample(struct arm_spe_queue *speq,
> return arm_spe_deliver_synth_event(spe, speq, event, &sample);
> }
>
> -static u64 arm_spe__synth_data_source(const struct arm_spe_record *record)
> +static const struct midr_range neoverse_spe[] = {
> + MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N1),
> + MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N2),
> + MIDR_ALL_VERSIONS(MIDR_NEOVERSE_V1),
> + {},
> +};
> +
> +static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
> + union perf_mem_data_src *data_src)
> {
> - union perf_mem_data_src data_src = { 0 };
> + /*
> + * Even though four levels of cache hierarchy are possible, no known
> + * production Neoverse systems currently include more than three levels
> + * so for the time being we assume three exist. If a production system
> + * is built with four the this function would have to be changed to
> + * detect the number of levels for reporting.
> + */
>
> - if (record->op == ARM_SPE_LD)
> - data_src.mem_op = PERF_MEM_OP_LOAD;
> - else if (record->op == ARM_SPE_ST)
> - data_src.mem_op = PERF_MEM_OP_STORE;
> - else
> - return 0;
> + /*
> + * We have no data on the hit level or data source for stores in the
> + * Neoverse SPE records.
> + */
> + if (record->op & ARM_SPE_ST) {
> + data_src->mem_lvl = PERF_MEM_LVL_NA;
> + data_src->mem_lvl_num = PERF_MEM_LVLNUM_NA;
> + data_src->mem_snoop = PERF_MEM_SNOOP_NA;
> + return;
> + }
> +
> + switch (record->source) {
> + case ARM_SPE_NV_L1D:
> + data_src->mem_lvl = PERF_MEM_LVL_L1 | PERF_MEM_LVL_HIT;
> + data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> + data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
> + break;
> + case ARM_SPE_NV_L2:
> + data_src->mem_lvl = PERF_MEM_LVL_L2 | PERF_MEM_LVL_HIT;
> + data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> + data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
> + break;
> + case ARM_SPE_NV_PEER_CORE:
> + data_src->mem_lvl = PERF_MEM_LVL_L2 | PERF_MEM_LVL_HIT;
> + data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> + data_src->mem_snoopx = PERF_MEM_SNOOPX_PEER;
> + break;
> + /*
> + * We don't know if this is L1, L2 but we do know it was a cache-2-cache
> + * transfer, so set SNOOPX_PEER
> + */
> + case ARM_SPE_NV_LOCAL_CLUSTER:
> + case ARM_SPE_NV_PEER_CLUSTER:
> + data_src->mem_lvl = PERF_MEM_LVL_L3 | PERF_MEM_LVL_HIT;
> + data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> + data_src->mem_snoopx = PERF_MEM_SNOOPX_PEER;
> + break;
> + /*
> + * System cache is assumed to be L3
> + */
> + case ARM_SPE_NV_SYS_CACHE:
> + data_src->mem_lvl = PERF_MEM_LVL_L3 | PERF_MEM_LVL_HIT;
> + data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> + data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> + break;
> + /*
> + * We don't know what level it hit in, except it came from the other
> + * socket
> + */
> + case ARM_SPE_NV_REMOTE:
> + data_src->mem_lvl = PERF_MEM_LVL_REM_CCE1;
> + data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> + data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
> + data_src->mem_snoopx = PERF_MEM_SNOOPX_PEER;
> + break;
> + case ARM_SPE_NV_DRAM:
> + data_src->mem_lvl = PERF_MEM_LVL_LOC_RAM | PERF_MEM_LVL_HIT;
> + data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
> + data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
> + break;
> + default:
> + break;
> + }
> +}
>
> +static void arm_spe__synth_data_source_generic(const struct arm_spe_record *record,
> + union perf_mem_data_src *data_src)
> +{
> if (record->type & (ARM_SPE_LLC_ACCESS | ARM_SPE_LLC_MISS)) {
> - data_src.mem_lvl = PERF_MEM_LVL_L3;
> + data_src->mem_lvl = PERF_MEM_LVL_L3;
>
> if (record->type & ARM_SPE_LLC_MISS)
> - data_src.mem_lvl |= PERF_MEM_LVL_MISS;
> + data_src->mem_lvl |= PERF_MEM_LVL_MISS;
> else
> - data_src.mem_lvl |= PERF_MEM_LVL_HIT;
> + data_src->mem_lvl |= PERF_MEM_LVL_HIT;
> } else if (record->type & (ARM_SPE_L1D_ACCESS | ARM_SPE_L1D_MISS)) {
> - data_src.mem_lvl = PERF_MEM_LVL_L1;
> + data_src->mem_lvl = PERF_MEM_LVL_L1;
>
> if (record->type & ARM_SPE_L1D_MISS)
> - data_src.mem_lvl |= PERF_MEM_LVL_MISS;
> + data_src->mem_lvl |= PERF_MEM_LVL_MISS;
> else
> - data_src.mem_lvl |= PERF_MEM_LVL_HIT;
> + data_src->mem_lvl |= PERF_MEM_LVL_HIT;
> }
>
> if (record->type & ARM_SPE_REMOTE_ACCESS)
> - data_src.mem_lvl |= PERF_MEM_LVL_REM_CCE1;
> + data_src->mem_lvl |= PERF_MEM_LVL_REM_CCE1;
> +}
> +
> +static u64 arm_spe__synth_data_source(const struct arm_spe_record *record, u64 midr)
> +{
> + union perf_mem_data_src data_src = { 0 };
> + bool is_neoverse = is_midr_in_range(midr, neoverse_spe);
> +
> + if (record->op == ARM_SPE_LD)
> + data_src.mem_op = PERF_MEM_OP_LOAD;
> + else if (record->op == ARM_SPE_ST)
> + data_src.mem_op = PERF_MEM_OP_STORE;
> + else
> + return 0;
> +
> + if (is_neoverse)
> + arm_spe__synth_data_source_neoverse(record, &data_src);
> + else
> + arm_spe__synth_data_source_generic(record, &data_src);
>
> if (record->type & (ARM_SPE_TLB_ACCESS | ARM_SPE_TLB_MISS)) {
> data_src.mem_dtlb = PERF_MEM_TLB_WK;
> @@ -436,7 +531,7 @@ static int arm_spe_sample(struct arm_spe_queue *speq)
> u64 data_src;
> int err;
>
> - data_src = arm_spe__synth_data_source(record);
> + data_src = arm_spe__synth_data_source(record, spe->midr);
>
> if (spe->sample_flc) {
> if (record->type & ARM_SPE_L1D_MISS) {
> @@ -1178,6 +1273,8 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
> struct perf_record_auxtrace_info *auxtrace_info = &event->auxtrace_info;
> size_t min_sz = sizeof(u64) * ARM_SPE_AUXTRACE_PRIV_MAX;
> struct perf_record_time_conv *tc = &session->time_conv;
> + const char *cpuid = perf_env__cpuid(session->evlist->env);
> + u64 midr = strtol(cpuid, NULL, 16);
> struct arm_spe *spe;
> int err;
>
> @@ -1197,6 +1294,7 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
> spe->machine = &session->machines.host; /* No kvm support */
> spe->auxtrace_type = auxtrace_info->type;
> spe->pmu_type = auxtrace_info->priv[ARM_SPE_PMU_TYPE];
> + spe->midr = midr;
>
> spe->timeless_decoding = arm_spe__is_timeless_decoding(spe);
>
> --
> 2.34.1

--

- Arnaldo

2022-08-11 23:01:17

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v6 00/15] perf c2c: Support data source and display for Arm64

Em Thu, Aug 11, 2022 at 02:24:36PM +0800, Leo Yan escreveu:
> Arm64 Neoverse CPUs supports data source in Arm SPE trace, this allows
> us to detect cache line contention and transfers.
>
> This patch set has been rebased on the acme/perf/core branch with the latest
> commit b39c9e1b101d ("perf machine: Fix missing free of
> machine->kallsyms_filename").
>
> To make building success, a compilation fixing commit [1] has been sent
> to LKML, this patch set is dependent on it. This patch set has been verified
> for both x86 perf memory events and Arm SPE events.
>
> [1] https://lore.kernel.org/lkml/[email protected]/

So, I tentatively applied this set after applying the patch for
<asm/sysreg.h>, and its all now out in tmp.perf/core in my git tree,
please check.

I'm doing the usual set of container build tests, but any additional
checking, including on the committer note I added to the first patch in
this series, claryfing it is not really a "sync" with the kernel
headers, is more than welcome.

- Arnaldo

> Changes from v5:
> * Removed the patch "perf: Add SNOOP_PEER flag to perf mem data struct"
> (Arnaldo);
> * Removed the patch "perf arm-spe: Don't set data source if it's not a
> memory operation" which has been merged in the mainline kernel, so can
> dismiss merging conflict.
> * Rebased on the latest acme perf/core branch, no any code change
> compared to previous version.
>
> Changes from v4:
> * Included Ali's patch set for adding data source in Arm SPE samples;
> * Added Ian's ACK and Ali's review and test tags;
> * Update document for the default peer dispaly for Arm64 (Ali).
>
> Changes from v3:
> * Changed to display remote and local peer accesses (Joe);
> * Fixed the usage info for display types (Joe);
> * Do not display HITM dimensions when use 'peer' display, and HITM
> display doesn't show any 'peer' dimensions (James);
> * Split to smaller patches for adding dimensions of peer operations;
> * Updated documentation to reflect the latest GUI and stdio.
>
>
> Ali Saidi (2):
> perf tools: sync addition of PERF_MEM_SNOOPX_PEER
> perf arm-spe: Use SPE data source for neoverse cores
>
> Leo Yan (13):
> perf mem: Print snoop peer flag
> perf mem: Add statistics for peer snooping
> perf c2c: Output statistics for peer snooping
> perf c2c: Add dimensions for peer load operations
> perf c2c: Add dimensions of peer metrics for cache line view
> perf c2c: Add mean dimensions for peer operations
> perf c2c: Use explicit names for display macros
> perf c2c: Rename dimension from 'percent_hitm' to
> 'percent_costly_snoop'
> perf c2c: Refactor node header
> perf c2c: Refactor display string
> perf c2c: Sort on peer snooping for load operations
> perf c2c: Use 'peer' as default display for Arm64
> perf c2c: Update documentation for new display option 'peer'
>
> tools/include/uapi/linux/perf_event.h | 2 +-
> tools/perf/Documentation/perf-c2c.txt | 31 +-
> tools/perf/builtin-c2c.c | 454 ++++++++++++++----
> .../util/arm-spe-decoder/arm-spe-decoder.c | 1 +
> .../util/arm-spe-decoder/arm-spe-decoder.h | 12 +
> tools/perf/util/arm-spe.c | 130 ++++-
> tools/perf/util/mem-events.c | 46 +-
> tools/perf/util/mem-events.h | 3 +
> 8 files changed, 547 insertions(+), 132 deletions(-)
>
> --
> 2.34.1

--

- Arnaldo

2022-08-12 01:28:44

by Leo Yan

[permalink] [raw]
Subject: Re: [PATCH v6 00/15] perf c2c: Support data source and display for Arm64

Hi Arnaldo,

On Thu, Aug 11, 2022 at 07:25:35PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Thu, Aug 11, 2022 at 02:24:36PM +0800, Leo Yan escreveu:
> > Arm64 Neoverse CPUs supports data source in Arm SPE trace, this allows
> > us to detect cache line contention and transfers.
> >
> > This patch set has been rebased on the acme/perf/core branch with the latest
> > commit b39c9e1b101d ("perf machine: Fix missing free of
> > machine->kallsyms_filename").
> >
> > To make building success, a compilation fixing commit [1] has been sent
> > to LKML, this patch set is dependent on it. This patch set has been verified
> > for both x86 perf memory events and Arm SPE events.
> >
> > [1] https://lore.kernel.org/lkml/[email protected]/
>
> So, I tentatively applied this set after applying the patch for
> <asm/sysreg.h>, and its all now out in tmp.perf/core in my git tree,
> please check.

With discussing with Suzuki, he pointed it is not perfect for adding asm
include path in that way. With the patch on tmp.perf/core branch, two
include paths are added into CFLAGS for arm-spe.c:

-I$(srctree)/tools/arch/$(SRCARCH)/include/
-I$(srctree)/tools/arch/arm64/include/

When we build perf on x86_64, then $(srctree)/tools/arch/x86/include/
takes more precedence than $(srctree)/tools/arch/arm64/include/; if we
want to include header file without relative path in c code, like
"#include <asm/cputype.h>", then it has chance to find the same name
file from x86's asm folder rather than arm64's asm folder.

At yesterday, I spent couple hours to find other methods (like
filter-out, CFLAGS_REMOVE, etc) in makefile but it's no lucky to make
success to give precedence for $(srctree)/tools/arch/arm64/include/.

So current patches on the branch tmp.perf/core can build successfully,
but if have any better method to resolve the header path precedence
issue, then I prefer to improve for this, which can allow us later
don't worry about it. Any suggestion for this?

> I'm doing the usual set of container build tests, but any additional
> checking, including on the committer note I added to the first patch in
> this series, claryfing it is not really a "sync" with the kernel
> headers, is more than welcome.

It's fine for me for adding my Signed-off for the signature chain.
Appreicate for the amending.

Thanks,
Leo