2022-04-09 05:22:59

by Ali Saidi

[permalink] [raw]
Subject: [PATCH v4 0/4] perf: arm-spe: Decode SPE source and use for perf c2c

When synthesizing data from SPE, augment the type with source information
for Arm Neoverse cores so we can detect situtions like cache line
contention and transfers on Arm platforms.

This changes enables the expected behavior of perf c2c on a system with
SPE where lines that are shared among multiple cores show up in perf c2c
output.

These changes switch to use mem_lvl_num to encode the level information
instead of mem_lvl which is being deprecated, but I haven't found other
users of mem_lvl_num.

Changes in v5:
* Add a new snooping type to disambiguate cache-to-cache transfers where
we don't know if the data is clean or dirty.
* Set snoop flags on all the data-source cases
* Special case stores as we have no information on them

Changes in v4:
* Bring-in the kernel's arch/arm64/include/asm/cputype.h into tools/
* Add neoverse-v1 to the neoverse cores list

Ali Saidi (4):
tools: arm64: Import cputype.h
perf arm-spe: Use SPE data source for neoverse cores
perf mem: Support mem_lvl_num in c2c command
perf mem: Support HITM for when mem_lvl_num is any

tools/arch/arm64/include/asm/cputype.h | 258 ++++++++++++++++++
.../util/arm-spe-decoder/arm-spe-decoder.c | 1 +
.../util/arm-spe-decoder/arm-spe-decoder.h | 12 +
tools/perf/util/arm-spe.c | 110 +++++++-
tools/perf/util/mem-events.c | 20 +-
5 files changed, 383 insertions(+), 18 deletions(-)
create mode 100644 tools/arch/arm64/include/asm/cputype.h

--
2.32.0


2022-04-09 07:14:46

by Ali Saidi

[permalink] [raw]
Subject: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct

Add a flag to the perf mem data struct to signal that a request caused a
cache-to-cache transfer of a line from a peer of the requestor and
wasn't sourced from a lower cache level. The line being moved from one
peer cache to another has latency and performance implications. On Arm64
Neoverse systems the data source can indicate a cache-to-cache transfer
but not if the line is dirty or clean, so instead of overloading HITM
define a new flag that indicates this type of transfer.

Signed-off-by: Ali Saidi <[email protected]>
---
include/uapi/linux/perf_event.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 82858b697c05..c9e58c79f3e5 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -1308,7 +1308,7 @@ union perf_mem_data_src {
#define PERF_MEM_SNOOP_SHIFT 19

#define PERF_MEM_SNOOPX_FWD 0x01 /* forward */
-/* 1 free */
+#define PERF_MEM_SNOOPX_PEER 0x02 /* xfer from peer */
#define PERF_MEM_SNOOPX_SHIFT 38

/* locked instruction */
--
2.32.0

2022-04-09 16:54:31

by Ali Saidi

[permalink] [raw]
Subject: [PATCH v5 5/5] perf mem: Support mem_lvl_num in c2c command

In addition to summarizing data encoded in mem_lvl also support data
encoded in mem_lvl_num.

Since other architectures don't seem to populate the mem_lvl_num field
here there shouldn't be a change in functionality.

Signed-off-by: Ali Saidi <[email protected]>
---
tools/perf/util/mem-events.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
index ed0ab838bcc5..e5e405185498 100644
--- a/tools/perf/util/mem-events.c
+++ b/tools/perf/util/mem-events.c
@@ -485,6 +485,7 @@ int c2c_decode_stats(struct c2c_stats *stats, struct mem_info *mi)
u64 daddr = mi->daddr.addr;
u64 op = data_src->mem_op;
u64 lvl = data_src->mem_lvl;
+ u64 lnum = data_src->mem_lvl_num;
u64 snoop = data_src->mem_snoop;
u64 lock = data_src->mem_lock;
u64 blk = data_src->mem_blk;
@@ -527,16 +528,18 @@ do { \
if (lvl & P(LVL, UNC)) stats->ld_uncache++;
if (lvl & P(LVL, IO)) stats->ld_io++;
if (lvl & P(LVL, LFB)) stats->ld_fbhit++;
- if (lvl & P(LVL, L1 )) stats->ld_l1hit++;
- if (lvl & P(LVL, L2 )) stats->ld_l2hit++;
- if (lvl & P(LVL, L3 )) {
+ if (lvl & P(LVL, L1) || lnum == P(LVLNUM, L1))
+ stats->ld_l1hit++;
+ if (lvl & P(LVL, L2) || lnum == P(LVLNUM, L2))
+ stats->ld_l2hit++;
+ if (lvl & P(LVL, L3) || lnum == P(LVLNUM, L3)) {
if (snoop & P(SNOOP, HITM))
HITM_INC(lcl_hitm);
else
stats->ld_llchit++;
}

- if (lvl & P(LVL, LOC_RAM)) {
+ if (lvl & P(LVL, LOC_RAM) || lnum == P(LVLNUM, RAM)) {
stats->lcl_dram++;
if (snoop & P(SNOOP, HIT))
stats->ld_shared++;
--
2.32.0

2022-04-10 14:22:37

by Ali Saidi

[permalink] [raw]
Subject: [PATCH v5 3/5] perf tools: sync addition of PERF_MEM_SNOOPX_PEER

Add a flag to the perf mem data struct to signal that a request caused a
cache-to-cache transfer of a line from a peer of the requestor and
wasn't sourced from a lower cache level. The line being moved from one
peer cache to another has latency and performance implications. On Arm64
Neoverse systems the data source can indicate a cache-to-cache transfer
but not if the line is dirty or clean, so instead of overloading HITM
define a new flag that indicates this type of transfer.

Signed-off-by: Ali Saidi <[email protected]>
---
tools/include/uapi/linux/perf_event.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 82858b697c05..c9e58c79f3e5 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -1308,7 +1308,7 @@ union perf_mem_data_src {
#define PERF_MEM_SNOOP_SHIFT 19

#define PERF_MEM_SNOOPX_FWD 0x01 /* forward */
-/* 1 free */
+#define PERF_MEM_SNOOPX_PEER 0x02 /* xfer from peer */
#define PERF_MEM_SNOOPX_SHIFT 38

/* locked instruction */
--
2.32.0

2022-04-11 15:07:35

by German Gomez

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] perf tools: sync addition of PERF_MEM_SNOOPX_PEER


On 08/04/2022 20:53, Ali Saidi wrote:
> Add a flag to the perf mem data struct to signal that a request caused a
> cache-to-cache transfer of a line from a peer of the requestor and
> wasn't sourced from a lower cache level. The line being moved from one
> peer cache to another has latency and performance implications. On Arm64
> Neoverse systems the data source can indicate a cache-to-cache transfer
> but not if the line is dirty or clean, so instead of overloading HITM
> define a new flag that indicates this type of transfer.

I think it's fine to merge this and the previous commit rather than have
two commits with the same msg.

>
> Signed-off-by: Ali Saidi <[email protected]>
> ---
> tools/include/uapi/linux/perf_event.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
> index 82858b697c05..c9e58c79f3e5 100644
> --- a/tools/include/uapi/linux/perf_event.h
> +++ b/tools/include/uapi/linux/perf_event.h
> @@ -1308,7 +1308,7 @@ union perf_mem_data_src {
> #define PERF_MEM_SNOOP_SHIFT 19
>
> #define PERF_MEM_SNOOPX_FWD 0x01 /* forward */
> -/* 1 free */
> +#define PERF_MEM_SNOOPX_PEER 0x02 /* xfer from peer */
> #define PERF_MEM_SNOOPX_SHIFT 38
>
> /* locked instruction */

2022-04-11 16:27:53

by Ali Saidi

[permalink] [raw]
Subject: [PATCH v5 1/5] tools: arm64: Import cputype.h

Bring-in the kernel's arch/arm64/include/asm/cputype.h into tools/
for arm64 to make use of all the core-type definitions in perf.

Replace sysreg.h with the version already imported into tools/.

Signed-off-by: Ali Saidi <[email protected]>
---
tools/arch/arm64/include/asm/cputype.h | 258 +++++++++++++++++++++++++
1 file changed, 258 insertions(+)
create mode 100644 tools/arch/arm64/include/asm/cputype.h

diff --git a/tools/arch/arm64/include/asm/cputype.h b/tools/arch/arm64/include/asm/cputype.h
new file mode 100644
index 000000000000..9afcc6467a09
--- /dev/null
+++ b/tools/arch/arm64/include/asm/cputype.h
@@ -0,0 +1,258 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2012 ARM Ltd.
+ */
+#ifndef __ASM_CPUTYPE_H
+#define __ASM_CPUTYPE_H
+
+#define INVALID_HWID ULONG_MAX
+
+#define MPIDR_UP_BITMASK (0x1 << 30)
+#define MPIDR_MT_BITMASK (0x1 << 24)
+#define MPIDR_HWID_BITMASK UL(0xff00ffffff)
+
+#define MPIDR_LEVEL_BITS_SHIFT 3
+#define MPIDR_LEVEL_BITS (1 << MPIDR_LEVEL_BITS_SHIFT)
+#define MPIDR_LEVEL_MASK ((1 << MPIDR_LEVEL_BITS) - 1)
+
+#define MPIDR_LEVEL_SHIFT(level) \
+ (((1 << level) >> 1) << MPIDR_LEVEL_BITS_SHIFT)
+
+#define MPIDR_AFFINITY_LEVEL(mpidr, level) \
+ ((mpidr >> MPIDR_LEVEL_SHIFT(level)) & MPIDR_LEVEL_MASK)
+
+#define MIDR_REVISION_MASK 0xf
+#define MIDR_REVISION(midr) ((midr) & MIDR_REVISION_MASK)
+#define MIDR_PARTNUM_SHIFT 4
+#define MIDR_PARTNUM_MASK (0xfff << MIDR_PARTNUM_SHIFT)
+#define MIDR_PARTNUM(midr) \
+ (((midr) & MIDR_PARTNUM_MASK) >> MIDR_PARTNUM_SHIFT)
+#define MIDR_ARCHITECTURE_SHIFT 16
+#define MIDR_ARCHITECTURE_MASK (0xf << MIDR_ARCHITECTURE_SHIFT)
+#define MIDR_ARCHITECTURE(midr) \
+ (((midr) & MIDR_ARCHITECTURE_MASK) >> MIDR_ARCHITECTURE_SHIFT)
+#define MIDR_VARIANT_SHIFT 20
+#define MIDR_VARIANT_MASK (0xf << MIDR_VARIANT_SHIFT)
+#define MIDR_VARIANT(midr) \
+ (((midr) & MIDR_VARIANT_MASK) >> MIDR_VARIANT_SHIFT)
+#define MIDR_IMPLEMENTOR_SHIFT 24
+#define MIDR_IMPLEMENTOR_MASK (0xff << MIDR_IMPLEMENTOR_SHIFT)
+#define MIDR_IMPLEMENTOR(midr) \
+ (((midr) & MIDR_IMPLEMENTOR_MASK) >> MIDR_IMPLEMENTOR_SHIFT)
+
+#define MIDR_CPU_MODEL(imp, partnum) \
+ (((imp) << MIDR_IMPLEMENTOR_SHIFT) | \
+ (0xf << MIDR_ARCHITECTURE_SHIFT) | \
+ ((partnum) << MIDR_PARTNUM_SHIFT))
+
+#define MIDR_CPU_VAR_REV(var, rev) \
+ (((var) << MIDR_VARIANT_SHIFT) | (rev))
+
+#define MIDR_CPU_MODEL_MASK (MIDR_IMPLEMENTOR_MASK | MIDR_PARTNUM_MASK | \
+ MIDR_ARCHITECTURE_MASK)
+
+#define ARM_CPU_IMP_ARM 0x41
+#define ARM_CPU_IMP_APM 0x50
+#define ARM_CPU_IMP_CAVIUM 0x43
+#define ARM_CPU_IMP_BRCM 0x42
+#define ARM_CPU_IMP_QCOM 0x51
+#define ARM_CPU_IMP_NVIDIA 0x4E
+#define ARM_CPU_IMP_FUJITSU 0x46
+#define ARM_CPU_IMP_HISI 0x48
+#define ARM_CPU_IMP_APPLE 0x61
+
+#define ARM_CPU_PART_AEM_V8 0xD0F
+#define ARM_CPU_PART_FOUNDATION 0xD00
+#define ARM_CPU_PART_CORTEX_A57 0xD07
+#define ARM_CPU_PART_CORTEX_A72 0xD08
+#define ARM_CPU_PART_CORTEX_A53 0xD03
+#define ARM_CPU_PART_CORTEX_A73 0xD09
+#define ARM_CPU_PART_CORTEX_A75 0xD0A
+#define ARM_CPU_PART_CORTEX_A35 0xD04
+#define ARM_CPU_PART_CORTEX_A55 0xD05
+#define ARM_CPU_PART_CORTEX_A76 0xD0B
+#define ARM_CPU_PART_NEOVERSE_N1 0xD0C
+#define ARM_CPU_PART_CORTEX_A77 0xD0D
+#define ARM_CPU_PART_NEOVERSE_V1 0xD40
+#define ARM_CPU_PART_CORTEX_A78 0xD41
+#define ARM_CPU_PART_CORTEX_X1 0xD44
+#define ARM_CPU_PART_CORTEX_A510 0xD46
+#define ARM_CPU_PART_CORTEX_A710 0xD47
+#define ARM_CPU_PART_CORTEX_X2 0xD48
+#define ARM_CPU_PART_NEOVERSE_N2 0xD49
+#define ARM_CPU_PART_CORTEX_A78C 0xD4B
+
+#define APM_CPU_PART_POTENZA 0x000
+
+#define CAVIUM_CPU_PART_THUNDERX 0x0A1
+#define CAVIUM_CPU_PART_THUNDERX_81XX 0x0A2
+#define CAVIUM_CPU_PART_THUNDERX_83XX 0x0A3
+#define CAVIUM_CPU_PART_THUNDERX2 0x0AF
+/* OcteonTx2 series */
+#define CAVIUM_CPU_PART_OCTX2_98XX 0x0B1
+#define CAVIUM_CPU_PART_OCTX2_96XX 0x0B2
+#define CAVIUM_CPU_PART_OCTX2_95XX 0x0B3
+#define CAVIUM_CPU_PART_OCTX2_95XXN 0x0B4
+#define CAVIUM_CPU_PART_OCTX2_95XXMM 0x0B5
+#define CAVIUM_CPU_PART_OCTX2_95XXO 0x0B6
+
+#define BRCM_CPU_PART_BRAHMA_B53 0x100
+#define BRCM_CPU_PART_VULCAN 0x516
+
+#define QCOM_CPU_PART_FALKOR_V1 0x800
+#define QCOM_CPU_PART_FALKOR 0xC00
+#define QCOM_CPU_PART_KRYO 0x200
+#define QCOM_CPU_PART_KRYO_2XX_GOLD 0x800
+#define QCOM_CPU_PART_KRYO_2XX_SILVER 0x801
+#define QCOM_CPU_PART_KRYO_3XX_SILVER 0x803
+#define QCOM_CPU_PART_KRYO_4XX_GOLD 0x804
+#define QCOM_CPU_PART_KRYO_4XX_SILVER 0x805
+
+#define NVIDIA_CPU_PART_DENVER 0x003
+#define NVIDIA_CPU_PART_CARMEL 0x004
+
+#define FUJITSU_CPU_PART_A64FX 0x001
+
+#define HISI_CPU_PART_TSV110 0xD01
+
+#define APPLE_CPU_PART_M1_ICESTORM 0x022
+#define APPLE_CPU_PART_M1_FIRESTORM 0x023
+
+#define MIDR_CORTEX_A53 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A53)
+#define MIDR_CORTEX_A57 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A57)
+#define MIDR_CORTEX_A72 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A72)
+#define MIDR_CORTEX_A73 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A73)
+#define MIDR_CORTEX_A75 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A75)
+#define MIDR_CORTEX_A35 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A35)
+#define MIDR_CORTEX_A55 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A55)
+#define MIDR_CORTEX_A76 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A76)
+#define MIDR_NEOVERSE_N1 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_N1)
+#define MIDR_CORTEX_A77 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A77)
+#define MIDR_NEOVERSE_V1 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_V1)
+#define MIDR_CORTEX_A78 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A78)
+#define MIDR_CORTEX_X1 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_X1)
+#define MIDR_CORTEX_A510 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A510)
+#define MIDR_CORTEX_A710 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A710)
+#define MIDR_CORTEX_X2 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_X2)
+#define MIDR_NEOVERSE_N2 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_N2)
+#define MIDR_CORTEX_A78C MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A78C)
+#define MIDR_THUNDERX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX)
+#define MIDR_THUNDERX_81XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX_81XX)
+#define MIDR_THUNDERX_83XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX_83XX)
+#define MIDR_OCTX2_98XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_98XX)
+#define MIDR_OCTX2_96XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_96XX)
+#define MIDR_OCTX2_95XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XX)
+#define MIDR_OCTX2_95XXN MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XXN)
+#define MIDR_OCTX2_95XXMM MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XXMM)
+#define MIDR_OCTX2_95XXO MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XXO)
+#define MIDR_CAVIUM_THUNDERX2 MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX2)
+#define MIDR_BRAHMA_B53 MIDR_CPU_MODEL(ARM_CPU_IMP_BRCM, BRCM_CPU_PART_BRAHMA_B53)
+#define MIDR_BRCM_VULCAN MIDR_CPU_MODEL(ARM_CPU_IMP_BRCM, BRCM_CPU_PART_VULCAN)
+#define MIDR_QCOM_FALKOR_V1 MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_FALKOR_V1)
+#define MIDR_QCOM_FALKOR MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_FALKOR)
+#define MIDR_QCOM_KRYO MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO)
+#define MIDR_QCOM_KRYO_2XX_GOLD MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_2XX_GOLD)
+#define MIDR_QCOM_KRYO_2XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_2XX_SILVER)
+#define MIDR_QCOM_KRYO_3XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_3XX_SILVER)
+#define MIDR_QCOM_KRYO_4XX_GOLD MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_4XX_GOLD)
+#define MIDR_QCOM_KRYO_4XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_4XX_SILVER)
+#define MIDR_NVIDIA_DENVER MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, NVIDIA_CPU_PART_DENVER)
+#define MIDR_NVIDIA_CARMEL MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, NVIDIA_CPU_PART_CARMEL)
+#define MIDR_FUJITSU_A64FX MIDR_CPU_MODEL(ARM_CPU_IMP_FUJITSU, FUJITSU_CPU_PART_A64FX)
+#define MIDR_HISI_TSV110 MIDR_CPU_MODEL(ARM_CPU_IMP_HISI, HISI_CPU_PART_TSV110)
+#define MIDR_APPLE_M1_ICESTORM MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_ICESTORM)
+#define MIDR_APPLE_M1_FIRESTORM MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_FIRESTORM)
+
+/* Fujitsu Erratum 010001 affects A64FX 1.0 and 1.1, (v0r0 and v1r0) */
+#define MIDR_FUJITSU_ERRATUM_010001 MIDR_FUJITSU_A64FX
+#define MIDR_FUJITSU_ERRATUM_010001_MASK (~MIDR_CPU_VAR_REV(1, 0))
+#define TCR_CLEAR_FUJITSU_ERRATUM_010001 (TCR_NFD1 | TCR_NFD0)
+
+#ifndef __ASSEMBLY__
+
+#include "sysreg.h"
+
+#define read_cpuid(reg) read_sysreg_s(SYS_ ## reg)
+
+/*
+ * Represent a range of MIDR values for a given CPU model and a
+ * range of variant/revision values.
+ *
+ * @model - CPU model as defined by MIDR_CPU_MODEL
+ * @rv_min - Minimum value for the revision/variant as defined by
+ * MIDR_CPU_VAR_REV
+ * @rv_max - Maximum value for the variant/revision for the range.
+ */
+struct midr_range {
+ u32 model;
+ u32 rv_min;
+ u32 rv_max;
+};
+
+#define MIDR_RANGE(m, v_min, r_min, v_max, r_max) \
+ { \
+ .model = m, \
+ .rv_min = MIDR_CPU_VAR_REV(v_min, r_min), \
+ .rv_max = MIDR_CPU_VAR_REV(v_max, r_max), \
+ }
+
+#define MIDR_REV_RANGE(m, v, r_min, r_max) MIDR_RANGE(m, v, r_min, v, r_max)
+#define MIDR_REV(m, v, r) MIDR_RANGE(m, v, r, v, r)
+#define MIDR_ALL_VERSIONS(m) MIDR_RANGE(m, 0, 0, 0xf, 0xf)
+
+static inline bool midr_is_cpu_model_range(u32 midr, u32 model, u32 rv_min,
+ u32 rv_max)
+{
+ u32 _model = midr & MIDR_CPU_MODEL_MASK;
+ u32 rv = midr & (MIDR_REVISION_MASK | MIDR_VARIANT_MASK);
+
+ return _model == model && rv >= rv_min && rv <= rv_max;
+}
+
+static inline bool is_midr_in_range(u32 midr, struct midr_range const *range)
+{
+ return midr_is_cpu_model_range(midr, range->model,
+ range->rv_min, range->rv_max);
+}
+
+static inline bool
+is_midr_in_range_list(u32 midr, struct midr_range const *ranges)
+{
+ while (ranges->model)
+ if (is_midr_in_range(midr, ranges++))
+ return true;
+ return false;
+}
+
+/*
+ * The CPU ID never changes at run time, so we might as well tell the
+ * compiler that it's constant. Use this function to read the CPU ID
+ * rather than directly reading processor_id or read_cpuid() directly.
+ */
+static inline u32 __attribute_const__ read_cpuid_id(void)
+{
+ return read_cpuid(MIDR_EL1);
+}
+
+static inline u64 __attribute_const__ read_cpuid_mpidr(void)
+{
+ return read_cpuid(MPIDR_EL1);
+}
+
+static inline unsigned int __attribute_const__ read_cpuid_implementor(void)
+{
+ return MIDR_IMPLEMENTOR(read_cpuid_id());
+}
+
+static inline unsigned int __attribute_const__ read_cpuid_part_number(void)
+{
+ return MIDR_PARTNUM(read_cpuid_id());
+}
+
+static inline u32 __attribute_const__ read_cpuid_cachetype(void)
+{
+ return read_cpuid(CTR_EL0);
+}
+#endif /* __ASSEMBLY__ */
+
+#endif
--
2.32.0

2022-04-11 20:33:12

by Ali Saidi

[permalink] [raw]
Subject: [PATCH v5 4/5] perf arm-spe: Use SPE data source for neoverse cores

When synthesizing data from SPE, augment the type with source information
for Arm Neoverse cores. The field is IMPLDEF but the Neoverse cores all use
the same encoding. I can't find encoding information for any other SPE
implementations to unify their choices with Arm's thus that is left for
future work.

This change populates the mem_lvl_num for Neoverse cores instead of the
deprecated mem_lvl namespace.

Signed-off-by: Ali Saidi <[email protected]>
---
.../util/arm-spe-decoder/arm-spe-decoder.c | 1 +
.../util/arm-spe-decoder/arm-spe-decoder.h | 12 ++
tools/perf/util/arm-spe.c | 127 ++++++++++++++++--
3 files changed, 126 insertions(+), 14 deletions(-)

diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
index 5e390a1a79ab..091987dd3966 100644
--- a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
+++ b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
@@ -220,6 +220,7 @@ static int arm_spe_read_record(struct arm_spe_decoder *decoder)

break;
case ARM_SPE_DATA_SOURCE:
+ decoder->record.source = payload;
break;
case ARM_SPE_BAD:
break;
diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
index 69b31084d6be..46a61df1145b 100644
--- a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
+++ b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
@@ -29,6 +29,17 @@ enum arm_spe_op_type {
ARM_SPE_ST = 1 << 1,
};

+enum arm_spe_neoverse_data_source {
+ ARM_SPE_NV_L1D = 0x0,
+ ARM_SPE_NV_L2 = 0x8,
+ ARM_SPE_NV_PEER_CORE = 0x9,
+ ARM_SPE_NV_LOCAL_CLUSTER = 0xa,
+ ARM_SPE_NV_SYS_CACHE = 0xb,
+ ARM_SPE_NV_PEER_CLUSTER = 0xc,
+ ARM_SPE_NV_REMOTE = 0xd,
+ ARM_SPE_NV_DRAM = 0xe,
+};
+
struct arm_spe_record {
enum arm_spe_sample_type type;
int err;
@@ -40,6 +51,7 @@ struct arm_spe_record {
u64 virt_addr;
u64 phys_addr;
u64 context_id;
+ u16 source;
};

struct arm_spe_insn;
diff --git a/tools/perf/util/arm-spe.c b/tools/perf/util/arm-spe.c
index d2b64e3f588b..a20285cf98e3 100644
--- a/tools/perf/util/arm-spe.c
+++ b/tools/perf/util/arm-spe.c
@@ -34,6 +34,7 @@
#include "arm-spe-decoder/arm-spe-decoder.h"
#include "arm-spe-decoder/arm-spe-pkt-decoder.h"

+#include "../../arch/arm64/include/asm/cputype.h"
#define MAX_TIMESTAMP (~0ULL)

struct arm_spe {
@@ -45,6 +46,7 @@ struct arm_spe {
struct perf_session *session;
struct machine *machine;
u32 pmu_type;
+ u64 midr;

struct perf_tsc_conversion tc;

@@ -399,33 +401,127 @@ static bool arm_spe__is_memory_event(enum arm_spe_sample_type type)
return false;
}

-static u64 arm_spe__synth_data_source(const struct arm_spe_record *record)
+static const struct midr_range neoverse_spe[] = {
+ MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N1),
+ MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N2),
+ MIDR_ALL_VERSIONS(MIDR_NEOVERSE_V1),
+ {},
+};
+
+
+static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
+ union perf_mem_data_src *data_src)
{
- union perf_mem_data_src data_src = { 0 };
+ /*
+ * Even though four levels of cache hierarchy are possible, no known
+ * production Neoverse systems currently include more than three levels
+ * so for the time being we assume three exist. If a production system
+ * is built with four the this function would have to be changed to
+ * detect the number of levels for reporting.
+ */

- if (record->op == ARM_SPE_LD)
- data_src.mem_op = PERF_MEM_OP_LOAD;
- else
- data_src.mem_op = PERF_MEM_OP_STORE;
+ /*
+ * We have no data on the hit level or data source for stores in the
+ * Neoverse SPE records.
+ */
+ if (record->op & ARM_SPE_ST) {
+ data_src->mem_lvl = PERF_MEM_LVL_NA;
+ data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
+ data_src->mem_snoop = PERF_MEM_SNOOP_NA;
+ return;
+ }
+
+
+ switch (record->source) {
+ case ARM_SPE_NV_L1D:
+ data_src->mem_lvl = PERF_MEM_LVL_HIT;
+ data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
+ data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
+ break;
+ case ARM_SPE_NV_L2:
+ data_src->mem_lvl = PERF_MEM_LVL_HIT;
+ data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
+ data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
+ break;
+ case ARM_SPE_NV_PEER_CORE:
+ data_src->mem_lvl = PERF_MEM_LVL_HIT;
+ data_src->mem_snoopx = PERF_MEM_SNOOPX_PEER;
+ data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
+ break;
+ /*
+ * We don't know if this is L1, L2 but we do know it was a cache-2-cache
+ * transfer, so set SNOOPX_PEER
+ */
+ case ARM_SPE_NV_LOCAL_CLUSTER:
+ case ARM_SPE_NV_PEER_CLUSTER:
+ data_src->mem_lvl = PERF_MEM_LVL_HIT;
+ data_src->mem_snoopx = PERF_MEM_SNOOPX_PEER;
+ data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
+ break;
+ /*
+ * System cache is assumed to be L3
+ */
+ case ARM_SPE_NV_SYS_CACHE:
+ data_src->mem_lvl = PERF_MEM_LVL_HIT;
+ data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
+ data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
+ break;
+ /*
+ * We don't know what level it hit in, except it came from the other
+ * socket
+ */
+ case ARM_SPE_NV_REMOTE:
+ data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
+ data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
+ data_src->mem_snoop = PERF_MEM_SNOOP_NA;
+ break;
+ case ARM_SPE_NV_DRAM:
+ data_src->mem_lvl = PERF_MEM_LVL_HIT;
+ data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
+ data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
+ break;
+ default:
+ break;
+ }
+}

+static void arm_spe__synth_data_source_generic(const struct arm_spe_record *record,
+ union perf_mem_data_src *data_src)
+{
if (record->type & (ARM_SPE_LLC_ACCESS | ARM_SPE_LLC_MISS)) {
- data_src.mem_lvl = PERF_MEM_LVL_L3;
+ data_src->mem_lvl = PERF_MEM_LVL_L3;

if (record->type & ARM_SPE_LLC_MISS)
- data_src.mem_lvl |= PERF_MEM_LVL_MISS;
+ data_src->mem_lvl |= PERF_MEM_LVL_MISS;
else
- data_src.mem_lvl |= PERF_MEM_LVL_HIT;
+ data_src->mem_lvl |= PERF_MEM_LVL_HIT;
} else if (record->type & (ARM_SPE_L1D_ACCESS | ARM_SPE_L1D_MISS)) {
- data_src.mem_lvl = PERF_MEM_LVL_L1;
+ data_src->mem_lvl = PERF_MEM_LVL_L1;

if (record->type & ARM_SPE_L1D_MISS)
- data_src.mem_lvl |= PERF_MEM_LVL_MISS;
+ data_src->mem_lvl |= PERF_MEM_LVL_MISS;
else
- data_src.mem_lvl |= PERF_MEM_LVL_HIT;
+ data_src->mem_lvl |= PERF_MEM_LVL_HIT;
}

if (record->type & ARM_SPE_REMOTE_ACCESS)
- data_src.mem_lvl |= PERF_MEM_LVL_REM_CCE1;
+ data_src->mem_lvl |= PERF_MEM_LVL_REM_CCE1;
+}
+
+static u64 arm_spe__synth_data_source(const struct arm_spe_record *record, u64 midr)
+{
+ union perf_mem_data_src data_src = { 0 };
+ bool is_neoverse = is_midr_in_range(midr, neoverse_spe);
+
+ if (record->op & ARM_SPE_LD)
+ data_src.mem_op = PERF_MEM_OP_LOAD;
+ else
+ data_src.mem_op = PERF_MEM_OP_STORE;
+
+ if (is_neoverse)
+ arm_spe__synth_data_source_neoverse(record, &data_src);
+ else
+ arm_spe__synth_data_source_generic(record, &data_src);

if (record->type & (ARM_SPE_TLB_ACCESS | ARM_SPE_TLB_MISS)) {
data_src.mem_dtlb = PERF_MEM_TLB_WK;
@@ -446,7 +542,7 @@ static int arm_spe_sample(struct arm_spe_queue *speq)
u64 data_src;
int err;

- data_src = arm_spe__synth_data_source(record);
+ data_src = arm_spe__synth_data_source(record, spe->midr);

if (spe->sample_flc) {
if (record->type & ARM_SPE_L1D_MISS) {
@@ -1183,6 +1279,8 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
struct perf_record_auxtrace_info *auxtrace_info = &event->auxtrace_info;
size_t min_sz = sizeof(u64) * ARM_SPE_AUXTRACE_PRIV_MAX;
struct perf_record_time_conv *tc = &session->time_conv;
+ const char *cpuid = perf_env__cpuid(session->evlist->env);
+ u64 midr = strtol(cpuid, NULL, 16);
struct arm_spe *spe;
int err;

@@ -1202,6 +1300,7 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
spe->machine = &session->machines.host; /* No kvm support */
spe->auxtrace_type = auxtrace_info->type;
spe->pmu_type = auxtrace_info->priv[ARM_SPE_PMU_TYPE];
+ spe->midr = midr;

spe->timeless_decoding = arm_spe__is_timeless_decoding(spe);

--
2.32.0

2022-04-12 06:02:36

by German Gomez

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] perf tools: sync addition of PERF_MEM_SNOOPX_PEER


On 11/04/2022 11:26, German Gomez wrote:
> On 08/04/2022 20:53, Ali Saidi wrote:
>> Add a flag to the perf mem data struct to signal that a request caused a
>> cache-to-cache transfer of a line from a peer of the requestor and
>> wasn't sourced from a lower cache level. The line being moved from one
>> peer cache to another has latency and performance implications. On Arm64
>> Neoverse systems the data source can indicate a cache-to-cache transfer
>> but not if the line is dirty or clean, so instead of overloading HITM
>> define a new flag that indicates this type of transfer.
> I think it's fine to merge this and the previous commit rather than have
> two commits with the same msg.
>

I take it back. It has been pointed out to me that having the separation
is normal/ok. So my comment doesn't apply.

2022-04-12 22:21:13

by German Gomez

[permalink] [raw]
Subject: Re: [PATCH v5 5/5] perf mem: Support mem_lvl_num in c2c command


On 08/04/2022 20:53, Ali Saidi wrote:
> In addition to summarizing data encoded in mem_lvl also support data
> encoded in mem_lvl_num.
>
> Since other architectures don't seem to populate the mem_lvl_num field
> here there shouldn't be a change in functionality.
>
> Signed-off-by: Ali Saidi <[email protected]>
> ---
> tools/perf/util/mem-events.c | 11 +++++++----
> 1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
> index ed0ab838bcc5..e5e405185498 100644
> --- a/tools/perf/util/mem-events.c
> +++ b/tools/perf/util/mem-events.c
> @@ -485,6 +485,7 @@ int c2c_decode_stats(struct c2c_stats *stats, struct mem_info *mi)
> u64 daddr = mi->daddr.addr;
> u64 op = data_src->mem_op;
> u64 lvl = data_src->mem_lvl;
> + u64 lnum = data_src->mem_lvl_num;
> u64 snoop = data_src->mem_snoop;
> u64 lock = data_src->mem_lock;
> u64 blk = data_src->mem_blk;
> @@ -527,16 +528,18 @@ do { \
> if (lvl & P(LVL, UNC)) stats->ld_uncache++;
> if (lvl & P(LVL, IO)) stats->ld_io++;
> if (lvl & P(LVL, LFB)) stats->ld_fbhit++;

Just for completion, can we also handle LFB as it seems to be being set
in "/arch/x86/events/intel/ds.c"? (Sorry I missed this in the v4)

> - if (lvl & P(LVL, L1 )) stats->ld_l1hit++;
> - if (lvl & P(LVL, L2 )) stats->ld_l2hit++;
> - if (lvl & P(LVL, L3 )) {
> + if (lvl & P(LVL, L1) || lnum == P(LVLNUM, L1))
> + stats->ld_l1hit++;
> + if (lvl & P(LVL, L2) || lnum == P(LVLNUM, L2))
> + stats->ld_l2hit++;
> + if (lvl & P(LVL, L3) || lnum == P(LVLNUM, L3)) {
> if (snoop & P(SNOOP, HITM))
> HITM_INC(lcl_hitm);
> else
> stats->ld_llchit++;
> }
>
> - if (lvl & P(LVL, LOC_RAM)) {
> + if (lvl & P(LVL, LOC_RAM) || lnum == P(LVLNUM, RAM)) {
> stats->lcl_dram++;
> if (snoop & P(SNOOP, HIT))
> stats->ld_shared++;

2022-04-20 06:52:27

by Leo Yan

[permalink] [raw]
Subject: Re: [PATCH v5 1/5] tools: arm64: Import cputype.h

On Fri, Apr 08, 2022 at 07:53:40PM +0000, Ali Saidi wrote:
> Bring-in the kernel's arch/arm64/include/asm/cputype.h into tools/
> for arm64 to make use of all the core-type definitions in perf.
>
> Replace sysreg.h with the version already imported into tools/.
>
> Signed-off-by: Ali Saidi <[email protected]>

Please drop this patch in next spin, it has been merged in the
mainline kernel.

Thanks,
Leo

2022-04-20 10:36:07

by Leo Yan

[permalink] [raw]
Subject: Re: [PATCH v5 5/5] perf mem: Support mem_lvl_num in c2c command

On Mon, Apr 11, 2022 at 11:04:28AM +0100, German Gomez wrote:
>
> On 08/04/2022 20:53, Ali Saidi wrote:
> > In addition to summarizing data encoded in mem_lvl also support data
> > encoded in mem_lvl_num.
> >
> > Since other architectures don't seem to populate the mem_lvl_num field
> > here there shouldn't be a change in functionality.
> >
> > Signed-off-by: Ali Saidi <[email protected]>
> > ---
> > tools/perf/util/mem-events.c | 11 +++++++----
> > 1 file changed, 7 insertions(+), 4 deletions(-)
> >
> > diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
> > index ed0ab838bcc5..e5e405185498 100644
> > --- a/tools/perf/util/mem-events.c
> > +++ b/tools/perf/util/mem-events.c
> > @@ -485,6 +485,7 @@ int c2c_decode_stats(struct c2c_stats *stats, struct mem_info *mi)
> > u64 daddr = mi->daddr.addr;
> > u64 op = data_src->mem_op;
> > u64 lvl = data_src->mem_lvl;
> > + u64 lnum = data_src->mem_lvl_num;
> > u64 snoop = data_src->mem_snoop;
> > u64 lock = data_src->mem_lock;
> > u64 blk = data_src->mem_blk;
> > @@ -527,16 +528,18 @@ do { \
> > if (lvl & P(LVL, UNC)) stats->ld_uncache++;
> > if (lvl & P(LVL, IO)) stats->ld_io++;
> > if (lvl & P(LVL, LFB)) stats->ld_fbhit++;
>
> Just for completion, can we also handle LFB as it seems to be being set
> in "/arch/x86/events/intel/ds.c"? (Sorry I missed this in the v4)

With fixing LFB issue pointed by German, the change looks good to me:

Reviewed-by: Leo Yan <[email protected]>

It would be appreciate if x86 or PowerPC maintainers could take a look
for this patch. Thanks!

Leo

> > - if (lvl & P(LVL, L1 )) stats->ld_l1hit++;
> > - if (lvl & P(LVL, L2 )) stats->ld_l2hit++;
> > - if (lvl & P(LVL, L3 )) {
> > + if (lvl & P(LVL, L1) || lnum == P(LVLNUM, L1))
> > + stats->ld_l1hit++;
> > + if (lvl & P(LVL, L2) || lnum == P(LVLNUM, L2))
> > + stats->ld_l2hit++;
> > + if (lvl & P(LVL, L3) || lnum == P(LVLNUM, L3)) {
> > if (snoop & P(SNOOP, HITM))
> > HITM_INC(lcl_hitm);
> > else
> > stats->ld_llchit++;
> > }
> >
> > - if (lvl & P(LVL, LOC_RAM)) {
> > + if (lvl & P(LVL, LOC_RAM) || lnum == P(LVLNUM, RAM)) {
> > stats->lcl_dram++;
> > if (snoop & P(SNOOP, HIT))
> > stats->ld_shared++;

2022-04-20 21:12:27

by Leo Yan

[permalink] [raw]
Subject: Re: [PATCH v5 4/5] perf arm-spe: Use SPE data source for neoverse cores

On Fri, Apr 08, 2022 at 07:53:43PM +0000, Ali Saidi wrote:
> When synthesizing data from SPE, augment the type with source information
> for Arm Neoverse cores. The field is IMPLDEF but the Neoverse cores all use
> the same encoding. I can't find encoding information for any other SPE
> implementations to unify their choices with Arm's thus that is left for
> future work.
>
> This change populates the mem_lvl_num for Neoverse cores instead of the
> deprecated mem_lvl namespace.
>
> Signed-off-by: Ali Saidi <[email protected]>
> ---
> .../util/arm-spe-decoder/arm-spe-decoder.c | 1 +
> .../util/arm-spe-decoder/arm-spe-decoder.h | 12 ++
> tools/perf/util/arm-spe.c | 127 ++++++++++++++++--
> 3 files changed, 126 insertions(+), 14 deletions(-)
>
> diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
> index 5e390a1a79ab..091987dd3966 100644
> --- a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
> +++ b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
> @@ -220,6 +220,7 @@ static int arm_spe_read_record(struct arm_spe_decoder *decoder)
>
> break;
> case ARM_SPE_DATA_SOURCE:
> + decoder->record.source = payload;
> break;
> case ARM_SPE_BAD:
> break;
> diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
> index 69b31084d6be..46a61df1145b 100644
> --- a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
> +++ b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
> @@ -29,6 +29,17 @@ enum arm_spe_op_type {
> ARM_SPE_ST = 1 << 1,
> };
>
> +enum arm_spe_neoverse_data_source {
> + ARM_SPE_NV_L1D = 0x0,
> + ARM_SPE_NV_L2 = 0x8,
> + ARM_SPE_NV_PEER_CORE = 0x9,
> + ARM_SPE_NV_LOCAL_CLUSTER = 0xa,
> + ARM_SPE_NV_SYS_CACHE = 0xb,
> + ARM_SPE_NV_PEER_CLUSTER = 0xc,
> + ARM_SPE_NV_REMOTE = 0xd,
> + ARM_SPE_NV_DRAM = 0xe,
> +};
> +
> struct arm_spe_record {
> enum arm_spe_sample_type type;
> int err;
> @@ -40,6 +51,7 @@ struct arm_spe_record {
> u64 virt_addr;
> u64 phys_addr;
> u64 context_id;
> + u16 source;
> };
>
> struct arm_spe_insn;
> diff --git a/tools/perf/util/arm-spe.c b/tools/perf/util/arm-spe.c
> index d2b64e3f588b..a20285cf98e3 100644
> --- a/tools/perf/util/arm-spe.c
> +++ b/tools/perf/util/arm-spe.c
> @@ -34,6 +34,7 @@
> #include "arm-spe-decoder/arm-spe-decoder.h"
> #include "arm-spe-decoder/arm-spe-pkt-decoder.h"
>
> +#include "../../arch/arm64/include/asm/cputype.h"
> #define MAX_TIMESTAMP (~0ULL)
>
> struct arm_spe {
> @@ -45,6 +46,7 @@ struct arm_spe {
> struct perf_session *session;
> struct machine *machine;
> u32 pmu_type;
> + u64 midr;
>
> struct perf_tsc_conversion tc;
>
> @@ -399,33 +401,127 @@ static bool arm_spe__is_memory_event(enum arm_spe_sample_type type)
> return false;
> }
>
> -static u64 arm_spe__synth_data_source(const struct arm_spe_record *record)
> +static const struct midr_range neoverse_spe[] = {
> + MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N1),
> + MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N2),
> + MIDR_ALL_VERSIONS(MIDR_NEOVERSE_V1),
> + {},
> +};
> +
> +
> +static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
> + union perf_mem_data_src *data_src)
> {
> - union perf_mem_data_src data_src = { 0 };
> + /*
> + * Even though four levels of cache hierarchy are possible, no known
> + * production Neoverse systems currently include more than three levels
> + * so for the time being we assume three exist. If a production system
> + * is built with four the this function would have to be changed to
> + * detect the number of levels for reporting.
> + */
>
> - if (record->op == ARM_SPE_LD)
> - data_src.mem_op = PERF_MEM_OP_LOAD;
> - else
> - data_src.mem_op = PERF_MEM_OP_STORE;
> + /*
> + * We have no data on the hit level or data source for stores in the
> + * Neoverse SPE records.
> + */
> + if (record->op & ARM_SPE_ST) {
> + data_src->mem_lvl = PERF_MEM_LVL_NA;
> + data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> + data_src->mem_snoop = PERF_MEM_SNOOP_NA;
> + return;
> + }
> +
> +

Redundant new line.

> + switch (record->source) {
> + case ARM_SPE_NV_L1D:
> + data_src->mem_lvl = PERF_MEM_LVL_HIT;
> + data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> + data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
> + break;
> + case ARM_SPE_NV_L2:
> + data_src->mem_lvl = PERF_MEM_LVL_HIT;
> + data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> + data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
> + break;
> + case ARM_SPE_NV_PEER_CORE:
> + data_src->mem_lvl = PERF_MEM_LVL_HIT;
> + data_src->mem_snoopx = PERF_MEM_SNOOPX_PEER;
> + data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> + break;
> + /*
> + * We don't know if this is L1, L2 but we do know it was a cache-2-cache
> + * transfer, so set SNOOPX_PEER
> + */
> + case ARM_SPE_NV_LOCAL_CLUSTER:
> + case ARM_SPE_NV_PEER_CLUSTER:
> + data_src->mem_lvl = PERF_MEM_LVL_HIT;
> + data_src->mem_snoopx = PERF_MEM_SNOOPX_PEER;

As a side topic, it's better to use a new patch to dump snooping flag
PERF_MEM_SNOOPX_PEER, some code like below:

diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
index f8f234251f92..66d44280a4ea 100644
--- a/tools/perf/util/mem-events.c
+++ b/tools/perf/util/mem-events.c
@@ -410,6 +410,11 @@ static const char * const snoop_access[] = {
"HitM",
};

+static const char * const snoopx_access[] = {
+ "Fwd",
+ "Peer",
+};
+
int perf_mem__snp_scnprintf(char *out, size_t sz, struct mem_info *mem_info)
{
size_t i, l = 0;
@@ -430,13 +435,18 @@ int perf_mem__snp_scnprintf(char *out, size_t sz, struct mem_info *mem_info)
}
l += scnprintf(out + l, sz - l, snoop_access[i]);
}
- if (mem_info &&
- (mem_info->data_src.mem_snoopx & PERF_MEM_SNOOPX_FWD)) {
+
+ if (mem_info)
+ m = mem_info->data_src.mem_snoopx;
+
+ for (i = 0; m && i < ARRAY_SIZE(snoopx_access); i++, m >>= 1) {
+ if (!(m & 0x1))
+ continue;
if (l) {
strcat(out, " or ");
l += 4;
}
- l += scnprintf(out + l, sz - l, "Fwd");
+ l += scnprintf(out + l, sz - l, snoopx_access[i]);
}


> + data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> + break;
> + /*
> + * System cache is assumed to be L3
> + */
> + case ARM_SPE_NV_SYS_CACHE:
> + data_src->mem_lvl = PERF_MEM_LVL_HIT;
> + data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> + data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> + break;
> + /*
> + * We don't know what level it hit in, except it came from the other
> + * socket
> + */
> + case ARM_SPE_NV_REMOTE:
> + data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> + data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
> + data_src->mem_snoop = PERF_MEM_SNOOP_NA;
> + break;
> + case ARM_SPE_NV_DRAM:
> + data_src->mem_lvl = PERF_MEM_LVL_HIT;
> + data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
> + data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
> + break;
> + default:
> + break;
> + }
> +}
>
> +static void arm_spe__synth_data_source_generic(const struct arm_spe_record *record,
> + union perf_mem_data_src *data_src)
> +{
> if (record->type & (ARM_SPE_LLC_ACCESS | ARM_SPE_LLC_MISS)) {
> - data_src.mem_lvl = PERF_MEM_LVL_L3;
> + data_src->mem_lvl = PERF_MEM_LVL_L3;
>
> if (record->type & ARM_SPE_LLC_MISS)
> - data_src.mem_lvl |= PERF_MEM_LVL_MISS;
> + data_src->mem_lvl |= PERF_MEM_LVL_MISS;
> else
> - data_src.mem_lvl |= PERF_MEM_LVL_HIT;
> + data_src->mem_lvl |= PERF_MEM_LVL_HIT;
> } else if (record->type & (ARM_SPE_L1D_ACCESS | ARM_SPE_L1D_MISS)) {
> - data_src.mem_lvl = PERF_MEM_LVL_L1;
> + data_src->mem_lvl = PERF_MEM_LVL_L1;
>
> if (record->type & ARM_SPE_L1D_MISS)
> - data_src.mem_lvl |= PERF_MEM_LVL_MISS;
> + data_src->mem_lvl |= PERF_MEM_LVL_MISS;
> else
> - data_src.mem_lvl |= PERF_MEM_LVL_HIT;
> + data_src->mem_lvl |= PERF_MEM_LVL_HIT;
> }
>
> if (record->type & ARM_SPE_REMOTE_ACCESS)
> - data_src.mem_lvl |= PERF_MEM_LVL_REM_CCE1;
> + data_src->mem_lvl |= PERF_MEM_LVL_REM_CCE1;
> +}
> +
> +static u64 arm_spe__synth_data_source(const struct arm_spe_record *record, u64 midr)
> +{
> + union perf_mem_data_src data_src = { 0 };
> + bool is_neoverse = is_midr_in_range(midr, neoverse_spe);
> +
> + if (record->op & ARM_SPE_LD)
> + data_src.mem_op = PERF_MEM_OP_LOAD;
> + else
> + data_src.mem_op = PERF_MEM_OP_STORE;
> +
> + if (is_neoverse)
> + arm_spe__synth_data_source_neoverse(record, &data_src);
> + else
> + arm_spe__synth_data_source_generic(record, &data_src);
>
> if (record->type & (ARM_SPE_TLB_ACCESS | ARM_SPE_TLB_MISS)) {
> data_src.mem_dtlb = PERF_MEM_TLB_WK;
> @@ -446,7 +542,7 @@ static int arm_spe_sample(struct arm_spe_queue *speq)
> u64 data_src;
> int err;
>
> - data_src = arm_spe__synth_data_source(record);
> + data_src = arm_spe__synth_data_source(record, spe->midr);
>
> if (spe->sample_flc) {
> if (record->type & ARM_SPE_L1D_MISS) {
> @@ -1183,6 +1279,8 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
> struct perf_record_auxtrace_info *auxtrace_info = &event->auxtrace_info;
> size_t min_sz = sizeof(u64) * ARM_SPE_AUXTRACE_PRIV_MAX;
> struct perf_record_time_conv *tc = &session->time_conv;
> + const char *cpuid = perf_env__cpuid(session->evlist->env);
> + u64 midr = strtol(cpuid, NULL, 16);
> struct arm_spe *spe;
> int err;
>
> @@ -1202,6 +1300,7 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
> spe->machine = &session->machines.host; /* No kvm support */
> spe->auxtrace_type = auxtrace_info->type;
> spe->pmu_type = auxtrace_info->priv[ARM_SPE_PMU_TYPE];
> + spe->midr = midr;

Except the redundant line, the patch is good for me and I tested it at
my side:

Reviewed-by: Leo Yan <[email protected]>
Tested-by: Leo Yan <[email protected]>

>
> spe->timeless_decoding = arm_spe__is_timeless_decoding(spe);
>
> --
> 2.32.0
>

2022-04-21 07:47:16

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH v5 5/5] perf mem: Support mem_lvl_num in c2c command



On 4/8/2022 3:53 PM, Ali Saidi wrote:
> In addition to summarizing data encoded in mem_lvl also support data
> encoded in mem_lvl_num.
>
> Since other architectures don't seem to populate the mem_lvl_num field
> here there shouldn't be a change in functionality.
>
> Signed-off-by: Ali Saidi <[email protected]>
> ---
> tools/perf/util/mem-events.c | 11 +++++++----
> 1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
> index ed0ab838bcc5..e5e405185498 100644
> --- a/tools/perf/util/mem-events.c
> +++ b/tools/perf/util/mem-events.c
> @@ -485,6 +485,7 @@ int c2c_decode_stats(struct c2c_stats *stats, struct mem_info *mi)
> u64 daddr = mi->daddr.addr;
> u64 op = data_src->mem_op;
> u64 lvl = data_src->mem_lvl;
> + u64 lnum = data_src->mem_lvl_num;
> u64 snoop = data_src->mem_snoop;
> u64 lock = data_src->mem_lock;
> u64 blk = data_src->mem_blk;
> @@ -527,16 +528,18 @@ do { \
> if (lvl & P(LVL, UNC)) stats->ld_uncache++;
> if (lvl & P(LVL, IO)) stats->ld_io++;
> if (lvl & P(LVL, LFB)) stats->ld_fbhit++;
> - if (lvl & P(LVL, L1 )) stats->ld_l1hit++;
> - if (lvl & P(LVL, L2 )) stats->ld_l2hit++;
> - if (lvl & P(LVL, L3 )) {
> + if (lvl & P(LVL, L1) || lnum == P(LVLNUM, L1))
> + stats->ld_l1hit++;
> + if (lvl & P(LVL, L2) || lnum == P(LVLNUM, L2))
> + stats->ld_l2hit++;
> + if (lvl & P(LVL, L3) || lnum == P(LVLNUM, L3)) {
> if (snoop & P(SNOOP, HITM))
> HITM_INC(lcl_hitm);
> else
> stats->ld_llchit++;
> }
>
> - if (lvl & P(LVL, LOC_RAM)) {
> + if (lvl & P(LVL, LOC_RAM) || lnum == P(LVLNUM, RAM)) {

I think the PERF_MEM_LVLNUM_RAM only means it's a DRAM.
It doesn't contain the location information. To distinguish the local
and remote dram, X86 uses PERF_MEM_REMOTE_REMOTE.
Here the remote dram will be mistakenly calculated if you only check the
PERF_MEM_LVLNUM_RAM.

Actually, it looks like the mem_lvl_num fields supported in this patch
are also supported by the PERF_MEM_LVL*. Why don't you set both
PERF_MEM_LVLNUM_* and PERF_MEM_LVL* in your previous patch 4?
Then you can drop this patch.

Thanks,
Kan
> stats->lcl_dram++;
> if (snoop & P(SNOOP, HIT))
> stats->ld_shared++;

2022-04-21 14:44:49

by Leo Yan

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] perf tools: sync addition of PERF_MEM_SNOOPX_PEER

On Mon, Apr 11, 2022 at 03:35:52PM +0100, German Gomez wrote:
>
> On 11/04/2022 11:26, German Gomez wrote:
> > On 08/04/2022 20:53, Ali Saidi wrote:
> >> Add a flag to the perf mem data struct to signal that a request caused a
> >> cache-to-cache transfer of a line from a peer of the requestor and
> >> wasn't sourced from a lower cache level. The line being moved from one
> >> peer cache to another has latency and performance implications. On Arm64
> >> Neoverse systems the data source can indicate a cache-to-cache transfer
> >> but not if the line is dirty or clean, so instead of overloading HITM
> >> define a new flag that indicates this type of transfer.
> > I think it's fine to merge this and the previous commit rather than have
> > two commits with the same msg.
> >
>
> I take it back. It has been pointed out to me that having the separation
> is normal/ok. So my comment doesn't apply.

Yeah, it's good that we split these two patches since it's easier
for maintainer to pick up separately :)

Thanks,
Leo

2022-04-21 19:24:59

by Leo Yan

[permalink] [raw]
Subject: Re: [PATCH v5 4/5] perf arm-spe: Use SPE data source for neoverse cores

On Fri, Apr 08, 2022 at 07:53:43PM +0000, Ali Saidi wrote:
> When synthesizing data from SPE, augment the type with source information
> for Arm Neoverse cores. The field is IMPLDEF but the Neoverse cores all use
> the same encoding. I can't find encoding information for any other SPE
> implementations to unify their choices with Arm's thus that is left for
> future work.
>
> This change populates the mem_lvl_num for Neoverse cores instead of the
> deprecated mem_lvl namespace.
>
> Signed-off-by: Ali Saidi <[email protected]>
> ---
> .../util/arm-spe-decoder/arm-spe-decoder.c | 1 +
> .../util/arm-spe-decoder/arm-spe-decoder.h | 12 ++
> tools/perf/util/arm-spe.c | 127 ++++++++++++++++--
> 3 files changed, 126 insertions(+), 14 deletions(-)
>
> diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
> index 5e390a1a79ab..091987dd3966 100644
> --- a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
> +++ b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
> @@ -220,6 +220,7 @@ static int arm_spe_read_record(struct arm_spe_decoder *decoder)
>
> break;
> case ARM_SPE_DATA_SOURCE:
> + decoder->record.source = payload;
> break;
> case ARM_SPE_BAD:
> break;
> diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
> index 69b31084d6be..46a61df1145b 100644
> --- a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
> +++ b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
> @@ -29,6 +29,17 @@ enum arm_spe_op_type {
> ARM_SPE_ST = 1 << 1,
> };
>
> +enum arm_spe_neoverse_data_source {
> + ARM_SPE_NV_L1D = 0x0,
> + ARM_SPE_NV_L2 = 0x8,
> + ARM_SPE_NV_PEER_CORE = 0x9,
> + ARM_SPE_NV_LOCAL_CLUSTER = 0xa,
> + ARM_SPE_NV_SYS_CACHE = 0xb,
> + ARM_SPE_NV_PEER_CLUSTER = 0xc,
> + ARM_SPE_NV_REMOTE = 0xd,
> + ARM_SPE_NV_DRAM = 0xe,
> +};
> +
> struct arm_spe_record {
> enum arm_spe_sample_type type;
> int err;
> @@ -40,6 +51,7 @@ struct arm_spe_record {
> u64 virt_addr;
> u64 phys_addr;
> u64 context_id;
> + u16 source;
> };
>
> struct arm_spe_insn;
> diff --git a/tools/perf/util/arm-spe.c b/tools/perf/util/arm-spe.c
> index d2b64e3f588b..a20285cf98e3 100644
> --- a/tools/perf/util/arm-spe.c
> +++ b/tools/perf/util/arm-spe.c
> @@ -34,6 +34,7 @@
> #include "arm-spe-decoder/arm-spe-decoder.h"
> #include "arm-spe-decoder/arm-spe-pkt-decoder.h"
>
> +#include "../../arch/arm64/include/asm/cputype.h"
> #define MAX_TIMESTAMP (~0ULL)
>
> struct arm_spe {
> @@ -45,6 +46,7 @@ struct arm_spe {
> struct perf_session *session;
> struct machine *machine;
> u32 pmu_type;
> + u64 midr;
>
> struct perf_tsc_conversion tc;
>
> @@ -399,33 +401,127 @@ static bool arm_spe__is_memory_event(enum arm_spe_sample_type type)
> return false;
> }
>
> -static u64 arm_spe__synth_data_source(const struct arm_spe_record *record)
> +static const struct midr_range neoverse_spe[] = {
> + MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N1),
> + MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N2),
> + MIDR_ALL_VERSIONS(MIDR_NEOVERSE_V1),
> + {},
> +};
> +
> +
> +static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
> + union perf_mem_data_src *data_src)
> {
> - union perf_mem_data_src data_src = { 0 };
> + /*
> + * Even though four levels of cache hierarchy are possible, no known
> + * production Neoverse systems currently include more than three levels
> + * so for the time being we assume three exist. If a production system
> + * is built with four the this function would have to be changed to
> + * detect the number of levels for reporting.
> + */
>
> - if (record->op == ARM_SPE_LD)
> - data_src.mem_op = PERF_MEM_OP_LOAD;
> - else
> - data_src.mem_op = PERF_MEM_OP_STORE;
> + /*
> + * We have no data on the hit level or data source for stores in the
> + * Neoverse SPE records.
> + */
> + if (record->op & ARM_SPE_ST) {
> + data_src->mem_lvl = PERF_MEM_LVL_NA;
> + data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> + data_src->mem_snoop = PERF_MEM_SNOOP_NA;
> + return;
> + }

For the store operation, I found we need to use more strictly criteria
to check memory operations, otherwise, we might wrongly synthesize
memory sample even for other types of operations.

To fix the issue, I think we need to add below patch; if this is okay
for you, please consider to include it in the next patch set version.

Thanks,
Leo

From 7f8499d4f44b400d217c01d42059f00e8a1697b0 Mon Sep 17 00:00:00 2001
From: Leo Yan <[email protected]>
Date: Wed, 20 Apr 2022 15:46:21 +0800
Subject: [PATCH] perf arm-spe: Don't set data source if it's not a memory
operation

Except memory load and store operations, Arm SPE records also can
support other operation types, bug when set the data source field the
current code assumes a record is a either load operation or store
operation, this leads to wrongly synthesize memory samples.

This patch strictly checks the record operation type, it only sets data
source only for the operation types ARM_SPE_LD and ARM_SPE_ST,
otherwise, returns zero for data source. Therefore, we can synthesize
memory samples only when data source is a non-zero value, the function
arm_spe__is_memory_event() is useless and removed.

Signed-off-by: Leo Yan <[email protected]>
---
tools/perf/util/arm-spe.c | 22 ++++++++--------------
1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/tools/perf/util/arm-spe.c b/tools/perf/util/arm-spe.c
index d2b64e3f588b..76251825c01d 100644
--- a/tools/perf/util/arm-spe.c
+++ b/tools/perf/util/arm-spe.c
@@ -387,26 +387,16 @@ static int arm_spe__synth_instruction_sample(struct arm_spe_queue *speq,
return arm_spe_deliver_synth_event(spe, speq, event, &sample);
}

-#define SPE_MEM_TYPE (ARM_SPE_L1D_ACCESS | ARM_SPE_L1D_MISS | \
- ARM_SPE_LLC_ACCESS | ARM_SPE_LLC_MISS | \
- ARM_SPE_REMOTE_ACCESS)
-
-static bool arm_spe__is_memory_event(enum arm_spe_sample_type type)
-{
- if (type & SPE_MEM_TYPE)
- return true;
-
- return false;
-}
-
static u64 arm_spe__synth_data_source(const struct arm_spe_record *record)
{
union perf_mem_data_src data_src = { 0 };

if (record->op == ARM_SPE_LD)
data_src.mem_op = PERF_MEM_OP_LOAD;
- else
+ else if (record->op & ARM_SPE_ST)
data_src.mem_op = PERF_MEM_OP_STORE;
+ else
+ return 0;

if (record->type & (ARM_SPE_LLC_ACCESS | ARM_SPE_LLC_MISS)) {
data_src.mem_lvl = PERF_MEM_LVL_L3;
@@ -510,7 +500,11 @@ static int arm_spe_sample(struct arm_spe_queue *speq)
return err;
}

- if (spe->sample_memory && arm_spe__is_memory_event(record->type)) {
+ /*
+ * When data_src is zero it means the record is not a memory operation,
+ * skip to synthesize memory sample for this case.
+ */
+ if (spe->sample_memory && data_src) {
err = arm_spe__synth_mem_sample(speq, spe->memory_id, data_src);
if (err)
return err;
--
2.25.1

2022-04-22 08:09:51

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct



On 4/8/2022 3:53 PM, Ali Saidi wrote:
> Add a flag to the perf mem data struct to signal that a request caused a
> cache-to-cache transfer of a line from a peer of the requestor and
> wasn't sourced from a lower cache level.

It sounds similar to the Forward state. Why can't the
PERF_MEM_SNOOPX_FWD be reused?

Thanks,
Kan

> The line being moved from one
> peer cache to another has latency and performance implications. On Arm64
> Neoverse systems the data source can indicate a cache-to-cache transfer
> but not if the line is dirty or clean, so instead of overloading HITM
> define a new flag that indicates this type of transfer.
>
> Signed-off-by: Ali Saidi <[email protected]>
> ---
> include/uapi/linux/perf_event.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 82858b697c05..c9e58c79f3e5 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -1308,7 +1308,7 @@ union perf_mem_data_src {
> #define PERF_MEM_SNOOP_SHIFT 19
>
> #define PERF_MEM_SNOOPX_FWD 0x01 /* forward */
> -/* 1 free */
> +#define PERF_MEM_SNOOPX_PEER 0x02 /* xfer from peer */
> #define PERF_MEM_SNOOPX_SHIFT 38
>
> /* locked instruction */

2022-04-22 19:27:59

by Leo Yan

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct

On Fri, Apr 08, 2022 at 07:53:41PM +0000, Ali Saidi wrote:
> Add a flag to the perf mem data struct to signal that a request caused a
> cache-to-cache transfer of a line from a peer of the requestor and
> wasn't sourced from a lower cache level. The line being moved from one
> peer cache to another has latency and performance implications. On Arm64
> Neoverse systems the data source can indicate a cache-to-cache transfer
> but not if the line is dirty or clean, so instead of overloading HITM
> define a new flag that indicates this type of transfer.
>
> Signed-off-by: Ali Saidi <[email protected]>

The patch looks good to me:
Reviewed-by: Leo Yan <[email protected]>

Sine this is a common flag, it's better if x86 or PowerPC maintainers
could take a look for this new snooping type. Thanks!

Leo

> ---
> include/uapi/linux/perf_event.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 82858b697c05..c9e58c79f3e5 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -1308,7 +1308,7 @@ union perf_mem_data_src {
> #define PERF_MEM_SNOOP_SHIFT 19
>
> #define PERF_MEM_SNOOPX_FWD 0x01 /* forward */
> -/* 1 free */
> +#define PERF_MEM_SNOOPX_PEER 0x02 /* xfer from peer */
> #define PERF_MEM_SNOOPX_SHIFT 38
>
> /* locked instruction */
> --
> 2.32.0
>

2022-04-22 20:33:12

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v5 5/5] perf mem: Support mem_lvl_num in c2c command

Em Wed, Apr 20, 2022 at 04:48:23PM +0800, Leo Yan escreveu:
> On Mon, Apr 11, 2022 at 11:04:28AM +0100, German Gomez wrote:
> >
> > On 08/04/2022 20:53, Ali Saidi wrote:
> > > In addition to summarizing data encoded in mem_lvl also support data
> > > encoded in mem_lvl_num.
> > >
> > > Since other architectures don't seem to populate the mem_lvl_num field
> > > here there shouldn't be a change in functionality.
> > >
> > > Signed-off-by: Ali Saidi <[email protected]>
> > > ---
> > > tools/perf/util/mem-events.c | 11 +++++++----
> > > 1 file changed, 7 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
> > > index ed0ab838bcc5..e5e405185498 100644
> > > --- a/tools/perf/util/mem-events.c
> > > +++ b/tools/perf/util/mem-events.c
> > > @@ -485,6 +485,7 @@ int c2c_decode_stats(struct c2c_stats *stats, struct mem_info *mi)
> > > u64 daddr = mi->daddr.addr;
> > > u64 op = data_src->mem_op;
> > > u64 lvl = data_src->mem_lvl;
> > > + u64 lnum = data_src->mem_lvl_num;
> > > u64 snoop = data_src->mem_snoop;
> > > u64 lock = data_src->mem_lock;
> > > u64 blk = data_src->mem_blk;
> > > @@ -527,16 +528,18 @@ do { \
> > > if (lvl & P(LVL, UNC)) stats->ld_uncache++;
> > > if (lvl & P(LVL, IO)) stats->ld_io++;
> > > if (lvl & P(LVL, LFB)) stats->ld_fbhit++;
> >
> > Just for completion, can we also handle LFB as it seems to be being set
> > in "/arch/x86/events/intel/ds.c"? (Sorry I missed this in the v4)
>
> With fixing LFB issue pointed by German, the change looks good to me:

Waiting for a v6 then, please collect Leo's reviewed-by tag when
submitting it.

- Arnaldo

> Reviewed-by: Leo Yan <[email protected]>
>
> It would be appreciate if x86 or PowerPC maintainers could take a look
> for this patch. Thanks!


> Leo
>
> > > - if (lvl & P(LVL, L1 )) stats->ld_l1hit++;
> > > - if (lvl & P(LVL, L2 )) stats->ld_l2hit++;
> > > - if (lvl & P(LVL, L3 )) {
> > > + if (lvl & P(LVL, L1) || lnum == P(LVLNUM, L1))
> > > + stats->ld_l1hit++;
> > > + if (lvl & P(LVL, L2) || lnum == P(LVLNUM, L2))
> > > + stats->ld_l2hit++;
> > > + if (lvl & P(LVL, L3) || lnum == P(LVLNUM, L3)) {
> > > if (snoop & P(SNOOP, HITM))
> > > HITM_INC(lcl_hitm);
> > > else
> > > stats->ld_llchit++;
> > > }
> > >
> > > - if (lvl & P(LVL, LOC_RAM)) {
> > > + if (lvl & P(LVL, LOC_RAM) || lnum == P(LVLNUM, RAM)) {
> > > stats->lcl_dram++;
> > > if (snoop & P(SNOOP, HIT))
> > > stats->ld_shared++;

--

- Arnaldo

2022-04-22 22:29:38

by Ali Saidi

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct

On Wed, 20 Apr 2022 18:43:28, Kan Liang wrote:
> On 4/8/2022 3:53 PM, Ali Saidi wrote:
> > Add a flag to the perf mem data struct to signal that a request caused a
> > cache-to-cache transfer of a line from a peer of the requestor and
> > wasn't sourced from a lower cache level.
>
> It sounds similar to the Forward state. Why can't the
> PERF_MEM_SNOOPX_FWD be reused?

Is there a definition of SNOOPX_FWD i can refer to? Happy to use this instead if
the semantics align between architectures.

Thanks,

Ali

2022-04-22 23:14:03

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct



On 4/22/2022 2:49 PM, Ali Saidi wrote:
> On Wed, 20 Apr 2022 18:43:28, Kan Liang wrote:
>> On 4/8/2022 3:53 PM, Ali Saidi wrote:
>>> Add a flag to the perf mem data struct to signal that a request caused a
>>> cache-to-cache transfer of a line from a peer of the requestor and
>>> wasn't sourced from a lower cache level.
>>
>> It sounds similar to the Forward state. Why can't the
>> PERF_MEM_SNOOPX_FWD be reused?
>
> Is there a definition of SNOOPX_FWD i can refer to? Happy to use this instead if
> the semantics align between architectures.
>

+ Andi

As my understanding, the SNOOPX_FWD means the Forward state, which is a
non-modified (clean) cache-to-cache copy.
https://en.wikipedia.org/wiki/MESIF_protocol

Thanks,
Kan

2022-04-22 23:18:28

by Ali Saidi

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct


On Fri, 22 Apr 2022 21:43:28, Kan Liang wrote:
> On 4/22/2022 2:49 PM, Ali Saidi wrote:
> > On Wed, 20 Apr 2022 18:43:28, Kan Liang wrote:
> >> On 4/8/2022 3:53 PM, Ali Saidi wrote:
> >>> Add a flag to the perf mem data struct to signal that a request caused a
> >>> cache-to-cache transfer of a line from a peer of the requestor and
> >>> wasn't sourced from a lower cache level.
> >>
> >> It sounds similar to the Forward state. Why can't the
> >> PERF_MEM_SNOOPX_FWD be reused?
> >
> > Is there a definition of SNOOPX_FWD i can refer to? Happy to use this instead if
> > the semantics align between architectures.
> >
>
> + Andi
>
> As my understanding, the SNOOPX_FWD means the Forward state, which is a
> non-modified (clean) cache-to-cache copy.
> https://en.wikipedia.org/wiki/MESIF_protocol

In this case the semantics are different. We know the line was transferred from
another peer cache, but don't know if it was clean, dirty, or if the receiving core
now has exclusive ownership of it.

Thanks,

Ali

2022-04-23 11:13:12

by Leo Yan

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct

On Fri, Apr 22, 2022 at 09:22:49PM +0000, Ali Saidi wrote:
>
> On Fri, 22 Apr 2022 21:43:28, Kan Liang wrote:
> > On 4/22/2022 2:49 PM, Ali Saidi wrote:
> > > On Wed, 20 Apr 2022 18:43:28, Kan Liang wrote:
> > >> On 4/8/2022 3:53 PM, Ali Saidi wrote:
> > >>> Add a flag to the perf mem data struct to signal that a request caused a
> > >>> cache-to-cache transfer of a line from a peer of the requestor and
> > >>> wasn't sourced from a lower cache level.
> > >>
> > >> It sounds similar to the Forward state. Why can't the
> > >> PERF_MEM_SNOOPX_FWD be reused?
> > >
> > > Is there a definition of SNOOPX_FWD i can refer to? Happy to use this instead if
> > > the semantics align between architectures.
> > >
> >
> > + Andi
> >
> > As my understanding, the SNOOPX_FWD means the Forward state, which is a
> > non-modified (clean) cache-to-cache copy.
> > https://en.wikipedia.org/wiki/MESIF_protocol
>
> In this case the semantics are different. We know the line was transferred from
> another peer cache, but don't know if it was clean, dirty, or if the receiving core
> now has exclusive ownership of it.

In the spec "Intel 64 and IA-32 Architectures Software Developer's Manual,
Volume 3B: System Programming Guide, Part 2", section "18.8.1.3 Off-core
Response Performance Monitoring in the Processor Core", it defines the
REMOTE_CACHE_FWD as:

"L3 Miss: local homed requests that missed the L3 cache and was serviced
by forwarded data following a cross package snoop where no modified copies
found. (Remote home requests are not counted)".

Except SNOOPX_FWD means a no modified cache snooping, it also means it's
a cache conherency from *remote* socket. This is quite different from we
define SNOOPX_PEER, which only snoop from peer CPU or clusters.

If no objection, I prefer we could keep the new snoop type SNOOPX_PEER,
this would be easier for us to distinguish the semantics and support the
statistics for SNOOPX_FWD and SNOOPX_PEER separately.

I overlooked the flag SNOOPX_FWD, thanks a lot for Kan's reminding.

Thanks,
Leo

2022-04-23 18:23:26

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct


> Except SNOOPX_FWD means a no modified cache snooping, it also means it's
> a cache conherency from *remote* socket. This is quite different from we
> define SNOOPX_PEER, which only snoop from peer CPU or clusters.
>
> If no objection, I prefer we could keep the new snoop type SNOOPX_PEER,
> this would be easier for us to distinguish the semantics and support the
> statistics for SNOOPX_FWD and SNOOPX_PEER separately.
>
> I overlooked the flag SNOOPX_FWD, thanks a lot for Kan's reminding.

Yes seems better to keep using a separate flag if they don't exactly match.

It's not that we're short on flags anyways.

-Andi

2022-04-24 14:35:46

by Leo Yan

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct

On Sat, Apr 23, 2022 at 05:53:28AM -0700, Andi Kleen wrote:
>
> > Except SNOOPX_FWD means a no modified cache snooping, it also means it's
> > a cache conherency from *remote* socket. This is quite different from we
> > define SNOOPX_PEER, which only snoop from peer CPU or clusters.
> >
> > If no objection, I prefer we could keep the new snoop type SNOOPX_PEER,
> > this would be easier for us to distinguish the semantics and support the
> > statistics for SNOOPX_FWD and SNOOPX_PEER separately.
> >
> > I overlooked the flag SNOOPX_FWD, thanks a lot for Kan's reminding.
>
> Yes seems better to keep using a separate flag if they don't exactly match.
>
> It's not that we're short on flags anyways.

Thanks for confirmation.

Leo

2022-04-26 04:09:10

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct



On 4/24/2022 7:43 AM, Leo Yan wrote:
> On Sat, Apr 23, 2022 at 05:53:28AM -0700, Andi Kleen wrote:
>>
>>> Except SNOOPX_FWD means a no modified cache snooping, it also means it's
>>> a cache conherency from *remote* socket. This is quite different from we
>>> define SNOOPX_PEER, which only snoop from peer CPU or clusters.
>>>

The FWD doesn't have to be *remote*. The definition you quoted is just
for the "L3 Miss", which is indeed a remote forward. But we still have
cross-core FWD. See Table 19-101.

Actually, X86 uses the PERF_MEM_REMOTE_REMOTE + PERF_MEM_SNOOPX_FWD to
indicate the remote FWD, not just SNOOPX_FWD.

>>> If no objection, I prefer we could keep the new snoop type SNOOPX_PEER,
>>> this would be easier for us to distinguish the semantics and support the
>>> statistics for SNOOPX_FWD and SNOOPX_PEER separately.
>>>
>>> I overlooked the flag SNOOPX_FWD, thanks a lot for Kan's reminding.
>>
>> Yes seems better to keep using a separate flag if they don't exactly match.
>>

Yes, I agree with Andi. If you still think the existing flag combination
doesn't match your requirement, a new separate flag should be
introduced. I'm not familiar with ARM. I think I will leave it to you
and the maintainer to decide.

Thanks,
Kan

2022-04-27 16:54:31

by Leo Yan

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct

Hi Kan,

On Mon, Apr 25, 2022 at 01:01:40PM -0400, Liang, Kan wrote:
>
>
> On 4/24/2022 7:43 AM, Leo Yan wrote:
> > On Sat, Apr 23, 2022 at 05:53:28AM -0700, Andi Kleen wrote:
> > >
> > > > Except SNOOPX_FWD means a no modified cache snooping, it also means it's
> > > > a cache conherency from *remote* socket. This is quite different from we
> > > > define SNOOPX_PEER, which only snoop from peer CPU or clusters.
> > > >
>
> The FWD doesn't have to be *remote*. The definition you quoted is just for
> the "L3 Miss", which is indeed a remote forward. But we still have
> cross-core FWD. See Table 19-101.
>
> Actually, X86 uses the PERF_MEM_REMOTE_REMOTE + PERF_MEM_SNOOPX_FWD to
> indicate the remote FWD, not just SNOOPX_FWD.

Thanks a lot for the info.

> > > > If no objection, I prefer we could keep the new snoop type SNOOPX_PEER,
> > > > this would be easier for us to distinguish the semantics and support the
> > > > statistics for SNOOPX_FWD and SNOOPX_PEER separately.
> > > >
> > > > I overlooked the flag SNOOPX_FWD, thanks a lot for Kan's reminding.
> > >
> > > Yes seems better to keep using a separate flag if they don't exactly match.
> > >
>
> Yes, I agree with Andi. If you still think the existing flag combination
> doesn't match your requirement, a new separate flag should be introduced.
> I'm not familiar with ARM. I think I will leave it to you and the maintainer
> to decide.

It's a bit difficult for me to make decision is because now SNOOPX_FWD
is not used in the file util/mem-events.c, so I am not very sure if
SNOOPX_FWD has the consistent usage across different arches.

On the other hand, I sent a patch for 'peer' flag statistics [1], you
could review it and it only stats for L2 and L3 cache level for local
node.

The main purpose for my sending this email is if you think the FWD can
be the consistent for both arches, and even the new added display mode
is also useful for x86 arch (we can rename it as 'fwd' display mode),
then I am very glad to unify the flag.

Thanks,
Leo

[1] https://lore.kernel.org/lkml/[email protected]/

2022-04-27 19:53:28

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct



On 4/27/2022 12:19 PM, Leo Yan wrote:
> Hi Kan,
>
> On Mon, Apr 25, 2022 at 01:01:40PM -0400, Liang, Kan wrote:
>>
>>
>> On 4/24/2022 7:43 AM, Leo Yan wrote:
>>> On Sat, Apr 23, 2022 at 05:53:28AM -0700, Andi Kleen wrote:
>>>>
>>>>> Except SNOOPX_FWD means a no modified cache snooping, it also means it's
>>>>> a cache conherency from *remote* socket. This is quite different from we
>>>>> define SNOOPX_PEER, which only snoop from peer CPU or clusters.
>>>>>
>>
>> The FWD doesn't have to be *remote*. The definition you quoted is just for
>> the "L3 Miss", which is indeed a remote forward. But we still have
>> cross-core FWD. See Table 19-101.
>>
>> Actually, X86 uses the PERF_MEM_REMOTE_REMOTE + PERF_MEM_SNOOPX_FWD to
>> indicate the remote FWD, not just SNOOPX_FWD.
>
> Thanks a lot for the info.
>
>>>>> If no objection, I prefer we could keep the new snoop type SNOOPX_PEER,
>>>>> this would be easier for us to distinguish the semantics and support the
>>>>> statistics for SNOOPX_FWD and SNOOPX_PEER separately.
>>>>>
>>>>> I overlooked the flag SNOOPX_FWD, thanks a lot for Kan's reminding.
>>>>
>>>> Yes seems better to keep using a separate flag if they don't exactly match.
>>>>
>>
>> Yes, I agree with Andi. If you still think the existing flag combination
>> doesn't match your requirement, a new separate flag should be introduced.
>> I'm not familiar with ARM. I think I will leave it to you and the maintainer
>> to decide.
>
> It's a bit difficult for me to make decision is because now SNOOPX_FWD
> is not used in the file util/mem-events.c, so I am not very sure if
> SNOOPX_FWD has the consistent usage across different arches.

No, it's used in the file util/mem-events.c
See perf_mem__snp_scnprintf().

>
> On the other hand, I sent a patch for 'peer' flag statistics [1], you
> could review it and it only stats for L2 and L3 cache level for local
> node.

If it's for the local node, why don't you use the hop level which is
introduced recently by Power? The below seems a good fit.

PERF_MEM_LVLNUM_ANY_CACHE | PERF_MEM_HOPS_0?

/* hop level */
#define PERF_MEM_HOPS_0 0x01 /* remote core, same node */
#define PERF_MEM_HOPS_1 0x02 /* remote node, same socket */
#define PERF_MEM_HOPS_2 0x03 /* remote socket, same board */
#define PERF_MEM_HOPS_3 0x04 /* remote board */
/* 5-7 available */
#define PERF_MEM_HOPS_SHIFT 43

Thanks,
Kan

>
> The main purpose for my sending this email is if you think the FWD can
> be the consistent for both arches, and even the new added display mode
> is also useful for x86 arch (we can rename it as 'fwd' display mode),
> then I am very glad to unify the flag.
>
> Thanks,
> Leo
>
> [1] https://lore.kernel.org/lkml/[email protected]/



2022-04-29 14:31:08

by Leo Yan

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct

On Wed, Apr 27, 2022 at 03:29:31PM -0400, Liang, Kan wrote:

[...]

> > It's a bit difficult for me to make decision is because now SNOOPX_FWD
> > is not used in the file util/mem-events.c, so I am not very sure if
> > SNOOPX_FWD has the consistent usage across different arches.
>
> No, it's used in the file util/mem-events.c
> See perf_mem__snp_scnprintf().

Right. Actually I mean FWD flag is not for statistics.

> > On the other hand, I sent a patch for 'peer' flag statistics [1], you
> > could review it and it only stats for L2 and L3 cache level for local
> > node.
>
> If it's for the local node, why don't you use the hop level which is
> introduced recently by Power? The below seems a good fit.
>
> PERF_MEM_LVLNUM_ANY_CACHE | PERF_MEM_HOPS_0?
>
> /* hop level */
> #define PERF_MEM_HOPS_0 0x01 /* remote core, same node */
> #define PERF_MEM_HOPS_1 0x02 /* remote node, same socket */
> #define PERF_MEM_HOPS_2 0x03 /* remote socket, same board */
> #define PERF_MEM_HOPS_3 0x04 /* remote board */
> /* 5-7 available */
> #define PERF_MEM_HOPS_SHIFT 43

Thanks for reminding. I have considered HOPS flags during the
discussion with Ali for the flags, you could see PERF_MEM_HOPS_0 is for
"remote core, same node", this could introduce confusion for Arm
Neoverse CPUs. Another thinking is how we consume the flags in perf c2c
tool, perf c2c tool uses snoop flag to find out the costly cache
conherency operations, if we use PERF_MEM_HOPS_0 that means we need to
extend perf c2c tool to support two kinds of flags: snoop flag and hop
flag, so it would introduce complexity for perf c2c code.

If we step back to review current flags, we can see different arches
have different memory model (and implementations), it is a bit painful
when we try to unify the flags. So at current stage, I prefer to use
PEER flag for Arm arches, essentially it's not too bad that now we
introduce one bit, and later we may consider to consolidate memory
flags in more general way.

Thanks,
Leo