2024-02-14 01:18:48

by Ian Rogers

[permalink] [raw]
Subject: [PATCH v1 00/30] perf vendor event and TMA 4.7 metric update

The first 12 patches update vendor events to those in:
https://github.com/intel/perfmon

The next 18 patches update the TMA metrics from version 4.5 to version
4.7. This includes improvements to many models like the
tma_info_bottleneck* metrics, an abstraction or summarization of the
100+ TMA tree nodes into 12-entry familiar performance metrics.

The update was possible due to the release of TMA 4.7 metrics in:
https://github.com/intel/perfmon/pull/140
https://github.com/intel/perfmon/pull/138

Pull request:
https://github.com/intel/perfmon/pull/144
was applied to the converter script. This PR adds back CORE_CLKS as in
TMA metrics 4.5 for pre-Icelake models. This avoids an issue running
perf stat not in system wide mode where pre-Icelake models need to
scale the slots dependent on what the other hyperthread is doing (via
the event CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE - later models have a
slots counter for this). This is a known not accurate thing to do,
hence the removal from TMA 4.7, but perf would allow the use of the
metric not in system wide mode and report a metric with value. It
seemed better to give a more accurate number in not system wide mode,
possibly induce multiplexing, and have the TMA 4.5 behavior. We can
later update the behavior to be NaN [1] when not in system wide mode
to avoid broken usage. The not system wide mode in the TMA 4.5
spreadsheet is known as "#EBS_Mode" and is detected in perf json with
the "#core_wide" literal.

[1] https://lore.kernel.org/lkml/[email protected]/

Ian Rogers (30):
perf vendor events intel: Update alderlake events to v1.24
perf vendor events intel: Update alderlaken events to v1.24
perf vendor events intel: Update broadwell events to v29
perf vendor events intel: Update emeraldrapids events to v1.03
perf vendor events intel: Update grandridge events to v1.01
perf vendor events intel: Update haswell events to v35
perf vendor events intel: Update icelake events to v1.21
perf vendor events intel: Update meteorlake events to v1.07
perf vendor events intel: Update rocketlake events to v1.02
perf vendor events intel: Update sierraforst events to v1.01
perf vendor events intel: Update skylake events to v58
perf vendor events intel: Update tigerlake events to v1.15
perf vendor events intel: Update alderlake TMA metrics to 4.7
perf vendor events intel: Update broadwell TMA metrics to 4.7
perf vendor events intel: Update broadwellde TMA metrics to 4.7
perf vendor events intel: Update broadwellx TMA metrics to 4.7
perf vendor events intel: Update cascadelakex TMA metrics to 4.7
perf vendor events intel: Update haswell TMA metrics to 4.7
perf vendor events intel: Update haswellx TMA metrics to 4.7
perf vendor events intel: Update icelake TMA metrics to 4.7
perf vendor events intel: Update icelakex TMA metrics to 4.7
perf vendor events intel: Update ivybridge TMA metrics to 4.7
perf vendor events intel: Update ivytown TMA metrics to 4.7
perf vendor events intel: Update jaketown TMA metrics to 4.7
perf vendor events intel: Update rocketlake TMA metrics to 4.7
perf vendor events intel: Update sandybridge TMA metrics to 4.7
perf vendor events intel: Update sapphirerapids TMA metrics to 4.7
perf vendor events intel: Update skylake TMA metrics to 4.7
perf vendor events intel: Update skylakex TMA metrics to 4.7
perf vendor events intel: Update tigerlake TMA metrics to 4.7

.../arch/x86/alderlake/adl-metrics.json | 459 ++-
.../arch/x86/alderlake/floating-point.json | 30 +-
.../arch/x86/alderlake/metricgroups.json | 11 +-
.../pmu-events/arch/x86/alderlake/other.json | 10 +
.../arch/x86/alderlake/pipeline.json | 13 +
.../pmu-events/arch/x86/alderlaken/other.json | 9 +
.../arch/x86/alderlaken/pipeline.json | 9 +
.../arch/x86/broadwell/bdw-metrics.json | 204 +-
.../pmu-events/arch/x86/broadwell/memory.json | 2 +-
.../arch/x86/broadwell/metricgroups.json | 7 +-
.../arch/x86/broadwellde/bdwde-metrics.json | 191 +-
.../arch/x86/broadwellde/metricgroups.json | 7 +-
.../arch/x86/broadwellx/bdx-metrics.json | 250 +-
.../arch/x86/broadwellx/metricgroups.json | 7 +-
.../arch/x86/cascadelakex/clx-metrics.json | 566 +++-
.../arch/x86/cascadelakex/metricgroups.json | 12 +-
.../arch/x86/emeraldrapids/uncore-cache.json | 152 +
.../pmu-events/arch/x86/grandridge/cache.json | 185 ++
.../arch/x86/grandridge/floating-point.json | 68 +
.../arch/x86/grandridge/frontend.json | 16 +
.../arch/x86/grandridge/memory.json | 66 +
.../pmu-events/arch/x86/grandridge/other.json | 16 +
.../arch/x86/grandridge/pipeline.json | 353 ++
.../arch/x86/grandridge/uncore-cache.json | 1795 +++++++++++
.../x86/grandridge/uncore-interconnect.json | 175 +
.../arch/x86/grandridge/uncore-io.json | 1187 +++++++
.../arch/x86/grandridge/uncore-memory.json | 385 +++
.../arch/x86/grandridge/uncore-power.json | 10 +
.../arch/x86/grandridge/virtual-memory.json | 113 +-
.../arch/x86/haswell/hsw-metrics.json | 178 +-
.../pmu-events/arch/x86/haswell/memory.json | 2 +-
.../arch/x86/haswell/metricgroups.json | 7 +-
.../arch/x86/haswellx/hsx-metrics.json | 224 +-
.../arch/x86/haswellx/metricgroups.json | 7 +-
.../arch/x86/icelake/icl-metrics.json | 398 ++-
.../pmu-events/arch/x86/icelake/memory.json | 1 +
.../arch/x86/icelake/metricgroups.json | 12 +-
.../pmu-events/arch/x86/icelake/other.json | 2 +-
.../pmu-events/arch/x86/icelake/pipeline.json | 10 +-
.../arch/x86/icelakex/icx-metrics.json | 586 +++-
.../arch/x86/icelakex/metricgroups.json | 12 +-
.../arch/x86/ivybridge/ivb-metrics.json | 197 +-
.../arch/x86/ivybridge/metricgroups.json | 7 +-
.../arch/x86/ivytown/ivt-metrics.json | 200 +-
.../arch/x86/ivytown/metricgroups.json | 7 +-
.../arch/x86/jaketown/jkt-metrics.json | 64 +-
.../arch/x86/jaketown/metricgroups.json | 7 +-
tools/perf/pmu-events/arch/x86/mapfile.csv | 24 +-
.../pmu-events/arch/x86/meteorlake/cache.json | 8 +-
.../arch/x86/meteorlake/floating-point.json | 86 +-
.../pmu-events/arch/x86/meteorlake/other.json | 10 +
.../arch/x86/meteorlake/pipeline.json | 76 +
.../arch/x86/meteorlake/virtual-memory.json | 36 +
.../arch/x86/rocketlake/memory.json | 1 +
.../arch/x86/rocketlake/metricgroups.json | 12 +-
.../pmu-events/arch/x86/rocketlake/other.json | 2 +-
.../arch/x86/rocketlake/pipeline.json | 10 +-
.../arch/x86/rocketlake/rkl-metrics.json | 406 ++-
.../arch/x86/sandybridge/metricgroups.json | 7 +-
.../arch/x86/sandybridge/snb-metrics.json | 71 +-
.../arch/x86/sapphirerapids/metricgroups.json | 12 +-
.../arch/x86/sapphirerapids/spr-metrics.json | 773 +++--
.../arch/x86/sierraforest/cache.json | 185 ++
.../arch/x86/sierraforest/floating-point.json | 68 +
.../arch/x86/sierraforest/frontend.json | 16 +
.../arch/x86/sierraforest/memory.json | 66 +
.../arch/x86/sierraforest/other.json | 16 +
.../arch/x86/sierraforest/pipeline.json | 360 +++
.../arch/x86/sierraforest/uncore-cache.json | 2853 +++++++++++++++++
.../arch/x86/sierraforest/uncore-cxl.json | 10 +
.../x86/sierraforest/uncore-interconnect.json | 1228 +++++++
.../arch/x86/sierraforest/uncore-io.json | 1634 ++++++++++
.../arch/x86/sierraforest/uncore-memory.json | 385 +++
.../arch/x86/sierraforest/uncore-power.json | 10 +
.../arch/x86/sierraforest/virtual-memory.json | 113 +-
.../pmu-events/arch/x86/skylake/memory.json | 2 +-
.../arch/x86/skylake/metricgroups.json | 12 +-
.../pmu-events/arch/x86/skylake/pipeline.json | 2 +-
.../arch/x86/skylake/skl-metrics.json | 395 ++-
.../arch/x86/skylake/virtual-memory.json | 2 +-
.../arch/x86/skylakex/metricgroups.json | 12 +-
.../arch/x86/skylakex/skx-metrics.json | 548 +++-
.../arch/x86/tigerlake/metricgroups.json | 12 +-
.../pmu-events/arch/x86/tigerlake/other.json | 2 +-
.../arch/x86/tigerlake/pipeline.json | 10 +-
.../arch/x86/tigerlake/tgl-metrics.json | 406 ++-
.../x86/tigerlake/uncore-interconnect.json | 2 +
87 files changed, 15763 insertions(+), 2349 deletions(-)
create mode 100644 tools/perf/pmu-events/arch/x86/grandridge/floating-point.json
create mode 100644 tools/perf/pmu-events/arch/x86/grandridge/uncore-cache.json
create mode 100644 tools/perf/pmu-events/arch/x86/grandridge/uncore-interconnect.json
create mode 100644 tools/perf/pmu-events/arch/x86/grandridge/uncore-io.json
create mode 100644 tools/perf/pmu-events/arch/x86/grandridge/uncore-memory.json
create mode 100644 tools/perf/pmu-events/arch/x86/grandridge/uncore-power.json
create mode 100644 tools/perf/pmu-events/arch/x86/sierraforest/floating-point.json
create mode 100644 tools/perf/pmu-events/arch/x86/sierraforest/uncore-cache.json
create mode 100644 tools/perf/pmu-events/arch/x86/sierraforest/uncore-cxl.json
create mode 100644 tools/perf/pmu-events/arch/x86/sierraforest/uncore-interconnect.json
create mode 100644 tools/perf/pmu-events/arch/x86/sierraforest/uncore-io.json
create mode 100644 tools/perf/pmu-events/arch/x86/sierraforest/uncore-memory.json
create mode 100644 tools/perf/pmu-events/arch/x86/sierraforest/uncore-power.json

--
2.43.0.687.g38aa6559b0-goog



2024-02-14 01:19:02

by Ian Rogers

[permalink] [raw]
Subject: [PATCH v1 01/30] perf vendor events intel: Update alderlake events to v1.24

Update alderlake events to v1.24 released in:
https://github.com/intel/perfmon/commit/e627dd8d89e2d2110f1d499608dd6f37aae37a8c

Adds aliased events, improves documentation and fix some event fields.

Event json automatically generated by:
https://github.com/intel/perfmon/blob/main/scripts/create_perf_json.py

Signed-off-by: Ian Rogers <[email protected]>
---
.../arch/x86/alderlake/floating-point.json | 30 +++++++++++++++++--
.../pmu-events/arch/x86/alderlake/other.json | 10 +++++++
.../arch/x86/alderlake/pipeline.json | 13 ++++++++
tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +-
4 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/tools/perf/pmu-events/arch/x86/alderlake/floating-point.json b/tools/perf/pmu-events/arch/x86/alderlake/floating-point.json
index c8ba96c4a7f8..cd291943dc08 100644
--- a/tools/perf/pmu-events/arch/x86/alderlake/floating-point.json
+++ b/tools/perf/pmu-events/arch/x86/alderlake/floating-point.json
@@ -26,7 +26,7 @@
"Unit": "cpu_core"
},
{
- "BriefDescription": "FP_ARITH_DISPATCHED.PORT_0",
+ "BriefDescription": "FP_ARITH_DISPATCHED.PORT_0 [This event is alias to FP_ARITH_DISPATCHED.V0]",
"EventCode": "0xb3",
"EventName": "FP_ARITH_DISPATCHED.PORT_0",
"SampleAfterValue": "2000003",
@@ -34,7 +34,7 @@
"Unit": "cpu_core"
},
{
- "BriefDescription": "FP_ARITH_DISPATCHED.PORT_1",
+ "BriefDescription": "FP_ARITH_DISPATCHED.PORT_1 [This event is alias to FP_ARITH_DISPATCHED.V1]",
"EventCode": "0xb3",
"EventName": "FP_ARITH_DISPATCHED.PORT_1",
"SampleAfterValue": "2000003",
@@ -42,13 +42,37 @@
"Unit": "cpu_core"
},
{
- "BriefDescription": "FP_ARITH_DISPATCHED.PORT_5",
+ "BriefDescription": "FP_ARITH_DISPATCHED.PORT_5 [This event is alias to FP_ARITH_DISPATCHED.V2]",
"EventCode": "0xb3",
"EventName": "FP_ARITH_DISPATCHED.PORT_5",
"SampleAfterValue": "2000003",
"UMask": "0x4",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "FP_ARITH_DISPATCHED.V0 [This event is alias to FP_ARITH_DISPATCHED.PORT_0]",
+ "EventCode": "0xb3",
+ "EventName": "FP_ARITH_DISPATCHED.V0",
+ "SampleAfterValue": "2000003",
+ "UMask": "0x1",
+ "Unit": "cpu_core"
+ },
+ {
+ "BriefDescription": "FP_ARITH_DISPATCHED.V1 [This event is alias to FP_ARITH_DISPATCHED.PORT_1]",
+ "EventCode": "0xb3",
+ "EventName": "FP_ARITH_DISPATCHED.V1",
+ "SampleAfterValue": "2000003",
+ "UMask": "0x2",
+ "Unit": "cpu_core"
+ },
+ {
+ "BriefDescription": "FP_ARITH_DISPATCHED.V2 [This event is alias to FP_ARITH_DISPATCHED.PORT_5]",
+ "EventCode": "0xb3",
+ "EventName": "FP_ARITH_DISPATCHED.V2",
+ "SampleAfterValue": "2000003",
+ "UMask": "0x4",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Counts number of SSE/AVX computational 128-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 2 computation operations, one for each element. Applies to SSE* and AVX* packed double precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element.",
"EventCode": "0xc7",
diff --git a/tools/perf/pmu-events/arch/x86/alderlake/other.json b/tools/perf/pmu-events/arch/x86/alderlake/other.json
index 1db73e020215..5250a17d9cae 100644
--- a/tools/perf/pmu-events/arch/x86/alderlake/other.json
+++ b/tools/perf/pmu-events/arch/x86/alderlake/other.json
@@ -39,6 +39,16 @@
"UMask": "0x8",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "This event is deprecated. [This event is alias to MISC_RETIRED.LBR_INSERTS]",
+ "Deprecated": "1",
+ "EventCode": "0xe4",
+ "EventName": "LBR_INSERTS.ANY",
+ "PEBS": "1",
+ "SampleAfterValue": "1000003",
+ "UMask": "0x1",
+ "Unit": "cpu_atom"
+ },
{
"BriefDescription": "Counts modified writebacks from L1 cache and L2 cache that have any type of response.",
"EventCode": "0xB7",
diff --git a/tools/perf/pmu-events/arch/x86/alderlake/pipeline.json b/tools/perf/pmu-events/arch/x86/alderlake/pipeline.json
index f9876bef16da..df6032e816d4 100644
--- a/tools/perf/pmu-events/arch/x86/alderlake/pipeline.json
+++ b/tools/perf/pmu-events/arch/x86/alderlake/pipeline.json
@@ -799,6 +799,7 @@
"BriefDescription": "INST_RETIRED.MACRO_FUSED",
"EventCode": "0xc0",
"EventName": "INST_RETIRED.MACRO_FUSED",
+ "PEBS": "1",
"SampleAfterValue": "2000003",
"UMask": "0x10",
"Unit": "cpu_core"
@@ -807,6 +808,7 @@
"BriefDescription": "Retired NOP instructions.",
"EventCode": "0xc0",
"EventName": "INST_RETIRED.NOP",
+ "PEBS": "1",
"PublicDescription": "Counts all retired NOP or ENDBR32/64 instructions",
"SampleAfterValue": "2000003",
"UMask": "0x2",
@@ -825,6 +827,7 @@
"BriefDescription": "Iterations of Repeat string retired instructions.",
"EventCode": "0xc0",
"EventName": "INST_RETIRED.REP_ITERATION",
+ "PEBS": "1",
"PublicDescription": "Number of iterations of Repeat (REP) string retired instructions such as MOVS, CMPS, and SCAS. Each has a byte, word, and doubleword version and string instructions can be repeated using a repetition prefix, REP, that allows their architectural execution to be repeated a number of times as specified by the RCX register. Note the number of iterations is implementation-dependent.",
"SampleAfterValue": "2000003",
"UMask": "0x8",
@@ -1106,6 +1109,16 @@
"UMask": "0x20",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "Counts the number of LBR entries recorded. Requires LBRs to be enabled in IA32_LBR_CTL. [This event is alias to LBR_INSERTS.ANY]",
+ "EventCode": "0xe4",
+ "EventName": "MISC_RETIRED.LBR_INSERTS",
+ "PEBS": "1",
+ "PublicDescription": "Counts the number of LBR entries recorded. Requires LBRs to be enabled in IA32_LBR_CTL. This event is PDIR on GP0 and NPEBS on all other GPs [This event is alias to LBR_INSERTS.ANY]",
+ "SampleAfterValue": "1000003",
+ "UMask": "0x1",
+ "Unit": "cpu_atom"
+ },
{
"BriefDescription": "Increments whenever there is an update to the LBR array.",
"EventCode": "0xcc",
diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv
index 4d1deed4437a..b4adaa1b5e9e 100644
--- a/tools/perf/pmu-events/arch/x86/mapfile.csv
+++ b/tools/perf/pmu-events/arch/x86/mapfile.csv
@@ -1,5 +1,5 @@
Family-model,Version,Filename,EventType
-GenuineIntel-6-(97|9A|B7|BA|BF),v1.23,alderlake,core
+GenuineIntel-6-(97|9A|B7|BA|BF),v1.24,alderlake,core
GenuineIntel-6-BE,v1.23,alderlaken,core
GenuineIntel-6-(1C|26|27|35|36),v5,bonnell,core
GenuineIntel-6-(3D|47),v28,broadwell,core
--
2.43.0.687.g38aa6559b0-goog


2024-02-14 01:19:17

by Ian Rogers

[permalink] [raw]
Subject: [PATCH v1 02/30] perf vendor events intel: Update alderlaken events to v1.24

Update alderlaken events to v1.24 released in:
https://github.com/intel/perfmon/commit/e627dd8d89e2d2110f1d499608dd6f37aae37a8c

Adds LBR_INSERTS.ANY/MISC_RETIRED.LBR_INSERTS event.

Event json automatically generated by:
https://github.com/intel/perfmon/blob/main/scripts/create_perf_json.py

Signed-off-by: Ian Rogers <[email protected]>
---
tools/perf/pmu-events/arch/x86/alderlaken/other.json | 9 +++++++++
tools/perf/pmu-events/arch/x86/alderlaken/pipeline.json | 9 +++++++++
tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +-
3 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/tools/perf/pmu-events/arch/x86/alderlaken/other.json b/tools/perf/pmu-events/arch/x86/alderlaken/other.json
index 6336de61f628..ccc892149dbe 100644
--- a/tools/perf/pmu-events/arch/x86/alderlaken/other.json
+++ b/tools/perf/pmu-events/arch/x86/alderlaken/other.json
@@ -1,4 +1,13 @@
[
+ {
+ "BriefDescription": "This event is deprecated. [This event is alias to MISC_RETIRED.LBR_INSERTS]",
+ "Deprecated": "1",
+ "EventCode": "0xe4",
+ "EventName": "LBR_INSERTS.ANY",
+ "PEBS": "1",
+ "SampleAfterValue": "1000003",
+ "UMask": "0x1"
+ },
{
"BriefDescription": "Counts modified writebacks from L1 cache and L2 cache that have any type of response.",
"EventCode": "0xB7",
diff --git a/tools/perf/pmu-events/arch/x86/alderlaken/pipeline.json b/tools/perf/pmu-events/arch/x86/alderlaken/pipeline.json
index 3153bab527a9..846bcdafca6d 100644
--- a/tools/perf/pmu-events/arch/x86/alderlaken/pipeline.json
+++ b/tools/perf/pmu-events/arch/x86/alderlaken/pipeline.json
@@ -344,6 +344,15 @@
"SampleAfterValue": "20003",
"UMask": "0x1"
},
+ {
+ "BriefDescription": "Counts the number of LBR entries recorded. Requires LBRs to be enabled in IA32_LBR_CTL. [This event is alias to LBR_INSERTS.ANY]",
+ "EventCode": "0xe4",
+ "EventName": "MISC_RETIRED.LBR_INSERTS",
+ "PEBS": "1",
+ "PublicDescription": "Counts the number of LBR entries recorded. Requires LBRs to be enabled in IA32_LBR_CTL. This event is PDIR on GP0 and NPEBS on all other GPs [This event is alias to LBR_INSERTS.ANY]",
+ "SampleAfterValue": "1000003",
+ "UMask": "0x1"
+ },
{
"BriefDescription": "Counts the number of issue slots not consumed by the backend due to a micro-sequencer (MS) scoreboard, which stalls the front-end from issuing from the UROM until a specified older uop retires.",
"EventCode": "0x75",
diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv
index b4adaa1b5e9e..5bda5d498841 100644
--- a/tools/perf/pmu-events/arch/x86/mapfile.csv
+++ b/tools/perf/pmu-events/arch/x86/mapfile.csv
@@ -1,6 +1,6 @@
Family-model,Version,Filename,EventType
GenuineIntel-6-(97|9A|B7|BA|BF),v1.24,alderlake,core
-GenuineIntel-6-BE,v1.23,alderlaken,core
+GenuineIntel-6-BE,v1.24,alderlaken,core
GenuineIntel-6-(1C|26|27|35|36),v5,bonnell,core
GenuineIntel-6-(3D|47),v28,broadwell,core
GenuineIntel-6-56,v11,broadwellde,core
--
2.43.0.687.g38aa6559b0-goog


2024-02-14 01:20:13

by Ian Rogers

[permalink] [raw]
Subject: [PATCH v1 06/30] perf vendor events intel: Update haswell events to v35

Update haswell events to v35 released in:
https://github.com/intel/perfmon/commit/c0f9b34d421941bc3e13c6ca5554e6a54e8bd574

Updates "must be precise" on RTM_RETIRED.ABORTED.

Event json automatically generated by:
https://github.com/intel/perfmon/blob/main/scripts/create_perf_json.py

Signed-off-by: Ian Rogers <[email protected]>
---
tools/perf/pmu-events/arch/x86/haswell/memory.json | 2 +-
tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/perf/pmu-events/arch/x86/haswell/memory.json b/tools/perf/pmu-events/arch/x86/haswell/memory.json
index 2fc25e22a42a..6ba0ea6e3fa6 100644
--- a/tools/perf/pmu-events/arch/x86/haswell/memory.json
+++ b/tools/perf/pmu-events/arch/x86/haswell/memory.json
@@ -371,7 +371,7 @@
"BriefDescription": "Number of times an RTM execution aborted due to any reasons (multiple categories may count as one).",
"EventCode": "0xc9",
"EventName": "RTM_RETIRED.ABORTED",
- "PEBS": "1",
+ "PEBS": "2",
"SampleAfterValue": "2000003",
"UMask": "0x4"
},
diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv
index 2e72e81669d5..932d7c094e41 100644
--- a/tools/perf/pmu-events/arch/x86/mapfile.csv
+++ b/tools/perf/pmu-events/arch/x86/mapfile.csv
@@ -12,7 +12,7 @@ GenuineIntel-6-5[CF],v13,goldmont,core
GenuineIntel-6-7A,v1.01,goldmontplus,core
GenuineIntel-6-B6,v1.01,grandridge,core
GenuineIntel-6-A[DE],v1.01,graniterapids,core
-GenuineIntel-6-(3C|45|46),v33,haswell,core
+GenuineIntel-6-(3C|45|46),v35,haswell,core
GenuineIntel-6-3F,v28,haswellx,core
GenuineIntel-6-7[DE],v1.19,icelake,core
GenuineIntel-6-6[AC],v1.23,icelakex,core
--
2.43.0.687.g38aa6559b0-goog


2024-02-14 01:21:32

by Ian Rogers

[permalink] [raw]
Subject: [PATCH v1 03/30] perf vendor events intel: Update broadwell events to v29

Update broadwell events to v29 released in:
https://github.com/intel/perfmon/commit/47117146c6b9e38811618beca31eba4e41c3d874

Updates "must be precise" on RTM_RETIRED.ABORTED.

Event json automatically generated by:
https://github.com/intel/perfmon/blob/main/scripts/create_perf_json.py

Signed-off-by: Ian Rogers <[email protected]>
---
tools/perf/pmu-events/arch/x86/broadwell/memory.json | 2 +-
tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/perf/pmu-events/arch/x86/broadwell/memory.json b/tools/perf/pmu-events/arch/x86/broadwell/memory.json
index ac7cdb831960..b01ed47072bc 100644
--- a/tools/perf/pmu-events/arch/x86/broadwell/memory.json
+++ b/tools/perf/pmu-events/arch/x86/broadwell/memory.json
@@ -2005,7 +2005,7 @@
"BriefDescription": "Number of times RTM abort was triggered",
"EventCode": "0xc9",
"EventName": "RTM_RETIRED.ABORTED",
- "PEBS": "1",
+ "PEBS": "2",
"PublicDescription": "Number of times RTM abort was triggered .",
"SampleAfterValue": "2000003",
"UMask": "0x4"
diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv
index 5bda5d498841..7b30c0eb036a 100644
--- a/tools/perf/pmu-events/arch/x86/mapfile.csv
+++ b/tools/perf/pmu-events/arch/x86/mapfile.csv
@@ -2,7 +2,7 @@ Family-model,Version,Filename,EventType
GenuineIntel-6-(97|9A|B7|BA|BF),v1.24,alderlake,core
GenuineIntel-6-BE,v1.24,alderlaken,core
GenuineIntel-6-(1C|26|27|35|36),v5,bonnell,core
-GenuineIntel-6-(3D|47),v28,broadwell,core
+GenuineIntel-6-(3D|47),v29,broadwell,core
GenuineIntel-6-56,v11,broadwellde,core
GenuineIntel-6-4F,v22,broadwellx,core
GenuineIntel-6-55-[56789ABCDEF],v1.20,cascadelakex,core
--
2.43.0.687.g38aa6559b0-goog


2024-02-14 01:22:54

by Ian Rogers

[permalink] [raw]
Subject: [PATCH v1 12/30] perf vendor events intel: Update tigerlake events to v1.15

Update alderlake events to v1.15 released in:
https://github.com/intel/perfmon/commit/282a6951fd9f025cff6c8c0ea16b1fcec786a4cd

Documentation fixes, removal of TOPDOWN.BR_MISPREDICT_SLOTS,
deprecation of UNC_ARB_DAT_REQUESTS.RD, UNC_ARB_DAT_REQUESTS.RD and
UNC_ARB_IFA_OCCUPANCY.ALL.

Event json automatically generated by:
https://github.com/intel/perfmon/blob/main/scripts/create_perf_json.py

Signed-off-by: Ian Rogers <[email protected]>
---
tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +-
tools/perf/pmu-events/arch/x86/tigerlake/other.json | 2 +-
tools/perf/pmu-events/arch/x86/tigerlake/pipeline.json | 10 +---------
.../arch/x86/tigerlake/uncore-interconnect.json | 2 ++
4 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv
index 96b86d3b60ce..5297d25f4e03 100644
--- a/tools/perf/pmu-events/arch/x86/mapfile.csv
+++ b/tools/perf/pmu-events/arch/x86/mapfile.csv
@@ -32,7 +32,7 @@ GenuineIntel-6-(37|4A|4C|4D|5A),v15,silvermont,core
GenuineIntel-6-(4E|5E|8E|9E|A5|A6),v58,skylake,core
GenuineIntel-6-55-[01234],v1.32,skylakex,core
GenuineIntel-6-86,v1.21,snowridgex,core
-GenuineIntel-6-8[CD],v1.13,tigerlake,core
+GenuineIntel-6-8[CD],v1.15,tigerlake,core
GenuineIntel-6-2C,v5,westmereep-dp,core
GenuineIntel-6-25,v4,westmereep-sp,core
GenuineIntel-6-2F,v4,westmereex,core
diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/other.json b/tools/perf/pmu-events/arch/x86/tigerlake/other.json
index 55f3048bcfa6..117b18abcaaf 100644
--- a/tools/perf/pmu-events/arch/x86/tigerlake/other.json
+++ b/tools/perf/pmu-events/arch/x86/tigerlake/other.json
@@ -19,7 +19,7 @@
"BriefDescription": "Core cycles where the core was running in a manner where Turbo may be clipped to the AVX512 turbo schedule.",
"EventCode": "0x28",
"EventName": "CORE_POWER.LVL2_TURBO_LICENSE",
- "PublicDescription": "Core cycles where the core was running with power-delivery for license level 2 (introduced in Skylake Server microarchtecture). This includes high current AVX 512-bit instructions.",
+ "PublicDescription": "Core cycles where the core was running with power-delivery for license level 2 (introduced in Skylake Server microarchitecture). This includes high current AVX 512-bit instructions.",
"SampleAfterValue": "200003",
"UMask": "0x20"
},
diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/pipeline.json b/tools/perf/pmu-events/arch/x86/tigerlake/pipeline.json
index 541bf1dd1679..4f85d53edec2 100644
--- a/tools/perf/pmu-events/arch/x86/tigerlake/pipeline.json
+++ b/tools/perf/pmu-events/arch/x86/tigerlake/pipeline.json
@@ -537,7 +537,7 @@
"BriefDescription": "Cycles when Reservation Station (RS) is empty for the thread",
"EventCode": "0x5e",
"EventName": "RS_EVENTS.EMPTY_CYCLES",
- "PublicDescription": "Counts cycles during which the reservation station (RS) is empty for this logical processor. This is usually caused when the front-end pipeline runs into stravation periods (e.g. branch mispredictions or i-cache misses)",
+ "PublicDescription": "Counts cycles during which the reservation station (RS) is empty for this logical processor. This is usually caused when the front-end pipeline runs into starvation periods (e.g. branch mispredictions or i-cache misses)",
"SampleAfterValue": "1000003",
"UMask": "0x1"
},
@@ -560,14 +560,6 @@
"SampleAfterValue": "10000003",
"UMask": "0x2"
},
- {
- "BriefDescription": "TMA slots wasted due to incorrect speculation by branch mispredictions",
- "EventCode": "0xa4",
- "EventName": "TOPDOWN.BR_MISPREDICT_SLOTS",
- "PublicDescription": "Number of TMA slots that were wasted due to incorrect speculation by branch mispredictions. This event estimates number of operations that were issued but not retired from the speculative path as well as the out-of-order engine recovery past a branch misprediction.",
- "SampleAfterValue": "10000003",
- "UMask": "0x8"
- },
{
"BriefDescription": "TMA slots available for an unhalted logical processor. Fixed counter - architectural event",
"EventName": "TOPDOWN.SLOTS",
diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/uncore-interconnect.json b/tools/perf/pmu-events/arch/x86/tigerlake/uncore-interconnect.json
index eed1b90a2779..48f23acc76c0 100644
--- a/tools/perf/pmu-events/arch/x86/tigerlake/uncore-interconnect.json
+++ b/tools/perf/pmu-events/arch/x86/tigerlake/uncore-interconnect.json
@@ -25,6 +25,7 @@
},
{
"BriefDescription": "This event is deprecated. Refer to new event UNC_ARB_REQ_TRK_REQUEST.DRD",
+ "Deprecated": "1",
"EventCode": "0x81",
"EventName": "UNC_ARB_DAT_REQUESTS.RD",
"PerPkg": "1",
@@ -33,6 +34,7 @@
},
{
"BriefDescription": "This event is deprecated. Refer to new event UNC_ARB_DAT_OCCUPANCY.ALL",
+ "Deprecated": "1",
"EventCode": "0x85",
"EventName": "UNC_ARB_IFA_OCCUPANCY.ALL",
"PerPkg": "1",
--
2.43.0.687.g38aa6559b0-goog


2024-02-14 01:24:54

by Ian Rogers

[permalink] [raw]
Subject: [PATCH v1 24/30] perf vendor events intel: Update jaketown TMA metrics to 4.7

Top-Down Microarchitecture Analysis (TMA) metrics simplify
cycle-accounting using microarchitecture-abstracted metrics
organized in one hierarchy. This update is from version 4.5 to
4.7.

The update includes:

- Swapped tma_info_core_ilp (becomes per SMT thread) and
tma_info_pipeline_execute (per physical core).
- Tuned thresholds for tma_fetch_bandwidth.

The update came from:

https://github.com/intel/perfmon/pull/140
https://github.com/intel/perfmon/pull/138

Running the script:

https://github.com/intel/perfmon/blob/main/scripts/create_perf_json.py
Signed-off-by: Ian Rogers <[email protected]>
---
.../arch/x86/jaketown/jkt-metrics.json | 64 ++++++++++++++-----
.../arch/x86/jaketown/metricgroups.json | 7 +-
2 files changed, 52 insertions(+), 19 deletions(-)

diff --git a/tools/perf/pmu-events/arch/x86/jaketown/jkt-metrics.json b/tools/perf/pmu-events/arch/x86/jaketown/jkt-metrics.json
index 35b1a3aa728d..fc8c3f785be1 100644
--- a/tools/perf/pmu-events/arch/x86/jaketown/jkt-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/jaketown/jkt-metrics.json
@@ -163,7 +163,7 @@
"MetricExpr": "tma_frontend_bound - tma_fetch_latency",
"MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
"MetricName": "tma_fetch_bandwidth",
- "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_thread_ipc / 4 > 0.35",
+ "MetricThreshold": "tma_fetch_bandwidth > 0.2",
"MetricgroupNoGroup": "TopdownL2",
"PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Related metrics: tma_dsb_switches, tma_info_frontend_dsb_coverage, tma_lcp",
"ScaleUnit": "100%"
@@ -193,7 +193,7 @@
"MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P",
"MetricName": "tma_fp_scalar",
"MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)",
- "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_512b, tma_port_6, tma_ports_utilized_2",
+ "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_6, tma_ports_utilized_2",
"ScaleUnit": "100%"
},
{
@@ -202,7 +202,25 @@
"MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P",
"MetricName": "tma_fp_vector",
"MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)",
- "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector_512b, tma_port_6, tma_ports_utilized_2",
+ "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_6, tma_ports_utilized_2",
+ "ScaleUnit": "100%"
+ },
+ {
+ "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors",
+ "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE + FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE) / UOPS_DISPATCHED.THREAD",
+ "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P",
+ "MetricName": "tma_fp_vector_128b",
+ "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))",
+ "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_6, tma_ports_utilized_2",
+ "ScaleUnit": "100%"
+ },
+ {
+ "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors",
+ "MetricExpr": "(SIMD_FP_256.PACKED_DOUBLE + SIMD_FP_256.PACKED_SINGLE) / UOPS_DISPATCHED.THREAD",
+ "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P",
+ "MetricName": "tma_fp_vector_256b",
+ "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))",
+ "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_6, tma_ports_utilized_2",
"ScaleUnit": "100%"
},
{
@@ -222,7 +240,7 @@
"MetricName": "tma_heavy_operations",
"MetricThreshold": "tma_heavy_operations > 0.1",
"MetricgroupNoGroup": "TopdownL2",
- "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences.",
+ "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences. ([ICL+] Note this may overcount due to approximation using indirect events; [ADL+] .)",
"ScaleUnit": "100%"
},
{
@@ -244,7 +262,7 @@
"MetricName": "tma_info_core_flopc"
},
{
- "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core",
+ "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per thread (logical-processor)",
"MetricExpr": "UOPS_DISPATCHED.THREAD / (cpu@UOPS_DISPATCHED.CORE\\,cmask\\=1@ / 2 if #SMT_on else cpu@UOPS_DISPATCHED.CORE\\,cmask\\=1@)",
"MetricGroup": "Backend;Cor;Pipeline;PortsUtil",
"MetricName": "tma_info_core_ilp"
@@ -271,21 +289,27 @@
"MetricName": "tma_info_pipeline_retire"
},
{
- "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]",
+ "BriefDescription": "Measured Average Core Frequency for unhalted processors [GHz]",
"MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / duration_time",
"MetricGroup": "Power;Summary",
- "MetricName": "tma_info_system_average_frequency"
+ "MetricName": "tma_info_system_core_frequency"
},
{
- "BriefDescription": "Average CPU Utilization",
+ "BriefDescription": "Average CPU Utilization (percentage)",
"MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC",
"MetricGroup": "HPC;Summary",
"MetricName": "tma_info_system_cpu_utilization"
},
+ {
+ "BriefDescription": "Average number of utilized CPUs",
+ "MetricExpr": "#num_cpus_online * tma_info_system_cpu_utilization",
+ "MetricGroup": "Summary",
+ "MetricName": "tma_info_system_cpus_utilized"
+ },
{
"BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]",
"MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e9 / duration_time",
- "MetricGroup": "HPC;Mem;MemoryBW;SoC;tma_issueBW",
+ "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW",
"MetricName": "tma_info_system_dram_bw_use",
"PublicDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]. Related metrics: tma_mem_bandwidth"
},
@@ -294,7 +318,7 @@
"MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PACKED_SINGLE) / 1e9 / duration_time",
"MetricGroup": "Cor;Flops;HPC",
"MetricName": "tma_info_system_gflops",
- "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine."
+ "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width"
},
{
"BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]",
@@ -348,6 +372,12 @@
"MetricGroup": "Power",
"MetricName": "tma_info_system_turbo_utilization"
},
+ {
+ "BriefDescription": "Measured Average Uncore Frequency for the SoC [GHz]",
+ "MetricExpr": "tma_info_system_socket_clks / 1e9 / duration_time",
+ "MetricGroup": "SoC",
+ "MetricName": "tma_info_system_uncore_frequency"
+ },
{
"BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.",
"MetricExpr": "CPU_CLK_UNHALTED.THREAD",
@@ -389,7 +419,7 @@
{
"BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses",
"MetricExpr": "(12 * ITLB_MISSES.STLB_HIT + ITLB_MISSES.WALK_DURATION) / tma_info_thread_clks",
- "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group",
+ "MetricGroup": "BigFootprint;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group",
"MetricName": "tma_itlb_misses",
"MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)",
"PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_MISSES.WALK_COMPLETED",
@@ -399,7 +429,7 @@
"BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core",
"MetricConstraint": "NO_GROUP_EVENTS_SMT",
"MetricExpr": "MEM_LOAD_UOPS_RETIRED.LLC_HIT / (MEM_LOAD_UOPS_RETIRED.LLC_HIT + 7 * MEM_LOAD_UOPS_RETIRED.LLC_MISS) * CYCLE_ACTIVITY.STALLS_L2_PENDING / tma_info_thread_clks",
- "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group",
+ "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group",
"MetricName": "tma_l3_bound",
"MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)",
"PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS",
@@ -421,7 +451,7 @@
"MetricName": "tma_light_operations",
"MetricThreshold": "tma_light_operations > 0.6",
"MetricgroupNoGroup": "TopdownL2",
- "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST",
+ "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized code running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. ([ICL+] Note this may undercount due to approximation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST",
"ScaleUnit": "100%"
},
{
@@ -436,21 +466,21 @@
"ScaleUnit": "100%"
},
{
- "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)",
+ "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory - DRAM ([SPR-HBM] and/or HBM)",
"MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=6@) / tma_info_thread_clks",
"MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueBW",
"MetricName": "tma_mem_bandwidth",
"MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))",
- "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that). Related metrics: tma_info_system_dram_bw_use",
+ "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that). Related metrics: tma_info_system_dram_bw_use",
"ScaleUnit": "100%"
},
{
- "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)",
+ "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory - DRAM ([SPR-HBM] and/or HBM)",
"MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth",
"MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueLat",
"MetricName": "tma_mem_latency",
"MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))",
- "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that). Related metrics: ",
+ "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory - DRAM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that). Related metrics: ",
"ScaleUnit": "100%"
},
{
diff --git a/tools/perf/pmu-events/arch/x86/jaketown/metricgroups.json b/tools/perf/pmu-events/arch/x86/jaketown/metricgroups.json
index bebb85945d62..a2c27794c0d8 100644
--- a/tools/perf/pmu-events/arch/x86/jaketown/metricgroups.json
+++ b/tools/perf/pmu-events/arch/x86/jaketown/metricgroups.json
@@ -2,10 +2,10 @@
"Backend": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"Bad": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"BadSpec": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
- "BigFoot": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
+ "BigFootprint": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"BrMispredicts": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"Branches": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
- "CacheMisses": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
+ "CacheHits": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"Compute": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"Cor": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"DSB": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
@@ -23,7 +23,9 @@
"L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"MachineClears": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
+ "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
+ "MemOffcore": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"MemoryBW": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"MemoryBound": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"MemoryLat": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
@@ -88,6 +90,7 @@
"tma_issueTLB": "Metrics related by the issue $issueTLB",
"tma_l1_bound_group": "Metrics contributing to tma_l1_bound category",
"tma_light_operations_group": "Metrics contributing to tma_light_operations category",
+ "tma_machine_clears_group": "Metrics contributing to tma_machine_clears category",
"tma_mem_latency_group": "Metrics contributing to tma_mem_latency category",
"tma_memory_bound_group": "Metrics contributing to tma_memory_bound category",
"tma_microcode_sequencer_group": "Metrics contributing to tma_microcode_sequencer category",
--
2.43.0.687.g38aa6559b0-goog


2024-02-14 01:26:53

by Ian Rogers

[permalink] [raw]
Subject: [PATCH v1 04/30] perf vendor events intel: Update emeraldrapids events to v1.03

Update emeraldrapids events to v1.03 released in:
https://github.com/intel/perfmon/commit/c7c6f72dae07fee35d5982232829c0cd37f9e28e

Adds uncore CHA events.

Event json automatically generated by:
https://github.com/intel/perfmon/blob/main/scripts/create_perf_json.py

Signed-off-by: Ian Rogers <[email protected]>
---
.../arch/x86/emeraldrapids/uncore-cache.json | 152 ++++++++++++++++++
tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +-
2 files changed, 153 insertions(+), 1 deletion(-)

diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/uncore-cache.json b/tools/perf/pmu-events/arch/x86/emeraldrapids/uncore-cache.json
index bf5a511b99d1..86a8f3b7fe1d 100644
--- a/tools/perf/pmu-events/arch/x86/emeraldrapids/uncore-cache.json
+++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/uncore-cache.json
@@ -3564,6 +3564,15 @@
"UMask": "0x10c8968201",
"Unit": "CHA"
},
+ {
+ "BriefDescription": "UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_CXL_EXP_LOCAL",
+ "EventCode": "0x35",
+ "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_CXL_EXP_LOCAL",
+ "PerPkg": "1",
+ "PortMask": "0x000",
+ "UMask": "0x20c8968201",
+ "Unit": "CHA"
+ },
{
"BriefDescription": "TOR Inserts : DRd_Prefs issued by iA Cores targeting DDR Mem that Missed the LLC",
"EventCode": "0x35",
@@ -3715,6 +3724,15 @@
"UMask": "0x10ccd68201",
"Unit": "CHA"
},
+ {
+ "BriefDescription": "UNC_CHA_TOR_INSERTS.IA_MISS_LLCPREFDATA_CXL_EXP_LOCAL",
+ "EventCode": "0x35",
+ "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_LLCPREFDATA_CXL_EXP_LOCAL",
+ "PerPkg": "1",
+ "PortMask": "0x000",
+ "UMask": "0x20ccd68201",
+ "Unit": "CHA"
+ },
{
"BriefDescription": "TOR Inserts; LLCPrefRFO misses from local IA",
"EventCode": "0x35",
@@ -3741,6 +3759,15 @@
"UMask": "0x10c8868201",
"Unit": "CHA"
},
+ {
+ "BriefDescription": "UNC_CHA_TOR_INSERTS.IA_MISS_LLCPREFRFO_CXL_EXP_LOCAL",
+ "EventCode": "0x35",
+ "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_LLCPREFRFO_CXL_EXP_LOCAL",
+ "PerPkg": "1",
+ "PortMask": "0x000",
+ "UMask": "0x20c8868201",
+ "Unit": "CHA"
+ },
{
"BriefDescription": "TOR Inserts : WCiLFs issued by iA Cores targeting DDR that missed the LLC - HOMed locally",
"EventCode": "0x35",
@@ -3891,6 +3918,15 @@
"UMask": "0x10ccc68201",
"Unit": "CHA"
},
+ {
+ "BriefDescription": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO_PREF_CXL_EXP_LOCAL",
+ "EventCode": "0x35",
+ "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO_PREF_CXL_EXP_LOCAL",
+ "PerPkg": "1",
+ "PortMask": "0x000",
+ "UMask": "0x20ccc68201",
+ "Unit": "CHA"
+ },
{
"BriefDescription": "TOR Inserts; RFO prefetch misses from local IA",
"EventCode": "0x35",
@@ -4428,6 +4464,46 @@
"UMask": "0x40",
"Unit": "CHA"
},
+ {
+ "BriefDescription": "TOR Inserts for INVXTOM opcodes received from a remote socket which miss the L3 and target memory in a CXL type 3 memory expander local to this socket.",
+ "EventCode": "0x35",
+ "EventName": "UNC_CHA_TOR_INSERTS.RRQ_MISS_INVXTOM_CXL_EXP_LOCAL",
+ "PerPkg": "1",
+ "UMask": "0x20e87e8240",
+ "Unit": "CHA"
+ },
+ {
+ "BriefDescription": "TOR Inserts for RDCODE opcodes received from a remote socket which miss the L3 and target memory in a CXL type 3 memory expander local to this socket.",
+ "EventCode": "0x35",
+ "EventName": "UNC_CHA_TOR_INSERTS.RRQ_MISS_RDCODE_CXL_EXP_LOCAL",
+ "PerPkg": "1",
+ "UMask": "0x20e80e8240",
+ "Unit": "CHA"
+ },
+ {
+ "BriefDescription": "TOR Inserts for RDCUR opcodes received from a remote socket which miss the L3 and target memory in a CXL type 3 memory expander local to this socket.",
+ "EventCode": "0x35",
+ "EventName": "UNC_CHA_TOR_INSERTS.RRQ_MISS_RDCUR_CXL_EXP_LOCAL",
+ "PerPkg": "1",
+ "UMask": "0x20e8068240",
+ "Unit": "CHA"
+ },
+ {
+ "BriefDescription": "TOR Inserts for RDDATA opcodes received from a remote socket which miss the L3 and target memory in a CXL type 3 memory expander local to this socket.",
+ "EventCode": "0x35",
+ "EventName": "UNC_CHA_TOR_INSERTS.RRQ_MISS_RDDATA_CXL_EXP_LOCAL",
+ "PerPkg": "1",
+ "UMask": "0x20e8168240",
+ "Unit": "CHA"
+ },
+ {
+ "BriefDescription": "TOR Inserts for RDINVOWN_OPT opcodes received from a remote socket which miss the L3 and target memory in a CXL type 3 memory expander local to this socket.",
+ "EventCode": "0x35",
+ "EventName": "UNC_CHA_TOR_INSERTS.RRQ_MISS_RDINVOWN_OPT_CXL_EXP_LOCAL",
+ "PerPkg": "1",
+ "UMask": "0x20e8268240",
+ "Unit": "CHA"
+ },
{
"BriefDescription": "TOR Inserts; All Snoops from Remote",
"EventCode": "0x35",
@@ -5011,6 +5087,15 @@
"UMask": "0x10c8968201",
"Unit": "CHA"
},
+ {
+ "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_PREF_CXL_EXP_LOCAL",
+ "EventCode": "0x36",
+ "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_PREF_CXL_EXP_LOCAL",
+ "PerPkg": "1",
+ "PortMask": "0x000",
+ "UMask": "0x20c8968201",
+ "Unit": "CHA"
+ },
{
"BriefDescription": "TOR Occupancy : DRd_Prefs issued by iA Cores targeting DDR Mem that Missed the LLC",
"EventCode": "0x36",
@@ -5162,6 +5247,15 @@
"UMask": "0x10ccd68201",
"Unit": "CHA"
},
+ {
+ "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_LLCPREFDATA_CXL_EXP_LOCAL",
+ "EventCode": "0x36",
+ "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_LLCPREFDATA_CXL_EXP_LOCAL",
+ "PerPkg": "1",
+ "PortMask": "0x000",
+ "UMask": "0x20ccd68201",
+ "Unit": "CHA"
+ },
{
"BriefDescription": "TOR Occupancy; LLCPrefRFO misses from local IA",
"EventCode": "0x36",
@@ -5188,6 +5282,15 @@
"UMask": "0x10c8868201",
"Unit": "CHA"
},
+ {
+ "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_LLCPREFRFO_CXL_EXP_LOCAL",
+ "EventCode": "0x36",
+ "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_LLCPREFRFO_CXL_EXP_LOCAL",
+ "PerPkg": "1",
+ "PortMask": "0x000",
+ "UMask": "0x20c8868201",
+ "Unit": "CHA"
+ },
{
"BriefDescription": "TOR Occupancy : WCiLFs issued by iA Cores targeting DDR that missed the LLC - HOMed locally",
"EventCode": "0x36",
@@ -5338,6 +5441,15 @@
"UMask": "0x10ccc68201",
"Unit": "CHA"
},
+ {
+ "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_RFO_PREF_CXL_EXP_LOCAL",
+ "EventCode": "0x36",
+ "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_RFO_PREF_CXL_EXP_LOCAL",
+ "PerPkg": "1",
+ "PortMask": "0x000",
+ "UMask": "0x20ccc68201",
+ "Unit": "CHA"
+ },
{
"BriefDescription": "TOR Occupancy; RFO prefetch misses from local IA",
"EventCode": "0x36",
@@ -5841,6 +5953,46 @@
"UMask": "0x40",
"Unit": "CHA"
},
+ {
+ "BriefDescription": "TOR Occupancy for INVXTOM opcodes received from a remote socket which miss the L3 and target memory in a CXL type 3 memory expander local to this socket.",
+ "EventCode": "0x36",
+ "EventName": "UNC_CHA_TOR_OCCUPANCY.RRQ_MISS_INVXTOM_CXL_EXP_LOCAL",
+ "PerPkg": "1",
+ "UMask": "0x20e87e8240",
+ "Unit": "CHA"
+ },
+ {
+ "BriefDescription": "TOR Occupancy for RDCODE opcodes received from a remote socket which miss the L3 and target memory in a CXL type 3 memory expander local to this socket.",
+ "EventCode": "0x36",
+ "EventName": "UNC_CHA_TOR_OCCUPANCY.RRQ_MISS_RDCODE_CXL_EXP_LOCAL",
+ "PerPkg": "1",
+ "UMask": "0x20e80e8240",
+ "Unit": "CHA"
+ },
+ {
+ "BriefDescription": "TOR Occupancy for RDCUR opcodes received from a remote socket which miss the L3 and target memory in a CXL type 3 memory expander local to this socket.",
+ "EventCode": "0x36",
+ "EventName": "UNC_CHA_TOR_OCCUPANCY.RRQ_MISS_RDCUR_CXL_EXP_LOCAL",
+ "PerPkg": "1",
+ "UMask": "0x20e8068240",
+ "Unit": "CHA"
+ },
+ {
+ "BriefDescription": "TOR Occupancy for RDDATA opcodes received from a remote socket which miss the L3 and target memory in a CXL type 3 memory expander local to this socket.",
+ "EventCode": "0x36",
+ "EventName": "UNC_CHA_TOR_OCCUPANCY.RRQ_MISS_RDDATA_CXL_EXP_LOCAL",
+ "PerPkg": "1",
+ "UMask": "0x20e8168240",
+ "Unit": "CHA"
+ },
+ {
+ "BriefDescription": "TOR Occupancy for RDINVOWN_OPT opcodes received from a remote socket which miss the L3 and target memory in a CXL type 3 memory expander local to this socket.",
+ "EventCode": "0x36",
+ "EventName": "UNC_CHA_TOR_OCCUPANCY.RRQ_MISS_RDINVOWN_OPT_CXL_EXP_LOCAL",
+ "PerPkg": "1",
+ "UMask": "0x20e8268240",
+ "Unit": "CHA"
+ },
{
"BriefDescription": "TOR Occupancy; All Snoops from Remote",
"EventCode": "0x36",
diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv
index 7b30c0eb036a..42cbbe35b52e 100644
--- a/tools/perf/pmu-events/arch/x86/mapfile.csv
+++ b/tools/perf/pmu-events/arch/x86/mapfile.csv
@@ -7,7 +7,7 @@ GenuineIntel-6-56,v11,broadwellde,core
GenuineIntel-6-4F,v22,broadwellx,core
GenuineIntel-6-55-[56789ABCDEF],v1.20,cascadelakex,core
GenuineIntel-6-9[6C],v1.04,elkhartlake,core
-GenuineIntel-6-CF,v1.02,emeraldrapids,core
+GenuineIntel-6-CF,v1.03,emeraldrapids,core
GenuineIntel-6-5[CF],v13,goldmont,core
GenuineIntel-6-7A,v1.01,goldmontplus,core
GenuineIntel-6-B6,v1.00,grandridge,core
--
2.43.0.687.g38aa6559b0-goog


2024-02-14 01:27:31

by Ian Rogers

[permalink] [raw]
Subject: [PATCH v1 07/30] perf vendor events intel: Update icelake events to v1.21

Update icelake events to v1.21 released in:
https://github.com/intel/perfmon/commit/54f1246b0496112c1d2b2a49e4859c85caa3dbf4

Improves descriptions, removes TOPDOWN.BR_MISPREDICT_SLOTS.

Event json automatically generated by:
https://github.com/intel/perfmon/blob/main/scripts/create_perf_json.py

Signed-off-by: Ian Rogers <[email protected]>
---
tools/perf/pmu-events/arch/x86/icelake/memory.json | 1 +
tools/perf/pmu-events/arch/x86/icelake/other.json | 2 +-
tools/perf/pmu-events/arch/x86/icelake/pipeline.json | 10 +---------
tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +-
4 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/tools/perf/pmu-events/arch/x86/icelake/memory.json b/tools/perf/pmu-events/arch/x86/icelake/memory.json
index e8d2ec1c029b..f84763220549 100644
--- a/tools/perf/pmu-events/arch/x86/icelake/memory.json
+++ b/tools/perf/pmu-events/arch/x86/icelake/memory.json
@@ -259,6 +259,7 @@
"BriefDescription": "Number of times an RTM execution aborted.",
"EventCode": "0xc9",
"EventName": "RTM_RETIRED.ABORTED",
+ "PEBS": "1",
"PublicDescription": "Counts the number of times RTM abort was triggered.",
"SampleAfterValue": "100003",
"UMask": "0x4"
diff --git a/tools/perf/pmu-events/arch/x86/icelake/other.json b/tools/perf/pmu-events/arch/x86/icelake/other.json
index cfb590632918..4fdc87339555 100644
--- a/tools/perf/pmu-events/arch/x86/icelake/other.json
+++ b/tools/perf/pmu-events/arch/x86/icelake/other.json
@@ -19,7 +19,7 @@
"BriefDescription": "Core cycles where the core was running in a manner where Turbo may be clipped to the AVX512 turbo schedule.",
"EventCode": "0x28",
"EventName": "CORE_POWER.LVL2_TURBO_LICENSE",
- "PublicDescription": "Core cycles where the core was running with power-delivery for license level 2 (introduced in Skylake Server microarchtecture). This includes high current AVX 512-bit instructions.",
+ "PublicDescription": "Core cycles where the core was running with power-delivery for license level 2 (introduced in Skylake Server microarchitecture). This includes high current AVX 512-bit instructions.",
"SampleAfterValue": "200003",
"UMask": "0x20"
},
diff --git a/tools/perf/pmu-events/arch/x86/icelake/pipeline.json b/tools/perf/pmu-events/arch/x86/icelake/pipeline.json
index 375b78044f14..c7313fd4fdf4 100644
--- a/tools/perf/pmu-events/arch/x86/icelake/pipeline.json
+++ b/tools/perf/pmu-events/arch/x86/icelake/pipeline.json
@@ -529,7 +529,7 @@
"BriefDescription": "Cycles when Reservation Station (RS) is empty for the thread",
"EventCode": "0x5e",
"EventName": "RS_EVENTS.EMPTY_CYCLES",
- "PublicDescription": "Counts cycles during which the reservation station (RS) is empty for this logical processor. This is usually caused when the front-end pipeline runs into stravation periods (e.g. branch mispredictions or i-cache misses)",
+ "PublicDescription": "Counts cycles during which the reservation station (RS) is empty for this logical processor. This is usually caused when the front-end pipeline runs into starvation periods (e.g. branch mispredictions or i-cache misses)",
"SampleAfterValue": "1000003",
"UMask": "0x1"
},
@@ -552,14 +552,6 @@
"SampleAfterValue": "10000003",
"UMask": "0x2"
},
- {
- "BriefDescription": "TMA slots wasted due to incorrect speculation by branch mispredictions",
- "EventCode": "0xa4",
- "EventName": "TOPDOWN.BR_MISPREDICT_SLOTS",
- "PublicDescription": "Number of TMA slots that were wasted due to incorrect speculation by branch mispredictions. This event estimates number of operations that were issued but not retired from the speculative path as well as the out-of-order engine recovery past a branch misprediction.",
- "SampleAfterValue": "10000003",
- "UMask": "0x8"
- },
{
"BriefDescription": "TMA slots available for an unhalted logical processor. Fixed counter - architectural event",
"EventName": "TOPDOWN.SLOTS",
diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv
index 932d7c094e41..953e13a136a4 100644
--- a/tools/perf/pmu-events/arch/x86/mapfile.csv
+++ b/tools/perf/pmu-events/arch/x86/mapfile.csv
@@ -14,7 +14,7 @@ GenuineIntel-6-B6,v1.01,grandridge,core
GenuineIntel-6-A[DE],v1.01,graniterapids,core
GenuineIntel-6-(3C|45|46),v35,haswell,core
GenuineIntel-6-3F,v28,haswellx,core
-GenuineIntel-6-7[DE],v1.19,icelake,core
+GenuineIntel-6-7[DE],v1.21,icelake,core
GenuineIntel-6-6[AC],v1.23,icelakex,core
GenuineIntel-6-3A,v24,ivybridge,core
GenuineIntel-6-3E,v24,ivytown,core
--
2.43.0.687.g38aa6559b0-goog


2024-02-14 01:27:58

by Ian Rogers

[permalink] [raw]
Subject: [PATCH v1 09/30] perf vendor events intel: Update rocketlake events to v1.02

Update alderlake events to v1.02 released in:
https://github.com/intel/perfmon/commit/4931178d1ede1099a3e4ac7e04ed9f073e03d219

Improves documentation and removes TOPDOWN.BR_MISPREDICT_SLOTS.

Event json automatically generated by:
https://github.com/intel/perfmon/blob/main/scripts/create_perf_json.py

Signed-off-by: Ian Rogers <[email protected]>
---
tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +-
tools/perf/pmu-events/arch/x86/rocketlake/memory.json | 1 +
tools/perf/pmu-events/arch/x86/rocketlake/other.json | 2 +-
.../perf/pmu-events/arch/x86/rocketlake/pipeline.json | 10 +---------
4 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv
index 09145aaa0d8e..be9f342b3206 100644
--- a/tools/perf/pmu-events/arch/x86/mapfile.csv
+++ b/tools/perf/pmu-events/arch/x86/mapfile.csv
@@ -24,7 +24,7 @@ GenuineIntel-6-BD,v1.00,lunarlake,core
GenuineIntel-6-A[AC],v1.07,meteorlake,core
GenuineIntel-6-1[AEF],v4,nehalemep,core
GenuineIntel-6-2E,v4,nehalemex,core
-GenuineIntel-6-A7,v1.01,rocketlake,core
+GenuineIntel-6-A7,v1.02,rocketlake,core
GenuineIntel-6-2A,v19,sandybridge,core
GenuineIntel-6-8F,v1.17,sapphirerapids,core
GenuineIntel-6-AF,v1.00,sierraforest,core
diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/memory.json b/tools/perf/pmu-events/arch/x86/rocketlake/memory.json
index e8d2ec1c029b..f84763220549 100644
--- a/tools/perf/pmu-events/arch/x86/rocketlake/memory.json
+++ b/tools/perf/pmu-events/arch/x86/rocketlake/memory.json
@@ -259,6 +259,7 @@
"BriefDescription": "Number of times an RTM execution aborted.",
"EventCode": "0xc9",
"EventName": "RTM_RETIRED.ABORTED",
+ "PEBS": "1",
"PublicDescription": "Counts the number of times RTM abort was triggered.",
"SampleAfterValue": "100003",
"UMask": "0x4"
diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/other.json b/tools/perf/pmu-events/arch/x86/rocketlake/other.json
index cfb590632918..4fdc87339555 100644
--- a/tools/perf/pmu-events/arch/x86/rocketlake/other.json
+++ b/tools/perf/pmu-events/arch/x86/rocketlake/other.json
@@ -19,7 +19,7 @@
"BriefDescription": "Core cycles where the core was running in a manner where Turbo may be clipped to the AVX512 turbo schedule.",
"EventCode": "0x28",
"EventName": "CORE_POWER.LVL2_TURBO_LICENSE",
- "PublicDescription": "Core cycles where the core was running with power-delivery for license level 2 (introduced in Skylake Server microarchtecture). This includes high current AVX 512-bit instructions.",
+ "PublicDescription": "Core cycles where the core was running with power-delivery for license level 2 (introduced in Skylake Server microarchitecture). This includes high current AVX 512-bit instructions.",
"SampleAfterValue": "200003",
"UMask": "0x20"
},
diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/pipeline.json b/tools/perf/pmu-events/arch/x86/rocketlake/pipeline.json
index 375b78044f14..c7313fd4fdf4 100644
--- a/tools/perf/pmu-events/arch/x86/rocketlake/pipeline.json
+++ b/tools/perf/pmu-events/arch/x86/rocketlake/pipeline.json
@@ -529,7 +529,7 @@
"BriefDescription": "Cycles when Reservation Station (RS) is empty for the thread",
"EventCode": "0x5e",
"EventName": "RS_EVENTS.EMPTY_CYCLES",
- "PublicDescription": "Counts cycles during which the reservation station (RS) is empty for this logical processor. This is usually caused when the front-end pipeline runs into stravation periods (e.g. branch mispredictions or i-cache misses)",
+ "PublicDescription": "Counts cycles during which the reservation station (RS) is empty for this logical processor. This is usually caused when the front-end pipeline runs into starvation periods (e.g. branch mispredictions or i-cache misses)",
"SampleAfterValue": "1000003",
"UMask": "0x1"
},
@@ -552,14 +552,6 @@
"SampleAfterValue": "10000003",
"UMask": "0x2"
},
- {
- "BriefDescription": "TMA slots wasted due to incorrect speculation by branch mispredictions",
- "EventCode": "0xa4",
- "EventName": "TOPDOWN.BR_MISPREDICT_SLOTS",
- "PublicDescription": "Number of TMA slots that were wasted due to incorrect speculation by branch mispredictions. This event estimates number of operations that were issued but not retired from the speculative path as well as the out-of-order engine recovery past a branch misprediction.",
- "SampleAfterValue": "10000003",
- "UMask": "0x8"
- },
{
"BriefDescription": "TMA slots available for an unhalted logical processor. Fixed counter - architectural event",
"EventName": "TOPDOWN.SLOTS",
--
2.43.0.687.g38aa6559b0-goog


2024-02-14 01:28:26

by Ian Rogers

[permalink] [raw]
Subject: [PATCH v1 11/30] perf vendor events intel: Update skylake events to v58

Update skylake events to v58 released in:
https://github.com/intel/perfmon/commit/625fb7507373fef8297052c5f9af9ffe78d460c0

Improves documentation.

Event json automatically generated by:
https://github.com/intel/perfmon/blob/main/scripts/create_perf_json.py

Signed-off-by: Ian Rogers <[email protected]>
---
tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +-
tools/perf/pmu-events/arch/x86/skylake/memory.json | 2 +-
tools/perf/pmu-events/arch/x86/skylake/pipeline.json | 2 +-
tools/perf/pmu-events/arch/x86/skylake/virtual-memory.json | 2 +-
4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv
index e8bb8506c2eb..96b86d3b60ce 100644
--- a/tools/perf/pmu-events/arch/x86/mapfile.csv
+++ b/tools/perf/pmu-events/arch/x86/mapfile.csv
@@ -29,7 +29,7 @@ GenuineIntel-6-2A,v19,sandybridge,core
GenuineIntel-6-8F,v1.17,sapphirerapids,core
GenuineIntel-6-AF,v1.01,sierraforest,core
GenuineIntel-6-(37|4A|4C|4D|5A),v15,silvermont,core
-GenuineIntel-6-(4E|5E|8E|9E|A5|A6),v57,skylake,core
+GenuineIntel-6-(4E|5E|8E|9E|A5|A6),v58,skylake,core
GenuineIntel-6-55-[01234],v1.32,skylakex,core
GenuineIntel-6-86,v1.21,snowridgex,core
GenuineIntel-6-8[CD],v1.13,tigerlake,core
diff --git a/tools/perf/pmu-events/arch/x86/skylake/memory.json b/tools/perf/pmu-events/arch/x86/skylake/memory.json
index 588ad6059a13..f047862f9735 100644
--- a/tools/perf/pmu-events/arch/x86/skylake/memory.json
+++ b/tools/perf/pmu-events/arch/x86/skylake/memory.json
@@ -1008,7 +1008,7 @@
"BriefDescription": "Number of times an RTM execution aborted due to any reasons (multiple categories may count as one).",
"EventCode": "0xC9",
"EventName": "RTM_RETIRED.ABORTED",
- "PEBS": "1",
+ "PEBS": "2",
"PublicDescription": "Number of times RTM abort was triggered.",
"SampleAfterValue": "2000003",
"UMask": "0x4"
diff --git a/tools/perf/pmu-events/arch/x86/skylake/pipeline.json b/tools/perf/pmu-events/arch/x86/skylake/pipeline.json
index cd3e737bf4a1..fe202d1e368a 100644
--- a/tools/perf/pmu-events/arch/x86/skylake/pipeline.json
+++ b/tools/perf/pmu-events/arch/x86/skylake/pipeline.json
@@ -387,7 +387,7 @@
"Errata": "SKL091, SKL044",
"EventCode": "0xC0",
"EventName": "INST_RETIRED.NOP",
- "PEBS": "1",
+ "PEBS": "2",
"SampleAfterValue": "2000003",
"UMask": "0x2"
},
diff --git a/tools/perf/pmu-events/arch/x86/skylake/virtual-memory.json b/tools/perf/pmu-events/arch/x86/skylake/virtual-memory.json
index f59405877ae8..73feadaf7674 100644
--- a/tools/perf/pmu-events/arch/x86/skylake/virtual-memory.json
+++ b/tools/perf/pmu-events/arch/x86/skylake/virtual-memory.json
@@ -205,7 +205,7 @@
"BriefDescription": "Counts 1 per cycle for each PMH that is busy with a page walk for an instruction fetch request. EPT page walk duration are excluded in Skylake.",
"EventCode": "0x85",
"EventName": "ITLB_MISSES.WALK_PENDING",
- "PublicDescription": "Counts 1 per cycle for each PMH (Page Miss Handler) that is busy with a page walk for an instruction fetch request. EPT page walk duration are excluded in Skylake michroarchitecture.",
+ "PublicDescription": "Counts 1 per cycle for each PMH (Page Miss Handler) that is busy with a page walk for an instruction fetch request. EPT page walk duration are excluded in Skylake microarchitecture.",
"SampleAfterValue": "100003",
"UMask": "0x10"
},
--
2.43.0.687.g38aa6559b0-goog


2024-02-14 01:34:06

by Ian Rogers

[permalink] [raw]
Subject: [PATCH v1 26/30] perf vendor events intel: Update sandybridge TMA metrics to 4.7

Top-Down Microarchitecture Analysis (TMA) metrics simplify
cycle-accounting using microarchitecture-abstracted metrics
organized in one hierarchy. This update is from version 4.5 to
4.7.

The update includes:

- Add metrics tma_fp_vector_128b, tma_fp_vector_256b and
tma_info_system_cpus_utilized.
- Remove metrics tma_info_system_mem_parallel_requests,
tma_info_system_core_frequency and
tma_info_system_mem_request_latency.
- Swapped tma_info_core_ilp (becomes per SMT thread) and
tma_info_pipeline_execute (per physical core).
- Tuned thresholds for tma_fetch_bandwidth.

The update came from:

https://github.com/intel/perfmon/pull/140
https://github.com/intel/perfmon/pull/138

Running the script:

https://github.com/intel/perfmon/blob/main/scripts/create_perf_json.py
Signed-off-by: Ian Rogers <[email protected]>
---
.../arch/x86/sandybridge/metricgroups.json | 7 +-
.../arch/x86/sandybridge/snb-metrics.json | 71 +++++++++++--------
2 files changed, 46 insertions(+), 32 deletions(-)

diff --git a/tools/perf/pmu-events/arch/x86/sandybridge/metricgroups.json b/tools/perf/pmu-events/arch/x86/sandybridge/metricgroups.json
index bebb85945d62..a2c27794c0d8 100644
--- a/tools/perf/pmu-events/arch/x86/sandybridge/metricgroups.json
+++ b/tools/perf/pmu-events/arch/x86/sandybridge/metricgroups.json
@@ -2,10 +2,10 @@
"Backend": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"Bad": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"BadSpec": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
- "BigFoot": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
+ "BigFootprint": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"BrMispredicts": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"Branches": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
- "CacheMisses": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
+ "CacheHits": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"Compute": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"Cor": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"DSB": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
@@ -23,7 +23,9 @@
"L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"MachineClears": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
+ "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
+ "MemOffcore": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"MemoryBW": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"MemoryBound": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
"MemoryLat": "Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet",
@@ -88,6 +90,7 @@
"tma_issueTLB": "Metrics related by the issue $issueTLB",
"tma_l1_bound_group": "Metrics contributing to tma_l1_bound category",
"tma_light_operations_group": "Metrics contributing to tma_light_operations category",
+ "tma_machine_clears_group": "Metrics contributing to tma_machine_clears category",
"tma_mem_latency_group": "Metrics contributing to tma_mem_latency category",
"tma_memory_bound_group": "Metrics contributing to tma_memory_bound category",
"tma_microcode_sequencer_group": "Metrics contributing to tma_microcode_sequencer category",
diff --git a/tools/perf/pmu-events/arch/x86/sandybridge/snb-metrics.json b/tools/perf/pmu-events/arch/x86/sandybridge/snb-metrics.json
index 8898b6fd0dea..ce836ebda542 100644
--- a/tools/perf/pmu-events/arch/x86/sandybridge/snb-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/sandybridge/snb-metrics.json
@@ -163,7 +163,7 @@
"MetricExpr": "tma_frontend_bound - tma_fetch_latency",
"MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
"MetricName": "tma_fetch_bandwidth",
- "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_thread_ipc / 4 > 0.35",
+ "MetricThreshold": "tma_fetch_bandwidth > 0.2",
"MetricgroupNoGroup": "TopdownL2",
"PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Related metrics: tma_dsb_switches, tma_info_frontend_dsb_coverage, tma_lcp",
"ScaleUnit": "100%"
@@ -193,7 +193,7 @@
"MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P",
"MetricName": "tma_fp_scalar",
"MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)",
- "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_512b, tma_port_6, tma_ports_utilized_2",
+ "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_6, tma_ports_utilized_2",
"ScaleUnit": "100%"
},
{
@@ -202,7 +202,25 @@
"MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P",
"MetricName": "tma_fp_vector",
"MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)",
- "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector_512b, tma_port_6, tma_ports_utilized_2",
+ "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_6, tma_ports_utilized_2",
+ "ScaleUnit": "100%"
+ },
+ {
+ "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors",
+ "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE + FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE) / UOPS_DISPATCHED.THREAD",
+ "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P",
+ "MetricName": "tma_fp_vector_128b",
+ "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))",
+ "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_6, tma_ports_utilized_2",
+ "ScaleUnit": "100%"
+ },
+ {
+ "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors",
+ "MetricExpr": "(SIMD_FP_256.PACKED_DOUBLE + SIMD_FP_256.PACKED_SINGLE) / UOPS_DISPATCHED.THREAD",
+ "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P",
+ "MetricName": "tma_fp_vector_256b",
+ "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))",
+ "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_6, tma_ports_utilized_2",
"ScaleUnit": "100%"
},
{
@@ -222,7 +240,7 @@
"MetricName": "tma_heavy_operations",
"MetricThreshold": "tma_heavy_operations > 0.1",
"MetricgroupNoGroup": "TopdownL2",
- "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences.",
+ "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences. ([ICL+] Note this may overcount due to approximation using indirect events; [ADL+] .)",
"ScaleUnit": "100%"
},
{
@@ -244,7 +262,7 @@
"MetricName": "tma_info_core_flopc"
},
{
- "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core",
+ "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per thread (logical-processor)",
"MetricExpr": "UOPS_DISPATCHED.THREAD / (cpu@UOPS_DISPATCHED.CORE\\,cmask\\=1@ / 2 if #SMT_on else cpu@UOPS_DISPATCHED.CORE\\,cmask\\=1@)",
"MetricGroup": "Backend;Cor;Pipeline;PortsUtil",
"MetricName": "tma_info_core_ilp"
@@ -271,21 +289,27 @@
"MetricName": "tma_info_pipeline_retire"
},
{
- "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]",
+ "BriefDescription": "Measured Average Core Frequency for unhalted processors [GHz]",
"MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / duration_time",
"MetricGroup": "Power;Summary",
- "MetricName": "tma_info_system_average_frequency"
+ "MetricName": "tma_info_system_core_frequency"
},
{
- "BriefDescription": "Average CPU Utilization",
+ "BriefDescription": "Average CPU Utilization (percentage)",
"MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC",
"MetricGroup": "HPC;Summary",
"MetricName": "tma_info_system_cpu_utilization"
},
+ {
+ "BriefDescription": "Average number of utilized CPUs",
+ "MetricExpr": "#num_cpus_online * tma_info_system_cpu_utilization",
+ "MetricGroup": "Summary",
+ "MetricName": "tma_info_system_cpus_utilized"
+ },
{
"BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]",
"MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_REQUESTS.ALL) / 1e6 / duration_time / 1e3",
- "MetricGroup": "HPC;Mem;MemoryBW;SoC;tma_issueBW",
+ "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW",
"MetricName": "tma_info_system_dram_bw_use",
"PublicDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]. Related metrics: tma_mem_bandwidth"
},
@@ -294,7 +318,7 @@
"MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PACKED_SINGLE) / 1e9 / duration_time",
"MetricGroup": "Cor;Flops;HPC",
"MetricName": "tma_info_system_gflops",
- "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine."
+ "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width"
},
{
"BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]",
@@ -316,19 +340,6 @@
"MetricName": "tma_info_system_kernel_utilization",
"MetricThreshold": "tma_info_system_kernel_utilization > 0.05"
},
- {
- "BriefDescription": "Average number of parallel requests to external memory",
- "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_OCCUPANCY.CYCLES_WITH_ANY_REQUEST",
- "MetricGroup": "Mem;SoC",
- "MetricName": "tma_info_system_mem_parallel_requests",
- "PublicDescription": "Average number of parallel requests to external memory. Accounts for all requests"
- },
- {
- "BriefDescription": "Average latency of all requests to external memory (in Uncore cycles)",
- "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_REQUESTS.ALL",
- "MetricGroup": "Mem;SoC",
- "MetricName": "tma_info_system_mem_request_latency"
- },
{
"BriefDescription": "Fraction of cycles where both hardware Logical Processors were active",
"MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)",
@@ -388,7 +399,7 @@
{
"BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses",
"MetricExpr": "(12 * ITLB_MISSES.STLB_HIT + ITLB_MISSES.WALK_DURATION) / tma_info_thread_clks",
- "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group",
+ "MetricGroup": "BigFootprint;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group",
"MetricName": "tma_itlb_misses",
"MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)",
"PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_MISSES.WALK_COMPLETED",
@@ -398,7 +409,7 @@
"BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core",
"MetricConstraint": "NO_GROUP_EVENTS_SMT",
"MetricExpr": "MEM_LOAD_UOPS_RETIRED.LLC_HIT / (MEM_LOAD_UOPS_RETIRED.LLC_HIT + 7 * MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS) * CYCLE_ACTIVITY.STALLS_L2_PENDING / tma_info_thread_clks",
- "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group",
+ "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group",
"MetricName": "tma_l3_bound",
"MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)",
"PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS",
@@ -420,7 +431,7 @@
"MetricName": "tma_light_operations",
"MetricThreshold": "tma_light_operations > 0.6",
"MetricgroupNoGroup": "TopdownL2",
- "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST",
+ "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized code running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. ([ICL+] Note this may undercount due to approximation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST",
"ScaleUnit": "100%"
},
{
@@ -435,21 +446,21 @@
"ScaleUnit": "100%"
},
{
- "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)",
+ "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory - DRAM ([SPR-HBM] and/or HBM)",
"MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=6@) / tma_info_thread_clks",
"MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueBW",
"MetricName": "tma_mem_bandwidth",
"MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))",
- "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that). Related metrics: tma_info_system_dram_bw_use",
+ "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that). Related metrics: tma_info_system_dram_bw_use",
"ScaleUnit": "100%"
},
{
- "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)",
+ "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory - DRAM ([SPR-HBM] and/or HBM)",
"MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth",
"MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueLat",
"MetricName": "tma_mem_latency",
"MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))",
- "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that). Related metrics: ",
+ "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory - DRAM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that). Related metrics: ",
"ScaleUnit": "100%"
},
{
--
2.43.0.687.g38aa6559b0-goog


2024-02-14 16:46:56

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] perf vendor event and TMA 4.7 metric update



On 2024-02-13 8:17 p.m., Ian Rogers wrote:
> The first 12 patches update vendor events to those in:
> https://github.com/intel/perfmon
>
> The next 18 patches update the TMA metrics from version 4.5 to version
> 4.7. This includes improvements to many models like the
> tma_info_bottleneck* metrics, an abstraction or summarization of the
> 100+ TMA tree nodes into 12-entry familiar performance metrics.
>
> The update was possible due to the release of TMA 4.7 metrics in:
> https://github.com/intel/perfmon/pull/140
> https://github.com/intel/perfmon/pull/138
>
> Pull request:
> https://github.com/intel/perfmon/pull/144
> was applied to the converter script. This PR adds back CORE_CLKS as in
> TMA metrics 4.5 for pre-Icelake models. This avoids an issue running
> perf stat not in system wide mode where pre-Icelake models need to
> scale the slots dependent on what the other hyperthread is doing (via
> the event CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE - later models have a
> slots counter for this). This is a known not accurate thing to do,
> hence the removal from TMA 4.7, but perf would allow the use of the
> metric not in system wide mode and report a metric with value. It
> seemed better to give a more accurate number in not system wide mode,
> possibly induce multiplexing, and have the TMA 4.5 behavior. We can
> later update the behavior to be NaN [1] when not in system wide mode
> to avoid broken usage. The not system wide mode in the TMA 4.5
> spreadsheet is known as "#EBS_Mode" and is detected in perf json with
> the "#core_wide" literal.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
>
> Ian Rogers (30):
> perf vendor events intel: Update alderlake events to v1.24
> perf vendor events intel: Update alderlaken events to v1.24
> perf vendor events intel: Update broadwell events to v29
> perf vendor events intel: Update emeraldrapids events to v1.03
> perf vendor events intel: Update grandridge events to v1.01
> perf vendor events intel: Update haswell events to v35
> perf vendor events intel: Update icelake events to v1.21
> perf vendor events intel: Update meteorlake events to v1.07
> perf vendor events intel: Update rocketlake events to v1.02
> perf vendor events intel: Update sierraforst events to v1.01
> perf vendor events intel: Update skylake events to v58
> perf vendor events intel: Update tigerlake events to v1.15
> perf vendor events intel: Update alderlake TMA metrics to 4.7
> perf vendor events intel: Update broadwell TMA metrics to 4.7
> perf vendor events intel: Update broadwellde TMA metrics to 4.7
> perf vendor events intel: Update broadwellx TMA metrics to 4.7
> perf vendor events intel: Update cascadelakex TMA metrics to 4.7
> perf vendor events intel: Update haswell TMA metrics to 4.7
> perf vendor events intel: Update haswellx TMA metrics to 4.7
> perf vendor events intel: Update icelake TMA metrics to 4.7
> perf vendor events intel: Update icelakex TMA metrics to 4.7
> perf vendor events intel: Update ivybridge TMA metrics to 4.7
> perf vendor events intel: Update ivytown TMA metrics to 4.7
> perf vendor events intel: Update jaketown TMA metrics to 4.7
> perf vendor events intel: Update rocketlake TMA metrics to 4.7
> perf vendor events intel: Update sandybridge TMA metrics to 4.7
> perf vendor events intel: Update sapphirerapids TMA metrics to 4.7
> perf vendor events intel: Update skylake TMA metrics to 4.7
> perf vendor events intel: Update skylakex TMA metrics to 4.7
> perf vendor events intel: Update tigerlake TMA metrics to 4.7
>

Thanks Ian!

Reviewed-by: Kan Liang <[email protected]>

Thanks,
Kan
> .../arch/x86/alderlake/adl-metrics.json | 459 ++-
> .../arch/x86/alderlake/floating-point.json | 30 +-
> .../arch/x86/alderlake/metricgroups.json | 11 +-
> .../pmu-events/arch/x86/alderlake/other.json | 10 +
> .../arch/x86/alderlake/pipeline.json | 13 +
> .../pmu-events/arch/x86/alderlaken/other.json | 9 +
> .../arch/x86/alderlaken/pipeline.json | 9 +
> .../arch/x86/broadwell/bdw-metrics.json | 204 +-
> .../pmu-events/arch/x86/broadwell/memory.json | 2 +-
> .../arch/x86/broadwell/metricgroups.json | 7 +-
> .../arch/x86/broadwellde/bdwde-metrics.json | 191 +-
> .../arch/x86/broadwellde/metricgroups.json | 7 +-
> .../arch/x86/broadwellx/bdx-metrics.json | 250 +-
> .../arch/x86/broadwellx/metricgroups.json | 7 +-
> .../arch/x86/cascadelakex/clx-metrics.json | 566 +++-
> .../arch/x86/cascadelakex/metricgroups.json | 12 +-
> .../arch/x86/emeraldrapids/uncore-cache.json | 152 +
> .../pmu-events/arch/x86/grandridge/cache.json | 185 ++
> .../arch/x86/grandridge/floating-point.json | 68 +
> .../arch/x86/grandridge/frontend.json | 16 +
> .../arch/x86/grandridge/memory.json | 66 +
> .../pmu-events/arch/x86/grandridge/other.json | 16 +
> .../arch/x86/grandridge/pipeline.json | 353 ++
> .../arch/x86/grandridge/uncore-cache.json | 1795 +++++++++++
> .../x86/grandridge/uncore-interconnect.json | 175 +
> .../arch/x86/grandridge/uncore-io.json | 1187 +++++++
> .../arch/x86/grandridge/uncore-memory.json | 385 +++
> .../arch/x86/grandridge/uncore-power.json | 10 +
> .../arch/x86/grandridge/virtual-memory.json | 113 +-
> .../arch/x86/haswell/hsw-metrics.json | 178 +-
> .../pmu-events/arch/x86/haswell/memory.json | 2 +-
> .../arch/x86/haswell/metricgroups.json | 7 +-
> .../arch/x86/haswellx/hsx-metrics.json | 224 +-
> .../arch/x86/haswellx/metricgroups.json | 7 +-
> .../arch/x86/icelake/icl-metrics.json | 398 ++-
> .../pmu-events/arch/x86/icelake/memory.json | 1 +
> .../arch/x86/icelake/metricgroups.json | 12 +-
> .../pmu-events/arch/x86/icelake/other.json | 2 +-
> .../pmu-events/arch/x86/icelake/pipeline.json | 10 +-
> .../arch/x86/icelakex/icx-metrics.json | 586 +++-
> .../arch/x86/icelakex/metricgroups.json | 12 +-
> .../arch/x86/ivybridge/ivb-metrics.json | 197 +-
> .../arch/x86/ivybridge/metricgroups.json | 7 +-
> .../arch/x86/ivytown/ivt-metrics.json | 200 +-
> .../arch/x86/ivytown/metricgroups.json | 7 +-
> .../arch/x86/jaketown/jkt-metrics.json | 64 +-
> .../arch/x86/jaketown/metricgroups.json | 7 +-
> tools/perf/pmu-events/arch/x86/mapfile.csv | 24 +-
> .../pmu-events/arch/x86/meteorlake/cache.json | 8 +-
> .../arch/x86/meteorlake/floating-point.json | 86 +-
> .../pmu-events/arch/x86/meteorlake/other.json | 10 +
> .../arch/x86/meteorlake/pipeline.json | 76 +
> .../arch/x86/meteorlake/virtual-memory.json | 36 +
> .../arch/x86/rocketlake/memory.json | 1 +
> .../arch/x86/rocketlake/metricgroups.json | 12 +-
> .../pmu-events/arch/x86/rocketlake/other.json | 2 +-
> .../arch/x86/rocketlake/pipeline.json | 10 +-
> .../arch/x86/rocketlake/rkl-metrics.json | 406 ++-
> .../arch/x86/sandybridge/metricgroups.json | 7 +-
> .../arch/x86/sandybridge/snb-metrics.json | 71 +-
> .../arch/x86/sapphirerapids/metricgroups.json | 12 +-
> .../arch/x86/sapphirerapids/spr-metrics.json | 773 +++--
> .../arch/x86/sierraforest/cache.json | 185 ++
> .../arch/x86/sierraforest/floating-point.json | 68 +
> .../arch/x86/sierraforest/frontend.json | 16 +
> .../arch/x86/sierraforest/memory.json | 66 +
> .../arch/x86/sierraforest/other.json | 16 +
> .../arch/x86/sierraforest/pipeline.json | 360 +++
> .../arch/x86/sierraforest/uncore-cache.json | 2853 +++++++++++++++++
> .../arch/x86/sierraforest/uncore-cxl.json | 10 +
> .../x86/sierraforest/uncore-interconnect.json | 1228 +++++++
> .../arch/x86/sierraforest/uncore-io.json | 1634 ++++++++++
> .../arch/x86/sierraforest/uncore-memory.json | 385 +++
> .../arch/x86/sierraforest/uncore-power.json | 10 +
> .../arch/x86/sierraforest/virtual-memory.json | 113 +-
> .../pmu-events/arch/x86/skylake/memory.json | 2 +-
> .../arch/x86/skylake/metricgroups.json | 12 +-
> .../pmu-events/arch/x86/skylake/pipeline.json | 2 +-
> .../arch/x86/skylake/skl-metrics.json | 395 ++-
> .../arch/x86/skylake/virtual-memory.json | 2 +-
> .../arch/x86/skylakex/metricgroups.json | 12 +-
> .../arch/x86/skylakex/skx-metrics.json | 548 +++-
> .../arch/x86/tigerlake/metricgroups.json | 12 +-
> .../pmu-events/arch/x86/tigerlake/other.json | 2 +-
> .../arch/x86/tigerlake/pipeline.json | 10 +-
> .../arch/x86/tigerlake/tgl-metrics.json | 406 ++-
> .../x86/tigerlake/uncore-interconnect.json | 2 +
> 87 files changed, 15763 insertions(+), 2349 deletions(-)
> create mode 100644 tools/perf/pmu-events/arch/x86/grandridge/floating-point.json
> create mode 100644 tools/perf/pmu-events/arch/x86/grandridge/uncore-cache.json
> create mode 100644 tools/perf/pmu-events/arch/x86/grandridge/uncore-interconnect.json
> create mode 100644 tools/perf/pmu-events/arch/x86/grandridge/uncore-io.json
> create mode 100644 tools/perf/pmu-events/arch/x86/grandridge/uncore-memory.json
> create mode 100644 tools/perf/pmu-events/arch/x86/grandridge/uncore-power.json
> create mode 100644 tools/perf/pmu-events/arch/x86/sierraforest/floating-point.json
> create mode 100644 tools/perf/pmu-events/arch/x86/sierraforest/uncore-cache.json
> create mode 100644 tools/perf/pmu-events/arch/x86/sierraforest/uncore-cxl.json
> create mode 100644 tools/perf/pmu-events/arch/x86/sierraforest/uncore-interconnect.json
> create mode 100644 tools/perf/pmu-events/arch/x86/sierraforest/uncore-io.json
> create mode 100644 tools/perf/pmu-events/arch/x86/sierraforest/uncore-memory.json
> create mode 100644 tools/perf/pmu-events/arch/x86/sierraforest/uncore-power.json
>

2024-02-21 02:03:12

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] perf vendor event and TMA 4.7 metric update

On Tue, 13 Feb 2024 17:17:49 -0800, Ian Rogers wrote:
> The first 12 patches update vendor events to those in:
> https://github.com/intel/perfmon
>
> The next 18 patches update the TMA metrics from version 4.5 to version
> 4.7. This includes improvements to many models like the
> tma_info_bottleneck* metrics, an abstraction or summarization of the
> 100+ TMA tree nodes into 12-entry familiar performance metrics.
>
> [...]

Applied to perf-tools-next, thanks!

Best regards,
--
Namhyung Kim <[email protected]>