2024-01-04 07:43:39

by Ian Rogers

[permalink] [raw]
Subject: [PATCH v1 1/4] perf vendor events intel: Alderlake/rocketlake metric fixes

Fix that the core PMU is being specified for 2 uncore events. Specify
a PMU for the alderlake UNCORE_FREQ metric.

Conversion script updated in:
https://github.com/intel/perfmon/pull/126

Reported-by: Arnaldo Carvalho de Melo <[email protected]>
Closes: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Ian Rogers <[email protected]>
---
.../arch/x86/alderlake/adl-metrics.json | 15 ++++++++-------
.../arch/x86/rocketlake/rkl-metrics.json | 2 +-
2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
index 3388b58b8f1a..35124a4ddcb2 100644
--- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
@@ -69,12 +69,6 @@
"MetricName": "C9_Pkg_Residency",
"ScaleUnit": "100%"
},
- {
- "BriefDescription": "Uncore frequency per die [GHZ]",
- "MetricExpr": "tma_info_system_socket_clks / #num_dies / duration_time / 1e9",
- "MetricGroup": "SoC",
- "MetricName": "UNCORE_FREQ"
- },
{
"BriefDescription": "Percentage of cycles spent in System Management Interrupts.",
"MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)",
@@ -809,6 +803,13 @@
"ScaleUnit": "100%",
"Unit": "cpu_atom"
},
+ {
+ "BriefDescription": "Uncore frequency per die [GHZ]",
+ "MetricExpr": "tma_info_system_socket_clks / #num_dies / duration_time / 1e9",
+ "MetricGroup": "SoC",
+ "MetricName": "UNCORE_FREQ",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.",
"MetricExpr": "(cpu_core@UOPS_DISPATCHED.PORT_0@ + cpu_core@UOPS_DISPATCHED.PORT_1@ + cpu_core@UOPS_DISPATCHED.PORT_5_11@ + cpu_core@UOPS_DISPATCHED.PORT_6@) / (5 * tma_info_core_core_clks)",
@@ -1838,7 +1839,7 @@
},
{
"BriefDescription": "Average number of parallel data read requests to external memory",
- "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / cpu_core@UNC_ARB_DAT_OCCUPANCY.RD\\,cmask\\=1@",
+ "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / UNC_ARB_DAT_OCCUPANCY.RD@cmask\\=1@",
"MetricGroup": "Mem;MemoryBW;SoC",
"MetricName": "tma_info_system_mem_parallel_reads",
"PublicDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches",
diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json b/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json
index 0c880e415669..27433fc15ede 100644
--- a/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json
@@ -985,7 +985,7 @@
},
{
"BriefDescription": "Average number of parallel data read requests to external memory",
- "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / cpu@UNC_ARB_DAT_OCCUPANCY.RD\\,cmask\\=1@",
+ "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / UNC_ARB_DAT_OCCUPANCY.RD@cmask\\=1@",
"MetricGroup": "Mem;MemoryBW;SoC",
"MetricName": "tma_info_system_mem_parallel_reads",
"PublicDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches"
--
2.43.0.472.g3155946c3a-goog



2024-01-04 07:43:55

by Ian Rogers

[permalink] [raw]
Subject: [PATCH v1 2/4] perf vendor events intel: Update emeraldrapids events to v1.02

Update to v1.02 released in:
https://github.com/intel/perfmon/pull/123

Removes events AMX_OPS_RETIRED.BF16 and AMX_OPS_RETIRED.INT8. Add
events FP_ARITH_DISPATCHED.V0, FP_ARITH_DISPATCHED.V1,
FP_ARITH_DISPATCHED.V2, UNC_IIO_IOMMU0.1G_HITS, UNC_IIO_IOMMU0.2M_HITS
and UNC_IIO_IOMMU0.4K_HITS. Description updates.

Signed-off-by: Ian Rogers <[email protected]>
---
.../x86/emeraldrapids/floating-point.json | 27 +++++++++++++++--
.../arch/x86/emeraldrapids/pipeline.json | 18 +----------
.../emeraldrapids/uncore-interconnect.json | 8 ++---
.../arch/x86/emeraldrapids/uncore-io.json | 30 +++++++++++++++++++
tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +-
5 files changed, 60 insertions(+), 25 deletions(-)

diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/floating-point.json b/tools/perf/pmu-events/arch/x86/emeraldrapids/floating-point.json
index 4a9d211e9d4f..1bdefaf96287 100644
--- a/tools/perf/pmu-events/arch/x86/emeraldrapids/floating-point.json
+++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/floating-point.json
@@ -23,26 +23,47 @@
"UMask": "0x10"
},
{
- "BriefDescription": "FP_ARITH_DISPATCHED.PORT_0",
+ "BriefDescription": "FP_ARITH_DISPATCHED.PORT_0 [This event is alias to FP_ARITH_DISPATCHED.V0]",
"EventCode": "0xb3",
"EventName": "FP_ARITH_DISPATCHED.PORT_0",
"SampleAfterValue": "2000003",
"UMask": "0x1"
},
{
- "BriefDescription": "FP_ARITH_DISPATCHED.PORT_1",
+ "BriefDescription": "FP_ARITH_DISPATCHED.PORT_1 [This event is alias to FP_ARITH_DISPATCHED.V1]",
"EventCode": "0xb3",
"EventName": "FP_ARITH_DISPATCHED.PORT_1",
"SampleAfterValue": "2000003",
"UMask": "0x2"
},
{
- "BriefDescription": "FP_ARITH_DISPATCHED.PORT_5",
+ "BriefDescription": "FP_ARITH_DISPATCHED.PORT_5 [This event is alias to FP_ARITH_DISPATCHED.V2]",
"EventCode": "0xb3",
"EventName": "FP_ARITH_DISPATCHED.PORT_5",
"SampleAfterValue": "2000003",
"UMask": "0x4"
},
+ {
+ "BriefDescription": "FP_ARITH_DISPATCHED.V0 [This event is alias to FP_ARITH_DISPATCHED.PORT_0]",
+ "EventCode": "0xb3",
+ "EventName": "FP_ARITH_DISPATCHED.V0",
+ "SampleAfterValue": "2000003",
+ "UMask": "0x1"
+ },
+ {
+ "BriefDescription": "FP_ARITH_DISPATCHED.V1 [This event is alias to FP_ARITH_DISPATCHED.PORT_1]",
+ "EventCode": "0xb3",
+ "EventName": "FP_ARITH_DISPATCHED.V1",
+ "SampleAfterValue": "2000003",
+ "UMask": "0x2"
+ },
+ {
+ "BriefDescription": "FP_ARITH_DISPATCHED.V2 [This event is alias to FP_ARITH_DISPATCHED.PORT_5]",
+ "EventCode": "0xb3",
+ "EventName": "FP_ARITH_DISPATCHED.V2",
+ "SampleAfterValue": "2000003",
+ "UMask": "0x4"
+ },
{
"BriefDescription": "Counts number of SSE/AVX computational 128-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 2 computation operations, one for each element. Applies to SSE* and AVX* packed double precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element.",
"EventCode": "0xc7",
diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/pipeline.json b/tools/perf/pmu-events/arch/x86/emeraldrapids/pipeline.json
index 6dcf3b763af4..1f8200fb8964 100644
--- a/tools/perf/pmu-events/arch/x86/emeraldrapids/pipeline.json
+++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/pipeline.json
@@ -1,20 +1,4 @@
[
- {
- "BriefDescription": "AMX retired arithmetic BF16 operations.",
- "EventCode": "0xce",
- "EventName": "AMX_OPS_RETIRED.BF16",
- "PublicDescription": "Number of AMX-based retired arithmetic bfloat16 (BF16) floating-point operations. Counts TDPBF16PS FP instructions. SW to use operation multiplier of 4",
- "SampleAfterValue": "1000003",
- "UMask": "0x2"
- },
- {
- "BriefDescription": "AMX retired arithmetic integer 8-bit operations.",
- "EventCode": "0xce",
- "EventName": "AMX_OPS_RETIRED.INT8",
- "PublicDescription": "Number of AMX-based retired arithmetic integer operations of 8-bit width source operands. Counts TDPB[SS,UU,US,SU]D instructions. SW should use operation multiplier of 8.",
- "SampleAfterValue": "1000003",
- "UMask": "0x1"
- },
{
"BriefDescription": "This event is deprecated. Refer to new event ARITH.DIV_ACTIVE",
"CounterMask": "1",
@@ -505,7 +489,7 @@
"UMask": "0x1"
},
{
- "BriefDescription": "INT_MISC.UNKNOWN_BRANCH_CYCLES",
+ "BriefDescription": "Bubble cycles of BAClear (Unknown Branch).",
"EventCode": "0xad",
"EventName": "INT_MISC.UNKNOWN_BRANCH_CYCLES",
"MSRIndex": "0x3F7",
diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/uncore-interconnect.json b/tools/perf/pmu-events/arch/x86/emeraldrapids/uncore-interconnect.json
index 09d840c7da4c..65d088556bae 100644
--- a/tools/perf/pmu-events/arch/x86/emeraldrapids/uncore-interconnect.json
+++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/uncore-interconnect.json
@@ -4825,11 +4825,11 @@
"Unit": "M3UPI"
},
{
- "BriefDescription": "Number of allocations into the CRS Egress used to queue up requests destined to the mesh (AD Bouncable)",
+ "BriefDescription": "Number of allocations into the CRS Egress used to queue up requests destined to the mesh (AD Bounceable)",
"EventCode": "0x47",
"EventName": "UNC_MDF_CRS_TxR_INSERTS.AD_BNC",
"PerPkg": "1",
- "PublicDescription": "AD Bouncable : Number of allocations into the CRS Egress",
+ "PublicDescription": "AD Bounceable : Number of allocations into the CRS Egress",
"UMask": "0x1",
"Unit": "MDF"
},
@@ -4861,11 +4861,11 @@
"Unit": "MDF"
},
{
- "BriefDescription": "Number of allocations into the CRS Egress used to queue up requests destined to the mesh (BL Bouncable)",
+ "BriefDescription": "Number of allocations into the CRS Egress used to queue up requests destined to the mesh (BL Bounceable)",
"EventCode": "0x47",
"EventName": "UNC_MDF_CRS_TxR_INSERTS.BL_BNC",
"PerPkg": "1",
- "PublicDescription": "BL Bouncable : Number of allocations into the CRS Egress",
+ "PublicDescription": "BL Bounceable : Number of allocations into the CRS Egress",
"UMask": "0x4",
"Unit": "MDF"
},
diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/uncore-io.json b/tools/perf/pmu-events/arch/x86/emeraldrapids/uncore-io.json
index 557080b74ee5..0761980c34a0 100644
--- a/tools/perf/pmu-events/arch/x86/emeraldrapids/uncore-io.json
+++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/uncore-io.json
@@ -1185,6 +1185,36 @@
"UMask": "0x70ff010",
"Unit": "IIO"
},
+ {
+ "BriefDescription": ": IOTLB Hits to a 1G Page",
+ "EventCode": "0x40",
+ "EventName": "UNC_IIO_IOMMU0.1G_HITS",
+ "PerPkg": "1",
+ "PortMask": "0x0000",
+ "PublicDescription": ": IOTLB Hits to a 1G Page : Counts if a transaction to a 1G page, on its first lookup, hits the IOTLB.",
+ "UMask": "0x10",
+ "Unit": "IIO"
+ },
+ {
+ "BriefDescription": ": IOTLB Hits to a 2M Page",
+ "EventCode": "0x40",
+ "EventName": "UNC_IIO_IOMMU0.2M_HITS",
+ "PerPkg": "1",
+ "PortMask": "0x0000",
+ "PublicDescription": ": IOTLB Hits to a 2M Page : Counts if a transaction to a 2M page, on its first lookup, hits the IOTLB.",
+ "UMask": "0x8",
+ "Unit": "IIO"
+ },
+ {
+ "BriefDescription": ": IOTLB Hits to a 4K Page",
+ "EventCode": "0x40",
+ "EventName": "UNC_IIO_IOMMU0.4K_HITS",
+ "PerPkg": "1",
+ "PortMask": "0x0000",
+ "PublicDescription": ": IOTLB Hits to a 4K Page : Counts if a transaction to a 4K page, on its first lookup, hits the IOTLB.",
+ "UMask": "0x4",
+ "Unit": "IIO"
+ },
{
"BriefDescription": ": Context cache hits",
"EventCode": "0x40",
diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv
index e571683f59f3..fd38c516c048 100644
--- a/tools/perf/pmu-events/arch/x86/mapfile.csv
+++ b/tools/perf/pmu-events/arch/x86/mapfile.csv
@@ -7,7 +7,7 @@ GenuineIntel-6-56,v11,broadwellde,core
GenuineIntel-6-4F,v22,broadwellx,core
GenuineIntel-6-55-[56789ABCDEF],v1.20,cascadelakex,core
GenuineIntel-6-9[6C],v1.04,elkhartlake,core
-GenuineIntel-6-CF,v1.01,emeraldrapids,core
+GenuineIntel-6-CF,v1.02,emeraldrapids,core
GenuineIntel-6-5[CF],v13,goldmont,core
GenuineIntel-6-7A,v1.01,goldmontplus,core
GenuineIntel-6-B6,v1.00,grandridge,core
--
2.43.0.472.g3155946c3a-goog


2024-01-04 07:44:16

by Ian Rogers

[permalink] [raw]
Subject: [PATCH v1 3/4] perf vendor events intel: Update icelakex events to v1.23

Update to v1.23 released in:
https://github.com/intel/perfmon/pull/123

Updates to event descriptions.

Signed-off-by: Ian Rogers <[email protected]>
---
tools/perf/pmu-events/arch/x86/icelakex/other.json | 2 +-
tools/perf/pmu-events/arch/x86/icelakex/pipeline.json | 2 +-
.../pmu-events/arch/x86/icelakex/uncore-interconnect.json | 6 +++---
tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +-
4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/tools/perf/pmu-events/arch/x86/icelakex/other.json b/tools/perf/pmu-events/arch/x86/icelakex/other.json
index 63d5faf2fc43..11810daaf150 100644
--- a/tools/perf/pmu-events/arch/x86/icelakex/other.json
+++ b/tools/perf/pmu-events/arch/x86/icelakex/other.json
@@ -19,7 +19,7 @@
"BriefDescription": "Core cycles where the core was running in a manner where Turbo may be clipped to the AVX512 turbo schedule.",
"EventCode": "0x28",
"EventName": "CORE_POWER.LVL2_TURBO_LICENSE",
- "PublicDescription": "Core cycles where the core was running with power-delivery for license level 2 (introduced in Skylake Server microarchtecture). This includes high current AVX 512-bit instructions.",
+ "PublicDescription": "Core cycles where the core was running with power-delivery for license level 2 (introduced in Skylake Server microarchitecture). This includes high current AVX 512-bit instructions.",
"SampleAfterValue": "200003",
"UMask": "0x20"
},
diff --git a/tools/perf/pmu-events/arch/x86/icelakex/pipeline.json b/tools/perf/pmu-events/arch/x86/icelakex/pipeline.json
index 176e5ef2a24a..45ee6bceba7f 100644
--- a/tools/perf/pmu-events/arch/x86/icelakex/pipeline.json
+++ b/tools/perf/pmu-events/arch/x86/icelakex/pipeline.json
@@ -519,7 +519,7 @@
"BriefDescription": "Cycles when Reservation Station (RS) is empty for the thread",
"EventCode": "0x5e",
"EventName": "RS_EVENTS.EMPTY_CYCLES",
- "PublicDescription": "Counts cycles during which the reservation station (RS) is empty for this logical processor. This is usually caused when the front-end pipeline runs into stravation periods (e.g. branch mispredictions or i-cache misses)",
+ "PublicDescription": "Counts cycles during which the reservation station (RS) is empty for this logical processor. This is usually caused when the front-end pipeline runs into starvation periods (e.g. branch mispredictions or i-cache misses)",
"SampleAfterValue": "1000003",
"UMask": "0x1"
},
diff --git a/tools/perf/pmu-events/arch/x86/icelakex/uncore-interconnect.json b/tools/perf/pmu-events/arch/x86/icelakex/uncore-interconnect.json
index f87ea3f66d1b..a066a009c511 100644
--- a/tools/perf/pmu-events/arch/x86/icelakex/uncore-interconnect.json
+++ b/tools/perf/pmu-events/arch/x86/icelakex/uncore-interconnect.json
@@ -38,7 +38,7 @@
"EventCode": "0x10",
"EventName": "UNC_I_COHERENT_OPS.CLFLUSH",
"PerPkg": "1",
- "PublicDescription": "Coherent Ops : CLFlush : Counts the number of coherency related operations servied by the IRP",
+ "PublicDescription": "Coherent Ops : CLFlush : Counts the number of coherency related operations serviced by the IRP",
"UMask": "0x80",
"Unit": "IRP"
},
@@ -65,7 +65,7 @@
"EventCode": "0x10",
"EventName": "UNC_I_COHERENT_OPS.WBMTOI",
"PerPkg": "1",
- "PublicDescription": "Coherent Ops : WbMtoI : Counts the number of coherency related operations servied by the IRP",
+ "PublicDescription": "Coherent Ops : WbMtoI : Counts the number of coherency related operations serviced by the IRP",
"UMask": "0x40",
"Unit": "IRP"
},
@@ -454,7 +454,7 @@
"EventCode": "0x11",
"EventName": "UNC_I_TRANSACTIONS.WRITES",
"PerPkg": "1",
- "PublicDescription": "Inbound Transaction Count : Writes : Counts the number of Inbound transactions from the IRP to the Uncore. This can be filtered based on request type in addition to the source queue. Note the special filtering equation. We do OR-reduction on the request type. If the SOURCE bit is set, then we also do AND qualification based on the source portID. : Trackes only write requests. Each write request should have a prefetch, so there is no need to explicitly track these requests. For writes that are tickled and have to retry, the counter will be incremented for each retry.",
+ "PublicDescription": "Inbound Transaction Count : Writes : Counts the number of Inbound transactions from the IRP to the Uncore. This can be filtered based on request type in addition to the source queue. Note the special filtering equation. We do OR-reduction on the request type. If the SOURCE bit is set, then we also do AND qualification based on the source portID. : Tracks only write requests. Each write request should have a prefetch, so there is no need to explicitly track these requests. For writes that are tickled and have to retry, the counter will be incremented for each retry.",
"UMask": "0x2",
"Unit": "IRP"
},
diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv
index fd38c516c048..c1820eb16a19 100644
--- a/tools/perf/pmu-events/arch/x86/mapfile.csv
+++ b/tools/perf/pmu-events/arch/x86/mapfile.csv
@@ -15,7 +15,7 @@ GenuineIntel-6-A[DE],v1.01,graniterapids,core
GenuineIntel-6-(3C|45|46),v33,haswell,core
GenuineIntel-6-3F,v28,haswellx,core
GenuineIntel-6-7[DE],v1.19,icelake,core
-GenuineIntel-6-6[AC],v1.21,icelakex,core
+GenuineIntel-6-6[AC],v1.23,icelakex,core
GenuineIntel-6-3A,v24,ivybridge,core
GenuineIntel-6-3E,v24,ivytown,core
GenuineIntel-6-2D,v24,jaketown,core
--
2.43.0.472.g3155946c3a-goog


2024-01-04 07:44:28

by Ian Rogers

[permalink] [raw]
Subject: [PATCH v1 4/4] perf vendor events intel: Update sapphirerapids events to v1.17

Update to v1.17 released in:
https://github.com/intel/perfmon/pull/123

Add events FP_ARITH_DISPATCHED.V0, FP_ARITH_DISPATCHED.V1,
FP_ARITH_DISPATCHED.V2, UNC_IIO_IOMMU0.1G_HITS, UNC_IIO_IOMMU0.2M_HITS
and UNC_IIO_IOMMU0.4K_HITS. Description updates.

Signed-off-by: Ian Rogers <[email protected]>
---
tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +-
.../x86/sapphirerapids/floating-point.json | 27 +++++++++++++++--
.../arch/x86/sapphirerapids/pipeline.json | 2 +-
.../sapphirerapids/uncore-interconnect.json | 8 ++---
.../arch/x86/sapphirerapids/uncore-io.json | 30 +++++++++++++++++++
5 files changed, 60 insertions(+), 9 deletions(-)

diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv
index c1820eb16a19..4d1deed4437a 100644
--- a/tools/perf/pmu-events/arch/x86/mapfile.csv
+++ b/tools/perf/pmu-events/arch/x86/mapfile.csv
@@ -26,7 +26,7 @@ GenuineIntel-6-1[AEF],v4,nehalemep,core
GenuineIntel-6-2E,v4,nehalemex,core
GenuineIntel-6-A7,v1.01,rocketlake,core
GenuineIntel-6-2A,v19,sandybridge,core
-GenuineIntel-6-8F,v1.16,sapphirerapids,core
+GenuineIntel-6-8F,v1.17,sapphirerapids,core
GenuineIntel-6-AF,v1.00,sierraforest,core
GenuineIntel-6-(37|4A|4C|4D|5A),v15,silvermont,core
GenuineIntel-6-(4E|5E|8E|9E|A5|A6),v57,skylake,core
diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/floating-point.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/floating-point.json
index 4a9d211e9d4f..1bdefaf96287 100644
--- a/tools/perf/pmu-events/arch/x86/sapphirerapids/floating-point.json
+++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/floating-point.json
@@ -23,26 +23,47 @@
"UMask": "0x10"
},
{
- "BriefDescription": "FP_ARITH_DISPATCHED.PORT_0",
+ "BriefDescription": "FP_ARITH_DISPATCHED.PORT_0 [This event is alias to FP_ARITH_DISPATCHED.V0]",
"EventCode": "0xb3",
"EventName": "FP_ARITH_DISPATCHED.PORT_0",
"SampleAfterValue": "2000003",
"UMask": "0x1"
},
{
- "BriefDescription": "FP_ARITH_DISPATCHED.PORT_1",
+ "BriefDescription": "FP_ARITH_DISPATCHED.PORT_1 [This event is alias to FP_ARITH_DISPATCHED.V1]",
"EventCode": "0xb3",
"EventName": "FP_ARITH_DISPATCHED.PORT_1",
"SampleAfterValue": "2000003",
"UMask": "0x2"
},
{
- "BriefDescription": "FP_ARITH_DISPATCHED.PORT_5",
+ "BriefDescription": "FP_ARITH_DISPATCHED.PORT_5 [This event is alias to FP_ARITH_DISPATCHED.V2]",
"EventCode": "0xb3",
"EventName": "FP_ARITH_DISPATCHED.PORT_5",
"SampleAfterValue": "2000003",
"UMask": "0x4"
},
+ {
+ "BriefDescription": "FP_ARITH_DISPATCHED.V0 [This event is alias to FP_ARITH_DISPATCHED.PORT_0]",
+ "EventCode": "0xb3",
+ "EventName": "FP_ARITH_DISPATCHED.V0",
+ "SampleAfterValue": "2000003",
+ "UMask": "0x1"
+ },
+ {
+ "BriefDescription": "FP_ARITH_DISPATCHED.V1 [This event is alias to FP_ARITH_DISPATCHED.PORT_1]",
+ "EventCode": "0xb3",
+ "EventName": "FP_ARITH_DISPATCHED.V1",
+ "SampleAfterValue": "2000003",
+ "UMask": "0x2"
+ },
+ {
+ "BriefDescription": "FP_ARITH_DISPATCHED.V2 [This event is alias to FP_ARITH_DISPATCHED.PORT_5]",
+ "EventCode": "0xb3",
+ "EventName": "FP_ARITH_DISPATCHED.V2",
+ "SampleAfterValue": "2000003",
+ "UMask": "0x4"
+ },
{
"BriefDescription": "Counts number of SSE/AVX computational 128-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 2 computation operations, one for each element. Applies to SSE* and AVX* packed double precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element.",
"EventCode": "0xc7",
diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json
index 6dcf3b763af4..2cfe814d2015 100644
--- a/tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json
+++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json
@@ -505,7 +505,7 @@
"UMask": "0x1"
},
{
- "BriefDescription": "INT_MISC.UNKNOWN_BRANCH_CYCLES",
+ "BriefDescription": "Bubble cycles of BAClear (Unknown Branch).",
"EventCode": "0xad",
"EventName": "INT_MISC.UNKNOWN_BRANCH_CYCLES",
"MSRIndex": "0x3F7",
diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-interconnect.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-interconnect.json
index 09d840c7da4c..65d088556bae 100644
--- a/tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-interconnect.json
+++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-interconnect.json
@@ -4825,11 +4825,11 @@
"Unit": "M3UPI"
},
{
- "BriefDescription": "Number of allocations into the CRS Egress used to queue up requests destined to the mesh (AD Bouncable)",
+ "BriefDescription": "Number of allocations into the CRS Egress used to queue up requests destined to the mesh (AD Bounceable)",
"EventCode": "0x47",
"EventName": "UNC_MDF_CRS_TxR_INSERTS.AD_BNC",
"PerPkg": "1",
- "PublicDescription": "AD Bouncable : Number of allocations into the CRS Egress",
+ "PublicDescription": "AD Bounceable : Number of allocations into the CRS Egress",
"UMask": "0x1",
"Unit": "MDF"
},
@@ -4861,11 +4861,11 @@
"Unit": "MDF"
},
{
- "BriefDescription": "Number of allocations into the CRS Egress used to queue up requests destined to the mesh (BL Bouncable)",
+ "BriefDescription": "Number of allocations into the CRS Egress used to queue up requests destined to the mesh (BL Bounceable)",
"EventCode": "0x47",
"EventName": "UNC_MDF_CRS_TxR_INSERTS.BL_BNC",
"PerPkg": "1",
- "PublicDescription": "BL Bouncable : Number of allocations into the CRS Egress",
+ "PublicDescription": "BL Bounceable : Number of allocations into the CRS Egress",
"UMask": "0x4",
"Unit": "MDF"
},
diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-io.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-io.json
index 8b5f54fed103..03596db87710 100644
--- a/tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-io.json
+++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-io.json
@@ -1249,6 +1249,36 @@
"UMask": "0x70ff010",
"Unit": "IIO"
},
+ {
+ "BriefDescription": ": IOTLB Hits to a 1G Page",
+ "EventCode": "0x40",
+ "EventName": "UNC_IIO_IOMMU0.1G_HITS",
+ "PerPkg": "1",
+ "PortMask": "0x0000",
+ "PublicDescription": ": IOTLB Hits to a 1G Page : Counts if a transaction to a 1G page, on its first lookup, hits the IOTLB.",
+ "UMask": "0x10",
+ "Unit": "IIO"
+ },
+ {
+ "BriefDescription": ": IOTLB Hits to a 2M Page",
+ "EventCode": "0x40",
+ "EventName": "UNC_IIO_IOMMU0.2M_HITS",
+ "PerPkg": "1",
+ "PortMask": "0x0000",
+ "PublicDescription": ": IOTLB Hits to a 2M Page : Counts if a transaction to a 2M page, on its first lookup, hits the IOTLB.",
+ "UMask": "0x8",
+ "Unit": "IIO"
+ },
+ {
+ "BriefDescription": ": IOTLB Hits to a 4K Page",
+ "EventCode": "0x40",
+ "EventName": "UNC_IIO_IOMMU0.4K_HITS",
+ "PerPkg": "1",
+ "PortMask": "0x0000",
+ "PublicDescription": ": IOTLB Hits to a 4K Page : Counts if a transaction to a 4K page, on its first lookup, hits the IOTLB.",
+ "UMask": "0x4",
+ "Unit": "IIO"
+ },
{
"BriefDescription": ": Context cache hits",
"EventCode": "0x40",
--
2.43.0.472.g3155946c3a-goog


2024-01-04 12:39:30

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v1 1/4] perf vendor events intel: Alderlake/rocketlake metric fixes

Em Wed, Jan 03, 2024 at 11:42:56PM -0800, Ian Rogers escreveu:
> Fix that the core PMU is being specified for 2 uncore events. Specify
> a PMU for the alderlake UNCORE_FREQ metric.
>
> Conversion script updated in:
> https://github.com/intel/perfmon/pull/126
>
> Reported-by: Arnaldo Carvalho de Melo <[email protected]>
> Closes: https://lore.kernel.org/lkml/[email protected]/
> Signed-off-by: Ian Rogers <[email protected]>

After this first patch:

101: perf all metricgroups test : Ok
102: perf all metrics test : FAILED!
107: perf metrics value validation : Ok

102 is now failing due to some other problem:

root@number:~# perf test -v 102
102: perf all metrics test :
--- start ---
test child forked, pid 2701034
Testing tma_core_bound
Testing tma_info_core_ilp
Testing tma_info_memory_l2mpki
Testing tma_memory_bound
Testing tma_info_bad_spec_branch_misprediction_cost
Testing tma_info_bad_spec_ipmisp_cond_ntaken
Testing tma_info_bad_spec_ipmisp_cond_taken
Testing tma_info_bad_spec_ipmisp_indirect
Testing tma_info_bad_spec_ipmisp_ret
Testing tma_info_bad_spec_ipmispredict
Testing tma_info_bottleneck_mispredictions
Testing tma_info_branches_callret
Testing tma_info_branches_cond_nt
Testing tma_info_branches_cond_tk
Testing tma_info_branches_jump
Testing tma_info_branches_other_branches
Testing tma_branch_mispredicts
Testing tma_clears_resteers
Testing tma_machine_clears
Testing tma_mispredicts_resteers
Testing tma_icache_misses
Testing tma_info_bottleneck_big_code
Testing tma_itlb_misses
Testing tma_unknown_branches
Testing tma_info_inst_mix_bptkbranch
Testing tma_info_inst_mix_ipbranch
Testing tma_info_inst_mix_ipcall
Testing tma_info_inst_mix_iptb
Testing tma_info_system_ipfarbranch
Testing tma_info_thread_uptb
Testing tma_info_memory_fb_hpki
Testing tma_info_memory_l1mpki
Testing tma_info_memory_l1mpki_load
Testing tma_info_memory_l2hpki_all
Testing tma_info_memory_l2hpki_load
Testing tma_info_memory_l2mpki_all
Testing tma_info_memory_l2mpki_load
Testing tma_info_memory_l3mpki
Testing tma_l1_bound
Testing tma_l2_bound
Testing tma_l3_bound
Testing tma_fp_scalar
Testing tma_fp_vector
Testing tma_fp_vector_128b
Testing tma_fp_vector_256b
Testing tma_int_vector_128b
Testing tma_int_vector_256b
Testing tma_port_0
Testing tma_x87_use
Testing tma_info_botlnk_l0_core_bound_likely
Testing tma_info_core_fp_arith_utilization
Testing tma_info_pipeline_execute
Testing tma_info_system_gflops
Testing tma_info_thread_execute_per_issue
Testing tma_dsb
Testing tma_info_frontend_dsb_coverage
Testing tma_decoder0_alone
Testing tma_dsb_switches
Testing tma_info_botlnk_l2_dsb_misses
Testing tma_info_frontend_dsb_switch_cost
Testing tma_info_frontend_ipdsb_miss_ret
Testing tma_mite
Testing tma_contested_accesses
Testing tma_false_sharing
Testing tma_backend_bound
Testing tma_backend_bound_aux
Testing tma_bad_speculation
Testing tma_frontend_bound
Testing tma_retiring
Testing tma_info_botlnk_l2_ic_misses
Testing tma_info_bottleneck_instruction_fetch_bw
Testing tma_info_frontend_fetch_upc
Testing tma_info_frontend_icache_miss_latency
Testing tma_info_frontend_ipunknown_branch
Testing tma_info_frontend_lsd_coverage
Testing tma_info_memory_tlb_code_stlb_mpki
Testing tma_fetch_bandwidth
Testing tma_lsd
Testing tma_branch_resteers
Testing tma_lcp
Testing tma_ms_switches
Testing tma_info_core_flopc
Testing tma_info_inst_mix_iparith
Testing tma_info_inst_mix_iparith_avx128
Testing tma_info_inst_mix_iparith_avx256
Testing tma_info_inst_mix_iparith_scalar_dp
Testing tma_info_inst_mix_iparith_scalar_sp
Testing tma_info_inst_mix_ipflop
Testing tma_fetch_latency
Testing tma_avx_assists
Testing tma_fp_arith
Testing tma_fp_assists
Testing tma_info_system_cpu_utilization
Testing tma_info_system_dram_bw_use
Testing tma_shuffles
Testing tma_info_frontend_l2mpki_code
Testing tma_info_frontend_l2mpki_code_all
Testing tma_info_inst_mix_ipload
Testing tma_info_inst_mix_ipstore
Testing tma_info_bottleneck_memory_bandwidth
Testing tma_info_bottleneck_memory_data_tlbs
Testing tma_info_bottleneck_memory_latency
Testing tma_info_memory_core_l1d_cache_fill_bw
Testing tma_info_memory_core_l2_cache_fill_bw
Testing tma_info_memory_core_l3_cache_access_bw
Testing tma_info_memory_core_l3_cache_fill_bw
Testing tma_info_memory_load_miss_real_latency
Testing tma_info_memory_mlp
Testing tma_info_memory_thread_l1d_cache_fill_bw_1t
Testing tma_info_memory_thread_l2_cache_fill_bw_1t
Testing tma_info_memory_thread_l3_cache_access_bw_1t
Testing tma_info_memory_thread_l3_cache_fill_bw_1t
Testing tma_info_memory_tlb_load_stlb_mpki
Testing tma_info_memory_tlb_page_walks_utilization
Testing tma_info_memory_tlb_store_stlb_mpki
Testing tma_info_system_mem_parallel_reads
Testing tma_info_system_mem_read_latency
Testing tma_info_system_mem_request_latency
Testing tma_info_thread_cpi
Testing tma_fb_full
Testing tma_mem_bandwidth
Testing tma_sq_full
Testing tma_streaming_stores
Testing tma_dram_bound
Testing tma_store_bound
Testing tma_l3_hit_latency
Testing tma_mem_latency
Testing tma_store_latency
Testing tma_dtlb_load
Testing tma_dtlb_store
Testing tma_load_stlb_hit
Testing tma_load_stlb_miss
Testing tma_store_stlb_hit
Testing tma_store_stlb_miss
Testing tma_info_memory_oro_data_l2_mlp
Testing tma_info_memory_oro_load_l2_mlp
Testing tma_info_memory_oro_load_l2_miss_latency
Testing tma_info_memory_oro_load_l3_miss_latency
Testing tma_microcode_sequencer
Testing tma_info_core_clks
Testing tma_info_core_clks_p
Testing tma_info_core_cpi
Testing tma_info_core_ipc
Testing tma_info_core_slots
Testing tma_info_core_upi
Testing tma_info_frontend_inst_miss_cost_dramhit_percent
Testing tma_info_frontend_inst_miss_cost_l2hit_percent
Testing tma_info_frontend_inst_miss_cost_l3hit_percent
Testing tma_info_inst_mix_branch_mispredict_ratio
Testing tma_info_inst_mix_branch_mispredict_to_unknown_branch_ratio
Testing tma_info_inst_mix_fpdiv_uop_ratio
Testing tma_info_inst_mix_idiv_uop_ratio
Testing tma_info_inst_mix_ipfarbranch
Testing tma_info_inst_mix_ipmisp_cond_ntaken
Testing tma_info_inst_mix_ipmisp_cond_taken
Testing tma_info_inst_mix_ipmisp_indirect
Testing tma_info_inst_mix_ipmisp_ret
Testing tma_info_inst_mix_ipmispredict
Testing tma_info_inst_mix_microcode_uop_ratio
Testing tma_info_inst_mix_x87_uop_ratio
Testing tma_info_l1_bound_address_alias_blocks
Testing tma_info_l1_bound_load_splits
Testing tma_info_l1_bound_store_fwd_blocks
Testing tma_info_memory_cycles_per_demand_load_dram_hit
Testing tma_info_memory_cycles_per_demand_load_l2_hit
Testing tma_info_memory_cycles_per_demand_load_l3_hit
Testing tma_info_memory_memloadpki
Testing tma_info_system_kernel_cpi
Testing tma_info_system_kernel_utilization
Testing tma_data_sharing
Testing tma_lock_latency
Testing tma_fused_instructions
Testing tma_info_pipeline_ipassist
Testing tma_info_pipeline_retire
Testing tma_info_pipeline_strings_cycles
Testing tma_info_thread_clks
Testing tma_info_thread_uoppi
Testing tma_int_operations
Testing tma_memory_operations
Testing tma_non_fused_branches
Testing tma_nop_instructions
Testing tma_other_light_ops
Testing tma_ports_utilization
Testing tma_ports_utilized_0
Testing tma_ports_utilized_1
Metric 'tma_ports_utilized_1' not printed in:
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
Average synthesis took: 49.581 usec (+- 0.030 usec)
Average num. events: 47.000 (+- 0.000)
Average time per event 1.055 usec
Average data synthesis took: 53.367 usec (+- 0.032 usec)
Average num. events: 246.000 (+- 0.000)
Average time per event 0.217 usec

Performance counter stats for 'perf bench internals synthesize':

<not counted> cpu_core/TOPDOWN.SLOTS/ (0.00%)
<not counted> cpu_core/topdown-retiring/ (0.00%)
<not counted> cpu_core/topdown-mem-bound/ (0.00%)
<not counted> cpu_core/topdown-bad-spec/ (0.00%)
<not counted> cpu_core/topdown-fe-bound/ (0.00%)
<not counted> cpu_core/topdown-be-bound/ (0.00%)
<not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/ (0.00%)
<not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/ (0.00%)
<not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/ (0.00%)
<not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/ (0.00%)
<not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/ (0.00%)
<not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/ (0.00%)
<not counted> cpu_core/ARITH.DIV_ACTIVE/ (0.00%)
<not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/ (0.00%)
<not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/ (0.00%)

1.180394056 seconds time elapsed

0.409881000 seconds user
0.764134000 seconds sys
Testing tma_ports_utilized_2
Metric 'tma_ports_utilized_2' not printed in:
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
Average synthesis took: 50.456 usec (+- 0.066 usec)
Average num. events: 47.000 (+- 0.000)
Average time per event 1.074 usec
Average data synthesis took: 52.904 usec (+- 0.030 usec)
Average num. events: 246.000 (+- 0.000)
Average time per event 0.215 usec

Performance counter stats for 'perf bench internals synthesize':

<not counted> cpu_core/TOPDOWN.SLOTS/ (0.00%)
<not counted> cpu_core/topdown-retiring/ (0.00%)
<not counted> cpu_core/topdown-mem-bound/ (0.00%)
<not counted> cpu_core/topdown-bad-spec/ (0.00%)
<not counted> cpu_core/topdown-fe-bound/ (0.00%)
<not counted> cpu_core/topdown-be-bound/ (0.00%)
<not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/ (0.00%)
<not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/ (0.00%)
<not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/ (0.00%)
<not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/ (0.00%)
<not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/ (0.00%)
<not counted> cpu_core/ARITH.DIV_ACTIVE/ (0.00%)
<not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/ (0.00%)
<not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL/ (0.00%)
<not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/ (0.00%)

1.187012938 seconds time elapsed

0.397919000 seconds user
0.782854000 seconds sys
Testing tma_ports_utilized_3m
Metric 'tma_ports_utilized_3m' not printed in:
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
Average synthesis took: 48.248 usec (+- 0.028 usec)
Average num. events: 47.000 (+- 0.000)
Average time per event 1.027 usec
Average data synthesis took: 52.781 usec (+- 0.036 usec)
Average num. events: 245.000 (+- 0.000)
Average time per event 0.215 usec

Performance counter stats for 'perf bench internals synthesize':

<not counted> cpu_core/TOPDOWN.SLOTS/ (0.00%)
<not counted> cpu_core/topdown-retiring/ (0.00%)
<not counted> cpu_core/topdown-mem-bound/ (0.00%)
<not counted> cpu_core/topdown-bad-spec/ (0.00%)
<not counted> cpu_core/topdown-fe-bound/ (0.00%)
<not counted> cpu_core/topdown-be-bound/ (0.00%)
<not counted> cpu_core/UOPS_EXECUTED.CYCLES_GE_3/ (0.00%)
<not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/ (0.00%)
<not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/ (0.00%)
<not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/ (0.00%)
<not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/ (0.00%)
<not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/ (0.00%)
<not counted> cpu_core/ARITH.DIV_ACTIVE/ (0.00%)
<not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/ (0.00%)
<not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/ (0.00%)

1.160315379 seconds time elapsed

0.388639000 seconds user
0.765363000 seconds sys
Testing tma_serializing_operation
Testing C10_Pkg_Residency
Testing C1_Core_Residency
Testing C2_Pkg_Residency
Testing C3_Pkg_Residency
Testing C6_Core_Residency
Testing C6_Pkg_Residency
Testing C7_Core_Residency
Testing C7_Pkg_Residency
Testing C8_Pkg_Residency
Testing C9_Pkg_Residency
Testing tma_info_system_average_frequency
Testing tma_info_system_turbo_utilization
Testing tma_info_inst_mix_ipswpf
Testing tma_info_bottleneck_branching_overhead
Testing tma_info_core_coreipc
Testing tma_info_thread_ipc
Testing tma_heavy_operations
Testing tma_light_operations
Testing tma_info_core_core_clks
Testing tma_info_system_smt_2t_utilization
Testing tma_info_thread_slots_utilization
Testing UNCORE_FREQ
Testing tma_info_system_socket_clks
Testing tma_info_inst_mix_instructions
Testing tma_info_thread_slots
Testing tma_base
Testing tma_ms_uops
Testing tma_resource_bound
Testing tma_alloc_restriction
Testing tma_branch_detect
Testing tma_branch_resteer
Testing tma_cisc
Testing tma_decode
Testing tma_divider
Testing tma_fast_nuke
Testing tma_few_uops_instructions
Testing tma_fpdiv_uops
Testing tma_mem_scheduler
Testing tma_non_mem_scheduler
Testing tma_nuke
Testing tma_other_fb
Testing tma_other_load_store
Testing tma_other_ret
Testing tma_predecode
Testing tma_register
Testing tma_reorder_buffer
Testing tma_serialization
Testing tma_assists
Testing tma_disambiguation
Testing tma_fp_assist
Testing tma_ld_buffer
Testing tma_memory_ordering
Testing tma_other_l1
Testing tma_page_fault
Testing tma_rsv
Testing tma_smc
Testing tma_split_loads
Testing tma_split_stores
Testing tma_st_buffer
Testing tma_stlb_hit
Testing tma_stlb_miss
Testing tma_store_fwd_blk
Testing tma_alu_op_utilization
Testing tma_load_op_utilization
Testing tma_mixing_vectors
Testing tma_page_faults
Testing tma_store_op_utilization
Testing tma_memory_fence
Metric 'tma_memory_fence' not printed in:
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
Average synthesis took: 49.458 usec (+- 0.033 usec)
Average num. events: 47.000 (+- 0.000)
Average time per event 1.052 usec
Average data synthesis took: 53.268 usec (+- 0.027 usec)
Average num. events: 244.000 (+- 0.000)
Average time per event 0.218 usec

Performance counter stats for 'perf bench internals synthesize':

<not counted> cpu_core/TOPDOWN.SLOTS/ (0.00%)
<not counted> cpu_core/topdown-retiring/ (0.00%)
<not counted> cpu_core/topdown-mem-bound/ (0.00%)
<not counted> cpu_core/topdown-bad-spec/ (0.00%)
<not counted> cpu_core/topdown-fe-bound/ (0.00%)
<not counted> cpu_core/topdown-be-bound/ (0.00%)
<not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/ (0.00%)
<not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/ (0.00%)
<not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/ (0.00%)
<not counted> cpu_core/MISC2_RETIRED.LFENCE/ (0.00%)
<not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/ (0.00%)
<not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/ (0.00%)
<not counted> cpu_core/ARITH.DIV_ACTIVE/ (0.00%)
<not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/ (0.00%)
<not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/ (0.00%)

1.177929044 seconds time elapsed

0.434552000 seconds user
0.736874000 seconds sys
Testing tma_port_1
Testing tma_port_6
Testing tma_slow_pause
Metric 'tma_slow_pause' not printed in:
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
Average synthesis took: 49.987 usec (+- 0.049 usec)
Average num. events: 47.000 (+- 0.000)
Average time per event 1.064 usec
Average data synthesis took: 53.490 usec (+- 0.033 usec)
Average num. events: 245.000 (+- 0.000)
Average time per event 0.218 usec

Performance counter stats for 'perf bench internals synthesize':

<not counted> cpu_core/TOPDOWN.SLOTS/ (0.00%)
<not counted> cpu_core/topdown-retiring/ (0.00%)
<not counted> cpu_core/topdown-mem-bound/ (0.00%)
<not counted> cpu_core/topdown-bad-spec/ (0.00%)
<not counted> cpu_core/topdown-fe-bound/ (0.00%)
<not counted> cpu_core/topdown-be-bound/ (0.00%)
<not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/ (0.00%)
<not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/ (0.00%)
<not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/ (0.00%)
<not counted> cpu_core/CPU_CLK_UNHALTED.PAUSE/ (0.00%)
<not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/ (0.00%)
<not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/ (0.00%)
<not counted> cpu_core/ARITH.DIV_ACTIVE/ (0.00%)
<not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/ (0.00%)
<not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/ (0.00%)

1.186254766 seconds time elapsed

0.427220000 seconds user
0.752217000 seconds sys
Testing smi_cycles
Testing smi_num
Testing tsx_aborted_cycles
Testing tsx_cycles_per_elision
Testing tsx_cycles_per_transaction
Testing tsx_transactional_cycles
test child finished with -1
---- end ----
perf all metrics test: FAILED!
root@number:~#

> ---
> .../arch/x86/alderlake/adl-metrics.json | 15 ++++++++-------
> .../arch/x86/rocketlake/rkl-metrics.json | 2 +-
> 2 files changed, 9 insertions(+), 8 deletions(-)
>
> diff --git a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> index 3388b58b8f1a..35124a4ddcb2 100644
> --- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> +++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> @@ -69,12 +69,6 @@
> "MetricName": "C9_Pkg_Residency",
> "ScaleUnit": "100%"
> },
> - {
> - "BriefDescription": "Uncore frequency per die [GHZ]",
> - "MetricExpr": "tma_info_system_socket_clks / #num_dies / duration_time / 1e9",
> - "MetricGroup": "SoC",
> - "MetricName": "UNCORE_FREQ"
> - },
> {
> "BriefDescription": "Percentage of cycles spent in System Management Interrupts.",
> "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)",
> @@ -809,6 +803,13 @@
> "ScaleUnit": "100%",
> "Unit": "cpu_atom"
> },
> + {
> + "BriefDescription": "Uncore frequency per die [GHZ]",
> + "MetricExpr": "tma_info_system_socket_clks / #num_dies / duration_time / 1e9",
> + "MetricGroup": "SoC",
> + "MetricName": "UNCORE_FREQ",
> + "Unit": "cpu_core"
> + },
> {
> "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.",
> "MetricExpr": "(cpu_core@UOPS_DISPATCHED.PORT_0@ + cpu_core@UOPS_DISPATCHED.PORT_1@ + cpu_core@UOPS_DISPATCHED.PORT_5_11@ + cpu_core@UOPS_DISPATCHED.PORT_6@) / (5 * tma_info_core_core_clks)",
> @@ -1838,7 +1839,7 @@
> },
> {
> "BriefDescription": "Average number of parallel data read requests to external memory",
> - "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / cpu_core@UNC_ARB_DAT_OCCUPANCY.RD\\,cmask\\=1@",
> + "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / UNC_ARB_DAT_OCCUPANCY.RD@cmask\\=1@",
> "MetricGroup": "Mem;MemoryBW;SoC",
> "MetricName": "tma_info_system_mem_parallel_reads",
> "PublicDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches",
> diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json b/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json
> index 0c880e415669..27433fc15ede 100644
> --- a/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json
> +++ b/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json
> @@ -985,7 +985,7 @@
> },
> {
> "BriefDescription": "Average number of parallel data read requests to external memory",
> - "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / cpu@UNC_ARB_DAT_OCCUPANCY.RD\\,cmask\\=1@",
> + "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / UNC_ARB_DAT_OCCUPANCY.RD@cmask\\=1@",
> "MetricGroup": "Mem;MemoryBW;SoC",
> "MetricName": "tma_info_system_mem_parallel_reads",
> "PublicDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches"
> --
> 2.43.0.472.g3155946c3a-goog
>

--

- Arnaldo

2024-01-04 13:56:48

by Ian Rogers

[permalink] [raw]
Subject: Re: [PATCH v1 1/4] perf vendor events intel: Alderlake/rocketlake metric fixes

On Thu, Jan 4, 2024 at 4:39 AM Arnaldo Carvalho de Melo <[email protected]> wrote:
>
> Em Wed, Jan 03, 2024 at 11:42:56PM -0800, Ian Rogers escreveu:
> > Fix that the core PMU is being specified for 2 uncore events. Specify
> > a PMU for the alderlake UNCORE_FREQ metric.
> >
> > Conversion script updated in:
> > https://github.com/intel/perfmon/pull/126
> >
> > Reported-by: Arnaldo Carvalho de Melo <[email protected]>
> > Closes: https://lore.kernel.org/lkml/[email protected]/
> > Signed-off-by: Ian Rogers <[email protected]>
>
> After this first patch:
>
> 101: perf all metricgroups test : Ok
> 102: perf all metrics test : FAILED!
> 107: perf metrics value validation : Ok
>
> 102 is now failing due to some other problem:
>
> root@number:~# perf test -v 102
> 102: perf all metrics test :
> --- start ---
> test child forked, pid 2701034
> Testing tma_core_bound
> Testing tma_info_core_ilp
> Testing tma_info_memory_l2mpki
> Testing tma_memory_bound
> Testing tma_info_bad_spec_branch_misprediction_cost
> Testing tma_info_bad_spec_ipmisp_cond_ntaken
> Testing tma_info_bad_spec_ipmisp_cond_taken
> Testing tma_info_bad_spec_ipmisp_indirect
> Testing tma_info_bad_spec_ipmisp_ret
> Testing tma_info_bad_spec_ipmispredict
> Testing tma_info_bottleneck_mispredictions
> Testing tma_info_branches_callret
> Testing tma_info_branches_cond_nt
> Testing tma_info_branches_cond_tk
> Testing tma_info_branches_jump
> Testing tma_info_branches_other_branches
> Testing tma_branch_mispredicts
> Testing tma_clears_resteers
> Testing tma_machine_clears
> Testing tma_mispredicts_resteers
> Testing tma_icache_misses
> Testing tma_info_bottleneck_big_code
> Testing tma_itlb_misses
> Testing tma_unknown_branches
> Testing tma_info_inst_mix_bptkbranch
> Testing tma_info_inst_mix_ipbranch
> Testing tma_info_inst_mix_ipcall
> Testing tma_info_inst_mix_iptb
> Testing tma_info_system_ipfarbranch
> Testing tma_info_thread_uptb
> Testing tma_info_memory_fb_hpki
> Testing tma_info_memory_l1mpki
> Testing tma_info_memory_l1mpki_load
> Testing tma_info_memory_l2hpki_all
> Testing tma_info_memory_l2hpki_load
> Testing tma_info_memory_l2mpki_all
> Testing tma_info_memory_l2mpki_load
> Testing tma_info_memory_l3mpki
> Testing tma_l1_bound
> Testing tma_l2_bound
> Testing tma_l3_bound
> Testing tma_fp_scalar
> Testing tma_fp_vector
> Testing tma_fp_vector_128b
> Testing tma_fp_vector_256b
> Testing tma_int_vector_128b
> Testing tma_int_vector_256b
> Testing tma_port_0
> Testing tma_x87_use
> Testing tma_info_botlnk_l0_core_bound_likely
> Testing tma_info_core_fp_arith_utilization
> Testing tma_info_pipeline_execute
> Testing tma_info_system_gflops
> Testing tma_info_thread_execute_per_issue
> Testing tma_dsb
> Testing tma_info_frontend_dsb_coverage
> Testing tma_decoder0_alone
> Testing tma_dsb_switches
> Testing tma_info_botlnk_l2_dsb_misses
> Testing tma_info_frontend_dsb_switch_cost
> Testing tma_info_frontend_ipdsb_miss_ret
> Testing tma_mite
> Testing tma_contested_accesses
> Testing tma_false_sharing
> Testing tma_backend_bound
> Testing tma_backend_bound_aux
> Testing tma_bad_speculation
> Testing tma_frontend_bound
> Testing tma_retiring
> Testing tma_info_botlnk_l2_ic_misses
> Testing tma_info_bottleneck_instruction_fetch_bw
> Testing tma_info_frontend_fetch_upc
> Testing tma_info_frontend_icache_miss_latency
> Testing tma_info_frontend_ipunknown_branch
> Testing tma_info_frontend_lsd_coverage
> Testing tma_info_memory_tlb_code_stlb_mpki
> Testing tma_fetch_bandwidth
> Testing tma_lsd
> Testing tma_branch_resteers
> Testing tma_lcp
> Testing tma_ms_switches
> Testing tma_info_core_flopc
> Testing tma_info_inst_mix_iparith
> Testing tma_info_inst_mix_iparith_avx128
> Testing tma_info_inst_mix_iparith_avx256
> Testing tma_info_inst_mix_iparith_scalar_dp
> Testing tma_info_inst_mix_iparith_scalar_sp
> Testing tma_info_inst_mix_ipflop
> Testing tma_fetch_latency
> Testing tma_avx_assists
> Testing tma_fp_arith
> Testing tma_fp_assists
> Testing tma_info_system_cpu_utilization
> Testing tma_info_system_dram_bw_use
> Testing tma_shuffles
> Testing tma_info_frontend_l2mpki_code
> Testing tma_info_frontend_l2mpki_code_all
> Testing tma_info_inst_mix_ipload
> Testing tma_info_inst_mix_ipstore
> Testing tma_info_bottleneck_memory_bandwidth
> Testing tma_info_bottleneck_memory_data_tlbs
> Testing tma_info_bottleneck_memory_latency
> Testing tma_info_memory_core_l1d_cache_fill_bw
> Testing tma_info_memory_core_l2_cache_fill_bw
> Testing tma_info_memory_core_l3_cache_access_bw
> Testing tma_info_memory_core_l3_cache_fill_bw
> Testing tma_info_memory_load_miss_real_latency
> Testing tma_info_memory_mlp
> Testing tma_info_memory_thread_l1d_cache_fill_bw_1t
> Testing tma_info_memory_thread_l2_cache_fill_bw_1t
> Testing tma_info_memory_thread_l3_cache_access_bw_1t
> Testing tma_info_memory_thread_l3_cache_fill_bw_1t
> Testing tma_info_memory_tlb_load_stlb_mpki
> Testing tma_info_memory_tlb_page_walks_utilization
> Testing tma_info_memory_tlb_store_stlb_mpki
> Testing tma_info_system_mem_parallel_reads
> Testing tma_info_system_mem_read_latency
> Testing tma_info_system_mem_request_latency
> Testing tma_info_thread_cpi
> Testing tma_fb_full
> Testing tma_mem_bandwidth
> Testing tma_sq_full
> Testing tma_streaming_stores
> Testing tma_dram_bound
> Testing tma_store_bound
> Testing tma_l3_hit_latency
> Testing tma_mem_latency
> Testing tma_store_latency
> Testing tma_dtlb_load
> Testing tma_dtlb_store
> Testing tma_load_stlb_hit
> Testing tma_load_stlb_miss
> Testing tma_store_stlb_hit
> Testing tma_store_stlb_miss
> Testing tma_info_memory_oro_data_l2_mlp
> Testing tma_info_memory_oro_load_l2_mlp
> Testing tma_info_memory_oro_load_l2_miss_latency
> Testing tma_info_memory_oro_load_l3_miss_latency
> Testing tma_microcode_sequencer
> Testing tma_info_core_clks
> Testing tma_info_core_clks_p
> Testing tma_info_core_cpi
> Testing tma_info_core_ipc
> Testing tma_info_core_slots
> Testing tma_info_core_upi
> Testing tma_info_frontend_inst_miss_cost_dramhit_percent
> Testing tma_info_frontend_inst_miss_cost_l2hit_percent
> Testing tma_info_frontend_inst_miss_cost_l3hit_percent
> Testing tma_info_inst_mix_branch_mispredict_ratio
> Testing tma_info_inst_mix_branch_mispredict_to_unknown_branch_ratio
> Testing tma_info_inst_mix_fpdiv_uop_ratio
> Testing tma_info_inst_mix_idiv_uop_ratio
> Testing tma_info_inst_mix_ipfarbranch
> Testing tma_info_inst_mix_ipmisp_cond_ntaken
> Testing tma_info_inst_mix_ipmisp_cond_taken
> Testing tma_info_inst_mix_ipmisp_indirect
> Testing tma_info_inst_mix_ipmisp_ret
> Testing tma_info_inst_mix_ipmispredict
> Testing tma_info_inst_mix_microcode_uop_ratio
> Testing tma_info_inst_mix_x87_uop_ratio
> Testing tma_info_l1_bound_address_alias_blocks
> Testing tma_info_l1_bound_load_splits
> Testing tma_info_l1_bound_store_fwd_blocks
> Testing tma_info_memory_cycles_per_demand_load_dram_hit
> Testing tma_info_memory_cycles_per_demand_load_l2_hit
> Testing tma_info_memory_cycles_per_demand_load_l3_hit
> Testing tma_info_memory_memloadpki
> Testing tma_info_system_kernel_cpi
> Testing tma_info_system_kernel_utilization
> Testing tma_data_sharing
> Testing tma_lock_latency
> Testing tma_fused_instructions
> Testing tma_info_pipeline_ipassist
> Testing tma_info_pipeline_retire
> Testing tma_info_pipeline_strings_cycles
> Testing tma_info_thread_clks
> Testing tma_info_thread_uoppi
> Testing tma_int_operations
> Testing tma_memory_operations
> Testing tma_non_fused_branches
> Testing tma_nop_instructions
> Testing tma_other_light_ops
> Testing tma_ports_utilization
> Testing tma_ports_utilized_0
> Testing tma_ports_utilized_1
> Metric 'tma_ports_utilized_1' not printed in:
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
> Average synthesis took: 49.581 usec (+- 0.030 usec)
> Average num. events: 47.000 (+- 0.000)
> Average time per event 1.055 usec
> Average data synthesis took: 53.367 usec (+- 0.032 usec)
> Average num. events: 246.000 (+- 0.000)
> Average time per event 0.217 usec
>
> Performance counter stats for 'perf bench internals synthesize':
>
> <not counted> cpu_core/TOPDOWN.SLOTS/ (0.00%)
> <not counted> cpu_core/topdown-retiring/ (0.00%)
> <not counted> cpu_core/topdown-mem-bound/ (0.00%)
> <not counted> cpu_core/topdown-bad-spec/ (0.00%)
> <not counted> cpu_core/topdown-fe-bound/ (0.00%)
> <not counted> cpu_core/topdown-be-bound/ (0.00%)
> <not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/ (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/ (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/ (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/ (0.00%)
> <not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/ (0.00%)
> <not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/ (0.00%)
> <not counted> cpu_core/ARITH.DIV_ACTIVE/ (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/ (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/ (0.00%)
>
> 1.180394056 seconds time elapsed
>
> 0.409881000 seconds user
> 0.764134000 seconds sys
> Testing tma_ports_utilized_2
> Metric 'tma_ports_utilized_2' not printed in:
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
> Average synthesis took: 50.456 usec (+- 0.066 usec)
> Average num. events: 47.000 (+- 0.000)
> Average time per event 1.074 usec
> Average data synthesis took: 52.904 usec (+- 0.030 usec)
> Average num. events: 246.000 (+- 0.000)
> Average time per event 0.215 usec
>
> Performance counter stats for 'perf bench internals synthesize':
>
> <not counted> cpu_core/TOPDOWN.SLOTS/ (0.00%)
> <not counted> cpu_core/topdown-retiring/ (0.00%)
> <not counted> cpu_core/topdown-mem-bound/ (0.00%)
> <not counted> cpu_core/topdown-bad-spec/ (0.00%)
> <not counted> cpu_core/topdown-fe-bound/ (0.00%)
> <not counted> cpu_core/topdown-be-bound/ (0.00%)
> <not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/ (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/ (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/ (0.00%)
> <not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/ (0.00%)
> <not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/ (0.00%)
> <not counted> cpu_core/ARITH.DIV_ACTIVE/ (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/ (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL/ (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/ (0.00%)
>
> 1.187012938 seconds time elapsed
>
> 0.397919000 seconds user
> 0.782854000 seconds sys
> Testing tma_ports_utilized_3m
> Metric 'tma_ports_utilized_3m' not printed in:
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
> Average synthesis took: 48.248 usec (+- 0.028 usec)
> Average num. events: 47.000 (+- 0.000)
> Average time per event 1.027 usec
> Average data synthesis took: 52.781 usec (+- 0.036 usec)
> Average num. events: 245.000 (+- 0.000)
> Average time per event 0.215 usec
>
> Performance counter stats for 'perf bench internals synthesize':
>
> <not counted> cpu_core/TOPDOWN.SLOTS/ (0.00%)
> <not counted> cpu_core/topdown-retiring/ (0.00%)
> <not counted> cpu_core/topdown-mem-bound/ (0.00%)
> <not counted> cpu_core/topdown-bad-spec/ (0.00%)
> <not counted> cpu_core/topdown-fe-bound/ (0.00%)
> <not counted> cpu_core/topdown-be-bound/ (0.00%)
> <not counted> cpu_core/UOPS_EXECUTED.CYCLES_GE_3/ (0.00%)
> <not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/ (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/ (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/ (0.00%)
> <not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/ (0.00%)
> <not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/ (0.00%)
> <not counted> cpu_core/ARITH.DIV_ACTIVE/ (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/ (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/ (0.00%)
>
> 1.160315379 seconds time elapsed
>
> 0.388639000 seconds user
> 0.765363000 seconds sys
> Testing tma_serializing_operation
> Testing C10_Pkg_Residency
> Testing C1_Core_Residency
> Testing C2_Pkg_Residency
> Testing C3_Pkg_Residency
> Testing C6_Core_Residency
> Testing C6_Pkg_Residency
> Testing C7_Core_Residency
> Testing C7_Pkg_Residency
> Testing C8_Pkg_Residency
> Testing C9_Pkg_Residency
> Testing tma_info_system_average_frequency
> Testing tma_info_system_turbo_utilization
> Testing tma_info_inst_mix_ipswpf
> Testing tma_info_bottleneck_branching_overhead
> Testing tma_info_core_coreipc
> Testing tma_info_thread_ipc
> Testing tma_heavy_operations
> Testing tma_light_operations
> Testing tma_info_core_core_clks
> Testing tma_info_system_smt_2t_utilization
> Testing tma_info_thread_slots_utilization
> Testing UNCORE_FREQ
> Testing tma_info_system_socket_clks
> Testing tma_info_inst_mix_instructions
> Testing tma_info_thread_slots
> Testing tma_base
> Testing tma_ms_uops
> Testing tma_resource_bound
> Testing tma_alloc_restriction
> Testing tma_branch_detect
> Testing tma_branch_resteer
> Testing tma_cisc
> Testing tma_decode
> Testing tma_divider
> Testing tma_fast_nuke
> Testing tma_few_uops_instructions
> Testing tma_fpdiv_uops
> Testing tma_mem_scheduler
> Testing tma_non_mem_scheduler
> Testing tma_nuke
> Testing tma_other_fb
> Testing tma_other_load_store
> Testing tma_other_ret
> Testing tma_predecode
> Testing tma_register
> Testing tma_reorder_buffer
> Testing tma_serialization
> Testing tma_assists
> Testing tma_disambiguation
> Testing tma_fp_assist
> Testing tma_ld_buffer
> Testing tma_memory_ordering
> Testing tma_other_l1
> Testing tma_page_fault
> Testing tma_rsv
> Testing tma_smc
> Testing tma_split_loads
> Testing tma_split_stores
> Testing tma_st_buffer
> Testing tma_stlb_hit
> Testing tma_stlb_miss
> Testing tma_store_fwd_blk
> Testing tma_alu_op_utilization
> Testing tma_load_op_utilization
> Testing tma_mixing_vectors
> Testing tma_page_faults
> Testing tma_store_op_utilization
> Testing tma_memory_fence
> Metric 'tma_memory_fence' not printed in:
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
> Average synthesis took: 49.458 usec (+- 0.033 usec)
> Average num. events: 47.000 (+- 0.000)
> Average time per event 1.052 usec
> Average data synthesis took: 53.268 usec (+- 0.027 usec)
> Average num. events: 244.000 (+- 0.000)
> Average time per event 0.218 usec
>
> Performance counter stats for 'perf bench internals synthesize':
>
> <not counted> cpu_core/TOPDOWN.SLOTS/ (0.00%)
> <not counted> cpu_core/topdown-retiring/ (0.00%)
> <not counted> cpu_core/topdown-mem-bound/ (0.00%)
> <not counted> cpu_core/topdown-bad-spec/ (0.00%)
> <not counted> cpu_core/topdown-fe-bound/ (0.00%)
> <not counted> cpu_core/topdown-be-bound/ (0.00%)
> <not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/ (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/ (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/ (0.00%)
> <not counted> cpu_core/MISC2_RETIRED.LFENCE/ (0.00%)
> <not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/ (0.00%)
> <not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/ (0.00%)
> <not counted> cpu_core/ARITH.DIV_ACTIVE/ (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/ (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/ (0.00%)
>
> 1.177929044 seconds time elapsed
>
> 0.434552000 seconds user
> 0.736874000 seconds sys
> Testing tma_port_1
> Testing tma_port_6
> Testing tma_slow_pause
> Metric 'tma_slow_pause' not printed in:
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
> Average synthesis took: 49.987 usec (+- 0.049 usec)
> Average num. events: 47.000 (+- 0.000)
> Average time per event 1.064 usec
> Average data synthesis took: 53.490 usec (+- 0.033 usec)
> Average num. events: 245.000 (+- 0.000)
> Average time per event 0.218 usec
>
> Performance counter stats for 'perf bench internals synthesize':
>
> <not counted> cpu_core/TOPDOWN.SLOTS/ (0.00%)
> <not counted> cpu_core/topdown-retiring/ (0.00%)
> <not counted> cpu_core/topdown-mem-bound/ (0.00%)
> <not counted> cpu_core/topdown-bad-spec/ (0.00%)
> <not counted> cpu_core/topdown-fe-bound/ (0.00%)
> <not counted> cpu_core/topdown-be-bound/ (0.00%)
> <not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/ (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/ (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/ (0.00%)
> <not counted> cpu_core/CPU_CLK_UNHALTED.PAUSE/ (0.00%)
> <not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/ (0.00%)
> <not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/ (0.00%)
> <not counted> cpu_core/ARITH.DIV_ACTIVE/ (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/ (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/ (0.00%)
>
> 1.186254766 seconds time elapsed
>
> 0.427220000 seconds user
> 0.752217000 seconds sys
> Testing smi_cycles
> Testing smi_num
> Testing tsx_aborted_cycles
> Testing tsx_cycles_per_elision
> Testing tsx_cycles_per_transaction
> Testing tsx_transactional_cycles
> test child finished with -1
> ---- end ----
> perf all metrics test: FAILED!
> root@number:~#

Have a try disabling the NMI watchdog. Agreed that there is more to
fix here but I think the PMU driver is in part to blame because
manually breaking the weak group of events is a fix. Fwiw, if we
switch to the buddy watchdog mechanism then we'll no longer need to
disable the NMI watchdog:
https://lore.kernel.org/lkml/20230421155255.1.I6bf789d21d0c3d75d382e7e51a804a7a51315f2c@changeid/

Thanks,
Ian

> > ---
> > .../arch/x86/alderlake/adl-metrics.json | 15 ++++++++-------
> > .../arch/x86/rocketlake/rkl-metrics.json | 2 +-
> > 2 files changed, 9 insertions(+), 8 deletions(-)
> >
> > diff --git a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> > index 3388b58b8f1a..35124a4ddcb2 100644
> > --- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> > +++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> > @@ -69,12 +69,6 @@
> > "MetricName": "C9_Pkg_Residency",
> > "ScaleUnit": "100%"
> > },
> > - {
> > - "BriefDescription": "Uncore frequency per die [GHZ]",
> > - "MetricExpr": "tma_info_system_socket_clks / #num_dies / duration_time / 1e9",
> > - "MetricGroup": "SoC",
> > - "MetricName": "UNCORE_FREQ"
> > - },
> > {
> > "BriefDescription": "Percentage of cycles spent in System Management Interrupts.",
> > "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)",
> > @@ -809,6 +803,13 @@
> > "ScaleUnit": "100%",
> > "Unit": "cpu_atom"
> > },
> > + {
> > + "BriefDescription": "Uncore frequency per die [GHZ]",
> > + "MetricExpr": "tma_info_system_socket_clks / #num_dies / duration_time / 1e9",
> > + "MetricGroup": "SoC",
> > + "MetricName": "UNCORE_FREQ",
> > + "Unit": "cpu_core"
> > + },
> > {
> > "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.",
> > "MetricExpr": "(cpu_core@UOPS_DISPATCHED.PORT_0@ + cpu_core@UOPS_DISPATCHED.PORT_1@ + cpu_core@UOPS_DISPATCHED.PORT_5_11@ + cpu_core@UOPS_DISPATCHED.PORT_6@) / (5 * tma_info_core_core_clks)",
> > @@ -1838,7 +1839,7 @@
> > },
> > {
> > "BriefDescription": "Average number of parallel data read requests to external memory",
> > - "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / cpu_core@UNC_ARB_DAT_OCCUPANCY.RD\\,cmask\\=1@",
> > + "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / UNC_ARB_DAT_OCCUPANCY.RD@cmask\\=1@",
> > "MetricGroup": "Mem;MemoryBW;SoC",
> > "MetricName": "tma_info_system_mem_parallel_reads",
> > "PublicDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches",
> > diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json b/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json
> > index 0c880e415669..27433fc15ede 100644
> > --- a/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json
> > +++ b/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json
> > @@ -985,7 +985,7 @@
> > },
> > {
> > "BriefDescription": "Average number of parallel data read requests to external memory",
> > - "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / cpu@UNC_ARB_DAT_OCCUPANCY.RD\\,cmask\\=1@",
> > + "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / UNC_ARB_DAT_OCCUPANCY.RD@cmask\\=1@",
> > "MetricGroup": "Mem;MemoryBW;SoC",
> > "MetricName": "tma_info_system_mem_parallel_reads",
> > "PublicDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches"
> > --
> > 2.43.0.472.g3155946c3a-goog
> >
>
> --
>
> - Arnaldo
>

2024-01-04 14:24:08

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH v1 1/4] perf vendor events intel: Alderlake/rocketlake metric fixes



On 2024-01-04 2:42 a.m., Ian Rogers wrote:
> Fix that the core PMU is being specified for 2 uncore events. Specify
> a PMU for the alderlake UNCORE_FREQ metric.
>
> Conversion script updated in:
> https://github.com/intel/perfmon/pull/126
>
> Reported-by: Arnaldo Carvalho de Melo <[email protected]>
> Closes: https://lore.kernel.org/lkml/[email protected]/
> Signed-off-by: Ian Rogers <[email protected]>

Thanks Ian. For the series,

Reviewed-by: Kan Liang <[email protected]>

Thanks,
Kan

> ---
> .../arch/x86/alderlake/adl-metrics.json | 15 ++++++++-------
> .../arch/x86/rocketlake/rkl-metrics.json | 2 +-
> 2 files changed, 9 insertions(+), 8 deletions(-)
>
> diff --git a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> index 3388b58b8f1a..35124a4ddcb2 100644
> --- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> +++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> @@ -69,12 +69,6 @@
> "MetricName": "C9_Pkg_Residency",
> "ScaleUnit": "100%"
> },
> - {
> - "BriefDescription": "Uncore frequency per die [GHZ]",
> - "MetricExpr": "tma_info_system_socket_clks / #num_dies / duration_time / 1e9",
> - "MetricGroup": "SoC",
> - "MetricName": "UNCORE_FREQ"
> - },
> {
> "BriefDescription": "Percentage of cycles spent in System Management Interrupts.",
> "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)",
> @@ -809,6 +803,13 @@
> "ScaleUnit": "100%",
> "Unit": "cpu_atom"
> },
> + {
> + "BriefDescription": "Uncore frequency per die [GHZ]",
> + "MetricExpr": "tma_info_system_socket_clks / #num_dies / duration_time / 1e9",
> + "MetricGroup": "SoC",
> + "MetricName": "UNCORE_FREQ",
> + "Unit": "cpu_core"
> + },
> {
> "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.",
> "MetricExpr": "(cpu_core@UOPS_DISPATCHED.PORT_0@ + cpu_core@UOPS_DISPATCHED.PORT_1@ + cpu_core@UOPS_DISPATCHED.PORT_5_11@ + cpu_core@UOPS_DISPATCHED.PORT_6@) / (5 * tma_info_core_core_clks)",
> @@ -1838,7 +1839,7 @@
> },
> {
> "BriefDescription": "Average number of parallel data read requests to external memory",
> - "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / cpu_core@UNC_ARB_DAT_OCCUPANCY.RD\\,cmask\\=1@",
> + "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / UNC_ARB_DAT_OCCUPANCY.RD@cmask\\=1@",
> "MetricGroup": "Mem;MemoryBW;SoC",
> "MetricName": "tma_info_system_mem_parallel_reads",
> "PublicDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches",
> diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json b/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json
> index 0c880e415669..27433fc15ede 100644
> --- a/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json
> +++ b/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json
> @@ -985,7 +985,7 @@
> },
> {
> "BriefDescription": "Average number of parallel data read requests to external memory",
> - "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / cpu@UNC_ARB_DAT_OCCUPANCY.RD\\,cmask\\=1@",
> + "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / UNC_ARB_DAT_OCCUPANCY.RD@cmask\\=1@",
> "MetricGroup": "Mem;MemoryBW;SoC",
> "MetricName": "tma_info_system_mem_parallel_reads",
> "PublicDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches"

2024-01-04 14:30:35

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH v1 1/4] perf vendor events intel: Alderlake/rocketlake metric fixes



On 2024-01-04 8:56 a.m., Ian Rogers wrote:
>> Testing tma_slow_pause
>> Metric 'tma_slow_pause' not printed in:
>> # Running 'internals/synthesize' benchmark:
>> Computing performance of single threaded perf event synthesis by
>> synthesizing events on the perf process itself:
>> Average synthesis took: 49.987 usec (+- 0.049 usec)
>> Average num. events: 47.000 (+- 0.000)
>> Average time per event 1.064 usec
>> Average data synthesis took: 53.490 usec (+- 0.033 usec)
>> Average num. events: 245.000 (+- 0.000)
>> Average time per event 0.218 usec
>>
>> Performance counter stats for 'perf bench internals synthesize':
>>
>> <not counted> cpu_core/TOPDOWN.SLOTS/ (0.00%)
>> <not counted> cpu_core/topdown-retiring/ (0.00%)
>> <not counted> cpu_core/topdown-mem-bound/ (0.00%)
>> <not counted> cpu_core/topdown-bad-spec/ (0.00%)
>> <not counted> cpu_core/topdown-fe-bound/ (0.00%)
>> <not counted> cpu_core/topdown-be-bound/ (0.00%)
>> <not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/ (0.00%)
>> <not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/ (0.00%)
>> <not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/ (0.00%)
>> <not counted> cpu_core/CPU_CLK_UNHALTED.PAUSE/ (0.00%)
>> <not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/ (0.00%)
>> <not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/ (0.00%)
>> <not counted> cpu_core/ARITH.DIV_ACTIVE/ (0.00%)
>> <not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/ (0.00%)
>> <not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/ (0.00%)
>>
>> 1.186254766 seconds time elapsed
>>
>> 0.427220000 seconds user
>> 0.752217000 seconds sys
>> Testing smi_cycles
>> Testing smi_num
>> Testing tsx_aborted_cycles
>> Testing tsx_cycles_per_elision
>> Testing tsx_cycles_per_transaction
>> Testing tsx_transactional_cycles
>> test child finished with -1
>> ---- end ----
>> perf all metrics test: FAILED!
>> root@number:~#
> Have a try disabling the NMI watchdog. Agreed that there is more to
> fix here but I think the PMU driver is in part to blame because
> manually breaking the weak group of events is a fix.

I think we have a NO_GROUP_EVENTS_NMI metric constraint to mark a group
which require disabling of the NMI watchdog.
Maybe we should mark the group a NO_GROUP_EVENTS_NMI metric.

Thanks,
Kan

> Fwiw, if we
> switch to the buddy watchdog mechanism then we'll no longer need to
> disable the NMI watchdog:
> https://lore.kernel.org/lkml/20230421155255.1.I6bf789d21d0c3d75d382e7e51a804a7a51315f2c@changeid/

2024-01-04 17:52:08

by Ian Rogers

[permalink] [raw]
Subject: Re: [PATCH v1 1/4] perf vendor events intel: Alderlake/rocketlake metric fixes

On Thu, Jan 4, 2024 at 6:30 AM Liang, Kan <[email protected]> wrote:
>
>
>
> On 2024-01-04 8:56 a.m., Ian Rogers wrote:
> >> Testing tma_slow_pause
> >> Metric 'tma_slow_pause' not printed in:
> >> # Running 'internals/synthesize' benchmark:
> >> Computing performance of single threaded perf event synthesis by
> >> synthesizing events on the perf process itself:
> >> Average synthesis took: 49.987 usec (+- 0.049 usec)
> >> Average num. events: 47.000 (+- 0.000)
> >> Average time per event 1.064 usec
> >> Average data synthesis took: 53.490 usec (+- 0.033 usec)
> >> Average num. events: 245.000 (+- 0.000)
> >> Average time per event 0.218 usec
> >>
> >> Performance counter stats for 'perf bench internals synthesize':
> >>
> >> <not counted> cpu_core/TOPDOWN.SLOTS/ (0.00%)
> >> <not counted> cpu_core/topdown-retiring/ (0.00%)
> >> <not counted> cpu_core/topdown-mem-bound/ (0.00%)
> >> <not counted> cpu_core/topdown-bad-spec/ (0.00%)
> >> <not counted> cpu_core/topdown-fe-bound/ (0.00%)
> >> <not counted> cpu_core/topdown-be-bound/ (0.00%)
> >> <not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/ (0.00%)
> >> <not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/ (0.00%)
> >> <not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/ (0.00%)
> >> <not counted> cpu_core/CPU_CLK_UNHALTED.PAUSE/ (0.00%)
> >> <not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/ (0.00%)
> >> <not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/ (0.00%)
> >> <not counted> cpu_core/ARITH.DIV_ACTIVE/ (0.00%)
> >> <not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/ (0.00%)
> >> <not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/ (0.00%)
> >>
> >> 1.186254766 seconds time elapsed
> >>
> >> 0.427220000 seconds user
> >> 0.752217000 seconds sys
> >> Testing smi_cycles
> >> Testing smi_num
> >> Testing tsx_aborted_cycles
> >> Testing tsx_cycles_per_elision
> >> Testing tsx_cycles_per_transaction
> >> Testing tsx_transactional_cycles
> >> test child finished with -1
> >> ---- end ----
> >> perf all metrics test: FAILED!
> >> root@number:~#
> > Have a try disabling the NMI watchdog. Agreed that there is more to
> > fix here but I think the PMU driver is in part to blame because
> > manually breaking the weak group of events is a fix.
>
> I think we have a NO_GROUP_EVENTS_NMI metric constraint to mark a group
> which require disabling of the NMI watchdog.
> Maybe we should mark the group a NO_GROUP_EVENTS_NMI metric.

+Weilin due to the affects of event grouping.

Thanks Kan, NO_GROUP_EVENTS_NMI would be good. Something I see for
tma_ports_utilized_1 that may be worsening things is:

```
Testing tma_ports_utilized_1
Metric 'tma_ports_utilized_1' not printed in:
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
Average synthesis took: 49.581 usec (+- 0.030 usec)
Average num. events: 47.000 (+- 0.000)
Average time per event 1.055 usec
Average data synthesis took: 53.367 usec (+- 0.032 usec)
Average num. events: 246.000 (+- 0.000)
Average time per event 0.217 usec

Performance counter stats for 'perf bench internals synthesize':

<not counted> cpu_core/TOPDOWN.SLOTS/
(0.00%)
<not counted> cpu_core/topdown-retiring/
(0.00%)
<not counted> cpu_core/topdown-mem-bound/
(0.00%)
<not counted> cpu_core/topdown-bad-spec/
(0.00%)
<not counted> cpu_core/topdown-fe-bound/
(0.00%)
<not counted> cpu_core/topdown-be-bound/
(0.00%)
<not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/
(0.00%)
<not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/
(0.00%)
<not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/
(0.00%)
<not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/
(0.00%)
<not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/
(0.00%)
<not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/
(0.00%)
<not counted> cpu_core/ARITH.DIV_ACTIVE/
(0.00%)
<not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/
(0.00%)
<not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/
(0.00%)

1.180394056 seconds time elapsed

0.409881000 seconds user
0.764134000 seconds sys
```

The event EXE_ACTIVITY.1_PORTS_UTIL is repeated, this is because the
metric code deduplicates events based purely on their name and so
doesn't realize EXE_ACTIVITY.1_PORTS_UTIL is the same as
cpu_core@EXE_ACTIVITY.1_PORTS_UTIL@. This is a hybrid only glitch as
we only prefix with a PMU for hybrid metrics, and I should find and
remove why there's no PMU for the 1 case of EXE_ACTIVITY.1_PORTS_UTIL.

This problem doesn't occur for tma_slow_pause and I wondered if you
could give insight. That metric has the counters below:
```
$ perf stat -M tma_slow_pause -a sleep 0.1

Performance counter stats for 'system wide':

<not counted> cpu_core/TOPDOWN.SLOTS/
(0.00%)
<not counted> cpu_core/topdown-retiring/
(0.00%)
<not counted> cpu_core/topdown-mem-bound/
(0.00%)
<not counted> cpu_core/topdown-bad-spec/
(0.00%)
<not counted> cpu_core/topdown-fe-bound/
(0.00%)
<not counted> cpu_core/topdown-be-bound/
(0.00%)
<not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/
(0.00%)
<not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/
(0.00%)
<not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/
(0.00%)
<not counted> cpu_core/CPU_CLK_UNHALTED.PAUSE/
(0.00%)
<not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/
(0.00%)
<not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/
(0.00%)
<not counted> cpu_core/ARITH.DIV_ACTIVE/
(0.00%)
<not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/
(0.00%)
<not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/
(0.00%)

0.102074888 seconds time elapsed
```

With -vv I see the event string is:
'{RESOURCE_STALLS.SCOREBOARD/metric-id=RESOURCE_STALLS.SCOREBOARD/,cpu_core/EXE_ACTIVITY.1_PORTS_UTIL,metric-id=cpu_core!3EXE_ACTIVITY.1_PORTS_UTIL!3/,cpu_core/TOPDOWN.SLOTS,metric-id=cpu_core!3TOPDOWN.SLOTS!3/,cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS,metric-id=cpu_core!3EXE_ACTIVITY.BOUND_ON_LOADS!3/,cpu_core/topdown-retiring,metric-id=cpu_core!3topdown!1retiring!3/,cpu_core/topdown-mem-bound,metric-id=cpu_core!3topdown!1mem!1bound!3/,cpu_core/topdown-bad-spec,metric-id=cpu_core!3topdown!1bad!1spec!3/,CPU_CLK_UNHALTED.PAUSE/metric-id=CPU_CLK_UNHALTED.PAUSE/,cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL,metric-id=cpu_core!3CYCLE_ACTIVITY.STALLS_TOTAL!3/,cpu_core/CPU_CLK_UNHALTED.THREAD,metric-id=cpu_core!3CPU_CLK_UNHALTED.THREAD!3/,cpu_core/ARITH.DIV_ACTIVE,metric-id=cpu_core!3ARITH.DIV_ACTIVE!3/,cpu_core/topdown-fe-bound,metric-id=cpu_core!3topdown!1fe!1bound!3/,cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc,metric-id=cpu_core!3EXE_ACTIVITY.2_PORTS_UTIL!0umask!20xc!3/,cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80,metric-id=cpu_core!3EXE_ACTIVITY.3_PORTS_UTIL!0umask!20x80!3/,cpu_core/topdown-be-bound,metric-id=cpu_core!3topdown!1be!1bound!3/}:W'

which without the metric-ids becomes:
'{RESOURCE_STALLS.SCOREBOARD,cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/,cpu_core/TOPDOWN.SLOTS/,cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/,cpu_core/topdown-retiring/,cpu_core/topdown-mem-bound/,cpu_core/topdown-bad-spec/,CPU_CLK_UNHALTED.PAUSE,cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/,cpu_core/CPU_CLK_UNHALTED.THREAD/,cpu_core/ARITH.DIV_ACTIVE/,cpu_core/topdown-fe-bound/,cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/,cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/,cpu_core/topdown-be-bound/}:W'

I count 9 none slots/top-down counters there, but I see
CPU_CLK_UNHALTED.THREAD can use fixed counter 1. Should
perf_event_open fail for a CPU that has a pinned use of a fixed
counter and the group needs the fixed counter? I'm guessing you don't
want this as CPU_CLK_UNHALTED.THREAD can also go on a generic counter
and the driver doesn't want to count counter usage, it seems feasible
to add it though. I guess we need a NO_GROUP_EVENTS_NMI whenever
CPU_CLK_UNHALTED.THREAD is an event and 8 generic counters are in use.

Checking on Tigerlake I see:
```
$ perf stat -M tma_slow_pause -a sleep 0.1

Performance counter stats for 'system wide':

105,210,913 TOPDOWN.SLOTS # 0.1 %
tma_slow_pause (72.65%)
6,701,129 topdown-retiring
(72.65%)
52,359,712 topdown-fe-bound
(72.65%)
32,904,532 topdown-be-bound
(72.65%)
14,117,814 topdown-bad-spec
(72.65%)
6,602,391 RESOURCE_STALLS.SCOREBOARD
(76.17%)
4,220,773 cpu/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/
(76.73%)
421,812 EXE_ACTIVITY.BOUND_ON_STORES
(76.69%)
5,164,088 EXE_ACTIVITY.1_PORTS_UTIL
(76.70%)
299,681 cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/
(76.69%)
245 MISC_RETIRED.PAUSE_INST
(76.67%)
58,403,687 CPU_CLK_UNHALTED.THREAD
(76.72%)
25,297,841 CYCLE_ACTIVITY.STALLS_MEM_ANY
(76.67%)
3,788,772 EXE_ACTIVITY.2_PORTS_UTIL
(62.69%)
20,973,875 CYCLE_ACTIVITY.STALLS_TOTAL
(62.16%)
68,053 ARITH.DIVIDER_ACTIVE
(62.18%)

0.102624327 seconds time elapsed
```
so 10 generic counters which would never fit and the weak group is
broken - the difference in the metric explaining why I've not been
seeing the issue. I think I need to add alderlake/sapphirerapids
constraints here:
https://github.com/captain5050/perfmon/blob/main/scripts/create_perf_json.py#L1382
Ideally we'd automate the constraint generation (or the PMU driver
would help us out by failing to open the weak group).

Thanks,
Ian


> Thanks,
> Kan
>
> > Fwiw, if we
> > switch to the buddy watchdog mechanism then we'll no longer need to
> > disable the NMI watchdog:
> > https://lore.kernel.org/lkml/20230421155255.1.I6bf789d21d0c3d75d382e7e51a804a7a51315f2c@changeid/
>

2024-01-04 19:30:27

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH v1 1/4] perf vendor events intel: Alderlake/rocketlake metric fixes



On 2024-01-04 12:51 p.m., Ian Rogers wrote:
> On Thu, Jan 4, 2024 at 6:30 AM Liang, Kan <[email protected]> wrote:
>>
>>
>>
>> On 2024-01-04 8:56 a.m., Ian Rogers wrote:
>>>> Testing tma_slow_pause
>>>> Metric 'tma_slow_pause' not printed in:
>>>> # Running 'internals/synthesize' benchmark:
>>>> Computing performance of single threaded perf event synthesis by
>>>> synthesizing events on the perf process itself:
>>>> Average synthesis took: 49.987 usec (+- 0.049 usec)
>>>> Average num. events: 47.000 (+- 0.000)
>>>> Average time per event 1.064 usec
>>>> Average data synthesis took: 53.490 usec (+- 0.033 usec)
>>>> Average num. events: 245.000 (+- 0.000)
>>>> Average time per event 0.218 usec
>>>>
>>>> Performance counter stats for 'perf bench internals synthesize':
>>>>
>>>> <not counted> cpu_core/TOPDOWN.SLOTS/ (0.00%)
>>>> <not counted> cpu_core/topdown-retiring/ (0.00%)
>>>> <not counted> cpu_core/topdown-mem-bound/ (0.00%)
>>>> <not counted> cpu_core/topdown-bad-spec/ (0.00%)
>>>> <not counted> cpu_core/topdown-fe-bound/ (0.00%)
>>>> <not counted> cpu_core/topdown-be-bound/ (0.00%)
>>>> <not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/ (0.00%)
>>>> <not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/ (0.00%)
>>>> <not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/ (0.00%)
>>>> <not counted> cpu_core/CPU_CLK_UNHALTED.PAUSE/ (0.00%)
>>>> <not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/ (0.00%)
>>>> <not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/ (0.00%)
>>>> <not counted> cpu_core/ARITH.DIV_ACTIVE/ (0.00%)
>>>> <not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/ (0.00%)
>>>> <not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/ (0.00%)
>>>>
>>>> 1.186254766 seconds time elapsed
>>>>
>>>> 0.427220000 seconds user
>>>> 0.752217000 seconds sys
>>>> Testing smi_cycles
>>>> Testing smi_num
>>>> Testing tsx_aborted_cycles
>>>> Testing tsx_cycles_per_elision
>>>> Testing tsx_cycles_per_transaction
>>>> Testing tsx_transactional_cycles
>>>> test child finished with -1
>>>> ---- end ----
>>>> perf all metrics test: FAILED!
>>>> root@number:~#
>>> Have a try disabling the NMI watchdog. Agreed that there is more to
>>> fix here but I think the PMU driver is in part to blame because
>>> manually breaking the weak group of events is a fix.
>>
>> I think we have a NO_GROUP_EVENTS_NMI metric constraint to mark a group
>> which require disabling of the NMI watchdog.
>> Maybe we should mark the group a NO_GROUP_EVENTS_NMI metric.
>
> +Weilin due to the affects of event grouping.
>
> Thanks Kan, NO_GROUP_EVENTS_NMI would be good. Something I see for
> tma_ports_utilized_1 that may be worsening things is:
>
> ```
> Testing tma_ports_utilized_1
> Metric 'tma_ports_utilized_1' not printed in:
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
> Average synthesis took: 49.581 usec (+- 0.030 usec)
> Average num. events: 47.000 (+- 0.000)
> Average time per event 1.055 usec
> Average data synthesis took: 53.367 usec (+- 0.032 usec)
> Average num. events: 246.000 (+- 0.000)
> Average time per event 0.217 usec
>
> Performance counter stats for 'perf bench internals synthesize':
>
> <not counted> cpu_core/TOPDOWN.SLOTS/
> (0.00%)
> <not counted> cpu_core/topdown-retiring/
> (0.00%)
> <not counted> cpu_core/topdown-mem-bound/
> (0.00%)
> <not counted> cpu_core/topdown-bad-spec/
> (0.00%)
> <not counted> cpu_core/topdown-fe-bound/
> (0.00%)
> <not counted> cpu_core/topdown-be-bound/
> (0.00%)
> <not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/
> (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/
> (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/
> (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/
> (0.00%)
> <not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/
> (0.00%)
> <not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/
> (0.00%)
> <not counted> cpu_core/ARITH.DIV_ACTIVE/
> (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/
> (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/
> (0.00%)
>
> 1.180394056 seconds time elapsed
>
> 0.409881000 seconds user
> 0.764134000 seconds sys
> ```
>
> The event EXE_ACTIVITY.1_PORTS_UTIL is repeated, this is because the
> metric code deduplicates events based purely on their name and so
> doesn't realize EXE_ACTIVITY.1_PORTS_UTIL is the same as
> cpu_core@EXE_ACTIVITY.1_PORTS_UTIL@. This is a hybrid only glitch as
> we only prefix with a PMU for hybrid metrics, and I should find and
> remove why there's no PMU for the 1 case of EXE_ACTIVITY.1_PORTS_UTIL.
>
> This problem doesn't occur for tma_slow_pause and I wondered if you
> could give insight. That metric has the counters below:
> ```
> $ perf stat -M tma_slow_pause -a sleep 0.1
>
> Performance counter stats for 'system wide':
>
> <not counted> cpu_core/TOPDOWN.SLOTS/
> (0.00%)
> <not counted> cpu_core/topdown-retiring/
> (0.00%)
> <not counted> cpu_core/topdown-mem-bound/
> (0.00%)
> <not counted> cpu_core/topdown-bad-spec/
> (0.00%)
> <not counted> cpu_core/topdown-fe-bound/
> (0.00%)
> <not counted> cpu_core/topdown-be-bound/
> (0.00%)
> <not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/
> (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/
> (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/
> (0.00%)
> <not counted> cpu_core/CPU_CLK_UNHALTED.PAUSE/
> (0.00%)
> <not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/
> (0.00%)
> <not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/
> (0.00%)
> <not counted> cpu_core/ARITH.DIV_ACTIVE/
> (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/
> (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/
> (0.00%)
>
> 0.102074888 seconds time elapsed
> ```
>
> With -vv I see the event string is:
> '{RESOURCE_STALLS.SCOREBOARD/metric-id=RESOURCE_STALLS.SCOREBOARD/,cpu_core/EXE_ACTIVITY.1_PORTS_UTIL,metric-id=cpu_core!3EXE_ACTIVITY.1_PORTS_UTIL!3/,cpu_core/TOPDOWN.SLOTS,metric-id=cpu_core!3TOPDOWN.SLOTS!3/,cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS,metric-id=cpu_core!3EXE_ACTIVITY.BOUND_ON_LOADS!3/,cpu_core/topdown-retiring,metric-id=cpu_core!3topdown!1retiring!3/,cpu_core/topdown-mem-bound,metric-id=cpu_core!3topdown!1mem!1bound!3/,cpu_core/topdown-bad-spec,metric-id=cpu_core!3topdown!1bad!1spec!3/,CPU_CLK_UNHALTED.PAUSE/metric-id=CPU_CLK_UNHALTED.PAUSE/,cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL,metric-id=cpu_core!3CYCLE_ACTIVITY.STALLS_TOTAL!3/,cpu_core/CPU_CLK_UNHALTED.THREAD,metric-id=cpu_core!3CPU_CLK_UNHALTED.THREAD!3/,cpu_core/ARITH.DIV_ACTIVE,metric-id=cpu_core!3ARITH.DIV_ACTIVE!3/,cpu_core/topdown-fe-bound,metric-id=cpu_core!3topdown!1fe!1bound!3/,cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc,metric-id=cpu_core!3EXE_ACTIVITY.2_PORTS_UTIL!0umask!20xc!3/,cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80,metric-id=cpu_core!3EXE_ACTIVITY.3_PORTS_UTIL!0umask!20x80!3/,cpu_core/topdown-be-bound,metric-id=cpu_core!3topdown!1be!1bound!3/}:W'
>
> which without the metric-ids becomes:
> '{RESOURCE_STALLS.SCOREBOARD,cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/,cpu_core/TOPDOWN.SLOTS/,cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/,cpu_core/topdown-retiring/,cpu_core/topdown-mem-bound/,cpu_core/topdown-bad-spec/,CPU_CLK_UNHALTED.PAUSE,cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/,cpu_core/CPU_CLK_UNHALTED.THREAD/,cpu_core/ARITH.DIV_ACTIVE/,cpu_core/topdown-fe-bound/,cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/,cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/,cpu_core/topdown-be-bound/}:W'
>
> I count 9 none slots/top-down counters there, but I see
> CPU_CLK_UNHALTED.THREAD can use fixed counter 1. Should
> perf_event_open fail for a CPU that has a pinned use of a fixed
> counter and the group needs the fixed counter?

I tried, but the idea was rejected.

> I'm guessing you don't
> want this as CPU_CLK_UNHALTED.THREAD can also go on a generic counter
> and the driver doesn't want to count counter usage, it seems feasible
> to add it though. I guess we need a NO_GROUP_EVENTS_NMI whenever
> CPU_CLK_UNHALTED.THREAD is an event and 8 generic counters are in use.

Yes, it looks good to me.

>
> Checking on Tigerlake I see:
> ```
> $ perf stat -M tma_slow_pause -a sleep 0.1
>
> Performance counter stats for 'system wide':
>
> 105,210,913 TOPDOWN.SLOTS # 0.1 %
> tma_slow_pause (72.65%)
> 6,701,129 topdown-retiring
> (72.65%)
> 52,359,712 topdown-fe-bound
> (72.65%)
> 32,904,532 topdown-be-bound
> (72.65%)
> 14,117,814 topdown-bad-spec
> (72.65%)
> 6,602,391 RESOURCE_STALLS.SCOREBOARD
> (76.17%)
> 4,220,773 cpu/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/
> (76.73%)
> 421,812 EXE_ACTIVITY.BOUND_ON_STORES
> (76.69%)
> 5,164,088 EXE_ACTIVITY.1_PORTS_UTIL
> (76.70%)
> 299,681 cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/
> (76.69%)
> 245 MISC_RETIRED.PAUSE_INST
> (76.67%)
> 58,403,687 CPU_CLK_UNHALTED.THREAD
> (76.72%)
> 25,297,841 CYCLE_ACTIVITY.STALLS_MEM_ANY
> (76.67%)
> 3,788,772 EXE_ACTIVITY.2_PORTS_UTIL
> (62.69%)
> 20,973,875 CYCLE_ACTIVITY.STALLS_TOTAL
> (62.16%)
> 68,053 ARITH.DIVIDER_ACTIVE
> (62.18%)
>
> 0.102624327 seconds time elapsed
> ```
> so 10 generic counters which would never fit and the weak group is
> broken - the difference in the metric explaining why I've not been
> seeing the issue. I think I need to add alderlake/sapphirerapids
> constraints here:
> https://github.com/captain5050/perfmon/blob/main/scripts/create_perf_json.py#L1382
> Ideally we'd automate the constraint generation (or the PMU driver
> would help us out by failing to open the weak group).

Yes, an automation will be great. The NO_GROUP_EVENTS_NMI can be set for
a group which has CPU_CLK_UNHALTED.THREAD and the number of core events
(expect topdown) == the max number of GP counters + 1.

Thanks,
Kan
>
> Thanks,
> Ian
>
>
>> Thanks,
>> Kan
>>
>>> Fwiw, if we
>>> switch to the buddy watchdog mechanism then we'll no longer need to
>>> disable the NMI watchdog:
>>> https://lore.kernel.org/lkml/20230421155255.1.I6bf789d21d0c3d75d382e7e51a804a7a51315f2c@changeid/
>>
>

2024-01-04 20:37:17

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v1 1/4] perf vendor events intel: Alderlake/rocketlake metric fixes

Em Thu, Jan 04, 2024 at 05:56:22AM -0800, Ian Rogers escreveu:
> On Thu, Jan 4, 2024 at 4:39 AM Arnaldo Carvalho de Melo <[email protected]> wrote:
> > Em Wed, Jan 03, 2024 at 11:42:56PM -0800, Ian Rogers escreveu:
> > > Fix that the core PMU is being specified for 2 uncore events. Specify
> > > a PMU for the alderlake UNCORE_FREQ metric.
<SNIP>
> > 101: perf all metricgroups test : Ok
> > 102: perf all metrics test : FAILED!
> > 107: perf metrics value validation : Ok

> > 102 is now failing due to some other problem:

> > root@number:~# perf test -v 102
> > 102: perf all metrics test :
> > --- start ---
> > test child forked, pid 2701034
> > Testing tma_core_bound
> > Testing tma_info_core_ilp
<SNIP>
> > Testing tma_memory_fence
> > Metric 'tma_memory_fence' not printed in:
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of single threaded perf event synthesis by
> > synthesizing events on the perf process itself:
> > Average synthesis took: 49.458 usec (+- 0.033 usec)
> > Average num. events: 47.000 (+- 0.000)
> > Average time per event 1.052 usec
> > Average data synthesis took: 53.268 usec (+- 0.027 usec)
> > Average num. events: 244.000 (+- 0.000)
> > Average time per event 0.218 usec

> > Performance counter stats for 'perf bench internals synthesize':

> > <not counted> cpu_core/TOPDOWN.SLOTS/ (0.00%)
> > <not counted> cpu_core/topdown-retiring/ (0.00%)
> > <not counted> cpu_core/topdown-mem-bound/ (0.00%)
> > <not counted> cpu_core/topdown-bad-spec/ (0.00%)
> > <not counted> cpu_core/topdown-fe-bound/ (0.00%)
> > <not counted> cpu_core/topdown-be-bound/ (0.00%)
> > <not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/ (0.00%)
> > <not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/ (0.00%)
> > <not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/ (0.00%)
> > <not counted> cpu_core/MISC2_RETIRED.LFENCE/ (0.00%)
> > <not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/ (0.00%)
> > <not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/ (0.00%)
> > <not counted> cpu_core/ARITH.DIV_ACTIVE/ (0.00%)
> > <not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/ (0.00%)
> > <not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/ (0.00%)

> > 1.177929044 seconds time elapsed

> > 0.434552000 seconds user
> > 0.736874000 seconds sys
> > Testing tma_port_1
<SNIP>
> > test child finished with -1
> > ---- end ----
> > perf all metrics test: FAILED!
> > root@number:~#

> Have a try disabling the NMI watchdog. Agreed that there is more to

Did the trick, added this to the cset log message:

--------------------------------------- 8< ----------------------------
Test 102 is failing for another reason, not being able to get as many
counters as needed, Ian Rogers suggested disabling the NMI watchdog to
have more counters available:

root@number:/home/acme# cat /proc/sys/kernel/nmi_watchdog
1
root@number:/home/acme# echo 0 > /proc/sys/kernel/nmi_watchdog
root@number:/home/acme# perf test 102
102: perf all metrics test : Ok
root@number:/home/acme#
--------------------------------------- 8< ----------------------------

- Arnaldo

> fix here but I think the PMU driver is in part to blame because
> manually breaking the weak group of events is a fix. Fwiw, if we
> switch to the buddy watchdog mechanism then we'll no longer need to
> disable the NMI watchdog:
> https://lore.kernel.org/lkml/20230421155255.1.I6bf789d21d0c3d75d382e7e51a804a7a51315f2c@changeid/

2024-01-04 23:32:07

by Ian Rogers

[permalink] [raw]
Subject: Re: [PATCH v1 1/4] perf vendor events intel: Alderlake/rocketlake metric fixes

On Thu, Jan 4, 2024 at 11:30 AM Liang, Kan <[email protected]> wrote:
>
>
>
> On 2024-01-04 12:51 p.m., Ian Rogers wrote:
> > On Thu, Jan 4, 2024 at 6:30 AM Liang, Kan <[email protected]> wrote:
> >>
> >>
> >>
> >> On 2024-01-04 8:56 a.m., Ian Rogers wrote:
> >>>> Testing tma_slow_pause
> >>>> Metric 'tma_slow_pause' not printed in:
> >>>> # Running 'internals/synthesize' benchmark:
> >>>> Computing performance of single threaded perf event synthesis by
> >>>> synthesizing events on the perf process itself:
> >>>> Average synthesis took: 49.987 usec (+- 0.049 usec)
> >>>> Average num. events: 47.000 (+- 0.000)
> >>>> Average time per event 1.064 usec
> >>>> Average data synthesis took: 53.490 usec (+- 0.033 usec)
> >>>> Average num. events: 245.000 (+- 0.000)
> >>>> Average time per event 0.218 usec
> >>>>
> >>>> Performance counter stats for 'perf bench internals synthesize':
> >>>>
> >>>> <not counted> cpu_core/TOPDOWN.SLOTS/ (0.00%)
> >>>> <not counted> cpu_core/topdown-retiring/ (0.00%)
> >>>> <not counted> cpu_core/topdown-mem-bound/ (0.00%)
> >>>> <not counted> cpu_core/topdown-bad-spec/ (0.00%)
> >>>> <not counted> cpu_core/topdown-fe-bound/ (0.00%)
> >>>> <not counted> cpu_core/topdown-be-bound/ (0.00%)
> >>>> <not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/ (0.00%)
> >>>> <not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/ (0.00%)
> >>>> <not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/ (0.00%)
> >>>> <not counted> cpu_core/CPU_CLK_UNHALTED.PAUSE/ (0.00%)
> >>>> <not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/ (0.00%)
> >>>> <not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/ (0.00%)
> >>>> <not counted> cpu_core/ARITH.DIV_ACTIVE/ (0.00%)
> >>>> <not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/ (0.00%)
> >>>> <not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/ (0.00%)
> >>>>
> >>>> 1.186254766 seconds time elapsed
> >>>>
> >>>> 0.427220000 seconds user
> >>>> 0.752217000 seconds sys
> >>>> Testing smi_cycles
> >>>> Testing smi_num
> >>>> Testing tsx_aborted_cycles
> >>>> Testing tsx_cycles_per_elision
> >>>> Testing tsx_cycles_per_transaction
> >>>> Testing tsx_transactional_cycles
> >>>> test child finished with -1
> >>>> ---- end ----
> >>>> perf all metrics test: FAILED!
> >>>> root@number:~#
> >>> Have a try disabling the NMI watchdog. Agreed that there is more to
> >>> fix here but I think the PMU driver is in part to blame because
> >>> manually breaking the weak group of events is a fix.
> >>
> >> I think we have a NO_GROUP_EVENTS_NMI metric constraint to mark a group
> >> which require disabling of the NMI watchdog.
> >> Maybe we should mark the group a NO_GROUP_EVENTS_NMI metric.
> >
> > +Weilin due to the affects of event grouping.
> >
> > Thanks Kan, NO_GROUP_EVENTS_NMI would be good. Something I see for
> > tma_ports_utilized_1 that may be worsening things is:
> >
> > ```
> > Testing tma_ports_utilized_1
> > Metric 'tma_ports_utilized_1' not printed in:
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of single threaded perf event synthesis by
> > synthesizing events on the perf process itself:
> > Average synthesis took: 49.581 usec (+- 0.030 usec)
> > Average num. events: 47.000 (+- 0.000)
> > Average time per event 1.055 usec
> > Average data synthesis took: 53.367 usec (+- 0.032 usec)
> > Average num. events: 246.000 (+- 0.000)
> > Average time per event 0.217 usec
> >
> > Performance counter stats for 'perf bench internals synthesize':
> >
> > <not counted> cpu_core/TOPDOWN.SLOTS/
> > (0.00%)
> > <not counted> cpu_core/topdown-retiring/
> > (0.00%)
> > <not counted> cpu_core/topdown-mem-bound/
> > (0.00%)
> > <not counted> cpu_core/topdown-bad-spec/
> > (0.00%)
> > <not counted> cpu_core/topdown-fe-bound/
> > (0.00%)
> > <not counted> cpu_core/topdown-be-bound/
> > (0.00%)
> > <not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/
> > (0.00%)
> > <not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/
> > (0.00%)
> > <not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/
> > (0.00%)
> > <not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/
> > (0.00%)
> > <not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/
> > (0.00%)
> > <not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/
> > (0.00%)
> > <not counted> cpu_core/ARITH.DIV_ACTIVE/
> > (0.00%)
> > <not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/
> > (0.00%)
> > <not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/
> > (0.00%)
> >
> > 1.180394056 seconds time elapsed
> >
> > 0.409881000 seconds user
> > 0.764134000 seconds sys
> > ```
> >
> > The event EXE_ACTIVITY.1_PORTS_UTIL is repeated, this is because the
> > metric code deduplicates events based purely on their name and so
> > doesn't realize EXE_ACTIVITY.1_PORTS_UTIL is the same as
> > cpu_core@EXE_ACTIVITY.1_PORTS_UTIL@. This is a hybrid only glitch as
> > we only prefix with a PMU for hybrid metrics, and I should find and
> > remove why there's no PMU for the 1 case of EXE_ACTIVITY.1_PORTS_UTIL.
> >
> > This problem doesn't occur for tma_slow_pause and I wondered if you
> > could give insight. That metric has the counters below:
> > ```
> > $ perf stat -M tma_slow_pause -a sleep 0.1
> >
> > Performance counter stats for 'system wide':
> >
> > <not counted> cpu_core/TOPDOWN.SLOTS/
> > (0.00%)
> > <not counted> cpu_core/topdown-retiring/
> > (0.00%)
> > <not counted> cpu_core/topdown-mem-bound/
> > (0.00%)
> > <not counted> cpu_core/topdown-bad-spec/
> > (0.00%)
> > <not counted> cpu_core/topdown-fe-bound/
> > (0.00%)
> > <not counted> cpu_core/topdown-be-bound/
> > (0.00%)
> > <not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/
> > (0.00%)
> > <not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/
> > (0.00%)
> > <not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/
> > (0.00%)
> > <not counted> cpu_core/CPU_CLK_UNHALTED.PAUSE/
> > (0.00%)
> > <not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/
> > (0.00%)
> > <not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/
> > (0.00%)
> > <not counted> cpu_core/ARITH.DIV_ACTIVE/
> > (0.00%)
> > <not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/
> > (0.00%)
> > <not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/
> > (0.00%)
> >
> > 0.102074888 seconds time elapsed
> > ```
> >
> > With -vv I see the event string is:
> > '{RESOURCE_STALLS.SCOREBOARD/metric-id=RESOURCE_STALLS.SCOREBOARD/,cpu_core/EXE_ACTIVITY.1_PORTS_UTIL,metric-id=cpu_core!3EXE_ACTIVITY.1_PORTS_UTIL!3/,cpu_core/TOPDOWN.SLOTS,metric-id=cpu_core!3TOPDOWN.SLOTS!3/,cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS,metric-id=cpu_core!3EXE_ACTIVITY.BOUND_ON_LOADS!3/,cpu_core/topdown-retiring,metric-id=cpu_core!3topdown!1retiring!3/,cpu_core/topdown-mem-bound,metric-id=cpu_core!3topdown!1mem!1bound!3/,cpu_core/topdown-bad-spec,metric-id=cpu_core!3topdown!1bad!1spec!3/,CPU_CLK_UNHALTED.PAUSE/metric-id=CPU_CLK_UNHALTED.PAUSE/,cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL,metric-id=cpu_core!3CYCLE_ACTIVITY.STALLS_TOTAL!3/,cpu_core/CPU_CLK_UNHALTED.THREAD,metric-id=cpu_core!3CPU_CLK_UNHALTED.THREAD!3/,cpu_core/ARITH.DIV_ACTIVE,metric-id=cpu_core!3ARITH.DIV_ACTIVE!3/,cpu_core/topdown-fe-bound,metric-id=cpu_core!3topdown!1fe!1bound!3/,cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc,metric-id=cpu_core!3EXE_ACTIVITY.2_PORTS_UTIL!0umask!20xc!3/,cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80,metric-id=cpu_core!3EXE_ACTIVITY.3_PORTS_UTIL!0umask!20x80!3/,cpu_core/topdown-be-bound,metric-id=cpu_core!3topdown!1be!1bound!3/}:W'
> >
> > which without the metric-ids becomes:
> > '{RESOURCE_STALLS.SCOREBOARD,cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/,cpu_core/TOPDOWN.SLOTS/,cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/,cpu_core/topdown-retiring/,cpu_core/topdown-mem-bound/,cpu_core/topdown-bad-spec/,CPU_CLK_UNHALTED.PAUSE,cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/,cpu_core/CPU_CLK_UNHALTED.THREAD/,cpu_core/ARITH.DIV_ACTIVE/,cpu_core/topdown-fe-bound/,cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/,cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/,cpu_core/topdown-be-bound/}:W'
> >
> > I count 9 none slots/top-down counters there, but I see
> > CPU_CLK_UNHALTED.THREAD can use fixed counter 1. Should
> > perf_event_open fail for a CPU that has a pinned use of a fixed
> > counter and the group needs the fixed counter?
>
> I tried, but the idea was rejected.
>
> > I'm guessing you don't
> > want this as CPU_CLK_UNHALTED.THREAD can also go on a generic counter
> > and the driver doesn't want to count counter usage, it seems feasible
> > to add it though. I guess we need a NO_GROUP_EVENTS_NMI whenever
> > CPU_CLK_UNHALTED.THREAD is an event and 8 generic counters are in use.
>
> Yes, it looks good to me.

Fixes all sent out, see and its links:
https://lore.kernel.org/lkml/[email protected]/

Thanks,
Ian

> >
> > Checking on Tigerlake I see:
> > ```
> > $ perf stat -M tma_slow_pause -a sleep 0.1
> >
> > Performance counter stats for 'system wide':
> >
> > 105,210,913 TOPDOWN.SLOTS # 0.1 %
> > tma_slow_pause (72.65%)
> > 6,701,129 topdown-retiring
> > (72.65%)
> > 52,359,712 topdown-fe-bound
> > (72.65%)
> > 32,904,532 topdown-be-bound
> > (72.65%)
> > 14,117,814 topdown-bad-spec
> > (72.65%)
> > 6,602,391 RESOURCE_STALLS.SCOREBOARD
> > (76.17%)
> > 4,220,773 cpu/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/
> > (76.73%)
> > 421,812 EXE_ACTIVITY.BOUND_ON_STORES
> > (76.69%)
> > 5,164,088 EXE_ACTIVITY.1_PORTS_UTIL
> > (76.70%)
> > 299,681 cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/
> > (76.69%)
> > 245 MISC_RETIRED.PAUSE_INST
> > (76.67%)
> > 58,403,687 CPU_CLK_UNHALTED.THREAD
> > (76.72%)
> > 25,297,841 CYCLE_ACTIVITY.STALLS_MEM_ANY
> > (76.67%)
> > 3,788,772 EXE_ACTIVITY.2_PORTS_UTIL
> > (62.69%)
> > 20,973,875 CYCLE_ACTIVITY.STALLS_TOTAL
> > (62.16%)
> > 68,053 ARITH.DIVIDER_ACTIVE
> > (62.18%)
> >
> > 0.102624327 seconds time elapsed
> > ```
> > so 10 generic counters which would never fit and the weak group is
> > broken - the difference in the metric explaining why I've not been
> > seeing the issue. I think I need to add alderlake/sapphirerapids
> > constraints here:
> > https://github.com/captain5050/perfmon/blob/main/scripts/create_perf_json.py#L1382
> > Ideally we'd automate the constraint generation (or the PMU driver
> > would help us out by failing to open the weak group).
>
> Yes, an automation will be great. The NO_GROUP_EVENTS_NMI can be set for
> a group which has CPU_CLK_UNHALTED.THREAD and the number of core events
> (expect topdown) == the max number of GP counters + 1.
>
> Thanks,
> Kan
> >
> > Thanks,
> > Ian
> >
> >
> >> Thanks,
> >> Kan
> >>
> >>> Fwiw, if we
> >>> switch to the buddy watchdog mechanism then we'll no longer need to
> >>> disable the NMI watchdog:
> >>> https://lore.kernel.org/lkml/20230421155255.1.I6bf789d21d0c3d75d382e7e51a804a7a51315f2c@changeid/
> >>
> >