LinuxLists.cc - [PATCH v1 0/4] Intel metric fixes and event updates

2023-08-01 06:02:36

Subject: [PATCH v1 0/4] Intel metric fixes and event updates

The metric tma_info_pipeline_retire contains uops_retired.slots with
perf metric events. Patch 1 fixes this event sorting so that
uops_retired.slots isn't made a group leader as that needs to be
topdown.slots.

Patch 2 and 3 update the meteorlake and sapphirerapids events.

Patch 4 addresses an issue with event grouping discussed in:
https://lore.kernel.org/lkml/[email protected]/
by adding and altering metric constraints. The constraints avoid
groups for metrics where the kernel PMU fails to not open the group
(the trigger for the weak group being removed).

Ian Rogers (4):
perf parse-events x86: Avoid sorting uops_retired.slots
perf vendor events intel: Update meteorlake to 1.04
perf vendor events intel: Update sapphirerapids to 1.15
perf vendor events intel: Update Icelake+ metric constraints

tools/perf/arch/x86/util/evlist.c | 7 +-
tools/perf/arch/x86/util/evsel.c | 7 +-
.../arch/x86/alderlake/adl-metrics.json | 11 +-
.../arch/x86/alderlaken/adln-metrics.json | 2 +
.../arch/x86/icelake/icl-metrics.json | 10 +-
.../arch/x86/icelakex/icx-metrics.json | 10 +-
tools/perf/pmu-events/arch/x86/mapfile.csv | 4 +-
.../pmu-events/arch/x86/meteorlake/cache.json | 165 ++++++++++++++++++
.../arch/x86/meteorlake/floating-point.json | 8 +
.../arch/x86/meteorlake/frontend.json | 56 ++++++
.../arch/x86/meteorlake/memory.json | 80 +++++++++
.../pmu-events/arch/x86/meteorlake/other.json | 16 ++
.../arch/x86/meteorlake/pipeline.json | 159 +++++++++++++++++
.../arch/x86/rocketlake/rkl-metrics.json | 10 +-
.../arch/x86/sapphirerapids/other.json | 18 ++
.../arch/x86/sapphirerapids/spr-metrics.json | 9 +-
.../arch/x86/tigerlake/tgl-metrics.json | 10 +-
17 files changed, 549 insertions(+), 33 deletions(-)

--
2.41.0.585.gd2178a4bd4-goog

2023-08-01 06:24:10

by Ian Rogers

[permalink] [raw]

Subject: [PATCH v1 1/4] perf parse-events x86: Avoid sorting uops_retired.slots

As topdown.slots may appear as slots it may get confused with
uops_retired.slots which is an invalid perf metric event group
leader. Special case uops_retired.slots to avoid this confusion.

Signed-off-by: Ian Rogers <[email protected]>
---
tools/perf/arch/x86/util/evlist.c | 7 ++++---
tools/perf/arch/x86/util/evsel.c | 7 +++----
2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/tools/perf/arch/x86/util/evlist.c b/tools/perf/arch/x86/util/evlist.c
index cbd582182932..b1ce0c52d88d 100644
--- a/tools/perf/arch/x86/util/evlist.c
+++ b/tools/perf/arch/x86/util/evlist.c
@@ -75,11 +75,12 @@ int arch_evlist__add_default_attrs(struct evlist *evlist,

int arch_evlist__cmp(const struct evsel *lhs, const struct evsel *rhs)
{
- if (topdown_sys_has_perf_metrics() && evsel__sys_has_perf_metrics(lhs)) {
+ if (topdown_sys_has_perf_metrics() &&
+ (arch_evsel__must_be_in_group(lhs) || arch_evsel__must_be_in_group(rhs))) {
/* Ensure the topdown slots comes first. */
- if (strcasestr(lhs->name, "slots"))
+ if (strcasestr(lhs->name, "slots") && !strcasestr(lhs->name, "uops_retired.slots"))
return -1;
- if (strcasestr(rhs->name, "slots"))
+ if (strcasestr(rhs->name, "slots") && !strcasestr(rhs->name, "uops_retired.slots"))
return 1;
/* Followed by topdown events. */
if (strcasestr(lhs->name, "topdown") && !strcasestr(rhs->name, "topdown"))
diff --git a/tools/perf/arch/x86/util/evsel.c b/tools/perf/arch/x86/util/evsel.c
index 81d22657922a..090d0f371891 100644
--- a/tools/perf/arch/x86/util/evsel.c
+++ b/tools/perf/arch/x86/util/evsel.c
@@ -40,12 +40,11 @@ bool evsel__sys_has_perf_metrics(const struct evsel *evsel)

bool arch_evsel__must_be_in_group(const struct evsel *evsel)
{
- if (!evsel__sys_has_perf_metrics(evsel))
+ if (!evsel__sys_has_perf_metrics(evsel) || !evsel->name ||
+ strcasestr(evsel->name, "uops_retired.slots"))
return false;

- return evsel->name &&
- (strcasestr(evsel->name, "slots") ||
- strcasestr(evsel->name, "topdown"));
+ return strcasestr(evsel->name, "topdown") || strcasestr(evsel->name, "slots");
}

int arch_evsel__hw_name(struct evsel *evsel, char *bf, size_t size)
--
2.41.0.585.gd2178a4bd4-goog

2023-08-01 06:26:51

by Ian Rogers

[permalink] [raw]

Subject: [PATCH v1 3/4] perf vendor events intel: Update sapphirerapids to 1.15

1.15 events were released in:
https://github.com/intel/perfmon/commit/76dfb81a1148ec049fd9caae9c62529404da63df
Adds the events OCR.DEMAND_DATA_RD.LOCAL_SOCKET_PMM and
OCR.DEMAND_DATA_RD.PMM.

Signed-off-by: Ian Rogers <[email protected]>
---
tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +-
.../arch/x86/sapphirerapids/other.json | 18 ++++++++++++++++++
2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv
index 9020d7a23c91..3a8770e29fe8 100644
--- a/tools/perf/pmu-events/arch/x86/mapfile.csv
+++ b/tools/perf/pmu-events/arch/x86/mapfile.csv
@@ -24,7 +24,7 @@ GenuineIntel-6-1[AEF],v3,nehalemep,core
GenuineIntel-6-2E,v3,nehalemex,core
GenuineIntel-6-A7,v1.01,rocketlake,core
GenuineIntel-6-2A,v19,sandybridge,core
-GenuineIntel-6-(8F|CF),v1.14,sapphirerapids,core
+GenuineIntel-6-(8F|CF),v1.15,sapphirerapids,core
GenuineIntel-6-AF,v1.00,sierraforest,core
GenuineIntel-6-(37|4A|4C|4D|5A),v15,silvermont,core
GenuineIntel-6-(4E|5E|8E|9E|A5|A6),v57,skylake,core
diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/other.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/other.json
index 31b6be9fb8c7..442ef3807a9d 100644
--- a/tools/perf/pmu-events/arch/x86/sapphirerapids/other.json
+++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/other.json
@@ -76,6 +76,24 @@
"SampleAfterValue": "100003",
"UMask": "0x1"
},
+ {
+ "BriefDescription": "Counts demand data reads that were supplied by PMM attached to this socket, whether or not in Sub NUMA Cluster(SNC) Mode. In SNC Mode counts PMM accesses that are controlled by the close or distant SNC Cluster.",
+ "EventCode": "0x2A,0x2B",
+ "EventName": "OCR.DEMAND_DATA_RD.LOCAL_SOCKET_PMM",
+ "MSRIndex": "0x1a6,0x1a7",
+ "MSRValue": "0x700C00001",
+ "SampleAfterValue": "100003",
+ "UMask": "0x1"
+ },
+ {
+ "BriefDescription": "Counts demand data reads that were supplied by PMM.",
+ "EventCode": "0x2A,0x2B",
+ "EventName": "OCR.DEMAND_DATA_RD.PMM",
+ "MSRIndex": "0x1a6,0x1a7",
+ "MSRValue": "0x703C00001",
+ "SampleAfterValue": "100003",
+ "UMask": "0x1"
+ },
{
"BriefDescription": "Counts demand data reads that were supplied by DRAM attached to another socket.",
"EventCode": "0x2A,0x2B",
--
2.41.0.585.gd2178a4bd4-goog

2023-08-01 06:28:30

by Ian Rogers

[permalink] [raw]

Subject: [PATCH v1 2/4] perf vendor events intel: Update meteorlake to 1.04

1.04 events were released in:
https://github.com/intel/perfmon/commit/44fe3681501f43fc515577aced8e944b187c8e51
Addition of 51 core events.

Signed-off-by: Ian Rogers <[email protected]>
---
tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +-
.../pmu-events/arch/x86/meteorlake/cache.json | 165 ++++++++++++++++++
.../arch/x86/meteorlake/floating-point.json | 8 +
.../arch/x86/meteorlake/frontend.json | 56 ++++++
.../arch/x86/meteorlake/memory.json | 80 +++++++++
.../pmu-events/arch/x86/meteorlake/other.json | 16 ++
.../arch/x86/meteorlake/pipeline.json | 159 +++++++++++++++++
7 files changed, 485 insertions(+), 1 deletion(-)

diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv
index 6650100830c4..9020d7a23c91 100644
--- a/tools/perf/pmu-events/arch/x86/mapfile.csv
+++ b/tools/perf/pmu-events/arch/x86/mapfile.csv
@@ -19,7 +19,7 @@ GenuineIntel-6-3A,v24,ivybridge,core
GenuineIntel-6-3E,v23,ivytown,core
GenuineIntel-6-2D,v23,jaketown,core
GenuineIntel-6-(57|85),v10,knightslanding,core
-GenuineIntel-6-A[AC],v1.03,meteorlake,core
+GenuineIntel-6-A[AC],v1.04,meteorlake,core
GenuineIntel-6-1[AEF],v3,nehalemep,core
GenuineIntel-6-2E,v3,nehalemex,core
GenuineIntel-6-A7,v1.01,rocketlake,core
diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/cache.json b/tools/perf/pmu-events/arch/x86/meteorlake/cache.json
index e1ae7c92f38e..1de0200b32f6 100644
--- a/tools/perf/pmu-events/arch/x86/meteorlake/cache.json
+++ b/tools/perf/pmu-events/arch/x86/meteorlake/cache.json
@@ -36,6 +36,15 @@
"UMask": "0x2",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "Number of cycles a demand request has waited due to L1D due to lack of L2 resources.",
+ "EventCode": "0x48",
+ "EventName": "L1D_PEND_MISS.L2_STALLS",
+ "PublicDescription": "Counts number of cycles a demand request has waited due to L1D due to lack of L2 resources. Demand requests include cacheable/uncacheable demand load, store, lock or SW prefetch accesses.",
+ "SampleAfterValue": "1000003",
+ "UMask": "0x4",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Number of L1D misses that are outstanding",
"EventCode": "0x48",
@@ -260,6 +269,15 @@
"UMask": "0x40",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "Cycles when L1D is locked",
+ "EventCode": "0x42",
+ "EventName": "LOCK_CYCLES.CACHE_LOCK_DURATION",
+ "PublicDescription": "This event counts the number of cycles when the L1D is locked. It is a superset of the 0x1 mask (BUS_LOCK_CLOCKS.BUS_LOCK_DURATION).",
+ "SampleAfterValue": "2000003",
+ "UMask": "0x2",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Counts the number of cacheable memory requests that miss in the LLC. Counts on a per core basis.",
"EventCode": "0x2e",
@@ -514,6 +532,17 @@
"UMask": "0x4",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "Retired load instructions whose data sources were L3 hit and cross-core snoop missed in on-pkg core cache.",
+ "Data_LA": "1",
+ "EventCode": "0xd2",
+ "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS",
+ "PEBS": "1",
+ "PublicDescription": "Counts the retired load instructions whose data sources were L3 hit and cross-core snoop missed in on-pkg core cache.",
+ "SampleAfterValue": "20011",
+ "UMask": "0x1",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Retired load instructions whose data sources were hits in L3 without snoops required",
"Data_LA": "1",
@@ -730,6 +759,14 @@
"UMask": "0x1",
"Unit": "cpu_atom"
},
+ {
+ "BriefDescription": "MEM_STORE_RETIRED.L2_HIT",
+ "EventCode": "0x44",
+ "EventName": "MEM_STORE_RETIRED.L2_HIT",
+ "SampleAfterValue": "200003",
+ "UMask": "0x1",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Counts the number of load ops retired.",
"Data_LA": "1",
@@ -977,6 +1014,15 @@
"UMask": "0x8",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "Cacheable and Non-Cacheable code read requests",
+ "EventCode": "0x21",
+ "EventName": "OFFCORE_REQUESTS.DEMAND_CODE_RD",
+ "PublicDescription": "Counts both cacheable and Non-Cacheable code read requests.",
+ "SampleAfterValue": "100003",
+ "UMask": "0x2",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Demand Data Read requests sent to uncore",
"EventCode": "0x21",
@@ -995,6 +1041,89 @@
"UMask": "0x4",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "Cycles when offcore outstanding cacheable Core Data Read transactions are present in SuperQueue (SQ), queue to uncore.",
+ "CounterMask": "1",
+ "EventCode": "0x20",
+ "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD",
+ "PublicDescription": "Counts cycles when offcore outstanding cacheable Core Data Read transactions are present in the super queue. A transaction is considered to be in the Offcore outstanding state between L2 miss and transaction completion sent to requestor (SQ de-allocation). See corresponding Umask under OFFCORE_REQUESTS.",
+ "SampleAfterValue": "1000003",
+ "UMask": "0x8",
+ "Unit": "cpu_core"
+ },
+ {
+ "BriefDescription": "Cycles with offcore outstanding Code Reads transactions in the SuperQueue (SQ), queue to uncore.",
+ "CounterMask": "1",
+ "EventCode": "0x20",
+ "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_CODE_RD",
+ "PublicDescription": "Counts the number of offcore outstanding Code Reads transactions in the super queue every cycle. The 'Offcore outstanding' state of the transaction lasts from the L2 miss until the sending transaction completion to requestor (SQ deallocation). See the corresponding Umask under OFFCORE_REQUESTS.",
+ "SampleAfterValue": "1000003",
+ "UMask": "0x2",
+ "Unit": "cpu_core"
+ },
+ {
+ "BriefDescription": "Cycles where at least 1 outstanding demand data read request is pending.",
+ "CounterMask": "1",
+ "EventCode": "0x20",
+ "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_DATA_RD",
+ "SampleAfterValue": "2000003",
+ "UMask": "0x1",
+ "Unit": "cpu_core"
+ },
+ {
+ "BriefDescription": "Cycles with offcore outstanding demand rfo reads transactions in SuperQueue (SQ), queue to uncore.",
+ "CounterMask": "1",
+ "EventCode": "0x20",
+ "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO",
+ "PublicDescription": "Counts the number of offcore outstanding demand rfo Reads transactions in the super queue every cycle. The 'Offcore outstanding' state of the transaction lasts from the L2 miss until the sending transaction completion to requestor (SQ deallocation). See the corresponding Umask under OFFCORE_REQUESTS.",
+ "SampleAfterValue": "1000003",
+ "UMask": "0x4",
+ "Unit": "cpu_core"
+ },
+ {
+ "BriefDescription": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD",
+ "EventCode": "0x20",
+ "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD",
+ "SampleAfterValue": "1000003",
+ "UMask": "0x8",
+ "Unit": "cpu_core"
+ },
+ {
+ "BriefDescription": "Offcore outstanding Code Reads transactions in the SuperQueue (SQ), queue to uncore, every cycle.",
+ "EventCode": "0x20",
+ "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_CODE_RD",
+ "PublicDescription": "Counts the number of offcore outstanding Code Reads transactions in the super queue every cycle. The 'Offcore outstanding' state of the transaction lasts from the L2 miss until the sending transaction completion to requestor (SQ deallocation). See the corresponding Umask under OFFCORE_REQUESTS.",
+ "SampleAfterValue": "1000003",
+ "UMask": "0x2",
+ "Unit": "cpu_core"
+ },
+ {
+ "BriefDescription": "For every cycle, increments by the number of outstanding demand data read requests pending.",
+ "EventCode": "0x20",
+ "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD",
+ "PublicDescription": "For every cycle, increments by the number of outstanding demand data read requests pending. Requests are considered outstanding from the time they miss the core's L2 cache until the transaction completion message is sent to the requestor.",
+ "SampleAfterValue": "1000003",
+ "UMask": "0x1",
+ "Unit": "cpu_core"
+ },
+ {
+ "BriefDescription": "Cycles with at least 6 offcore outstanding Demand Data Read transactions in uncore queue.",
+ "CounterMask": "6",
+ "EventCode": "0x20",
+ "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD_GE_6",
+ "SampleAfterValue": "2000003",
+ "UMask": "0x1",
+ "Unit": "cpu_core"
+ },
+ {
+ "BriefDescription": "Store Read transactions pending for off-core. Highly correlated.",
+ "EventCode": "0x20",
+ "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_RFO",
+ "PublicDescription": "Counts the number of off-core outstanding read-for-ownership (RFO) store transactions every cycle. An RFO transaction is considered to be in the Off-core outstanding state between L2 cache miss and transaction completion.",
+ "SampleAfterValue": "1000003",
+ "UMask": "0x4",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Counts bus locks, accounts for cache line split locks and UC locks.",
"EventCode": "0x2c",
@@ -1004,6 +1133,42 @@
"UMask": "0x10",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "Number of PREFETCHNTA instructions executed.",
+ "EventCode": "0x40",
+ "EventName": "SW_PREFETCH_ACCESS.NTA",
+ "PublicDescription": "Counts the number of PREFETCHNTA instructions executed.",
+ "SampleAfterValue": "100003",
+ "UMask": "0x1",
+ "Unit": "cpu_core"
+ },
+ {
+ "BriefDescription": "Number of PREFETCHW instructions executed.",
+ "EventCode": "0x40",
+ "EventName": "SW_PREFETCH_ACCESS.PREFETCHW",
+ "PublicDescription": "Counts the number of PREFETCHW instructions executed.",
+ "SampleAfterValue": "100003",
+ "UMask": "0x8",
+ "Unit": "cpu_core"
+ },
+ {
+ "BriefDescription": "Number of PREFETCHT0 instructions executed.",
+ "EventCode": "0x40",
+ "EventName": "SW_PREFETCH_ACCESS.T0",
+ "PublicDescription": "Counts the number of PREFETCHT0 instructions executed.",
+ "SampleAfterValue": "100003",
+ "UMask": "0x2",
+ "Unit": "cpu_core"
+ },
+ {
+ "BriefDescription": "Number of PREFETCHT1 or PREFETCHT2 instructions executed.",
+ "EventCode": "0x40",
+ "EventName": "SW_PREFETCH_ACCESS.T1_T2",
+ "PublicDescription": "Counts the number of PREFETCHT1 or PREFETCHT2 instructions executed.",
+ "SampleAfterValue": "100003",
+ "UMask": "0x4",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Counts the number of issue slots every cycle that were not delivered by the frontend due to an icache miss",
"EventCode": "0x71",
diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/floating-point.json b/tools/perf/pmu-events/arch/x86/meteorlake/floating-point.json
index 616489f0974a..f66506ee37ef 100644
--- a/tools/perf/pmu-events/arch/x86/meteorlake/floating-point.json
+++ b/tools/perf/pmu-events/arch/x86/meteorlake/floating-point.json
@@ -41,6 +41,14 @@
"UMask": "0x2",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "FP_ARITH_DISPATCHED.PORT_5",
+ "EventCode": "0xb3",
+ "EventName": "FP_ARITH_DISPATCHED.PORT_5",
+ "SampleAfterValue": "2000003",
+ "UMask": "0x4",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Counts number of SSE/AVX computational 128-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 2 computation operations, one for each element. Applies to SSE* and AVX* packed double precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element.",
"EventCode": "0xc7",
diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/frontend.json b/tools/perf/pmu-events/arch/x86/meteorlake/frontend.json
index 0f064518d1c0..8264419500a5 100644
--- a/tools/perf/pmu-events/arch/x86/meteorlake/frontend.json
+++ b/tools/perf/pmu-events/arch/x86/meteorlake/frontend.json
@@ -43,6 +43,14 @@
"UMask": "0x2",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "DSB_FILL.FB_STALL_OT",
+ "EventCode": "0x62",
+ "EventName": "DSB_FILL.FB_STALL_OT",
+ "SampleAfterValue": "1000003",
+ "UMask": "0x10",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Retired ANT branches",
"EventCode": "0xc6",
@@ -55,6 +63,30 @@
"UMask": "0x3",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "Retired Instructions who experienced DSB miss.",
+ "EventCode": "0xc6",
+ "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS",
+ "MSRIndex": "0x3F7",
+ "MSRValue": "0x1",
+ "PEBS": "1",
+ "PublicDescription": "Counts retired Instructions that experienced DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.",
+ "SampleAfterValue": "100007",
+ "UMask": "0x3",
+ "Unit": "cpu_core"
+ },
+ {
+ "BriefDescription": "Retired Instructions who experienced a critical DSB miss.",
+ "EventCode": "0xc6",
+ "EventName": "FRONTEND_RETIRED.DSB_MISS",
+ "MSRIndex": "0x3F7",
+ "MSRValue": "0x11",
+ "PEBS": "1",
+ "PublicDescription": "Number of retired Instructions that experienced a critical DSB (Decode stream buffer i.e. the decoded instruction-cache) miss. Critical means stalls were exposed to the back-end as a result of the DSB miss.",
+ "SampleAfterValue": "100007",
+ "UMask": "0x3",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Counts the number of instructions retired that were tagged because empty issue slots were seen before the uop due to ITLB miss",
"EventCode": "0xc6",
@@ -88,6 +120,18 @@
"UMask": "0x3",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "Retired Instructions who experienced Instruction L2 Cache true miss.",
+ "EventCode": "0xc6",
+ "EventName": "FRONTEND_RETIRED.L2_MISS",
+ "MSRIndex": "0x3F7",
+ "MSRValue": "0x13",
+ "PEBS": "1",
+ "PublicDescription": "Counts retired Instructions who experienced Instruction L2 Cache true miss.",
+ "SampleAfterValue": "100007",
+ "UMask": "0x3",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Retired instructions after front-end starvation of at least 1 cycle",
"EventCode": "0xc6",
@@ -243,6 +287,18 @@
"UMask": "0x3",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "Retired Instructions who experienced STLB (2nd level TLB) true miss.",
+ "EventCode": "0xc6",
+ "EventName": "FRONTEND_RETIRED.STLB_MISS",
+ "MSRIndex": "0x3F7",
+ "MSRValue": "0x15",
+ "PEBS": "1",
+ "PublicDescription": "Counts retired Instructions that experienced STLB (2nd level TLB) true miss.",
+ "SampleAfterValue": "100007",
+ "UMask": "0x3",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "FRONTEND_RETIRED.UNKNOWN_BRANCH",
"EventCode": "0xc6",
diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/memory.json b/tools/perf/pmu-events/arch/x86/meteorlake/memory.json
index 67e949b4c789..2605e1d0ba9f 100644
--- a/tools/perf/pmu-events/arch/x86/meteorlake/memory.json
+++ b/tools/perf/pmu-events/arch/x86/meteorlake/memory.json
@@ -66,6 +66,15 @@
"UMask": "0x84",
"Unit": "cpu_atom"
},
+ {
+ "BriefDescription": "Number of machine clears due to memory ordering conflicts.",
+ "EventCode": "0xc3",
+ "EventName": "MACHINE_CLEARS.MEMORY_ORDERING",
+ "PublicDescription": "Counts the number of Machine Clears detected dye to memory ordering. Memory Ordering Machine Clears may apply when a memory read may not conform to the memory ordering rules of the x86 architecture",
+ "SampleAfterValue": "100003",
+ "UMask": "0x2",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Execution stalls while L1 cache miss demand load is outstanding.",
"CounterMask": "3",
@@ -95,6 +104,35 @@
"UMask": "0x9",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "MEMORY_ORDERING.MD_NUKE",
+ "EventCode": "0x09",
+ "EventName": "MEMORY_ORDERING.MD_NUKE",
+ "SampleAfterValue": "100003",
+ "UMask": "0x1",
+ "Unit": "cpu_core"
+ },
+ {
+ "BriefDescription": "Counts the number of memory ordering machine clears due to memory renaming.",
+ "EventCode": "0x09",
+ "EventName": "MEMORY_ORDERING.MRN_NUKE",
+ "SampleAfterValue": "100003",
+ "UMask": "0x2",
+ "Unit": "cpu_core"
+ },
+ {
+ "BriefDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 1024 cycles.",
+ "Data_LA": "1",
+ "EventCode": "0xcd",
+ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_1024",
+ "MSRIndex": "0x3F6",
+ "MSRValue": "0x400",
+ "PEBS": "2",
+ "PublicDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 1024 cycles. Reported latency may be longer than just the memory latency.",
+ "SampleAfterValue": "53",
+ "UMask": "0x1",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 128 cycles.",
"Data_LA": "1",
@@ -121,6 +159,19 @@
"UMask": "0x1",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 2048 cycles.",
+ "Data_LA": "1",
+ "EventCode": "0xcd",
+ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_2048",
+ "MSRIndex": "0x3F6",
+ "MSRValue": "0x800",
+ "PEBS": "2",
+ "PublicDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 2048 cycles. Reported latency may be longer than just the memory latency.",
+ "SampleAfterValue": "23",
+ "UMask": "0x1",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 256 cycles.",
"Data_LA": "1",
@@ -235,5 +286,34 @@
"SampleAfterValue": "100003",
"UMask": "0x10",
"Unit": "cpu_core"
+ },
+ {
+ "BriefDescription": "Cycles where data return is pending for a Demand Data Read request who miss L3 cache.",
+ "CounterMask": "1",
+ "EventCode": "0x20",
+ "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_L3_MISS_DEMAND_DATA_RD",
+ "PublicDescription": "Cycles with at least 1 Demand Data Read requests who miss L3 cache in the superQ.",
+ "SampleAfterValue": "1000003",
+ "UMask": "0x10",
+ "Unit": "cpu_core"
+ },
+ {
+ "BriefDescription": "For every cycle, increments by the number of demand data read requests pending that are known to have missed the L3 cache.",
+ "EventCode": "0x20",
+ "EventName": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD",
+ "PublicDescription": "For every cycle, increments by the number of demand data read requests pending that are known to have missed the L3 cache. Note that this does not capture all elapsed cycles while requests are outstanding - only cycles from when the requests were known by the requesting core to have missed the L3 cache.",
+ "SampleAfterValue": "2000003",
+ "UMask": "0x10",
+ "Unit": "cpu_core"
+ },
+ {
+ "BriefDescription": "Cycles where the core is waiting on at least 6 outstanding demand data read requests known to have missed the L3 cache.",
+ "CounterMask": "6",
+ "EventCode": "0x20",
+ "EventName": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD_GE_6",
+ "PublicDescription": "Cycles where the core is waiting on at least 6 outstanding demand data read requests known to have missed the L3 cache. Note that this event does not capture all elapsed cycles while the requests are outstanding - only cycles from when the requests were known to have missed the L3 cache.",
+ "SampleAfterValue": "2000003",
+ "UMask": "0x10",
+ "Unit": "cpu_core"
}
]
diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/other.json b/tools/perf/pmu-events/arch/x86/meteorlake/other.json
index 2ec57f487525..f4c603599df4 100644
--- a/tools/perf/pmu-events/arch/x86/meteorlake/other.json
+++ b/tools/perf/pmu-events/arch/x86/meteorlake/other.json
@@ -1,4 +1,12 @@
[
+ {
+ "BriefDescription": "ASSISTS.PAGE_FAULT",
+ "EventCode": "0xc1",
+ "EventName": "ASSISTS.PAGE_FAULT",
+ "SampleAfterValue": "1000003",
+ "UMask": "0x8",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Counts streaming stores that have any type of response.",
"EventCode": "0x2A,0x2B",
@@ -30,6 +38,14 @@
"UMask": "0x7",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "RS.EMPTY_RESOURCE",
+ "EventCode": "0xa5",
+ "EventName": "RS.EMPTY_RESOURCE",
+ "SampleAfterValue": "1000003",
+ "UMask": "0x1",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Counts the number of issue slots in a UMWAIT or TPAUSE instruction where no uop issues due to the instruction putting the CPU into the C0.1 activity state. For Tremont, UMWAIT and TPAUSE will only put the CPU into C0.1 activity state (not C0.2 activity state)",
"EventCode": "0x75",
diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/pipeline.json b/tools/perf/pmu-events/arch/x86/meteorlake/pipeline.json
index eeaa7a97f71c..352c5efafc06 100644
--- a/tools/perf/pmu-events/arch/x86/meteorlake/pipeline.json
+++ b/tools/perf/pmu-events/arch/x86/meteorlake/pipeline.json
@@ -311,6 +311,16 @@
"UMask": "0x60",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "This event counts the number of mispredicted ret instructions retired. Non PEBS",
+ "EventCode": "0xc5",
+ "EventName": "BR_MISP_RETIRED.RET",
+ "PEBS": "1",
+ "PublicDescription": "This is a non-precise version (that is, does not use PEBS) of the event that counts mispredicted return instructions retired.",
+ "SampleAfterValue": "100007",
+ "UMask": "0x8",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Counts the number of mispredicted near RET branch instructions retired.",
"EventCode": "0xc5",
@@ -329,6 +339,33 @@
"UMask": "0x48",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "Core clocks when the thread is in the C0.1 light-weight slower wakeup time but more power saving optimized state.",
+ "EventCode": "0xec",
+ "EventName": "CPU_CLK_UNHALTED.C01",
+ "PublicDescription": "Counts core clocks when the thread is in the C0.1 light-weight slower wakeup time but more power saving optimized state. This state can be entered via the TPAUSE or UMWAIT instructions.",
+ "SampleAfterValue": "2000003",
+ "UMask": "0x10",
+ "Unit": "cpu_core"
+ },
+ {
+ "BriefDescription": "Core clocks when the thread is in the C0.2 light-weight faster wakeup time but less power saving optimized state.",
+ "EventCode": "0xec",
+ "EventName": "CPU_CLK_UNHALTED.C02",
+ "PublicDescription": "Counts core clocks when the thread is in the C0.2 light-weight faster wakeup time but less power saving optimized state. This state can be entered via the TPAUSE or UMWAIT instructions.",
+ "SampleAfterValue": "2000003",
+ "UMask": "0x20",
+ "Unit": "cpu_core"
+ },
+ {
+ "BriefDescription": "Core clocks when the thread is in the C0.1 or C0.2 or running a PAUSE in C0 ACPI state.",
+ "EventCode": "0xec",
+ "EventName": "CPU_CLK_UNHALTED.C0_WAIT",
+ "PublicDescription": "Counts core clocks when the thread is in the C0.1 or C0.2 power saving optimized states (TPAUSE or UMWAIT instructions) or running the PAUSE instruction.",
+ "SampleAfterValue": "2000003",
+ "UMask": "0x70",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Fixed Counter: Counts the number of unhalted core clock cycles",
"EventName": "CPU_CLK_UNHALTED.CORE",
@@ -361,6 +398,24 @@
"UMask": "0x2",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "CPU_CLK_UNHALTED.PAUSE",
+ "EventCode": "0xec",
+ "EventName": "CPU_CLK_UNHALTED.PAUSE",
+ "SampleAfterValue": "2000003",
+ "UMask": "0x40",
+ "Unit": "cpu_core"
+ },
+ {
+ "BriefDescription": "CPU_CLK_UNHALTED.PAUSE_INST",
+ "CounterMask": "1",
+ "EdgeDetect": "1",
+ "EventCode": "0xec",
+ "EventName": "CPU_CLK_UNHALTED.PAUSE_INST",
+ "SampleAfterValue": "2000003",
+ "UMask": "0x40",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Core crystal clock cycles. Cycle counts are evenly distributed between active threads in the Core.",
"EventCode": "0x3c",
@@ -602,6 +657,15 @@
"UMask": "0x10",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "Retired NOP instructions.",
+ "EventCode": "0xc0",
+ "EventName": "INST_RETIRED.NOP",
+ "PublicDescription": "Counts all retired NOP or ENDBR32/64 or PREFETCHIT0/1 instructions",
+ "SampleAfterValue": "2000003",
+ "UMask": "0x2",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Precise instruction retired with PEBS precise-distribution",
"EventName": "INST_RETIRED.PREC_DIST",
@@ -611,6 +675,15 @@
"UMask": "0x1",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "Iterations of Repeat string retired instructions.",
+ "EventCode": "0xc0",
+ "EventName": "INST_RETIRED.REP_ITERATION",
+ "PublicDescription": "Number of iterations of Repeat (REP) string retired instructions such as MOVS, CMPS, and SCAS. Each has a byte, word, and doubleword version and string instructions can be repeated using a repetition prefix, REP, that allows their architectural execution to be repeated a number of times as specified by the RCX register. Note the number of iterations is implementation-dependent.",
+ "SampleAfterValue": "2000003",
+ "UMask": "0x8",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Cycles the Backend cluster is recovering after a miss-speculation or a Store Buffer or Load Buffer drain stall.",
"CounterMask": "1",
@@ -621,6 +694,17 @@
"UMask": "0x3",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "Clears speculative count",
+ "CounterMask": "1",
+ "EdgeDetect": "1",
+ "EventCode": "0xad",
+ "EventName": "INT_MISC.CLEARS_COUNT",
+ "PublicDescription": "Counts the number of speculative clears due to any type of branch misprediction or machine clears",
+ "SampleAfterValue": "500009",
+ "UMask": "0x1",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Counts cycles after recovery from a branch misprediction or machine clear till the first uop is issued from the resteered path.",
"EventCode": "0xad",
@@ -630,6 +714,15 @@
"UMask": "0x80",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "Cycles when Resource Allocation Table (RAT) external stall is sent to Instruction Decode Queue (IDQ) for the thread",
+ "EventCode": "0xad",
+ "EventName": "INT_MISC.RAT_STALLS",
+ "PublicDescription": "This event counts the number of cycles during which Resource Allocation Table (RAT) external stall is sent to Instruction Decode Queue (IDQ) for the current thread. This also includes the cycles during which the Allocator is serving another thread.",
+ "SampleAfterValue": "1000003",
+ "UMask": "0x8",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Core cycles the allocator was stalled due to recovery from earlier clear event for this thread",
"EventCode": "0xad",
@@ -733,6 +826,15 @@
"UMask": "0x4",
"Unit": "cpu_atom"
},
+ {
+ "BriefDescription": "False dependencies in MOB due to partial compare on address.",
+ "EventCode": "0x03",
+ "EventName": "LD_BLOCKS.ADDRESS_ALIAS",
+ "PublicDescription": "Counts the number of times a load got blocked due to false dependencies in MOB due to partial compare on address.",
+ "SampleAfterValue": "100003",
+ "UMask": "0x4",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Counts the number of retired loads that are blocked because its address exactly matches an older store whose data is not ready.",
"EventCode": "0x03",
@@ -742,6 +844,15 @@
"UMask": "0x1",
"Unit": "cpu_atom"
},
+ {
+ "BriefDescription": "The number of times that split load operations are temporarily blocked because all resources for handling the split accesses are in use.",
+ "EventCode": "0x03",
+ "EventName": "LD_BLOCKS.NO_SR",
+ "PublicDescription": "Counts the number of times that split load operations are temporarily blocked because all resources for handling the split accesses are in use.",
+ "SampleAfterValue": "100003",
+ "UMask": "0x88",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Counts the number of retired loads that are blocked because its address partially overlapped with an older store.",
"EventCode": "0x03",
@@ -751,6 +862,15 @@
"UMask": "0x2",
"Unit": "cpu_atom"
},
+ {
+ "BriefDescription": "Loads blocked due to overlapping with a preceding store that cannot be forwarded.",
+ "EventCode": "0x03",
+ "EventName": "LD_BLOCKS.STORE_FORWARD",
+ "PublicDescription": "Counts the number of times where store forwarding was prevented for a load operation. The most common case is a load blocked due to the address of memory access (partially) overlapping with a preceding uncompleted store. Note: See the table of not supported store forwards in the Optimization Guide.",
+ "SampleAfterValue": "100003",
+ "UMask": "0x82",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Cycles Uops delivered by the LSD, but didn't come from the decoder.",
"CounterMask": "1",
@@ -823,6 +943,24 @@
"UMask": "0x1",
"Unit": "cpu_atom"
},
+ {
+ "BriefDescription": "Self-modifying code (SMC) detected.",
+ "EventCode": "0xc3",
+ "EventName": "MACHINE_CLEARS.SMC",
+ "PublicDescription": "Counts self-modifying code (SMC) detected, which causes a machine clear.",
+ "SampleAfterValue": "100003",
+ "UMask": "0x4",
+ "Unit": "cpu_core"
+ },
+ {
+ "BriefDescription": "LFENCE instructions retired",
+ "EventCode": "0xe0",
+ "EventName": "MISC2_RETIRED.LFENCE",
+ "PublicDescription": "number of LFENCE retired instructions",
+ "SampleAfterValue": "400009",
+ "UMask": "0x20",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Counts cycles where the pipeline is stalled due to serializing operations.",
"EventCode": "0xa2",
@@ -1260,6 +1398,16 @@
"UMask": "0x1",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "Cycles with retired uop(s).",
+ "CounterMask": "1",
+ "EventCode": "0xc2",
+ "EventName": "UOPS_RETIRED.CYCLES",
+ "PublicDescription": "Counts cycles where at least one uop has retired.",
+ "SampleAfterValue": "1000003",
+ "UMask": "0x2",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Retired uops except the last uop of each instruction.",
"EventCode": "0xc2",
@@ -1306,6 +1454,17 @@
"UMask": "0x2",
"Unit": "cpu_core"
},
+ {
+ "BriefDescription": "Cycles without actually retired uops.",
+ "CounterMask": "1",
+ "EventCode": "0xc2",
+ "EventName": "UOPS_RETIRED.STALLS",
+ "Invert": "1",
+ "PublicDescription": "This event counts cycles without actually retired uops.",
+ "SampleAfterValue": "1000003",
+ "UMask": "0x2",
+ "Unit": "cpu_core"
+ },
{
"BriefDescription": "Cycles with less than 10 actually retired uops.",
"CounterMask": "10",
--
2.41.0.585.gd2178a4bd4-goog

2023-08-01 07:28:53

by Ian Rogers

[permalink] [raw]

Subject: [PATCH v1 4/4] perf vendor events intel: Update Icelake+ metric constraints

Avoid grouping events especially in cases where the kernel's PMU
driver fails to not open the events, causing the events to report back
as "<not counted>".

This update comes from:
https://github.com/intel/perfmon/pull/94

Fixes issues reported with patch:
https://lore.kernel.org/lkml/[email protected]/

Signed-off-by: Ian Rogers <[email protected]>
---
.../pmu-events/arch/x86/alderlake/adl-metrics.json | 11 +++++++----
.../pmu-events/arch/x86/alderlaken/adln-metrics.json | 2 ++
.../perf/pmu-events/arch/x86/icelake/icl-metrics.json | 10 ++++++----
.../pmu-events/arch/x86/icelakex/icx-metrics.json | 10 ++++++----
.../pmu-events/arch/x86/rocketlake/rkl-metrics.json | 10 ++++++----
.../arch/x86/sapphirerapids/spr-metrics.json | 9 +++++----
.../pmu-events/arch/x86/tigerlake/tgl-metrics.json | 10 ++++++----
7 files changed, 38 insertions(+), 24 deletions(-)

diff --git a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
index daf9458f0b77..c6780d5c456b 100644
--- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
@@ -558,6 +558,7 @@
},
{
"BriefDescription": "Counts the number of cycles a core is stalled due to a demand load which hit in the Last Level Cache (LLC) or other core with HITE/F/M.",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "cpu_atom@MEM_BOUND_STALLS.LOAD_LLC_HIT@ / tma_info_core_clks - max((cpu_atom@MEM_BOUND_STALLS.LOAD@ - cpu_atom@LD_HEAD.L1_MISS_AT_RET@) / tma_info_core_clks, 0) * cpu_atom@MEM_BOUND_STALLS.LOAD_LLC_HIT@ / cpu_atom@MEM_BOUND_STALLS.LOAD@",
"MetricGroup": "TopdownL3;tma_L3_group;tma_memory_bound_group",
"MetricName": "tma_l3_bound",
@@ -800,6 +801,7 @@
},
{
"BriefDescription": "Counts the number of cycles that the oldest load of the load buffer is stalled at retirement due to a store forward block.",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "LD_HEAD.ST_ADDR_AT_RET / tma_info_core_clks",
"MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group",
"MetricName": "tma_store_fwd_blk",
@@ -1058,7 +1060,6 @@
},
{
"BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector",
"MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group",
"MetricName": "tma_fp_arith",
@@ -1230,6 +1231,7 @@
},
{
"BriefDescription": "Total pipeline cost of Instruction Cache misses - subset of the Big_Code Bottleneck",
+ "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))",
"MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL",
"MetricName": "tma_info_botlnk_l2_ic_misses",
@@ -1267,6 +1269,7 @@
},
{
"BriefDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks",
+ "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_sq_full / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_fb_full / (tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))",
"MetricGroup": "Mem;MemoryBW;Offcore;tma_issueBW",
"MetricName": "tma_info_bottleneck_memory_bandwidth",
@@ -1355,7 +1358,6 @@
},
{
"BriefDescription": "Floating Point Operations Per Cycle",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.SCALAR_SINGLE@ + cpu_core@FP_ARITH_INST_RETIRED.SCALAR_DOUBLE@ + 2 * cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@ + 4 * (cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE@ + cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE@) + 8 * cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE@) / tma_info_core_core_clks",
"MetricGroup": "Flops;Ret",
"MetricName": "tma_info_core_flopc",
@@ -1363,7 +1365,6 @@
},
{
"BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "(cpu_core@FP_ARITH_DISPATCHED.PORT_0@ + cpu_core@FP_ARITH_DISPATCHED.PORT_1@ + cpu_core@FP_ARITH_DISPATCHED.PORT_5@) / (2 * tma_info_core_core_clks)",
"MetricGroup": "Cor;Flops;HPC",
"MetricName": "tma_info_core_fp_arith_utilization",
@@ -1769,7 +1770,6 @@
},
{
"BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@UOPS_RETIRED.SLOTS\\,cmask\\=1@",
"MetricGroup": "Pipeline;Ret",
"MetricName": "tma_info_pipeline_retire",
@@ -2002,6 +2002,7 @@
},
{
"BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "(cpu_core@MEMORY_ACTIVITY.STALLS_L2_MISS@ - cpu_core@MEMORY_ACTIVITY.STALLS_L3_MISS@) / tma_info_thread_clks",
"MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group",
"MetricName": "tma_l3_bound",
@@ -2375,6 +2376,7 @@
},
{
"BriefDescription": "This metric represents rate of split store accesses",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_clks",
"MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bound_group",
"MetricName": "tma_split_stores",
@@ -2405,6 +2407,7 @@
},
{
"BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "13 * cpu_core@LD_BLOCKS.STORE_FORWARD@ / tma_info_thread_clks",
"MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group",
"MetricName": "tma_store_fwd_blk",
diff --git a/tools/perf/pmu-events/arch/x86/alderlaken/adln-metrics.json b/tools/perf/pmu-events/arch/x86/alderlaken/adln-metrics.json
index 0f1628d698da..06e67e34e1bf 100644
--- a/tools/perf/pmu-events/arch/x86/alderlaken/adln-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/alderlaken/adln-metrics.json
@@ -466,6 +466,7 @@
},
{
"BriefDescription": "Counts the number of cycles a core is stalled due to a demand load which hit in the Last Level Cache (LLC) or other core with HITE/F/M.",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "MEM_BOUND_STALLS.LOAD_LLC_HIT / tma_info_core_clks - max((MEM_BOUND_STALLS.LOAD - LD_HEAD.L1_MISS_AT_RET) / tma_info_core_clks, 0) * MEM_BOUND_STALLS.LOAD_LLC_HIT / MEM_BOUND_STALLS.LOAD",
"MetricGroup": "TopdownL3;tma_L3_group;tma_memory_bound_group",
"MetricName": "tma_l3_bound",
@@ -682,6 +683,7 @@
},
{
"BriefDescription": "Counts the number of cycles that the oldest load of the load buffer is stalled at retirement due to a store forward block.",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "LD_HEAD.ST_ADDR_AT_RET / tma_info_core_clks",
"MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group",
"MetricName": "tma_store_fwd_blk",
diff --git a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
index 8fcc05c4e0a1..a6eed0d9a26d 100644
--- a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
@@ -85,6 +85,7 @@
},
{
"BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_clks",
"MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group",
"MetricName": "tma_4k_aliasing",
@@ -319,7 +320,6 @@
},
{
"BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector",
"MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group",
"MetricName": "tma_fp_arith",
@@ -464,6 +464,7 @@
},
{
"BriefDescription": "Total pipeline cost of Instruction Cache misses - subset of the Big_Code Bottleneck",
+ "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))",
"MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL",
"MetricName": "tma_info_botlnk_l2_ic_misses",
@@ -497,6 +498,7 @@
},
{
"BriefDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks",
+ "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_sq_full / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))",
"MetricGroup": "Mem;MemoryBW;Offcore;tma_issueBW",
"MetricName": "tma_info_bottleneck_memory_bandwidth",
@@ -574,14 +576,12 @@
},
{
"BriefDescription": "Floating Point Operations Per Cycle",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * cpu@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE\\,umask\\=0x18@ + 8 * cpu@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE\\,umask\\=0x60@ + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / tma_info_core_core_clks",
"MetricGroup": "Flops;Ret",
"MetricName": "tma_info_core_flopc"
},
{
"BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0xfc@) / (2 * tma_info_core_core_clks)",
"MetricGroup": "Cor;Flops;HPC",
"MetricName": "tma_info_core_fp_arith_utilization",
@@ -927,7 +927,6 @@
},
{
"BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RETIRED.SLOTS\\,cmask\\=1@",
"MetricGroup": "Pipeline;Ret",
"MetricName": "tma_info_pipeline_retire"
@@ -1100,6 +1099,7 @@
},
{
"BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STALLS_L3_MISS) / tma_info_thread_clks",
"MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group",
"MetricName": "tma_l3_bound",
@@ -1419,6 +1419,7 @@
},
{
"BriefDescription": "This metric represents rate of split store accesses",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_clks",
"MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bound_group",
"MetricName": "tma_split_stores",
@@ -1446,6 +1447,7 @@
},
{
"BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks",
"MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group",
"MetricName": "tma_store_fwd_blk",
diff --git a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
index 9bb7e3f20f7f..7082ad5ba961 100644
--- a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
@@ -289,6 +289,7 @@
},
{
"BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_clks",
"MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group",
"MetricName": "tma_4k_aliasing",
@@ -523,7 +524,6 @@
},
{
"BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector",
"MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group",
"MetricName": "tma_fp_arith",
@@ -668,6 +668,7 @@
},
{
"BriefDescription": "Total pipeline cost of Instruction Cache misses - subset of the Big_Code Bottleneck",
+ "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))",
"MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL",
"MetricName": "tma_info_botlnk_l2_ic_misses",
@@ -701,6 +702,7 @@
},
{
"BriefDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks",
+ "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_sq_full / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))",
"MetricGroup": "Mem;MemoryBW;Offcore;tma_issueBW",
"MetricName": "tma_info_bottleneck_memory_bandwidth",
@@ -778,14 +780,12 @@
},
{
"BriefDescription": "Floating Point Operations Per Cycle",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * cpu@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE\\,umask\\=0x18@ + 8 * cpu@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE\\,umask\\=0x60@ + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / tma_info_core_core_clks",
"MetricGroup": "Flops;Ret",
"MetricName": "tma_info_core_flopc"
},
{
"BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0xfc@) / (2 * tma_info_core_core_clks)",
"MetricGroup": "Cor;Flops;HPC",
"MetricName": "tma_info_core_fp_arith_utilization",
@@ -1144,7 +1144,6 @@
},
{
"BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RETIRED.SLOTS\\,cmask\\=1@",
"MetricGroup": "Pipeline;Ret",
"MetricName": "tma_info_pipeline_retire"
@@ -1369,6 +1368,7 @@
},
{
"BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STALLS_L3_MISS) / tma_info_thread_clks",
"MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group",
"MetricName": "tma_l3_bound",
@@ -1715,6 +1715,7 @@
},
{
"BriefDescription": "This metric represents rate of split store accesses",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_clks",
"MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bound_group",
"MetricName": "tma_split_stores",
@@ -1742,6 +1743,7 @@
},
{
"BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks",
"MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group",
"MetricName": "tma_store_fwd_blk",
diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json b/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json
index 1bb9cededa56..a0191c8b708d 100644
--- a/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json
@@ -85,6 +85,7 @@
},
{
"BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_clks",
"MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group",
"MetricName": "tma_4k_aliasing",
@@ -319,7 +320,6 @@
},
{
"BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector",
"MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group",
"MetricName": "tma_fp_arith",
@@ -464,6 +464,7 @@
},
{
"BriefDescription": "Total pipeline cost of Instruction Cache misses - subset of the Big_Code Bottleneck",
+ "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))",
"MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL",
"MetricName": "tma_info_botlnk_l2_ic_misses",
@@ -497,6 +498,7 @@
},
{
"BriefDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks",
+ "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_sq_full / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))",
"MetricGroup": "Mem;MemoryBW;Offcore;tma_issueBW",
"MetricName": "tma_info_bottleneck_memory_bandwidth",
@@ -574,14 +576,12 @@
},
{
"BriefDescription": "Floating Point Operations Per Cycle",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * cpu@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE\\,umask\\=0x18@ + 8 * cpu@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE\\,umask\\=0x60@ + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / tma_info_core_core_clks",
"MetricGroup": "Flops;Ret",
"MetricName": "tma_info_core_flopc"
},
{
"BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0xfc@) / (2 * tma_info_core_core_clks)",
"MetricGroup": "Cor;Flops;HPC",
"MetricName": "tma_info_core_fp_arith_utilization",
@@ -933,7 +933,6 @@
},
{
"BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RETIRED.SLOTS\\,cmask\\=1@",
"MetricGroup": "Pipeline;Ret",
"MetricName": "tma_info_pipeline_retire"
@@ -1126,6 +1125,7 @@
},
{
"BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STALLS_L3_MISS) / tma_info_thread_clks",
"MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group",
"MetricName": "tma_l3_bound",
@@ -1445,6 +1445,7 @@
},
{
"BriefDescription": "This metric represents rate of split store accesses",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_clks",
"MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bound_group",
"MetricName": "tma_split_stores",
@@ -1472,6 +1473,7 @@
},
{
"BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks",
"MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group",
"MetricName": "tma_store_fwd_blk",
diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
index c207c851a9f9..222212abd811 100644
--- a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
@@ -553,7 +553,6 @@
},
{
"BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector + tma_fp_amx",
"MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group",
"MetricName": "tma_fp_arith",
@@ -717,6 +716,7 @@
},
{
"BriefDescription": "Total pipeline cost of Instruction Cache misses - subset of the Big_Code Bottleneck",
+ "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))",
"MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL",
"MetricName": "tma_info_botlnk_l2_ic_misses",
@@ -750,6 +750,7 @@
},
{
"BriefDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks",
+ "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_sq_full / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_fb_full / (tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))",
"MetricGroup": "Mem;MemoryBW;Offcore;tma_issueBW",
"MetricName": "tma_info_bottleneck_memory_bandwidth",
@@ -827,14 +828,12 @@
},
{
"BriefDescription": "Floating Point Operations Per Cycle",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + FP_ARITH_INST_RETIRED2.SCALAR_HALF + 2 * (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED2.COMPLEX_SCALAR_HALF) + 4 * cpu@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE\\,umask\\=0x18@ + 8 * (FP_ARITH_INST_RETIRED2.128B_PACKED_HALF + cpu@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE\\,umask\\=0x60@) + 16 * (FP_ARITH_INST_RETIRED2.256B_PACKED_HALF + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) + 32 * FP_ARITH_INST_RETIRED2.512B_PACKED_HALF + 4 * AMX_OPS_RETIRED.BF16",
"MetricGroup": "Flops;Ret",
"MetricName": "tma_info_core_flopc"
},
{
"BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.PORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * tma_info_core_core_clks)",
"MetricGroup": "Cor;Flops;HPC",
"MetricName": "tma_info_core_fp_arith_utilization",
@@ -1216,7 +1215,6 @@
},
{
"BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RETIRED.SLOTS\\,cmask\\=1@",
"MetricGroup": "Pipeline;Ret",
"MetricName": "tma_info_pipeline_retire"
@@ -1467,6 +1465,7 @@
},
{
"BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.STALLS_L3_MISS) / tma_info_thread_clks",
"MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group",
"MetricName": "tma_l3_bound",
@@ -1841,6 +1840,7 @@
},
{
"BriefDescription": "This metric represents rate of split store accesses",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_clks",
"MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bound_group",
"MetricName": "tma_split_stores",
@@ -1868,6 +1868,7 @@
},
{
"BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks",
"MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group",
"MetricName": "tma_store_fwd_blk",
diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
index c7c2d6ab1a93..fab084e1bc69 100644
--- a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
@@ -79,6 +79,7 @@
},
{
"BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_clks",
"MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group",
"MetricName": "tma_4k_aliasing",
@@ -313,7 +314,6 @@
},
{
"BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector",
"MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group",
"MetricName": "tma_fp_arith",
@@ -458,6 +458,7 @@
},
{
"BriefDescription": "Total pipeline cost of Instruction Cache misses - subset of the Big_Code Bottleneck",
+ "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))",
"MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL",
"MetricName": "tma_info_botlnk_l2_ic_misses",
@@ -491,6 +492,7 @@
},
{
"BriefDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks",
+ "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_sq_full / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))",
"MetricGroup": "Mem;MemoryBW;Offcore;tma_issueBW",
"MetricName": "tma_info_bottleneck_memory_bandwidth",
@@ -568,14 +570,12 @@
},
{
"BriefDescription": "Floating Point Operations Per Cycle",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * cpu@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE\\,umask\\=0x18@ + 8 * cpu@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE\\,umask\\=0x60@ + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / tma_info_core_core_clks",
"MetricGroup": "Flops;Ret",
"MetricName": "tma_info_core_flopc"
},
{
"BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0xfc@) / (2 * tma_info_core_core_clks)",
"MetricGroup": "Cor;Flops;HPC",
"MetricName": "tma_info_core_fp_arith_utilization",
@@ -927,7 +927,6 @@
},
{
"BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.",
- "MetricConstraint": "NO_GROUP_EVENTS",
"MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RETIRED.SLOTS\\,cmask\\=1@",
"MetricGroup": "Pipeline;Ret",
"MetricName": "tma_info_pipeline_retire"
@@ -1114,6 +1113,7 @@
},
{
"BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STALLS_L3_MISS) / tma_info_thread_clks",
"MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group",
"MetricName": "tma_l3_bound",
@@ -1433,6 +1433,7 @@
},
{
"BriefDescription": "This metric represents rate of split store accesses",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_clks",
"MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bound_group",
"MetricName": "tma_split_stores",
@@ -1460,6 +1461,7 @@
},
{
"BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores",
+ "MetricConstraint": "NO_GROUP_EVENTS_NMI",
"MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks",
"MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group",
"MetricName": "tma_store_fwd_blk",
--
2.41.0.585.gd2178a4bd4-goog

2023-08-01 16:33:04

by Liang, Kan

[permalink] [raw]

Subject: Re: [PATCH v1 1/4] perf parse-events x86: Avoid sorting uops_retired.slots

On 2023-08-01 1:36 a.m., Ian Rogers wrote:
> As topdown.slots may appear as slots it may get confused with
> uops_retired.slots which is an invalid perf metric event group
> leader. Special case uops_retired.slots to avoid this confusion.
>

Does any name with format "name.slots" cause the confusion? If so, I
don't think we can stop others from naming like the above format.

Is it better to hard code the topdown.slots/slots, rather than
uops_retired.slots?

Thanks,
Kan

> Signed-off-by: Ian Rogers <[email protected]>
> ---
> tools/perf/arch/x86/util/evlist.c | 7 ++++---
> tools/perf/arch/x86/util/evsel.c | 7 +++----
> 2 files changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/tools/perf/arch/x86/util/evlist.c b/tools/perf/arch/x86/util/evlist.c
> index cbd582182932..b1ce0c52d88d 100644
> --- a/tools/perf/arch/x86/util/evlist.c
> +++ b/tools/perf/arch/x86/util/evlist.c
> @@ -75,11 +75,12 @@ int arch_evlist__add_default_attrs(struct evlist *evlist,
>
> int arch_evlist__cmp(const struct evsel *lhs, const struct evsel *rhs)
> {
> - if (topdown_sys_has_perf_metrics() && evsel__sys_has_perf_metrics(lhs)) {
> + if (topdown_sys_has_perf_metrics() &&
> + (arch_evsel__must_be_in_group(lhs) || arch_evsel__must_be_in_group(rhs))) {
> /* Ensure the topdown slots comes first. */
> - if (strcasestr(lhs->name, "slots"))
> + if (strcasestr(lhs->name, "slots") && !strcasestr(lhs->name, "uops_retired.slots"))
> return -1;
> - if (strcasestr(rhs->name, "slots"))
> + if (strcasestr(rhs->name, "slots") && !strcasestr(rhs->name, "uops_retired.slots"))
> return 1;
> /* Followed by topdown events. */
> if (strcasestr(lhs->name, "topdown") && !strcasestr(rhs->name, "topdown"))
> diff --git a/tools/perf/arch/x86/util/evsel.c b/tools/perf/arch/x86/util/evsel.c
> index 81d22657922a..090d0f371891 100644
> --- a/tools/perf/arch/x86/util/evsel.c
> +++ b/tools/perf/arch/x86/util/evsel.c
> @@ -40,12 +40,11 @@ bool evsel__sys_has_perf_metrics(const struct evsel *evsel)
>
> bool arch_evsel__must_be_in_group(const struct evsel *evsel)
> {
> - if (!evsel__sys_has_perf_metrics(evsel))
> + if (!evsel__sys_has_perf_metrics(evsel) || !evsel->name ||
> + strcasestr(evsel->name, "uops_retired.slots"))
> return false;
>
> - return evsel->name &&
> - (strcasestr(evsel->name, "slots") ||
> - strcasestr(evsel->name, "topdown"));
> + return strcasestr(evsel->name, "topdown") || strcasestr(evsel->name, "slots");
> }
>
> int arch_evsel__hw_name(struct evsel *evsel, char *bf, size_t size)

2023-08-01 16:37:44

by Ian Rogers

[permalink] [raw]

Subject: Re: [PATCH v1 1/4] perf parse-events x86: Avoid sorting uops_retired.slots

On Tue, Aug 1, 2023 at 8:40 AM Liang, Kan <[email protected]> wrote:
>
>
>
> On 2023-08-01 1:36 a.m., Ian Rogers wrote:
> > As topdown.slots may appear as slots it may get confused with
> > uops_retired.slots which is an invalid perf metric event group
> > leader. Special case uops_retired.slots to avoid this confusion.
> >
>
> Does any name with format "name.slots" cause the confusion? If so, I
> don't think we can stop others from naming like the above format.
>
> Is it better to hard code the topdown.slots/slots, rather than
> uops_retired.slots?

So firstly, yet more fringe perf metric event benefits with this silly
bit of complexity. The issue with "just" trying to pattern match
"topdown.slots" and "slots" is that there may be pmu names in there.
So (ignoring case) we get:

slots
topdown.slots
cpu/slots/
cpu/topdown.slots/
cpu_core/slots/
cpu_core/topdown.slots/

but the name can have other junk like modifiers in there and also
there's the name= config term, but let's just not think about that
breaking stuff. To avoid 6 searches I searched for the known
problematic uops_retired.slots, knowing if that's not there then one
of the 6 above is the match. Searching for just "slots" or
"topdown.slots" isn't good enough as "slots" will get a hit in
"uops_retired.slots", how we ended up with this patch in the 1st
place. So I ended up using the uops_retired.slots search to reduce
complexity in the code and to avoid writing an event parser just to
figure out what the name is... Ideally we'd be using the
perf_event_attr to do this comparison, but immediately after
parse-events when this code runs it hasn't been computed yet. Maybe I
can shuffle things around and make this true. Ideally I think
parse-events would just be a library that given some event description
gives back perf_event_attr and not evsels, but that's yet further
work...

Thanks,
Ian

>
> Thanks,
> Kan
>
> > Signed-off-by: Ian Rogers <[email protected]>
> > ---
> > tools/perf/arch/x86/util/evlist.c | 7 ++++---
> > tools/perf/arch/x86/util/evsel.c | 7 +++----
> > 2 files changed, 7 insertions(+), 7 deletions(-)
> >
> > diff --git a/tools/perf/arch/x86/util/evlist.c b/tools/perf/arch/x86/util/evlist.c
> > index cbd582182932..b1ce0c52d88d 100644
> > --- a/tools/perf/arch/x86/util/evlist.c
> > +++ b/tools/perf/arch/x86/util/evlist.c
> > @@ -75,11 +75,12 @@ int arch_evlist__add_default_attrs(struct evlist *evlist,
> >
> > int arch_evlist__cmp(const struct evsel *lhs, const struct evsel *rhs)
> > {
> > - if (topdown_sys_has_perf_metrics() && evsel__sys_has_perf_metrics(lhs)) {
> > + if (topdown_sys_has_perf_metrics() &&
> > + (arch_evsel__must_be_in_group(lhs) || arch_evsel__must_be_in_group(rhs))) {
> > /* Ensure the topdown slots comes first. */
> > - if (strcasestr(lhs->name, "slots"))
> > + if (strcasestr(lhs->name, "slots") && !strcasestr(lhs->name, "uops_retired.slots"))
> > return -1;
> > - if (strcasestr(rhs->name, "slots"))
> > + if (strcasestr(rhs->name, "slots") && !strcasestr(rhs->name, "uops_retired.slots"))
> > return 1;
> > /* Followed by topdown events. */
> > if (strcasestr(lhs->name, "topdown") && !strcasestr(rhs->name, "topdown"))
> > diff --git a/tools/perf/arch/x86/util/evsel.c b/tools/perf/arch/x86/util/evsel.c
> > index 81d22657922a..090d0f371891 100644
> > --- a/tools/perf/arch/x86/util/evsel.c
> > +++ b/tools/perf/arch/x86/util/evsel.c
> > @@ -40,12 +40,11 @@ bool evsel__sys_has_perf_metrics(const struct evsel *evsel)
> >
> > bool arch_evsel__must_be_in_group(const struct evsel *evsel)
> > {
> > - if (!evsel__sys_has_perf_metrics(evsel))
> > + if (!evsel__sys_has_perf_metrics(evsel) || !evsel->name ||
> > + strcasestr(evsel->name, "uops_retired.slots"))
> > return false;
> >
> > - return evsel->name &&
> > - (strcasestr(evsel->name, "slots") ||
> > - strcasestr(evsel->name, "topdown"));
> > + return strcasestr(evsel->name, "topdown") || strcasestr(evsel->name, "slots");
> > }
> >
> > int arch_evsel__hw_name(struct evsel *evsel, char *bf, size_t size)

2023-08-01 18:09:04

by Liang, Kan

[permalink] [raw]

Subject: Re: [PATCH v1 0/4] Intel metric fixes and event updates

On 2023-08-01 1:36 a.m., Ian Rogers wrote:
> The metric tma_info_pipeline_retire contains uops_retired.slots with
> perf metric events. Patch 1 fixes this event sorting so that
> uops_retired.slots isn't made a group leader as that needs to be
> topdown.slots.
>
> Patch 2 and 3 update the meteorlake and sapphirerapids events.
>
> Patch 4 addresses an issue with event grouping discussed in:
> https://lore.kernel.org/lkml/[email protected]/
> by adding and altering metric constraints. The constraints avoid
> groups for metrics where the kernel PMU fails to not open the group
> (the trigger for the weak group being removed).
>
> Ian Rogers (4):
> perf parse-events x86: Avoid sorting uops_retired.slots
> perf vendor events intel: Update meteorlake to 1.04
> perf vendor events intel: Update sapphirerapids to 1.15
> perf vendor events intel: Update Icelake+ metric constraints
>

Thanks Ian.

Reviewed-by: Kan Liang <[email protected]>

Thanks,
Kan

> tools/perf/arch/x86/util/evlist.c | 7 +-
> tools/perf/arch/x86/util/evsel.c | 7 +-
> .../arch/x86/alderlake/adl-metrics.json | 11 +-
> .../arch/x86/alderlaken/adln-metrics.json | 2 +
> .../arch/x86/icelake/icl-metrics.json | 10 +-
> .../arch/x86/icelakex/icx-metrics.json | 10 +-
> tools/perf/pmu-events/arch/x86/mapfile.csv | 4 +-
> .../pmu-events/arch/x86/meteorlake/cache.json | 165 ++++++++++++++++++
> .../arch/x86/meteorlake/floating-point.json | 8 +
> .../arch/x86/meteorlake/frontend.json | 56 ++++++
> .../arch/x86/meteorlake/memory.json | 80 +++++++++
> .../pmu-events/arch/x86/meteorlake/other.json | 16 ++
> .../arch/x86/meteorlake/pipeline.json | 159 +++++++++++++++++
> .../arch/x86/rocketlake/rkl-metrics.json | 10 +-
> .../arch/x86/sapphirerapids/other.json | 18 ++
> .../arch/x86/sapphirerapids/spr-metrics.json | 9 +-
> .../arch/x86/tigerlake/tgl-metrics.json | 10 +-
> 17 files changed, 549 insertions(+), 33 deletions(-)
>

2023-08-01 18:28:44

by Liang, Kan

[permalink] [raw]

Subject: Re: [PATCH v1 1/4] perf parse-events x86: Avoid sorting uops_retired.slots

On 2023-08-01 11:54 a.m., Ian Rogers wrote:
> On Tue, Aug 1, 2023 at 8:40 AM Liang, Kan <[email protected]> wrote:
>>
>>
>>
>> On 2023-08-01 1:36 a.m., Ian Rogers wrote:
>>> As topdown.slots may appear as slots it may get confused with
>>> uops_retired.slots which is an invalid perf metric event group
>>> leader. Special case uops_retired.slots to avoid this confusion.
>>>
>>
>> Does any name with format "name.slots" cause the confusion? If so, I
>> don't think we can stop others from naming like the above format.
>>
>> Is it better to hard code the topdown.slots/slots, rather than
>> uops_retired.slots?
>
> So firstly, yet more fringe perf metric event benefits with this silly
> bit of complexity. The issue with "just" trying to pattern match
> "topdown.slots" and "slots" is that there may be pmu names in there.
> So (ignoring case) we get:
>
> slots
> topdown.slots
> cpu/slots/
> cpu/topdown.slots/
> cpu_core/slots/
> cpu_core/topdown.slots/
>
> but the name can have other junk like modifiers in there and also
> there's the name= config term, but let's just not think about that
> breaking stuff. To avoid 6 searches I searched for the known
> problematic uops_retired.slots, knowing if that's not there then one
> of the 6 above is the match. Searching for just "slots" or
> "topdown.slots" isn't good enough as "slots" will get a hit in
> "uops_retired.slots", how we ended up with this patch in the 1st
> place. So I ended up using the uops_retired.slots search to reduce
> complexity in the code and to avoid writing an event parser just to
> figure out what the name is... Ideally we'd be using the
> perf_event_attr to do this comparison, but immediately after
> parse-events when this code runs it hasn't been computed yet. Maybe I
> can shuffle things around and make this true. Ideally I think
> parse-events would just be a library that given some event description
> gives back perf_event_attr and not evsels, but that's yet further
> work...
>

Thanks for the details.

I don't think we have other events with a similar pattern. For now,
special case uops_retired.slots should be good enough.

Thanks,
Kan

> Thanks,
> Ian
>
>>
>> Thanks,
>> Kan
>>
>>> Signed-off-by: Ian Rogers <[email protected]>
>>> ---
>>> tools/perf/arch/x86/util/evlist.c | 7 ++++---
>>> tools/perf/arch/x86/util/evsel.c | 7 +++----
>>> 2 files changed, 7 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/tools/perf/arch/x86/util/evlist.c b/tools/perf/arch/x86/util/evlist.c
>>> index cbd582182932..b1ce0c52d88d 100644
>>> --- a/tools/perf/arch/x86/util/evlist.c
>>> +++ b/tools/perf/arch/x86/util/evlist.c
>>> @@ -75,11 +75,12 @@ int arch_evlist__add_default_attrs(struct evlist *evlist,
>>>
>>> int arch_evlist__cmp(const struct evsel *lhs, const struct evsel *rhs)
>>> {
>>> - if (topdown_sys_has_perf_metrics() && evsel__sys_has_perf_metrics(lhs)) {
>>> + if (topdown_sys_has_perf_metrics() &&
>>> + (arch_evsel__must_be_in_group(lhs) || arch_evsel__must_be_in_group(rhs))) {
>>> /* Ensure the topdown slots comes first. */
>>> - if (strcasestr(lhs->name, "slots"))
>>> + if (strcasestr(lhs->name, "slots") && !strcasestr(lhs->name, "uops_retired.slots"))
>>> return -1;
>>> - if (strcasestr(rhs->name, "slots"))
>>> + if (strcasestr(rhs->name, "slots") && !strcasestr(rhs->name, "uops_retired.slots"))
>>> return 1;
>>> /* Followed by topdown events. */
>>> if (strcasestr(lhs->name, "topdown") && !strcasestr(rhs->name, "topdown"))
>>> diff --git a/tools/perf/arch/x86/util/evsel.c b/tools/perf/arch/x86/util/evsel.c
>>> index 81d22657922a..090d0f371891 100644
>>> --- a/tools/perf/arch/x86/util/evsel.c
>>> +++ b/tools/perf/arch/x86/util/evsel.c
>>> @@ -40,12 +40,11 @@ bool evsel__sys_has_perf_metrics(const struct evsel *evsel)
>>>
>>> bool arch_evsel__must_be_in_group(const struct evsel *evsel)
>>> {
>>> - if (!evsel__sys_has_perf_metrics(evsel))
>>> + if (!evsel__sys_has_perf_metrics(evsel) || !evsel->name ||
>>> + strcasestr(evsel->name, "uops_retired.slots"))
>>> return false;
>>>
>>> - return evsel->name &&
>>> - (strcasestr(evsel->name, "slots") ||
>>> - strcasestr(evsel->name, "topdown"));
>>> + return strcasestr(evsel->name, "topdown") || strcasestr(evsel->name, "slots");
>>> }
>>>
>>> int arch_evsel__hw_name(struct evsel *evsel, char *bf, size_t size)

2023-08-01 19:19:10

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH v1 0/4] Intel metric fixes and event updates

Em Tue, Aug 01, 2023 at 01:48:45PM -0400, Liang, Kan escreveu:
>
>
> On 2023-08-01 1:36 a.m., Ian Rogers wrote:
> > The metric tma_info_pipeline_retire contains uops_retired.slots with
> > perf metric events. Patch 1 fixes this event sorting so that
> > uops_retired.slots isn't made a group leader as that needs to be
> > topdown.slots.
> >
> > Patch 2 and 3 update the meteorlake and sapphirerapids events.
> >
> > Patch 4 addresses an issue with event grouping discussed in:
> > https://lore.kernel.org/lkml/[email protected]/
> > by adding and altering metric constraints. The constraints avoid
> > groups for metrics where the kernel PMU fails to not open the group
> > (the trigger for the weak group being removed).
> >
> > Ian Rogers (4):
> > perf parse-events x86: Avoid sorting uops_retired.slots
> > perf vendor events intel: Update meteorlake to 1.04
> > perf vendor events intel: Update sapphirerapids to 1.15
> > perf vendor events intel: Update Icelake+ metric constraints
> >
>
> Thanks Ian.
>
> Reviewed-by: Kan Liang <[email protected]>

Thanks, applied.

- Arnaldo