From: Kan Liang <[email protected]>
In the default mode, the current output of the metricgroup include both
events and metrics, which is not necessary and makes the output hard to
read. Also, different ARCHs (even different generations of the ARCH) may
have a different output format because of the different events in a
metrics.
The patch proposes a new output format which only outputting the value
of each metric and the metricgroup name. It can brings a clean and
consistent output format among ARCHs and generations.
The first two patches are bug fixes for the current code.
The patches 3-6 introduce the new metricgroup output.
The patches 7-8 improve the tests to cover the default mode.
Here are some examples for the new output.
STD output:
On SPR
perf stat -a sleep 1
Performance counter stats for 'system wide':
226,054.13 msec cpu-clock # 224.588 CPUs utilized
932 context-switches # 4.123 /sec
224 cpu-migrations # 0.991 /sec
76 page-faults # 0.336 /sec
45,940,682 cycles # 0.000 GHz
36,676,047 instructions # 0.80 insn per cycle
7,044,516 branches # 31.163 K/sec
62,169 branch-misses # 0.88% of all branches
TopdownL1 # 68.7 % tma_backend_bound
# 3.1 % tma_bad_speculation
# 13.0 % tma_frontend_bound
# 15.2 % tma_retiring
TopdownL2 # 2.7 % tma_branch_mispredicts
# 19.6 % tma_core_bound
# 4.8 % tma_fetch_bandwidth
# 8.3 % tma_fetch_latency
# 2.9 % tma_heavy_operations
# 12.3 % tma_light_operations
# 0.4 % tma_machine_clears
# 49.1 % tma_memory_bound
1.006529767 seconds time elapsed
On Hybrid
perf stat -a sleep 1
Performance counter stats for 'system wide':
32,154.81 msec cpu-clock # 31.978 CPUs utilized
165 context-switches # 5.131 /sec
33 cpu-migrations # 1.026 /sec
72 page-faults # 2.239 /sec
5,653,347 cpu_core/cycles/ # 0.000 GHz
4,164,114 cpu_atom/cycles/ # 0.000 GHz
3,921,839 cpu_core/instructions/ # 0.69 insn per cycle
2,142,800 cpu_atom/instructions/ # 0.38 insn per cycle
713,629 cpu_core/branches/ # 22.194 K/sec
452,838 cpu_atom/branches/ # 14.083 K/sec
26,810 cpu_core/branch-misses/ # 3.76% of all branches
26,029 cpu_atom/branch-misses/ # 3.65% of all branches
TopdownL1 (cpu_core) # 32.0 % tma_backend_bound
# 8.0 % tma_bad_speculation
# 45.5 % tma_frontend_bound
# 14.5 % tma_retiring
JSON output
on SPR
perf stat --json -a sleep 1
{"counter-value" : "225904.823297", "unit" : "msec", "event" : "cpu-clock", "event-runtime" : 225904323425, "pcnt-running" : 100.00, "metric-value" : "224.456872", "metric-unit" : "CPUs utilized"}
{"counter-value" : "986.000000", "unit" : "", "event" : "context-switches", "event-runtime" : 225904108985, "pcnt-running" : 100.00, "metric-value" : "4.364670", "metric-unit" : "/sec"}
{"counter-value" : "224.000000", "unit" : "", "event" : "cpu-migrations", "event-runtime" : 225904016141, "pcnt-running" : 100.00, "metric-value" : "0.991568", "metric-unit" : "/sec"}
{"counter-value" : "76.000000", "unit" : "", "event" : "page-faults", "event-runtime" : 225903913270, "pcnt-running" : 100.00, "metric-value" : "0.336425", "metric-unit" : "/sec"}
{"counter-value" : "48433482.000000", "unit" : "", "event" : "cycles", "event-runtime" : 225903792732, "pcnt-running" : 100.00, "metric-value" : "0.000214", "metric-unit" : "GHz"}
{"counter-value" : "38620409.000000", "unit" : "", "event" : "instructions", "event-runtime" : 225903657830, "pcnt-running" : 100.00, "metric-value" : "0.797391", "metric-unit" : "insn per cycle"}
{"counter-value" : "7369473.000000", "unit" : "", "event" : "branches", "event-runtime" : 225903464328, "pcnt-running" : 100.00, "metric-value" : "32.622026", "metric-unit" : "K/sec"}
{"counter-value" : "54747.000000", "unit" : "", "event" : "branch-misses", "event-runtime" : 225903234523, "pcnt-running" : 100.00, "metric-value" : "0.742889", "metric-unit" : "of all branches"}
{"event-runtime" : 225902840555, "pcnt-running" : 100.00, "metricgroup" : "TopdownL1"}
{"metric-value" : "69.950631", "metric-unit" : "% tma_backend_bound"}
{"metric-value" : "2.771783", "metric-unit" : "% tma_bad_speculation"}
{"metric-value" : "12.026074", "metric-unit" : "% tma_frontend_bound"}
{"metric-value" : "15.251513", "metric-unit" : "% tma_retiring"}
{"event-runtime" : 225902840555, "pcnt-running" : 100.00, "metricgroup" : "TopdownL2"}
{"metric-value" : "2.351757", "metric-unit" : "% tma_branch_mispredicts"}
{"metric-value" : "19.729771", "metric-unit" : "% tma_core_bound"}
{"metric-value" : "4.555207", "metric-unit" : "% tma_fetch_bandwidth"}
{"metric-value" : "7.470867", "metric-unit" : "% tma_fetch_latency"}
{"metric-value" : "2.938808", "metric-unit" : "% tma_heavy_operations"}
{"metric-value" : "12.312705", "metric-unit" : "% tma_light_operations"}
{"metric-value" : "0.420026", "metric-unit" : "% tma_machine_clears"}
{"metric-value" : "50.220860", "metric-unit" : "% tma_memory_bound"}
On hybrid
perf stat --json -a sleep 1
{"counter-value" : "32150.838437", "unit" : "msec", "event" : "cpu-clock", "event-runtime" : 32150846654, "pcnt-running" : 100.00, "metric-value" : "31.981465", "metric-unit" : "CPUs utilized"}
{"counter-value" : "154.000000", "unit" : "", "event" : "context-switches", "event-runtime" : 32150849941, "pcnt-running" : 100.00, "metric-value" : "4.789922", "metric-unit" : "/sec"}
{"counter-value" : "32.000000", "unit" : "", "event" : "cpu-migrations", "event-runtime" : 32150851194, "pcnt-running" : 100.00, "metric-value" : "0.995308", "metric-unit" : "/sec"}
{"counter-value" : "73.000000", "unit" : "", "event" : "page-faults", "event-runtime" : 32150855128, "pcnt-running" : 100.00, "metric-value" : "2.270547", "metric-unit" : "/sec"}
{"counter-value" : "6404864.000000", "unit" : "", "event" : "cpu_core/cycles/", "event-runtime" : 16069765136, "pcnt-running" : 100.00, "metric-value" : "0.000199", "metric-unit" : "GHz"}
{"counter-value" : "3011411.000000", "unit" : "", "event" : "cpu_atom/cycles/", "event-runtime" : 16080917475, "pcnt-running" : 100.00, "metric-value" : "0.000094", "metric-unit" : "GHz"}
{"counter-value" : "4748155.000000", "unit" : "", "event" : "cpu_core/instructions/", "event-runtime" : 16069777198, "pcnt-running" : 100.00, "metric-value" : "0.741336", "metric-unit" : "insn per cycle"}
{"counter-value" : "1129678.000000", "unit" : "", "event" : "cpu_atom/instructions/", "event-runtime" : 16080933337, "pcnt-running" : 100.00, "metric-value" : "0.176378", "metric-unit" : "insn per cycle"}
{"counter-value" : "943319.000000", "unit" : "", "event" : "cpu_core/branches/", "event-runtime" : 16069771422, "pcnt-running" : 100.00, "metric-value" : "29.340417", "metric-unit" : "K/sec"}
{"counter-value" : "194500.000000", "unit" : "", "event" : "cpu_atom/branches/", "event-runtime" : 16080937169, "pcnt-running" : 100.00, "metric-value" : "6.049609", "metric-unit" : "K/sec"}
{"counter-value" : "31974.000000", "unit" : "", "event" : "cpu_core/branch-misses/", "event-runtime" : 16069759637, "pcnt-running" : 100.00, "metric-value" : "3.389521", "metric-unit" : "of all branches"}
{"counter-value" : "18643.000000", "unit" : "", "event" : "cpu_atom/branch-misses/", "event-runtime" : 16080929464, "pcnt-running" : 100.00, "metric-value" : "1.976320", "metric-unit" : "of all branches"}
{"event-runtime" : 16069747669, "pcnt-running" : 100.00, "metricgroup" : "TopdownL1 (cpu_core)"}
{"metric-value" : "30.939553", "metric-unit" : "% tma_backend_bound"}
{"metric-value" : "8.303274", "metric-unit" : "% tma_bad_speculation"}
{"metric-value" : "46.181223", "metric-unit" : "% tma_frontend_bound"}
{"metric-value" : "14.575950", "metric-unit" : "% tma_retiring"}
CSV output
On SPR
perf stat -x, -a sleep 1
225851.20,msec,cpu-clock,225850700108,100.00,224.431,CPUs utilized
976,,context-switches,225850504803,100.00,4.321,/sec
224,,cpu-migrations,225850410336,100.00,0.992,/sec
76,,page-faults,225850304155,100.00,0.337,/sec
52288305,,cycles,225850188531,100.00,0.000,GHz
37977214,,instructions,225850071251,100.00,0.73,insn per cycle
7299859,,branches,225849890722,100.00,32.322,K/sec
51102,,branch-misses,225849672536,100.00,0.70,of all branches
,225849327050,100.00,,,,TopdownL1
,,,,,70.1,% tma_backend_bound
,,,,,2.7,% tma_bad_speculation
,,,,,12.5,% tma_frontend_bound
,,,,,14.6,% tma_retiring
,225849327050,100.00,,,,TopdownL2
,,,,,2.3,% tma_branch_mispredicts
,,,,,19.6,% tma_core_bound
,,,,,4.6,% tma_fetch_bandwidth
,,,,,7.9,% tma_fetch_latency
,,,,,2.9,% tma_heavy_operations
,,,,,11.7,% tma_light_operations
,,,,,0.5,% tma_machine_clears
,,,,,50.5,% tma_memory_bound
On Hybrid
perf stat -x, -a sleep 1
32148.69,msec,cpu-clock,32148689529,100.00,31.974,CPUs utilized
168,,context-switches,32148707526,100.00,5.226,/sec
33,,cpu-migrations,32148718292,100.00,1.026,/sec
73,,page-faults,32148729436,100.00,2.271,/sec
8632400,,cpu_core/cycles/,16067477534,100.00,0.000,GHz
3359282,,cpu_atom/cycles/,16081105672,100.00,0.000,GHz
9222630,,cpu_core/instructions/,16067506390,100.00,1.07,insn per cycle
1256594,,cpu_atom/instructions/,16081131302,100.00,0.15,insn per cycle
1842167,,cpu_core/branches/,16067509544,100.00,57.301,K/sec
215437,,cpu_atom/branches/,16081139517,100.00,6.701,K/sec
38133,,cpu_core/branch-misses/,16067511463,100.00,2.07,of all branches
20857,,cpu_atom/branch-misses/,16081135654,100.00,1.13,of all branches
,16067501860,100.00,,,,TopdownL1 (cpu_core)
,,,,,30.6,% tma_backend_bound
,,,,,7.8,% tma_bad_speculation
,,,,,42.0,% tma_frontend_bound
,,,,,19.6,% tma_retiring
Kan Liang (8):
perf metric: Fix no group check
perf evsel: Fix the annotation for hardware events on hybrid
perf metric: JSON flag to default metric group
perf vendor events arm64: Add default tags into topdown L1 metrics
perf stat,jevents: Introduce Default tags for the default mode
perf stat,metrics: New metrics output for the default mode
pert tests: Support metricgroup perf stat JSON output
perf test: Add test case for the standard perf stat output
tools/perf/builtin-stat.c | 5 +-
tools/perf/pmu-events/arch/arm64/sbsa.json | 12 +-
.../arch/x86/alderlake/adl-metrics.json | 20 +-
.../arch/x86/icelake/icl-metrics.json | 20 +-
.../arch/x86/icelakex/icx-metrics.json | 20 +-
.../arch/x86/sapphirerapids/spr-metrics.json | 60 ++--
.../arch/x86/tigerlake/tgl-metrics.json | 20 +-
tools/perf/pmu-events/jevents.py | 5 +-
tools/perf/pmu-events/pmu-events.h | 1 +
.../tests/shell/lib/perf_json_output_lint.py | 3 +
tools/perf/tests/shell/stat+std_output.sh | 259 ++++++++++++++++++
tools/perf/util/evsel.h | 13 +-
tools/perf/util/metricgroup.c | 111 +++++++-
tools/perf/util/metricgroup.h | 1 +
tools/perf/util/stat-display.c | 69 ++++-
tools/perf/util/stat-shadow.c | 39 +--
16 files changed, 564 insertions(+), 94 deletions(-)
create mode 100755 tools/perf/tests/shell/stat+std_output.sh
--
2.35.1
From: Kan Liang <[email protected]>
Introduce a new metricgroup, Default, to tag all the metric groups which
will be collected in the default mode.
Add a new field, DefaultMetricgroupName, in the JSON file to indicate
the real metric group name. It will be printed in the default output
to replace the event names.
There is nothing changed for the output format.
On SPR, both TopdownL1 and TopdownL2 are displayed in the default
output.
On ARM, Intel ICL and later platforms (before SPR), only TopdownL1 is
displayed in the default output.
Suggested-by: Stephane Eranian <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-stat.c | 4 ++--
tools/perf/pmu-events/jevents.py | 5 +++--
tools/perf/pmu-events/pmu-events.h | 1 +
tools/perf/util/metricgroup.c | 3 +++
4 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index c87c6897edc9..2269b3e90e9b 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -2154,14 +2154,14 @@ static int add_default_attributes(void)
* Add TopdownL1 metrics if they exist. To minimize
* multiplexing, don't request threshold computation.
*/
- if (metricgroup__has_metric(pmu, "TopdownL1")) {
+ if (metricgroup__has_metric(pmu, "Default")) {
struct evlist *metric_evlist = evlist__new();
struct evsel *metric_evsel;
if (!metric_evlist)
return -1;
- if (metricgroup__parse_groups(metric_evlist, pmu, "TopdownL1",
+ if (metricgroup__parse_groups(metric_evlist, pmu, "Default",
/*metric_no_group=*/false,
/*metric_no_merge=*/false,
/*metric_no_threshold=*/true,
diff --git a/tools/perf/pmu-events/jevents.py b/tools/perf/pmu-events/jevents.py
index 7ed258be1829..12e80bb7939b 100755
--- a/tools/perf/pmu-events/jevents.py
+++ b/tools/perf/pmu-events/jevents.py
@@ -54,8 +54,8 @@ _json_event_attributes = [
# Attributes that are in pmu_metric rather than pmu_event.
_json_metric_attributes = [
'pmu', 'metric_name', 'metric_group', 'metric_expr', 'metric_threshold',
- 'desc', 'long_desc', 'unit', 'compat', 'metricgroup_no_group', 'aggr_mode',
- 'event_grouping'
+ 'desc', 'long_desc', 'unit', 'compat', 'metricgroup_no_group',
+ 'default_metricgroup_name', 'aggr_mode', 'event_grouping'
]
# Attributes that are bools or enum int values, encoded as '0', '1',...
_json_enum_attributes = ['aggr_mode', 'deprecated', 'event_grouping', 'perpkg']
@@ -307,6 +307,7 @@ class JsonEvent:
self.metric_name = jd.get('MetricName')
self.metric_group = jd.get('MetricGroup')
self.metricgroup_no_group = jd.get('MetricgroupNoGroup')
+ self.default_metricgroup_name = jd.get('DefaultMetricgroupName')
self.event_grouping = convert_metric_constraint(jd.get('MetricConstraint'))
self.metric_expr = None
if 'MetricExpr' in jd:
diff --git a/tools/perf/pmu-events/pmu-events.h b/tools/perf/pmu-events/pmu-events.h
index 8cd23d656a5d..caf59f23cd64 100644
--- a/tools/perf/pmu-events/pmu-events.h
+++ b/tools/perf/pmu-events/pmu-events.h
@@ -61,6 +61,7 @@ struct pmu_metric {
const char *desc;
const char *long_desc;
const char *metricgroup_no_group;
+ const char *default_metricgroup_name;
enum aggr_mode_class aggr_mode;
enum metric_event_groups event_grouping;
};
diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
index 74f2d8efc02d..efafa02db5e5 100644
--- a/tools/perf/util/metricgroup.c
+++ b/tools/perf/util/metricgroup.c
@@ -137,6 +137,8 @@ struct metric {
* output.
*/
const char *metric_unit;
+ /** Optional default metric group name */
+ const char *default_metricgroup_name;
/** Optional null terminated array of referenced metrics. */
struct metric_ref *metric_refs;
/**
@@ -219,6 +221,7 @@ static struct metric *metric__new(const struct pmu_metric *pm,
m->pmu = pm->pmu ?: "cpu";
m->metric_name = pm->metric_name;
+ m->default_metricgroup_name = pm->default_metricgroup_name;
m->modifier = NULL;
if (modifier) {
m->modifier = strdup(modifier);
--
2.35.1
From: Kan Liang <[email protected]>
The no group check fails if there is more than one meticgroup in the
metricgroup_no_group.
The first parameter of the match_metric() should be the string, while
the substring should be the second parameter.
Fixes: ccc66c609280 ("perf metric: JSON flag to not group events if gathering a metric group")
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/util/metricgroup.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
index 70ef2e23a710..74f2d8efc02d 100644
--- a/tools/perf/util/metricgroup.c
+++ b/tools/perf/util/metricgroup.c
@@ -1175,7 +1175,7 @@ static int metricgroup__add_metric_callback(const struct pmu_metric *pm,
if (pm->metric_expr && match_pm_metric(pm, data->pmu, data->metric_name)) {
bool metric_no_group = data->metric_no_group ||
- match_metric(data->metric_name, pm->metricgroup_no_group);
+ match_metric(pm->metricgroup_no_group, data->metric_name);
data->has_match = true;
ret = add_metric(data->list, pm, data->modifier, metric_no_group,
--
2.35.1
From: Kan Liang <[email protected]>
Add the default tags for ARM as well.
Signed-off-by: Kan Liang <[email protected]>
Cc: Jing Zhang <[email protected]>
Cc: John Garry <[email protected]>
---
tools/perf/pmu-events/arch/arm64/sbsa.json | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/tools/perf/pmu-events/arch/arm64/sbsa.json b/tools/perf/pmu-events/arch/arm64/sbsa.json
index f678c37ea9c3..f90b338261ac 100644
--- a/tools/perf/pmu-events/arch/arm64/sbsa.json
+++ b/tools/perf/pmu-events/arch/arm64/sbsa.json
@@ -2,28 +2,32 @@
{
"MetricExpr": "stall_slot_frontend / (#slots * cpu_cycles)",
"BriefDescription": "Frontend bound L1 topdown metric",
- "MetricGroup": "TopdownL1",
+ "DefaultMetricgroupName": "TopdownL1",
+ "MetricGroup": "Default;TopdownL1",
"MetricName": "frontend_bound",
"ScaleUnit": "100%"
},
{
"MetricExpr": "(1 - op_retired / op_spec) * (1 - stall_slot / (#slots * cpu_cycles))",
"BriefDescription": "Bad speculation L1 topdown metric",
- "MetricGroup": "TopdownL1",
+ "DefaultMetricgroupName": "TopdownL1",
+ "MetricGroup": "Default;TopdownL1",
"MetricName": "bad_speculation",
"ScaleUnit": "100%"
},
{
"MetricExpr": "(op_retired / op_spec) * (1 - stall_slot / (#slots * cpu_cycles))",
"BriefDescription": "Retiring L1 topdown metric",
- "MetricGroup": "TopdownL1",
+ "DefaultMetricgroupName": "TopdownL1",
+ "MetricGroup": "Default;TopdownL1",
"MetricName": "retiring",
"ScaleUnit": "100%"
},
{
"MetricExpr": "stall_slot_backend / (#slots * cpu_cycles)",
"BriefDescription": "Backend Bound L1 topdown metric",
- "MetricGroup": "TopdownL1",
+ "DefaultMetricgroupName": "TopdownL1",
+ "MetricGroup": "Default;TopdownL1",
"MetricName": "backend_bound",
"ScaleUnit": "100%"
}
--
2.35.1
From: Kan Liang <[email protected]>
A new field metricgroup has been added in the perf stat JSON output.
Support it in the test case.
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/tests/shell/lib/perf_json_output_lint.py | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/perf/tests/shell/lib/perf_json_output_lint.py b/tools/perf/tests/shell/lib/perf_json_output_lint.py
index b81582a89d36..5e9bd68c83fe 100644
--- a/tools/perf/tests/shell/lib/perf_json_output_lint.py
+++ b/tools/perf/tests/shell/lib/perf_json_output_lint.py
@@ -55,6 +55,7 @@ def check_json_output(expected_items):
'interval': lambda x: isfloat(x),
'metric-unit': lambda x: True,
'metric-value': lambda x: isfloat(x),
+ 'metricgroup': lambda x: True,
'node': lambda x: True,
'pcnt-running': lambda x: isfloat(x),
'socket': lambda x: True,
@@ -70,6 +71,8 @@ def check_json_output(expected_items):
# values and possibly other prefixes like interval, core and
# aggregate-number.
pass
+ elif count != expected_items and count >= 1 and count <= 5 and 'metricgroup' in item:
+ pass
elif count != expected_items:
raise RuntimeError(f'wrong number of fields. counted {count} expected {expected_items}'
f' in \'{item}\'')
--
2.35.1
From: Kan Liang <[email protected]>
For the default output, the default metric group could vary on different
platforms. For example, on SPR, the TopdownL1 and TopdownL2 metrics
should be displayed in the default mode. On ICL, only the TopdownL1
should be displayed.
Add a flag so we can tag the default metric group for different
platforms rather than hack the perf code.
The flag is added to Intel TopdownL1 since ICL and TopdownL2 metrics
since SPR.
Add a new field, DefaultMetricgroupName, in the JSON file to indicate
the real metric group name.
Signed-off-by: Kan Liang <[email protected]>
---
.../arch/x86/alderlake/adl-metrics.json | 20 ++++---
.../arch/x86/icelake/icl-metrics.json | 20 ++++---
.../arch/x86/icelakex/icx-metrics.json | 20 ++++---
.../arch/x86/sapphirerapids/spr-metrics.json | 60 +++++++++++--------
.../arch/x86/tigerlake/tgl-metrics.json | 20 ++++---
5 files changed, 84 insertions(+), 56 deletions(-)
diff --git a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
index c9f7e3d4ab08..e78c85220e27 100644
--- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
@@ -832,22 +832,24 @@
},
{
"BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
+ "DefaultMetricgroupName": "TopdownL1",
"MetricExpr": "cpu_core@topdown\\-be\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
- "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+ "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
"MetricName": "tma_backend_bound",
"MetricThreshold": "tma_backend_bound > 0.2",
- "MetricgroupNoGroup": "TopdownL1",
+ "MetricgroupNoGroup": "TopdownL1;Default",
"PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
"ScaleUnit": "100%",
"Unit": "cpu_core"
},
{
"BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
+ "DefaultMetricgroupName": "TopdownL1",
"MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
- "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+ "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
"MetricName": "tma_bad_speculation",
"MetricThreshold": "tma_bad_speculation > 0.15",
- "MetricgroupNoGroup": "TopdownL1",
+ "MetricgroupNoGroup": "TopdownL1;Default",
"PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
"ScaleUnit": "100%",
"Unit": "cpu_core"
@@ -1112,11 +1114,12 @@
},
{
"BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
+ "DefaultMetricgroupName": "TopdownL1",
"MetricExpr": "cpu_core@topdown\\-fe\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / tma_info_thread_slots",
- "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
+ "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
"MetricName": "tma_frontend_bound",
"MetricThreshold": "tma_frontend_bound > 0.15",
- "MetricgroupNoGroup": "TopdownL1",
+ "MetricgroupNoGroup": "TopdownL1;Default",
"PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
"ScaleUnit": "100%",
"Unit": "cpu_core"
@@ -2316,11 +2319,12 @@
},
{
"BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
+ "DefaultMetricgroupName": "TopdownL1",
"MetricExpr": "cpu_core@topdown\\-retiring@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
- "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+ "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
"MetricName": "tma_retiring",
"MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
- "MetricgroupNoGroup": "TopdownL1",
+ "MetricgroupNoGroup": "TopdownL1;Default",
"PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
"ScaleUnit": "100%",
"Unit": "cpu_core"
diff --git a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
index 20210742171d..cc4edf855064 100644
--- a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
@@ -111,21 +111,23 @@
},
{
"BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
+ "DefaultMetricgroupName": "TopdownL1",
"MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
- "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+ "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
"MetricName": "tma_backend_bound",
"MetricThreshold": "tma_backend_bound > 0.2",
- "MetricgroupNoGroup": "TopdownL1",
+ "MetricgroupNoGroup": "TopdownL1;Default",
"PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
"ScaleUnit": "100%"
},
{
"BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
+ "DefaultMetricgroupName": "TopdownL1",
"MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
- "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+ "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
"MetricName": "tma_bad_speculation",
"MetricThreshold": "tma_bad_speculation > 0.15",
- "MetricgroupNoGroup": "TopdownL1",
+ "MetricgroupNoGroup": "TopdownL1;Default",
"PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
"ScaleUnit": "100%"
},
@@ -372,11 +374,12 @@
},
{
"BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
+ "DefaultMetricgroupName": "TopdownL1",
"MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
- "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
+ "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
"MetricName": "tma_frontend_bound",
"MetricThreshold": "tma_frontend_bound > 0.15",
- "MetricgroupNoGroup": "TopdownL1",
+ "MetricgroupNoGroup": "TopdownL1;Default",
"PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
"ScaleUnit": "100%"
},
@@ -1378,11 +1381,12 @@
},
{
"BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
+ "DefaultMetricgroupName": "TopdownL1",
"MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
- "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+ "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
"MetricName": "tma_retiring",
"MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
- "MetricgroupNoGroup": "TopdownL1",
+ "MetricgroupNoGroup": "TopdownL1;Default",
"PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
"ScaleUnit": "100%"
},
diff --git a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
index ef25cda019be..6f25b5b7aaf6 100644
--- a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
@@ -315,21 +315,23 @@
},
{
"BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
+ "DefaultMetricgroupName": "TopdownL1",
"MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
- "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+ "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
"MetricName": "tma_backend_bound",
"MetricThreshold": "tma_backend_bound > 0.2",
- "MetricgroupNoGroup": "TopdownL1",
+ "MetricgroupNoGroup": "TopdownL1;Default",
"PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
"ScaleUnit": "100%"
},
{
"BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
+ "DefaultMetricgroupName": "TopdownL1",
"MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
- "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+ "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
"MetricName": "tma_bad_speculation",
"MetricThreshold": "tma_bad_speculation > 0.15",
- "MetricgroupNoGroup": "TopdownL1",
+ "MetricgroupNoGroup": "TopdownL1;Default",
"PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
"ScaleUnit": "100%"
},
@@ -576,11 +578,12 @@
},
{
"BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
+ "DefaultMetricgroupName": "TopdownL1",
"MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
- "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
+ "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
"MetricName": "tma_frontend_bound",
"MetricThreshold": "tma_frontend_bound > 0.15",
- "MetricgroupNoGroup": "TopdownL1",
+ "MetricgroupNoGroup": "TopdownL1;Default",
"PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
"ScaleUnit": "100%"
},
@@ -1674,11 +1677,12 @@
},
{
"BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
+ "DefaultMetricgroupName": "TopdownL1",
"MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
- "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+ "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
"MetricName": "tma_retiring",
"MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
- "MetricgroupNoGroup": "TopdownL1",
+ "MetricgroupNoGroup": "TopdownL1;Default",
"PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
"ScaleUnit": "100%"
},
diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
index 4f3dd85540b6..c732982f70b5 100644
--- a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
@@ -340,31 +340,34 @@
},
{
"BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
+ "DefaultMetricgroupName": "TopdownL1",
"MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
- "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+ "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
"MetricName": "tma_backend_bound",
"MetricThreshold": "tma_backend_bound > 0.2",
- "MetricgroupNoGroup": "TopdownL1",
+ "MetricgroupNoGroup": "TopdownL1;Default",
"PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
"ScaleUnit": "100%"
},
{
"BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
+ "DefaultMetricgroupName": "TopdownL1",
"MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
- "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+ "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
"MetricName": "tma_bad_speculation",
"MetricThreshold": "tma_bad_speculation > 0.15",
- "MetricgroupNoGroup": "TopdownL1",
+ "MetricgroupNoGroup": "TopdownL1;Default",
"PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
"ScaleUnit": "100%"
},
{
"BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction",
+ "DefaultMetricgroupName": "TopdownL2",
"MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
- "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
+ "MetricGroup": "BadSpec;BrMispredicts;Default;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
"MetricName": "tma_branch_mispredicts",
"MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15",
- "MetricgroupNoGroup": "TopdownL2",
+ "MetricgroupNoGroup": "TopdownL2;Default",
"PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredictions, tma_mispredicts_resteers",
"ScaleUnit": "100%"
},
@@ -407,11 +410,12 @@
},
{
"BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck",
+ "DefaultMetricgroupName": "TopdownL2",
"MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)",
- "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
+ "MetricGroup": "Backend;Compute;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
"MetricName": "tma_core_bound",
"MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2",
- "MetricgroupNoGroup": "TopdownL2",
+ "MetricgroupNoGroup": "TopdownL2;Default",
"PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).",
"ScaleUnit": "100%"
},
@@ -509,21 +513,23 @@
},
{
"BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues",
+ "DefaultMetricgroupName": "TopdownL2",
"MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)",
- "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
+ "MetricGroup": "Default;FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
"MetricName": "tma_fetch_bandwidth",
"MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_thread_ipc / 6 > 0.35",
- "MetricgroupNoGroup": "TopdownL2",
+ "MetricgroupNoGroup": "TopdownL2;Default",
"PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_switches, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp",
"ScaleUnit": "100%"
},
{
"BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues",
+ "DefaultMetricgroupName": "TopdownL2",
"MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
- "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
+ "MetricGroup": "Default;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
"MetricName": "tma_fetch_latency",
"MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15",
- "MetricgroupNoGroup": "TopdownL2",
+ "MetricgroupNoGroup": "TopdownL2;Default",
"PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS",
"ScaleUnit": "100%"
},
@@ -611,11 +617,12 @@
},
{
"BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
+ "DefaultMetricgroupName": "TopdownL1",
"MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
- "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
+ "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
"MetricName": "tma_frontend_bound",
"MetricThreshold": "tma_frontend_bound > 0.15",
- "MetricgroupNoGroup": "TopdownL1",
+ "MetricgroupNoGroup": "TopdownL1;Default",
"PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
"ScaleUnit": "100%"
},
@@ -630,11 +637,12 @@
},
{
"BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences",
+ "DefaultMetricgroupName": "TopdownL2",
"MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
- "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
+ "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
"MetricName": "tma_heavy_operations",
"MetricThreshold": "tma_heavy_operations > 0.1",
- "MetricgroupNoGroup": "TopdownL2",
+ "MetricgroupNoGroup": "TopdownL2;Default",
"PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences. Sample with: UOPS_RETIRED.HEAVY",
"ScaleUnit": "100%"
},
@@ -1486,11 +1494,12 @@
},
{
"BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)",
+ "DefaultMetricgroupName": "TopdownL2",
"MetricExpr": "max(0, tma_retiring - tma_heavy_operations)",
- "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
+ "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
"MetricName": "tma_light_operations",
"MetricThreshold": "tma_light_operations > 0.6",
- "MetricgroupNoGroup": "TopdownL2",
+ "MetricgroupNoGroup": "TopdownL2;Default",
"PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST",
"ScaleUnit": "100%"
},
@@ -1540,11 +1549,12 @@
},
{
"BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears",
+ "DefaultMetricgroupName": "TopdownL2",
"MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts)",
- "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
+ "MetricGroup": "BadSpec;Default;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
"MetricName": "tma_machine_clears",
"MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15",
- "MetricgroupNoGroup": "TopdownL2",
+ "MetricgroupNoGroup": "TopdownL2;Default",
"PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache",
"ScaleUnit": "100%"
},
@@ -1576,11 +1586,12 @@
},
{
"BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck",
+ "DefaultMetricgroupName": "TopdownL2",
"MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
- "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
+ "MetricGroup": "Backend;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
"MetricName": "tma_memory_bound",
"MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2",
- "MetricgroupNoGroup": "TopdownL2",
+ "MetricgroupNoGroup": "TopdownL2;Default",
"PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).",
"ScaleUnit": "100%"
},
@@ -1784,11 +1795,12 @@
},
{
"BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
+ "DefaultMetricgroupName": "TopdownL1",
"MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
- "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+ "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
"MetricName": "tma_retiring",
"MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
- "MetricgroupNoGroup": "TopdownL1",
+ "MetricgroupNoGroup": "TopdownL1;Default",
"PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
"ScaleUnit": "100%"
},
diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
index d0538a754288..83346911aa63 100644
--- a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
@@ -105,21 +105,23 @@
},
{
"BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
+ "DefaultMetricgroupName": "TopdownL1",
"MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
- "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+ "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
"MetricName": "tma_backend_bound",
"MetricThreshold": "tma_backend_bound > 0.2",
- "MetricgroupNoGroup": "TopdownL1",
+ "MetricgroupNoGroup": "TopdownL1;Default",
"PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
"ScaleUnit": "100%"
},
{
"BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
+ "DefaultMetricgroupName": "TopdownL1",
"MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
- "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+ "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
"MetricName": "tma_bad_speculation",
"MetricThreshold": "tma_bad_speculation > 0.15",
- "MetricgroupNoGroup": "TopdownL1",
+ "MetricgroupNoGroup": "TopdownL1;Default",
"PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
"ScaleUnit": "100%"
},
@@ -366,11 +368,12 @@
},
{
"BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
+ "DefaultMetricgroupName": "TopdownL1",
"MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
- "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
+ "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
"MetricName": "tma_frontend_bound",
"MetricThreshold": "tma_frontend_bound > 0.15",
- "MetricgroupNoGroup": "TopdownL1",
+ "MetricgroupNoGroup": "TopdownL1;Default",
"PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
"ScaleUnit": "100%"
},
@@ -1392,11 +1395,12 @@
},
{
"BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
+ "DefaultMetricgroupName": "TopdownL1",
"MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
- "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+ "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
"MetricName": "tma_retiring",
"MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
- "MetricgroupNoGroup": "TopdownL1",
+ "MetricgroupNoGroup": "TopdownL1;Default",
"PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
"ScaleUnit": "100%"
},
--
2.35.1
From: Kan Liang <[email protected]>
In the default mode, the current output of the metricgroup include both
events and metrics, which is not necessary and just makes the output
hard to read. Since different ARCHs (even different generations in the
same ARCH) may use different events. The output also vary on different
platforms.
For a metricgroup, only outputting the value of each metric is good
enough.
Current perf may append different metric groups to the same leader
event, or append the metrics from the same metricgroup to different
events. That could bring confusion when perf only prints the
metricgroup output mode. For example, print the same metricgroup name
several times.
Reorganize metricgroup for the default mode and make sure that
a metricgroup can only be appended to one event.
Sort the metricgroup for the default mode by the name of the
metricgroup.
Add a new field default_metricgroup in evsel to indicate an event of
the default metricgroup. For those events, printout() should print
the metricgroup name rather than events.
Add print_metricgroup_header() to print out the metricgroup name in
different output formats.
On SPR
Before:
./perf_old stat sleep 1
Performance counter stats for 'sleep 1':
0.54 msec task-clock:u # 0.001 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
68 page-faults:u # 125.445 K/sec
540,970 cycles:u # 0.998 GHz
556,325 instructions:u # 1.03 insn per cycle
123,602 branches:u # 228.018 M/sec
6,889 branch-misses:u # 5.57% of all branches
3,245,820 TOPDOWN.SLOTS:u # 18.4 % tma_backend_bound
# 17.2 % tma_retiring
# 23.1 % tma_bad_speculation
# 41.4 % tma_frontend_bound
564,859 topdown-retiring:u
1,370,999 topdown-fe-bound:u
603,271 topdown-be-bound:u
744,874 topdown-bad-spec:u
12,661 INT_MISC.UOP_DROPPING:u # 23.357 M/sec
1.001798215 seconds time elapsed
0.000193000 seconds user
0.001700000 seconds sys
After:
$ ./perf stat sleep 1
Performance counter stats for 'sleep 1':
0.51 msec task-clock:u # 0.001 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
68 page-faults:u # 132.683 K/sec
545,228 cycles:u # 1.064 GHz
555,509 instructions:u # 1.02 insn per cycle
123,574 branches:u # 241.120 M/sec
6,957 branch-misses:u # 5.63% of all branches
TopdownL1 # 17.5 % tma_backend_bound
# 22.6 % tma_bad_speculation
# 42.7 % tma_frontend_bound
# 17.1 % tma_retiring
TopdownL2 # 21.8 % tma_branch_mispredicts
# 11.5 % tma_core_bound
# 13.4 % tma_fetch_bandwidth
# 29.3 % tma_fetch_latency
# 2.7 % tma_heavy_operations
# 14.5 % tma_light_operations
# 0.8 % tma_machine_clears
# 6.1 % tma_memory_bound
1.001712086 seconds time elapsed
0.000151000 seconds user
0.001618000 seconds sys
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-stat.c | 1 +
tools/perf/util/evsel.h | 1 +
tools/perf/util/metricgroup.c | 106 ++++++++++++++++++++++++++++++++-
tools/perf/util/metricgroup.h | 1 +
tools/perf/util/stat-display.c | 69 ++++++++++++++++++++-
5 files changed, 172 insertions(+), 6 deletions(-)
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 2269b3e90e9b..b274cc264d56 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -2172,6 +2172,7 @@ static int add_default_attributes(void)
evlist__for_each_entry(metric_evlist, metric_evsel) {
metric_evsel->skippable = true;
+ metric_evsel->default_metricgroup = true;
}
evlist__splice_list_tail(evsel_list, &metric_evlist->core.entries);
evlist__delete(metric_evlist);
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 36a32e4ca168..61b1385108f4 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -130,6 +130,7 @@ struct evsel {
bool reset_group;
bool errored;
bool needs_auxtrace_mmap;
+ bool default_metricgroup;
struct hashmap *per_pkg_mask;
int err;
struct {
diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
index efafa02db5e5..22181ce4f27f 100644
--- a/tools/perf/util/metricgroup.c
+++ b/tools/perf/util/metricgroup.c
@@ -79,6 +79,7 @@ static struct rb_node *metric_event_new(struct rblist *rblist __maybe_unused,
return NULL;
memcpy(me, entry, sizeof(struct metric_event));
me->evsel = ((struct metric_event *)entry)->evsel;
+ me->default_metricgroup_name = NULL;
INIT_LIST_HEAD(&me->head);
return &me->nd;
}
@@ -1133,14 +1134,19 @@ static int metricgroup__add_metric_sys_event_iter(const struct pmu_metric *pm,
/**
* metric_list_cmp - list_sort comparator that sorts metrics with more events to
* the front. tool events are excluded from the count.
+ * For the default metrics, sort them by metricgroup name.
*/
-static int metric_list_cmp(void *priv __maybe_unused, const struct list_head *l,
+static int metric_list_cmp(void *priv, const struct list_head *l,
const struct list_head *r)
{
const struct metric *left = container_of(l, struct metric, nd);
const struct metric *right = container_of(r, struct metric, nd);
struct expr_id_data *data;
int i, left_count, right_count;
+ bool is_default = *(bool *)priv;
+
+ if (is_default && left->default_metricgroup_name && right->default_metricgroup_name)
+ return strcmp(left->default_metricgroup_name, right->default_metricgroup_name);
left_count = hashmap__size(left->pctx->ids);
perf_tool_event__for_each_event(i) {
@@ -1497,6 +1503,91 @@ static int parse_ids(bool metric_no_merge, struct perf_pmu *fake_pmu,
return ret;
}
+static struct metric_event *
+metricgroup__lookup_default_metricgroup(struct rblist *metric_events,
+ struct evsel *evsel,
+ struct metric *m)
+{
+ struct metric_event *me;
+ char *name;
+ int err;
+
+ me = metricgroup__lookup(metric_events, evsel, true);
+ if (!me->default_metricgroup_name) {
+ if (m->pmu && strcmp(m->pmu, "cpu"))
+ err = asprintf(&name, "%s (%s)", m->default_metricgroup_name, m->pmu);
+ else
+ err = asprintf(&name, "%s", m->default_metricgroup_name);
+ if (err < 0)
+ return NULL;
+ me->default_metricgroup_name = name;
+ }
+ if (!strncmp(m->default_metricgroup_name,
+ me->default_metricgroup_name,
+ strlen(m->default_metricgroup_name)))
+ return me;
+
+ return NULL;
+}
+
+static struct metric_event *
+metricgroup__lookup_create(struct rblist *metric_events,
+ struct evsel **evsel,
+ struct list_head *metric_list,
+ struct metric *m,
+ bool is_default)
+{
+ struct metric_event *me;
+ struct metric *cur;
+ struct evsel *ev;
+ size_t i;
+
+ if (!is_default)
+ return metricgroup__lookup(metric_events, evsel[0], true);
+
+ /*
+ * If the metric group has been attached to a previous
+ * event/metric, use that metric event.
+ */
+ list_for_each_entry(cur, metric_list, nd) {
+ if (cur == m)
+ break;
+ if (cur->pmu && strcmp(m->pmu, cur->pmu))
+ continue;
+ if (strncmp(m->default_metricgroup_name,
+ cur->default_metricgroup_name,
+ strlen(m->default_metricgroup_name)))
+ continue;
+ if (!cur->evlist)
+ continue;
+ evlist__for_each_entry(cur->evlist, ev) {
+ me = metricgroup__lookup(metric_events, ev, false);
+ if (!strncmp(m->default_metricgroup_name,
+ me->default_metricgroup_name,
+ strlen(m->default_metricgroup_name)))
+ return me;
+ }
+ }
+
+ /*
+ * Different metric groups may append to the same leader event.
+ * For example, TopdownL1 and TopdownL2 are appended to the
+ * TOPDOWN.SLOTS event.
+ * Split it and append the new metric group to the next available
+ * event.
+ */
+ me = metricgroup__lookup_default_metricgroup(metric_events, evsel[0], m);
+ if (me)
+ return me;
+
+ for (i = 1; i < hashmap__size(m->pctx->ids); i++) {
+ me = metricgroup__lookup_default_metricgroup(metric_events, evsel[i], m);
+ if (me)
+ return me;
+ }
+ return NULL;
+}
+
static int parse_groups(struct evlist *perf_evlist,
const char *pmu, const char *str,
bool metric_no_group,
@@ -1512,6 +1603,7 @@ static int parse_groups(struct evlist *perf_evlist,
LIST_HEAD(metric_list);
struct metric *m;
bool tool_events[PERF_TOOL_MAX] = {false};
+ bool is_default = !strcmp(str, "Default");
int ret;
if (metric_events_list->nr_entries == 0)
@@ -1523,7 +1615,7 @@ static int parse_groups(struct evlist *perf_evlist,
goto out;
/* Sort metrics from largest to smallest. */
- list_sort(NULL, &metric_list, metric_list_cmp);
+ list_sort((void *)&is_default, &metric_list, metric_list_cmp);
if (!metric_no_merge) {
struct expr_parse_ctx *combined = NULL;
@@ -1603,7 +1695,15 @@ static int parse_groups(struct evlist *perf_evlist,
goto out;
}
- me = metricgroup__lookup(metric_events_list, metric_events[0], true);
+ me = metricgroup__lookup_create(metric_events_list,
+ metric_events,
+ &metric_list, m,
+ is_default);
+ if (!me) {
+ pr_err("Cannot create metric group for default!\n");
+ ret = -EINVAL;
+ goto out;
+ }
expr = malloc(sizeof(struct metric_expr));
if (!expr) {
diff --git a/tools/perf/util/metricgroup.h b/tools/perf/util/metricgroup.h
index bf18274c15df..e3609b853213 100644
--- a/tools/perf/util/metricgroup.h
+++ b/tools/perf/util/metricgroup.h
@@ -22,6 +22,7 @@ struct cgroup;
struct metric_event {
struct rb_node nd;
struct evsel *evsel;
+ char *default_metricgroup_name;
struct list_head head; /* list of metric_expr */
};
diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
index a2bbdc25d979..efe5fd04c033 100644
--- a/tools/perf/util/stat-display.c
+++ b/tools/perf/util/stat-display.c
@@ -21,10 +21,12 @@
#include "iostat.h"
#include "pmu.h"
#include "pmus.h"
+#include "metricgroup.h"
#define CNTR_NOT_SUPPORTED "<not supported>"
#define CNTR_NOT_COUNTED "<not counted>"
+#define MGROUP_LEN 50
#define METRIC_LEN 38
#define EVNAME_LEN 32
#define COUNTS_LEN 18
@@ -707,6 +709,55 @@ static bool evlist__has_hybrid(struct evlist *evlist)
return false;
}
+static void print_metricgroup_header_json(struct perf_stat_config *config,
+ struct outstate *os __maybe_unused,
+ const char *metricgroup_name)
+{
+ fprintf(config->output, "\"metricgroup\" : \"%s\"}", metricgroup_name);
+ new_line_json(config, (void *)os);
+}
+
+static void print_metricgroup_header_csv(struct perf_stat_config *config,
+ struct outstate *os,
+ const char *metricgroup_name)
+{
+ int i;
+
+ for (i = 0; i < os->nfields; i++)
+ fputs(config->csv_sep, os->fh);
+ fprintf(config->output, "%s", metricgroup_name);
+ new_line_csv(config, (void *)os);
+}
+
+static void print_metricgroup_header_std(struct perf_stat_config *config,
+ struct outstate *os __maybe_unused,
+ const char *metricgroup_name)
+{
+ int n = fprintf(config->output, " %*s", EVNAME_LEN, metricgroup_name);
+
+ fprintf(config->output, "%*s", MGROUP_LEN - n - 1, "");
+}
+
+static void print_metricgroup_header(struct perf_stat_config *config,
+ struct outstate *os,
+ struct evsel *counter,
+ double noise, u64 run, u64 ena,
+ const char *metricgroup_name)
+{
+ aggr_printout(config, os->evsel, os->id, os->aggr_nr);
+
+ print_noise(config, counter, noise, /*before_metric=*/true);
+ print_running(config, run, ena, /*before_metric=*/true);
+
+ if (config->json_output) {
+ print_metricgroup_header_json(config, os, metricgroup_name);
+ } else if (config->csv_output) {
+ print_metricgroup_header_csv(config, os, metricgroup_name);
+ } else
+ print_metricgroup_header_std(config, os, metricgroup_name);
+
+}
+
static void printout(struct perf_stat_config *config, struct outstate *os,
double uval, u64 run, u64 ena, double noise, int aggr_idx)
{
@@ -751,10 +802,17 @@ static void printout(struct perf_stat_config *config, struct outstate *os,
out.force_header = false;
if (!config->metric_only) {
- abs_printout(config, os->id, os->aggr_nr, counter, uval, ok);
+ if (counter->default_metricgroup) {
+ struct metric_event *me;
- print_noise(config, counter, noise, /*before_metric=*/true);
- print_running(config, run, ena, /*before_metric=*/true);
+ me = metricgroup__lookup(&config->metric_events, counter, false);
+ print_metricgroup_header(config, os, counter, noise, run, ena,
+ me->default_metricgroup_name);
+ } else {
+ abs_printout(config, os->id, os->aggr_nr, counter, uval, ok);
+ print_noise(config, counter, noise, /*before_metric=*/true);
+ print_running(config, run, ena, /*before_metric=*/true);
+ }
}
if (ok) {
@@ -883,6 +941,11 @@ static void print_counter_aggrdata(struct perf_stat_config *config,
if (counter->merged_stat)
return;
+ /* Only print the metric group for the default mode */
+ if (counter->default_metricgroup &&
+ !metricgroup__lookup(&config->metric_events, counter, false))
+ return;
+
uniquify_counter(config, counter);
val = aggr->counts.val;
--
2.35.1
From: Kan Liang <[email protected]>
Add a new test case to verify the standard perf stat output with
different options.
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/tests/shell/stat+std_output.sh | 259 ++++++++++++++++++++++
1 file changed, 259 insertions(+)
create mode 100755 tools/perf/tests/shell/stat+std_output.sh
diff --git a/tools/perf/tests/shell/stat+std_output.sh b/tools/perf/tests/shell/stat+std_output.sh
new file mode 100755
index 000000000000..b9db0f245450
--- /dev/null
+++ b/tools/perf/tests/shell/stat+std_output.sh
@@ -0,0 +1,259 @@
+#!/bin/bash
+# perf stat STD output linter
+# SPDX-License-Identifier: GPL-2.0
+# Tests various perf stat STD output commands for
+# default event and metricgroup
+
+set -e
+
+skip_test=0
+
+stat_output=$(mktemp /tmp/__perf_test.stat_output.std.XXXXX)
+
+event_name=(cpu-clock task-clock context-switches cpu-migrations page-faults cycles instructions branches branch-misses stalled-cycles-frontend stalled-cycles-backend)
+event_metric=("CPUs utilized" "CPUs utilized" "/sec" "/sec" "/sec" "GHz" "insn per cycle" "/sec" "of all branches" "frontend cycles idle" "backend cycles idle")
+
+metricgroup_name=(TopdownL1 TopdownL2)
+
+cleanup() {
+ rm -f "${stat_output}"
+
+ trap - EXIT TERM INT
+}
+
+trap_cleanup() {
+ cleanup
+ exit 1
+}
+trap trap_cleanup EXIT TERM INT
+
+function commachecker()
+{
+ local -i cnt=0
+ local prefix=1
+
+ case "$1"
+ in "--interval") prefix=2
+ ;; "--per-thread") prefix=2
+ ;; "--system-wide-no-aggr") prefix=2
+ ;; "--per-core") prefix=3
+ ;; "--per-socket") prefix=3
+ ;; "--per-node") prefix=3
+ ;; "--per-die") prefix=3
+ ;; "--per-cache") prefix=3
+ esac
+
+ while read line
+ do
+ # Ignore initial "started on" comment.
+ x=${line:0:1}
+ [ "$x" = "#" ] && continue
+ # Ignore initial blank line.
+ [ "$line" = "" ] && continue
+ # Ignore "Performance counter stats"
+ x=${line:0:25}
+ [ "$x" = "Performance counter stats" ] && continue
+ # Ignore "seconds time elapsed" and break
+ [[ "$line" == *"time elapsed"* ]] && break
+
+ main_body=$(echo $line | cut -d' ' -f$prefix-)
+ x=${main_body%#*}
+ # Check default metricgroup
+ y=$(echo $x | tr -d ' ')
+ [ "$y" = "" ] && continue
+ for i in "${!metricgroup_name[@]}"; do
+ [[ "$y" == *"${metricgroup_name[$i]}"* ]] && break
+ done
+ [[ "$y" == *"${metricgroup_name[$i]}"* ]] && continue
+
+ # Check default event
+ for i in "${!event_name[@]}"; do
+ [[ "$x" == *"${event_name[$i]}"* ]] && break
+ done
+
+ [[ ! "$x" == *"${event_name[$i]}"* ]] && {
+ echo "Unknown event name in $line" 1>&2
+ exit 1;
+ }
+
+ # Check event metric if it exists
+ [[ ! "$main_body" == *"#"* ]] && continue
+ [[ ! "$main_body" == *"${event_metric[$i]}"* ]] && {
+ echo "wrong event metric. expected ${event_metric[$i]} in $line" 1>&2
+ exit 1;
+ }
+ done < "${stat_output}"
+ return 0
+}
+
+# Return true if perf_event_paranoid is > $1 and not running as root.
+function ParanoidAndNotRoot()
+{
+ [ $(id -u) != 0 ] && [ $(cat /proc/sys/kernel/perf_event_paranoid) -gt $1 ]
+}
+
+check_no_args()
+{
+ echo -n "Checking STD output: no args "
+ perf stat -o "${stat_output}" true
+ commachecker --no-args
+ echo "[Success]"
+}
+
+check_system_wide()
+{
+ echo -n "Checking STD output: system wide "
+ if ParanoidAndNotRoot 0
+ then
+ echo "[Skip] paranoid and not root"
+ return
+ fi
+ perf stat -a -o "${stat_output}" true
+ commachecker --system-wide
+ echo "[Success]"
+}
+
+check_system_wide_no_aggr()
+{
+ echo -n "Checking STD output: system wide no aggregation "
+ if ParanoidAndNotRoot 0
+ then
+ echo "[Skip] paranoid and not root"
+ return
+ fi
+ perf stat -A -a --no-merge -o "${stat_output}" true
+ commachecker --system-wide-no-aggr
+ echo "[Success]"
+}
+
+check_interval()
+{
+ echo -n "Checking STD output: interval "
+ perf stat -I 1000 -o "${stat_output}" true
+ commachecker --interval
+ echo "[Success]"
+}
+
+
+check_per_core()
+{
+ echo -n "Checking STD output: per core "
+ if ParanoidAndNotRoot 0
+ then
+ echo "[Skip] paranoid and not root"
+ return
+ fi
+ perf stat --per-core -a -o "${stat_output}" true
+ commachecker --per-core
+ echo "[Success]"
+}
+
+check_per_thread()
+{
+ echo -n "Checking STD output: per thread "
+ if ParanoidAndNotRoot 0
+ then
+ echo "[Skip] paranoid and not root"
+ return
+ fi
+ perf stat --per-thread -a -o "${stat_output}" true
+ commachecker --per-thread
+ echo "[Success]"
+}
+
+check_per_cache_instance()
+{
+ echo -n "Checking STD output: per cache instance "
+ if ParanoidAndNotRoot 0
+ then
+ echo "[Skip] paranoid and not root"
+ return
+ fi
+ perf stat --per-cache -a true 2>&1 | commachecker --per-cache
+ echo "[Success]"
+}
+
+check_per_die()
+{
+ echo -n "Checking STD output: per die "
+ if ParanoidAndNotRoot 0
+ then
+ echo "[Skip] paranoid and not root"
+ return
+ fi
+ perf stat --per-die -a -o "${stat_output}" true
+ commachecker --per-die
+ echo "[Success]"
+}
+
+check_per_node()
+{
+ echo -n "Checking STD output: per node "
+ if ParanoidAndNotRoot 0
+ then
+ echo "[Skip] paranoid and not root"
+ return
+ fi
+ perf stat --per-node -a -o "${stat_output}" true
+ commachecker --per-node
+ echo "[Success]"
+}
+
+check_per_socket()
+{
+ echo -n "Checking STD output: per socket "
+ if ParanoidAndNotRoot 0
+ then
+ echo "[Skip] paranoid and not root"
+ return
+ fi
+ perf stat --per-socket -a -o "${stat_output}" true
+ commachecker --per-socket
+ echo "[Success]"
+}
+
+# The perf stat options for per-socket, per-core, per-die
+# and -A ( no_aggr mode ) uses the info fetched from this
+# directory: "/sys/devices/system/cpu/cpu*/topology". For
+# example, socket value is fetched from "physical_package_id"
+# file in topology directory.
+# Reference: cpu__get_topology_int in util/cpumap.c
+# If the platform doesn't expose topology information, values
+# will be set to -1. For example, incase of pSeries platform
+# of powerpc, value for "physical_package_id" is restricted
+# and set to -1. Check here validates the socket-id read from
+# topology file before proceeding further
+
+FILE_LOC="/sys/devices/system/cpu/cpu*/topology/"
+FILE_NAME="physical_package_id"
+
+check_for_topology()
+{
+ if ! ParanoidAndNotRoot 0
+ then
+ socket_file=`ls $FILE_LOC/$FILE_NAME | head -n 1`
+ [ -z $socket_file ] && return 0
+ socket_id=`cat $socket_file`
+ [ $socket_id == -1 ] && skip_test=1
+ return 0
+ fi
+}
+
+check_for_topology
+check_no_args
+check_system_wide
+check_interval
+check_per_thread
+check_per_node
+if [ $skip_test -ne 1 ]
+then
+ check_system_wide_no_aggr
+ check_per_core
+ check_per_cache_instance
+ check_per_die
+ check_per_socket
+else
+ echo "[Skip] Skipping tests for system_wide_no_aggr, per_core, per_die and per_socket since socket id exposed via topology is invalid"
+fi
+cleanup
+exit 0
--
2.35.1
On Wed, Jun 7, 2023 at 9:27 AM <[email protected]> wrote:
>
> From: Kan Liang <[email protected]>
>
> The no group check fails if there is more than one meticgroup in the
> metricgroup_no_group.
>
> The first parameter of the match_metric() should be the string, while
> the substring should be the second parameter.
>
> Fixes: ccc66c609280 ("perf metric: JSON flag to not group events if gathering a metric group")
> Signed-off-by: Kan Liang <[email protected]>
Acked-by: Ian Rogers <[email protected]>
Thanks,
Ian
> ---
> tools/perf/util/metricgroup.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
> index 70ef2e23a710..74f2d8efc02d 100644
> --- a/tools/perf/util/metricgroup.c
> +++ b/tools/perf/util/metricgroup.c
> @@ -1175,7 +1175,7 @@ static int metricgroup__add_metric_callback(const struct pmu_metric *pm,
>
> if (pm->metric_expr && match_pm_metric(pm, data->pmu, data->metric_name)) {
> bool metric_no_group = data->metric_no_group ||
> - match_metric(data->metric_name, pm->metricgroup_no_group);
> + match_metric(pm->metricgroup_no_group, data->metric_name);
>
> data->has_match = true;
> ret = add_metric(data->list, pm, data->modifier, metric_no_group,
> --
> 2.35.1
>
On Wed, Jun 7, 2023 at 9:27 AM <[email protected]> wrote:
>
> From: Kan Liang <[email protected]>
>
> For the default output, the default metric group could vary on different
> platforms. For example, on SPR, the TopdownL1 and TopdownL2 metrics
> should be displayed in the default mode. On ICL, only the TopdownL1
> should be displayed.
>
> Add a flag so we can tag the default metric group for different
> platforms rather than hack the perf code.
>
> The flag is added to Intel TopdownL1 since ICL and TopdownL2 metrics
> since SPR.
>
> Add a new field, DefaultMetricgroupName, in the JSON file to indicate
> the real metric group name.
>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> .../arch/x86/alderlake/adl-metrics.json | 20 ++++---
> .../arch/x86/icelake/icl-metrics.json | 20 ++++---
> .../arch/x86/icelakex/icx-metrics.json | 20 ++++---
> .../arch/x86/sapphirerapids/spr-metrics.json | 60 +++++++++++--------
> .../arch/x86/tigerlake/tgl-metrics.json | 20 ++++---
> 5 files changed, 84 insertions(+), 56 deletions(-)
>
> diff --git a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> index c9f7e3d4ab08..e78c85220e27 100644
> --- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> +++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> @@ -832,22 +832,24 @@
> },
> {
> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> + "DefaultMetricgroupName": "TopdownL1",
> "MetricExpr": "cpu_core@topdown\\-be\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> "MetricName": "tma_backend_bound",
> "MetricThreshold": "tma_backend_bound > 0.2",
> - "MetricgroupNoGroup": "TopdownL1",
> + "MetricgroupNoGroup": "TopdownL1;Default",
> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> "ScaleUnit": "100%",
> "Unit": "cpu_core"
> },
> {
> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> + "DefaultMetricgroupName": "TopdownL1",
> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> "MetricName": "tma_bad_speculation",
> "MetricThreshold": "tma_bad_speculation > 0.15",
> - "MetricgroupNoGroup": "TopdownL1",
> + "MetricgroupNoGroup": "TopdownL1;Default",
> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> "ScaleUnit": "100%",
> "Unit": "cpu_core"
> @@ -1112,11 +1114,12 @@
> },
> {
> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> + "DefaultMetricgroupName": "TopdownL1",
> "MetricExpr": "cpu_core@topdown\\-fe\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / tma_info_thread_slots",
> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> "MetricName": "tma_frontend_bound",
> "MetricThreshold": "tma_frontend_bound > 0.15",
> - "MetricgroupNoGroup": "TopdownL1",
> + "MetricgroupNoGroup": "TopdownL1;Default",
> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> "ScaleUnit": "100%",
> "Unit": "cpu_core"
> @@ -2316,11 +2319,12 @@
> },
> {
> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> + "DefaultMetricgroupName": "TopdownL1",
> "MetricExpr": "cpu_core@topdown\\-retiring@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> "MetricName": "tma_retiring",
> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> - "MetricgroupNoGroup": "TopdownL1",
> + "MetricgroupNoGroup": "TopdownL1;Default",
> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> "ScaleUnit": "100%",
> "Unit": "cpu_core"
For Alderlake the Default metric group is added for all cpu_core
metrics but not cpu_atom. This will lead to only getting metrics for
performance cores while the workload could be running on atoms. This
could lead to a false conclusion that the workload has no issues with
the metrics. I think this behavior is surprising and should be called
out as intentional in the commit message.
Thanks,
Ian
> diff --git a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
> index 20210742171d..cc4edf855064 100644
> --- a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
> +++ b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
> @@ -111,21 +111,23 @@
> },
> {
> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> + "DefaultMetricgroupName": "TopdownL1",
> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> "MetricName": "tma_backend_bound",
> "MetricThreshold": "tma_backend_bound > 0.2",
> - "MetricgroupNoGroup": "TopdownL1",
> + "MetricgroupNoGroup": "TopdownL1;Default",
> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> "ScaleUnit": "100%"
> },
> {
> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> + "DefaultMetricgroupName": "TopdownL1",
> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> "MetricName": "tma_bad_speculation",
> "MetricThreshold": "tma_bad_speculation > 0.15",
> - "MetricgroupNoGroup": "TopdownL1",
> + "MetricgroupNoGroup": "TopdownL1;Default",
> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> "ScaleUnit": "100%"
> },
> @@ -372,11 +374,12 @@
> },
> {
> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> + "DefaultMetricgroupName": "TopdownL1",
> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> "MetricName": "tma_frontend_bound",
> "MetricThreshold": "tma_frontend_bound > 0.15",
> - "MetricgroupNoGroup": "TopdownL1",
> + "MetricgroupNoGroup": "TopdownL1;Default",
> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> "ScaleUnit": "100%"
> },
> @@ -1378,11 +1381,12 @@
> },
> {
> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> + "DefaultMetricgroupName": "TopdownL1",
> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> "MetricName": "tma_retiring",
> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> - "MetricgroupNoGroup": "TopdownL1",
> + "MetricgroupNoGroup": "TopdownL1;Default",
> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> "ScaleUnit": "100%"
> },
> diff --git a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
> index ef25cda019be..6f25b5b7aaf6 100644
> --- a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
> +++ b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
> @@ -315,21 +315,23 @@
> },
> {
> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> + "DefaultMetricgroupName": "TopdownL1",
> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> "MetricName": "tma_backend_bound",
> "MetricThreshold": "tma_backend_bound > 0.2",
> - "MetricgroupNoGroup": "TopdownL1",
> + "MetricgroupNoGroup": "TopdownL1;Default",
> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> "ScaleUnit": "100%"
> },
> {
> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> + "DefaultMetricgroupName": "TopdownL1",
> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> "MetricName": "tma_bad_speculation",
> "MetricThreshold": "tma_bad_speculation > 0.15",
> - "MetricgroupNoGroup": "TopdownL1",
> + "MetricgroupNoGroup": "TopdownL1;Default",
> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> "ScaleUnit": "100%"
> },
> @@ -576,11 +578,12 @@
> },
> {
> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> + "DefaultMetricgroupName": "TopdownL1",
> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> "MetricName": "tma_frontend_bound",
> "MetricThreshold": "tma_frontend_bound > 0.15",
> - "MetricgroupNoGroup": "TopdownL1",
> + "MetricgroupNoGroup": "TopdownL1;Default",
> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> "ScaleUnit": "100%"
> },
> @@ -1674,11 +1677,12 @@
> },
> {
> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> + "DefaultMetricgroupName": "TopdownL1",
> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> "MetricName": "tma_retiring",
> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> - "MetricgroupNoGroup": "TopdownL1",
> + "MetricgroupNoGroup": "TopdownL1;Default",
> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> "ScaleUnit": "100%"
> },
> diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
> index 4f3dd85540b6..c732982f70b5 100644
> --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
> +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
> @@ -340,31 +340,34 @@
> },
> {
> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> + "DefaultMetricgroupName": "TopdownL1",
> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> "MetricName": "tma_backend_bound",
> "MetricThreshold": "tma_backend_bound > 0.2",
> - "MetricgroupNoGroup": "TopdownL1",
> + "MetricgroupNoGroup": "TopdownL1;Default",
> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> "ScaleUnit": "100%"
> },
> {
> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> + "DefaultMetricgroupName": "TopdownL1",
> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> "MetricName": "tma_bad_speculation",
> "MetricThreshold": "tma_bad_speculation > 0.15",
> - "MetricgroupNoGroup": "TopdownL1",
> + "MetricgroupNoGroup": "TopdownL1;Default",
> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> "ScaleUnit": "100%"
> },
> {
> "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction",
> + "DefaultMetricgroupName": "TopdownL2",
> "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> - "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
> + "MetricGroup": "BadSpec;BrMispredicts;Default;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
> "MetricName": "tma_branch_mispredicts",
> "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15",
> - "MetricgroupNoGroup": "TopdownL2",
> + "MetricgroupNoGroup": "TopdownL2;Default",
> "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredictions, tma_mispredicts_resteers",
> "ScaleUnit": "100%"
> },
> @@ -407,11 +410,12 @@
> },
> {
> "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck",
> + "DefaultMetricgroupName": "TopdownL2",
> "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)",
> - "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
> + "MetricGroup": "Backend;Compute;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
> "MetricName": "tma_core_bound",
> "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2",
> - "MetricgroupNoGroup": "TopdownL2",
> + "MetricgroupNoGroup": "TopdownL2;Default",
> "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).",
> "ScaleUnit": "100%"
> },
> @@ -509,21 +513,23 @@
> },
> {
> "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues",
> + "DefaultMetricgroupName": "TopdownL2",
> "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)",
> - "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
> + "MetricGroup": "Default;FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
> "MetricName": "tma_fetch_bandwidth",
> "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_thread_ipc / 6 > 0.35",
> - "MetricgroupNoGroup": "TopdownL2",
> + "MetricgroupNoGroup": "TopdownL2;Default",
> "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_switches, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp",
> "ScaleUnit": "100%"
> },
> {
> "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues",
> + "DefaultMetricgroupName": "TopdownL2",
> "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> - "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
> + "MetricGroup": "Default;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
> "MetricName": "tma_fetch_latency",
> "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15",
> - "MetricgroupNoGroup": "TopdownL2",
> + "MetricgroupNoGroup": "TopdownL2;Default",
> "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS",
> "ScaleUnit": "100%"
> },
> @@ -611,11 +617,12 @@
> },
> {
> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> + "DefaultMetricgroupName": "TopdownL1",
> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> "MetricName": "tma_frontend_bound",
> "MetricThreshold": "tma_frontend_bound > 0.15",
> - "MetricgroupNoGroup": "TopdownL1",
> + "MetricgroupNoGroup": "TopdownL1;Default",
> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> "ScaleUnit": "100%"
> },
> @@ -630,11 +637,12 @@
> },
> {
> "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences",
> + "DefaultMetricgroupName": "TopdownL2",
> "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> - "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
> + "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
> "MetricName": "tma_heavy_operations",
> "MetricThreshold": "tma_heavy_operations > 0.1",
> - "MetricgroupNoGroup": "TopdownL2",
> + "MetricgroupNoGroup": "TopdownL2;Default",
> "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences. Sample with: UOPS_RETIRED.HEAVY",
> "ScaleUnit": "100%"
> },
> @@ -1486,11 +1494,12 @@
> },
> {
> "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)",
> + "DefaultMetricgroupName": "TopdownL2",
> "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)",
> - "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
> + "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
> "MetricName": "tma_light_operations",
> "MetricThreshold": "tma_light_operations > 0.6",
> - "MetricgroupNoGroup": "TopdownL2",
> + "MetricgroupNoGroup": "TopdownL2;Default",
> "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST",
> "ScaleUnit": "100%"
> },
> @@ -1540,11 +1549,12 @@
> },
> {
> "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears",
> + "DefaultMetricgroupName": "TopdownL2",
> "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts)",
> - "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
> + "MetricGroup": "BadSpec;Default;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
> "MetricName": "tma_machine_clears",
> "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15",
> - "MetricgroupNoGroup": "TopdownL2",
> + "MetricgroupNoGroup": "TopdownL2;Default",
> "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache",
> "ScaleUnit": "100%"
> },
> @@ -1576,11 +1586,12 @@
> },
> {
> "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck",
> + "DefaultMetricgroupName": "TopdownL2",
> "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> - "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
> + "MetricGroup": "Backend;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
> "MetricName": "tma_memory_bound",
> "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2",
> - "MetricgroupNoGroup": "TopdownL2",
> + "MetricgroupNoGroup": "TopdownL2;Default",
> "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).",
> "ScaleUnit": "100%"
> },
> @@ -1784,11 +1795,12 @@
> },
> {
> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> + "DefaultMetricgroupName": "TopdownL1",
> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> "MetricName": "tma_retiring",
> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> - "MetricgroupNoGroup": "TopdownL1",
> + "MetricgroupNoGroup": "TopdownL1;Default",
> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> "ScaleUnit": "100%"
> },
> diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
> index d0538a754288..83346911aa63 100644
> --- a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
> +++ b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
> @@ -105,21 +105,23 @@
> },
> {
> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> + "DefaultMetricgroupName": "TopdownL1",
> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> "MetricName": "tma_backend_bound",
> "MetricThreshold": "tma_backend_bound > 0.2",
> - "MetricgroupNoGroup": "TopdownL1",
> + "MetricgroupNoGroup": "TopdownL1;Default",
> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> "ScaleUnit": "100%"
> },
> {
> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> + "DefaultMetricgroupName": "TopdownL1",
> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> "MetricName": "tma_bad_speculation",
> "MetricThreshold": "tma_bad_speculation > 0.15",
> - "MetricgroupNoGroup": "TopdownL1",
> + "MetricgroupNoGroup": "TopdownL1;Default",
> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> "ScaleUnit": "100%"
> },
> @@ -366,11 +368,12 @@
> },
> {
> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> + "DefaultMetricgroupName": "TopdownL1",
> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> "MetricName": "tma_frontend_bound",
> "MetricThreshold": "tma_frontend_bound > 0.15",
> - "MetricgroupNoGroup": "TopdownL1",
> + "MetricgroupNoGroup": "TopdownL1;Default",
> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> "ScaleUnit": "100%"
> },
> @@ -1392,11 +1395,12 @@
> },
> {
> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> + "DefaultMetricgroupName": "TopdownL1",
> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> "MetricName": "tma_retiring",
> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> - "MetricgroupNoGroup": "TopdownL1",
> + "MetricgroupNoGroup": "TopdownL1;Default",
> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> "ScaleUnit": "100%"
> },
> --
> 2.35.1
>
On Wed, Jun 7, 2023 at 9:27 AM <[email protected]> wrote:
>
> From: Kan Liang <[email protected]>
>
> Introduce a new metricgroup, Default, to tag all the metric groups which
> will be collected in the default mode.
>
> Add a new field, DefaultMetricgroupName, in the JSON file to indicate
> the real metric group name. It will be printed in the default output
> to replace the event names.
>
> There is nothing changed for the output format.
>
> On SPR, both TopdownL1 and TopdownL2 are displayed in the default
> output.
>
> On ARM, Intel ICL and later platforms (before SPR), only TopdownL1 is
> displayed in the default output.
>
> Suggested-by: Stephane Eranian <[email protected]>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> tools/perf/builtin-stat.c | 4 ++--
> tools/perf/pmu-events/jevents.py | 5 +++--
> tools/perf/pmu-events/pmu-events.h | 1 +
> tools/perf/util/metricgroup.c | 3 +++
> 4 files changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index c87c6897edc9..2269b3e90e9b 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -2154,14 +2154,14 @@ static int add_default_attributes(void)
> * Add TopdownL1 metrics if they exist. To minimize
> * multiplexing, don't request threshold computation.
> */
> - if (metricgroup__has_metric(pmu, "TopdownL1")) {
> + if (metricgroup__has_metric(pmu, "Default")) {
> struct evlist *metric_evlist = evlist__new();
> struct evsel *metric_evsel;
>
> if (!metric_evlist)
> return -1;
>
> - if (metricgroup__parse_groups(metric_evlist, pmu, "TopdownL1",
> + if (metricgroup__parse_groups(metric_evlist, pmu, "Default",
> /*metric_no_group=*/false,
> /*metric_no_merge=*/false,
> /*metric_no_threshold=*/true,
> diff --git a/tools/perf/pmu-events/jevents.py b/tools/perf/pmu-events/jevents.py
> index 7ed258be1829..12e80bb7939b 100755
> --- a/tools/perf/pmu-events/jevents.py
> +++ b/tools/perf/pmu-events/jevents.py
> @@ -54,8 +54,8 @@ _json_event_attributes = [
> # Attributes that are in pmu_metric rather than pmu_event.
> _json_metric_attributes = [
> 'pmu', 'metric_name', 'metric_group', 'metric_expr', 'metric_threshold',
> - 'desc', 'long_desc', 'unit', 'compat', 'metricgroup_no_group', 'aggr_mode',
> - 'event_grouping'
> + 'desc', 'long_desc', 'unit', 'compat', 'metricgroup_no_group',
> + 'default_metricgroup_name', 'aggr_mode', 'event_grouping'
> ]
> # Attributes that are bools or enum int values, encoded as '0', '1',...
> _json_enum_attributes = ['aggr_mode', 'deprecated', 'event_grouping', 'perpkg']
> @@ -307,6 +307,7 @@ class JsonEvent:
> self.metric_name = jd.get('MetricName')
> self.metric_group = jd.get('MetricGroup')
> self.metricgroup_no_group = jd.get('MetricgroupNoGroup')
> + self.default_metricgroup_name = jd.get('DefaultMetricgroupName')
> self.event_grouping = convert_metric_constraint(jd.get('MetricConstraint'))
> self.metric_expr = None
> if 'MetricExpr' in jd:
> diff --git a/tools/perf/pmu-events/pmu-events.h b/tools/perf/pmu-events/pmu-events.h
> index 8cd23d656a5d..caf59f23cd64 100644
> --- a/tools/perf/pmu-events/pmu-events.h
> +++ b/tools/perf/pmu-events/pmu-events.h
> @@ -61,6 +61,7 @@ struct pmu_metric {
> const char *desc;
> const char *long_desc;
> const char *metricgroup_no_group;
> + const char *default_metricgroup_name;
> enum aggr_mode_class aggr_mode;
> enum metric_event_groups event_grouping;
> };
> diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
> index 74f2d8efc02d..efafa02db5e5 100644
> --- a/tools/perf/util/metricgroup.c
> +++ b/tools/perf/util/metricgroup.c
> @@ -137,6 +137,8 @@ struct metric {
> * output.
> */
> const char *metric_unit;
> + /** Optional default metric group name */
> + const char *default_metricgroup_name;
Adding a bit more to the comment would be useful, like:
Optional name of the metric group reported if the Default metric group
is being processed.
> /** Optional null terminated array of referenced metrics. */
> struct metric_ref *metric_refs;
> /**
> @@ -219,6 +221,7 @@ static struct metric *metric__new(const struct pmu_metric *pm,
>
> m->pmu = pm->pmu ?: "cpu";
> m->metric_name = pm->metric_name;
> + m->default_metricgroup_name = pm->default_metricgroup_name;
> m->modifier = NULL;
> if (modifier) {
> m->modifier = strdup(modifier);
> --
> 2.35.1
>
On Wed, Jun 7, 2023 at 9:27 AM <[email protected]> wrote:
>
> From: Kan Liang <[email protected]>
>
> Add the default tags for ARM as well.
>
> Signed-off-by: Kan Liang <[email protected]>
> Cc: Jing Zhang <[email protected]>
> Cc: John Garry <[email protected]>
Acked-by: Ian Rogers <[email protected]>
Thanks,
Ian
> ---
> tools/perf/pmu-events/arch/arm64/sbsa.json | 12 ++++++++----
> 1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/tools/perf/pmu-events/arch/arm64/sbsa.json b/tools/perf/pmu-events/arch/arm64/sbsa.json
> index f678c37ea9c3..f90b338261ac 100644
> --- a/tools/perf/pmu-events/arch/arm64/sbsa.json
> +++ b/tools/perf/pmu-events/arch/arm64/sbsa.json
> @@ -2,28 +2,32 @@
> {
> "MetricExpr": "stall_slot_frontend / (#slots * cpu_cycles)",
> "BriefDescription": "Frontend bound L1 topdown metric",
> - "MetricGroup": "TopdownL1",
> + "DefaultMetricgroupName": "TopdownL1",
> + "MetricGroup": "Default;TopdownL1",
> "MetricName": "frontend_bound",
> "ScaleUnit": "100%"
> },
> {
> "MetricExpr": "(1 - op_retired / op_spec) * (1 - stall_slot / (#slots * cpu_cycles))",
> "BriefDescription": "Bad speculation L1 topdown metric",
> - "MetricGroup": "TopdownL1",
> + "DefaultMetricgroupName": "TopdownL1",
> + "MetricGroup": "Default;TopdownL1",
> "MetricName": "bad_speculation",
> "ScaleUnit": "100%"
> },
> {
> "MetricExpr": "(op_retired / op_spec) * (1 - stall_slot / (#slots * cpu_cycles))",
> "BriefDescription": "Retiring L1 topdown metric",
> - "MetricGroup": "TopdownL1",
> + "DefaultMetricgroupName": "TopdownL1",
> + "MetricGroup": "Default;TopdownL1",
> "MetricName": "retiring",
> "ScaleUnit": "100%"
> },
> {
> "MetricExpr": "stall_slot_backend / (#slots * cpu_cycles)",
> "BriefDescription": "Backend Bound L1 topdown metric",
> - "MetricGroup": "TopdownL1",
> + "DefaultMetricgroupName": "TopdownL1",
> + "MetricGroup": "Default;TopdownL1",
> "MetricName": "backend_bound",
> "ScaleUnit": "100%"
> }
> --
> 2.35.1
>
On 2023-06-13 3:59 p.m., Ian Rogers wrote:
> On Wed, Jun 7, 2023 at 9:27 AM <[email protected]> wrote:
>>
>> From: Kan Liang <[email protected]>
>>
>> Introduce a new metricgroup, Default, to tag all the metric groups which
>> will be collected in the default mode.
>>
>> Add a new field, DefaultMetricgroupName, in the JSON file to indicate
>> the real metric group name. It will be printed in the default output
>> to replace the event names.
>>
>> There is nothing changed for the output format.
>>
>> On SPR, both TopdownL1 and TopdownL2 are displayed in the default
>> output.
>>
>> On ARM, Intel ICL and later platforms (before SPR), only TopdownL1 is
>> displayed in the default output.
>>
>> Suggested-by: Stephane Eranian <[email protected]>
>> Signed-off-by: Kan Liang <[email protected]>
>> ---
>> tools/perf/builtin-stat.c | 4 ++--
>> tools/perf/pmu-events/jevents.py | 5 +++--
>> tools/perf/pmu-events/pmu-events.h | 1 +
>> tools/perf/util/metricgroup.c | 3 +++
>> 4 files changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
>> index c87c6897edc9..2269b3e90e9b 100644
>> --- a/tools/perf/builtin-stat.c
>> +++ b/tools/perf/builtin-stat.c
>> @@ -2154,14 +2154,14 @@ static int add_default_attributes(void)
>> * Add TopdownL1 metrics if they exist. To minimize
>> * multiplexing, don't request threshold computation.
>> */
>> - if (metricgroup__has_metric(pmu, "TopdownL1")) {
>> + if (metricgroup__has_metric(pmu, "Default")) {
>> struct evlist *metric_evlist = evlist__new();
>> struct evsel *metric_evsel;
>>
>> if (!metric_evlist)
>> return -1;
>>
>> - if (metricgroup__parse_groups(metric_evlist, pmu, "TopdownL1",
>> + if (metricgroup__parse_groups(metric_evlist, pmu, "Default",
>> /*metric_no_group=*/false,
>> /*metric_no_merge=*/false,
>> /*metric_no_threshold=*/true,
>> diff --git a/tools/perf/pmu-events/jevents.py b/tools/perf/pmu-events/jevents.py
>> index 7ed258be1829..12e80bb7939b 100755
>> --- a/tools/perf/pmu-events/jevents.py
>> +++ b/tools/perf/pmu-events/jevents.py
>> @@ -54,8 +54,8 @@ _json_event_attributes = [
>> # Attributes that are in pmu_metric rather than pmu_event.
>> _json_metric_attributes = [
>> 'pmu', 'metric_name', 'metric_group', 'metric_expr', 'metric_threshold',
>> - 'desc', 'long_desc', 'unit', 'compat', 'metricgroup_no_group', 'aggr_mode',
>> - 'event_grouping'
>> + 'desc', 'long_desc', 'unit', 'compat', 'metricgroup_no_group',
>> + 'default_metricgroup_name', 'aggr_mode', 'event_grouping'
>> ]
>> # Attributes that are bools or enum int values, encoded as '0', '1',...
>> _json_enum_attributes = ['aggr_mode', 'deprecated', 'event_grouping', 'perpkg']
>> @@ -307,6 +307,7 @@ class JsonEvent:
>> self.metric_name = jd.get('MetricName')
>> self.metric_group = jd.get('MetricGroup')
>> self.metricgroup_no_group = jd.get('MetricgroupNoGroup')
>> + self.default_metricgroup_name = jd.get('DefaultMetricgroupName')
>> self.event_grouping = convert_metric_constraint(jd.get('MetricConstraint'))
>> self.metric_expr = None
>> if 'MetricExpr' in jd:
>> diff --git a/tools/perf/pmu-events/pmu-events.h b/tools/perf/pmu-events/pmu-events.h
>> index 8cd23d656a5d..caf59f23cd64 100644
>> --- a/tools/perf/pmu-events/pmu-events.h
>> +++ b/tools/perf/pmu-events/pmu-events.h
>> @@ -61,6 +61,7 @@ struct pmu_metric {
>> const char *desc;
>> const char *long_desc;
>> const char *metricgroup_no_group;
>> + const char *default_metricgroup_name;
>> enum aggr_mode_class aggr_mode;
>> enum metric_event_groups event_grouping;
>> };
>> diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
>> index 74f2d8efc02d..efafa02db5e5 100644
>> --- a/tools/perf/util/metricgroup.c
>> +++ b/tools/perf/util/metricgroup.c
>> @@ -137,6 +137,8 @@ struct metric {
>> * output.
>> */
>> const char *metric_unit;
>> + /** Optional default metric group name */
>> + const char *default_metricgroup_name;
>
> Adding a bit more to the comment would be useful, like:
>
> Optional name of the metric group reported if the Default metric group
> is being processed.
Sure.
Thanks,
Kan
>
>> /** Optional null terminated array of referenced metrics. */
>> struct metric_ref *metric_refs;
>> /**
>> @@ -219,6 +221,7 @@ static struct metric *metric__new(const struct pmu_metric *pm,
>>
>> m->pmu = pm->pmu ?: "cpu";
>> m->metric_name = pm->metric_name;
>> + m->default_metricgroup_name = pm->default_metricgroup_name;
>> m->modifier = NULL;
>> if (modifier) {
>> m->modifier = strdup(modifier);
>> --
>> 2.35.1
>>
On Wed, Jun 7, 2023 at 9:27 AM <[email protected]> wrote:
>
> From: Kan Liang <[email protected]>
>
> A new field metricgroup has been added in the perf stat JSON output.
> Support it in the test case.
>
> Signed-off-by: Kan Liang <[email protected]>
Acked-by: Ian Rogers <[email protected]>
Thanks,
Ian
> ---
> tools/perf/tests/shell/lib/perf_json_output_lint.py | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/tools/perf/tests/shell/lib/perf_json_output_lint.py b/tools/perf/tests/shell/lib/perf_json_output_lint.py
> index b81582a89d36..5e9bd68c83fe 100644
> --- a/tools/perf/tests/shell/lib/perf_json_output_lint.py
> +++ b/tools/perf/tests/shell/lib/perf_json_output_lint.py
> @@ -55,6 +55,7 @@ def check_json_output(expected_items):
> 'interval': lambda x: isfloat(x),
> 'metric-unit': lambda x: True,
> 'metric-value': lambda x: isfloat(x),
> + 'metricgroup': lambda x: True,
> 'node': lambda x: True,
> 'pcnt-running': lambda x: isfloat(x),
> 'socket': lambda x: True,
> @@ -70,6 +71,8 @@ def check_json_output(expected_items):
> # values and possibly other prefixes like interval, core and
> # aggregate-number.
> pass
> + elif count != expected_items and count >= 1 and count <= 5 and 'metricgroup' in item:
> + pass
> elif count != expected_items:
> raise RuntimeError(f'wrong number of fields. counted {count} expected {expected_items}'
> f' in \'{item}\'')
> --
> 2.35.1
>
On Wed, Jun 7, 2023 at 9:27 AM <[email protected]> wrote:
>
> From: Kan Liang <[email protected]>
>
> In the default mode, the current output of the metricgroup include both
> events and metrics, which is not necessary and just makes the output
> hard to read. Since different ARCHs (even different generations in the
> same ARCH) may use different events. The output also vary on different
> platforms.
>
> For a metricgroup, only outputting the value of each metric is good
> enough.
>
> Current perf may append different metric groups to the same leader
> event, or append the metrics from the same metricgroup to different
> events. That could bring confusion when perf only prints the
> metricgroup output mode. For example, print the same metricgroup name
> several times.
> Reorganize metricgroup for the default mode and make sure that
> a metricgroup can only be appended to one event.
> Sort the metricgroup for the default mode by the name of the
> metricgroup.
>
> Add a new field default_metricgroup in evsel to indicate an event of
> the default metricgroup. For those events, printout() should print
> the metricgroup name rather than events.
>
> Add print_metricgroup_header() to print out the metricgroup name in
> different output formats.
>
> On SPR
> Before:
>
> ./perf_old stat sleep 1
>
> Performance counter stats for 'sleep 1':
>
> 0.54 msec task-clock:u # 0.001 CPUs utilized
> 0 context-switches:u # 0.000 /sec
> 0 cpu-migrations:u # 0.000 /sec
> 68 page-faults:u # 125.445 K/sec
> 540,970 cycles:u # 0.998 GHz
> 556,325 instructions:u # 1.03 insn per cycle
> 123,602 branches:u # 228.018 M/sec
> 6,889 branch-misses:u # 5.57% of all branches
> 3,245,820 TOPDOWN.SLOTS:u # 18.4 % tma_backend_bound
> # 17.2 % tma_retiring
> # 23.1 % tma_bad_speculation
> # 41.4 % tma_frontend_bound
> 564,859 topdown-retiring:u
> 1,370,999 topdown-fe-bound:u
> 603,271 topdown-be-bound:u
> 744,874 topdown-bad-spec:u
> 12,661 INT_MISC.UOP_DROPPING:u # 23.357 M/sec
>
> 1.001798215 seconds time elapsed
>
> 0.000193000 seconds user
> 0.001700000 seconds sys
>
> After:
>
> $ ./perf stat sleep 1
>
> Performance counter stats for 'sleep 1':
>
> 0.51 msec task-clock:u # 0.001 CPUs utilized
> 0 context-switches:u # 0.000 /sec
> 0 cpu-migrations:u # 0.000 /sec
> 68 page-faults:u # 132.683 K/sec
> 545,228 cycles:u # 1.064 GHz
> 555,509 instructions:u # 1.02 insn per cycle
> 123,574 branches:u # 241.120 M/sec
> 6,957 branch-misses:u # 5.63% of all branches
> TopdownL1 # 17.5 % tma_backend_bound
> # 22.6 % tma_bad_speculation
> # 42.7 % tma_frontend_bound
> # 17.1 % tma_retiring
> TopdownL2 # 21.8 % tma_branch_mispredicts
> # 11.5 % tma_core_bound
> # 13.4 % tma_fetch_bandwidth
> # 29.3 % tma_fetch_latency
> # 2.7 % tma_heavy_operations
> # 14.5 % tma_light_operations
> # 0.8 % tma_machine_clears
> # 6.1 % tma_memory_bound
>
> 1.001712086 seconds time elapsed
>
> 0.000151000 seconds user
> 0.001618000 seconds sys
>
>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> tools/perf/builtin-stat.c | 1 +
> tools/perf/util/evsel.h | 1 +
> tools/perf/util/metricgroup.c | 106 ++++++++++++++++++++++++++++++++-
> tools/perf/util/metricgroup.h | 1 +
> tools/perf/util/stat-display.c | 69 ++++++++++++++++++++-
> 5 files changed, 172 insertions(+), 6 deletions(-)
>
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 2269b3e90e9b..b274cc264d56 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -2172,6 +2172,7 @@ static int add_default_attributes(void)
>
> evlist__for_each_entry(metric_evlist, metric_evsel) {
> metric_evsel->skippable = true;
> + metric_evsel->default_metricgroup = true;
> }
> evlist__splice_list_tail(evsel_list, &metric_evlist->core.entries);
> evlist__delete(metric_evlist);
> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
> index 36a32e4ca168..61b1385108f4 100644
> --- a/tools/perf/util/evsel.h
> +++ b/tools/perf/util/evsel.h
> @@ -130,6 +130,7 @@ struct evsel {
> bool reset_group;
> bool errored;
> bool needs_auxtrace_mmap;
> + bool default_metricgroup;
A comment would be useful here, something like:
If running perf stat, is this evsel a member of a Default metric group metric.
> struct hashmap *per_pkg_mask;
> int err;
> struct {
> diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
> index efafa02db5e5..22181ce4f27f 100644
> --- a/tools/perf/util/metricgroup.c
> +++ b/tools/perf/util/metricgroup.c
> @@ -79,6 +79,7 @@ static struct rb_node *metric_event_new(struct rblist *rblist __maybe_unused,
> return NULL;
> memcpy(me, entry, sizeof(struct metric_event));
> me->evsel = ((struct metric_event *)entry)->evsel;
> + me->default_metricgroup_name = NULL;
> INIT_LIST_HEAD(&me->head);
> return &me->nd;
> }
> @@ -1133,14 +1134,19 @@ static int metricgroup__add_metric_sys_event_iter(const struct pmu_metric *pm,
> /**
> * metric_list_cmp - list_sort comparator that sorts metrics with more events to
> * the front. tool events are excluded from the count.
> + * For the default metrics, sort them by metricgroup name.
> */
> -static int metric_list_cmp(void *priv __maybe_unused, const struct list_head *l,
> +static int metric_list_cmp(void *priv, const struct list_head *l,
> const struct list_head *r)
> {
> const struct metric *left = container_of(l, struct metric, nd);
> const struct metric *right = container_of(r, struct metric, nd);
> struct expr_id_data *data;
> int i, left_count, right_count;
> + bool is_default = *(bool *)priv;
> +
> + if (is_default && left->default_metricgroup_name && right->default_metricgroup_name)
> + return strcmp(left->default_metricgroup_name, right->default_metricgroup_name);
This breaks the comment above. The events are now sorted prioritizing
default metric group names. This potentially will have an effect of
reducing sharing of events between groups, it will also break
assumptions within that code that there are always the same number of
fewer events in a metric as you process the list. To remedy this I
think you need to re-sort the metrics after the event sharing has had
a chance to share events between groups.
>
> left_count = hashmap__size(left->pctx->ids);
> perf_tool_event__for_each_event(i) {
> @@ -1497,6 +1503,91 @@ static int parse_ids(bool metric_no_merge, struct perf_pmu *fake_pmu,
> return ret;
> }
>
> +static struct metric_event *
> +metricgroup__lookup_default_metricgroup(struct rblist *metric_events,
> + struct evsel *evsel,
> + struct metric *m)
> +{
> + struct metric_event *me;
> + char *name;
> + int err;
> +
> + me = metricgroup__lookup(metric_events, evsel, true);
> + if (!me->default_metricgroup_name) {
> + if (m->pmu && strcmp(m->pmu, "cpu"))
> + err = asprintf(&name, "%s (%s)", m->default_metricgroup_name, m->pmu);
> + else
> + err = asprintf(&name, "%s", m->default_metricgroup_name);
> + if (err < 0)
> + return NULL;
> + me->default_metricgroup_name = name;
> + }
> + if (!strncmp(m->default_metricgroup_name,
> + me->default_metricgroup_name,
> + strlen(m->default_metricgroup_name)))
> + return me;
> +
> + return NULL;
> +}
A function comment would be useful as the name is confusing, why
lookup? Doesn't it create the value? Leak sanitizer isn't happy here:
```
==1545918==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 10 byte(s) in 1 object(s) allocated from:
#0 0x7f2755a7077b in __interceptor_strdup
../../../../src/libsanitizer/asan/asan_interceptors.cpp:439
#1 0x564986a8df31 in asprintf util/util.c:566
#2 0x5649869b5901 in metricgroup__lookup_default_metricgroup
util/metricgroup.c:1520
#3 0x5649869b5e57 in metricgroup__lookup_create util/metricgroup.c:1579
#4 0x5649869b6ddc in parse_groups util/metricgroup.c:1698
#5 0x5649869b7714 in metricgroup__parse_groups util/metricgroup.c:1771
#6 0x5649867da9d5 in add_default_attributes tools/perf/builtin-stat.c:2164
#7 0x5649867ddbfb in cmd_stat tools/perf/builtin-stat.c:2707
#8 0x5649868fa5a2 in run_builtin tools/perf/perf.c:323
#9 0x5649868fab13 in handle_internal_command tools/perf/perf.c:377
#10 0x5649868faedb in run_argv tools/perf/perf.c:421
#11 0x5649868fb443 in main tools/perf/perf.c:537
#12 0x7f2754846189 in __libc_start_call_main
../sysdeps/nptl/libc_start_call_main.h:58
```
> +static struct metric_event *
> +metricgroup__lookup_create(struct rblist *metric_events,
> + struct evsel **evsel,
> + struct list_head *metric_list,
> + struct metric *m,
> + bool is_default)
> +{
> + struct metric_event *me;
> + struct metric *cur;
> + struct evsel *ev;
> + size_t i;
> +
> + if (!is_default)
> + return metricgroup__lookup(metric_events, evsel[0], true);
> +
> + /*
> + * If the metric group has been attached to a previous
> + * event/metric, use that metric event.
> + */
> + list_for_each_entry(cur, metric_list, nd) {
> + if (cur == m)
> + break;
> + if (cur->pmu && strcmp(m->pmu, cur->pmu))
> + continue;
> + if (strncmp(m->default_metricgroup_name,
> + cur->default_metricgroup_name,
> + strlen(m->default_metricgroup_name)))
> + continue;
> + if (!cur->evlist)
> + continue;
> + evlist__for_each_entry(cur->evlist, ev) {
> + me = metricgroup__lookup(metric_events, ev, false);
> + if (!strncmp(m->default_metricgroup_name,
> + me->default_metricgroup_name,
> + strlen(m->default_metricgroup_name)))
> + return me;
> + }
> + }
> +
> + /*
> + * Different metric groups may append to the same leader event.
> + * For example, TopdownL1 and TopdownL2 are appended to the
> + * TOPDOWN.SLOTS event.
> + * Split it and append the new metric group to the next available
> + * event.
> + */
> + me = metricgroup__lookup_default_metricgroup(metric_events, evsel[0], m);
> + if (me)
> + return me;
> +
> + for (i = 1; i < hashmap__size(m->pctx->ids); i++) {
> + me = metricgroup__lookup_default_metricgroup(metric_events, evsel[i], m);
> + if (me)
> + return me;
> + }
> + return NULL;
> +}
> +
I have a hard time understanding this function, does it just go away
if you do the two sorts that I proposed above? Should this be
metric_event__lookup_create? A function comment saying what the code
is trying to achieve would be useful.
This appears to be trying to correct output issues by changing how
metrics are associated with events, shouldn't output issues be
resolved by fixing the output code? If not, why don't we apply this
logic to TopdownL1, why just Default?
> static int parse_groups(struct evlist *perf_evlist,
> const char *pmu, const char *str,
> bool metric_no_group,
> @@ -1512,6 +1603,7 @@ static int parse_groups(struct evlist *perf_evlist,
> LIST_HEAD(metric_list);
> struct metric *m;
> bool tool_events[PERF_TOOL_MAX] = {false};
> + bool is_default = !strcmp(str, "Default");
> int ret;
>
> if (metric_events_list->nr_entries == 0)
> @@ -1523,7 +1615,7 @@ static int parse_groups(struct evlist *perf_evlist,
> goto out;
>
> /* Sort metrics from largest to smallest. */
> - list_sort(NULL, &metric_list, metric_list_cmp);
> + list_sort((void *)&is_default, &metric_list, metric_list_cmp);
>
> if (!metric_no_merge) {
> struct expr_parse_ctx *combined = NULL;
> @@ -1603,7 +1695,15 @@ static int parse_groups(struct evlist *perf_evlist,
> goto out;
> }
>
> - me = metricgroup__lookup(metric_events_list, metric_events[0], true);
> + me = metricgroup__lookup_create(metric_events_list,
> + metric_events,
> + &metric_list, m,
> + is_default);
> + if (!me) {
> + pr_err("Cannot create metric group for default!\n");
> + ret = -EINVAL;
> + goto out;
> + }
>
> expr = malloc(sizeof(struct metric_expr));
> if (!expr) {
> diff --git a/tools/perf/util/metricgroup.h b/tools/perf/util/metricgroup.h
> index bf18274c15df..e3609b853213 100644
> --- a/tools/perf/util/metricgroup.h
> +++ b/tools/perf/util/metricgroup.h
> @@ -22,6 +22,7 @@ struct cgroup;
> struct metric_event {
> struct rb_node nd;
> struct evsel *evsel;
> + char *default_metricgroup_name;
> struct list_head head; /* list of metric_expr */
> };
>
> diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
> index a2bbdc25d979..efe5fd04c033 100644
> --- a/tools/perf/util/stat-display.c
> +++ b/tools/perf/util/stat-display.c
> @@ -21,10 +21,12 @@
> #include "iostat.h"
> #include "pmu.h"
> #include "pmus.h"
> +#include "metricgroup.h"
This is bringing metric code from stat-shadow, which is kind of the
whole reason there is a separation and stat-shadow exists. Should the
logic exist in stat-shadow instead?
>
> #define CNTR_NOT_SUPPORTED "<not supported>"
> #define CNTR_NOT_COUNTED "<not counted>"
>
> +#define MGROUP_LEN 50
> #define METRIC_LEN 38
> #define EVNAME_LEN 32
> #define COUNTS_LEN 18
> @@ -707,6 +709,55 @@ static bool evlist__has_hybrid(struct evlist *evlist)
> return false;
> }
>
> +static void print_metricgroup_header_json(struct perf_stat_config *config,
> + struct outstate *os __maybe_unused,
> + const char *metricgroup_name)
> +{
> + fprintf(config->output, "\"metricgroup\" : \"%s\"}", metricgroup_name);
> + new_line_json(config, (void *)os);
> +}
> +
Should the output part of this patch be separate from the
evsel/evlist/meitric modifications?
Thanks,
Ian
> +static void print_metricgroup_header_csv(struct perf_stat_config *config,
> + struct outstate *os,
> + const char *metricgroup_name)
> +{
> + int i;
> +
> + for (i = 0; i < os->nfields; i++)
> + fputs(config->csv_sep, os->fh);
> + fprintf(config->output, "%s", metricgroup_name);
> + new_line_csv(config, (void *)os);
> +}
> +
> +static void print_metricgroup_header_std(struct perf_stat_config *config,
> + struct outstate *os __maybe_unused,
> + const char *metricgroup_name)
> +{
> + int n = fprintf(config->output, " %*s", EVNAME_LEN, metricgroup_name);
> +
> + fprintf(config->output, "%*s", MGROUP_LEN - n - 1, "");
> +}
> +
> +static void print_metricgroup_header(struct perf_stat_config *config,
> + struct outstate *os,
> + struct evsel *counter,
> + double noise, u64 run, u64 ena,
> + const char *metricgroup_name)
> +{
> + aggr_printout(config, os->evsel, os->id, os->aggr_nr);
> +
> + print_noise(config, counter, noise, /*before_metric=*/true);
> + print_running(config, run, ena, /*before_metric=*/true);
> +
> + if (config->json_output) {
> + print_metricgroup_header_json(config, os, metricgroup_name);
> + } else if (config->csv_output) {
> + print_metricgroup_header_csv(config, os, metricgroup_name);
> + } else
> + print_metricgroup_header_std(config, os, metricgroup_name);
> +
> +}
> +
> static void printout(struct perf_stat_config *config, struct outstate *os,
> double uval, u64 run, u64 ena, double noise, int aggr_idx)
> {
> @@ -751,10 +802,17 @@ static void printout(struct perf_stat_config *config, struct outstate *os,
> out.force_header = false;
>
> if (!config->metric_only) {
> - abs_printout(config, os->id, os->aggr_nr, counter, uval, ok);
> + if (counter->default_metricgroup) {
> + struct metric_event *me;
>
> - print_noise(config, counter, noise, /*before_metric=*/true);
> - print_running(config, run, ena, /*before_metric=*/true);
> + me = metricgroup__lookup(&config->metric_events, counter, false);
> + print_metricgroup_header(config, os, counter, noise, run, ena,
> + me->default_metricgroup_name);
> + } else {
> + abs_printout(config, os->id, os->aggr_nr, counter, uval, ok);
> + print_noise(config, counter, noise, /*before_metric=*/true);
> + print_running(config, run, ena, /*before_metric=*/true);
> + }
> }
>
> if (ok) {
> @@ -883,6 +941,11 @@ static void print_counter_aggrdata(struct perf_stat_config *config,
> if (counter->merged_stat)
> return;
>
> + /* Only print the metric group for the default mode */
> + if (counter->default_metricgroup &&
> + !metricgroup__lookup(&config->metric_events, counter, false))
> + return;
> +
> uniquify_counter(config, counter);
>
> val = aggr->counts.val;
> --
> 2.35.1
>
On Wed, Jun 7, 2023 at 9:27 AM <[email protected]> wrote:
>
> From: Kan Liang <[email protected]>
>
> Add a new test case to verify the standard perf stat output with
> different options.
>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> tools/perf/tests/shell/stat+std_output.sh | 259 ++++++++++++++++++++++
> 1 file changed, 259 insertions(+)
> create mode 100755 tools/perf/tests/shell/stat+std_output.sh
>
> diff --git a/tools/perf/tests/shell/stat+std_output.sh b/tools/perf/tests/shell/stat+std_output.sh
> new file mode 100755
> index 000000000000..b9db0f245450
> --- /dev/null
> +++ b/tools/perf/tests/shell/stat+std_output.sh
> @@ -0,0 +1,259 @@
> +#!/bin/bash
> +# perf stat STD output linter
> +# SPDX-License-Identifier: GPL-2.0
> +# Tests various perf stat STD output commands for
> +# default event and metricgroup
> +
> +set -e
> +
> +skip_test=0
> +
> +stat_output=$(mktemp /tmp/__perf_test.stat_output.std.XXXXX)
> +
> +event_name=(cpu-clock task-clock context-switches cpu-migrations page-faults cycles instructions branches branch-misses stalled-cycles-frontend stalled-cycles-backend)
> +event_metric=("CPUs utilized" "CPUs utilized" "/sec" "/sec" "/sec" "GHz" "insn per cycle" "/sec" "of all branches" "frontend cycles idle" "backend cycles idle")
> +
> +metricgroup_name=(TopdownL1 TopdownL2)
> +
> +cleanup() {
> + rm -f "${stat_output}"
> +
> + trap - EXIT TERM INT
> +}
> +
> +trap_cleanup() {
> + cleanup
> + exit 1
> +}
> +trap trap_cleanup EXIT TERM INT
> +
> +function commachecker()
> +{
> + local -i cnt=0
> + local prefix=1
> +
> + case "$1"
> + in "--interval") prefix=2
> + ;; "--per-thread") prefix=2
> + ;; "--system-wide-no-aggr") prefix=2
> + ;; "--per-core") prefix=3
> + ;; "--per-socket") prefix=3
> + ;; "--per-node") prefix=3
> + ;; "--per-die") prefix=3
> + ;; "--per-cache") prefix=3
> + esac
> +
> + while read line
> + do
> + # Ignore initial "started on" comment.
> + x=${line:0:1}
> + [ "$x" = "#" ] && continue
> + # Ignore initial blank line.
> + [ "$line" = "" ] && continue
> + # Ignore "Performance counter stats"
> + x=${line:0:25}
> + [ "$x" = "Performance counter stats" ] && continue
> + # Ignore "seconds time elapsed" and break
> + [[ "$line" == *"time elapsed"* ]] && break
> +
> + main_body=$(echo $line | cut -d' ' -f$prefix-)
> + x=${main_body%#*}
> + # Check default metricgroup
> + y=$(echo $x | tr -d ' ')
> + [ "$y" = "" ] && continue
> + for i in "${!metricgroup_name[@]}"; do
> + [[ "$y" == *"${metricgroup_name[$i]}"* ]] && break
> + done
> + [[ "$y" == *"${metricgroup_name[$i]}"* ]] && continue
> +
> + # Check default event
> + for i in "${!event_name[@]}"; do
> + [[ "$x" == *"${event_name[$i]}"* ]] && break
> + done
> +
> + [[ ! "$x" == *"${event_name[$i]}"* ]] && {
> + echo "Unknown event name in $line" 1>&2
> + exit 1;
> + }
> +
> + # Check event metric if it exists
> + [[ ! "$main_body" == *"#"* ]] && continue
> + [[ ! "$main_body" == *"${event_metric[$i]}"* ]] && {
> + echo "wrong event metric. expected ${event_metric[$i]} in $line" 1>&2
> + exit 1;
> + }
> + done < "${stat_output}"
> + return 0
> +}
> +
> +# Return true if perf_event_paranoid is > $1 and not running as root.
> +function ParanoidAndNotRoot()
> +{
> + [ $(id -u) != 0 ] && [ $(cat /proc/sys/kernel/perf_event_paranoid) -gt $1 ]
> +}
> +
> +check_no_args()
> +{
> + echo -n "Checking STD output: no args "
> + perf stat -o "${stat_output}" true
> + commachecker --no-args
> + echo "[Success]"
> +}
> +
> +check_system_wide()
> +{
> + echo -n "Checking STD output: system wide "
> + if ParanoidAndNotRoot 0
> + then
> + echo "[Skip] paranoid and not root"
> + return
> + fi
> + perf stat -a -o "${stat_output}" true
> + commachecker --system-wide
> + echo "[Success]"
> +}
> +
> +check_system_wide_no_aggr()
> +{
> + echo -n "Checking STD output: system wide no aggregation "
> + if ParanoidAndNotRoot 0
> + then
> + echo "[Skip] paranoid and not root"
> + return
> + fi
> + perf stat -A -a --no-merge -o "${stat_output}" true
> + commachecker --system-wide-no-aggr
> + echo "[Success]"
> +}
> +
> +check_interval()
> +{
> + echo -n "Checking STD output: interval "
> + perf stat -I 1000 -o "${stat_output}" true
> + commachecker --interval
> + echo "[Success]"
> +}
> +
> +
> +check_per_core()
> +{
> + echo -n "Checking STD output: per core "
> + if ParanoidAndNotRoot 0
> + then
> + echo "[Skip] paranoid and not root"
> + return
> + fi
> + perf stat --per-core -a -o "${stat_output}" true
> + commachecker --per-core
> + echo "[Success]"
> +}
> +
> +check_per_thread()
> +{
> + echo -n "Checking STD output: per thread "
> + if ParanoidAndNotRoot 0
> + then
> + echo "[Skip] paranoid and not root"
> + return
> + fi
> + perf stat --per-thread -a -o "${stat_output}" true
> + commachecker --per-thread
> + echo "[Success]"
> +}
> +
> +check_per_cache_instance()
> +{
> + echo -n "Checking STD output: per cache instance "
> + if ParanoidAndNotRoot 0
> + then
> + echo "[Skip] paranoid and not root"
> + return
> + fi
> + perf stat --per-cache -a true 2>&1 | commachecker --per-cache
> + echo "[Success]"
> +}
> +
> +check_per_die()
> +{
> + echo -n "Checking STD output: per die "
> + if ParanoidAndNotRoot 0
> + then
> + echo "[Skip] paranoid and not root"
> + return
> + fi
> + perf stat --per-die -a -o "${stat_output}" true
> + commachecker --per-die
> + echo "[Success]"
> +}
> +
> +check_per_node()
> +{
> + echo -n "Checking STD output: per node "
> + if ParanoidAndNotRoot 0
> + then
> + echo "[Skip] paranoid and not root"
> + return
> + fi
> + perf stat --per-node -a -o "${stat_output}" true
> + commachecker --per-node
> + echo "[Success]"
> +}
> +
> +check_per_socket()
> +{
> + echo -n "Checking STD output: per socket "
> + if ParanoidAndNotRoot 0
> + then
> + echo "[Skip] paranoid and not root"
> + return
> + fi
> + perf stat --per-socket -a -o "${stat_output}" true
> + commachecker --per-socket
> + echo "[Success]"
> +}
> +
> +# The perf stat options for per-socket, per-core, per-die
> +# and -A ( no_aggr mode ) uses the info fetched from this
> +# directory: "/sys/devices/system/cpu/cpu*/topology". For
> +# example, socket value is fetched from "physical_package_id"
> +# file in topology directory.
> +# Reference: cpu__get_topology_int in util/cpumap.c
> +# If the platform doesn't expose topology information, values
> +# will be set to -1. For example, incase of pSeries platform
> +# of powerpc, value for "physical_package_id" is restricted
> +# and set to -1. Check here validates the socket-id read from
> +# topology file before proceeding further
> +
> +FILE_LOC="/sys/devices/system/cpu/cpu*/topology/"
> +FILE_NAME="physical_package_id"
> +
> +check_for_topology()
> +{
> + if ! ParanoidAndNotRoot 0
> + then
> + socket_file=`ls $FILE_LOC/$FILE_NAME | head -n 1`
> + [ -z $socket_file ] && return 0
> + socket_id=`cat $socket_file`
> + [ $socket_id == -1 ] && skip_test=1
> + return 0
> + fi
> +}
Tests, great! This logic is taken from
tools/perf/tests/shell/stat+csv_output.sh, could we share the
implementation between that and here by moving the code into something
in the lib directory?
Thanks,
Ian
> +
> +check_for_topology
> +check_no_args
> +check_system_wide
> +check_interval
> +check_per_thread
> +check_per_node
> +if [ $skip_test -ne 1 ]
> +then
> + check_system_wide_no_aggr
> + check_per_core
> + check_per_cache_instance
> + check_per_die
> + check_per_socket
> +else
> + echo "[Skip] Skipping tests for system_wide_no_aggr, per_core, per_die and per_socket since socket id exposed via topology is invalid"
> +fi
> +cleanup
> +exit 0
> --
> 2.35.1
>
On 2023-06-13 3:44 p.m., Ian Rogers wrote:
> On Wed, Jun 7, 2023 at 9:27 AM <[email protected]> wrote:
>>
>> From: Kan Liang <[email protected]>
>>
>> For the default output, the default metric group could vary on different
>> platforms. For example, on SPR, the TopdownL1 and TopdownL2 metrics
>> should be displayed in the default mode. On ICL, only the TopdownL1
>> should be displayed.
>>
>> Add a flag so we can tag the default metric group for different
>> platforms rather than hack the perf code.
>>
>> The flag is added to Intel TopdownL1 since ICL and TopdownL2 metrics
>> since SPR.
>>
>> Add a new field, DefaultMetricgroupName, in the JSON file to indicate
>> the real metric group name.
>>
>> Signed-off-by: Kan Liang <[email protected]>
>> ---
>> .../arch/x86/alderlake/adl-metrics.json | 20 ++++---
>> .../arch/x86/icelake/icl-metrics.json | 20 ++++---
>> .../arch/x86/icelakex/icx-metrics.json | 20 ++++---
>> .../arch/x86/sapphirerapids/spr-metrics.json | 60 +++++++++++--------
>> .../arch/x86/tigerlake/tgl-metrics.json | 20 ++++---
>> 5 files changed, 84 insertions(+), 56 deletions(-)
>>
>> diff --git a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
>> index c9f7e3d4ab08..e78c85220e27 100644
>> --- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
>> +++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
>> @@ -832,22 +832,24 @@
>> },
>> {
>> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>> + "DefaultMetricgroupName": "TopdownL1",
>> "MetricExpr": "cpu_core@topdown\\-be\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>> "MetricName": "tma_backend_bound",
>> "MetricThreshold": "tma_backend_bound > 0.2",
>> - "MetricgroupNoGroup": "TopdownL1",
>> + "MetricgroupNoGroup": "TopdownL1;Default",
>> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>> "ScaleUnit": "100%",
>> "Unit": "cpu_core"
>> },
>> {
>> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>> + "DefaultMetricgroupName": "TopdownL1",
>> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>> "MetricName": "tma_bad_speculation",
>> "MetricThreshold": "tma_bad_speculation > 0.15",
>> - "MetricgroupNoGroup": "TopdownL1",
>> + "MetricgroupNoGroup": "TopdownL1;Default",
>> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>> "ScaleUnit": "100%",
>> "Unit": "cpu_core"
>> @@ -1112,11 +1114,12 @@
>> },
>> {
>> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>> + "DefaultMetricgroupName": "TopdownL1",
>> "MetricExpr": "cpu_core@topdown\\-fe\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / tma_info_thread_slots",
>> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>> "MetricName": "tma_frontend_bound",
>> "MetricThreshold": "tma_frontend_bound > 0.15",
>> - "MetricgroupNoGroup": "TopdownL1",
>> + "MetricgroupNoGroup": "TopdownL1;Default",
>> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>> "ScaleUnit": "100%",
>> "Unit": "cpu_core"
>> @@ -2316,11 +2319,12 @@
>> },
>> {
>> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>> + "DefaultMetricgroupName": "TopdownL1",
>> "MetricExpr": "cpu_core@topdown\\-retiring@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>> "MetricName": "tma_retiring",
>> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>> - "MetricgroupNoGroup": "TopdownL1",
>> + "MetricgroupNoGroup": "TopdownL1;Default",
>> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>> "ScaleUnit": "100%",
>> "Unit": "cpu_core"
>
> For Alderlake the Default metric group is added for all cpu_core
> metrics but not cpu_atom. This will lead to only getting metrics for
> performance cores while the workload could be running on atoms. This
> could lead to a false conclusion that the workload has no issues with
> the metrics. I think this behavior is surprising and should be called
> out as intentional in the commit message.
>
The e-core doesn't have enough counters to calculate all the Topdown
events. It will trigger the multiplexing. We try to avoid it in the
default mode.
I will update the commit in V2.
Thanks,
Kan
> Thanks,
> Ian
>
>> diff --git a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
>> index 20210742171d..cc4edf855064 100644
>> --- a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
>> +++ b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
>> @@ -111,21 +111,23 @@
>> },
>> {
>> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>> + "DefaultMetricgroupName": "TopdownL1",
>> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>> "MetricName": "tma_backend_bound",
>> "MetricThreshold": "tma_backend_bound > 0.2",
>> - "MetricgroupNoGroup": "TopdownL1",
>> + "MetricgroupNoGroup": "TopdownL1;Default",
>> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>> "ScaleUnit": "100%"
>> },
>> {
>> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>> + "DefaultMetricgroupName": "TopdownL1",
>> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>> "MetricName": "tma_bad_speculation",
>> "MetricThreshold": "tma_bad_speculation > 0.15",
>> - "MetricgroupNoGroup": "TopdownL1",
>> + "MetricgroupNoGroup": "TopdownL1;Default",
>> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>> "ScaleUnit": "100%"
>> },
>> @@ -372,11 +374,12 @@
>> },
>> {
>> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>> + "DefaultMetricgroupName": "TopdownL1",
>> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>> "MetricName": "tma_frontend_bound",
>> "MetricThreshold": "tma_frontend_bound > 0.15",
>> - "MetricgroupNoGroup": "TopdownL1",
>> + "MetricgroupNoGroup": "TopdownL1;Default",
>> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>> "ScaleUnit": "100%"
>> },
>> @@ -1378,11 +1381,12 @@
>> },
>> {
>> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>> + "DefaultMetricgroupName": "TopdownL1",
>> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>> "MetricName": "tma_retiring",
>> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>> - "MetricgroupNoGroup": "TopdownL1",
>> + "MetricgroupNoGroup": "TopdownL1;Default",
>> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>> "ScaleUnit": "100%"
>> },
>> diff --git a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
>> index ef25cda019be..6f25b5b7aaf6 100644
>> --- a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
>> +++ b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
>> @@ -315,21 +315,23 @@
>> },
>> {
>> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>> + "DefaultMetricgroupName": "TopdownL1",
>> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>> "MetricName": "tma_backend_bound",
>> "MetricThreshold": "tma_backend_bound > 0.2",
>> - "MetricgroupNoGroup": "TopdownL1",
>> + "MetricgroupNoGroup": "TopdownL1;Default",
>> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>> "ScaleUnit": "100%"
>> },
>> {
>> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>> + "DefaultMetricgroupName": "TopdownL1",
>> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>> "MetricName": "tma_bad_speculation",
>> "MetricThreshold": "tma_bad_speculation > 0.15",
>> - "MetricgroupNoGroup": "TopdownL1",
>> + "MetricgroupNoGroup": "TopdownL1;Default",
>> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>> "ScaleUnit": "100%"
>> },
>> @@ -576,11 +578,12 @@
>> },
>> {
>> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>> + "DefaultMetricgroupName": "TopdownL1",
>> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>> "MetricName": "tma_frontend_bound",
>> "MetricThreshold": "tma_frontend_bound > 0.15",
>> - "MetricgroupNoGroup": "TopdownL1",
>> + "MetricgroupNoGroup": "TopdownL1;Default",
>> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>> "ScaleUnit": "100%"
>> },
>> @@ -1674,11 +1677,12 @@
>> },
>> {
>> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>> + "DefaultMetricgroupName": "TopdownL1",
>> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>> "MetricName": "tma_retiring",
>> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>> - "MetricgroupNoGroup": "TopdownL1",
>> + "MetricgroupNoGroup": "TopdownL1;Default",
>> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>> "ScaleUnit": "100%"
>> },
>> diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
>> index 4f3dd85540b6..c732982f70b5 100644
>> --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
>> +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
>> @@ -340,31 +340,34 @@
>> },
>> {
>> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>> + "DefaultMetricgroupName": "TopdownL1",
>> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>> "MetricName": "tma_backend_bound",
>> "MetricThreshold": "tma_backend_bound > 0.2",
>> - "MetricgroupNoGroup": "TopdownL1",
>> + "MetricgroupNoGroup": "TopdownL1;Default",
>> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>> "ScaleUnit": "100%"
>> },
>> {
>> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>> + "DefaultMetricgroupName": "TopdownL1",
>> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>> "MetricName": "tma_bad_speculation",
>> "MetricThreshold": "tma_bad_speculation > 0.15",
>> - "MetricgroupNoGroup": "TopdownL1",
>> + "MetricgroupNoGroup": "TopdownL1;Default",
>> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>> "ScaleUnit": "100%"
>> },
>> {
>> "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction",
>> + "DefaultMetricgroupName": "TopdownL2",
>> "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>> - "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
>> + "MetricGroup": "BadSpec;BrMispredicts;Default;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
>> "MetricName": "tma_branch_mispredicts",
>> "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15",
>> - "MetricgroupNoGroup": "TopdownL2",
>> + "MetricgroupNoGroup": "TopdownL2;Default",
>> "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredictions, tma_mispredicts_resteers",
>> "ScaleUnit": "100%"
>> },
>> @@ -407,11 +410,12 @@
>> },
>> {
>> "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck",
>> + "DefaultMetricgroupName": "TopdownL2",
>> "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)",
>> - "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>> + "MetricGroup": "Backend;Compute;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>> "MetricName": "tma_core_bound",
>> "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2",
>> - "MetricgroupNoGroup": "TopdownL2",
>> + "MetricgroupNoGroup": "TopdownL2;Default",
>> "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).",
>> "ScaleUnit": "100%"
>> },
>> @@ -509,21 +513,23 @@
>> },
>> {
>> "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues",
>> + "DefaultMetricgroupName": "TopdownL2",
>> "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)",
>> - "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
>> + "MetricGroup": "Default;FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
>> "MetricName": "tma_fetch_bandwidth",
>> "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_thread_ipc / 6 > 0.35",
>> - "MetricgroupNoGroup": "TopdownL2",
>> + "MetricgroupNoGroup": "TopdownL2;Default",
>> "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_switches, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp",
>> "ScaleUnit": "100%"
>> },
>> {
>> "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues",
>> + "DefaultMetricgroupName": "TopdownL2",
>> "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>> - "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
>> + "MetricGroup": "Default;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
>> "MetricName": "tma_fetch_latency",
>> "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15",
>> - "MetricgroupNoGroup": "TopdownL2",
>> + "MetricgroupNoGroup": "TopdownL2;Default",
>> "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS",
>> "ScaleUnit": "100%"
>> },
>> @@ -611,11 +617,12 @@
>> },
>> {
>> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>> + "DefaultMetricgroupName": "TopdownL1",
>> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>> "MetricName": "tma_frontend_bound",
>> "MetricThreshold": "tma_frontend_bound > 0.15",
>> - "MetricgroupNoGroup": "TopdownL1",
>> + "MetricgroupNoGroup": "TopdownL1;Default",
>> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>> "ScaleUnit": "100%"
>> },
>> @@ -630,11 +637,12 @@
>> },
>> {
>> "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences",
>> + "DefaultMetricgroupName": "TopdownL2",
>> "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>> - "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>> + "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>> "MetricName": "tma_heavy_operations",
>> "MetricThreshold": "tma_heavy_operations > 0.1",
>> - "MetricgroupNoGroup": "TopdownL2",
>> + "MetricgroupNoGroup": "TopdownL2;Default",
>> "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences. Sample with: UOPS_RETIRED.HEAVY",
>> "ScaleUnit": "100%"
>> },
>> @@ -1486,11 +1494,12 @@
>> },
>> {
>> "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)",
>> + "DefaultMetricgroupName": "TopdownL2",
>> "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)",
>> - "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>> + "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>> "MetricName": "tma_light_operations",
>> "MetricThreshold": "tma_light_operations > 0.6",
>> - "MetricgroupNoGroup": "TopdownL2",
>> + "MetricgroupNoGroup": "TopdownL2;Default",
>> "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST",
>> "ScaleUnit": "100%"
>> },
>> @@ -1540,11 +1549,12 @@
>> },
>> {
>> "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears",
>> + "DefaultMetricgroupName": "TopdownL2",
>> "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts)",
>> - "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
>> + "MetricGroup": "BadSpec;Default;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
>> "MetricName": "tma_machine_clears",
>> "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15",
>> - "MetricgroupNoGroup": "TopdownL2",
>> + "MetricgroupNoGroup": "TopdownL2;Default",
>> "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache",
>> "ScaleUnit": "100%"
>> },
>> @@ -1576,11 +1586,12 @@
>> },
>> {
>> "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck",
>> + "DefaultMetricgroupName": "TopdownL2",
>> "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>> - "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>> + "MetricGroup": "Backend;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>> "MetricName": "tma_memory_bound",
>> "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2",
>> - "MetricgroupNoGroup": "TopdownL2",
>> + "MetricgroupNoGroup": "TopdownL2;Default",
>> "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).",
>> "ScaleUnit": "100%"
>> },
>> @@ -1784,11 +1795,12 @@
>> },
>> {
>> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>> + "DefaultMetricgroupName": "TopdownL1",
>> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>> "MetricName": "tma_retiring",
>> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>> - "MetricgroupNoGroup": "TopdownL1",
>> + "MetricgroupNoGroup": "TopdownL1;Default",
>> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>> "ScaleUnit": "100%"
>> },
>> diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
>> index d0538a754288..83346911aa63 100644
>> --- a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
>> +++ b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
>> @@ -105,21 +105,23 @@
>> },
>> {
>> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>> + "DefaultMetricgroupName": "TopdownL1",
>> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>> "MetricName": "tma_backend_bound",
>> "MetricThreshold": "tma_backend_bound > 0.2",
>> - "MetricgroupNoGroup": "TopdownL1",
>> + "MetricgroupNoGroup": "TopdownL1;Default",
>> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>> "ScaleUnit": "100%"
>> },
>> {
>> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>> + "DefaultMetricgroupName": "TopdownL1",
>> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>> "MetricName": "tma_bad_speculation",
>> "MetricThreshold": "tma_bad_speculation > 0.15",
>> - "MetricgroupNoGroup": "TopdownL1",
>> + "MetricgroupNoGroup": "TopdownL1;Default",
>> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>> "ScaleUnit": "100%"
>> },
>> @@ -366,11 +368,12 @@
>> },
>> {
>> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>> + "DefaultMetricgroupName": "TopdownL1",
>> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>> "MetricName": "tma_frontend_bound",
>> "MetricThreshold": "tma_frontend_bound > 0.15",
>> - "MetricgroupNoGroup": "TopdownL1",
>> + "MetricgroupNoGroup": "TopdownL1;Default",
>> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>> "ScaleUnit": "100%"
>> },
>> @@ -1392,11 +1395,12 @@
>> },
>> {
>> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>> + "DefaultMetricgroupName": "TopdownL1",
>> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>> "MetricName": "tma_retiring",
>> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>> - "MetricgroupNoGroup": "TopdownL1",
>> + "MetricgroupNoGroup": "TopdownL1;Default",
>> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>> "ScaleUnit": "100%"
>> },
>> --
>> 2.35.1
>>
Em Tue, Jun 13, 2023 at 01:17:41PM -0700, Ian Rogers escreveu:
> On Wed, Jun 7, 2023 at 9:27 AM <[email protected]> wrote:
> >
> > From: Kan Liang <[email protected]>
> >
> > A new field metricgroup has been added in the perf stat JSON output.
> > Support it in the test case.
> >
> > Signed-off-by: Kan Liang <[email protected]>
>
> Acked-by: Ian Rogers <[email protected]>
Thanks, applied.
- Arnaldo
Em Tue, Jun 13, 2023 at 12:45:10PM -0700, Ian Rogers escreveu:
> On Wed, Jun 7, 2023 at 9:27 AM <[email protected]> wrote:
> >
> > From: Kan Liang <[email protected]>
> >
> > Add the default tags for ARM as well.
> >
> > Signed-off-by: Kan Liang <[email protected]>
> > Cc: Jing Zhang <[email protected]>
> > Cc: John Garry <[email protected]>
>
> Acked-by: Ian Rogers <[email protected]>
Thanks, applied.
- Arnaldo
> Thanks,
> Ian
>
> > ---
> > tools/perf/pmu-events/arch/arm64/sbsa.json | 12 ++++++++----
> > 1 file changed, 8 insertions(+), 4 deletions(-)
> >
> > diff --git a/tools/perf/pmu-events/arch/arm64/sbsa.json b/tools/perf/pmu-events/arch/arm64/sbsa.json
> > index f678c37ea9c3..f90b338261ac 100644
> > --- a/tools/perf/pmu-events/arch/arm64/sbsa.json
> > +++ b/tools/perf/pmu-events/arch/arm64/sbsa.json
> > @@ -2,28 +2,32 @@
> > {
> > "MetricExpr": "stall_slot_frontend / (#slots * cpu_cycles)",
> > "BriefDescription": "Frontend bound L1 topdown metric",
> > - "MetricGroup": "TopdownL1",
> > + "DefaultMetricgroupName": "TopdownL1",
> > + "MetricGroup": "Default;TopdownL1",
> > "MetricName": "frontend_bound",
> > "ScaleUnit": "100%"
> > },
> > {
> > "MetricExpr": "(1 - op_retired / op_spec) * (1 - stall_slot / (#slots * cpu_cycles))",
> > "BriefDescription": "Bad speculation L1 topdown metric",
> > - "MetricGroup": "TopdownL1",
> > + "DefaultMetricgroupName": "TopdownL1",
> > + "MetricGroup": "Default;TopdownL1",
> > "MetricName": "bad_speculation",
> > "ScaleUnit": "100%"
> > },
> > {
> > "MetricExpr": "(op_retired / op_spec) * (1 - stall_slot / (#slots * cpu_cycles))",
> > "BriefDescription": "Retiring L1 topdown metric",
> > - "MetricGroup": "TopdownL1",
> > + "DefaultMetricgroupName": "TopdownL1",
> > + "MetricGroup": "Default;TopdownL1",
> > "MetricName": "retiring",
> > "ScaleUnit": "100%"
> > },
> > {
> > "MetricExpr": "stall_slot_backend / (#slots * cpu_cycles)",
> > "BriefDescription": "Backend Bound L1 topdown metric",
> > - "MetricGroup": "TopdownL1",
> > + "DefaultMetricgroupName": "TopdownL1",
> > + "MetricGroup": "Default;TopdownL1",
> > "MetricName": "backend_bound",
> > "ScaleUnit": "100%"
> > }
> > --
> > 2.35.1
> >
--
- Arnaldo
On Tue, Jun 13, 2023 at 1:10 PM Liang, Kan <[email protected]> wrote:
>
>
>
> On 2023-06-13 3:44 p.m., Ian Rogers wrote:
> > On Wed, Jun 7, 2023 at 9:27 AM <[email protected]> wrote:
> >>
> >> From: Kan Liang <[email protected]>
> >>
> >> For the default output, the default metric group could vary on different
> >> platforms. For example, on SPR, the TopdownL1 and TopdownL2 metrics
> >> should be displayed in the default mode. On ICL, only the TopdownL1
> >> should be displayed.
> >>
> >> Add a flag so we can tag the default metric group for different
> >> platforms rather than hack the perf code.
> >>
> >> The flag is added to Intel TopdownL1 since ICL and TopdownL2 metrics
> >> since SPR.
> >>
> >> Add a new field, DefaultMetricgroupName, in the JSON file to indicate
> >> the real metric group name.
> >>
> >> Signed-off-by: Kan Liang <[email protected]>
> >> ---
> >> .../arch/x86/alderlake/adl-metrics.json | 20 ++++---
> >> .../arch/x86/icelake/icl-metrics.json | 20 ++++---
> >> .../arch/x86/icelakex/icx-metrics.json | 20 ++++---
> >> .../arch/x86/sapphirerapids/spr-metrics.json | 60 +++++++++++--------
> >> .../arch/x86/tigerlake/tgl-metrics.json | 20 ++++---
> >> 5 files changed, 84 insertions(+), 56 deletions(-)
> >>
> >> diff --git a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> >> index c9f7e3d4ab08..e78c85220e27 100644
> >> --- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> >> +++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> >> @@ -832,22 +832,24 @@
> >> },
> >> {
> >> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> >> + "DefaultMetricgroupName": "TopdownL1",
> >> "MetricExpr": "cpu_core@topdown\\-be\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
> >> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >> "MetricName": "tma_backend_bound",
> >> "MetricThreshold": "tma_backend_bound > 0.2",
> >> - "MetricgroupNoGroup": "TopdownL1",
> >> + "MetricgroupNoGroup": "TopdownL1;Default",
> >> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> >> "ScaleUnit": "100%",
> >> "Unit": "cpu_core"
> >> },
> >> {
> >> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> >> + "DefaultMetricgroupName": "TopdownL1",
> >> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> >> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >> "MetricName": "tma_bad_speculation",
> >> "MetricThreshold": "tma_bad_speculation > 0.15",
> >> - "MetricgroupNoGroup": "TopdownL1",
> >> + "MetricgroupNoGroup": "TopdownL1;Default",
> >> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> >> "ScaleUnit": "100%",
> >> "Unit": "cpu_core"
> >> @@ -1112,11 +1114,12 @@
> >> },
> >> {
> >> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> >> + "DefaultMetricgroupName": "TopdownL1",
> >> "MetricExpr": "cpu_core@topdown\\-fe\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / tma_info_thread_slots",
> >> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> >> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> >> "MetricName": "tma_frontend_bound",
> >> "MetricThreshold": "tma_frontend_bound > 0.15",
> >> - "MetricgroupNoGroup": "TopdownL1",
> >> + "MetricgroupNoGroup": "TopdownL1;Default",
> >> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> >> "ScaleUnit": "100%",
> >> "Unit": "cpu_core"
> >> @@ -2316,11 +2319,12 @@
> >> },
> >> {
> >> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> >> + "DefaultMetricgroupName": "TopdownL1",
> >> "MetricExpr": "cpu_core@topdown\\-retiring@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
> >> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >> "MetricName": "tma_retiring",
> >> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> >> - "MetricgroupNoGroup": "TopdownL1",
> >> + "MetricgroupNoGroup": "TopdownL1;Default",
> >> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> >> "ScaleUnit": "100%",
> >> "Unit": "cpu_core"
> >
> > For Alderlake the Default metric group is added for all cpu_core
> > metrics but not cpu_atom. This will lead to only getting metrics for
> > performance cores while the workload could be running on atoms. This
> > could lead to a false conclusion that the workload has no issues with
> > the metrics. I think this behavior is surprising and should be called
> > out as intentional in the commit message.
> >
>
> The e-core doesn't have enough counters to calculate all the Topdown
> events. It will trigger the multiplexing. We try to avoid it in the
> default mode.
> I will update the commit in V2.
Is multiplexing a worse crime than only giving output for half the
cores? Both can be misleading. Perhaps the safest thing is to not use
Default on hybrid platforms.
Thanks,
Ian
> Thanks,
> Kan
>
> > Thanks,
> > Ian
> >
> >> diff --git a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
> >> index 20210742171d..cc4edf855064 100644
> >> --- a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
> >> +++ b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
> >> @@ -111,21 +111,23 @@
> >> },
> >> {
> >> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> >> + "DefaultMetricgroupName": "TopdownL1",
> >> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
> >> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >> "MetricName": "tma_backend_bound",
> >> "MetricThreshold": "tma_backend_bound > 0.2",
> >> - "MetricgroupNoGroup": "TopdownL1",
> >> + "MetricgroupNoGroup": "TopdownL1;Default",
> >> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> >> "ScaleUnit": "100%"
> >> },
> >> {
> >> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> >> + "DefaultMetricgroupName": "TopdownL1",
> >> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> >> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >> "MetricName": "tma_bad_speculation",
> >> "MetricThreshold": "tma_bad_speculation > 0.15",
> >> - "MetricgroupNoGroup": "TopdownL1",
> >> + "MetricgroupNoGroup": "TopdownL1;Default",
> >> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> >> "ScaleUnit": "100%"
> >> },
> >> @@ -372,11 +374,12 @@
> >> },
> >> {
> >> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> >> + "DefaultMetricgroupName": "TopdownL1",
> >> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> >> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> >> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> >> "MetricName": "tma_frontend_bound",
> >> "MetricThreshold": "tma_frontend_bound > 0.15",
> >> - "MetricgroupNoGroup": "TopdownL1",
> >> + "MetricgroupNoGroup": "TopdownL1;Default",
> >> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> >> "ScaleUnit": "100%"
> >> },
> >> @@ -1378,11 +1381,12 @@
> >> },
> >> {
> >> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> >> + "DefaultMetricgroupName": "TopdownL1",
> >> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >> "MetricName": "tma_retiring",
> >> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> >> - "MetricgroupNoGroup": "TopdownL1",
> >> + "MetricgroupNoGroup": "TopdownL1;Default",
> >> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> >> "ScaleUnit": "100%"
> >> },
> >> diff --git a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
> >> index ef25cda019be..6f25b5b7aaf6 100644
> >> --- a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
> >> +++ b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
> >> @@ -315,21 +315,23 @@
> >> },
> >> {
> >> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> >> + "DefaultMetricgroupName": "TopdownL1",
> >> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
> >> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >> "MetricName": "tma_backend_bound",
> >> "MetricThreshold": "tma_backend_bound > 0.2",
> >> - "MetricgroupNoGroup": "TopdownL1",
> >> + "MetricgroupNoGroup": "TopdownL1;Default",
> >> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> >> "ScaleUnit": "100%"
> >> },
> >> {
> >> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> >> + "DefaultMetricgroupName": "TopdownL1",
> >> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> >> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >> "MetricName": "tma_bad_speculation",
> >> "MetricThreshold": "tma_bad_speculation > 0.15",
> >> - "MetricgroupNoGroup": "TopdownL1",
> >> + "MetricgroupNoGroup": "TopdownL1;Default",
> >> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> >> "ScaleUnit": "100%"
> >> },
> >> @@ -576,11 +578,12 @@
> >> },
> >> {
> >> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> >> + "DefaultMetricgroupName": "TopdownL1",
> >> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> >> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> >> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> >> "MetricName": "tma_frontend_bound",
> >> "MetricThreshold": "tma_frontend_bound > 0.15",
> >> - "MetricgroupNoGroup": "TopdownL1",
> >> + "MetricgroupNoGroup": "TopdownL1;Default",
> >> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> >> "ScaleUnit": "100%"
> >> },
> >> @@ -1674,11 +1677,12 @@
> >> },
> >> {
> >> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> >> + "DefaultMetricgroupName": "TopdownL1",
> >> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >> "MetricName": "tma_retiring",
> >> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> >> - "MetricgroupNoGroup": "TopdownL1",
> >> + "MetricgroupNoGroup": "TopdownL1;Default",
> >> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> >> "ScaleUnit": "100%"
> >> },
> >> diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
> >> index 4f3dd85540b6..c732982f70b5 100644
> >> --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
> >> +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
> >> @@ -340,31 +340,34 @@
> >> },
> >> {
> >> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> >> + "DefaultMetricgroupName": "TopdownL1",
> >> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >> "MetricName": "tma_backend_bound",
> >> "MetricThreshold": "tma_backend_bound > 0.2",
> >> - "MetricgroupNoGroup": "TopdownL1",
> >> + "MetricgroupNoGroup": "TopdownL1;Default",
> >> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> >> "ScaleUnit": "100%"
> >> },
> >> {
> >> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> >> + "DefaultMetricgroupName": "TopdownL1",
> >> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> >> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >> "MetricName": "tma_bad_speculation",
> >> "MetricThreshold": "tma_bad_speculation > 0.15",
> >> - "MetricgroupNoGroup": "TopdownL1",
> >> + "MetricgroupNoGroup": "TopdownL1;Default",
> >> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> >> "ScaleUnit": "100%"
> >> },
> >> {
> >> "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction",
> >> + "DefaultMetricgroupName": "TopdownL2",
> >> "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >> - "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
> >> + "MetricGroup": "BadSpec;BrMispredicts;Default;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
> >> "MetricName": "tma_branch_mispredicts",
> >> "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15",
> >> - "MetricgroupNoGroup": "TopdownL2",
> >> + "MetricgroupNoGroup": "TopdownL2;Default",
> >> "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredictions, tma_mispredicts_resteers",
> >> "ScaleUnit": "100%"
> >> },
> >> @@ -407,11 +410,12 @@
> >> },
> >> {
> >> "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck",
> >> + "DefaultMetricgroupName": "TopdownL2",
> >> "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)",
> >> - "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
> >> + "MetricGroup": "Backend;Compute;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
> >> "MetricName": "tma_core_bound",
> >> "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2",
> >> - "MetricgroupNoGroup": "TopdownL2",
> >> + "MetricgroupNoGroup": "TopdownL2;Default",
> >> "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).",
> >> "ScaleUnit": "100%"
> >> },
> >> @@ -509,21 +513,23 @@
> >> },
> >> {
> >> "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues",
> >> + "DefaultMetricgroupName": "TopdownL2",
> >> "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)",
> >> - "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
> >> + "MetricGroup": "Default;FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
> >> "MetricName": "tma_fetch_bandwidth",
> >> "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_thread_ipc / 6 > 0.35",
> >> - "MetricgroupNoGroup": "TopdownL2",
> >> + "MetricgroupNoGroup": "TopdownL2;Default",
> >> "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_switches, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp",
> >> "ScaleUnit": "100%"
> >> },
> >> {
> >> "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues",
> >> + "DefaultMetricgroupName": "TopdownL2",
> >> "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> >> - "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
> >> + "MetricGroup": "Default;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
> >> "MetricName": "tma_fetch_latency",
> >> "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15",
> >> - "MetricgroupNoGroup": "TopdownL2",
> >> + "MetricgroupNoGroup": "TopdownL2;Default",
> >> "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS",
> >> "ScaleUnit": "100%"
> >> },
> >> @@ -611,11 +617,12 @@
> >> },
> >> {
> >> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> >> + "DefaultMetricgroupName": "TopdownL1",
> >> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> >> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> >> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> >> "MetricName": "tma_frontend_bound",
> >> "MetricThreshold": "tma_frontend_bound > 0.15",
> >> - "MetricgroupNoGroup": "TopdownL1",
> >> + "MetricgroupNoGroup": "TopdownL1;Default",
> >> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> >> "ScaleUnit": "100%"
> >> },
> >> @@ -630,11 +637,12 @@
> >> },
> >> {
> >> "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences",
> >> + "DefaultMetricgroupName": "TopdownL2",
> >> "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >> - "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
> >> + "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
> >> "MetricName": "tma_heavy_operations",
> >> "MetricThreshold": "tma_heavy_operations > 0.1",
> >> - "MetricgroupNoGroup": "TopdownL2",
> >> + "MetricgroupNoGroup": "TopdownL2;Default",
> >> "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences. Sample with: UOPS_RETIRED.HEAVY",
> >> "ScaleUnit": "100%"
> >> },
> >> @@ -1486,11 +1494,12 @@
> >> },
> >> {
> >> "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)",
> >> + "DefaultMetricgroupName": "TopdownL2",
> >> "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)",
> >> - "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
> >> + "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
> >> "MetricName": "tma_light_operations",
> >> "MetricThreshold": "tma_light_operations > 0.6",
> >> - "MetricgroupNoGroup": "TopdownL2",
> >> + "MetricgroupNoGroup": "TopdownL2;Default",
> >> "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST",
> >> "ScaleUnit": "100%"
> >> },
> >> @@ -1540,11 +1549,12 @@
> >> },
> >> {
> >> "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears",
> >> + "DefaultMetricgroupName": "TopdownL2",
> >> "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts)",
> >> - "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
> >> + "MetricGroup": "BadSpec;Default;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
> >> "MetricName": "tma_machine_clears",
> >> "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15",
> >> - "MetricgroupNoGroup": "TopdownL2",
> >> + "MetricgroupNoGroup": "TopdownL2;Default",
> >> "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache",
> >> "ScaleUnit": "100%"
> >> },
> >> @@ -1576,11 +1586,12 @@
> >> },
> >> {
> >> "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck",
> >> + "DefaultMetricgroupName": "TopdownL2",
> >> "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >> - "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
> >> + "MetricGroup": "Backend;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
> >> "MetricName": "tma_memory_bound",
> >> "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2",
> >> - "MetricgroupNoGroup": "TopdownL2",
> >> + "MetricgroupNoGroup": "TopdownL2;Default",
> >> "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).",
> >> "ScaleUnit": "100%"
> >> },
> >> @@ -1784,11 +1795,12 @@
> >> },
> >> {
> >> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> >> + "DefaultMetricgroupName": "TopdownL1",
> >> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >> "MetricName": "tma_retiring",
> >> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> >> - "MetricgroupNoGroup": "TopdownL1",
> >> + "MetricgroupNoGroup": "TopdownL1;Default",
> >> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> >> "ScaleUnit": "100%"
> >> },
> >> diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
> >> index d0538a754288..83346911aa63 100644
> >> --- a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
> >> +++ b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
> >> @@ -105,21 +105,23 @@
> >> },
> >> {
> >> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> >> + "DefaultMetricgroupName": "TopdownL1",
> >> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
> >> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >> "MetricName": "tma_backend_bound",
> >> "MetricThreshold": "tma_backend_bound > 0.2",
> >> - "MetricgroupNoGroup": "TopdownL1",
> >> + "MetricgroupNoGroup": "TopdownL1;Default",
> >> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> >> "ScaleUnit": "100%"
> >> },
> >> {
> >> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> >> + "DefaultMetricgroupName": "TopdownL1",
> >> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> >> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >> "MetricName": "tma_bad_speculation",
> >> "MetricThreshold": "tma_bad_speculation > 0.15",
> >> - "MetricgroupNoGroup": "TopdownL1",
> >> + "MetricgroupNoGroup": "TopdownL1;Default",
> >> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> >> "ScaleUnit": "100%"
> >> },
> >> @@ -366,11 +368,12 @@
> >> },
> >> {
> >> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> >> + "DefaultMetricgroupName": "TopdownL1",
> >> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> >> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> >> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> >> "MetricName": "tma_frontend_bound",
> >> "MetricThreshold": "tma_frontend_bound > 0.15",
> >> - "MetricgroupNoGroup": "TopdownL1",
> >> + "MetricgroupNoGroup": "TopdownL1;Default",
> >> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> >> "ScaleUnit": "100%"
> >> },
> >> @@ -1392,11 +1395,12 @@
> >> },
> >> {
> >> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> >> + "DefaultMetricgroupName": "TopdownL1",
> >> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >> "MetricName": "tma_retiring",
> >> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> >> - "MetricgroupNoGroup": "TopdownL1",
> >> + "MetricgroupNoGroup": "TopdownL1;Default",
> >> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> >> "ScaleUnit": "100%"
> >> },
> >> --
> >> 2.35.1
> >>
On 2023-06-13 4:16 p.m., Ian Rogers wrote:
> On Wed, Jun 7, 2023 at 9:27 AM <[email protected]> wrote:
>>
>> From: Kan Liang <[email protected]>
>>
>> In the default mode, the current output of the metricgroup include both
>> events and metrics, which is not necessary and just makes the output
>> hard to read. Since different ARCHs (even different generations in the
>> same ARCH) may use different events. The output also vary on different
>> platforms.
>>
>> For a metricgroup, only outputting the value of each metric is good
>> enough.
>>
>> Current perf may append different metric groups to the same leader
>> event, or append the metrics from the same metricgroup to different
>> events. That could bring confusion when perf only prints the
>> metricgroup output mode. For example, print the same metricgroup name
>> several times.
>> Reorganize metricgroup for the default mode and make sure that
>> a metricgroup can only be appended to one event.
>> Sort the metricgroup for the default mode by the name of the
>> metricgroup.
>>
>> Add a new field default_metricgroup in evsel to indicate an event of
>> the default metricgroup. For those events, printout() should print
>> the metricgroup name rather than events.
>>
>> Add print_metricgroup_header() to print out the metricgroup name in
>> different output formats.
>>
>> On SPR
>> Before:
>>
>> ./perf_old stat sleep 1
>>
>> Performance counter stats for 'sleep 1':
>>
>> 0.54 msec task-clock:u # 0.001 CPUs utilized
>> 0 context-switches:u # 0.000 /sec
>> 0 cpu-migrations:u # 0.000 /sec
>> 68 page-faults:u # 125.445 K/sec
>> 540,970 cycles:u # 0.998 GHz
>> 556,325 instructions:u # 1.03 insn per cycle
>> 123,602 branches:u # 228.018 M/sec
>> 6,889 branch-misses:u # 5.57% of all branches
>> 3,245,820 TOPDOWN.SLOTS:u # 18.4 % tma_backend_bound
>> # 17.2 % tma_retiring
>> # 23.1 % tma_bad_speculation
>> # 41.4 % tma_frontend_bound
>> 564,859 topdown-retiring:u
>> 1,370,999 topdown-fe-bound:u
>> 603,271 topdown-be-bound:u
>> 744,874 topdown-bad-spec:u
>> 12,661 INT_MISC.UOP_DROPPING:u # 23.357 M/sec
>>
>> 1.001798215 seconds time elapsed
>>
>> 0.000193000 seconds user
>> 0.001700000 seconds sys
>>
>> After:
>>
>> $ ./perf stat sleep 1
>>
>> Performance counter stats for 'sleep 1':
>>
>> 0.51 msec task-clock:u # 0.001 CPUs utilized
>> 0 context-switches:u # 0.000 /sec
>> 0 cpu-migrations:u # 0.000 /sec
>> 68 page-faults:u # 132.683 K/sec
>> 545,228 cycles:u # 1.064 GHz
>> 555,509 instructions:u # 1.02 insn per cycle
>> 123,574 branches:u # 241.120 M/sec
>> 6,957 branch-misses:u # 5.63% of all branches
>> TopdownL1 # 17.5 % tma_backend_bound
>> # 22.6 % tma_bad_speculation
>> # 42.7 % tma_frontend_bound
>> # 17.1 % tma_retiring
>> TopdownL2 # 21.8 % tma_branch_mispredicts
>> # 11.5 % tma_core_bound
>> # 13.4 % tma_fetch_bandwidth
>> # 29.3 % tma_fetch_latency
>> # 2.7 % tma_heavy_operations
>> # 14.5 % tma_light_operations
>> # 0.8 % tma_machine_clears
>> # 6.1 % tma_memory_bound
>>
>> 1.001712086 seconds time elapsed
>>
>> 0.000151000 seconds user
>> 0.001618000 seconds sys
>>
>>
>> Signed-off-by: Kan Liang <[email protected]>
>> ---
>> tools/perf/builtin-stat.c | 1 +
>> tools/perf/util/evsel.h | 1 +
>> tools/perf/util/metricgroup.c | 106 ++++++++++++++++++++++++++++++++-
>> tools/perf/util/metricgroup.h | 1 +
>> tools/perf/util/stat-display.c | 69 ++++++++++++++++++++-
>> 5 files changed, 172 insertions(+), 6 deletions(-)
>>
>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
>> index 2269b3e90e9b..b274cc264d56 100644
>> --- a/tools/perf/builtin-stat.c
>> +++ b/tools/perf/builtin-stat.c
>> @@ -2172,6 +2172,7 @@ static int add_default_attributes(void)
>>
>> evlist__for_each_entry(metric_evlist, metric_evsel) {
>> metric_evsel->skippable = true;
>> + metric_evsel->default_metricgroup = true;
>> }
>> evlist__splice_list_tail(evsel_list, &metric_evlist->core.entries);
>> evlist__delete(metric_evlist);
>> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
>> index 36a32e4ca168..61b1385108f4 100644
>> --- a/tools/perf/util/evsel.h
>> +++ b/tools/perf/util/evsel.h
>> @@ -130,6 +130,7 @@ struct evsel {
>> bool reset_group;
>> bool errored;
>> bool needs_auxtrace_mmap;
>> + bool default_metricgroup;
>
> A comment would be useful here, something like:
>
> If running perf stat, is this evsel a member of a Default metric group metric.
Yes, it's the member of the 'default' metric group.
I will add a comment.
>
>> struct hashmap *per_pkg_mask;
>> int err;
>> struct {
>> diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
>> index efafa02db5e5..22181ce4f27f 100644
>> --- a/tools/perf/util/metricgroup.c
>> +++ b/tools/perf/util/metricgroup.c
>> @@ -79,6 +79,7 @@ static struct rb_node *metric_event_new(struct rblist *rblist __maybe_unused,
>> return NULL;
>> memcpy(me, entry, sizeof(struct metric_event));
>> me->evsel = ((struct metric_event *)entry)->evsel;
>> + me->default_metricgroup_name = NULL;
>> INIT_LIST_HEAD(&me->head);
>> return &me->nd;
>> }
>> @@ -1133,14 +1134,19 @@ static int metricgroup__add_metric_sys_event_iter(const struct pmu_metric *pm,
>> /**
>> * metric_list_cmp - list_sort comparator that sorts metrics with more events to
>> * the front. tool events are excluded from the count.
>> + * For the default metrics, sort them by metricgroup name.
>> */
>> -static int metric_list_cmp(void *priv __maybe_unused, const struct list_head *l,
>> +static int metric_list_cmp(void *priv, const struct list_head *l,
>> const struct list_head *r)
>> {
>> const struct metric *left = container_of(l, struct metric, nd);
>> const struct metric *right = container_of(r, struct metric, nd);
>> struct expr_id_data *data;
>> int i, left_count, right_count;
>> + bool is_default = *(bool *)priv;
>> +
>> + if (is_default && left->default_metricgroup_name && right->default_metricgroup_name)
>> + return strcmp(left->default_metricgroup_name, right->default_metricgroup_name);
>
> This breaks the comment above. The events are now sorted prioritizing
> default metric group names. This potentially will have an effect of
> reducing sharing of events between groups, it will also break
> assumptions within that code that there are always the same number of
> fewer events in a metric as you process the list. To remedy this I
> think you need to re-sort the metrics after the event sharing has had
> a chance to share events between groups.
>
>
>>
>> left_count = hashmap__size(left->pctx->ids);
>> perf_tool_event__for_each_event(i) {
>> @@ -1497,6 +1503,91 @@ static int parse_ids(bool metric_no_merge, struct perf_pmu *fake_pmu,
>> return ret;
>> }
>>
>> +static struct metric_event *
>> +metricgroup__lookup_default_metricgroup(struct rblist *metric_events,
>> + struct evsel *evsel,
>> + struct metric *m)
>> +{
>> + struct metric_event *me;
>> + char *name;
>> + int err;
>> +
>> + me = metricgroup__lookup(metric_events, evsel, true);
>> + if (!me->default_metricgroup_name) {
>> + if (m->pmu && strcmp(m->pmu, "cpu"))
>> + err = asprintf(&name, "%s (%s)", m->default_metricgroup_name, m->pmu);
>> + else
>> + err = asprintf(&name, "%s", m->default_metricgroup_name);
>> + if (err < 0)
>> + return NULL;
>> + me->default_metricgroup_name = name;
>> + }
>> + if (!strncmp(m->default_metricgroup_name,
>> + me->default_metricgroup_name,
>> + strlen(m->default_metricgroup_name)))
>> + return me;
>> +
>> + return NULL;
>> +}
>
> A function comment would be useful as the name is confusing, why
> lookup? Doesn't it create the value? Leak sanitizer isn't happy here:
>
> ```
> ==1545918==ERROR: LeakSanitizer: detected memory leaks
>
> Direct leak of 10 byte(s) in 1 object(s) allocated from:
> #0 0x7f2755a7077b in __interceptor_strdup
> ../../../../src/libsanitizer/asan/asan_interceptors.cpp:439
> #1 0x564986a8df31 in asprintf util/util.c:566
> #2 0x5649869b5901 in metricgroup__lookup_default_metricgroup
> util/metricgroup.c:1520
> #3 0x5649869b5e57 in metricgroup__lookup_create util/metricgroup.c:1579
> #4 0x5649869b6ddc in parse_groups util/metricgroup.c:1698
> #5 0x5649869b7714 in metricgroup__parse_groups util/metricgroup.c:1771
> #6 0x5649867da9d5 in add_default_attributes tools/perf/builtin-stat.c:2164
> #7 0x5649867ddbfb in cmd_stat tools/perf/builtin-stat.c:2707
> #8 0x5649868fa5a2 in run_builtin tools/perf/perf.c:323
> #9 0x5649868fab13 in handle_internal_command tools/perf/perf.c:377
> #10 0x5649868faedb in run_argv tools/perf/perf.c:421
> #11 0x5649868fb443 in main tools/perf/perf.c:537
> #12 0x7f2754846189 in __libc_start_call_main
> ../sysdeps/nptl/libc_start_call_main.h:58
> ```
>
>> +static struct metric_event *
>> +metricgroup__lookup_create(struct rblist *metric_events,
>> + struct evsel **evsel,
>> + struct list_head *metric_list,
>> + struct metric *m,
>> + bool is_default)
>> +{
>> + struct metric_event *me;
>> + struct metric *cur;
>> + struct evsel *ev;
>> + size_t i;
>> +
>> + if (!is_default)
>> + return metricgroup__lookup(metric_events, evsel[0], true);
>> +
>> + /*
>> + * If the metric group has been attached to a previous
>> + * event/metric, use that metric event.
>> + */
>> + list_for_each_entry(cur, metric_list, nd) {
>> + if (cur == m)
>> + break;
>> + if (cur->pmu && strcmp(m->pmu, cur->pmu))
>> + continue;
>> + if (strncmp(m->default_metricgroup_name,
>> + cur->default_metricgroup_name,
>> + strlen(m->default_metricgroup_name)))
>> + continue;
>> + if (!cur->evlist)
>> + continue;
>> + evlist__for_each_entry(cur->evlist, ev) {
>> + me = metricgroup__lookup(metric_events, ev, false);
>> + if (!strncmp(m->default_metricgroup_name,
>> + me->default_metricgroup_name,
>> + strlen(m->default_metricgroup_name)))
>> + return me;
>> + }
>> + }
>> +
>> + /*
>> + * Different metric groups may append to the same leader event.
>> + * For example, TopdownL1 and TopdownL2 are appended to the
>> + * TOPDOWN.SLOTS event.
>> + * Split it and append the new metric group to the next available
>> + * event.
>> + */
>> + me = metricgroup__lookup_default_metricgroup(metric_events, evsel[0], m);
>> + if (me)
>> + return me;
>> +
>> + for (i = 1; i < hashmap__size(m->pctx->ids); i++) {
>> + me = metricgroup__lookup_default_metricgroup(metric_events, evsel[i], m);
>> + if (me)
>> + return me;
>> + }
>> + return NULL;
>> +}
>> +
>
> I have a hard time understanding this function, does it just go away
> if you do the two sorts that I proposed above? Should this be
> metric_event__lookup_create? A function comment saying what the code
> is trying to achieve would be useful.
>
> This appears to be trying to correct output issues by changing how
> metrics are associated with events, shouldn't output issues be
> resolved by fixing the output code? If not, why don't we apply this
> logic to TopdownL1, why just Default?
Yes, the above codes try to re-organize the metric and append the metric
from the same metricgroup into the same event. So it can be easily print
out later.
With the second sort, I think it should not be problem to address it in
the output code. Let me do some experiments.
>
>> static int parse_groups(struct evlist *perf_evlist,
>> const char *pmu, const char *str,
>> bool metric_no_group,
>> @@ -1512,6 +1603,7 @@ static int parse_groups(struct evlist *perf_evlist,
>> LIST_HEAD(metric_list);
>> struct metric *m;
>> bool tool_events[PERF_TOOL_MAX] = {false};
>> + bool is_default = !strcmp(str, "Default");
>> int ret;
>>
>> if (metric_events_list->nr_entries == 0)
>> @@ -1523,7 +1615,7 @@ static int parse_groups(struct evlist *perf_evlist,
>> goto out;
>>
>> /* Sort metrics from largest to smallest. */
>> - list_sort(NULL, &metric_list, metric_list_cmp);
>> + list_sort((void *)&is_default, &metric_list, metric_list_cmp);
>>
>> if (!metric_no_merge) {
>> struct expr_parse_ctx *combined = NULL;
>> @@ -1603,7 +1695,15 @@ static int parse_groups(struct evlist *perf_evlist,
>> goto out;
>> }
>>
>> - me = metricgroup__lookup(metric_events_list, metric_events[0], true);
>> + me = metricgroup__lookup_create(metric_events_list,
>> + metric_events,
>> + &metric_list, m,
>> + is_default);
>> + if (!me) {
>> + pr_err("Cannot create metric group for default!\n");
>> + ret = -EINVAL;
>> + goto out;
>> + }
>>
>> expr = malloc(sizeof(struct metric_expr));
>> if (!expr) {
>> diff --git a/tools/perf/util/metricgroup.h b/tools/perf/util/metricgroup.h
>> index bf18274c15df..e3609b853213 100644
>> --- a/tools/perf/util/metricgroup.h
>> +++ b/tools/perf/util/metricgroup.h
>> @@ -22,6 +22,7 @@ struct cgroup;
>> struct metric_event {
>> struct rb_node nd;
>> struct evsel *evsel;
>> + char *default_metricgroup_name;
>> struct list_head head; /* list of metric_expr */
>> };
>>
>> diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
>> index a2bbdc25d979..efe5fd04c033 100644
>> --- a/tools/perf/util/stat-display.c
>> +++ b/tools/perf/util/stat-display.c
>> @@ -21,10 +21,12 @@
>> #include "iostat.h"
>> #include "pmu.h"
>> #include "pmus.h"
>> +#include "metricgroup.h"
>
> This is bringing metric code from stat-shadow, which is kind of the
> whole reason there is a separation and stat-shadow exists. Should the
> logic exist in stat-shadow instead?
>
>>
>> #define CNTR_NOT_SUPPORTED "<not supported>"
>> #define CNTR_NOT_COUNTED "<not counted>"
>>
>> +#define MGROUP_LEN 50
>> #define METRIC_LEN 38
>> #define EVNAME_LEN 32
>> #define COUNTS_LEN 18
>> @@ -707,6 +709,55 @@ static bool evlist__has_hybrid(struct evlist *evlist)
>> return false;
>> }
>>
>> +static void print_metricgroup_header_json(struct perf_stat_config *config,
>> + struct outstate *os __maybe_unused,
>> + const char *metricgroup_name)
>> +{
>> + fprintf(config->output, "\"metricgroup\" : \"%s\"}", metricgroup_name);
>> + new_line_json(config, (void *)os);
>> +}
>> +
>
> Should the output part of this patch be separate from the
> evsel/evlist/meitric modifications?
>
Sure, I will split the patch.
Thanks,
Kan
> Thanks,
> Ian
>
>> +static void print_metricgroup_header_csv(struct perf_stat_config *config,
>> + struct outstate *os,
>> + const char *metricgroup_name)
>> +{
>> + int i;
>> +
>> + for (i = 0; i < os->nfields; i++)
>> + fputs(config->csv_sep, os->fh);
>> + fprintf(config->output, "%s", metricgroup_name);
>> + new_line_csv(config, (void *)os);
>> +}
>> +
>> +static void print_metricgroup_header_std(struct perf_stat_config *config,
>> + struct outstate *os __maybe_unused,
>> + const char *metricgroup_name)
>> +{
>> + int n = fprintf(config->output, " %*s", EVNAME_LEN, metricgroup_name);
>> +
>> + fprintf(config->output, "%*s", MGROUP_LEN - n - 1, "");
>> +}
>> +
>> +static void print_metricgroup_header(struct perf_stat_config *config,
>> + struct outstate *os,
>> + struct evsel *counter,
>> + double noise, u64 run, u64 ena,
>> + const char *metricgroup_name)
>> +{
>> + aggr_printout(config, os->evsel, os->id, os->aggr_nr);
>> +
>> + print_noise(config, counter, noise, /*before_metric=*/true);
>> + print_running(config, run, ena, /*before_metric=*/true);
>> +
>> + if (config->json_output) {
>> + print_metricgroup_header_json(config, os, metricgroup_name);
>> + } else if (config->csv_output) {
>> + print_metricgroup_header_csv(config, os, metricgroup_name);
>> + } else
>> + print_metricgroup_header_std(config, os, metricgroup_name);
>> +
>> +}
>> +
>> static void printout(struct perf_stat_config *config, struct outstate *os,
>> double uval, u64 run, u64 ena, double noise, int aggr_idx)
>> {
>> @@ -751,10 +802,17 @@ static void printout(struct perf_stat_config *config, struct outstate *os,
>> out.force_header = false;
>>
>> if (!config->metric_only) {
>> - abs_printout(config, os->id, os->aggr_nr, counter, uval, ok);
>> + if (counter->default_metricgroup) {
>> + struct metric_event *me;
>>
>> - print_noise(config, counter, noise, /*before_metric=*/true);
>> - print_running(config, run, ena, /*before_metric=*/true);
>> + me = metricgroup__lookup(&config->metric_events, counter, false);
>> + print_metricgroup_header(config, os, counter, noise, run, ena,
>> + me->default_metricgroup_name);
>> + } else {
>> + abs_printout(config, os->id, os->aggr_nr, counter, uval, ok);
>> + print_noise(config, counter, noise, /*before_metric=*/true);
>> + print_running(config, run, ena, /*before_metric=*/true);
>> + }
>> }
>>
>> if (ok) {
>> @@ -883,6 +941,11 @@ static void print_counter_aggrdata(struct perf_stat_config *config,
>> if (counter->merged_stat)
>> return;
>>
>> + /* Only print the metric group for the default mode */
>> + if (counter->default_metricgroup &&
>> + !metricgroup__lookup(&config->metric_events, counter, false))
>> + return;
>> +
>> uniquify_counter(config, counter);
>>
>> val = aggr->counts.val;
>> --
>> 2.35.1
>>
On Tue, Jun 13, 2023 at 2:00 PM Liang, Kan <[email protected]> wrote:
>
>
>
> On 2023-06-13 4:28 p.m., Ian Rogers wrote:
> > On Tue, Jun 13, 2023 at 1:10 PM Liang, Kan <[email protected]> wrote:
> >>
> >>
> >>
> >> On 2023-06-13 3:44 p.m., Ian Rogers wrote:
> >>> On Wed, Jun 7, 2023 at 9:27 AM <[email protected]> wrote:
> >>>>
> >>>> From: Kan Liang <[email protected]>
> >>>>
> >>>> For the default output, the default metric group could vary on different
> >>>> platforms. For example, on SPR, the TopdownL1 and TopdownL2 metrics
> >>>> should be displayed in the default mode. On ICL, only the TopdownL1
> >>>> should be displayed.
> >>>>
> >>>> Add a flag so we can tag the default metric group for different
> >>>> platforms rather than hack the perf code.
> >>>>
> >>>> The flag is added to Intel TopdownL1 since ICL and TopdownL2 metrics
> >>>> since SPR.
> >>>>
> >>>> Add a new field, DefaultMetricgroupName, in the JSON file to indicate
> >>>> the real metric group name.
> >>>>
> >>>> Signed-off-by: Kan Liang <[email protected]>
> >>>> ---
> >>>> .../arch/x86/alderlake/adl-metrics.json | 20 ++++---
> >>>> .../arch/x86/icelake/icl-metrics.json | 20 ++++---
> >>>> .../arch/x86/icelakex/icx-metrics.json | 20 ++++---
> >>>> .../arch/x86/sapphirerapids/spr-metrics.json | 60 +++++++++++--------
> >>>> .../arch/x86/tigerlake/tgl-metrics.json | 20 ++++---
> >>>> 5 files changed, 84 insertions(+), 56 deletions(-)
> >>>>
> >>>> diff --git a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> >>>> index c9f7e3d4ab08..e78c85220e27 100644
> >>>> --- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> >>>> +++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> >>>> @@ -832,22 +832,24 @@
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> >>>> + "DefaultMetricgroupName": "TopdownL1",
> >>>> "MetricExpr": "cpu_core@topdown\\-be\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
> >>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>> "MetricName": "tma_backend_bound",
> >>>> "MetricThreshold": "tma_backend_bound > 0.2",
> >>>> - "MetricgroupNoGroup": "TopdownL1",
> >>>> + "MetricgroupNoGroup": "TopdownL1;Default",
> >>>> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> >>>> "ScaleUnit": "100%",
> >>>> "Unit": "cpu_core"
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> >>>> + "DefaultMetricgroupName": "TopdownL1",
> >>>> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> >>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>> "MetricName": "tma_bad_speculation",
> >>>> "MetricThreshold": "tma_bad_speculation > 0.15",
> >>>> - "MetricgroupNoGroup": "TopdownL1",
> >>>> + "MetricgroupNoGroup": "TopdownL1;Default",
> >>>> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> >>>> "ScaleUnit": "100%",
> >>>> "Unit": "cpu_core"
> >>>> @@ -1112,11 +1114,12 @@
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> >>>> + "DefaultMetricgroupName": "TopdownL1",
> >>>> "MetricExpr": "cpu_core@topdown\\-fe\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / tma_info_thread_slots",
> >>>> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> >>>> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> >>>> "MetricName": "tma_frontend_bound",
> >>>> "MetricThreshold": "tma_frontend_bound > 0.15",
> >>>> - "MetricgroupNoGroup": "TopdownL1",
> >>>> + "MetricgroupNoGroup": "TopdownL1;Default",
> >>>> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> >>>> "ScaleUnit": "100%",
> >>>> "Unit": "cpu_core"
> >>>> @@ -2316,11 +2319,12 @@
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> >>>> + "DefaultMetricgroupName": "TopdownL1",
> >>>> "MetricExpr": "cpu_core@topdown\\-retiring@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
> >>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>> "MetricName": "tma_retiring",
> >>>> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> >>>> - "MetricgroupNoGroup": "TopdownL1",
> >>>> + "MetricgroupNoGroup": "TopdownL1;Default",
> >>>> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> >>>> "ScaleUnit": "100%",
> >>>> "Unit": "cpu_core"
> >>>
> >>> For Alderlake the Default metric group is added for all cpu_core
> >>> metrics but not cpu_atom. This will lead to only getting metrics for
> >>> performance cores while the workload could be running on atoms. This
> >>> could lead to a false conclusion that the workload has no issues with
> >>> the metrics. I think this behavior is surprising and should be called
> >>> out as intentional in the commit message.
> >>>
> >>
> >> The e-core doesn't have enough counters to calculate all the Topdown
> >> events. It will trigger the multiplexing. We try to avoid it in the
> >> default mode.
> >> I will update the commit in V2.
> >
> > Is multiplexing a worse crime than only giving output for half the
> > cores? Both can be misleading. Perhaps the safest thing is to not use
> > Default on hybrid platforms.
> >
>
> I think if we cannot give the accurate number, we shouldn't show it. I
> don't think it's a problem just showing the Topdown on p-core. If the
> user doesn't find their interested data in the default mode, they can
> always use the --topdown for a specific core.
So --topdown is just dressing to using "-M TopdownL ..." and using -M
is how to drill down by group. I'm not sure how useful the command
line flag is, especially for levels >2.
Playing devil's advocate somewhat on the hybrid metric, let's say I
configure a managed runtime like a JVM so that all garbage collector
threads run on atom cores the main workload runs on the p-cores. This
is at least done in research papers. Let's say the garbage collector
is backend memory bound. The result from the default metrics won't
show this just (from the cover letter):
```
Performance counter stats for 'system wide':
32,154.81 msec cpu-clock # 31.978
CPUs utilized
165 context-switches # 5.131 /sec
33 cpu-migrations # 1.026 /sec
72 page-faults # 2.239 /sec
5,653,347 cpu_core/cycles/ # 0.000 GHz
4,164,114 cpu_atom/cycles/ # 0.000 GHz
3,921,839 cpu_core/instructions/ # 0.69
insn per cycle
2,142,800 cpu_atom/instructions/ # 0.38
insn per cycle
713,629 cpu_core/branches/ # 22.194 K/sec
452,838 cpu_atom/branches/ # 14.083 K/sec
26,810 cpu_core/branch-misses/ # 3.76% of
all branches
26,029 cpu_atom/branch-misses/ # 3.65% of
all branches
TopdownL1 (cpu_core) # 32.0 %
tma_backend_bound
# 8.0 %
tma_bad_speculation
# 45.5 %
tma_frontend_bound
# 14.5 % tma_retiring
```
As the garbage collector needs to run to free memory it can lead to
priority inversion where the garbage collector being slow is meaning
there isn't enough heap on the p-cores. Here the user has to interpret
the "(cpu_core)" to know that only half the metrics are shown and they
should run with "-M TopdownL1" to get cpu_core and cpu_atom. From this
they can see they have a memory bound issue on the atom cores. This
seems less safe than reporting nothing then the user specifying "-M
TopdownL1" to get the metrics on both cores.
For the multiplexing problem, is it solved by removing IPC from this output?
Thanks,
Ian
> Thanks,
> Kan
>
> > Thanks,
> > Ian
> >
> >> Thanks,
> >> Kan
> >>
> >>> Thanks,
> >>> Ian
> >>>
> >>>> diff --git a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
> >>>> index 20210742171d..cc4edf855064 100644
> >>>> --- a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
> >>>> +++ b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
> >>>> @@ -111,21 +111,23 @@
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> >>>> + "DefaultMetricgroupName": "TopdownL1",
> >>>> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
> >>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>> "MetricName": "tma_backend_bound",
> >>>> "MetricThreshold": "tma_backend_bound > 0.2",
> >>>> - "MetricgroupNoGroup": "TopdownL1",
> >>>> + "MetricgroupNoGroup": "TopdownL1;Default",
> >>>> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> >>>> + "DefaultMetricgroupName": "TopdownL1",
> >>>> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> >>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>> "MetricName": "tma_bad_speculation",
> >>>> "MetricThreshold": "tma_bad_speculation > 0.15",
> >>>> - "MetricgroupNoGroup": "TopdownL1",
> >>>> + "MetricgroupNoGroup": "TopdownL1;Default",
> >>>> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> @@ -372,11 +374,12 @@
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> >>>> + "DefaultMetricgroupName": "TopdownL1",
> >>>> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> >>>> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> >>>> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> >>>> "MetricName": "tma_frontend_bound",
> >>>> "MetricThreshold": "tma_frontend_bound > 0.15",
> >>>> - "MetricgroupNoGroup": "TopdownL1",
> >>>> + "MetricgroupNoGroup": "TopdownL1;Default",
> >>>> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> @@ -1378,11 +1381,12 @@
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> >>>> + "DefaultMetricgroupName": "TopdownL1",
> >>>> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>> "MetricName": "tma_retiring",
> >>>> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> >>>> - "MetricgroupNoGroup": "TopdownL1",
> >>>> + "MetricgroupNoGroup": "TopdownL1;Default",
> >>>> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> diff --git a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
> >>>> index ef25cda019be..6f25b5b7aaf6 100644
> >>>> --- a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
> >>>> +++ b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
> >>>> @@ -315,21 +315,23 @@
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> >>>> + "DefaultMetricgroupName": "TopdownL1",
> >>>> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
> >>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>> "MetricName": "tma_backend_bound",
> >>>> "MetricThreshold": "tma_backend_bound > 0.2",
> >>>> - "MetricgroupNoGroup": "TopdownL1",
> >>>> + "MetricgroupNoGroup": "TopdownL1;Default",
> >>>> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> >>>> + "DefaultMetricgroupName": "TopdownL1",
> >>>> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> >>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>> "MetricName": "tma_bad_speculation",
> >>>> "MetricThreshold": "tma_bad_speculation > 0.15",
> >>>> - "MetricgroupNoGroup": "TopdownL1",
> >>>> + "MetricgroupNoGroup": "TopdownL1;Default",
> >>>> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> @@ -576,11 +578,12 @@
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> >>>> + "DefaultMetricgroupName": "TopdownL1",
> >>>> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> >>>> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> >>>> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> >>>> "MetricName": "tma_frontend_bound",
> >>>> "MetricThreshold": "tma_frontend_bound > 0.15",
> >>>> - "MetricgroupNoGroup": "TopdownL1",
> >>>> + "MetricgroupNoGroup": "TopdownL1;Default",
> >>>> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> @@ -1674,11 +1677,12 @@
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> >>>> + "DefaultMetricgroupName": "TopdownL1",
> >>>> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>> "MetricName": "tma_retiring",
> >>>> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> >>>> - "MetricgroupNoGroup": "TopdownL1",
> >>>> + "MetricgroupNoGroup": "TopdownL1;Default",
> >>>> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
> >>>> index 4f3dd85540b6..c732982f70b5 100644
> >>>> --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
> >>>> +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
> >>>> @@ -340,31 +340,34 @@
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> >>>> + "DefaultMetricgroupName": "TopdownL1",
> >>>> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>> "MetricName": "tma_backend_bound",
> >>>> "MetricThreshold": "tma_backend_bound > 0.2",
> >>>> - "MetricgroupNoGroup": "TopdownL1",
> >>>> + "MetricgroupNoGroup": "TopdownL1;Default",
> >>>> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> >>>> + "DefaultMetricgroupName": "TopdownL1",
> >>>> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> >>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>> "MetricName": "tma_bad_speculation",
> >>>> "MetricThreshold": "tma_bad_speculation > 0.15",
> >>>> - "MetricgroupNoGroup": "TopdownL1",
> >>>> + "MetricgroupNoGroup": "TopdownL1;Default",
> >>>> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction",
> >>>> + "DefaultMetricgroupName": "TopdownL2",
> >>>> "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >>>> - "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
> >>>> + "MetricGroup": "BadSpec;BrMispredicts;Default;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
> >>>> "MetricName": "tma_branch_mispredicts",
> >>>> "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15",
> >>>> - "MetricgroupNoGroup": "TopdownL2",
> >>>> + "MetricgroupNoGroup": "TopdownL2;Default",
> >>>> "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredictions, tma_mispredicts_resteers",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> @@ -407,11 +410,12 @@
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck",
> >>>> + "DefaultMetricgroupName": "TopdownL2",
> >>>> "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)",
> >>>> - "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
> >>>> + "MetricGroup": "Backend;Compute;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
> >>>> "MetricName": "tma_core_bound",
> >>>> "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2",
> >>>> - "MetricgroupNoGroup": "TopdownL2",
> >>>> + "MetricgroupNoGroup": "TopdownL2;Default",
> >>>> "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> @@ -509,21 +513,23 @@
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues",
> >>>> + "DefaultMetricgroupName": "TopdownL2",
> >>>> "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)",
> >>>> - "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
> >>>> + "MetricGroup": "Default;FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
> >>>> "MetricName": "tma_fetch_bandwidth",
> >>>> "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_thread_ipc / 6 > 0.35",
> >>>> - "MetricgroupNoGroup": "TopdownL2",
> >>>> + "MetricgroupNoGroup": "TopdownL2;Default",
> >>>> "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_switches, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues",
> >>>> + "DefaultMetricgroupName": "TopdownL2",
> >>>> "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> >>>> - "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
> >>>> + "MetricGroup": "Default;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
> >>>> "MetricName": "tma_fetch_latency",
> >>>> "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15",
> >>>> - "MetricgroupNoGroup": "TopdownL2",
> >>>> + "MetricgroupNoGroup": "TopdownL2;Default",
> >>>> "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> @@ -611,11 +617,12 @@
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> >>>> + "DefaultMetricgroupName": "TopdownL1",
> >>>> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> >>>> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> >>>> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> >>>> "MetricName": "tma_frontend_bound",
> >>>> "MetricThreshold": "tma_frontend_bound > 0.15",
> >>>> - "MetricgroupNoGroup": "TopdownL1",
> >>>> + "MetricgroupNoGroup": "TopdownL1;Default",
> >>>> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> @@ -630,11 +637,12 @@
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences",
> >>>> + "DefaultMetricgroupName": "TopdownL2",
> >>>> "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >>>> - "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
> >>>> + "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
> >>>> "MetricName": "tma_heavy_operations",
> >>>> "MetricThreshold": "tma_heavy_operations > 0.1",
> >>>> - "MetricgroupNoGroup": "TopdownL2",
> >>>> + "MetricgroupNoGroup": "TopdownL2;Default",
> >>>> "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences. Sample with: UOPS_RETIRED.HEAVY",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> @@ -1486,11 +1494,12 @@
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)",
> >>>> + "DefaultMetricgroupName": "TopdownL2",
> >>>> "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)",
> >>>> - "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
> >>>> + "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
> >>>> "MetricName": "tma_light_operations",
> >>>> "MetricThreshold": "tma_light_operations > 0.6",
> >>>> - "MetricgroupNoGroup": "TopdownL2",
> >>>> + "MetricgroupNoGroup": "TopdownL2;Default",
> >>>> "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> @@ -1540,11 +1549,12 @@
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears",
> >>>> + "DefaultMetricgroupName": "TopdownL2",
> >>>> "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts)",
> >>>> - "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
> >>>> + "MetricGroup": "BadSpec;Default;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
> >>>> "MetricName": "tma_machine_clears",
> >>>> "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15",
> >>>> - "MetricgroupNoGroup": "TopdownL2",
> >>>> + "MetricgroupNoGroup": "TopdownL2;Default",
> >>>> "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> @@ -1576,11 +1586,12 @@
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck",
> >>>> + "DefaultMetricgroupName": "TopdownL2",
> >>>> "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >>>> - "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
> >>>> + "MetricGroup": "Backend;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
> >>>> "MetricName": "tma_memory_bound",
> >>>> "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2",
> >>>> - "MetricgroupNoGroup": "TopdownL2",
> >>>> + "MetricgroupNoGroup": "TopdownL2;Default",
> >>>> "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> @@ -1784,11 +1795,12 @@
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> >>>> + "DefaultMetricgroupName": "TopdownL1",
> >>>> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>> "MetricName": "tma_retiring",
> >>>> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> >>>> - "MetricgroupNoGroup": "TopdownL1",
> >>>> + "MetricgroupNoGroup": "TopdownL1;Default",
> >>>> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
> >>>> index d0538a754288..83346911aa63 100644
> >>>> --- a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
> >>>> +++ b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
> >>>> @@ -105,21 +105,23 @@
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> >>>> + "DefaultMetricgroupName": "TopdownL1",
> >>>> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
> >>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>> "MetricName": "tma_backend_bound",
> >>>> "MetricThreshold": "tma_backend_bound > 0.2",
> >>>> - "MetricgroupNoGroup": "TopdownL1",
> >>>> + "MetricgroupNoGroup": "TopdownL1;Default",
> >>>> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> >>>> + "DefaultMetricgroupName": "TopdownL1",
> >>>> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> >>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>> "MetricName": "tma_bad_speculation",
> >>>> "MetricThreshold": "tma_bad_speculation > 0.15",
> >>>> - "MetricgroupNoGroup": "TopdownL1",
> >>>> + "MetricgroupNoGroup": "TopdownL1;Default",
> >>>> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> @@ -366,11 +368,12 @@
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> >>>> + "DefaultMetricgroupName": "TopdownL1",
> >>>> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> >>>> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> >>>> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> >>>> "MetricName": "tma_frontend_bound",
> >>>> "MetricThreshold": "tma_frontend_bound > 0.15",
> >>>> - "MetricgroupNoGroup": "TopdownL1",
> >>>> + "MetricgroupNoGroup": "TopdownL1;Default",
> >>>> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> @@ -1392,11 +1395,12 @@
> >>>> },
> >>>> {
> >>>> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> >>>> + "DefaultMetricgroupName": "TopdownL1",
> >>>> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>> "MetricName": "tma_retiring",
> >>>> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> >>>> - "MetricgroupNoGroup": "TopdownL1",
> >>>> + "MetricgroupNoGroup": "TopdownL1;Default",
> >>>> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> >>>> "ScaleUnit": "100%"
> >>>> },
> >>>> --
> >>>> 2.35.1
> >>>>
On 2023-06-13 4:28 p.m., Ian Rogers wrote:
> On Tue, Jun 13, 2023 at 1:10 PM Liang, Kan <[email protected]> wrote:
>>
>>
>>
>> On 2023-06-13 3:44 p.m., Ian Rogers wrote:
>>> On Wed, Jun 7, 2023 at 9:27 AM <[email protected]> wrote:
>>>>
>>>> From: Kan Liang <[email protected]>
>>>>
>>>> For the default output, the default metric group could vary on different
>>>> platforms. For example, on SPR, the TopdownL1 and TopdownL2 metrics
>>>> should be displayed in the default mode. On ICL, only the TopdownL1
>>>> should be displayed.
>>>>
>>>> Add a flag so we can tag the default metric group for different
>>>> platforms rather than hack the perf code.
>>>>
>>>> The flag is added to Intel TopdownL1 since ICL and TopdownL2 metrics
>>>> since SPR.
>>>>
>>>> Add a new field, DefaultMetricgroupName, in the JSON file to indicate
>>>> the real metric group name.
>>>>
>>>> Signed-off-by: Kan Liang <[email protected]>
>>>> ---
>>>> .../arch/x86/alderlake/adl-metrics.json | 20 ++++---
>>>> .../arch/x86/icelake/icl-metrics.json | 20 ++++---
>>>> .../arch/x86/icelakex/icx-metrics.json | 20 ++++---
>>>> .../arch/x86/sapphirerapids/spr-metrics.json | 60 +++++++++++--------
>>>> .../arch/x86/tigerlake/tgl-metrics.json | 20 ++++---
>>>> 5 files changed, 84 insertions(+), 56 deletions(-)
>>>>
>>>> diff --git a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
>>>> index c9f7e3d4ab08..e78c85220e27 100644
>>>> --- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
>>>> +++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
>>>> @@ -832,22 +832,24 @@
>>>> },
>>>> {
>>>> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>> "MetricExpr": "cpu_core@topdown\\-be\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>> "MetricName": "tma_backend_bound",
>>>> "MetricThreshold": "tma_backend_bound > 0.2",
>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>>> "ScaleUnit": "100%",
>>>> "Unit": "cpu_core"
>>>> },
>>>> {
>>>> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>> "MetricName": "tma_bad_speculation",
>>>> "MetricThreshold": "tma_bad_speculation > 0.15",
>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>>> "ScaleUnit": "100%",
>>>> "Unit": "cpu_core"
>>>> @@ -1112,11 +1114,12 @@
>>>> },
>>>> {
>>>> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>> "MetricExpr": "cpu_core@topdown\\-fe\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / tma_info_thread_slots",
>>>> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>>>> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>>> "MetricName": "tma_frontend_bound",
>>>> "MetricThreshold": "tma_frontend_bound > 0.15",
>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>>> "ScaleUnit": "100%",
>>>> "Unit": "cpu_core"
>>>> @@ -2316,11 +2319,12 @@
>>>> },
>>>> {
>>>> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>> "MetricExpr": "cpu_core@topdown\\-retiring@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>> "MetricName": "tma_retiring",
>>>> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>>> "ScaleUnit": "100%",
>>>> "Unit": "cpu_core"
>>>
>>> For Alderlake the Default metric group is added for all cpu_core
>>> metrics but not cpu_atom. This will lead to only getting metrics for
>>> performance cores while the workload could be running on atoms. This
>>> could lead to a false conclusion that the workload has no issues with
>>> the metrics. I think this behavior is surprising and should be called
>>> out as intentional in the commit message.
>>>
>>
>> The e-core doesn't have enough counters to calculate all the Topdown
>> events. It will trigger the multiplexing. We try to avoid it in the
>> default mode.
>> I will update the commit in V2.
>
> Is multiplexing a worse crime than only giving output for half the
> cores? Both can be misleading. Perhaps the safest thing is to not use
> Default on hybrid platforms.
>
I think if we cannot give the accurate number, we shouldn't show it. I
don't think it's a problem just showing the Topdown on p-core. If the
user doesn't find their interested data in the default mode, they can
always use the --topdown for a specific core.
Thanks,
Kan
> Thanks,
> Ian
>
>> Thanks,
>> Kan
>>
>>> Thanks,
>>> Ian
>>>
>>>> diff --git a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
>>>> index 20210742171d..cc4edf855064 100644
>>>> --- a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
>>>> +++ b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
>>>> @@ -111,21 +111,23 @@
>>>> },
>>>> {
>>>> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>> "MetricName": "tma_backend_bound",
>>>> "MetricThreshold": "tma_backend_bound > 0.2",
>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> {
>>>> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>> "MetricName": "tma_bad_speculation",
>>>> "MetricThreshold": "tma_bad_speculation > 0.15",
>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> @@ -372,11 +374,12 @@
>>>> },
>>>> {
>>>> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>>>> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>>>> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>>> "MetricName": "tma_frontend_bound",
>>>> "MetricThreshold": "tma_frontend_bound > 0.15",
>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> @@ -1378,11 +1381,12 @@
>>>> },
>>>> {
>>>> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>> "MetricName": "tma_retiring",
>>>> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> diff --git a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
>>>> index ef25cda019be..6f25b5b7aaf6 100644
>>>> --- a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
>>>> +++ b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
>>>> @@ -315,21 +315,23 @@
>>>> },
>>>> {
>>>> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>> "MetricName": "tma_backend_bound",
>>>> "MetricThreshold": "tma_backend_bound > 0.2",
>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> {
>>>> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>> "MetricName": "tma_bad_speculation",
>>>> "MetricThreshold": "tma_bad_speculation > 0.15",
>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> @@ -576,11 +578,12 @@
>>>> },
>>>> {
>>>> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>>>> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>>>> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>>> "MetricName": "tma_frontend_bound",
>>>> "MetricThreshold": "tma_frontend_bound > 0.15",
>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> @@ -1674,11 +1677,12 @@
>>>> },
>>>> {
>>>> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>> "MetricName": "tma_retiring",
>>>> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
>>>> index 4f3dd85540b6..c732982f70b5 100644
>>>> --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
>>>> +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
>>>> @@ -340,31 +340,34 @@
>>>> },
>>>> {
>>>> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>> "MetricName": "tma_backend_bound",
>>>> "MetricThreshold": "tma_backend_bound > 0.2",
>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> {
>>>> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>> "MetricName": "tma_bad_speculation",
>>>> "MetricThreshold": "tma_bad_speculation > 0.15",
>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> {
>>>> "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction",
>>>> + "DefaultMetricgroupName": "TopdownL2",
>>>> "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>> - "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
>>>> + "MetricGroup": "BadSpec;BrMispredicts;Default;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
>>>> "MetricName": "tma_branch_mispredicts",
>>>> "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15",
>>>> - "MetricgroupNoGroup": "TopdownL2",
>>>> + "MetricgroupNoGroup": "TopdownL2;Default",
>>>> "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredictions, tma_mispredicts_resteers",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> @@ -407,11 +410,12 @@
>>>> },
>>>> {
>>>> "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck",
>>>> + "DefaultMetricgroupName": "TopdownL2",
>>>> "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)",
>>>> - "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>>>> + "MetricGroup": "Backend;Compute;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>>>> "MetricName": "tma_core_bound",
>>>> "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2",
>>>> - "MetricgroupNoGroup": "TopdownL2",
>>>> + "MetricgroupNoGroup": "TopdownL2;Default",
>>>> "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> @@ -509,21 +513,23 @@
>>>> },
>>>> {
>>>> "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues",
>>>> + "DefaultMetricgroupName": "TopdownL2",
>>>> "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)",
>>>> - "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
>>>> + "MetricGroup": "Default;FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
>>>> "MetricName": "tma_fetch_bandwidth",
>>>> "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_thread_ipc / 6 > 0.35",
>>>> - "MetricgroupNoGroup": "TopdownL2",
>>>> + "MetricgroupNoGroup": "TopdownL2;Default",
>>>> "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_switches, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> {
>>>> "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues",
>>>> + "DefaultMetricgroupName": "TopdownL2",
>>>> "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>>>> - "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
>>>> + "MetricGroup": "Default;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
>>>> "MetricName": "tma_fetch_latency",
>>>> "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15",
>>>> - "MetricgroupNoGroup": "TopdownL2",
>>>> + "MetricgroupNoGroup": "TopdownL2;Default",
>>>> "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> @@ -611,11 +617,12 @@
>>>> },
>>>> {
>>>> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>>>> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>>>> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>>> "MetricName": "tma_frontend_bound",
>>>> "MetricThreshold": "tma_frontend_bound > 0.15",
>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> @@ -630,11 +637,12 @@
>>>> },
>>>> {
>>>> "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences",
>>>> + "DefaultMetricgroupName": "TopdownL2",
>>>> "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>> - "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>>>> + "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>>>> "MetricName": "tma_heavy_operations",
>>>> "MetricThreshold": "tma_heavy_operations > 0.1",
>>>> - "MetricgroupNoGroup": "TopdownL2",
>>>> + "MetricgroupNoGroup": "TopdownL2;Default",
>>>> "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences. Sample with: UOPS_RETIRED.HEAVY",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> @@ -1486,11 +1494,12 @@
>>>> },
>>>> {
>>>> "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)",
>>>> + "DefaultMetricgroupName": "TopdownL2",
>>>> "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)",
>>>> - "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>>>> + "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>>>> "MetricName": "tma_light_operations",
>>>> "MetricThreshold": "tma_light_operations > 0.6",
>>>> - "MetricgroupNoGroup": "TopdownL2",
>>>> + "MetricgroupNoGroup": "TopdownL2;Default",
>>>> "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> @@ -1540,11 +1549,12 @@
>>>> },
>>>> {
>>>> "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears",
>>>> + "DefaultMetricgroupName": "TopdownL2",
>>>> "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts)",
>>>> - "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
>>>> + "MetricGroup": "BadSpec;Default;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
>>>> "MetricName": "tma_machine_clears",
>>>> "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15",
>>>> - "MetricgroupNoGroup": "TopdownL2",
>>>> + "MetricgroupNoGroup": "TopdownL2;Default",
>>>> "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> @@ -1576,11 +1586,12 @@
>>>> },
>>>> {
>>>> "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck",
>>>> + "DefaultMetricgroupName": "TopdownL2",
>>>> "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>> - "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>>>> + "MetricGroup": "Backend;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>>>> "MetricName": "tma_memory_bound",
>>>> "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2",
>>>> - "MetricgroupNoGroup": "TopdownL2",
>>>> + "MetricgroupNoGroup": "TopdownL2;Default",
>>>> "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> @@ -1784,11 +1795,12 @@
>>>> },
>>>> {
>>>> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>> "MetricName": "tma_retiring",
>>>> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
>>>> index d0538a754288..83346911aa63 100644
>>>> --- a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
>>>> +++ b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
>>>> @@ -105,21 +105,23 @@
>>>> },
>>>> {
>>>> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>> "MetricName": "tma_backend_bound",
>>>> "MetricThreshold": "tma_backend_bound > 0.2",
>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> {
>>>> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>> "MetricName": "tma_bad_speculation",
>>>> "MetricThreshold": "tma_bad_speculation > 0.15",
>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> @@ -366,11 +368,12 @@
>>>> },
>>>> {
>>>> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>>>> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>>>> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>>> "MetricName": "tma_frontend_bound",
>>>> "MetricThreshold": "tma_frontend_bound > 0.15",
>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> @@ -1392,11 +1395,12 @@
>>>> },
>>>> {
>>>> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>> "MetricName": "tma_retiring",
>>>> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>>> "ScaleUnit": "100%"
>>>> },
>>>> --
>>>> 2.35.1
>>>>
On 2023-06-13 5:28 p.m., Ian Rogers wrote:
> On Tue, Jun 13, 2023 at 2:00 PM Liang, Kan <[email protected]> wrote:
>>
>>
>>
>> On 2023-06-13 4:28 p.m., Ian Rogers wrote:
>>> On Tue, Jun 13, 2023 at 1:10 PM Liang, Kan <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 2023-06-13 3:44 p.m., Ian Rogers wrote:
>>>>> On Wed, Jun 7, 2023 at 9:27 AM <[email protected]> wrote:
>>>>>>
>>>>>> From: Kan Liang <[email protected]>
>>>>>>
>>>>>> For the default output, the default metric group could vary on different
>>>>>> platforms. For example, on SPR, the TopdownL1 and TopdownL2 metrics
>>>>>> should be displayed in the default mode. On ICL, only the TopdownL1
>>>>>> should be displayed.
>>>>>>
>>>>>> Add a flag so we can tag the default metric group for different
>>>>>> platforms rather than hack the perf code.
>>>>>>
>>>>>> The flag is added to Intel TopdownL1 since ICL and TopdownL2 metrics
>>>>>> since SPR.
>>>>>>
>>>>>> Add a new field, DefaultMetricgroupName, in the JSON file to indicate
>>>>>> the real metric group name.
>>>>>>
>>>>>> Signed-off-by: Kan Liang <[email protected]>
>>>>>> ---
>>>>>> .../arch/x86/alderlake/adl-metrics.json | 20 ++++---
>>>>>> .../arch/x86/icelake/icl-metrics.json | 20 ++++---
>>>>>> .../arch/x86/icelakex/icx-metrics.json | 20 ++++---
>>>>>> .../arch/x86/sapphirerapids/spr-metrics.json | 60 +++++++++++--------
>>>>>> .../arch/x86/tigerlake/tgl-metrics.json | 20 ++++---
>>>>>> 5 files changed, 84 insertions(+), 56 deletions(-)
>>>>>>
>>>>>> diff --git a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
>>>>>> index c9f7e3d4ab08..e78c85220e27 100644
>>>>>> --- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
>>>>>> +++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
>>>>>> @@ -832,22 +832,24 @@
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>>>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>>>> "MetricExpr": "cpu_core@topdown\\-be\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
>>>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>> "MetricName": "tma_backend_bound",
>>>>>> "MetricThreshold": "tma_backend_bound > 0.2",
>>>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>>>>> "ScaleUnit": "100%",
>>>>>> "Unit": "cpu_core"
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>>>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>>>> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>>>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>> "MetricName": "tma_bad_speculation",
>>>>>> "MetricThreshold": "tma_bad_speculation > 0.15",
>>>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>>>>> "ScaleUnit": "100%",
>>>>>> "Unit": "cpu_core"
>>>>>> @@ -1112,11 +1114,12 @@
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>>>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>>>> "MetricExpr": "cpu_core@topdown\\-fe\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / tma_info_thread_slots",
>>>>>> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>>> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>>> "MetricName": "tma_frontend_bound",
>>>>>> "MetricThreshold": "tma_frontend_bound > 0.15",
>>>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>>>>> "ScaleUnit": "100%",
>>>>>> "Unit": "cpu_core"
>>>>>> @@ -2316,11 +2319,12 @@
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>>>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>>>> "MetricExpr": "cpu_core@topdown\\-retiring@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
>>>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>> "MetricName": "tma_retiring",
>>>>>> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>>>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>>>>> "ScaleUnit": "100%",
>>>>>> "Unit": "cpu_core"
>>>>>
>>>>> For Alderlake the Default metric group is added for all cpu_core
>>>>> metrics but not cpu_atom. This will lead to only getting metrics for
>>>>> performance cores while the workload could be running on atoms. This
>>>>> could lead to a false conclusion that the workload has no issues with
>>>>> the metrics. I think this behavior is surprising and should be called
>>>>> out as intentional in the commit message.
>>>>>
>>>>
>>>> The e-core doesn't have enough counters to calculate all the Topdown
>>>> events. It will trigger the multiplexing. We try to avoid it in the
>>>> default mode.
>>>> I will update the commit in V2.
>>>
>>> Is multiplexing a worse crime than only giving output for half the
>>> cores? Both can be misleading. Perhaps the safest thing is to not use
>>> Default on hybrid platforms.
>>>
>>
>> I think if we cannot give the accurate number, we shouldn't show it. I
>> don't think it's a problem just showing the Topdown on p-core. If the
>> user doesn't find their interested data in the default mode, they can
>> always use the --topdown for a specific core.
>
> So --topdown is just dressing to using "-M TopdownL ..." and using -M
> is how to drill down by group. I'm not sure how useful the command
> line flag is, especially for levels >2.
>
> Playing devil's advocate somewhat on the hybrid metric, let's say I
> configure a managed runtime like a JVM so that all garbage collector
> threads run on atom cores the main workload runs on the p-cores. This
> is at least done in research papers. Let's say the garbage collector
> is backend memory bound. The result from the default metrics won't
> show this just (from the cover letter):
>
> ```
> Performance counter stats for 'system wide':
>
> 32,154.81 msec cpu-clock # 31.978
> CPUs utilized
> 165 context-switches # 5.131 /sec
> 33 cpu-migrations # 1.026 /sec
> 72 page-faults # 2.239 /sec
> 5,653,347 cpu_core/cycles/ # 0.000 GHz
> 4,164,114 cpu_atom/cycles/ # 0.000 GHz
> 3,921,839 cpu_core/instructions/ # 0.69
> insn per cycle
> 2,142,800 cpu_atom/instructions/ # 0.38
> insn per cycle
> 713,629 cpu_core/branches/ # 22.194 K/sec
> 452,838 cpu_atom/branches/ # 14.083 K/sec
> 26,810 cpu_core/branch-misses/ # 3.76% of
> all branches
> 26,029 cpu_atom/branch-misses/ # 3.65% of
> all branches
> TopdownL1 (cpu_core) # 32.0 %
> tma_backend_bound
> # 8.0 %
> tma_bad_speculation
> # 45.5 %
> tma_frontend_bound
> # 14.5 % tma_retiring
> ```
>
> As the garbage collector needs to run to free memory it can lead to
> priority inversion where the garbage collector being slow is meaning
> there isn't enough heap on the p-cores. Here the user has to interpret
> the "(cpu_core)" to know that only half the metrics are shown and they
> should run with "-M TopdownL1" to get cpu_core and cpu_atom. From this
> they can see they have a memory bound issue on the atom cores. This
> seems less safe than reporting nothing then the user specifying "-M
> TopdownL1" to get the metrics on both cores.
OK. I will think about it. But no matter which way we choose, I think we
have to update the script anyway.
>
> For the multiplexing problem, is it solved by removing IPC from this output?
No, IPC should only uses the fixed counters. The branch events share the
GP counters with the Topdown events.
Thanks,
Kan
>
> Thanks,
> Ian
>
>> Thanks,
>> Kan
>>
>>> Thanks,
>>> Ian
>>>
>>>> Thanks,
>>>> Kan
>>>>
>>>>> Thanks,
>>>>> Ian
>>>>>
>>>>>> diff --git a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
>>>>>> index 20210742171d..cc4edf855064 100644
>>>>>> --- a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
>>>>>> +++ b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
>>>>>> @@ -111,21 +111,23 @@
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>>>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>>>> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
>>>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>> "MetricName": "tma_backend_bound",
>>>>>> "MetricThreshold": "tma_backend_bound > 0.2",
>>>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>>>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>>>> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>>>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>> "MetricName": "tma_bad_speculation",
>>>>>> "MetricThreshold": "tma_bad_speculation > 0.15",
>>>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> @@ -372,11 +374,12 @@
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>>>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>>>> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>>>>>> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>>> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>>> "MetricName": "tma_frontend_bound",
>>>>>> "MetricThreshold": "tma_frontend_bound > 0.15",
>>>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> @@ -1378,11 +1381,12 @@
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>>>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>>>> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>> "MetricName": "tma_retiring",
>>>>>> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>>>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> diff --git a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
>>>>>> index ef25cda019be..6f25b5b7aaf6 100644
>>>>>> --- a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
>>>>>> +++ b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
>>>>>> @@ -315,21 +315,23 @@
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>>>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>>>> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
>>>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>> "MetricName": "tma_backend_bound",
>>>>>> "MetricThreshold": "tma_backend_bound > 0.2",
>>>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>>>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>>>> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>>>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>> "MetricName": "tma_bad_speculation",
>>>>>> "MetricThreshold": "tma_bad_speculation > 0.15",
>>>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> @@ -576,11 +578,12 @@
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>>>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>>>> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>>>>>> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>>> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>>> "MetricName": "tma_frontend_bound",
>>>>>> "MetricThreshold": "tma_frontend_bound > 0.15",
>>>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> @@ -1674,11 +1677,12 @@
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>>>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>>>> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>> "MetricName": "tma_retiring",
>>>>>> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>>>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
>>>>>> index 4f3dd85540b6..c732982f70b5 100644
>>>>>> --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
>>>>>> +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
>>>>>> @@ -340,31 +340,34 @@
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>>>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>>>> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>> "MetricName": "tma_backend_bound",
>>>>>> "MetricThreshold": "tma_backend_bound > 0.2",
>>>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>>>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>>>> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>>>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>> "MetricName": "tma_bad_speculation",
>>>>>> "MetricThreshold": "tma_bad_speculation > 0.15",
>>>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction",
>>>>>> + "DefaultMetricgroupName": "TopdownL2",
>>>>>> "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>>>> - "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
>>>>>> + "MetricGroup": "BadSpec;BrMispredicts;Default;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
>>>>>> "MetricName": "tma_branch_mispredicts",
>>>>>> "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15",
>>>>>> - "MetricgroupNoGroup": "TopdownL2",
>>>>>> + "MetricgroupNoGroup": "TopdownL2;Default",
>>>>>> "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredictions, tma_mispredicts_resteers",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> @@ -407,11 +410,12 @@
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck",
>>>>>> + "DefaultMetricgroupName": "TopdownL2",
>>>>>> "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)",
>>>>>> - "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>>>>>> + "MetricGroup": "Backend;Compute;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>>>>>> "MetricName": "tma_core_bound",
>>>>>> "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2",
>>>>>> - "MetricgroupNoGroup": "TopdownL2",
>>>>>> + "MetricgroupNoGroup": "TopdownL2;Default",
>>>>>> "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> @@ -509,21 +513,23 @@
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues",
>>>>>> + "DefaultMetricgroupName": "TopdownL2",
>>>>>> "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)",
>>>>>> - "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
>>>>>> + "MetricGroup": "Default;FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
>>>>>> "MetricName": "tma_fetch_bandwidth",
>>>>>> "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_thread_ipc / 6 > 0.35",
>>>>>> - "MetricgroupNoGroup": "TopdownL2",
>>>>>> + "MetricgroupNoGroup": "TopdownL2;Default",
>>>>>> "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_switches, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues",
>>>>>> + "DefaultMetricgroupName": "TopdownL2",
>>>>>> "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>>>>>> - "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
>>>>>> + "MetricGroup": "Default;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
>>>>>> "MetricName": "tma_fetch_latency",
>>>>>> "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15",
>>>>>> - "MetricgroupNoGroup": "TopdownL2",
>>>>>> + "MetricgroupNoGroup": "TopdownL2;Default",
>>>>>> "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> @@ -611,11 +617,12 @@
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>>>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>>>> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>>>>>> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>>> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>>> "MetricName": "tma_frontend_bound",
>>>>>> "MetricThreshold": "tma_frontend_bound > 0.15",
>>>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> @@ -630,11 +637,12 @@
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences",
>>>>>> + "DefaultMetricgroupName": "TopdownL2",
>>>>>> "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>>>> - "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>>>>>> + "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>>>>>> "MetricName": "tma_heavy_operations",
>>>>>> "MetricThreshold": "tma_heavy_operations > 0.1",
>>>>>> - "MetricgroupNoGroup": "TopdownL2",
>>>>>> + "MetricgroupNoGroup": "TopdownL2;Default",
>>>>>> "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences. Sample with: UOPS_RETIRED.HEAVY",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> @@ -1486,11 +1494,12 @@
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)",
>>>>>> + "DefaultMetricgroupName": "TopdownL2",
>>>>>> "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)",
>>>>>> - "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>>>>>> + "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>>>>>> "MetricName": "tma_light_operations",
>>>>>> "MetricThreshold": "tma_light_operations > 0.6",
>>>>>> - "MetricgroupNoGroup": "TopdownL2",
>>>>>> + "MetricgroupNoGroup": "TopdownL2;Default",
>>>>>> "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> @@ -1540,11 +1549,12 @@
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears",
>>>>>> + "DefaultMetricgroupName": "TopdownL2",
>>>>>> "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts)",
>>>>>> - "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
>>>>>> + "MetricGroup": "BadSpec;Default;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
>>>>>> "MetricName": "tma_machine_clears",
>>>>>> "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15",
>>>>>> - "MetricgroupNoGroup": "TopdownL2",
>>>>>> + "MetricgroupNoGroup": "TopdownL2;Default",
>>>>>> "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> @@ -1576,11 +1586,12 @@
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck",
>>>>>> + "DefaultMetricgroupName": "TopdownL2",
>>>>>> "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>>>> - "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>>>>>> + "MetricGroup": "Backend;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>>>>>> "MetricName": "tma_memory_bound",
>>>>>> "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2",
>>>>>> - "MetricgroupNoGroup": "TopdownL2",
>>>>>> + "MetricgroupNoGroup": "TopdownL2;Default",
>>>>>> "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> @@ -1784,11 +1795,12 @@
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>>>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>>>> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>> "MetricName": "tma_retiring",
>>>>>> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>>>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
>>>>>> index d0538a754288..83346911aa63 100644
>>>>>> --- a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
>>>>>> +++ b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
>>>>>> @@ -105,21 +105,23 @@
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>>>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>>>> "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
>>>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>> "MetricName": "tma_backend_bound",
>>>>>> "MetricThreshold": "tma_backend_bound > 0.2",
>>>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>> "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>>>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>>>> "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>>>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>> "MetricName": "tma_bad_speculation",
>>>>>> "MetricThreshold": "tma_bad_speculation > 0.15",
>>>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>> "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> @@ -366,11 +368,12 @@
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>>>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>>>> "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>>>>>> - "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>>> + "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>>> "MetricName": "tma_frontend_bound",
>>>>>> "MetricThreshold": "tma_frontend_bound > 0.15",
>>>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>> "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> @@ -1392,11 +1395,12 @@
>>>>>> },
>>>>>> {
>>>>>> "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>>>>>> + "DefaultMetricgroupName": "TopdownL1",
>>>>>> "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>>>> - "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> + "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>> "MetricName": "tma_retiring",
>>>>>> "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>>>>>> - "MetricgroupNoGroup": "TopdownL1",
>>>>>> + "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>> "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>>>>> "ScaleUnit": "100%"
>>>>>> },
>>>>>> --
>>>>>> 2.35.1
>>>>>>
On 07/06/2023 17:26, [email protected] wrote:
> From: Kan Liang<[email protected]>
>
> Add the default tags for ARM as well.
>
> Signed-off-by: Kan Liang<[email protected]>
> Cc: Jing Zhang<[email protected]>
> Cc: John Garry<[email protected]>
Reviewed-by: John Garry <[email protected]>
But does pmu-events/arch/arm64/hisilicon/hip08/metrics.json need to be
fixed up as well?
On 2023-06-14 10:30 a.m., John Garry wrote:
> On 07/06/2023 17:26, [email protected] wrote:
>> From: Kan Liang<[email protected]>
>>
>> Add the default tags for ARM as well.
>>
>> Signed-off-by: Kan Liang<[email protected]>
>> Cc: Jing Zhang<[email protected]>
>> Cc: John Garry<[email protected]>
>
> Reviewed-by: John Garry <[email protected]>
>
> But does pmu-events/arch/arm64/hisilicon/hip08/metrics.json need to be
> fixed up as well?
The patch has been added in V4. Please take a look.
https://lore.kernel.org/lkml/[email protected]/
Thanks,
Kan