From: Kan Liang <[email protected]>
Icelake has support for measuring the level 1 TopDown metrics
directly in hardware. This is implemented by an additional METRICS
register, and a new Fixed Counter 3 that measures pipeline SLOTS.
New in Icelake
- Do not require generic counters. This allows to collect TopDown always
in addition to other events.
- Measuring TopDown per thread/process instead of only per core
For the Ice Lake implementation of performance metrics, the values in
PERF_METRICS MSR are derived from fixed counter 3. Software should start
both registers, PERF_METRICS and fixed counter 3, from zero.
Additionally, software is recommended to periodically clear both
registers in order to maintain accurate measurements. The latter is
required for certain scenarios that involve sampling metrics at high
rates. Software should always write fixed counter 3 before write to
PERF_METRICS.
IA32_PERF_GLOBAL_STATUS. OVF_PERF_METRICS[48]: If this bit is set,
it indicates that some PERF_METRICS-related counter has overflowed and
a PMI is triggered. Software has to synchronize, e.g. re-start,
PERF_METRICS as well as fixed counter 3. Otherwise, PERF_METRICS may
return invalid values.
Limitation
- To get accurate result and avoid reading the METRICS register multiple
times, the TopDown metrics events and SLOTS event have to be in the
same group.
- METRICS and SLOTS registers have to be cleared after each read by SW.
That is to prevent the lose of precision.
- Cannot do sampling read SLOTS and TopDown metric events
Please refer SDM Vol3, 18.3.9.3 Performance Metrics for the details of
TopDown metrics.
Changes since V4:
- Add description regarding to event-code naming for fixed counters
- Fix add_nr_metric_event().
For leader event, we have to take the accepted metrics events into
account.
For sibling event, it doesn't need to count accepted metrics events
again.
- Remove is_first_topdown_event_in_group().
Force slots in topdown group. Only update topdown events with slots
event.
- Re-use last_period and period_left for saved_metric and saved_slots.
Changes since V3:
- Separate fixed counter3 definition patch
- Separate BTS index patch
- Apply Peter's cleanup patch
- Fix the name of perf capabilities for perf METRICS
- Apply patch for mul_u64_u32_div() x86_64 implementation
- Fix unconditionally allows collecting 4 extra events
- Add patch to clean up NMI handler by naming global status bit
- Add patch to reuse event_base_rdpmc for RDPMC userspace support
Changes since V2:
- Rebase on top of v5.3-rc1
Key changes since V1:
- Remove variables for reg_idx and enabled_events[] array.
The reg_idx can be calculated by idx in runtime.
Using existing active_mask to replace enabled_events.
- Choose value 47 for the fixed index of BTS.
- Support OVF_PERF_METRICS overflow bit in PMI handler
- Drops the caching mechanism and related variables
New mechanism is to update all active slots/metrics events for the
first slots/metrics events in a group. For each group reading, it
still only read the slots/perf_metrics MSR once
- Disable PMU for read of topdown events to avoid the NMI issue
- Move RDPMC support to a separate patch
- Using event=0x00,umask=0x1X for topdown metrics events
- Drop the patch which add REMOVE transaction
We can indicate x86_pmu_stop() by checking
(event && !test_bit(event->hw.idx, cpuc->active_mask)),
which is a good place to save the slots/metrics MSR value
Andi Kleen (2):
perf, tools, stat: Support new per thread TopDown metrics
perf, tools: Add documentation for topdown metrics
Kan Liang (12):
perf/x86/intel: Introduce the fourth fixed counter
perf/x86/intel: Set correct mask for TOPDOWN.SLOTS
perf/x86/intel: Move BTS index to 47
perf/x86/intel: Basic support for metrics counters
perf/x86/intel: Fix the name of perf capabilities for perf METRICS
perf/x86/intel: Support hardware TopDown metrics
perf/x86/intel: Support per thread RDPMC TopDown metrics
perf/x86/intel: Export TopDown events for Icelake
perf/x86/intel: Disable sampling read slots and topdown
perf/x86/intel: Name global status bit in NMI handler
perf/x86: Use event_base_rdpmc for RDPMC userspace support
perf, tools, stat: Check Topdown Metric group
arch/x86/events/core.c | 86 +++++-
arch/x86/events/intel/core.c | 399 ++++++++++++++++++++++---
arch/x86/events/perf_event.h | 57 +++-
arch/x86/include/asm/msr-index.h | 3 +
arch/x86/include/asm/perf_event.h | 60 +++-
include/linux/perf_event.h | 29 +-
tools/perf/Documentation/perf-stat.txt | 9 +-
tools/perf/Documentation/topdown.txt | 235 +++++++++++++++
tools/perf/builtin-stat.c | 97 ++++++
tools/perf/util/stat-shadow.c | 89 ++++++
tools/perf/util/stat.c | 4 +
tools/perf/util/stat.h | 8 +
12 files changed, 1007 insertions(+), 69 deletions(-)
create mode 100644 tools/perf/Documentation/topdown.txt
--
2.17.1
From: Kan Liang <[email protected]>
The fourth fixed counter, TOPDOWN.SLOTS, is introduced in Ice Lake.
Add MSR address and macros for the new fixed counter, which will be used
in the following patch.
Add comments to explain the event encoding rules for fixed counters.
Signed-off-by: Kan Liang <[email protected]>
---
Changes since V4:
- Add description regarding to event-code naming for fixed counters
arch/x86/include/asm/perf_event.h | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index ee26e9215f18..55a4d05ba6ec 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -146,12 +146,22 @@ struct x86_pmu_capability {
*/
/*
- * All 3 fixed-mode PMCs are configured via this single MSR:
+ * All fixed-mode PMCs are configured via this single MSR:
*/
#define MSR_ARCH_PERFMON_FIXED_CTR_CTRL 0x38d
/*
- * The counts are available in three separate MSRs:
+ * There is no event-code assigned to fixed-mode PMCs.
+ * For the fixed-mode PMC which has an equivalent event on general-purpose PMCs,
+ * using the event-code of the equivalent event for the fixed-mode PMC.
+ * E.g. Instr_Retired.Any, CPU_CLK_Unhalted.Core
+ *
+ * For the fixed-mode PMCs which doesn't have an equivalent event,
+ * using pseudo-encoding, e.g. CPU_CLK_Unhalted.Ref, TOPDOWN.SLOTS.
+ * The event-code for fixed-mode PMCs must be 0x00.
+ * The umask-code is 0x0X. X indicates the index of the fixed counter.
+ *
+ * The counts are available in separate MSRs:
*/
/* Instr_Retired.Any: */
@@ -162,11 +172,16 @@ struct x86_pmu_capability {
#define MSR_ARCH_PERFMON_FIXED_CTR1 0x30a
#define INTEL_PMC_IDX_FIXED_CPU_CYCLES (INTEL_PMC_IDX_FIXED + 1)
-/* CPU_CLK_Unhalted.Ref: */
+/* CPU_CLK_Unhalted.Ref: event=0x00,umask=0x3 (pseudo-encoding) */
#define MSR_ARCH_PERFMON_FIXED_CTR2 0x30b
#define INTEL_PMC_IDX_FIXED_REF_CYCLES (INTEL_PMC_IDX_FIXED + 2)
#define INTEL_PMC_MSK_FIXED_REF_CYCLES (1ULL << INTEL_PMC_IDX_FIXED_REF_CYCLES)
+/* TOPDOWN.SLOTS: event=0x00,umask=0x4 (pseudo-encoding) */
+#define MSR_ARCH_PERFMON_FIXED_CTR3 0x30c
+#define INTEL_PMC_IDX_FIXED_SLOTS (INTEL_PMC_IDX_FIXED + 3)
+#define INTEL_PMC_MSK_FIXED_SLOTS (1ULL << INTEL_PMC_IDX_FIXED_SLOTS)
+
/*
* We model BTS tracing as another fixed-mode PMC.
*
--
2.17.1
From: Kan Liang <[email protected]>
TOPDOWN.SLOTS(0x0400) is not a generic event. It is only available on
fixed counter3.
Don't extend its mask to generic counters.
Signed-off-by: Kan Liang <[email protected]>
---
No changes since V4
arch/x86/events/intel/core.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index dc64b16e6b71..b61e81316c2b 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -5118,12 +5118,14 @@ __init int intel_pmu_init(void)
if (x86_pmu.event_constraints) {
/*
- * event on fixed counter2 (REF_CYCLES) only works on this
+ * event on fixed counter2 (REF_CYCLES) and
+ * fixed counter3 (TOPDOWN.SLOTS) only work on this
* counter, so do not extend mask to generic counters
*/
for_each_event_constraint(c, x86_pmu.event_constraints) {
if (c->cmask == FIXED_EVENT_FLAGS
- && c->idxmsk64 != INTEL_PMC_MSK_FIXED_REF_CYCLES) {
+ && c->idxmsk64 != INTEL_PMC_MSK_FIXED_REF_CYCLES
+ && c->idxmsk64 != INTEL_PMC_MSK_FIXED_SLOTS) {
c->idxmsk64 |= (1ULL << x86_pmu.num_counters) - 1;
}
c->idxmsk64 &=
--
2.17.1
From: Kan Liang <[email protected]>
Metrics counters (hardware counters containing multiple metrics)
are modeled as separate registers for each TopDown metric events,
with an extra reg being used for coordinating access to the
underlying register in the scheduler.
Adds the basic infrastructure to separate the scheduler register indexes
from the actual hardware register indexes. In most cases the MSR address
is already used correctly, but for code using indexes we need calculate
the correct underlying register.
The TopDown metric events share the fixed counter 3. It only needs
enable/disable once for them.
Naming:
The events which uses Metrics counters are called TopDown metric
events or metric events in the code.
The fixed counter 3 is called TopDown slots event or slots event.
Topdown events stand for metric events + slots event in the code.
Thank Peter Zijlstra very much to clean up the patch. All the topdown
support has been properly placed in the fixed counter functions. So the
is_topdown_idx() only need to be check once. Also, clean up the
x86_assign_hw_event() by converting multiple if-else statements to a
switch statement.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
No changes since V4
arch/x86/events/core.c | 23 ++++++--
arch/x86/events/intel/core.c | 98 ++++++++++++++++++++++---------
arch/x86/events/perf_event.h | 14 +++++
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/perf_event.h | 28 +++++++++
5 files changed, 129 insertions(+), 35 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 6e3f0c18908e..12410f4beea5 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1054,22 +1054,33 @@ static inline void x86_assign_hw_event(struct perf_event *event,
struct cpu_hw_events *cpuc, int i)
{
struct hw_perf_event *hwc = &event->hw;
+ int idx;
- hwc->idx = cpuc->assign[i];
+ idx = hwc->idx = cpuc->assign[i];
hwc->last_cpu = smp_processor_id();
hwc->last_tag = ++cpuc->tags[i];
- if (hwc->idx == INTEL_PMC_IDX_FIXED_BTS) {
+ switch (hwc->idx) {
+ case INTEL_PMC_IDX_FIXED_BTS:
hwc->config_base = 0;
hwc->event_base = 0;
- } else if (hwc->idx >= INTEL_PMC_IDX_FIXED) {
+ break;
+
+ case INTEL_PMC_IDX_FIXED_METRIC_BASE ... INTEL_PMC_IDX_FIXED_METRIC_BASE + 3:
+ /* All METRIC events are mapped onto the fixed SLOTS event */
+ idx = INTEL_PMC_IDX_FIXED_SLOTS;
+ /* fall through */
+ case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1:
hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
- hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 + (hwc->idx - INTEL_PMC_IDX_FIXED);
- hwc->event_base_rdpmc = (hwc->idx - INTEL_PMC_IDX_FIXED) | 1<<30;
- } else {
+ hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 + (idx - INTEL_PMC_IDX_FIXED);
+ hwc->event_base_rdpmc = (idx - INTEL_PMC_IDX_FIXED) | 1<<30;
+ break;
+
+ default:
hwc->config_base = x86_pmu_config_addr(hwc->idx);
hwc->event_base = x86_pmu_event_addr(hwc->idx);
hwc->event_base_rdpmc = x86_pmu_rdpmc_index(hwc->idx);
+ break;
}
}
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index b61e81316c2b..9b40d6c0eb5a 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2129,27 +2129,60 @@ static inline void intel_pmu_ack_status(u64 ack)
wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, ack);
}
-static void intel_pmu_disable_fixed(struct hw_perf_event *hwc)
+static inline bool event_is_checkpointed(struct perf_event *event)
+{
+ return unlikely(event->hw.config & HSW_IN_TX_CHECKPOINTED) != 0;
+}
+
+static inline void intel_set_masks(struct perf_event *event, int idx)
{
- int idx = hwc->idx - INTEL_PMC_IDX_FIXED;
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ if (event->attr.exclude_host)
+ __set_bit(idx, (unsigned long *)&cpuc->intel_ctrl_guest_mask);
+ if (event->attr.exclude_guest)
+ __set_bit(idx, (unsigned long *)&cpuc->intel_ctrl_host_mask);
+ if (event_is_checkpointed(event))
+ __set_bit(idx, (unsigned long *)&cpuc->intel_cp_status);
+}
+
+static inline void intel_clear_masks(struct perf_event *event, int idx)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ __clear_bit(idx, (unsigned long *)&cpuc->intel_ctrl_guest_mask);
+ __clear_bit(idx, (unsigned long *)&cpuc->intel_ctrl_host_mask);
+ __clear_bit(idx, (unsigned long *)&cpuc->intel_cp_status);
+}
+
+static void intel_pmu_disable_fixed(struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
u64 ctrl_val, mask;
+ int idx = hwc->idx;
- mask = 0xfULL << (idx * 4);
+ if (is_topdown_idx(idx)) {
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ /*
+ * When there are other Top-Down events still active,
+ * don't disable the SLOTS counter.
+ */
+ if (*(u64 *)cpuc->active_mask & INTEL_PMC_OTHER_TOPDOWN_BITS(idx))
+ return;
+ idx = INTEL_PMC_IDX_FIXED_SLOTS;
+ }
+ intel_clear_masks(event, idx);
+
+ mask = 0xfULL << ((idx - INTEL_PMC_IDX_FIXED) * 4);
rdmsrl(hwc->config_base, ctrl_val);
ctrl_val &= ~mask;
wrmsrl(hwc->config_base, ctrl_val);
}
-static inline bool event_is_checkpointed(struct perf_event *event)
-{
- return (event->hw.config & HSW_IN_TX_CHECKPOINTED) != 0;
-}
-
static void intel_pmu_disable_event(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
- struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
if (unlikely(hwc->idx == INTEL_PMC_IDX_FIXED_BTS)) {
intel_pmu_disable_bts();
@@ -2157,18 +2190,19 @@ static void intel_pmu_disable_event(struct perf_event *event)
return;
}
- cpuc->intel_ctrl_guest_mask &= ~(1ull << hwc->idx);
- cpuc->intel_ctrl_host_mask &= ~(1ull << hwc->idx);
- cpuc->intel_cp_status &= ~(1ull << hwc->idx);
-
- if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL))
- intel_pmu_disable_fixed(hwc);
- else
+ if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
+ intel_pmu_disable_fixed(event);
+ } else {
+ intel_clear_masks(event, hwc->idx);
x86_pmu_disable_event(event);
+ }
/*
* Needs to be called after x86_pmu_disable_event,
* so we don't trigger the event without PEBS bit set.
+ *
+ * Metric stuff doesn't do PEBS. So the early exit from
+ * intel_pmu_disable_fixed() is OK.
*/
if (unlikely(event->attr.precise_ip))
intel_pmu_pebs_disable(event);
@@ -2193,8 +2227,22 @@ static void intel_pmu_read_event(struct perf_event *event)
static void intel_pmu_enable_fixed(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
- int idx = hwc->idx - INTEL_PMC_IDX_FIXED;
u64 ctrl_val, mask, bits = 0;
+ int idx = hwc->idx;
+
+ if (is_topdown_idx(idx)) {
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ /*
+ * When there are other Top-Down events already active,
+ * don't enable the SLOTS counter.
+ */
+ if (*(u64 *)cpuc->active_mask & INTEL_PMC_OTHER_TOPDOWN_BITS(idx))
+ return;
+
+ idx = INTEL_PMC_IDX_FIXED_SLOTS;
+ }
+
+ intel_set_masks(event, idx);
/*
* Enable IRQ generation (0x8), if not PEBS,
@@ -2214,6 +2262,7 @@ static void intel_pmu_enable_fixed(struct perf_event *event)
if (x86_pmu.version > 2 && hwc->config & ARCH_PERFMON_EVENTSEL_ANY)
bits |= 0x4;
+ idx -= INTEL_PMC_IDX_FIXED;
bits <<= (idx * 4);
mask = 0xfULL << (idx * 4);
@@ -2231,7 +2280,6 @@ static void intel_pmu_enable_fixed(struct perf_event *event)
static void intel_pmu_enable_event(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
- struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
if (unlikely(hwc->idx == INTEL_PMC_IDX_FIXED_BTS)) {
if (!__this_cpu_read(cpu_hw_events.enabled))
@@ -2241,23 +2289,15 @@ static void intel_pmu_enable_event(struct perf_event *event)
return;
}
- if (event->attr.exclude_host)
- cpuc->intel_ctrl_guest_mask |= (1ull << hwc->idx);
- if (event->attr.exclude_guest)
- cpuc->intel_ctrl_host_mask |= (1ull << hwc->idx);
-
- if (unlikely(event_is_checkpointed(event)))
- cpuc->intel_cp_status |= (1ull << hwc->idx);
-
if (unlikely(event->attr.precise_ip))
intel_pmu_pebs_enable(event);
if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
intel_pmu_enable_fixed(event);
- return;
+ } else {
+ intel_set_masks(event, hwc->idx);
+ __x86_pmu_enable_event(hwc, ARCH_PERFMON_EVENTSEL_ENABLE);
}
-
- __x86_pmu_enable_event(hwc, ARCH_PERFMON_EVENTSEL_ENABLE);
}
static void intel_pmu_add_event(struct perf_event *event)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 930611db8f9a..6ebca54f86df 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -356,6 +356,20 @@ struct cpu_hw_events {
#define FIXED_EVENT_CONSTRAINT(c, n) \
EVENT_CONSTRAINT(c, (1ULL << (32+n)), FIXED_EVENT_FLAGS)
+/*
+ * Special metric counters do not actually exist, but get remapped
+ * to a combination of FxCtr3 + MSR_PERF_METRICS
+ *
+ * This allocates them to a dummy offset for the scheduler.
+ * This does not allow sharing of multiple users of the same
+ * metric without multiplexing, even though the hardware supports that
+ * in principle.
+ */
+
+#define METRIC_EVENT_CONSTRAINT(c, n) \
+ EVENT_CONSTRAINT(c, (1ULL << (INTEL_PMC_IDX_FIXED_METRIC_BASE+n)), \
+ FIXED_EVENT_FLAGS)
+
/*
* Constraint on the Event code + UMask
*/
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 20ce682a2540..bc6a5c2c8f86 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -799,6 +799,7 @@
#define MSR_CORE_PERF_FIXED_CTR0 0x00000309
#define MSR_CORE_PERF_FIXED_CTR1 0x0000030a
#define MSR_CORE_PERF_FIXED_CTR2 0x0000030b
+#define MSR_CORE_PERF_FIXED_CTR3 0x0000030c
#define MSR_CORE_PERF_FIXED_CTR_CTRL 0x0000038d
#define MSR_CORE_PERF_GLOBAL_STATUS 0x0000038e
#define MSR_CORE_PERF_GLOBAL_CTRL 0x0000038f
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 7df1d5b78aa8..3f1290424c52 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -191,6 +191,34 @@ struct x86_pmu_capability {
*/
#define INTEL_PMC_IDX_FIXED_BTS (INTEL_PMC_IDX_FIXED + 15)
+/*
+ * We model PERF_METRICS as more magic fixed-mode PMCs, one for each metric
+ *
+ * Internally they all map to Fixed Ctr 3 (SLOTS), and allocate PERF_METRICS
+ * as an extra_reg. PERF_METRICS has no own configuration, but we fill in
+ * the configuration of FxCtr3 to enforce that all the shared users of SLOTS
+ * have the same configuration.
+ */
+#define INTEL_PMC_IDX_FIXED_METRIC_BASE (INTEL_PMC_IDX_FIXED + 16)
+#define INTEL_PMC_IDX_TD_RETIRING (INTEL_PMC_IDX_FIXED_METRIC_BASE + 0)
+#define INTEL_PMC_IDX_TD_BAD_SPEC (INTEL_PMC_IDX_FIXED_METRIC_BASE + 1)
+#define INTEL_PMC_IDX_TD_FE_BOUND (INTEL_PMC_IDX_FIXED_METRIC_BASE + 2)
+#define INTEL_PMC_IDX_TD_BE_BOUND (INTEL_PMC_IDX_FIXED_METRIC_BASE + 3)
+#define INTEL_PMC_MSK_TOPDOWN ((0xfull << INTEL_PMC_IDX_FIXED_METRIC_BASE) | \
+ INTEL_PMC_MSK_FIXED_SLOTS)
+
+static inline bool is_metric_idx(int idx)
+{
+ return (unsigned)(idx - INTEL_PMC_IDX_FIXED_METRIC_BASE) < 4;
+}
+
+static inline bool is_topdown_idx(int idx)
+{
+ return is_metric_idx(idx) || idx == INTEL_PMC_IDX_FIXED_SLOTS;
+}
+
+#define INTEL_PMC_OTHER_TOPDOWN_BITS(bit) (~(0x1ull << bit) & INTEL_PMC_MSK_TOPDOWN)
+
#define GLOBAL_STATUS_COND_CHG BIT_ULL(63)
#define GLOBAL_STATUS_BUFFER_OVF BIT_ULL(62)
#define GLOBAL_STATUS_UNC_OVF BIT_ULL(61)
--
2.17.1
From: Kan Liang <[email protected]>
Intro
=====
Icelake has support for measuring the four top level TopDown metrics
directly in hardware. This is implemented by an additional "metrics"
register, and a new Fixed Counter 3 that measures pipeline "slots".
Events
======
We export four metric events as separate perf events, which map to
internal "metrics" counter register. Those events do not exist in
hardware, but can be allocated by the scheduler.
For the event mapping we use a special 0x00 event code, which is
reserved for fake events. The metric events start from umask 0x10.
When setting up such events they point to the slots counter, and a
special callback, update_topdown_event(), reads the additional metrics
msr to generate the metrics. Then the metric is reported by multiplying
the metric (fraction) with slots.
This multiplication allows to easily keep a running count, for example
when the slots counter overflows, and makes all the standard tools, such
as a perf stat, work. They can do deltas of the values without needing
to know about fraction. This also simplifies accumulating the counts
of child events, which otherwise would need to know how to average
fraction values.
All four metric events don't support sampling. Since they will be
handled specially for event update, a flag PERF_X86_EVENT_TOPDOWN is
introduced to indicate this case.
The slots event can support both sampling and counting.
For counting, the flag is also applied.
For sampling, it will be handled normally as other normal events.
Groups
======
The slots event is required in a Topdown group.
To avoid reading the METRICS register multiple times, the metrics and
slots value can only be updated by slots event in a group.
All active slots and metrics events will be updated one time.
Reset
======
For the Ice Lake implementation of performance metrics, the values in
PERF_METRICS MSR are derived from fixed counter 3. Software should start
both registers, PERF_METRICS and fixed counter 3, from zero.
Additionally, software is recommended to periodically clear both
registers in order to maintain accurate measurements. The latter is
required for certain scenarios that involve sampling metrics at high
rates. Software should always write fixed counter 3 before write to
PERF_METRICS.
NMI
======
The METRICS related register may be overflow. The bit 48 of STATUS
register will be set. If so, PERF_METRICS and Fixed counter 3 are
required to be reset. The patch also update all active slots and
metrics events in NMI handler.
The update_topdown_event() has to read two registers separately. The
values may be modify by a NMI. PMU has to be disabled before calling the
function.
RDPMC
======
RDPMC is temporarily disabled. The following patch will enable it.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
Changes since V4
- Fix add_nr_metric_event().
For leader event, we have to take the accepted metrics events into
account.
For sibling event, it doesn't need to count accepted metrics events
again.
- Remove is_first_topdown_event_in_group().
Force slots in topdown group. Only update topdown events with slots
event.
arch/x86/events/core.c | 44 +++++++
arch/x86/events/intel/core.c | 192 ++++++++++++++++++++++++++++++-
arch/x86/events/perf_event.h | 39 +++++++
arch/x86/include/asm/msr-index.h | 2 +
4 files changed, 273 insertions(+), 4 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 12410f4beea5..bfa5e8286eed 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -76,6 +76,9 @@ u64 x86_perf_event_update(struct perf_event *event)
if (idx == INTEL_PMC_IDX_FIXED_BTS)
return 0;
+ /* Specially handle the counting of Topdown slots/metrics */
+ if (unlikely(is_topdown_count(event)) && x86_pmu.update_topdown_event)
+ return x86_pmu.update_topdown_event(event);
/*
* Careful: an NMI might modify the previous event value.
*
@@ -992,6 +995,32 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
return unsched ? -EINVAL : 0;
}
+static int add_nr_metric_event(struct cpu_hw_events *cpuc,
+ struct perf_event *event,
+ int *max_count, bool sibling)
+{
+ /* There are 4 TopDown metrics events. */
+ if (is_metric_event(event) && (++cpuc->n_metric_event > 4))
+ return -EINVAL;
+
+ /*
+ * Take the accepted metrics events into account for leader event.
+ */
+ if (!sibling)
+ *max_count += cpuc->n_metric_event;
+ else if (is_metric_event(event))
+ (*max_count)++;
+
+ return 0;
+}
+
+static void del_nr_metric_event(struct cpu_hw_events *cpuc,
+ struct perf_event *event)
+{
+ if (is_metric_event(event))
+ cpuc->n_metric_event--;
+}
+
/*
* dogrp: true if must collect siblings events (group)
* returns total number of events and error code
@@ -1027,6 +1056,10 @@ static int collect_events(struct cpu_hw_events *cpuc, struct perf_event *leader,
cpuc->pebs_output = is_pebs_pt(leader) + 1;
}
+ if (x86_pmu.intel_cap.perf_metrics &&
+ add_nr_metric_event(cpuc, leader, &max_count, false))
+ return -EINVAL;
+
if (is_x86_event(leader)) {
if (n >= max_count)
return -EINVAL;
@@ -1041,6 +1074,10 @@ static int collect_events(struct cpu_hw_events *cpuc, struct perf_event *leader,
event->state <= PERF_EVENT_STATE_OFF)
continue;
+ if (x86_pmu.intel_cap.perf_metrics &&
+ add_nr_metric_event(cpuc, event, &max_count, true))
+ return -EINVAL;
+
if (n >= max_count)
return -EINVAL;
@@ -1204,6 +1241,11 @@ int x86_perf_event_set_period(struct perf_event *event)
if (idx == INTEL_PMC_IDX_FIXED_BTS)
return 0;
+ /* Specially handle the counting of Topdown slots/metrics */
+ if (unlikely(is_topdown_count(event)) &&
+ x86_pmu.set_topdown_event_period)
+ return x86_pmu.set_topdown_event_period(event);
+
/*
* If we are way outside a reasonable range then just skip forward:
*/
@@ -1481,6 +1523,8 @@ static void x86_pmu_del(struct perf_event *event, int flags)
}
cpuc->event_constraint[i-1] = NULL;
--cpuc->n_events;
+ if (x86_pmu.intel_cap.perf_metrics)
+ del_nr_metric_event(cpuc, event);
perf_event_update_userpage(event);
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 9b40d6c0eb5a..d7aecfe03372 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -247,6 +247,10 @@ static struct event_constraint intel_icl_event_constraints[] = {
FIXED_EVENT_CONSTRAINT(0x003c, 1), /* CPU_CLK_UNHALTED.CORE */
FIXED_EVENT_CONSTRAINT(0x0300, 2), /* CPU_CLK_UNHALTED.REF */
FIXED_EVENT_CONSTRAINT(0x0400, 3), /* SLOTS */
+ METRIC_EVENT_CONSTRAINT(0x1000, 0), /* Retiring metric */
+ METRIC_EVENT_CONSTRAINT(0x1100, 1), /* Bad speculation metric */
+ METRIC_EVENT_CONSTRAINT(0x1200, 2), /* FE bound metric */
+ METRIC_EVENT_CONSTRAINT(0x1300, 3), /* BE bound metric */
INTEL_EVENT_CONSTRAINT_RANGE(0x03, 0x0a, 0xf),
INTEL_EVENT_CONSTRAINT_RANGE(0x1f, 0x28, 0xf),
INTEL_EVENT_CONSTRAINT(0x32, 0xf), /* SW_PREFETCH_ACCESS.* */
@@ -267,6 +271,14 @@ static struct extra_reg intel_icl_extra_regs[] __read_mostly = {
INTEL_UEVENT_EXTRA_REG(0x01bb, MSR_OFFCORE_RSP_1, 0x3fffffbfffull, RSP_1),
INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(0x01cd),
INTEL_UEVENT_EXTRA_REG(0x01c6, MSR_PEBS_FRONTEND, 0x7fff17, FE),
+ /*
+ * The original Fixed Ctr 3 are shared from different metrics
+ * events. So use the extra reg to enforce the same
+ * configuration on the original register, but do not actually
+ * write to it.
+ */
+ INTEL_UEVENT_EXTRA_REG(0x0400, 0, -1L, TOPDOWN),
+ INTEL_UEVENT_TOPDOWN_EXTRA_REG(0x1000),
EVENT_EXTRA_END
};
@@ -2216,10 +2228,125 @@ static void intel_pmu_del_event(struct perf_event *event)
intel_pmu_pebs_del(event);
}
+static int icl_set_topdown_event_period(struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ s64 left = local64_read(&hwc->period_left);
+
+ /*
+ * The values in PERF_METRICS MSR are derived from fixed counter 3.
+ * Software should start both registers, PERF_METRICS and fixed
+ * counter 3, from zero.
+ * Clear PERF_METRICS and Fixed counter 3 in initialization.
+ * After that, both MSRs will be cleared for each read.
+ * Don't need to clear them again.
+ */
+ if (left == x86_pmu.max_period) {
+ wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
+ wrmsrl(MSR_PERF_METRICS, 0);
+ local64_set(&hwc->period_left, 0);
+ }
+
+ perf_event_update_userpage(event);
+
+ return 0;
+}
+
+static u64 icl_get_metrics_event_value(u64 metric, u64 slots, int idx)
+{
+ u32 val;
+
+ /*
+ * The metric is reported as an 8bit integer fraction
+ * suming up to 0xff.
+ * slots-in-metric = (Metric / 0xff) * slots
+ */
+ val = (metric >> ((idx - INTEL_PMC_IDX_FIXED_METRIC_BASE) * 8)) & 0xff;
+ return mul_u64_u32_div(slots, val, 0xff);
+}
+
+static void __icl_update_topdown_event(struct perf_event *event,
+ u64 slots, u64 metrics)
+{
+ int idx = event->hw.idx;
+ u64 delta;
+
+ if (is_metric_idx(idx))
+ delta = icl_get_metrics_event_value(metrics, slots, idx);
+ else
+ delta = slots;
+
+ local64_add(delta, &event->count);
+}
+
+/*
+ * Update all active Topdown events.
+ *
+ * The PERF_METRICS and Fixed counter 3 are read separately. The values may be
+ * modify by a NMI. PMU has to be disabled before calling this function.
+ */
+static u64 icl_update_topdown_event(struct perf_event *event)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct perf_event *other;
+ u64 slots, metrics;
+ int idx;
+
+ /* read Fixed counter 3 */
+ rdpmcl((3 | 1<<30), slots);
+ if (!slots)
+ return 0;
+
+ /* read PERF_METRICS */
+ rdpmcl((1<<29), metrics);
+
+ for_each_set_bit(idx, cpuc->active_mask, INTEL_PMC_IDX_TD_BE_BOUND + 1) {
+ if (!is_topdown_idx(idx))
+ continue;
+ other = cpuc->events[idx];
+ __icl_update_topdown_event(other, slots, metrics);
+ }
+
+ /*
+ * Check and update this event, which may have been cleared
+ * in active_mask e.g. x86_pmu_stop()
+ */
+ if (event && !test_bit(event->hw.idx, cpuc->active_mask))
+ __icl_update_topdown_event(event, slots, metrics);
+
+ /*
+ * Software is recommended to periodically clear both registers
+ * in order to maintain accurate measurements, which is required for
+ * certain scenarios that involve sampling metrics at high rates.
+ * Software should always write fixed counter 3 before write to
+ * PERF_METRICS.
+ */
+ wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
+ wrmsrl(MSR_PERF_METRICS, 0);
+
+ return slots;
+}
+
+static void intel_pmu_read_topdown_event(struct perf_event *event)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ /* Only need to call update_topdown_event() once for group read. */
+ if ((cpuc->txn_flags & PERF_PMU_TXN_READ) &&
+ !is_slots_event(event))
+ return;
+
+ perf_pmu_disable(event->pmu);
+ x86_pmu.update_topdown_event(event);
+ perf_pmu_enable(event->pmu);
+}
+
static void intel_pmu_read_event(struct perf_event *event)
{
if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
intel_pmu_auto_reload_read(event);
+ else if (is_topdown_count(event) && x86_pmu.update_topdown_event)
+ intel_pmu_read_topdown_event(event);
else
x86_perf_event_update(event);
}
@@ -2431,6 +2558,15 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
intel_pt_interrupt();
}
+ /*
+ * Intel Perf mertrics
+ */
+ if (__test_and_clear_bit(48, (unsigned long *)&status)) {
+ handled++;
+ if (x86_pmu.update_topdown_event)
+ x86_pmu.update_topdown_event(NULL);
+ }
+
/*
* Checkpointed counters can lead to 'spurious' PMIs because the
* rollback caused by the PMI will have cleared the overflow status
@@ -3349,6 +3485,42 @@ static int intel_pmu_hw_config(struct perf_event *event)
if (event->attr.type != PERF_TYPE_RAW)
return 0;
+ /*
+ * Config Topdown slots and metric events
+ *
+ * The slots event on Fixed Counter 3 can support sampling,
+ * which will be handled normally in x86_perf_event_update().
+ *
+ * The metric events don't support sampling.
+ *
+ * For counting, topdown slots and metric events will be
+ * handled specially for event update.
+ * A flag PERF_X86_EVENT_TOPDOWN is applied for the case.
+ */
+ if (x86_pmu.intel_cap.perf_metrics && is_topdown_event(event)) {
+ if (is_metric_event(event) && is_sampling_event(event))
+ return -EINVAL;
+
+ if (!is_sampling_event(event)) {
+ if (event->attr.config1 != 0)
+ return -EINVAL;
+ if (event->attr.config & ARCH_PERFMON_EVENTSEL_ANY)
+ return -EINVAL;
+ /*
+ * Put configuration (minus event) into config1 so that
+ * the scheduler enforces through an extra_reg that
+ * all instances of the metrics events have the same
+ * configuration.
+ */
+ event->attr.config1 = event->hw.config &
+ X86_ALL_EVENT_FLAGS;
+ event->hw.flags |= PERF_X86_EVENT_TOPDOWN;
+
+ if (is_metric_event(event))
+ event->hw.flags &= ~PERF_X86_EVENT_RDPMC_ALLOWED;
+ }
+ }
+
if (!(event->attr.config & ARCH_PERFMON_EVENTSEL_ANY))
return 0;
@@ -5107,6 +5279,8 @@ __init int intel_pmu_init(void)
x86_pmu.rtm_abort_event = X86_CONFIG(.event=0xca, .umask=0x02);
x86_pmu.lbr_pt_coexist = true;
intel_pmu_pebs_data_source_skl(pmem);
+ x86_pmu.update_topdown_event = icl_update_topdown_event;
+ x86_pmu.set_topdown_event_period = icl_set_topdown_event_period;
pr_cont("Icelake events, ");
name = "icelake";
break;
@@ -5163,10 +5337,17 @@ __init int intel_pmu_init(void)
* counter, so do not extend mask to generic counters
*/
for_each_event_constraint(c, x86_pmu.event_constraints) {
- if (c->cmask == FIXED_EVENT_FLAGS
- && c->idxmsk64 != INTEL_PMC_MSK_FIXED_REF_CYCLES
- && c->idxmsk64 != INTEL_PMC_MSK_FIXED_SLOTS) {
- c->idxmsk64 |= (1ULL << x86_pmu.num_counters) - 1;
+ if (c->cmask == FIXED_EVENT_FLAGS) {
+ /*
+ * Don't extend topdown slots and metrics
+ * events to generic counters.
+ */
+ if (c->idxmsk64 & INTEL_PMC_MSK_TOPDOWN) {
+ c->weight = hweight64(c->idxmsk64);
+ continue;
+ }
+ if (c->idxmsk64 != INTEL_PMC_MSK_FIXED_REF_CYCLES)
+ c->idxmsk64 |= (1ULL << x86_pmu.num_counters) - 1;
}
c->idxmsk64 &=
~(~0ULL << (INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed));
@@ -5219,6 +5400,9 @@ __init int intel_pmu_init(void)
if (x86_pmu.counter_freezing)
x86_pmu.handle_irq = intel_pmu_handle_irq_v4;
+ if (x86_pmu.intel_cap.perf_metrics)
+ x86_pmu.intel_ctrl |= 1ULL << 48;
+
return 0;
}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index ecce05141f71..404bf3f2c293 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -40,6 +40,7 @@ enum extra_reg_type {
EXTRA_REG_LBR = 2, /* lbr_select */
EXTRA_REG_LDLAT = 3, /* ld_lat_threshold */
EXTRA_REG_FE = 4, /* fe_* */
+ EXTRA_REG_TOPDOWN = 5, /* Topdown slots/metrics */
EXTRA_REG_MAX /* number of entries needed */
};
@@ -77,6 +78,29 @@ static inline bool constraint_match(struct event_constraint *c, u64 ecode)
#define PERF_X86_EVENT_AUTO_RELOAD 0x0200 /* use PEBS auto-reload */
#define PERF_X86_EVENT_LARGE_PEBS 0x0400 /* use large PEBS */
#define PERF_X86_EVENT_PEBS_VIA_PT 0x0800 /* use PT buffer for PEBS */
+#define PERF_X86_EVENT_TOPDOWN 0x1000 /* Count Topdown slots/metrics events */
+
+static inline bool is_topdown_count(struct perf_event *event)
+{
+ return event->hw.flags & PERF_X86_EVENT_TOPDOWN;
+}
+
+static inline bool is_metric_event(struct perf_event *event)
+{
+ return ((event->attr.config & ARCH_PERFMON_EVENTSEL_EVENT) == 0) &&
+ ((event->attr.config & INTEL_ARCH_EVENT_MASK) >= 0x1000) &&
+ ((event->attr.config & INTEL_ARCH_EVENT_MASK) <= 0x1300);
+}
+
+static inline bool is_slots_event(struct perf_event *event)
+{
+ return (event->attr.config & INTEL_ARCH_EVENT_MASK) == 0x0400;
+}
+
+static inline bool is_topdown_event(struct perf_event *event)
+{
+ return is_metric_event(event) || is_slots_event(event);
+}
struct amd_nb {
int nb_id; /* NorthBridge id */
@@ -266,6 +290,12 @@ struct cpu_hw_events {
*/
u64 tfa_shadow;
+ /*
+ * Perf Metrics
+ */
+ /* number of accepted metrics events */
+ int n_metric_event;
+
/*
* AMD specific bits
*/
@@ -517,6 +547,9 @@ struct extra_reg {
0xffff, \
LDLAT)
+#define INTEL_UEVENT_TOPDOWN_EXTRA_REG(event) \
+ EVENT_EXTRA_REG(event, 0, 0xfcff, -1L, TOPDOWN)
+
#define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, RSP_0)
union perf_capabilities {
@@ -696,6 +729,12 @@ struct x86_pmu {
*/
atomic_t lbr_exclusive[x86_lbr_exclusive_max];
+ /*
+ * Intel perf metrics
+ */
+ u64 (*update_topdown_event)(struct perf_event *event);
+ int (*set_topdown_event_period)(struct perf_event *event);
+
/*
* perf task context (i.e. struct perf_event_context::task_ctx_data)
* switch helper to bridge calls from perf/core to perf/x86.
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index bc6a5c2c8f86..4571b79b63db 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -118,6 +118,8 @@
#define MSR_TURBO_RATIO_LIMIT1 0x000001ae
#define MSR_TURBO_RATIO_LIMIT2 0x000001af
+#define MSR_PERF_METRICS 0x00000329
+
#define MSR_LBR_SELECT 0x000001c8
#define MSR_LBR_TOS 0x000001c9
#define MSR_LBR_NHM_FROM 0x00000680
--
2.17.1
From: Kan Liang <[email protected]>
Export new TopDown metrics events for perf that map to the sub metrics
in the metrics register, and another for the new slots fixed counter.
This makes the new fixed counters in Icelake visible to the perf
user tools.
Originally-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
No changes since V4
arch/x86/events/intel/core.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 0d1a327c18fc..d913dda3e1c2 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -321,6 +321,12 @@ EVENT_ATTR_STR_HT(topdown-recovery-bubbles, td_recovery_bubbles,
EVENT_ATTR_STR_HT(topdown-recovery-bubbles.scale, td_recovery_bubbles_scale,
"4", "2");
+EVENT_ATTR_STR(slots, slots, "event=0x00,umask=0x4");
+EVENT_ATTR_STR(topdown-retiring, td_retiring, "event=0x00,umask=0x10");
+EVENT_ATTR_STR(topdown-bad-spec, td_bad_spec, "event=0x00,umask=0x11");
+EVENT_ATTR_STR(topdown-fe-bound, td_fe_bound, "event=0x00,umask=0x12");
+EVENT_ATTR_STR(topdown-be-bound, td_be_bound, "event=0x00,umask=0x13");
+
static struct attribute *snb_events_attrs[] = {
EVENT_PTR(td_slots_issued),
EVENT_PTR(td_slots_retired),
@@ -4567,6 +4573,15 @@ static struct attribute *icl_events_attrs[] = {
NULL,
};
+static struct attribute *icl_td_events_attrs[] = {
+ EVENT_PTR(slots),
+ EVENT_PTR(td_retiring),
+ EVENT_PTR(td_bad_spec),
+ EVENT_PTR(td_fe_bound),
+ EVENT_PTR(td_be_bound),
+ NULL,
+};
+
static struct attribute *icl_tsx_events_attrs[] = {
EVENT_PTR(tx_start),
EVENT_PTR(tx_abort),
@@ -5342,6 +5357,7 @@ __init int intel_pmu_init(void)
hsw_format_attr : nhm_format_attr;
extra_skl_attr = skl_format_attr;
mem_attr = icl_events_attrs;
+ td_attr = icl_td_events_attrs;
tsx_attr = icl_tsx_events_attrs;
x86_pmu.rtm_abort_event = X86_CONFIG(.event=0xca, .umask=0x02);
x86_pmu.lbr_pt_coexist = true;
--
2.17.1
From: Kan Liang <[email protected]>
With Ice Lake CPUs, the TopDown metrics are directly available as fixed
counters and do not require generic counters, which make it possible to
measure TopDown per thread/process instead of only per core.
The metrics and slots values have to be saved/restored during context
switching.
The saved values are also used as previous values to calculate the
delta.
The PERF_METRICS MSR value will be returned if RDPMC metrics events.
Re-use last_period and period_left, which are unused sampling fields,
for saved_metric and saved_slots.
Signed-off-by: Kan Liang <[email protected]>
---
Changes since V4:
- Re-use last_period and period_left for saved_metric and saved_slots.
arch/x86/events/core.c | 5 +-
arch/x86/events/intel/core.c | 103 +++++++++++++++++++++++++++++------
include/linux/perf_event.h | 29 ++++++----
3 files changed, 108 insertions(+), 29 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index bfa5e8286eed..333541c05815 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2204,7 +2204,10 @@ static int x86_pmu_event_idx(struct perf_event *event)
if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED))
return 0;
- if (x86_pmu.num_counters_fixed && idx >= INTEL_PMC_IDX_FIXED) {
+ /* Return PERF_METRICS MSR value for metrics event */
+ if (is_metric_idx(idx))
+ idx = 1 << 29;
+ else if (x86_pmu.num_counters_fixed && idx >= INTEL_PMC_IDX_FIXED) {
idx -= INTEL_PMC_IDX_FIXED;
idx |= 1 << 30;
}
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index d7aecfe03372..0d1a327c18fc 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2244,7 +2244,13 @@ static int icl_set_topdown_event_period(struct perf_event *event)
if (left == x86_pmu.max_period) {
wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
wrmsrl(MSR_PERF_METRICS, 0);
- local64_set(&hwc->period_left, 0);
+ hwc->saved_slots = 0;
+ hwc->saved_metric = 0;
+ }
+
+ if ((hwc->saved_slots) && is_slots_event(event)) {
+ wrmsrl(MSR_CORE_PERF_FIXED_CTR3, hwc->saved_slots);
+ wrmsrl(MSR_PERF_METRICS, hwc->saved_metric);
}
perf_event_update_userpage(event);
@@ -2265,7 +2271,7 @@ static u64 icl_get_metrics_event_value(u64 metric, u64 slots, int idx)
return mul_u64_u32_div(slots, val, 0xff);
}
-static void __icl_update_topdown_event(struct perf_event *event,
+static u64 icl_get_topdown_value(struct perf_event *event,
u64 slots, u64 metrics)
{
int idx = event->hw.idx;
@@ -2276,7 +2282,50 @@ static void __icl_update_topdown_event(struct perf_event *event,
else
delta = slots;
- local64_add(delta, &event->count);
+ return delta;
+}
+
+static void __icl_update_topdown_event(struct perf_event *event,
+ u64 slots, u64 metrics,
+ u64 last_slots, u64 last_metrics)
+{
+ u64 delta, last = 0;
+
+ delta = icl_get_topdown_value(event, slots, metrics);
+ if (last_slots)
+ last = icl_get_topdown_value(event, last_slots, last_metrics);
+
+ /*
+ * The 8bit integer fraction of metric may be not accurate,
+ * especially when the changes is very small.
+ * For example, if only a few bad_spec happens, the fraction
+ * may be reduced from 1 to 0. If so, the bad_spec event value
+ * will be 0 which is definitely less than the last value.
+ * Avoid update event->count for this case.
+ */
+ if (delta > last) {
+ delta -= last;
+ local64_add(delta, &event->count);
+ }
+}
+
+static void update_saved_topdown_regs(struct perf_event *event,
+ u64 slots, u64 metrics)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct perf_event *other;
+ int idx;
+
+ event->hw.saved_slots = slots;
+ event->hw.saved_metric = metrics;
+
+ for_each_set_bit(idx, cpuc->active_mask, INTEL_PMC_IDX_TD_BE_BOUND + 1) {
+ if (!is_topdown_idx(idx))
+ continue;
+ other = cpuc->events[idx];
+ other->hw.saved_slots = slots;
+ other->hw.saved_metric = metrics;
+ }
}
/*
@@ -2290,6 +2339,7 @@ static u64 icl_update_topdown_event(struct perf_event *event)
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct perf_event *other;
u64 slots, metrics;
+ bool reset = true;
int idx;
/* read Fixed counter 3 */
@@ -2304,25 +2354,45 @@ static u64 icl_update_topdown_event(struct perf_event *event)
if (!is_topdown_idx(idx))
continue;
other = cpuc->events[idx];
- __icl_update_topdown_event(other, slots, metrics);
+ __icl_update_topdown_event(other, slots, metrics,
+ event ? event->hw.saved_slots : 0,
+ event ? event->hw.saved_metric : 0);
}
/*
* Check and update this event, which may have been cleared
* in active_mask e.g. x86_pmu_stop()
*/
- if (event && !test_bit(event->hw.idx, cpuc->active_mask))
- __icl_update_topdown_event(event, slots, metrics);
+ if (event && !test_bit(event->hw.idx, cpuc->active_mask)) {
+ __icl_update_topdown_event(event, slots, metrics,
+ event->hw.saved_slots,
+ event->hw.saved_metric);
- /*
- * Software is recommended to periodically clear both registers
- * in order to maintain accurate measurements, which is required for
- * certain scenarios that involve sampling metrics at high rates.
- * Software should always write fixed counter 3 before write to
- * PERF_METRICS.
- */
- wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
- wrmsrl(MSR_PERF_METRICS, 0);
+ /*
+ * In x86_pmu_stop(), the event is cleared in active_mask first,
+ * then drain the delta, which indicates context switch for
+ * counting.
+ * Save metric and slots for context switch.
+ * Don't need to reset the PERF_METRICS and Fixed counter 3.
+ * Because the values will be restored in next schedule in.
+ */
+ update_saved_topdown_regs(event, slots, metrics);
+ reset = false;
+ }
+
+ if (reset) {
+ /*
+ * Software is recommended to periodically clear both registers
+ * in order to maintain accurate measurements, which is required
+ * for certain scenarios that involve sampling metrics at high
+ * rates. Software should always write fixed counter 3 before
+ * write to PERF_METRICS.
+ */
+ wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
+ wrmsrl(MSR_PERF_METRICS, 0);
+ if (event)
+ update_saved_topdown_regs(event, 0, 0);
+ }
return slots;
}
@@ -3515,9 +3585,6 @@ static int intel_pmu_hw_config(struct perf_event *event)
event->attr.config1 = event->hw.config &
X86_ALL_EVENT_FLAGS;
event->hw.flags |= PERF_X86_EVENT_TOPDOWN;
-
- if (is_metric_event(event))
- event->hw.flags &= ~PERF_X86_EVENT_RDPMC_ALLOWED;
}
}
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 011dcbdbccc2..3f58414e4a91 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -200,17 +200,26 @@ struct hw_perf_event {
*/
u64 sample_period;
- /*
- * The period we started this sample with.
- */
- u64 last_period;
+ union {
+ struct { /* Sampling */
+ /*
+ * The period we started this sample with.
+ */
+ u64 last_period;
- /*
- * However much is left of the current period; note that this is
- * a full 64bit value and allows for generation of periods longer
- * than hardware might allow.
- */
- local64_t period_left;
+ /*
+ * However much is left of the current period;
+ * note that this is a full 64bit value and
+ * allows for generation of periods longer
+ * than hardware might allow.
+ */
+ local64_t period_left;
+ };
+ struct { /* Topdown events counting for context switch */
+ u64 saved_metric;
+ u64 saved_slots;
+ };
+ };
/*
* State for throttling the event, see __perf_event_overflow() and
--
2.17.1
From: Andi Kleen <[email protected]>
Icelake has support for reporting per thread TopDown metrics.
These are reported differently than the previous TopDown support,
each metric is standalone, but scaled to pipeline "slots".
We don't need to do anything special for HyperThreading anymore.
Teach perf stat --topdown to handle these new metrics and
print them in the same way as the previous TopDown metrics.
The restrictions of only being able to report information per core is
gone.
Signed-off-by: Andi Kleen <[email protected]>
---
Changes since V4
- Add slots into topdown group
tools/perf/Documentation/perf-stat.txt | 9 ++-
tools/perf/builtin-stat.c | 25 ++++++++
tools/perf/util/stat-shadow.c | 89 ++++++++++++++++++++++++++
tools/perf/util/stat.c | 4 ++
tools/perf/util/stat.h | 8 +++
5 files changed, 133 insertions(+), 2 deletions(-)
diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
index a9af4e440e80..fefbc886c519 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -279,8 +279,13 @@ if the workload is actually bound by the CPU and not by something else.
For best results it is usually a good idea to use it with interval
mode like -I 1000, as the bottleneck of workloads can change often.
-The top down metrics are collected per core instead of per
-CPU thread. Per core mode is automatically enabled
+This enables --metric-only, unless overridden with --no-metric-only.
+
+The following restrictions only apply to older Intel CPUs and Atom,
+on newer CPUs (IceLake and later) TopDown can be collected for any thread:
+
+The top down metrics are collected per core instead of per CPU thread.
+Per core mode is automatically enabled
and -a (global monitoring) is needed, requiring root rights or
perf.perf_event_paranoid=-1.
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index c88d4e118409..0ed191bf8b5e 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -126,6 +126,15 @@ static const char * topdown_attrs[] = {
NULL,
};
+static const char *topdown_metric_attrs[] = {
+ "slots",
+ "topdown-retiring",
+ "topdown-bad-spec",
+ "topdown-fe-bound",
+ "topdown-be-bound",
+ NULL,
+};
+
static const char *smi_cost_attrs = {
"{"
"msr/aperf/,"
@@ -1327,6 +1336,21 @@ static int add_default_attributes(void)
char *str = NULL;
bool warn = false;
+ if (topdown_filter_events(topdown_metric_attrs, &str, 1) < 0) {
+ pr_err("Out of memory\n");
+ return -1;
+ }
+ if (topdown_metric_attrs[0] && str) {
+ if (!stat_config.interval) {
+ fprintf(stat_config.output,
+ "Topdown accuracy may decreases when measuring long period.\n"
+ "Please print the result regularly, e.g. -I1000\n");
+ }
+ goto setup_metrics;
+ }
+
+ str = NULL;
+
if (stat_config.aggr_mode != AGGR_GLOBAL &&
stat_config.aggr_mode != AGGR_CORE) {
pr_err("top down event configuration requires --per-core mode\n");
@@ -1348,6 +1372,7 @@ static int add_default_attributes(void)
if (topdown_attrs[0] && str) {
if (warn)
arch_topdown_group_warn();
+setup_metrics:
err = parse_events(evsel_list, str, &errinfo);
if (err) {
fprintf(stderr,
diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index 2c41d47f6f83..0484b1377cc8 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -243,6 +243,18 @@ void perf_stat__update_shadow_stats(struct evsel *counter, u64 count,
else if (perf_stat_evsel__is(counter, TOPDOWN_RECOVERY_BUBBLES))
update_runtime_stat(st, STAT_TOPDOWN_RECOVERY_BUBBLES,
ctx, cpu, count);
+ else if (perf_stat_evsel__is(counter, TOPDOWN_RETIRING))
+ update_runtime_stat(st, STAT_TOPDOWN_RETIRING,
+ ctx, cpu, count);
+ else if (perf_stat_evsel__is(counter, TOPDOWN_BAD_SPEC))
+ update_runtime_stat(st, STAT_TOPDOWN_BAD_SPEC,
+ ctx, cpu, count);
+ else if (perf_stat_evsel__is(counter, TOPDOWN_FE_BOUND))
+ update_runtime_stat(st, STAT_TOPDOWN_FE_BOUND,
+ ctx, cpu, count);
+ else if (perf_stat_evsel__is(counter, TOPDOWN_BE_BOUND))
+ update_runtime_stat(st, STAT_TOPDOWN_BE_BOUND,
+ ctx, cpu, count);
else if (perf_evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_FRONTEND))
update_runtime_stat(st, STAT_STALLED_CYCLES_FRONT,
ctx, cpu, count);
@@ -694,6 +706,47 @@ static double td_be_bound(int ctx, int cpu, struct runtime_stat *st)
return sanitize_val(1.0 - sum);
}
+/*
+ * Kernel reports metrics multiplied with slots. To get back
+ * the ratios we need to recreate the sum.
+ */
+
+static double td_metric_ratio(int ctx, int cpu,
+ enum stat_type type,
+ struct runtime_stat *stat)
+{
+ double sum = runtime_stat_avg(stat, STAT_TOPDOWN_RETIRING, ctx, cpu) +
+ runtime_stat_avg(stat, STAT_TOPDOWN_FE_BOUND, ctx, cpu) +
+ runtime_stat_avg(stat, STAT_TOPDOWN_BE_BOUND, ctx, cpu) +
+ runtime_stat_avg(stat, STAT_TOPDOWN_BAD_SPEC, ctx, cpu);
+ double d = runtime_stat_avg(stat, type, ctx, cpu);
+
+ if (sum)
+ return d / sum;
+ return 0;
+}
+
+/*
+ * ... but only if most of the values are actually available.
+ * We allow two missing.
+ */
+
+static bool full_td(int ctx, int cpu,
+ struct runtime_stat *stat)
+{
+ int c = 0;
+
+ if (runtime_stat_avg(stat, STAT_TOPDOWN_RETIRING, ctx, cpu) > 0)
+ c++;
+ if (runtime_stat_avg(stat, STAT_TOPDOWN_BE_BOUND, ctx, cpu) > 0)
+ c++;
+ if (runtime_stat_avg(stat, STAT_TOPDOWN_FE_BOUND, ctx, cpu) > 0)
+ c++;
+ if (runtime_stat_avg(stat, STAT_TOPDOWN_BAD_SPEC, ctx, cpu) > 0)
+ c++;
+ return c >= 2;
+}
+
static void print_smi_cost(struct perf_stat_config *config,
int cpu, struct evsel *evsel,
struct perf_stat_output_ctx *out,
@@ -1025,6 +1078,42 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config,
be_bound * 100.);
else
print_metric(config, ctxp, NULL, NULL, name, 0);
+ } else if (perf_stat_evsel__is(evsel, TOPDOWN_RETIRING) &&
+ full_td(ctx, cpu, st)) {
+ double retiring = td_metric_ratio(ctx, cpu,
+ STAT_TOPDOWN_RETIRING, st);
+
+ if (retiring > 0.7)
+ color = PERF_COLOR_GREEN;
+ print_metric(config, ctxp, color, "%8.1f%%", "retiring",
+ retiring * 100.);
+ } else if (perf_stat_evsel__is(evsel, TOPDOWN_FE_BOUND) &&
+ full_td(ctx, cpu, st)) {
+ double fe_bound = td_metric_ratio(ctx, cpu,
+ STAT_TOPDOWN_FE_BOUND, st);
+
+ if (fe_bound > 0.2)
+ color = PERF_COLOR_RED;
+ print_metric(config, ctxp, color, "%8.1f%%", "frontend bound",
+ fe_bound * 100.);
+ } else if (perf_stat_evsel__is(evsel, TOPDOWN_BE_BOUND) &&
+ full_td(ctx, cpu, st)) {
+ double be_bound = td_metric_ratio(ctx, cpu,
+ STAT_TOPDOWN_BE_BOUND, st);
+
+ if (be_bound > 0.2)
+ color = PERF_COLOR_RED;
+ print_metric(config, ctxp, color, "%8.1f%%", "backend bound",
+ be_bound * 100.);
+ } else if (perf_stat_evsel__is(evsel, TOPDOWN_BAD_SPEC) &&
+ full_td(ctx, cpu, st)) {
+ double bad_spec = td_metric_ratio(ctx, cpu,
+ STAT_TOPDOWN_BAD_SPEC, st);
+
+ if (bad_spec > 0.1)
+ color = PERF_COLOR_RED;
+ print_metric(config, ctxp, color, "%8.1f%%", "bad speculation",
+ bad_spec * 100.);
} else if (evsel->metric_expr) {
generic_metric(config, evsel->metric_expr, evsel->metric_events, evsel->name,
evsel->metric_name, NULL, avg, cpu, out, st);
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index 6822e4ffe224..cf7709dbd066 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -95,6 +95,10 @@ static const char *id_str[PERF_STAT_EVSEL_ID__MAX] = {
ID(TOPDOWN_SLOTS_RETIRED, topdown-slots-retired),
ID(TOPDOWN_FETCH_BUBBLES, topdown-fetch-bubbles),
ID(TOPDOWN_RECOVERY_BUBBLES, topdown-recovery-bubbles),
+ ID(TOPDOWN_RETIRING, topdown-retiring),
+ ID(TOPDOWN_BAD_SPEC, topdown-bad-spec),
+ ID(TOPDOWN_FE_BOUND, topdown-fe-bound),
+ ID(TOPDOWN_BE_BOUND, topdown-be-bound),
ID(SMI_NUM, msr/smi/),
ID(APERF, msr/aperf/),
};
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index 081c4a5113c6..368a2a324c90 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -28,6 +28,10 @@ enum perf_stat_evsel_id {
PERF_STAT_EVSEL_ID__TOPDOWN_SLOTS_RETIRED,
PERF_STAT_EVSEL_ID__TOPDOWN_FETCH_BUBBLES,
PERF_STAT_EVSEL_ID__TOPDOWN_RECOVERY_BUBBLES,
+ PERF_STAT_EVSEL_ID__TOPDOWN_RETIRING,
+ PERF_STAT_EVSEL_ID__TOPDOWN_BAD_SPEC,
+ PERF_STAT_EVSEL_ID__TOPDOWN_FE_BOUND,
+ PERF_STAT_EVSEL_ID__TOPDOWN_BE_BOUND,
PERF_STAT_EVSEL_ID__SMI_NUM,
PERF_STAT_EVSEL_ID__APERF,
PERF_STAT_EVSEL_ID__MAX,
@@ -81,6 +85,10 @@ enum stat_type {
STAT_TOPDOWN_SLOTS_RETIRED,
STAT_TOPDOWN_FETCH_BUBBLES,
STAT_TOPDOWN_RECOVERY_BUBBLES,
+ STAT_TOPDOWN_RETIRING,
+ STAT_TOPDOWN_BAD_SPEC,
+ STAT_TOPDOWN_FE_BOUND,
+ STAT_TOPDOWN_BE_BOUND,
STAT_SMI_NUM,
STAT_APERF,
STAT_MAX
--
2.17.1
From: Andi Kleen <[email protected]>
Add some documentation how to use the topdown metrics in ring 3.
Signed-off-by: Andi Kleen <[email protected]>
---
Changes since V4
- Update example for Ice Lake
tools/perf/Documentation/topdown.txt | 235 +++++++++++++++++++++++++++
1 file changed, 235 insertions(+)
create mode 100644 tools/perf/Documentation/topdown.txt
diff --git a/tools/perf/Documentation/topdown.txt b/tools/perf/Documentation/topdown.txt
new file mode 100644
index 000000000000..e724d2af3b8d
--- /dev/null
+++ b/tools/perf/Documentation/topdown.txt
@@ -0,0 +1,235 @@
+Using TopDown metrics in user space
+-----------------------------------
+
+Intel CPUs (since Sandy Bridge and Silvermont) support a TopDown
+methology to break down CPU pipeline execution into 4 bottlenecks:
+frontend bound, backend bound, bad speculation, retiring.
+
+For more details on Topdown see [1][5]
+
+Traditionally this was implemented by events in generic counters
+and specific formulas to compute the bottlenecks.
+
+perf stat --topdown implements this.
+
+Full Top Down includes more levels that can break down the
+bottlenecks further. This is not directly implemented in perf,
+but available in other tools that can run on top of perf,
+such as toplev[2] or vtune[3]
+
+New Topdown features in Ice Lake
+===============================
+
+With Ice Lake CPUs the TopDown metrics are directly available as
+fixed counters and do not require generic counters. This allows
+to collect TopDown always in addition to other events.
+
+% perf stat -a --topdown -I1000
+# time counts unit events
+ 1.000854735 20,097,158,100 slots
+ 1.000854735 79,327,616 topdown-retiring # 0.4% retiring
+ 1.000854735 157,932,715 topdown-bad-spec # 0.8% bad speculation
+ 1.000854735 81,610,855 topdown-fe-bound # 0.4% frontend bound
+ 1.000854735 19,778,286,903 topdown-be-bound # 98.4% backend bound
+ 2.003623823 20,010,908,365 slots
+ 2.003623823 79,905,340 topdown-retiring # 0.4% retiring
+ 2.003623823 158,405,024 topdown-bad-spec # 0.8% bad speculation
+ 2.003623823 87,980,097 topdown-fe-bound # 0.4% frontend bound
+ 2.003623823 19,684,617,888 topdown-be-bound # 98.4% backend bound
+ 3.005828889 20,062,101,220 slots
+ 3.005828889 80,077,032 topdown-retiring # 0.4% retiring
+ 3.005828889 158,682,921 topdown-bad-spec # 0.8% bad speculation
+ 3.005828889 86,579,604 topdown-fe-bound # 0.4% frontend bound
+ 3.005828889 19,736,761,649 topdown-be-bound # 98.4% backend bound
+...
+
+This also enables measuring TopDown per thread/process instead
+of only per core.
+
+Using TopDown through RDPMC in applications on Ice Lake
+======================================================
+
+For more fine grained measurements it can be useful to
+access the new directly from user space. This is more complicated,
+but drastically lowers overhead.
+
+On Ice Lake, there is a new fixed counter 3: SLOTS, which reports
+"pipeline SLOTS" (cycles multiplied by core issue width) and a
+metric register that reports slots ratios for the different bottleneck
+categories.
+
+The metrics counter is CPU model specific and is not be available
+on older CPUs.
+
+Example code
+============
+
+Library functions to do the functionality described below
+is also available in libjevents [4]
+
+The application opens a perf_event file descriptor
+and sets up fixed counter 3 (SLOTS) to start and
+allow user programs to read the performance counters.
+
+Fixed counter 3 is mapped to a pseudo event event=0x00, umask=04,
+so the perf_event_attr structure should be initialized with
+{ .config = 0x0400, .type = PERF_TYPE_RAW }
+
+#include <linux/perf_event.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+/* Provide own perf_event_open stub because glibc doesn't */
+__attribute__((weak))
+int perf_event_open(struct perf_event_attr *attr, pid_t pid,
+ int cpu, int group_fd, unsigned long flags)
+{
+ return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags);
+}
+
+/* open slots counter file descriptor for current task */
+struct perf_event_attr slots = {
+ .type = PERF_TYPE_RAW,
+ .size = sizeof(struct perf_event_attr),
+ .config = 0x400,
+ .exclude_kernel = 1,
+};
+
+int fd = perf_event_open(&slots, 0, -1, -1, 0);
+if (fd < 0)
+ ... error ...
+
+The RDPMC instruction (or _rdpmc compiler intrinsic) can now be used
+to read slots and the topdown metrics at different points of the program:
+
+#include <stdint.h>
+#include <x86intrin.h>
+
+#define RDPMC_FIXED (1 << 30) /* return fixed counters */
+#define RDPMC_METRIC (1 << 29) /* return metric counters */
+
+#define FIXED_COUNTER_SLOTS 3
+#define METRIC_COUNTER_TOPDOWN_L1 0
+
+static inline uint64_t read_slots(void)
+{
+ return _rdpmc(RDPMC_FIXED | FIXED_COUNTER_SLOTS);
+}
+
+static inline uint64_t read_metrics(void)
+{
+ return _rdpmc(RDPMC_METRIC | METRIC_COUNTER_TOPDOWN_L1);
+}
+
+Then the program can be instrumented to read these metrics at different
+points.
+
+It's not a good idea to do this with too short code regions,
+as the parallelism and overlap in the CPU program execution will
+cause too much measurement inaccuracy. For example instrumenting
+individual basic blocks is definitely too fine grained.
+
+Decoding metrics values
+=======================
+
+The value reported by read_metrics() contains four 8 bit fields
+that represent a scaled ratio that represent the Level 1 bottleneck.
+All four fields add up to 0xff (= 100%)
+
+The binary ratios in the metric value can be converted to float ratios:
+
+#define GET_METRIC(m, i) (((m) >> (i*8)) & 0xff)
+
+#define TOPDOWN_RETIRING(val) ((float)GET_METRIC(val, 0) / 0xff)
+#define TOPDOWN_BAD_SPEC(val) ((float)GET_METRIC(val, 1) / 0xff)
+#define TOPDOWN_FE_BOUND(val) ((float)GET_METRIC(val, 2) / 0xff)
+#define TOPDOWN_BE_BOUND(val) ((float)GET_METRIC(val, 3) / 0xff)
+
+and then converted to percent for printing.
+
+The ratios in the metric accumulate for the time when the counter
+is enabled. For measuring programs it is often useful to measure
+specific sections. For this it is needed to deltas on metrics.
+
+This can be done by scaling the metrics with the slots counter
+read at the same time.
+
+Then it's possible to take deltas of these slots counts
+measured at different points, and determine the metrics
+for that time period.
+
+ slots_a = read_slots();
+ metric_a = read_metrics();
+
+ ... larger code region ...
+
+ slots_b = read_slots()
+ metric_b = read_metrics()
+
+ # compute scaled metrics for measurement a
+ retiring_slots_a = GET_METRIC(metric_a, 0) * slots_a
+ bad_spec_slots_a = GET_METRIC(metric_a, 1) * slots_a
+ fe_bound_slots_a = GET_METRIC(metric_a, 2) * slots_a
+ be_bound_slots_a = GET_METRIC(metric_a, 3) * slots_a
+
+ # compute delta scaled metrics between b and a
+ retiring_slots = GET_METRIC(metric_b, 0) * slots_b - retiring_slots_a
+ bad_spec_slots = GET_METRIC(metric_b, 1) * slots_b - bad_spec_slots_a
+ fe_bound_slots = GET_METRIC(metric_b, 2) * slots_b - fe_bound_slots_a
+ be_bound_slots = GET_METRIC(metric_b, 3) * slots_b - be_bound_slots_a
+
+Later the individual ratios for the measurement period can be recreated
+from these counts.
+
+ slots_delta = slots_b - slots_a
+ retiring_ratio = (float)retiring_slots / slots_delta
+ bad_spec_ratio = (float)bad_spec_slots / slots_delta
+ fe_bound_ratio = (float)fe_bound_slots / slots_delta
+ be_bound_ratio = (float)be_bound_slots / slota_delta
+
+ printf("Retiring %.2f%% Bad Speculation %.2f%% FE Bound %.2f%% BE Bound %.2f%%\n",
+ retiring_ratio * 100.,
+ bad_spec_ratio * 100.,
+ fe_bound_ratio * 100.,
+ be_bound_ratio * 100.);
+
+Resetting metrics counters
+==========================
+
+Since the individual metrics are only 8bit they lose precision for
+short regions over time because the number of cycles covered by each
+fraction bit shrinks. So the counters need to be reset regularly.
+
+When using the kernel perf API the kernel resets on every read.
+So as long as the reading is at reasonable intervals (every few
+seconds) the precision is good.
+
+When using perf stat it is recommended to always use the -I option,
+with no longer interval than a few seconds
+
+ perf stat -I 1000 --topdown ...
+
+For user programs using RDPMC directly the counter can
+be reset explicitly using ioctl:
+
+ ioctl(perf_fd, PERF_EVENT_IOC_RESET, 0);
+
+This "opens" a new measurement period.
+
+A program using RDPMC for TopDown should schedule such a reset
+regularly, as in every few seconds.
+
+Limits on Ice Lake
+==================
+
+All the TopDown events must be in a group with SLOTS events.
+
+There is no sampling support for TopDown events.
+Sampling read SLOTS and TopDown events is forbidden.
+For example, perf record -e '{slots, topdown-retiring}:S'
+
+[1] https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-win
+[2] https://github.com/andikleen/pmu-tools/wiki/toplev-manual
+[3] https://software.intel.com/en-us/intel-vtune-amplifier-xe
+[4] https://github.com/andikleen/pmu-tools/tree/master/jevents
+[5] https://sites.google.com/site/analysismethods/yasin-pubs
--
2.17.1
From: Kan Liang <[email protected]>
The RDPMC index is always re-calculated in RDPMC userspace support,
especially for fixed counters.
The RDPMC index value is stored in variable event_base_rdpmc for kernel
usage, which can be used for RDPMC userspace support as well. Only the
metrics event has to be specially handled.
Signed-off-by: Kan Liang <[email protected]>
---
No changes since V4
arch/x86/events/core.c | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 48dd920c5e7d..3ab18277e4a7 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2203,20 +2203,16 @@ static void x86_pmu_event_unmapped(struct perf_event *event, struct mm_struct *m
static int x86_pmu_event_idx(struct perf_event *event)
{
- int idx = event->hw.idx;
+ struct hw_perf_event *hwc = &event->hw;
- if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED))
+ if (!(hwc->flags & PERF_X86_EVENT_RDPMC_ALLOWED))
return 0;
/* Return PERF_METRICS MSR value for metrics event */
- if (is_metric_idx(idx))
- idx = 1 << 29;
- else if (x86_pmu.num_counters_fixed && idx >= INTEL_PMC_IDX_FIXED) {
- idx -= INTEL_PMC_IDX_FIXED;
- idx |= 1 << 30;
- }
-
- return idx + 1;
+ if (is_metric_idx(hwc->idx))
+ return (1 << 29) + 1;
+ else
+ return hwc->event_base_rdpmc + 1;
}
static ssize_t get_attr_rdpmc(struct device *cdev,
--
2.17.1
From: Kan Liang <[email protected]>
The slots event supports sampling. Users may sampling read slots and
metrics events, e.g perf record -e '{slots, topdown-retiring}:S'.
But the metrics event will reset the fixed counter 3 which will impact
the sampling of the slots event.
Add specific validate_group() support to reject the case and error out
for Icelake.
An alternative fix may unconditionally disable SLOTS sampling. But it's
not a decent fix. Because users may want to only sampling slot events
without topdown metrics event.
Signed-off-by: Kan Liang <[email protected]>
---
No changes since V4
arch/x86/events/core.c | 4 ++++
arch/x86/events/intel/core.c | 20 ++++++++++++++++++++
arch/x86/events/perf_event.h | 2 ++
3 files changed, 26 insertions(+)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 333541c05815..48dd920c5e7d 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2111,7 +2111,11 @@ static int validate_group(struct perf_event *event)
fake_cpuc->n_events = 0;
ret = x86_pmu.schedule_events(fake_cpuc, n, NULL);
+ if (ret)
+ goto out;
+ if (x86_pmu.validate_group)
+ ret = x86_pmu.validate_group(fake_cpuc, n);
out:
free_fake_cpuc(fake_cpuc);
return ret;
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index d913dda3e1c2..7fbf268f5143 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4512,6 +4512,25 @@ static __init void intel_ht_bug(void)
x86_pmu.stop_scheduling = intel_stop_scheduling;
}
+static int icl_validate_group(struct cpu_hw_events *cpuc, int n)
+{
+ bool has_sampling_slots = false, has_metrics = false;
+ struct perf_event *e;
+ int i;
+
+ for (i = 0; i < n; i++) {
+ e = cpuc->event_list[i];
+ if (is_slots_event(e) && is_sampling_event(e))
+ has_sampling_slots = true;
+
+ if (is_metric_event(e))
+ has_metrics = true;
+ }
+ if (unlikely(has_sampling_slots && has_metrics))
+ return -EINVAL;
+ return 0;
+}
+
EVENT_ATTR_STR(mem-loads, mem_ld_hsw, "event=0xcd,umask=0x1,ldlat=3");
EVENT_ATTR_STR(mem-stores, mem_st_hsw, "event=0xd0,umask=0x82")
@@ -5364,6 +5383,7 @@ __init int intel_pmu_init(void)
intel_pmu_pebs_data_source_skl(pmem);
x86_pmu.update_topdown_event = icl_update_topdown_event;
x86_pmu.set_topdown_event_period = icl_set_topdown_event_period;
+ x86_pmu.validate_group = icl_validate_group;
pr_cont("Icelake events, ");
name = "icelake";
break;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 404bf3f2c293..132ac123e83f 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -661,6 +661,8 @@ struct x86_pmu {
int perfctr_second_write;
u64 (*limit_period)(struct perf_event *event, u64 l);
+ int (*validate_group)(struct cpu_hw_events *cpuc, int n);
+
/* PMI handler bits */
unsigned int late_ack :1,
counter_freezing :1;
--
2.17.1
From: Kan Liang <[email protected]>
The bit index number of global status is directly used in current NMI
handler. Using a meaningful name to replace the number to improve the
readability of code.
Signed-off-by: Kan Liang <[email protected]>
---
No changes since V4
arch/x86/events/intel/core.c | 6 +++---
arch/x86/include/asm/perf_event.h | 7 +++++--
2 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 7fbf268f5143..bc6468329c52 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2616,7 +2616,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
/*
* PEBS overflow sets bit 62 in the global status register
*/
- if (__test_and_clear_bit(62, (unsigned long *)&status)) {
+ if (__test_and_clear_bit(GLOBAL_STATUS_BUFFER_OVF_BIT, (unsigned long *)&status)) {
handled++;
x86_pmu.drain_pebs(regs);
status &= x86_pmu.intel_ctrl | GLOBAL_STATUS_TRACE_TOPAPMI;
@@ -2625,7 +2625,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
/*
* Intel PT
*/
- if (__test_and_clear_bit(55, (unsigned long *)&status)) {
+ if (__test_and_clear_bit(GLOBAL_STATUS_TRACE_TOPAPMI_BIT, (unsigned long *)&status)) {
handled++;
if (unlikely(perf_guest_cbs && perf_guest_cbs->is_in_guest() &&
perf_guest_cbs->handle_intel_pt_intr))
@@ -2637,7 +2637,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
/*
* Intel Perf mertrics
*/
- if (__test_and_clear_bit(48, (unsigned long *)&status)) {
+ if (__test_and_clear_bit(GLOBAL_STATUS_PERF_METRICS_OVF_BIT, (unsigned long *)&status)) {
handled++;
if (x86_pmu.update_topdown_event)
x86_pmu.update_topdown_event(NULL);
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 3f1290424c52..e684e7851b48 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -220,12 +220,15 @@ static inline bool is_topdown_idx(int idx)
#define INTEL_PMC_OTHER_TOPDOWN_BITS(bit) (~(0x1ull << bit) & INTEL_PMC_MSK_TOPDOWN)
#define GLOBAL_STATUS_COND_CHG BIT_ULL(63)
-#define GLOBAL_STATUS_BUFFER_OVF BIT_ULL(62)
+#define GLOBAL_STATUS_BUFFER_OVF_BIT 62
+#define GLOBAL_STATUS_BUFFER_OVF BIT_ULL(GLOBAL_STATUS_BUFFER_OVF_BIT)
#define GLOBAL_STATUS_UNC_OVF BIT_ULL(61)
#define GLOBAL_STATUS_ASIF BIT_ULL(60)
#define GLOBAL_STATUS_COUNTERS_FROZEN BIT_ULL(59)
#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(58)
-#define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(55)
+#define GLOBAL_STATUS_TRACE_TOPAPMI_BIT 55
+#define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(GLOBAL_STATUS_TRACE_TOPAPMI_BIT)
+#define GLOBAL_STATUS_PERF_METRICS_OVF_BIT 48
/*
* Adaptive PEBS v4
--
2.17.1
From: Kan Liang <[email protected]>
The bit 48 in the PERF_GLOBAL_STATUS is used to indicate the overflow
status of PERF_METRICS counters now.
Move BTS index to 47.
Signed-off-by: Kan Liang <[email protected]>
---
No changes since V4
arch/x86/include/asm/perf_event.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 55a4d05ba6ec..7df1d5b78aa8 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -185,11 +185,11 @@ struct x86_pmu_capability {
/*
* We model BTS tracing as another fixed-mode PMC.
*
- * We choose a value in the middle of the fixed event range, since lower
+ * We choose value 47 for the fixed index of BTS, since lower
* values are used by actual fixed events and higher values are used
* to indicate other overflow conditions in the PERF_GLOBAL_STATUS msr.
*/
-#define INTEL_PMC_IDX_FIXED_BTS (INTEL_PMC_IDX_FIXED + 16)
+#define INTEL_PMC_IDX_FIXED_BTS (INTEL_PMC_IDX_FIXED + 15)
#define GLOBAL_STATUS_COND_CHG BIT_ULL(63)
#define GLOBAL_STATUS_BUFFER_OVF BIT_ULL(62)
--
2.17.1
From: Kan Liang <[email protected]>
The slots event is required in a Topdown Metric group.
Add a check to examine the Topdown Metric group. Error out if there is
no slots event detected.
Only check the group on the platform which using topdown_metric_attrs,
e.g. Ice Lake.
Signed-off-by: Kan Liang <[email protected]>
---
New for V5
tools/perf/builtin-stat.c | 72 +++++++++++++++++++++++++++++++++++++++
1 file changed, 72 insertions(+)
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 0ed191bf8b5e..948a0300410c 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -1134,6 +1134,72 @@ static int topdown_filter_events(const char **attr, char **str, bool use_group)
return 0;
}
+/* Event encoding for Topdown Metric events */
+#define TOPDOWN_SLOTS 0x0400
+#define TOPDOWN_RETIRE 0x1000
+#define TOPDOWN_BAD_SPEC 0x1100
+#define TOPDOWN_FE_BOUND 0x1200
+#define TOPDOWN_BE_BOUND 0x1300
+
+static bool is_topdown_metric_event(struct evsel *counter)
+{
+ if (!counter->pmu_name)
+ return false;
+
+ if (strcmp(counter->pmu_name, "cpu"))
+ return false;
+
+ if ((counter->core.attr.config == TOPDOWN_RETIRE) ||
+ (counter->core.attr.config == TOPDOWN_BAD_SPEC) ||
+ (counter->core.attr.config == TOPDOWN_FE_BOUND) ||
+ (counter->core.attr.config == TOPDOWN_BE_BOUND))
+ return true;
+
+ return false;
+}
+
+static bool is_topdown_slots_event(struct evsel *counter)
+{
+ if (!counter->pmu_name)
+ return false;
+
+ if (strcmp(counter->pmu_name, "cpu"))
+ return false;
+
+ if (counter->core.attr.config == TOPDOWN_SLOTS)
+ return true;
+
+ return false;
+}
+
+static bool topdown_check_group_member(void)
+{
+ struct evsel *counter, *leader, *member;
+ bool has_slots;
+
+ if (!pmu_have_event("cpu", topdown_metric_attrs[0]))
+ return true;
+
+ evlist__for_each_entry(evsel_list, counter) {
+ if (!is_topdown_metric_event(counter))
+ continue;
+
+ leader = counter->leader;
+ has_slots = false;
+
+ for_each_group_evsel(member, leader) {
+ if (is_topdown_slots_event(member))
+ has_slots = true;
+ counter = member;
+ }
+
+ if (!has_slots)
+ return false;
+ }
+
+ return true;
+}
+
__weak bool arch_topdown_check_group(bool *warn)
{
*warn = false;
@@ -1740,6 +1806,12 @@ int cmd_stat(int argc, const char **argv)
(const char **) stat_usage,
PARSE_OPT_STOP_AT_NON_OPTION);
perf_stat__collect_metric_expr(evsel_list);
+
+ if (!topdown_check_group_member()) {
+ fprintf(stderr, "Topdown group must include slots event\n");
+ goto out;
+ }
+
perf_stat__init_shadow_stats();
if (stat_config.csv_sep) {
--
2.17.1
From: Kan Liang <[email protected]>
Bit 15 of PERF_CAPABILITIES MSR indicates this architecture provides
built in support for perf METRICS. The perf METRICS is not a PEBS
feature.
Rename pebs_metrics_available perf_metrics.
No one use the bit in current code. The following patch will use it.
Signed-off-by: Kan Liang <[email protected]>
---
No changes since V4
arch/x86/events/perf_event.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 6ebca54f86df..ecce05141f71 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -532,7 +532,7 @@ union perf_capabilities {
*/
u64 full_width_write:1;
u64 pebs_baseline:1;
- u64 pebs_metrics_available:1;
+ u64 perf_metrics:1;
u64 pebs_output_pt_available:1;
};
u64 capabilities;
--
2.17.1
On Mon, Jan 06, 2020 at 12:29:05PM -0800, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> Icelake has support for measuring the level 1 TopDown metrics
> directly in hardware. This is implemented by an additional METRICS
> register, and a new Fixed Counter 3 that measures pipeline SLOTS.
>
> New in Icelake
> - Do not require generic counters. This allows to collect TopDown always
> in addition to other events.
> - Measuring TopDown per thread/process instead of only per core
>
> For the Ice Lake implementation of performance metrics, the values in
> PERF_METRICS MSR are derived from fixed counter 3. Software should start
> both registers, PERF_METRICS and fixed counter 3, from zero.
> Additionally, software is recommended to periodically clear both
> registers in order to maintain accurate measurements. The latter is
> required for certain scenarios that involve sampling metrics at high
> rates. Software should always write fixed counter 3 before write to
> PERF_METRICS.
Do we really have to support this trainwreck? This is such ill designed
hardware, I'm loath to support it, it might encourage more such
'creative' things and we really don't need that.
Hi,
On Fri, Jan 10, 2020 at 5:17 AM Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Jan 06, 2020 at 12:29:05PM -0800, [email protected] wrote:
> > From: Kan Liang <[email protected]>
> >
> > Icelake has support for measuring the level 1 TopDown metrics
> > directly in hardware. This is implemented by an additional METRICS
> > register, and a new Fixed Counter 3 that measures pipeline SLOTS.
> >
> > New in Icelake
> > - Do not require generic counters. This allows to collect TopDown always
> > in addition to other events.
> > - Measuring TopDown per thread/process instead of only per core
> >
> > For the Ice Lake implementation of performance metrics, the values in
> > PERF_METRICS MSR are derived from fixed counter 3. Software should start
> > both registers, PERF_METRICS and fixed counter 3, from zero.
> > Additionally, software is recommended to periodically clear both
> > registers in order to maintain accurate measurements. The latter is
> > required for certain scenarios that involve sampling metrics at high
> > rates. Software should always write fixed counter 3 before write to
> > PERF_METRICS.
>
> Do we really have to support this trainwreck? This is such ill designed
> hardware, I'm loath to support it, it might encourage more such
> 'creative' things and we really don't need that.
>
Yes, we do because it provides important information per hyper-thread.
I understand that the hardware is convoluted to support because it
introduces a new concept: a single counter computing multiple high
level metrics. It is difficult to abstract cleanly especially when you
add on top that it is connected with a new fixed counter (SLOTS).
The challenges I see:
- single MSR containing multiple non monotonically incrementing fields
- point of reference. Need to know when fields were zeroed to
understand on what part of the execution the topdown percentages are
computed.
- must combine with fixed counter 3 (SLOTS) to operate correctly
I see two ways of supporting this new concept.
1/ Abstract as individual events
In Kan's approach, the nature of the PERF_METRICS MSR is hidden.
He exposes the individual metrics as pseudo-events: topdown-retiring,
topdown-bad-spec, slots, ...
These events are based on the fields of the PERF_METRICS (except slots).
Given that each field is a percentage, he chose to scale them by SLOTS
to expose them as monotonically incrementing events. This makes it
easier on the perf tool.
To ensure the pseudo-events make sense, it is necessary to put them
into a single event group.
That also helps the kernel with a single WRMSR/RDMSR for all 4 metrics.
Given that the point of reference is important, any read of the group
resets the fields.
With this approach, the perf tool has no changes required, except
recomputing the topdown percentages from the scale counts.
2/ Abstract the multi-metric MSR
This is another approach, whereby we could export a new abstraction of
a structured counter. The kernel could publish the structure of the
counter
like it does today for the structure of the config register
(/sys/devices/cpu/format). The tool would parse the format and extract
the fields from the
64-bit value of the MSR. The width and unit would be part of the
format, just like what is done for some pseudo events already.
To program this MSR, you'd have to add a single pseudo event, e.g.,
TOPDOWN_L1. The grouping would be implicit.
The point of reference approach would be the same as the first
approach: any read would reset the counts.
The kernel would still have to handle the SLOTS counter.
This approach requires fewer changes to the kernel but more in the tool.
If you have another approach in mind, please share it.
The PERF_METRICS hardware is very useful, we cannot really afford not
having it supported.
I am happy to help.
On Mon, Apr 20, 2020 at 09:00:56AM -0700, Stephane Eranian wrote:
> Hi,
>
> On Fri, Jan 10, 2020 at 5:17 AM Peter Zijlstra <[email protected]> wrote:
> >
> > On Mon, Jan 06, 2020 at 12:29:05PM -0800, [email protected] wrote:
> > > From: Kan Liang <[email protected]>
> > >
> > > Icelake has support for measuring the level 1 TopDown metrics
> > > directly in hardware. This is implemented by an additional METRICS
> > > register, and a new Fixed Counter 3 that measures pipeline SLOTS.
> > >
> > > New in Icelake
> > > - Do not require generic counters. This allows to collect TopDown always
> > > in addition to other events.
> > > - Measuring TopDown per thread/process instead of only per core
> > >
> > > For the Ice Lake implementation of performance metrics, the values in
> > > PERF_METRICS MSR are derived from fixed counter 3. Software should start
> > > both registers, PERF_METRICS and fixed counter 3, from zero.
> > > Additionally, software is recommended to periodically clear both
> > > registers in order to maintain accurate measurements. The latter is
> > > required for certain scenarios that involve sampling metrics at high
> > > rates. Software should always write fixed counter 3 before write to
> > > PERF_METRICS.
> >
> > Do we really have to support this trainwreck? This is such ill designed
> > hardware, I'm loath to support it, it might encourage more such
> > 'creative' things and we really don't need that.
> >
> Yes, we do because it provides important information per hyper-thread.
>
> I understand that the hardware is convoluted to support because it
> introduces a new concept: a single counter computing multiple high
> level metrics. It is difficult to abstract cleanly especially when you
> add on top that it is connected with a new fixed counter (SLOTS).
It's not a new concept, it's just completely idiotic. It didn't need to
be this crazy. There is absolutely no sane reason for it to be this
crazy.
The 4 counters in a single msr thing is insane because it uses a
division.
Very much worse, it explicitly uses the exact value of another counter
(SLOTS) to drive that division, creating a tight coupling between the
registers and completely and utterly destroying the SLOTS counter.
Since it keeps internal 'shadow' counters for the 4 events anyway, it
might as well have kept a shadow counter for the SLOTS event and driven
it off of that, that would have kept the SLOTS counter sane, but nooo,
gotta wreck that.
> That also helps the kernel with a single WRMSR/RDMSR for all 4 metrics.
I also really don't buy that as a driver for all this insanity.
Optimizing MSRs to not be utterly stupid expensive would've been so much
saner and would've helped everyone.
This is just creating more wreckage.
What I really want to know is if future hardware is going to be as
stupid; or if there's going to be change. I really don't want to commit
to ABI here and then have to find out they fixed the hardware and we
can't do sane things anymore.
Obviously, future hardware is not something that is to be discussed, so
we're at a stand-still here.