From: Kan Liang <[email protected]>
Changes since V6:
- Correct the comments regarding to the pseudo umask-code of the fixed
counter.
- Add a new PERF_EV_CAP_COEXIST event capability
- Update the code regarding to the nr_metric check
- Check the TopDown group in hw_config
Drop the patch for x86_pmu.validate_group
- The pseudo umask-code of metrics event starts from the middle of the
pseudo event space, 0x80.
Changes since V5:
- Rebase on top of tip perf/core branch
commit c085fb877467 ("perf/x86/intel/lbr: Support XSAVES for arch LBR
read")
- Add more cleanup codes, e.g. ("perf/x86/intel: Use switch in
intel_pmu_disable/enable_event"). Use more macros to replace the magic
number.
- Reorganize the patch set:
- Move all cleanup related patches to the beginning of the patch set.
- Split the TopDown metrics patch into a generic patch and a Icelake
specific patch.
- Drop extra_reg. The TopDown metrics events and slots event don't
support any filters.
Changes since V4:
- Add description regarding to event-code naming for fixed counters
- Fix add_nr_metric_event().
For leader event, we have to take the accepted metrics events into
account.
For sibling event, it doesn't need to count accepted metrics events
again.
- Remove is_first_topdown_event_in_group().
Force slots in topdown group. Only update topdown events with slots
event.
- Re-use last_period and period_left for saved_metric and saved_slots.
Changes since V3:
- Separate fixed counter3 definition patch
- Separate BTS index patch
- Apply Peter's cleanup patch
- Fix the name of perf capabilities for perf METRICS
- Apply patch for mul_u64_u32_div() x86_64 implementation
- Fix unconditionally allows collecting 4 extra events
- Add patch to clean up NMI handler by naming global status bit
- Add patch to reuse event_base_rdpmc for RDPMC userspace support
Changes since V2:
- Rebase on top of v5.3-rc1
Key changes since V1:
- Remove variables for reg_idx and enabled_events[] array.
The reg_idx can be calculated by idx in runtime.
Using existing active_mask to replace enabled_events.
- Choose value 47 for the fixed index of BTS.
- Support OVF_PERF_METRICS overflow bit in PMI handler
- Drops the caching mechanism and related variables
New mechanism is to update all active slots/metrics events for the
first slots/metrics events in a group. For each group reading, it
still only read the slots/perf_metrics MSR once
- Disable PMU for read of topdown events to avoid the NMI issue
- Move RDPMC support to a separate patch
- Using event=0x00,umask=0x1X for topdown metrics events
- Drop the patch which add REMOVE transaction
We can indicate x86_pmu_stop() by checking
(event && !test_bit(event->hw.idx, cpuc->active_mask)),
which is a good place to save the slots/metrics MSR value
Icelake has support for measuring the level 1 TopDown metrics
directly in hardware. This is implemented by an additional METRICS
register, and a new Fixed Counter 3 that measures pipeline SLOTS.
New in Icelake
- Do not require generic counters. This allows to collect TopDown always
in addition to other events.
- Measuring TopDown per thread/process instead of only per core
For the Ice Lake implementation of performance metrics, the values in
PERF_METRICS MSR are derived from fixed counter 3. Software should start
both registers, PERF_METRICS and fixed counter 3, from zero.
Additionally, software is recommended to periodically clear both
registers in order to maintain accurate measurements. The latter is
required for certain scenarios that involve sampling metrics at high
rates. Software should always write fixed counter 3 before write to
PERF_METRICS.
IA32_PERF_GLOBAL_STATUS. OVF_PERF_METRICS[48]: If this bit is set,
it indicates that some PERF_METRICS-related counter has overflowed and
a PMI is triggered. Software has to synchronize, e.g. re-start,
PERF_METRICS as well as fixed counter 3. Otherwise, PERF_METRICS may
return invalid values.
Limitation
- To get accurate result and avoid reading the METRICS register multiple
times, the TopDown metrics events and SLOTS event have to be in the
same group.
- METRICS and SLOTS registers have to be cleared after each read by SW.
That is to prevent the lose of precision.
- Cannot do sampling read SLOTS and TopDown metric events
Please refer SDM Vol3, 18.3.9.3 Performance Metrics for the details of
TopDown metrics.
Andi Kleen (2):
perf, tools, stat: Support new per thread TopDown metrics
perf, tools: Add documentation for topdown metrics
Kan Liang (12):
perf/x86: Use event_base_rdpmc for the RDPMC userspace support
perf/x86/intel: Name the global status bit in NMI handler
perf/x86/intel: Introduce the fourth fixed counter
perf/x86/intel: Move BTS index to 47
perf/x86/intel: Fix the name of perf METRICS
perf/x86/intel: Use switch in intel_pmu_disable/enable_event
perf/core: Add a new PERF_EV_CAP_COEXIST event capability
perf/x86/intel: Generic support for hardware TopDown metrics
perf/x86: Add a macro for RDPMC offset of fixed counters
perf/x86/intel: Support TopDown metrics on Ice Lake
perf/x86/intel: Support per-thread RDPMC TopDown metrics
perf, tools, stat: Check Topdown Metric group
arch/x86/events/core.c | 80 ++++--
arch/x86/events/intel/core.c | 363 ++++++++++++++++++++++++-
arch/x86/events/perf_event.h | 52 +++-
arch/x86/include/asm/msr-index.h | 3 +
arch/x86/include/asm/perf_event.h | 97 ++++++-
include/linux/perf_event.h | 34 ++-
kernel/events/core.c | 52 +++-
tools/perf/Documentation/perf-stat.txt | 9 +-
tools/perf/Documentation/topdown.txt | 235 ++++++++++++++++
tools/perf/builtin-stat.c | 97 +++++++
tools/perf/util/stat-shadow.c | 89 ++++++
tools/perf/util/stat.c | 4 +
tools/perf/util/stat.h | 8 +
13 files changed, 1059 insertions(+), 64 deletions(-)
create mode 100644 tools/perf/Documentation/topdown.txt
--
2.17.1
From: Kan Liang <[email protected]>
The RDPMC index is always re-calculated for the RDPMC userspace support,
which is unnecessary.
The RDPMC index value is stored in the variable event_base_rdpmc for
the kernel usage, which can be used for RDPMC userspace support as well.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/core.c | 11 +++--------
1 file changed, 3 insertions(+), 8 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 1cbf57dc2ac8..8e108ea378e3 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2208,17 +2208,12 @@ static void x86_pmu_event_unmapped(struct perf_event *event, struct mm_struct *m
static int x86_pmu_event_idx(struct perf_event *event)
{
- int idx = event->hw.idx;
+ struct hw_perf_event *hwc = &event->hw;
- if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED))
+ if (!(hwc->flags & PERF_X86_EVENT_RDPMC_ALLOWED))
return 0;
- if (x86_pmu.num_counters_fixed && idx >= INTEL_PMC_IDX_FIXED) {
- idx -= INTEL_PMC_IDX_FIXED;
- idx |= 1 << 30;
- }
-
- return idx + 1;
+ return hwc->event_base_rdpmc + 1;
}
static ssize_t get_attr_rdpmc(struct device *cdev,
--
2.17.1
From: Kan Liang <[email protected]>
Magic numbers are used in the current NMI handler for the global status
bit. Use a meaningful name to replace the magic numbers to improve the
readability of the code.
Remove a Tab for all GLOBAL_STATUS_* and INTEL_PMC_IDX_FIXED_BTS macros
to reduce the length of the line.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/intel/core.c | 4 ++--
arch/x86/include/asm/perf_event.h | 22 ++++++++++++----------
2 files changed, 14 insertions(+), 12 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 50963472ee85..ac1408fe1aee 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2389,7 +2389,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
/*
* PEBS overflow sets bit 62 in the global status register
*/
- if (__test_and_clear_bit(62, (unsigned long *)&status)) {
+ if (__test_and_clear_bit(GLOBAL_STATUS_BUFFER_OVF_BIT, (unsigned long *)&status)) {
u64 pebs_enabled = cpuc->pebs_enabled;
handled++;
@@ -2410,7 +2410,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
/*
* Intel PT
*/
- if (__test_and_clear_bit(55, (unsigned long *)&status)) {
+ if (__test_and_clear_bit(GLOBAL_STATUS_TRACE_TOPAPMI_BIT, (unsigned long *)&status)) {
handled++;
if (unlikely(perf_guest_cbs && perf_guest_cbs->is_in_guest() &&
perf_guest_cbs->handle_intel_pt_intr))
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 0c1b13720525..fd3eba65337f 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -225,16 +225,18 @@ struct x86_pmu_capability {
* values are used by actual fixed events and higher values are used
* to indicate other overflow conditions in the PERF_GLOBAL_STATUS msr.
*/
-#define INTEL_PMC_IDX_FIXED_BTS (INTEL_PMC_IDX_FIXED + 16)
-
-#define GLOBAL_STATUS_COND_CHG BIT_ULL(63)
-#define GLOBAL_STATUS_BUFFER_OVF BIT_ULL(62)
-#define GLOBAL_STATUS_UNC_OVF BIT_ULL(61)
-#define GLOBAL_STATUS_ASIF BIT_ULL(60)
-#define GLOBAL_STATUS_COUNTERS_FROZEN BIT_ULL(59)
-#define GLOBAL_STATUS_LBRS_FROZEN_BIT 58
-#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(GLOBAL_STATUS_LBRS_FROZEN_BIT)
-#define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(55)
+#define INTEL_PMC_IDX_FIXED_BTS (INTEL_PMC_IDX_FIXED + 16)
+
+#define GLOBAL_STATUS_COND_CHG BIT_ULL(63)
+#define GLOBAL_STATUS_BUFFER_OVF_BIT 62
+#define GLOBAL_STATUS_BUFFER_OVF BIT_ULL(GLOBAL_STATUS_BUFFER_OVF_BIT)
+#define GLOBAL_STATUS_UNC_OVF BIT_ULL(61)
+#define GLOBAL_STATUS_ASIF BIT_ULL(60)
+#define GLOBAL_STATUS_COUNTERS_FROZEN BIT_ULL(59)
+#define GLOBAL_STATUS_LBRS_FROZEN_BIT 58
+#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(GLOBAL_STATUS_LBRS_FROZEN_BIT)
+#define GLOBAL_STATUS_TRACE_TOPAPMI_BIT 55
+#define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(GLOBAL_STATUS_TRACE_TOPAPMI_BIT)
/*
* We model guest LBR event tracing as another fixed-mode PMC like BTS.
--
2.17.1
From: Kan Liang <[email protected]>
Current perf assumes that events in a group are independent. Close an
event doesn't impact the value of the other events in the same group.
If the closed event is a member, after the event closure, other events
are still running like a group. If the closed event is a leader, other
events are running as singleton events.
However, the assumption is not correct anymore, e.g., the TopDown slots
and metrics events. The slots and metrics events must coexist in the
same group. If the slots event is closed, the value for the metrics
events is invalid.
Add a new PERF_EV_CAP_COEXIST event capability to indicate the
relationship among group events.
If any event with the flag is detached from the group, split the group
into singleton events, and move all events, which have the flag, to the
unrecoverable ERROR state.
The leader of a PERF_EV_CAP_COEXIST group has to be updated at last.
Move perf_event__header_size(leader); to the end of perf_group_detach().
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
include/linux/perf_event.h | 5 ++++
kernel/events/core.c | 52 +++++++++++++++++++++++++++++++++++---
2 files changed, 53 insertions(+), 4 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3b22db08b6fb..93631e5389bf 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -576,9 +576,14 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *,
* PERF_EV_CAP_SOFTWARE: Is a software event.
* PERF_EV_CAP_READ_ACTIVE_PKG: A CPU event (or cgroup event) that can be read
* from any CPU in the package where it is active.
+ * PERF_EV_CAP_COEXIST: An event with this flag must coexist with other sibling
+ * events, which have the same flag. If any event with the flag is detached
+ * from the group, split the group into singleton events, and move the events
+ * with the flag to the unrecoverable ERROR state.
*/
#define PERF_EV_CAP_SOFTWARE BIT(0)
#define PERF_EV_CAP_READ_ACTIVE_PKG BIT(1)
+#define PERF_EV_CAP_COEXIST BIT(2)
#define SWEVENT_HLIST_BITS 8
#define SWEVENT_HLIST_SIZE (1 << SWEVENT_HLIST_BITS)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7c436d705fbd..e35d549a356d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2133,10 +2133,28 @@ static inline struct list_head *get_event_list(struct perf_event *event)
return event->attr.pinned ? &ctx->pinned_active : &ctx->flexible_active;
}
+/*
+ * If the event has PERF_EV_CAP_COEXIST capability,
+ * schedule it out and move it into the ERROR state.
+ */
+static inline void perf_remove_coexist_events(struct perf_event *event)
+{
+ struct perf_event_context *ctx = event->ctx;
+ struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+
+ if (!(event->event_caps & PERF_EV_CAP_COEXIST))
+ return;
+
+ event_sched_out(event, cpuctx, ctx);
+ perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
+}
+
static void perf_group_detach(struct perf_event *event)
{
+ struct perf_event *leader = event->group_leader;
struct perf_event *sibling, *tmp;
struct perf_event_context *ctx = event->ctx;
+ bool cap_coexist = !!(event->event_caps & PERF_EV_CAP_COEXIST);
lockdep_assert_held(&ctx->lock);
@@ -2150,15 +2168,25 @@ static void perf_group_detach(struct perf_event *event)
perf_put_aux_event(event);
+ /*
+ * If a PERF_EV_CAP_COEXIST event is detached,
+ * split the group into singleton events.
+ */
+ if (cap_coexist) {
+ event = leader;
+ goto split_group;
+ }
+
/*
* If this is a sibling, remove it from its group.
*/
- if (event->group_leader != event) {
+ if (leader != event) {
list_del_init(&event->sibling_list);
event->group_leader->nr_siblings--;
goto out;
}
+split_group:
/*
* If this was a group event with sibling events then
* upgrade the siblings to singleton events by adding them
@@ -2172,6 +2200,10 @@ static void perf_group_detach(struct perf_event *event)
/* Inherit group flags from the previous leader */
sibling->group_caps = event->group_caps;
+ /* Remove sibling PERF_EV_CAP_COEXIST event */
+ if (cap_coexist)
+ perf_remove_coexist_events(sibling);
+
if (!RB_EMPTY_NODE(&event->group_node)) {
add_event_to_groups(sibling, event->ctx);
@@ -2181,12 +2213,24 @@ static void perf_group_detach(struct perf_event *event)
WARN_ON_ONCE(sibling->ctx != event->ctx);
}
-
out:
- perf_event__header_size(event->group_leader);
- for_each_sibling_event(tmp, event->group_leader)
+ for_each_sibling_event(tmp, leader)
perf_event__header_size(tmp);
+
+ /*
+ * Change the leader of a PERF_EV_CAP_COEXIST group into
+ * a singleton event. If the leader is a PERF_EV_CAP_COEXIST
+ * event as well, remove it.
+ */
+
+ if (cap_coexist) {
+ list_del_init(&leader->sibling_list);
+ leader->group_leader->nr_siblings = 0;
+ perf_remove_coexist_events(leader);
+ }
+
+ perf_event__header_size(leader);
}
static bool is_orphaned_event(struct perf_event *event)
--
2.17.1
From: Kan Liang <[email protected]>
Bit 15 of the PERF_CAPABILITIES MSR indicates that the perf METRICS
feature is supported. The perf METRICS is not a PEBS feature.
Rename pebs_metrics_available perf_metrics.
The bit is not used in the current code. It will be used in a later
patch.
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/perf_event.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 7b68ab5f19e7..5d453da8cfa6 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -537,7 +537,7 @@ union perf_capabilities {
*/
u64 full_width_write:1;
u64 pebs_baseline:1;
- u64 pebs_metrics_available:1;
+ u64 perf_metrics:1;
u64 pebs_output_pt_available:1;
};
u64 capabilities;
--
2.17.1
From: Kan Liang <[email protected]>
Currently, the if-else is used in the intel_pmu_disable/enable_event to
check the type of an event. It works well, but with more and more types
added later, e.g., perf metrics, compared to the switch statement, the
if-else may impair the readability of the code.
There is no harm to use the switch statement to replace the if-else
here. Also, some optimizing compilers may compile a switch statement
into a jump-table which is more efficient than if-else for a large
number of cases. The performance gain may not be observed for now,
because the number of cases is only 5, but the benefits may be observed
with more and more types added in the future.
Use switch to replace the if-else in the intel_pmu_disable/enable_event.
If the idx is invalid, print a warning.
For the case INTEL_PMC_IDX_FIXED_BTS in intel_pmu_disable_event, don't
need to check the event->attr.precise_ip. Use return for the case.
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/intel/core.c | 36 ++++++++++++++++++++++++++++--------
1 file changed, 28 insertions(+), 8 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index ac1408fe1aee..76eab8178047 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2180,17 +2180,28 @@ static void intel_pmu_disable_event(struct perf_event *event)
struct hw_perf_event *hwc = &event->hw;
int idx = hwc->idx;
- if (idx < INTEL_PMC_IDX_FIXED) {
+ switch (idx) {
+ case 0 ... INTEL_PMC_IDX_FIXED - 1:
intel_clear_masks(event, idx);
x86_pmu_disable_event(event);
- } else if (idx < INTEL_PMC_IDX_FIXED_BTS) {
+ break;
+ case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1:
intel_clear_masks(event, idx);
intel_pmu_disable_fixed(event);
- } else if (idx == INTEL_PMC_IDX_FIXED_BTS) {
+ break;
+ case INTEL_PMC_IDX_FIXED_BTS:
intel_pmu_disable_bts();
intel_pmu_drain_bts_buffer();
- } else if (idx == INTEL_PMC_IDX_FIXED_VLBR)
+ return;
+ case INTEL_PMC_IDX_FIXED_VLBR:
intel_clear_masks(event, idx);
+ break;
+ default:
+ intel_clear_masks(event, idx);
+ pr_warn("Failed to disable the event with invalid index %d\n",
+ idx);
+ return;
+ }
/*
* Needs to be called after x86_pmu_disable_event,
@@ -2262,18 +2273,27 @@ static void intel_pmu_enable_event(struct perf_event *event)
if (unlikely(event->attr.precise_ip))
intel_pmu_pebs_enable(event);
- if (idx < INTEL_PMC_IDX_FIXED) {
+ switch (idx) {
+ case 0 ... INTEL_PMC_IDX_FIXED - 1:
intel_set_masks(event, idx);
__x86_pmu_enable_event(hwc, ARCH_PERFMON_EVENTSEL_ENABLE);
- } else if (idx < INTEL_PMC_IDX_FIXED_BTS) {
+ break;
+ case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1:
intel_set_masks(event, idx);
intel_pmu_enable_fixed(event);
- } else if (idx == INTEL_PMC_IDX_FIXED_BTS) {
+ break;
+ case INTEL_PMC_IDX_FIXED_BTS:
if (!__this_cpu_read(cpu_hw_events.enabled))
return;
intel_pmu_enable_bts(hwc->config);
- } else if (idx == INTEL_PMC_IDX_FIXED_VLBR)
+ break;
+ case INTEL_PMC_IDX_FIXED_VLBR:
intel_set_masks(event, idx);
+ break;
+ default:
+ pr_warn("Failed to enable the event with invalid index %d\n",
+ idx);
+ }
}
static void intel_pmu_add_event(struct perf_event *event)
--
2.17.1
From: Kan Liang <[email protected]>
The bit 48 in the PERF_GLOBAL_STATUS is used to indicate the overflow
status of the PERF_METRICS counters.
Move the BTS index to the bit 47.
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/include/asm/perf_event.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index fe8110a8c75b..58419e50ff98 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -238,11 +238,11 @@ struct x86_pmu_capability {
/*
* We model BTS tracing as another fixed-mode PMC.
*
- * We choose a value in the middle of the fixed event range, since lower
+ * We choose the value 47 for the fixed index of BTS, since lower
* values are used by actual fixed events and higher values are used
* to indicate other overflow conditions in the PERF_GLOBAL_STATUS msr.
*/
-#define INTEL_PMC_IDX_FIXED_BTS (INTEL_PMC_IDX_FIXED + 16)
+#define INTEL_PMC_IDX_FIXED_BTS (INTEL_PMC_IDX_FIXED + 15)
#define GLOBAL_STATUS_COND_CHG BIT_ULL(63)
#define GLOBAL_STATUS_BUFFER_OVF_BIT 62
--
2.17.1
From: Kan Liang <[email protected]>
Intro
=====
The TopDown Microarchitecture Analysis (TMA) Method is a structured
analysis methodology to identify critical performance bottlenecks in
out-of-order processors. Current perf has supported the method.
The method works well, but there is one problem. To collect the TopDown
events, several GP counters have to be used. If a user wants to collect
other events at the same time, the multiplexing probably be triggered,
which impacts the accuracy.
To free up the scarce GP counters, the hardware TopDown metrics feature
is introduced from Ice Lake. The hardware implements an additional
"metrics" register and a new Fixed Counter 3 that measures pipeline
"slots". The TopDown events can be calculated from them instead.
Events
======
The level 1 TopDown has four metrics. There is no event-code assigned to
the TopDown metrics. Four metric events are exported as separate perf
events, which map to the internal "metrics" counter register. Those
events do not exist in hardware, but can be allocated by the scheduler.
For the event mapping, a special 0x00 event code is used, which is
reserved for fake events. The metric events start from umask 0x10.
When setting up the metric events, they point to the Fixed Counter 3.
They have to be specially handled.
- Add the update_topdown_event() callback to read the additional metrics
MSR and generate the metrics.
- Add the set_topdown_event_period() callback to initialize metrics MSR
and the fixed counter 3.
- Add a variable n_metric_event to track the number of the accepted
metrics events. The sharing between multiple users of the same metric
without multiplexing is not allowed.
- Only enable/disable the fixed counter 3 when there are no other active
TopDown events, which avoid the unnecessary writing of the fixed
control register.
- Disable the PMU when reading the metrics event. The metrics MSR and
the fixed counter 3 are read separately. The values may be modified by
an NMI.
All four metric events don't support sampling. Since they will be
handled specially for event update, a flag PERF_X86_EVENT_TOPDOWN is
introduced to indicate this case.
The slots event can support both sampling and counting.
For counting, the flag is also applied.
For sampling, it will be handled normally as other normal events.
Groups
======
The slots event is required in a Topdown group.
To avoid reading the METRICS register multiple times, the metrics and
slots value can only be updated by slots event in a group.
All active slots and metrics events will be updated one time.
Therefore, the slots event must be before any metric events in a Topdown
group.
NMI
======
The METRICS related register may be overflow. The bit 48 of the STATUS
register will be set. If so, PERF_METRICS and Fixed counter 3 are
required to be reset. The patch also update all active slots and
metrics events in the NMI handler.
The update_topdown_event() has to read two registers separately. The
values may be modified by an NMI. PMU has to be disabled before calling
the function.
RDPMC
======
RDPMC is temporarily disabled. A later patch will enable it.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/core.c | 63 +++++++++++---
arch/x86/events/intel/core.c | 138 ++++++++++++++++++++++++++++--
arch/x86/events/perf_event.h | 37 ++++++++
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/perf_event.h | 47 ++++++++++
5 files changed, 271 insertions(+), 15 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 8e108ea378e3..53fcf0a2b025 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -76,6 +76,9 @@ u64 x86_perf_event_update(struct perf_event *event)
if (unlikely(!hwc->event_base))
return 0;
+ if (unlikely(is_topdown_count(event)) && x86_pmu.update_topdown_event)
+ return x86_pmu.update_topdown_event(event);
+
/*
* Careful: an NMI might modify the previous event value.
*
@@ -1031,6 +1034,42 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
return unsched ? -EINVAL : 0;
}
+static int add_nr_metric_event(struct cpu_hw_events *cpuc,
+ struct perf_event *event)
+{
+ if (is_metric_event(event)) {
+ if (cpuc->n_metric == INTEL_TD_METRIC_NUM)
+ return -EINVAL;
+ cpuc->n_metric++;
+ }
+
+ return 0;
+}
+
+static void del_nr_metric_event(struct cpu_hw_events *cpuc,
+ struct perf_event *event)
+{
+ if (is_metric_event(event))
+ cpuc->n_metric--;
+}
+
+static int collect_event(struct cpu_hw_events *cpuc, struct perf_event *event,
+ int max_count, int n)
+{
+
+ if (x86_pmu.intel_cap.perf_metrics && add_nr_metric_event(cpuc, event))
+ return -EINVAL;
+
+ if (n >= max_count + cpuc->n_metric)
+ return -EINVAL;
+
+ cpuc->event_list[n] = event;
+ if (is_counter_pair(&event->hw))
+ cpuc->n_pair++;
+
+ return 0;
+}
+
/*
* dogrp: true if must collect siblings events (group)
* returns total number of events and error code
@@ -1067,28 +1106,22 @@ static int collect_events(struct cpu_hw_events *cpuc, struct perf_event *leader,
}
if (is_x86_event(leader)) {
- if (n >= max_count)
+ if (collect_event(cpuc, leader, max_count, n))
return -EINVAL;
- cpuc->event_list[n] = leader;
n++;
- if (is_counter_pair(&leader->hw))
- cpuc->n_pair++;
}
+
if (!dogrp)
return n;
for_each_sibling_event(event, leader) {
- if (!is_x86_event(event) ||
- event->state <= PERF_EVENT_STATE_OFF)
+ if (!is_x86_event(event) || event->state <= PERF_EVENT_STATE_OFF)
continue;
- if (n >= max_count)
+ if (collect_event(cpuc, event, max_count, n))
return -EINVAL;
- cpuc->event_list[n] = event;
n++;
- if (is_counter_pair(&event->hw))
- cpuc->n_pair++;
}
return n;
}
@@ -1110,6 +1143,10 @@ static inline void x86_assign_hw_event(struct perf_event *event,
hwc->event_base = 0;
break;
+ case INTEL_PMC_IDX_METRIC_BASE ... INTEL_PMC_IDX_METRIC_END:
+ /* All the metric events are mapped onto the fixed counter 3. */
+ idx = INTEL_PMC_IDX_FIXED_SLOTS;
+ /* fall through */
case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS-1:
hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 +
@@ -1245,6 +1282,10 @@ int x86_perf_event_set_period(struct perf_event *event)
if (unlikely(!hwc->event_base))
return 0;
+ if (unlikely(is_topdown_count(event)) &&
+ x86_pmu.set_topdown_event_period)
+ return x86_pmu.set_topdown_event_period(event);
+
/*
* If we are way outside a reasonable range then just skip forward:
*/
@@ -1529,6 +1570,8 @@ static void x86_pmu_del(struct perf_event *event, int flags)
}
cpuc->event_constraint[i-1] = NULL;
--cpuc->n_events;
+ if (x86_pmu.intel_cap.perf_metrics)
+ del_nr_metric_event(cpuc, event);
perf_event_update_userpage(event);
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 76eab8178047..5655e09c6878 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2165,11 +2165,24 @@ static inline void intel_clear_masks(struct perf_event *event, int idx)
static void intel_pmu_disable_fixed(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
- int idx = hwc->idx - INTEL_PMC_IDX_FIXED;
u64 ctrl_val, mask;
+ int idx = hwc->idx;
- mask = 0xfULL << (idx * 4);
+ if (is_topdown_idx(idx)) {
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ /*
+ * When there are other active TopDown events,
+ * don't disable the fixed counter 3.
+ */
+ if (*(u64 *)cpuc->active_mask & INTEL_PMC_OTHER_TOPDOWN_BITS(idx))
+ return;
+ idx = INTEL_PMC_IDX_FIXED_SLOTS;
+ }
+
+ intel_clear_masks(event, idx);
+ mask = 0xfULL << ((idx - INTEL_PMC_IDX_FIXED) * 4);
rdmsrl(hwc->config_base, ctrl_val);
ctrl_val &= ~mask;
wrmsrl(hwc->config_base, ctrl_val);
@@ -2186,7 +2199,7 @@ static void intel_pmu_disable_event(struct perf_event *event)
x86_pmu_disable_event(event);
break;
case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1:
- intel_clear_masks(event, idx);
+ case INTEL_PMC_IDX_METRIC_BASE ... INTEL_PMC_IDX_METRIC_END:
intel_pmu_disable_fixed(event);
break;
case INTEL_PMC_IDX_FIXED_BTS:
@@ -2219,10 +2232,26 @@ static void intel_pmu_del_event(struct perf_event *event)
intel_pmu_pebs_del(event);
}
+static void intel_pmu_read_topdown_event(struct perf_event *event)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ /* Only need to call update_topdown_event() once for group read. */
+ if ((cpuc->txn_flags & PERF_PMU_TXN_READ) &&
+ !is_slots_event(event))
+ return;
+
+ perf_pmu_disable(event->pmu);
+ x86_pmu.update_topdown_event(event);
+ perf_pmu_enable(event->pmu);
+}
+
static void intel_pmu_read_event(struct perf_event *event)
{
if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
intel_pmu_auto_reload_read(event);
+ else if (is_topdown_count(event) && x86_pmu.update_topdown_event)
+ intel_pmu_read_topdown_event(event);
else
x86_perf_event_update(event);
}
@@ -2230,8 +2259,22 @@ static void intel_pmu_read_event(struct perf_event *event)
static void intel_pmu_enable_fixed(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
- int idx = hwc->idx - INTEL_PMC_IDX_FIXED;
u64 ctrl_val, mask, bits = 0;
+ int idx = hwc->idx;
+
+ if (is_topdown_idx(idx)) {
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ /*
+ * When there are other active TopDown events,
+ * don't enable the fixed counter 3 again.
+ */
+ if (*(u64 *)cpuc->active_mask & INTEL_PMC_OTHER_TOPDOWN_BITS(idx))
+ return;
+
+ idx = INTEL_PMC_IDX_FIXED_SLOTS;
+ }
+
+ intel_set_masks(event, idx);
/*
* Enable IRQ generation (0x8), if not PEBS,
@@ -2251,6 +2294,7 @@ static void intel_pmu_enable_fixed(struct perf_event *event)
if (x86_pmu.version > 2 && hwc->config & ARCH_PERFMON_EVENTSEL_ANY)
bits |= 0x4;
+ idx -= INTEL_PMC_IDX_FIXED;
bits <<= (idx * 4);
mask = 0xfULL << (idx * 4);
@@ -2279,7 +2323,7 @@ static void intel_pmu_enable_event(struct perf_event *event)
__x86_pmu_enable_event(hwc, ARCH_PERFMON_EVENTSEL_ENABLE);
break;
case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1:
- intel_set_masks(event, idx);
+ case INTEL_PMC_IDX_METRIC_BASE ... INTEL_PMC_IDX_METRIC_END:
intel_pmu_enable_fixed(event);
break;
case INTEL_PMC_IDX_FIXED_BTS:
@@ -2439,6 +2483,15 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
intel_pt_interrupt();
}
+ /*
+ * Intel Perf mertrics
+ */
+ if (__test_and_clear_bit(GLOBAL_STATUS_PERF_METRICS_OVF_BIT, (unsigned long *)&status)) {
+ handled++;
+ if (x86_pmu.update_topdown_event)
+ x86_pmu.update_topdown_event(NULL);
+ }
+
/*
* Checkpointed counters can lead to 'spurious' PMIs because the
* rollback caused by the PMI will have cleared the overflow status
@@ -3375,6 +3428,72 @@ static int intel_pmu_hw_config(struct perf_event *event)
if (event->attr.type != PERF_TYPE_RAW)
return 0;
+ /*
+ * Config Topdown slots and metric events
+ *
+ * The slots event on Fixed Counter 3 can support sampling,
+ * which will be handled normally in x86_perf_event_update().
+ *
+ * The metric events don't support sampling.
+ *
+ * For counting, topdown slots and metric events will be
+ * handled specially for event update.
+ * A flag PERF_X86_EVENT_TOPDOWN is applied for the case.
+ */
+ if (x86_pmu.intel_cap.perf_metrics && is_topdown_event(event)) {
+ if (is_metric_event(event)) {
+ struct perf_event *leader = event->group_leader;
+ struct perf_event *sibling;
+
+ /* The metric events don't support sampling. */
+ if (is_sampling_event(event))
+ return -EINVAL;
+
+ /* The metric events cannot be a group leader. */
+ if (leader == event)
+ return -EINVAL;
+
+ /*
+ * The slots event cannot be the leader of a topdown
+ * sample-read group, e.g., {slots, topdown-retiring}:S
+ */
+ if (is_slots_event(leader) && is_sampling_event(leader))
+ return -EINVAL;
+
+ /*
+ * The slots event must be before the metric events,
+ * because we only update the values of a topdown
+ * group once with the slots event.
+ */
+ if (!is_slots_event(leader)) {
+ for_each_sibling_event(sibling, leader) {
+ if (is_slots_event(sibling))
+ break;
+ if (is_metric_event(sibling))
+ return -EINVAL;
+ }
+ }
+ }
+
+ if (!is_sampling_event(event)) {
+ if (event->attr.config1 != 0)
+ return -EINVAL;
+ /*
+ * The TopDown metrics events and slots event don't
+ * support any filters.
+ */
+ if (event->attr.config & X86_ALL_EVENT_FLAGS)
+ return -EINVAL;
+
+ event->hw.flags |= PERF_X86_EVENT_TOPDOWN;
+
+ event->event_caps |= PERF_EV_CAP_COEXIST;
+
+ if (is_metric_event(event))
+ event->hw.flags &= ~PERF_X86_EVENT_RDPMC_ALLOWED;
+ }
+ }
+
if (!(event->attr.config & ARCH_PERFMON_EVENTSEL_ANY))
return 0;
@@ -5218,6 +5337,15 @@ __init int intel_pmu_init(void)
* counter, so do not extend mask to generic counters
*/
for_each_event_constraint(c, x86_pmu.event_constraints) {
+ /*
+ * Don't extend the topdown slots and metrics
+ * events to the generic counters.
+ */
+ if (c->idxmsk64 & INTEL_PMC_MSK_TOPDOWN) {
+ c->weight = hweight64(c->idxmsk64);
+ continue;
+ }
+
if (c->cmask == FIXED_EVENT_FLAGS
&& c->idxmsk64 != INTEL_PMC_MSK_FIXED_REF_CYCLES) {
c->idxmsk64 |= (1ULL << x86_pmu.num_counters) - 1;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 5d453da8cfa6..dea009ce0e4a 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -79,6 +79,31 @@ static inline bool constraint_match(struct event_constraint *c, u64 ecode)
#define PERF_X86_EVENT_PEBS_VIA_PT 0x0800 /* use PT buffer for PEBS */
#define PERF_X86_EVENT_PAIR 0x1000 /* Large Increment per Cycle */
#define PERF_X86_EVENT_LBR_SELECT 0x2000 /* Save/Restore MSR_LBR_SELECT */
+#define PERF_X86_EVENT_TOPDOWN 0x4000 /* Count Topdown slots/metrics events */
+
+static inline bool is_topdown_count(struct perf_event *event)
+{
+ return event->hw.flags & PERF_X86_EVENT_TOPDOWN;
+}
+
+static inline bool is_metric_event(struct perf_event *event)
+{
+ u64 config = event->attr.config;
+
+ return ((config & ARCH_PERFMON_EVENTSEL_EVENT) == 0) &&
+ ((config & INTEL_ARCH_EVENT_MASK) >= INTEL_TD_METRIC_RETIRING) &&
+ ((config & INTEL_ARCH_EVENT_MASK) <= INTEL_TD_METRIC_MAX);
+}
+
+static inline bool is_slots_event(struct perf_event *event)
+{
+ return (event->attr.config & INTEL_ARCH_EVENT_MASK) == INTEL_TD_SLOTS;
+}
+
+static inline bool is_topdown_event(struct perf_event *event)
+{
+ return is_metric_event(event) || is_slots_event(event);
+}
struct amd_nb {
int nb_id; /* NorthBridge id */
@@ -284,6 +309,12 @@ struct cpu_hw_events {
*/
u64 tfa_shadow;
+ /*
+ * Perf Metrics
+ */
+ /* number of accepted metrics events */
+ int n_metric;
+
/*
* AMD specific bits
*/
@@ -726,6 +757,12 @@ struct x86_pmu {
*/
atomic_t lbr_exclusive[x86_lbr_exclusive_max];
+ /*
+ * Intel perf metrics
+ */
+ u64 (*update_topdown_event)(struct perf_event *event);
+ int (*set_topdown_event_period)(struct perf_event *event);
+
/*
* perf task context (i.e. struct perf_event_context::task_ctx_data)
* switch helper to bridge calls from perf/core to perf/x86.
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index bdc07fc6e517..d8c6cb82866d 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -853,6 +853,7 @@
#define MSR_CORE_PERF_FIXED_CTR0 0x00000309
#define MSR_CORE_PERF_FIXED_CTR1 0x0000030a
#define MSR_CORE_PERF_FIXED_CTR2 0x0000030b
+#define MSR_CORE_PERF_FIXED_CTR3 0x0000030c
#define MSR_CORE_PERF_FIXED_CTR_CTRL 0x0000038d
#define MSR_CORE_PERF_GLOBAL_STATUS 0x0000038e
#define MSR_CORE_PERF_GLOBAL_CTRL 0x0000038f
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 58419e50ff98..000cab7818b5 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -244,6 +244,52 @@ struct x86_pmu_capability {
*/
#define INTEL_PMC_IDX_FIXED_BTS (INTEL_PMC_IDX_FIXED + 15)
+/*
+ * The PERF_METRICS MSR is modeled as several magic fixed-mode PMCs, one for
+ * each TopDown metric event.
+ *
+ * Internally the TopDown metric events are mapped to the FxCtr 3 (SLOTS).
+ */
+#define INTEL_PMC_IDX_METRIC_BASE (INTEL_PMC_IDX_FIXED + 16)
+#define INTEL_PMC_IDX_TD_RETIRING (INTEL_PMC_IDX_METRIC_BASE + 0)
+#define INTEL_PMC_IDX_TD_BAD_SPEC (INTEL_PMC_IDX_METRIC_BASE + 1)
+#define INTEL_PMC_IDX_TD_FE_BOUND (INTEL_PMC_IDX_METRIC_BASE + 2)
+#define INTEL_PMC_IDX_TD_BE_BOUND (INTEL_PMC_IDX_METRIC_BASE + 3)
+#define INTEL_PMC_IDX_METRIC_END INTEL_PMC_IDX_TD_BE_BOUND
+#define INTEL_PMC_MSK_TOPDOWN ((0xfull << INTEL_PMC_IDX_METRIC_BASE) | \
+ INTEL_PMC_MSK_FIXED_SLOTS)
+
+/*
+ * There is no event-code assigned to the TopDown events.
+ *
+ * For the slots event, use the pseudo code of the fixed counter 3.
+ *
+ * For the metric events, the pseudo event-code is 0x00.
+ * The pseudo umask-code starts from the middle of the pseudo event
+ * space, 0x80.
+ */
+#define INTEL_TD_SLOTS 0x0400 /* TOPDOWN.SLOTS */
+/* Level 1 metrics */
+#define INTEL_TD_METRIC_RETIRING 0x8000 /* Retiring metric */
+#define INTEL_TD_METRIC_BAD_SPEC 0x8100 /* Bad speculation metric */
+#define INTEL_TD_METRIC_FE_BOUND 0x8200 /* FE bound metric */
+#define INTEL_TD_METRIC_BE_BOUND 0x8300 /* BE bound metric */
+#define INTEL_TD_METRIC_MAX INTEL_TD_METRIC_BE_BOUND
+#define INTEL_TD_METRIC_NUM 4
+
+static inline bool is_metric_idx(int idx)
+{
+ return (unsigned)(idx - INTEL_PMC_IDX_METRIC_BASE) < INTEL_TD_METRIC_NUM;
+}
+
+static inline bool is_topdown_idx(int idx)
+{
+ return is_metric_idx(idx) || idx == INTEL_PMC_IDX_FIXED_SLOTS;
+}
+
+#define INTEL_PMC_OTHER_TOPDOWN_BITS(bit) \
+ (~(0x1ull << bit) & INTEL_PMC_MSK_TOPDOWN)
+
#define GLOBAL_STATUS_COND_CHG BIT_ULL(63)
#define GLOBAL_STATUS_BUFFER_OVF_BIT 62
#define GLOBAL_STATUS_BUFFER_OVF BIT_ULL(GLOBAL_STATUS_BUFFER_OVF_BIT)
@@ -254,6 +300,7 @@ struct x86_pmu_capability {
#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(GLOBAL_STATUS_LBRS_FROZEN_BIT)
#define GLOBAL_STATUS_TRACE_TOPAPMI_BIT 55
#define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(GLOBAL_STATUS_TRACE_TOPAPMI_BIT)
+#define GLOBAL_STATUS_PERF_METRICS_OVF_BIT 48
/*
* We model guest LBR event tracing as another fixed-mode PMC like BTS.
--
2.17.1
From: Kan Liang <[email protected]>
Ice Lake supports the hardware TopDown metrics feature, which can free
up the scarce GP counters.
Update the event constraints for the metrics events. The metric counters
do not exist, which are mapped to a dummy offset. The sharing between
multiple users of the same metric without multiplexing is not allowed.
Implement set_topdown_event_period for Ice Lake. The values in
PERF_METRICS MSR are derived from the fixed counter 3. Both registers
should start from zero.
Implement update_topdown_event for Ice Lake. The metric is reported by
multiplying the metric (fraction) with slots. To maintain accurate
measurements, both registers are cleared for each update. The fixed
counter 3 should always be cleared before the PERF_METRICS.
Implement td_attr for the new metrics events and the new slots fixed
counter. Make them visible to the perf user tools.
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/intel/core.c | 118 ++++++++++++++++++++++++++++++
arch/x86/events/perf_event.h | 13 ++++
arch/x86/include/asm/msr-index.h | 2 +
arch/x86/include/asm/perf_event.h | 2 +
4 files changed, 135 insertions(+)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 5655e09c6878..e7d6c3b79772 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -247,6 +247,10 @@ static struct event_constraint intel_icl_event_constraints[] = {
FIXED_EVENT_CONSTRAINT(0x003c, 1), /* CPU_CLK_UNHALTED.CORE */
FIXED_EVENT_CONSTRAINT(0x0300, 2), /* CPU_CLK_UNHALTED.REF */
FIXED_EVENT_CONSTRAINT(0x0400, 3), /* SLOTS */
+ METRIC_EVENT_CONSTRAINT(INTEL_TD_METRIC_RETIRING, 0),
+ METRIC_EVENT_CONSTRAINT(INTEL_TD_METRIC_BAD_SPEC, 1),
+ METRIC_EVENT_CONSTRAINT(INTEL_TD_METRIC_FE_BOUND, 2),
+ METRIC_EVENT_CONSTRAINT(INTEL_TD_METRIC_BE_BOUND, 3),
INTEL_EVENT_CONSTRAINT_RANGE(0x03, 0x0a, 0xf),
INTEL_EVENT_CONSTRAINT_RANGE(0x1f, 0x28, 0xf),
INTEL_EVENT_CONSTRAINT(0x32, 0xf), /* SW_PREFETCH_ACCESS.* */
@@ -309,6 +313,12 @@ EVENT_ATTR_STR_HT(topdown-recovery-bubbles, td_recovery_bubbles,
EVENT_ATTR_STR_HT(topdown-recovery-bubbles.scale, td_recovery_bubbles_scale,
"4", "2");
+EVENT_ATTR_STR(slots, slots, "event=0x00,umask=0x4");
+EVENT_ATTR_STR(topdown-retiring, td_retiring, "event=0x00,umask=0x80");
+EVENT_ATTR_STR(topdown-bad-spec, td_bad_spec, "event=0x00,umask=0x81");
+EVENT_ATTR_STR(topdown-fe-bound, td_fe_bound, "event=0x00,umask=0x82");
+EVENT_ATTR_STR(topdown-be-bound, td_be_bound, "event=0x00,umask=0x83");
+
static struct attribute *snb_events_attrs[] = {
EVENT_PTR(td_slots_issued),
EVENT_PTR(td_slots_retired),
@@ -2232,6 +2242,99 @@ static void intel_pmu_del_event(struct perf_event *event)
intel_pmu_pebs_del(event);
}
+static int icl_set_topdown_event_period(struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ s64 left = local64_read(&hwc->period_left);
+
+ /*
+ * The values in PERF_METRICS MSR are derived from fixed counter 3.
+ * Software should start both registers, PERF_METRICS and fixed
+ * counter 3, from zero.
+ * Clear PERF_METRICS and Fixed counter 3 in initialization.
+ * After that, both MSRs will be cleared for each read.
+ * Don't need to clear them again.
+ */
+ if (left == x86_pmu.max_period) {
+ wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
+ wrmsrl(MSR_PERF_METRICS, 0);
+ local64_set(&hwc->period_left, 0);
+ }
+
+ perf_event_update_userpage(event);
+
+ return 0;
+}
+
+static inline u64 icl_get_metrics_event_value(u64 metric, u64 slots, int idx)
+{
+ u32 val;
+
+ /*
+ * The metric is reported as an 8bit integer fraction
+ * suming up to 0xff.
+ * slots-in-metric = (Metric / 0xff) * slots
+ */
+ val = (metric >> ((idx - INTEL_PMC_IDX_METRIC_BASE) * 8)) & 0xff;
+ return mul_u64_u32_div(slots, val, 0xff);
+}
+
+static void __icl_update_topdown_event(struct perf_event *event,
+ u64 slots, u64 metrics)
+{
+ int idx = event->hw.idx;
+ u64 delta;
+
+ if (is_metric_idx(idx))
+ delta = icl_get_metrics_event_value(metrics, slots, idx);
+ else
+ delta = slots;
+
+ local64_add(delta, &event->count);
+}
+
+/*
+ * Update all active Topdown events.
+ *
+ * The PERF_METRICS and Fixed counter 3 are read separately. The values may be
+ * modify by a NMI. PMU has to be disabled before calling this function.
+ */
+static u64 icl_update_topdown_event(struct perf_event *event)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct perf_event *other;
+ u64 slots, metrics;
+ int idx;
+
+ /* read Fixed counter 3 */
+ rdpmcl((3 | INTEL_PMC_FIXED_RDPMC_BASE), slots);
+ if (!slots)
+ return 0;
+
+ /* read PERF_METRICS */
+ rdpmcl(INTEL_PMC_FIXED_RDPMC_METRICS, metrics);
+
+ for_each_set_bit(idx, cpuc->active_mask, INTEL_PMC_IDX_TD_BE_BOUND + 1) {
+ if (!is_topdown_idx(idx))
+ continue;
+ other = cpuc->events[idx];
+ __icl_update_topdown_event(other, slots, metrics);
+ }
+
+ /*
+ * Check and update this event, which may have been cleared
+ * in active_mask e.g. x86_pmu_stop()
+ */
+ if (event && !test_bit(event->hw.idx, cpuc->active_mask))
+ __icl_update_topdown_event(event, slots, metrics);
+
+ /* The fixed counter 3 has to be written before the PERF_METRICS. */
+ wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
+ wrmsrl(MSR_PERF_METRICS, 0);
+
+ return slots;
+}
+
static void intel_pmu_read_topdown_event(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -4494,6 +4597,15 @@ static struct attribute *icl_events_attrs[] = {
NULL,
};
+static struct attribute *icl_td_events_attrs[] = {
+ EVENT_PTR(slots),
+ EVENT_PTR(td_retiring),
+ EVENT_PTR(td_bad_spec),
+ EVENT_PTR(td_fe_bound),
+ EVENT_PTR(td_be_bound),
+ NULL,
+};
+
static struct attribute *icl_tsx_events_attrs[] = {
EVENT_PTR(tx_start),
EVENT_PTR(tx_abort),
@@ -5278,10 +5390,13 @@ __init int intel_pmu_init(void)
hsw_format_attr : nhm_format_attr;
extra_skl_attr = skl_format_attr;
mem_attr = icl_events_attrs;
+ td_attr = icl_td_events_attrs;
tsx_attr = icl_tsx_events_attrs;
x86_pmu.rtm_abort_event = X86_CONFIG(.event=0xca, .umask=0x02);
x86_pmu.lbr_pt_coexist = true;
intel_pmu_pebs_data_source_skl(pmem);
+ x86_pmu.update_topdown_event = icl_update_topdown_event;
+ x86_pmu.set_topdown_event_period = icl_set_topdown_event_period;
pr_cont("Icelake events, ");
name = "icelake";
break;
@@ -5401,6 +5516,9 @@ __init int intel_pmu_init(void)
if (x86_pmu.counter_freezing)
x86_pmu.handle_irq = intel_pmu_handle_irq_v4;
+ if (x86_pmu.intel_cap.perf_metrics)
+ x86_pmu.intel_ctrl |= 1ULL << GLOBAL_CTRL_EN_PERF_METRICS;
+
return 0;
}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index dea009ce0e4a..345442410a4d 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -406,6 +406,19 @@ struct cpu_hw_events {
#define FIXED_EVENT_CONSTRAINT(c, n) \
EVENT_CONSTRAINT(c, (1ULL << (32+n)), FIXED_EVENT_FLAGS)
+/*
+ * The special metric counters do not actually exist. They are calculated from
+ * the combination of the FxCtr3 + MSR_PERF_METRICS.
+ *
+ * The special metric counters are mapped to a dummy offset for the scheduler.
+ * The sharing between multiple users of the same metric without multiplexing
+ * is not allowed, even though the hardware supports that in principle.
+ */
+
+#define METRIC_EVENT_CONSTRAINT(c, n) \
+ EVENT_CONSTRAINT(c, (1ULL << (INTEL_PMC_IDX_METRIC_BASE + n)), \
+ INTEL_ARCH_EVENT_MASK)
+
/*
* Constraint on the Event code + UMask
*/
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index d8c6cb82866d..325cfe5dd95a 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -859,6 +859,8 @@
#define MSR_CORE_PERF_GLOBAL_CTRL 0x0000038f
#define MSR_CORE_PERF_GLOBAL_OVF_CTRL 0x00000390
+#define MSR_PERF_METRICS 0x00000329
+
/* PERF_GLOBAL_OVF_CTL bits */
#define MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI_BIT 55
#define MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI (1ULL << MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI_BIT)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 964ba312c249..e20aa588dc47 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -198,6 +198,7 @@ struct x86_pmu_capability {
/* RDPMC offset for Fixed PMCs */
#define INTEL_PMC_FIXED_RDPMC_BASE (1 << 30)
+#define INTEL_PMC_FIXED_RDPMC_METRICS (1 << 29)
/*
* All the fixed-mode PMCs are configured via this single MSR:
@@ -305,6 +306,7 @@ static inline bool is_topdown_idx(int idx)
#define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(GLOBAL_STATUS_TRACE_TOPAPMI_BIT)
#define GLOBAL_STATUS_PERF_METRICS_OVF_BIT 48
+#define GLOBAL_CTRL_EN_PERF_METRICS 48
/*
* We model guest LBR event tracing as another fixed-mode PMC like BTS.
*
--
2.17.1
From: Andi Kleen <[email protected]>
Add some documentation how to use the topdown metrics in ring 3.
Signed-off-by: Andi Kleen <[email protected]>
---
tools/perf/Documentation/topdown.txt | 235 +++++++++++++++++++++++++++
1 file changed, 235 insertions(+)
create mode 100644 tools/perf/Documentation/topdown.txt
diff --git a/tools/perf/Documentation/topdown.txt b/tools/perf/Documentation/topdown.txt
new file mode 100644
index 000000000000..e724d2af3b8d
--- /dev/null
+++ b/tools/perf/Documentation/topdown.txt
@@ -0,0 +1,235 @@
+Using TopDown metrics in user space
+-----------------------------------
+
+Intel CPUs (since Sandy Bridge and Silvermont) support a TopDown
+methology to break down CPU pipeline execution into 4 bottlenecks:
+frontend bound, backend bound, bad speculation, retiring.
+
+For more details on Topdown see [1][5]
+
+Traditionally this was implemented by events in generic counters
+and specific formulas to compute the bottlenecks.
+
+perf stat --topdown implements this.
+
+Full Top Down includes more levels that can break down the
+bottlenecks further. This is not directly implemented in perf,
+but available in other tools that can run on top of perf,
+such as toplev[2] or vtune[3]
+
+New Topdown features in Ice Lake
+===============================
+
+With Ice Lake CPUs the TopDown metrics are directly available as
+fixed counters and do not require generic counters. This allows
+to collect TopDown always in addition to other events.
+
+% perf stat -a --topdown -I1000
+# time counts unit events
+ 1.000854735 20,097,158,100 slots
+ 1.000854735 79,327,616 topdown-retiring # 0.4% retiring
+ 1.000854735 157,932,715 topdown-bad-spec # 0.8% bad speculation
+ 1.000854735 81,610,855 topdown-fe-bound # 0.4% frontend bound
+ 1.000854735 19,778,286,903 topdown-be-bound # 98.4% backend bound
+ 2.003623823 20,010,908,365 slots
+ 2.003623823 79,905,340 topdown-retiring # 0.4% retiring
+ 2.003623823 158,405,024 topdown-bad-spec # 0.8% bad speculation
+ 2.003623823 87,980,097 topdown-fe-bound # 0.4% frontend bound
+ 2.003623823 19,684,617,888 topdown-be-bound # 98.4% backend bound
+ 3.005828889 20,062,101,220 slots
+ 3.005828889 80,077,032 topdown-retiring # 0.4% retiring
+ 3.005828889 158,682,921 topdown-bad-spec # 0.8% bad speculation
+ 3.005828889 86,579,604 topdown-fe-bound # 0.4% frontend bound
+ 3.005828889 19,736,761,649 topdown-be-bound # 98.4% backend bound
+...
+
+This also enables measuring TopDown per thread/process instead
+of only per core.
+
+Using TopDown through RDPMC in applications on Ice Lake
+======================================================
+
+For more fine grained measurements it can be useful to
+access the new directly from user space. This is more complicated,
+but drastically lowers overhead.
+
+On Ice Lake, there is a new fixed counter 3: SLOTS, which reports
+"pipeline SLOTS" (cycles multiplied by core issue width) and a
+metric register that reports slots ratios for the different bottleneck
+categories.
+
+The metrics counter is CPU model specific and is not be available
+on older CPUs.
+
+Example code
+============
+
+Library functions to do the functionality described below
+is also available in libjevents [4]
+
+The application opens a perf_event file descriptor
+and sets up fixed counter 3 (SLOTS) to start and
+allow user programs to read the performance counters.
+
+Fixed counter 3 is mapped to a pseudo event event=0x00, umask=04,
+so the perf_event_attr structure should be initialized with
+{ .config = 0x0400, .type = PERF_TYPE_RAW }
+
+#include <linux/perf_event.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+/* Provide own perf_event_open stub because glibc doesn't */
+__attribute__((weak))
+int perf_event_open(struct perf_event_attr *attr, pid_t pid,
+ int cpu, int group_fd, unsigned long flags)
+{
+ return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags);
+}
+
+/* open slots counter file descriptor for current task */
+struct perf_event_attr slots = {
+ .type = PERF_TYPE_RAW,
+ .size = sizeof(struct perf_event_attr),
+ .config = 0x400,
+ .exclude_kernel = 1,
+};
+
+int fd = perf_event_open(&slots, 0, -1, -1, 0);
+if (fd < 0)
+ ... error ...
+
+The RDPMC instruction (or _rdpmc compiler intrinsic) can now be used
+to read slots and the topdown metrics at different points of the program:
+
+#include <stdint.h>
+#include <x86intrin.h>
+
+#define RDPMC_FIXED (1 << 30) /* return fixed counters */
+#define RDPMC_METRIC (1 << 29) /* return metric counters */
+
+#define FIXED_COUNTER_SLOTS 3
+#define METRIC_COUNTER_TOPDOWN_L1 0
+
+static inline uint64_t read_slots(void)
+{
+ return _rdpmc(RDPMC_FIXED | FIXED_COUNTER_SLOTS);
+}
+
+static inline uint64_t read_metrics(void)
+{
+ return _rdpmc(RDPMC_METRIC | METRIC_COUNTER_TOPDOWN_L1);
+}
+
+Then the program can be instrumented to read these metrics at different
+points.
+
+It's not a good idea to do this with too short code regions,
+as the parallelism and overlap in the CPU program execution will
+cause too much measurement inaccuracy. For example instrumenting
+individual basic blocks is definitely too fine grained.
+
+Decoding metrics values
+=======================
+
+The value reported by read_metrics() contains four 8 bit fields
+that represent a scaled ratio that represent the Level 1 bottleneck.
+All four fields add up to 0xff (= 100%)
+
+The binary ratios in the metric value can be converted to float ratios:
+
+#define GET_METRIC(m, i) (((m) >> (i*8)) & 0xff)
+
+#define TOPDOWN_RETIRING(val) ((float)GET_METRIC(val, 0) / 0xff)
+#define TOPDOWN_BAD_SPEC(val) ((float)GET_METRIC(val, 1) / 0xff)
+#define TOPDOWN_FE_BOUND(val) ((float)GET_METRIC(val, 2) / 0xff)
+#define TOPDOWN_BE_BOUND(val) ((float)GET_METRIC(val, 3) / 0xff)
+
+and then converted to percent for printing.
+
+The ratios in the metric accumulate for the time when the counter
+is enabled. For measuring programs it is often useful to measure
+specific sections. For this it is needed to deltas on metrics.
+
+This can be done by scaling the metrics with the slots counter
+read at the same time.
+
+Then it's possible to take deltas of these slots counts
+measured at different points, and determine the metrics
+for that time period.
+
+ slots_a = read_slots();
+ metric_a = read_metrics();
+
+ ... larger code region ...
+
+ slots_b = read_slots()
+ metric_b = read_metrics()
+
+ # compute scaled metrics for measurement a
+ retiring_slots_a = GET_METRIC(metric_a, 0) * slots_a
+ bad_spec_slots_a = GET_METRIC(metric_a, 1) * slots_a
+ fe_bound_slots_a = GET_METRIC(metric_a, 2) * slots_a
+ be_bound_slots_a = GET_METRIC(metric_a, 3) * slots_a
+
+ # compute delta scaled metrics between b and a
+ retiring_slots = GET_METRIC(metric_b, 0) * slots_b - retiring_slots_a
+ bad_spec_slots = GET_METRIC(metric_b, 1) * slots_b - bad_spec_slots_a
+ fe_bound_slots = GET_METRIC(metric_b, 2) * slots_b - fe_bound_slots_a
+ be_bound_slots = GET_METRIC(metric_b, 3) * slots_b - be_bound_slots_a
+
+Later the individual ratios for the measurement period can be recreated
+from these counts.
+
+ slots_delta = slots_b - slots_a
+ retiring_ratio = (float)retiring_slots / slots_delta
+ bad_spec_ratio = (float)bad_spec_slots / slots_delta
+ fe_bound_ratio = (float)fe_bound_slots / slots_delta
+ be_bound_ratio = (float)be_bound_slots / slota_delta
+
+ printf("Retiring %.2f%% Bad Speculation %.2f%% FE Bound %.2f%% BE Bound %.2f%%\n",
+ retiring_ratio * 100.,
+ bad_spec_ratio * 100.,
+ fe_bound_ratio * 100.,
+ be_bound_ratio * 100.);
+
+Resetting metrics counters
+==========================
+
+Since the individual metrics are only 8bit they lose precision for
+short regions over time because the number of cycles covered by each
+fraction bit shrinks. So the counters need to be reset regularly.
+
+When using the kernel perf API the kernel resets on every read.
+So as long as the reading is at reasonable intervals (every few
+seconds) the precision is good.
+
+When using perf stat it is recommended to always use the -I option,
+with no longer interval than a few seconds
+
+ perf stat -I 1000 --topdown ...
+
+For user programs using RDPMC directly the counter can
+be reset explicitly using ioctl:
+
+ ioctl(perf_fd, PERF_EVENT_IOC_RESET, 0);
+
+This "opens" a new measurement period.
+
+A program using RDPMC for TopDown should schedule such a reset
+regularly, as in every few seconds.
+
+Limits on Ice Lake
+==================
+
+All the TopDown events must be in a group with SLOTS events.
+
+There is no sampling support for TopDown events.
+Sampling read SLOTS and TopDown events is forbidden.
+For example, perf record -e '{slots, topdown-retiring}:S'
+
+[1] https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-win
+[2] https://github.com/andikleen/pmu-tools/wiki/toplev-manual
+[3] https://software.intel.com/en-us/intel-vtune-amplifier-xe
+[4] https://github.com/andikleen/pmu-tools/tree/master/jevents
+[5] https://sites.google.com/site/analysismethods/yasin-pubs
--
2.17.1
From: Kan Liang <[email protected]>
The slots event is required in a Topdown Metric group.
Add a check to examine the Topdown Metric group. Error out if there is
no slots event detected.
Only check the group on the platform which using topdown_metric_attrs,
e.g. Ice Lake.
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-stat.c | 72 +++++++++++++++++++++++++++++++++++++++
1 file changed, 72 insertions(+)
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 078ea1485317..9886aef78440 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -1441,6 +1441,72 @@ static int topdown_filter_events(const char **attr, char **str, bool use_group)
return 0;
}
+/* Event encoding for Topdown Metric events */
+#define TOPDOWN_SLOTS 0x0400
+#define TOPDOWN_RETIRE 0x8000
+#define TOPDOWN_BAD_SPEC 0x8100
+#define TOPDOWN_FE_BOUND 0x8200
+#define TOPDOWN_BE_BOUND 0x8300
+
+static bool is_topdown_metric_event(struct evsel *counter)
+{
+ if (!counter->pmu_name)
+ return false;
+
+ if (strcmp(counter->pmu_name, "cpu"))
+ return false;
+
+ if ((counter->core.attr.config == TOPDOWN_RETIRE) ||
+ (counter->core.attr.config == TOPDOWN_BAD_SPEC) ||
+ (counter->core.attr.config == TOPDOWN_FE_BOUND) ||
+ (counter->core.attr.config == TOPDOWN_BE_BOUND))
+ return true;
+
+ return false;
+}
+
+static bool is_topdown_slots_event(struct evsel *counter)
+{
+ if (!counter->pmu_name)
+ return false;
+
+ if (strcmp(counter->pmu_name, "cpu"))
+ return false;
+
+ if (counter->core.attr.config == TOPDOWN_SLOTS)
+ return true;
+
+ return false;
+}
+
+static bool topdown_check_group_member(void)
+{
+ struct evsel *counter, *leader, *member;
+ bool has_slots;
+
+ if (!pmu_have_event("cpu", topdown_metric_attrs[0]))
+ return true;
+
+ evlist__for_each_entry(evsel_list, counter) {
+ if (!is_topdown_metric_event(counter))
+ continue;
+
+ leader = counter->leader;
+ has_slots = false;
+
+ for_each_group_evsel(member, leader) {
+ if (is_topdown_slots_event(member))
+ has_slots = true;
+ counter = member;
+ }
+
+ if (!has_slots)
+ return false;
+ }
+
+ return true;
+}
+
__weak bool arch_topdown_check_group(bool *warn)
{
*warn = false;
@@ -2024,6 +2090,12 @@ int cmd_stat(int argc, const char **argv)
(const char **) stat_usage,
PARSE_OPT_STOP_AT_NON_OPTION);
perf_stat__collect_metric_expr(evsel_list);
+
+ if (!topdown_check_group_member()) {
+ fprintf(stderr, "Topdown group must include slots event\n");
+ goto out;
+ }
+
perf_stat__init_shadow_stats();
if (stat_config.csv_sep) {
--
2.17.1
From: Kan Liang <[email protected]>
Starts from Ice Lake, the TopDown metrics are directly available as
fixed counters and do not require generic counters. Also, the TopDown
metrics can be collected per thread. Extend the RDPMC usage to support
per-thread TopDown metrics.
The RDPMC index of the PERF_METRICS will be output if RDPMC users ask
for the RDPMC index of the metrics events.
To support per thread RDPMC TopDown, the metrics and slots counters have
to be saved/restored during the context switching.
The last_period and period_left are not used in the counting mode. Use
the fields for saved_metric and saved_slots.
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/core.c | 5 +-
arch/x86/events/intel/core.c | 91 +++++++++++++++++++++++++++++++-----
include/linux/perf_event.h | 29 ++++++++----
3 files changed, 102 insertions(+), 23 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index ebf723f33794..0f3d01562ded 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2257,7 +2257,10 @@ static int x86_pmu_event_idx(struct perf_event *event)
if (!(hwc->flags & PERF_X86_EVENT_RDPMC_ALLOWED))
return 0;
- return hwc->event_base_rdpmc + 1;
+ if (is_metric_idx(hwc->idx))
+ return INTEL_PMC_FIXED_RDPMC_METRICS + 1;
+ else
+ return hwc->event_base_rdpmc + 1;
}
static ssize_t get_attr_rdpmc(struct device *cdev,
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index e7d6c3b79772..b0ab638e48ee 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2258,7 +2258,13 @@ static int icl_set_topdown_event_period(struct perf_event *event)
if (left == x86_pmu.max_period) {
wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
wrmsrl(MSR_PERF_METRICS, 0);
- local64_set(&hwc->period_left, 0);
+ hwc->saved_slots = 0;
+ hwc->saved_metric = 0;
+ }
+
+ if ((hwc->saved_slots) && is_slots_event(event)) {
+ wrmsrl(MSR_CORE_PERF_FIXED_CTR3, hwc->saved_slots);
+ wrmsrl(MSR_PERF_METRICS, hwc->saved_metric);
}
perf_event_update_userpage(event);
@@ -2279,7 +2285,7 @@ static inline u64 icl_get_metrics_event_value(u64 metric, u64 slots, int idx)
return mul_u64_u32_div(slots, val, 0xff);
}
-static void __icl_update_topdown_event(struct perf_event *event,
+static u64 icl_get_topdown_value(struct perf_event *event,
u64 slots, u64 metrics)
{
int idx = event->hw.idx;
@@ -2290,7 +2296,50 @@ static void __icl_update_topdown_event(struct perf_event *event,
else
delta = slots;
- local64_add(delta, &event->count);
+ return delta;
+}
+
+static void __icl_update_topdown_event(struct perf_event *event,
+ u64 slots, u64 metrics,
+ u64 last_slots, u64 last_metrics)
+{
+ u64 delta, last = 0;
+
+ delta = icl_get_topdown_value(event, slots, metrics);
+ if (last_slots)
+ last = icl_get_topdown_value(event, last_slots, last_metrics);
+
+ /*
+ * The 8bit integer fraction of metric may be not accurate,
+ * especially when the changes is very small.
+ * For example, if only a few bad_spec happens, the fraction
+ * may be reduced from 1 to 0. If so, the bad_spec event value
+ * will be 0 which is definitely less than the last value.
+ * Avoid update event->count for this case.
+ */
+ if (delta > last) {
+ delta -= last;
+ local64_add(delta, &event->count);
+ }
+}
+
+static void update_saved_topdown_regs(struct perf_event *event,
+ u64 slots, u64 metrics)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct perf_event *other;
+ int idx;
+
+ event->hw.saved_slots = slots;
+ event->hw.saved_metric = metrics;
+
+ for_each_set_bit(idx, cpuc->active_mask, INTEL_PMC_IDX_TD_BE_BOUND + 1) {
+ if (!is_topdown_idx(idx))
+ continue;
+ other = cpuc->events[idx];
+ other->hw.saved_slots = slots;
+ other->hw.saved_metric = metrics;
+ }
}
/*
@@ -2304,6 +2353,7 @@ static u64 icl_update_topdown_event(struct perf_event *event)
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct perf_event *other;
u64 slots, metrics;
+ bool reset = true;
int idx;
/* read Fixed counter 3 */
@@ -2318,19 +2368,39 @@ static u64 icl_update_topdown_event(struct perf_event *event)
if (!is_topdown_idx(idx))
continue;
other = cpuc->events[idx];
- __icl_update_topdown_event(other, slots, metrics);
+ __icl_update_topdown_event(other, slots, metrics,
+ event ? event->hw.saved_slots : 0,
+ event ? event->hw.saved_metric : 0);
}
/*
* Check and update this event, which may have been cleared
* in active_mask e.g. x86_pmu_stop()
*/
- if (event && !test_bit(event->hw.idx, cpuc->active_mask))
- __icl_update_topdown_event(event, slots, metrics);
+ if (event && !test_bit(event->hw.idx, cpuc->active_mask)) {
+ __icl_update_topdown_event(event, slots, metrics,
+ event->hw.saved_slots,
+ event->hw.saved_metric);
- /* The fixed counter 3 has to be written before the PERF_METRICS. */
- wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
- wrmsrl(MSR_PERF_METRICS, 0);
+ /*
+ * In x86_pmu_stop(), the event is cleared in active_mask first,
+ * then drain the delta, which indicates context switch for
+ * counting.
+ * Save metric and slots for context switch.
+ * Don't need to reset the PERF_METRICS and Fixed counter 3.
+ * Because the values will be restored in next schedule in.
+ */
+ update_saved_topdown_regs(event, slots, metrics);
+ reset = false;
+ }
+
+ if (reset) {
+ /* The fixed counter 3 has to be written before the PERF_METRICS. */
+ wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
+ wrmsrl(MSR_PERF_METRICS, 0);
+ if (event)
+ update_saved_topdown_regs(event, 0, 0);
+ }
return slots;
}
@@ -3591,9 +3661,6 @@ static int intel_pmu_hw_config(struct perf_event *event)
event->hw.flags |= PERF_X86_EVENT_TOPDOWN;
event->event_caps |= PERF_EV_CAP_COEXIST;
-
- if (is_metric_event(event))
- event->hw.flags &= ~PERF_X86_EVENT_RDPMC_ALLOWED;
}
}
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 93631e5389bf..aa60a1381aa1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -212,17 +212,26 @@ struct hw_perf_event {
*/
u64 sample_period;
- /*
- * The period we started this sample with.
- */
- u64 last_period;
+ union {
+ struct { /* Sampling */
+ /*
+ * The period we started this sample with.
+ */
+ u64 last_period;
- /*
- * However much is left of the current period; note that this is
- * a full 64bit value and allows for generation of periods longer
- * than hardware might allow.
- */
- local64_t period_left;
+ /*
+ * However much is left of the current period;
+ * note that this is a full 64bit value and
+ * allows for generation of periods longer
+ * than hardware might allow.
+ */
+ local64_t period_left;
+ };
+ struct { /* Topdown events counting for context switch */
+ u64 saved_metric;
+ u64 saved_slots;
+ };
+ };
/*
* State for throttling the event, see __perf_event_overflow() and
--
2.17.1
From: Kan Liang <[email protected]>
The RDPMC base offset of fixed counters is hard-code. Use a meaningful
name to replace the magic number to improve the readability of the code.
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/core.c | 3 ++-
arch/x86/include/asm/perf_event.h | 3 +++
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 53fcf0a2b025..ebf723f33794 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1151,7 +1151,8 @@ static inline void x86_assign_hw_event(struct perf_event *event,
hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 +
(idx - INTEL_PMC_IDX_FIXED);
- hwc->event_base_rdpmc = (idx - INTEL_PMC_IDX_FIXED) | 1<<30;
+ hwc->event_base_rdpmc = (idx - INTEL_PMC_IDX_FIXED) |
+ INTEL_PMC_FIXED_RDPMC_BASE;
break;
default:
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 000cab7818b5..964ba312c249 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -196,6 +196,9 @@ struct x86_pmu_capability {
* Fixed-purpose performance events:
*/
+/* RDPMC offset for Fixed PMCs */
+#define INTEL_PMC_FIXED_RDPMC_BASE (1 << 30)
+
/*
* All the fixed-mode PMCs are configured via this single MSR:
*/
--
2.17.1
From: Kan Liang <[email protected]>
The fourth fixed counter, TOPDOWN.SLOTS, is introduced in Ice Lake to
measure the level 1 TopDown events.
Add MSR address and macros for the new fixed counter, which will be used
in a later patch.
Add comments to explain the event encoding rules for the fixed counters.
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/include/asm/perf_event.h | 23 ++++++++++++++++++++---
1 file changed, 20 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index fd3eba65337f..fe8110a8c75b 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -197,12 +197,24 @@ struct x86_pmu_capability {
*/
/*
- * All 3 fixed-mode PMCs are configured via this single MSR:
+ * All the fixed-mode PMCs are configured via this single MSR:
*/
#define MSR_ARCH_PERFMON_FIXED_CTR_CTRL 0x38d
/*
- * The counts are available in three separate MSRs:
+ * There is no event-code assigned to the fixed-mode PMCs.
+ *
+ * For a fixed-mode PMC, which has an equivalent event on a general-purpose
+ * PMC, the event-code of the equivalent event is used for the fixed-mode PMC,
+ * e.g., Instr_Retired.Any and CPU_CLK_Unhalted.Core.
+ *
+ * For a fixed-mode PMC, which doesn't have an equivalent event, a
+ * pseudo-encoding is used, e.g., CPU_CLK_Unhalted.Ref and TOPDOWN.SLOTS.
+ * The pseudo event-code for a fixed-mode PMC must be 0x00.
+ * The pseudo umask-code is 0xX. The X equals the index of the fixed
+ * counter + 1, e.g., the fixed counter 2 has the pseudo-encoding 0x0300.
+ *
+ * The counts are available in separate MSRs:
*/
/* Instr_Retired.Any: */
@@ -213,11 +225,16 @@ struct x86_pmu_capability {
#define MSR_ARCH_PERFMON_FIXED_CTR1 0x30a
#define INTEL_PMC_IDX_FIXED_CPU_CYCLES (INTEL_PMC_IDX_FIXED + 1)
-/* CPU_CLK_Unhalted.Ref: */
+/* CPU_CLK_Unhalted.Ref: event=0x00,umask=0x3 (pseudo-encoding) */
#define MSR_ARCH_PERFMON_FIXED_CTR2 0x30b
#define INTEL_PMC_IDX_FIXED_REF_CYCLES (INTEL_PMC_IDX_FIXED + 2)
#define INTEL_PMC_MSK_FIXED_REF_CYCLES (1ULL << INTEL_PMC_IDX_FIXED_REF_CYCLES)
+/* TOPDOWN.SLOTS: event=0x00,umask=0x4 (pseudo-encoding) */
+#define MSR_ARCH_PERFMON_FIXED_CTR3 0x30c
+#define INTEL_PMC_IDX_FIXED_SLOTS (INTEL_PMC_IDX_FIXED + 3)
+#define INTEL_PMC_MSK_FIXED_SLOTS (1ULL << INTEL_PMC_IDX_FIXED_SLOTS)
+
/*
* We model BTS tracing as another fixed-mode PMC.
*
--
2.17.1
From: Andi Kleen <[email protected]>
Icelake has support for reporting per thread TopDown metrics.
These are reported differently than the previous TopDown support,
each metric is standalone, but scaled to pipeline "slots".
We don't need to do anything special for HyperThreading anymore.
Teach perf stat --topdown to handle these new metrics and
print them in the same way as the previous TopDown metrics.
The restrictions of only being able to report information per core is
gone.
Signed-off-by: Andi Kleen <[email protected]>
---
tools/perf/Documentation/perf-stat.txt | 9 ++-
tools/perf/builtin-stat.c | 25 ++++++++
tools/perf/util/stat-shadow.c | 89 ++++++++++++++++++++++++++
tools/perf/util/stat.c | 4 ++
tools/perf/util/stat.h | 8 +++
5 files changed, 133 insertions(+), 2 deletions(-)
diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
index b029ee728a0b..a1a2b9755ae8 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -317,8 +317,13 @@ if the workload is actually bound by the CPU and not by something else.
For best results it is usually a good idea to use it with interval
mode like -I 1000, as the bottleneck of workloads can change often.
-The top down metrics are collected per core instead of per
-CPU thread. Per core mode is automatically enabled
+This enables --metric-only, unless overridden with --no-metric-only.
+
+The following restrictions only apply to older Intel CPUs and Atom,
+on newer CPUs (IceLake and later) TopDown can be collected for any thread:
+
+The top down metrics are collected per core instead of per CPU thread.
+Per core mode is automatically enabled
and -a (global monitoring) is needed, requiring root rights or
perf.perf_event_paranoid=-1.
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 9be020e0098a..078ea1485317 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -128,6 +128,15 @@ static const char * topdown_attrs[] = {
NULL,
};
+static const char *topdown_metric_attrs[] = {
+ "slots",
+ "topdown-retiring",
+ "topdown-bad-spec",
+ "topdown-fe-bound",
+ "topdown-be-bound",
+ NULL,
+};
+
static const char *smi_cost_attrs = {
"{"
"msr/aperf/,"
@@ -1638,6 +1647,21 @@ static int add_default_attributes(void)
char *str = NULL;
bool warn = false;
+ if (topdown_filter_events(topdown_metric_attrs, &str, 1) < 0) {
+ pr_err("Out of memory\n");
+ return -1;
+ }
+ if (topdown_metric_attrs[0] && str) {
+ if (!stat_config.interval) {
+ fprintf(stat_config.output,
+ "Topdown accuracy may decreases when measuring long period.\n"
+ "Please print the result regularly, e.g. -I1000\n");
+ }
+ goto setup_metrics;
+ }
+
+ str = NULL;
+
if (stat_config.aggr_mode != AGGR_GLOBAL &&
stat_config.aggr_mode != AGGR_CORE) {
pr_err("top down event configuration requires --per-core mode\n");
@@ -1659,6 +1683,7 @@ static int add_default_attributes(void)
if (topdown_attrs[0] && str) {
if (warn)
arch_topdown_group_warn();
+setup_metrics:
err = parse_events(evsel_list, str, &errinfo);
if (err) {
fprintf(stderr,
diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index a7c13a88ecb9..933ddb601cb8 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -241,6 +241,18 @@ void perf_stat__update_shadow_stats(struct evsel *counter, u64 count,
else if (perf_stat_evsel__is(counter, TOPDOWN_RECOVERY_BUBBLES))
update_runtime_stat(st, STAT_TOPDOWN_RECOVERY_BUBBLES,
ctx, cpu, count);
+ else if (perf_stat_evsel__is(counter, TOPDOWN_RETIRING))
+ update_runtime_stat(st, STAT_TOPDOWN_RETIRING,
+ ctx, cpu, count);
+ else if (perf_stat_evsel__is(counter, TOPDOWN_BAD_SPEC))
+ update_runtime_stat(st, STAT_TOPDOWN_BAD_SPEC,
+ ctx, cpu, count);
+ else if (perf_stat_evsel__is(counter, TOPDOWN_FE_BOUND))
+ update_runtime_stat(st, STAT_TOPDOWN_FE_BOUND,
+ ctx, cpu, count);
+ else if (perf_stat_evsel__is(counter, TOPDOWN_BE_BOUND))
+ update_runtime_stat(st, STAT_TOPDOWN_BE_BOUND,
+ ctx, cpu, count);
else if (evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_FRONTEND))
update_runtime_stat(st, STAT_STALLED_CYCLES_FRONT,
ctx, cpu, count);
@@ -705,6 +717,47 @@ static double td_be_bound(int ctx, int cpu, struct runtime_stat *st)
return sanitize_val(1.0 - sum);
}
+/*
+ * Kernel reports metrics multiplied with slots. To get back
+ * the ratios we need to recreate the sum.
+ */
+
+static double td_metric_ratio(int ctx, int cpu,
+ enum stat_type type,
+ struct runtime_stat *stat)
+{
+ double sum = runtime_stat_avg(stat, STAT_TOPDOWN_RETIRING, ctx, cpu) +
+ runtime_stat_avg(stat, STAT_TOPDOWN_FE_BOUND, ctx, cpu) +
+ runtime_stat_avg(stat, STAT_TOPDOWN_BE_BOUND, ctx, cpu) +
+ runtime_stat_avg(stat, STAT_TOPDOWN_BAD_SPEC, ctx, cpu);
+ double d = runtime_stat_avg(stat, type, ctx, cpu);
+
+ if (sum)
+ return d / sum;
+ return 0;
+}
+
+/*
+ * ... but only if most of the values are actually available.
+ * We allow two missing.
+ */
+
+static bool full_td(int ctx, int cpu,
+ struct runtime_stat *stat)
+{
+ int c = 0;
+
+ if (runtime_stat_avg(stat, STAT_TOPDOWN_RETIRING, ctx, cpu) > 0)
+ c++;
+ if (runtime_stat_avg(stat, STAT_TOPDOWN_BE_BOUND, ctx, cpu) > 0)
+ c++;
+ if (runtime_stat_avg(stat, STAT_TOPDOWN_FE_BOUND, ctx, cpu) > 0)
+ c++;
+ if (runtime_stat_avg(stat, STAT_TOPDOWN_BAD_SPEC, ctx, cpu) > 0)
+ c++;
+ return c >= 2;
+}
+
static void print_smi_cost(struct perf_stat_config *config,
int cpu, struct evsel *evsel,
struct perf_stat_output_ctx *out,
@@ -1034,6 +1087,42 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config,
be_bound * 100.);
else
print_metric(config, ctxp, NULL, NULL, name, 0);
+ } else if (perf_stat_evsel__is(evsel, TOPDOWN_RETIRING) &&
+ full_td(ctx, cpu, st)) {
+ double retiring = td_metric_ratio(ctx, cpu,
+ STAT_TOPDOWN_RETIRING, st);
+
+ if (retiring > 0.7)
+ color = PERF_COLOR_GREEN;
+ print_metric(config, ctxp, color, "%8.1f%%", "retiring",
+ retiring * 100.);
+ } else if (perf_stat_evsel__is(evsel, TOPDOWN_FE_BOUND) &&
+ full_td(ctx, cpu, st)) {
+ double fe_bound = td_metric_ratio(ctx, cpu,
+ STAT_TOPDOWN_FE_BOUND, st);
+
+ if (fe_bound > 0.2)
+ color = PERF_COLOR_RED;
+ print_metric(config, ctxp, color, "%8.1f%%", "frontend bound",
+ fe_bound * 100.);
+ } else if (perf_stat_evsel__is(evsel, TOPDOWN_BE_BOUND) &&
+ full_td(ctx, cpu, st)) {
+ double be_bound = td_metric_ratio(ctx, cpu,
+ STAT_TOPDOWN_BE_BOUND, st);
+
+ if (be_bound > 0.2)
+ color = PERF_COLOR_RED;
+ print_metric(config, ctxp, color, "%8.1f%%", "backend bound",
+ be_bound * 100.);
+ } else if (perf_stat_evsel__is(evsel, TOPDOWN_BAD_SPEC) &&
+ full_td(ctx, cpu, st)) {
+ double bad_spec = td_metric_ratio(ctx, cpu,
+ STAT_TOPDOWN_BAD_SPEC, st);
+
+ if (bad_spec > 0.1)
+ color = PERF_COLOR_RED;
+ print_metric(config, ctxp, color, "%8.1f%%", "bad speculation",
+ bad_spec * 100.);
} else if (evsel->metric_expr) {
generic_metric(config, evsel->metric_expr, evsel->metric_events, evsel->name,
evsel->metric_name, NULL, 1, cpu, out, st);
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index cdb154381a87..bd0decd6d753 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -95,6 +95,10 @@ static const char *id_str[PERF_STAT_EVSEL_ID__MAX] = {
ID(TOPDOWN_SLOTS_RETIRED, topdown-slots-retired),
ID(TOPDOWN_FETCH_BUBBLES, topdown-fetch-bubbles),
ID(TOPDOWN_RECOVERY_BUBBLES, topdown-recovery-bubbles),
+ ID(TOPDOWN_RETIRING, topdown-retiring),
+ ID(TOPDOWN_BAD_SPEC, topdown-bad-spec),
+ ID(TOPDOWN_FE_BOUND, topdown-fe-bound),
+ ID(TOPDOWN_BE_BOUND, topdown-be-bound),
ID(SMI_NUM, msr/smi/),
ID(APERF, msr/aperf/),
};
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index f75ae679eb28..9705421d99a4 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -28,6 +28,10 @@ enum perf_stat_evsel_id {
PERF_STAT_EVSEL_ID__TOPDOWN_SLOTS_RETIRED,
PERF_STAT_EVSEL_ID__TOPDOWN_FETCH_BUBBLES,
PERF_STAT_EVSEL_ID__TOPDOWN_RECOVERY_BUBBLES,
+ PERF_STAT_EVSEL_ID__TOPDOWN_RETIRING,
+ PERF_STAT_EVSEL_ID__TOPDOWN_BAD_SPEC,
+ PERF_STAT_EVSEL_ID__TOPDOWN_FE_BOUND,
+ PERF_STAT_EVSEL_ID__TOPDOWN_BE_BOUND,
PERF_STAT_EVSEL_ID__SMI_NUM,
PERF_STAT_EVSEL_ID__APERF,
PERF_STAT_EVSEL_ID__MAX,
@@ -82,6 +86,10 @@ enum stat_type {
STAT_TOPDOWN_SLOTS_RETIRED,
STAT_TOPDOWN_FETCH_BUBBLES,
STAT_TOPDOWN_RECOVERY_BUBBLES,
+ STAT_TOPDOWN_RETIRING,
+ STAT_TOPDOWN_BAD_SPEC,
+ STAT_TOPDOWN_FE_BOUND,
+ STAT_TOPDOWN_BE_BOUND,
STAT_SMI_NUM,
STAT_APERF,
STAT_MAX
--
2.17.1
> + if (topdown_metric_attrs[0] && str) {`
> + if (!stat_config.interval) {
> + fprintf(stat_config.output,
> + "Topdown accuracy may decreases when measuring long period.\n"
> + "Please print the result regularly, e.g. -I1000\n");
Can you disable this warning when --metrics-only is used? In this case it doesn't matter
because the error is proportional to the percentage accuracy and should be invisible.
You can only see a difference when looking at the expanded counts.
-andi
On Thu, Jul 23, 2020 at 10:11:10AM -0700, [email protected] wrote:
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 3b22db08b6fb..93631e5389bf 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -576,9 +576,14 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *,
> * PERF_EV_CAP_SOFTWARE: Is a software event.
> * PERF_EV_CAP_READ_ACTIVE_PKG: A CPU event (or cgroup event) that can be read
> * from any CPU in the package where it is active.
> + * PERF_EV_CAP_COEXIST: An event with this flag must coexist with other sibling
> + * events, which have the same flag. If any event with the flag is detached
> + * from the group, split the group into singleton events, and move the events
> + * with the flag to the unrecoverable ERROR state.
> */
> #define PERF_EV_CAP_SOFTWARE BIT(0)
> #define PERF_EV_CAP_READ_ACTIVE_PKG BIT(1)
> +#define PERF_EV_CAP_COEXIST BIT(2)
>
> #define SWEVENT_HLIST_BITS 8
> #define SWEVENT_HLIST_SIZE (1 << SWEVENT_HLIST_BITS)
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 7c436d705fbd..e35d549a356d 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2133,10 +2133,28 @@ static inline struct list_head *get_event_list(struct perf_event *event)
> return event->attr.pinned ? &ctx->pinned_active : &ctx->flexible_active;
> }
>
> +/*
> + * If the event has PERF_EV_CAP_COEXIST capability,
> + * schedule it out and move it into the ERROR state.
> + */
> +static inline void perf_remove_coexist_events(struct perf_event *event)
> +{
> + struct perf_event_context *ctx = event->ctx;
> + struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
> +
> + if (!(event->event_caps & PERF_EV_CAP_COEXIST))
> + return;
> +
> + event_sched_out(event, cpuctx, ctx);
> + perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
> +}
Ah, so the problem here is that ERROR is actually recoverable using
IOC_ENABLE. We don't want that either. Let me try and figure out of EXIT
would work.
On Fri, Jul 24, 2020 at 12:55:43PM +0200, [email protected] wrote:
> > + event_sched_out(event, cpuctx, ctx);
> > + perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
> > +}
>
> Ah, so the problem here is that ERROR is actually recoverable using
> IOC_ENABLE. We don't want that either. Let me try and figure out of EXIT
> would work.
EXIT is difficult too.. Also, that COEXIST thing hurt my brian, can't we
do a simpler SIBLING thing that simply requires the event to be a group
sibling?
The consequence is that SLOTS must be the leader, is that really a
problem? You keep providing the {cycles, slots, metric-things} example,
but afaict {slots, metric-thing, cycles} should work just fine too.
PERF_SAMPLE_READ with PERF_FORMAT_GROUP doesn't need to the the leader.
The thing is, with COEXIST, if you have:
{cycles, slots, metric1, metric2}
and you say that COEXIST events need to co-exist with at least one other
COEXIST event you end up with a problem when: close(fd_slots) happens,
because in the above case, there's two more COEXIST events, so the
requirement holds, even through this is very much not what we want.
You 'fixed' this by saying closing any COEXIST event will tear the group
up, but that's 'weird' and unexpected if you were to do:
close(fd_metric2).
So something like this.
---
Subject: perf/core: Add a new PERF_EV_CAP_SIBLING event capability
Current perf assumes that events in a group are independent. Close an
event doesn't impact the value of the other events in the same group.
If the closed event is a member, after the event closure, other events
are still running like a group. If the closed event is a leader, other
events are running as singleton events.
Add PERF_EV_CAP_SIBLING to allow events to indicate they require being
part of a group, and when the leader dies they cannot exist
independently.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
include/linux/perf_event.h | 4 ++++
kernel/events/core.c | 38 +++++++++++++++++++++++++++++++++-----
2 files changed, 37 insertions(+), 5 deletions(-)
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -576,9 +576,13 @@ typedef void (*perf_overflow_handler_t)(
* PERF_EV_CAP_SOFTWARE: Is a software event.
* PERF_EV_CAP_READ_ACTIVE_PKG: A CPU event (or cgroup event) that can be read
* from any CPU in the package where it is active.
+ * PERF_EV_CAP_SIBLING: An event with this flag must be a group sibling and
+ * cannot be a group leader. If an event with this flag is detached from the
+ * group it is scheduled out and moved into an unrecoverable ERROR state.
*/
#define PERF_EV_CAP_SOFTWARE BIT(0)
#define PERF_EV_CAP_READ_ACTIVE_PKG BIT(1)
+#define PERF_EV_CAP_SIBLING BIT(2)
#define SWEVENT_HLIST_BITS 8
#define SWEVENT_HLIST_SIZE (1 << SWEVENT_HLIST_BITS)
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2133,8 +2133,24 @@ static inline struct list_head *get_even
return event->attr.pinned ? &ctx->pinned_active : &ctx->flexible_active;
}
+/*
+ * Events that have PERF_EV_CAP_SIBLING require being part of a group and
+ * cannot exist on their own, schedule them out and move them into the ERROR
+ * state. Also see _perf_event_enable(), it will not be able to recover
+ * this ERROR state.
+ */
+static inline void perf_remove_sibling_event(struct perf_event *event)
+{
+ struct perf_event_context *ctx = event->ctx;
+ struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+
+ event_sched_out(event, cpuctx, ctx);
+ perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
+}
+
static void perf_group_detach(struct perf_event *event)
{
+ struct perf_event *leader = event->group_leader;
struct perf_event *sibling, *tmp;
struct perf_event_context *ctx = event->ctx;
@@ -2153,7 +2169,7 @@ static void perf_group_detach(struct per
/*
* If this is a sibling, remove it from its group.
*/
- if (event->group_leader != event) {
+ if (leader != event) {
list_del_init(&event->sibling_list);
event->group_leader->nr_siblings--;
goto out;
@@ -2166,6 +2182,9 @@ static void perf_group_detach(struct per
*/
list_for_each_entry_safe(sibling, tmp, &event->sibling_list, sibling_list) {
+ if (sibling->event_caps & PERF_EV_CAP_SIBLING)
+ perf_remove_sibling_event(sibling);
+
sibling->group_leader = sibling;
list_del_init(&sibling->sibling_list);
@@ -2183,10 +2202,10 @@ static void perf_group_detach(struct per
}
out:
- perf_event__header_size(event->group_leader);
-
- for_each_sibling_event(tmp, event->group_leader)
+ for_each_sibling_event(tmp, leader)
perf_event__header_size(tmp);
+
+ perf_event__header_size(leader);
}
static bool is_orphaned_event(struct perf_event *event)
@@ -2979,6 +2998,7 @@ static void _perf_event_enable(struct pe
raw_spin_lock_irq(&ctx->lock);
if (event->state >= PERF_EVENT_STATE_INACTIVE ||
event->state < PERF_EVENT_STATE_ERROR) {
+out:
raw_spin_unlock_irq(&ctx->lock);
return;
}
@@ -2990,8 +3010,16 @@ static void _perf_event_enable(struct pe
* has gone back into error state, as distinct from the task having
* been scheduled away before the cross-call arrived.
*/
- if (event->state == PERF_EVENT_STATE_ERROR)
+ if (event->state == PERF_EVENT_STATE_ERROR) {
+ /*
+ * Detached SIBLING events cannot leave ERROR state.
+ */
+ if (event->event_caps & PERF_EV_CAP_SIBLING &&
+ event->group_leader == event)
+ goto out;
+
event->state = PERF_EVENT_STATE_OFF;
+ }
raw_spin_unlock_irq(&ctx->lock);
event_function_call(event, __perf_event_enable, NULL);
On Thu, Jul 23, 2020 at 10:11:11AM -0700, [email protected] wrote:
> @@ -3375,6 +3428,72 @@ static int intel_pmu_hw_config(struct perf_event *event)
> if (event->attr.type != PERF_TYPE_RAW)
> return 0;
>
> + /*
> + * Config Topdown slots and metric events
> + *
> + * The slots event on Fixed Counter 3 can support sampling,
> + * which will be handled normally in x86_perf_event_update().
> + *
> + * The metric events don't support sampling.
> + *
> + * For counting, topdown slots and metric events will be
> + * handled specially for event update.
> + * A flag PERF_X86_EVENT_TOPDOWN is applied for the case.
> + */
> + if (x86_pmu.intel_cap.perf_metrics && is_topdown_event(event)) {
> + if (is_metric_event(event)) {
> + struct perf_event *leader = event->group_leader;
> + struct perf_event *sibling;
> +
> + /* The metric events don't support sampling. */
> + if (is_sampling_event(event))
> + return -EINVAL;
> +
> + /* The metric events cannot be a group leader. */
> + if (leader == event)
> + return -EINVAL;
> +
> + /*
> + * The slots event cannot be the leader of a topdown
> + * sample-read group, e.g., {slots, topdown-retiring}:S
> + */
> + if (is_slots_event(leader) && is_sampling_event(leader))
> + return -EINVAL;
This has nothing to do with sample-read; SLOTS cannot be sampling when
coupled with the METRIC stuff because hardware is daft.
And you can have SAMPLE_READ on non-leader events just fine.
> +
> + /*
> + * The slots event must be before the metric events,
> + * because we only update the values of a topdown
> + * group once with the slots event.
> + */
> + if (!is_slots_event(leader)) {
> + for_each_sibling_event(sibling, leader) {
> + if (is_slots_event(sibling))
> + break;
> + if (is_metric_event(sibling))
> + return -EINVAL;
> + }
> + }
Per the SIBLING patch this then wants to be:
if (!is_slots_event(leader))
return -EINVAL;
event->event_caps |= PERF_EV_CAP_SIBLING.
/*
* Only once we have a METRICs sibling to we
* need TopDown magic.
*/
leader->hw.flags |= PERF_X86_EVENT_TOPDOWN;
> + }
> +
> + if (!is_sampling_event(event)) {
> + if (event->attr.config1 != 0)
> + return -EINVAL;
How does this depend on sampling?
> + /*
> + * The TopDown metrics events and slots event don't
> + * support any filters.
> + */
> + if (event->attr.config & X86_ALL_EVENT_FLAGS)
> + return -EINVAL;
That seems independent of sampling too. Even a sampling SLOTS shouldn't
be having any of those afaict.
> +
> + event->hw.flags |= PERF_X86_EVENT_TOPDOWN;
This is confusing too, a !sampling SLOTS event without METRIC siblings
shouldn't have this set, right? So arguably, this should be like above.
> +
> + event->event_caps |= PERF_EV_CAP_COEXIST;
> +
> + if (is_metric_event(event))
> + event->hw.flags &= ~PERF_X86_EVENT_RDPMC_ALLOWED;
This too seems like something that should be in the is_metric_event()
branch above.
> + }
> + }
> +
> if (!(event->attr.config & ARCH_PERFMON_EVENTSEL_ANY))
> return 0;
>
On 7/24/2020 7:46 AM, [email protected] wrote:
> On Fri, Jul 24, 2020 at 12:55:43PM +0200, [email protected] wrote:
>>> + event_sched_out(event, cpuctx, ctx);
>>> + perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
>>> +}
>>
>> Ah, so the problem here is that ERROR is actually recoverable using
>> IOC_ENABLE. We don't want that either. Let me try and figure out of EXIT
>> would work.
>
> EXIT is difficult too.. Also, that COEXIST thing hurt my brian, can't we
> do a simpler SIBLING thing that simply requires the event to be a group
> sibling?
>
> The consequence is that SLOTS must be the leader, is that really a
> problem? You keep providing the {cycles, slots, metric-things} example,
> but afaict {slots, metric-thing, cycles} should work just fine too. > PERF_SAMPLE_READ with PERF_FORMAT_GROUP doesn't need to the the leader.
I'm not sure I get your point.
For the PERF_SAMPLE_READ with PERF_FORMAT_GROUP case, other events can
be the leader, e.g., {cycles, slots, metric-things}:S.
In this case, the SLOTS event is not a leader. I don't think we can
assume that the SLOTS event must be the leader.
I think the case should be a useful case. Users may want to read the
Topdown values for each sample.
Thanks,
Kan
>
> The thing is, with COEXIST, if you have:
>
> {cycles, slots, metric1, metric2}
>
> and you say that COEXIST events need to co-exist with at least one other
> COEXIST event you end up with a problem when: close(fd_slots) happens,
> because in the above case, there's two more COEXIST events, so the
> requirement holds, even through this is very much not what we want.
>
> You 'fixed' this by saying closing any COEXIST event will tear the group
> up, but that's 'weird' and unexpected if you were to do:
> close(fd_metric2).
> >
> So something like this.
>
> ---
> Subject: perf/core: Add a new PERF_EV_CAP_SIBLING event capability
>
> Current perf assumes that events in a group are independent. Close an
> event doesn't impact the value of the other events in the same group.
> If the closed event is a member, after the event closure, other events
> are still running like a group. If the closed event is a leader, other
> events are running as singleton events.
>
> Add PERF_EV_CAP_SIBLING to allow events to indicate they require being
> part of a group, and when the leader dies they cannot exist
> independently.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> include/linux/perf_event.h | 4 ++++
> kernel/events/core.c | 38 +++++++++++++++++++++++++++++++++-----
> 2 files changed, 37 insertions(+), 5 deletions(-)
>
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -576,9 +576,13 @@ typedef void (*perf_overflow_handler_t)(
> * PERF_EV_CAP_SOFTWARE: Is a software event.
> * PERF_EV_CAP_READ_ACTIVE_PKG: A CPU event (or cgroup event) that can be read
> * from any CPU in the package where it is active.
> + * PERF_EV_CAP_SIBLING: An event with this flag must be a group sibling and
> + * cannot be a group leader. If an event with this flag is detached from the
> + * group it is scheduled out and moved into an unrecoverable ERROR state.
> */
> #define PERF_EV_CAP_SOFTWARE BIT(0)
> #define PERF_EV_CAP_READ_ACTIVE_PKG BIT(1)
> +#define PERF_EV_CAP_SIBLING BIT(2)
>
> #define SWEVENT_HLIST_BITS 8
> #define SWEVENT_HLIST_SIZE (1 << SWEVENT_HLIST_BITS)
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2133,8 +2133,24 @@ static inline struct list_head *get_even
> return event->attr.pinned ? &ctx->pinned_active : &ctx->flexible_active;
> }
>
> +/*
> + * Events that have PERF_EV_CAP_SIBLING require being part of a group and
> + * cannot exist on their own, schedule them out and move them into the ERROR
> + * state. Also see _perf_event_enable(), it will not be able to recover
> + * this ERROR state.
> + */
> +static inline void perf_remove_sibling_event(struct perf_event *event)
> +{
> + struct perf_event_context *ctx = event->ctx;
> + struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
> +
> + event_sched_out(event, cpuctx, ctx);
> + perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
> +}
> +
> static void perf_group_detach(struct perf_event *event)
> {
> + struct perf_event *leader = event->group_leader;
> struct perf_event *sibling, *tmp;
> struct perf_event_context *ctx = event->ctx;
>
> @@ -2153,7 +2169,7 @@ static void perf_group_detach(struct per
> /*
> * If this is a sibling, remove it from its group.
> */
> - if (event->group_leader != event) {
> + if (leader != event) {
> list_del_init(&event->sibling_list);
> event->group_leader->nr_siblings--;
> goto out;
> @@ -2166,6 +2182,9 @@ static void perf_group_detach(struct per
> */
> list_for_each_entry_safe(sibling, tmp, &event->sibling_list, sibling_list) {
>
> + if (sibling->event_caps & PERF_EV_CAP_SIBLING)
> + perf_remove_sibling_event(sibling);
> +
> sibling->group_leader = sibling;
> list_del_init(&sibling->sibling_list);
>
> @@ -2183,10 +2202,10 @@ static void perf_group_detach(struct per
> }
>
> out:
> - perf_event__header_size(event->group_leader);
> -
> - for_each_sibling_event(tmp, event->group_leader)
> + for_each_sibling_event(tmp, leader)
> perf_event__header_size(tmp);
> +
> + perf_event__header_size(leader);
> }
>
> static bool is_orphaned_event(struct perf_event *event)
> @@ -2979,6 +2998,7 @@ static void _perf_event_enable(struct pe
> raw_spin_lock_irq(&ctx->lock);
> if (event->state >= PERF_EVENT_STATE_INACTIVE ||
> event->state < PERF_EVENT_STATE_ERROR) {
> +out:
> raw_spin_unlock_irq(&ctx->lock);
> return;
> }
> @@ -2990,8 +3010,16 @@ static void _perf_event_enable(struct pe
> * has gone back into error state, as distinct from the task having
> * been scheduled away before the cross-call arrived.
> */
> - if (event->state == PERF_EVENT_STATE_ERROR)
> + if (event->state == PERF_EVENT_STATE_ERROR) {
> + /*
> + * Detached SIBLING events cannot leave ERROR state.
> + */
> + if (event->event_caps & PERF_EV_CAP_SIBLING &&
> + event->group_leader == event)
> + goto out;
> +
> event->state = PERF_EVENT_STATE_OFF;
> + }
> raw_spin_unlock_irq(&ctx->lock);
>
> event_function_call(event, __perf_event_enable, NULL);
>
On Fri, Jul 24, 2020 at 09:43:44AM -0400, Liang, Kan wrote:
>
>
> On 7/24/2020 7:46 AM, [email protected] wrote:
> > On Fri, Jul 24, 2020 at 12:55:43PM +0200, [email protected] wrote:
> > > > + event_sched_out(event, cpuctx, ctx);
> > > > + perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
> > > > +}
> > >
> > > Ah, so the problem here is that ERROR is actually recoverable using
> > > IOC_ENABLE. We don't want that either. Let me try and figure out of EXIT
> > > would work.
> >
> > EXIT is difficult too.. Also, that COEXIST thing hurt my brian, can't we
> > do a simpler SIBLING thing that simply requires the event to be a group
> > sibling?
> >
> > The consequence is that SLOTS must be the leader, is that really a
> > problem? You keep providing the {cycles, slots, metric-things} example,
> > but afaict {slots, metric-thing, cycles} should work just fine too.
> > PERF_SAMPLE_READ with PERF_FORMAT_GROUP doesn't need to the the leader.
>
> I'm not sure I get your point.
> For the PERF_SAMPLE_READ with PERF_FORMAT_GROUP case, other events can be
> the leader, e.g., {cycles, slots, metric-things}:S.
> In this case, the SLOTS event is not a leader. I don't think we can assume
> that the SLOTS event must be the leader.
You can have a sibling event with SAMPLE_READ and FORMAT_GROUP just
fine. The sampling event doesn't need to be the leader.
On 7/24/2020 9:54 AM, Peter Zijlstra wrote:
> On Fri, Jul 24, 2020 at 09:43:44AM -0400, Liang, Kan wrote:
>>
>>
>> On 7/24/2020 7:46 AM, [email protected] wrote:
>>> On Fri, Jul 24, 2020 at 12:55:43PM +0200, [email protected] wrote:
>>>>> + event_sched_out(event, cpuctx, ctx);
>>>>> + perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
>>>>> +}
>>>>
>>>> Ah, so the problem here is that ERROR is actually recoverable using
>>>> IOC_ENABLE. We don't want that either. Let me try and figure out of EXIT
>>>> would work.
>>>
>>> EXIT is difficult too.. Also, that COEXIST thing hurt my brian, can't we
>>> do a simpler SIBLING thing that simply requires the event to be a group
>>> sibling?
>>>
>>> The consequence is that SLOTS must be the leader, is that really a
>>> problem? You keep providing the {cycles, slots, metric-things} example,
>>> but afaict {slots, metric-thing, cycles} should work just fine too.
>>> PERF_SAMPLE_READ with PERF_FORMAT_GROUP doesn't need to the the leader.
>>
>> I'm not sure I get your point.
>> For the PERF_SAMPLE_READ with PERF_FORMAT_GROUP case, other events can be
>> the leader, e.g., {cycles, slots, metric-things}:S.
>> In this case, the SLOTS event is not a leader. I don't think we can assume
>> that the SLOTS event must be the leader.
>
> You can have a sibling event with SAMPLE_READ and FORMAT_GROUP just
> fine. The sampling event doesn't need to be the leader.
>
There will be a problem for the current perf tool, which assumes that
the leader event is the sampling event.
I will check how can we specially handle it in the perf tool.
Thanks,
Kan
On Fri, Jul 24, 2020 at 10:19:25AM -0400, Liang, Kan wrote:
> There will be a problem for the current perf tool, which assumes that the
> leader event is the sampling event.
>
> I will check how can we specially handle it in the perf tool.
Ah, okay. I've long lost track of how the tool works :/
Something that seems to 'work' is:
'{cycles,cpu/instructions,period=50000/}', so maybe you can make the
group modifier :S use any sampling event if there is one, and otherwise
designate the leader.
Then you can write things like:
'{slots, metric1, metric2, cpu/cycles,freq=50000/}:S'
and then since cycles is specified as a sampling event, it will use
that.
> The consequence is that SLOTS must be the leader, is that really a
> problem? You keep providing the {cycles, slots, metric-things} example,
Yes that's a problem. One (major) use case for topdown is to
sample on lots of different events, but always create groups that
also measure topdown metrics for that sampling period. For that we need
other leaders.
-Andi
> Something that seems to 'work' is:
> '{cycles,cpu/instructions,period=50000/}', so maybe you can make the
> group modifier :S use any sampling event if there is one, and otherwise
> designate the leader.
>
> Then you can write things like:
>
> '{slots, metric1, metric2, cpu/cycles,freq=50000/}:S'
>
> and then since cycles is specified as a sampling event, it will use
> that.
Okay possible, but it makes things more complicated
for the user to understand and requires special documentation.
Hopefully it's worth it the internal simplification.
-andi
On Fri, Jul 24, 2020 at 07:39:08AM -0700, Andi Kleen wrote:
> > The consequence is that SLOTS must be the leader, is that really a
> > problem? You keep providing the {cycles, slots, metric-things} example,
>
> Yes that's a problem. One (major) use case for topdown is to
> sample on lots of different events, but always create groups that
> also measure topdown metrics for that sampling period. For that we need
> other leaders.
Apparently the only problem here is the tool, the kernel doesn't care
which events in a group are sampling, there could be multiple.
On Fri, Jul 24, 2020 at 07:46:32AM -0700, Andi Kleen wrote:
> > Something that seems to 'work' is:
> > '{cycles,cpu/instructions,period=50000/}', so maybe you can make the
> > group modifier :S use any sampling event if there is one, and otherwise
> > designate the leader.
> >
> > Then you can write things like:
> >
> > '{slots, metric1, metric2, cpu/cycles,freq=50000/}:S'
> >
> > and then since cycles is specified as a sampling event, it will use
> > that.
>
> Okay possible, but it makes things more complicated
> for the user to understand and requires special documentation.
> Hopefully it's worth it the internal simplification.
You already require special documentation for this metrics stuff. We
already need to state that SLOTS cannot be a sampling event, so you
already need to pay attention to this anyway.
A shortcut could be a :s event modifier, then you can write:
'{slots, metric1, metric2, cycles:s}:S'
and have the tool select the :s tagged one.
On Fri, Jul 24, 2020 at 03:19:06PM +0200, [email protected] wrote:
> On Thu, Jul 23, 2020 at 10:11:11AM -0700, [email protected] wrote:
> > @@ -3375,6 +3428,72 @@ static int intel_pmu_hw_config(struct perf_event *event)
> > if (event->attr.type != PERF_TYPE_RAW)
> > return 0;
> >
> > + /*
> > + * Config Topdown slots and metric events
> > + *
> > + * The slots event on Fixed Counter 3 can support sampling,
> > + * which will be handled normally in x86_perf_event_update().
> > + *
> > + * The metric events don't support sampling.
> > + *
> > + * For counting, topdown slots and metric events will be
> > + * handled specially for event update.
> > + * A flag PERF_X86_EVENT_TOPDOWN is applied for the case.
> > + */
> > + if (x86_pmu.intel_cap.perf_metrics && is_topdown_event(event)) {
> > + if (is_metric_event(event)) {
> > + struct perf_event *leader = event->group_leader;
> > + struct perf_event *sibling;
> > +
> > + /* The metric events don't support sampling. */
> > + if (is_sampling_event(event))
> > + return -EINVAL;
> > +
> > + /* The metric events cannot be a group leader. */
> > + if (leader == event)
> > + return -EINVAL;
> > +
> > + /*
> > + * The slots event cannot be the leader of a topdown
> > + * sample-read group, e.g., {slots, topdown-retiring}:S
> > + */
> > + if (is_slots_event(leader) && is_sampling_event(leader))
> > + return -EINVAL;
>
> This has nothing to do with sample-read; SLOTS cannot be sampling when
> coupled with the METRIC stuff because hardware is daft.
>
> And you can have SAMPLE_READ on non-leader events just fine.
>
> > +
> > + /*
> > + * The slots event must be before the metric events,
> > + * because we only update the values of a topdown
> > + * group once with the slots event.
> > + */
> > + if (!is_slots_event(leader)) {
> > + for_each_sibling_event(sibling, leader) {
> > + if (is_slots_event(sibling))
> > + break;
> > + if (is_metric_event(sibling))
> > + return -EINVAL;
> > + }
> > + }
>
> Per the SIBLING patch this then wants to be:
>
> if (!is_slots_event(leader))
> return -EINVAL;
>
> event->event_caps |= PERF_EV_CAP_SIBLING.
> /*
> * Only once we have a METRICs sibling to we
> * need TopDown magic.
> */
> leader->hw.flags |= PERF_X86_EVENT_TOPDOWN;
> > + }
> > +
> > + if (!is_sampling_event(event)) {
> > + if (event->attr.config1 != 0)
> > + return -EINVAL;
>
> How does this depend on sampling?
>
> > + /*
> > + * The TopDown metrics events and slots event don't
> > + * support any filters.
> > + */
> > + if (event->attr.config & X86_ALL_EVENT_FLAGS)
> > + return -EINVAL;
>
> That seems independent of sampling too. Even a sampling SLOTS shouldn't
> be having any of those afaict.
>
> > +
> > + event->hw.flags |= PERF_X86_EVENT_TOPDOWN;
>
> This is confusing too, a !sampling SLOTS event without METRIC siblings
> shouldn't have this set, right? So arguably, this should be like above.
>
> > +
> > + event->event_caps |= PERF_EV_CAP_COEXIST;
> > +
> > + if (is_metric_event(event))
> > + event->hw.flags &= ~PERF_X86_EVENT_RDPMC_ALLOWED;
>
> This too seems like something that should be in the is_metric_event()
> branch above.
>
> > + }
> > + }
> > +
> > if (!(event->attr.config & ARCH_PERFMON_EVENTSEL_ANY))
> > return 0;
> >
FWIW, I pushed out a branch with all these changes in:
git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/metric
Just to get it some build love, if you want it differently, I'm happy to
throw it all out again.
On 7/24/2020 11:27 AM, [email protected] wrote:
> On Fri, Jul 24, 2020 at 03:19:06PM +0200, [email protected] wrote:
>> On Thu, Jul 23, 2020 at 10:11:11AM -0700, [email protected] wrote:
>>> @@ -3375,6 +3428,72 @@ static int intel_pmu_hw_config(struct perf_event *event)
>>> if (event->attr.type != PERF_TYPE_RAW)
>>> return 0;
>>>
>>> + /*
>>> + * Config Topdown slots and metric events
>>> + *
>>> + * The slots event on Fixed Counter 3 can support sampling,
>>> + * which will be handled normally in x86_perf_event_update().
>>> + *
>>> + * The metric events don't support sampling.
>>> + *
>>> + * For counting, topdown slots and metric events will be
>>> + * handled specially for event update.
>>> + * A flag PERF_X86_EVENT_TOPDOWN is applied for the case.
>>> + */
>>> + if (x86_pmu.intel_cap.perf_metrics && is_topdown_event(event)) {
>>> + if (is_metric_event(event)) {
>>> + struct perf_event *leader = event->group_leader;
>>> + struct perf_event *sibling;
>>> +
>>> + /* The metric events don't support sampling. */
>>> + if (is_sampling_event(event))
>>> + return -EINVAL;
>>> +
>>> + /* The metric events cannot be a group leader. */
>>> + if (leader == event)
>>> + return -EINVAL;
>>> +
>>> + /*
>>> + * The slots event cannot be the leader of a topdown
>>> + * sample-read group, e.g., {slots, topdown-retiring}:S
>>> + */
>>> + if (is_slots_event(leader) && is_sampling_event(leader))
>>> + return -EINVAL;
>>
>> This has nothing to do with sample-read; SLOTS cannot be sampling when
>> coupled with the METRIC stuff because hardware is daft.
>>
>> And you can have SAMPLE_READ on non-leader events just fine.
>>
>>> +
>>> + /*
>>> + * The slots event must be before the metric events,
>>> + * because we only update the values of a topdown
>>> + * group once with the slots event.
>>> + */
>>> + if (!is_slots_event(leader)) {
>>> + for_each_sibling_event(sibling, leader) {
>>> + if (is_slots_event(sibling))
>>> + break;
>>> + if (is_metric_event(sibling))
>>> + return -EINVAL;
>>> + }
>>> + }
>>
>> Per the SIBLING patch this then wants to be:
>>
>> if (!is_slots_event(leader))
>> return -EINVAL;
>>
>> event->event_caps |= PERF_EV_CAP_SIBLING.
>> /*
>> * Only once we have a METRICs sibling to we
>> * need TopDown magic.
>> */
>> leader->hw.flags |= PERF_X86_EVENT_TOPDOWN;
>>> + }
>>> +
>>> + if (!is_sampling_event(event)) {
>>> + if (event->attr.config1 != 0)
>>> + return -EINVAL;
>>
>> How does this depend on sampling?
>>
>>> + /*
>>> + * The TopDown metrics events and slots event don't
>>> + * support any filters.
>>> + */
>>> + if (event->attr.config & X86_ALL_EVENT_FLAGS)
>>> + return -EINVAL;
>>
>> That seems independent of sampling too. Even a sampling SLOTS shouldn't
>> be having any of those afaict.
>>
>>> +
>>> + event->hw.flags |= PERF_X86_EVENT_TOPDOWN;
>>
>> This is confusing too, a !sampling SLOTS event without METRIC siblings
>> shouldn't have this set, right? So arguably, this should be like above.
>>
>>> +
>>> + event->event_caps |= PERF_EV_CAP_COEXIST;
>>> +
>>> + if (is_metric_event(event))
>>> + event->hw.flags &= ~PERF_X86_EVENT_RDPMC_ALLOWED;
>>
>> This too seems like something that should be in the is_metric_event()
>> branch above.
>>
>>> + }
>>> + }
>>> +
>>> if (!(event->attr.config & ARCH_PERFMON_EVENTSEL_ANY))
>>> return 0;
>>>
>
> FWIW, I pushed out a branch with all these changes in:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/metric
>
> Just to get it some build love, if you want it differently, I'm happy to
> throw it all out again.
Thanks Peter.
I will pull the branch and do more tests.
Thanks,
Kan
On Fri, Jul 24, 2020 at 04:59:34PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 24, 2020 at 07:46:32AM -0700, Andi Kleen wrote:
> > > Something that seems to 'work' is:
> > > '{cycles,cpu/instructions,period=50000/}', so maybe you can make the
> > > group modifier :S use any sampling event if there is one, and otherwise
> > > designate the leader.
> > >
> > > Then you can write things like:
> > >
> > > '{slots, metric1, metric2, cpu/cycles,freq=50000/}:S'
> > >
> > > and then since cycles is specified as a sampling event, it will use
> > > that.
> >
> > Okay possible, but it makes things more complicated
> > for the user to understand and requires special documentation.
> > Hopefully it's worth it the internal simplification.
>
> You already require special documentation for this metrics stuff. We
> already need to state that SLOTS cannot be a sampling event, so you
> already need to pay attention to this anyway.
>
> A shortcut could be a :s event modifier, then you can write:
>
> '{slots, metric1, metric2, cycles:s}:S'
>
> and have the tool select the :s tagged one.
Having slots as leader also would allow doing something like
FORMAT_METRIC, where we return sibling/leader in some fashion.
That also makes sense for instructions, because, IIRC,
instructions/slots is the better IPC.
And we should probably consider FORMAT_RESET.
On 7/24/2020 12:43 PM, [email protected] wrote:
> On Fri, Jul 24, 2020 at 04:59:34PM +0200, Peter Zijlstra wrote:
>> On Fri, Jul 24, 2020 at 07:46:32AM -0700, Andi Kleen wrote:
>>>> Something that seems to 'work' is:
>>>> '{cycles,cpu/instructions,period=50000/}', so maybe you can make the
>>>> group modifier :S use any sampling event if there is one, and otherwise
>>>> designate the leader.
>>>>
>>>> Then you can write things like:
>>>>
>>>> '{slots, metric1, metric2, cpu/cycles,freq=50000/}:S'
>>>>
>>>> and then since cycles is specified as a sampling event, it will use
>>>> that.
>>>
>>> Okay possible, but it makes things more complicated
>>> for the user to understand and requires special documentation.
>>> Hopefully it's worth it the internal simplification.
>>
>> You already require special documentation for this metrics stuff. We
>> already need to state that SLOTS cannot be a sampling event, so you
>> already need to pay attention to this anyway.
>>
>> A shortcut could be a :s event modifier, then you can write:
>>
>> '{slots, metric1, metric2, cycles:s}:S'
>>
>> and have the tool select the :s tagged one.
It looks like PT encountered a similar issue as us.
They use the 2nd event of the group as the
"leader". I think we can simply extend the function to check the slots
event in perf tool.
https://lore.kernel.org/lkml/[email protected]/
>
> Having slots as leader also would allow doing something like
> FORMAT_METRIC, where we return sibling/leader in some fashion.
>
> That also makes sense for instructions, because, IIRC,
> instructions/slots is the better IPC.
>
> And we should probably consider FORMAT_RESET.
What's FORMAT_RESET for?
Thanks,
Kan
On 7/24/2020 12:07 PM, Liang, Kan wrote:
>
>
> On 7/24/2020 11:27 AM, [email protected] wrote:
>> On Fri, Jul 24, 2020 at 03:19:06PM +0200, [email protected] wrote:
>>> On Thu, Jul 23, 2020 at 10:11:11AM -0700, [email protected]
>>> wrote:
>>>> @@ -3375,6 +3428,72 @@ static int intel_pmu_hw_config(struct
>>>> perf_event *event)
>>>> Â Â Â Â Â if (event->attr.type != PERF_TYPE_RAW)
>>>> Â Â Â Â Â Â Â Â Â return 0;
>>>> +Â Â Â /*
>>>> +Â Â Â Â * Config Topdown slots and metric events
>>>> +Â Â Â Â *
>>>> +Â Â Â Â * The slots event on Fixed Counter 3 can support sampling,
>>>> +Â Â Â Â * which will be handled normally in x86_perf_event_update().
>>>> +Â Â Â Â *
>>>> +Â Â Â Â * The metric events don't support sampling.
>>>> +Â Â Â Â *
>>>> +Â Â Â Â * For counting, topdown slots and metric events will be
>>>> +Â Â Â Â * handled specially for event update.
>>>> +Â Â Â Â * A flag PERF_X86_EVENT_TOPDOWN is applied for the case.
>>>> +Â Â Â Â */
>>>> +Â Â Â if (x86_pmu.intel_cap.perf_metrics && is_topdown_event(event)) {
>>>> +Â Â Â Â Â Â Â if (is_metric_event(event)) {
>>>> +Â Â Â Â Â Â Â Â Â Â Â struct perf_event *leader = event->group_leader;
>>>> +Â Â Â Â Â Â Â Â Â Â Â struct perf_event *sibling;
>>>> +
>>>> +Â Â Â Â Â Â Â Â Â Â Â /* The metric events don't support sampling. */
>>>> +Â Â Â Â Â Â Â Â Â Â Â if (is_sampling_event(event))
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â return -EINVAL;
>>>> +
>>>> +Â Â Â Â Â Â Â Â Â Â Â /* The metric events cannot be a group leader. */
>>>> +Â Â Â Â Â Â Â Â Â Â Â if (leader == event)
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â return -EINVAL;
>>>> +
>>>> +Â Â Â Â Â Â Â Â Â Â Â /*
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â * The slots event cannot be the leader of a topdown
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â * sample-read group, e.g., {slots, topdown-retiring}:S
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â */
>>>> +Â Â Â Â Â Â Â Â Â Â Â if (is_slots_event(leader) && is_sampling_event(leader))
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â return -EINVAL;
>>>
>>> This has nothing to do with sample-read; SLOTS cannot be sampling when
>>> coupled with the METRIC stuff because hardware is daft.
>>>
>>> And you can have SAMPLE_READ on non-leader events just fine.
>>>
>>>> +
>>>> +Â Â Â Â Â Â Â Â Â Â Â /*
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â * The slots event must be before the metric events,
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â * because we only update the values of a topdown
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â * group once with the slots event.
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â */
>>>> +Â Â Â Â Â Â Â Â Â Â Â if (!is_slots_event(leader)) {
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â for_each_sibling_event(sibling, leader) {
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â if (is_slots_event(sibling))
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â break;
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â if (is_metric_event(sibling))
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â return -EINVAL;
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â }
>>>> +Â Â Â Â Â Â Â Â Â Â Â }
>>>
>>> Per the SIBLING patch this then wants to be:
>>>
>>> Â Â Â Â Â Â Â Â Â Â Â if (!is_slots_event(leader))
>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â return -EINVAL;
>>>
>>> Â Â Â Â Â Â Â Â Â Â Â event->event_caps |= PERF_EV_CAP_SIBLING.
>>> Â Â Â Â Â Â Â Â Â Â Â /*
>>> Â Â Â Â Â Â Â Â Â Â Â Â * Only once we have a METRICs sibling to we
>>> Â Â Â Â Â Â Â Â Â Â Â Â * need TopDown magic.
>>> Â Â Â Â Â Â Â Â Â Â Â Â */
>>> Â Â Â Â Â Â Â Â Â Â Â leader->hw.flags |= PERF_X86_EVENT_TOPDOWN;
Since we only set the flag for the SLOTS event now, the V7 patch will
treat the metric events as normal events, which trigger an error.
To fix the error, I think we may merge the changes as below with the
[PATCH 08/14] perf/x86/intel: Generic support for hardware TopDown metrics.
I think we don't need the PERF_X86_EVENT_TOPDOWN flag anymore.
If it's a non-sampling slots event, apply the special function.
If it's a metric event, do nothing.
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 0f3d01562ded..02dfee0b6615 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -73,10 +73,10 @@ u64 x86_perf_event_update(struct perf_event *event)
u64 prev_raw_count, new_raw_count;
u64 delta;
- if (unlikely(!hwc->event_base))
+ if (unlikely(!hwc->event_base || is_metric_event(event)))
return 0;
- if (unlikely(is_topdown_count(event)) && x86_pmu.update_topdown_event)
+ if (unlikely(is_slots_count(event)) && x86_pmu.update_topdown_event)
return x86_pmu.update_topdown_event(event);
/*
@@ -1280,11 +1280,10 @@ int x86_perf_event_set_period(struct perf_event
*event)
s64 period = hwc->sample_period;
int ret = 0, idx = hwc->idx;
- if (unlikely(!hwc->event_base))
+ if (unlikely(!hwc->event_base || is_metric_event(event)))
return 0;
- if (unlikely(is_topdown_count(event)) &&
- x86_pmu.set_topdown_event_period)
+ if (unlikely(is_slots_count(event)) && x86_pmu.set_topdown_event_period)
return x86_pmu.set_topdown_event_period(event);
/*
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 6cb079e0c9d9..010ac74afc09 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2405,27 +2405,18 @@ static u64 icl_update_topdown_event(struct
perf_event *event)
return slots;
}
-static void intel_pmu_read_topdown_event(struct perf_event *event)
+static void intel_pmu_read_event(struct perf_event *event)
{
- struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
-
- /* Only need to call update_topdown_event() once for group read. */
- if ((cpuc->txn_flags & PERF_PMU_TXN_READ) &&
- !is_slots_event(event))
+ if (unlikely(is_metric_event(event)))
return;
- perf_pmu_disable(event->pmu);
- x86_pmu.update_topdown_event(event);
- perf_pmu_enable(event->pmu);
-}
-
-static void intel_pmu_read_event(struct perf_event *event)
-{
if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
intel_pmu_auto_reload_read(event);
- else if (is_topdown_count(event) && x86_pmu.update_topdown_event)
- intel_pmu_read_topdown_event(event);
- else
+ else if (is_slots_count(event) && x86_pmu.update_topdown_event) {
+ perf_pmu_disable(event->pmu);
+ x86_pmu.update_topdown_event(event);
+ perf_pmu_enable(event->pmu);
+ } else
x86_perf_event_update(event);
}
@@ -3606,12 +3597,14 @@ static int intel_pmu_hw_config(struct perf_event
*event)
*
* The slots event on Fixed Counter 3 can support sampling,
* which will be handled normally in x86_perf_event_update().
- *
* The metric events don't support sampling.
*
- * For counting, topdown slots and metric events will be
- * handled specially for event update.
- * A flag PERF_X86_EVENT_TOPDOWN is applied for the case.
+ * For the counting of the slots and metric events, the rules
+ * as below have to be applied:
+ * - A standalone metric event or pure metric events group is
+ * not supported.
+ * - The metric events group must include the slots event.
+ * - The slots event must be the leader of the group.
*/
if (x86_pmu.intel_cap.perf_metrics && is_topdown_event(event)) {
if (event->attr.config1 || event->attr.config2)
@@ -3647,11 +3640,6 @@ static int intel_pmu_hw_config(struct perf_event
*event)
return -EINVAL;
event->event_caps |= PERF_EV_CAP_SIBLING;
- /*
- * Only once we have a METRICs sibling do we
- * need TopDown magic.
- */
- leader->hw.flags |= PERF_X86_EVENT_TOPDOWN;
}
}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 345442410a4d..153db209c105 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -79,12 +79,6 @@ static inline bool constraint_match(struct
event_constraint *c, u64 ecode)
#define PERF_X86_EVENT_PEBS_VIA_PT 0x0800 /* use PT buffer for PEBS */
#define PERF_X86_EVENT_PAIR 0x1000 /* Large Increment per Cycle */
#define PERF_X86_EVENT_LBR_SELECT 0x2000 /* Save/Restore MSR_LBR_SELECT */
-#define PERF_X86_EVENT_TOPDOWN 0x4000 /* Count Topdown slots/metrics
events */
-
-static inline bool is_topdown_count(struct perf_event *event)
-{
- return event->hw.flags & PERF_X86_EVENT_TOPDOWN;
-}
static inline bool is_metric_event(struct perf_event *event)
{
@@ -105,6 +99,14 @@ static inline bool is_topdown_event(struct
perf_event *event)
return is_metric_event(event) || is_slots_event(event);
}
+static inline bool is_slots_count(struct perf_event *event)
+{
+ if (is_slots_event(event) && !is_sampling_event(event))
+ return true;
+
+ return false;
+}
+
struct amd_nb {
int nb_id; /* NorthBridge id */
int refcnt; /* reference count */
Thanks,
Kan
>>>> +Â Â Â Â Â Â Â }
>>>> +
>>>> +Â Â Â Â Â Â Â if (!is_sampling_event(event)) {
>>>> +Â Â Â Â Â Â Â Â Â Â Â if (event->attr.config1 != 0)
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â return -EINVAL;
>>>
>>> How does this depend on sampling?
>>>
>>>> +Â Â Â Â Â Â Â Â Â Â Â /*
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â * The TopDown metrics events and slots event don't
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â * support any filters.
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â */
>>>> +Â Â Â Â Â Â Â Â Â Â Â if (event->attr.config & X86_ALL_EVENT_FLAGS)
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â return -EINVAL;
>>>
>>> That seems independent of sampling too. Even a sampling SLOTS shouldn't
>>> be having any of those afaict.
>>>
>>>> +
>>>> +Â Â Â Â Â Â Â Â Â Â Â event->hw.flags |= PERF_X86_EVENT_TOPDOWN;
>>>
>>> This is confusing too, a !sampling SLOTS event without METRIC siblings
>>> shouldn't have this set, right? So arguably, this should be like above.
>>>
>>>> +
>>>> +Â Â Â Â Â Â Â Â Â Â Â event->event_caps |= PERF_EV_CAP_COEXIST;
>>>> +
>>>> +Â Â Â Â Â Â Â Â Â Â Â if (is_metric_event(event))
>>>> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â event->hw.flags &= ~PERF_X86_EVENT_RDPMC_ALLOWED;
>>>
>>> This too seems like something that should be in the is_metric_event()
>>> branch above.
>>>
>>>> +Â Â Â Â Â Â Â }
>>>> +Â Â Â }
>>>> +
>>>> Â Â Â Â Â if (!(event->attr.config & ARCH_PERFMON_EVENTSEL_ANY))
>>>> Â Â Â Â Â Â Â Â Â return 0;
>>
>> FWIW, I pushed out a branch with all these changes in:
>>
>> Â Â git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git
>> perf/metric
>>
>> Just to get it some build love, if you want it differently, I'm happy to
>> throw it all out again.
>
> Thanks Peter.
>
> I will pull the branch and do more tests.
>
> Thanks,
> Kan
On Fri, Jul 24, 2020 at 03:10:52PM -0400, Liang, Kan wrote:
> > > > Per the SIBLING patch this then wants to be:
> > > >
> > > > ??????????? if (!is_slots_event(leader))
> > > > ??????????????? return -EINVAL;
> > > >
> > > > ??????????? event->event_caps |= PERF_EV_CAP_SIBLING.
> > > > ??????????? /*
> > > > ???????????? * Only once we have a METRICs sibling to we
> > > > ???????????? * need TopDown magic.
> > > > ???????????? */
> > > > ??????????? leader->hw.flags |= PERF_X86_EVENT_TOPDOWN;
>
> Since we only set the flag for the SLOTS event now, the V7 patch will treat
> the metric events as normal events, which trigger an error.
Damn, that was a silly oversight on my part.
> I think we don't need the PERF_X86_EVENT_TOPDOWN flag anymore.
> If it's a non-sampling slots event, apply the special function.
> If it's a metric event, do nothing.
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 0f3d01562ded..02dfee0b6615 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -73,10 +73,10 @@ u64 x86_perf_event_update(struct perf_event *event)
> u64 prev_raw_count, new_raw_count;
> u64 delta;
>
> - if (unlikely(!hwc->event_base))
> + if (unlikely(!hwc->event_base || is_metric_event(event)))
> return 0;
>
> - if (unlikely(is_topdown_count(event)) && x86_pmu.update_topdown_event)
> + if (unlikely(is_slots_count(event)) && x86_pmu.update_topdown_event)
> return x86_pmu.update_topdown_event(event);
>
> /*
> @@ -1280,11 +1280,10 @@ int x86_perf_event_set_period(struct perf_event
> *event)
> s64 period = hwc->sample_period;
> int ret = 0, idx = hwc->idx;
>
> - if (unlikely(!hwc->event_base))
> + if (unlikely(!hwc->event_base || is_metric_event(event)))
> return 0;
>
> - if (unlikely(is_topdown_count(event)) &&
> - x86_pmu.set_topdown_event_period)
> + if (unlikely(is_slots_count(event)) && x86_pmu.set_topdown_event_period)
> return x86_pmu.set_topdown_event_period(event);
>
> /*
This; I don't like that much, it adds even more conditions to fairly hot
code.
I was even considering adding a static_branch for
x86_pmu.intel_cap.perf_metrics.
Anyway, let me fix this.
On Fri, Jul 24, 2020 at 03:10:52PM -0400, Liang, Kan wrote:
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 6cb079e0c9d9..010ac74afc09 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -2405,27 +2405,18 @@ static u64 icl_update_topdown_event(struct
> perf_event *event)
> return slots;
> }
>
> -static void intel_pmu_read_topdown_event(struct perf_event *event)
> +static void intel_pmu_read_event(struct perf_event *event)
> {
> - struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> -
> - /* Only need to call update_topdown_event() once for group read. */
> - if ((cpuc->txn_flags & PERF_PMU_TXN_READ) &&
> - !is_slots_event(event))
> return;
>
> - perf_pmu_disable(event->pmu);
> - x86_pmu.update_topdown_event(event);
> - perf_pmu_enable(event->pmu);
> -}
> -
> -static void intel_pmu_read_event(struct perf_event *event)
> -{
> if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
> intel_pmu_auto_reload_read(event);
> - else if (is_topdown_count(event) && x86_pmu.update_topdown_event)
> - intel_pmu_read_topdown_event(event);
> - else
> + else if (is_slots_count(event) && x86_pmu.update_topdown_event) {
> + perf_pmu_disable(event->pmu);
> + x86_pmu.update_topdown_event(event);
> + perf_pmu_enable(event->pmu);
> + } else
> x86_perf_event_update(event);
> }
I'm a little puzzled by this; what happens if you:
fd = sys_perf_event_open(&attr_slots);
fd1 = sys_perf_event_open(&attr_metric, .group_fd=fd);
read(fd1);
?
On 7/28/2020 9:09 AM, Peter Zijlstra wrote:
> On Fri, Jul 24, 2020 at 03:10:52PM -0400, Liang, Kan wrote:
>
>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>> index 6cb079e0c9d9..010ac74afc09 100644
>> --- a/arch/x86/events/intel/core.c
>> +++ b/arch/x86/events/intel/core.c
>> @@ -2405,27 +2405,18 @@ static u64 icl_update_topdown_event(struct
>> perf_event *event)
>> return slots;
>> }
>>
>> -static void intel_pmu_read_topdown_event(struct perf_event *event)
>> +static void intel_pmu_read_event(struct perf_event *event)
>> {
>> - struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>> -
>> - /* Only need to call update_topdown_event() once for group read. */
>> - if ((cpuc->txn_flags & PERF_PMU_TXN_READ) &&
>> - !is_slots_event(event))
>> return;
>>
>> - perf_pmu_disable(event->pmu);
>> - x86_pmu.update_topdown_event(event);
>> - perf_pmu_enable(event->pmu);
>> -}
>> -
>> -static void intel_pmu_read_event(struct perf_event *event)
>> -{
>> if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
>> intel_pmu_auto_reload_read(event);
>> - else if (is_topdown_count(event) && x86_pmu.update_topdown_event)
>> - intel_pmu_read_topdown_event(event);
>> - else
>> + else if (is_slots_count(event) && x86_pmu.update_topdown_event) {
>> + perf_pmu_disable(event->pmu);
>> + x86_pmu.update_topdown_event(event);
>> + perf_pmu_enable(event->pmu);
>> + } else
>> x86_perf_event_update(event);
>> }
>
> I'm a little puzzled by this; what happens if you:
>
> fd = sys_perf_event_open(&attr_slots);
> fd1 = sys_perf_event_open(&attr_metric, .group_fd=fd);
>
> read(fd1);
>
> ?
>
I did a quick test. It depends on the .read_format of attr_metric.
If PERF_FORMAT_GROUP is applied for attr_metric, perf_read_group() will
be invoked. The value of fd1 is updated correctly.
If the flag is not applied, 0 will be returned.
static ssize_t
__perf_read(struct perf_event *event, char __user *buf, size_t count)
{
u64 read_format = event->attr.read_format;
int ret;
/*
* Return end-of-file for a read on an event that is in
* error state (i.e. because it was pinned but it couldn't be
* scheduled on to the CPU at some point).
*/
if (event->state == PERF_EVENT_STATE_ERROR)
return 0;
if (count < event->read_size)
return -ENOSPC;
WARN_ON_ONCE(event->ctx->parent_ctx);
if (read_format & PERF_FORMAT_GROUP)
ret = perf_read_group(event, read_format, buf);
else
ret = perf_read_one(event, read_format, buf);
return ret;
}
Thanks,
Kan
On Tue, Jul 28, 2020 at 09:28:39AM -0400, Liang, Kan wrote:
>
>
> On 7/28/2020 9:09 AM, Peter Zijlstra wrote:
> > On Fri, Jul 24, 2020 at 03:10:52PM -0400, Liang, Kan wrote:
> >
> > > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > > index 6cb079e0c9d9..010ac74afc09 100644
> > > --- a/arch/x86/events/intel/core.c
> > > +++ b/arch/x86/events/intel/core.c
> > > @@ -2405,27 +2405,18 @@ static u64 icl_update_topdown_event(struct
> > > perf_event *event)
> > > return slots;
> > > }
> > >
> > > -static void intel_pmu_read_topdown_event(struct perf_event *event)
> > > +static void intel_pmu_read_event(struct perf_event *event)
> > > {
> > > - struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> > > -
> > > - /* Only need to call update_topdown_event() once for group read. */
> > > - if ((cpuc->txn_flags & PERF_PMU_TXN_READ) &&
> > > - !is_slots_event(event))
> > > return;
> > >
> > > - perf_pmu_disable(event->pmu);
> > > - x86_pmu.update_topdown_event(event);
> > > - perf_pmu_enable(event->pmu);
> > > -}
> > > -
> > > -static void intel_pmu_read_event(struct perf_event *event)
> > > -{
> > > if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
> > > intel_pmu_auto_reload_read(event);
> > > - else if (is_topdown_count(event) && x86_pmu.update_topdown_event)
> > > - intel_pmu_read_topdown_event(event);
> > > - else
> > > + else if (is_slots_count(event) && x86_pmu.update_topdown_event) {
> > > + perf_pmu_disable(event->pmu);
> > > + x86_pmu.update_topdown_event(event);
> > > + perf_pmu_enable(event->pmu);
> > > + } else
> > > x86_perf_event_update(event);
> > > }
> >
> > I'm a little puzzled by this; what happens if you:
> >
> > fd = sys_perf_event_open(&attr_slots);
> > fd1 = sys_perf_event_open(&attr_metric, .group_fd=fd);
> >
> > read(fd1);
> >
> > ?
> >
>
> I did a quick test. It depends on the .read_format of attr_metric.
> If PERF_FORMAT_GROUP is applied for attr_metric, perf_read_group() will be
> invoked. The value of fd1 is updated correctly.
> If the flag is not applied, 0 will be returned.
Exactly :-), was that intentional? Because prior to this change that
would've worked just fine.
Now, I agree it's a daft thing, because that value is pretty useless
without also having the slots value, but I feel we should be explicit in
our choices here.
If for example, we would've had hardware provide us the raw metric
counters, instead of us having to reconstruct them, this would've been a
semi useful thing.
So I'm tempted to leave things as it, and keep this 'working'.
On 7/28/2020 9:44 AM, [email protected] wrote:
> On Tue, Jul 28, 2020 at 09:28:39AM -0400, Liang, Kan wrote:
>>
>>
>> On 7/28/2020 9:09 AM, Peter Zijlstra wrote:
>>> On Fri, Jul 24, 2020 at 03:10:52PM -0400, Liang, Kan wrote:
>>>
>>>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>>>> index 6cb079e0c9d9..010ac74afc09 100644
>>>> --- a/arch/x86/events/intel/core.c
>>>> +++ b/arch/x86/events/intel/core.c
>>>> @@ -2405,27 +2405,18 @@ static u64 icl_update_topdown_event(struct
>>>> perf_event *event)
>>>> return slots;
>>>> }
>>>>
>>>> -static void intel_pmu_read_topdown_event(struct perf_event *event)
>>>> +static void intel_pmu_read_event(struct perf_event *event)
>>>> {
>>>> - struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>>>> -
>>>> - /* Only need to call update_topdown_event() once for group read. */
>>>> - if ((cpuc->txn_flags & PERF_PMU_TXN_READ) &&
>>>> - !is_slots_event(event))
>>>> return;
>>>>
>>>> - perf_pmu_disable(event->pmu);
>>>> - x86_pmu.update_topdown_event(event);
>>>> - perf_pmu_enable(event->pmu);
>>>> -}
>>>> -
>>>> -static void intel_pmu_read_event(struct perf_event *event)
>>>> -{
>>>> if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
>>>> intel_pmu_auto_reload_read(event);
>>>> - else if (is_topdown_count(event) && x86_pmu.update_topdown_event)
>>>> - intel_pmu_read_topdown_event(event);
>>>> - else
>>>> + else if (is_slots_count(event) && x86_pmu.update_topdown_event) {
>>>> + perf_pmu_disable(event->pmu);
>>>> + x86_pmu.update_topdown_event(event);
>>>> + perf_pmu_enable(event->pmu);
>>>> + } else
>>>> x86_perf_event_update(event);
>>>> }
>>>
>>> I'm a little puzzled by this; what happens if you:
>>>
>>> fd = sys_perf_event_open(&attr_slots);
>>> fd1 = sys_perf_event_open(&attr_metric, .group_fd=fd);
>>>
>>> read(fd1);
>>>
>>> ?
>>>
>>
>> I did a quick test. It depends on the .read_format of attr_metric.
>> If PERF_FORMAT_GROUP is applied for attr_metric, perf_read_group() will be
>> invoked. The value of fd1 is updated correctly.
>> If the flag is not applied, 0 will be returned.
>
> Exactly :-), was that intentional?
Kind of, because a metric event must be in a group with the leader slots
event. If a user reads (treats) the metric event as a singleton event,
an invalid value should be expected.
> Because prior to this change that
> would've worked just fine.
>
> Now, I agree it's a daft thing, because that value is pretty useless
> without also having the slots value, but I feel we should be explicit in
> our choices here.
>
> If for example, we would've had hardware provide us the raw metric
> counters, instead of us having to reconstruct them, this would've been a
> semi useful thing.
>
> So I'm tempted to leave things as it, and keep this 'working'.
I will update the perf tool document to force the PERF_FORMAT_GROUP flag
for each metric events.
Thanks,
Kan
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 6f7225099d5f3ec3019f380a0da2b456b7796cb0
Gitweb: https://git.kernel.org/tip/6f7225099d5f3ec3019f380a0da2b456b7796cb0
Author: Kan Liang <[email protected]>
AuthorDate: Thu, 23 Jul 2020 10:11:06 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 18 Aug 2020 16:34:35 +02:00
perf/x86/intel: Introduce the fourth fixed counter
The fourth fixed counter, TOPDOWN.SLOTS, is introduced in Ice Lake to
measure the level 1 TopDown events.
Add MSR address and macros for the new fixed counter, which will be used
in a later patch.
Add comments to explain the event encoding rules for the fixed counters.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/include/asm/perf_event.h | 23 ++++++++++++++++++++---
1 file changed, 20 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index fd3eba6..fe8110a 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -197,12 +197,24 @@ struct x86_pmu_capability {
*/
/*
- * All 3 fixed-mode PMCs are configured via this single MSR:
+ * All the fixed-mode PMCs are configured via this single MSR:
*/
#define MSR_ARCH_PERFMON_FIXED_CTR_CTRL 0x38d
/*
- * The counts are available in three separate MSRs:
+ * There is no event-code assigned to the fixed-mode PMCs.
+ *
+ * For a fixed-mode PMC, which has an equivalent event on a general-purpose
+ * PMC, the event-code of the equivalent event is used for the fixed-mode PMC,
+ * e.g., Instr_Retired.Any and CPU_CLK_Unhalted.Core.
+ *
+ * For a fixed-mode PMC, which doesn't have an equivalent event, a
+ * pseudo-encoding is used, e.g., CPU_CLK_Unhalted.Ref and TOPDOWN.SLOTS.
+ * The pseudo event-code for a fixed-mode PMC must be 0x00.
+ * The pseudo umask-code is 0xX. The X equals the index of the fixed
+ * counter + 1, e.g., the fixed counter 2 has the pseudo-encoding 0x0300.
+ *
+ * The counts are available in separate MSRs:
*/
/* Instr_Retired.Any: */
@@ -213,11 +225,16 @@ struct x86_pmu_capability {
#define MSR_ARCH_PERFMON_FIXED_CTR1 0x30a
#define INTEL_PMC_IDX_FIXED_CPU_CYCLES (INTEL_PMC_IDX_FIXED + 1)
-/* CPU_CLK_Unhalted.Ref: */
+/* CPU_CLK_Unhalted.Ref: event=0x00,umask=0x3 (pseudo-encoding) */
#define MSR_ARCH_PERFMON_FIXED_CTR2 0x30b
#define INTEL_PMC_IDX_FIXED_REF_CYCLES (INTEL_PMC_IDX_FIXED + 2)
#define INTEL_PMC_MSK_FIXED_REF_CYCLES (1ULL << INTEL_PMC_IDX_FIXED_REF_CYCLES)
+/* TOPDOWN.SLOTS: event=0x00,umask=0x4 (pseudo-encoding) */
+#define MSR_ARCH_PERFMON_FIXED_CTR3 0x30c
+#define INTEL_PMC_IDX_FIXED_SLOTS (INTEL_PMC_IDX_FIXED + 3)
+#define INTEL_PMC_MSK_FIXED_SLOTS (1ULL << INTEL_PMC_IDX_FIXED_SLOTS)
+
/*
* We model BTS tracing as another fixed-mode PMC.
*
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 58da7dbe6f036fefe504a4bb452afbd39bba73f7
Gitweb: https://git.kernel.org/tip/58da7dbe6f036fefe504a4bb452afbd39bba73f7
Author: Kan Liang <[email protected]>
AuthorDate: Thu, 23 Jul 2020 10:11:09 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 18 Aug 2020 16:34:36 +02:00
perf/x86/intel: Use switch in intel_pmu_disable/enable_event
Currently, the if-else is used in the intel_pmu_disable/enable_event to
check the type of an event. It works well, but with more and more types
added later, e.g., perf metrics, compared to the switch statement, the
if-else may impair the readability of the code.
There is no harm to use the switch statement to replace the if-else
here. Also, some optimizing compilers may compile a switch statement
into a jump-table which is more efficient than if-else for a large
number of cases. The performance gain may not be observed for now,
because the number of cases is only 5, but the benefits may be observed
with more and more types added in the future.
Use switch to replace the if-else in the intel_pmu_disable/enable_event.
If the idx is invalid, print a warning.
For the case INTEL_PMC_IDX_FIXED_BTS in intel_pmu_disable_event, don't
need to check the event->attr.precise_ip. Use return for the case.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/events/intel/core.c | 36 +++++++++++++++++++++++++++--------
1 file changed, 28 insertions(+), 8 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index ac1408f..76eab81 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2180,17 +2180,28 @@ static void intel_pmu_disable_event(struct perf_event *event)
struct hw_perf_event *hwc = &event->hw;
int idx = hwc->idx;
- if (idx < INTEL_PMC_IDX_FIXED) {
+ switch (idx) {
+ case 0 ... INTEL_PMC_IDX_FIXED - 1:
intel_clear_masks(event, idx);
x86_pmu_disable_event(event);
- } else if (idx < INTEL_PMC_IDX_FIXED_BTS) {
+ break;
+ case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1:
intel_clear_masks(event, idx);
intel_pmu_disable_fixed(event);
- } else if (idx == INTEL_PMC_IDX_FIXED_BTS) {
+ break;
+ case INTEL_PMC_IDX_FIXED_BTS:
intel_pmu_disable_bts();
intel_pmu_drain_bts_buffer();
- } else if (idx == INTEL_PMC_IDX_FIXED_VLBR)
+ return;
+ case INTEL_PMC_IDX_FIXED_VLBR:
intel_clear_masks(event, idx);
+ break;
+ default:
+ intel_clear_masks(event, idx);
+ pr_warn("Failed to disable the event with invalid index %d\n",
+ idx);
+ return;
+ }
/*
* Needs to be called after x86_pmu_disable_event,
@@ -2262,18 +2273,27 @@ static void intel_pmu_enable_event(struct perf_event *event)
if (unlikely(event->attr.precise_ip))
intel_pmu_pebs_enable(event);
- if (idx < INTEL_PMC_IDX_FIXED) {
+ switch (idx) {
+ case 0 ... INTEL_PMC_IDX_FIXED - 1:
intel_set_masks(event, idx);
__x86_pmu_enable_event(hwc, ARCH_PERFMON_EVENTSEL_ENABLE);
- } else if (idx < INTEL_PMC_IDX_FIXED_BTS) {
+ break;
+ case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1:
intel_set_masks(event, idx);
intel_pmu_enable_fixed(event);
- } else if (idx == INTEL_PMC_IDX_FIXED_BTS) {
+ break;
+ case INTEL_PMC_IDX_FIXED_BTS:
if (!__this_cpu_read(cpu_hw_events.enabled))
return;
intel_pmu_enable_bts(hwc->config);
- } else if (idx == INTEL_PMC_IDX_FIXED_VLBR)
+ break;
+ case INTEL_PMC_IDX_FIXED_VLBR:
intel_set_masks(event, idx);
+ break;
+ default:
+ pr_warn("Failed to enable the event with invalid index %d\n",
+ idx);
+ }
}
static void intel_pmu_add_event(struct perf_event *event)
The following commit has been merged into the perf/core branch of tip:
Commit-ID: bbdbde2a415d9f479803266cae6fb0c1a9f6c80e
Gitweb: https://git.kernel.org/tip/bbdbde2a415d9f479803266cae6fb0c1a9f6c80e
Author: Kan Liang <[email protected]>
AuthorDate: Thu, 23 Jul 2020 10:11:08 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 18 Aug 2020 16:34:35 +02:00
perf/x86/intel: Fix the name of perf METRICS
Bit 15 of the PERF_CAPABILITIES MSR indicates that the perf METRICS
feature is supported. The perf METRICS is not a PEBS feature.
Rename pebs_metrics_available perf_metrics.
The bit is not used in the current code. It will be used in a later
patch.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/events/perf_event.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 7b68ab5..5d453da 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -537,7 +537,7 @@ union perf_capabilities {
*/
u64 full_width_write:1;
u64 pebs_baseline:1;
- u64 pebs_metrics_available:1;
+ u64 perf_metrics:1;
u64 pebs_output_pt_available:1;
};
u64 capabilities;
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 7b2c05a15d29d0570a0d21da1e4fd5cbc85cbf13
Gitweb: https://git.kernel.org/tip/7b2c05a15d29d0570a0d21da1e4fd5cbc85cbf13
Author: Kan Liang <[email protected]>
AuthorDate: Thu, 23 Jul 2020 10:11:11 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 18 Aug 2020 16:34:36 +02:00
perf/x86/intel: Generic support for hardware TopDown metrics
Intro
=====
The TopDown Microarchitecture Analysis (TMA) Method is a structured
analysis methodology to identify critical performance bottlenecks in
out-of-order processors. Current perf has supported the method.
The method works well, but there is one problem. To collect the TopDown
events, several GP counters have to be used. If a user wants to collect
other events at the same time, the multiplexing probably be triggered,
which impacts the accuracy.
To free up the scarce GP counters, the hardware TopDown metrics feature
is introduced from Ice Lake. The hardware implements an additional
"metrics" register and a new Fixed Counter 3 that measures pipeline
"slots". The TopDown events can be calculated from them instead.
Events
======
The level 1 TopDown has four metrics. There is no event-code assigned to
the TopDown metrics. Four metric events are exported as separate perf
events, which map to the internal "metrics" counter register. Those
events do not exist in hardware, but can be allocated by the scheduler.
For the event mapping, a special 0x00 event code is used, which is
reserved for fake events. The metric events start from umask 0x10.
When setting up the metric events, they point to the Fixed Counter 3.
They have to be specially handled.
- Add the update_topdown_event() callback to read the additional metrics
MSR and generate the metrics.
- Add the set_topdown_event_period() callback to initialize metrics MSR
and the fixed counter 3.
- Add a variable n_metric_event to track the number of the accepted
metrics events. The sharing between multiple users of the same metric
without multiplexing is not allowed.
- Only enable/disable the fixed counter 3 when there are no other active
TopDown events, which avoid the unnecessary writing of the fixed
control register.
- Disable the PMU when reading the metrics event. The metrics MSR and
the fixed counter 3 are read separately. The values may be modified by
an NMI.
All four metric events don't support sampling. Since they will be
handled specially for event update, a flag PERF_X86_EVENT_TOPDOWN is
introduced to indicate this case.
The slots event can support both sampling and counting.
For counting, the flag is also applied.
For sampling, it will be handled normally as other normal events.
Groups
======
The slots event is required in a Topdown group.
To avoid reading the METRICS register multiple times, the metrics and
slots value can only be updated by slots event in a group.
All active slots and metrics events will be updated one time.
Therefore, the slots event must be before any metric events in a Topdown
group.
NMI
======
The METRICS related register may be overflow. The bit 48 of the STATUS
register will be set. If so, PERF_METRICS and Fixed counter 3 are
required to be reset. The patch also update all active slots and
metrics events in the NMI handler.
The update_topdown_event() has to read two registers separately. The
values may be modified by an NMI. PMU has to be disabled before calling
the function.
RDPMC
======
RDPMC is temporarily disabled. A later patch will enable it.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/events/core.c | 63 ++++++++++++---
arch/x86/events/intel/core.c | 124 +++++++++++++++++++++++++++--
arch/x86/events/perf_event.h | 37 +++++++++-
arch/x86/include/asm/msr-index.h | 1 +-
arch/x86/include/asm/perf_event.h | 47 +++++++++++-
5 files changed, 257 insertions(+), 15 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 8e108ea..53fcf0a 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -76,6 +76,9 @@ u64 x86_perf_event_update(struct perf_event *event)
if (unlikely(!hwc->event_base))
return 0;
+ if (unlikely(is_topdown_count(event)) && x86_pmu.update_topdown_event)
+ return x86_pmu.update_topdown_event(event);
+
/*
* Careful: an NMI might modify the previous event value.
*
@@ -1031,6 +1034,42 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
return unsched ? -EINVAL : 0;
}
+static int add_nr_metric_event(struct cpu_hw_events *cpuc,
+ struct perf_event *event)
+{
+ if (is_metric_event(event)) {
+ if (cpuc->n_metric == INTEL_TD_METRIC_NUM)
+ return -EINVAL;
+ cpuc->n_metric++;
+ }
+
+ return 0;
+}
+
+static void del_nr_metric_event(struct cpu_hw_events *cpuc,
+ struct perf_event *event)
+{
+ if (is_metric_event(event))
+ cpuc->n_metric--;
+}
+
+static int collect_event(struct cpu_hw_events *cpuc, struct perf_event *event,
+ int max_count, int n)
+{
+
+ if (x86_pmu.intel_cap.perf_metrics && add_nr_metric_event(cpuc, event))
+ return -EINVAL;
+
+ if (n >= max_count + cpuc->n_metric)
+ return -EINVAL;
+
+ cpuc->event_list[n] = event;
+ if (is_counter_pair(&event->hw))
+ cpuc->n_pair++;
+
+ return 0;
+}
+
/*
* dogrp: true if must collect siblings events (group)
* returns total number of events and error code
@@ -1067,28 +1106,22 @@ static int collect_events(struct cpu_hw_events *cpuc, struct perf_event *leader,
}
if (is_x86_event(leader)) {
- if (n >= max_count)
+ if (collect_event(cpuc, leader, max_count, n))
return -EINVAL;
- cpuc->event_list[n] = leader;
n++;
- if (is_counter_pair(&leader->hw))
- cpuc->n_pair++;
}
+
if (!dogrp)
return n;
for_each_sibling_event(event, leader) {
- if (!is_x86_event(event) ||
- event->state <= PERF_EVENT_STATE_OFF)
+ if (!is_x86_event(event) || event->state <= PERF_EVENT_STATE_OFF)
continue;
- if (n >= max_count)
+ if (collect_event(cpuc, event, max_count, n))
return -EINVAL;
- cpuc->event_list[n] = event;
n++;
- if (is_counter_pair(&event->hw))
- cpuc->n_pair++;
}
return n;
}
@@ -1110,6 +1143,10 @@ static inline void x86_assign_hw_event(struct perf_event *event,
hwc->event_base = 0;
break;
+ case INTEL_PMC_IDX_METRIC_BASE ... INTEL_PMC_IDX_METRIC_END:
+ /* All the metric events are mapped onto the fixed counter 3. */
+ idx = INTEL_PMC_IDX_FIXED_SLOTS;
+ /* fall through */
case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS-1:
hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 +
@@ -1245,6 +1282,10 @@ int x86_perf_event_set_period(struct perf_event *event)
if (unlikely(!hwc->event_base))
return 0;
+ if (unlikely(is_topdown_count(event)) &&
+ x86_pmu.set_topdown_event_period)
+ return x86_pmu.set_topdown_event_period(event);
+
/*
* If we are way outside a reasonable range then just skip forward:
*/
@@ -1529,6 +1570,8 @@ static void x86_pmu_del(struct perf_event *event, int flags)
}
cpuc->event_constraint[i-1] = NULL;
--cpuc->n_events;
+ if (x86_pmu.intel_cap.perf_metrics)
+ del_nr_metric_event(cpuc, event);
perf_event_update_userpage(event);
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 76eab81..4a43668 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2165,11 +2165,24 @@ static inline void intel_clear_masks(struct perf_event *event, int idx)
static void intel_pmu_disable_fixed(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
- int idx = hwc->idx - INTEL_PMC_IDX_FIXED;
u64 ctrl_val, mask;
+ int idx = hwc->idx;
- mask = 0xfULL << (idx * 4);
+ if (is_topdown_idx(idx)) {
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ /*
+ * When there are other active TopDown events,
+ * don't disable the fixed counter 3.
+ */
+ if (*(u64 *)cpuc->active_mask & INTEL_PMC_OTHER_TOPDOWN_BITS(idx))
+ return;
+ idx = INTEL_PMC_IDX_FIXED_SLOTS;
+ }
+ intel_clear_masks(event, idx);
+
+ mask = 0xfULL << ((idx - INTEL_PMC_IDX_FIXED) * 4);
rdmsrl(hwc->config_base, ctrl_val);
ctrl_val &= ~mask;
wrmsrl(hwc->config_base, ctrl_val);
@@ -2186,7 +2199,7 @@ static void intel_pmu_disable_event(struct perf_event *event)
x86_pmu_disable_event(event);
break;
case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1:
- intel_clear_masks(event, idx);
+ case INTEL_PMC_IDX_METRIC_BASE ... INTEL_PMC_IDX_METRIC_END:
intel_pmu_disable_fixed(event);
break;
case INTEL_PMC_IDX_FIXED_BTS:
@@ -2219,10 +2232,26 @@ static void intel_pmu_del_event(struct perf_event *event)
intel_pmu_pebs_del(event);
}
+static void intel_pmu_read_topdown_event(struct perf_event *event)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ /* Only need to call update_topdown_event() once for group read. */
+ if ((cpuc->txn_flags & PERF_PMU_TXN_READ) &&
+ !is_slots_event(event))
+ return;
+
+ perf_pmu_disable(event->pmu);
+ x86_pmu.update_topdown_event(event);
+ perf_pmu_enable(event->pmu);
+}
+
static void intel_pmu_read_event(struct perf_event *event)
{
if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
intel_pmu_auto_reload_read(event);
+ else if (is_topdown_count(event) && x86_pmu.update_topdown_event)
+ intel_pmu_read_topdown_event(event);
else
x86_perf_event_update(event);
}
@@ -2230,8 +2259,22 @@ static void intel_pmu_read_event(struct perf_event *event)
static void intel_pmu_enable_fixed(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
- int idx = hwc->idx - INTEL_PMC_IDX_FIXED;
u64 ctrl_val, mask, bits = 0;
+ int idx = hwc->idx;
+
+ if (is_topdown_idx(idx)) {
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ /*
+ * When there are other active TopDown events,
+ * don't enable the fixed counter 3 again.
+ */
+ if (*(u64 *)cpuc->active_mask & INTEL_PMC_OTHER_TOPDOWN_BITS(idx))
+ return;
+
+ idx = INTEL_PMC_IDX_FIXED_SLOTS;
+ }
+
+ intel_set_masks(event, idx);
/*
* Enable IRQ generation (0x8), if not PEBS,
@@ -2251,6 +2294,7 @@ static void intel_pmu_enable_fixed(struct perf_event *event)
if (x86_pmu.version > 2 && hwc->config & ARCH_PERFMON_EVENTSEL_ANY)
bits |= 0x4;
+ idx -= INTEL_PMC_IDX_FIXED;
bits <<= (idx * 4);
mask = 0xfULL << (idx * 4);
@@ -2279,7 +2323,7 @@ static void intel_pmu_enable_event(struct perf_event *event)
__x86_pmu_enable_event(hwc, ARCH_PERFMON_EVENTSEL_ENABLE);
break;
case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1:
- intel_set_masks(event, idx);
+ case INTEL_PMC_IDX_METRIC_BASE ... INTEL_PMC_IDX_METRIC_END:
intel_pmu_enable_fixed(event);
break;
case INTEL_PMC_IDX_FIXED_BTS:
@@ -2440,6 +2484,15 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
}
/*
+ * Intel Perf mertrics
+ */
+ if (__test_and_clear_bit(GLOBAL_STATUS_PERF_METRICS_OVF_BIT, (unsigned long *)&status)) {
+ handled++;
+ if (x86_pmu.update_topdown_event)
+ x86_pmu.update_topdown_event(NULL);
+ }
+
+ /*
* Checkpointed counters can lead to 'spurious' PMIs because the
* rollback caused by the PMI will have cleared the overflow status
* bit. Therefore always force probe these counters.
@@ -3375,6 +3428,58 @@ static int intel_pmu_hw_config(struct perf_event *event)
if (event->attr.type != PERF_TYPE_RAW)
return 0;
+ /*
+ * Config Topdown slots and metric events
+ *
+ * The slots event on Fixed Counter 3 can support sampling,
+ * which will be handled normally in x86_perf_event_update().
+ *
+ * Metric events don't support sampling and require being paired
+ * with a slots event as group leader. When the slots event
+ * is used in a metrics group, it too cannot support sampling.
+ */
+ if (x86_pmu.intel_cap.perf_metrics && is_topdown_event(event)) {
+ if (event->attr.config1 || event->attr.config2)
+ return -EINVAL;
+
+ /*
+ * The TopDown metrics events and slots event don't
+ * support any filters.
+ */
+ if (event->attr.config & X86_ALL_EVENT_FLAGS)
+ return -EINVAL;
+
+ if (is_metric_event(event)) {
+ struct perf_event *leader = event->group_leader;
+
+ /* The metric events don't support sampling. */
+ if (is_sampling_event(event))
+ return -EINVAL;
+
+ /* The metric events require a slots group leader. */
+ if (!is_slots_event(leader))
+ return -EINVAL;
+
+ /*
+ * The leader/SLOTS must not be a sampling event for
+ * metric use; hardware requires it starts at 0 when used
+ * in conjunction with MSR_PERF_METRICS.
+ */
+ if (is_sampling_event(leader))
+ return -EINVAL;
+
+ event->event_caps |= PERF_EV_CAP_SIBLING;
+ /*
+ * Only once we have a METRICs sibling do we
+ * need TopDown magic.
+ */
+ leader->hw.flags |= PERF_X86_EVENT_TOPDOWN;
+ event->hw.flags |= PERF_X86_EVENT_TOPDOWN;
+
+ event->hw.flags &= ~PERF_X86_EVENT_RDPMC_ALLOWED;
+ }
+ }
+
if (!(event->attr.config & ARCH_PERFMON_EVENTSEL_ANY))
return 0;
@@ -5218,6 +5323,15 @@ __init int intel_pmu_init(void)
* counter, so do not extend mask to generic counters
*/
for_each_event_constraint(c, x86_pmu.event_constraints) {
+ /*
+ * Don't extend the topdown slots and metrics
+ * events to the generic counters.
+ */
+ if (c->idxmsk64 & INTEL_PMC_MSK_TOPDOWN) {
+ c->weight = hweight64(c->idxmsk64);
+ continue;
+ }
+
if (c->cmask == FIXED_EVENT_FLAGS
&& c->idxmsk64 != INTEL_PMC_MSK_FIXED_REF_CYCLES) {
c->idxmsk64 |= (1ULL << x86_pmu.num_counters) - 1;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 5d453da..dea009c 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -79,6 +79,31 @@ static inline bool constraint_match(struct event_constraint *c, u64 ecode)
#define PERF_X86_EVENT_PEBS_VIA_PT 0x0800 /* use PT buffer for PEBS */
#define PERF_X86_EVENT_PAIR 0x1000 /* Large Increment per Cycle */
#define PERF_X86_EVENT_LBR_SELECT 0x2000 /* Save/Restore MSR_LBR_SELECT */
+#define PERF_X86_EVENT_TOPDOWN 0x4000 /* Count Topdown slots/metrics events */
+
+static inline bool is_topdown_count(struct perf_event *event)
+{
+ return event->hw.flags & PERF_X86_EVENT_TOPDOWN;
+}
+
+static inline bool is_metric_event(struct perf_event *event)
+{
+ u64 config = event->attr.config;
+
+ return ((config & ARCH_PERFMON_EVENTSEL_EVENT) == 0) &&
+ ((config & INTEL_ARCH_EVENT_MASK) >= INTEL_TD_METRIC_RETIRING) &&
+ ((config & INTEL_ARCH_EVENT_MASK) <= INTEL_TD_METRIC_MAX);
+}
+
+static inline bool is_slots_event(struct perf_event *event)
+{
+ return (event->attr.config & INTEL_ARCH_EVENT_MASK) == INTEL_TD_SLOTS;
+}
+
+static inline bool is_topdown_event(struct perf_event *event)
+{
+ return is_metric_event(event) || is_slots_event(event);
+}
struct amd_nb {
int nb_id; /* NorthBridge id */
@@ -285,6 +310,12 @@ struct cpu_hw_events {
u64 tfa_shadow;
/*
+ * Perf Metrics
+ */
+ /* number of accepted metrics events */
+ int n_metric;
+
+ /*
* AMD specific bits
*/
struct amd_nb *amd_nb;
@@ -727,6 +758,12 @@ struct x86_pmu {
atomic_t lbr_exclusive[x86_lbr_exclusive_max];
/*
+ * Intel perf metrics
+ */
+ u64 (*update_topdown_event)(struct perf_event *event);
+ int (*set_topdown_event_period)(struct perf_event *event);
+
+ /*
* perf task context (i.e. struct perf_event_context::task_ctx_data)
* switch helper to bridge calls from perf/core to perf/x86.
* See struct pmu::swap_task_ctx() usage for examples;
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 2859ee4..fabb61c 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -857,6 +857,7 @@
#define MSR_CORE_PERF_FIXED_CTR0 0x00000309
#define MSR_CORE_PERF_FIXED_CTR1 0x0000030a
#define MSR_CORE_PERF_FIXED_CTR2 0x0000030b
+#define MSR_CORE_PERF_FIXED_CTR3 0x0000030c
#define MSR_CORE_PERF_FIXED_CTR_CTRL 0x0000038d
#define MSR_CORE_PERF_GLOBAL_STATUS 0x0000038e
#define MSR_CORE_PERF_GLOBAL_CTRL 0x0000038f
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 58419e5..000cab7 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -244,6 +244,52 @@ struct x86_pmu_capability {
*/
#define INTEL_PMC_IDX_FIXED_BTS (INTEL_PMC_IDX_FIXED + 15)
+/*
+ * The PERF_METRICS MSR is modeled as several magic fixed-mode PMCs, one for
+ * each TopDown metric event.
+ *
+ * Internally the TopDown metric events are mapped to the FxCtr 3 (SLOTS).
+ */
+#define INTEL_PMC_IDX_METRIC_BASE (INTEL_PMC_IDX_FIXED + 16)
+#define INTEL_PMC_IDX_TD_RETIRING (INTEL_PMC_IDX_METRIC_BASE + 0)
+#define INTEL_PMC_IDX_TD_BAD_SPEC (INTEL_PMC_IDX_METRIC_BASE + 1)
+#define INTEL_PMC_IDX_TD_FE_BOUND (INTEL_PMC_IDX_METRIC_BASE + 2)
+#define INTEL_PMC_IDX_TD_BE_BOUND (INTEL_PMC_IDX_METRIC_BASE + 3)
+#define INTEL_PMC_IDX_METRIC_END INTEL_PMC_IDX_TD_BE_BOUND
+#define INTEL_PMC_MSK_TOPDOWN ((0xfull << INTEL_PMC_IDX_METRIC_BASE) | \
+ INTEL_PMC_MSK_FIXED_SLOTS)
+
+/*
+ * There is no event-code assigned to the TopDown events.
+ *
+ * For the slots event, use the pseudo code of the fixed counter 3.
+ *
+ * For the metric events, the pseudo event-code is 0x00.
+ * The pseudo umask-code starts from the middle of the pseudo event
+ * space, 0x80.
+ */
+#define INTEL_TD_SLOTS 0x0400 /* TOPDOWN.SLOTS */
+/* Level 1 metrics */
+#define INTEL_TD_METRIC_RETIRING 0x8000 /* Retiring metric */
+#define INTEL_TD_METRIC_BAD_SPEC 0x8100 /* Bad speculation metric */
+#define INTEL_TD_METRIC_FE_BOUND 0x8200 /* FE bound metric */
+#define INTEL_TD_METRIC_BE_BOUND 0x8300 /* BE bound metric */
+#define INTEL_TD_METRIC_MAX INTEL_TD_METRIC_BE_BOUND
+#define INTEL_TD_METRIC_NUM 4
+
+static inline bool is_metric_idx(int idx)
+{
+ return (unsigned)(idx - INTEL_PMC_IDX_METRIC_BASE) < INTEL_TD_METRIC_NUM;
+}
+
+static inline bool is_topdown_idx(int idx)
+{
+ return is_metric_idx(idx) || idx == INTEL_PMC_IDX_FIXED_SLOTS;
+}
+
+#define INTEL_PMC_OTHER_TOPDOWN_BITS(bit) \
+ (~(0x1ull << bit) & INTEL_PMC_MSK_TOPDOWN)
+
#define GLOBAL_STATUS_COND_CHG BIT_ULL(63)
#define GLOBAL_STATUS_BUFFER_OVF_BIT 62
#define GLOBAL_STATUS_BUFFER_OVF BIT_ULL(GLOBAL_STATUS_BUFFER_OVF_BIT)
@@ -254,6 +300,7 @@ struct x86_pmu_capability {
#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(GLOBAL_STATUS_LBRS_FROZEN_BIT)
#define GLOBAL_STATUS_TRACE_TOPAPMI_BIT 55
#define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(GLOBAL_STATUS_TRACE_TOPAPMI_BIT)
+#define GLOBAL_STATUS_PERF_METRICS_OVF_BIT 48
/*
* We model guest LBR event tracing as another fixed-mode PMC like BTS.
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 2cb5383b30d47c446ec7d884cd80f93ffcc31817
Gitweb: https://git.kernel.org/tip/2cb5383b30d47c446ec7d884cd80f93ffcc31817
Author: Kan Liang <[email protected]>
AuthorDate: Thu, 23 Jul 2020 10:11:14 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 18 Aug 2020 16:34:37 +02:00
perf/x86/intel: Support per-thread RDPMC TopDown metrics
Starts from Ice Lake, the TopDown metrics are directly available as
fixed counters and do not require generic counters. Also, the TopDown
metrics can be collected per thread. Extend the RDPMC usage to support
per-thread TopDown metrics.
The RDPMC index of the PERF_METRICS will be output if RDPMC users ask
for the RDPMC index of the metrics events.
To support per thread RDPMC TopDown, the metrics and slots counters have
to be saved/restored during the context switching.
The last_period and period_left are not used in the counting mode. Use
the fields for saved_metric and saved_slots.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/events/core.c | 5 +-
arch/x86/events/intel/core.c | 90 ++++++++++++++++++++++++++++++-----
include/linux/perf_event.h | 29 +++++++----
3 files changed, 102 insertions(+), 22 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index ebf723f..0f3d015 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2257,7 +2257,10 @@ static int x86_pmu_event_idx(struct perf_event *event)
if (!(hwc->flags & PERF_X86_EVENT_RDPMC_ALLOWED))
return 0;
- return hwc->event_base_rdpmc + 1;
+ if (is_metric_idx(hwc->idx))
+ return INTEL_PMC_FIXED_RDPMC_METRICS + 1;
+ else
+ return hwc->event_base_rdpmc + 1;
}
static ssize_t get_attr_rdpmc(struct device *cdev,
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index db83334..c72e490 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2258,7 +2258,13 @@ static int icl_set_topdown_event_period(struct perf_event *event)
if (left == x86_pmu.max_period) {
wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
wrmsrl(MSR_PERF_METRICS, 0);
- local64_set(&hwc->period_left, 0);
+ hwc->saved_slots = 0;
+ hwc->saved_metric = 0;
+ }
+
+ if ((hwc->saved_slots) && is_slots_event(event)) {
+ wrmsrl(MSR_CORE_PERF_FIXED_CTR3, hwc->saved_slots);
+ wrmsrl(MSR_PERF_METRICS, hwc->saved_metric);
}
perf_event_update_userpage(event);
@@ -2279,7 +2285,7 @@ static inline u64 icl_get_metrics_event_value(u64 metric, u64 slots, int idx)
return mul_u64_u32_div(slots, val, 0xff);
}
-static void __icl_update_topdown_event(struct perf_event *event,
+static u64 icl_get_topdown_value(struct perf_event *event,
u64 slots, u64 metrics)
{
int idx = event->hw.idx;
@@ -2290,7 +2296,50 @@ static void __icl_update_topdown_event(struct perf_event *event,
else
delta = slots;
- local64_add(delta, &event->count);
+ return delta;
+}
+
+static void __icl_update_topdown_event(struct perf_event *event,
+ u64 slots, u64 metrics,
+ u64 last_slots, u64 last_metrics)
+{
+ u64 delta, last = 0;
+
+ delta = icl_get_topdown_value(event, slots, metrics);
+ if (last_slots)
+ last = icl_get_topdown_value(event, last_slots, last_metrics);
+
+ /*
+ * The 8bit integer fraction of metric may be not accurate,
+ * especially when the changes is very small.
+ * For example, if only a few bad_spec happens, the fraction
+ * may be reduced from 1 to 0. If so, the bad_spec event value
+ * will be 0 which is definitely less than the last value.
+ * Avoid update event->count for this case.
+ */
+ if (delta > last) {
+ delta -= last;
+ local64_add(delta, &event->count);
+ }
+}
+
+static void update_saved_topdown_regs(struct perf_event *event,
+ u64 slots, u64 metrics)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct perf_event *other;
+ int idx;
+
+ event->hw.saved_slots = slots;
+ event->hw.saved_metric = metrics;
+
+ for_each_set_bit(idx, cpuc->active_mask, INTEL_PMC_IDX_TD_BE_BOUND + 1) {
+ if (!is_topdown_idx(idx))
+ continue;
+ other = cpuc->events[idx];
+ other->hw.saved_slots = slots;
+ other->hw.saved_metric = metrics;
+ }
}
/*
@@ -2304,6 +2353,7 @@ static u64 icl_update_topdown_event(struct perf_event *event)
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct perf_event *other;
u64 slots, metrics;
+ bool reset = true;
int idx;
/* read Fixed counter 3 */
@@ -2318,19 +2368,39 @@ static u64 icl_update_topdown_event(struct perf_event *event)
if (!is_topdown_idx(idx))
continue;
other = cpuc->events[idx];
- __icl_update_topdown_event(other, slots, metrics);
+ __icl_update_topdown_event(other, slots, metrics,
+ event ? event->hw.saved_slots : 0,
+ event ? event->hw.saved_metric : 0);
}
/*
* Check and update this event, which may have been cleared
* in active_mask e.g. x86_pmu_stop()
*/
- if (event && !test_bit(event->hw.idx, cpuc->active_mask))
- __icl_update_topdown_event(event, slots, metrics);
+ if (event && !test_bit(event->hw.idx, cpuc->active_mask)) {
+ __icl_update_topdown_event(event, slots, metrics,
+ event->hw.saved_slots,
+ event->hw.saved_metric);
- /* The fixed counter 3 has to be written before the PERF_METRICS. */
- wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
- wrmsrl(MSR_PERF_METRICS, 0);
+ /*
+ * In x86_pmu_stop(), the event is cleared in active_mask first,
+ * then drain the delta, which indicates context switch for
+ * counting.
+ * Save metric and slots for context switch.
+ * Don't need to reset the PERF_METRICS and Fixed counter 3.
+ * Because the values will be restored in next schedule in.
+ */
+ update_saved_topdown_regs(event, slots, metrics);
+ reset = false;
+ }
+
+ if (reset) {
+ /* The fixed counter 3 has to be written before the PERF_METRICS. */
+ wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
+ wrmsrl(MSR_PERF_METRICS, 0);
+ if (event)
+ update_saved_topdown_regs(event, 0, 0);
+ }
return slots;
}
@@ -3578,8 +3648,6 @@ static int intel_pmu_hw_config(struct perf_event *event)
*/
leader->hw.flags |= PERF_X86_EVENT_TOPDOWN;
event->hw.flags |= PERF_X86_EVENT_TOPDOWN;
-
- event->hw.flags &= ~PERF_X86_EVENT_RDPMC_ALLOWED;
}
}
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6048650..46a3974 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -212,17 +212,26 @@ struct hw_perf_event {
*/
u64 sample_period;
- /*
- * The period we started this sample with.
- */
- u64 last_period;
+ union {
+ struct { /* Sampling */
+ /*
+ * The period we started this sample with.
+ */
+ u64 last_period;
- /*
- * However much is left of the current period; note that this is
- * a full 64bit value and allows for generation of periods longer
- * than hardware might allow.
- */
- local64_t period_left;
+ /*
+ * However much is left of the current period;
+ * note that this is a full 64bit value and
+ * allows for generation of periods longer
+ * than hardware might allow.
+ */
+ local64_t period_left;
+ };
+ struct { /* Topdown events counting for context switch */
+ u64 saved_metric;
+ u64 saved_slots;
+ };
+ };
/*
* State for throttling the event, see __perf_event_overflow() and
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 59a854e2f3b90ad2cc7368ae392de40b981ad51d
Gitweb: https://git.kernel.org/tip/59a854e2f3b90ad2cc7368ae392de40b981ad51d
Author: Kan Liang <[email protected]>
AuthorDate: Thu, 23 Jul 2020 10:11:13 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 18 Aug 2020 16:34:37 +02:00
perf/x86/intel: Support TopDown metrics on Ice Lake
Ice Lake supports the hardware TopDown metrics feature, which can free
up the scarce GP counters.
Update the event constraints for the metrics events. The metric counters
do not exist, which are mapped to a dummy offset. The sharing between
multiple users of the same metric without multiplexing is not allowed.
Implement set_topdown_event_period for Ice Lake. The values in
PERF_METRICS MSR are derived from the fixed counter 3. Both registers
should start from zero.
Implement update_topdown_event for Ice Lake. The metric is reported by
multiplying the metric (fraction) with slots. To maintain accurate
measurements, both registers are cleared for each update. The fixed
counter 3 should always be cleared before the PERF_METRICS.
Implement td_attr for the new metrics events and the new slots fixed
counter. Make them visible to the perf user tools.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/events/intel/core.c | 118 +++++++++++++++++++++++++++++-
arch/x86/events/perf_event.h | 13 +++-
arch/x86/include/asm/msr-index.h | 2 +-
arch/x86/include/asm/perf_event.h | 2 +-
4 files changed, 135 insertions(+)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 4a43668..db83334 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -247,6 +247,10 @@ static struct event_constraint intel_icl_event_constraints[] = {
FIXED_EVENT_CONSTRAINT(0x003c, 1), /* CPU_CLK_UNHALTED.CORE */
FIXED_EVENT_CONSTRAINT(0x0300, 2), /* CPU_CLK_UNHALTED.REF */
FIXED_EVENT_CONSTRAINT(0x0400, 3), /* SLOTS */
+ METRIC_EVENT_CONSTRAINT(INTEL_TD_METRIC_RETIRING, 0),
+ METRIC_EVENT_CONSTRAINT(INTEL_TD_METRIC_BAD_SPEC, 1),
+ METRIC_EVENT_CONSTRAINT(INTEL_TD_METRIC_FE_BOUND, 2),
+ METRIC_EVENT_CONSTRAINT(INTEL_TD_METRIC_BE_BOUND, 3),
INTEL_EVENT_CONSTRAINT_RANGE(0x03, 0x0a, 0xf),
INTEL_EVENT_CONSTRAINT_RANGE(0x1f, 0x28, 0xf),
INTEL_EVENT_CONSTRAINT(0x32, 0xf), /* SW_PREFETCH_ACCESS.* */
@@ -309,6 +313,12 @@ EVENT_ATTR_STR_HT(topdown-recovery-bubbles, td_recovery_bubbles,
EVENT_ATTR_STR_HT(topdown-recovery-bubbles.scale, td_recovery_bubbles_scale,
"4", "2");
+EVENT_ATTR_STR(slots, slots, "event=0x00,umask=0x4");
+EVENT_ATTR_STR(topdown-retiring, td_retiring, "event=0x00,umask=0x80");
+EVENT_ATTR_STR(topdown-bad-spec, td_bad_spec, "event=0x00,umask=0x81");
+EVENT_ATTR_STR(topdown-fe-bound, td_fe_bound, "event=0x00,umask=0x82");
+EVENT_ATTR_STR(topdown-be-bound, td_be_bound, "event=0x00,umask=0x83");
+
static struct attribute *snb_events_attrs[] = {
EVENT_PTR(td_slots_issued),
EVENT_PTR(td_slots_retired),
@@ -2232,6 +2242,99 @@ static void intel_pmu_del_event(struct perf_event *event)
intel_pmu_pebs_del(event);
}
+static int icl_set_topdown_event_period(struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ s64 left = local64_read(&hwc->period_left);
+
+ /*
+ * The values in PERF_METRICS MSR are derived from fixed counter 3.
+ * Software should start both registers, PERF_METRICS and fixed
+ * counter 3, from zero.
+ * Clear PERF_METRICS and Fixed counter 3 in initialization.
+ * After that, both MSRs will be cleared for each read.
+ * Don't need to clear them again.
+ */
+ if (left == x86_pmu.max_period) {
+ wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
+ wrmsrl(MSR_PERF_METRICS, 0);
+ local64_set(&hwc->period_left, 0);
+ }
+
+ perf_event_update_userpage(event);
+
+ return 0;
+}
+
+static inline u64 icl_get_metrics_event_value(u64 metric, u64 slots, int idx)
+{
+ u32 val;
+
+ /*
+ * The metric is reported as an 8bit integer fraction
+ * suming up to 0xff.
+ * slots-in-metric = (Metric / 0xff) * slots
+ */
+ val = (metric >> ((idx - INTEL_PMC_IDX_METRIC_BASE) * 8)) & 0xff;
+ return mul_u64_u32_div(slots, val, 0xff);
+}
+
+static void __icl_update_topdown_event(struct perf_event *event,
+ u64 slots, u64 metrics)
+{
+ int idx = event->hw.idx;
+ u64 delta;
+
+ if (is_metric_idx(idx))
+ delta = icl_get_metrics_event_value(metrics, slots, idx);
+ else
+ delta = slots;
+
+ local64_add(delta, &event->count);
+}
+
+/*
+ * Update all active Topdown events.
+ *
+ * The PERF_METRICS and Fixed counter 3 are read separately. The values may be
+ * modify by a NMI. PMU has to be disabled before calling this function.
+ */
+static u64 icl_update_topdown_event(struct perf_event *event)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct perf_event *other;
+ u64 slots, metrics;
+ int idx;
+
+ /* read Fixed counter 3 */
+ rdpmcl((3 | INTEL_PMC_FIXED_RDPMC_BASE), slots);
+ if (!slots)
+ return 0;
+
+ /* read PERF_METRICS */
+ rdpmcl(INTEL_PMC_FIXED_RDPMC_METRICS, metrics);
+
+ for_each_set_bit(idx, cpuc->active_mask, INTEL_PMC_IDX_TD_BE_BOUND + 1) {
+ if (!is_topdown_idx(idx))
+ continue;
+ other = cpuc->events[idx];
+ __icl_update_topdown_event(other, slots, metrics);
+ }
+
+ /*
+ * Check and update this event, which may have been cleared
+ * in active_mask e.g. x86_pmu_stop()
+ */
+ if (event && !test_bit(event->hw.idx, cpuc->active_mask))
+ __icl_update_topdown_event(event, slots, metrics);
+
+ /* The fixed counter 3 has to be written before the PERF_METRICS. */
+ wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
+ wrmsrl(MSR_PERF_METRICS, 0);
+
+ return slots;
+}
+
static void intel_pmu_read_topdown_event(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -4480,6 +4583,15 @@ static struct attribute *icl_events_attrs[] = {
NULL,
};
+static struct attribute *icl_td_events_attrs[] = {
+ EVENT_PTR(slots),
+ EVENT_PTR(td_retiring),
+ EVENT_PTR(td_bad_spec),
+ EVENT_PTR(td_fe_bound),
+ EVENT_PTR(td_be_bound),
+ NULL,
+};
+
static struct attribute *icl_tsx_events_attrs[] = {
EVENT_PTR(tx_start),
EVENT_PTR(tx_abort),
@@ -5264,10 +5376,13 @@ __init int intel_pmu_init(void)
hsw_format_attr : nhm_format_attr;
extra_skl_attr = skl_format_attr;
mem_attr = icl_events_attrs;
+ td_attr = icl_td_events_attrs;
tsx_attr = icl_tsx_events_attrs;
x86_pmu.rtm_abort_event = X86_CONFIG(.event=0xca, .umask=0x02);
x86_pmu.lbr_pt_coexist = true;
intel_pmu_pebs_data_source_skl(pmem);
+ x86_pmu.update_topdown_event = icl_update_topdown_event;
+ x86_pmu.set_topdown_event_period = icl_set_topdown_event_period;
pr_cont("Icelake events, ");
name = "icelake";
break;
@@ -5387,6 +5502,9 @@ __init int intel_pmu_init(void)
if (x86_pmu.counter_freezing)
x86_pmu.handle_irq = intel_pmu_handle_irq_v4;
+ if (x86_pmu.intel_cap.perf_metrics)
+ x86_pmu.intel_ctrl |= 1ULL << GLOBAL_CTRL_EN_PERF_METRICS;
+
return 0;
}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index dea009c..3454424 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -407,6 +407,19 @@ struct cpu_hw_events {
EVENT_CONSTRAINT(c, (1ULL << (32+n)), FIXED_EVENT_FLAGS)
/*
+ * The special metric counters do not actually exist. They are calculated from
+ * the combination of the FxCtr3 + MSR_PERF_METRICS.
+ *
+ * The special metric counters are mapped to a dummy offset for the scheduler.
+ * The sharing between multiple users of the same metric without multiplexing
+ * is not allowed, even though the hardware supports that in principle.
+ */
+
+#define METRIC_EVENT_CONSTRAINT(c, n) \
+ EVENT_CONSTRAINT(c, (1ULL << (INTEL_PMC_IDX_METRIC_BASE + n)), \
+ INTEL_ARCH_EVENT_MASK)
+
+/*
* Constraint on the Event code + UMask
*/
#define INTEL_UEVENT_CONSTRAINT(c, n) \
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index fabb61c..dc131b8 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -863,6 +863,8 @@
#define MSR_CORE_PERF_GLOBAL_CTRL 0x0000038f
#define MSR_CORE_PERF_GLOBAL_OVF_CTRL 0x00000390
+#define MSR_PERF_METRICS 0x00000329
+
/* PERF_GLOBAL_OVF_CTL bits */
#define MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI_BIT 55
#define MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI (1ULL << MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI_BIT)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 964ba31..e20aa58 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -198,6 +198,7 @@ struct x86_pmu_capability {
/* RDPMC offset for Fixed PMCs */
#define INTEL_PMC_FIXED_RDPMC_BASE (1 << 30)
+#define INTEL_PMC_FIXED_RDPMC_METRICS (1 << 29)
/*
* All the fixed-mode PMCs are configured via this single MSR:
@@ -305,6 +306,7 @@ static inline bool is_topdown_idx(int idx)
#define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(GLOBAL_STATUS_TRACE_TOPAPMI_BIT)
#define GLOBAL_STATUS_PERF_METRICS_OVF_BIT 48
+#define GLOBAL_CTRL_EN_PERF_METRICS 48
/*
* We model guest LBR event tracing as another fixed-mode PMC like BTS.
*
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 0e2e45e2ded4988f5641115fd996c75dc32e4be3
Gitweb: https://git.kernel.org/tip/0e2e45e2ded4988f5641115fd996c75dc32e4be3
Author: Kan Liang <[email protected]>
AuthorDate: Thu, 23 Jul 2020 10:11:12 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 18 Aug 2020 16:34:36 +02:00
perf/x86: Add a macro for RDPMC offset of fixed counters
The RDPMC base offset of fixed counters is hard-code. Use a meaningful
name to replace the magic number to improve the readability of the code.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/events/core.c | 3 ++-
arch/x86/include/asm/perf_event.h | 3 +++
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 53fcf0a..ebf723f 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1151,7 +1151,8 @@ static inline void x86_assign_hw_event(struct perf_event *event,
hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 +
(idx - INTEL_PMC_IDX_FIXED);
- hwc->event_base_rdpmc = (idx - INTEL_PMC_IDX_FIXED) | 1<<30;
+ hwc->event_base_rdpmc = (idx - INTEL_PMC_IDX_FIXED) |
+ INTEL_PMC_FIXED_RDPMC_BASE;
break;
default:
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 000cab7..964ba31 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -196,6 +196,9 @@ struct x86_pmu_capability {
* Fixed-purpose performance events:
*/
+/* RDPMC offset for Fixed PMCs */
+#define INTEL_PMC_FIXED_RDPMC_BASE (1 << 30)
+
/*
* All the fixed-mode PMCs are configured via this single MSR:
*/
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 75608cb02ea5dd997990e2998eca3670cb71a18c
Gitweb: https://git.kernel.org/tip/75608cb02ea5dd997990e2998eca3670cb71a18c
Author: Kan Liang <[email protected]>
AuthorDate: Thu, 23 Jul 2020 10:11:04 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 18 Aug 2020 16:34:34 +02:00
perf/x86: Use event_base_rdpmc for the RDPMC userspace support
The RDPMC index is always re-calculated for the RDPMC userspace support,
which is unnecessary.
The RDPMC index value is stored in the variable event_base_rdpmc for
the kernel usage, which can be used for RDPMC userspace support as well.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/events/core.c | 11 +++--------
1 file changed, 3 insertions(+), 8 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 1cbf57d..8e108ea 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2208,17 +2208,12 @@ static void x86_pmu_event_unmapped(struct perf_event *event, struct mm_struct *m
static int x86_pmu_event_idx(struct perf_event *event)
{
- int idx = event->hw.idx;
+ struct hw_perf_event *hwc = &event->hw;
- if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED))
+ if (!(hwc->flags & PERF_X86_EVENT_RDPMC_ALLOWED))
return 0;
- if (x86_pmu.num_counters_fixed && idx >= INTEL_PMC_IDX_FIXED) {
- idx -= INTEL_PMC_IDX_FIXED;
- idx |= 1 << 30;
- }
-
- return idx + 1;
+ return hwc->event_base_rdpmc + 1;
}
static ssize_t get_attr_rdpmc(struct device *cdev,
The following commit has been merged into the perf/core branch of tip:
Commit-ID: d39fcc32893dac2d02900d99c38276a00cc54d60
Gitweb: https://git.kernel.org/tip/d39fcc32893dac2d02900d99c38276a00cc54d60
Author: Kan Liang <[email protected]>
AuthorDate: Thu, 23 Jul 2020 10:11:07 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 18 Aug 2020 16:34:35 +02:00
perf/x86/intel: Move BTS index to 47
The bit 48 in the PERF_GLOBAL_STATUS is used to indicate the overflow
status of the PERF_METRICS counters.
Move the BTS index to the bit 47.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/include/asm/perf_event.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index fe8110a..58419e5 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -238,11 +238,11 @@ struct x86_pmu_capability {
/*
* We model BTS tracing as another fixed-mode PMC.
*
- * We choose a value in the middle of the fixed event range, since lower
+ * We choose the value 47 for the fixed index of BTS, since lower
* values are used by actual fixed events and higher values are used
* to indicate other overflow conditions in the PERF_GLOBAL_STATUS msr.
*/
-#define INTEL_PMC_IDX_FIXED_BTS (INTEL_PMC_IDX_FIXED + 16)
+#define INTEL_PMC_IDX_FIXED_BTS (INTEL_PMC_IDX_FIXED + 15)
#define GLOBAL_STATUS_COND_CHG BIT_ULL(63)
#define GLOBAL_STATUS_BUFFER_OVF_BIT 62
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 9f0c4fa111dc909ca545c45ea20ec84da555ce16
Gitweb: https://git.kernel.org/tip/9f0c4fa111dc909ca545c45ea20ec84da555ce16
Author: Kan Liang <[email protected]>
AuthorDate: Thu, 23 Jul 2020 10:11:10 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 18 Aug 2020 16:34:36 +02:00
perf/core: Add a new PERF_EV_CAP_SIBLING event capability
Current perf assumes that events in a group are independent. Close an
event doesn't impact the value of the other events in the same group.
If the closed event is a member, after the event closure, other events
are still running like a group. If the closed event is a leader, other
events are running as singleton events.
Add PERF_EV_CAP_SIBLING to allow events to indicate they require being
part of a group, and when the leader dies they cannot exist
independently.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
include/linux/perf_event.h | 4 ++++-
kernel/events/core.c | 38 ++++++++++++++++++++++++++++++++-----
2 files changed, 37 insertions(+), 5 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 04a49cc..6048650 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -576,9 +576,13 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *,
* PERF_EV_CAP_SOFTWARE: Is a software event.
* PERF_EV_CAP_READ_ACTIVE_PKG: A CPU event (or cgroup event) that can be read
* from any CPU in the package where it is active.
+ * PERF_EV_CAP_SIBLING: An event with this flag must be a group sibling and
+ * cannot be a group leader. If an event with this flag is detached from the
+ * group it is scheduled out and moved into an unrecoverable ERROR state.
*/
#define PERF_EV_CAP_SOFTWARE BIT(0)
#define PERF_EV_CAP_READ_ACTIVE_PKG BIT(1)
+#define PERF_EV_CAP_SIBLING BIT(2)
#define SWEVENT_HLIST_BITS 8
#define SWEVENT_HLIST_SIZE (1 << SWEVENT_HLIST_BITS)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5bfe8e3..57efe3b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2133,8 +2133,24 @@ static inline struct list_head *get_event_list(struct perf_event *event)
return event->attr.pinned ? &ctx->pinned_active : &ctx->flexible_active;
}
+/*
+ * Events that have PERF_EV_CAP_SIBLING require being part of a group and
+ * cannot exist on their own, schedule them out and move them into the ERROR
+ * state. Also see _perf_event_enable(), it will not be able to recover
+ * this ERROR state.
+ */
+static inline void perf_remove_sibling_event(struct perf_event *event)
+{
+ struct perf_event_context *ctx = event->ctx;
+ struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+
+ event_sched_out(event, cpuctx, ctx);
+ perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
+}
+
static void perf_group_detach(struct perf_event *event)
{
+ struct perf_event *leader = event->group_leader;
struct perf_event *sibling, *tmp;
struct perf_event_context *ctx = event->ctx;
@@ -2153,7 +2169,7 @@ static void perf_group_detach(struct perf_event *event)
/*
* If this is a sibling, remove it from its group.
*/
- if (event->group_leader != event) {
+ if (leader != event) {
list_del_init(&event->sibling_list);
event->group_leader->nr_siblings--;
goto out;
@@ -2166,6 +2182,9 @@ static void perf_group_detach(struct perf_event *event)
*/
list_for_each_entry_safe(sibling, tmp, &event->sibling_list, sibling_list) {
+ if (sibling->event_caps & PERF_EV_CAP_SIBLING)
+ perf_remove_sibling_event(sibling);
+
sibling->group_leader = sibling;
list_del_init(&sibling->sibling_list);
@@ -2183,10 +2202,10 @@ static void perf_group_detach(struct perf_event *event)
}
out:
- perf_event__header_size(event->group_leader);
-
- for_each_sibling_event(tmp, event->group_leader)
+ for_each_sibling_event(tmp, leader)
perf_event__header_size(tmp);
+
+ perf_event__header_size(leader);
}
static bool is_orphaned_event(struct perf_event *event)
@@ -2979,6 +2998,7 @@ static void _perf_event_enable(struct perf_event *event)
raw_spin_lock_irq(&ctx->lock);
if (event->state >= PERF_EVENT_STATE_INACTIVE ||
event->state < PERF_EVENT_STATE_ERROR) {
+out:
raw_spin_unlock_irq(&ctx->lock);
return;
}
@@ -2990,8 +3010,16 @@ static void _perf_event_enable(struct perf_event *event)
* has gone back into error state, as distinct from the task having
* been scheduled away before the cross-call arrived.
*/
- if (event->state == PERF_EVENT_STATE_ERROR)
+ if (event->state == PERF_EVENT_STATE_ERROR) {
+ /*
+ * Detached SIBLING events cannot leave ERROR state.
+ */
+ if (event->event_caps & PERF_EV_CAP_SIBLING &&
+ event->group_leader == event)
+ goto out;
+
event->state = PERF_EVENT_STATE_OFF;
+ }
raw_spin_unlock_irq(&ctx->lock);
event_function_call(event, __perf_event_enable, NULL);
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 60a2a271cf05cf046c522e1d7f62116b4bcb32a2
Gitweb: https://git.kernel.org/tip/60a2a271cf05cf046c522e1d7f62116b4bcb32a2
Author: Kan Liang <[email protected]>
AuthorDate: Thu, 23 Jul 2020 10:11:05 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 18 Aug 2020 16:34:34 +02:00
perf/x86/intel: Name the global status bit in NMI handler
Magic numbers are used in the current NMI handler for the global status
bit. Use a meaningful name to replace the magic numbers to improve the
readability of the code.
Remove a Tab for all GLOBAL_STATUS_* and INTEL_PMC_IDX_FIXED_BTS macros
to reduce the length of the line.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/events/intel/core.c | 4 ++--
arch/x86/include/asm/perf_event.h | 22 ++++++++++++----------
2 files changed, 14 insertions(+), 12 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 5096347..ac1408f 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2389,7 +2389,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
/*
* PEBS overflow sets bit 62 in the global status register
*/
- if (__test_and_clear_bit(62, (unsigned long *)&status)) {
+ if (__test_and_clear_bit(GLOBAL_STATUS_BUFFER_OVF_BIT, (unsigned long *)&status)) {
u64 pebs_enabled = cpuc->pebs_enabled;
handled++;
@@ -2410,7 +2410,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
/*
* Intel PT
*/
- if (__test_and_clear_bit(55, (unsigned long *)&status)) {
+ if (__test_and_clear_bit(GLOBAL_STATUS_TRACE_TOPAPMI_BIT, (unsigned long *)&status)) {
handled++;
if (unlikely(perf_guest_cbs && perf_guest_cbs->is_in_guest() &&
perf_guest_cbs->handle_intel_pt_intr))
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 0c1b137..fd3eba6 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -225,16 +225,18 @@ struct x86_pmu_capability {
* values are used by actual fixed events and higher values are used
* to indicate other overflow conditions in the PERF_GLOBAL_STATUS msr.
*/
-#define INTEL_PMC_IDX_FIXED_BTS (INTEL_PMC_IDX_FIXED + 16)
-
-#define GLOBAL_STATUS_COND_CHG BIT_ULL(63)
-#define GLOBAL_STATUS_BUFFER_OVF BIT_ULL(62)
-#define GLOBAL_STATUS_UNC_OVF BIT_ULL(61)
-#define GLOBAL_STATUS_ASIF BIT_ULL(60)
-#define GLOBAL_STATUS_COUNTERS_FROZEN BIT_ULL(59)
-#define GLOBAL_STATUS_LBRS_FROZEN_BIT 58
-#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(GLOBAL_STATUS_LBRS_FROZEN_BIT)
-#define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(55)
+#define INTEL_PMC_IDX_FIXED_BTS (INTEL_PMC_IDX_FIXED + 16)
+
+#define GLOBAL_STATUS_COND_CHG BIT_ULL(63)
+#define GLOBAL_STATUS_BUFFER_OVF_BIT 62
+#define GLOBAL_STATUS_BUFFER_OVF BIT_ULL(GLOBAL_STATUS_BUFFER_OVF_BIT)
+#define GLOBAL_STATUS_UNC_OVF BIT_ULL(61)
+#define GLOBAL_STATUS_ASIF BIT_ULL(60)
+#define GLOBAL_STATUS_COUNTERS_FROZEN BIT_ULL(59)
+#define GLOBAL_STATUS_LBRS_FROZEN_BIT 58
+#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(GLOBAL_STATUS_LBRS_FROZEN_BIT)
+#define GLOBAL_STATUS_TRACE_TOPAPMI_BIT 55
+#define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(GLOBAL_STATUS_TRACE_TOPAPMI_BIT)
/*
* We model guest LBR event tracing as another fixed-mode PMC like BTS.