This patchseries adds support for Large Increment per Cycle Events,
which is needed to count events like Retired SSE/AVX FLOPs.
The first patch constrains Large Increment events to the even PMCs,
and the second patch changes the scheduler to accommodate and
program the new Merge event needed on the odd counters.
The RFC was posted here:
https://lkml.org/lkml/2019/8/26/828
Changes since then include mostly fixing interoperation with the
watchdog, splitting, rewording, and addressing Peter Zijlstra's
comments:
- Mentioned programming the odd counter before the even counter
in the commit text, as is now also done in the code.
- Do the programming of the counters in the enable/disable paths
instead of the commit_scheduler hook.
- Instead of the loop re-counting all large increment events,
have collect_events() and a new amd_put_event_constraints_f17h
update a new cpuc variable 'n_lg_inc'. Now the scheduler
does a simple subtraction to get the target gpmax value.
- Amend the fastpath's used_mask code to fix a problem where
counter programming was being overwritten when running with
the watchdog enabled.
- Omit the superfluous __set_bit(idx + 1) in __perf_sched_find_counter
and clear the large increment's sched->state.used bit in the
path where a failure to schedule is determined due to the
next counter already being used (thanks Nathan Fontenot).
- Broaden new PMU initialization code to run on families 17h and
above.
- Have new is_large_inc(strcut perf_event) common to all x86 paths
as is is_pebs_pt(). That way, the raw event code checker
amd_is_lg_inc_event_code() can stay in its vendor-specific area
events/amd/core.c.
- __set_bit, WARN_ON(!gpmax), all addressed.
- WRT changing the naming to PAIR, etc. I dislike the idea because
h/w documentation consistently calls this now relatively old
feature for "Large Increment per Cycle" events, and the secondary
event needed, specifically the "Merge event (0xFFF)". When I
started this project the biggest problem was disambiguating
between the Large Increment event (FLOPs, or others), and the
Merge event (0xFFF) itself. Different phases had "Merge" for
the Merge event vs. "Merged" for the Large Increment event(s),
or "Mergee", which made reading the source code too easy to
mistake one for the other. So I opted for two distinctly
different base terms/stem-words: Large increment (lg_inc) and
Merge, to match the documentation, which basically has it right.
Changing the term to "pair" would have created the same "pair" vs.
"paired" vs. "pairer" etc. confusion, so I dropped it.
- WRT the comment "How about you make __perf_sched_find_count() set
the right value? That already knows it did this.", I didn't see
how I'd get away from still having to do the constraints flag &
LARGE_INC check in perf_assign_events(), to re-adjust the assignment
in the assign array, or sched.state.counter. This code really
is only needed after the counter assignment is made, in order to
program the h/w correctly.
Kim Phillips (2):
perf/x86/amd: Constrain Large Increment per Cycle events
perf/x86/amd: Add support for Large Increment per Cycle Events
arch/x86/events/amd/core.c | 110 +++++++++++++++++++++++++----------
arch/x86/events/core.c | 46 ++++++++++++++-
arch/x86/events/perf_event.h | 21 +++++++
3 files changed, 145 insertions(+), 32 deletions(-)
--
2.24.0
Description of hardware operation
---------------------------------
The core AMD PMU has a 4-bit wide per-cycle increment for each
performance monitor counter. That works for most events, but
now with AMD Family 17h and above processors, some events can
occur more than 15 times in a cycle. Those events are called
"Large Increment per Cycle" events. In order to count these
events, two adjacent h/w PMCs get their count signals merged
to form 8 bits per cycle total. In addition, the PERF_CTR count
registers are merged to be able to count up to 64 bits.
Normally, events like instructions retired, get programmed on a single
counter like so:
PERF_CTL0 (MSR 0xc0010200) 0x000000000053ff0c # event 0x0c, umask 0xff
PERF_CTR0 (MSR 0xc0010201) 0x0000800000000001 # r/w 48-bit count
The next counter at MSRs 0xc0010202-3 remains unused, or can be used
independently to count something else.
When counting Large Increment per Cycle events, such as FLOPs,
however, we now have to reserve the next counter and program the
PERF_CTL (config) register with the Merge event (0xFFF), like so:
PERF_CTL0 (msr 0xc0010200) 0x000000000053ff03 # FLOPs event, umask 0xff
PERF_CTR0 (msr 0xc0010201) 0x0000800000000001 # rd 64-bit cnt, wr lo 48b
PERF_CTL1 (msr 0xc0010202) 0x0000000f004000ff # Merge event, enable bit
PERF_CTR1 (msr 0xc0010203) 0x0000000000000000 # wr hi 16-bits count
The count is widened from the normal 48-bits to 64 bits by having the
second counter carry the higher 16 bits of the count in its lower 16
bits of its counter register.
The odd counter, e.g., PERF_CTL1, is programmed with the enabled Merge
event before the even counter, PERF_CTL0.
The Large Increment feature is available starting with Family 17h.
For more details, search any Family 17h PPR for the "Large Increment
per Cycle Events" section, e.g., section 2.1.15.3 on p. 173 in this
version:
https://www.amd.com/system/files/TechDocs/56176_ppr_Family_17h_Model_71h_B0_pub_Rev_3.06.zip
Description of software operation
---------------------------------
The following steps are taken in order to support reserving and
enabling the extra counter for Large Increment per Cycle events:
1. In the main x86 scheduler, we reduce the number of available
counters by the number of Large Increment per Cycle events being
scheduled, tracked by a new cpuc variable 'n_lg_inc' and a new
amd_put_event_constraints_f17h(). This improves the counter
scheduler success rate.
2. In perf_assign_events(), if a counter is assigned to a Large
Increment event, we increment the current counter variable, so the
counter used for the Merge event is removed from assignment
consideration by upcoming event assignments.
3. In find_counter(), if a counter has been found for the Large
Increment event, we set the next counter as used, to prevent other
events from using it.
4. We perform steps 2 & 3 also in the x86 scheduler fastpath, i.e.,
we add Merge event accounting to the existing used_mask logic.
5. Finally, we add on the programming of Merge event to the
neighbouring PMC counters in the counter enable/disable{_all}
code paths.
Currently, software does not support a single PMU with mixed 48- and
64-bit counting, so Large increment event counts are limited to 48
bits. In set_period, we zero-out the upper 16 bits of the count, so
the hardware doesn't copy them to the even counter's higher bits.
Simple invocation example showing counting 8 FLOPs per 256-bit/%ymm
vaddps instruction executed in a loop 100 million times:
perf stat -e cpu/fp_ret_sse_avx_ops.all/,cpu/instructions/ <workload>
Performance counter stats for '<workload>':
800,000,000 cpu/fp_ret_sse_avx_ops.all/u
300,042,101 cpu/instructions/u
Prior to this patch, the reported SSE/AVX FLOPs retired count would
be wrong.
Signed-off-by: Kim Phillips <[email protected]>
Cc: Janakarajan Natarajan <[email protected]>
Cc: Suravee Suthikulpanit <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Martin Liška <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
arch/x86/events/amd/core.c | 19 +++++++++++++++
arch/x86/events/core.c | 46 ++++++++++++++++++++++++++++++++++--
arch/x86/events/perf_event.h | 19 +++++++++++++++
3 files changed, 82 insertions(+), 2 deletions(-)
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index 78f771cadb8e..e69723cb5b51 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -14,6 +14,10 @@
static DEFINE_PER_CPU(unsigned long, perf_nmi_tstamp);
static unsigned long perf_nmi_window;
+/* AMD Event 0xFFF: Merge. Used with Large Increment per Cycle events */
+#define AMD_MERGE_EVENT ((0xFULL << 32) | 0xFFULL)
+#define AMD_MERGE_EVENT_ENABLE (AMD_MERGE_EVENT | ARCH_PERFMON_EVENTSEL_ENABLE)
+
static __initconst const u64 amd_hw_cache_event_ids
[PERF_COUNT_HW_CACHE_MAX]
[PERF_COUNT_HW_CACHE_OP_MAX]
@@ -335,6 +339,10 @@ static int amd_core_hw_config(struct perf_event *event)
else if (event->attr.exclude_guest)
event->hw.config |= AMD64_EVENTSEL_HOSTONLY;
+ if ((x86_pmu.flags & PMU_FL_MERGE) &&
+ amd_is_lg_inc_event_code(&event->hw))
+ event->hw.flags |= PERF_X86_EVENT_LARGE_INC;
+
return 0;
}
@@ -889,6 +897,15 @@ amd_get_event_constraints_f17h(struct cpu_hw_events *cpuc, int idx,
return &unconstrained;
}
+static void amd_put_event_constraints_f17h(struct cpu_hw_events *cpuc,
+ struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+
+ if (is_large_inc(hwc))
+ --cpuc->n_lg_inc;
+}
+
static ssize_t amd_event_sysfs_show(char *page, u64 config)
{
u64 event = (config & ARCH_PERFMON_EVENTSEL_EVENT) |
@@ -919,6 +936,7 @@ static __initconst const struct x86_pmu amd_pmu = {
.max_period = (1ULL << 47) - 1,
.get_event_constraints = amd_get_event_constraints,
.put_event_constraints = amd_put_event_constraints,
+ .perf_ctr_merge_en = AMD_MERGE_EVENT_ENABLE,
.format_attrs = amd_format_attr,
.events_sysfs_show = amd_event_sysfs_show,
@@ -976,6 +994,7 @@ static int __init amd_core_pmu_init(void)
PERF_X86_EVENT_LARGE_INC);
x86_pmu.get_event_constraints = amd_get_event_constraints_f17h;
+ x86_pmu.put_event_constraints = amd_put_event_constraints_f17h;
x86_pmu.flags |= PMU_FL_MERGE;
}
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 7b21455d7504..51c4c69a9d76 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -612,6 +612,7 @@ void x86_pmu_disable_all(void)
int idx;
for (idx = 0; idx < x86_pmu.num_counters; idx++) {
+ struct hw_perf_event *hwc = &cpuc->events[idx]->hw;
u64 val;
if (!test_bit(idx, cpuc->active_mask))
@@ -621,6 +622,8 @@ void x86_pmu_disable_all(void)
continue;
val &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
wrmsrl(x86_pmu_config_addr(idx), val);
+ if (is_large_inc(hwc))
+ wrmsrl(x86_pmu_config_addr(idx + 1), 0);
}
}
@@ -787,6 +790,16 @@ static bool __perf_sched_find_counter(struct perf_sched *sched)
if (!__test_and_set_bit(idx, sched->state.used)) {
if (sched->state.nr_gp++ >= sched->max_gp)
return false;
+ /*
+ * Large inc. events need the Merge event
+ * on the next counter
+ */
+ if ((c->flags & PERF_X86_EVENT_LARGE_INC) &&
+ __test_and_set_bit(idx + 1, sched->state.used)) {
+ /* next counter already used */
+ __clear_bit(idx, sched->state.used);
+ return false;
+ }
goto done;
}
@@ -849,14 +862,19 @@ int perf_assign_events(struct event_constraint **constraints, int n,
int wmin, int wmax, int gpmax, int *assign)
{
struct perf_sched sched;
+ struct event_constraint *c;
perf_sched_init(&sched, constraints, n, wmin, wmax, gpmax);
do {
if (!perf_sched_find_counter(&sched))
break; /* failed */
- if (assign)
+ if (assign) {
assign[sched.state.event] = sched.state.counter;
+ c = constraints[sched.state.event];
+ if (c->flags & PERF_X86_EVENT_LARGE_INC)
+ sched.state.counter++;
+ }
} while (perf_sched_next_event(&sched));
return sched.state.unassigned;
@@ -926,10 +944,14 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
break;
/* not already used */
- if (test_bit(hwc->idx, used_mask))
+ if (test_bit(hwc->idx, used_mask) || (is_large_inc(hwc) &&
+ test_bit(hwc->idx + 1, used_mask)))
break;
__set_bit(hwc->idx, used_mask);
+ if (is_large_inc(hwc))
+ __set_bit(hwc->idx + 1, used_mask);
+
if (assign)
assign[i] = hwc->idx;
}
@@ -952,6 +974,15 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
READ_ONCE(cpuc->excl_cntrs->exclusive_present))
gpmax /= 2;
+ /*
+ * Reduce the amount of available counters to allow fitting
+ * the extra Merge events needed by large increment events.
+ */
+ if (x86_pmu.flags & PMU_FL_MERGE) {
+ gpmax = x86_pmu.num_counters - cpuc->n_lg_inc;
+ WARN_ON(gpmax <= 0);
+ }
+
unsched = perf_assign_events(cpuc->event_constraint, n, wmin,
wmax, gpmax, assign);
}
@@ -1032,6 +1063,8 @@ static int collect_events(struct cpu_hw_events *cpuc, struct perf_event *leader,
return -EINVAL;
cpuc->event_list[n] = leader;
n++;
+ if (is_large_inc(&leader->hw))
+ cpuc->n_lg_inc++;
}
if (!dogrp)
return n;
@@ -1046,6 +1079,8 @@ static int collect_events(struct cpu_hw_events *cpuc, struct perf_event *leader,
cpuc->event_list[n] = event;
n++;
+ if (is_large_inc(&event->hw))
+ cpuc->n_lg_inc++;
}
return n;
}
@@ -1231,6 +1266,13 @@ int x86_perf_event_set_period(struct perf_event *event)
wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
+ /*
+ * Clear the Merge event counter's upper 16 bits since
+ * we currently declare a 48-bit counter width
+ */
+ if (is_large_inc(hwc))
+ wrmsrl(x86_pmu_event_addr(idx + 1), 0);
+
/*
* Due to erratum on certan cpu we need
* a second write to be sure the register
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 6154f06eb4ba..6db6e4c69e3d 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -273,6 +273,7 @@ struct cpu_hw_events {
struct amd_nb *amd_nb;
/* Inverted mask of bits to clear in the perf_ctr ctrl registers */
u64 perf_ctr_virt_mask;
+ int n_lg_inc; /* Large increment events */
void *kfree_on_online[X86_PERF_KFREE_MAX];
};
@@ -687,6 +688,8 @@ struct x86_pmu {
* AMD bits
*/
unsigned int amd_nb_constraints : 1;
+ /* Merge event enable: value to set in perf_ctl next to large inc. */
+ u64 perf_ctr_merge_en;
/*
* Extra registers for events
@@ -832,6 +835,11 @@ int x86_pmu_hw_config(struct perf_event *event);
void x86_pmu_disable_all(void);
+static inline int is_large_inc(struct hw_perf_event *hwc)
+{
+ return !!(hwc->flags & PERF_X86_EVENT_LARGE_INC);
+}
+
static inline void __x86_pmu_enable_event(struct hw_perf_event *hwc,
u64 enable_mask)
{
@@ -839,6 +847,14 @@ static inline void __x86_pmu_enable_event(struct hw_perf_event *hwc,
if (hwc->extra_reg.reg)
wrmsrl(hwc->extra_reg.reg, hwc->extra_reg.config);
+
+ /*
+ * Add enabled Merge event on next counter
+ * if large increment event being enabled on this counter
+ */
+ if (is_large_inc(hwc))
+ wrmsrl(hwc->config_base + 2, x86_pmu.perf_ctr_merge_en);
+
wrmsrl(hwc->config_base, (hwc->config | enable_mask) & ~disable_mask);
}
@@ -855,6 +871,9 @@ static inline void x86_pmu_disable_event(struct perf_event *event)
struct hw_perf_event *hwc = &event->hw;
wrmsrl(hwc->config_base, hwc->config);
+
+ if (is_large_inc(hwc))
+ wrmsrl(hwc->config_base + 2, 0);
}
void x86_pmu_enable_event(struct perf_event *event);
--
2.24.0
AMD Family 17h processors and above gain support for Large Increment
per Cycle events. Unfortunately there is no CPUID or equivalent bit
that indicates whether the feature exists or not, so we continue to
determine eligibility based on a CPU family number comparison.
For Large Increment per Cycle events, we add a f17h-and-compatibles
get_event_constraints_f17h() that returns an even counter bitmask:
Large Increment per Cycle events can only be placed on PMCs 0, 2,
and 4 out of the currently available 0-5. The only currently
public event that requires this feature to report valid counts
is PMCx003 "Retired SSE/AVX Operations".
Note that the CPU family logic in amd_core_pmu_init() is changed
so as to be able to selectively add initialization for features
available in ranges of backward-compatible CPU families. This
Large Increment per Cycle feature is expected to be retained
in future families.
A side-effect of assigning a new get_constraints function for f17h
disables calling the old (prior to f15h) amd_get_event_constraints
implementation left enabled by commit e40ed1542dd7 ("perf/x86: Add perf
support for AMD family-17h processors"), which is no longer
necessary since those North Bridge event codes are obsoleted.
Also fix a spelling mistake whilst in the area (calulating ->
calculating).
Fixes: e40ed1542dd7 ("perf/x86: Add perf support for AMD family-17h processors")
Signed-off-by: Kim Phillips <[email protected]>
Cc: Janakarajan Natarajan <[email protected]>
Cc: Suravee Suthikulpanit <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Martin Liška <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
arch/x86/events/amd/core.c | 91 ++++++++++++++++++++++++------------
arch/x86/events/perf_event.h | 2 +
2 files changed, 63 insertions(+), 30 deletions(-)
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index 64c3e70b0556..78f771cadb8e 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -301,6 +301,25 @@ static inline int amd_pmu_addr_offset(int index, bool eventsel)
return offset;
}
+/*
+ * AMD64 events are detected based on their event codes.
+ */
+static inline unsigned int amd_get_event_code(struct hw_perf_event *hwc)
+{
+ return ((hwc->config >> 24) & 0x0f00) | (hwc->config & 0x00ff);
+}
+
+static inline bool amd_is_lg_inc_event_code(struct hw_perf_event *hwc)
+{
+ if (!(x86_pmu.flags & PMU_FL_MERGE))
+ return false;
+
+ switch (amd_get_event_code(hwc)) {
+ case 0x003: return true; /* Retired SSE/AVX FLOPs */
+ default: return false;
+ }
+}
+
static int amd_core_hw_config(struct perf_event *event)
{
if (event->attr.exclude_host && event->attr.exclude_guest)
@@ -319,14 +338,6 @@ static int amd_core_hw_config(struct perf_event *event)
return 0;
}
-/*
- * AMD64 events are detected based on their event codes.
- */
-static inline unsigned int amd_get_event_code(struct hw_perf_event *hwc)
-{
- return ((hwc->config >> 24) & 0x0f00) | (hwc->config & 0x00ff);
-}
-
static inline int amd_is_nb_event(struct hw_perf_event *hwc)
{
return (hwc->config & 0xe0) == 0xe0;
@@ -864,6 +875,20 @@ amd_get_event_constraints_f15h(struct cpu_hw_events *cpuc, int idx,
}
}
+static struct event_constraint lg_inc_constraint;
+
+static struct event_constraint *
+amd_get_event_constraints_f17h(struct cpu_hw_events *cpuc, int idx,
+ struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+
+ if (amd_is_lg_inc_event_code(hwc))
+ return &lg_inc_constraint;
+
+ return &unconstrained;
+}
+
static ssize_t amd_event_sysfs_show(char *page, u64 config)
{
u64 event = (config & ARCH_PERFMON_EVENTSEL_EVENT) |
@@ -907,33 +932,15 @@ static __initconst const struct x86_pmu amd_pmu = {
static int __init amd_core_pmu_init(void)
{
+ u64 even_ctr_mask = 0ULL;
+ int i;
+
if (!boot_cpu_has(X86_FEATURE_PERFCTR_CORE))
return 0;
- /* Avoid calulating the value each time in the NMI handler */
+ /* Avoid calculating the value each time in the NMI handler */
perf_nmi_window = msecs_to_jiffies(100);
- switch (boot_cpu_data.x86) {
- case 0x15:
- pr_cont("Fam15h ");
- x86_pmu.get_event_constraints = amd_get_event_constraints_f15h;
- break;
- case 0x17:
- pr_cont("Fam17h ");
- /*
- * In family 17h, there are no event constraints in the PMC hardware.
- * We fallback to using default amd_get_event_constraints.
- */
- break;
- case 0x18:
- pr_cont("Fam18h ");
- /* Using default amd_get_event_constraints. */
- break;
- default:
- pr_err("core perfctr but no constraints; unknown hardware!\n");
- return -ENODEV;
- }
-
/*
* If core performance counter extensions exists, we must use
* MSR_F15H_PERF_CTL/MSR_F15H_PERF_CTR msrs. See also
@@ -948,6 +955,30 @@ static int __init amd_core_pmu_init(void)
*/
x86_pmu.amd_nb_constraints = 0;
+ if (boot_cpu_data.x86 == 0x15) {
+ pr_cont("Fam15h ");
+ x86_pmu.get_event_constraints = amd_get_event_constraints_f15h;
+ }
+ if (boot_cpu_data.x86 >= 0x17) {
+ pr_cont("Fam17h+ ");
+ /*
+ * Family 17h and compatibles have constraints for Large
+ * Increment per Cycle events: they may only be assigned an
+ * even numbered counter that has a consecutive adjacent odd
+ * numbered counter following it.
+ */
+ for (i = 0; i < x86_pmu.num_counters - 1; i += 2)
+ even_ctr_mask |= 1 << i;
+
+ lg_inc_constraint = (struct event_constraint)
+ __EVENT_CONSTRAINT(0, even_ctr_mask, 0,
+ x86_pmu.num_counters / 2, 0,
+ PERF_X86_EVENT_LARGE_INC);
+
+ x86_pmu.get_event_constraints = amd_get_event_constraints_f17h;
+ x86_pmu.flags |= PMU_FL_MERGE;
+ }
+
pr_cont("core perfctr, ");
return 0;
}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index ecacfbf4ebc1..6154f06eb4ba 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -77,6 +77,7 @@ static inline bool constraint_match(struct event_constraint *c, u64 ecode)
#define PERF_X86_EVENT_AUTO_RELOAD 0x0200 /* use PEBS auto-reload */
#define PERF_X86_EVENT_LARGE_PEBS 0x0400 /* use large PEBS */
#define PERF_X86_EVENT_PEBS_VIA_PT 0x0800 /* use PT buffer for PEBS */
+#define PERF_X86_EVENT_LARGE_INC 0x1000 /* Large Increment per Cycle */
struct amd_nb {
int nb_id; /* NorthBridge id */
@@ -735,6 +736,7 @@ do { \
#define PMU_FL_EXCL_ENABLED 0x8 /* exclusive counter active */
#define PMU_FL_PEBS_ALL 0x10 /* all events are valid PEBS events */
#define PMU_FL_TFA 0x20 /* deal with TSX force abort */
+#define PMU_FL_MERGE 0x40 /* merge counters for large incr. events */
#define EVENT_VAR(_id) event_attr_##_id
#define EVENT_PTR(_id) &event_attr_##_id.attr.attr
--
2.24.0
On Thu, Nov 14, 2019 at 12:37:20PM -0600, Kim Phillips wrote:
I still hate the naming on this, "large increment per cycle" is just a
random bunch of words collected by AMD marketing or somesuch. It doesn't
convey the fundamental point that counters get paired. So I've done a
giant bunch of search and replace on it for you.
> @@ -621,6 +622,8 @@ void x86_pmu_disable_all(void)
> continue;
> val &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
> wrmsrl(x86_pmu_config_addr(idx), val);
> + if (is_large_inc(hwc))
> + wrmsrl(x86_pmu_config_addr(idx + 1), 0);
> }
> }
>
See, the above makes sense, it doesn't assume anything about where
config for idx and config for idx+1 are, and then here:
> @@ -855,6 +871,9 @@ static inline void x86_pmu_disable_event(struct perf_event *event)
> struct hw_perf_event *hwc = &event->hw;
>
> wrmsrl(hwc->config_base, hwc->config);
> +
> + if (is_large_inc(hwc))
> + wrmsrl(hwc->config_base + 2, 0);
> }
You hard-code the offset as being 2. I fixed that for you.
> @@ -849,14 +862,19 @@ int perf_assign_events(struct event_constraint **constraints, int n,
> int wmin, int wmax, int gpmax, int *assign)
> {
> struct perf_sched sched;
> + struct event_constraint *c;
>
> perf_sched_init(&sched, constraints, n, wmin, wmax, gpmax);
>
> do {
> if (!perf_sched_find_counter(&sched))
> break; /* failed */
> - if (assign)
> + if (assign) {
> assign[sched.state.event] = sched.state.counter;
> + c = constraints[sched.state.event];
> + if (c->flags & PERF_X86_EVENT_LARGE_INC)
> + sched.state.counter++;
> + }
> } while (perf_sched_next_event(&sched));
>
> return sched.state.unassigned;
I'm still confused by this bit. AFAICT it serves no purpose.
perf_sched_next_event() will reset sched->state.counter to 0 on every
pass anyway.
I've deleted it for you.
> @@ -926,10 +944,14 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
> break;
>
> /* not already used */
> - if (test_bit(hwc->idx, used_mask))
> + if (test_bit(hwc->idx, used_mask) || (is_large_inc(hwc) &&
> + test_bit(hwc->idx + 1, used_mask)))
> break;
>
> __set_bit(hwc->idx, used_mask);
> + if (is_large_inc(hwc))
> + __set_bit(hwc->idx + 1, used_mask);
> +
> if (assign)
> assign[i] = hwc->idx;
> }
This is just really sad.. fixed that too.
Can you verify the patches in:
git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/amd
work?
Hi Peter,
Happy New Year, and thanks for your review...
On 12/20/19 6:09 AM, Peter Zijlstra wrote:
> On Thu, Nov 14, 2019 at 12:37:20PM -0600, Kim Phillips wrote:
>> @@ -621,6 +622,8 @@ void x86_pmu_disable_all(void)
>> continue;
>> val &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
>> wrmsrl(x86_pmu_config_addr(idx), val);
>> + if (is_large_inc(hwc))
>> + wrmsrl(x86_pmu_config_addr(idx + 1), 0);
>> }
>> }
>>
>
> See, the above makes sense, it doesn't assume anything about where
> config for idx and config for idx+1 are, and then here:
>
>> @@ -855,6 +871,9 @@ static inline void x86_pmu_disable_event(struct perf_event *event)
>> struct hw_perf_event *hwc = &event->hw;
>>
>> wrmsrl(hwc->config_base, hwc->config);
>> +
>> + if (is_large_inc(hwc))
>> + wrmsrl(hwc->config_base + 2, 0);
>> }
>
> You hard-code the offset as being 2. I fixed that for you.
Sorry I missed that! Thanks for catching and fixing it.
>> @@ -849,14 +862,19 @@ int perf_assign_events(struct event_constraint **constraints, int n,
>> int wmin, int wmax, int gpmax, int *assign)
>> {
>> struct perf_sched sched;
>> + struct event_constraint *c;
>>
>> perf_sched_init(&sched, constraints, n, wmin, wmax, gpmax);
>>
>> do {
>> if (!perf_sched_find_counter(&sched))
>> break; /* failed */
>> - if (assign)
>> + if (assign) {
>> assign[sched.state.event] = sched.state.counter;
>> + c = constraints[sched.state.event];
>> + if (c->flags & PERF_X86_EVENT_LARGE_INC)
>> + sched.state.counter++;
>> + }
>> } while (perf_sched_next_event(&sched));
>>
>> return sched.state.unassigned;
>
> I'm still confused by this bit. AFAICT it serves no purpose.
> perf_sched_next_event() will reset sched->state.counter to 0 on every
> pass anyway.
>
> I've deleted it for you.
OK, I had my doubts because perf_sched_next_event doesn't have a counter reset in the "next weight" exit case, which indeed might be a moot case, but then perf_sched_init, called in the beginning of perf_assign_events, doesn't clear the counter variable either..
>> @@ -926,10 +944,14 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
>> break;
>>
>> /* not already used */
>> - if (test_bit(hwc->idx, used_mask))
>> + if (test_bit(hwc->idx, used_mask) || (is_large_inc(hwc) &&
>> + test_bit(hwc->idx + 1, used_mask)))
>> break;
>>
>> __set_bit(hwc->idx, used_mask);
>> + if (is_large_inc(hwc))
>> + __set_bit(hwc->idx + 1, used_mask);
>> +
>> if (assign)
>> assign[i] = hwc->idx;
>> }
>
> This is just really sad.. fixed that too.
[*]
> Can you verify the patches in:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/amd
>
> work?
I found a test that failed: perf stat on 6 instructions + 1 FLOPs events, multiplexing over 6 h/w counters along with the watchdog often returns exceptionally high and incorrect counts for a couple of the instructions events. The FLOPs event counts ok.
My original patch, rebased onto the same code your perf/amd branch is, does not fail this test.
If I re-add my perf_assign_events changes you removed, the problem does not go away.
If I undo re-adding my perf_assign_events code, and re-add my "not already used" code that you removed - see [*] above - the problem DOES go away, and all the counts are all accurate.
One problem I see with your change in the "not already used" fastpath area, is that the new mask variable gets updated with position 'i' regardless of any previous Large Increment event assignments. I.e., a successfully scheduled large increment event assignment may have already consumed that 'i' slot for its Merge event in a previous iteration of the loop. So if the fastpath scheduler fails to assign that following event, the slow path is wrongly entered due to a wrong 'i' comparison with 'n', for example.
Btw, thanks for adding the comment pointing out perf_sched_restore_state is incompatible with Large Increment events.
I'd just as soon put my original "not already used" code back in, but if it's still unwanted, let me know how you'd like to proceed...
Kim
On Wed, Jan 08, 2020 at 04:26:47PM -0600, Kim Phillips wrote:
> On 12/20/19 6:09 AM, Peter Zijlstra wrote:
> > On Thu, Nov 14, 2019 at 12:37:20PM -0600, Kim Phillips wrote:
> >> @@ -926,10 +944,14 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
> >> break;
> >>
> >> /* not already used */
> >> - if (test_bit(hwc->idx, used_mask))
> >> + if (test_bit(hwc->idx, used_mask) || (is_large_inc(hwc) &&
> >> + test_bit(hwc->idx + 1, used_mask)))
> >> break;
> >>
> >> __set_bit(hwc->idx, used_mask);
> >> + if (is_large_inc(hwc))
> >> + __set_bit(hwc->idx + 1, used_mask);
> >> +
> >> if (assign)
> >> assign[i] = hwc->idx;
> >> }
> >
> > This is just really sad.. fixed that too.
>
> [*]
> If I undo re-adding my perf_assign_events code, and re-add my "not
> already used" code that you removed - see [*] above - the problem DOES
> go away, and all the counts are all accurate.
>
> One problem I see with your change in the "not already used" fastpath
> area, is that the new mask variable gets updated with position 'i'
> regardless of any previous Large Increment event assignments.
Urgh, I completely messed that up. Find the below delta (I'll push out a
new version to queue.git as well).
> I.e., a
> successfully scheduled large increment event assignment may have
> already consumed that 'i' slot for its Merge event in a previous
> iteration of the loop. So if the fastpath scheduler fails to assign
> that following event, the slow path is wrongly entered due to a wrong
> 'i' comparison with 'n', for example.
That should only be part of the story though; the fast-path is purely
optional. False-negatives on the fast path should not affect
functionality, only performance. False-positives on the fast path are a
no-no of course.
---
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 222f172cbaf5..3bb738f5a472 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -937,7 +937,7 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
* fastpath, try to reuse previous register
*/
for (i = 0; i < n; i++) {
- u64 mask = BIT_ULL(i);
+ u64 mask;
hwc = &cpuc->event_list[i]->hw;
c = cpuc->event_constraint[i];
@@ -950,6 +950,7 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
if (!test_bit(hwc->idx, c->idxmsk))
break;
+ mask = BIT_ULL(hwc->idx);
if (is_counter_pair(hwc))
mask |= mask << 1;
On 1/10/20 9:09 AM, Peter Zijlstra wrote:
> On Wed, Jan 08, 2020 at 04:26:47PM -0600, Kim Phillips wrote:
>> On 12/20/19 6:09 AM, Peter Zijlstra wrote:
>>> On Thu, Nov 14, 2019 at 12:37:20PM -0600, Kim Phillips wrote:
>> One problem I see with your change in the "not already used" fastpath
>> area, is that the new mask variable gets updated with position 'i'
>> regardless of any previous Large Increment event assignments.
>
> Urgh, I completely messed that up. Find the below delta (I'll push out a
> new version to queue.git as well).
OK, I tested what you pushed on your perf/amd branch, and it passes all my tests.
BTW, a large part of the commit message went missing, hopefully it'll be brought back before being pushed further upstream?
Thank you,
Kim
On Fri, Jan 10, 2020 at 10:22:10AM -0600, Kim Phillips wrote:
> On 1/10/20 9:09 AM, Peter Zijlstra wrote:
> > On Wed, Jan 08, 2020 at 04:26:47PM -0600, Kim Phillips wrote:
> >> On 12/20/19 6:09 AM, Peter Zijlstra wrote:
> >>> On Thu, Nov 14, 2019 at 12:37:20PM -0600, Kim Phillips wrote:
> >> One problem I see with your change in the "not already used" fastpath
> >> area, is that the new mask variable gets updated with position 'i'
> >> regardless of any previous Large Increment event assignments.
> >
> > Urgh, I completely messed that up. Find the below delta (I'll push out a
> > new version to queue.git as well).
>
> OK, I tested what you pushed on your perf/amd branch, and it passes all my tests.
Excellent!
> BTW, a large part of the commit message went missing, hopefully it'll be brought back before being pushed further upstream?
Argh.. it's those ---- lines, my script things they're cuts. I'll go
fix it up.
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 471af006a747f1c535c8a8c6c0973c320fe01b22
Gitweb: https://git.kernel.org/tip/471af006a747f1c535c8a8c6c0973c320fe01b22
Author: Kim Phillips <[email protected]>
AuthorDate: Thu, 14 Nov 2019 12:37:19 -06:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Fri, 17 Jan 2020 10:19:26 +01:00
perf/x86/amd: Constrain Large Increment per Cycle events
AMD Family 17h processors and above gain support for Large Increment
per Cycle events. Unfortunately there is no CPUID or equivalent bit
that indicates whether the feature exists or not, so we continue to
determine eligibility based on a CPU family number comparison.
For Large Increment per Cycle events, we add a f17h-and-compatibles
get_event_constraints_f17h() that returns an even counter bitmask:
Large Increment per Cycle events can only be placed on PMCs 0, 2,
and 4 out of the currently available 0-5. The only currently
public event that requires this feature to report valid counts
is PMCx003 "Retired SSE/AVX Operations".
Note that the CPU family logic in amd_core_pmu_init() is changed
so as to be able to selectively add initialization for features
available in ranges of backward-compatible CPU families. This
Large Increment per Cycle feature is expected to be retained
in future families.
A side-effect of assigning a new get_constraints function for f17h
disables calling the old (prior to f15h) amd_get_event_constraints
implementation left enabled by commit e40ed1542dd7 ("perf/x86: Add perf
support for AMD family-17h processors"), which is no longer
necessary since those North Bridge event codes are obsoleted.
Also fix a spelling mistake whilst in the area (calulating ->
calculating).
Fixes: e40ed1542dd7 ("perf/x86: Add perf support for AMD family-17h processors")
Signed-off-by: Kim Phillips <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/events/amd/core.c | 91 +++++++++++++++++++++++------------
arch/x86/events/perf_event.h | 2 +-
2 files changed, 63 insertions(+), 30 deletions(-)
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index a7752cd..571168f 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -301,6 +301,25 @@ static inline int amd_pmu_addr_offset(int index, bool eventsel)
return offset;
}
+/*
+ * AMD64 events are detected based on their event codes.
+ */
+static inline unsigned int amd_get_event_code(struct hw_perf_event *hwc)
+{
+ return ((hwc->config >> 24) & 0x0f00) | (hwc->config & 0x00ff);
+}
+
+static inline bool amd_is_pair_event_code(struct hw_perf_event *hwc)
+{
+ if (!(x86_pmu.flags & PMU_FL_PAIR))
+ return false;
+
+ switch (amd_get_event_code(hwc)) {
+ case 0x003: return true; /* Retired SSE/AVX FLOPs */
+ default: return false;
+ }
+}
+
static int amd_core_hw_config(struct perf_event *event)
{
if (event->attr.exclude_host && event->attr.exclude_guest)
@@ -319,14 +338,6 @@ static int amd_core_hw_config(struct perf_event *event)
return 0;
}
-/*
- * AMD64 events are detected based on their event codes.
- */
-static inline unsigned int amd_get_event_code(struct hw_perf_event *hwc)
-{
- return ((hwc->config >> 24) & 0x0f00) | (hwc->config & 0x00ff);
-}
-
static inline int amd_is_nb_event(struct hw_perf_event *hwc)
{
return (hwc->config & 0xe0) == 0xe0;
@@ -855,6 +866,20 @@ amd_get_event_constraints_f15h(struct cpu_hw_events *cpuc, int idx,
}
}
+static struct event_constraint pair_constraint;
+
+static struct event_constraint *
+amd_get_event_constraints_f17h(struct cpu_hw_events *cpuc, int idx,
+ struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+
+ if (amd_is_pair_event_code(hwc))
+ return &pair_constraint;
+
+ return &unconstrained;
+}
+
static ssize_t amd_event_sysfs_show(char *page, u64 config)
{
u64 event = (config & ARCH_PERFMON_EVENTSEL_EVENT) |
@@ -898,33 +923,15 @@ static __initconst const struct x86_pmu amd_pmu = {
static int __init amd_core_pmu_init(void)
{
+ u64 even_ctr_mask = 0ULL;
+ int i;
+
if (!boot_cpu_has(X86_FEATURE_PERFCTR_CORE))
return 0;
- /* Avoid calulating the value each time in the NMI handler */
+ /* Avoid calculating the value each time in the NMI handler */
perf_nmi_window = msecs_to_jiffies(100);
- switch (boot_cpu_data.x86) {
- case 0x15:
- pr_cont("Fam15h ");
- x86_pmu.get_event_constraints = amd_get_event_constraints_f15h;
- break;
- case 0x17:
- pr_cont("Fam17h ");
- /*
- * In family 17h, there are no event constraints in the PMC hardware.
- * We fallback to using default amd_get_event_constraints.
- */
- break;
- case 0x18:
- pr_cont("Fam18h ");
- /* Using default amd_get_event_constraints. */
- break;
- default:
- pr_err("core perfctr but no constraints; unknown hardware!\n");
- return -ENODEV;
- }
-
/*
* If core performance counter extensions exists, we must use
* MSR_F15H_PERF_CTL/MSR_F15H_PERF_CTR msrs. See also
@@ -939,6 +946,30 @@ static int __init amd_core_pmu_init(void)
*/
x86_pmu.amd_nb_constraints = 0;
+ if (boot_cpu_data.x86 == 0x15) {
+ pr_cont("Fam15h ");
+ x86_pmu.get_event_constraints = amd_get_event_constraints_f15h;
+ }
+ if (boot_cpu_data.x86 >= 0x17) {
+ pr_cont("Fam17h+ ");
+ /*
+ * Family 17h and compatibles have constraints for Large
+ * Increment per Cycle events: they may only be assigned an
+ * even numbered counter that has a consecutive adjacent odd
+ * numbered counter following it.
+ */
+ for (i = 0; i < x86_pmu.num_counters - 1; i += 2)
+ even_ctr_mask |= 1 << i;
+
+ pair_constraint = (struct event_constraint)
+ __EVENT_CONSTRAINT(0, even_ctr_mask, 0,
+ x86_pmu.num_counters / 2, 0,
+ PERF_X86_EVENT_PAIR);
+
+ x86_pmu.get_event_constraints = amd_get_event_constraints_f17h;
+ x86_pmu.flags |= PMU_FL_PAIR;
+ }
+
pr_cont("core perfctr, ");
return 0;
}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 930611d..e2fd363 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -77,6 +77,7 @@ static inline bool constraint_match(struct event_constraint *c, u64 ecode)
#define PERF_X86_EVENT_AUTO_RELOAD 0x0200 /* use PEBS auto-reload */
#define PERF_X86_EVENT_LARGE_PEBS 0x0400 /* use large PEBS */
#define PERF_X86_EVENT_PEBS_VIA_PT 0x0800 /* use PT buffer for PEBS */
+#define PERF_X86_EVENT_PAIR 0x1000 /* Large Increment per Cycle */
struct amd_nb {
int nb_id; /* NorthBridge id */
@@ -743,6 +744,7 @@ do { \
#define PMU_FL_EXCL_ENABLED 0x8 /* exclusive counter active */
#define PMU_FL_PEBS_ALL 0x10 /* all events are valid PEBS events */
#define PMU_FL_TFA 0x20 /* deal with TSX force abort */
+#define PMU_FL_PAIR 0x40 /* merge counters for large incr. events */
#define EVENT_VAR(_id) event_attr_##_id
#define EVENT_PTR(_id) &event_attr_##_id.attr.attr