LinuxLists.cc - [RFC] perf/x86/amd: add support for Large Increment per Cycle Events

2019-08-26 21:50:25

Subject: [RFC] perf/x86/amd: add support for Large Increment per Cycle Events

The core AMD PMU has a 4-bit wide per-cycle increment for each
performance monitor counter. That works for most counters, but
now with AMD Family 17h and above processors, for some, more than 15
events can occur in a cycle. Those events are called "Large
Increment per Cycle" events, and one example is the number of
SSE/AVX FLOPs retired (event code 0x003). In order to count these
events, two adjacent h/w PMCs get their count signals merged
to form 8 bits per cycle total. In addition, the PERF_CTR count
registers are merged to be able to count up to 64 bits.

Normally, events like instructions retired, get programmed on a single
counter like so:

PERF_CTL0 (MSR 0xc0010200) 0x000000000053ff0c # event 0x0c, umask 0xff
PERF_CTR0 (MSR 0xc0010201) 0x0000800000000001 # r/w 48-bit count

The next counter at MSRs 0xc0010202-3 remains unused, or can be used
independently to count something else.

When counting Large Increment per Cycle events, such as FLOPs,
however, we now have to reserve the next counter and program the
PERF_CTL (config) register with the Merge event (0xFFF), like so:

PERF_CTL0 (msr 0xc0010200) 0x000000000053ff03 # FLOPs event, umask 0xff
PERF_CTR0 (msr 0xc0010201) 0x0000800000000001 # read 64-bit count, wr low 48b
PERF_CTL1 (msr 0xc0010202) 0x0000000f004000ff # Merge event, enable bit
PERF_CTR1 (msr 0xc0010203) 0x0000000000000000 # write higher 16-bits of count

The count is widened from the normal 48-bits to 64 bits by having the
second counter carry the higher 16 bits of the count in its lower 16
bits of its counter register. Support for mixed 48- and 64-bit counting
is not supported in this version.

For more details, search a Family 17h PPR for the "Large Increment per
Cycle Events" section, e.g., section 2.1.15.3 on p. 173 in this version:

https://www.amd.com/system/files/TechDocs/56176_ppr_Family_17h_Model_71h_B0_pub_Rev_3.06.zip

In order to support reserving the extra counter for a single Large
Increment per Cycle event in the perf core, we:

1. Add a f17h get_event_constraints() that returns only an even counter
bitmask, since Large Increment events can only be placed on counters 0,
2, and 4 out of the currently available 0-5.

2. We add a commit_scheduler hook that adds the Merge event (0xFFF) to
any Large Increment event being scheduled. If the event being scheduled
is not a Large Increment event, we check for, and remove any
pre-existing Large Increment events on the next counter.

3. In the main x86 scheduler, we reduce the number of available
counters by the number of Large Increment per Cycle events being added.
This improves the counter scheduler success rate.

4. In perf_assign_events(), if a counter is assigned to a Large
Increment event, we increment the current counter variable, so the
counter used for the Merge event is skipped.

5. In find_counter(), if a counter has been found for the
Large Increment event, we set the next counter as used, to
prevent other events from using it.

A side-effect of assigning a new get_constraints function for f17h
disables calling the old (prior to f15h) amd_get_event_constraints
implementation left enabled by commit e40ed1542dd7 ("perf/x86: Add perf
support for AMD family-17h processors"), which is no longer
necessary since those North Bridge events are obsolete.

Simple invocation example:

perf stat -e cpu/fp_ret_sse_avx_ops.all/,cpu/instructions/, \
cpu/event=0x03,umask=0xff/ <workload>

Performance counter stats for '<workload>':

800,000,000 cpu/fp_ret_sse_avx_ops.all/u
300,042,101 cpu/instructions/u
800,000,000 cpu/event=0x03,umask=0xff/u

0.041359898 seconds time elapsed

0.041200000 seconds user
0.000000000 seconds sys

Fixes: e40ed1542dd7 ("perf/x86: Add perf support for AMD family-17h processors")
Signed-off-by: Kim Phillips <[email protected]>
Cc: Janakarajan Natarajan <[email protected]>
Cc: Suravee Suthikulpanit <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Martin Liska <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
RFC because I'd like input on the approach, including how to add support
for mixed-width (48- and 64-bit) counting for a single PMU. Plus there
are bugs:

- with nmi_watchdog=0, single invocations work, but it fails to count
correctly under when invoking two simultaneous perfs pinned to the
same cpu

- it fails to count correctly under certain conditions with
nmi_watchdog=1

arch/x86/events/amd/core.c | 102 +++++++++++++++++++++++++++--------
arch/x86/events/core.c | 39 +++++++++++++-
arch/x86/events/perf_event.h | 46 ++++++++++++++++
3 files changed, 163 insertions(+), 24 deletions(-)

diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index e7d35f60d53f..351e72449fb8 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -12,6 +12,10 @@

static DEFINE_PER_CPU(unsigned int, perf_nmi_counter);

+/* AMD Event 0xFFF: Merge. For pairing with Large Increment per Cycle events */
+#define AMD_MERGE_EVENT (0xFFULL | (0xFULL << 32))
+#define AMD_MERGE_EVENT_ENABLE (AMD_MERGE_EVENT | ARCH_PERFMON_EVENTSEL_ENABLE)
+
static __initconst const u64 amd_hw_cache_event_ids
[PERF_COUNT_HW_CACHE_MAX]
[PERF_COUNT_HW_CACHE_OP_MAX]
@@ -317,19 +321,6 @@ static int amd_core_hw_config(struct perf_event *event)
return 0;
}

-/*
- * AMD64 events are detected based on their event codes.
- */
-static inline unsigned int amd_get_event_code(struct hw_perf_event *hwc)
-{
- return ((hwc->config >> 24) & 0x0f00) | (hwc->config & 0x00ff);
-}
-
-static inline int amd_is_nb_event(struct hw_perf_event *hwc)
-{
- return (hwc->config & 0xe0) == 0xe0;
-}
-
static inline int amd_has_nb(struct cpu_hw_events *cpuc)
{
struct amd_nb *nb = cpuc->amd_nb;
@@ -863,6 +854,53 @@ amd_get_event_constraints_f15h(struct cpu_hw_events *cpuc, int idx,
}
}

+static struct event_constraint lg_inc_constraint;
+
+static struct event_constraint *
+amd_get_event_constraints_f17h(struct cpu_hw_events *cpuc, int idx,
+ struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+
+ if (is_lg_inc_event(hwc))
+ return &lg_inc_constraint;
+
+ return &unconstrained;
+}
+
+/*
+ * Add the Merge event to any merged event being scheduled.
+ * Otherwise, check and remove any pre-existing Merge events
+ * on the next adjacent counter
+ */
+static void
+amd_commit_scheduling_f17h(struct cpu_hw_events *cpuc, int idx, int cntr)
+{
+ struct event_constraint *c = cpuc->event_constraint[idx];
+ unsigned int next_ctr_ctl;
+ u64 config;
+
+ /*
+ * Merge events only modify events on even numbered counters,
+ * and up to the one before last
+ */
+ if ((cntr & 1) || cntr + 1 >= x86_pmu.num_counters)
+ return;
+
+ next_ctr_ctl = x86_pmu_config_addr(cntr + 1);
+
+ if (c->flags & PERF_X86_EVENT_LARGE_INC) {
+ wrmsrl(next_ctr_ctl, AMD_MERGE_EVENT_ENABLE);
+ } else {
+ rdmsrl(next_ctr_ctl, config);
+ if ((config & AMD_MERGE_EVENT_ENABLE) ==
+ AMD_MERGE_EVENT_ENABLE) {
+ config &= ~AMD_MERGE_EVENT_ENABLE;
+ wrmsrl(next_ctr_ctl, config);
+ }
+ }
+}
+
static ssize_t amd_event_sysfs_show(char *page, u64 config)
{
u64 event = (config & ARCH_PERFMON_EVENTSEL_EVENT) |
@@ -906,9 +944,21 @@ static __initconst const struct x86_pmu amd_pmu = {

static int __init amd_core_pmu_init(void)
{
+ u64 even_ctr_mask = 0ULL;
+ int i;
+
if (!boot_cpu_has(X86_FEATURE_PERFCTR_CORE))
return 0;

+ /*
+ * If core performance counter extensions exists, we must use
+ * MSR_F15H_PERF_CTL/MSR_F15H_PERF_CTR msrs. See also
+ * amd_pmu_addr_offset().
+ */
+ x86_pmu.eventsel = MSR_F15H_PERF_CTL;
+ x86_pmu.perfctr = MSR_F15H_PERF_CTR;
+ x86_pmu.num_counters = AMD64_NUM_COUNTERS_CORE;
+
switch (boot_cpu_data.x86) {
case 0x15:
pr_cont("Fam15h ");
@@ -917,9 +967,23 @@ static int __init amd_core_pmu_init(void)
case 0x17:
pr_cont("Fam17h ");
/*
- * In family 17h, there are no event constraints in the PMC hardware.
- * We fallback to using default amd_get_event_constraints.
+ * Family 17h has constraints for Large Increment per Cycle
+ * events: they may only land on an even numbered counter
+ * that has a consecutive adjacent odd numbered counter
+ * following it.
*/
+ for (i = 0; i < x86_pmu.num_counters - 1; i += 2)
+ even_ctr_mask |= 1 << i;
+
+ lg_inc_constraint = (struct event_constraint)
+ __EVENT_CONSTRAINT(0, even_ctr_mask, 0,
+ x86_pmu.num_counters / 2, 0,
+ PERF_X86_EVENT_LARGE_INC);
+
+ x86_pmu.get_event_constraints = amd_get_event_constraints_f17h;
+ x86_pmu.commit_scheduling = amd_commit_scheduling_f17h;
+ x86_pmu.flags |= PMU_FL_MERGE;
+
break;
case 0x18:
pr_cont("Fam18h ");
@@ -930,14 +994,6 @@ static int __init amd_core_pmu_init(void)
return -ENODEV;
}

- /*
- * If core performance counter extensions exists, we must use
- * MSR_F15H_PERF_CTL/MSR_F15H_PERF_CTR msrs. See also
- * amd_pmu_addr_offset().
- */
- x86_pmu.eventsel = MSR_F15H_PERF_CTL;
- x86_pmu.perfctr = MSR_F15H_PERF_CTR;
- x86_pmu.num_counters = AMD64_NUM_COUNTERS_CORE;
/*
* AMD Core perfctr has separate MSRs for the NB events, see
* the amd/uncore.c driver.
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 325959d19d9a..4596c141f348 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -787,6 +787,18 @@ static bool __perf_sched_find_counter(struct perf_sched *sched)
if (!__test_and_set_bit(idx, sched->state.used)) {
if (sched->state.nr_gp++ >= sched->max_gp)
return false;
+ if (c->flags & PERF_X86_EVENT_LARGE_INC) {
+ /*
+ * merged events need the Merge event
+ * on the next counter
+ */
+ if (__test_and_set_bit(idx + 1,
+ sched->state.used))
+ /* next counter already used */
+ return false;
+
+ set_bit(idx + 1, sched->state.used);
+ }

goto done;
}
@@ -849,14 +861,20 @@ int perf_assign_events(struct event_constraint **constraints, int n,
int wmin, int wmax, int gpmax, int *assign)
{
struct perf_sched sched;
+ struct event_constraint *c;
+

perf_sched_init(&sched, constraints, n, wmin, wmax, gpmax);

do {
if (!perf_sched_find_counter(&sched))
break; /* failed */
- if (assign)
+ if (assign) {
assign[sched.state.event] = sched.state.counter;
+ c = constraints[sched.state.event];
+ if (c->flags & PERF_X86_EVENT_LARGE_INC)
+ sched.state.counter++;
+ }
} while (perf_sched_next_event(&sched));

return sched.state.unassigned;
@@ -952,6 +970,18 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
READ_ONCE(cpuc->excl_cntrs->exclusive_present))
gpmax /= 2;

+ /*
+ * reduce the amount of available counters
+ * to allow fitting the Merge event along
+ * with their large increment event
+ */
+ if (x86_pmu.flags & PMU_FL_MERGE) {
+ for (i = 0; i < n; i++) {
+ hwc = &cpuc->event_list[i]->hw;
+ if (is_lg_inc_event(hwc) && gpmax > 1)
+ gpmax--;
+ }
+ }
unsched = perf_assign_events(cpuc->event_constraint, n, wmin,
wmax, gpmax, assign);
}
@@ -1210,6 +1240,13 @@ int x86_perf_event_set_period(struct perf_event *event)

wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);

+ /*
+ * Clear the Merge event counter's upper 16 bits since
+ * we currently declare a 48-bit counter width
+ */
+ if (is_lg_inc_event(hwc))
+ wrmsrl(x86_pmu_event_addr(idx + 1), 0);
+
/*
* Due to erratum on certan cpu we need
* a second write to be sure the register
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 8751008fc170..533fd9f982fc 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -76,6 +76,7 @@ static inline bool constraint_match(struct event_constraint *c, u64 ecode)
#define PERF_X86_EVENT_EXCL_ACCT 0x0100 /* accounted EXCL event */
#define PERF_X86_EVENT_AUTO_RELOAD 0x0200 /* use PEBS auto-reload */
#define PERF_X86_EVENT_LARGE_PEBS 0x0400 /* use large PEBS */
+#define PERF_X86_EVENT_LARGE_INC 0x0800 /* Large Increment per Cycle */

struct amd_nb {
int nb_id; /* NorthBridge id */
@@ -324,6 +325,8 @@ struct cpu_hw_events {
*/
#define INTEL_EVENT_CONSTRAINT(c, n) \
EVENT_CONSTRAINT(c, n, ARCH_PERFMON_EVENTSEL_EVENT)
+#define AMD_EVENT_CONSTRAINT(c, n) \
+ EVENT_CONSTRAINT(c, n, AMD64_EVENTSEL_EVENT)

/*
* Constraint on a range of Event codes
@@ -723,6 +726,7 @@ do { \
#define PMU_FL_EXCL_ENABLED 0x8 /* exclusive counter active */
#define PMU_FL_PEBS_ALL 0x10 /* all events are valid PEBS events */
#define PMU_FL_TFA 0x20 /* deal with TSX force abort */
+#define PMU_FL_MERGE 0x40 /* merge counters for large incr. events */

#define EVENT_VAR(_id) event_attr_##_id
#define EVENT_PTR(_id) &event_attr_##_id.attr.attr
@@ -892,6 +896,28 @@ ssize_t events_ht_sysfs_show(struct device *dev, struct device_attribute *attr,

int amd_pmu_init(void);

+/*
+ * AMD64 events are detected based on their event codes.
+ */
+static inline unsigned int amd_get_event_code(struct hw_perf_event *hwc)
+{
+ return ((hwc->config >> 24) & 0x0f00) | (hwc->config & 0x00ff);
+}
+
+static inline int amd_is_nb_event(struct hw_perf_event *hwc)
+{
+ return (hwc->config & 0xe0) == 0xe0;
+}
+
+/* assumes x86_pmu.flags & PMU_FL_MERGE */
+static inline bool is_lg_inc_event(struct hw_perf_event *hwc)
+{
+ switch (amd_get_event_code(hwc)) {
+ case 0x003: return true; /* Retired SSE/AVX FLOPs */
+ default: return false;
+ }
+}
+
#else /* CONFIG_CPU_SUP_AMD */

static inline int amd_pmu_init(void)
@@ -899,6 +925,26 @@ static inline int amd_pmu_init(void)
return 0;
}

+static inline unsigned int amd_get_event_code(struct hw_perf_event *hwc)
+{
+ return 0;
+}
+
+static inline int amd_is_nb_event(struct hw_perf_event *hwc)
+{
+ return 0;
+}
+
+static inline int amd_is_merged_event(struct hw_perf_event *hwc)
+{
+ return 0;
+}
+
+static inline bool is_lg_inc_event(struct hw_perf_event *hwc)
+{
+ return 0;
+}
+
#endif /* CONFIG_CPU_SUP_AMD */

#ifdef CONFIG_CPU_SUP_INTEL
--
2.23.0

2019-08-28 12:49:45

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC] perf/x86/amd: add support for Large Increment per Cycle Events

On Mon, Aug 26, 2019 at 02:59:15PM -0500, Kim Phillips wrote:
> The core AMD PMU has a 4-bit wide per-cycle increment for each
> performance monitor counter. That works for most counters, but
> now with AMD Family 17h and above processors, for some, more than 15
> events can occur in a cycle. Those events are called "Large
> Increment per Cycle" events, and one example is the number of
> SSE/AVX FLOPs retired (event code 0x003). In order to count these
> events, two adjacent h/w PMCs get their count signals merged
> to form 8 bits per cycle total.

*groan*

> In addition, the PERF_CTR count
> registers are merged to be able to count up to 64 bits.

That is daft; why can't you extend the existing MSR to 64bit?

> Normally, events like instructions retired, get programmed on a single
> counter like so:
>
> PERF_CTL0 (MSR 0xc0010200) 0x000000000053ff0c # event 0x0c, umask 0xff
> PERF_CTR0 (MSR 0xc0010201) 0x0000800000000001 # r/w 48-bit count
>
> The next counter at MSRs 0xc0010202-3 remains unused, or can be used
> independently to count something else.
>
> When counting Large Increment per Cycle events, such as FLOPs,
> however, we now have to reserve the next counter and program the
> PERF_CTL (config) register with the Merge event (0xFFF), like so:
>
> PERF_CTL0 (msr 0xc0010200) 0x000000000053ff03 # FLOPs event, umask 0xff
> PERF_CTR0 (msr 0xc0010201) 0x0000800000000001 # read 64-bit count, wr low 48b
> PERF_CTL1 (msr 0xc0010202) 0x0000000f004000ff # Merge event, enable bit
> PERF_CTR1 (msr 0xc0010203) 0x0000000000000000 # write higher 16-bits of count
>
> The count is widened from the normal 48-bits to 64 bits by having the
> second counter carry the higher 16 bits of the count in its lower 16
> bits of its counter register. Support for mixed 48- and 64-bit counting
> is not supported in this version.

This is diguisting.. please talk to your hardware people. I sort of
understand the pairing, but that upper 16 bit split for writes is just
woeful crap.

> For more details, search a Family 17h PPR for the "Large Increment per
> Cycle Events" section, e.g., section 2.1.15.3 on p. 173 in this version:
>
> https://www.amd.com/system/files/TechDocs/56176_ppr_Family_17h_Model_71h_B0_pub_Rev_3.06.zip

My mama told me not to open random zip files of the interweb :-)

Also; afaict the only additional information there is that it works in
odd/even pairs and you have to program the odd one before the even one.
Surely you could've included that here.

> In order to support reserving the extra counter for a single Large
> Increment per Cycle event in the perf core, we:
>
> 1. Add a f17h get_event_constraints() that returns only an even counter
> bitmask, since Large Increment events can only be placed on counters 0,
> 2, and 4 out of the currently available 0-5.

So hereby you promise that all LI events are unconstrained, right?
Also, what marks the paired counter in the used mask? Aaah, you modify
__perf_sched_find_counter(). Comments below.

> 2. We add a commit_scheduler hook that adds the Merge event (0xFFF) to
> any Large Increment event being scheduled. If the event being scheduled
> is not a Large Increment event, we check for, and remove any
> pre-existing Large Increment events on the next counter.

That is weird at best; the scheduling hooks shouldn't be the one doing
the programming; that should be done in x86_pmu_enable(). Can't you do
this by changing amd_pmu::{en,dis}able() ?

(also; we really should rename some of those x86_pmu::ops :/)

> 3. In the main x86 scheduler, we reduce the number of available
> counters by the number of Large Increment per Cycle events being added.
> This improves the counter scheduler success rate.
>
> 4. In perf_assign_events(), if a counter is assigned to a Large
> Increment event, we increment the current counter variable, so the
> counter used for the Merge event is skipped.
>
> 5. In find_counter(), if a counter has been found for the
> Large Increment event, we set the next counter as used, to
> prevent other events from using it.
>
> A side-effect of assigning a new get_constraints function for f17h
> disables calling the old (prior to f15h) amd_get_event_constraints
> implementation left enabled by commit e40ed1542dd7 ("perf/x86: Add perf
> support for AMD family-17h processors"), which is no longer
> necessary since those North Bridge events are obsolete.

> RFC because I'd like input on the approach, including how to add support
> for mixed-width (48- and 64-bit) counting for a single PMU.

Ideally I'd tell you to wait for sane hardware :/

> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 325959d19d9a..4596c141f348 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -787,6 +787,18 @@ static bool __perf_sched_find_counter(struct perf_sched *sched)
> if (!__test_and_set_bit(idx, sched->state.used)) {
> if (sched->state.nr_gp++ >= sched->max_gp)
> return false;
> + if (c->flags & PERF_X86_EVENT_LARGE_INC) {

Can we please call that something like:

PERF_X86_EVENT_PAIR

> + /*
> + * merged events need the Merge event
> + * on the next counter
> + */
> + if (__test_and_set_bit(idx + 1,
> + sched->state.used))
> + /* next counter already used */
> + return false;

Coding Style wants { } there. Also, remove that line-break.

> +
> + set_bit(idx + 1, sched->state.used);

__set_bit() surely

> + }
>
> goto done;
> }
> @@ -849,14 +861,20 @@ int perf_assign_events(struct event_constraint **constraints, int n,
> int wmin, int wmax, int gpmax, int *assign)
> {
> struct perf_sched sched;
> + struct event_constraint *c;
> +
>
> perf_sched_init(&sched, constraints, n, wmin, wmax, gpmax);
>
> do {
> if (!perf_sched_find_counter(&sched))
> break; /* failed */
> - if (assign)
> + if (assign) {
> assign[sched.state.event] = sched.state.counter;
> + c = constraints[sched.state.event];
> + if (c->flags & PERF_X86_EVENT_LARGE_INC)
> + sched.state.counter++;
> + }

How about you make __perf_sched_find_count() set the right value? That
already knows it did this.

> } while (perf_sched_next_event(&sched));
>
> return sched.state.unassigned;
> @@ -952,6 +970,18 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
> READ_ONCE(cpuc->excl_cntrs->exclusive_present))
> gpmax /= 2;
>
> + /*
> + * reduce the amount of available counters
> + * to allow fitting the Merge event along
> + * with their large increment event
> + */
> + if (x86_pmu.flags & PMU_FL_MERGE) {
> + for (i = 0; i < n; i++) {
> + hwc = &cpuc->event_list[i]->hw;
> + if (is_lg_inc_event(hwc) && gpmax > 1)

It should not be possible to hit !gpmax; make that a WARN.

> + gpmax--;
> + }

Alternatively you could have collect_events() cound the number of
'lg_inc' (we really have to come up with a better name) events on the
cpuc. Then you can do a simple subtraction and avoid the loop.

> + }
> unsched = perf_assign_events(cpuc->event_constraint, n, wmin,
> wmax, gpmax, assign);
> }
> @@ -1210,6 +1240,13 @@ int x86_perf_event_set_period(struct perf_event *event)
>
> wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
>
> + /*
> + * Clear the Merge event counter's upper 16 bits since
> + * we currently declare a 48-bit counter width
> + */
> + if (is_lg_inc_event(hwc))
> + wrmsrl(x86_pmu_event_addr(idx + 1), 0);
> +

*yuck*...

> /*
> * Due to erratum on certan cpu we need
> * a second write to be sure the register

2019-08-28 18:09:53

by Stephane Eranian

[permalink] [raw]

Subject: Re: [RFC] perf/x86/amd: add support for Large Increment per Cycle Events

On Wed, Aug 28, 2019 at 5:47 AM Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Aug 26, 2019 at 02:59:15PM -0500, Kim Phillips wrote:
> > The core AMD PMU has a 4-bit wide per-cycle increment for each
> > performance monitor counter. That works for most counters, but
> > now with AMD Family 17h and above processors, for some, more than 15
> > events can occur in a cycle. Those events are called "Large
> > Increment per Cycle" events, and one example is the number of
> > SSE/AVX FLOPs retired (event code 0x003). In order to count these
> > events, two adjacent h/w PMCs get their count signals merged
> > to form 8 bits per cycle total.
>
> *groan*
>
> > In addition, the PERF_CTR count
> > registers are merged to be able to count up to 64 bits.
>
> That is daft; why can't you extend the existing MSR to 64bit?
>
My understanding is that the problem is not the width of the counter
but its ability to increment by more than 15 per cycle. They need two
counters to swallow 16+ events/cycle.

> > Normally, events like instructions retired, get programmed on a single
> > counter like so:
> >
> > PERF_CTL0 (MSR 0xc0010200) 0x000000000053ff0c # event 0x0c, umask 0xff
> > PERF_CTR0 (MSR 0xc0010201) 0x0000800000000001 # r/w 48-bit count
> >
> > The next counter at MSRs 0xc0010202-3 remains unused, or can be used
> > independently to count something else.
> >
> > When counting Large Increment per Cycle events, such as FLOPs,
> > however, we now have to reserve the next counter and program the
> > PERF_CTL (config) register with the Merge event (0xFFF), like so:
> >
> > PERF_CTL0 (msr 0xc0010200) 0x000000000053ff03 # FLOPs event, umask 0xff
> > PERF_CTR0 (msr 0xc0010201) 0x0000800000000001 # read 64-bit count, wr low 48b
> > PERF_CTL1 (msr 0xc0010202) 0x0000000f004000ff # Merge event, enable bit
> > PERF_CTR1 (msr 0xc0010203) 0x0000000000000000 # write higher 16-bits of count
> >
> > The count is widened from the normal 48-bits to 64 bits by having the
> > second counter carry the higher 16 bits of the count in its lower 16
> > bits of its counter register. Support for mixed 48- and 64-bit counting
> > is not supported in this version.
>
> This is diguisting.. please talk to your hardware people. I sort of
> understand the pairing, but that upper 16 bit split for writes is just
> woeful crap.
>
> > For more details, search a Family 17h PPR for the "Large Increment per
> > Cycle Events" section, e.g., section 2.1.15.3 on p. 173 in this version:
> >
> > https://www.amd.com/system/files/TechDocs/56176_ppr_Family_17h_Model_71h_B0_pub_Rev_3.06.zip
>
> My mama told me not to open random zip files of the interweb :-)
>
> Also; afaict the only additional information there is that it works in
> odd/even pairs and you have to program the odd one before the even one.
> Surely you could've included that here.
>
> > In order to support reserving the extra counter for a single Large
> > Increment per Cycle event in the perf core, we:
> >
> > 1. Add a f17h get_event_constraints() that returns only an even counter
> > bitmask, since Large Increment events can only be placed on counters 0,
> > 2, and 4 out of the currently available 0-5.
>
> So hereby you promise that all LI events are unconstrained, right?
> Also, what marks the paired counter in the used mask? Aaah, you modify
> __perf_sched_find_counter(). Comments below.
>
> > 2. We add a commit_scheduler hook that adds the Merge event (0xFFF) to
> > any Large Increment event being scheduled. If the event being scheduled
> > is not a Large Increment event, we check for, and remove any
> > pre-existing Large Increment events on the next counter.
>
> That is weird at best; the scheduling hooks shouldn't be the one doing
> the programming; that should be done in x86_pmu_enable(). Can't you do
> this by changing amd_pmu::{en,dis}able() ?
>
> (also; we really should rename some of those x86_pmu::ops :/)
>
> > 3. In the main x86 scheduler, we reduce the number of available
> > counters by the number of Large Increment per Cycle events being added.
> > This improves the counter scheduler success rate.
> >
> > 4. In perf_assign_events(), if a counter is assigned to a Large
> > Increment event, we increment the current counter variable, so the
> > counter used for the Merge event is skipped.
> >
> > 5. In find_counter(), if a counter has been found for the
> > Large Increment event, we set the next counter as used, to
> > prevent other events from using it.
> >
> > A side-effect of assigning a new get_constraints function for f17h
> > disables calling the old (prior to f15h) amd_get_event_constraints
> > implementation left enabled by commit e40ed1542dd7 ("perf/x86: Add perf
> > support for AMD family-17h processors"), which is no longer
> > necessary since those North Bridge events are obsolete.
>
> > RFC because I'd like input on the approach, including how to add support
> > for mixed-width (48- and 64-bit) counting for a single PMU.
>
> Ideally I'd tell you to wait for sane hardware :/
>
>
> > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> > index 325959d19d9a..4596c141f348 100644
> > --- a/arch/x86/events/core.c
> > +++ b/arch/x86/events/core.c
> > @@ -787,6 +787,18 @@ static bool __perf_sched_find_counter(struct perf_sched *sched)
> > if (!__test_and_set_bit(idx, sched->state.used)) {
> > if (sched->state.nr_gp++ >= sched->max_gp)
> > return false;
> > + if (c->flags & PERF_X86_EVENT_LARGE_INC) {
>
> Can we please call that something like:
>
> PERF_X86_EVENT_PAIR
>
> > + /*
> > + * merged events need the Merge event
> > + * on the next counter
> > + */
> > + if (__test_and_set_bit(idx + 1,
> > + sched->state.used))
> > + /* next counter already used */
> > + return false;
>
> Coding Style wants { } there. Also, remove that line-break.
>
> > +
> > + set_bit(idx + 1, sched->state.used);
>
> __set_bit() surely
>
> > + }
> >
> > goto done;
> > }
> > @@ -849,14 +861,20 @@ int perf_assign_events(struct event_constraint **constraints, int n,
> > int wmin, int wmax, int gpmax, int *assign)
> > {
> > struct perf_sched sched;
> > + struct event_constraint *c;
> > +
> >
> > perf_sched_init(&sched, constraints, n, wmin, wmax, gpmax);
> >
> > do {
> > if (!perf_sched_find_counter(&sched))
> > break; /* failed */
> > - if (assign)
> > + if (assign) {
> > assign[sched.state.event] = sched.state.counter;
> > + c = constraints[sched.state.event];
> > + if (c->flags & PERF_X86_EVENT_LARGE_INC)
> > + sched.state.counter++;
> > + }
>
> How about you make __perf_sched_find_count() set the right value? That
> already knows it did this.
>
> > } while (perf_sched_next_event(&sched));
> >
> > return sched.state.unassigned;
> > @@ -952,6 +970,18 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
> > READ_ONCE(cpuc->excl_cntrs->exclusive_present))
> > gpmax /= 2;
> >
> > + /*
> > + * reduce the amount of available counters
> > + * to allow fitting the Merge event along
> > + * with their large increment event
> > + */
> > + if (x86_pmu.flags & PMU_FL_MERGE) {
> > + for (i = 0; i < n; i++) {
> > + hwc = &cpuc->event_list[i]->hw;
> > + if (is_lg_inc_event(hwc) && gpmax > 1)
>
> It should not be possible to hit !gpmax; make that a WARN.
>
> > + gpmax--;
> > + }
>
> Alternatively you could have collect_events() cound the number of
> 'lg_inc' (we really have to come up with a better name) events on the
> cpuc. Then you can do a simple subtraction and avoid the loop.
>
> > + }
> > unsched = perf_assign_events(cpuc->event_constraint, n, wmin,
> > wmax, gpmax, assign);
> > }
> > @@ -1210,6 +1240,13 @@ int x86_perf_event_set_period(struct perf_event *event)
> >
> > wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
> >
> > + /*
> > + * Clear the Merge event counter's upper 16 bits since
> > + * we currently declare a 48-bit counter width
> > + */
> > + if (is_lg_inc_event(hwc))
> > + wrmsrl(x86_pmu_event_addr(idx + 1), 0);
> > +
>
> *yuck*...
>
> > /*
> > * Due to erratum on certan cpu we need
> > * a second write to be sure the register