2019-08-26 15:39:51

by Liang, Kan

[permalink] [raw]
Subject: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

From: Kan Liang <[email protected]>

Intro
=====

Icelake has support for measuring the four top level TopDown metrics
directly in hardware. This is implemented by an additional "metrics"
register, and a new Fixed Counter 3 that measures pipeline "slots".

Events
======

We export four metric events as separate perf events, which map to
internal "metrics" counter register. Those events do not exist in
hardware, but can be allocated by the scheduler.

For the event mapping we use a special 0x00 event code, which is
reserved for fake events. The metric events start from umask 0x10.

When setting up such events they point to the slots counter, and a
special callback, update_topdown_event(), reads the additional metrics
msr to generate the metrics. Then the metric is reported by multiplying
the metric (percentage) with slots.

This multiplication allows to easily keep a running count, for example
when the slots counter overflows, and makes all the standard tools, such
as a perf stat, work. They can do deltas of the values without needing
to know about percentages. This also simplifies accumulating the counts
of child events, which otherwise would need to know how to average
percent values.

All four metric events don't support sampling. Since they will be
handled specially for event update, a flag PERF_X86_EVENT_TOPDOWN is
introduced to indicate this case.

The slots event can support both sampling and counting.
For counting, the flag is also applied.
For sampling, it will be handled normally as other normal events.

Groups
======

To avoid reading the METRICS register multiple times, the metrics and
slots value can only be updated by the first slots/metrics event in a
group. All active slots and metrics events will be updated one time.

Reset
======

The PERF_METRICS and Fixed counter 3 have to be reset for each read,
because:
- The 8bit metrics ratio values lose precision when the measurement
period gets longer.
- The PERF_METRICS may report wrong value if its delta was less than
1/255 of SLOTS (Fixed counter 3).

Also, for counting, the -max_period is the initial value of the SLOTS.
The huge initial value will definitely trigger the issue mentioned
above. Force initial value as 0 for topdown and slots event counting.

NMI
======

The METRICS register may be overflow. The bit 48 of STATUS register
will be set. If so, update all active slots and metrics events.

The update_topdown_event() has to read two registers separately. The
values may be modify by a NMI. PMU has to be disabled before calling the
function.

RDPMC
======

RDPMC is temporarily disabled. The following patch will enable it.

Originally-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/core.c | 10 ++
arch/x86/events/intel/core.c | 230 ++++++++++++++++++++++++++++++-
arch/x86/events/perf_event.h | 17 +++
arch/x86/include/asm/msr-index.h | 2 +
4 files changed, 255 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 54534ff00940..1ae23db5c2d7 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -76,6 +76,8 @@ u64 x86_perf_event_update(struct perf_event *event)
if (idx == INTEL_PMC_IDX_FIXED_BTS)
return 0;

+ if (is_topdown_count(event) && x86_pmu.update_topdown_event)
+ return x86_pmu.update_topdown_event(event);
/*
* Careful: an NMI might modify the previous event value.
*
@@ -1003,6 +1005,10 @@ static int collect_events(struct cpu_hw_events *cpuc, struct perf_event *leader,

max_count = x86_pmu.num_counters + x86_pmu.num_counters_fixed;

+ /* There are 4 TopDown metrics events. */
+ if (x86_pmu.intel_cap.perf_metrics)
+ max_count += 4;
+
/* current number of events already accepted */
n = cpuc->n_events;

@@ -1184,6 +1190,10 @@ int x86_perf_event_set_period(struct perf_event *event)
if (idx == INTEL_PMC_IDX_FIXED_BTS)
return 0;

+ if (unlikely(is_topdown_count(event)) &&
+ x86_pmu.set_topdown_event_period)
+ return x86_pmu.set_topdown_event_period(event);
+
/*
* If we are way outside a reasonable range then just skip forward:
*/
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index f4d6335a18e2..616313d7f3d7 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -247,6 +247,10 @@ static struct event_constraint intel_icl_event_constraints[] = {
FIXED_EVENT_CONSTRAINT(0x003c, 1), /* CPU_CLK_UNHALTED.CORE */
FIXED_EVENT_CONSTRAINT(0x0300, 2), /* CPU_CLK_UNHALTED.REF */
FIXED_EVENT_CONSTRAINT(0x0400, 3), /* SLOTS */
+ METRIC_EVENT_CONSTRAINT(0x1000, 0), /* Retiring metric */
+ METRIC_EVENT_CONSTRAINT(0x1100, 1), /* Bad speculation metric */
+ METRIC_EVENT_CONSTRAINT(0x1200, 2), /* FE bound metric */
+ METRIC_EVENT_CONSTRAINT(0x1300, 3), /* BE bound metric */
INTEL_EVENT_CONSTRAINT_RANGE(0x03, 0x0a, 0xf),
INTEL_EVENT_CONSTRAINT_RANGE(0x1f, 0x28, 0xf),
INTEL_EVENT_CONSTRAINT(0x32, 0xf), /* SW_PREFETCH_ACCESS.* */
@@ -267,6 +271,14 @@ static struct extra_reg intel_icl_extra_regs[] __read_mostly = {
INTEL_UEVENT_EXTRA_REG(0x01bb, MSR_OFFCORE_RSP_1, 0x3fffff9fffull, RSP_1),
INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(0x01cd),
INTEL_UEVENT_EXTRA_REG(0x01c6, MSR_PEBS_FRONTEND, 0x7fff17, FE),
+ /*
+ * The original Fixed Ctr 3 are shared from different metrics
+ * events. So use the extra reg to enforce the same
+ * configuration on the original register, but do not actually
+ * write to it.
+ */
+ INTEL_UEVENT_EXTRA_REG(0x0400, 0, -1L, TOPDOWN),
+ INTEL_UEVENT_TOPDOWN_EXTRA_REG(0x1000),
EVENT_EXTRA_END
};

@@ -2190,10 +2202,163 @@ static void intel_pmu_del_event(struct perf_event *event)
intel_pmu_pebs_del(event);
}

+static inline bool is_metric_event(struct perf_event *event)
+{
+ return ((event->attr.config & ARCH_PERFMON_EVENTSEL_EVENT) == 0) &&
+ ((event->attr.config & INTEL_ARCH_EVENT_MASK) >= 0x1000) &&
+ ((event->attr.config & INTEL_ARCH_EVENT_MASK) <= 0x1300);
+}
+
+static inline bool is_slots_event(struct perf_event *event)
+{
+ return (event->attr.config & INTEL_ARCH_EVENT_MASK) == 0x0400;
+}
+
+static inline bool is_topdown_event(struct perf_event *event)
+{
+ return is_metric_event(event) || is_slots_event(event);
+}
+
+static bool is_first_topdown_event_in_group(struct perf_event *event)
+{
+ struct perf_event *first = NULL;
+
+ if (is_topdown_event(event->group_leader))
+ first = event->group_leader;
+ else {
+ for_each_sibling_event(first, event->group_leader)
+ if (is_topdown_event(first))
+ break;
+ }
+
+ if (event == first)
+ return true;
+
+ return false;
+}
+
+static int icl_set_topdown_event_period(struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ s64 left = local64_read(&hwc->period_left);
+
+ /*
+ * Clear PERF_METRICS and Fixed counter 3 in initialization.
+ * After that, both MSRs will be cleared for each read.
+ * Don't need to clear them again.
+ */
+ if (left == x86_pmu.max_period) {
+ wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
+ wrmsrl(MSR_PERF_METRICS, 0);
+ local64_set(&hwc->period_left, 0);
+ }
+
+ perf_event_update_userpage(event);
+
+ return 0;
+}
+
+static u64 icl_get_metrics_event_value(u64 metric, u64 slots, int idx)
+{
+ u32 val;
+
+ /*
+ * The metric is reported as an 8bit integer percentage
+ * suming up to 0xff.
+ * slots-in-metric = (Metric / 0xff) * slots
+ */
+ val = (metric >> ((idx - INTEL_PMC_IDX_FIXED_METRIC_BASE) * 8)) & 0xff;
+ return mul_u64_u32_div(slots, val, 0xff);
+}
+
+static void __icl_update_topdown_event(struct perf_event *event,
+ u64 slots, u64 metrics)
+{
+ int idx = event->hw.idx;
+ u64 delta;
+
+ if (is_metric_idx(idx))
+ delta = icl_get_metrics_event_value(metrics, slots, idx);
+ else
+ delta = slots;
+
+ local64_add(delta, &event->count);
+}
+
+/*
+ * Update all active Topdown events.
+ * PMU has to be disabled before calling this function.
+ */
+static u64 icl_update_topdown_event(struct perf_event *event)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct perf_event *other;
+ u64 slots, metrics;
+ int idx;
+
+ /*
+ * Only need to update all events for the first
+ * slots/metrics event in a group
+ */
+ if (event && !is_first_topdown_event_in_group(event))
+ return 0;
+
+ /* read Fixed counter 3 */
+ rdpmcl((3 | 1<<30), slots);
+ if (!slots)
+ return 0;
+
+ /* read PERF_METRICS */
+ rdpmcl((1<<29), metrics);
+
+ for_each_set_bit(idx, cpuc->active_mask, INTEL_PMC_IDX_TD_BE_BOUND + 1) {
+ if (!is_topdown_idx(idx))
+ continue;
+ other = cpuc->events[idx];
+ __icl_update_topdown_event(other, slots, metrics);
+ }
+
+ /*
+ * Check and update this event, which may have been cleared
+ * in active_mask e.g. x86_pmu_stop()
+ */
+ if (event && !test_bit(event->hw.idx, cpuc->active_mask))
+ __icl_update_topdown_event(event, slots, metrics);
+
+ /*
+ * To avoid the known issues as below, the PERF_METRICS and
+ * Fixed counter 3 are reset for each read.
+ * - The 8bit metrics ratio values lose precision when the
+ * measurement period gets longer.
+ * - The PERF_METRICS may report wrong value if its delta was
+ * less than 1/255 of Fixed counter 3.
+ */
+ wrmsrl(MSR_PERF_METRICS, 0);
+ wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
+
+ return slots;
+}
+
+static void intel_pmu_read_topdown_event(struct perf_event *event)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ /* Only need to call update_topdown_event() once for group read. */
+ if ((cpuc->txn_flags & PERF_PMU_TXN_READ) &&
+ !is_first_topdown_event_in_group(event))
+ return;
+
+ perf_pmu_disable(event->pmu);
+ x86_pmu.update_topdown_event(event);
+ perf_pmu_enable(event->pmu);
+}
+
static void intel_pmu_read_event(struct perf_event *event)
{
if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
intel_pmu_auto_reload_read(event);
+ else if (is_topdown_count(event) && x86_pmu.update_topdown_event)
+ intel_pmu_read_topdown_event(event);
else
x86_perf_event_update(event);
}
@@ -2401,6 +2566,15 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
intel_pt_interrupt();
}

+ /*
+ * Intel Perf mertrics
+ */
+ if (__test_and_clear_bit(48, (unsigned long *)&status)) {
+ handled++;
+ if (x86_pmu.update_topdown_event)
+ x86_pmu.update_topdown_event(NULL);
+ }
+
/*
* Checkpointed counters can lead to 'spurious' PMIs because the
* rollback caused by the PMI will have cleared the overflow status
@@ -3312,6 +3486,42 @@ static int intel_pmu_hw_config(struct perf_event *event)
if (event->attr.type != PERF_TYPE_RAW)
return 0;

+ /*
+ * Config Topdown slots and metric events
+ *
+ * The slots event on Fixed Counter 3 can support sampling,
+ * which will be handled normally in x86_perf_event_update().
+ *
+ * The metric events don't support sampling.
+ *
+ * For counting, topdown slots and metric events will be
+ * handled specially for event update.
+ * A flag PERF_X86_EVENT_TOPDOWN is applied for the case.
+ */
+ if (x86_pmu.intel_cap.perf_metrics && is_topdown_event(event)) {
+ if (is_metric_event(event) && is_sampling_event(event))
+ return -EINVAL;
+
+ if (!is_sampling_event(event)) {
+ if (event->attr.config1 != 0)
+ return -EINVAL;
+ if (event->attr.config & ARCH_PERFMON_EVENTSEL_ANY)
+ return -EINVAL;
+ /*
+ * Put configuration (minus event) into config1 so that
+ * the scheduler enforces through an extra_reg that
+ * all instances of the metrics events have the same
+ * configuration.
+ */
+ event->attr.config1 = event->hw.config &
+ X86_ALL_EVENT_FLAGS;
+ event->hw.flags |= PERF_X86_EVENT_TOPDOWN;
+
+ if (is_metric_event(event))
+ event->hw.flags &= ~PERF_X86_EVENT_RDPMC_ALLOWED;
+ }
+ }
+
if (!(event->attr.config & ARCH_PERFMON_EVENTSEL_ANY))
return 0;

@@ -5040,6 +5250,8 @@ __init int intel_pmu_init(void)
x86_pmu.rtm_abort_event = X86_CONFIG(.event=0xca, .umask=0x02);
x86_pmu.lbr_pt_coexist = true;
intel_pmu_pebs_data_source_skl(pmem);
+ x86_pmu.update_topdown_event = icl_update_topdown_event;
+ x86_pmu.set_topdown_event_period = icl_set_topdown_event_period;
pr_cont("Icelake events, ");
name = "icelake";
break;
@@ -5096,10 +5308,17 @@ __init int intel_pmu_init(void)
* counter, so do not extend mask to generic counters
*/
for_each_event_constraint(c, x86_pmu.event_constraints) {
- if (c->cmask == FIXED_EVENT_FLAGS
- && c->idxmsk64 != INTEL_PMC_MSK_FIXED_REF_CYCLES
- && c->idxmsk64 != INTEL_PMC_MSK_FIXED_SLOTS) {
- c->idxmsk64 |= (1ULL << x86_pmu.num_counters) - 1;
+ if (c->cmask == FIXED_EVENT_FLAGS) {
+ /*
+ * Don't extend topdown slots and metrics
+ * events to generic counters.
+ */
+ if (c->idxmsk64 & INTEL_PMC_MSK_TOPDOWN) {
+ c->weight = hweight64(c->idxmsk64);
+ continue;
+ }
+ if (c->idxmsk64 != INTEL_PMC_MSK_FIXED_REF_CYCLES)
+ c->idxmsk64 |= (1ULL << x86_pmu.num_counters) - 1;
}
c->idxmsk64 &=
~(~0ULL << (INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed));
@@ -5152,6 +5371,9 @@ __init int intel_pmu_init(void)
if (x86_pmu.counter_freezing)
x86_pmu.handle_irq = intel_pmu_handle_irq_v4;

+ if (x86_pmu.intel_cap.perf_metrics)
+ x86_pmu.intel_ctrl |= 1ULL << 48;
+
return 0;
}

diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 37f17f55ef2d..7c59f08fadc0 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -40,6 +40,7 @@ enum extra_reg_type {
EXTRA_REG_LBR = 2, /* lbr_select */
EXTRA_REG_LDLAT = 3, /* ld_lat_threshold */
EXTRA_REG_FE = 4, /* fe_* */
+ EXTRA_REG_TOPDOWN = 5, /* Topdown slots/metrics */

EXTRA_REG_MAX /* number of entries needed */
};
@@ -76,6 +77,12 @@ static inline bool constraint_match(struct event_constraint *c, u64 ecode)
#define PERF_X86_EVENT_EXCL_ACCT 0x0100 /* accounted EXCL event */
#define PERF_X86_EVENT_AUTO_RELOAD 0x0200 /* use PEBS auto-reload */
#define PERF_X86_EVENT_LARGE_PEBS 0x0400 /* use large PEBS */
+#define PERF_X86_EVENT_TOPDOWN 0x0800 /* Count Topdown slots/metrics events */
+
+static inline bool is_topdown_count(struct perf_event *event)
+{
+ return event->hw.flags & PERF_X86_EVENT_TOPDOWN;
+}

struct amd_nb {
int nb_id; /* NorthBridge id */
@@ -509,6 +516,9 @@ struct extra_reg {
0xffff, \
LDLAT)

+#define INTEL_UEVENT_TOPDOWN_EXTRA_REG(event) \
+ EVENT_EXTRA_REG(event, 0, 0xfcff, -1L, TOPDOWN)
+
#define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, RSP_0)

union perf_capabilities {
@@ -524,6 +534,7 @@ union perf_capabilities {
*/
u64 full_width_write:1;
u64 pebs_baseline:1;
+ u64 perf_metrics:1;
};
u64 capabilities;
};
@@ -686,6 +697,12 @@ struct x86_pmu {
*/
atomic_t lbr_exclusive[x86_lbr_exclusive_max];

+ /*
+ * Intel perf metrics
+ */
+ u64 (*update_topdown_event)(struct perf_event *event);
+ int (*set_topdown_event_period)(struct perf_event *event);
+
/*
* AMD bits
*/
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 78f3a5ebc1e2..460a419a7214 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -118,6 +118,8 @@
#define MSR_TURBO_RATIO_LIMIT1 0x000001ae
#define MSR_TURBO_RATIO_LIMIT2 0x000001af

+#define MSR_PERF_METRICS 0x00000329
+
#define MSR_LBR_SELECT 0x000001c8
#define MSR_LBR_TOS 0x000001c9
#define MSR_LBR_NHM_FROM 0x00000680
--
2.17.1


2019-08-28 15:32:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

On Mon, Aug 26, 2019 at 07:47:35AM -0700, [email protected] wrote:

> Groups
> ======
>
> To avoid reading the METRICS register multiple times, the metrics and
> slots value can only be updated by the first slots/metrics event in a
> group. All active slots and metrics events will be updated one time.

Can't we require SLOTS to be the group leader for any Metric group?

Is there ever a case where we want to add other events to a metric
group?

> Reset
> ======
>
> The PERF_METRICS and Fixed counter 3 have to be reset for each read,
> because:
> - The 8bit metrics ratio values lose precision when the measurement
> period gets longer.

So it musn't be too hot,

> - The PERF_METRICS may report wrong value if its delta was less than
> 1/255 of SLOTS (Fixed counter 3).

it also musn't be too cold. But that leaves it unspecified what exactly
is the right range.

IOW, you want a Goldilocks number of SLOTS.

> Also, for counting, the -max_period is the initial value of the SLOTS.
> The huge initial value will definitely trigger the issue mentioned
> above. Force initial value as 0 for topdown and slots event counting.

But you just told us that 0 is wrong too (too cold).

I'm still confused by all this; when exactly does:

> NMI
> ======
>
> The METRICS register may be overflow. The bit 48 of STATUS register
> will be set. If so, update all active slots and metrics events.

that happen? It would be useful to get that METRIC_OVF (can we please
start naming them; 62,55,48 is past silly) at the exact point
where PERF_METRICS is saturated.

If this is so; then we can use this to update/reset PERF_METRICS and
nothing else.

That is; leave the SLOTS programming alone; use -max_period as usual.

Then on METRIC_OVF, read PERF_METRICS and clear it, and update all the
metric events by adding slots_delta * frac / 256 -- where slots_delta is
the SLOTS count since the last METRIC_OVF.

On read; read PERF_METRICS -- BUT DO NOT RESET -- and compute an
intermediate delta and add that to whatever stable count we had.

Maybe something like:

do {
count1 = local64_read(&event->count);
barrier();
metrics = read_perf_metrics();
barrier();
count2 = local64_read(event->count);
} while (count1 != count2);

/* no METRIC_OVF happened and {count,metrics} is consistent */

return count1 + (slots_delta * frac / 256);

> The update_topdown_event() has to read two registers separately. The
> values may be modify by a NMI. PMU has to be disabled before calling the
> function.

Then there is no mucking about with that odd counter/metrics msr pair
reset nonsense. Becuase that really stinks.

2019-08-28 15:33:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

On Mon, Aug 26, 2019 at 07:47:35AM -0700, [email protected] wrote:
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 54534ff00940..1ae23db5c2d7 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -76,6 +76,8 @@ u64 x86_perf_event_update(struct perf_event *event)
> if (idx == INTEL_PMC_IDX_FIXED_BTS)
> return 0;
>
> + if (is_topdown_count(event) && x86_pmu.update_topdown_event)
> + return x86_pmu.update_topdown_event(event);

If is_topdown_count() is true; it had better bloody well have that
function. But I really hate this.

> /*
> * Careful: an NMI might modify the previous event value.
> *
> @@ -1003,6 +1005,10 @@ static int collect_events(struct cpu_hw_events *cpuc, struct perf_event *leader,
>
> max_count = x86_pmu.num_counters + x86_pmu.num_counters_fixed;
>
> + /* There are 4 TopDown metrics events. */
> + if (x86_pmu.intel_cap.perf_metrics)
> + max_count += 4;

I'm thinking this is wrong.. this unconditionally allows collecting 4
extra events on every part that has this metrics crud on.

> /* current number of events already accepted */
> n = cpuc->n_events;
>
> @@ -1184,6 +1190,10 @@ int x86_perf_event_set_period(struct perf_event *event)
> if (idx == INTEL_PMC_IDX_FIXED_BTS)
> return 0;
>
> + if (unlikely(is_topdown_count(event)) &&
> + x86_pmu.set_topdown_event_period)
> + return x86_pmu.set_topdown_event_period(event);

Same as with the other method; and I similarly hates it.

> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index f4d6335a18e2..616313d7f3d7 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c

> +static int icl_set_topdown_event_period(struct perf_event *event)
> +{
> + struct hw_perf_event *hwc = &event->hw;
> + s64 left = local64_read(&hwc->period_left);
> +
> + /*
> + * Clear PERF_METRICS and Fixed counter 3 in initialization.
> + * After that, both MSRs will be cleared for each read.
> + * Don't need to clear them again.
> + */
> + if (left == x86_pmu.max_period) {
> + wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
> + wrmsrl(MSR_PERF_METRICS, 0);
> + local64_set(&hwc->period_left, 0);
> + }

This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
never trigger the overflow there; this then seems to suggest the actual
counter value is irrelevant. Therefore you don't actually need this.

> +
> + perf_event_update_userpage(event);
> +
> + return 0;
> +}
> +
> +static u64 icl_get_metrics_event_value(u64 metric, u64 slots, int idx)
> +{
> + u32 val;
> +
> + /*
> + * The metric is reported as an 8bit integer percentage

s/percentage/fraction/

percentage is 1/100, 8bit is 256.

> + * suming up to 0xff.
> + * slots-in-metric = (Metric / 0xff) * slots
> + */
> + val = (metric >> ((idx - INTEL_PMC_IDX_FIXED_METRIC_BASE) * 8)) & 0xff;
> + return mul_u64_u32_div(slots, val, 0xff);

This really utterly blows.. I'd have wasted range to be able to do a
power-of-two fraction here. That is use 8 bits with a range [0,128].

But also; x86_64 seems to lack a sane implementation of that function,
and it currently compiles into utter crap (it can be 2 instructions).

> +}

> +/*
> + * Update all active Topdown events.
> + * PMU has to be disabled before calling this function.

Forgets to explain why...

> + */

2019-08-28 16:13:55

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH] x86/math64: Provide a sane mul_u64_u32_div() implementation for x86_64

On Wed, Aug 28, 2019 at 05:19:21PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 26, 2019 at 07:47:35AM -0700, [email protected] wrote:

> > + return mul_u64_u32_div(slots, val, 0xff);
>
> But also; x86_64 seems to lack a sane implementation of that function,
> and it currently compiles into utter crap (it can be 2 instructions).

---
Subject: x86/math64: Provide a sane mul_u64_u32_div() implementation for x86_64
From: Peter Zijlstra <[email protected]>
Date: Wed Aug 28 17:39:46 CEST 2019

On x86_64 we can do a u64 * u64 -> u128 widening multiply followed by
a u128 / u64 -> u64 division to implement a sane version of
mul_u64_u32_div().

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/include/asm/div64.h | 13 +++++++++++++
1 file changed, 13 insertions(+)

--- a/arch/x86/include/asm/div64.h
+++ b/arch/x86/include/asm/div64.h
@@ -73,6 +73,19 @@ static inline u64 mul_u32_u32(u32 a, u32

#else
# include <asm-generic/div64.h>
+
+static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 div)
+{
+ u64 q;
+
+ asm ("mulq %2; divq %3" : "=a" (q)
+ : "a" (a), "rm" (mul), "rm" (div)
+ : "rdx");
+
+ return q;
+}
+#define mul_u64_u32_div mul_u64_u32_div
+
#endif /* CONFIG_X86_32 */

#endif /* _ASM_X86_DIV64_H */

2019-08-28 16:20:43

by Andi Kleen

[permalink] [raw]
Subject: Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

> This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
> never trigger the overflow there; this then seems to suggest the actual

The 48bit counter might overflow in a few hours.

-Andi

2019-08-28 16:30:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

On Wed, Aug 28, 2019 at 09:17:54AM -0700, Andi Kleen wrote:
> > This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
> > never trigger the overflow there; this then seems to suggest the actual
>
> The 48bit counter might overflow in a few hours.

Sure; the point is? Kan said it should not be too big; a full 48bit wrap
around must be too big or nothing is.

2019-08-28 19:06:10

by Andi Kleen

[permalink] [raw]
Subject: Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

On Wed, Aug 28, 2019 at 05:02:38PM +0200, Peter Zijlstra wrote:
> >
> > To avoid reading the METRICS register multiple times, the metrics and
> > slots value can only be updated by the first slots/metrics event in a
> > group. All active slots and metrics events will be updated one time.
>
> Can't we require SLOTS to be the group leader for any Metric group?

Metrics are really useful to collect with any other sampling event
to give you more context in the program behavior.

> Is there ever a case where we want to add other events to a metric
> group?

Yes. Any normal leader sampling case. You just collect metrics too.

> it also musn't be too cold. But that leaves it unspecified what exactly
> is the right range.
>
> IOW, you want a Goldilocks number of SLOTS.

That's too strong. You probably don't want minutes, but seconds
worth of measurement should be ok.

> > NMI
> > ======
> >
> > The METRICS register may be overflow. The bit 48 of STATUS register
> > will be set. If so, update all active slots and metrics events.
>
> that happen? It would be useful to get that METRIC_OVF (can we please

This happens when the internal counters that feed the metrics
overflow.

> If this is so; then we can use this to update/reset PERF_METRICS and
> nothing else.

It has to be handled in the PMI.

> Then there is no mucking about with that odd counter/metrics msr pair
> reset nonsense. Becuase that really stinks.

You have to write them to reset the internal counters.

-Andi

2019-08-28 19:36:26

by Liang, Kan

[permalink] [raw]
Subject: Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics



On 8/28/2019 11:02 AM, Peter Zijlstra wrote:
>> Reset
>> ======
>>
>> The PERF_METRICS and Fixed counter 3 have to be reset for each read,
>> because:
>> - The 8bit metrics ratio values lose precision when the measurement
>> period gets longer.
> So it musn't be too hot,
>
>> - The PERF_METRICS may report wrong value if its delta was less than
>> 1/255 of SLOTS (Fixed counter 3).

The "delta" is actually for the internal counters. Sorry for the
confusion.

> it also musn't be too cold. But that leaves it unspecified what exactly
> is the right range.
>
> IOW, you want a Goldilocks number of SLOTS.
>
>> Also, for counting, the -max_period is the initial value of the SLOTS.
>> The huge initial value will definitely trigger the issue mentioned
>> above. Force initial value as 0 for topdown and slots event counting.
> But you just told us that 0 is wrong too (too cold).

We set -max_period (0x8000 0000 0001) as initial value of FIXCTR3 for
counting. But the internal counters probably starts counting from 0.
PERF_METRICS is fraction of internal counters / FIXCTR3.
So PERF_METRICS will never works.
We have to make FIXCTR3 count from 0 as well.

Thanks,
Kan

2019-08-29 03:13:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

On Wed, Aug 28, 2019 at 06:28:57PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 28, 2019 at 09:17:54AM -0700, Andi Kleen wrote:
> > > This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
> > > never trigger the overflow there; this then seems to suggest the actual
> >
> > The 48bit counter might overflow in a few hours.
>
> Sure; the point is? Kan said it should not be too big; a full 48bit wrap
> around must be too big or nothing is.

We expect users to avoid making it too big, but we cannot rule it out.

Actually for the common perf stat for a long time case you can make it fairly
big because the percentages still work out over the complete run time.

The problem with letting it accumulate too much is mainly if you
want to use a continuously running metrics counter to time smaller
regions by taking deltas, for example using RDPMC.

In this case the small regions would be too inaccurate after some time.

But to make the first case work absolutely need to handle the overflow
case. Otherwise the metrics would just randomly stop at some
point.


-Andi



2019-08-29 09:19:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

On Wed, Aug 28, 2019 at 08:11:51PM -0700, Andi Kleen wrote:
> On Wed, Aug 28, 2019 at 06:28:57PM +0200, Peter Zijlstra wrote:
> > On Wed, Aug 28, 2019 at 09:17:54AM -0700, Andi Kleen wrote:
> > > > This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
> > > > never trigger the overflow there; this then seems to suggest the actual
> > >
> > > The 48bit counter might overflow in a few hours.
> >
> > Sure; the point is? Kan said it should not be too big; a full 48bit wrap
> > around must be too big or nothing is.
>
> We expect users to avoid making it too big, but we cannot rule it out.
>
> Actually for the common perf stat for a long time case you can make it fairly
> big because the percentages still work out over the complete run time.
>
> The problem with letting it accumulate too much is mainly if you
> want to use a continuously running metrics counter to time smaller
> regions by taking deltas, for example using RDPMC.
>
> In this case the small regions would be too inaccurate after some time.
>
> But to make the first case work absolutely need to handle the overflow
> case. Otherwise the metrics would just randomly stop at some
> point.

But then you need -max_period, not 0. By using half the period, there is
no ambiguity on overflow.

2019-08-29 09:32:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] x86/math64: Provide a sane mul_u64_u32_div() implementation for x86_64

On Wed, Aug 28, 2019 at 06:11:23PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 28, 2019 at 05:19:21PM +0200, Peter Zijlstra wrote:
> > On Mon, Aug 26, 2019 at 07:47:35AM -0700, [email protected] wrote:
>
> > > + return mul_u64_u32_div(slots, val, 0xff);
> >
> > But also; x86_64 seems to lack a sane implementation of that function,
> > and it currently compiles into utter crap (it can be 2 instructions).

This one actually builds defconfig :-)

---
Subject: x86/math64: Provide a sane mul_u64_u32_div() implementation for x86_64
From: Peter Zijlstra <[email protected]>
Date: Wed Aug 28 17:39:46 CEST 2019

On x86_64 we can do a u64 * u64 -> u128 widening multiply followed by
a u128 / u64 -> u64 division to implement a sane version of
mul_u64_u32_div().

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/include/asm/div64.h | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/arch/x86/include/asm/div64.h b/arch/x86/include/asm/div64.h
index 20a46150e0a8..9b8cb50768c2 100644
--- a/arch/x86/include/asm/div64.h
+++ b/arch/x86/include/asm/div64.h
@@ -73,6 +73,19 @@ static inline u64 mul_u32_u32(u32 a, u32 b)

#else
# include <asm-generic/div64.h>
+
+static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 div)
+{
+ u64 q;
+
+ asm ("mulq %2; divq %3" : "=a" (q)
+ : "a" (a), "rm" ((u64)mul), "rm" ((u64)div)
+ : "rdx");
+
+ return q;
+}
+#define mul_u64_u32_div mul_u64_u32_div
+
#endif /* CONFIG_X86_32 */

#endif /* _ASM_X86_DIV64_H */

2019-08-29 13:33:01

by Liang, Kan

[permalink] [raw]
Subject: Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics



On 8/28/2019 11:19 AM, Peter Zijlstra wrote:
>> +static int icl_set_topdown_event_period(struct perf_event *event)
>> +{
>> + struct hw_perf_event *hwc = &event->hw;
>> + s64 left = local64_read(&hwc->period_left);
>> +
>> + /*
>> + * Clear PERF_METRICS and Fixed counter 3 in initialization.
>> + * After that, both MSRs will be cleared for each read.
>> + * Don't need to clear them again.
>> + */
>> + if (left == x86_pmu.max_period) {
>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
>> + wrmsrl(MSR_PERF_METRICS, 0);
>> + local64_set(&hwc->period_left, 0);
>> + }
> This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
> never trigger the overflow there; this then seems to suggest the actual
> counter value is irrelevant. Therefore you don't actually need this.
>

Could you please elaborate on why initialization to 0 never triggers an
overflow?
As of my understanding, initialization to 0 only means that it will take
more time than initialization to -max_period (0x8000 0000 0001) to
trigger an overflow.

Maybe 0 is too tricky. We can set the initial value to 1.
I think the bottom line is that we need a small initial value for
FIXED_CTR3 here.
PERF_METRICS reports an 8bit integer fraction which is something like
0xff * internal counters / FIXCTR3.
The internal counters only start counting from 0. (SW cannot set an
arbitrary initial value for internal counters.)
If the initial value of FIXED_CTR3 is too big, PERF_METRICS could always
remain constant, e.g. 0.

Thanks,
Kan

2019-08-29 13:54:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

On Thu, Aug 29, 2019 at 09:31:37AM -0400, Liang, Kan wrote:
> On 8/28/2019 11:19 AM, Peter Zijlstra wrote:
> > > +static int icl_set_topdown_event_period(struct perf_event *event)
> > > +{
> > > + struct hw_perf_event *hwc = &event->hw;
> > > + s64 left = local64_read(&hwc->period_left);
> > > +
> > > + /*
> > > + * Clear PERF_METRICS and Fixed counter 3 in initialization.
> > > + * After that, both MSRs will be cleared for each read.
> > > + * Don't need to clear them again.
> > > + */
> > > + if (left == x86_pmu.max_period) {
> > > + wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
> > > + wrmsrl(MSR_PERF_METRICS, 0);
> > > + local64_set(&hwc->period_left, 0);
> > > + }
> > This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
> > never trigger the overflow there; this then seems to suggest the actual
> > counter value is irrelevant. Therefore you don't actually need this.
> >
>
> Could you please elaborate on why initialization to 0 never triggers an
> overflow?

Well, 'never' as in a 'long' time.

> As of my understanding, initialization to 0 only means that it will take
> more time than initialization to -max_period (0x8000 0000 0001) to trigger
> an overflow.

Only twice as long. And why do we care about that?

The problem with it is though that you get the overflow at the end of
the whole period, instead of halfway through, so reconstruction is
trickier.

> Maybe 0 is too tricky. We can set the initial value to 1.

That's even worse. I'm still not understanding why we can't use the
normal code.

> I think the bottom line is that we need a small initial value for FIXED_CTR3
> here.

But why?!

> PERF_METRICS reports an 8bit integer fraction which is something like 0xff *
> internal counters / FIXCTR3.
> The internal counters only start counting from 0. (SW cannot set an
> arbitrary initial value for internal counters.)
> If the initial value of FIXED_CTR3 is too big, PERF_METRICS could always
> remain constant, e.g. 0.

What what? The PERF_METRICS contents depends on the FIXCTR3 value ?!
That's bloody insane. /me goes find the SDM. The SDM is bloody useless
:-(.

Please give a complete and coherent description of all of this. I can't
very well review any of this until I know how the hardware works, now
can I.

In this write-up, include the exact condition for METRICS_OVF (the SDM
states: 'it indicates that PERF_METRIC counter has overflowed', which is
gramatically incorrect and makes no sense even with the missing article
injected).

2019-08-29 16:57:17

by Liang, Kan

[permalink] [raw]
Subject: Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics



On 8/29/2019 9:52 AM, Peter Zijlstra wrote:
> On Thu, Aug 29, 2019 at 09:31:37AM -0400, Liang, Kan wrote:
>> On 8/28/2019 11:19 AM, Peter Zijlstra wrote:
>>>> +static int icl_set_topdown_event_period(struct perf_event *event)
>>>> +{
>>>> + struct hw_perf_event *hwc = &event->hw;
>>>> + s64 left = local64_read(&hwc->period_left);
>>>> +
>>>> + /*
>>>> + * Clear PERF_METRICS and Fixed counter 3 in initialization.
>>>> + * After that, both MSRs will be cleared for each read.
>>>> + * Don't need to clear them again.
>>>> + */
>>>> + if (left == x86_pmu.max_period) {
>>>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
>>>> + wrmsrl(MSR_PERF_METRICS, 0);
>>>> + local64_set(&hwc->period_left, 0);
>>>> + }
>>> This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
>>> never trigger the overflow there; this then seems to suggest the actual
>>> counter value is irrelevant. Therefore you don't actually need this.
>>>
>>
>> Could you please elaborate on why initialization to 0 never triggers an
>> overflow?
>
> Well, 'never' as in a 'long' time.
>
>> As of my understanding, initialization to 0 only means that it will take
>> more time than initialization to -max_period (0x8000 0000 0001) to trigger
>> an overflow.
>
> Only twice as long. And why do we care about that?
>
> The problem with it is though that you get the overflow at the end of
> the whole period, instead of halfway through, so reconstruction is
> trickier.
>
>> Maybe 0 is too tricky. We can set the initial value to 1.
>
> That's even worse. I'm still not understanding why we can't use the
> normal code.
>
>> I think the bottom line is that we need a small initial value for FIXED_CTR3
>> here.
>
> But why?!
>
>> PERF_METRICS reports an 8bit integer fraction which is something like 0xff *
>> internal counters / FIXCTR3.
>> The internal counters only start counting from 0. (SW cannot set an
>> arbitrary initial value for internal counters.)
>> If the initial value of FIXED_CTR3 is too big, PERF_METRICS could always
>> remain constant, e.g. 0.
>
> What what? The PERF_METRICS contents depends on the FIXCTR3 value ?!

Yes.

For current implementation, PERF_METRIC MSR is composed by four fields,
backend bound, frontend bound, bad speculation and retiring.
Each of the fields are populated using the below formula for eg:
PERF_METRIC[RETIRING] = (0xFF *
PERF_METRICS_RETIRING_INTERNAL_48bit_COUNTER)
/ FIXCTR3

The METRICS_OVF indicates the overflow of any internal counters.

The internal counters only start counting from 0, which cannot be
programmed by SW. But resetting the PERF_METRIC would implicitly
resetting the internal counters.

Thanks,
Kan


> That's bloody insane. /me goes find the SDM. The SDM is bloody useless
> :-(.
>
> Please give a complete and coherent description of all of this. I can't
> very well review any of this until I know how the hardware works, now
> can I.
>
> In this write-up, include the exact condition for METRICS_OVF (the SDM
> states: 'it indicates that PERF_METRIC counter has overflowed', which is
> gramatically incorrect and makes no sense even with the missing article
> injected).
>




2019-08-30 23:21:32

by Stephane Eranian

[permalink] [raw]
Subject: Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

Hi,

On Mon, Aug 26, 2019 at 7:48 AM <[email protected]> wrote:
>
> From: Kan Liang <[email protected]>
>
> Intro
> =====
>
> Icelake has support for measuring the four top level TopDown metrics
> directly in hardware. This is implemented by an additional "metrics"
> register, and a new Fixed Counter 3 that measures pipeline "slots".
>
> Events
> ======
>
> We export four metric events as separate perf events, which map to
> internal "metrics" counter register. Those events do not exist in
> hardware, but can be allocated by the scheduler.
>
There is another approach possible for supporting Topdown-style counters.
Instead of trying to abstract them as separate events to the user and then
trying to put them back together in the kernel and then using slots to scale
them as counts, we could just expose them as is, i.e., structured counter
values. The kernel already handles structured counter configs and exports
the fields on the config via sysfs and the perf tool picks them up and can
encode any event. We could have a similar approach for a counter
value. It could have fields, unit, types. Perf stat would pick them up in
the same manner. It would greatly simplify the kernel implementation.
You would need to publish an pseudo-event code for each group of
metrics. Note that I am not advocating expose the raw counter value.
That way you would maintain one event code -> one "counter" on hw.
The reset on read would also work. It would generate only one rdmsr
per read without forcing any grouping. You would not need any grouping,
or using slots under the hood to scale. If PERF_METRICS gets extended, you
can just add another pseudo event code or umask.

The PERF_METRICS events do not make real sense in isolation. The SLOTS
scaling is hard to interpret. You can never profiling on PERF_METRICS event
so keeping them grouped is okay.


> For the event mapping we use a special 0x00 event code, which is
> reserved for fake events. The metric events start from umask 0x10.
>
> When setting up such events they point to the slots counter, and a
> special callback, update_topdown_event(), reads the additional metrics
> msr to generate the metrics. Then the metric is reported by multiplying
> the metric (percentage) with slots.
>
> This multiplication allows to easily keep a running count, for example
> when the slots counter overflows, and makes all the standard tools, such
> as a perf stat, work. They can do deltas of the values without needing
> to know about percentages. This also simplifies accumulating the counts
> of child events, which otherwise would need to know how to average
> percent values.
>
> All four metric events don't support sampling. Since they will be
> handled specially for event update, a flag PERF_X86_EVENT_TOPDOWN is
> introduced to indicate this case.
>
> The slots event can support both sampling and counting.
> For counting, the flag is also applied.
> For sampling, it will be handled normally as other normal events.
>
> Groups
> ======
>
> To avoid reading the METRICS register multiple times, the metrics and
> slots value can only be updated by the first slots/metrics event in a
> group. All active slots and metrics events will be updated one time.
>
> Reset
> ======
>
> The PERF_METRICS and Fixed counter 3 have to be reset for each read,
> because:
> - The 8bit metrics ratio values lose precision when the measurement
> period gets longer.
> - The PERF_METRICS may report wrong value if its delta was less than
> 1/255 of SLOTS (Fixed counter 3).
>
> Also, for counting, the -max_period is the initial value of the SLOTS.
> The huge initial value will definitely trigger the issue mentioned
> above. Force initial value as 0 for topdown and slots event counting.
>
> NMI
> ======
>
> The METRICS register may be overflow. The bit 48 of STATUS register
> will be set. If so, update all active slots and metrics events.
>
> The update_topdown_event() has to read two registers separately. The
> values may be modify by a NMI. PMU has to be disabled before calling the
> function.
>
> RDPMC
> ======
>
> RDPMC is temporarily disabled. The following patch will enable it.
>
> Originally-by: Andi Kleen <[email protected]>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> arch/x86/events/core.c | 10 ++
> arch/x86/events/intel/core.c | 230 ++++++++++++++++++++++++++++++-
> arch/x86/events/perf_event.h | 17 +++
> arch/x86/include/asm/msr-index.h | 2 +
> 4 files changed, 255 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 54534ff00940..1ae23db5c2d7 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -76,6 +76,8 @@ u64 x86_perf_event_update(struct perf_event *event)
> if (idx == INTEL_PMC_IDX_FIXED_BTS)
> return 0;
>
> + if (is_topdown_count(event) && x86_pmu.update_topdown_event)
> + return x86_pmu.update_topdown_event(event);
> /*
> * Careful: an NMI might modify the previous event value.
> *
> @@ -1003,6 +1005,10 @@ static int collect_events(struct cpu_hw_events *cpuc, struct perf_event *leader,
>
> max_count = x86_pmu.num_counters + x86_pmu.num_counters_fixed;
>
> + /* There are 4 TopDown metrics events. */
> + if (x86_pmu.intel_cap.perf_metrics)
> + max_count += 4;
> +
> /* current number of events already accepted */
> n = cpuc->n_events;
>
> @@ -1184,6 +1190,10 @@ int x86_perf_event_set_period(struct perf_event *event)
> if (idx == INTEL_PMC_IDX_FIXED_BTS)
> return 0;
>
> + if (unlikely(is_topdown_count(event)) &&
> + x86_pmu.set_topdown_event_period)
> + return x86_pmu.set_topdown_event_period(event);
> +
> /*
> * If we are way outside a reasonable range then just skip forward:
> */
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index f4d6335a18e2..616313d7f3d7 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -247,6 +247,10 @@ static struct event_constraint intel_icl_event_constraints[] = {
> FIXED_EVENT_CONSTRAINT(0x003c, 1), /* CPU_CLK_UNHALTED.CORE */
> FIXED_EVENT_CONSTRAINT(0x0300, 2), /* CPU_CLK_UNHALTED.REF */
> FIXED_EVENT_CONSTRAINT(0x0400, 3), /* SLOTS */
> + METRIC_EVENT_CONSTRAINT(0x1000, 0), /* Retiring metric */
> + METRIC_EVENT_CONSTRAINT(0x1100, 1), /* Bad speculation metric */
> + METRIC_EVENT_CONSTRAINT(0x1200, 2), /* FE bound metric */
> + METRIC_EVENT_CONSTRAINT(0x1300, 3), /* BE bound metric */
> INTEL_EVENT_CONSTRAINT_RANGE(0x03, 0x0a, 0xf),
> INTEL_EVENT_CONSTRAINT_RANGE(0x1f, 0x28, 0xf),
> INTEL_EVENT_CONSTRAINT(0x32, 0xf), /* SW_PREFETCH_ACCESS.* */
> @@ -267,6 +271,14 @@ static struct extra_reg intel_icl_extra_regs[] __read_mostly = {
> INTEL_UEVENT_EXTRA_REG(0x01bb, MSR_OFFCORE_RSP_1, 0x3fffff9fffull, RSP_1),
> INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(0x01cd),
> INTEL_UEVENT_EXTRA_REG(0x01c6, MSR_PEBS_FRONTEND, 0x7fff17, FE),
> + /*
> + * The original Fixed Ctr 3 are shared from different metrics
> + * events. So use the extra reg to enforce the same
> + * configuration on the original register, but do not actually
> + * write to it.
> + */
> + INTEL_UEVENT_EXTRA_REG(0x0400, 0, -1L, TOPDOWN),
> + INTEL_UEVENT_TOPDOWN_EXTRA_REG(0x1000),
> EVENT_EXTRA_END
> };
>
> @@ -2190,10 +2202,163 @@ static void intel_pmu_del_event(struct perf_event *event)
> intel_pmu_pebs_del(event);
> }
>
> +static inline bool is_metric_event(struct perf_event *event)
> +{
> + return ((event->attr.config & ARCH_PERFMON_EVENTSEL_EVENT) == 0) &&
> + ((event->attr.config & INTEL_ARCH_EVENT_MASK) >= 0x1000) &&
> + ((event->attr.config & INTEL_ARCH_EVENT_MASK) <= 0x1300);
> +}
> +
> +static inline bool is_slots_event(struct perf_event *event)
> +{
> + return (event->attr.config & INTEL_ARCH_EVENT_MASK) == 0x0400;
> +}
> +
> +static inline bool is_topdown_event(struct perf_event *event)
> +{
> + return is_metric_event(event) || is_slots_event(event);
> +}
> +
> +static bool is_first_topdown_event_in_group(struct perf_event *event)
> +{
> + struct perf_event *first = NULL;
> +
> + if (is_topdown_event(event->group_leader))
> + first = event->group_leader;
> + else {
> + for_each_sibling_event(first, event->group_leader)
> + if (is_topdown_event(first))
> + break;
> + }
> +
> + if (event == first)
> + return true;
> +
> + return false;
> +}
> +
> +static int icl_set_topdown_event_period(struct perf_event *event)
> +{
> + struct hw_perf_event *hwc = &event->hw;
> + s64 left = local64_read(&hwc->period_left);
> +
> + /*
> + * Clear PERF_METRICS and Fixed counter 3 in initialization.
> + * After that, both MSRs will be cleared for each read.
> + * Don't need to clear them again.
> + */
> + if (left == x86_pmu.max_period) {
> + wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
> + wrmsrl(MSR_PERF_METRICS, 0);
> + local64_set(&hwc->period_left, 0);
> + }
> +
> + perf_event_update_userpage(event);
> +
> + return 0;
> +}
> +
> +static u64 icl_get_metrics_event_value(u64 metric, u64 slots, int idx)
> +{
> + u32 val;
> +
> + /*
> + * The metric is reported as an 8bit integer percentage
> + * suming up to 0xff.
> + * slots-in-metric = (Metric / 0xff) * slots
> + */
> + val = (metric >> ((idx - INTEL_PMC_IDX_FIXED_METRIC_BASE) * 8)) & 0xff;
> + return mul_u64_u32_div(slots, val, 0xff);
> +}
> +
> +static void __icl_update_topdown_event(struct perf_event *event,
> + u64 slots, u64 metrics)
> +{
> + int idx = event->hw.idx;
> + u64 delta;
> +
> + if (is_metric_idx(idx))
> + delta = icl_get_metrics_event_value(metrics, slots, idx);
> + else
> + delta = slots;
> +
> + local64_add(delta, &event->count);
> +}
> +
> +/*
> + * Update all active Topdown events.
> + * PMU has to be disabled before calling this function.
> + */
> +static u64 icl_update_topdown_event(struct perf_event *event)
> +{
> + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> + struct perf_event *other;
> + u64 slots, metrics;
> + int idx;
> +
> + /*
> + * Only need to update all events for the first
> + * slots/metrics event in a group
> + */
> + if (event && !is_first_topdown_event_in_group(event))
> + return 0;
> +
> + /* read Fixed counter 3 */
> + rdpmcl((3 | 1<<30), slots);
> + if (!slots)
> + return 0;
> +
> + /* read PERF_METRICS */
> + rdpmcl((1<<29), metrics);
> +
> + for_each_set_bit(idx, cpuc->active_mask, INTEL_PMC_IDX_TD_BE_BOUND + 1) {
> + if (!is_topdown_idx(idx))
> + continue;
> + other = cpuc->events[idx];
> + __icl_update_topdown_event(other, slots, metrics);
> + }
> +
> + /*
> + * Check and update this event, which may have been cleared
> + * in active_mask e.g. x86_pmu_stop()
> + */
> + if (event && !test_bit(event->hw.idx, cpuc->active_mask))
> + __icl_update_topdown_event(event, slots, metrics);
> +
> + /*
> + * To avoid the known issues as below, the PERF_METRICS and
> + * Fixed counter 3 are reset for each read.
> + * - The 8bit metrics ratio values lose precision when the
> + * measurement period gets longer.
> + * - The PERF_METRICS may report wrong value if its delta was
> + * less than 1/255 of Fixed counter 3.
> + */
> + wrmsrl(MSR_PERF_METRICS, 0);
> + wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
> +
> + return slots;
> +}
> +
> +static void intel_pmu_read_topdown_event(struct perf_event *event)
> +{
> + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> +
> + /* Only need to call update_topdown_event() once for group read. */
> + if ((cpuc->txn_flags & PERF_PMU_TXN_READ) &&
> + !is_first_topdown_event_in_group(event))
> + return;
> +
> + perf_pmu_disable(event->pmu);
> + x86_pmu.update_topdown_event(event);
> + perf_pmu_enable(event->pmu);
> +}
> +
> static void intel_pmu_read_event(struct perf_event *event)
> {
> if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
> intel_pmu_auto_reload_read(event);
> + else if (is_topdown_count(event) && x86_pmu.update_topdown_event)
> + intel_pmu_read_topdown_event(event);
> else
> x86_perf_event_update(event);
> }
> @@ -2401,6 +2566,15 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
> intel_pt_interrupt();
> }
>
> + /*
> + * Intel Perf mertrics
> + */
> + if (__test_and_clear_bit(48, (unsigned long *)&status)) {
> + handled++;
> + if (x86_pmu.update_topdown_event)
> + x86_pmu.update_topdown_event(NULL);
> + }
> +
> /*
> * Checkpointed counters can lead to 'spurious' PMIs because the
> * rollback caused by the PMI will have cleared the overflow status
> @@ -3312,6 +3486,42 @@ static int intel_pmu_hw_config(struct perf_event *event)
> if (event->attr.type != PERF_TYPE_RAW)
> return 0;
>
> + /*
> + * Config Topdown slots and metric events
> + *
> + * The slots event on Fixed Counter 3 can support sampling,
> + * which will be handled normally in x86_perf_event_update().
> + *
> + * The metric events don't support sampling.
> + *
> + * For counting, topdown slots and metric events will be
> + * handled specially for event update.
> + * A flag PERF_X86_EVENT_TOPDOWN is applied for the case.
> + */
> + if (x86_pmu.intel_cap.perf_metrics && is_topdown_event(event)) {
> + if (is_metric_event(event) && is_sampling_event(event))
> + return -EINVAL;
> +
> + if (!is_sampling_event(event)) {
> + if (event->attr.config1 != 0)
> + return -EINVAL;
> + if (event->attr.config & ARCH_PERFMON_EVENTSEL_ANY)
> + return -EINVAL;
> + /*
> + * Put configuration (minus event) into config1 so that
> + * the scheduler enforces through an extra_reg that
> + * all instances of the metrics events have the same
> + * configuration.
> + */
> + event->attr.config1 = event->hw.config &
> + X86_ALL_EVENT_FLAGS;
> + event->hw.flags |= PERF_X86_EVENT_TOPDOWN;
> +
> + if (is_metric_event(event))
> + event->hw.flags &= ~PERF_X86_EVENT_RDPMC_ALLOWED;
> + }
> + }
> +
> if (!(event->attr.config & ARCH_PERFMON_EVENTSEL_ANY))
> return 0;
>
> @@ -5040,6 +5250,8 @@ __init int intel_pmu_init(void)
> x86_pmu.rtm_abort_event = X86_CONFIG(.event=0xca, .umask=0x02);
> x86_pmu.lbr_pt_coexist = true;
> intel_pmu_pebs_data_source_skl(pmem);
> + x86_pmu.update_topdown_event = icl_update_topdown_event;
> + x86_pmu.set_topdown_event_period = icl_set_topdown_event_period;
> pr_cont("Icelake events, ");
> name = "icelake";
> break;
> @@ -5096,10 +5308,17 @@ __init int intel_pmu_init(void)
> * counter, so do not extend mask to generic counters
> */
> for_each_event_constraint(c, x86_pmu.event_constraints) {
> - if (c->cmask == FIXED_EVENT_FLAGS
> - && c->idxmsk64 != INTEL_PMC_MSK_FIXED_REF_CYCLES
> - && c->idxmsk64 != INTEL_PMC_MSK_FIXED_SLOTS) {
> - c->idxmsk64 |= (1ULL << x86_pmu.num_counters) - 1;
> + if (c->cmask == FIXED_EVENT_FLAGS) {
> + /*
> + * Don't extend topdown slots and metrics
> + * events to generic counters.
> + */
> + if (c->idxmsk64 & INTEL_PMC_MSK_TOPDOWN) {
> + c->weight = hweight64(c->idxmsk64);
> + continue;
> + }
> + if (c->idxmsk64 != INTEL_PMC_MSK_FIXED_REF_CYCLES)
> + c->idxmsk64 |= (1ULL << x86_pmu.num_counters) - 1;
> }
> c->idxmsk64 &=
> ~(~0ULL << (INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed));
> @@ -5152,6 +5371,9 @@ __init int intel_pmu_init(void)
> if (x86_pmu.counter_freezing)
> x86_pmu.handle_irq = intel_pmu_handle_irq_v4;
>
> + if (x86_pmu.intel_cap.perf_metrics)
> + x86_pmu.intel_ctrl |= 1ULL << 48;
> +
> return 0;
> }
>
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index 37f17f55ef2d..7c59f08fadc0 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -40,6 +40,7 @@ enum extra_reg_type {
> EXTRA_REG_LBR = 2, /* lbr_select */
> EXTRA_REG_LDLAT = 3, /* ld_lat_threshold */
> EXTRA_REG_FE = 4, /* fe_* */
> + EXTRA_REG_TOPDOWN = 5, /* Topdown slots/metrics */
>
> EXTRA_REG_MAX /* number of entries needed */
> };
> @@ -76,6 +77,12 @@ static inline bool constraint_match(struct event_constraint *c, u64 ecode)
> #define PERF_X86_EVENT_EXCL_ACCT 0x0100 /* accounted EXCL event */
> #define PERF_X86_EVENT_AUTO_RELOAD 0x0200 /* use PEBS auto-reload */
> #define PERF_X86_EVENT_LARGE_PEBS 0x0400 /* use large PEBS */
> +#define PERF_X86_EVENT_TOPDOWN 0x0800 /* Count Topdown slots/metrics events */
> +
> +static inline bool is_topdown_count(struct perf_event *event)
> +{
> + return event->hw.flags & PERF_X86_EVENT_TOPDOWN;
> +}
>
> struct amd_nb {
> int nb_id; /* NorthBridge id */
> @@ -509,6 +516,9 @@ struct extra_reg {
> 0xffff, \
> LDLAT)
>
> +#define INTEL_UEVENT_TOPDOWN_EXTRA_REG(event) \
> + EVENT_EXTRA_REG(event, 0, 0xfcff, -1L, TOPDOWN)
> +
> #define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, RSP_0)
>
> union perf_capabilities {
> @@ -524,6 +534,7 @@ union perf_capabilities {
> */
> u64 full_width_write:1;
> u64 pebs_baseline:1;
> + u64 perf_metrics:1;
> };
> u64 capabilities;
> };
> @@ -686,6 +697,12 @@ struct x86_pmu {
> */
> atomic_t lbr_exclusive[x86_lbr_exclusive_max];
>
> + /*
> + * Intel perf metrics
> + */
> + u64 (*update_topdown_event)(struct perf_event *event);
> + int (*set_topdown_event_period)(struct perf_event *event);
> +
> /*
> * AMD bits
> */
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 78f3a5ebc1e2..460a419a7214 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -118,6 +118,8 @@
> #define MSR_TURBO_RATIO_LIMIT1 0x000001ae
> #define MSR_TURBO_RATIO_LIMIT2 0x000001af
>
> +#define MSR_PERF_METRICS 0x00000329
> +
> #define MSR_LBR_SELECT 0x000001c8
> #define MSR_LBR_TOS 0x000001c9
> #define MSR_LBR_NHM_FROM 0x00000680
> --
> 2.17.1
>

2019-08-31 00:32:17

by Andi Kleen

[permalink] [raw]
Subject: Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

> the same manner. It would greatly simplify the kernel implementation.

I tried that originally. It was actually more complicated.

You can't really do deltas on raw metrics, and a lot of the perf
infrastructure is built around deltas.

To do the regular reset and not lose precision over time internally
you have to keep expanded counters anyways. And if you do that
you can just expose them to user space too, and have everything
in user space just work without any changes (except for the final
output)

-Andi

2019-08-31 09:14:23

by Stephane Eranian

[permalink] [raw]
Subject: Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

Andi,

On Fri, Aug 30, 2019 at 5:31 PM Andi Kleen <[email protected]> wrote:
>
> > the same manner. It would greatly simplify the kernel implementation.
>
> I tried that originally. It was actually more complicated.
>
> You can't really do deltas on raw metrics, and a lot of the perf
> infrastructure is built around deltas.
>
How is RAPL handled? No deltas there either. It uses the snapshot model.
At each interval, perf stat just reads the current count, and does not compute
a delta since previous read.
With PERF_METRICS, the delta is always since previous read. If you read
frequently enough you do not lose precision.

>
> To do the regular reset and not lose precision over time internally
> you have to keep expanded counters anyways. And if you do that
> you can just expose them to user space too, and have everything
> in user space just work without any changes (except for the final
> output)
>
> -Andi
>

2019-08-31 09:19:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

On Thu, Aug 29, 2019 at 12:56:02PM -0400, Liang, Kan wrote:
> On 8/29/2019 9:52 AM, Peter Zijlstra wrote:

> > What what? The PERF_METRICS contents depends on the FIXCTR3 value ?!
>
> Yes.
>
> For current implementation, PERF_METRIC MSR is composed by four fields,
> backend bound, frontend bound, bad speculation and retiring.
> Each of the fields are populated using the below formula for eg:
> PERF_METRIC[RETIRING] = (0xFF *
> PERF_METRICS_RETIRING_INTERNAL_48bit_COUNTER)
> / FIXCTR3

So it really depends on the actual exposed FIXCTR3 _value_ to compute
the PERF_METRIC field? *mind boggles*, that's really unspeakable crap.
And this isn't documented anywhere afaict.

I was thinking they've have an internal counter for the SLOTS value too,
so the PERF_METRIC fields are indenpendent; which would be like 'sane'.

Exposing the internal counters would've been _soooo_ much better, just
add 4 more fixed counters and call it a day.

> The METRICS_OVF indicates the overflow of any internal counters.

OK, but I'm thinking that by that time the fraction in PERF_METRIC will
be too coarse and we're loosing precision. Reconstruction will be
inaccurate.

> The internal counters only start counting from 0, which cannot be programmed
> by SW. But resetting the PERF_METRIC would implicitly resetting the internal
> counters.

The only possible option given the choices.

2019-08-31 09:21:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

On Wed, Aug 28, 2019 at 12:04:45PM -0700, Andi Kleen wrote:

> > > NMI
> > > ======
> > >
> > > The METRICS register may be overflow. The bit 48 of STATUS register
> > > will be set. If so, update all active slots and metrics events.
> >
> > that happen? It would be useful to get that METRIC_OVF (can we please
>
> This happens when the internal counters that feed the metrics
> overflow.
>
> > If this is so; then we can use this to update/reset PERF_METRICS and
> > nothing else.
>
> It has to be handled in the PMI.

That's what I wrote; Overflow is always NMI.

> > Then there is no mucking about with that odd counter/metrics msr pair
> > reset nonsense. Becuase that really stinks.
>
> You have to write them to reset the internal counters.

But not for ever read, only on METRIC_OVF.

2019-08-31 09:31:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

On Sat, Aug 31, 2019 at 02:13:05AM -0700, Stephane Eranian wrote:

> With PERF_METRICS, the delta is always since previous read. If you read
> frequently enough you do not lose precision.

You always loose precision, the whole fraction of 255 looses the
fractional part; consider:

255 * 1 / 8
31.87500000000000000000
255 * 7 / 8
223.12500000000000000000

31 * 8 / 255
.97254901960784313725
223 * 8 / 255
6.99607843137254901960

Now, we can make reconstruction use normal 'round-nearest' and then we'd
get 1 and 7 back, but there's always cases where you loose things.

Also, with 255 being an odd number, round-nearest is actually 'hard' :/

2019-08-31 17:54:39

by Andi Kleen

[permalink] [raw]
Subject: Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics

On Sat, Aug 31, 2019 at 02:13:05AM -0700, Stephane Eranian wrote:
> Andi,
>
> On Fri, Aug 30, 2019 at 5:31 PM Andi Kleen <[email protected]> wrote:
> >
> > > the same manner. It would greatly simplify the kernel implementation.
> >
> > I tried that originally. It was actually more complicated.
> >
> > You can't really do deltas on raw metrics, and a lot of the perf
> > infrastructure is built around deltas.
> >
> How is RAPL handled? No deltas there either. It uses the snapshot model.

RAPL doesn't support any context switch or CPU context.
Also it has no concept of "accumulate with clear on read"

-Andi

2019-09-10 07:56:17

by Liang, Kan

[permalink] [raw]
Subject: Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics



On 8/31/2019 5:19 AM, Peter Zijlstra wrote:
>>> Then there is no mucking about with that odd counter/metrics msr pair
>>> reset nonsense. Becuase that really stinks.
>> You have to write them to reset the internal counters.
> But not for ever read, only on METRIC_OVF.

The precision are lost if the counters were running much longer than the
measured delta. We have to reset the counters after every read.
For example, the worst case is:
The previous reading was rounded up and the current reading is rounded
down. The error of METRICS is 1/256 - 1/SLOTS, which is related to
SLOTS. The error for each Topdown event is (1/256 - 1/SLOTS) * SLOTS.
With the SLOTS increasing, the precision will get worse and worse.
Therefor, we have to reset the counters periodically.

Thanks,
Kan