From: Kan Liang <[email protected]>
The patch series intends to add Icelake support for Linux perf.
PATCH 1-18: Kernel patches to support Icelake.
- 1-4: Support adaptive PEBS feature
- 5-6: Enable core support with some new features, e.g. 8 generic
counters, new event constraints, a new fixed counter.
- 7-10: Enable cstate, rapl, msr and uncore support on Icelake
- 11-17: Support hardware Metrics counters and SLOT fixed counter for
Topdown events.
- 18: Support CPUID 10.ECX to disable fixed counters
PATCH 19-22: Perf tool patches to support XMM, Topdown and event list.
Andi Kleen (13):
perf/core: Support outputting registers from a separate array
perf/x86/intel: Extract memory code PEBS parser for reuse
perf/x86/lbr: Avoid reading the LBRs when adaptive PEBS handles them
perf/x86: Support constraint ranges
perf/core: Support a REMOVE transaction
perf/x86/intel: Basic support for metrics counters
perf/x86/intel: Support overflows on SLOTS
perf/x86/intel: Set correct weight for topdown subevent counters
perf/x86/intel: Export new top down events for Icelake
perf/x86/intel: Support CPUID 10.ECX to disable fixed counters
perf, tools: Add support for recording and printing XMM registers
perf, tools, stat: Support new per thread TopDown
perf, tools: Add documentation for topdown metrics
Kan Liang (9):
perf/x86/intel: Support adaptive PEBSv4
perf/x86/intel: Add Icelake support
perf/x86/intel/cstate: Add Icelake support
perf/x86/intel/rapl: Add Icelake support
perf/x86/msr: Add Icelake support
perf/x86/intel/uncore: Add Intel Icelake uncore support
perf/x86/intel: Support hardware TopDown metrics
perf/x86/intel: Disable sampling read slots and topdown
perf vendor events intel: Add JSON files for Icelake
arch/arm/kernel/perf_regs.c | 2 +-
arch/arm64/kernel/perf_regs.c | 2 +-
arch/powerpc/perf/perf_regs.c | 2 +-
arch/s390/kernel/perf_regs.c | 2 +-
arch/x86/events/core.c | 71 +-
arch/x86/events/intel/core.c | 421 ++++++++-
arch/x86/events/intel/cstate.c | 2 +
arch/x86/events/intel/ds.c | 382 +++++++-
arch/x86/events/intel/lbr.c | 35 +-
arch/x86/events/intel/rapl.c | 2 +
arch/x86/events/intel/uncore.c | 6 +
arch/x86/events/intel/uncore.h | 1 +
arch/x86/events/intel/uncore_snb.c | 91 ++
arch/x86/events/msr.c | 1 +
arch/x86/events/perf_event.h | 88 ++
arch/x86/include/asm/intel_ds.h | 2 +-
arch/x86/include/asm/msr-index.h | 4 +
arch/x86/include/asm/perf_event.h | 74 +-
arch/x86/include/uapi/asm/perf_regs.h | 25 +-
arch/x86/kernel/perf_regs.c | 17 +-
include/linux/perf_event.h | 9 +
include/linux/perf_regs.h | 4 +-
kernel/events/core.c | 12 +-
tools/arch/x86/include/uapi/asm/perf_regs.h | 19 +
tools/perf/Documentation/perf-stat.txt | 9 +-
tools/perf/Documentation/topdown.txt | 223 +++++
tools/perf/arch/x86/include/perf_regs.h | 27 +-
tools/perf/arch/x86/util/perf_regs.c | 16 +
tools/perf/builtin-stat.c | 24 +
.../pmu-events/arch/x86/icelake/cache.json | 552 +++++++++++
.../arch/x86/icelake/floating-point.json | 90 ++
.../pmu-events/arch/x86/icelake/frontend.json | 424 +++++++++
.../pmu-events/arch/x86/icelake/memory.json | 410 ++++++++
.../pmu-events/arch/x86/icelake/other.json | 133 +++
.../pmu-events/arch/x86/icelake/pipeline.json | 892 ++++++++++++++++++
.../arch/x86/icelake/virtual-memory.json | 236 +++++
tools/perf/pmu-events/arch/x86/mapfile.csv | 1 +
tools/perf/util/perf_regs.h | 1 +
tools/perf/util/stat-shadow.c | 89 ++
tools/perf/util/stat.c | 4 +
tools/perf/util/stat.h | 8 +
41 files changed, 4311 insertions(+), 102 deletions(-)
create mode 100644 tools/perf/Documentation/topdown.txt
create mode 100644 tools/perf/pmu-events/arch/x86/icelake/cache.json
create mode 100644 tools/perf/pmu-events/arch/x86/icelake/floating-point.json
create mode 100644 tools/perf/pmu-events/arch/x86/icelake/frontend.json
create mode 100644 tools/perf/pmu-events/arch/x86/icelake/memory.json
create mode 100644 tools/perf/pmu-events/arch/x86/icelake/other.json
create mode 100644 tools/perf/pmu-events/arch/x86/icelake/pipeline.json
create mode 100644 tools/perf/pmu-events/arch/x86/icelake/virtual-memory.json
--
2.17.1
From: Andi Kleen <[email protected]>
Extract some code related to memory profiling from the PEBS record
parser into separate functions. It can be reused by the upcoming
adaptive PEBS parser. No functional changes.
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/intel/ds.c | 63 ++++++++++++++++++++------------------
1 file changed, 34 insertions(+), 29 deletions(-)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 10c99ce1fead..4a2206876baa 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1125,34 +1125,50 @@ static int intel_pmu_pebs_fixup_ip(struct pt_regs *regs)
return 0;
}
-static inline u64 intel_hsw_weight(struct pebs_record_skl *pebs)
+static inline u64 intel_hsw_weight(u64 tsx_tuning)
{
- if (pebs->tsx_tuning) {
- union hsw_tsx_tuning tsx = { .value = pebs->tsx_tuning };
+ if (tsx_tuning) {
+ union hsw_tsx_tuning tsx = { .value = tsx_tuning };
return tsx.cycles_last_block;
}
return 0;
}
-static inline u64 intel_hsw_transaction(struct pebs_record_skl *pebs)
+static u64 intel_hsw_transaction(u64 tsx_tuning, u64 ax)
{
- u64 txn = (pebs->tsx_tuning & PEBS_HSW_TSX_FLAGS) >> 32;
+ u64 txn = (tsx_tuning & PEBS_HSW_TSX_FLAGS) >> 32;
/* For RTM XABORTs also log the abort code from AX */
- if ((txn & PERF_TXN_TRANSACTION) && (pebs->ax & 1))
- txn |= ((pebs->ax >> 24) & 0xff) << PERF_TXN_ABORT_SHIFT;
+ if ((txn & PERF_TXN_TRANSACTION) && (ax & 1))
+ txn |= ((ax >> 24) & 0xff) << PERF_TXN_ABORT_SHIFT;
return txn;
}
+#define PERF_X86_EVENT_PEBS_HSW_PREC \
+ (PERF_X86_EVENT_PEBS_ST_HSW | \
+ PERF_X86_EVENT_PEBS_LD_HSW | \
+ PERF_X86_EVENT_PEBS_NA_HSW)
+
+static u64 get_data_src(struct perf_event *event, u64 aux)
+{
+ u64 val = PERF_MEM_NA;
+ int fl = event->hw.flags;
+ bool fst = fl & (PERF_X86_EVENT_PEBS_ST | PERF_X86_EVENT_PEBS_HSW_PREC);
+
+ if (fl & PERF_X86_EVENT_PEBS_LDLAT)
+ val = load_latency_data(aux);
+ else if (fst && (fl & PERF_X86_EVENT_PEBS_HSW_PREC))
+ val = precise_datala_hsw(event, aux);
+ else if (fst)
+ val = precise_store_data(aux);
+ return val;
+}
+
static void setup_pebs_sample_data(struct perf_event *event,
struct pt_regs *iregs, void *__pebs,
struct perf_sample_data *data,
struct pt_regs *regs)
{
-#define PERF_X86_EVENT_PEBS_HSW_PREC \
- (PERF_X86_EVENT_PEBS_ST_HSW | \
- PERF_X86_EVENT_PEBS_LD_HSW | \
- PERF_X86_EVENT_PEBS_NA_HSW)
/*
* We cast to the biggest pebs_record but are careful not to
* unconditionally access the 'extra' entries.
@@ -1160,17 +1176,13 @@ static void setup_pebs_sample_data(struct perf_event *event,
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct pebs_record_skl *pebs = __pebs;
u64 sample_type;
- int fll, fst, dsrc;
- int fl = event->hw.flags;
+ int fll;
if (pebs == NULL)
return;
sample_type = event->attr.sample_type;
- dsrc = sample_type & PERF_SAMPLE_DATA_SRC;
-
- fll = fl & PERF_X86_EVENT_PEBS_LDLAT;
- fst = fl & (PERF_X86_EVENT_PEBS_ST | PERF_X86_EVENT_PEBS_HSW_PREC);
+ fll = event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT;
perf_sample_data_init(data, 0, event->hw.last_period);
@@ -1185,16 +1197,8 @@ static void setup_pebs_sample_data(struct perf_event *event,
/*
* data.data_src encodes the data source
*/
- if (dsrc) {
- u64 val = PERF_MEM_NA;
- if (fll)
- val = load_latency_data(pebs->dse);
- else if (fst && (fl & PERF_X86_EVENT_PEBS_HSW_PREC))
- val = precise_datala_hsw(event, pebs->dse);
- else if (fst)
- val = precise_store_data(pebs->dse);
- data->data_src.val = val;
- }
+ if (sample_type & PERF_SAMPLE_DATA_SRC)
+ data->data_src.val = get_data_src(event, pebs->dse);
/*
* We must however always use iregs for the unwinder to stay sane; the
@@ -1281,10 +1285,11 @@ static void setup_pebs_sample_data(struct perf_event *event,
if (x86_pmu.intel_cap.pebs_format >= 2) {
/* Only set the TSX weight when no memory weight. */
if ((sample_type & PERF_SAMPLE_WEIGHT) && !fll)
- data->weight = intel_hsw_weight(pebs);
+ data->weight = intel_hsw_weight(pebs->tsx_tuning);
if (sample_type & PERF_SAMPLE_TRANSACTION)
- data->txn = intel_hsw_transaction(pebs);
+ data->txn = intel_hsw_transaction(pebs->tsx_tuning,
+ pebs->ax);
}
/*
--
2.17.1
From: Andi Kleen <[email protected]>
Icelake extended the general counters to 8, even when SMT is enabled.
However only a (large) subset of the events can be used on all 8
counters.
The events that can or cannot be used on all counters are organized
in ranges.
We need a lot of scheduler constraints to handle all this.
To avoid blowing up the tables add event code ranges to the constraint
tables, and a new inline function to match them.
The changes costs ~2k text size according to 0day report.
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/intel/core.c | 2 +-
arch/x86/events/intel/ds.c | 2 +-
arch/x86/events/perf_event.h | 34 ++++++++++++++++++++++++++++++++++
3 files changed, 36 insertions(+), 2 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index a964b9832b0c..8486ab87f8f8 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2655,7 +2655,7 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
if (x86_pmu.event_constraints) {
for_each_event_constraint(c, x86_pmu.event_constraints) {
- if ((event->hw.config & c->cmask) == c->code) {
+ if (constraint_match(c, event->hw.config)) {
event->hw.flags |= c->flags;
return c;
}
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 974284c5ed6c..30370fb93e21 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -858,7 +858,7 @@ struct event_constraint *intel_pebs_constraints(struct perf_event *event)
if (x86_pmu.pebs_constraints) {
for_each_event_constraint(c, x86_pmu.pebs_constraints) {
- if ((event->hw.config & c->cmask) == c->code) {
+ if (constraint_match(c, event->hw.config)) {
event->hw.flags |= c->flags;
return c;
}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 27c7945b5174..863d27f4c352 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -54,6 +54,7 @@ struct event_constraint {
int weight;
int overlap;
int flags;
+ u64 range_end;
};
/*
* struct hw_perf_event.flags flags
@@ -71,6 +72,12 @@ struct event_constraint {
#define PERF_X86_EVENT_AUTO_RELOAD 0x0400 /* use PEBS auto-reload */
#define PERF_X86_EVENT_LARGE_PEBS 0x0800 /* use large PEBS */
+static inline bool constraint_match(struct event_constraint *c, u64 ecode)
+{
+ ecode &= c->cmask;
+ return ecode == c->code ||
+ (c->range_end && ecode >= c->code && ecode <= c->range_end);
+}
struct amd_nb {
int nb_id; /* NorthBridge id */
@@ -267,9 +274,22 @@ struct cpu_hw_events {
.flags = f, \
}
+#define __EVENT_CONSTRAINT_RANGE(c, e, n, m, w, o, f) { \
+ { .idxmsk64 = (n) }, \
+ .code = (c), \
+ .range_end = (e), \
+ .cmask = (m), \
+ .weight = (w), \
+ .overlap = (o), \
+ .flags = f, \
+}
+
#define EVENT_CONSTRAINT(c, n, m) \
__EVENT_CONSTRAINT(c, n, m, HWEIGHT(n), 0, 0)
+#define EVENT_CONSTRAINT_RANGE(c, e, n, m) \
+ __EVENT_CONSTRAINT_RANGE(c, e, n, m, HWEIGHT(n), 0, 0)
+
#define INTEL_EXCLEVT_CONSTRAINT(c, n) \
__EVENT_CONSTRAINT(c, n, ARCH_PERFMON_EVENTSEL_EVENT, HWEIGHT(n),\
0, PERF_X86_EVENT_EXCL)
@@ -304,6 +324,12 @@ struct cpu_hw_events {
#define INTEL_EVENT_CONSTRAINT(c, n) \
EVENT_CONSTRAINT(c, n, ARCH_PERFMON_EVENTSEL_EVENT)
+/*
+ * Constraint on a range of Event codes
+ */
+#define INTEL_EVENT_CONSTRAINT_RANGE(c, e, n) \
+ EVENT_CONSTRAINT_RANGE(c, e, n, ARCH_PERFMON_EVENTSEL_EVENT)
+
/*
* Constraint on the Event code + UMask + fixed-mask
*
@@ -351,6 +377,9 @@ struct cpu_hw_events {
#define INTEL_FLAGS_EVENT_CONSTRAINT(c, n) \
EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS)
+#define INTEL_FLAGS_EVENT_CONSTRAINT_RANGE(c, e, n) \
+ EVENT_CONSTRAINT_RANGE(c, e, n, INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS)
+
/* Check only flags, but allow all event/umask */
#define INTEL_ALL_EVENT_CONSTRAINT(code, n) \
EVENT_CONSTRAINT(code, n, X86_ALL_EVENT_FLAGS)
@@ -367,6 +396,11 @@ struct cpu_hw_events {
ARCH_PERFMON_EVENTSEL_EVENT|X86_ALL_EVENT_FLAGS, \
HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_LD_HSW)
+#define INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_LD_RANGE(code, end, n) \
+ __EVENT_CONSTRAINT_RANGE(code, end, n, \
+ ARCH_PERFMON_EVENTSEL_EVENT|X86_ALL_EVENT_FLAGS, \
+ HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_LD_HSW)
+
#define INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_XLD(code, n) \
__EVENT_CONSTRAINT(code, n, \
ARCH_PERFMON_EVENTSEL_EVENT|X86_ALL_EVENT_FLAGS, \
--
2.17.1
From: Kan Liang <[email protected]>
Adaptive PEBS is a new way to report PEBS sampling information. Instead
of a fixed size record for all PEBS events it allows to configure the
PEBS record to only include the information needed. Events can then opt
in to use such an extended record, or stay with a basic record which
only contains the IP.
The major new feature is to support LBRs in PEBS record.
This allows (much faster) large PEBS, while still supporting callstacks
through callstack LBR. So essentially a lot of profiling can now be done
without frequent interrupts, dropping the overhead significantly.
The main requirement still is to use a period, and not use frequency
mode, because frequency mode requires reevaluating the frequency on each
overflow.
The floating point state (XMM) is also supported, which allows efficient
profiling of FP function arguments.
Adapt the drain function to handle variable length records.
Use a new callback to parse the new record format, and also handle the
STATUS field now being at a different offset.
Add code to set up the configuration register. Since there is only a
single register, all events either get the full super set of all events,
or only the basic record.
Originally-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/intel/core.c | 2 +
arch/x86/events/intel/ds.c | 293 ++++++++++++++++++++++++++++--
arch/x86/events/intel/lbr.c | 22 +++
arch/x86/events/perf_event.h | 14 ++
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/perf_event.h | 42 +++++
6 files changed, 359 insertions(+), 15 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 17096d3cd616..a964b9832b0c 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3446,6 +3446,8 @@ static int intel_pmu_cpu_prepare(int cpu)
{
struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu);
+ cpuc->pebs_record_size = x86_pmu.pebs_record_size;
+
if (x86_pmu.extra_regs || x86_pmu.lbr_sel_map) {
cpuc->shared_regs = allocate_shared_regs(cpu);
if (!cpuc->shared_regs)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 4a2206876baa..974284c5ed6c 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -906,17 +906,82 @@ static inline void pebs_update_threshold(struct cpu_hw_events *cpuc)
if (cpuc->n_pebs == cpuc->n_large_pebs) {
threshold = ds->pebs_absolute_maximum -
- reserved * x86_pmu.pebs_record_size;
+ reserved * cpuc->pebs_record_size;
} else {
- threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+ threshold = ds->pebs_buffer_base + cpuc->pebs_record_size;
}
ds->pebs_interrupt_threshold = threshold;
}
+static void adaptive_pebs_record_size_update(void)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ u64 d = cpuc->pebs_data_cfg;
+ int sz = sizeof(struct pebs_basic);
+
+ if (d & PEBS_DATACFG_MEMINFO)
+ sz += sizeof(struct pebs_meminfo);
+ if (d & PEBS_DATACFG_GPRS)
+ sz += sizeof(struct pebs_gprs);
+ if (d & PEBS_DATACFG_XMMS)
+ sz += sizeof(struct pebs_xmm);
+ if (d & PEBS_DATACFG_LBRS)
+ sz += x86_pmu.lbr_nr * sizeof(struct pebs_lbr_entry);
+
+ cpuc->pebs_record_size = sz;
+}
+
+static u64 pebs_update_adaptive_cfg(struct perf_event *event)
+{
+ u64 sample_type = event->attr.sample_type;
+ u64 pebs_data_cfg = 0;
+
+
+ if ((sample_type & ~(PERF_SAMPLE_IP|PERF_SAMPLE_TIME)) ||
+ event->attr.precise_ip < 2) {
+
+ if (sample_type & (PERF_SAMPLE_ADDR | PERF_SAMPLE_DATA_SRC |
+ PERF_SAMPLE_PHYS_ADDR | PERF_SAMPLE_WEIGHT |
+ PERF_SAMPLE_TRANSACTION))
+ pebs_data_cfg |= PEBS_DATACFG_MEMINFO;
+
+ /*
+ * Cases we need the registers:
+ * + user requested registers
+ * + precise_ip < 2 for the non event IP
+ * + For RTM TSX weight we need GPRs too for the abort
+ * code. But we don't want to force GPRs for all other
+ * weights. So add it only for the RTM abort event.
+ */
+ if (((sample_type & PERF_SAMPLE_REGS_INTR) &&
+ (event->attr.sample_regs_intr & 0xffffffff)) ||
+ (event->attr.precise_ip < 2) ||
+ ((sample_type & PERF_SAMPLE_WEIGHT) &&
+ ((event->attr.config & 0xffff) == x86_pmu.force_gpr_event)))
+ pebs_data_cfg |= PEBS_DATACFG_GPRS;
+
+ if ((sample_type & PERF_SAMPLE_REGS_INTR) &&
+ (event->attr.sample_regs_intr >> 32))
+ pebs_data_cfg |= PEBS_DATACFG_XMMS;
+
+ if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
+ /*
+ * For now always log all LBRs. Could configure this
+ * later.
+ */
+ pebs_data_cfg |= PEBS_DATACFG_LBRS |
+ ((x86_pmu.lbr_nr-1) << PEBS_DATACFG_LBR_SHIFT);
+ }
+ }
+ return pebs_data_cfg;
+}
+
static void
-pebs_update_state(bool needed_cb, struct cpu_hw_events *cpuc, struct pmu *pmu)
+pebs_update_state(bool needed_cb, struct cpu_hw_events *cpuc,
+ struct perf_event *event, bool add)
{
+ struct pmu *pmu = event->ctx->pmu;
/*
* Make sure we get updated with the first PEBS
* event. It will trigger also during removal, but
@@ -933,6 +998,19 @@ pebs_update_state(bool needed_cb, struct cpu_hw_events *cpuc, struct pmu *pmu)
update = true;
}
+ if (x86_pmu.intel_cap.pebs_baseline && add) {
+ u64 pebs_data_cfg;
+
+ pebs_data_cfg = pebs_update_adaptive_cfg(event);
+
+ /* Update pebs_record_size if new event requires more data. */
+ if (pebs_data_cfg & ~cpuc->pebs_data_cfg) {
+ cpuc->pebs_data_cfg |= pebs_data_cfg;
+ adaptive_pebs_record_size_update();
+ update = true;
+ }
+ }
+
if (update)
pebs_update_threshold(cpuc);
}
@@ -947,7 +1025,7 @@ void intel_pmu_pebs_add(struct perf_event *event)
if (hwc->flags & PERF_X86_EVENT_LARGE_PEBS)
cpuc->n_large_pebs++;
- pebs_update_state(needed_cb, cpuc, event->ctx->pmu);
+ pebs_update_state(needed_cb, cpuc, event, true);
}
void intel_pmu_pebs_enable(struct perf_event *event)
@@ -965,6 +1043,14 @@ void intel_pmu_pebs_enable(struct perf_event *event)
else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled |= 1ULL << 63;
+ if (x86_pmu.intel_cap.pebs_baseline) {
+ hwc->config |= ICL_EVENTSEL_ADAPTIVE;
+ if (cpuc->pebs_data_cfg != cpuc->active_pebs_data_cfg) {
+ wrmsrl(MSR_PEBS_DATA_CFG, cpuc->pebs_data_cfg);
+ cpuc->active_pebs_data_cfg = cpuc->pebs_data_cfg;
+ }
+ }
+
/*
* Use auto-reload if possible to save a MSR write in the PMI.
* This must be done in pmu::start(), because PERF_EVENT_IOC_PERIOD.
@@ -991,7 +1077,12 @@ void intel_pmu_pebs_del(struct perf_event *event)
if (hwc->flags & PERF_X86_EVENT_LARGE_PEBS)
cpuc->n_large_pebs--;
- pebs_update_state(needed_cb, cpuc, event->ctx->pmu);
+ /* Clear both pebs_data_cfg and pebs_record_size for first PEBS. */
+ if (x86_pmu.intel_cap.pebs_baseline && !cpuc->n_pebs) {
+ cpuc->pebs_data_cfg = 0;
+ cpuc->pebs_record_size = sizeof(struct pebs_basic);
+ }
+ pebs_update_state(needed_cb, cpuc, event, false);
}
void intel_pmu_pebs_disable(struct perf_event *event)
@@ -1004,6 +1095,8 @@ void intel_pmu_pebs_disable(struct perf_event *event)
cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
+ /* Delay reprograming DATA_CFG to next enable */
+
if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
@@ -1013,6 +1106,7 @@ void intel_pmu_pebs_disable(struct perf_event *event)
wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
hwc->config |= ARCH_PERFMON_EVENTSEL_INT;
+ hwc->config &= ~ICL_EVENTSEL_ADAPTIVE;
}
void intel_pmu_pebs_enable_all(void)
@@ -1144,6 +1238,24 @@ static u64 intel_hsw_transaction(u64 tsx_tuning, u64 ax)
return txn;
}
+static inline void *next_pebs_record(void *p)
+{
+ unsigned int size;
+
+ if (x86_pmu.intel_cap.pebs_format < 4)
+ size = x86_pmu.pebs_record_size;
+ else
+ size = ((struct pebs_basic *)p)->format_size >> 48;
+ return p + size;
+}
+
+static inline u64 get_pebs_status(struct pebs_record_nhm *n)
+{
+ if (x86_pmu.intel_cap.pebs_format < 4)
+ return n->status;
+ return ((struct pebs_basic *)n)->applicable_counters;
+}
+
#define PERF_X86_EVENT_PEBS_HSW_PREC \
(PERF_X86_EVENT_PEBS_ST_HSW | \
PERF_X86_EVENT_PEBS_LD_HSW | \
@@ -1164,7 +1276,7 @@ static u64 get_data_src(struct perf_event *event, u64 aux)
return val;
}
-static void setup_pebs_sample_data(struct perf_event *event,
+static void setup_pebs_fixed_sample_data(struct perf_event *event,
struct pt_regs *iregs, void *__pebs,
struct perf_sample_data *data,
struct pt_regs *regs)
@@ -1306,6 +1418,129 @@ static void setup_pebs_sample_data(struct perf_event *event,
data->br_stack = &cpuc->lbr_stack;
}
+static void adaptive_pebs_save_regs(struct pt_regs *regs,
+ struct pebs_gprs *gprs)
+{
+ regs->ax = gprs->ax;
+ regs->bx = gprs->bx;
+ regs->cx = gprs->cx;
+ regs->dx = gprs->dx;
+ regs->si = gprs->si;
+ regs->di = gprs->di;
+ regs->bp = gprs->bp;
+ regs->sp = gprs->sp;
+#ifndef CONFIG_X86_32
+ regs->r8 = gprs->r8;
+ regs->r9 = gprs->r9;
+ regs->r10 = gprs->r10;
+ regs->r11 = gprs->r11;
+ regs->r12 = gprs->r12;
+ regs->r13 = gprs->r13;
+ regs->r14 = gprs->r14;
+ regs->r15 = gprs->r15;
+#endif
+}
+
+/*
+ * With adaptive PEBS the layout depends on what fields are configured.
+ */
+
+static void setup_pebs_adaptive_sample_data(struct perf_event *event,
+ struct pt_regs *iregs, void *__pebs,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct pebs_basic *basic = __pebs;
+ void *next_record = basic + 1;
+ u64 sample_type;
+ u64 format_size;
+ struct pebs_gprs *gprs = NULL;
+
+ if (basic == NULL)
+ return;
+
+ sample_type = event->attr.sample_type;
+ format_size = basic->format_size;
+ perf_sample_data_init(data, 0, event->hw.last_period);
+ data->period = event->hw.last_period;
+
+ if (event->attr.use_clockid == 0)
+ data->time = native_sched_clock_from_tsc(basic->tsc);
+
+ /*
+ * We must however always use iregs for the unwinder to stay sane; the
+ * record BP,SP,IP can point into thin air when the record is from a
+ * previous PMI context or an (I)RET happened between the record and
+ * PMI.
+ */
+ if (sample_type & PERF_SAMPLE_CALLCHAIN)
+ data->callchain = perf_callchain(event, iregs);
+
+ *regs = *iregs;
+ /* The ip in basic is EventingIP */
+ set_linear_ip(regs, basic->ip);
+ regs->flags = PERF_EFLAGS_EXACT;
+
+ if (format_size & PEBS_DATACFG_GPRS) {
+ gprs = next_record;
+ next_record = gprs + 1;
+
+ if (event->attr.precise_ip < 2) {
+ set_linear_ip(regs, gprs->ip);
+ regs->flags &= ~PERF_EFLAGS_EXACT;
+ }
+
+ if (sample_type & PERF_SAMPLE_REGS_INTR)
+ adaptive_pebs_save_regs(regs, gprs);
+ }
+
+ if (format_size & PEBS_DATACFG_MEMINFO) {
+ struct pebs_meminfo *meminfo = next_record;
+
+ next_record = meminfo + 1;
+
+ if (sample_type & PERF_SAMPLE_WEIGHT)
+ data->weight = meminfo->latency ?:
+ intel_hsw_weight(meminfo->tsx_tuning);
+
+ if (sample_type & PERF_SAMPLE_DATA_SRC)
+ data->data_src.val = get_data_src(event, meminfo->aux);
+
+ if (sample_type & (PERF_SAMPLE_ADDR | PERF_SAMPLE_PHYS_ADDR))
+ data->addr = meminfo->address;
+
+ if (sample_type & PERF_SAMPLE_TRANSACTION)
+ data->txn = intel_hsw_transaction(meminfo->tsx_tuning,
+ gprs ? gprs->ax : 0);
+ }
+
+ if (format_size & PEBS_DATACFG_LBRS) {
+ struct pebs_lbr *lbr = next_record;
+ int num_lbr = ((format_size >> PEBS_DATACFG_LBR_SHIFT)
+ & 0xff) + 1;
+ next_record = next_record + num_lbr*sizeof(struct pebs_lbr_entry);
+
+ if (has_branch_stack(event)) {
+ intel_pmu_store_pebs_lbrs(lbr);
+ data->br_stack = &cpuc->lbr_stack;
+ }
+ }
+
+ if (format_size & PEBS_DATACFG_XMMS) {
+ struct pebs_xmm *xmm = next_record;
+
+ next_record = xmm + 1;
+ data->extra_regs = xmm->xmm;
+ }
+
+ WARN_ONCE(next_record != __pebs + (format_size >> 48),
+ "PEBS record size %llu, expected %llu, config %llx\n",
+ format_size >> 48,
+ (u64)(next_record - __pebs),
+ basic->format_size);
+}
+
static inline void *
get_next_pebs_record_by_bit(void *base, void *top, int bit)
{
@@ -1323,19 +1558,20 @@ get_next_pebs_record_by_bit(void *base, void *top, int bit)
if (base == NULL)
return NULL;
- for (at = base; at < top; at += x86_pmu.pebs_record_size) {
+ for (at = base; at < top; at = next_pebs_record(at)) {
struct pebs_record_nhm *p = at;
+ unsigned long status = get_pebs_status(p);
- if (test_bit(bit, (unsigned long *)&p->status)) {
+ if (test_bit(bit, (unsigned long *)&status)) {
/* PEBS v3 has accurate status bits */
if (x86_pmu.intel_cap.pebs_format >= 3)
return at;
- if (p->status == (1 << bit))
+ if (status == (1 << bit))
return at;
/* clear non-PEBS bit and re-check */
- pebs_status = p->status & cpuc->pebs_enabled;
+ pebs_status = status & cpuc->pebs_enabled;
pebs_status &= PEBS_COUNTER_MASK;
if (pebs_status == (1 << bit))
return at;
@@ -1434,14 +1670,14 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
return;
while (count > 1) {
- setup_pebs_sample_data(event, iregs, at, &data, ®s);
+ x86_pmu.setup_pebs_sample_data(event, iregs, at, &data, ®s);
perf_event_output(event, &data, ®s);
- at += x86_pmu.pebs_record_size;
+ at = next_pebs_record(at);
at = get_next_pebs_record_by_bit(at, top, bit);
count--;
}
- setup_pebs_sample_data(event, iregs, at, &data, ®s);
+ x86_pmu.setup_pebs_sample_data(event, iregs, at, &data, ®s);
/*
* All but the last records are processed.
@@ -1534,11 +1770,11 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
return;
}
- for (at = base; at < top; at += x86_pmu.pebs_record_size) {
+ for (at = base; at < top; at = next_pebs_record(at)) {
struct pebs_record_nhm *p = at;
u64 pebs_status;
- pebs_status = p->status & cpuc->pebs_enabled;
+ pebs_status = get_pebs_status(p) & cpuc->pebs_enabled;
pebs_status &= mask;
/* PEBS v3 has more accurate status bits */
@@ -1637,8 +1873,10 @@ void __init intel_ds_init(void)
x86_pmu.pebs_no_isolation = 1;
if (x86_pmu.pebs) {
char pebs_type = x86_pmu.intel_cap.pebs_trap ? '+' : '-';
+ char *pebs_qual = "";
int format = x86_pmu.intel_cap.pebs_format;
+ x86_pmu.setup_pebs_sample_data = setup_pebs_fixed_sample_data;
switch (format) {
case 0:
pr_cont("PEBS fmt0%c, ", pebs_type);
@@ -1674,10 +1912,35 @@ void __init intel_ds_init(void)
x86_pmu.large_pebs_flags |= PERF_SAMPLE_TIME;
break;
+ case 4:
+ x86_pmu.drain_pebs = intel_pmu_drain_pebs_nhm;
+ x86_pmu.setup_pebs_sample_data = setup_pebs_adaptive_sample_data;
+ x86_pmu.pebs_record_size = sizeof(struct pebs_basic);
+ if (x86_pmu.intel_cap.pebs_baseline) {
+ x86_pmu.large_pebs_flags |=
+ PERF_SAMPLE_BRANCH_STACK |
+ PERF_SAMPLE_TIME;
+ x86_pmu.flags |= PMU_FL_PEBS_ALL;
+ pebs_qual = "-baseline";
+ } else {
+ /* Only basic record supported */
+ x86_pmu.large_pebs_flags &=
+ ~(PERF_SAMPLE_ADDR |
+ PERF_SAMPLE_TIME |
+ PERF_SAMPLE_DATA_SRC |
+ PERF_SAMPLE_TRANSACTION |
+ PERF_SAMPLE_REGS_USER |
+ PERF_SAMPLE_REGS_INTR);
+ }
+ pr_cont("PEBS fmt4%c%s, ", pebs_type, pebs_qual);
+ break;
+
default:
pr_cont("no PEBS fmt%d%c, ", format, pebs_type);
x86_pmu.pebs = 0;
}
+ if (format != 4)
+ x86_pmu.intel_cap.pebs_baseline = 0;
}
}
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index c88ed39582a1..58889f952959 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1079,6 +1079,28 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
}
}
+void intel_pmu_store_pebs_lbrs(struct pebs_lbr *lbr)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ int i;
+
+ cpuc->lbr_stack.nr = x86_pmu.lbr_nr;
+ for (i = 0; i < x86_pmu.lbr_nr; i++) {
+ u64 info = lbr->lbr[i].info;
+ struct perf_branch_entry *e = &cpuc->lbr_entries[i];
+
+ e->from = lbr->lbr[i].from;
+ e->to = lbr->lbr[i].to;
+ e->mispred = !!(info & LBR_INFO_MISPRED);
+ e->predicted = !(info & LBR_INFO_MISPRED);
+ e->in_tx = !!(info & LBR_INFO_IN_TX);
+ e->abort = !!(info & LBR_INFO_ABORT);
+ e->cycles = info & LBR_INFO_CYCLES;
+ e->reserved = 0;
+ }
+ intel_pmu_lbr_filter(cpuc);
+}
+
/*
* Map interface branch filters onto LBR filters
*/
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 7e75f474b076..db3b622265fa 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -207,6 +207,11 @@ struct cpu_hw_events {
int n_pebs;
int n_large_pebs;
+ /* Current super set of events hardware configuration */
+ u64 pebs_data_cfg;
+ u64 active_pebs_data_cfg;
+ int pebs_record_size;
+
/*
* Intel LBR bits
*/
@@ -468,6 +473,7 @@ union perf_capabilities {
* values > 32bit.
*/
u64 full_width_write:1;
+ u64 pebs_baseline:1;
};
u64 capabilities;
};
@@ -612,10 +618,16 @@ struct x86_pmu {
int pebs_record_size;
int pebs_buffer_size;
void (*drain_pebs)(struct pt_regs *regs);
+ void (*setup_pebs_sample_data)(struct perf_event *event,
+ struct pt_regs *iregs,
+ void *__pebs,
+ struct perf_sample_data *data,
+ struct pt_regs *regs);
struct event_constraint *pebs_constraints;
void (*pebs_aliases)(struct perf_event *event);
int max_pebs_events;
unsigned long large_pebs_flags;
+ u64 force_gpr_event;
/*
* Intel LBR
@@ -952,6 +964,8 @@ void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in);
void intel_pmu_auto_reload_read(struct perf_event *event);
+void intel_pmu_store_pebs_lbrs(struct pebs_lbr *lbr);
+
void intel_ds_init(void);
void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 8e40c2446fd1..b14b8bea8681 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -116,6 +116,7 @@
#define LBR_INFO_CYCLES 0xffff
#define MSR_IA32_PEBS_ENABLE 0x000003f1
+#define MSR_PEBS_DATA_CFG 0x000003f2
#define MSR_IA32_DS_AREA 0x00000600
#define MSR_IA32_PERF_CAPABILITIES 0x00000345
#define MSR_PEBS_LD_LAT_THRESHOLD 0x000003f6
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 8bdf74902293..da0a80ef6505 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -32,6 +32,7 @@
#define HSW_IN_TX (1ULL << 32)
#define HSW_IN_TX_CHECKPOINTED (1ULL << 33)
+#define ICL_EVENTSEL_ADAPTIVE (1ULL << 34)
#define AMD64_EVENTSEL_INT_CORE_ENABLE (1ULL << 36)
#define AMD64_EVENTSEL_GUESTONLY (1ULL << 40)
@@ -87,6 +88,12 @@
#define ARCH_PERFMON_BRANCH_MISSES_RETIRED 6
#define ARCH_PERFMON_EVENTS_COUNT 7
+#define PEBS_DATACFG_MEMINFO BIT_ULL(0)
+#define PEBS_DATACFG_GPRS BIT_ULL(1)
+#define PEBS_DATACFG_XMMS BIT_ULL(2)
+#define PEBS_DATACFG_LBRS BIT_ULL(3)
+#define PEBS_DATACFG_LBR_SHIFT 24
+
/*
* Intel "Architectural Performance Monitoring" CPUID
* detection/enumeration details:
@@ -176,6 +183,41 @@ struct x86_pmu_capability {
#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(58)
#define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(55)
+/*
+ * Adaptive PEBS v4
+ */
+
+struct pebs_basic {
+ u64 format_size;
+ u64 ip;
+ u64 applicable_counters;
+ u64 tsc;
+};
+
+struct pebs_meminfo {
+ u64 address;
+ u64 aux;
+ u64 latency;
+ u64 tsx_tuning;
+};
+
+struct pebs_gprs {
+ u64 flags, ip, ax, cx, dx, bx, sp, bp, si, di;
+ u64 r8, r9, r10, r11, r12, r13, r14, r15;
+};
+
+struct pebs_xmm {
+ u64 xmm[16*2]; /* two entries for each register */
+};
+
+struct pebs_lbr_entry {
+ u64 from, to, info;
+};
+
+struct pebs_lbr {
+ struct pebs_lbr_entry lbr[0]; /* Variable length */
+};
+
/*
* IBS cpuid feature detection
*/
--
2.17.1
From: Andi Kleen <[email protected]>
Icelake supports a new CPUID 10.ECX cpu leaf to indicate some fixed
counters are not supported. This extends the previous count to a bitmap
which allows to disable even lower counters.
It's a nop on Icelake (all fixed counters are supported), but let's
implement it here. This adds the necessary checks. In theory it could
be used today by a Hypervisor.
For disabled counters disable any constraint events. I reuse the
existing intel_ctrl variable to remember which counters are disabled.
All code that reads all counters is fixed to check this extra bitmask.
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/core.c | 8 +++++++-
arch/x86/events/intel/core.c | 22 +++++++++++++++-------
arch/x86/events/perf_event.h | 6 ++++++
3 files changed, 28 insertions(+), 8 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 796e46a59148..283df78c52e0 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -225,6 +225,8 @@ static bool check_hw_exists(void)
if (ret)
goto msr_fail;
for (i = 0; i < x86_pmu.num_counters_fixed; i++) {
+ if (fixed_counter_disabled(i))
+ continue;
if (val & (0x03 << i*4)) {
bios_fail = 1;
val_fail = val;
@@ -1362,6 +1364,8 @@ void perf_event_print_debug(void)
cpu, idx, prev_left);
}
for (idx = 0; idx < x86_pmu.num_counters_fixed; idx++) {
+ if (fixed_counter_disabled(idx))
+ continue;
rdmsrl(MSR_ARCH_PERFMON_FIXED_CTR0 + idx, pmc_count);
pr_info("CPU#%d: fixed-PMC%d count: %016llx\n",
@@ -1877,7 +1881,9 @@ static int __init init_hw_perf_events(void)
pr_info("... generic registers: %d\n", x86_pmu.num_counters);
pr_info("... value mask: %016Lx\n", x86_pmu.cntval_mask);
pr_info("... max period: %016Lx\n", x86_pmu.max_period);
- pr_info("... fixed-purpose events: %d\n", x86_pmu.num_counters_fixed);
+ pr_info("... fixed-purpose events: %lu\n",
+ hweight64((((1ULL << x86_pmu.num_counters_fixed) - 1)
+ << INTEL_PMC_IDX_FIXED) & x86_pmu.intel_ctrl));
pr_info("... event mask: %016Lx\n", x86_pmu.intel_ctrl);
/*
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 3f86af8ce832..433dbd0152a9 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2278,8 +2278,11 @@ static void intel_pmu_reset(void)
wrmsrl_safe(x86_pmu_config_addr(idx), 0ull);
wrmsrl_safe(x86_pmu_event_addr(idx), 0ull);
}
- for (idx = 0; idx < x86_pmu.num_counters_fixed; idx++)
+ for (idx = 0; idx < x86_pmu.num_counters_fixed; idx++) {
+ if (fixed_counter_disabled(idx))
+ continue;
wrmsrl_safe(MSR_ARCH_PERFMON_FIXED_CTR0 + idx, 0ull);
+ }
if (ds)
ds->bts_index = ds->bts_buffer_base;
@@ -4476,7 +4479,7 @@ __init int intel_pmu_init(void)
union cpuid10_eax eax;
union cpuid10_ebx ebx;
struct event_constraint *c;
- unsigned int unused;
+ unsigned int fixed_mask;
struct extra_reg *er;
int version, i;
char *name;
@@ -4497,9 +4500,11 @@ __init int intel_pmu_init(void)
* Check whether the Architectural PerfMon supports
* Branch Misses Retired hw_event or not.
*/
- cpuid(10, &eax.full, &ebx.full, &unused, &edx.full);
+ cpuid(10, &eax.full, &ebx.full, &fixed_mask, &edx.full);
if (eax.split.mask_length < ARCH_PERFMON_EVENTS_COUNT)
return -ENODEV;
+ if (!fixed_mask)
+ fixed_mask = -1;
version = eax.split.version_id;
if (version < 2)
@@ -5017,7 +5022,8 @@ __init int intel_pmu_init(void)
}
x86_pmu.intel_ctrl |=
- ((1LL << x86_pmu.num_counters_fixed)-1) << INTEL_PMC_IDX_FIXED;
+ (((1LL << x86_pmu.num_counters_fixed)-1) & (u64)fixed_mask)
+ << INTEL_PMC_IDX_FIXED;
if (x86_pmu.event_constraints) {
/*
@@ -5034,9 +5040,11 @@ __init int intel_pmu_init(void)
c->weight = hweight64(c->idxmsk64);
continue;
}
- if (c->cmask == FIXED_EVENT_FLAGS
- && c->idxmsk64 != INTEL_PMC_MSK_FIXED_REF_CYCLES) {
- c->idxmsk64 |= (1ULL << x86_pmu.num_counters) - 1;
+ if (c->cmask == FIXED_EVENT_FLAGS) {
+ if (c->idxmsk64 != INTEL_PMC_MSK_FIXED_REF_CYCLES)
+ c->idxmsk64 |= (1ULL << x86_pmu.num_counters) - 1;
+ /* Disabled fixed counters which are not in CPUID */
+ c->idxmsk64 &= x86_pmu.intel_ctrl;
}
c->idxmsk64 &=
~(~0ULL << (INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed));
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index ef8c4d846e87..8894e3bd1f23 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -926,6 +926,12 @@ ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr,
ssize_t events_ht_sysfs_show(struct device *dev, struct device_attribute *attr,
char *page);
+static inline bool fixed_counter_disabled(int i)
+{
+ return x86_pmu.intel_ctrl &&
+ ((1ULL << (i + INTEL_PMC_IDX_FIXED)) & x86_pmu.intel_ctrl);
+}
+
#ifdef CONFIG_CPU_SUP_AMD
int amd_pmu_init(void);
--
2.17.1
From: Andi Kleen <[email protected]>
Add some documentation how to use the topdown metrics in ring 3.
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/Documentation/topdown.txt | 223 +++++++++++++++++++++++++++
1 file changed, 223 insertions(+)
create mode 100644 tools/perf/Documentation/topdown.txt
diff --git a/tools/perf/Documentation/topdown.txt b/tools/perf/Documentation/topdown.txt
new file mode 100644
index 000000000000..167393225641
--- /dev/null
+++ b/tools/perf/Documentation/topdown.txt
@@ -0,0 +1,223 @@
+Using TopDown metrics in user space
+-----------------------------------
+
+Intel CPUs (since Sandy Bridge and Silvermont) support a TopDown
+methology to break down CPU pipeline execution into 4 bottlenecks:
+frontend bound, backend bound, bad speculation, retiring.
+
+For more details on Topdown see [1][5]
+
+Traditionally this was implemented by events in generic counters
+and specific formulas to compute the bottlenecks.
+
+perf stat --topdown implements this.
+
+% perf stat -a --topdown -I1000
+# time counts unit events
+ 1.000373951 8,460,978,609 topdown-retiring # 22.9% retiring
+ 1.000373951 3,445,383,303 topdown-bad-spec # 9.3% bad speculation
+ 1.000373951 15,886,483,355 topdown-fe-bound # 43.0% frontend bound
+ 1.000373951 9,163,488,720 topdown-be-bound # 24.8% backend bound
+ 2.000782154 8,477,925,431 topdown-retiring # 22.9% retiring
+ 2.000782154 3,459,151,256 topdown-bad-spec # 9.3% bad speculation
+ 2.000782154 15,947,224,725 topdown-fe-bound # 43.0% frontend bound
+ 2.000782154 9,145,551,695 topdown-be-bound # 24.7% backend bound
+ 3.001155967 8,487,323,125 topdown-retiring # 22.9% retiring
+ 3.001155967 3,451,808,066 topdown-bad-spec # 9.3% bad speculation
+ 3.001155967 15,959,068,902 topdown-fe-bound # 43.0% frontend bound
+ 3.001155967 9,172,479,784 topdown-be-bound # 24.7% backend bound
+...
+
+Full Top Down includes more levels that can break down the
+bottlenecks further. This is not directly implemented in perf,
+but available in other tools that can run on top of perf,
+such as toplev[2] or vtune[3]
+
+New Topdown features in Icelake
+===============================
+
+With Icelake (2018 Core) CPUs the TopDown metrics are directly available as
+fixed counters and do not require generic counters. This allows
+to collect TopDown always in addition to other events.
+
+This also enables measuring TopDown per thread/process instead
+of only per core.
+
+Using TopDown through RDPMC in applications on Icelake
+======================================================
+
+For more fine grained measurements it can be useful to
+access the new directly from user space. This is more complicated,
+but drastically lowers overhead.
+
+On Icelake, there is a new fixed counter 3: SLOTS, which reports
+"pipeline SLOTS" (cycles multiplied by core issue width) and a
+metric register that reports slots ratios for the different bottleneck
+categories.
+
+The metrics counter is CPU model specific and is not be available
+on older CPUs.
+
+Example code
+============
+
+Library functions to do the functionality described below
+is also available in libjevents [4]
+
+The application opens a perf_event file descriptor
+and sets up fixed counter 3 (SLOTS) to start and
+allow user programs to read the performance counters.
+
+Fixed counter 3 is mapped to a pseudo event event=0x00, umask=04,
+so the perf_event_attr structure should be initialized with
+{ .config = 0x0400, .type = PERF_TYPE_RAW }
+
+#include <linux/perf_event.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+/* Provide own perf_event_open stub because glibc doesn't */
+__attribute__((weak))
+int perf_event_open(struct perf_event_attr *attr, pid_t pid,
+ int cpu, int group_fd, unsigned long flags)
+{
+ return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags);
+}
+
+/* open slots counter file descriptor for current task */
+struct perf_event_attr slots = {
+ .type = PERF_TYPE_RAW,
+ .size = sizeof(struct perf_event_attr),
+ .config = 0x400,
+ .exclude_kernel = 1,
+};
+
+int fd = perf_event_open(&slots, 0, -1, -1, 0);
+if (fd < 0)
+ ... error ...
+
+The RDPMC instruction (or _rdpmc compiler intrinsic) can now be used
+to read slots and the topdown metrics at different points of the program:
+
+#include <stdint.h>
+#include <x86intrin.h>
+
+#define RDPMC_FIXED (1 << 30) /* return fixed counters */
+#define RDPMC_METRIC (1 << 29) /* return metric counters */
+
+#define FIXED_COUNTER_SLOTS 3
+#define METRIC_COUNTER_TOPDOWN_L1 0
+
+static inline uint64_t read_slots(void)
+{
+ return _rdpmc(RDPMC_FIXED | FIXED_COUNTER_SLOTS);
+}
+
+static inline uint64_t read_metrics(void)
+{
+ return _rdpmc(RDPMC_METRIC | METRIC_COUNTER_TOPDOWN_L1);
+}
+
+Then the program can be instrumented to read these metrics at different
+points.
+
+It's not a good idea to do this with too short code regions,
+as the parallelism and overlap in the CPU program execution will
+cause too much measurement inaccuracy. For example instrumenting
+individual basic blocks is definitely too fine grained.
+
+Decoding metrics values
+=======================
+
+The value reported by read_metrics() contains four 8 bit fields
+that represent a scaled ratio that represent the Level 1 bottleneck.
+All four fields add up to 0xff (= 100%)
+
+The binary ratios in the metric value can be converted to float ratios:
+
+#define GET_METRIC(m, i) (((m) >> (i*8)) & 0xff)
+
+#define TOPDOWN_RETIRING(val) ((float)GET_METRIC(val, 0) / 0xff)
+#define TOPDOWN_BAD_SPEC(val) ((float)GET_METRIC(val, 1) / 0xff)
+#define TOPDOWN_FE_BOUND(val) ((float)GET_METRIC(val, 2) / 0xff)
+#define TOPDOWN_BE_BOUND(val) ((float)GET_METRIC(val, 3) / 0xff)
+
+and then converted to percent for printing.
+
+The ratios in the metric accumulate for the time when the counter
+is enabled. For measuring programs it is often useful to measure
+specific sections. For this it is needed to deltas on metrics.
+
+This can be done by scaling the metrics with the slots counter
+read at the same time.
+
+Then it's possible to take deltas of these slots counts
+measured at different points, and determine the metrics
+for that time period.
+
+ slots_a = read_slots();
+ metric_a = read_metrics();
+
+ ... larger code region ...
+
+ slots_b = read_slots()
+ metric_b = read_metrics()
+
+ # compute scaled metrics for measurement a
+ retiring_slots_a = GET_METRIC(metric_a, 0) * slots_a
+ bad_spec_slots_a = GET_METRIC(metric_a, 1) * slots_a
+ fe_bound_slots_a = GET_METRIC(metric_a, 2) * slots_a
+ be_bound_slots_a = GET_METRIC(metric_a, 3) * slots_a
+
+ # compute delta scaled metrics between b and a
+ retiring_slots = GET_METRIC(metric_b, 0) * slots_b - retiring_slots_a
+ bad_spec_slots = GET_METRIC(metric_b, 1) * slots_b - bad_spec_slots_a
+ fe_bound_slots = GET_METRIC(metric_b, 2) * slots_b - fe_bound_slots_a
+ be_bound_slots = GET_METRIC(metric_b, 3) * slots_b - be_bound_slots_a
+
+Later the individual ratios for the measurement period can be recreated
+from these counts.
+
+ slots_delta = slots_b - slots_a
+ retiring_ratio = (float)retiring_slots / slots_delta
+ bad_spec_ratio = (float)bad_spec_slots / slots_delta
+ fe_bound_ratio = (float)fe_bound_slots / slots_delta
+ be_bound_ratio = (float)be_bound_slots / slota_delta
+
+ printf("Retiring %.2f%% Bad Speculation %.2f%% FE Bound %.2f%% BE Bound %.2f%%\n",
+ retiring_ratio * 100.,
+ bad_spec_ratio * 100.,
+ fe_bound_ratio * 100.,
+ be_bound_ratio * 100.);
+
+Resetting metrics counters
+==========================
+
+Since the individual metrics are only 8bit they lose precision for
+short regions over time because the number of cycles covered by each
+fraction bit shrinks. So the counters need to be reset regularly.
+
+When using the kernel perf API the kernel resets on every read.
+So as long as the reading is at reasonable intervals (every few
+seconds) the precision is good.
+
+When using perf stat it is recommended to always use the -I option,
+with no longer interval than a few seconds
+
+ perf stat -I 1000 --topdown ...
+
+For user programs using RDPMC directly the counter can
+be reset explicitly using ioctl:
+
+ ioctl(perf_fd, PERF_EVENT_IOC_RESET, 0);
+
+This "opens" a new measurement period.
+
+A program using RDPMC for TopDown should schedule such a reset
+regularly, as in every few seconds.
+
+[1] https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-win
+[2] https://github.com/andikleen/pmu-tools/wiki/toplev-manual
+[3] https://software.intel.com/en-us/intel-vtune-amplifier-xe
+[4] https://github.com/andikleen/pmu-tools/tree/master/jevents
+[5] https://sites.google.com/site/analysismethods/yasin-pubs
--
2.17.1
From: Andi Kleen <[email protected]>
Icelake has support for reporting per thread TopDown metrics.
These are reported differently than the previous TopDown support,
each metric is standalone, but scaled to pipeline "slots".
We don't need to do anything special for HyperThreading anymore.
Teach perf stat --topdown to handle these new metrics and
print them in the same way as the previous TopDown metrics.
The restrictions of only being able to report information per core is
gone.
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/Documentation/perf-stat.txt | 9 ++-
tools/perf/builtin-stat.c | 24 +++++++
tools/perf/util/stat-shadow.c | 89 ++++++++++++++++++++++++++
tools/perf/util/stat.c | 4 ++
tools/perf/util/stat.h | 8 +++
5 files changed, 132 insertions(+), 2 deletions(-)
diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
index 4bc2085e5197..f4469751ef0a 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -266,8 +266,13 @@ if the workload is actually bound by the CPU and not by something else.
For best results it is usually a good idea to use it with interval
mode like -I 1000, as the bottleneck of workloads can change often.
-The top down metrics are collected per core instead of per
-CPU thread. Per core mode is automatically enabled
+This enables --metric-only, unless overridden with --no-metric-only.
+
+The following restrictions only apply to older Intel CPUs and Atom,
+on newer CPUs (IceLake and later) TopDown can be collected for any thread:
+
+The top down metrics are collected per core instead of per CPU thread.
+Per core mode is automatically enabled
and -a (global monitoring) is needed, requiring root rights or
perf.perf_event_paranoid=-1.
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 7b8f09b0b8bf..5068396c241b 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -123,6 +123,14 @@ static const char * topdown_attrs[] = {
NULL,
};
+static const char *topdown_metric_attrs[] = {
+ "topdown-retiring",
+ "topdown-bad-spec",
+ "topdown-fe-bound",
+ "topdown-be-bound",
+ NULL,
+};
+
static const char *smi_cost_attrs = {
"{"
"msr/aperf/,"
@@ -1215,6 +1223,21 @@ static int add_default_attributes(void)
char *str = NULL;
bool warn = false;
+ if (topdown_filter_events(topdown_metric_attrs, &str, 1) < 0) {
+ pr_err("Out of memory\n");
+ return -1;
+ }
+ if (topdown_metric_attrs[0] && str) {
+ if (!stat_config.interval) {
+ fprintf(stat_config.output,
+ "Topdown accuracy may decreases when measuring long period.\n"
+ "Please print the result regularly, e.g. -I1000\n");
+ }
+ goto setup_metrics;
+ }
+
+ str = NULL;
+
if (stat_config.aggr_mode != AGGR_GLOBAL &&
stat_config.aggr_mode != AGGR_CORE) {
pr_err("top down event configuration requires --per-core mode\n");
@@ -1236,6 +1259,7 @@ static int add_default_attributes(void)
if (topdown_attrs[0] && str) {
if (warn)
arch_topdown_group_warn();
+setup_metrics:
err = parse_events(evsel_list, str, &errinfo);
if (err) {
fprintf(stderr,
diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index 83d8094be4fe..6b6ffd64a4c4 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -238,6 +238,18 @@ void perf_stat__update_shadow_stats(struct perf_evsel *counter, u64 count,
else if (perf_stat_evsel__is(counter, TOPDOWN_RECOVERY_BUBBLES))
update_runtime_stat(st, STAT_TOPDOWN_RECOVERY_BUBBLES,
ctx, cpu, count);
+ else if (perf_stat_evsel__is(counter, TOPDOWN_RETIRING))
+ update_runtime_stat(st, STAT_TOPDOWN_RETIRING,
+ ctx, cpu, count);
+ else if (perf_stat_evsel__is(counter, TOPDOWN_BAD_SPEC))
+ update_runtime_stat(st, STAT_TOPDOWN_BAD_SPEC,
+ ctx, cpu, count);
+ else if (perf_stat_evsel__is(counter, TOPDOWN_FE_BOUND))
+ update_runtime_stat(st, STAT_TOPDOWN_FE_BOUND,
+ ctx, cpu, count);
+ else if (perf_stat_evsel__is(counter, TOPDOWN_BE_BOUND))
+ update_runtime_stat(st, STAT_TOPDOWN_BE_BOUND,
+ ctx, cpu, count);
else if (perf_evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_FRONTEND))
update_runtime_stat(st, STAT_STALLED_CYCLES_FRONT,
ctx, cpu, count);
@@ -682,6 +694,47 @@ static double td_be_bound(int ctx, int cpu, struct runtime_stat *st)
return sanitize_val(1.0 - sum);
}
+/*
+ * Kernel reports metrics multiplied with slots. To get back
+ * the ratios we need to recreate the sum.
+ */
+
+static double td_metric_ratio(int ctx, int cpu,
+ enum stat_type type,
+ struct runtime_stat *stat)
+{
+ double sum = runtime_stat_avg(stat, STAT_TOPDOWN_RETIRING, ctx, cpu) +
+ runtime_stat_avg(stat, STAT_TOPDOWN_FE_BOUND, ctx, cpu) +
+ runtime_stat_avg(stat, STAT_TOPDOWN_BE_BOUND, ctx, cpu) +
+ runtime_stat_avg(stat, STAT_TOPDOWN_BAD_SPEC, ctx, cpu);
+ double d = runtime_stat_avg(stat, type, ctx, cpu);
+
+ if (sum)
+ return d / sum;
+ return 0;
+}
+
+/*
+ * ... but only if most of the values are actually available.
+ * We allow two missing.
+ */
+
+static bool full_td(int ctx, int cpu,
+ struct runtime_stat *stat)
+{
+ int c = 0;
+
+ if (runtime_stat_avg(stat, STAT_TOPDOWN_RETIRING, ctx, cpu) > 0)
+ c++;
+ if (runtime_stat_avg(stat, STAT_TOPDOWN_BE_BOUND, ctx, cpu) > 0)
+ c++;
+ if (runtime_stat_avg(stat, STAT_TOPDOWN_FE_BOUND, ctx, cpu) > 0)
+ c++;
+ if (runtime_stat_avg(stat, STAT_TOPDOWN_BAD_SPEC, ctx, cpu) > 0)
+ c++;
+ return c >= 2;
+}
+
static void print_smi_cost(struct perf_stat_config *config,
int cpu, struct perf_evsel *evsel,
struct perf_stat_output_ctx *out,
@@ -970,6 +1023,42 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config,
be_bound * 100.);
else
print_metric(config, ctxp, NULL, NULL, name, 0);
+ } else if (perf_stat_evsel__is(evsel, TOPDOWN_RETIRING) &&
+ full_td(ctx, cpu, st)) {
+ double retiring = td_metric_ratio(ctx, cpu,
+ STAT_TOPDOWN_RETIRING, st);
+
+ if (retiring > 0.7)
+ color = PERF_COLOR_GREEN;
+ print_metric(config, ctxp, color, "%8.1f%%", "retiring",
+ retiring * 100.);
+ } else if (perf_stat_evsel__is(evsel, TOPDOWN_FE_BOUND) &&
+ full_td(ctx, cpu, st)) {
+ double fe_bound = td_metric_ratio(ctx, cpu,
+ STAT_TOPDOWN_FE_BOUND, st);
+
+ if (fe_bound > 0.2)
+ color = PERF_COLOR_RED;
+ print_metric(config, ctxp, color, "%8.1f%%", "frontend bound",
+ fe_bound * 100.);
+ } else if (perf_stat_evsel__is(evsel, TOPDOWN_BE_BOUND) &&
+ full_td(ctx, cpu, st)) {
+ double be_bound = td_metric_ratio(ctx, cpu,
+ STAT_TOPDOWN_BE_BOUND, st);
+
+ if (be_bound > 0.2)
+ color = PERF_COLOR_RED;
+ print_metric(config, ctxp, color, "%8.1f%%", "backend bound",
+ be_bound * 100.);
+ } else if (perf_stat_evsel__is(evsel, TOPDOWN_BAD_SPEC) &&
+ full_td(ctx, cpu, st)) {
+ double bad_spec = td_metric_ratio(ctx, cpu,
+ STAT_TOPDOWN_BAD_SPEC, st);
+
+ if (bad_spec > 0.1)
+ color = PERF_COLOR_RED;
+ print_metric(config, ctxp, color, "%8.1f%%", "bad speculation",
+ bad_spec * 100.);
} else if (evsel->metric_expr) {
generic_metric(config, evsel->metric_expr, evsel->metric_events, evsel->name,
evsel->metric_name, avg, cpu, out, st);
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index 4d40515307b8..56f86752a14e 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -87,6 +87,10 @@ static const char *id_str[PERF_STAT_EVSEL_ID__MAX] = {
ID(TOPDOWN_SLOTS_RETIRED, topdown-slots-retired),
ID(TOPDOWN_FETCH_BUBBLES, topdown-fetch-bubbles),
ID(TOPDOWN_RECOVERY_BUBBLES, topdown-recovery-bubbles),
+ ID(TOPDOWN_RETIRING, topdown-retiring),
+ ID(TOPDOWN_BAD_SPEC, topdown-bad-spec),
+ ID(TOPDOWN_FE_BOUND, topdown-fe-bound),
+ ID(TOPDOWN_BE_BOUND, topdown-be-bound),
ID(SMI_NUM, msr/smi/),
ID(APERF, msr/aperf/),
};
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index 2f9c9159a364..827c560ad19f 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -29,6 +29,10 @@ enum perf_stat_evsel_id {
PERF_STAT_EVSEL_ID__TOPDOWN_SLOTS_RETIRED,
PERF_STAT_EVSEL_ID__TOPDOWN_FETCH_BUBBLES,
PERF_STAT_EVSEL_ID__TOPDOWN_RECOVERY_BUBBLES,
+ PERF_STAT_EVSEL_ID__TOPDOWN_RETIRING,
+ PERF_STAT_EVSEL_ID__TOPDOWN_BAD_SPEC,
+ PERF_STAT_EVSEL_ID__TOPDOWN_FE_BOUND,
+ PERF_STAT_EVSEL_ID__TOPDOWN_BE_BOUND,
PERF_STAT_EVSEL_ID__SMI_NUM,
PERF_STAT_EVSEL_ID__APERF,
PERF_STAT_EVSEL_ID__MAX,
@@ -81,6 +85,10 @@ enum stat_type {
STAT_TOPDOWN_SLOTS_RETIRED,
STAT_TOPDOWN_FETCH_BUBBLES,
STAT_TOPDOWN_RECOVERY_BUBBLES,
+ STAT_TOPDOWN_RETIRING,
+ STAT_TOPDOWN_BAD_SPEC,
+ STAT_TOPDOWN_FE_BOUND,
+ STAT_TOPDOWN_BE_BOUND,
STAT_SMI_NUM,
STAT_APERF,
STAT_MAX
--
2.17.1
From: Kan Liang <[email protected]>
Add V1 event list for Icelake.
Signed-off-by: Kan Liang <[email protected]>
---
.../pmu-events/arch/x86/icelake/cache.json | 552 +++++++++++
.../arch/x86/icelake/floating-point.json | 90 ++
.../pmu-events/arch/x86/icelake/frontend.json | 424 +++++++++
.../pmu-events/arch/x86/icelake/memory.json | 410 ++++++++
.../pmu-events/arch/x86/icelake/other.json | 133 +++
.../pmu-events/arch/x86/icelake/pipeline.json | 892 ++++++++++++++++++
.../arch/x86/icelake/virtual-memory.json | 236 +++++
tools/perf/pmu-events/arch/x86/mapfile.csv | 1 +
8 files changed, 2738 insertions(+)
create mode 100644 tools/perf/pmu-events/arch/x86/icelake/cache.json
create mode 100644 tools/perf/pmu-events/arch/x86/icelake/floating-point.json
create mode 100644 tools/perf/pmu-events/arch/x86/icelake/frontend.json
create mode 100644 tools/perf/pmu-events/arch/x86/icelake/memory.json
create mode 100644 tools/perf/pmu-events/arch/x86/icelake/other.json
create mode 100644 tools/perf/pmu-events/arch/x86/icelake/pipeline.json
create mode 100644 tools/perf/pmu-events/arch/x86/icelake/virtual-memory.json
diff --git a/tools/perf/pmu-events/arch/x86/icelake/cache.json b/tools/perf/pmu-events/arch/x86/icelake/cache.json
new file mode 100644
index 000000000000..3529fc338c17
--- /dev/null
+++ b/tools/perf/pmu-events/arch/x86/icelake/cache.json
@@ -0,0 +1,552 @@
+[
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of demand Data Read requests that miss L2 cache. Only not rejected loads are counted.",
+ "EventCode": "0x24",
+ "Counter": "0,1,2,3",
+ "UMask": "0x21",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "L2_RQSTS.DEMAND_DATA_RD_MISS",
+ "SampleAfterValue": "200003",
+ "BriefDescription": "Demand Data Read miss L2, no rejects"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the RFO (Read-for-Ownership) requests that miss L2 cache.",
+ "EventCode": "0x24",
+ "Counter": "0,1,2,3",
+ "UMask": "0x22",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "L2_RQSTS.RFO_MISS",
+ "SampleAfterValue": "200003",
+ "BriefDescription": "RFO requests that miss L2 cache"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts L2 cache misses when fetching instructions.",
+ "EventCode": "0x24",
+ "Counter": "0,1,2,3",
+ "UMask": "0x24",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "L2_RQSTS.CODE_RD_MISS",
+ "SampleAfterValue": "200003",
+ "BriefDescription": "L2 cache misses when fetching instructions"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts demand requests that miss L2 cache.",
+ "EventCode": "0x24",
+ "Counter": "0,1,2,3",
+ "UMask": "0x27",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "L2_RQSTS.ALL_DEMAND_MISS",
+ "SampleAfterValue": "200003",
+ "BriefDescription": "Demand requests that miss L2 cache"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts Software prefetch requests that miss the L2 cache. This event accounts for PREFETCHNTA and PREFETCHT0/1/2 instructions.",
+ "EventCode": "0x24",
+ "Counter": "0,1,2,3",
+ "UMask": "0x28",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "L2_RQSTS.SWPF_MISS",
+ "SampleAfterValue": "200003",
+ "BriefDescription": "SW prefetch requests that miss L2 cache."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of demand Data Read requests initiated by load instructions that hit L2 cache.",
+ "EventCode": "0x24",
+ "Counter": "0,1,2,3",
+ "UMask": "0xc1",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "L2_RQSTS.DEMAND_DATA_RD_HIT",
+ "SampleAfterValue": "200003",
+ "BriefDescription": "Demand Data Read requests that hit L2 cache"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the RFO (Read-for-Ownership) requests that hit L2 cache.",
+ "EventCode": "0x24",
+ "Counter": "0,1,2,3",
+ "UMask": "0xc2",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "L2_RQSTS.RFO_HIT",
+ "SampleAfterValue": "200003",
+ "BriefDescription": "RFO requests that hit L2 cache"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts L2 cache hits when fetching instructions, code reads.",
+ "EventCode": "0x24",
+ "Counter": "0,1,2,3",
+ "UMask": "0xc4",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "L2_RQSTS.CODE_RD_HIT",
+ "SampleAfterValue": "200003",
+ "BriefDescription": "L2 cache hits when fetching instructions, code reads."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts Software prefetch requests that hit the L2 cache. This event accounts for PREFETCHNTA and PREFETCHT0/1/2 instructions.",
+ "EventCode": "0x24",
+ "Counter": "0,1,2,3",
+ "UMask": "0xc8",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "L2_RQSTS.SWPF_HIT",
+ "SampleAfterValue": "200003",
+ "BriefDescription": "SW prefetch requests that hit L2 cache."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of demand Data Read requests (including requests from L1D hardware prefetchers). These loads may hit or miss L2 cache. Only non rejected loads are counted.",
+ "EventCode": "0x24",
+ "Counter": "0,1,2,3",
+ "UMask": "0xe1",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "L2_RQSTS.ALL_DEMAND_DATA_RD",
+ "SampleAfterValue": "200003",
+ "BriefDescription": "Demand Data Read requests"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the total number of RFO (read for ownership) requests to L2 cache. L2 RFO requests include both L1D demand RFO misses as well as L1D RFO prefetches.",
+ "EventCode": "0x24",
+ "Counter": "0,1,2,3",
+ "UMask": "0xe2",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "L2_RQSTS.ALL_RFO",
+ "SampleAfterValue": "200003",
+ "BriefDescription": "RFO requests to L2 cache"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the total number of L2 code requests.",
+ "EventCode": "0x24",
+ "Counter": "0,1,2,3",
+ "UMask": "0xe4",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "L2_RQSTS.ALL_CODE_RD",
+ "SampleAfterValue": "200003",
+ "BriefDescription": "L2 code requests"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts demand requests to L2 cache.",
+ "EventCode": "0x24",
+ "Counter": "0,1,2,3",
+ "UMask": "0xe7",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "L2_RQSTS.ALL_DEMAND_REFERENCES",
+ "SampleAfterValue": "200003",
+ "BriefDescription": "Demand requests to L2 cache"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts number of L1D misses that are outstanding in each cycle, that is each cycle the number of Fill Buffers (FB) outstanding required by Demand Reads. FB either is held by demand loads, or it is held by non-demand loads and gets hit at least once by demand. The valid outstanding interval is defined until the FB deallocation by one of the following ways: from FB allocation, if FB is allocated by demand from the demand Hit FB, if it is allocated by hardware or software prefetch. Note: In the L1D, a Demand Read contains cacheable or noncacheable demand loads, including ones causing cache-line splits and reads due to page walks resulted from any request type.",
+ "EventCode": "0x48",
+ "Counter": "0,1,2,3",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "L1D_PEND_MISS.PENDING",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of L1D misses that are outstanding"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts duration of L1D miss outstanding in cycles.",
+ "EventCode": "0x48",
+ "Counter": "0,1,2,3",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "L1D_PEND_MISS.PENDING_CYCLES",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles with L1D load Misses outstanding.",
+ "CounterMask": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts number of cycles a demand request has waited due to L1D Fill Buffer (FB) unavailablability. Demand requests include cacheable/uncacheable demand load, store, lock or SW prefetch accesses.",
+ "EventCode": "0x48",
+ "Counter": "0,1,2,3",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "L1D_PEND_MISS.FB_FULL",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of cycles a demand request has waited due to L1D Fill Buffer (FB) unavailablability."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts number of phases a demand request has waited due to L1D Fill Buffer (FB) unavailablability. Demand requests include cacheable/uncacheable demand load, store, lock or SW prefetch accesses.",
+ "EventCode": "0x48",
+ "Counter": "0,1,2,3",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "L1D_PEND_MISS.FB_FULL_PERIODS",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of phases a demand request has waited due to L1D Fill Buffer (FB) unavailablability.",
+ "CounterMask": "1",
+ "EdgeDetect": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts number of cycles a demand request has waited due to L1D due to lack of L2 resources. Demand requests include cacheable/uncacheable demand load, store, lock or SW prefetch accesses.",
+ "EventCode": "0x48",
+ "Counter": "0,1,2,3",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "L1D_PEND_MISS.L2_STALL",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of cycles a demand request has waited due to L1D due to lack of L2 resources."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts L1D data line replacements including opportunistic replacements, and replacements that require stall-for-replace or block-for-replace.",
+ "EventCode": "0x51",
+ "Counter": "0,1,2,3",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "L1D.REPLACEMENT",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Counts the number of cache lines replaced in L1 data cache."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of offcore outstanding demand rfo Reads transactions in the super queue every cycle. The 'Offcore outstanding' state of the transaction lasts from the L2 miss until the sending transaction completion to requestor (SQ deallocation). See the corresponding Umask under OFFCORE_REQUESTS.",
+ "EventCode": "0x60",
+ "Counter": "0,1,2,3",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles with offcore outstanding demand rfo reads transactions in SuperQueue (SQ), queue to uncore.",
+ "CounterMask": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of offcore outstanding cacheable Core Data Read transactions in the super queue every cycle. A transaction is considered to be in the Offcore outstanding state between L2 miss and transaction completion sent to requestor (SQ de-allocation). See corresponding Umask under OFFCORE_REQUESTS.",
+ "EventCode": "0x60",
+ "Counter": "0,1,2,3",
+ "UMask": "0x8",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Offcore outstanding cacheable Core Data Read transactions in SuperQueue (SQ), queue to uncore"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts cycles when offcore outstanding cacheable Core Data Read transactions are present in the super queue. A transaction is considered to be in the Offcore outstanding state between L2 miss and transaction completion sent to requestor (SQ de-allocation). See corresponding Umask under OFFCORE_REQUESTS.",
+ "EventCode": "0x60",
+ "Counter": "0,1,2,3",
+ "UMask": "0x8",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles when offcore outstanding cacheable Core Data Read transactions are present in SuperQueue (SQ), queue to uncore.",
+ "CounterMask": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the Demand Data Read requests sent to uncore. Use it in conjunction with OFFCORE_REQUESTS_OUTSTANDING to determine average latency in the uncore.",
+ "EventCode": "0xB0",
+ "Counter": "0,1,2,3",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "OFFCORE_REQUESTS.DEMAND_DATA_RD",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Demand Data Read requests sent to uncore"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the demand RFO (read for ownership) requests including regular RFOs, locks, ItoM.",
+ "EventCode": "0xB0",
+ "Counter": "0,1,2,3",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "OFFCORE_REQUESTS.DEMAND_RFO",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Demand RFO requests including regular RFOs, locks, ItoM"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the demand and prefetch data reads. All Core Data Reads include cacheable 'Demands' and L2 prefetchers (not L3 prefetchers). Counting also covers reads due to page walks resulted from any request type.",
+ "EventCode": "0xB0",
+ "Counter": "0,1,2,3",
+ "UMask": "0x8",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "OFFCORE_REQUESTS.ALL_DATA_RD",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Demand and prefetch data reads"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts memory transactions reached the super queue including requests initiated by the core, all L3 prefetches, page walks, etc..",
+ "EventCode": "0xB0",
+ "Counter": "0,1,2,3",
+ "UMask": "0x80",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "OFFCORE_REQUESTS.ALL_REQUESTS",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Any memory transaction that reached the SQ."
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired load instructions that true miss the STLB.",
+ "EventCode": "0xD0",
+ "Counter": "0,1,2,3",
+ "UMask": "0x11",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "MEM_INST_RETIRED.STLB_MISS_LOADS",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Retired load instructions that miss the STLB.",
+ "Data_LA": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired store instructions that true miss the STLB.",
+ "EventCode": "0xD0",
+ "Counter": "0,1,2,3",
+ "UMask": "0x12",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "MEM_INST_RETIRED.STLB_MISS_STORES",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Retired store instructions that miss the STLB.",
+ "Data_LA": "1",
+ "L1_Hit_Indication": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired load instructions with locked access.",
+ "EventCode": "0xD0",
+ "Counter": "0,1,2,3",
+ "UMask": "0x21",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "MEM_INST_RETIRED.LOCK_LOADS",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "Retired load instructions with locked access.",
+ "Data_LA": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired load instructions that split across a cacheline boundary.",
+ "EventCode": "0xD0",
+ "Counter": "0,1,2,3",
+ "UMask": "0x41",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "MEM_INST_RETIRED.SPLIT_LOADS",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Retired load instructions that split across a cacheline boundary.",
+ "Data_LA": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired store instructions that split across a cacheline boundary.",
+ "EventCode": "0xD0",
+ "Counter": "0,1,2,3",
+ "UMask": "0x42",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "MEM_INST_RETIRED.SPLIT_STORES",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Retired store instructions that split across a cacheline boundary.",
+ "Data_LA": "1",
+ "L1_Hit_Indication": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts all retired load instructions. This event accounts for SW prefetch instructions for loads.",
+ "EventCode": "0xD0",
+ "Counter": "0,1,2,3",
+ "UMask": "0x81",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "MEM_INST_RETIRED.ALL_LOADS",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "All retired load instructions.",
+ "Data_LA": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts all retired store instructions. This event account for SW prefetch instructions and PREFETCHW instruction for stores.",
+ "EventCode": "0xD0",
+ "Counter": "0,1,2,3",
+ "UMask": "0x82",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "MEM_INST_RETIRED.ALL_STORES",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "All retired store instructions.",
+ "Data_LA": "1",
+ "L1_Hit_Indication": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired load instructions with at least one uop that hit in the L1 data cache. This event includes all SW prefetches and lock instructions regardless of the data source.",
+ "EventCode": "0xD1",
+ "Counter": "0,1,2,3",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "MEM_LOAD_RETIRED.L1_HIT",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Retired load instructions with L1 cache hits as data sources",
+ "Data_LA": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired load instructions with L2 cache hits as data sources.",
+ "EventCode": "0xD1",
+ "Counter": "0,1,2,3",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "MEM_LOAD_RETIRED.L2_HIT",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Retired load instructions with L2 cache hits as data sources",
+ "Data_LA": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired load instructions with at least one uop that hit in the L3 cache.",
+ "EventCode": "0xD1",
+ "Counter": "0,1,2,3",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "MEM_LOAD_RETIRED.L3_HIT",
+ "SampleAfterValue": "50021",
+ "BriefDescription": "Retired load instructions with L3 cache hits as data sources",
+ "Data_LA": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired load instructions with at least one uop that missed in the L1 cache.",
+ "EventCode": "0xD1",
+ "Counter": "0,1,2,3",
+ "UMask": "0x8",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "MEM_LOAD_RETIRED.L1_MISS",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Retired load instructions missed L1 cache as data sources",
+ "Data_LA": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired load instructions missed L2 cache as data sources.",
+ "EventCode": "0xD1",
+ "Counter": "0,1,2,3",
+ "UMask": "0x10",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "MEM_LOAD_RETIRED.L2_MISS",
+ "SampleAfterValue": "50021",
+ "BriefDescription": "Retired load instructions missed L2 cache as data sources",
+ "Data_LA": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired load instructions with at least one uop that missed in the L3 cache.",
+ "EventCode": "0xD1",
+ "Counter": "0,1,2,3",
+ "UMask": "0x20",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "MEM_LOAD_RETIRED.L3_MISS",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "Retired load instructions missed L3 cache as data sources",
+ "Data_LA": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired load instructions with at least one uop was load missed in L1 but hit FB (Fill Buffers) due to preceding miss to the same cache line with data not ready.",
+ "EventCode": "0xd1",
+ "Counter": "0,1,2,3",
+ "UMask": "0x40",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "MEM_LOAD_RETIRED.FB_HIT",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "Number of completed demand load requests that missed the L1, but hit the FB(fill buffer), because a preceding miss to the same cacheline initiated the line to be brought into L1, but data is not yet ready in L1.",
+ "Data_LA": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the retired load instructions whose data sources were L3 hit and cross-core snoop missed in on-pkg core cache.",
+ "EventCode": "0xd2",
+ "Counter": "0,1,2,3",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS",
+ "SampleAfterValue": "20011",
+ "BriefDescription": "Retired load instructions whose data sources were L3 hit and cross-core snoop missed in on-pkg core cache.",
+ "Data_LA": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired load instructions whose data sources were L3 and cross-core snoop hits in on-pkg core cache.",
+ "EventCode": "0xd2",
+ "Counter": "0,1,2,3",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT",
+ "SampleAfterValue": "20011",
+ "BriefDescription": "Retired load instructions whose data sources were L3 and cross-core snoop hits in on-pkg core cache",
+ "Data_LA": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired load instructions whose data sources were HitM responses from shared L3.",
+ "EventCode": "0xd2",
+ "Counter": "0,1,2,3",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM",
+ "SampleAfterValue": "20011",
+ "BriefDescription": "Retired load instructions whose data sources were HitM responses from shared L3",
+ "Data_LA": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired load instructions whose data sources were hits in L3 without snoops required.",
+ "EventCode": "0xd2",
+ "Counter": "0,1,2,3",
+ "UMask": "0x8",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NONE",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Retired load instructions whose data sources were hits in L3 without snoops required",
+ "Data_LA": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of L2 cache lines filling the L2. Counting does not cover rejects.",
+ "EventCode": "0xF1",
+ "Counter": "0,1,2,3",
+ "UMask": "0x1f",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "L2_LINES_IN.ALL",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "L2 cache lines filling L2"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the cycles for which the thread is active and the superQ cannot take any more entries.",
+ "EventCode": "0xF4",
+ "Counter": "0,1,2,3",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "SQ_MISC.SQ_FULL",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Cycles the thread is active and superQ cannot take any more entries."
+ }
+]
\ No newline at end of file
diff --git a/tools/perf/pmu-events/arch/x86/icelake/floating-point.json b/tools/perf/pmu-events/arch/x86/icelake/floating-point.json
new file mode 100644
index 000000000000..8e59146c2432
--- /dev/null
+++ b/tools/perf/pmu-events/arch/x86/icelake/floating-point.json
@@ -0,0 +1,90 @@
+[
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts number of SSE/AVX computational scalar double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 1 computational operation. Applies to SSE* and AVX* scalar double precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element.",
+ "EventCode": "0xc7",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "FP_ARITH_INST_RETIRED.SCALAR_DOUBLE",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of SSE/AVX computational scalar double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 1 computation. Applies to SSE* and AVX* scalar double precision floating-point instructions: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 RANGE SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts number of SSE/AVX computational scalar single precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 1 computational operation. Applies to SSE* and AVX* scalar single precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT RCP FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element.",
+ "EventCode": "0xc7",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "FP_ARITH_INST_RETIRED.SCALAR_SINGLE",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of SSE/AVX computational scalar single precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 1 computation. Applies to SSE* and AVX* scalar single precision floating-point instructions: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 RANGE SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts number of SSE/AVX computational 128-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 2 computation operations, one for each element. Applies to SSE* and AVX* packed double precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element.",
+ "EventCode": "0xc7",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of SSE/AVX computational 128-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 2 computation operations, one for each element. Applies to SSE* and AVX* packed double precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX SQRT RSQRT14 RCP14 RANGE DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts number of SSE/AVX computational 128-bit packed single precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 4 computation operations, one for each element. Applies to SSE* and AVX* packed single precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX SQRT RSQRT RCP DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element.",
+ "EventCode": "0xc7",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x8",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of SSE/AVX computational 128-bit packed single precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 4 computation operations, one for each element. Applies to SSE* and AVX* packed single precision floating-point instructions: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts number of SSE/AVX computational 256-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 4 computation operations, one for each element. Applies to SSE* and AVX* packed double precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX SQRT FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element.",
+ "EventCode": "0xc7",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x10",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of SSE/AVX computational 256-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 4 computation operations, one for each element. Applies to SSE* and AVX* packed double precision floating-point instructions: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 RANGE SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts number of SSE/AVX computational 256-bit packed single precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 8 computation operations, one for each element. Applies to SSE* and AVX* packed single precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX SQRT RSQRT RCP DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element.",
+ "EventCode": "0xc7",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x20",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of SSE/AVX computational 256-bit packed single precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 8 computation operations, one for each element. Applies to SSE* and AVX* packed single precision floating-point instructions: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 RANGE SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts number of SSE/AVX computational 512-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 8 computation operations, one for each element. Applies to SSE* and AVX* packed double precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT14 RCP14 RANGE FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element.",
+ "EventCode": "0xc7",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x40",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of SSE/AVX computational 512-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 16 computation operations, one for each element. Applies to SSE* and AVX* packed double precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT14 RCP14 RANGE FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts number of SSE/AVX computational 512-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 16 computation operations, one for each element. Applies to SSE* and AVX* packed double precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT14 RCP14 RANGE FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element.",
+ "EventCode": "0xc7",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x80",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of SSE/AVX computational 512-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 8 computation operations, one for each element. Applies to SSE* and AVX* packed double precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT14 RCP14 RANGE FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element."
+ }
+]
\ No newline at end of file
diff --git a/tools/perf/pmu-events/arch/x86/icelake/frontend.json b/tools/perf/pmu-events/arch/x86/icelake/frontend.json
new file mode 100644
index 000000000000..9c3cfbfcec0f
--- /dev/null
+++ b/tools/perf/pmu-events/arch/x86/icelake/frontend.json
@@ -0,0 +1,424 @@
+[
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of uops delivered to Instruction Decode Queue (IDQ) from the MITE path. This also means that uops are not being delivered from the Decode Stream Buffer (DSB).",
+ "EventCode": "0x79",
+ "Counter": "0,1,2,3",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "IDQ.MITE_UOPS",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Uops delivered to Instruction Decode Queue (IDQ) from MITE path"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of cycles where optimal number of uops was delivered to the Instruction Decode Queue (IDQ) from the MITE (legacy decode pipeline) path. During these cycles uops are not being delivered from the Decode Stream Buffer (DSB).",
+ "EventCode": "0x79",
+ "Counter": "0,1,2,3",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "IDQ.MITE_CYCLES_OK",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles MITE is delivering optimal number of Uops",
+ "CounterMask": "5"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of cycles uops were delivered to the Instruction Decode Queue (IDQ) from the MITE (legacy decode pipeline) path. During these cycles uops are not being delivered from the Decode Stream Buffer (DSB).",
+ "EventCode": "0x79",
+ "Counter": "0,1,2,3",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "IDQ.MITE_CYCLES_ANY",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles MITE is delivering any Uop",
+ "CounterMask": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of uops delivered to Instruction Decode Queue (IDQ) from the Decode Stream Buffer (DSB) path.",
+ "EventCode": "0x79",
+ "Counter": "0,1,2,3",
+ "UMask": "0x8",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "IDQ.DSB_UOPS",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Uops delivered to Instruction Decode Queue (IDQ) from the Decode Stream Buffer (DSB) path"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of cycles where optimal number of uops was delivered to the Instruction Decode Queue (IDQ) from the MITE (legacy decode pipeline) path. During these cycles uops are not being delivered from the Decode Stream Buffer (DSB).",
+ "EventCode": "0x79",
+ "Counter": "0,1,2,3",
+ "UMask": "0x8",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "IDQ.DSB_CYCLES_OK",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles DSB is delivering optimal number of Uops",
+ "CounterMask": "5"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of cycles uops were delivered to Instruction Decode Queue (IDQ) from the Decode Stream Buffer (DSB) path.",
+ "EventCode": "0x79",
+ "Counter": "0,1,2,3",
+ "UMask": "0x8",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "IDQ.DSB_CYCLES_ANY",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles Decode Stream Buffer (DSB) is delivering any Uop",
+ "CounterMask": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Number of switches from DSB (Decode Stream Buffer) or MITE (legacy decode pipeline) to the Microcode Sequencer.",
+ "EventCode": "0x79",
+ "Counter": "0,1,2,3",
+ "UMask": "0x30",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "IDQ.MS_SWITCHES",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of switches from DSB or MITE to the MS",
+ "CounterMask": "1",
+ "EdgeDetect": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the total number of uops delivered by the Microcode Sequencer (MS). Any instruction over 4 uops will be delivered by the MS. Some instructions such as transcendentals may additionally generate uops from the MS.",
+ "EventCode": "0x79",
+ "Counter": "0,1,2,3",
+ "UMask": "0x30",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "IDQ.MS_UOPS",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Uops delivered to IDQ while MS is busy"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts cycles during which uops are being delivered to Instruction Decode Queue (IDQ) while the Microcode Sequencer (MS) is busy. Uops maybe initiated by Decode Stream Buffer (DSB) or MITE.",
+ "EventCode": "0x79",
+ "Counter": "0,1,2,3",
+ "UMask": "0x30",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "IDQ.MS_CYCLES_ANY",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles when uops are being delivered to IDQ while MS is busy",
+ "CounterMask": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts cycles where a code line fetch is stalled due to an L1 instruction cache miss. The legacy decode pipeline works at a 16 Byte granularity.",
+ "EventCode": "0x80",
+ "Counter": "0,1,2,3",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "ICACHE_16B.IFDATA_STALL",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles where a code fetch is stalled due to L1 instruction cache miss."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts instruction fetch tag lookups that hit in the instruction cache (L1I). Counts at 64-byte cache-line granularity. Accounts for both cacheable and uncacheable accesses.",
+ "EventCode": "0x83",
+ "Counter": "0,1,2,3",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "ICACHE_64B.IFTAG_HIT",
+ "SampleAfterValue": "200003",
+ "BriefDescription": "Instruction fetch tag lookups that hit in the instruction cache (L1I). Counts at 64-byte cache-line granularity."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts instruction fetch tag lookups that miss in the instruction cache (L1I). Counts at 64-byte cache-line granularity. Accounts for both cacheable and uncacheable accesses.",
+ "EventCode": "0x83",
+ "Counter": "0,1,2,3",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "ICACHE_64B.IFTAG_MISS",
+ "SampleAfterValue": "200003",
+ "BriefDescription": "Instruction fetch tag lookups that miss in the instruction cache (L1I). Counts at 64-byte cache-line granularity."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts cycles where a code fetch is stalled due to L1 instruction cache tag miss.",
+ "EventCode": "0x83",
+ "Counter": "0,1,2,3",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "ICACHE_64B.IFTAG_STALL",
+ "SampleAfterValue": "200003",
+ "BriefDescription": "Cycles where a code fetch is stalled due to L1 instruction cache tag miss."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of uops not delivered to by the Instruction Decode Queue (IDQ) to the back-end of the pipeline when there was no back-end stalls. This event counts for one SMT thread in a given cycle.",
+ "EventCode": "0x9C",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "IDQ_UOPS_NOT_DELIVERED.CORE",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Uops not delivered by IDQ when backend of the machine is not stalled"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of cycles when no uops were delivered by the Instruction Decode Queue (IDQ) to the back-end of the pipeline when there was no back-end stalls. This event counts for one SMT thread in a given cycle.",
+ "EventCode": "0x9c",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles when no uops are not delivered by the IDQ when backend of the machine is not stalled",
+ "CounterMask": "5"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of cycles when the optimal number of uops were delivered by the Instruction Decode Queue (IDQ) to the back-end of the pipeline when there was no back-end stalls. This event counts for one SMT thread in a given cycle.",
+ "EventCode": "0x9C",
+ "Invert": "1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "IDQ_UOPS_NOT_DELIVERED.CYCLES_FE_WAS_OK",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles when optimal number of uops was delivered to the back-end when the back-end is not stalled",
+ "CounterMask": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Decode Stream Buffer (DSB) is a Uop-cache that holds translations of previously fetched instructions that were decoded by the legacy x86 decode pipeline (MITE). This event counts fetch penalty cycles when a transition occurs from DSB to MITE.",
+ "EventCode": "0xAB",
+ "Counter": "0,1,2,3",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "DSB2MITE_SWITCHES.PENALTY_CYCLES",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "DSB-to-MITE switch true penalty cycles."
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired Instructions that experienced DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.",
+ "EventCode": "0xC6",
+ "MSRValue": "0x11",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "FRONTEND_RETIRED.DSB_MISS",
+ "MSRIndex": "0x3F7",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "Retired Instructions who experienced DSB miss.",
+ "TakenAlone": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired Instructions who experienced Instruction L1 Cache true miss.",
+ "EventCode": "0xC6",
+ "MSRValue": "0x12",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "FRONTEND_RETIRED.L1I_MISS",
+ "MSRIndex": "0x3F7",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "Retired Instructions who experienced Instruction L1 Cache true miss.",
+ "TakenAlone": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired Instructions who experienced Instruction L2 Cache true miss.",
+ "EventCode": "0xC6",
+ "MSRValue": "0x13",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "FRONTEND_RETIRED.L2_MISS",
+ "MSRIndex": "0x3F7",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "Retired Instructions who experienced Instruction L2 Cache true miss.",
+ "TakenAlone": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired Instructions that experienced iTLB (Instruction TLB) true miss.",
+ "EventCode": "0xC6",
+ "MSRValue": "0x14",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "FRONTEND_RETIRED.ITLB_MISS",
+ "MSRIndex": "0x3F7",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "Retired Instructions who experienced iTLB true miss.",
+ "TakenAlone": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired Instructions that experienced STLB (2nd level TLB) true miss.",
+ "EventCode": "0xC6",
+ "MSRValue": "0x15",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "FRONTEND_RETIRED.STLB_MISS",
+ "MSRIndex": "0x3F7",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "Retired Instructions who experienced STLB (2nd level TLB) true miss.",
+ "TakenAlone": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired instructions that are fetched after an interval where the front-end delivered no uops for a period of 2 cycles which was not interrupted by a back-end stall.",
+ "EventCode": "0xC6",
+ "MSRValue": "0x500206",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2",
+ "MSRIndex": "0x3F7",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "Retired instructions that are fetched after an interval where the front-end delivered no uops for a period of 2 cycles which was not interrupted by a back-end stall.",
+ "TakenAlone": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired instructions that are fetched after an interval where the front-end delivered no uops for a period of 4 cycles which was not interrupted by a back-end stall.",
+ "EventCode": "0xC6",
+ "MSRValue": "0x500406",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "FRONTEND_RETIRED.LATENCY_GE_4",
+ "MSRIndex": "0x3F7",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "Retired instructions that are fetched after an interval where the front-end delivered no uops for a period of 4 cycles which was not interrupted by a back-end stall.",
+ "TakenAlone": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired instructions that are delivered to the back-end after a front-end stall of at least 8 cycles. During this period the front-end delivered no uops.",
+ "EventCode": "0xC6",
+ "MSRValue": "0x500806",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "FRONTEND_RETIRED.LATENCY_GE_8",
+ "MSRIndex": "0x3F7",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "Retired instructions that are fetched after an interval where the front-end delivered no uops for a period of 8 cycles which was not interrupted by a back-end stall.",
+ "TakenAlone": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired instructions that are delivered to the back-end after a front-end stall of at least 16 cycles. During this period the front-end delivered no uops.",
+ "EventCode": "0xC6",
+ "MSRValue": "0x501006",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "FRONTEND_RETIRED.LATENCY_GE_16",
+ "MSRIndex": "0x3F7",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "Retired instructions that are fetched after an interval where the front-end delivered no uops for a period of 16 cycles which was not interrupted by a back-end stall.",
+ "TakenAlone": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired instructions that are delivered to the back-end after a front-end stall of at least 32 cycles. During this period the front-end delivered no uops.",
+ "EventCode": "0xC6",
+ "MSRValue": "0x502006",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "FRONTEND_RETIRED.LATENCY_GE_32",
+ "MSRIndex": "0x3F7",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "Retired instructions that are fetched after an interval where the front-end delivered no uops for a period of 32 cycles which was not interrupted by a back-end stall.",
+ "TakenAlone": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired instructions that are fetched after an interval where the front-end delivered no uops for a period of 64 cycles which was not interrupted by a back-end stall.",
+ "EventCode": "0xC6",
+ "MSRValue": "0x504006",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "FRONTEND_RETIRED.LATENCY_GE_64",
+ "MSRIndex": "0x3F7",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "Retired instructions that are fetched after an interval where the front-end delivered no uops for a period of 64 cycles which was not interrupted by a back-end stall.",
+ "TakenAlone": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired instructions that are fetched after an interval where the front-end delivered no uops for a period of 128 cycles which was not interrupted by a back-end stall.",
+ "EventCode": "0xC6",
+ "MSRValue": "0x508006",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "FRONTEND_RETIRED.LATENCY_GE_128",
+ "MSRIndex": "0x3F7",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "Retired instructions that are fetched after an interval where the front-end delivered no uops for a period of 128 cycles which was not interrupted by a back-end stall.",
+ "TakenAlone": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired instructions that are fetched after an interval where the front-end delivered no uops for a period of 256 cycles which was not interrupted by a back-end stall.",
+ "EventCode": "0xC6",
+ "MSRValue": "0x510006",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "FRONTEND_RETIRED.LATENCY_GE_256",
+ "MSRIndex": "0x3F7",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "Retired instructions that are fetched after an interval where the front-end delivered no uops for a period of 256 cycles which was not interrupted by a back-end stall.",
+ "TakenAlone": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired instructions that are fetched after an interval where the front-end delivered no uops for a period of 512 cycles which was not interrupted by a back-end stall.",
+ "EventCode": "0xC6",
+ "MSRValue": "0x520006",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "FRONTEND_RETIRED.LATENCY_GE_512",
+ "MSRIndex": "0x3F7",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "Retired instructions that are fetched after an interval where the front-end delivered no uops for a period of 512 cycles which was not interrupted by a back-end stall.",
+ "TakenAlone": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts retired instructions that are delivered to the back-end after the front-end had at least 1 bubble-slot for a period of 2 cycles. A bubble-slot is an empty issue-pipeline slot while there was no RAT stall.",
+ "EventCode": "0xC6",
+ "MSRValue": "0x100206",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1",
+ "MSRIndex": "0x3F7",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "Retired instructions that are fetched after an interval where the front-end had at least 1 bubble-slot for a period of 2 cycles which was not interrupted by a back-end stall.",
+ "TakenAlone": "1"
+ }
+]
\ No newline at end of file
diff --git a/tools/perf/pmu-events/arch/x86/icelake/memory.json b/tools/perf/pmu-events/arch/x86/icelake/memory.json
new file mode 100644
index 000000000000..f158366b9dd6
--- /dev/null
+++ b/tools/perf/pmu-events/arch/x86/icelake/memory.json
@@ -0,0 +1,410 @@
+[
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of times a TSX line had a cache conflict.",
+ "EventCode": "0x54",
+ "Counter": "0,1,2,3",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "TX_MEM.ABORT_CONFLICT",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of times a transactional abort was signaled due to a data conflict on a transactionally accessed address"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Speculatively counts the number Transactional Synchronization Extensions (TSX) Aborts due to a data capacity limitation for transactional writes.",
+ "EventCode": "0x54",
+ "Counter": "0,1,2,3",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "TX_MEM.ABORT_CAPACITY_WRITE",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Speculatively counts the number TSX Aborts due to a data capacity limitation for transactional writes."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of times a TSX Abort was triggered due to a non-release/commit store to lock.",
+ "EventCode": "0x54",
+ "Counter": "0,1,2,3",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "TX_MEM.ABORT_HLE_STORE_TO_ELIDED_LOCK",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Number of times a HLE transactional region aborted due to a non XRELEASE prefixed instruction writing to an elided lock in the elision buffer"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of times a TSX Abort was triggered due to commit but Lock Buffer not empty.",
+ "EventCode": "0x54",
+ "Counter": "0,1,2,3",
+ "UMask": "0x8",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "TX_MEM.ABORT_HLE_ELISION_BUFFER_NOT_EMPTY",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of times an HLE transactional execution aborted due to NoAllocatedElisionBuffer being non-zero."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of times a TSX Abort was triggered due to release/commit but data and address mismatch.",
+ "EventCode": "0x54",
+ "Counter": "0,1,2,3",
+ "UMask": "0x10",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "TX_MEM.ABORT_HLE_ELISION_BUFFER_MISMATCH",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of times an HLE transactional execution aborted due to XRELEASE lock not satisfying the address and value requirements in the elision buffer"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of times a TSX Abort was triggered due to attempting an unsupported alignment from Lock Buffer.",
+ "EventCode": "0x54",
+ "Counter": "0,1,2,3",
+ "UMask": "0x20",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "TX_MEM.ABORT_HLE_ELISION_BUFFER_UNSUPPORTED_ALIGNMENT",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of times an HLE transactional execution aborted due to an unsupported read alignment from the elision buffer."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of times we could not allocate Lock Buffer.",
+ "EventCode": "0x54",
+ "Counter": "0,1,2,3",
+ "UMask": "0x40",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "TX_MEM.HLE_ELISION_BUFFER_FULL",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of times HLE lock could not be elided due to ElisionBufferAvailable being zero."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts Unfriendly TSX abort triggered by a vzeroupper instruction.",
+ "EventCode": "0x5d",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "TX_EXEC.MISC2",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Counts the number of times a class of instructions that may cause a transactional abort was executed inside a transactional region"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts Unfriendly TSX abort triggered by a nest count that is too deep.",
+ "EventCode": "0x5d",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "TX_EXEC.MISC3",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of times an instruction execution caused the transactional nest count supported to be exceeded"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "EventCode": "0xA3",
+ "Counter": "0,1,2,3",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "CYCLE_ACTIVITY.CYCLES_L3_MISS",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles while L3 cache miss demand load is outstanding.",
+ "CounterMask": "2"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "EventCode": "0xA3",
+ "Counter": "0,1,2,3",
+ "UMask": "0x6",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "CYCLE_ACTIVITY.STALLS_L3_MISS",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Execution stalls while L3 cache miss demand load is outstanding.",
+ "CounterMask": "6"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Demand Data Read requests who miss L3 cache.",
+ "EventCode": "0xB0",
+ "Counter": "0,1,2,3",
+ "UMask": "0x10",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Demand Data Read requests who miss L3 cache"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of Machine Clears detected dye to memory ordering. Memory Ordering Machine Clears may apply when a memory read may not conform to the memory ordering rules of the x86 architecture",
+ "EventCode": "0xc3",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "MACHINE_CLEARS.MEMORY_ORDERING",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Number of machine clears due to memory ordering conflicts."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of times we entered an HLE region. Does not count nested transactions.",
+ "EventCode": "0xC8",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "HLE_RETIRED.START",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of times an HLE execution started."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of times HLE commit succeeded.",
+ "EventCode": "0xC8",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "HLE_RETIRED.COMMIT",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of times an HLE execution successfully committed",
+ "Data_LA": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of times HLE abort was triggered.",
+ "EventCode": "0xc8",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "HLE_RETIRED.ABORTED",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of times an HLE execution aborted due to any reasons (multiple categories may count as one)."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of times an HLE execution aborted due to various memory events (e.g., read/write capacity and conflicts).",
+ "EventCode": "0xC8",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x8",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "HLE_RETIRED.ABORTED_MEM",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of times an HLE execution aborted due to various memory events (e.g., read/write capacity and conflicts)."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of times an HLE execution aborted due to HLE-unfriendly instructions and certain unfriendly events (such as AD assists etc.).",
+ "EventCode": "0xC8",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x20",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "HLE_RETIRED.ABORTED_UNFRIENDLY",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of times an HLE execution aborted due to HLE-unfriendly instructions and certain unfriendly events (such as AD assists etc.)."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of times an HLE execution aborted due to unfriendly events (such as interrupts).",
+ "EventCode": "0xC8",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x80",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "HLE_RETIRED.ABORTED_EVENTS",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of times an HLE execution aborted due to unfriendly events (such as interrupts)."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of times we entered an RTM region. Does not count nested transactions.",
+ "EventCode": "0xC9",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "RTM_RETIRED.START",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of times an RTM execution started."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of times RTM commit succeeded.",
+ "EventCode": "0xC9",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "RTM_RETIRED.COMMIT",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of times an RTM execution successfully committed"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of times RTM abort was triggered.",
+ "EventCode": "0xc9",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "RTM_RETIRED.ABORTED",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of times an RTM execution aborted.",
+ "Data_LA": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of times an RTM execution aborted due to various memory events (e.g. read/write capacity and conflicts).",
+ "EventCode": "0xC9",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x8",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "RTM_RETIRED.ABORTED_MEM",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of times an RTM execution aborted due to various memory events (e.g. read/write capacity and conflicts)"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of times an RTM execution aborted due to HLE-unfriendly instructions.",
+ "EventCode": "0xC9",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x20",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "RTM_RETIRED.ABORTED_UNFRIENDLY",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of times an RTM execution aborted due to HLE-unfriendly instructions"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of times an RTM execution aborted due to incompatible memory type.",
+ "EventCode": "0xC9",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x40",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "RTM_RETIRED.ABORTED_MEMTYPE",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of times an RTM execution aborted due to incompatible memory type"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of times an RTM execution aborted due to none of the previous 4 categories (e.g. interrupt).",
+ "EventCode": "0xC9",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x80",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "RTM_RETIRED.ABORTED_EVENTS",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of times an RTM execution aborted due to none of the previous 4 categories (e.g. interrupt)"
+ },
+ {
+ "PEBS": "2",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 4 cycles. Reported latency may be longer than just the memory latency.",
+ "EventCode": "0xcd",
+ "MSRValue": "0x4",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4",
+ "MSRIndex": "0x3F6",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 4 cycles.",
+ "TakenAlone": "1"
+ },
+ {
+ "PEBS": "2",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 8 cycles. Reported latency may be longer than just the memory latency.",
+ "EventCode": "0xcd",
+ "MSRValue": "0x8",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8",
+ "MSRIndex": "0x3F6",
+ "SampleAfterValue": "50021",
+ "BriefDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 8 cycles.",
+ "TakenAlone": "1"
+ },
+ {
+ "PEBS": "2",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 16 cycles. Reported latency may be longer than just the memory latency.",
+ "EventCode": "0xcd",
+ "MSRValue": "0x10",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16",
+ "MSRIndex": "0x3F6",
+ "SampleAfterValue": "20011",
+ "BriefDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 16 cycles.",
+ "TakenAlone": "1"
+ },
+ {
+ "PEBS": "2",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 32 cycles. Reported latency may be longer than just the memory latency.",
+ "EventCode": "0xcd",
+ "MSRValue": "0x20",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32",
+ "MSRIndex": "0x3F6",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 32 cycles.",
+ "TakenAlone": "1"
+ },
+ {
+ "PEBS": "2",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 64 cycles. Reported latency may be longer than just the memory latency.",
+ "EventCode": "0xcd",
+ "MSRValue": "0x40",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64",
+ "MSRIndex": "0x3F6",
+ "SampleAfterValue": "2003",
+ "BriefDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 64 cycles.",
+ "TakenAlone": "1"
+ },
+ {
+ "PEBS": "2",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 128 cycles. Reported latency may be longer than just the memory latency.",
+ "EventCode": "0xcd",
+ "MSRValue": "0x80",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_128",
+ "MSRIndex": "0x3F6",
+ "SampleAfterValue": "1009",
+ "BriefDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 128 cycles.",
+ "TakenAlone": "1"
+ },
+ {
+ "PEBS": "2",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 256 cycles. Reported latency may be longer than just the memory latency.",
+ "EventCode": "0xcd",
+ "MSRValue": "0x100",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256",
+ "MSRIndex": "0x3F6",
+ "SampleAfterValue": "503",
+ "BriefDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 256 cycles.",
+ "TakenAlone": "1"
+ },
+ {
+ "PEBS": "2",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 512 cycles. Reported latency may be longer than just the memory latency.",
+ "EventCode": "0xcd",
+ "MSRValue": "0x200",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512",
+ "MSRIndex": "0x3F6",
+ "SampleAfterValue": "101",
+ "BriefDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 512 cycles.",
+ "TakenAlone": "1"
+ }
+]
\ No newline at end of file
diff --git a/tools/perf/pmu-events/arch/x86/icelake/other.json b/tools/perf/pmu-events/arch/x86/icelake/other.json
new file mode 100644
index 000000000000..b9d0e17d4d23
--- /dev/null
+++ b/tools/perf/pmu-events/arch/x86/icelake/other.json
@@ -0,0 +1,133 @@
+[
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of available slots for an unhalted logical processor. The event increments by machine-width of the narrowest pipeline as employed by the Top-down Microarchitecture Analysis method. The count is distributed among unhalted logical processors (hyper-threads) who share the same physical core. Software can use this event as the denominator for the top-level metrics of the Top-down Microarchitecture Analysis method. This event is counted on a designated fixed counter (Fixed Counter 3) and is an architectural event.",
+ "Counter": "35",
+ "UMask": "0x4",
+ "PEBScounters": "35",
+ "EventName": "TOPDOWN.SLOTS",
+ "SampleAfterValue": "10000003",
+ "BriefDescription": "Counts the number of available slots for an unhalted logical processor."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts Core cycles where the core was running with power-delivery for baseline license level 0. This includes non-AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes.",
+ "EventCode": "0x28",
+ "Counter": "0,1,2,3",
+ "UMask": "0x7",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "CORE_POWER.LVL0_TURBO_LICENSE",
+ "SampleAfterValue": "200003",
+ "BriefDescription": "Core cycles where the core was running in a manner where Turbo may be clipped to the Non-AVX turbo schedule."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts Core cycles where the core was running with power-delivery for license level 1. This includes high current AVX 256-bit instructions as well as low current AVX 512-bit instructions.",
+ "EventCode": "0x28",
+ "Counter": "0,1,2,3",
+ "UMask": "0x18",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "CORE_POWER.LVL1_TURBO_LICENSE",
+ "SampleAfterValue": "200003",
+ "BriefDescription": "Core cycles where the core was running in a manner where Turbo may be clipped to the AVX2 turbo schedule."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Core cycles where the core was running with power-delivery for license level 2 (introduced in Skylake Server microarchtecture). This includes high current AVX 512-bit instructions.",
+ "EventCode": "0x28",
+ "Counter": "0,1,2,3",
+ "UMask": "0x20",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "CORE_POWER.LVL2_TURBO_LICENSE",
+ "SampleAfterValue": "200003",
+ "BriefDescription": "Core cycles where the core was running in a manner where Turbo may be clipped to the AVX512 turbo schedule."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of PREFETCHNTA instructions executed.",
+ "EventCode": "0x32",
+ "Counter": "0,1,2,3",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "SW_PREFETCH_ACCESS.NTA",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of PREFETCHNTA instructions executed."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of PREFETCHT0 instructions executed.",
+ "EventCode": "0x32",
+ "Counter": "0,1,2,3",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "SW_PREFETCH_ACCESS.T0",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of PREFETCHT0 instructions executed."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of PREFETCHT1 or PREFETCHT2 instructions executed.",
+ "EventCode": "0x32",
+ "Counter": "0,1,2,3",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "SW_PREFETCH_ACCESS.T1_T2",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of PREFETCHT1 or PREFETCHT2 instructions executed."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of PREFETCHW instructions executed.",
+ "EventCode": "0x32",
+ "Counter": "0,1,2,3",
+ "UMask": "0x8",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "SW_PREFETCH_ACCESS.PREFETCHW",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of PREFETCHW instructions executed."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of available slots for an unhalted logical processor. The event increments by machine-width of the narrowest pipeline as employed by the Top-down Microarchitecture Analysis method. The count is distributed among unhalted logical processors (hyper-threads) who share the same physical core.",
+ "EventCode": "0xa4",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "TOPDOWN.SLOTS_P",
+ "SampleAfterValue": "10000003",
+ "BriefDescription": "Counts the number of available slots for an unhalted logical processor."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "EventCode": "0xA4",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "TOPDOWN.BACKEND_BOUND_SLOTS",
+ "SampleAfterValue": "10000003",
+ "BriefDescription": "Issue slots where no uops were being issued due to lack of back end resources."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts all microcode Floating Point assists.",
+ "EventCode": "0xC1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "ASSISTS.FP",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Counts all microcode FP assists.",
+ "CounterMask": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of occurrences where a microcode assist is invoked by hardware Examples include AD (page Access Dirty), FP and AVX related assists.",
+ "EventCode": "0xc1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x7",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "ASSISTS.ANY",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Number of occurrences where a microcode assist is invoked by hardware."
+ }
+]
\ No newline at end of file
diff --git a/tools/perf/pmu-events/arch/x86/icelake/pipeline.json b/tools/perf/pmu-events/arch/x86/icelake/pipeline.json
new file mode 100644
index 000000000000..6d8311e634aa
--- /dev/null
+++ b/tools/perf/pmu-events/arch/x86/icelake/pipeline.json
@@ -0,0 +1,892 @@
+[
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of X86 instructions retired - an Architectural PerfMon event. Counting continues during hardware interrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is counted by a designated fixed counter freeing up programmable counters to count other events. INST_RETIRED.ANY_P is counted by a programmable counter.",
+ "Counter": "32",
+ "UMask": "0x1",
+ "PEBScounters": "32",
+ "EventName": "INST_RETIRED.ANY",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of instructions retired. Fixed Counter - architectural event"
+ },
+ {
+ "PEBS": "2",
+ "CollectPEBSRecord": "3",
+ "PublicDescription": "A version of INST_RETIRED that allows for a more unbiased distribution of samples across instructions retired. It utilizes the Precise Distribution of Instructions Retired (PDIR) feature to mitigate some bias in how retired instructions get sampled. Use on Fixed Counter 0.",
+ "Counter": "32",
+ "UMask": "0x1",
+ "PEBScounters": "32",
+ "EventName": "INST_RETIRED.PREC_DIST",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Precise instruction retired event with a reduced effect of PEBS shadow in IP distribution"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of core cycles while the thread is not in a halt state. The thread enters the halt state when it is running the HLT instruction. This event is a component in many key event ratios. The core frequency may change from time to time due to transitions associated with Enhanced Intel SpeedStep Technology or TM2. For this reason this event may have a changing ratio with regards to time. When the core frequency is constant, this event can approximate elapsed time while the core was not in the halt state. It is counted on a dedicated fixed counter, leaving the four (eight when Hyperthreading is disabled) programmable counters available for other events.",
+ "Counter": "33",
+ "UMask": "0x2",
+ "PEBScounters": "33",
+ "EventName": "CPU_CLK_UNHALTED.THREAD",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Core cycles when the thread is not in halt state"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of reference cycles when the core is not in a halt state. The core enters the halt state when it is running the HLT instruction or the MWAIT instruction. This event is not affected by core frequency changes (for example, P states, TM2 transitions) but has the same incrementing frequency as the time stamp counter. This event can approximate elapsed time while the core was not in a halt state. This event has a constant ratio with the CPU_CLK_UNHALTED.REF_XCLK event. It is counted on a dedicated fixed counter, leaving the four (eight when Hyperthreading is disabled) programmable counters available for other events. Note: On all current platforms this event stops counting during 'throttling (TM)' states duty off periods the processor is 'halted'. The counter update is done at a lower clock rate then the core clock the overflow status bit for this counter may appear 'sticky'. After the counter has overflowed and software clears the overflow status bit and resets the counter to less than MAX. The reset value to the counter is not clocked immediately so the overflow status bit will flip 'high (1)' and generate another PMI (if enabled) after which the reset value gets clocked into the counter. Therefore, software will get the interrupt, read the overflow status bit '1 for bit 34 while the counter value is less than MAX. Software should ignore this case.",
+ "Counter": "34",
+ "UMask": "0x3",
+ "PEBScounters": "34",
+ "EventName": "CPU_CLK_UNHALTED.REF_TSC",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Reference cycles when the core is not in halt state."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of times the load operation got the true Block-on-Store blocking code preventing store forwarding. This includes cases when: a. preceding store conflicts with the load (incomplete overlap),b. store forwarding is impossible due to u-arch limitations, c. preceding lock RMW operations are not forwarded, d. store has the no-forward bit set (uncacheable/page-split/masked stores), e. all-blocking stores are used (mostly, fences and port I/O), and others. The most common case is a load blocked due to its address range overlapping with a preceding smaller uncompleted store. Note: This event does not take into account cases of out-of-SW-control (for example, SbTailHit), unknown physical STA, and cases of blocking loads on store due to being non-WB memory type or a lock. These cases are covered by other events. See the table of not supported store forwards in the Optimization Guide.",
+ "EventCode": "0x03",
+ "Counter": "0,1,2,3",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "LD_BLOCKS.STORE_FORWARD",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Loads blocked by overlapping with store buffer that cannot be forwarded."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of times that split load operations are temporarily blocked because all resources for handling the split accesses are in use.",
+ "EventCode": "0x03",
+ "Counter": "0,1,2,3",
+ "UMask": "0x8",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "LD_BLOCKS.NO_SR",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "The number of times that split load operations are temporarily blocked because all resources for handling the split accesses are in use."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of times a load got blocked due to false dependencies in MOB due to partial compare on address.",
+ "EventCode": "0x07",
+ "Counter": "0,1,2,3",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "False dependencies in MOB due to partial compare on address."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts core cycles when the Resource allocator was stalled due to recovery from an earlier branch misprediction or machine clear event.",
+ "EventCode": "0x0D",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "INT_MISC.RECOVERY_CYCLES",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Core cycles the allocator was stalled due to recovery from earlier clear event for this thread"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts cycles the Backend cluster is recovering after a miss-speculation or a Store Buffer or Load Buffer drain stall.",
+ "EventCode": "0x0D",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x3",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "INT_MISC.ALL_RECOVERY_CYCLES",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles the Backend cluster is recovering after a miss-speculation or a Store Buffer or Load Buffer drain stall.",
+ "CounterMask": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Cycles after recovery from a branch misprediction or machine clear till the first uop is issued from the resteered path.",
+ "EventCode": "0x0d",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x80",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "INT_MISC.CLEAR_RESTEER_CYCLES",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Counts cycles after recovery from a branch misprediction or machine clear till the first uop is issued from the resteered path."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of uops that the Resource Allocation Table (RAT) issues to the Reservation Station (RS).",
+ "EventCode": "0x0E",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "UOPS_ISSUED.ANY",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Uops that RAT issues to RS"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts cycles during which the Resource Allocation Table (RAT) does not issue any Uops to the reservation station (RS) for the current thread.",
+ "EventCode": "0x0E",
+ "Invert": "1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "UOPS_ISSUED.STALL_CYCLES",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles when RAT does not issue Uops to RS for the thread",
+ "CounterMask": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts cycles when divide unit is busy executing divide or square root operations. Accounts for integer and floating-point operations.",
+ "EventCode": "0x14",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x9",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "ARITH.DIVIDER_ACTIVE",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles when divide unit is busy executing divide or square root operations.",
+ "CounterMask": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "This is an architectural event that counts the number of thread cycles while the thread is not in a halt state. The thread enters the halt state when it is running the HLT instruction. The core frequency may change from time to time due to power or thermal throttling. For this reason, this event may have a changing ratio with regards to wall clock time.",
+ "EventCode": "0x3C",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "CPU_CLK_UNHALTED.THREAD_P",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Thread cycles when thread is not in halt state"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts core crystal clock cycles when the thread is unhalted.",
+ "EventCode": "0x3C",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "CPU_CLK_UNHALTED.REF_XCLK",
+ "SampleAfterValue": "25003",
+ "BriefDescription": "Core crystal clock cycles when the thread is unhalted."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts Core crystal clock cycles when current thread is unhalted and the other thread is halted.",
+ "EventCode": "0x3C",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE",
+ "SampleAfterValue": "25003",
+ "BriefDescription": "Core crystal clock cycles when this thread is unhalted and the other thread is halted."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts all not software-prefetch load dispatches that hit the fill buffer (FB) allocated for the software prefetch. It can also be incremented by some lock instructions. So it should only be used with profiling so that the locks can be excluded by ASM (Assembly File) inspection of the nearby instructions.",
+ "EventCode": "0x4c",
+ "Counter": "0,1,2,3",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "LOAD_HIT_PREFETCH.SWPF",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Counts the number of demand load dispatches that hit L1D fill buffer (FB) allocated for software prefetch."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts cycles during which the reservation station (RS) is empty for this logical processor. This is usually caused when the front-end pipeline runs into stravation periods (e.g. branch mispredictions or i-cache misses)",
+ "EventCode": "0x5E",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "RS_EVENTS.EMPTY_CYCLES",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles when Reservation Station (RS) is empty for the thread"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts end of periods where the Reservation Station (RS) was empty. Could be useful to closely sample on front-end latency issues (see the FRONTEND_RETIRED event of designated precise events)",
+ "EventCode": "0x5E",
+ "Invert": "1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "RS_EVENTS.EMPTY_END",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Counts end of periods where the Reservation Station (RS) was empty.",
+ "CounterMask": "1",
+ "EdgeDetect": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts cycles that the Instruction Length decoder (ILD) stalls occurred due to dynamically changing prefix length of the decoded instruction (by operand size prefix instruction 0x66, address size prefix instruction 0x67 or REX.W for Intel64). Count is proportional to the number of prefixes in a 16B-line. This may result in a three-cycle penalty for each LCP (Length changing prefix) in a 16-byte chunk.",
+ "EventCode": "0x87",
+ "Counter": "0,1,2,3",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "ILD_STALL.LCP",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Stalls caused by changing prefix length of the instruction."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts, on the per-thread basis, cycles during which at least one uop is dispatched from the Reservation Station (RS) to port 0.",
+ "EventCode": "0xa1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "UOPS_DISPATCHED.PORT_0",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of uops executed on port 0"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts, on the per-thread basis, cycles during which at least one uop is dispatched from the Reservation Station (RS) to port 1.",
+ "EventCode": "0xa1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "UOPS_DISPATCHED.PORT_1",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of uops executed on port 1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts, on the per-thread basis, cycles during which at least one uop is dispatched from the Reservation Station (RS) to ports 2 and 3.",
+ "EventCode": "0xa1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "UOPS_DISPATCHED.PORT_2_3",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of uops executed on port 2 and 3"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts, on the per-thread basis, cycles during which at least one uop is dispatched from the Reservation Station (RS) to ports 5 and 9.",
+ "EventCode": "0xa1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x10",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "UOPS_DISPATCHED.PORT_4_9",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of uops executed on port 4 and 9"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts, on the per-thread basis, cycles during which at least one uop is dispatched from the Reservation Station (RS) to port 5.",
+ "EventCode": "0xa1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x20",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "UOPS_DISPATCHED.PORT_5",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of uops executed on port 5"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts, on the per-thread basis, cycles during which at least one uop is dispatched from the Reservation Station (RS) to port 6.",
+ "EventCode": "0xa1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x40",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "UOPS_DISPATCHED.PORT_6",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of uops executed on port 6"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts, on the per-thread basis, cycles during which at least one uop is dispatched from the Reservation Station (RS) to ports 7 and 8.",
+ "EventCode": "0xa1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x80",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "UOPS_DISPATCHED.PORT_7_8",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of uops executed on port 7 and 8"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "EventCode": "0xa2",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "RESOURCE_STALLS.SCOREBOARD",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Counts cycles where the pipeline is stalled due to serializing operations."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts allocation stall cycles caused by the store buffer (SB) being full. This counts cycles that the pipeline back-end blocked uop delivery from the front-end.",
+ "EventCode": "0xA2",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x8",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "RESOURCE_STALLS.SB",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles stalled due to no store buffers available. (not including draining form sync)."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "EventCode": "0xA3",
+ "Counter": "0,1,2,3",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "CYCLE_ACTIVITY.CYCLES_L2_MISS",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles while L2 cache miss demand load is outstanding.",
+ "CounterMask": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "EventCode": "0xA3",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "CYCLE_ACTIVITY.STALLS_TOTAL",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Total execution stalls.",
+ "CounterMask": "4"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "EventCode": "0xA3",
+ "Counter": "0,1,2,3",
+ "UMask": "0x5",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "CYCLE_ACTIVITY.STALLS_L2_MISS",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Execution stalls while L2 cache miss demand load is outstanding.",
+ "CounterMask": "5"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "EventCode": "0xA3",
+ "Counter": "0,1,2,3",
+ "UMask": "0x8",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "CYCLE_ACTIVITY.CYCLES_L1D_MISS",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles while L1 cache miss demand load is outstanding.",
+ "CounterMask": "8"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "EventCode": "0xA3",
+ "Counter": "0,1,2,3",
+ "UMask": "0xc",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "CYCLE_ACTIVITY.STALLS_L1D_MISS",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Execution stalls while L1 cache miss demand load is outstanding.",
+ "CounterMask": "12"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "EventCode": "0xA3",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x10",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "CYCLE_ACTIVITY.CYCLES_MEM_ANY",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles while memory subsystem has an outstanding load.",
+ "CounterMask": "16"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "EventCode": "0xA3",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x14",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "CYCLE_ACTIVITY.STALLS_MEM_ANY",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Execution stalls while memory subsystem has an outstanding load.",
+ "CounterMask": "20"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts cycles during which a total of 1 uop was executed on all ports and Reservation Station (RS) was not empty.",
+ "EventCode": "0xa6",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "EXE_ACTIVITY.1_PORTS_UTIL",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles total of 1 uop is executed on all ports and Reservation Station was not empty."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts cycles during which a total of 2 uops were executed on all ports and Reservation Station (RS) was not empty.",
+ "EventCode": "0xa6",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "EXE_ACTIVITY.2_PORTS_UTIL",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles total of 2 uops are executed on all ports and Reservation Station was not empty."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts cycles where the Store Buffer was full and no loads caused an execution stall.",
+ "EventCode": "0xA6",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x40",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "EXE_ACTIVITY.BOUND_ON_STORES",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles where the Store Buffer was full and no loads caused an execution stall.",
+ "CounterMask": "2"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts cycles during which no uops were executed on all ports and Reservation Station (RS) was not empty.",
+ "EventCode": "0xa6",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x80",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "EXE_ACTIVITY.EXE_BOUND_0_PORTS",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles where no uops were executed, the Reservation Station was not empty, the Store Buffer was full and there was no outstanding load."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of uops delivered to the back-end by the LSD(Loop Stream Detector).",
+ "EventCode": "0xA8",
+ "Counter": "0,1,2,3",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "LSD.UOPS",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of Uops delivered by the LSD."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the cycles when at least one uop is delivered by the LSD (Loop-stream detector).",
+ "EventCode": "0xA8",
+ "Counter": "0,1,2,3",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "LSD.CYCLES_ACTIVE",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles Uops delivered by the LSD, but didn't come from the decoder.",
+ "CounterMask": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the cycles when optimal number of uops is delivered by the LSD (Loop-stream detector).",
+ "EventCode": "0xa8",
+ "Counter": "0,1,2,3",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "LSD.CYCLES_OK",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles optimal number of Uops delivered by the LSD, but did not come from the decoder.",
+ "CounterMask": "5"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "EventCode": "0xB1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "UOPS_EXECUTED.THREAD",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Counts the number of uops to be executed per-thread each cycle."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts cycles during which no uops were dispatched from the Reservation Station (RS) per thread.",
+ "EventCode": "0xB1",
+ "Invert": "1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "UOPS_EXECUTED.STALL_CYCLES",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Counts number of cycles no uops were dispatched to be executed on this thread.",
+ "CounterMask": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Cycles where at least 1 uop was executed per-thread.",
+ "EventCode": "0xb1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "UOPS_EXECUTED.CYCLES_GE_1",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles where at least 1 uop was executed per-thread",
+ "CounterMask": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Cycles where at least 2 uops were executed per-thread.",
+ "EventCode": "0xb1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "UOPS_EXECUTED.CYCLES_GE_2",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles where at least 2 uops were executed per-thread",
+ "CounterMask": "2"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Cycles where at least 3 uops were executed per-thread.",
+ "EventCode": "0xb1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "UOPS_EXECUTED.CYCLES_GE_3",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles where at least 3 uops were executed per-thread",
+ "CounterMask": "3"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Cycles where at least 4 uops were executed per-thread.",
+ "EventCode": "0xb1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "UOPS_EXECUTED.CYCLES_GE_4",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles where at least 4 uops were executed per-thread",
+ "CounterMask": "4"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of uops executed from any thread.",
+ "EventCode": "0xB1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "UOPS_EXECUTED.CORE",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of uops executed on the core."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts cycles when at least 1 micro-op is executed from any thread on physical core.",
+ "EventCode": "0xB1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "UOPS_EXECUTED.CORE_CYCLES_GE_1",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles at least 1 micro-op is executed from any thread on physical core.",
+ "CounterMask": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts cycles when at least 2 micro-ops are executed from any thread on physical core.",
+ "EventCode": "0xB1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "UOPS_EXECUTED.CORE_CYCLES_GE_2",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles at least 2 micro-op is executed from any thread on physical core.",
+ "CounterMask": "2"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts cycles when at least 3 micro-ops are executed from any thread on physical core.",
+ "EventCode": "0xB1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "UOPS_EXECUTED.CORE_CYCLES_GE_3",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles at least 3 micro-op is executed from any thread on physical core.",
+ "CounterMask": "3"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts cycles when at least 4 micro-ops are executed from any thread on physical core.",
+ "EventCode": "0xB1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "UOPS_EXECUTED.CORE_CYCLES_GE_4",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles at least 4 micro-op is executed from any thread on physical core.",
+ "CounterMask": "4"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of x87 uops executed.",
+ "EventCode": "0xB1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x10",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "UOPS_EXECUTED.X87",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Counts the number of x87 uops dispatched."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of X86 instructions retired - an Architectural PerfMon event. Counting continues during hardware interrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is counted by a designated fixed counter freeing up programmable counters to count other events. INST_RETIRED.ANY_P is counted by a programmable counter.",
+ "EventCode": "0xC0",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "INST_RETIRED.ANY_P",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of instructions retired. General Counter - architectural event"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of cycles using always true condition (uops_ret &lt; 16) applied to non PEBS uops retired event.",
+ "EventCode": "0xC2",
+ "Invert": "1",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "UOPS_RETIRED.TOTAL_CYCLES",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycles with less than 10 actually retired uops.",
+ "CounterMask": "10"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the retirement slots used each cycle.",
+ "EventCode": "0xc2",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "UOPS_RETIRED.SLOTS",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Retirement slots used."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of machine clears (nukes) of any type.",
+ "EventCode": "0xC3",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "MACHINE_CLEARS.COUNT",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Number of machine clears (nukes) of any type.",
+ "CounterMask": "1",
+ "EdgeDetect": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts self-modifying code (SMC) detected, which causes a machine clear.",
+ "EventCode": "0xC3",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "MACHINE_CLEARS.SMC",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Self-modifying code (SMC) detected."
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts all branch instructions retired.",
+ "EventCode": "0xC4",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "BR_INST_RETIRED.ALL_BRANCHES",
+ "SampleAfterValue": "400009",
+ "BriefDescription": "All branch instructions retired."
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts taken conditional branch instructions retired.",
+ "EventCode": "0xc4",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "BR_INST_RETIRED.COND_TAKEN",
+ "SampleAfterValue": "400009",
+ "BriefDescription": "Taken conditional branch instructions retired."
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts both direct and indirect near call instructions retired.",
+ "EventCode": "0xC4",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "BR_INST_RETIRED.NEAR_CALL",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "Direct and indirect near call instructions retired."
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts return instructions retired.",
+ "EventCode": "0xC4",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x8",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "BR_INST_RETIRED.NEAR_RETURN",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "Return instructions retired."
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts not taken branch instructions retired.",
+ "EventCode": "0xC4",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x10",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "BR_INST_RETIRED.COND_NTAKEN",
+ "SampleAfterValue": "400009",
+ "BriefDescription": "Not taken branch instructions retired."
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts conditional branch instructions retired.",
+ "EventCode": "0xc4",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x11",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "BR_INST_RETIRED.COND",
+ "SampleAfterValue": "400009",
+ "BriefDescription": "Conditional branch instructions retired."
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts taken branch instructions retired.",
+ "EventCode": "0xC4",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x20",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "BR_INST_RETIRED.NEAR_TAKEN",
+ "SampleAfterValue": "400009",
+ "BriefDescription": "Taken branch instructions retired."
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts far branch instructions retired.",
+ "EventCode": "0xC4",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x40",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "BR_INST_RETIRED.FAR_BRANCH",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "Far branch instructions retired."
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts all indirect branch instructions retired (excluding RETs. TSX aborts is considered indirect branch).",
+ "EventCode": "0xc4",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x80",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "BR_INST_RETIRED.INDIRECT",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "All indirect branch instructions retired (excluding RETs. TSX aborts are considered indirect branch)."
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts all the retired branch instructions that were mispredicted by the processor. A branch misprediction occurs when the processor incorrectly predicts the destination of the branch. When the misprediction is discovered at execution, all the instructions executed in the wrong (speculative) path must be discarded, and the processor must start fetching from the correct path.",
+ "EventCode": "0xC5",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "BR_MISP_RETIRED.ALL_BRANCHES",
+ "SampleAfterValue": "400009",
+ "BriefDescription": "All mispredicted branch instructions retired.",
+ "Data_LA": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts taken conditional mispredicted branch instructions retired.",
+ "EventCode": "0xc5",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "BR_MISP_RETIRED.COND_TAKEN",
+ "SampleAfterValue": "400009",
+ "BriefDescription": "number of branch instructions retired that were mispredicted and taken. Non PEBS",
+ "Data_LA": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts mispredicted conditional branch instructions retired.",
+ "EventCode": "0xc5",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x11",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "BR_MISP_RETIRED.COND",
+ "SampleAfterValue": "400009",
+ "BriefDescription": "Mispredicted conditional branch instructions retired.",
+ "Data_LA": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts number of near branch instructions retired that were mispredicted and taken.",
+ "EventCode": "0xC5",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x20",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "BR_MISP_RETIRED.NEAR_TAKEN",
+ "SampleAfterValue": "400009",
+ "BriefDescription": "Number of near branch instructions retired that were mispredicted and taken.",
+ "Data_LA": "1"
+ },
+ {
+ "PEBS": "1",
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts all miss-predicted indirect branch instructions retired (excluding RETs. TSX aborts is considered indirect branch).",
+ "EventCode": "0xC5",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x80",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "BR_MISP_RETIRED.INDIRECT",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "All miss-predicted indirect branch instructions retired (excluding RETs. TSX aborts is considered indirect branch).",
+ "Data_LA": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Increments when an entry is added to the Last Branch Record (LBR) array (or removed from the array in case of RETURNs in call stack mode). The event requires LBR enable via IA32_DEBUGCTL MSR and branch type selection via MSR_LBR_SELECT.",
+ "EventCode": "0xcc",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x20",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "MISC_RETIRED.LBR_INSERTS",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Increments whenever there is an update to the LBR array."
+ },
+ {
+ "PublicDescription": "Counts number of retired PAUSE instructions (that do not end up with a VMExit to the VMM; TSX aborted Instructions may be counted).",
+ "EventCode": "0xcc",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x40",
+ "EventName": "MISC_RETIRED.PAUSE_INST",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of retired PAUSE instructions."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of times the front-end is resteered when it finds a branch instruction in a fetch line. This occurs for the first time a branch instruction is fetched or when the branch is not tracked by the BPU (Branch Prediction Unit) anymore.",
+ "EventCode": "0xE6",
+ "Counter": "0,1,2,3",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "BACLEARS.ANY",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Counts the total number when the front end is resteered, mainly when the BPU cannot provide a correct prediction and this is corrected by other branch handling mechanisms at the front end."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "This event distributes cycle counts between active hyperthreads, i.e., those in C0. A hyperthread becomes inactive when it executes the HLT or MWAIT instructions. If all other hyperthreads are inactive (or disabled or do not exist), all counts are attributed to this hyperthread. To obtain the full count when the Core is active, sum the counts from each hyperthread.",
+ "EventCode": "0xec",
+ "Counter": "0,1,2,3,4,5,6,7",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3,4,5,6,7",
+ "EventName": "CPU_CLK_UNHALTED.DISTRIBUTED",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Cycle counts are evenly distributed between active threads in the Core."
+ }
+]
\ No newline at end of file
diff --git a/tools/perf/pmu-events/arch/x86/icelake/virtual-memory.json b/tools/perf/pmu-events/arch/x86/icelake/virtual-memory.json
new file mode 100644
index 000000000000..7180a900c175
--- /dev/null
+++ b/tools/perf/pmu-events/arch/x86/icelake/virtual-memory.json
@@ -0,0 +1,236 @@
+[
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts page walks completed due to demand data loads whose address translations missed in the TLB and were mapped to 4K pages. The page walks can end with or without a page fault.",
+ "EventCode": "0x08",
+ "Counter": "0,1,2,3",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_4K",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Page walks completed due to a demand data load to a 4K page."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts page walks completed due to demand data loads whose address translations missed in the TLB and were mapped to 2M/4M pages. The page walks can end with or without a page fault.",
+ "EventCode": "0x08",
+ "Counter": "0,1,2,3",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Page walks completed due to a demand data load to a 2M/4M page."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts demand data loads that caused a completed page walk of any page size (4K/2M/4M/1G). This implies it missed in all TLB levels. The page walk can end with or without a fault.",
+ "EventCode": "0x08",
+ "Counter": "0,1,2,3",
+ "UMask": "0xe",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Load miss in all TLB levels causes a page walk that completes. (All page sizes)"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of page walks outstanding for a demand load in the PMH (Page Miss Handler) each cycle.",
+ "EventCode": "0x08",
+ "Counter": "0,1,2,3",
+ "UMask": "0x10",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "DTLB_LOAD_MISSES.WALK_PENDING",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of page walks outstanding for a demand load in the PMH each cycle."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts cycles when at least one PMH (Page Miss Handler) is busy with a page walk for a demand load.",
+ "EventCode": "0x08",
+ "Counter": "0,1,2,3",
+ "UMask": "0x10",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "DTLB_LOAD_MISSES.WALK_ACTIVE",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Cycles when at least one PMH is busy with a page walk for a demand load.",
+ "CounterMask": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts loads that miss the DTLB (Data TLB) and hit the STLB (Second level TLB).",
+ "EventCode": "0x08",
+ "Counter": "0,1,2,3",
+ "UMask": "0x20",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "DTLB_LOAD_MISSES.STLB_HIT",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Loads that miss the DTLB and hit the STLB."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts page walks completed due to demand data stores whose address translations missed in the TLB and were mapped to 4K pages. The page walks can end with or without a page fault.",
+ "EventCode": "0x49",
+ "Counter": "0,1,2,3",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_4K",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Page walks completed due to a demand data store to a 4K page."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts page walks completed due to demand data stores whose address translations missed in the TLB and were mapped to 2M/4M pages. The page walks can end with or without a page fault.",
+ "EventCode": "0x49",
+ "Counter": "0,1,2,3",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_2M_4M",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Page walks completed due to a demand data store to a 2M/4M page."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts demand data stores that caused a completed page walk of any page size (4K/2M/4M/1G). This implies it missed in all TLB levels. The page walk can end with or without a fault.",
+ "EventCode": "0x49",
+ "Counter": "0,1,2,3",
+ "UMask": "0xe",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Store misses in all TLB levels causes a page walk that completes. (All page sizes)"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of page walks outstanding for a store in the PMH (Page Miss Handler) each cycle.",
+ "EventCode": "0x49",
+ "Counter": "0,1,2,3",
+ "UMask": "0x10",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "DTLB_STORE_MISSES.WALK_PENDING",
+ "SampleAfterValue": "2000003",
+ "BriefDescription": "Number of page walks outstanding for a store in the PMH each cycle."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts cycles when at least one PMH (Page Miss Handler) is busy with a page walk for a store.",
+ "EventCode": "0x49",
+ "Counter": "0,1,2,3",
+ "UMask": "0x10",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "DTLB_STORE_MISSES.WALK_ACTIVE",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Cycles when at least one PMH is busy with a page walk for a store.",
+ "CounterMask": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts stores that miss the DTLB (Data TLB) and hit the STLB (2nd Level TLB).",
+ "EventCode": "0x49",
+ "Counter": "0,1,2,3",
+ "UMask": "0x20",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "DTLB_STORE_MISSES.STLB_HIT",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Stores that miss the DTLB and hit the STLB."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts completed page walks (4K page size) caused by a code fetch. This implies it missed in the ITLB and further levels of TLB. The page walk can end with or without a fault.",
+ "EventCode": "0x85",
+ "Counter": "0,1,2,3",
+ "UMask": "0x2",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "ITLB_MISSES.WALK_COMPLETED_4K",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Code miss in all TLB levels causes a page walk that completes. (4K)"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts code misses in all ITLB (Instruction TLB) levels that caused a completed page walk (2M and 4M page sizes). The page walk can end with or without a fault.",
+ "EventCode": "0x85",
+ "Counter": "0,1,2,3",
+ "UMask": "0x4",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "ITLB_MISSES.WALK_COMPLETED_2M_4M",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Code miss in all TLB levels causes a page walk that completes. (2M/4M)"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts completed page walks (2M and 4M page sizes) caused by a code fetch. This implies it missed in the ITLB (Instruction TLB) and further levels of TLB. The page walk can end with or without a fault.",
+ "EventCode": "0x85",
+ "Counter": "0,1,2,3",
+ "UMask": "0xe",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "ITLB_MISSES.WALK_COMPLETED",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Code miss in all TLB levels causes a page walk that completes. (All page sizes)"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of page walks outstanding for an outstanding code (instruction fetch) request in the PMH (Page Miss Handler) each cycle.",
+ "EventCode": "0x85",
+ "Counter": "0,1,2,3",
+ "UMask": "0x10",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "ITLB_MISSES.WALK_PENDING",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Number of page walks outstanding for an outstanding code request in the PMH each cycle."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts cycles when at least one PMH (Page Miss Handler) is busy with a page walk for a code (instruction fetch) request.",
+ "EventCode": "0x85",
+ "Counter": "0,1,2,3",
+ "UMask": "0x10",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "ITLB_MISSES.WALK_ACTIVE",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Cycles when at least one PMH is busy with a page walk for code (instruction fetch) request.",
+ "CounterMask": "1"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts instruction fetch requests that miss the ITLB (Instruction TLB) and hit the STLB (Second-level TLB).",
+ "EventCode": "0x85",
+ "Counter": "0,1,2,3",
+ "UMask": "0x20",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "ITLB_MISSES.STLB_HIT",
+ "SampleAfterValue": "100003",
+ "BriefDescription": "Instruction fetch requests that miss the ITLB and hit the STLB."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of flushes of the big or small ITLB pages. Counting include both TLB Flush (covering all sets) and TLB Set Clear (set-specific).",
+ "EventCode": "0xAE",
+ "Counter": "0,1,2,3",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "ITLB.ITLB_FLUSH",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "Flushing of the Instruction TLB (ITLB) pages, includes 4k/2M/4M pages."
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of DTLB flush attempts of the thread-specific entries.",
+ "EventCode": "0xBD",
+ "Counter": "0,1,2,3",
+ "UMask": "0x1",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "TLB_FLUSH.DTLB_THREAD",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "DTLB flush attempts of the thread-specific entries"
+ },
+ {
+ "CollectPEBSRecord": "2",
+ "PublicDescription": "Counts the number of any STLB flush attempts (such as entire, VPID, PCID, InvPage, CR3 write, etc.).",
+ "EventCode": "0xBD",
+ "Counter": "0,1,2,3",
+ "UMask": "0x20",
+ "PEBScounters": "0,1,2,3",
+ "EventName": "TLB_FLUSH.STLB_ANY",
+ "SampleAfterValue": "100007",
+ "BriefDescription": "STLB flush attempts"
+ }
+]
\ No newline at end of file
diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv
index e05c2c8458fc..7e5947aa42e1 100644
--- a/tools/perf/pmu-events/arch/x86/mapfile.csv
+++ b/tools/perf/pmu-events/arch/x86/mapfile.csv
@@ -33,3 +33,4 @@ GenuineIntel-6-25,v2,westmereep-sp,core
GenuineIntel-6-2F,v2,westmereex,core
GenuineIntel-6-55-[01234],v1,skylakex,core
GenuineIntel-6-55-[56789ABCDEF],v1,cascadelakex,core
+GenuineIntel-6-7E,v1,icelake,core
--
2.17.1
From: Andi Kleen <[email protected]>
Newer kernel code can collect XMM registers in some cases.
Add support for perf script to dump them, and support
for the register parser in perf record -I ... to configure them.
For now they are just printed in hex, could potentially add
other formats too.
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
tools/arch/x86/include/uapi/asm/perf_regs.h | 19 +++++++++++++++
tools/perf/arch/x86/include/perf_regs.h | 27 ++++++++++++++++++---
tools/perf/arch/x86/util/perf_regs.c | 16 ++++++++++++
tools/perf/util/perf_regs.h | 1 +
4 files changed, 60 insertions(+), 3 deletions(-)
diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h
index f3329cabce5c..29a90e4464b2 100644
--- a/tools/arch/x86/include/uapi/asm/perf_regs.h
+++ b/tools/arch/x86/include/uapi/asm/perf_regs.h
@@ -28,6 +28,25 @@ enum perf_event_x86_regs {
PERF_REG_X86_R14,
PERF_REG_X86_R15,
+ /* These all need two bits set because they are 128bit */
+ PERF_REG_X86_XMM0 = 32,
+ PERF_REG_X86_XMM1 = 34,
+ PERF_REG_X86_XMM2 = 36,
+ PERF_REG_X86_XMM3 = 38,
+ PERF_REG_X86_XMM4 = 40,
+ PERF_REG_X86_XMM5 = 42,
+ PERF_REG_X86_XMM6 = 44,
+ PERF_REG_X86_XMM7 = 46,
+ PERF_REG_X86_XMM8 = 48,
+ PERF_REG_X86_XMM9 = 50,
+ PERF_REG_X86_XMM10 = 52,
+ PERF_REG_X86_XMM11 = 54,
+ PERF_REG_X86_XMM12 = 56,
+ PERF_REG_X86_XMM13 = 58,
+ PERF_REG_X86_XMM14 = 60,
+ PERF_REG_X86_XMM15 = 62,
+
+ /* This does not include the XMMX registers */
PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
};
diff --git a/tools/perf/arch/x86/include/perf_regs.h b/tools/perf/arch/x86/include/perf_regs.h
index 7f6d538f8a89..c6923eb3a0ce 100644
--- a/tools/perf/arch/x86/include/perf_regs.h
+++ b/tools/perf/arch/x86/include/perf_regs.h
@@ -8,9 +8,9 @@
void perf_regs_load(u64 *regs);
+#define PERF_REGS_MAX 64
#ifndef HAVE_ARCH_X86_64_SUPPORT
-#define PERF_REGS_MASK ((1ULL << PERF_REG_X86_32_MAX) - 1)
-#define PERF_REGS_MAX PERF_REG_X86_32_MAX
+#define PERF_REGS_MASK (((1ULL << PERF_REG_X86_32_MAX) - 1))
#define PERF_SAMPLE_REGS_ABI PERF_SAMPLE_REGS_ABI_32
#else
#define REG_NOSUPPORT ((1ULL << PERF_REG_X86_DS) | \
@@ -18,7 +18,6 @@ void perf_regs_load(u64 *regs);
(1ULL << PERF_REG_X86_FS) | \
(1ULL << PERF_REG_X86_GS))
#define PERF_REGS_MASK (((1ULL << PERF_REG_X86_64_MAX) - 1) & ~REG_NOSUPPORT)
-#define PERF_REGS_MAX PERF_REG_X86_64_MAX
#define PERF_SAMPLE_REGS_ABI PERF_SAMPLE_REGS_ABI_64
#endif
#define PERF_REG_IP PERF_REG_X86_IP
@@ -77,6 +76,28 @@ static inline const char *perf_reg_name(int id)
case PERF_REG_X86_R15:
return "R15";
#endif /* HAVE_ARCH_X86_64_SUPPORT */
+
+#define XMM(x) \
+ case PERF_REG_X86_XMM ## x: \
+ case PERF_REG_X86_XMM ## x + 1: \
+ return "XMM" #x;
+ XMM(0)
+ XMM(1)
+ XMM(2)
+ XMM(3)
+ XMM(4)
+ XMM(5)
+ XMM(6)
+ XMM(7)
+ XMM(8)
+ XMM(9)
+ XMM(10)
+ XMM(11)
+ XMM(12)
+ XMM(13)
+ XMM(14)
+ XMM(15)
+#undef XMM
default:
return NULL;
}
diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
index fead6b3b4206..71d7604dbf0b 100644
--- a/tools/perf/arch/x86/util/perf_regs.c
+++ b/tools/perf/arch/x86/util/perf_regs.c
@@ -31,6 +31,22 @@ const struct sample_reg sample_reg_masks[] = {
SMPL_REG(R14, PERF_REG_X86_R14),
SMPL_REG(R15, PERF_REG_X86_R15),
#endif
+ SMPL_REG2(XMM0, PERF_REG_X86_XMM0),
+ SMPL_REG2(XMM1, PERF_REG_X86_XMM1),
+ SMPL_REG2(XMM2, PERF_REG_X86_XMM2),
+ SMPL_REG2(XMM3, PERF_REG_X86_XMM3),
+ SMPL_REG2(XMM4, PERF_REG_X86_XMM4),
+ SMPL_REG2(XMM5, PERF_REG_X86_XMM5),
+ SMPL_REG2(XMM6, PERF_REG_X86_XMM6),
+ SMPL_REG2(XMM7, PERF_REG_X86_XMM7),
+ SMPL_REG2(XMM8, PERF_REG_X86_XMM8),
+ SMPL_REG2(XMM9, PERF_REG_X86_XMM9),
+ SMPL_REG2(XMM10, PERF_REG_X86_XMM10),
+ SMPL_REG2(XMM11, PERF_REG_X86_XMM11),
+ SMPL_REG2(XMM12, PERF_REG_X86_XMM12),
+ SMPL_REG2(XMM13, PERF_REG_X86_XMM13),
+ SMPL_REG2(XMM14, PERF_REG_X86_XMM14),
+ SMPL_REG2(XMM15, PERF_REG_X86_XMM15),
SMPL_REG_END
};
diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
index c9319f8d17a6..1a15a4bfc28d 100644
--- a/tools/perf/util/perf_regs.h
+++ b/tools/perf/util/perf_regs.h
@@ -12,6 +12,7 @@ struct sample_reg {
uint64_t mask;
};
#define SMPL_REG(n, b) { .name = #n, .mask = 1ULL << (b) }
+#define SMPL_REG2(n, b) { .name = #n, .mask = 3ULL << (b) }
#define SMPL_REG_END { .name = NULL }
extern const struct sample_reg sample_reg_masks[];
--
2.17.1
From: Kan Liang <[email protected]>
To get correct PERF_METRICS value, the fixed counter 3 must start from
0. It would bring problems when sampling read slots and topdown events.
For example,
perf record -e '{slots, topdown-retiring}:S'
The slots would not overflow if it starts from 0.
Add specific validate_group() support to reject the case and error out
for Icelake.
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/core.c | 2 ++
arch/x86/events/intel/core.c | 20 ++++++++++++++++++++
arch/x86/events/perf_event.h | 2 ++
3 files changed, 24 insertions(+)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index bc5cc3cfc86d..796e46a59148 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2115,6 +2115,8 @@ static int validate_group(struct perf_event *event)
ret = x86_pmu.schedule_events(fake_cpuc, n, NULL);
+ if (x86_pmu.validate_group)
+ ret = x86_pmu.validate_group(fake_cpuc, n);
out:
free_fake_cpuc(fake_cpuc);
return ret;
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 1d9570646994..3f86af8ce832 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4259,6 +4259,25 @@ static int icl_set_period(struct perf_event *event)
return 1;
}
+static int icl_validate_group(struct cpu_hw_events *cpuc, int n)
+{
+ bool has_sampling_slots = false, has_metrics = false;
+ struct perf_event *e;
+ int i;
+
+ for (i = 0; i < n; i++) {
+ e = cpuc->event_list[i];
+ if (is_slots_event(e) && is_sampling_event(e))
+ has_sampling_slots = true;
+
+ if (is_perf_metrics_event(e))
+ has_metrics = true;
+ }
+ if (unlikely(has_sampling_slots && has_metrics))
+ return -EINVAL;
+ return 0;
+}
+
EVENT_ATTR_STR(mem-loads, mem_ld_hsw, "event=0xcd,umask=0x1,ldlat=3");
EVENT_ATTR_STR(mem-stores, mem_st_hsw, "event=0xd0,umask=0x82")
@@ -4950,6 +4969,7 @@ __init int intel_pmu_init(void)
x86_pmu.has_metric = x86_pmu.intel_cap.perf_metrics;
x86_pmu.metric_update_event = icl_metric_update_event;
x86_pmu.set_period = icl_set_period;
+ x86_pmu.validate_group = icl_validate_group;
pr_cont("Icelake events, ");
name = "icelake";
break;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 6165f4b2a0e6..ef8c4d846e87 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -630,6 +630,8 @@ struct x86_pmu {
u64 (*limit_period)(struct perf_event *event, u64 l);
int (*set_period)(struct perf_event *event);
+ int (*validate_group)(struct cpu_hw_events *cpuc, int n);
+
/* PMI handler bits */
unsigned int late_ack :1,
counter_freezing :1;
--
2.17.1
From: Andi Kleen <[email protected]>
For TopDown metrics it is useful to have a remove transaction when
the counter is unscheduled, so that the value can be saved correctly.
Add a remove transaction to the perf core.
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/core.c | 3 +--
include/linux/perf_event.h | 1 +
kernel/events/core.c | 5 +++++
3 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index b684f0294f35..2b2328a528df 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1902,8 +1902,7 @@ static inline void x86_pmu_read(struct perf_event *event)
* Set the flag to make pmu::enable() not perform the
* schedulability test, it will be performed at commit time
*
- * We only support PERF_PMU_TXN_ADD transactions. Save the
- * transaction flags but otherwise ignore non-PERF_PMU_TXN_ADD
+ * Save the transaction flags and ignore non-PERF_PMU_TXN_ADD
* transactions.
*/
static void x86_pmu_start_txn(struct pmu *pmu, unsigned int txn_flags)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index bd3d6a89ccd4..088358be55ff 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -233,6 +233,7 @@ struct perf_event;
*/
#define PERF_PMU_TXN_ADD 0x1 /* txn to add/schedule event on PMU */
#define PERF_PMU_TXN_READ 0x2 /* txn to read event group from PMU */
+#define PERF_PMU_TXN_REMOVE 0x4 /* txn to remove event on PMU */
/**
* pmu::capabilities flags
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 560ac237b8be..bc07cd5a302d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2032,6 +2032,7 @@ group_sched_out(struct perf_event *group_event,
struct perf_cpu_context *cpuctx,
struct perf_event_context *ctx)
{
+ struct pmu *pmu = ctx->pmu;
struct perf_event *event;
if (group_event->state != PERF_EVENT_STATE_ACTIVE)
@@ -2039,6 +2040,8 @@ group_sched_out(struct perf_event *group_event,
perf_pmu_disable(ctx->pmu);
+ pmu->start_txn(pmu, PERF_PMU_TXN_REMOVE);
+
event_sched_out(group_event, cpuctx, ctx);
/*
@@ -2051,6 +2054,8 @@ group_sched_out(struct perf_event *group_event,
if (group_event->attr.exclusive)
cpuctx->exclusive = 0;
+
+ pmu->commit_txn(pmu);
}
#define DETACH_GROUP 0x01UL
--
2.17.1
From: Kan Liang <[email protected]>
Add Intel Icelake uncore support,
- The init code is based on Skylake
- Add new pci id for IMC
- New MSR address for CBOX
- Get CBOX# from CNL_UNC_CBO_CONFIG MSR directly
- Create a new PMU for fixed clocktick counter
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/intel/uncore.c | 6 ++
arch/x86/events/intel/uncore.h | 1 +
arch/x86/events/intel/uncore_snb.c | 91 ++++++++++++++++++++++++++++++
3 files changed, 98 insertions(+)
diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c
index d516161c00c4..dcf87fbbce9d 100644
--- a/arch/x86/events/intel/uncore.c
+++ b/arch/x86/events/intel/uncore.c
@@ -1366,6 +1366,11 @@ static const struct intel_uncore_init_fun skx_uncore_init __initconst = {
.pci_init = skx_uncore_pci_init,
};
+static const struct intel_uncore_init_fun icl_uncore_init __initconst = {
+ .cpu_init = icl_uncore_cpu_init,
+ .pci_init = skl_uncore_pci_init,
+};
+
static const struct x86_cpu_id intel_uncore_match[] __initconst = {
X86_UNCORE_MODEL_MATCH(INTEL_FAM6_NEHALEM_EP, nhm_uncore_init),
X86_UNCORE_MODEL_MATCH(INTEL_FAM6_NEHALEM, nhm_uncore_init),
@@ -1392,6 +1397,7 @@ static const struct x86_cpu_id intel_uncore_match[] __initconst = {
X86_UNCORE_MODEL_MATCH(INTEL_FAM6_SKYLAKE_X, skx_uncore_init),
X86_UNCORE_MODEL_MATCH(INTEL_FAM6_KABYLAKE_MOBILE, skl_uncore_init),
X86_UNCORE_MODEL_MATCH(INTEL_FAM6_KABYLAKE_DESKTOP, skl_uncore_init),
+ X86_UNCORE_MODEL_MATCH(INTEL_FAM6_ICELAKE_MOBILE, icl_uncore_init),
{},
};
diff --git a/arch/x86/events/intel/uncore.h b/arch/x86/events/intel/uncore.h
index cb46d602a6b8..1b5099a48ba9 100644
--- a/arch/x86/events/intel/uncore.h
+++ b/arch/x86/events/intel/uncore.h
@@ -512,6 +512,7 @@ int skl_uncore_pci_init(void);
void snb_uncore_cpu_init(void);
void nhm_uncore_cpu_init(void);
void skl_uncore_cpu_init(void);
+void icl_uncore_cpu_init(void);
int snb_pci2phy_map_init(int devid);
/* uncore_snbep.c */
diff --git a/arch/x86/events/intel/uncore_snb.c b/arch/x86/events/intel/uncore_snb.c
index b12517fae77a..97fbbc91aa29 100644
--- a/arch/x86/events/intel/uncore_snb.c
+++ b/arch/x86/events/intel/uncore_snb.c
@@ -34,6 +34,8 @@
#define PCI_DEVICE_ID_INTEL_CFL_4S_S_IMC 0x3e33
#define PCI_DEVICE_ID_INTEL_CFL_6S_S_IMC 0x3eca
#define PCI_DEVICE_ID_INTEL_CFL_8S_S_IMC 0x3e32
+#define PCI_DEVICE_ID_INTEL_ICL_U_IMC 0x8a02
+#define PCI_DEVICE_ID_INTEL_ICL_U2_IMC 0x8a12
/* SNB event control */
#define SNB_UNC_CTL_EV_SEL_MASK 0x000000ff
@@ -93,6 +95,12 @@
#define SKL_UNC_PERF_GLOBAL_CTL 0xe01
#define SKL_UNC_GLOBAL_CTL_CORE_ALL ((1 << 5) - 1)
+/* ICL Cbo register */
+#define ICL_UNC_CBO_CONFIG 0x396
+#define ICL_UNC_NUM_CBO_MASK 0xf
+#define ICL_UNC_CBO_0_PER_CTR0 0x702
+#define ICL_UNC_CBO_MSR_OFFSET 0x8
+
DEFINE_UNCORE_FORMAT_ATTR(event, event, "config:0-7");
DEFINE_UNCORE_FORMAT_ATTR(umask, umask, "config:8-15");
DEFINE_UNCORE_FORMAT_ATTR(edge, edge, "config:18");
@@ -280,6 +288,70 @@ void skl_uncore_cpu_init(void)
snb_uncore_arb.ops = &skl_uncore_msr_ops;
}
+static struct intel_uncore_type icl_uncore_cbox = {
+ .name = "cbox",
+ .num_counters = 4,
+ .perf_ctr_bits = 44,
+ .perf_ctr = ICL_UNC_CBO_0_PER_CTR0,
+ .event_ctl = SNB_UNC_CBO_0_PERFEVTSEL0,
+ .event_mask = SNB_UNC_RAW_EVENT_MASK,
+ .msr_offset = ICL_UNC_CBO_MSR_OFFSET,
+ .ops = &skl_uncore_msr_ops,
+ .format_group = &snb_uncore_format_group,
+};
+
+static struct uncore_event_desc icl_uncore_events[] = {
+ INTEL_UNCORE_EVENT_DESC(clockticks, "event=0xff"),
+ { /* end: all zeroes */ },
+};
+
+static struct attribute *icl_uncore_clock_formats_attr[] = {
+ &format_attr_event.attr,
+ NULL,
+};
+
+static struct attribute_group icl_uncore_clock_format_group = {
+ .name = "format",
+ .attrs = icl_uncore_clock_formats_attr,
+};
+
+static struct intel_uncore_type icl_uncore_clockbox = {
+ .name = "clock",
+ .num_counters = 1,
+ .num_boxes = 1,
+ .fixed_ctr_bits = 48,
+ .fixed_ctr = SNB_UNC_FIXED_CTR,
+ .fixed_ctl = SNB_UNC_FIXED_CTR_CTRL,
+ .single_fixed = 1,
+ .event_mask = SNB_UNC_CTL_EV_SEL_MASK,
+ .format_group = &icl_uncore_clock_format_group,
+ .ops = &skl_uncore_msr_ops,
+ .event_descs = icl_uncore_events,
+};
+
+static struct intel_uncore_type *icl_msr_uncores[] = {
+ &icl_uncore_cbox,
+ &snb_uncore_arb,
+ &icl_uncore_clockbox,
+ NULL,
+};
+
+static int icl_get_cbox_num(void)
+{
+ u64 num_boxes;
+
+ rdmsrl(ICL_UNC_CBO_CONFIG, num_boxes);
+
+ return num_boxes & ICL_UNC_NUM_CBO_MASK;
+}
+
+void icl_uncore_cpu_init(void)
+{
+ uncore_msr_uncores = icl_msr_uncores;
+ icl_uncore_cbox.num_boxes = icl_get_cbox_num();
+ snb_uncore_arb.ops = &skl_uncore_msr_ops;
+}
+
enum {
SNB_PCI_UNCORE_IMC,
};
@@ -666,6 +738,18 @@ static const struct pci_device_id skl_uncore_pci_ids[] = {
{ /* end: all zeroes */ },
};
+static const struct pci_device_id icl_uncore_pci_ids[] = {
+ { /* IMC */
+ PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICL_U_IMC),
+ .driver_data = UNCORE_PCI_DEV_DATA(SNB_PCI_UNCORE_IMC, 0),
+ },
+ { /* IMC */
+ PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICL_U2_IMC),
+ .driver_data = UNCORE_PCI_DEV_DATA(SNB_PCI_UNCORE_IMC, 0),
+ },
+ { /* end: all zeroes */ },
+};
+
static struct pci_driver snb_uncore_pci_driver = {
.name = "snb_uncore",
.id_table = snb_uncore_pci_ids,
@@ -691,6 +775,11 @@ static struct pci_driver skl_uncore_pci_driver = {
.id_table = skl_uncore_pci_ids,
};
+static struct pci_driver icl_uncore_pci_driver = {
+ .name = "icl_uncore",
+ .id_table = icl_uncore_pci_ids,
+};
+
struct imc_uncore_pci_dev {
__u32 pci_id;
struct pci_driver *driver;
@@ -730,6 +819,8 @@ static const struct imc_uncore_pci_dev desktop_imc_pci_ids[] = {
IMC_DEV(CFL_4S_S_IMC, &skl_uncore_pci_driver), /* 8th Gen Core S 4 Cores Server */
IMC_DEV(CFL_6S_S_IMC, &skl_uncore_pci_driver), /* 8th Gen Core S 6 Cores Server */
IMC_DEV(CFL_8S_S_IMC, &skl_uncore_pci_driver), /* 8th Gen Core S 8 Cores Server */
+ IMC_DEV(ICL_U_IMC, &icl_uncore_pci_driver), /* 10th Gen Core Mobile */
+ IMC_DEV(ICL_U2_IMC, &icl_uncore_pci_driver), /* 10th Gen Core Mobile */
{ /* end marker */ }
};
--
2.17.1
From: Andi Kleen <[email protected]>
Metrics counters (hardware counters containing multiple metrics)
are modelled as separate registers for each sub-event, with an
extra reg being used for coordinating access to the underlying
register in the scheduler.
This patch adds the basic infrastructure to separate the scheduler
register indexes from the actual hardware register indexes. In
most cases the MSR address is already used correctly, but for
code using indexes we need a separate reg_idx field in the event
to indicate the correct underlying register.
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/core.c | 18 ++++++++++++++++--
arch/x86/events/intel/core.c | 29 ++++++++++++++++++++---------
arch/x86/events/perf_event.h | 15 +++++++++++++++
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/perf_event.h | 30 ++++++++++++++++++++++++++++++
include/linux/perf_event.h | 1 +
6 files changed, 83 insertions(+), 11 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 2b2328a528df..9320cb7906cd 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -996,16 +996,30 @@ static inline void x86_assign_hw_event(struct perf_event *event,
struct hw_perf_event *hwc = &event->hw;
hwc->idx = cpuc->assign[i];
+ hwc->reg_idx = hwc->idx;
hwc->last_cpu = smp_processor_id();
hwc->last_tag = ++cpuc->tags[i];
+ /*
+ * Metrics counters use different indexes in the scheduler
+ * versus the hardware.
+ *
+ * Map metrics to fixed counter 3 (which is the base count),
+ * but the update event callback reads the extra metric register
+ * and converts to the right metric.
+ */
+ if (is_metric_idx(hwc->idx))
+ hwc->reg_idx = INTEL_PMC_IDX_FIXED_SLOTS;
+
if (hwc->idx == INTEL_PMC_IDX_FIXED_BTS) {
hwc->config_base = 0;
hwc->event_base = 0;
} else if (hwc->idx >= INTEL_PMC_IDX_FIXED) {
hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
- hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 + (hwc->idx - INTEL_PMC_IDX_FIXED);
- hwc->event_base_rdpmc = (hwc->idx - INTEL_PMC_IDX_FIXED) | 1<<30;
+ hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 +
+ (hwc->reg_idx - INTEL_PMC_IDX_FIXED);
+ hwc->event_base_rdpmc = (hwc->reg_idx - INTEL_PMC_IDX_FIXED)
+ | 1<<30;
} else {
hwc->config_base = x86_pmu_config_addr(hwc->idx);
hwc->event_base = x86_pmu_event_addr(hwc->idx);
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 87dafac87520..f1cb0155c79a 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2057,7 +2057,7 @@ static inline void intel_pmu_ack_status(u64 ack)
static void intel_pmu_disable_fixed(struct hw_perf_event *hwc)
{
- int idx = hwc->idx - INTEL_PMC_IDX_FIXED;
+ int idx = hwc->reg_idx - INTEL_PMC_IDX_FIXED;
u64 ctrl_val, mask;
mask = 0xfULL << (idx * 4);
@@ -2083,9 +2083,19 @@ static void intel_pmu_disable_event(struct perf_event *event)
return;
}
- cpuc->intel_ctrl_guest_mask &= ~(1ull << hwc->idx);
- cpuc->intel_ctrl_host_mask &= ~(1ull << hwc->idx);
- cpuc->intel_cp_status &= ~(1ull << hwc->idx);
+ __clear_bit(hwc->idx, cpuc->enabled_events);
+
+ /*
+ * When any other slots sharing event is still enabled,
+ * cancel the disabling.
+ */
+ if (is_any_slots_idx(hwc->idx) &&
+ (*(u64 *)&cpuc->enabled_events & INTEL_PMC_MSK_ANY_SLOTS))
+ return;
+
+ cpuc->intel_ctrl_guest_mask &= ~(1ull << hwc->reg_idx);
+ cpuc->intel_ctrl_host_mask &= ~(1ull << hwc->reg_idx);
+ cpuc->intel_cp_status &= ~(1ull << hwc->reg_idx);
if (unlikely(event->attr.precise_ip))
intel_pmu_pebs_disable(event);
@@ -2117,7 +2127,7 @@ static void intel_pmu_read_event(struct perf_event *event)
static void intel_pmu_enable_fixed(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
- int idx = hwc->idx - INTEL_PMC_IDX_FIXED;
+ int idx = hwc->reg_idx - INTEL_PMC_IDX_FIXED;
u64 ctrl_val, mask, bits = 0;
/*
@@ -2161,18 +2171,19 @@ static void intel_pmu_enable_event(struct perf_event *event)
}
if (event->attr.exclude_host)
- cpuc->intel_ctrl_guest_mask |= (1ull << hwc->idx);
+ cpuc->intel_ctrl_guest_mask |= (1ull << hwc->reg_idx);
if (event->attr.exclude_guest)
- cpuc->intel_ctrl_host_mask |= (1ull << hwc->idx);
+ cpuc->intel_ctrl_host_mask |= (1ull << hwc->reg_idx);
if (unlikely(event_is_checkpointed(event)))
- cpuc->intel_cp_status |= (1ull << hwc->idx);
+ cpuc->intel_cp_status |= (1ull << hwc->reg_idx);
if (unlikely(event->attr.precise_ip))
intel_pmu_pebs_enable(event);
if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
- intel_pmu_enable_fixed(event);
+ if (!__test_and_set_bit(hwc->idx, cpuc->enabled_events))
+ intel_pmu_enable_fixed(event);
return;
}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index efa893ce95e2..e4215ea5c65e 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -187,6 +187,7 @@ struct cpu_hw_events {
unsigned long active_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
unsigned long running[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
int enabled;
+ unsigned long enabled_events[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
int n_events; /* the # of events in the below arrays */
int n_added; /* the # last events in the below arrays;
@@ -347,6 +348,20 @@ struct cpu_hw_events {
#define FIXED_EVENT_CONSTRAINT(c, n) \
EVENT_CONSTRAINT(c, (1ULL << (32+n)), FIXED_EVENT_FLAGS)
+/*
+ * Special metric counters do not actually exist, but get remapped
+ * to a combination of FxCtr3 + MSR_PERF_METRICS
+ *
+ * This allocates them to a dummy offset for the scheduler.
+ * This does not allow sharing of multiple users of the same
+ * metric without multiplexing, even though the hardware supports that
+ * in principle.
+ */
+
+#define METRIC_EVENT_CONSTRAINT(c, n) \
+ EVENT_CONSTRAINT(c, (1ULL << (INTEL_PMC_IDX_FIXED_METRIC_BASE+n)), \
+ FIXED_EVENT_FLAGS)
+
/*
* Constraint on the Event code + UMask
*/
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index b14b8bea8681..5f369e954ba8 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -771,6 +771,7 @@
#define MSR_CORE_PERF_FIXED_CTR0 0x00000309
#define MSR_CORE_PERF_FIXED_CTR1 0x0000030a
#define MSR_CORE_PERF_FIXED_CTR2 0x0000030b
+#define MSR_CORE_PERF_FIXED_CTR3 0x0000030c
#define MSR_CORE_PERF_FIXED_CTR_CTRL 0x0000038d
#define MSR_CORE_PERF_GLOBAL_STATUS 0x0000038e
#define MSR_CORE_PERF_GLOBAL_CTRL 0x0000038f
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 64cb4dffe4cd..fc85efa4cb69 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -166,6 +166,10 @@ struct x86_pmu_capability {
#define INTEL_PMC_IDX_FIXED_REF_CYCLES (INTEL_PMC_IDX_FIXED + 2)
#define INTEL_PMC_MSK_FIXED_REF_CYCLES (1ULL << INTEL_PMC_IDX_FIXED_REF_CYCLES)
+#define MSR_ARCH_PERFMON_FIXED_CTR3 0x30c
+#define INTEL_PMC_IDX_FIXED_SLOTS (INTEL_PMC_IDX_FIXED + 3)
+#define INTEL_PMC_MSK_FIXED_SLOTS (1ULL << INTEL_PMC_IDX_FIXED_SLOTS)
+
/*
* We model BTS tracing as another fixed-mode PMC.
*
@@ -175,6 +179,32 @@ struct x86_pmu_capability {
*/
#define INTEL_PMC_IDX_FIXED_BTS (INTEL_PMC_IDX_FIXED + 16)
+/*
+ * We model PERF_METRICS as more magic fixed-mode PMCs, one for each metric
+ * and another for the whole slots counter
+ *
+ * Internally they all map to Fixed Ctr 3 (SLOTS), and allocate PERF_METRICS
+ * as an extra_reg. PERF_METRICS has no own configuration, but we fill in
+ * the configuration of FxCtr3 to enforce that all the shared users of SLOTS
+ * have the same configuration.
+ */
+#define INTEL_PMC_IDX_FIXED_METRIC_BASE (INTEL_PMC_IDX_FIXED + 17)
+#define INTEL_PMC_IDX_TD_RETIRING (INTEL_PMC_IDX_FIXED_METRIC_BASE + 0)
+#define INTEL_PMC_IDX_TD_BAD_SPEC (INTEL_PMC_IDX_FIXED_METRIC_BASE + 1)
+#define INTEL_PMC_IDX_TD_FE_BOUND (INTEL_PMC_IDX_FIXED_METRIC_BASE + 2)
+#define INTEL_PMC_IDX_TD_BE_BOUND (INTEL_PMC_IDX_FIXED_METRIC_BASE + 3)
+#define INTEL_PMC_MSK_ANY_SLOTS ((0xfull << INTEL_PMC_IDX_FIXED_METRIC_BASE) | \
+ INTEL_PMC_MSK_FIXED_SLOTS)
+static inline bool is_metric_idx(int idx)
+{
+ return idx >= INTEL_PMC_IDX_FIXED_METRIC_BASE && idx <= INTEL_PMC_IDX_TD_BE_BOUND;
+}
+
+static inline bool is_any_slots_idx(int idx)
+{
+ return is_metric_idx(idx) || idx == INTEL_PMC_IDX_FIXED_SLOTS;
+}
+
#define GLOBAL_STATUS_COND_CHG BIT_ULL(63)
#define GLOBAL_STATUS_BUFFER_OVF BIT_ULL(62)
#define GLOBAL_STATUS_UNC_OVF BIT_ULL(61)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 088358be55ff..4c62da79ff41 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -127,6 +127,7 @@ struct hw_perf_event {
unsigned long event_base;
int event_base_rdpmc;
int idx;
+ int reg_idx;
int last_cpu;
int flags;
--
2.17.1
From: Andi Kleen <[email protected]>
With adaptive PEBS the CPU can directly supply the LBR information,
so we don't need to read it again. But the LBRs still need to be
enabled. Add a special count to the cpuc that distinguishes these
two cases, and avoid reading the LBRs unnecessarily when PEBS is
active.
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/intel/lbr.c | 13 ++++++++++++-
arch/x86/events/perf_event.h | 1 +
2 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 58889f952959..b8cac31f089c 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -488,6 +488,8 @@ void intel_pmu_lbr_add(struct perf_event *event)
* be 'new'. Conversely, a new event can get installed through the
* context switch path for the first time.
*/
+ if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0)
+ cpuc->lbr_pebs_users++;
perf_sched_cb_inc(event->ctx->pmu);
if (!cpuc->lbr_users++ && !event->total_time_running)
intel_pmu_lbr_reset();
@@ -507,8 +509,11 @@ void intel_pmu_lbr_del(struct perf_event *event)
task_ctx->lbr_callstack_users--;
}
+ if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0)
+ cpuc->lbr_pebs_users--;
cpuc->lbr_users--;
WARN_ON_ONCE(cpuc->lbr_users < 0);
+ WARN_ON_ONCE(cpuc->lbr_pebs_users < 0);
perf_sched_cb_dec(event->ctx->pmu);
}
@@ -658,7 +663,13 @@ void intel_pmu_lbr_read(void)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
- if (!cpuc->lbr_users)
+ /*
+ * Don't read when all LBRs users are using adaptive PEBS.
+ *
+ * This could be smarter and actually check the event,
+ * but this simple approach seems to work for now.
+ */
+ if (!cpuc->lbr_users || cpuc->lbr_users == cpuc->lbr_pebs_users)
return;
if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index db3b622265fa..27c7945b5174 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -216,6 +216,7 @@ struct cpu_hw_events {
* Intel LBR bits
*/
int lbr_users;
+ int lbr_pebs_users;
struct perf_branch_stack lbr_stack;
struct perf_branch_entry lbr_entries[MAX_LBR_ENTRIES];
struct er_account *lbr_sel;
--
2.17.1
From: Kan Liang <[email protected]>
Icelake support the same RAPL counters as Skylake.
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/intel/rapl.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/x86/events/intel/rapl.c b/arch/x86/events/intel/rapl.c
index 94dc564146ca..37ebf6fc5415 100644
--- a/arch/x86/events/intel/rapl.c
+++ b/arch/x86/events/intel/rapl.c
@@ -775,6 +775,8 @@ static const struct x86_cpu_id rapl_cpu_match[] __initconst = {
X86_RAPL_MODEL_MATCH(INTEL_FAM6_ATOM_GOLDMONT_X, hsw_rapl_init),
X86_RAPL_MODEL_MATCH(INTEL_FAM6_ATOM_GOLDMONT_PLUS, hsw_rapl_init),
+
+ X86_RAPL_MODEL_MATCH(INTEL_FAM6_ICELAKE_MOBILE, skl_rapl_init),
{},
};
--
2.17.1
From: Kan Liang <[email protected]>
Icelake uses the same C-state residency events as Sandy Bridge.
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/intel/cstate.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/x86/events/intel/cstate.c b/arch/x86/events/intel/cstate.c
index 94a4b7fc75d0..dd5658ec31d5 100644
--- a/arch/x86/events/intel/cstate.c
+++ b/arch/x86/events/intel/cstate.c
@@ -578,6 +578,8 @@ static const struct x86_cpu_id intel_cstates_match[] __initconst = {
X86_CSTATES_MODEL(INTEL_FAM6_ATOM_GOLDMONT_X, glm_cstates),
X86_CSTATES_MODEL(INTEL_FAM6_ATOM_GOLDMONT_PLUS, glm_cstates),
+
+ X86_CSTATES_MODEL(INTEL_FAM6_ICELAKE_MOBILE, snb_cstates),
{ },
};
MODULE_DEVICE_TABLE(x86cpu, intel_cstates_match);
--
2.17.1
From: Kan Liang <[email protected]>
Add Icelake core PMU perf code, including constraint tables and the main
enable code.
Icelake expanded the generic counters to always 8 even with HT on, but a
range of events cannot be scheduled on the extra 4 counters.
Add new constraint ranges to describe this to the scheduler.
The number of constraints that need to be checked is larger now than
with earlier CPUs.
At some point we may need a new data structure to look them up more
efficiently than with linear search. So far it still seems to be
acceptable however.
Icelake added a new fixed counter SLOTS. Full support for it is added
later in the patch series.
The cache events table is identical to Skylake.
Compare to PEBS instruction event on generic counter, fixed counter 0
has less skid. Force instruction:ppp always in fixed counter 0.
Originally-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/intel/core.c | 111 ++++++++++++++++++++++++++++++
arch/x86/events/intel/ds.c | 26 ++++++-
arch/x86/events/perf_event.h | 2 +
arch/x86/include/asm/intel_ds.h | 2 +-
arch/x86/include/asm/perf_event.h | 2 +-
5 files changed, 139 insertions(+), 4 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 8486ab87f8f8..87dafac87520 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -239,6 +239,35 @@ static struct extra_reg intel_skl_extra_regs[] __read_mostly = {
EVENT_EXTRA_END
};
+static struct event_constraint intel_icl_event_constraints[] = {
+ FIXED_EVENT_CONSTRAINT(0x00c0, 0), /* INST_RETIRED.ANY */
+ INTEL_UEVENT_CONSTRAINT(0x1c0, 0), /* INST_RETIRED.PREC_DIST */
+ FIXED_EVENT_CONSTRAINT(0x003c, 1), /* CPU_CLK_UNHALTED.CORE */
+ FIXED_EVENT_CONSTRAINT(0x0300, 2), /* CPU_CLK_UNHALTED.REF */
+ FIXED_EVENT_CONSTRAINT(0x0400, 3), /* SLOTS */
+ INTEL_EVENT_CONSTRAINT_RANGE(0x03, 0x0a, 0xf),
+ INTEL_EVENT_CONSTRAINT_RANGE(0x1f, 0x28, 0xf),
+ INTEL_EVENT_CONSTRAINT(0x32, 0xf), /* SW_PREFETCH_ACCESS.* */
+ INTEL_EVENT_CONSTRAINT_RANGE(0x48, 0x54, 0xf),
+ INTEL_EVENT_CONSTRAINT_RANGE(0x60, 0x8b, 0xf),
+ INTEL_UEVENT_CONSTRAINT(0x04a3, 0xff), /* CYCLE_ACTIVITY.STALLS_TOTAL */
+ INTEL_UEVENT_CONSTRAINT(0x10a3, 0xff), /* CYCLE_ACTIVITY.STALLS_MEM_ANY */
+ INTEL_EVENT_CONSTRAINT(0xa3, 0xf), /* CYCLE_ACTIVITY.* */
+ INTEL_EVENT_CONSTRAINT_RANGE(0xa8, 0xb0, 0xf),
+ INTEL_EVENT_CONSTRAINT_RANGE(0xb7, 0xbd, 0xf),
+ INTEL_EVENT_CONSTRAINT_RANGE(0xd0, 0xe6, 0xf),
+ INTEL_EVENT_CONSTRAINT_RANGE(0xf0, 0xf4, 0xf),
+ EVENT_CONSTRAINT_END
+};
+
+static struct extra_reg intel_icl_extra_regs[] __read_mostly = {
+ INTEL_UEVENT_EXTRA_REG(0x01b7, MSR_OFFCORE_RSP_0, 0x3fffff9fffull, RSP_0),
+ INTEL_UEVENT_EXTRA_REG(0x01bb, MSR_OFFCORE_RSP_1, 0x3fffff9fffull, RSP_1),
+ INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(0x01cd),
+ INTEL_UEVENT_EXTRA_REG(0x01c6, MSR_PEBS_FRONTEND, 0x7fff17, FE),
+ EVENT_EXTRA_END
+};
+
EVENT_ATTR_STR(mem-loads, mem_ld_nhm, "event=0x0b,umask=0x10,ldlat=3");
EVENT_ATTR_STR(mem-loads, mem_ld_snb, "event=0xcd,umask=0x1,ldlat=3");
EVENT_ATTR_STR(mem-stores, mem_st_snb, "event=0xcd,umask=0x2");
@@ -3324,6 +3353,9 @@ static struct event_constraint counter0_constraint =
static struct event_constraint counter2_constraint =
EVENT_CONSTRAINT(0, 0x4, 0);
+static struct event_constraint fixed_counter0_constraint =
+ FIXED_EVENT_CONSTRAINT(0x00c0, 0);
+
static struct event_constraint *
hsw_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
struct perf_event *event)
@@ -3342,6 +3374,21 @@ hsw_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
return c;
}
+static struct event_constraint *
+icl_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
+ struct perf_event *event)
+{
+ /*
+ * Fixed counter 0 has less skid.
+ * Force instruction:ppp in Fixed counter 0
+ */
+ if ((event->attr.precise_ip == 3) &&
+ ((event->hw.config & X86_RAW_EVENT_MASK) == 0x00c0))
+ return &fixed_counter0_constraint;
+
+ return hsw_get_event_constraints(cpuc, idx, event);
+}
+
static struct event_constraint *
glp_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
struct perf_event *event)
@@ -4038,6 +4085,42 @@ static struct attribute *hsw_tsx_events_attrs[] = {
NULL
};
+EVENT_ATTR_STR(tx-capacity-read, tx_capacity_read, "event=0x54,umask=0x80");
+EVENT_ATTR_STR(tx-capacity-write, tx_capacity_write, "event=0x54,umask=0x2");
+EVENT_ATTR_STR(el-capacity-read, el_capacity_read, "event=0x54,umask=0x80");
+EVENT_ATTR_STR(el-capacity-write, el_capacity_write, "event=0x54,umask=0x2");
+
+static struct attribute *icl_events_attrs[] = {
+ EVENT_PTR(mem_ld_hsw),
+ EVENT_PTR(mem_st_hsw),
+ NULL,
+};
+
+static struct attribute *icl_tsx_events_attrs[] = {
+ EVENT_PTR(tx_start),
+ EVENT_PTR(tx_abort),
+ EVENT_PTR(tx_commit),
+ EVENT_PTR(tx_capacity_read),
+ EVENT_PTR(tx_capacity_write),
+ EVENT_PTR(tx_conflict),
+ EVENT_PTR(el_start),
+ EVENT_PTR(el_abort),
+ EVENT_PTR(el_commit),
+ EVENT_PTR(el_capacity_read),
+ EVENT_PTR(el_capacity_write),
+ EVENT_PTR(el_conflict),
+ EVENT_PTR(cycles_t),
+ EVENT_PTR(cycles_ct),
+ NULL,
+};
+
+static __init struct attribute **get_icl_events_attrs(void)
+{
+ return boot_cpu_has(X86_FEATURE_RTM) ?
+ merge_attr(icl_events_attrs, icl_tsx_events_attrs) :
+ icl_events_attrs;
+}
+
static ssize_t freeze_on_smi_show(struct device *cdev,
struct device_attribute *attr,
char *buf)
@@ -4611,6 +4694,34 @@ __init int intel_pmu_init(void)
name = "skylake";
break;
+ case INTEL_FAM6_ICELAKE_MOBILE:
+ x86_pmu.late_ack = true;
+ memcpy(hw_cache_event_ids, skl_hw_cache_event_ids, sizeof(hw_cache_event_ids));
+ memcpy(hw_cache_extra_regs, skl_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));
+ hw_cache_event_ids[C(ITLB)][C(OP_READ)][C(RESULT_ACCESS)] = -1;
+ intel_pmu_lbr_init_skl();
+
+ x86_pmu.event_constraints = intel_icl_event_constraints;
+ x86_pmu.pebs_constraints = intel_icl_pebs_event_constraints;
+ x86_pmu.extra_regs = intel_icl_extra_regs;
+ x86_pmu.pebs_aliases = NULL;
+ x86_pmu.pebs_prec_dist = true;
+ x86_pmu.flags |= PMU_FL_HAS_RSP_1;
+ x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
+
+ x86_pmu.hw_config = hsw_hw_config;
+ x86_pmu.get_event_constraints = icl_get_event_constraints;
+ extra_attr = boot_cpu_has(X86_FEATURE_RTM) ?
+ hsw_format_attr : nhm_format_attr;
+ extra_attr = merge_attr(extra_attr, skl_format_attr);
+ x86_pmu.cpu_events = get_icl_events_attrs();
+ x86_pmu.force_gpr_event = 0x2ca;
+ x86_pmu.lbr_pt_coexist = true;
+ intel_pmu_pebs_data_source_skl(false);
+ pr_cont("Icelake events, ");
+ name = "icelake";
+ break;
+
default:
switch (x86_pmu.version) {
case 1:
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 30370fb93e21..97bad6b0f470 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -849,6 +849,26 @@ struct event_constraint intel_skl_pebs_event_constraints[] = {
EVENT_CONSTRAINT_END
};
+struct event_constraint intel_icl_pebs_event_constraints[] = {
+ INTEL_FLAGS_UEVENT_CONSTRAINT(0x1c0, 0x100000000ULL), /* INST_RETIRED.PREC_DIST */
+ INTEL_FLAGS_UEVENT_CONSTRAINT(0x0400, 0x400000000ULL), /* SLOTS */
+
+ INTEL_PLD_CONSTRAINT(0x1cd, 0xff), /* MEM_TRANS_RETIRED.LOAD_LATENCY */
+ INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x1d0, 0xf), /* MEM_INST_RETIRED.LOAD */
+ INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_ST(0x2d0, 0xf), /* MEM_INST_RETIRED.STORE */
+
+ INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_LD_RANGE(0xd1, 0xd4, 0xf), /* MEM_LOAD_*_RETIRED.* */
+
+ INTEL_FLAGS_EVENT_CONSTRAINT(0xd0, 0xf), /* MEM_INST_RETIRED.* */
+
+ /*
+ * Everything else is handled by PMU_FL_PEBS_ALL, because we
+ * need the full constraints from the main table.
+ */
+
+ EVENT_CONSTRAINT_END
+};
+
struct event_constraint *intel_pebs_constraints(struct perf_event *event)
{
struct event_constraint *c;
@@ -1038,7 +1058,8 @@ void intel_pmu_pebs_enable(struct perf_event *event)
cpuc->pebs_enabled |= 1ULL << hwc->idx;
- if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
+ if ((event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT) &&
+ (x86_pmu.version < 5))
cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled |= 1ULL << 63;
@@ -1097,7 +1118,8 @@ void intel_pmu_pebs_disable(struct perf_event *event)
/* Delay reprograming DATA_CFG to next enable */
- if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
+ if ((event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT) &&
+ (x86_pmu.version < 5))
cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled &= ~(1ULL << 63);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 863d27f4c352..efa893ce95e2 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -981,6 +981,8 @@ extern struct event_constraint intel_bdw_pebs_event_constraints[];
extern struct event_constraint intel_skl_pebs_event_constraints[];
+extern struct event_constraint intel_icl_pebs_event_constraints[];
+
struct event_constraint *intel_pebs_constraints(struct perf_event *event);
void intel_pmu_pebs_add(struct perf_event *event);
diff --git a/arch/x86/include/asm/intel_ds.h b/arch/x86/include/asm/intel_ds.h
index ae26df1c2789..8380c3ddd4b2 100644
--- a/arch/x86/include/asm/intel_ds.h
+++ b/arch/x86/include/asm/intel_ds.h
@@ -8,7 +8,7 @@
/* The maximal number of PEBS events: */
#define MAX_PEBS_EVENTS 8
-#define MAX_FIXED_PEBS_EVENTS 3
+#define MAX_FIXED_PEBS_EVENTS 4
/*
* A debug store configuration.
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index da0a80ef6505..64cb4dffe4cd 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -7,7 +7,7 @@
*/
#define INTEL_PMC_MAX_GENERIC 32
-#define INTEL_PMC_MAX_FIXED 3
+#define INTEL_PMC_MAX_FIXED 4
#define INTEL_PMC_IDX_FIXED 32
#define X86_PMC_IDX_MAX 64
--
2.17.1
From: Andi Kleen <[email protected]>
Export new top down events for perf that map to the sub metrics
in the metrics register, and another for the new slots fixed counter.
This makes the new fixed counters in Icelake visible to the perf
user tools.
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/intel/core.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index d69537b6c184..1d9570646994 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -320,6 +320,12 @@ EVENT_ATTR_STR_HT(topdown-recovery-bubbles, td_recovery_bubbles,
EVENT_ATTR_STR_HT(topdown-recovery-bubbles.scale, td_recovery_bubbles_scale,
"4", "2");
+EVENT_ATTR_STR(slots, slots, "event=0x00,umask=0x4");
+EVENT_ATTR_STR(topdown-retiring, td_retiring, "event=0xff,umask=0x1");
+EVENT_ATTR_STR(topdown-bad-spec, td_bad_spec, "event=0xff,umask=0x2");
+EVENT_ATTR_STR(topdown-fe-bound, td_fe_bound, "event=0xff,umask=0x3");
+EVENT_ATTR_STR(topdown-be-bound, td_be_bound, "event=0xff,umask=0x4");
+
static struct attribute *snb_events_attrs[] = {
EVENT_PTR(td_slots_issued),
EVENT_PTR(td_slots_retired),
@@ -4311,6 +4317,11 @@ EVENT_ATTR_STR(el-capacity-write, el_capacity_write, "event=0x54,umask=0x2");
static struct attribute *icl_events_attrs[] = {
EVENT_PTR(mem_ld_hsw),
EVENT_PTR(mem_st_hsw),
+ EVENT_PTR(slots),
+ EVENT_PTR(td_retiring),
+ EVENT_PTR(td_bad_spec),
+ EVENT_PTR(td_fe_bound),
+ EVENT_PTR(td_be_bound),
NULL,
};
--
2.17.1
From: Andi Kleen <[email protected]>
Add support to the perf core for outputting registers from a separate
array and add support for outputting XMM registers for x86.
This requires changing all the perf_reg_value functions for the
different architectures to pass the additional argument. Except for x86,
they just ignore it.
XMM registers are 128 bit. To simplify the code, they are handled like
two different registers, which means setting two bits in the register
bitmap. This also allows only sampling the lower 64bit bits in XMM.
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/arm/kernel/perf_regs.c | 2 +-
arch/arm64/kernel/perf_regs.c | 2 +-
arch/powerpc/perf/perf_regs.c | 2 +-
arch/s390/kernel/perf_regs.c | 2 +-
arch/x86/include/uapi/asm/perf_regs.h | 25 +++++++++++++++++++++++--
arch/x86/kernel/perf_regs.c | 17 ++++++++++++-----
include/linux/perf_event.h | 2 ++
include/linux/perf_regs.h | 4 ++--
kernel/events/core.c | 7 +++++--
9 files changed, 48 insertions(+), 15 deletions(-)
diff --git a/arch/arm/kernel/perf_regs.c b/arch/arm/kernel/perf_regs.c
index 05fe92aa7d98..1feedc151e87 100644
--- a/arch/arm/kernel/perf_regs.c
+++ b/arch/arm/kernel/perf_regs.c
@@ -8,7 +8,7 @@
#include <asm/perf_regs.h>
#include <asm/ptrace.h>
-u64 perf_reg_value(struct pt_regs *regs, int idx)
+u64 perf_reg_value(struct pt_regs *regs, u64 *extra_regs, int idx)
{
if (WARN_ON_ONCE((u32)idx >= PERF_REG_ARM_MAX))
return 0;
diff --git a/arch/arm64/kernel/perf_regs.c b/arch/arm64/kernel/perf_regs.c
index 0bbac612146e..85d7db5e5428 100644
--- a/arch/arm64/kernel/perf_regs.c
+++ b/arch/arm64/kernel/perf_regs.c
@@ -9,7 +9,7 @@
#include <asm/perf_regs.h>
#include <asm/ptrace.h>
-u64 perf_reg_value(struct pt_regs *regs, int idx)
+u64 perf_reg_value(struct pt_regs *regs, u64 *extra_regs, int idx)
{
if (WARN_ON_ONCE((u32)idx >= PERF_REG_ARM64_MAX))
return 0;
diff --git a/arch/powerpc/perf/perf_regs.c b/arch/powerpc/perf/perf_regs.c
index 3349f3f8fe84..29380a99473a 100644
--- a/arch/powerpc/perf/perf_regs.c
+++ b/arch/powerpc/perf/perf_regs.c
@@ -73,7 +73,7 @@ static unsigned int pt_regs_offset[PERF_REG_POWERPC_MAX] = {
PT_REGS_OFFSET(PERF_REG_POWERPC_MMCRA, dsisr),
};
-u64 perf_reg_value(struct pt_regs *regs, int idx)
+u64 perf_reg_value(struct pt_regs *regs, u64 *extra_regs, int idx)
{
if (WARN_ON_ONCE(idx >= PERF_REG_POWERPC_MAX))
return 0;
diff --git a/arch/s390/kernel/perf_regs.c b/arch/s390/kernel/perf_regs.c
index 4352a504f235..88974116dbc6 100644
--- a/arch/s390/kernel/perf_regs.c
+++ b/arch/s390/kernel/perf_regs.c
@@ -8,7 +8,7 @@
#include <asm/fpu/api.h>
#include <asm/fpu/types.h>
-u64 perf_reg_value(struct pt_regs *regs, int idx)
+u64 perf_reg_value(struct pt_regs *regs, u64 *extra_regs, int idx)
{
freg_t fp;
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index f3329cabce5c..1ff0df1c97ae 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -28,7 +28,28 @@ enum perf_event_x86_regs {
PERF_REG_X86_R14,
PERF_REG_X86_R15,
- PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
- PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
+ /* These all need two bits set because they are 128bit */
+ PERF_REG_X86_XMM0 = 32,
+ PERF_REG_X86_XMM1 = 34,
+ PERF_REG_X86_XMM2 = 36,
+ PERF_REG_X86_XMM3 = 38,
+ PERF_REG_X86_XMM4 = 40,
+ PERF_REG_X86_XMM5 = 42,
+ PERF_REG_X86_XMM6 = 44,
+ PERF_REG_X86_XMM7 = 46,
+ PERF_REG_X86_XMM8 = 48,
+ PERF_REG_X86_XMM9 = 50,
+ PERF_REG_X86_XMM10 = 52,
+ PERF_REG_X86_XMM11 = 54,
+ PERF_REG_X86_XMM12 = 56,
+ PERF_REG_X86_XMM13 = 58,
+ PERF_REG_X86_XMM14 = 60,
+ PERF_REG_X86_XMM15 = 62,
+
+ /* This does not include the XMMX registers */
+ PERF_REG_GPR_X86_32_MAX = PERF_REG_X86_GS + 1,
+ PERF_REG_GPR_X86_64_MAX = PERF_REG_X86_R15 + 1,
+
+ PERF_REG_X86_MAX = PERF_REG_X86_XMM15 + 2,
};
#endif /* _ASM_X86_PERF_REGS_H */
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index c06c4c16c6b6..8b44a4c5a161 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -10,14 +10,14 @@
#include <asm/ptrace.h>
#ifdef CONFIG_X86_32
-#define PERF_REG_X86_MAX PERF_REG_X86_32_MAX
+#define PERF_REG_GPR_X86_MAX PERF_REG_GPR_X86_32_MAX
#else
-#define PERF_REG_X86_MAX PERF_REG_X86_64_MAX
+#define PERF_REG_GPR_X86_MAX PERF_REG_GPR_X86_64_MAX
#endif
#define PT_REGS_OFFSET(id, r) [id] = offsetof(struct pt_regs, r)
-static unsigned int pt_regs_offset[PERF_REG_X86_MAX] = {
+static unsigned int pt_regs_offset[PERF_REG_GPR_X86_MAX] = {
PT_REGS_OFFSET(PERF_REG_X86_AX, ax),
PT_REGS_OFFSET(PERF_REG_X86_BX, bx),
PT_REGS_OFFSET(PERF_REG_X86_CX, cx),
@@ -57,15 +57,22 @@ static unsigned int pt_regs_offset[PERF_REG_X86_MAX] = {
#endif
};
-u64 perf_reg_value(struct pt_regs *regs, int idx)
+u64 perf_reg_value(struct pt_regs *regs, u64 *extra_regs, int idx)
{
+ if (idx >= 32 && idx < 64) {
+ if (!extra_regs)
+ return 0;
+ return extra_regs[idx - 32];
+ }
+
if (WARN_ON_ONCE(idx >= ARRAY_SIZE(pt_regs_offset)))
return 0;
return regs_get_register(regs, pt_regs_offset[idx]);
}
-#define REG_RESERVED (~((1ULL << PERF_REG_X86_MAX) - 1ULL))
+#define REG_RESERVED \
+ (PERF_REG_X86_MAX == 64 ? 0 : ~((1ULL << PERF_REG_X86_MAX)) - 1ULL)
#ifdef CONFIG_X86_32
int perf_reg_validate(u64 mask)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e47ef764f613..bd3d6a89ccd4 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -948,6 +948,7 @@ struct perf_sample_data {
u64 stack_user_size;
u64 phys_addr;
+ u64 *extra_regs;
} ____cacheline_aligned;
/* default value for data source */
@@ -968,6 +969,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
data->weight = 0;
data->data_src.val = PERF_MEM_NA;
data->txn = 0;
+ data->extra_regs = NULL;
}
extern void perf_output_sample(struct perf_output_handle *handle,
diff --git a/include/linux/perf_regs.h b/include/linux/perf_regs.h
index 476747456bca..9884c64d5598 100644
--- a/include/linux/perf_regs.h
+++ b/include/linux/perf_regs.h
@@ -11,14 +11,14 @@ struct perf_regs {
#ifdef CONFIG_HAVE_PERF_REGS
#include <asm/perf_regs.h>
-u64 perf_reg_value(struct pt_regs *regs, int idx);
+u64 perf_reg_value(struct pt_regs *regs, u64 *extra_regs, int idx);
int perf_reg_validate(u64 mask);
u64 perf_reg_abi(struct task_struct *task);
void perf_get_regs_user(struct perf_regs *regs_user,
struct pt_regs *regs,
struct pt_regs *regs_user_copy);
#else
-static inline u64 perf_reg_value(struct pt_regs *regs, int idx)
+static inline u64 perf_reg_value(struct pt_regs *regs, u64 *extra_regs, int idx)
{
return 0;
}
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5f59d848171e..560ac237b8be 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5858,7 +5858,8 @@ EXPORT_SYMBOL_GPL(perf_unregister_guest_info_callbacks);
static void
perf_output_sample_regs(struct perf_output_handle *handle,
- struct pt_regs *regs, u64 mask)
+ struct pt_regs *regs,
+ u64 *extra_regs, u64 mask)
{
int bit;
DECLARE_BITMAP(_mask, 64);
@@ -5867,7 +5868,7 @@ perf_output_sample_regs(struct perf_output_handle *handle,
for_each_set_bit(bit, _mask, sizeof(mask) * BITS_PER_BYTE) {
u64 val;
- val = perf_reg_value(regs, bit);
+ val = perf_reg_value(regs, extra_regs, bit);
perf_output_put(handle, val);
}
}
@@ -6274,6 +6275,7 @@ void perf_output_sample(struct perf_output_handle *handle,
u64 mask = event->attr.sample_regs_user;
perf_output_sample_regs(handle,
data->regs_user.regs,
+ NULL,
mask);
}
}
@@ -6306,6 +6308,7 @@ void perf_output_sample(struct perf_output_handle *handle,
perf_output_sample_regs(handle,
data->regs_intr.regs,
+ data->extra_regs,
mask);
}
}
--
2.17.1
From: Kan Liang <[email protected]>
Intro
=====
Icelake has support for measuring the four top level TopDown metrics
directly in hardware. This is implemented by an additional "metrics"
register, and a new Fixed Counter 3 that measures pipeline "slots".
Events
======
We export four metric events as separate perf events, which map to
internal "metrics" counter register. Those events do not exist in
hardware, but can be allocated by the scheduler.
For the event mapping we use a special 0xff event code, which is
reserved for software.
When setting up such events they point to the slots counter, and a
special callback, metric_update_event(), reads the additional metrics
msr to generate the metrics. Then the metric is reported by multiplying
the metric (percentage) with slots.
This multiplication allows to easily keep a running count, for example
when the slots counter overflows, and makes all the standard tools, such
as a perf stat, work. They can do deltas of the values without needing
to know about percentages. This also simplifies accumulating the counts
of child events, which otherwise would need to know how to average
percent values.
Groups
======
To avoid reading the METRICS register multiple times, the metrics and
slots value are cached. This only works when multiple sub-events are in
the same group.
Resetting1
==========
The 8bit metrics ratio values lose precision when the measurement period
gets longer.
To avoid this we always reset the metric value when reading, as we
already accumulate the count in the perf count value.
For a long period read, low precision is acceptable.
For a short period read, the register will be reset often enough that it
is not a problem.
This implies that to read more than one submetrics always a group needs
to be used, so that the caching above still gives the correct value.
We also need to support this in the NMI, so that it's possible to
collect all top down metrics as part of leader sampling. To avoid races
with the normal transactions use a special nmi_metric cache that is only
used during the NMI.
Resetting2
==========
The PERF_METRICS may report wrong value if its delta was less than 1/255
of SLOTS (Fixed counter 3).
To avoid this, the PERF_METRICS and SLOTS registers have to be reset
simultaneously. The slots value has to be cached as well.
In counting, the -max_period is the initial value of the SLOTS. The huge
initial value will definitely trigger the issue mentioned above.
Force initial value as 0 for topdown and slots event counting.
RDPMC
=========
The TopDown events can be collected per thread/process. To use TopDown
through RDPMC in applications on Icelake, the metrics and slots values
have to be saved/restored during context switching.
Add specific set_period() to specially handle the slots and metrics
event. Because,
- The initial value must be 0.
- Only need to restore the value in context switch. For other cases,
the counters have been cleared after read.
Originally-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/core.c | 40 +++++--
arch/x86/events/intel/core.c | 192 +++++++++++++++++++++++++++++++
arch/x86/events/perf_event.h | 14 +++
arch/x86/include/asm/msr-index.h | 2 +
include/linux/perf_event.h | 5 +
5 files changed, 242 insertions(+), 11 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 9320cb7906cd..bc5cc3cfc86d 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -91,16 +91,20 @@ u64 x86_perf_event_update(struct perf_event *event)
new_raw_count) != prev_raw_count)
goto again;
- /*
- * Now we have the new raw value and have updated the prev
- * timestamp already. We can now calculate the elapsed delta
- * (event-)time and add that to the generic event.
- *
- * Careful, not all hw sign-extends above the physical width
- * of the count.
- */
- delta = (new_raw_count << shift) - (prev_raw_count << shift);
- delta >>= shift;
+ if (unlikely(hwc->flags & PERF_X86_EVENT_UPDATE))
+ delta = x86_pmu.metric_update_event(event, new_raw_count);
+ else {
+ /*
+ * Now we have the new raw value and have updated the prev
+ * timestamp already. We can now calculate the elapsed delta
+ * (event-)time and add that to the generic event.
+ *
+ * Careful, not all hw sign-extends above the physical width
+ * of the count.
+ */
+ delta = (new_raw_count << shift) - (prev_raw_count << shift);
+ delta >>= shift;
+ }
local64_add(delta, &event->count);
local64_sub(delta, &hwc->period_left);
@@ -964,6 +968,10 @@ static int collect_events(struct cpu_hw_events *cpuc, struct perf_event *leader,
max_count = x86_pmu.num_counters + x86_pmu.num_counters_fixed;
+ /* There are 4 TopDown metrics */
+ if (x86_pmu.has_metric)
+ max_count += 4;
+
/* current number of events already accepted */
n = cpuc->n_events;
@@ -1147,6 +1155,9 @@ int x86_perf_event_set_period(struct perf_event *event)
if (idx == INTEL_PMC_IDX_FIXED_BTS)
return 0;
+ if (x86_pmu.set_period && x86_pmu.set_period(event))
+ goto update_userpage;
+
/*
* If we are way outside a reasonable range then just skip forward:
*/
@@ -1195,6 +1206,7 @@ int x86_perf_event_set_period(struct perf_event *event)
(u64)(-left) & x86_pmu.cntval_mask);
}
+update_userpage:
perf_event_update_userpage(event);
return ret;
@@ -1947,6 +1959,8 @@ static void x86_pmu_cancel_txn(struct pmu *pmu)
txn_flags = cpuc->txn_flags;
cpuc->txn_flags = 0;
+ cpuc->txn_metric = 0;
+ cpuc->txn_slots = 0;
if (txn_flags & ~PERF_PMU_TXN_ADD)
return;
@@ -1974,6 +1988,8 @@ static int x86_pmu_commit_txn(struct pmu *pmu)
WARN_ON_ONCE(!cpuc->txn_flags); /* no txn in flight */
+ cpuc->txn_metric = 0;
+ cpuc->txn_slots = 0;
if (cpuc->txn_flags & ~PERF_PMU_TXN_ADD) {
cpuc->txn_flags = 0;
return 0;
@@ -2191,7 +2207,9 @@ static int x86_pmu_event_idx(struct perf_event *event)
if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED))
return 0;
- if (x86_pmu.num_counters_fixed && idx >= INTEL_PMC_IDX_FIXED) {
+ if (is_metric_idx(idx))
+ idx = 1 << 29;
+ else if (x86_pmu.num_counters_fixed && idx >= INTEL_PMC_IDX_FIXED) {
idx -= INTEL_PMC_IDX_FIXED;
idx |= 1 << 30;
}
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 545db8da24de..af2028ee2d1e 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -245,6 +245,11 @@ static struct event_constraint intel_icl_event_constraints[] = {
FIXED_EVENT_CONSTRAINT(0x003c, 1), /* CPU_CLK_UNHALTED.CORE */
FIXED_EVENT_CONSTRAINT(0x0300, 2), /* CPU_CLK_UNHALTED.REF */
FIXED_EVENT_CONSTRAINT(0x0400, 3), /* SLOTS */
+ METRIC_EVENT_CONSTRAINT(0x01ff, 0), /* Retiring metric */
+ METRIC_EVENT_CONSTRAINT(0x02ff, 1), /* Bad speculation metric */
+ METRIC_EVENT_CONSTRAINT(0x03ff, 2), /* FE bound metric */
+ METRIC_EVENT_CONSTRAINT(0x04ff, 3), /* BE bound metric */
+
INTEL_EVENT_CONSTRAINT_RANGE(0x03, 0x0a, 0xf),
INTEL_EVENT_CONSTRAINT_RANGE(0x1f, 0x28, 0xf),
INTEL_EVENT_CONSTRAINT(0x32, 0xf), /* SW_PREFETCH_ACCESS.* */
@@ -265,6 +270,14 @@ static struct extra_reg intel_icl_extra_regs[] __read_mostly = {
INTEL_UEVENT_EXTRA_REG(0x01bb, MSR_OFFCORE_RSP_1, 0x3fffff9fffull, RSP_1),
INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(0x01cd),
INTEL_UEVENT_EXTRA_REG(0x01c6, MSR_PEBS_FRONTEND, 0x7fff17, FE),
+ /*
+ * PERF_METRICS does exist, but it is not configured. But we
+ * share the original Fixed Ctr 3 from different metrics
+ * events. So use the extra reg to enforce the same
+ * configuration on the original register, but do not actually
+ * write to it.
+ */
+ INTEL_EVENT_EXTRA_REG(0xff, 0, -1L, PERF_METRICS),
EVENT_EXTRA_END
};
@@ -2399,6 +2412,8 @@ static int intel_pmu_handle_irq_v4(struct pt_regs *regs)
int pmu_enabled = cpuc->enabled;
int loops = 0;
+ cpuc->nmi_metric = 0;
+ cpuc->nmi_slots = 0;
/* PMU has been disabled because of counter freezing */
cpuc->enabled = 0;
if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask)) {
@@ -2472,6 +2487,8 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
int pmu_enabled;
cpuc = this_cpu_ptr(&cpu_hw_events);
+ cpuc->nmi_metric = 0;
+ cpuc->nmi_slots = 0;
/*
* Save the PMU state.
@@ -3194,6 +3211,13 @@ static int core_pmu_hw_config(struct perf_event *event)
return intel_pmu_bts_config(event);
}
+#define EVENT_CODE(e) (e->attr.config & INTEL_ARCH_EVENT_MASK)
+#define is_slots_event(e) (EVENT_CODE(e) == 0x0400)
+#define is_perf_metrics_event(e) \
+ (((EVENT_CODE(e) & 0xff) == 0xff) && \
+ (EVENT_CODE(e) >= 0x01ff) && \
+ (EVENT_CODE(e) <= 0x04ff))
+
static int intel_pmu_hw_config(struct perf_event *event)
{
int ret = x86_pmu_hw_config(event);
@@ -3239,6 +3263,31 @@ static int intel_pmu_hw_config(struct perf_event *event)
if (event->attr.type != PERF_TYPE_RAW)
return 0;
+ /* Fixed Counter 3 with its metric sub events */
+ if (x86_pmu.has_metric &&
+ (is_slots_event(event) || is_perf_metrics_event(event))) {
+ if (event->attr.config1 != 0)
+ return -EINVAL;
+ if (event->attr.config & ARCH_PERFMON_EVENTSEL_ANY)
+ return -EINVAL;
+ /*
+ * Put configuration (minus event) into config1 so that
+ * the scheduler enforces through an extra_reg that
+ * all instances of the metrics events have the same
+ * configuration.
+ */
+ event->attr.config1 = event->hw.config & X86_ALL_EVENT_FLAGS;
+ if (is_perf_metrics_event(event)) {
+ if (!x86_pmu.intel_cap.perf_metrics)
+ return -EINVAL;
+ if (is_sampling_event(event))
+ return -EINVAL;
+ }
+ if (!is_sampling_event(event))
+ event->hw.flags |= PERF_X86_EVENT_UPDATE;
+ return 0;
+ }
+
if (!(event->attr.config & ARCH_PERFMON_EVENTSEL_ANY))
return 0;
@@ -4069,6 +4118,141 @@ static __init void intel_ht_bug(void)
x86_pmu.stop_scheduling = intel_stop_scheduling;
}
+/*
+ * Update metric event with the PERF_METRICS register and return the delta
+ *
+ * Metric events are defined as SLOTS * metric. The original
+ * metric can be reconstructed by taking SUM(all-metrics)/metric
+ * (or SLOTS/metric)
+ *
+ * There are some limits for reading and resetting the PERF_METRICS register.
+ * - The PERF_METRICS and SLOTS registers have to be reset simultaneously.
+ * - To get the high precision, the measurement period has to be short.
+ * The PERF_METRICS and SLOTS will be reset for each reading.
+ * The only exception is context switch. The PERF_METRICS and SLOTS have to
+ * be saved/restored during context switch for RDPMC users.
+ *
+ * The PERF_METRICS doesn't have the absolute value. To calculate the delta
+ * of metric events,
+ * delta = new_SLOTS * new_METRICS - last_SLOTS * last_METRICS;
+ * Only need to save the last SLOTS and METRICS for context switch. For other
+ * cases, the registers have been reset. The last_* values are 0.
+ *
+ * To avoid reading the METRICS register multiple times, the metrics and slots
+ * value are cached. This only works when multiple sub-events are in the same
+ * group.
+ */
+static u64 icl_metric_update_event(struct perf_event *event, u64 val)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct hw_perf_event *hwc = &event->hw;
+ u64 newval, metric, slots_val = 0, new, last;
+ bool nmi = in_nmi();
+ int txn_flags = nmi ? 0 : cpuc->txn_flags;
+
+ /*
+ * Use cached value for transaction.
+ */
+ newval = 0;
+ if (txn_flags) {
+ newval = cpuc->txn_metric;
+ slots_val = cpuc->txn_slots;
+ } else if (nmi) {
+ newval = cpuc->nmi_metric;
+ slots_val = cpuc->nmi_slots;
+ }
+
+ if (!newval) {
+ slots_val = val;
+
+ rdpmcl((1<<29), newval);
+ if (txn_flags) {
+ cpuc->txn_metric = newval;
+ cpuc->txn_slots = slots_val;
+ } else if (nmi) {
+ cpuc->nmi_metric = newval;
+ cpuc->nmi_slots = slots_val;
+ }
+
+ if (!(txn_flags & PERF_PMU_TXN_REMOVE)) {
+ /* Reset the metric value when reading
+ * The SLOTS register must be reset when PERF_METRICS reset,
+ * otherwise PERF_METRICS may has wrong output.
+ */
+ wrmsrl(MSR_PERF_METRICS, 0);
+ wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
+ hwc->saved_metric = 0;
+ hwc->saved_slots = 0;
+ } else {
+ /* saved metric and slots for context switch */
+ hwc->saved_metric = newval;
+ hwc->saved_slots = val;
+
+ }
+ /* cache the last metric and slots */
+ cpuc->last_metric = hwc->last_metric;
+ cpuc->last_slots = hwc->last_slots;
+ hwc->last_metric = 0;
+ hwc->last_slots = 0;
+ }
+
+ /* The METRICS and SLOTS have been reset when reading */
+ if (!(txn_flags & PERF_PMU_TXN_REMOVE))
+ local64_set(&hwc->prev_count, 0);
+
+ if (is_slots_event(event))
+ return (slots_val - cpuc->last_slots);
+
+ /*
+ * The metric is reported as an 8bit integer percentage
+ * suming up to 0xff. As the counter is less than 64bits
+ * we can use the not used bits to get the needed precision.
+ * Use 16bit fixed point arithmetic for
+ * slots-in-metric = (MetricPct / 0xff) * val
+ * This works fine for upto 48bit counters, but will
+ * lose precision above that.
+ */
+
+ metric = (cpuc->last_metric >> ((hwc->idx - INTEL_PMC_IDX_FIXED_METRIC_BASE)*8)) & 0xff;
+ last = (((metric * 0xffff) >> 8) * cpuc->last_slots) >> 16;
+
+ metric = (newval >> ((hwc->idx - INTEL_PMC_IDX_FIXED_METRIC_BASE)*8)) & 0xff;
+ new = (((metric * 0xffff) >> 8) * slots_val) >> 16;
+
+ return (new - last);
+}
+
+/*
+ * Update counters for metrics and slots events
+ */
+static int icl_set_period(struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ s64 left = local64_read(&hwc->period_left);
+
+ if (!(hwc->flags & PERF_X86_EVENT_UPDATE))
+ return 0;
+
+ /* The initial value must be 0 */
+ if ((left == x86_pmu.max_period) && !hwc->saved_slots) {
+ wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
+ wrmsrl(MSR_PERF_METRICS, 0);
+ }
+
+ /* restore metric and slots for context switch */
+ if (hwc->saved_slots) {
+ wrmsrl(MSR_CORE_PERF_FIXED_CTR3, hwc->saved_slots);
+ wrmsrl(MSR_PERF_METRICS, hwc->saved_metric);
+ hwc->last_slots = hwc->saved_slots;
+ hwc->last_metric = hwc->saved_metric;
+ hwc->saved_slots = 0;
+ hwc->saved_metric = 0;
+
+ }
+
+ return 1;
+}
+
EVENT_ATTR_STR(mem-loads, mem_ld_hsw, "event=0xcd,umask=0x1,ldlat=3");
EVENT_ATTR_STR(mem-stores, mem_st_hsw, "event=0xd0,umask=0x82")
@@ -4752,6 +4936,9 @@ __init int intel_pmu_init(void)
x86_pmu.force_gpr_event = 0x2ca;
x86_pmu.lbr_pt_coexist = true;
intel_pmu_pebs_data_source_skl(false);
+ x86_pmu.has_metric = x86_pmu.intel_cap.perf_metrics;
+ x86_pmu.metric_update_event = icl_metric_update_event;
+ x86_pmu.set_period = icl_set_period;
pr_cont("Icelake events, ");
name = "icelake";
break;
@@ -4866,6 +5053,11 @@ __init int intel_pmu_init(void)
if (x86_pmu.counter_freezing)
x86_pmu.handle_irq = intel_pmu_handle_irq_v4;
+ if (x86_pmu.has_metric) {
+ x86_pmu.intel_ctrl |= 1ULL << 48;
+ pr_cont("TopDown, ");
+ }
+
kfree(to_free);
return 0;
}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index e4215ea5c65e..6165f4b2a0e6 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -40,6 +40,7 @@ enum extra_reg_type {
EXTRA_REG_LBR = 2, /* lbr_select */
EXTRA_REG_LDLAT = 3, /* ld_lat_threshold */
EXTRA_REG_FE = 4, /* fe_* */
+ EXTRA_REG_PERF_METRICS = 5, /* perf metrics */
EXTRA_REG_MAX /* number of entries needed */
};
@@ -71,6 +72,7 @@ struct event_constraint {
#define PERF_X86_EVENT_EXCL_ACCT 0x0200 /* accounted EXCL event */
#define PERF_X86_EVENT_AUTO_RELOAD 0x0400 /* use PEBS auto-reload */
#define PERF_X86_EVENT_LARGE_PEBS 0x0800 /* use large PEBS */
+#define PERF_X86_EVENT_UPDATE 0x1000 /* call update_event after read */
static inline bool constraint_match(struct event_constraint *c, u64 ecode)
{
@@ -239,6 +241,13 @@ struct cpu_hw_events {
u64 intel_ctrl_host_mask;
struct perf_guest_switch_msr guest_switch_msrs[X86_PMC_IDX_MAX];
+ unsigned long txn_metric;
+ unsigned long txn_slots;
+ unsigned long nmi_metric;
+ unsigned long nmi_slots;
+ unsigned long last_metric;
+ unsigned long last_slots;
+
/*
* Intel checkpoint mask
*/
@@ -524,6 +533,7 @@ union perf_capabilities {
*/
u64 full_width_write:1;
u64 pebs_baseline:1;
+ u64 perf_metrics:1;
};
u64 capabilities;
};
@@ -587,6 +597,7 @@ struct x86_pmu {
int (*addr_offset)(int index, bool eventsel);
int (*rdpmc_index)(int index);
u64 (*event_map)(int);
+ u64 (*metric_update_event)(struct perf_event *event, u64 val);
int max_events;
int num_counters;
int num_counters_fixed;
@@ -617,6 +628,7 @@ struct x86_pmu {
struct x86_pmu_quirk *quirks;
int perfctr_second_write;
u64 (*limit_period)(struct perf_event *event, u64 l);
+ int (*set_period)(struct perf_event *event);
/* PMI handler bits */
unsigned int late_ack :1,
@@ -705,6 +717,8 @@ struct x86_pmu {
struct extra_reg *extra_regs;
unsigned int flags;
+ bool has_metric;
+
/*
* Intel host/guest support (KVM)
*/
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 5f369e954ba8..60ec21d9dd6d 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -102,6 +102,8 @@
#define MSR_TURBO_RATIO_LIMIT1 0x000001ae
#define MSR_TURBO_RATIO_LIMIT2 0x000001af
+#define MSR_PERF_METRICS 0x00000329
+
#define MSR_LBR_SELECT 0x000001c8
#define MSR_LBR_TOS 0x000001c9
#define MSR_LBR_NHM_FROM 0x00000680
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 4c62da79ff41..9c42876c5576 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -133,6 +133,11 @@ struct hw_perf_event {
struct hw_perf_event_extra extra_reg;
struct hw_perf_event_extra branch_reg;
+
+ u64 saved_metric;
+ u64 saved_slots;
+ u64 last_slots;
+ u64 last_metric;
};
struct { /* software */
struct hrtimer hrtimer;
--
2.17.1
From: Andi Kleen <[email protected]>
The top down sub event counters are mapped to a fixed counter,
but should have the normal weight for the scheduler.
So special case this.
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/intel/core.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index af2028ee2d1e..d69537b6c184 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4994,6 +4994,15 @@ __init int intel_pmu_init(void)
* counter, so do not extend mask to generic counters
*/
for_each_event_constraint(c, x86_pmu.event_constraints) {
+ /*
+ * Don't limit the event mask for topdown sub event
+ * counters.
+ */
+ if (x86_pmu.num_counters_fixed >= 3 &&
+ c->idxmsk64 & INTEL_PMC_MSK_ANY_SLOTS) {
+ c->weight = hweight64(c->idxmsk64);
+ continue;
+ }
if (c->cmask == FIXED_EVENT_FLAGS
&& c->idxmsk64 != INTEL_PMC_MSK_FIXED_REF_CYCLES) {
c->idxmsk64 |= (1ULL << x86_pmu.num_counters) - 1;
--
2.17.1
From: Andi Kleen <[email protected]>
The internal counters used for the metrics can overflow. If this happens
an overflow is triggered on the SLOTS fixed counter. Add special code
that resets all the slave metric counters in this case.
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/intel/core.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index f1cb0155c79a..545db8da24de 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2198,12 +2198,35 @@ static void intel_pmu_add_event(struct perf_event *event)
intel_pmu_lbr_add(event);
}
+/* When SLOTS overflowed update all the active topdown-* events */
+static void intel_pmu_update_metrics(struct perf_event *event)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ int idx;
+ u64 slots_events;
+
+ slots_events = *(u64 *)cpuc->enabled_events & INTEL_PMC_MSK_ANY_SLOTS;
+
+ for_each_set_bit(idx, (unsigned long *)&slots_events, 64) {
+ struct perf_event *ev = cpuc->events[idx];
+
+ if (ev == event)
+ continue;
+ x86_perf_event_update(event);
+ }
+}
+
/*
* Save and restart an expired event. Called by NMI contexts,
* so it has to be careful about preempting normal event ops:
*/
int intel_pmu_save_and_restart(struct perf_event *event)
{
+ struct hw_perf_event *hwc = &event->hw;
+
+ if (unlikely(hwc->reg_idx == INTEL_PMC_IDX_FIXED_SLOTS))
+ intel_pmu_update_metrics(event);
+
x86_perf_event_update(event);
/*
* For a checkpointed counter always reset back to 0. This
--
2.17.1
From: Kan Liang <[email protected]>
Icelake is the same as the existing Skylake parts.
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/msr.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/events/msr.c b/arch/x86/events/msr.c
index a878e6286e4a..f3f4c2263501 100644
--- a/arch/x86/events/msr.c
+++ b/arch/x86/events/msr.c
@@ -89,6 +89,7 @@ static bool test_intel(int idx)
case INTEL_FAM6_SKYLAKE_X:
case INTEL_FAM6_KABYLAKE_MOBILE:
case INTEL_FAM6_KABYLAKE_DESKTOP:
+ case INTEL_FAM6_ICELAKE_MOBILE:
if (idx == PERF_MSR_SMI || idx == PERF_MSR_PPERF)
return true;
break;
--
2.17.1
On Mon, Mar 18, 2019 at 02:41:23PM -0700, [email protected] wrote:
> From: Andi Kleen <[email protected]>
>
> Add support to the perf core for outputting registers from a separate
> array and add support for outputting XMM registers for x86.
What separate array and why?
> This requires changing all the perf_reg_value functions for the
> different architectures to pass the additional argument.
What additional argument? (basically a dangling reference here)
> Except for x86, they just ignore it.
>
> XMM registers are 128 bit. To simplify the code, they are handled like
> two different registers, which means setting two bits in the register
> bitmap. This also allows only sampling the lower 64bit bits in XMM.
So that is at least 2 changes in one patch; I though there was a rule
about that.
> Signed-off-by: Andi Kleen <[email protected]>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
> index f3329cabce5c..1ff0df1c97ae 100644
> --- a/arch/x86/include/uapi/asm/perf_regs.h
> +++ b/arch/x86/include/uapi/asm/perf_regs.h
> @@ -28,7 +28,28 @@ enum perf_event_x86_regs {
> PERF_REG_X86_R14,
> PERF_REG_X86_R15,
>
> - PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
> - PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
> + /* These all need two bits set because they are 128bit */
> + PERF_REG_X86_XMM0 = 32,
> + PERF_REG_X86_XMM1 = 34,
> + PERF_REG_X86_XMM2 = 36,
> + PERF_REG_X86_XMM3 = 38,
> + PERF_REG_X86_XMM4 = 40,
> + PERF_REG_X86_XMM5 = 42,
> + PERF_REG_X86_XMM6 = 44,
> + PERF_REG_X86_XMM7 = 46,
> + PERF_REG_X86_XMM8 = 48,
> + PERF_REG_X86_XMM9 = 50,
> + PERF_REG_X86_XMM10 = 52,
> + PERF_REG_X86_XMM11 = 54,
> + PERF_REG_X86_XMM12 = 56,
> + PERF_REG_X86_XMM13 = 58,
> + PERF_REG_X86_XMM14 = 60,
> + PERF_REG_X86_XMM15 = 62,
> +
> + /* This does not include the XMMX registers */
> + PERF_REG_GPR_X86_32_MAX = PERF_REG_X86_GS + 1,
> + PERF_REG_GPR_X86_64_MAX = PERF_REG_X86_R15 + 1,
> +
> + PERF_REG_X86_MAX = PERF_REG_X86_XMM15 + 2,
This needs explaining in both the Changelog and a comment.
> };
> #endif /* _ASM_X86_PERF_REGS_H */
> diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
> index c06c4c16c6b6..8b44a4c5a161 100644
> --- a/arch/x86/kernel/perf_regs.c
> +++ b/arch/x86/kernel/perf_regs.c
> @@ -10,14 +10,14 @@
> #include <asm/ptrace.h>
>
> #ifdef CONFIG_X86_32
> -#define PERF_REG_X86_MAX PERF_REG_X86_32_MAX
> +#define PERF_REG_GPR_X86_MAX PERF_REG_GPR_X86_32_MAX
> #else
> -#define PERF_REG_X86_MAX PERF_REG_X86_64_MAX
> +#define PERF_REG_GPR_X86_MAX PERF_REG_GPR_X86_64_MAX
> #endif
>
> #define PT_REGS_OFFSET(id, r) [id] = offsetof(struct pt_regs, r)
>
> -static unsigned int pt_regs_offset[PERF_REG_X86_MAX] = {
> +static unsigned int pt_regs_offset[PERF_REG_GPR_X86_MAX] = {
> PT_REGS_OFFSET(PERF_REG_X86_AX, ax),
> PT_REGS_OFFSET(PERF_REG_X86_BX, bx),
> PT_REGS_OFFSET(PERF_REG_X86_CX, cx),
> @@ -57,15 +57,22 @@ static unsigned int pt_regs_offset[PERF_REG_X86_MAX] = {
> #endif
> };
>
> -u64 perf_reg_value(struct pt_regs *regs, int idx)
> +u64 perf_reg_value(struct pt_regs *regs, u64 *extra_regs, int idx)
> {
> + if (idx >= 32 && idx < 64) {
> + if (!extra_regs)
> + return 0;
> + return extra_regs[idx - 32];
> + }
> +
> if (WARN_ON_ONCE(idx >= ARRAY_SIZE(pt_regs_offset)))
> return 0;
>
> return regs_get_register(regs, pt_regs_offset[idx]);
> }
>
> -#define REG_RESERVED (~((1ULL << PERF_REG_X86_MAX) - 1ULL))
> +#define REG_RESERVED \
> + (PERF_REG_X86_MAX == 64 ? 0 : ~((1ULL << PERF_REG_X86_MAX)) - 1ULL)
>
> #ifdef CONFIG_X86_32
> int perf_reg_validate(u64 mask)
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index e47ef764f613..bd3d6a89ccd4 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -948,6 +948,7 @@ struct perf_sample_data {
> u64 stack_user_size;
>
> u64 phys_addr;
> + u64 *extra_regs;
> } ____cacheline_aligned;
>
> /* default value for data source */
> @@ -968,6 +969,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
> data->weight = 0;
> data->data_src.val = PERF_MEM_NA;
> data->txn = 0;
> + data->extra_regs = NULL;
> }
NAK, why do I have to keep explaining this?
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 5f59d848171e..560ac237b8be 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -5858,7 +5858,8 @@ EXPORT_SYMBOL_GPL(perf_unregister_guest_info_callbacks);
>
> static void
> perf_output_sample_regs(struct perf_output_handle *handle,
> - struct pt_regs *regs, u64 mask)
> + struct pt_regs *regs,
> + u64 *extra_regs, u64 mask)
> {
> int bit;
> DECLARE_BITMAP(_mask, 64);
> @@ -5867,7 +5868,7 @@ perf_output_sample_regs(struct perf_output_handle *handle,
> for_each_set_bit(bit, _mask, sizeof(mask) * BITS_PER_BYTE) {
> u64 val;
>
> - val = perf_reg_value(regs, bit);
> + val = perf_reg_value(regs, extra_regs, bit);
> perf_output_put(handle, val);
> }
> }
> @@ -6274,6 +6275,7 @@ void perf_output_sample(struct perf_output_handle *handle,
> u64 mask = event->attr.sample_regs_user;
> perf_output_sample_regs(handle,
> data->regs_user.regs,
> + NULL,
> mask);
> }
> }
> @@ -6306,6 +6308,7 @@ void perf_output_sample(struct perf_output_handle *handle,
>
> perf_output_sample_regs(handle,
> data->regs_intr.regs,
> + data->extra_regs,
> mask);
> }
> }
See, I think most of this is completely unnessecary. Both sites pass:
&perf_regs::regs to perf_output_sample_regs()<-perf_reg_value().
So all you need to do is add the XMM crud to perf_regs, and use
container_of() on the pt_regs pointer in perf_reg_value() to get back to
perf_regs and voila, XMM registers.
On Mon, Mar 18, 2019 at 02:41:24PM -0700, [email protected] wrote:
> @@ -1125,34 +1125,50 @@ static int intel_pmu_pebs_fixup_ip(struct pt_regs *regs)
> return 0;
> }
>
> -static inline u64 intel_hsw_weight(struct pebs_record_skl *pebs)
> +static inline u64 intel_hsw_weight(u64 tsx_tuning)
That function name is now completely insane. It used to be a reference
to the hsw pebs format, but you just destroyed that.
> {
> - if (pebs->tsx_tuning) {
> - union hsw_tsx_tuning tsx = { .value = pebs->tsx_tuning };
> + if (tsx_tuning) {
> + union hsw_tsx_tuning tsx = { .value = tsx_tuning };
> return tsx.cycles_last_block;
> }
> return 0;
> }
>
> -static inline u64 intel_hsw_transaction(struct pebs_record_skl *pebs)
> +static u64 intel_hsw_transaction(u64 tsx_tuning, u64 ax)
Guess what...
> {
> - u64 txn = (pebs->tsx_tuning & PEBS_HSW_TSX_FLAGS) >> 32;
> + u64 txn = (tsx_tuning & PEBS_HSW_TSX_FLAGS) >> 32;
>
> /* For RTM XABORTs also log the abort code from AX */
> - if ((txn & PERF_TXN_TRANSACTION) && (pebs->ax & 1))
> - txn |= ((pebs->ax >> 24) & 0xff) << PERF_TXN_ABORT_SHIFT;
> + if ((txn & PERF_TXN_TRANSACTION) && (ax & 1))
> + txn |= ((ax >> 24) & 0xff) << PERF_TXN_ABORT_SHIFT;
> return txn;
> }
On Tue, Mar 19, 2019 at 02:00:04PM +0100, Peter Zijlstra wrote:
> On Mon, Mar 18, 2019 at 02:41:23PM -0700, [email protected] wrote:
> > @@ -6274,6 +6275,7 @@ void perf_output_sample(struct perf_output_handle *handle,
> > u64 mask = event->attr.sample_regs_user;
> > perf_output_sample_regs(handle,
> > data->regs_user.regs,
> > + NULL,
> > mask);
> > }
> > }
> > @@ -6306,6 +6308,7 @@ void perf_output_sample(struct perf_output_handle *handle,
> >
> > perf_output_sample_regs(handle,
> > data->regs_intr.regs,
> > + data->extra_regs,
> > mask);
> > }
> > }
>
> See, I think most of this is completely unnessecary. Both sites pass:
> &perf_regs::regs to perf_output_sample_regs()<-perf_reg_value().
>
> So all you need to do is add the XMM crud to perf_regs, and use
> container_of() on the pt_regs pointer in perf_reg_value() to get back to
> perf_regs and voila, XMM registers.
Ah, that's not quite true, but I still think you can make something like
that work.
On Mon, Mar 18, 2019 at 02:41:25PM -0700, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> Adaptive PEBS is a new way to report PEBS sampling information. Instead
> of a fixed size record for all PEBS events it allows to configure the
> PEBS record to only include the information needed. Events can then opt
> in to use such an extended record, or stay with a basic record which
> only contains the IP.
>
> The major new feature is to support LBRs in PEBS record.
> This allows (much faster) large PEBS, while still supporting callstacks
> through callstack LBR.
Does it also allow normal LBR usage? Or does it have to be callstacks?
> arch/x86/events/intel/core.c | 2 +
> arch/x86/events/intel/ds.c | 293 ++++++++++++++++++++++++++++--
> arch/x86/events/intel/lbr.c | 22 +++
> arch/x86/events/perf_event.h | 14 ++
> arch/x86/include/asm/msr-index.h | 1 +
> arch/x86/include/asm/perf_event.h | 42 +++++
> 6 files changed, 359 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 17096d3cd616..a964b9832b0c 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -3446,6 +3446,8 @@ static int intel_pmu_cpu_prepare(int cpu)
> {
> struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu);
>
> + cpuc->pebs_record_size = x86_pmu.pebs_record_size;
> +
> if (x86_pmu.extra_regs || x86_pmu.lbr_sel_map) {
> cpuc->shared_regs = allocate_shared_regs(cpu);
> if (!cpuc->shared_regs)
Does not apply... Didn't apply when you send it. At the very least you
could've refreshed the series before sending :/
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 4a2206876baa..974284c5ed6c 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -906,17 +906,82 @@ static inline void pebs_update_threshold(struct cpu_hw_events *cpuc)
>
> if (cpuc->n_pebs == cpuc->n_large_pebs) {
> threshold = ds->pebs_absolute_maximum -
> - reserved * x86_pmu.pebs_record_size;
> + reserved * cpuc->pebs_record_size;
> } else {
> - threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
> + threshold = ds->pebs_buffer_base + cpuc->pebs_record_size;
> }
>
> ds->pebs_interrupt_threshold = threshold;
> }
>
> +static void adaptive_pebs_record_size_update(void)
> +{
> + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> + u64 d = cpuc->pebs_data_cfg;
> + int sz = sizeof(struct pebs_basic);
> +
> + if (d & PEBS_DATACFG_MEMINFO)
> + sz += sizeof(struct pebs_meminfo);
> + if (d & PEBS_DATACFG_GPRS)
> + sz += sizeof(struct pebs_gprs);
> + if (d & PEBS_DATACFG_XMMS)
> + sz += sizeof(struct pebs_xmm);
> + if (d & PEBS_DATACFG_LBRS)
> + sz += x86_pmu.lbr_nr * sizeof(struct pebs_lbr_entry);
> +
> + cpuc->pebs_record_size = sz;
> +}
You call that @d pebs_data_cfg elsewhere, why the inconsistency?
> +static u64 pebs_update_adaptive_cfg(struct perf_event *event)
> +{
> + u64 sample_type = event->attr.sample_type;
> + u64 pebs_data_cfg = 0;
> +
> +
too much whitespace
> + if ((sample_type & ~(PERF_SAMPLE_IP|PERF_SAMPLE_TIME)) ||
> + event->attr.precise_ip < 2) {
> +
> + if (sample_type & (PERF_SAMPLE_ADDR | PERF_SAMPLE_DATA_SRC |
> + PERF_SAMPLE_PHYS_ADDR | PERF_SAMPLE_WEIGHT |
> + PERF_SAMPLE_TRANSACTION))
> + pebs_data_cfg |= PEBS_DATACFG_MEMINFO;
> +
> + /*
> + * Cases we need the registers:
> + * + user requested registers
> + * + precise_ip < 2 for the non event IP
> + * + For RTM TSX weight we need GPRs too for the abort
> + * code. But we don't want to force GPRs for all other
> + * weights. So add it only for the RTM abort event.
> + */
> + if (((sample_type & PERF_SAMPLE_REGS_INTR) &&
> + (event->attr.sample_regs_intr & 0xffffffff)) ||
> + (event->attr.precise_ip < 2) ||
> + ((sample_type & PERF_SAMPLE_WEIGHT) &&
> + ((event->attr.config & 0xffff) == x86_pmu.force_gpr_event)))
> + pebs_data_cfg |= PEBS_DATACFG_GPRS;
I know it has a comment, but it would be nice for the code to be
readable too. This is horrible.
> +
> + if ((sample_type & PERF_SAMPLE_REGS_INTR) &&
> + (event->attr.sample_regs_intr >> 32))
> + pebs_data_cfg |= PEBS_DATACFG_XMMS;
indent fail
> +
> + if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
> + /*
> + * For now always log all LBRs. Could configure this
> + * later.
> + */
> + pebs_data_cfg |= PEBS_DATACFG_LBRS |
> + ((x86_pmu.lbr_nr-1) << PEBS_DATACFG_LBR_SHIFT);
> + }
> + }
> + return pebs_data_cfg;
> +}
> +
> static void
> -pebs_update_state(bool needed_cb, struct cpu_hw_events *cpuc, struct pmu *pmu)
> +pebs_update_state(bool needed_cb, struct cpu_hw_events *cpuc,
> + struct perf_event *event, bool add)
> {
> + struct pmu *pmu = event->ctx->pmu;
> /*
> * Make sure we get updated with the first PEBS
> * event. It will trigger also during removal, but
> @@ -933,6 +998,19 @@ pebs_update_state(bool needed_cb, struct cpu_hw_events *cpuc, struct pmu *pmu)
> update = true;
> }
>
> + if (x86_pmu.intel_cap.pebs_baseline && add) {
> + u64 pebs_data_cfg;
> +
> + pebs_data_cfg = pebs_update_adaptive_cfg(event);
> +
> + /* Update pebs_record_size if new event requires more data. */
> + if (pebs_data_cfg & ~cpuc->pebs_data_cfg) {
> + cpuc->pebs_data_cfg |= pebs_data_cfg;
> + adaptive_pebs_record_size_update();
> + update = true;
> + }
> + }
> +
> if (update)
> pebs_update_threshold(cpuc);
> }
Hurmph.. this only grows the PEBS record.
> @@ -947,7 +1025,7 @@ void intel_pmu_pebs_add(struct perf_event *event)
> if (hwc->flags & PERF_X86_EVENT_LARGE_PEBS)
> cpuc->n_large_pebs++;
>
> - pebs_update_state(needed_cb, cpuc, event->ctx->pmu);
> + pebs_update_state(needed_cb, cpuc, event, true);
> }
>
> void intel_pmu_pebs_enable(struct perf_event *event)
> @@ -965,6 +1043,14 @@ void intel_pmu_pebs_enable(struct perf_event *event)
> else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
> cpuc->pebs_enabled |= 1ULL << 63;
>
> + if (x86_pmu.intel_cap.pebs_baseline) {
> + hwc->config |= ICL_EVENTSEL_ADAPTIVE;
> + if (cpuc->pebs_data_cfg != cpuc->active_pebs_data_cfg) {
> + wrmsrl(MSR_PEBS_DATA_CFG, cpuc->pebs_data_cfg);
> + cpuc->active_pebs_data_cfg = cpuc->pebs_data_cfg;
> + }
> + }
> +
> /*
> * Use auto-reload if possible to save a MSR write in the PMI.
> * This must be done in pmu::start(), because PERF_EVENT_IOC_PERIOD.
> @@ -991,7 +1077,12 @@ void intel_pmu_pebs_del(struct perf_event *event)
> if (hwc->flags & PERF_X86_EVENT_LARGE_PEBS)
> cpuc->n_large_pebs--;
>
> - pebs_update_state(needed_cb, cpuc, event->ctx->pmu);
> + /* Clear both pebs_data_cfg and pebs_record_size for first PEBS. */
Weird comment..
> + if (x86_pmu.intel_cap.pebs_baseline && !cpuc->n_pebs) {
> + cpuc->pebs_data_cfg = 0;
> + cpuc->pebs_record_size = sizeof(struct pebs_basic);
> + }
> + pebs_update_state(needed_cb, cpuc, event, false);
Why do we have to reset record_size? That'll be updated in
pebs_update_state() on the next add.
> }
>
> void intel_pmu_pebs_disable(struct perf_event *event)
> @@ -1004,6 +1095,8 @@ void intel_pmu_pebs_disable(struct perf_event *event)
>
> cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
>
> + /* Delay reprograming DATA_CFG to next enable */
> +
No need for that I think.
> if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
> cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
> else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
> @@ -1013,6 +1106,7 @@ void intel_pmu_pebs_disable(struct perf_event *event)
> wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
>
> hwc->config |= ARCH_PERFMON_EVENTSEL_INT;
> + hwc->config &= ~ICL_EVENTSEL_ADAPTIVE;
Just curious; the way I read the SDM, we could leave this set, is that
correct?
> }
>
> void intel_pmu_pebs_enable_all(void)
> @@ -1323,19 +1558,20 @@ get_next_pebs_record_by_bit(void *base, void *top, int bit)
> if (base == NULL)
> return NULL;
>
> - for (at = base; at < top; at += x86_pmu.pebs_record_size) {
> + for (at = base; at < top; at = next_pebs_record(at)) {
That _should_ work with cpuc->pebs_record_size, right?
> struct pebs_record_nhm *p = at;
> + unsigned long status = get_pebs_status(p);
>
> - if (test_bit(bit, (unsigned long *)&p->status)) {
> + if (test_bit(bit, (unsigned long *)&status)) {
> /* PEBS v3 has accurate status bits */
> if (x86_pmu.intel_cap.pebs_format >= 3)
> return at;
>
> - if (p->status == (1 << bit))
> + if (status == (1 << bit))
> return at;
>
> /* clear non-PEBS bit and re-check */
> - pebs_status = p->status & cpuc->pebs_enabled;
> + pebs_status = status & cpuc->pebs_enabled;
> pebs_status &= PEBS_COUNTER_MASK;
> if (pebs_status == (1 << bit))
> return at;
> @@ -1434,14 +1670,14 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
> return;
>
> while (count > 1) {
> - setup_pebs_sample_data(event, iregs, at, &data, ®s);
> + x86_pmu.setup_pebs_sample_data(event, iregs, at, &data, ®s);
> perf_event_output(event, &data, ®s);
> - at += x86_pmu.pebs_record_size;
> + at = next_pebs_record(at);
> at = get_next_pebs_record_by_bit(at, top, bit);
> count--;
> }
>
> - setup_pebs_sample_data(event, iregs, at, &data, ®s);
> + x86_pmu.setup_pebs_sample_data(event, iregs, at, &data, ®s);
>
> /*
> * All but the last records are processed.
> @@ -1534,11 +1770,11 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
> return;
> }
>
> - for (at = base; at < top; at += x86_pmu.pebs_record_size) {
> + for (at = base; at < top; at = next_pebs_record(at)) {
> struct pebs_record_nhm *p = at;
> u64 pebs_status;
>
> - pebs_status = p->status & cpuc->pebs_enabled;
> + pebs_status = get_pebs_status(p) & cpuc->pebs_enabled;
> pebs_status &= mask;
>
> /* PEBS v3 has more accurate status bits */
How much work would intel_pmu_drain_pebs_icl() be?
I'm thinking that might not be terrible.
On Mon, Mar 18, 2019 at 02:41:27PM -0700, [email protected] wrote:
> From: Andi Kleen <[email protected]>
>
> Icelake extended the general counters to 8, even when SMT is enabled.
> However only a (large) subset of the events can be used on all 8
> counters.
>
> The events that can or cannot be used on all counters are organized
> in ranges.
>
> We need a lot of scheduler constraints to handle all this.
>
> To avoid blowing up the tables add event code ranges to the constraint
> tables, and a new inline function to match them.
>
> The changes costs ~2k text size according to 0day report.
Where?! there isn't much code here.
> Signed-off-by: Andi Kleen <[email protected]>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> arch/x86/events/intel/core.c | 2 +-
> arch/x86/events/intel/ds.c | 2 +-
> arch/x86/events/perf_event.h | 34 ++++++++++++++++++++++++++++++++++
> 3 files changed, 36 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index a964b9832b0c..8486ab87f8f8 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -2655,7 +2655,7 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
>
> if (x86_pmu.event_constraints) {
> for_each_event_constraint(c, x86_pmu.event_constraints) {
> - if ((event->hw.config & c->cmask) == c->code) {
> + if (constraint_match(c, event->hw.config)) {
> event->hw.flags |= c->flags;
> return c;
> }
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 974284c5ed6c..30370fb93e21 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -858,7 +858,7 @@ struct event_constraint *intel_pebs_constraints(struct perf_event *event)
>
> if (x86_pmu.pebs_constraints) {
> for_each_event_constraint(c, x86_pmu.pebs_constraints) {
> - if ((event->hw.config & c->cmask) == c->code) {
> + if (constraint_match(c, event->hw.config)) {
> event->hw.flags |= c->flags;
> return c;
> }
> @@ -71,6 +72,12 @@ struct event_constraint {
> #define PERF_X86_EVENT_AUTO_RELOAD 0x0400 /* use PEBS auto-reload */
> #define PERF_X86_EVENT_LARGE_PEBS 0x0800 /* use large PEBS */
>
> +static inline bool constraint_match(struct event_constraint *c, u64 ecode)
> +{
> + ecode &= c->cmask;
> + return ecode == c->code ||
> + (c->range_end && ecode >= c->code && ecode <= c->range_end);
> +}
>
> struct amd_nb {
> int nb_id; /* NorthBridge id */
That's all the code, how does that add up to 2k ?
On Tue, Mar 19, 2019 at 03:53:09PM +0100, Peter Zijlstra wrote:
> On Mon, Mar 18, 2019 at 02:41:27PM -0700, [email protected] wrote:
> > The changes costs ~2k text size according to 0day report.
>
> Where?! there isn't much code here.
> > @@ -71,6 +72,12 @@ struct event_constraint {
> > #define PERF_X86_EVENT_AUTO_RELOAD 0x0400 /* use PEBS auto-reload */
> > #define PERF_X86_EVENT_LARGE_PEBS 0x0800 /* use large PEBS */
> >
> > +static inline bool constraint_match(struct event_constraint *c, u64 ecode)
> > +{
> > + ecode &= c->cmask;
> > + return ecode == c->code ||
> > + (c->range_end && ecode >= c->code && ecode <= c->range_end);
> > +}
> >
> > struct amd_nb {
> > int nb_id; /* NorthBridge id */
>
> That's all the code, how does that add up to 2k ?
By my counting it adds all of 108 bytes of text. What it does do is add
5784 bytes of data. But that too appears not to be required.
The below patch does the same, it adds 37 bytes of text and no
additional data.
It mostly works because of how Intel event codes are 'small'. We can
easily compress the constraint to allow a u64 size, but I don't think
that is needed. If we need to cover AMD64_EVENTSEL_EVENT the range
compare needs to get fixed anyway.
---
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2688,7 +2688,7 @@ x86_get_event_constraints(struct cpu_hw_
if (x86_pmu.event_constraints) {
for_each_event_constraint(c, x86_pmu.event_constraints) {
- if ((event->hw.config & c->cmask) == c->code) {
+ if (constraint_match(c, event->hw.config)) {
event->hw.flags |= c->flags;
return c;
}
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -858,7 +858,7 @@ struct event_constraint *intel_pebs_cons
if (x86_pmu.pebs_constraints) {
for_each_event_constraint(c, x86_pmu.pebs_constraints) {
- if ((event->hw.config & c->cmask) == c->code) {
+ if (constraint_match(c, event->hw.config)) {
event->hw.flags |= c->flags;
return c;
}
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -54,6 +54,7 @@ struct event_constraint {
int weight;
int overlap;
int flags;
+ unsigned int size;
};
/*
* struct hw_perf_event.flags flags
@@ -71,6 +72,10 @@ struct event_constraint {
#define PERF_X86_EVENT_AUTO_RELOAD 0x0400 /* use PEBS auto-reload */
#define PERF_X86_EVENT_LARGE_PEBS 0x0800 /* use large PEBS */
+static inline bool constraint_match(struct event_constraint *c, u64 ecode)
+{
+ return ((ecode & c->cmask) - c->code) <= (u64)c->size;
+}
struct amd_nb {
int nb_id; /* NorthBridge id */
@@ -257,18 +262,25 @@ struct cpu_hw_events {
void *kfree_on_online[X86_PERF_KFREE_MAX];
};
-#define __EVENT_CONSTRAINT(c, n, m, w, o, f) {\
+#define __EVENT_CONSTRAINT_RANGE(c, e, n, m, w, o, f) { \
{ .idxmsk64 = (n) }, \
.code = (c), \
+ .size = (e) - (c), \
.cmask = (m), \
.weight = (w), \
.overlap = (o), \
.flags = f, \
}
+#define __EVENT_CONSTRAINT(c, n, m, w, o, f) \
+ __EVENT_CONSTRAINT_RANGE(c, c, n, m, w, o, f)
+
#define EVENT_CONSTRAINT(c, n, m) \
__EVENT_CONSTRAINT(c, n, m, HWEIGHT(n), 0, 0)
+#define EVENT_CONSTRAINT_RANGE(c, e, n, m) \
+ __EVENT_CONSTRAINT_RANGE(c, e, n, m, HWEIGHT(n), 0, 0)
+
#define INTEL_EXCLEVT_CONSTRAINT(c, n) \
__EVENT_CONSTRAINT(c, n, ARCH_PERFMON_EVENTSEL_EVENT, HWEIGHT(n),\
0, PERF_X86_EVENT_EXCL)
@@ -304,6 +316,12 @@ struct cpu_hw_events {
EVENT_CONSTRAINT(c, n, ARCH_PERFMON_EVENTSEL_EVENT)
/*
+ * Constraint on a range of Event codes
+ */
+#define INTEL_EVENT_CONSTRAINT_RANGE(c, e, n) \
+ EVENT_CONSTRAINT_RANGE(c, e, n, ARCH_PERFMON_EVENTSEL_EVENT)
+
+/*
* Constraint on the Event code + UMask + fixed-mask
*
* filter mask to validate fixed counter events.
@@ -350,6 +368,9 @@ struct cpu_hw_events {
#define INTEL_FLAGS_EVENT_CONSTRAINT(c, n) \
EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS)
+#define INTEL_FLAGS_EVENT_CONSTRAINT_RANGE(c, e, n) \
+ EVENT_CONSTRAINT_RANGE(c, e, n, INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS)
+
/* Check only flags, but allow all event/umask */
#define INTEL_ALL_EVENT_CONSTRAINT(code, n) \
EVENT_CONSTRAINT(code, n, X86_ALL_EVENT_FLAGS)
@@ -366,6 +387,11 @@ struct cpu_hw_events {
ARCH_PERFMON_EVENTSEL_EVENT|X86_ALL_EVENT_FLAGS, \
HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_LD_HSW)
+#define INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_LD_RANGE(code, end, n) \
+ __EVENT_CONSTRAINT_RANGE(code, end, n, \
+ ARCH_PERFMON_EVENTSEL_EVENT|X86_ALL_EVENT_FLAGS, \
+ HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_LD_HSW)
+
#define INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_XLD(code, n) \
__EVENT_CONSTRAINT(code, n, \
ARCH_PERFMON_EVENTSEL_EVENT|X86_ALL_EVENT_FLAGS, \
On Mon, Mar 18, 2019 at 02:41:33PM -0700, [email protected] wrote:
> From: Andi Kleen <[email protected]>
>
> For TopDown metrics it is useful to have a remove transaction when
> the counter is unscheduled, so that the value can be saved correctly.
> Add a remove transaction to the perf core.
That needs more explaining.
> > struct amd_nb {
> > int nb_id; /* NorthBridge id */
>
> That's all the code, how does that add up to 2k ?
Because the tables are bigger. There are much more tables than code.
In theory could use two kinds of tables, but that would complicate
the code quite a bit.
-Andi
On Tue, Mar 19, 2019 at 03:47:48PM +0100, Peter Zijlstra wrote:
> On Mon, Mar 18, 2019 at 02:41:25PM -0700, [email protected] wrote:
> > From: Kan Liang <[email protected]>
> >
> > Adaptive PEBS is a new way to report PEBS sampling information. Instead
> > of a fixed size record for all PEBS events it allows to configure the
> > PEBS record to only include the information needed. Events can then opt
> > in to use such an extended record, or stay with a basic record which
> > only contains the IP.
> >
> > The major new feature is to support LBRs in PEBS record.
> > This allows (much faster) large PEBS, while still supporting callstacks
> > through callstack LBR.
>
> Does it also allow normal LBR usage? Or does it have to be callstacks?
It allows normal LBR too. But I would expect callstack to be the most
common one. As long as you set a period you can get multi-record
PEBS with -g, which has a lot lower lower overhead than using
PMIs.
Eventually I hope we can even make multi-record PEBS
work in frequency mode by averaging the frequency over multiple
records.
> > hwc->config |= ARCH_PERFMON_EVENTSEL_INT;
> > + hwc->config &= ~ICL_EVENTSEL_ADAPTIVE;
>
> Just curious; the way I read the SDM, we could leave this set, is that
> correct?
It needs to be cleared to get the basic record (which should be
a common case)
-Andi
On Tue, Mar 19, 2019 at 08:57:57AM -0700, Andi Kleen wrote:
> > > struct amd_nb {
> > > int nb_id; /* NorthBridge id */
> >
> > That's all the code, how does that add up to 2k ?
>
> Because the tables are bigger. There are much more tables than code.
Tables is data, not text.
On Tue, Mar 19, 2019 at 09:03:37AM -0700, Andi Kleen wrote:
> On Tue, Mar 19, 2019 at 03:47:48PM +0100, Peter Zijlstra wrote:
> > On Mon, Mar 18, 2019 at 02:41:25PM -0700, [email protected] wrote:
> > > From: Kan Liang <[email protected]>
> > >
> > > Adaptive PEBS is a new way to report PEBS sampling information. Instead
> > > of a fixed size record for all PEBS events it allows to configure the
> > > PEBS record to only include the information needed. Events can then opt
> > > in to use such an extended record, or stay with a basic record which
> > > only contains the IP.
> > >
> > > The major new feature is to support LBRs in PEBS record.
> > > This allows (much faster) large PEBS, while still supporting callstacks
> > > through callstack LBR.
> >
> > Does it also allow normal LBR usage? Or does it have to be callstacks?
>
> It allows normal LBR too. But I would expect callstack to be the most
> common one. As long as you set a period you can get multi-record
> PEBS with -g, which has a lot lower lower overhead than using
> PMIs.
>
> Eventually I hope we can even make multi-record PEBS
> work in frequency mode by averaging the frequency over multiple
> records.
Should be doable..
> > > hwc->config |= ARCH_PERFMON_EVENTSEL_INT;
> > > + hwc->config &= ~ICL_EVENTSEL_ADAPTIVE;
> >
> > Just curious; the way I read the SDM, we could leave this set, is that
> > correct?
>
> It needs to be cleared to get the basic record (which should be
> a common case)
But you unconditionally set the bit when you enable PEBS for the event.
So why clear it?
On 3/19/2019 10:47 AM, Peter Zijlstra wrote:
>> @@ -933,6 +998,19 @@ pebs_update_state(bool needed_cb, struct cpu_hw_events *cpuc, struct pmu *pmu)
>> update = true;
>> }
>>
>> + if (x86_pmu.intel_cap.pebs_baseline && add) {
>> + u64 pebs_data_cfg;
>> +
>> + pebs_data_cfg = pebs_update_adaptive_cfg(event);
>> +
>> + /* Update pebs_record_size if new event requires more data. */
>> + if (pebs_data_cfg & ~cpuc->pebs_data_cfg) {
>> + cpuc->pebs_data_cfg |= pebs_data_cfg;
>> + adaptive_pebs_record_size_update();
>> + update = true;
>> + }
>> + }
>> +
>> if (update)
>> pebs_update_threshold(cpuc);
>> }
> Hurmph.. this only grows the PEBS record.
>
Yes, the PEBS record doesn't shrink on the del. Because we have to go
through all the existing pebs events for an accurate config. I think it
doesn't worth it. There is no harmful for a bigger PEBS record, except
little performance impacts. But that's rare case. For most cases, we
usually apply the same pebs config for all pebs events.
>
>> @@ -947,7 +1025,7 @@ void intel_pmu_pebs_add(struct perf_event *event)
>> if (hwc->flags & PERF_X86_EVENT_LARGE_PEBS)
>> cpuc->n_large_pebs++;
>>
>> - pebs_update_state(needed_cb, cpuc, event->ctx->pmu);
>> + pebs_update_state(needed_cb, cpuc, event, true);
>> }
>>
>> void intel_pmu_pebs_enable(struct perf_event *event)
>> @@ -965,6 +1043,14 @@ void intel_pmu_pebs_enable(struct perf_event *event)
>> else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
>> cpuc->pebs_enabled |= 1ULL << 63;
>>
>> + if (x86_pmu.intel_cap.pebs_baseline) {
>> + hwc->config |= ICL_EVENTSEL_ADAPTIVE;
>> + if (cpuc->pebs_data_cfg != cpuc->active_pebs_data_cfg) {
>> + wrmsrl(MSR_PEBS_DATA_CFG, cpuc->pebs_data_cfg);
>> + cpuc->active_pebs_data_cfg = cpuc->pebs_data_cfg;
>> + }
>> + }
>> +
>> /*
>> * Use auto-reload if possible to save a MSR write in the PMI.
>> * This must be done in pmu::start(), because PERF_EVENT_IOC_PERIOD.
>> @@ -991,7 +1077,12 @@ void intel_pmu_pebs_del(struct perf_event *event)
>> if (hwc->flags & PERF_X86_EVENT_LARGE_PEBS)
>> cpuc->n_large_pebs--;
>>
>> - pebs_update_state(needed_cb, cpuc, event->ctx->pmu);
>> + /* Clear both pebs_data_cfg and pebs_record_size for first PEBS. */
> Weird comment..
>
>> + if (x86_pmu.intel_cap.pebs_baseline && !cpuc->n_pebs) {
>> + cpuc->pebs_data_cfg = 0;
>> + cpuc->pebs_record_size = sizeof(struct pebs_basic);
>> + }
>> + pebs_update_state(needed_cb, cpuc, event, false);
> Why do we have to reset record_size? That'll be updated in
> pebs_update_state() on the next add.
>
The record_size should be reset for the first PEBS events.
Right, I can move the reset in pebs_update_state().
Thanks,
Kan
> How much work would intel_pmu_drain_pebs_icl() be?
>
> I'm thinking that might not be terrible.
I had it in an early version of the code, but it ended up with a lot
of code duplication.
-Andi
On Mon, Mar 18, 2019 at 2:44 PM <[email protected]> wrote:
>
> From: Kan Liang <[email protected]>
>
> Add Icelake core PMU perf code, including constraint tables and the main
> enable code.
>
> Icelake expanded the generic counters to always 8 even with HT on, but a
> range of events cannot be scheduled on the extra 4 counters.
> Add new constraint ranges to describe this to the scheduler.
> The number of constraints that need to be checked is larger now than
> with earlier CPUs.
> At some point we may need a new data structure to look them up more
> efficiently than with linear search. So far it still seems to be
> acceptable however.
>
> Icelake added a new fixed counter SLOTS. Full support for it is added
> later in the patch series.
>
> The cache events table is identical to Skylake.
>
> Compare to PEBS instruction event on generic counter, fixed counter 0
> has less skid. Force instruction:ppp always in fixed counter 0.
>
> Originally-by: Andi Kleen <[email protected]>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> arch/x86/events/intel/core.c | 111 ++++++++++++++++++++++++++++++
> arch/x86/events/intel/ds.c | 26 ++++++-
> arch/x86/events/perf_event.h | 2 +
> arch/x86/include/asm/intel_ds.h | 2 +-
> arch/x86/include/asm/perf_event.h | 2 +-
> 5 files changed, 139 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 8486ab87f8f8..87dafac87520 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -239,6 +239,35 @@ static struct extra_reg intel_skl_extra_regs[] __read_mostly = {
> EVENT_EXTRA_END
> };
>
> +static struct event_constraint intel_icl_event_constraints[] = {
> + FIXED_EVENT_CONSTRAINT(0x00c0, 0), /* INST_RETIRED.ANY */
> + INTEL_UEVENT_CONSTRAINT(0x1c0, 0), /* INST_RETIRED.PREC_DIST */
> + FIXED_EVENT_CONSTRAINT(0x003c, 1), /* CPU_CLK_UNHALTED.CORE */
> + FIXED_EVENT_CONSTRAINT(0x0300, 2), /* CPU_CLK_UNHALTED.REF */
> + FIXED_EVENT_CONSTRAINT(0x0400, 3), /* SLOTS */
> + INTEL_EVENT_CONSTRAINT_RANGE(0x03, 0x0a, 0xf),
> + INTEL_EVENT_CONSTRAINT_RANGE(0x1f, 0x28, 0xf),
> + INTEL_EVENT_CONSTRAINT(0x32, 0xf), /* SW_PREFETCH_ACCESS.* */
> + INTEL_EVENT_CONSTRAINT_RANGE(0x48, 0x54, 0xf),
> + INTEL_EVENT_CONSTRAINT_RANGE(0x60, 0x8b, 0xf),
> + INTEL_UEVENT_CONSTRAINT(0x04a3, 0xff), /* CYCLE_ACTIVITY.STALLS_TOTAL */
> + INTEL_UEVENT_CONSTRAINT(0x10a3, 0xff), /* CYCLE_ACTIVITY.STALLS_MEM_ANY */
> + INTEL_EVENT_CONSTRAINT(0xa3, 0xf), /* CYCLE_ACTIVITY.* */
> + INTEL_EVENT_CONSTRAINT_RANGE(0xa8, 0xb0, 0xf),
> + INTEL_EVENT_CONSTRAINT_RANGE(0xb7, 0xbd, 0xf),
> + INTEL_EVENT_CONSTRAINT_RANGE(0xd0, 0xe6, 0xf),
> + INTEL_EVENT_CONSTRAINT_RANGE(0xf0, 0xf4, 0xf),
> + EVENT_CONSTRAINT_END
> +};
> +
> +static struct extra_reg intel_icl_extra_regs[] __read_mostly = {
> + INTEL_UEVENT_EXTRA_REG(0x01b7, MSR_OFFCORE_RSP_0, 0x3fffff9fffull, RSP_0),
> + INTEL_UEVENT_EXTRA_REG(0x01bb, MSR_OFFCORE_RSP_1, 0x3fffff9fffull, RSP_1),
> + INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(0x01cd),
> + INTEL_UEVENT_EXTRA_REG(0x01c6, MSR_PEBS_FRONTEND, 0x7fff17, FE),
> + EVENT_EXTRA_END
> +};
> +
> EVENT_ATTR_STR(mem-loads, mem_ld_nhm, "event=0x0b,umask=0x10,ldlat=3");
> EVENT_ATTR_STR(mem-loads, mem_ld_snb, "event=0xcd,umask=0x1,ldlat=3");
> EVENT_ATTR_STR(mem-stores, mem_st_snb, "event=0xcd,umask=0x2");
> @@ -3324,6 +3353,9 @@ static struct event_constraint counter0_constraint =
> static struct event_constraint counter2_constraint =
> EVENT_CONSTRAINT(0, 0x4, 0);
>
> +static struct event_constraint fixed_counter0_constraint =
> + FIXED_EVENT_CONSTRAINT(0x00c0, 0);
> +
> static struct event_constraint *
> hsw_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
> struct perf_event *event)
> @@ -3342,6 +3374,21 @@ hsw_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
> return c;
> }
>
> +static struct event_constraint *
> +icl_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
> + struct perf_event *event)
> +{
> + /*
> + * Fixed counter 0 has less skid.
> + * Force instruction:ppp in Fixed counter 0
> + */
> + if ((event->attr.precise_ip == 3) &&
> + ((event->hw.config & X86_RAW_EVENT_MASK) == 0x00c0))
> + return &fixed_counter0_constraint;
> +
Not clear to me why you need to treat this one separately from the
PEBS constraints, if you check that you have
:ppp (precise_ip > 0)?
> + return hsw_get_event_constraints(cpuc, idx, event);
> +}
> +
> static struct event_constraint *
> glp_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
> struct perf_event *event)
> @@ -4038,6 +4085,42 @@ static struct attribute *hsw_tsx_events_attrs[] = {
> NULL
> };
>
> +EVENT_ATTR_STR(tx-capacity-read, tx_capacity_read, "event=0x54,umask=0x80");
> +EVENT_ATTR_STR(tx-capacity-write, tx_capacity_write, "event=0x54,umask=0x2");
> +EVENT_ATTR_STR(el-capacity-read, el_capacity_read, "event=0x54,umask=0x80");
> +EVENT_ATTR_STR(el-capacity-write, el_capacity_write, "event=0x54,umask=0x2");
> +
> +static struct attribute *icl_events_attrs[] = {
> + EVENT_PTR(mem_ld_hsw),
> + EVENT_PTR(mem_st_hsw),
> + NULL,
> +};
> +
> +static struct attribute *icl_tsx_events_attrs[] = {
> + EVENT_PTR(tx_start),
> + EVENT_PTR(tx_abort),
> + EVENT_PTR(tx_commit),
> + EVENT_PTR(tx_capacity_read),
> + EVENT_PTR(tx_capacity_write),
> + EVENT_PTR(tx_conflict),
> + EVENT_PTR(el_start),
> + EVENT_PTR(el_abort),
> + EVENT_PTR(el_commit),
> + EVENT_PTR(el_capacity_read),
> + EVENT_PTR(el_capacity_write),
> + EVENT_PTR(el_conflict),
> + EVENT_PTR(cycles_t),
> + EVENT_PTR(cycles_ct),
> + NULL,
> +};
> +
> +static __init struct attribute **get_icl_events_attrs(void)
> +{
> + return boot_cpu_has(X86_FEATURE_RTM) ?
> + merge_attr(icl_events_attrs, icl_tsx_events_attrs) :
> + icl_events_attrs;
> +}
> +
> static ssize_t freeze_on_smi_show(struct device *cdev,
> struct device_attribute *attr,
> char *buf)
> @@ -4611,6 +4694,34 @@ __init int intel_pmu_init(void)
> name = "skylake";
> break;
>
> + case INTEL_FAM6_ICELAKE_MOBILE:
> + x86_pmu.late_ack = true;
> + memcpy(hw_cache_event_ids, skl_hw_cache_event_ids, sizeof(hw_cache_event_ids));
> + memcpy(hw_cache_extra_regs, skl_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));
> + hw_cache_event_ids[C(ITLB)][C(OP_READ)][C(RESULT_ACCESS)] = -1;
> + intel_pmu_lbr_init_skl();
> +
> + x86_pmu.event_constraints = intel_icl_event_constraints;
> + x86_pmu.pebs_constraints = intel_icl_pebs_event_constraints;
> + x86_pmu.extra_regs = intel_icl_extra_regs;
> + x86_pmu.pebs_aliases = NULL;
> + x86_pmu.pebs_prec_dist = true;
> + x86_pmu.flags |= PMU_FL_HAS_RSP_1;
> + x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
> +
> + x86_pmu.hw_config = hsw_hw_config;
> + x86_pmu.get_event_constraints = icl_get_event_constraints;
> + extra_attr = boot_cpu_has(X86_FEATURE_RTM) ?
> + hsw_format_attr : nhm_format_attr;
> + extra_attr = merge_attr(extra_attr, skl_format_attr);
> + x86_pmu.cpu_events = get_icl_events_attrs();
> + x86_pmu.force_gpr_event = 0x2ca;
> + x86_pmu.lbr_pt_coexist = true;
> + intel_pmu_pebs_data_source_skl(false);
> + pr_cont("Icelake events, ");
> + name = "icelake";
> + break;
> +
> default:
> switch (x86_pmu.version) {
> case 1:
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 30370fb93e21..97bad6b0f470 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -849,6 +849,26 @@ struct event_constraint intel_skl_pebs_event_constraints[] = {
> EVENT_CONSTRAINT_END
> };
>
> +struct event_constraint intel_icl_pebs_event_constraints[] = {
> + INTEL_FLAGS_UEVENT_CONSTRAINT(0x1c0, 0x100000000ULL), /* INST_RETIRED.PREC_DIST */
> + INTEL_FLAGS_UEVENT_CONSTRAINT(0x0400, 0x400000000ULL), /* SLOTS */
> +
> + INTEL_PLD_CONSTRAINT(0x1cd, 0xff), /* MEM_TRANS_RETIRED.LOAD_LATENCY */
> + INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x1d0, 0xf), /* MEM_INST_RETIRED.LOAD */
> + INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_ST(0x2d0, 0xf), /* MEM_INST_RETIRED.STORE */
> +
> + INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_LD_RANGE(0xd1, 0xd4, 0xf), /* MEM_LOAD_*_RETIRED.* */
> +
> + INTEL_FLAGS_EVENT_CONSTRAINT(0xd0, 0xf), /* MEM_INST_RETIRED.* */
> +
> + /*
> + * Everything else is handled by PMU_FL_PEBS_ALL, because we
> + * need the full constraints from the main table.
> + */
> +
> + EVENT_CONSTRAINT_END
> +};
> +
> struct event_constraint *intel_pebs_constraints(struct perf_event *event)
> {
> struct event_constraint *c;
> @@ -1038,7 +1058,8 @@ void intel_pmu_pebs_enable(struct perf_event *event)
>
> cpuc->pebs_enabled |= 1ULL << hwc->idx;
>
> - if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
> + if ((event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT) &&
> + (x86_pmu.version < 5))
> cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
> else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
> cpuc->pebs_enabled |= 1ULL << 63;
> @@ -1097,7 +1118,8 @@ void intel_pmu_pebs_disable(struct perf_event *event)
>
> /* Delay reprograming DATA_CFG to next enable */
>
> - if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
> + if ((event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT) &&
> + (x86_pmu.version < 5))
> cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
> else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
> cpuc->pebs_enabled &= ~(1ULL << 63);
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index 863d27f4c352..efa893ce95e2 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -981,6 +981,8 @@ extern struct event_constraint intel_bdw_pebs_event_constraints[];
>
> extern struct event_constraint intel_skl_pebs_event_constraints[];
>
> +extern struct event_constraint intel_icl_pebs_event_constraints[];
> +
> struct event_constraint *intel_pebs_constraints(struct perf_event *event);
>
> void intel_pmu_pebs_add(struct perf_event *event);
> diff --git a/arch/x86/include/asm/intel_ds.h b/arch/x86/include/asm/intel_ds.h
> index ae26df1c2789..8380c3ddd4b2 100644
> --- a/arch/x86/include/asm/intel_ds.h
> +++ b/arch/x86/include/asm/intel_ds.h
> @@ -8,7 +8,7 @@
>
> /* The maximal number of PEBS events: */
> #define MAX_PEBS_EVENTS 8
> -#define MAX_FIXED_PEBS_EVENTS 3
> +#define MAX_FIXED_PEBS_EVENTS 4
>
> /*
> * A debug store configuration.
> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
> index da0a80ef6505..64cb4dffe4cd 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -7,7 +7,7 @@
> */
>
> #define INTEL_PMC_MAX_GENERIC 32
> -#define INTEL_PMC_MAX_FIXED 3
> +#define INTEL_PMC_MAX_FIXED 4
> #define INTEL_PMC_IDX_FIXED 32
>
> #define X86_PMC_IDX_MAX 64
> --
> 2.17.1
>
On 3/19/2019 8:08 PM, Stephane Eranian wrote:
> On Mon, Mar 18, 2019 at 2:44 PM <[email protected]> wrote:
>>
>> From: Kan Liang <[email protected]>
>>
>> Add Icelake core PMU perf code, including constraint tables and the main
>> enable code.
>>
>> Icelake expanded the generic counters to always 8 even with HT on, but a
>> range of events cannot be scheduled on the extra 4 counters.
>> Add new constraint ranges to describe this to the scheduler.
>> The number of constraints that need to be checked is larger now than
>> with earlier CPUs.
>> At some point we may need a new data structure to look them up more
>> efficiently than with linear search. So far it still seems to be
>> acceptable however.
>>
>> Icelake added a new fixed counter SLOTS. Full support for it is added
>> later in the patch series.
>>
>> The cache events table is identical to Skylake.
>>
>> Compare to PEBS instruction event on generic counter, fixed counter 0
>> has less skid. Force instruction:ppp always in fixed counter 0.
>>
>> Originally-by: Andi Kleen <[email protected]>
>> Signed-off-by: Kan Liang <[email protected]>
>> ---
>> arch/x86/events/intel/core.c | 111 ++++++++++++++++++++++++++++++
>> arch/x86/events/intel/ds.c | 26 ++++++-
>> arch/x86/events/perf_event.h | 2 +
>> arch/x86/include/asm/intel_ds.h | 2 +-
>> arch/x86/include/asm/perf_event.h | 2 +-
>> 5 files changed, 139 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>> index 8486ab87f8f8..87dafac87520 100644
>> --- a/arch/x86/events/intel/core.c
>> +++ b/arch/x86/events/intel/core.c
>> @@ -239,6 +239,35 @@ static struct extra_reg intel_skl_extra_regs[] __read_mostly = {
>> EVENT_EXTRA_END
>> };
>>
>> +static struct event_constraint intel_icl_event_constraints[] = {
>> + FIXED_EVENT_CONSTRAINT(0x00c0, 0), /* INST_RETIRED.ANY */
>> + INTEL_UEVENT_CONSTRAINT(0x1c0, 0), /* INST_RETIRED.PREC_DIST */
>> + FIXED_EVENT_CONSTRAINT(0x003c, 1), /* CPU_CLK_UNHALTED.CORE */
>> + FIXED_EVENT_CONSTRAINT(0x0300, 2), /* CPU_CLK_UNHALTED.REF */
>> + FIXED_EVENT_CONSTRAINT(0x0400, 3), /* SLOTS */
>> + INTEL_EVENT_CONSTRAINT_RANGE(0x03, 0x0a, 0xf),
>> + INTEL_EVENT_CONSTRAINT_RANGE(0x1f, 0x28, 0xf),
>> + INTEL_EVENT_CONSTRAINT(0x32, 0xf), /* SW_PREFETCH_ACCESS.* */
>> + INTEL_EVENT_CONSTRAINT_RANGE(0x48, 0x54, 0xf),
>> + INTEL_EVENT_CONSTRAINT_RANGE(0x60, 0x8b, 0xf),
>> + INTEL_UEVENT_CONSTRAINT(0x04a3, 0xff), /* CYCLE_ACTIVITY.STALLS_TOTAL */
>> + INTEL_UEVENT_CONSTRAINT(0x10a3, 0xff), /* CYCLE_ACTIVITY.STALLS_MEM_ANY */
>> + INTEL_EVENT_CONSTRAINT(0xa3, 0xf), /* CYCLE_ACTIVITY.* */
>> + INTEL_EVENT_CONSTRAINT_RANGE(0xa8, 0xb0, 0xf),
>> + INTEL_EVENT_CONSTRAINT_RANGE(0xb7, 0xbd, 0xf),
>> + INTEL_EVENT_CONSTRAINT_RANGE(0xd0, 0xe6, 0xf),
>> + INTEL_EVENT_CONSTRAINT_RANGE(0xf0, 0xf4, 0xf),
>> + EVENT_CONSTRAINT_END
>> +};
>> +
>> +static struct extra_reg intel_icl_extra_regs[] __read_mostly = {
>> + INTEL_UEVENT_EXTRA_REG(0x01b7, MSR_OFFCORE_RSP_0, 0x3fffff9fffull, RSP_0),
>> + INTEL_UEVENT_EXTRA_REG(0x01bb, MSR_OFFCORE_RSP_1, 0x3fffff9fffull, RSP_1),
>> + INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(0x01cd),
>> + INTEL_UEVENT_EXTRA_REG(0x01c6, MSR_PEBS_FRONTEND, 0x7fff17, FE),
>> + EVENT_EXTRA_END
>> +};
>> +
>> EVENT_ATTR_STR(mem-loads, mem_ld_nhm, "event=0x0b,umask=0x10,ldlat=3");
>> EVENT_ATTR_STR(mem-loads, mem_ld_snb, "event=0xcd,umask=0x1,ldlat=3");
>> EVENT_ATTR_STR(mem-stores, mem_st_snb, "event=0xcd,umask=0x2");
>> @@ -3324,6 +3353,9 @@ static struct event_constraint counter0_constraint =
>> static struct event_constraint counter2_constraint =
>> EVENT_CONSTRAINT(0, 0x4, 0);
>>
>> +static struct event_constraint fixed_counter0_constraint =
>> + FIXED_EVENT_CONSTRAINT(0x00c0, 0);
>> +
>> static struct event_constraint *
>> hsw_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
>> struct perf_event *event)
>> @@ -3342,6 +3374,21 @@ hsw_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
>> return c;
>> }
>>
>> +static struct event_constraint *
>> +icl_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
>> + struct perf_event *event)
>> +{
>> + /*
>> + * Fixed counter 0 has less skid.
>> + * Force instruction:ppp in Fixed counter 0
>> + */
>> + if ((event->attr.precise_ip == 3) &&
>> + ((event->hw.config & X86_RAW_EVENT_MASK) == 0x00c0))
>> + return &fixed_counter0_constraint;
>> +
> Not clear to me why you need to treat this one separately from the
> PEBS constraints, if you check that you have
> :ppp (precise_ip > 0)?
>
>
There is no model specific get_pebs_constrains, so I have to put the
checking here for Icelake.
We only want to force the :ppp case in fixed counter 0, which has less
skid.
Thanks,
Kan
On Tue, Mar 19, 2019 at 02:38:24PM -0700, Andi Kleen wrote:
> > How much work would intel_pmu_drain_pebs_icl() be?
> >
> > I'm thinking that might not be terrible.
>
> I had it in an early version of the code, but it ended up with a lot
> of code duplication.
I'm thinking that if you remove all the <v3 nonsense, it shouldn't be
_that_ much. A quick hack here makes it ~60 lines.
And it avoids sprinkling a whole bunch of conditionals around and
introducing that new x86_pmu method.